Custom Providers

ScholarFlux enables integration with any API through a three-layer configuration system. This guide demonstrates how to add custom providers—from news APIs to specialized research databases—with full support for normalization, caching, and concurrent search.

Overview 

Why Add Custom Providers?

ScholarFlux ships with seven academic providers (PLOS, arXiv, PubMed, OpenAlex, Crossref, CORE, Springer Nature), but the research needs of end-users vary:

Institution-specific databases: University repositories, institutional archives
Domain-specific resources: Medical databases, patent databases, legal research platforms
News and media APIs: The Guardian, New York Times, Reuters
Specialized platforms: bioRxiv, SSRN, RePEc, HAL, Europe PMC
Internal APIs: Company knowledge bases, proprietary research databases

ScholarFlux’s provider system is universal—it works with any REST API returning JSON or XML.

The Three Configuration Layers 

Every provider is defined by three components:

1. APIParameterMap - Request parameter mapping

This core component needed to create a minimally viable ProviderConfig. It maps ScholarFlux parameters to API-specific parameter names:

APIParameterMap(
    query='q',                    # query → q
    start='page',                 # start → page
    records_per_page='page-size'  # records_per_page → page-size
)

2. Field Map - Record field normalization

An optional component that is used to map API-specific field names to universal field names used throughout ScholarFlux (particularly in academic applications). Normalizes API-specific fields to provider-agnostic field names:

# For academic APIs, use or subclass the AcademicFieldMap
from scholar_flux.api.normalization import AcademicFieldMap

field_map = AcademicFieldMap(
    provider_name='my_provider',
    title='article_title',
    abstract='summary',
    doi='DOI'
)

# For non-academic APIs, subclass NormalizingFieldMap
from scholar_flux.api.normalization import NormalizingFieldMap

class ArticleFieldMap(NormalizingFieldMap):
    provider_name: str = ""
    title: str | list[str] | None = None
    url: str | list[str] | None = None
    text: str | list[str] | None = None

Minimal Provider Example 

ScholarFlux offers a high degree of customization, the minimally viable provider-config only requires users to create an APIParameterMap and a ProviderConfig:

from scholar_flux.api import ProviderConfig, APIParameterMap, provider_registry
from scholar_flux import SearchCoordinator

# Minimal configuration - just parameter mapping
minimal_config = ProviderConfig(
    provider_name='a_custom_api_provider',
    display_name='A Human-Readable Name for the Custom Provider',
    base_url='https://api.a_custom_api_provider.com/search',
    parameter_map=APIParameterMap(
        query='query',
        start='item-start-number',
        records_per_page='items-per-page'
    ),
    records_per_page=20
)

provider_registry.add(minimal_config)

# Use immediately - returns raw API response
coordinator = SearchCoordinator(
    query="test",
    provider_name="a_custom_api_provider"
)
# [Dry Run] - Shows how each parameter is mapped in the prepared request URL
prepared_request = coordinator.api.prepare_search(page=1)
print(prepared_request.url) # indicates the URL that the request would be sent to
# OUTPUT: https://api.a_custom_api_provider.com/search?query=test&item-start-number=1&items-per-page=20

result = coordinator.search_page(page=1) # response container with additional metadata

if result:
    # Records have raw API field names
    print(result.response)  # The raw API response
    print(result.metadata)  # Extracted metadata
    print(result.data)  # Processed records
else:
    print(f"Error retrieving page {result.page}. {result.error}: {result.message}")

Complete Example: Guardian News API 

Let’s add The Guardian’s news API as a custom provider. This demonstrates a non-academic API with typical JSON responses.

Note

An API key is required for The Guardian. As always, it is strongly recommended to store the API key as an environment variable (defined as GUARDIAN_API_KEY here) instead of directly passing the API key as a string.

Full Configuration 

from datetime import datetime

from scholar_flux.api import (
    ProviderConfig,
    APIParameterMap,
    ResponseMetadataMap,
    provider_registry
)
from scholar_flux import SearchCoordinator
from scholar_flux.api.normalization import NormalizingFieldMap
from scholar_flux.utils.record_types import NormalizedRecordType  # dictionary with string fields
from scholar_flux.utils.helpers import parse_iso_timestamp


# Step 1: Configure API parameters
parameters = APIParameterMap(
    query='q',                     # Guardian uses 'q' for queries
    start='page',                  # Guardian uses 'page' for pagination
    records_per_page='page-size',  # Guardian uses 'page-size' for limit
    api_key_parameter='api-key',   # API key parameter name
    auto_calculate_page=False,     # Use page number directly
    zero_indexed_pagination=False, # Pages start at 1, not 0
    api_key_required=True          # API key is mandatory
)

# Step 2 (Optional - for field normalization): Define custom field map for news articles
class ArticleFieldMap(NormalizingFieldMap):
    """Field map for journalism/news APIs."""
    provider_name: str = ""
    title: str | list[str] | None = None
    record_id: str | list[str] | None = None
    record_type: str | list[str] | None = None
    subject: str | list[str] | None = None
    text: str | list[str] | None = None
    url: str | list[str] | None = None
    date_published: str | list[str] | None = None

class GuardianFieldMap(ArticleFieldMap):
    """Optionally subclass the field map for the Guardian API, adding API-specific post processing if needed."""

    def _post_process(self, record: NormalizedRecordType) -> NormalizedRecordType:
        """Optional post-processing example step for modifying records after field resolution for 'The Guardian'."""
        record = super()._post_process(record) # Initial post-processing step after fields are resolved

        # Converts the publication date into a datetime
        record["date_published"] = self.extract_iso_timestamp(record)

        # Returns the updated record after post-processing
        return record

    @classmethod
    def extract_iso_timestamp(cls, record: NormalizedRecordType, field: str = "date_published") -> datetime | None:
        """Extracts the date of the article's publication."""
        date = record.get(field) # In case a valid `date field` could not be resolved
        # Attempts to extract a timestamp, returning None if parsing fails
        return parse_iso_timestamp(date) if date else None


# Step 3 (Optional - for field normalization): Configure field mappings
field_map = GuardianFieldMap(
    provider_name='guardian',
    title='webTitle',              # Guardian's title field
    record_id='id',                # Guardian's ID field
    record_type='type',            # Article type
    subject='sectionName',         # Section as subject
    text='fields.trailText',       # Nested field for preview text
    url='webUrl',                  # Article URL
    date_published='webPublicationDate',
    api_specific_fields={          # Guardian-specific fields
        'section_name': 'sectionName',
        'pillar_name': 'pillarName'
    }
)

# Step 4 (Optional): Configure metadata extraction
metadata = ResponseMetadataMap(
    total_query_hits='total', # Path to total results
    records_per_page='pageSize' # path to page-size
)


# Step 5: Create provider configuration
guardian_config = ProviderConfig(
    provider_name='guardian',
    display_name='The Guardian',
    base_url='https://content.guardianapis.com/search',
    parameter_map=parameters,
    metadata_map=metadata,
    field_map=field_map,
    records_per_page=10,           # Default page size
    request_delay=1.0,             # Wait 1s between requests
    api_key_env_var='GUARDIAN_API_KEY',  # Environment variable
    docs_url='https://open-platform.theguardian.com/documentation/'
)

# Step 6: Add to registry
provider_registry.add(guardian_config)

# Step 7: Use immediately!
coordinator = SearchCoordinator(
    query="artificial intelligence",
    provider_name="guardian"
)

# Normalize records on the retrieval of page 1
result = coordinator.search_page(page=1, normalize_records=True)

if result.normalized_records:
    print(f"Retrieved {result.record_count} articles")
    first_record = result.normalized_records[0]
    print(f"First article: {first_record['title']}")

What just happened:

✅ Configured parameter mapping (query → q, page → page) ✅ Created custom field map for news articles ✅ Configured field normalization (webTitle → title) ✅ Configured metadata extraction (total results) ✅ Added to registry—now works like built-in providers ✅ Full ScholarFlux integration (caching, rate limiting, multi-provider search)

Understanding the Configuration 

APIParameterMap explained:

The core step used by ScholarFlux to translate requests into something that Guardian can understand. The Guardian API expects parameters like ?q=technology&page=1&page-size=10&api-key=xxx. ScholarFlux uses standard names (query, start, records_per_page), so we map them:

APIParameterMap(
    query='q',                     # ScholarFlux 'query' → Guardian 'q'
    start='page',                  # ScholarFlux 'start' → Guardian 'page'
    records_per_page='page-size',  # ScholarFlux 'records_per_page' → Guardian 'page-size'
    api_key_parameter='api-key',   # Where to insert the API key
    auto_calculate_page=False,     # Guardian uses page numbers (1, 2, 3...), so these are used directly
    zero_indexed_pagination=False  # First page is 1, not 0
)

Field map explained:

Guardian returns records with fields like webTitle, webUrl, etc. We normalize these:

field_map = ArticleFieldMap(
    provider_name='guardian',
    title='webTitle',              # Guardian's title → universal 'title'
    url='webUrl',                  # Guardian's URL → universal 'url'
    text='fields.trailText'        # Nested field extraction
)

ResponseMetadataMap explained:

Guardian returns JSON like:

{
  "response": {
    "total": 50000,
    "pageSize": "25",
    "results": [...]
  }
}

As nested metadata paths are traversed directly on extraction, we simply tell ScholarFlux the field names:

ResponseMetadataMap(
    total_query_hits='total', # Path to total results
    records_per_page='pageSize' # path to page-size
)

Inspecting and Extending Parameters 

ScholarFlux provides tools for inspecting supported parameters and extending them at runtime. For maximum applicability amidst changing APIs in the future, the API explicitly defines only the bare minimum of parameters that could be most beneficial to users. Users are encouraged to add new parameters to each provider’s configuration as needed on runtime.

Viewing Supported Parameters

Use SearchAPI.describe() to see all accepted universal and API-specific parameters for a provider:

from scholar_flux import SearchAPI

# View parameters for a built-in provider
api = SearchAPI.from_defaults(query="test", provider_name="crossref")
api.describe()

# Output:
# {'config_fields': ['provider_name', 'base_url', 'records_per_page',
#                    'request_delay', 'api_key', 'api_specific_parameters'],
#  'api_specific_parameters': {
#      'mailto': APISpecificParameter(name='mailto',
#                    description='An optional contact email...',
#                    validator='validate_and_process_email (function)', ...),
#      'sort': APISpecificParameter(name='sort',
#                    description="Sort field (e.g., 'published', 'deposited')...", ...),
#      'order': APISpecificParameter(name='order',
#                    description="Sort direction: 'asc' or 'desc'.", ...),
#  }}

This is especially useful when:

Discovering which API-specific parameters a provider supports
Understanding parameter validation and requirements
Debugging why a parameter isn’t being accepted

Adding Parameters at Runtime

Extend parameter support without modifying provider configuration:

from scholar_flux import SearchCoordinator
from scholar_flux.api.validators import validate_str

coordinator = SearchCoordinator(query="machine learning", provider_name="crossref")

# Add a custom API-specific parameter for the current session
new_parameter_config = coordinator.api.parameter_config.add_parameter(
        name='select',           # Actual API parameter name
        description="A Custom filter to remove unwanted fields from retrieved records (e.g. select='doi,title')",  # Documentation
        validator=validate_str,      # Ensures that the value is a string and raises an error otherwise
        required=False,              # Optional parameter
        inplace=False,              # Determines whether the global configuration settings should be modified
)
coordinator.api.parameter_config = new_parameter_config

# Now you can use the parameter
result = coordinator.search(page=1, select="DOI,title,page")

For lower-level control, use BaseAPIParameterMap.add_parameter():

from scholar_flux.api import provider_registry
from scholar_flux.api.validators import api_validator
from scholar_flux.api.models import APISpecificParameter

name = 'crossref'
# The decorator just provides more information on the field and provider if a validation error occurs.
@api_validator(provider_name=name, field="select")
def check_selection(value: str):
    """Simple validator for ensuring that received values are strings."""
    if value is not None and not isinstance(value, str):
        raise TypeError(f"The received value ({value}) is not a string...")
    return value

# Get the provider's parameter map
config = provider_registry.get(name)

# Add a single parameter efficiently
config.parameter_map.api_specific_parameters['select'] = APISpecificParameter(
        name='select',           # Actual API parameter name
        description="A Custom filter to remove unwanted fields from retrieved records (e.g. select='doi,title')",  # Documentation
        validator=validate_str,      # Ensures that the value is a string and raises an error otherwise
        required=False,              # Optional parameter
)

result = coordinator.search(page=2, select="DOI,title,page")

Tip

Runtime parameter extensions are session-scoped and don’t persist. For permanent additions, define them in your ProviderConfig.

Common Patterns 

Pagination Styles 

Different APIs use different pagination approaches:

Page-based, one-indexed (e.g., OpenAlex):

# API expects: ?page=1, ?page=2, ?page=3
APIParameterMap(
     query="search",
     start="page",
     records_per_page="per_page",
     api_key_parameter="api_key",
     api_key_required=False,
     auto_calculate_page=False,
     zero_indexed_pagination=False,
 )

Offset-based, zero-indexed (e.g., arXiv):

# API expects: ?start=0, ?start=25, ?start=50
APIParameterMap(
    query='search_query',
    start='start',
    records_per_page='max_results',
    api_key_parameter="api_key",
    api_key_required=False,
    auto_calculate_page=True,      # Calculate: (page-1) × records_per_page
    zero_indexed_pagination=True   # First record is at index 0
)

Example page -> offset calculation (one-indexed):

Page 1: start = 1 + (1-1) × 25 = 1
Page 2: start = 1 + (2-1) × 25 = 26
Page 3: start = 1 + (3-1) × 25 = 51

Mixed (Crossref):

# API uses offset but calls it 'cursor' or 'offset'
APIParameterMap(
    query='query',
    start='offset',
    records_per_page='rows',
    auto_calculate_page=True,
    zero_indexed_pagination=False
)

API Key Handling 

Query parameter (Guardian):

parameters = APIParameterMap(
    query='q',
    records_per_page='pageSize',
    api_key_parameter='api-key',  # Parameter name
    api_key_required=True         # Raise error if missing
)

config = ProviderConfig(
    provider_name='my_provider',
    parameter_map=parameters,
    api_key_env_var='MY_API_KEY'  # Environment variable to check
)

Optional API key:

parameters = APIParameterMap(
    query='q',
    records_per_page='pageSize',
    api_key_parameter='apikey',
    api_key_required=False  # API works without key (slower rate limit)
)

Note

For header-based authentication (Authorization: Bearer xxx), use a custom session with headers. See API reference for details.

Field Mapping Patterns 

Simple field mapping:

# Direct field name mapping
field_map = AcademicFieldMap(
    provider_name='my_provider',
    title='article_title',  # API field: article_title → title
    doi='DOI',              # API field: DOI → doi
    authors='author_list'   # API field: author_list → authors
)

Nested field mapping:

# Extract from nested objects
field_map = AcademicFieldMap(
    provider_name='my_provider',
    title='metadata.title',           # metadata.title → title
    abstract='content.abstract',      # content.abstract → abstract
    authors='authors.contributor.name' # Deep nesting
)

Fallback field mapping:

# Try multiple field names (uses first non-null value)
field_map = AcademicFieldMap(
    provider_name='my_provider',
    title=['title', 'headline', 'name'],  # Try in order
    abstract=['abstract', 'summary', 'description']
)

Testing Your Provider 

Validation Checklist 

Before using a custom provider in production:

Test with real queries:

coordinator = SearchCoordinator(
    query="test query",
    provider_name="my_provider"
)

# Check the URL to verify that the request URL appears correct. If not, revise the API's ProviderConfig
search_request = coordinator.api.prepare_search(page=1)  #  An individual requests.PreparedRequest
print(f"Search URL: {search_request.url}")  # Inspect the URL

# Test basic retrieval with `search_page` (returns a `SearchResult` container with additional metadata)
result = coordinator.search_page(page=1)
assert result, f"Failed: {type(result.response_result)}: {result.error} - {result.message}"
print(f"✓ Retrieved {result.record_count} records")

# Test multiple pages
results = coordinator.search_pages(pages=range(1, 4))
successful = results.filter()
print(f"✓ Retrieved {len(successful)}/{len(results)} pages")

Verify normalization:

# Tests retrieval with the returned `ProcessedResponse`, `ErrorResponse`, or None
result = coordinator.search(page=2, normalize_records=True)
if result and result.normalized_records:
    record = result.normalized_records[0]
    print(f"✓ Title: {record.get('title')}")
    print(f"✓ DOI: {record.get('doi')}")
    print(f"✓ Provider: {record.get('provider_name')}")
else:
    print("✗ Normalization failed")

Test pagination:

# Verify pages return different records
page1 = coordinator.search(page=1)
page2 = coordinator.search(page=2)

if page1 and page2:
    ids1 = [r['id'] for r in page1.data if 'id' in r]
    ids2 = [r['id'] for r in page2.data if 'id' in r]
    overlap = set(ids1) & set(ids2)
    print(f"✓ Pages have {len(overlap)} overlapping IDs (should be 0)")

Check metadata extraction:

result = coordinator.search(page=1)
if result:
    print(f"✓ Total results: {result.total_query_hits}")
    print(f"✓ Records per page: {result.records_per_page}")

Common Issues 

“No records found” but API returns data:

Check your record extraction path. Add debug logging:

result = coordinator.search(page=1)
if result:
    print(f"Parsed response keys: {result.parsed_response.keys()}")
    print(f"Extracted records: {len(result.extracted_records or [])}")

“Field not found” during normalization:

Check field names in actual API response:

result = coordinator.search(page=1)
if result and result.data:
    sample_record = result.data[0]
    print(f"Available fields: {list(sample_record.keys())}")

Pagination returns same records:

Verify the mapped parameters from APIParameterMap against the API provider’s requirements using its documentation:

# If the API uses page numbers (1, 2, 3):
print(coordinator.api.parameter_config.map)

Best Practices 

Configuration Guidelines 

Check API rate limiting requirements directly and start conservative with rate limits:

# Start with a longer delay
request_delay=5.0

# Monitor API response headers
# Adjust based on documented limits

Use descriptive provider names:

# Good
provider_name='europepmc'
provider_name='semantic_scholar'

# Avoid
provider_name='api1'
provider_name='custom'

Document your configuration:

"""
Custom ScholarFlux Provider: Europe PMC

Requirements:
- No API key required
- Rate limit: 3 requests/second

Usage:
    >>> from my_providers import europepmc_config
    >>> provider_registry.add(europepmc_config)
    >>> coordinator = SearchCoordinator(
    ...     query="cancer",
    ...     provider_name="europepmc"
    ... )

API Documentation:
- https://europepmc.org/RestfulWebService
"""

Test with diverse queries:

test_queries = [
    "simple query",
    "complex AND (query OR terms)",
    "phrase in quotes",
    "year:2024"
]

for query in test_queries:
    coordinator = SearchCoordinator(
        query=query,
        provider_name="my_provider"
    )
    result = coordinator.search(page=1)
    print(f"{query}: {'✓' if result else '✗'}")

Error Handling 

ScholarFlux uses response types instead of exceptions:

from scholar_flux import SearchCoordinator
from scholar_flux.api.models import NonResponse, ErrorResponse

def safe_search(query: str, provider_name: str):
    coordinator = SearchCoordinator(query=query, provider_name=provider_name)
    result = coordinator.search(page=1)

    # ProcessedResponse (truthy) - success
    if result:
        return result.normalize()

    # NonResponse - network error or API unreachable
    if isinstance(result.response_result, NonResponse):
        print(f"Network error: {result.message}")
        return []

    # ErrorResponse - API returned error
    if isinstance(result.response_result, ErrorResponse):
        print(f"API error: {result.message}")
        return []

    return []

Tip

Check response validity with if result: rather than try/except for cleaner code.

Next Steps 

Related Guides:

Schema Normalization - Deep dive into field normalization patterns
Multi-Provider Search - Use custom providers in concurrent searches
Workflows - Multi-step retrieval for complex APIs (like PubMed’s two-step process)

Advanced Topics:

Caching Strategies - Production caching with Redis, MongoDB, SQLAlchemy
Production Deployment - Deploy custom providers at scale

API Reference:

scholar_flux.api.models.ProviderConfig - Complete configuration reference
scholar_flux.api.models.APIParameterMap - Parameter mapping reference
scholar_flux.api.normalization.AcademicFieldMap - Academic field mapping
scholar_flux.api.normalization.NormalizingFieldMap - Base field map for custom schemas

Community Contributions 

Consider sharing your custom providers:

Test thoroughly with the validation checklist
Document clearly with usage examples
Open a pull request at https://github.com/SammieH21/scholar-flux
Include tests demonstrating functionality

Popular community providers may be included in future ScholarFlux releases!

Getting Help 

If you encounter issues:

Check API documentation: Verify parameter names and response structure
Test API directly: Use curl or requests to understand behavior
Search issues: https://github.com/SammieH21/scholar-flux/issues
Open an issue: Include provider details, config code, and error messages
Email: scholar.flux@gmail.com

When requesting help, include:

Provider name and documentation URL
Your ProviderConfig code
Sample API response (anonymize sensitive data)
Error messages or unexpected behavior
ScholarFlux version: pip show scholar-flux