Custom Providers

ScholarFlux enables integration with any API through a three-layer configuration system. This guide demonstrates how to add custom providers—from news APIs to specialized research databases—with full support for normalization, caching, and concurrent search.

Overview

Why Add Custom Providers?

ScholarFlux ships with seven academic providers (PLOS, arXiv, PubMed, OpenAlex, Crossref, CORE, Springer Nature), but research needs vary:

  • Institution-specific databases: University repositories, institutional archives

  • Domain-specific resources: Medical databases, patent databases, legal research platforms

  • News and media APIs: The Guardian, New York Times, Reuters

  • Specialized platforms: bioRxiv, SSRN, RePEc, HAL, Europe PMC

  • Internal APIs: Company knowledge bases, proprietary research databases

ScholarFlux’s provider system is universal—it works with any REST API returning JSON or XML.

The Three Configuration Layers

Every provider is defined by three components:

1. APIParameterMap - Request parameter mapping

This core component needed to create a minimally viable ProviderConfig. It maps ScholarFlux parameters to API-specific parameter names:

APIParameterMap(
    query='q',                    # query → q
    start='page',                 # start → page
    records_per_page='page-size'  # records_per_page → page-size
)

2. Field Map - Record field normalization

An optional component that is used to map API-specific field names to universal field names used throughout ScholarFlux (particularly in academic applications). Normalizes API-specific fields to provider-agnostic field names:

# For academic APIs, use AcademicFieldMap
from scholar_flux.api.normalization import AcademicFieldMap

field_map = AcademicFieldMap(
    provider_name='my_provider',
    title='article_title',
    abstract='summary',
    doi='DOI'
)

# For non-academic APIs, subclass NormalizingFieldMap
from scholar_flux.api.normalization import NormalizingFieldMap

class ArticleFieldMap(NormalizingFieldMap):
    provider_name: str = ""
    title: str | list[str] | None = None
    url: str | list[str] | None = None
    text: str | list[str] | None = None

See also

For detailed information on field normalization patterns, see Schema Normalization.

3. ResponseMetadataMap - Response metadata extraction

Extracts pagination info from API responses. This map is optional and mainly used when determining if there are more retrievable pages associated with a query when retrieving multiple pages in succession.

ResponseMetadataMap(
    total_query_hits='total',  # Path to total results
    records_per_page='pageSize' # path to page-size
)

Minimal Provider Example

ScholarFlux offers a high degree of customization, the minimally viable provider-config only requires users to create an APIParameterMap and a ProviderConfig:

from scholar_flux.api import ProviderConfig, APIParameterMap, provider_registry
from scholar_flux import SearchCoordinator

# Minimal configuration - just parameter mapping
minimal_config = ProviderConfig(
    provider_name='a_custom_api_provider',
    base_url='https://api.a_custom_api_provider.com/search',
    parameter_map=APIParameterMap(
        query='query',
        start='item-start-number',
        records_per_page='items-per-page'
    ),
    records_per_page=20
)

provider_registry.add(minimal_config)

# Use immediately - returns raw API response
coordinator = SearchCoordinator(
    query="test",
    provider_name="a_custom_api_provider"
)
# [Dry Run] - Shows how each parameter is mapped in the prepared request URL
prepared_request = coordinator.api.prepare_search(page=1)
print(prepared_request.url) # indicates the URL that the request would be sent to
# OUTPUT: https://api.a_custom_api_provider.com/search?query=test&item-start-number=1&items-per-page=20

result = coordinator.search_page(page=1) # response container with additional metadata

if result:
    # Records have raw API field names
    print(result.response)  # The raw API response
    print(result.metadata)  # Extracted metadata
    print(result.data)  # Processed records
 else:
     print(f"Error retrieving page {result.page}. {result.error}: {result.message}")

Complete Example: Guardian News API

Let’s add The Guardian’s news API as a custom provider. This demonstrates a non-academic API with typical JSON responses.

Full Configuration

from scholar_flux.api import (
    ProviderConfig,
    APIParameterMap,
    ResponseMetadataMap,
    provider_registry
)
from scholar_flux import SearchCoordinator
from scholar_flux.api.normalization import NormalizingFieldMap


# Step 1: Configure API parameters
parameters = APIParameterMap(
    query='q',                     # Guardian uses 'q' for queries
    start='page',                  # Guardian uses 'page' for pagination
    records_per_page='page-size',  # Guardian uses 'page-size' for limit
    api_key_parameter='api-key',   # API key parameter name
    auto_calculate_page=False,     # Use page number directly
    zero_indexed_pagination=False, # Pages start at 1, not 0
    api_key_required=True          # API key is mandatory
)

# Step 2 (Optional - for field normalization): Define custom field map for news articles
class ArticleFieldMap(NormalizingFieldMap):
    """Field map for journalism/news APIs."""
    provider_name: str = ""
    title: str | list[str] | None = None
    record_id: str | list[str] | None = None
    record_type: str | list[str] | None = None
    subject: str | list[str] | None = None
    text: str | list[str] | None = None
    url: str | list[str] | None = None
    date_published: str | list[str] | None = None


# Step 3 (Optional - for field normalization): Configure field mappings
field_map = ArticleFieldMap(
    provider_name='guardian',
    title='webTitle',              # Guardian's title field
    record_id='id',                # Guardian's ID field
    record_type='type',            # Article type
    subject='sectionName',         # Section as subject
    text='fields.trailText',       # Nested field for preview text
    url='webUrl',                  # Article URL
    date_published='webPublicationDate',
    api_specific_fields={          # Guardian-specific fields
        'section_name': 'sectionName',
        'pillar_name': 'pillarName'
    }
)

# Step 4 (Optional): Configure metadata extraction
metadata = ResponseMetadataMap(
    total_query_hits='total', # Path to total results
    records_per_page='pageSize' # path to page-size
)


# Step 5: Create provider configuration
guardian_config = ProviderConfig(
    provider_name='guardian',
    base_url='https://content.guardianapis.com/search',
    parameter_map=parameters,
    metadata_map=metadata,
    field_map=field_map,
    records_per_page=10,           # Default page size
    request_delay=1.0,             # Wait 1s between requests
    api_key_env_var='GUARDIAN_API_KEY',  # Environment variable
    docs_url='https://open-platform.theguardian.com/documentation/'
)

# Step 6: Add to registry
provider_registry.add(guardian_config)

# Step 7: Use immediately!
coordinator = SearchCoordinator(
    query="artificial intelligence",
    provider_name="guardian"
)
result = coordinator.search(page=1, normalize_records=True)

if result:
    print(f"Retrieved {len(result.data)} articles")
    normalized = result.normalized_records or []
    if normalized:
        print(f"First article: {normalized[0]['title']}")

What just happened:

✅ Configured parameter mapping (query → q, page → page) ✅ Created custom field map for news articles ✅ Configured field normalization (webTitle → title) ✅ Configured metadata extraction (total results) ✅ Added to registry—now works like built-in providers ✅ Full ScholarFlux integration (caching, rate limiting, multi-provider search)

Understanding the Configuration

APIParameterMap explained:

The core step used by ScholarFlux to translate requests into something that Guardian can understand. The Guardian API expects parameters like ?q=technology&page=1&page-size=10&api-key=xxx. ScholarFlux uses standard names (query, start, records_per_page), so we map them:

APIParameterMap(
    query='q',                     # ScholarFlux 'query' → Guardian 'q'
    start='page',                  # ScholarFlux 'start' → Guardian 'page'
    records_per_page='page-size',  # ScholarFlux 'records_per_page' → Guardian 'page-size'
    api_key_parameter='api-key',   # Where to insert the API key
    auto_calculate_page=False,     # Guardian uses page numbers (1, 2, 3...), so these are used directly
    zero_indexed_pagination=False  # First page is 1, not 0
)

Field map explained:

Guardian returns records with fields like webTitle, webUrl, etc. We normalize these:

field_map = ArticleFieldMap(
    provider_name='guardian',
    title='webTitle',              # Guardian's title → universal 'title'
    url='webUrl',                  # Guardian's URL → universal 'url'
    text='fields.trailText'        # Nested field extraction
)

ResponseMetadataMap explained:

Guardian returns JSON like:

{
  "response": {
    "total": 50000,
    "pageSize": "25",
    "results": [...]
  }
}

As nested metadata paths are traversed directly on extraction, we simply tell ScholarFlux the field names:

ResponseMetadataMap(
    total_query_hits='total', # Path to total results
    records_per_page='pageSize' # path to page-size
)

Inspecting and Extending Parameters

ScholarFlux provides tools for inspecting supported parameters and extending them at runtime. For maximum applicability amidst changing APIs in the future, the API explicitly defines only the bare minimum of parameters that could be most beneficial to users. Users are encouraged to add new parameters to each provider’s configuration as needed on runtime.

Viewing Supported Parameters

Use SearchAPI.describe() to see all accepted universal and API-specific parameters for a provider:

from scholar_flux import SearchAPI

# View parameters for a built-in provider
api = SearchAPI.from_defaults(query="test", provider_name="crossref")
api.describe()

# Output:
# {'config_fields': ['provider_name', 'base_url', 'records_per_page',
#                    'request_delay', 'api_key', 'api_specific_parameters'],
#  'api_specific_parameters': {
#      'mailto': APISpecificParameter(name='mailto',
#                    description='An optional contact email...',
#                    validator='validate_and_process_email (function)', ...),
#      'sort': APISpecificParameter(name='sort',
#                    description="Sort field (e.g., 'published', 'deposited')...", ...),
#      'order': APISpecificParameter(name='order',
#                    description="Sort direction: 'asc' or 'desc'.", ...),
#  }}

This is especially useful when:

  • Discovering which API-specific parameters a provider supports

  • Understanding parameter validation and requirements

  • Debugging why a parameter isn’t being accepted

Adding Parameters at Runtime

Extend parameter support without modifying provider configuration:

from scholar_flux import SearchCoordinator
from scholar_flux.api.validators import validate_str

coordinator = SearchCoordinator(query="machine learning", provider_name="crossref")

# Add a custom API-specific parameter for the current session
new_parameter_config = coordinator.api.parameter_config.add_parameter(
        name='select',           # Actual API parameter name
        description="A Custom filter to remove unwanted fields from retrieved records (e.g. select='doi,title')",  # Documentation
        validator=validate_str,      # Ensures that the value is a string and raises an error otherwise
        required=False,              # Optional parameter
        inplace=False,              # Determines whether the global configuration settings should be modified
)
coordinator.api.parameter_config = new_parameter_config

# Now you can use the parameter
result = coordinator.search(page=1, select="DOI,title,page")

For lower-level control, use BaseAPIParameterMap.add_parameter():

from scholar_flux.api import provider_registry
from scholar_flux.api.validators import api_validator
from scholar_flux.api.models import APISpecificParameter

name = 'crossref'
# The decorator just provides more information on the field and provider if a validation error occurs.
@api_validator(provider_name=name, field="select")
def check_selection(value: str):
    """Simple validator for ensuring that received values are strings."""
    if value is not None and not isinstance(value, str):
        raise TypeError(f"The received value ({value}) is not a string...")
    return value

# Get the provider's parameter map
config = provider_registry.get(name)

# Add a single parameter efficiently
config.parameter_map.api_specific_parameters['select'] = APISpecificParameter(
        name='select',           # Actual API parameter name
        description="A Custom filter to remove unwanted fields from retrieved records (e.g. select='doi,title')",  # Documentation
        validator=validate_str,      # Ensures that the value is a string and raises an error otherwise
        required=False,              # Optional parameter
)

result = coordinator.search(page=2, select="DOI,title,page")

Tip

Runtime parameter extensions are session-scoped and don’t persist. For permanent additions, define them in your ProviderConfig.

Common Patterns

Pagination Styles

Different APIs use different pagination approaches:

Page-based, one-indexed (e.g., OpenAlex):

# API expects: ?page=1, ?page=2, ?page=3
APIParameterMap(
     query="search",
     start="page",
     records_per_page="per_page",
     api_key_parameter="api_key",
     api_key_required=False,
     auto_calculate_page=False,
     zero_indexed_pagination=False,
 )

Offset-based, zero-indexed (e.g., arXiv):

# API expects: ?start=0, ?start=25, ?start=50
APIParameterMap(
    query='search_query',
    start='start',
    records_per_page='max_results',
    api_key_parameter="api_key",
    api_key_required=False,
    auto_calculate_page=True,      # Calculate: (page-1) × records_per_page
    zero_indexed_pagination=True   # First record is at index 0
)

Example page -> offset calculation (one-indexed):

  • Page 1: start = 1 + (1-1) × 25 = 1

  • Page 2: start = 1 + (2-1) × 25 = 26

  • Page 3: start = 1 + (3-1) × 25 = 51

Mixed (Crossref):

# API uses offset but calls it 'cursor' or 'offset'
APIParameterMap(
    query='query',
    start='offset',
    records_per_page='rows',
    auto_calculate_page=True,
    zero_indexed_pagination=False
)

API Key Handling

Query parameter (Guardian):

parameters = APIParameterMap(
    query='q',
    records_per_page='pageSize',
    api_key_parameter='api-key',  # Parameter name
    api_key_required=True         # Raise error if missing
)

config = ProviderConfig(
    provider_name='my_provider',
    parameter_map=parameters,
    api_key_env_var='MY_API_KEY'  # Environment variable to check
)

Optional API key:

parameters = APIParameterMap(
    query='q',
    records_per_page='pageSize',
    api_key_parameter='apikey',
    api_key_required=False  # API works without key (slower rate limit)
)

Note

For header-based authentication (Authorization: Bearer xxx), use a custom session with headers. See API reference for details.

Field Mapping Patterns

Simple field mapping:

# Direct field name mapping
field_map = AcademicFieldMap(
    provider_name='my_provider',
    title='article_title',  # API field: article_title → title
    doi='DOI',              # API field: DOI → doi
    authors='author_list'   # API field: author_list → authors
)

Nested field mapping:

# Extract from nested objects
field_map = AcademicFieldMap(
    provider_name='my_provider',
    title='metadata.title',           # metadata.title → title
    abstract='content.abstract',      # content.abstract → abstract
    authors='authors.contributor.name' # Deep nesting
)

Fallback field mapping:

# Try multiple field names (uses first non-null value)
field_map = AcademicFieldMap(
    provider_name='my_provider',
    title=['title', 'headline', 'name'],  # Try in order
    abstract=['abstract', 'summary', 'description']
)

See also

For advanced field mapping including nested arrays, conditional extraction, and custom processors, see Schema Normalization.

Testing Your Provider

Validation Checklist

Before using a custom provider in production:

  1. Test with real queries:

    coordinator = SearchCoordinator(
        query="test query",
        provider_name="my_provider"
    )
    
    # Test basic retrieval with `search_page` (returns a `SearchResult` container with additional metadata)
    result = coordinator.search_page(page=1)
    assert result, f"Failed: {type(result.response_result)}: {result.error} - {result.message}"
    print(f"✓ Retrieved {len(result.data)} records")
    
    # Test multiple pages
    results = coordinator.search_pages(pages=range(1, 4))
    successful = results.filter()
    print(f"✓ Retrieved {len(successful)}/{len(results)} pages")
    
  2. Verify normalization:

    # Tests retrieval with the returned `ProcessedResponse`, `ErrorResponse`, or None
    result = coordinator.search(page=2, normalize_records=True)
    if result and result.normalized_records:
        record = result.normalized_records[0]
        print(f"✓ Title: {record.get('title')}")
        print(f"✓ DOI: {record.get('doi')}")
        print(f"✓ Provider: {record.get('provider_name')}")
    else:
        print("✗ Normalization failed")
    
  3. Test pagination:

    # Verify pages return different records
    page1 = coordinator.search(page=1)
    page2 = coordinator.search(page=2)
    
    if page1 and page2:
        ids1 = [r['id'] for r in page1.data if 'id' in r]
        ids2 = [r['id'] for r in page2.data if 'id' in r]
        overlap = set(ids1) & set(ids2)
        print(f"✓ Pages have {len(overlap)} overlapping IDs (should be 0)")
    
  4. Check metadata extraction:

    result = coordinator.search(page=1)
    if result:
        print(f"✓ Total results: {result.total_query_hits}")
        print(f"✓ Records per page: {result.records_per_page}")
    

Common Issues

“No records found” but API returns data:

Check your record extraction path. Add debug logging:

result = coordinator.search(page=1)
if result:
    print(f"Parsed response keys: {result.parsed_response.keys()}")
    print(f"Extracted records: {len(result.extracted_records or [])}")

“Field not found” during normalization:

Check field names in actual API response:

result = coordinator.search(page=1)
if result and result.data:
    sample_record = result.data[0]
    print(f"Available fields: {list(sample_record.keys())}")

Pagination returns same records:

Verify the mapped parameters from APIParameterMap against the API provider’s requirements using its documentation:

# If API uses page numbers (1, 2, 3):
print(coordinator.api.parameter_config.map)

Best Practices

Configuration Guidelines

  1. Check API rate limiting requirements directly and start conservative with rate limits:

    # Start with a longer delay
    request_delay=5.0
    
    # Monitor API response headers
    # Adjust based on documented limits
    
  2. Use descriptive provider names:

    # Good
    provider_name='europepmc'
    provider_name='semantic_scholar'
    
    # Avoid
    provider_name='api1'
    provider_name='custom'
    
  3. Document your configuration:

    """
    Custom ScholarFlux Provider: Europe PMC
    
    Requirements:
    - No API key required
    - Rate limit: 3 requests/second
    
    Usage:
        >>> from my_providers import europepmc_config
        >>> provider_registry.add(europepmc_config)
        >>> coordinator = SearchCoordinator(
        ...     query="cancer",
        ...     provider_name="europepmc"
        ... )
    
    API Documentation:
    - https://europepmc.org/RestfulWebService
    """
    
  4. Test with diverse queries:

    test_queries = [
        "simple query",
        "complex AND (query OR terms)",
        "phrase in quotes",
        "year:2024"
    ]
    
    for query in test_queries:
        coordinator = SearchCoordinator(
            query=query,
            provider_name="my_provider"
        )
        result = coordinator.search(page=1)
        print(f"{query}: {'✓' if result else '✗'}")
    

Error Handling

ScholarFlux uses response types instead of exceptions:

from scholar_flux import SearchCoordinator
from scholar_flux.api.models import NonResponse, ErrorResponse

def safe_search(query: str, provider_name: str):
    coordinator = SearchCoordinator(query=query, provider_name=provider_name)
    result = coordinator.search(page=1)

    # ProcessedResponse (truthy) - success
    if result:
        return result.normalize()

    # NonResponse - network error or API unreachable
    if isinstance(result.response_result, NonResponse):
        print(f"Network error: {result.message}")
        return []

    # ErrorResponse - API returned error
    if isinstance(result.response_result, ErrorResponse):
        print(f"API error: {result.message}")
        return []

    return []

Tip

Check response validity with if result: rather than try/except for cleaner code.

Next Steps

Related Guides:

Advanced Topics:

API Reference:

Community Contributions

Consider sharing your custom providers:

  1. Test thoroughly with the validation checklist

  2. Document clearly with usage examples

  3. Open a pull request at https://github.com/SammieH21/scholar-flux

  4. Include tests demonstrating functionality

Popular community providers may be included in future ScholarFlux releases!

Getting Help

If you encounter issues:

  1. Check API documentation: Verify parameter names and response structure

  2. Test API directly: Use curl or requests to understand behavior

  3. Search issues: https://github.com/SammieH21/scholar-flux/issues

  4. Open an issue: Include provider details, config code, and error messages

  5. Email: scholar.flux@gmail.com

When requesting help, include:

  • Provider name and documentation URL

  • Your ProviderConfig code

  • Sample API response (anonymize sensitive data)

  • Error messages or unexpected behavior

  • ScholarFlux version: pip show scholar-flux