==================
Custom Providers
==================

ScholarFlux enables integration with any API through a three-layer configuration system. This guide demonstrates how to add custom providers—from news APIs to specialized research databases—with full support for normalization, caching, and concurrent search.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
========

Why Add Custom Providers?
--------------------------

ScholarFlux ships with seven academic providers (PLOS, arXiv, PubMed, OpenAlex, Crossref, CORE, Springer Nature), but research needs vary:

- **Institution-specific databases**: University repositories, institutional archives
- **Domain-specific resources**: Medical databases, patent databases, legal research platforms
- **News and media APIs**: The Guardian, New York Times, Reuters
- **Specialized platforms**: bioRxiv, SSRN, RePEc, HAL, Europe PMC
- **Internal APIs**: Company knowledge bases, proprietary research databases

**ScholarFlux's provider system is universal**—it works with any REST API returning JSON or XML.

The Three Configuration Layers
-------------------------------

Every provider is defined by three components:

**1. APIParameterMap** - Request parameter mapping

This core component needed to create a minimally viable ProviderConfig. It maps ScholarFlux parameters to API-specific parameter names:

.. code-block:: python

   APIParameterMap(
       query='q',                    # query → q
       start='page',                 # start → page
       records_per_page='page-size'  # records_per_page → page-size
   )


**2. Field Map** - Record field normalization

An optional component that is used to map API-specific field names to universal field names used throughout ScholarFlux (particularly in academic applications). Normalizes API-specific fields to provider-agnostic field names:

.. code-block:: python

   # For academic APIs, use AcademicFieldMap
   from scholar_flux.api.normalization import AcademicFieldMap
   
   field_map = AcademicFieldMap(
       provider_name='my_provider',
       title='article_title',
       abstract='summary',
       doi='DOI'
   )
   
   # For non-academic APIs, subclass NormalizingFieldMap
   from scholar_flux.api.normalization import NormalizingFieldMap
   
   class ArticleFieldMap(NormalizingFieldMap):
       provider_name: str = ""
       title: str | list[str] | None = None
       url: str | list[str] | None = None
       text: str | list[str] | None = None

.. seealso::
   For detailed information on field normalization patterns, see :doc:`schema_normalization`.

**3. ResponseMetadataMap** - Response metadata extraction

Extracts pagination info from API responses. This map is optional and mainly used when determining if there are more retrievable pages associated with a query when retrieving multiple pages in succession.

.. code-block:: python

   ResponseMetadataMap(
       total_query_hits='total',  # Path to total results
       records_per_page='pageSize' # path to page-size
   )

Minimal Provider Example
========================

ScholarFlux offers a high degree of customization, the minimally viable provider-config only requires users to create an APIParameterMap and a ProviderConfig: 

.. code-block:: python

   from scholar_flux.api import ProviderConfig, APIParameterMap, provider_registry
   from scholar_flux import SearchCoordinator
   
   # Minimal configuration - just parameter mapping
   minimal_config = ProviderConfig(
       provider_name='a_custom_api_provider',
       base_url='https://api.a_custom_api_provider.com/search',
       parameter_map=APIParameterMap(
           query='query',
           start='item-start-number',
           records_per_page='items-per-page'
       ),
       records_per_page=20
   )
   
   provider_registry.add(minimal_config)
   
   # Use immediately - returns raw API response
   coordinator = SearchCoordinator(
       query="test",
       provider_name="a_custom_api_provider"
   )
   # [Dry Run] - Shows how each parameter is mapped in the prepared request URL
   prepared_request = coordinator.api.prepare_search(page=1)
   print(prepared_request.url) # indicates the URL that the request would be sent to
   # OUTPUT: https://api.a_custom_api_provider.com/search?query=test&item-start-number=1&items-per-page=20

   result = coordinator.search_page(page=1) # response container with additional metadata
   
   if result:
       # Records have raw API field names
       print(result.response)  # The raw API response
       print(result.metadata)  # Extracted metadata
       print(result.data)  # Processed records
    else:
        print(f"Error retrieving page {result.page}. {result.error}: {result.message}")

Complete Example: Guardian News API
====================================

Let's add The Guardian's news API as a custom provider. This demonstrates a non-academic API with typical JSON responses.

Full Configuration
------------------

.. code-block:: python

   from scholar_flux.api import (
       ProviderConfig,
       APIParameterMap,
       ResponseMetadataMap,
       provider_registry
   )
   from scholar_flux import SearchCoordinator
   from scholar_flux.api.normalization import NormalizingFieldMap

   
   # Step 1: Configure API parameters
   parameters = APIParameterMap(
       query='q',                     # Guardian uses 'q' for queries
       start='page',                  # Guardian uses 'page' for pagination
       records_per_page='page-size',  # Guardian uses 'page-size' for limit
       api_key_parameter='api-key',   # API key parameter name
       auto_calculate_page=False,     # Use page number directly
       zero_indexed_pagination=False, # Pages start at 1, not 0
       api_key_required=True          # API key is mandatory
   )
   
   # Step 2 (Optional - for field normalization): Define custom field map for news articles
   class ArticleFieldMap(NormalizingFieldMap):
       """Field map for journalism/news APIs."""
       provider_name: str = ""
       title: str | list[str] | None = None
       record_id: str | list[str] | None = None
       record_type: str | list[str] | None = None
       subject: str | list[str] | None = None
       text: str | list[str] | None = None
       url: str | list[str] | None = None
       date_published: str | list[str] | None = None

   
   # Step 3 (Optional - for field normalization): Configure field mappings
   field_map = ArticleFieldMap(
       provider_name='guardian',
       title='webTitle',              # Guardian's title field
       record_id='id',                # Guardian's ID field
       record_type='type',            # Article type
       subject='sectionName',         # Section as subject
       text='fields.trailText',       # Nested field for preview text
       url='webUrl',                  # Article URL
       date_published='webPublicationDate',
       api_specific_fields={          # Guardian-specific fields
           'section_name': 'sectionName',
           'pillar_name': 'pillarName'
       }
   )
   
   # Step 4 (Optional): Configure metadata extraction
   metadata = ResponseMetadataMap(
       total_query_hits='total', # Path to total results
       records_per_page='pageSize' # path to page-size
   )

   
   # Step 5: Create provider configuration
   guardian_config = ProviderConfig(
       provider_name='guardian',
       base_url='https://content.guardianapis.com/search',
       parameter_map=parameters,
       metadata_map=metadata,
       field_map=field_map,
       records_per_page=10,           # Default page size
       request_delay=1.0,             # Wait 1s between requests
       api_key_env_var='GUARDIAN_API_KEY',  # Environment variable
       docs_url='https://open-platform.theguardian.com/documentation/'
   )
   
   # Step 6: Add to registry
   provider_registry.add(guardian_config)
   
   # Step 7: Use immediately!
   coordinator = SearchCoordinator(
       query="artificial intelligence",
       provider_name="guardian"
   )
   result = coordinator.search(page=1, normalize_records=True)
   
   if result:
       print(f"Retrieved {len(result.data)} articles")
       normalized = result.normalized_records or []
       if normalized:
           print(f"First article: {normalized[0]['title']}")

**What just happened:**

✅ Configured parameter mapping (query → q, page → page)  
✅ Created custom field map for news articles  
✅ Configured field normalization (webTitle → title)  
✅ Configured metadata extraction (total results)  
✅ Added to registry—now works like built-in providers  
✅ Full ScholarFlux integration (caching, rate limiting, multi-provider search)

Understanding the Configuration
--------------------------------

**APIParameterMap explained:**

The core step used by ScholarFlux to translate requests into something that Guardian can understand. The Guardian API expects parameters like ``?q=technology&page=1&page-size=10&api-key=xxx``. ScholarFlux uses standard names (``query``, ``start``, ``records_per_page``), so we map them:

.. code-block:: python

   APIParameterMap(
       query='q',                     # ScholarFlux 'query' → Guardian 'q'
       start='page',                  # ScholarFlux 'start' → Guardian 'page'
       records_per_page='page-size',  # ScholarFlux 'records_per_page' → Guardian 'page-size'
       api_key_parameter='api-key',   # Where to insert the API key
       auto_calculate_page=False,     # Guardian uses page numbers (1, 2, 3...), so these are used directly
       zero_indexed_pagination=False  # First page is 1, not 0
   )

**Field map explained:**

Guardian returns records with fields like ``webTitle``, ``webUrl``, etc. We normalize these:

.. code-block:: python

   field_map = ArticleFieldMap(
       provider_name='guardian',
       title='webTitle',              # Guardian's title → universal 'title'
       url='webUrl',                  # Guardian's URL → universal 'url'
       text='fields.trailText'        # Nested field extraction
   )

**ResponseMetadataMap explained:**

Guardian returns JSON like:

.. code-block:: text

   {
     "response": {
       "total": 50000,
       "pageSize": "25",
       "results": [...]
     }
   }

As nested metadata paths are traversed directly on extraction, we simply tell ScholarFlux the field names:

.. code-block:: python

   ResponseMetadataMap(
       total_query_hits='total', # Path to total results
       records_per_page='pageSize' # path to page-size
   )

Inspecting and Extending Parameters
------------------------------------

ScholarFlux provides tools for inspecting supported parameters and extending them at runtime. For maximum applicability amidst changing APIs in the future, the API explicitly defines only the bare minimum of parameters that could be most beneficial to users. Users are encouraged to add new parameters to each provider's configuration as needed on runtime.

Viewing Supported Parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Use ``SearchAPI.describe()`` to see all accepted universal and API-specific parameters for a provider:

.. code-block:: python

   from scholar_flux import SearchAPI
   
   # View parameters for a built-in provider
   api = SearchAPI.from_defaults(query="test", provider_name="crossref")
   api.describe()
   
   # Output:
   # {'config_fields': ['provider_name', 'base_url', 'records_per_page', 
   #                    'request_delay', 'api_key', 'api_specific_parameters'],
   #  'api_specific_parameters': {
   #      'mailto': APISpecificParameter(name='mailto', 
   #                    description='An optional contact email...', 
   #                    validator='validate_and_process_email (function)', ...),
   #      'sort': APISpecificParameter(name='sort', 
   #                    description="Sort field (e.g., 'published', 'deposited')...", ...),
   #      'order': APISpecificParameter(name='order', 
   #                    description="Sort direction: 'asc' or 'desc'.", ...),
   #  }}

This is especially useful when:

- Discovering which API-specific parameters a provider supports
- Understanding parameter validation and requirements
- Debugging why a parameter isn't being accepted

Adding Parameters at Runtime
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Extend parameter support without modifying provider configuration:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   from scholar_flux.api.validators import validate_str
   
   coordinator = SearchCoordinator(query="machine learning", provider_name="crossref")
   
   # Add a custom API-specific parameter for the current session
   new_parameter_config = coordinator.api.parameter_config.add_parameter(
           name='select',           # Actual API parameter name
           description="A Custom filter to remove unwanted fields from retrieved records (e.g. select='doi,title')",  # Documentation
           validator=validate_str,      # Ensures that the value is a string and raises an error otherwise
           required=False,              # Optional parameter
           inplace=False,              # Determines whether the global configuration settings should be modified
   )
   coordinator.api.parameter_config = new_parameter_config
   
   # Now you can use the parameter
   result = coordinator.search(page=1, select="DOI,title,page")

For lower-level control, use ``BaseAPIParameterMap.add_parameter()``:

.. code-block:: python

   from scholar_flux.api import provider_registry
   from scholar_flux.api.validators import api_validator
   from scholar_flux.api.models import APISpecificParameter

   name = 'crossref'
   # The decorator just provides more information on the field and provider if a validation error occurs.
   @api_validator(provider_name=name, field="select")
   def check_selection(value: str):
       """Simple validator for ensuring that received values are strings."""
       if value is not None and not isinstance(value, str):
           raise TypeError(f"The received value ({value}) is not a string...")
       return value
   
   # Get the provider's parameter map
   config = provider_registry.get(name)
   
   # Add a single parameter efficiently
   config.parameter_map.api_specific_parameters['select'] = APISpecificParameter(
           name='select',           # Actual API parameter name
           description="A Custom filter to remove unwanted fields from retrieved records (e.g. select='doi,title')",  # Documentation
           validator=validate_str,      # Ensures that the value is a string and raises an error otherwise
           required=False,              # Optional parameter
   )

   result = coordinator.search(page=2, select="DOI,title,page")

.. tip::
   Runtime parameter extensions are session-scoped and don't persist. For permanent additions, define them in your ``ProviderConfig``.

Common Patterns
===============

Pagination Styles
-----------------

Different APIs use different pagination approaches:

**Page-based, one-indexed (e.g., OpenAlex):**

.. code-block:: python

   # API expects: ?page=1, ?page=2, ?page=3
   APIParameterMap(
        query="search",
        start="page",
        records_per_page="per_page",
        api_key_parameter="api_key",
        api_key_required=False,
        auto_calculate_page=False,
        zero_indexed_pagination=False,
    )

**Offset-based, zero-indexed (e.g., arXiv):**

.. code-block:: python

   # API expects: ?start=0, ?start=25, ?start=50
   APIParameterMap(
       query='search_query',
       start='start',
       records_per_page='max_results',
       api_key_parameter="api_key",
       api_key_required=False,
       auto_calculate_page=True,      # Calculate: (page-1) × records_per_page
       zero_indexed_pagination=True   # First record is at index 0
   )

**Example page -> offset calculation (one-indexed):**

- Page 1: ``start = 1 + (1-1) × 25 = 1``
- Page 2: ``start = 1 + (2-1) × 25 = 26``
- Page 3: ``start = 1 + (3-1) × 25 = 51``

**Mixed (Crossref):**

.. code-block:: python

   # API uses offset but calls it 'cursor' or 'offset'
   APIParameterMap(
       query='query',
       start='offset',
       records_per_page='rows',
       auto_calculate_page=True,
       zero_indexed_pagination=False
   )

API Key Handling
----------------

**Query parameter (Guardian):**

.. code-block:: python

   parameters = APIParameterMap(
       query='q',
       records_per_page='pageSize',
       api_key_parameter='api-key',  # Parameter name
       api_key_required=True         # Raise error if missing
   )
   
   config = ProviderConfig(
       provider_name='my_provider',
       parameter_map=parameters,
       api_key_env_var='MY_API_KEY'  # Environment variable to check
   )

**Optional API key:**

.. code-block:: python

   parameters = APIParameterMap(
       query='q',
       records_per_page='pageSize',
       api_key_parameter='apikey',
       api_key_required=False  # API works without key (slower rate limit)
   )

.. note::
   For header-based authentication (``Authorization: Bearer xxx``), use a custom session with headers. See API reference for details.

Field Mapping Patterns
----------------------

**Simple field mapping:**

.. code-block:: python

   # Direct field name mapping
   field_map = AcademicFieldMap(
       provider_name='my_provider',
       title='article_title',  # API field: article_title → title
       doi='DOI',              # API field: DOI → doi
       authors='author_list'   # API field: author_list → authors
   )

**Nested field mapping:**

.. code-block:: python

   # Extract from nested objects
   field_map = AcademicFieldMap(
       provider_name='my_provider',
       title='metadata.title',           # metadata.title → title
       abstract='content.abstract',      # content.abstract → abstract
       authors='authors.contributor.name' # Deep nesting
   )

**Fallback field mapping:**

.. code-block:: python

   # Try multiple field names (uses first non-null value)
   field_map = AcademicFieldMap(
       provider_name='my_provider',
       title=['title', 'headline', 'name'],  # Try in order
       abstract=['abstract', 'summary', 'description']
   )

.. seealso::
   For advanced field mapping including nested arrays, conditional extraction, and custom processors, see :doc:`schema_normalization`.

Testing Your Provider
=====================

Validation Checklist
--------------------

Before using a custom provider in production:

1. **Test with real queries:**

   .. code-block:: python
   
      coordinator = SearchCoordinator(
          query="test query",
          provider_name="my_provider"
      )
      
      # Test basic retrieval with `search_page` (returns a `SearchResult` container with additional metadata)
      result = coordinator.search_page(page=1)
      assert result, f"Failed: {type(result.response_result)}: {result.error} - {result.message}"
      print(f"✓ Retrieved {len(result.data)} records")
      
      # Test multiple pages
      results = coordinator.search_pages(pages=range(1, 4))
      successful = results.filter()
      print(f"✓ Retrieved {len(successful)}/{len(results)} pages")

2. **Verify normalization:**

   .. code-block:: python
   
      # Tests retrieval with the returned `ProcessedResponse`, `ErrorResponse`, or None
      result = coordinator.search(page=2, normalize_records=True)
      if result and result.normalized_records:
          record = result.normalized_records[0]
          print(f"✓ Title: {record.get('title')}")
          print(f"✓ DOI: {record.get('doi')}")
          print(f"✓ Provider: {record.get('provider_name')}")
      else:
          print("✗ Normalization failed")

3. **Test pagination:**

   .. code-block:: python
   
      # Verify pages return different records
      page1 = coordinator.search(page=1)
      page2 = coordinator.search(page=2)
      
      if page1 and page2:
          ids1 = [r['id'] for r in page1.data if 'id' in r]
          ids2 = [r['id'] for r in page2.data if 'id' in r]
          overlap = set(ids1) & set(ids2)
          print(f"✓ Pages have {len(overlap)} overlapping IDs (should be 0)")

4. **Check metadata extraction:**

   .. code-block:: python
   
      result = coordinator.search(page=1)
      if result:
          print(f"✓ Total results: {result.total_query_hits}")
          print(f"✓ Records per page: {result.records_per_page}")

Common Issues
-------------

**"No records found" but API returns data:**

Check your record extraction path. Add debug logging:

.. code-block:: python

   result = coordinator.search(page=1)
   if result:
       print(f"Parsed response keys: {result.parsed_response.keys()}")
       print(f"Extracted records: {len(result.extracted_records or [])}")

**"Field not found" during normalization:**

Check field names in actual API response:

.. code-block:: python

   result = coordinator.search(page=1)
   if result and result.data:
       sample_record = result.data[0]
       print(f"Available fields: {list(sample_record.keys())}")

**Pagination returns same records:**

Verify the mapped parameters from ``APIParameterMap`` against the API provider's requirements using its documentation:

.. code-block:: python

   # If API uses page numbers (1, 2, 3):
   print(coordinator.api.parameter_config.map)
   

Best Practices
==============

Configuration Guidelines
------------------------

1. **Check API rate limiting requirements directly and start conservative with rate limits:**

   .. code-block:: python
   
      # Start with a longer delay
      request_delay=5.0
      
      # Monitor API response headers
      # Adjust based on documented limits

2. **Use descriptive provider names:**

   .. code-block:: python
   
      # Good
      provider_name='europepmc'
      provider_name='semantic_scholar'
      
      # Avoid
      provider_name='api1'
      provider_name='custom'

3. **Document your configuration:**

   .. code-block:: python
   
      """
      Custom ScholarFlux Provider: Europe PMC
      
      Requirements:
      - No API key required
      - Rate limit: 3 requests/second
      
      Usage:
          >>> from my_providers import europepmc_config
          >>> provider_registry.add(europepmc_config)
          >>> coordinator = SearchCoordinator(
          ...     query="cancer",
          ...     provider_name="europepmc"
          ... )
      
      API Documentation:
      - https://europepmc.org/RestfulWebService
      """

4. **Test with diverse queries:**

   .. code-block:: python
   
      test_queries = [
          "simple query",
          "complex AND (query OR terms)",
          "phrase in quotes",
          "year:2024"
      ]
      
      for query in test_queries:
          coordinator = SearchCoordinator(
              query=query,
              provider_name="my_provider"
          )
          result = coordinator.search(page=1)
          print(f"{query}: {'✓' if result else '✗'}")

Error Handling
--------------

ScholarFlux uses response types instead of exceptions:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   from scholar_flux.api.models import NonResponse, ErrorResponse
   
   def safe_search(query: str, provider_name: str):
       coordinator = SearchCoordinator(query=query, provider_name=provider_name)
       result = coordinator.search(page=1)
       
       # ProcessedResponse (truthy) - success
       if result:
           return result.normalize()
       
       # NonResponse - network error or API unreachable
       if isinstance(result.response_result, NonResponse):
           print(f"Network error: {result.message}")
           return []
       
       # ErrorResponse - API returned error
       if isinstance(result.response_result, ErrorResponse):
           print(f"API error: {result.message}")
           return []
       
       return []

.. tip::
   Check response validity with ``if result:`` rather than ``try/except`` for cleaner code.

Next Steps
==========

**Related Guides:**

- :doc:`schema_normalization` - Deep dive into field normalization patterns
- :doc:`multi_provider_search` - Use custom providers in concurrent searches
- :doc:`advanced_workflows` - Multi-step retrieval for complex APIs (like PubMed's two-step process)

**Advanced Topics:**

- :doc:`caching_strategies` - Production caching with Redis, MongoDB, SQLAlchemy
- :doc:`production_deployment` - Deploy custom providers at scale

**API Reference:**

- :class:`scholar_flux.api.models.ProviderConfig` - Complete configuration reference
- :class:`scholar_flux.api.models.APIParameterMap` - Parameter mapping reference
- :class:`scholar_flux.api.normalization.AcademicFieldMap` - Academic field mapping
- :class:`scholar_flux.api.normalization.NormalizingFieldMap` - Base field map for custom schemas

Community Contributions
-----------------------

Consider sharing your custom providers:

1. Test thoroughly with the validation checklist
2. Document clearly with usage examples
3. Open a pull request at https://github.com/SammieH21/scholar-flux
4. Include tests demonstrating functionality

Popular community providers may be included in future ScholarFlux releases!

Getting Help
------------

If you encounter issues:

1. **Check API documentation**: Verify parameter names and response structure
2. **Test API directly**: Use ``curl`` or ``requests`` to understand behavior
3. **Search issues**: https://github.com/SammieH21/scholar-flux/issues
4. **Open an issue**: Include provider details, config code, and error messages
5. **Email**: scholar.flux@gmail.com

When requesting help, include:

- Provider name and documentation URL
- Your ``ProviderConfig`` code
- Sample API response (anonymize sensitive data)
- Error messages or unexpected behavior
- ScholarFlux version: ``pip show scholar-flux``