Schema Normalization
====================

This tutorial demonstrates ScholarFlux's schema normalization system, which transforms inconsistent provider-specific field names into a unified academic schema—ready for machine learning, analytics, and systematic reviews.

.. contents:: Table of Contents
   :local:
   :depth: 2

Overview
========

The Challenge: Different Field Names for the Same Data
-------------------------------------------------------

Academic APIs return the same information using wildly different field names:

.. code-block:: python

   # The same "title" field across providers:
   plos_record = {
       'title_display': 'Machine Learning in Genomics',  # PLOS
       'author_display': ['Smith J', 'Jones K']
   }
   
   arxiv_record = {
       'title': 'Machine Learning in Genomics',          # arXiv
       'author': [{'name': 'Smith J'}, {'name': 'Jones K'}]
   }
   
   crossref_record = {
       'title': ['Machine Learning in Genomics'],         # Crossref
       'author': [{'family': 'Smith', 'given': 'J'}]
   }
   
   openalex_record = {
       'display_name': 'Machine Learning in Genomics',    # OpenAlex
       'authorships': [{'author': {'display_name': 'Smith J'}}]
   }

**Result**: Building ML datasets requires hours of manual schema mapping and custom parsers for each provider.

The Solution: Automatic Schema Normalization
---------------------------------------------

ScholarFlux normalizes provider-specific field names into universal academic fields:

.. code-block:: python

   from scholar_flux import SearchCoordinator, MultiSearchCoordinator
   import pandas as pd
   
   # Query 4 providers
   multi_coordinator = MultiSearchCoordinator()
   multi_coordinator.add_coordinators([
       SearchCoordinator(query="machine learning", provider_name=provider)
       for provider in ['plos', 'arxiv', 'openalex', 'crossref']
   ])
   
   results = multi_coordinator.search_pages(pages=range(1, 3))
   
   # Filter successful responses and normalize
   normalized_records = results.filter().normalize()
   
   # All records now have consistent field names
   df = pd.DataFrame(normalized_records)
   print(df.columns)
   # Index(['provider_name', 'doi', 'url', 'record_id', 'title', 'abstract',
   #        'authors', 'journal', 'publisher', 'year', 'date_published',
   #        'date_created', 'keywords', 'subjects', 'citation_count',
   #        'open_access', 'license', 'record_type', 'language', ...])

**What happened:**
- ✅ 4 different response schemas normalized to 1 unified schema
- ✅ Nested fields flattened (``author.name`` → ``authors``)
- ✅ Provider-specific fields preserved in additional columns
- ✅ Ready for immediate ML/analytics workflows

Learning Objectives
-------------------

By the end of this tutorial, you will:

- Normalize multi-provider search results with one method call
- Understand the universal academic fields in ``AcademicFieldMap``
- Build ML-ready pandas DataFrames from heterogeneous API responses
- Create custom field mappings for new providers
- Use fallback paths for fields with multiple possible locations
- Apply normalization at different levels (SearchResultList, SearchResult, ProcessedResponse)

Prerequisites
-------------

Before starting, ensure you have:

- Completed the :doc:`getting_started` tutorial
- Familiarity with :doc:`multi_provider_search` for concurrent queries
- Basic pandas knowledge (optional, for DataFrame examples)
- Installed ScholarFlux: ``pip install scholar-flux``

.. note::
   Normalization works with any provider—no special configuration needed!

Basic Normalization
===================

Single Provider Normalization
------------------------------

Normalize results from a single provider:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   import pandas as pd
   
   # Search PLOS
   coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")
   results = coordinator.search_pages(pages=range(1, 6))
   
   # Filter successful responses and normalize
   normalized_records = results.filter().normalize()
   
   # Convert to DataFrame
   df = pd.DataFrame(normalized_records)
   
   # All records have consistent field names
   print(df[['provider_name', 'title', 'doi', 'authors', 'journal']].head())

**Expected output:**

.. code-block:: text

   provider_name                                    title              doi
   0          plos  CRISPR-Cas9 genome editing in plants  10.1371/jour...
   1          plos       Therapeutic applications of...  10.1371/jour...
   2          plos  Ethical considerations in CRISPR use  10.1371/jour...

**Before normalization** (PLOS-specific fields):
- ``title_display`` → **After normalization**: ``title``
- ``id`` → ``doi``
- ``author_display`` → ``authors``

Multi-Provider Normalization
-----------------------------

The real power emerges with multiple providers:

.. code-block:: python

   from scholar_flux import SearchCoordinator, MultiSearchCoordinator
   import pandas as pd
   
   # Query 4 providers simultaneously
   multi_coordinator = MultiSearchCoordinator()
   multi_coordinator.add_coordinators([
       SearchCoordinator(query="machine learning", provider_name='plos'),
       SearchCoordinator(query="machine learning", provider_name='arxiv'),
       SearchCoordinator(query="machine learning", provider_name='openalex'),
       SearchCoordinator(query="machine learning", provider_name='crossref')
   ])
   
   # Retrieve 10 pages per provider (40 total requests)
   results = multi_coordinator.search_pages(pages=range(1, 11))
   
   # Normalize all 1,250+ records in one call
   normalized_records = results.filter().normalize()
   
   # ML-ready DataFrame
   df = pd.DataFrame(normalized_records)
   
   print(f"Total records: {len(df)}")
   print(f"Providers: {df['provider_name'].unique()}")
   print(f"Fields: {len(df.columns)}")

**Expected output:**

.. code-block:: text

   Total records: 1250
   Providers: ['plos' 'arxiv' 'openalex' 'crossref']
   Fields: 37

**What ScholarFlux normalized:**

+------------------+-------------------------+---------------------------+------------------------+
| Universal Field  | PLOS                    | arXiv                     | OpenAlex               |
+==================+=========================+===========================+========================+
| ``title``        | ``title_display``       | ``title``                 | ``display_name``       |
+------------------+-------------------------+---------------------------+------------------------+
| ``doi``          | ``id``                  | ``doi`` (from links)      | ``doi``                |
+------------------+-------------------------+---------------------------+------------------------+
| ``authors``      | ``author_display``      | ``author.name``           | ``authorships.author`` |
+------------------+-------------------------+---------------------------+------------------------+
| ``abstract``     | ``abstract``            | ``summary``               | ``abstract``           |
+------------------+-------------------------+---------------------------+------------------------+
| ``year``         | ``publication_date``    | ``published``             | ``publication_year``   |
+------------------+-------------------------+---------------------------+------------------------+

.. tip::
   Normalization preserves provider-specific fields as additional columns—you get the best of both worlds!

The normalize() Method
----------------------

The :meth:`~scholar_flux.api.models.SearchResultList.normalize` method is available at three levels:

1. **SearchResultList** (recommended for batch operations):

   .. code-block:: python
   
      results = coordinator.search_pages(pages=range(1, 11))
      normalized = results.filter().normalize()  # List[dict]

2. **SearchResult** (single page):

   .. code-block:: python
   
      result = coordinator.search(page=1)
      normalized = result.normalize()  # List[dict]

3. **ProcessedResponse** (lowest level):

   .. code-block:: python
   
      result = coordinator.search(page=1)
      normalized = result.response_result.normalize()  # List[dict]

.. note::
   All three methods return the same structure: a list of dictionaries with normalized field names.

Inline Normalization
--------------------

For convenience, normalize during search execution:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   
   coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")
   
   # Normalize automatically during search
   result = coordinator.search(page=1, normalize_records=True)
   
   # Access cached normalized records
   normalized = result.response_result.normalized_records
   
   # Or call normalize() - returns cached results
   normalized = result.normalize()

**Why use inline normalization?**
- Normalized records are cached in ``ProcessedResponse.normalized_records``
- Subsequent ``normalize()`` calls return cached results (no recomputation)
- Useful when you know you'll need normalized data later

The filter() Method
-------------------

``SearchResultList.filter()`` removes unsuccessful responses before normalization:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   
   coordinator = SearchCoordinator(query="test", provider_name="plos")
   results = coordinator.search_pages(pages=range(1, 20))
   
   # Without filter - may include ErrorResponse/NonResponse
   print(f"Total results: {len(results)}")
   
   # With filter - only ProcessedResponse instances
   successful = results.filter()
   print(f"Successful: {len(successful)}")
   
   # Normalize only successful responses
   normalized = successful.normalize()

**filter() behavior:**
- Keeps: ``ProcessedResponse`` instances (successful retrievals)
- Removes: ``ErrorResponse`` and ``NonResponse`` instances (failures)
- Returns: New ``SearchResultList`` with filtered results

.. tip::
   Always use ``filter()`` before ``normalize()`` to avoid errors from failed responses.

Understanding Universal Fields
==============================

The AcademicFieldMap
--------------------

ScholarFlux defines 18 universal academic fields through the ``AcademicFieldMap``:

.. code-block:: python

   from scholar_flux.api.normalization import AcademicFieldMap
   
   # View all universal fields
   universal_fields = AcademicFieldMap.model_fields.keys()
   print(list(universal_fields))

**Core Identifiers:**
- ``provider_name``: Source database (plos, arxiv, crossref, etc.)
- ``doi``: Digital Object Identifier
- ``url``: Direct link to article
- ``record_id``: Provider-specific identifier

**Bibliographic Metadata:**
- ``title``: Article title
- ``abstract``: Article abstract/summary
- ``authors``: Author list

**Publication Metadata:**
- ``journal``: Journal name
- ``publisher``: Publisher name
- ``year``: Publication year
- ``date_published``: Full publication date
- ``date_created``: Record creation date

**Content Classification:**
- ``keywords``: Article keywords
- ``subjects``: Subject classifications
- ``full_text``: Full text availability

**Metrics:**
- ``citation_count``: Number of citations

**Access Information:**
- ``open_access``: Open access status
- ``license``: License type

**Document Metadata:**
- ``record_type``: Article type
- ``language``: Primary language

Field Map Architecture
----------------------

Each provider has a custom field map defining how to extract universal fields:

.. code-block:: python

   from scholar_flux.api.providers import provider_registry
   
   # Get PLOS field map
   plos_config = provider_registry.get('plos')
   field_map = plos_config.field_map
   
   # View field mappings
   print(field_map.fields)
   # {'provider_name': 'plos',
   #  'title': 'title_display',
   #  'doi': 'id',
   #  'authors': 'author_display',
   #  'abstract': 'abstract',
   #  'year': 'publication_date',
   #  ...}

**How it works:**
1. Field map defines mapping from API-specific fields to universal fields
2. ``normalize()`` applies the field map to transform records
3. Missing fields are set to ``None`` (not excluded)
4. Provider-specific fields are preserved as additional columns

Nested Field Access
-------------------

Field maps support dot notation for nested fields:

.. code-block:: python

   from scholar_flux.api.normalization import AcademicFieldMap
   
   # Define nested field paths
   field_map = AcademicFieldMap(
       provider_name="custom_api",
       title="article.metadata.title",
       authors="article.authors.name",
       doi="identifiers.doi",
       year="publication.year"
   )
   
   # Sample nested record
   record = {
       'article': {
           'metadata': {'title': 'Deep Learning'},
           'authors': [
               {'name': 'Smith, J'},
               {'name': 'Doe, A'}
           ]
       },
       'identifiers': {'doi': '10.1234/example'},
       'publication': {'year': 2024}
   }
   
   # Normalize
   normalized = field_map.normalize_record(record)
   
   print(normalized)
   # {'provider_name': 'custom_api',
   #  'title': 'Deep Learning',
   #  'authors': ['Smith, J', 'Doe, A'],
   #  'doi': '10.1234/example',
   #  'year': 2024,
   #  ...}

**Nested field features:**
- Uses dot notation (``parent.child.field``)
- Automatically traverses lists (``authors.name`` extracts from all authors)
- Returns ``None`` if path doesn't exist
- Handles mixed types gracefully

Fallback Paths
--------------

Some providers store the same data in different locations. Use fallback paths:

.. code-block:: python

   from scholar_flux.api.normalization import AcademicFieldMap
   
   # Define fallback paths as a list
   field_map = AcademicFieldMap(
       provider_name="custom_api",
       # Try primary_title first, then fallback_title, then title
       title=["primary_title", "fallback_title", "title"],
       # Try detailed abstract first, then summary
       abstract=["detailed_abstract", "summary"]
   )
   
   # Record with fallback field
   record = {
       'fallback_title': 'Machine Learning Advances',
       'summary': 'A comprehensive review...'
   }
   
   normalized = field_map.normalize_record(record)
   
   print(normalized['title'])    # 'Machine Learning Advances'
   print(normalized['abstract']) # 'A comprehensive review...'

**Fallback behavior:**
- Tries paths in order (left to right)
- Uses first non-None value found
- Sets to ``None`` if all paths fail
- Defined per-field (each field can have different fallbacks)

**Example from PubMed field map:**

.. code-block:: python

   # scholar_flux/api/normalization/pubmed_field_map.py
   field_map = AcademicFieldMap(
       provider_name="pubmed",
       # Try with #text attribute first, fallback to field directly
       title=[
           "MedlineCitation.Article.ArticleTitle.#text",
           "MedlineCitation.Article.ArticleTitle"
       ],
       abstract=[
           "MedlineCitation.Article.Abstract.AbstractText.#text",
           "MedlineCitation.Article.Abstract.AbstractText"
       ],
       # ... other fields
   )

This handles cases where XML parsing produces different structures depending on content.

Advanced Normalization
======================

Including Metadata in Normalized Records
-----------------------------------------

Include query/provider metadata alongside normalized records:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   
   coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")
   results = coordinator.search_pages(pages=range(1, 3))
   
   # Default: includes provider_name and page
   normalized = results.filter().normalize()
   print(normalized[0].keys())
   # dict_keys(['provider_name', 'page', 'title', 'doi', ...])
   
   # Include only provider_name
   normalized = results.filter().normalize(include={'provider_name'})
   
   # Include all metadata
   normalized = results.filter().normalize(include={'query', 'provider_name', 'page'})
   print(normalized[0])
   # {'query': 'CRISPR',
   #  'provider_name': 'plos',
   #  'page': 1,
   #  'title': '...',
   #  'doi': '...',
   #  ...}

**Available metadata fields:**
- ``query``: Search query used
- ``provider_name``: Data source
- ``page``: Page number

Controlling Normalization Updates
----------------------------------

Control when normalized records are cached:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   
   coordinator = SearchCoordinator(query="test", provider_name="plos")
   result = coordinator.search(page=1)
   
   # First normalization - computes and caches
   normalized1 = result.normalize(update_records=True)
   assert result.response_result.normalized_records == normalized1
   
   # Second normalization - uses cached results
   normalized2 = result.normalize()
   assert normalized1 is result.response_result.normalized_records
   
   # Force recomputation without caching
   normalized3 = result.normalize(update_records=False)
   # Recomputes but doesn't update .normalized_records

**update_records parameter:**
- ``None`` (default): Update cache if not already set
- ``True``: Always update cache
- ``False``: Never update cache (recompute each time)

Error Handling
--------------

Normalization handles errors gracefully:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   from scholar_flux.exceptions import RecordNormalizationException
   
   coordinator = SearchCoordinator(query="test", provider_name="unknown_provider")
   result = coordinator.search(page=1)
   
   # Graceful failure - returns empty list
   normalized = result.normalize(raise_on_error=False)
   print(normalized)  # []
   
   # Strict failure - raises exception
   try:
       normalized = result.normalize(raise_on_error=True)
   except RecordNormalizationException as e:
       print(f"Normalization failed: {e}")

**Error scenarios:**
- Provider not in registry → ``RecordNormalizationException``
- No field map defined → ``RecordNormalizationException``
- ErrorResponse/NonResponse → Returns ``[]`` if ``raise_on_error=False``
- Missing response result → ``RecordNormalizationException``

Working with DataFrames
=======================

Building ML-Ready Datasets
---------------------------

Convert normalized records directly to pandas DataFrames:

.. code-block:: python

   from scholar_flux import SearchCoordinator, MultiSearchCoordinator
   from scholar_flux.api.normalization import AcademicFieldMap
   import pandas as pd
   
   # Multi-provider search
   multi_coordinator = MultiSearchCoordinator()
   multi_coordinator.add_coordinators([
       SearchCoordinator(query="machine learning", provider_name='plos'),
       SearchCoordinator(query="machine learning", provider_name='crossref'),
       SearchCoordinator(query="machine learning", provider_name='openalex')
   ])
   
   results = multi_coordinator.search_pages(pages=range(1, 11))
   
   # Normalize with metadata
   normalized = results.filter().normalize(include={'provider_name', 'page'})
   
   # Convert to DataFrame
   df = pd.DataFrame(normalized)
   
   # Analyze field coverage
   universal_fields = list(AcademicFieldMap.model_fields.keys())
   coverage = df[universal_fields].notna().mean() * 100
   
   print(coverage.sort_values(ascending=False))
   # provider_name    100.0
   # title            100.0
   # doi               95.2
   # authors           87.3
   # abstract          76.8
   # year              98.1
   # ...

Analyzing Provider Coverage
----------------------------

Compare which fields are available across providers:

.. code-block:: python

   import pandas as pd
   from scholar_flux.api.normalization import AcademicFieldMap
   
   # Assume df is a DataFrame from normalized multi-provider results
   universal_fields = list(AcademicFieldMap.model_fields.keys())
   
   # Count records per provider with each field
   provider_field_counts = df.groupby('provider_name')[universal_fields].count()
   
   # Find fields available in 3+ providers
   min_providers = 3
   common_fields = (provider_field_counts > 0).sum() >= min_providers
   common_field_list = common_fields[common_fields].index.tolist()
   
   print("Fields common across providers:")
   print(common_field_list)
   
   print("\nRecord counts per provider:")
   print(provider_field_counts[common_field_list])

**Example output:**

.. code-block:: text

   Fields common across providers:
   ['provider_name', 'doi', 'url', 'record_id', 'title', 'abstract', 
    'authors', 'journal', 'publisher', 'year', 'date_published', 
    'date_created', 'subjects', 'record_type']
   
   Record counts per provider:
                      doi  url  record_id  title  abstract  ...
   provider_name                                             ...
   arxiv              0   50      50       50       50      ...
   crossref          50   50      50       50        3      ...
   openalex          40   49      50       50        0      ...
   plos             100    0     100      100       99      ...

Creating Custom Field Maps
==========================

Basic Custom Field Map
-----------------------

Create a custom field map for a new provider:

.. code-block:: python

   from scholar_flux.api.normalization import AcademicFieldMap
   
   # Define mapping for custom provider
   custom_map = AcademicFieldMap(
       provider_name="custom_api",
       # Direct field mappings
       title="article_title",
       doi="digital_identifier",
       abstract="summary_text",
       # Nested field mappings
       authors="contributors.author_name",
       journal="publication_venue.name",
       year="published_year",
       # API-specific fields to preserve
       api_specific_fields={
           'internal_id': 'record_number',
           'subject_codes': 'classification_codes',
           'access_level': 'availability_status'
       }
   )
   
   # Test with sample record
   sample = {
       'article_title': 'Deep Learning Methods',
       'digital_identifier': '10.1234/example.2024',
       'summary_text': 'A comprehensive review...',
       'contributors': [
           {'author_name': 'Smith, J'},
           {'author_name': 'Doe, A'}
       ],
       'publication_venue': {'name': 'Nature'},
       'published_year': 2024,
       'record_number': 12345,
       'classification_codes': ['CS.AI', 'STAT.ML']
   }
   
   normalized = custom_map.normalize_record(sample)
   
   print(normalized)
   # {'provider_name': 'custom_api',
   #  'title': 'Deep Learning Methods',
   #  'doi': '10.1234/example.2024',
   #  'abstract': 'A comprehensive review...',
   #  'authors': ['Smith, J', 'Doe, A'],
   #  'journal': 'Nature',
   #  'year': 2024,
   #  'internal_id': 12345,
   #  'subject_codes': ['CS.AI', 'STAT.ML'],
   #  ...}

Integrating Custom Maps with Providers
---------------------------------------

Add custom field maps to provider configurations:

.. code-block:: python

   from scholar_flux.api import ProviderConfig, APIParameterMap, SearchCoordinator
   from scholar_flux.api.providers import provider_registry
   from scholar_flux.api.normalization import AcademicFieldMap
   
   # Create custom field map
   field_map = AcademicFieldMap(
       provider_name="guardian",
       title="webTitle",
       url="webUrl",
       date_published="webPublicationDate",
       authors="tags.contributor",
       abstract="fields.trailText",
       api_specific_fields={
           'section_name': 'sectionName',
           'word_count': 'fields.wordcount'
       }
   )
   
   # Create provider config with field map
   guardian_config = ProviderConfig(
       provider_name='guardian',
       base_url='https://content.guardianapis.com/search',
       parameter_map=APIParameterMap(
           query='q',
           start='page',
           records_per_page='page-size',
           api_key_parameter='api-key',
           auto_calculate_page=False,
           api_key_required=True
       ),
       field_map=field_map,  # Add custom field map
       records_per_page=10,
       request_delay=6,
       api_key_env_var='GUARDIAN_API_KEY'
   )
   
   # Add to registry
   provider_registry.add(guardian_config)
   
   # Use with automatic normalization
   coordinator = SearchCoordinator(query="climate change", provider_name='guardian')
   result = coordinator.search(page=1, normalize_records=True)
   
   # Access normalized records
   normalized = result.response_result.normalized_records

Processing Complex Structures
------------------------------

For complex nested structures, combine with data processors:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   from scholar_flux.data import RecursiveDataProcessor
   from scholar_flux.api.normalization import AcademicFieldMap
   
   # RecursiveDataProcessor flattens nested structures
   processor = RecursiveDataProcessor()
   
   coordinator = SearchCoordinator(
       query="test",
       provider_name="complex_api",
       processor=processor  # Flattens before normalization
   )
   
   # Field map works on flattened structure
   field_map = AcademicFieldMap(
       provider_name="complex_api",
       title="article.metadata.title",  # Will be flattened to "article.metadata.title"
       authors="authors.name"            # Auto-extracts from flattened author list
   )

Best Practices
==============

Performance Optimization
------------------------

**1. Cache normalized records when possible:**

.. code-block:: python

   # Good - normalizes once, caches result
   result = coordinator.search(page=1, normalize_records=True)
   normalized = result.response_result.normalized_records  # Uses cache
   
   # Less efficient - recomputes each time
   result = coordinator.search(page=1)
   normalized1 = result.normalize()
   normalized2 = result.normalize()  # Recomputes

**2. Batch normalization with SearchResultList:**

.. code-block:: python

   # Good - normalizes all at once
   results = coordinator.search_pages(pages=range(1, 100))
   normalized = results.filter().normalize()
   
   # Less efficient - normalizes one at a time
   normalized = []
   for result in results.filter():
       normalized.extend(result.normalize())

**3. Use filter() before normalize():**

.. code-block:: python

   # Good - only normalizes successful responses
   normalized = results.filter().normalize()
   
   # Less efficient - tries to normalize errors
   normalized = results.normalize(raise_on_error=False)

Memory Management
-----------------

For large datasets, process in chunks:

.. code-block:: python

   import pandas as pd
   from scholar_flux import SearchCoordinator
   
   coordinator = SearchCoordinator(query="machine learning", provider_name="plos")
   
   # Process 100 pages in chunks of 10
   all_records = []
   for start in range(1, 101, 10):
       chunk_pages = range(start, min(start + 10, 101))
       results = coordinator.search_pages(pages=chunk_pages)
       normalized = results.filter().normalize()
       all_records.extend(normalized)
       
       # Optional: Save intermediate results
       if start % 50 == 1:
           pd.DataFrame(all_records).to_parquet(f'checkpoint_{start}.parquet')
   
   # Final DataFrame
   df = pd.DataFrame(all_records)

Data Quality Checks
-------------------

Validate normalized data before analysis:

.. code-block:: python

   import pandas as pd
   from scholar_flux.api.normalization import AcademicFieldMap
   
   # Get normalized records
   normalized = results.filter().normalize(include={'provider_name'})
   df = pd.DataFrame(normalized)
   
   # Check for required fields
   required_fields = ['provider_name', 'title', 'doi']
   missing_required = df[required_fields].isna().sum()
   print("Missing required fields:")
   print(missing_required[missing_required > 0])
   
   # Check universal field coverage
   universal_fields = list(AcademicFieldMap.model_fields.keys())
   coverage = df[universal_fields].notna().mean() * 100
   print("\nField coverage:")
   print(coverage[coverage > 0].sort_values(ascending=False))
   
   # Check for duplicates by DOI
   duplicates = df[df.duplicated(subset=['doi'], keep=False)]
   print(f"\nDuplicate records: {len(duplicates)}")

Next Steps
==========

Congratulations! You now understand ScholarFlux's schema normalization system. You can:

✅ Normalize multi-provider search results with one method call
✅ Build ML-ready pandas DataFrames from heterogeneous APIs
✅ Create custom field mappings for new providers
✅ Use fallback paths for flexible field resolution
✅ Optimize normalization performance for large datasets

Real-World Use Cases
====================

Systematic Literature Review
-----------------------------

Build evidence tables for systematic reviews:

.. code-block:: python

   from scholar_flux import MultiSearchCoordinator, SearchCoordinator
   from scholar_flux.utils import JsonFileUtils, JsonDataEncoder
   from pathlib import Path
   import pandas as pd
   
   # Search all major databases for a medical topic
   multi_coordinator = MultiSearchCoordinator.from_coordinators([
       SearchCoordinator(query="COVID-19 vaccine efficacy", provider_name=p, use_cache=True)
       for p in ['pubmed', 'plos', 'crossref']
   ])
   
   results = multi_coordinator.search_pages(pages=range(1, 51))  # 150 pages
   search_fields = {'query', 'display_name', 'page'} # metadata fields to include in the result set
   df = pd.DataFrame(results.filter().normalize(include=search_fields))

   # Save location
   documents_folder = Path.home() / "Documents" 

   # Create an audit trail, saving the raw records before normalization
   raw_evidence_records_path = documents_folder / "covid_vaccine_evidence_raw_records.json"
   raw_evidence_records = results.join(include=search_fields)

   if not JsonFileUtils.is_jsonable(raw_evidence_records):
        print( """Can't save the JSON data directly! The data elements that can't be stored will be encoded for storage.
        Use `scholar_flux.utils.JsonDataEncoder.decode()` after loading to restore the raw data.

        Note: "Only elements in nested lists and dictionaries that can't be directly stored will be encoded, and
        everything else in the JSON will stay as is file will be stored as is.""")
        raw_evidence_records = JsonDataEncoder.encode(raw_evidence_records)
   
   JsonFileUtils.save_as(raw_evidence_records, raw_evidence_records_path)
   
   # Create evidence table
   evidence_records = df[[
       'title', 'authors', 'journal', 'year', 'doi', 'abstract', 'full_text'
   ]].copy()
   
   # Add PRISMA screening columns
   evidence_records['include_abstract'] = evidence_records['abstract'].notna()
   evidence_records['include_fulltext'] = evidence_records['full_text'].notna()
   evidence_records['is_restricted'] = evidence_records['open_access'].fillna(False) == False
   evidence_records['exclusion_reason'] = None
   
   # Export for manual review
   evidence_records_path = documents_folder / 'covid_vaccine_evidence.xlsx'
   evidence_records.to_excel(evidence_records_path, index=False)

   print(f"The data was successfully saved: \n1. '{raw_evidence_records_path}' \n2. '{evidence_records_path}'")

Citation Network Analysis
-------------------------

Build citation graphs from normalized data:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   import pandas as pd
   import networkx as nx
   
   # Retrieve papers with citation data
   coordinator = SearchCoordinator(query="neural networks", provider_name="openalex")
   results = coordinator.search_pages(pages=range(1, 101))
   df = pd.DataFrame(results.filter().normalize())
   
   # Filter papers with citations
   cited = df[df['citation_count'] > 0].copy()
   
   # Build citation network (simplified)
   G = nx.DiGraph()
   
   for _, row in cited.iterrows():
       if pd.notna(row['doi']):
           G.add_node(row['doi'], 
                      title=row['title'],
                      year=row['year'],
                      citations=row['citation_count'])
   
   # Analyze network
   print(f"Nodes: {G.number_of_nodes()}")
   if G.number_of_nodes() > 0:
       most_cited = max(G.nodes(data=True), key=lambda x: x[1].get('citations', 0))
       print(f"Most cited: {most_cited[1]['title']} ({most_cited[1]['citations']} citations)")

Meta-Analysis Pipeline
----------------------

Extract data for meta-analysis:

.. code-block:: python

   from scholar_flux import SearchCoordinator
   import pandas as pd
   import re
   
   # Search for clinical trials
   coordinator = SearchCoordinator(
       query="randomized controlled trial depression treatment",
       provider_name="pubmed"
   )
   
   results = coordinator.search_pages(pages=range(1, 21))
   df = pd.DataFrame(results.filter().normalize())
   
   # Extract sample sizes from abstracts (simplified)
   def extract_n(abstract):
       if pd.isna(abstract):
           return None
       match = re.search(r'[Nn]=(\d+)', str(abstract))
       return int(match.group(1)) if match else None
   
   df['sample_size'] = df['abstract'].apply(extract_n)
   
   # Filter for meta-analysis
   meta_data = df[df['sample_size'].notna()].copy()
   
   # Export for RevMan or comprehensive meta-analysis
   meta_data[['title', 'authors', 'year', 'journal', 'sample_size', 'doi']].to_csv(
       'depression_rct_meta.csv',
       index=False
   )


Getting Help
------------

If you encounter issues with normalization:

1. **Check field availability**: Print ``result.data[0].keys()`` to see actual field names
2. **Verify provider has field map**: ``provider_registry[provider_name].field_map``
3. **Test with sample record**: Use ``field_map.normalize_record(sample)`` to debug
4. **Search existing issues**: https://github.com/SammieH21/scholar-flux/issues
5. **Ask for help**: Open a new issue or email scholar.flux@gmail.com

When reporting normalization issues, include:

- Provider name
- Sample raw record (``result.data[0]``)
- Expected normalized fields
- Actual normalized output
- ScholarFlux version


Where to Go Next
----------------

**Related Tutorials:**

- :doc:`multi_provider_search` - Concurrent multi-provider orchestration (pairs with normalization)
- :doc:`custom_providers` - Add new providers with custom field maps
- :doc:`advanced_workflows` - Multi-step normalization pipelines

**Advanced Topics:**

- :doc:`caching_strategies` - Cache normalized results for production
- :doc:`production_deployment` - Deploy normalized data pipelines

**Reference:**

- :class:`~scholar_flux.api.normalization.AcademicFieldMap` - Full field map API
- :class:`~scholar_flux.api.normalization.NormalizingFieldMap` - Base normalization class
- :class:`~scholar_flux.api.models.SearchResultList` - Batch normalization methods