scholar_flux.api.normalization package
Submodules
scholar_flux.api.normalization.academic_field_map module
The scholar_flux.api.normalization.academic_field_map implements the AcademicFieldMap for record normalization.
This implementation subclasses the NormalizingFieldMap class for use in academic record normalization by defining additional combinations of fields that apply solely to academic APIs and databases.
- Architecture Context:
This layer is the third step in a 3 part configuration system tailored to each individual provider.
Parameter Map (BaseAPIParameterMap) - Translates search parameters to provider-specific API parameters
Metadata Map (ResponseMetadataMap) - Extracts pagination metadata (total hits, records per page)
Field Map (AcademicFieldMap) - Normalizes provider-specific fields into a universal schema
All three layers compose via ProviderConfig for complete provider integration:
>>> from scholar_flux.api.providers import provider_registry >>> config = provider_registry.get("plos") >>> config.parameter_map # Request building >>> config.metadata_map # Pagination intelligence >>> config.field_map # Response normalization
- Design Philosophy:
Minimal defaults: Works out-of-box for common use cases
Provider-specific when needed: Subclasses override _post_process() for domain logic
User-extensible: Users can customize or replace field maps entirely
This is NOT a rigid framework—each provider handles genuinely different data structures. The base class provides common helpers, not enforced patterns.
- class scholar_flux.api.normalization.academic_field_map.AcademicFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
NormalizingFieldMapExtends the NormalizingFieldMap to customize field extraction and processing for academic record normalization.
This class is used to normalize the names of academic data fields consistently across provider. By default, the AcademicFieldMap includes fields for several attributes of academic records including:
Core identifiers (e.g. doi, url, record_id)
Bibliographic metadata ( title, abstract, authors)
Publication metadata (journal, publisher, year, date_published, date_created)
Content and classification (keywords, subjects, full_text)
Metrics and impact (citation_count)
Access and rights (open_access, license)
Document metadata (record_type, language)
All other fields that are relevant to only the current API (api_specific_fields)
During normalization, the AcademicFieldMap.fields property returns all subclassed field mappings as a flattened dictionary (excluding private fields prefixed with underscores). Both simple and nested API-specific field names are matched and mapped to universal field names.
Any changes to the instance configuration are automatically detected during normalization by comparing the _cached_fields to the updated fields property.
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> field_map = AcademicFieldMap(provider_name = None, title = 'article_title', record_id='ID') >>> expected_result = field_map.fields | {'provider_name':'core', 'title': 'Decomposition of Political Tactics', 'record_id': 196} >>> result = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> cached_fields = field_map._cached_fields >>> print(result == expected_result) >>> result2 = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> assert cached_fields is field_map._cached_fields >>> assert result is not result2
Note
To account for special cases, the AcademicFieldMap can be subclassed to perform two-step normalization to further process extracted elements.
- Phase 1:
The AcademicFieldMap extracts nested fields for each record. This class traverses paths like ‘MedlineCitation.Article.AuthorList.Author’ (PubMed) or authorships.institutions.display_name (OpenAlex) to map API-specific fields to universal parameter names
- Phase 2 (Subclasses):
Subclasses can reformat extracted data into finalized fields. For example, PubMed prepares the authors field by combining each author’s ‘ForeName’ and ‘LastName’ into ‘FirstName LastName’. PLOS creates the record URL for each article by combining the URL prefix for the website with the DOI of the current record. The AcademicFieldMap defines common (yet optional) class methods to aid in the extraction and processing of normalized fields.
- abstract: list[str] | str | None
- api_specific_fields: dict[str, Any]
- authors: list[str] | str | None
- citation_count: list[str] | str | None
- date_created: list[str] | str | None
- date_published: list[str] | str | None
- default_field_values: dict[str, Any]
- doi: list[str] | str | None
- classmethod extract_abstract(record: NormalizedRecordType, strip_html: bool = False, field: str = 'abstract', **kwargs: Any) str | None[source]
Extracts and prepares the abstract for the current record.
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘abstract’ already available as a field.
strip_html (bool) – Indicates whether html tags should be checked and removed if found in the abstract.
field (str) – The field where an abstract or text field can be found.
**kwargs – Additional arguments to pass to get_text when stripping html elements.
- Returns:
An abstract string or None if not found or not a string/list of strings
- Return type:
Optional[str]
Example
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {'abstract': 'Analysis of the Placebo effect on...'} >>> AcademicFieldMap.extract_abstract(record) # OUTPUT: 'Analysis of the Placebo effect on...'
>>> record = {'abstract': '<h1>Game theory in the technological industry.</h1><p>This study explores...</p>'} >>> AcademicFieldMap.extract_abstract(record, strip_html=True, separator=' ') # OUTPUT: 'Game theory in the technological industry. This study explores...'
- classmethod extract_authors(record: NormalizedRecordType, field: str = 'authors') list[str] | None[source]
Filters and cleans the author names list.
- Parameters:
record (NormalizedRecordType) – Normalized record with an ‘authors’ field.
field (str) – The field to extract the list of authors from.
- Returns:
A list of non-empty author names, or None if empty
- Return type:
Optional[list[str]]
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {'authors': 'Evan Doodle; Jane Doe'} >>> AcademicFieldMap.extract_authors(record) # OUTPUT: ['Evan Doodle', 'Jane Doe'] >>> record = {'authors': ['Evan Doodle', 'Jane Noah']} >>> AcademicFieldMap.extract_authors(record) # OUTPUT: ['Evan Doodle', 'Jane Noah'] >>> record = {'authors': [102, 203]} >>> AcademicFieldMap.extract_authors(record) # returns, elements aren't strings # OUTPUT: None
- classmethod extract_boolean_field(record: NormalizedRecordType, field: str, true_values: tuple[str, ...] = ('true', '1', 'yes'), false_values: tuple[str, ...] = ('false', '0', 'no'), default: bool | None = None) bool | None[source]
Extracts a field’s value from the current record as a boolean (‘true’->True/’false’->False/’None’->None).
- Parameters:
record (NormalizedRecordType) – The normalized record dictionary to extract a boolean value from.
field (str) – The record field to be used for the extraction of a boolean value.
true_values (tuple[str, ...]) – Values to be mapped to True when found.
false_values (tuple[str, ...]) – Values to be mapped to false when found.
default (Optional[bool]) – The value to default to when neither True values or False values can be found.
- Returns:
True if the field appears in the list of the tuple of true_values
False if the field appears in the list of the tuple of false_values
The default if the observed value cannot be found within true_values and false_values
- Return type:
Optional[bool]
- classmethod extract_id(record: NormalizedRecordType, field: str = 'record_id', strip_prefix: str | Pattern | None = None) str | None[source]
Extracts and coerces the ID from the current record into a string.
- Parameters:
record (NormalizedRecordType) – A normalized record dictionary before or after post-processing
field (str) – The IdType to filter for (e.g., ‘arxiv_id’, ‘pmid’, ‘mag_id’)
strip_prefix (Optional[str | re.Pattern]) – An optional prefix to remove from the identifier (e.g., ‘PMC’ for PMC IDs)
- Returns:
The record ID as a string, or None if not available
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> AcademicFieldMap.extract_id({"record_id": 12345678}) '12345678' >>> AcademicFieldMap.extract_id({"record_id": "mock_id:123"}) mock_id:123'
- classmethod extract_iso_date(record: NormalizedRecordType, field: str = 'date_created') str | None[source]
Extracts and formats a date from a dictionary or strings in ISO format (%Y-%m-%d).
- Parameters:
record (NormalizedRecordType) – A normalized record having a date_created or similar field to extract an ISO date from. Note: Users can extract an ISO date from a nested dictionary field if its formatted with year, month, or day. If the nested field is a string, this method will instead attempt to parse it as an ISO timestamp otherwise. If the field is a datetime or date, the object will be parsed directly.
field (str) – The name of the field containing date information to extract.
- Returns:
An ISO formatted date string (YYYY-MM-DD, YYYY-MM, or YYYY) or None.
- Return type:
(Optional[str])
Examples
PubDate with Year=’2025’, Month=’Dec’, Day=’19’: Returns ‘2025-12-19’
PubDate with Year=’2025’, Month=’12’: Returns ‘2025-12’
PLOS with timestamp: ‘2016-12-08T00:00:00Z’ Returns ‘2016-12-08’
- classmethod extract_journal(record: NormalizedRecordType, field: str = 'journal') str | None[source]
Extracts the publication journal title or a list of journal titles as a semicolon delimited string.
- Parameters:
record (NormalizedRecordType) – The normalized record dictionary to extract the journal field from.
field (str) – The field to extract the journal from.
- Returns:
The journal or journals of publication, joined by a semicolon, or None if not available.
- Return type:
Optional[str]
Examples
>>> AcademicFieldMap.extract_journal({"journal": "Nature"}) # OUTPUT: 'Nature' >>> AcademicFieldMap.extract_journal({"journal": ["Nature", "Science"]}) # OUTPUT: 'Nature; Science' >>> AcademicFieldMap.extract_journal({"journal": ["Nature", "", None, "Science"]}) # OUTPUT: 'Nature; Science'
- classmethod extract_url(record: NormalizedRecordType, *paths: list[str | int] | str, pattern_delimiter: str | Pattern | None = re.compile('; *(?=http)|, *(?=http)|\\| *(?=http)'), delimiter_prefix: str | None = None, delimiter_suffix: str | None = '(?=http)') str | None[source]
Helper function for extracting a single, primary URL from record based on the path taken to traverse the URL.
- Parameters:
record (NormalizedRecordType) – The record dictionary to extract the URL from.
*paths – Arbitrary positional path arguments leading to a single URL or list of URLs. Each path can be a string or list of keys representing the path needed to find a URL in a nested record. Defaults to the tuple (‘url’, ) if not provided, defaulting to a basic url lookup.
pattern_delimiter (str | Pattern) – Regex pattern to split URL strings. Defaults to “; *”. A positive lookahead (?=http) is automatically appended to the delimiter to prevent splitting URLs mid-domain. Set to None to disable splitting. Note that if a re.Pattern object is provided, it will be used as is without transformation.
delimiter_prefix (str) – An option string appended as a prefix to each element within a pattern. This prefix is None by default but can be used to identify URLs that directly follow a specific pattern.
delimiter_suffix (str) – An option string appended as a suffix to each element within a pattern. This suffix is used to identify http schemes (typically associated with URLs) that may directly follow a string delimited by the suffix separator.
- Returns:
The first value found at any of the specified paths. Commonly a string URL, but could be any type depending on the data structure. Returns None if not found.
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {"url": "http://example.com; http://backup.com"} >>> AcademicFieldMap.extract_url(record) # OUTPUT: 'http://example.com'
>>> record = {"url": [{"value": "http://example.com"}]} >>> AcademicFieldMap.extract_url(record, ["url", 0, "value"], ["url", 0]) # OUTPUT: 'http://example.com'
>>> # Semicolon-delimited URLs (common in CrossRef, Springer) >>> record = {"url": "http://example.com; http://backup.com"} >>> AcademicFieldMap.extract_url(record) # OUTPUT: 'http://example.com'
- classmethod extract_url_id(record: NormalizedRecordType, field: str = 'record_id', strip_prefix: str | Pattern | None = None) str | None[source]
Extracts an ID from the URL of the current record, removing a URL prefix when specified.
- Parameters:
record (NormalizedRecordType) – The record containing the URL ID to extract
field (str) – The field containing the ID (with or without a prefix)
strip_prefix (Optional[str | re.Pattern]) – The prefix or regex pattern to optionally remove from the URL
- Returns:
The ID after field extraction and the removal the string prefix, if provided. If the record field doesn’t exist, None is returned instead.
- Return type:
Optional[str]
- classmethod extract_year(record: NormalizedRecordType, field: str = 'year') int | None[source]
Extracts the year of publication or record creation from the manuscript/record.
- Parameters:
record (NormalizedRecordType) – Normalized record dictionary
field (str) – The field to extract the year of publication or record creation from.
- Returns:
The year as an integer, or None if not extractable.
- Return type:
Optional[int]
Examples
>>> AcademicFieldMap.extract_year({"year": "2024-06-15"}) 2024 >>> AcademicFieldMap.extract_year({"year": 2024}) 2024 >>> AcademicFieldMap.extract_year({"year": None}) None
- full_text: list[str] | str | None
- is_retracted: list[str] | str | None
- journal: list[str] | str | None
- keywords: list[str] | str | None
- language: list[str] | str | None
- license: list[str] | str | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- classmethod normalize_doi(record: NormalizedRecordType, field: str = 'doi') str | None[source]
Normalizes DOI by stripping the https://doi.org/ prefix.
- Parameters:
record (NormalizedRecordType) – Normalized record containing the ‘doi’ field to extract.
field (str) – The field to extract the record doi from.
- Returns:
Cleaned DOI string without URL prefix, or None if invalid
- Return type:
Optional[str]
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {'doi': 'https://doi.org/10.1234/example'} >>> AcademicFieldMap.normalize_doi(record) # OUTPUT: '10.1234/example'
- open_access: list[str] | str | None
- provider_name: str
- publisher: list[str] | str | None
- classmethod reconstruct_url(id: str | None, url: str) str | None[source]
Reconstruct an article URL from the ID of the article.
Useful for PLOS and PubMed URL reconstruction.
- Parameters:
id (Optional[str]) – The ID/DOI identifier (e.g., “10.1371/journal.pone.0123456”)
url (str) – The URL prefix (e.g. f”https://journals.plos.org/plosone/article?id=”)
- Returns:
Reconstructed URL if ID is valid, None otherwise.
- Return type:
str
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> AcademicFieldMap.reconstruct_url( ... id="10.1371/journal.pone.0123456", ... url=f"https://journals.plos.org/plosone/article?id=" ... ) # OUTPUT: 'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123456' >>> AcademicFieldMap.reconstruct_url(None, '') # OUTPUT: None >>> AcademicFieldMap.reconstruct_url("", None) # OUTPUT: None
- record_id: list[str] | str | None
- record_type: list[str] | str | None
- subjects: list[str] | str | None
- title: list[str] | str | None
- url: list[str] | str | None
- year: list[str] | str | None
scholar_flux.api.normalization.arxiv_field_map module
The scholar_flux.api.normalization.arxiv_field_map.py module defines the normalization mappings used for Arxiv.
- class scholar_flux.api.normalization.arxiv_field_map.ArXivFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapArXiv specific field mapping with custom transformations.
The ArXivFieldMap implements a minimal set of methods for record normalization to finalize the structure of each extracted and normalized record during postprocessing.
- Post-Processed Fields:
arXiv, DOI, and record identifiers
Year extraction from ISO date strings
PDF URL extraction from link arrays
Open access status (always true for arXiv)
Subject and category normalization
Note
arXiv records use unique identifier and link structures. The field map configuration and post-processing logic handle these for consistent output.
- classmethod extract_pdf_url(record: NormalizedRecordType, field: str = 'url_list') str | None[source]
Extracts a valid PDF URL from the array of URLs corresponding to the record.
- Parameters:
record (NormalizedRecordType) – A normalized arXiv record dictionary to extract a PDF URL from.
field (str) – The field to extract the list of URLs from.
- Returns:
A validated PDF URL if available, otherwise None.
- Return type:
Optional[str]
- classmethod extract_record_type(record: NormalizedRecordType, journal_field: str = 'journal', comment_field: str = 'comment') str[source]
Infers a record type from the journal or comment field for an arXiv record using pattern matching.
The record type is inferred using predefined patterns to determine whether a record is a journal article, a book chapter, preprint, etc.
The possible types are defined within scholar_flux.api.normalization.arxiv_field_map as a RECORD_TYPE_PATTERNS dictionary that maps patterns to record types.
- Parameters:
record (NormalizedRecordType) – a normalized arXiv record to infer the record type with.
journal_field (str) – The journal field used to infer record type with.
comment_field (str) – The comment field used to infer record type with if a journal field does not exist.
- Returns:
When journal_field is available, the record type associated with the matched pattern is returned. If no patterns match, then journal-article is returned instead. When journal_field is not available, a record type is extracted by pattern matching against the comment_field. If no match is found, preprint is returned.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
scholar_flux.api.normalization.base_field_map module
The scholar_flux.api.normalization.base_field_map defines a data model for normalizing API response records.
This implementation is to be used as the basis of the normalization of fields that often greatly differ in naming convention and structure across different API implementations. Future subclasses can directly specify expected fields and processing requirements to normalize the full range of processed records and generate a common set of named fields that unifies API-specific record specifications into a common structure.
- class scholar_flux.api.normalization.base_field_map.BaseFieldMap(*, provider_name: str, api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>)[source]
Bases:
BaseModelThe BaseFieldMap is used to normalize the names of fields consistently across providers.
This class provides a minimal implementation for mapping API-specific fields from a non-nested dictionary record to a common record key. It is intended to be subclassed and customized for different APIs.
Instances of this class can be called directly to normalize a single or multiple records based on the input. Direct calls to instances are directly handled by .apply() under-the-hood.
- - normalize_record
Normalizes a single dictionary record
- - normalize_records
Normalizes a list of dictionary records
- - apply
Returns either a single normalized record or a list of normalized records matching the input.
- - structure
Displays a string representation of the current BaseFieldMap instance
- provider_name
A default provider name to be assigned for all normalized records. If not provided, the field map will try to find the provider name from within each record.
- Type:
str
- api_specific_fields
Defines a dictionary of normalized field names (keys) to map to the names of fields within each dictionary record (values)
- Type:
dict[str, Any]
- default_field_values
Indicates values that should be assigned if a field cannot be found within a record.
- Type:
dict[str, Any]
- api_specific_fields: dict[str, Any]
- apply(records: RecordType) NormalizedRecordType[source]
- apply(records: RecordList) NormalizedRecordList
Normalizes a record or list of records by mapping API-specific field names to common fields.
- Parameters:
records (RecordType | RecordList) – A single dictionary record or a list of dictionary records to normalize.
- Returns:
A single normalized dictionary is returned if a single record is provided. NormalizedRecordList: A list of normalized dictionaries is returned if a list of records is provided.
- Return type:
NormalizedRecordType
- property core_fields: dict[str, Any]
Returns a dictionary of all core fields in the current FieldMap (excluding all API-specific fields).
- default_field_values: dict[str, Any]
- property fields: dict[str, Any]
Returns a representation of the current FieldMap as a dictionary.
- filter_api_specific_fields(record: NormalizedRecordType, keep_api_specific_fields: bool | Sequence[str] | set[str] | None = None) dict[str, Any][source]
Filters API Specific parameters from the processed record.
- Parameters:
record (NormalizedRecordType) – The current record to filter API-specific fields from.
keep_api_specific_fields (Optional[bool | Sequence[str] | set[str]]) – Either a boolean indicating whether to keep all API-specific fields (True/None) or to remove them after the completion of normalization (False). This parameter can also be a sequence/set of specific field names to keep.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- normalize_record(record: dict, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordType[source]
Maps API-specific fields in a single dictionary record to a normalized set of field names.
- Parameters:
record (dict) – The single, dictionary-typed record to normalize.
keep_api_specific_fields (Optional[bool | Sequence[str]]) – A boolean indicating whether to keep or remove all API-specific fields or a sequence indicating which API-specific fields to keep.
- Returns:
A new dictionary with normalized field names.
- Return type:
NormalizedRecordType
- Raises:
TypeError – If the input to record is not a mapping or dictionary object.
- normalize_records(records: RecordType | RecordList, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordList[source]
Maps API-specific fields in one or more records to a normalized set of field names.
- Parameters:
records (dict | RecordType | RecordList) – A single dictionary record or a list of dictionary records.
keep_api_specific_fields (Optional[bool | Sequence[str]]) – A boolean indicating whether to keep or remove all API-specific fields or a sequence indicating which API-specific fields to keep.
- Returns:
A list of dictionaries with normalized field names.
- Return type:
NormalizedRecordList
- provider_name: str
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method that shows the current structure of the BaseFieldMap. :param flatten: Whether to flatten the current field map’s structural representation into a single line (Default=False) :type flatten: bool :param show_value_attributes: Whether to show nested attributes of the base field map or subclass (Default = True) :type show_value_attributes: bool
- Returns:
A structural representation of the current field map as a string. Use a print statement to view it.
- Return type:
str
scholar_flux.api.normalization.core_field_map module
The scholar_flux.api.normalization.core_field_map.py module defines the normalization mappings used for Core API.
- class scholar_flux.api.normalization.core_field_map.CoreFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapCore specific field mappings with custom transformations.
The Core API provides open access scholarly content aggregated from thousands of repositories worldwide.
The CoreFieldMap implements several methods for record normalization and the extraction of record fields and cross platform IDs. The post-processing step finalizes the structure of each normalized record to consistently prepare and post-process records retrieved from the CORE API.
- Post-Processed Fields:
Year extraction from various date formats
Journal list flattening (Core can return multiple journal titles)
Record ID coercion to string format
Open access default (Core sources are generally all open access)
Cross-reference identifier extraction (arXiv, PubMed, MAG IDs)
Multi-identifier normalization for entity resolution
- classmethod extract_arxiv_id(record: NormalizedRecordType) str | None[source]
Extracts the arXiv identifier for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary.
- Returns:
The arXiv ID (e.g., ‘1012.4340’) or None if not available.
- Return type:
Optional[str]
- classmethod extract_mag_id(record: NormalizedRecordType) str | None[source]
Extracts the Microsoft Academic Graph identifier for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary
- Returns:
The MAG ID or None if not available
- Return type:
Optional[str]
Examples
>>> CoreFieldMap.extract_mag_id({"mag_id": "2056403249"}) '2056403249' >>> CoreFieldMap.extract_mag_id({"mag_id": "None"}) None
- classmethod extract_oai_ids(record: NormalizedRecordType) list[str] | None[source]
Extracts the OAI identifiers for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary
- Returns:
The OAI IDs for the current record as a list or None if not available
- Return type:
Optional[list[str]]
- classmethod extract_pmid(record: NormalizedRecordType) str | None[source]
Extracts the PubMed identifier for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary
- Returns:
The PubMed ID or None if not available
- Return type:
Optional[str]
Examples
>>> CoreFieldMap.extract_pmid({"pmid": "12345678"}) '12345678' >>> CoreFieldMap.extract_pmid({"pmid": "None"}) None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
scholar_flux.api.normalization.crossref_field_map module
The scholar_flux.api.normalization.crossref_field_map.py module defines the normalization mappings for Crossref.
- class scholar_flux.api.normalization.crossref_field_map.CrossrefFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapCrossref specific field mapping with custom transformations.
The CrossrefFieldMap implements a minimal set of methods for field extraction and abstract HTML tag removal, preparing and finalizing the structure of each normalized record in the post-processing step.
- Post-Processed Fields:
DOI, URL, and record identifiers
Year and date extraction from nested date fields
Author name formatting
Open access status resolution from license URLs
Journal extraction
Abstract retrieval and HTML tag removal
Retraction status detection
Note
Crossref records may contain nested lists and multiple date fields. The field map configuration and post-processing logic handle these variations and normalizes the output.
- classmethod check_retraction(record: NormalizedRecordType, field: str = 'updated_by_list', pattern: str | Pattern | None = None) bool | None[source]
Checks if the record is a retraction notice.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to check for retraction updates.
pattern (str) – An optional field or pattern used to verify retraction status
- Returns:
True if the paper has been retracted, None if the status is unknown.
- Return type:
Optional[bool]
Note
┌─────────────────────┐ updated-by ┌─────────────────────┐ │ Retracted Paper │ ◄───────────────────── │ Retraction Notice │ │ (original article) │ ─────────────────────► │ (update record) │ └─────────────────────┘ update-to └─────────────────────┘
Crossref’s update-to field is on the retraction NOTICE, pointing to the retracted paper. The retracted paper itself might instead contain an updated-by field indicating that the paper has been retracted.
When retraction status can’t be determined for certain due to a lack of information, retraction can be verified with the following steps:
Sending a separate crossref search with the filter=’update-type:retraction’ API-specific parameter
Checking the https://gitlab.com/crossref/retraction-watch-data repo (updated daily)
Source: https://www.crossref.org/documentation/retrieve-metadata/retraction-watch/ (2026)
- classmethod extract_authors(record: NormalizedRecordType, field: str = 'author_list') list[str] | None[source]
Extracts formatted author names by combining GivenName and LastName.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract the nested list of authors from.
- Returns:
List of author names in ‘ForeName LastName’ format, or None if no authors.
- Return type:
Optional[list[str]]
Note
Returns None for organizational records (datasets, reports) where Crossref does not provide individual authors. Check the ‘publisher’ or ‘institution’ fields for organizational attribution.
- classmethod extract_date_parts(record: NormalizedRecordType, field: str = 'date_published') str | None[source]
Extracts the publication date or date_created for the current record.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract a Crossref date field from.
- Returns:
ISO formatted date string (YYYY-MM-DD) or None.
- Return type:
Optional[str]
Note: This class method is designed to handle Crossref’s unique processing structure to consistently convert date fields in [[Year, Month, Date]] format to %Y-%m-%d format.
- classmethod extract_title(record: NormalizedRecordType, field: str = 'title') str | None[source]
Extracts the record title or a nested list indicating the title (or titles for the article) as a string.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract the title from.
- Returns:
The title or delimited set of titles associated with the record, joined by a semicolon.
- Return type:
Optional[str]
- classmethod extract_year(record: NormalizedRecordType, field: str = 'year') int | None[source]
Extracts the year of publication or creation.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field name to extract the year from.
- Returns:
The year of record publication or creation extracted as an integer.
- Return type:
Optional[int]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- classmethod resolve_open_access(record: NormalizedRecordType, field: str = 'license') bool | None[source]
Resolves the Open Access Status from known license URLs.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract license URLs from.
- Returns:
True if open access, False if restricted, None if indeterminate.
- Return type:
Optional[bool]
scholar_flux.api.normalization.normalizing_field_map module
The scholar_flux.api.normalization.normalizing_field_map implements the NormalizingFieldMap for complex record normalization scenarios.
This class builds on the BaseFieldMap, using a NormalizingDataProcessor to handle nested field traversal and scenarios where fields may be differently named in different records from the same API provider. The NormalizingFieldMap can be subclassed with specialized field names for validated normalization.
- class scholar_flux.api.normalization.normalizing_field_map.NormalizingFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>)[source]
Bases:
BaseFieldMapA field map implementation that builds upon the original BaseFieldMap to recursively find and retrieve nested JSON elements from records with automated index processing and path-guided traversal.
During normalization, the NormalizingFieldMap.fields property returns all subclassed field mappings as a flattened dictionary (excluding private fields prefixed with underscores). Both simple and nested API-specific field names are matched and mapped to universal field names.
Any changes to the instance configuration are automatically detected during normalization by comparing the _cached_fields to the updated fields property.
Examples
>>> from scholar_flux.api.normalization.normalizing_field_map import NormalizingFieldMap >>> field_map = NormalizingFieldMap(provider_name = None, api_specific_fields=dict(title = 'article_title', record_id='ID')) >>> expected_result = field_map.fields | {'provider_name':'core', 'title': 'Decomposition of Political Tactics', 'record_id': 196} >>> result = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> cached_fields = field_map._cached_fields >>> print(result == expected_result) >>> result2 = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> assert cached_fields is field_map._cached_fields >>> assert result is not result2
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- normalize_record(record: RecordType, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordType[source]
Maps API-specific fields in dictionaries of processed records to a normalized set of field names.
- normalize_records(records: RecordType | RecordList, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordList[source]
Maps API-specific fields within a processed record list to create a new, normalized record list.
- property processor: NormalizingDataProcessor
Generates a NormalizingDataProcessor using the current set of assigned field names.
Note that if a processor does not already exist or if the schema is changed, The data processor is recreated with the updated set of fields.
- provider_name: str
scholar_flux.api.normalization.open_alex_field_map module
The scholar_flux.api.normalization.open_alex_field_map.py module defines the normalization mappings for OpenAlex.
- class scholar_flux.api.normalization.open_alex_field_map.OpenAlexFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapOpenAlex specific field mapping with custom transformations.
The OpenAlexFieldMap implements a minimal set of methods for field extraction and abstract reconstruction, finalizing the structure of each normalized record in the post-processing step.
- Post-Processed Fields:
Abstract reconstruction from inverted index format
DOI normalization (stripping URL prefix)
PMID extraction from ids object
Author list cleanup (filter empty entries)
- classmethod extract_open_access(record: NormalizedRecordType, field: str = 'open_access') bool | None[source]
Extracts the open access status from the OpenAlex record as a boolean field.
The value returned can be True or False, indicating whether the full text of the record is freely accessible to the public, or None if the field is missing or status cannot be determined from the field.
- Parameters:
record (NormalizedRecordType) – The Normalized OpenAlex record dictionary.
field (str) – The field to extract the open access status from.
- Returns:
True if the record is open access (e.g., arXiv, CORE, PubMed Central, CC-BY license).
False if the record is not open access (e.g., subscription, restricted, or fee-based access).
None if the status cannot be determined from the available metadata.
- Return type:
Optional[bool]
Note
- How OpenAlex determines open access status is explained here:
https://help.openalex.org/hc/en-us/articles/24347035046295-Open-Access-OA
- classmethod extract_pmid(record: NormalizedRecordType, field: str = 'pmid') str | None[source]
Extracts PubMed ID from the ids object.
- Parameters:
record (NormalizedRecordType) – Normalized OpenAlex record dictionary.
field (str) – The field to extract the PMID from.
- Returns:
PMID string without URL prefix, or None if not found.
- Return type:
Optional[str]
Examples
>>> record = {'pmid': 'https://pubmed.ncbi.nlm.nih.gov/29241234'} >>> OpenAlexFieldMap.extract_pmid(record) '29241234'
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- classmethod reconstruct_abstract(record: NormalizedRecordType, field: str = 'abstract_inverted_index') str | None[source]
Reconstructs abstract text from OpenAlex inverted index format.
OpenAlex stores abstracts as inverted indexes where keys are words and values are arrays of positions where those words appear.
- Parameters:
record (NormalizedRecordType) – Normalized OpenAlex record dictionary.
field (str) – The field containing the inverted index.
- Returns:
Reconstructed abstract string, or None if not available.
- Return type:
Optional[str]
Examples
>>> record = {'abstract_inverted_index': {'Hello': [0], 'world': [1]}} >>> OpenAlexFieldMap.reconstruct_abstract(record) 'Hello world'
scholar_flux.api.normalization.plos_field_map module
The scholar_flux.api.normalization.plos_field_map.py module defines the normalization mappings for the PLOS API.
- class scholar_flux.api.normalization.plos_field_map.PLOSFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapPLOS specific field mapping with custom transformations.
The PLOSFieldMap defines a minimal set of PLOS-specific post-processing steps to further process dictionary records after normalization.
- Post-Processed Fields:
DOI and record identifiers
Year extraction from publication date
URL reconstruction from DOI
Author and abstract normalization
Open access and license status
Note
The PLOS API provides most fields directly, but some (such as URLs) are reconstructed from the DOI. The field map configuration handles fallback paths and default values for publisher and open access status.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- classmethod reconstruct_plos_url(record: NormalizedRecordType, field: str = 'doi') str | None[source]
Reconstructs the PLOS article URL from the DOI of the article.
- Parameters:
record (NormalizedRecordType) – The Normalized record dictionary containing the DOI used to reconstruct the URL.
field (str) – The field to extract the DOI from.
- Returns:
Reconstructed URL if DOI is valid, None otherwise.
- Return type:
Optional[str]
Examples
>>> from scholar_flux.api.normalization.plos_field_map import PLOSFieldMap >>> PLOSFieldMap.reconstruct_plos_url({'doi':"10.1371/journal.pone.0123456"}) # OUTPUT: 'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123456' >>> PLOSFieldMap.reconstruct_plos_url({}) # OUTPUT: None
scholar_flux.api.normalization.pubmed_efetch_field_map module
The scholar_flux.api.normalization.pubmed_efetch_field_map.py module defines PubMed eFetch normalization mappings.
scholar_flux.api.normalization.pubmed_field_map module
The scholar_flux.api.normalization.pubmed_field_map.py module defines the normalization mappings used for PubMed.
- class scholar_flux.api.normalization.pubmed_field_map.PubMedFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapPubMed specific field mapping with custom transformations.
The PubMedFieldMap builds on the original AcademicFieldMap to add a minimal PubMed-specific array of post-processing steps that produces final, consistent, normalized record structures across several record types.
- Post-Processed Fields:
PMCID, ‘PMID’, and ‘PII’ identifiers
The date and year of creation or publication
The base URL for the record
Authors (After formatting nested authorship fields)
The DOI for the record
Open Access status
Abstract Retrieval
Note
PubMed’s XML structure varies between Articles and BookDocuments, which is handled via fallback paths in the field_map configuration (e.g., multiple paths for ‘year’ and ‘abstract’).
Article identifiers (DOI, PMCID, PII) are extracted from the ArticleIdList by filtering on the ‘@IdType’ attribute, with additional fallback logic for DOI via ELocationID.
Open access status is determined by the presence of a PMCID, indicating the article is available in PubMed Central.
- classmethod extract_authors(record: NormalizedRecordType, field: str = 'authors') list[str] | None[source]
Extract formatted author names combining ForeName and LastName.
- Parameters:
record (NormalizedRecordType) – Raw PubMed record dictionary
field (str) – The location to extract the nested list of authors from.
- Returns:
List of author names in ‘ForeName LastName’ format, or None if no authors
- Return type:
Optional[list[str]]
Notes
For an author with LastName=’Smith’, ForeName=’John’: Returns [‘John Smith’]
For an author with only LastName=’Smith’: Returns [‘Smith’]
- classmethod extract_date_created(record: NormalizedRecordType) str | None[source]
Extract date created or article date.
- Parameters:
record (NormalizedRecordType) – Raw PubMed record dictionary
- Returns:
ISO formatted date string (YYYY-MM-DD) or None
- Return type:
Optional[str]
Notes
Tries date created (MedlineCitation.DateCompleted) first, then falls back to article_date (ArticleDate.DateCompleted)
- classmethod extract_doi(record: NormalizedRecordType) str | None[source]
Extracts the DOI from the ArticleIdList based on the IdType attribute.
Attempts to extract DOI from two sources: 1. ArticleIdList with IdType=’doi’ (primary) 2. ELocationID with EIdType=’doi’ (fallback)
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ and ‘elocation_id’ already extracted
- Returns:
DOI string or None if not found
- Return type:
Optional[str]
- classmethod extract_open_access(record: NormalizedRecordType) bool | None[source]
Determines if an article is open access based on PMC ID presence.
The presence of a PMCID indicates the article is available in PubMed Central, which means it is accessible as open access. This is a reliable indicator for PubMed records, though it may not capture all open access articles (e.g., those available only on publisher websites).
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ already extracted
- Returns:
True if PMCID present (open access), False if no PMCID, None if indeterminate
- Return type:
Optional[bool]
Examples
>>> from scholar_flux.api.normalization.pubmed_field_map import PubMedFieldMap >>> record = {'article_id_list': {'ArticleId': [{'@IdType': 'pmc', '#text': 'PMC123456'}]}} >>> PubMedFieldMap.extract_open_access(record) # OUTPUT: True >>> record = {'article_id_list': {'ArticleId': [{'@IdType': 'doi', '#text': '10.1234/example'}]}} >>> PubMedFieldMap.extract_open_access(record) # OUTPUT: False
- classmethod extract_pii(record: NormalizedRecordType) str | None[source]
Extracts the Publisher Item Identifier (PII) from the ArticleIdList.
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ already extracted
- Returns:
PII string or None if not found
- Return type:
Optional[str]
- classmethod extract_pmcid(record: NormalizedRecordType) str | None[source]
Extracts the PMC ID for full-text access from the normalized record.
Returns the PMCID without the ‘PMC’ prefix for consistency. Handles edge cases where stripping the prefix results in an empty string.
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ already extracted
- Returns:
PMC ID without ‘PMC’ prefix, or None if not found or invalid
- Return type:
Optional[str]
Examples
>>> from scholar_flux.api.normalization.pubmed_field_map import PubMedFieldMap >>> record = {'article_id_list': {'ArticleId': [{'@IdType': 'pmc', '#text': 'PMC123456'}]}} >>> PubMedFieldMap.extract_pmcid(record) # OUTPUT: '123456'
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- classmethod reconstruct_pubmed_url(record: NormalizedRecordType) str | None[source]
Reconstruct PubMed article URL from the PMID.
- Parameters:
record (NormalizedRecordType) – The record containing the ‘pmid’ field
- Returns:
A Reconstructed URL if PMID is valid, None otherwise.
Examples
>>> from scholar_flux.api.normalization.pubmed_field_map import PubMedFieldMap >>> PubMedFieldMap.reconstruct_pubmed_url({"pmid": "41418093"}) # OUTPUT: 'https://pubmed.ncbi.nlm.nih.gov/41418093/' >>> PubMedFieldMap.reconstruct_pubmed_url({"pmid": None}) # OUTPUT: None
scholar_flux.api.normalization.springer_nature_field_map module
scholar_flux.api.normalization.springer_nature_field_map.py defines the normalization steps for Springer Nature.
- class scholar_flux.api.normalization.springer_nature_field_map.SpringerNatureFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapSpringer Nature specific field mapping with custom transformations.
The SpringerNatureFieldMap defines a minimal set of Springer Nature-specific post-processing steps that aid in the normalization of preprocessed records retrieved from the Springer Nature API.
- Post-Processed Fields:
DOI and record identifiers
Year extraction from publication date
URL extraction from nested URL objects
Open access status conversion from string to boolean
Author and abstract normalization
Note
Springer Nature records may contain arrays and nested objects for key fields (e.g., year and primary URL). The field map configuration and post-processing logic handle these for consistent normalization.
- classmethod extract_open_access(record: NormalizedRecordType, field: str = 'open_access') bool | None[source]
Extracts the current record’s open access status by delegating processing to .extract_boolean_field().
- Parameters:
record (NormalizedRecordType) – The normalized record to extract the open_access status for
field (str) – The field to extract the open access status from.
- Returns:
The open access status of the record when available, and None otherwise.
- Return type:
(Optional[bool])
Examples
>>> from scholar_flux.api.normalization import SpringerNatureFieldMap >>> record = {"doi": "10.1234/example", "title": "Sample Article","open_access": "true"} >>> SpringerNatureFieldMap.extract_open_access(record) # OUTPUT: True >>> record = {"doi": "10.5678/example", "title": "Sample Publication","open_access": "false"} >>> SpringerNatureFieldMap.extract_open_access(record) # OUTPUT: False >>> record = {"title": "Another Article","open_access": "N/A"} >>> SpringerNatureFieldMap.extract_open_access(record) # OUTPUT: None
- classmethod extract_primary_url(record: NormalizedRecordType, field: str = 'url', pattern_delimiter: str | Pattern | None = re.compile('; *(?=http)|, *(?=http)|\\| *(?=http)')) str | None[source]
Extracts the primary (or first valid) URL from a record field with a flat or nested JSON structure.
- Parameters:
record (NormalizedRecordType) – The normalized record to extract the primary URL field from
field (str) – The field to extract a primary URL from
pattern_delimiter (Optional[str | re.Pattern]) – An optional pattern used to separate URLs when combined as a list for primary URL extraction
- Returns:
The extracted primary URL when extraction is successful, and None otherwise.
- Return type:
Optional[str]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
Module contents
The scholar_flux.api.normalization module implements API-specific record normalization for downstream analyses.
Because the range of fields retrieved after successful data processing can greatly vary based on API-specific semantics, extensive post-processing is usually required to reliably extract insights from API-derived data.
To solve this issue, this module implements record normalization based on heuristics and provider-specific logic for all default APIs while also supporting the application and extension of normalization logic to new APIs.
- Classes:
- BaseFieldMap:
Base class for mapping API-specific fields from non-nested dictionary records to common field names.
- NormalizingFieldMap:
Extends the BaseFieldMap, adding support for nested field extraction with fallback resolution.
- AcademicFieldMap:
Subclasses the NormalizingFieldMap to tailor its scope to academic field extraction and normalization. Implements several helpers to post-process fields such as DOI, authors, abstract, and publication metadata.
- ArXivFieldMap:
AcademicFieldMap subclass implementing arXiv-specific normalization post-processing steps with PDF URL extraction, ISO date parsing, and record ID formatting.
- CoreFieldMap:
Subclasses the AcademicFieldMap to implement CORE-specific normalization post-processing steps for full-text and repository metadata handling.
- CrossrefFieldMap:
A Crossref-specific AcademicFieldMap subclass that implements normalization with author affiliation and reference parsing. The abstract post-processing step will remove HTML tags from abstracts if the beautifulsoup4 optional dependency is available.
- OpenAlexFieldMap:
Subclasses the AcademicFieldMap to implement OpenAlex-specific normalization post-processing steps for institution and concept extraction. Post-processing also reconstructs abstracts using abstract-inverted indexes when available.
- PLOSFieldMap:
An AcademicFieldMap subclass that implements PLOS-specific normalization post-processing steps involving URL reconstruction, date extraction, and abstract cleanup.
- PubMedFieldMap:
PubMed-specific AcademicFieldMap subclass that implements normalization post-processing steps, including MeSH term extraction, author list formatting, the reconstruction of PubMed URLs from PubMed IDs, etc.
- SpringerNatureFieldMap:
Subclasses the AcademicFieldMap to define Springer Nature-specific normalization post-processing steps to extract the publication date, author list, open access status, and primary URL for each record.
- class scholar_flux.api.normalization.AcademicFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
NormalizingFieldMapExtends the NormalizingFieldMap to customize field extraction and processing for academic record normalization.
This class is used to normalize the names of academic data fields consistently across provider. By default, the AcademicFieldMap includes fields for several attributes of academic records including:
Core identifiers (e.g. doi, url, record_id)
Bibliographic metadata ( title, abstract, authors)
Publication metadata (journal, publisher, year, date_published, date_created)
Content and classification (keywords, subjects, full_text)
Metrics and impact (citation_count)
Access and rights (open_access, license)
Document metadata (record_type, language)
All other fields that are relevant to only the current API (api_specific_fields)
During normalization, the AcademicFieldMap.fields property returns all subclassed field mappings as a flattened dictionary (excluding private fields prefixed with underscores). Both simple and nested API-specific field names are matched and mapped to universal field names.
Any changes to the instance configuration are automatically detected during normalization by comparing the _cached_fields to the updated fields property.
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> field_map = AcademicFieldMap(provider_name = None, title = 'article_title', record_id='ID') >>> expected_result = field_map.fields | {'provider_name':'core', 'title': 'Decomposition of Political Tactics', 'record_id': 196} >>> result = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> cached_fields = field_map._cached_fields >>> print(result == expected_result) >>> result2 = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> assert cached_fields is field_map._cached_fields >>> assert result is not result2
Note
To account for special cases, the AcademicFieldMap can be subclassed to perform two-step normalization to further process extracted elements.
- Phase 1:
The AcademicFieldMap extracts nested fields for each record. This class traverses paths like ‘MedlineCitation.Article.AuthorList.Author’ (PubMed) or authorships.institutions.display_name (OpenAlex) to map API-specific fields to universal parameter names
- Phase 2 (Subclasses):
Subclasses can reformat extracted data into finalized fields. For example, PubMed prepares the authors field by combining each author’s ‘ForeName’ and ‘LastName’ into ‘FirstName LastName’. PLOS creates the record URL for each article by combining the URL prefix for the website with the DOI of the current record. The AcademicFieldMap defines common (yet optional) class methods to aid in the extraction and processing of normalized fields.
- abstract: list[str] | str | None
- api_specific_fields: dict[str, Any]
- authors: list[str] | str | None
- citation_count: list[str] | str | None
- date_created: list[str] | str | None
- date_published: list[str] | str | None
- default_field_values: dict[str, Any]
- doi: list[str] | str | None
- classmethod extract_abstract(record: NormalizedRecordType, strip_html: bool = False, field: str = 'abstract', **kwargs: Any) str | None[source]
Extracts and prepares the abstract for the current record.
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘abstract’ already available as a field.
strip_html (bool) – Indicates whether html tags should be checked and removed if found in the abstract.
field (str) – The field where an abstract or text field can be found.
**kwargs – Additional arguments to pass to get_text when stripping html elements.
- Returns:
An abstract string or None if not found or not a string/list of strings
- Return type:
Optional[str]
Example
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {'abstract': 'Analysis of the Placebo effect on...'} >>> AcademicFieldMap.extract_abstract(record) # OUTPUT: 'Analysis of the Placebo effect on...'
>>> record = {'abstract': '<h1>Game theory in the technological industry.</h1><p>This study explores...</p>'} >>> AcademicFieldMap.extract_abstract(record, strip_html=True, separator=' ') # OUTPUT: 'Game theory in the technological industry. This study explores...'
- classmethod extract_authors(record: NormalizedRecordType, field: str = 'authors') list[str] | None[source]
Filters and cleans the author names list.
- Parameters:
record (NormalizedRecordType) – Normalized record with an ‘authors’ field.
field (str) – The field to extract the list of authors from.
- Returns:
A list of non-empty author names, or None if empty
- Return type:
Optional[list[str]]
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {'authors': 'Evan Doodle; Jane Doe'} >>> AcademicFieldMap.extract_authors(record) # OUTPUT: ['Evan Doodle', 'Jane Doe'] >>> record = {'authors': ['Evan Doodle', 'Jane Noah']} >>> AcademicFieldMap.extract_authors(record) # OUTPUT: ['Evan Doodle', 'Jane Noah'] >>> record = {'authors': [102, 203]} >>> AcademicFieldMap.extract_authors(record) # returns, elements aren't strings # OUTPUT: None
- classmethod extract_boolean_field(record: NormalizedRecordType, field: str, true_values: tuple[str, ...] = ('true', '1', 'yes'), false_values: tuple[str, ...] = ('false', '0', 'no'), default: bool | None = None) bool | None[source]
Extracts a field’s value from the current record as a boolean (‘true’->True/’false’->False/’None’->None).
- Parameters:
record (NormalizedRecordType) – The normalized record dictionary to extract a boolean value from.
field (str) – The record field to be used for the extraction of a boolean value.
true_values (tuple[str, ...]) – Values to be mapped to True when found.
false_values (tuple[str, ...]) – Values to be mapped to false when found.
default (Optional[bool]) – The value to default to when neither True values or False values can be found.
- Returns:
True if the field appears in the list of the tuple of true_values
False if the field appears in the list of the tuple of false_values
The default if the observed value cannot be found within true_values and false_values
- Return type:
Optional[bool]
- classmethod extract_id(record: NormalizedRecordType, field: str = 'record_id', strip_prefix: str | Pattern | None = None) str | None[source]
Extracts and coerces the ID from the current record into a string.
- Parameters:
record (NormalizedRecordType) – A normalized record dictionary before or after post-processing
field (str) – The IdType to filter for (e.g., ‘arxiv_id’, ‘pmid’, ‘mag_id’)
strip_prefix (Optional[str | re.Pattern]) – An optional prefix to remove from the identifier (e.g., ‘PMC’ for PMC IDs)
- Returns:
The record ID as a string, or None if not available
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> AcademicFieldMap.extract_id({"record_id": 12345678}) '12345678' >>> AcademicFieldMap.extract_id({"record_id": "mock_id:123"}) mock_id:123'
- classmethod extract_iso_date(record: NormalizedRecordType, field: str = 'date_created') str | None[source]
Extracts and formats a date from a dictionary or strings in ISO format (%Y-%m-%d).
- Parameters:
record (NormalizedRecordType) – A normalized record having a date_created or similar field to extract an ISO date from. Note: Users can extract an ISO date from a nested dictionary field if its formatted with year, month, or day. If the nested field is a string, this method will instead attempt to parse it as an ISO timestamp otherwise. If the field is a datetime or date, the object will be parsed directly.
field (str) – The name of the field containing date information to extract.
- Returns:
An ISO formatted date string (YYYY-MM-DD, YYYY-MM, or YYYY) or None.
- Return type:
(Optional[str])
Examples
PubDate with Year=’2025’, Month=’Dec’, Day=’19’: Returns ‘2025-12-19’
PubDate with Year=’2025’, Month=’12’: Returns ‘2025-12’
PLOS with timestamp: ‘2016-12-08T00:00:00Z’ Returns ‘2016-12-08’
- classmethod extract_journal(record: NormalizedRecordType, field: str = 'journal') str | None[source]
Extracts the publication journal title or a list of journal titles as a semicolon delimited string.
- Parameters:
record (NormalizedRecordType) – The normalized record dictionary to extract the journal field from.
field (str) – The field to extract the journal from.
- Returns:
The journal or journals of publication, joined by a semicolon, or None if not available.
- Return type:
Optional[str]
Examples
>>> AcademicFieldMap.extract_journal({"journal": "Nature"}) # OUTPUT: 'Nature' >>> AcademicFieldMap.extract_journal({"journal": ["Nature", "Science"]}) # OUTPUT: 'Nature; Science' >>> AcademicFieldMap.extract_journal({"journal": ["Nature", "", None, "Science"]}) # OUTPUT: 'Nature; Science'
- classmethod extract_url(record: NormalizedRecordType, *paths: list[str | int] | str, pattern_delimiter: str | Pattern | None = re.compile('; *(?=http)|, *(?=http)|\\| *(?=http)'), delimiter_prefix: str | None = None, delimiter_suffix: str | None = '(?=http)') str | None[source]
Helper function for extracting a single, primary URL from record based on the path taken to traverse the URL.
- Parameters:
record (NormalizedRecordType) – The record dictionary to extract the URL from.
*paths – Arbitrary positional path arguments leading to a single URL or list of URLs. Each path can be a string or list of keys representing the path needed to find a URL in a nested record. Defaults to the tuple (‘url’, ) if not provided, defaulting to a basic url lookup.
pattern_delimiter (str | Pattern) – Regex pattern to split URL strings. Defaults to “; *”. A positive lookahead (?=http) is automatically appended to the delimiter to prevent splitting URLs mid-domain. Set to None to disable splitting. Note that if a re.Pattern object is provided, it will be used as is without transformation.
delimiter_prefix (str) – An option string appended as a prefix to each element within a pattern. This prefix is None by default but can be used to identify URLs that directly follow a specific pattern.
delimiter_suffix (str) – An option string appended as a suffix to each element within a pattern. This suffix is used to identify http schemes (typically associated with URLs) that may directly follow a string delimited by the suffix separator.
- Returns:
The first value found at any of the specified paths. Commonly a string URL, but could be any type depending on the data structure. Returns None if not found.
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {"url": "http://example.com; http://backup.com"} >>> AcademicFieldMap.extract_url(record) # OUTPUT: 'http://example.com'
>>> record = {"url": [{"value": "http://example.com"}]} >>> AcademicFieldMap.extract_url(record, ["url", 0, "value"], ["url", 0]) # OUTPUT: 'http://example.com'
>>> # Semicolon-delimited URLs (common in CrossRef, Springer) >>> record = {"url": "http://example.com; http://backup.com"} >>> AcademicFieldMap.extract_url(record) # OUTPUT: 'http://example.com'
- classmethod extract_url_id(record: NormalizedRecordType, field: str = 'record_id', strip_prefix: str | Pattern | None = None) str | None[source]
Extracts an ID from the URL of the current record, removing a URL prefix when specified.
- Parameters:
record (NormalizedRecordType) – The record containing the URL ID to extract
field (str) – The field containing the ID (with or without a prefix)
strip_prefix (Optional[str | re.Pattern]) – The prefix or regex pattern to optionally remove from the URL
- Returns:
The ID after field extraction and the removal the string prefix, if provided. If the record field doesn’t exist, None is returned instead.
- Return type:
Optional[str]
- classmethod extract_year(record: NormalizedRecordType, field: str = 'year') int | None[source]
Extracts the year of publication or record creation from the manuscript/record.
- Parameters:
record (NormalizedRecordType) – Normalized record dictionary
field (str) – The field to extract the year of publication or record creation from.
- Returns:
The year as an integer, or None if not extractable.
- Return type:
Optional[int]
Examples
>>> AcademicFieldMap.extract_year({"year": "2024-06-15"}) 2024 >>> AcademicFieldMap.extract_year({"year": 2024}) 2024 >>> AcademicFieldMap.extract_year({"year": None}) None
- full_text: list[str] | str | None
- is_retracted: list[str] | str | None
- journal: list[str] | str | None
- keywords: list[str] | str | None
- language: list[str] | str | None
- license: list[str] | str | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- classmethod normalize_doi(record: NormalizedRecordType, field: str = 'doi') str | None[source]
Normalizes DOI by stripping the https://doi.org/ prefix.
- Parameters:
record (NormalizedRecordType) – Normalized record containing the ‘doi’ field to extract.
field (str) – The field to extract the record doi from.
- Returns:
Cleaned DOI string without URL prefix, or None if invalid
- Return type:
Optional[str]
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> record = {'doi': 'https://doi.org/10.1234/example'} >>> AcademicFieldMap.normalize_doi(record) # OUTPUT: '10.1234/example'
- open_access: list[str] | str | None
- provider_name: str
- publisher: list[str] | str | None
- classmethod reconstruct_url(id: str | None, url: str) str | None[source]
Reconstruct an article URL from the ID of the article.
Useful for PLOS and PubMed URL reconstruction.
- Parameters:
id (Optional[str]) – The ID/DOI identifier (e.g., “10.1371/journal.pone.0123456”)
url (str) – The URL prefix (e.g. f”https://journals.plos.org/plosone/article?id=”)
- Returns:
Reconstructed URL if ID is valid, None otherwise.
- Return type:
str
Examples
>>> from scholar_flux.api.normalization import AcademicFieldMap >>> AcademicFieldMap.reconstruct_url( ... id="10.1371/journal.pone.0123456", ... url=f"https://journals.plos.org/plosone/article?id=" ... ) # OUTPUT: 'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123456' >>> AcademicFieldMap.reconstruct_url(None, '') # OUTPUT: None >>> AcademicFieldMap.reconstruct_url("", None) # OUTPUT: None
- record_id: list[str] | str | None
- record_type: list[str] | str | None
- subjects: list[str] | str | None
- title: list[str] | str | None
- url: list[str] | str | None
- year: list[str] | str | None
- class scholar_flux.api.normalization.ArXivFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapArXiv specific field mapping with custom transformations.
The ArXivFieldMap implements a minimal set of methods for record normalization to finalize the structure of each extracted and normalized record during postprocessing.
- Post-Processed Fields:
arXiv, DOI, and record identifiers
Year extraction from ISO date strings
PDF URL extraction from link arrays
Open access status (always true for arXiv)
Subject and category normalization
Note
arXiv records use unique identifier and link structures. The field map configuration and post-processing logic handle these for consistent output.
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- classmethod extract_pdf_url(record: NormalizedRecordType, field: str = 'url_list') str | None[source]
Extracts a valid PDF URL from the array of URLs corresponding to the record.
- Parameters:
record (NormalizedRecordType) – A normalized arXiv record dictionary to extract a PDF URL from.
field (str) – The field to extract the list of URLs from.
- Returns:
A validated PDF URL if available, otherwise None.
- Return type:
Optional[str]
- classmethod extract_record_type(record: NormalizedRecordType, journal_field: str = 'journal', comment_field: str = 'comment') str[source]
Infers a record type from the journal or comment field for an arXiv record using pattern matching.
The record type is inferred using predefined patterns to determine whether a record is a journal article, a book chapter, preprint, etc.
The possible types are defined within scholar_flux.api.normalization.arxiv_field_map as a RECORD_TYPE_PATTERNS dictionary that maps patterns to record types.
- Parameters:
record (NormalizedRecordType) – a normalized arXiv record to infer the record type with.
journal_field (str) – The journal field used to infer record type with.
comment_field (str) – The comment field used to infer record type with if a journal field does not exist.
- Returns:
When journal_field is available, the record type associated with the matched pattern is returned. If no patterns match, then journal-article is returned instead. When journal_field is not available, a record type is extracted by pattern matching against the comment_field. If no match is found, preprint is returned.
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None
- class scholar_flux.api.normalization.BaseFieldMap(*, provider_name: str, api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>)[source]
Bases:
BaseModelThe BaseFieldMap is used to normalize the names of fields consistently across providers.
This class provides a minimal implementation for mapping API-specific fields from a non-nested dictionary record to a common record key. It is intended to be subclassed and customized for different APIs.
Instances of this class can be called directly to normalize a single or multiple records based on the input. Direct calls to instances are directly handled by .apply() under-the-hood.
- - normalize_record
Normalizes a single dictionary record
- - normalize_records
Normalizes a list of dictionary records
- - apply
Returns either a single normalized record or a list of normalized records matching the input.
- - structure
Displays a string representation of the current BaseFieldMap instance
- provider_name
A default provider name to be assigned for all normalized records. If not provided, the field map will try to find the provider name from within each record.
- Type:
str
- api_specific_fields
Defines a dictionary of normalized field names (keys) to map to the names of fields within each dictionary record (values)
- Type:
dict[str, Any]
- default_field_values
Indicates values that should be assigned if a field cannot be found within a record.
- Type:
dict[str, Any]
- api_specific_fields: dict[str, Any]
- apply(records: RecordType) NormalizedRecordType[source]
- apply(records: RecordList) NormalizedRecordList
Normalizes a record or list of records by mapping API-specific field names to common fields.
- Parameters:
records (RecordType | RecordList) – A single dictionary record or a list of dictionary records to normalize.
- Returns:
A single normalized dictionary is returned if a single record is provided. NormalizedRecordList: A list of normalized dictionaries is returned if a list of records is provided.
- Return type:
NormalizedRecordType
- property core_fields: dict[str, Any]
Returns a dictionary of all core fields in the current FieldMap (excluding all API-specific fields).
- default_field_values: dict[str, Any]
- property fields: dict[str, Any]
Returns a representation of the current FieldMap as a dictionary.
- filter_api_specific_fields(record: NormalizedRecordType, keep_api_specific_fields: bool | Sequence[str] | set[str] | None = None) dict[str, Any][source]
Filters API Specific parameters from the processed record.
- Parameters:
record (NormalizedRecordType) – The current record to filter API-specific fields from.
keep_api_specific_fields (Optional[bool | Sequence[str] | set[str]]) – Either a boolean indicating whether to keep all API-specific fields (True/None) or to remove them after the completion of normalization (False). This parameter can also be a sequence/set of specific field names to keep.
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- normalize_record(record: dict, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordType[source]
Maps API-specific fields in a single dictionary record to a normalized set of field names.
- Parameters:
record (dict) – The single, dictionary-typed record to normalize.
keep_api_specific_fields (Optional[bool | Sequence[str]]) – A boolean indicating whether to keep or remove all API-specific fields or a sequence indicating which API-specific fields to keep.
- Returns:
A new dictionary with normalized field names.
- Return type:
NormalizedRecordType
- Raises:
TypeError – If the input to record is not a mapping or dictionary object.
- normalize_records(records: RecordType | RecordList, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordList[source]
Maps API-specific fields in one or more records to a normalized set of field names.
- Parameters:
records (dict | RecordType | RecordList) – A single dictionary record or a list of dictionary records.
keep_api_specific_fields (Optional[bool | Sequence[str]]) – A boolean indicating whether to keep or remove all API-specific fields or a sequence indicating which API-specific fields to keep.
- Returns:
A list of dictionaries with normalized field names.
- Return type:
NormalizedRecordList
- provider_name: str
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method that shows the current structure of the BaseFieldMap. :param flatten: Whether to flatten the current field map’s structural representation into a single line (Default=False) :type flatten: bool :param show_value_attributes: Whether to show nested attributes of the base field map or subclass (Default = True) :type show_value_attributes: bool
- Returns:
A structural representation of the current field map as a string. Use a print statement to view it.
- Return type:
str
- class scholar_flux.api.normalization.CoreFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapCore specific field mappings with custom transformations.
The Core API provides open access scholarly content aggregated from thousands of repositories worldwide.
The CoreFieldMap implements several methods for record normalization and the extraction of record fields and cross platform IDs. The post-processing step finalizes the structure of each normalized record to consistently prepare and post-process records retrieved from the CORE API.
- Post-Processed Fields:
Year extraction from various date formats
Journal list flattening (Core can return multiple journal titles)
Record ID coercion to string format
Open access default (Core sources are generally all open access)
Cross-reference identifier extraction (arXiv, PubMed, MAG IDs)
Multi-identifier normalization for entity resolution
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- classmethod extract_arxiv_id(record: NormalizedRecordType) str | None[source]
Extracts the arXiv identifier for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary.
- Returns:
The arXiv ID (e.g., ‘1012.4340’) or None if not available.
- Return type:
Optional[str]
- classmethod extract_mag_id(record: NormalizedRecordType) str | None[source]
Extracts the Microsoft Academic Graph identifier for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary
- Returns:
The MAG ID or None if not available
- Return type:
Optional[str]
Examples
>>> CoreFieldMap.extract_mag_id({"mag_id": "2056403249"}) '2056403249' >>> CoreFieldMap.extract_mag_id({"mag_id": "None"}) None
- classmethod extract_oai_ids(record: NormalizedRecordType) list[str] | None[source]
Extracts the OAI identifiers for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary
- Returns:
The OAI IDs for the current record as a list or None if not available
- Return type:
Optional[list[str]]
- classmethod extract_pmid(record: NormalizedRecordType) str | None[source]
Extracts the PubMed identifier for cross-database entity resolution.
- Parameters:
record (NormalizedRecordType) – Normalized Core record dictionary
- Returns:
The PubMed ID or None if not available
- Return type:
Optional[str]
Examples
>>> CoreFieldMap.extract_pmid({"pmid": "12345678"}) '12345678' >>> CoreFieldMap.extract_pmid({"pmid": "None"}) None
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None
- class scholar_flux.api.normalization.CrossrefFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapCrossref specific field mapping with custom transformations.
The CrossrefFieldMap implements a minimal set of methods for field extraction and abstract HTML tag removal, preparing and finalizing the structure of each normalized record in the post-processing step.
- Post-Processed Fields:
DOI, URL, and record identifiers
Year and date extraction from nested date fields
Author name formatting
Open access status resolution from license URLs
Journal extraction
Abstract retrieval and HTML tag removal
Retraction status detection
Note
Crossref records may contain nested lists and multiple date fields. The field map configuration and post-processing logic handle these variations and normalizes the output.
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- classmethod check_retraction(record: NormalizedRecordType, field: str = 'updated_by_list', pattern: str | Pattern | None = None) bool | None[source]
Checks if the record is a retraction notice.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to check for retraction updates.
pattern (str) – An optional field or pattern used to verify retraction status
- Returns:
True if the paper has been retracted, None if the status is unknown.
- Return type:
Optional[bool]
Note
┌─────────────────────┐ updated-by ┌─────────────────────┐ │ Retracted Paper │ ◄───────────────────── │ Retraction Notice │ │ (original article) │ ─────────────────────► │ (update record) │ └─────────────────────┘ update-to └─────────────────────┘
Crossref’s update-to field is on the retraction NOTICE, pointing to the retracted paper. The retracted paper itself might instead contain an updated-by field indicating that the paper has been retracted.
When retraction status can’t be determined for certain due to a lack of information, retraction can be verified with the following steps:
Sending a separate crossref search with the filter=’update-type:retraction’ API-specific parameter
Checking the https://gitlab.com/crossref/retraction-watch-data repo (updated daily)
Source: https://www.crossref.org/documentation/retrieve-metadata/retraction-watch/ (2026)
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- classmethod extract_authors(record: NormalizedRecordType, field: str = 'author_list') list[str] | None[source]
Extracts formatted author names by combining GivenName and LastName.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract the nested list of authors from.
- Returns:
List of author names in ‘ForeName LastName’ format, or None if no authors.
- Return type:
Optional[list[str]]
Note
Returns None for organizational records (datasets, reports) where Crossref does not provide individual authors. Check the ‘publisher’ or ‘institution’ fields for organizational attribution.
- classmethod extract_date_parts(record: NormalizedRecordType, field: str = 'date_published') str | None[source]
Extracts the publication date or date_created for the current record.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract a Crossref date field from.
- Returns:
ISO formatted date string (YYYY-MM-DD) or None.
- Return type:
Optional[str]
Note: This class method is designed to handle Crossref’s unique processing structure to consistently convert date fields in [[Year, Month, Date]] format to %Y-%m-%d format.
- classmethod extract_title(record: NormalizedRecordType, field: str = 'title') str | None[source]
Extracts the record title or a nested list indicating the title (or titles for the article) as a string.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract the title from.
- Returns:
The title or delimited set of titles associated with the record, joined by a semicolon.
- Return type:
Optional[str]
- classmethod extract_year(record: NormalizedRecordType, field: str = 'year') int | None[source]
Extracts the year of publication or creation.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field name to extract the year from.
- Returns:
The year of record publication or creation extracted as an integer.
- Return type:
Optional[int]
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- classmethod resolve_open_access(record: NormalizedRecordType, field: str = 'license') bool | None[source]
Resolves the Open Access Status from known license URLs.
- Parameters:
record (NormalizedRecordType) – Normalized Crossref record dictionary.
field (str) – The field to extract license URLs from.
- Returns:
True if open access, False if restricted, None if indeterminate.
- Return type:
Optional[bool]
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None
- class scholar_flux.api.normalization.NormalizingFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>)[source]
Bases:
BaseFieldMapA field map implementation that builds upon the original BaseFieldMap to recursively find and retrieve nested JSON elements from records with automated index processing and path-guided traversal.
During normalization, the NormalizingFieldMap.fields property returns all subclassed field mappings as a flattened dictionary (excluding private fields prefixed with underscores). Both simple and nested API-specific field names are matched and mapped to universal field names.
Any changes to the instance configuration are automatically detected during normalization by comparing the _cached_fields to the updated fields property.
Examples
>>> from scholar_flux.api.normalization.normalizing_field_map import NormalizingFieldMap >>> field_map = NormalizingFieldMap(provider_name = None, api_specific_fields=dict(title = 'article_title', record_id='ID')) >>> expected_result = field_map.fields | {'provider_name':'core', 'title': 'Decomposition of Political Tactics', 'record_id': 196} >>> result = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> cached_fields = field_map._cached_fields >>> print(result == expected_result) >>> result2 = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics')) >>> assert cached_fields is field_map._cached_fields >>> assert result is not result2
- api_specific_fields: dict[str, Any]
- default_field_values: dict[str, Any]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- normalize_record(record: RecordType, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordType[source]
Maps API-specific fields in dictionaries of processed records to a normalized set of field names.
- normalize_records(records: RecordType | RecordList, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordList[source]
Maps API-specific fields within a processed record list to create a new, normalized record list.
- property processor: NormalizingDataProcessor
Generates a NormalizingDataProcessor using the current set of assigned field names.
Note that if a processor does not already exist or if the schema is changed, The data processor is recreated with the updated set of fields.
- provider_name: str
- class scholar_flux.api.normalization.OpenAlexFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapOpenAlex specific field mapping with custom transformations.
The OpenAlexFieldMap implements a minimal set of methods for field extraction and abstract reconstruction, finalizing the structure of each normalized record in the post-processing step.
- Post-Processed Fields:
Abstract reconstruction from inverted index format
DOI normalization (stripping URL prefix)
PMID extraction from ids object
Author list cleanup (filter empty entries)
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- classmethod extract_open_access(record: NormalizedRecordType, field: str = 'open_access') bool | None[source]
Extracts the open access status from the OpenAlex record as a boolean field.
The value returned can be True or False, indicating whether the full text of the record is freely accessible to the public, or None if the field is missing or status cannot be determined from the field.
- Parameters:
record (NormalizedRecordType) – The Normalized OpenAlex record dictionary.
field (str) – The field to extract the open access status from.
- Returns:
True if the record is open access (e.g., arXiv, CORE, PubMed Central, CC-BY license).
False if the record is not open access (e.g., subscription, restricted, or fee-based access).
None if the status cannot be determined from the available metadata.
- Return type:
Optional[bool]
Note
- How OpenAlex determines open access status is explained here:
https://help.openalex.org/hc/en-us/articles/24347035046295-Open-Access-OA
- classmethod extract_pmid(record: NormalizedRecordType, field: str = 'pmid') str | None[source]
Extracts PubMed ID from the ids object.
- Parameters:
record (NormalizedRecordType) – Normalized OpenAlex record dictionary.
field (str) – The field to extract the PMID from.
- Returns:
PMID string without URL prefix, or None if not found.
- Return type:
Optional[str]
Examples
>>> record = {'pmid': 'https://pubmed.ncbi.nlm.nih.gov/29241234'} >>> OpenAlexFieldMap.extract_pmid(record) '29241234'
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- classmethod reconstruct_abstract(record: NormalizedRecordType, field: str = 'abstract_inverted_index') str | None[source]
Reconstructs abstract text from OpenAlex inverted index format.
OpenAlex stores abstracts as inverted indexes where keys are words and values are arrays of positions where those words appear.
- Parameters:
record (NormalizedRecordType) – Normalized OpenAlex record dictionary.
field (str) – The field containing the inverted index.
- Returns:
Reconstructed abstract string, or None if not available.
- Return type:
Optional[str]
Examples
>>> record = {'abstract_inverted_index': {'Hello': [0], 'world': [1]}} >>> OpenAlexFieldMap.reconstruct_abstract(record) 'Hello world'
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None
- class scholar_flux.api.normalization.PLOSFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapPLOS specific field mapping with custom transformations.
The PLOSFieldMap defines a minimal set of PLOS-specific post-processing steps to further process dictionary records after normalization.
- Post-Processed Fields:
DOI and record identifiers
Year extraction from publication date
URL reconstruction from DOI
Author and abstract normalization
Open access and license status
Note
The PLOS API provides most fields directly, but some (such as URLs) are reconstructed from the DOI. The field map configuration handles fallback paths and default values for publisher and open access status.
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- classmethod reconstruct_plos_url(record: NormalizedRecordType, field: str = 'doi') str | None[source]
Reconstructs the PLOS article URL from the DOI of the article.
- Parameters:
record (NormalizedRecordType) – The Normalized record dictionary containing the DOI used to reconstruct the URL.
field (str) – The field to extract the DOI from.
- Returns:
Reconstructed URL if DOI is valid, None otherwise.
- Return type:
Optional[str]
Examples
>>> from scholar_flux.api.normalization.plos_field_map import PLOSFieldMap >>> PLOSFieldMap.reconstruct_plos_url({'doi':"10.1371/journal.pone.0123456"}) # OUTPUT: 'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123456' >>> PLOSFieldMap.reconstruct_plos_url({}) # OUTPUT: None
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None
- class scholar_flux.api.normalization.PubMedFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapPubMed specific field mapping with custom transformations.
The PubMedFieldMap builds on the original AcademicFieldMap to add a minimal PubMed-specific array of post-processing steps that produces final, consistent, normalized record structures across several record types.
- Post-Processed Fields:
PMCID, ‘PMID’, and ‘PII’ identifiers
The date and year of creation or publication
The base URL for the record
Authors (After formatting nested authorship fields)
The DOI for the record
Open Access status
Abstract Retrieval
Note
PubMed’s XML structure varies between Articles and BookDocuments, which is handled via fallback paths in the field_map configuration (e.g., multiple paths for ‘year’ and ‘abstract’).
Article identifiers (DOI, PMCID, PII) are extracted from the ArticleIdList by filtering on the ‘@IdType’ attribute, with additional fallback logic for DOI via ELocationID.
Open access status is determined by the presence of a PMCID, indicating the article is available in PubMed Central.
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- classmethod extract_authors(record: NormalizedRecordType, field: str = 'authors') list[str] | None[source]
Extract formatted author names combining ForeName and LastName.
- Parameters:
record (NormalizedRecordType) – Raw PubMed record dictionary
field (str) – The location to extract the nested list of authors from.
- Returns:
List of author names in ‘ForeName LastName’ format, or None if no authors
- Return type:
Optional[list[str]]
Notes
For an author with LastName=’Smith’, ForeName=’John’: Returns [‘John Smith’]
For an author with only LastName=’Smith’: Returns [‘Smith’]
- classmethod extract_date_created(record: NormalizedRecordType) str | None[source]
Extract date created or article date.
- Parameters:
record (NormalizedRecordType) – Raw PubMed record dictionary
- Returns:
ISO formatted date string (YYYY-MM-DD) or None
- Return type:
Optional[str]
Notes
Tries date created (MedlineCitation.DateCompleted) first, then falls back to article_date (ArticleDate.DateCompleted)
- classmethod extract_doi(record: NormalizedRecordType) str | None[source]
Extracts the DOI from the ArticleIdList based on the IdType attribute.
Attempts to extract DOI from two sources: 1. ArticleIdList with IdType=’doi’ (primary) 2. ELocationID with EIdType=’doi’ (fallback)
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ and ‘elocation_id’ already extracted
- Returns:
DOI string or None if not found
- Return type:
Optional[str]
- classmethod extract_open_access(record: NormalizedRecordType) bool | None[source]
Determines if an article is open access based on PMC ID presence.
The presence of a PMCID indicates the article is available in PubMed Central, which means it is accessible as open access. This is a reliable indicator for PubMed records, though it may not capture all open access articles (e.g., those available only on publisher websites).
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ already extracted
- Returns:
True if PMCID present (open access), False if no PMCID, None if indeterminate
- Return type:
Optional[bool]
Examples
>>> from scholar_flux.api.normalization.pubmed_field_map import PubMedFieldMap >>> record = {'article_id_list': {'ArticleId': [{'@IdType': 'pmc', '#text': 'PMC123456'}]}} >>> PubMedFieldMap.extract_open_access(record) # OUTPUT: True >>> record = {'article_id_list': {'ArticleId': [{'@IdType': 'doi', '#text': '10.1234/example'}]}} >>> PubMedFieldMap.extract_open_access(record) # OUTPUT: False
- classmethod extract_pii(record: NormalizedRecordType) str | None[source]
Extracts the Publisher Item Identifier (PII) from the ArticleIdList.
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ already extracted
- Returns:
PII string or None if not found
- Return type:
Optional[str]
- classmethod extract_pmcid(record: NormalizedRecordType) str | None[source]
Extracts the PMC ID for full-text access from the normalized record.
Returns the PMCID without the ‘PMC’ prefix for consistency. Handles edge cases where stripping the prefix results in an empty string.
- Parameters:
record (NormalizedRecordType) – Normalized record with ‘article_id_list’ already extracted
- Returns:
PMC ID without ‘PMC’ prefix, or None if not found or invalid
- Return type:
Optional[str]
Examples
>>> from scholar_flux.api.normalization.pubmed_field_map import PubMedFieldMap >>> record = {'article_id_list': {'ArticleId': [{'@IdType': 'pmc', '#text': 'PMC123456'}]}} >>> PubMedFieldMap.extract_pmcid(record) # OUTPUT: '123456'
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- classmethod reconstruct_pubmed_url(record: NormalizedRecordType) str | None[source]
Reconstruct PubMed article URL from the PMID.
- Parameters:
record (NormalizedRecordType) – The record containing the ‘pmid’ field
- Returns:
A Reconstructed URL if PMID is valid, None otherwise.
Examples
>>> from scholar_flux.api.normalization.pubmed_field_map import PubMedFieldMap >>> PubMedFieldMap.reconstruct_pubmed_url({"pmid": "41418093"}) # OUTPUT: 'https://pubmed.ncbi.nlm.nih.gov/41418093/' >>> PubMedFieldMap.reconstruct_pubmed_url({"pmid": None}) # OUTPUT: None
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None
- class scholar_flux.api.normalization.SpringerNatureFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]
Bases:
AcademicFieldMapSpringer Nature specific field mapping with custom transformations.
The SpringerNatureFieldMap defines a minimal set of Springer Nature-specific post-processing steps that aid in the normalization of preprocessed records retrieved from the Springer Nature API.
- Post-Processed Fields:
DOI and record identifiers
Year extraction from publication date
URL extraction from nested URL objects
Open access status conversion from string to boolean
Author and abstract normalization
Note
Springer Nature records may contain arrays and nested objects for key fields (e.g., year and primary URL). The field map configuration and post-processing logic handle these for consistent normalization.
- abstract: str | list[str] | None
- api_specific_fields: dict[str, Any]
- authors: str | list[str] | None
- citation_count: str | list[str] | None
- date_created: str | list[str] | None
- date_published: str | list[str] | None
- default_field_values: dict[str, Any]
- doi: str | list[str] | None
- classmethod extract_open_access(record: NormalizedRecordType, field: str = 'open_access') bool | None[source]
Extracts the current record’s open access status by delegating processing to .extract_boolean_field().
- Parameters:
record (NormalizedRecordType) – The normalized record to extract the open_access status for
field (str) – The field to extract the open access status from.
- Returns:
The open access status of the record when available, and None otherwise.
- Return type:
(Optional[bool])
Examples
>>> from scholar_flux.api.normalization import SpringerNatureFieldMap >>> record = {"doi": "10.1234/example", "title": "Sample Article","open_access": "true"} >>> SpringerNatureFieldMap.extract_open_access(record) # OUTPUT: True >>> record = {"doi": "10.5678/example", "title": "Sample Publication","open_access": "false"} >>> SpringerNatureFieldMap.extract_open_access(record) # OUTPUT: False >>> record = {"title": "Another Article","open_access": "N/A"} >>> SpringerNatureFieldMap.extract_open_access(record) # OUTPUT: None
- classmethod extract_primary_url(record: NormalizedRecordType, field: str = 'url', pattern_delimiter: str | Pattern | None = re.compile('; *(?=http)|, *(?=http)|\\| *(?=http)')) str | None[source]
Extracts the primary (or first valid) URL from a record field with a flat or nested JSON structure.
- Parameters:
record (NormalizedRecordType) – The normalized record to extract the primary URL field from
field (str) – The field to extract a primary URL from
pattern_delimiter (Optional[str | re.Pattern]) – An optional pattern used to separate URLs when combined as a list for primary URL extraction
- Returns:
The extracted primary URL when extraction is successful, and None otherwise.
- Return type:
Optional[str]
- full_text: str | list[str] | None
- is_retracted: str | list[str] | None
- journal: str | list[str] | None
- keywords: str | list[str] | None
- language: str | list[str] | None
- license: str | list[str] | None
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_post_init(context: Any, /) None
This function is meant to behave like a BaseModel method to initialise private attributes.
It takes context as an argument since that’s what pydantic-core passes when calling it.
- Parameters:
self – The BaseModel instance.
context – The context.
- open_access: str | list[str] | None
- provider_name: str
- publisher: str | list[str] | None
- record_id: str | list[str] | None
- record_type: str | list[str] | None
- subjects: str | list[str] | None
- title: str | list[str] | None
- url: str | list[str] | None
- year: str | list[str] | None