scholar_flux.api.models package

Submodules

scholar_flux.api.models.api_parameters module

The scholar_flux.api.models.api_parameters module implements the APIParameterMap and APIParameterConfig classes.

These two classes are designed for flexibility in the creation and handling of API Responses given provider-specific differences in request parameters and configuration.

Classes:
APIParameterMap:

Extends the BaseAPIParameterMap to provide factory functions and utilities to more efficiently retrieve and use default parameter maps.

APIParameterConfig:

Uses or creates an APIParameterMap to prepare request parameters according to the specifications of the current provider’s API.

class scholar_flux.api.models.api_parameters.APIParameterConfig(parameter_map: APIParameterMap)[source]

Bases: object

Uses an APIParameterMap instance and runtime parameter values to build parameter dictionaries for API requests.

Parameters:

parameter_map (APIParameterMap) – The mapping of universal to API-specific parameter names.

Class Attributes:
DEFAULT_CORRECT_ZERO_INDEX (bool):

Autocorrects zero-indexed API parameter building specifications to only accept positive values when True. If otherwise False, page calculation APIs will start from page 0 if zero-indexed (i.e., arXiv).

Examples

>>> from scholar_flux.api import APIParameterConfig, APIParameterMap
>>> # the API parameter map is defined and used to resolve parameters to the API's language
>>> api_parameter_map = APIParameterMap(
... query='q', records_per_page = 'pagesize', start = 'page', auto_calculate_page = False
... )
# The APIParameterConfig defines class and settings that indicate how to create requests
>>> api_parameter_config = APIParameterConfig(api_parameter_map, auto_calculate_page = False)
# Builds parameters using the specification from the APIParameterMap
>>> page = api_parameter_config.build_parameters(query= 'ml', page = 10, records_per_page=50)
>>> print(page)
# OUTPUT {'q': 'ml', 'page': 10, 'pagesize': 50}
DEFAULT_CORRECT_ZERO_INDEX: ClassVar[bool] = True
__init__(*args: Any, **kwargs: Any) None
add_parameter(name: str, description: str | None = None, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False, inplace: bool = True) APIParameterConfig[source]

Passes keyword arguments to the current parameter map to add a new API-specific parameter to its config.

Parameters:
  • name (str) – The name of the parameter used when sending requests to APIs.

  • description (str) – A description of the API-specific parameter.

  • validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.

  • default (Any) – A default value used for the parameter if not specified by the user

  • required (bool) – Indicates whether the current parameter is required for API calls.

  • inplace (bool) –

    A flag that, if True, modifies the current parameter map instance in place. If False, it returns a new parameter map that contains the added parameter, while leaving the original unchanged.

    Note: If this instance is shared (e.g., retrieved from provider_registry), changes will affect all references to this parameter map. if inplace=True.

Returns:

An APIParameterConfig with the updated parameter map. If inplace=True, the original is returned. Otherwise a new parameter map containing an updated api_specific_parameters dict is returned.

Return type:

APIParameterConfig

classmethod as_config(parameter_map: dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) APIParameterConfig[source]

Factory method for creating a new APIParameterConfig from a dictionary or APIParameterMap.

This helper class method resolves the structure of the APIParameterConfig against its basic building blocks to create a new configuration when possible.

Parameters:

parameter_map (dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) – A parameter mapping/config to use in the instantiation of an APIParameterConfig.

Returns:

A new structure from the inputs

Return type:

APIParameterConfig

Raises:

APIParameterException – If there is an error in the creation/resolution of the required parameters

build_parameters(query: str | None, page: int | None, records_per_page: int, **api_specific_parameters: Any) Dict[str, Any][source]

Builds the dictionary of request parameters using the current parameter map and provided values at runtime.

Parameters:
  • query (Optional[str]) – The search query string.

  • page (Optional[int]) – The page number for pagination (1-based).

  • records_per_page (int) – Number of records to fetch per page.

  • **api_specific_parameters – Additional API-specific parameters to include.

Returns:

The fully constructed API request parameters dictionary, with keys as API-specific parameter names and values as provided.

Return type:

Dict[str, Any]

extract_parameters(parameters: dict[str, Any] | None) dict[str, Any][source]

Extracts all parameters from a dictionary: Helpful for when keywords must be extracted by provider.

Note: this method modifies the original parameter dictionary, using the pop() method to extract all values identified as api_specific_parameters from the parameters dictionary when possible. These extracted parameters are then returned in a separate dictionary.

Useful for reorganizing dictionaries that contain dynamically specified input parameters for distinct APIs.

Parameters:

parameters (Optional[dict[str, Any]]) – An optional parameter dictionary from which to extract API-specific parameters.

Returns:

A dictionary containing all extracted parameters if available.

Return type:

(dict[str, Any])

classmethod from_defaults(provider_name: str, **additional_parameters: Any) APIParameterConfig[source]

Factory method to create APIParameterConfig instances with sensible defaults for known APIs.

If the provider_name does not exist, the code will raise an exception.

Parameters:
  • provider_name (str) – The name of the API to create the parameter map for.

  • api_key (Optional[str]) – API key value if required.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter config instance for the specified API.

Return type:

APIParameterConfig

Raises:

NotImplementedError – If the API name is unknown.

classmethod get_defaults(provider_name: str, **additional_parameters: Any) APIParameterConfig | None[source]

Factory method to create APIParameterConfig instances with sensible defaults for known APIs.

Avoids throwing an error if the provider name does not already exist.

Parameters:
  • provider_name (str) – The name of the API to create the parameter map for.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter config instance for the specified API. Returns None if a mapping for the provider_name isn’t retrieved

Return type:

Optional[APIParameterConfig]

property map: APIParameterMap

Helper property that is an alias for the APIParameterMap attribute.

The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.

Returns:

The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.

Return type:

APIParameterMap

parameter_map: APIParameterMap
show_parameters() list[source]

Helper method to show the complete list of all parameters that can be found in the current_mappings.

Returns:

The complete list of all universal and api specific parameters corresponding to the current API

Return type:

List

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the APIParameterConfig.

class scholar_flux.api.models.api_parameters.APIParameterMap(*, query: str, records_per_page: str, start: str | None = None, api_key_parameter: str | None = None, api_key_required: bool = False, auto_calculate_page: bool = True, zero_indexed_pagination: bool = False, api_specific_parameters: ~typing.Dict[str, ~scholar_flux.api.models.base_parameters.APISpecificParameter] = <factory>)[source]

Bases: BaseAPIParameterMap

Extends BaseAPIParameterMap by adding validation and the optional retrieval of provider defaults for known APIs.

This class also specifies default mappings for specific attributes such as API keys and additional parameter names.

query

The API-specific parameter name for the search query.

Type:

str

start

The API-specific parameter name for pagination (start index or page number).

Type:

Optional[str]

records_per_page

The API-specific parameter name for records per page.

Type:

str

api_key_parameter

The API-specific parameter name for the API key.

Type:

Optional[str]

api_key_required

Indicates whether an API key is required.

Type:

bool

auto_calculate_page

If True, calculates start index from page; if False, passes page number directly.

Type:

bool

zero_indexed_pagination

If True, treats 0 as an allowed page value when retrieving data from APIs.

Type:

bool

api_specific_parameters

Additional universal to API-specific parameter mappings.

Type:

Dict[str, str]

api_key_parameter: str | None
api_key_required: bool
api_specific_parameters: Dict[str, APISpecificParameter]
auto_calculate_page: bool
classmethod from_defaults(provider_name: str, **additional_parameters: Any) APIParameterMap[source]

Factory method that uses the APIParameterMap.get_defaults classmethod to retrieve the provider config.

Raises an error if the provider does not exist.

Parameters:
  • provider_name (str) – The name of the API to create the parameter map for.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter map for the specified API.

Return type:

APIParameterMap

Raises:

NotImplementedError – If the API name is unknown.

classmethod get_defaults(provider_name: str, **additional_parameters: Any) APIParameterMap | None[source]

Factory method to create APIParameterMap instances with sensible defaults for known APIs.

This class method attempts to pull from the list of known providers defined in the scholar_flux.api.providers.provider_registry and returns None if an APIParameterMap for the provider cannot be found.

Using the additional_parameters keyword arguments, users can specify optional overrides for specific parameters if needed. This is helpful in circumstances where an API’s specification overlaps with that of a known provider.

Valid providers (as indicated in provider_registry) include:

  • springernature

  • plos

  • arxiv

  • openalex

  • core

  • crossref

Parameters:
  • provider_name (str) – The name of the API provider to retrieve the parameter map for.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter map for the specified API.

Return type:

Optional[APIParameterMap]

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query: str
records_per_page: str
classmethod set_default_api_key_parameter(values: dict[str, Any]) dict[str, Any][source]

Sets the default for the api key parameter when api_key_required`=True and `api_key_parameter is None.

Parameters:

values (dict[str, Any]) – The dictionary of attributes to validate

Returns:

The updated parameter values passed to the APIParameterMap. api_key_parameter is set to “api_key” if key is required but not specified

Return type:

dict[str, Any]

start: str | None
classmethod validate_api_specific_parameter_mappings(values: dict[str, Any]) dict[str, Any][source]

Validates the additional mappings provided to the APIParameterMap.

This method validates that the input is dictionary of mappings that consists of only string-typed keys mapped to API-specific parameters as defined by the APISpecificParameter class.

Parameters:

values (dict[str, Any]) – The dictionary of attribute values to validate.

Returns:

The updated dictionary if validation passes.

Return type:

dict[str, Any]

Raises:

APIParameterException – If api_specific_parameters is not a dictionary or contains non-string keys/values.

zero_indexed_pagination: bool

scholar_flux.api.models.base_parameters module

The scholar_flux.api.models.base_parameters module implements BaseAPIParameterMap and APISpecificParameter classes.

These classes define the core and API-specific fields required to interact with and create requests to API providers.

Classes:

BaseAPIParameterMap: Defines parameters for interacting with a provider’s API specification. APISpecificParameters: Defines optional and required parameters specific to an API provider.

class scholar_flux.api.models.base_parameters.APISpecificParameter(name: str, description: str, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False)[source]

Bases: object

Dataclass that defines the specification of an API-specific parameter for an API provider.

Implements optionally specifiable defaults, validation steps, and indicators for optional vs. required fields.

Parameters:
  • name (str) – The name of the parameter used when sending requests to APIs.

  • description (str) – A description of the API-specific parameter.

  • validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.

  • default (Any) – A default value used for the parameter if not specified by the user

  • required (bool) – Indicates whether the current parameter is required for API calls.

__init__(*args: Any, **kwargs: Any) None
default: Any = None
description: str
name: str
required: bool = False
structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method for showing the structure of the current APISpecificParameter.

validator: Callable[[Any], Any] | None = None
property validator_name: str

Helper method for generating a human-readable string from the validator function, if used.

class scholar_flux.api.models.base_parameters.BaseAPIParameterMap(*, query: str, records_per_page: str, start: str | None = None, api_key_parameter: str | None = None, api_key_required: bool = False, auto_calculate_page: bool = True, zero_indexed_pagination: bool = False, api_specific_parameters: ~typing.Dict[str, ~scholar_flux.api.models.base_parameters.APISpecificParameter] = <factory>)[source]

Bases: BaseModel

Base class for Mapping universal SearchAPI parameter names to API-specific parameter names.

Includes core logic for distinguishing parameter names, indicating required API keys, and defining pagination logic.

query

The API-specific parameter name for the search query.

Type:

str

start

The API-specific parameter name for optional pagination (start index or page number).

Type:

Optional[str]

records_per_page

The API-specific parameter name for records per page.

Type:

str

api_key_parameter

The API-specific parameter name for the API key.

Type:

Optional[str]

api_key_required

Indicates whether an API key is required.

Type:

bool

page_required

If True, indicates that a page is required.

Type:

bool

auto_calculate_page

If True, calculates start index from page; if False, passes page number directly.

Type:

bool

zero_indexed_pagination

Treats page=0 as an allowed page value when retrieving data from the API.

Type:

bool

api_specific_parameters

Additional API-specific parameter mappings.

Type:

Dict[str, APISpecificParameter]

add_parameter(name: str, description: str | None = None, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False, inplace: bool = True) Self[source]

Helper method that enables the efficient addition of parameters to the current parameter map.

Parameters:
  • name (str) – The name of the parameter used when sending requests to APIs.

  • description (str) – A description of the API-specific parameter.

  • validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.

  • default (Any) – A default value used for the parameter if not specified by the user

  • required (bool) – Indicates whether the current parameter is required for API calls.

  • inplace (bool) –

    A flag that, if True, modifies the current parameter map instance in place. If False, it returns a new parameter map that contains the added parameter, while leaving the original unchanged.

    Note: If this instance is shared (e.g., retrieved from provider_registry), changes will affect all references to this parameter map. if inplace=True .

Returns:

A parameter map containing the specified parameter. If inplace=True, the original is returned. Otherwise a new parameter map containing an updated api_specific_parameters dict is returned.

Return type:

Self

api_key_parameter: str | None
api_key_required: bool
api_specific_parameters: Dict[str, APISpecificParameter]
auto_calculate_page: bool
classmethod from_dict(obj: Dict[str, Any]) BaseAPIParameterMap[source]

Create a new instance of BaseAPIParameterMap from a dictionary.

Parameters:

obj (dict) – The dictionary containing the data for the new instance.

Returns:

A new instance created from the given dictionary.

Return type:

BaseAPIParameterMap

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query: str
records_per_page: str
show_parameters() list[source]

Helper method to show the complete list of all parameters that can be found in the current ParameterMap.

Returns:

The complete list of all universal and API-specific parameters corresponding to the current API

Return type:

List

start: str | None
structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the BaseAPIParameterMap.

to_dict() Dict[str, Any][source]

Convert the current instance into a dictionary representation.

Returns:

A dictionary representation of the current instance.

Return type:

Dict

update(other: BaseAPIParameterMap | Dict[str, Any]) BaseAPIParameterMap[source]

Update the current instance with values from another BaseAPIParameterMap or dictionary.

Parameters:

other (BaseAPIParameterMap | Dict) – The object containing updated values.

Returns:

A new instance with updated values.

Return type:

BaseAPIParameterMap

zero_indexed_pagination: bool

scholar_flux.api.models.base_provider_dict module

The scholar_flux.api.models.base_provider_dict.py module implements a BaseProviderDict to extend the dictionary and resolve provider names to a generic key, handling the normalization of provider names for consistent access.

class scholar_flux.api.models.base_provider_dict.BaseProviderDict(dict=None, /, **kwargs)[source]

Bases: UserDict[str, Any]

The BaseProviderDict extends the dictionary to resolve minor naming variations in keys to the same provider name.

The BaseProviderDict uses the ProviderConfig._normalize_name method to ignore underscores and case-sensitivity.

find(key: str | Pattern, regex: bool | None = None) list[str][source]

Identifies providers with names matching the specified pattern using either prefix or regex pattern matching.

This implementation uses fuzzy finding, or “flexible matching that’s more forgiving than exact”. When regex=True or a compiled Pattern is provided, regex matching is used. Otherwise, provider names are filtered using prefix matching via str.startswith after normalizing the provided key and provider names.

Parameters:
  • key (str | re.Pattern) – The key or pattern to match using regular expressions or prefix matching.

  • regex (Optional[bool]) – Indicates whether regular expressions should be used to match provider names.

Returns:

A list of strings containing provider names that match the key/pattern.

Return type:

list[str]

Note

Unless either pattern is received or regex=True, providers are matched if the normalized key prefix is present in the normalized provider name.

property providers: list[str]

Returns a list containing the names of all (keys) in the current registry.

Returns:

A complete list of all keys shown in the current registry

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the BaseProviderDict or subclass.

scholar_flux.api.models.provider_config module

The scholar_flux.api.models.provider_config module implements the basic provider configuration necessary for interacting with APIs.

It provides the foundational information necessary for the SearchAPI to resolve provider names to the URLs of the providers, as well as basic defaults necessary for interaction.

class scholar_flux.api.models.provider_config.ProviderConfig(*, provider_name: Annotated[str, MinLen(min_length=1)], base_url: str, parameter_map: BaseAPIParameterMap, metadata_map: ResponseMetadataMap | None = None, field_map: BaseFieldMap | None = None, records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: Annotated[float, Ge(ge=0)] = 6.1, api_key_env_var: str | None = None, docs_url: str | None = None, display_name: Annotated[str, MinLen(min_length=1)] = '')[source]

Bases: BaseModel

Config for creating the basic instructions and settings necessary to interact with new providers. This config, on initialization, is created for default providers on package initialization in the scholar_flux.api.providers submodule. A new, custom provider or override can be added to the provider_registry (a custom user dictionary) from the scholar_flux.api.providers module.

Parameters:
  • provider_name (str) – The name of the provider to be associated with the config.

  • base_url (str) – The URL of the provider to send requests with the specified parameters.

  • parameter_map (BaseAPIParameterMap) – The parameter map indicating the specific semantics of the API.

  • metadata_map (MetadataMap) – Defines the names of metadata fields used to distinguish response characteristics.

  • field_map (Optional[BaseFieldMap]) – A provider-specific field map that normalizes processed response records into a universal record structure.

  • records_per_page (int) – Generally the upper limit (for some APIs) or reasonable limit for the number of retrieved records per request (specific to the API provider).

  • request_delay (float) – Indicates exactly how many seconds to wait before sending successive requests. Note that the requested interval may vary based on the API provider.

  • api_key_env_var (Optional[str]) – Indicates the environment variable to look for if the API requires or accepts API keys.

  • docs_url (Optional[str]) – An optional URL that indicates where documentation related to the use of the API can be found.

Example Usage:
>>> from scholar_flux.api import ProviderConfig, APIParameterMap, SearchAPI
>>> # Maps each of the individual parameters required to interact with the Guardian API
>>> parameters = APIParameterMap(query='q',
>>>                              start='page',
>>>                              records_per_page='page-size',
>>>                              api_key_parameter='api-key',
>>>                              auto_calculate_page=False,
>>>                              api_key_required=True)
>>> # creating the config object that holds the basic configuration necessary to interact with the API
>>> guardian_config = ProviderConfig(provider_name = 'GUARDIAN',
>>>                                  parameter_map = parameters,
>>>                                  base_url = 'https://content.guardianapis.com//search',
>>>                                  records_per_page=10,
>>>                                  api_key_env_var='GUARDIAN_API_KEY',
>>>                                  request_delay=6)
>>> api = SearchAPI.from_provider_config(query = 'economic welfare',
>>>                                      provider_config = guardian_config,
>>>                                      use_cache = True)
>>> assert api.provider_name == 'guardian'
>>> response = api.search(page = 1) # assumes that you have the GUARDIAN_API_KEY stored as an env variable
>>> assert response.ok
api_key_env_var: str | None
property api_key_required: bool

References the APIParameterMap to determine whether an API key is required.

base_url: str
display_name: str
docs_url: str | None
field_map: BaseFieldMap | None
property map: BaseAPIParameterMap

Helper property that is an alias for the APIParameterMap attribute.

The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.

Returns:

The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.

Return type:

APIParameterMap

metadata_map: ResponseMetadataMap | None
model_config: ClassVar[ConfigDict] = {'str_strip_whitespace': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod normalize_provider_name(v: str) str[source]

Helper method for normalizing the names of providers to a consistent structure.

parameter_map: BaseAPIParameterMap
classmethod prepare_fields(values: dict[str, Any]) dict[str, Any][source]

Model validator used to prepare fields for the ProviderConfig prior to further field validation.

provider_name: str
records_per_page: int
request_delay: float
search_config_defaults() dict[str, Any][source]

Convenience method for retrieving ProviderConfig fields as a dict. Useful for providing the missing information needed to create a SearchAPIConfig object for a provider when only the provider_name has been provided.

Returns:

A dictionary containing the URL, name, records_per_page, and request_delay

for the current provider.

Return type:

dict

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the ProviderConfig.

classmethod validate_base_url(v: str) str[source]

Validates the current URL and raises an APIParameterException if invalid.

classmethod validate_docs_url(v: str | None) str | None[source]

Validates the documentation URL and raises an APIParameterException if invalid.

scholar_flux.api.models.provider_registry module

The scholar_flux.api.models.provider_registry module implements the ProviderRegistry class, which extends Python’s user-dictionary implementation to map providers to their unique scholar_flux ProviderConfig.

When scholar_flux uses the name of a provider to create a SearchAPI or SearchCoordinator, the package-level scholar_flux.api.providers.provider_registry is referenced to retrieve the necessary configuration for easier interaction and specification of APIs.

class scholar_flux.api.models.provider_registry.ProviderRegistry(dict=None, /, **kwargs)[source]

Bases: BaseProviderDict

The ProviderRegistry implementation allows the smooth and efficient retrieval of API parameter maps and default configuration settings to aid in the creation of a SearchAPI that is specific to the current API.

Note that the ProviderRegistry uses the ProviderConfig._normalize_name to ignore underscores and case-sensitivity.

- ProviderRegistry.from_defaults

Dynamically imports configurations stored within scholar_flux.api.providers, and fails gracefully if a provider’s module does not contain a ProviderConfig.

- ProviderRegistry.get

resolves a provider name to its ProviderConfig if it exists in the registry.

- ProviderRegistry.get_from_url

resolves a provider URL to its ProviderConfig if it exists in the registry.

add(provider_config: ProviderConfig) None[source]

Helper method for adding a new provider to the provider registry.

create(provider_name: str, **kwargs: Any) ProviderConfig[source]

Helper method that creates and registers a new ProviderConfig with the current provider registry.

Parameters:
  • provider_name (str) – The name of the provider to create a new provider_config for.

  • **kwargs – Additional keyword arguments to pass to scholar_flux.api.models.ProviderConfig

Returns:

The newly created provider configuration when possible.

Return type:

ProviderConfig

Raises:

APIParameterException – If an unexpected error occurs during the creation of a new ProviderConfig.

classmethod from_defaults() ProviderRegistry[source]

Dynamically loads provider configurations from the scholar_flux.api.providers module.

This method specifically uses the provider_name of each provider listed within the scholar_flux.api.providers.provider_registry to lookup and return its ProviderConfig.

Returns:

A new registry containing the loaded default provider configurations

Return type:

ProviderRegistry

get_display_name(provider_name: str, default: str | None = None) str | None[source]

Finds the human-readable name for a provider if it exists.

If the provider doesn’t exist within the registry, the result falls back to the default if available and None otherwise.

Parameters:
  • provider_name (str) – The provider identifier to look up.

  • default (Optional[str]) – The name to fall back to. If not specified, None is returned instead.

Returns:

The display name if the provider exists, otherwise the default is returned.

Return type:

Optional[str]

get_from_url(provider_url: str | None) ProviderConfig | None[source]

Attempt to retrieve a ProviderConfig instance for the given provider by resolving the provided URL to the provider’s base URL. Will not throw an error in the event that the provider does not exist.

Parameters:

provider_url (Optional[str]) – URL of the provider to look up.

Returns:

Instance configuration for the provider if it exists, else None

Return type:

Optional[ProviderConfig]

remove(provider_name: str) None[source]

Helper method for removing a provider configuration from the provider registry.

resolve_config(provider_url: str | None = None, provider_name: str | None = None, verbose: bool = True) ProviderConfig | None[source]

Helper method to resolve mismatches between the URL and the provider_name when both are provided. The default behavior is to always prefer a provided provider_url over the provider_name to offer maximum flexibility.

Parameters:
  • provider_url (Optional[str]) – The prospective URL associated with a provider configuration.

  • provider_name (Optional[str]) – The prospective name of the provider associated with a provider configuration.

  • verbose (bool) – Determines whether the origin of the configuration should be logged.

Returns:

A provider configuration resolved with priority given to the base URL or the provider name otherwise. If neither the base URL and provider name resolve to a known provider, None is returned instead.

Return type:

Optional[ProviderConfig]

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the ProviderRegistry.

scholar_flux.api.models.rate_limiter_registry module

The scholar_flux.api.models.rate_limiter_registry module implements a registry that stores rate limiters by provider.

The RateLimiterRegistry implements several helpers for interacting with, retrieving, and creating default and thread- safe rate limiters for both default and new providers.

class scholar_flux.api.models.rate_limiter_registry.RateLimiterRegistry(*args: Any, threaded: bool = False, **kwargs: Any)[source]

Bases: BaseProviderDict

A registry for creating, retrieving, updating, and deleting rate limiters by provider.

The RateLimiterRegistry standardizes CRUD operations with thread-safe rate limiters for both default and custom providers. It ensures compatibility when using rate limiters in active applications. This implementation is especially important when using MultiSearchCoordinators to enforce normalized rate limiting by provider.

threaded

Indicates whether the registry should use ThreadedRateLimiters.

Type:

bool

__init__(*args: Any, threaded: bool = False, **kwargs: Any) None[source]

Initializes the RateLimiterRegistry and enforces the use of ThreadedRateLimiters when threaded=True.

add(provider_name: str, rate_limiter: RateLimiter | ThreadedRateLimiter) None[source]

Helper method for adding a new provider and rate limiter to the provider registry.

create(provider_name: str, default_request_delay: int | float | None = None) RateLimiter | ThreadedRateLimiter[source]

Helper method that creates a new rate limiter for the current provider.

The minimum interval for the provider is chosen based on the following order of priority:

  1. If the provider exists in the provider_registry, use the request_delay from its configuration settings.

  2. Otherwise, use the default_request_delay parameter if it is a float or integer.

  3. If a provider doesn’t exist in the registry and default_request_delay isn’t specified, use the RateLimiter.DEFAULT_MIN_INTERVAL class parameter.

Parameters:
  • provider_name (str) – The name of the provider to create a new rate limiter for.

  • default_request_delay (Optional[int | float]) – The default minimum interval to use when creating a new rate limiter.

classmethod from_defaults(threaded: bool = False) Self[source]

Initializes a new RateLimiterRegistry for known providers based on their configurations.

This method specifically uses the provider_name and request_delay of each provider listed within the scholar_flux.api.providers.provider_registry to create rate limiters for all known configurations.

Returns:

A new rate limiter registry that contains default rate limiters for known providers.

Return type:

RateLimiterRegistry

get_from_url(provider_url: str | None) RateLimiter | ThreadedRateLimiter | None[source]

Attempts to retrieve a RateLimiter for the specified provider from a URL.

This method retrieves the rate limiter of the provider associated with the provided URL if the URL after normalization exists within the scholar_flux.api.provider_registry. If a provider does not exist, a value of None will be returned instead.

Parameters:

provider_url (Optional[str]) – URL of the provider to look up.

Returns:

The rate limiter of the provider when available. Otherwise None.

Return type:

Optional[RateLimiter | ThreadedRateLimiter]

get_or_create(key: str, default_request_delay: int | float | None = None) RateLimiter | ThreadedRateLimiter[source]

Helper method that retrieves rate limiter from the registry or creates one if it doesn’t exist.

This method is useful when a provider may or may not exist in the current registry and otherwise needs to be added. If a provider’s rate limiter does not yet exist, the registry attempts to create a new rate limiter.

Parameters:
  • key (str) – The name of the provider to retrieve a rate limiter for, and otherwise create a new rate limiter if it doesn’t exist.

  • default_request_delay (Optional[int | float]) – The default minimum interval to use when creating a new rate limiter if one does not already exist for the provider

Returns:

The retrieved rate limiter for the current provider if available. Otherwise a new, RateLimiter will be created, registered, and returned.

Return type:

RateLimiter | ThreadedRateLimiter

property rate_limiter: type[RateLimiter | ThreadedRateLimiter]

Helper method that returns the class constructor for a rate limiter.

Returns:

A ThreadedRateLimiter if self.threaded=True, otherwise the core RateLimiter

remove(provider_name: str) None[source]

Helper method for removing a provider configuration from the provider registry.

scholar_flux.api.models.reconstructed_response module

The scholar_flux.api.reconstructed_response module implements the ReconstructedResponse for transformation.

The ReconstructedResponse class was designed to be request-client agnostic to improve flexibility in the request clients that can be used to retrieve data from APIs and load response data from cache.

The ReconstructedResponse is a minimal implementation of a response-like object that can transform response classes from requests, httpx, and aiohttp into a singular representation of the same response.

class scholar_flux.api.models.reconstructed_response.ReconstructedResponse(status_code: int, reason: str, headers: MutableMapping[str, str], content: bytes, url: Any)[source]

Bases: object

Core class for constructing minimal, universal response representations from responses and response-like objects.

The ReconstructedResponse implements several helpers that enable the reconstruction of response-like objects from different sources such as the requests, aiohttp, and httpx libraries.

The primary purpose of the ReconstructedResponse in scholar_flux is to create a minimal representation of a response when we need to construct a ProcessedResponse without an actual response and verify content fields.

In applications such as retrieving cached data from a scholar_flux.data_storage.DataCacheManager, if an original or cached response is not available, then a ReconstructedResponse is created from the cached response fields when available.

Parameters:
  • status_code (int) – The integer code indicating the status of the response

  • reason (str) – Indicates the reasoning associated with the status of the response

  • headers (MutableMapping[str, str]) – Indicates metadata associated with the response (e.g. Content-Type, etc.)

  • content (bytes) – The content within the response

  • url – (Any): The URL from which the response was received

Note

The ReconstructedResponse.build factory method is recommended in cases when one property may contain the needed fields but may need to be processed and prepared first before being used. Examples include instances where one has text or json data instead of content, a reason_phrase field instead of reason, etc.

Example

>>> from scholar_flux.api.models import ReconstructedResponse
# build a response using a factory method that infers fields from existing ones when not directly specified
>>> response = ReconstructedResponse.build(status_code = 200, content = b"success", url = "https://google.com")
# check whether the current class follows a ResponseProtocol and contains valid fields
>>> assert response.is_response()
# OUTPUT: True
>>> response.validate() # raises an error if invalid
>>> response.raise_for_status() # no error for 200 status codes
>>> assert response.reason == 'OK' == response.status  # inferred from the status_code attribute
__init__(status_code: int, reason: str, headers: MutableMapping[str, str], content: bytes, url: Any) None
asdict() dict[str, Any][source]

Converts the ReconstructedResponse into a dictionary containing attributes and their corresponding values.

This convenience method uses dataclasses.asdict() under the hood to convert a ReconstructedResponse to a dictionary consisting of key-value pairs.

Returns:

A dictionary that maps the field names of a ReconstructedResponse instance to their assigned values.

Return type:

dict[str, Any]

classmethod build(response: object | None = None, **kwargs: Any) ReconstructedResponse[source]

Helper method for building a new ReconstructedResponse from a regular response object.

This classmethod can either construct a new ReconstructedResponse object from a response or response-like object or otherwise build a new ReconstructedResponse via its keyword parameters.

Parameters:
  • response (Optional[object]) – A response or response-like object of unknown type or None.

  • **kwargs – The underlying components needed to construct a new response. Note that ideally, this set of key-value pairs would be specific only to the types expected by the ReconstructedResponse.

Returns:

A minimal ReconstructedResponse object created from the received parameter set.

Return type:

ReconstructedResponse

content: bytes
classmethod fields() list[str][source]

Retrieves a list containing the names of all fields associated with the ReconstructedResponse class.

Returns:

A list containing the name of each attribute in the ReconstructedResponse.

Return type:

list[str]

classmethod from_keywords(**kwargs: Any) ReconstructedResponse[source]

Uses the provided keyword arguments to create a ReconstructedResponse.

Parameters:

**kwargs

The ReconstructedResponse keyword arguments to normalize. Possible keywords include:

  • status_code (int): The integer code indicating the status of the response

  • reason (str): Indicates the reasoning associated with the status of the response.

  • headers (MutableMapping[str, str]): Indicates metadata associated with the response (e.g. Content-Type)

  • content (bytes): The content within the response

  • url: (Any): The URL from which the response was received

The keywords can alternatively be inferred from other common response fields:

  • content: [‘content’, ‘_content’, ‘text’, ‘json’]

  • headers: [‘headers’, ‘_headers’]

  • reason: [‘reason’, ‘status’, ‘reason_phrase’, ‘status_code’]

Returns:

A newly reconstructed response from the given keyword components.

Return type:

ReconstructedResponse

headers: MutableMapping[str, str]
is_response() bool[source]

Validates the fields of the minimally reconstructed response, indicating whether all fields are valid.

The fields that are validated include:

  1. status codes (should be an integer)

  2. URLs (should be a valid url)

  3. reasons (should originate from a reason attribute or inferred from the status code)

  4. content (should be a bytes field or encoded from a string text field)

  5. headers (should be a dictionary with string fields and preferably a content type)

Returns:

Indicates whether the current reconstructed response minimally recreates a response object.

Return type:

bool

json() dict[str, Any] | list[Any] | None[source]

Return JSON-decoded body from the underlying response, if available.

property ok: bool

Indicates whether the current response indicates a successful request (200 <= status_code < 300).

To account for the nature of successful requests to APIs in academic pipelines, status codes from 300 to 399 are excluded.

Returns:

True if the status code is an integer value within the range of 200 and 299, False otherwise.

Return type:

bool

classmethod prepare_response_fields(**kwargs: Any) dict[str, Any][source]

Extracts and prepares the fields required to reconstruct the response from the provided keyword arguments.

Parameters:
  • status_code (int) – The integer code indicating the status of the response

  • reason (str) – Indicates the reasoning associated with the status of the response

  • headers (MutableMapping[str, str]) – Indicates metadata associated with the response (e.g. Content-Type)

  • content (bytes) – The content within the response

  • url – (Any): The URL from which the response was received

Some fields can be both provided directly or inferred from other similarly common fields:

  • content: [‘content’, ‘_content’, ‘text’, ‘json’]

  • headers: [‘headers’, ‘_headers’]

  • reason: [‘reason’, ‘status’, ‘reason_phrase’, ‘status_code’]

Returns:

A dictionary containing the prepared response fields.

Return type:

dict[str, Any]

raise_for_status() None[source]

Verifies the status code for the current ReconstructedResponse, raising an error for failed responses.

This method follows a similar convention as requests and httpx response types, raising an error when encountering status codes that are indicative of failed responses.

As scholar_flux processes data that is generally only sent when status codes are between 200-299 (or exactly 200 [ok]), an error is raised when encountering a value outside of this range.

Raises:

HTTPError – If the structure of the response is invalid or the status code is not within the range of 200-299.

reason: str
property status: str | None

Helper property for retrieving a human-readable description of the status.

Returns:

The status description associated with the response (if available).

Return type:

Optional[str]

status_code: int
property text: str | None

Helper property for retrieving the text from the bytes content as a string.

Returns:

The decoded text from the content of the response.

Return type:

Optional[str]

url: Any
validate() None[source]

Convenience method for the validation of the current ReconstructedResponse.

If the response validation is successful, an InvalidResponseReconstructionException will not be raised.

Raises:

InvalidResponseReconstructionException – If at least one field is determined to be invalid and unexpected of a true response object.

scholar_flux.api.models.response_metadata_map module

The scholar_flux.api.models.response_metadata_map module implements the ResponseMetadataMap for field resolution.

class scholar_flux.api.models.response_metadata_map.ResponseMetadataMap(*, total_query_hits: str | None = None, records_per_page: str | None = None)[source]

Bases: BaseModel

Maps API-specific response metadata field names to common names.

This class enables extraction of metadata from API responses, primarily used for pagination decisions in multi-page searches. This class extracts and processes metadata fields from metadata dictionaries and can be used for nested path reversal by denoting fields with periods. field retrieval.

Parameters:
  • total_query_hits – Field name containing the total number of results for a query (used to determine if more pages exist)

  • records_per_page – Field name indicating the number of records on the current page

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits")
>>> metadata = {"totalHits": 318942, "limit": 10}
>>> total = metadata_map.calculate_query_hits(metadata)
>>> print(total)  # 318942
>>> # Used for pagination decisions
>>> has_more = total > (current_page * records_per_page)
calculate_pages_remaining(page: int, total_query_hits: int | None = None, records_per_page: int | None = None, metadata: MetadataType | None = None) int | None[source]

Calculating the total number of pages yet to be queried using either metadata or direct integer fields.

Parameters:
  • total_query_hits (Optional[int]) – Total number of record hits associated with a given query. If not specified, this is parsed from the metadata

  • records_per_page (Optional[int]) – Total number of records on the current page as an integer if available and convertible

  • metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)

Returns:

The total number of pages that remain given the values total_query_hits and records_per_page

Return type:

Optional[int]

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(
... total_query_hits="statistics.totalHits", records_per_page="metadata.pageSize"
... )
>>> metadata = {"statistics": {"totalHits": "1500"},"metadata": {"pageSize": "20"}}
>>> total = metadata_map.calculate_pages_remaining(page = 74, metadata = metadata)
>>> print(total) # 1 (converted from string)
calculate_query_hits(metadata: MetadataType) int | None[source]

Extract and convert total query hits from response metadata.

Parameters:

metadata (MetadataType) – A mapping containing response metadata typically from ProcessedResponse.metadata

Returns:

Total number of query hits as an integer if available and convertible, otherwise None

Return type:

Optional[int]

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits")
>>> metadata = {"totalHits": "1500", "results": [...]}
>>> total = metadata_map.calculate_query_hits(metadata)
>>> print(total)  # 1500 (converted from string)
calculate_records_per_page(metadata: MetadataType) int | None[source]

Extract and convert the total number of records on the current page from response metadata.

Parameters:

metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)

Returns:

Total number of records on the current page as an integer if available and convertible, otherwise None

Return type:

Optional[int]

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(records_per_page="pageSize")
>>> metadata = {"pageSize": "20", "results": [...]}
>>> total = metadata_map.calculate_records_per_page(metadata)
>>> print(total)  # 20 (converted from string)
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

process_metadata(metadata: MetadataType) MetadataType[source]

Helper method for processing metadata after mapping relevant fields using the metadata schema.

Parameters:

metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)

Returns:

A mapped dictionary of processed metadata fields.

Return type:

metadata (MetadataType)

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits", records_per_page="pageSize")
>>> metadata = {"totalHits": "1500","pageSize": "20", "results": [...]}
>>> metadata_map.process_metadata(metadata)
# OUTPUT: {"total_query_hits": 1500, "pageSize": "records_per_page", 20}
records_per_page: str | None
total_query_hits: str | None

scholar_flux.api.models.response_types module

Helper module used to define response types returned by scholar-flux after API response retrieval and processing.

The APIResponseType is a union of different possible response types that can be received from a SearchCoordinator:
  • ProcessedResponse: A successfully processed response containing parsed response metadata, and processed records.

  • ErrorResponse: Indicates that an error has occurred during response retrieval and/or processing when unsuccessful.

  • NonResponse: ErrorResponse subclass indicating when an error prevents the successful retrieval of a response.

scholar_flux.api.models.responses module

The scholar_flux.api.models.responses module contains the core response types used during API response retrieval.

These responses are designed to indicate whether the retrieval and processing of API responses was successful or unsuccessful while also storing relevant fields that aid in post-retrieval diagnostics. Each class uses pydantic to ensure type-validated responses while also ensuring flexibility in how responses can be used and applied.

Classes:
ProcessedResponse:

Indicates whether an API was successfully retrieved, parsed, and processed. This model is designed to facilitate the inspection of intermediate results and retrieval of extracted response records.

ErrorResponse:

Indicates that an error occurred somewhere in the retrieval or processing of an API response. This class is designed to allow inspection of error messages and failure results to aid in debugging in case of unexpected scenarios.

NonResponse:

Inherits from ErrorResponse and is designed to indicate that an error occurred in the preparation of a request or the sending/retrieval of a response.

class scholar_flux.api.models.responses.APIResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None)[source]

Bases: BaseModel

A Response wrapper for responses of different types that allows consistency when using several possible backends.

The purpose of this class is to serve as the base for managing responses received from scholarly APIs while processing each component in a predictable, reproducible manner.

This class uses pydantic’s data validation and serialization/deserialization methods to aid caching and includes properties that refer back to the original response for displaying valid response codes, URLs, etc.

All future processing/error-based responses classes inherit from and build off of this class.

Parameters:
  • cache_key (Optional[str]) – A string for recording cache keys for use in later steps of the response orchestration involving processing, cache storage, and cache retrieval

  • response (Optional[requests.Response | ResponseProtocol]) – A response or response-like object to be validated and used/re-used in later caching and response processing/orchestration steps.

  • created_at (Optional[str]) – A value indicating the time at which a response or response-like object was created.

Example

>>> from scholar_flux.api import APIResponse
# Using keyword arguments to build a basic APIResponse data container:
>>> response = APIResponse.from_response(
>>>     cache_key = 'test-response',
>>>     status_code = 200,
>>>     content=b'success',
>>>     url='https://example.com',
>>>     headers={'Content-Type': 'application/text'}
>>> )
>>> response
# OUTPUT: APIResponse(cache_key='test-response', response = ReconstructedResponse(
#    status_code=200, reason='OK', headers={'Content-Type': 'application/text'},
#    text='success', url='https://example.com'
#)
>>> assert response.status == 'OK' and response.text == 'success' and response.url == 'https://example.com'
# OUTPUT: True
>>> assert response.validate_response()
# OUTPUT: True
classmethod as_reconstructed_response(response: object) ReconstructedResponse[source]

Classmethod designed to create a reconstructed response from an original response object.

This method coerces response attributes into a reconstructed response that retains the original content, status code, headers, URL, reason, etc.

Returns:

A minimal response object that contains the core attributes needed to support

other processes in the scholar_flux module such as response parsing and caching.

Return type:

ReconstructedResponse

build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType] | None[source]

Defines a No-Op method to be overridden by ProcessedResponse subclasses.

cache_key: str | None
property cached: bool | None

Identifies whether the current response was retrieved from the session cache.

Returns:

True if the response is a CachedResponse object and False if it is a fresh requests.Response object None: Unknown (e.g., the response attribute is not a requests.Response object or subclass)

Return type:

bool

property content: bytes | None

Return content from the underlying response, if available and valid.

Returns:

The bytes from the original response content

Return type:

(bytes)

created_at: str | None
encode_response(response: object) dict[str, Any] | list[Any] | None[source]

Helper method for serializing a response into a json format.

Accounts for special cases such as CaseInsensitiveDict fields that are otherwise unserializable.

From this step, pydantic can safely use json internally to dump the encoded response fields

classmethod from_response(response: Any | None = None, cache_key: str | None = None, auto_created_at: bool | None = None, **kwargs: Any) Self[source]

Construct an APIResponse from a response object or from keyword arguments.

If response is not a valid response object, builds a minimal response-like object from kwargs.

classmethod from_serialized_response(response: object | None = None, **kwargs: Any) ReconstructedResponse | None[source]

Helper method for creating a new APIResponse from dumped JSON object.

This method accounts for lack of ease of serialization of responses by decoding the response dictionary that was loaded from a string using json.loads from the JSON module in the standard library.

If the response input is still a serialized string, this method will manually load the response dict with the APIresponse._deserialize_response_dict class method before further processing.

Parameters:

response (object) – A prospective response value to load into the API Response.

Returns:

A reconstructed response object, if possible. Otherwise returns None

Return type:

Optional[ReconstructedResponse]

property headers: MutableMapping[str, str] | None

Return headers from the underlying response, if available and valid.

Returns:

A dictionary of headers from the response

Return type:

MutableMapping[str, str]

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(*args: Any, **kwargs: Any) NormalizedRecordList[source]

Defines the normalize method that successfully processed API Responses can override to normalize records.

Raises:

NotImplementedError – Unless overridden, this method will raise an error unless defined in a subclass.

process_metadata(*args: Any, **kwargs: Any) MetadataType | None[source]

Abstract processing method that APIResponse subclasses can override to process metadata.

Parameters:
  • *args – No-Op - Added for compatibility with the APIResponse subclasses.

  • **kwargs – No-Op - Added for compatibility with the APIResponse subclasses.

Raises:

NotImplementedError – Unless overridden, this method will raise an error unless defined in a subclass.

raise_for_status() None[source]

Uses the underlying response or response-like object to validate the status code associated with the request.

If the attribute isn’t a response or reconstructed response, the code will coerce the class into a response object to verify the status code for the request URL and response.

Raises:

requests.RequestException – Errors for status codes that indicate unsuccessfully received responses.

property reason: str | None

Uses the reason or status code attribute on the response object, to retrieve or create a status description.

Returns:

The status description associated with the response.

Return type:

Optional[str]

resolve_extracted_record(*args: Any, **kwargs: Any) RecordType | None[source]

Defines a No-Op method to be overridden by ProcessedResponse subclasses.

response: Response | ResponseProtocol | None
classmethod serialize_response(response: Response | ResponseProtocol) str | None[source]

Helper method for serializing a response into a json format.

The response object is first converted into a serialized string and subsequently dumped after ensuring that the field is serializable.

Parameters:

response (Response, ResponseProtocol) – A requests.Response or response-like object to serialize as a string.

Returns:

A serialized response when response serialization is possible. Otherwise None.

Return type:

Optional[str]

property status: str | None

Helper property for retrieving a human-readable status description APIResponse.

Returns:

The status description associated with the response (if available).

Return type:

Optional[str]

property status_code: int | None

Helper property for retrieving a status code from the APIResponse.

Returns:

The status code associated with the response (if available)

Return type:

Optional[int]

strip_annotations(*args: Any, **kwargs: Any) RecordList[source]

Defines a No-Op method to be overridden by ProcessedResponse subclasses.

property text: str | None

Attempts to retrieve the response text by first decoding the bytes of its content.

If not available, this property attempts to directly reference the text attribute directly.

Returns:

A text string if the text is available in the correct format, otherwise None

Return type:

Optional[str]

classmethod transform_response(v: Response | ResponseProtocol | None) Response | ResponseProtocol | None[source]

Attempts to resolve a valid or a serialized response-like object as an original or ReconstructedResponse.

All original response objects (duck-typed or requests response) with valid values will be returned as is.

If the passed object is a string - this function will attempt to serialize it before attempting to parse it as a dictionary.

Dictionary fields will be decoded, if originally encoded, and parsed as a ReconstructedResponse object, if possible.

Otherwise, the original object is returned as is.

property url: str | None

Return URL from the underlying response, if available and valid.

Returns:

The original URL in string format, if available. For URL objects that are not str types, this method

attempts to convert them into strings when possible.

Return type:

str

classmethod validate_iso_timestamp(v: str | datetime | None) str | None[source]

Helper method for validating and ensuring that the timestamp accurately follows an ISO 8601 format.

validate_response(raise_on_error: bool = False) bool[source]

Helper method for determining whether the response attribute is truly a response or response-like object.

If the response isn’t a requests.Response object, we use duck-typing to determine whether the response, itself, contains the attributes expected of a response.

For this purpose, response properties are checked in order to determine whether the properties of the nested response match object matches the expected type.

Parameters:

raise_on_error (bool) – Indicates whether an error should be raised if the response attribute is invalid (False by default).

Returns:

Indicates whether the current APIResponse.response attribute is a valid response.

Return type:

bool

Raises:

InvalidResponseStructureException – When the response attribute is invalid and raise_on_error=True

class scholar_flux.api.models.responses.ErrorResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None, message: str | None = None, error: str | None = None)[source]

Bases: APIResponse

Returned when something goes wrong, but we don’t want to throw immediately—just hand back failure details.

The class is formatted for compatibility with the ProcessedResponse.

build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType][source]

No-Op: Returns an empty dict when no extracted records are available.

This method is retained for compatibility with ProcessedResponse. Since ErrorResponse has no extracted records to index, this method always returns an empty dictionary regardless of arguments provided.

Parameters:
  • *args – Positional argument placeholder for compatibility with the ProcessedResponse.build_record_id_index method. All arguments are ignored.

  • **kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.build_record_id_index method. All arguments are ignored.

Returns:

An empty dictionary indicating no records are available for indexing.

Return type:

dict[str, RecordType]

cache_key: str | None
created_at: str | None
property data: None

Provided for type hinting + compatibility.

error: str | None
property extracted_records: None

Provided for type hinting + compatibility.

classmethod from_error(message: str, error: Exception, cache_key: str | None = None, response: Response | ResponseProtocol | None = None) Self[source]

Creates and logs the processing error if one occurs during response processing.

Parameters:
  • message (str) – Error message describing the failure.

  • error (Exception) – The exception instance that was raised.

  • cache_key (Optional[str]) – Cache key for storing results.

  • response (Optional[requests.Response | ResponseProtocol]) – Raw API response.

Returns:

A pydantic model that contains the error response data and background information on what precipitated the error.

Return type:

ErrorResponse

message: str | None
property metadata: None

Provided for type hinting + compatibility.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = True, *args: Any, **kwargs: Any) NormalizedRecordList[source]

No-Op: Raises a RecordNormalizationException when raise_on_error=True and returns an empty list otherwise.

Parameters:
  • field_map (Optional[BaseFieldMap]) – An optional field map that can be used to normalize the current response. This is inferred from the registry if not provided as input.

  • raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a RecordNormalizationException is raised.

  • *args – Positional argument placeholder for compatibility with the ProcessedResponse.normalize method

  • **kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.normalize method

Returns:

An empty list if raise_on_error=False

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If raise_on_error=True, this exception is raised after catching NotImplementedError

property normalized_records: None

Provided for type hinting + compatibility.

property parsed_response: None

Provided for type hinting + compatibility.

process_metadata(*args: Any, **kwargs: Any) MetadataType | None[source]

No-Op: This method is retained for compatibility. It returns None by default.

property processed_metadata: None

Provided for type hinting + compatibility.

property processed_records: None

Provided for type hinting + compatibility.

property record_count: int

Number of records in this response.

property records_per_page: None

Provided for type hinting + compatibility.

resolve_extracted_record(*args: Any, **kwargs: Any) None[source]

No-Op: Returns None when no records are available.

This method is retained for compatibility with ProcessedResponse. Since ErrorResponse has no extracted or processed records, resolution is not possible and this method always returns None.

Parameters:
  • *args – Positional argument placeholder for compatibility with the ProcessedResponse.resolve_extracted_record method. Currently includes processed_index (int).

  • **kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.resolve_extracted_record method. All arguments are ignored.

Returns:

Always returns None since no records exist to resolve.

Return type:

None

response: requests.Response | ResponseProtocol | None
strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]

Convenience method for removing internal metadata annotations from a provided list of records.

This method removes all metadata annotations (dictionary keys that are prefixed with an underscore) that were added during the record extraction step for pipeline traceability (e.g., _extraction_index, _record_id).

Parameters:

records – (RecordType | RecordList) Records to strip. Defaults to processed_records if None.

Returns:

A list of dictionary records with stripped metadata annotations when provided. If a record or record list is not provided, a warning is logged, and an empty list is returned.

Return type:

RecordList

Note: This method is defined primarily for compatibility with the ProcessedResponse API.

property total_query_hits: None

Provided for type hinting + compatibility.

class scholar_flux.api.models.responses.NonResponse(*, cache_key: str | None = None, response: None = None, created_at: str | None = None, message: str | None = None, error: str | None = None)[source]

Bases: ErrorResponse

Response class that indicates that an error occurred during request preparation or API response retrieval.

This class is used to signify the error that occurred within the search process using a similar interface as the other scholar_flux Response dataclasses.

cache_key: str | None
created_at: str | None
error: str | None
message: str | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

response: None
class scholar_flux.api.models.responses.ProcessedResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None, parsed_response: Any | None = None, extracted_records: RecordList | None = None, processed_records: RecordList | None = None, normalized_records: NormalizedRecordList | None = None, metadata: MetadataType | None = None, processed_metadata: MetadataType | None = None, message: str | None = None)[source]

Bases: APIResponse

APIResponse class that scholar_flux uses to return processed response data after successful response processing.

This class is populated to return response data containing information on the original, cached, or reconstructed API response that is received and processed after retrieval. In addition to returning processed records and metadata, this class also allows storage of intermediate steps including:

  1. Parsed responses

  2. Extracted records and metadata

  3. Processed records (aliased as data)

  4. Normalized records

  5. Processed metadata

  6. Any additional messages. An error field is provided for compatibility with the ErrorResponse class.

build_record_id_index() dict[str, RecordType][source]

Builds a lookup table for ID-based resolution of extracted records.

This method creates a dictionary that maps _record_id values to their corresponding extracted records. Useful when performing multiple resolutions for records the same response.

Returns:

A new dictionary mapping record IDs to the original record. An empty dictionary is returned if extracted_records is None/empty or all records do not have an associated ID

Return type:

dict[str, RecordType]

Example

>>> from scholar_flux import SearchCoordinator
>>> coordinator = SearchCoordinator(query = 'public health', annotate_records=True)
>>> response = coordinator.search(page = 1)
>>> id_index = response.build_record_id_index()
>>> processed_record = response.data[0]
>>> extracted_record = id_index.get(processed_record["_record_id"])
>>> isinstance(extracted_record, dict)
# OUTPUT: True

Note

This method is used in the process of identifying raw, unprocessed records after extensive post-processing and filtering has been performed on each record and relies on record annotation being enabled during data extraction.

cache_key: str | None
created_at: str | None
property data: RecordList | None

Alias to the processed_records attribute that holds a list of dictionaries, when available.

property error: None

Provided for type hinting + compatibility.

extracted_records: RecordList | None
message: str | None
metadata: MetadataType | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = False, update_records: bool | None = None, resolve_records: bool | None = None, keep_api_specific_fields: bool | Sequence | None = None, strip_annotations: bool | None = None) NormalizedRecordList[source]

Applies a field map to normalize the processed records of a ProcessedResponse into a common structure.

Note that if a field_map is not provided, this method will return the previously created normalized_records attribute if available. If normalized_records is None, this method will attempt to look up the FieldMap from the current provider_registry.

If processed records is None (and not an empty list), record normalization will fall back to using extracted_records and will return relatively similar results with minor differences in potential value coercion, flattening, and the recursive extraction of values at non-terminal paths depending on the implementation of the data processor.

Parameters:
  • field_map (Optional[BaseFieldMap]) – An optional field map that can be used to normalize the current response. This is inferred from the registry if not provided as input.

  • raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a normalization error is raised.

  • update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is currently None.

  • resolve_records (Optional[bool]) – A flag that determines if resolution with annotated records should occur. If True or None, resolution occurs. If False, normalization uses processed_records when not None and extracted_records otherwise.

  • keep_api_specific_fields (Optional[bool | Sequence]) – Indicates what API-specific records should be retained from the complete list of API parameters that are returned. If False, only the core parameters defined by the FieldMap are returned. If True or None, all parameters are returned instead.

  • strip_annotations (Optional[bool]) – A flag for removing metadata annotations denoted by a leading underscore. When True or None (default), annotations are removed from normalized records.

Returns:

The list of normalized records in the same dimension as the original processed response. If a map for the current provider does not exist and raise_on_error=False, an empty list is returned instead.

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If an error occurs during the normalization of record list.

Example

>>> from scholar_flux import SearchCoordinator
>>> from scholar_flux.utils import truncate, coerce_flattened_str
>>> coordinator = SearchCoordinator(query = 'public health')
>>> response = coordinator.search_page(page = 1)
>>> normalized_records = response.normalize()
>>> for record in normalized_records[:5]:
...     print(f"Title: {record['title']}")
...     print(f"URL: {record['url']}")
...     print(f"Source: {record['provider_name']}")
...     print(f"Abstract: {truncate(record['abstract'] or 'Not available')}")
...     print(f"Authors: {coerce_flattened_str(record['authors'])}")
...     print("-"*100)

# OUTPUT: Title: Are we prepared? The development of performance indicators for … URL: https://journals.plos.org/plosone/article?id=… Source: plos Abstract: Background: Disasters and emergencies… Authors: … —————————————————————————————————-

Note

Computation is performed in one of three cases:

1.`normalized_records` does not already exist 2.`update_records` is not True 3. Either resolve_records or keep_api_specific_fields is not None

normalized_records: NormalizedRecordList | None
parsed_response: Any | None
process_metadata(metadata_map: ResponseMetadataMap | None = None, update_metadata: bool | None = None) MetadataType | None[source]

Uses a ResponseMetadataMap to process metadata for tertiary information on the response.

This method is a helper that is meant for primarily internal use for providing metadata information on the response where helpful and for informing users of the characteristics of the current response.

This function will update the ProcessedResponse.processed_metadata attribute when update_metadata=True or in a secondary case where the current processed_metadata field is an empty dict or None unless update_metadata=False

Parameters:
  • metadata_map (Optional[ResponseMetadataMap]) – A mapping that resolve API-specific metadata names to a universal parameter name.

  • update_metadata (Optional[bool]) – Determines whether the underlying processed_metadata field should be updated. If True, the processed_metadata field is updated inplace. If None, the field is only updated when metadata fields have been successfully processed and the `processed_metadata ` field is None.

Returns:

The processed metadata returned as a dictionary when available. None otherwise.

Return type:

Optional[MetadataType]

processed_metadata: MetadataType | None
processed_records: RecordList | None
property record_count: int

The overall length of the processed data field as processed in the last step after filtering.

property records_per_page: int | None

Returns the total number of results on the current page.

This method retrieves the records_per_page variable from the processed_metadata attribute, and if metadata hasn’t yet been processed, this method will then call process_metadata() manually to ensure that the field is available.

resolve_extracted_record(processed_index: int) RecordType | None[source]

Resolve a processed record back to its original extracted record.

This method uses a two-phase resolution strategy with optional validation:

  1. Primary: Direct index lookup via _extraction_index (fast, single access)

  2. Validation: Verify _record_id matches

  3. Fallback: Search by _record_id if index lookup fails or mismatches (scans all records)

Parameters:

processed_index (int) – The index of the record in processed_records to resolve.

Returns:

The original extracted record, or None if resolution fails.

Return type:

Optional[RecordType]

Example

>>> from scholar_flux import SearchCoordinator, RecursiveDataProcessor
>>> coordinator = SearchCoordinator(
...     query='public health',
...     provider_name='openalex',
...     annotate_records=True,
...     processor=RecursiveDataProcessor()
... )
>>> response = coordinator.search(page=1)
>>> # Get processed (possibly flattened) record
>>> processed = response.processed_records[0]
>>> print(processed.get("authorships.author.display_name"))  # ['Kenneth L. Howard...']
>>> # Resolve to original nested structure
>>> original = response.resolve_extracted_record(0)
>>> print(original.get("authorships"))
>>> print(original.get("authorships")[0].keys())
# OUTPUT: dict_keys(['author_position', 'author', 'institutions', 'countries', 'is_corresponding', 'raw_author_name', 'raw_affiliation_strings', 'affiliations'])

Note

Resolution requires that records were extracted with annotate_records=True in the DataExtractor. Without annotation fields, this method returns None.

response: requests.Response | ResponseProtocol | None
strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]

Convenience method that removes metadata annotations from a record list for clean export.

This method removes all metadata annotations (dictionary keys that are prefixed with an underscore) that were added during the record extraction step for pipeline traceability (e.g., _extraction_index, _record_id).

Parameters:

records – (RecordType | RecordList) Records to strip. Defaults to processed_records if None.

Returns:

New list of records with annotation fields removed.

Return type:

RecordType | RecordList

Example

>>> clean_data = response.strip_annotations()
>>> df = pd.DataFrame(clean_data)  # No internal fields in DataFrame
property total_query_hits: int | None

Returns the total number of results as reported by the API.

This method retrieves the total_query_hits variable from the processed_metadata attribute, and if metadata hasn’t yet been processed, this method will then call process_metadata() manually to ensure that the field is available.

scholar_flux.api.models.search_api_config module

The scholar_flux.api.models.search_api_config module implements the core SearchAPIConfig used to drive API searches.

The SearchAPIConfig is used by the SearchAPI to interact with API providers via a unified interface for orchestrating response retrieval.

This configuration defines settings such as rate limiting, the number of records retrieved per request, API keys, and the API provider/URL where requests will be sent.

Under the hood, the SearchAPIConfig can use both pre-created and custom defaults to create a new configuration with minimal code.

class scholar_flux.api.models.search_api_config.SearchAPIConfig(*, provider_name: str = '', base_url: str = '', records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: float = -1, api_key: SecretStr | None = None, api_specific_parameters: dict[str, Any] | None = None)[source]

Bases: BaseModel

The SearchAPIConfig class provides the core tools necessary to set and interact with the API. The SearchAPI uses this class to retrieve data from an API using universal parameters to simplify the process of retrieving raw responses.

provider_name

Indicates the name of the API to use when making requests to a provider. If the provider name matches a known default and the base_url is unspecified, the base URL for the current provider is used instead.

Type:

str

base_url

Indicates the API URL where data will be searched and retrieved.

Type:

str

records_per_page

Controls the number of records that will appear on each page.

Type:

int

request_delay

Indicates the minimum delay between each request to avoid exceeding API rate limits.

Type:

float

api_key

This is an API-specific parameter for validating the current user’s identity. If a str type is provided, it is converted into a SecretStr.

Type:

Optional[str | SecretStr]

api_specific_parameters

A dictionary containing all parameters specific to the current API. API-specific parameters include the following:

  1. mailto (Optional[str | SecretStr]):

    An optional email address for receiving feedback on usage from providers. This parameter is currently applicable only to the Crossref API.

  2. db (str):

    The parameter used by the NIH to direct requests for data to the pubmed database. This parameter defaults to pubmed and does not require direct specification.

Type:

dict[str, Any]

Examples

>>> from scholar_flux.api import SearchAPIConfig, SearchAPI, provider_registry
# To create a CROSSREF configuration with minimal defaults and provide an api_specific_parameter:
>>> config = SearchAPIConfig.from_defaults(provider_name = 'crossref', mailto = 'your_email_here@example.com')
# The configuration automatically retrieves the configuration for the "Crossref" API.
>>> assert config.provider_name == 'crossref' and config.base_url == provider_registry['crossref'].base_url
>>> api = SearchAPI.from_settings(query = 'q', config = config)
>>> assert api.config == config
# To retrieve all defaults associated with a provider and automatically read an API key if needed:
>>> config = SearchAPIConfig.from_defaults(provider_name = 'pubmed', api_key = 'your api key goes here')
# The API key is retrieved automatically if you have the API key specified as an environment variable.
>>> assert config.api_key is not None
# Default provider API specifications are already pre-populated if they are set with defaults.
>>> assert config.api_specific_parameters['db'] == 'pubmed'  # Required by pubmed and defaults to pubmed.
# Update a provider and automatically retrieve its API key - the previous API key will no longer apply.
>>> updated_config = SearchAPIConfig.update(config, provider_name = 'core')
# The API key should have been overwritten to use core. Looks for a `CORE_API_KEY` env variable by default.
>>> assert updated_config.provider_name  == 'core' and  updated_config.api_key != config.api_key
DEFAULT_PROVIDER: ClassVar[str] = 'PLOS'
DEFAULT_RECORDS_PER_PAGE: ClassVar[int] = 25
DEFAULT_REQUEST_DELAY: ClassVar[float] = 6.1
MAX_API_KEY_LENGTH: ClassVar[int] = 512
api_key: SecretStr | None
api_specific_parameters: dict[str, Any] | None
base_url: str
classmethod default_request_delay(v: int | float | None, provider_name: str | None = None) float[source]

Helper method enabling the retrieval of the most appropriate rate limit for the current provider.

Defaults to the SearchAPIConfig default rate limit when the current provider is unknown and a valid rate limit has not yet been provided.

Parameters:
  • v (Optional[int | float]) – The value received for the current request_delay

  • provider_name (Optional[str]) – The name of the provider to retrieve a rate limit for

Returns:

The inputted non-negative request delay, the retrieved rate limit for the current provider

if available, or the SearchAPIConfig.DEFAULT_REQUEST_DELAY - all in order of priority.

Return type:

float

classmethod from_defaults(provider_name: str, **overrides: Any) SearchAPIConfig[source]

Uses the default configuration for the chosen provider to create a SearchAPIConfig object containing configuration parameters. Note that additional parameters and field overrides can be added via the **overrides field.

Parameters:
  • provider_name (str) – The name of the provider to create the config

  • **overrides – Optional keyword arguments to specify overrides and additional arguments

Returns:

A default APIConfig object based on the chosen parameters

Return type:

SearchAPIConfig

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

provider_name: str
records_per_page: int
request_delay: float
classmethod set_records_per_page(v: int | None) int[source]

Sets the records_per_page parameter with the default if the supplied value is not valid:

Triggers a validation error when records_per_page is an invalid type. Otherwise uses the DEFAULT_RECORDS_PER_PAGE class attribute if the supplied value is missing or is a negative number.

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method for retrieving a string representation of the overall structure of the current SearchAPIConfig.

classmethod update(current_config: SearchAPIConfig, **overrides: Any) SearchAPIConfig[source]

Create a new SearchAPIConfig by updating an existing config with new values and/or switching to a different provider. This method ensures that the new provider’s base_url and defaults are used if provider_name is given, and that API-specific parameters are prioritized and merged as expected.

Parameters:
  • current_config (SearchAPIConfig) – The existing configuration to update.

  • **overrides – Any fields or API-specific parameters to override or add.

Returns:

A new config with the merged and prioritized values.

Return type:

SearchAPIConfig

property url_basename: str

Uses the _extract_url_basename method from the provider URL associated with the current config instance.

classmethod validate_api_key(v: SecretStr | str | None) SecretStr | None[source]

Validates the api_key attribute and triggers a validation error if it is not valid.

classmethod validate_provider_name(v: str | None) str[source]

Validates the provider_name attribute and triggers a validation error if it is not valid.

classmethod validate_request_delay(v: int | float | None) int | float | None[source]

Sets the request delay (delay between each request) for valid request delays. This validator triggers a validation error when the request delay is an invalid type.

If a request delay is left None or is a negative number, this class method returns -1, and further validation is performed by cls.default_request_delay to retrieve the provider’s default request delay.

If not available, SearchAPIConfig.DEFAULT_REQUEST_DELAY is used.

validate_search_api_config_parameters() Self[source]

Validation method that resolves URLs and/or provider names to provider_info when one or the other is not explicitly provided.

Occurs as the last step in the validation process.

classmethod validate_url(v: str) str[source]

Validates the base_url and triggers a validation error if it is not valid.

classmethod validate_url_type(v: str | None) str[source]

Validates the type for the base_url attribute and triggers a validation error if it is not valid.

scholar_flux.api.models.search_inputs module

The scholar_flux.api.models.search_inputs module implements the PageListInput RootModel for multi-page searches.

The PageListInput model is designed to validate and prepare lists and iterables of page numbers for multi-page retrieval using the SearchCoordinator.search_pages method.

class scholar_flux.api.models.search_inputs.PageListInput(root: RootModelRootType = PydanticUndefined)[source]

Bases: RootModel[Sequence[int]]

Helper class for processing page information in a predictable manner.

The PageListInput class expects to receive a list, string, or generator that contains at least one page number. If a singular integer is received, the result is transformed into a single-item list containing that integer.

Parameters:

root (Sequence[int]) – A list containing at least one page number.

Examples

>>> from scholar_flux.api.models import PageListInput
>>> PageListInput(5)
PageListInput([5])
>>> PageListInput(range(5))
PageListInput([0, 1, 2, 3, 4])
classmethod from_record_count(min_records: int, records_per_page: int, page_offset: int = 0) Self[source]

Helper method for calculating the total number of pages required to retrieve at least min_records records.

Parameters:
  • min_records (int) – The total number of records to retrieve sequentially.

  • records_per_page (int) – The total number of records that are retrieved per page.

  • page_offset (int) – The total number of pages to skip before beginning record retrieval (0 by default). When the provided value is not a non-negative integer, this parameter is coerced to 0 and a warning is triggered.

Returns:

The calculated page range used to retrieve at least min_records records given records_per_page.

Return type:

PageListInput

Examples

>>> from scholar_flux.api.models import PageListInput
>>> PageListInput.from_record_count(20, 10, 0)
PageListInput(1, 2)
>>> PageListInput.from_record_count(20, 10, 2)
PageListInput(3, 4)
>>> PageListInput.from_record_count(15, 10, 1)
PageListInput(2, 3)

# triggers a warning for page_offset (non-integers are coerced to 0): >>> PageListInput.from_record_count(20, 10, None) PageListInput(1, 2)

>>> PageListInput.from_record_count(0, 10, 0)
PageListInput()

Note

This method expects a positive integer for min_records from which to calculate the page range required to retrieve at least min_records. Specifying 0 for min_records will result in an empty list of pages that essentially functions as a no-op search returning an empty list from SearchCoordinator.search_records.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property page_numbers: Sequence[int]

Returns the sequence of validated page numbers as a list.

classmethod page_validation(v: str | int | Sequence[int | str]) Sequence[int][source]

Processes the page input to ensure that a list of integers is returned if the received page list is in a valid format.

Parameters:

v (str | int | Sequence[int | str]) – A page or sequence of pages to be formatted as a list of pages.

Returns:

A validated, formatted sequence of page numbers assuming successful page validation

Return type:

Sequence[int]

Raises:

ValidationError – Internally raised via pydantic if a ValueError is encountered (if the input is not exclusively a page or list of page numbers)

classmethod process_page(page_value: str | int) int[source]

Helper method for ensuring that each value in the sequence is a numeric string or whole number.

Note that this function will not throw an error for negative pages as that is handled at a later step in the page search process.

Parameters:

page_value (str | int) – The value to be converted if it is not already an integer

Returns:

A validated integer if the page can be converted to an integer and is not a float

Return type:

int

Raises:

ValueError – When the value is not an integer or numeric string to be converted to an integer

scholar_flux.api.models.search_results module

The scholar_flux.api.models.search_results module defines the SearchResult and SearchResultList implementations.

These two classes are containers of API response data and aid in the storage of retrieved and processed response results while allowing the efficient identification of individual queries to providers from both multi-page and multi-coordinated searches.

These implementations allow increased organization for the API output of multiple searches by defining the provider, page, query, and response result retrieved from multi-page searches from the SearchCoordinator and multi-provider/page searches using the MultiSearchCoordinator.

Classes:
SearchResult:

Pydantic Base class that stores the search result as well as the query, provider name, and page.

SearchResultList:

Inherits from a basic list to constrain the output to a list of SearchResults while providing data preparation convenience functions for downstream frameworks.

Example

>>> from scholar_flux import SearchCoordinator
>>> coordinator = SearchCoordinator(query="sight restoration", provider_name="crossref")
>>> response = coordinator.search_page(1)
>>>
>>> # Check if processing succeeded
>>> if response:
...     print(f"Retrieved {response.record_count} records for page {response.page} with query {response.query}")
...     print(f"Total available: {response.total_query_hits}")
...
...     # Normalize to common schema with post-processing
...     normalized = response.normalize(include = {'query', 'page', 'display_name'})
...     for record in normalized[:3]:
...         print(f"Title: {record['title']}")
...         print(f"Authors: {record['authors']}")  # Formatted as a list
...         print(f"Publisher: {record['publisher']}") # Recursively extracted
...         print(f"Year: {record['year']}")  # Extracted and parsed as an integer
...         print("-"*100)  # Already extracted
... else:
...     print(f"Error: {response.error} - {response.message}")
class scholar_flux.api.models.search_results.SearchResult(*, query: str, provider_name: str, page: Annotated[int, Ge(ge=0)], response_result: ProcessedResponse | ErrorResponse | None = None)[source]

Bases: BaseModel

Core container for search results that stores the retrieved and processed data from API Searches.

This class is useful when iterating and searching over a range of pages, queries, and providers at a time. This class uses pydantic to ensure that field validation is automatic, ensuring integrity and reliability of response processing. This supports multi-page searches that link each response result to a particular query, page, and provider.

Parameters:
  • query (str) – The query used to retrieve records and response metadata

  • provider_name (str) – The name of the provider where data is being retrieved

  • page (int) – The page number associated with the request for data

  • response_result (Optional[ProcessedResponse | ErrorResponse]) – The response result containing the specifics of the data retrieved from the response or the error messages recorded if the request is not successful.

For convenience, the properties of the response_result are referenced as properties of the SearchResult, including: response, parsed_response, processed_records, etc.

build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType][source]

Builds a lookup table mapping record IDs to their original extracted records.

This method delegates to the underlying ProcessedResponse or ErrorResponse to build an index for fast ID-based resolution of extracted records. Useful for batch resolution operations where multiple records need to be resolved to their original nested structures without repeated searches.

Parameters:
  • *args – Positional arguments passed through to the underlying response’s build_record_id_index method. The ProcessedResponse implementation accepts no positional arguments.

  • **kwargs – Keyword arguments passed through to the underlying response’s build_record_id_index method. The ProcessedResponse implementation accepts no keyword arguments.

Returns:

A dictionary mapping _record_id values to their corresponding extracted records. Returns an empty dict if response_result is None or if no extracted records exist.

Return type:

dict[str, RecordType]

property cache_key: str | None

Extracts the cache key from the API Response if available.

This cache key is used when storing and retrieving data from response processing cache storage.

Returns:

The key if the response_result contains a cache_key that is not None. None otherwise.

Return type:

Optional[str]

property cached: bool | None

Identifies whether the current response was retrieved from the session cache.

Returns:

True if the response is a CachedResponse object and False if it is a fresh requests.Response object None: Unknown (e.g., the response attribute is not a requests.Response object or subclass)

Return type:

bool

property created_at: str | None

Extracts the time in which the ErrorResponse or ProcessedResponse was created, if available.

property data: RecordList | None

Alias referring back to the processed records from the ProcessedResponse or ErrorResponse.

Contains the processed records from the API response processing step after a successfully received response has been processed. If an error response was received instead, the value of this property is None.

Returns:

The list of processed records if ProcessedResponse.data is not None. None otherwise.

Return type:

Optional[RecordList]

property display_name: str

Returns a human-readable provider name for the current provider when available.

property error: str | None

Extracts the error name associated with the result from the base class.

This field is generally populated when ErrorResponse objects are received and indicates why an error occurred.

Returns:

The error if the response_result is an ErrorResponse with a populated error field. None otherwise.

Return type:

Optional[str]

property extracted_records: RecordList | None

Contains the extracted records from the response record extraction step after successful response parsing.

If an ErrorResponse was received instead, the value of this property is None.

Returns:

A list of extracted records if ProcessedResponse.extracted_records is not None. None otherwise.

Return type:

Optional[RecordList]

property message: str | None

Extracts the message associated with the result from the base class.

This message is generally populated when ErrorResponse objects are received and indicates why an error occurred in the event that the response_result is an ErrorResponse.

Returns:

The message if the ProcessedResponse.message or ErrorResponse.message is not None. None otherwise.

Return type:

Optional[str]

property metadata: MetadataType | None

Contains the metadata from the API response metadata extraction step after successful response parsing.

If an ErrorResponse was received instead, the value of this property is None.

Returns:

A dictionary of metadata if ProcessedResponse.metadata is not None. None otherwise.

Return type:

Optional[MetadataType]

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = False, update_records: bool | None = None, include: SearchFields | None = None, *, resolve_records: bool | None = None, keep_api_specific_fields: bool | Sequence | None = None, strip_annotations: bool | None = None) NormalizedRecordList[source]

Normalizes ProcessedResponse record fields to map API-specific fields to provider-agnostic field names.

The field map is resolved in the following order of priority:

  1. User-specified field maps

  2. Resolving a provider name to a BaseFieldMap or subclass from the registry.

  3. Resolving the URL to a BaseFieldMap or subclass

If a field map is not available at any step in the process, an empty list will be returned if raise_on_error=False. Otherwise, a RecordNormalizationException is raised.

Parameters:
  • field_map (Optional[BaseFieldMap]) – Optional field map to use in the normalization of the record list. If not provided, the field map is looked up from the registry using the name or URL of the current provider.

  • raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a normalization error is raised.

  • update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is None.

  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each normalized record dictionary. Possible fields include provider_name, query, display_name, and page. By default, no model fields are appended.

  • resolve_records (Optional[bool]) – A flag that determines if resolution with annotated records should occur. If True or None, resolution occurs. If False, normalization uses processed_records when not None and extracted_records otherwise.

  • keep_api_specific_fields (Optional[bool | Sequence]) – Indicates what API-specific records should be retained from the complete list of API parameters that are returned. If False, only the core parameters defined by the FieldMap are returned. If True or None, all parameters are returned instead.

  • strip_annotations (Optional[bool]) – A flag indicating whether to remove metadata annotations from normalized records. If True or None, fields with leading underscores are removed from each normalized record.

Returns:

A list of normalized records, or empty list if normalization is unavailable.

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If raise_on_error=True and no field map found.

Note

The ProcessedResponse.normalize() method will handle most of the internal logic. This method delegates normalization to the ProcessedResponse when the user does not explicitly pass a field map and the provider-name-resolved map matches the URL-resolved map. If the automatically resolved field maps do not differ, the ProcessedResponse.normalize() method handles the resolution details for caching purposes.

Example

>>> from scholar_flux import SearchCoordinator
>>> from scholar_flux.utils import truncate, coerce_flattened_str
>>> coordinator = SearchCoordinator(query = 'AI Safety', provider_name = 'arXiv')
>>> response = coordinator.search_page(page = 1)
>>> normalized_records = response.normalize(include = {'display_name', 'query', 'page'})
>>> for record in normalized_records[:5]:
...     print(f"Title: {record['title']}")
...     print(f"URL: {record['url']}")
...     print(f"Source: From {record['display_name']}: '{record['query']}' Page={record['page']}")
...     print(f"Abstract: {truncate(record['abstract'] or 'Not available')}")
...     print(f"Authors: {coerce_flattened_str(record['authors'])}")
...     print("-"*100)

# OUTPUT: Title: AI Safety… URL: http://arxiv.org/abs/… Source: From arXiv: ‘AI Safety’ Page=1 Abstract: This report … Authors: … ————————————–

property normalized_records: NormalizedRecordList | None

Contains the normalized records from the API response processing step after normalization.

If an error response was received instead, the value of this property is None.

Returns:

The list of normalized dictionary records if ProcessedResponse.normalized_records is not None.

Return type:

Optional[NormalizedRecordList]

page: int
property parsed_response: Any | None

Contains the parsed response content from the API response parsing step.

Parsed API responses are generally formatted as dictionaries that contain the extracted JSON, XML, or YAML content from a successfully received, raw response.

If an ErrorResponse was received instead, the value of this property is None.

Returns:

The parsed response when ProcessedResponse.parsed_response is not None. Otherwise None.

Return type:

Optional[Any]

process_metadata(metadata_map: ResponseMetadataMap | None = None, update_metadata: bool | None = None) MetadataType | None[source]

Processes and maps API-specific ProcessedResponse.metadata fields to provider-agnostic field names.

By default, the ResponseMetadataMap map retrieves and converts the API-specific page-size (records per page) and total results (total query hits) fields to integers when possible.

The field map is resolved in the following order of priority:

  1. User-specified field maps

  2. Resolving a provider name to a ResponseMetadataMap or subclass from the registry.

  3. Resolving the URL to a ResponseMetadataMap or subclass

If a metadata_map is not available, None will be returned.

Parameters:
  • metadata_map – (Optional[ResponseMetadataMap]): An optional response metadata map to use in the mapping and processing of the response metadata. If not provided, the metadata map is looked up via the registry using the name or URL of the current provider.

  • update_metadata (Optional[bool]) – A flag that determines whether updates should be made to the processed_metadata attribute after computation. If None, updates are made only if the processed_metadata attribute is None.

Returns:

A processed metadata dictionary mapping total_query_hits and records_per_page fields where possible.

Return type:

MetadataType

property processed_metadata: MetadataType | None

Contains the processed metadata from the API response processing step after the response has been processed.

If an error response was received instead, the value of this property is None.

Returns:

The processed metadata dict if ProcessedResponse.processed_metadata is not None. None otherwise.

Return type:

Optional[MetadataType]

property processed_records: RecordList | None

Contains the processed records from the API response processing step after processing the response.

If an error response was received instead, the value of this property is None.

Returns:

The list of processed records if ProcessedResponse.processed_records is not None. None otherwise.

Return type:

Optional[RecordList]

provider_name: str
query: str
property record_count: int

Retrieves the overall length of the processed_record field from the API response if available.

property records_per_page: int | None

Returns the number of records sent on the current page according to the API-specific metadata field.

resolve_extracted_record(*args: Any, **kwargs: Any) RecordType | None[source]

Resolves a processed record back to its original extracted record.

This method delegates to the underlying ProcessedResponse or ErrorResponse to resolve a single processed record (identified by its index) back to its original extracted record with nested structure. Uses annotation fields (_extraction_index, _record_id) added during extraction.

Parameters:
  • *args – Positional arguments passed through to the underlying response’s resolve_extracted_record method. The ProcessedResponse implementation accepts: - processed_index (int): Index of the record in processed_records

  • **kwargs – Keyword arguments passed through to the underlying response’s resolve_extracted_record method.

Returns:

The original extracted record with nested structure, or None if: - response_result is None - The record index is invalid - No matching extracted record is found

Return type:

Optional[RecordType]

property response: Response | ResponseProtocol | None

Directly references the raw response or response-like object from the API Response if available.

Returns:

The response object (response-like or None) if a ProcessedResponse or ErrorResponse is available. When either APIResponse subclass is not available, None is returned instead.

Return type:

Optional[Response | ResponseProtocol]

response_result: ProcessedResponse | ErrorResponse | None
property retrieval_timestamp: datetime | None

Indicates the ISO timestamp associated with the original response creation date and time.

property status: str | None

Extracts the human-readable status description from the underlying response, if available.

property status_code: int | None

Extracts the HTTP status code from the underlying response, if available.

strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]

Convenience method for removing metadata annotations from a record list for clean export.

Strips fields prefixed with underscore that were added during extraction for pipeline traceability (e.g., _extraction_index, _record_id).

Parameters:

records (Optional[RecordType | RecordList]) – Records to strip. Defaults to processed_records if None.

Returns:

New list of records with annotation fields removed. If there are no records to strip, an empty list is returned instead.

Example

>>> clean_data = response.strip_annotations()
>>> df = pd.DataFrame(clean_data)  # No internal fields in DataFrame
property total_query_hits: int | None

Returns the total number of query hits according to the processed metadata field specific to the API.

property url: str | None

Extracts the URL from the underlying response, if available.

with_search_fields(records: NormalizedRecordType, include: SearchFields | None = None, strip_annotations: bool | None = None) NormalizedRecordType[source]
with_search_fields(records: NormalizedRecordList, include: SearchFields | None = None, strip_annotations: bool | None = None) NormalizedRecordList
with_search_fields(records: RecordType, include: SearchFields | None = None, strip_annotations: bool | None = None) RecordType
with_search_fields(records: RecordList | Iterator[RecordType], include: SearchFields | None = None, strip_annotations: bool | None = None) RecordList
with_search_fields(records: None, include: SearchFields | None = None, strip_annotations: bool | None = None) RecordType

Returns a record or list of record dictionaries merged with selected SearchResult fields.

Parameters:
  • records (RecordType | Iterator[RecordType] | NormalizedRecordType | RecordList | NormalizedRecordList) – The record dictionary or list of records to be merged with SearchResult fields.

  • include – Set of SearchResult fields to include (default: {“provider_name”, “page”}).

  • strip_annotations (Optional[bool]) – A flag indicating whether to remove metadata annotations from records. If True, fields with leading underscores are removed from each processed record.

Returns:

A single dictionary is returned if a single parsed record is provided. RecordList: A list of dictionaries is returned if a list of parsed records is provided. NormalizedRecordType: A single normalized dictionary is returned if a single normalized record is provided. NormalizedRecordList: A list of normalized dictionaries is returned if a list of normalized records is provided.

Return type:

RecordType

class scholar_flux.api.models.search_results.SearchResultList(iterable=(), /)[source]

Bases: list[SearchResult]

A custom list that stores the results of multiple SearchResult instances for enhanced type safety.

The SearchResultList class inherits from a list and extends its functionality to tailor its utility to ProcessedResponse and ErrorResponse objects received from SearchCoordinators and MultiSearchCoordinators.

- SearchResultList.append

Basic list.append implementation extended to accept only SearchResults

- SearchResultList.extend

Basic list.extend implementation extended to accept only iterables of SearchResults

- SearchResultList.filter

Removes NonResponses and ErrorResponses from the list of SearchResults

- SearchResultList.select

Selects a subset of SearchResults by query, provider_name, or page

- SearchResultList.join

Combines all records from ProcessedResponses into a list of dictionary-based records

Note: Attempts to add other classes to the SearchResultList other than SearchResults will raise a TypeError.

append(item: SearchResult) None[source]

Overrides the default list.append method for type-checking compatibility.

This override ensures that only SearchResult objects can be appended to the SearchResultList. For all other types, a TypeError will be raised when attempting to append it to the SearchResultList.

Parameters:

item (SearchResult) – A SearchResult containing API response data, the name of the queried provider, the query, and the page number associated with the ProcessedResponse or ErrorResponse response result.

Raises:

TypeError – When the item to append to the SearchResultList is not a SearchResult.

copy() SearchResultList[source]

Overrides the default list.copy to return a shallow copy as a SearchResultList.

Returns:

A new, shallow copy of the current list.

Return type:

SearchResultList

extend(other: SearchResultList | MutableSequence[SearchResult] | Iterable[SearchResult]) None[source]

Overrides the default list.extend method for type-checking compatibility.

This override ensures that only an iterable of SearchResult objects can be appended to the SearchResultList. For all other types, a TypeError will be raised when attempting to extend the SearchResultList with them.

Parameters:
  • other (Iterable[SearchResult]) – An iterable/sequence of response results containing the API response

  • data

  • name (the provider)

  • response (and page associated with the)

Raises:

TypeError – When the item used to extend the SearchResultList is not a mutable sequence of SearchResult instances

filter(invert: bool = False) SearchResultList[source]

Helper method that retains only elements from the original response that indicate successful processing.

Parameters:

invert (bool) – Controls whether SearchResults containing ProcessedResponses or ErrorResponses should be selected. If True, ProcessedResponses are omitted from the filtered SearchResultList. Otherwise, only ProcessedResponses are retained.

join(include: SearchFields | None = None, strip_annotations: bool | None = None) RecordList[source]

Combines all successfully processed API responses into a single list of dictionary records across all pages.

This method is especially useful for compatibility with pandas and polars dataframes that can accept a list of records when individual records are dictionaries.

Note that this method will only load processed responses that contain records that were also successfully extracted and processed.

Parameters:
  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each parsed record dictionary. Possible fields include provider_name, display_name, query, and page.

  • strip_annotations (Optional[bool]) – A flag indicating whether to remove metadata annotations from records. If True, fields with leading underscores are removed from each processed record.

Returns:

A single list containing all records retrieved from each page

Return type:

RecordList

normalize(raise_on_error: bool = False, update_records: bool | None = None, include: SearchFields | None = None, **kwargs: Any) NormalizedRecordList[source]

Convenience method allowing the batch normalization of all SearchResults in a SearchResultList.

When called, each result in the current SearchResultList is sequentially normalized as a record dictionary and outputted into a flattened list of normalized records across all pages, providers, and queries. The provider name is extracted from the normalization step and identifies the origin of each record, but additional search annotations (e.g., query, provider_name, display_name, page) can be added to each record to identify its origin.

Parameters:
  • raise_on_error (bool) – A flag indicating whether to raise an error. If False, iteration will continue through failures in processing such as cases where ErrorResponses and NonResponses otherwise raise a NotImplementedError. if raise_on_error is True, the normalization error will be raised.

  • update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is None.

  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each normalized record dictionary. Possible fields include provider_name, query, display_name, and page. By default, no model fields are appended.

  • **kwargs

    Additional keyword parameters forwarded to SearchResult.normalize(). Supported parameters include:

    • strip_annotations (bool): Removes internal annotation fields from normalized records

    • resolve_records (bool): Merges extracted and processed records when annotations exist

    • keep_api_specific_fields (bool | Sequence): Controls API-specific field inclusion

    • field_map (BaseFieldMap): An optional override to the field map to be used for record normalization

Returns:

A list of all normalized records across all queried pages, or an empty list if no records are available.

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If raise_on_error=True and no field map found.

process_metadata(update_metadata: bool | None = None, include: SearchFields | None = None) list[MetadataType][source]

Processes the ProcessedResponse.metadata field to map metadata fields to provider-agnostic field names.

By default, the ResponseMetadataMap map retrieves and converts the API-specific page-size (records per page) and total results (total query hits) fields to integers when possible.

The field map is resolved in the following order of priority:

  1. User-specified field maps

  2. Resolving a provider name to a BaseFieldMap or subclass from the registry.

  3. Resolving the URL to a BaseFieldMap or subclass

Parameters:
  • update_metadata (Optional[bool]) – A flag that determines whether updates should be made to the processed_metadata attribute after computation. If None, updates are made only if the processed_metadata attribute is None.

  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each listed metadata dictionary. Possible fields include provider_name, display_name, query, and page.

Returns:

A list of processed metadata dictionaries mapping total_query_hits and records_per_page fields where possible.

Return type:

list[MetadataType]

Raises:

RecordNormalizationException – If raise_on_error=True and no field map found.

property record_count: int

Retrieves the overall record count across all search results if available.

select(query: str | None = None, provider_name: str | Pattern | None = None, page: tuple | MutableSequence | int | None = None, *, fuzzy: bool = True, regex: bool | None = None) SearchResultList[source]

Helper method that enables the selection of all responses (successful or failed) based on its attributes.

Parameters:
  • query (Optional[str]) – The exact query string to match (if provided). Ignored if None

  • provider_name (Optional[str | Pattern]) – The provider string or regex pattern to match (if provided). Ignored if None.

  • page (Optional[tuple | MutableSequence | int]) – The page or sequence of pages to match. Ignored if None.

  • fuzzy (bool) – Identifies search results by provider using fuzzy finding, or “flexible matching that’s more forgiving than exact”. When true, this implementation matches providers with normalized names that begin with the provided prefix. (e.g., pubmed can match pubmed or pubmedefetch). The provider_registry.find() method is used to find providers within the package-level registry with names starting with the prefix. Pattern matching is performed if provider_name is a re.Pattern. If fuzzy=False, then only strict string matches will be preserved.

  • regex (Optional[bool]) – An optional keyword parameter passed to provider_registry.find() when fuzzy=True. When True, key pattern matching is enabled and registered providers can be identified using regex. This parameter is No-Op if fuzzy=False.

  • Examples

    >>> from scholar_flux.api.models import SearchResult, SearchResultList
    >>> crossref_result = SearchResult(page=1, query = 'q1', provider_name='crossref')
    >>> pubmed_result = SearchResult(page=2, query = 'q2', provider_name='pubmedefetch')
    >>> springer_nature_result = SearchResult(page=3, query = 'q3', provider_name='springernature')
    >>> search_result_list = SearchResultList([crossref_result, pubmed_result, springer_nature_result])
    >>> len(search_result_list.select()) # No filters selected
    # OUTPUT: 3
    >>> search_result_list.select(provider_name="pubmed") # No filters selected
    # OUTPUT: [SearchResult(query='q2', provider_name='pubmedefetch', page=2, response_result=None, display_name='PubMed (eFetch)')]
    >>> search_result_list.select(provider_name="springer")
    # OUTPUT: [SearchResult(query='q3', provider_name='springernature', page=3, response_result=None, display_name='Springer Nature')]
    >>> search_result_list.select(query="q1")
    # OUTPUT: [SearchResult(query='q1', provider_name='crossref', page=1, response_result=None, display_name='Crossref')]
    

Returns:

A filtered list of search results containing only results that match the conditions.

Return type:

SearchResultList

Module contents

The scholar_flux.api.models module includes all of the needed configuration classes that are needed to define the configuration needed to configure APIs for specific providers and to ensure that the process is orchestrated in a robust way.

Core Models:
  • APIParameterMap: Contains the mappings and settings used to customized common and API Specific parameters

    to the requirements for each API.

  • APIParameterConfig: Encapsulates the created APIParameterMap as well as the methods used to create each request.

  • SearchAPIConfig: Defines the core logic to abstract the creation of requests with parameters specific to each API.

  • ProviderConfig: Allows users to define each of the defaults and mappings settings needed to create a Search API.

  • ProviderRegistry: A customized dictionary mapping provider names to their dynamically retrieved configuration.

  • ProcessedResponse: Indicates a successfully retrieved and processed response from an API provider.

  • ErrorResponse: Indicates that an exception occurred somewhere in the process of response retrieval and processing.

  • NonResponse: Indicates a that a response of any status code could not be retrieved due to an exception.

class scholar_flux.api.models.APIParameterConfig(parameter_map: APIParameterMap)[source]

Bases: object

Uses an APIParameterMap instance and runtime parameter values to build parameter dictionaries for API requests.

Parameters:

parameter_map (APIParameterMap) – The mapping of universal to API-specific parameter names.

Class Attributes:
DEFAULT_CORRECT_ZERO_INDEX (bool):

Autocorrects zero-indexed API parameter building specifications to only accept positive values when True. If otherwise False, page calculation APIs will start from page 0 if zero-indexed (i.e., arXiv).

Examples

>>> from scholar_flux.api import APIParameterConfig, APIParameterMap
>>> # the API parameter map is defined and used to resolve parameters to the API's language
>>> api_parameter_map = APIParameterMap(
... query='q', records_per_page = 'pagesize', start = 'page', auto_calculate_page = False
... )
# The APIParameterConfig defines class and settings that indicate how to create requests
>>> api_parameter_config = APIParameterConfig(api_parameter_map, auto_calculate_page = False)
# Builds parameters using the specification from the APIParameterMap
>>> page = api_parameter_config.build_parameters(query= 'ml', page = 10, records_per_page=50)
>>> print(page)
# OUTPUT {'q': 'ml', 'page': 10, 'pagesize': 50}
DEFAULT_CORRECT_ZERO_INDEX: ClassVar[bool] = True
__init__(*args: Any, **kwargs: Any) None
add_parameter(name: str, description: str | None = None, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False, inplace: bool = True) APIParameterConfig[source]

Passes keyword arguments to the current parameter map to add a new API-specific parameter to its config.

Parameters:
  • name (str) – The name of the parameter used when sending requests to APIs.

  • description (str) – A description of the API-specific parameter.

  • validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.

  • default (Any) – A default value used for the parameter if not specified by the user

  • required (bool) – Indicates whether the current parameter is required for API calls.

  • inplace (bool) –

    A flag that, if True, modifies the current parameter map instance in place. If False, it returns a new parameter map that contains the added parameter, while leaving the original unchanged.

    Note: If this instance is shared (e.g., retrieved from provider_registry), changes will affect all references to this parameter map. if inplace=True.

Returns:

An APIParameterConfig with the updated parameter map. If inplace=True, the original is returned. Otherwise a new parameter map containing an updated api_specific_parameters dict is returned.

Return type:

APIParameterConfig

classmethod as_config(parameter_map: dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) APIParameterConfig[source]

Factory method for creating a new APIParameterConfig from a dictionary or APIParameterMap.

This helper class method resolves the structure of the APIParameterConfig against its basic building blocks to create a new configuration when possible.

Parameters:

parameter_map (dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) – A parameter mapping/config to use in the instantiation of an APIParameterConfig.

Returns:

A new structure from the inputs

Return type:

APIParameterConfig

Raises:

APIParameterException – If there is an error in the creation/resolution of the required parameters

build_parameters(query: str | None, page: int | None, records_per_page: int, **api_specific_parameters: Any) Dict[str, Any][source]

Builds the dictionary of request parameters using the current parameter map and provided values at runtime.

Parameters:
  • query (Optional[str]) – The search query string.

  • page (Optional[int]) – The page number for pagination (1-based).

  • records_per_page (int) – Number of records to fetch per page.

  • **api_specific_parameters – Additional API-specific parameters to include.

Returns:

The fully constructed API request parameters dictionary, with keys as API-specific parameter names and values as provided.

Return type:

Dict[str, Any]

extract_parameters(parameters: dict[str, Any] | None) dict[str, Any][source]

Extracts all parameters from a dictionary: Helpful for when keywords must be extracted by provider.

Note: this method modifies the original parameter dictionary, using the pop() method to extract all values identified as api_specific_parameters from the parameters dictionary when possible. These extracted parameters are then returned in a separate dictionary.

Useful for reorganizing dictionaries that contain dynamically specified input parameters for distinct APIs.

Parameters:

parameters (Optional[dict[str, Any]]) – An optional parameter dictionary from which to extract API-specific parameters.

Returns:

A dictionary containing all extracted parameters if available.

Return type:

(dict[str, Any])

classmethod from_defaults(provider_name: str, **additional_parameters: Any) APIParameterConfig[source]

Factory method to create APIParameterConfig instances with sensible defaults for known APIs.

If the provider_name does not exist, the code will raise an exception.

Parameters:
  • provider_name (str) – The name of the API to create the parameter map for.

  • api_key (Optional[str]) – API key value if required.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter config instance for the specified API.

Return type:

APIParameterConfig

Raises:

NotImplementedError – If the API name is unknown.

classmethod get_defaults(provider_name: str, **additional_parameters: Any) APIParameterConfig | None[source]

Factory method to create APIParameterConfig instances with sensible defaults for known APIs.

Avoids throwing an error if the provider name does not already exist.

Parameters:
  • provider_name (str) – The name of the API to create the parameter map for.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter config instance for the specified API. Returns None if a mapping for the provider_name isn’t retrieved

Return type:

Optional[APIParameterConfig]

property map: APIParameterMap

Helper property that is an alias for the APIParameterMap attribute.

The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.

Returns:

The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.

Return type:

APIParameterMap

parameter_map: APIParameterMap
show_parameters() list[source]

Helper method to show the complete list of all parameters that can be found in the current_mappings.

Returns:

The complete list of all universal and api specific parameters corresponding to the current API

Return type:

List

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the APIParameterConfig.

class scholar_flux.api.models.APIParameterMap(*, query: str, records_per_page: str, start: str | None = None, api_key_parameter: str | None = None, api_key_required: bool = False, auto_calculate_page: bool = True, zero_indexed_pagination: bool = False, api_specific_parameters: ~typing.Dict[str, ~scholar_flux.api.models.base_parameters.APISpecificParameter] = <factory>)[source]

Bases: BaseAPIParameterMap

Extends BaseAPIParameterMap by adding validation and the optional retrieval of provider defaults for known APIs.

This class also specifies default mappings for specific attributes such as API keys and additional parameter names.

query

The API-specific parameter name for the search query.

Type:

str

start

The API-specific parameter name for pagination (start index or page number).

Type:

Optional[str]

records_per_page

The API-specific parameter name for records per page.

Type:

str

api_key_parameter

The API-specific parameter name for the API key.

Type:

Optional[str]

api_key_required

Indicates whether an API key is required.

Type:

bool

auto_calculate_page

If True, calculates start index from page; if False, passes page number directly.

Type:

bool

zero_indexed_pagination

If True, treats 0 as an allowed page value when retrieving data from APIs.

Type:

bool

api_specific_parameters

Additional universal to API-specific parameter mappings.

Type:

Dict[str, str]

api_key_parameter: str | None
api_key_required: bool
api_specific_parameters: Dict[str, APISpecificParameter]
auto_calculate_page: bool
classmethod from_defaults(provider_name: str, **additional_parameters: Any) APIParameterMap[source]

Factory method that uses the APIParameterMap.get_defaults classmethod to retrieve the provider config.

Raises an error if the provider does not exist.

Parameters:
  • provider_name (str) – The name of the API to create the parameter map for.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter map for the specified API.

Return type:

APIParameterMap

Raises:

NotImplementedError – If the API name is unknown.

classmethod get_defaults(provider_name: str, **additional_parameters: Any) APIParameterMap | None[source]

Factory method to create APIParameterMap instances with sensible defaults for known APIs.

This class method attempts to pull from the list of known providers defined in the scholar_flux.api.providers.provider_registry and returns None if an APIParameterMap for the provider cannot be found.

Using the additional_parameters keyword arguments, users can specify optional overrides for specific parameters if needed. This is helpful in circumstances where an API’s specification overlaps with that of a known provider.

Valid providers (as indicated in provider_registry) include:

  • springernature

  • plos

  • arxiv

  • openalex

  • core

  • crossref

Parameters:
  • provider_name (str) – The name of the API provider to retrieve the parameter map for.

  • additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter map for the specified API.

Return type:

Optional[APIParameterMap]

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query: str
records_per_page: str
classmethod set_default_api_key_parameter(values: dict[str, Any]) dict[str, Any][source]

Sets the default for the api key parameter when api_key_required`=True and `api_key_parameter is None.

Parameters:

values (dict[str, Any]) – The dictionary of attributes to validate

Returns:

The updated parameter values passed to the APIParameterMap. api_key_parameter is set to “api_key” if key is required but not specified

Return type:

dict[str, Any]

start: str | None
classmethod validate_api_specific_parameter_mappings(values: dict[str, Any]) dict[str, Any][source]

Validates the additional mappings provided to the APIParameterMap.

This method validates that the input is dictionary of mappings that consists of only string-typed keys mapped to API-specific parameters as defined by the APISpecificParameter class.

Parameters:

values (dict[str, Any]) – The dictionary of attribute values to validate.

Returns:

The updated dictionary if validation passes.

Return type:

dict[str, Any]

Raises:

APIParameterException – If api_specific_parameters is not a dictionary or contains non-string keys/values.

zero_indexed_pagination: bool
class scholar_flux.api.models.APIResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None)[source]

Bases: BaseModel

A Response wrapper for responses of different types that allows consistency when using several possible backends.

The purpose of this class is to serve as the base for managing responses received from scholarly APIs while processing each component in a predictable, reproducible manner.

This class uses pydantic’s data validation and serialization/deserialization methods to aid caching and includes properties that refer back to the original response for displaying valid response codes, URLs, etc.

All future processing/error-based responses classes inherit from and build off of this class.

Parameters:
  • cache_key (Optional[str]) – A string for recording cache keys for use in later steps of the response orchestration involving processing, cache storage, and cache retrieval

  • response (Optional[requests.Response | ResponseProtocol]) – A response or response-like object to be validated and used/re-used in later caching and response processing/orchestration steps.

  • created_at (Optional[str]) – A value indicating the time at which a response or response-like object was created.

Example

>>> from scholar_flux.api import APIResponse
# Using keyword arguments to build a basic APIResponse data container:
>>> response = APIResponse.from_response(
>>>     cache_key = 'test-response',
>>>     status_code = 200,
>>>     content=b'success',
>>>     url='https://example.com',
>>>     headers={'Content-Type': 'application/text'}
>>> )
>>> response
# OUTPUT: APIResponse(cache_key='test-response', response = ReconstructedResponse(
#    status_code=200, reason='OK', headers={'Content-Type': 'application/text'},
#    text='success', url='https://example.com'
#)
>>> assert response.status == 'OK' and response.text == 'success' and response.url == 'https://example.com'
# OUTPUT: True
>>> assert response.validate_response()
# OUTPUT: True
classmethod as_reconstructed_response(response: object) ReconstructedResponse[source]

Classmethod designed to create a reconstructed response from an original response object.

This method coerces response attributes into a reconstructed response that retains the original content, status code, headers, URL, reason, etc.

Returns:

A minimal response object that contains the core attributes needed to support

other processes in the scholar_flux module such as response parsing and caching.

Return type:

ReconstructedResponse

build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType] | None[source]

Defines a No-Op method to be overridden by ProcessedResponse subclasses.

cache_key: str | None
property cached: bool | None

Identifies whether the current response was retrieved from the session cache.

Returns:

True if the response is a CachedResponse object and False if it is a fresh requests.Response object None: Unknown (e.g., the response attribute is not a requests.Response object or subclass)

Return type:

bool

property content: bytes | None

Return content from the underlying response, if available and valid.

Returns:

The bytes from the original response content

Return type:

(bytes)

created_at: str | None
encode_response(response: object) dict[str, Any] | list[Any] | None[source]

Helper method for serializing a response into a json format.

Accounts for special cases such as CaseInsensitiveDict fields that are otherwise unserializable.

From this step, pydantic can safely use json internally to dump the encoded response fields

classmethod from_response(response: Any | None = None, cache_key: str | None = None, auto_created_at: bool | None = None, **kwargs: Any) Self[source]

Construct an APIResponse from a response object or from keyword arguments.

If response is not a valid response object, builds a minimal response-like object from kwargs.

classmethod from_serialized_response(response: object | None = None, **kwargs: Any) ReconstructedResponse | None[source]

Helper method for creating a new APIResponse from dumped JSON object.

This method accounts for lack of ease of serialization of responses by decoding the response dictionary that was loaded from a string using json.loads from the JSON module in the standard library.

If the response input is still a serialized string, this method will manually load the response dict with the APIresponse._deserialize_response_dict class method before further processing.

Parameters:

response (object) – A prospective response value to load into the API Response.

Returns:

A reconstructed response object, if possible. Otherwise returns None

Return type:

Optional[ReconstructedResponse]

property headers: MutableMapping[str, str] | None

Return headers from the underlying response, if available and valid.

Returns:

A dictionary of headers from the response

Return type:

MutableMapping[str, str]

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(*args: Any, **kwargs: Any) NormalizedRecordList[source]

Defines the normalize method that successfully processed API Responses can override to normalize records.

Raises:

NotImplementedError – Unless overridden, this method will raise an error unless defined in a subclass.

process_metadata(*args: Any, **kwargs: Any) MetadataType | None[source]

Abstract processing method that APIResponse subclasses can override to process metadata.

Parameters:
  • *args – No-Op - Added for compatibility with the APIResponse subclasses.

  • **kwargs – No-Op - Added for compatibility with the APIResponse subclasses.

Raises:

NotImplementedError – Unless overridden, this method will raise an error unless defined in a subclass.

raise_for_status() None[source]

Uses the underlying response or response-like object to validate the status code associated with the request.

If the attribute isn’t a response or reconstructed response, the code will coerce the class into a response object to verify the status code for the request URL and response.

Raises:

requests.RequestException – Errors for status codes that indicate unsuccessfully received responses.

property reason: str | None

Uses the reason or status code attribute on the response object, to retrieve or create a status description.

Returns:

The status description associated with the response.

Return type:

Optional[str]

resolve_extracted_record(*args: Any, **kwargs: Any) RecordType | None[source]

Defines a No-Op method to be overridden by ProcessedResponse subclasses.

response: Response | ResponseProtocol | None
classmethod serialize_response(response: Response | ResponseProtocol) str | None[source]

Helper method for serializing a response into a json format.

The response object is first converted into a serialized string and subsequently dumped after ensuring that the field is serializable.

Parameters:

response (Response, ResponseProtocol) – A requests.Response or response-like object to serialize as a string.

Returns:

A serialized response when response serialization is possible. Otherwise None.

Return type:

Optional[str]

property status: str | None

Helper property for retrieving a human-readable status description APIResponse.

Returns:

The status description associated with the response (if available).

Return type:

Optional[str]

property status_code: int | None

Helper property for retrieving a status code from the APIResponse.

Returns:

The status code associated with the response (if available)

Return type:

Optional[int]

strip_annotations(*args: Any, **kwargs: Any) RecordList[source]

Defines a No-Op method to be overridden by ProcessedResponse subclasses.

property text: str | None

Attempts to retrieve the response text by first decoding the bytes of its content.

If not available, this property attempts to directly reference the text attribute directly.

Returns:

A text string if the text is available in the correct format, otherwise None

Return type:

Optional[str]

classmethod transform_response(v: Response | ResponseProtocol | None) Response | ResponseProtocol | None[source]

Attempts to resolve a valid or a serialized response-like object as an original or ReconstructedResponse.

All original response objects (duck-typed or requests response) with valid values will be returned as is.

If the passed object is a string - this function will attempt to serialize it before attempting to parse it as a dictionary.

Dictionary fields will be decoded, if originally encoded, and parsed as a ReconstructedResponse object, if possible.

Otherwise, the original object is returned as is.

property url: str | None

Return URL from the underlying response, if available and valid.

Returns:

The original URL in string format, if available. For URL objects that are not str types, this method

attempts to convert them into strings when possible.

Return type:

str

classmethod validate_iso_timestamp(v: str | datetime | None) str | None[source]

Helper method for validating and ensuring that the timestamp accurately follows an ISO 8601 format.

validate_response(raise_on_error: bool = False) bool[source]

Helper method for determining whether the response attribute is truly a response or response-like object.

If the response isn’t a requests.Response object, we use duck-typing to determine whether the response, itself, contains the attributes expected of a response.

For this purpose, response properties are checked in order to determine whether the properties of the nested response match object matches the expected type.

Parameters:

raise_on_error (bool) – Indicates whether an error should be raised if the response attribute is invalid (False by default).

Returns:

Indicates whether the current APIResponse.response attribute is a valid response.

Return type:

bool

Raises:

InvalidResponseStructureException – When the response attribute is invalid and raise_on_error=True

class scholar_flux.api.models.APISpecificParameter(name: str, description: str, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False)[source]

Bases: object

Dataclass that defines the specification of an API-specific parameter for an API provider.

Implements optionally specifiable defaults, validation steps, and indicators for optional vs. required fields.

Parameters:
  • name (str) – The name of the parameter used when sending requests to APIs.

  • description (str) – A description of the API-specific parameter.

  • validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.

  • default (Any) – A default value used for the parameter if not specified by the user

  • required (bool) – Indicates whether the current parameter is required for API calls.

__init__(*args: Any, **kwargs: Any) None
default: Any = None
description: str
name: str
required: bool = False
structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method for showing the structure of the current APISpecificParameter.

validator: Callable[[Any], Any] | None = None
property validator_name: str

Helper method for generating a human-readable string from the validator function, if used.

class scholar_flux.api.models.AcademicFieldMap(*, provider_name: str = '', api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>, doi: list[str] | str | None = None, url: list[str] | str | None = None, record_id: list[str] | str | None = None, title: list[str] | str | None = None, abstract: list[str] | str | None = None, authors: list[str] | str | None = None, journal: list[str] | str | None = None, publisher: list[str] | str | None = None, year: list[str] | str | None = None, date_published: list[str] | str | None = None, date_created: list[str] | str | None = None, keywords: list[str] | str | None = None, subjects: list[str] | str | None = None, full_text: list[str] | str | None = None, citation_count: list[str] | str | None = None, open_access: list[str] | str | None = None, license: list[str] | str | None = None, record_type: list[str] | str | None = None, language: list[str] | str | None = None, is_retracted: list[str] | str | None = None)[source]

Bases: NormalizingFieldMap

Extends the NormalizingFieldMap to customize field extraction and processing for academic record normalization.

This class is used to normalize the names of academic data fields consistently across provider. By default, the AcademicFieldMap includes fields for several attributes of academic records including:

  1. Core identifiers (e.g. doi, url, record_id)

  2. Bibliographic metadata ( title, abstract, authors)

  3. Publication metadata (journal, publisher, year, date_published, date_created)

  4. Content and classification (keywords, subjects, full_text)

  5. Metrics and impact (citation_count)

  6. Access and rights (open_access, license)

  7. Document metadata (record_type, language)

  8. All other fields that are relevant to only the current API (api_specific_fields)

During normalization, the AcademicFieldMap.fields property returns all subclassed field mappings as a flattened dictionary (excluding private fields prefixed with underscores). Both simple and nested API-specific field names are matched and mapped to universal field names.

Any changes to the instance configuration are automatically detected during normalization by comparing the _cached_fields to the updated fields property.

Examples

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> field_map = AcademicFieldMap(provider_name = None, title = 'article_title', record_id='ID')
>>> expected_result = field_map.fields | {'provider_name':'core', 'title': 'Decomposition of Political Tactics', 'record_id': 196}
>>> result = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics'))
>>> cached_fields = field_map._cached_fields
>>> print(result == expected_result)
>>> result2 = field_map.apply(dict(provider_name='core', ID=196, article_title='Decomposition of Political Tactics'))
>>> assert cached_fields is field_map._cached_fields
>>> assert result is not result2

Note

To account for special cases, the AcademicFieldMap can be subclassed to perform two-step normalization to further process extracted elements.

  1. Phase 1:

    The AcademicFieldMap extracts nested fields for each record. This class traverses paths like ‘MedlineCitation.Article.AuthorList.Author’ (PubMed) or authorships.institutions.display_name (OpenAlex) to map API-specific fields to universal parameter names

  2. Phase 2 (Subclasses):

    Subclasses can reformat extracted data into finalized fields. For example, PubMed prepares the authors field by combining each author’s ‘ForeName’ and ‘LastName’ into ‘FirstName LastName’. PLOS creates the record URL for each article by combining the URL prefix for the website with the DOI of the current record. The AcademicFieldMap defines common (yet optional) class methods to aid in the extraction and processing of normalized fields.

abstract: list[str] | str | None
authors: list[str] | str | None
citation_count: list[str] | str | None
date_created: list[str] | str | None
date_published: list[str] | str | None
doi: list[str] | str | None
classmethod extract_abstract(record: NormalizedRecordType, strip_html: bool = False, field: str = 'abstract', **kwargs: Any) str | None[source]

Extracts and prepares the abstract for the current record.

Parameters:
  • record (NormalizedRecordType) – Normalized record with ‘abstract’ already available as a field.

  • strip_html (bool) – Indicates whether html tags should be checked and removed if found in the abstract.

  • field (str) – The field where an abstract or text field can be found.

  • **kwargs – Additional arguments to pass to get_text when stripping html elements.

Returns:

An abstract string or None if not found or not a string/list of strings

Return type:

Optional[str]

Example

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> record = {'abstract': 'Analysis of the Placebo effect on...'}
>>> AcademicFieldMap.extract_abstract(record)
# OUTPUT: 'Analysis of the Placebo effect on...'
>>> record = {'abstract': '<h1>Game theory in the technological industry.</h1><p>This study explores...</p>'}
>>> AcademicFieldMap.extract_abstract(record, strip_html=True, separator=' ')
# OUTPUT: 'Game theory in the technological industry. This study explores...'
classmethod extract_authors(record: NormalizedRecordType, field: str = 'authors') list[str] | None[source]

Filters and cleans the author names list.

Parameters:
  • record (NormalizedRecordType) – Normalized record with an ‘authors’ field.

  • field (str) – The field to extract the list of authors from.

Returns:

A list of non-empty author names, or None if empty

Return type:

Optional[list[str]]

Examples

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> record = {'authors': 'Evan Doodle; Jane Doe'}
>>> AcademicFieldMap.extract_authors(record)
# OUTPUT: ['Evan Doodle', 'Jane Doe']
>>> record = {'authors': ['Evan Doodle', 'Jane Noah']}
>>> AcademicFieldMap.extract_authors(record)
# OUTPUT: ['Evan Doodle', 'Jane Noah']
>>> record = {'authors': [102, 203]}
>>> AcademicFieldMap.extract_authors(record) # returns, elements aren't strings
# OUTPUT: None
classmethod extract_boolean_field(record: NormalizedRecordType, field: str, true_values: tuple[str, ...] = ('true', '1', 'yes'), false_values: tuple[str, ...] = ('false', '0', 'no'), default: bool | None = None) bool | None[source]

Extracts a field’s value from the current record as a boolean (‘true’->True/’false’->False/’None’->None).

Parameters:
  • record (NormalizedRecordType) – The normalized record dictionary to extract a boolean value from.

  • field (str) – The record field to be used for the extraction of a boolean value.

  • true_values (tuple[str, ...]) – Values to be mapped to True when found.

  • false_values (tuple[str, ...]) – Values to be mapped to false when found.

  • default (Optional[bool]) – The value to default to when neither True values or False values can be found.

Returns:

  • True if the field appears in the list of the tuple of true_values

  • False if the field appears in the list of the tuple of false_values

  • The default if the observed value cannot be found within true_values and false_values

Return type:

Optional[bool]

classmethod extract_id(record: NormalizedRecordType, field: str = 'record_id', strip_prefix: str | Pattern | None = None) str | None[source]

Extracts and coerces the ID from the current record into a string.

Parameters:
  • record (NormalizedRecordType) – A normalized record dictionary before or after post-processing

  • field (str) – The IdType to filter for (e.g., ‘arxiv_id’, ‘pmid’, ‘mag_id’)

  • strip_prefix (Optional[str | re.Pattern]) – An optional prefix to remove from the identifier (e.g., ‘PMC’ for PMC IDs)

Returns:

The record ID as a string, or None if not available

Examples

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> AcademicFieldMap.extract_id({"record_id": 12345678})
'12345678'
>>> AcademicFieldMap.extract_id({"record_id": "mock_id:123"})
mock_id:123'
classmethod extract_iso_date(record: NormalizedRecordType, field: str = 'date_created') str | None[source]

Extracts and formats a date from a dictionary or strings in ISO format (%Y-%m-%d).

Parameters:
  • record (NormalizedRecordType) – A normalized record having a date_created or similar field to extract an ISO date from. Note: Users can extract an ISO date from a nested dictionary field if its formatted with year, month, or day. If the nested field is a string, this method will instead attempt to parse it as an ISO timestamp otherwise. If the field is a datetime or date, the object will be parsed directly.

  • field (str) – The name of the field containing date information to extract.

Returns:

An ISO formatted date string (YYYY-MM-DD, YYYY-MM, or YYYY) or None.

Return type:

(Optional[str])

Examples

PubDate with Year=’2025’, Month=’Dec’, Day=’19’: Returns ‘2025-12-19’

PubDate with Year=’2025’, Month=’12’: Returns ‘2025-12’

PLOS with timestamp: ‘2016-12-08T00:00:00Z’ Returns ‘2016-12-08’

classmethod extract_journal(record: NormalizedRecordType, field: str = 'journal') str | None[source]

Extracts the publication journal title or a list of journal titles as a semicolon delimited string.

Parameters:
  • record (NormalizedRecordType) – The normalized record dictionary to extract the journal field from.

  • field (str) – The field to extract the journal from.

Returns:

The journal or journals of publication, joined by a semicolon, or None if not available.

Return type:

Optional[str]

Examples

>>> AcademicFieldMap.extract_journal({"journal": "Nature"})
# OUTPUT: 'Nature'
>>> AcademicFieldMap.extract_journal({"journal": ["Nature", "Science"]})
# OUTPUT: 'Nature; Science'
>>> AcademicFieldMap.extract_journal({"journal": ["Nature", "", None, "Science"]})
# OUTPUT: 'Nature; Science'
classmethod extract_url(record: NormalizedRecordType, *paths: list[str | int] | str, pattern_delimiter: str | Pattern | None = re.compile('; *(?=http)|, *(?=http)|\\| *(?=http)'), delimiter_prefix: str | None = None, delimiter_suffix: str | None = '(?=http)') str | None[source]

Helper function for extracting a single, primary URL from record based on the path taken to traverse the URL.

Parameters:
  • record (NormalizedRecordType) – The record dictionary to extract the URL from.

  • *paths – Arbitrary positional path arguments leading to a single URL or list of URLs. Each path can be a string or list of keys representing the path needed to find a URL in a nested record. Defaults to the tuple (‘url’, ) if not provided, defaulting to a basic url lookup.

  • pattern_delimiter (str | Pattern) – Regex pattern to split URL strings. Defaults to “; *”. A positive lookahead (?=http) is automatically appended to the delimiter to prevent splitting URLs mid-domain. Set to None to disable splitting. Note that if a re.Pattern object is provided, it will be used as is without transformation.

  • delimiter_prefix (str) – An option string appended as a prefix to each element within a pattern. This prefix is None by default but can be used to identify URLs that directly follow a specific pattern.

  • delimiter_suffix (str) – An option string appended as a suffix to each element within a pattern. This suffix is used to identify http schemes (typically associated with URLs) that may directly follow a string delimited by the suffix separator.

Returns:

The first value found at any of the specified paths. Commonly a string URL, but could be any type depending on the data structure. Returns None if not found.

Examples

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> record = {"url": "http://example.com; http://backup.com"}
>>> AcademicFieldMap.extract_url(record)
# OUTPUT: 'http://example.com'
>>> record = {"url": [{"value": "http://example.com"}]}
>>> AcademicFieldMap.extract_url(record, ["url", 0, "value"], ["url", 0])
# OUTPUT: 'http://example.com'
>>> # Semicolon-delimited URLs (common in CrossRef, Springer)
>>> record = {"url": "http://example.com; http://backup.com"}
>>> AcademicFieldMap.extract_url(record)
# OUTPUT: 'http://example.com'
classmethod extract_url_id(record: NormalizedRecordType, field: str = 'record_id', strip_prefix: str | Pattern | None = None) str | None[source]

Extracts an ID from the URL of the current record, removing a URL prefix when specified.

Parameters:
  • record (NormalizedRecordType) – The record containing the URL ID to extract

  • field (str) – The field containing the ID (with or without a prefix)

  • strip_prefix (Optional[str | re.Pattern]) – The prefix or regex pattern to optionally remove from the URL

Returns:

The ID after field extraction and the removal the string prefix, if provided. If the record field doesn’t exist, None is returned instead.

Return type:

Optional[str]

classmethod extract_year(record: NormalizedRecordType, field: str = 'year') int | None[source]

Extracts the year of publication or record creation from the manuscript/record.

Parameters:
  • record (NormalizedRecordType) – Normalized record dictionary

  • field (str) – The field to extract the year of publication or record creation from.

Returns:

The year as an integer, or None if not extractable.

Return type:

Optional[int]

Examples

>>> AcademicFieldMap.extract_year({"year": "2024-06-15"})
2024
>>> AcademicFieldMap.extract_year({"year": 2024})
2024
>>> AcademicFieldMap.extract_year({"year": None})
None
full_text: list[str] | str | None
is_retracted: list[str] | str | None
journal: list[str] | str | None
keywords: list[str] | str | None
language: list[str] | str | None
license: list[str] | str | None
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_post_init(context: Any, /) None

This function is meant to behave like a BaseModel method to initialise private attributes.

It takes context as an argument since that’s what pydantic-core passes when calling it.

Parameters:
  • self – The BaseModel instance.

  • context – The context.

classmethod normalize_doi(record: NormalizedRecordType, field: str = 'doi') str | None[source]

Normalizes DOI by stripping the https://doi.org/ prefix.

Parameters:
  • record (NormalizedRecordType) – Normalized record containing the ‘doi’ field to extract.

  • field (str) – The field to extract the record doi from.

Returns:

Cleaned DOI string without URL prefix, or None if invalid

Return type:

Optional[str]

Examples

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> record = {'doi': 'https://doi.org/10.1234/example'}
>>> AcademicFieldMap.normalize_doi(record)
# OUTPUT: '10.1234/example'
open_access: list[str] | str | None
publisher: list[str] | str | None
classmethod reconstruct_url(id: str | None, url: str) str | None[source]

Reconstruct an article URL from the ID of the article.

Useful for PLOS and PubMed URL reconstruction.

Parameters:
Returns:

Reconstructed URL if ID is valid, None otherwise.

Return type:

str

Examples

>>> from scholar_flux.api.normalization import AcademicFieldMap
>>> AcademicFieldMap.reconstruct_url(
...     id="10.1371/journal.pone.0123456",
...     url=f"https://journals.plos.org/plosone/article?id="
... )
# OUTPUT: 'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123456'
>>> AcademicFieldMap.reconstruct_url(None, '')
# OUTPUT: None
>>> AcademicFieldMap.reconstruct_url("", None)
# OUTPUT: None
record_id: list[str] | str | None
record_type: list[str] | str | None
subjects: list[str] | str | None
title: list[str] | str | None
url: list[str] | str | None
year: list[str] | str | None
class scholar_flux.api.models.BaseAPIParameterMap(*, query: str, records_per_page: str, start: str | None = None, api_key_parameter: str | None = None, api_key_required: bool = False, auto_calculate_page: bool = True, zero_indexed_pagination: bool = False, api_specific_parameters: ~typing.Dict[str, ~scholar_flux.api.models.base_parameters.APISpecificParameter] = <factory>)[source]

Bases: BaseModel

Base class for Mapping universal SearchAPI parameter names to API-specific parameter names.

Includes core logic for distinguishing parameter names, indicating required API keys, and defining pagination logic.

query

The API-specific parameter name for the search query.

Type:

str

start

The API-specific parameter name for optional pagination (start index or page number).

Type:

Optional[str]

records_per_page

The API-specific parameter name for records per page.

Type:

str

api_key_parameter

The API-specific parameter name for the API key.

Type:

Optional[str]

api_key_required

Indicates whether an API key is required.

Type:

bool

page_required

If True, indicates that a page is required.

Type:

bool

auto_calculate_page

If True, calculates start index from page; if False, passes page number directly.

Type:

bool

zero_indexed_pagination

Treats page=0 as an allowed page value when retrieving data from the API.

Type:

bool

api_specific_parameters

Additional API-specific parameter mappings.

Type:

Dict[str, APISpecificParameter]

add_parameter(name: str, description: str | None = None, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False, inplace: bool = True) Self[source]

Helper method that enables the efficient addition of parameters to the current parameter map.

Parameters:
  • name (str) – The name of the parameter used when sending requests to APIs.

  • description (str) – A description of the API-specific parameter.

  • validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.

  • default (Any) – A default value used for the parameter if not specified by the user

  • required (bool) – Indicates whether the current parameter is required for API calls.

  • inplace (bool) –

    A flag that, if True, modifies the current parameter map instance in place. If False, it returns a new parameter map that contains the added parameter, while leaving the original unchanged.

    Note: If this instance is shared (e.g., retrieved from provider_registry), changes will affect all references to this parameter map. if inplace=True .

Returns:

A parameter map containing the specified parameter. If inplace=True, the original is returned. Otherwise a new parameter map containing an updated api_specific_parameters dict is returned.

Return type:

Self

api_key_parameter: str | None
api_key_required: bool
api_specific_parameters: Dict[str, APISpecificParameter]
auto_calculate_page: bool
classmethod from_dict(obj: Dict[str, Any]) BaseAPIParameterMap[source]

Create a new instance of BaseAPIParameterMap from a dictionary.

Parameters:

obj (dict) – The dictionary containing the data for the new instance.

Returns:

A new instance created from the given dictionary.

Return type:

BaseAPIParameterMap

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

query: str
records_per_page: str
show_parameters() list[source]

Helper method to show the complete list of all parameters that can be found in the current ParameterMap.

Returns:

The complete list of all universal and API-specific parameters corresponding to the current API

Return type:

List

start: str | None
structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the BaseAPIParameterMap.

to_dict() Dict[str, Any][source]

Convert the current instance into a dictionary representation.

Returns:

A dictionary representation of the current instance.

Return type:

Dict

update(other: BaseAPIParameterMap | Dict[str, Any]) BaseAPIParameterMap[source]

Update the current instance with values from another BaseAPIParameterMap or dictionary.

Parameters:

other (BaseAPIParameterMap | Dict) – The object containing updated values.

Returns:

A new instance with updated values.

Return type:

BaseAPIParameterMap

zero_indexed_pagination: bool
class scholar_flux.api.models.BaseFieldMap(*, provider_name: str, api_specific_fields: dict[str, ~typing.Any] = <factory>, default_field_values: dict[str, ~typing.Any] = <factory>)[source]

Bases: BaseModel

The BaseFieldMap is used to normalize the names of fields consistently across providers.

This class provides a minimal implementation for mapping API-specific fields from a non-nested dictionary record to a common record key. It is intended to be subclassed and customized for different APIs.

Instances of this class can be called directly to normalize a single or multiple records based on the input. Direct calls to instances are directly handled by .apply() under-the-hood.

- normalize_record

Normalizes a single dictionary record

- normalize_records

Normalizes a list of dictionary records

- apply

Returns either a single normalized record or a list of normalized records matching the input.

- structure

Displays a string representation of the current BaseFieldMap instance

provider_name

A default provider name to be assigned for all normalized records. If not provided, the field map will try to find the provider name from within each record.

Type:

str

api_specific_fields

Defines a dictionary of normalized field names (keys) to map to the names of fields within each dictionary record (values)

Type:

dict[str, Any]

default_field_values

Indicates values that should be assigned if a field cannot be found within a record.

Type:

dict[str, Any]

api_specific_fields: dict[str, Any]
apply(records: RecordType) NormalizedRecordType[source]
apply(records: RecordList) NormalizedRecordList

Normalizes a record or list of records by mapping API-specific field names to common fields.

Parameters:

records (RecordType | RecordList) – A single dictionary record or a list of dictionary records to normalize.

Returns:

A single normalized dictionary is returned if a single record is provided. NormalizedRecordList: A list of normalized dictionaries is returned if a list of records is provided.

Return type:

NormalizedRecordType

property core_fields: dict[str, Any]

Returns a dictionary of all core fields in the current FieldMap (excluding all API-specific fields).

default_field_values: dict[str, Any]
property fields: dict[str, Any]

Returns a representation of the current FieldMap as a dictionary.

filter_api_specific_fields(record: NormalizedRecordType, keep_api_specific_fields: bool | Sequence[str] | set[str] | None = None) dict[str, Any][source]

Filters API Specific parameters from the processed record.

Parameters:
  • record (NormalizedRecordType) – The current record to filter API-specific fields from.

  • keep_api_specific_fields (Optional[bool | Sequence[str] | set[str]]) – Either a boolean indicating whether to keep all API-specific fields (True/None) or to remove them after the completion of normalization (False). This parameter can also be a sequence/set of specific field names to keep.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize_record(record: dict, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordType[source]

Maps API-specific fields in a single dictionary record to a normalized set of field names.

Parameters:
  • record (dict) – The single, dictionary-typed record to normalize.

  • keep_api_specific_fields (Optional[bool | Sequence[str]]) – A boolean indicating whether to keep or remove all API-specific fields or a sequence indicating which API-specific fields to keep.

Returns:

A new dictionary with normalized field names.

Return type:

NormalizedRecordType

Raises:

TypeError – If the input to record is not a mapping or dictionary object.

normalize_records(records: RecordType | RecordList, keep_api_specific_fields: bool | Sequence[str] | None = True) NormalizedRecordList[source]

Maps API-specific fields in one or more records to a normalized set of field names.

Parameters:
  • records (dict | RecordType | RecordList) – A single dictionary record or a list of dictionary records.

  • keep_api_specific_fields (Optional[bool | Sequence[str]]) – A boolean indicating whether to keep or remove all API-specific fields or a sequence indicating which API-specific fields to keep.

Returns:

A list of dictionaries with normalized field names.

Return type:

NormalizedRecordList

provider_name: str
structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the BaseFieldMap. :param flatten: Whether to flatten the current field map’s structural representation into a single line (Default=False) :type flatten: bool :param show_value_attributes: Whether to show nested attributes of the base field map or subclass (Default = True) :type show_value_attributes: bool

Returns:

A structural representation of the current field map as a string. Use a print statement to view it.

Return type:

str

classmethod validate_provider_name(v: str | None) str[source]

Transforms the provider_name into an empty string prior to further type validation.

class scholar_flux.api.models.BaseProviderDict(dict=None, /, **kwargs)[source]

Bases: UserDict[str, Any]

The BaseProviderDict extends the dictionary to resolve minor naming variations in keys to the same provider name.

The BaseProviderDict uses the ProviderConfig._normalize_name method to ignore underscores and case-sensitivity.

find(key: str | Pattern, regex: bool | None = None) list[str][source]

Identifies providers with names matching the specified pattern using either prefix or regex pattern matching.

This implementation uses fuzzy finding, or “flexible matching that’s more forgiving than exact”. When regex=True or a compiled Pattern is provided, regex matching is used. Otherwise, provider names are filtered using prefix matching via str.startswith after normalizing the provided key and provider names.

Parameters:
  • key (str | re.Pattern) – The key or pattern to match using regular expressions or prefix matching.

  • regex (Optional[bool]) – Indicates whether regular expressions should be used to match provider names.

Returns:

A list of strings containing provider names that match the key/pattern.

Return type:

list[str]

Note

Unless either pattern is received or regex=True, providers are matched if the normalized key prefix is present in the normalized provider name.

property providers: list[str]

Returns a list containing the names of all (keys) in the current registry.

Returns:

A complete list of all keys shown in the current registry

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the BaseProviderDict or subclass.

class scholar_flux.api.models.ErrorResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None, message: str | None = None, error: str | None = None)[source]

Bases: APIResponse

Returned when something goes wrong, but we don’t want to throw immediately—just hand back failure details.

The class is formatted for compatibility with the ProcessedResponse.

build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType][source]

No-Op: Returns an empty dict when no extracted records are available.

This method is retained for compatibility with ProcessedResponse. Since ErrorResponse has no extracted records to index, this method always returns an empty dictionary regardless of arguments provided.

Parameters:
  • *args – Positional argument placeholder for compatibility with the ProcessedResponse.build_record_id_index method. All arguments are ignored.

  • **kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.build_record_id_index method. All arguments are ignored.

Returns:

An empty dictionary indicating no records are available for indexing.

Return type:

dict[str, RecordType]

cache_key: str | None
created_at: str | None
property data: None

Provided for type hinting + compatibility.

error: str | None
property extracted_records: None

Provided for type hinting + compatibility.

classmethod from_error(message: str, error: Exception, cache_key: str | None = None, response: Response | ResponseProtocol | None = None) Self[source]

Creates and logs the processing error if one occurs during response processing.

Parameters:
  • message (str) – Error message describing the failure.

  • error (Exception) – The exception instance that was raised.

  • cache_key (Optional[str]) – Cache key for storing results.

  • response (Optional[requests.Response | ResponseProtocol]) – Raw API response.

Returns:

A pydantic model that contains the error response data and background information on what precipitated the error.

Return type:

ErrorResponse

message: str | None
property metadata: None

Provided for type hinting + compatibility.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = True, *args: Any, **kwargs: Any) NormalizedRecordList[source]

No-Op: Raises a RecordNormalizationException when raise_on_error=True and returns an empty list otherwise.

Parameters:
  • field_map (Optional[BaseFieldMap]) – An optional field map that can be used to normalize the current response. This is inferred from the registry if not provided as input.

  • raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a RecordNormalizationException is raised.

  • *args – Positional argument placeholder for compatibility with the ProcessedResponse.normalize method

  • **kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.normalize method

Returns:

An empty list if raise_on_error=False

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If raise_on_error=True, this exception is raised after catching NotImplementedError

property normalized_records: None

Provided for type hinting + compatibility.

property parsed_response: None

Provided for type hinting + compatibility.

process_metadata(*args: Any, **kwargs: Any) MetadataType | None[source]

No-Op: This method is retained for compatibility. It returns None by default.

property processed_metadata: None

Provided for type hinting + compatibility.

property processed_records: None

Provided for type hinting + compatibility.

property record_count: int

Number of records in this response.

property records_per_page: None

Provided for type hinting + compatibility.

resolve_extracted_record(*args: Any, **kwargs: Any) None[source]

No-Op: Returns None when no records are available.

This method is retained for compatibility with ProcessedResponse. Since ErrorResponse has no extracted or processed records, resolution is not possible and this method always returns None.

Parameters:
  • *args – Positional argument placeholder for compatibility with the ProcessedResponse.resolve_extracted_record method. Currently includes processed_index (int).

  • **kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.resolve_extracted_record method. All arguments are ignored.

Returns:

Always returns None since no records exist to resolve.

Return type:

None

response: requests.Response | ResponseProtocol | None
strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]

Convenience method for removing internal metadata annotations from a provided list of records.

This method removes all metadata annotations (dictionary keys that are prefixed with an underscore) that were added during the record extraction step for pipeline traceability (e.g., _extraction_index, _record_id).

Parameters:

records – (RecordType | RecordList) Records to strip. Defaults to processed_records if None.

Returns:

A list of dictionary records with stripped metadata annotations when provided. If a record or record list is not provided, a warning is logged, and an empty list is returned.

Return type:

RecordList

Note: This method is defined primarily for compatibility with the ProcessedResponse API.

property total_query_hits: None

Provided for type hinting + compatibility.

class scholar_flux.api.models.NonResponse(*, cache_key: str | None = None, response: None = None, created_at: str | None = None, message: str | None = None, error: str | None = None)[source]

Bases: ErrorResponse

Response class that indicates that an error occurred during request preparation or API response retrieval.

This class is used to signify the error that occurred within the search process using a similar interface as the other scholar_flux Response dataclasses.

cache_key: str | None
created_at: str | None
error: str | None
message: str | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

response: None
class scholar_flux.api.models.PageListInput(root: RootModelRootType = PydanticUndefined)[source]

Bases: RootModel[Sequence[int]]

Helper class for processing page information in a predictable manner.

The PageListInput class expects to receive a list, string, or generator that contains at least one page number. If a singular integer is received, the result is transformed into a single-item list containing that integer.

Parameters:

root (Sequence[int]) – A list containing at least one page number.

Examples

>>> from scholar_flux.api.models import PageListInput
>>> PageListInput(5)
PageListInput([5])
>>> PageListInput(range(5))
PageListInput([0, 1, 2, 3, 4])
classmethod from_record_count(min_records: int, records_per_page: int, page_offset: int = 0) Self[source]

Helper method for calculating the total number of pages required to retrieve at least min_records records.

Parameters:
  • min_records (int) – The total number of records to retrieve sequentially.

  • records_per_page (int) – The total number of records that are retrieved per page.

  • page_offset (int) – The total number of pages to skip before beginning record retrieval (0 by default). When the provided value is not a non-negative integer, this parameter is coerced to 0 and a warning is triggered.

Returns:

The calculated page range used to retrieve at least min_records records given records_per_page.

Return type:

PageListInput

Examples

>>> from scholar_flux.api.models import PageListInput
>>> PageListInput.from_record_count(20, 10, 0)
PageListInput(1, 2)
>>> PageListInput.from_record_count(20, 10, 2)
PageListInput(3, 4)
>>> PageListInput.from_record_count(15, 10, 1)
PageListInput(2, 3)

# triggers a warning for page_offset (non-integers are coerced to 0): >>> PageListInput.from_record_count(20, 10, None) PageListInput(1, 2)

>>> PageListInput.from_record_count(0, 10, 0)
PageListInput()

Note

This method expects a positive integer for min_records from which to calculate the page range required to retrieve at least min_records. Specifying 0 for min_records will result in an empty list of pages that essentially functions as a no-op search returning an empty list from SearchCoordinator.search_records.

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property page_numbers: Sequence[int]

Returns the sequence of validated page numbers as a list.

classmethod page_validation(v: str | int | Sequence[int | str]) Sequence[int][source]

Processes the page input to ensure that a list of integers is returned if the received page list is in a valid format.

Parameters:

v (str | int | Sequence[int | str]) – A page or sequence of pages to be formatted as a list of pages.

Returns:

A validated, formatted sequence of page numbers assuming successful page validation

Return type:

Sequence[int]

Raises:

ValidationError – Internally raised via pydantic if a ValueError is encountered (if the input is not exclusively a page or list of page numbers)

classmethod process_page(page_value: str | int) int[source]

Helper method for ensuring that each value in the sequence is a numeric string or whole number.

Note that this function will not throw an error for negative pages as that is handled at a later step in the page search process.

Parameters:

page_value (str | int) – The value to be converted if it is not already an integer

Returns:

A validated integer if the page can be converted to an integer and is not a float

Return type:

int

Raises:

ValueError – When the value is not an integer or numeric string to be converted to an integer

root: RootModelRootType
class scholar_flux.api.models.ProcessedResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None, parsed_response: Any | None = None, extracted_records: RecordList | None = None, processed_records: RecordList | None = None, normalized_records: NormalizedRecordList | None = None, metadata: MetadataType | None = None, processed_metadata: MetadataType | None = None, message: str | None = None)[source]

Bases: APIResponse

APIResponse class that scholar_flux uses to return processed response data after successful response processing.

This class is populated to return response data containing information on the original, cached, or reconstructed API response that is received and processed after retrieval. In addition to returning processed records and metadata, this class also allows storage of intermediate steps including:

  1. Parsed responses

  2. Extracted records and metadata

  3. Processed records (aliased as data)

  4. Normalized records

  5. Processed metadata

  6. Any additional messages. An error field is provided for compatibility with the ErrorResponse class.

build_record_id_index() dict[str, RecordType][source]

Builds a lookup table for ID-based resolution of extracted records.

This method creates a dictionary that maps _record_id values to their corresponding extracted records. Useful when performing multiple resolutions for records the same response.

Returns:

A new dictionary mapping record IDs to the original record. An empty dictionary is returned if extracted_records is None/empty or all records do not have an associated ID

Return type:

dict[str, RecordType]

Example

>>> from scholar_flux import SearchCoordinator
>>> coordinator = SearchCoordinator(query = 'public health', annotate_records=True)
>>> response = coordinator.search(page = 1)
>>> id_index = response.build_record_id_index()
>>> processed_record = response.data[0]
>>> extracted_record = id_index.get(processed_record["_record_id"])
>>> isinstance(extracted_record, dict)
# OUTPUT: True

Note

This method is used in the process of identifying raw, unprocessed records after extensive post-processing and filtering has been performed on each record and relies on record annotation being enabled during data extraction.

cache_key: str | None
created_at: str | None
property data: RecordList | None

Alias to the processed_records attribute that holds a list of dictionaries, when available.

property error: None

Provided for type hinting + compatibility.

extracted_records: RecordList | None
message: str | None
metadata: MetadataType | None
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = False, update_records: bool | None = None, resolve_records: bool | None = None, keep_api_specific_fields: bool | Sequence | None = None, strip_annotations: bool | None = None) NormalizedRecordList[source]

Applies a field map to normalize the processed records of a ProcessedResponse into a common structure.

Note that if a field_map is not provided, this method will return the previously created normalized_records attribute if available. If normalized_records is None, this method will attempt to look up the FieldMap from the current provider_registry.

If processed records is None (and not an empty list), record normalization will fall back to using extracted_records and will return relatively similar results with minor differences in potential value coercion, flattening, and the recursive extraction of values at non-terminal paths depending on the implementation of the data processor.

Parameters:
  • field_map (Optional[BaseFieldMap]) – An optional field map that can be used to normalize the current response. This is inferred from the registry if not provided as input.

  • raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a normalization error is raised.

  • update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is currently None.

  • resolve_records (Optional[bool]) – A flag that determines if resolution with annotated records should occur. If True or None, resolution occurs. If False, normalization uses processed_records when not None and extracted_records otherwise.

  • keep_api_specific_fields (Optional[bool | Sequence]) – Indicates what API-specific records should be retained from the complete list of API parameters that are returned. If False, only the core parameters defined by the FieldMap are returned. If True or None, all parameters are returned instead.

  • strip_annotations (Optional[bool]) – A flag for removing metadata annotations denoted by a leading underscore. When True or None (default), annotations are removed from normalized records.

Returns:

The list of normalized records in the same dimension as the original processed response. If a map for the current provider does not exist and raise_on_error=False, an empty list is returned instead.

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If an error occurs during the normalization of record list.

Example

>>> from scholar_flux import SearchCoordinator
>>> from scholar_flux.utils import truncate, coerce_flattened_str
>>> coordinator = SearchCoordinator(query = 'public health')
>>> response = coordinator.search_page(page = 1)
>>> normalized_records = response.normalize()
>>> for record in normalized_records[:5]:
...     print(f"Title: {record['title']}")
...     print(f"URL: {record['url']}")
...     print(f"Source: {record['provider_name']}")
...     print(f"Abstract: {truncate(record['abstract'] or 'Not available')}")
...     print(f"Authors: {coerce_flattened_str(record['authors'])}")
...     print("-"*100)

# OUTPUT: Title: Are we prepared? The development of performance indicators for … URL: https://journals.plos.org/plosone/article?id=… Source: plos Abstract: Background: Disasters and emergencies… Authors: … —————————————————————————————————-

Note

Computation is performed in one of three cases:

1.`normalized_records` does not already exist 2.`update_records` is not True 3. Either resolve_records or keep_api_specific_fields is not None

normalized_records: NormalizedRecordList | None
parsed_response: Any | None
process_metadata(metadata_map: ResponseMetadataMap | None = None, update_metadata: bool | None = None) MetadataType | None[source]

Uses a ResponseMetadataMap to process metadata for tertiary information on the response.

This method is a helper that is meant for primarily internal use for providing metadata information on the response where helpful and for informing users of the characteristics of the current response.

This function will update the ProcessedResponse.processed_metadata attribute when update_metadata=True or in a secondary case where the current processed_metadata field is an empty dict or None unless update_metadata=False

Parameters:
  • metadata_map (Optional[ResponseMetadataMap]) – A mapping that resolve API-specific metadata names to a universal parameter name.

  • update_metadata (Optional[bool]) – Determines whether the underlying processed_metadata field should be updated. If True, the processed_metadata field is updated inplace. If None, the field is only updated when metadata fields have been successfully processed and the `processed_metadata ` field is None.

Returns:

The processed metadata returned as a dictionary when available. None otherwise.

Return type:

Optional[MetadataType]

processed_metadata: MetadataType | None
processed_records: RecordList | None
property record_count: int

The overall length of the processed data field as processed in the last step after filtering.

property records_per_page: int | None

Returns the total number of results on the current page.

This method retrieves the records_per_page variable from the processed_metadata attribute, and if metadata hasn’t yet been processed, this method will then call process_metadata() manually to ensure that the field is available.

resolve_extracted_record(processed_index: int) RecordType | None[source]

Resolve a processed record back to its original extracted record.

This method uses a two-phase resolution strategy with optional validation:

  1. Primary: Direct index lookup via _extraction_index (fast, single access)

  2. Validation: Verify _record_id matches

  3. Fallback: Search by _record_id if index lookup fails or mismatches (scans all records)

Parameters:

processed_index (int) – The index of the record in processed_records to resolve.

Returns:

The original extracted record, or None if resolution fails.

Return type:

Optional[RecordType]

Example

>>> from scholar_flux import SearchCoordinator, RecursiveDataProcessor
>>> coordinator = SearchCoordinator(
...     query='public health',
...     provider_name='openalex',
...     annotate_records=True,
...     processor=RecursiveDataProcessor()
... )
>>> response = coordinator.search(page=1)
>>> # Get processed (possibly flattened) record
>>> processed = response.processed_records[0]
>>> print(processed.get("authorships.author.display_name"))  # ['Kenneth L. Howard...']
>>> # Resolve to original nested structure
>>> original = response.resolve_extracted_record(0)
>>> print(original.get("authorships"))
>>> print(original.get("authorships")[0].keys())
# OUTPUT: dict_keys(['author_position', 'author', 'institutions', 'countries', 'is_corresponding', 'raw_author_name', 'raw_affiliation_strings', 'affiliations'])

Note

Resolution requires that records were extracted with annotate_records=True in the DataExtractor. Without annotation fields, this method returns None.

response: requests.Response | ResponseProtocol | None
strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]

Convenience method that removes metadata annotations from a record list for clean export.

This method removes all metadata annotations (dictionary keys that are prefixed with an underscore) that were added during the record extraction step for pipeline traceability (e.g., _extraction_index, _record_id).

Parameters:

records – (RecordType | RecordList) Records to strip. Defaults to processed_records if None.

Returns:

New list of records with annotation fields removed.

Return type:

RecordType | RecordList

Example

>>> clean_data = response.strip_annotations()
>>> df = pd.DataFrame(clean_data)  # No internal fields in DataFrame
property total_query_hits: int | None

Returns the total number of results as reported by the API.

This method retrieves the total_query_hits variable from the processed_metadata attribute, and if metadata hasn’t yet been processed, this method will then call process_metadata() manually to ensure that the field is available.

class scholar_flux.api.models.ProviderConfig(*, provider_name: Annotated[str, MinLen(min_length=1)], base_url: str, parameter_map: BaseAPIParameterMap, metadata_map: ResponseMetadataMap | None = None, field_map: BaseFieldMap | None = None, records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: Annotated[float, Ge(ge=0)] = 6.1, api_key_env_var: str | None = None, docs_url: str | None = None, display_name: Annotated[str, MinLen(min_length=1)] = '')[source]

Bases: BaseModel

Config for creating the basic instructions and settings necessary to interact with new providers. This config, on initialization, is created for default providers on package initialization in the scholar_flux.api.providers submodule. A new, custom provider or override can be added to the provider_registry (a custom user dictionary) from the scholar_flux.api.providers module.

Parameters:
  • provider_name (str) – The name of the provider to be associated with the config.

  • base_url (str) – The URL of the provider to send requests with the specified parameters.

  • parameter_map (BaseAPIParameterMap) – The parameter map indicating the specific semantics of the API.

  • metadata_map (MetadataMap) – Defines the names of metadata fields used to distinguish response characteristics.

  • field_map (Optional[BaseFieldMap]) – A provider-specific field map that normalizes processed response records into a universal record structure.

  • records_per_page (int) – Generally the upper limit (for some APIs) or reasonable limit for the number of retrieved records per request (specific to the API provider).

  • request_delay (float) – Indicates exactly how many seconds to wait before sending successive requests. Note that the requested interval may vary based on the API provider.

  • api_key_env_var (Optional[str]) – Indicates the environment variable to look for if the API requires or accepts API keys.

  • docs_url (Optional[str]) – An optional URL that indicates where documentation related to the use of the API can be found.

Example Usage:
>>> from scholar_flux.api import ProviderConfig, APIParameterMap, SearchAPI
>>> # Maps each of the individual parameters required to interact with the Guardian API
>>> parameters = APIParameterMap(query='q',
>>>                              start='page',
>>>                              records_per_page='page-size',
>>>                              api_key_parameter='api-key',
>>>                              auto_calculate_page=False,
>>>                              api_key_required=True)
>>> # creating the config object that holds the basic configuration necessary to interact with the API
>>> guardian_config = ProviderConfig(provider_name = 'GUARDIAN',
>>>                                  parameter_map = parameters,
>>>                                  base_url = 'https://content.guardianapis.com//search',
>>>                                  records_per_page=10,
>>>                                  api_key_env_var='GUARDIAN_API_KEY',
>>>                                  request_delay=6)
>>> api = SearchAPI.from_provider_config(query = 'economic welfare',
>>>                                      provider_config = guardian_config,
>>>                                      use_cache = True)
>>> assert api.provider_name == 'guardian'
>>> response = api.search(page = 1) # assumes that you have the GUARDIAN_API_KEY stored as an env variable
>>> assert response.ok
api_key_env_var: str | None
property api_key_required: bool

References the APIParameterMap to determine whether an API key is required.

base_url: str
display_name: str
docs_url: str | None
field_map: BaseFieldMap | None
property map: BaseAPIParameterMap

Helper property that is an alias for the APIParameterMap attribute.

The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.

Returns:

The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.

Return type:

APIParameterMap

metadata_map: ResponseMetadataMap | None
model_config: ClassVar[ConfigDict] = {'str_strip_whitespace': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod normalize_provider_name(v: str) str[source]

Helper method for normalizing the names of providers to a consistent structure.

parameter_map: BaseAPIParameterMap
classmethod prepare_fields(values: dict[str, Any]) dict[str, Any][source]

Model validator used to prepare fields for the ProviderConfig prior to further field validation.

provider_name: str
records_per_page: int
request_delay: float
search_config_defaults() dict[str, Any][source]

Convenience method for retrieving ProviderConfig fields as a dict. Useful for providing the missing information needed to create a SearchAPIConfig object for a provider when only the provider_name has been provided.

Returns:

A dictionary containing the URL, name, records_per_page, and request_delay

for the current provider.

Return type:

dict

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the ProviderConfig.

classmethod validate_base_url(v: str) str[source]

Validates the current URL and raises an APIParameterException if invalid.

classmethod validate_docs_url(v: str | None) str | None[source]

Validates the documentation URL and raises an APIParameterException if invalid.

class scholar_flux.api.models.ProviderRegistry(dict=None, /, **kwargs)[source]

Bases: BaseProviderDict

The ProviderRegistry implementation allows the smooth and efficient retrieval of API parameter maps and default configuration settings to aid in the creation of a SearchAPI that is specific to the current API.

Note that the ProviderRegistry uses the ProviderConfig._normalize_name to ignore underscores and case-sensitivity.

- ProviderRegistry.from_defaults

Dynamically imports configurations stored within scholar_flux.api.providers, and fails gracefully if a provider’s module does not contain a ProviderConfig.

- ProviderRegistry.get

resolves a provider name to its ProviderConfig if it exists in the registry.

- ProviderRegistry.get_from_url

resolves a provider URL to its ProviderConfig if it exists in the registry.

add(provider_config: ProviderConfig) None[source]

Helper method for adding a new provider to the provider registry.

create(provider_name: str, **kwargs: Any) ProviderConfig[source]

Helper method that creates and registers a new ProviderConfig with the current provider registry.

Parameters:
  • provider_name (str) – The name of the provider to create a new provider_config for.

  • **kwargs – Additional keyword arguments to pass to scholar_flux.api.models.ProviderConfig

Returns:

The newly created provider configuration when possible.

Return type:

ProviderConfig

Raises:

APIParameterException – If an unexpected error occurs during the creation of a new ProviderConfig.

classmethod from_defaults() ProviderRegistry[source]

Dynamically loads provider configurations from the scholar_flux.api.providers module.

This method specifically uses the provider_name of each provider listed within the scholar_flux.api.providers.provider_registry to lookup and return its ProviderConfig.

Returns:

A new registry containing the loaded default provider configurations

Return type:

ProviderRegistry

get_display_name(provider_name: str, default: str | None = None) str | None[source]

Finds the human-readable name for a provider if it exists.

If the provider doesn’t exist within the registry, the result falls back to the default if available and None otherwise.

Parameters:
  • provider_name (str) – The provider identifier to look up.

  • default (Optional[str]) – The name to fall back to. If not specified, None is returned instead.

Returns:

The display name if the provider exists, otherwise the default is returned.

Return type:

Optional[str]

get_from_url(provider_url: str | None) ProviderConfig | None[source]

Attempt to retrieve a ProviderConfig instance for the given provider by resolving the provided URL to the provider’s base URL. Will not throw an error in the event that the provider does not exist.

Parameters:

provider_url (Optional[str]) – URL of the provider to look up.

Returns:

Instance configuration for the provider if it exists, else None

Return type:

Optional[ProviderConfig]

remove(provider_name: str) None[source]

Helper method for removing a provider configuration from the provider registry.

resolve_config(provider_url: str | None = None, provider_name: str | None = None, verbose: bool = True) ProviderConfig | None[source]

Helper method to resolve mismatches between the URL and the provider_name when both are provided. The default behavior is to always prefer a provided provider_url over the provider_name to offer maximum flexibility.

Parameters:
  • provider_url (Optional[str]) – The prospective URL associated with a provider configuration.

  • provider_name (Optional[str]) – The prospective name of the provider associated with a provider configuration.

  • verbose (bool) – Determines whether the origin of the configuration should be logged.

Returns:

A provider configuration resolved with priority given to the base URL or the provider name otherwise. If neither the base URL and provider name resolve to a known provider, None is returned instead.

Return type:

Optional[ProviderConfig]

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method that shows the current structure of the ProviderRegistry.

class scholar_flux.api.models.ReconstructedResponse(status_code: int, reason: str, headers: MutableMapping[str, str], content: bytes, url: Any)[source]

Bases: object

Core class for constructing minimal, universal response representations from responses and response-like objects.

The ReconstructedResponse implements several helpers that enable the reconstruction of response-like objects from different sources such as the requests, aiohttp, and httpx libraries.

The primary purpose of the ReconstructedResponse in scholar_flux is to create a minimal representation of a response when we need to construct a ProcessedResponse without an actual response and verify content fields.

In applications such as retrieving cached data from a scholar_flux.data_storage.DataCacheManager, if an original or cached response is not available, then a ReconstructedResponse is created from the cached response fields when available.

Parameters:
  • status_code (int) – The integer code indicating the status of the response

  • reason (str) – Indicates the reasoning associated with the status of the response

  • headers (MutableMapping[str, str]) – Indicates metadata associated with the response (e.g. Content-Type, etc.)

  • content (bytes) – The content within the response

  • url – (Any): The URL from which the response was received

Note

The ReconstructedResponse.build factory method is recommended in cases when one property may contain the needed fields but may need to be processed and prepared first before being used. Examples include instances where one has text or json data instead of content, a reason_phrase field instead of reason, etc.

Example

>>> from scholar_flux.api.models import ReconstructedResponse
# build a response using a factory method that infers fields from existing ones when not directly specified
>>> response = ReconstructedResponse.build(status_code = 200, content = b"success", url = "https://google.com")
# check whether the current class follows a ResponseProtocol and contains valid fields
>>> assert response.is_response()
# OUTPUT: True
>>> response.validate() # raises an error if invalid
>>> response.raise_for_status() # no error for 200 status codes
>>> assert response.reason == 'OK' == response.status  # inferred from the status_code attribute
__init__(status_code: int, reason: str, headers: MutableMapping[str, str], content: bytes, url: Any) None
asdict() dict[str, Any][source]

Converts the ReconstructedResponse into a dictionary containing attributes and their corresponding values.

This convenience method uses dataclasses.asdict() under the hood to convert a ReconstructedResponse to a dictionary consisting of key-value pairs.

Returns:

A dictionary that maps the field names of a ReconstructedResponse instance to their assigned values.

Return type:

dict[str, Any]

classmethod build(response: object | None = None, **kwargs: Any) ReconstructedResponse[source]

Helper method for building a new ReconstructedResponse from a regular response object.

This classmethod can either construct a new ReconstructedResponse object from a response or response-like object or otherwise build a new ReconstructedResponse via its keyword parameters.

Parameters:
  • response (Optional[object]) – A response or response-like object of unknown type or None.

  • **kwargs – The underlying components needed to construct a new response. Note that ideally, this set of key-value pairs would be specific only to the types expected by the ReconstructedResponse.

Returns:

A minimal ReconstructedResponse object created from the received parameter set.

Return type:

ReconstructedResponse

content: bytes
classmethod fields() list[str][source]

Retrieves a list containing the names of all fields associated with the ReconstructedResponse class.

Returns:

A list containing the name of each attribute in the ReconstructedResponse.

Return type:

list[str]

classmethod from_keywords(**kwargs: Any) ReconstructedResponse[source]

Uses the provided keyword arguments to create a ReconstructedResponse.

Parameters:

**kwargs

The ReconstructedResponse keyword arguments to normalize. Possible keywords include:

  • status_code (int): The integer code indicating the status of the response

  • reason (str): Indicates the reasoning associated with the status of the response.

  • headers (MutableMapping[str, str]): Indicates metadata associated with the response (e.g. Content-Type)

  • content (bytes): The content within the response

  • url: (Any): The URL from which the response was received

The keywords can alternatively be inferred from other common response fields:

  • content: [‘content’, ‘_content’, ‘text’, ‘json’]

  • headers: [‘headers’, ‘_headers’]

  • reason: [‘reason’, ‘status’, ‘reason_phrase’, ‘status_code’]

Returns:

A newly reconstructed response from the given keyword components.

Return type:

ReconstructedResponse

headers: MutableMapping[str, str]
is_response() bool[source]

Validates the fields of the minimally reconstructed response, indicating whether all fields are valid.

The fields that are validated include:

  1. status codes (should be an integer)

  2. URLs (should be a valid url)

  3. reasons (should originate from a reason attribute or inferred from the status code)

  4. content (should be a bytes field or encoded from a string text field)

  5. headers (should be a dictionary with string fields and preferably a content type)

Returns:

Indicates whether the current reconstructed response minimally recreates a response object.

Return type:

bool

json() dict[str, Any] | list[Any] | None[source]

Return JSON-decoded body from the underlying response, if available.

property ok: bool

Indicates whether the current response indicates a successful request (200 <= status_code < 300).

To account for the nature of successful requests to APIs in academic pipelines, status codes from 300 to 399 are excluded.

Returns:

True if the status code is an integer value within the range of 200 and 299, False otherwise.

Return type:

bool

classmethod prepare_response_fields(**kwargs: Any) dict[str, Any][source]

Extracts and prepares the fields required to reconstruct the response from the provided keyword arguments.

Parameters:
  • status_code (int) – The integer code indicating the status of the response

  • reason (str) – Indicates the reasoning associated with the status of the response

  • headers (MutableMapping[str, str]) – Indicates metadata associated with the response (e.g. Content-Type)

  • content (bytes) – The content within the response

  • url – (Any): The URL from which the response was received

Some fields can be both provided directly or inferred from other similarly common fields:

  • content: [‘content’, ‘_content’, ‘text’, ‘json’]

  • headers: [‘headers’, ‘_headers’]

  • reason: [‘reason’, ‘status’, ‘reason_phrase’, ‘status_code’]

Returns:

A dictionary containing the prepared response fields.

Return type:

dict[str, Any]

raise_for_status() None[source]

Verifies the status code for the current ReconstructedResponse, raising an error for failed responses.

This method follows a similar convention as requests and httpx response types, raising an error when encountering status codes that are indicative of failed responses.

As scholar_flux processes data that is generally only sent when status codes are between 200-299 (or exactly 200 [ok]), an error is raised when encountering a value outside of this range.

Raises:

HTTPError – If the structure of the response is invalid or the status code is not within the range of 200-299.

reason: str
property status: str | None

Helper property for retrieving a human-readable description of the status.

Returns:

The status description associated with the response (if available).

Return type:

Optional[str]

status_code: int
property text: str | None

Helper property for retrieving the text from the bytes content as a string.

Returns:

The decoded text from the content of the response.

Return type:

Optional[str]

url: Any
validate() None[source]

Convenience method for the validation of the current ReconstructedResponse.

If the response validation is successful, an InvalidResponseReconstructionException will not be raised.

Raises:

InvalidResponseReconstructionException – If at least one field is determined to be invalid and unexpected of a true response object.

class scholar_flux.api.models.ResponseHistoryRegistry(*args: Any, **kwargs: Any)[source]

Bases: BaseProviderDict

The ResponseHistoryRegistry is responsible for storing a sorted list of responses for later retrieval.

This class has its utility in multi-orchestrated searches to a single provider across workflows and coordinators.

Note that the ResponseHistoryRegistry uses the ProviderConfig._normalize_name to ignore underscores and case-sensitivity.

- ResponseHistoryRegistry.get

resolves a provider name to an API response if it exists in the registry.

- ResponseHistoryRegistry.get_from_url

resolves a provider URL to an API response if it exists in the registry.

__init__(*args: Any, **kwargs: Any) None[source]

Initializes the ResponseHistoryRegistry with a thread lock to enforce threaded dictionary operations.

add(provider_name: str, response: ProcessedResponse | ErrorResponse) None[source]

Helper method for adding a new response to the ResponseHistoryRegistry.

get_from_url(provider_url: str | None) ProcessedResponse | ErrorResponse | None[source]

Attempt to retrieve a ProcessedResponse or ErrorResponse instance for the given provider from a URL.

This method retrieves responses by resolving the provided URL to the provider’s base URL after normalization. If a provider does not exist in the response history, a value of None will be returned instead.

Parameters:

provider_url (Optional[str]) – URL of the provider to look up.

Returns:

The last stored response for a provider if it has an entry in the response history. Otherwise None.

Return type:

Optional[ProcessedResponse | ErrorResponse]

remove(provider_name: str) None[source]

Helper method for removing an ErrorResponse or ProcessedResponse from the ResponseHistoryRegistry.

class scholar_flux.api.models.ResponseMetadataMap(*, total_query_hits: str | None = None, records_per_page: str | None = None)[source]

Bases: BaseModel

Maps API-specific response metadata field names to common names.

This class enables extraction of metadata from API responses, primarily used for pagination decisions in multi-page searches. This class extracts and processes metadata fields from metadata dictionaries and can be used for nested path reversal by denoting fields with periods. field retrieval.

Parameters:
  • total_query_hits – Field name containing the total number of results for a query (used to determine if more pages exist)

  • records_per_page – Field name indicating the number of records on the current page

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits")
>>> metadata = {"totalHits": 318942, "limit": 10}
>>> total = metadata_map.calculate_query_hits(metadata)
>>> print(total)  # 318942
>>> # Used for pagination decisions
>>> has_more = total > (current_page * records_per_page)
calculate_pages_remaining(page: int, total_query_hits: int | None = None, records_per_page: int | None = None, metadata: MetadataType | None = None) int | None[source]

Calculating the total number of pages yet to be queried using either metadata or direct integer fields.

Parameters:
  • total_query_hits (Optional[int]) – Total number of record hits associated with a given query. If not specified, this is parsed from the metadata

  • records_per_page (Optional[int]) – Total number of records on the current page as an integer if available and convertible

  • metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)

Returns:

The total number of pages that remain given the values total_query_hits and records_per_page

Return type:

Optional[int]

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(
... total_query_hits="statistics.totalHits", records_per_page="metadata.pageSize"
... )
>>> metadata = {"statistics": {"totalHits": "1500"},"metadata": {"pageSize": "20"}}
>>> total = metadata_map.calculate_pages_remaining(page = 74, metadata = metadata)
>>> print(total) # 1 (converted from string)
calculate_query_hits(metadata: MetadataType) int | None[source]

Extract and convert total query hits from response metadata.

Parameters:

metadata (MetadataType) – A mapping containing response metadata typically from ProcessedResponse.metadata

Returns:

Total number of query hits as an integer if available and convertible, otherwise None

Return type:

Optional[int]

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits")
>>> metadata = {"totalHits": "1500", "results": [...]}
>>> total = metadata_map.calculate_query_hits(metadata)
>>> print(total)  # 1500 (converted from string)
calculate_records_per_page(metadata: MetadataType) int | None[source]

Extract and convert the total number of records on the current page from response metadata.

Parameters:

metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)

Returns:

Total number of records on the current page as an integer if available and convertible, otherwise None

Return type:

Optional[int]

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(records_per_page="pageSize")
>>> metadata = {"pageSize": "20", "results": [...]}
>>> total = metadata_map.calculate_records_per_page(metadata)
>>> print(total)  # 20 (converted from string)
model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

process_metadata(metadata: MetadataType) MetadataType[source]

Helper method for processing metadata after mapping relevant fields using the metadata schema.

Parameters:

metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)

Returns:

A mapped dictionary of processed metadata fields.

Return type:

metadata (MetadataType)

Example

>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap
>>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits", records_per_page="pageSize")
>>> metadata = {"totalHits": "1500","pageSize": "20", "results": [...]}
>>> metadata_map.process_metadata(metadata)
# OUTPUT: {"total_query_hits": 1500, "pageSize": "records_per_page", 20}
records_per_page: str | None
total_query_hits: str | None
class scholar_flux.api.models.SearchAPIConfig(*, provider_name: str = '', base_url: str = '', records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: float = -1, api_key: SecretStr | None = None, api_specific_parameters: dict[str, Any] | None = None)[source]

Bases: BaseModel

The SearchAPIConfig class provides the core tools necessary to set and interact with the API. The SearchAPI uses this class to retrieve data from an API using universal parameters to simplify the process of retrieving raw responses.

provider_name

Indicates the name of the API to use when making requests to a provider. If the provider name matches a known default and the base_url is unspecified, the base URL for the current provider is used instead.

Type:

str

base_url

Indicates the API URL where data will be searched and retrieved.

Type:

str

records_per_page

Controls the number of records that will appear on each page.

Type:

int

request_delay

Indicates the minimum delay between each request to avoid exceeding API rate limits.

Type:

float

api_key

This is an API-specific parameter for validating the current user’s identity. If a str type is provided, it is converted into a SecretStr.

Type:

Optional[str | SecretStr]

api_specific_parameters

A dictionary containing all parameters specific to the current API. API-specific parameters include the following:

  1. mailto (Optional[str | SecretStr]):

    An optional email address for receiving feedback on usage from providers. This parameter is currently applicable only to the Crossref API.

  2. db (str):

    The parameter used by the NIH to direct requests for data to the pubmed database. This parameter defaults to pubmed and does not require direct specification.

Type:

dict[str, Any]

Examples

>>> from scholar_flux.api import SearchAPIConfig, SearchAPI, provider_registry
# To create a CROSSREF configuration with minimal defaults and provide an api_specific_parameter:
>>> config = SearchAPIConfig.from_defaults(provider_name = 'crossref', mailto = 'your_email_here@example.com')
# The configuration automatically retrieves the configuration for the "Crossref" API.
>>> assert config.provider_name == 'crossref' and config.base_url == provider_registry['crossref'].base_url
>>> api = SearchAPI.from_settings(query = 'q', config = config)
>>> assert api.config == config
# To retrieve all defaults associated with a provider and automatically read an API key if needed:
>>> config = SearchAPIConfig.from_defaults(provider_name = 'pubmed', api_key = 'your api key goes here')
# The API key is retrieved automatically if you have the API key specified as an environment variable.
>>> assert config.api_key is not None
# Default provider API specifications are already pre-populated if they are set with defaults.
>>> assert config.api_specific_parameters['db'] == 'pubmed'  # Required by pubmed and defaults to pubmed.
# Update a provider and automatically retrieve its API key - the previous API key will no longer apply.
>>> updated_config = SearchAPIConfig.update(config, provider_name = 'core')
# The API key should have been overwritten to use core. Looks for a `CORE_API_KEY` env variable by default.
>>> assert updated_config.provider_name  == 'core' and  updated_config.api_key != config.api_key
DEFAULT_PROVIDER: ClassVar[str] = 'PLOS'
DEFAULT_RECORDS_PER_PAGE: ClassVar[int] = 25
DEFAULT_REQUEST_DELAY: ClassVar[float] = 6.1
MAX_API_KEY_LENGTH: ClassVar[int] = 512
api_key: SecretStr | None
api_specific_parameters: dict[str, Any] | None
base_url: str
classmethod default_request_delay(v: int | float | None, provider_name: str | None = None) float[source]

Helper method enabling the retrieval of the most appropriate rate limit for the current provider.

Defaults to the SearchAPIConfig default rate limit when the current provider is unknown and a valid rate limit has not yet been provided.

Parameters:
  • v (Optional[int | float]) – The value received for the current request_delay

  • provider_name (Optional[str]) – The name of the provider to retrieve a rate limit for

Returns:

The inputted non-negative request delay, the retrieved rate limit for the current provider

if available, or the SearchAPIConfig.DEFAULT_REQUEST_DELAY - all in order of priority.

Return type:

float

classmethod from_defaults(provider_name: str, **overrides: Any) SearchAPIConfig[source]

Uses the default configuration for the chosen provider to create a SearchAPIConfig object containing configuration parameters. Note that additional parameters and field overrides can be added via the **overrides field.

Parameters:
  • provider_name (str) – The name of the provider to create the config

  • **overrides – Optional keyword arguments to specify overrides and additional arguments

Returns:

A default APIConfig object based on the chosen parameters

Return type:

SearchAPIConfig

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

provider_name: str
records_per_page: int
request_delay: float
classmethod set_records_per_page(v: int | None) int[source]

Sets the records_per_page parameter with the default if the supplied value is not valid:

Triggers a validation error when records_per_page is an invalid type. Otherwise uses the DEFAULT_RECORDS_PER_PAGE class attribute if the supplied value is missing or is a negative number.

structure(flatten: bool = False, show_value_attributes: bool = True) str[source]

Helper method for retrieving a string representation of the overall structure of the current SearchAPIConfig.

classmethod update(current_config: SearchAPIConfig, **overrides: Any) SearchAPIConfig[source]

Create a new SearchAPIConfig by updating an existing config with new values and/or switching to a different provider. This method ensures that the new provider’s base_url and defaults are used if provider_name is given, and that API-specific parameters are prioritized and merged as expected.

Parameters:
  • current_config (SearchAPIConfig) – The existing configuration to update.

  • **overrides – Any fields or API-specific parameters to override or add.

Returns:

A new config with the merged and prioritized values.

Return type:

SearchAPIConfig

property url_basename: str

Uses the _extract_url_basename method from the provider URL associated with the current config instance.

classmethod validate_api_key(v: SecretStr | str | None) SecretStr | None[source]

Validates the api_key attribute and triggers a validation error if it is not valid.

classmethod validate_provider_name(v: str | None) str[source]

Validates the provider_name attribute and triggers a validation error if it is not valid.

classmethod validate_request_delay(v: int | float | None) int | float | None[source]

Sets the request delay (delay between each request) for valid request delays. This validator triggers a validation error when the request delay is an invalid type.

If a request delay is left None or is a negative number, this class method returns -1, and further validation is performed by cls.default_request_delay to retrieve the provider’s default request delay.

If not available, SearchAPIConfig.DEFAULT_REQUEST_DELAY is used.

validate_search_api_config_parameters() Self[source]

Validation method that resolves URLs and/or provider names to provider_info when one or the other is not explicitly provided.

Occurs as the last step in the validation process.

classmethod validate_url(v: str) str[source]

Validates the base_url and triggers a validation error if it is not valid.

classmethod validate_url_type(v: str | None) str[source]

Validates the type for the base_url attribute and triggers a validation error if it is not valid.

class scholar_flux.api.models.SearchResult(*, query: str, provider_name: str, page: Annotated[int, Ge(ge=0)], response_result: ProcessedResponse | ErrorResponse | None = None)[source]

Bases: BaseModel

Core container for search results that stores the retrieved and processed data from API Searches.

This class is useful when iterating and searching over a range of pages, queries, and providers at a time. This class uses pydantic to ensure that field validation is automatic, ensuring integrity and reliability of response processing. This supports multi-page searches that link each response result to a particular query, page, and provider.

Parameters:
  • query (str) – The query used to retrieve records and response metadata

  • provider_name (str) – The name of the provider where data is being retrieved

  • page (int) – The page number associated with the request for data

  • response_result (Optional[ProcessedResponse | ErrorResponse]) – The response result containing the specifics of the data retrieved from the response or the error messages recorded if the request is not successful.

For convenience, the properties of the response_result are referenced as properties of the SearchResult, including: response, parsed_response, processed_records, etc.

build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType][source]

Builds a lookup table mapping record IDs to their original extracted records.

This method delegates to the underlying ProcessedResponse or ErrorResponse to build an index for fast ID-based resolution of extracted records. Useful for batch resolution operations where multiple records need to be resolved to their original nested structures without repeated searches.

Parameters:
  • *args – Positional arguments passed through to the underlying response’s build_record_id_index method. The ProcessedResponse implementation accepts no positional arguments.

  • **kwargs – Keyword arguments passed through to the underlying response’s build_record_id_index method. The ProcessedResponse implementation accepts no keyword arguments.

Returns:

A dictionary mapping _record_id values to their corresponding extracted records. Returns an empty dict if response_result is None or if no extracted records exist.

Return type:

dict[str, RecordType]

property cache_key: str | None

Extracts the cache key from the API Response if available.

This cache key is used when storing and retrieving data from response processing cache storage.

Returns:

The key if the response_result contains a cache_key that is not None. None otherwise.

Return type:

Optional[str]

property cached: bool | None

Identifies whether the current response was retrieved from the session cache.

Returns:

True if the response is a CachedResponse object and False if it is a fresh requests.Response object None: Unknown (e.g., the response attribute is not a requests.Response object or subclass)

Return type:

bool

property created_at: str | None

Extracts the time in which the ErrorResponse or ProcessedResponse was created, if available.

property data: RecordList | None

Alias referring back to the processed records from the ProcessedResponse or ErrorResponse.

Contains the processed records from the API response processing step after a successfully received response has been processed. If an error response was received instead, the value of this property is None.

Returns:

The list of processed records if ProcessedResponse.data is not None. None otherwise.

Return type:

Optional[RecordList]

property display_name: str

Returns a human-readable provider name for the current provider when available.

property error: str | None

Extracts the error name associated with the result from the base class.

This field is generally populated when ErrorResponse objects are received and indicates why an error occurred.

Returns:

The error if the response_result is an ErrorResponse with a populated error field. None otherwise.

Return type:

Optional[str]

property extracted_records: RecordList | None

Contains the extracted records from the response record extraction step after successful response parsing.

If an ErrorResponse was received instead, the value of this property is None.

Returns:

A list of extracted records if ProcessedResponse.extracted_records is not None. None otherwise.

Return type:

Optional[RecordList]

property message: str | None

Extracts the message associated with the result from the base class.

This message is generally populated when ErrorResponse objects are received and indicates why an error occurred in the event that the response_result is an ErrorResponse.

Returns:

The message if the ProcessedResponse.message or ErrorResponse.message is not None. None otherwise.

Return type:

Optional[str]

property metadata: MetadataType | None

Contains the metadata from the API response metadata extraction step after successful response parsing.

If an ErrorResponse was received instead, the value of this property is None.

Returns:

A dictionary of metadata if ProcessedResponse.metadata is not None. None otherwise.

Return type:

Optional[MetadataType]

model_config: ClassVar[ConfigDict] = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = False, update_records: bool | None = None, include: SearchFields | None = None, *, resolve_records: bool | None = None, keep_api_specific_fields: bool | Sequence | None = None, strip_annotations: bool | None = None) NormalizedRecordList[source]

Normalizes ProcessedResponse record fields to map API-specific fields to provider-agnostic field names.

The field map is resolved in the following order of priority:

  1. User-specified field maps

  2. Resolving a provider name to a BaseFieldMap or subclass from the registry.

  3. Resolving the URL to a BaseFieldMap or subclass

If a field map is not available at any step in the process, an empty list will be returned if raise_on_error=False. Otherwise, a RecordNormalizationException is raised.

Parameters:
  • field_map (Optional[BaseFieldMap]) – Optional field map to use in the normalization of the record list. If not provided, the field map is looked up from the registry using the name or URL of the current provider.

  • raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a normalization error is raised.

  • update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is None.

  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each normalized record dictionary. Possible fields include provider_name, query, display_name, and page. By default, no model fields are appended.

  • resolve_records (Optional[bool]) – A flag that determines if resolution with annotated records should occur. If True or None, resolution occurs. If False, normalization uses processed_records when not None and extracted_records otherwise.

  • keep_api_specific_fields (Optional[bool | Sequence]) – Indicates what API-specific records should be retained from the complete list of API parameters that are returned. If False, only the core parameters defined by the FieldMap are returned. If True or None, all parameters are returned instead.

  • strip_annotations (Optional[bool]) – A flag indicating whether to remove metadata annotations from normalized records. If True or None, fields with leading underscores are removed from each normalized record.

Returns:

A list of normalized records, or empty list if normalization is unavailable.

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If raise_on_error=True and no field map found.

Note

The ProcessedResponse.normalize() method will handle most of the internal logic. This method delegates normalization to the ProcessedResponse when the user does not explicitly pass a field map and the provider-name-resolved map matches the URL-resolved map. If the automatically resolved field maps do not differ, the ProcessedResponse.normalize() method handles the resolution details for caching purposes.

Example

>>> from scholar_flux import SearchCoordinator
>>> from scholar_flux.utils import truncate, coerce_flattened_str
>>> coordinator = SearchCoordinator(query = 'AI Safety', provider_name = 'arXiv')
>>> response = coordinator.search_page(page = 1)
>>> normalized_records = response.normalize(include = {'display_name', 'query', 'page'})
>>> for record in normalized_records[:5]:
...     print(f"Title: {record['title']}")
...     print(f"URL: {record['url']}")
...     print(f"Source: From {record['display_name']}: '{record['query']}' Page={record['page']}")
...     print(f"Abstract: {truncate(record['abstract'] or 'Not available')}")
...     print(f"Authors: {coerce_flattened_str(record['authors'])}")
...     print("-"*100)

# OUTPUT: Title: AI Safety… URL: http://arxiv.org/abs/… Source: From arXiv: ‘AI Safety’ Page=1 Abstract: This report … Authors: … ————————————–

property normalized_records: NormalizedRecordList | None

Contains the normalized records from the API response processing step after normalization.

If an error response was received instead, the value of this property is None.

Returns:

The list of normalized dictionary records if ProcessedResponse.normalized_records is not None.

Return type:

Optional[NormalizedRecordList]

page: int
property parsed_response: Any | None

Contains the parsed response content from the API response parsing step.

Parsed API responses are generally formatted as dictionaries that contain the extracted JSON, XML, or YAML content from a successfully received, raw response.

If an ErrorResponse was received instead, the value of this property is None.

Returns:

The parsed response when ProcessedResponse.parsed_response is not None. Otherwise None.

Return type:

Optional[Any]

process_metadata(metadata_map: ResponseMetadataMap | None = None, update_metadata: bool | None = None) MetadataType | None[source]

Processes and maps API-specific ProcessedResponse.metadata fields to provider-agnostic field names.

By default, the ResponseMetadataMap map retrieves and converts the API-specific page-size (records per page) and total results (total query hits) fields to integers when possible.

The field map is resolved in the following order of priority:

  1. User-specified field maps

  2. Resolving a provider name to a ResponseMetadataMap or subclass from the registry.

  3. Resolving the URL to a ResponseMetadataMap or subclass

If a metadata_map is not available, None will be returned.

Parameters:
  • metadata_map – (Optional[ResponseMetadataMap]): An optional response metadata map to use in the mapping and processing of the response metadata. If not provided, the metadata map is looked up via the registry using the name or URL of the current provider.

  • update_metadata (Optional[bool]) – A flag that determines whether updates should be made to the processed_metadata attribute after computation. If None, updates are made only if the processed_metadata attribute is None.

Returns:

A processed metadata dictionary mapping total_query_hits and records_per_page fields where possible.

Return type:

MetadataType

property processed_metadata: MetadataType | None

Contains the processed metadata from the API response processing step after the response has been processed.

If an error response was received instead, the value of this property is None.

Returns:

The processed metadata dict if ProcessedResponse.processed_metadata is not None. None otherwise.

Return type:

Optional[MetadataType]

property processed_records: RecordList | None

Contains the processed records from the API response processing step after processing the response.

If an error response was received instead, the value of this property is None.

Returns:

The list of processed records if ProcessedResponse.processed_records is not None. None otherwise.

Return type:

Optional[RecordList]

provider_name: str
query: str
property record_count: int

Retrieves the overall length of the processed_record field from the API response if available.

property records_per_page: int | None

Returns the number of records sent on the current page according to the API-specific metadata field.

resolve_extracted_record(*args: Any, **kwargs: Any) RecordType | None[source]

Resolves a processed record back to its original extracted record.

This method delegates to the underlying ProcessedResponse or ErrorResponse to resolve a single processed record (identified by its index) back to its original extracted record with nested structure. Uses annotation fields (_extraction_index, _record_id) added during extraction.

Parameters:
  • *args – Positional arguments passed through to the underlying response’s resolve_extracted_record method. The ProcessedResponse implementation accepts: - processed_index (int): Index of the record in processed_records

  • **kwargs – Keyword arguments passed through to the underlying response’s resolve_extracted_record method.

Returns:

The original extracted record with nested structure, or None if: - response_result is None - The record index is invalid - No matching extracted record is found

Return type:

Optional[RecordType]

property response: Response | ResponseProtocol | None

Directly references the raw response or response-like object from the API Response if available.

Returns:

The response object (response-like or None) if a ProcessedResponse or ErrorResponse is available. When either APIResponse subclass is not available, None is returned instead.

Return type:

Optional[Response | ResponseProtocol]

response_result: ProcessedResponse | ErrorResponse | None
property retrieval_timestamp: datetime | None

Indicates the ISO timestamp associated with the original response creation date and time.

property status: str | None

Extracts the human-readable status description from the underlying response, if available.

property status_code: int | None

Extracts the HTTP status code from the underlying response, if available.

strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]

Convenience method for removing metadata annotations from a record list for clean export.

Strips fields prefixed with underscore that were added during extraction for pipeline traceability (e.g., _extraction_index, _record_id).

Parameters:

records (Optional[RecordType | RecordList]) – Records to strip. Defaults to processed_records if None.

Returns:

New list of records with annotation fields removed. If there are no records to strip, an empty list is returned instead.

Example

>>> clean_data = response.strip_annotations()
>>> df = pd.DataFrame(clean_data)  # No internal fields in DataFrame
property total_query_hits: int | None

Returns the total number of query hits according to the processed metadata field specific to the API.

property url: str | None

Extracts the URL from the underlying response, if available.

with_search_fields(records: NormalizedRecordType, include: SearchFields | None = None, strip_annotations: bool | None = None) NormalizedRecordType[source]
with_search_fields(records: NormalizedRecordList, include: SearchFields | None = None, strip_annotations: bool | None = None) NormalizedRecordList
with_search_fields(records: RecordType, include: SearchFields | None = None, strip_annotations: bool | None = None) RecordType
with_search_fields(records: RecordList | Iterator[RecordType], include: SearchFields | None = None, strip_annotations: bool | None = None) RecordList
with_search_fields(records: None, include: SearchFields | None = None, strip_annotations: bool | None = None) RecordType

Returns a record or list of record dictionaries merged with selected SearchResult fields.

Parameters:
  • records (RecordType | Iterator[RecordType] | NormalizedRecordType | RecordList | NormalizedRecordList) – The record dictionary or list of records to be merged with SearchResult fields.

  • include – Set of SearchResult fields to include (default: {“provider_name”, “page”}).

  • strip_annotations (Optional[bool]) – A flag indicating whether to remove metadata annotations from records. If True, fields with leading underscores are removed from each processed record.

Returns:

A single dictionary is returned if a single parsed record is provided. RecordList: A list of dictionaries is returned if a list of parsed records is provided. NormalizedRecordType: A single normalized dictionary is returned if a single normalized record is provided. NormalizedRecordList: A list of normalized dictionaries is returned if a list of normalized records is provided.

Return type:

RecordType

class scholar_flux.api.models.SearchResultList(iterable=(), /)[source]

Bases: list[SearchResult]

A custom list that stores the results of multiple SearchResult instances for enhanced type safety.

The SearchResultList class inherits from a list and extends its functionality to tailor its utility to ProcessedResponse and ErrorResponse objects received from SearchCoordinators and MultiSearchCoordinators.

- SearchResultList.append

Basic list.append implementation extended to accept only SearchResults

- SearchResultList.extend

Basic list.extend implementation extended to accept only iterables of SearchResults

- SearchResultList.filter

Removes NonResponses and ErrorResponses from the list of SearchResults

- SearchResultList.select

Selects a subset of SearchResults by query, provider_name, or page

- SearchResultList.join

Combines all records from ProcessedResponses into a list of dictionary-based records

Note: Attempts to add other classes to the SearchResultList other than SearchResults will raise a TypeError.

append(item: SearchResult) None[source]

Overrides the default list.append method for type-checking compatibility.

This override ensures that only SearchResult objects can be appended to the SearchResultList. For all other types, a TypeError will be raised when attempting to append it to the SearchResultList.

Parameters:

item (SearchResult) – A SearchResult containing API response data, the name of the queried provider, the query, and the page number associated with the ProcessedResponse or ErrorResponse response result.

Raises:

TypeError – When the item to append to the SearchResultList is not a SearchResult.

copy() SearchResultList[source]

Overrides the default list.copy to return a shallow copy as a SearchResultList.

Returns:

A new, shallow copy of the current list.

Return type:

SearchResultList

extend(other: SearchResultList | MutableSequence[SearchResult] | Iterable[SearchResult]) None[source]

Overrides the default list.extend method for type-checking compatibility.

This override ensures that only an iterable of SearchResult objects can be appended to the SearchResultList. For all other types, a TypeError will be raised when attempting to extend the SearchResultList with them.

Parameters:
  • other (Iterable[SearchResult]) – An iterable/sequence of response results containing the API response

  • data

  • name (the provider)

  • response (and page associated with the)

Raises:

TypeError – When the item used to extend the SearchResultList is not a mutable sequence of SearchResult instances

filter(invert: bool = False) SearchResultList[source]

Helper method that retains only elements from the original response that indicate successful processing.

Parameters:

invert (bool) – Controls whether SearchResults containing ProcessedResponses or ErrorResponses should be selected. If True, ProcessedResponses are omitted from the filtered SearchResultList. Otherwise, only ProcessedResponses are retained.

join(include: SearchFields | None = None, strip_annotations: bool | None = None) RecordList[source]

Combines all successfully processed API responses into a single list of dictionary records across all pages.

This method is especially useful for compatibility with pandas and polars dataframes that can accept a list of records when individual records are dictionaries.

Note that this method will only load processed responses that contain records that were also successfully extracted and processed.

Parameters:
  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each parsed record dictionary. Possible fields include provider_name, display_name, query, and page.

  • strip_annotations (Optional[bool]) – A flag indicating whether to remove metadata annotations from records. If True, fields with leading underscores are removed from each processed record.

Returns:

A single list containing all records retrieved from each page

Return type:

RecordList

normalize(raise_on_error: bool = False, update_records: bool | None = None, include: SearchFields | None = None, **kwargs: Any) NormalizedRecordList[source]

Convenience method allowing the batch normalization of all SearchResults in a SearchResultList.

When called, each result in the current SearchResultList is sequentially normalized as a record dictionary and outputted into a flattened list of normalized records across all pages, providers, and queries. The provider name is extracted from the normalization step and identifies the origin of each record, but additional search annotations (e.g., query, provider_name, display_name, page) can be added to each record to identify its origin.

Parameters:
  • raise_on_error (bool) – A flag indicating whether to raise an error. If False, iteration will continue through failures in processing such as cases where ErrorResponses and NonResponses otherwise raise a NotImplementedError. if raise_on_error is True, the normalization error will be raised.

  • update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is None.

  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each normalized record dictionary. Possible fields include provider_name, query, display_name, and page. By default, no model fields are appended.

  • **kwargs

    Additional keyword parameters forwarded to SearchResult.normalize(). Supported parameters include:

    • strip_annotations (bool): Removes internal annotation fields from normalized records

    • resolve_records (bool): Merges extracted and processed records when annotations exist

    • keep_api_specific_fields (bool | Sequence): Controls API-specific field inclusion

    • field_map (BaseFieldMap): An optional override to the field map to be used for record normalization

Returns:

A list of all normalized records across all queried pages, or an empty list if no records are available.

Return type:

NormalizedRecordList

Raises:

RecordNormalizationException – If raise_on_error=True and no field map found.

process_metadata(update_metadata: bool | None = None, include: SearchFields | None = None) list[MetadataType][source]

Processes the ProcessedResponse.metadata field to map metadata fields to provider-agnostic field names.

By default, the ResponseMetadataMap map retrieves and converts the API-specific page-size (records per page) and total results (total query hits) fields to integers when possible.

The field map is resolved in the following order of priority:

  1. User-specified field maps

  2. Resolving a provider name to a BaseFieldMap or subclass from the registry.

  3. Resolving the URL to a BaseFieldMap or subclass

Parameters:
  • update_metadata (Optional[bool]) – A flag that determines whether updates should be made to the processed_metadata attribute after computation. If None, updates are made only if the processed_metadata attribute is None.

  • include (Optional[set[Literal['query', 'provider_name', "display_name", 'page']]]) – Optionally appends the specified model fields as key-value pairs to each listed metadata dictionary. Possible fields include provider_name, display_name, query, and page.

Returns:

A list of processed metadata dictionaries mapping total_query_hits and records_per_page fields where possible.

Return type:

list[MetadataType]

Raises:

RecordNormalizationException – If raise_on_error=True and no field map found.

property record_count: int

Retrieves the overall record count across all search results if available.

select(query: str | None = None, provider_name: str | Pattern | None = None, page: tuple | MutableSequence | int | None = None, *, fuzzy: bool = True, regex: bool | None = None) SearchResultList[source]

Helper method that enables the selection of all responses (successful or failed) based on its attributes.

Parameters:
  • query (Optional[str]) – The exact query string to match (if provided). Ignored if None

  • provider_name (Optional[str | Pattern]) – The provider string or regex pattern to match (if provided). Ignored if None.

  • page (Optional[tuple | MutableSequence | int]) – The page or sequence of pages to match. Ignored if None.

  • fuzzy (bool) – Identifies search results by provider using fuzzy finding, or “flexible matching that’s more forgiving than exact”. When true, this implementation matches providers with normalized names that begin with the provided prefix. (e.g., pubmed can match pubmed or pubmedefetch). The provider_registry.find() method is used to find providers within the package-level registry with names starting with the prefix. Pattern matching is performed if provider_name is a re.Pattern. If fuzzy=False, then only strict string matches will be preserved.

  • regex (Optional[bool]) – An optional keyword parameter passed to provider_registry.find() when fuzzy=True. When True, key pattern matching is enabled and registered providers can be identified using regex. This parameter is No-Op if fuzzy=False.

  • Examples

    >>> from scholar_flux.api.models import SearchResult, SearchResultList
    >>> crossref_result = SearchResult(page=1, query = 'q1', provider_name='crossref')
    >>> pubmed_result = SearchResult(page=2, query = 'q2', provider_name='pubmedefetch')
    >>> springer_nature_result = SearchResult(page=3, query = 'q3', provider_name='springernature')
    >>> search_result_list = SearchResultList([crossref_result, pubmed_result, springer_nature_result])
    >>> len(search_result_list.select()) # No filters selected
    # OUTPUT: 3
    >>> search_result_list.select(provider_name="pubmed") # No filters selected
    # OUTPUT: [SearchResult(query='q2', provider_name='pubmedefetch', page=2, response_result=None, display_name='PubMed (eFetch)')]
    >>> search_result_list.select(provider_name="springer")
    # OUTPUT: [SearchResult(query='q3', provider_name='springernature', page=3, response_result=None, display_name='Springer Nature')]
    >>> search_result_list.select(query="q1")
    # OUTPUT: [SearchResult(query='q1', provider_name='crossref', page=1, response_result=None, display_name='Crossref')]
    

Returns:

A filtered list of search results containing only results that match the conditions.

Return type:

SearchResultList