scholar_flux package

Subpackages

Module contents

ScholarFlux API

The scholar_flux package is an open-source project designed to streamline access to academic and scholarly resources across various platforms. It offers a unified API that simplifies querying academic databases, retrieving metadata, and performing comprehensive searches within scholarly articles, journals, and publications.

In addition, this API has built in extension capabilities for applications in News retrieval and other domains.

The SearchCoordinator offers the core functionality needed to orchestrate the process the process, including: API Retrieval -> Response Parsing -> Record Extraction -> Record Processing -> Returning the Processed Response

This module initializes the package and includes the core functionality and helper classes needed to retrieve API Responses from API Providers.

class scholar_flux.APIParameterConfig(parameter_map: APIParameterMap)[source]

Bases: object

Uses an APIParameterMap instance and runtime parameter values to build parameter dictionaries for API requests.

Parameters:: parameter_map (APIParameterMap) – The mapping of universal to API-specific parameter names.

Class Attributes:

DEFAULT_CORRECT_ZERO_INDEX (bool):: Autocorrects zero-indexed API parameter building specifications to only accept positive values when True. If otherwise False, page calculation APIs will start from page 0 if zero-indexed (i.e., arXiv).

Examples

>>> from scholar_flux.api import APIParameterConfig, APIParameterMap
>>> # the API parameter map is defined and used to resolve parameters to the API's language
>>> api_parameter_map = APIParameterMap(
... query='q', records_per_page = 'pagesize', start = 'page', auto_calculate_page = False
... )
# The APIParameterConfig defines class and settings that indicate how to create requests
>>> api_parameter_config = APIParameterConfig(api_parameter_map, auto_calculate_page = False)
# Builds parameters using the specification from the APIParameterMap
>>> page = api_parameter_config.build_parameters(query= 'ml', page = 10, records_per_page=50)
>>> print(page)
# OUTPUT {'q': 'ml', 'page': 10, 'pagesize': 50}

DEFAULT_CORRECT_ZERO_INDEX: ClassVar[bool] = True

__init__(*args: Any, **kwargs: Any) → None

classmethod as_config(parameter_map: dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) → APIParameterConfig[source]

Factory method for creating a new APIParameterConfig from a dictionary or APIParameterMap.

This helper class method resolves the structure of the APIParameterConfig against its basic building blocks to create a new configuration when possible.

Parameters:: parameter_map (dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) – A parameter mapping/config to use in the instantiation of an APIParameterConfig.
Returns:: A new structure from the inputs
Return type:: APIParameterConfig
Raises:: APIParameterException – If there is an error in the creation/resolution of the required parameters

build_parameters(query: str | None, page: int | None, records_per_page: int, **api_specific_parameters) → Dict[str, Any][source]

Builds the dictionary of request parameters using the current parameter map and provided values at runtime.

Parameters:

query (Optional[str]) – The search query string.
page (Optional[int]) – The page number for pagination (1-based).
records_per_page (int) – Number of records to fetch per page.
**api_specific_parameters – Additional API-specific parameters to include.

Returns:

The fully constructed API request parameters dictionary, with keys as API-specific parameter names and values as provided.

Return type:

Dict[str, Any]

classmethod from_defaults(provider_name: str, **additional_parameters) → APIParameterConfig[source]

Factory method to create APIParameterConfig instances with sensible defaults for known APIs.

If the provider_name does not exist, the code will raise an exception.

Parameters:

provider_name (str) – The name of the API to create the parameter map for.
api_key (Optional[str]) – API key value if required.
additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter config instance for the specified API.

Return type:

APIParameterConfig

Raises:

NotImplementedError – If the API name is unknown.

classmethod get_defaults(provider_name: str, **additional_parameters) → APIParameterConfig | None[source]

Factory method to create APIParameterConfig instances with sensible defaults for known APIs.

Avoids throwing an error if the provider name does not already exist.

Parameters:

provider_name (str) – The name of the API to create the parameter map for.
additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter config instance for the specified API. Returns None if a mapping for the provider_name isn’t retrieved

Return type:

Optional[APIParameterConfig]

property map: APIParameterMap

Helper property that is an alias for the APIParameterMap attribute.

The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.

Returns:: The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.
Return type:: APIParameterMap

parameter_map: APIParameterMap

show_parameters() → list[source]

Helper method to show the complete list of all parameters that can be found in the current_mappings.

Returns:: The complete list of all universal and api specific parameters corresponding to the current API
Return type:: List

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]: Helper method that shows the current structure of the APIParameterConfig.

class scholar_flux.APIParameterMap(*, query: str, records_per_page: str, start: str | None = None, api_key_parameter: str | None = None, api_key_required: bool = False, auto_calculate_page: bool = True, zero_indexed_pagination: bool = False, api_specific_parameters: ~typing.Dict[str, ~scholar_flux.api.models.base_parameters.APISpecificParameter] = <factory>)[source]

Bases: BaseAPIParameterMap

Extends BaseAPIParameterMap by adding validation and the optional retrieval of provider defaults for known APIs.

This class also specifies default mappings for specific attributes such as API keys and additional parameter names.

query

The API-specific parameter name for the search query.

Type:: str

start

The API-specific parameter name for pagination (start index or page number).

Type:: Optional[str]

records_per_page

The API-specific parameter name for records per page.

Type:: str

api_key_parameter

The API-specific parameter name for the API key.

Type:: Optional[str]

api_key_required

Indicates whether an API key is required.

Type:: bool

auto_calculate_page

If True, calculates start index from page; if False, passes page number directly.

Type:: bool

zero_indexed_pagination

If True, treats 0 as an allowed page value when retrieving data from APIs.

Type:: bool

api_specific_parameters

Additional universal to API-specific parameter mappings.

Type:: Dict[str, str]

classmethod from_defaults(provider_name: str, **additional_parameters) → APIParameterMap[source]

Factory method that uses the APIParameterMap.get_defaults classmethod to retrieve the provider config.

Raises an error if the provider does not exist.

Parameters:

provider_name (str) – The name of the API to create the parameter map for.
additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter map for the specified API.

Return type:

APIParameterMap

Raises:

NotImplementedError – If the API name is unknown.

classmethod get_defaults(provider_name: str, **additional_parameters) → APIParameterMap | None[source]

Factory method to create APIParameterMap instances with sensible defaults for known APIs.

This class method attempts to pull from the list of known providers defined in the scholar_flux.api.providers.provider_registry and returns None if an APIParameterMap for the provider cannot be found.

Using the additional_parameters keyword arguments, users can specify optional overrides for specific parameters if needed. This is helpful in circumstances where an API’s specification overlaps with that of a known provider.

Valid providers (as indicated in provider_registry) include:

springernature
plos
arxiv
openalex
core
crossref

Parameters:

provider_name (str) – The name of the API provider to retrieve the parameter map for.
additional_parameters (dict) – Additional parameter mappings.

Returns:

Configured parameter map for the specified API.

Return type:

Optional[APIParameterMap]

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod set_default_api_key_parameter(values: dict[str, Any]) → dict[str, Any][source]

Sets the default for the api key parameter when api_key_required`=True and `api_key_parameter is None.

Parameters:: values (dict[str, Any]) – The dictionary of attributes to validate
Returns:: The updated parameter values passed to the APIParameterMap. api_key_parameter is set to “api_key” if key is required but not specified
Return type:: dict[str, Any]

classmethod validate_api_specific_parameter_mappings(values: dict[str, Any]) → dict[str, Any][source]

Validates the additional mappings provided to the APIParameterMap.

This method validates that the input is dictionary of mappings that consists of only string-typed keys mapped to API-specific parameters as defined by the APISpecificParameter class.

Parameters:: values (dict[str, Any]) – The dictionary of attribute values to validate.
Returns:: The updated dictionary if validation passes.
Return type:: dict[str, Any]
Raises:: APIParameterException – If api_specific_parameters is not a dictionary or contains non-string keys/values.

Bases: object

The BaseAPI client is a minimal implementation for user-friendly request preparation and response retrieval.

Parameters:

session (Optional[requests.Session]) – A pre-configured requests or requests-cache session. A new session is created if not specified.
user_agent (Optional[str]) – An optional user-agent string for the session.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache (bool) – Indicates whether or not to create a cached session. If a cached session is already specified, this setting will have no effect on the creation of a session.

Examples

>>> from scholar_flux.api import BaseAPI
# creating a basic API client that uses the PLOS API as the default while caching response data in-memory:
>>> base_api = BaseAPI(use_cache=True)
# retrieve a basic request:
>>> parameters = {'q': 'machine learning', 'start': 1, 'rows': 20}
>>> response_page_1 = base_api.send_request('https://api.plos.org/search', parameters=parameters)
>>> assert response_page_1.ok
>>> response_page_1
# OUTPUT: <Response [200]>
>>> ml_page_1 = response_page_1.json()
# retrieving the next page:
>>> parameters['start'] = 21
>>> response_page_2 = base_api.send_request('https://api.plos.org/search', parameters=parameters)
>>> assert response_page_2.ok
>>> response_page_2
# OUTPUT: <Response [200]>
>>> ml_page_2 = response_page_2.json()
>>> ml_page_2
# OUTPUT: {'response': {'numFound': '...', 'start': 21, 'docs': ['...']}} # redacted

DEFAULT_TIMEOUT: int = 20

DEFAULT_USE_CACHE: bool = False

Initializes the BaseAPI client for response retrieval given the provided inputs.

The necessary attributes are prepared with a new or existing session (cached or uncached) via dependency injection. This class is designed to be subclassed for specific API implementations.

Parameters:

user_agent (Optional[str]) – Optional user-agent string for the session.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
timeout (Optional[int | float]) – Timeout for requests in seconds.
use_cache (Optional[bool]) – Indicates whether or not to use cache. The default setting is to create a regular requests.Session unless a CachedSession is already provided.

configure_session(session: Session | None = None, user_agent: str | None = None, use_cache: bool | None = None) → Session[source]

Creates a session object if one does not already exist. If use_cache = True, then a cached session object will be used. A regular session that is not already cached will be overridden.

Parameters:

session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist. If use_cache is True and a cached session has already been passed, the previously created cached session is returned. Otherwise, a new CachedSession is created.

Returns:

The configured session.

Return type:

requests.Session

prepare_request(base_url: str, endpoint: str | None = None, parameters: Dict[str, Any] | None = None) → PreparedRequest[source]

Prepares a GET request for the specified endpoint with optional parameters.

Parameters:

base_url (str) – The base URL for the API.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.

Returns:

The prepared request object.

Return type:

prepared_request (PreparedRequest)

send_request(base_url: str, endpoint: str | None = None, parameters: Dict[str, Any] | None = None, timeout: int | float | None = None) → Response[source]

Sends a GET request to the specified endpoint with optional parameters.

Parameters:

base_url (str) – The base API to send the request to.
endpoint (Optional[str]) – The endpoint of the API to send the request to.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.
timeout (int) – Timeout for the request in seconds.

Returns:

The response object.

Return type:

requests.Response

structure(flatten: bool = True, show_value_attributes: bool = False) → str[source]

Base method for showing the structure of the current BaseAPI. This method reveals the configuration settings of the API client that will be used to send requests.

Returns:: The current structure of the BaseAPI or its subclass.
Return type:: str

summary() → str[source]

Create a summary representation of the current structure of the API:

Returns the original representation.

property user_agent: str | None

The User-Agent should always reflect what is used in the session.

This method retrieves the User-Agent from the session directly.

Bases: SessionManager

This session manager is a wrapper around requests-cache and enables the creation of a requests-cache session with defaults that abstract away the complexity of cached session management.

The purpose of this class is to abstract away the complexity in cached sessions by providing reasonable defaults that are well integrated with the scholar_flux package The requests_cache package is built off of the base requests library and similarly be injected into the scholar_flux SearchAPI for making cached queries.

Examples

>>> from scholar_flux.sessions import CachedSessionManager
>>> from scholar_flux.api import SearchAPI
>>> from requests_cache import CachedSession
### creates a sqlite cached session in a package-writable directory
>>> session_manager = CachedSessionManager(user_agent='scholar_flux_user_agent')
>>> cached_session = session_manager() # defaults to a sqlite session in the package directory
### Which is equivalent to:
>>> cached_session = session_manager.configure_session() # defaults to a sqlite session in the package directory
>>> assert isinstance(cached_session, CachedSession)
### Similarly to the basic requests.session, this can be dependency injected in the SearchAPI:
>>> SearchAPI(query = 'history of software design', session = cached_session)

The initialization of the CachedSessionManager defines the options that are later passed to the self.configure_session method which returns a session object after parameter validation.

Parameters:

user_agent (str) – Specifies the name to use for the User-Agent parameter that is to be provided in each request header.
cache_name (str) – The name to associate with the current cache - used as a file in the case of filesystem/sqlite storages, and is otherwise used as a cache name in the case of storages such as Redis.
Optional (cache_directory) – Defines the directory where the cache file is stored. if not provided, the cache_directory, when needed (sqlite, filesystem storage, etc.) will default to the first writable directory location using the scholar_flux.package_metadata.get_default_writable_directory method.
backend (str | requests.BaseCache) – Defines the backend to use when creating a requests-cache session. the default is sqlite. Other backends include memory, filesystem, mongodb, redis, gridfs, and dynamodb. Users can enter in direct cache storage implementations from requests_cache, including RedisCache, MongoCache, SQLiteCache, etc. For more information, visit the following link: https://requests-cache.readthedocs.io/en/stable/user_guide/backends.html#choosing-a-backend
serializer – (Optional[str | requests_cache.serializers.pipeline.SerializerPipeline | requests_cache.serializers.pipeline.Stage]): An optional serializer that is used to prepare cached responses for storage (serialization) and deserialize them for retrieval
expire_after (Optional[int|float|str|datetime.datetime|datetime.timedelta]) – Sets the expiration time after which previously successfully cached responses expire.
raise_on_error (bool) – Whether to raise an error on instantiation if an error is encountered in the creation of a session. If raise_on_error = False, the error is logged, and a requests.Session is created instead.

property backend: str | BaseCache: Makes the config’s backend storage device for requests-cache accessible from the CachedSessionManager.

property cache_directory: Path | None: Makes the config’s cache directory accessible by the CachedSessionManager.

property cache_name: str: Makes the config’s base file name for the cache accessible by the CachedSessionManager.

property cache_path: str: Makes the config’s cache directory accessible by the CachedSessionManager.

configure_session() → Session | CachedSession[source]

Configures and returns a cached session object with the options provided to the config when creating the CachedSessionManager.

Note

If the cached session can not be configured due to permission errors, or connection errors, the session_manager will fallback to creating a requests.Session if the self.raise_on_error attribute is set to False.

Returns:: A cached session object if successful otherwise returns a requests.Session object in the event of an error.
Return type:: requests.Session | requests_cache.CachedSession

property expire_after: int | float | str | datetime | timedelta | None: Makes the config’s value used for response cache expiration accessible from the CachedSessionManager.

Determines what directory will be used for session cache storage, favoring an explicitly assigned cache_directory if provided.

Note that this method will only attempt to find a cache directory if one is needed, such as when choosing to use a “filesystem” or “sqlite” database using a string.

Resolution order (highest to lowest priority):

Explicit cache_directory argument
config_settings.config[‘CACHE_DIRECTORY’] (can be set via environment variable)
Package or home directory defaults (depending on writeability)

If the resolved cache_directory is a string, it is coerced into a Path before being returned. Returns None if the backend does not require a cache directory (e.g., dynamodb, mongodb, etc.).

Parameters:

cache_directory (Optional[Path | str]) – Explicit directory to use, if provided.
backend (Optional[str | requests.BaseCache]) – Backend type, used to determine if a directory is needed.

Returns:

The resolved cache directory as a Path or None if not applicable

Return type:

Optional[Path]

property serializer: str | SerializerPipeline | Stage | None: Makes the serializer from the config accessible from the CachedSessionManager.

class scholar_flux.DataCacheManager(cache_storage: ABCStorage | None = None)[source]

Bases: object

DataCacheManager class manages caching of API responses.

This class provides methods to generate cache keys, verify cache entries, check cache validity, update cache with new data, and retrieve data from the cache storage.

Parameters:: cache_storage (-) – Optional; A dictionary to store cached data. Defaults to using In-Memory Storage .

- generate_fallback_cache_key(response): Generates a unique fallback cache key based on the response URL and status code.

- verify_cache(cache_key): Checks if the provided cache_key exists in the cache storage.

- cache_is_valid(cache_key, response=None, cached_response=None): Determines whether the cached data for a given key is still valid.

- update_cache(cache_key, response, store_raw=False, metadata=None, parsed_response=None, processed_records=None): Updates the cache storage with new data.

- retrieve(cache_key): Retrieves data from the cache storage based on the cache key.

- retrieve_from_response(response): Retrieves data from the cache storage based on the response if within cache.

Examples

>>> from scholar_flux.data_storage import DataCacheManager
>>> from scholar_flux.api import SearchCoordinator
# Factory method that creates a default redis connection to the service on localhost if available.
>>> redis_cache_manager = DataCacheManager.with_storage('redis')
# Creates a search coordinator for retrieving API responses from the PLOS API provider
>>> search_coordinator = SearchCoordinator(query = 'Computational Caching Strategies',
                                           provider_name='plos',
                                           cache_requests = True, # caches raw requests prior to processing
                                           cache_manager=redis_cache_manager) # caches response processing
# Uses the cache manager to temporarily store cached responses for the default duration
>>> processed_response = search_coordinator.search(page = 1)
# On the next search, the processed response data can be retrieved directly for later response reconstruction
>>> retrieved_response_json = search_coordinator.responses.cache.retrieve(processed_response.cache_key)
# Serialized responses store the core response fields (content, URL, status code) associated with API responses
>>> assert isinstance(retrieved_response_json, dict) and 'serialized_response' in retrieved_response_json

__init__(cache_storage: ABCStorage | None = None) → None[source]: Initializes the DataCacheManager with the selected cache storage.

classmethod cache_fingerprint(obj: str | Any | None = None, package_version: str | None = '0.1.5') → str[source]

This method helps identify changes in class/configuration for later cache retrieval. It generates a unique string based on the object and the package version.

By default, a fingerprint is generated from the current package version and object representation, if provided. If otherwise not provided, a new human-readable object representation is generated using the scholar_flux.utils.generate_repr helper function that represents the object name and its current state. A package version is also prepended to the current finger-print if enabled (not None), and can be customized if needed for object-specific versioning.

Parameters:

obj (Optional[str]) – A finger-printed object, or an object to generate a representation of
package_version (Optional[str]) – The current package version string or manually provided version for a component).

Returns:

A human-readable string including the version, object identity

Return type:

str

cache_is_valid(cache_key: str, response: Response | ResponseProtocol | None = None, cached_response: Dict[str, Any] | None = None) → bool[source]

Determines whether the cached data for a given key is still valid or needs reprocessing due to missing fields or modified content when checked against the current response.

If a cached_response dictionary was not directly passed, the cache key will be retrieved from storage before comparison.

Parameters:

cache_key (str) – The unique identifier for cached data.
response (Optional[Response | ResponseProtocol]) – The API response or response-like object used to validate the cache, if available.
cached_response – Optional[Dict[str, Any]]: The cached data associated with the key

Returns:

True if the cache is valid, False otherwise.

Return type:

bool

clone() → DataCacheManager[source]: Helper method for creating a newly cloned instance of the current DataCacheManager.

delete(cache_key: str) → None[source]

Deletes data from the cache storage based on the cache key.

Parameters:: cache_key – A unique identifier for the cached data.
Returns:: The cached data corresponding to the cache key if found, otherwise None.
Return type:: None

classmethod generate_fallback_cache_key(response: Response | ResponseProtocol) → str[source]

Generates a unique fallback cache key based on the response URL and status code.

Parameters:: response – The API response object.
Returns:: A unique fallback cache key.
Return type:: str

classmethod generate_response_hash(response: Response | ResponseProtocol) → str[source]

Generates a hash of the response content.

Parameters:: response – The API response object.
Returns:: A SHA-256 hash of the response content.
Return type:: str

isnull() → bool[source]: Helper method for determining whether the current cache manager uses a null storage.

classmethod null() → DataCacheManager[source]

Creates a DataCacheManager using a NullStorage (no storage.

This storage device has the effect of returning False when validating whether the current DataCacheManager is in operation or not

Returns:: The current class initialized without storage
Return type:: DataCacheManager

retrieve(cache_key: str) → Dict[str, Any] | None[source]

Retrieves data from the cache storage based on the cache key.

Parameters:: cache_key – A unique identifier for the cached data.
Returns:: The cached data corresponding to the cache key if found, otherwise None.
Return type:: Optional[Dict[str, Any]]

retrieve_from_response(response: Response | ResponseProtocol) → Dict[str, Any] | None[source]

Retrieves data from the cache storage based on the response if within cache.

Parameters:: response – The API response object.
Returns:: The cached data corresponding to the response if found, otherwise None.
Return type:: Optional[Dict[str, Any]]

structure(flatten: bool = False, show_value_attributes: bool = False) → str[source]

Helper method for quickly showing a representation of the overall structure of the current DataCacheManager. The instance uses the generate_repr helper function to produce human-readable representations of the core structure of the storage subclass with its defaults.

Returns:: The structure of the current DataCacheManager as a string.
Return type:: str

update_cache(cache_key: str, response: Response | ResponseProtocol, store_raw: bool = False, parsed_response: Any | None = None, metadata: Dict[str, Any] | None = None, extracted_records: Any | None = None, processed_records: Any | None = None, **kwargs) → None[source]

Updates the cache storage with new data.

Parameters:

cache_key – A unique identifier for the cached data.
response – (requests.Response | ResponseProtocol) The API response or response-like object.
store_raw – (Optional) A boolean indicating whether to store the raw response. Defaults to False.
metadata – (Optional) Additional metadata associated with the cached data. Defaults to None.
parsed_response – (Optional) The response data parsed into a structured format. Defaults to None.
processed_records – (Optional) The response data processed for specific use. Defaults to None.
kwargs – Optional additional hashable dictionary fields that can be stored using sql cattrs encodings or in-memory cache.

verify_cache(cache_key: str | None) → bool[source]

Checks if the provided cache_key exists in the cache storage.

Parameters:: cache_key – A unique identifier for the cached data.
Returns:: True if the cache key exists, False otherwise.
Return type:: bool

classmethod with_storage(cache_storage: Literal['redis', 'sql', 'sqlalchemy', 'mongodb', 'pymongo', 'inmemory', 'memory', 'null'] | None = None, *args, **kwargs) → DataCacheManager[source]

Creates a DataCacheManager using a known storage device.

This is a convenience function allowing the user to create a DataCacheManager with redis, sql, mongodb, or inmemory storage with default settings or through the use of optional positional and keyword parameters to initialize the storage as needed. :returns: The current class initialized the chosen storage :rtype: DataCacheManager

Bases: BaseDataExtractor

The DataExtractor allows for the streamlined extraction of records and metadata from responses retrieved from APIs. This proceeds as the second stage of the response processing step where metadata and records are extracted from parsed responses.

The data extractor provides two ways to identify metadata paths and record paths:

manual identification: If record path or metadata_path are specified, then the data extractor will attempt to retrieve the metadata and records at the provided paths. Note that, as metadata_paths can be associated with multiple keys, starting from the outside dictionary, we may have to specify a dictionary containing keys denoting metadata variables and their paths as a list of values indicating how to retrieve the value. The path can also be given by a list of lists describing how to retrieve the last element.

Dynamic identification: Uses heuristics to determine records from metadata. records will nearly always be defined by a list containing only dictionaries as its elements while the metadata will generally contain a variety of elements, some nested and others as integers, strings, etc. In some cases where its harder to determine, we can use dynamic_record_identifiers to determine whether a list containing a single nested dictionary is a record or metadata. For scientific purposes, its keys may contain ‘abstract’, ‘title’, ‘doi’, etc. This can be defined manually by the users if the defaults are not reliable for a given API.

Upon initializing the class, the class can be used as a callable that returns the records and metadata in that order.

Example

>>> from scholar_flux.data import DataExtractor
>>> data = dict(query='specification driven development', options={'record_count':5,'response_time':'50ms'})
>>> data['records'] = [dict(id=1, record='protocol vs.code'), dict(id=2, record='Impact of Agile')]
>>> extractor = DataExtractor()
>>> records, metadata = extractor(data)
>>> print(metadata)
# OUTPUT: {'query': 'specification driven development', 'record_count': 5, 'response_time': '50ms'}
>>> print(records)
# OUTPUT: [{'id': 1, 'record': 'protocol vs.code'}, {'id': 2, 'record': 'Impact of Agile'}]

DEFAULT_DYNAMIC_METADATA_IDENTIFIERS = ('metadata', 'facets', 'IdList')

DEFAULT_DYNAMIC_RECORD_IDENTIFIERS = ('title', 'doi', 'abstract')

Initialize the DataExtractor with optional path overrides for metadata and records.

Parameters:

record_path (Optional[List[str]]) – Custom path to find records in the parsed data. Contains a list of strings and rarely integers indexes indicating how to recursively find the list of records
metadata_path (List[List[str]] | Optional[Dict[str, List[str]]]) – Identifies the paths in a dictionary associated with metadata as opposed to records. This can be a list of paths where each element is a list describing how to get to a terminal
element
dynamic_record_identifiers (Optional[List[str]]) – Helps to identify dictionary keys that only belong to records when dealing with a single element that would otherwise be classified as metadata.
dynamic_metadata_identifiers (Optional[List[str]]) – Helps to identify dictionary keys that are likely to only belong to metadata that could otherwise share a similar structure to a list of dictionaries, similar to what’s seen with records.

dynamic_identification(parsed_page_dict: dict) → tuple[list[dict[str, Any]], dict[str, Any]][source]

Dynamically identify and separate metadata from records. This function recursively traverses the dictionary and uses a heuristic to determine whether a specific record corresponds to metadata or is a list of records: Generally, keys associated with values corresponding to metadata will contain only lists of dictionaries On the other hand, nested structures containing metadata will be associated with a singular value other a dictionary of keys associated with a singular value that is not a list. Using this heuristic, we’re able to determine metadata from records with a high degree of confidence.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the metadata dictionary and the list of record dictionaries.
Return type:: Tuple[Dict[str, Any], List[Dict[str, Any]]]

extract(parsed_page: list[dict] | dict) → tuple[list[dict] | None, dict[str, Any] | None][source]

Extract both records and metadata from the parsed page dictionary.

Parameters:: parsed_page (List[Dict] | Dict) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of records and the metadata dictionary.
Return type:: Tuple[Optional[List[Dict]], Optional[Dict]]

class scholar_flux.DataParser(additional_parsers: dict[str, Callable] | None = None)[source]

Bases: BaseDataParser

Extensible class that handles the identification and parsing of typical formats seen in APIs that send news and academic articles in XML, JSON, and YAML formats.

The BaseDataParser contains each of the necessary class elements to parse JSON, XML, and YAML formats as class methods while this class allows for the specification of additional parsers.

Parameters:: additional_parsers (Optional[dict[str, Callable]]) – Allows overrides for parsers in addition to the JSON, XML and YAML parsers that are enabled by default.

__init__(additional_parsers: dict[str, Callable] | None = None)[source]

On initialization, the data parser is set to use built-in class methods to parse json, xml, and yaml-based response content by default and the parse helper class to determine which parser to use based on the Content- Type.

Parameters:

additional_parsers (Optional[dict[str, Callable]]) – Allows for the addition of
identification. (new parsers and overrides to class methods to be used on content-type)

parse(response: Response | ResponseProtocol, format: str | None = None) → dict | list[dict] | None[source]

Parses the API response content using to core steps.

Detects the API response format if a format is not already specified
Uses the previously determined format to parse the content of the response and return a parsed dictionary (json) structure.

Parameters:

response (requests.Response | ResponseProtocol) – The response or response-like object from the API request.
format (str) – The parser needed to format the response as a list of dicts

Returns:

response dict containing fields including a list of metadata records as dictionaries.

Return type:

dict

Bases: ABCDataProcessor

Initialize the DataProcessor with explicit extraction paths and options. The DataProcessor performs the selective extraction os specific fields from each record within a page (list) of JSON (dictionary) records and assumes that the paths to extract are known beforehand.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that records should be retained during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

Examples

>>> from scholar_flux.data import DataProcessor
>>> data = [{'id':1, 'school':{'department':'NYU Department of Mathematics'}},
>>>         {'id':2, 'school':{'department':'GSU Department of History'}},
>>>         {'id':3, 'school':{'organization':'Pharmaceutical Research Team'}}]
# creating a basic processor
>>> data_processor = DataProcessor(record_keys = [['id'], ['school', 'department'], ['school', 'organization']]) # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': 1, 'school.department': 'NYU Department of Mathematics', 'school.organization': None},
#          {'id': 2, 'school.department': 'GSU Department of History', 'school.organization': None},
#          {'id': 3, 'school.department': None, 'school.organization': 'Pharmaceutical Research Team'}]

Initialize the DataProcessor with explicit extraction paths and options.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

collapse_fields(processed_record_dict: dict) → dict[str, list[str | int] | str | int][source]: Helper method for joining lists of data into a singular string for flattening.

Processes a specific key from a record by retrieving the value associated with the key at the nested path. Depending on whether value_delimiter is set, the method will joining non-None values into a string using the delimiter. Otherwise, keys with lists as values will contain the lists un-edited.

Parameters:

record – The record dictionary to extract the key from.
key – The key to process within the record dictionary.

Returns:

A string containing non-None values from the specified key in the record dictionary, joined by ‘; ‘.

Return type:

str

process_page(parsed_records: list[dict[str | int, Any]], ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]

Core method of the data processor that enables the processing of lists of dictionary records to filter and process records based on the configuration of the current DataProcessor.

Parameters:

parsed_records (list[dict[str | int, Any]]) – The records to process and/or filter
ignore_keys (Optional[list[str]]) – Optional overrides identifying records to ignore based on matching keys or absence of keys
keep_keys (Optional[list[str]]) – Optional overrides identifying records to keep based based on matching keys
regex – (Optional[bool]): Used to determine whether or not to filter records using regular expressions

process_record(record_dict: dict[str | int, Any]) → dict[str, Any][source]

Processes a record dictionary to extract record data and article content, creating a processed record dictionary with an abstract field.

Args: - record_dict: The dictionary containing the record data.

Returns: - dict: A processed record dictionary with record keys processed and an abstract field created from the article content.

record_filter(record_dict: dict[str | int, Any], record_keys: list[str] | None = None, regex: bool | None = None) → bool | None[source]: Helper method that filters records using regex pattern matching, checking if any of the keys provided in the function call exist.

class scholar_flux.InMemoryStorage(namespace: str | None = None, ttl: int | None = None, raise_on_error: bool | None = None, **kwargs)[source]

Bases: ABCStorage

Default storage class that implements an in-memory storage cache using a dictionary.

This class implements the required abstract methods from the ABCStorage base class to ensure compatibility with the scholar_flux.DataCacheManager. Methods are provided to delete from the cache, update the cache with new data, and retrieve data from the cache.

Parameters:

namespace (Optional[str]) – Prefix for cache keys. Defaults to None.
ttl (Optional[int]) – Ignored. Included for interface compatibility; not implemented.
**kwargs (Dict) – Ignored. Included for interface compatibility; not implemented.

Examples

>>> from scholar_flux.data_storage import InMemoryStorage
### defaults to a basic dictionary:
>>> memory_storage = InMemoryStorage(namespace='testing_functionality')
>>> print(memory_storage)
# OUTPUT: InMemoryStorage(...)
### Adding records to the storage
>>> memory_storage.update('record_page_1', {'id':52, 'article': 'A name to remember'})
>>> memory_storage.update('record_page_2', {'id':55, 'article': 'A name can have many meanings'})
### Revising and overwriting a record
>>> memory_storage.update('record_page_2', {'id':53, 'article': 'A name has many meanings'})
>>> memory_storage.retrieve_keys() # retrieves all current keys stored in the cache under the namespace
# OUTPUT: ['testing_functionality:record_page_1', 'testing_functionality:record_page_2']
>>> memory_storage.retrieve_all() # Will also be empty
# OUTPUT: {'testing_functionality:record_page_1': {'id': 52,
#           'article': 'A name to remember'},
#          'testing_functionality:record_page_2': {'id': 53,
#           'article': 'A name has many meanings'}}
>>> memory_storage.retrieve('record_page_1') # retrieves the record for page 1
# OUTPUT: {'id': 52, 'article': 'A name to remember'}
>>> memory_storage.delete_all() # deletes all records from the namespace
>>> memory_storage.retrieve_keys() # Will now be empty
>>> memory_storage.retrieve_all() # Will also be empty

DEFAULT_NAMESPACE: str | None = None

DEFAULT_RAISE_ON_ERROR: bool = False

__init__(namespace: str | None = None, ttl: int | None = None, raise_on_error: bool | None = None, **kwargs) → None[source]

Initialize a basic, dictionary-like memory_cache using a namespace.

Note that ttl and **kwargs are provided for interface compatibility, and specifying any of these as arguments will not affect processing or cache initialization.

clone() → InMemoryStorage[source]: Helper method for creating a new InMemoryStorage with the same configuration.

delete(key: str) → None[source]

Attempts to delete the selected cache key if found within the current namespace.

Parameters:: key (str) – The key used associated with the stored data from the dictionary cache.

delete_all() → None[source]: Attempts to delete all cache keys found within the current namespace.

classmethod is_available(*args, **kwargs) → bool[source]

Helper method that returns True, indicating that dictionary-based storage will always be available.

Returns:: True to indicate that the dictionary-base cache storage will always be available
Return type:: (bool)

retrieve(key: str) → Any | None[source]

Attempts to retrieve a response containing the specified cache key within the current namespace.

Parameters:: key (str) – The key used to fetch the stored data from cache.
Returns:: The value returned is deserialized JSON object if successful. Returns None if the key does not exist.
Return type:: Any

retrieve_all() → Dict[str, Any] | None[source]

Retrieves all cache key-response mappings found within the current namespace.

Returns:: A dictionary containing each key-value mapping for all cached data within the same namespace

retrieve_keys() → List[str] | None[source]

Retrieves the full list of all cache keys found within the current namespace.

Returns:: The full list of all keys that are currently mapped within the storage
Return type:: List[str]

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]: Helper method for creating an in-memory cache without overloading the representation with the specifics of what is being cached.

update(key: str, data: Any) → None[source]

Attempts to update the data associated with a specific cache key in the namespace.

Parameters:

key (str) – The key of the key-value pair
data (Any) – The data to be associated with the key

verify_cache(key: str) → bool[source]

Verifies whether a cache key exists the current namespace in the in-memory cache.

Parameters:: key (str) – The key to lookup in the cache
Returns:: True if the key is found otherwise False.
Return type:: bool

Bases: ABCStorage

Implements the storage methods necessary to interact with MongoDB with a unified backend interface.

The MongoDBStorage uses the same underlying interface as other scholar_flux storage classes for use with the DataCacheManager. This implementation is designed to use a key-value store as a cache by which data can be stored and retrieved in a relatively straightforward manner similar to the In-Memory Storage.

Examples

>>> from scholar_flux.data_storage import MongoDBStorage
# Defaults to connecting to locally (mongodb://127.0.0.1) on the default port for MongoDB (27017)
# Verifies that a mongodb service is actually available locally on the default port
>>> assert MongoDBStorage.is_available()
>>> mongo_storage = MongoDBStorage(namespace='testing_functionality')
>>> print(mongo_storage)
# OUTPUT: MongoDBStorage(...)
# Adding records to the storage
>>> mongo_storage.update('record_page_1', {'id':52, 'article': 'A name to remember'})
>>> mongo_storage.update('record_page_2', {'id':55, 'article': 'A name can have many meanings'})
# Revising and overwriting a record
>>> mongo_storage.update('record_page_2', {'id':53, 'article': 'A name has many meanings'})
>>> mongo_storage.retrieve_keys() # retrieves all current keys stored in the cache under the namespace
# OUTPUT: ['testing_functionality:record_page_1', 'testing_functionality:record_page_2']
>>> mongo_storage.retrieve_all()
# OUTPUT: {'testing_functionality:record_page_1': {'id': 52,
#           'article': 'A name to remember'},
#          'testing_functionality:record_page_2': {'id': 53,
#           'article': 'A name has many meanings'}}
>>> mongo_storage.retrieve('record_page_1') # retrieves the record for page 1
# OUTPUT: {'id': 52, 'article': 'A name to remember'}
>>> mongo_storage.delete_all() # deletes all records from the namespace
>>> mongo_storage.retrieve_keys() # Will now be empty
>>> mongo_storage.retrieve_all() # Will also be empty

DEFAULT_CONFIG: Dict[str, Any] = {'collection': 'result_page', 'db': 'storage_manager_db', 'host': 'mongodb://127.0.0.1', 'port': 27017}

DEFAULT_NAMESPACE: str | None = None

DEFAULT_RAISE_ON_ERROR: bool = False

Initialize the Mongo DB storage backend and connect to the Mongo DB server.

If no parameters are specified, the MongoDB storage will default to the parameters derived from the scholar_flux.utils.config_settings.config dictionary, which, in turn, resolves the host and port from environment variables or the default MongoDB host/port in the following order of priority:

SCHOLAR_FLUX_MONGODB_HOST > MONGODB_HOST > ‘mongodb://127.0.0.1’ (localhost)

SCHOLAR_FLUX_MONGODB_PORT > MONGODB_PORT > 27017

Parameters:

host (Optional[str]) –
The host address where the Mongo Database can be found. The default is ‘mongodb://127.0.0.1’, which is the mongo server on the localhost.

Each of the following are valid values for host:
- Simple hostname: ‘localhost’ (uses port parameter)
- Full URI: ‘mongodb://localhost:27017’ (ignores port parameter)
- Complex URI: ‘mongodb://user:pass@host:27017/db?options’
namespace (Optional[str]) – The prefix associated with each cache key. By default, this is None.
ttl (Optional[float | int]) – The total number of seconds that must elapse for a cache record
raise_on_error (Optional[bool]) – Determines whether an error should be raised when encountering unexpected issues when interacting with MongoDB. If None, the raise_on_error attribute defaults to MongoDBStorage.DEFAULT_RAISE_ON_ERROR.
**mongo_config (Dict[Any, Any]) – Configuration parameters required to connect to the Mongo DB server. Typically includes parameters such as host, port, db, etc.

Raises:

MongoDBImportError – If db module is not available or fails to load.

clone() → MongoDBStorage[source]

Helper method for creating a new MongoDBStorage with the same parameters.

Note that the implementation of the MongoClient is not able to be deep copied. This method is provided for convenience for re-instantiation with the same configuration.

delete(key: str)[source]

Delete the value associated with the provided key from cache.

Parameters:: key (str) – The key used associated with the stored data from the cache.
Raises:: PyMongoError – If there is an error deleting the record

delete_all()[source]

Delete all records from cache that match the current namespace prefix.

Raises:: PyMongoError – If there an error occurred when deleting records from the collection

classmethod is_available(host: str | None = None, port: int | None = None, verbose: bool = True) → bool[source]

Helper method that indicates whether the MongoDB service is available or not.

It attempts to establish a connection on the provided host and port and returns a boolean indicating if the connection was successful.

Note that if the input to the host is a URI (e.g. mongodb://localhost:27017), any input provided to the port variable will be ignored when MongoClient initializes the connection and use the URI exclusively.

Parameters:

host (Optional[str]) – The IP of the host of the MongoDB service. If None or an empty string, Defaults to localhost (the local computer) or the “host” entry from the class variable, DEFAULT_CONFIG.
port (Optional[int]) – The port where the service is hosted. If None or 0, defaults to port, 27017 or the “port” entry from the DEFAULT_CONFIG class variable.
verbose (bool) – Indicates whether to log status messages. Defaults to True

Returns:

Indicating whether or not the service was be successfully accessed. The value returned is True if successful and False otherwise.

Return type:

bool

Raises:

ServerSelectionTimeoutError – If a timeout error occurs when attempting to ping Mongo DB
ConnectionFailure – If a connection cannot be established

retrieve(key: str) → Any | None[source]

Retrieve the value associated with the provided key from cache.

Parameters:: key (str) – The key used to fetch the stored data from cache.
Returns:: The value returned is deserialized JSON object if successful. Returns None if the key does not exist.
Return type:: Any
Raises:: PyMongoError – If there is an error retrieving the record

retrieve_all() → Dict[str, Any][source]

Retrieve all records from cache that match the current namespace prefix.

Returns:: Dictionary of key-value pairs. Keys are original keys, values are JSON deserialized objects.
Return type:: dict
Raises:: PyMongoError – If there is an error during the retrieval of records under the namespace.

retrieve_keys() → List[str][source]

Retrieve all keys for records from cache.

Returns:: A list of all keys saved via MongoDB.
Return type:: list[str]
Raises:: PyMongoError – If there is an error retrieving the record key.

update(key: str, data: Any)[source]

Update the cache by storing associated value with provided key.

Parameters:

key (str) – The key used to store the data in cache.
data (Any) – A Python object that will be serialized into JSON format and stored. This includes standard data types such as strings, numbers, lists, dictionaries, etc.

Raises:

PyMongoError – If an error occur when attempting to insert or update a record

verify_cache(key: str) → bool[source]

Check if specific cache key exists.

Parameters:

key (str) – The key to check its presence in the Mongo DB storage backend.

Returns:

True if the key is found otherwise False.

Return type:

bool

Raises:

ValueError – If provided key is empty or None.
CacheVerificationException – If an error occurs on data retrieval

class scholar_flux.MultiSearchCoordinator(*args, **kwargs)[source]

Bases: UserDict

The MultiSearchCoordinator is a utility class for orchestrating searches across multiple providers, pages, and queries sequentially or using multithreading. This coordinator builds on the SearchCoordinator’s core structure to ensure consistent, rate-limited API requests.

The multi-search coordinator uses shared rate limiters to ensure that requests to the same provider (even across different queries) will use the same rate limiter.

This implementation uses the ThreadedRateLimiter.min_interval parameter from the shared rate limiter of each provider to determine the request_delay across all queries. These settings can be found and modified in the scholar_flux.api.providers.threaded_rate_limiter_registry by provider_name.

For new, unregistered providers, users can override the MultiSearchCoordinator.DEFAULT_THREADED_REQUEST_DELAY class variable to adjust the shared request_delay.

# Examples:

>>> from scholar_flux import MultiSearchCoordinator, SearchCoordinator, RecursiveDataProcessor
>>> from scholar_flux.api.rate_limiting import threaded_rate_limiter_registry
>>> multi_search_coordinator = MultiSearchCoordinator()
>>> threaded_rate_limiter_registry['arxiv'].min_interval = 6 # arbitrary rate limit (seconds per request)
>>>
>>> # Create coordinators for different queries and providers
>>> coordinators = [
...     SearchCoordinator(
...         provider_name=provider,
...         query=query,
...         processor=RecursiveDataProcessor(),
...         user_agent="SammieH",
...         cache_requests=True
...     )
...     for query in ('ml', 'nlp')
...     for provider in ('plos', 'arxiv', 'openalex', 'crossref')
... ]
>>>
>>> # Add coordinators to the multi-search coordinator
>>> multi_search_coordinator.add_coordinators(coordinators)
>>>
>>> # Execute searches across multiple pages
>>> all_pages = multi_search_coordinator.search_pages(pages=[1, 2, 3])
>>>
>>> # filters and retains successful requests from the multi-provider search
>>> filtered_pages = all_pages.filter()
>>> # The results will contain successfully processed responses across all queries, pages, and providers
>>> print(filtered_pages)  # Output will be a list of SearchResult objects
>>> # Extracts successfully processed records into a list of records where each record is a dictionary
>>> record_dict = filtered_pages.join() # retrieves a list of records
>>> print(record_dict)  # Output will be a flattened list of all records

DEFAULT_THREADED_REQUEST_DELAY: float | int = 6.0

__init__(*args, **kwargs)[source]

Initializes the MultiSearchCoordinator, allowing positional and keyword arguments to be specified when creating the MultiSearchCoordinator.

The initialization of the MultiSearchCoordinator operates similarly to that of a regular dict with the caveat that values are statically typed as SearchCoordinator instances.

add(search_coordinator: SearchCoordinator)[source]

Adds a new SearchCoordinator to the MultiSearchCoordinator instance.

Parameters:: search_coordinator (SearchCoordinator) – A search coordinator to add to the MultiSearchCoordinator dict

Raises: InvalidCoordinatorParameterException: If the expected type is not a SearchCoordinator

add_coordinators(search_coordinators: Iterable[SearchCoordinator])[source]: Helper method for adding a sequence of coordinators at a time.

property coordinators: list[SearchCoordinator]: Utility property for quickly retrieving a list of all currently registered coordinators.

current_providers() → set[str][source]: Extracts a set of names corresponding to the each API provider assigned to the MultiSearchCoordinator.

group_by_provider() → dict[str, dict[str, SearchCoordinator]][source]

Groups all coordinators by provider name to facilitate retrieval with normalized components where needed. Especially helpful in the latter retrieval of articles when using multithreading by provider (as opposed to by page) to account for strict rate limits. All coordinated searches corresponding to a provider would appear under a nested dictionary to facilitate orchestration on the same thread with the same rate limiter.

Returns:: All elements in the final dictionary map provider-specific coordinators to the normalized provider name for the nested dictionary of coordinators.
Return type:: dict[str, dict[str, SearchCoordinator]]

iter_pages(pages: Sequence[int] | PageListInput, iterate_by_group: bool = False, **kwargs) → Generator[SearchResult, None, None][source]

Helper method that creates and joins a sequence of generator functions for retrieving and processing records from each combination of queries, pages, and providers in sequence. This implementation uses the SearchCoordinator.iter_pages to dynamically identify when page retrieval should halt for each API provider, accounting for errors, timeouts, and less than the expected amount of records before filtering records with pre- specified criteria.

Parameters:

pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.

Yields:

SearchResult –

Iteratively returns the SearchResult for each provider, query, and page using a generator: expression. Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)

iter_pages_threaded(pages: Sequence[int] | PageListInput, max_workers: int | None = None, **kwargs) → Generator[SearchResult, None, None][source]

Threading by provider to respect rate limits Helper method that implements threading to simultaneously retrieve a sequence of generator functions for retrieving and processing records from each combination of queries, pages, and providers in a multi-threaded set of sequences grouped by provider.

This implementation also uses the SearchCoordinator.iter_pages to dynamically identify when page retrieval should halt for each API provider, accounting for errors, timeouts, and less than the expected amount of records before filtering records with pre-specified criteria.

Note, that as threading is performed by provider, this method will not differ significantly in speed from the MultiSearchCoordinator.iter_pages method if only a single provider has been specified.

Parameters:

pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.

Yields:

SearchResult –

Iteratively returns the SearchResult for each provider, query, and page using a generator: expression as each SearchResult becomes available after multi-threaded processing. Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)

search(page: int = 1, iterate_by_group: bool = False, max_workers: int | None = None, multithreading: bool = True, **kwargs) → SearchResultList[source]

Public method used to search for a single or multiple pages from multiple providers at once using a sequential or multithreading approach. This approach delegates the search to search_pages to retrieve a single page for query and provider using an iterative approach to search for articles grouped by provider.

Note that the MultiSearchCoordinator.search_pages method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.

Parameters:

pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.

Returns:

The list containing all retrieved and processed pages from the API. If any non-stopping: errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.

Return type:

SearchResultList

search_pages(pages: Sequence[int] | PageListInput, iterate_by_group: bool = False, max_workers: int | None = None, multithreading: bool = True, **kwargs) → SearchResultList[source]

Public method used to search articles from multiple providers at once using a sequential or multithreading approach. This approach uses iter_pages under the.

Note that the MultiSearchCoordinator.search_pages method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.

Parameters:

pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.

Returns:

The list containing all retrieved and processed pages from the API. If any non-stopping: errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.

Return type:

SearchResultList

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]: Helper method that shows the current structure of the MultiSearchCoordinator.

class scholar_flux.NullStorage(namespace: str | None = None, ttl: None = None, raise_on_error: bool | None = None, **kwargs)[source]

Bases: ABCStorage

NullStorage is a no-op implementation of ABCStorage. This class is useful for when you want to disable storage without changing code logic.

The scholar_flux package mainly implements this storage when the user turns off processing cache.

Example

>>> from scholar_flux.data_storage import DataCacheManager, NullStorage
>>> from scholar_flux.api import SearchCoordinator
>>> null_storage = DataCacheManager.null()
## This implements a data cache with the null storage under the hood:
>>> assert isinstance(null_storage.cache_storage, NullStorage)
>>> search_coordinator = SearchCoordinator(query='History of Data Caching', cache_manager=null_storage)
# Otherwise the same can be performed with the following:
>>> search_coordinator = SearchCoordinator(query='History of Data Caching', cache_results = False)
# The processing of responses will then be recomputed on the next search:
>>> response = search_coordinator.search(page = 1)

DEFAULT_NAMESPACE: str | None = None

DEFAULT_RAISE_ON_ERROR: bool = False

__init__(namespace: str | None = None, ttl: None = None, raise_on_error: bool | None = None, **kwargs) → None[source]

Initialize a No-Op cache for compatibility with the ABCStorage base class.

Note that namespace, ttl, raise_on_error, and **kwargs are provided for interface compatibility, and specifying any of these as arguments will not affect initialization.

clone() → NullStorage[source]: Helper method for creating a new implementation of the current NullStorage.

delete(*args, **kwargs) → None[source]: Method added for abstract class consistency - no-op

delete_all(*args, **kwargs) → None[source]: Method added for abstract class consistency - no-op

classmethod is_available(*args, **kwargs) → bool[source]: Method added for abstract class consistency - returns, True indicating that the no-op storage is always available although no cache is ever stored.

retrieve(*args, **kwargs) → Any | None[source]: Method added for abstract class consistency - no-op

retrieve_all(*args, **kwargs) → Dict[str, Any] | None[source]: Method added for abstract class consistency - returns a dictionary for type consistency

retrieve_keys(*args, **kwargs) → List[str] | None[source]: Method added for abstract class consistency - returns a list for type consistency

update(*args, **kwargs) → None[source]: Method added for abstract class consistency - no-op

verify_cache(*args, **kwargs) → bool[source]: Method added for abstract class consistency - returns None, indicating that no cache is ever stored

class scholar_flux.PassThroughDataProcessor(ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True)[source]

Bases: ABCDataProcessor

A basic data processor that retains all valid records without modification unless a specific filter for JSON keys are specified.

Unlike the DataProcessor, this specific implementation will not flatten records. Instead all filtered and selected records will retain their original nested structure.

__init__(ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True) → None[source]

Initialize the PassThroughDataProcessor with explicit extraction paths and options.

Parameters:

ignore_keys – List of keys to ignore during processing.
keep – List of keys that records should contain during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

process_page(parsed_records: list[dict[str | int, Any]], ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]: Processes and returns each record as is if filtering the final list of records by key is not enabled.

process_record(record_dict: dict[str | int, Any]) → dict[str | int, Any][source]

A no-op method retained for to maintain a similar interface as other DataProcessor implementations.

Args: - record_dict: The dictionary containing the record data.

Returns: - dict: The original processed dictionary

record_filter(record_dict: dict[str | int, Any], record_keys: list[str] | None = None, regex: bool | None = None) → bool | None[source]: Helper method that filters records using regex pattern matching, checking if any of the keys provided in the function call exist.

Bases: ABCDataProcessor

The PathDataProcessor uses a custom implementation of Trie-based processing to abstract nested key-value combinations into path-node pairs where the path defines the full range of nested keys that need to be traversed to arrive at each terminal field within each individual record.

This implementation automatically and dynamically flattens and filters a single page of records (a list of dictionary-based records) extracted from a response at a time to return the processed record data.

Example

>>> from scholar_flux.data import PathDataProcessor
>>> path_data_processor = PathDataProcessor() # instantiating the class
>>> data = [{'id':1, 'a':{'b':'c'}}, {'id':2, 'b':{'f':'e'}}, {'id':2, 'c':{'h':'g'}}]
### The process_page method can then be referenced using the processor as a callable:
>>> result = path_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'a.b': 'c'}, {'id': '2', 'b.f': 'e'}, {'id': '2', 'c.h': 'g'}]

__init__(json_data: dict | list[dict] | None = None, value_delimiter: str | None = '; ', ignore_keys: list | None = None, keep_keys: list[str] | None = None, regex: bool | None = True, use_cache: bool | None = True) → None[source]: Initializes the data processor with JSON data and optional parameters for processing.

property cached: bool: Property indicating whether the underlying path node index uses a cache of weakreferences to nodes.

discover_keys() → dict[str, Any] | None[source]: Discovers all keys within the JSON data.

load_data(json_data: dict | list[dict] | None = None) → bool[source]

Attempts to load a data dictionary or list, contingent of it having at least one non-missing record to load from. If json_data is missing or the json input is equal to the current json_data attribute, then the json_data attribute will not be updated from the json input.

Parameters:: json_data (Optional[dict | list[dict]])
Returns:: Indicates whether the data was successfully loaded (True) or not (False)
Return type:: bool

process_page(parsed_records: list[dict] | None = None, keep_keys: list[str] | None = None, ignore_keys: list[str] | None = None, combine_keys: bool = True, regex: bool | None = None) → list[dict][source]: Processes each individual record dict from the JSON data.

process_record(record_index: int, keep_keys: list | None = None, ignore_keys: list | None = None, regex=None) → None[source]

Processes a record dictionary to extract record data and article content, creating a processed record dictionary with an abstract field.

Determines whether or not to retain a specific record at the index.

record_filter(record_dict: dict[ProcessingPath, Any], record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Indicates whether a record contains a path (key) indicating whether the record as a whole should be retained or dropped.

structure(flatten: bool = False, show_value_attributes: bool = False) → str[source]

Method for showing the structure of the current PathDataProcessor and identifying the current configuration.

Useful for showing the options being used to process the api response records

class scholar_flux.ProviderConfig(*, provider_name: Annotated[str, MinLen(min_length=1)], base_url: str, parameter_map: BaseAPIParameterMap, records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: Annotated[float, Ge(ge=0)] = 6.1, api_key_env_var: str | None = None, docs_url: str | None = None)[source]

Bases: BaseModel

Config for creating the basic instructions and settings necessary to interact with new providers. This config on initialization is created for default providers on package initialization in the scholar_flux.api.providers submodule. A new, custom provider or override can be added to the provider_registry (A custom user dictionary) from the scholar_flux.api.providers module.

Parameters:

provider_name (str) – The name of the provider to be associated with the config.
base_url (str) – The URL of the provider to send requests with the specified parameters.
parameter_map (BaseAPIParameterMap) – The parameter map indicating the specific semantics of the API.
records_per_page (int) – Generally the upper limit (for some APIs) or reasonable limit for the number of retrieved records per request (specific to the API provider).
request_delay (float) – Indicates exactly how many seconds to wait before sending successive requests Note that the requested interval may vary based on the API provider.
api_key_env_var (Optional[str]) – Indicates the environment variable to look for if the API requires or accepts API keys.
docs_url – (Optional[str]): An optional URL that indicates where documentation related to the use of the API can be found.

Example Usage:

>>> from scholar_flux.api import ProviderConfig, APIParameterMap, SearchAPI
>>> # Maps each of the individual parameters required to interact with the Guardian API
>>> parameters = APIParameterMap(query='q',
>>>                              start='page',
>>>                              records_per_page='page-size',
>>>                              api_key_parameter='api-key',
>>>                              auto_calculate_page=False,
>>>                              api_key_required=True)
>>> # creating the config object that holds the basic configuration necessary to interact with the API
>>> guardian_config = ProviderConfig(provider_name = 'GUARDIAN',
>>>                                  parameter_map = parameters,
>>>                                  base_url = 'https://content.guardianapis.com//search',
>>>                                  records_per_page=10,
>>>                                  api_key_env_var='GUARDIAN_API_KEY',
>>>                                  request_delay=6)
>>> api = SearchAPI.from_provider_config(query = 'economic welfare',
>>>                                      provider_config = guardian_config,
>>>                                      use_cache = True)
>>> assert api.provider_name == 'guardian'
>>> response = api.search(page = 1) # assumes that you have the GUARDIAN_API_KEY stored as an env variable
>>> assert response.ok

api_key_env_var: str | None

base_url: str

docs_url: str | None

model_config: ClassVar[ConfigDict] = {'str_strip_whitespace': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

classmethod normalize_provider_name(v: str) → str[source]: Helper method for normalizing the names of providers to a consistent structure.

parameter_map: BaseAPIParameterMap

provider_name: str

records_per_page: int

request_delay: float

search_config_defaults() → dict[str, Any][source]

Convenience Method for retrieving ProviderConfig fields as a dict. Useful for providing the missing information needed to create a SearchAPIConfig object for a provider when only the provider_name has been provided.

Returns:

A dictionary containing the URL, name, records_per_page, and request_delay: for the current provider.

Return type:

(dict)

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]: Helper method that shows the current structure of the ProviderConfig.

classmethod validate_base_url(v: str) → str[source]: Validates the current url and raises an APIParameterException if invalid.

classmethod validate_docs_url(v: str | None) → str | None[source]: Validates the documentation url and raises an APIParameterException if invalid.

Bases: ABCDataProcessor

Processes a list of raw page record dict data from the API response based on discovered record keys and flattens them into a list of dictionaries consisting of key value pairs that simplify the interpretation of the final flattened json structure.

Example

>>> from scholar_flux.data import RecursiveDataProcessor
>>> data = [{'id':1, 'a':{'b':'c'}}, {'id':2, 'b':{'f':'e'}}, {'id':2, 'c':{'h':'g'}}]
# creating a basic processor
>>> recursive_data_processor = RecursiveDataProcessor() # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = recursive_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'b': 'c'}, {'id': '2', 'f': 'e'}, {'id': '2', 'h': 'g'}]
    # To identify the full nested location of record:
>>> recursive_data_processor = RecursiveDataProcessor(use_full_path=True) # instantiating the class
>>> result = recursive_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'a.b': 'c'}, {'id': '2', 'b.f': 'e'}, {'id': '2', 'c.h': 'g'}]

Initializes the data processor with JSON data and optional parameters for processing.

Parameters:

json_data (list[dict]) – The json data set to process and flatten - a list of dictionaries is expected
value_delimiter (Optional[str]) – Indicates whether or not to join values found at terminal paths
ignore_keys (Optional[list[str]]) – Determines records that should be omitted based on whether each record contains a key or substring. (off by default)
keep_keys (Optional[list[str]]) – Indicates whether or not to keep a record if the key is present. (off by default)
regex (Optional[bool]) – Determines whether to use regex filtering for filtering records based on the presence or absence of specific keywords
use_full_path (Optional[bool]) – Determines whether or not to keep the full path for the json record key. If False, the path is shortened, keeping the last key or set of keys while preventing name collisions.

discover_keys() → dict[str, list[str]] | None[source]: Discovers all keys within the JSON data.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include: bool = True, **kwargs) → dict[str, list[str]][source]: Filters discovered keys based on specified criteria.

load_data(json_data: list[dict] | None = None)[source]

Attempts to load a data dictionary or list, contingent of it having at least one non-missing record to load from. If json_data is missing, or the json input is equal to the current json_data attribute, then the json_data attribute will not be updated from the json input.

Parameters:: json_data (Optional[list[dict]])
Returns:: Indicates whether the data was successfully loaded (True) or not (False)
Return type:: bool

process_page(parsed_records: list[dict] | None = None, keep_keys: list[str] | None = None, ignore_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]: Processes each individual record dict from the JSON data.

process_record(record_dict: dict[str, Any], **kwargs) → dict[str, Any][source]: Processes a record dictionary to extract record data and article content, creating a processed record dictionary with an abstract field.

record_filter(record_dict: dict[str, Any], record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Filters records, using regex pattern matching, checking if any of the keys provided in the function call exist.

class scholar_flux.RedisStorage(host: str | None = None, namespace: str | None = None, ttl: int | None = None, raise_on_error: bool | None = None, **redis_config)[source]

Bases: ABCStorage

Implements the storage methods necessary to interact with Redis using a unified backend interface.

The RedisStorage implements the abstract methods from the ABCStorage class for use with the DataCacheManager. This implementation is designed to use a key-value store as a cache by which data can be stored and retrieved in a relatively straightforward manner similar to the In-Memory Storage.

Examples

>>> from scholar_flux.data_storage import RedisStorage
# Defaults to connecting to locally (localhost) on the default port for Redis services (6379)
# Verifies that a Redis service is locally available.
>>> assert RedisStorage.is_available()
>>> redis_storage = RedisStorage(namespace='testing_functionality')
>>> print(redis_storage)
# OUTPUT: RedisStorage(...)
# Adding records to the storage
>>> redis_storage.update('record_page_1', {'id':52, 'article': 'A name to remember'})
>>> redis_storage.update('record_page_2', {'id':55, 'article': 'A name can have many meanings'})
# Revising and overwriting a record
>>> redis_storage.update('record_page_2', {'id':53, 'article': 'A name has many meanings'})
>>> redis_storage.retrieve_keys() # retrieves all current keys stored in the cache under the namespace
# OUTPUT: ['testing_functionality:record_page_1', 'testing_functionality:record_page_2']
>>> redis_storage.retrieve_all() # Will also be empty
# OUTPUT: {'testing_functionality:record_page_1': {'id': 52,
#           'article': 'A name to remember'},
#          'testing_functionality:record_page_2': {'id': 53,
#           'article': 'A name has many meanings'}}
>>> redis_storage.retrieve('record_page_1') # retrieves the record for page 1
# OUTPUT: {'id': 52, 'article': 'A name to remember'}
>>> redis_storage.delete_all() # deletes all records from the namespace
>>> redis_storage.retrieve_keys() # Will now be empty
>>> redis_storage.retrieve_all() # Will also be empty

DEFAULT_CONFIG: dict = {'host': 'localhost', 'port': 6379}

DEFAULT_NAMESPACE: str = 'SFAPI'

DEFAULT_RAISE_ON_ERROR: bool = False

__init__(host: str | None = None, namespace: str | None = None, ttl: int | None = None, raise_on_error: bool | None = None, **redis_config)[source]

Initialize the Redis storage backend and connect to the Redis server.

If no parameters are specified, the Redis storage will attempt to resolve the host and port using variables from the environment (loaded into scholar_flux.utils.config_settings at runtime).

The resolved host and port are resolved from environment variables/defaults in the following order of priority:

SCHOLAR_FLUX_REDIS_HOST > REDIS_HOST > ‘localhost’

SCHOLAR_FLUX_REDIS_PORT > REDIS_PORT > 6379

Parameters:

host (Optional[str]) – Redis server host. Can be provided positionally or as a keyword argument. Defaults to ‘localhost’ if not specified.
namespace (Optional[str]) – The prefix associated with each cache key. Defaults to DEFAULT_NAMESPACE if left None.
ttl (Optional[int]) – The total number of seconds that must elapse for a cache record to expire. If not provided, ttl defaults to None.
raise_on_error (Optional[bool]) – Determines whether an error should be raised when encountering unexpected issues when interacting with Redis. If None, the raise_on_error attribute defaults to RedisStorage.DEFAULT_RAISE_ON_ERROR.
**redis_config (Optional[Dict[Any, Any]]) – Configuration parameters required to connect to the Redis server. Typically includes parameters such as host, port, db, etc.

Raises:

RedisImportError – If redis module is not available or fails to load.

clone() → RedisStorage[source]

Helper method for creating a new RedisStorage with the same parameters.

Note that the implementation of the RedisStorage is not able to be deep copied, and this method is provided for convenience in re-instantiation with the same configuration.

delete(key: str) → None[source]

Delete the value associated with the provided key from cache.

Parameters:: key (str) – The key used associated with the stored data from cache.
Raises:: RedisError – If there is an error deleting the record

delete_all() → None[source]

Delete all records from cache that match the current namespace prefix.

Raises:: RedisError – If there an error occurred when deleting records from the collection

classmethod is_available(host: str | None = None, port: int | None = None, verbose: bool = True) → bool[source]

Helper class method for testing whether the Redis service is available and can be accessed.

If Redis can be successfully reached, this function returns True, otherwise False.

Parameters:

host (Optional[str]) – Indicates the location to attempt a connection. If None or an empty string, Defaults to localhost (the local computer) or the “host” entry from the class variable, DEFAULT_CONFIG.
port (Optional[int]) – Indicates the port where the service can be accessed If None or 0, Defaults to port 6379 or the “port” entry from the DEFAULT_CONFIG class variable.
verbose (bool) – Indicates whether to log at the levels, DEBUG and lower, or to log warnings only

Raises:

TimeoutError – If a timeout error occurs when attempting to ping Redis
ConnectionError – If a connection cannot be established

retrieve(key: str) → Any | None[source]

Retrieve the value associated with the provided key from cache.

Parameters:: key (str) – The key used to fetch the stored data from cache.
Returns:: The value returned is deserialized JSON object if successful. Returns None if the key does not exist.
Return type:: Any

retrieve_all() → Dict[str, Any][source]

Retrieve all records from cache that match the current namespace prefix.

Returns:: Dictionary of key-value pairs. Keys are original keys, values are JSON deserialized objects.
Return type:: dict
Raises:: RedisError – If there is an error during the retrieval of records under the namespace

retrieve_keys() → List[str][source]

Retrieve all keys for records from cache that match the current namespace prefix.

Returns:: A list of all keys saved under the current namespace.
Return type:: list
Raises:: RedisError – If there is an error retrieving the record key

update(key: str, data: Any) → None[source]

Update the cache by storing associated value with provided key.

Parameters:

key (str) – The key used to store the serialized JSON string in cache.
data (Any) – A Python object that will be serialized into JSON format and stored. This includes standard data types like strings, numbers, lists, dictionaries, etc.

Raises:

Redis – If an error occur when attempting to insert or update a record

verify_cache(key: str) → bool[source]

Check if specific cache key exists.

Parameters:

key (str) – The key to check its presence in the Redis storage backend.

Returns:

True if the key is found otherwise False.

Return type:

bool

Raises:

ValueError – If provided key is empty or None.
RedisError – If an error occurs when looking up a key

class scholar_flux.ResponseCoordinator(parser: BaseDataParser, extractor: BaseDataExtractor, processor: ABCDataProcessor, cache_manager: DataCacheManager)[source]

Bases: object

Coordinates the parsing, extraction, processing, and caching of API responses. The ResponseCoordinator operates on the concept of dependency injection to orchestrate the entire process. Because the structure of the coordinator (parser, extractor, processor)

Note that the overall composition of the coordinator is a governing factor in how the response is processed. The ResponseCoordinator uses a cache key and schema fingerprint to ensure that it is only returning a processed response from the cache storage if the structure of the coordinator at the time of cache storage has not changed.

To ensure that we’re not pulling from cache on significant changes to the ResponseCoordinator, we validate the schema by default using DEFAULT_VALIDATE_FINGERPRINT. When the schema changes, previously cached data is ignored, although this can be explicitly overridden during response handling.

The coordinator orchestration process operates mainly through the ResponseCoordinator.handle_response method that sequentially calls the parser, extractor, processor, and cache_manager.

Example workflow:

>>> from scholar_flux.api import SearchAPI, ResponseCoordinator
>>> api = SearchAPI(query = 'technological innovation', provider_name = 'crossref', user_agent = 'scholar_flux')
>>> response_coordinator = ResponseCoordinator.build() # uses defaults with caching in-memory
>>> response = api.search(page = 1)
# future calls with the same structure will be cached
>>> processed_response = response_coordinator.handle_response(response, cache_key='tech-innovation-cache-key-page-1')
# the ProcessedResponse (or ErrorResponse) stores critical fields from the original and processed response
>>> processed_response
# OUTPUT: ProcessedResponse(len=20, cache_key='tech-innovation-cache-key-page-1', metadata=...)
>>> new_processed_response = response_coordinator.handle_response(processed_response, cache_key='tech-innovation-cache-key-page-1')
>>> new_processed_response
# OUTPUT: ProcessedResponse(len=20, cache_key='tech-innovation-cache-key-page-1', metadata=...)

Note that the entire process can be orchestrated via the SearchCoordinator that uses the SearchAPI and ResponseCoordinator as core dependency injected components:

>>> from scholar_flux import SearchCoordinator
>>> search_coordinator = SearchCoordinator(api, response_coordinator, cache_requests=True)
# uses a default cache key constructed from the response internally
>>> processed_response = search_coordinator.search(page = 1)
# OUTPUT: ProcessedResponse(len=20, cache_key='crossref_technological innovation_1_20', metadata=...)
>>> processed_response.content == new_processed_response.content

Parameters:

parser (BaseDataParser) – Parses raw API responses.
extractor (BaseDataExtractor) – Extracts records and metadata.
processor (ABCDataProcessor) – Processes extracted data.
cache_manager (DataCacheManager) – Manages response cache.

DEFAULT_VALIDATE_FINGERPRINT: bool = True

__init__(parser: BaseDataParser, extractor: BaseDataExtractor, processor: ABCDataProcessor, cache_manager: DataCacheManager)[source]: Initializes the response coordinator using the core components used to parse, process, and cache response data.

classmethod build(parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None) → ResponseCoordinator[source]

Factory method to build a ResponseCoordinator with sensible defaults.

Parameters:

parser – Optional([BaseDataParser]): First step of the response processing pipeline - parses response records into a dictionary
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from responses separately
processor – (Optional[ABCDataProcessor]): Processes API responses into list of dictionaries
cache_manager – (Optional[DataCacheManager]): Manages the caching of processed records for faster retrieval
cache_requests – (Optional[bool]): Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results – (Optional[bool]): Determines whether or not to cache processed responses - on by default unless specified or if a cache manager is already provided

Returns:

A fully constructed coordinator.

Return type:

ResponseCoordinator

property cache: DataCacheManager

Alias for the response data processing cache manager:

Also allows direct access to the DataCacheManager from the ResponseCoordinator

property cache_manager: DataCacheManager: Allows direct access to the DataCacheManager from the ResponseCoordinator.

classmethod configure_cache(cache_manager: DataCacheManager | None = None, cache_results: bool | None = None) → DataCacheManager[source]

Helper method for building and swapping out cache managers depending on the cache chosen.

Parameters:

cache_manager (Optional[DataCacheManager]) – An optional cache manager to use
cache_results (Optional[bool]) – Ground truth parameter, used to resolve whether to use caching when the cache_manager and cache_results contradict

Returns:

An existing or newly created cache manager that can be used with the ResponseCoordinator

Return type:

DataCacheManager

property extractor: BaseDataExtractor: Allows direct access to the DataExtractor from the ResponseCoordinator.

handle_response(response: Response | ResponseProtocol, cache_key: str | None = None, from_cache: bool = True, validate_fingerprint: bool | None = None) → ErrorResponse | ProcessedResponse[source]

Retrieves the data from the processed response from cache as a if previously cached. Otherwise the data is retrieved after processing the response. The response data is subsequently transformed into a dataclass containing the response content, processing info, and metadata.

Parameters:

response (Response) – Raw API response.
cache_key (Optional[str]) – Cache key for storing/retrieving.
from_cache – (bool): Should we try to retrieve the processed response from the cache?

Returns:

A Dataclass Object that contains response data: and detailed processing info.

Return type:

ProcessedResponse

handle_response_data(response: Response, cache_key: str | None = None) → List[Dict[Any, Any]] | List | None[source]

Retrieves the data from the processed response from cache if previously cached. Otherwise the data is retrieved after processing the response.

Parameters:

response (Response) – Raw API response.
cache_key (Optional[str]) – Cache key for storing/retrieving.

Returns:

Processed response data or None.

Return type:

Optional[List[Dict[Any, Any]]]

property parser: BaseDataParser: Allows direct access to the data parser from the ResponseCoordinator.

property processor: ABCDataProcessor: Allows direct access to the DataProcessor from the ResponseCoordinator.

schema_fingerprint() → str[source]: Helper method for generating a concise view of the current structure of the response coordinator.

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method for retrieving a string representation of the overall structure of the current ResponseCoordinator. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the ResponseCoordinator.

Parameters:

flatten (bool) – Whether to flatten the ResponseCoordinator’s structural representation into a single line.
show_value_attributes (bool) – Whether to show nested attributes of the components in the structure of the current ResponseCoordinator instance.

Returns:

The structure of the current ResponseCoordinator as a string.

Return type:

str

summary() → str[source]: Helper class for creating a quick summary representation of the structure of the Response Coordinator.

classmethod update(response_coordinator: ResponseCoordinator, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None) → ResponseCoordinator[source]

Factory method to create a new ResponseCoordinator from an existing configuration.

Parameters:

response_coordinator – Optional([ResponseCoordinator]): ResponseCoordinator containing the defaults to swap
parser – Optional([BaseDataParser]): First step of the response processing pipeline - parses response records into a dictionary
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from responses separately
processor – (Optional[ABCDataProcessor]): Processes API responses into list of dictionaries
cache_manager – (Optional[DataCacheManager]): Manages the caching of processed records for faster retrieval
cache_requests – (Optional[bool]): Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results – (Optional[bool]): Determines whether or not to cache processed responses - on by default unless specified or if a cache manager is already provided

Returns:

A fully constructed coordinator.

Return type:

ResponseCoordinator

class scholar_flux.ResponseValidator[source]

Bases: object

Helper class that serves as an initial response validation step to ensure that, in custom retry handling, the basic structure of a response can be validated to determine whether or not to retry the response retrieval process.

The ResponseValidator implements class methods that are simple tools that return boolean values (True/False) when response or response-like objects do not contain the required structure and raise errors when encountering non-response objects or when raise_on_error = True otherwise.

Example

>>> from scholar_flux.api import ResponseValidator, ReconstructedResponse
>>> mock_success_response = ReconstructedResponse.build(status_code = 200,
>>>                                                     json = {'response': 'success'},
>>>                                                     url = "https://an-example-url.com",
>>>                                                     headers={'Content-Type': 'application/json'}
>>>                                                     )
>>> ResponseValidator.validate_response(mock_success_response) is True
>>> ResponseValidator.validate_content(mock_success_response) is True

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method that shows the current structure of the ResponseValidator class in a string format. This method will show the name of the current class along with its attributes (ResponseValidator())

Returns:: A string representation of the current structure of the ResponseValidator
Return type:: str

classmethod validate_content(response: Response | ResponseProtocol, expected_format: str = 'application/json', *, raise_on_error: bool = False) → bool[source]

Validates the response content type.

Parameters:

response (requests.Response | ResponseProtocol) – The HTTP response or response-like object to check.
expected_format (str) – The expected content type substring (e.g., “application/json”).
raise_on_error (bool) – If True, raises InvalidResponseException on mismatch.

Returns:

True if the content type matches, False otherwise.

Return type:

bool

Raises:

InvalidResponseException – If the content type does not match and raise_on_error is True.

classmethod validate_response(response: Response | ResponseProtocol, *, raise_on_error: bool = False) → bool[source]

Validates HTTP responses by verifying first whether the object is a Response or follows a ResponseProtocol. For valid response or response- like objects, the status code is verified, returning True for 400 and 500 level validation errors and raising an error if raise_on_error is set to True.

Note that a ResponseProtocol duck-types and verifies that each of a minimal set of attributes and/or properties can be found within the current response.

In the scholar_flux retrieval step, this validator verifies that the response received is a valid response.

Parameters:

response – (requests.Response | ResponseProtocol): The HTTP response object to validate
raise_on_error (bool) – If True, raises InvalidResponseException on error for invalid response status codes

Returns:

True if valid, False otherwise

Raises:

InvalidResponseException – If response is invalid and raise_on_error is True
RequestFailedException – If an exception occurs during response validation due to missing or incorrect types

class scholar_flux.SQLAlchemyStorage(url: str | None = None, namespace: str | None = None, ttl: None = None, raise_on_error: bool | None = False, **sqlalchemy_config)[source]

Bases: ABCStorage

Implements the storage methods necessary to interact with SQLite3 in addition to other SQL flavors via sqlalchemy. This implementation is designed to use a relational database as a cache by which data can be stored and retrieved in a relatively straightforward manner that associates records in key-value pairs similar to the In-Memory Storage.

Note:

This table uses the structure previously defined in the CacheTable to store records in a structured manner:

ID:
Automatically generated - identifies the unique record in the table

Key:
Is used to associate a specific cached record with a short human-readable (or hashed) string

Cache:
The JSON data associated with the record. To store the data, any nested, non-serializable data is first encoded before being unstructured and stored. On retrieving the data, the JSON string is decoded and restructured in order to return the original object.

The SQLAlchemyStorage can be initialized as follows:

### Import the package and initialize the storage in a dedicated package directory : >>> from scholar_flux.data_storage import SQLAlchemyStorage # Defaults to connecting to creating a local, file-based sqlite cache within the default writable directory. # Verifies that the dependency for a basic sqlite service is actually available for use locally >>> assert SQLAlchemyStorage.is_available() >>> sql_storage = SQLAlchemyStorage(namespace=’testing_functionality’) >>> print(sql_storage) # OUTPUT: SQLAlchemyStorage(…) # Adding records to the storage >>> sql_storage.update(‘record_page_1’, {‘id’:52, ‘article’: ‘A name to remember’}) >>> sql_storage.update(‘record_page_2’, {‘id’:55, ‘article’: ‘A name can have many meanings’}) # Revising and overwriting a record >>> sql_storage.update(‘record_page_2’, {‘id’:53, ‘article’: ‘A name has many meanings’}) >>> sql_storage.retrieve_keys() # retrieves all current keys stored in the cache under the namespace >>> sql_storage.retrieve_all() # OUTPUT: {‘testing_functionality:record_page_1’: {‘id’: 52, # ‘article’: ‘A name to remember’}, # ‘testing_functionality:record_page_2’: {‘id’: 53, # ‘article’: ‘A name has many meanings’}} # OUTPUT: [‘testing_functionality:record_page_1’, ‘testing_functionality:record_page_2’] >>> sql_storage.retrieve(‘record_page_1’) # retrieves the record for page 1 # OUTPUT: {‘id’: 52, ‘article’: ‘A name to remember’} >>> sql_storage.delete_all() # deletes all records from the namespace >>> sql_storage.retrieve_keys() # Will now be empty

DEFAULT_CONFIG: Dict[str, Any] = {'echo': False, 'url': <function SQLAlchemyStorage.<lambda>>}

DEFAULT_NAMESPACE: str | None = None

DEFAULT_RAISE_ON_ERROR: bool = False

__init__(url: str | None = None, namespace: str | None = None, ttl: None = None, raise_on_error: bool | None = False, **sqlalchemy_config) → None[source]

Initialize the SQLAlchemy storage backend and connect to the server indicated via the url parameter.

This class uses the innate flexibility of SQLAlchemy to support backends such as SQLite, Postgres, DuckDB, etc.

Parameters:

url (Optional[str]) – Database connection string. This can be provided positionally or as a keyword argument.
namespace (Optional[str]) – The prefix associated with each cache key. By default, this is None.
ttl (None) – Ignored. Included for interface compatibility; not implemented.
raise_on_error (Optional[bool]) – Determines whether an error should be raised when encountering unexpected issues when interacting with SQLAlchemy. If None, the raise_on_error attribute defaults to SQLAlchemyStorage.DEFAULT_RAISE_ON_ERROR.
**sqlalchemy_config –
Additional SQLAlchemy engine/session options passed to sqlalchemy.create_engine Typical parameters include the following:
- url (str): Indicates what server to connect to. Defaults to sqlite in the package directory.
- echo (bool): Indicates whether to show the executed SQL queries in the console.

clone() → SQLAlchemyStorage[source]

Helper method for creating a new SQLAlchemyStorage with the same parameters.

Note that the implementation of the SQLAlchemyStorage is not able to be deep copied, and this method is provided for convenience in re-instantiation with the same configuration.

delete(key: str) → None[source]

Delete the value associated with the provided key from cache.

Parameters:: key (str) – The key used associated with the stored data from cache.

delete_all() → None[source]: Delete all records from cache that match the current namespace prefix.

classmethod is_available(url: str | None = None, verbose: bool = True) → bool[source]

Helper class method for testing whether the SQL service can be accessed. If so, this function returns True, otherwise False.

Parameters:

host (str) – Indicates the location to attempt a connection
port (int) – Indicates the port where the service can be accessed
verbose (bool) – Indicates whether to log at the levels, DEBUG and lower, or to log warnings only

retrieve(key: str) → Any | None[source]

Retrieve the value associated with the provided key from cache.

Parameters:: key (str) – The key used to fetch the stored data from cache.
Returns:: The value returned is deserialized JSON object if successful. Returns None if the key does not exist.
Return type:: Any

retrieve_all() → Dict[str, Any][source]

Retrieve all records from cache.

Returns:: Dictionary of key-value pairs. Keys are original keys, values are JSON deserialized objects.
Return type:: dict

retrieve_keys() → List[str][source]

Retrieve all keys for records from cache .

Returns:: A list of all keys saved via SQL.
Return type:: list

update(key: str, data: Any) → None[source]

Update the cache by storing associated value with provided key.

Parameters:

key (str) – The key used to store the serialized JSON string in cache.
data (Any) – A Python object that will be serialized into JSON format and stored. This includes standard data types like strings, numbers, lists, dictionaries, etc.

verify_cache(key: str) → bool[source]

Check if specific cache key exists.

Parameters:: key (str) – The key to check its presence in the SQL storage backend.
Returns:: True if the key is found otherwise False.
Return type:: bool
Raises:: ValueError – If provided key is empty or None.

Bases: BaseAPI

The core interface that handles the retrieval of JSON, XML, and YAML content from the scholarly API sources offered by several providers such as SpringerNature, PLOS, and PubMed. The SearchAPI is structured to allow flexibility without complexity in initialization. API clients can be either constructed piece-by-piece or with sensible defaults for session-based retrieval, API key management, caching, and configuration options.

This class is integrated into the SearchCoordinator as a core component of a pipeline that further parses the response, extracts records and metadata, and caches the processed records to facilitate downstream tasks such as research, summarization, and data mining.

Examples

>>> from scholar_flux.api import SearchAPI
# creating a basic API that uses the PLOS as the default while caching data in-memory:
>>> api = SearchAPI(query = 'machine learning', provider_name = 'plos', use_cache = True)
# retrieve a basic request:
>>> response_page_1 = api.search(page = 1)
>>> assert response_page_1.ok
>>> response_page_1
# OUTPUT: <Response [200]>
>>> ml_page_1 = response_page_1.json()
# future requests automatically wait until the specified request delay passes to send another request:
>>> response_page_2 = api.search(page = 2)
>>> assert response_page_1.ok
>>> response_page_2
# OUTPUT: <Response [200]
>>> ml_page_2 = response_page_2.json()

DEFAULT_CACHED_SESSION: bool = False

DEFAULT_URL: str = 'https://api.plos.org/search'

Initializes the SearchAPI with a query and optional parameters. The absolute bare minimum for interacting with APIs requires a query, base_url, and an APIParameterConfig that associates relevant fields (aka query, records_per_page, etc. with fields that are specific to each API provider.

Parameters:

query (str) – The search keyword or query string.
provider_name (Optional[str]) – The name of the API provider where requests will be sent. If a provider_name and base_url are both given, the SearchAPIConfig will prioritize base_urls over the provider_name.
parameter_config (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]) – A config that a parameter map attribute under the hood to build the parameters necessary to interact with an API. For convenience, an APIParameterMap can be provided in place of an APIParameterConfig, and the conversion will take place under the hood.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session. A new session is created if not specified.
user_agent (Optional[str]) – Optional user-agent string for the session.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
masker (Optional[str]) – Used for filtering potentially sensitive information from logs (API keys, auth bearers, emails, etc)
use_cache (bool) – Indicates whether or not to create a cached session. If a cached session is already specified, this setting will have no effect on the creation of a session.
base_url (str) – The base URL for the article API.
api_key (Optional[str | SecretStr]) – API key if required.
records_per_page (int) – Number of records to fetch per page (1-100).
request_delay (Optional[float]) – Minimum delay between requests in seconds. If not specified, the SearchAPI, this setting will use the default request delay defined in the SearchAPIConfig (6.1 seconds) if an override for the current provider does not exist.
**api_specific_parameters –

Additional parameter-value pairs to be provided to SearchAPIConfig class. API specific parameters include:
mailto (Optional[str | SecretStr]): (CROSSREF: an optional contact for feedback on API usage) db: str (PubMed: a database to retrieve data from (example: db=pubmed)

property api_key: SecretStr | None

Retrieves the current value of the API key from the SearchAPIConfig as a SecretStr.

Note that the API key is stored as a secret key when available. The value of the API key can be retrieved by using the api_key.get_secret_value() method.

Returns:: A secret string of the API key if it exists
Return type:: Optional[SecretStr]

property api_specific_parameters: dict

This property pulls additional parameters corresponding to the API from the configuration of the current API instance.

Returns:: A list of all parameters specific to the current API.
Return type:: dict[str, APISpecificParameter]

property base_url: str

Corresponds to the base URL of the current API.

Returns:: The base URL corresponding to the API Provider

build_parameters(page: int, additional_parameters: dict[str, Any] | None = None, **api_specific_parameters) → Dict[str, Any][source]

Constructs the request parameters for the API call, using the provided APIParameterConfig and its associated APIParameterMap. This method maps standard fields (query, page, records_per_page, api_key, etc.) to the provider-specific parameter names.

Using additional_parameters, an arbitrary set of parameter key-value can be added to request further customize or override parameter settings to the API. additional_parameters is offered as a convenience method in case an API may use additional arguments or a query requires specific advanced functionality.

Other arguments and mappings can be supplied through **api_specific_parameters to the parameter config, provided that the options or pre-defined mappings exist in the config.

When **api_specific_parameters and additional_parameters conflict, additional_parameters is considered the ground truth. If any remaining parameters are None in the constructed list of parameters, these values will be dropped from the final dictionary.

Parameters:

page (int) – The page number to request.
Optional[dict] (additional_parameters) – A dictionary of additional overrides that may or may not have been included in the original parameter map of the current API. (Provided for further customization of requests).
**api_specific_parameters – Additional parameters to provide to the parameter config: Note that the config will only accept keyword arguments that have been explicitly defined in the parameter map. For all others, they must be added using the additional_parameters parameter.

Returns:

The constructed request parameters.

Return type:

Dict[str, Any]

property cache: BaseCache | None

Retrieves the requests-session cache object if the session object is a CachedSession object.

If a session cache does not exist, this function will return None.

Returns:: The cache object if available, otherwise None.
Return type:: Optional[BaseCache]

property config: SearchAPIConfig

Property method for accessing the config for the SearchAPI.

Returns:: The configuration corresponding to the API Provider

describe() → dict[source]

A helper method used that describe accepted configuration for the current provider or user-defined parameter mappings.

Returns:: a dictionary describing valid config fields and provider-specific api parameters for the current provider (if applicable).
Return type:: dict

classmethod from_defaults(query: str, provider_name: str | None, session: Session | None = None, user_agent: Annotated[str | None, 'An optional User-Agent to associate with each search'] = None, use_cache: bool | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters) → SearchAPI[source]

Factory method to create SearchAPI instances with sensible defaults for known providers.

PLOS is used by default unless the environment variable, SCHOLAR_FLUX_DEFAULT_PROVIDER is set to another provider.

Parameters:

query (str) – The search keyword or query string.
base_url (str) – The base URL for the article API.
records_per_page (int) – Number of records to fetch per page (1-100).
request_delay (Optional[float]) – Minimum delay between requests in seconds.
api_key (Optional[str | SecretStr]) – API key if required.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist.
masker (Optional[str]) – Used for filtering potentially sensitive information from logs
**api_specific_parameters – Additional api parameter-value pairs and overrides to be provided to SearchAPIConfig class

Returns:

A new SearchAPI instance initialized with the config chosen.

classmethod from_provider_config(query: str, provider_config: ProviderConfig, session: Session | None = None, user_agent: Annotated[str | None, 'An optional User-Agent to associate with each search'] = None, use_cache: bool | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters) → SearchAPI[source]

Factory method to create a new SearchAPI instance using a ProviderConfig.

This method uses the default settings associated with the provider config to temporarily make the configuration settings globally available when creating the SearchAPIConfig and APIParameterConfig instances from the provider registry.

Parameters:

query (str) – The search keyword or query string.
provider_config – ProviderConfig,
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError.
masker (Optional[str]) – Used for filtering potentially sensitive information from logs
**api_specific_parameters – Additional api parameter-value pairs and overrides to be provided to SearchAPIConfig class

Returns:

A new SearchAPI instance initialized with the chosen configuration.

Advanced constructor: instantiate directly from a SearchAPIConfig instance.

Parameters:

query (str) – The search keyword or query string.
config (SearchAPIConfig) – Indicates the configuration settings to be used when sending requests to APIs
parameter_config – (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]): Maps global scholar_flux parameters to those that are specific to the current API
session – (Optional[requests.Session | CachedSession]): An optional session to use for the creation of request sessions
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache – Optional[bool]: Indicates whether or not to use cache. The settings from session are otherwise used this option is not specified.
masker – (Optional[SensitiveDataMasker]): A masker used to filter logs of API keys and other sensitive data
user_agent – Optional[str] = An user agent to associate with the session

Returns:

A newly constructed SearchAPI with the chosen/validated settings

Return type:

SearchAPI

static is_cached_session(session: CachedSession | Session) → bool[source]

Checks whether the current session is a cached session.

To do so, this method first determines whether the current object has a ‘cache’ attribute and whether the cache element, if existing, is a BaseCache.

Parameters:: session (requests.Session) – The session to check.
Returns:: True if the session is a cached session, False otherwise.
Return type:: bool

make_request(current_page: int, additional_parameters: dict[str, Any] | None = None, request_delay: float | None = None) → Response[source]

Constructs and sends a request to the chosen api:

The parameters are built based on the default/chosen config and parameter map :param page: The page number to request. :type page: int :param additional_parameters Optional[dict]: A dictionary of additional overrides not included in the original SearchAPIConfig :param request_delay: Overrides the configured request delay for the current request only. :type request_delay: Optional[float]

Returns:: The API’s response to the request.
Return type:: requests.Response

property parameter_config: APIParameterConfig

Property method for accessing the parameter mapping config for the SearchAPI.

Returns:: The configuration corresponding to the API Provider

prepare_request(base_url: str | None = None, endpoint: str | None = None, parameters: Dict[str, Any] | None = None, api_key: str | None = None) → PreparedRequest[source]

Prepares a GET request for the specified endpoint with optional parameters.

This method builds on the original base class method by additionally allowing users to specify a custom request directly while also accounting for the addition of an API key specific to the API.

Parameters:

base_url (str) – The base URL for the API.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.

Returns:

The prepared request object.

Return type:

requests.PreparedRequest

prepare_search(page: int | None = None, parameters: Dict[str, Any] | None = None) → PreparedRequest[source]

Prepares the current request given the provided page and parameters.

The prepared request object can be sent using the SearchAPI.session.send method with requests.Session and `requests_cache.CachedSession`objects.

Parameters:

page (Optional[int]) – Page number to query. If provided, parameters are built from the config and this page.
parameters (Optional[Dict[str, Any]]) – If provided alone, used as the full parameter set to build the current request. If provided together with page, these act as additional or overriding parameters on top of the built config.

Returns:

A request object that can be sent via api.session.send.

Return type:

requests.PreparedRequest

property provider_name: str

Property method for accessing the provider name in the current SearchAPI instance.

Returns:: The name corresponding to the API Provider.

property query: str: Retrieves the current value of the query to be sent to the current API.

property records_per_page: int

Indicates the total number of records to show on each page.

Returns:: an integer indicating the max number of records per page
Return type:: int

property request_delay: float

Indicates how long we should wait in-between requests.

Helpful for ensuring compliance with the rate-limiting requirements of various APIs.

Returns:: The number of seconds to wait at minimum between each request
Return type:: float

search(page: int | None = None, parameters: Dict[str, Any] | None = None, request_delay: float | None = None) → Response[source]

Public method to perform a search for the selected page with the current API configuration.

A search can be performed by specifying either the page to query with the preselected defaults and additional parameter overrides for other parameters accepted by the API.

Users can also create a custom request using a parameter dictionary containing the full set of API parameters.

Parameters:

page (Optional[int]) – Page number to query. If provided, parameters are built from the config and this page.
parameters (Optional[Dict[str, Any]]) – If provided alone, used as the full parameter set for the request. If provided together with page, these act as additional or overriding parameters on top of the built config.
request_delay (Optional[float]) – Overrides the configured request delay for the current request only.

Returns:

A response object from the API containing articles and metadata

Return type:

requests.Response

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method for quickly showing a representation of the overall structure of the SearchAPI. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the SearchAPI.

Parameters:

flatten (bool) – Whether to flatten the SearchAPI’s structural representation into a single line.
show_value_attributes (bool) – Whether to show nested attributes of the components of the SearchAPI.

Returns:

The structure of the current SearchAPI as a string.

Return type:

str

summary() → str[source]: Create a summary representation of the current structure of the API.

Helper method for generating a new SearchAPI from an existing SearchAPI instance. All parameters that are not modified are pulled from the original SearchAPI. If no changes are made, an identical SearchAPI is generated from the existing defaults.

Parameters:

config (SearchAPIConfig) – Indicates the configuration settings to be used when sending requests to APIs
parameter_config (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]) – Maps global scholar_flux parameters to those that are API specific.
session – (Optional[requests.Session | CachedSession]): An optional session to use for the creation of request sessions
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache – Optional[bool]: Indicates whether or not to use cache. The settings from session are otherwise used this option is not specified.
masker – (Optional[SensitiveDataMasker]): A masker used to filter logs of API keys and other sensitive data
user_agent – Optional[str] = An user agent to associate with the session

Returns:

A newly constructed SearchAPI with the chosen/validated settings

Return type:

SearchAPI

with_config(config: SearchAPIConfig | None = None, parameter_config: APIParameterConfig | None = None, provider_name: str | None = None, query: str | None = None) → Iterator[SearchAPI][source]

Temporarily modifies the SearchAPI’s SearchAPIConfig and/or APIParameterConfig and namespace. You can provide a config, a parameter_config, or a provider_name to fetch defaults. Explicitly provided configs take precedence over provider_name, and the context manager will revert changes to the parameter mappings and search configuration afterward.

Parameters:

config (Optional[SearchAPIConfig]) – Temporary search api configuration to use within the context to control where and how response records are retrieved.
parameter_config (Optional[APIParameterConfig]) – Temporary parameter config to use within the context to resolve universal parameters names to those that are specific to the current api.
provider_name (Optional[str]) – Used to retrieve the associated configuration for a specific provider in order to edit the parameter map when using a different provider.
query (Optional[str]) – Allows users to temporarily modify the query used to retrieve records from an API.

Yields:

SearchAPI – The current api object with a temporarily swapped config during the context manager.

with_config_parameters(provider_name: str | None = None, query: str | None = None, **api_specific_parameters) → Iterator[SearchAPI][source]

Allows for the temporary modification of the search configuration, and parameter mappings, and cache namespace. For the current API. Uses a contextmanager to temporarily change the provided parameters without persisting the changes.

Parameters:

provider_name (Optional[str]) – If provided, fetches the default parameter config for the provider.
query (Optional[str]) – Allows users to temporarily modify the query used to retrieve records from an API.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override in the current config.

Yields:

SearchAPI – The API object with temporarily swapped config and/or parameter config.

class scholar_flux.SearchAPIConfig(*, provider_name: str = '', base_url: str = '', records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: float = -1, api_key: SecretStr | None = None, api_specific_parameters: dict[str, Any] | None = None)[source]

Bases: BaseModel

The SearchAPIConfig class provides the core tools necessary to set and interact with the API. The SearchAPI uses this class to retrieve data from an API using universal parameters to simplify the process of retrieving raw responses.

provider_name

Indicates the name of the API to use when making requests to a provider. If the provider name matches a known default and the base_url is unspecified, the base URL for the current provider is used instead.

Type:: str

base_url

Indicates the API URL where data will be searched and retrieved.

Type:: str

records_per_page

Controls the number of records that will appear on each page

Type:: int

request_delay

Indicates the minimum delay between each request to avoid exceeding API rate limits

Type:: float

api_key

This is an API-specific parameter for validating the current user’s identity. If a str type is provided, it is converted into a SecretStr.

Type:: Optional[str | SecretStr]

api_specific_parameters

A dictionary containing all parameters specific to the current API. API-specific parameters include the following.

mailto (Optional[str | SecretStr]):
An optional email address for receiving feedback on usage from providers, This parameter is currently applicable only to the Crossref API.
db: (str):
The parameter use by the NIH to direct requests for data to the pubmed database. This parameter defaults to pubmed and does not require direct specification

Type:: dict[str, APISpecificParameter]

Examples

>>> from scholar_flux.api import SearchAPIConfig, SearchAPI, provider_registry
# to create a CROSSREF configuration with minimal defaults and provide an api_specific_parameter:
>>> config = SearchAPIConfig.from_defaults(provider_name = 'crossref', mailto = 'your_email_here@example.com')
# the configuration automatically retrieves the configuration for the "Crossref" API
>>> assert config.provider_name == 'crossref' and config.base_url == provider_registry['crossref'].base_url
>>> api = SearchAPI.from_settings(query = 'q', config = config)
>>> assert api.config == config
# to retrieve all defaults associated with a provider and automatically read an API key if needed
>>> config = SearchAPIConfig.from_defaults(provider_name = 'pubmed', api_key = 'your api key goes here')
# the API key is retrieved automatically if you have the API key specified as an environment variable
>>> assert config.api_key is not None
# Default provider API specifications are already pre-populated if they are set with defaults
>>> assert config.api_specific_parameters['db'] == 'pubmed'  # required by pubmed and defaults to pubmed
# Update a provider and automatically retrieve its API key - the previous API key will no longer apply
>>> updated_config = SearchAPIConfig.update(config, provider_name = 'core')
# The API key should have been overwritten to use core. Looks for a `CORE_API_KEY` env variable by default
>>> assert updated_config.provider_name  == 'core' and  updated_config.api_key != config.api_key

DEFAULT_PROVIDER: ClassVar[str] = 'PLOS'

DEFAULT_RECORDS_PER_PAGE: ClassVar[int] = 25

DEFAULT_REQUEST_DELAY: ClassVar[float] = 6.1

MAX_API_KEY_LENGTH: ClassVar[int] = 512

api_key: SecretStr | None

api_specific_parameters: dict[str, Any] | None

base_url: str

classmethod default_request_delay(v: int | float | None, provider_name: str | None = None) → float[source]

Helper method enabling the retrieval of the most appropriate rate limit for the current provider.

Defaults to the SearchAPIConfig default rate limit when the current provider is unknown and a valid rate limit has not yet been provided.

Parameters:

v (Optional[int | float]) – The value received for the current request_delay
provider_name (Optional[str]) – The name of the provider to retrieve a rate limit for

Returns:

The inputted non-negative request delay, the retrieved rate limit for the current provider: if available, or the SearchAPIConfig.DEFAULT_REQUEST_DELAY - all in order of priority.

Return type:

float

classmethod from_defaults(provider_name: str, **overrides) → SearchAPIConfig[source]

Uses the default configuration for the chosen provider to create a SearchAPIConfig object containing configuration parameters. Note that additional parameters and field overrides can be added via the **overrides field.

Parameters:

provider_name (str) – The name of the provider to create the config
**overrides – Optional keyword arguments to specify overrides and additional arguments

Returns:

A default APIConfig object based on the chosen parameters

Return type:

SearchAPIConfig

model_config: ClassVar[ConfigDict] = {}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

provider_name: str

records_per_page: int

request_delay: float

classmethod set_records_per_page(v: int | None)[source]

Sets the records_per_page parameter with the default if the supplied value is not valid:

Triggers a validation error when request delay is an invalid type. Otherwise uses the DEFAULT_RECORDS_PER_PAGE class attribute if the supplied value is missing or is a negative number.

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]: Helper method for retrieving a string representation of the overall structure of the current SearchAPIConfig.

classmethod update(current_config: SearchAPIConfig, **overrides) → SearchAPIConfig[source]

Create a new SearchAPIConfig by updating an existing config with new values and/or switching to a different provider. This method ensures that the new provider’s base_url and defaults are used if provider_name is given, and that API-specific parameters are prioritized and merged as expected.

Parameters:

current_config (SearchAPIConfig) – The existing configuration to update.
**overrides – Any fields or API-specific parameters to override or add.

Returns:

A new config with the merged and prioritized values.

Return type:

SearchAPIConfig

property url_basename: str: Uses the _extract_url_basename method from the provider URL associated with the current config instance.

classmethod validate_api_key(v: SecretStr | str | None) → SecretStr | None[source]: Validates the api_key attribute and triggers a validation error if it is not valid.

classmethod validate_provider_name(v: str | None) → str[source]: Validates the provider_name attribute and triggers a validation error if it is not valid.

classmethod validate_request_delay(v: int | float | None) → int | float | None[source]

Sets the request delay (delay between each request) for valid request delays. This validator triggers a validation error when the request delay is an invalid type.

If a request delay is left None or is a negative number, this class method returns -1, and further validation is performed by cls.default_request_delay to retrieve the provider’s default request delay.

If not available, SearchAPIConfig.DEFAULT_REQUEST_DELAY is used.

validate_search_api_config_parameters() → Self[source]

Validation method that resolves URLs and/or provider names to provider_info when one or the other is not explicitly provided.

Occurs as the last step in the validation process.

classmethod validate_url(v: str)[source]: Validates the base_url and triggers a validation error if it is not valid.

classmethod validate_url_type(v: str | None) → str[source]: Validates the type for the base_url attribute and triggers a validation error if it is not valid.

Bases: BaseCoordinator

High-level coordinator for requesting and retrieving records and metadata from APIs.

This class uses dependency injection to orchestrate the process of constructing requests, validating response, and processing scientific works and articles. This class is designed to abstract away the complexity of using APIs while providing a consistent and robust interface for retrieving record data and metadata from request and storage cache if valid to help avoid exceeding limits in API requests.

If no search_api is provided, the coordinator will create a Search API that uses the default provider if the environment variable, SCHOLAR_FLUX_DEFAULT_PROVIDER, is not provided. Otherwise PLOS is used on the backend.

Flexible initializer that constructs a SearchCoordinator either from its core components or from their basic building blocks when these core components are not directly provided.

If search_api and response_coordinator are provided, then this method will use these inputs directly.

The additional parameters can still be used to update these two components. For example, a search_api can be updated with a new query, session, and SearchAPIConfig parameters through keyword arguments (**kwargs))

When neither component is provided:

The creation of the search_api requires, at minimum, a query.
If the response_coordinator, a parser, extractor, processor, and cache_manager aren’t provided, then a new ResponseCoordinator will be built from the default settings.

Core Components/Attributes:

SearchAPI: handles all requests to an API based on its configuration.: Dependencies: query, **kwargs
ResponseCoordinator:handles the parsing, record/metadata extraction, processing, and caching of responses: Dependencies: parser, extractor, processor, cache_manager

Other Attributes:

RetryHandler: Addresses when to retry failed requests and how failed requests are retried SearchWorkflow: An optional workflow that defines custom search logic from specific APIs Validator: handles how requests are validated. The default determines whether a 200 response was received

Note

This implementation uses the underlying private method _initialize to handle the assignment of parameters under the hood while the core function of the __init__ creates these components if they do not already exist.

Parameters:

search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs
parser (Optional(BaseDataParser)) – First step of the response processing pipeline - parses response records into a dictionary
extractor (Optional[BaseDataExtractor]) – Extracts both records and metadata from responses separately
processor (Optional[ABCDataProcessor]) – Processes the previously extracted API records into list of dictionaries that are filtered and optionally flattened during processing
cache_manager (Optional[DataCacheManager]) – Manages the caching of processed records for faster retrieval
query (Optional[str]) – Query to be used when sending requests when creating an API - modifies the query if the API already exists
provider_name (Optional[str]) – The name of the API provider where requests will be sent. If a provider_name and base_url are both given, the SearchAPIConfig will prioritize base_urls over the provider_name.
cache_requests (Optional[bool]) – Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results (Optional[bool]) – Determines whether or not to cache processed responses - on by default unless specified otherwise
retry_handler (Optional[RetryHandler]) – class used to retry failed requests-cache
validator (Optional[ResponseValidator]) – class used to verify and validate responses returned from APIs
workflow (Optional[SearchWorkflow]) – An optional workflow used to customize how records are retrieved from APIs. Uses the default workflow for the current provider when a workflow is not directly specified.
**kwargs – Keyword arguments to be passed to the SearchAPIConfig that creates the SearchAPI if it doesn’t already exist

Examples –

>>> from scholar_flux import SearchCoordinator
>>> from scholar_flux.api import APIResponse, ReconstructedResponse
>>> from scholar_flux.sessions import CachedSessionManager
>>> from typing import MutableMapping
>>> session = CachedSessionManager(user_agent = 'scholar_flux', backend='redis').configure_session()
>>> search_coordinator = SearchCoordinator(query = "Intrinsic Motivation", session = session, cache_results = False)
>>> response = search_coordinator.search(page = 1)
>>> response
# OUTPUT: <ProcessedResponse(len=50, cache_key='plos_Functional Processing_1_50', metadata='...') ': 1, 'maxSco...")>
>>> new_response = ReconstructedResponse.build(**response.response.__dict__)
>>> new_response.validate()
>>> new_response = ReconstructedResponse.build(response.response)
>>> ReconstructedResponse.build(new_response).validate()
>>> new_response.validate()
>>> newer_response = APIResponse.as_reconstructed_response(new_response)
>>> newer_response.validate()
>>> double_processed_response = search_coordinator._process_response(response = newer_response, cache_key = response.cache_key)

classmethod as_coordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator, *args, **kwargs) → SearchCoordinator[source]

Helper factory method for building a SearchCoordinator that allows users to build from the final building blocks of a SearchCoordinator.

Parameters:

search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs

Returns:

A newly created coordinator that orchestrates record retrieval and processing

Return type:

SearchCoordinator

fetch(page: int, from_request_cache: bool = True, raise_on_error: bool = False, **api_specific_parameters) → Response | ResponseProtocol | None[source]

Fetches the raw response from the current API or from cache if available.

Parameters:

page (int) – The page number to retrieve from the cache.
from_request_cache (bool) – This parameter determines whether to try to fetch a valid response from cache.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.

Returns:

The response object if available, otherwise None.

Return type:

Optional[Response]

get_cached_request(page: int, **kwargs) → Response | ResponseProtocol | None[source]

Retrieves the cached request for a given page number if available.

Parameters:: page (int) – The page number to retrieve from the cache.
Returns:: The cached request object if available, otherwise None.
Return type:: Optional[Response]

get_cached_response(page: int) → Dict[str, Any] | None[source]

Retrieves the cached response for a given page number if available.

Parameters:: page (int) – The page number to retrieve from the cache.
Returns:: The cached response data if available, otherwise None.
Return type:: Optional[Dict[str, Any]]

iter_pages(pages: Sequence[int] | PageListInput, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters) → Generator[SearchResult, None, None][source]

Helper method that creates a generator function for retrieving and processing records from the API Provider for a page range in sequence. This implementation dynamically examines the properties of the page search result for each retrieved API response to determine whether or not iteration should halt early versus determining whether iteration should continue.

This method is directly used by SearchCoordinator.search_pages to provide a clean interface that abstracts the complexity of iterators and is also provided for convenience when iteration is more preferable.

Parameters:

pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.

Yields:

SearchResult –

Iteratively returns the SearchResult for each page using a generator expression.: Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)

robust_request(page: int, **api_specific_parameters) → Response | ResponseProtocol | None[source]

Constructs and sends a request to the current API. Fetches a response from the current API.

Parameters:

page (int) – The page number to retrieve from the cache.
**kwargs – Optional Additional parameters to pass to the SearchAPI

Returns:

The request object if available, otherwise None.

Return type:

Optional[Response]

search(page: int = 1, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters) → ProcessedResponse | ErrorResponse | None[source]

Public method for retrieving and processing records from the API specifying the page and records per page. Note that the response object is saved under the last_response attribute in the event that the response is retrieved and processed successfully, irrespective of whether the response was cached.

Parameters:

page (int) – The current page number. Used for process caching purposes even if not required by the API
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.

Returns:

A ProcessedResponse model containing the response (response), processed records (data), and article metadata (metadata) if the response was successful. Otherwise returns an ErrorResponse where the reason behind the error (message), exception type (error), and response (response) are provided. Possible error responses also include a NonResponse (an ErrorResponse subclass) for cases where a response object is irretrievable. Like the ErrorResponse class, NonResponse is also Falsy (i.e., not NonResponse returns True)

Return type:

Optional[ProcessedResponse | ErrorResponse]

search_data(page: int = 1, from_request_cache: bool = True, from_process_cache: bool = True) → List[Dict] | None[source]

Public method to perform a search, specifying the page and records per page. Note that instead of returning a ProcessedResponse or ErrorResponse, this calls the search method and retrieves only the list of processed dictionary records from the ProcessedResponse.

Parameters:

page (int) – The current page number.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage stored within the SearchCoordinator.search_api.cache
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the processing cache stored within the SearchCoordinator.response_coordinator.cache

Returns:

A List of records containing processed article data

Return type:

Optional[List[Dict]]

search_pages(pages: Sequence[int] | PageListInput, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters) → SearchResultList[source]

Public method for retrieving and processing records from the API specifying the page and records per page in sequence. This method Note that the response object is saved under the last_response attribute in the event that the data is processed successfully, irrespective of whether responses are cached or not.

Parameters:

page (int) – The current page number. Used for process caching purposes even if not required by the API
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.

Returns:

A list of response data classes containing processed article data (data).: Note that processing stops if the response for a given page is None, is not retrievable, or contains less than the expected number of responses, indicating that the next page may contain no more records.

Return type:

List[ProcessedResponse]

classmethod update(search_coordinator: SearchCoordinator, search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None) → SearchCoordinator[source]

Helper factory method allowing the creation of a new components based on an existing configuration while allowing the replacement of previous components. Note that this implementation does not directly copy the underlying components if a new component is not selected.

Parameters:

SearchCoordinator – A previously created coordinator containing the components to use if a default is not provided
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs
retry_handler (Optional[RetryHandler]) – class used to retry failed requests-cache
validator (Optional[ResponseValidator]) – class used to verify and validate responses returned from APIs
workflow (Optional[SearchWorkflow]) – An optional workflow used to customize how records are retrieved from APIs. Uses the default workflow for the current provider when a workflow is not directly specified and does not directly carry over in cases where a new provider is chosen.

Returns:

A newly created coordinator that orchestrates record retrieval and processing

Return type:

SearchCoordinator

class scholar_flux.SessionManager(user_agent: str | None = None)[source]

Bases: BaseSessionManager

Manager that creates a simple requests session using the default settings and the provided User-Agent.

Parameters:: user_agent (Optional[str]) – The User-Agent to be passed as a parameter in the creation of the session object.

Example

>>> from scholar_flux.sessions import SessionManager
>>> from scholar_flux.api import SearchAPI
>>> from requests import Session
>>> session_manager = SessionManager(user_agent='scholar_flux_user_agent')
### Creating the session object
>>> session = session_manager.configure_session()
### Which is also equivalent to:
>>> session = session_manager()
### This implementation returns a requests.session object which is compatible with the SearchAPI:
>>> assert isinstance(session, Session)
# OUTPUT: True
>>> api = SearchAPI(query='history of software design', session = session)

__init__(user_agent: str | None = None) → None[source]: Initializes a basic session manager that sets the user agent if provided.

configure_session() → Session[source]

Configures a basic requests session with the provided user_agent attribute.

Returns:: a regular requests.session object with the default settings and an optional user header.
Return type:: requests.Session