scholar_flux.data package

Submodules

scholar_flux.data.abc_processor module

The scholar_flux.data.abc_processor module defines the ABCDataProcessor, which in turn, defines the core, abstract logic that all scholar_flux data processor subclasses will implement.

This module defines the abstract methods and types that each processor will use for compatibility with the SearchCoordinator in the processing step.

class scholar_flux.data.abc_processor.ABCDataProcessor(*args: Any, **kwargs: Any)[source]

Bases: ABC

The ABCDataProcessor is the base class from which all other processors are created.

The purpose of all subclasses of the ABCDataProcessor is to transform extracted records into a format suitable for future data processing pipelines. More specifically, its responsibilities include:

Processing a specific key from record by joining non-None values into a string.

Processing a record dictionary to extract record and article content, creating a processed record dictionary with an abstract field.

Processing a list of raw page record dict data from the API response based on record keys.

All subclasses, at minimum, are expected to implement the process_page method which would effectively transform the records of each page into the intended list of dictionaries.

__init__(*args: Any, **kwargs: Any) → None[source]: Initializes record keys and header/body paths in the object instance using defined methods.

define_record_keys(*args: Any, **kwargs: Any) → dict | None[source]: Abstract method to be optionally implemented to determine record keys that should be parsed to process each record.

define_record_path(*args: Any, **kwargs: Any) → Tuple | None[source]: Abstract method to be optionally implemented to define header and body paths for record extraction, with default paths provided if not specified.

discover_keys(*args: Any, **kwargs: Any) → dict | None[source]: Abstract method to be optionally implemented to discover nested key paths in json data structures.

ignore_record_keys(*args: Any, **kwargs: Any) → list | None[source]: Abstract method to be optionally implemented to ignore certain keys in records when processing records.

load_data(*args: Any, **kwargs: Any) → Any[source]: Helper method that is optionally implemented by subclasses to load JSON data into customized implementations of processors.

process_key(*args: Any, **kwargs: Any) → str | None[source]: Abstract method to be optionally implemented for processing keys from records.

abstract process_page(*args: Any, **kwargs: Any) → list[dict][source]: Must be implemented in subclasses for processing entire pages of records.

process_record(*args: Any, **kwargs: Any) → dict | None[source]

Abstract method to be optionally implemented for processing a single record in a json data structure.

Used to extract record data and article content, creating a processed record dictionary with an abstract field.

process_text(*args: Any, **kwargs: Any) → str | None[source]: Abstract method to be optionally implemented for processing a record dictionary to extract record and article content, creating a processed record dictionary with an abstract field.

classmethod record_filter(*args: Any, **kwargs: Any) → bool | None[source]

Optional filter implementation to handle record screening using regex or other logic.

Subclasses can customize filtering if required.

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method for quickly showing a representation of the overall structure of the current Processor subclass. The instance uses the generate_repr helper function to produce human-readable representations of the core structure of the processing configuration along with its defaults.

Returns:: The structure of the current Processor subclass as a string.
Return type:: str

scholar_flux.data.base_extractor module

The scholar_flux.data.base_extractor implements the core processes used to extract data from parsed responses.

The BaseDataExtractor implements the methods and functionality that are used when the structure of the parsed response and paths for records and metadata are already known. The BaseDataExtractor serves as the base for later extension with the scholar_flux.data.data_extractor.DataExtractor to dynamically identify records and metadata paths when the structure of the response is not provided.

class scholar_flux.data.base_extractor.BaseDataExtractor(record_path: list | None = None, metadata_path: list[list] | dict[str, list] | None = None)[source]

Bases: object

Base DataExtractor implementing the minimum components necessary to extract records and metadata from parsed responses when the location of records and metadata is known beforehand.

__init__(record_path: list | None = None, metadata_path: list[list] | dict[str, list] | None = None)[source]

Initialize the DataExtractor with metadata and records to extract separately.

If record path or metadata_path are specified, then the data extractor will attempt to retrieve the metadata and records at the provided paths. Note that, as metadata_paths can be associated with multiple keys, starting from the outside dictionary, we may have to specify a dictionary containing keys denoting metadata variables and their paths as a list of values indicating how to retrieve the value. The path can also be given by a list of lists describing how to retrieve the last element.

While the encouraged type for record_path is a list of strings that each represent each nested path element to be traversed to arrive at a value for a field, a delimited string can also be used with the default delimiter being scholar_flux.utils.PathStr.DELIMITER. Similarly, a list or dictionary of path strings can also be used as shorthand for the individual metadata fields containing relevant metadata values.

Similarly, a list or dictionary of path strings can also be used as shorthand for the individual metadata fields containing relevant metadata values.

Parameters:

record_path (Optional[List[str]]) – Custom path to find records in the parsed data. Contains a list of strings and rarely integers indexes indicating how to recursively find the list of records.
metadata_path (List[List[str]] | Optional[Dict[str, List[str]]]) – Identifies the paths in a dictionary associated with metadata as opposed to records. This can be a list of paths where each element is a list describing how to arrive at a terminal element.

extract(parsed_page: list[dict] | dict) → tuple[RecordList | None, MetadataType | None][source]

Extract both records and metadata from the parsed page dictionary.

Parameters:: parsed_page (Union[list[dict], dict]) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of records and the metadata dictionary.
Return type:: tuple[Optional[RecordList], Optional[MetadataType]]

extract_metadata(parsed_page_dict: dict[str, Any]) → MetadataType[source]

Extract metadata from the parsed page dictionary.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data to be parsed.
Returns:: The extracted metadata.
Return type:: Dict

extract_records(parsed_page_dict: dict) → RecordList | None[source]

Extract records from parsed data as a list of dicts.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data to be parsed.
Returns:: A list of records as dictionaries, or None if extraction fails.
Return type:: Optional[RecordList]

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Base method for showing the structure of the current Data Extractor. This method reveals the configuration settings of the extractor config that will be used to extract records and metadata.

Returns:: The current structure of the BaseDataExtractor or its subclass.
Return type:: str

classmethod update(data_extractor: Self, **data_extractor_kwargs: Any) → Self[source]

Helper method for creating a new BaseDataExtractor instance, replacing only the specified components.

Parameters:

data_extractor (Self) – A previously created BaseDataExtractor instance
**data_extractor_kwargs – Keyword arguments used to replace components of the BaseDataExtractor. Unspecified fields from the previous BaseDataExtractor remain unchanged.

Returns:

A new data extractor instance with the specified parameter updates

Return type:

BaseDataExtractor

scholar_flux.data.base_parser module

The scholar_flux.data.base_parser module contains the core logic for parsing data structures received from APIs.

This module implements the BaseDataParser that is used to prepare and parse JSON, XML, and YAML into dictionary-based nested structures prior to record extraction and processing.

class scholar_flux.data.base_parser.BaseDataParser[source]

Bases: object

Base class responsible for parsing typical formats seen in APIs that send news and academic articles in XML, JSON, and YAML formats.

__init__() → None[source]

On initialization, the data parser is set to use built-in class methods to parse json, xml, and yaml-based response content by default and the parse helper class to determine which parser to use based on the Content- Type.

Parameters:

additional_parsers (Optional[dict[str, Callable]]) – Allows for the addition of
identification. (new parsers and overrides to class methods to be used on content-type)

classmethod detect_format(response: Response | ResponseProtocol) → str | None[source]: Helper method for determining the format corresponding to a response object.

classmethod get_default_parsers() → dict[str, Callable][source]

Helper method used to retrieve the default parsers to parse XML, JSON, and YAML response data.

Returns:

A dictionary of data parsers that can be used to parse response data: into usable json format

Return type:

dict[str, Callable]

parse(response: Response | ResponseProtocol) → dict | list[dict] | None[source]: Uses one of the default parsing methods to extract a dictionary of data from the response content.

classmethod parse_from_defaults(response: Response | ResponseProtocol) → dict | list[dict] | None[source]

Detects the API response format if a format is not already specified and uses one of the default structures to parse the data structure into a dictionary depending on the content type stored in the API response header.

Parameters:: response (response type) – The response (or response-like) object from the API request.
Returns:: response dict containing fields including a list of metadata records as dictionaries.
Return type:: dict

classmethod parse_json(content: bytes) → dict | list[dict][source]: Uses the standard json library to parse JSON content into a dictionary.

classmethod parse_xml(content: bytes) → dict | list[dict][source]: Uses the optional xmltodict library to parse XML content into a dictionary.

classmethod parse_yaml(content: bytes) → dict | list[dict][source]: Uses the optional yaml library to parse YAML content.

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method for retrieving a string representation of the structure of the current BaseDataParser or subclass of the BaseDataParser.

Override this for more specific descriptions of attributes and defaults. Useful for showing the options being used for parsing response content into dictionary objects.

Returns:: A string representation of the base parser indicating all registered or default parsers
Return type:: str

scholar_flux.data.data_extractor module

The scholar_flux.data.data_extractor builds on the BaseDataExtractor to implement automated path extraction.

The DataExtractor implements dynamic record and metadata extraction when the paths are not known beforehand.

The extracted list of responses and metadata dictionaries are used in later steps prior to further response record processing.

Bases: BaseDataExtractor

The DataExtractor allows for the streamlined extraction of records and metadata from responses retrieved from APIs. This proceeds as the second stage of the response processing step where metadata and records are extracted from parsed responses.

The data extractor provides two ways to identify metadata paths and record paths:

manual identification: If record path or metadata_path are specified, then the data extractor will attempt to retrieve the metadata and records at the provided paths. Note that, as metadata_paths can be associated with multiple keys, starting from the outside dictionary, we may have to specify a dictionary containing keys denoting metadata variables and their paths as a list of values indicating how to retrieve the value. The path can also be given by a list of lists describing how to retrieve the last element.

Dynamic identification: Uses heuristics to determine records from metadata. records will nearly always be defined by a list containing only dictionaries as its elements while the metadata will generally contain a variety of elements, some nested and others as integers, strings, etc. In some cases where its harder to determine, we can use dynamic_record_identifiers to determine whether a list containing a single nested dictionary is a record or metadata. For scientific purposes, its keys may contain ‘abstract’, ‘title’, ‘doi’, etc. This can be defined manually by the users if the defaults are not reliable for a given API.

Upon initializing the class, the class can be used as a callable that returns the records and metadata in that order.

Example

>>> from scholar_flux.data import DataExtractor
>>> data = dict(query='specification driven development', options={'record_count':5,'response_time':'50ms'})
>>> data['records'] = [dict(id=1, record='protocol vs.code'), dict(id=2, record='Impact of Agile')]
>>> extractor = DataExtractor(annotate_records=False)
>>> records, metadata = extractor(data)
>>> print(metadata)
# OUTPUT: {'query': 'specification driven development', 'record_count': 5, 'response_time': '50ms'}
>>> print(records)
# OUTPUT: [{'id': 1, 'record': 'protocol vs.code'}, {'id': 2, 'record': 'Impact of Agile'}]

Record Annotation:

When annotate_records=True, each extracted record receives two fields for downstream linkage after processing/flattening:

_extraction_index: Zero-based position in the extracted record list
_record_id: Content-based hash in format “hash_index” (e.g., “a1b2c3d4_0”)

These fields enable resolution back to original records when order may change or records are deduplicated. The hash is generated from record content excluding internal fields (those starting with ‘_’), ensuring stability across runs for identical content.

Example:

>>> extractor = DataExtractor(annotate_records=True)
>>> records, metadata = extractor(data)
>>> records[0]['_extraction_index']
# OUTPUT: 0
>>> records[0]['_record_id']
# OUTPUT: 'a9e3e93e_0'
>>> records[0]
# OUTPUT: {'id': 1, 'record': 'protocol vs.code', '_extraction_index': 0, '_record_id': 'a9e3e93e_0'}

DEFAULT_DYNAMIC_METADATA_IDENTIFIERS = ('metadata', 'facets', 'IdList')

DEFAULT_DYNAMIC_RECORD_IDENTIFIERS = ('title', 'doi', 'abstract')

EXTRACTION_INDEX_KEY = '_extraction_index'

RECORD_ID_KEY = '_record_id'

Initialize the DataExtractor with optional path overrides for metadata and records.

Parameters:

record_path (Optional[List[str]]) – Custom path to find records in the parsed data. Contains a list of strings and rarely integers indexes indicating how to recursively find the list of records.
metadata_path (List[List[str]] | Optional[Dict[str, List[str]]]) – Identifies the paths in a dictionary associated with metadata as opposed to records. This can be a list of paths where each element is a list describing how to get to a terminal element.
dynamic_record_identifiers (Optional[List[str]]) – Helps to identify dictionary keys that only belong to records when dealing with a single element that would otherwise be classified as metadata.
dynamic_metadata_identifiers (Optional[List[str]]) – Helps to identify dictionary keys that are likely to only belong to metadata that could otherwise share a similar structure to a list of dictionaries, similar to what’s seen with records.
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).

dynamic_identification(parsed_page_dict: dict) → tuple[RecordList, MetadataType][source]

Dynamically identify and separate metadata from records. This function recursively traverses the dictionary and uses a heuristic to determine whether a specific record corresponds to metadata or is a list of records: Generally, keys associated with values corresponding to metadata will contain only lists of dictionaries On the other hand, nested structures containing metadata will be associated with a singular value other a dictionary of keys associated with a singular value that is not a list. Using this heuristic, we’re able to determine metadata from records with a high degree of confidence.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of record dictionaries and the metadata dictionary.
Return type:: tuple[RecordList, MetadataType]

extract(parsed_page: list[dict] | dict) → tuple[RecordList | None, MetadataType | None][source]

Extract both records and metadata from the parsed page dictionary.

Parameters:: parsed_page (RecordList | dict) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of records and the metadata dictionary.
Return type:: tuple[Optional[RecordList], Optional[MetadataType]]

classmethod strip_annotations(records: RecordType) → RecordType[source]

classmethod strip_annotations(records: NormalizedRecordList) → NormalizedRecordList

classmethod strip_annotations(records: RecordList) → RecordList

classmethod strip_annotations(records: None) → None

Removes metadata annotations from records by filtering out keys prefixed with underscore.

This method creates clean copies of records without internal pipeline metadata fields that may be added during (e.g., ‘_extraction_index’, ‘_record_id’) processing when record annotation is enabled.

Parameters:: records (RecordType | RecordList) – A single dictionary record or a list of dictionary records to clean. Records should contain dictionary elements with string keys.
Returns:: A new dictionary with annotation fields removed if input is a single record. RecordList: A new list of dictionaries with annotation fields removed if input is a list.
Return type:: RecordType

Note

The original records are not modified. This method instead return a new dictionary or a new list of dictionaries with only non-annotation fields preserved.

classmethod update(data_extractor: BaseDataExtractor, **data_extractor_kwargs: Any) → Self[source]

Helper method for creating a new DataExtractor instance, replacing only the specified components.

Parameters:

data_extractor (Self) – A previously created DataExtractor instance
**data_extractor_kwargs – Keyword arguments used to replace components of the DataExtractor. Unspecified fields from the previous DataExtractor remain unchanged.

Returns:

A new data extractor instance with the specified parameter updates

Return type:

DataExtractor

scholar_flux.data.data_parser module

The scholar_flux.data.data_parser module defines the DataParser used within the scholar_flux API to parse JSON as well as uncommon response formats.

This module implements the DataParser which allows for custom overrides to JSON, XML, and YAML files to prepare and parse dictionary-based nested structures prior to record extraction and processing.

class scholar_flux.data.data_parser.DataParser(additional_parsers: dict[str, Callable] | None = None)[source]

Bases: BaseDataParser

Extensible class that handles the identification and parsing of typical formats seen in APIs that send news and academic articles in XML, JSON, and YAML formats.

The BaseDataParser contains each of the necessary class elements to parse JSON, XML, and YAML formats as class methods while this class allows for the specification of additional parsers.

Parameters:: additional_parsers (Optional[dict[str, Callable]]) – Allows overrides for parsers in addition to the JSON, XML and YAML parsers that are enabled by default.

__init__(additional_parsers: dict[str, Callable] | None = None)[source]

On initialization, the data parser is set to use built-in class methods to parse json, xml, and yaml-based response content by default and the parse helper class to determine which parser to use based on the Content- Type.

Parameters:

additional_parsers (Optional[dict[str, Callable]]) – Allows for the addition of
identification. (new parsers and overrides to class methods to be used on content-type)

parse(response: Response | ResponseProtocol, format: str | None = None) → dict | list[dict] | None[source]

Parses the API response content using two core steps.

Detects the API response format if a format is not already specified
Uses the previously determined format to parse the content of the response and return a parsed dictionary (json) structure.

Parameters:

response (requests.Response | ResponseProtocol) – The response or response-like object from the API request.
format (str) – The parser needed to format the response as a list of dicts

Returns:

response dict containing fields including a list of metadata records as dictionaries.

Return type:

dict

scholar_flux.data.data_processor module

The scholar_flux.data.data_processor implements a DataProcessor based on the schema required of the ABCDataProcessor for processing the records and/or metadata extracted from a response. The data processor implements manual nested key retrieval by using the list of record_keys that point to the paths of fields to extract from the passed list of nested JSON dictionary records.

The data processor can be used to filter records based on conditions and extract nested key-value pairs within each record to ensure that relevant records and fields from records are retained

Bases: ABCDataProcessor

Initialize the DataProcessor with explicit extraction paths and options. The DataProcessor performs the selective extraction of specific fields from each record within a page (list) of JSON (dictionary) records and assumes that the paths to extract are known beforehand.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that records should be retained during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

Examples

>>> from scholar_flux.data import DataProcessor
>>> data = [{'id':1, 'school':{'department':'NYU Department of Mathematics'}},
>>>         {'id':2, 'school':{'department':'GSU Department of History'}},
>>>         {'id':3, 'school':{'organization':'Pharmaceutical Research Team'}}]
# creating a basic processor
>>> data_processor = DataProcessor(record_keys = [['id'], ['school', 'department'], ['school', 'organization']]) # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': 1, 'school.department': 'NYU Department of Mathematics', 'school.organization': None},
#          {'id': 2, 'school.department': 'GSU Department of History', 'school.organization': None},
#          {'id': 3, 'school.department': None, 'school.organization': 'Pharmaceutical Research Team'}]
# String paths can also be used to accomplish the same:
>>> data_processor = DataProcessor(record_keys = ['id', 'school.department', 'school.organization']) # instantiating the class
>>> assert data_processor.process_page(data) == result

Initialize the DataProcessor with explicit extraction paths and options.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

collapse_fields(processed_record_dict: dict) → dict[str, list[str | int] | str | int][source]: Helper method for joining lists of data into a singular string for flattening.

Processes a specific key from a record by retrieving the value associated with the key at the nested path. Depending on whether value_delimiter is set, the method will join non-None values into a string using the delimiter. Otherwise, keys with lists as values will contain the lists un-edited.

Parameters:

record – The JSON structure (generally a nested list or dictionary) to extract the key from.
key – The key to process within the record dictionary.

Returns:

The value found at the specified key within a dictionary nested in a list, and otherwise None.

Return type:

list

process_page(parsed_records: RecordList, ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]

Core method of the data processor that enables the processing of lists of dictionary records to filter and process records based on the configuration of the current DataProcessor.

Parameters:

parsed_records (list[dict[str | int, Any]]) – The records to process and/or filter
ignore_keys (Optional[list[str]]) – Optional overrides that identify records to ignore based on the absence of specific keys or regex patterns.
keep_keys (Optional[list[str]]) – Optional overrides identifying records to keep based on the absence of specific keys or regex patterns.
regex – (Optional[bool]): Used to determine whether or not to filter records using regular expressions

process_record(record_dict: RecordType) → dict[str, Any][source]

Processes a record dictionary to extract record data and article content, creating a processed record dictionary with an abstract field.

Args: - record_dict: The dictionary containing the record data.

Returns: - dict: A processed record dictionary with record keys processed and an abstract field created from the article content.

classmethod record_filter(record_dict: RecordType, record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Helper method that filters records using regex pattern matching, checking if any of the keys provided in the function call exist.

update_record_keys(record_keys: dict[str | int, Any] | dict[str, Any] | list[list[str | int]] | list[list[str]] | list[str]) → None[source]: A helper method for transforming and updating the current dictionary of record keys with a new list.

scholar_flux.data.normalizing_data_processor module

This normalizing_data_processor.py module implements the NormalizingDataProcessor for normalizing API field names.

Bases: DataProcessor

A data processor that flattens records before extraction, extending DataProcessor.

This processor adds a normalization step to DataProcessor: 1. Flattens each record into dot-notation keys (e.g., “school.department”) 2. Extracts specified fields using parent class logic 3. Handles already-flattened records (idempotent operation)

Inherits all functionality from DataProcessor, including: - Field extraction via record_keys - Record filtering via ignore_keys/keep_keys - Value collapsing via value_delimiter

Parameters:

record_keys – Keys to extract (same as DataProcessor).
ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that must be present to keep a record.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for filtering.
use_full_path – Whether to preserve full paths in flattened keys.

Examples

>>> from scholar_flux.data import NormalizingDataProcessor
>>> data = [{'id':1, 'school':{'department':'NYU Department of Mathematics'}},
>>>         {'id':2, 'school':{'department':'GSU Department of History'}},
>>>         {'id':3, 'school':{'organization':'Pharmaceutical Research Team'}}]
# creating a basic processor
>>> data_processor = NormalizingDataProcessor(record_keys = [['id'], ['school', 'department'], ['school', 'organization']]) # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': 1, 'school.department': 'NYU Department of Mathematics', 'school.organization': None},
#          {'id': 2, 'school.department': 'GSU Department of History', 'school.organization': None},
#          {'id': 3, 'school.department': None, 'school.organization': 'Pharmaceutical Research Team'}]
# String paths can also be used to accomplish the same:
>>> data_processor = NormalizingDataProcessor(record_keys = ['id', 'school.department', 'school.organization']) # instantiating the class
>>> assert data_processor.process_page(data) == result

Initializes the NormalizingDataProcessor.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.
traverse_lists – (Optional[bool]): Determines whether lists are automatically traversed when indices are not specified in the path.

process_record(record_dict: RecordType) → NormalizedRecordType[source]

Process a single record by flattening it first, then extracting fields.

Overrides parent method to add flattening step before field extraction.

Parameters:: record_dict (RecordType) – The dictionary containing the record data.
Returns:: A processed record with specified keys extracted.
Return type:: NormalizedRecordType

scholar_flux.data.pass_through_data_processor module

The scholar_flux.data.pass_through_data_processor implements a PassThroughDataProcessor based on the schema required of the ABCDataProcessor for processing the records and/or metadata extracted from a response.

The pass through data processor is designed for simplicity, allowing end-users to return extracted records as is and also filter records based on conditions and extract nested key-value pairs within each record if specified.

class scholar_flux.data.pass_through_data_processor.PassThroughDataProcessor(ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True)[source]

Bases: ABCDataProcessor

A basic data processor that retains all valid records without modification unless a specific filter for JSON keys are specified.

Unlike the DataProcessor, this specific implementation will not flatten records. Instead all filtered and selected records will retain their original nested structure.

__init__(ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True) → None[source]

Initialize the PassThroughDataProcessor with explicit extraction paths and options.

Parameters:

ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that records should contain during processing.
regex – Whether to use regex for ignore filtering.

process_page(parsed_records: RecordList, ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]: Processes and returns each record as is if filtering the final list of records by key is not enabled.

process_record(record_dict: RecordType) → RecordType[source]

A no-op method retained to maintain a similar interface as other DataProcessor implementations.

Args: - record_dict: The dictionary containing the record data.

Returns: - dict: The original processed dictionary

classmethod record_filter(record_dict: RecordType, record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Helper method that filters records using regex pattern matching, checking if any of the keys provided in the function call exist.

scholar_flux.data.path_data_processor module

The scholar_flux.data.path_data_processor implements the PathDataProcessor that uses a custom path processing implementation to dynamically flatten and format JSON records to retrieve nested-key value pairs.

Similar to the RecursiveDataProcessor, the PathDataProcessor can be used to dynamically filter, process, and flatten nested paths while formatting the output based on its specification.

Bases: ABCDataProcessor

The PathDataProcessor uses a custom implementation of Trie-based processing to abstract nested key-value combinations into path-node pairs where the path defines the full range of nested keys that need to be traversed to arrive at each terminal field within each individual record.

This implementation automatically and dynamically flattens and filters a single page of records (a list of dictionary-based records) extracted from a response at a time to return the processed record data.

Example

>>> from scholar_flux.data import PathDataProcessor
>>> path_data_processor = PathDataProcessor() # instantiating the class
>>> data = [{'id':1, 'a':{'b':'c'}}, {'id':2, 'b':{'f':'e'}}, {'id':2, 'c':{'h':'g'}}]
### The process_page method can then be referenced using the processor as a callable:
>>> result = path_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'a.b': 'c'}, {'id': '2', 'b.f': 'e'}, {'id': '2', 'c.h': 'g'}]

__init__(json_data: RecordType | RecordList | None = None, value_delimiter: str | None = None, ignore_keys: list | None = None, keep_keys: list[str] | None = None, regex: bool | None = True, use_cache: bool | None = True) → None[source]: Initializes the data processor with JSON data and optional parameters for processing.

property cached: bool: Property indicating whether the underlying path node index uses a cache of weakreferences to nodes.

discover_keys() → dict[str, Any] | None[source]: Discovers all keys within the JSON data.

property json_data: RecordList | None: A list of dictionary-based records to further process.

load_data(json_data: RecordType | RecordList | None = None) → bool[source]

Attempts to load a data dictionary or list, contingent on the input having at least one non-missing record.

If json_data is missing or the json input is equal to the current json_data attribute, then the json_data attribute will not be updated from the json input.

Parameters:: json_data (Optional[RecordType | RecordList]) – The json data to be loaded as an attribute.
Returns:: Indicates whether the data was successfully loaded (True) or not (False).
Return type:: bool

process_page(parsed_records: RecordType | RecordList | None = None, keep_keys: list[str] | None = None, ignore_keys: list[str] | None = None, combine_keys: bool = True, regex: bool | None = None) → RecordList[source]: Processes each individual record dict from the JSON data.

process_record(record_index: int, keep_keys: list | None = None, ignore_keys: list | None = None, regex: bool | None = None) → None[source]

Processes the current record dictionary, indicating if the record at the index should be retained or dropped.

The full set of processed records is subsequently accessible via processor.path_node_index.simplify_to_rows().

classmethod record_filter(record_dict: dict[ProcessingPath, Any], record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Identifies whether a record contains a path (key), indicating whether the record should be retained.

structure(flatten: bool = False, show_value_attributes: bool = False) → str[source]

Method for showing the structure of the current PathDataProcessor and identifying the current configuration.

Useful for showing the options being used to process the api response records

scholar_flux.data.recursive_data_processor module

The scholar_flux.data.recursive_data_processor implements the RecursiveDataProcessor that implements the dynamic, and automatic recursive retrieval of nested key-data pairs from listed dictionary records.

The data processor can be used to flatten and filter records based on conditions and extract nested data for each record in the response.

class scholar_flux.data.recursive_data_processor.RecursiveDataProcessor(json_data: list[dict] | None = None, value_delimiter: str | None = None, ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True, use_full_path: bool | None = True)[source]

Bases: ABCDataProcessor

Processes a list of raw page record dict data from the API response based on discovered record keys and flattens them into a list of dictionaries consisting of key value pairs that simplify the interpretation of the final flattened json structure.

Example

>>> from scholar_flux.data import RecursiveDataProcessor
>>> data = [{'id':1, 'a':{'b':'c'}}, {'id':2, 'b':{'f':'e'}}, {'id':2, 'c':{'h':'g'}}]
# creating a basic processor
>>> recursive_data_processor = RecursiveDataProcessor() # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = recursive_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'b': 'c'}, {'id': '2', 'f': 'e'}, {'id': '2', 'h': 'g'}]
    # To identify the full nested location of record:
>>> recursive_data_processor = RecursiveDataProcessor(use_full_path=True) # instantiating the class
>>> result = recursive_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'a.b': 'c'}, {'id': '2', 'b.f': 'e'}, {'id': '2', 'c.h': 'g'}]

Initializes the data processor with JSON data and optional parameters for processing.

Parameters:

json_data (list[dict]) – The json data set to process and flatten - a list of dictionaries is expected
value_delimiter (Optional[str]) – Indicates whether or not to join values found at terminal paths
ignore_keys (Optional[list[str]]) – Determines records that should be omitted based on whether each record contains a key or substring. (off by default)
keep_keys (Optional[list[str]]) – Indicates whether or not to keep a record if the key is present. (off by default)
regex (Optional[bool]) – Determines whether to use regex filtering for filtering records based on the presence or absence of specific keywords
use_full_path (Optional[bool]) – Determines whether or not to keep the full path for the json record key. If False, the path is shortened, keeping the last key or set of keys while preventing name collisions.

discover_keys() → dict[str, list[str]] | None[source]: Discovers all keys within the JSON data.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include: bool = True, **kwargs: Any) → dict[str, list[str]][source]: Filters discovered keys based on specified criteria.

property json_data: RecordList | None: A list of dictionary-based records to further process.

load_data(json_data: RecordType | RecordList | None = None) → bool[source]

Attempts to load a data dictionary or list, contingent on the input having at least one non-missing record.

If json_data is missing, or the json input is equal to the current json_data attribute, then the json_data attribute will not be updated from the json input.

Parameters:: json_data (Optional[RecordType | RecordList]) – The json data to be loaded as an attribute.
Returns:: Indicates whether the data was successfully loaded (True) or not (False).
Return type:: bool

process_page(parsed_records: list[dict] | None = None, keep_keys: list[str] | None = None, ignore_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]: Processes each individual record dict from the JSON data.

process_record(record_dict: RecordType, **kwargs: Any) → RecordType[source]: Processes and flattens record dictionary, extracting record data and article content in the process.

classmethod record_filter(record_dict: RecordType, record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Indicates if the current record contains any of the keys.

Module contents

The scholar_flux.data module contains components that process the raw responses, enabling end users to interact with structured and formatted data after the scholar_flux SearchApi receives a valid response. This module, after receiving the response performs the following steps: Response Parsing –> Record Extraction –> Record Processing.

Stages:

Response Parsing:

Extracts XML, JSON, or YAML-based responses from the response content. The response content is automatically parsed depending on the content type listed in the response header. This can be further customized to enable the processing of other content types in a streamlined way.

Record Extraction:

This phase involves the extraction of metadata and records from parsed API responses. The process can be performed in two ways:

1. The paths of records are listed ahead of time, indicating individual metadata fields and where the list of JSON records can be found if available.

2. The metadata and records can be identified automatically using heuristics instead. Records are generally identified as a list of dictionaries where each list entry is a separate record that may contain similar sets of fields.

The record extraction phase then returns the records and metadata as a tuple in that order.

Record Processing:

The final stage of the response processing pipeline where the records are flattened, processed, and filtered. This stage often involves flattening each individual record element into the path where the data can be found and the value found at the end of the nested path. This stage also allows individual records to be filtered by key - paths can be retained or removed based on whether it contains a regex pattern or fixed string. The results are then returned as a list of flattened dictionaries, depending on the Processor chosen.

Processors:

DataProcessor:
Requires the end-user to manually specify the paths where data should be extracted in each record as well as a key that should correspond to the extracted value in each record.

NormalizingDataProcessor:
Inherits from the DataProcessor and implements flattening prior to extracting the required parameters needed to normalize field maps. Useful in later steps of processing where fields may or may not already be normalized.

PassThroughDataProcessor:
The simplest implementation of the DataProcessor that does not automatically flatten records. This implementation still allows for the filtering of records similarly to the DataProcessor.

RecursiveDataProcessor:
A recursive implementation that dynamically discovers terminal paths and flattens them, using the path as the key for the extracted value.

PathDataProcessor:
A custom implementation of a data processor that uses trie-based processing to efficiently process and filter a flattened and processed list of JSON records. This implementation is universally applicable to JSON-formatted data and allows for further customization in the specifics of how records (and JSON dictionaries) are processed.

Each element in the processing pipeline is designed to be extensible and can be further customized and used in the retrieval of response data using base/ABC implementations:

BaseDataParser

BaseDataExtractor

ABCDataProcessor

The resulting classes can then be used as such:

>>> from scholar_flux.data import DataParser, DataExtractor, PathDataProcessor
>>> from scholar_flux.api import SearchCoordinator
>>> search_coordinator = SearchCoordinator(query='Pharmaceuticals', parser=DataParser(), extractor=DataExtractor(), processor=PathDataProcessor())
>>> response = search_coordinator.search(page = 1)
>>> response
# OUTPUT: <ProcessedResponse(len=50, cache_key='plos_Pharmaceuticals_1_50', metadata=...")>
### Elements from each stage of the process can be accessed:
>>> response.parsed_response # a JSON formatted response after parsing the response with the search_coordinator.parser
>>> response.extracted_records # list of dictionaries containing records extracted using the search_coordinator.extractor
>>> response.data # the list of dictionaries processed from the search_coordinator.processor

class scholar_flux.data.ABCDataProcessor(*args: Any, **kwargs: Any)[source]

Bases: ABC

The ABCDataProcessor is the base class from which all other processors are created.

The purpose of all subclasses of the ABCDataProcessor is to transform extracted records into a format suitable for future data processing pipelines. More specifically, its responsibilities include:

Processing a specific key from record by joining non-None values into a string.

Processing a record dictionary to extract record and article content, creating a processed record dictionary with an abstract field.

Processing a list of raw page record dict data from the API response based on record keys.

All subclasses, at minimum, are expected to implement the process_page method which would effectively transform the records of each page into the intended list of dictionaries.

__init__(*args: Any, **kwargs: Any) → None[source]: Initializes record keys and header/body paths in the object instance using defined methods.

define_record_keys(*args: Any, **kwargs: Any) → dict | None[source]: Abstract method to be optionally implemented to determine record keys that should be parsed to process each record.

define_record_path(*args: Any, **kwargs: Any) → Tuple | None[source]: Abstract method to be optionally implemented to define header and body paths for record extraction, with default paths provided if not specified.

discover_keys(*args: Any, **kwargs: Any) → dict | None[source]: Abstract method to be optionally implemented to discover nested key paths in json data structures.

ignore_record_keys(*args: Any, **kwargs: Any) → list | None[source]: Abstract method to be optionally implemented to ignore certain keys in records when processing records.

load_data(*args: Any, **kwargs: Any) → Any[source]: Helper method that is optionally implemented by subclasses to load JSON data into customized implementations of processors.

process_key(*args: Any, **kwargs: Any) → str | None[source]: Abstract method to be optionally implemented for processing keys from records.

abstract process_page(*args: Any, **kwargs: Any) → list[dict][source]: Must be implemented in subclasses for processing entire pages of records.

process_record(*args: Any, **kwargs: Any) → dict | None[source]

Abstract method to be optionally implemented for processing a single record in a json data structure.

Used to extract record data and article content, creating a processed record dictionary with an abstract field.

process_text(*args: Any, **kwargs: Any) → str | None[source]: Abstract method to be optionally implemented for processing a record dictionary to extract record and article content, creating a processed record dictionary with an abstract field.

classmethod record_filter(*args: Any, **kwargs: Any) → bool | None[source]

Optional filter implementation to handle record screening using regex or other logic.

Subclasses can customize filtering if required.

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method for quickly showing a representation of the overall structure of the current Processor subclass. The instance uses the generate_repr helper function to produce human-readable representations of the core structure of the processing configuration along with its defaults.

Returns:: The structure of the current Processor subclass as a string.
Return type:: str

class scholar_flux.data.BaseDataExtractor(record_path: list | None = None, metadata_path: list[list] | dict[str, list] | None = None)[source]

Bases: object

Base DataExtractor implementing the minimum components necessary to extract records and metadata from parsed responses when the location of records and metadata is known beforehand.

__init__(record_path: list | None = None, metadata_path: list[list] | dict[str, list] | None = None)[source]

Initialize the DataExtractor with metadata and records to extract separately.

If record path or metadata_path are specified, then the data extractor will attempt to retrieve the metadata and records at the provided paths. Note that, as metadata_paths can be associated with multiple keys, starting from the outside dictionary, we may have to specify a dictionary containing keys denoting metadata variables and their paths as a list of values indicating how to retrieve the value. The path can also be given by a list of lists describing how to retrieve the last element.

While the encouraged type for record_path is a list of strings that each represent each nested path element to be traversed to arrive at a value for a field, a delimited string can also be used with the default delimiter being scholar_flux.utils.PathStr.DELIMITER. Similarly, a list or dictionary of path strings can also be used as shorthand for the individual metadata fields containing relevant metadata values.

Similarly, a list or dictionary of path strings can also be used as shorthand for the individual metadata fields containing relevant metadata values.

Parameters:

record_path (Optional[List[str]]) – Custom path to find records in the parsed data. Contains a list of strings and rarely integers indexes indicating how to recursively find the list of records.
metadata_path (List[List[str]] | Optional[Dict[str, List[str]]]) – Identifies the paths in a dictionary associated with metadata as opposed to records. This can be a list of paths where each element is a list describing how to arrive at a terminal element.

extract(parsed_page: list[dict] | dict) → tuple[RecordList | None, MetadataType | None][source]

Extract both records and metadata from the parsed page dictionary.

Parameters:: parsed_page (Union[list[dict], dict]) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of records and the metadata dictionary.
Return type:: tuple[Optional[RecordList], Optional[MetadataType]]

extract_metadata(parsed_page_dict: dict[str, Any]) → MetadataType[source]

Extract metadata from the parsed page dictionary.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data to be parsed.
Returns:: The extracted metadata.
Return type:: Dict

extract_records(parsed_page_dict: dict) → RecordList | None[source]

Extract records from parsed data as a list of dicts.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data to be parsed.
Returns:: A list of records as dictionaries, or None if extraction fails.
Return type:: Optional[RecordList]

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Base method for showing the structure of the current Data Extractor. This method reveals the configuration settings of the extractor config that will be used to extract records and metadata.

Returns:: The current structure of the BaseDataExtractor or its subclass.
Return type:: str

classmethod update(data_extractor: Self, **data_extractor_kwargs: Any) → Self[source]

Helper method for creating a new BaseDataExtractor instance, replacing only the specified components.

Parameters:

data_extractor (Self) – A previously created BaseDataExtractor instance
**data_extractor_kwargs – Keyword arguments used to replace components of the BaseDataExtractor. Unspecified fields from the previous BaseDataExtractor remain unchanged.

Returns:

A new data extractor instance with the specified parameter updates

Return type:

BaseDataExtractor

class scholar_flux.data.BaseDataParser[source]

Bases: object

Base class responsible for parsing typical formats seen in APIs that send news and academic articles in XML, JSON, and YAML formats.

__init__() → None[source]

On initialization, the data parser is set to use built-in class methods to parse json, xml, and yaml-based response content by default and the parse helper class to determine which parser to use based on the Content- Type.

Parameters:

additional_parsers (Optional[dict[str, Callable]]) – Allows for the addition of
identification. (new parsers and overrides to class methods to be used on content-type)

classmethod detect_format(response: Response | ResponseProtocol) → str | None[source]: Helper method for determining the format corresponding to a response object.

classmethod get_default_parsers() → dict[str, Callable][source]

Helper method used to retrieve the default parsers to parse XML, JSON, and YAML response data.

Returns:

A dictionary of data parsers that can be used to parse response data: into usable json format

Return type:

dict[str, Callable]

parse(response: Response | ResponseProtocol) → dict | list[dict] | None[source]: Uses one of the default parsing methods to extract a dictionary of data from the response content.

classmethod parse_from_defaults(response: Response | ResponseProtocol) → dict | list[dict] | None[source]

Detects the API response format if a format is not already specified and uses one of the default structures to parse the data structure into a dictionary depending on the content type stored in the API response header.

Parameters:: response (response type) – The response (or response-like) object from the API request.
Returns:: response dict containing fields including a list of metadata records as dictionaries.
Return type:: dict

classmethod parse_json(content: bytes) → dict | list[dict][source]: Uses the standard json library to parse JSON content into a dictionary.

classmethod parse_xml(content: bytes) → dict | list[dict][source]: Uses the optional xmltodict library to parse XML content into a dictionary.

classmethod parse_yaml(content: bytes) → dict | list[dict][source]: Uses the optional yaml library to parse YAML content.

structure(flatten: bool = False, show_value_attributes: bool = True) → str[source]

Helper method for retrieving a string representation of the structure of the current BaseDataParser or subclass of the BaseDataParser.

Override this for more specific descriptions of attributes and defaults. Useful for showing the options being used for parsing response content into dictionary objects.

Returns:: A string representation of the base parser indicating all registered or default parsers
Return type:: str

Bases: BaseDataExtractor

The DataExtractor allows for the streamlined extraction of records and metadata from responses retrieved from APIs. This proceeds as the second stage of the response processing step where metadata and records are extracted from parsed responses.

The data extractor provides two ways to identify metadata paths and record paths:

manual identification: If record path or metadata_path are specified, then the data extractor will attempt to retrieve the metadata and records at the provided paths. Note that, as metadata_paths can be associated with multiple keys, starting from the outside dictionary, we may have to specify a dictionary containing keys denoting metadata variables and their paths as a list of values indicating how to retrieve the value. The path can also be given by a list of lists describing how to retrieve the last element.

Dynamic identification: Uses heuristics to determine records from metadata. records will nearly always be defined by a list containing only dictionaries as its elements while the metadata will generally contain a variety of elements, some nested and others as integers, strings, etc. In some cases where its harder to determine, we can use dynamic_record_identifiers to determine whether a list containing a single nested dictionary is a record or metadata. For scientific purposes, its keys may contain ‘abstract’, ‘title’, ‘doi’, etc. This can be defined manually by the users if the defaults are not reliable for a given API.

Upon initializing the class, the class can be used as a callable that returns the records and metadata in that order.

Example

>>> from scholar_flux.data import DataExtractor
>>> data = dict(query='specification driven development', options={'record_count':5,'response_time':'50ms'})
>>> data['records'] = [dict(id=1, record='protocol vs.code'), dict(id=2, record='Impact of Agile')]
>>> extractor = DataExtractor(annotate_records=False)
>>> records, metadata = extractor(data)
>>> print(metadata)
# OUTPUT: {'query': 'specification driven development', 'record_count': 5, 'response_time': '50ms'}
>>> print(records)
# OUTPUT: [{'id': 1, 'record': 'protocol vs.code'}, {'id': 2, 'record': 'Impact of Agile'}]

Record Annotation:

When annotate_records=True, each extracted record receives two fields for downstream linkage after processing/flattening:

_extraction_index: Zero-based position in the extracted record list
_record_id: Content-based hash in format “hash_index” (e.g., “a1b2c3d4_0”)

These fields enable resolution back to original records when order may change or records are deduplicated. The hash is generated from record content excluding internal fields (those starting with ‘_’), ensuring stability across runs for identical content.

Example:

>>> extractor = DataExtractor(annotate_records=True)
>>> records, metadata = extractor(data)
>>> records[0]['_extraction_index']
# OUTPUT: 0
>>> records[0]['_record_id']
# OUTPUT: 'a9e3e93e_0'
>>> records[0]
# OUTPUT: {'id': 1, 'record': 'protocol vs.code', '_extraction_index': 0, '_record_id': 'a9e3e93e_0'}

DEFAULT_DYNAMIC_METADATA_IDENTIFIERS = ('metadata', 'facets', 'IdList')

DEFAULT_DYNAMIC_RECORD_IDENTIFIERS = ('title', 'doi', 'abstract')

EXTRACTION_INDEX_KEY = '_extraction_index'

RECORD_ID_KEY = '_record_id'

Initialize the DataExtractor with optional path overrides for metadata and records.

Parameters:

record_path (Optional[List[str]]) – Custom path to find records in the parsed data. Contains a list of strings and rarely integers indexes indicating how to recursively find the list of records.
metadata_path (List[List[str]] | Optional[Dict[str, List[str]]]) – Identifies the paths in a dictionary associated with metadata as opposed to records. This can be a list of paths where each element is a list describing how to get to a terminal element.
dynamic_record_identifiers (Optional[List[str]]) – Helps to identify dictionary keys that only belong to records when dealing with a single element that would otherwise be classified as metadata.
dynamic_metadata_identifiers (Optional[List[str]]) – Helps to identify dictionary keys that are likely to only belong to metadata that could otherwise share a similar structure to a list of dictionaries, similar to what’s seen with records.
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).

dynamic_identification(parsed_page_dict: dict) → tuple[RecordList, MetadataType][source]

Dynamically identify and separate metadata from records. This function recursively traverses the dictionary and uses a heuristic to determine whether a specific record corresponds to metadata or is a list of records: Generally, keys associated with values corresponding to metadata will contain only lists of dictionaries On the other hand, nested structures containing metadata will be associated with a singular value other a dictionary of keys associated with a singular value that is not a list. Using this heuristic, we’re able to determine metadata from records with a high degree of confidence.

Parameters:: parsed_page_dict (Dict) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of record dictionaries and the metadata dictionary.
Return type:: tuple[RecordList, MetadataType]

extract(parsed_page: list[dict] | dict) → tuple[RecordList | None, MetadataType | None][source]

Extract both records and metadata from the parsed page dictionary.

Parameters:: parsed_page (RecordList | dict) – The dictionary containing the page data and metadata to be extracted.
Returns:: A tuple containing the list of records and the metadata dictionary.
Return type:: tuple[Optional[RecordList], Optional[MetadataType]]

classmethod strip_annotations(records: RecordType) → RecordType[source]

classmethod strip_annotations(records: NormalizedRecordList) → NormalizedRecordList

classmethod strip_annotations(records: RecordList) → RecordList

classmethod strip_annotations(records: None) → None

Removes metadata annotations from records by filtering out keys prefixed with underscore.

This method creates clean copies of records without internal pipeline metadata fields that may be added during (e.g., ‘_extraction_index’, ‘_record_id’) processing when record annotation is enabled.

Parameters:: records (RecordType | RecordList) – A single dictionary record or a list of dictionary records to clean. Records should contain dictionary elements with string keys.
Returns:: A new dictionary with annotation fields removed if input is a single record. RecordList: A new list of dictionaries with annotation fields removed if input is a list.
Return type:: RecordType

Note

The original records are not modified. This method instead return a new dictionary or a new list of dictionaries with only non-annotation fields preserved.

classmethod update(data_extractor: BaseDataExtractor, **data_extractor_kwargs: Any) → Self[source]

Helper method for creating a new DataExtractor instance, replacing only the specified components.

Parameters:

data_extractor (Self) – A previously created DataExtractor instance
**data_extractor_kwargs – Keyword arguments used to replace components of the DataExtractor. Unspecified fields from the previous DataExtractor remain unchanged.

Returns:

A new data extractor instance with the specified parameter updates

Return type:

DataExtractor

class scholar_flux.data.DataParser(additional_parsers: dict[str, Callable] | None = None)[source]

Bases: BaseDataParser

Extensible class that handles the identification and parsing of typical formats seen in APIs that send news and academic articles in XML, JSON, and YAML formats.

The BaseDataParser contains each of the necessary class elements to parse JSON, XML, and YAML formats as class methods while this class allows for the specification of additional parsers.

Parameters:: additional_parsers (Optional[dict[str, Callable]]) – Allows overrides for parsers in addition to the JSON, XML and YAML parsers that are enabled by default.

__init__(additional_parsers: dict[str, Callable] | None = None)[source]

On initialization, the data parser is set to use built-in class methods to parse json, xml, and yaml-based response content by default and the parse helper class to determine which parser to use based on the Content- Type.

Parameters:

additional_parsers (Optional[dict[str, Callable]]) – Allows for the addition of
identification. (new parsers and overrides to class methods to be used on content-type)

parse(response: Response | ResponseProtocol, format: str | None = None) → dict | list[dict] | None[source]

Parses the API response content using two core steps.

Detects the API response format if a format is not already specified
Uses the previously determined format to parse the content of the response and return a parsed dictionary (json) structure.

Parameters:

response (requests.Response | ResponseProtocol) – The response or response-like object from the API request.
format (str) – The parser needed to format the response as a list of dicts

Returns:

response dict containing fields including a list of metadata records as dictionaries.

Return type:

dict

Bases: ABCDataProcessor

Initialize the DataProcessor with explicit extraction paths and options. The DataProcessor performs the selective extraction of specific fields from each record within a page (list) of JSON (dictionary) records and assumes that the paths to extract are known beforehand.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that records should be retained during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

Examples

>>> from scholar_flux.data import DataProcessor
>>> data = [{'id':1, 'school':{'department':'NYU Department of Mathematics'}},
>>>         {'id':2, 'school':{'department':'GSU Department of History'}},
>>>         {'id':3, 'school':{'organization':'Pharmaceutical Research Team'}}]
# creating a basic processor
>>> data_processor = DataProcessor(record_keys = [['id'], ['school', 'department'], ['school', 'organization']]) # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': 1, 'school.department': 'NYU Department of Mathematics', 'school.organization': None},
#          {'id': 2, 'school.department': 'GSU Department of History', 'school.organization': None},
#          {'id': 3, 'school.department': None, 'school.organization': 'Pharmaceutical Research Team'}]
# String paths can also be used to accomplish the same:
>>> data_processor = DataProcessor(record_keys = ['id', 'school.department', 'school.organization']) # instantiating the class
>>> assert data_processor.process_page(data) == result

Initialize the DataProcessor with explicit extraction paths and options.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.

collapse_fields(processed_record_dict: dict) → dict[str, list[str | int] | str | int][source]: Helper method for joining lists of data into a singular string for flattening.

Processes a specific key from a record by retrieving the value associated with the key at the nested path. Depending on whether value_delimiter is set, the method will join non-None values into a string using the delimiter. Otherwise, keys with lists as values will contain the lists un-edited.

Parameters:

record – The JSON structure (generally a nested list or dictionary) to extract the key from.
key – The key to process within the record dictionary.

Returns:

The value found at the specified key within a dictionary nested in a list, and otherwise None.

Return type:

list

process_page(parsed_records: RecordList, ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]

Core method of the data processor that enables the processing of lists of dictionary records to filter and process records based on the configuration of the current DataProcessor.

Parameters:

parsed_records (list[dict[str | int, Any]]) – The records to process and/or filter
ignore_keys (Optional[list[str]]) – Optional overrides that identify records to ignore based on the absence of specific keys or regex patterns.
keep_keys (Optional[list[str]]) – Optional overrides identifying records to keep based on the absence of specific keys or regex patterns.
regex – (Optional[bool]): Used to determine whether or not to filter records using regular expressions

process_record(record_dict: RecordType) → dict[str, Any][source]

Processes a record dictionary to extract record data and article content, creating a processed record dictionary with an abstract field.

Args: - record_dict: The dictionary containing the record data.

Returns: - dict: A processed record dictionary with record keys processed and an abstract field created from the article content.

classmethod record_filter(record_dict: RecordType, record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Helper method that filters records using regex pattern matching, checking if any of the keys provided in the function call exist.

update_record_keys(record_keys: dict[str | int, Any] | dict[str, Any] | list[list[str | int]] | list[list[str]] | list[str]) → None[source]: A helper method for transforming and updating the current dictionary of record keys with a new list.

Bases: DataProcessor

A data processor that flattens records before extraction, extending DataProcessor.

This processor adds a normalization step to DataProcessor: 1. Flattens each record into dot-notation keys (e.g., “school.department”) 2. Extracts specified fields using parent class logic 3. Handles already-flattened records (idempotent operation)

Inherits all functionality from DataProcessor, including: - Field extraction via record_keys - Record filtering via ignore_keys/keep_keys - Value collapsing via value_delimiter

Parameters:

record_keys – Keys to extract (same as DataProcessor).
ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that must be present to keep a record.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for filtering.
use_full_path – Whether to preserve full paths in flattened keys.

Examples

>>> from scholar_flux.data import NormalizingDataProcessor
>>> data = [{'id':1, 'school':{'department':'NYU Department of Mathematics'}},
>>>         {'id':2, 'school':{'department':'GSU Department of History'}},
>>>         {'id':3, 'school':{'organization':'Pharmaceutical Research Team'}}]
# creating a basic processor
>>> data_processor = NormalizingDataProcessor(record_keys = [['id'], ['school', 'department'], ['school', 'organization']]) # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': 1, 'school.department': 'NYU Department of Mathematics', 'school.organization': None},
#          {'id': 2, 'school.department': 'GSU Department of History', 'school.organization': None},
#          {'id': 3, 'school.department': None, 'school.organization': 'Pharmaceutical Research Team'}]
# String paths can also be used to accomplish the same:
>>> data_processor = NormalizingDataProcessor(record_keys = ['id', 'school.department', 'school.organization']) # instantiating the class
>>> assert data_processor.process_page(data) == result

Initializes the NormalizingDataProcessor.

Parameters:

record_keys – Keys to extract, as a dict of output_key to path, or a list of paths.
ignore_keys – List of keys to ignore during processing.
value_delimiter – Delimiter for joining multiple values.
regex – Whether to use regex for ignore filtering.
traverse_lists – (Optional[bool]): Determines whether lists are automatically traversed when indices are not specified in the path.

process_record(record_dict: RecordType) → NormalizedRecordType[source]

Process a single record by flattening it first, then extracting fields.

Overrides parent method to add flattening step before field extraction.

Parameters:: record_dict (RecordType) – The dictionary containing the record data.
Returns:: A processed record with specified keys extracted.
Return type:: NormalizedRecordType

class scholar_flux.data.PassThroughDataProcessor(ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True)[source]

Bases: ABCDataProcessor

A basic data processor that retains all valid records without modification unless a specific filter for JSON keys are specified.

Unlike the DataProcessor, this specific implementation will not flatten records. Instead all filtered and selected records will retain their original nested structure.

__init__(ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = True) → None[source]

Initialize the PassThroughDataProcessor with explicit extraction paths and options.

Parameters:

ignore_keys – List of keys to ignore during processing.
keep_keys – List of keys that records should contain during processing.
regex – Whether to use regex for ignore filtering.

process_page(parsed_records: RecordList, ignore_keys: list[str] | None = None, keep_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]: Processes and returns each record as is if filtering the final list of records by key is not enabled.

process_record(record_dict: RecordType) → RecordType[source]

A no-op method retained to maintain a similar interface as other DataProcessor implementations.

Args: - record_dict: The dictionary containing the record data.

Returns: - dict: The original processed dictionary

classmethod record_filter(record_dict: RecordType, record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Helper method that filters records using regex pattern matching, checking if any of the keys provided in the function call exist.

Bases: ABCDataProcessor

The PathDataProcessor uses a custom implementation of Trie-based processing to abstract nested key-value combinations into path-node pairs where the path defines the full range of nested keys that need to be traversed to arrive at each terminal field within each individual record.

This implementation automatically and dynamically flattens and filters a single page of records (a list of dictionary-based records) extracted from a response at a time to return the processed record data.

Example

>>> from scholar_flux.data import PathDataProcessor
>>> path_data_processor = PathDataProcessor() # instantiating the class
>>> data = [{'id':1, 'a':{'b':'c'}}, {'id':2, 'b':{'f':'e'}}, {'id':2, 'c':{'h':'g'}}]
### The process_page method can then be referenced using the processor as a callable:
>>> result = path_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'a.b': 'c'}, {'id': '2', 'b.f': 'e'}, {'id': '2', 'c.h': 'g'}]

__init__(json_data: RecordType | RecordList | None = None, value_delimiter: str | None = None, ignore_keys: list | None = None, keep_keys: list[str] | None = None, regex: bool | None = True, use_cache: bool | None = True) → None[source]: Initializes the data processor with JSON data and optional parameters for processing.

property cached: bool: Property indicating whether the underlying path node index uses a cache of weakreferences to nodes.

discover_keys() → dict[str, Any] | None[source]: Discovers all keys within the JSON data.

property json_data: RecordList | None: A list of dictionary-based records to further process.

load_data(json_data: RecordType | RecordList | None = None) → bool[source]

Attempts to load a data dictionary or list, contingent on the input having at least one non-missing record.

If json_data is missing or the json input is equal to the current json_data attribute, then the json_data attribute will not be updated from the json input.

Parameters:: json_data (Optional[RecordType | RecordList]) – The json data to be loaded as an attribute.
Returns:: Indicates whether the data was successfully loaded (True) or not (False).
Return type:: bool

process_page(parsed_records: RecordType | RecordList | None = None, keep_keys: list[str] | None = None, ignore_keys: list[str] | None = None, combine_keys: bool = True, regex: bool | None = None) → RecordList[source]: Processes each individual record dict from the JSON data.

process_record(record_index: int, keep_keys: list | None = None, ignore_keys: list | None = None, regex: bool | None = None) → None[source]

Processes the current record dictionary, indicating if the record at the index should be retained or dropped.

The full set of processed records is subsequently accessible via processor.path_node_index.simplify_to_rows().

classmethod record_filter(record_dict: dict[ProcessingPath, Any], record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Identifies whether a record contains a path (key), indicating whether the record should be retained.

structure(flatten: bool = False, show_value_attributes: bool = False) → str[source]

Method for showing the structure of the current PathDataProcessor and identifying the current configuration.

Useful for showing the options being used to process the api response records

Bases: ABCDataProcessor

Processes a list of raw page record dict data from the API response based on discovered record keys and flattens them into a list of dictionaries consisting of key value pairs that simplify the interpretation of the final flattened json structure.

Example

>>> from scholar_flux.data import RecursiveDataProcessor
>>> data = [{'id':1, 'a':{'b':'c'}}, {'id':2, 'b':{'f':'e'}}, {'id':2, 'c':{'h':'g'}}]
# creating a basic processor
>>> recursive_data_processor = RecursiveDataProcessor() # instantiating the class
### The process_page method can then be referenced using the processor as a callable:
>>> result = recursive_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'b': 'c'}, {'id': '2', 'f': 'e'}, {'id': '2', 'h': 'g'}]
    # To identify the full nested location of record:
>>> recursive_data_processor = RecursiveDataProcessor(use_full_path=True) # instantiating the class
>>> result = recursive_data_processor(data) # recursively flattens and processes by default
>>> print(result)
# OUTPUT: [{'id': '1', 'a.b': 'c'}, {'id': '2', 'b.f': 'e'}, {'id': '2', 'c.h': 'g'}]

Initializes the data processor with JSON data and optional parameters for processing.

Parameters:

json_data (list[dict]) – The json data set to process and flatten - a list of dictionaries is expected
value_delimiter (Optional[str]) – Indicates whether or not to join values found at terminal paths
ignore_keys (Optional[list[str]]) – Determines records that should be omitted based on whether each record contains a key or substring. (off by default)
keep_keys (Optional[list[str]]) – Indicates whether or not to keep a record if the key is present. (off by default)
regex (Optional[bool]) – Determines whether to use regex filtering for filtering records based on the presence or absence of specific keywords
use_full_path (Optional[bool]) – Determines whether or not to keep the full path for the json record key. If False, the path is shortened, keeping the last key or set of keys while preventing name collisions.

discover_keys() → dict[str, list[str]] | None[source]: Discovers all keys within the JSON data.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include: bool = True, **kwargs: Any) → dict[str, list[str]][source]: Filters discovered keys based on specified criteria.

property json_data: RecordList | None: A list of dictionary-based records to further process.

load_data(json_data: RecordType | RecordList | None = None) → bool[source]

Attempts to load a data dictionary or list, contingent on the input having at least one non-missing record.

If json_data is missing, or the json input is equal to the current json_data attribute, then the json_data attribute will not be updated from the json input.

Parameters:: json_data (Optional[RecordType | RecordList]) – The json data to be loaded as an attribute.
Returns:: Indicates whether the data was successfully loaded (True) or not (False).
Return type:: bool

process_page(parsed_records: list[dict] | None = None, keep_keys: list[str] | None = None, ignore_keys: list[str] | None = None, regex: bool | None = None) → list[dict][source]: Processes each individual record dict from the JSON data.

process_record(record_dict: RecordType, **kwargs: Any) → RecordType[source]: Processes and flattens record dictionary, extracting record data and article content in the process.

classmethod record_filter(record_dict: RecordType, record_keys: list[str] | None = None, regex: bool | None = None) → bool[source]: Indicates if the current record contains any of the keys.

Bases: object

An implementation of a recursive JSON dictionary processor that is used to process and identify nested components such as paths, terminal key names, and the data at each terminal path.

This utility of the RecursiveJsonProcessor is for flattening dictionary records into flattened representations where its keys represent the terminal paths at each node and its values represent the data found at each terminal path.

Initialize the RecursiveJsonProcessor with a JSON dictionary and a delimiter for joining list elements.

Args:
json_dict (Dict): The input JSON dictionary to be parsed. object_delimiter (str): The delimiter used to join elements max depth list objects. Default is “; “. normalizing_delimiter (str): The delimiter used to join elements across multiple keys when normalizing. Default is “

“.

combine_normalized(normalized_field_value: list | str | None) → list | str | None[source]

Combines lists of nested data (strings, ints, None, etc.) into a single string separated by the normalizing_delimiter.

If a delimiter isn’t specified or if the value is None, it is returned as is without modification.

create_record(obj: Any, path: List[Any]) → List[JsonRecordData][source]: Helper method for creating a new record within the current JsonProcessor.

filter_extracted(exclude_keys: List[str] | None = None) → Self[source]

Filter the extracted JSON dictionaries to exclude specified keys.

Parameters:: exclude_keys ([List[str]]) – List of keys to exclude from the flattened result.

flatten() → Dict[str, List[Any] | str | None] | None[source]

Flatten the extracted JSON dictionary from a nested structure into a simpler structure.

Returns:: A dictionary with flattened paths as keys and lists of values.
Return type:: Optional[Dict[str, List[Any]]]

Process the dictionary, filter extracted paths, and then flatten the result.

Parameters:

exclude_keys (Optional[List[str]]) – List of keys to exclude from the flattened result.
traversal_paths (Optional[List[str]]) – Optional ‘.’ delimited paths to constrain the extracted keys to. If omitted, all paths are traversed.
traverse_lists (bool) – Determines whether to automatically traverse and flatten list structures.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_dictionary(obj: Dict | None = None) → Self[source]: Create a new json dictionary that contains information about the relative paths of each field that can be found within the current JSON dict.

process_level(obj: Any, level_name: List[Any] | None = None) → List[Any][source]

Helper method for processing a level within a dictionary.

This method is recursively called to process nested components

traverse_dictionary(paths: List[str] | List[List[str]] | List[List[str | int]], obj: Dict | None = None, traverse_lists: bool = False) → Self[source]: Create a new json dictionary by traversing ‘.’ delimited paths for json data found from a JSON Dict.

traverse_level(path: List[str] | List[str | int], obj: Any, level_name: List[Any] | None = None, traverse_lists: bool = False) → List[Any][source]

Helper method for traversing a level within a dictionary while constraining keys to known paths.

This method is recursively called to traverse nested components using known keys

static unlist(current_data: Dict | List | None) → Any | None[source]

Flattens a dictionary or list if it contains a single element that is a dictionary.

Parameters:: current_data – A dictionary or list to be flattened if it contains a single dictionary element.
Returns:: The flattened dictionary if the input meets the flattening condition, otherwise returns the input unchanged.
Return type:: Optional[Dict|List]