scholar_flux.utils.paths package
Submodules
scholar_flux.utils.paths.path_discoverer module
The scholar_flux.utils.paths.path_discoverer module contains an implementation of a PathDiscoverer dataclass that facilitates the discovery of nested values within JSON data structures and the terminal path where each value is located within the data structure.
This implementation recursively explores the JSON data set and adds to a dictionary of path mappings until the JSON data set is fully represented as path-data combinations that facilitate further processing of JSON data structures using Trie-based implementations.
- class scholar_flux.utils.paths.path_discoverer.PathDiscoverer(records: dict | list[dict] | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>)
Bases:
objectFor both discovering paths and flattening json files into a single dictionary that simplifies the nested structure into the path, the type of structure, and the terminal value.
- Parameters:
records – Optional[Union[list[dict], dict]]: A list of dictionaries to be flattened
path_mappings – dict[ProcessingPath, Any]: A set of key-value pairs mapping paths to terminal values
- records
The input data to be traversed and flattened.
- Type:
Optional[Union[list[dict], dict]]
- path_mappings
Holds a dictionary of values mapped to ProcessingPaths after processing
- Type:
dict[ProcessingPath, Any]
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- __init__(records: dict | list[dict] | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>) None
- discover_path_elements(records: dict | list[dict] | None = None, current_path: ProcessingPath | None = None, max_depth: int | None = None, inplace: bool = False) dict[ProcessingPath, Any] | None[source]
Recursively traverses records to discover keys, their paths, and terminal status. Uses the private method _discover_path_elements in order to add terminal path value pairs to the path_mappings attribute.
- Parameters:
records (Optional[Union[list[dict], dict]]) – A list of dictionaries to be flattened if not already provided.
current_path (Optional[dict[ProcessingPath, Any]]) – The parent path to prefix all subsequent paths with. Is useful when working with a subset of a dict
max_depth (Optional[int]) – Indicates the times we should recursively attempt to retrieve a terminal path. Leaving this at None will traverse all possible nested lists/dictionaries.
inplace (bool) – Determines whether or not to save the inner state of the PathDiscoverer object. When False: Returns the final object and clears the self.path_mappings attribute. When True: Retains the self.path_mappings attribute and returns None
- path_mappings: dict[ProcessingPath, Any]
- records: list[dict] | dict | None = None
- property terminal_paths: Set[ProcessingPath]
Helper method for returning a list of all discovered paths from the PathDiscoverer.
scholar_flux.utils.paths.path_node_index module
The scholar_flux.utils.paths.path_node_index module implements the PathNodeIndex class that uses trie-based logic to facilitate the processing of JSON data structures.
The PathNodeIndex is responsible for orchestrating JSON data discovery, processing, and flattening to abstract JSON data into path-node pairs indicate the location of terminal values and the path location of the terminal-values within a nested JSON data structure.
- class scholar_flux.utils.paths.path_node_index.PathNodeIndex(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None)
Bases:
objectThe PathNodeIndex is a dataclass that enables the efficient processing of nested key value pairs from JSON data commonly received from APIs providing records, articles, and other forms of data.
This index enables the orchestration of both parsing, flattening, and the simplification of JSON data structures.
- Parameters:
index (PathNodeMap) – A dictionary of path-node mappings that are used by the PathNodeIndex to simplify JSON structures into a singular list of dictionaries where each dictionary represents a record
simplifier (PathSimplifier) – A structure that enables the simplification of a path node index into a singular list of dictionary records. The structure is initially used to identify unique path names for each path-value combination.
- Class Variables:
- DEFAULT_DELIMITER (str): A delimiter to use by default when reading JSON structures and transforming the
list of keys used to retrieve a terminal path into a simplified string. Each individual key is separated by this delimiter.
- MAX_PROCESSES (int): An optional maximum on the total number of processes to use when simplifying multiple
records into a singular structure in parallel. This can be configured directly or turned off altogether by setting this class variable to None.
- Example Usage:
>>> from scholar_flux.utils import PathNodeIndex >>> record_test_json: list[dict] = [ >>> { >>> "authors": {"principle_investigator": "Dr. Smith", "assistant": "Jane Doe"}, >>> "doi": "10.1234/example.doi", >>> "title": "Sample Study", >>> # "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"], >>> "genre": {"subspecialty": "Neuroscience"}, >>> "journal": {"topic": "Sleep Research"}, >>> }, >>> { >>> "authors": {"principle_investigator": "Dr. Lee", "assistant": "John Roe"}, >>> "doi": "10.5678/example2.doi", >>> "title": "Another Study", >>> "abstract": "Another abstract.", >>> "genre": {"subspecialty": "Psychiatry"}, >>> "journal": {"topic": "Dreams"}, >>> }, >>> ] >>> normalized_records = PathNodeIndex.normalize_records(record_test_json) >>> normalized_records # OUTPUT: [{'abstract': 'Another abstract.', # 'doi': '10.5678/example2.doi', # 'title': 'Another Study', # 'authors.assistant': 'John Roe', # 'authors.principle_investigator': 'Dr. Lee', # 'genre.subspecialty': 'Psychiatry', # 'journal.topic': 'Dreams'}, # {'doi': '10.1234/example.doi', # 'title': 'Sample Study', # 'authors.assistant': 'Jane Doe', # 'authors.principle_investigator': 'Dr. Smith', # 'genre.subspecialty': 'Neuroscience', # 'journal.topic': 'Sleep Research'}]
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- MAX_PROCESSES: ClassVar[int | None] = 8
- __init__(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None) None
- combine_keys(skip_keys: list | None = None) None[source]
Combine nodes with values in their paths by updating the paths of count nodes.
This method searches for paths ending with values and count, identifies related nodes, and updates the paths by combining the value with the count node.
- Parameters:
skip_keys (Optional[list]) – Keys that should not be combined regardless of a matching pattern
quote_numeric (Optional[bool]) – Determines whether to quote integer components of paths to distinguish from Indices (default behavior is to quote them (ex. 0, 123).
- Raises:
PathCombinationError – If an error occurs during the combination process.
- classmethod from_path_mappings(path_mappings: dict[ProcessingPath, Any], chain_map: bool = False, use_cache: bool | None = None) PathNodeIndex[source]
Takes a dictionary of path:value mappings and transforms the dictionary into a list of PathNodes: useful for later path manipulations such as grouping and consolidating paths into a flattened dictionary.
If use_cache is not specified, then the Mapping will use the class default to determine whether or not to cache.
- Returns:
An index of PathNodes created from a dictionary
- Return type:
- get_node(path: ProcessingPath | str) PathNode | None[source]
Try to retrieve a path node with the given path.
- Parameters:
index (The exact path of to search for in the)
- Returns:
- The exact node that matches the provided path.
Returns None if a match is not found
- Return type:
Optional[PathNode]
- node_map: PathNodeMap | RecordPathChainMap
- property nodes: list[PathNode]
Returns a list of PathNodes stored within the index.
- Returns:
The complete list of all PathNodes that have been registered in the PathIndex
- Return type:
list[PathNode]
- classmethod normalize_records(json_records: dict | list[dict], combine_keys: bool = True, object_delimiter: str | None = ';', parallel: bool = False) list[dict[str, Any]][source]
Full pipeline for processing a loaded JSON structure into a list of dictionaries where each individual list element is a processed and normalized record.
- Parameters:
json_records (dict[str,Any] | list[dict[str,Any]]) – The JSON structure to normalize. If this structure is a dictionary, it will first be nested in a list as a single element before processing.
combine_keys – bool: This function determines whether or not to combine keys that are likely to denote names and corresponding values/counts. Default is True
object_delimiter – This delimiter determines whether to join terminal paths in lists under the same key and how to collapse the list into a singular string. If empty, terminal lists are returned as is.
parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel
- Return type:
list[dict[str,Any]]
- property paths: list[ProcessingPath]
Returns a list of Paths stored within the index.
- Returns:
The complete list of all paths that have been registered in the PathIndex
- Return type:
list[ProcessingPath]
- pattern_search(pattern: str | Pattern) list[PathNode][source]
Attempt to find all values containing the specified pattern using regular expressions :param pattern: :type pattern: Union[str, re.Pattern]
- Returns:
all paths and nodes that match the specified pattern
- Return type:
dict[ProcessingPath, PathNode]
- property record_indices: list[int]
Helper property for retrieving the full list of all record indices across the current mapping of paths to nodes for the current index.
This property is a helper method to quickly retrieve the full list of sorted record_indices.
It refers back to the map for the underlying implementation in the retrieval of record_indices.
- Returns:
A list containing integers denoting individual records found in each path.
- Return type:
list[int]
- search(path: ProcessingPath) list[PathNode][source]
Attempt to find all values with that match the provided path or have sub-paths that are an exact match match to the provided path :param path Union[str: :param ProcessingPath] the path to search for.: :param Note that the provided path must match a prefix/ancestor path of an indexed path: :param exactly to be considered a match:
- Returns:
- All paths equal to or containing sub-paths
exactly matching the specified path
- Return type:
dict[ProcessingPath, PathNode]
- simplifier: PathSimplifier
- simplify_to_rows(object_delimiter: str | None = ';', parallel: bool = False, max_components: int | None = None, remove_noninformative: bool = True) list[dict[str, Any]][source]
Simplify indexed nodes into a paginated data structure.
- Parameters:
object_delimiter (str) – The separator to use when collapsing multiple values into a single string.
parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel
- Returns:
A list of dictionaries representing the paginated data structure.
- Return type:
list[dict[str, Any]]
- use_cache: bool | None = None
scholar_flux.utils.paths.path_node_map module
The scholar_flux.utils.paths.path_node_map module implements the PathNodeMap that is used to record terminal path- value combinations that enables more efficient mapping, retrieval, and updates to terminal path node combinations.
- class scholar_flux.utils.paths.path_node_map.PathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])
Bases:
UserDict[ProcessingPath,PathNode]A dictionary-like class that maps Processing paths to PathNode objects.
- DEFAULT_USE_CACHE: bool = True
- __init__(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]
Initializes the PathNodeMap instance.
- add(node: PathNode, overwrite: bool | None = None, inplace: bool = True) PathNodeMap | None[source]
Add a node to the PathNodeMap instance.
- Parameters:
node (PathNode) – The node to add.
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
- Raises:
PathNodeMapError – If any error occurs while adding the node.
- filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]
Filter the PathNodeMap for paths with the given prefix.
- Parameters:
prefix (ProcessingPath) – The prefix to search for.
min_depth (Optional[int]) – The minimum depth to search for. Default is None.
max_depth (Optional[int]) – The maximum depth to search for. Default is None.
from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.
- Returns:
A dictionary of paths with the given prefix and their corresponding terminal_nodes
- Return type:
dict[Optional[ProcessingPath], Optional[PathNode]]
- Raises:
PathNodeMapError – If an error occurs while filtering the PathNodeMap.
- classmethod format_mapping(key_value_pairs: PathNodeMap | MutableMapping[ProcessingPath, PathNode] | dict[str, PathNode]) dict[ProcessingPath, PathNode][source]
Takes a dictionary or a PathNodeMap Transforms the string keys in a dictionary into Processing paths and returns the mapping.
- Parameters:
key_value_pairs (Union[dict[ProcessingPath, PathNode], dict[str, PathNode]]) – The dictionary of key-value pairs to transform.
- Returns:
a dictionary of validated path, node pairings
- Return type:
dict[ProcessingPath, PathNode]
- Raises:
PathNodeMapError – If the validation process fails.
- classmethod format_terminal_nodes(node_obj: MutableMapping | PathNodeMap | PathNode) dict[ProcessingPath, PathNode][source]
Recursively iterate over terminal nodes from Path Node Maps and retrieve only terminal_nodes :param node_obj: PathNode map or node dictionary containing either nested or already flattened terminal_paths :type node_obj: Union[dict,PathNodeMap]
- Returns:
the flattened terminal paths extracted from the inputted node_obj
- Return type:
item (dict)
- get(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]
Gets an item from the PathNodeMap instance. If the value isn’t available, this method will return the value specified in default.
- Parameters:
key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.
- Returns:
The value (PathNode instance).
- Return type:
- get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]
Helper method for retrieving a path node in a standardized way.
- node_exists(node: PathNode | ProcessingPath) bool[source]
Helper method to validate whether the current node exists.
- property nodes: list[PathNode]
Enables the retrieval of paths stored within the current map as a property.
- property paths: list[ProcessingPath]
Enables retrieval of nodes stored within the current map as a property.
- property record_indices: list[int]
Helper property for retrieving the full list of all record indices across all paths for the current map Note: This assumes that all paths within the current map are derived from a list of records where every path’s first element denotes its initial position in a list with nested json components
- Returns:
A list containing integers denoting individual records found in each path
- Return type:
list[int]
- remove(node: ProcessingPath | PathNode | str, inplace: bool = True) PathNodeMap | None[source]
Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool
- Returns:
A new PathNodeMap instance with the specified paths removed if inplace is specified as True.
- Return type:
Optional[PathNodeMap]
- Raises:
PathNodeMapError – If any error occurs while removing.
- update(*args, overwrite: bool | None = None, **kwargs: Mapping[str | ProcessingPath, PathNode]) None[source]
Updates the PathNodeMap instance with new key-value pairs.
- Parameters:
*args (Union[PathNodeMap,dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
*kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap
Returns
scholar_flux.utils.paths.path_nodes module
The scholar_flux.utils.paths.path_nodes module implements the basic PathNode data class necessary to represent a terminal path-value combination within a nested JSON structure.
This data structure forms the basis of path processing that scholar_flux uses to process, filter, and flatten JSON data sets.
- class scholar_flux.utils.paths.path_nodes.PathNode(path: ProcessingPath, value: Any)
Bases:
objectA dataclass acts as a wrapper for path-terminal value pairs in nested JSON structures.
The PathNode consists of a value of any type and a ProcessingPath instance that indicates where a terminal-value was found. This class simplifies the process of manipulating and flattening data structures originating from JSON data
- path
The terminal path where the value was located
- Type:
- value
- Type:
Any
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- __init__(path: ProcessingPath, value: Any) None
- classmethod is_valid_node(node: PathNode) bool[source]
Validates whether the current node is or is not a PathNode isinstance. If the current input is not a PathNode, then this class will raise an InvalidPathNodeError.
- Raises:
InvalidPathNodeError – If the current node is not a PathNode or if its path is not a valid ProcessingPath
- path: ProcessingPath
- property path_group: ProcessingPath
Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.
- Returns:
A ProcessingPath instance with the last numeric component removed and indices replaced.
- Return type:
- property path_keys: ProcessingPath
Utility function for retaining keys from a path, ignoring indexes generated by lists Retrieves the original path minus all keys that originate from list indexes.
- Returns:
A ProcessingPath instance associated with all dictionary keys
- Return type:
- property record_index: int
Extract the first element of the node’s path to determine the record number originating from a list of dictionaries, assuming the path originates from a paginated structure.
- Returns:
Value denoting the record that the path originates from
- Return type:
int
- Raises:
PathIndexingError – if the first element of the path is not a numerical index
- classmethod to_path_node(path: ProcessingPath | str | int | list[str] | list[int] | list[str | int], value: Any, **path_kwargs) Self[source]
Helper method for creating a path node from the components used to create paths in addition to value to assign the path node.
- Parameters:
path (Union[ProcessingPath, str, list[str]]) – The path to be assigned to the node. If this is not a path already, then a path will be created from what is provided
value (Any) – The value to associate with the new node
**path_kwargs – Additional keyword arguments to be used in the creation of a path. This is passed to ProcessingPath.to_processing_path when creating a path
- Returns:
The newly constructed path
- Return type:
- Raises:
InvalidPathNodeError – If the values provided cannot be used to create a new node
- update(**attributes: ProcessingPath | Any) PathNode[source]
Update the parameters of a PathNode by creating a new PathNode instance. Note that the original PathNode dataclass is frozen. This method uses the copied dict originating from the dataclass to initialize a new PathNode. :param **attributes: keyword arguments indicating the attributes of the :type **attributes: dict :param PathNode to update. If a specific key is not provided: :param then it will not update: :param Each key should be a valid attribute name of PathNode: :param : :param and each value should be the corresponding updated value.:
- Returns:
A new path with the updated attributes
- value: Any
scholar_flux.utils.paths.path_simplification module
The scholar_flux.utils.paths.path_simplification module implements the PathSimplifier for flattening JSON records.
This simplifier is used in the latter path processing steps to coerce a nested JSON structure represented by a PathNodeIndex into a singular list of dictionaries.
The PathSimplifier will return the full paths where each nested JSON value can be found, if allowed. Otherwise, the PathSimplifier will attempt to shorten the names in the final dictionary of paths up to the user-specified nested key (component) length while preventing name collisions from occurring.
- class scholar_flux.utils.paths.path_simplification.PathSimplifier(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>)
Bases:
objectA utility class for simplifying and managing Processing Paths.
- Parameters:
delimiter (str) – The delimiter to use when splitting paths.
non_informative (Optional[List[str]]) – A list of non-informative components to remove from paths.
- delimiter
The delimiter used to separate components in the path.
- Type:
str
- non_informative
A list of non-informative components to be removed during simplification.
- Type:
List[str]
- name_mappings
A dictionary for tracking unique names to avoid collisions.
- Type:
Dict[ProcessingPath, str]
- __init__(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>) None
- clear_mappings() None[source]
Clear all existing path mappings.
Example
### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.clear_mappings() ### simplifier.get_mapped_paths()
- Output:
{}
- delimiter: str = '.'
- generate_unique_name(path: ProcessingPath, max_components: int | None, remove_noninformative: bool = False) ProcessingPath[source]
Generate a unique name for the given Processing Path.
- Parameters:
path (ProcessingPath) – The ProcessingPath object representing the path components.
max_components (int) – The maximum number of components to use in the name.
remove_noninformative (bool) – Whether to remove non-informative components.
- Returns:
A unique ProcessingPath name.
- Return type:
- Raises:
PathSimplificationError – If an error occurs during name generation.
- get_mapped_paths() Dict[ProcessingPath, str][source]
Get the current name mappings.
- Returns:
The dictionary of mappings from original paths to simplified names.
- Return type:
Dict[ProcessingPath, str]
Example
### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.get_mapped_paths() Output:
{ProcessingPath(‘a/b/c’): ‘c’, ProcessingPath(‘a/b/d’): ‘d’}
- name_mappings: Dict[ProcessingPath, str]
- non_informative: list[str]
- simplify_paths(paths: List[ProcessingPath | str] | Set[ProcessingPath | str], max_components: int | None, remove_noninformative: bool = False) Dict[ProcessingPath, str][source]
Simplify paths by removing non-informative components and selecting the last ‘max_components’ informative components.
- Parameters:
paths (List[Union[ProcessingPath, str]]) – List of path strings or ProcessingPaths to simplify.
max_components (int) – The maximum desired number of informative components to retain in the simplified path.
remove_noninformative (bool) – Whether to remove non-informative components.
- Returns:
- A dictionary mapping the original path to its simplified unique group name
for all elements within the same path after removing indices
- Return type:
Dict[ProcessingPath, str]
- Raises:
PathSimplificationError – If an error occurs during path simplification.
- simplify_to_row(terminal_nodes: List[PathNode] | Set[PathNode], collapse: str | None = ';') Dict[str, Any][source]
Simplify terminal nodes by mapping them to their corresponding unique names.
- Parameters:
terminal_nodes (List[PathNode]) – A list of PathNode objects representing the terminal nodes.
collapse (Optional[str]) – The separator to use when collapsing multiple values into a single string.
- Returns:
A dictionary mapping unique names to their corresponding values or collapsed strings.
- Return type:
Dict[str, Union[List[str], str]]
- Raises:
PathSimplificationError – If an error occurs during simplification.
scholar_flux.utils.paths.processing_cache module
The scholar_flux.utils.paths.path_cache class implements the PathProcessingCache to cache path processing operations.
By caching terminal paths and their parent paths, the PathProcessingCache class facilitates the faster, more efficient filtering, processing, and retrieval of nested JSON data components and structures as represented by path nodes.
For the duration that each path-node combination exists, the cache uses weakly-referenced dictionaries and weakly-referenced sets to facilitate indexed trie operations and the process of filtering each path-node combination.
- class scholar_flux.utils.paths.processing_cache.PathProcessingCache
Bases:
objectThe PathProcessingCache class implements a method of path caching that enables faster prefix searches. and retrieval of terminal paths associated with a path to node mapping. This class is used within PathNodeMaps and RecordPathNodeMaps to increase the speed and efficiency of path discovery, processing, and filtering path-node mappings.
Because the primary purpose of the scholar_flux Trie-based path-node-processing implementation is the processing and preparation of highly nested JSON structures from API responses, the PathProcessingCache was created to efficiently keep track of all descendants of a terminal node with weak references and facilitate of filtering and flattening path-node combinations.
Stale data is automatically removed to reduce the number of comparisons needed to retrieve terminal paths only, and, as a result, later steps can more efficiently filter the complete list of terminal paths with faster path prefix searches to facilitate processing using Path-Node Maps and Indexes when processing JSON data structures.
- __init__() None[source]
Initializes the ProcessingCache instance.
- _cache
Underlying cache data structure that keeps track of all descendants that begin with the current prefix by mapping path strings to WeakSets that automatically remove ProcessingPaths when garbage collected
- Type:
defaultdict[str, WeakSet[ProcessingPath]]
- updates
Implements a lazy caching system that only adds elements to the _cache when filtering and node retrieval is explicitly required. The implementation uses weakly referenced keys to remove cached paths to ensure that references are deleted when a lazy operation is no longer needed.
- Type:
WeakKeyDictionary[ProcessingPath, Literal[‘add’, ‘remove’]]
- cache_update() None[source]
Initializes the lazy updates for the cache given the current update instructions.
- filter(prefix: ProcessingPath, min_depth: int | None = None, max_depth: int | None = None) Set[ProcessingPath][source]
Filter the cache for paths with the given prefix.
- Parameters:
prefix (ProcessingPath) – The prefix to search for.
min_depth (Optional[int]) – The minimum depth to search for. Default is None.
max_depth (Optional[int]) – The maximum depth to search for. Default is None.
- Returns:
A set of paths with the given prefix.
- Return type:
Set[ProcessingPath]
- lazy_add(path: ProcessingPath) None[source]
Add a path to the cache for faster prefix searches.
- Parameters:
path (ProcessingPath) – The path to add to the cache.
- lazy_remove(path: ProcessingPath) None[source]
Remove a path from the cache.
- Parameters:
path (ProcessingPath) – The path to remove from the cache.
- property path_cache: defaultdict[str, WeakSet[ProcessingPath]]
Helper method that allows for inspection of the ProcessingCache and automatically updates the node cache prior to retrieval.
- Returns:
- The underlying cache used within the ProcessingCache to
retrieve a list all currently active terminal nodes.
- Return type:
defaultdict[str, WeakSet[ProcessingPath]]
scholar_flux.utils.paths.processing_path module
Implements the ProcessingPath that is the most fundamental component in the scholar_flux JSON path processing trie implementation.
The ProcessingPath is used to store a path processing representation that allows for extensive flexibility in the creation, filtering, and discovery of nested keys in JSON structures.
- class scholar_flux.utils.paths.processing_path.ProcessingPath(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)
Bases:
objectA utility class to handle path operations for processing and flattening dictionaries.
- Parameters:
components (Union[str, int, Tuple[str, ...], List[str], List[int], List[str | int]]) – The initial path, either as a string or a list of strings. Any integers will be auto-converted to strings in the process of formatting the components of the path
component_types (Optional[Union[Tuple[str, ...], List[str]]]) – Optional metadata fields that can be used to annotate specific components of a path
delimiter (str) – The delimiter used to separate components in the path.
- Raises:
InvalidProcessingPathError – If the path is neither a string nor a list of strings.
InvalidPathDelimiterError – If the delimiter is invalid.
- components
A tuple of path components.
- Type:
Tuple[str, …]
- delimiter
The delimiter used to separate components in the path.
- Type:
str
Examples
>>> from scholar_flux.utils import ProcessingPath >>> abc_path = ProcessingPath(['a', 'b', 'c'], delimiter ='//') >>> updated_path = abc_path / 'd' >>> assert updated_path.depth > 3 and updated_path[-1] == 'd' # OUTPUT: True >>> assert str(updated_path) == 'a//b//c//d' >>> assert updated_path.has_ancestor(abc_path)
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- __init__(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)[source]
Initializes the ProcessingPath. The inputs are first validated to ensure that the path components and delimiters are valid.
- Parameters:
components – (Union[str, int, Tuple[str, …], List[str], List[int], List[str | int]]): The current path keys describing the path where each key represents a nested key in a JSON structure
component_types – (Optional[Union[Tuple[str, …], List[str]]]): An iterable of component types (used to annotate the components)
delimiter – (Optional[str]): The separator used to indicate separate nested keys in a JSON structure. Defaults to the class default if not directly specified.
- append(component: int | str, component_type: str | None = None) ProcessingPath[source]
Append a component to the path and return a new ProcessingPath object.
- Parameters:
component (str) – The component to append.
- Returns:
A new ProcessingPath object with the appended component.
- Return type:
- Raises:
InvalidProcessingPathError – If the component is not a non-empty string.
- component_types: Tuple[str, ...] | None = None
- components: Tuple[str, ...]
- copy() ProcessingPath[source]
Create a copy of the ProcessingPath.
- Returns:
A new ProcessingPath object with the same components and delimiter.
- Return type:
- delimiter: str = ''
- property depth: int
Return the depth of the path.
- Returns:
The number of components in the path.
- Return type:
int
- get_ancestors() List[ProcessingPath | None][source]
Get all parent paths of the current ProcessingPath by the specified number of steps.
- Returns:
Contains a list of all ancestor paths for the current path
If the depth of the path is 1, an empty list is returned
- Return type:
List[Optional[ProcessingPath]]
- get_name(max_components: int = 1) ProcessingPath[source]
Generate a path name based on the last ‘max_components’ components of the path.
- Parameters:
max_components (int) – The maximum number of components to include in the name (default is 1).
- Returns:
A new ProcessingPath object representing the generated name.
- Return type:
- get_parent(step: int = 1) ProcessingPath | None[source]
Get the ancestor path of the current ProcessingPath by the specified number of steps.
This method navigates up the path structure by the given number of steps. If the step count is greater than or equal to the depth of the current path, or if the path is already the root, it returns None. If the step count equals the current depth, it returns the root ProcessingPath.
- Parameters:
step (int) – The number of levels up to retrieve. 1 for parent, 2 for grandparent, etc. (default is 1).
- Returns:
The ancestor ProcessingPath if the step is within the path depth.
The root ProcessingPath if step equals the depth of the current path.
None if the step is greater than the current depth or if the path is already the root.
- Return type:
Optional[ProcessingPath]
- Raises:
ValueError – If the step is less than 1.
- group(last_only: bool = False) ProcessingPath[source]
Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.
- Parameters:
last_only (bool) – Determines whether or not to replace all list indices vs removing only the last
- Returns:
A ProcessingPath instance with the last numeric component removed and indices replaced.
- Return type:
- has_ancestor(path: str | ProcessingPath) bool[source]
Determine whether the provided path is equal to or a subset/descendant of the current path (self).
- Parameters:
path (ProcessingPath) – The potential subset/descendant of (self) ProcessingPath.
- Returns:
True if ‘self’ is a superset of ‘path’. False Otherwise.
- Return type:
bool
- static infer_delimiter(path: str | ProcessingPath, delimiters: list[str] = ['<>', '//', '/', '>', '<', '\\', '%', '.']) str | None[source]
Infer the delimiter used in the path string based on its string representation.
- Parameters:
path (Union[str,ProcessingPath]) – The path string to infer the delimiter from.
delimiters (List[str]) – A list of common delimiters to search for in the path.
default_delimiter (str) – The default delimiter to use if no delimiter is found.
- Returns:
The inferred delimiter.
- Return type:
str
- info_content(non_informative: List[str]) int[source]
Calculate the number of informative components in the path.
- Parameters:
non_informative (List[str]) – A list of non-informative components.
- Returns:
The number of informative components.
- Return type:
int
- is_ancestor_of(path: str | ProcessingPath) bool[source]
Determine whether the current path (self) is equal to or a subset/descendant path of the specified path.
- Parameters:
path (ProcessingPath) – The potential superset of (self) ProcessingPath.
- Returns:
True if ‘self’ is a subset of ‘path’. False Otherwise.
- Return type:
bool
- property is_root: bool
Check if the path represents the root node.
- Returns:
True if the path is root, False otherwise.
- Return type:
bool
- classmethod keep_descendants(paths: List[ProcessingPath]) List[ProcessingPath][source]
Filters a list of paths and keeps only descendants.
- property record_index: int
Extract the first element of the current path to determine the record number if the current path refers back to a paginated structure.
- Returns:
The first value, converted to an integer if possible
- Return type:
int
- Raises:
PathIndexingError – if the first element of the path is not a numerical index
- remove(removal_list: List[str]) ProcessingPath[source]
Remove specified components from the path.
- Parameters:
removal_list (List[str]) – A list of components to remove.
- Returns:
A new ProcessingPath object without the specified components.
- Return type:
- remove_by_type(removal_list: List[str], raise_on_error: bool = False) ProcessingPath[source]
Remove specified component types from the path.
- Parameters:
removal_list (List[str]) – A list of component types to remove.
- Returns:
A new ProcessingPath object without the specified components.
- Return type:
- remove_indices(num: int = -1, reverse: bool = False) ProcessingPath[source]
Remove numeric components from the path.
- Parameters:
num (int) – The number of numeric components to remove. If negative, removes all (default is -1).
- Returns:
A new ProcessingPath object without the specified numeric components.
- Return type:
- replace(old: str, new: str) ProcessingPath[source]
Replace occurrences of a component in the path.
- Parameters:
old (str) – The component to replace.
new (str) – The new component to replace the old one with.
- Returns:
A new ProcessingPath object with the replaced components.
- Return type:
- Raises:
InvalidProcessingPathError – If the replacement arguments are not strings.
- replace_indices(placeholder: str = 'i') ProcessingPath[source]
Replace numeric components in the path with a placeholder.
- Parameters:
placeholder (str) – The placeholder to replace numeric components with (default is ‘i’).
- Returns:
A new ProcessingPath object with numeric components replaced by the placeholder.
- Return type:
- replace_path(old: str | ProcessingPath, new: str | ProcessingPath, component_types: List | Tuple | None = None) ProcessingPath[source]
Replace an ancestor path or full path in the current ProcessingPath with a new path.
- Parameters:
old (Union[str, ProcessingPath]) – The path to replace.
new (Union[str, ProcessingPath]) – The new path to replace the old path ancestor or full path with.
- Returns:
A new ProcessingPath object with the replaced components.
- Return type:
- Raises:
InvalidProcessingPathError – If the replacement arguments are not strings or ProcessingPaths.
- reversed() ProcessingPath[source]
Returns a reversed ProcessingPath from the current_path.
- Returns:
A new ProcessingPath object with the same components/types in a reversed order
- Return type:
- sorted() ProcessingPath[source]
Returns a sorted ProcessingPath from the current_path. Elements are sorted by component in alphabetical order.
- Returns:
A new ProcessingPath object with the same components/types in a reversed order
- Return type:
- to_list() List[str][source]
Convert the ProcessingPath to a list of components.
- Returns:
A list of components in the ProcessingPath.
- Return type:
List[str]
- to_pattern(escape_all=False) Pattern[source]
Convert the ProcessingPath to a regular expression pattern.
- Returns:
The regular expression pattern representing the ProcessingPath.
- Return type:
Pattern
- classmethod to_processing_path(path: ProcessingPath | str | int | List[str] | List[int] | List[str | int], component_types: list | tuple | None = None, delimiter: str | None = None, infer_delimiter: bool = False) ProcessingPath[source]
Convert an input to a ProcessingPath instance if it’s not already.
- Parameters:
path (Union[ProcessingPath, str, int, List[str], List[int], List[str | int]]) – The input path to convert.
component_types (list|tuple) – The type of component associated with each path element
delimiter (str) – The delimiter to use if the input is a string.
- Returns:
A ProcessingPath instance.
- Return type:
- Raises:
InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.
- to_string() str[source]
Get the string representation of the ProcessingPath.
- Returns:
The string representation of the ProcessingPath.
- Return type:
str
- update_delimiter(new_delimiter: str) ProcessingPath[source]
Update the delimiter of the current ProcessingPath with the provided new delimiter.
This method creates a new ProcessingPath instance with the same components but replaces the existing delimiter with the specified new_delimiter.
- Parameters:
new_delimiter (str) – The new delimiter to replace the current one.
- Returns:
A new ProcessingPath instance with the updated delimiter.
- Return type:
- Raises:
InvalidPathDelimiterError – If the provided new_delimiter is not valid.
Example
>>> processing_path = ProcessingPath('a.b.c', delimiter='.') >>> updated_path = processing_path.update_delimiter('/') >>> print(updated_path) # Output: ProcessingPath(a/b/c)
- classmethod with_inferred_delimiter(path: ProcessingPath | str, component_types: List | Tuple | None = None) ProcessingPath[source]
Converts an input to a ProcessingPath instance if it’s not already a processing path.
- Parameters:
path (Union[ProcessingPath, str, List[str]]) – The input path to convert.
delimiter (str) – The delimiter to use if the input is a string.
component_type (list|tuple) – The type of component associated with each path element
- Returns:
A ProcessingPath instance.
- Return type:
- Raises:
InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.
scholar_flux.utils.paths.record_path_chain_map module
The scholar_flux.utils.paths.path_node_map module builds on top of the original PathNodeMap to further specialize the map implementation toward the nested dictionary records that can be found within paginated data.
This module implements the RecordPathNodeMap and RecordPathChainMap, respectively to process batches of nodes at a time that all apply to a single record while allowing speedups to cache when retaining only terminal nodes via set/dictionary operations.
- class scholar_flux.utils.paths.record_path_chain_map.RecordPathChainMap(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap])
Bases:
UserDict[int,RecordPathNodeMap]A dictionary-like class that maps Processing paths to PathNode objects.
- DEFAULT_USE_CACHE = True
- __init__(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap]) None[source]
Initializes the RecordPathNodeMap instance.
- add(node: PathNode | RecordPathNodeMap, overwrite: bool | None = None)[source]
Add a node to the PathNodeMap instance.
- Parameters:
node (PathNode) – The node to add.
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
- Raises:
PathNodeMapError – If any error occurs while adding the node.
- filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]
Filter the RecordPathChainMap for paths with the given prefix.
- Parameters:
prefix (ProcessingPath) – The prefix to search for.
min_depth (Optional[int]) – The minimum depth to search for. Default is None.
max_depth (Optional[int]) – The maximum depth to search for. Default is None.
from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.
- Returns:
A dictionary of paths with the given prefix and their corresponding terminal_nodes
- Return type:
dict[Optional[ProcessingPath], Optional[PathNode]]
- Raises:
RecordPathNodeMapError – If an error occurs while filtering the PathNodeMap.
- get(key: str | ProcessingPath, default: RecordPathNodeMap | None = None) RecordPathNodeMap | None[source]
Gets an item from the RecordPathNodeMap instance. If the value isn’t available, this method will return the value specified in default.
- Parameters:
key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.
- Returns:
A record map instance
- Return type:
- get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]
Helper method for retrieving a path node in a standardized way across PathNodeMaps.
- node_exists(node: PathNode | ProcessingPath) bool[source]
Helper method to validate whether the current node exists.
- property paths: list[ProcessingPath]
Enables looping over nodes stored across maps.
- property record_indices: list[int]
Helper property for retrieving the full list of all record indices across all paths for the current map Note: A core requirement of the ChainMap is that each RecordPathNodeMap indicates the position of a record in a nested JSON structure. This property is a helper method to quickly retrieve the full list of sorted record_indices.
- Returns:
A list containing integers denoting individual records found in each path
- Return type:
list[int]
- remove(node: ProcessingPath | PathNode | str)[source]
Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool
- Returns:
A new PathNodeMap instance with the specified paths removed if inplace is specified as True.
- Return type:
Optional[PathNodeMap]
- Raises:
PathNodeMapError – If any error occurs while removing.
- update(*args, overwrite: bool | None = None, **kwargs: dict[str, PathNode] | dict[str | ProcessingPath, RecordPathNodeMap]) None[source]
Updates the PathNodeMap instance with new key-value pairs.
- Parameters:
*args (Union["PathNodeMap",dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
*kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap
Returns
- class scholar_flux.utils.paths.record_path_chain_map.RecordPathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])
Bases:
PathNodeMapA dictionary-like class that maps Processing paths to PathNode objects using record indexes.
This implementation inherits from the PathNodeMap class and constrains the allowed nodes to those that begin with a numeric record index. Where each index indicates a record and nodes represent values associated with the record.
- __init__(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]
Initializes the RecordPathNodeMap using a similar set of inputs as the original PathNodeMap.
This implementation constraints the inputted nodes to a singular numeric key index that all nodes must begin with. If nodes are provided without the key, then the record_index is inferred for the inputs.
- classmethod from_mapping(mapping: dict[str | ProcessingPath, PathNode] | PathNodeMap | Sequence[PathNode] | set[PathNode] | RecordPathNodeMap, use_cache: bool | None = None) RecordPathNodeMap[source]
Helper method for coercing types into a RecordPathNodeMap.
Module contents
The scholar_flux.utils.paths module contains a series of related classes that serve the purpose of providing a unified interface for processing json paths in a manner that simplifies the implementation of nested values into paths leading to terminal values (keys) and the terminal elements (values).
- Modules:
- processing_path.py: Implements the ProcessingPath class which provides the basic building block for defining the
nested path location where terminal values are stored in structured json data consisting of dictionaries, lists, and other nested elements.
path_nodes.py: Implements a PathNode class where processing paths are paired with a value at its path
- path_discoverer.py: Defines the PathDiscoverer that recursively finds terminal paths up to a specific max depth.
This implementation is designed to create a dictionary by processing a json data structure to create a new flattened dictionary consisting of terminal ProcessingPaths (keys) and their associated data at these terminal paths (values).
- processing_cache.py Implements a caching mechanism using ProcessingPaths and weak references. The processing cache
uses lazy path additions and WeakKeyDictionaries implement a cache that store terminal path references to ensure the efficient retrieval of path-node combinations.
- path_node_map: Defines validated PathNodeMap data structure built off a user dict to efficiently store
nodes found at terminal paths. This mapping also uses a generated a cache that uses weakref to keep a running mapping of all terminal nodes.
- record_path_chain_map: Implements the RecordPathNodeMap that adds a mandatory record index to PathNodeMaps for consistency
when reading and manipulating JSON data nested within lists. The RecordPathChainMap is also implemented, building on the RecordPathChainMap for increased consistency and faster retrieval of nodes associated with particular records in a JSON data set. Operates as a drop-in replacement when used in a PathNodeIndex.
- path_node_index: Implements a PathNodeIndex data structure used to orchestrate the processing path-based
sparse trie data structures that take a JSON and extract, flatten, and simplify the original data structure to create an easy to process flattened dictionary.
- path_simplifier Implements the PathSimplifier utility class that takes a PathNodeIndex as input, identifies
unique paths (ignoring index) and simplifying the path into a flattened list that outputs joined paths collapsed into string or flattened into a list.
Examples
>>> from scholar_flux.utils import PathNodeIndex
>>> record_test_json: list[dict] = [
>>> {
>>> "authors": {
>>> "principle_investigator": "Dr. Smith",
>>> "assistant": "Jane Doe"
>>> },
>>> "doi": "10.1234/example.doi",
>>> "title": "Sample Study",
>>> "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"],
>>> "genre": {
>>> "subspecialty": "Neuroscience"
>>> },
>>> "journal": {
>>> "topic": "Sleep Research"
>>> }
>>> },
>>> {
>>> "authors": {
>>> "principle_investigator": "Dr. Lee",
>>> "assistant": "John Roe"
>>> },
>>> "doi": "10.5678/example2.doi",
>>> "title": "Another Study",
>>> "abstract": "Another abstract.",
>>> "genre": {
>>> "subspecialty": "Psychiatry"
>>> },
>>> "journal": {
>>> "topic": "Dreams"
>>> }
>>> }
>>> ]
### Create a new index to process the current json
>>> path_node_index = PathNodeIndex()
# orchestrate the pipeline of identifying terminal paths and nodes, followed by formatting and flattening
# the paths used to arrive at each value at the end of the terminal path.
>>> normalized_records = path_node_index.normalize_records(record_test_json, object_delimiter = None)
>>> print(normalized_records)
# OUTPUT: [{'abstract': 'Another abstract.',
'doi': '10.5678/example2.doi',
'title': 'Another Study',
'authors.assistant': 'John Roe',
'authors.principle_investigator': 'Dr. Lee',
'genre.subspecialty': 'Psychiatry',
'journal.topic': 'Dreams'},
{'doi': '10.1234/example.doi',
'title': 'Sample Study',
'abstract': ['This is a sample abstract.', "keywords: 'sample', 'abstract'"],
'authors.assistant': 'Jane Doe',
'authors.principle_investigator': 'Dr. Smith',
'genre.subspecialty': 'Neuroscience',
'journal.topic': 'Sleep Research'}]
- class scholar_flux.utils.paths.PathDiscoverer(records: dict | list[dict] | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>)
Bases:
objectFor both discovering paths and flattening json files into a single dictionary that simplifies the nested structure into the path, the type of structure, and the terminal value.
- Parameters:
records – Optional[Union[list[dict], dict]]: A list of dictionaries to be flattened
path_mappings – dict[ProcessingPath, Any]: A set of key-value pairs mapping paths to terminal values
- records
The input data to be traversed and flattened.
- Type:
Optional[Union[list[dict], dict]]
- path_mappings
Holds a dictionary of values mapped to ProcessingPaths after processing
- Type:
dict[ProcessingPath, Any]
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- __init__(records: dict | list[dict] | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>) None
- discover_path_elements(records: dict | list[dict] | None = None, current_path: ProcessingPath | None = None, max_depth: int | None = None, inplace: bool = False) dict[ProcessingPath, Any] | None[source]
Recursively traverses records to discover keys, their paths, and terminal status. Uses the private method _discover_path_elements in order to add terminal path value pairs to the path_mappings attribute.
- Parameters:
records (Optional[Union[list[dict], dict]]) – A list of dictionaries to be flattened if not already provided.
current_path (Optional[dict[ProcessingPath, Any]]) – The parent path to prefix all subsequent paths with. Is useful when working with a subset of a dict
max_depth (Optional[int]) – Indicates the times we should recursively attempt to retrieve a terminal path. Leaving this at None will traverse all possible nested lists/dictionaries.
inplace (bool) – Determines whether or not to save the inner state of the PathDiscoverer object. When False: Returns the final object and clears the self.path_mappings attribute. When True: Retains the self.path_mappings attribute and returns None
- path_mappings: dict[ProcessingPath, Any]
- records: list[dict] | dict | None = None
- property terminal_paths: Set[ProcessingPath]
Helper method for returning a list of all discovered paths from the PathDiscoverer.
- class scholar_flux.utils.paths.PathNode(path: ProcessingPath, value: Any)
Bases:
objectA dataclass acts as a wrapper for path-terminal value pairs in nested JSON structures.
The PathNode consists of a value of any type and a ProcessingPath instance that indicates where a terminal-value was found. This class simplifies the process of manipulating and flattening data structures originating from JSON data
- path
The terminal path where the value was located
- Type:
- value
- Type:
Any
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- __init__(path: ProcessingPath, value: Any) None
- classmethod is_valid_node(node: PathNode) bool[source]
Validates whether the current node is or is not a PathNode isinstance. If the current input is not a PathNode, then this class will raise an InvalidPathNodeError.
- Raises:
InvalidPathNodeError – If the current node is not a PathNode or if its path is not a valid ProcessingPath
- path: ProcessingPath
- property path_group: ProcessingPath
Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.
- Returns:
A ProcessingPath instance with the last numeric component removed and indices replaced.
- Return type:
- property path_keys: ProcessingPath
Utility function for retaining keys from a path, ignoring indexes generated by lists Retrieves the original path minus all keys that originate from list indexes.
- Returns:
A ProcessingPath instance associated with all dictionary keys
- Return type:
- property record_index: int
Extract the first element of the node’s path to determine the record number originating from a list of dictionaries, assuming the path originates from a paginated structure.
- Returns:
Value denoting the record that the path originates from
- Return type:
int
- Raises:
PathIndexingError – if the first element of the path is not a numerical index
- classmethod to_path_node(path: ProcessingPath | str | int | list[str] | list[int] | list[str | int], value: Any, **path_kwargs) Self[source]
Helper method for creating a path node from the components used to create paths in addition to value to assign the path node.
- Parameters:
path (Union[ProcessingPath, str, list[str]]) – The path to be assigned to the node. If this is not a path already, then a path will be created from what is provided
value (Any) – The value to associate with the new node
**path_kwargs – Additional keyword arguments to be used in the creation of a path. This is passed to ProcessingPath.to_processing_path when creating a path
- Returns:
The newly constructed path
- Return type:
- Raises:
InvalidPathNodeError – If the values provided cannot be used to create a new node
- update(**attributes: ProcessingPath | Any) PathNode[source]
Update the parameters of a PathNode by creating a new PathNode instance. Note that the original PathNode dataclass is frozen. This method uses the copied dict originating from the dataclass to initialize a new PathNode. :param **attributes: keyword arguments indicating the attributes of the :type **attributes: dict :param PathNode to update. If a specific key is not provided: :param then it will not update: :param Each key should be a valid attribute name of PathNode: :param : :param and each value should be the corresponding updated value.:
- Returns:
A new path with the updated attributes
- value: Any
- class scholar_flux.utils.paths.PathNodeIndex(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None)
Bases:
objectThe PathNodeIndex is a dataclass that enables the efficient processing of nested key value pairs from JSON data commonly received from APIs providing records, articles, and other forms of data.
This index enables the orchestration of both parsing, flattening, and the simplification of JSON data structures.
- Parameters:
index (PathNodeMap) – A dictionary of path-node mappings that are used by the PathNodeIndex to simplify JSON structures into a singular list of dictionaries where each dictionary represents a record
simplifier (PathSimplifier) – A structure that enables the simplification of a path node index into a singular list of dictionary records. The structure is initially used to identify unique path names for each path-value combination.
- Class Variables:
- DEFAULT_DELIMITER (str): A delimiter to use by default when reading JSON structures and transforming the
list of keys used to retrieve a terminal path into a simplified string. Each individual key is separated by this delimiter.
- MAX_PROCESSES (int): An optional maximum on the total number of processes to use when simplifying multiple
records into a singular structure in parallel. This can be configured directly or turned off altogether by setting this class variable to None.
- Example Usage:
>>> from scholar_flux.utils import PathNodeIndex >>> record_test_json: list[dict] = [ >>> { >>> "authors": {"principle_investigator": "Dr. Smith", "assistant": "Jane Doe"}, >>> "doi": "10.1234/example.doi", >>> "title": "Sample Study", >>> # "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"], >>> "genre": {"subspecialty": "Neuroscience"}, >>> "journal": {"topic": "Sleep Research"}, >>> }, >>> { >>> "authors": {"principle_investigator": "Dr. Lee", "assistant": "John Roe"}, >>> "doi": "10.5678/example2.doi", >>> "title": "Another Study", >>> "abstract": "Another abstract.", >>> "genre": {"subspecialty": "Psychiatry"}, >>> "journal": {"topic": "Dreams"}, >>> }, >>> ] >>> normalized_records = PathNodeIndex.normalize_records(record_test_json) >>> normalized_records # OUTPUT: [{'abstract': 'Another abstract.', # 'doi': '10.5678/example2.doi', # 'title': 'Another Study', # 'authors.assistant': 'John Roe', # 'authors.principle_investigator': 'Dr. Lee', # 'genre.subspecialty': 'Psychiatry', # 'journal.topic': 'Dreams'}, # {'doi': '10.1234/example.doi', # 'title': 'Sample Study', # 'authors.assistant': 'Jane Doe', # 'authors.principle_investigator': 'Dr. Smith', # 'genre.subspecialty': 'Neuroscience', # 'journal.topic': 'Sleep Research'}]
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- MAX_PROCESSES: ClassVar[int | None] = 8
- __init__(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None) None
- combine_keys(skip_keys: list | None = None) None[source]
Combine nodes with values in their paths by updating the paths of count nodes.
This method searches for paths ending with values and count, identifies related nodes, and updates the paths by combining the value with the count node.
- Parameters:
skip_keys (Optional[list]) – Keys that should not be combined regardless of a matching pattern
quote_numeric (Optional[bool]) – Determines whether to quote integer components of paths to distinguish from Indices (default behavior is to quote them (ex. 0, 123).
- Raises:
PathCombinationError – If an error occurs during the combination process.
- classmethod from_path_mappings(path_mappings: dict[ProcessingPath, Any], chain_map: bool = False, use_cache: bool | None = None) PathNodeIndex[source]
Takes a dictionary of path:value mappings and transforms the dictionary into a list of PathNodes: useful for later path manipulations such as grouping and consolidating paths into a flattened dictionary.
If use_cache is not specified, then the Mapping will use the class default to determine whether or not to cache.
- Returns:
An index of PathNodes created from a dictionary
- Return type:
- get_node(path: ProcessingPath | str) PathNode | None[source]
Try to retrieve a path node with the given path.
- Parameters:
index (The exact path of to search for in the)
- Returns:
- The exact node that matches the provided path.
Returns None if a match is not found
- Return type:
Optional[PathNode]
- node_map: PathNodeMap | RecordPathChainMap
- property nodes: list[PathNode]
Returns a list of PathNodes stored within the index.
- Returns:
The complete list of all PathNodes that have been registered in the PathIndex
- Return type:
list[PathNode]
- classmethod normalize_records(json_records: dict | list[dict], combine_keys: bool = True, object_delimiter: str | None = ';', parallel: bool = False) list[dict[str, Any]][source]
Full pipeline for processing a loaded JSON structure into a list of dictionaries where each individual list element is a processed and normalized record.
- Parameters:
json_records (dict[str,Any] | list[dict[str,Any]]) – The JSON structure to normalize. If this structure is a dictionary, it will first be nested in a list as a single element before processing.
combine_keys – bool: This function determines whether or not to combine keys that are likely to denote names and corresponding values/counts. Default is True
object_delimiter – This delimiter determines whether to join terminal paths in lists under the same key and how to collapse the list into a singular string. If empty, terminal lists are returned as is.
parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel
- Return type:
list[dict[str,Any]]
- property paths: list[ProcessingPath]
Returns a list of Paths stored within the index.
- Returns:
The complete list of all paths that have been registered in the PathIndex
- Return type:
list[ProcessingPath]
- pattern_search(pattern: str | Pattern) list[PathNode][source]
Attempt to find all values containing the specified pattern using regular expressions :param pattern: :type pattern: Union[str, re.Pattern]
- Returns:
all paths and nodes that match the specified pattern
- Return type:
dict[ProcessingPath, PathNode]
- property record_indices: list[int]
Helper property for retrieving the full list of all record indices across the current mapping of paths to nodes for the current index.
This property is a helper method to quickly retrieve the full list of sorted record_indices.
It refers back to the map for the underlying implementation in the retrieval of record_indices.
- Returns:
A list containing integers denoting individual records found in each path.
- Return type:
list[int]
- search(path: ProcessingPath) list[PathNode][source]
Attempt to find all values with that match the provided path or have sub-paths that are an exact match match to the provided path :param path Union[str: :param ProcessingPath] the path to search for.: :param Note that the provided path must match a prefix/ancestor path of an indexed path: :param exactly to be considered a match:
- Returns:
- All paths equal to or containing sub-paths
exactly matching the specified path
- Return type:
dict[ProcessingPath, PathNode]
- simplifier: PathSimplifier
- simplify_to_rows(object_delimiter: str | None = ';', parallel: bool = False, max_components: int | None = None, remove_noninformative: bool = True) list[dict[str, Any]][source]
Simplify indexed nodes into a paginated data structure.
- Parameters:
object_delimiter (str) – The separator to use when collapsing multiple values into a single string.
parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel
- Returns:
A list of dictionaries representing the paginated data structure.
- Return type:
list[dict[str, Any]]
- use_cache: bool | None = None
- class scholar_flux.utils.paths.PathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])
Bases:
UserDict[ProcessingPath,PathNode]A dictionary-like class that maps Processing paths to PathNode objects.
- DEFAULT_USE_CACHE: bool = True
- __init__(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]
Initializes the PathNodeMap instance.
- add(node: PathNode, overwrite: bool | None = None, inplace: bool = True) PathNodeMap | None[source]
Add a node to the PathNodeMap instance.
- Parameters:
node (PathNode) – The node to add.
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
- Raises:
PathNodeMapError – If any error occurs while adding the node.
- filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]
Filter the PathNodeMap for paths with the given prefix.
- Parameters:
prefix (ProcessingPath) – The prefix to search for.
min_depth (Optional[int]) – The minimum depth to search for. Default is None.
max_depth (Optional[int]) – The maximum depth to search for. Default is None.
from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.
- Returns:
A dictionary of paths with the given prefix and their corresponding terminal_nodes
- Return type:
dict[Optional[ProcessingPath], Optional[PathNode]]
- Raises:
PathNodeMapError – If an error occurs while filtering the PathNodeMap.
- classmethod format_mapping(key_value_pairs: PathNodeMap | MutableMapping[ProcessingPath, PathNode] | dict[str, PathNode]) dict[ProcessingPath, PathNode][source]
Takes a dictionary or a PathNodeMap Transforms the string keys in a dictionary into Processing paths and returns the mapping.
- Parameters:
key_value_pairs (Union[dict[ProcessingPath, PathNode], dict[str, PathNode]]) – The dictionary of key-value pairs to transform.
- Returns:
a dictionary of validated path, node pairings
- Return type:
dict[ProcessingPath, PathNode]
- Raises:
PathNodeMapError – If the validation process fails.
- classmethod format_terminal_nodes(node_obj: MutableMapping | PathNodeMap | PathNode) dict[ProcessingPath, PathNode][source]
Recursively iterate over terminal nodes from Path Node Maps and retrieve only terminal_nodes :param node_obj: PathNode map or node dictionary containing either nested or already flattened terminal_paths :type node_obj: Union[dict,PathNodeMap]
- Returns:
the flattened terminal paths extracted from the inputted node_obj
- Return type:
item (dict)
- get(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]
Gets an item from the PathNodeMap instance. If the value isn’t available, this method will return the value specified in default.
- Parameters:
key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.
- Returns:
The value (PathNode instance).
- Return type:
- get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]
Helper method for retrieving a path node in a standardized way.
- node_exists(node: PathNode | ProcessingPath) bool[source]
Helper method to validate whether the current node exists.
- property nodes: list[PathNode]
Enables the retrieval of paths stored within the current map as a property.
- property paths: list[ProcessingPath]
Enables retrieval of nodes stored within the current map as a property.
- property record_indices: list[int]
Helper property for retrieving the full list of all record indices across all paths for the current map Note: This assumes that all paths within the current map are derived from a list of records where every path’s first element denotes its initial position in a list with nested json components
- Returns:
A list containing integers denoting individual records found in each path
- Return type:
list[int]
- remove(node: ProcessingPath | PathNode | str, inplace: bool = True) PathNodeMap | None[source]
Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool
- Returns:
A new PathNodeMap instance with the specified paths removed if inplace is specified as True.
- Return type:
Optional[PathNodeMap]
- Raises:
PathNodeMapError – If any error occurs while removing.
- update(*args, overwrite: bool | None = None, **kwargs: Mapping[str | ProcessingPath, PathNode]) None[source]
Updates the PathNodeMap instance with new key-value pairs.
- Parameters:
*args (Union[PathNodeMap,dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
*kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap
Returns
- class scholar_flux.utils.paths.PathProcessingCache
Bases:
objectThe PathProcessingCache class implements a method of path caching that enables faster prefix searches. and retrieval of terminal paths associated with a path to node mapping. This class is used within PathNodeMaps and RecordPathNodeMaps to increase the speed and efficiency of path discovery, processing, and filtering path-node mappings.
Because the primary purpose of the scholar_flux Trie-based path-node-processing implementation is the processing and preparation of highly nested JSON structures from API responses, the PathProcessingCache was created to efficiently keep track of all descendants of a terminal node with weak references and facilitate of filtering and flattening path-node combinations.
Stale data is automatically removed to reduce the number of comparisons needed to retrieve terminal paths only, and, as a result, later steps can more efficiently filter the complete list of terminal paths with faster path prefix searches to facilitate processing using Path-Node Maps and Indexes when processing JSON data structures.
- __init__() None[source]
Initializes the ProcessingCache instance.
- _cache
Underlying cache data structure that keeps track of all descendants that begin with the current prefix by mapping path strings to WeakSets that automatically remove ProcessingPaths when garbage collected
- Type:
defaultdict[str, WeakSet[ProcessingPath]]
- updates
Implements a lazy caching system that only adds elements to the _cache when filtering and node retrieval is explicitly required. The implementation uses weakly referenced keys to remove cached paths to ensure that references are deleted when a lazy operation is no longer needed.
- Type:
WeakKeyDictionary[ProcessingPath, Literal[‘add’, ‘remove’]]
- cache_update() None[source]
Initializes the lazy updates for the cache given the current update instructions.
- filter(prefix: ProcessingPath, min_depth: int | None = None, max_depth: int | None = None) Set[ProcessingPath][source]
Filter the cache for paths with the given prefix.
- Parameters:
prefix (ProcessingPath) – The prefix to search for.
min_depth (Optional[int]) – The minimum depth to search for. Default is None.
max_depth (Optional[int]) – The maximum depth to search for. Default is None.
- Returns:
A set of paths with the given prefix.
- Return type:
Set[ProcessingPath]
- lazy_add(path: ProcessingPath) None[source]
Add a path to the cache for faster prefix searches.
- Parameters:
path (ProcessingPath) – The path to add to the cache.
- lazy_remove(path: ProcessingPath) None[source]
Remove a path from the cache.
- Parameters:
path (ProcessingPath) – The path to remove from the cache.
- property path_cache: defaultdict[str, WeakSet[ProcessingPath]]
Helper method that allows for inspection of the ProcessingCache and automatically updates the node cache prior to retrieval.
- Returns:
- The underlying cache used within the ProcessingCache to
retrieve a list all currently active terminal nodes.
- Return type:
defaultdict[str, WeakSet[ProcessingPath]]
- class scholar_flux.utils.paths.PathSimplifier(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>)
Bases:
objectA utility class for simplifying and managing Processing Paths.
- Parameters:
delimiter (str) – The delimiter to use when splitting paths.
non_informative (Optional[List[str]]) – A list of non-informative components to remove from paths.
- delimiter
The delimiter used to separate components in the path.
- Type:
str
- non_informative
A list of non-informative components to be removed during simplification.
- Type:
List[str]
- name_mappings
A dictionary for tracking unique names to avoid collisions.
- Type:
Dict[ProcessingPath, str]
- __init__(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>) None
- clear_mappings() None[source]
Clear all existing path mappings.
Example
### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.clear_mappings() ### simplifier.get_mapped_paths()
- Output:
{}
- delimiter: str = '.'
- generate_unique_name(path: ProcessingPath, max_components: int | None, remove_noninformative: bool = False) ProcessingPath[source]
Generate a unique name for the given Processing Path.
- Parameters:
path (ProcessingPath) – The ProcessingPath object representing the path components.
max_components (int) – The maximum number of components to use in the name.
remove_noninformative (bool) – Whether to remove non-informative components.
- Returns:
A unique ProcessingPath name.
- Return type:
- Raises:
PathSimplificationError – If an error occurs during name generation.
- get_mapped_paths() Dict[ProcessingPath, str][source]
Get the current name mappings.
- Returns:
The dictionary of mappings from original paths to simplified names.
- Return type:
Dict[ProcessingPath, str]
Example
### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.get_mapped_paths() Output:
{ProcessingPath(‘a/b/c’): ‘c’, ProcessingPath(‘a/b/d’): ‘d’}
- name_mappings: Dict[ProcessingPath, str]
- non_informative: list[str]
- simplify_paths(paths: List[ProcessingPath | str] | Set[ProcessingPath | str], max_components: int | None, remove_noninformative: bool = False) Dict[ProcessingPath, str][source]
Simplify paths by removing non-informative components and selecting the last ‘max_components’ informative components.
- Parameters:
paths (List[Union[ProcessingPath, str]]) – List of path strings or ProcessingPaths to simplify.
max_components (int) – The maximum desired number of informative components to retain in the simplified path.
remove_noninformative (bool) – Whether to remove non-informative components.
- Returns:
- A dictionary mapping the original path to its simplified unique group name
for all elements within the same path after removing indices
- Return type:
Dict[ProcessingPath, str]
- Raises:
PathSimplificationError – If an error occurs during path simplification.
- simplify_to_row(terminal_nodes: List[PathNode] | Set[PathNode], collapse: str | None = ';') Dict[str, Any][source]
Simplify terminal nodes by mapping them to their corresponding unique names.
- Parameters:
terminal_nodes (List[PathNode]) – A list of PathNode objects representing the terminal nodes.
collapse (Optional[str]) – The separator to use when collapsing multiple values into a single string.
- Returns:
A dictionary mapping unique names to their corresponding values or collapsed strings.
- Return type:
Dict[str, Union[List[str], str]]
- Raises:
PathSimplificationError – If an error occurs during simplification.
- class scholar_flux.utils.paths.ProcessingPath(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)
Bases:
objectA utility class to handle path operations for processing and flattening dictionaries.
- Parameters:
components (Union[str, int, Tuple[str, ...], List[str], List[int], List[str | int]]) – The initial path, either as a string or a list of strings. Any integers will be auto-converted to strings in the process of formatting the components of the path
component_types (Optional[Union[Tuple[str, ...], List[str]]]) – Optional metadata fields that can be used to annotate specific components of a path
delimiter (str) – The delimiter used to separate components in the path.
- Raises:
InvalidProcessingPathError – If the path is neither a string nor a list of strings.
InvalidPathDelimiterError – If the delimiter is invalid.
- components
A tuple of path components.
- Type:
Tuple[str, …]
- delimiter
The delimiter used to separate components in the path.
- Type:
str
Examples
>>> from scholar_flux.utils import ProcessingPath >>> abc_path = ProcessingPath(['a', 'b', 'c'], delimiter ='//') >>> updated_path = abc_path / 'd' >>> assert updated_path.depth > 3 and updated_path[-1] == 'd' # OUTPUT: True >>> assert str(updated_path) == 'a//b//c//d' >>> assert updated_path.has_ancestor(abc_path)
- DEFAULT_DELIMITER: ClassVar[str] = '.'
- __init__(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)[source]
Initializes the ProcessingPath. The inputs are first validated to ensure that the path components and delimiters are valid.
- Parameters:
components – (Union[str, int, Tuple[str, …], List[str], List[int], List[str | int]]): The current path keys describing the path where each key represents a nested key in a JSON structure
component_types – (Optional[Union[Tuple[str, …], List[str]]]): An iterable of component types (used to annotate the components)
delimiter – (Optional[str]): The separator used to indicate separate nested keys in a JSON structure. Defaults to the class default if not directly specified.
- append(component: int | str, component_type: str | None = None) ProcessingPath[source]
Append a component to the path and return a new ProcessingPath object.
- Parameters:
component (str) – The component to append.
- Returns:
A new ProcessingPath object with the appended component.
- Return type:
- Raises:
InvalidProcessingPathError – If the component is not a non-empty string.
- component_types: Tuple[str, ...] | None = None
- components: Tuple[str, ...]
- copy() ProcessingPath[source]
Create a copy of the ProcessingPath.
- Returns:
A new ProcessingPath object with the same components and delimiter.
- Return type:
- delimiter: str = ''
- property depth: int
Return the depth of the path.
- Returns:
The number of components in the path.
- Return type:
int
- get_ancestors() List[ProcessingPath | None][source]
Get all parent paths of the current ProcessingPath by the specified number of steps.
- Returns:
Contains a list of all ancestor paths for the current path
If the depth of the path is 1, an empty list is returned
- Return type:
List[Optional[ProcessingPath]]
- get_name(max_components: int = 1) ProcessingPath[source]
Generate a path name based on the last ‘max_components’ components of the path.
- Parameters:
max_components (int) – The maximum number of components to include in the name (default is 1).
- Returns:
A new ProcessingPath object representing the generated name.
- Return type:
- get_parent(step: int = 1) ProcessingPath | None[source]
Get the ancestor path of the current ProcessingPath by the specified number of steps.
This method navigates up the path structure by the given number of steps. If the step count is greater than or equal to the depth of the current path, or if the path is already the root, it returns None. If the step count equals the current depth, it returns the root ProcessingPath.
- Parameters:
step (int) – The number of levels up to retrieve. 1 for parent, 2 for grandparent, etc. (default is 1).
- Returns:
The ancestor ProcessingPath if the step is within the path depth.
The root ProcessingPath if step equals the depth of the current path.
None if the step is greater than the current depth or if the path is already the root.
- Return type:
Optional[ProcessingPath]
- Raises:
ValueError – If the step is less than 1.
- group(last_only: bool = False) ProcessingPath[source]
Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.
- Parameters:
last_only (bool) – Determines whether or not to replace all list indices vs removing only the last
- Returns:
A ProcessingPath instance with the last numeric component removed and indices replaced.
- Return type:
- has_ancestor(path: str | ProcessingPath) bool[source]
Determine whether the provided path is equal to or a subset/descendant of the current path (self).
- Parameters:
path (ProcessingPath) – The potential subset/descendant of (self) ProcessingPath.
- Returns:
True if ‘self’ is a superset of ‘path’. False Otherwise.
- Return type:
bool
- static infer_delimiter(path: str | ProcessingPath, delimiters: list[str] = ['<>', '//', '/', '>', '<', '\\', '%', '.']) str | None[source]
Infer the delimiter used in the path string based on its string representation.
- Parameters:
path (Union[str,ProcessingPath]) – The path string to infer the delimiter from.
delimiters (List[str]) – A list of common delimiters to search for in the path.
default_delimiter (str) – The default delimiter to use if no delimiter is found.
- Returns:
The inferred delimiter.
- Return type:
str
- info_content(non_informative: List[str]) int[source]
Calculate the number of informative components in the path.
- Parameters:
non_informative (List[str]) – A list of non-informative components.
- Returns:
The number of informative components.
- Return type:
int
- is_ancestor_of(path: str | ProcessingPath) bool[source]
Determine whether the current path (self) is equal to or a subset/descendant path of the specified path.
- Parameters:
path (ProcessingPath) – The potential superset of (self) ProcessingPath.
- Returns:
True if ‘self’ is a subset of ‘path’. False Otherwise.
- Return type:
bool
- property is_root: bool
Check if the path represents the root node.
- Returns:
True if the path is root, False otherwise.
- Return type:
bool
- classmethod keep_descendants(paths: List[ProcessingPath]) List[ProcessingPath][source]
Filters a list of paths and keeps only descendants.
- property record_index: int
Extract the first element of the current path to determine the record number if the current path refers back to a paginated structure.
- Returns:
The first value, converted to an integer if possible
- Return type:
int
- Raises:
PathIndexingError – if the first element of the path is not a numerical index
- remove(removal_list: List[str]) ProcessingPath[source]
Remove specified components from the path.
- Parameters:
removal_list (List[str]) – A list of components to remove.
- Returns:
A new ProcessingPath object without the specified components.
- Return type:
- remove_by_type(removal_list: List[str], raise_on_error: bool = False) ProcessingPath[source]
Remove specified component types from the path.
- Parameters:
removal_list (List[str]) – A list of component types to remove.
- Returns:
A new ProcessingPath object without the specified components.
- Return type:
- remove_indices(num: int = -1, reverse: bool = False) ProcessingPath[source]
Remove numeric components from the path.
- Parameters:
num (int) – The number of numeric components to remove. If negative, removes all (default is -1).
- Returns:
A new ProcessingPath object without the specified numeric components.
- Return type:
- replace(old: str, new: str) ProcessingPath[source]
Replace occurrences of a component in the path.
- Parameters:
old (str) – The component to replace.
new (str) – The new component to replace the old one with.
- Returns:
A new ProcessingPath object with the replaced components.
- Return type:
- Raises:
InvalidProcessingPathError – If the replacement arguments are not strings.
- replace_indices(placeholder: str = 'i') ProcessingPath[source]
Replace numeric components in the path with a placeholder.
- Parameters:
placeholder (str) – The placeholder to replace numeric components with (default is ‘i’).
- Returns:
A new ProcessingPath object with numeric components replaced by the placeholder.
- Return type:
- replace_path(old: str | ProcessingPath, new: str | ProcessingPath, component_types: List | Tuple | None = None) ProcessingPath[source]
Replace an ancestor path or full path in the current ProcessingPath with a new path.
- Parameters:
old (Union[str, ProcessingPath]) – The path to replace.
new (Union[str, ProcessingPath]) – The new path to replace the old path ancestor or full path with.
- Returns:
A new ProcessingPath object with the replaced components.
- Return type:
- Raises:
InvalidProcessingPathError – If the replacement arguments are not strings or ProcessingPaths.
- reversed() ProcessingPath[source]
Returns a reversed ProcessingPath from the current_path.
- Returns:
A new ProcessingPath object with the same components/types in a reversed order
- Return type:
- sorted() ProcessingPath[source]
Returns a sorted ProcessingPath from the current_path. Elements are sorted by component in alphabetical order.
- Returns:
A new ProcessingPath object with the same components/types in a reversed order
- Return type:
- to_list() List[str][source]
Convert the ProcessingPath to a list of components.
- Returns:
A list of components in the ProcessingPath.
- Return type:
List[str]
- to_pattern(escape_all=False) Pattern[source]
Convert the ProcessingPath to a regular expression pattern.
- Returns:
The regular expression pattern representing the ProcessingPath.
- Return type:
Pattern
- classmethod to_processing_path(path: ProcessingPath | str | int | List[str] | List[int] | List[str | int], component_types: list | tuple | None = None, delimiter: str | None = None, infer_delimiter: bool = False) ProcessingPath[source]
Convert an input to a ProcessingPath instance if it’s not already.
- Parameters:
path (Union[ProcessingPath, str, int, List[str], List[int], List[str | int]]) – The input path to convert.
component_types (list|tuple) – The type of component associated with each path element
delimiter (str) – The delimiter to use if the input is a string.
- Returns:
A ProcessingPath instance.
- Return type:
- Raises:
InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.
- to_string() str[source]
Get the string representation of the ProcessingPath.
- Returns:
The string representation of the ProcessingPath.
- Return type:
str
- update_delimiter(new_delimiter: str) ProcessingPath[source]
Update the delimiter of the current ProcessingPath with the provided new delimiter.
This method creates a new ProcessingPath instance with the same components but replaces the existing delimiter with the specified new_delimiter.
- Parameters:
new_delimiter (str) – The new delimiter to replace the current one.
- Returns:
A new ProcessingPath instance with the updated delimiter.
- Return type:
- Raises:
InvalidPathDelimiterError – If the provided new_delimiter is not valid.
Example
>>> processing_path = ProcessingPath('a.b.c', delimiter='.') >>> updated_path = processing_path.update_delimiter('/') >>> print(updated_path) # Output: ProcessingPath(a/b/c)
- classmethod with_inferred_delimiter(path: ProcessingPath | str, component_types: List | Tuple | None = None) ProcessingPath[source]
Converts an input to a ProcessingPath instance if it’s not already a processing path.
- Parameters:
path (Union[ProcessingPath, str, List[str]]) – The input path to convert.
delimiter (str) – The delimiter to use if the input is a string.
component_type (list|tuple) – The type of component associated with each path element
- Returns:
A ProcessingPath instance.
- Return type:
- Raises:
InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.
- class scholar_flux.utils.paths.RecordPathChainMap(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap])
Bases:
UserDict[int,RecordPathNodeMap]A dictionary-like class that maps Processing paths to PathNode objects.
- DEFAULT_USE_CACHE = True
- __init__(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap]) None[source]
Initializes the RecordPathNodeMap instance.
- add(node: PathNode | RecordPathNodeMap, overwrite: bool | None = None)[source]
Add a node to the PathNodeMap instance.
- Parameters:
node (PathNode) – The node to add.
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
- Raises:
PathNodeMapError – If any error occurs while adding the node.
- filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]
Filter the RecordPathChainMap for paths with the given prefix.
- Parameters:
prefix (ProcessingPath) – The prefix to search for.
min_depth (Optional[int]) – The minimum depth to search for. Default is None.
max_depth (Optional[int]) – The maximum depth to search for. Default is None.
from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.
- Returns:
A dictionary of paths with the given prefix and their corresponding terminal_nodes
- Return type:
dict[Optional[ProcessingPath], Optional[PathNode]]
- Raises:
RecordPathNodeMapError – If an error occurs while filtering the PathNodeMap.
- get(key: str | ProcessingPath, default: RecordPathNodeMap | None = None) RecordPathNodeMap | None[source]
Gets an item from the RecordPathNodeMap instance. If the value isn’t available, this method will return the value specified in default.
- Parameters:
key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.
- Returns:
A record map instance
- Return type:
- get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]
Helper method for retrieving a path node in a standardized way across PathNodeMaps.
- node_exists(node: PathNode | ProcessingPath) bool[source]
Helper method to validate whether the current node exists.
- property paths: list[ProcessingPath]
Enables looping over nodes stored across maps.
- property record_indices: list[int]
Helper property for retrieving the full list of all record indices across all paths for the current map Note: A core requirement of the ChainMap is that each RecordPathNodeMap indicates the position of a record in a nested JSON structure. This property is a helper method to quickly retrieve the full list of sorted record_indices.
- Returns:
A list containing integers denoting individual records found in each path
- Return type:
list[int]
- remove(node: ProcessingPath | PathNode | str)[source]
Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool
- Returns:
A new PathNodeMap instance with the specified paths removed if inplace is specified as True.
- Return type:
Optional[PathNodeMap]
- Raises:
PathNodeMapError – If any error occurs while removing.
- update(*args, overwrite: bool | None = None, **kwargs: dict[str, PathNode] | dict[str | ProcessingPath, RecordPathNodeMap]) None[source]
Updates the PathNodeMap instance with new key-value pairs.
- Parameters:
*args (Union["PathNodeMap",dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap
overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.
*kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap
Returns
- class scholar_flux.utils.paths.RecordPathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])
Bases:
PathNodeMapA dictionary-like class that maps Processing paths to PathNode objects using record indexes.
This implementation inherits from the PathNodeMap class and constrains the allowed nodes to those that begin with a numeric record index. Where each index indicates a record and nodes represent values associated with the record.
- __init__(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]
Initializes the RecordPathNodeMap using a similar set of inputs as the original PathNodeMap.
This implementation constraints the inputted nodes to a singular numeric key index that all nodes must begin with. If nodes are provided without the key, then the record_index is inferred for the inputs.
- classmethod from_mapping(mapping: dict[str | ProcessingPath, PathNode] | PathNodeMap | Sequence[PathNode] | set[PathNode] | RecordPathNodeMap, use_cache: bool | None = None) RecordPathNodeMap[source]
Helper method for coercing types into a RecordPathNodeMap.