scholar_flux.utils package

Subpackages

Submodules

scholar_flux.utils.config_loader module

The scholar_flux.api.utils.config_loader is the primary configuration loader used by the scholar_flux package to ensure that user-specified package default settings are registered via the use of environment variables.

The ConfigLoader is used alongside the scholar_flux.utils.initializer to fully initialize the scholar_flux package with the chosen configuration. This includes the initialization of importing API keys as secret strings, defining log levels, default API providers, etc.

class scholar_flux.utils.config_loader.ConfigLoader(env_path: str | Path | None = None)[source]

Bases: object

Helper class used to load the configuration of the scholar_flux package on initialization to dynamically configure package options. Using the config loader with environment variables, the following settings can be defined at runtime.

Package Level Settings:
  • SCHOLAR_FLUX_DEFAULT_PROVIDER: Defines the provider to use by default when creating a SearchAPI instance

API_KEYS:
  • ARXIV_API_KEY: API key used when retrieving academic data from arXiv

  • OPEN_ALEX_API_KEY: API key used when retrieving academic data from OpenAlex

  • SPRINGER_NATURE_API_KEY: API key used when retrieving academic data from Springer Nature

  • CROSSREF_API_KEY: API key used to retrieve academic metadata from Crossref (API key not required)

  • CORE_API_KEY: API key used to retrieve metadata and full-text publications from the CORE API

  • PUBMED_API_KEY: API key used to retrieve publications from the NIH PubMed database

Session Cache:
  • SCHOLAR_FLUX_CACHE_DIRECTORY: defines where requests and response processing cache will be stored when

    using sqlite and similar cache storages

  • SCHOLAR_FLUX_CACHE_SECRET_KEY: defines the secret key used to create encrypted session cache for request

    retrieval

Logging:
  • SCHOLAR_FLUX_ENABLE_LOGGING: defines whether logging should be enabled or not

  • SCHOLAR_FLUX_LOG_DIRECTORY: defines where rotating logs will be stored when logging is enabled

  • SCHOLAR_FLUX_LOG_LEVEL: defines the default log level used for package level logging during and after

    scholar_flux package initialization

  • SCHOLAR_FLUX_PROPAGATE_LOGS: determines whether logs should be propagated or not. (True by default)

Examples

>>> from scholar_flux.utils import ConfigLoader
>>> from pydantic import SecretStr
>>> config_loader = ConfigLoader()
>>> config_loader.load_config(reload_env=True)
>>> api_key = '' # Your key goes here
>>> if api_key:
>>>     config_loader.config['CROSSREF_API_KEY'] = api_key
>>> print(config_loader.env_path) # the default environment location when writing/replacing a env config
>>> config_loader.save_config() # to save the full configuration in the default environment folder
DEFAULT_ENV: Dict[str, Any] = {'ARXIV_API_KEY': None, 'CORE_API_KEY': None, 'CROSSREF_API_KEY': None, 'OPEN_ALEX_API_KEY': None, 'PUBMED_API_KEY': None, 'SCHOLAR_FLUX_CACHE_DIRECTORY': None, 'SCHOLAR_FLUX_CACHE_SECRET_KEY': None, 'SCHOLAR_FLUX_DEFAULT_PROVIDER': 'plos', 'SCHOLAR_FLUX_ENABLE_LOGGING': '', 'SCHOLAR_FLUX_LOG_DIRECTORY': None, 'SCHOLAR_FLUX_LOG_LEVEL': '', 'SCHOLAR_FLUX_MONGODB_HOST': 'mongodb://127.0.0.1', 'SCHOLAR_FLUX_MONGODB_PORT': 27017, 'SCHOLAR_FLUX_PROPAGATE_LOGS': '', 'SCHOLAR_FLUX_REDIS_HOST': 'localhost', 'SCHOLAR_FLUX_REDIS_PORT': 6379, 'SPRINGER_NATURE_API_KEY': None}
DEFAULT_ENV_PATH: Path = PosixPath('/home/runner/work/scholar-flux/scholar-flux/.env')
__init__(env_path: str | Path | None = None)[source]

Utility class for loading environment variables from the operating system and .env files.

load_config(env_path: str | Path | None = None, reload_env: bool = False, reload_os_env: bool = False, verbose: bool = False) None[source]

Load configuration settings from the global OS environment or an .env file while optionally overwriting previously set configuration settings.

Optionally attempt to reload and overwrite previously set ConfigLoader using either or both sources of config settings.

Note that config settings from an .env file are prioritized over globally set OS environment variables. If neither reload_os_env or reload_env are chosen, this function has no effect on the current configuration.

Parameters:
  • env_path (Optional[Path | str]) – An optional env path to read from. Defaults to the current env_path if None.

  • reload_env (bool) – Determines whether environment variables will be loaded/reloaded from the provided env_path or a current self.env_path. Defaults to False, indicating that variables are not reloaded from an .env.

  • reload_os_env (bool) – Determines whether environment variables will be loaded/reloaded from the Operating System’s global environment.

  • verbose (bool) – Convenience setting indicating whether or not to log changed configuration variable names.

load_dotenv(env_path: str | Path | None = None, replace_all: bool = False, verbose: bool = False) dict[source]

Retrieves a list of non-missing environment variables from the current .env file that are non-null.

Parameters:
  • env_path – Optional[Path | str]: Location of the .env file where env variables will be retrieved from

  • replace_all – bool = False: Indicates whether all environment variables should be replaced vs. only non-missing variables

  • verbose – bool = False: Flag indicating whether logging should be shown in the output

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict

load_os_env(replace_all: bool = False, verbose: bool = False) dict[source]

Load any updated configuration settings from variables set within the system environment.

The configuration setting must already exist in the config to be updated if available. Otherwise, the update_config method allows direct updates to the config settings.

Parameters:
  • replace_all – bool = False: Indicates whether all environment variables should be replaced vs. only non-missing variables

  • verbose – bool = False: Flag indicating whether logging should be shown in the output

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict

classmethod load_os_env_key(key: str, **kwargs) str | SecretStr | None[source]

Loads the provided key from the global environment. Converts API_KEY variables to secret strings by default.

Parameters:
  • key (str) – The key to load from the environment. This key will be guarded if it contains any of the following substrings: “API_KEY”, “SECRET”, “MAIL”

  • matches (str) – The substrings used to indicate whether the loaded environment variable should be guarded

Returns:

The value of the environment variable, possibly wrapped as a secret string

Return type:

Optional[str | SecretStr]

save_config(env_path: str | Path | None = None) None[source]

Save configuration settings to a .env file.

Unmasks strings read as secrets if the are of the type, SecretStr.

try_loadenv(env_path: str | Path | None = None, verbose: bool = False) Dict[str, Any] | None[source]

Try to load environment variables from a specified .env file into the environment and return as a dict.

update_config(env_dict: dict[str, Any], verbose: bool = False) None[source]

Helper method for updating the config dictionary with the provided dictionary of key-value pairs.

This method coerces strings into integers when possible and uses the _guard_secret method as insurance to guard against logging and recording API keys without masking. Although the load_env and load_os_env methods also mask API keys, this is particularly useful if the end-user calls update_config directly.

write_key(key_name: str, key_value: str, env_path: str | Path | None = None, create: bool = True) None[source]

Write a key-value pair to a .env file.

scholar_flux.utils.encoder module

The scholar_flux.utils.encoder module contains implementations of encoder-decoder helper classes that help abstract the serialization and deserialization of JSON data sets for easier storage.

Responses from APIs often contains non-serializable data types including non-traditional sequences and mappings that aren’t directly serializable. The implementations directly aid in creating representations of these classes that can be used to reconstruct the original object after serialization with built-in types.

Classes:
CacheDataEncoder:

Helper class used to recursively encode and decode nested JSON data with mixed data types.

JsonDataEncoder:

Helper class that builds on the CacheDataEncoder to provide built-in JSON loading/dumping support that aids in the creation of a simple Serialization-Deserialization pipeline.

class scholar_flux.utils.encoder.CacheDataEncoder[source]

Bases: object

A utility class to encode data into a base64 string representation or decode it back from base64.

This class supports encoding binary data (bytes) and recursively handles nested structures such as dictionaries and lists by encoding their elements, preserving the original structure upon decoding.

This class is used to serialize json structures when the structure isn’t known and contains unpredictable elements such as 1) None, 2) bytes, 3) nested lists, 4) Other unpredictable structures typically found in JSON.

Class Attributes:
DEFAULT_HASH_PREFIX: (Optional[str]):

An optional indicator of fields to mark fields as bytes for use when decoding. This field defaults to <hashbytes> but can be optionally turned off by setting CacheDataEncoder.DEFAULT_HASH_PREFIX=None or CacheDataEncoder.DEFAULT_HASH_PREFIX=’’

DEFAULT_NONREADABLE_PROP (int):

A threshold used to identify previously encoded base64 fields. This proportion is used when a hash prefix that marks encoded text is not applied. To test whether a string is an encoded_string, when decoded, a high percentage of letters will be nonreadable when decoded. (i.e CacheDataEncoder.decode(‘encoders’) —> b’zw(uêì’

Example

>>> from scholar_flux.utils import CacheDataEncoder
>>> import json
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string', 'list': ['a', True, 'series', 'of', None]}
>>> try:
>>>     json.dumps(data)
>>> except TypeError:
>>>     print('The `data` is non-serializable as expected ')
>>>
>>> encoded_data = CacheDataEncoder.encode(data)
>>> serialized_data = json.dumps(encoded_data)
>>> assert data == CacheDataEncoder.decode(json.loads(serialized_data))
DEFAULT_HASH_PREFIX: str | None = '<hashbytes>'
DEFAULT_NONREADABLE_PROP: float = 0.2
classmethod decode(data: Any, hash_prefix: str | None = None) Any[source]

Recursively decodes base64 strings back to bytes or recursively decode elements within dictionaries and lists.

Parameters:
  • data (Any) – The input data that needs decoding from a base64 encoded format. This could be a base64 string or nested structures like dictionaries and lists containing base64 strings as values.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Decoded bytes for byte-based representations or recursively decoded elements

within the dictionary/list/tuple if applicable.

Return type:

Any

classmethod encode(data: Any, hash_prefix: str | None = None) Any[source]

Recursively encodes all items that contain elements that cannot be directly serialized into JSON into a format more suitable for serialization:

  • Mappings are converted into dictionaries

  • Sets and other uncommon Sequences other than lists and tuples are converted into lists

  • Bytes objects are converted into strings and hashed with an optional prefix-identifier.

Parameters:
  • data (Any) – The input data. This can be: * bytes: Encoded directly to a base64 string. * Mappings/Sequences/Sets/Tuples: Recursively encodes elements if they are bytes.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Encoded string (for bytes) or a dictionary/list/tuple

with recursively encoded elements.

Return type:

Any

classmethod is_base64(s: str | bytes, hash_prefix: str | None = None) bool[source]

Check if a string is a valid base64 encoded string. Encoded strings can optionally be identified with a hash_prefix to streamline checks to determine whether or not to later decode a base64 encoded string.

As a general heuristic when encoding and decoding base 64 objects, a string should be equal to its original value after encoding and decoding the string. In this implementation, we strip equals signs, as minor differences in padding aren’t relevant.

Parameters:
  • s (str | bytes) – The string to check.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

True if the string is base64 encoded, False otherwise.

Return type:

bool

classmethod is_nonreadable(s: bytes, prop: float | None = None) bool[source]

Check if a decoded byte string contains a high percentage of non-printable characters. Non-printable characters are defined as those not within the unicode range of (32 <= c <= 126).

Parameters:
  • s (bytes) – The byte string to check.

  • prop (float) – The threshold percentage of non-printable characters.

  • specified. (Defaults to DEFAULT_NONREADABLE_PROP is not)

Returns:

True if the string is likely gibberish, False otherwise.

Return type:

bool

class scholar_flux.utils.encoder.JsonDataEncoder[source]

Bases: CacheDataEncoder

Helper class used to extend the CacheDataEncoder to provide functionality directly relevant to serializing and deserializing data from JSON formats into serialized JSON strings for easier storage and recovery.

This method includes utility dumping and loading tools directly applicable to safely dumping and reloading responses received by various APIs.

Example Use:
>>> from scholar_flux.utils import JsonDataEncoder
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string',
>>>         'list': ['a', True, 'series' 'of', None]}
# serializes the original data even though it contains otherwise unserializable components
>>> serialized_data = JsonDataEncoder.dumps(data)
>>> assert isinstance(serialized_data, str)
# deserializes the data, returning the original structure
>>> recovered_data = json.loads(serialized_data)
# the result should be the original string
>>> assert data == recovered_data
# OUTPUT: True
classmethod deserialize(s: str, **json_kwargs) Any[source]

Class method that deserializes and decodes json data from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The decoded data.

Return type:

Any

classmethod dumps(data: Any, **json_kwargs) str[source]

Convenience method that uses the json module to serialize (dump) JSON data into a JSON string.

Parameters:
  • data (Any) – The data to serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

classmethod loads(s: str, **json_kwargs) Any[source]

Convenience method that uses the json module to deserialize (load) from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The loaded json data.

Return type:

Any

classmethod serialize(data: Any, **json_kwargs) str[source]

Class method that encodes and serializes data to a JSON string.

Parameters:
  • data (Any) – The data to encode and serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

scholar_flux.utils.helpers module

The scholar_flux.utils.helpers module contains several helper functions to aid in common data data manipulation scenarios including character conversions, date-time parsing and formatting, and nesting and unnesting common python data structures.

scholar_flux.utils.helpers.as_list_1d(value: Any) List[source]

Nests a value into a single element list if the value is not already a list.

Parameters:

value (Any) – The value to add to a list if it is not already a list

Returns:

If already a list, the value is returned as is. Otherwise, the value is nested in a list. Caveat: if the value is None, an empty list is returned

Return type:

List

scholar_flux.utils.helpers.coerce_int(value: Any) int | None[source]

Attempts to convert a value to an integer, returning None if the conversion fails.

scholar_flux.utils.helpers.coerce_str(value: Any) str | None[source]

Attempts to convert a value into a string, if possible, returning None if conversion fails.

scholar_flux.utils.helpers.format_iso_timestamp(timestamp: datetime) str[source]

Formats an iso timestamp string in UTC with millisecond precision.

Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123Z”)

Return type:

str

scholar_flux.utils.helpers.generate_iso_timestamp() str[source]

Generates and formats an ISO 8601 timestamp string in UTC with millisecond precision for reliable round-trip conversion.

Example usage:
>>> from scholar_flux.utils import generate_iso_timestamp, parse_iso_timestamp, format_iso_timestamp
>>> timestamp = generate_iso_timestamp()
>>> parsed_timestamp = parse_iso_timestamp(timestamp)
>>> assert parsed_timestamp is not None and format_iso_timestamp(parsed_timestamp) == timestamp
Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123Z”)

Return type:

str

scholar_flux.utils.helpers.generate_response_hash(response: Response | ResponseProtocol) str[source]

Generates a response hash from a response or response-like object that implements the ResponseProtocol.

Parameters:

response (requests.Response | ResponseProtocol) – An http response or response-like object.

Returns:

A unique identifier for the response.

scholar_flux.utils.helpers.get_nested_data(json: list | dict | None, path: list) list | dict | None | str | int[source]

Recursively retrieves data from a nested dictionary using a sequence of keys.

Parameters:
  • json (List[Dict[Any, Any]] | Dict[Any, Any]) – The parsed json structure from which to extract data.

  • path (List[Any]) – A list of keys representing the path to the desired data within json.

Returns:

The value retrieved from the nested dictionary following the path, or None if any

key in the path is not found or leads to a None value prematurely.

Return type:

Optional[Any]

scholar_flux.utils.helpers.is_nested(obj: Any) bool[source]

Indicates whether the current value is a nested object. Useful for recursive iterations such as JSON record data.

Parameters:

obj – any (realistic JSON) data type - dicts, lists, strs, numbers

Returns:

True if nested otherwise False

Return type:

bool

scholar_flux.utils.helpers.nested_key_exists(obj: Any, key_to_find: str, regex: bool = False) bool[source]

Recursively checks if a specified key is present anywhere in a given JSON-like dictionary or list structure.

Parameters:
  • obj – The dictionary or list to search.

  • key_to_find – The key to search for.

  • regex – Whether or not to search with regular expressions.

Returns:

True if the key is present, False otherwise.

scholar_flux.utils.helpers.parse_iso_timestamp(timestamp_str: str) datetime | None[source]

Attempts to convert an ISO 8601 timestamp string back to a datetime object.

Parameters:

timestamp_str – ISO 8601 formatted timestamp string

Returns:

datetime object if parsing succeeds, None otherwise

Return type:

datetime

scholar_flux.utils.helpers.quote_if_string(value: Any) Any[source]

Attempt to quote string values to distinguish them from object text in class representations.

Parameters:

value (Any) – a value that is quoted only if it is a string

Returns:

Returns a quoted string if successful. Otherwise returns the value unchanged

Return type:

Any

scholar_flux.utils.helpers.quote_numeric(value: Any) str[source]

Attempts to quote as a numeric value and returns the original value if successful Otherwise returns the original element.

Parameters:

value (Any) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful.

Raises:

ValueError – If the value cannot be quoted

scholar_flux.utils.helpers.try_call(func: Callable, args: tuple | None = None, kwargs: dict | None = None, suppress: tuple = (), logger: Logger | None = None, log_level: int = 30, default: Any | None = None) Any | None[source]

A helper function for calling another function safely in the event that one of the specified errors occur and are contained within the list of errors to suppress.

Parameters:
  • func – The function to call

  • args – A tuple of positional arguments to add to the function call

  • kwargs – A dictionary of keyword arguments to add to the function call

  • suppress – A tuple of exceptions to handle and suppress if they occur

  • logger – The logger to use for warning generation

  • default – The value to return in the event that an error occurs and is suppressed

Returns:

When successful, the return type of the callable is also returned without modification. Upon suppressing an exception, the function will generate a warning and return None by default unless the default was set.

Return type:

Optional[Any]

scholar_flux.utils.helpers.try_dict(value: List | Tuple | Dict) Dict | None[source]

Attempts to convert a value into a dictionary, if possible. If it is not possible to convert the value into a dictionary, the function will return None.

Parameters:

value (List[Dict | Tuple | Dict) – The value to attempt to convert into a dict

Returns:

The value converted into a dictionary if possible, otherwise None

Return type:

Optional[Dict]

scholar_flux.utils.helpers.try_int(value: JSON_TYPE | None) JSON_TYPE | int | None[source]

Attempts to convert a value to an integer, returning the original value if the conversion fails.

Parameters:

value (Hashable) – the value to attempt to coerce into an integer

Return type:

Optional[int]

scholar_flux.utils.helpers.try_pop(s: Set[T], item: T, default: T | None = None) T | None[source]

Attempt to remove an item from a set and return the item if it exists.

Parameters:
  • item (Hashable) – The item to try to remove from the set

  • default (Optional[Hashable]) – The object to return as a default if item is not found

Returns:

Optional[Hashable] item if the value is in the set, otherwise returns the specified default

scholar_flux.utils.helpers.try_quote_numeric(value: Any) str | None[source]

Attempt to quote numeric values to distinguish them from string values and integers.

Parameters:

value (Any) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful. Otherwise None

Return type:

Optional[str]

scholar_flux.utils.helpers.try_str(value: Any) str | None[source]

Attempts to convert a value to a string, returning the original value if the conversion fails.

Parameters:

value (Any) – the value to attempt to coerce into an string

Return type:

Optional[str]

scholar_flux.utils.helpers.unlist_1d(current_data: Tuple | List | Any) Any[source]

Retrieves an element from a list/tuple if it contains only a single element. Otherwise, it will return the element as is. Useful for extracting text from a single element list/tuple.

Parameters:

current_data (Tuple | List | Any) – An object potentially unlist if it contains a single element.

Returns:

The unlisted object if it comes from a single element list/tuple, otherwise returns the input unchanged.

Return type:

Optional[Any]

scholar_flux.utils.initializer module

The scholar_flux.utils.initializer.py module is used within the scholar_flux package to kickstart the initialization of the scholar_flux package on import.

Several key steps are performed via the use of the initializer: 1) Environment variables are imported using the ConfigLoader 2) The Logger is subsequently set up for the scholar_flux API package 3) The package level masker is subsequently set up to enable sensitive data to be redacted from logs

scholar_flux.utils.initializer.initialize_package(log: bool = True, env_path: str | Path | None = None, config_params: dict[str, Any] | None = None, logging_params: dict[str, Any] | None = None) tuple[dict[str, Any], Logger, SensitiveDataMasker][source]

Function used for orchestrating the initialization of the config, log settings, and masking for scholar_flux.

This function imports a ‘.env’ configuration file at the specified location if it exists. Otherwise, scholar_flux will look for a .env file in the default locations if available. If no .env configuration file is found, then only package defaults and available OS environment variables are used.

This function can also be used for dynamic re-initialization of configuration parameters and logging. The config_params are sent as keyword arguments to the scholar_flux.utils.ConfigSettings.load_config method. logging_paras are used as keyword arguments to the scholar_flux.utils.setup_logging method to set up logging settings and handlers.

Parameters:
  • log (bool) – A True/False flag that determines whether to enable or disable logging.

  • env_path (Optional[str | Path]) – The file path indicating from where to load the environment variables, if provided.

  • config_params (Optional[Dict]) – A dictionary allowing for the specification of configuration parameters when attempting to load environment variables from a config. Useful for loading API keys from environment variables for later use.

  • logging_params (Optional[Dict]) – A dictionary allowing users to specify options for package-level logging with custom logic. Log settings are loaded from the OS environment or an .env file when available, with precedence given to .env files. These settings, when loaded, override the default ScholarFlux logging configuration. Otherwise, ScholarFlux uses a log-level of WARNING by default.

Returns:

A tuple containing the configuration dictionary and the initialized logger.

Return type:

Tuple[Dict[str, Any], logging.Logger, scholar_flux.security.SensitiveDataMasker]

Raises:

PackageInitializationError – If there are issues with loading the configuration or initializing the logger.

scholar_flux.utils.json_file_utils module

The scholar_flux.utils.json_file_utils module implements a simple JsonFileUtils class that contains a basic set of convenience classes for interacting with the file system and JSON files.

class scholar_flux.utils.json_file_utils.JsonFileUtils[source]

Bases: object

Helper class that implements several basic file utility class methods for easily interacting with the file system. This class also contains utility methods used to parse, load, and dump JSON files for convenience.

Example

>>> from scholar_flux.utils.json_file_utils import JsonFileUtils
>>> from pathlib import Path
>>> original_data = {"key": "value"}
>>> json_file = "/tmp/sample"

# the JSON data should be serializable: >>> assert JsonFileUtils.is_jsonable(original_data) # writing the json file >>> JsonFileUtils.save_as(original_data, json_file) # the data should now exist at the ‘/tmp/sample.json’ path >>> assert Path(json_file).with_suffix(‘.json’).exists() # verifying that the dumped data can be loaded as intended: >>> data = JsonFileUtils.load_data(json_file) >>> assert data is not None and original_data == data

DEFAULT_EXT = 'json'
classmethod append_to_file(content: str | List[str], filepath: str | Path, ext: str | None = None) None[source]

Helper method used to append content to a file in a content-type aware manner.

Parameters:
  • content (Union[str, List[str]]) – The content to append to the file.

  • filepath (Union[str, Path]) – The file path to write to

  • ext (Optional[str]) – An optional extension to add to the file path

classmethod get_filepath(filepath: str | Path, ext: str | None = None) str[source]

Prepare the filepath using the filepath and extension if provided. Assumes a Unix filesystem structure for edge cases.

Parameters:
  • filepath (Union[str, Path]) – The file path to read from

  • ext (Optional[str]) – An optional extension to add to the file path. If the extension is left None, and an extension does not yet exist on the file path, the default JSON is used by default.

static is_jsonable(obj: Any) bool[source]

Verifies whether the object can be serialized as a json object.

Parameters:

obj (Any) – The object to check

Returns:

True if the object is jsonable (serializable), otherwise False

Return type:

bool

classmethod load_data(filepath: str | Path, ext: str | None = None) Dict | List | str[source]

Attempts to load data from a filepath as a dictionary/list. If unsuccessful, the file’s contents are instead loaded as a string.

Parameters:

filepath (Union[str, Path]) – The file path to read the data from

Returns:

A dictionary or list if the data can be successfully loaded with json, and a string if loading with JSON is not possible.

Return type:

Union[Dict, List, str]

classmethod read_lines(filepath: str | Path, ext: str | None = None) Generator[str, None, None][source]

Iteratively reads lines from a text file.

Parameters:
  • filepath (Union[str, Path]) – The file path to read the data from

  • ext (Optional[str]) – An optional extension to add to the file path

Returns:

The lines read from a text file

Return type:

Generator[str, None, None]

To retrieve a list of data instead of a generator, pass the result to list:
>>> from scholar_flux.utils import JsonFileUtils
>>> line_gen = JsonFileUtils.read_lines('pyproject.toml')
>>> assert isinstance(list(line_gen), list)
classmethod save_as(obj: List | Dict | str | float | int, filepath: str | Path, ext: str | None = None, dump: bool = True) None[source]

Save an object in text format with the specified extension (if provided).

Parameters:
  • obj (Union[List, Dict, str, float, int]) – A value to save into a file

  • filepath (Union[str, Path]) – The file path to write the object to

  • ext (Optional[str]) – An optional extension to add to the file path

  • dump (bool) – If True, the object is serialized using json.dumps. Otherwise the str function is used

scholar_flux.utils.json_processing_utils module

Helper module used to process recursive JSON data received from APIs of an unknown type and structure.

Classes:
PathUtils:

Utility class used to prepare path strings and lists of path components consistently for processing.

KeyDiscoverer:

Helper class for identifying JSON paths and terminal keys containing nested data elements.

KeyFilter:

Helper class used to identify and filter nested dictionaries based on path length and pattern matching.

RecursiveJsonProcessor:

Front-end facing utility function used by the scholar_flux.data.RecursiveDataProcessor to process, filter, and flatten JSON formatted data.

JsonRecordData:

Helper class used as a container to hold extracted path/data components for further processing.

JsonNormalizer:

Helper class used by the RecursiveJsonProcessor to flatten the inputted JSON record into a non-nested dictionary

Example Use:
>>> from scholar_flux.utils import RecursiveJsonProcessor
>>> from pprint import pp
>>> data = {
        "authors": {"principle_investigator": "Dr. Smith", "assistant": "Jane Doe"},
        "doi": "10.1234/example.doi",
        "title": "Sample Study",
        "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"],
        "genre": {"subspecialty": "Neuroscience"},
        "journal": {"topic": "Sleep Research"},
    }
# joins fields with nested components using a newline character - retains full paths leading to each value
>>> processor = RecursiveJsonProcessor(object_delimiter = '   ', use_full_path = True)
# processes and flattens the JSON dict using the defined helper classes under the hood
>>> result = processor.process_and_flatten(data)
# prints the result in a format that is easier to view from the CLI
>>> pp(result)
# OUTPUT: {'authors.principle_investigator': 'Dr. Smith',
           'authors.assistant': 'Jane Doe',
           'doi': '10.1234/example.doi',
           'title': 'Sample Study',
           'abstract': "This is a sample abstract.   keywords: 'sample', 'abstract'",
           'genre.subspecialty': 'Neuroscience',
           'journal.topic': 'Sleep Research'}
class scholar_flux.utils.json_processing_utils.JsonNormalizer(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Bases: object

Helper class that flattens and normalizes the retrieved list of JsonRecordData into singular flattened dictionary.

__init__(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Initialize the JsonNormalizer with extracted JSON data and a delimiter.

Parameters:
  • extracted_record_data_list (List[JsonRecordData]) – The list of extracted JSON data.

  • delimiter (str) – The delimiter used to join elements in lists.

  • use_full_path (str) – Indicates whether to use the full nested json path or the smallest unique path available

create_unique_key(current_group: List[str], current_key_str: str, unique_mappings_dict: Dict[str, List[str]]) str[source]

Create a unique key for the current data entry if a simple key is not sufficient.

Parameters:
  • current_group (List[str]) – The list of keys in the current path.

  • current_key_str (str) – The string representation of the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

get_unique_key(current_key_str: str, current_group: List[str], unique_mappings_dict: Dict[str, List[str]]) str[source]

Generate a unique key for the current data entry.

Parameters:
  • current_key_str (str) – The string representation of the current path.

  • current_group (List[str]) – The list of keys in the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

normalize_extracted() Dict[str, List[Any] | str | None][source]

Normalize the extracted JSON data into a flattened dictionary.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Dict[str, List[Any]]

class scholar_flux.utils.json_processing_utils.JsonRecordData(path: List[str | int], data: Dict[str, Any])[source]

Bases: object

Helper class used as a container to record the paths, data, and names associated with each terminal path.

Parameters:
  • path (list[str | int]) – The path associated with the terminal data point where nested terminal values can be found

  • data (dict[str, Any]) – The nested terminal value at the end of a path

__init__(path: List[str | int], data: Dict[str, Any]) None
data: Dict[str, Any]
path: List[str | int]
class scholar_flux.utils.json_processing_utils.KeyDiscoverer(records: List[Dict] | None = None)[source]

Bases: object

Helper class used to discover terminal keys containing data within nested JSON data structures and identify the paths used to arrive at each key.

_discovered_keys

Defines the complete list of all keys that can be found in a dictionary and the path that needs to be traversed to arrive at that key

Type:

dict[str, list]

_terminal_paths

Creates a dictionary that indicates whether the currently added path is terminal within the JSON data structure

Type:

dict[str, bool]

__init__(records: List[Dict] | None = None)[source]

Initializes the KeyDiscoverer and identifies terminal key/path pairs within the JSON data structure.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None) Dict[str, List[str]][source]

Helper method that filters a range of keys based on the specified criteria.

get_all_keys() Dict[str, List[str]][source]

Returns all discovered keys and their paths.

get_keys_with_path(key: str) List[str][source]

Returns all paths associated with a specific key.

get_terminal_keys() Dict[str, List[str]][source]

Returns keys and their terminal paths (paths that don’t contain nested dictionaries).

get_terminal_paths() List[str][source]

Returns paths indicating whether they are terminal (don’t contain nested dictionaries).

class scholar_flux.utils.json_processing_utils.KeyFilter[source]

Bases: object

Helper class used to create a simple filter that allows for the identification of terminal keys associated with data in a JSON structure and the paths that lead to each terminal key.

static filter_keys(discovered_keys: Dict[str, List[str]], prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include_matches: bool = True, match_any: bool = True) Dict[str, List[str]][source]

A method used to create a function that matches key-value pairs based on the specified criteria.

For example, filtering can be configured to identify keys based on prefix, minimum path length, and path substring/pattern matching with conditional match inclusion/exclusion.

class scholar_flux.utils.json_processing_utils.PathUtils[source]

Bases: object

Helper class used to perform string/list manipulations for paths that can be represented in either form, requiring conversion from one type to the other in specific JSON path processing scenarios.

static constant_path_indices(path: List[Any], constant: str = 'i') List[Any][source]

Replace integer indices with constants in the provided path.

Parameters:
  • path (List[Any]) – The original path containing both keys and indices.

  • constant (str) – A value to replace a numeric value with.

Returns:

A path with only the key names.

Return type:

List[Any]

static group_path_assignments(path: List[Any]) str | None[source]

Group the path assignments into a single string, excluding indices.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A single string representing the grouped path, or None if the path is empty.

Return type:

Optional[str]

static path_name(level_names: List[Any]) str[source]

Generate a string representation of the path based on the provided level names.

Parameters:

level_names (List[Any]) – A list of names representing the path levels.

Returns:

A string representation of the path.

Return type:

str

static path_str(level_names: List[Any]) str[source]

Join the level names into a single string separated by underscores.

Parameters:

level_names (List[Any]) – A list of names representing the path levels.

Returns:

A single string with level names joined by underscores.

Return type:

str

static remove_path_indices(path: List[Any]) List[Any][source]

Remove integer indices from the path to get a list of key names.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A path with only the key names.

Return type:

List[Any]

class scholar_flux.utils.json_processing_utils.RecursiveJsonProcessor(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False)[source]

Bases: object

An implementation of a recursive JSON dictionary processor that is used to process and identify nested components such as paths, terminal key names, and the data at each terminal path.

This utility of the RecursiveJsonProcessor is for flattening dictionary records into flattened representations where its keys represent the terminal paths at each node and its values represent the data found at each terminal path.

__init__(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False)[source]

Initialize the RecursiveJsonProcessor with a JSON dictionary and a delimiter for joining list elements.

Args:

json_dict (Dict): The input JSON dictionary to be parsed. object_delimiter (str): The delimiter used to join elements max depth list objects. Default is “; “. normalizing_delimiter (str): The delimiter used to join elements across multiple keys when normalizing. Default is “

“.

combine_normalized(normalized_field_value: list | str | None) list | str | None[source]

Combines lists of nested data (strings, ints, None, etc.) into a single string separated by the normalizing_delimiter.

If a delimiter isn’t specified or if the value is None, it is returned as is without modification.

filter_extracted(exclude_keys: List[str] | None = None)[source]

Filter the extracted JSON dictionaries to exclude specified keys.

Parameters:

exclude_keys ([List[str]]) – List of keys to exclude from the flattened result.

flatten() Dict[str, List[Any] | str | None] | None[source]

Flatten the extracted JSON dictionary from a nested structure into a simpler structure.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_and_flatten(obj: Dict | None = None, exclude_keys: List[str] | None = None) Dict[str, List[Any] | str | None] | None[source]

Process the dictionary, filter extracted paths, and then flatten the result.

Parameters:

exclude_keys (Optional[List[str]]) – List of keys to exclude from the flattened result.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_dictionary(obj: Dict | None = None)[source]

Create a new json dictionary that contains information about the relative paths of each field that can be found within the current json_dict.

process_level(obj: Any, level_name: List[Any] | None = None) List[Any][source]

Helper method for processing a level within a dictionary.

This method is recursively called to process nested components

static unlist(current_data: Dict | List | None) Any | None[source]

Flattens a dictionary or list if it contains a single element that is a dictionary.

Parameters:

current_data – A dictionary or list to be flattened if it contains a single dictionary element.

Returns:

The flattened dictionary if the input meets the flattening condition, otherwise returns the input unchanged.

Return type:

Optional[Dict|List]

scholar_flux.utils.logger module

The scholar_flux.utils.logger module implements a basic logger used to create an easy-to-re-initialize logger to be used for logging events and progress in the retrieval and processing of API responses.

scholar_flux.utils.logger.setup_logging(logger: Logger | None = None, log_directory: str | None = None, log_file: str | None = 'application.log', log_level: int = 10, propagate_logs: bool | None = True, max_bytes: int = 1048576, backup_count: int = 5, logging_filter: Filter | None = None)[source]

Configure logging to write to both console and file with optional filtering.

Sets up a logger that outputs to both the terminal (console) and a rotating log file. Rotating files automatically create new files when size limits are reached, keeping your logs manageable.

Parameters:
  • logger (Optional[logging.Logger]) – The logger instance to configure. If None, uses the root logger.

  • log_directory (Optional[str]) – Indicates where to save log files. If None, automatically finds a writable directory when a log_file is specified..

  • log_file (Optional[str]) – Name of the log file (default: ‘application.log’). If None, file-based logging will not be performed.

  • log_level (int) – Minimum level to log (DEBUG logs everything, INFO skips debug messages).

  • propagate_logs (Optional[bool]) – Determines whether to propagate logs. Logs are propagated by default if this option is not specified.

  • max_bytes (int) – Maximum size of each log file before rotating (default: 1MB).

  • backup_count (int) – Number of old log files to keep (default: 5).

  • logging_filter (Optional[logging.Filter]) – Optional filter to modify log messages (e.g., hide sensitive data).

Example

>>> # Basic setup - logs to console and file
>>> setup_logging()
>>> # Custom location and less verbose
>>> setup_logging(log_directory="/var/log/myapp", log_level=logging.INFO)
>>> # With sensitive data masking
>>> from scholar_flux.security import MaskingFilter
>>> mask_filter = MaskingFilter()
>>> setup_logging(logging_filter=mask_filter)

Note

  • Console shows all log messages in real-time

  • File keeps a permanent record with automatic rotation

  • If logging_filter is provided, it’s applied to both console and file output

  • Calling this function multiple times will reset the logger configuration

scholar_flux.utils.module_utils module

The scholar_flux.utils.module_utils module defines the set_public_api_module that is used throughout the scholar_flux source code to aid in logging and streamline the documentation of imports.

It is generally used in the initialization of submodules within the scholar_flux which helps greatly in the structuring of the automatic sphinx documentation.

scholar_flux.utils.module_utils.set_public_api_module(module_name: str, public_names: list[str], namespace: dict)[source]

Assigns the current module’s name to the __module__ attribute of public API objects.

This function is useful for several use cases including sphinx documentation, introspection, and error handling/reporting.

For all objects defined in the list of a modules public API names (generally named __all__), this function sets their __module__ attribute to the name of the current public API module if supported.

This is useful for ensuring that imported classes and functions appear as if they are defined in the current module (such as in the automatic generation of sphinx documentation), which improves overall documentation, introspection, and error reporting.

Parameters:
  • module_name (str) – The name of the module (usually __name__).

  • public_names (list[str]) – List of public object names to update (e.g., __all__).

  • namespace (dict) – The module’s namespace (usually globals()).

Example usage:

set_public_api_module(__name__, __all__, globals())

scholar_flux.utils.provider_utils module

The scholar_flux.utils.provider_utils module implements the ProviderUtils class that is used to dynamically load the configuration for default providers stored in the scholar_flux.api.providers module.

class scholar_flux.utils.provider_utils.ProviderUtils[source]

Bases: object

Helper class used by the scholar_flux package to dynamically load the default ProviderConfig for each provider within the scholar_flux.api.providers module on import.

The ProviderUtils class uses importlib with exception handling to account for possible errors that may occur when dynamically importing the ProviderConfig for each provider.

classmethod load_provider_config(provider_module: str, provider_config_variable: str = 'provider') ProviderConfig | None[source]

Helper method that loads a single config from the provided module in the event that The module contains a ProviderConfig by the same name as the provider_config_variable. The default variable to look for is provider.

Parameters:
  • provider_module (str) – The name of the module to load.

  • provider_config_variable (str) – The name of the variable carrying the config to check for.

Returns:

The ProviderConfig associated with the module if one has been found,

by the same variable name, provider_config_variable. Otherwise, the method will return None instead.

Return type:

Optional[ProviderConfig]

classmethod load_provider_config_dict() dict[str, ProviderConfig][source]

Helper method for dynamically retrieving the default provider list as a dictionary.

Returns:

A dictionary containing the formatted name of the provider

as well as its associated configuration in a dictionary

Return type:

dict[str, ProviderConfig]

scholar_flux.utils.repr_utils module

The scholar_flux.utils.repr_utils module includes several methods used in the creation of descriptive representations of custom objects such as custom classes, dataclasses, and base models. This module can be used to generate a representation from a string to show nested attributes and customize the representation if needed.

Functions:
  • generate_repr: The core representation generating function that uses the class type and attributes

    to create a representation of the object

  • generate_repr_from_string: Takes a class name and dictionary of attribute name-value pairs to create

    a representation from scratch

  • adjust_repr_padding: Helper function that adjusts the padding of the representation to ensure all

    attributes are shown in-line

  • format_repr_value: Formats the value of a nested attribute regarding padding and appearance with

    the selected options

  • normalize_repr: Formats the value of a nested attribute, cleaning memory locations and stripping whitespace

scholar_flux.utils.repr_utils.adjust_repr_padding(obj: Any, pad_length: int | None = 0, flatten: bool | None = None) str[source]

Helper method for adjusting the padding for representations of objects.

Parameters:
  • obj (Any) – The object to generate an adjusted repr for

  • pad_length (Optional[int]) – Indicates the additional amount of padding that should be added. Helpful for when attempting to create nested representations formatted as intended.

  • flatten (bool) – indicates whether to use newline characters. This is false by default

Returns:

A string representation of the current object that adjusts the padding accordingly

Return type:

str

scholar_flux.utils.repr_utils.format_repr_value(value: Any, pad_length: int | None = None, show_value_attributes: bool | None = None, flatten: bool | None = None) str[source]

Helper function for representing nested objects from custom classes.

Parameters:
  • value (Any) – The value containing the repr to format

  • pad_length (Optional[int]) – Indicates the total additional padding to add for each individual line

  • show_value_attributes (Optional[bool]) – If False, all attributes within the current object will be replaced with ‘…’. As an example: e.g. StorageDevice(…)

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

scholar_flux.utils.repr_utils.generate_repr(obj: object, exclude: set[str] | list[str] | tuple[str] | None = None, show_value_attributes: bool = True, flatten: bool = False) str[source]

Method for creating a basic representation of a custom object’s data structure. Useful for showing the options/attributes being used by an object.

In case the object doesn’t have a __dict__ attribute, the code will raise an AttributeError and fall back to using the basic string representation of the object.

Note that threading.Lock objects are excluded from the final representation.

Parameters:
  • obj – The object whose attributes are to be represented.

  • exclude – Attributes to exclude from the representation (default is None).

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

Returns:

A string representing the object’s attributes in a human-readable format.

scholar_flux.utils.repr_utils.generate_repr_from_string(class_name: str, attribute_dict: dict[str, Any], show_value_attributes: bool | None = None, flatten: bool | None = False) str[source]

Method for creating a basic representation of a custom object’s data structure. Allows for the direct creation of a repr using the classname as a string and the attribute dict that will be formatted and prepared for representation of the attributes of the object.

Parameters:
  • class_name – The class name of the object whose attributes are to be represented.

  • attribute_dict (dict) – The dictionary containing the full list of attributes to format into the components of a repr

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

Returns:

A string representing the object’s attributes in a human-readable format.

scholar_flux.utils.repr_utils.normalize_repr(value: Any) str[source]

Helper function for removing byte locations and surrounding signs from classes.

Parameters:

value (Any) – a value whose representation to be normalized

Returns:

A normalized string representation of the current value

Return type:

str

scholar_flux.utils.response_protocol module

The scholar_flux.utils.response_protocol module is used to ensure that responses can be successfully duck-typed and implemented without favoring a specific client such as requests (or by extension, requests_cache), httpx, or asyncio.

An object is then seen as response-like if it passes the preliminary check of having all of the following attributes:
  • url

  • status_code

  • raise_for_status

  • headers

To ensure compatibility, the scholar_flux.api.ReconstructedResponse class is used for as an adapter throughout the request retrieval, response processing, and caching processes to ensure that the ResponseProtocol generalizes to other implementations when not directly using the default requests client.

class scholar_flux.utils.response_protocol.ResponseProtocol(*args, **kwargs)[source]

Bases: Protocol

Protocol for HTTP response objects compatible with both requests.Response, httpx.Response, and other response- like classes.

This protocol defines the common interface shared between popular HTTP client libraries, allowing for type-safe interoperability.

The URL is kept flexible to allow for other types outside of the normal string including basic pydantic and httpx type for both httpx and other custom objects.

__init__(*args, **kwargs)
content: bytes
headers: MutableMapping[str, str]
raise_for_status() None[source]

Raise an exception for HTTP error status codes.

status_code: int
url: Any

Module contents

The scholar_flux.utils module contains a comprehensive set of utility tools used to simplify the re-implementation of common design patterns.

Modules:
  • initializer.py: Contains the tools used to initialize (or reinitialize) the scholar_flux package.
    The initializer creates the following package components:
    • config: Contains a list of environment variables and defaults for configuring the package

    • logger: created by calling setup_logging function with inputs or defaults from an .env file

    • masker: identifies and masks sensitive data from logs such as api keys and email addresses

  • logger.py: Contains the setup_logging that is used to set the logging level and output location for logs when

    using the scholar_flux package

  • config.py: Holds the ConfigLoader class that starts from the scholar_flux defaults and reads from an .env and

    environment variables to automatically apply API keys, encryption settings, the default provider, etc.

  • helpers.py: Contains a variety of convenience and helper functions used throughout the scholar_flux package.

  • file_utils.py: Implements a JsonFileUtils class that contains several static methods for reading files

  • encoder: Contains an implementation of a CacheDataEncoder and JsonDataEncoder that uses base64 and json utilities

    to recursively serialize, deserialize, encode and decode JSON dictionaries and lists for storage and retrieval by using base64. This method accounts for when direct serialization isn’t possible and would otherwise result in a JSONDecodeError as a direct result of not accounting for nested structures and types.

  • json_processing_utils: Contains a variety of utilities used in the creation of the RecursiveJsonProcessor which

    is used to streamline the process of filtering and flattening parsed record data

  • /paths: Contains custom implementations for processing JSON lists using path processing that abstracts

    elements of JSON files into Nodes consisting of paths (keys) to arrive at terminal entries (values) similar to dictionaries. This implementation simplifies the flattening processing, and filtering of records when processing articles and record entries from response data.

  • provider_utils: Contains the ProviderUtils class that implements class methods that are used to dynamically read

    modules containing provider-specific config models. These config models are then used by the scholar_flux.api module to populate Search API configurations with API-specific settings.

  • repr_utils: Contains a set of helper functions specifically geared toward printing nested objects and

    compositions of classes into a human readable format to create sensible representations of objects

class scholar_flux.utils.CacheDataEncoder[source]

Bases: object

A utility class to encode data into a base64 string representation or decode it back from base64.

This class supports encoding binary data (bytes) and recursively handles nested structures such as dictionaries and lists by encoding their elements, preserving the original structure upon decoding.

This class is used to serialize json structures when the structure isn’t known and contains unpredictable elements such as 1) None, 2) bytes, 3) nested lists, 4) Other unpredictable structures typically found in JSON.

Class Attributes:
DEFAULT_HASH_PREFIX: (Optional[str]):

An optional indicator of fields to mark fields as bytes for use when decoding. This field defaults to <hashbytes> but can be optionally turned off by setting CacheDataEncoder.DEFAULT_HASH_PREFIX=None or CacheDataEncoder.DEFAULT_HASH_PREFIX=’’

DEFAULT_NONREADABLE_PROP (int):

A threshold used to identify previously encoded base64 fields. This proportion is used when a hash prefix that marks encoded text is not applied. To test whether a string is an encoded_string, when decoded, a high percentage of letters will be nonreadable when decoded. (i.e CacheDataEncoder.decode(‘encoders’) —> b’zw(uêì’

Example

>>> from scholar_flux.utils import CacheDataEncoder
>>> import json
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string', 'list': ['a', True, 'series', 'of', None]}
>>> try:
>>>     json.dumps(data)
>>> except TypeError:
>>>     print('The `data` is non-serializable as expected ')
>>>
>>> encoded_data = CacheDataEncoder.encode(data)
>>> serialized_data = json.dumps(encoded_data)
>>> assert data == CacheDataEncoder.decode(json.loads(serialized_data))
DEFAULT_HASH_PREFIX: str | None = '<hashbytes>'
DEFAULT_NONREADABLE_PROP: float = 0.2
classmethod decode(data: Any, hash_prefix: str | None = None) Any[source]

Recursively decodes base64 strings back to bytes or recursively decode elements within dictionaries and lists.

Parameters:
  • data (Any) – The input data that needs decoding from a base64 encoded format. This could be a base64 string or nested structures like dictionaries and lists containing base64 strings as values.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Decoded bytes for byte-based representations or recursively decoded elements

within the dictionary/list/tuple if applicable.

Return type:

Any

classmethod encode(data: Any, hash_prefix: str | None = None) Any[source]

Recursively encodes all items that contain elements that cannot be directly serialized into JSON into a format more suitable for serialization:

  • Mappings are converted into dictionaries

  • Sets and other uncommon Sequences other than lists and tuples are converted into lists

  • Bytes objects are converted into strings and hashed with an optional prefix-identifier.

Parameters:
  • data (Any) – The input data. This can be: * bytes: Encoded directly to a base64 string. * Mappings/Sequences/Sets/Tuples: Recursively encodes elements if they are bytes.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Encoded string (for bytes) or a dictionary/list/tuple

with recursively encoded elements.

Return type:

Any

classmethod is_base64(s: str | bytes, hash_prefix: str | None = None) bool[source]

Check if a string is a valid base64 encoded string. Encoded strings can optionally be identified with a hash_prefix to streamline checks to determine whether or not to later decode a base64 encoded string.

As a general heuristic when encoding and decoding base 64 objects, a string should be equal to its original value after encoding and decoding the string. In this implementation, we strip equals signs, as minor differences in padding aren’t relevant.

Parameters:
  • s (str | bytes) – The string to check.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

True if the string is base64 encoded, False otherwise.

Return type:

bool

classmethod is_nonreadable(s: bytes, prop: float | None = None) bool[source]

Check if a decoded byte string contains a high percentage of non-printable characters. Non-printable characters are defined as those not within the unicode range of (32 <= c <= 126).

Parameters:
  • s (bytes) – The byte string to check.

  • prop (float) – The threshold percentage of non-printable characters.

  • specified. (Defaults to DEFAULT_NONREADABLE_PROP is not)

Returns:

True if the string is likely gibberish, False otherwise.

Return type:

bool

class scholar_flux.utils.ConfigLoader(env_path: str | Path | None = None)[source]

Bases: object

Helper class used to load the configuration of the scholar_flux package on initialization to dynamically configure package options. Using the config loader with environment variables, the following settings can be defined at runtime.

Package Level Settings:
  • SCHOLAR_FLUX_DEFAULT_PROVIDER: Defines the provider to use by default when creating a SearchAPI instance

API_KEYS:
  • ARXIV_API_KEY: API key used when retrieving academic data from arXiv

  • OPEN_ALEX_API_KEY: API key used when retrieving academic data from OpenAlex

  • SPRINGER_NATURE_API_KEY: API key used when retrieving academic data from Springer Nature

  • CROSSREF_API_KEY: API key used to retrieve academic metadata from Crossref (API key not required)

  • CORE_API_KEY: API key used to retrieve metadata and full-text publications from the CORE API

  • PUBMED_API_KEY: API key used to retrieve publications from the NIH PubMed database

Session Cache:
  • SCHOLAR_FLUX_CACHE_DIRECTORY: defines where requests and response processing cache will be stored when

    using sqlite and similar cache storages

  • SCHOLAR_FLUX_CACHE_SECRET_KEY: defines the secret key used to create encrypted session cache for request

    retrieval

Logging:
  • SCHOLAR_FLUX_ENABLE_LOGGING: defines whether logging should be enabled or not

  • SCHOLAR_FLUX_LOG_DIRECTORY: defines where rotating logs will be stored when logging is enabled

  • SCHOLAR_FLUX_LOG_LEVEL: defines the default log level used for package level logging during and after

    scholar_flux package initialization

  • SCHOLAR_FLUX_PROPAGATE_LOGS: determines whether logs should be propagated or not. (True by default)

Examples

>>> from scholar_flux.utils import ConfigLoader
>>> from pydantic import SecretStr
>>> config_loader = ConfigLoader()
>>> config_loader.load_config(reload_env=True)
>>> api_key = '' # Your key goes here
>>> if api_key:
>>>     config_loader.config['CROSSREF_API_KEY'] = api_key
>>> print(config_loader.env_path) # the default environment location when writing/replacing a env config
>>> config_loader.save_config() # to save the full configuration in the default environment folder
DEFAULT_ENV: Dict[str, Any] = {'ARXIV_API_KEY': None, 'CORE_API_KEY': None, 'CROSSREF_API_KEY': None, 'OPEN_ALEX_API_KEY': None, 'PUBMED_API_KEY': None, 'SCHOLAR_FLUX_CACHE_DIRECTORY': None, 'SCHOLAR_FLUX_CACHE_SECRET_KEY': None, 'SCHOLAR_FLUX_DEFAULT_PROVIDER': 'plos', 'SCHOLAR_FLUX_ENABLE_LOGGING': '', 'SCHOLAR_FLUX_LOG_DIRECTORY': None, 'SCHOLAR_FLUX_LOG_LEVEL': '', 'SCHOLAR_FLUX_MONGODB_HOST': 'mongodb://127.0.0.1', 'SCHOLAR_FLUX_MONGODB_PORT': 27017, 'SCHOLAR_FLUX_PROPAGATE_LOGS': '', 'SCHOLAR_FLUX_REDIS_HOST': 'localhost', 'SCHOLAR_FLUX_REDIS_PORT': 6379, 'SPRINGER_NATURE_API_KEY': None}
DEFAULT_ENV_PATH: Path = PosixPath('/home/runner/work/scholar-flux/scholar-flux/.env')
__init__(env_path: str | Path | None = None)[source]

Utility class for loading environment variables from the operating system and .env files.

config: Dict[str, Any]
env_path: Path
load_config(env_path: str | Path | None = None, reload_env: bool = False, reload_os_env: bool = False, verbose: bool = False) None[source]

Load configuration settings from the global OS environment or an .env file while optionally overwriting previously set configuration settings.

Optionally attempt to reload and overwrite previously set ConfigLoader using either or both sources of config settings.

Note that config settings from an .env file are prioritized over globally set OS environment variables. If neither reload_os_env or reload_env are chosen, this function has no effect on the current configuration.

Parameters:
  • env_path (Optional[Path | str]) – An optional env path to read from. Defaults to the current env_path if None.

  • reload_env (bool) – Determines whether environment variables will be loaded/reloaded from the provided env_path or a current self.env_path. Defaults to False, indicating that variables are not reloaded from an .env.

  • reload_os_env (bool) – Determines whether environment variables will be loaded/reloaded from the Operating System’s global environment.

  • verbose (bool) – Convenience setting indicating whether or not to log changed configuration variable names.

load_dotenv(env_path: str | Path | None = None, replace_all: bool = False, verbose: bool = False) dict[source]

Retrieves a list of non-missing environment variables from the current .env file that are non-null.

Parameters:
  • env_path – Optional[Path | str]: Location of the .env file where env variables will be retrieved from

  • replace_all – bool = False: Indicates whether all environment variables should be replaced vs. only non-missing variables

  • verbose – bool = False: Flag indicating whether logging should be shown in the output

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict

load_os_env(replace_all: bool = False, verbose: bool = False) dict[source]

Load any updated configuration settings from variables set within the system environment.

The configuration setting must already exist in the config to be updated if available. Otherwise, the update_config method allows direct updates to the config settings.

Parameters:
  • replace_all – bool = False: Indicates whether all environment variables should be replaced vs. only non-missing variables

  • verbose – bool = False: Flag indicating whether logging should be shown in the output

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict

classmethod load_os_env_key(key: str, **kwargs) str | SecretStr | None[source]

Loads the provided key from the global environment. Converts API_KEY variables to secret strings by default.

Parameters:
  • key (str) – The key to load from the environment. This key will be guarded if it contains any of the following substrings: “API_KEY”, “SECRET”, “MAIL”

  • matches (str) – The substrings used to indicate whether the loaded environment variable should be guarded

Returns:

The value of the environment variable, possibly wrapped as a secret string

Return type:

Optional[str | SecretStr]

save_config(env_path: str | Path | None = None) None[source]

Save configuration settings to a .env file.

Unmasks strings read as secrets if the are of the type, SecretStr.

try_loadenv(env_path: str | Path | None = None, verbose: bool = False) Dict[str, Any] | None[source]

Try to load environment variables from a specified .env file into the environment and return as a dict.

update_config(env_dict: dict[str, Any], verbose: bool = False) None[source]

Helper method for updating the config dictionary with the provided dictionary of key-value pairs.

This method coerces strings into integers when possible and uses the _guard_secret method as insurance to guard against logging and recording API keys without masking. Although the load_env and load_os_env methods also mask API keys, this is particularly useful if the end-user calls update_config directly.

write_key(key_name: str, key_value: str, env_path: str | Path | None = None, create: bool = True) None[source]

Write a key-value pair to a .env file.

class scholar_flux.utils.JsonDataEncoder[source]

Bases: CacheDataEncoder

Helper class used to extend the CacheDataEncoder to provide functionality directly relevant to serializing and deserializing data from JSON formats into serialized JSON strings for easier storage and recovery.

This method includes utility dumping and loading tools directly applicable to safely dumping and reloading responses received by various APIs.

Example Use:
>>> from scholar_flux.utils import JsonDataEncoder
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string',
>>>         'list': ['a', True, 'series' 'of', None]}
# serializes the original data even though it contains otherwise unserializable components
>>> serialized_data = JsonDataEncoder.dumps(data)
>>> assert isinstance(serialized_data, str)
# deserializes the data, returning the original structure
>>> recovered_data = json.loads(serialized_data)
# the result should be the original string
>>> assert data == recovered_data
# OUTPUT: True
classmethod deserialize(s: str, **json_kwargs) Any[source]

Class method that deserializes and decodes json data from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The decoded data.

Return type:

Any

classmethod dumps(data: Any, **json_kwargs) str[source]

Convenience method that uses the json module to serialize (dump) JSON data into a JSON string.

Parameters:
  • data (Any) – The data to serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

classmethod loads(s: str, **json_kwargs) Any[source]

Convenience method that uses the json module to deserialize (load) from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The loaded json data.

Return type:

Any

classmethod serialize(data: Any, **json_kwargs) str[source]

Class method that encodes and serializes data to a JSON string.

Parameters:
  • data (Any) – The data to encode and serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

class scholar_flux.utils.JsonFileUtils[source]

Bases: object

Helper class that implements several basic file utility class methods for easily interacting with the file system. This class also contains utility methods used to parse, load, and dump JSON files for convenience.

Example

>>> from scholar_flux.utils.json_file_utils import JsonFileUtils
>>> from pathlib import Path
>>> original_data = {"key": "value"}
>>> json_file = "/tmp/sample"

# the JSON data should be serializable: >>> assert JsonFileUtils.is_jsonable(original_data) # writing the json file >>> JsonFileUtils.save_as(original_data, json_file) # the data should now exist at the ‘/tmp/sample.json’ path >>> assert Path(json_file).with_suffix(‘.json’).exists() # verifying that the dumped data can be loaded as intended: >>> data = JsonFileUtils.load_data(json_file) >>> assert data is not None and original_data == data

DEFAULT_EXT = 'json'
classmethod append_to_file(content: str | List[str], filepath: str | Path, ext: str | None = None) None[source]

Helper method used to append content to a file in a content-type aware manner.

Parameters:
  • content (Union[str, List[str]]) – The content to append to the file.

  • filepath (Union[str, Path]) – The file path to write to

  • ext (Optional[str]) – An optional extension to add to the file path

classmethod get_filepath(filepath: str | Path, ext: str | None = None) str[source]

Prepare the filepath using the filepath and extension if provided. Assumes a Unix filesystem structure for edge cases.

Parameters:
  • filepath (Union[str, Path]) – The file path to read from

  • ext (Optional[str]) – An optional extension to add to the file path. If the extension is left None, and an extension does not yet exist on the file path, the default JSON is used by default.

static is_jsonable(obj: Any) bool[source]

Verifies whether the object can be serialized as a json object.

Parameters:

obj (Any) – The object to check

Returns:

True if the object is jsonable (serializable), otherwise False

Return type:

bool

classmethod load_data(filepath: str | Path, ext: str | None = None) Dict | List | str[source]

Attempts to load data from a filepath as a dictionary/list. If unsuccessful, the file’s contents are instead loaded as a string.

Parameters:

filepath (Union[str, Path]) – The file path to read the data from

Returns:

A dictionary or list if the data can be successfully loaded with json, and a string if loading with JSON is not possible.

Return type:

Union[Dict, List, str]

classmethod read_lines(filepath: str | Path, ext: str | None = None) Generator[str, None, None][source]

Iteratively reads lines from a text file.

Parameters:
  • filepath (Union[str, Path]) – The file path to read the data from

  • ext (Optional[str]) – An optional extension to add to the file path

Returns:

The lines read from a text file

Return type:

Generator[str, None, None]

To retrieve a list of data instead of a generator, pass the result to list:
>>> from scholar_flux.utils import JsonFileUtils
>>> line_gen = JsonFileUtils.read_lines('pyproject.toml')
>>> assert isinstance(list(line_gen), list)
classmethod save_as(obj: List | Dict | str | float | int, filepath: str | Path, ext: str | None = None, dump: bool = True) None[source]

Save an object in text format with the specified extension (if provided).

Parameters:
  • obj (Union[List, Dict, str, float, int]) – A value to save into a file

  • filepath (Union[str, Path]) – The file path to write the object to

  • ext (Optional[str]) – An optional extension to add to the file path

  • dump (bool) – If True, the object is serialized using json.dumps. Otherwise the str function is used

class scholar_flux.utils.JsonNormalizer(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Bases: object

Helper class that flattens and normalizes the retrieved list of JsonRecordData into singular flattened dictionary.

__init__(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Initialize the JsonNormalizer with extracted JSON data and a delimiter.

Parameters:
  • extracted_record_data_list (List[JsonRecordData]) – The list of extracted JSON data.

  • delimiter (str) – The delimiter used to join elements in lists.

  • use_full_path (str) – Indicates whether to use the full nested json path or the smallest unique path available

create_unique_key(current_group: List[str], current_key_str: str, unique_mappings_dict: Dict[str, List[str]]) str[source]

Create a unique key for the current data entry if a simple key is not sufficient.

Parameters:
  • current_group (List[str]) – The list of keys in the current path.

  • current_key_str (str) – The string representation of the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

get_unique_key(current_key_str: str, current_group: List[str], unique_mappings_dict: Dict[str, List[str]]) str[source]

Generate a unique key for the current data entry.

Parameters:
  • current_key_str (str) – The string representation of the current path.

  • current_group (List[str]) – The list of keys in the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

normalize_extracted() Dict[str, List[Any] | str | None][source]

Normalize the extracted JSON data into a flattened dictionary.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Dict[str, List[Any]]

class scholar_flux.utils.JsonRecordData(path: List[str | int], data: Dict[str, Any])[source]

Bases: object

Helper class used as a container to record the paths, data, and names associated with each terminal path.

Parameters:
  • path (list[str | int]) – The path associated with the terminal data point where nested terminal values can be found

  • data (dict[str, Any]) – The nested terminal value at the end of a path

__init__(path: List[str | int], data: Dict[str, Any]) None
data: Dict[str, Any]
path: List[str | int]
class scholar_flux.utils.KeyDiscoverer(records: List[Dict] | None = None)[source]

Bases: object

Helper class used to discover terminal keys containing data within nested JSON data structures and identify the paths used to arrive at each key.

_discovered_keys

Defines the complete list of all keys that can be found in a dictionary and the path that needs to be traversed to arrive at that key

Type:

dict[str, list]

_terminal_paths

Creates a dictionary that indicates whether the currently added path is terminal within the JSON data structure

Type:

dict[str, bool]

__init__(records: List[Dict] | None = None)[source]

Initializes the KeyDiscoverer and identifies terminal key/path pairs within the JSON data structure.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None) Dict[str, List[str]][source]

Helper method that filters a range of keys based on the specified criteria.

get_all_keys() Dict[str, List[str]][source]

Returns all discovered keys and their paths.

get_keys_with_path(key: str) List[str][source]

Returns all paths associated with a specific key.

get_terminal_keys() Dict[str, List[str]][source]

Returns keys and their terminal paths (paths that don’t contain nested dictionaries).

get_terminal_paths() List[str][source]

Returns paths indicating whether they are terminal (don’t contain nested dictionaries).

class scholar_flux.utils.KeyFilter[source]

Bases: object

Helper class used to create a simple filter that allows for the identification of terminal keys associated with data in a JSON structure and the paths that lead to each terminal key.

static filter_keys(discovered_keys: Dict[str, List[str]], prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include_matches: bool = True, match_any: bool = True) Dict[str, List[str]][source]

A method used to create a function that matches key-value pairs based on the specified criteria.

For example, filtering can be configured to identify keys based on prefix, minimum path length, and path substring/pattern matching with conditional match inclusion/exclusion.

class scholar_flux.utils.PathDiscoverer(records: dict | list[dict] | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>)

Bases: object

For both discovering paths and flattening json files into a single dictionary that simplifies the nested structure into the path, the type of structure, and the terminal value.

Parameters:
  • records – Optional[Union[list[dict], dict]]: A list of dictionaries to be flattened

  • path_mappings – dict[ProcessingPath, Any]: A set of key-value pairs mapping paths to terminal values

records

The input data to be traversed and flattened.

Type:

Optional[Union[list[dict], dict]]

path_mappings

Holds a dictionary of values mapped to ProcessingPaths after processing

Type:

dict[ProcessingPath, Any]

DEFAULT_DELIMITER: ClassVar[str] = '.'
__init__(records: dict | list[dict] | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>) None
clear()[source]

Removes all path-value mappings from the self.path_mappings dictionary.

discover_path_elements(records: dict | list[dict] | None = None, current_path: ProcessingPath | None = None, max_depth: int | None = None, inplace: bool = False) dict[ProcessingPath, Any] | None[source]

Recursively traverses records to discover keys, their paths, and terminal status. Uses the private method _discover_path_elements in order to add terminal path value pairs to the path_mappings attribute.

Parameters:
  • records (Optional[Union[list[dict], dict]]) – A list of dictionaries to be flattened if not already provided.

  • current_path (Optional[dict[ProcessingPath, Any]]) – The parent path to prefix all subsequent paths with. Is useful when working with a subset of a dict

  • max_depth (Optional[int]) – Indicates the times we should recursively attempt to retrieve a terminal path. Leaving this at None will traverse all possible nested lists/dictionaries.

  • inplace (bool) – Determines whether or not to save the inner state of the PathDiscoverer object. When False: Returns the final object and clears the self.path_mappings attribute. When True: Retains the self.path_mappings attribute and returns None

path_mappings: dict[ProcessingPath, Any]
records: list[dict] | dict | None = None
property terminal_paths: Set[ProcessingPath]

Helper method for returning a list of all discovered paths from the PathDiscoverer.

class scholar_flux.utils.PathNode(path: ProcessingPath, value: Any)

Bases: object

A dataclass acts as a wrapper for path-terminal value pairs in nested JSON structures.

The PathNode consists of a value of any type and a ProcessingPath instance that indicates where a terminal-value was found. This class simplifies the process of manipulating and flattening data structures originating from JSON data

path

The terminal path where the value was located

Type:

ProcessingPath

value
Type:

Any

DEFAULT_DELIMITER: ClassVar[str] = '.'
__init__(path: ProcessingPath, value: Any) None
copy() PathNode[source]

Helper method for copying and returning an identical path node.

classmethod is_valid_node(node: PathNode) bool[source]

Validates whether the current node is or is not a PathNode isinstance. If the current input is not a PathNode, then this class will raise an InvalidPathNodeError.

Raises:

InvalidPathNodeError – If the current node is not a PathNode or if its path is not a valid ProcessingPath

path: ProcessingPath
property path_group: ProcessingPath

Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.

Returns:

A ProcessingPath instance with the last numeric component removed and indices replaced.

Return type:

ProcessingPath

property path_keys: ProcessingPath

Utility function for retaining keys from a path, ignoring indexes generated by lists Retrieves the original path minus all keys that originate from list indexes.

Returns:

A ProcessingPath instance associated with all dictionary keys

Return type:

ProcessingPath

property record_index: int

Extract the first element of the node’s path to determine the record number originating from a list of dictionaries, assuming the path originates from a paginated structure.

Returns:

Value denoting the record that the path originates from

Return type:

int

Raises:

PathIndexingError – if the first element of the path is not a numerical index

classmethod to_path_node(path: ProcessingPath | str | int | list[str] | list[int] | list[str | int], value: Any, **path_kwargs) Self[source]

Helper method for creating a path node from the components used to create paths in addition to value to assign the path node.

Parameters:
  • path (Union[ProcessingPath, str, list[str]]) – The path to be assigned to the node. If this is not a path already, then a path will be created from what is provided

  • value (Any) – The value to associate with the new node

  • **path_kwargs – Additional keyword arguments to be used in the creation of a path. This is passed to ProcessingPath.to_processing_path when creating a path

Returns:

The newly constructed path

Return type:

PathNode

Raises:

InvalidPathNodeError – If the values provided cannot be used to create a new node

update(**attributes: ProcessingPath | Any) PathNode[source]

Update the parameters of a PathNode by creating a new PathNode instance. Note that the original PathNode dataclass is frozen. This method uses the copied dict originating from the dataclass to initialize a new PathNode. :param **attributes: keyword arguments indicating the attributes of the :type **attributes: dict :param PathNode to update. If a specific key is not provided: :param then it will not update: :param Each key should be a valid attribute name of PathNode: :param : :param and each value should be the corresponding updated value.:

Returns:

A new path with the updated attributes

value: Any
class scholar_flux.utils.PathNodeIndex(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None)

Bases: object

The PathNodeIndex is a dataclass that enables the efficient processing of nested key value pairs from JSON data commonly received from APIs providing records, articles, and other forms of data.

This index enables the orchestration of both parsing, flattening, and the simplification of JSON data structures.

Parameters:
  • index (PathNodeMap) – A dictionary of path-node mappings that are used by the PathNodeIndex to simplify JSON structures into a singular list of dictionaries where each dictionary represents a record

  • simplifier (PathSimplifier) – A structure that enables the simplification of a path node index into a singular list of dictionary records. The structure is initially used to identify unique path names for each path-value combination.

Class Variables:
DEFAULT_DELIMITER (str): A delimiter to use by default when reading JSON structures and transforming the

list of keys used to retrieve a terminal path into a simplified string. Each individual key is separated by this delimiter.

MAX_PROCESSES (int): An optional maximum on the total number of processes to use when simplifying multiple

records into a singular structure in parallel. This can be configured directly or turned off altogether by setting this class variable to None.

Example Usage:
>>> from scholar_flux.utils import PathNodeIndex
>>> record_test_json: list[dict] = [
>>>     {
>>>         "authors": {"principle_investigator": "Dr. Smith", "assistant": "Jane Doe"},
>>>         "doi": "10.1234/example.doi",
>>>         "title": "Sample Study",
>>>         # "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"],
>>>         "genre": {"subspecialty": "Neuroscience"},
>>>         "journal": {"topic": "Sleep Research"},
>>>     },
>>>     {
>>>         "authors": {"principle_investigator": "Dr. Lee", "assistant": "John Roe"},
>>>         "doi": "10.5678/example2.doi",
>>>         "title": "Another Study",
>>>         "abstract": "Another abstract.",
>>>         "genre": {"subspecialty": "Psychiatry"},
>>>         "journal": {"topic": "Dreams"},
>>>     },
>>> ]
>>> normalized_records = PathNodeIndex.normalize_records(record_test_json)
>>> normalized_records
# OUTPUT: [{'abstract': 'Another abstract.',
#         'doi': '10.5678/example2.doi',
#         'title': 'Another Study',
#         'authors.assistant': 'John Roe',
#         'authors.principle_investigator': 'Dr. Lee',
#         'genre.subspecialty': 'Psychiatry',
#         'journal.topic': 'Dreams'},
#        {'doi': '10.1234/example.doi',
#         'title': 'Sample Study',
#         'authors.assistant': 'Jane Doe',
#         'authors.principle_investigator': 'Dr. Smith',
#         'genre.subspecialty': 'Neuroscience',
#         'journal.topic': 'Sleep Research'}]
DEFAULT_DELIMITER: ClassVar[str] = '.'
MAX_PROCESSES: ClassVar[int | None] = 8
__init__(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None) None
combine_keys(skip_keys: list | None = None) None[source]

Combine nodes with values in their paths by updating the paths of count nodes.

This method searches for paths ending with values and count, identifies related nodes, and updates the paths by combining the value with the count node.

Parameters:
  • skip_keys (Optional[list]) – Keys that should not be combined regardless of a matching pattern

  • quote_numeric (Optional[bool]) – Determines whether to quote integer components of paths to distinguish from Indices (default behavior is to quote them (ex. 0, 123).

Raises:

PathCombinationError – If an error occurs during the combination process.

classmethod from_path_mappings(path_mappings: dict[ProcessingPath, Any], chain_map: bool = False, use_cache: bool | None = None) PathNodeIndex[source]

Takes a dictionary of path:value mappings and transforms the dictionary into a list of PathNodes: useful for later path manipulations such as grouping and consolidating paths into a flattened dictionary.

If use_cache is not specified, then the Mapping will use the class default to determine whether or not to cache.

Returns:

An index of PathNodes created from a dictionary

Return type:

PathNodeIndex

get_node(path: ProcessingPath | str) PathNode | None[source]

Try to retrieve a path node with the given path.

Parameters:

index (The exact path of to search for in the)

Returns:

The exact node that matches the provided path.

Returns None if a match is not found

Return type:

Optional[PathNode]

node_map: PathNodeMap | RecordPathChainMap
property nodes: list[PathNode]

Returns a list of PathNodes stored within the index.

Returns:

The complete list of all PathNodes that have been registered in the PathIndex

Return type:

list[PathNode]

classmethod normalize_records(json_records: dict | list[dict], combine_keys: bool = True, object_delimiter: str | None = ';', parallel: bool = False) list[dict[str, Any]][source]

Full pipeline for processing a loaded JSON structure into a list of dictionaries where each individual list element is a processed and normalized record.

Parameters:
  • json_records (dict[str,Any] | list[dict[str,Any]]) – The JSON structure to normalize. If this structure is a dictionary, it will first be nested in a list as a single element before processing.

  • combine_keys – bool: This function determines whether or not to combine keys that are likely to denote names and corresponding values/counts. Default is True

  • object_delimiter – This delimiter determines whether to join terminal paths in lists under the same key and how to collapse the list into a singular string. If empty, terminal lists are returned as is.

  • parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel

Return type:

list[dict[str,Any]]

property paths: list[ProcessingPath]

Returns a list of Paths stored within the index.

Returns:

The complete list of all paths that have been registered in the PathIndex

Return type:

list[ProcessingPath]

Attempt to find all values containing the specified pattern using regular expressions :param pattern: :type pattern: Union[str, re.Pattern]

Returns:

all paths and nodes that match the specified pattern

Return type:

dict[ProcessingPath, PathNode]

property record_indices: list[int]

Helper property for retrieving the full list of all record indices across the current mapping of paths to nodes for the current index.

This property is a helper method to quickly retrieve the full list of sorted record_indices.

It refers back to the map for the underlying implementation in the retrieval of record_indices.

Returns:

A list containing integers denoting individual records found in each path.

Return type:

list[int]

search(path: ProcessingPath) list[PathNode][source]

Attempt to find all values with that match the provided path or have sub-paths that are an exact match match to the provided path :param path Union[str: :param ProcessingPath] the path to search for.: :param Note that the provided path must match a prefix/ancestor path of an indexed path: :param exactly to be considered a match:

Returns:

All paths equal to or containing sub-paths

exactly matching the specified path

Return type:

dict[ProcessingPath, PathNode]

simplifier: PathSimplifier
simplify_to_rows(object_delimiter: str | None = ';', parallel: bool = False, max_components: int | None = None, remove_noninformative: bool = True) list[dict[str, Any]][source]

Simplify indexed nodes into a paginated data structure.

Parameters:
  • object_delimiter (str) – The separator to use when collapsing multiple values into a single string.

  • parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel

Returns:

A list of dictionaries representing the paginated data structure.

Return type:

list[dict[str, Any]]

use_cache: bool | None = None
class scholar_flux.utils.PathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])

Bases: UserDict[ProcessingPath, PathNode]

A dictionary-like class that maps Processing paths to PathNode objects.

DEFAULT_USE_CACHE: bool = True
__init__(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]

Initializes the PathNodeMap instance.

add(node: PathNode, overwrite: bool | None = None, inplace: bool = True) PathNodeMap | None[source]

Add a node to the PathNodeMap instance.

Parameters:
  • node (PathNode) – The node to add.

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

Raises:

PathNodeMapError – If any error occurs while adding the node.

filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]

Filter the PathNodeMap for paths with the given prefix.

Parameters:
  • prefix (ProcessingPath) – The prefix to search for.

  • min_depth (Optional[int]) – The minimum depth to search for. Default is None.

  • max_depth (Optional[int]) – The maximum depth to search for. Default is None.

  • from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.

Returns:

A dictionary of paths with the given prefix and their corresponding terminal_nodes

Return type:

dict[Optional[ProcessingPath], Optional[PathNode]]

Raises:

PathNodeMapError – If an error occurs while filtering the PathNodeMap.

classmethod format_mapping(key_value_pairs: PathNodeMap | MutableMapping[ProcessingPath, PathNode] | dict[str, PathNode]) dict[ProcessingPath, PathNode][source]

Takes a dictionary or a PathNodeMap Transforms the string keys in a dictionary into Processing paths and returns the mapping.

Parameters:

key_value_pairs (Union[dict[ProcessingPath, PathNode], dict[str, PathNode]]) – The dictionary of key-value pairs to transform.

Returns:

a dictionary of validated path, node pairings

Return type:

dict[ProcessingPath, PathNode]

Raises:

PathNodeMapError – If the validation process fails.

classmethod format_terminal_nodes(node_obj: MutableMapping | PathNodeMap | PathNode) dict[ProcessingPath, PathNode][source]

Recursively iterate over terminal nodes from Path Node Maps and retrieve only terminal_nodes :param node_obj: PathNode map or node dictionary containing either nested or already flattened terminal_paths :type node_obj: Union[dict,PathNodeMap]

Returns:

the flattened terminal paths extracted from the inputted node_obj

Return type:

item (dict)

get(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]

Gets an item from the PathNodeMap instance. If the value isn’t available, this method will return the value specified in default.

Parameters:

key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.

Returns:

The value (PathNode instance).

Return type:

PathNode

get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]

Helper method for retrieving a path node in a standardized way.

node_exists(node: PathNode | ProcessingPath) bool[source]

Helper method to validate whether the current node exists.

property nodes: list[PathNode]

Enables the retrieval of paths stored within the current map as a property.

property paths: list[ProcessingPath]

Enables retrieval of nodes stored within the current map as a property.

property record_indices: list[int]

Helper property for retrieving the full list of all record indices across all paths for the current map Note: This assumes that all paths within the current map are derived from a list of records where every path’s first element denotes its initial position in a list with nested json components

Returns:

A list containing integers denoting individual records found in each path

Return type:

list[int]

remove(node: ProcessingPath | PathNode | str, inplace: bool = True) PathNodeMap | None[source]

Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool

Returns:

A new PathNodeMap instance with the specified paths removed if inplace is specified as True.

Return type:

Optional[PathNodeMap]

Raises:

PathNodeMapError – If any error occurs while removing.

update(*args, overwrite: bool | None = None, **kwargs: Mapping[str | ProcessingPath, PathNode]) None[source]

Updates the PathNodeMap instance with new key-value pairs.

Parameters:
  • *args (Union[PathNodeMap,dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

  • *kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap

Returns

class scholar_flux.utils.PathProcessingCache

Bases: object

The PathProcessingCache class implements a method of path caching that enables faster prefix searches. and retrieval of terminal paths associated with a path to node mapping. This class is used within PathNodeMaps and RecordPathNodeMaps to increase the speed and efficiency of path discovery, processing, and filtering path-node mappings.

Because the primary purpose of the scholar_flux Trie-based path-node-processing implementation is the processing and preparation of highly nested JSON structures from API responses, the PathProcessingCache was created to efficiently keep track of all descendants of a terminal node with weak references and facilitate of filtering and flattening path-node combinations.

Stale data is automatically removed to reduce the number of comparisons needed to retrieve terminal paths only, and, as a result, later steps can more efficiently filter the complete list of terminal paths with faster path prefix searches to facilitate processing using Path-Node Maps and Indexes when processing JSON data structures.

__init__() None[source]

Initializes the ProcessingCache instance.

_cache

Underlying cache data structure that keeps track of all descendants that begin with the current prefix by mapping path strings to WeakSets that automatically remove ProcessingPaths when garbage collected

Type:

defaultdict[str, WeakSet[ProcessingPath]]

updates

Implements a lazy caching system that only adds elements to the _cache when filtering and node retrieval is explicitly required. The implementation uses weakly referenced keys to remove cached paths to ensure that references are deleted when a lazy operation is no longer needed.

Type:

WeakKeyDictionary[ProcessingPath, Literal[‘add’, ‘remove’]]

cache_update() None[source]

Initializes the lazy updates for the cache given the current update instructions.

filter(prefix: ProcessingPath, min_depth: int | None = None, max_depth: int | None = None) Set[ProcessingPath][source]

Filter the cache for paths with the given prefix.

Parameters:
  • prefix (ProcessingPath) – The prefix to search for.

  • min_depth (Optional[int]) – The minimum depth to search for. Default is None.

  • max_depth (Optional[int]) – The maximum depth to search for. Default is None.

Returns:

A set of paths with the given prefix.

Return type:

Set[ProcessingPath]

lazy_add(path: ProcessingPath) None[source]

Add a path to the cache for faster prefix searches.

Parameters:

path (ProcessingPath) – The path to add to the cache.

lazy_remove(path: ProcessingPath) None[source]

Remove a path from the cache.

Parameters:

path (ProcessingPath) – The path to remove from the cache.

property path_cache: defaultdict[str, WeakSet[ProcessingPath]]

Helper method that allows for inspection of the ProcessingCache and automatically updates the node cache prior to retrieval.

Returns:

The underlying cache used within the ProcessingCache to

retrieve a list all currently active terminal nodes.

Return type:

defaultdict[str, WeakSet[ProcessingPath]]

class scholar_flux.utils.PathSimplifier(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>)

Bases: object

A utility class for simplifying and managing Processing Paths.

Parameters:
  • delimiter (str) – The delimiter to use when splitting paths.

  • non_informative (Optional[List[str]]) – A list of non-informative components to remove from paths.

delimiter

The delimiter used to separate components in the path.

Type:

str

non_informative

A list of non-informative components to be removed during simplification.

Type:

List[str]

name_mappings

A dictionary for tracking unique names to avoid collisions.

Type:

Dict[ProcessingPath, str]

__init__(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>) None
clear_mappings() None[source]

Clear all existing path mappings.

Example

### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.clear_mappings() ### simplifier.get_mapped_paths()

Output:

{}

delimiter: str = '.'
generate_unique_name(path: ProcessingPath, max_components: int | None, remove_noninformative: bool = False) ProcessingPath[source]

Generate a unique name for the given Processing Path.

Parameters:
  • path (ProcessingPath) – The ProcessingPath object representing the path components.

  • max_components (int) – The maximum number of components to use in the name.

  • remove_noninformative (bool) – Whether to remove non-informative components.

Returns:

A unique ProcessingPath name.

Return type:

ProcessingPath

Raises:

PathSimplificationError – If an error occurs during name generation.

get_mapped_paths() Dict[ProcessingPath, str][source]

Get the current name mappings.

Returns:

The dictionary of mappings from original paths to simplified names.

Return type:

Dict[ProcessingPath, str]

Example

### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.get_mapped_paths() Output:

{ProcessingPath(‘a/b/c’): ‘c’, ProcessingPath(‘a/b/d’): ‘d’}

name_mappings: Dict[ProcessingPath, str]
non_informative: list[str]
simplify_paths(paths: List[ProcessingPath | str] | Set[ProcessingPath | str], max_components: int | None, remove_noninformative: bool = False) Dict[ProcessingPath, str][source]

Simplify paths by removing non-informative components and selecting the last ‘max_components’ informative components.

Parameters:
  • paths (List[Union[ProcessingPath, str]]) – List of path strings or ProcessingPaths to simplify.

  • max_components (int) – The maximum desired number of informative components to retain in the simplified path.

  • remove_noninformative (bool) – Whether to remove non-informative components.

Returns:

A dictionary mapping the original path to its simplified unique group name

for all elements within the same path after removing indices

Return type:

Dict[ProcessingPath, str]

Raises:

PathSimplificationError – If an error occurs during path simplification.

simplify_to_row(terminal_nodes: List[PathNode] | Set[PathNode], collapse: str | None = ';') Dict[str, Any][source]

Simplify terminal nodes by mapping them to their corresponding unique names.

Parameters:
  • terminal_nodes (List[PathNode]) – A list of PathNode objects representing the terminal nodes.

  • collapse (Optional[str]) – The separator to use when collapsing multiple values into a single string.

Returns:

A dictionary mapping unique names to their corresponding values or collapsed strings.

Return type:

Dict[str, Union[List[str], str]]

Raises:

PathSimplificationError – If an error occurs during simplification.

class scholar_flux.utils.PathUtils[source]

Bases: object

Helper class used to perform string/list manipulations for paths that can be represented in either form, requiring conversion from one type to the other in specific JSON path processing scenarios.

static constant_path_indices(path: List[Any], constant: str = 'i') List[Any][source]

Replace integer indices with constants in the provided path.

Parameters:
  • path (List[Any]) – The original path containing both keys and indices.

  • constant (str) – A value to replace a numeric value with.

Returns:

A path with only the key names.

Return type:

List[Any]

static group_path_assignments(path: List[Any]) str | None[source]

Group the path assignments into a single string, excluding indices.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A single string representing the grouped path, or None if the path is empty.

Return type:

Optional[str]

static path_name(level_names: List[Any]) str[source]

Generate a string representation of the path based on the provided level names.

Parameters:

level_names (List[Any]) – A list of names representing the path levels.

Returns:

A string representation of the path.

Return type:

str

static path_str(level_names: List[Any]) str[source]

Join the level names into a single string separated by underscores.

Parameters:

level_names (List[Any]) – A list of names representing the path levels.

Returns:

A single string with level names joined by underscores.

Return type:

str

static remove_path_indices(path: List[Any]) List[Any][source]

Remove integer indices from the path to get a list of key names.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A path with only the key names.

Return type:

List[Any]

class scholar_flux.utils.ProcessingPath(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)

Bases: object

A utility class to handle path operations for processing and flattening dictionaries.

Parameters:
  • components (Union[str, int, Tuple[str, ...], List[str], List[int], List[str | int]]) – The initial path, either as a string or a list of strings. Any integers will be auto-converted to strings in the process of formatting the components of the path

  • component_types (Optional[Union[Tuple[str, ...], List[str]]]) – Optional metadata fields that can be used to annotate specific components of a path

  • delimiter (str) – The delimiter used to separate components in the path.

Raises:
components

A tuple of path components.

Type:

Tuple[str, …]

delimiter

The delimiter used to separate components in the path.

Type:

str

Examples

>>> from scholar_flux.utils import ProcessingPath
>>> abc_path = ProcessingPath(['a', 'b', 'c'], delimiter ='//')
>>> updated_path = abc_path / 'd'
>>> assert updated_path.depth > 3 and updated_path[-1] == 'd'
# OUTPUT: True
>>> assert str(updated_path) == 'a//b//c//d'
>>> assert updated_path.has_ancestor(abc_path)
DEFAULT_DELIMITER: ClassVar[str] = '.'
__init__(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)[source]

Initializes the ProcessingPath. The inputs are first validated to ensure that the path components and delimiters are valid.

Parameters:
  • components – (Union[str, int, Tuple[str, …], List[str], List[int], List[str | int]]): The current path keys describing the path where each key represents a nested key in a JSON structure

  • component_types – (Optional[Union[Tuple[str, …], List[str]]]): An iterable of component types (used to annotate the components)

  • delimiter – (Optional[str]): The separator used to indicate separate nested keys in a JSON structure. Defaults to the class default if not directly specified.

append(component: int | str, component_type: str | None = None) ProcessingPath[source]

Append a component to the path and return a new ProcessingPath object.

Parameters:

component (str) – The component to append.

Returns:

A new ProcessingPath object with the appended component.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the component is not a non-empty string.

component_types: Tuple[str, ...] | None = None
components: Tuple[str, ...]
copy() ProcessingPath[source]

Create a copy of the ProcessingPath.

Returns:

A new ProcessingPath object with the same components and delimiter.

Return type:

ProcessingPath

delimiter: str = ''
property depth: int

Return the depth of the path.

Returns:

The number of components in the path.

Return type:

int

get_ancestors() List[ProcessingPath | None][source]

Get all parent paths of the current ProcessingPath by the specified number of steps.

Returns:

  • Contains a list of all ancestor paths for the current path

  • If the depth of the path is 1, an empty list is returned

Return type:

List[Optional[ProcessingPath]]

get_name(max_components: int = 1) ProcessingPath[source]

Generate a path name based on the last ‘max_components’ components of the path.

Parameters:

max_components (int) – The maximum number of components to include in the name (default is 1).

Returns:

A new ProcessingPath object representing the generated name.

Return type:

ProcessingPath

get_parent(step: int = 1) ProcessingPath | None[source]

Get the ancestor path of the current ProcessingPath by the specified number of steps.

This method navigates up the path structure by the given number of steps. If the step count is greater than or equal to the depth of the current path, or if the path is already the root, it returns None. If the step count equals the current depth, it returns the root ProcessingPath.

Parameters:

step (int) – The number of levels up to retrieve. 1 for parent, 2 for grandparent, etc. (default is 1).

Returns:

  • The ancestor ProcessingPath if the step is within the path depth.

  • The root ProcessingPath if step equals the depth of the current path.

  • None if the step is greater than the current depth or if the path is already the root.

Return type:

Optional[ProcessingPath]

Raises:

ValueError – If the step is less than 1.

group(last_only: bool = False) ProcessingPath[source]

Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.

Parameters:

last_only (bool) – Determines whether or not to replace all list indices vs removing only the last

Returns:

A ProcessingPath instance with the last numeric component removed and indices replaced.

Return type:

ProcessingPath

has_ancestor(path: str | ProcessingPath) bool[source]

Determine whether the provided path is equal to or a subset/descendant of the current path (self).

Parameters:

path (ProcessingPath) – The potential subset/descendant of (self) ProcessingPath.

Returns:

True if ‘self’ is a superset of ‘path’. False Otherwise.

Return type:

bool

static infer_delimiter(path: str | ProcessingPath, delimiters: list[str] = ['<>', '//', '/', '>', '<', '\\', '%', '.']) str | None[source]

Infer the delimiter used in the path string based on its string representation.

Parameters:
  • path (Union[str,ProcessingPath]) – The path string to infer the delimiter from.

  • delimiters (List[str]) – A list of common delimiters to search for in the path.

  • default_delimiter (str) – The default delimiter to use if no delimiter is found.

Returns:

The inferred delimiter.

Return type:

str

info_content(non_informative: List[str]) int[source]

Calculate the number of informative components in the path.

Parameters:

non_informative (List[str]) – A list of non-informative components.

Returns:

The number of informative components.

Return type:

int

is_ancestor_of(path: str | ProcessingPath) bool[source]

Determine whether the current path (self) is equal to or a subset/descendant path of the specified path.

Parameters:

path (ProcessingPath) – The potential superset of (self) ProcessingPath.

Returns:

True if ‘self’ is a subset of ‘path’. False Otherwise.

Return type:

bool

property is_root: bool

Check if the path represents the root node.

Returns:

True if the path is root, False otherwise.

Return type:

bool

classmethod keep_descendants(paths: List[ProcessingPath]) List[ProcessingPath][source]

Filters a list of paths and keeps only descendants.

property record_index: int

Extract the first element of the current path to determine the record number if the current path refers back to a paginated structure.

Returns:

The first value, converted to an integer if possible

Return type:

int

Raises:

PathIndexingError – if the first element of the path is not a numerical index

remove(removal_list: List[str]) ProcessingPath[source]

Remove specified components from the path.

Parameters:

removal_list (List[str]) – A list of components to remove.

Returns:

A new ProcessingPath object without the specified components.

Return type:

ProcessingPath

remove_by_type(removal_list: List[str], raise_on_error: bool = False) ProcessingPath[source]

Remove specified component types from the path.

Parameters:

removal_list (List[str]) – A list of component types to remove.

Returns:

A new ProcessingPath object without the specified components.

Return type:

ProcessingPath

remove_indices(num: int = -1, reverse: bool = False) ProcessingPath[source]

Remove numeric components from the path.

Parameters:

num (int) – The number of numeric components to remove. If negative, removes all (default is -1).

Returns:

A new ProcessingPath object without the specified numeric components.

Return type:

ProcessingPath

replace(old: str, new: str) ProcessingPath[source]

Replace occurrences of a component in the path.

Parameters:
  • old (str) – The component to replace.

  • new (str) – The new component to replace the old one with.

Returns:

A new ProcessingPath object with the replaced components.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the replacement arguments are not strings.

replace_indices(placeholder: str = 'i') ProcessingPath[source]

Replace numeric components in the path with a placeholder.

Parameters:

placeholder (str) – The placeholder to replace numeric components with (default is ‘i’).

Returns:

A new ProcessingPath object with numeric components replaced by the placeholder.

Return type:

ProcessingPath

replace_path(old: str | ProcessingPath, new: str | ProcessingPath, component_types: List | Tuple | None = None) ProcessingPath[source]

Replace an ancestor path or full path in the current ProcessingPath with a new path.

Parameters:
  • old (Union[str, ProcessingPath]) – The path to replace.

  • new (Union[str, ProcessingPath]) – The new path to replace the old path ancestor or full path with.

Returns:

A new ProcessingPath object with the replaced components.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the replacement arguments are not strings or ProcessingPaths.

reversed() ProcessingPath[source]

Returns a reversed ProcessingPath from the current_path.

Returns:

A new ProcessingPath object with the same components/types in a reversed order

Return type:

ProcessingPath

sorted() ProcessingPath[source]

Returns a sorted ProcessingPath from the current_path. Elements are sorted by component in alphabetical order.

Returns:

A new ProcessingPath object with the same components/types in a reversed order

Return type:

ProcessingPath

to_list() List[str][source]

Convert the ProcessingPath to a list of components.

Returns:

A list of components in the ProcessingPath.

Return type:

List[str]

to_pattern(escape_all=False) Pattern[source]

Convert the ProcessingPath to a regular expression pattern.

Returns:

The regular expression pattern representing the ProcessingPath.

Return type:

Pattern

classmethod to_processing_path(path: ProcessingPath | str | int | List[str] | List[int] | List[str | int], component_types: list | tuple | None = None, delimiter: str | None = None, infer_delimiter: bool = False) ProcessingPath[source]

Convert an input to a ProcessingPath instance if it’s not already.

Parameters:
  • path (Union[ProcessingPath, str, int, List[str], List[int], List[str | int]]) – The input path to convert.

  • component_types (list|tuple) – The type of component associated with each path element

  • delimiter (str) – The delimiter to use if the input is a string.

Returns:

A ProcessingPath instance.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.

to_string() str[source]

Get the string representation of the ProcessingPath.

Returns:

The string representation of the ProcessingPath.

Return type:

str

update_delimiter(new_delimiter: str) ProcessingPath[source]

Update the delimiter of the current ProcessingPath with the provided new delimiter.

This method creates a new ProcessingPath instance with the same components but replaces the existing delimiter with the specified new_delimiter.

Parameters:

new_delimiter (str) – The new delimiter to replace the current one.

Returns:

A new ProcessingPath instance with the updated delimiter.

Return type:

ProcessingPath

Raises:

InvalidPathDelimiterError – If the provided new_delimiter is not valid.

Example

>>> processing_path = ProcessingPath('a.b.c', delimiter='.')
>>> updated_path = processing_path.update_delimiter('/')
>>> print(updated_path)  # Output: ProcessingPath(a/b/c)
classmethod with_inferred_delimiter(path: ProcessingPath | str, component_types: List | Tuple | None = None) ProcessingPath[source]

Converts an input to a ProcessingPath instance if it’s not already a processing path.

Parameters:
  • path (Union[ProcessingPath, str, List[str]]) – The input path to convert.

  • delimiter (str) – The delimiter to use if the input is a string.

  • component_type (list|tuple) – The type of component associated with each path element

Returns:

A ProcessingPath instance.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.

class scholar_flux.utils.RecordPathChainMap(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap])

Bases: UserDict[int, RecordPathNodeMap]

A dictionary-like class that maps Processing paths to PathNode objects.

DEFAULT_USE_CACHE = True
__init__(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap]) None[source]

Initializes the RecordPathNodeMap instance.

add(node: PathNode | RecordPathNodeMap, overwrite: bool | None = None)[source]

Add a node to the PathNodeMap instance.

Parameters:
  • node (PathNode) – The node to add.

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

Raises:

PathNodeMapError – If any error occurs while adding the node.

filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]

Filter the RecordPathChainMap for paths with the given prefix.

Parameters:
  • prefix (ProcessingPath) – The prefix to search for.

  • min_depth (Optional[int]) – The minimum depth to search for. Default is None.

  • max_depth (Optional[int]) – The maximum depth to search for. Default is None.

  • from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.

Returns:

A dictionary of paths with the given prefix and their corresponding terminal_nodes

Return type:

dict[Optional[ProcessingPath], Optional[PathNode]]

Raises:

RecordPathNodeMapError – If an error occurs while filtering the PathNodeMap.

get(key: str | ProcessingPath, default: RecordPathNodeMap | None = None) RecordPathNodeMap | None[source]

Gets an item from the RecordPathNodeMap instance. If the value isn’t available, this method will return the value specified in default.

Parameters:

key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.

Returns:

A record map instance

Return type:

RecordPathNodeMap

get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]

Helper method for retrieving a path node in a standardized way across PathNodeMaps.

node_exists(node: PathNode | ProcessingPath) bool[source]

Helper method to validate whether the current node exists.

property nodes: list[PathNode]

Enables looping over paths stored across maps.

property paths: list[ProcessingPath]

Enables looping over nodes stored across maps.

property record_indices: list[int]

Helper property for retrieving the full list of all record indices across all paths for the current map Note: A core requirement of the ChainMap is that each RecordPathNodeMap indicates the position of a record in a nested JSON structure. This property is a helper method to quickly retrieve the full list of sorted record_indices.

Returns:

A list containing integers denoting individual records found in each path

Return type:

list[int]

remove(node: ProcessingPath | PathNode | str)[source]

Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool

Returns:

A new PathNodeMap instance with the specified paths removed if inplace is specified as True.

Return type:

Optional[PathNodeMap]

Raises:

PathNodeMapError – If any error occurs while removing.

update(*args, overwrite: bool | None = None, **kwargs: dict[str, PathNode] | dict[str | ProcessingPath, RecordPathNodeMap]) None[source]

Updates the PathNodeMap instance with new key-value pairs.

Parameters:
  • *args (Union["PathNodeMap",dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

  • *kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap

Returns

class scholar_flux.utils.RecordPathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])

Bases: PathNodeMap

A dictionary-like class that maps Processing paths to PathNode objects using record indexes.

This implementation inherits from the PathNodeMap class and constrains the allowed nodes to those that begin with a numeric record index. Where each index indicates a record and nodes represent values associated with the record.

__init__(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]

Initializes the RecordPathNodeMap using a similar set of inputs as the original PathNodeMap.

This implementation constraints the inputted nodes to a singular numeric key index that all nodes must begin with. If nodes are provided without the key, then the record_index is inferred for the inputs.

classmethod from_mapping(mapping: dict[str | ProcessingPath, PathNode] | PathNodeMap | Sequence[PathNode] | set[PathNode] | RecordPathNodeMap, use_cache: bool | None = None) RecordPathNodeMap[source]

Helper method for coercing types into a RecordPathNodeMap.

class scholar_flux.utils.RecursiveJsonProcessor(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False)[source]

Bases: object

An implementation of a recursive JSON dictionary processor that is used to process and identify nested components such as paths, terminal key names, and the data at each terminal path.

This utility of the RecursiveJsonProcessor is for flattening dictionary records into flattened representations where its keys represent the terminal paths at each node and its values represent the data found at each terminal path.

__init__(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False)[source]

Initialize the RecursiveJsonProcessor with a JSON dictionary and a delimiter for joining list elements.

Args:

json_dict (Dict): The input JSON dictionary to be parsed. object_delimiter (str): The delimiter used to join elements max depth list objects. Default is “; “. normalizing_delimiter (str): The delimiter used to join elements across multiple keys when normalizing. Default is “

“.

combine_normalized(normalized_field_value: list | str | None) list | str | None[source]

Combines lists of nested data (strings, ints, None, etc.) into a single string separated by the normalizing_delimiter.

If a delimiter isn’t specified or if the value is None, it is returned as is without modification.

filter_extracted(exclude_keys: List[str] | None = None)[source]

Filter the extracted JSON dictionaries to exclude specified keys.

Parameters:

exclude_keys ([List[str]]) – List of keys to exclude from the flattened result.

flatten() Dict[str, List[Any] | str | None] | None[source]

Flatten the extracted JSON dictionary from a nested structure into a simpler structure.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_and_flatten(obj: Dict | None = None, exclude_keys: List[str] | None = None) Dict[str, List[Any] | str | None] | None[source]

Process the dictionary, filter extracted paths, and then flatten the result.

Parameters:

exclude_keys (Optional[List[str]]) – List of keys to exclude from the flattened result.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_dictionary(obj: Dict | None = None)[source]

Create a new json dictionary that contains information about the relative paths of each field that can be found within the current json_dict.

process_level(obj: Any, level_name: List[Any] | None = None) List[Any][source]

Helper method for processing a level within a dictionary.

This method is recursively called to process nested components

static unlist(current_data: Dict | List | None) Any | None[source]

Flattens a dictionary or list if it contains a single element that is a dictionary.

Parameters:

current_data – A dictionary or list to be flattened if it contains a single dictionary element.

Returns:

The flattened dictionary if the input meets the flattening condition, otherwise returns the input unchanged.

Return type:

Optional[Dict|List]

class scholar_flux.utils.ResponseProtocol(*args, **kwargs)[source]

Bases: Protocol

Protocol for HTTP response objects compatible with both requests.Response, httpx.Response, and other response- like classes.

This protocol defines the common interface shared between popular HTTP client libraries, allowing for type-safe interoperability.

The URL is kept flexible to allow for other types outside of the normal string including basic pydantic and httpx type for both httpx and other custom objects.

__init__(*args, **kwargs)
content: bytes
headers: MutableMapping[str, str]
raise_for_status() None[source]

Raise an exception for HTTP error status codes.

status_code: int
url: Any
scholar_flux.utils.adjust_repr_padding(obj: Any, pad_length: int | None = 0, flatten: bool | None = None) str[source]

Helper method for adjusting the padding for representations of objects.

Parameters:
  • obj (Any) – The object to generate an adjusted repr for

  • pad_length (Optional[int]) – Indicates the additional amount of padding that should be added. Helpful for when attempting to create nested representations formatted as intended.

  • flatten (bool) – indicates whether to use newline characters. This is false by default

Returns:

A string representation of the current object that adjusts the padding accordingly

Return type:

str

scholar_flux.utils.as_list_1d(value: Any) List[source]

Nests a value into a single element list if the value is not already a list.

Parameters:

value (Any) – The value to add to a list if it is not already a list

Returns:

If already a list, the value is returned as is. Otherwise, the value is nested in a list. Caveat: if the value is None, an empty list is returned

Return type:

List

scholar_flux.utils.coerce_int(value: Any) int | None[source]

Attempts to convert a value to an integer, returning None if the conversion fails.

scholar_flux.utils.coerce_str(value: Any) str | None[source]

Attempts to convert a value into a string, if possible, returning None if conversion fails.

scholar_flux.utils.format_iso_timestamp(timestamp: datetime) str[source]

Formats an iso timestamp string in UTC with millisecond precision.

Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123Z”)

Return type:

str

scholar_flux.utils.format_repr_value(value: Any, pad_length: int | None = None, show_value_attributes: bool | None = None, flatten: bool | None = None) str[source]

Helper function for representing nested objects from custom classes.

Parameters:
  • value (Any) – The value containing the repr to format

  • pad_length (Optional[int]) – Indicates the total additional padding to add for each individual line

  • show_value_attributes (Optional[bool]) – If False, all attributes within the current object will be replaced with ‘…’. As an example: e.g. StorageDevice(…)

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

scholar_flux.utils.generate_iso_timestamp() str[source]

Generates and formats an ISO 8601 timestamp string in UTC with millisecond precision for reliable round-trip conversion.

Example usage:
>>> from scholar_flux.utils import generate_iso_timestamp, parse_iso_timestamp, format_iso_timestamp
>>> timestamp = generate_iso_timestamp()
>>> parsed_timestamp = parse_iso_timestamp(timestamp)
>>> assert parsed_timestamp is not None and format_iso_timestamp(parsed_timestamp) == timestamp
Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123Z”)

Return type:

str

scholar_flux.utils.generate_repr(obj: object, exclude: set[str] | list[str] | tuple[str] | None = None, show_value_attributes: bool = True, flatten: bool = False) str[source]

Method for creating a basic representation of a custom object’s data structure. Useful for showing the options/attributes being used by an object.

In case the object doesn’t have a __dict__ attribute, the code will raise an AttributeError and fall back to using the basic string representation of the object.

Note that threading.Lock objects are excluded from the final representation.

Parameters:
  • obj – The object whose attributes are to be represented.

  • exclude – Attributes to exclude from the representation (default is None).

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

Returns:

A string representing the object’s attributes in a human-readable format.

scholar_flux.utils.generate_repr_from_string(class_name: str, attribute_dict: dict[str, Any], show_value_attributes: bool | None = None, flatten: bool | None = False) str[source]

Method for creating a basic representation of a custom object’s data structure. Allows for the direct creation of a repr using the classname as a string and the attribute dict that will be formatted and prepared for representation of the attributes of the object.

Parameters:
  • class_name – The class name of the object whose attributes are to be represented.

  • attribute_dict (dict) – The dictionary containing the full list of attributes to format into the components of a repr

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

Returns:

A string representing the object’s attributes in a human-readable format.

scholar_flux.utils.generate_response_hash(response: Response | ResponseProtocol) str[source]

Generates a response hash from a response or response-like object that implements the ResponseProtocol.

Parameters:

response (requests.Response | ResponseProtocol) – An http response or response-like object.

Returns:

A unique identifier for the response.

scholar_flux.utils.get_nested_data(json: list | dict | None, path: list) list | dict | None | str | int[source]

Recursively retrieves data from a nested dictionary using a sequence of keys.

Parameters:
  • json (List[Dict[Any, Any]] | Dict[Any, Any]) – The parsed json structure from which to extract data.

  • path (List[Any]) – A list of keys representing the path to the desired data within json.

Returns:

The value retrieved from the nested dictionary following the path, or None if any

key in the path is not found or leads to a None value prematurely.

Return type:

Optional[Any]

scholar_flux.utils.initialize_package(log: bool = True, env_path: str | Path | None = None, config_params: dict[str, Any] | None = None, logging_params: dict[str, Any] | None = None) tuple[dict[str, Any], Logger, SensitiveDataMasker][source]

Function used for orchestrating the initialization of the config, log settings, and masking for scholar_flux.

This function imports a ‘.env’ configuration file at the specified location if it exists. Otherwise, scholar_flux will look for a .env file in the default locations if available. If no .env configuration file is found, then only package defaults and available OS environment variables are used.

This function can also be used for dynamic re-initialization of configuration parameters and logging. The config_params are sent as keyword arguments to the scholar_flux.utils.ConfigSettings.load_config method. logging_paras are used as keyword arguments to the scholar_flux.utils.setup_logging method to set up logging settings and handlers.

Parameters:
  • log (bool) – A True/False flag that determines whether to enable or disable logging.

  • env_path (Optional[str | Path]) – The file path indicating from where to load the environment variables, if provided.

  • config_params (Optional[Dict]) – A dictionary allowing for the specification of configuration parameters when attempting to load environment variables from a config. Useful for loading API keys from environment variables for later use.

  • logging_params (Optional[Dict]) – A dictionary allowing users to specify options for package-level logging with custom logic. Log settings are loaded from the OS environment or an .env file when available, with precedence given to .env files. These settings, when loaded, override the default ScholarFlux logging configuration. Otherwise, ScholarFlux uses a log-level of WARNING by default.

Returns:

A tuple containing the configuration dictionary and the initialized logger.

Return type:

Tuple[Dict[str, Any], logging.Logger, scholar_flux.security.SensitiveDataMasker]

Raises:

PackageInitializationError – If there are issues with loading the configuration or initializing the logger.

scholar_flux.utils.is_nested(obj: Any) bool[source]

Indicates whether the current value is a nested object. Useful for recursive iterations such as JSON record data.

Parameters:

obj – any (realistic JSON) data type - dicts, lists, strs, numbers

Returns:

True if nested otherwise False

Return type:

bool

scholar_flux.utils.nested_key_exists(obj: Any, key_to_find: str, regex: bool = False) bool[source]

Recursively checks if a specified key is present anywhere in a given JSON-like dictionary or list structure.

Parameters:
  • obj – The dictionary or list to search.

  • key_to_find – The key to search for.

  • regex – Whether or not to search with regular expressions.

Returns:

True if the key is present, False otherwise.

scholar_flux.utils.normalize_repr(value: Any) str[source]

Helper function for removing byte locations and surrounding signs from classes.

Parameters:

value (Any) – a value whose representation to be normalized

Returns:

A normalized string representation of the current value

Return type:

str

scholar_flux.utils.parse_iso_timestamp(timestamp_str: str) datetime | None[source]

Attempts to convert an ISO 8601 timestamp string back to a datetime object.

Parameters:

timestamp_str – ISO 8601 formatted timestamp string

Returns:

datetime object if parsing succeeds, None otherwise

Return type:

datetime

scholar_flux.utils.quote_if_string(value: Any) Any[source]

Attempt to quote string values to distinguish them from object text in class representations.

Parameters:

value (Any) – a value that is quoted only if it is a string

Returns:

Returns a quoted string if successful. Otherwise returns the value unchanged

Return type:

Any

scholar_flux.utils.quote_numeric(value: Any) str[source]

Attempts to quote as a numeric value and returns the original value if successful Otherwise returns the original element.

Parameters:

value (Any) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful.

Raises:

ValueError – If the value cannot be quoted

scholar_flux.utils.set_public_api_module(module_name: str, public_names: list[str], namespace: dict)[source]

Assigns the current module’s name to the __module__ attribute of public API objects.

This function is useful for several use cases including sphinx documentation, introspection, and error handling/reporting.

For all objects defined in the list of a modules public API names (generally named __all__), this function sets their __module__ attribute to the name of the current public API module if supported.

This is useful for ensuring that imported classes and functions appear as if they are defined in the current module (such as in the automatic generation of sphinx documentation), which improves overall documentation, introspection, and error reporting.

Parameters:
  • module_name (str) – The name of the module (usually __name__).

  • public_names (list[str]) – List of public object names to update (e.g., __all__).

  • namespace (dict) – The module’s namespace (usually globals()).

Example usage:

set_public_api_module(__name__, __all__, globals())

scholar_flux.utils.setup_logging(logger: Logger | None = None, log_directory: str | None = None, log_file: str | None = 'application.log', log_level: int = 10, propagate_logs: bool | None = True, max_bytes: int = 1048576, backup_count: int = 5, logging_filter: Filter | None = None)[source]

Configure logging to write to both console and file with optional filtering.

Sets up a logger that outputs to both the terminal (console) and a rotating log file. Rotating files automatically create new files when size limits are reached, keeping your logs manageable.

Parameters:
  • logger (Optional[logging.Logger]) – The logger instance to configure. If None, uses the root logger.

  • log_directory (Optional[str]) – Indicates where to save log files. If None, automatically finds a writable directory when a log_file is specified..

  • log_file (Optional[str]) – Name of the log file (default: ‘application.log’). If None, file-based logging will not be performed.

  • log_level (int) – Minimum level to log (DEBUG logs everything, INFO skips debug messages).

  • propagate_logs (Optional[bool]) – Determines whether to propagate logs. Logs are propagated by default if this option is not specified.

  • max_bytes (int) – Maximum size of each log file before rotating (default: 1MB).

  • backup_count (int) – Number of old log files to keep (default: 5).

  • logging_filter (Optional[logging.Filter]) – Optional filter to modify log messages (e.g., hide sensitive data).

Example

>>> # Basic setup - logs to console and file
>>> setup_logging()
>>> # Custom location and less verbose
>>> setup_logging(log_directory="/var/log/myapp", log_level=logging.INFO)
>>> # With sensitive data masking
>>> from scholar_flux.security import MaskingFilter
>>> mask_filter = MaskingFilter()
>>> setup_logging(logging_filter=mask_filter)

Note

  • Console shows all log messages in real-time

  • File keeps a permanent record with automatic rotation

  • If logging_filter is provided, it’s applied to both console and file output

  • Calling this function multiple times will reset the logger configuration

scholar_flux.utils.try_call(func: Callable, args: tuple | None = None, kwargs: dict | None = None, suppress: tuple = (), logger: Logger | None = None, log_level: int = 30, default: Any | None = None) Any | None[source]

A helper function for calling another function safely in the event that one of the specified errors occur and are contained within the list of errors to suppress.

Parameters:
  • func – The function to call

  • args – A tuple of positional arguments to add to the function call

  • kwargs – A dictionary of keyword arguments to add to the function call

  • suppress – A tuple of exceptions to handle and suppress if they occur

  • logger – The logger to use for warning generation

  • default – The value to return in the event that an error occurs and is suppressed

Returns:

When successful, the return type of the callable is also returned without modification. Upon suppressing an exception, the function will generate a warning and return None by default unless the default was set.

Return type:

Optional[Any]

scholar_flux.utils.try_dict(value: List | Tuple | Dict) Dict | None[source]

Attempts to convert a value into a dictionary, if possible. If it is not possible to convert the value into a dictionary, the function will return None.

Parameters:

value (List[Dict | Tuple | Dict) – The value to attempt to convert into a dict

Returns:

The value converted into a dictionary if possible, otherwise None

Return type:

Optional[Dict]

scholar_flux.utils.try_int(value: JSON_TYPE | None) JSON_TYPE | int | None[source]

Attempts to convert a value to an integer, returning the original value if the conversion fails.

Parameters:

value (Hashable) – the value to attempt to coerce into an integer

Return type:

Optional[int]

scholar_flux.utils.try_pop(s: Set[T], item: T, default: T | None = None) T | None[source]

Attempt to remove an item from a set and return the item if it exists.

Parameters:
  • item (Hashable) – The item to try to remove from the set

  • default (Optional[Hashable]) – The object to return as a default if item is not found

Returns:

Optional[Hashable] item if the value is in the set, otherwise returns the specified default

scholar_flux.utils.try_quote_numeric(value: Any) str | None[source]

Attempt to quote numeric values to distinguish them from string values and integers.

Parameters:

value (Any) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful. Otherwise None

Return type:

Optional[str]

scholar_flux.utils.try_str(value: Any) str | None[source]

Attempts to convert a value to a string, returning the original value if the conversion fails.

Parameters:

value (Any) – the value to attempt to coerce into an string

Return type:

Optional[str]

scholar_flux.utils.unlist_1d(current_data: Tuple | List | Any) Any[source]

Retrieves an element from a list/tuple if it contains only a single element. Otherwise, it will return the element as is. Useful for extracting text from a single element list/tuple.

Parameters:

current_data (Tuple | List | Any) – An object potentially unlist if it contains a single element.

Returns:

The unlisted object if it comes from a single element list/tuple, otherwise returns the input unchanged.

Return type:

Optional[Any]