scholar_flux.utils package

Subpackages

Submodules

scholar_flux.utils.config_loader module

The scholar_flux.api.utils.config_loader module defines the primary ConfigLoader for the scholar_flux package.

The ConfigLoader is designed to ensure that user-specified package default settings are registered via the use of .env files when available, and OS environment variables otherwise.

ScholarFlux uses the ConfigLoader alongside the scholar_flux.utils.initializer to fully initialize the scholar_flux package with the chosen configuration. This includes the initialization of importing API keys as secret strings, defining log levels, default API providers, etc.

class scholar_flux.utils.config_loader.ConfigLoader(env_path: str | Path | None = None)[source]

Bases: object

Configuration loader for the scholar_flux package settings and environment variables.

The ConfigLoader is used on package initialization to dynamically configure package options from .env files and the OS environment. ScholarFlux uses this class to define package-level settings at runtime while prioritizing .env file configurations when available.

Configuration Variables

Package Level Settings

  • SCHOLAR_FLUX_DEFAULT_PROVIDER: Defines the provider to use by default when creating a SearchAPI instance.

  • SCHOLAR_FLUX_DEFAULT_USER_AGENT:

    The default User-Agent to use when sending requests via requests-cache. If not specified, a default User-Agent will be generated automatically.

  • SCHOLAR_FLUX_DEFAULT_MAILTO:

    Defines the default mailto address that is used when creating a new search coordinator.

  • SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND:

    Controls the default backend for CachedSession instances created when initializing SearchAPI or SearchCoordinator. Supported requests_cache backends include sqlite, redis, mongodb, and memory.

  • SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE:

    Defines the default cache storage backend that the DataCacheManager creates for response caching during orchestration of the response processing steps. Supported options are redis, sql, mongodb, memory, and null. Defaults to memory if not specified.

  • SCHOLAR_FLUX_SESSION_CACHE_NAME:

    Defines the name of the session cache db (mongodb/redis), table name (sqlite), or nested folder (filename).

  • SCHOLAR_FLUX_CACHE_DIRECTORY:

    Defines the directory path where requests and response processing cache will be stored when using filesystem-based cache backends (e.g., sqlite).

API_KEYS

  • ARXIV_API_KEY: API key used when retrieving academic data from arXiv.

  • OPEN_ALEX_API_KEY: API key used when retrieving academic data from OpenAlex.

  • SPRINGER_NATURE_API_KEY: API key used when retrieving academic data from Springer Nature.

  • CROSSREF_API_KEY: API key used to retrieve academic metadata from Crossref (API key not required).

  • CORE_API_KEY: API key used to retrieve metadata and full-text publications from the CORE API.

  • PUBMED_API_KEY: API key used to retrieve publications from the NIH PubMed database.

  • SCHOLAR_FLUX_CACHE_SECRET_KEY:

    Defines the secret key used to create encrypted session cache for request retrieval.

Logging

  • SCHOLAR_FLUX_ENABLE_LOGGING: Defines whether logging should be enabled when ScholarFlux is initialized.

  • SCHOLAR_FLUX_LOG_DIRECTORY: Defines where rotating logs will be stored when logging is enabled.

  • SCHOLAR_FLUX_LOG_LEVEL:

    Defines the default log level used for package level logging during and after scholar_flux package initialization.

  • SCHOLAR_FLUX_LOG_STREAM:

    Defines the default stream that should be used when initializing the package-level logger.

  • SCHOLAR_FLUX_PROPAGATE_LOGS: Determines whether logs should be propagated or not. (True by default).

Database Connections

  • SCHOLAR_FLUX_MONGODB_HOST: MongoDB connection string (default: “mongodb://127.0.0.1”)

  • SCHOLAR_FLUX_MONGODB_PORT: MongoDB port (default: 27017)

  • SCHOLAR_FLUX_REDIS_HOST: Redis host (default: “localhost”)

  • SCHOLAR_FLUX_REDIS_PORT: Redis port (default: 6379)

  • SCHOLAR_FLUX_SQLALCHEMY_URL: The default SQLAlchemy URL to use for the response processing cache.

  • SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL: Controls the time until expiration for cached responses (seconds)

  • SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL: Controls the time until expiration for processing cache (seconds)

Examples

>>> from scholar_flux.utils import ConfigLoader
>>> from pydantic import SecretStr
>>> config_loader = ConfigLoader()
>>> config_loader.load_config(reload_env=True)
>>> api_key = '' # Your key goes here
>>> if api_key:
>>>     config_loader.config['CROSSREF_API_KEY'] = api_key
>>> print(config_loader.env_path) # the default environment location when writing/replacing a env config
>>> config_loader.save_config() # to save the full configuration in the default environment folder
DEFAULT_ENV: Dict[str, Any] = {'ARXIV_API_KEY': None, 'CORE_API_KEY': None, 'CROSSREF_API_KEY': None, 'OPEN_ALEX_API_KEY': None, 'PUBMED_API_KEY': None, 'SCHOLAR_FLUX_CACHE_DIRECTORY': None, 'SCHOLAR_FLUX_CACHE_SECRET_KEY': None, 'SCHOLAR_FLUX_DEFAULT_MAILTO': None, 'SCHOLAR_FLUX_DEFAULT_PROVIDER': 'plos', 'SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE': None, 'SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL': None, 'SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND': None, 'SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL': 86400, 'SCHOLAR_FLUX_DEFAULT_USER_AGENT': None, 'SCHOLAR_FLUX_ENABLE_LOGGING': None, 'SCHOLAR_FLUX_LOG_DIRECTORY': None, 'SCHOLAR_FLUX_LOG_FILE': 'application.log', 'SCHOLAR_FLUX_LOG_LEVEL': None, 'SCHOLAR_FLUX_LOG_STREAM': None, 'SCHOLAR_FLUX_MONGODB_HOST': 'mongodb://127.0.0.1', 'SCHOLAR_FLUX_MONGODB_PORT': 27017, 'SCHOLAR_FLUX_PROPAGATE_LOGS': None, 'SCHOLAR_FLUX_REDIS_HOST': 'localhost', 'SCHOLAR_FLUX_REDIS_PORT': 6379, 'SCHOLAR_FLUX_SESSION_CACHE_NAME': None, 'SCHOLAR_FLUX_SQLALCHEMY_URL': None, 'SPRINGER_NATURE_API_KEY': None}
DEFAULT_ENV_PATH: Path = PosixPath('/home/runner/work/scholar-flux/scholar-flux/.env')
__init__(env_path: str | Path | None = None)[source]

Initializes the ConfigLoader with class-level defaults and establishes the .env path to read from.

If a custom path is provided and valid, it will be used when it points to a valid file that exists; otherwise, the path will default to a readable package location (SCHOLAR_FLUX_HOME, ~/.scholar_flux, or current directory).

Parameters:

env_path (Optional[Path | str]) – The dotenv file to read environment variables from. If not passed, environment variables are scanned and checked from default package locations or the current directory when available.

env_path

The location of the .env file to load for reading/writing configuration.

Type:

Path

config

(Dict[str, Any]): The current configuration dictionary with masked sensitive values.

get(key: str, default: Any = None) Any[source]

Retrieve a configuration value from the config dictionary, falling back to the environment if not present.

Parameters:
  • key (str) – The name of the variable from which to retrieve the configuration value.

  • default (Any) – A fallback value that is returned when the key exists in neither the config dictionary nor the environment.

Note

Any values set during the current session are prioritized over values from the environment. If a value can’t be found within the config dictionary, the get() method will fallback to checking for the environment variable within the operating system environment.

load_config(env_path: str | Path | None = None, reload_env: bool = False, reload_os_env: bool = False, verbose: bool = False) None[source]

Load configuration settings from a .env file and the global OS environment.

This package allows users to set new defaults on changes to the environment while optionally overwriting previously set configuration settings.

Optionally attempt to reload and overwrite previously set ConfigLoader using either or both sources of config settings.

Note that config settings from a .env file are prioritized over globally set OS environment variables. If neither reload_os_env or reload_env are chosen, this function has no effect on the current configuration.

Parameters:
  • env_path (Optional[Path | str]) – An optional env path to read from. Defaults to the current env_path if None.

  • reload_env (bool) – Determines whether environment variables will be loaded/reloaded from the provided env_path or a current self.env_path. Defaults to False, indicating that variables are not reloaded from a .env.

  • reload_os_env (bool) – Determines whether environment variables will be loaded/reloaded from the Operating System’s global environment.

  • verbose (bool) – Convenience setting indicating whether or not to log changed configuration variable names.

load_dotenv(env_path: str | Path | None = None, replace_all: bool = False, verbose: bool = False) dict[str, Any][source]

Retrieves a list of non-missing environment variables from the current .env file that are non-null.

Parameters:
  • env_path (Optional[Path | str]) – Location of the .env file where env variables will be retrieved from.

  • replace_all (bool) – Indicates whether all environment variables should be replaced vs. only non-missing variables. by default, only previously non-existent variables are assigned updated values.

  • verbose (bool) – Flag indicating whether logging should be shown in the output. This is set to False by default.

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict[str, Any]

load_os_env(replace_all: bool = False, verbose: bool = False) dict[source]

Load any updated configuration settings from variables set within the system environment.

The configuration setting must already exist in the config to be updated if available. Otherwise, the update_config method allows direct updates to the config settings.

Parameters:
  • replace_all (bool) – Indicates whether all environment variables should be replaced vs. only non-missing variables. This is false by default.

  • verbose (bool) – Flag indicating whether logging should be shown in the output. This is False by default.

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict[str, Any]

classmethod load_os_env_key(key: str, **kwargs: Any) str | SecretStr | None[source]

Loads the provided key from the global environment. Converts API_KEY variables to secret strings by default.

Parameters:
  • key (str) – The key to load from the environment. This key will be guarded if it contains any of the following substrings: “API_KEY”, “SECRET”, “MAIL”

  • matches (str) – The substrings used to indicate whether the loaded environment variable should be guarded

Returns:

The value of the environment variable, possibly wrapped as a secret string

Return type:

Optional[str | SecretStr]

save_config(env_path: str | Path | None = None) None[source]

Save configuration settings to a .env file.

Automatically unmasks SecretStr values before writing to disk.

Parameters:

env_path (Optional[Path | str]) – The location to save the configuration settings to.

Note

Sensitive values (SecretStr) are unmasked during write. Ensure .env files have appropriate permissions (For example, with permissions such as chmod 600).

set(key: str, value: Any, verbose: bool = True) None[source]

Sets a configuration value for a key within the config dictionary.

Parameters:
  • key (str) – The name of the variable to set or overwrite within the current session.

  • value (Any) – The value to assign to the setting in the config dictionary.

  • verbose (bool) – Determines whether overrides to defaults or previously existing variables should be logged.

Note

Values set with the .set() method are prioritized over values from the environment when .get() is called. To override this behavior and use environment variables instead, either remove the environment variable from the config dictionary, or set the value associated with the key to None.

try_loadenv(env_path: str | Path | None = None, verbose: bool = False) Dict[str, Any] | None[source]

Try to load environment variables from a specified .env file into the environment and return as a dict.

Parameters:
  • env_path (Optional[Path | str]) – Location of the .env file where env variables will be retrieved from.

  • verbose (bool) – Flag indicating whether logging should be shown in the output. This is False by default.

Returns:

A loaded configuration that is returned as a dictionary when available. Otherwise, None is returned.

Return type:

Optional[Dict[str, Any]]

update_config(env_dict: dict[str, Any], verbose: bool = False) None[source]

Helper method for updating the config dictionary with the provided dictionary of key-value pairs.

This method coerces strings into integers when possible and uses the _guard_secret method as insurance to guard against logging and recording API keys without masking. Although the load_env and load_os_env methods also mask API keys, this is particularly useful if the end-user calls update_config directly.

Parameters:
  • env_dict (dict[str, Any]) – A dictionary containing environment variables that will be used to update the package-level config. dictionary for the current session.

  • verbose (bool) – Determines whether updates to the configuration should be logged when they occur.

write_key(key_name: str, key_value: str, env_path: str | Path | None = None, create: bool = True) None[source]

Write a key-value pair to a .env file.

Parameters:
  • key_name (str) – The name of the key to write to a environment configuration file

  • key_value (str) – The value of the key to write to a environment configuration file

  • env_path (Optional[Path | str]) – The dotenv filepath indicating where to write the key-value pair.

  • create (bool) – Determines whether a new dotenv file should be created if it doesn’t already exist. True by default.

Raises:
  • IOError – If file cannot be written

  • PermissionError – If insufficient permissions to create/modify file

scholar_flux.utils.encoder module

The scholar_flux.utils.encoder module contains implementations of encoder-decoder helper classes that help abstract the serialization and deserialization of JSON data sets for easier storage.

Responses from APIs often contain non-serializable data types including non-traditional sequences and mappings that aren’t directly serializable. The implementations directly aid in creating representations of these classes that can be used to reconstruct the original object after serialization with built-in types.

Classes:
CacheDataEncoder:

Helper class used to recursively encode and decode nested JSON data with mixed data types.

JsonDataEncoder:

Helper class that builds on the CacheDataEncoder to provide built-in JSON loading/dumping support that aids in the creation of a simple Serialization-Deserialization pipeline.

class scholar_flux.utils.encoder.CacheDataEncoder[source]

Bases: object

A utility class to encode data into a base64 string representation or decode it back from base64.

This class supports encoding binary data (bytes) and recursively handles nested structures such as dictionaries and lists by encoding their elements, preserving the original structure upon decoding.

This class is used to serialize json structures when the structure isn’t known and contains unpredictable elements such as 1) None, 2) bytes, 3) nested lists, 4) Other unpredictable structures typically found in JSON.

Class Attributes:
DEFAULT_HASH_PREFIX: (Optional[str]):

An optional indicator of fields to mark fields as bytes for use when decoding. This field defaults to <hashbytes> but can be optionally turned off by setting CacheDataEncoder.DEFAULT_HASH_PREFIX=None or CacheDataEncoder.DEFAULT_HASH_PREFIX=’’

DEFAULT_NONREADABLE_PROP (int):

A threshold used to identify previously encoded base64 fields. This proportion is used when a hash prefix that marks encoded text is not applied. To test whether a string is an encoded_string, when decoded, a high percentage of letters will be nonreadable when decoded. (i.e CacheDataEncoder.decode(‘encoders’) —> b’zw(uêì’

Example

>>> from scholar_flux.utils import CacheDataEncoder
>>> import json
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string', 'list': ['a', True, 'series', 'of', None]}
>>> try:
>>>     json.dumps(data)
>>> except TypeError:
>>>     print('The `data` is non-serializable as expected ')
>>>
>>> encoded_data = CacheDataEncoder.encode(data)
>>> serialized_data = json.dumps(encoded_data)
>>> assert data == CacheDataEncoder.decode(json.loads(serialized_data))
DEFAULT_HASH_PREFIX: str | None = '<hashbytes>'
DEFAULT_NONREADABLE_PROP: float = 0.2
classmethod decode(data: str, hash_prefix: str | None = None) str | bytes[source]
classmethod decode(data: dict, hash_prefix: str | None = None) dict
classmethod decode(data: list, hash_prefix: str | None = None) list
classmethod decode(data: tuple, hash_prefix: str | None = None) tuple
classmethod decode(data: T, hash_prefix: str | None = None) T

Recursively decodes base64 strings back to bytes or recursively decode elements within dictionaries and lists.

Parameters:
  • data (object) – The input data that needs decoding from a base64 encoded format. This could be a base64 string or nested structures like dictionaries and lists containing base64 strings as values.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Decoded bytes for byte-based representations or recursively decoded elements

within the dictionary/list/tuple if applicable.

Return type:

object

classmethod encode(data: bytes, hash_prefix: str | None = None) str[source]
classmethod encode(data: MutableMapping, hash_prefix: str | None = None) dict
classmethod encode(data: MutableSequence | set, hash_prefix: str | None = None) list
classmethod encode(data: tuple, hash_prefix: str | None = None) tuple
classmethod encode(data: T, hash_prefix: str | None = None) T

Recursively encodes all items that contain elements that cannot be directly serialized into JSON into a format more suitable for serialization:

  • Mappings are converted into dictionaries

  • Sets and other uncommon Sequences other than lists and tuples are converted into lists

  • Bytes objects are converted into strings and hashed with an optional prefix-identifier.

Parameters:
  • data (object) – The input data. This can be: * bytes: Encoded directly to a base64 string. * Mappings/Sequences/Sets/Tuples: Recursively encodes elements if they are bytes.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Encoded string (for bytes) or a dictionary/list/tuple with recursively encoded elements.

Return type:

object

classmethod is_base64(s: str | bytes, hash_prefix: str | None = None) bool[source]

Check if a string is a valid base64 encoded string. Encoded strings can optionally be identified with a hash_prefix to streamline checks to determine whether or not to later decode a base64 encoded string.

As a general heuristic when encoding and decoding base 64 objects, a string should be equal to its original value after encoding and decoding the string. In this implementation, we strip equals signs, as minor differences in padding aren’t relevant.

Parameters:
  • s (str | bytes) – The string to check.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

True if the string is base64 encoded, False otherwise.

Return type:

bool

classmethod is_nonreadable(s: bytes, prop: float | None = None) bool[source]

Check if a decoded byte string contains a high percentage of non-printable characters. Non-printable characters are defined as those not within the unicode range of (32 <= c <= 126).

Parameters:
  • s (bytes) – The byte string to check.

  • prop (float) – The threshold percentage of non-printable characters.

  • specified. (Defaults to DEFAULT_NONREADABLE_PROP is not)

Returns:

True if the string is likely gibberish, False otherwise.

Return type:

bool

class scholar_flux.utils.encoder.JsonDataEncoder[source]

Bases: CacheDataEncoder

Helper class used to extend the CacheDataEncoder to provide functionality directly relevant to serializing and deserializing data from JSON formats into serialized JSON strings for easier storage and recovery.

This method includes utility dumping and loading tools directly applicable to safely dumping and reloading responses received by various APIs.

Example Use:
>>> from scholar_flux.utils import JsonDataEncoder
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string',
>>>         'list': ['a', True, 'series' 'of', None]}
# serializes the original data even though it contains otherwise unserializable components
>>> serialized_data = JsonDataEncoder.dumps(data)
>>> assert isinstance(serialized_data, str)
# deserializes the data, returning the original structure
>>> recovered_data = json.loads(serialized_data)
# the result should be the original string
>>> assert data == recovered_data
# OUTPUT: True
classmethod deserialize(s: str, **json_kwargs: Any) Any[source]

Class method that deserializes and decodes json data from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The decoded data.

Return type:

Any

classmethod dumps(data: object, **json_kwargs: Any) str[source]

Convenience method that uses the json module to serialize (dump) JSON data into a JSON string.

Parameters:
  • data (object) – The data to serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

classmethod loads(s: str, **json_kwargs: Any) Any[source]

Convenience method that uses the json module to deserialize (load) from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The loaded json data.

Return type:

Any

classmethod serialize(data: object, **json_kwargs: Any) str[source]

Class method that encodes and serializes data to a JSON string.

Parameters:
  • data (Any) – The data to encode and serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

scholar_flux.utils.helpers module

The scholar_flux.utils.helpers module contains several helper functions to aid in common data manipulation scenarios.

This module includes helpers for character conversion, date-time parsing and formatting, and nesting and unnesting common python data structures.

scholar_flux.utils.helpers.as_list_1d(value: Any) list[source]

Nests a value into a single element list if the value is not already a list.

Parameters:

value (Any) – The value to add to a list if it is not already a list

Returns:

If already a list, the value is returned as is. Otherwise, the value is nested in a list. Caveat: if the value is None, an empty list is returned

Return type:

list

scholar_flux.utils.helpers.as_str(value: object, *, encoding: str | None = 'utf-8', errors: str | None = 'strict') str[source]

Converts an object into a string type, accounting for re.Pattern/bytes semantics when relevant.

Parameters:
  • value (object) – The value to attempt to convert into a string.

  • encoding (Optional[str]) – An optional value used to decode byte strings. Not relevant for data of other types.

  • errors (Optional[str]) – An optional value for decoding errors with non-Unicode bytes characters. Not relevant for non-byte strings.

Returns:

The value converted into a string.

Return type:

str

scholar_flux.utils.helpers.as_tuple(obj: object) tuple[source]

Converts objects into tuples when possible and nests objects within a tuple otherwise.

This function is useful as a preprocessing step for function calls that require tuples instead of lists, NoneTypes, and other data types.

Parameters:

obj (object) – The object to nest as a tuple

Returns:

The original object converted into a tuple

Return type:

tuple

scholar_flux.utils.helpers.build_iso_date(year: str | None, month: str | None = '', day: str | None = '') str | None[source]

Build ISO-formatted date string with graduated precision.

Constructs date strings in ISO format with appropriate precision based on available components. Returns full date (YYYY-MM-DD) if all components present, year-month (YYYY-MM) if only year and month available, or year only (YYYY).

Parameters:
  • year (Optional[str]) – Year as string (required for output)

  • month (Optional[str]) – Month as string (name or number), optional

  • day (Optional[str]) – Day as string, optional

Returns:

ISO date string with graduated precision (YYYY-MM-DD, YYYY-MM, or YYYY), or None if year is empty/None

Return type:

Optional[str]

Examples

>>> build_iso_date('2025', '12', '19')
# OUTPUT: '2025-12-19'
>>> build_iso_date('2025', 'Dec')
# OUTPUT: '2025-12'
>>> build_iso_date('2025', 'Dec', '19')
# OUTPUT: '2025-12-19'
>>> build_iso_date('2025')
# OUTPUT: '2025'
>>> build_iso_date('')
# OUTPUT: None
scholar_flux.utils.helpers.coerce_bool(value: object, true_values: tuple[str, ...] = ('T', 'true', 'yes', '1'), false_values: tuple[str, ...] = ('F', 'false', 'no', '0')) bool | None[source]

Attempts to convert a value to a boolean value, returning None if the conversion fails.

Parameters:
  • value (object) – The value to attempt to convert into a boolean.

  • true_values (tuple[str, ...]) – Values to be mapped to True when matched by the input value.

  • false_values (tuple[str, ...]) – Values to be mapped to False when matched by the input value.

Returns:

The value converted into a boolean if possible, otherwise None.

Return type:

Optional[bool]

Examples

>>> from scholar_flux.utils.helpers import coerce_bool
>>> coerce_bool("TRUE")
True
>>> coerce_bool(1)
True
>>> coerce_bool(True, true_values=())
True
>>> coerce_bool("maybe", true_values=("Maybe",))
True
>>> coerce_bool("NO")
False
>>> coerce_bool("0")
False
>>> coerce_bool("Unknown?")
None
>>> coerce_bool("0", false_values=None)
None
scholar_flux.utils.helpers.coerce_bytes(value: object, encoding: str | None = 'utf-8') bytes | None[source]

Attempts to convert a value into bytes, if possible, returning None if conversion fails.

Parameters:
  • value (object) – The value to attempt to convert into a bytes object.

  • encoding (Optional[str]) – An optional value used to encode strings as bytes. Not relevant for other data types.

Returns:

The value converted into a bytes object if possible, otherwise None

Return type:

Optional[bytes]

scholar_flux.utils.helpers.coerce_flattened_str(value: object, delimiter: str = '; ') str | None[source]

Coerces strings or sequences of strings into a single, flattened string.

This function handles the common pattern of normalizing journal names, keywords, or other metadata that may arrive as either a string or list of strings. Sequences of strings are handled by joining them, and if a sequence cannot be converted to a sequence of strings, None is returned instead.

Parameters:
  • value (object) – A string, bytes, list/tuple of strings, or None

  • delimiter (str) – The string used to join list elements with (default: “; “)

Returns:

A single string (coerced or joined), or None if conversion fails

Return type:

Optional[str]

scholar_flux.utils.helpers.coerce_int(value: object) int | None[source]

Attempts to convert a value to an integer, returning None if the conversion fails.

Parameters:

value (object) – The value to attempt to convert into a int.

Returns:

The value converted into an integer if possible, otherwise None.

Return type:

Optional[int]

scholar_flux.utils.helpers.coerce_json_str(data: object) str | None[source]

Attempts to convert a serializable list or mapping into a JSON string.

This method uses the json.dumps() function to serialize a JSON sequence or mapping, returning None if conversion fails.

Parameters:

data (object) – Attempts to coerce a JSON object as a string. This function attempts JSON string conversion and validation for Mapping, Sequence, str, and bytes data types. For all other data types, None is returned.

Returns:

The data coerced into a JSON string if possible, otherwise None.

Return type:

Optional[str]

Note

If the data is a string or bytes object, this method verifies that, when loaded with json.loads, the string is deserialized as a mapping or list. Otherwise, None is returned.

Examples

>>> from scholar_flux.utils.helpers import coerce_json_str
>>> coerce_json_str('{"a": 1, "b": 2}')  # already a json string, returned as is
# OUTPUT: '"a": 1, "b": 2"'
>>> coerce_json_str({"a": 1, "b": 2})  # already a json string, returned as is
# OUTPUT: '""a": 1, "b": 2"'
scholar_flux.utils.helpers.coerce_numeric(value: object) float | None[source]

Attempts to convert a value to a float, returning None if the conversion fails.

Parameters:

value (object) – The value to attempt to convert into a decimal value.

Returns:

The value converted into a float if possible, otherwise None.

Return type:

Optional[float]

Note

Conversion treats booleans as integers and converts them when observed. To avoid this, use conditional logic.

scholar_flux.utils.helpers.coerce_str(value: object, *, encoding: str | None = 'utf-8', errors: str | None = 'strict') str | None[source]

Attempts to convert a value into a string, if possible, returning None if conversion fails.

Parameters:
  • value (object) – The value to attempt to convert into a string.

  • encoding (Optional[str]) – An optional value used to decode byte strings. Not relevant for data of other types.

  • errors (Optional[str]) – An optional value for decoding errors with non-Unicode bytes characters. Not relevant for non-byte strings.

Returns:

The value converted into a string if possible, otherwise None.

Return type:

Optional[str]

scholar_flux.utils.helpers.extract_year(value: Any, format: str = '%Y-%m-%d') int | None[source]

Extract a 4-digit year from a date string.

Attempts to parse the value using the specified format, then falls back to regex extraction.

Parameters:
  • value (Any) – A value (generally a string or integer) potentially containing a year.

  • format (str) – The expected date format (strptime format string). Defaults to “%Y-%m-%d”.

Returns:

The extracted year as an integer, or None if extraction fails.

Return type:

Optional[int]

Examples

>>> from datetime import date
>>> from scholar_flux.utils.helpers import extract_year
>>> extract_year(date(2027,5, 5))
# OUTPUT: 2027
>>> extract_year("2026-03-01")
# OUTPUT: 2026
>>> extract_year("03/15/2024", format="%m/%d/%Y")
# OUTPUT: 2024
>>> extract_year("2023")
# OUTPUT: 2023
>>> extract_year(None)
# OUTPUT: None
scholar_flux.utils.helpers.filter_record_key_prefixes(record: Mapping[str, Any] | Mapping[str | int, Any], prefix: str, invert: bool = False) RecordType[source]

Removes or retains keys from dictionaries and mappings beginning with a specific string prefix.

Parameters:
  • record (Mapping[str, Any] | Mapping[str | int, Any]) – A dictionary record to filter keys containing specific prefixes

  • prefix (str) – The prefix to filter from the dictionary. Prefixes that are not strings will be coerced into strings internally, but only string-typed fields will be matched.

  • invert (bool) – If False, dictionary keys beginning with the prefix are removed (default behavior). If true, fields beginning with the prefix are retained instead.

Returns:

The filtered record after retaining (invert=True) or removing (invert=False) string prefixes.

Return type:

RecordType

scholar_flux.utils.helpers.format_iso_timestamp(timestamp: datetime) str[source]

Formats an iso timestamp string in UTC with millisecond precision.

Parameters:

timestamp (datetime) – The datetime object to format.

Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123+00:00”)

Return type:

str

scholar_flux.utils.helpers.generate_iso_timestamp() str[source]

Generates and formats an ISO 8601 timestamp string in UTC with millisecond precision for reliable round-trip conversion.

Example usage:
>>> from scholar_flux.utils import generate_iso_timestamp, parse_iso_timestamp, format_iso_timestamp
>>> timestamp = generate_iso_timestamp()
>>> parsed_timestamp = parse_iso_timestamp(timestamp)
>>> assert parsed_timestamp is not None and format_iso_timestamp(parsed_timestamp) == timestamp
Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123Z”)

Return type:

str

scholar_flux.utils.helpers.generate_response_hash(response: Response | ResponseProtocol) str[source]

Generates a response hash from a response or response-like object that implements the ResponseProtocol.

Parameters:

response (requests.Response | ResponseProtocol) – An http response or response-like object.

Returns:

A unique identifier for the response.

Return type:

str

scholar_flux.utils.helpers.get_first_available_key(data: Mapping[H | str, Any], keys: Sequence[H | str], default: T | None = None, case_sensitive: bool = True) Any | T[source]

Extracts the first key from a sequence of keys that can be found within a dictionary.

Parameters:
  • data (Mapping[H | str, Any]) – A dictionary or dictionary-like object to extract an existing data element from.

  • keys (Sequence[H | str]) – A sequence or set of keys used for the extraction of the first available data element.

  • default (T) – The value to use if none of the checked keys are available in the dictionary.

  • case_sensitive (bool) – Defines whether data element retrieval should rely on case sensitivity (Default=True).

Returns:

The value associated with the first available dictionary key

Return type:

Any

scholar_flux.utils.helpers.get_nested_data(json: list | Mapping | None, path: str | list, flatten_nested_dictionaries: bool = True, verbose: bool = True) Any[source]

Recursively retrieves data from a nested dictionary using a sequence of keys.

Parameters:
  • json (list | Mapping | None) – The parsed json structure from which to extract data.

  • path (str | list) – A list of keys representing the path to the desired data within json.

  • flatten_nested_dictionaries (bool) – Determines whether single-element lists containing dictionary data should be extracted.

  • verbose (bool) – Determines whether logging should occur when an error is encountered.

Returns:

The value retrieved from the nested dictionary following the path, or None if any

key in the path is not found or leads to a None value prematurely.

Return type:

Any

scholar_flux.utils.helpers.get_values(obj: Iterable) Iterable[source]

Automatically retrieves .values() from dictionaries when available and returns the original input otherwise.

Parameters:

obj (Iterable) – An object to get the values from.

Returns:

An iterable created from obj.values() if the object is a dictionary and the original object otherwise.

If the object is empty or is not a nested object, an empty list is returned.

Return type:

Iterable

scholar_flux.utils.helpers.infer_text_pattern_search(text: str, pattern_dict: Mapping[str | Pattern, V | None] | Mapping[str, V | None] | Mapping[Pattern, V], default: None = None, *, regex: bool = True, flags: int | RegexFlag = 0) V | None

Infers a category based on a text pattern search. If a value match can’t be inferred, a default is returned.

Parameters:
  • text (str) – The text to match. If None or missing, the default is returned instead.

  • pattern_dict (Mapping[str | re.Pattern, Optional[V]] | Mapping[str, Optional[V]] | Mapping[re.Pattern, V]) – A dictionary that maps patterns to potential output values provided that the pattern matches.

  • default (Optional[D]) – The value to return if a match cannot be inferred from text pattern matching.

  • regex (bool) – Whether to interpret patterns as regex (default True).

  • flags (int | re.RegexFlag) – Optional flags to pass to re.search when available. (default flags=0 for no flags)

Returns:

The inferred category when a match is found based on a dictionary mapping, and the default otherwise.

Return type:

Optional[V | D]

Note

If the provided value is not a mapping or if the provided value cannot be coerced into a string, the default is returned instead.

scholar_flux.utils.helpers.is_nested(obj: Any) bool[source]

Indicates whether the current value is a nested object. Useful for recursive iterations such as JSON record data.

Parameters:

obj (Any) Any (realistic JSON)

Returns:

True if nested otherwise False

Return type:

bool

scholar_flux.utils.helpers.is_nested_json(obj: Any) bool[source]

Check if a value is a nested, parsed JSON structure.

Parameters:

obj (Any) – The object to check.

Returns:

False if the value is not a Json-like structure and, True if it is a nested JSON structure.

Return type:

bool

scholar_flux.utils.helpers.nested_key_exists(obj: object, key_to_find: str, regex: bool = False) bool[source]

Recursively checks if a specified key is present anywhere in a given JSON-like dictionary or list structure.

Parameters:
  • obj (object) – The dictionary or list to search.

  • key_to_find (str) – The key to search for.

  • regex (bool) – Whether or not to search with regular expressions.

Returns:

True if the key is present, False otherwise.

Return type:

bool

scholar_flux.utils.helpers.parse_iso_timestamp(timestamp_str: str) datetime | None[source]

Attempts to convert an ISO 8601 timestamp string back to a datetime object.

Parameters:

timestamp_str (str) – ISO 8601 formatted timestamp string

Returns:

datetime object if parsing succeeds, None otherwise

Return type:

Optional[datetime]

scholar_flux.utils.helpers.quote_if_string(value: object) object[source]

Attempt to quote string values to distinguish them from object text in class representations.

Parameters:

value (object) – a value that is quoted only if it is a string

Returns:

Returns a quoted string if successful. Otherwise returns the value unchanged

Return type:

Any

scholar_flux.utils.helpers.quote_numeric(value: object) str[source]

Attempts to quote as a numeric value and returns the quoted value if successful. Otherwise raises an error.

Parameters:

value (object) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful.

Return type:

str

Raises:

ValueError – If the value cannot be quoted

scholar_flux.utils.helpers.strip_html_tags(text: str, parser: Literal['html.parser', 'lxml'] = 'html.parser', verbose: bool = True, **kwargs: Any) str[source]

Extracts the raw text from HTML while removing html elements such as paragraph tags and breaks.

Parameters:
  • text (str) – The text to extract and remove html tags and elements from

  • parser (Literal['html.parser', 'lxml']) – The parser to use for the removal of html elements

  • verbose (bool) – Indicates whether issues regarding missing dependencies and incorrect types should be logged.

  • **kwargs – Additional keyword arguments to be passed directly to BeautifulSoup.get_text(). Possible keywords include: - separator (str): String inserted between elements (default: ‘’) - strip (bool): Whether to strip whitespace from element text (default: False)

Returns:

The string with text elements removed if the input is a string and the original input otherwise.

Return type:

str

Examples

>>> strip_html_tags("<p>Hello</p><p>World</p>")
'HelloWorld'
>>> strip_html_tags("<p>Hello</p><p>World</p>", separator=" ")
'Hello World'
>>> strip_html_tags("<p>  Whitespace  </p>", strip=True)
'Whitespace'
scholar_flux.utils.helpers.try_bytes(value: bytes) bytes[source]
scholar_flux.utils.helpers.try_bytes(value: None) None
scholar_flux.utils.helpers.try_bytes(value: T) bytes | T

Attempts to convert a value to a bytes object, returning the original value if the conversion fails.

Parameters:

value (object) – the value to attempt to coerce into an bytes

Returns:

The converted bytes object if successful, otherwise the original value.

Return type:

bytes | object

scholar_flux.utils.helpers.try_call(func: Callable, args: tuple | None = None, kwargs: dict | None = None, suppress: tuple = (), logger: Logger | None = None, log_level: int = 30, default: Any | None = None) Any | None[source]

A helper function for calling another function safely in the event that one of the specified errors occur and are contained within the list of errors to suppress.

Parameters:
  • func (Callable) – The function to call

  • args (Optional[tuple]) – A tuple of positional arguments to add to the function call

  • kwargs (Optional[dict]) – A dictionary of keyword arguments to add to the function call

  • suppress (tuple) – A tuple of exceptions to handle and suppress if they occur

  • logger (Optional[logging.Logger]) – The logger to use for warning generation

  • log_level (int) – The logging level to use when logging suppressed exceptions.

  • default (Optional[Any]) – The value to return in the event that an error occurs and is suppressed

Returns:

When successful, the return type of the callable is also returned without modification. Upon suppressing an exception, the function will generate a warning and return None by default unless the default was set.

Return type:

Optional[Any]

scholar_flux.utils.helpers.try_compile(s: P, *, prefix: str | None = None, suffix: str | None = None, flags: int | RegexFlag = 0, escape: bool = False, verbose: bool = False) P[source]
scholar_flux.utils.helpers.try_compile(s: str | None, *, prefix: str | None = None, suffix: str | None = None, flags: int | RegexFlag = 0, escape: bool = False, verbose: bool = False) Pattern | None

Attempts to compile an object as a pattern when possible, returning None when compilation fails.

Parameters:
  • s (Optional[str | re.Pattern]) – The string to compile as a pattern

  • prefix (Optional[str]) – A prefix to add to the beginning of a string when a pattern is not directly provided

  • suffix (Optional[str]) – A suffix to add to the end of a string when a pattern is not directly provided

  • flags (int | re.RegexFlag) – Flags to use when compiling a pattern. By default, no flags are applied (flags=0).

  • escape (bool) – Indicates whether regular expression symbols should escaped to interpret them literally.

  • verbose (bool)

Returns:

A regular expression pattern when successful, otherwise None

Return type:

Optional[re.Pattern]

Note

When a pattern is received, it is returned as is. Only valid strings are transformed into patterns containing a prefix when provided.

scholar_flux.utils.helpers.try_dict(value: dict) dict[source]
scholar_flux.utils.helpers.try_dict(value: list | tuple) dict
scholar_flux.utils.helpers.try_dict(value: object) dict | None

Attempts to convert a value into a dictionary, if possible.

If it is not possible to convert the value into a dictionary, the function will return None.

Parameters:

value (Any) – A value to attempt to convert into a dict.

Returns:

The value converted into a dictionary if possible, otherwise None

Return type:

Optional[dict]

scholar_flux.utils.helpers.try_int(value: int) int[source]
scholar_flux.utils.helpers.try_int(value: None) None
scholar_flux.utils.helpers.try_int(value: T) int | T

Attempts to convert a value to an integer, returning the original value if the conversion fails.

Parameters:

value (object) – the value to attempt to coerce into an integer

Returns:

The converted integer if successful, otherwise the original value.

Return type:

int | object

scholar_flux.utils.helpers.try_none(value: None) None[source]
scholar_flux.utils.helpers.try_none(value: T) None | T

Converts empty strings, ‘none’, and empty data containers into None. Otherwise, the original value is returned.

Parameters:
  • value (object) – The value to convert into None when possible

  • none_indicators (tuple[Any, ...]) – Tuple of values that should be treated as None indicators.

Returns:

The original value if not converted, and None otherwise

Return type:

object | None

scholar_flux.utils.helpers.try_pop(s: Set[H], item: H, default: H | None = None) H | None[source]

Attempt to remove an item from a set and return the item if it exists.

Parameters:
  • s (Set[H]) – The set to remove the item from.

  • item (H) – The item to try to remove from the set

  • default (Optional[H]) – The object to return as a default if item is not found

Returns:

item if the value is in the set, otherwise returns the specified default

Return type:

H | None

scholar_flux.utils.helpers.try_quote_numeric(value: object) str | None[source]

Attempt to quote numeric values to distinguish them from string values and integers.

Parameters:

value (object) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful. Otherwise None

Return type:

Optional[str]

scholar_flux.utils.helpers.try_str(value: str) str[source]
scholar_flux.utils.helpers.try_str(value: None) None
scholar_flux.utils.helpers.try_str(value: T) str | T

Attempts to convert a value to a string, returning the original value if the conversion fails.

Parameters:

value (object) – the value to attempt to coerce into an string

Returns:

The converted string if successful, otherwise the original value.

Return type:

str | object

scholar_flux.utils.helpers.unlist_1d(current_data: tuple | list | Any) Any[source]

Retrieves an element from a list/tuple if it contains only a single element. Otherwise, it will return the element as is. Useful for extracting text from a single element list/tuple.

Parameters:

current_data (tuple | list | Any) – An object potentially unlist if it contains a single element.

Returns:

The unlisted object if it comes from a single element list/tuple,

otherwise returns the input unchanged.

Return type:

Any

scholar_flux.utils.initializer module

The scholar_flux.utils.initializer.py module is used within the scholar_flux package to kickstart the initialization of the scholar_flux package on import.

Several key steps are performed via the use of the initializer: 1) Environment variables are imported using the ConfigLoader 2) The Logger is subsequently set up for the scholar_flux API package 3) The package level masker is subsequently set up to enable sensitive data to be redacted from logs

scholar_flux.utils.initializer.initialize_package(log: bool = True, env_path: str | Path | None = None, config_params: dict[str, Any] | None = None, logging_params: dict[str, Any] | None = None) tuple[dict[str, Any], Logger, SensitiveDataMasker][source]

Function used for orchestrating the initialization of the config, log settings, and masking for scholar_flux.

This function imports a ‘.env’ configuration file at the specified location if it exists. Otherwise, scholar_flux will look for a .env file in the default locations if available. If no .env configuration file is found, then only package defaults and available OS environment variables are used.

This function can also be used for dynamic re-initialization of configuration parameters and logging. The config_params are sent as keyword arguments to the scholar_flux.utils.ConfigSettings.load_config method. logging_paras are used as keyword arguments to the scholar_flux.utils.setup_logging method to set up logging settings and handlers.

Parameters:
  • log (bool) – A True/False flag that determines whether to enable or disable logging.

  • env_path (Optional[str | Path]) – The file path indicating from where to load the environment variables, if provided.

  • config_params (Optional[Dict]) – A dictionary allowing for the specification of configuration parameters when attempting to load environment variables from a config. Useful for loading API keys from environment variables for later use.

  • logging_params (Optional[Dict]) – A dictionary allowing users to specify options for package-level logging with custom logic. Log settings are loaded from the OS environment or an .env file when available, with precedence given to .env files. These settings, when loaded, override the default ScholarFlux logging configuration. Otherwise, ScholarFlux uses a log-level of WARNING by default.

Returns:

A tuple containing the configuration dictionary and the initialized logger.

Return type:

Tuple[Dict[str, Any], logging.Logger, scholar_flux.security.SensitiveDataMasker]

Raises:

PackageInitializationError – If there are issues with loading the configuration or initializing the logger.

scholar_flux.utils.json_file_utils module

The scholar_flux.utils.json_file_utils module implements a simple JsonFileUtils class that contains a basic set of convenience classes for interacting with the file system and JSON files.

class scholar_flux.utils.json_file_utils.JsonFileUtils[source]

Bases: object

Helper class that implements several basic file utility class methods for easily interacting with the file system. This class also contains utility methods used to parse, load, and dump JSON files for convenience.

Example

>>> from scholar_flux.utils.json_file_utils import JsonFileUtils
>>> from pathlib import Path
>>> original_data = {"key": "value"}
>>> json_file = "/tmp/sample"

# the JSON data should be serializable: >>> assert JsonFileUtils.is_jsonable(original_data) # writing the json file >>> JsonFileUtils.save_as(original_data, json_file) # the data should now exist at the ‘/tmp/sample.json’ path >>> assert Path(json_file).with_suffix(‘.json’).exists() # verifying that the dumped data can be loaded as intended: >>> data = JsonFileUtils.load_data(json_file) >>> assert data is not None and original_data == data

DEFAULT_EXT = 'json'
classmethod append_to_file(content: str | list[str], filepath: str | Path, ext: str | None = None) None[source]

Helper method used to append content to a file in a content-type aware manner.

Parameters:
  • content (Union[str, list[str]]) – The content to append to the file.

  • filepath (Union[str, Path]) – The file path to write to

  • ext (Optional[str]) – An optional extension to add to the file path

classmethod get_filepath(filepath: str | Path, ext: str | None = None) str[source]

Prepare the filepath using the filepath and extension if provided. Assumes a Unix filesystem structure for edge cases.

Parameters:
  • filepath (Union[str, Path]) – The file path to read from

  • ext (Optional[str]) – An optional extension to add to the file path. If the extension is left None, and an extension does not yet exist on the file path, the default JSON is used by default.

static is_jsonable(obj: Any) bool[source]

Verifies whether the object can be serialized as a json object.

Parameters:

obj (Any) – The object to check

Returns:

True if the object is jsonable (serializable), otherwise False

Return type:

bool

classmethod load_data(filepath: str | Path, ext: str | None = None) dict | list | str[source]

Attempts to load data from a filepath as a dictionary/list. If unsuccessful, the file’s contents are instead loaded as a string.

Parameters:

filepath (Union[str, Path]) – The file path to read the data from

Returns:

A dictionary or list if the data can be successfully loaded with json, and a string if loading with JSON is not possible.

Return type:

Union[dict, list, str]

classmethod read_lines(filepath: str | Path, ext: str | None = None) Generator[str, None, None][source]

Iteratively reads lines from a text file.

Parameters:
  • filepath (Union[str, Path]) – The file path to read the data from

  • ext (Optional[str]) – An optional extension to add to the file path

Returns:

The lines read from a text file

Return type:

Generator[str, None, None]

To retrieve a list of data instead of a generator, pass the result to list:
>>> from scholar_flux.utils import JsonFileUtils
>>> line_gen = JsonFileUtils.read_lines('pyproject.toml')
>>> assert isinstance(list(line_gen), list)
classmethod save_as(obj: list | dict | str | float | int, filepath: str | Path, ext: str | None = None, dump: bool = True) None[source]

Save an object in text format with the specified extension (if provided).

Parameters:
  • obj (Union[list, dict, str, float, int]) – A value to save into a file

  • filepath (Union[str, Path]) – The file path to write the object to

  • ext (Optional[str]) – An optional extension to add to the file path

  • dump (bool) – If True, the object is serialized using json.dumps. Otherwise the str function is used

scholar_flux.utils.json_processing_utils module

Helper module used to process recursive JSON data received from APIs of an unknown type and structure.

Classes:
PathUtils:

Utility class used to prepare path strings and lists of path components consistently for processing.

KeyDiscoverer:

Helper class for identifying JSON paths and terminal keys containing nested data elements.

KeyFilter:

Helper class used to identify and filter nested dictionaries based on path length and pattern matching.

RecursiveJsonProcessor:

Front-end facing utility function used by the scholar_flux.data.RecursiveDataProcessor to process, filter, and flatten JSON formatted data.

JsonRecordData:

Helper class used as a container to hold extracted path/data components for further processing.

JsonNormalizer:

Helper class used by the RecursiveJsonProcessor to flatten the inputted JSON record into a non-nested dictionary

Example Use:
>>> from scholar_flux.utils import RecursiveJsonProcessor
>>> from pprint import pp
>>> data = {
        "authors": {"principle_investigator": "Dr. Smith", "assistant": "Jane Doe"},
        "doi": "10.1234/example.doi",
        "title": "Sample Study",
        "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"],
        "genre": {"subspecialty": "Neuroscience"},
        "journal": {"topic": "Sleep Research"},
    }
# joins fields with nested components using a newline character - retains full paths leading to each value
>>> processor = RecursiveJsonProcessor(object_delimiter = '   ', use_full_path = True)
# processes and flattens the JSON dict using the defined helper classes under the hood
>>> result = processor.process_and_flatten(data)
# prints the result in a format that is easier to view from the CLI
>>> pp(result)
# OUTPUT: {'authors.principle_investigator': 'Dr. Smith',
           'authors.assistant': 'Jane Doe',
           'doi': '10.1234/example.doi',
           'title': 'Sample Study',
           'abstract': "This is a sample abstract.   keywords: 'sample', 'abstract'",
           'genre.subspecialty': 'Neuroscience',
           'journal.topic': 'Sleep Research'}
class scholar_flux.utils.json_processing_utils.JsonNormalizer(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Bases: object

Helper class that flattens and normalizes the retrieved list of JsonRecordData into singular flattened dictionary.

__init__(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Initialize the JsonNormalizer with extracted JSON data and a delimiter.

Parameters:
  • json_record_data_list (List[JsonRecordData]) – The list of extracted JSON data.

  • use_full_path (bool) – Indicates whether to use the full nested json path or the smallest unique path available

create_unique_key(current_group: List[str], current_key_str: str, unique_mappings_dict: Dict[str, List[str]]) str[source]

Create a unique key for the current data entry if a simple key is not sufficient.

Parameters:
  • current_group (List[str]) – The list of keys in the current path.

  • current_key_str (str) – The string representation of the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

get_unique_key(current_key_str: str, current_group: List[str], unique_mappings_dict: Dict[str, List[str]]) str[source]

Generate a unique key for the current data entry.

Parameters:
  • current_key_str (str) – The string representation of the current path.

  • current_group (List[str]) – The list of keys in the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

normalize_extracted() Dict[str, List[Any]][source]

Normalize the extracted JSON data into a flattened dictionary.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Dict[str, List[Any]]

class scholar_flux.utils.json_processing_utils.JsonRecordData(path: List[str | int], data: Dict[str, Any])[source]

Bases: object

Helper class used as a container to record the paths, data, and names associated with each terminal path.

This class uses its structural representation to create a hash that allows it to be stored within a set.

Parameters:
  • path (list[str | int]) – The path associated with the terminal data point where nested terminal values can be found

  • data (dict[str, Any]) – The nested terminal value at the end of a path

__init__(path: List[str | int], data: Dict[str, Any]) None
data: Dict[str, Any]
path: List[str | int]
structure() str[source]

Helper method used to identify duplicate paths before addition.

class scholar_flux.utils.json_processing_utils.KeyDiscoverer(records: List[Dict] | None = None)[source]

Bases: object

Helper class used to discover terminal keys containing data within nested JSON data structures and identify the paths used to arrive at each key.

_discovered_keys

Defines the complete list of all keys that can be found in a dictionary and the path that needs to be traversed to arrive at that key

Type:

dict[str, list]

_terminal_paths

Creates a dictionary that indicates whether the currently added path is terminal within the JSON data structure

Type:

dict[str, bool]

__init__(records: List[Dict] | None = None)[source]

Initializes the KeyDiscoverer and identifies terminal key/path pairs within the JSON data structure.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None) Dict[str, List[str]][source]

Helper method that filters a range of keys based on the specified criteria.

get_all_keys() Dict[str, List[str]][source]

Returns all discovered keys and their paths.

get_keys_with_path(key: str) List[str][source]

Returns all paths associated with a specific key.

get_terminal_keys() Dict[str, List[str]][source]

Returns keys and their terminal paths (paths that don’t contain nested dictionaries).

get_terminal_paths() List[str][source]

Returns paths indicating whether they are terminal (don’t contain nested dictionaries).

class scholar_flux.utils.json_processing_utils.KeyFilter[source]

Bases: object

Helper class used to create a simple filter that allows for the identification of terminal keys associated with data in a JSON structure and the paths that lead to each terminal key.

static filter_keys(discovered_keys: Dict[str, List[str]], prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include_matches: bool = True, match_any: bool = True) Dict[str, List[str]][source]

A method used to create a function that matches key-value pairs based on the specified criteria.

For example, filtering can be configured to identify keys based on prefix, minimum path length, and path substring/pattern matching with conditional match inclusion/exclusion.

class scholar_flux.utils.json_processing_utils.PathUtils[source]

Bases: object

Helper class used to perform string/list manipulations for paths that can be represented in either form, requiring conversion from one type to the other in specific JSON path processing scenarios.

CONSTANT: str = 'i'
DELIMITER: str = '.'
IGNORE_KEYS: set = {'value'}
classmethod constant_path_indices(path: str | List[Any], constant: str | None = None) List[Any][source]

Replace integer indices with constants in the provided path.

Parameters:
  • path (List[Any]) – The original path containing both keys and indices.

  • constant (Optional[str]) – A value to replace a numeric value with. if not provided, the CONSTANT class

  • otherwise. (variable is used)

Returns:

A path with only the key names.

Return type:

List[Any]

static group_path_assignments(path: List[Any]) str | None[source]

Group the path assignments into a single string, excluding indices.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A single string representing the grouped path, or None if the path is empty.

Return type:

Optional[str]

classmethod path_name(level_names: List[Any], delimiter: str | None = None) str[source]

Generate a string representation of the path based on the provided level names.

The path name is chosen starting from the last non-numeric key in a list of path elements.

Parameters:

level_names (List[Any]) – A list of names representing the path levels.

delimiter (Optional[str]):

A delimiter used to join levels that, together, form the name of a path. If not specified, the class-level delimiter is used.

Returns:

A string representation of the path.

Return type:

str

classmethod path_split(path: str, delimiter: str | None = None) List[str][source]

Splits a path on the cls.DELIMITER value.

Parameters:
  • path (str) – A string-based path to be split into a list

  • delimiter (Optional[str]) – A delimiter used to split a path string. If not specified, the class-level delimiter is used.

Returns:

A list containing each level of a path as a string element.

Return type:

List[str]

classmethod path_str(level_names: List[Any], delimiter: str | None = None) str[source]

Join the level names into a single string separated by underscores.

Parameters:
  • level_names (List[Any]) – A list of names representing the path levels.

  • delimiter (Optional[str]) – A delimiter used to join a path from its keys. If not specified, the class-level delimiter is used.

Returns:

A single string with level names joined by underscores.

Return type:

str

classmethod remove_path_indices(path: str | List[Any]) List[Any][source]

Remove integer indices from the path to get a list of key names.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A path with only the key names.

Return type:

List[Any]

classmethod to_path_sequence(path: str | List[str] | List[str | int], delimiter: str | None = None) List[str] | List[str | int][source]

Convert a path input (string or list) to a normalized path sequence.

Parameters:
  • path (str | List[str] | List[str | int]) – Either a delimited string or list of path components

  • delimiter (List[str] | List[str | int]) – Optional delimiter for string paths

Returns:

List of path components (strings and/or integers)

Return type:

PathSequence

Examples

>>> PathUtils.to_path_sequence("authors.0.name")
['authors', '0', 'name']
>>> PathUtils.to_path_sequence(["authors", 0, "name"])
['authors', 0, 'name']
class scholar_flux.utils.json_processing_utils.RecursiveJsonProcessor(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False, path_delimiter: str | None = None)[source]

Bases: object

An implementation of a recursive JSON dictionary processor that is used to process and identify nested components such as paths, terminal key names, and the data at each terminal path.

This utility of the RecursiveJsonProcessor is for flattening dictionary records into flattened representations where its keys represent the terminal paths at each node and its values represent the data found at each terminal path.

__init__(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False, path_delimiter: str | None = None)[source]

Initialize the RecursiveJsonProcessor with a JSON dictionary and a delimiter for joining list elements.

Args:

json_dict (Dict): The input JSON dictionary to be parsed. object_delimiter (str): The delimiter used to join elements max depth list objects. Default is “; “. normalizing_delimiter (str): The delimiter used to join elements across multiple keys when normalizing. Default is “

“.

combine_normalized(normalized_field_value: list | str | None) list | str | None[source]

Combines lists of nested data (strings, ints, None, etc.) into a single string separated by the normalizing_delimiter.

If a delimiter isn’t specified or if the value is None, it is returned as is without modification.

create_record(obj: Any, path: List[Any]) List[JsonRecordData][source]

Helper method for creating a new record within the current JsonProcessor.

filter_extracted(exclude_keys: List[str] | None = None) Self[source]

Filter the extracted JSON dictionaries to exclude specified keys.

Parameters:

exclude_keys ([List[str]]) – List of keys to exclude from the flattened result.

flatten() Dict[str, List[Any] | str | None] | None[source]

Flatten the extracted JSON dictionary from a nested structure into a simpler structure.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_and_flatten(obj: Dict | None = None, exclude_keys: List[str] | None = None, traversal_paths: List[str] | List[List[str]] | List[List[str | int]] | None = None, traverse_lists: bool = False) Dict[str, Any] | None[source]

Process the dictionary, filter extracted paths, and then flatten the result.

Parameters:
  • exclude_keys (Optional[List[str]]) – List of keys to exclude from the flattened result.

  • traversal_paths (Optional[List[str]]) – Optional ‘.’ delimited paths to constrain the extracted keys to. If omitted, all paths are traversed.

  • traverse_lists (bool) – Determines whether to automatically traverse and flatten list structures.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_dictionary(obj: Dict | None = None) Self[source]

Create a new json dictionary that contains information about the relative paths of each field that can be found within the current JSON dict.

process_level(obj: Any, level_name: List[Any] | None = None) List[Any][source]

Helper method for processing a level within a dictionary.

This method is recursively called to process nested components

traverse_dictionary(paths: List[str] | List[List[str]] | List[List[str | int]], obj: Dict | None = None, traverse_lists: bool = False) Self[source]

Create a new json dictionary by traversing ‘.’ delimited paths for json data found from a JSON Dict.

traverse_level(path: List[str] | List[str | int], obj: Any, level_name: List[Any] | None = None, traverse_lists: bool = False) List[Any][source]

Helper method for traversing a level within a dictionary while constraining keys to known paths.

This method is recursively called to traverse nested components using known keys

static unlist(current_data: Dict | List | None) Any | None[source]

Flattens a dictionary or list if it contains a single element that is a dictionary.

Parameters:

current_data – A dictionary or list to be flattened if it contains a single dictionary element.

Returns:

The flattened dictionary if the input meets the flattening condition, otherwise returns the input unchanged.

Return type:

Optional[Dict|List]

scholar_flux.utils.logger module

The scholar_flux.utils.logger module implements a basic logger used to create an easy-to-re-initialize logger to be used for logging events and progress in the retrieval and processing of API responses.

scholar_flux.utils.logger.log_level_context(log_level: int | str = 10, logger: Logger | None = None, allow_lower_level: bool = True) Iterator[None][source]

Context manager for temporarily changing the log level for the package-level (or custom) logger.

Parameters:
  • log_level (int | str) – The log level to temporarily change to. Options include: - logging.DEBUG (10) or “DEBUG” - logging.INFO (20) or “INFO” - logging.WARNING (30) or “WARNING” - logging.ERROR (40) or “ERROR” - logging.CRITICAL (50) or “CRITICAL”

  • logger (logging.Logger) – The logger to use when temporarily changing the log level. If not specified, the ScholarFlux package level logger is used.

  • allow_lower_level (bool) – When False, The current log level is overridden only when the provided log level is higher than the current log level.

Example

>>> from scholar_flux import SearchAPI, log_level_context
>>> api = SearchAPI(provider_name = "CORE", query = "Technological Safety")
>>> with log_level_context("DEBUG"): # `logging.DEBUG`
...     response = api.search(page = 1)
# OUTPUT: 2026-01-21 13:46:50,333 - scholar_flux.api.base_api - DEBUG - Sending request to https://api.core.ac.uk/v3/search/works

Note: when an invalid log_level is passed, a level of 51 is used in its place, effectively turning off logging.

scholar_flux.utils.logger.resolve_log_level(log_level: str | int | None = None) int | None[source]

Utility for resolving numeric strings and log level values into integer log levels.

Parameters:

log_level (Optional[str | int]) – The log level to resolve as an integer if not already an integer. Accepts both case-insensitive strings (“Warning”, “INFO”, “error”) and integers (“0”, 1, “03”)

Returns:

The logging level that is either resolved from the user-provided log_level None: When non-string/non-integer is received or log level resolution from a string is unsuccessful

Return type:

int

scholar_flux.utils.logger.resolve_log_stream(stream: str | bool | TextIO | None) TextIO | Literal[False][source]

Helper for resolving streams used for logging from strings.

Parameters:

stream (Optional[str | bool | TextIO]) – The value to resolve as a stream type.

Returns:

A stderr or stdout stream resolved from the input. Literal[False]: If False or a similar, falsy value is received (eg., 0, ‘0’, ‘false’)

Return type:

TextIO

Note

This function attempts to resolve values into stderr or stdout using case-insensitive string normalization when possible. A value of False, when returned, indicates that streaming should not be used. If a value other than a string is passed (e.g., None, True, 23), the stream will default to stderr instead.

scholar_flux.utils.logger.setup_logging(logger: Logger | None = None, log_directory: str | None = None, log_file: str | None = 'application.log', log_level: int = 10, propagate_logs: bool | None = True, max_bytes: int = 1048576, backup_count: int = 5, logging_filter: Filter | None = None, *, stream: TextIO | Literal[False] | None = None, raise_on_error: bool = True) None[source]

Configures a logger to write to the console and, optionally, file logs with an optional logging filter.

This function is a general purpose utility used by the scholar_flux package to set up a package level logger that implements sensitive data masking with a custom filter.

The logger is configured to write to the terminal (console) and, if optionally a rotating log file. if specified. Rotating files automatically create new files when size limits are reached, keeping your logs manageable.

Parameters:
  • logger (Optional[logging.Logger]) – The logger instance to configure. If None, uses the root logger.

  • log_directory (Optional[str]) – Indicates where to save log files. If None, automatically finds a writable directory when a log_file is specified.

  • log_file (Optional[str]) – Name of the log file (default: ‘application.log’). If None, file-based logging will not be performed.

  • log_level (int) – Minimum level to log (DEBUG logs everything, INFO skips debug messages).

  • propagate_logs (Optional[bool]) – Determines whether to propagate logs. Logs are propagated by default if this option is not specified.

  • max_bytes (int) – Maximum size of each log file before rotating (default: 1MB).

  • backup_count (int) – Number of old log files to keep (default: 5).

  • logging_filter (Optional[logging.Filter]) – Optional filter to modify log messages (e.g., hide sensitive data).

  • stream (Optional[TextIO | bool]) – Optionally modifies the stream used for logging. By default, a stream is created that uses stderr. Set this to False to avoid creating a log stream altogether.

  • raise_on_error (bool) – Indicates whether an error should be raised if an error on package directory setup occurs.

Example

>>> # Basic setup - logs to console and file
>>> setup_logging()
>>> # Custom location and less verbose
>>> setup_logging(log_directory="/var/log/myapp", log_level=logging.INFO)
>>> # With sensitive data masking
>>> from scholar_flux.security import MaskingFilter
>>> mask_filter = MaskingFilter()
>>> setup_logging(logging_filter=mask_filter)

Note

  • Console shows all log messages in real-time

  • File keeps a permanent record with automatic rotation

  • If logging_filter is provided, it’s applied to both console and file output

  • Calling this function multiple times will reset the logger configuration

scholar_flux.utils.module_utils module

The scholar_flux.utils.module_utils module defines the set_public_api_module that is used throughout the scholar_flux source code to aid in logging and streamline the documentation of imports.

It is generally used in the initialization of submodules within the scholar_flux which helps greatly in the structuring of the automatic sphinx documentation.

scholar_flux.utils.module_utils.set_public_api_module(module_name: str, public_names: list[str], namespace: dict) None[source]

Assigns the current module’s name to the __module__ attribute of public API objects.

This function is useful for several use cases including sphinx documentation, introspection, and error handling/reporting.

For all objects defined in the list of a modules public API names (generally named __all__), this function sets their __module__ attribute to the name of the current public API module if supported.

This is useful for ensuring that imported classes and functions appear as if they are defined in the current module (such as in the automatic generation of sphinx documentation), which improves overall documentation, introspection, and error reporting.

Parameters:
  • module_name (str) – The name of the module (usually __name__).

  • public_names (list[str]) – List of public object names to update (e.g., __all__).

  • namespace (dict) – The module’s namespace (usually globals()).

Example usage:

set_public_api_module(__name__, __all__, globals())

scholar_flux.utils.provider_utils module

The scholar_flux.utils.provider_utils module implements the ProviderUtils class that is used to dynamically load the configuration for default providers stored in the scholar_flux.api.providers module.

class scholar_flux.utils.provider_utils.ProviderUtils[source]

Bases: object

Helper class used by the scholar_flux package to dynamically load the default ProviderConfig for each provider within the scholar_flux.api.providers module on import.

The ProviderUtils class uses importlib with exception handling to account for possible errors that may occur when dynamically importing the ProviderConfig for each provider.

classmethod load_provider_config(provider_module: str, provider_config_variable: str = 'provider') ProviderConfig | None[source]

Helper method that loads a single config from the provided module in the event that The module contains a ProviderConfig by the same name as the provider_config_variable. The default variable to look for is provider.

Parameters:
  • provider_module (str) – The name of the module to load.

  • provider_config_variable (str) – The name of the variable carrying the config to check for.

Returns:

The ProviderConfig associated with the module if one has been found,

by the same variable name, provider_config_variable. Otherwise, the method will return None instead.

Return type:

Optional[ProviderConfig]

classmethod load_provider_config_dict() dict[str, ProviderConfig][source]

Helper method for dynamically retrieving the default provider list as a dictionary.

Returns:

A dictionary containing the formatted name of the provider

as well as its associated configuration in a dictionary

Return type:

dict[str, ProviderConfig]

scholar_flux.utils.repr_utils module

The scholar_flux.utils.repr_utils module includes several methods used in the creation of descriptive representations of custom objects such as custom classes, dataclasses, and base models. This module can be used to generate a representation from a string to show nested attributes and customize the representation if needed.

Functions:
  • truncate:

    A helper function used to truncate various types before representations of objects are displayed. This function also accounts for edge cases and type differences before other utilities display the repr.

  • generate_repr:

    The core representation generating function that uses the class type and attributes to create a representation of the object.

  • generate_repr_from_string:

    Takes a class name and dictionary of attribute name-value pairs to create a representation from scratch.

  • generate_sequence_repr:

    Generates a representation of a sequence given its class and internal elements. This class uses generate_repr on each nested component to generate the complete representation of the sequence.

  • adjust_repr_padding:

    Helper function that adjusts the padding of the representation to ensure all attributes are shown in-line.

  • format_repr_value:

    Formats the value of a nested attribute with regard to padding and appearance with the selected options.

  • normalize_repr:

    Formats the value of a nested attribute, cleaning memory locations and stripping whitespace.

scholar_flux.utils.repr_utils.adjust_repr_padding(obj: Any, pad_length: int | None = 0, flatten: bool | None = None) str[source]

Helper method for adjusting the padding for representations of objects.

Parameters:
  • obj (Any) – The object to generate an adjusted repr for

  • pad_length (Optional[int]) – Indicates the additional amount of padding that should be added. Helpful for when attempting to create nested representations formatted as intended.

  • flatten (bool) – Indicates whether to use newline characters. This is false by default

Returns:

A string representation of the current object that adjusts the padding accordingly

Return type:

str

scholar_flux.utils.repr_utils.format_repr_value(value: Any, pad_length: int | None = None, show_value_attributes: bool | None = None, flatten: bool | None = None, replace_numeric: bool | None = False) str[source]

Helper function for representing nested objects from custom classes.

Parameters:
  • value (Any) – The value containing the repr to format

  • pad_length (Optional[int]) – Indicates the total additional padding to add for each individual line

  • show_value_attributes (Optional[bool]) – If False, all attributes within the current object will be replaced with ‘…’. (e.g., StorageDevice(…))

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

Returns:

The formatted string representation of a value

Return type:

str

scholar_flux.utils.repr_utils.generate_repr(obj: object, exclude: set[str] | list[str] | tuple[str] | None = None, show_value_attributes: bool = True, flatten: bool = False, replace_numeric: bool = False, as_dict: bool | None = False, resolve_property_attributes: bool = False, flatten_nested: bool | None = None) str[source]

Method for creating a basic representation of a custom object’s data structure. Useful for showing the options/attributes being used by an object.

In case the object doesn’t have a __dict__ attribute, the code will raise an AttributeError and fall back to using the basic string representation of the object.

Note that threading.Lock objects are excluded from the final representation.

Parameters:
  • obj (object) – The object whose attributes are to be represented.

  • exclude (Optional[set[str] | list[str] | tuple[str]]) – Attributes to exclude from the representation (default is None).

  • show_value_attributes (bool) – If False, nested attributes within elements will be replaced with ‘…’. e.g., RetryAttempt(…)

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

  • as_dict (bool) – Determines whether to represent the current class as a dictionary.

  • resolve_property_attributes (bool) – Determines whether to substitute properties pointing to private attributes.

  • flatten_nested (Optional[bool]) – Indicates whether to use newline characters to create a representation of nested objects or to flatten them into a single line. If None, nested objects are flattened only if flatten=True.

Returns:

A string representing the object’s attributes in a human-readable format.

scholar_flux.utils.repr_utils.generate_repr_from_string(class_name: str, attribute_dict: dict[str, Any], show_value_attributes: bool | None = None, flatten: bool | None = False, replace_numeric: bool | None = False, as_dict: bool | None = False, flatten_nested: bool | None = None) str[source]

Method for creating a basic representation of a custom object’s data structure. Allows for the direct creation of a repr using the classname as a string and the attribute dict that will be formatted and prepared for representation of the attributes of the object.

Parameters:
  • class_name – The class name of the object whose attributes are to be represented.

  • attribute_dict (dict) – A dictionary containing attributes to format into the components of a repr.

  • show_value_attributes (bool) – If False, nested attributes within elements will be replaced with ‘…’. e.g., RetryAttempt(…).

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character.

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

  • as_dict (Optional[bool]) – Determines whether to represent the current class as a dictionary.

  • flatten_nested (Optional[bool]) – Indicates whether to use newline characters to create a representation of nested objects or to flatten them into a single line. False by default.

Returns:

A string representing the object’s attributes in a human-readable format.

Return type:

str

scholar_flux.utils.repr_utils.generate_sequence_repr(obj: Sequence | set, flatten: bool = False, show_value_attributes: bool = True, replace_numeric: bool = False, brackets: tuple[str, str] | None = ('[', ']'), flatten_nested: bool | None = None) str[source]

Method for creating a basic representations for sequence-like data structures.

This function generates formatted str representations for collections such as list, tuple, deque, and custom sequence data types. A string representation is also created for nested elements using generate_repr internally.

When this function encounters an error, the method internally falls back to using the str function to create a basic string representation.

Parameters:
  • obj (Sequence) – The sequence-like object to create a string representation for

  • flatten (bool) – Indicates whether to use newline characters. This is false by default

  • show_value_attributes (bool) – If False, nested attributes within elements will be replaced with ‘…’. e.g., RetryAttempt(…)

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

  • brackets (Optional[tuple[str, str]]) – Opening and closing brackets for the sequence (default: “[”, “]”).

  • flatten_nested (Optional[bool]) – Indicates whether to use newline characters to create a representation of nested objects or to flatten them into a single line. If None, nested objects are flattened only if flatten=True.

Returns:

A string representing the sequence’s elements in a human-readable format.

Examples

>>> from collections import deque
>>> from scholar_flux.utils import generate_sequence_repr
>>> items = deque([{"a": 1}, {"b": 2}])
>>> print(generate_sequence_repr(items, flatten=True))
# OUTPUT: deque([{'a': 1}, {'b': 2}])
>>> print(generate_sequence_repr(items, flatten=False))
# OUTPUT: deque([{'a': 1},
                 {'b': 2}])
>>> print(generate_sequence_repr([1, 2, 3], flatten=True, brackets=None))
# OUTPUT: list((1, 2, 3))
scholar_flux.utils.repr_utils.normalize_repr(value: Any, replace_numeric: bool | None = False) str[source]

Helper function for removing byte locations and surrounding signs from classes.

Parameters:
  • value (Any) – A value whose representation is to be normalized

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

Returns:

A normalized string representation of the current value

Return type:

str

scholar_flux.utils.repr_utils.truncate(value: Any, max_length: int = 40, suffix: str = '...', show_count: bool = True) str[source]

Truncates various strings, mappings, and sequences for cleaner representations of objects in CLIs.

Handles: - Strings: Truncate with suffix - Mappings (dict): Show preview of first N chars with count - Sequences (list, tuple): Show preview with count - Other objects: Use string representation

Parameters:
  • value (Any) – The value to truncate.

  • max_length (int) – Maximum character length before truncation.

  • suffix (str) – String to append when truncated (default: “…”).

  • show_count (bool) – Whether to show item count for collections.

Returns:

Truncated string representation.

Return type:

str

Examples

>>> truncate("A very long string that needs truncation", max_length=20)
'A very long string...'
>>> truncate({'key1': 'value1', 'key2': 'value2'}, max_length=30)
"{'key1': 'value1', ...} (2 items)"
>>> truncate([1, 2, 3, 4, 5], max_length=10)
'[1, 2, ...] (5 items)'
>>> truncate({'a': 1}, max_length=50, show_count=False)
"{'a': 1}"

scholar_flux.utils.response_protocol module

The scholar_flux.utils.response_protocol module implements response object duck-typing for API client operations.

The implemented ResponseProtocol class and the is_response_like function are each implemented to ensure that responses during API retrieval and processing steps can be successfully duck-typed and validated without favoring a specific client such as requests (or by extension, requests_cache), httpx, or aiohttp.

An object is then seen as response-like if it passes the preliminary check, containing all of the following attributes:
  • url

  • status_code

  • raise_for_status

  • headers

To ensure compatibility, the scholar_flux.api.ReconstructedResponse class is used as an adapter throughout the request retrieval, response processing, and caching processes to ensure that the ResponseProtocol generalizes to other implementations when not directly using the default requests client.

class scholar_flux.utils.response_protocol.ResponseProtocol(*args, **kwargs)[source]

Bases: Protocol

Protocol for HTTP response objects compatible with requests.Response, httpx.Response, and other response classes.

This protocol defines the common interface shared between popular HTTP client libraries, allowing for type-safe interoperability.

The URL is kept flexible to allow for other types outside of the normal string including basic pydantic and httpx type for both httpx and other custom objects.

__init__(*args, **kwargs)
content: bytes
headers: MutableMapping[str, str]
raise_for_status() None[source]

Raises an exception for HTTP error status codes.

status_code: int
url: Any
class scholar_flux.utils.response_protocol.ResponseSupportsJSONProtocol(*args, **kwargs)[source]

Bases: ResponseProtocol, Protocol

Extends the ResponseProtocol for the identification of response-like objects that support JSON deserialization.

JSON parsing is supported for python http clients such as requests and httpx.

Use this protocol to narrow response types when JSON parsing is required, such as response parsing and the extraction of error details for unsuccessful responses.

json(**kwargs: Any) Any[source]

Deserializes response content into JSON format.

scholar_flux.utils.response_protocol.is_response_like(response: object) TypeGuard[Response | ResponseProtocol][source]

Identifies whether an object is a response or duck typed response protocol.

scholar_flux.utils.response_protocol.response_supports_json(response: object) TypeGuard[ResponseSupportsJSONProtocol][source]

Determines whether the current object is a response-like object that supports JSON content parsing.

Module contents

The scholar_flux.utils module defines several utilities for simplifying the implementation of common design patterns.

Modules:
  • initializer.py: Contains the tools used to initialize (or reinitialize) the scholar_flux package.
    The initializer creates the following package components:
    • config: Contains a list of environment variables and defaults for configuring the package

    • logger: created by calling setup_logging function with inputs or defaults from an .env file

    • masker: identifies and masks sensitive data from logs such as api keys and email addresses

  • logger.py: Contains the setup_logging that is used to set the logging level and output location for logs when

    using the scholar_flux package

  • config.py: Holds the ConfigLoader class that starts from the scholar_flux defaults and reads from an .env and

    environment variables to automatically apply API keys, encryption settings, the default provider, etc.

  • helpers.py: Contains a variety of convenience and helper functions used throughout the scholar_flux package.

  • file_utils.py: Implements a JsonFileUtils class that contains several static methods for reading files

  • encoder: Contains an implementation of a CacheDataEncoder and JsonDataEncoder that uses base64 and json utilities

    to recursively serialize, deserialize, encode and decode JSON dictionaries and lists for storage and retrieval by using base64. This method accounts for when direct serialization isn’t possible and would otherwise result in a JSONDecodeError as a result of not accounting for nested structures and types.

  • json_processing_utils: Contains a variety of utilities used in the creation of the RecursiveJsonProcessor which

    is used to streamline the process of filtering and flattening parsed record data

  • /paths: Contains custom implementations for processing JSON lists using path processing that abstracts

    elements of JSON files into Nodes consisting of paths (keys) to arrive at terminal entries (values) similar to dictionaries. This implementation simplifies the flattening processing, and filtering of records when processing articles and record entries from response data.

  • provider_utils: Contains the ProviderUtils class that implements class methods that are used to dynamically read

    modules containing provider-specific config models. These config models are then used by the scholar_flux.api module to populate Search API configurations with API-specific settings.

  • repr_utils: Contains a set of helper functions specifically geared toward printing nested objects and

    compositions of classes into a human-readable format to create sensible representations of objects

class scholar_flux.utils.CacheDataEncoder[source]

Bases: object

A utility class to encode data into a base64 string representation or decode it back from base64.

This class supports encoding binary data (bytes) and recursively handles nested structures such as dictionaries and lists by encoding their elements, preserving the original structure upon decoding.

This class is used to serialize json structures when the structure isn’t known and contains unpredictable elements such as 1) None, 2) bytes, 3) nested lists, 4) Other unpredictable structures typically found in JSON.

Class Attributes:
DEFAULT_HASH_PREFIX: (Optional[str]):

An optional indicator of fields to mark fields as bytes for use when decoding. This field defaults to <hashbytes> but can be optionally turned off by setting CacheDataEncoder.DEFAULT_HASH_PREFIX=None or CacheDataEncoder.DEFAULT_HASH_PREFIX=’’

DEFAULT_NONREADABLE_PROP (int):

A threshold used to identify previously encoded base64 fields. This proportion is used when a hash prefix that marks encoded text is not applied. To test whether a string is an encoded_string, when decoded, a high percentage of letters will be nonreadable when decoded. (i.e CacheDataEncoder.decode(‘encoders’) —> b’zw(uêì’

Example

>>> from scholar_flux.utils import CacheDataEncoder
>>> import json
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string', 'list': ['a', True, 'series', 'of', None]}
>>> try:
>>>     json.dumps(data)
>>> except TypeError:
>>>     print('The `data` is non-serializable as expected ')
>>>
>>> encoded_data = CacheDataEncoder.encode(data)
>>> serialized_data = json.dumps(encoded_data)
>>> assert data == CacheDataEncoder.decode(json.loads(serialized_data))
DEFAULT_HASH_PREFIX: str | None = '<hashbytes>'
DEFAULT_NONREADABLE_PROP: float = 0.2
classmethod decode(data: str, hash_prefix: str | None = None) str | bytes[source]
classmethod decode(data: dict, hash_prefix: str | None = None) dict
classmethod decode(data: list, hash_prefix: str | None = None) list
classmethod decode(data: tuple, hash_prefix: str | None = None) tuple
classmethod decode(data: T, hash_prefix: str | None = None) T

Recursively decodes base64 strings back to bytes or recursively decode elements within dictionaries and lists.

Parameters:
  • data (object) – The input data that needs decoding from a base64 encoded format. This could be a base64 string or nested structures like dictionaries and lists containing base64 strings as values.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Decoded bytes for byte-based representations or recursively decoded elements

within the dictionary/list/tuple if applicable.

Return type:

object

classmethod encode(data: bytes, hash_prefix: str | None = None) str[source]
classmethod encode(data: MutableMapping, hash_prefix: str | None = None) dict
classmethod encode(data: MutableSequence | set, hash_prefix: str | None = None) list
classmethod encode(data: tuple, hash_prefix: str | None = None) tuple
classmethod encode(data: T, hash_prefix: str | None = None) T

Recursively encodes all items that contain elements that cannot be directly serialized into JSON into a format more suitable for serialization:

  • Mappings are converted into dictionaries

  • Sets and other uncommon Sequences other than lists and tuples are converted into lists

  • Bytes objects are converted into strings and hashed with an optional prefix-identifier.

Parameters:
  • data (object) – The input data. This can be: * bytes: Encoded directly to a base64 string. * Mappings/Sequences/Sets/Tuples: Recursively encodes elements if they are bytes.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

Encoded string (for bytes) or a dictionary/list/tuple with recursively encoded elements.

Return type:

object

classmethod is_base64(s: str | bytes, hash_prefix: str | None = None) bool[source]

Check if a string is a valid base64 encoded string. Encoded strings can optionally be identified with a hash_prefix to streamline checks to determine whether or not to later decode a base64 encoded string.

As a general heuristic when encoding and decoding base 64 objects, a string should be equal to its original value after encoding and decoding the string. In this implementation, we strip equals signs, as minor differences in padding aren’t relevant.

Parameters:
  • s (str | bytes) – The string to check.

  • hash_prefix (Optional[str]) – The prefix to identify hash bytes. Uses the class default prefix <hashbytes> but can be turned off if the CacheDataEncoder.DEFAULT_HASH_PREFIX is modified or hash_prefix is set to ‘’.

Returns:

True if the string is base64 encoded, False otherwise.

Return type:

bool

classmethod is_nonreadable(s: bytes, prop: float | None = None) bool[source]

Check if a decoded byte string contains a high percentage of non-printable characters. Non-printable characters are defined as those not within the unicode range of (32 <= c <= 126).

Parameters:
  • s (bytes) – The byte string to check.

  • prop (float) – The threshold percentage of non-printable characters.

  • specified. (Defaults to DEFAULT_NONREADABLE_PROP is not)

Returns:

True if the string is likely gibberish, False otherwise.

Return type:

bool

class scholar_flux.utils.ConfigLoader(env_path: str | Path | None = None)[source]

Bases: object

Configuration loader for the scholar_flux package settings and environment variables.

The ConfigLoader is used on package initialization to dynamically configure package options from .env files and the OS environment. ScholarFlux uses this class to define package-level settings at runtime while prioritizing .env file configurations when available.

Configuration Variables

Package Level Settings

  • SCHOLAR_FLUX_DEFAULT_PROVIDER: Defines the provider to use by default when creating a SearchAPI instance.

  • SCHOLAR_FLUX_DEFAULT_USER_AGENT:

    The default User-Agent to use when sending requests via requests-cache. If not specified, a default User-Agent will be generated automatically.

  • SCHOLAR_FLUX_DEFAULT_MAILTO:

    Defines the default mailto address that is used when creating a new search coordinator.

  • SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND:

    Controls the default backend for CachedSession instances created when initializing SearchAPI or SearchCoordinator. Supported requests_cache backends include sqlite, redis, mongodb, and memory.

  • SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE:

    Defines the default cache storage backend that the DataCacheManager creates for response caching during orchestration of the response processing steps. Supported options are redis, sql, mongodb, memory, and null. Defaults to memory if not specified.

  • SCHOLAR_FLUX_SESSION_CACHE_NAME:

    Defines the name of the session cache db (mongodb/redis), table name (sqlite), or nested folder (filename).

  • SCHOLAR_FLUX_CACHE_DIRECTORY:

    Defines the directory path where requests and response processing cache will be stored when using filesystem-based cache backends (e.g., sqlite).

API_KEYS

  • ARXIV_API_KEY: API key used when retrieving academic data from arXiv.

  • OPEN_ALEX_API_KEY: API key used when retrieving academic data from OpenAlex.

  • SPRINGER_NATURE_API_KEY: API key used when retrieving academic data from Springer Nature.

  • CROSSREF_API_KEY: API key used to retrieve academic metadata from Crossref (API key not required).

  • CORE_API_KEY: API key used to retrieve metadata and full-text publications from the CORE API.

  • PUBMED_API_KEY: API key used to retrieve publications from the NIH PubMed database.

  • SCHOLAR_FLUX_CACHE_SECRET_KEY:

    Defines the secret key used to create encrypted session cache for request retrieval.

Logging

  • SCHOLAR_FLUX_ENABLE_LOGGING: Defines whether logging should be enabled when ScholarFlux is initialized.

  • SCHOLAR_FLUX_LOG_DIRECTORY: Defines where rotating logs will be stored when logging is enabled.

  • SCHOLAR_FLUX_LOG_LEVEL:

    Defines the default log level used for package level logging during and after scholar_flux package initialization.

  • SCHOLAR_FLUX_LOG_STREAM:

    Defines the default stream that should be used when initializing the package-level logger.

  • SCHOLAR_FLUX_PROPAGATE_LOGS: Determines whether logs should be propagated or not. (True by default).

Database Connections

  • SCHOLAR_FLUX_MONGODB_HOST: MongoDB connection string (default: “mongodb://127.0.0.1”)

  • SCHOLAR_FLUX_MONGODB_PORT: MongoDB port (default: 27017)

  • SCHOLAR_FLUX_REDIS_HOST: Redis host (default: “localhost”)

  • SCHOLAR_FLUX_REDIS_PORT: Redis port (default: 6379)

  • SCHOLAR_FLUX_SQLALCHEMY_URL: The default SQLAlchemy URL to use for the response processing cache.

  • SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL: Controls the time until expiration for cached responses (seconds)

  • SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL: Controls the time until expiration for processing cache (seconds)

Examples

>>> from scholar_flux.utils import ConfigLoader
>>> from pydantic import SecretStr
>>> config_loader = ConfigLoader()
>>> config_loader.load_config(reload_env=True)
>>> api_key = '' # Your key goes here
>>> if api_key:
>>>     config_loader.config['CROSSREF_API_KEY'] = api_key
>>> print(config_loader.env_path) # the default environment location when writing/replacing a env config
>>> config_loader.save_config() # to save the full configuration in the default environment folder
DEFAULT_ENV: Dict[str, Any] = {'ARXIV_API_KEY': None, 'CORE_API_KEY': None, 'CROSSREF_API_KEY': None, 'OPEN_ALEX_API_KEY': None, 'PUBMED_API_KEY': None, 'SCHOLAR_FLUX_CACHE_DIRECTORY': None, 'SCHOLAR_FLUX_CACHE_SECRET_KEY': None, 'SCHOLAR_FLUX_DEFAULT_MAILTO': None, 'SCHOLAR_FLUX_DEFAULT_PROVIDER': 'plos', 'SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE': None, 'SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL': None, 'SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND': None, 'SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL': 86400, 'SCHOLAR_FLUX_DEFAULT_USER_AGENT': None, 'SCHOLAR_FLUX_ENABLE_LOGGING': None, 'SCHOLAR_FLUX_LOG_DIRECTORY': None, 'SCHOLAR_FLUX_LOG_FILE': 'application.log', 'SCHOLAR_FLUX_LOG_LEVEL': None, 'SCHOLAR_FLUX_LOG_STREAM': None, 'SCHOLAR_FLUX_MONGODB_HOST': 'mongodb://127.0.0.1', 'SCHOLAR_FLUX_MONGODB_PORT': 27017, 'SCHOLAR_FLUX_PROPAGATE_LOGS': None, 'SCHOLAR_FLUX_REDIS_HOST': 'localhost', 'SCHOLAR_FLUX_REDIS_PORT': 6379, 'SCHOLAR_FLUX_SESSION_CACHE_NAME': None, 'SCHOLAR_FLUX_SQLALCHEMY_URL': None, 'SPRINGER_NATURE_API_KEY': None}
DEFAULT_ENV_PATH: Path = PosixPath('/home/runner/work/scholar-flux/scholar-flux/.env')
__init__(env_path: str | Path | None = None)[source]

Initializes the ConfigLoader with class-level defaults and establishes the .env path to read from.

If a custom path is provided and valid, it will be used when it points to a valid file that exists; otherwise, the path will default to a readable package location (SCHOLAR_FLUX_HOME, ~/.scholar_flux, or current directory).

Parameters:

env_path (Optional[Path | str]) – The dotenv file to read environment variables from. If not passed, environment variables are scanned and checked from default package locations or the current directory when available.

env_path

The location of the .env file to load for reading/writing configuration.

Type:

Path

config

(Dict[str, Any]): The current configuration dictionary with masked sensitive values.

config: Dict[str, Any]
env_path: Path
get(key: str, default: Any = None) Any[source]

Retrieve a configuration value from the config dictionary, falling back to the environment if not present.

Parameters:
  • key (str) – The name of the variable from which to retrieve the configuration value.

  • default (Any) – A fallback value that is returned when the key exists in neither the config dictionary nor the environment.

Note

Any values set during the current session are prioritized over values from the environment. If a value can’t be found within the config dictionary, the get() method will fallback to checking for the environment variable within the operating system environment.

load_config(env_path: str | Path | None = None, reload_env: bool = False, reload_os_env: bool = False, verbose: bool = False) None[source]

Load configuration settings from a .env file and the global OS environment.

This package allows users to set new defaults on changes to the environment while optionally overwriting previously set configuration settings.

Optionally attempt to reload and overwrite previously set ConfigLoader using either or both sources of config settings.

Note that config settings from a .env file are prioritized over globally set OS environment variables. If neither reload_os_env or reload_env are chosen, this function has no effect on the current configuration.

Parameters:
  • env_path (Optional[Path | str]) – An optional env path to read from. Defaults to the current env_path if None.

  • reload_env (bool) – Determines whether environment variables will be loaded/reloaded from the provided env_path or a current self.env_path. Defaults to False, indicating that variables are not reloaded from a .env.

  • reload_os_env (bool) – Determines whether environment variables will be loaded/reloaded from the Operating System’s global environment.

  • verbose (bool) – Convenience setting indicating whether or not to log changed configuration variable names.

load_dotenv(env_path: str | Path | None = None, replace_all: bool = False, verbose: bool = False) dict[str, Any][source]

Retrieves a list of non-missing environment variables from the current .env file that are non-null.

Parameters:
  • env_path (Optional[Path | str]) – Location of the .env file where env variables will be retrieved from.

  • replace_all (bool) – Indicates whether all environment variables should be replaced vs. only non-missing variables. by default, only previously non-existent variables are assigned updated values.

  • verbose (bool) – Flag indicating whether logging should be shown in the output. This is set to False by default.

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict[str, Any]

load_os_env(replace_all: bool = False, verbose: bool = False) dict[source]

Load any updated configuration settings from variables set within the system environment.

The configuration setting must already exist in the config to be updated if available. Otherwise, the update_config method allows direct updates to the config settings.

Parameters:
  • replace_all (bool) – Indicates whether all environment variables should be replaced vs. only non-missing variables. This is false by default.

  • verbose (bool) – Flag indicating whether logging should be shown in the output. This is False by default.

Returns:

A dictionary of key-value pairs corresponding to environment variables

Return type:

dict[str, Any]

classmethod load_os_env_key(key: str, **kwargs: Any) str | SecretStr | None[source]

Loads the provided key from the global environment. Converts API_KEY variables to secret strings by default.

Parameters:
  • key (str) – The key to load from the environment. This key will be guarded if it contains any of the following substrings: “API_KEY”, “SECRET”, “MAIL”

  • matches (str) – The substrings used to indicate whether the loaded environment variable should be guarded

Returns:

The value of the environment variable, possibly wrapped as a secret string

Return type:

Optional[str | SecretStr]

save_config(env_path: str | Path | None = None) None[source]

Save configuration settings to a .env file.

Automatically unmasks SecretStr values before writing to disk.

Parameters:

env_path (Optional[Path | str]) – The location to save the configuration settings to.

Note

Sensitive values (SecretStr) are unmasked during write. Ensure .env files have appropriate permissions (For example, with permissions such as chmod 600).

set(key: str, value: Any, verbose: bool = True) None[source]

Sets a configuration value for a key within the config dictionary.

Parameters:
  • key (str) – The name of the variable to set or overwrite within the current session.

  • value (Any) – The value to assign to the setting in the config dictionary.

  • verbose (bool) – Determines whether overrides to defaults or previously existing variables should be logged.

Note

Values set with the .set() method are prioritized over values from the environment when .get() is called. To override this behavior and use environment variables instead, either remove the environment variable from the config dictionary, or set the value associated with the key to None.

try_loadenv(env_path: str | Path | None = None, verbose: bool = False) Dict[str, Any] | None[source]

Try to load environment variables from a specified .env file into the environment and return as a dict.

Parameters:
  • env_path (Optional[Path | str]) – Location of the .env file where env variables will be retrieved from.

  • verbose (bool) – Flag indicating whether logging should be shown in the output. This is False by default.

Returns:

A loaded configuration that is returned as a dictionary when available. Otherwise, None is returned.

Return type:

Optional[Dict[str, Any]]

update_config(env_dict: dict[str, Any], verbose: bool = False) None[source]

Helper method for updating the config dictionary with the provided dictionary of key-value pairs.

This method coerces strings into integers when possible and uses the _guard_secret method as insurance to guard against logging and recording API keys without masking. Although the load_env and load_os_env methods also mask API keys, this is particularly useful if the end-user calls update_config directly.

Parameters:
  • env_dict (dict[str, Any]) – A dictionary containing environment variables that will be used to update the package-level config. dictionary for the current session.

  • verbose (bool) – Determines whether updates to the configuration should be logged when they occur.

write_key(key_name: str, key_value: str, env_path: str | Path | None = None, create: bool = True) None[source]

Write a key-value pair to a .env file.

Parameters:
  • key_name (str) – The name of the key to write to a environment configuration file

  • key_value (str) – The value of the key to write to a environment configuration file

  • env_path (Optional[Path | str]) – The dotenv filepath indicating where to write the key-value pair.

  • create (bool) – Determines whether a new dotenv file should be created if it doesn’t already exist. True by default.

Raises:
  • IOError – If file cannot be written

  • PermissionError – If insufficient permissions to create/modify file

class scholar_flux.utils.JsonDataEncoder[source]

Bases: CacheDataEncoder

Helper class used to extend the CacheDataEncoder to provide functionality directly relevant to serializing and deserializing data from JSON formats into serialized JSON strings for easier storage and recovery.

This method includes utility dumping and loading tools directly applicable to safely dumping and reloading responses received by various APIs.

Example Use:
>>> from scholar_flux.utils import JsonDataEncoder
>>> data = {'note': 'hello', 'another_note': b'a non-serializable string',
>>>         'list': ['a', True, 'series' 'of', None]}
# serializes the original data even though it contains otherwise unserializable components
>>> serialized_data = JsonDataEncoder.dumps(data)
>>> assert isinstance(serialized_data, str)
# deserializes the data, returning the original structure
>>> recovered_data = json.loads(serialized_data)
# the result should be the original string
>>> assert data == recovered_data
# OUTPUT: True
classmethod deserialize(s: str, **json_kwargs: Any) Any[source]

Class method that deserializes and decodes json data from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The decoded data.

Return type:

Any

classmethod dumps(data: object, **json_kwargs: Any) str[source]

Convenience method that uses the json module to serialize (dump) JSON data into a JSON string.

Parameters:
  • data (object) – The data to serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

classmethod loads(s: str, **json_kwargs: Any) Any[source]

Convenience method that uses the json module to deserialize (load) from a JSON string.

Parameters:
  • s (str) – The JSON string to deserialize and decode.

  • **json_kwargs – Additional keyword arguments for json.loads.

Returns:

The loaded json data.

Return type:

Any

classmethod serialize(data: object, **json_kwargs: Any) str[source]

Class method that encodes and serializes data to a JSON string.

Parameters:
  • data (Any) – The data to encode and serialize as a json string.

  • **json_kwargs – Additional keyword arguments for json.dumps.

Returns:

The JSON string.

Return type:

str

class scholar_flux.utils.JsonFileUtils[source]

Bases: object

Helper class that implements several basic file utility class methods for easily interacting with the file system. This class also contains utility methods used to parse, load, and dump JSON files for convenience.

Example

>>> from scholar_flux.utils.json_file_utils import JsonFileUtils
>>> from pathlib import Path
>>> original_data = {"key": "value"}
>>> json_file = "/tmp/sample"

# the JSON data should be serializable: >>> assert JsonFileUtils.is_jsonable(original_data) # writing the json file >>> JsonFileUtils.save_as(original_data, json_file) # the data should now exist at the ‘/tmp/sample.json’ path >>> assert Path(json_file).with_suffix(‘.json’).exists() # verifying that the dumped data can be loaded as intended: >>> data = JsonFileUtils.load_data(json_file) >>> assert data is not None and original_data == data

DEFAULT_EXT = 'json'
classmethod append_to_file(content: str | list[str], filepath: str | Path, ext: str | None = None) None[source]

Helper method used to append content to a file in a content-type aware manner.

Parameters:
  • content (Union[str, list[str]]) – The content to append to the file.

  • filepath (Union[str, Path]) – The file path to write to

  • ext (Optional[str]) – An optional extension to add to the file path

classmethod get_filepath(filepath: str | Path, ext: str | None = None) str[source]

Prepare the filepath using the filepath and extension if provided. Assumes a Unix filesystem structure for edge cases.

Parameters:
  • filepath (Union[str, Path]) – The file path to read from

  • ext (Optional[str]) – An optional extension to add to the file path. If the extension is left None, and an extension does not yet exist on the file path, the default JSON is used by default.

static is_jsonable(obj: Any) bool[source]

Verifies whether the object can be serialized as a json object.

Parameters:

obj (Any) – The object to check

Returns:

True if the object is jsonable (serializable), otherwise False

Return type:

bool

classmethod load_data(filepath: str | Path, ext: str | None = None) dict | list | str[source]

Attempts to load data from a filepath as a dictionary/list. If unsuccessful, the file’s contents are instead loaded as a string.

Parameters:

filepath (Union[str, Path]) – The file path to read the data from

Returns:

A dictionary or list if the data can be successfully loaded with json, and a string if loading with JSON is not possible.

Return type:

Union[dict, list, str]

classmethod read_lines(filepath: str | Path, ext: str | None = None) Generator[str, None, None][source]

Iteratively reads lines from a text file.

Parameters:
  • filepath (Union[str, Path]) – The file path to read the data from

  • ext (Optional[str]) – An optional extension to add to the file path

Returns:

The lines read from a text file

Return type:

Generator[str, None, None]

To retrieve a list of data instead of a generator, pass the result to list:
>>> from scholar_flux.utils import JsonFileUtils
>>> line_gen = JsonFileUtils.read_lines('pyproject.toml')
>>> assert isinstance(list(line_gen), list)
classmethod save_as(obj: list | dict | str | float | int, filepath: str | Path, ext: str | None = None, dump: bool = True) None[source]

Save an object in text format with the specified extension (if provided).

Parameters:
  • obj (Union[list, dict, str, float, int]) – A value to save into a file

  • filepath (Union[str, Path]) – The file path to write the object to

  • ext (Optional[str]) – An optional extension to add to the file path

  • dump (bool) – If True, the object is serialized using json.dumps. Otherwise the str function is used

class scholar_flux.utils.JsonNormalizer(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Bases: object

Helper class that flattens and normalizes the retrieved list of JsonRecordData into singular flattened dictionary.

__init__(json_record_data_list: List[JsonRecordData], use_full_path: bool = False)[source]

Initialize the JsonNormalizer with extracted JSON data and a delimiter.

Parameters:
  • json_record_data_list (List[JsonRecordData]) – The list of extracted JSON data.

  • use_full_path (bool) – Indicates whether to use the full nested json path or the smallest unique path available

create_unique_key(current_group: List[str], current_key_str: str, unique_mappings_dict: Dict[str, List[str]]) str[source]

Create a unique key for the current data entry if a simple key is not sufficient.

Parameters:
  • current_group (List[str]) – The list of keys in the current path.

  • current_key_str (str) – The string representation of the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

get_unique_key(current_key_str: str, current_group: List[str], unique_mappings_dict: Dict[str, List[str]]) str[source]

Generate a unique key for the current data entry.

Parameters:
  • current_key_str (str) – The string representation of the current path.

  • current_group (List[str]) – The list of keys in the current path.

  • unique_mappings_dict (Dict[str, List[str]]) – A dictionary tracking unique keys.

Returns:

A unique key for the current data entry.

Return type:

str

normalize_extracted() Dict[str, List[Any]][source]

Normalize the extracted JSON data into a flattened dictionary.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Dict[str, List[Any]]

class scholar_flux.utils.JsonRecordData(path: List[str | int], data: Dict[str, Any])[source]

Bases: object

Helper class used as a container to record the paths, data, and names associated with each terminal path.

This class uses its structural representation to create a hash that allows it to be stored within a set.

Parameters:
  • path (list[str | int]) – The path associated with the terminal data point where nested terminal values can be found

  • data (dict[str, Any]) – The nested terminal value at the end of a path

__init__(path: List[str | int], data: Dict[str, Any]) None
data: Dict[str, Any]
path: List[str | int]
structure() str[source]

Helper method used to identify duplicate paths before addition.

class scholar_flux.utils.KeyDiscoverer(records: List[Dict] | None = None)[source]

Bases: object

Helper class used to discover terminal keys containing data within nested JSON data structures and identify the paths used to arrive at each key.

_discovered_keys

Defines the complete list of all keys that can be found in a dictionary and the path that needs to be traversed to arrive at that key

Type:

dict[str, list]

_terminal_paths

Creates a dictionary that indicates whether the currently added path is terminal within the JSON data structure

Type:

dict[str, bool]

__init__(records: List[Dict] | None = None)[source]

Initializes the KeyDiscoverer and identifies terminal key/path pairs within the JSON data structure.

filter_keys(prefix: str | None = None, min_length: int | None = None, substring: str | None = None) Dict[str, List[str]][source]

Helper method that filters a range of keys based on the specified criteria.

get_all_keys() Dict[str, List[str]][source]

Returns all discovered keys and their paths.

get_keys_with_path(key: str) List[str][source]

Returns all paths associated with a specific key.

get_terminal_keys() Dict[str, List[str]][source]

Returns keys and their terminal paths (paths that don’t contain nested dictionaries).

get_terminal_paths() List[str][source]

Returns paths indicating whether they are terminal (don’t contain nested dictionaries).

class scholar_flux.utils.KeyFilter[source]

Bases: object

Helper class used to create a simple filter that allows for the identification of terminal keys associated with data in a JSON structure and the paths that lead to each terminal key.

static filter_keys(discovered_keys: Dict[str, List[str]], prefix: str | None = None, min_length: int | None = None, substring: str | None = None, pattern: str | None = None, include_matches: bool = True, match_any: bool = True) Dict[str, List[str]][source]

A method used to create a function that matches key-value pairs based on the specified criteria.

For example, filtering can be configured to identify keys based on prefix, minimum path length, and path substring/pattern matching with conditional match inclusion/exclusion.

class scholar_flux.utils.PathDiscoverer(records: list[dict] | dict | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>)

Bases: object

For both discovering paths and flattening json files into a single dictionary that simplifies the nested structure into the path, the type of structure, and the terminal value.

Parameters:
  • records – Optional[Union[list[dict], dict]]: A list of dictionaries to be flattened

  • path_mappings – dict[ProcessingPath, Any]: A set of key-value pairs mapping paths to terminal values

records

The input data to be traversed and flattened.

Type:

Optional[Union[list[dict], dict]]

path_mappings

Holds a dictionary of values mapped to ProcessingPaths after processing

Type:

dict[ProcessingPath, Any]

DEFAULT_DELIMITER: ClassVar[str] = '.'
__init__(records: list[dict] | dict | None = None, path_mappings: dict[~scholar_flux.utils.paths.ProcessingPath, ~typing.Any] = <factory>) None
clear() None[source]

Removes all path-value mappings from the self.path_mappings dictionary.

discover_path_elements(records: list[dict] | dict | None = None, current_path: ProcessingPath | None = None, max_depth: int | None = None, inplace: bool = False) dict[ProcessingPath, Any] | None[source]

Recursively traverses records to discover keys, their paths, and terminal status. Uses the private method _discover_path_elements in order to add terminal path value pairs to the path_mappings attribute.

Parameters:
  • records (Optional[Union[list[dict], dict]]) – A list of dictionaries to be flattened if not already provided.

  • current_path (Optional[dict[ProcessingPath, Any]]) – The parent path to prefix all subsequent paths with. Is useful when working with a subset of a dict

  • max_depth (Optional[int]) – Indicates the times we should recursively attempt to retrieve a terminal path. Leaving this at None will traverse all possible nested lists/dictionaries.

  • inplace (bool) – Determines whether or not to save the inner state of the PathDiscoverer object. When False: Returns the final object and clears the self.path_mappings attribute. When True: Retains the self.path_mappings attribute and returns None

path_mappings: dict[ProcessingPath, Any]
records: list[dict] | dict | None = None
property terminal_paths: Set[ProcessingPath]

Helper method for returning a list of all discovered paths from the PathDiscoverer.

class scholar_flux.utils.PathNode(path: ProcessingPath, value: Any)

Bases: object

A dataclass acts as a wrapper for path-terminal value pairs in nested JSON structures.

The PathNode consists of a value of any type and a ProcessingPath instance that indicates where a terminal-value was found. This class simplifies the process of manipulating and flattening data structures originating from JSON data

path

The terminal path where the value was located

Type:

ProcessingPath

value
Type:

Any

DEFAULT_DELIMITER: ClassVar[str] = '.'
__init__(path: ProcessingPath, value: Any) None
copy() PathNode[source]

Helper method for copying and returning an identical path node.

classmethod is_valid_node(node: PathNode) bool[source]

Validates whether the current node is or is not a PathNode isinstance. If the current input is not a PathNode, then this class will raise an InvalidPathNodeError.

Raises:

InvalidPathNodeError – If the current node is not a PathNode or if its path is not a valid ProcessingPath

path: ProcessingPath
property path_group: ProcessingPath

Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.

Returns:

A ProcessingPath instance with the last numeric component removed and indices replaced.

Return type:

ProcessingPath

property path_keys: ProcessingPath

Utility function for retaining keys from a path, ignoring indexes generated by lists Retrieves the original path minus all keys that originate from list indexes.

Returns:

A ProcessingPath instance associated with all dictionary keys

Return type:

ProcessingPath

property record_index: int

Extract the first element of the node’s path to determine the record number originating from a list of dictionaries, assuming the path originates from a paginated structure.

Returns:

Value denoting the record that the path originates from

Return type:

int

Raises:

PathIndexingError – if the first element of the path is not a numerical index

classmethod to_path_node(path: ProcessingPath | str | int | list[str] | list[int] | list[str | int], value: Any, **path_kwargs: Any) Self[source]

Helper method for creating a path node from the components used to create paths in addition to value to assign the path node.

Parameters:
  • path (Union[ProcessingPath, str, list[str]]) – The path to be assigned to the node. If this is not a path already, then a path will be created from what is provided

  • value (Any) – The value to associate with the new node

  • **path_kwargs – Additional keyword arguments to be used in the creation of a path. This is passed to ProcessingPath.to_processing_path when creating a path

Returns:

The newly constructed path

Return type:

PathNode

Raises:

InvalidPathNodeError – If the values provided cannot be used to create a new node

update(**attributes: ProcessingPath | Any) PathNode[source]

Update the parameters of a PathNode by creating a new PathNode instance. Note that the original PathNode dataclass is frozen. This method uses the copied dict originating from the dataclass to initialize a new PathNode. :param **attributes: keyword arguments indicating the attributes of the :type **attributes: dict :param PathNode to update. If a specific key is not provided: :param then it will not update: :param Each key should be a valid attribute name of PathNode: :param : :param and each value should be the corresponding updated value.:

Returns:

A new path with the updated attributes

value: Any
class scholar_flux.utils.PathNodeIndex(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None)

Bases: object

The PathNodeIndex is a dataclass that enables the efficient processing of nested key value pairs from JSON data commonly received from APIs providing records, articles, and other forms of data.

This index enables the orchestration of both parsing, flattening, and the simplification of JSON data structures.

Parameters:
  • index (PathNodeMap) – A dictionary of path-node mappings that are used by the PathNodeIndex to simplify JSON structures into a singular list of dictionaries where each dictionary represents a record

  • simplifier (PathSimplifier) – A structure that enables the simplification of a path node index into a singular list of dictionary records. The structure is initially used to identify unique path names for each path-value combination.

Class Variables:
DEFAULT_DELIMITER (str): A delimiter to use by default when reading JSON structures and transforming the

list of keys used to retrieve a terminal path into a simplified string. Each individual key is separated by this delimiter.

MAX_PROCESSES (int): An optional maximum on the total number of processes to use when simplifying multiple

records into a singular structure in parallel. This can be configured directly or turned off altogether by setting this class variable to None.

Example Usage:
>>> from scholar_flux.utils import PathNodeIndex
>>> record_test_json: list[dict] = [
>>>     {
>>>         "authors": {"principle_investigator": "Dr. Smith", "assistant": "Jane Doe"},
>>>         "doi": "10.1234/example.doi",
>>>         "title": "Sample Study",
>>>         # "abstract": ["This is a sample abstract.", "keywords: 'sample', 'abstract'"],
>>>         "genre": {"subspecialty": "Neuroscience"},
>>>         "journal": {"topic": "Sleep Research"},
>>>     },
>>>     {
>>>         "authors": {"principle_investigator": "Dr. Lee", "assistant": "John Roe"},
>>>         "doi": "10.5678/example2.doi",
>>>         "title": "Another Study",
>>>         "abstract": "Another abstract.",
>>>         "genre": {"subspecialty": "Psychiatry"},
>>>         "journal": {"topic": "Dreams"},
>>>     },
>>> ]
>>> normalized_records = PathNodeIndex.normalize_records(record_test_json)
>>> normalized_records
# OUTPUT: [{'abstract': 'Another abstract.',
#         'doi': '10.5678/example2.doi',
#         'title': 'Another Study',
#         'authors.assistant': 'John Roe',
#         'authors.principle_investigator': 'Dr. Lee',
#         'genre.subspecialty': 'Psychiatry',
#         'journal.topic': 'Dreams'},
#        {'doi': '10.1234/example.doi',
#         'title': 'Sample Study',
#         'authors.assistant': 'Jane Doe',
#         'authors.principle_investigator': 'Dr. Smith',
#         'genre.subspecialty': 'Neuroscience',
#         'journal.topic': 'Sleep Research'}]
DEFAULT_DELIMITER: ClassVar[str] = '.'
MAX_PROCESSES: ClassVar[int | None] = 8
__init__(node_map: ~scholar_flux.utils.paths.PathNodeMap | ~scholar_flux.utils.paths.RecordPathChainMap = <factory>, simplifier: ~scholar_flux.utils.paths.PathSimplifier = <factory>, use_cache: bool | None = None) None
combine_keys(skip_keys: list | None = None) None[source]

Combine nodes with values in their paths by updating the paths of count nodes.

This method searches for paths ending with values and count, identifies related nodes, and updates the paths by combining the value with the count node.

Parameters:
  • skip_keys (Optional[list]) – Keys that should not be combined regardless of a matching pattern

  • quote_numeric (Optional[bool]) – Determines whether to quote integer components of paths to distinguish from Indices (default behavior is to quote them (ex. 0, 123).

Raises:

PathCombinationError – If an error occurs during the combination process.

classmethod from_path_mappings(path_mappings: dict[ProcessingPath, Any], chain_map: bool = False, use_cache: bool | None = None) PathNodeIndex[source]

Takes a dictionary of path:value mappings and transforms the dictionary into a list of PathNodes: useful for later path manipulations such as grouping and consolidating paths into a flattened dictionary.

If use_cache is not specified, then the Mapping will use the class default to determine whether or not to cache.

Returns:

An index of PathNodes created from a dictionary

Return type:

PathNodeIndex

get_node(path: ProcessingPath | str) PathNode | None[source]

Try to retrieve a path node with the given path.

Parameters:

index (The exact path of to search for in the)

Returns:

The exact node that matches the provided path.

Returns None if a match is not found

Return type:

Optional[PathNode]

node_map: PathNodeMap | RecordPathChainMap
property nodes: list[PathNode]

Returns a list of PathNodes stored within the index.

Returns:

The complete list of all PathNodes that have been registered in the PathIndex

Return type:

list[PathNode]

classmethod normalize_records(json_records: dict | list[dict], combine_keys: bool = True, object_delimiter: str | None = ';', parallel: bool = False) list[dict[str, Any]][source]

Full pipeline for processing a loaded JSON structure into a list of dictionaries where each individual list element is a processed and normalized record.

Parameters:
  • json_records (dict[str,Any] | list[dict[str,Any]]) – The JSON structure to normalize. If this structure is a dictionary, it will first be nested in a list as a single element before processing.

  • combine_keys – bool: This function determines whether or not to combine keys that are likely to denote names and corresponding values/counts. Default is True

  • object_delimiter – This delimiter determines whether to join terminal paths in lists under the same key and how to collapse the list into a singular string. If empty, terminal lists are returned as is.

  • parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel

Return type:

list[dict[str,Any]]

property paths: list[ProcessingPath]

Returns a list of Paths stored within the index.

Returns:

The complete list of all paths that have been registered in the PathIndex

Return type:

list[ProcessingPath]

Attempt to find all values containing the specified pattern using regular expressions :param pattern: :type pattern: Union[str, re.Pattern]

Returns:

all paths and nodes that match the specified pattern

Return type:

dict[ProcessingPath, PathNode]

property record_indices: list[int]

Helper property for retrieving the full list of all record indices across the current mapping of paths to nodes for the current index.

This property is a helper method to quickly retrieve the full list of sorted record_indices.

It refers back to the map for the underlying implementation in the retrieval of record_indices.

Returns:

A list containing integers denoting individual records found in each path.

Return type:

list[int]

search(path: ProcessingPath) list[PathNode][source]

Attempt to find all values with that match the provided path or have sub-paths that are an exact match to the provided path :param path Union[str: :param ProcessingPath] the path to search for.: :param Note that the provided path must match a prefix/ancestor path of an indexed path: :param exactly to be considered a match:

Returns:

All paths equal to or containing sub-paths

exactly matching the specified path

Return type:

dict[ProcessingPath, PathNode]

simplifier: PathSimplifier
simplify_to_rows(object_delimiter: str | None = ';', parallel: bool = False, max_components: int | None = None, remove_noninformative: bool = True) list[dict[str, Any]][source]

Simplify indexed nodes into a paginated data structure.

Parameters:
  • object_delimiter (str) – The separator to use when collapsing multiple values into a single string.

  • parallel (bool) – Whether or not the simplification into a flattened structure should occur in parallel

Returns:

A list of dictionaries representing the paginated data structure.

Return type:

list[dict[str, Any]]

use_cache: bool | None = None
class scholar_flux.utils.PathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])

Bases: UserDict[ProcessingPath, PathNode]

A dictionary-like class that maps Processing paths to PathNode objects.

DEFAULT_USE_CACHE: bool = True
__init__(*nodes: PathNode | Generator[PathNode, None, None] | tuple[PathNode] | list[PathNode] | set[PathNode] | dict[str, PathNode] | dict[ProcessingPath, PathNode], use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]

Initializes the PathNodeMap instance.

add(node: PathNode, overwrite: bool | None = None, inplace: bool = True) PathNodeMap | None[source]

Add a node to the PathNodeMap instance.

Parameters:
  • node (PathNode) – The node to add.

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

Raises:

PathNodeMapError – If any error occurs while adding the node.

filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]

Filter the PathNodeMap for paths with the given prefix.

Parameters:
  • prefix (ProcessingPath) – The prefix to search for.

  • min_depth (Optional[int]) – The minimum depth to search for. Default is None.

  • max_depth (Optional[int]) – The maximum depth to search for. Default is None.

  • from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.

Returns:

A dictionary of paths with the given prefix and their corresponding terminal_nodes

Return type:

dict[Optional[ProcessingPath], Optional[PathNode]]

Raises:

PathNodeMapError – If an error occurs while filtering the PathNodeMap.

classmethod format_mapping(key_value_pairs: PathNodeMap | MutableMapping[ProcessingPath, PathNode] | dict[str, PathNode]) dict[ProcessingPath, PathNode][source]

Takes a dictionary or a PathNodeMap Transforms the string keys in a dictionary into Processing paths and returns the mapping.

Parameters:

key_value_pairs (Union[dict[ProcessingPath, PathNode], dict[str, PathNode]]) – The dictionary of key-value pairs to transform.

Returns:

a dictionary of validated path, node pairings

Return type:

dict[ProcessingPath, PathNode]

Raises:

PathNodeMapError – If the validation process fails.

classmethod format_terminal_nodes(node_obj: MutableMapping | PathNodeMap | PathNode) dict[ProcessingPath, PathNode][source]

Recursively iterate over terminal nodes from Path Node Maps and retrieve only terminal_nodes :param node_obj: PathNode map or node dictionary containing either nested or already flattened terminal_paths :type node_obj: Union[dict,PathNodeMap]

Returns:

the flattened terminal paths extracted from the inputted node_obj

Return type:

item (dict)

get(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]

Gets an item from the PathNodeMap instance. If the value isn’t available, this method will return the value specified in default.

Parameters:

key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.

Returns:

The value (PathNode instance).

Return type:

PathNode

get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]

Helper method for retrieving a path node in a standardized way.

node_exists(node: PathNode | ProcessingPath) bool[source]

Helper method to validate whether the current node exists.

property nodes: list[PathNode]

Enables the retrieval of paths stored within the current map as a property.

property paths: list[ProcessingPath]

Enables retrieval of nodes stored within the current map as a property.

property record_indices: list[int]

Helper property for retrieving the full list of all record indices across all paths for the current map Note: This assumes that all paths within the current map are derived from a list of records where every path’s first element denotes its initial position in a list with nested json components

Returns:

A list containing integers denoting individual records found in each path

Return type:

list[int]

remove(node: ProcessingPath | PathNode | str, inplace: bool = True) PathNodeMap | None[source]

Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool

Returns:

A new PathNodeMap instance with the specified paths removed if inplace is specified as True.

Return type:

Optional[PathNodeMap]

Raises:

PathNodeMapError – If any error occurs while removing.

update(*args: Any, overwrite: bool | None = None, **kwargs: Mapping[str | ProcessingPath, PathNode]) None[source]

Updates the PathNodeMap instance with new key-value pairs.

Parameters:
  • *args (Union[PathNodeMap,dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

  • *kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap

Returns

class scholar_flux.utils.PathProcessingCache

Bases: object

The PathProcessingCache class implements a method of path caching that enables faster prefix searches and retrieval of terminal paths associated with a path to node mapping. This class is used within PathNodeMaps and RecordPathNodeMaps to increase the speed and efficiency of path discovery, processing, and filtering path-node mappings.

Because the primary purpose of the scholar_flux Trie-based path-node-processing implementation is the processing and preparation of highly nested JSON structures from API responses, the PathProcessingCache was created to efficiently keep track of all descendants of a terminal node with weak references and facilitate of filtering and flattening path-node combinations.

Stale data is automatically removed to reduce the number of comparisons needed to retrieve terminal paths only, and, as a result, later steps can more efficiently filter the complete list of terminal paths with faster path prefix searches to facilitate processing using Path-Node Maps and Indexes when processing JSON data structures.

__init__() None[source]

Initializes the ProcessingCache instance.

_cache

Underlying cache data structure that keeps track of all descendants that begin with the current prefix by mapping path strings to WeakSets that automatically remove ProcessingPaths when garbage collected

Type:

defaultdict[str, WeakSet[ProcessingPath]]

updates

Implements a lazy caching system that only adds elements to the _cache when filtering and node retrieval is explicitly required. The implementation uses weakly referenced keys to remove cached paths to ensure that references are deleted when a lazy operation is no longer needed.

Type:

WeakKeyDictionary[ProcessingPath, Literal[‘add’, ‘remove’]]

cache_update() None[source]

Initializes the lazy updates for the cache given the current update instructions.

filter(prefix: ProcessingPath, min_depth: int | None = None, max_depth: int | None = None) Set[ProcessingPath][source]

Filter the cache for paths with the given prefix.

Parameters:
  • prefix (ProcessingPath) – The prefix to search for.

  • min_depth (Optional[int]) – The minimum depth to search for. Default is None.

  • max_depth (Optional[int]) – The maximum depth to search for. Default is None.

Returns:

A set of paths with the given prefix.

Return type:

Set[ProcessingPath]

lazy_add(path: ProcessingPath) None[source]

Add a path to the cache for faster prefix searches.

Parameters:

path (ProcessingPath) – The path to add to the cache.

lazy_remove(path: ProcessingPath) None[source]

Remove a path from the cache.

Parameters:

path (ProcessingPath) – The path to remove from the cache.

property path_cache: defaultdict[str, WeakSet[ProcessingPath]]

Helper method that allows for inspection of the ProcessingCache and automatically updates the node cache prior to retrieval.

Returns:

The underlying cache used within the ProcessingCache to

retrieve a list all currently active terminal nodes.

Return type:

defaultdict[str, WeakSet[ProcessingPath]]

class scholar_flux.utils.PathSimplifier(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>)

Bases: object

A utility class for simplifying and managing Processing Paths.

Parameters:
  • delimiter (str) – The delimiter to use when splitting paths.

  • non_informative (Optional[List[str]]) – A list of non-informative components to remove from paths.

delimiter

The delimiter used to separate components in the path.

Type:

str

non_informative

A list of non-informative components to be removed during simplification.

Type:

List[str]

name_mappings

A dictionary for tracking unique names to avoid collisions.

Type:

Dict[ProcessingPath, str]

__init__(delimiter: str = '.', non_informative: list[str] = <factory>, name_mappings: ~typing.Dict[~scholar_flux.utils.paths.ProcessingPath, str] = <factory>) None
clear_mappings() None[source]

Clear all existing path mappings.

Example

### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.clear_mappings() ### simplifier.get_mapped_paths()

Output:

{}

delimiter: str = '.'
generate_unique_name(path: ProcessingPath, max_components: int | None, remove_noninformative: bool = False) ProcessingPath[source]

Generate a unique name for the given Processing Path.

Parameters:
  • path (ProcessingPath) – The ProcessingPath object representing the path components.

  • max_components (int) – The maximum number of components to use in the name.

  • remove_noninformative (bool) – Whether to remove non-informative components.

Returns:

A unique ProcessingPath name.

Return type:

ProcessingPath

Raises:

PathSimplificationError – If an error occurs during name generation.

get_mapped_paths() Dict[ProcessingPath, str][source]

Get the current name mappings.

Returns:

The dictionary of mappings from original paths to simplified names.

Return type:

Dict[ProcessingPath, str]

Example

### simplifier = PathSimplifier() ### simplifier.simplify_paths([‘a/b/c’, ‘a/b/d’], 2) ### simplifier.get_mapped_paths() Output:

{ProcessingPath(‘a/b/c’): ‘c’, ProcessingPath(‘a/b/d’): ‘d’}

name_mappings: Dict[ProcessingPath, str]
non_informative: list[str]
simplify_paths(paths: List[ProcessingPath | str] | Set[ProcessingPath | str], max_components: int | None, remove_noninformative: bool = False) Dict[ProcessingPath, str][source]

Simplify paths by removing non-informative components and selecting the last ‘max_components’ informative components.

Parameters:
  • paths (List[Union[ProcessingPath, str]]) – List of path strings or ProcessingPaths to simplify.

  • max_components (int) – The maximum desired number of informative components to retain in the simplified path.

  • remove_noninformative (bool) – Whether to remove non-informative components.

Returns:

A dictionary mapping the original path to its simplified unique group name

for all elements within the same path after removing indices

Return type:

Dict[ProcessingPath, str]

Raises:

PathSimplificationError – If an error occurs during path simplification.

simplify_to_row(terminal_nodes: List[PathNode] | Set[PathNode], collapse: str | None = ';') Dict[str, Any][source]

Simplify terminal nodes by mapping them to their corresponding unique names.

Parameters:
  • terminal_nodes (List[PathNode]) – A list of PathNode objects representing the terminal nodes.

  • collapse (Optional[str]) – The separator to use when collapsing multiple values into a single string.

Returns:

A dictionary mapping unique names to their corresponding values or collapsed strings.

Return type:

Dict[str, Union[List[str], str]]

Raises:

PathSimplificationError – If an error occurs during simplification.

class scholar_flux.utils.PathUtils[source]

Bases: object

Helper class used to perform string/list manipulations for paths that can be represented in either form, requiring conversion from one type to the other in specific JSON path processing scenarios.

CONSTANT: str = 'i'
DELIMITER: str = '.'
IGNORE_KEYS: set = {'value'}
classmethod constant_path_indices(path: str | List[Any], constant: str | None = None) List[Any][source]

Replace integer indices with constants in the provided path.

Parameters:
  • path (List[Any]) – The original path containing both keys and indices.

  • constant (Optional[str]) – A value to replace a numeric value with. if not provided, the CONSTANT class

  • otherwise. (variable is used)

Returns:

A path with only the key names.

Return type:

List[Any]

static group_path_assignments(path: List[Any]) str | None[source]

Group the path assignments into a single string, excluding indices.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A single string representing the grouped path, or None if the path is empty.

Return type:

Optional[str]

classmethod path_name(level_names: List[Any], delimiter: str | None = None) str[source]

Generate a string representation of the path based on the provided level names.

The path name is chosen starting from the last non-numeric key in a list of path elements.

Parameters:

level_names (List[Any]) – A list of names representing the path levels.

delimiter (Optional[str]):

A delimiter used to join levels that, together, form the name of a path. If not specified, the class-level delimiter is used.

Returns:

A string representation of the path.

Return type:

str

classmethod path_split(path: str, delimiter: str | None = None) List[str][source]

Splits a path on the cls.DELIMITER value.

Parameters:
  • path (str) – A string-based path to be split into a list

  • delimiter (Optional[str]) – A delimiter used to split a path string. If not specified, the class-level delimiter is used.

Returns:

A list containing each level of a path as a string element.

Return type:

List[str]

classmethod path_str(level_names: List[Any], delimiter: str | None = None) str[source]

Join the level names into a single string separated by underscores.

Parameters:
  • level_names (List[Any]) – A list of names representing the path levels.

  • delimiter (Optional[str]) – A delimiter used to join a path from its keys. If not specified, the class-level delimiter is used.

Returns:

A single string with level names joined by underscores.

Return type:

str

classmethod remove_path_indices(path: str | List[Any]) List[Any][source]

Remove integer indices from the path to get a list of key names.

Parameters:

path (List[Any]) – The original path containing both keys and indices.

Returns:

A path with only the key names.

Return type:

List[Any]

classmethod to_path_sequence(path: str | List[str] | List[str | int], delimiter: str | None = None) List[str] | List[str | int][source]

Convert a path input (string or list) to a normalized path sequence.

Parameters:
  • path (str | List[str] | List[str | int]) – Either a delimited string or list of path components

  • delimiter (List[str] | List[str | int]) – Optional delimiter for string paths

Returns:

List of path components (strings and/or integers)

Return type:

PathSequence

Examples

>>> PathUtils.to_path_sequence("authors.0.name")
['authors', '0', 'name']
>>> PathUtils.to_path_sequence(["authors", 0, "name"])
['authors', 0, 'name']
class scholar_flux.utils.ProcessingPath(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)

Bases: object

A utility class to handle path operations for processing and flattening dictionaries.

Parameters:
  • components (Union[str, int, Tuple[str, ...], List[str], List[int], List[str | int]]) – The initial path, either as a string or a list of strings. Any integers will be auto-converted to strings in the process of formatting the components of the path

  • component_types (Optional[Union[Tuple[str, ...], List[str]]]) – Optional metadata fields that can be used to annotate specific components of a path

  • delimiter (str) – The delimiter used to separate components in the path.

Raises:
components

A tuple of path components.

Type:

Tuple[str, …]

delimiter

The delimiter used to separate components in the path.

Type:

str

Examples

>>> from scholar_flux.utils import ProcessingPath
>>> abc_path = ProcessingPath(['a', 'b', 'c'], delimiter ='//')
>>> updated_path = abc_path / 'd'
>>> assert updated_path.depth > 3 and updated_path[-1] == 'd'
# OUTPUT: True
>>> assert str(updated_path) == 'a//b//c//d'
>>> assert updated_path.has_ancestor(abc_path)
DEFAULT_DELIMITER: ClassVar[str] = '.'
__init__(components: str | int | Tuple[str, ...] | List[str] | List[int] | List[str | int] = (), component_types: Tuple[str, ...] | List[str] | None = None, delimiter: str | None = None)[source]

Initializes the ProcessingPath. The inputs are first validated to ensure that the path components and delimiters are valid.

Parameters:
  • components – (Union[str, int, Tuple[str, …], List[str], List[int], List[str | int]]): The current path keys describing the path where each key represents a nested key in a JSON structure

  • component_types – (Optional[Union[Tuple[str, …], List[str]]]): An iterable of component types (used to annotate the components)

  • delimiter – (Optional[str]): The separator used to indicate separate nested keys in a JSON structure. Defaults to the class default if not directly specified.

append(component: int | str, component_type: str | None = None) ProcessingPath[source]

Append a component to the path and return a new ProcessingPath object.

Parameters:

component (str) – The component to append.

Returns:

A new ProcessingPath object with the appended component.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the component is not a non-empty string.

component_types: Tuple[str, ...] | None = None
components: Tuple[str, ...]
copy() ProcessingPath[source]

Create a copy of the ProcessingPath.

Returns:

A new ProcessingPath object with the same components and delimiter.

Return type:

ProcessingPath

delimiter: str = ''
property depth: int

Return the depth of the path.

Returns:

The number of components in the path.

Return type:

int

get_ancestors() List[ProcessingPath | None][source]

Get all parent paths of the current ProcessingPath by the specified number of steps.

Returns:

  • Contains a list of all ancestor paths for the current path

  • If the depth of the path is 1, an empty list is returned

Return type:

List[Optional[ProcessingPath]]

get_name(max_components: int = 1) ProcessingPath[source]

Generate a path name based on the last ‘max_components’ components of the path.

Parameters:

max_components (int) – The maximum number of components to include in the name (default is 1).

Returns:

A new ProcessingPath object representing the generated name.

Return type:

ProcessingPath

get_parent(step: int = 1) ProcessingPath | None[source]

Get the ancestor path of the current ProcessingPath by the specified number of steps.

This method navigates up the path structure by the given number of steps. If the step count is greater than or equal to the depth of the current path, or if the path is already the root, it returns None. If the step count equals the current depth, it returns the root ProcessingPath.

Parameters:

step (int) – The number of levels up to retrieve. 1 for parent, 2 for grandparent, etc. (default is 1).

Returns:

  • The ancestor ProcessingPath if the step is within the path depth.

  • The root ProcessingPath if step equals the depth of the current path.

  • None if the step is greater than the current depth or if the path is already the root.

Return type:

Optional[ProcessingPath]

Raises:

ValueError – If the step is less than 1.

group(last_only: bool = False) ProcessingPath[source]

Attempt to retrieve the path omitting the last element if it is numeric. The remaining integers are replaced with a placeholder (i). This is later useful for when we need to group paths into a list or sets in order to consolidate record fields.

Parameters:

last_only (bool) – Determines whether or not to replace all list indices vs removing only the last

Returns:

A ProcessingPath instance with the last numeric component removed and indices replaced.

Return type:

ProcessingPath

has_ancestor(path: str | ProcessingPath) bool[source]

Determine whether the provided path is equal to or a subset/descendant of the current path (self).

Parameters:

path (ProcessingPath) – The potential subset/descendant of (self) ProcessingPath.

Returns:

True if ‘self’ is a superset of ‘path’. False Otherwise.

Return type:

bool

static infer_delimiter(path: str | ProcessingPath, delimiters: list[str] = ['<>', '//', '/', '>', '<', '\\', '%', '.']) str | None[source]

Infer the delimiter used in the path string based on its string representation.

Parameters:
  • path (Union[str,ProcessingPath]) – The path string to infer the delimiter from.

  • delimiters (List[str]) – A list of common delimiters to search for in the path.

  • default_delimiter (str) – The default delimiter to use if no delimiter is found.

Returns:

The inferred delimiter.

Return type:

str

info_content(non_informative: List[str]) int[source]

Calculate the number of informative components in the path.

Parameters:

non_informative (List[str]) – A list of non-informative components.

Returns:

The number of informative components.

Return type:

int

is_ancestor_of(path: str | ProcessingPath) bool[source]

Determine whether the current path (self) is equal to or a subset/descendant path of the specified path.

Parameters:

path (ProcessingPath) – The potential superset of (self) ProcessingPath.

Returns:

True if ‘self’ is a subset of ‘path’. False Otherwise.

Return type:

bool

property is_root: bool

Check if the path represents the root node.

Returns:

True if the path is root, False otherwise.

Return type:

bool

classmethod keep_descendants(paths: List[ProcessingPath]) List[ProcessingPath][source]

Filters a list of paths and keeps only descendants.

property record_index: int

Extract the first element of the current path to determine the record number if the current path refers back to a paginated structure.

Returns:

The first value, converted to an integer if possible

Return type:

int

Raises:

PathIndexingError – if the first element of the path is not a numerical index

remove(removal_list: List[str]) ProcessingPath[source]

Remove specified components from the path.

Parameters:

removal_list (List[str]) – A list of components to remove.

Returns:

A new ProcessingPath object without the specified components.

Return type:

ProcessingPath

remove_by_type(removal_list: List[str], raise_on_error: bool = False) ProcessingPath[source]

Remove specified component types from the path.

Parameters:

removal_list (List[str]) – A list of component types to remove.

Returns:

A new ProcessingPath object without the specified components.

Return type:

ProcessingPath

remove_indices(num: int = -1, reverse: bool = False) ProcessingPath[source]

Remove numeric components from the path.

Parameters:

num (int) – The number of numeric components to remove. If negative, removes all (default is -1).

Returns:

A new ProcessingPath object without the specified numeric components.

Return type:

ProcessingPath

replace(old: str, new: str) ProcessingPath[source]

Replace occurrences of a component in the path.

Parameters:
  • old (str) – The component to replace.

  • new (str) – The new component to replace the old one with.

Returns:

A new ProcessingPath object with the replaced components.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the replacement arguments are not strings.

replace_indices(placeholder: str = 'i') ProcessingPath[source]

Replace numeric components in the path with a placeholder.

Parameters:

placeholder (str) – The placeholder to replace numeric components with (default is ‘i’).

Returns:

A new ProcessingPath object with numeric components replaced by the placeholder.

Return type:

ProcessingPath

replace_path(old: str | ProcessingPath, new: str | ProcessingPath, component_types: List | Tuple | None = None) ProcessingPath[source]

Replace an ancestor path or full path in the current ProcessingPath with a new path.

Parameters:
  • old (Union[str, ProcessingPath]) – The path to replace.

  • new (Union[str, ProcessingPath]) – The new path to replace the old path ancestor or full path with.

Returns:

A new ProcessingPath object with the replaced components.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the replacement arguments are not strings or ProcessingPaths.

reversed() ProcessingPath[source]

Returns a reversed ProcessingPath from the current_path.

Returns:

A new ProcessingPath object with the same components/types in a reversed order

Return type:

ProcessingPath

sorted() ProcessingPath[source]

Returns a sorted ProcessingPath from the current_path. Elements are sorted by component in alphabetical order.

Returns:

A new ProcessingPath object with the same components/types in a reversed order

Return type:

ProcessingPath

to_list() List[str][source]

Convert the ProcessingPath to a list of components.

Returns:

A list of components in the ProcessingPath.

Return type:

List[str]

to_pattern(escape_all: bool = False) Pattern[source]

Convert the ProcessingPath to a regular expression pattern.

Returns:

The regular expression pattern representing the ProcessingPath.

Return type:

Pattern

classmethod to_processing_path(path: ProcessingPath | str | int | List[str] | List[int] | List[str | int], component_types: list | tuple | None = None, delimiter: str | None = None, infer_delimiter: bool = False) ProcessingPath[source]

Convert an input to a ProcessingPath instance if it’s not already.

Parameters:
  • path (Union[ProcessingPath, str, int, List[str], List[int], List[str | int]]) – The input path to convert.

  • component_types (list|tuple) – The type of component associated with each path element

  • delimiter (str) – The delimiter to use if the input is a string.

Returns:

A ProcessingPath instance.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.

to_string() str[source]

Get the string representation of the ProcessingPath.

Returns:

The string representation of the ProcessingPath.

Return type:

str

update_delimiter(new_delimiter: str) ProcessingPath[source]

Update the delimiter of the current ProcessingPath with the provided new delimiter.

This method creates a new ProcessingPath instance with the same components but replaces the existing delimiter with the specified new_delimiter.

Parameters:

new_delimiter (str) – The new delimiter to replace the current one.

Returns:

A new ProcessingPath instance with the updated delimiter.

Return type:

ProcessingPath

Raises:

InvalidPathDelimiterError – If the provided new_delimiter is not valid.

Example

>>> processing_path = ProcessingPath('a.b.c', delimiter='.')
>>> updated_path = processing_path.update_delimiter('/')
>>> print(updated_path)  # Output: ProcessingPath(a/b/c)
classmethod with_inferred_delimiter(path: ProcessingPath | str, component_types: List | Tuple | None = None) ProcessingPath[source]

Converts an input to a ProcessingPath instance if it’s not already a processing path.

Parameters:
  • path (Union[ProcessingPath, str, List[str]]) – The input path to convert.

  • delimiter (str) – The delimiter to use if the input is a string.

  • component_type (list|tuple) – The type of component associated with each path element

Returns:

A ProcessingPath instance.

Return type:

ProcessingPath

Raises:

InvalidProcessingPathError – If the input cannot be converted to a valid ProcessingPath.

class scholar_flux.utils.RecordPathChainMap(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap])

Bases: UserDict[int, RecordPathNodeMap]

A dictionary-like class that maps Processing paths to PathNode objects.

DEFAULT_USE_CACHE = True
__init__(*record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap], use_cache: bool | None = None, **path_record_maps: RecordPathNodeMap | PathNodeMap | PathNode | Generator[PathNode, None, None] | Sequence[PathNode] | Mapping[int | str | ProcessingPath, PathNode] | Mapping[int, PathNodeMap]) None[source]

Initializes the RecordPathNodeMap instance.

add(node: PathNode | RecordPathNodeMap, overwrite: bool | None = None) None[source]

Add a node to the PathNodeMap instance.

Parameters:
  • node (PathNode) – The node to add.

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

Raises:

PathNodeMapError – If any error occurs while adding the node.

filter(prefix: ProcessingPath | str | int, min_depth: int | None = None, max_depth: int | None = None, from_cache: bool | None = None) dict[ProcessingPath, PathNode][source]

Filter the RecordPathChainMap for paths with the given prefix.

Parameters:
  • prefix (ProcessingPath) – The prefix to search for.

  • min_depth (Optional[int]) – The minimum depth to search for. Default is None.

  • max_depth (Optional[int]) – The maximum depth to search for. Default is None.

  • from_cache (Optional[bool]) – Whether to use cache when filtering based on a path prefix.

Returns:

A dictionary of paths with the given prefix and their corresponding terminal_nodes

Return type:

dict[Optional[ProcessingPath], Optional[PathNode]]

Raises:

RecordPathNodeMapError – If an error occurs while filtering the PathNodeMap.

get(key: str | ProcessingPath, default: RecordPathNodeMap | None = None) RecordPathNodeMap | None[source]

Gets an item from the RecordPathNodeMap instance. If the value isn’t available, this method will return the value specified in default.

Parameters:

key (Union[str,ProcessingPath]) – The key (Processing path) If string, coerces to a ProcessingPath.

Returns:

A record map instance

Return type:

RecordPathNodeMap

get_node(key: str | ProcessingPath, default: PathNode | None = None) PathNode | None[source]

Helper method for retrieving a path node in a standardized way across PathNodeMaps.

node_exists(node: PathNode | ProcessingPath) bool[source]

Helper method to validate whether the current node exists.

property nodes: list[PathNode]

Enables looping over paths stored across maps.

property paths: list[ProcessingPath]

Enables looping over nodes stored across maps.

property record_indices: list[int]

Helper property for retrieving the full list of all record indices across all paths for the current map Note: A core requirement of the ChainMap is that each RecordPathNodeMap indicates the position of a record in a nested JSON structure. This property is a helper method to quickly retrieve the full list of sorted record_indices.

Returns:

A list containing integers denoting individual records found in each path

Return type:

list[int]

remove(node: ProcessingPath | PathNode | str) None[source]

Remove the specified path or node from the PathNodeMap instance. :param node: The path or node to remove. :type node: Union[ProcessingPath, PathNode, str] :param inplace: Whether to remove the path in-place or return a new PathNodeMap instance. Default is True. :type inplace: bool

Returns:

A new PathNodeMap instance with the specified paths removed if inplace is specified as True.

Return type:

Optional[PathNodeMap]

Raises:

PathNodeMapError – If any error occurs while removing.

update(*args: Any, overwrite: bool | None = None, **kwargs: dict[str, PathNode] | dict[str | ProcessingPath, RecordPathNodeMap]) None[source]

Updates the PathNodeMap instance with new key-value pairs.

Parameters:
  • *args (Union["PathNodeMap",dict[ProcessingPath, PathNode],dict[str, PathNode]]) – PathNodeMap or dictionary containing the key-value pairs to append to the PathNodeMap

  • overwrite (bool) – Flag indicating whether to overwrite existing values if the key already exists.

  • *kwargs (PathNode) – Path Nodes using the path as the argument name to append to the PathNodeMap

Returns

class scholar_flux.utils.RecordPathNodeMap(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode])

Bases: PathNodeMap

A dictionary-like class that maps Processing paths to PathNode objects using record indexes.

This implementation inherits from the PathNodeMap class and constrains the allowed nodes to those that begin with a numeric record index. Where each index indicates a record and nodes represent values associated with the record.

__init__(*nodes: PathNode | Generator[PathNode, None, None] | set[PathNode] | Sequence[PathNode] | Mapping[str | ProcessingPath, PathNode], record_index: int | str | None = None, use_cache: bool | None = None, allow_terminal: bool | None = False, overwrite: bool | None = True, **path_nodes: Mapping[str | ProcessingPath, PathNode]) None[source]

Initializes the RecordPathNodeMap using a similar set of inputs as the original PathNodeMap.

This implementation constraints the inputted nodes to a singular numeric key index that all nodes must begin with. If nodes are provided without the key, then the record_index is inferred for the inputs.

classmethod from_mapping(mapping: dict[str | ProcessingPath, PathNode] | PathNodeMap | Sequence[PathNode] | set[PathNode] | RecordPathNodeMap, use_cache: bool | None = None) RecordPathNodeMap[source]

Helper method for coercing types into a RecordPathNodeMap.

class scholar_flux.utils.RecursiveJsonProcessor(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False, path_delimiter: str | None = None)[source]

Bases: object

An implementation of a recursive JSON dictionary processor that is used to process and identify nested components such as paths, terminal key names, and the data at each terminal path.

This utility of the RecursiveJsonProcessor is for flattening dictionary records into flattened representations where its keys represent the terminal paths at each node and its values represent the data found at each terminal path.

__init__(json_dict: Dict | None = None, object_delimiter: str | None = '; ', normalizing_delimiter: str | None = None, use_full_path: bool | None = False, path_delimiter: str | None = None)[source]

Initialize the RecursiveJsonProcessor with a JSON dictionary and a delimiter for joining list elements.

Args:

json_dict (Dict): The input JSON dictionary to be parsed. object_delimiter (str): The delimiter used to join elements max depth list objects. Default is “; “. normalizing_delimiter (str): The delimiter used to join elements across multiple keys when normalizing. Default is “

“.

combine_normalized(normalized_field_value: list | str | None) list | str | None[source]

Combines lists of nested data (strings, ints, None, etc.) into a single string separated by the normalizing_delimiter.

If a delimiter isn’t specified or if the value is None, it is returned as is without modification.

create_record(obj: Any, path: List[Any]) List[JsonRecordData][source]

Helper method for creating a new record within the current JsonProcessor.

filter_extracted(exclude_keys: List[str] | None = None) Self[source]

Filter the extracted JSON dictionaries to exclude specified keys.

Parameters:

exclude_keys ([List[str]]) – List of keys to exclude from the flattened result.

flatten() Dict[str, List[Any] | str | None] | None[source]

Flatten the extracted JSON dictionary from a nested structure into a simpler structure.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_and_flatten(obj: Dict | None = None, exclude_keys: List[str] | None = None, traversal_paths: List[str] | List[List[str]] | List[List[str | int]] | None = None, traverse_lists: bool = False) Dict[str, Any] | None[source]

Process the dictionary, filter extracted paths, and then flatten the result.

Parameters:
  • exclude_keys (Optional[List[str]]) – List of keys to exclude from the flattened result.

  • traversal_paths (Optional[List[str]]) – Optional ‘.’ delimited paths to constrain the extracted keys to. If omitted, all paths are traversed.

  • traverse_lists (bool) – Determines whether to automatically traverse and flatten list structures.

Returns:

A dictionary with flattened paths as keys and lists of values.

Return type:

Optional[Dict[str, List[Any]]]

process_dictionary(obj: Dict | None = None) Self[source]

Create a new json dictionary that contains information about the relative paths of each field that can be found within the current JSON dict.

process_level(obj: Any, level_name: List[Any] | None = None) List[Any][source]

Helper method for processing a level within a dictionary.

This method is recursively called to process nested components

traverse_dictionary(paths: List[str] | List[List[str]] | List[List[str | int]], obj: Dict | None = None, traverse_lists: bool = False) Self[source]

Create a new json dictionary by traversing ‘.’ delimited paths for json data found from a JSON Dict.

traverse_level(path: List[str] | List[str | int], obj: Any, level_name: List[Any] | None = None, traverse_lists: bool = False) List[Any][source]

Helper method for traversing a level within a dictionary while constraining keys to known paths.

This method is recursively called to traverse nested components using known keys

static unlist(current_data: Dict | List | None) Any | None[source]

Flattens a dictionary or list if it contains a single element that is a dictionary.

Parameters:

current_data – A dictionary or list to be flattened if it contains a single dictionary element.

Returns:

The flattened dictionary if the input meets the flattening condition, otherwise returns the input unchanged.

Return type:

Optional[Dict|List]

class scholar_flux.utils.ResponseProtocol(*args, **kwargs)[source]

Bases: Protocol

Protocol for HTTP response objects compatible with requests.Response, httpx.Response, and other response classes.

This protocol defines the common interface shared between popular HTTP client libraries, allowing for type-safe interoperability.

The URL is kept flexible to allow for other types outside of the normal string including basic pydantic and httpx type for both httpx and other custom objects.

__init__(*args, **kwargs)
content: bytes
headers: MutableMapping[str, str]
raise_for_status() None[source]

Raises an exception for HTTP error status codes.

status_code: int
url: Any
class scholar_flux.utils.ResponseSupportsJSONProtocol(*args, **kwargs)[source]

Bases: ResponseProtocol, Protocol

Extends the ResponseProtocol for the identification of response-like objects that support JSON deserialization.

JSON parsing is supported for python http clients such as requests and httpx.

Use this protocol to narrow response types when JSON parsing is required, such as response parsing and the extraction of error details for unsuccessful responses.

json(**kwargs: Any) Any[source]

Deserializes response content into JSON format.

scholar_flux.utils.adjust_repr_padding(obj: Any, pad_length: int | None = 0, flatten: bool | None = None) str[source]

Helper method for adjusting the padding for representations of objects.

Parameters:
  • obj (Any) – The object to generate an adjusted repr for

  • pad_length (Optional[int]) – Indicates the additional amount of padding that should be added. Helpful for when attempting to create nested representations formatted as intended.

  • flatten (bool) – Indicates whether to use newline characters. This is false by default

Returns:

A string representation of the current object that adjusts the padding accordingly

Return type:

str

scholar_flux.utils.as_list_1d(value: Any) list[source]

Nests a value into a single element list if the value is not already a list.

Parameters:

value (Any) – The value to add to a list if it is not already a list

Returns:

If already a list, the value is returned as is. Otherwise, the value is nested in a list. Caveat: if the value is None, an empty list is returned

Return type:

list

scholar_flux.utils.as_tuple(obj: object) tuple[source]

Converts objects into tuples when possible and nests objects within a tuple otherwise.

This function is useful as a preprocessing step for function calls that require tuples instead of lists, NoneTypes, and other data types.

Parameters:

obj (object) – The object to nest as a tuple

Returns:

The original object converted into a tuple

Return type:

tuple

scholar_flux.utils.coerce_bool(value: object, true_values: tuple[str, ...] = ('T', 'true', 'yes', '1'), false_values: tuple[str, ...] = ('F', 'false', 'no', '0')) bool | None[source]

Attempts to convert a value to a boolean value, returning None if the conversion fails.

Parameters:
  • value (object) – The value to attempt to convert into a boolean.

  • true_values (tuple[str, ...]) – Values to be mapped to True when matched by the input value.

  • false_values (tuple[str, ...]) – Values to be mapped to False when matched by the input value.

Returns:

The value converted into a boolean if possible, otherwise None.

Return type:

Optional[bool]

Examples

>>> from scholar_flux.utils.helpers import coerce_bool
>>> coerce_bool("TRUE")
True
>>> coerce_bool(1)
True
>>> coerce_bool(True, true_values=())
True
>>> coerce_bool("maybe", true_values=("Maybe",))
True
>>> coerce_bool("NO")
False
>>> coerce_bool("0")
False
>>> coerce_bool("Unknown?")
None
>>> coerce_bool("0", false_values=None)
None
scholar_flux.utils.coerce_bytes(value: object, encoding: str | None = 'utf-8') bytes | None[source]

Attempts to convert a value into bytes, if possible, returning None if conversion fails.

Parameters:
  • value (object) – The value to attempt to convert into a bytes object.

  • encoding (Optional[str]) – An optional value used to encode strings as bytes. Not relevant for other data types.

Returns:

The value converted into a bytes object if possible, otherwise None

Return type:

Optional[bytes]

scholar_flux.utils.coerce_flattened_str(value: object, delimiter: str = '; ') str | None[source]

Coerces strings or sequences of strings into a single, flattened string.

This function handles the common pattern of normalizing journal names, keywords, or other metadata that may arrive as either a string or list of strings. Sequences of strings are handled by joining them, and if a sequence cannot be converted to a sequence of strings, None is returned instead.

Parameters:
  • value (object) – A string, bytes, list/tuple of strings, or None

  • delimiter (str) – The string used to join list elements with (default: “; “)

Returns:

A single string (coerced or joined), or None if conversion fails

Return type:

Optional[str]

scholar_flux.utils.coerce_int(value: object) int | None[source]

Attempts to convert a value to an integer, returning None if the conversion fails.

Parameters:

value (object) – The value to attempt to convert into a int.

Returns:

The value converted into an integer if possible, otherwise None.

Return type:

Optional[int]

scholar_flux.utils.coerce_json_str(data: object) str | None[source]

Attempts to convert a serializable list or mapping into a JSON string.

This method uses the json.dumps() function to serialize a JSON sequence or mapping, returning None if conversion fails.

Parameters:

data (object) – Attempts to coerce a JSON object as a string. This function attempts JSON string conversion and validation for Mapping, Sequence, str, and bytes data types. For all other data types, None is returned.

Returns:

The data coerced into a JSON string if possible, otherwise None.

Return type:

Optional[str]

Note

If the data is a string or bytes object, this method verifies that, when loaded with json.loads, the string is deserialized as a mapping or list. Otherwise, None is returned.

Examples

>>> from scholar_flux.utils.helpers import coerce_json_str
>>> coerce_json_str('{"a": 1, "b": 2}')  # already a json string, returned as is
# OUTPUT: '"a": 1, "b": 2"'
>>> coerce_json_str({"a": 1, "b": 2})  # already a json string, returned as is
# OUTPUT: '""a": 1, "b": 2"'
scholar_flux.utils.coerce_numeric(value: object) float | None[source]

Attempts to convert a value to a float, returning None if the conversion fails.

Parameters:

value (object) – The value to attempt to convert into a decimal value.

Returns:

The value converted into a float if possible, otherwise None.

Return type:

Optional[float]

Note

Conversion treats booleans as integers and converts them when observed. To avoid this, use conditional logic.

scholar_flux.utils.coerce_str(value: object, *, encoding: str | None = 'utf-8', errors: str | None = 'strict') str | None[source]

Attempts to convert a value into a string, if possible, returning None if conversion fails.

Parameters:
  • value (object) – The value to attempt to convert into a string.

  • encoding (Optional[str]) – An optional value used to decode byte strings. Not relevant for data of other types.

  • errors (Optional[str]) – An optional value for decoding errors with non-Unicode bytes characters. Not relevant for non-byte strings.

Returns:

The value converted into a string if possible, otherwise None.

Return type:

Optional[str]

scholar_flux.utils.extract_year(value: Any, format: str = '%Y-%m-%d') int | None[source]

Extract a 4-digit year from a date string.

Attempts to parse the value using the specified format, then falls back to regex extraction.

Parameters:
  • value (Any) – A value (generally a string or integer) potentially containing a year.

  • format (str) – The expected date format (strptime format string). Defaults to “%Y-%m-%d”.

Returns:

The extracted year as an integer, or None if extraction fails.

Return type:

Optional[int]

Examples

>>> from datetime import date
>>> from scholar_flux.utils.helpers import extract_year
>>> extract_year(date(2027,5, 5))
# OUTPUT: 2027
>>> extract_year("2026-03-01")
# OUTPUT: 2026
>>> extract_year("03/15/2024", format="%m/%d/%Y")
# OUTPUT: 2024
>>> extract_year("2023")
# OUTPUT: 2023
>>> extract_year(None)
# OUTPUT: None
scholar_flux.utils.filter_record_key_prefixes(record: Mapping[str, Any] | Mapping[str | int, Any], prefix: str, invert: bool = False) RecordType[source]

Removes or retains keys from dictionaries and mappings beginning with a specific string prefix.

Parameters:
  • record (Mapping[str, Any] | Mapping[str | int, Any]) – A dictionary record to filter keys containing specific prefixes

  • prefix (str) – The prefix to filter from the dictionary. Prefixes that are not strings will be coerced into strings internally, but only string-typed fields will be matched.

  • invert (bool) – If False, dictionary keys beginning with the prefix are removed (default behavior). If true, fields beginning with the prefix are retained instead.

Returns:

The filtered record after retaining (invert=True) or removing (invert=False) string prefixes.

Return type:

RecordType

scholar_flux.utils.format_iso_timestamp(timestamp: datetime) str[source]

Formats an iso timestamp string in UTC with millisecond precision.

Parameters:

timestamp (datetime) – The datetime object to format.

Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123+00:00”)

Return type:

str

scholar_flux.utils.format_repr_value(value: Any, pad_length: int | None = None, show_value_attributes: bool | None = None, flatten: bool | None = None, replace_numeric: bool | None = False) str[source]

Helper function for representing nested objects from custom classes.

Parameters:
  • value (Any) – The value containing the repr to format

  • pad_length (Optional[int]) – Indicates the total additional padding to add for each individual line

  • show_value_attributes (Optional[bool]) – If False, all attributes within the current object will be replaced with ‘…’. (e.g., StorageDevice(…))

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

Returns:

The formatted string representation of a value

Return type:

str

scholar_flux.utils.generate_iso_timestamp() str[source]

Generates and formats an ISO 8601 timestamp string in UTC with millisecond precision for reliable round-trip conversion.

Example usage:
>>> from scholar_flux.utils import generate_iso_timestamp, parse_iso_timestamp, format_iso_timestamp
>>> timestamp = generate_iso_timestamp()
>>> parsed_timestamp = parse_iso_timestamp(timestamp)
>>> assert parsed_timestamp is not None and format_iso_timestamp(parsed_timestamp) == timestamp
Returns:

ISO 8601 formatted timestamp (e.g., “2024-03-15T14:30:00.123Z”)

Return type:

str

scholar_flux.utils.generate_repr(obj: object, exclude: set[str] | list[str] | tuple[str] | None = None, show_value_attributes: bool = True, flatten: bool = False, replace_numeric: bool = False, as_dict: bool | None = False, resolve_property_attributes: bool = False, flatten_nested: bool | None = None) str[source]

Method for creating a basic representation of a custom object’s data structure. Useful for showing the options/attributes being used by an object.

In case the object doesn’t have a __dict__ attribute, the code will raise an AttributeError and fall back to using the basic string representation of the object.

Note that threading.Lock objects are excluded from the final representation.

Parameters:
  • obj (object) – The object whose attributes are to be represented.

  • exclude (Optional[set[str] | list[str] | tuple[str]]) – Attributes to exclude from the representation (default is None).

  • show_value_attributes (bool) – If False, nested attributes within elements will be replaced with ‘…’. e.g., RetryAttempt(…)

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

  • as_dict (bool) – Determines whether to represent the current class as a dictionary.

  • resolve_property_attributes (bool) – Determines whether to substitute properties pointing to private attributes.

  • flatten_nested (Optional[bool]) – Indicates whether to use newline characters to create a representation of nested objects or to flatten them into a single line. If None, nested objects are flattened only if flatten=True.

Returns:

A string representing the object’s attributes in a human-readable format.

scholar_flux.utils.generate_repr_from_string(class_name: str, attribute_dict: dict[str, Any], show_value_attributes: bool | None = None, flatten: bool | None = False, replace_numeric: bool | None = False, as_dict: bool | None = False, flatten_nested: bool | None = None) str[source]

Method for creating a basic representation of a custom object’s data structure. Allows for the direct creation of a repr using the classname as a string and the attribute dict that will be formatted and prepared for representation of the attributes of the object.

Parameters:
  • class_name – The class name of the object whose attributes are to be represented.

  • attribute_dict (dict) – A dictionary containing attributes to format into the components of a repr.

  • show_value_attributes (bool) – If False, nested attributes within elements will be replaced with ‘…’. e.g., RetryAttempt(…).

  • flatten (bool) – Determines whether to show each individual value inline or separated by a newline character.

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

  • as_dict (Optional[bool]) – Determines whether to represent the current class as a dictionary.

  • flatten_nested (Optional[bool]) – Indicates whether to use newline characters to create a representation of nested objects or to flatten them into a single line. False by default.

Returns:

A string representing the object’s attributes in a human-readable format.

Return type:

str

scholar_flux.utils.generate_response_hash(response: Response | ResponseProtocol) str[source]

Generates a response hash from a response or response-like object that implements the ResponseProtocol.

Parameters:

response (requests.Response | ResponseProtocol) – An http response or response-like object.

Returns:

A unique identifier for the response.

Return type:

str

scholar_flux.utils.generate_sequence_repr(obj: Sequence | set, flatten: bool = False, show_value_attributes: bool = True, replace_numeric: bool = False, brackets: tuple[str, str] | None = ('[', ']'), flatten_nested: bool | None = None) str[source]

Method for creating a basic representations for sequence-like data structures.

This function generates formatted str representations for collections such as list, tuple, deque, and custom sequence data types. A string representation is also created for nested elements using generate_repr internally.

When this function encounters an error, the method internally falls back to using the str function to create a basic string representation.

Parameters:
  • obj (Sequence) – The sequence-like object to create a string representation for

  • flatten (bool) – Indicates whether to use newline characters. This is false by default

  • show_value_attributes (bool) – If False, nested attributes within elements will be replaced with ‘…’. e.g., RetryAttempt(…)

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

  • brackets (Optional[tuple[str, str]]) – Opening and closing brackets for the sequence (default: “[”, “]”).

  • flatten_nested (Optional[bool]) – Indicates whether to use newline characters to create a representation of nested objects or to flatten them into a single line. If None, nested objects are flattened only if flatten=True.

Returns:

A string representing the sequence’s elements in a human-readable format.

Examples

>>> from collections import deque
>>> from scholar_flux.utils import generate_sequence_repr
>>> items = deque([{"a": 1}, {"b": 2}])
>>> print(generate_sequence_repr(items, flatten=True))
# OUTPUT: deque([{'a': 1}, {'b': 2}])
>>> print(generate_sequence_repr(items, flatten=False))
# OUTPUT: deque([{'a': 1},
                 {'b': 2}])
>>> print(generate_sequence_repr([1, 2, 3], flatten=True, brackets=None))
# OUTPUT: list((1, 2, 3))
scholar_flux.utils.get_first_available_key(data: Mapping[H | str, Any], keys: Sequence[H | str], default: T | None = None, case_sensitive: bool = True) Any | T[source]

Extracts the first key from a sequence of keys that can be found within a dictionary.

Parameters:
  • data (Mapping[H | str, Any]) – A dictionary or dictionary-like object to extract an existing data element from.

  • keys (Sequence[H | str]) – A sequence or set of keys used for the extraction of the first available data element.

  • default (T) – The value to use if none of the checked keys are available in the dictionary.

  • case_sensitive (bool) – Defines whether data element retrieval should rely on case sensitivity (Default=True).

Returns:

The value associated with the first available dictionary key

Return type:

Any

scholar_flux.utils.get_nested_data(json: list | Mapping | None, path: str | list, flatten_nested_dictionaries: bool = True, verbose: bool = True) Any[source]

Recursively retrieves data from a nested dictionary using a sequence of keys.

Parameters:
  • json (list | Mapping | None) – The parsed json structure from which to extract data.

  • path (str | list) – A list of keys representing the path to the desired data within json.

  • flatten_nested_dictionaries (bool) – Determines whether single-element lists containing dictionary data should be extracted.

  • verbose (bool) – Determines whether logging should occur when an error is encountered.

Returns:

The value retrieved from the nested dictionary following the path, or None if any

key in the path is not found or leads to a None value prematurely.

Return type:

Any

scholar_flux.utils.get_values(obj: Iterable) Iterable[source]

Automatically retrieves .values() from dictionaries when available and returns the original input otherwise.

Parameters:

obj (Iterable) – An object to get the values from.

Returns:

An iterable created from obj.values() if the object is a dictionary and the original object otherwise.

If the object is empty or is not a nested object, an empty list is returned.

Return type:

Iterable

Infers a category based on a text pattern search. If a value match can’t be inferred, a default is returned.

Parameters:
  • text (str) – The text to match. If None or missing, the default is returned instead.

  • pattern_dict (Mapping[str | re.Pattern, Optional[V]] | Mapping[str, Optional[V]] | Mapping[re.Pattern, V]) – A dictionary that maps patterns to potential output values provided that the pattern matches.

  • default (Optional[D]) – The value to return if a match cannot be inferred from text pattern matching.

  • regex (bool) – Whether to interpret patterns as regex (default True).

  • flags (int | re.RegexFlag) – Optional flags to pass to re.search when available. (default flags=0 for no flags)

Returns:

The inferred category when a match is found based on a dictionary mapping, and the default otherwise.

Return type:

Optional[V | D]

Note

If the provided value is not a mapping or if the provided value cannot be coerced into a string, the default is returned instead.

scholar_flux.utils.initialize_package(log: bool = True, env_path: str | Path | None = None, config_params: dict[str, Any] | None = None, logging_params: dict[str, Any] | None = None) tuple[dict[str, Any], Logger, SensitiveDataMasker][source]

Function used for orchestrating the initialization of the config, log settings, and masking for scholar_flux.

This function imports a ‘.env’ configuration file at the specified location if it exists. Otherwise, scholar_flux will look for a .env file in the default locations if available. If no .env configuration file is found, then only package defaults and available OS environment variables are used.

This function can also be used for dynamic re-initialization of configuration parameters and logging. The config_params are sent as keyword arguments to the scholar_flux.utils.ConfigSettings.load_config method. logging_paras are used as keyword arguments to the scholar_flux.utils.setup_logging method to set up logging settings and handlers.

Parameters:
  • log (bool) – A True/False flag that determines whether to enable or disable logging.

  • env_path (Optional[str | Path]) – The file path indicating from where to load the environment variables, if provided.

  • config_params (Optional[Dict]) – A dictionary allowing for the specification of configuration parameters when attempting to load environment variables from a config. Useful for loading API keys from environment variables for later use.

  • logging_params (Optional[Dict]) – A dictionary allowing users to specify options for package-level logging with custom logic. Log settings are loaded from the OS environment or an .env file when available, with precedence given to .env files. These settings, when loaded, override the default ScholarFlux logging configuration. Otherwise, ScholarFlux uses a log-level of WARNING by default.

Returns:

A tuple containing the configuration dictionary and the initialized logger.

Return type:

Tuple[Dict[str, Any], logging.Logger, scholar_flux.security.SensitiveDataMasker]

Raises:

PackageInitializationError – If there are issues with loading the configuration or initializing the logger.

scholar_flux.utils.is_nested(obj: Any) bool[source]

Indicates whether the current value is a nested object. Useful for recursive iterations such as JSON record data.

Parameters:

obj (Any) Any (realistic JSON)

Returns:

True if nested otherwise False

Return type:

bool

scholar_flux.utils.is_nested_json(obj: Any) bool[source]

Check if a value is a nested, parsed JSON structure.

Parameters:

obj (Any) – The object to check.

Returns:

False if the value is not a Json-like structure and, True if it is a nested JSON structure.

Return type:

bool

scholar_flux.utils.is_response_like(response: object) TypeGuard[Response | ResponseProtocol][source]

Identifies whether an object is a response or duck typed response protocol.

scholar_flux.utils.log_level_context(log_level: int | str = 10, logger: Logger | None = None, allow_lower_level: bool = True) Iterator[None][source]

Context manager for temporarily changing the log level for the package-level (or custom) logger.

Parameters:
  • log_level (int | str) – The log level to temporarily change to. Options include: - logging.DEBUG (10) or “DEBUG” - logging.INFO (20) or “INFO” - logging.WARNING (30) or “WARNING” - logging.ERROR (40) or “ERROR” - logging.CRITICAL (50) or “CRITICAL”

  • logger (logging.Logger) – The logger to use when temporarily changing the log level. If not specified, the ScholarFlux package level logger is used.

  • allow_lower_level (bool) – When False, The current log level is overridden only when the provided log level is higher than the current log level.

Example

>>> from scholar_flux import SearchAPI, log_level_context
>>> api = SearchAPI(provider_name = "CORE", query = "Technological Safety")
>>> with log_level_context("DEBUG"): # `logging.DEBUG`
...     response = api.search(page = 1)
# OUTPUT: 2026-01-21 13:46:50,333 - scholar_flux.api.base_api - DEBUG - Sending request to https://api.core.ac.uk/v3/search/works

Note: when an invalid log_level is passed, a level of 51 is used in its place, effectively turning off logging.

scholar_flux.utils.nested_key_exists(obj: object, key_to_find: str, regex: bool = False) bool[source]

Recursively checks if a specified key is present anywhere in a given JSON-like dictionary or list structure.

Parameters:
  • obj (object) – The dictionary or list to search.

  • key_to_find (str) – The key to search for.

  • regex (bool) – Whether or not to search with regular expressions.

Returns:

True if the key is present, False otherwise.

Return type:

bool

scholar_flux.utils.normalize_repr(value: Any, replace_numeric: bool | None = False) str[source]

Helper function for removing byte locations and surrounding signs from classes.

Parameters:
  • value (Any) – A value whose representation is to be normalized

  • replace_numeric (bool) – Determines whether count values in strings should be replaced.

Returns:

A normalized string representation of the current value

Return type:

str

scholar_flux.utils.parse_iso_timestamp(timestamp_str: str) datetime | None[source]

Attempts to convert an ISO 8601 timestamp string back to a datetime object.

Parameters:

timestamp_str (str) – ISO 8601 formatted timestamp string

Returns:

datetime object if parsing succeeds, None otherwise

Return type:

Optional[datetime]

scholar_flux.utils.quote_if_string(value: object) object[source]

Attempt to quote string values to distinguish them from object text in class representations.

Parameters:

value (object) – a value that is quoted only if it is a string

Returns:

Returns a quoted string if successful. Otherwise returns the value unchanged

Return type:

Any

scholar_flux.utils.quote_numeric(value: object) str[source]

Attempts to quote as a numeric value and returns the quoted value if successful. Otherwise raises an error.

Parameters:

value (object) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful.

Return type:

str

Raises:

ValueError – If the value cannot be quoted

scholar_flux.utils.resolve_log_level(log_level: str | int | None = None) int | None[source]

Utility for resolving numeric strings and log level values into integer log levels.

Parameters:

log_level (Optional[str | int]) – The log level to resolve as an integer if not already an integer. Accepts both case-insensitive strings (“Warning”, “INFO”, “error”) and integers (“0”, 1, “03”)

Returns:

The logging level that is either resolved from the user-provided log_level None: When non-string/non-integer is received or log level resolution from a string is unsuccessful

Return type:

int

scholar_flux.utils.resolve_log_stream(stream: str | bool | TextIO | None) TextIO | Literal[False][source]

Helper for resolving streams used for logging from strings.

Parameters:

stream (Optional[str | bool | TextIO]) – The value to resolve as a stream type.

Returns:

A stderr or stdout stream resolved from the input. Literal[False]: If False or a similar, falsy value is received (eg., 0, ‘0’, ‘false’)

Return type:

TextIO

Note

This function attempts to resolve values into stderr or stdout using case-insensitive string normalization when possible. A value of False, when returned, indicates that streaming should not be used. If a value other than a string is passed (e.g., None, True, 23), the stream will default to stderr instead.

scholar_flux.utils.response_supports_json(response: object) TypeGuard[ResponseSupportsJSONProtocol][source]

Determines whether the current object is a response-like object that supports JSON content parsing.

scholar_flux.utils.set_public_api_module(module_name: str, public_names: list[str], namespace: dict) None[source]

Assigns the current module’s name to the __module__ attribute of public API objects.

This function is useful for several use cases including sphinx documentation, introspection, and error handling/reporting.

For all objects defined in the list of a modules public API names (generally named __all__), this function sets their __module__ attribute to the name of the current public API module if supported.

This is useful for ensuring that imported classes and functions appear as if they are defined in the current module (such as in the automatic generation of sphinx documentation), which improves overall documentation, introspection, and error reporting.

Parameters:
  • module_name (str) – The name of the module (usually __name__).

  • public_names (list[str]) – List of public object names to update (e.g., __all__).

  • namespace (dict) – The module’s namespace (usually globals()).

Example usage:

set_public_api_module(__name__, __all__, globals())

scholar_flux.utils.setup_logging(logger: Logger | None = None, log_directory: str | None = None, log_file: str | None = 'application.log', log_level: int = 10, propagate_logs: bool | None = True, max_bytes: int = 1048576, backup_count: int = 5, logging_filter: Filter | None = None, *, stream: TextIO | Literal[False] | None = None, raise_on_error: bool = True) None[source]

Configures a logger to write to the console and, optionally, file logs with an optional logging filter.

This function is a general purpose utility used by the scholar_flux package to set up a package level logger that implements sensitive data masking with a custom filter.

The logger is configured to write to the terminal (console) and, if optionally a rotating log file. if specified. Rotating files automatically create new files when size limits are reached, keeping your logs manageable.

Parameters:
  • logger (Optional[logging.Logger]) – The logger instance to configure. If None, uses the root logger.

  • log_directory (Optional[str]) – Indicates where to save log files. If None, automatically finds a writable directory when a log_file is specified.

  • log_file (Optional[str]) – Name of the log file (default: ‘application.log’). If None, file-based logging will not be performed.

  • log_level (int) – Minimum level to log (DEBUG logs everything, INFO skips debug messages).

  • propagate_logs (Optional[bool]) – Determines whether to propagate logs. Logs are propagated by default if this option is not specified.

  • max_bytes (int) – Maximum size of each log file before rotating (default: 1MB).

  • backup_count (int) – Number of old log files to keep (default: 5).

  • logging_filter (Optional[logging.Filter]) – Optional filter to modify log messages (e.g., hide sensitive data).

  • stream (Optional[TextIO | bool]) – Optionally modifies the stream used for logging. By default, a stream is created that uses stderr. Set this to False to avoid creating a log stream altogether.

  • raise_on_error (bool) – Indicates whether an error should be raised if an error on package directory setup occurs.

Example

>>> # Basic setup - logs to console and file
>>> setup_logging()
>>> # Custom location and less verbose
>>> setup_logging(log_directory="/var/log/myapp", log_level=logging.INFO)
>>> # With sensitive data masking
>>> from scholar_flux.security import MaskingFilter
>>> mask_filter = MaskingFilter()
>>> setup_logging(logging_filter=mask_filter)

Note

  • Console shows all log messages in real-time

  • File keeps a permanent record with automatic rotation

  • If logging_filter is provided, it’s applied to both console and file output

  • Calling this function multiple times will reset the logger configuration

scholar_flux.utils.strip_html_tags(text: str, parser: Literal['html.parser', 'lxml'] = 'html.parser', verbose: bool = True, **kwargs: Any) str[source]

Extracts the raw text from HTML while removing html elements such as paragraph tags and breaks.

Parameters:
  • text (str) – The text to extract and remove html tags and elements from

  • parser (Literal['html.parser', 'lxml']) – The parser to use for the removal of html elements

  • verbose (bool) – Indicates whether issues regarding missing dependencies and incorrect types should be logged.

  • **kwargs – Additional keyword arguments to be passed directly to BeautifulSoup.get_text(). Possible keywords include: - separator (str): String inserted between elements (default: ‘’) - strip (bool): Whether to strip whitespace from element text (default: False)

Returns:

The string with text elements removed if the input is a string and the original input otherwise.

Return type:

str

Examples

>>> strip_html_tags("<p>Hello</p><p>World</p>")
'HelloWorld'
>>> strip_html_tags("<p>Hello</p><p>World</p>", separator=" ")
'Hello World'
>>> strip_html_tags("<p>  Whitespace  </p>", strip=True)
'Whitespace'
scholar_flux.utils.truncate(value: Any, max_length: int = 40, suffix: str = '...', show_count: bool = True) str[source]

Truncates various strings, mappings, and sequences for cleaner representations of objects in CLIs.

Handles: - Strings: Truncate with suffix - Mappings (dict): Show preview of first N chars with count - Sequences (list, tuple): Show preview with count - Other objects: Use string representation

Parameters:
  • value (Any) – The value to truncate.

  • max_length (int) – Maximum character length before truncation.

  • suffix (str) – String to append when truncated (default: “…”).

  • show_count (bool) – Whether to show item count for collections.

Returns:

Truncated string representation.

Return type:

str

Examples

>>> truncate("A very long string that needs truncation", max_length=20)
'A very long string...'
>>> truncate({'key1': 'value1', 'key2': 'value2'}, max_length=30)
"{'key1': 'value1', ...} (2 items)"
>>> truncate([1, 2, 3, 4, 5], max_length=10)
'[1, 2, ...] (5 items)'
>>> truncate({'a': 1}, max_length=50, show_count=False)
"{'a': 1}"
scholar_flux.utils.try_bytes(value: object) bytes | object[source]

Attempts to convert a value to a bytes object, returning the original value if the conversion fails.

Parameters:

value (object) – the value to attempt to coerce into an bytes

Returns:

The converted bytes object if successful, otherwise the original value.

Return type:

bytes | object

scholar_flux.utils.try_call(func: Callable, args: tuple | None = None, kwargs: dict | None = None, suppress: tuple = (), logger: Logger | None = None, log_level: int = 30, default: Any | None = None) Any | None[source]

A helper function for calling another function safely in the event that one of the specified errors occur and are contained within the list of errors to suppress.

Parameters:
  • func (Callable) – The function to call

  • args (Optional[tuple]) – A tuple of positional arguments to add to the function call

  • kwargs (Optional[dict]) – A dictionary of keyword arguments to add to the function call

  • suppress (tuple) – A tuple of exceptions to handle and suppress if they occur

  • logger (Optional[logging.Logger]) – The logger to use for warning generation

  • log_level (int) – The logging level to use when logging suppressed exceptions.

  • default (Optional[Any]) – The value to return in the event that an error occurs and is suppressed

Returns:

When successful, the return type of the callable is also returned without modification. Upon suppressing an exception, the function will generate a warning and return None by default unless the default was set.

Return type:

Optional[Any]

scholar_flux.utils.try_compile(s: str | Pattern | None, *, prefix: str | None = None, suffix: str | None = None, flags: int | RegexFlag = 0, escape: bool = False, verbose: bool = False) Pattern | None[source]

Attempts to compile an object as a pattern when possible, returning None when compilation fails.

Parameters:
  • s (Optional[str | re.Pattern]) – The string to compile as a pattern

  • prefix (Optional[str]) – A prefix to add to the beginning of a string when a pattern is not directly provided

  • suffix (Optional[str]) – A suffix to add to the end of a string when a pattern is not directly provided

  • flags (int | re.RegexFlag) – Flags to use when compiling a pattern. By default, no flags are applied (flags=0).

  • escape (bool) – Indicates whether regular expression symbols should escaped to interpret them literally.

  • verbose (bool)

Returns:

A regular expression pattern when successful, otherwise None

Return type:

Optional[re.Pattern]

Note

When a pattern is received, it is returned as is. Only valid strings are transformed into patterns containing a prefix when provided.

scholar_flux.utils.try_dict(value: Any) dict | None[source]

Attempts to convert a value into a dictionary, if possible.

If it is not possible to convert the value into a dictionary, the function will return None.

Parameters:

value (Any) – A value to attempt to convert into a dict.

Returns:

The value converted into a dictionary if possible, otherwise None

Return type:

Optional[dict]

scholar_flux.utils.try_int(value: object) int | object[source]

Attempts to convert a value to an integer, returning the original value if the conversion fails.

Parameters:

value (object) – the value to attempt to coerce into an integer

Returns:

The converted integer if successful, otherwise the original value.

Return type:

int | object

scholar_flux.utils.try_none(value: object, none_indicators: tuple[Any, ...] = ('none', 'unspecified', 'unknown', 'n/a')) object | None[source]

Converts empty strings, ‘none’, and empty data containers into None. Otherwise, the original value is returned.

Parameters:
  • value (object) – The value to convert into None when possible

  • none_indicators (tuple[Any, ...]) – Tuple of values that should be treated as None indicators.

Returns:

The original value if not converted, and None otherwise

Return type:

object | None

scholar_flux.utils.try_pop(s: Set[H], item: H, default: H | None = None) H | None[source]

Attempt to remove an item from a set and return the item if it exists.

Parameters:
  • s (Set[H]) – The set to remove the item from.

  • item (H) – The item to try to remove from the set

  • default (Optional[H]) – The object to return as a default if item is not found

Returns:

item if the value is in the set, otherwise returns the specified default

Return type:

H | None

scholar_flux.utils.try_quote_numeric(value: object) str | None[source]

Attempt to quote numeric values to distinguish them from string values and integers.

Parameters:

value (object) – a value that is quoted only if it is a numeric string or an integer

Returns:

Returns a quoted string if successful. Otherwise None

Return type:

Optional[str]

scholar_flux.utils.try_str(value: object) str | object[source]

Attempts to convert a value to a string, returning the original value if the conversion fails.

Parameters:

value (object) – the value to attempt to coerce into an string

Returns:

The converted string if successful, otherwise the original value.

Return type:

str | object

scholar_flux.utils.unlist_1d(current_data: tuple | list | Any) Any[source]

Retrieves an element from a list/tuple if it contains only a single element. Otherwise, it will return the element as is. Useful for extracting text from a single element list/tuple.

Parameters:

current_data (tuple | list | Any) – An object potentially unlist if it contains a single element.

Returns:

The unlisted object if it comes from a single element list/tuple,

otherwise returns the input unchanged.

Return type:

Any