scholar_flux.api package
Subpackages
- scholar_flux.api.models package
- Submodules
- scholar_flux.api.models.api_parameters module
APIParameterConfigAPIParameterConfig.DEFAULT_CORRECT_ZERO_INDEXAPIParameterConfig.__init__()APIParameterConfig.add_parameter()APIParameterConfig.as_config()APIParameterConfig.build_parameters()APIParameterConfig.extract_parameters()APIParameterConfig.from_defaults()APIParameterConfig.get_defaults()APIParameterConfig.mapAPIParameterConfig.parameter_mapAPIParameterConfig.show_parameters()APIParameterConfig.structure()
APIParameterMapAPIParameterMap.queryAPIParameterMap.startAPIParameterMap.records_per_pageAPIParameterMap.api_key_parameterAPIParameterMap.api_key_requiredAPIParameterMap.auto_calculate_pageAPIParameterMap.zero_indexed_paginationAPIParameterMap.api_specific_parametersAPIParameterMap.api_key_parameterAPIParameterMap.api_key_requiredAPIParameterMap.api_specific_parametersAPIParameterMap.auto_calculate_pageAPIParameterMap.from_defaults()APIParameterMap.get_defaults()APIParameterMap.model_configAPIParameterMap.queryAPIParameterMap.records_per_pageAPIParameterMap.set_default_api_key_parameter()APIParameterMap.startAPIParameterMap.validate_api_specific_parameter_mappings()APIParameterMap.zero_indexed_pagination
- scholar_flux.api.models.base_parameters module
APISpecificParameterBaseAPIParameterMapBaseAPIParameterMap.queryBaseAPIParameterMap.startBaseAPIParameterMap.records_per_pageBaseAPIParameterMap.api_key_parameterBaseAPIParameterMap.api_key_requiredBaseAPIParameterMap.page_requiredBaseAPIParameterMap.auto_calculate_pageBaseAPIParameterMap.zero_indexed_paginationBaseAPIParameterMap.api_specific_parametersBaseAPIParameterMap.add_parameter()BaseAPIParameterMap.api_key_parameterBaseAPIParameterMap.api_key_requiredBaseAPIParameterMap.api_specific_parametersBaseAPIParameterMap.auto_calculate_pageBaseAPIParameterMap.from_dict()BaseAPIParameterMap.model_configBaseAPIParameterMap.queryBaseAPIParameterMap.records_per_pageBaseAPIParameterMap.show_parameters()BaseAPIParameterMap.startBaseAPIParameterMap.structure()BaseAPIParameterMap.to_dict()BaseAPIParameterMap.update()BaseAPIParameterMap.zero_indexed_pagination
- scholar_flux.api.models.base_provider_dict module
- scholar_flux.api.models.provider_config module
ProviderConfigProviderConfig.api_key_env_varProviderConfig.api_key_requiredProviderConfig.base_urlProviderConfig.display_nameProviderConfig.docs_urlProviderConfig.field_mapProviderConfig.mapProviderConfig.metadata_mapProviderConfig.model_configProviderConfig.normalize_provider_name()ProviderConfig.parameter_mapProviderConfig.prepare_fields()ProviderConfig.provider_nameProviderConfig.records_per_pageProviderConfig.request_delayProviderConfig.search_config_defaults()ProviderConfig.structure()ProviderConfig.validate_base_url()ProviderConfig.validate_docs_url()
- scholar_flux.api.models.provider_registry module
- scholar_flux.api.models.rate_limiter_registry module
- scholar_flux.api.models.reconstructed_response module
ReconstructedResponseReconstructedResponse.__init__()ReconstructedResponse.asdict()ReconstructedResponse.build()ReconstructedResponse.contentReconstructedResponse.fields()ReconstructedResponse.from_keywords()ReconstructedResponse.headersReconstructedResponse.is_response()ReconstructedResponse.json()ReconstructedResponse.okReconstructedResponse.prepare_response_fields()ReconstructedResponse.raise_for_status()ReconstructedResponse.reasonReconstructedResponse.statusReconstructedResponse.status_codeReconstructedResponse.textReconstructedResponse.urlReconstructedResponse.validate()
- scholar_flux.api.models.response_metadata_map module
- scholar_flux.api.models.response_types module
- scholar_flux.api.models.responses module
APIResponseAPIResponse.as_reconstructed_response()APIResponse.build_record_id_index()APIResponse.cache_keyAPIResponse.cachedAPIResponse.contentAPIResponse.created_atAPIResponse.encode_response()APIResponse.from_response()APIResponse.from_serialized_response()APIResponse.headersAPIResponse.model_configAPIResponse.normalize()APIResponse.process_metadata()APIResponse.raise_for_status()APIResponse.reasonAPIResponse.resolve_extracted_record()APIResponse.responseAPIResponse.serialize_response()APIResponse.statusAPIResponse.status_codeAPIResponse.strip_annotations()APIResponse.textAPIResponse.transform_response()APIResponse.urlAPIResponse.validate_iso_timestamp()APIResponse.validate_response()
ErrorResponseErrorResponse.build_record_id_index()ErrorResponse.cache_keyErrorResponse.created_atErrorResponse.dataErrorResponse.errorErrorResponse.extracted_recordsErrorResponse.from_error()ErrorResponse.messageErrorResponse.metadataErrorResponse.model_configErrorResponse.normalize()ErrorResponse.normalized_recordsErrorResponse.parsed_responseErrorResponse.process_metadata()ErrorResponse.processed_metadataErrorResponse.processed_recordsErrorResponse.record_countErrorResponse.records_per_pageErrorResponse.resolve_extracted_record()ErrorResponse.responseErrorResponse.strip_annotations()ErrorResponse.total_query_hits
NonResponseProcessedResponseProcessedResponse.build_record_id_index()ProcessedResponse.cache_keyProcessedResponse.created_atProcessedResponse.dataProcessedResponse.errorProcessedResponse.extracted_recordsProcessedResponse.messageProcessedResponse.metadataProcessedResponse.model_configProcessedResponse.normalize()ProcessedResponse.normalized_recordsProcessedResponse.parsed_responseProcessedResponse.process_metadata()ProcessedResponse.processed_metadataProcessedResponse.processed_recordsProcessedResponse.record_countProcessedResponse.records_per_pageProcessedResponse.resolve_extracted_record()ProcessedResponse.responseProcessedResponse.strip_annotations()ProcessedResponse.total_query_hits
- scholar_flux.api.models.search_api_config module
SearchAPIConfigSearchAPIConfig.provider_nameSearchAPIConfig.base_urlSearchAPIConfig.records_per_pageSearchAPIConfig.request_delaySearchAPIConfig.api_keySearchAPIConfig.api_specific_parametersSearchAPIConfig.DEFAULT_PROVIDERSearchAPIConfig.DEFAULT_RECORDS_PER_PAGESearchAPIConfig.DEFAULT_REQUEST_DELAYSearchAPIConfig.MAX_API_KEY_LENGTHSearchAPIConfig.api_keySearchAPIConfig.api_specific_parametersSearchAPIConfig.base_urlSearchAPIConfig.default_request_delay()SearchAPIConfig.from_defaults()SearchAPIConfig.model_configSearchAPIConfig.provider_nameSearchAPIConfig.records_per_pageSearchAPIConfig.request_delaySearchAPIConfig.set_records_per_page()SearchAPIConfig.structure()SearchAPIConfig.update()SearchAPIConfig.url_basenameSearchAPIConfig.validate_api_key()SearchAPIConfig.validate_provider_name()SearchAPIConfig.validate_request_delay()SearchAPIConfig.validate_search_api_config_parameters()SearchAPIConfig.validate_url()SearchAPIConfig.validate_url_type()
- scholar_flux.api.models.search_inputs module
- scholar_flux.api.models.search_results module
SearchResultSearchResult.build_record_id_index()SearchResult.cache_keySearchResult.cachedSearchResult.created_atSearchResult.dataSearchResult.display_nameSearchResult.errorSearchResult.extracted_recordsSearchResult.messageSearchResult.metadataSearchResult.model_configSearchResult.normalize()SearchResult.normalized_recordsSearchResult.pageSearchResult.parsed_responseSearchResult.process_metadata()SearchResult.processed_metadataSearchResult.processed_recordsSearchResult.provider_nameSearchResult.querySearchResult.record_countSearchResult.records_per_pageSearchResult.resolve_extracted_record()SearchResult.responseSearchResult.response_resultSearchResult.retrieval_timestampSearchResult.statusSearchResult.status_codeSearchResult.strip_annotations()SearchResult.total_query_hitsSearchResult.urlSearchResult.with_search_fields()
SearchResultList
- Module contents
APIParameterConfigAPIParameterConfig.DEFAULT_CORRECT_ZERO_INDEXAPIParameterConfig.__init__()APIParameterConfig.add_parameter()APIParameterConfig.as_config()APIParameterConfig.build_parameters()APIParameterConfig.extract_parameters()APIParameterConfig.from_defaults()APIParameterConfig.get_defaults()APIParameterConfig.mapAPIParameterConfig.parameter_mapAPIParameterConfig.show_parameters()APIParameterConfig.structure()
APIParameterMapAPIParameterMap.queryAPIParameterMap.startAPIParameterMap.records_per_pageAPIParameterMap.api_key_parameterAPIParameterMap.api_key_requiredAPIParameterMap.auto_calculate_pageAPIParameterMap.zero_indexed_paginationAPIParameterMap.api_specific_parametersAPIParameterMap.api_key_parameterAPIParameterMap.api_key_requiredAPIParameterMap.api_specific_parametersAPIParameterMap.auto_calculate_pageAPIParameterMap.from_defaults()APIParameterMap.get_defaults()APIParameterMap.model_configAPIParameterMap.queryAPIParameterMap.records_per_pageAPIParameterMap.set_default_api_key_parameter()APIParameterMap.startAPIParameterMap.validate_api_specific_parameter_mappings()APIParameterMap.zero_indexed_pagination
APIResponseAPIResponse.as_reconstructed_response()APIResponse.build_record_id_index()APIResponse.cache_keyAPIResponse.cachedAPIResponse.contentAPIResponse.created_atAPIResponse.encode_response()APIResponse.from_response()APIResponse.from_serialized_response()APIResponse.headersAPIResponse.model_configAPIResponse.normalize()APIResponse.process_metadata()APIResponse.raise_for_status()APIResponse.reasonAPIResponse.resolve_extracted_record()APIResponse.responseAPIResponse.serialize_response()APIResponse.statusAPIResponse.status_codeAPIResponse.strip_annotations()APIResponse.textAPIResponse.transform_response()APIResponse.urlAPIResponse.validate_iso_timestamp()APIResponse.validate_response()
APISpecificParameterAcademicFieldMapAcademicFieldMap.abstractAcademicFieldMap.authorsAcademicFieldMap.citation_countAcademicFieldMap.date_createdAcademicFieldMap.date_publishedAcademicFieldMap.doiAcademicFieldMap.extract_abstract()AcademicFieldMap.extract_authors()AcademicFieldMap.extract_boolean_field()AcademicFieldMap.extract_id()AcademicFieldMap.extract_iso_date()AcademicFieldMap.extract_journal()AcademicFieldMap.extract_url()AcademicFieldMap.extract_url_id()AcademicFieldMap.extract_year()AcademicFieldMap.full_textAcademicFieldMap.is_retractedAcademicFieldMap.journalAcademicFieldMap.keywordsAcademicFieldMap.languageAcademicFieldMap.licenseAcademicFieldMap.model_configAcademicFieldMap.model_post_init()AcademicFieldMap.normalize_doi()AcademicFieldMap.open_accessAcademicFieldMap.publisherAcademicFieldMap.reconstruct_url()AcademicFieldMap.record_idAcademicFieldMap.record_typeAcademicFieldMap.subjectsAcademicFieldMap.titleAcademicFieldMap.urlAcademicFieldMap.year
BaseAPIParameterMapBaseAPIParameterMap.queryBaseAPIParameterMap.startBaseAPIParameterMap.records_per_pageBaseAPIParameterMap.api_key_parameterBaseAPIParameterMap.api_key_requiredBaseAPIParameterMap.page_requiredBaseAPIParameterMap.auto_calculate_pageBaseAPIParameterMap.zero_indexed_paginationBaseAPIParameterMap.api_specific_parametersBaseAPIParameterMap.add_parameter()BaseAPIParameterMap.api_key_parameterBaseAPIParameterMap.api_key_requiredBaseAPIParameterMap.api_specific_parametersBaseAPIParameterMap.auto_calculate_pageBaseAPIParameterMap.from_dict()BaseAPIParameterMap.model_configBaseAPIParameterMap.queryBaseAPIParameterMap.records_per_pageBaseAPIParameterMap.show_parameters()BaseAPIParameterMap.startBaseAPIParameterMap.structure()BaseAPIParameterMap.to_dict()BaseAPIParameterMap.update()BaseAPIParameterMap.zero_indexed_pagination
BaseFieldMapBaseFieldMap.provider_nameBaseFieldMap.api_specific_fieldsBaseFieldMap.default_field_valuesBaseFieldMap.api_specific_fieldsBaseFieldMap.apply()BaseFieldMap.core_fieldsBaseFieldMap.default_field_valuesBaseFieldMap.fieldsBaseFieldMap.filter_api_specific_fields()BaseFieldMap.model_configBaseFieldMap.normalize_record()BaseFieldMap.normalize_records()BaseFieldMap.provider_nameBaseFieldMap.structure()BaseFieldMap.validate_provider_name()
BaseProviderDictErrorResponseErrorResponse.build_record_id_index()ErrorResponse.cache_keyErrorResponse.created_atErrorResponse.dataErrorResponse.errorErrorResponse.extracted_recordsErrorResponse.from_error()ErrorResponse.messageErrorResponse.metadataErrorResponse.model_configErrorResponse.normalize()ErrorResponse.normalized_recordsErrorResponse.parsed_responseErrorResponse.process_metadata()ErrorResponse.processed_metadataErrorResponse.processed_recordsErrorResponse.record_countErrorResponse.records_per_pageErrorResponse.resolve_extracted_record()ErrorResponse.responseErrorResponse.strip_annotations()ErrorResponse.total_query_hits
NonResponsePageListInputProcessedResponseProcessedResponse.build_record_id_index()ProcessedResponse.cache_keyProcessedResponse.created_atProcessedResponse.dataProcessedResponse.errorProcessedResponse.extracted_recordsProcessedResponse.messageProcessedResponse.metadataProcessedResponse.model_configProcessedResponse.normalize()ProcessedResponse.normalized_recordsProcessedResponse.parsed_responseProcessedResponse.process_metadata()ProcessedResponse.processed_metadataProcessedResponse.processed_recordsProcessedResponse.record_countProcessedResponse.records_per_pageProcessedResponse.resolve_extracted_record()ProcessedResponse.responseProcessedResponse.strip_annotations()ProcessedResponse.total_query_hits
ProviderConfigProviderConfig.api_key_env_varProviderConfig.api_key_requiredProviderConfig.base_urlProviderConfig.display_nameProviderConfig.docs_urlProviderConfig.field_mapProviderConfig.mapProviderConfig.metadata_mapProviderConfig.model_configProviderConfig.normalize_provider_name()ProviderConfig.parameter_mapProviderConfig.prepare_fields()ProviderConfig.provider_nameProviderConfig.records_per_pageProviderConfig.request_delayProviderConfig.search_config_defaults()ProviderConfig.structure()ProviderConfig.validate_base_url()ProviderConfig.validate_docs_url()
ProviderRegistryReconstructedResponseReconstructedResponse.__init__()ReconstructedResponse.asdict()ReconstructedResponse.build()ReconstructedResponse.contentReconstructedResponse.fields()ReconstructedResponse.from_keywords()ReconstructedResponse.headersReconstructedResponse.is_response()ReconstructedResponse.json()ReconstructedResponse.okReconstructedResponse.prepare_response_fields()ReconstructedResponse.raise_for_status()ReconstructedResponse.reasonReconstructedResponse.statusReconstructedResponse.status_codeReconstructedResponse.textReconstructedResponse.urlReconstructedResponse.validate()
ResponseHistoryRegistryResponseMetadataMapSearchAPIConfigSearchAPIConfig.provider_nameSearchAPIConfig.base_urlSearchAPIConfig.records_per_pageSearchAPIConfig.request_delaySearchAPIConfig.api_keySearchAPIConfig.api_specific_parametersSearchAPIConfig.DEFAULT_PROVIDERSearchAPIConfig.DEFAULT_RECORDS_PER_PAGESearchAPIConfig.DEFAULT_REQUEST_DELAYSearchAPIConfig.MAX_API_KEY_LENGTHSearchAPIConfig.api_keySearchAPIConfig.api_specific_parametersSearchAPIConfig.base_urlSearchAPIConfig.default_request_delay()SearchAPIConfig.from_defaults()SearchAPIConfig.model_configSearchAPIConfig.provider_nameSearchAPIConfig.records_per_pageSearchAPIConfig.request_delaySearchAPIConfig.set_records_per_page()SearchAPIConfig.structure()SearchAPIConfig.update()SearchAPIConfig.url_basenameSearchAPIConfig.validate_api_key()SearchAPIConfig.validate_provider_name()SearchAPIConfig.validate_request_delay()SearchAPIConfig.validate_search_api_config_parameters()SearchAPIConfig.validate_url()SearchAPIConfig.validate_url_type()
SearchResultSearchResult.build_record_id_index()SearchResult.cache_keySearchResult.cachedSearchResult.created_atSearchResult.dataSearchResult.display_nameSearchResult.errorSearchResult.extracted_recordsSearchResult.messageSearchResult.metadataSearchResult.model_configSearchResult.normalize()SearchResult.normalized_recordsSearchResult.pageSearchResult.parsed_responseSearchResult.process_metadata()SearchResult.processed_metadataSearchResult.processed_recordsSearchResult.provider_nameSearchResult.querySearchResult.record_countSearchResult.records_per_pageSearchResult.resolve_extracted_record()SearchResult.responseSearchResult.response_resultSearchResult.retrieval_timestampSearchResult.statusSearchResult.status_codeSearchResult.strip_annotations()SearchResult.total_query_hitsSearchResult.urlSearchResult.with_search_fields()
SearchResultList
- scholar_flux.api.normalization package
- Submodules
- scholar_flux.api.normalization.academic_field_map module
AcademicFieldMapAcademicFieldMap.abstractAcademicFieldMap.api_specific_fieldsAcademicFieldMap.authorsAcademicFieldMap.citation_countAcademicFieldMap.date_createdAcademicFieldMap.date_publishedAcademicFieldMap.default_field_valuesAcademicFieldMap.doiAcademicFieldMap.extract_abstract()AcademicFieldMap.extract_authors()AcademicFieldMap.extract_boolean_field()AcademicFieldMap.extract_id()AcademicFieldMap.extract_iso_date()AcademicFieldMap.extract_journal()AcademicFieldMap.extract_url()AcademicFieldMap.extract_url_id()AcademicFieldMap.extract_year()AcademicFieldMap.full_textAcademicFieldMap.is_retractedAcademicFieldMap.journalAcademicFieldMap.keywordsAcademicFieldMap.languageAcademicFieldMap.licenseAcademicFieldMap.model_configAcademicFieldMap.model_post_init()AcademicFieldMap.normalize_doi()AcademicFieldMap.open_accessAcademicFieldMap.provider_nameAcademicFieldMap.publisherAcademicFieldMap.reconstruct_url()AcademicFieldMap.record_idAcademicFieldMap.record_typeAcademicFieldMap.subjectsAcademicFieldMap.titleAcademicFieldMap.urlAcademicFieldMap.year
- scholar_flux.api.normalization.arxiv_field_map module
- scholar_flux.api.normalization.base_field_map module
BaseFieldMapBaseFieldMap.provider_nameBaseFieldMap.api_specific_fieldsBaseFieldMap.default_field_valuesBaseFieldMap.api_specific_fieldsBaseFieldMap.apply()BaseFieldMap.core_fieldsBaseFieldMap.default_field_valuesBaseFieldMap.fieldsBaseFieldMap.filter_api_specific_fields()BaseFieldMap.model_configBaseFieldMap.normalize_record()BaseFieldMap.normalize_records()BaseFieldMap.provider_nameBaseFieldMap.structure()BaseFieldMap.validate_provider_name()
- scholar_flux.api.normalization.core_field_map module
- scholar_flux.api.normalization.crossref_field_map module
- scholar_flux.api.normalization.normalizing_field_map module
- scholar_flux.api.normalization.open_alex_field_map module
- scholar_flux.api.normalization.plos_field_map module
- scholar_flux.api.normalization.pubmed_efetch_field_map module
- scholar_flux.api.normalization.pubmed_field_map module
- scholar_flux.api.normalization.springer_nature_field_map module
- Module contents
AcademicFieldMapAcademicFieldMap.abstractAcademicFieldMap.api_specific_fieldsAcademicFieldMap.authorsAcademicFieldMap.citation_countAcademicFieldMap.date_createdAcademicFieldMap.date_publishedAcademicFieldMap.default_field_valuesAcademicFieldMap.doiAcademicFieldMap.extract_abstract()AcademicFieldMap.extract_authors()AcademicFieldMap.extract_boolean_field()AcademicFieldMap.extract_id()AcademicFieldMap.extract_iso_date()AcademicFieldMap.extract_journal()AcademicFieldMap.extract_url()AcademicFieldMap.extract_url_id()AcademicFieldMap.extract_year()AcademicFieldMap.full_textAcademicFieldMap.is_retractedAcademicFieldMap.journalAcademicFieldMap.keywordsAcademicFieldMap.languageAcademicFieldMap.licenseAcademicFieldMap.model_configAcademicFieldMap.model_post_init()AcademicFieldMap.normalize_doi()AcademicFieldMap.open_accessAcademicFieldMap.provider_nameAcademicFieldMap.publisherAcademicFieldMap.reconstruct_url()AcademicFieldMap.record_idAcademicFieldMap.record_typeAcademicFieldMap.subjectsAcademicFieldMap.titleAcademicFieldMap.urlAcademicFieldMap.year
ArXivFieldMapArXivFieldMap.abstractArXivFieldMap.api_specific_fieldsArXivFieldMap.authorsArXivFieldMap.citation_countArXivFieldMap.date_createdArXivFieldMap.date_publishedArXivFieldMap.default_field_valuesArXivFieldMap.doiArXivFieldMap.extract_pdf_url()ArXivFieldMap.extract_record_type()ArXivFieldMap.full_textArXivFieldMap.is_retractedArXivFieldMap.journalArXivFieldMap.keywordsArXivFieldMap.languageArXivFieldMap.licenseArXivFieldMap.model_configArXivFieldMap.model_post_init()ArXivFieldMap.open_accessArXivFieldMap.provider_nameArXivFieldMap.publisherArXivFieldMap.record_idArXivFieldMap.record_typeArXivFieldMap.subjectsArXivFieldMap.titleArXivFieldMap.urlArXivFieldMap.year
BaseFieldMapBaseFieldMap.provider_nameBaseFieldMap.api_specific_fieldsBaseFieldMap.default_field_valuesBaseFieldMap.api_specific_fieldsBaseFieldMap.apply()BaseFieldMap.core_fieldsBaseFieldMap.default_field_valuesBaseFieldMap.fieldsBaseFieldMap.filter_api_specific_fields()BaseFieldMap.model_configBaseFieldMap.normalize_record()BaseFieldMap.normalize_records()BaseFieldMap.provider_nameBaseFieldMap.structure()BaseFieldMap.validate_provider_name()
CoreFieldMapCoreFieldMap.abstractCoreFieldMap.api_specific_fieldsCoreFieldMap.authorsCoreFieldMap.citation_countCoreFieldMap.date_createdCoreFieldMap.date_publishedCoreFieldMap.default_field_valuesCoreFieldMap.doiCoreFieldMap.extract_arxiv_id()CoreFieldMap.extract_mag_id()CoreFieldMap.extract_oai_ids()CoreFieldMap.extract_pmid()CoreFieldMap.full_textCoreFieldMap.is_retractedCoreFieldMap.journalCoreFieldMap.keywordsCoreFieldMap.languageCoreFieldMap.licenseCoreFieldMap.model_configCoreFieldMap.model_post_init()CoreFieldMap.open_accessCoreFieldMap.provider_nameCoreFieldMap.publisherCoreFieldMap.record_idCoreFieldMap.record_typeCoreFieldMap.subjectsCoreFieldMap.titleCoreFieldMap.urlCoreFieldMap.year
CrossrefFieldMapCrossrefFieldMap.abstractCrossrefFieldMap.api_specific_fieldsCrossrefFieldMap.authorsCrossrefFieldMap.check_retraction()CrossrefFieldMap.citation_countCrossrefFieldMap.date_createdCrossrefFieldMap.date_publishedCrossrefFieldMap.default_field_valuesCrossrefFieldMap.doiCrossrefFieldMap.extract_authors()CrossrefFieldMap.extract_date_parts()CrossrefFieldMap.extract_title()CrossrefFieldMap.extract_year()CrossrefFieldMap.full_textCrossrefFieldMap.is_retractedCrossrefFieldMap.journalCrossrefFieldMap.keywordsCrossrefFieldMap.languageCrossrefFieldMap.licenseCrossrefFieldMap.model_configCrossrefFieldMap.model_post_init()CrossrefFieldMap.open_accessCrossrefFieldMap.provider_nameCrossrefFieldMap.publisherCrossrefFieldMap.record_idCrossrefFieldMap.record_typeCrossrefFieldMap.resolve_open_access()CrossrefFieldMap.subjectsCrossrefFieldMap.titleCrossrefFieldMap.urlCrossrefFieldMap.year
NormalizingFieldMapOpenAlexFieldMapOpenAlexFieldMap.abstractOpenAlexFieldMap.api_specific_fieldsOpenAlexFieldMap.authorsOpenAlexFieldMap.citation_countOpenAlexFieldMap.date_createdOpenAlexFieldMap.date_publishedOpenAlexFieldMap.default_field_valuesOpenAlexFieldMap.doiOpenAlexFieldMap.extract_open_access()OpenAlexFieldMap.extract_pmid()OpenAlexFieldMap.full_textOpenAlexFieldMap.is_retractedOpenAlexFieldMap.journalOpenAlexFieldMap.keywordsOpenAlexFieldMap.languageOpenAlexFieldMap.licenseOpenAlexFieldMap.model_configOpenAlexFieldMap.model_post_init()OpenAlexFieldMap.open_accessOpenAlexFieldMap.provider_nameOpenAlexFieldMap.publisherOpenAlexFieldMap.reconstruct_abstract()OpenAlexFieldMap.record_idOpenAlexFieldMap.record_typeOpenAlexFieldMap.subjectsOpenAlexFieldMap.titleOpenAlexFieldMap.urlOpenAlexFieldMap.year
PLOSFieldMapPLOSFieldMap.abstractPLOSFieldMap.api_specific_fieldsPLOSFieldMap.authorsPLOSFieldMap.citation_countPLOSFieldMap.date_createdPLOSFieldMap.date_publishedPLOSFieldMap.default_field_valuesPLOSFieldMap.doiPLOSFieldMap.full_textPLOSFieldMap.is_retractedPLOSFieldMap.journalPLOSFieldMap.keywordsPLOSFieldMap.languagePLOSFieldMap.licensePLOSFieldMap.model_configPLOSFieldMap.model_post_init()PLOSFieldMap.open_accessPLOSFieldMap.provider_namePLOSFieldMap.publisherPLOSFieldMap.reconstruct_plos_url()PLOSFieldMap.record_idPLOSFieldMap.record_typePLOSFieldMap.subjectsPLOSFieldMap.titlePLOSFieldMap.urlPLOSFieldMap.year
PubMedFieldMapPubMedFieldMap.abstractPubMedFieldMap.api_specific_fieldsPubMedFieldMap.authorsPubMedFieldMap.citation_countPubMedFieldMap.date_createdPubMedFieldMap.date_publishedPubMedFieldMap.default_field_valuesPubMedFieldMap.doiPubMedFieldMap.extract_authors()PubMedFieldMap.extract_date_created()PubMedFieldMap.extract_doi()PubMedFieldMap.extract_open_access()PubMedFieldMap.extract_pii()PubMedFieldMap.extract_pmcid()PubMedFieldMap.full_textPubMedFieldMap.is_retractedPubMedFieldMap.journalPubMedFieldMap.keywordsPubMedFieldMap.languagePubMedFieldMap.licensePubMedFieldMap.model_configPubMedFieldMap.model_post_init()PubMedFieldMap.open_accessPubMedFieldMap.provider_namePubMedFieldMap.publisherPubMedFieldMap.reconstruct_pubmed_url()PubMedFieldMap.record_idPubMedFieldMap.record_typePubMedFieldMap.subjectsPubMedFieldMap.titlePubMedFieldMap.urlPubMedFieldMap.year
SpringerNatureFieldMapSpringerNatureFieldMap.abstractSpringerNatureFieldMap.api_specific_fieldsSpringerNatureFieldMap.authorsSpringerNatureFieldMap.citation_countSpringerNatureFieldMap.date_createdSpringerNatureFieldMap.date_publishedSpringerNatureFieldMap.default_field_valuesSpringerNatureFieldMap.doiSpringerNatureFieldMap.extract_open_access()SpringerNatureFieldMap.extract_primary_url()SpringerNatureFieldMap.full_textSpringerNatureFieldMap.is_retractedSpringerNatureFieldMap.journalSpringerNatureFieldMap.keywordsSpringerNatureFieldMap.languageSpringerNatureFieldMap.licenseSpringerNatureFieldMap.model_configSpringerNatureFieldMap.model_post_init()SpringerNatureFieldMap.open_accessSpringerNatureFieldMap.provider_nameSpringerNatureFieldMap.publisherSpringerNatureFieldMap.record_idSpringerNatureFieldMap.record_typeSpringerNatureFieldMap.subjectsSpringerNatureFieldMap.titleSpringerNatureFieldMap.urlSpringerNatureFieldMap.year
- scholar_flux.api.providers package
- Submodules
- scholar_flux.api.providers.arxiv module
- scholar_flux.api.providers.core module
- scholar_flux.api.providers.crossref module
- scholar_flux.api.providers.open_alex module
- scholar_flux.api.providers.plos module
- scholar_flux.api.providers.pubmed module
- scholar_flux.api.providers.pubmed_efetch module
- scholar_flux.api.providers.springer_nature module
- Module contents
- scholar_flux.api.rate_limiting package
- Submodules
- scholar_flux.api.rate_limiting.rate_limiter module
- scholar_flux.api.rate_limiting.retry_handler module
RetryHandlerRetryHandler.max_retriesRetryHandler.backoff_factorRetryHandler.max_backoffRetryHandler.retry_statusesRetryHandler.historyRetryHandler.DEFAULT_RAISE_ON_ERRORRetryHandler.DEFAULT_RETRY_AFTER_HEADERSRetryHandler.DEFAULT_RETRY_STATUSESRetryHandler.DEFAULT_VALID_STATUSESRetryHandler.RAISE_ON_DELAY_EXCEEDEDRetryHandler.__init__()RetryHandler.calculate_retry_delay()RetryHandler.delay_exceeds_max_backoff()RetryHandler.execute_with_retry()RetryHandler.extract_retry_after()RetryHandler.extract_retry_after_from_response()RetryHandler.get_retry_after()RetryHandler.historyRetryHandler.log_retry_attempt()RetryHandler.log_retry_warning()RetryHandler.parse_retry_after()RetryHandler.resize_history()RetryHandler.should_retry()
- scholar_flux.api.rate_limiting.threaded_rate_limiter module
- Module contents
RateLimiterRetryHandlerRetryHandler.max_retriesRetryHandler.backoff_factorRetryHandler.max_backoffRetryHandler.retry_statusesRetryHandler.historyRetryHandler.DEFAULT_RAISE_ON_ERRORRetryHandler.DEFAULT_RETRY_AFTER_HEADERSRetryHandler.DEFAULT_RETRY_STATUSESRetryHandler.DEFAULT_VALID_STATUSESRetryHandler.RAISE_ON_DELAY_EXCEEDEDRetryHandler.__init__()RetryHandler.calculate_retry_delay()RetryHandler.delay_exceeds_max_backoff()RetryHandler.execute_with_retry()RetryHandler.extract_retry_after()RetryHandler.extract_retry_after_from_response()RetryHandler.get_retry_after()RetryHandler.historyRetryHandler.log_retry_attempt()RetryHandler.log_retry_warning()RetryHandler.parse_retry_after()RetryHandler.resize_history()RetryHandler.should_retry()
ThreadedRateLimiter
- scholar_flux.api.workflows package
- Submodules
- scholar_flux.api.workflows.models module
- scholar_flux.api.workflows.pubmed_workflow module
- scholar_flux.api.workflows.search_workflow module
- scholar_flux.api.workflows.workflow_defaults module
- Module contents
BaseStepContextBaseWorkflowBaseWorkflowResultBaseWorkflowStepPubMedFetchStepPubMedSearchStepPubMedSearchStep.provider_namePubMedSearchStep.step_numberPubMedSearchStep.descriptionPubMedSearchStep.additional_kwargsPubMedSearchStep.config_parametersPubMedSearchStep.descriptionPubMedSearchStep.model_configPubMedSearchStep.provider_namePubMedSearchStep.search_parametersPubMedSearchStep.step_number
PubMedSearchWorkflowSearchWorkflowStepContextWORKFLOW_DEFAULTSWorkflowResultWorkflowStep
Submodules
scholar_flux.api.base_api module
Defines the BaseAPI, which implements minimal features such as caching, requests, and response retrieval.
The BaseAPI is subclassed by scholar_flux.api.SearchAPI to further build and formulate requests based on the parameters accepted by each API provider given their respective configurations.
- class scholar_flux.api.base_api.BaseAPI(user_agent: str | None = None, session: Session | None = None, timeout: int | float | None = None, use_cache: bool | None = None)[source]
Bases:
objectThe BaseAPI client is a minimal implementation for user-friendly request preparation and response retrieval.
- Parameters:
session (Optional[requests.Session]) – A pre-configured requests or requests-cache session. A new session is created if not specified.
user_agent (Optional[str]) – An optional user-agent string for the session.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache (bool) – Indicates whether or not to create a cached session. If a cached session is already specified, this setting will have no effect on the creation of a session.
Examples
>>> from scholar_flux.api import BaseAPI # creating a basic API client that uses the PLOS API as the default while caching response data in-memory: >>> base_api = BaseAPI(use_cache=True) # retrieve a basic request: >>> parameters = {'q': 'machine learning', 'start': 1, 'rows': 20} >>> response_page_1 = base_api.send_request('https://api.plos.org/search', parameters=parameters) >>> assert response_page_1.ok >>> response_page_1 # OUTPUT: <Response [200]> >>> ml_page_1 = response_page_1.json() # retrieving the next page: >>> parameters['start'] = 21 >>> response_page_2 = base_api.send_request('https://api.plos.org/search', parameters=parameters) >>> assert response_page_2.ok >>> response_page_2 # OUTPUT: <Response [200]> >>> ml_page_2 = response_page_2.json() >>> ml_page_2 # OUTPUT: {'response': {'numFound': '...', 'start': 21, 'docs': ['...']}} # redacted
Note
The class variable, BaseAPI.DEFAULT_USE_CACHE is set at import to True if the environment variable, SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND, is configured. Otherwise, DEFAULT_USE_CACHE is set to False. Changes made via config_settings after import/runtime will not enable or disable caching unless you manually update BaseAPI.DEFAULT_USE_CACHE or SearchAPI.DEFAULT_USE_CACHE (for the SearchAPI subclass).
- DEFAULT_TIMEOUT: int = 20
- DEFAULT_USE_CACHE: bool = False
- __init__(user_agent: str | None = None, session: Session | None = None, timeout: int | float | None = None, use_cache: bool | None = None)[source]
Initializes the BaseAPI client for response retrieval given the provided inputs.
The necessary attributes are prepared with a new or existing session (cached or uncached) via dependency injection. This class is designed to be subclassed for specific API implementations.
- Parameters:
user_agent (Optional[str]) – Optional user-agent string for the session.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
timeout (Optional[int | float]) – Timeout for requests in seconds.
use_cache (Optional[bool]) – Indicates whether or not to use cache. The default setting is to create a regular requests.Session unless a CachedSession is already provided.
- property cache: BaseCache | None
Retrieves the requests-session cache object if the session object is a CachedSession object.
If a session cache does not exist, this function will return None.
- Returns:
The cache object if available, otherwise None.
- Return type:
Optional[BaseCache]
- property cached: bool
Checks whether the current session object used by the current API is a cached session.
- Returns:
True if the current object is a cached session object, and False otherwise
- Return type:
bool
- configure_session(session: Session | None = None, user_agent: str | None = None, use_cache: bool | None = None) Session[source]
Creates a new Session or CachedSession object for API requests if a session does not already exist.
If use_cache = True, then a cached session object will be used. A regular session that is not already cached will be overridden.
- Parameters:
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist. If use_cache is True and a cached session has already been passed, the previously created cached session is returned. Otherwise, a new CachedSession is created.
- Returns:
The configured session.
- Return type:
requests.Session
- static is_cached_session(session: CachedSession | Session) bool[source]
Checks whether a provided session object is a requests_cache.CachedSession object.
- Parameters:
session (requests.Session) – The session to check.
- Returns:
True if the session is a cached session, False otherwise.
- Return type:
bool
- prepare_request(base_url: str, endpoint: str | None = None, parameters: Dict[str, Any] | None = None) PreparedRequest[source]
Prepares a GET request for the specified endpoint with optional parameters.
- Parameters:
base_url (str) – The base URL for the API.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.
- Returns:
The prepared request object.
- Return type:
prepared_request (PreparedRequest)
- send_request(base_url: str, endpoint: str | None = None, parameters: Dict[str, Any] | None = None, timeout: int | float | None = None) Response[source]
Sends a GET request to the specified endpoint with optional parameters.
- Parameters:
base_url (str) – The base API to send the request to.
endpoint (Optional[str]) – The endpoint of the API to send the request to.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.
timeout (Optional[int | float]) – Timeout for the request in seconds.
- Returns:
The response object.
- Return type:
requests.Response
- session: Session
- structure(flatten: bool = True, show_value_attributes: bool = False) str[source]
Base method for showing the structure of the current BaseAPI. This method reveals the configuration settings of the API client that will be used to send requests.
- Returns:
The current structure of the BaseAPI or its subclass.
- Return type:
str
- summary() str[source]
Create a summary representation of the current structure of the API:
Returns the original representation.
- property user_agent: str | None
The User-Agent should always reflect what is used in the session.
This method retrieves the User-Agent from the session directly.
scholar_flux.api.base_coordinator module
Defines the BaseCoordinator that implements the most basic orchestration components used to request, process, and optionally cache processed record data from APIs.
- class scholar_flux.api.base_coordinator.BaseCoordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator)[source]
Bases:
objectBaseCoordinator providing the minimum functionality for requesting and retrieving records and metadata from APIs.
This class uses dependency injection to orchestrate the process of constructing requests, validating responses, and processing scientific works and articles. This class is designed to provide the absolute minimum necessary functionality to both retrieve and process data from APIs and can make use of caching functionality for caching requests and responses.
After initialization, the BaseCoordinator uses two main components for the sequential orchestration of response retrieval, processing, and caching.
- Components:
- SearchAPI (api/search_api):
Handles the creation and orchestration of search requests in addition to the caching of successful requests via dependency injection.
- ResponseCoordinator (responses/response_coordinator): Handles the full range of response
processing steps after retrieving a response from an API. These parsing, extraction, and processing steps occur sequentially when a new response is received. If a response was previously handled, the coordinator will attempt to retrieve these responses from the processing cache.
Example
>>> from scholar_flux.api import SearchAPI, ResponseCoordinator, BaseCoordinator # Note: the SearchAPI uses PLOS by default if `provider_name` is not provided. # Unless the `SCHOLAR_FLUX_DEFAULT_PROVIDER` env variable is set to another provider. >>> base_search_coordinator = BaseCoordinator(search_api = SearchAPI(query = 'Math'), >>> response_coordinator = ResponseCoordinator.build()) >>> response = base_search_coordinator.search(page = 1) >>> response # OUTPUT <ProcessedResponse(len=20, cache_key=None, metadata="{'numFound': 14618, 'start': 1, ...})> # All processed records for a particular response can be found under response.data (a list of dictionaries) >>> list(response.data[0].keys()) # OUTPUT ['article_type', 'eissn', 'id', 'journal', 'publication_date', 'score', 'title_display', # 'abstract', 'author_display']
- __init__(search_api: SearchAPI, response_coordinator: ResponseCoordinator)[source]
Initializes the base coordinator by delegating assignment of attributes to the _initialize method. Future coordinators can follow a similar pattern of using an _initialize for initial parameter assignment.
- Parameters:
search_api (SearchAPI) – The search API to use for the retrieval of response records from APIs
response_coordinator (ResponseCoordinator) – Core class used to handle the processing and core handling of all responses from APIs
- classmethod as_coordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator, *args: Any, **kwargs: Any) Self[source]
Helper factory method for building a SearchCoordinator that allows users to build from the final building blocks of a SearchCoordinator.
- Parameters:
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs.
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs.
- Returns:
A newly created coordinator class or subclass that orchestrates record retrieval and processing.
- Return type:
Self
- property display_name: str
Human-readable provider name for logging and display purposes.
- property extractor: BaseDataExtractor
Allows direct access to the DataExtractor from the ResponseCoordinator.
- property last_response: ProcessedResponse | ErrorResponse | None
Retrieves the last response sent to a provider.
- parameter_search(**kwargs: Any) ProcessedResponse | ErrorResponse | None[source]
Public method for retrieving and processing non-paginated records with directly specified parameters.
This method is designed as a direct entrypoint to performing searches without the addition of otherwise automatically populated, pagination-related fields such as query, records_per_page, etc. while still taking advantage of the orchestration features of the current coordinator.
- property parser: BaseDataParser
Allows direct access to the data parser from the ResponseCoordinator.
- property processor: ABCDataProcessor
Allows direct access to the DataProcessor from the ResponseCoordinator.
- property provider_name: str
Property method for accessing the provider name in the current SearchAPI instance.
- Returns:
The name corresponding to the API Provider.
- property response_coordinator: ResponseCoordinator
Allows the ResponseCoordinator to be used as a property.
The response_coordinator handles and coordinates the processing of API responses from parsing, record/metadata extraction, processing, and cache management.
- property responses: ResponseCoordinator
An alias for the response_coordinator property that is used for orchestrating the processing of retrieved API responses.
Handles response orchestration, including response content parsing, the extraction of records/metadata, record processing, and cache operations.
- search(**kwargs: Any) ProcessedResponse | ErrorResponse | None[source]
Public Search Method coordinating the retrieval and processing of an API response.
This method serves as the base and will primarily handle the “How” of searching (e.g. Workflows, Single page search, etc.)
- property search_api: SearchAPI
Allows the search_api to be used as a property while also allowing for verification.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for quickly showing a representation of the overall structure of the SearchCoordinator. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the Coordinator.
- Parameters:
flatten (bool) – Whether to flatten the coordinator’s structural representation into a single line. Default=False
show_value_attributes (bool) – Whether to show nested attributes of the components of the BaseCoordinator its subclass.
- Returns:
The structure of the current SearchCoordinator as a string.
- Return type:
str
- classmethod update(search_coordinator: Self, search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, **kwargs: Any) Self[source]
Creates a new coordinator with optionally replaced core components.
- Parameters:
search_coordinator – The coordinator to base the new instance on.
search_api (Optional[SearchAPI]) – Replacement SearchAPI, or None to keep existing.
response_coordinator (Optional[ResponseCoordinator]) – Replacement ResponseCoordinator, or None to keep existing.
**kwargs – Additional keyword arguments to be passed to BaseCoordinator.as_coordinator()
- Returns:
A new coordinator instance with the specified components.
- Return type:
Self
- with_components(search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, **update_kwargs: Any) Generator[Self, None, None][source]
Temporarily creates and yields a new coordinator with modified core components.
- Parameters:
search_api (Optional[SearchAPI]) – Replacement SearchAPI.
response_coordinator (Optional[ResponseCoordinator]) – Replacement ResponseCoordinator.
**update_kwargs – Optional keyword arguments to be passed to update
- Yields:
Self – A new coordinator instance with the specified modifications.
scholar_flux.api.multisearch_coordinator module
Defines the MultiSearchCoordinator that builds on the features implemented by the SearchCoordinator to create multiple queries to different providers either sequentially or by using multithreading.
This implementation uses shared rate limiting to ensure that rate limits to different providers are not exceeded.
- class scholar_flux.api.multisearch_coordinator.MultiSearchCoordinator(*args: Any, **kwargs: Any)[source]
Bases:
UserDict[str,SearchCoordinator]The MultiSearchCoordinator is a utility class for orchestrating searches across multiple providers, pages, and queries sequentially or using multithreading. This coordinator builds on the SearchCoordinator’s core structure to ensure consistent, rate-limited API requests.
The multi-search coordinator uses shared rate limiters to ensure that requests to the same provider (even across different queries) will use the same rate limiter.
This implementation uses the ThreadedRateLimiter.min_interval parameter from the shared rate limiter of each provider to determine the request_delay across all queries. These settings can be found and modified in the scholar_flux.api.providers.threaded_rate_limiter_registry by provider_name.
For new, unregistered providers, users can override the MultiSearchCoordinator.DEFAULT_THREADED_REQUEST_DELAY class variable to adjust the shared request_delay.
# Examples:
>>> from scholar_flux import MultiSearchCoordinator, SearchCoordinator, RecursiveDataProcessor >>> from scholar_flux.api.rate_limiting import threaded_rate_limiter_registry >>> multi_search_coordinator = MultiSearchCoordinator() >>> threaded_rate_limiter_registry['arxiv'].min_interval = 6 # arbitrary rate limit (seconds per request) >>> >>> # Create coordinators for different queries and providers >>> coordinators = [ ... SearchCoordinator( ... provider_name=provider, ... query=query, ... processor=RecursiveDataProcessor(), ... user_agent="SammieH", ... cache_requests=True ... ) ... for query in ('ml', 'nlp') ... for provider in ('plos', 'arxiv', 'openalex', 'crossref') ... ] >>> >>> # Add coordinators to the multi-search coordinator >>> multi_search_coordinator.add_coordinators(coordinators) >>> >>> # Execute searches across multiple pages >>> all_pages = multi_search_coordinator.search_pages(pages=[1, 2, 3]) >>> >>> # filters and retains successful requests from the multi-provider search >>> filtered_pages = all_pages.filter() >>> # The results will contain successfully processed responses across all queries, pages, and providers >>> print(filtered_pages) # Output will be a list of SearchResult objects >>> # Extracts successfully processed records into a list of records where each record is a dictionary >>> record_dict = filtered_pages.join() # retrieves a list of records >>> print(record_dict) # Output will be a flattened list of all records
- DEFAULT_THREADED_REQUEST_DELAY: float | int = 6.0
- __init__(*args: Any, **kwargs: Any) None[source]
Initializes the MultiSearchCoordinator, allowing positional and keyword arguments to be specified when creating the MultiSearchCoordinator.
The initialization of the MultiSearchCoordinator operates similarly to that of a regular dict with the caveat that values are statically typed as SearchCoordinator instances.
- add(search_coordinator: SearchCoordinator) None[source]
Adds a new SearchCoordinator to the MultiSearchCoordinator instance.
- Parameters:
search_coordinator (SearchCoordinator) – A search coordinator to add to the MultiSearchCoordinator dict
Raises: InvalidCoordinatorParameterException: If the expected type is not a SearchCoordinator
- add_coordinators(search_coordinators: Iterable[SearchCoordinator]) None[source]
Helper method for adding a sequence of coordinators at a time.
- property coordinators: list[SearchCoordinator]
Utility property for quickly retrieving a list of all currently registered coordinators.
- current_providers() set[str][source]
Extracts a set of names corresponding to each API provider assigned to the MultiSearchCoordinator.
- classmethod from_coordinators(search_coordinators: Iterable[SearchCoordinator]) Self[source]
Constructs a new MultiSearchCoordinator instance from a sequence of coordinators at a time.
- group_by_provider() dict[str, dict[str, SearchCoordinator]][source]
Groups all coordinators by provider name to facilitate retrieval with normalized components where needed. Especially helpful in the latter retrieval of articles when using multithreading by provider (as opposed to by page) to account for strict rate limits. All coordinated searches corresponding to a provider would appear under a nested dictionary to facilitate orchestration on the same thread with the same rate limiter.
- Returns:
All elements in the final dictionary map provider-specific coordinators to the normalized provider name for the nested dictionary of coordinators.
- Return type:
dict[str, dict[str, SearchCoordinator]]
- iter_pages(pages: Sequence[int] | PageListInput, iterate_by_group: bool = False, **kwargs: Any) Generator[SearchResult, None, None][source]
Helper method that creates and joins a sequence of generator functions for retrieving and processing records from each combination of queries, pages, and providers in sequence. This implementation uses the SearchCoordinator.iter_pages to dynamically identify when page retrieval should halt for each API provider, accounting for errors, timeouts, and less than the expected amount of records before filtering records with pre- specified criteria.
- Parameters:
pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Yields:
SearchResult –
- Iteratively returns the SearchResult for each provider, query, and page using a generator
expression. Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)
- iter_pages_threaded(pages: Sequence[int] | PageListInput, max_workers: int | None = None, **kwargs: Any) Generator[SearchResult, None, None][source]
Threading by provider to respect rate limits Helper method that implements threading to simultaneously retrieve a sequence of generator functions for retrieving and processing records from each combination of queries, pages, and providers in a multi-threaded set of sequences grouped by provider.
This implementation also uses the SearchCoordinator.iter_pages to dynamically identify when page retrieval should halt for each API provider, accounting for errors, timeouts, and less than the expected amount of records before filtering records with pre-specified criteria.
Note, that as threading is performed by provider, this method will not differ significantly in speed from the MultiSearchCoordinator.iter_pages method if only a single provider has been specified.
- Parameters:
pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Yields:
SearchResult –
- Iteratively returns the SearchResult for each provider, query, and page using a generator
expression as each SearchResult becomes available after multi-threaded processing. Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)
- search(page: int = 1, iterate_by_group: bool = False, max_workers: int | None = None, multithreading: bool = True, **kwargs: Any) SearchResultList[source]
Public method used to search for a single or multiple pages from multiple providers at once using a sequential or multithreading approach. This approach delegates the search to search_pages to retrieve a single page for query and provider using an iterative approach to search for articles grouped by provider.
Note that the MultiSearchCoordinator.search_pages method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.
- Parameters:
page (int) – The page number to iteratively request from each API Provider.
iterate_by_group (bool) – Determines whether all searches should be performed by page or by group. Note that page-based iteration is significantly faster due to API rate limits. This is set to False by default as a result.
max_workers (Optional[int]) – Determines how many threads should operate at one time. Applies only when multithreading is set to True. When None, as many threads are used as required.
multithreading (bool) – Multithreading is used when this parameter is set to True. Otherwise, sequential iteration is performed. Multithreading is enabled by default.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available. Workflows are utilized by default.
- Returns:
The list containing all retrieved and processed pages from the API. If any non-stopping errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.
- Return type:
- search_page(page: int = 1, **kwargs: Any) SearchResultList[source]
Retrieves a single page from all registered coordinators.
This method provides API compatibility with SearchCoordinator.search_page, returning results wrapped in SearchResult containers with provider metadata.
- Parameters:
page (int) – The page number to retrieve from each provider.
**kwargs – Additional arguments to pass to MultiSearchCoordinator.search_pages or the search_pages method for each individual coordinator.
- Returns:
Results from all coordinators for the specified page.
- Return type:
- search_pages(pages: Sequence[int] | PageListInput, iterate_by_group: bool = False, max_workers: int | None = None, multithreading: bool = True, *, min_records: int | None = None, page_offset: int = 0, **kwargs: Any) SearchResultList[source]
Searches for records from multiple providers using a sequential or multithreading approach.
Note that the MultiSearchCoordinator.search_pages method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.
- Parameters:
pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
min_records (int) – The total number of records to retrieve sequentially. If not provided as an integer, the pages argument is validated immediately instead. No-Op when pages is a non-empty/non-zero value.
page_offset (int) – The page offset to begin record retrieval from (0 by default). This parameter is only relevant when a min_records value is provided instead of a page number.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Returns:
- The list containing all retrieved and processed pages from the API. If any non-stopping
errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.
- Return type:
- search_records(min_records: int, page_offset: int = 0, **kwargs: Any) SearchResultList[source]
Helper method for retrieving a minimum of min_records records across all API providers.
This method retrieves a minimum of min_records per provider unless no pages remain to be retrieved or a non-retryable error occurs during processing. Note that this method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.
- Parameters:
min_records (int) – The total number of records to retrieve sequentially.
page_offset (int) – The page offset to begin record retrieval from (0 by default).
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Returns:
The list containing all retrieved and processed pages from the API. If any non-stopping errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.
- Return type:
- select(query: str | None = None, provider_name: str | None = None) list[SearchCoordinator][source]
Helper method that enables the selection of coordinators based on their query or provider name.
scholar_flux.api.response_coordinator module
The scholar_flux.api.response_coordinator module implements the ResponseCoordinator that is used to coordinate the processing of successfully and unsuccessfully retrieved responses. This class is used by the SearchCoordinator to orchestrate the response parsing, processing and caching of responses.
The ResponseCoordinator relies on dependency injection to modify the processing methods used at each step.
- class scholar_flux.api.response_coordinator.ResponseCoordinator(parser: BaseDataParser, extractor: BaseDataExtractor, processor: ABCDataProcessor, cache_manager: DataCacheManager)[source]
Bases:
objectCoordinates the parsing, extraction, processing, and caching of API responses. The ResponseCoordinator operates on the concept of dependency injection to orchestrate the entire process.
Note that the overall composition of the coordinator is a governing factor in how the response is processed. The ResponseCoordinator uses a cache key and schema fingerprint to ensure that it is only returning a processed response from the cache storage if the structure of the coordinator at the time of cache storage has not changed.
To ensure that we’re not pulling from cache on significant changes to the ResponseCoordinator, we validate the schema by default using DEFAULT_VALIDATE_FINGERPRINT. When the schema changes, previously cached data is ignored, although this can be explicitly overridden during response handling.
The coordinator orchestration process operates mainly through the ResponseCoordinator.handle_response method that sequentially calls the parser, extractor, processor, and cache_manager.
Example workflow:
>>> from scholar_flux.api import SearchAPI, ResponseCoordinator >>> api = SearchAPI(query = 'technological innovation', provider_name = 'crossref', user_agent = 'scholar_flux') >>> response_coordinator = ResponseCoordinator.build() # uses defaults with caching in-memory >>> response = api.search(page = 1) # future calls with the same structure will be cached >>> processed_response = response_coordinator.handle_response(response, cache_key='tech-innovation-cache-key-page-1') # the ProcessedResponse (or ErrorResponse) stores critical fields from the original and processed response >>> processed_response # OUTPUT: ProcessedResponse(len=20, cache_key='tech-innovation-cache-key-page-1', metadata=...) >>> new_processed_response = response_coordinator.handle_response(processed_response, cache_key='tech-innovation-cache-key-page-1') >>> new_processed_response # OUTPUT: ProcessedResponse(len=20, cache_key='tech-innovation-cache-key-page-1', metadata=...)
Note that the entire process can be orchestrated via the SearchCoordinator that uses the SearchAPI and ResponseCoordinator as core dependency injected components:
>>> from scholar_flux import SearchCoordinator >>> search_coordinator = SearchCoordinator(api, response_coordinator, cache_requests=True) # uses a default cache key constructed from the response internally >>> processed_response = search_coordinator.search(page = 1) # OUTPUT: ProcessedResponse(len=20, cache_key='crossref_technological innovation_1_20', metadata=...) >>> processed_response.content == new_processed_response.content
- Core Attributes:
parser (BaseDataParser): Parses raw API responses. extractor (BaseDataExtractor): Extracts records and metadata. processor (ABCDataProcessor): Processes extracted data. cache_manager (DataCacheManager): Manages response cache.
- DEFAULT_VALIDATE_FINGERPRINT: bool = True
- __init__(parser: BaseDataParser, extractor: BaseDataExtractor, processor: ABCDataProcessor, cache_manager: DataCacheManager)[source]
Initializes a ResponseCoordinator with specified components for response parsing, processing, and caching.
- Parameters:
parser – (BaseDataParser): First step of the response processing pipeline: parses response records into a dictionary.
extractor – (BaseDataExtractor): Extracts both records and metadata from an API response separately for future processing steps.
processor – (ABCDataProcessor): Processes the list of dictionary-based records that were previously extracted from the APIResponse.
cache_manager – (DataCacheManager): Manages the processed record caching for faster response processing for identical responses.
- classmethod build(parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None, annotate_records: bool | None = None) ResponseCoordinator[source]
Factory method to build a ResponseCoordinator with sensible defaults.
- Parameters:
parser – (BaseDataParser): First step of the response processing pipeline: parses response records into a dictionary.
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from an API response separately for future processing steps.
processor – (Optional[ABCDataProcessor]): Processes the list of dictionary-based records that were previously extracted from the APIResponse.
cache_manager – (Optional[DataCacheManager]): Manages the processed record caching for faster response processing for identical responses.
cache_results – (Optional[bool]): Determines whether or not to cache processed responses: Enabled by default unless specified or if a cache manager is already provided.
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).
- Returns:
A fully constructed coordinator.
- Return type:
- property cache: DataCacheManager
Alias for the response data processing cache manager:
Also allows direct access to the DataCacheManager from the ResponseCoordinator
- property cache_manager: DataCacheManager
Allows direct access to the DataCacheManager from the ResponseCoordinator.
- classmethod configure_cache(cache_manager: DataCacheManager | None = None, cache_results: bool | None = None) DataCacheManager[source]
Helper method for building and swapping out cache managers depending on the cache chosen.
- Parameters:
cache_manager (Optional[DataCacheManager]) – An optional cache manager to use
cache_results (Optional[bool]) – Ground truth parameter, used to resolve whether to use caching when the cache_manager and cache_results contradict
- Returns:
An existing or newly created cache manager that can be used with the ResponseCoordinator
- Return type:
- property extractor: BaseDataExtractor
Allows direct access to the DataExtractor from the ResponseCoordinator.
- handle_response(response: Response | ResponseProtocol, cache_key: str | None = None, from_cache: bool = True, validate_fingerprint: bool | None = None, normalize_records: bool | None = None) ErrorResponse | ProcessedResponse[source]
Handles response data extraction, processing, and caching, retrieving response data from cache if available.
Once processed, the response data is transformed into a pydantic ProcessedResponse or ErrorResponse model that contains the response content, processing information, metadata, and/or error details when relevant.
- Parameters:
response (Response) – Raw API response.
cache_key (Optional[str]) – Cache key for storing/retrieving.
from_cache – (bool): Indicates whether the response data should be retrieved from cache if available.
validate_fingerprint – (Optional[bool]): Indicates whether cache should be invalidated if the ResponseCoordinator components are modified.
normalize_records (Optional[bool]) – Determines whether records should be normalized after processing.
- Returns:
A pydantic model containing the response data and detailed processing info.
- Return type:
- handle_response_data(response: Response | ResponseProtocol, cache_key: str | None = None, **kwargs: Any) RecordList | None[source]
Retrieves the data from the processed response from cache if previously cached. Otherwise the data is retrieved after processing the response.
- Parameters:
response (Response | ResponseProtocol) – Raw API response.
cache_key (Optional[str]) – Cache key for storing/retrieving.
**kwargs – Additional keyword arguments to pass to ResponseCoordinator.handle_response.
- Returns:
Processed response data or None.
- Return type:
Optional[RecordList]
- property parser: BaseDataParser
Allows direct access to the data parser from the ResponseCoordinator.
- property processor: ABCDataProcessor
Allows direct access to the DataProcessor from the ResponseCoordinator.
- schema_fingerprint() str[source]
Helper method for generating a concise view of the current structure of the response coordinator.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for retrieving a string representation of the overall structure of the current ResponseCoordinator. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the ResponseCoordinator.
- Parameters:
flatten (bool) – Whether to flatten the ResponseCoordinator’s structural representation into a single line.
show_value_attributes (bool) – Whether to show nested attributes of the components in the structure of the current ResponseCoordinator instance.
- Returns:
The structure of the current ResponseCoordinator as a string.
- Return type:
str
- summary() str[source]
Helper class for creating a quick summary representation of the structure of the Response Coordinator.
- classmethod update(response_coordinator: ResponseCoordinator, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None, annotate_records: bool | None = None) ResponseCoordinator[source]
Factory method to create a new ResponseCoordinator from an existing configuration.
- Parameters:
response_coordinator – (ResponseCoordinator): ResponseCoordinator containing the defaults to swap
parser – (Optional[BaseDataParser]): First step of the response processing pipeline - parses response records into a dictionary
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from responses separately
processor – (Optional[ABCDataProcessor]): Processes API responses into list of dictionaries
cache_manager – (Optional[DataCacheManager]): Manages the caching of processed records for faster retrieval
cache_results – (Optional[bool]): Determines whether or not to cache processed responses - on by default unless specified or if a cache manager is already provided
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).
- Returns:
A fully constructed coordinator.
- Return type:
scholar_flux.api.response_validator module
The scholar_flux.api.response_validator module implements a basic ResponseValidator that is used for preliminary response validation to determine whether received responses are valid and successful.
This class is used by default in SearchCoordinators to determine whether to proceed with response processing.
- class scholar_flux.api.response_validator.ResponseValidator[source]
Bases:
objectHelper class that serves as an initial response validation step to ensure that, in custom retry handling, the basic structure of a response can be validated to determine whether or not to retry the response retrieval process.
The ResponseValidator implements class methods that are simple tools that return boolean values (True/False) when response or response-like objects do not contain the required structure and raise errors when encountering non-response objects or when raise_on_error = True otherwise.
The ResponseValidator also contains helpers for the validation of both processed responses and responses that are reconstructed after storage and deserialization.
Example
>>> from scholar_flux.api import ResponseValidator, ReconstructedResponse >>> mock_success_response = ReconstructedResponse.build(status_code = 200, >>> json = {'response': 'success'}, >>> url = "https://an-example-url.com", >>> headers={'Content-Type': 'application/json'} >>> ) >>> ResponseValidator.validate_response(mock_success_response) is True >>> ResponseValidator.validate_content(mock_success_response) is True
- classmethod identify_invalid_fields(response: Response | ResponseProtocol) dict[str, Any][source]
Helper class method for identifying invalid fields within a response.
This class iteratively validates the complete list of all invalid fields that populate the current response.
If any invalid fields exist, the method returns a dictionary of each field and its corresponding value.
- Parameters:
response (requests.Response | ResponseProtocol) – A response or response-like object to check for the presence of invalid values.
- Returns:
A dictionary containing each invalid field as keys and their assigned values
- Return type:
(dict[str, Any])
- classmethod identify_invalid_keywords(status_code: object | None = None, url: object | None = None, reason: object | None = None, content: object | None = None, headers: object | None = None) dict[str, object][source]
Validates response field keyword arguments, indicating those that contain invalid values.
- Parameters:
status_code (Optional[object]) – The status code to validate (expected: int 100-599).
url (Optional[object]) – The URL to validate (should be a valid url).
reason (Optional[object]) – The reason string to validate (should be a string).
content (Optional[object]) – The content to validate (should be a bytes field).
headers (Optional[object]) – The headers to validate (should be a mapping with string-typed keys).
- Returns:
A dictionary containing each invalid field as a key and its assigned value.
- Return type:
dict[str, object]
- classmethod is_valid_content(content: object) TypeGuard[bytes][source]
Validates whether content is a valid bytes object.
- classmethod is_valid_headers(headers: object) TypeGuard[Mapping[str, str]][source]
Validates whether headers is a dict containing string-typed keys/values.
- classmethod is_valid_reason(reason: object) TypeGuard[str][source]
Validates whether reason is a valid string.
- classmethod is_valid_response_structure(response: object) TypeGuard[ResponseProtocol][source]
Validates whether each of the core components of a response are populated with the correct response types.
The following properties that refer back to the original response should be available:
status_code: (int)
reason: string
headers: dictionary
content: bytes
url: string or URL-like field
- Parameters:
response (object) – An object to evaluate as a response or response-like object.
- Returns:
True if all core response fields are valid, False otherwise.
- Return type:
TypeGuard[ResponseProtocol]
- classmethod is_valid_status_code(status_code: object) TypeGuard[int][source]
Validates whether the status_code is a valid integer between 100-599.
- classmethod is_valid_url(url: object) TypeGuard[str][source]
Validates whether the provided value is a valid URL.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method that shows the current structure of the ResponseValidator class in a string format. This method will show the name of the current class along with its attributes (ResponseValidator())
- Returns:
A string representation of the current structure of the ResponseValidator
- Return type:
str
- classmethod validate_content(response: Response | ResponseProtocol, expected_format: str = 'application/json', *, raise_on_error: bool = False) bool[source]
Validates the response content type.
- Parameters:
response (requests.Response | ResponseProtocol) – The HTTP response or response-like object to check.
expected_format (str) – The expected content type substring (e.g., “application/json”).
raise_on_error (bool) – If True, raises InvalidResponseException on mismatch.
- Returns:
True if the content type matches, False otherwise.
- Return type:
bool
- Raises:
InvalidResponseException – If the content type does not match and raise_on_error is True.
- classmethod validate_response(response: Response | ResponseProtocol, *, raise_on_error: bool = False) bool[source]
Validates HTTP responses by verifying first whether the object is a Response or follows a ResponseProtocol. For valid response or response- like objects, the status code is verified, returning False for 400 and 500 level validation errors when raise_on_error=False. If raise_on_error is set to True, an error is raised instead.
Note that a ResponseProtocol duck-types and verifies that each of a minimal set of attributes and/or properties can be found within the current response.
In the scholar_flux retrieval step, this validator verifies that the response received is a valid response.
- Parameters:
response – (requests.Response | ResponseProtocol): The HTTP response object to validate
raise_on_error (bool) – If True, raises InvalidResponseException on error for invalid response status codes
- Returns:
True if valid, False otherwise
- Raises:
InvalidResponseException – If response is invalid and raise_on_error is True
RequestFailedException – If an exception occurs during response validation due to missing or incorrect types
- classmethod validate_response_like(response: object) TypeGuard[Response | ResponseProtocol][source]
Validates that an object is a response or a duck typed ResponseProtocol, raising an error if invalid.
- Parameters:
response (object) – An object to verify as a response or response-like object
- Returns:
True when the received object is a requests.Response or a ResponseProtocol.
- Return type:
TypeGuard[requests.Response | ResponseProtocol]
- Raises:
InvalidResponseStructureException – Raised when the object is not a response-like object.
- classmethod validate_response_structure(response: Response | ResponseProtocol, raise_on_error: bool = True) TypeGuard[Response | ResponseProtocol][source]
Raises an error if a response object does not contain valid properties expected of a response. If the response validation is successful, True is returned, indicating that the value is a valid ResponseLike object.
- Parameters:
response (requests.Response | ResponseProtocol) – The response or response-like object to validate.
raise_on_error (bool) – Flag indicating whether an InvalidResponseStructureException should be raised for objects with invalid structures (True by default).
- Returns:
True when the received object is a requests.Response or a ResponseProtocol.
- Return type:
TypeGuard[requests.Response | ResponseProtocol]
- Raises:
InvalidResponseStructureException – Raised when the object is not a response-like object or if at least one field is determined to be invalid and unexpected of a response-like object.
scholar_flux.api.search_api module
Implements the SearchAPI that is the core interface used throughout the scholar_flux package to retrieve responses.
The SearchAPI builds on the BaseAPI to simplify parameter handling into a universal interface where the specifics of parameter names and request formation are abstracted.
- class scholar_flux.api.search_api.SearchAPI(query: str, provider_name: str | None = None, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, use_cache: bool | None = None, base_url: str | None = None, api_key: SecretStr | str | None = None, records_per_page: int = 20, request_delay: float | None = None, **api_specific_parameters: Any)[source]
Bases:
BaseAPIThe core interface that handles the retrieval of JSON, XML, and YAML content from the scholarly API sources offered by several providers such as Springer Nature, PLOS, and PubMed. The SearchAPI is structured to allow flexibility without complexity in initialization. API clients can be either constructed piece-by-piece or with sensible defaults for session-based retrieval, API key management, caching, and configuration options.
This class is integrated into the SearchCoordinator as a core component of a pipeline that further parses the response, extracts records and metadata, and caches the processed records to facilitate downstream tasks such as research, summarization, and data mining.
Examples
>>> from scholar_flux.api import SearchAPI # creating a basic API that uses the PLOS as the default while caching data in-memory: >>> api = SearchAPI(query = 'machine learning', provider_name = 'plos', use_cache = True) # retrieve a basic request: >>> response_page_1 = api.search(page = 1) >>> assert response_page_1.ok >>> response_page_1 # OUTPUT: <Response [200]> >>> ml_page_1 = response_page_1.json() # future requests automatically wait until the specified request delay passes to send another request: >>> response_page_2 = api.search(page = 2) >>> assert response_page_1.ok >>> response_page_2 # OUTPUT: <Response [200] >>> ml_page_2 = response_page_2.json()
- DEFAULT_URL: str = 'https://api.plos.org/search'
- __init__(query: str, provider_name: str | None = None, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, use_cache: bool | None = None, base_url: str | None = None, api_key: SecretStr | str | None = None, records_per_page: int = 20, request_delay: float | None = None, **api_specific_parameters: Any) None[source]
Initializes the SearchAPI with a query and optional parameters. The absolute bare minimum for interacting with APIs requires a query, base_url, and an APIParameterConfig that associates relevant fields (aka query, records_per_page, etc. with fields that are specific to each API provider.
- Parameters:
query (str) – The search keyword or query string.
provider_name (Optional[str]) – The name of the API provider where requests will be sent. If a provider_name and base_url are both given, the SearchAPIConfig will prioritize base_urls over the provider_name.
parameter_config (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]) – A config that a parameter map attribute under the hood to build the parameters necessary to interact with an API. For convenience, an APIParameterMap can be provided in place of an APIParameterConfig, and the conversion will take place under the hood.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session. A new session is created if not specified.
user_agent (Optional[str]) – Optional user-agent string for the session.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
masker (Optional[str]) – Used for filtering potentially sensitive information from logs (API keys, auth bearers, emails, etc)
use_cache (bool) – Indicates whether or not to create a cached session. If a cached session is already specified, this setting will have no effect on the creation of a session.
base_url (str) – The base URL for the article API.
api_key (Optional[str | SecretStr]) – API key if required.
records_per_page (int) – Number of records to fetch per page (1-100).
request_delay (Optional[float]) – Minimum delay between requests in seconds. If not specified, the SearchAPI, this setting will use the default request delay defined in the SearchAPIConfig (6.1 seconds) if an override for the current provider does not exist.
**api_specific_parameters –
- Additional parameter-value pairs to be provided to SearchAPIConfig class. API specific parameters include:
mailto (Optional[str | SecretStr]): (CROSSREF: an optional contact for feedback on API usage) db: str (PubMed: a database to retrieve data from (example: db=pubmed)
- property api_key: SecretStr | None
Retrieves the current value of the API key from the SearchAPIConfig as a SecretStr.
Note that the API key is stored as a secret key when available. The value of the API key can be retrieved by using the api_key.get_secret_value() method.
- Returns:
A secret string of the API key if it exists
- Return type:
Optional[SecretStr]
- property api_specific_parameters: dict[str, APISpecificParameter]
This property pulls additional parameters corresponding to the API from the configuration of the current API instance.
- Returns:
A dictionary of all parameters specific to the current API.
- Return type:
dict[str, APISpecificParameter]
- property base_url: str
Corresponds to the base URL of the current API.
- Returns:
The base URL corresponding to the API Provider
- build_parameters(page: int, additional_parameters: dict[str, Any] | None = None, **api_specific_parameters: Any) dict[str, Any][source]
Constructs the request parameters for the API call, using the provided APIParameterConfig and its associated APIParameterMap. This method maps standard fields (query, page, records_per_page, api_key, etc.) to the provider-specific parameter names.
Using additional_parameters, an arbitrary set of parameter key-value can be added to request further customize or override parameter settings to the API. additional_parameters is offered as a convenience method in case an API may use additional arguments or a query requires specific advanced functionality.
Other arguments and mappings can be supplied through **api_specific_parameters to the parameter config, provided that the options or pre-defined mappings exist in the config.
When **api_specific_parameters and additional_parameters conflict, additional_parameters is considered the ground truth. If any remaining parameters are None in the constructed list of parameters, these values will be dropped from the final dictionary.
- Parameters:
page (int) – The page number to request.
Optional[dict] (additional_parameters) – A dictionary of additional overrides that may or may not have been included in the original parameter map of the current API. (Provided for further customization of requests).
**api_specific_parameters – Additional parameters to provide to the parameter config: Note that the config will only accept keyword arguments that have been explicitly defined in the parameter map. For all others, they must be added using the additional_parameters parameter.
- Returns:
The constructed request parameters.
- Return type:
dict[str, Any]
- property config: SearchAPIConfig
Property method for accessing the config for the SearchAPI.
- Returns:
The configuration corresponding to the API Provider
- describe() dict[str, Any][source]
A helper method used that describe accepted configuration for the current provider or user-defined parameter mappings.
- Returns:
A dictionary describing valid config fields and provider-specific api parameters for the current provider (if applicable).
- Return type:
dict[str, Any]
- property display_name: str
Human-readable provider name for logging and display purposes.
- classmethod from_defaults(query: str, provider_name: str | None, session: Session | None = None, user_agent: Annotated[str | None, 'An optional User-Agent to associate with each search'] = None, use_cache: bool | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters: Any) SearchAPI[source]
Factory method to create SearchAPI instances with sensible defaults for known providers.
PLOS is used by default unless the environment variable, SCHOLAR_FLUX_DEFAULT_PROVIDER is set to another provider.
- Parameters:
query (str) – The search keyword or query string.
base_url (str) – The base URL for the article API.
records_per_page (int) – Number of records to fetch per page (1-100).
request_delay (Optional[float]) – Minimum delay between requests in seconds.
api_key (Optional[str | SecretStr]) – API key if required.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist.
masker (Optional[str]) – Used for filtering potentially sensitive information from logs
**api_specific_parameters – Additional api parameter-value pairs and overrides to be provided to SearchAPIConfig class
- Returns:
A new SearchAPI instance initialized with the config chosen.
- classmethod from_provider_config(query: str, provider_config: ProviderConfig, session: Session | None = None, user_agent: Annotated[str | None, 'An optional User-Agent to associate with each search'] = None, use_cache: bool | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters: Any) SearchAPI[source]
Factory method to create a new SearchAPI instance using a ProviderConfig.
This method uses the default settings associated with the provider config to temporarily make the configuration settings globally available when creating the SearchAPIConfig and APIParameterConfig instances from the provider registry.
- Parameters:
query (str) – The search keyword or query string.
provider_config – ProviderConfig,
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError.
masker (Optional[str]) – Used for filtering potentially sensitive information from logs
**api_specific_parameters – Additional api parameter-value pairs and overrides to be provided to SearchAPIConfig class
- Returns:
A new SearchAPI instance initialized with the chosen configuration.
- classmethod from_settings(query: str, config: SearchAPIConfig, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, use_cache: bool | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None) SearchAPI[source]
Advanced constructor: instantiate directly from a SearchAPIConfig instance.
- Parameters:
query (str) – The search keyword or query string.
config (SearchAPIConfig) – Indicates the configuration settings to be used when sending requests to APIs
parameter_config – (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]): Maps global scholar_flux parameters to those that are specific to the current API
session – (Optional[requests.Session | CachedSession]): An optional session to use for the creation of request sessions
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache – Optional[bool]: Indicates whether or not to use cache. The settings from session are otherwise used this option is not specified.
masker (Optional[SensitiveDataMasker]) – A masker used to filter logs of API keys and other sensitive data.
user_agent (Optional[str]) – A user agent to associate with the session.
- Returns:
A newly constructed SearchAPI with the chosen/validated settings.
- Return type:
- classmethod get_default_provider_name() str[source]
Retrieves the name of the default provider as configured via config_settings.
Note
When config_settings does not resolve to a known provider, a warning is raised, and SearchAPIConfig.DEFAULT_PROVIDER is returned instead.
- Returns:
A known default, either resolved from SCHOLAR_FLUX_DEFAULT_PROVIDER or SearchAPIConfig.DEFAULT_PROVIDER.
- Return type:
str
- make_request(current_page: int, additional_parameters: dict[str, Any] | None = None, request_delay: float | None = None, endpoint: str | None = None) Response[source]
Constructs and sends a request to the chosen api:
The parameters are built based on the default/chosen config and parameter map :param page: The page number to request. :type page: int :param additional_parameters Optional[dict]: A dictionary of additional overrides not included in the original SearchAPIConfig :param request_delay: Overrides the configured request delay for the current request only. :type request_delay: Optional[float] :param endpoint: The API endpoint to prepare the request for. :type endpoint: Optional[str]
- Returns:
The API’s response to the request.
- Return type:
requests.Response
- property parameter_config: APIParameterConfig
Property method for accessing the parameter mapping config for the SearchAPI.
- Returns:
The configuration corresponding to the API Provider
- prepare_request(base_url: str | None = None, endpoint: str | None = None, parameters: dict[str, Any] | None = None, api_key: str | None = None) PreparedRequest[source]
Prepares a GET request for the specified endpoint with optional parameters.
This method builds on the original base class method by additionally allowing users to specify a custom request directly while also accounting for the addition of an API key specific to the API.
- Parameters:
base_url (str) – The base URL for the API.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
parameters (Optional[dict[str, Any]]) – Optional query parameters for the request.
- Returns:
The prepared request object.
- Return type:
requests.PreparedRequest
- prepare_search(page: int | None = None, parameters: dict[str, Any] | None = None, request_delay: float | None = None, endpoint: str | None = None) PreparedRequest[source]
Prepares the current request given the provided page and parameters.
The prepared request object can be sent using the SearchAPI.session.send method with requests.Session and `requests_cache.CachedSession`objects.
- Parameters:
page (Optional[int]) – Page number to query. If provided, parameters are built from the config and this page.
parameters (Optional[dict[str, Any]]) – If provided alone, used as the full parameter set to build the current request. If provided together with page, these act as additional or overriding parameters on top of the built config.
request_delay (Optional[float]) – No-Op: retained to emulate the .search() method’s parameters to ensure that the value is not included in the request parameters.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
- Returns:
A request object that can be sent via api.session.send.
- Return type:
requests.PreparedRequest
- property provider_name: str
Property method for accessing the provider name in the current SearchAPI instance.
- Returns:
The name corresponding to the API Provider.
- property query: str
Retrieves the current value of the query to be sent to the current API.
- property rate_limiter: RateLimiter
Property enabling public access to the rate limiter for ease of use.
- Returns:
Throttles the number of requests that can sent to an API within a time interval.
- Return type:
- property records_per_page: int
Indicates the total number of records to show on each page.
- Returns:
an integer indicating the max number of records per page
- Return type:
int
- property request_delay: float
Indicates how long we should wait in-between requests.
Helpful for ensuring compliance with the rate-limiting requirements of various APIs.
- Returns:
The number of seconds to wait at minimum between each request
- Return type:
float
- search(page: int | None = None, parameters: dict[str, Any] | None = None, request_delay: float | None = None, endpoint: str | None = None) Response[source]
Public method to perform a search for the selected page with the current API configuration.
A search can be performed by specifying either the page to query with the preselected defaults and additional parameter overrides for other parameters accepted by the API.
Users can also create a custom request using a parameter dictionary containing the full set of API parameters.
- Parameters:
page (Optional[int]) – Page number to query. If provided, parameters are built from the config and this page.
parameters (Optional[dict[str, Any]]) – If provided alone, used as the full parameter set for the request. If provided together with page, these act as additional or overriding parameters on top of the built config.
request_delay (Optional[float]) – Overrides the configured request delay for the current request only.
endpoint (Optional[str]) – An Optional API endpoint to append to base_url.
- Returns:
A response object from the API containing articles and metadata
- Return type:
requests.Response
- session: Session
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for quickly showing a representation of the overall structure of the SearchAPI. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the SearchAPI.
- Parameters:
flatten (bool) – Whether to flatten the SearchAPI’s structural representation into a single line.
show_value_attributes (bool) – Whether to show nested attributes of the components of the SearchAPI.
- Returns:
The structure of the current SearchAPI as a string.
- Return type:
str
- classmethod update(search_api: SearchAPI, query: str | None = None, config: SearchAPIConfig | None = None, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, use_cache: bool | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters: Any) SearchAPI[source]
Helper method for generating a new SearchAPI from an existing SearchAPI instance. All parameters that are not modified are pulled from the original SearchAPI. If no changes are made, an identical SearchAPI is generated from the existing defaults.
- Parameters:
config (SearchAPIConfig) – Indicates the configuration settings to be used when sending requests to APIs
parameter_config (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]) – Maps global scholar_flux parameters to those that are API specific.
session – (Optional[requests.Session | CachedSession]): An optional session to use for the creation of request sessions
user_agent (Optional[str]) – A user agent to associate with the session
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache – Optional[bool]: Indicates whether or not to use cache. The settings from session are otherwise used this option is not specified.
masker – (Optional[SensitiveDataMasker]): A masker used to filter logs of API keys and other sensitive data
- Returns:
A newly constructed SearchAPI with the chosen/validated settings
- Return type:
- with_config(config: SearchAPIConfig | None = None, parameter_config: APIParameterConfig | None = None, provider_name: str | None = None, query: str | None = None) Iterator[SearchAPI][source]
Temporarily modifies the SearchAPI’s SearchAPIConfig and/or APIParameterConfig and namespace. You can provide a config, a parameter_config, or a provider_name to fetch defaults. Explicitly provided configs take precedence over provider_name, and the context manager will revert changes to the parameter mappings and search configuration afterward.
- Parameters:
config (Optional[SearchAPIConfig]) – Temporary search api configuration to use within the context to control where and how response records are retrieved.
parameter_config (Optional[APIParameterConfig]) – Temporary parameter config to use within the context to resolve universal parameters names to those that are specific to the current api.
provider_name (Optional[str]) – Used to retrieve the associated configuration for a specific provider in order to edit the parameter map when using a different provider.
query (Optional[str]) – Allows users to temporarily modify the query used to retrieve records from an API.
- Yields:
SearchAPI – The current api object with a temporarily swapped config during the context manager.
- with_config_parameters(provider_name: str | None = None, query: str | None = None, **api_specific_parameters: Any) Iterator[SearchAPI][source]
Allows for the temporary modification of the search configuration, and parameter mappings, and cache namespace. For the current API. Uses a contextmanager to temporarily change the provided parameters without persisting the changes.
- Parameters:
provider_name (Optional[str]) – If provided, fetches the default parameter config for the provider.
query (Optional[str]) – Allows users to temporarily modify the query used to retrieve records from an API.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override in the current config.
- Yields:
SearchAPI – The API object with temporarily swapped config and/or parameter config.
scholar_flux.api.search_coordinator module
Implements the SearchCoordinator for orchestrating single/multi-page API response retrieval and record processing.
- class scholar_flux.api.search_coordinator.SearchCoordinator(search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, *, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, query: str | None = None, provider_name: str | None = None, cache_requests: bool | None = None, cache_results: bool | None = None, annotate_records: bool | None = None, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None, **kwargs: Any)[source]
Bases:
BaseCoordinatorHigh-level coordinator for requesting and retrieving records and metadata from APIs.
This class uses dependency injection to orchestrate the process of constructing requests, validating responses, and processing scientific works and articles. This class is designed to abstract away the complexity of using APIs while providing a consistent and robust interface for retrieving record data and metadata from request and storage cache if valid to help avoid exceeding limits in API requests.
If no search_api is provided, the coordinator will create a Search API that uses the default provider if the environment variable, SCHOLAR_FLUX_DEFAULT_PROVIDER, is not provided. Otherwise PLOS is used on the backend.
- __init__(search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, *, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, query: str | None = None, provider_name: str | None = None, cache_requests: bool | None = None, cache_results: bool | None = None, annotate_records: bool | None = None, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None, **kwargs: Any) None[source]
Flexible initializer that constructs a SearchCoordinator from its core components or their building blocks.
If SearchAPI and ResponseCoordinator are provided, then this method will use these inputs directly. Otherwise, the coordinator will be created from their underlying dependencies when these core components are not directly provided.
The additional parameters can still be used to update these two components. For example, a search_api can be updated with a new query, session, and SearchAPIConfig parameters through keyword arguments (**kwargs).
- When neither component is provided:
The creation of the search_api requires, at minimum, a query.
If the response_coordinator, a parser, extractor, processor, and cache_manager aren’t provided, then a new ResponseCoordinator will be built from the default settings.
- Core Components/Attributes:
- SearchAPI: handles all requests to an API based on its configuration.
Dependencies: query, **kwargs
- ResponseCoordinator: handles the parsing, record/metadata extraction, processing, and caching of responses
Dependencies: parser, extractor, processor, cache_manager
- Other Attributes:
RetryHandler: Addresses when to retry failed requests and how failed requests are retried SearchWorkflow: An optional workflow that defines custom search logic from specific APIs Validator: handles how requests are validated. The default determines whether a 200 response was received
Note
This implementation uses the underlying private method _initialize to handle the assignment of parameters under the hood while the core function of the __init__ creates these components if they do not already exist.
- Parameters:
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs.
response_coordinator (Optional[ResponseCoordinator]) – Core class used to coordinate the handling and processing of all responses received from APIs.
parser (Optional[BaseDataParser]) – First step of the response processing pipeline - parses response records into a dictionary.
extractor (Optional[BaseDataExtractor]) – Extracts both records and metadata from responses separately.
processor (Optional[ABCDataProcessor]) – Processes the previously extracted API records into list of dictionaries that are filtered and optionally flattened during processing.
cache_manager (Optional[DataCacheManager]) – Manages the caching of processed records for faster retrieval.
query (Optional[str]) – Query to be used when sending requests when creating an API - modifies the query if the API already exists.
provider_name (Optional[str]) – The name of the API provider where requests will be sent. If a provider_name and base_url are both given, the SearchAPIConfig will prioritize base_urls over the provider_name.
cache_requests (Optional[bool]) – Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results (Optional[bool]) – Determines whether or not to cache processed responses - on by default unless specified otherwise
annotate_records (Optional[bool]) – Indicates whether the DataExtractor should add unique, record-identifying fields to each extracted record. These fields aid in record-linkage and the hashed identification of duplicates in later steps.
retry_handler (Optional[RetryHandler]) – Class used to retry failed requests-cache.
validator (Optional[ResponseValidator]) – Class used to verify and validate responses returned from APIs.
workflow (Optional[SearchWorkflow]) – An optional workflow used to customize how records are retrieved from APIs. Uses the default workflow for the current provider when a workflow is not directly specified.
**kwargs – Keyword arguments to be passed to the SearchAPIConfig if a SearchAPI doesn’t already exist.
Examples –
>>> from scholar_flux import SearchCoordinator >>> from scholar_flux.api import APIResponse, ReconstructedResponse >>> from scholar_flux.sessions import CachedSessionManager >>> from typing import MutableMapping >>> session = CachedSessionManager(user_agent = 'scholar_flux', backend='redis').configure_session() >>> search_coordinator = SearchCoordinator(query = "Intrinsic Motivation", session = session, cache_results = False) >>> response = search_coordinator.search(page = 1) >>> response # OUTPUT: <ProcessedResponse(len=50, cache_key='plos_Functional Processing_1_50', metadata='...') ': 1, 'maxSco...")> >>> new_response = ReconstructedResponse.build(**response.response.__dict__) >>> new_response.validate() >>> new_response = ReconstructedResponse.build(response.response) >>> ReconstructedResponse.build(new_response).validate() >>> new_response.validate() >>> newer_response = APIResponse.as_reconstructed_response(new_response) >>> newer_response.validate() >>> double_processed_response = search_coordinator._process_response(response = newer_response, cache_key = response.cache_key)
- classmethod as_coordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator, *args: Any, **kwargs: Any) SearchCoordinator[source]
Helper factory method for building a SearchCoordinator that allows users to build from the final building blocks of a SearchCoordinator.
- Parameters:
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs
- Returns:
A newly created coordinator that orchestrates record retrieval and processing
- Return type:
- fetch(page: int | None, from_request_cache: bool = True, raise_on_error: bool = False, cache_only: bool = False, **api_specific_parameters: Any) Response | ResponseProtocol | None[source]
Fetches the raw response from the current API or from cache if available.
If page is None, fetch will default to a basic parameter search using the API base URL given the specified parameters.
- Parameters:
page (Optional[int]) – The page number to retrieve from the cache.
from_request_cache (bool) – This parameter determines whether to try to fetch a valid response from cache.
raise_on_error (bool) – Indicates whether an error should be raised when failing to fetch a valid response.
cache_only (bool) – Flag indicating whether the search should only attempt to retrieve the page from cache.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
The response object if available, otherwise None.
- Return type:
Optional[Response]
- Raises:
RetryAfterDelayExceededException – If the server-requested delay until the next request exceeds the user-specified maximum wait time as configured through the RetryHandler.
RequestFailedException – If an unexpected error occurs during the retrieval process as orchestrated via the RetryHandler.
- get_cached_request(page: int | None, **kwargs: Any) Response | ResponseProtocol | None[source]
Retrieves the cached request for a given page number if available.
- Parameters:
page (Optional[int]) – The page number to retrieve from the cache.
- Returns:
The cached request object if available, otherwise None.
- Return type:
Optional[Response]
- get_cached_response(page: int, url: str | None = None, **kwargs: Any) ProcessedResponse | ErrorResponse | None[source]
Retrieves the cached response for a given page number if available.
This method attempts to retrieve processed cache information when available, preferring the retrieval of processed cached data when available, despite whether the underlying request was cached.
If the cached request does not exist, and the processed response data does exist, this method creates a ProcessedResponse with ReconstructedResponse when possible.
If the cached request exists or is newer, this method returns the ProcessedResponse after handling the raw cached response object.
- Parameters:
page (int) – The page number to retrieve from the cache.
url (Optional[str]) – The request URL for parameter-based cache keys. Used when page is None.
**kwargs – Additional arguments to pass to get_cached_requests for the reconstruction of a cached response
- Returns:
The cached/reconstructed response if available.
- Return type:
Optional[ProcessedResponse | ErrorResponse]
- get_cached_response_keys() list[str][source]
Finds all cache keys from cached, paginated requests made with the current query.
- get_cached_search_result(page: int, url: str | None = None, **kwargs: Any) SearchResult | None[source]
Retrieves a SearchResult containing a ProcessedResponse for a given page number if available.
This is convenience method that uses get_cached_response under the hood to retrieve and format a response as a SearchResult instance.
If the cached response does not exist, this method will return None instead.
- Parameters:
page (int) – The page number to retrieve from the cache.
url (Optional[str]) – The request URL for parameter-based cache keys. Used when page is None.
**kwargs – Additional arguments to pass to get_cached_response for the reconstruction of a cached response
- Returns:
The search result containing the reconstructed response result if available.
- Return type:
Optional[SearchResult]
- iter_pages(pages: Sequence[int] | PageListInput, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) Generator[SearchResult, None, None][source]
Helper method that creates a generator function for retrieving and processing records from the API Provider for a page range in sequence. This implementation dynamically examines the properties of the page search result for each retrieved API response to determine whether or not iteration should halt early versus determining whether iteration should continue.
This method is directly used by SearchCoordinator.search_pages to provide a clean interface that abstracts the complexity of iterators and is also provided for convenience when iteration is more preferable.
- Parameters:
pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Yields:
SearchResult –
- Iteratively returns the SearchResult for each page using a generator expression.
Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)
- parameter_search(from_request_cache: bool = True, from_process_cache: bool = True, normalize_records: bool | None = None, **api_specific_parameters: Any) ProcessedResponse | ErrorResponse[source]
Public method for retrieving and processing records from the API with pre-specified parameters.
Note that the response object is saved under the last_response attribute in the event that the response is retrieved and processed successfully, irrespective of whether the response was cached.
- Parameters:
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage
normalize_records (Optional[bool]) – Determines whether records should be normalized after processing
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A ProcessedResponse model containing the response (response), processed records (data), and article metadata (metadata) if the response was successful. Otherwise returns an ErrorResponse where the reason behind the error (message), exception type (error), and response (response) are provided. Possible error responses also include a NonResponse (an ErrorResponse subclass) for cases where a response object is irretrievable. Like the ErrorResponse class, NonResponse is also Falsy (i.e., not NonResponse returns True)
- Return type:
Optional[ProcessedResponse | ErrorResponse]
- robust_request(page: int | None, **api_specific_parameters: Any) Response | ResponseProtocol | None[source]
Constructs and sends a request to the current API. Fetches a response from the current API.
- Parameters:
page (Optional[int]) – The page number to retrieve from the cache. If missing, this implementation relies on api_specific_parameters to retrieve data from an API.
**kwargs – Optional Additional parameters to pass to the SearchAPI
- Returns:
The request/response-like object if available, otherwise None.
- Return type:
Optional[Response | ResponseProtocol]
- search(page: int = 1, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, normalize_records: bool | None = None, **api_specific_parameters: Any) ProcessedResponse | ErrorResponse | None[source]
Public method for retrieving and processing records from the API specifying the page and records per page. Note that the response object is saved under the last_response attribute in the event that the response is retrieved and processed successfully, irrespective of whether the response was cached.
- Parameters:
page (int) – The current page number. Used for process caching purposes even if not required by the API
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
normalize_records (Optional[bool]) – Determines whether records should be normalized after processing
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A ProcessedResponse model containing the response (response), processed records (data), and article metadata (metadata) if the response was successful. Otherwise returns an ErrorResponse where the reason behind the error (message), exception type (error), and response (response) are provided. Possible error responses also include a NonResponse (an ErrorResponse subclass) for cases where a response object is irretrievable. Like the ErrorResponse class, NonResponse is also Falsy (i.e., not NonResponse returns True)
- Return type:
Optional[ProcessedResponse | ErrorResponse]
Note: When specifying cache_only=True, this keyword argument is propagated to the fetch method, ensuring that a fresh request is not sent to the current API when a previously cached response is unavailable from the session cache. Instead, a NonResponse is returned that records the PageUnavailableFromCacheException and its corresponding error message.
- search_data(page: int = 1, *args: Any, **kwargs: Any) RecordList | None[source]
Public convenience method to perform a search, specifying the page and records per page.
Note that instead of returning a ProcessedResponse or ErrorResponse, this calls the search method an retrieves only the list of processed dictionary records from the ProcessedResponse.
- Parameters:
page (int) – The current page number.
*args – Positional arguments to pass directly to the .search() method
**kwargs – Keyword arguments to pass directly to the .search() method
- Returns:
A list of record dictionaries containing the processed article data when parsed successfully and records exist. If no records exist, or an error occurs somewhere within the processes, None is returned, instead.
- Return type:
Optional[RecordList]
- search_page(page: int, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) SearchResult[source]
Retrieves a single-page SearchResult, returning the processed response with additional metadata.
This method is used to support the retrieval of a page range while wrapping each result in a SearchResult class as a BaseModel that provides more structured information about the received API Response, including the provider’s name, the page number, and the response result.
The SearchResult.response_result attribute can hold three different types of responses:
ProcessedResponse - indicates the successful retrieval and processing of the data
- ErrorResponse/Nonresponse - indicates that a response was successfully received, but that an error
occurred during request building, response retrieval or response processing
None - indicates an issue in the retrieval of the response or formatting/preparation of the request
The SearchResult wrapper enables: - Introspection: Access provider, query, and page without unpacking the response - Aggregation: Combine results across pages with consistent metadata - Normalization: Apply field mapping to create provider-agnostic schemas
When a workflow is active, the provider name is determined from the last-queried URL to ensure correct labeling. For non-workflow searches, the SearchAPI’s provider name is used.
- Parameters:
page (int) – The current page number. Used for process caching purposes even if not required by the API
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A search result containing the requested page number (page), the name of the provider (provider_name), and the result of the search (api_response) which contains a ProcessedResponse, an ErrorResponse, or None.
- Return type:
Note
When specifying cache_only=True, this keyword argument is propagated to fetch method, ensuring that a fresh request is not sent to the current API when a previously cached response is unavailable from the session cache. Instead, a SearchResult containing a NonResponse is returned, recording the PageUnavailableFromCacheException and its corresponding error message.
- search_pages(pages: Sequence[int] | PageListInput, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) SearchResultList[source]
Public method for retrieving and processing records from the API specifying the page and records per page in sequence.
This method collects search results from multiple pages into a SearchResultList, which provides specialized methods for filtering, normalization, selection, and aggregation. Unlike iter_pages(), which streams results one at a time, this method returns the full collection for cross-page analysis and batch operations.
The SearchResultList return type enables powerful operations like filtering out failures, normalizing records across different providers, selecting subsets by query/provider/page, and joining all records into a single list for DataFrame creation.
- Parameters:
pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider. Can be a list, range, or PageListInput instance.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A specialized list containing SearchResult instances for each requested page. The SearchResultList provides methods including: - filter(): Retain only successful ProcessedResponses or filter by success/failure - select(): Filter results by query, provider_name, or page number - normalize(): - Apply field mapping to create provider-agnostic record schemas - join(): - Combine all records into a single list with optional metadata - process_metadata(): - Extract and process metadata across all results - record_count: - Total number of records across all pages
- Return type:
Note
Retrieval stops early if a page response is None, not retrievable, or contains fewer than the expected number of records, indicating that subsequent pages may be empty. When cache_only=True, the fetch step will only fetch valid responses from cache. If a NonResponse is returned due to a cache miss, the search will continue without halting.
- search_records(min_records: int, page_offset: int = 0, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) SearchResultList[source]
Public method for retrieving and processing records by specifying the number of records to retrieve.
This method first calculates the total number of pages required to retrieve the specified number of records and subsequently collects search results from multiple pages into a SearchResultList. The result list provides specialized methods for filtering, normalization, selection, and aggregation. Unlike iter_pages(), which streams results one at a time, this method returns the full collection for cross-page analysis and batch operations.
The SearchResultList return type enables powerful operations like filtering out failures, normalizing records across different providers, selecting subsets by query/provider/page, and joining all records into a single list for DataFrame creation.
- Parameters:
min_records (int) – The total number of records to retrieve sequentially.
page_offset (int) – The page offset indicating the number of pages to skip before beginning record retrieval (0 by default).
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A specialized list containing SearchResult instances for each requested page. The SearchResultList provides methods including: - filter(): Retain only successful ProcessedResponses or filter by success/failure - select(): Filter results by query, provider_name, or page number - normalize(): Apply field mapping to create provider-agnostic record schemas - join(): Combine all records into a single list with optional metadata - process_metadata(): Extract and process metadata across all results - record_count: Total number of records across all pages
Note that retrieval stops early if a page response is None, not retrievable, or contains fewer than the expected number of records, indicating that subsequent pages may be empty.
- Return type:
- classmethod update(search_coordinator: Self, search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, *, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None, annotate_records: bool | None = None, **search_api_kwargs: Any) SearchCoordinator[source]
Helper factory method allowing the creation of a new SearchCoordinator from both current and new components.
A new coordinator can be created using the components from an existing configuration as a base while directly replacing other components with new configurations. Note that this implementation does not directly copy the underlying components if a new component is not selected.
- Parameters:
SearchCoordinator – A previously created coordinator containing the components to use if a default is not provided
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs
retry_handler (Optional[RetryHandler]) – Class used to retry failed requests-cache
validator (Optional[ResponseValidator]) – Class used to verify and validate responses returned from APIs
workflow (Optional[SearchWorkflow]) – An optional workflow used to customize how records are retrieved from APIs. Uses the default workflow for the current provider when a workflow is not directly specified and does not directly carry over in cases where a new provider is chosen.
parser – (Optional[BaseDataParser]): First step of the response processing pipeline - parses response records into a dictionary
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from responses separately
processor – (Optional[ABCDataProcessor]): Processes API responses into list of dictionaries
cache_manager – (Optional[DataCacheManager]): Manages the caching of processed records for faster retrieval
cache_requests – (Optional[bool]): Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results – (Optional[bool]): Determines whether or not to cache processed responses - on by default unless specified or if a cache manager is already provided.
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).
- Returns:
A newly created coordinator that orchestrates record retrieval and processing
- Return type:
scholar_flux.api.validators module
The scholar_flux.api.validators module implements methods that are used within the validation of scholar_flux API configurations to ensure that valid and invalid inputs are received as such.
Functions:
- validate_email:
Used to verify whether an email matches the expected pattern
- validate_and_process_email:
Attempts to mask valid emails and raises an error on invalid input
- validate_url:
Used to verify whether a URL is a valid string
- normalize_url:
Uses regular expressions to format the URL in a consistent format for string comparisons
- validate_and_process_url:
validates URLs to ensure that it matches the expected format and normalizes the URL for later use
- validate_int:
Validates integer values with optional min/max bounds
- validate_str:
Validates string values with optional allowed values constraint
- validate_date:
Validates date strings in YYYY-MM-DD format
- validate_bool_str:
Validates and normalizes boolean string values
- scholar_flux.api.validators.api_validator(provider_name: str, field: str) Callable[source]
Decorator for wrapping validators with standardized error handling.
- Parameters:
provider_name – Name of the API provider.
field – Name of the parameter being validated.
- Returns:
A decorator that wraps validators with enhanced error handling.
Examples
>>> @api_validator("openalex", "mailto") ... def validate_mailto(email): ... return validate_and_process_email(email)
- scholar_flux.api.validators.normalize_url(url: str, normalize_https: bool = True, remove_parameters: bool = False) str[source]
Helper class to aid in comparisons of string urls. Normalizes a URL for consistent comparisons by converting to https:// and stripping right-most forward slashes (‘/’).
- Parameters:
url (str) – The URL to normalize into a consistent structure for later comparison
normalize_https (bool) – indicates whether to normalize the http identifier on the URL. This is True by default.
- Returns:
The normalized URL
- Return type:
str
- scholar_flux.api.validators.remove_url_parameters(url: str) str[source]
Helper method for removing queries and parameters from URLs.
- Parameters:
url (str) – The URL
- scholar_flux.api.validators.validate_and_process_email(email: SecretStr | str | None, from_env: bool = True, verbose: bool = True) SecretStr | None[source]
If a string value is provided, determine whether the email is valid.
This function first uses the validate_email function for the validation of the email. If the value is non-missing is not an email, this implementation will raise an ValueError. When the provided email is None, this function will attempt to load the email from the config and environment if possible (SCHOLAR_FLUX_DEFAULT_MAILTO).
- Parameters:
email (Optional[SecretStr | str]) – an email to validate if non-missing
from_env (bool) – If True, will attempt to load from environment if email is None.
verbose (bool) – If True, will log warnings for invalid emails.
- Returns:
Masked valid email, or None if not provided.
- Return type:
Optional[SecretStr]
- Raises:
ValueError – If the current value is not an email
- scholar_flux.api.validators.validate_and_process_url(url: str | None, **kwargs: Any) str | None[source]
If a string value is provided, determine whether the url is valid.
This function first uses the validate_url function for the validation of the url.
- Parameters:
url (Optional[str]) – a URL to validate if non-missing
- Returns:
Normalized URL if valid, or None if not provided.
- Return type:
Optional[str]
- Raises:
ValueError – If the provided URL is invalid.
- scholar_flux.api.validators.validate_api_specific_field(validator: Callable[[P], R], provider_name: str, field: str) Callable[[P], R][source]
Wrap a validator function with standardized error handling for API-specific parameters.
- Parameters:
validator (Callable) – The validation function to wrap.
provider_name – str: Name of the API provider.
field – str: Name of the parameter being validated.
- Returns:
A wrapped validator with enhanced error messages.
- Return type:
Callable
- Raises:
TypeError – If validator is not callable.
Examples
>>> validator = validate_api_specific_field( ... validate_email, ... "openalex", ... "mailto" ... ) >>> validator("user@example.com")
- scholar_flux.api.validators.validate_bool_str(value: str | None, true_values: tuple[str, ...] = ('true', '1', 'yes'), default: bool | None = False) bool | None[source]
Validate and convert a boolean string to a Python bool.
- Parameters:
value (Optional[str]) – The string value to convert.
true_values (tuple[str, ...]) – A tuple of lowercase strings that represent True.
default (Optional[bool]) – An optional default to use when a true value cannot be found in a valid string (default=False). Some applications may warrant returning None instead of False when a True value is not identified.
- Returns:
True if value matches a true_value, False if non-empty string, None if value is None.
- Return type:
Optional[bool]
Examples
>>> validate_bool_str("true") # OUTPUT: True >>> validate_bool_str("false") # OUTPUT: False >>> validate_bool_str("TRUE") # OUTPUT: True >>> validate_bool_str(None) # OUTPUT: None
- scholar_flux.api.validators.validate_date(value: str | None, format: str = '%Y-%m-%d', format_description: str = 'YYYY-MM-DD') str | None[source]
Validate that a value is a date string in the specified format.
- Parameters:
value (Optional[str]) – The date string to validate.
format (str) – The expected date format (strptime format string).
format_description (str) – Human-readable format description for error messages.
- Returns:
The validated date string, or None if value is None.
- Return type:
Optional[str]
- Raises:
ValueError – If value is not a valid date in the specified format.
Examples
>>> validate_date("2023-01-15") # OUTPUT: '2023-01-15' >>> validate_date("2023/01/15", format="%Y/%m/%d", format_description="YYYY/MM/DD") # OUTPUT: '2023/01/15'
- scholar_flux.api.validators.validate_email(email: str, verbose: bool = True) bool[source]
Uses regex to determine whether the provided value is an email.
- Parameters:
email (str) – The email string to validate
- Returns:
True if the email is valid, and False otherwise
- scholar_flux.api.validators.validate_int(value: int | None, min: int | None = None, max: int | None = None) int | None[source]
Validate that a value is an integer and optionally within bounds.
- Parameters:
value (Optional[int]) – The value to validate as an integer.
min (Optional[int]) – Optional minimum value (inclusive).
max (Optional[int]) – Optional maximum value (inclusive).
- Returns:
The validated integer value, or None if value is None.
- Return type:
Optional[int]
- Raises:
ValueError – If value is not an integer or is outside the specified bounds.
- scholar_flux.api.validators.validate_str(value: str | None, allowed: list | set | tuple | None = None) str | None[source]
Validate that a value is a string and optionally in a set of allowed values.
- Parameters:
value (Optional[str]) – The value to validate as a string.
allowed (Optional[list | set | tuple]) – Optional collection of allowed string values.
- Returns:
The validated string value, or None if value is None.
- Return type:
Optional[str]
- Raises:
ValueError – If value is not a string or is not in the allowed values.
- scholar_flux.api.validators.validate_url(url: str, verbose: bool = True) bool[source]
Uses urlparse to determine whether the provided value is a URL.
Basic Checks:
Only http:// and https:// schemes are accepted
A URL domain exists after the URL scheme
No whitespace exists in the domain name
Note: Further validation is delegated to request libraries.
- Parameters:
url (str) – The url string to validate
verbose (bool) – Determines whether to log upon encountering invalid URLs
- Returns:
True if the url is valid, and False otherwise
Module contents
The scholar_flux.api module includes the core classes and functionality necessary to interact with APIs in a universally applicable manner. This module defines the methods necessary to retrieve raw responses from APIs based on the configuration used for the API client (SearchAPI).
- Sub-modules:
- models: Contains the classes used to set up new configurations in addition to the API utility models
and modules necessary to interact with APIs
- providers: Defines the default provider specifications to easily create a new client for a specific
provider with minimal code. (e.g., plos.py contains the necessary config settings for the PLOS API)
- workflows: Defines custom workflows for APIs requiring API-specific logic modifications for easier record retrieval.
This includes the PubMed Workflow which searches IDs and then fetches the records
- rate_limiting: Defines the methods and classes used to ensure that the rate limits associated with each API
are not exceeded. The SearchAPI implements rate limiting using the RateLimiter and, optionally, ThreadedRateLimiter class to wait a specified interval of time before sending the next request.
- In order to use the API one can get started with the SearchCoordinator with minimal effort:
>>> from scholar_flux.api import SearchCoordinator # imports the most forward facing interface for record retrieval >>> search_coordinator = SearchCoordinator(query = 'Turing Machines') # uses PLOS by default >>> print(search_coordinator.api) # Shows the core SearchAPI specification used to send requests to APIs >>> processed_response = search_coordinator.search(page = 1) # retrieves and processes records from the API response
- You can also retrieve the responses directly without processing via the SearchAPI:
>>> from scholar_flux.api import SearchAPI # imports the core SearchAPI used by the coordinator to send requests >>> api = SearchAPI(query='ML') # uses PLOS by default >>> response = api.search(page = 1) # retrieves and processes records from the API response
- The functionality of the SearchCoordinators are further customized using the following modules:
scholar_flux.sessions: Contains the core classes for directly setting up cached sessions scholar_flux.data: Contains the core classes used to parse, extract, and process records scholar_flux.data_storage: Contains the core classes used for caching scholar_flux.security: Contains the core classes used for ensuring security in console and logging (e.g API keys)
- class scholar_flux.api.APIParameterConfig(parameter_map: APIParameterMap)[source]
Bases:
objectUses an APIParameterMap instance and runtime parameter values to build parameter dictionaries for API requests.
- Parameters:
parameter_map (APIParameterMap) – The mapping of universal to API-specific parameter names.
- Class Attributes:
- DEFAULT_CORRECT_ZERO_INDEX (bool):
Autocorrects zero-indexed API parameter building specifications to only accept positive values when True. If otherwise False, page calculation APIs will start from page 0 if zero-indexed (i.e., arXiv).
Examples
>>> from scholar_flux.api import APIParameterConfig, APIParameterMap >>> # the API parameter map is defined and used to resolve parameters to the API's language >>> api_parameter_map = APIParameterMap( ... query='q', records_per_page = 'pagesize', start = 'page', auto_calculate_page = False ... ) # The APIParameterConfig defines class and settings that indicate how to create requests >>> api_parameter_config = APIParameterConfig(api_parameter_map, auto_calculate_page = False) # Builds parameters using the specification from the APIParameterMap >>> page = api_parameter_config.build_parameters(query= 'ml', page = 10, records_per_page=50) >>> print(page) # OUTPUT {'q': 'ml', 'page': 10, 'pagesize': 50}
- DEFAULT_CORRECT_ZERO_INDEX: ClassVar[bool] = True
- __init__(*args: Any, **kwargs: Any) None
- add_parameter(name: str, description: str | None = None, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False, inplace: bool = True) APIParameterConfig[source]
Passes keyword arguments to the current parameter map to add a new API-specific parameter to its config.
- Parameters:
name (str) – The name of the parameter used when sending requests to APIs.
description (str) – A description of the API-specific parameter.
validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.
default (Any) – A default value used for the parameter if not specified by the user
required (bool) – Indicates whether the current parameter is required for API calls.
inplace (bool) –
A flag that, if True, modifies the current parameter map instance in place. If False, it returns a new parameter map that contains the added parameter, while leaving the original unchanged.
Note: If this instance is shared (e.g., retrieved from provider_registry), changes will affect all references to this parameter map. if inplace=True.
- Returns:
An APIParameterConfig with the updated parameter map. If inplace=True, the original is returned. Otherwise a new parameter map containing an updated api_specific_parameters dict is returned.
- Return type:
- classmethod as_config(parameter_map: dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) APIParameterConfig[source]
Factory method for creating a new APIParameterConfig from a dictionary or APIParameterMap.
This helper class method resolves the structure of the APIParameterConfig against its basic building blocks to create a new configuration when possible.
- Parameters:
parameter_map (dict | BaseAPIParameterMap | APIParameterMap | APIParameterConfig) – A parameter mapping/config to use in the instantiation of an APIParameterConfig.
- Returns:
A new structure from the inputs
- Return type:
- Raises:
APIParameterException – If there is an error in the creation/resolution of the required parameters
- build_parameters(query: str | None, page: int | None, records_per_page: int, **api_specific_parameters: Any) Dict[str, Any][source]
Builds the dictionary of request parameters using the current parameter map and provided values at runtime.
- Parameters:
query (Optional[str]) – The search query string.
page (Optional[int]) – The page number for pagination (1-based).
records_per_page (int) – Number of records to fetch per page.
**api_specific_parameters – Additional API-specific parameters to include.
- Returns:
The fully constructed API request parameters dictionary, with keys as API-specific parameter names and values as provided.
- Return type:
Dict[str, Any]
- extract_parameters(parameters: dict[str, Any] | None) dict[str, Any][source]
Extracts all parameters from a dictionary: Helpful for when keywords must be extracted by provider.
Note: this method modifies the original parameter dictionary, using the pop() method to extract all values identified as api_specific_parameters from the parameters dictionary when possible. These extracted parameters are then returned in a separate dictionary.
Useful for reorganizing dictionaries that contain dynamically specified input parameters for distinct APIs.
- Parameters:
parameters (Optional[dict[str, Any]]) – An optional parameter dictionary from which to extract API-specific parameters.
- Returns:
A dictionary containing all extracted parameters if available.
- Return type:
(dict[str, Any])
- classmethod from_defaults(provider_name: str, **additional_parameters: Any) APIParameterConfig[source]
Factory method to create APIParameterConfig instances with sensible defaults for known APIs.
If the provider_name does not exist, the code will raise an exception.
- Parameters:
provider_name (str) – The name of the API to create the parameter map for.
api_key (Optional[str]) – API key value if required.
additional_parameters (dict) – Additional parameter mappings.
- Returns:
Configured parameter config instance for the specified API.
- Return type:
- Raises:
NotImplementedError – If the API name is unknown.
- classmethod get_defaults(provider_name: str, **additional_parameters: Any) APIParameterConfig | None[source]
Factory method to create APIParameterConfig instances with sensible defaults for known APIs.
Avoids throwing an error if the provider name does not already exist.
- Parameters:
provider_name (str) – The name of the API to create the parameter map for.
additional_parameters (dict) – Additional parameter mappings.
- Returns:
Configured parameter config instance for the specified API. Returns None if a mapping for the provider_name isn’t retrieved
- Return type:
Optional[APIParameterConfig]
- property map: APIParameterMap
Helper property that is an alias for the APIParameterMap attribute.
The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.
- Returns:
The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.
- Return type:
- parameter_map: APIParameterMap
- class scholar_flux.api.APIParameterMap(*, query: str, records_per_page: str, start: str | None = None, api_key_parameter: str | None = None, api_key_required: bool = False, auto_calculate_page: bool = True, zero_indexed_pagination: bool = False, api_specific_parameters: ~typing.Dict[str, ~scholar_flux.api.models.base_parameters.APISpecificParameter] = <factory>)[source]
Bases:
BaseAPIParameterMapExtends BaseAPIParameterMap by adding validation and the optional retrieval of provider defaults for known APIs.
This class also specifies default mappings for specific attributes such as API keys and additional parameter names.
- query
The API-specific parameter name for the search query.
- Type:
str
- start
The API-specific parameter name for pagination (start index or page number).
- Type:
Optional[str]
- records_per_page
The API-specific parameter name for records per page.
- Type:
str
- api_key_parameter
The API-specific parameter name for the API key.
- Type:
Optional[str]
- api_key_required
Indicates whether an API key is required.
- Type:
bool
- auto_calculate_page
If True, calculates start index from page; if False, passes page number directly.
- Type:
bool
- zero_indexed_pagination
If True, treats 0 as an allowed page value when retrieving data from APIs.
- Type:
bool
- api_specific_parameters
Additional universal to API-specific parameter mappings.
- Type:
Dict[str, str]
- api_key_parameter: str | None
- api_key_required: bool
- api_specific_parameters: Dict[str, APISpecificParameter]
- auto_calculate_page: bool
- classmethod from_defaults(provider_name: str, **additional_parameters: Any) APIParameterMap[source]
Factory method that uses the APIParameterMap.get_defaults classmethod to retrieve the provider config.
Raises an error if the provider does not exist.
- Parameters:
provider_name (str) – The name of the API to create the parameter map for.
additional_parameters (dict) – Additional parameter mappings.
- Returns:
Configured parameter map for the specified API.
- Return type:
- Raises:
NotImplementedError – If the API name is unknown.
- classmethod get_defaults(provider_name: str, **additional_parameters: Any) APIParameterMap | None[source]
Factory method to create APIParameterMap instances with sensible defaults for known APIs.
This class method attempts to pull from the list of known providers defined in the scholar_flux.api.providers.provider_registry and returns None if an APIParameterMap for the provider cannot be found.
Using the additional_parameters keyword arguments, users can specify optional overrides for specific parameters if needed. This is helpful in circumstances where an API’s specification overlaps with that of a known provider.
Valid providers (as indicated in provider_registry) include:
springernature
plos
arxiv
openalex
core
crossref
- Parameters:
provider_name (str) – The name of the API provider to retrieve the parameter map for.
additional_parameters (dict) – Additional parameter mappings.
- Returns:
Configured parameter map for the specified API.
- Return type:
Optional[APIParameterMap]
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- query: str
- records_per_page: str
- classmethod set_default_api_key_parameter(values: dict[str, Any]) dict[str, Any][source]
Sets the default for the api key parameter when api_key_required`=True and `api_key_parameter is None.
- Parameters:
values (dict[str, Any]) – The dictionary of attributes to validate
- Returns:
The updated parameter values passed to the APIParameterMap. api_key_parameter is set to “api_key” if key is required but not specified
- Return type:
dict[str, Any]
- start: str | None
- classmethod validate_api_specific_parameter_mappings(values: dict[str, Any]) dict[str, Any][source]
Validates the additional mappings provided to the APIParameterMap.
This method validates that the input is dictionary of mappings that consists of only string-typed keys mapped to API-specific parameters as defined by the APISpecificParameter class.
- Parameters:
values (dict[str, Any]) – The dictionary of attribute values to validate.
- Returns:
The updated dictionary if validation passes.
- Return type:
dict[str, Any]
- Raises:
APIParameterException – If api_specific_parameters is not a dictionary or contains non-string keys/values.
- zero_indexed_pagination: bool
- class scholar_flux.api.APIResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None)[source]
Bases:
BaseModelA Response wrapper for responses of different types that allows consistency when using several possible backends.
The purpose of this class is to serve as the base for managing responses received from scholarly APIs while processing each component in a predictable, reproducible manner.
This class uses pydantic’s data validation and serialization/deserialization methods to aid caching and includes properties that refer back to the original response for displaying valid response codes, URLs, etc.
All future processing/error-based responses classes inherit from and build off of this class.
- Parameters:
cache_key (Optional[str]) – A string for recording cache keys for use in later steps of the response orchestration involving processing, cache storage, and cache retrieval
response (Optional[requests.Response | ResponseProtocol]) – A response or response-like object to be validated and used/re-used in later caching and response processing/orchestration steps.
created_at (Optional[str]) – A value indicating the time at which a response or response-like object was created.
Example
>>> from scholar_flux.api import APIResponse # Using keyword arguments to build a basic APIResponse data container: >>> response = APIResponse.from_response( >>> cache_key = 'test-response', >>> status_code = 200, >>> content=b'success', >>> url='https://example.com', >>> headers={'Content-Type': 'application/text'} >>> ) >>> response # OUTPUT: APIResponse(cache_key='test-response', response = ReconstructedResponse( # status_code=200, reason='OK', headers={'Content-Type': 'application/text'}, # text='success', url='https://example.com' #) >>> assert response.status == 'OK' and response.text == 'success' and response.url == 'https://example.com' # OUTPUT: True >>> assert response.validate_response() # OUTPUT: True
- classmethod as_reconstructed_response(response: object) ReconstructedResponse[source]
Classmethod designed to create a reconstructed response from an original response object.
This method coerces response attributes into a reconstructed response that retains the original content, status code, headers, URL, reason, etc.
- Returns:
- A minimal response object that contains the core attributes needed to support
other processes in the scholar_flux module such as response parsing and caching.
- Return type:
- build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType] | None[source]
Defines a No-Op method to be overridden by ProcessedResponse subclasses.
- cache_key: str | None
- property cached: bool | None
Identifies whether the current response was retrieved from the session cache.
- Returns:
True if the response is a CachedResponse object and False if it is a fresh requests.Response object None: Unknown (e.g., the response attribute is not a requests.Response object or subclass)
- Return type:
bool
- property content: bytes | None
Return content from the underlying response, if available and valid.
- Returns:
The bytes from the original response content
- Return type:
(bytes)
- created_at: str | None
- encode_response(response: object) dict[str, Any] | list[Any] | None[source]
Helper method for serializing a response into a json format.
Accounts for special cases such as CaseInsensitiveDict fields that are otherwise unserializable.
From this step, pydantic can safely use json internally to dump the encoded response fields
- classmethod from_response(response: Any | None = None, cache_key: str | None = None, auto_created_at: bool | None = None, **kwargs: Any) Self[source]
Construct an APIResponse from a response object or from keyword arguments.
If response is not a valid response object, builds a minimal response-like object from kwargs.
- classmethod from_serialized_response(response: object | None = None, **kwargs: Any) ReconstructedResponse | None[source]
Helper method for creating a new APIResponse from dumped JSON object.
This method accounts for lack of ease of serialization of responses by decoding the response dictionary that was loaded from a string using json.loads from the JSON module in the standard library.
If the response input is still a serialized string, this method will manually load the response dict with the APIresponse._deserialize_response_dict class method before further processing.
- Parameters:
response (object) – A prospective response value to load into the API Response.
- Returns:
A reconstructed response object, if possible. Otherwise returns None
- Return type:
Optional[ReconstructedResponse]
- property headers: MutableMapping[str, str] | None
Return headers from the underlying response, if available and valid.
- Returns:
A dictionary of headers from the response
- Return type:
MutableMapping[str, str]
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- normalize(*args: Any, **kwargs: Any) NormalizedRecordList[source]
Defines the normalize method that successfully processed API Responses can override to normalize records.
- Raises:
NotImplementedError – Unless overridden, this method will raise an error unless defined in a subclass.
- process_metadata(*args: Any, **kwargs: Any) MetadataType | None[source]
Abstract processing method that APIResponse subclasses can override to process metadata.
- Parameters:
*args – No-Op - Added for compatibility with the APIResponse subclasses.
**kwargs – No-Op - Added for compatibility with the APIResponse subclasses.
- Raises:
NotImplementedError – Unless overridden, this method will raise an error unless defined in a subclass.
- raise_for_status() None[source]
Uses the underlying response or response-like object to validate the status code associated with the request.
If the attribute isn’t a response or reconstructed response, the code will coerce the class into a response object to verify the status code for the request URL and response.
- Raises:
requests.RequestException – Errors for status codes that indicate unsuccessfully received responses.
- property reason: str | None
Uses the reason or status code attribute on the response object, to retrieve or create a status description.
- Returns:
The status description associated with the response.
- Return type:
Optional[str]
- resolve_extracted_record(*args: Any, **kwargs: Any) RecordType | None[source]
Defines a No-Op method to be overridden by ProcessedResponse subclasses.
- response: Response | ResponseProtocol | None
- classmethod serialize_response(response: Response | ResponseProtocol) str | None[source]
Helper method for serializing a response into a json format.
The response object is first converted into a serialized string and subsequently dumped after ensuring that the field is serializable.
- Parameters:
response (Response, ResponseProtocol) – A requests.Response or response-like object to serialize as a string.
- Returns:
A serialized response when response serialization is possible. Otherwise None.
- Return type:
Optional[str]
- property status: str | None
Helper property for retrieving a human-readable status description APIResponse.
- Returns:
The status description associated with the response (if available).
- Return type:
Optional[str]
- property status_code: int | None
Helper property for retrieving a status code from the APIResponse.
- Returns:
The status code associated with the response (if available)
- Return type:
Optional[int]
- strip_annotations(*args: Any, **kwargs: Any) RecordList[source]
Defines a No-Op method to be overridden by ProcessedResponse subclasses.
- property text: str | None
Attempts to retrieve the response text by first decoding the bytes of its content.
If not available, this property attempts to directly reference the text attribute directly.
- Returns:
A text string if the text is available in the correct format, otherwise None
- Return type:
Optional[str]
- classmethod transform_response(v: Response | ResponseProtocol | None) Response | ResponseProtocol | None[source]
Attempts to resolve a valid or a serialized response-like object as an original or ReconstructedResponse.
All original response objects (duck-typed or requests response) with valid values will be returned as is.
If the passed object is a string - this function will attempt to serialize it before attempting to parse it as a dictionary.
Dictionary fields will be decoded, if originally encoded, and parsed as a ReconstructedResponse object, if possible.
Otherwise, the original object is returned as is.
- property url: str | None
Return URL from the underlying response, if available and valid.
- Returns:
- The original URL in string format, if available. For URL objects that are not str types, this method
attempts to convert them into strings when possible.
- Return type:
str
- classmethod validate_iso_timestamp(v: str | datetime | None) str | None[source]
Helper method for validating and ensuring that the timestamp accurately follows an ISO 8601 format.
- validate_response(raise_on_error: bool = False) bool[source]
Helper method for determining whether the response attribute is truly a response or response-like object.
If the response isn’t a requests.Response object, we use duck-typing to determine whether the response, itself, contains the attributes expected of a response.
For this purpose, response properties are checked in order to determine whether the properties of the nested response match object matches the expected type.
- Parameters:
raise_on_error (bool) – Indicates whether an error should be raised if the response attribute is invalid (False by default).
- Returns:
Indicates whether the current APIResponse.response attribute is a valid response.
- Return type:
bool
- Raises:
InvalidResponseStructureException – When the response attribute is invalid and raise_on_error=True
- class scholar_flux.api.APISpecificParameter(name: str, description: str, validator: Callable[[Any], Any] | None = None, default: Any = None, required: bool = False)[source]
Bases:
objectDataclass that defines the specification of an API-specific parameter for an API provider.
Implements optionally specifiable defaults, validation steps, and indicators for optional vs. required fields.
- Parameters:
name (str) – The name of the parameter used when sending requests to APIs.
description (str) – A description of the API-specific parameter.
validator (Optional[Callable[[Any], Any]]) – An optional function/method for verifying and pre-processing parameter input based on required types, constrained values, etc.
default (Any) – A default value used for the parameter if not specified by the user
required (bool) – Indicates whether the current parameter is required for API calls.
- __init__(*args: Any, **kwargs: Any) None
- default: Any = None
- description: str
- name: str
- required: bool = False
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for showing the structure of the current APISpecificParameter.
- validator: Callable[[Any], Any] | None = None
- property validator_name: str
Helper method for generating a human-readable string from the validator function, if used.
- class scholar_flux.api.BaseAPI(user_agent: str | None = None, session: Session | None = None, timeout: int | float | None = None, use_cache: bool | None = None)[source]
Bases:
objectThe BaseAPI client is a minimal implementation for user-friendly request preparation and response retrieval.
- Parameters:
session (Optional[requests.Session]) – A pre-configured requests or requests-cache session. A new session is created if not specified.
user_agent (Optional[str]) – An optional user-agent string for the session.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache (bool) – Indicates whether or not to create a cached session. If a cached session is already specified, this setting will have no effect on the creation of a session.
Examples
>>> from scholar_flux.api import BaseAPI # creating a basic API client that uses the PLOS API as the default while caching response data in-memory: >>> base_api = BaseAPI(use_cache=True) # retrieve a basic request: >>> parameters = {'q': 'machine learning', 'start': 1, 'rows': 20} >>> response_page_1 = base_api.send_request('https://api.plos.org/search', parameters=parameters) >>> assert response_page_1.ok >>> response_page_1 # OUTPUT: <Response [200]> >>> ml_page_1 = response_page_1.json() # retrieving the next page: >>> parameters['start'] = 21 >>> response_page_2 = base_api.send_request('https://api.plos.org/search', parameters=parameters) >>> assert response_page_2.ok >>> response_page_2 # OUTPUT: <Response [200]> >>> ml_page_2 = response_page_2.json() >>> ml_page_2 # OUTPUT: {'response': {'numFound': '...', 'start': 21, 'docs': ['...']}} # redacted
Note
The class variable, BaseAPI.DEFAULT_USE_CACHE is set at import to True if the environment variable, SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND, is configured. Otherwise, DEFAULT_USE_CACHE is set to False. Changes made via config_settings after import/runtime will not enable or disable caching unless you manually update BaseAPI.DEFAULT_USE_CACHE or SearchAPI.DEFAULT_USE_CACHE (for the SearchAPI subclass).
- DEFAULT_TIMEOUT: int = 20
- DEFAULT_USE_CACHE: bool = False
- __init__(user_agent: str | None = None, session: Session | None = None, timeout: int | float | None = None, use_cache: bool | None = None)[source]
Initializes the BaseAPI client for response retrieval given the provided inputs.
The necessary attributes are prepared with a new or existing session (cached or uncached) via dependency injection. This class is designed to be subclassed for specific API implementations.
- Parameters:
user_agent (Optional[str]) – Optional user-agent string for the session.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
timeout (Optional[int | float]) – Timeout for requests in seconds.
use_cache (Optional[bool]) – Indicates whether or not to use cache. The default setting is to create a regular requests.Session unless a CachedSession is already provided.
- property cache: BaseCache | None
Retrieves the requests-session cache object if the session object is a CachedSession object.
If a session cache does not exist, this function will return None.
- Returns:
The cache object if available, otherwise None.
- Return type:
Optional[BaseCache]
- property cached: bool
Checks whether the current session object used by the current API is a cached session.
- Returns:
True if the current object is a cached session object, and False otherwise
- Return type:
bool
- configure_session(session: Session | None = None, user_agent: str | None = None, use_cache: bool | None = None) Session[source]
Creates a new Session or CachedSession object for API requests if a session does not already exist.
If use_cache = True, then a cached session object will be used. A regular session that is not already cached will be overridden.
- Parameters:
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist. If use_cache is True and a cached session has already been passed, the previously created cached session is returned. Otherwise, a new CachedSession is created.
- Returns:
The configured session.
- Return type:
requests.Session
- static is_cached_session(session: CachedSession | Session) bool[source]
Checks whether a provided session object is a requests_cache.CachedSession object.
- Parameters:
session (requests.Session) – The session to check.
- Returns:
True if the session is a cached session, False otherwise.
- Return type:
bool
- prepare_request(base_url: str, endpoint: str | None = None, parameters: Dict[str, Any] | None = None) PreparedRequest[source]
Prepares a GET request for the specified endpoint with optional parameters.
- Parameters:
base_url (str) – The base URL for the API.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.
- Returns:
The prepared request object.
- Return type:
prepared_request (PreparedRequest)
- send_request(base_url: str, endpoint: str | None = None, parameters: Dict[str, Any] | None = None, timeout: int | float | None = None) Response[source]
Sends a GET request to the specified endpoint with optional parameters.
- Parameters:
base_url (str) – The base API to send the request to.
endpoint (Optional[str]) – The endpoint of the API to send the request to.
parameters (Optional[Dict[str, Any]]) – Optional query parameters for the request.
timeout (Optional[int | float]) – Timeout for the request in seconds.
- Returns:
The response object.
- Return type:
requests.Response
- session: Session
- structure(flatten: bool = True, show_value_attributes: bool = False) str[source]
Base method for showing the structure of the current BaseAPI. This method reveals the configuration settings of the API client that will be used to send requests.
- Returns:
The current structure of the BaseAPI or its subclass.
- Return type:
str
- summary() str[source]
Create a summary representation of the current structure of the API:
Returns the original representation.
- property user_agent: str | None
The User-Agent should always reflect what is used in the session.
This method retrieves the User-Agent from the session directly.
- class scholar_flux.api.BaseCoordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator)[source]
Bases:
objectBaseCoordinator providing the minimum functionality for requesting and retrieving records and metadata from APIs.
This class uses dependency injection to orchestrate the process of constructing requests, validating responses, and processing scientific works and articles. This class is designed to provide the absolute minimum necessary functionality to both retrieve and process data from APIs and can make use of caching functionality for caching requests and responses.
After initialization, the BaseCoordinator uses two main components for the sequential orchestration of response retrieval, processing, and caching.
- Components:
- SearchAPI (api/search_api):
Handles the creation and orchestration of search requests in addition to the caching of successful requests via dependency injection.
- ResponseCoordinator (responses/response_coordinator): Handles the full range of response
processing steps after retrieving a response from an API. These parsing, extraction, and processing steps occur sequentially when a new response is received. If a response was previously handled, the coordinator will attempt to retrieve these responses from the processing cache.
Example
>>> from scholar_flux.api import SearchAPI, ResponseCoordinator, BaseCoordinator # Note: the SearchAPI uses PLOS by default if `provider_name` is not provided. # Unless the `SCHOLAR_FLUX_DEFAULT_PROVIDER` env variable is set to another provider. >>> base_search_coordinator = BaseCoordinator(search_api = SearchAPI(query = 'Math'), >>> response_coordinator = ResponseCoordinator.build()) >>> response = base_search_coordinator.search(page = 1) >>> response # OUTPUT <ProcessedResponse(len=20, cache_key=None, metadata="{'numFound': 14618, 'start': 1, ...})> # All processed records for a particular response can be found under response.data (a list of dictionaries) >>> list(response.data[0].keys()) # OUTPUT ['article_type', 'eissn', 'id', 'journal', 'publication_date', 'score', 'title_display', # 'abstract', 'author_display']
- __init__(search_api: SearchAPI, response_coordinator: ResponseCoordinator)[source]
Initializes the base coordinator by delegating assignment of attributes to the _initialize method. Future coordinators can follow a similar pattern of using an _initialize for initial parameter assignment.
- Parameters:
search_api (SearchAPI) – The search API to use for the retrieval of response records from APIs
response_coordinator (ResponseCoordinator) – Core class used to handle the processing and core handling of all responses from APIs
- classmethod as_coordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator, *args: Any, **kwargs: Any) Self[source]
Helper factory method for building a SearchCoordinator that allows users to build from the final building blocks of a SearchCoordinator.
- Parameters:
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs.
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs.
- Returns:
A newly created coordinator class or subclass that orchestrates record retrieval and processing.
- Return type:
Self
- property display_name: str
Human-readable provider name for logging and display purposes.
- property extractor: BaseDataExtractor
Allows direct access to the DataExtractor from the ResponseCoordinator.
- property last_response: ProcessedResponse | ErrorResponse | None
Retrieves the last response sent to a provider.
- parameter_search(**kwargs: Any) ProcessedResponse | ErrorResponse | None[source]
Public method for retrieving and processing non-paginated records with directly specified parameters.
This method is designed as a direct entrypoint to performing searches without the addition of otherwise automatically populated, pagination-related fields such as query, records_per_page, etc. while still taking advantage of the orchestration features of the current coordinator.
- property parser: BaseDataParser
Allows direct access to the data parser from the ResponseCoordinator.
- property processor: ABCDataProcessor
Allows direct access to the DataProcessor from the ResponseCoordinator.
- property provider_name: str
Property method for accessing the provider name in the current SearchAPI instance.
- Returns:
The name corresponding to the API Provider.
- property response_coordinator: ResponseCoordinator
Allows the ResponseCoordinator to be used as a property.
The response_coordinator handles and coordinates the processing of API responses from parsing, record/metadata extraction, processing, and cache management.
- property responses: ResponseCoordinator
An alias for the response_coordinator property that is used for orchestrating the processing of retrieved API responses.
Handles response orchestration, including response content parsing, the extraction of records/metadata, record processing, and cache operations.
- search(**kwargs: Any) ProcessedResponse | ErrorResponse | None[source]
Public Search Method coordinating the retrieval and processing of an API response.
This method serves as the base and will primarily handle the “How” of searching (e.g. Workflows, Single page search, etc.)
- property search_api: SearchAPI
Allows the search_api to be used as a property while also allowing for verification.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for quickly showing a representation of the overall structure of the SearchCoordinator. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the Coordinator.
- Parameters:
flatten (bool) – Whether to flatten the coordinator’s structural representation into a single line. Default=False
show_value_attributes (bool) – Whether to show nested attributes of the components of the BaseCoordinator its subclass.
- Returns:
The structure of the current SearchCoordinator as a string.
- Return type:
str
- classmethod update(search_coordinator: Self, search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, **kwargs: Any) Self[source]
Creates a new coordinator with optionally replaced core components.
- Parameters:
search_coordinator – The coordinator to base the new instance on.
search_api (Optional[SearchAPI]) – Replacement SearchAPI, or None to keep existing.
response_coordinator (Optional[ResponseCoordinator]) – Replacement ResponseCoordinator, or None to keep existing.
**kwargs – Additional keyword arguments to be passed to BaseCoordinator.as_coordinator()
- Returns:
A new coordinator instance with the specified components.
- Return type:
Self
- with_components(search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, **update_kwargs: Any) Generator[Self, None, None][source]
Temporarily creates and yields a new coordinator with modified core components.
- Parameters:
search_api (Optional[SearchAPI]) – Replacement SearchAPI.
response_coordinator (Optional[ResponseCoordinator]) – Replacement ResponseCoordinator.
**update_kwargs – Optional keyword arguments to be passed to update
- Yields:
Self – A new coordinator instance with the specified modifications.
- class scholar_flux.api.ErrorResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None, message: str | None = None, error: str | None = None)[source]
Bases:
APIResponseReturned when something goes wrong, but we don’t want to throw immediately—just hand back failure details.
The class is formatted for compatibility with the ProcessedResponse.
- build_record_id_index(*args: Any, **kwargs: Any) dict[str, RecordType][source]
No-Op: Returns an empty dict when no extracted records are available.
This method is retained for compatibility with ProcessedResponse. Since ErrorResponse has no extracted records to index, this method always returns an empty dictionary regardless of arguments provided.
- Parameters:
*args – Positional argument placeholder for compatibility with the ProcessedResponse.build_record_id_index method. All arguments are ignored.
**kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.build_record_id_index method. All arguments are ignored.
- Returns:
An empty dictionary indicating no records are available for indexing.
- Return type:
dict[str, RecordType]
- property data: None
Provided for type hinting + compatibility.
- error: str | None
- property extracted_records: None
Provided for type hinting + compatibility.
- classmethod from_error(message: str, error: Exception, cache_key: str | None = None, response: Response | ResponseProtocol | None = None) Self[source]
Creates and logs the processing error if one occurs during response processing.
- Parameters:
message (str) – Error message describing the failure.
error (Exception) – The exception instance that was raised.
cache_key (Optional[str]) – Cache key for storing results.
response (Optional[requests.Response | ResponseProtocol]) – Raw API response.
- Returns:
A pydantic model that contains the error response data and background information on what precipitated the error.
- Return type:
- message: str | None
- property metadata: None
Provided for type hinting + compatibility.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = True, *args: Any, **kwargs: Any) NormalizedRecordList[source]
No-Op: Raises a RecordNormalizationException when raise_on_error=True and returns an empty list otherwise.
- Parameters:
field_map (Optional[BaseFieldMap]) – An optional field map that can be used to normalize the current response. This is inferred from the registry if not provided as input.
raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a RecordNormalizationException is raised.
*args – Positional argument placeholder for compatibility with the ProcessedResponse.normalize method
**kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.normalize method
- Returns:
An empty list if raise_on_error=False
- Return type:
NormalizedRecordList
- Raises:
RecordNormalizationException – If raise_on_error=True, this exception is raised after catching NotImplementedError
- property normalized_records: None
Provided for type hinting + compatibility.
- property parsed_response: None
Provided for type hinting + compatibility.
- process_metadata(*args: Any, **kwargs: Any) MetadataType | None[source]
No-Op: This method is retained for compatibility. It returns None by default.
- property processed_metadata: None
Provided for type hinting + compatibility.
- property processed_records: None
Provided for type hinting + compatibility.
- property record_count: int
Number of records in this response.
- property records_per_page: None
Provided for type hinting + compatibility.
- resolve_extracted_record(*args: Any, **kwargs: Any) None[source]
No-Op: Returns None when no records are available.
This method is retained for compatibility with ProcessedResponse. Since ErrorResponse has no extracted or processed records, resolution is not possible and this method always returns None.
- Parameters:
*args – Positional argument placeholder for compatibility with the ProcessedResponse.resolve_extracted_record method. Currently includes processed_index (int).
**kwargs – Keyword argument placeholder for compatibility with the ProcessedResponse.resolve_extracted_record method. All arguments are ignored.
- Returns:
Always returns None since no records exist to resolve.
- Return type:
None
- strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]
Convenience method for removing internal metadata annotations from a provided list of records.
This method removes all metadata annotations (dictionary keys that are prefixed with an underscore) that were added during the record extraction step for pipeline traceability (e.g., _extraction_index, _record_id).
- Parameters:
records – (RecordType | RecordList) Records to strip. Defaults to processed_records if None.
- Returns:
A list of dictionary records with stripped metadata annotations when provided. If a record or record list is not provided, a warning is logged, and an empty list is returned.
- Return type:
RecordList
Note: This method is defined primarily for compatibility with the ProcessedResponse API.
- property total_query_hits: None
Provided for type hinting + compatibility.
- class scholar_flux.api.MultiSearchCoordinator(*args: Any, **kwargs: Any)[source]
Bases:
UserDict[str,SearchCoordinator]The MultiSearchCoordinator is a utility class for orchestrating searches across multiple providers, pages, and queries sequentially or using multithreading. This coordinator builds on the SearchCoordinator’s core structure to ensure consistent, rate-limited API requests.
The multi-search coordinator uses shared rate limiters to ensure that requests to the same provider (even across different queries) will use the same rate limiter.
This implementation uses the ThreadedRateLimiter.min_interval parameter from the shared rate limiter of each provider to determine the request_delay across all queries. These settings can be found and modified in the scholar_flux.api.providers.threaded_rate_limiter_registry by provider_name.
For new, unregistered providers, users can override the MultiSearchCoordinator.DEFAULT_THREADED_REQUEST_DELAY class variable to adjust the shared request_delay.
# Examples:
>>> from scholar_flux import MultiSearchCoordinator, SearchCoordinator, RecursiveDataProcessor >>> from scholar_flux.api.rate_limiting import threaded_rate_limiter_registry >>> multi_search_coordinator = MultiSearchCoordinator() >>> threaded_rate_limiter_registry['arxiv'].min_interval = 6 # arbitrary rate limit (seconds per request) >>> >>> # Create coordinators for different queries and providers >>> coordinators = [ ... SearchCoordinator( ... provider_name=provider, ... query=query, ... processor=RecursiveDataProcessor(), ... user_agent="SammieH", ... cache_requests=True ... ) ... for query in ('ml', 'nlp') ... for provider in ('plos', 'arxiv', 'openalex', 'crossref') ... ] >>> >>> # Add coordinators to the multi-search coordinator >>> multi_search_coordinator.add_coordinators(coordinators) >>> >>> # Execute searches across multiple pages >>> all_pages = multi_search_coordinator.search_pages(pages=[1, 2, 3]) >>> >>> # filters and retains successful requests from the multi-provider search >>> filtered_pages = all_pages.filter() >>> # The results will contain successfully processed responses across all queries, pages, and providers >>> print(filtered_pages) # Output will be a list of SearchResult objects >>> # Extracts successfully processed records into a list of records where each record is a dictionary >>> record_dict = filtered_pages.join() # retrieves a list of records >>> print(record_dict) # Output will be a flattened list of all records
- DEFAULT_THREADED_REQUEST_DELAY: float | int = 6.0
- __init__(*args: Any, **kwargs: Any) None[source]
Initializes the MultiSearchCoordinator, allowing positional and keyword arguments to be specified when creating the MultiSearchCoordinator.
The initialization of the MultiSearchCoordinator operates similarly to that of a regular dict with the caveat that values are statically typed as SearchCoordinator instances.
- add(search_coordinator: SearchCoordinator) None[source]
Adds a new SearchCoordinator to the MultiSearchCoordinator instance.
- Parameters:
search_coordinator (SearchCoordinator) – A search coordinator to add to the MultiSearchCoordinator dict
Raises: InvalidCoordinatorParameterException: If the expected type is not a SearchCoordinator
- add_coordinators(search_coordinators: Iterable[SearchCoordinator]) None[source]
Helper method for adding a sequence of coordinators at a time.
- property coordinators: list[SearchCoordinator]
Utility property for quickly retrieving a list of all currently registered coordinators.
- current_providers() set[str][source]
Extracts a set of names corresponding to each API provider assigned to the MultiSearchCoordinator.
- classmethod from_coordinators(search_coordinators: Iterable[SearchCoordinator]) Self[source]
Constructs a new MultiSearchCoordinator instance from a sequence of coordinators at a time.
- group_by_provider() dict[str, dict[str, SearchCoordinator]][source]
Groups all coordinators by provider name to facilitate retrieval with normalized components where needed. Especially helpful in the latter retrieval of articles when using multithreading by provider (as opposed to by page) to account for strict rate limits. All coordinated searches corresponding to a provider would appear under a nested dictionary to facilitate orchestration on the same thread with the same rate limiter.
- Returns:
All elements in the final dictionary map provider-specific coordinators to the normalized provider name for the nested dictionary of coordinators.
- Return type:
dict[str, dict[str, SearchCoordinator]]
- iter_pages(pages: Sequence[int] | PageListInput, iterate_by_group: bool = False, **kwargs: Any) Generator[SearchResult, None, None][source]
Helper method that creates and joins a sequence of generator functions for retrieving and processing records from each combination of queries, pages, and providers in sequence. This implementation uses the SearchCoordinator.iter_pages to dynamically identify when page retrieval should halt for each API provider, accounting for errors, timeouts, and less than the expected amount of records before filtering records with pre- specified criteria.
- Parameters:
pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Yields:
SearchResult –
- Iteratively returns the SearchResult for each provider, query, and page using a generator
expression. Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)
- iter_pages_threaded(pages: Sequence[int] | PageListInput, max_workers: int | None = None, **kwargs: Any) Generator[SearchResult, None, None][source]
Threading by provider to respect rate limits Helper method that implements threading to simultaneously retrieve a sequence of generator functions for retrieving and processing records from each combination of queries, pages, and providers in a multi-threaded set of sequences grouped by provider.
This implementation also uses the SearchCoordinator.iter_pages to dynamically identify when page retrieval should halt for each API provider, accounting for errors, timeouts, and less than the expected amount of records before filtering records with pre-specified criteria.
Note, that as threading is performed by provider, this method will not differ significantly in speed from the MultiSearchCoordinator.iter_pages method if only a single provider has been specified.
- Parameters:
pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Yields:
SearchResult –
- Iteratively returns the SearchResult for each provider, query, and page using a generator
expression as each SearchResult becomes available after multi-threaded processing. Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)
- search(page: int = 1, iterate_by_group: bool = False, max_workers: int | None = None, multithreading: bool = True, **kwargs: Any) SearchResultList[source]
Public method used to search for a single or multiple pages from multiple providers at once using a sequential or multithreading approach. This approach delegates the search to search_pages to retrieve a single page for query and provider using an iterative approach to search for articles grouped by provider.
Note that the MultiSearchCoordinator.search_pages method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.
- Parameters:
page (int) – The page number to iteratively request from each API Provider.
iterate_by_group (bool) – Determines whether all searches should be performed by page or by group. Note that page-based iteration is significantly faster due to API rate limits. This is set to False by default as a result.
max_workers (Optional[int]) – Determines how many threads should operate at one time. Applies only when multithreading is set to True. When None, as many threads are used as required.
multithreading (bool) – Multithreading is used when this parameter is set to True. Otherwise, sequential iteration is performed. Multithreading is enabled by default.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available. Workflows are utilized by default.
- Returns:
The list containing all retrieved and processed pages from the API. If any non-stopping errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.
- Return type:
- search_page(page: int = 1, **kwargs: Any) SearchResultList[source]
Retrieves a single page from all registered coordinators.
This method provides API compatibility with SearchCoordinator.search_page, returning results wrapped in SearchResult containers with provider metadata.
- Parameters:
page (int) – The page number to retrieve from each provider.
**kwargs – Additional arguments to pass to MultiSearchCoordinator.search_pages or the search_pages method for each individual coordinator.
- Returns:
Results from all coordinators for the specified page.
- Return type:
- search_pages(pages: Sequence[int] | PageListInput, iterate_by_group: bool = False, max_workers: int | None = None, multithreading: bool = True, *, min_records: int | None = None, page_offset: int = 0, **kwargs: Any) SearchResultList[source]
Searches for records from multiple providers using a sequential or multithreading approach.
Note that the MultiSearchCoordinator.search_pages method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.
- Parameters:
pages (Sequence[int]) – A sequence of page numbers to iteratively request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
min_records (int) – The total number of records to retrieve sequentially. If not provided as an integer, the pages argument is validated immediately instead. No-Op when pages is a non-empty/non-zero value.
page_offset (int) – The page offset to begin record retrieval from (0 by default). This parameter is only relevant when a min_records value is provided instead of a page number.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Returns:
- The list containing all retrieved and processed pages from the API. If any non-stopping
errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.
- Return type:
- search_records(min_records: int, page_offset: int = 0, **kwargs: Any) SearchResultList[source]
Helper method for retrieving a minimum of min_records records across all API providers.
This method retrieves a minimum of min_records per provider unless no pages remain to be retrieved or a non-retryable error occurs during processing. Note that this method uses shared rate limiters to ensure that APIs are not overwhelmed by the number of requests being sent within a specific time interval.
- Parameters:
min_records (int) – The total number of records to retrieve sequentially.
page_offset (int) – The page offset to begin record retrieval from (0 by default).
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
- Returns:
The list containing all retrieved and processed pages from the API. If any non-stopping errors occur, this will return an ErrorResponse instead with error and message attributes further explaining any issues that occurred during processing.
- Return type:
- select(query: str | None = None, provider_name: str | None = None) list[SearchCoordinator][source]
Helper method that enables the selection of coordinators based on their query or provider name.
- class scholar_flux.api.NonResponse(*, cache_key: str | None = None, response: None = None, created_at: str | None = None, message: str | None = None, error: str | None = None)[source]
Bases:
ErrorResponseResponse class that indicates that an error occurred during request preparation or API response retrieval.
This class is used to signify the error that occurred within the search process using a similar interface as the other scholar_flux Response dataclasses.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- response: None
- class scholar_flux.api.ProcessedResponse(*, cache_key: str | None = None, response: Response | ResponseProtocol | None = None, created_at: str | None = None, parsed_response: Any | None = None, extracted_records: RecordList | None = None, processed_records: RecordList | None = None, normalized_records: NormalizedRecordList | None = None, metadata: MetadataType | None = None, processed_metadata: MetadataType | None = None, message: str | None = None)[source]
Bases:
APIResponseAPIResponse class that scholar_flux uses to return processed response data after successful response processing.
This class is populated to return response data containing information on the original, cached, or reconstructed API response that is received and processed after retrieval. In addition to returning processed records and metadata, this class also allows storage of intermediate steps including:
Parsed responses
Extracted records and metadata
Processed records (aliased as data)
Normalized records
Processed metadata
Any additional messages. An error field is provided for compatibility with the ErrorResponse class.
- build_record_id_index() dict[str, RecordType][source]
Builds a lookup table for ID-based resolution of extracted records.
This method creates a dictionary that maps _record_id values to their corresponding extracted records. Useful when performing multiple resolutions for records the same response.
- Returns:
A new dictionary mapping record IDs to the original record. An empty dictionary is returned if extracted_records is None/empty or all records do not have an associated ID
- Return type:
dict[str, RecordType]
Example
>>> from scholar_flux import SearchCoordinator >>> coordinator = SearchCoordinator(query = 'public health', annotate_records=True) >>> response = coordinator.search(page = 1) >>> id_index = response.build_record_id_index() >>> processed_record = response.data[0] >>> extracted_record = id_index.get(processed_record["_record_id"]) >>> isinstance(extracted_record, dict) # OUTPUT: True
Note
This method is used in the process of identifying raw, unprocessed records after extensive post-processing and filtering has been performed on each record and relies on record annotation being enabled during data extraction.
- property data: RecordList | None
Alias to the processed_records attribute that holds a list of dictionaries, when available.
- property error: None
Provided for type hinting + compatibility.
- extracted_records: RecordList | None
- message: str | None
- metadata: MetadataType | None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- normalize(field_map: BaseFieldMap | None = None, raise_on_error: bool = False, update_records: bool | None = None, resolve_records: bool | None = None, keep_api_specific_fields: bool | Sequence | None = None, strip_annotations: bool | None = None) NormalizedRecordList[source]
Applies a field map to normalize the processed records of a ProcessedResponse into a common structure.
Note that if a field_map is not provided, this method will return the previously created normalized_records attribute if available. If normalized_records is None, this method will attempt to look up the FieldMap from the current provider_registry.
If processed records is None (and not an empty list), record normalization will fall back to using extracted_records and will return relatively similar results with minor differences in potential value coercion, flattening, and the recursive extraction of values at non-terminal paths depending on the implementation of the data processor.
- Parameters:
field_map (Optional[BaseFieldMap]) – An optional field map that can be used to normalize the current response. This is inferred from the registry if not provided as input.
raise_on_error (bool) – A flag indicating whether to raise an error. If a field_map cannot be identified for the current response and raise_on_error is also True, a normalization error is raised.
update_records (Optional[bool]) – A flag that determines whether updates should be made to the normalized_records attribute after computation. If None, updates are made only if the normalized_records attribute is currently None.
resolve_records (Optional[bool]) – A flag that determines if resolution with annotated records should occur. If True or None, resolution occurs. If False, normalization uses processed_records when not None and extracted_records otherwise.
keep_api_specific_fields (Optional[bool | Sequence]) – Indicates what API-specific records should be retained from the complete list of API parameters that are returned. If False, only the core parameters defined by the FieldMap are returned. If True or None, all parameters are returned instead.
strip_annotations (Optional[bool]) – A flag for removing metadata annotations denoted by a leading underscore. When True or None (default), annotations are removed from normalized records.
- Returns:
The list of normalized records in the same dimension as the original processed response. If a map for the current provider does not exist and raise_on_error=False, an empty list is returned instead.
- Return type:
NormalizedRecordList
- Raises:
RecordNormalizationException – If an error occurs during the normalization of record list.
Example
>>> from scholar_flux import SearchCoordinator >>> from scholar_flux.utils import truncate, coerce_flattened_str >>> coordinator = SearchCoordinator(query = 'public health') >>> response = coordinator.search_page(page = 1) >>> normalized_records = response.normalize() >>> for record in normalized_records[:5]: ... print(f"Title: {record['title']}") ... print(f"URL: {record['url']}") ... print(f"Source: {record['provider_name']}") ... print(f"Abstract: {truncate(record['abstract'] or 'Not available')}") ... print(f"Authors: {coerce_flattened_str(record['authors'])}") ... print("-"*100)
# OUTPUT: Title: Are we prepared? The development of performance indicators for … URL: https://journals.plos.org/plosone/article?id=… Source: plos Abstract: Background: Disasters and emergencies… Authors: … —————————————————————————————————-
Note
Computation is performed in one of three cases:
1.`normalized_records` does not already exist 2.`update_records` is not True 3. Either resolve_records or keep_api_specific_fields is not None
- normalized_records: NormalizedRecordList | None
- parsed_response: Any | None
- process_metadata(metadata_map: ResponseMetadataMap | None = None, update_metadata: bool | None = None) MetadataType | None[source]
Uses a ResponseMetadataMap to process metadata for tertiary information on the response.
This method is a helper that is meant for primarily internal use for providing metadata information on the response where helpful and for informing users of the characteristics of the current response.
This function will update the ProcessedResponse.processed_metadata attribute when update_metadata=True or in a secondary case where the current processed_metadata field is an empty dict or None unless update_metadata=False
- Parameters:
metadata_map (Optional[ResponseMetadataMap]) – A mapping that resolve API-specific metadata names to a universal parameter name.
update_metadata (Optional[bool]) – Determines whether the underlying processed_metadata field should be updated. If True, the processed_metadata field is updated inplace. If None, the field is only updated when metadata fields have been successfully processed and the `processed_metadata ` field is None.
- Returns:
The processed metadata returned as a dictionary when available. None otherwise.
- Return type:
Optional[MetadataType]
- processed_metadata: MetadataType | None
- processed_records: RecordList | None
- property record_count: int
The overall length of the processed data field as processed in the last step after filtering.
- property records_per_page: int | None
Returns the total number of results on the current page.
This method retrieves the records_per_page variable from the processed_metadata attribute, and if metadata hasn’t yet been processed, this method will then call process_metadata() manually to ensure that the field is available.
- resolve_extracted_record(processed_index: int) RecordType | None[source]
Resolve a processed record back to its original extracted record.
This method uses a two-phase resolution strategy with optional validation:
Primary: Direct index lookup via _extraction_index (fast, single access)
Validation: Verify _record_id matches
Fallback: Search by _record_id if index lookup fails or mismatches (scans all records)
- Parameters:
processed_index (int) – The index of the record in processed_records to resolve.
- Returns:
The original extracted record, or None if resolution fails.
- Return type:
Optional[RecordType]
Example
>>> from scholar_flux import SearchCoordinator, RecursiveDataProcessor >>> coordinator = SearchCoordinator( ... query='public health', ... provider_name='openalex', ... annotate_records=True, ... processor=RecursiveDataProcessor() ... ) >>> response = coordinator.search(page=1) >>> # Get processed (possibly flattened) record >>> processed = response.processed_records[0] >>> print(processed.get("authorships.author.display_name")) # ['Kenneth L. Howard...'] >>> # Resolve to original nested structure >>> original = response.resolve_extracted_record(0) >>> print(original.get("authorships")) >>> print(original.get("authorships")[0].keys()) # OUTPUT: dict_keys(['author_position', 'author', 'institutions', 'countries', 'is_corresponding', 'raw_author_name', 'raw_affiliation_strings', 'affiliations'])
Note
Resolution requires that records were extracted with annotate_records=True in the DataExtractor. Without annotation fields, this method returns None.
- strip_annotations(records: RecordType | RecordList | None = None) RecordList[source]
Convenience method that removes metadata annotations from a record list for clean export.
This method removes all metadata annotations (dictionary keys that are prefixed with an underscore) that were added during the record extraction step for pipeline traceability (e.g., _extraction_index, _record_id).
- Parameters:
records – (RecordType | RecordList) Records to strip. Defaults to processed_records if None.
- Returns:
New list of records with annotation fields removed.
- Return type:
RecordType | RecordList
Example
>>> clean_data = response.strip_annotations() >>> df = pd.DataFrame(clean_data) # No internal fields in DataFrame
- property total_query_hits: int | None
Returns the total number of results as reported by the API.
This method retrieves the total_query_hits variable from the processed_metadata attribute, and if metadata hasn’t yet been processed, this method will then call process_metadata() manually to ensure that the field is available.
- class scholar_flux.api.ProviderConfig(*, provider_name: Annotated[str, MinLen(min_length=1)], base_url: str, parameter_map: BaseAPIParameterMap, metadata_map: ResponseMetadataMap | None = None, field_map: BaseFieldMap | None = None, records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: Annotated[float, Ge(ge=0)] = 6.1, api_key_env_var: str | None = None, docs_url: str | None = None, display_name: Annotated[str, MinLen(min_length=1)] = '')[source]
Bases:
BaseModelConfig for creating the basic instructions and settings necessary to interact with new providers. This config, on initialization, is created for default providers on package initialization in the scholar_flux.api.providers submodule. A new, custom provider or override can be added to the provider_registry (a custom user dictionary) from the scholar_flux.api.providers module.
- Parameters:
provider_name (str) – The name of the provider to be associated with the config.
base_url (str) – The URL of the provider to send requests with the specified parameters.
parameter_map (BaseAPIParameterMap) – The parameter map indicating the specific semantics of the API.
metadata_map (MetadataMap) – Defines the names of metadata fields used to distinguish response characteristics.
field_map (Optional[BaseFieldMap]) – A provider-specific field map that normalizes processed response records into a universal record structure.
records_per_page (int) – Generally the upper limit (for some APIs) or reasonable limit for the number of retrieved records per request (specific to the API provider).
request_delay (float) – Indicates exactly how many seconds to wait before sending successive requests. Note that the requested interval may vary based on the API provider.
api_key_env_var (Optional[str]) – Indicates the environment variable to look for if the API requires or accepts API keys.
docs_url (Optional[str]) – An optional URL that indicates where documentation related to the use of the API can be found.
- Example Usage:
>>> from scholar_flux.api import ProviderConfig, APIParameterMap, SearchAPI >>> # Maps each of the individual parameters required to interact with the Guardian API >>> parameters = APIParameterMap(query='q', >>> start='page', >>> records_per_page='page-size', >>> api_key_parameter='api-key', >>> auto_calculate_page=False, >>> api_key_required=True) >>> # creating the config object that holds the basic configuration necessary to interact with the API >>> guardian_config = ProviderConfig(provider_name = 'GUARDIAN', >>> parameter_map = parameters, >>> base_url = 'https://content.guardianapis.com//search', >>> records_per_page=10, >>> api_key_env_var='GUARDIAN_API_KEY', >>> request_delay=6) >>> api = SearchAPI.from_provider_config(query = 'economic welfare', >>> provider_config = guardian_config, >>> use_cache = True) >>> assert api.provider_name == 'guardian' >>> response = api.search(page = 1) # assumes that you have the GUARDIAN_API_KEY stored as an env variable >>> assert response.ok
- api_key_env_var: str | None
- property api_key_required: bool
References the APIParameterMap to determine whether an API key is required.
- base_url: str
- display_name: str
- docs_url: str | None
- field_map: BaseFieldMap | None
- property map: BaseAPIParameterMap
Helper property that is an alias for the APIParameterMap attribute.
The APIParameterMap maps all universal parameters to the parameter names specific to the API provider.
- Returns:
The mapping that the current APIParameterConfig will use to build a dictionary of parameter requests specific to the current API.
- Return type:
- metadata_map: ResponseMetadataMap | None
- model_config: ClassVar[ConfigDict] = {'str_strip_whitespace': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- classmethod normalize_provider_name(v: str) str[source]
Helper method for normalizing the names of providers to a consistent structure.
- parameter_map: BaseAPIParameterMap
- classmethod prepare_fields(values: dict[str, Any]) dict[str, Any][source]
Model validator used to prepare fields for the ProviderConfig prior to further field validation.
- provider_name: str
- records_per_page: int
- request_delay: float
- search_config_defaults() dict[str, Any][source]
Convenience method for retrieving ProviderConfig fields as a dict. Useful for providing the missing information needed to create a SearchAPIConfig object for a provider when only the provider_name has been provided.
- Returns:
- A dictionary containing the URL, name, records_per_page, and request_delay
for the current provider.
- Return type:
dict
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method that shows the current structure of the ProviderConfig.
- class scholar_flux.api.ProviderRegistry(dict=None, /, **kwargs)[source]
Bases:
BaseProviderDictThe ProviderRegistry implementation allows the smooth and efficient retrieval of API parameter maps and default configuration settings to aid in the creation of a SearchAPI that is specific to the current API.
Note that the ProviderRegistry uses the ProviderConfig._normalize_name to ignore underscores and case-sensitivity.
- - ProviderRegistry.from_defaults
Dynamically imports configurations stored within scholar_flux.api.providers, and fails gracefully if a provider’s module does not contain a ProviderConfig.
- - ProviderRegistry.get
resolves a provider name to its ProviderConfig if it exists in the registry.
- - ProviderRegistry.get_from_url
resolves a provider URL to its ProviderConfig if it exists in the registry.
- add(provider_config: ProviderConfig) None[source]
Helper method for adding a new provider to the provider registry.
- create(provider_name: str, **kwargs: Any) ProviderConfig[source]
Helper method that creates and registers a new ProviderConfig with the current provider registry.
- Parameters:
provider_name (str) – The name of the provider to create a new provider_config for.
**kwargs – Additional keyword arguments to pass to scholar_flux.api.models.ProviderConfig
- Returns:
The newly created provider configuration when possible.
- Return type:
- Raises:
APIParameterException – If an unexpected error occurs during the creation of a new ProviderConfig.
- classmethod from_defaults() ProviderRegistry[source]
Dynamically loads provider configurations from the scholar_flux.api.providers module.
This method specifically uses the provider_name of each provider listed within the scholar_flux.api.providers.provider_registry to lookup and return its ProviderConfig.
- Returns:
A new registry containing the loaded default provider configurations
- Return type:
- get_display_name(provider_name: str, default: str | None = None) str | None[source]
Finds the human-readable name for a provider if it exists.
If the provider doesn’t exist within the registry, the result falls back to the default if available and None otherwise.
- Parameters:
provider_name (str) – The provider identifier to look up.
default (Optional[str]) – The name to fall back to. If not specified, None is returned instead.
- Returns:
The display name if the provider exists, otherwise the default is returned.
- Return type:
Optional[str]
- get_from_url(provider_url: str | None) ProviderConfig | None[source]
Attempt to retrieve a ProviderConfig instance for the given provider by resolving the provided URL to the provider’s base URL. Will not throw an error in the event that the provider does not exist.
- Parameters:
provider_url (Optional[str]) – URL of the provider to look up.
- Returns:
Instance configuration for the provider if it exists, else None
- Return type:
Optional[ProviderConfig]
- remove(provider_name: str) None[source]
Helper method for removing a provider configuration from the provider registry.
- resolve_config(provider_url: str | None = None, provider_name: str | None = None, verbose: bool = True) ProviderConfig | None[source]
Helper method to resolve mismatches between the URL and the provider_name when both are provided. The default behavior is to always prefer a provided provider_url over the provider_name to offer maximum flexibility.
- Parameters:
provider_url (Optional[str]) – The prospective URL associated with a provider configuration.
provider_name (Optional[str]) – The prospective name of the provider associated with a provider configuration.
verbose (bool) – Determines whether the origin of the configuration should be logged.
- Returns:
A provider configuration resolved with priority given to the base URL or the provider name otherwise. If neither the base URL and provider name resolve to a known provider, None is returned instead.
- Return type:
Optional[ProviderConfig]
- class scholar_flux.api.RateLimiter(min_interval: int | float | None = None)[source]
Bases:
objectA basic rate limiter used to ensure that function calls (such as API requests) do not exceed a specified rate.
The RateLimiter is used within ScholarFlux to throttle the total number of requests that can be made within a defined time interval (measured in seconds).
This class ensures that calls to RateLimiter.wait() (or any decorated function) are spaced by at least min_interval seconds.
For multithreading applications, the RateLimiter is not thread-safe. Instead, the ThreadedRateLimiter subclass can provide a thread-safe implementation when required.
- Parameters:
min_interval (Optional[float | int]) – The minimum number of seconds that must elapse before another request sent or call is performed. If min_interval is not specified, then class attribute, RateLimiter.DEFAULT_MIN_INTERVAL will be assigned to RateLimiter.min_interval instead.
Examples
>>> import requests >>> from scholar_flux.api import RateLimiter >>> rate_limiter = RateLimiter(min_interval = 5) >>> # The first call won't sleep, because a prior call using the rate limiter doesn't yet exist >>> with rate_limiter: ... response = requests.get("http://httpbin.org/get") >>> # will sleep if 5 seconds since the last call hasn't elapsed. >>> with rate_limiter: ... response = requests.get("http://httpbin.org/get") >>> # Or simply call the `wait` method directly: >>> rate_limiter.wait() >>> response = requests.get("http://httpbin.org/get")
Note
The class-level history deque is a design choice, This attribute allows class-level monitoring and introspection into how request delays are computed. The HistoryDeque is thread-safe (uses cpython on the backend) and allows global observability which is helpful for debugging, especially in cases where you need to adjust the total amount of requests sent within a given interval to avoid 429 errors.
- DEFAULT_MIN_INTERVAL: float | int = 6.1
- __init__(min_interval: int | float | None = None)[source]
Initializes the rate limiter with the min_interval argument.
- Parameters:
min_interval (Optional[float | int]) – Minimum number of seconds to wait before the next call is performed or request sent.
- default_min_interval() float | int[source]
Returns the default minimum interval for the current rate limiter.
- history: HistoryDeque[RateLimitEvent] = HistoryDeque([])
- property min_interval: float | int
The minimum number of seconds that must elapse before another request sent or action is taken.
- rate(min_interval: float | int, metadata: Dict[str, Any] | None = None) Iterator[Self][source]
Temporarily adjusts the minimum interval between function calls or requests when used with a context manager.
After the context manager exits, the original minimum interval value is then reassigned its previous value, and the time of the last call is recorded.
- Parameters:
min_interval (float | int) – Indicates the minimum interval to be temporarily used during the call
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- Yields:
RateLimiter – The original rate limiter with a temporarily changed minimum interval
- classmethod resize_history(maxlen: int) None[source]
Resize the global history deque, preserving existing records up to the new limit.
- sleep(interval: int | float | None = None, metadata: Dict[str, Any] | None = None) None[source]
Simple Instance level implementation of sleep that can be overridden when needed.
- Parameters:
interval (Optional[float | int]) – The time interval to sleep. If None, the default minimum interval for the current rate limiter is used. must be non-null, otherwise, the default min_interval value is used.
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- Exceptions:
APIParameterException: Occurs if the value provided is either not an integer/float or is less than 0
- wait(min_interval: int | float | None = None, metadata: Dict[str, Any] | None = None) None[source]
Block (time.sleep) until at least min_interval has passed since last call.
This method can be used with the min_interval attribute to determine when a search was last sent and throttle requests to make sure rate limits aren’t exceeded. If not enough time has passed, the API will wait before sending the next request.
- Parameters:
min_interval (Optional[float | int]) – The minimum time to wait until another call is sent. Note that the min_interval attribute or argument must be non-null, otherwise, the default min_interval value is used.
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- Exceptions:
APIParameterException: Occurs if the value provided is either not an integer/float or is less than 0
- wait_since(min_interval: int | float | None = None, timestamp: float | int | datetime | None = None, metadata: Dict[str, Any] | None = None) None[source]
Wait based on a reference timestamp or datetime.
- Parameters:
min_interval (Optional[float | int]) – Minimum interval to wait. Uses default if None.
timestamp (Optional[float | int | datetime]) – Reference time such as a Unix timestamp or datetime. If None, sleeps for min_interval.
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- class scholar_flux.api.ReconstructedResponse(status_code: int, reason: str, headers: MutableMapping[str, str], content: bytes, url: Any)[source]
Bases:
objectCore class for constructing minimal, universal response representations from responses and response-like objects.
The ReconstructedResponse implements several helpers that enable the reconstruction of response-like objects from different sources such as the requests, aiohttp, and httpx libraries.
The primary purpose of the ReconstructedResponse in scholar_flux is to create a minimal representation of a response when we need to construct a ProcessedResponse without an actual response and verify content fields.
In applications such as retrieving cached data from a scholar_flux.data_storage.DataCacheManager, if an original or cached response is not available, then a ReconstructedResponse is created from the cached response fields when available.
- Parameters:
status_code (int) – The integer code indicating the status of the response
reason (str) – Indicates the reasoning associated with the status of the response
headers (MutableMapping[str, str]) – Indicates metadata associated with the response (e.g. Content-Type, etc.)
content (bytes) – The content within the response
url – (Any): The URL from which the response was received
Note
The ReconstructedResponse.build factory method is recommended in cases when one property may contain the needed fields but may need to be processed and prepared first before being used. Examples include instances where one has text or json data instead of content, a reason_phrase field instead of reason, etc.
Example
>>> from scholar_flux.api.models import ReconstructedResponse # build a response using a factory method that infers fields from existing ones when not directly specified >>> response = ReconstructedResponse.build(status_code = 200, content = b"success", url = "https://google.com") # check whether the current class follows a ResponseProtocol and contains valid fields >>> assert response.is_response() # OUTPUT: True >>> response.validate() # raises an error if invalid >>> response.raise_for_status() # no error for 200 status codes >>> assert response.reason == 'OK' == response.status # inferred from the status_code attribute
- __init__(status_code: int, reason: str, headers: MutableMapping[str, str], content: bytes, url: Any) None
- asdict() dict[str, Any][source]
Converts the ReconstructedResponse into a dictionary containing attributes and their corresponding values.
This convenience method uses dataclasses.asdict() under the hood to convert a ReconstructedResponse to a dictionary consisting of key-value pairs.
- Returns:
A dictionary that maps the field names of a ReconstructedResponse instance to their assigned values.
- Return type:
dict[str, Any]
- classmethod build(response: object | None = None, **kwargs: Any) ReconstructedResponse[source]
Helper method for building a new ReconstructedResponse from a regular response object.
This classmethod can either construct a new ReconstructedResponse object from a response or response-like object or otherwise build a new ReconstructedResponse via its keyword parameters.
- Parameters:
response (Optional[object]) – A response or response-like object of unknown type or None.
**kwargs – The underlying components needed to construct a new response. Note that ideally, this set of key-value pairs would be specific only to the types expected by the ReconstructedResponse.
- Returns:
A minimal ReconstructedResponse object created from the received parameter set.
- Return type:
- content: bytes
- classmethod fields() list[str][source]
Retrieves a list containing the names of all fields associated with the ReconstructedResponse class.
- Returns:
A list containing the name of each attribute in the ReconstructedResponse.
- Return type:
list[str]
- classmethod from_keywords(**kwargs: Any) ReconstructedResponse[source]
Uses the provided keyword arguments to create a ReconstructedResponse.
- Parameters:
**kwargs –
The ReconstructedResponse keyword arguments to normalize. Possible keywords include:
status_code (int): The integer code indicating the status of the response
reason (str): Indicates the reasoning associated with the status of the response.
headers (MutableMapping[str, str]): Indicates metadata associated with the response (e.g. Content-Type)
content (bytes): The content within the response
url: (Any): The URL from which the response was received
The keywords can alternatively be inferred from other common response fields:
content: [‘content’, ‘_content’, ‘text’, ‘json’]
headers: [‘headers’, ‘_headers’]
reason: [‘reason’, ‘status’, ‘reason_phrase’, ‘status_code’]
- Returns:
A newly reconstructed response from the given keyword components.
- Return type:
- headers: MutableMapping[str, str]
- is_response() bool[source]
Validates the fields of the minimally reconstructed response, indicating whether all fields are valid.
The fields that are validated include:
status codes (should be an integer)
URLs (should be a valid url)
reasons (should originate from a reason attribute or inferred from the status code)
content (should be a bytes field or encoded from a string text field)
headers (should be a dictionary with string fields and preferably a content type)
- Returns:
Indicates whether the current reconstructed response minimally recreates a response object.
- Return type:
bool
- json() dict[str, Any] | list[Any] | None[source]
Return JSON-decoded body from the underlying response, if available.
- property ok: bool
Indicates whether the current response indicates a successful request (200 <= status_code < 300).
To account for the nature of successful requests to APIs in academic pipelines, status codes from 300 to 399 are excluded.
- Returns:
True if the status code is an integer value within the range of 200 and 299, False otherwise.
- Return type:
bool
- classmethod prepare_response_fields(**kwargs: Any) dict[str, Any][source]
Extracts and prepares the fields required to reconstruct the response from the provided keyword arguments.
- Parameters:
status_code (int) – The integer code indicating the status of the response
reason (str) – Indicates the reasoning associated with the status of the response
headers (MutableMapping[str, str]) – Indicates metadata associated with the response (e.g. Content-Type)
content (bytes) – The content within the response
url – (Any): The URL from which the response was received
Some fields can be both provided directly or inferred from other similarly common fields:
content: [‘content’, ‘_content’, ‘text’, ‘json’]
headers: [‘headers’, ‘_headers’]
reason: [‘reason’, ‘status’, ‘reason_phrase’, ‘status_code’]
- Returns:
A dictionary containing the prepared response fields.
- Return type:
dict[str, Any]
- raise_for_status() None[source]
Verifies the status code for the current ReconstructedResponse, raising an error for failed responses.
This method follows a similar convention as requests and httpx response types, raising an error when encountering status codes that are indicative of failed responses.
As scholar_flux processes data that is generally only sent when status codes are between 200-299 (or exactly 200 [ok]), an error is raised when encountering a value outside of this range.
- Raises:
HTTPError – If the structure of the response is invalid or the status code is not within the range of 200-299.
- reason: str
- property status: str | None
Helper property for retrieving a human-readable description of the status.
- Returns:
The status description associated with the response (if available).
- Return type:
Optional[str]
- status_code: int
- property text: str | None
Helper property for retrieving the text from the bytes content as a string.
- Returns:
The decoded text from the content of the response.
- Return type:
Optional[str]
- url: Any
- validate() None[source]
Convenience method for the validation of the current ReconstructedResponse.
If the response validation is successful, an InvalidResponseReconstructionException will not be raised.
- Raises:
InvalidResponseReconstructionException – If at least one field is determined to be invalid and unexpected of a true response object.
- class scholar_flux.api.ResponseCoordinator(parser: BaseDataParser, extractor: BaseDataExtractor, processor: ABCDataProcessor, cache_manager: DataCacheManager)[source]
Bases:
objectCoordinates the parsing, extraction, processing, and caching of API responses. The ResponseCoordinator operates on the concept of dependency injection to orchestrate the entire process.
Note that the overall composition of the coordinator is a governing factor in how the response is processed. The ResponseCoordinator uses a cache key and schema fingerprint to ensure that it is only returning a processed response from the cache storage if the structure of the coordinator at the time of cache storage has not changed.
To ensure that we’re not pulling from cache on significant changes to the ResponseCoordinator, we validate the schema by default using DEFAULT_VALIDATE_FINGERPRINT. When the schema changes, previously cached data is ignored, although this can be explicitly overridden during response handling.
The coordinator orchestration process operates mainly through the ResponseCoordinator.handle_response method that sequentially calls the parser, extractor, processor, and cache_manager.
Example workflow:
>>> from scholar_flux.api import SearchAPI, ResponseCoordinator >>> api = SearchAPI(query = 'technological innovation', provider_name = 'crossref', user_agent = 'scholar_flux') >>> response_coordinator = ResponseCoordinator.build() # uses defaults with caching in-memory >>> response = api.search(page = 1) # future calls with the same structure will be cached >>> processed_response = response_coordinator.handle_response(response, cache_key='tech-innovation-cache-key-page-1') # the ProcessedResponse (or ErrorResponse) stores critical fields from the original and processed response >>> processed_response # OUTPUT: ProcessedResponse(len=20, cache_key='tech-innovation-cache-key-page-1', metadata=...) >>> new_processed_response = response_coordinator.handle_response(processed_response, cache_key='tech-innovation-cache-key-page-1') >>> new_processed_response # OUTPUT: ProcessedResponse(len=20, cache_key='tech-innovation-cache-key-page-1', metadata=...)
Note that the entire process can be orchestrated via the SearchCoordinator that uses the SearchAPI and ResponseCoordinator as core dependency injected components:
>>> from scholar_flux import SearchCoordinator >>> search_coordinator = SearchCoordinator(api, response_coordinator, cache_requests=True) # uses a default cache key constructed from the response internally >>> processed_response = search_coordinator.search(page = 1) # OUTPUT: ProcessedResponse(len=20, cache_key='crossref_technological innovation_1_20', metadata=...) >>> processed_response.content == new_processed_response.content
- Core Attributes:
parser (BaseDataParser): Parses raw API responses. extractor (BaseDataExtractor): Extracts records and metadata. processor (ABCDataProcessor): Processes extracted data. cache_manager (DataCacheManager): Manages response cache.
- DEFAULT_VALIDATE_FINGERPRINT: bool = True
- __init__(parser: BaseDataParser, extractor: BaseDataExtractor, processor: ABCDataProcessor, cache_manager: DataCacheManager)[source]
Initializes a ResponseCoordinator with specified components for response parsing, processing, and caching.
- Parameters:
parser – (BaseDataParser): First step of the response processing pipeline: parses response records into a dictionary.
extractor – (BaseDataExtractor): Extracts both records and metadata from an API response separately for future processing steps.
processor – (ABCDataProcessor): Processes the list of dictionary-based records that were previously extracted from the APIResponse.
cache_manager – (DataCacheManager): Manages the processed record caching for faster response processing for identical responses.
- classmethod build(parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None, annotate_records: bool | None = None) ResponseCoordinator[source]
Factory method to build a ResponseCoordinator with sensible defaults.
- Parameters:
parser – (BaseDataParser): First step of the response processing pipeline: parses response records into a dictionary.
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from an API response separately for future processing steps.
processor – (Optional[ABCDataProcessor]): Processes the list of dictionary-based records that were previously extracted from the APIResponse.
cache_manager – (Optional[DataCacheManager]): Manages the processed record caching for faster response processing for identical responses.
cache_results – (Optional[bool]): Determines whether or not to cache processed responses: Enabled by default unless specified or if a cache manager is already provided.
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).
- Returns:
A fully constructed coordinator.
- Return type:
- property cache: DataCacheManager
Alias for the response data processing cache manager:
Also allows direct access to the DataCacheManager from the ResponseCoordinator
- property cache_manager: DataCacheManager
Allows direct access to the DataCacheManager from the ResponseCoordinator.
- classmethod configure_cache(cache_manager: DataCacheManager | None = None, cache_results: bool | None = None) DataCacheManager[source]
Helper method for building and swapping out cache managers depending on the cache chosen.
- Parameters:
cache_manager (Optional[DataCacheManager]) – An optional cache manager to use
cache_results (Optional[bool]) – Ground truth parameter, used to resolve whether to use caching when the cache_manager and cache_results contradict
- Returns:
An existing or newly created cache manager that can be used with the ResponseCoordinator
- Return type:
- property extractor: BaseDataExtractor
Allows direct access to the DataExtractor from the ResponseCoordinator.
- handle_response(response: Response | ResponseProtocol, cache_key: str | None = None, from_cache: bool = True, validate_fingerprint: bool | None = None, normalize_records: bool | None = None) ErrorResponse | ProcessedResponse[source]
Handles response data extraction, processing, and caching, retrieving response data from cache if available.
Once processed, the response data is transformed into a pydantic ProcessedResponse or ErrorResponse model that contains the response content, processing information, metadata, and/or error details when relevant.
- Parameters:
response (Response) – Raw API response.
cache_key (Optional[str]) – Cache key for storing/retrieving.
from_cache – (bool): Indicates whether the response data should be retrieved from cache if available.
validate_fingerprint – (Optional[bool]): Indicates whether cache should be invalidated if the ResponseCoordinator components are modified.
normalize_records (Optional[bool]) – Determines whether records should be normalized after processing.
- Returns:
A pydantic model containing the response data and detailed processing info.
- Return type:
- handle_response_data(response: Response | ResponseProtocol, cache_key: str | None = None, **kwargs: Any) RecordList | None[source]
Retrieves the data from the processed response from cache if previously cached. Otherwise the data is retrieved after processing the response.
- Parameters:
response (Response | ResponseProtocol) – Raw API response.
cache_key (Optional[str]) – Cache key for storing/retrieving.
**kwargs – Additional keyword arguments to pass to ResponseCoordinator.handle_response.
- Returns:
Processed response data or None.
- Return type:
Optional[RecordList]
- property parser: BaseDataParser
Allows direct access to the data parser from the ResponseCoordinator.
- property processor: ABCDataProcessor
Allows direct access to the DataProcessor from the ResponseCoordinator.
- schema_fingerprint() str[source]
Helper method for generating a concise view of the current structure of the response coordinator.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for retrieving a string representation of the overall structure of the current ResponseCoordinator. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the ResponseCoordinator.
- Parameters:
flatten (bool) – Whether to flatten the ResponseCoordinator’s structural representation into a single line.
show_value_attributes (bool) – Whether to show nested attributes of the components in the structure of the current ResponseCoordinator instance.
- Returns:
The structure of the current ResponseCoordinator as a string.
- Return type:
str
- summary() str[source]
Helper class for creating a quick summary representation of the structure of the Response Coordinator.
- classmethod update(response_coordinator: ResponseCoordinator, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None, annotate_records: bool | None = None) ResponseCoordinator[source]
Factory method to create a new ResponseCoordinator from an existing configuration.
- Parameters:
response_coordinator – (ResponseCoordinator): ResponseCoordinator containing the defaults to swap
parser – (Optional[BaseDataParser]): First step of the response processing pipeline - parses response records into a dictionary
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from responses separately
processor – (Optional[ABCDataProcessor]): Processes API responses into list of dictionaries
cache_manager – (Optional[DataCacheManager]): Manages the caching of processed records for faster retrieval
cache_results – (Optional[bool]): Determines whether or not to cache processed responses - on by default unless specified or if a cache manager is already provided
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).
- Returns:
A fully constructed coordinator.
- Return type:
- class scholar_flux.api.ResponseMetadataMap(*, total_query_hits: str | None = None, records_per_page: str | None = None)[source]
Bases:
BaseModelMaps API-specific response metadata field names to common names.
This class enables extraction of metadata from API responses, primarily used for pagination decisions in multi-page searches. This class extracts and processes metadata fields from metadata dictionaries and can be used for nested path reversal by denoting fields with periods. field retrieval.
- Parameters:
total_query_hits – Field name containing the total number of results for a query (used to determine if more pages exist)
records_per_page – Field name indicating the number of records on the current page
Example
>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap >>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits") >>> metadata = {"totalHits": 318942, "limit": 10} >>> total = metadata_map.calculate_query_hits(metadata) >>> print(total) # 318942 >>> # Used for pagination decisions >>> has_more = total > (current_page * records_per_page)
- calculate_pages_remaining(page: int, total_query_hits: int | None = None, records_per_page: int | None = None, metadata: MetadataType | None = None) int | None[source]
Calculating the total number of pages yet to be queried using either metadata or direct integer fields.
- Parameters:
total_query_hits (Optional[int]) – Total number of record hits associated with a given query. If not specified, this is parsed from the metadata
records_per_page (Optional[int]) – Total number of records on the current page as an integer if available and convertible
metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)
- Returns:
The total number of pages that remain given the values total_query_hits and records_per_page
- Return type:
Optional[int]
Example
>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap >>> metadata_map = ResponseMetadataMap( ... total_query_hits="statistics.totalHits", records_per_page="metadata.pageSize" ... ) >>> metadata = {"statistics": {"totalHits": "1500"},"metadata": {"pageSize": "20"}} >>> total = metadata_map.calculate_pages_remaining(page = 74, metadata = metadata) >>> print(total) # 1 (converted from string)
- calculate_query_hits(metadata: MetadataType) int | None[source]
Extract and convert total query hits from response metadata.
- Parameters:
metadata (MetadataType) – A mapping containing response metadata typically from ProcessedResponse.metadata
- Returns:
Total number of query hits as an integer if available and convertible, otherwise None
- Return type:
Optional[int]
Example
>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap >>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits") >>> metadata = {"totalHits": "1500", "results": [...]} >>> total = metadata_map.calculate_query_hits(metadata) >>> print(total) # 1500 (converted from string)
- calculate_records_per_page(metadata: MetadataType) int | None[source]
Extract and convert the total number of records on the current page from response metadata.
- Parameters:
metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)
- Returns:
Total number of records on the current page as an integer if available and convertible, otherwise None
- Return type:
Optional[int]
Example
>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap >>> metadata_map = ResponseMetadataMap(records_per_page="pageSize") >>> metadata = {"pageSize": "20", "results": [...]} >>> total = metadata_map.calculate_records_per_page(metadata) >>> print(total) # 20 (converted from string)
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- process_metadata(metadata: MetadataType) MetadataType[source]
Helper method for processing metadata after mapping relevant fields using the metadata schema.
- Parameters:
metadata (MetadataType) – A mapping containing response metadata (typically from ProcessedResponse.metadata)
- Returns:
A mapped dictionary of processed metadata fields.
- Return type:
metadata (MetadataType)
Example
>>> from scholar_flux.api.models.response_metadata_map import ResponseMetadataMap >>> metadata_map = ResponseMetadataMap(total_query_hits="totalHits", records_per_page="pageSize") >>> metadata = {"totalHits": "1500","pageSize": "20", "results": [...]} >>> metadata_map.process_metadata(metadata) # OUTPUT: {"total_query_hits": 1500, "pageSize": "records_per_page", 20}
- records_per_page: str | None
- total_query_hits: str | None
- class scholar_flux.api.ResponseValidator[source]
Bases:
objectHelper class that serves as an initial response validation step to ensure that, in custom retry handling, the basic structure of a response can be validated to determine whether or not to retry the response retrieval process.
The ResponseValidator implements class methods that are simple tools that return boolean values (True/False) when response or response-like objects do not contain the required structure and raise errors when encountering non-response objects or when raise_on_error = True otherwise.
The ResponseValidator also contains helpers for the validation of both processed responses and responses that are reconstructed after storage and deserialization.
Example
>>> from scholar_flux.api import ResponseValidator, ReconstructedResponse >>> mock_success_response = ReconstructedResponse.build(status_code = 200, >>> json = {'response': 'success'}, >>> url = "https://an-example-url.com", >>> headers={'Content-Type': 'application/json'} >>> ) >>> ResponseValidator.validate_response(mock_success_response) is True >>> ResponseValidator.validate_content(mock_success_response) is True
- classmethod identify_invalid_fields(response: Response | ResponseProtocol) dict[str, Any][source]
Helper class method for identifying invalid fields within a response.
This class iteratively validates the complete list of all invalid fields that populate the current response.
If any invalid fields exist, the method returns a dictionary of each field and its corresponding value.
- Parameters:
response (requests.Response | ResponseProtocol) – A response or response-like object to check for the presence of invalid values.
- Returns:
A dictionary containing each invalid field as keys and their assigned values
- Return type:
(dict[str, Any])
- classmethod identify_invalid_keywords(status_code: object | None = None, url: object | None = None, reason: object | None = None, content: object | None = None, headers: object | None = None) dict[str, object][source]
Validates response field keyword arguments, indicating those that contain invalid values.
- Parameters:
status_code (Optional[object]) – The status code to validate (expected: int 100-599).
url (Optional[object]) – The URL to validate (should be a valid url).
reason (Optional[object]) – The reason string to validate (should be a string).
content (Optional[object]) – The content to validate (should be a bytes field).
headers (Optional[object]) – The headers to validate (should be a mapping with string-typed keys).
- Returns:
A dictionary containing each invalid field as a key and its assigned value.
- Return type:
dict[str, object]
- classmethod is_valid_content(content: object) TypeGuard[bytes][source]
Validates whether content is a valid bytes object.
- classmethod is_valid_headers(headers: object) TypeGuard[Mapping[str, str]][source]
Validates whether headers is a dict containing string-typed keys/values.
- classmethod is_valid_reason(reason: object) TypeGuard[str][source]
Validates whether reason is a valid string.
- classmethod is_valid_response_structure(response: object) TypeGuard[ResponseProtocol][source]
Validates whether each of the core components of a response are populated with the correct response types.
The following properties that refer back to the original response should be available:
status_code: (int)
reason: string
headers: dictionary
content: bytes
url: string or URL-like field
- Parameters:
response (object) – An object to evaluate as a response or response-like object.
- Returns:
True if all core response fields are valid, False otherwise.
- Return type:
TypeGuard[ResponseProtocol]
- classmethod is_valid_status_code(status_code: object) TypeGuard[int][source]
Validates whether the status_code is a valid integer between 100-599.
- classmethod is_valid_url(url: object) TypeGuard[str][source]
Validates whether the provided value is a valid URL.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method that shows the current structure of the ResponseValidator class in a string format. This method will show the name of the current class along with its attributes (ResponseValidator())
- Returns:
A string representation of the current structure of the ResponseValidator
- Return type:
str
- classmethod validate_content(response: Response | ResponseProtocol, expected_format: str = 'application/json', *, raise_on_error: bool = False) bool[source]
Validates the response content type.
- Parameters:
response (requests.Response | ResponseProtocol) – The HTTP response or response-like object to check.
expected_format (str) – The expected content type substring (e.g., “application/json”).
raise_on_error (bool) – If True, raises InvalidResponseException on mismatch.
- Returns:
True if the content type matches, False otherwise.
- Return type:
bool
- Raises:
InvalidResponseException – If the content type does not match and raise_on_error is True.
- classmethod validate_response(response: Response | ResponseProtocol, *, raise_on_error: bool = False) bool[source]
Validates HTTP responses by verifying first whether the object is a Response or follows a ResponseProtocol. For valid response or response- like objects, the status code is verified, returning False for 400 and 500 level validation errors when raise_on_error=False. If raise_on_error is set to True, an error is raised instead.
Note that a ResponseProtocol duck-types and verifies that each of a minimal set of attributes and/or properties can be found within the current response.
In the scholar_flux retrieval step, this validator verifies that the response received is a valid response.
- Parameters:
response – (requests.Response | ResponseProtocol): The HTTP response object to validate
raise_on_error (bool) – If True, raises InvalidResponseException on error for invalid response status codes
- Returns:
True if valid, False otherwise
- Raises:
InvalidResponseException – If response is invalid and raise_on_error is True
RequestFailedException – If an exception occurs during response validation due to missing or incorrect types
- classmethod validate_response_like(response: object) TypeGuard[Response | ResponseProtocol][source]
Validates that an object is a response or a duck typed ResponseProtocol, raising an error if invalid.
- Parameters:
response (object) – An object to verify as a response or response-like object
- Returns:
True when the received object is a requests.Response or a ResponseProtocol.
- Return type:
TypeGuard[requests.Response | ResponseProtocol]
- Raises:
InvalidResponseStructureException – Raised when the object is not a response-like object.
- classmethod validate_response_structure(response: Response | ResponseProtocol, raise_on_error: bool = True) TypeGuard[Response | ResponseProtocol][source]
Raises an error if a response object does not contain valid properties expected of a response. If the response validation is successful, True is returned, indicating that the value is a valid ResponseLike object.
- Parameters:
response (requests.Response | ResponseProtocol) – The response or response-like object to validate.
raise_on_error (bool) – Flag indicating whether an InvalidResponseStructureException should be raised for objects with invalid structures (True by default).
- Returns:
True when the received object is a requests.Response or a ResponseProtocol.
- Return type:
TypeGuard[requests.Response | ResponseProtocol]
- Raises:
InvalidResponseStructureException – Raised when the object is not a response-like object or if at least one field is determined to be invalid and unexpected of a response-like object.
- class scholar_flux.api.RetryHandler(max_retries: int = 3, backoff_factor: float = 0.5, max_backoff: int | float = 120, retry_statuses: set[int] | list[int] | None = None, raise_on_error: bool | None = None, min_retry_delay: int | float | None = None)[source]
Bases:
objectCore class used to send and dynamically retry failed requests with exponential backoff.
The RetryHandler automatically handles HTTP errors (429, 500, 501, 502, 503, 504) by retrying failed requests with increasing delays between attempts. Additional status codes can be added to the retry set via RetryHandler.DEFAULT_RETRY_STATUSES.add(<status_code>).
- Features:
Exponential backoff with configurable parameters
Respects Retry-After headers when provided
Thread-safe history tracking of all retry attempts
Configurable maximum retries and timeout limits
Example
>>> from scholar_flux import SearchCoordinator >>> coordinator = SearchCoordinator(query="nutrition", provider_name="plos") >>> # Configure retry behavior >>> coordinator.retry_handler.max_retries = 5 >>> coordinator.retry_handler.backoff_factor = 1.0 >>> # The history is stored at the class level >>> coordinator.retry_handler.history.clear_history() >>> # Execute search with automatic retries >>> result = coordinator.search_page(page=1) >>> # Access retry statistics >>> print(f"Retry attempts: {len(coordinator.retry_handler.history)}")
- max_retries
Maximum number of retry attempts (default: 3)
- Type:
int
- backoff_factor
The multiplier used for exponential backoff (default: 0.5)
- Type:
float
- max_backoff
Maximum delay between retries in seconds (default: 120). Also enforced as a hard ceiling for server-requested delays via Retry-After headers.
- Type:
float
- retry_statuses
HTTP status codes that trigger retries (default: {429, 500, 501, 502, 503, 504})
- Type:
set
- history
Thread-safe storage of all retry attempts
- Type:
HistoryDeque
Note
The retry handler is automatically used by SearchCoordinator for all requests. Each parameter is adjusted dynamically based on the provider. No manual intervention is required for basic usage.
If too many requests are sent to a single server within a specific time interval, it may return a 429 Too Many Requests error and indicate the delay that should be respected before sending another request. If the class attribute, RAISE_ON_DELAY_EXCEEDED is True (default), a RetryAfterDelayExceededException is raised. To turn this feature off, either set the max_backoff parameter directly or set RetryHandler.RAISE_ON_DELAY_EXCEEDED=False to wait the full interval upon receiving a Retry-After header.
For observability, request information, delays, and response statuses are recorded in the RetryHandler.history class attribute for later inspection and can be referenced to help modify the rate limiting configuration when needed.
- DEFAULT_RAISE_ON_ERROR = False
- DEFAULT_RETRY_AFTER_HEADERS = ('retry-after', 'x-ratelimit-retry-after')
- DEFAULT_RETRY_STATUSES = {429, 500, 501, 502, 503, 504}
- DEFAULT_VALID_STATUSES = {200}
- RAISE_ON_DELAY_EXCEEDED: bool = True
- __init__(max_retries: int = 3, backoff_factor: float = 0.5, max_backoff: int | float = 120, retry_statuses: set[int] | list[int] | None = None, raise_on_error: bool | None = None, min_retry_delay: int | float | None = None) None[source]
Initializes the RetryHandler with configurable parameters for dynamically throttling successive requests.
- Parameters:
max_retries (int) – Indicates how many attempts should be performed before halting retries at retrieving a valid response.
backoff_factor (float) – Indicates the factor used to adjust when the next request is should be attempted based on past unsuccessful attempts.
max_backoff (int | float) – Describes the maximum number of seconds to wait before submitting the next request.
retry_statuses (Optional[set[int]]) – Indicates the full list of status codes that should be retried if encountered.
raise_on_error (Optional[bool]) – A flag that indicates whether or not to raise an error upon encountering an invalid status_code or exception.
min_retry_delay (Optional[int | float]) – The minimum delay in seconds between requests.
Note
The class-level history deque is a design choice. While RateLimiter instances are designed to be stateless. This attribute enables class-level monitoring and allows introspection into how request delays are computed. The HistoryDeque is thread-safe (uses cpython on the backend) and allows global observability which is helpful for debugging, especially in cases where you need to adjust the total amount of requests sent within a given interval to avoid 429 errors.
- calculate_retry_delay(attempt_count: int, response: Response | ResponseProtocol | None = None, min_retry_delay: int | float | None = None, backoff_factor: int | float | None = None, max_backoff: int | float | None = None) int | float[source]
Calculates the delay in seconds to wait before the next retry attempt.
- Parameters:
attempt_count (int) – The number of attempts made so far.
response (Optional[requests.Response | ResponseProtocol]) – The response object from the last attempt.
min_retry_delay (Optional[int | float]) – The minimum delay in seconds between requests.
backoff_factor (Optional[int | float]) – The factor used to adjust the delay.
max_backoff (Optional[int | float]) – The maximum delay in seconds between requests.
- Returns:
The delay in seconds for the next retry attempt.
- Return type:
int | float
- delay_exceeds_max_backoff(delay: int | float | None, max_backoff: int | float | None = None, *, error_message: str | None = None, warning_message: str | None = None, response: Response | ResponseProtocol | None = None, verbose: bool = True) bool[source]
Helper method for identifying and handling scenarios where an API-requested delay exceeds max_backoff.
This method centralizes the logic for the identification and handling of delays that exceed the user-defined maximum duration to wait in-between requests. The RetryHandler is structured to cap calculated, wait times using the max_backoff attribute, but Retry-After fields are the one scenario where API-mandated request delays can exceed max_backoff.
This helper method is designed to:
Raise an exception when delay > max_backoff and RetryHandler.RAISE_ON_DELAY_EXCEEDED is True
Log a warning message when delay > max_backoff and returns True (indicating an excessive delay)
Return False when delay is None or delay < max_backoff
- Parameters:
delay (Optional[int | float]) – The delay in seconds to verify against the max_backoff.
max_backoff (Optional[int |float]) – The maximum allowable delay. Defaults to self.max_backoff if not provided.
error_message (Optional[str]) – A Custom message to provide to the RetryAfterDelayExceededException. If None, this method raises the default error message indicating the server-requested delay.
warning_message (Optional[str]) – A Custom message logged when RAISE_ON_DELAY_EXCEEDED is False. If None, this method logs a warning to indicate that the RetryHandler will otherwise wait the full duration.
response (Optional[requests.Response | ResponseProtocol]) – The response object to add as additional context to the raised exception.
verbose (bool) – A flag for logging a default/custom warning when the delay exceeds the maximum duration and the RAISE_ON_DELAY_EXCEEDED flag is false.
- Returns:
True when the delay exceeds the maximum allowable delay and False otherwise.
- Return type:
bool
- Raises:
RetryAfterDelayExceededException – When the API-requested delay exceeds the max_backoff and the RAISE_ON_DELAY_EXCEEDED flag is True.
- execute_with_retry(request_func: Callable[[...], ResponseLike], validator_func: Callable | None = None, sleep_func: Callable[[float], None] | None = None, *args: Any, backoff_factor: int | float | None = None, max_backoff: int | float | None = None, min_retry_delay: int | float | None = None, **kwargs: Any) ResponseLike | None[source]
Sends a request and retries on failure based on predefined criteria and validation function.
- Parameters:
request_func (Callable) – The function to send the request.
validator_func (Optional[Callable]) – A function that takes a response and returns True if valid.
sleep_func (Optional[Callable[[float], None]]) – An optional function used for blocking the next request until a specified duration has passed.
*args – Positional arguments for the request function.
backoff_factor (Optional[int | float]) – Indicates the factor used to adjust when the next request is should be attempted based on past unsuccessful attempts.
max_backoff (Optional[int | float]) – Describes the maximum number of seconds to wait before submitting the next request.
min_retry_delay (Optional[int | float]) – The minimum delay in seconds between requests.
**kwargs – Arbitrary keyword arguments for the request function.
- Returns:
The returned response-like object, when successful, or None if no valid response was obtained.
- Return type:
Optional[requests.Response | ResponseProtocol]
- Raises:
RequestFailedException – When a request raises an exception for whatever reason.
TimeoutError – When a request times out during response retrieval.
InvalidResponseException – When the number of retries has been exceeded and self.raise_on_error is True.
RetryAfterDelayExceededException – When the Retry-After delay requested from the server exceeds max_backoff
Note
If a Retry-After header exceeds the max_backoff and RetryHandler.RAISE_ON_DELAY_EXCEEDED=True, the exception will be raised immediately and halt the series of retry attempts.
Also note that response objects can be extracted from handled InvalidResponseException or RetryAfterDelayExceededException classes, to extract the raw response, handle it with a try/except block and extract it from the response attribute:
Example
>>> from scholar_flux.api.rate_limiting.retry_handler import RetryHandler >>> from scholar_flux.exceptions import InvalidResponseException, RetryAfterDelayExceededException >>> import requests >>> retry_handler = RetryHandler(raise_on_error=True) >>> try: ... response = retry_handler.execute_with_retry(requests.get, url="https://httpbin.org/status/200") ... except (RetryAfterDelayExceededException, InvalidResponseException) as e: ... response = e.response >>> print(response)
- classmethod extract_retry_after(headers: Mapping[str, Any] | None, keys: tuple | None = None) str | None[source]
Extracts the `retry-after field from dictionary headers if the field exists.
- Parameters:
headers (Optional[Mapping[str, Any]]) – A headers dictionary or mapping to extract the retry-after field from
keys (Optional[tuple]) – The keys to look for in the headers. (case insensitive)
- Returns:
The retry-after field value, or None if not present.
- Return type:
Optional[str]
- classmethod extract_retry_after_from_response(response: Response | ResponseProtocol | None) str | None[source]
Extracts and parses retry-after delay from any response type.
This method handles both raw responses (Response/ResponseProtocol) and processed responses (ProcessedResponse/ErrorResponse), making it the single entry point for retry-after extraction.
- Parameters:
response (Optional[requests.Response | ResponseProtocol]) – Any response object with headers
- Returns:
The unparsed retry-after header in seconds, or None if not present
- Return type:
Optional[str]
- classmethod get_retry_after(response: Response | ResponseProtocol | None) int | float | None[source]
Calculates the time that must elapse before the next request is sent according to the headers.
- Parameters:
response (requests.Response | ResponseProtocol) – The response object from the last attempt.
- Returns:
Indicates the number of seconds that must elapse before the next request is sent.
- Return type:
Optional[float]
- history: HistoryDeque[RetryAttempt] = HistoryDeque([])
- log_retry_attempt(delay: float, status_code: int | None = None) None[source]
Log an attempt to retry a request.
- Parameters:
delay (float) – The delay in seconds before the next retry attempt.
status_code (Optional[int]) – The status code of the response that triggered the retry.
- static log_retry_warning(message: str) None[source]
Log a warning when retries are exhausted or an error occurs.
- Parameters:
message (str) – The warning message to log.
- classmethod parse_retry_after(retry_after: str | None) int | float | None[source]
Parse the ‘Retry-After’ header to calculate delay.
- Parameters:
retry_after (str) – The value of ‘Retry-After’ header.
- Returns:
The total delay in seconds parsed from the response field if available.
- Return type:
Optional[int | float]
- classmethod resize_history(maxlen: int) None[source]
Resize the global history deque, preserving existing records up to the new limit.
- Parameters:
maxlen (int) – The new maximum length of the history deque.
- should_retry(response: Response | ResponseProtocol) bool[source]
Determine whether the request should be retried.
- class scholar_flux.api.SearchAPI(query: str, provider_name: str | None = None, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, use_cache: bool | None = None, base_url: str | None = None, api_key: SecretStr | str | None = None, records_per_page: int = 20, request_delay: float | None = None, **api_specific_parameters: Any)[source]
Bases:
BaseAPIThe core interface that handles the retrieval of JSON, XML, and YAML content from the scholarly API sources offered by several providers such as Springer Nature, PLOS, and PubMed. The SearchAPI is structured to allow flexibility without complexity in initialization. API clients can be either constructed piece-by-piece or with sensible defaults for session-based retrieval, API key management, caching, and configuration options.
This class is integrated into the SearchCoordinator as a core component of a pipeline that further parses the response, extracts records and metadata, and caches the processed records to facilitate downstream tasks such as research, summarization, and data mining.
Examples
>>> from scholar_flux.api import SearchAPI # creating a basic API that uses the PLOS as the default while caching data in-memory: >>> api = SearchAPI(query = 'machine learning', provider_name = 'plos', use_cache = True) # retrieve a basic request: >>> response_page_1 = api.search(page = 1) >>> assert response_page_1.ok >>> response_page_1 # OUTPUT: <Response [200]> >>> ml_page_1 = response_page_1.json() # future requests automatically wait until the specified request delay passes to send another request: >>> response_page_2 = api.search(page = 2) >>> assert response_page_1.ok >>> response_page_2 # OUTPUT: <Response [200] >>> ml_page_2 = response_page_2.json()
- DEFAULT_URL: str = 'https://api.plos.org/search'
- __init__(query: str, provider_name: str | None = None, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, use_cache: bool | None = None, base_url: str | None = None, api_key: SecretStr | str | None = None, records_per_page: int = 20, request_delay: float | None = None, **api_specific_parameters: Any) None[source]
Initializes the SearchAPI with a query and optional parameters. The absolute bare minimum for interacting with APIs requires a query, base_url, and an APIParameterConfig that associates relevant fields (aka query, records_per_page, etc. with fields that are specific to each API provider.
- Parameters:
query (str) – The search keyword or query string.
provider_name (Optional[str]) – The name of the API provider where requests will be sent. If a provider_name and base_url are both given, the SearchAPIConfig will prioritize base_urls over the provider_name.
parameter_config (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]) – A config that a parameter map attribute under the hood to build the parameters necessary to interact with an API. For convenience, an APIParameterMap can be provided in place of an APIParameterConfig, and the conversion will take place under the hood.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session. A new session is created if not specified.
user_agent (Optional[str]) – Optional user-agent string for the session.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
masker (Optional[str]) – Used for filtering potentially sensitive information from logs (API keys, auth bearers, emails, etc)
use_cache (bool) – Indicates whether or not to create a cached session. If a cached session is already specified, this setting will have no effect on the creation of a session.
base_url (str) – The base URL for the article API.
api_key (Optional[str | SecretStr]) – API key if required.
records_per_page (int) – Number of records to fetch per page (1-100).
request_delay (Optional[float]) – Minimum delay between requests in seconds. If not specified, the SearchAPI, this setting will use the default request delay defined in the SearchAPIConfig (6.1 seconds) if an override for the current provider does not exist.
**api_specific_parameters –
- Additional parameter-value pairs to be provided to SearchAPIConfig class. API specific parameters include:
mailto (Optional[str | SecretStr]): (CROSSREF: an optional contact for feedback on API usage) db: str (PubMed: a database to retrieve data from (example: db=pubmed)
- property api_key: SecretStr | None
Retrieves the current value of the API key from the SearchAPIConfig as a SecretStr.
Note that the API key is stored as a secret key when available. The value of the API key can be retrieved by using the api_key.get_secret_value() method.
- Returns:
A secret string of the API key if it exists
- Return type:
Optional[SecretStr]
- property api_specific_parameters: dict[str, APISpecificParameter]
This property pulls additional parameters corresponding to the API from the configuration of the current API instance.
- Returns:
A dictionary of all parameters specific to the current API.
- Return type:
dict[str, APISpecificParameter]
- property base_url: str
Corresponds to the base URL of the current API.
- Returns:
The base URL corresponding to the API Provider
- build_parameters(page: int, additional_parameters: dict[str, Any] | None = None, **api_specific_parameters: Any) dict[str, Any][source]
Constructs the request parameters for the API call, using the provided APIParameterConfig and its associated APIParameterMap. This method maps standard fields (query, page, records_per_page, api_key, etc.) to the provider-specific parameter names.
Using additional_parameters, an arbitrary set of parameter key-value can be added to request further customize or override parameter settings to the API. additional_parameters is offered as a convenience method in case an API may use additional arguments or a query requires specific advanced functionality.
Other arguments and mappings can be supplied through **api_specific_parameters to the parameter config, provided that the options or pre-defined mappings exist in the config.
When **api_specific_parameters and additional_parameters conflict, additional_parameters is considered the ground truth. If any remaining parameters are None in the constructed list of parameters, these values will be dropped from the final dictionary.
- Parameters:
page (int) – The page number to request.
Optional[dict] (additional_parameters) – A dictionary of additional overrides that may or may not have been included in the original parameter map of the current API. (Provided for further customization of requests).
**api_specific_parameters – Additional parameters to provide to the parameter config: Note that the config will only accept keyword arguments that have been explicitly defined in the parameter map. For all others, they must be added using the additional_parameters parameter.
- Returns:
The constructed request parameters.
- Return type:
dict[str, Any]
- property config: SearchAPIConfig
Property method for accessing the config for the SearchAPI.
- Returns:
The configuration corresponding to the API Provider
- describe() dict[str, Any][source]
A helper method used that describe accepted configuration for the current provider or user-defined parameter mappings.
- Returns:
A dictionary describing valid config fields and provider-specific api parameters for the current provider (if applicable).
- Return type:
dict[str, Any]
- property display_name: str
Human-readable provider name for logging and display purposes.
- classmethod from_defaults(query: str, provider_name: str | None, session: Session | None = None, user_agent: Annotated[str | None, 'An optional User-Agent to associate with each search'] = None, use_cache: bool | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters: Any) SearchAPI[source]
Factory method to create SearchAPI instances with sensible defaults for known providers.
PLOS is used by default unless the environment variable, SCHOLAR_FLUX_DEFAULT_PROVIDER is set to another provider.
- Parameters:
query (str) – The search keyword or query string.
base_url (str) – The base URL for the article API.
records_per_page (int) – Number of records to fetch per page (1-100).
request_delay (Optional[float]) – Minimum delay between requests in seconds.
api_key (Optional[str | SecretStr]) – API key if required.
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist.
masker (Optional[str]) – Used for filtering potentially sensitive information from logs
**api_specific_parameters – Additional api parameter-value pairs and overrides to be provided to SearchAPIConfig class
- Returns:
A new SearchAPI instance initialized with the config chosen.
- classmethod from_provider_config(query: str, provider_config: ProviderConfig, session: Session | None = None, user_agent: Annotated[str | None, 'An optional User-Agent to associate with each search'] = None, use_cache: bool | None = None, timeout: int | float | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters: Any) SearchAPI[source]
Factory method to create a new SearchAPI instance using a ProviderConfig.
This method uses the default settings associated with the provider config to temporarily make the configuration settings globally available when creating the SearchAPIConfig and APIParameterConfig instances from the provider registry.
- Parameters:
query (str) – The search keyword or query string.
provider_config – ProviderConfig,
session (Optional[requests.Session]) – A pre-configured session or None to create a new session.
user_agent (Optional[str]) – Optional user-agent string for the session.
use_cache (Optional[bool]) – Indicates whether or not to use cache if a cached session doesn’t yet exist.
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError.
masker (Optional[str]) – Used for filtering potentially sensitive information from logs
**api_specific_parameters – Additional api parameter-value pairs and overrides to be provided to SearchAPIConfig class
- Returns:
A new SearchAPI instance initialized with the chosen configuration.
- classmethod from_settings(query: str, config: SearchAPIConfig, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, use_cache: bool | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None) SearchAPI[source]
Advanced constructor: instantiate directly from a SearchAPIConfig instance.
- Parameters:
query (str) – The search keyword or query string.
config (SearchAPIConfig) – Indicates the configuration settings to be used when sending requests to APIs
parameter_config – (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]): Maps global scholar_flux parameters to those that are specific to the current API
session – (Optional[requests.Session | CachedSession]): An optional session to use for the creation of request sessions
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache – Optional[bool]: Indicates whether or not to use cache. The settings from session are otherwise used this option is not specified.
masker (Optional[SensitiveDataMasker]) – A masker used to filter logs of API keys and other sensitive data.
user_agent (Optional[str]) – A user agent to associate with the session.
- Returns:
A newly constructed SearchAPI with the chosen/validated settings.
- Return type:
- classmethod get_default_provider_name() str[source]
Retrieves the name of the default provider as configured via config_settings.
Note
When config_settings does not resolve to a known provider, a warning is raised, and SearchAPIConfig.DEFAULT_PROVIDER is returned instead.
- Returns:
A known default, either resolved from SCHOLAR_FLUX_DEFAULT_PROVIDER or SearchAPIConfig.DEFAULT_PROVIDER.
- Return type:
str
- make_request(current_page: int, additional_parameters: dict[str, Any] | None = None, request_delay: float | None = None, endpoint: str | None = None) Response[source]
Constructs and sends a request to the chosen api:
The parameters are built based on the default/chosen config and parameter map :param page: The page number to request. :type page: int :param additional_parameters Optional[dict]: A dictionary of additional overrides not included in the original SearchAPIConfig :param request_delay: Overrides the configured request delay for the current request only. :type request_delay: Optional[float] :param endpoint: The API endpoint to prepare the request for. :type endpoint: Optional[str]
- Returns:
The API’s response to the request.
- Return type:
requests.Response
- property parameter_config: APIParameterConfig
Property method for accessing the parameter mapping config for the SearchAPI.
- Returns:
The configuration corresponding to the API Provider
- prepare_request(base_url: str | None = None, endpoint: str | None = None, parameters: dict[str, Any] | None = None, api_key: str | None = None) PreparedRequest[source]
Prepares a GET request for the specified endpoint with optional parameters.
This method builds on the original base class method by additionally allowing users to specify a custom request directly while also accounting for the addition of an API key specific to the API.
- Parameters:
base_url (str) – The base URL for the API.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
parameters (Optional[dict[str, Any]]) – Optional query parameters for the request.
- Returns:
The prepared request object.
- Return type:
requests.PreparedRequest
- prepare_search(page: int | None = None, parameters: dict[str, Any] | None = None, request_delay: float | None = None, endpoint: str | None = None) PreparedRequest[source]
Prepares the current request given the provided page and parameters.
The prepared request object can be sent using the SearchAPI.session.send method with requests.Session and `requests_cache.CachedSession`objects.
- Parameters:
page (Optional[int]) – Page number to query. If provided, parameters are built from the config and this page.
parameters (Optional[dict[str, Any]]) – If provided alone, used as the full parameter set to build the current request. If provided together with page, these act as additional or overriding parameters on top of the built config.
request_delay (Optional[float]) – No-Op: retained to emulate the .search() method’s parameters to ensure that the value is not included in the request parameters.
endpoint (Optional[str]) – The API endpoint to prepare the request for.
- Returns:
A request object that can be sent via api.session.send.
- Return type:
requests.PreparedRequest
- property provider_name: str
Property method for accessing the provider name in the current SearchAPI instance.
- Returns:
The name corresponding to the API Provider.
- property query: str
Retrieves the current value of the query to be sent to the current API.
- property rate_limiter: RateLimiter
Property enabling public access to the rate limiter for ease of use.
- Returns:
Throttles the number of requests that can sent to an API within a time interval.
- Return type:
- property records_per_page: int
Indicates the total number of records to show on each page.
- Returns:
an integer indicating the max number of records per page
- Return type:
int
- property request_delay: float
Indicates how long we should wait in-between requests.
Helpful for ensuring compliance with the rate-limiting requirements of various APIs.
- Returns:
The number of seconds to wait at minimum between each request
- Return type:
float
- search(page: int | None = None, parameters: dict[str, Any] | None = None, request_delay: float | None = None, endpoint: str | None = None) Response[source]
Public method to perform a search for the selected page with the current API configuration.
A search can be performed by specifying either the page to query with the preselected defaults and additional parameter overrides for other parameters accepted by the API.
Users can also create a custom request using a parameter dictionary containing the full set of API parameters.
- Parameters:
page (Optional[int]) – Page number to query. If provided, parameters are built from the config and this page.
parameters (Optional[dict[str, Any]]) – If provided alone, used as the full parameter set for the request. If provided together with page, these act as additional or overriding parameters on top of the built config.
request_delay (Optional[float]) – Overrides the configured request delay for the current request only.
endpoint (Optional[str]) – An Optional API endpoint to append to base_url.
- Returns:
A response object from the API containing articles and metadata
- Return type:
requests.Response
- session: Session
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for quickly showing a representation of the overall structure of the SearchAPI. The helper function, generate_repr_from_string helps produce human-readable representations of the core structure of the SearchAPI.
- Parameters:
flatten (bool) – Whether to flatten the SearchAPI’s structural representation into a single line.
show_value_attributes (bool) – Whether to show nested attributes of the components of the SearchAPI.
- Returns:
The structure of the current SearchAPI as a string.
- Return type:
str
- classmethod update(search_api: SearchAPI, query: str | None = None, config: SearchAPIConfig | None = None, parameter_config: BaseAPIParameterMap | APIParameterMap | APIParameterConfig | None = None, session: Session | CachedSession | None = None, user_agent: str | None = None, timeout: int | float | None = None, use_cache: bool | None = None, masker: SensitiveDataMasker | None = None, rate_limiter: RateLimiter | None = None, **api_specific_parameters: Any) SearchAPI[source]
Helper method for generating a new SearchAPI from an existing SearchAPI instance. All parameters that are not modified are pulled from the original SearchAPI. If no changes are made, an identical SearchAPI is generated from the existing defaults.
- Parameters:
config (SearchAPIConfig) – Indicates the configuration settings to be used when sending requests to APIs
parameter_config (Optional[BaseAPIParameterMap | APIParameterMap | APIParameterConfig]) – Maps global scholar_flux parameters to those that are API specific.
session – (Optional[requests.Session | CachedSession]): An optional session to use for the creation of request sessions
user_agent (Optional[str]) – A user agent to associate with the session
timeout – (Optional[int | float]): Identifies the number of seconds to wait before raising a TimeoutError
use_cache – Optional[bool]: Indicates whether or not to use cache. The settings from session are otherwise used this option is not specified.
masker – (Optional[SensitiveDataMasker]): A masker used to filter logs of API keys and other sensitive data
- Returns:
A newly constructed SearchAPI with the chosen/validated settings
- Return type:
- with_config(config: SearchAPIConfig | None = None, parameter_config: APIParameterConfig | None = None, provider_name: str | None = None, query: str | None = None) Iterator[SearchAPI][source]
Temporarily modifies the SearchAPI’s SearchAPIConfig and/or APIParameterConfig and namespace. You can provide a config, a parameter_config, or a provider_name to fetch defaults. Explicitly provided configs take precedence over provider_name, and the context manager will revert changes to the parameter mappings and search configuration afterward.
- Parameters:
config (Optional[SearchAPIConfig]) – Temporary search api configuration to use within the context to control where and how response records are retrieved.
parameter_config (Optional[APIParameterConfig]) – Temporary parameter config to use within the context to resolve universal parameters names to those that are specific to the current api.
provider_name (Optional[str]) – Used to retrieve the associated configuration for a specific provider in order to edit the parameter map when using a different provider.
query (Optional[str]) – Allows users to temporarily modify the query used to retrieve records from an API.
- Yields:
SearchAPI – The current api object with a temporarily swapped config during the context manager.
- with_config_parameters(provider_name: str | None = None, query: str | None = None, **api_specific_parameters: Any) Iterator[SearchAPI][source]
Allows for the temporary modification of the search configuration, and parameter mappings, and cache namespace. For the current API. Uses a contextmanager to temporarily change the provided parameters without persisting the changes.
- Parameters:
provider_name (Optional[str]) – If provided, fetches the default parameter config for the provider.
query (Optional[str]) – Allows users to temporarily modify the query used to retrieve records from an API.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override in the current config.
- Yields:
SearchAPI – The API object with temporarily swapped config and/or parameter config.
- class scholar_flux.api.SearchAPIConfig(*, provider_name: str = '', base_url: str = '', records_per_page: Annotated[int, Ge(ge=0), Le(le=1000)] = 20, request_delay: float = -1, api_key: SecretStr | None = None, api_specific_parameters: dict[str, Any] | None = None)[source]
Bases:
BaseModelThe SearchAPIConfig class provides the core tools necessary to set and interact with the API. The SearchAPI uses this class to retrieve data from an API using universal parameters to simplify the process of retrieving raw responses.
- provider_name
Indicates the name of the API to use when making requests to a provider. If the provider name matches a known default and the base_url is unspecified, the base URL for the current provider is used instead.
- Type:
str
- base_url
Indicates the API URL where data will be searched and retrieved.
- Type:
str
- records_per_page
Controls the number of records that will appear on each page.
- Type:
int
- request_delay
Indicates the minimum delay between each request to avoid exceeding API rate limits.
- Type:
float
- api_key
This is an API-specific parameter for validating the current user’s identity. If a str type is provided, it is converted into a SecretStr.
- Type:
Optional[str | SecretStr]
- api_specific_parameters
A dictionary containing all parameters specific to the current API. API-specific parameters include the following:
- mailto (Optional[str | SecretStr]):
An optional email address for receiving feedback on usage from providers. This parameter is currently applicable only to the Crossref API.
- db (str):
The parameter used by the NIH to direct requests for data to the pubmed database. This parameter defaults to pubmed and does not require direct specification.
- Type:
dict[str, Any]
Examples
>>> from scholar_flux.api import SearchAPIConfig, SearchAPI, provider_registry # To create a CROSSREF configuration with minimal defaults and provide an api_specific_parameter: >>> config = SearchAPIConfig.from_defaults(provider_name = 'crossref', mailto = 'your_email_here@example.com') # The configuration automatically retrieves the configuration for the "Crossref" API. >>> assert config.provider_name == 'crossref' and config.base_url == provider_registry['crossref'].base_url >>> api = SearchAPI.from_settings(query = 'q', config = config) >>> assert api.config == config # To retrieve all defaults associated with a provider and automatically read an API key if needed: >>> config = SearchAPIConfig.from_defaults(provider_name = 'pubmed', api_key = 'your api key goes here') # The API key is retrieved automatically if you have the API key specified as an environment variable. >>> assert config.api_key is not None # Default provider API specifications are already pre-populated if they are set with defaults. >>> assert config.api_specific_parameters['db'] == 'pubmed' # Required by pubmed and defaults to pubmed. # Update a provider and automatically retrieve its API key - the previous API key will no longer apply. >>> updated_config = SearchAPIConfig.update(config, provider_name = 'core') # The API key should have been overwritten to use core. Looks for a `CORE_API_KEY` env variable by default. >>> assert updated_config.provider_name == 'core' and updated_config.api_key != config.api_key
- DEFAULT_PROVIDER: ClassVar[str] = 'PLOS'
- DEFAULT_RECORDS_PER_PAGE: ClassVar[int] = 25
- DEFAULT_REQUEST_DELAY: ClassVar[float] = 6.1
- MAX_API_KEY_LENGTH: ClassVar[int] = 512
- api_key: SecretStr | None
- api_specific_parameters: dict[str, Any] | None
- base_url: str
- classmethod default_request_delay(v: int | float | None, provider_name: str | None = None) float[source]
Helper method enabling the retrieval of the most appropriate rate limit for the current provider.
Defaults to the SearchAPIConfig default rate limit when the current provider is unknown and a valid rate limit has not yet been provided.
- Parameters:
v (Optional[int | float]) – The value received for the current request_delay
provider_name (Optional[str]) – The name of the provider to retrieve a rate limit for
- Returns:
- The inputted non-negative request delay, the retrieved rate limit for the current provider
if available, or the SearchAPIConfig.DEFAULT_REQUEST_DELAY - all in order of priority.
- Return type:
float
- classmethod from_defaults(provider_name: str, **overrides: Any) SearchAPIConfig[source]
Uses the default configuration for the chosen provider to create a SearchAPIConfig object containing configuration parameters. Note that additional parameters and field overrides can be added via the **overrides field.
- Parameters:
provider_name (str) – The name of the provider to create the config
**overrides – Optional keyword arguments to specify overrides and additional arguments
- Returns:
A default APIConfig object based on the chosen parameters
- Return type:
- model_config: ClassVar[ConfigDict] = {}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- provider_name: str
- records_per_page: int
- request_delay: float
- classmethod set_records_per_page(v: int | None) int[source]
Sets the records_per_page parameter with the default if the supplied value is not valid:
Triggers a validation error when records_per_page is an invalid type. Otherwise uses the DEFAULT_RECORDS_PER_PAGE class attribute if the supplied value is missing or is a negative number.
- structure(flatten: bool = False, show_value_attributes: bool = True) str[source]
Helper method for retrieving a string representation of the overall structure of the current SearchAPIConfig.
- classmethod update(current_config: SearchAPIConfig, **overrides: Any) SearchAPIConfig[source]
Create a new SearchAPIConfig by updating an existing config with new values and/or switching to a different provider. This method ensures that the new provider’s base_url and defaults are used if provider_name is given, and that API-specific parameters are prioritized and merged as expected.
- Parameters:
current_config (SearchAPIConfig) – The existing configuration to update.
**overrides – Any fields or API-specific parameters to override or add.
- Returns:
A new config with the merged and prioritized values.
- Return type:
- property url_basename: str
Uses the _extract_url_basename method from the provider URL associated with the current config instance.
- classmethod validate_api_key(v: SecretStr | str | None) SecretStr | None[source]
Validates the api_key attribute and triggers a validation error if it is not valid.
- classmethod validate_provider_name(v: str | None) str[source]
Validates the provider_name attribute and triggers a validation error if it is not valid.
- classmethod validate_request_delay(v: int | float | None) int | float | None[source]
Sets the request delay (delay between each request) for valid request delays. This validator triggers a validation error when the request delay is an invalid type.
If a request delay is left None or is a negative number, this class method returns -1, and further validation is performed by cls.default_request_delay to retrieve the provider’s default request delay.
If not available, SearchAPIConfig.DEFAULT_REQUEST_DELAY is used.
- validate_search_api_config_parameters() Self[source]
Validation method that resolves URLs and/or provider names to provider_info when one or the other is not explicitly provided.
Occurs as the last step in the validation process.
- class scholar_flux.api.SearchCoordinator(search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, *, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, query: str | None = None, provider_name: str | None = None, cache_requests: bool | None = None, cache_results: bool | None = None, annotate_records: bool | None = None, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None, **kwargs: Any)[source]
Bases:
BaseCoordinatorHigh-level coordinator for requesting and retrieving records and metadata from APIs.
This class uses dependency injection to orchestrate the process of constructing requests, validating responses, and processing scientific works and articles. This class is designed to abstract away the complexity of using APIs while providing a consistent and robust interface for retrieving record data and metadata from request and storage cache if valid to help avoid exceeding limits in API requests.
If no search_api is provided, the coordinator will create a Search API that uses the default provider if the environment variable, SCHOLAR_FLUX_DEFAULT_PROVIDER, is not provided. Otherwise PLOS is used on the backend.
- __init__(search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, *, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, query: str | None = None, provider_name: str | None = None, cache_requests: bool | None = None, cache_results: bool | None = None, annotate_records: bool | None = None, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None, **kwargs: Any) None[source]
Flexible initializer that constructs a SearchCoordinator from its core components or their building blocks.
If SearchAPI and ResponseCoordinator are provided, then this method will use these inputs directly. Otherwise, the coordinator will be created from their underlying dependencies when these core components are not directly provided.
The additional parameters can still be used to update these two components. For example, a search_api can be updated with a new query, session, and SearchAPIConfig parameters through keyword arguments (**kwargs).
- When neither component is provided:
The creation of the search_api requires, at minimum, a query.
If the response_coordinator, a parser, extractor, processor, and cache_manager aren’t provided, then a new ResponseCoordinator will be built from the default settings.
- Core Components/Attributes:
- SearchAPI: handles all requests to an API based on its configuration.
Dependencies: query, **kwargs
- ResponseCoordinator: handles the parsing, record/metadata extraction, processing, and caching of responses
Dependencies: parser, extractor, processor, cache_manager
- Other Attributes:
RetryHandler: Addresses when to retry failed requests and how failed requests are retried SearchWorkflow: An optional workflow that defines custom search logic from specific APIs Validator: handles how requests are validated. The default determines whether a 200 response was received
Note
This implementation uses the underlying private method _initialize to handle the assignment of parameters under the hood while the core function of the __init__ creates these components if they do not already exist.
- Parameters:
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs.
response_coordinator (Optional[ResponseCoordinator]) – Core class used to coordinate the handling and processing of all responses received from APIs.
parser (Optional[BaseDataParser]) – First step of the response processing pipeline - parses response records into a dictionary.
extractor (Optional[BaseDataExtractor]) – Extracts both records and metadata from responses separately.
processor (Optional[ABCDataProcessor]) – Processes the previously extracted API records into list of dictionaries that are filtered and optionally flattened during processing.
cache_manager (Optional[DataCacheManager]) – Manages the caching of processed records for faster retrieval.
query (Optional[str]) – Query to be used when sending requests when creating an API - modifies the query if the API already exists.
provider_name (Optional[str]) – The name of the API provider where requests will be sent. If a provider_name and base_url are both given, the SearchAPIConfig will prioritize base_urls over the provider_name.
cache_requests (Optional[bool]) – Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results (Optional[bool]) – Determines whether or not to cache processed responses - on by default unless specified otherwise
annotate_records (Optional[bool]) – Indicates whether the DataExtractor should add unique, record-identifying fields to each extracted record. These fields aid in record-linkage and the hashed identification of duplicates in later steps.
retry_handler (Optional[RetryHandler]) – Class used to retry failed requests-cache.
validator (Optional[ResponseValidator]) – Class used to verify and validate responses returned from APIs.
workflow (Optional[SearchWorkflow]) – An optional workflow used to customize how records are retrieved from APIs. Uses the default workflow for the current provider when a workflow is not directly specified.
**kwargs – Keyword arguments to be passed to the SearchAPIConfig if a SearchAPI doesn’t already exist.
Examples –
>>> from scholar_flux import SearchCoordinator >>> from scholar_flux.api import APIResponse, ReconstructedResponse >>> from scholar_flux.sessions import CachedSessionManager >>> from typing import MutableMapping >>> session = CachedSessionManager(user_agent = 'scholar_flux', backend='redis').configure_session() >>> search_coordinator = SearchCoordinator(query = "Intrinsic Motivation", session = session, cache_results = False) >>> response = search_coordinator.search(page = 1) >>> response # OUTPUT: <ProcessedResponse(len=50, cache_key='plos_Functional Processing_1_50', metadata='...') ': 1, 'maxSco...")> >>> new_response = ReconstructedResponse.build(**response.response.__dict__) >>> new_response.validate() >>> new_response = ReconstructedResponse.build(response.response) >>> ReconstructedResponse.build(new_response).validate() >>> new_response.validate() >>> newer_response = APIResponse.as_reconstructed_response(new_response) >>> newer_response.validate() >>> double_processed_response = search_coordinator._process_response(response = newer_response, cache_key = response.cache_key)
- classmethod as_coordinator(search_api: SearchAPI, response_coordinator: ResponseCoordinator, *args: Any, **kwargs: Any) SearchCoordinator[source]
Helper factory method for building a SearchCoordinator that allows users to build from the final building blocks of a SearchCoordinator.
- Parameters:
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs
- Returns:
A newly created coordinator that orchestrates record retrieval and processing
- Return type:
- fetch(page: int | None, from_request_cache: bool = True, raise_on_error: bool = False, cache_only: bool = False, **api_specific_parameters: Any) Response | ResponseProtocol | None[source]
Fetches the raw response from the current API or from cache if available.
If page is None, fetch will default to a basic parameter search using the API base URL given the specified parameters.
- Parameters:
page (Optional[int]) – The page number to retrieve from the cache.
from_request_cache (bool) – This parameter determines whether to try to fetch a valid response from cache.
raise_on_error (bool) – Indicates whether an error should be raised when failing to fetch a valid response.
cache_only (bool) – Flag indicating whether the search should only attempt to retrieve the page from cache.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
The response object if available, otherwise None.
- Return type:
Optional[Response]
- Raises:
RetryAfterDelayExceededException – If the server-requested delay until the next request exceeds the user-specified maximum wait time as configured through the RetryHandler.
RequestFailedException – If an unexpected error occurs during the retrieval process as orchestrated via the RetryHandler.
- get_cached_request(page: int | None, **kwargs: Any) Response | ResponseProtocol | None[source]
Retrieves the cached request for a given page number if available.
- Parameters:
page (Optional[int]) – The page number to retrieve from the cache.
- Returns:
The cached request object if available, otherwise None.
- Return type:
Optional[Response]
- get_cached_response(page: int, url: str | None = None, **kwargs: Any) ProcessedResponse | ErrorResponse | None[source]
Retrieves the cached response for a given page number if available.
This method attempts to retrieve processed cache information when available, preferring the retrieval of processed cached data when available, despite whether the underlying request was cached.
If the cached request does not exist, and the processed response data does exist, this method creates a ProcessedResponse with ReconstructedResponse when possible.
If the cached request exists or is newer, this method returns the ProcessedResponse after handling the raw cached response object.
- Parameters:
page (int) – The page number to retrieve from the cache.
url (Optional[str]) – The request URL for parameter-based cache keys. Used when page is None.
**kwargs – Additional arguments to pass to get_cached_requests for the reconstruction of a cached response
- Returns:
The cached/reconstructed response if available.
- Return type:
Optional[ProcessedResponse | ErrorResponse]
- get_cached_response_keys() list[str][source]
Finds all cache keys from cached, paginated requests made with the current query.
- get_cached_search_result(page: int, url: str | None = None, **kwargs: Any) SearchResult | None[source]
Retrieves a SearchResult containing a ProcessedResponse for a given page number if available.
This is convenience method that uses get_cached_response under the hood to retrieve and format a response as a SearchResult instance.
If the cached response does not exist, this method will return None instead.
- Parameters:
page (int) – The page number to retrieve from the cache.
url (Optional[str]) – The request URL for parameter-based cache keys. Used when page is None.
**kwargs – Additional arguments to pass to get_cached_response for the reconstruction of a cached response
- Returns:
The search result containing the reconstructed response result if available.
- Return type:
Optional[SearchResult]
- iter_pages(pages: Sequence[int] | PageListInput, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) Generator[SearchResult, None, None][source]
Helper method that creates a generator function for retrieving and processing records from the API Provider for a page range in sequence. This implementation dynamically examines the properties of the page search result for each retrieved API response to determine whether or not iteration should halt early versus determining whether iteration should continue.
This method is directly used by SearchCoordinator.search_pages to provide a clean interface that abstracts the complexity of iterators and is also provided for convenience when iteration is more preferable.
- Parameters:
pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Yields:
SearchResult –
- Iteratively returns the SearchResult for each page using a generator expression.
Each result contains the requested page number (page), the name of the provider (provider_name), and the result of the search containing a ProcessedResponse, an ErrorResponse, or None (api response)
- parameter_search(from_request_cache: bool = True, from_process_cache: bool = True, normalize_records: bool | None = None, **api_specific_parameters: Any) ProcessedResponse | ErrorResponse[source]
Public method for retrieving and processing records from the API with pre-specified parameters.
Note that the response object is saved under the last_response attribute in the event that the response is retrieved and processed successfully, irrespective of whether the response was cached.
- Parameters:
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage
normalize_records (Optional[bool]) – Determines whether records should be normalized after processing
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A ProcessedResponse model containing the response (response), processed records (data), and article metadata (metadata) if the response was successful. Otherwise returns an ErrorResponse where the reason behind the error (message), exception type (error), and response (response) are provided. Possible error responses also include a NonResponse (an ErrorResponse subclass) for cases where a response object is irretrievable. Like the ErrorResponse class, NonResponse is also Falsy (i.e., not NonResponse returns True)
- Return type:
Optional[ProcessedResponse | ErrorResponse]
- robust_request(page: int | None, **api_specific_parameters: Any) Response | ResponseProtocol | None[source]
Constructs and sends a request to the current API. Fetches a response from the current API.
- Parameters:
page (Optional[int]) – The page number to retrieve from the cache. If missing, this implementation relies on api_specific_parameters to retrieve data from an API.
**kwargs – Optional Additional parameters to pass to the SearchAPI
- Returns:
The request/response-like object if available, otherwise None.
- Return type:
Optional[Response | ResponseProtocol]
- search(page: int = 1, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, normalize_records: bool | None = None, **api_specific_parameters: Any) ProcessedResponse | ErrorResponse | None[source]
Public method for retrieving and processing records from the API specifying the page and records per page. Note that the response object is saved under the last_response attribute in the event that the response is retrieved and processed successfully, irrespective of whether the response was cached.
- Parameters:
page (int) – The current page number. Used for process caching purposes even if not required by the API
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
normalize_records (Optional[bool]) – Determines whether records should be normalized after processing
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A ProcessedResponse model containing the response (response), processed records (data), and article metadata (metadata) if the response was successful. Otherwise returns an ErrorResponse where the reason behind the error (message), exception type (error), and response (response) are provided. Possible error responses also include a NonResponse (an ErrorResponse subclass) for cases where a response object is irretrievable. Like the ErrorResponse class, NonResponse is also Falsy (i.e., not NonResponse returns True)
- Return type:
Optional[ProcessedResponse | ErrorResponse]
Note: When specifying cache_only=True, this keyword argument is propagated to the fetch method, ensuring that a fresh request is not sent to the current API when a previously cached response is unavailable from the session cache. Instead, a NonResponse is returned that records the PageUnavailableFromCacheException and its corresponding error message.
- search_data(page: int = 1, *args: Any, **kwargs: Any) RecordList | None[source]
Public convenience method to perform a search, specifying the page and records per page.
Note that instead of returning a ProcessedResponse or ErrorResponse, this calls the search method an retrieves only the list of processed dictionary records from the ProcessedResponse.
- Parameters:
page (int) – The current page number.
*args – Positional arguments to pass directly to the .search() method
**kwargs – Keyword arguments to pass directly to the .search() method
- Returns:
A list of record dictionaries containing the processed article data when parsed successfully and records exist. If no records exist, or an error occurs somewhere within the processes, None is returned, instead.
- Return type:
Optional[RecordList]
- search_page(page: int, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) SearchResult[source]
Retrieves a single-page SearchResult, returning the processed response with additional metadata.
This method is used to support the retrieval of a page range while wrapping each result in a SearchResult class as a BaseModel that provides more structured information about the received API Response, including the provider’s name, the page number, and the response result.
The SearchResult.response_result attribute can hold three different types of responses:
ProcessedResponse - indicates the successful retrieval and processing of the data
- ErrorResponse/Nonresponse - indicates that a response was successfully received, but that an error
occurred during request building, response retrieval or response processing
None - indicates an issue in the retrieval of the response or formatting/preparation of the request
The SearchResult wrapper enables: - Introspection: Access provider, query, and page without unpacking the response - Aggregation: Combine results across pages with consistent metadata - Normalization: Apply field mapping to create provider-agnostic schemas
When a workflow is active, the provider name is determined from the last-queried URL to ensure correct labeling. For non-workflow searches, the SearchAPI’s provider name is used.
- Parameters:
page (int) – The current page number. Used for process caching purposes even if not required by the API
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A search result containing the requested page number (page), the name of the provider (provider_name), and the result of the search (api_response) which contains a ProcessedResponse, an ErrorResponse, or None.
- Return type:
Note
When specifying cache_only=True, this keyword argument is propagated to fetch method, ensuring that a fresh request is not sent to the current API when a previously cached response is unavailable from the session cache. Instead, a SearchResult containing a NonResponse is returned, recording the PageUnavailableFromCacheException and its corresponding error message.
- search_pages(pages: Sequence[int] | PageListInput, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) SearchResultList[source]
Public method for retrieving and processing records from the API specifying the page and records per page in sequence.
This method collects search results from multiple pages into a SearchResultList, which provides specialized methods for filtering, normalization, selection, and aggregation. Unlike iter_pages(), which streams results one at a time, this method returns the full collection for cross-page analysis and batch operations.
The SearchResultList return type enables powerful operations like filtering out failures, normalizing records across different providers, selecting subsets by query/provider/page, and joining all records into a single list for DataFrame creation.
- Parameters:
pages (Sequence[int] | PageListInput) – A sequence of page numbers to request from the API Provider. Can be a list, range, or PageListInput instance.
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A specialized list containing SearchResult instances for each requested page. The SearchResultList provides methods including: - filter(): Retain only successful ProcessedResponses or filter by success/failure - select(): Filter results by query, provider_name, or page number - normalize(): - Apply field mapping to create provider-agnostic record schemas - join(): - Combine all records into a single list with optional metadata - process_metadata(): - Extract and process metadata across all results - record_count: - Total number of records across all pages
- Return type:
Note
Retrieval stops early if a page response is None, not retrievable, or contains fewer than the expected number of records, indicating that subsequent pages may be empty. When cache_only=True, the fetch step will only fetch valid responses from cache. If a NonResponse is returned due to a cache miss, the search will continue without halting.
- search_records(min_records: int, page_offset: int = 0, from_request_cache: bool = True, from_process_cache: bool = True, use_workflow: bool | None = True, **api_specific_parameters: Any) SearchResultList[source]
Public method for retrieving and processing records by specifying the number of records to retrieve.
This method first calculates the total number of pages required to retrieve the specified number of records and subsequently collects search results from multiple pages into a SearchResultList. The result list provides specialized methods for filtering, normalization, selection, and aggregation. Unlike iter_pages(), which streams results one at a time, this method returns the full collection for cross-page analysis and batch operations.
The SearchResultList return type enables powerful operations like filtering out failures, normalizing records across different providers, selecting subsets by query/provider/page, and joining all records into a single list for DataFrame creation.
- Parameters:
min_records (int) – The total number of records to retrieve sequentially.
page_offset (int) – The page offset indicating the number of pages to skip before beginning record retrieval (0 by default).
from_request_cache (bool) – This parameter determines whether to try to retrieve the response from the requests-cache storage.
from_process_cache (bool) – This parameter determines whether to attempt to pull processed responses from the cache storage.
use_workflow (bool) – Indicates whether to use a workflow if available Workflows are utilized by default.
**api_specific_parameters (SearchAPIConfig) – Fields to temporarily override when building the request.
- Returns:
A specialized list containing SearchResult instances for each requested page. The SearchResultList provides methods including: - filter(): Retain only successful ProcessedResponses or filter by success/failure - select(): Filter results by query, provider_name, or page number - normalize(): Apply field mapping to create provider-agnostic record schemas - join(): Combine all records into a single list with optional metadata - process_metadata(): Extract and process metadata across all results - record_count: Total number of records across all pages
Note that retrieval stops early if a page response is None, not retrievable, or contains fewer than the expected number of records, indicating that subsequent pages may be empty.
- Return type:
- classmethod update(search_coordinator: Self, search_api: SearchAPI | None = None, response_coordinator: ResponseCoordinator | None = None, *, retry_handler: RetryHandler | None = None, validator: ResponseValidator | None = None, workflow: SearchWorkflow | None = None, parser: BaseDataParser | None = None, extractor: BaseDataExtractor | None = None, processor: ABCDataProcessor | None = None, cache_manager: DataCacheManager | None = None, cache_results: bool | None = None, annotate_records: bool | None = None, **search_api_kwargs: Any) SearchCoordinator[source]
Helper factory method allowing the creation of a new SearchCoordinator from both current and new components.
A new coordinator can be created using the components from an existing configuration as a base while directly replacing other components with new configurations. Note that this implementation does not directly copy the underlying components if a new component is not selected.
- Parameters:
SearchCoordinator – A previously created coordinator containing the components to use if a default is not provided
search_api (Optional[SearchAPI]) – The search API to use for the retrieval of response records from APIs
response_coordinator (Optional[ResponseCoordinator]) – Core class used to handle the processing and core handling of all responses from APIs
retry_handler (Optional[RetryHandler]) – Class used to retry failed requests-cache
validator (Optional[ResponseValidator]) – Class used to verify and validate responses returned from APIs
workflow (Optional[SearchWorkflow]) – An optional workflow used to customize how records are retrieved from APIs. Uses the default workflow for the current provider when a workflow is not directly specified and does not directly carry over in cases where a new provider is chosen.
parser – (Optional[BaseDataParser]): First step of the response processing pipeline - parses response records into a dictionary
extractor – (Optional[BaseDataExtractor]): Extracts both records and metadata from responses separately
processor – (Optional[ABCDataProcessor]): Processes API responses into list of dictionaries
cache_manager – (Optional[DataCacheManager]): Manages the caching of processed records for faster retrieval
cache_requests – (Optional[bool]): Determines whether or not to cache requests - api is the ground truth if not directly specified
cache_results – (Optional[bool]): Determines whether or not to cache processed responses - on by default unless specified or if a cache manager is already provided.
annotate_records (Optional[bool]) – When True, adds record-identifying linkage fields to each extracted record for resolution back to original data after processing or flattening. Adds _extraction_index (position) and _record_id (content hash + index). Default is None (no annotation).
- Returns:
A newly created coordinator that orchestrates record retrieval and processing
- Return type:
- class scholar_flux.api.ThreadedRateLimiter(min_interval: int | float | None = None)[source]
Bases:
RateLimiterThread-safe version of RateLimiter that can be safely used across multiple threads.
Inherits all functionality from RateLimiter but adds thread synchronization to prevent race conditions when multiple threads access the same limiter instance.
- __init__(min_interval: int | float | None = None) None[source]
Initializes a new ThreadedRateLimiter with thread safety.
- Parameters:
min_interval (Optional[float | int]) – The default minimum interval to wait. Uses default if None
- rate(min_interval: float | int, metadata: Dict[str, Any] | None = None) Iterator[Self][source]
Thread-safe version of .rate context manager.
- Parameters:
min_interval (float | int) – The minimum interval to temporarily use during the call
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- Yields:
Self – The rate limiter with temporarily changed interval
- sleep(interval: int | float | None = None, metadata: Dict[str, Any] | None = None) None[source]
Thread-safe version of .sleep that prevents race conditions.
This method provides thread-safe access to the sleep functionality by acquiring the internal lock before performing the sleep operation. This ensures that the sleep duration is calculated and executed atomically.
- Parameters:
interval (Optional[float | int]) – Optional interval to sleep for. If None, uses the default interval.
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- wait(min_interval: int | float | None = None, metadata: Dict[str, Any] | None = None) None[source]
Thread-safe version of the .wait method that prevents race conditions.
- Parameters:
min_interval (Optional[float | int]) – Minimum interval to wait. Uses default if None.
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- wait_since(min_interval: float | int | None = None, timestamp: float | int | datetime | None = None, metadata: Dict[str, Any] | None = None) None[source]
Thread-safe method for waiting until an interval from a reference timestamp or datetime has passed.
- Parameters:
min_interval (Optional[float | int]) – Minimum interval to wait. Uses default if None.
timestamp (Optional[float | int]) – Reference time formatted as a Unix timestamp or datetime. If None, sleeps for min_interval.
metadata (Optional[Dict[str, Any]]) – Optional metadata for observability (e.g., url, caller, reason).
- scholar_flux.api.validate_email(email: str, verbose: bool = True) bool[source]
Uses regex to determine whether the provided value is an email.
- Parameters:
email (str) – The email string to validate
- Returns:
True if the email is valid, and False otherwise
- scholar_flux.api.validate_url(url: str, verbose: bool = True) bool[source]
Uses urlparse to determine whether the provided value is a URL.
Basic Checks:
Only http:// and https:// schemes are accepted
A URL domain exists after the URL scheme
No whitespace exists in the domain name
Note: Further validation is delegated to request libraries.
- Parameters:
url (str) – The url string to validate
verbose (bool) – Determines whether to log upon encountering invalid URLs
- Returns:
True if the url is valid, and False otherwise