Welcome to Scholar Flux’s documentation!

Scholar Flux is a Python library for searching and processing academic articles from multiple providers with built-in caching and data management capabilities.

Tutorials:

Getting Started
- Overview
- Prerequisites
- Learning Objectives
- Installation
- Configuration
- Your First Search
- Retrieving Multiple Pages
- Caching Results
- Next Steps
- Common Pitfalls
- Getting Help
- Further Reading
Response Handling and Error Patterns
- Prerequisites
Response Access Patterns
- search_page() - Recommended
- search() - Direct Access
Response Types in Detail
- ProcessedResponse (Success)
- ErrorResponse (HTTP Errors)
- NonResponse (Network/Config Errors)
Error Handling Patterns
- Basic Boolean Check
- Type-Specific Handling
- Handling None Responses
- Batch Error Handling
Built-In Retry System
- How Automatic Retries Work
- Configuring Retry Behavior
- Retry Delay Calculation
- Common Retry Configurations
Practical Examples
- Example 1: Basic Search with Error Handling
- Example 2: Multi-Page Retrieval with Error Handling
- Example 3: Custom Retry Configuration
Multi-Provider Search
- Prerequisites
Overview
- Why Multi-Provider Search?
- Key Features
Quick Start
- Basic Example: Four Providers
- Complete Example: Normalized Data
Understanding Multi-Provider Architecture
- Thread-Per-Provider Model
- Shared Rate Limiting
Working with Results
- SearchResultList Basics
- Filtering Successful Responses
- Aggregating Records
- Normalizing Fields
Rate Limiting
- Default Rate Limits
- Inspecting Rate Limits
Real-World Example: Systematic Literature Review
Best Practices
- Query Design
- Caching Strategy
- Resource Management
Troubleshooting
- “No coordinators registered” Warning
- Memory Issues with Large Searches
- Provider-Specific Failures
Next Steps
Schema Normalization
Overview
- The Challenge: Different Field Names for the Same Data
- The Solution: Automatic Schema Normalization
- Learning Objectives
- Prerequisites
Basic Normalization
- Single Provider Normalization
- Multi-Provider Normalization
- The normalize() Method
- Inline Normalization
- The filter() Method
Understanding Universal Fields
- The AcademicFieldMap
- Field Map Architecture
- Nested Field Access
- Fallback Paths
Advanced Normalization
- Including Metadata in Normalized Records
- Controlling Normalization Updates
- Error Handling
Working with DataFrames
- Building ML-Ready Datasets
- Analyzing Provider Coverage
Creating Custom Field Maps
- Basic Custom Field Map
- Integrating Custom Maps with Providers
- Processing Complex Structures
Best Practices
- Performance Optimization
- Memory Management
- Data Quality Checks
Next Steps
Real-World Use Cases
- Systematic Literature Review
- Citation Network Analysis
- Meta-Analysis Pipeline
- Getting Help
- Where to Go Next
Caching Strategies
- Prerequisites
- Understanding the Two Caches
- Quick Start Patterns
- Choosing a Storage Backend
- Using Namespaces
- Encrypted Session Caching
- Monitoring Cache Behavior
- Practical Examples
- Multi-Provider Parallel Searches
- Cache Invalidation
- Time-To-Live (TTL) Strategies
- Troubleshooting
- Best Practices
- Further Reading
Workflows
- Prerequisites
- Overview
- Workflow Architecture
- Built-in Workflows
- Custom Workflows
- Real-World Example 1: Crossref DOI Enrichment
- Real-World Example 2: OpenAlex Citation Network
- Workflow Error Handling
- Best Practices
- Advanced Customization
- Troubleshooting
- Next Steps
- API Reference
Custom Providers
- Overview
- Minimal Provider Example
- Complete Example: Guardian News API
- Common Patterns
- Testing Your Provider
- Best Practices
- Next Steps
Production Deployment
- Overview
- Prerequisites
Environment Configuration
- SCHOLAR_FLUX_HOME (Recommended)
- Configuration System
- Production Environment Variables
- Runtime Configuration
- Loading Configuration
Docker for Reproducibility
- Using SCHOLAR_FLUX_HOME in Docker
- Basic Dockerfile
- Environment Variables in Docker
Production Patterns
- Caching Strategy
- Multi-Provider Search
- Schema Normalization
- Workflows
Production Use Cases
- Systematic Literature Review
- ML Training Data Collection
- Longitudinal Monitoring
Data Ownership & Citation
- Provider Attribution
Security Essentials
- API Key Management
- Cache Security
- Logging
Best Practices
- Configuration
- Caching
- Concurrency
- Data Management
- Security
Production Checklist
Next Steps
- API Reference
- Getting Help
scholar_flux
- scholar_flux package

Quick Start

Installation

The scholar-flux module is currently in beta, and is available right now for testing and preliminary use! Install Scholar Flux using the pypi index for testing:

pip install scholar-flux

Quick Start Example

Here’s a complete example demonstrating Scholar Flux’s core features:

from scholar_flux import SearchAPI, SearchCoordinator, DataCacheManager

# Initialize the API client with requests-cache to cache successful responses
api = SearchAPI.from_defaults(
    query="psychology",
    provider_name='plos',
    use_cache=True
)

# Perform a search and get a response object
response = api.search(page=1)

# Coordinate the response retrieval processing with a single search
# and in-memory record cache
coordinator = SearchCoordinator(api)

# Turn off process caching altogether
coordinator = SearchCoordinator(api, cache_results=False)

# Or use sqlalchemy, redis, or mongodb with an optional config
# (assuming a redis server and redis-py are installed)
coordinator = SearchCoordinator(
    api,
    cache_manager=DataCacheManager.with_storage('redis', 'localhost')
)

# Retrieves the previously cached response and processes it
processed_response = coordinator.search(page=1)

# Show each record from a flattened dictionary
print(processed_response.data)

# Transform the dictionary of records into a pandas dataframe
import pandas as pd
record_data_frame = pd.DataFrame(processed_response.data)

# Display each record in a table, line-by-line
print(record_data_frame.head(5))

# View each record's metadata
print(processed_response.metadata)

# Search the next page
processed_response_two = coordinator.search(page=2)

Key Features

Multiple Provider Support: Search across different academic databases
Smart Caching: Built-in request caching with requests-cache
Flexible Storage: In-memory, Redis, MongoDB, or SQLAlchemy backends
Data Processing: Transform responses into pandas DataFrames
Response Management: Coordinate searches with automatic caching

Example Pipelines

Production-quality examples demonstrating AI/ML integration patterns are available in the examples/ directory:

Retrieval Pipeline Orchestration - Scheduled data preparation with date filtering, deduplication, and Parquet export
Semantic Similarity Search - Embedding-based interdisciplinary paper discovery with ModernBERT
Agentic Literature Review - Multi-provider search with LLM classification via PydanticAI

API Reference

For detailed API documentation, see the scholar_flux section.