Welcome to Scholar Flux’s documentation!
Scholar Flux is a Python library for searching and processing academic articles from multiple providers with built-in caching and data management capabilities.
Tutorials:
- Getting Started
- Response Handling and Error Patterns
- Response Access Patterns
- Response Types in Detail
- Error Handling Patterns
- Built-In Retry System
- Practical Examples
- Multi-Provider Search
- Overview
- Quick Start
- Understanding Multi-Provider Architecture
- Working with Results
- Rate Limiting
- Real-World Example: Systematic Literature Review
- Best Practices
- Troubleshooting
- Next Steps
- Schema Normalization
- Overview
- Basic Normalization
- Understanding Universal Fields
- Advanced Normalization
- Working with DataFrames
- Creating Custom Field Maps
- Best Practices
- Next Steps
- Real-World Use Cases
- Caching Strategies
- Prerequisites
- Understanding the Two Caches
- Quick Start Patterns
- Choosing a Storage Backend
- Using Namespaces
- Encrypted Session Caching
- Monitoring Cache Behavior
- Practical Examples
- Multi-Provider Parallel Searches
- Cache Invalidation
- Time-To-Live (TTL) Strategies
- Troubleshooting
- Best Practices
- Further Reading
- Workflows
- Custom Providers
- Production Deployment
- Environment Configuration
- Docker for Reproducibility
- Production Patterns
- Production Use Cases
- Data Ownership & Citation
- Security Essentials
- Best Practices
- Production Checklist
- Next Steps
- scholar_flux
Quick Start
Installation
The scholar-flux module is currently in beta, and is available right now for testing and preliminary use! Install Scholar Flux using the pypi index for testing:
pip install scholar-flux
Quick Start Example
Here’s a complete example demonstrating Scholar Flux’s core features:
from scholar_flux import SearchAPI, SearchCoordinator, DataCacheManager
# Initialize the API client with requests-cache to cache successful responses
api = SearchAPI.from_defaults(
query="psychology",
provider_name='plos',
use_cache=True
)
# Perform a search and get a response object
response = api.search(page=1)
# Coordinate the response retrieval processing with a single search
# and in-memory record cache
coordinator = SearchCoordinator(api)
# Turn off process caching altogether
coordinator = SearchCoordinator(api, cache_results=False)
# Or use sqlalchemy, redis, or mongodb with an optional config
# (assuming a redis server and redis-py are installed)
coordinator = SearchCoordinator(
api,
cache_manager=DataCacheManager.with_storage('redis', 'localhost')
)
# Retrieves the previously cached response and processes it
processed_response = coordinator.search(page=1)
# Show each record from a flattened dictionary
print(processed_response.data)
# Transform the dictionary of records into a pandas dataframe
import pandas as pd
record_data_frame = pd.DataFrame(processed_response.data)
# Display each record in a table, line-by-line
print(record_data_frame.head(5))
# View each record's metadata
print(processed_response.metadata)
# Search the next page
processed_response_two = coordinator.search(page=2)
Key Features
Multiple Provider Support: Search across different academic databases
Smart Caching: Built-in request caching with requests-cache
Flexible Storage: In-memory, Redis, MongoDB, or SQLAlchemy backends
Data Processing: Transform responses into pandas DataFrames
Response Management: Coordinate searches with automatic caching
Example Pipelines
Production-quality examples demonstrating AI/ML integration patterns are available in the examples/ directory:
Retrieval Pipeline Orchestration - Scheduled data preparation with date filtering, deduplication, and Parquet export
Semantic Similarity Search - Embedding-based interdisciplinary paper discovery with ModernBERT
Agentic Literature Review - Multi-provider search with LLM classification via PydanticAI
API Reference
For detailed API documentation, see the scholar_flux section.