Getting Started
Welcome to ScholarFlux! This tutorial will guide you through installation, configuration, and your first search across academic databases.
Overview
ScholarFlux is a production-grade orchestration layer for academic APIs that enables concurrent multi-provider search with automatic rate limiting and schema normalization. By the end of this tutorial, you’ll be querying multiple scholarly databases with just a few lines of Python.
Prerequisites
Before starting, ensure you have:
Python 3.10 or higher installed
pip or Poetry for package management
Basic familiarity with Python
(Optional) API keys for providers requiring authentication
Note
Most providers (PLOS, arXiv, OpenAlex, Crossref) work out-of-the-box without API keys!
Learning Objectives
By the end of this tutorial, you will:
Install ScholarFlux with the appropriate extras
Configure environment variables and API keys
Execute your first search query
Handle successful and failed searches safely
Retrieve multiple pages of results
Enable caching for better performance
Installation
Basic Installation
Install ScholarFlux using pip:
pip install scholar-flux
This installs the core package with minimal dependencies, sufficient for providers like PLOS, OpenAlex, and Crossref that return JSON responses.
Installation with Extras
For full functionality, install optional dependencies:
# All features (recommended for development)
pip install scholar-flux[parsing,database,cryptography,duckdb]
# XML parsing only (for PubMed, arXiv)
pip install scholar-flux[parsing]
# Database response caching backends (Redis, MongoDB, SQLAlchemy)
pip install scholar-flux[database]
# For DuckDB response caching via sqlalchemy:
pip install scholar-flux[duckdb]
# Encrypted caching support
pip install scholar-flux[cryptography]
When to use which extras:
Extra |
Installs |
Required For |
|---|---|---|
|
|
PubMed, arXiv (XML responses) |
|
|
Production caching backends |
|
|
Encrypted session caching |
Development Installation
For contributing or running tests:
git clone https://github.com/SammieH21/scholar-flux.git
cd scholar-flux
poetry install --with dev,testing --all-extras
Verifying Installation
Test your installation:
import scholar_flux
print(scholar_flux.__version__)
# Output: 0.5.0
from scholar_flux import SearchCoordinator
# Quick test with PLOS (no API key needed)
coordinator = SearchCoordinator(query="computer science validation strategies", provider_name="plos")
result = coordinator.search_page(page=1)
if result:
print(f"✅ Installation successful! Retrieved {len(result.data)} records")
else:
print(f"❌ Search failed: {result.error}")
If you see “✅ Installation successful!”, you’re ready to continue!
Configuration
Environment Variables
ScholarFlux supports configuration via environment variables. Create a .env file in your project root:
# Logging configuration
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
SCHOLAR_FLUX_PROPAGATE_LOGS=TRUE
# API keys (optional - only needed for specific providers)
PUBMED_API_KEY=your_pubmed_key_here
SPRINGER_NATURE_API_KEY=your_springer_key_here
CORE_API_KEY=your_core_key_here
# Cache encryption (optional)
SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secret_key_here
Session and Request Defaults
The default behavior for API requests across all providers can also be configured:
# Default User-Agent for all sessions (recommended for production)
SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com)
# Default mailto for Crossref and OpenAlex (enables "polite pool" access)
SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu
Tip
Polite Pool Access: Setting SCHOLAR_FLUX_DEFAULT_MAILTO automatically enables higher rate limits for OpenAlex and Crossref:
OpenAlex: 10 requests/second (vs 1 req/sec without)
Crossref: Priority access and faster responses
In addition to request defaults, you can pre-configure caching backends system-wide:
Cache Backend Defaults
Environment variables can also control the default cache backends used for session requests and response processing:
# Session cache backend (HTTP responses)
# Options: sqlite (default), redis, mongodb, memory, filesystem
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis
# Processing cache backend (parsed data)
# Options: inmemory (default), redis, sql/sqlalchemy/sqlite, mongodb, null
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis
See also
For comprehensive environment configuration, see Production Deployment.
Warning
Never commit .env files to version control! Add .env to your .gitignore.
Loading Configuration
Option 1: Automatic loading (recommended)
Create a .env file in your project root. ScholarFlux automatically loads it on import:
import scholar_flux # Automatically loads .env
Option 2: Explicit initialization
For custom configuration paths:
from scholar_flux import initialize_package
initialize_package(
config_params={'enable_logging': True, 'log_level': 'DEBUG'},
env_path='path/to/custom/.env'
)
Option 3: Direct environment variables
Set environment variables directly (useful for containers):
export SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
export SCHOLAR_FLUX_LOG_LEVEL=DEBUG
export PUBMED_API_KEY=your_key_here
API Key Setup
Providers requiring API keys
While most APIs work out of the box, some may require an API key for use (Springer Nature) or for higher rate limits (PubMed and CORE API):
Provider |
API Key Needed |
How to Obtain |
|---|---|---|
PLOS |
No |
Works out-of-the-box |
arXiv |
No |
Works out-of-the-box |
OpenAlex |
No |
Optional |
Crossref |
No |
Optional |
PubMed |
No (Optional) |
|
CORE |
No (Optional) |
|
Springer Nature |
✅ Yes |
PubMed API Key Setup
While PubMed doesn’t require an API key, having one can increase rate limits from 3 requests per second to 10 requests per second (as of 2026).
Create an NCBI account: https://www.ncbi.nlm.nih.gov/account/
Navigate to Settings → API Key Management
Generate a new API key
Export your PubMed API key as an environment variable or add it to a
.envfile (See the configuration section above)
CORE API Key Setup
Similarly, the CORE API doesn’t require an API key but having one can greatly increase rate limits, which is very important for batch requests.
Create an CORE account: https://core.ac.uk/services/api
Navigate to Register Now and select either Academic, Non-Academic, or Personal Use depending on your affiliation
Check your email for a new API key
Export your CORE API key as an environment variable or add it to a
.envfile (See the configuration section above)
CORE_API_KEY=your_key_here
Verify:
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(query="human psychology", provider_name="pubmed")
result = coordinator.search_page(page=1)
if coordinator.api.api_key and result:
print(f"✅ PubMed API key working! Retrieved {result.record_count} records!")
Your First Search
Single-Provider Search
Let’s search PLOS for articles about machine learning:
from scholar_flux import SearchCoordinator
# Create a coordinator for PLOS
coordinator = SearchCoordinator(
query="machine learning",
provider_name="plos"
)
# Execute search for page 1
result = coordinator.search_page(page=1)
# Check if search was successful
if result:
print(f"Found {len(result.data)} records")
# Access the first record
first_record = result.data[0]
print(f"\nTitle: {first_record.get('title_display')}")
print(f"DOI: {first_record.get('id')}")
print(f"Journal: {first_record.get('journal')}")
else:
print(f"Search failed: {result.error} - {result.message}")
Expected output:
Found 50 records
Title: Deep learning applications in medical image analysis
DOI: 10.1371/journal.pone.0212345
Journal: PLOS ONE
Understanding the Response
The coordinator.search_page() method returns a SearchResult container with search metadata (query, provider_name, page) and a response_result attribute.
SearchResult is truthy when the search succeeds and falsy when it fails, making error checking simple:
result = coordinator.search_page(page=1)
if result:
# Success - access data safely
print(f"Found {len(result.data)} records")
for record in result.data[:3]:
print(f"Title: {record.get('title_display')}")
else:
# Failure - diagnostic info always available
print(f"Error: {result.error} - {result.message}")
print(f"Provider: {result.provider_name}, Page: {result.page}")
What’s in a SearchResult:
response: The raw response received from an APIprocessed_records: List of records (dictionaries) after processingdata: An alias forprocessed_records, containing a list of records after processingextracted_records: List of records (dictionaries) after parsing but before processingmetadata: Provider-specific info (total results, page size, etc.)parsed_response: The response data after parsing with JSON, XML, or YAMLquery: Your search queryprovider_name: The provider that was queriedpage: The page number requestedresponse_result: The underlying response object (ProcessedResponse, ErrorResponse, or NonResponse) after response processing
Tip
For detailed information on response types, error handling patterns, and the search() method, see Response Handling and Error Patterns.
Retrieving Multiple Pages
Sequential Page Retrieval
Retrieve multiple pages one at a time:
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")
# Retrieve pages 1-5
for page_num in range(1, 6):
result = coordinator.search_page(page=page_num)
if result:
print(f"Page {page_num}: {len(result.data)} records")
else:
print(f"Page {page_num} failed: {result.error}")
break # Stop on first error
Expected output:
Page 1: 50 records
Page 2: 50 records
Page 3: 50 records
Page 4: 50 records
Page 5: 50 records
Batch Page Retrieval
Retrieve multiple pages in one call using search_pages():
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")
# Retrieve pages 1-5 in one call
results = coordinator.search_pages(pages=range(1, 6))
# Results is a SearchResultList
print(f"Retrieved {len(results)} pages")
# Filter successful responses
successful = results.filter()
print(f"Success rate: {len(successful)}/{len(results)}")
# Combine all records into a single list
all_records = successful.join()
print(f"Total records: {len(all_records)}")
Expected output:
Retrieved 5 pages
Success rate: 5/5
Total records: 250
Working with SearchResultList
The SearchResultList provides convenient methods:
results = coordinator.search_pages(pages=range(1, 6))
# Filter only successful responses
successful = results.filter()
# Combine all records
all_records = successful.join()
# Convert to pandas DataFrame (requires pandas)
import pandas as pd
df = pd.DataFrame(all_records)
print(df.head())
# Iterate through results
for result in results:
if result:
print(f"Page {result.page}: {len(result.data)} records")
Caching Results
Request Caching (Layer 1)
Cache HTTP responses to avoid redundant network requests:
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(
query="machine learning",
provider_name="plos",
use_cache=True # Enable HTTP response caching
)
# First call: Makes network request
result1 = coordinator.search_page(page=1)
print("First call - from network")
# Second call: Retrieved from cache (instant)
result2 = coordinator.search_page(page=1)
print("Second call - from cache")
Note
By default, use_cache=True uses an in-memory SQLite cache. For production, use Redis or MongoDB.
Result Caching (Layer 2)
Cache processed results after extraction and transformation:
from scholar_flux import SearchCoordinator, DataCacheManager
# Use Redis for persistent caching
cache_manager = DataCacheManager.with_storage('redis', 'localhost:6379')
coordinator = SearchCoordinator(
query="machine learning",
provider_name="plos",
cache_manager=cache_manager
)
# First call: Processes and caches results
result1 = coordinator.search_page(page=1)
# Second call: Retrieved from processed cache
result2 = coordinator.search_page(page=1)
See also
For advanced caching strategies, see Caching Strategies.
Next Steps
Congratulations! You’ve completed the Getting Started tutorial. You now know how to:
✅ Install ScholarFlux with appropriate extras ✅ Configure environment variables and API keys ✅ Execute searches across academic providers ✅ Handle successful and failed searches safely ✅ Retrieve multiple pages of results ✅ Cache responses for performance
Common Pitfalls
Forgetting to check response validity
❌ Bad:
result = coordinator.search_page(page=1) for record in result.data: # May crash if result.data is None (ErrorResponses and NonResponses)! print(record)
✅ Good:
result = coordinator.search_page(page=1) for record in result.data or []: print(record)
Using wrong provider names
❌ Bad:
coordinator = SearchCoordinator(query="test", provider_name="pubmed_api") # No provider named "pubmed_api"!
✅ Good:
coordinator = SearchCoordinator(query="test", provider_name="pubmed")
Not installing extras required for specific providers
❌ Bad:
# Basic install without [parsing] extra coordinator = SearchCoordinator(query="test", provider_name="arxiv") result = coordinator.search_page(page=1) # Will fail - arXiv returns XML! # OUTPUT: ErrorResponse(...)
✅ Good:
pip install scholar-flux[parsing] # Installs xmltodict for XML parsing and beautifulsoup4 for html text parsing
Hardcoding API keys
❌ Bad:
coordinator = SearchCoordinator( query="test", provider_name="pubmed", api_key="abc123xyz" # Hardcoded - will be committed to git! )
✅ Good:
# Use .env file # PUBMED_API_KEY=abc123xyz coordinator = SearchCoordinator(query="test", provider_name="pubmed")
Where to Go Next
Core Tutorials:
Response Handling and Error Patterns - Response types, error handling, retry configuration
Multi-Provider Search - Query multiple providers concurrently
Schema Normalization - Build ML-ready datasets with consistent schemas
Caching Strategies - Advanced caching with Redis, MongoDB, SQLAlchemy
Advanced Topics:
Workflows - Multi-step retrieval pipelines
Custom Providers - Add new API providers to ScholarFlux
Production Deployment - Deploy ScholarFlux in production
Reference:
Welcome to Scholar Flux’s documentation! - Documentation home
Getting Help
If you encounter issues:
Check the documentation: https://SammieH21.github.io/scholar-flux/
Search existing issues: https://github.com/SammieH21/scholar-flux/issues
Ask a question: Open a new issue with details about your environment
Email: scholar.flux@gmail.com
When reporting issues, include:
ScholarFlux version:
import scholar_flux; print(scholar_flux.__version__)Python version:
python --versionOperating system
Minimal code to reproduce the issue
Complete error message
Further Reading
Response Handling and Error Patterns - Response handling and error patterns
Multi-Provider Search - Concurrent multi-provider orchestration
Schema Normalization - Building ML datasets with consistent schemas
SearchCoordinatorAPI referenceSearchAPIAPI referenceProcessedResponseAPI reference