Getting Started

Welcome to ScholarFlux! This tutorial will guide you through installation, configuration, and your first search across academic databases.

Overview

ScholarFlux is a production-grade orchestration layer for academic APIs that enables concurrent multi-provider search with automatic rate limiting and schema normalization. By the end of this tutorial, you’ll be querying multiple scholarly databases with just a few lines of Python.

Prerequisites

Before starting, ensure you have:

Python 3.10 or higher installed
pip or Poetry for package management
Basic familiarity with Python
(Optional) API keys for providers requiring authentication

Note

Most providers (PLOS, arXiv, OpenAlex, Crossref) work out-of-the-box without API keys!

Learning Objectives

By the end of this tutorial, you will:

Install ScholarFlux with the appropriate extras
Configure environment variables and API keys
Execute your first search query
Handle successful and failed searches safely
Retrieve multiple pages of results
Enable caching for better performance

Installation

Basic Installation

Install ScholarFlux using pip:

pip install scholar-flux

This installs the core package with minimal dependencies, sufficient for providers like PLOS, OpenAlex, and Crossref that return JSON responses.

Installation with Extras

For full functionality, install optional dependencies:

# All features (recommended for development)
pip install scholar-flux[parsing,database,cryptography,duckdb]

# XML parsing only (for PubMed, arXiv)
pip install scholar-flux[parsing]

# Database response caching backends (Redis, MongoDB, SQLAlchemy)
pip install scholar-flux[database]

# For DuckDB response caching via sqlalchemy:
pip install scholar-flux[duckdb]

# Encrypted caching support
pip install scholar-flux[cryptography]

When to use which extras:

Extra	Installs	Required For
`parsing`	`xmltodict`, `pyyaml`	PubMed, arXiv (XML responses)
`database`	`redis`, `pymongo`, `sqlalchemy`	Production caching backends
`cryptography`	`cryptography`	Encrypted session caching

Development Installation

For contributing or running tests:

git clone https://github.com/SammieH21/scholar-flux.git
cd scholar-flux
poetry install --with dev,testing --all-extras

Verifying Installation

Test your installation:

import scholar_flux
print(scholar_flux.__version__)
# Output: 0.5.2

from scholar_flux import SearchCoordinator

# Quick test with PLOS (no API key needed)
coordinator = SearchCoordinator(query="computer science validation strategies", provider_name="plos")
result = coordinator.search_page(page=1)

if result:
    print(f"✅ Installation successful! Retrieved {len(result.data)} records")
else:
    print(f"❌ Search failed: {result.error}")

If you see “✅ Installation successful!”, you’re ready to continue!

Configuration

Environment Variables

ScholarFlux supports configuration via environment variables. Create a .env file in your project root:

# Logging configuration
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
SCHOLAR_FLUX_PROPAGATE_LOGS=TRUE

# API keys (optional - only needed for specific providers)
PUBMED_API_KEY=your_pubmed_key_here
SPRINGER_NATURE_API_KEY=your_springer_key_here
CORE_API_KEY=your_core_key_here

# Cache encryption (optional)
SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secret_key_here

Session and Request Defaults

The default behavior for API requests across all providers can also be configured:

# Default User-Agent for all sessions (recommended for production)
SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com)

# Default mailto for Crossref and OpenAlex (enables "polite pool" access)
SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu

Tip

Polite Pool Access: Setting SCHOLAR_FLUX_DEFAULT_MAILTO automatically enables higher rate limits for OpenAlex and Crossref:

OpenAlex: 10 requests/second (vs 1 req/sec without)
Crossref: Priority access and faster responses

In addition to request defaults, you can pre-configure caching backends system-wide:

Cache Backend Defaults

Environment variables can also control the default cache backends used for session requests and response processing:

# Session cache backend (HTTP responses)
# Options: sqlite (default), redis, mongodb, memory, filesystem
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis

# Processing cache backend (parsed data)
# Options: inmemory (default), redis, sql/sqlalchemy/sqlite, mongodb, null
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis

Loading Configuration

Option 1: Automatic loading (recommended)

Create a .env file in your project root. ScholarFlux automatically loads it on import:

import scholar_flux  # Automatically loads .env

Option 2: Explicit initialization

For custom configuration paths:

from scholar_flux import initialize_package

initialize_package(
    config_params={'enable_logging': True, 'log_level': 'DEBUG'},
    env_path='path/to/custom/.env'
)

Option 3: Direct environment variables

Set environment variables directly (useful for containers):

export SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
export SCHOLAR_FLUX_LOG_LEVEL=DEBUG
export PUBMED_API_KEY=your_key_here

API Key Setup

Providers requiring API keys

While most APIs work out of the box, some may require an API key for use (Springer Nature) or for higher rate limits (PubMed and CORE API):

Provider	API Key Needed	How to Obtain
PLOS	No	Works out-of-the-box
arXiv	No	Works out-of-the-box
OpenAlex	No	Optional `mailto` for higher limits
Crossref	No	Optional `mailto` for higher limits
PubMed	No (Optional)	https://www.ncbi.nlm.nih.gov/account/
CORE	No (Optional)	https://core.ac.uk/services/api
Springer Nature	✅ Yes	https://dev.springernature.com

PubMed API Key Setup

While PubMed doesn’t require an API key, having one can increase rate limits from 3 requests per second to 10 requests per second (as of 2026).

Create an NCBI account: https://www.ncbi.nlm.nih.gov/account/
Navigate to Settings → API Key Management
Generate a new API key
Export your PubMed API key as an environment variable or add it to a .env file (See the configuration section above)

CORE API Key Setup

Similarly, the CORE API doesn’t require an API key but having one can greatly increase rate limits, which is very important for batch requests.

Create an CORE account: https://core.ac.uk/services/api
Navigate to Register Now and select either Academic, Non-Academic, or Personal Use depending on your affiliation
Check your email for a new API key
Export your CORE API key as an environment variable or add it to a .env file (See the configuration section above)

CORE_API_KEY=your_key_here

Verify:

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(query="human psychology", provider_name="pubmed")
result = coordinator.search_page(page=1)

if coordinator.api.api_key and result:
    print(f"✅ PubMed API key working! Retrieved {result.record_count} records!")

Your First Search

Single-Provider Search

Let’s search PLOS for articles about machine learning:

from scholar_flux import SearchCoordinator

# Create a coordinator for PLOS
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="plos"
)

# Execute search for page 1
result = coordinator.search_page(page=1)

# Check if search was successful
if result:
    print(f"Found {len(result.data)} records")

    # Access the first record
    first_record = result.data[0]
    print(f"\nTitle: {first_record.get('title_display')}")
    print(f"DOI: {first_record.get('id')}")
    print(f"Journal: {first_record.get('journal')}")
else:
    print(f"Search failed: {result.error} - {result.message}")

Expected output:

Found 50 records

Title: Deep learning applications in medical image analysis
DOI: 10.1371/journal.pone.0212345
Journal: PLOS ONE

Understanding the Response

The coordinator.search_page() method returns a SearchResult container with search metadata (query, provider_name, page) and a response_result attribute.

SearchResult is truthy when the search succeeds and falsy when it fails, making error checking simple:

result = coordinator.search_page(page=1)

if result:
    # Success - access data safely
    print(f"Found {len(result.data)} records")
    for record in result.data[:3]:
        print(f"Title: {record.get('title_display')}")
else:
    # Failure - diagnostic info always available
    print(f"Error: {result.error} - {result.message}")
    print(f"Provider: {result.provider_name}, Page: {result.page}")

What’s in a SearchResult:

response: The raw response received from an API
processed_records: List of records (dictionaries) after processing
data: An alias for processed_records, containing a list of records after processing
extracted_records: List of records (dictionaries) after parsing but before processing
metadata: Provider-specific info (total results, page size, etc.)
parsed_response: The response data after parsing with JSON, XML, or YAML
query: Your search query
provider_name: The provider that was queried
page: The page number requested
response_result: The underlying response object (ProcessedResponse, ErrorResponse, or NonResponse) after response processing

Tip

For detailed information on response types, error handling patterns, and the search() method, see Response Handling and Error Patterns.

Retrieving Multiple Pages

Sequential Page Retrieval

Retrieve multiple pages one at a time:

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")

# Retrieve pages 1-5
for page_num in range(1, 6):
    result = coordinator.search_page(page=page_num)

    if result:
        print(f"Page {page_num}: {len(result.data)} records")
    else:
        print(f"Page {page_num} failed: {result.error}")
        break  # Stop on first error

Expected output:

Page 1: 50 records
Page 2: 50 records
Page 3: 50 records
Page 4: 50 records
Page 5: 50 records

Batch Page Retrieval

Retrieve multiple pages in one call using search_pages():

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")

# Retrieve pages 1-5 in one call
results = coordinator.search_pages(pages=range(1, 6))

# Results is a SearchResultList
print(f"Retrieved {len(results)} pages")

# Filter successful responses
successful = results.filter()
print(f"Success rate: {len(successful)}/{len(results)}")

# Combine all records into a single list
all_records = successful.join()
print(f"Total records: {len(all_records)}")

Expected output:

Retrieved 5 pages
Success rate: 5/5
Total records: 250

Working with SearchResultList

The SearchResultList provides convenient methods:

results = coordinator.search_pages(pages=range(1, 6))

# Filter only successful responses
successful = results.filter()

# Combine all records
all_records = successful.join()

# Convert to pandas DataFrame (requires pandas)
import pandas as pd
df = pd.DataFrame(all_records)
print(df.head())

# Iterate through results
for result in results:
    if result:
        print(f"Page {result.page}: {len(result.data)} records")

Caching Results

Request Caching (Layer 1)

Cache HTTP responses to avoid redundant network requests:

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="plos",
    use_cache=True  # Enable HTTP response caching
)

# First call: Makes network request
result1 = coordinator.search_page(page=1)
print("First call - from network")

# Second call: Retrieved from cache (instant)
result2 = coordinator.search_page(page=1)
print("Second call - from cache")

Note

By default, use_cache=True uses an in-memory SQLite cache. For production, use Redis or MongoDB.

Result Caching (Layer 2)

Cache processed results after extraction and transformation:

from scholar_flux import SearchCoordinator, DataCacheManager

# Use Redis for persistent caching
cache_manager = DataCacheManager.with_storage('redis', 'localhost:6379')

coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="plos",
    cache_manager=cache_manager
)

# First call: Processes and caches results
result1 = coordinator.search_page(page=1)

# Second call: Retrieved from processed cache
result2 = coordinator.search_page(page=1)

Next Steps

Congratulations! You’ve completed the Getting Started tutorial. You now know how to:

✅ Install ScholarFlux with appropriate extras ✅ Configure environment variables and API keys ✅ Execute searches across academic providers ✅ Handle successful and failed searches safely ✅ Retrieve multiple pages of results ✅ Cache responses for performance

Common Pitfalls

Forgetting to check response validity

❌ Bad:

result = coordinator.search_page(page=1)
for record in result.data:  # May crash if result.data is None (ErrorResponses and NonResponses)!
    print(record)

✅ Good:

result = coordinator.search_page(page=1)
for record in result.data or []:
    print(record)

Using wrong provider names

❌ Bad:

coordinator = SearchCoordinator(query="test", provider_name="pubmed_api")
# No provider named "pubmed_api"!

✅ Good:

coordinator = SearchCoordinator(query="test", provider_name="pubmed")

Not installing extras required for specific providers

❌ Bad:

# Basic install without [parsing] extra
coordinator = SearchCoordinator(query="test", provider_name="arxiv")
result = coordinator.search_page(page=1)  # Will fail - arXiv returns XML!
# OUTPUT: ErrorResponse(...)

✅ Good:

pip install scholar-flux[parsing]  # Installs xmltodict for XML parsing and beautifulsoup4 for html text parsing

Hardcoding API keys

❌ Bad:

coordinator = SearchCoordinator(
    query="test",
    provider_name="pubmed",
    api_key="abc123xyz"  # Hardcoded - will be committed to git!
)

✅ Good:

# Use .env file
# PUBMED_API_KEY=abc123xyz
coordinator = SearchCoordinator(query="test", provider_name="pubmed")

Where to Go Next

Core Tutorials:

Response Handling and Error Patterns - Response types, error handling, retry configuration
Multi-Provider Search - Query multiple providers concurrently
Schema Normalization - Build ML-ready datasets with consistent schemas
Caching Strategies - Advanced caching with Redis, MongoDB, SQLAlchemy

Advanced Topics:

Workflows - Multi-step retrieval pipelines
Custom Providers - Add new API providers to ScholarFlux
Production Deployment - Deploy ScholarFlux in production

Reference:

Welcome to Scholar Flux’s documentation! - Documentation home

Getting Help

If you encounter issues:

Check the documentation: https://SammieH21.github.io/scholar-flux/
Search existing issues: https://github.com/SammieH21/scholar-flux/issues
Ask a question: Open a new issue with details about your environment
Email: scholar.flux@gmail.com

When reporting issues, include:

ScholarFlux version: import scholar_flux; print(scholar_flux.__version__)
Python version: python --version
Operating system
Minimal code to reproduce the issue
Complete error message

Getting Started

Overview

Prerequisites

Learning Objectives

Installation

Basic Installation

Installation with Extras

Development Installation

Verifying Installation

Configuration

Environment Variables

Session and Request Defaults

Cache Backend Defaults

Loading Configuration

Option 1: Automatic loading (recommended)

Option 2: Explicit initialization

Option 3: Direct environment variables

API Key Setup

Providers requiring API keys

PubMed API Key Setup

CORE API Key Setup

Your First Search

Single-Provider Search

Understanding the Response

Retrieving Multiple Pages

Sequential Page Retrieval

Batch Page Retrieval

Working with SearchResultList

Caching Results

Request Caching (Layer 1)

Result Caching (Layer 2)

Next Steps

Common Pitfalls

Where to Go Next

Getting Help

Further Reading