Getting Started

Welcome to ScholarFlux! This tutorial will guide you through installation, configuration, and your first search across academic databases.

Overview

ScholarFlux is a production-grade orchestration layer for academic APIs that enables concurrent multi-provider search with automatic rate limiting and schema normalization. By the end of this tutorial, you’ll be querying multiple scholarly databases with just a few lines of Python.

Prerequisites

Before starting, ensure you have:

  • Python 3.10 or higher installed

  • pip or Poetry for package management

  • Basic familiarity with Python

  • (Optional) API keys for providers requiring authentication

Note

Most providers (PLOS, arXiv, OpenAlex, Crossref) work out-of-the-box without API keys!

Learning Objectives

By the end of this tutorial, you will:

  • Install ScholarFlux with the appropriate extras

  • Configure environment variables and API keys

  • Execute your first search query

  • Handle successful and failed searches safely

  • Retrieve multiple pages of results

  • Enable caching for better performance

Installation

Basic Installation

Install ScholarFlux using pip:

pip install scholar-flux

This installs the core package with minimal dependencies, sufficient for providers like PLOS, OpenAlex, and Crossref that return JSON responses.

Installation with Extras

For full functionality, install optional dependencies:

# All features (recommended for development)
pip install scholar-flux[parsing,database,cryptography,duckdb]

# XML parsing only (for PubMed, arXiv)
pip install scholar-flux[parsing]

# Database response caching backends (Redis, MongoDB, SQLAlchemy)
pip install scholar-flux[database]

# For DuckDB response caching via sqlalchemy:
pip install scholar-flux[duckdb]

# Encrypted caching support
pip install scholar-flux[cryptography]

When to use which extras:

Extra

Installs

Required For

parsing

xmltodict, pyyaml

PubMed, arXiv (XML responses)

database

redis, pymongo, sqlalchemy

Production caching backends

cryptography

cryptography

Encrypted session caching

Development Installation

For contributing or running tests:

git clone https://github.com/SammieH21/scholar-flux.git
cd scholar-flux
poetry install --with dev,testing --all-extras

Verifying Installation

Test your installation:

import scholar_flux
print(scholar_flux.__version__)
# Output: 0.5.0
from scholar_flux import SearchCoordinator

# Quick test with PLOS (no API key needed)
coordinator = SearchCoordinator(query="computer science validation strategies", provider_name="plos")
result = coordinator.search_page(page=1)

if result:
    print(f"✅ Installation successful! Retrieved {len(result.data)} records")
else:
    print(f"❌ Search failed: {result.error}")

If you see “✅ Installation successful!”, you’re ready to continue!

Configuration

Environment Variables

ScholarFlux supports configuration via environment variables. Create a .env file in your project root:

# Logging configuration
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
SCHOLAR_FLUX_PROPAGATE_LOGS=TRUE

# API keys (optional - only needed for specific providers)
PUBMED_API_KEY=your_pubmed_key_here
SPRINGER_NATURE_API_KEY=your_springer_key_here
CORE_API_KEY=your_core_key_here

# Cache encryption (optional)
SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secret_key_here

Session and Request Defaults

The default behavior for API requests across all providers can also be configured:

# Default User-Agent for all sessions (recommended for production)
SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com)

# Default mailto for Crossref and OpenAlex (enables "polite pool" access)
SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu

Tip

Polite Pool Access: Setting SCHOLAR_FLUX_DEFAULT_MAILTO automatically enables higher rate limits for OpenAlex and Crossref:

  • OpenAlex: 10 requests/second (vs 1 req/sec without)

  • Crossref: Priority access and faster responses

In addition to request defaults, you can pre-configure caching backends system-wide:

Cache Backend Defaults

Environment variables can also control the default cache backends used for session requests and response processing:

# Session cache backend (HTTP responses)
# Options: sqlite (default), redis, mongodb, memory, filesystem
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis

# Processing cache backend (parsed data)
# Options: inmemory (default), redis, sql/sqlalchemy/sqlite, mongodb, null
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis

See also

For comprehensive environment configuration, see Production Deployment.

Warning

Never commit .env files to version control! Add .env to your .gitignore.

Loading Configuration

Option 2: Explicit initialization

For custom configuration paths:

from scholar_flux import initialize_package

initialize_package(
    config_params={'enable_logging': True, 'log_level': 'DEBUG'},
    env_path='path/to/custom/.env'
)

Option 3: Direct environment variables

Set environment variables directly (useful for containers):

export SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
export SCHOLAR_FLUX_LOG_LEVEL=DEBUG
export PUBMED_API_KEY=your_key_here

API Key Setup

Providers requiring API keys

While most APIs work out of the box, some may require an API key for use (Springer Nature) or for higher rate limits (PubMed and CORE API):

Provider

API Key Needed

How to Obtain

PLOS

No

Works out-of-the-box

arXiv

No

Works out-of-the-box

OpenAlex

No

Optional mailto for higher limits

Crossref

No

Optional mailto for higher limits

PubMed

No (Optional)

https://www.ncbi.nlm.nih.gov/account/

CORE

No (Optional)

https://core.ac.uk/services/api

Springer Nature

✅ Yes

https://dev.springernature.com

PubMed API Key Setup

While PubMed doesn’t require an API key, having one can increase rate limits from 3 requests per second to 10 requests per second (as of 2026).

  1. Create an NCBI account: https://www.ncbi.nlm.nih.gov/account/

  2. Navigate to Settings → API Key Management

  3. Generate a new API key

  4. Export your PubMed API key as an environment variable or add it to a .env file (See the configuration section above)

CORE API Key Setup

Similarly, the CORE API doesn’t require an API key but having one can greatly increase rate limits, which is very important for batch requests.

  1. Create an CORE account: https://core.ac.uk/services/api

  2. Navigate to Register Now and select either Academic, Non-Academic, or Personal Use depending on your affiliation

  3. Check your email for a new API key

  4. Export your CORE API key as an environment variable or add it to a .env file (See the configuration section above)

CORE_API_KEY=your_key_here
  1. Verify:

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(query="human psychology", provider_name="pubmed")
result = coordinator.search_page(page=1)

if coordinator.api.api_key and result:
    print(f"✅ PubMed API key working! Retrieved {result.record_count} records!")

Retrieving Multiple Pages

Sequential Page Retrieval

Retrieve multiple pages one at a time:

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")

# Retrieve pages 1-5
for page_num in range(1, 6):
    result = coordinator.search_page(page=page_num)

    if result:
        print(f"Page {page_num}: {len(result.data)} records")
    else:
        print(f"Page {page_num} failed: {result.error}")
        break  # Stop on first error

Expected output:

Page 1: 50 records
Page 2: 50 records
Page 3: 50 records
Page 4: 50 records
Page 5: 50 records

Batch Page Retrieval

Retrieve multiple pages in one call using search_pages():

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(query="CRISPR", provider_name="plos")

# Retrieve pages 1-5 in one call
results = coordinator.search_pages(pages=range(1, 6))

# Results is a SearchResultList
print(f"Retrieved {len(results)} pages")

# Filter successful responses
successful = results.filter()
print(f"Success rate: {len(successful)}/{len(results)}")

# Combine all records into a single list
all_records = successful.join()
print(f"Total records: {len(all_records)}")

Expected output:

Retrieved 5 pages
Success rate: 5/5
Total records: 250

Working with SearchResultList

The SearchResultList provides convenient methods:

results = coordinator.search_pages(pages=range(1, 6))

# Filter only successful responses
successful = results.filter()

# Combine all records
all_records = successful.join()

# Convert to pandas DataFrame (requires pandas)
import pandas as pd
df = pd.DataFrame(all_records)
print(df.head())

# Iterate through results
for result in results:
    if result:
        print(f"Page {result.page}: {len(result.data)} records")

Caching Results

Request Caching (Layer 1)

Cache HTTP responses to avoid redundant network requests:

from scholar_flux import SearchCoordinator

coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="plos",
    use_cache=True  # Enable HTTP response caching
)

# First call: Makes network request
result1 = coordinator.search_page(page=1)
print("First call - from network")

# Second call: Retrieved from cache (instant)
result2 = coordinator.search_page(page=1)
print("Second call - from cache")

Note

By default, use_cache=True uses an in-memory SQLite cache. For production, use Redis or MongoDB.

Result Caching (Layer 2)

Cache processed results after extraction and transformation:

from scholar_flux import SearchCoordinator, DataCacheManager

# Use Redis for persistent caching
cache_manager = DataCacheManager.with_storage('redis', 'localhost:6379')

coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="plos",
    cache_manager=cache_manager
)

# First call: Processes and caches results
result1 = coordinator.search_page(page=1)

# Second call: Retrieved from processed cache
result2 = coordinator.search_page(page=1)

See also

For advanced caching strategies, see Caching Strategies.

Next Steps

Congratulations! You’ve completed the Getting Started tutorial. You now know how to:

✅ Install ScholarFlux with appropriate extras ✅ Configure environment variables and API keys ✅ Execute searches across academic providers ✅ Handle successful and failed searches safely ✅ Retrieve multiple pages of results ✅ Cache responses for performance

Common Pitfalls

  1. Forgetting to check response validity

    ❌ Bad:

    result = coordinator.search_page(page=1)
    for record in result.data:  # May crash if result.data is None (ErrorResponses and NonResponses)!
        print(record)
    

    ✅ Good:

    result = coordinator.search_page(page=1)
    for record in result.data or []:
        print(record)
    
  2. Using wrong provider names

    ❌ Bad:

    coordinator = SearchCoordinator(query="test", provider_name="pubmed_api")
    # No provider named "pubmed_api"!
    

    ✅ Good:

    coordinator = SearchCoordinator(query="test", provider_name="pubmed")
    
  3. Not installing extras required for specific providers

    ❌ Bad:

    # Basic install without [parsing] extra
    coordinator = SearchCoordinator(query="test", provider_name="arxiv")
    result = coordinator.search_page(page=1)  # Will fail - arXiv returns XML!
    # OUTPUT: ErrorResponse(...)
    

    ✅ Good:

    pip install scholar-flux[parsing]  # Installs xmltodict for XML parsing and beautifulsoup4 for html text parsing
    
  4. Hardcoding API keys

    ❌ Bad:

    coordinator = SearchCoordinator(
        query="test",
        provider_name="pubmed",
        api_key="abc123xyz"  # Hardcoded - will be committed to git!
    )
    

    ✅ Good:

    # Use .env file
    # PUBMED_API_KEY=abc123xyz
    coordinator = SearchCoordinator(query="test", provider_name="pubmed")
    

Where to Go Next

Core Tutorials:

Advanced Topics:

Reference:

Getting Help

If you encounter issues:

  1. Check the documentation: https://SammieH21.github.io/scholar-flux/

  2. Search existing issues: https://github.com/SammieH21/scholar-flux/issues

  3. Ask a question: Open a new issue with details about your environment

  4. Email: scholar.flux@gmail.com

When reporting issues, include:

  • ScholarFlux version: import scholar_flux; print(scholar_flux.__version__)

  • Python version: python --version

  • Operating system

  • Minimal code to reproduce the issue

  • Complete error message

Further Reading