Production Deployment

This guide covers essential configuration and patterns for deploying ScholarFlux in production environments for machine learning and data engineering workflows.

Overview

ScholarFlux is designed for production-grade data collection from academic APIs. This MVP guide focuses on:

  • ML/Data Engineering: Building training datasets, systematic reviews, longitudinal monitoring

  • Reproducibility: Environment configuration and containerization

  • Essential patterns: Caching, concurrency, and security basics

Note

ScholarFlux is currently beta (v0.5.0). Test thoroughly before production deployment and monitor the GitHub repository for updates.

Prerequisites

Environment Configuration

Configuration System

ScholarFlux uses a hierarchical configuration system with the following priority:

  1. Explicit parameters in code

  2. Environment variables (highest priority for production)

  3. ``.env`` file (auto-loaded from $SCHOLAR_FLUX_HOME or fallback locations)

  4. Default values

See Getting Started for basic configuration setup.

Production Environment Variables

With SCHOLAR_FLUX_HOME set, create a .env file at $SCHOLAR_FLUX_HOME/.env:

Core Configuration

Based on scholar_flux.utils.config_loader:

# Logging (auto-uses $SCHOLAR_FLUX_HOME/logs/ if SCHOLAR_FLUX_HOME is set)
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO              # DEBUG, INFO, WARNING, ERROR, or CRITICAL
SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE        # Set FALSE for production
SCHOLAR_FLUX_LOG_STREAM=STDERR              # Stream: STDERR, STDOUT, or FALSE (Turns off Log Streaming)

# Optional: Override the log directory (otherwise uses $SCHOLAR_FLUX_HOME/logs/)
# SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux

# Optional: Override the cache directory (otherwise uses $SCHOLAR_FLUX_HOME/package_cache/)
# SCHOLAR_FLUX_CACHE_DIRECTORY=/var/cache/scholar-flux

# Optional: Override the session cache name for specific backends (otherwise uses search_requests_cache)
SCHOLAR_FLUX_SESSION_CACHE_NAME=session_cache_storage

# Cache encryption (generate secure random key)
SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secure_random_key_here

Note

If SCHOLAR_FLUX_HOME is set, you typically don’t need to set SCHOLAR_FLUX_LOG_DIRECTORY or SCHOLAR_FLUX_CACHE_DIRECTORY explicitly. They default to subdirectories within $SCHOLAR_FLUX_HOME.

API Provider Keys

# Required for specific providers
PUBMED_API_KEY=<insert_your_key>
SPRINGER_NATURE_API_KEY=<insert_your_key>
CORE_API_KEY=<insert_your_key>

# Optional (some providers don't require keys)
ARXIV_API_KEY=<insert_your_key>
OPEN_ALEX_API_KEY=<insert_your_key>
CROSSREF_API_KEY=<insert_your_key>

Session and Request Defaults

Configure default behavior for API requests across all providers:

# Default User-Agent for all sessions (recommended for production)
SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com)

# Default mailto for Crossref and OpenAlex (enables "polite pool" access)
SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu

Tip

Polite Pool Access: Setting SCHOLAR_FLUX_DEFAULT_MAILTO with a valid email automatically enables higher rate limits for Crossref and OpenAlex:

  • OpenAlex: 10 requests/second with mailto (vs. 1 req/sec without) — a 10x improvement

  • Crossref: Priority access and faster responses for identified users

As of v0.3.1, ScholarFlux reduced the default OpenAlex request_delay from 6s to 1s to align with their documented rate limits. Combined with mailto, this significantly improves the default throughput for OpenAlex queries.

These variables are read automatically when creating sessions and search coordinators, eliminating the need to specify them in code:

# Without environment variables - must specify each time
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="crossref",
    mailto="your.email@institution.edu"
)

# With SCHOLAR_FLUX_DEFAULT_MAILTO set - automatically applied
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="crossref"
)
# mailto is automatically read from environment

Cache Backend Configuration

Configure default cache backends via environment variables. This eliminates the need to specify backends in code:

# Layer 1: HTTP Response Cache (CachedSessionManager)
# Controls the default backend for requests-cache session caching
# Options: sqlite (default), redis, mongodb, memory, filesystem, gridfs, dynamodb
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis

# Layer 2: Processed Response Cache (DataCacheManager)
# Controls the default storage for processed API response caching
# Options: inmemory (default), redis, sql/sqlalchemy, mongodb, null
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis

# Redis connection (used by both layers when redis backend is selected)
SCHOLAR_FLUX_REDIS_HOST=localhost        # or REDIS_HOST
SCHOLAR_FLUX_REDIS_PORT=6379             # or REDIS_PORT

# MongoDB connection (alternative)
SCHOLAR_FLUX_MONGODB_HOST=mongodb://127.0.0.1   # or MONGODB_HOST
SCHOLAR_FLUX_MONGODB_PORT=27017                 # or MONGODB_PORT

# Using SQLite, DuckDB, or another SQLAlchemy flavor:
SCHOLAR_FLUX_SQLALCHEMY_URL=None         # an optional file path or URI for caching processed response data

# Default provider (optional)
SCHOLAR_FLUX_DEFAULT_PROVIDER=plos

With these environment variables set, cache backends are configured automatically:

from scholar_flux import SearchCoordinator
from scholar_flux.sessions import CachedSessionManager
from scholar_flux.data_storage import DataCacheManager

# Without environment variables - must specify backends explicitly
session_manager = CachedSessionManager(backend='redis')
cache_manager = DataCacheManager.with_storage('redis')

# With SCHOLAR_FLUX_DEFAULT_*_CACHE_* variables set - automatic configuration
session_manager = CachedSessionManager()  # Uses SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND
cache_manager = DataCacheManager.from_defaults()  # Uses SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE

# SearchCoordinator also respects these defaults
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="pubmed",
    session=session_manager(),
    cache_manager=cache_manager
)

Tip

ScholarFlux accepts both prefixed (SCHOLAR_FLUX_*) and unprefixed (REDIS_HOST) variables for cache backends, prioritizing the prefixed version.

Note

ScholarFlux builds on requests-cache, implementing strict validation for cache TTL (expire_after), storage cache availability, and other common session cache configuration parameters. The package, requests-cache, while powerful, implements minimal validation for inputs and may silently accept invalid TTLs or malformed values. The CachedSessionManager expands the functionality of requests-cache in production environments by raising clear errors for unsupported types, negative values (other than -1 for “no expiration”), or malformed strings before using these inputs to create CachedSessions, preventing unexpected errors when sending requests and ensuring that session caching in production environments operates predictably. This strict validation applies to both session cache (expire_after) and response processing cache (ttl for Redis/MongoDB) with the aim of ensuring that bad inputs fail-fast instead of propagating unexpected issues later during orchestrated searches.

Runtime Configuration

For scenarios where environment variables aren’t suitable (e.g., dynamic configuration, testing), use config_settings to configure defaults at runtime:

from scholar_flux.utils import config_settings

# Set defaults programmatically (equivalent to environment variables)
config_settings.set("SCHOLAR_FLUX_DEFAULT_USER_AGENT", "MyResearchApp/1.0")
config_settings.set("SCHOLAR_FLUX_DEFAULT_PROVIDER", "Crossref")
config_settings.set("SCHOLAR_FLUX_DEFAULT_MAILTO", "researcher@university.edu")
config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis")
config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis")

# Read configuration (checks config dict first, then falls back to environment)
user_agent = config_settings.get("SCHOLAR_FLUX_DEFAULT_USER_AGENT")
cache_backend = config_settings.get("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", default="sqlite")

# Now all SearchCoordinators will use these defaults to search for academic records via Crossref
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(query="test")
# Automatically uses the configured user_agent and mailto

Priority order for configuration:

  1. Explicit parameters passed to constructors

  2. Values set via config_settings.set()

  3. Environment variables (from OS or .env file)

  4. Built-in defaults

This pattern is useful for:

  • Testing: Override production settings without changing environment

  • Multi-tenant applications: Different configurations per request

  • Dynamic configuration: Change settings based on runtime conditions

Loading Configuration

Custom Configuration Path

For custom .env locations:

from scholar_flux import initialize_package

initialize_package(
    env_path="/etc/scholar-flux/.env.production",
    config_params={'enable_logging': True, 'log_level': 'INFO'}
)

Validation

Validate required secrets on startup:

import os

required_secrets = ['PUBMED_API_KEY', 'REDIS_HOST', 'SCHOLAR_FLUX_CACHE_SECRET_KEY']
missing = [key for key in required_secrets if not os.getenv(key)]

if missing:
    raise EnvironmentError(f"Missing required configuration: {missing}")

Docker for Reproducibility

Using SCHOLAR_FLUX_HOME in Docker

The recommended approach is to mount a volume to SCHOLAR_FLUX_HOME:

# Create host directory
mkdir -p /opt/scholar-flux

# Run container with volume mount
docker run \
  -e SCHOLAR_FLUX_HOME=/app/scholar-flux \
  -v /opt/scholar-flux:/app/scholar-flux \
  scholar-flux-app

This mounts the host’s /opt/scholar-flux to the container’s /app/scholar-flux, making .env, cache, and logs persist across container restarts.

Basic Dockerfile

Create reproducible research environments:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Create SCHOLAR_FLUX_HOME directory
RUN mkdir -p /app/scholar-flux/{package_cache,logs}
ENV SCHOLAR_FLUX_HOME=/app/scholar-flux

# Non-root user
RUN useradd -m scholar && \
    chown -R scholar:scholar /app
USER scholar

CMD ["python", "research_pipeline.py"]

requirements.txt:

scholar-flux[parsing,database,cryptography]>=0.3.1
redis>=5.0.0
pymongo>=4.0.0
pandas>=2.0.0

Environment Variables in Docker

Option 1: .env file in SCHOLAR_FLUX_HOME (Recommended)

# Create .env on host
cat > /opt/scholar-flux/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
PUBMED_API_KEY=<insert_your_key>
EOF

# Run with volume mount
docker run \
  -e SCHOLAR_FLUX_HOME=/app/scholar-flux \
  -v /opt/scholar-flux:/app/scholar-flux \
  scholar-flux-app

Option 2: Direct environment variables

docker run \
  -e SCHOLAR_FLUX_HOME=/app/scholar-flux \
  -e PUBMED_API_KEY=$PUBMED_API_KEY \
  -e REDIS_HOST=redis \
  -v /opt/scholar-flux:/app/scholar-flux \
  scholar-flux-app

Option 3: Docker Compose (Recommended for multi-container)

version: '3.8'

services:
  app:
    build: .
    environment:
      - SCHOLAR_FLUX_HOME=/app/scholar-flux
      - REDIS_HOST=redis
    volumes:
      - ./scholar-flux:/app/scholar-flux  # Host directory to container
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Directory structure on host:

./
├── docker-compose.yml
├── Dockerfile
├── research_pipeline.py
└── scholar-flux/           # Mounted to container
    ├── .env                # Contains API keys
    ├── package_cache/      # Persisted cache
    └── logs/               # Persisted logs

Note

Docker is recommended for reproducible research environments, not necessarily for production deployment at scale. Using SCHOLAR_FLUX_HOME with volume mounts ensures your configuration, cache, and logs persist across container restarts.

Production Patterns

This section references core patterns covered in detail elsewhere. For production deployments, understand these foundational concepts:

Caching Strategy

ScholarFlux uses two-layer caching (HTTP responses + processed results). For production:

  • Use persistent backends (SQLite, Redis, or MongoDB) not in-memory

  • Set appropriate TTL based on data freshness needs

  • Use namespaces to isolate different projects/experiments

Example with SCHOLAR_FLUX_HOME:

from scholar_flux import SearchCoordinator, CachedSessionManager
from scholar_flux.data_storage import DataCacheManager
import os

# 24 Hour request cache expiration by default
session_manager = CachedSessionManager(
    backend='sqlite',
    user_agent='Research/1.0 mailto:user@your.affiliation.edu',
    expire_after=86400
)

# Production response processing cache with Redis
cache = DataCacheManager.with_storage(
    'redis',
    namespace='my_project',
    ttl=86400  # 24 hours
)

coordinator = SearchCoordinator(
    query="deep learning",
    provider_name="pubmed",
    session=session_manager(),
    cache_manager=cache
)

print(os.environ.get("SCHOLAR_FLUX_HOME"))
# ~/.scholar_flux for package debugging in development

print(session_manager.cache_path)
# OUTPUT: ~/.scholar_flux/package_cache/search_requests_cache

print(cache.cache_storage.config.get('url'))
# Redis connection details

Note

For production with Redis: Simply change backend='sqlite' to backend='redis' when creating the session manager. Both session caching and data caching will automatically use the same Redis connection via SCHOLAR_FLUX_REDIS_HOST and SCHOLAR_FLUX_REDIS_PORT environment variables.

See also

See Caching Strategies for comprehensive coverage including Redis/MongoDB configuration, encryption, TTL strategies, and thread-safety patterns.

Schema Normalization

Build ML-ready datasets with consistent schemas across providers:

# Normalize results from multiple providers
results = multi_search_coordinator.search_pages(pages=range(1, 6))
normalized = results.filter().normalize(
    include={'provider_name', 'query'}
)

# Export to pandas
import pandas as pd
df = pd.DataFrame(normalized)

See also

See Schema Normalization for field mappings, custom transformations, and ML pipeline integration.

Workflows

For APIs requiring multi-step retrieval (e.g., PubMed ID search → fetch records):

# PubMed workflow automatically handles search → fetch
coordinator = SearchCoordinator(
    query="cancer treatment",
    provider_name="pubmed"  # Uses PubMedWorkflow automatically
)

results = coordinator.search_pages(pages=range(1, 11))

See also

See Workflows for custom workflows, PubMed examples, and multi-step retrieval patterns.

Production Use Cases

Systematic Literature Review

Collect and cache papers for reproducible reviews:

from scholar_flux import SearchCoordinator
from scholar_flux.data_storage import DataCacheManager

cache = DataCacheManager.with_storage(
    'mongodb',
    namespace='systematic_review_2024',
    ttl=2592000  # 30 days
)

coordinator = SearchCoordinator(
    query="machine learning healthcare",
    provider_name="pubmed",
    cache_manager=cache
)

# Collect all pages
results = coordinator.search_pages(pages=range(1, 101))
successful = results.filter()

# Export for analysis
normalized = successful.normalize()

import pandas as pd
df = pd.DataFrame(normalized)
df.to_csv('systematic_review_data.csv', index=False)

ML Training Data Collection

Build labeled datasets from multiple sources:

from scholar_flux import SearchCoordinator, MultiSearchCoordinator
from scholar_flux.data_storage import DataCacheManager

cache = DataCacheManager.with_storage('redis', namespace='ml_training')

# Define queries with labels
queries = {
    'machine learning classification': 'ml',
    'deep learning neural networks': 'dl',
    'reinforcement learning agents': 'rl'
}

# Create multi-coordinator
multi_search_coordinator = MultiSearchCoordinator()
for query, label in queries.items():
    coord = SearchCoordinator(
        query=query,
        provider_name="pubmed",
        cache_manager=cache
    )
    multi_search_coordinator.add_coordinator(coord)

# Collect data
results = multi_search_coordinator.search_pages(pages=range(1, 11))

# Add labels and export
import pandas as pd
normalized = results.filter().normalize(include={'query'})
df = pd.DataFrame(normalized)
df['label'] = df['query'].map(queries)

# Train/test split, etc.

Longitudinal Monitoring

Track new publications over time with scheduled collection:

import schedule
import time
from datetime import datetime
from scholar_flux import SearchCoordinator
from scholar_flux.data_storage import DataCacheManager

cache = DataCacheManager.with_storage(
    'mongodb',
    namespace='monitoring',
    ttl=604800  # 7 days
)

def collect_new_papers():
    """Daily collection of new papers."""
    coordinator = SearchCoordinator(
        query="AI safety",
        provider_name="arxiv",
        cache_manager=cache
    )

    results = coordinator.search_pages(pages=range(1, 6))

    # Process and store
    timestamp = datetime.now().isoformat()
    normalized = results.filter().normalize()

    # Save to database/file with timestamp
    import pandas as pd
    df = pd.DataFrame(normalized)
    df['collected_at'] = timestamp
    df.to_csv(f'papers_{timestamp}.csv', index=False)

# Schedule daily collection
schedule.every().day.at("09:00").do(collect_new_papers)

while True:
    schedule.run_pending()
    time.sleep(3600)

Data Ownership & Citation

Warning

Important Legal and Ethical Notice

ScholarFlux facilitates data retrieval but does not grant ownership of the data. Users must:

  1. Comply with API Terms of Service

  2. Properly cite original data sources in publications

  3. Respect rate limits and provider guidelines

  4. Consider data privacy (GDPR, CCPA) when handling personal information

  5. Obtain permission for commercial use

Provider Attribution

Always cite data sources in publications:

Example acknowledgment:

“Data was retrieved from PubMed (NCBI), PLOS, and Crossref APIs using ScholarFlux. We acknowledge these providers for making scholarly data accessible.”

Provider requirements:

  • PubMed/NCBI: Acknowledge “Entrez Programming Utilities”

  • PLOS: Attribution required for PLOS content

  • Crossref: Use mailto in requests for higher rate limits

  • Springer Nature: Commercial use requires separate licensing

See individual provider documentation for complete terms of service.

Security Essentials

API Key Management

Never commit secrets to version control:

# .gitignore
.env
.env.*
*.key

Key rotation:

Rotate API keys periodically when possible. Note that some providers only issue a single API key without rotation support. For providers that support it:

# Update .env file in SCHOLAR_FLUX_HOME
# Old: PUBMED_API_KEY=old_key_123
# New: PUBMED_API_KEY=new_key_456

Then restart your application to load the new key.

Cache Security

For sensitive research data, use encrypted caching:

from scholar_flux.sessions import EncryptionPipelineFactory, CachedSessionManager
import os

# Load or generate encryption key
key = os.getenv('SCHOLAR_FLUX_CACHE_SECRET_KEY')
encryption_factory = EncryptionPipelineFactory(key)

# Create encrypted session
serializer = encryption_factory()
session_manager = CachedSessionManager(
    backend='redis',
    serializer=serializer
)

See also

See SECURITY for comprehensive security guidelines including cache encryption, network security, and vulnerability reporting.

Logging

ScholarFlux includes built-in rotating logs. Configure via environment:

SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux
SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE  # Prevent duplicate logs

Logs automatically rotate when they reach size limits. For custom logging:

import logging

logger = logging.getLogger('scholar_flux')
logger.setLevel(logging.INFO)

# Add custom handlers as needed

Best Practices

Configuration

Set ``SCHOLAR_FLUX_HOME`` for centralized configuration, caching, and logging

✅ Use environment-specific .env files in $SCHOLAR_FLUX_HOME

✅ Validate required secrets on application startup

✅ Use prefixed variables (SCHOLAR_FLUX_*) for clarity in shared environments

Example setup:

# Set up SCHOLAR_FLUX_HOME
export SCHOLAR_FLUX_HOME=/opt/scholar-flux
mkdir -p $SCHOLAR_FLUX_HOME/{package_cache,logs}

# Create .env file
cat > $SCHOLAR_FLUX_HOME/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
PUBMED_API_KEY=<insert_your_key>
EOF

Caching

✅ Use persistent backends (SQLite, Redis, MongoDB) for production

✅ Set appropriate TTL: 1-7 days for general research, 30+ days for archival

✅ Use namespaces: namespace='project_name' or namespace='user:123:project'

Request caching (HTTP responses): Off by default, enable with CachedSessionManager or use_cache=True

Response caching (processed data): In-memory by default, use DataCacheManager.with_storage() for persistence

✅ SQLite backends automatically use $SCHOLAR_FLUX_HOME/package_cache/ when SCHOLAR_FLUX_HOME is set

See Caching Strategies for detailed patterns.

Concurrency

✅ Use MultiSearchCoordinator for parallel provider searches

✅ Create new session per thread: session_manager() (sessions not thread-safe)

✅ Share DataCacheManager across threads (thread-safe)

See Multi-Provider Search for detailed threading patterns.

Data Management

✅ Always cite data sources in publications

✅ Check API terms of service for commercial use

✅ Implement proper data retention policies

✅ Consider GDPR/CCPA compliance for personal data

Security

✅ Rotate API keys when possible (note: some providers only issue a single key)

✅ Use encrypted caching for sensitive queries

✅ Never log API keys (ScholarFlux masks them automatically)

✅ Monitor for security advisories on GitHub

✅ Keep .env files in $SCHOLAR_FLUX_HOME with restricted permissions (chmod 600)

Production Checklist

Before deploying to production:

Environment

☐ Set up SCHOLAR_FLUX_HOME directory structure

☐ Create .env file in $SCHOLAR_FLUX_HOME with all required secrets

☐ Validate configuration on startup

☐ Verify write permissions for cache and log directories

Caching

☐ Deploy Redis or MongoDB for persistent caching

☐ Configure appropriate TTL for your use case

☐ Test cache connectivity before starting collection

Security

☐ Remove all hardcoded secrets from code

☐ Set up API key rotation schedule

☐ Review SECURITY guidelines

☐ Configure encrypted caching if handling sensitive data

Testing

☐ Test with small page ranges first

☐ Verify rate limiting works correctly

☐ Test failure recovery (API errors, network issues)

☐ Validate data quality and completeness

Documentation

☐ Document which providers and queries you’re using

☐ Record data collection dates for reproducibility

☐ Plan for data citation in publications

☐ Document any provider-specific configurations

Next Steps

You now understand production deployment essentials for ScholarFlux. Continue with:

Example Pipelines:

Production-quality examples demonstrating real-world integration patterns:

Related Guides:

Advanced Topics:

Reference:

API Reference

Getting Help

For production deployment questions:

  1. Check documentation: Especially Caching Strategies and Multi-Provider Search

  2. GitHub Issues: https://github.com/SammieH21/scholar-flux/issues

  3. Email: scholar.flux@gmail.com

  4. Security issues: Use GitHub Security Advisories (private reporting)

When reporting issues, include:

  • ScholarFlux version: import scholar_flux; print(scholar_flux.__version__)

  • Python version and OS

  • Cache backend (Redis/MongoDB/SQL)

  • Relevant environment variables (mask secrets!)

  • Complete error messages

This is an MVP guide. We’ll expand with more patterns and examples as ScholarFlux matures toward v1.0. Feedback welcome!