Production Deployment

This guide covers essential configuration and patterns for deploying ScholarFlux in production environments for machine learning and data engineering workflows.

Overview 

ScholarFlux is designed for production-grade data collection from academic APIs. This MVP guide focuses on:

ML/Data Engineering: Building training datasets, systematic reviews, longitudinal monitoring
Reproducibility: Environment configuration and containerization
Essential patterns: Caching, concurrency, and security basics

Note

ScholarFlux is currently beta (v0.5.2). Test thoroughly before production deployment and monitor the GitHub repository for updates.

Prerequisites 

Completed Getting Started tutorial
Understanding of Response Handling and Error Patterns for error handling in production
Understanding of Caching Strategies for persistent storage
Python 3.10+ in production environment
Redis or MongoDB for production caching (recommended)

Environment Configuration

SCHOLAR_FLUX_HOME (Recommended)

The recommended approach for production is to set SCHOLAR_FLUX_HOME to centralize all ScholarFlux configuration, caching, and logging in a single directory:

# Set SCHOLAR_FLUX_HOME
export SCHOLAR_FLUX_HOME=/opt/scholar-flux

# ScholarFlux will automatically use:
# - .env file:    $SCHOLAR_FLUX_HOME/.env
# - Cache:        $SCHOLAR_FLUX_HOME/package_cache/
# - Logs:         $SCHOLAR_FLUX_HOME/logs/

Directory structure:

/opt/scholar-flux/
├── .env                    # Configuration and API keys
├── package_cache/          # Processed results cache
│   └── data_store.sqlite   # SQLite cache (if using SQL backend)
└── logs/                   # Application logs (with rotation)
    └── scholar_flux.log

Setup:

# Create directory structure
mkdir -p /opt/scholar-flux/{package_cache,logs}

# Create .env file
cat > /opt/scholar-flux/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
PUBMED_API_KEY=<your_api_key>
EOF

# Set environment variable (add to ~/.bashrc or /etc/environment)
export SCHOLAR_FLUX_HOME=/opt/scholar-flux

How it works:

ScholarFlux searches for writable directories in priority order:

$SCHOLAR_FLUX_HOME (if set) ← Recommended for production
~/.scholar_flux (user home directory)
.scholar_flux (current working directory)
Package installation directory

For .env files specifically, ScholarFlux also checks the current working directory, making it easy to place .env in either $SCHOLAR_FLUX_HOME/.env or simply .env in your project directory.

See scholar_flux.package_metadata.directories.get_default_writable_directory for implementation details.

Tip

Using SCHOLAR_FLUX_HOME is especially useful for:

Docker containers (mount a volume to a known location)
Shared servers (separate directories per user/project)
Production environments (centralized configuration)
Multi-user systems (avoid ~/.scholar_flux conflicts)

Configuration System

ScholarFlux uses a hierarchical configuration system with the following priority:

Explicit parameters in code
Environment variables (highest priority for production)
``.env`` file (auto-loaded from $SCHOLAR_FLUX_HOME or fallback locations)
Default values

See Getting Started for basic configuration setup.

Production Environment Variables

With SCHOLAR_FLUX_HOME set, create a .env file at $SCHOLAR_FLUX_HOME/.env:

Core Configuration

Based on scholar_flux.utils.config_loader:

# Logging (auto-uses $SCHOLAR_FLUX_HOME/logs/ if SCHOLAR_FLUX_HOME is set)
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO              # DEBUG, INFO, WARNING, ERROR, or CRITICAL
SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE        # Set FALSE for production
SCHOLAR_FLUX_LOG_STREAM=STDERR              # Stream: STDERR, STDOUT, or FALSE (Turns off Log Streaming)

# Optional: Override the log directory (otherwise uses $SCHOLAR_FLUX_HOME/logs/)
# SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux

# Optional: Override the cache directory for session and response cache backends (prioritized over $SCHOLAR_FLUX_HOME/package_cache/)
# SCHOLAR_FLUX_CACHE_DIRECTORY=/var/cache/scholar-flux

# Optional: Override the cache directory for specific cached session backends (When set, this directory is prioritized over `SCHOLAR_FLUX_CACHE_DIRECTORY`)
# SCHOLAR_FLUX_SESSION_CACHE_DIRECTORY=/var/cache/scholar-flux/session-cache/

# Optional: Override the session cache name for specific backends (otherwise uses search_requests_cache)
SCHOLAR_FLUX_SESSION_CACHE_NAME=session_cache_storage

# Cache encryption (generate secure random key)
SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secure_random_key_here

Note

If SCHOLAR_FLUX_HOME is set, you typically don’t need to set SCHOLAR_FLUX_LOG_DIRECTORY or SCHOLAR_FLUX_CACHE_DIRECTORY explicitly. They default to subdirectories within $SCHOLAR_FLUX_HOME.

API Provider Keys

# Required for specific providers
PUBMED_API_KEY=<insert_your_key>
SPRINGER_NATURE_API_KEY=<insert_your_key>
CORE_API_KEY=<insert_your_key>

# Optional (some providers don't require keys)
ARXIV_API_KEY=<insert_your_key>
OPEN_ALEX_API_KEY=<insert_your_key>
CROSSREF_API_KEY=<insert_your_key>

Session and Request Defaults

Configure default behavior for API requests across all providers:

# Default User-Agent for all sessions (recommended for production)
SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com)

# Default mailto for Crossref and OpenAlex (enables "polite pool" access)
SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu

Tip

Polite Pool Access: Setting SCHOLAR_FLUX_DEFAULT_MAILTO with a valid email automatically enables higher rate limits for Crossref and OpenAlex:

OpenAlex: 10 requests/second with mailto (vs. 1 req/sec without) — a 10x improvement
Crossref: Priority access and faster responses for identified users

As of v0.3.1, ScholarFlux reduced the default OpenAlex request_delay from 6s to 1s to align with their documented rate limits. Combined with mailto, this significantly improves the default throughput for OpenAlex queries.

These variables are read automatically when creating sessions and search coordinators, eliminating the need to specify them in code:

# Without environment variables - must specify each time
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="crossref",
    mailto="your.email@institution.edu"
)

# With SCHOLAR_FLUX_DEFAULT_MAILTO set - automatically applied
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="crossref"
)
# mailto is automatically read from environment

Cache Backend Configuration

Configure default cache backends via environment variables. This eliminates the need to specify backends in code:

# Layer 1: HTTP Response Cache (CachedSessionManager)
# Controls the default backend for requests-cache session caching
# Options: sqlite (default), redis, mongodb, memory, filesystem, gridfs, dynamodb
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis

# Layer 2: Processed Response Cache (DataCacheManager)
# Controls the default storage for processed API response caching
# Options: inmemory (default), redis, sql/sqlalchemy, mongodb, null
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis

# Redis connection (used by both layers when redis backend is selected)
SCHOLAR_FLUX_REDIS_HOST=localhost        # or REDIS_HOST
SCHOLAR_FLUX_REDIS_PORT=6379             # or REDIS_PORT

# MongoDB connection (alternative)
SCHOLAR_FLUX_MONGODB_HOST=mongodb://127.0.0.1   # or MONGODB_HOST
SCHOLAR_FLUX_MONGODB_PORT=27017                 # or MONGODB_PORT

# Using SQLite, DuckDB, or another SQLAlchemy flavor:
SCHOLAR_FLUX_SQLALCHEMY_URL=None         # an optional file path or URI for caching processed response data

# Default provider (optional)
SCHOLAR_FLUX_DEFAULT_PROVIDER=plos

With these environment variables set, cache backends are configured automatically:

from scholar_flux import SearchCoordinator
from scholar_flux.sessions import CachedSessionManager
from scholar_flux.data_storage import DataCacheManager

# Without environment variables - must specify backends explicitly
session_manager = CachedSessionManager(backend='redis')
cache_manager = DataCacheManager.with_storage('redis')

# With SCHOLAR_FLUX_DEFAULT_*_CACHE_* variables set - automatic configuration
session_manager = CachedSessionManager()  # Uses SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND
cache_manager = DataCacheManager.from_defaults()  # Uses SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE

# SearchCoordinator also respects these defaults
coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="pubmed",
    session=session_manager(),
    cache_manager=cache_manager
)

Tip

ScholarFlux accepts both prefixed (SCHOLAR_FLUX_*) and unprefixed (REDIS_HOST) variables for cache backends, prioritizing the prefixed version.

Note

ScholarFlux builds on requests-cache, implementing strict validation for cache TTL (expire_after), storage cache availability, and other common session cache configuration parameters. The package, requests-cache, while powerful, implements minimal validation for inputs and may silently accept invalid TTLs or malformed values. The CachedSessionManager expands the functionality of requests-cache in production environments by raising clear errors for unsupported types, negative values (other than -1 for “no expiration”), or malformed strings before using these inputs to create CachedSessions, preventing unexpected errors when sending requests and ensuring that session caching in production environments operates predictably. This strict validation applies to both session cache (expire_after) and response processing cache (ttl for Redis/MongoDB) with the aim of ensuring that bad inputs fail-fast instead of propagating unexpected issues later during orchestrated searches.

Runtime Configuration

For scenarios where environment variables aren’t suitable (e.g., dynamic configuration, testing), use config_settings to configure defaults at runtime:

from scholar_flux.utils import config_settings

# Set defaults programmatically (equivalent to environment variables)
config_settings.set("SCHOLAR_FLUX_DEFAULT_USER_AGENT", "MyResearchApp/1.0")
config_settings.set("SCHOLAR_FLUX_DEFAULT_PROVIDER", "Crossref")
config_settings.set("SCHOLAR_FLUX_DEFAULT_MAILTO", "researcher@university.edu")
config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis")
config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis")

# Read configuration (checks config dict first, then falls back to environment)
user_agent = config_settings.get("SCHOLAR_FLUX_DEFAULT_USER_AGENT")
cache_backend = config_settings.get("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", default="sqlite")

# Now all SearchCoordinators will use these defaults to search for academic records via Crossref
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(query="test")
# Automatically uses the configured user_agent and mailto

Priority order for configuration:

Explicit parameters passed to constructors
Values set via config_settings.set()
Environment variables (from OS or .env file)
Built-in defaults

This pattern is useful for:

Testing: Override production settings without changing environment
Multi-tenant applications: Different configurations per request
Dynamic configuration: Change settings based on runtime conditions

Loading Configuration

Automatic Loading (Recommended)

ScholarFlux automatically loads environment configuration on import. This also includes the several settings designated for caching:

SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND: Controls the default backend for CachedSession instances created when initializing SearchAPI or SearchCoordinator. Supported requests_cache backends include sqlite, redis, mongodb, and memory.
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE: Defines the default cache storage backend that the DataCacheManager creates for response caching during orchestration of the response processing steps. Supported options are redis, sql/sqlalchemy, mongo/mongodb, memory/inmemory, and null. Defaults to memory if not specified.
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL: Controls the total number of seconds that requests-cache should retain cached responses until they expire. The TTL is set at 86400 (1 day) unless overridden. Set this to -1 or None to turn off TTL-based session cache expiration.
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL: Controls the total number of seconds that the response processing cache should retain processed response data until they expire. The TTL is set to None unless overridden. Set this to -1 or leave it as None to turn off response processing cache expiration.
SCHOLAR_FLUX_MONGODB_HOST: MongoDB connection string (default: “mongodb://127.0.0.1”)
SCHOLAR_FLUX_MONGODB_PORT: MongoDB port (default: 27017)
SCHOLAR_FLUX_REDIS_HOST: Redis host (default: “localhost”)
SCHOLAR_FLUX_REDIS_PORT: Redis port (default: 6379)

Each of the above settings can be used to load the intended configuration on import or modify them at runtime.

from scholar_flux import SearchCoordinator
from scholar_flux.utils import config_settings

config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis") # Layer 1: Session Cache
config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis") # Layer 2: Response Processing Cache

# `cache_requests=True` (or `use_cache=True` — a SearchAPI parameter) is required when a default session cache is not pre-set on import
coordinator = SearchCoordinator(query = 'test query', cache_requests = True)

print(coordinator.api.session)
# <CachedSession(cache=<RedisCache(name=search_requests_cache)>, settings=CacheSettings(allowable_codes=[200], allowable_methods=('GET',), expire_after=86400))>

print(coordinator.responses.cache)
# DataCacheManager(cache_storage=RedisStorage(...))

Otherwise, caching can be configured (or overridden) as usual by passing explicit components when creating a SearchCoordinator:

import scholar_flux  # Loads .env from $SCHOLAR_FLUX_HOME or fallback locations
from scholar_flux import SearchCoordinator
from scholar_flux.sessions import CachedSessionManager
from scholar_flux.data_storage import DataCacheManager

# Request cache (off by default) - CachedSessionManager automatically uses
# $SCHOLAR_FLUX_HOME/package_cache/ for SQLite backend
session_manager = CachedSessionManager(
    backend='sqlite',
    user_agent='Research/1.0 mailto:your@institution.edu',
    expire_after=86400  # 24 hours
)

# Response processing cache (in-memory by default) - DataCacheManager automatically uses
# $SCHOLAR_FLUX_HOME/package_cache/ for SQLite backend
cache = DataCacheManager.with_storage(
    'sqlite',
    namespace='my_project',
    ttl=86400  # 24 hours
)

coordinator = SearchCoordinator(
    query="machine learning",
    provider_name="pubmed",
    session=session_manager(),  # Enable request caching
    cache_manager=cache          # Configure response cache
)

# Both caches automatically use SCHOLAR_FLUX_HOME/package_cache/
print(session_manager.cache_path)
# Example: /opt/scholar-flux/package_cache/search_requests_cache

print(cache.cache_storage.config.get('url'))
# Example: sqlite:////opt/scholar-flux/package_cache/data_store.sqlite

Tip

With SCHOLAR_FLUX_HOME set, both CachedSessionManager and DataCacheManager with SQLite backend automatically store cache files in $SCHOLAR_FLUX_HOME/package_cache/. No path configuration needed!

Note

Default caching behavior:

Request cache (HTTP responses): Off by default. Enable with session=CachedSessionManager() or use_cache=True
Response cache (processed data): In-memory by default. Use DataCacheManager.with_storage() for persistence

Custom Configuration Path

For custom .env locations:

from scholar_flux import initialize_package

initialize_package(
    env_path="/etc/scholar-flux/.env.production",
    config_params={'enable_logging': True, 'log_level': 'INFO'}
)

Validation

Validate required secrets on startup:

import os

required_secrets = ['PUBMED_API_KEY', 'REDIS_HOST', 'SCHOLAR_FLUX_CACHE_SECRET_KEY']
missing = [key for key in required_secrets if not os.getenv(key)]

if missing:
    raise EnvironmentError(f"Missing required configuration: {missing}")

Docker for Reproducibility

Using SCHOLAR_FLUX_HOME in Docker

The recommended approach is to mount a volume to SCHOLAR_FLUX_HOME:

# Create host directory
mkdir -p /opt/scholar-flux

# Run container with volume mount
docker run \
  -e SCHOLAR_FLUX_HOME=/app/scholar-flux \
  -v /opt/scholar-flux:/app/scholar-flux \
  scholar-flux-app

This mounts the host’s /opt/scholar-flux to the container’s /app/scholar-flux, making .env, cache, and logs persist across container restarts.

Basic Dockerfile

Create reproducible research environments:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Create SCHOLAR_FLUX_HOME directory
RUN mkdir -p /app/scholar-flux/{package_cache,logs}
ENV SCHOLAR_FLUX_HOME=/app/scholar-flux

# Non-root user
RUN useradd -m scholar && \
    chown -R scholar:scholar /app
USER scholar

CMD ["python", "research_pipeline.py"]

requirements.txt:

scholar-flux[parsing,database,cryptography]>=0.3.1
redis>=5.0.0
pymongo>=4.0.0
pandas>=2.0.0

Environment Variables in Docker

Option 1: .env file in SCHOLAR_FLUX_HOME (Recommended)

# Create .env on host
cat > /opt/scholar-flux/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
PUBMED_API_KEY=<insert_your_key>
EOF

# Run with volume mount
docker run \
  -e SCHOLAR_FLUX_HOME=/app/scholar-flux \
  -v /opt/scholar-flux:/app/scholar-flux \
  scholar-flux-app

Option 2: Direct environment variables

docker run \
  -e SCHOLAR_FLUX_HOME=/app/scholar-flux \
  -e PUBMED_API_KEY=$PUBMED_API_KEY \
  -e REDIS_HOST=redis \
  -v /opt/scholar-flux:/app/scholar-flux \
  scholar-flux-app

Option 3: Docker Compose (Recommended for multi-container)

version: '3.8'

services:
  app:
    build: .
    environment:
      - SCHOLAR_FLUX_HOME=/app/scholar-flux
      - REDIS_HOST=redis
    volumes:
      - ./scholar-flux:/app/scholar-flux  # Host directory to container
    depends_on:
      - redis

  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:

Directory structure on host:

./
├── docker-compose.yml
├── Dockerfile
├── research_pipeline.py
└── scholar-flux/           # Mounted to container
    ├── .env                # Contains API keys
    ├── package_cache/      # Persisted cache
    └── logs/               # Persisted logs

Note

Docker is recommended for reproducible research environments, not necessarily for production deployment at scale. Using SCHOLAR_FLUX_HOME with volume mounts ensures your configuration, cache, and logs persist across container restarts.

Production Patterns

This section references core patterns covered in detail elsewhere. For production deployments, understand these foundational concepts:

Caching Strategy

ScholarFlux uses two-layer caching (HTTP responses + processed results). For production:

Use persistent backends (SQLite, Redis, or MongoDB) not in-memory
Set appropriate TTL based on data freshness needs
Use namespaces to isolate different projects/experiments

Example with SCHOLAR_FLUX_HOME:

from scholar_flux import SearchCoordinator, CachedSessionManager
from scholar_flux.data_storage import DataCacheManager
import os

# 24 Hour request cache expiration by default
session_manager = CachedSessionManager(
    backend='sqlite',
    user_agent='Research/1.0 mailto:user@your.affiliation.edu',
    expire_after=86400
)

# Production response processing cache with Redis
cache = DataCacheManager.with_storage(
    'redis',
    namespace='my_project',
    ttl=86400  # 24 hours
)

coordinator = SearchCoordinator(
    query="deep learning",
    provider_name="pubmed",
    session=session_manager(),
    cache_manager=cache
)

print(os.environ.get("SCHOLAR_FLUX_HOME"))
# ~/.scholar_flux for package debugging in development

print(session_manager.cache_path)
# OUTPUT: ~/.scholar_flux/package_cache/search_requests_cache

print(cache.cache_storage.config.get('url'))
# Redis connection details

Note

For production with Redis: Simply change backend='sqlite' to backend='redis' when creating the session manager. Both session caching and data caching will automatically use the same Redis connection via SCHOLAR_FLUX_REDIS_HOST and SCHOLAR_FLUX_REDIS_PORT environment variables.

See also

See Caching Strategies for comprehensive coverage including Redis/MongoDB configuration, encryption, TTL strategies, and thread-safety patterns.

Multi-Provider Search

For ML dataset collection across multiple providers, use MultiSearchCoordinator for concurrent searches:

from scholar_flux import SearchCoordinator, MultiSearchCoordinator

# Create coordinators for different providers
coordinators = [
    SearchCoordinator(query="CRISPR", provider_name="pubmed"),
    SearchCoordinator(query="CRISPR", provider_name="plos"),
    SearchCoordinator(query="CRISPR", provider_name="crossref")
]

# Execute concurrently (thread-safe)
multi_search_coordinator = MultiSearchCoordinator.from_coordinators(coordinators)
results = multi_search_coordinator.search_pages(pages=range(1, 11))

See also

See Multi-Provider Search for threading, rate limiting coordination, and session management.

Schema Normalization

Build ML-ready datasets with consistent schemas across providers:

# Normalize results from multiple providers
results = multi_search_coordinator.search_pages(pages=range(1, 6))
normalized = results.filter().normalize(
    include={'provider_name', 'query'}
)

# Export to pandas
import pandas as pd
df = pd.DataFrame(normalized)

See also

See Schema Normalization for field mappings, custom transformations, and ML pipeline integration.

Workflows

For APIs requiring multi-step retrieval (e.g., PubMed ID search → fetch records):

# PubMed workflow automatically handles search → fetch
coordinator = SearchCoordinator(
    query="cancer treatment",
    provider_name="pubmed"  # Uses PubMedWorkflow automatically
)

results = coordinator.search_pages(pages=range(1, 11))

See also

See Workflows for custom workflows, PubMed examples, and multi-step retrieval patterns.

Production Use Cases

Systematic Literature Review

Collect and cache papers for reproducible reviews:

from scholar_flux import SearchCoordinator
from scholar_flux.data_storage import DataCacheManager

cache = DataCacheManager.with_storage(
    'mongodb',
    namespace='systematic_review_2024',
    ttl=2592000  # 30 days
)

coordinator = SearchCoordinator(
    query="machine learning healthcare",
    provider_name="pubmed",
    cache_manager=cache
)

# Collect all pages
results = coordinator.search_pages(pages=range(1, 101))
successful = results.filter()

# Export for analysis
normalized = successful.normalize()

import pandas as pd
df = pd.DataFrame(normalized)
df.to_csv('systematic_review_data.csv', index=False)

ML Training Data Collection

Build labeled datasets from multiple sources:

from scholar_flux import SearchCoordinator, MultiSearchCoordinator
from scholar_flux.data_storage import DataCacheManager

cache = DataCacheManager.with_storage('redis', namespace='ml_training')

# Define queries with labels
queries = {
    'machine learning classification': 'ml',
    'deep learning neural networks': 'dl',
    'reinforcement learning agents': 'rl'
}

# Create multi-coordinator
multi_search_coordinator = MultiSearchCoordinator()
for query, label in queries.items():
    coord = SearchCoordinator(
        query=query,
        provider_name="pubmed",
        cache_manager=cache
    )
    multi_search_coordinator.add_coordinator(coord)

# Collect data
results = multi_search_coordinator.search_pages(pages=range(1, 11))

# Add labels and export
import pandas as pd
normalized = results.filter().normalize(include={'query'})
df = pd.DataFrame(normalized)
df['label'] = df['query'].map(queries)

# Train/test split, etc.

Longitudinal Monitoring

Track new publications over time with scheduled collection:

import schedule
import time
from datetime import datetime
from scholar_flux import SearchCoordinator
from scholar_flux.data_storage import DataCacheManager

cache = DataCacheManager.with_storage(
    'mongodb',
    namespace='monitoring',
    ttl=604800  # 7 days
)

def collect_new_papers():
    """Daily collection of new papers."""
    coordinator = SearchCoordinator(
        query="AI safety",
        provider_name="arxiv",
        cache_manager=cache
    )

    results = coordinator.search_pages(pages=range(1, 6))

    # Process and store
    timestamp = datetime.now().isoformat()
    normalized = results.filter().normalize()

    # Save to database/file with timestamp
    import pandas as pd
    df = pd.DataFrame(normalized)
    df['collected_at'] = timestamp
    df.to_csv(f'papers_{timestamp}.csv', index=False)

# Schedule daily collection
schedule.every().day.at("09:00").do(collect_new_papers)

while True:
    schedule.run_pending()
    time.sleep(3600)

Data Ownership & Citation

Warning

Important Legal and Ethical Notice

ScholarFlux facilitates data retrieval but does not grant ownership of the data. Users must:

Comply with API Terms of Service
Properly cite original data sources in publications
Respect rate limits and provider guidelines
Consider data privacy (GDPR, CCPA) when handling personal information
Obtain permission for commercial use

Provider Attribution

Always cite data sources in publications:

Example acknowledgment:

“Data was retrieved from PubMed (NCBI), PLOS, and Crossref APIs using ScholarFlux. We acknowledge these providers for making scholarly data accessible.”

Provider requirements:

PubMed/NCBI: Acknowledge “Entrez Programming Utilities”
PLOS: Attribution required for PLOS content
Crossref: Use mailto in requests for higher rate limits
Springer Nature: Commercial use requires separate licensing

See individual provider documentation for complete terms of service.

Security Essentials

API Key Management

Never commit secrets to version control:

# .gitignore
.env
.env.*
*.key

Key rotation:

Rotate API keys periodically when possible. Note that some providers only issue a single API key without rotation support. For providers that support it:

# Update .env file in SCHOLAR_FLUX_HOME
# Old: PUBMED_API_KEY=old_key_123
# New: PUBMED_API_KEY=new_key_456

Then restart your application to load the new key.

Cache Security

For sensitive research data, use encrypted caching:

from scholar_flux.sessions import EncryptionPipelineFactory, CachedSessionManager
import os

# Load or generate encryption key
key = os.getenv('SCHOLAR_FLUX_CACHE_SECRET_KEY')
encryption_factory = EncryptionPipelineFactory(key)

# Create encrypted session
serializer = encryption_factory()
session_manager = CachedSessionManager(
    backend='redis',
    serializer=serializer
)

See also

See SECURITY for comprehensive security guidelines including cache encryption, network security, and vulnerability reporting.

Logging

ScholarFlux includes built-in rotating logs. Configure via environment:

SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux
SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE  # Prevent duplicate logs

Logs automatically rotate when they reach size limits. For custom logging:

import logging

logger = logging.getLogger('scholar_flux')
logger.setLevel(logging.INFO)

# Add custom handlers as needed

Best Practices

Configuration

✅ Set ``SCHOLAR_FLUX_HOME`` for centralized configuration, caching, and logging

✅ Use environment-specific .env files in $SCHOLAR_FLUX_HOME

✅ Validate required secrets on application startup

✅ Use prefixed variables (SCHOLAR_FLUX_*) for clarity in shared environments

Example setup:

# Set up SCHOLAR_FLUX_HOME
export SCHOLAR_FLUX_HOME=/opt/scholar-flux
mkdir -p $SCHOLAR_FLUX_HOME/{package_cache,logs}

# Create .env file
cat > $SCHOLAR_FLUX_HOME/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
PUBMED_API_KEY=<insert_your_key>
EOF

Caching

✅ Use persistent backends (SQLite, Redis, MongoDB) for production

✅ Set appropriate TTL: 1-7 days for general research, 30+ days for archival

✅ Use namespaces: namespace='project_name' or namespace='user:123:project'

✅ Request caching (HTTP responses): Off by default, enable with CachedSessionManager or use_cache=True

✅ Response caching (processed data): In-memory by default, use DataCacheManager.with_storage() for persistence

✅ SQLite backends automatically use $SCHOLAR_FLUX_HOME/package_cache/ when SCHOLAR_FLUX_HOME is set

See Caching Strategies for detailed patterns.

Concurrency

✅ Use MultiSearchCoordinator for parallel provider searches

✅ Create new session per thread: session_manager() (sessions not thread-safe)

✅ Share DataCacheManager across threads (thread-safe)

See Multi-Provider Search for detailed threading patterns.

Data Management

✅ Always cite data sources in publications

✅ Check API terms of service for commercial use

✅ Implement proper data retention policies

✅ Consider GDPR/CCPA compliance for personal data

Security

✅ Rotate API keys when possible (note: some providers only issue a single key)

✅ Use encrypted caching for sensitive queries

✅ Never log API keys (ScholarFlux masks them automatically)

✅ Monitor for security advisories on GitHub

✅ Keep .env files in $SCHOLAR_FLUX_HOME with restricted permissions (chmod 600)

Production Checklist

Before deploying to production:

Environment

☐ Set up SCHOLAR_FLUX_HOME directory structure

☐ Create .env file in $SCHOLAR_FLUX_HOME with all required secrets

☐ Validate configuration on startup

☐ Verify write permissions for cache and log directories

Caching

☐ Deploy Redis or MongoDB for persistent caching

☐ Configure appropriate TTL for your use case

☐ Test cache connectivity before starting collection

Security

☐ Remove all hardcoded secrets from code

☐ Set up API key rotation schedule

☐ Review SECURITY guidelines

☐ Configure encrypted caching if handling sensitive data

Testing

☐ Test with small page ranges first

☐ Verify rate limiting works correctly

☐ Test failure recovery (API errors, network issues)

☐ Validate data quality and completeness

Documentation

☐ Document which providers and queries you’re using

☐ Record data collection dates for reproducibility

☐ Plan for data citation in publications

☐ Document any provider-specific configurations

Next Steps

You now understand production deployment essentials for ScholarFlux. Continue with:

Example Pipelines:

Production-quality examples demonstrating real-world integration patterns:

Retrieval Pipeline Orchestration - Scheduled data preparation with date filtering, deduplication, and Parquet export
Semantic Similarity Search - Embedding-based paper discovery with ModernBERT
Agentic Literature Review - Multi-provider search with LLM classification via PydanticAI

Related Guides:

Getting Started - Installation and basic configuration
Multi-Provider Search - Concurrent provider coordination
Schema Normalization - Building ML-ready datasets

Advanced Topics:

Caching Strategies - Advanced caching backends and patterns
Workflows - Multi-step retrieval workflows

Reference:

Security Guidelines - Comprehensive security guidelines
Custom Providers - Adding new API providers
GitHub Issues - Report bugs or request features

API Reference

Getting Help

For production deployment questions:

Check documentation: Especially Caching Strategies and Multi-Provider Search
GitHub Issues: https://github.com/SammieH21/scholar-flux/issues
Email: scholar.flux@gmail.com
Security issues: Use GitHub Security Advisories (private reporting)

When reporting issues, include:

ScholarFlux version: import scholar_flux; print(scholar_flux.__version__)
Python version and OS
Cache backend (Redis/MongoDB/SQL)
Relevant environment variables (mask secrets!)
Complete error messages

—

This is an MVP guide. We’ll expand with more patterns and examples as ScholarFlux matures toward v1.0. Feedback welcome!

Production Deployment

Overview

Prerequisites

Environment Configuration

SCHOLAR_FLUX_HOME (Recommended)

Configuration System

Production Environment Variables

Core Configuration

API Provider Keys

Session and Request Defaults

Cache Backend Configuration

Runtime Configuration

Loading Configuration

Automatic Loading (Recommended)

Custom Configuration Path

Validation

Docker for Reproducibility

Using SCHOLAR_FLUX_HOME in Docker

Basic Dockerfile

Environment Variables in Docker

Production Patterns

Caching Strategy

Multi-Provider Search

Schema Normalization

Workflows

Production Use Cases

Systematic Literature Review

ML Training Data Collection

Longitudinal Monitoring

Data Ownership & Citation

Provider Attribution

Security Essentials

API Key Management

Cache Security

Logging

Best Practices

Configuration

Caching

Concurrency

Data Management

Security

Production Checklist

Next Steps

API Reference

Getting Help

Overview 

Prerequisites 