Production Deployment
This guide covers essential configuration and patterns for deploying ScholarFlux in production environments for machine learning and data engineering workflows.
Overview
ScholarFlux is designed for production-grade data collection from academic APIs. This MVP guide focuses on:
ML/Data Engineering: Building training datasets, systematic reviews, longitudinal monitoring
Reproducibility: Environment configuration and containerization
Essential patterns: Caching, concurrency, and security basics
Note
ScholarFlux is currently beta (v0.5.0). Test thoroughly before production deployment and monitor the GitHub repository for updates.
Prerequisites
Completed Getting Started tutorial
Understanding of Response Handling and Error Patterns for error handling in production
Understanding of Caching Strategies for persistent storage
Python 3.10+ in production environment
Redis or MongoDB for production caching (recommended)
Environment Configuration
SCHOLAR_FLUX_HOME (Recommended)
The recommended approach for production is to set SCHOLAR_FLUX_HOME to centralize all ScholarFlux configuration, caching, and logging in a single directory:
# Set SCHOLAR_FLUX_HOME
export SCHOLAR_FLUX_HOME=/opt/scholar-flux
# ScholarFlux will automatically use:
# - .env file: $SCHOLAR_FLUX_HOME/.env
# - Cache: $SCHOLAR_FLUX_HOME/package_cache/
# - Logs: $SCHOLAR_FLUX_HOME/logs/
Directory structure:
/opt/scholar-flux/
├── .env # Configuration and API keys
├── package_cache/ # Processed results cache
│ └── data_store.sqlite # SQLite cache (if using SQL backend)
└── logs/ # Application logs (with rotation)
└── scholar_flux.log
Setup:
# Create directory structure
mkdir -p /opt/scholar-flux/{package_cache,logs}
# Create .env file
cat > /opt/scholar-flux/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
PUBMED_API_KEY=<your_api_key>
EOF
# Set environment variable (add to ~/.bashrc or /etc/environment)
export SCHOLAR_FLUX_HOME=/opt/scholar-flux
How it works:
ScholarFlux searches for writable directories in priority order:
$SCHOLAR_FLUX_HOME(if set) ← Recommended for production~/.scholar_flux(user home directory).scholar_flux(current working directory)Package installation directory
For .env files specifically, ScholarFlux also checks the current working directory, making it easy to place .env in either $SCHOLAR_FLUX_HOME/.env or simply .env in your project directory.
See scholar_flux.package_metadata.directories.get_default_writable_directory for implementation details.
Tip
Using SCHOLAR_FLUX_HOME is especially useful for:
Docker containers (mount a volume to a known location)
Shared servers (separate directories per user/project)
Production environments (centralized configuration)
Multi-user systems (avoid ~/.scholar_flux conflicts)
Configuration System
ScholarFlux uses a hierarchical configuration system with the following priority:
Explicit parameters in code
Environment variables (highest priority for production)
``.env`` file (auto-loaded from
$SCHOLAR_FLUX_HOMEor fallback locations)Default values
See Getting Started for basic configuration setup.
Production Environment Variables
With SCHOLAR_FLUX_HOME set, create a .env file at $SCHOLAR_FLUX_HOME/.env:
Core Configuration
Based on scholar_flux.utils.config_loader:
# Logging (auto-uses $SCHOLAR_FLUX_HOME/logs/ if SCHOLAR_FLUX_HOME is set)
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, or CRITICAL
SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE # Set FALSE for production
SCHOLAR_FLUX_LOG_STREAM=STDERR # Stream: STDERR, STDOUT, or FALSE (Turns off Log Streaming)
# Optional: Override the log directory (otherwise uses $SCHOLAR_FLUX_HOME/logs/)
# SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux
# Optional: Override the cache directory (otherwise uses $SCHOLAR_FLUX_HOME/package_cache/)
# SCHOLAR_FLUX_CACHE_DIRECTORY=/var/cache/scholar-flux
# Optional: Override the session cache name for specific backends (otherwise uses search_requests_cache)
SCHOLAR_FLUX_SESSION_CACHE_NAME=session_cache_storage
# Cache encryption (generate secure random key)
SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secure_random_key_here
Note
If SCHOLAR_FLUX_HOME is set, you typically don’t need to set SCHOLAR_FLUX_LOG_DIRECTORY or SCHOLAR_FLUX_CACHE_DIRECTORY explicitly. They default to subdirectories within $SCHOLAR_FLUX_HOME.
API Provider Keys
# Required for specific providers
PUBMED_API_KEY=<insert_your_key>
SPRINGER_NATURE_API_KEY=<insert_your_key>
CORE_API_KEY=<insert_your_key>
# Optional (some providers don't require keys)
ARXIV_API_KEY=<insert_your_key>
OPEN_ALEX_API_KEY=<insert_your_key>
CROSSREF_API_KEY=<insert_your_key>
Session and Request Defaults
Configure default behavior for API requests across all providers:
# Default User-Agent for all sessions (recommended for production)
SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com)
# Default mailto for Crossref and OpenAlex (enables "polite pool" access)
SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu
Tip
Polite Pool Access: Setting SCHOLAR_FLUX_DEFAULT_MAILTO with a valid email automatically enables higher rate limits for Crossref and OpenAlex:
OpenAlex: 10 requests/second with mailto (vs. 1 req/sec without) — a 10x improvement
Crossref: Priority access and faster responses for identified users
As of v0.3.1, ScholarFlux reduced the default OpenAlex request_delay from 6s to 1s to align with their documented rate limits. Combined with mailto, this significantly improves the default throughput for OpenAlex queries.
These variables are read automatically when creating sessions and search coordinators, eliminating the need to specify them in code:
# Without environment variables - must specify each time
coordinator = SearchCoordinator(
query="machine learning",
provider_name="crossref",
mailto="your.email@institution.edu"
)
# With SCHOLAR_FLUX_DEFAULT_MAILTO set - automatically applied
coordinator = SearchCoordinator(
query="machine learning",
provider_name="crossref"
)
# mailto is automatically read from environment
Cache Backend Configuration
Configure default cache backends via environment variables. This eliminates the need to specify backends in code:
# Layer 1: HTTP Response Cache (CachedSessionManager)
# Controls the default backend for requests-cache session caching
# Options: sqlite (default), redis, mongodb, memory, filesystem, gridfs, dynamodb
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis
# Layer 2: Processed Response Cache (DataCacheManager)
# Controls the default storage for processed API response caching
# Options: inmemory (default), redis, sql/sqlalchemy, mongodb, null
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis
# Redis connection (used by both layers when redis backend is selected)
SCHOLAR_FLUX_REDIS_HOST=localhost # or REDIS_HOST
SCHOLAR_FLUX_REDIS_PORT=6379 # or REDIS_PORT
# MongoDB connection (alternative)
SCHOLAR_FLUX_MONGODB_HOST=mongodb://127.0.0.1 # or MONGODB_HOST
SCHOLAR_FLUX_MONGODB_PORT=27017 # or MONGODB_PORT
# Using SQLite, DuckDB, or another SQLAlchemy flavor:
SCHOLAR_FLUX_SQLALCHEMY_URL=None # an optional file path or URI for caching processed response data
# Default provider (optional)
SCHOLAR_FLUX_DEFAULT_PROVIDER=plos
With these environment variables set, cache backends are configured automatically:
from scholar_flux import SearchCoordinator
from scholar_flux.sessions import CachedSessionManager
from scholar_flux.data_storage import DataCacheManager
# Without environment variables - must specify backends explicitly
session_manager = CachedSessionManager(backend='redis')
cache_manager = DataCacheManager.with_storage('redis')
# With SCHOLAR_FLUX_DEFAULT_*_CACHE_* variables set - automatic configuration
session_manager = CachedSessionManager() # Uses SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND
cache_manager = DataCacheManager.from_defaults() # Uses SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE
# SearchCoordinator also respects these defaults
coordinator = SearchCoordinator(
query="machine learning",
provider_name="pubmed",
session=session_manager(),
cache_manager=cache_manager
)
Tip
ScholarFlux accepts both prefixed (SCHOLAR_FLUX_*) and unprefixed (REDIS_HOST) variables for cache backends, prioritizing the prefixed version.
Note
ScholarFlux builds on requests-cache, implementing strict validation for cache TTL (expire_after), storage cache availability, and other common
session cache configuration parameters. The package, requests-cache, while powerful, implements minimal validation for inputs and may silently
accept invalid TTLs or malformed values. The CachedSessionManager expands the functionality of requests-cache in production environments by raising
clear errors for unsupported types, negative values (other than -1 for “no expiration”), or malformed strings before using these inputs to create
CachedSessions, preventing unexpected errors when sending requests and ensuring that session caching in production environments operates predictably.
This strict validation applies to both session cache (expire_after) and response processing cache (ttl for Redis/MongoDB) with the aim of ensuring that
bad inputs fail-fast instead of propagating unexpected issues later during orchestrated searches.
Runtime Configuration
For scenarios where environment variables aren’t suitable (e.g., dynamic configuration, testing), use config_settings to configure defaults at runtime:
from scholar_flux.utils import config_settings
# Set defaults programmatically (equivalent to environment variables)
config_settings.set("SCHOLAR_FLUX_DEFAULT_USER_AGENT", "MyResearchApp/1.0")
config_settings.set("SCHOLAR_FLUX_DEFAULT_PROVIDER", "Crossref")
config_settings.set("SCHOLAR_FLUX_DEFAULT_MAILTO", "researcher@university.edu")
config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis")
config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis")
# Read configuration (checks config dict first, then falls back to environment)
user_agent = config_settings.get("SCHOLAR_FLUX_DEFAULT_USER_AGENT")
cache_backend = config_settings.get("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", default="sqlite")
# Now all SearchCoordinators will use these defaults to search for academic records via Crossref
from scholar_flux import SearchCoordinator
coordinator = SearchCoordinator(query="test")
# Automatically uses the configured user_agent and mailto
Priority order for configuration:
Explicit parameters passed to constructors
Values set via
config_settings.set()Environment variables (from OS or
.envfile)Built-in defaults
This pattern is useful for:
Testing: Override production settings without changing environment
Multi-tenant applications: Different configurations per request
Dynamic configuration: Change settings based on runtime conditions
Loading Configuration
Automatic Loading (Recommended)
ScholarFlux automatically loads environment configuration on import. This also includes the several settings designated for caching:
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND: Controls the default backend for CachedSession instances created when initializing SearchAPI or SearchCoordinator. Supported requests_cache backends include sqlite, redis, mongodb, and memory.
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE: Defines the default cache storage backend that the DataCacheManager creates for response caching during orchestration of the response processing steps. Supported options are redis, sql/sqlalchemy, mongo/mongodb, memory/inmemory, and null. Defaults to memory if not specified.
SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL: Controls the total number of seconds that requests-cache should retain cached responses until they expire. The TTL is set at 86400 (1 day) unless overridden. Set this to -1 or None to turn off TTL-based session cache expiration.
SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL: Controls the total number of seconds that the response processing cache should retain processed response data until they expire. The TTL is set to None unless overridden. Set this to -1 or leave it as None to turn off response processing cache expiration.
SCHOLAR_FLUX_MONGODB_HOST: MongoDB connection string (default: “mongodb://127.0.0.1”)
SCHOLAR_FLUX_MONGODB_PORT: MongoDB port (default: 27017)
SCHOLAR_FLUX_REDIS_HOST: Redis host (default: “localhost”)
SCHOLAR_FLUX_REDIS_PORT: Redis port (default: 6379)
Each of the above settings can be used to load the intended configuration on import or modify them at runtime.
from scholar_flux import SearchCoordinator
from scholar_flux.utils import config_settings
config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis") # Layer 1: Session Cache
config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis") # Layer 2: Response Processing Cache
# `cache_requests=True` (or `use_cache=True` — a SearchAPI parameter) is required when a default session cache is not pre-set on import
coordinator = SearchCoordinator(query = 'test query', cache_requests = True)
print(coordinator.api.session)
# <CachedSession(cache=<RedisCache(name=search_requests_cache)>, settings=CacheSettings(allowable_codes=[200], allowable_methods=('GET',), expire_after=86400))>
print(coordinator.responses.cache)
# DataCacheManager(cache_storage=RedisStorage(...))
Otherwise, caching can be configured (or overridden) as usual by passing explicit components when creating a SearchCoordinator:
import scholar_flux # Loads .env from $SCHOLAR_FLUX_HOME or fallback locations
from scholar_flux import SearchCoordinator
from scholar_flux.sessions import CachedSessionManager
from scholar_flux.data_storage import DataCacheManager
# Request cache (off by default) - CachedSessionManager automatically uses
# $SCHOLAR_FLUX_HOME/package_cache/ for SQLite backend
session_manager = CachedSessionManager(
backend='sqlite',
user_agent='Research/1.0 mailto:your@institution.edu',
expire_after=86400 # 24 hours
)
# Response processing cache (in-memory by default) - DataCacheManager automatically uses
# $SCHOLAR_FLUX_HOME/package_cache/ for SQLite backend
cache = DataCacheManager.with_storage(
'sqlite',
namespace='my_project',
ttl=86400 # 24 hours
)
coordinator = SearchCoordinator(
query="machine learning",
provider_name="pubmed",
session=session_manager(), # Enable request caching
cache_manager=cache # Configure response cache
)
# Both caches automatically use SCHOLAR_FLUX_HOME/package_cache/
print(session_manager.cache_path)
# Example: /opt/scholar-flux/package_cache/search_requests_cache
print(cache.cache_storage.config.get('url'))
# Example: sqlite:////opt/scholar-flux/package_cache/data_store.sqlite
Tip
With SCHOLAR_FLUX_HOME set, both CachedSessionManager and DataCacheManager with SQLite backend automatically store cache files in $SCHOLAR_FLUX_HOME/package_cache/. No path configuration needed!
Note
Default caching behavior:
Request cache (HTTP responses): Off by default. Enable with
session=CachedSessionManager()oruse_cache=TrueResponse cache (processed data): In-memory by default. Use
DataCacheManager.with_storage()for persistence
Custom Configuration Path
For custom .env locations:
from scholar_flux import initialize_package
initialize_package(
env_path="/etc/scholar-flux/.env.production",
config_params={'enable_logging': True, 'log_level': 'INFO'}
)
Validation
Validate required secrets on startup:
import os
required_secrets = ['PUBMED_API_KEY', 'REDIS_HOST', 'SCHOLAR_FLUX_CACHE_SECRET_KEY']
missing = [key for key in required_secrets if not os.getenv(key)]
if missing:
raise EnvironmentError(f"Missing required configuration: {missing}")
Docker for Reproducibility
Using SCHOLAR_FLUX_HOME in Docker
The recommended approach is to mount a volume to SCHOLAR_FLUX_HOME:
# Create host directory
mkdir -p /opt/scholar-flux
# Run container with volume mount
docker run \
-e SCHOLAR_FLUX_HOME=/app/scholar-flux \
-v /opt/scholar-flux:/app/scholar-flux \
scholar-flux-app
This mounts the host’s /opt/scholar-flux to the container’s /app/scholar-flux, making .env, cache, and logs persist across container restarts.
Basic Dockerfile
Create reproducible research environments:
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Create SCHOLAR_FLUX_HOME directory
RUN mkdir -p /app/scholar-flux/{package_cache,logs}
ENV SCHOLAR_FLUX_HOME=/app/scholar-flux
# Non-root user
RUN useradd -m scholar && \
chown -R scholar:scholar /app
USER scholar
CMD ["python", "research_pipeline.py"]
requirements.txt:
scholar-flux[parsing,database,cryptography]>=0.3.1
redis>=5.0.0
pymongo>=4.0.0
pandas>=2.0.0
Environment Variables in Docker
Option 1: .env file in SCHOLAR_FLUX_HOME (Recommended)
# Create .env on host
cat > /opt/scholar-flux/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
PUBMED_API_KEY=<insert_your_key>
EOF
# Run with volume mount
docker run \
-e SCHOLAR_FLUX_HOME=/app/scholar-flux \
-v /opt/scholar-flux:/app/scholar-flux \
scholar-flux-app
Option 2: Direct environment variables
docker run \
-e SCHOLAR_FLUX_HOME=/app/scholar-flux \
-e PUBMED_API_KEY=$PUBMED_API_KEY \
-e REDIS_HOST=redis \
-v /opt/scholar-flux:/app/scholar-flux \
scholar-flux-app
Option 3: Docker Compose (Recommended for multi-container)
version: '3.8'
services:
app:
build: .
environment:
- SCHOLAR_FLUX_HOME=/app/scholar-flux
- REDIS_HOST=redis
volumes:
- ./scholar-flux:/app/scholar-flux # Host directory to container
depends_on:
- redis
redis:
image: redis:7-alpine
volumes:
- redis-data:/data
volumes:
redis-data:
Directory structure on host:
./
├── docker-compose.yml
├── Dockerfile
├── research_pipeline.py
└── scholar-flux/ # Mounted to container
├── .env # Contains API keys
├── package_cache/ # Persisted cache
└── logs/ # Persisted logs
Note
Docker is recommended for reproducible research environments, not necessarily for production deployment at scale. Using SCHOLAR_FLUX_HOME with volume mounts ensures your configuration, cache, and logs persist across container restarts.
Production Patterns
This section references core patterns covered in detail elsewhere. For production deployments, understand these foundational concepts:
Caching Strategy
ScholarFlux uses two-layer caching (HTTP responses + processed results). For production:
Use persistent backends (SQLite, Redis, or MongoDB) not in-memory
Set appropriate TTL based on data freshness needs
Use namespaces to isolate different projects/experiments
Example with SCHOLAR_FLUX_HOME:
from scholar_flux import SearchCoordinator, CachedSessionManager
from scholar_flux.data_storage import DataCacheManager
import os
# 24 Hour request cache expiration by default
session_manager = CachedSessionManager(
backend='sqlite',
user_agent='Research/1.0 mailto:user@your.affiliation.edu',
expire_after=86400
)
# Production response processing cache with Redis
cache = DataCacheManager.with_storage(
'redis',
namespace='my_project',
ttl=86400 # 24 hours
)
coordinator = SearchCoordinator(
query="deep learning",
provider_name="pubmed",
session=session_manager(),
cache_manager=cache
)
print(os.environ.get("SCHOLAR_FLUX_HOME"))
# ~/.scholar_flux for package debugging in development
print(session_manager.cache_path)
# OUTPUT: ~/.scholar_flux/package_cache/search_requests_cache
print(cache.cache_storage.config.get('url'))
# Redis connection details
Note
For production with Redis: Simply change backend='sqlite' to backend='redis' when creating the session manager.
Both session caching and data caching will automatically use the same Redis connection via
SCHOLAR_FLUX_REDIS_HOST and SCHOLAR_FLUX_REDIS_PORT environment variables.
See also
See Caching Strategies for comprehensive coverage including Redis/MongoDB configuration, encryption, TTL strategies, and thread-safety patterns.
Multi-Provider Search
For ML dataset collection across multiple providers, use MultiSearchCoordinator for concurrent searches:
from scholar_flux import SearchCoordinator, MultiSearchCoordinator
# Create coordinators for different providers
coordinators = [
SearchCoordinator(query="CRISPR", provider_name="pubmed"),
SearchCoordinator(query="CRISPR", provider_name="plos"),
SearchCoordinator(query="CRISPR", provider_name="crossref")
]
# Execute concurrently (thread-safe)
multi_search_coordinator = MultiSearchCoordinator.from_coordinators(coordinators)
results = multi_search_coordinator.search_pages(pages=range(1, 11))
See also
See Multi-Provider Search for threading, rate limiting coordination, and session management.
Schema Normalization
Build ML-ready datasets with consistent schemas across providers:
# Normalize results from multiple providers
results = multi_search_coordinator.search_pages(pages=range(1, 6))
normalized = results.filter().normalize(
include={'provider_name', 'query'}
)
# Export to pandas
import pandas as pd
df = pd.DataFrame(normalized)
See also
See Schema Normalization for field mappings, custom transformations, and ML pipeline integration.
Workflows
For APIs requiring multi-step retrieval (e.g., PubMed ID search → fetch records):
# PubMed workflow automatically handles search → fetch
coordinator = SearchCoordinator(
query="cancer treatment",
provider_name="pubmed" # Uses PubMedWorkflow automatically
)
results = coordinator.search_pages(pages=range(1, 11))
See also
See Workflows for custom workflows, PubMed examples, and multi-step retrieval patterns.
Production Use Cases
Systematic Literature Review
Collect and cache papers for reproducible reviews:
from scholar_flux import SearchCoordinator
from scholar_flux.data_storage import DataCacheManager
cache = DataCacheManager.with_storage(
'mongodb',
namespace='systematic_review_2024',
ttl=2592000 # 30 days
)
coordinator = SearchCoordinator(
query="machine learning healthcare",
provider_name="pubmed",
cache_manager=cache
)
# Collect all pages
results = coordinator.search_pages(pages=range(1, 101))
successful = results.filter()
# Export for analysis
normalized = successful.normalize()
import pandas as pd
df = pd.DataFrame(normalized)
df.to_csv('systematic_review_data.csv', index=False)
ML Training Data Collection
Build labeled datasets from multiple sources:
from scholar_flux import SearchCoordinator, MultiSearchCoordinator
from scholar_flux.data_storage import DataCacheManager
cache = DataCacheManager.with_storage('redis', namespace='ml_training')
# Define queries with labels
queries = {
'machine learning classification': 'ml',
'deep learning neural networks': 'dl',
'reinforcement learning agents': 'rl'
}
# Create multi-coordinator
multi_search_coordinator = MultiSearchCoordinator()
for query, label in queries.items():
coord = SearchCoordinator(
query=query,
provider_name="pubmed",
cache_manager=cache
)
multi_search_coordinator.add_coordinator(coord)
# Collect data
results = multi_search_coordinator.search_pages(pages=range(1, 11))
# Add labels and export
import pandas as pd
normalized = results.filter().normalize(include={'query'})
df = pd.DataFrame(normalized)
df['label'] = df['query'].map(queries)
# Train/test split, etc.
Longitudinal Monitoring
Track new publications over time with scheduled collection:
import schedule
import time
from datetime import datetime
from scholar_flux import SearchCoordinator
from scholar_flux.data_storage import DataCacheManager
cache = DataCacheManager.with_storage(
'mongodb',
namespace='monitoring',
ttl=604800 # 7 days
)
def collect_new_papers():
"""Daily collection of new papers."""
coordinator = SearchCoordinator(
query="AI safety",
provider_name="arxiv",
cache_manager=cache
)
results = coordinator.search_pages(pages=range(1, 6))
# Process and store
timestamp = datetime.now().isoformat()
normalized = results.filter().normalize()
# Save to database/file with timestamp
import pandas as pd
df = pd.DataFrame(normalized)
df['collected_at'] = timestamp
df.to_csv(f'papers_{timestamp}.csv', index=False)
# Schedule daily collection
schedule.every().day.at("09:00").do(collect_new_papers)
while True:
schedule.run_pending()
time.sleep(3600)
Data Ownership & Citation
Warning
Important Legal and Ethical Notice
ScholarFlux facilitates data retrieval but does not grant ownership of the data. Users must:
Comply with API Terms of Service
Properly cite original data sources in publications
Respect rate limits and provider guidelines
Consider data privacy (GDPR, CCPA) when handling personal information
Obtain permission for commercial use
Provider Attribution
Always cite data sources in publications:
Example acknowledgment:
“Data was retrieved from PubMed (NCBI), PLOS, and Crossref APIs using ScholarFlux. We acknowledge these providers for making scholarly data accessible.”
Provider requirements:
PubMed/NCBI: Acknowledge “Entrez Programming Utilities”
PLOS: Attribution required for PLOS content
Crossref: Use
mailtoin requests for higher rate limitsSpringer Nature: Commercial use requires separate licensing
See individual provider documentation for complete terms of service.
Security Essentials
API Key Management
Never commit secrets to version control:
# .gitignore
.env
.env.*
*.key
Key rotation:
Rotate API keys periodically when possible. Note that some providers only issue a single API key without rotation support. For providers that support it:
# Update .env file in SCHOLAR_FLUX_HOME
# Old: PUBMED_API_KEY=old_key_123
# New: PUBMED_API_KEY=new_key_456
Then restart your application to load the new key.
Cache Security
For sensitive research data, use encrypted caching:
from scholar_flux.sessions import EncryptionPipelineFactory, CachedSessionManager
import os
# Load or generate encryption key
key = os.getenv('SCHOLAR_FLUX_CACHE_SECRET_KEY')
encryption_factory = EncryptionPipelineFactory(key)
# Create encrypted session
serializer = encryption_factory()
session_manager = CachedSessionManager(
backend='redis',
serializer=serializer
)
See also
See SECURITY for comprehensive security guidelines including cache encryption, network security, and vulnerability reporting.
Logging
ScholarFlux includes built-in rotating logs. Configure via environment:
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux
SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE # Prevent duplicate logs
Logs automatically rotate when they reach size limits. For custom logging:
import logging
logger = logging.getLogger('scholar_flux')
logger.setLevel(logging.INFO)
# Add custom handlers as needed
Best Practices
Configuration
✅ Set ``SCHOLAR_FLUX_HOME`` for centralized configuration, caching, and logging
✅ Use environment-specific .env files in $SCHOLAR_FLUX_HOME
✅ Validate required secrets on application startup
✅ Use prefixed variables (SCHOLAR_FLUX_*) for clarity in shared environments
Example setup:
# Set up SCHOLAR_FLUX_HOME
export SCHOLAR_FLUX_HOME=/opt/scholar-flux
mkdir -p $SCHOLAR_FLUX_HOME/{package_cache,logs}
# Create .env file
cat > $SCHOLAR_FLUX_HOME/.env << 'EOF'
SCHOLAR_FLUX_ENABLE_LOGGING=TRUE
SCHOLAR_FLUX_LOG_LEVEL=INFO
PUBMED_API_KEY=<insert_your_key>
EOF
Caching
✅ Use persistent backends (SQLite, Redis, MongoDB) for production
✅ Set appropriate TTL: 1-7 days for general research, 30+ days for archival
✅ Use namespaces: namespace='project_name' or namespace='user:123:project'
✅ Request caching (HTTP responses): Off by default, enable with CachedSessionManager or use_cache=True
✅ Response caching (processed data): In-memory by default, use DataCacheManager.with_storage() for persistence
✅ SQLite backends automatically use $SCHOLAR_FLUX_HOME/package_cache/ when SCHOLAR_FLUX_HOME is set
See Caching Strategies for detailed patterns.
Concurrency
✅ Use MultiSearchCoordinator for parallel provider searches
✅ Create new session per thread: session_manager() (sessions not thread-safe)
✅ Share DataCacheManager across threads (thread-safe)
See Multi-Provider Search for detailed threading patterns.
Data Management
✅ Always cite data sources in publications
✅ Check API terms of service for commercial use
✅ Implement proper data retention policies
✅ Consider GDPR/CCPA compliance for personal data
Security
✅ Rotate API keys when possible (note: some providers only issue a single key)
✅ Use encrypted caching for sensitive queries
✅ Never log API keys (ScholarFlux masks them automatically)
✅ Monitor for security advisories on GitHub
✅ Keep .env files in $SCHOLAR_FLUX_HOME with restricted permissions (chmod 600)
Production Checklist
Before deploying to production:
Environment
☐ Set up SCHOLAR_FLUX_HOME directory structure
☐ Create .env file in $SCHOLAR_FLUX_HOME with all required secrets
☐ Validate configuration on startup
☐ Verify write permissions for cache and log directories
Caching
☐ Deploy Redis or MongoDB for persistent caching
☐ Configure appropriate TTL for your use case
☐ Test cache connectivity before starting collection
Security
☐ Remove all hardcoded secrets from code
☐ Set up API key rotation schedule
☐ Review SECURITY guidelines
☐ Configure encrypted caching if handling sensitive data
Testing
☐ Test with small page ranges first
☐ Verify rate limiting works correctly
☐ Test failure recovery (API errors, network issues)
☐ Validate data quality and completeness
Documentation
☐ Document which providers and queries you’re using
☐ Record data collection dates for reproducibility
☐ Plan for data citation in publications
☐ Document any provider-specific configurations
Next Steps
You now understand production deployment essentials for ScholarFlux. Continue with:
Example Pipelines:
Production-quality examples demonstrating real-world integration patterns:
Retrieval Pipeline Orchestration - Scheduled data preparation with date filtering, deduplication, and Parquet export
Semantic Similarity Search - Embedding-based paper discovery with ModernBERT
Agentic Literature Review - Multi-provider search with LLM classification via PydanticAI
Related Guides:
Getting Started - Installation and basic configuration
Multi-Provider Search - Concurrent provider coordination
Schema Normalization - Building ML-ready datasets
Advanced Topics:
Caching Strategies - Advanced caching backends and patterns
Workflows - Multi-step retrieval workflows
Reference:
Security Guidelines - Comprehensive security guidelines
Custom Providers - Adding new API providers
GitHub Issues - Report bugs or request features
API Reference
Getting Help
For production deployment questions:
Check documentation: Especially Caching Strategies and Multi-Provider Search
GitHub Issues: https://github.com/SammieH21/scholar-flux/issues
Email: scholar.flux@gmail.com
Security issues: Use GitHub Security Advisories (private reporting)
When reporting issues, include:
ScholarFlux version:
import scholar_flux; print(scholar_flux.__version__)Python version and OS
Cache backend (Redis/MongoDB/SQL)
Relevant environment variables (mask secrets!)
Complete error messages
—
This is an MVP guide. We’ll expand with more patterns and examples as ScholarFlux matures toward v1.0. Feedback welcome!