Production Deployment ===================== This guide covers essential configuration and patterns for deploying ScholarFlux in production environments for machine learning and data engineering workflows. .. contents:: Table of Contents :local: :depth: 2 Overview -------- ScholarFlux is designed for production-grade data collection from academic APIs. This MVP guide focuses on: - **ML/Data Engineering**: Building training datasets, systematic reviews, longitudinal monitoring - **Reproducibility**: Environment configuration and containerization - **Essential patterns**: Caching, concurrency, and security basics .. note:: ScholarFlux is currently **beta (v0.5.0)**. Test thoroughly before production deployment and monitor the `GitHub repository `_ for updates. Prerequisites ------------- - Completed :doc:`getting_started` tutorial - Understanding of :doc:`response_handling_patterns` for error handling in production - Understanding of :doc:`caching_strategies` for persistent storage - Python 3.10+ in production environment - Redis or MongoDB for production caching (recommended) Environment Configuration ========================= SCHOLAR_FLUX_HOME (Recommended) -------------------------------- The recommended approach for production is to set ``SCHOLAR_FLUX_HOME`` to centralize all ScholarFlux configuration, caching, and logging in a single directory: .. code-block:: bash # Set SCHOLAR_FLUX_HOME export SCHOLAR_FLUX_HOME=/opt/scholar-flux # ScholarFlux will automatically use: # - .env file: $SCHOLAR_FLUX_HOME/.env # - Cache: $SCHOLAR_FLUX_HOME/package_cache/ # - Logs: $SCHOLAR_FLUX_HOME/logs/ **Directory structure:** .. code-block:: text /opt/scholar-flux/ ├── .env # Configuration and API keys ├── package_cache/ # Processed results cache │ └── data_store.sqlite # SQLite cache (if using SQL backend) └── logs/ # Application logs (with rotation) └── scholar_flux.log **Setup:** .. code-block:: bash # Create directory structure mkdir -p /opt/scholar-flux/{package_cache,logs} # Create .env file cat > /opt/scholar-flux/.env << 'EOF' SCHOLAR_FLUX_ENABLE_LOGGING=TRUE SCHOLAR_FLUX_LOG_LEVEL=INFO PUBMED_API_KEY= EOF # Set environment variable (add to ~/.bashrc or /etc/environment) export SCHOLAR_FLUX_HOME=/opt/scholar-flux **How it works:** ScholarFlux searches for writable directories in priority order: 1. ``$SCHOLAR_FLUX_HOME`` (if set) ← **Recommended for production** 2. ``~/.scholar_flux`` (user home directory) 3. ``.scholar_flux`` (current working directory) 4. Package installation directory For ``.env`` files specifically, ScholarFlux also checks the current working directory, making it easy to place ``.env`` in either ``$SCHOLAR_FLUX_HOME/.env`` or simply ``.env`` in your project directory. See ``scholar_flux.package_metadata.directories.get_default_writable_directory`` for implementation details. .. tip:: Using ``SCHOLAR_FLUX_HOME`` is especially useful for: - Docker containers (mount a volume to a known location) - Shared servers (separate directories per user/project) - Production environments (centralized configuration) - Multi-user systems (avoid ~/.scholar_flux conflicts) Configuration System -------------------- ScholarFlux uses a hierarchical configuration system with the following priority: 1. **Explicit parameters** in code 2. **Environment variables** (highest priority for production) 3. **``.env`` file** (auto-loaded from ``$SCHOLAR_FLUX_HOME`` or fallback locations) 4. **Default values** See :doc:`getting_started` for basic configuration setup. Production Environment Variables --------------------------------- With ``SCHOLAR_FLUX_HOME`` set, create a ``.env`` file at ``$SCHOLAR_FLUX_HOME/.env``: Core Configuration ~~~~~~~~~~~~~~~~~~ Based on ``scholar_flux.utils.config_loader``: .. code-block:: bash # Logging (auto-uses $SCHOLAR_FLUX_HOME/logs/ if SCHOLAR_FLUX_HOME is set) SCHOLAR_FLUX_ENABLE_LOGGING=TRUE SCHOLAR_FLUX_LOG_LEVEL=INFO # DEBUG, INFO, WARNING, ERROR, or CRITICAL SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE # Set FALSE for production SCHOLAR_FLUX_LOG_STREAM=STDERR # Stream: STDERR, STDOUT, or FALSE (Turns off Log Streaming) # Optional: Override the log directory (otherwise uses $SCHOLAR_FLUX_HOME/logs/) # SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux # Optional: Override the cache directory (otherwise uses $SCHOLAR_FLUX_HOME/package_cache/) # SCHOLAR_FLUX_CACHE_DIRECTORY=/var/cache/scholar-flux # Optional: Override the session cache name for specific backends (otherwise uses search_requests_cache) SCHOLAR_FLUX_SESSION_CACHE_NAME=session_cache_storage # Cache encryption (generate secure random key) SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secure_random_key_here .. note:: If ``SCHOLAR_FLUX_HOME`` is set, you typically don't need to set ``SCHOLAR_FLUX_LOG_DIRECTORY`` or ``SCHOLAR_FLUX_CACHE_DIRECTORY`` explicitly. They default to subdirectories within ``$SCHOLAR_FLUX_HOME``. API Provider Keys ~~~~~~~~~~~~~~~~~ .. code-block:: bash # Required for specific providers PUBMED_API_KEY= SPRINGER_NATURE_API_KEY= CORE_API_KEY= # Optional (some providers don't require keys) ARXIV_API_KEY= OPEN_ALEX_API_KEY= CROSSREF_API_KEY= Session and Request Defaults ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Configure default behavior for API requests across all providers: .. code-block:: bash # Default User-Agent for all sessions (recommended for production) SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com) # Default mailto for Crossref and OpenAlex (enables "polite pool" access) SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu .. tip:: **Polite Pool Access**: Setting ``SCHOLAR_FLUX_DEFAULT_MAILTO`` with a valid email automatically enables higher rate limits for Crossref and OpenAlex: - **OpenAlex**: 10 requests/second with mailto (vs. 1 req/sec without) — a 10x improvement - **Crossref**: Priority access and faster responses for identified users As of v0.3.1, ScholarFlux reduced the default OpenAlex ``request_delay`` from 6s to 1s to align with their documented rate limits. Combined with ``mailto``, this significantly improves the default throughput for OpenAlex queries. These variables are read automatically when creating sessions and search coordinators, eliminating the need to specify them in code: .. code-block:: python # Without environment variables - must specify each time coordinator = SearchCoordinator( query="machine learning", provider_name="crossref", mailto="your.email@institution.edu" ) # With SCHOLAR_FLUX_DEFAULT_MAILTO set - automatically applied coordinator = SearchCoordinator( query="machine learning", provider_name="crossref" ) # mailto is automatically read from environment Cache Backend Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Configure default cache backends via environment variables. This eliminates the need to specify backends in code: .. code-block:: bash # Layer 1: HTTP Response Cache (CachedSessionManager) # Controls the default backend for requests-cache session caching # Options: sqlite (default), redis, mongodb, memory, filesystem, gridfs, dynamodb SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis # Layer 2: Processed Response Cache (DataCacheManager) # Controls the default storage for processed API response caching # Options: inmemory (default), redis, sql/sqlalchemy, mongodb, null SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis # Redis connection (used by both layers when redis backend is selected) SCHOLAR_FLUX_REDIS_HOST=localhost # or REDIS_HOST SCHOLAR_FLUX_REDIS_PORT=6379 # or REDIS_PORT # MongoDB connection (alternative) SCHOLAR_FLUX_MONGODB_HOST=mongodb://127.0.0.1 # or MONGODB_HOST SCHOLAR_FLUX_MONGODB_PORT=27017 # or MONGODB_PORT # Using SQLite, DuckDB, or another SQLAlchemy flavor: SCHOLAR_FLUX_SQLALCHEMY_URL=None # an optional file path or URI for caching processed response data # Default provider (optional) SCHOLAR_FLUX_DEFAULT_PROVIDER=plos With these environment variables set, cache backends are configured automatically: .. code-block:: python from scholar_flux import SearchCoordinator from scholar_flux.sessions import CachedSessionManager from scholar_flux.data_storage import DataCacheManager # Without environment variables - must specify backends explicitly session_manager = CachedSessionManager(backend='redis') cache_manager = DataCacheManager.with_storage('redis') # With SCHOLAR_FLUX_DEFAULT_*_CACHE_* variables set - automatic configuration session_manager = CachedSessionManager() # Uses SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND cache_manager = DataCacheManager.from_defaults() # Uses SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE # SearchCoordinator also respects these defaults coordinator = SearchCoordinator( query="machine learning", provider_name="pubmed", session=session_manager(), cache_manager=cache_manager ) .. tip:: ScholarFlux accepts both prefixed (``SCHOLAR_FLUX_*``) and unprefixed (``REDIS_HOST``) variables for cache backends, prioritizing the prefixed version. .. note:: ScholarFlux builds on ``requests-cache``, implementing strict validation for cache TTL (``expire_after``), storage cache availability, and other common session cache configuration parameters. The package, ``requests-cache``, while powerful, implements minimal validation for inputs and may silently accept invalid TTLs or malformed values. The ``CachedSessionManager`` expands the functionality of ``requests-cache`` in production environments by raising clear errors for unsupported types, negative values (other than ``-1`` for "no expiration"), or malformed strings **before** using these inputs to create CachedSessions, preventing unexpected errors when sending requests and ensuring that session caching in production environments operates predictably. This strict validation applies to both session cache (`expire_after`) and response processing cache (`ttl` for Redis/MongoDB) with the aim of ensuring that bad inputs `fail-fast` instead of propagating unexpected issues later during orchestrated searches. Runtime Configuration --------------------- For scenarios where environment variables aren't suitable (e.g., dynamic configuration, testing), use ``config_settings`` to configure defaults at runtime: .. code-block:: python from scholar_flux.utils import config_settings # Set defaults programmatically (equivalent to environment variables) config_settings.set("SCHOLAR_FLUX_DEFAULT_USER_AGENT", "MyResearchApp/1.0") config_settings.set("SCHOLAR_FLUX_DEFAULT_PROVIDER", "Crossref") config_settings.set("SCHOLAR_FLUX_DEFAULT_MAILTO", "researcher@university.edu") config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis") config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis") # Read configuration (checks config dict first, then falls back to environment) user_agent = config_settings.get("SCHOLAR_FLUX_DEFAULT_USER_AGENT") cache_backend = config_settings.get("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", default="sqlite") # Now all SearchCoordinators will use these defaults to search for academic records via Crossref from scholar_flux import SearchCoordinator coordinator = SearchCoordinator(query="test") # Automatically uses the configured user_agent and mailto **Priority order for configuration:** 1. Explicit parameters passed to constructors 2. Values set via ``config_settings.set()`` 3. Environment variables (from OS or ``.env`` file) 4. Built-in defaults This pattern is useful for: - **Testing**: Override production settings without changing environment - **Multi-tenant applications**: Different configurations per request - **Dynamic configuration**: Change settings based on runtime conditions Loading Configuration --------------------- Automatic Loading (Recommended) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ScholarFlux automatically loads environment configuration on import. This also includes the several settings designated for caching: - **SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND**: Controls the default backend for CachedSession instances created when initializing SearchAPI or SearchCoordinator. Supported `requests_cache` backends include `sqlite`, `redis`, `mongodb`, and `memory`. - **SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE**: Defines the default cache storage backend that the `DataCacheManager` creates for response caching during orchestration of the response processing steps. Supported options are `redis`, `sql/sqlalchemy`, `mongo/mongodb`, `memory/inmemory`, and `null`. Defaults to `memory` if not specified. - **SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL**: Controls the total number of seconds that requests-cache should retain cached responses until they expire. The TTL is set at 86400 (1 day) unless overridden. Set this to -1 or None to turn off TTL-based session cache expiration. - **SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL**: Controls the total number of seconds that the response processing cache should retain processed response data until they expire. The TTL is set to None unless overridden. Set this to -1 or leave it as `None` to turn off response processing cache expiration. - **SCHOLAR_FLUX_MONGODB_HOST**: MongoDB connection string (default: "mongodb://127.0.0.1") - **SCHOLAR_FLUX_MONGODB_PORT**: MongoDB port (default: 27017) - **SCHOLAR_FLUX_REDIS_HOST**: Redis host (default: "localhost") - **SCHOLAR_FLUX_REDIS_PORT**: Redis port (default: 6379) Each of the above settings can be used to load the intended configuration on import or modify them at runtime. .. code-block:: python from scholar_flux import SearchCoordinator from scholar_flux.utils import config_settings config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis") # Layer 1: Session Cache config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis") # Layer 2: Response Processing Cache # `cache_requests=True` (or `use_cache=True` — a SearchAPI parameter) is required when a default session cache is not pre-set on import coordinator = SearchCoordinator(query = 'test query', cache_requests = True) print(coordinator.api.session) # , settings=CacheSettings(allowable_codes=[200], allowable_methods=('GET',), expire_after=86400))> print(coordinator.responses.cache) # DataCacheManager(cache_storage=RedisStorage(...)) Otherwise, caching can be configured (or overridden) as usual by passing explicit components when creating a ``SearchCoordinator``: .. code-block:: python import scholar_flux # Loads .env from $SCHOLAR_FLUX_HOME or fallback locations from scholar_flux import SearchCoordinator from scholar_flux.sessions import CachedSessionManager from scholar_flux.data_storage import DataCacheManager # Request cache (off by default) - CachedSessionManager automatically uses # $SCHOLAR_FLUX_HOME/package_cache/ for SQLite backend session_manager = CachedSessionManager( backend='sqlite', user_agent='Research/1.0 mailto:your@institution.edu', expire_after=86400 # 24 hours ) # Response processing cache (in-memory by default) - DataCacheManager automatically uses # $SCHOLAR_FLUX_HOME/package_cache/ for SQLite backend cache = DataCacheManager.with_storage( 'sqlite', namespace='my_project', ttl=86400 # 24 hours ) coordinator = SearchCoordinator( query="machine learning", provider_name="pubmed", session=session_manager(), # Enable request caching cache_manager=cache # Configure response cache ) # Both caches automatically use SCHOLAR_FLUX_HOME/package_cache/ print(session_manager.cache_path) # Example: /opt/scholar-flux/package_cache/search_requests_cache print(cache.cache_storage.config.get('url')) # Example: sqlite:////opt/scholar-flux/package_cache/data_store.sqlite .. tip:: With ``SCHOLAR_FLUX_HOME`` set, both ``CachedSessionManager`` and ``DataCacheManager`` with SQLite backend automatically store cache files in ``$SCHOLAR_FLUX_HOME/package_cache/``. No path configuration needed! .. note:: **Default caching behavior:** - **Request cache** (HTTP responses): Off by default. Enable with ``session=CachedSessionManager()`` or ``use_cache=True`` - **Response cache** (processed data): In-memory by default. Use ``DataCacheManager.with_storage()`` for persistence Custom Configuration Path ~~~~~~~~~~~~~~~~~~~~~~~~~~ For custom ``.env`` locations: .. code-block:: python from scholar_flux import initialize_package initialize_package( env_path="/etc/scholar-flux/.env.production", config_params={'enable_logging': True, 'log_level': 'INFO'} ) Validation ~~~~~~~~~~ Validate required secrets on startup: .. code-block:: python import os required_secrets = ['PUBMED_API_KEY', 'REDIS_HOST', 'SCHOLAR_FLUX_CACHE_SECRET_KEY'] missing = [key for key in required_secrets if not os.getenv(key)] if missing: raise EnvironmentError(f"Missing required configuration: {missing}") Docker for Reproducibility =========================== Using SCHOLAR_FLUX_HOME in Docker ---------------------------------- The recommended approach is to mount a volume to ``SCHOLAR_FLUX_HOME``: .. code-block:: bash # Create host directory mkdir -p /opt/scholar-flux # Run container with volume mount docker run \ -e SCHOLAR_FLUX_HOME=/app/scholar-flux \ -v /opt/scholar-flux:/app/scholar-flux \ scholar-flux-app This mounts the host's ``/opt/scholar-flux`` to the container's ``/app/scholar-flux``, making ``.env``, cache, and logs persist across container restarts. Basic Dockerfile ---------------- Create reproducible research environments: .. code-block:: dockerfile FROM python:3.11-slim WORKDIR /app # Install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application COPY . . # Create SCHOLAR_FLUX_HOME directory RUN mkdir -p /app/scholar-flux/{package_cache,logs} ENV SCHOLAR_FLUX_HOME=/app/scholar-flux # Non-root user RUN useradd -m scholar && \ chown -R scholar:scholar /app USER scholar CMD ["python", "research_pipeline.py"] **requirements.txt:** .. code-block:: text scholar-flux[parsing,database,cryptography]>=0.3.1 redis>=5.0.0 pymongo>=4.0.0 pandas>=2.0.0 Environment Variables in Docker -------------------------------- **Option 1: .env file in SCHOLAR_FLUX_HOME (Recommended)** .. code-block:: bash # Create .env on host cat > /opt/scholar-flux/.env << 'EOF' SCHOLAR_FLUX_ENABLE_LOGGING=TRUE PUBMED_API_KEY= EOF # Run with volume mount docker run \ -e SCHOLAR_FLUX_HOME=/app/scholar-flux \ -v /opt/scholar-flux:/app/scholar-flux \ scholar-flux-app **Option 2: Direct environment variables** .. code-block:: bash docker run \ -e SCHOLAR_FLUX_HOME=/app/scholar-flux \ -e PUBMED_API_KEY=$PUBMED_API_KEY \ -e REDIS_HOST=redis \ -v /opt/scholar-flux:/app/scholar-flux \ scholar-flux-app **Option 3: Docker Compose (Recommended for multi-container)** .. code-block:: yaml version: '3.8' services: app: build: . environment: - SCHOLAR_FLUX_HOME=/app/scholar-flux - REDIS_HOST=redis volumes: - ./scholar-flux:/app/scholar-flux # Host directory to container depends_on: - redis redis: image: redis:7-alpine volumes: - redis-data:/data volumes: redis-data: **Directory structure on host:** .. code-block:: text ./ ├── docker-compose.yml ├── Dockerfile ├── research_pipeline.py └── scholar-flux/ # Mounted to container ├── .env # Contains API keys ├── package_cache/ # Persisted cache └── logs/ # Persisted logs .. note:: Docker is recommended for **reproducible research environments**, not necessarily for production deployment at scale. Using ``SCHOLAR_FLUX_HOME`` with volume mounts ensures your configuration, cache, and logs persist across container restarts. Production Patterns =================== This section references core patterns covered in detail elsewhere. For production deployments, understand these foundational concepts: Caching Strategy ---------------- ScholarFlux uses two-layer caching (HTTP responses + processed results). For production: - Use **persistent backends** (SQLite, Redis, or MongoDB) not in-memory - Set appropriate **TTL** based on data freshness needs - Use **namespaces** to isolate different projects/experiments **Example with SCHOLAR_FLUX_HOME:** .. code-block:: python from scholar_flux import SearchCoordinator, CachedSessionManager from scholar_flux.data_storage import DataCacheManager import os # 24 Hour request cache expiration by default session_manager = CachedSessionManager( backend='sqlite', user_agent='Research/1.0 mailto:user@your.affiliation.edu', expire_after=86400 ) # Production response processing cache with Redis cache = DataCacheManager.with_storage( 'redis', namespace='my_project', ttl=86400 # 24 hours ) coordinator = SearchCoordinator( query="deep learning", provider_name="pubmed", session=session_manager(), cache_manager=cache ) print(os.environ.get("SCHOLAR_FLUX_HOME")) # ~/.scholar_flux for package debugging in development print(session_manager.cache_path) # OUTPUT: ~/.scholar_flux/package_cache/search_requests_cache print(cache.cache_storage.config.get('url')) # Redis connection details .. note:: **For production with Redis**: Simply change ``backend='sqlite'`` to ``backend='redis'`` when creating the session manager. Both session caching and data caching will automatically use the same Redis connection via ``SCHOLAR_FLUX_REDIS_HOST`` and ``SCHOLAR_FLUX_REDIS_PORT`` environment variables. .. seealso:: See :doc:`caching_strategies` for comprehensive coverage including Redis/MongoDB configuration, encryption, TTL strategies, and thread-safety patterns. Multi-Provider Search --------------------- For ML dataset collection across multiple providers, use ``MultiSearchCoordinator`` for concurrent searches: .. code-block:: python from scholar_flux import SearchCoordinator, MultiSearchCoordinator # Create coordinators for different providers coordinators = [ SearchCoordinator(query="CRISPR", provider_name="pubmed"), SearchCoordinator(query="CRISPR", provider_name="plos"), SearchCoordinator(query="CRISPR", provider_name="crossref") ] # Execute concurrently (thread-safe) multi_search_coordinator = MultiSearchCoordinator.from_coordinators(coordinators) results = multi_search_coordinator.search_pages(pages=range(1, 11)) .. seealso:: See :doc:`multi_provider_search` for threading, rate limiting coordination, and session management. Schema Normalization -------------------- Build ML-ready datasets with consistent schemas across providers: .. code-block:: python # Normalize results from multiple providers results = multi_search_coordinator.search_pages(pages=range(1, 6)) normalized = results.filter().normalize( include={'provider_name', 'query'} ) # Export to pandas import pandas as pd df = pd.DataFrame(normalized) .. seealso:: See :doc:`schema_normalization` for field mappings, custom transformations, and ML pipeline integration. Workflows --------- For APIs requiring multi-step retrieval (e.g., PubMed ID search → fetch records): .. code-block:: python # PubMed workflow automatically handles search → fetch coordinator = SearchCoordinator( query="cancer treatment", provider_name="pubmed" # Uses PubMedWorkflow automatically ) results = coordinator.search_pages(pages=range(1, 11)) .. seealso:: See :doc:`advanced_workflows` for custom workflows, PubMed examples, and multi-step retrieval patterns. Production Use Cases ==================== Systematic Literature Review ----------------------------- Collect and cache papers for reproducible reviews: .. code-block:: python from scholar_flux import SearchCoordinator from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage( 'mongodb', namespace='systematic_review_2024', ttl=2592000 # 30 days ) coordinator = SearchCoordinator( query="machine learning healthcare", provider_name="pubmed", cache_manager=cache ) # Collect all pages results = coordinator.search_pages(pages=range(1, 101)) successful = results.filter() # Export for analysis normalized = successful.normalize() import pandas as pd df = pd.DataFrame(normalized) df.to_csv('systematic_review_data.csv', index=False) ML Training Data Collection ---------------------------- Build labeled datasets from multiple sources: .. code-block:: python from scholar_flux import SearchCoordinator, MultiSearchCoordinator from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage('redis', namespace='ml_training') # Define queries with labels queries = { 'machine learning classification': 'ml', 'deep learning neural networks': 'dl', 'reinforcement learning agents': 'rl' } # Create multi-coordinator multi_search_coordinator = MultiSearchCoordinator() for query, label in queries.items(): coord = SearchCoordinator( query=query, provider_name="pubmed", cache_manager=cache ) multi_search_coordinator.add_coordinator(coord) # Collect data results = multi_search_coordinator.search_pages(pages=range(1, 11)) # Add labels and export import pandas as pd normalized = results.filter().normalize(include={'query'}) df = pd.DataFrame(normalized) df['label'] = df['query'].map(queries) # Train/test split, etc. Longitudinal Monitoring ----------------------- Track new publications over time with scheduled collection: .. code-block:: python import schedule import time from datetime import datetime from scholar_flux import SearchCoordinator from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage( 'mongodb', namespace='monitoring', ttl=604800 # 7 days ) def collect_new_papers(): """Daily collection of new papers.""" coordinator = SearchCoordinator( query="AI safety", provider_name="arxiv", cache_manager=cache ) results = coordinator.search_pages(pages=range(1, 6)) # Process and store timestamp = datetime.now().isoformat() normalized = results.filter().normalize() # Save to database/file with timestamp import pandas as pd df = pd.DataFrame(normalized) df['collected_at'] = timestamp df.to_csv(f'papers_{timestamp}.csv', index=False) # Schedule daily collection schedule.every().day.at("09:00").do(collect_new_papers) while True: schedule.run_pending() time.sleep(3600) Data Ownership & Citation ========================== .. warning:: **Important Legal and Ethical Notice** ScholarFlux facilitates data retrieval but **does not grant ownership** of the data. Users must: 1. Comply with API Terms of Service 2. Properly cite original data sources in publications 3. Respect rate limits and provider guidelines 4. Consider data privacy (GDPR, CCPA) when handling personal information 5. Obtain permission for commercial use Provider Attribution -------------------- Always cite data sources in publications: **Example acknowledgment:** "Data was retrieved from PubMed (NCBI), PLOS, and Crossref APIs using ScholarFlux. We acknowledge these providers for making scholarly data accessible." **Provider requirements:** - **PubMed/NCBI**: Acknowledge "Entrez Programming Utilities" - **PLOS**: Attribution required for PLOS content - **Crossref**: Use ``mailto`` in requests for higher rate limits - **Springer Nature**: Commercial use requires separate licensing See individual provider documentation for complete terms of service. Security Essentials =================== API Key Management ------------------ Never commit secrets to version control: .. code-block:: bash # .gitignore .env .env.* *.key **Key rotation:** Rotate API keys periodically when possible. Note that some providers only issue a single API key without rotation support. For providers that support it: .. code-block:: bash # Update .env file in SCHOLAR_FLUX_HOME # Old: PUBMED_API_KEY=old_key_123 # New: PUBMED_API_KEY=new_key_456 Then restart your application to load the new key. Cache Security -------------- For sensitive research data, use encrypted caching: .. code-block:: python from scholar_flux.sessions import EncryptionPipelineFactory, CachedSessionManager import os # Load or generate encryption key key = os.getenv('SCHOLAR_FLUX_CACHE_SECRET_KEY') encryption_factory = EncryptionPipelineFactory(key) # Create encrypted session serializer = encryption_factory() session_manager = CachedSessionManager( backend='redis', serializer=serializer ) .. seealso:: See `SECURITY `_ for comprehensive security guidelines including cache encryption, network security, and vulnerability reporting. Logging ------- ScholarFlux includes built-in rotating logs. Configure via environment: .. code-block:: bash SCHOLAR_FLUX_ENABLE_LOGGING=TRUE SCHOLAR_FLUX_LOG_LEVEL=INFO SCHOLAR_FLUX_LOG_DIRECTORY=/var/log/scholar-flux SCHOLAR_FLUX_PROPAGATE_LOGS=FALSE # Prevent duplicate logs Logs automatically rotate when they reach size limits. For custom logging: .. code-block:: python import logging logger = logging.getLogger('scholar_flux') logger.setLevel(logging.INFO) # Add custom handlers as needed Best Practices ============== Configuration ------------- ✅ **Set ``SCHOLAR_FLUX_HOME``** for centralized configuration, caching, and logging ✅ Use environment-specific ``.env`` files in ``$SCHOLAR_FLUX_HOME`` ✅ Validate required secrets on application startup ✅ Use prefixed variables (``SCHOLAR_FLUX_*``) for clarity in shared environments **Example setup:** .. code-block:: bash # Set up SCHOLAR_FLUX_HOME export SCHOLAR_FLUX_HOME=/opt/scholar-flux mkdir -p $SCHOLAR_FLUX_HOME/{package_cache,logs} # Create .env file cat > $SCHOLAR_FLUX_HOME/.env << 'EOF' SCHOLAR_FLUX_ENABLE_LOGGING=TRUE SCHOLAR_FLUX_LOG_LEVEL=INFO PUBMED_API_KEY= EOF Caching ------- ✅ Use persistent backends (SQLite, Redis, MongoDB) for production ✅ Set appropriate TTL: 1-7 days for general research, 30+ days for archival ✅ Use namespaces: ``namespace='project_name'`` or ``namespace='user:123:project'`` ✅ **Request caching** (HTTP responses): Off by default, enable with ``CachedSessionManager`` or ``use_cache=True`` ✅ **Response caching** (processed data): In-memory by default, use ``DataCacheManager.with_storage()`` for persistence ✅ SQLite backends automatically use ``$SCHOLAR_FLUX_HOME/package_cache/`` when ``SCHOLAR_FLUX_HOME`` is set See :doc:`caching_strategies` for detailed patterns. Concurrency ----------- ✅ Use ``MultiSearchCoordinator`` for parallel provider searches ✅ Create new session per thread: ``session_manager()`` (sessions not thread-safe) ✅ Share ``DataCacheManager`` across threads (thread-safe) See :doc:`multi_provider_search` for detailed threading patterns. Data Management --------------- ✅ Always cite data sources in publications ✅ Check API terms of service for commercial use ✅ Implement proper data retention policies ✅ Consider GDPR/CCPA compliance for personal data Security -------- ✅ Rotate API keys when possible (note: some providers only issue a single key) ✅ Use encrypted caching for sensitive queries ✅ Never log API keys (ScholarFlux masks them automatically) ✅ Monitor for security advisories on GitHub ✅ Keep ``.env`` files in ``$SCHOLAR_FLUX_HOME`` with restricted permissions (``chmod 600``) Production Checklist ==================== Before deploying to production: **Environment** ☐ Set up ``SCHOLAR_FLUX_HOME`` directory structure ☐ Create ``.env`` file in ``$SCHOLAR_FLUX_HOME`` with all required secrets ☐ Validate configuration on startup ☐ Verify write permissions for cache and log directories **Caching** ☐ Deploy Redis or MongoDB for persistent caching ☐ Configure appropriate TTL for your use case ☐ Test cache connectivity before starting collection **Security** ☐ Remove all hardcoded secrets from code ☐ Set up API key rotation schedule ☐ Review `SECURITY `_ guidelines ☐ Configure encrypted caching if handling sensitive data **Testing** ☐ Test with small page ranges first ☐ Verify rate limiting works correctly ☐ Test failure recovery (API errors, network issues) ☐ Validate data quality and completeness **Documentation** ☐ Document which providers and queries you're using ☐ Record data collection dates for reproducibility ☐ Plan for data citation in publications ☐ Document any provider-specific configurations Next Steps ========== You now understand production deployment essentials for ScholarFlux. Continue with: **Example Pipelines:** Production-quality examples demonstrating real-world integration patterns: - `Retrieval Pipeline Orchestration `_ - Scheduled data preparation with date filtering, deduplication, and Parquet export - `Semantic Similarity Search `_ - Embedding-based paper discovery with ModernBERT - `Agentic Literature Review `_ - Multi-provider search with LLM classification via PydanticAI **Related Guides:** - :doc:`getting_started` - Installation and basic configuration - :doc:`multi_provider_search` - Concurrent provider coordination - :doc:`schema_normalization` - Building ML-ready datasets **Advanced Topics:** - :doc:`caching_strategies` - Advanced caching backends and patterns - :doc:`advanced_workflows` - Multi-step retrieval workflows **Reference:** - `Security Guidelines `_ - Comprehensive security guidelines - :doc:`custom_providers` - Adding new API providers - `GitHub Issues `_ - Report bugs or request features API Reference ------------- - :class:`~scholar_flux.api.SearchAPI` - :class:`~scholar_flux.api.SearchCoordinator` - :class:`~scholar_flux.api.MultiSearchCoordinator` - :class:`~scholar_flux.utils.ConfigLoader` - :class:`~scholar_flux.data_storage.DataCacheManager` - :class:`~scholar_flux.sessions.CachedSessionManager` Getting Help ------------ For production deployment questions: 1. **Check documentation**: Especially :doc:`caching_strategies` and :doc:`multi_provider_search` 2. **GitHub Issues**: https://github.com/SammieH21/scholar-flux/issues 3. **Email**: scholar.flux@gmail.com 4. **Security issues**: Use GitHub Security Advisories (private reporting) When reporting issues, include: - ScholarFlux version: ``import scholar_flux; print(scholar_flux.__version__)`` - Python version and OS - Cache backend (Redis/MongoDB/SQL) - Relevant environment variables (mask secrets!) - Complete error messages --- **This is an MVP guide.** We'll expand with more patterns and examples as ScholarFlux matures toward v1.0. Feedback welcome!