Caching Strategies ================== ScholarFlux uses **two-layer caching** to speed up data collection while keeping results accurate and fresh. This tutorial shows you how to configure caching for common research workflows. .. contents:: Table of Contents :local: :depth: 2 Prerequisites ------------- - Complete :doc:`getting_started` for basic usage - Understanding of :doc:`response_handling_patterns` for error handling with caching - Basic knowledge of cache backends (Redis, MongoDB, SQLite) Understanding the Two Caches ----------------------------- ScholarFlux has two independent caches that work together: **Session Cache (HTTP Responses)** Caches raw API responses to avoid redundant network requests. Uses ``requests-cache``. - **Status**: Disabled by default - **Thread-safe**: ❌ No (create one session per thread) - **Best for**: Repeated analysis of same data **Processing Cache (Parsed Data)** Caches processed results after parsing and extraction. - **Status**: Enabled by default (in-memory) - **Thread-safe**: ✅ Yes (safe to share) - **Best for**: Avoiding re-processing of API responses .. tip:: Both caches work identically across all providers (PubMed, Crossref, PLOS, CORE, Springer Nature). Quick Start Patterns -------------------- Default: Processing Cache Only ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, processed results are cached in memory: .. code-block:: python from scholar_flux import SearchCoordinator coordinator = SearchCoordinator( query="machine learning applications", provider_name="pubmed" ) # First search: fetches from API and caches processed results results = coordinator.search(page=1) # Second search: retrieves from processing cache (no re-parsing) results_cached = coordinator.search(page=1) Enable Session Caching ~~~~~~~~~~~~~~~~~~~~~~~ Add HTTP response caching to reduce network requests (and avoid potential rate-limit-exceeded status codes): .. code-block:: python from scholar_flux import SearchCoordinator from scholar_flux.sessions import CachedSessionManager # Create session manager (factory for thread-safe sessions) session_manager = CachedSessionManager( cache_name="my_cache", backend="memory" ) coordinator = SearchCoordinator( query="deep learning", provider_name="crossref", session=session_manager.configure_session() # Creates new session instance ) results = coordinator.search(page=1) Disable All Caching ~~~~~~~~~~~~~~~~~~~~ For testing or when you always need fresh data: .. code-block:: python from scholar_flux import SearchCoordinator # Disable processing cache coordinator = SearchCoordinator( query="quantum computing", provider_name="plos", cache_results=False ) # Every search reprocesses results results = coordinator.search(page=1) # Or temporarily disable for one request: results = coordinator.search( page=1, from_request_cache=False, # Force fresh HTTP request from_process_cache=False # Force re-processing ) Choosing a Storage Backend --------------------------- The processing cache supports four backends. Choose based on your needs: +------------------+---------------+---------+-------------+------------------+ | Backend | Thread-Safe | TTL | Persistence | Best For | +==================+===============+=========+=============+==================+ | **memory** | ✅ Yes | ❌ No | ❌ No | Development | +------------------+---------------+---------+-------------+------------------+ | **sql** | ✅ Yes | ❌ No | ✅ Yes | Local projects | +------------------+---------------+---------+-------------+------------------+ | **duckdb** | ✅ Yes | ❌ No | ✅ Yes | Local analyses | +------------------+---------------+---------+-------------+------------------+ | **redis** | ✅ Yes | ✅ Yes | ✅ Yes | Production | +------------------+---------------+---------+-------------+------------------+ | **mongodb** | ✅ Yes | ✅ Yes | ✅ Yes | Document storage | +------------------+---------------+---------+-------------+------------------+ InMemory (Default) ~~~~~~~~~~~~~~~~~~ Fast but data is lost when your program ends: .. code-block:: python from scholar_flux import SearchCoordinator from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage("memory") coordinator = SearchCoordinator( query="climate change", provider_name="core", cache_manager=cache ) SQLAlchemy (Persistent) ~~~~~~~~~~~~~~~~~~~ Best for local projects where you want cache to persist. This implementation uses SQLite under the hood by default but is usable with a wide array of SQL backends. .. code-block:: python from scholar_flux.data_storage import DataCacheManager from scholar_flux import SearchCoordinator # Uses ~/.scholar-flux/package_cache/data_store.sqlite by default cache = DataCacheManager.with_storage( "sql", namespace="literature_review" ) # Or specify custom location: cache = DataCacheManager.with_storage( "sql", namespace="literature_review", url="sqlite:///./my_cache/data.db" ) coordinator = SearchCoordinator( query="renewable energy", provider_name="springernature", cache_manager=cache ) DuckDB (Persistent) ~~~~~~~~~~~~~~~~~~~ Best for workflows requiring analytical databases where you want cache to persist. This implementation builds off of the SQLAlchemyStorage backend to tailor workflows with the DuckDB backend. .. code-block:: python from scholar_flux.data_storage import DataCacheManager from scholar_flux import SearchCoordinator # Uses ~/.scholar-flux/package_cache/data_store.duckdb by default cache = DataCacheManager.with_storage( "duckdb", namespace="literature_review" ) # Or specify custom location: cache = DataCacheManager.with_storage( "duckdb", namespace="literature_review", url="duckdb:///./my_cache/data.db" ) coordinator = SearchCoordinator( query="Herbecology", provider_name="springernature", cache_manager=cache ) Redis (Production) ~~~~~~~~~~~~~~~~~~ High-performance with automatic expiration (TTL): .. code-block:: python from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage( "redis", namespace="production_search", host="localhost", # Default configuration if omitted port=6379, # Default configuration if omitted ttl=86400 # Expire after 24 hours ) MongoDB (Document Storage) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ Similar to Redis but with document-oriented storage: .. code-block:: python from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage( "mongodb", namespace="research_project", host="mongodb://127.0.0.1", # Default configuration if omitted port=27017, # Default configuration if omitted database="scholar_flux", collection="cache", ttl=604800 # Expire after 7 days ) Environment Variable Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For production deployments, you can configure default cache backends using environment variables instead of specifying them in code: .. code-block:: bash # Session cache (HTTP responses) - used by CachedSessionManager # Options: sqlite (default), redis, mongodb, memory, filesystem export SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis export SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_TTL=-1 # Turns off `Expire-After` for Session Cache (default=86400) # Processing cache (parsed data) - used by DataCacheManager # Options: inmemory (default), redis, sql/sqlalchemy, mongodb, null export SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis export SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_TTL=600 # Caches Processed Responses for 10 minutes (default=None) # Connection settings (optional - uses localhost defaults if not set) export SCHOLAR_FLUX_REDIS_HOST=localhost export SCHOLAR_FLUX_REDIS_PORT=6379 With these variables set, caches use the configured backends automatically: .. code-block:: python from scholar_flux import SearchCoordinator from scholar_flux.sessions import CachedSessionManager from scholar_flux.data_storage import DataCacheManager # Backends automatically configured from environment session_manager = CachedSessionManager() # Uses SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND cache_manager = DataCacheManager.from_defaults() # Uses SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE coordinator = SearchCoordinator( query="machine learning", provider_name="pubmed", session=session_manager(), cache_manager=cache_manager ) You can also configure backends programmatically at runtime: .. code-block:: python from scholar_flux.utils import config_settings # Set defaults before creating coordinators config_settings.set("SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND", "redis") config_settings.set("SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE", "redis") .. seealso:: See :doc:`production_deployment` for comprehensive environment configuration including ``SCHOLAR_FLUX_HOME`` setup. Using Namespaces ---------------- Namespaces let you organize cache by project, environment, or data source, even when they use the same DB: .. code-block:: python from scholar_flux.data_storage import DataCacheManager from scholar_flux import SearchCoordinator # Separate cache for different projects cancer_cache = DataCacheManager.with_storage( "sql", namespace="cancer_research" ) climate_cache = DataCacheManager.with_storage( "sql", namespace="climate_science" ) # Each uses separate cache space cancer_coord = SearchCoordinator( query="immunotherapy", provider_name="pubmed", cache_manager=cancer_cache ) climate_coord = SearchCoordinator( query="ocean acidification", provider_name="plos", cache_manager=climate_cache ) **Namespace best practices:** .. code-block:: python # Organize by environment dev_cache = DataCacheManager.with_storage("memory", namespace="dev") prod_cache = DataCacheManager.with_storage("redis", namespace="prod") # Organize hierarchically if needed cache = DataCacheManager.with_storage( "redis", namespace="user/123/project/ml_research" ) Encrypted Session Caching -------------------------- For sensitive queries, use encrypted session cache: .. code-block:: python """ Encrypt cached HTTP responses for security """ from scholar_flux.api import SearchCoordinator from scholar_flux.sessions import EncryptionPipelineFactory, CachedSessionManager from scholar_flux.utils import config_settings import os # Load or create encryption key key = os.environ.get("SCHOLAR_FLUX_CACHE_SECRET_KEY") encryption_factory = EncryptionPipelineFactory(key) if not key: # Save this key securely - losing it means losing cached data new_key = encryption_factory.secret_key print(f"Saving the secret key...") # next reload of scholar_flux should hold the following variable after it is saved config_settings.write_key( "SCHOLAR_FLUX_CACHE_SECRET_KEY", # the name of the key new_key.decode(), # the value of the key bytes to write env_path=config_settings.env_path # the current `env_path` is actually the default ) # Create encrypted serializer serializer = encryption_factory() # Create cached session with encryption session_manager = CachedSessionManager( cache_name="encrypted_cache", backend="sqlite", cache_directory=None, # Uses default scholar-flux directory serializer=serializer ) coordinator = SearchCoordinator( query="sensitive research query", provider_name="pubmed", session=session_manager() ) # Responses are encrypted in cache results = coordinator.search(page=1) .. warning:: - Never commit encryption keys to version control - Rotate encryption keys periodically - If the key is lost, cached data cannot be recovered - Use different keys for development and production Monitoring Cache Behavior -------------------------- Enable logging to see what's being cached: .. code-block:: python import logging # Enable ScholarFlux logging (console output is enabled by default) logger = logging.getLogger('scholar_flux') logger.setLevel(logging.INFO) # Optional: prevent propagating (duplicate) logs logger.propagate = False **Inspecting cache directly:** .. code-block:: python from scholar_flux.data_storage import DataCacheManager cache = DataCacheManager.with_storage("memory") storage_backend = cache.cache_storage # Perform searches... # coordinator.search(page=1) # coordinator.search(page=2) # Check what's cached all_keys = storage_backend.retrieve_keys() print(f"Cached pages: {len(all_keys)}") print(f"Keys: {all_keys}") # Get all cached data all_data = storage_backend.retrieve_all() for key, value in all_data.items(): records = value.get('processed_records', {}) print(f"Key: {key}, Records: {len(records)}") Or for directly inspecting search results cached with the current SearchCoordinator configuration: .. code-block:: python from scholar_flux import SearchCoordinator coordinator = SearchCoordinator(provider_name = 'arxiv', query = 'machine learning', use_cache=True) results = coordinator.search_pages(range(1, 4)) # Inspecting cache keys relevant to the current search coordinator: print(coordinator.get_cached_response_keys()) # ['arxiv_machine learning_1_25', # 'arxiv_machine learning_2_25', # 'arxiv_machine learning_3_25'] # Attempts to retrieve a single result: result1 = coordinator.get_cached_search_result(page = 1) # Retrieve a search result from session cache only: cached_results = coordinator.search_pages(range(1, 5), cache_only=True) print(cached_results) # [SearchResult(query='machine learning', provider_name='arxiv', page=1, ...,'), # SearchResult(query='machine learning', provider_name='arxiv', page=2, ...,'), # SearchResult(query='machine learning', provider_name='arxiv', page=3, ...,')] Practical Examples ------------------ Example: Machine Learning Data Collection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Collect training data with persistent caching: .. code-block:: python """ Collect labeled papers for ML classification task """ from scholar_flux import SearchCoordinator, MultiSearchCoordinator from scholar_flux.data_storage import DataCacheManager from pathlib import Path import pandas as pd # Setup persistent cache with the default SQL-storage cache = DataCacheManager.with_storage( "sql", namespace="ml_training_data", # limit the record scope ) # Collect papers on different topics topics = { "machine learning algorithms":"machine_learning", "deep learning neural networks": "deep_learning", "reinforcement learning": "reinforcement" } # Create coordinators for threaded searches by provider (sequential in this case) multicoordinator = MultiSearchCoordinator() multicoordinator.add_coordinators( SearchCoordinator( query=query, provider_name="pubmed", cache_manager=cache ) for query in topics.keys() ) # Fetch pages 1 and 2 across several search_result_list = multicoordinator.search_pages(range(1, 3)) # Show the results of the search: for search_result in search_result_list: print(f"Collected {search_result.query} page {search_result.page}: {search_result.record_count} papers") print(f"Total records: {search_result_list.record_count}") # Maps record fields to common names and stores each dictionary record inside the same list normalized_records = search_result_list.filter().normalize(include={'provider_name', 'query', 'page'}) df = pd.DataFrame(normalized_records) df['label'] = df['query'].apply(lambda q: topics[q]) print(f"Cached pages: {len(cache.cache_storage.retrieve_keys())}") Multi-Provider Parallel Searches --------------------------------- For concurrent searches across providers, use ``MultiSearchCoordinator``: .. code-block:: python from scholar_flux import SearchCoordinator, MultiSearchCoordinator from scholar_flux.data_storage import DataCacheManager from scholar_flux.sessions import CachedSessionManager user_agent="Research/1.0 (mailto:user@institution.edu)" # Change this # Each provider needs a separate session factory, independent of backend (request-cache sessions are not thread-safe) session_manager = CachedSessionManager(backend="redis", user_agent = user_agent) # The data cache manager uses a shared cache (thread-safe) cache_manager = DataCacheManager.with_storage("redis", namespace="multi_search") # Create coordinators for each provider plos = SearchCoordinator(query="neural networks", provider_name="plos", cache_manager = cache_manager, session=session_manager()) arxiv = SearchCoordinator(query="neural networks", provider_name="arxiv", cache_manager = cache_manager, session=session_manager()) crossref = SearchCoordinator(query="neural networks", provider_name="crossref", cache_manager = cache_manager, session=session_manager()) # Search all concurrently multicoordinator = MultiSearchCoordinator() multicoordinator.add_coordinators([plos, arxiv, crossref]) # All providers search in parallel (thread-safe) results = multicoordinator.search_pages(pages=range(1, 11)) .. tip:: For multi-provider concurrent searches with caching, see :doc:`multi_provider_search`. For workflow-based caching patterns, see :doc:`advanced_workflows`. For production caching deployment, see :doc:`production_deployment`. Cache Invalidation ------------------ The processing cache automatically invalidates when: 1. **Response content changed** - API returned different data 2. **Coordinator structure changed** - Different parsing/processing steps 3. **TTL expired** - Cache entry too old (Redis/MongoDB only) **Manual cache control:** .. code-block:: python from scholar_flux.data_storage import DataCacheManager from scholar_flux import SearchCoordinator cache_manager = DataCacheManager.with_storage("sql", namespace="temp") coordinator = SearchCoordinator( query="test query", provider_name="pubmed", cache_manager=cache_manager ) # Cache results results = coordinator.search(page=1) # Clear specific page (will re-cache on next search) cache_key = coordinator._create_cache_key(page=1) cache_manager.delete(cache_key) # Clear entire namespace cache_manager.cache_storage.delete_all() Time-To-Live (TTL) Strategies ------------------------------ Redis and MongoDB support automatic cache expiration: .. code-block:: python from scholar_flux.data_storage import DataCacheManager # Short TTL for frequently-changing data (1 hour) news_cache = DataCacheManager.with_storage( "redis", namespace="news", ttl=3600 ) # Medium TTL for general searches (1 day) research_cache = DataCacheManager.with_storage( "redis", namespace="research", ttl=86400 ) # Long TTL for stable data (30 days) archive_cache = DataCacheManager.with_storage( "mongodb", namespace="archive", ttl=86400 * 30 ) Troubleshooting --------------- Cache Not Persisting ~~~~~~~~~~~~~~~~~~~~ **Problem**: Using in-memory cache (default) .. code-block:: python # Problem: Data lost when program ends cache = DataCacheManager.with_storage("memory") # Solution: Use persistent backend cache = DataCacheManager.with_storage("sql") Thread Safety Errors ~~~~~~~~~~~~~~~~~~~~~ **Problem**: Sharing session across threads .. code-block:: python # ❌ Wrong: Single session shared across threads session = CachedSessionManager(backend="memory").configure_session() # Used in multiple threads - NOT SAFE # ✅ Correct: Create session per thread session_manager = CachedSessionManager(backend="memory") # Call session_manager() to create new instance per thread Redis Connection Failed ~~~~~~~~~~~~~~~~~~~~~~~~ **Check these common issues:** 1. Redis server not running: ``sudo systemctl start redis`` 2. Wrong host/port configuration 3. Firewall blocking port 6379 4. Python redis library not installed: ``pip install redis`` .. code-block:: python # import the RedisStorage directly from scholar_flux.data_storage.redis_storage import RedisStorage if RedisStorage.is_available(): cache = DataCacheManager.with_storage("redis") else: print("Redis not available, falling back to SQL") cache = DataCacheManager.with_storage("sql") Best Practices -------------- **1. Choose the Right Backend** - Development: ``memory`` (fast, no setup) - Local projects: ``sql`` (persistent, simple) - Production: ``redis`` or ``mongodb`` (scalable, TTL) **2. Use Namespaces** - Separate projects: ``namespace="project_name"`` - Separate environments: ``namespace="dev"`` vs ``namespace="prod"`` - Hierarchical: ``namespace="user:123:project:cancer"`` **3. Set Appropriate TTL** - Frequently-changing data: 1-6 hours - General research: 1-7 days - Archival data: 30+ days - Development: No TTL (never expires) **4. Monitor Your Cache** .. code-block:: python import logging logger = logging.getLogger('scholar_flux') # Will log rate-limits, response retrieval, processing, etc. logger.setLevel(logging.INFO) **5. Handle Errors Gracefully** .. code-block:: python # Continue processing even if cache fails cache = DataCacheManager.with_storage( "redis", raise_on_error=False # Log errors, don't crash ) **6. Thread Safety** - Sessions: Create per thread with ``session_manager()`` - Processing cache: Safe to share across threads - For parallel work: Use ``MultiSearchCoordinator`` **Note**: As different providers may have different data use agreements regarding data caching and storage, always review their terms of service prior to using ScholarFlux caching features! Further Reading --------------- **Related Tutorials:** - :doc:`getting_started` - Basic ScholarFlux usage - :doc:`response_handling_patterns` - Error handling with caching - :doc:`multi_provider_search` - Parallel searches and threading - :doc:`advanced_workflows` - Cache multi-step workflows **Production:** - :doc:`production_deployment` - Production caching patterns with SCHOLAR_FLUX_HOME - `Security Guidelines `_ - Comprehensive security guidelines including encryption and security best practices For questions or issues, visit the `GitHub repository `_.