Getting Started =============== Welcome to ScholarFlux! This tutorial will guide you through installation, configuration, and your first search across academic databases. Overview -------- ScholarFlux is a production-grade orchestration layer for academic APIs that enables concurrent multi-provider search with automatic rate limiting and schema normalization. By the end of this tutorial, you'll be querying multiple scholarly databases with just a few lines of Python. Prerequisites ------------- Before starting, ensure you have: - **Python 3.10 or higher** installed - **pip** or **Poetry** for package management - Basic familiarity with Python - (Optional) API keys for providers requiring authentication .. note:: Most providers (PLOS, arXiv, OpenAlex, Crossref) work out-of-the-box without API keys! Learning Objectives ------------------- By the end of this tutorial, you will: - Install ScholarFlux with the appropriate extras - Configure environment variables and API keys - Execute your first search query - Handle successful and failed searches safely - Retrieve multiple pages of results - Enable caching for better performance Installation ------------ Basic Installation ~~~~~~~~~~~~~~~~~~ Install ScholarFlux using pip: .. code-block:: bash pip install scholar-flux This installs the core package with minimal dependencies, sufficient for providers like PLOS, OpenAlex, and Crossref that return JSON responses. Installation with Extras ~~~~~~~~~~~~~~~~~~~~~~~~ For full functionality, install optional dependencies: .. code-block:: bash # All features (recommended for development) pip install scholar-flux[parsing,database,cryptography,duckdb] # XML parsing only (for PubMed, arXiv) pip install scholar-flux[parsing] # Database response caching backends (Redis, MongoDB, SQLAlchemy) pip install scholar-flux[database] # For DuckDB response caching via sqlalchemy: pip install scholar-flux[duckdb] # Encrypted caching support pip install scholar-flux[cryptography] **When to use which extras:** +------------------+------------------------------------------------+--------------------------------+ | Extra | Installs | Required For | +==================+================================================+================================+ | ``parsing`` | ``xmltodict``, ``pyyaml`` | PubMed, arXiv (XML responses) | +------------------+------------------------------------------------+--------------------------------+ | ``database`` | ``redis``, ``pymongo``, ``sqlalchemy`` | Production caching backends | +------------------+------------------------------------------------+--------------------------------+ | ``cryptography`` | ``cryptography`` | Encrypted session caching | +------------------+------------------------------------------------+--------------------------------+ Development Installation ~~~~~~~~~~~~~~~~~~~~~~~~ For contributing or running tests: .. code-block:: bash git clone https://github.com/SammieH21/scholar-flux.git cd scholar-flux poetry install --with dev,testing --all-extras Verifying Installation ~~~~~~~~~~~~~~~~~~~~~~ Test your installation: .. code-block:: python import scholar_flux print(scholar_flux.__version__) # Output: 0.5.0 .. code-block:: python from scholar_flux import SearchCoordinator # Quick test with PLOS (no API key needed) coordinator = SearchCoordinator(query="computer science validation strategies", provider_name="plos") result = coordinator.search_page(page=1) if result: print(f"✅ Installation successful! Retrieved {len(result.data)} records") else: print(f"❌ Search failed: {result.error}") If you see "✅ Installation successful!", you're ready to continue! Configuration ------------- Environment Variables ~~~~~~~~~~~~~~~~~~~~~ ScholarFlux supports configuration via environment variables. Create a ``.env`` file in your project root: .. code-block:: bash # Logging configuration SCHOLAR_FLUX_ENABLE_LOGGING=TRUE SCHOLAR_FLUX_LOG_LEVEL=INFO SCHOLAR_FLUX_PROPAGATE_LOGS=TRUE # API keys (optional - only needed for specific providers) PUBMED_API_KEY=your_pubmed_key_here SPRINGER_NATURE_API_KEY=your_springer_key_here CORE_API_KEY=your_core_key_here # Cache encryption (optional) SCHOLAR_FLUX_CACHE_SECRET_KEY=your_secret_key_here Session and Request Defaults ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The default behavior for API requests across all providers can also be configured: .. code-block:: bash # Default User-Agent for all sessions (recommended for production) SCHOLAR_FLUX_DEFAULT_USER_AGENT=MyApp/1.0 (https://example.com; mailto:contact@example.com) # Default mailto for Crossref and OpenAlex (enables "polite pool" access) SCHOLAR_FLUX_DEFAULT_MAILTO=your.email@institution.edu .. tip:: **Polite Pool Access**: Setting ``SCHOLAR_FLUX_DEFAULT_MAILTO`` automatically enables higher rate limits for OpenAlex and Crossref: - **OpenAlex**: 10 requests/second (vs 1 req/sec without) - **Crossref**: Priority access and faster responses In addition to request defaults, you can pre-configure caching backends system-wide: Cache Backend Defaults ^^^^^^^^^^^^^^^^^^^^^^ Environment variables can also control the default cache backends used for session requests and response processing: .. code-block:: bash # Session cache backend (HTTP responses) # Options: sqlite (default), redis, mongodb, memory, filesystem SCHOLAR_FLUX_DEFAULT_SESSION_CACHE_BACKEND=redis # Processing cache backend (parsed data) # Options: inmemory (default), redis, sql/sqlalchemy/sqlite, mongodb, null SCHOLAR_FLUX_DEFAULT_RESPONSE_CACHE_STORAGE=redis .. seealso:: For comprehensive environment configuration, see :doc:`production_deployment`. .. warning:: Never commit ``.env`` files to version control! Add ``.env`` to your ``.gitignore``. Loading Configuration ~~~~~~~~~~~~~~~~~~~~~ Option 1: Automatic loading (recommended) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Create a ``.env`` file in your project root. ScholarFlux automatically loads it on import: .. code-block:: python import scholar_flux # Automatically loads .env Option 2: Explicit initialization ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For custom configuration paths: .. code-block:: python from scholar_flux import initialize_package initialize_package( config_params={'enable_logging': True, 'log_level': 'DEBUG'}, env_path='path/to/custom/.env' ) Option 3: Direct environment variables ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Set environment variables directly (useful for containers): .. code-block:: bash export SCHOLAR_FLUX_ENABLE_LOGGING=TRUE export SCHOLAR_FLUX_LOG_LEVEL=DEBUG export PUBMED_API_KEY=your_key_here API Key Setup ~~~~~~~~~~~~~ Providers requiring API keys ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ While most APIs work out of the box, some may require an API key for use (`Springer Nature`) or for higher rate limits (`PubMed` and `CORE API`): +---------------------+------------------+---------------------------------------+ | Provider | API Key Needed | How to Obtain | +=====================+==================+=======================================+ | PLOS | No | Works out-of-the-box | +---------------------+------------------+---------------------------------------+ | arXiv | No | Works out-of-the-box | +---------------------+------------------+---------------------------------------+ | OpenAlex | No | Optional ``mailto`` for higher limits | +---------------------+------------------+---------------------------------------+ | Crossref | No | Optional ``mailto`` for higher limits | +---------------------+------------------+---------------------------------------+ | PubMed | No (Optional) | https://www.ncbi.nlm.nih.gov/account/ | +---------------------+------------------+---------------------------------------+ | CORE | No (Optional) | https://core.ac.uk/services/api | +---------------------+------------------+---------------------------------------+ | Springer Nature | ✅ Yes | https://dev.springernature.com | +---------------------+------------------+---------------------------------------+ PubMed API Key Setup ^^^^^^^^^^^^^^^^^^^^ While PubMed doesn't require an API key, having one can increase rate limits from 3 requests per second to 10 requests per second (as of 2026). 1. Create an NCBI account: https://www.ncbi.nlm.nih.gov/account/ 2. Navigate to Settings → API Key Management 3. Generate a new API key 4. Export your PubMed API key as an environment variable or add it to a ``.env`` file (See the configuration section above) CORE API Key Setup ^^^^^^^^^^^^^^^^^^^^ Similarly, the CORE API doesn't require an API key but having one can greatly increase rate limits, which is **very** important for batch requests. 1. Create an CORE account: https://core.ac.uk/services/api 2. Navigate to `Register Now` and select either `Academic`, `Non-Academic`, or `Personal Use` depending on your affiliation 3. Check your email for a new API key 4. Export your CORE API key as an environment variable or add it to a ``.env`` file (See the configuration section above) .. code-block:: bash CORE_API_KEY=your_key_here 5. Verify: .. code-block:: python from scholar_flux import SearchCoordinator coordinator = SearchCoordinator(query="human psychology", provider_name="pubmed") result = coordinator.search_page(page=1) if coordinator.api.api_key and result: print(f"✅ PubMed API key working! Retrieved {result.record_count} records!") Your First Search ----------------- Single-Provider Search ~~~~~~~~~~~~~~~~~~~~~~ Let's search PLOS for articles about machine learning: .. code-block:: python from scholar_flux import SearchCoordinator # Create a coordinator for PLOS coordinator = SearchCoordinator( query="machine learning", provider_name="plos" ) # Execute search for page 1 result = coordinator.search_page(page=1) # Check if search was successful if result: print(f"Found {len(result.data)} records") # Access the first record first_record = result.data[0] print(f"\nTitle: {first_record.get('title_display')}") print(f"DOI: {first_record.get('id')}") print(f"Journal: {first_record.get('journal')}") else: print(f"Search failed: {result.error} - {result.message}") **Expected output:** .. code-block:: text Found 50 records Title: Deep learning applications in medical image analysis DOI: 10.1371/journal.pone.0212345 Journal: PLOS ONE Understanding the Response ~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``coordinator.search_page()`` method returns a :class:`~scholar_flux.api.models.SearchResult` container with search metadata (query, provider_name, page) and a ``response_result`` attribute. SearchResult is truthy when the search succeeds and falsy when it fails, making error checking simple: .. code-block:: python result = coordinator.search_page(page=1) if result: # Success - access data safely print(f"Found {len(result.data)} records") for record in result.data[:3]: print(f"Title: {record.get('title_display')}") else: # Failure - diagnostic info always available print(f"Error: {result.error} - {result.message}") print(f"Provider: {result.provider_name}, Page: {result.page}") **What's in a SearchResult:** - ``response``: The raw response received from an API - ``processed_records``: List of records (dictionaries) after processing - ``data``: An alias for ``processed_records``, containing a list of records after processing - ``extracted_records``: List of records (dictionaries) after parsing but before processing - ``metadata``: Provider-specific info (total results, page size, etc.) - ``parsed_response``: The response data after parsing with JSON, XML, or YAML - ``query``: Your search query - ``provider_name``: The provider that was queried - ``page``: The page number requested - ``response_result``: The underlying response object (ProcessedResponse, ErrorResponse, or NonResponse) after response processing .. tip:: For detailed information on response types, error handling patterns, and the ``search()`` method, see :doc:`response_handling_patterns`. Retrieving Multiple Pages -------------------------- Sequential Page Retrieval ~~~~~~~~~~~~~~~~~~~~~~~~~~ Retrieve multiple pages one at a time: .. code-block:: python from scholar_flux import SearchCoordinator coordinator = SearchCoordinator(query="CRISPR", provider_name="plos") # Retrieve pages 1-5 for page_num in range(1, 6): result = coordinator.search_page(page=page_num) if result: print(f"Page {page_num}: {len(result.data)} records") else: print(f"Page {page_num} failed: {result.error}") break # Stop on first error **Expected output:** .. code-block:: text Page 1: 50 records Page 2: 50 records Page 3: 50 records Page 4: 50 records Page 5: 50 records Batch Page Retrieval ~~~~~~~~~~~~~~~~~~~~ Retrieve multiple pages in one call using :meth:`~scholar_flux.api.SearchCoordinator.search_pages`: .. code-block:: python from scholar_flux import SearchCoordinator coordinator = SearchCoordinator(query="CRISPR", provider_name="plos") # Retrieve pages 1-5 in one call results = coordinator.search_pages(pages=range(1, 6)) # Results is a SearchResultList print(f"Retrieved {len(results)} pages") # Filter successful responses successful = results.filter() print(f"Success rate: {len(successful)}/{len(results)}") # Combine all records into a single list all_records = successful.join() print(f"Total records: {len(all_records)}") **Expected output:** .. code-block:: text Retrieved 5 pages Success rate: 5/5 Total records: 250 Working with SearchResultList ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :class:`~scholar_flux.api.models.SearchResultList` provides convenient methods: .. code-block:: python results = coordinator.search_pages(pages=range(1, 6)) # Filter only successful responses successful = results.filter() # Combine all records all_records = successful.join() # Convert to pandas DataFrame (requires pandas) import pandas as pd df = pd.DataFrame(all_records) print(df.head()) # Iterate through results for result in results: if result: print(f"Page {result.page}: {len(result.data)} records") Caching Results --------------- Request Caching (Layer 1) ~~~~~~~~~~~~~~~~~~~~~~~~~~ Cache HTTP responses to avoid redundant network requests: .. code-block:: python from scholar_flux import SearchCoordinator coordinator = SearchCoordinator( query="machine learning", provider_name="plos", use_cache=True # Enable HTTP response caching ) # First call: Makes network request result1 = coordinator.search_page(page=1) print("First call - from network") # Second call: Retrieved from cache (instant) result2 = coordinator.search_page(page=1) print("Second call - from cache") .. note:: By default, ``use_cache=True`` uses an in-memory SQLite cache. For production, use Redis or MongoDB. Result Caching (Layer 2) ~~~~~~~~~~~~~~~~~~~~~~~~~ Cache processed results after extraction and transformation: .. code-block:: python from scholar_flux import SearchCoordinator, DataCacheManager # Use Redis for persistent caching cache_manager = DataCacheManager.with_storage('redis', 'localhost:6379') coordinator = SearchCoordinator( query="machine learning", provider_name="plos", cache_manager=cache_manager ) # First call: Processes and caches results result1 = coordinator.search_page(page=1) # Second call: Retrieved from processed cache result2 = coordinator.search_page(page=1) .. seealso:: For advanced caching strategies, see :doc:`caching_strategies`. Next Steps ---------- Congratulations! You've completed the Getting Started tutorial. You now know how to: ✅ Install ScholarFlux with appropriate extras ✅ Configure environment variables and API keys ✅ Execute searches across academic providers ✅ Handle successful and failed searches safely ✅ Retrieve multiple pages of results ✅ Cache responses for performance Common Pitfalls --------------- 1. **Forgetting to check response validity** ❌ Bad: .. code-block:: python result = coordinator.search_page(page=1) for record in result.data: # May crash if result.data is None (ErrorResponses and NonResponses)! print(record) ✅ Good: .. code-block:: python result = coordinator.search_page(page=1) for record in result.data or []: print(record) 2. **Using wrong provider names** ❌ Bad: .. code-block:: python coordinator = SearchCoordinator(query="test", provider_name="pubmed_api") # No provider named "pubmed_api"! ✅ Good: .. code-block:: python coordinator = SearchCoordinator(query="test", provider_name="pubmed") 3. **Not installing extras required for specific providers** ❌ Bad: .. code-block:: python # Basic install without [parsing] extra coordinator = SearchCoordinator(query="test", provider_name="arxiv") result = coordinator.search_page(page=1) # Will fail - arXiv returns XML! # OUTPUT: ErrorResponse(...) ✅ Good: .. code-block:: bash pip install scholar-flux[parsing] # Installs xmltodict for XML parsing and beautifulsoup4 for html text parsing 4. **Hardcoding API keys** ❌ Bad: .. code-block:: python coordinator = SearchCoordinator( query="test", provider_name="pubmed", api_key="abc123xyz" # Hardcoded - will be committed to git! ) ✅ Good: .. code-block:: python # Use .env file # PUBMED_API_KEY=abc123xyz coordinator = SearchCoordinator(query="test", provider_name="pubmed") Where to Go Next ~~~~~~~~~~~~~~~~ **Core Tutorials:** - :doc:`response_handling_patterns` - Response types, error handling, retry configuration - :doc:`multi_provider_search` - Query multiple providers concurrently - :doc:`schema_normalization` - Build ML-ready datasets with consistent schemas - :doc:`caching_strategies` - Advanced caching with Redis, MongoDB, SQLAlchemy **Advanced Topics:** - :doc:`advanced_workflows` - Multi-step retrieval pipelines - :doc:`custom_providers` - Add new API providers to ScholarFlux - :doc:`production_deployment` - Deploy ScholarFlux in production **Reference:** - :doc:`index` - Documentation home Getting Help ------------ If you encounter issues: 1. **Check the documentation**: https://SammieH21.github.io/scholar-flux/ 2. **Search existing issues**: https://github.com/SammieH21/scholar-flux/issues 3. **Ask a question**: Open a new issue with details about your environment 4. **Email**: scholar.flux@gmail.com When reporting issues, include: - ScholarFlux version: ``import scholar_flux; print(scholar_flux.__version__)`` - Python version: ``python --version`` - Operating system - Minimal code to reproduce the issue - Complete error message Further Reading --------------- - :doc:`response_handling_patterns` - Response handling and error patterns - :doc:`multi_provider_search` - Concurrent multi-provider orchestration - :doc:`schema_normalization` - Building ML datasets with consistent schemas - :class:`~scholar_flux.api.SearchCoordinator` API reference - :class:`~scholar_flux.api.SearchAPI` API reference - :class:`~scholar_flux.api.models.ProcessedResponse` API reference