Data Access Layer
Data catalog and boundary access.
DataCatalog Singleton
Bases: dict
Thread-safe singleton for managing catalog connections to climate data sources.
This class implements a thread-safe singleton pattern and inherits from dict to provide a unified interface for accessing multiple climate data catalogs. It manages connections to boundary, renewables, and general climate datasets through intake and intake-esm catalogs, offering convenient properties and methods for data querying and retrieval.
The class automatically initializes connections to predefined catalogs and supports dynamic addition of new catalogs.
Thread Safety
This class is thread-safe. The singleton instance is protected by a lock during creation, and the get_data() method accepts the catalog key as a parameter rather than storing it as mutable state, allowing concurrent queries from multiple threads.
Properties
data : intake_esm.core.esm_datastore Access to the main climate data catalog. boundary : intake.catalog.Catalog Access to the boundary conditions catalog. boundaries : Boundaries Access to the lazy-loading boundaries data manager. renewables : intake_esm.core.esm_datastore Access to the renewables data catalog. hdp: intake_esm.core.esm_datastore Access to the hdp data catalog
Methods:
| Name | Description |
|---|---|
set_catalog |
Add a new catalog to the collection. |
get_data |
Retrieve data from the specified catalog using query parameters. |
resolve_catalog_key |
Resolve and validate a catalog key, returning the closest match if needed. |
Notes
This class implements the singleton pattern, ensuring only one instance exists throughout the application lifecycle. Multiple calls to DataCatalog() will return the same instance.
The class automatically handles catalog initialization and provides sensible defaults when invalid catalog keys are specified.
Examples:
Thread-safe concurrent usage:
>>> from concurrent.futures import ThreadPoolExecutor
>>> catalog = DataCatalog()
>>> def fetch_data(params):
... query, catalog_key = params
... return catalog.get_data(query, catalog_key=catalog_key)
>>> with ThreadPoolExecutor(max_workers=4) as executor:
... results = list(executor.map(fetch_data, queries_and_keys))
Initialize the DataCatalog instance.
This method sets up the catalog connections and initializes internal state. It only runs once due to the singleton pattern implementation.
The derived variable registry is attached to catalogs that support it, enabling users to query derived variables directly.
Source code in climakitae/new_core/data_access/data_access.py
data
property
Access data catalog.
Returns:
| Type | Description |
|---|---|
esm_datastore
|
The main climate data catalog. |
boundary
property
Access boundary catalog.
Returns:
| Type | Description |
|---|---|
Catalog
|
The boundary conditions catalog. |
renewables
property
Access renewables catalog.
Returns:
| Type | Description |
|---|---|
esm_datastore
|
The renewables data catalog. |
hdp
property
Access historical data platform (histwxstns) catalog.
Returns:
| Type | Description |
|---|---|
esm_datastore
|
The histwxstns data catalog. |
boundaries
property
Access boundaries data with lazy loading (thread-safe).
Returns:
| Type | Description |
|---|---|
Boundaries
|
The lazy-loading boundaries data manager. |
derived_registry
property
Access the derived variable registry.
The registry contains definitions for derived variables that can be computed from source variables during data loading.
Returns:
| Type | Description |
|---|---|
DerivedVariableRegistry
|
The intake-esm derived variable registry attached to the catalogs. |
Examples:
__new__()
Override new to implement thread-safe singleton pattern.
Uses double-checked locking to ensure thread-safe singleton creation while minimizing lock contention after initialization.
Returns:
| Type | Description |
|---|---|
DataCatalog
|
The singleton instance of DataCatalog. |
Source code in climakitae/new_core/data_access/data_access.py
merge_catalogs()
Merge the AE intake catalogs into a single DataFrame.
This method combines the AE data catalogs into a unified DataFrame for easier searching and querying across all available datasets.
Returns:
| Type | Description |
|---|---|
DataFrame
|
A DataFrame containing the merged data from AE catalogs with an additional 'catalog' column identifying the source catalog. |
Source code in climakitae/new_core/data_access/data_access.py
resolve_catalog_key(key)
Resolve and validate a catalog key.
This method validates the provided catalog key and attempts to find the closest match if the exact key is not found. This is a pure function that does not modify any instance state, making it thread-safe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
Key of the catalog to resolve. Should be one of the available catalog keys. |
required |
Returns:
| Type | Description |
|---|---|
str or None
|
The resolved catalog key if valid or a close match is found, None if no valid key can be determined. |
Warns:
| Type | Description |
|---|---|
UserWarning
|
If the catalog key is not found and suggestions are provided. |
Examples:
>>> catalog = DataCatalog()
>>> resolved = catalog.resolve_catalog_key("cadcat")
>>> resolved
'cadcat'
Source code in climakitae/new_core/data_access/data_access.py
set_catalog_key(key)
Set the catalog key (DEPRECATED - use resolve_catalog_key instead).
.. deprecated:: 1.5.0
This method stores mutable state on the singleton which is not
thread-safe. Use :meth:resolve_catalog_key and pass the key
directly to :meth:get_data instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
key
|
str
|
Key of the catalog to set. |
required |
Returns:
| Type | Description |
|---|---|
DataCatalog
|
The current instance (for backward compatibility). |
Source code in climakitae/new_core/data_access/data_access.py
set_catalog(name, catalog)
Set a named catalog.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the catalog to set. |
required |
catalog
|
str
|
URL or path to the catalog file. |
required |
Returns:
| Type | Description |
|---|---|
DataCatalog
|
The current instance of DataCatalog allowing method chaining. |
Source code in climakitae/new_core/data_access/data_access.py
get_data(query, catalog_key=None)
Get data from the specified catalog (thread-safe).
This method queries the specified catalog using the provided parameters and returns the matching datasets as a dictionary. The catalog_key is passed as a parameter rather than stored as instance state, making this method safe to call from multiple threads simultaneously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
dict
|
Query parameters for filtering data. The available parameters depend on the catalog and may include items like 'variable', 'scenario', 'model', etc. |
required |
catalog_key
|
str
|
The key identifying which catalog to query. If not provided, falls back to the deprecated instance attribute (for backward compatibility). |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Dataset]
|
The requested dataset(s) from the catalog, keyed by dataset identifiers. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no catalog_key is provided and no default is available. |
Examples:
>>> catalog = DataCatalog()
>>> query = {"variable_id": "tas", "experiment_id": "historical"}
>>> data = catalog.get_data(query, catalog_key="cadcat")
Source code in climakitae/new_core/data_access/data_access.py
465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 | |
list_clip_boundaries()
List all available boundary options for clipping operations.
This method populates the available_boundaries attribute with a
dictionary of boundary categories and their available options. It's a
convenience method that provides direct access to boundary options
without needing to instantiate a Clip processor.
Notes
After calling this method, the available boundaries can be accessed
via the available_boundaries attribute.
Examples:
>>> catalog = DataCatalog()
>>> catalog.list_clip_boundaries()
>>> print(catalog.available_boundaries["states"])
['AZ', 'CA', 'CO', 'ID', 'MT', 'NV', 'NM', 'OR', 'UT', 'WA', 'WY']
Source code in climakitae/new_core/data_access/data_access.py
print_clip_boundaries()
Print all available boundary options for clipping in a user-friendly format.
This method provides a nicely formatted output showing all boundary categories and their available options for clipping operations. The output is formatted to be readable and includes summarized counts for categories with many options.
Examples:
>>> catalog = DataCatalog()
>>> catalog.print_clip_boundaries()
Available Boundary Options for Clipping:
========================================
states: - AZ, CA, CO, ID, MT ... and 6 more options
Source code in climakitae/new_core/data_access/data_access.py
reset()
Reset the DataCatalog instance to its initial state.
This method clears any deprecated mutable state and resets the instance to its original state. The catalogs themselves remain loaded and available.
Note: With thread-safe design, there is minimal mutable state to reset. This method is maintained for backward compatibility.
Source code in climakitae/new_core/data_access/data_access.py
Boundaries
Lazy-loading geospatial polygon data manager for ClimakitAE.
This class provides efficient access to various boundary datasets stored in S3 parquet catalogs. Data is loaded only when first accessed, improving memory usage and initialization performance. All lookup dictionaries are cached to avoid recomputation.
The class supports geographic subsetting for climate data analysis by providing access to various administrative and utility boundaries in California and the western United States. All data access is optimized for memory efficiency through lazy loading and intelligent caching.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
boundary_catalog
|
Catalog
|
Intake catalog instance for accessing boundary parquet files from S3 |
required |
Attributes:
| Name | Type | Description |
|---|---|---|
_cat |
Catalog
|
Reference to the boundary catalog instance used for data access |
Properties
_us_states : pd.DataFrame US western states with names, abbreviations, and geometries (lazy-loaded) _ca_counties : pd.DataFrame California counties with names and geometries, sorted alphabetically (lazy-loaded) _ca_watersheds : pd.DataFrame California HUC8 watersheds with names and geometries, sorted alphabetically (lazy-loaded) _ca_utilities : pd.DataFrame California electric utilities (IOUs and POUs) with names and geometries (lazy-loaded) _ca_forecast_zones : pd.DataFrame California electricity demand forecast zones with processed names (lazy-loaded) _ca_electric_balancing_areas : pd.DataFrame Electric balancing authority areas with filtered geometries (lazy-loaded)
Methods:
| Name | Description |
|---|---|
boundary_dict |
Return dictionary of all boundary lookup dictionaries for UI population |
preload_all |
Preload all boundary data for performance-critical scenarios |
clear_cache |
Clear all cached data and lookup dictionaries to free memory |
validate_catalog |
Validate that required catalog entries exist and are accessible |
get_memory_usage |
Get detailed memory usage information for loaded boundary datasets |
load |
Deprecated method for backward compatibility - use preload_all() instead |
Examples:
Basic usage with lazy loading:
>>> import intake
>>> catalog = intake.open_catalog('boundaries.yaml')
>>> boundaries = Boundaries(catalog)
>>>
>>> # Data loads automatically when accessed
>>> counties = boundaries._ca_counties
>>> watersheds = boundaries._ca_watersheds
Getting boundary options for UI components:
>>> boundary_options = boundaries.boundary_dict()
>>> state_options = boundary_options['states']
>>> county_options = boundary_options['CA counties']
Performance optimization:
>>> # Preload all data if you know you'll need it
>>> boundaries.preload_all()
>>>
>>> # Check memory usage
>>> usage = boundaries.get_memory_usage()
>>> print(f"Total memory: {usage['total_human']}")
Memory management:
>>> # Clear cache to free memory
>>> boundaries.clear_cache()
>>>
>>> # Data will be reloaded on next access
>>> counties = boundaries._ca_counties
Notes
- All boundary data is cached after first access for performance
- The class automatically validates catalog structure on initialization
- Processing includes sorting, filtering, and name standardization
- Memory usage can be monitored and managed through provided methods
- Western states are ordered according to WESTERN_STATES_LIST constant
- Utilities are ordered with priority utilities first, then alphabetically
Initialize the Boundaries class with a boundary catalog.
Sets up the lazy-loading infrastructure and validates the catalog structure to ensure all required boundary datasets are available. No data is loaded during initialization - it's loaded on first access.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
boundary_catalog
|
Catalog
|
Intake catalog instance for accessing boundary parquet files. Must contain entries for: 'states', 'counties', 'huc8', 'utilities', 'dfz', and 'eba'. |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the catalog is missing required entries |
Examples:
>>> import intake
>>> catalog = intake.open_catalog('s3://bucket/boundaries.yaml')
>>> boundaries = Boundaries(catalog)
Source code in climakitae/new_core/data_access/boundaries.py
validate_catalog()
Validate that required catalog entries exist and are accessible.
Checks for the presence of all required boundary datasets in the catalog. This ensures that the boundary data can be loaded when requested by the user.
Raises:
| Type | Description |
|---|---|
ValueError
|
If any required catalog entries are missing. The error message will list all missing entries. |
Notes
Required catalog entries: - 'states': US state boundaries - 'counties': California county boundaries - 'huc8': California watershed boundaries (HUC8 level) - 'utilities': California electric utility boundaries - 'dfz': California demand forecast zones - 'eba': Electric balancing authority areas
Source code in climakitae/new_core/data_access/boundaries.py
boundary_dict()
Return dictionary of all boundary lookup dictionaries for UI population.
Creates a comprehensive dictionary of all available boundary datasets with their corresponding lookup dictionaries. This is primarily used to populate user interface components that allow boundary selection for geographic subsetting of climate data.
The returned dictionary maps boundary category names to lookup dictionaries that map specific boundary names to their DataFrame indices. This enables efficient boundary selection and data subsetting operations.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, int]]
|
Nested dictionary structure: - Outer keys: boundary category names (e.g., 'states', 'CA counties') - Inner dictionaries: map boundary names to DataFrame indices Available categories: - 'none': No geographic subsetting - 'lat/lon': Custom coordinate-based selection - 'states': Western US states - 'CA counties': California counties (alphabetical) - 'CA watersheds': California HUC8 watersheds (alphabetical) - 'CA Electric Load Serving Entities (IOU & POU)': Electric utilities - 'CA Electricity Demand Forecast Zones': Forecast zones - 'CA Electric Balancing Authority Areas': Balancing areas |
Examples:
>>> boundaries = Boundaries(catalog)
>>> boundary_options = boundaries.boundary_dict()
>>>
>>> # Get available states
>>> states = boundary_options['states']
>>> print(states.keys()) # ['CA', 'OR', 'WA', ...]
>>>
>>> # Get available counties
>>> counties = boundary_options['CA counties']
>>> alameda_idx = counties['Alameda']
>>>
>>> # Use in UI dropdown population
>>> for category, options in boundary_options.items():
>>> populate_dropdown(category, options.keys())
Notes
- Lookup dictionaries are cached for performance
- Western states follow ordering in WESTERN_STATES_LIST
- Utilities are ordered with priority utilities first
- All other boundaries are sorted alphabetically
Source code in climakitae/new_core/data_access/boundaries.py
load()
Preload all boundary data (deprecated - data loads automatically when accessed).
This method is kept for backward compatibility. Data now loads automatically when first accessed through the property system.
Source code in climakitae/new_core/data_access/boundaries.py
preload_all()
Preload all boundary data for performance-critical scenarios.
Forces immediate loading of all boundary datasets and builds all lookup caches. This eliminates lazy loading delays for subsequent data access operations, making it ideal for performance-critical scenarios or when you know all boundary data will be needed.
The method loads all six boundary datasets: - US western states - California counties - California watersheds - California utilities - California forecast zones - California electric balancing areas
And builds all corresponding lookup dictionaries for fast boundary selection operations.
Examples:
>>> boundaries = Boundaries(catalog)
>>>
>>> # Preload for performance-critical batch processing
>>> boundaries.preload_all()
>>>
>>> # All subsequent access is now immediate
>>> for county in boundaries._ca_counties.itertuples():
>>> process_county_data(county)
Notes
- Increases initial memory usage but eliminates loading delays
- Useful for batch processing or repeated boundary access
- Data remains cached until clear_cache() is called
- Memory usage can be monitored with get_memory_usage()
Source code in climakitae/new_core/data_access/boundaries.py
clear_cache()
Clear all cached data and lookup dictionaries to free memory.
Removes all loaded boundary DataFrames and lookup dictionaries from memory, returning the Boundaries instance to its initial state. Data will be reloaded on next access through the lazy loading mechanism.
This is useful for: - Memory management in long-running applications - Forcing fresh data loads after catalog updates - Resetting state during testing or debugging
Examples:
>>> boundaries = Boundaries(catalog)
>>> boundaries.preload_all()
>>> usage_before = boundaries.get_memory_usage()
>>> print(f"Memory before: {usage_before['total_human']}")
>>>
>>> boundaries.clear_cache()
>>> usage_after = boundaries.get_memory_usage()
>>> print(f"Memory after: {usage_after['total_human']}") # Much lower
>>>
>>> # Data loads again on next access
>>> counties = boundaries._ca_counties # Triggers reload
Notes
- All subsequent data access will trigger fresh loads from catalog
- Lookup dictionaries will be rebuilt as needed
- Does not affect the underlying catalog or data sources
- Memory savings are immediate and substantial for loaded datasets
Source code in climakitae/new_core/data_access/boundaries.py
get_memory_usage()
Get detailed memory usage information for loaded boundary datasets.
Analyzes memory consumption of all loaded boundary DataFrames and provides both detailed per-dataset usage and summary statistics. Useful for memory monitoring and optimization decisions.
Returns:
| Type | Description |
|---|---|
Dict[str, Union[int, str]]
|
Comprehensive memory usage information: Per-dataset usage (bytes): - 'us_states': Memory used by US states DataFrame (0 if not loaded) - 'ca_counties': Memory used by CA counties DataFrame (0 if not loaded) - 'ca_watersheds': Memory used by CA watersheds DataFrame (0 if not loaded) - 'ca_utilities': Memory used by CA utilities DataFrame (0 if not loaded) - 'ca_forecast_zones': Memory used by forecast zones DataFrame (0 if not loaded) - 'ca_electric_balancing_areas': Memory used by balancing areas DataFrame (0 if not loaded) Summary statistics: - 'total_bytes': Total memory usage in bytes - 'total_human': Human-readable total memory usage (e.g., '15.2 MB') - 'loaded_datasets': Count of currently loaded datasets - 'cached_lookups': Count of cached lookup dictionaries |
Examples:
>>> boundaries = Boundaries(catalog)
>>> boundaries.preload_all()
>>> usage = boundaries.get_memory_usage()
>>>
>>> print(f"Total memory: {usage['total_human']}")
>>> print(f"Loaded datasets: {usage['loaded_datasets']}/6")
>>> print(f"Largest dataset: {max(usage['us_states'], usage['ca_counties'])}")
>>>
>>> # Check if specific dataset is loaded
>>> if usage['ca_counties'] > 0:
>>> print("Counties data is loaded")
>>> # Monitor memory before/after operations
>>> usage_before = boundaries.get_memory_usage()
>>> boundaries.clear_cache()
>>> usage_after = boundaries.get_memory_usage()
>>> saved = usage_before['total_bytes'] - usage_after['total_bytes']
>>> print(f"Memory freed: {boundaries._format_bytes(saved)}")
Notes
- Memory usage includes deep analysis of DataFrame contents
- Unloaded datasets report 0 bytes usage
- Lookup dictionary cache usage is counted separately
- Total includes all loaded DataFrames but not lookup dictionaries
Source code in climakitae/new_core/data_access/boundaries.py
784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 | |