Skip to content

Legacy Data Interface

The climakitae.core.data_interface module is the main compatibility layer for the original climakitae API. It exposes the legacy parameter object, the data retrieval entry points, and the discovery helpers that powered the old GUI.

Warning

This page documents the legacy climakitae.core.data_interface module. It is kept for backward compatibility. New code should use climakitae.new_core.user_interface.ClimateData.

On this page


What this module does

The legacy data interface is responsible for:

  • mapping human-readable GUI values to catalog values
  • validating combinations of resolution, timescale, scenario, and spatial subset
  • exposing data and subsetting option lookup helpers
  • loading the cached data catalogs, variable metadata, station metadata, and boundary catalogs used by DataParameters

Core concepts

Concept Legacy symbol Role
Variable metadata VariableDescriptions Loads variable_descriptions.csv once and keeps it available for option lookups
Shared data connections DataInterface Singleton cache for catalogs, stations, boundaries, and warming-level tables
Query state DataParameters Param-based configuration object used by the GUI and direct code paths
Data discovery get_data_options() Returns the valid query combinations in legacy GUI language
Spatial discovery get_subsetting_options() Returns valid boundaries and station geometry options
Data retrieval get_data() / DataParameters.retrieve() Executes a legacy query and returns lazily loaded xarray data

Query flow

  1. DataParameters loads the singleton DataInterface and populates the available options.
  2. Option observers keep fields like resolution, timescale, scenario_ssp, and cached_area in sync.
  3. retrieve() or get_data() calls the catalog loader.
  4. The loader returns an xarray.DataArray, xarray.Dataset, or a list of DataArray objects depending on the request.

Legacy field names

The legacy module uses GUI-style names instead of catalog-native names. Common examples:

Legacy field Meaning Modern equivalent
downscaling_method Dynamical, Statistical, or both activity_id
resolution 3 km, 9 km, or 45 km grid_label
timescale hourly, daily, monthly table_id
scenario_ssp / scenario_historical Scenario selection buckets experiment_id
area_subset / cached_area Named boundary selection clip processor
time_slice Year range tuple time_slice processor

See Core Concepts for the full mapping.


Examples

Direct query with DataParameters

from climakitae.core.data_interface import DataParameters

params = DataParameters()
params.variable = "Air Temperature at 2m"
params.downscaling_method = "Dynamical"
params.resolution = "9 km"
params.timescale = "hourly"
params.scenario_historical = ["Historical Climate"]
params.scenario_ssp = ["SSP 3-7.0"]
params.area_subset = "CA counties"
params.cached_area = ["Los Angeles County"]

data = params.retrieve()

Direct query with get_data

from climakitae.core.data_interface import get_data

data = get_data(
    variable="Air Temperature at 2m",
    resolution="9 km",
    timescale="hourly",
    downscaling_method="Dynamical",
    scenario=["Historical Climate", "SSP 3-7.0"],
    area_subset="CA counties",
    cached_area=["Los Angeles County"],
)

Public API

VariableDescriptions

Load Variable Desciptions CSV only once

This is a singleton class that needs to be called separately from DataInterface because variable descriptions are used without DataInterface in ck.view. Also ck.view is loaded on package load so this avoids loading boundary data when not needed.

Attributes:

Name Type Description
variable_descriptions DataFrame

pandas dataframe that stores available data variables usable with the package

Source code in climakitae/core/data_interface.py
def __init__(self):
    self.variable_descriptions = pd.DataFrame

load()

Read the variable descriptions csv into class variable.

Source code in climakitae/core/data_interface.py
def load(self):
    """Read the variable descriptions csv into class variable."""
    if self.variable_descriptions.empty:
        self.variable_descriptions = read_csv_file(VARIABLE_DESCRIPTIONS_CSV_PATH)

DataInterface

Load data connections into memory once

This is a singleton class called by the various Param classes to connect to the local data and to the intake data catalog and parquet boundary catalog. The class attributes are read only so that the data does not get changed accidentially.

Attributes:

Name Type Description
variable_descriptions DataFrame

variable descriptions pandas data frame

stations DataFrame

station locations pandas data frame

stations_gdf GeoDataFrame

station locations geopandas data frame

data_catalog ESMDataSource

intake ESM data catalog

boundary_catalog Catalog

parquet boundary catalog

geographies Boundaries

boundary dictionaries class

warming_level_times DataFrame

table of when each simulation/scenario reaches each warming level

Source code in climakitae/core/data_interface.py
def __init__(self):
    global _data_interface_initialized

    if _data_interface_initialized:
        return

    with _data_interface_init_lock:
        if _data_interface_initialized:
            return
        var_desc = VariableDescriptions()
        var_desc.load()
        self._variable_descriptions = var_desc.variable_descriptions
        self._stations = pd.read_csv(HADISD_STATIONS_URL)
        self._stations_gdf = gpd.GeoDataFrame(
            self.stations,
            crs="EPSG:4326",
            geometry=gpd.points_from_xy(self.stations.LON_X, self.stations.LAT_Y),
        )
        self._data_catalog = intake.open_esm_datastore(DATA_CATALOG_URL)
        self._warming_level_times = read_csv_file(
            GWL_1850_1900_FILE, index_col=[0, 1, 2]
        )

        # Get geography boundaries
        self._boundary_catalog = intake.open_catalog(BOUNDARY_CATALOG_URL)
        self._geographies = Boundaries(self.boundary_catalog)

        self._geographies.load()
        _data_interface_initialized = True

variable_descriptions property

Get the variable descriptions dataframe

stations property

Get the stations dataframe

stations_gdf property

Get the stations geopandas dataframe

data_catalog property

Get the data catalog

warming_level_times property

Get the warming level times dataframe

boundary_catalog property

Get the boundary catalog

geographies property

Get the geographies object

DataParameters

Bases: Parameterized

Python param object to hold data parameters for use in panel GUI.

Call DataParameters when you want to select and retrieve data from the climakitae data catalog without using the ckg.Select GUI. ckg.Select uses this class to store selections and retrieve data.

DataParameters calls DataInterface, a singleton class that makes the connection to the intake-esm data store in S3 bucket.

Attributes:

Name Type Description
unit_options_dict dict

options dictionary for converting unit to other units

area_subset str

dataset to use from Boundaries for sub area selection

cached_area list of strs

one or more features from area_subset datasets to use for selection

latitude tuple

latitude range of selection box

longitude tuple

longitude range of selection box

variable_type str

toggle raw or derived variable selection

default_variable str

initial variable to have selected in widget

time_slice tuple

year range to select

resolution str

resolution of data to select ("3 km", "9 km", "45 km")

timescale str

frequency of dataset ("hourly", "daily", "monthly")

scenario_historical list of strs

historical scenario selections

area_average str

whether to comput area average ("Yes", "No")

downscaling_method str

whether to choose WRF or LOCA2 data or both ("Dynamical", "Statistical", "Dynamical+Statistical")

data_type str

whether to choose gridded or station based data ("Gridded", "Stations")

stations list or strs

list of stations that can be filtered by cached_area

_station_data_info str

informational statement when station data selected with data_type

scenario_ssp list of strs

list of future climate scenarios selected (availability depends on other params)

simulation list of strs

list of simulations (models) selected (availability depends on other params)

variable str

variable long display name

units str

unit abbreviation currently of the data (native or converted)

enable_hidden_vars boolean

enable selection of variables that are hidden from the GUI?

extended_description str

extended description of the data variable

variable_id list of strs

list of variable ids that match the variable (WRF and LOCA2 can have different codes for same type of variable)

historical_climate_range_wrf tuple

time range of historical WRF data

historical_climate_range_loca tuple

time range of historical LOCA2 data

historical_climate_range_wrf_and_loca tuple

time range of historical WRF and LOCA2 data combined

historical_reconstruction_range tuple

time range of historical reanalysis data

ssp_range tuple

time range of future scenario SSP data

_info_about_station_data str

warning message about station data

_data_warning str

warning about selecting unavailable data combination

data_interface DataInterface

data connection singleton class that provides data

_data_catalog ESMDataSource

shorthand alias to DataInterface.data_catalog

_variable_descriptions DataFrame

shorthand alias to DataInterface.variable_descriptions

_stations_gdf GeoDataFrame

shorthand alias to DataInterface.stations_gdf

_geographies Boundaries

shorthand alias to DataInterface.geographies

_geography_choose dict

shorthand alias to Boundaries.boundary_dict()

_warming_level_times DataFrame

shorthand alias to DataInterface.warming_level_times

colormap str

default colormap to render the currently selected data

scenario_options list of strs

list of available scenarios (historical and ssp) for selection

variable_options_df DataFrame

filtered variable descriptions for the downscaling_method and timescale

warming_level array

global warming level(s)

warming_level_window integer

years around Global Warming Level (+/-) (e.g. 15 means a 30yr window)

approach (str, 'Warming Level' or Time)

how do you want the data to be retrieved?

warming_level_months array

months of year to use for computing warming levels default to entire calendar year: 1,2,3,4,5,6,7,8,9,10,11,12

all_touched boolean

spatial subset option for within or touching selection

Source code in climakitae/core/data_interface.py
def __init__(self, **params):
    # Set default values
    super().__init__(**params)

    self.data_interface = DataInterface()

    # Data Catalog
    self._data_catalog = self.data_interface.data_catalog

    # Warming Levels Table
    self._warming_level_times = self.data_interface.warming_level_times

    # variable descriptions
    self._variable_descriptions = self.data_interface.variable_descriptions

    # station data
    self._stations_gdf = self.data_interface.stations_gdf

    # Get geography boundaries and selection options
    self._geographies = self.data_interface.geographies
    self._geography_choose = self._geographies.boundary_dict()

    # Set location params
    self.area_subset = "none"
    self.param["area_subset"].objects = list(self._geography_choose.keys())
    self.param["cached_area"].objects = list(
        self._geography_choose[self.area_subset].keys()
    )

    self.all_touched = False

    # Set data params
    (
        self.scenario_options,
        self.simulation,
        unique_variable_ids,
    ) = _get_user_options(
        data_catalog=self._data_catalog,
        downscaling_method=self.downscaling_method,
        timescale=self.timescale,
        resolution=self.resolution,
    )
    self.variable_options_df = _get_variable_options_df(
        variable_descriptions=self._variable_descriptions,
        unique_variable_ids=unique_variable_ids,
        downscaling_method=self.downscaling_method,
        timescale=self.timescale,
        enable_hidden_vars=self.enable_hidden_vars,
    )

    # Show derived index option?
    indices = True
    if self.data_type == "Stations":
        indices = False
    if self.downscaling_method != "Dynamical":
        indices = False
    if self.timescale == "monthly":
        indices = False
    if not indices:
        self.param["variable_type"].objects = ["Variable"]
        self.variable_type = "Variable"
    else:
        self.param["variable_type"].objects = ["Variable", "Derived Index"]

    # Set scenario param
    scenario_ssp_options = [
        scenario_to_experiment_id(scen, reverse=True)
        for scen in self.scenario_options
        if "ssp" in scen
    ]
    for scenario_i in SSPS:
        if scenario_i in scenario_ssp_options:  # Reorder list
            scenario_ssp_options.remove(scenario_i)  # Remove item
            scenario_ssp_options.append(scenario_i)  # Add to back of list
    self.param["scenario_ssp"].objects = scenario_ssp_options
    self.scenario_ssp = []

    # Set variable param
    self.param["variable"].objects = (
        self.variable_options_df.display_name.values.tolist()
    )
    self.variable = self.default_variable

    # Set colormap, units, & extended description
    var_info = self.variable_options_df[
        self.variable_options_df["display_name"] == self.variable
    ]

    # Set params that are not selected by the user
    self.colormap = var_info.colormap.item()
    self.units = var_info.unit.item()
    self.extended_description = var_info.extended_description.item()
    self.variable_id = _get_var_ids(
        self._variable_descriptions,
        self.variable,
        self.downscaling_method,
        self.timescale,
        self.enable_hidden_vars,
    )
    self._data_warning = ""

retrieve(config=None, merge=True)

Retrieve data from catalog

By default, DataParameters determines the data retrieved. Grabs the data from the AWS S3 bucket, returns lazily loaded dask array. User-facing function that provides a wrapper for read_catalog_from_select.

Returns:

Name Type Description
data_return DataArray | Dataset | List[DataArray]

DataArray or Dataset object

Source code in climakitae/core/data_interface.py
def retrieve(
    self, config: str = None, merge: bool = True
) -> Union[xr.DataArray, xr.Dataset, List[xr.DataArray]]:
    """Retrieve data from catalog

    By default, DataParameters determines the data retrieved.
    Grabs the data from the AWS S3 bucket, returns lazily loaded dask array.
    User-facing function that provides a wrapper for read_catalog_from_select.

    Returns
    -------
    data_return : xr.DataArray | xr.Dataset | List[xr.DataArray]
        DataArray or Dataset object

    """

    def _warn_of_large_file_size(da: xr.DataArray):
        """Warn user if the data array is large"""
        nbytes = da.nbytes
        match nbytes:
            case nbytes if nbytes >= int(1e9) and nbytes < int(5e9):
                print(
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                    "! Returned data array is large. Operations could take up to 5x longer than 1GB of data!\n"
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                )
            case nbytes if nbytes >= int(5e9) and nbytes < int(1e10):
                print(
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                    "!! Returned data array is very large. Operations could take up to 8x longer than 1GB of data !!\n"
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                )
            case nbytes if nbytes >= int(1e10):
                print(
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                    "!!! Returned data array is huge. Operations could take 10x to infinity longer than 1GB of data !!!\n"
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                )

    def _warn_of_empty_data(self):
        if self.approach == "Warming Level" and (len(self.warming_level) > 1):
            print(
                "WARNING FOR WARMING LEVELS APPROACH\n-----------------------------------\nThere may be NaNs in your data for certain simulation/warming level combinations if the warming level is not reached for that particular simulation before the year 2100. \n\nThis does not mean you have missing data, but rather a feature of how the data is combined in retrieval to return a single data object. \n\nIf you want to remove these empty simulations, it is recommended to first subset the data object by each individual warming level and then dropping NaN values."
            )
        elif (self.approach == "Time") and (len(self.scenario_ssp) > 1):
            print(
                "WARNING\n-------\nYou have retrieved data for more than one SSP, but not all ensemble members for each GCM are available for all SSPs.\n\nAs a result, some scenario and simulation combinations may contain NaN values.\n\nIf you want to remove these empty simulations, it is recommended to first subset the data object by each individual scenario and then dropping NaN values."
            )

    data_return = read_catalog_from_select(self)

    if isinstance(data_return, list):
        for l in data_return:
            _warn_of_large_file_size(l)
    else:
        _warn_of_large_file_size(data_return)

    # Warn about empty simulations for certain selections
    _warn_of_empty_data(self)

    return data_return

get_data_options

Get data options, in the same format as the Select GUI, given a set of possible inputs. Allows the user to access the data using the same language as the GUI, bypassing the sometimes unintuitive naming in the catalog. If no function inputs are provided, the function returns the entire AE catalog that is available via the Select GUI

Parameters:

Name Type Description Default
variable str

Default to None

None
downscaling_method str

Default to None

None
resolution str

Default to None

None
timescale str

Default to None

None
scenario str or list

Default to None

None
tidy boolean

Format the pandas dataframe? This creates a DataFrame with a MultiIndex that makes it easier to parse the options. Default to True

True
enable_hidden_vars boolean

Return all variables, including the ones in which "show" is set to False? Default to False

False

Returns:

Name Type Description
cat_subset DataFrame

Catalog options for user-provided inputs

Source code in climakitae/core/data_interface.py
def get_data_options(
    variable: str = None,
    downscaling_method: str = None,
    resolution: str = None,
    timescale: str = None,
    scenario: Union[str, list[str]] = None,
    tidy: bool = True,
    enable_hidden_vars: bool = False,
) -> pd.DataFrame:
    """Get data options, in the same format as the Select GUI, given a set of possible inputs.
    Allows the user to access the data using the same language as the GUI, bypassing the sometimes unintuitive naming in the catalog.
    If no function inputs are provided, the function returns the entire AE catalog that is available via the Select GUI

    Parameters
    ----------
    variable : str, optional
        Default to None
    downscaling_method : str, optional
        Default to None
    resolution : str, optional
        Default to None
    timescale : str, optional
        Default to None
    scenario : str or list, optional
        Default to None
    tidy : boolean, optional
        Format the pandas dataframe? This creates a DataFrame with a MultiIndex that makes it easier to parse the options.
        Default to True
    enable_hidden_vars : boolean, optional
        Return all variables, including the ones in which "show" is set to False?
        Default to False

    Returns
    -------
    cat_subset : pd.DataFrame
        Catalog options for user-provided inputs

    """
    # Get intake catalog and variable descriptions from DataInterface object
    data_interface = DataInterface()
    var_df = data_interface.variable_descriptions
    catalog = data_interface.data_catalog
    cat_df = _get_user_friendly_catalog(
        intake_catalog=catalog,
        variable_descriptions=var_df,
        enable_hidden_vars=enable_hidden_vars,
    )

    # Raise error for bad input from user
    for user_input in [variable, downscaling_method, resolution, timescale]:
        if (user_input is not None) and (type(user_input) != str):
            print(
                _format_error_print_message(
                    "Function arguments require a single string value for your inputs"
                )
            )
            return None

    def _list(x: Union[str, list]) -> list:
        """Convert x to a list if its not a list"""
        return x if isinstance(x, list) else [x]

    d = {
        "variable": _list(variable),
        "timescale": _list(timescale),
        "downscaling_method": _list(downscaling_method),
        "scenario": _list(scenario),
        "resolution": _list(resolution),
    }

    d = _check_if_good_input(d, cat_df)

    # Subset the catalog with the user's inputs
    cat_subset = cat_df[
        (cat_df["variable"].isin(d["variable"]))
        & (cat_df["downscaling_method"].isin(d["downscaling_method"]))
        & (cat_df["resolution"].isin(d["resolution"]))
        & (cat_df["timescale"].isin(d["timescale"]))
        & (cat_df["scenario"].isin(d["scenario"]))
    ].reset_index(drop=True)
    if len(cat_subset) == 0:
        print(
            _format_error_print_message(
                "No data found for your input values. Please modify your data request."
            )
        )
        return None

    if tidy:
        cat_subset = cat_subset.set_index(
            ["downscaling_method", "scenario", "timescale"]
        )
    return cat_subset

get_subsetting_options

Get all geometry options for spatial subsetting. Options match those in selections GUI

Parameters:

Name Type Description Default
area_subset str

One of "all", "states", "CA counties", "CA Electricity Demand Forecast Zones", "CA watersheds", "CA Electric Balancing Authority Areas", "CA Electric Load Serving Entities (IOU & POU)", "Stations" Defaults to "all", which shows all the geometry options with area_subset as a multiindex

'all'

Returns:

Name Type Description
geom_df DataFrame

Geometry options Shows only options for one area_subset if input is provided that is not "all" i.e. if area_subset = "states", only the options for states will be returned

Source code in climakitae/core/data_interface.py
def get_subsetting_options(area_subset: str = "all") -> pd.DataFrame:
    """Get all geometry options for spatial subsetting.
    Options match those in selections GUI

    Parameters
    ----------
    area_subset : str
        One of "all", "states", "CA counties", "CA Electricity Demand Forecast Zones", "CA watersheds", "CA Electric Balancing Authority Areas", "CA Electric Load Serving Entities (IOU & POU)", "Stations"
        Defaults to "all", which shows all the geometry options with area_subset as a multiindex

    Returns
    -------
    geom_df : pd.DataFrame
        Geometry options
        Shows only options for one area_subset if input is provided that is not "all"
        i.e. if area_subset = "states", only the options for states will be returned

    """
    # Get geographies from DataInterface object
    data_interface = DataInterface()
    geographies = data_interface._geographies
    boundary_dict = geographies.boundary_dict()

    # Get geometries and labels from Boundaries object
    df_dict = {
        "states": geographies._us_states[["abbrevs", "geometry"]].rename(
            columns={"abbrevs": "NAME"}
        ),
        "CA counties": geographies._ca_counties[["NAME", "geometry"]],
        "CA Electricity Demand Forecast Zones": geographies._ca_forecast_zones.rename(
            columns={"FZ_Name": "NAME"}
        )[["NAME", "geometry"]],
        "CA watersheds": geographies._ca_watersheds.rename(columns={"Name": "NAME"})[
            ["NAME", "geometry"]
        ],
        "CA Electric Balancing Authority Areas": geographies._ca_electric_balancing_areas[
            ["NAME", "geometry"]
        ],
        "CA Electric Load Serving Entities (IOU & POU)": geographies._ca_utilities.rename(
            columns={"Utility": "NAME"}
        )[
            ["NAME", "geometry"]
        ],
        "Stations": data_interface._stations_gdf.sort_values("station").rename(
            columns={"station": "NAME"}
        )[["NAME", "geometry"]],
    }

    # Confirm that input for argument "area_subset" is valid
    # Raise error and print helpful statements if bad input
    valid_inputs = list(df_dict.keys()) + ["all"]
    if area_subset not in valid_inputs:
        print(
            "'"
            + str(area_subset)
            + "' is not a valid option for function argument 'area_subset'.\nChoose one of the following: "
            + ", ".join(valid_inputs)
        )
        print("Default argument 'all' will show all valid geometry options.")
        raise ValueError("Bad input for argument 'area_subset'")

    # Some of the geometry options are limited further by the selections.show() GUI
    # i.e. not all US states are an option in the GUI, even though the parquet file provided by geographies._us_states contains all US states
    # Here, we limit the output to return the same options as the GUI
    for name, df in df_dict.items():
        df["area_subset"] = [name] * len(
            df
        )  # Add area subset as a column. Used to create multiindex if area_subset = "all"
        if name == "Stations":  # This logic doesn't apply to weather stations
            pass  # do nothing
        else:  # Limit options
            df = df[df["NAME"].isin(list(boundary_dict[name].keys()))]
        df_dict[name] = df  # Replace the dictionary with the new, reduced dictionary

    if area_subset != "all":
        # Only return the desired area subset
        geoms_df = (
            df_dict[area_subset]
            .drop(columns="area_subset")
            .rename(columns={"NAME": "cached_area"})
            .set_index("cached_area")
        )
    else:
        geoms_df = pd.concat(list(df_dict.values())).rename(
            columns={"NAME": "cached_area"}
        )
        geoms_df = geoms_df.set_index(
            ["area_subset", "cached_area"]
        )  # Create multiindex

    return geoms_df

get_data

Retrieve formatted data from the Analytics Engine data catalog.

Contrasts with DataParameters().retrieve(), which retrieves data from the user inputs in climakitaegui's selections GUI.

Parameters:

Name Type Description Default
variable str

String name of climate variable

required
resolution str, one of ["3 km", "9 km", "45 km"]

Resolution of data in kilometers

required
timescale str, one of ["hourly", "daily", "monthly"]

Temporal frequency of dataset

required
downscaling_method str, one of ["Dynamical", "Statistical", "Dynamical+Statistical"]

Downscaling method of the data: WRF ("Dynamical"), LOCA2 ("Statistical"), or both "Dynamical+Statistical" Default to "Dynamical"

'Dynamical'
data_type str, one of ["Gridded", "Stations"]

Whether to choose gridded data or weather station data Default to "Gridded"

'Gridded'
approach one of ["Time", "Warming Level"]

Default to "Time"

'Time'
scenario str or list of str

SSP scenario ["SSP 3-7.0", "SSP 2-4.5","SSP 5-8.5"] and/or historical data selection ["Historical Climate", "Historical Reconstruction"] If approach = "Time", you need to set a valid option If approach = "Warming Level", scenario is ignored

None
units str

Variable units. Defaults to native units of data

None
area_subset str

Area category: i.e "CA counties" Defaults to entire domain ("none")

'none'
cached_area list

Area: i.e "Alameda county" Defaults to entire domain (["entire domain"])

None
area_average one of ["Yes","No"]

Take an average over spatial domain? Default to "No".

None
latitude None or tuple of float

Tuple of valid latitude bounds Default to entire domain

None
longitude None or tuple of float

Tuple of valid longitude bounds Default to entire domain

None
time_slice tuple

Time range for retrieved data Only valid for approach = "Time"

None
stations list of str

Which weather stations to retrieve data for Only valid for data_type = "Stations" Default to all stations

None
warming_level list of float

Must be one of the warming levels available in clmakitae.core.constants Only valid for approach = "Warming Level" and data_type = "Stations"

None
warming_level_window int in range(5, 25)

Years around Global Warming Level (+/-) (e.g. 15 means a 30yr window)

None
warming_level_months list of int

Months of year for which to perform warming level computation Default to all months in a year: [1,2,3,4,5,6,7,8,9,10,11,12] For example, you may want to set warming_level_months=[12,1,2] to perform the analysis for the winter season. Only valid for approach = "Warming Level" and data_type = "Stations"

None
all_touched boolean

spatial subset option for within or touching selection

False
enable_hidden_vars boolean

Return all variables, including the ones in which "show" is set to False? Default to False

False
kwargs dict

Additional keyword arguments to pass to DataParameters()

{}

Returns:

Type Description
DataArray

The requested climate data, or None if an error occurred.

Notes

Errors aren't raised by the function. Rather, an appropriate informative message is printed, and the function returns None. This is due to the fact that the AE Jupyter Hub raises a strange Pieces Mismatch Error for some bad inputs; instead, that error is ignored and a more informative error message is printed instead.

Source code in climakitae/core/data_interface.py
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
def get_data(
    variable: str,
    resolution: str,
    timescale: str,
    downscaling_method: str = "Dynamical",
    data_type: str = "Gridded",
    approach: str = "Time",
    scenario: Union[str, list[str]] = None,
    units: str = None,
    warming_level: list[float] = None,
    area_subset: str = "none",
    latitude: tuple[float, float] = None,
    longitude: tuple[float, float] = None,
    cached_area: list[str] = None,
    area_average: str = None,
    time_slice: tuple = None,
    stations: list[str] = None,
    warming_level_window: int = None,
    warming_level_months: list[int] = None,
    all_touched=False,
    enable_hidden_vars: bool = False,
    **kwargs,
) -> xr.DataArray:
    """Retrieve formatted data from the Analytics Engine data catalog.

    Contrasts with DataParameters().retrieve(), which retrieves data from
    the user inputs in climakitaegui's selections GUI.

    Parameters
    ----------
    variable : str
        String name of climate variable
    resolution : str, one of ["3 km", "9 km", "45 km"]
        Resolution of data in kilometers
    timescale : str, one of ["hourly", "daily", "monthly"]
        Temporal frequency of dataset
    downscaling_method : str, one of ["Dynamical", "Statistical", "Dynamical+Statistical"], optional
        Downscaling method of the data:
        WRF ("Dynamical"), LOCA2 ("Statistical"), or both "Dynamical+Statistical"
        Default to "Dynamical"
    data_type : str, one of ["Gridded", "Stations"], optional
        Whether to choose gridded data or weather station data
        Default to "Gridded"
    approach : one of ["Time", "Warming Level"], optional
        Default to "Time"
    scenario : str or list of str, optional
        SSP scenario ["SSP 3-7.0", "SSP 2-4.5","SSP 5-8.5"] and/or historical data selection ["Historical Climate", "Historical Reconstruction"]
        If approach = "Time", you need to set a valid option
        If approach = "Warming Level", scenario is ignored
    units : str, optional
        Variable units.
        Defaults to native units of data
    area_subset : str, optional
        Area category: i.e "CA counties"
        Defaults to entire domain ("none")
    cached_area : list, optional
        Area: i.e "Alameda county"
        Defaults to entire domain (["entire domain"])
    area_average : one of ["Yes","No"], optional
        Take an average over spatial domain?
        Default to "No".
    latitude : None or tuple of float, optional
        Tuple of valid latitude bounds
        Default to entire domain
    longitude : None or tuple of float, optional
        Tuple of valid longitude bounds
        Default to entire domain
    time_slice : tuple, optional
        Time range for retrieved data
        Only valid for approach = "Time"
    stations : list of str, optional
        Which weather stations to retrieve data for
        Only valid for data_type = "Stations"
        Default to all stations
    warming_level : list of float, optional
        Must be one of the warming levels available in `clmakitae.core.constants`
        Only valid for approach = "Warming Level" and data_type = "Stations"
    warming_level_window : int in range (5,25), optional
        Years around Global Warming Level (+/-) (e.g. 15 means a 30yr window)
    warming_level_months : list of int, optional
        Months of year for which to perform warming level computation
        Default to all months in a year: [1,2,3,4,5,6,7,8,9,10,11,12]
        For example, you may want to set warming_level_months=[12,1,2] to perform the analysis for the winter season.
        Only valid for approach = "Warming Level" and data_type = "Stations"
    all_touched : boolean
        spatial subset option for within or touching selection
    enable_hidden_vars : boolean, optional
        Return all variables, including the ones in which "show" is set to False?
        Default to False
    kwargs : dict
        Additional keyword arguments to pass to DataParameters()

    Returns
    -------
    xr.DataArray
        The requested climate data, or None if an error occurred.

    Notes
    -----
    Errors aren't raised by the function. Rather, an appropriate informative
    message is printed, and the function returns None. This is due to the fact
    that the AE Jupyter Hub raises a strange Pieces Mismatch Error for some bad
    inputs; instead, that error is ignored and a more informative error message
    is printed instead.

    """

    def _check_valid_input_station(
        stations: list[str], station_options_all: list[str]
    ) -> list[str]:
        """Check that the user input a valid value for station
        If invalid input, the function will "guess" a close-ish station using difflib
        See _get_closest_option function for more info
        If invalid input and no guesses found, the function will print an informative
        error message and raise a ValueError

        Parameters
        ----------
        stations : list[str]
        station_options_all : list of string
            All the possible station options
            Can be retrieved from DataParameters()._stations_gdf.station.values

        Returns
        -------
        stations : list[str]

        """
        station_options_all = sorted(
            station_options_all
        )  # sorted() puts the list in alphabetical order

        # Keep track of if error was raised and message was printed to user
        # If more than one station prints errors to the console, print a space between each station
        printed_warning = False

        for i, station_i in enumerate(stations):  # Go through all the stations
            # If the station is a valid option, don't do anything
            if station_i in station_options_all:
                continue

            if printed_warning:
                print(
                    "\n", end=""
                )  # Add a space between stations for better readability

            # If the station isn't a valid option...
            print("Input station='" + station_i + "' is not a valid option.")
            closest_options = _get_closest_options(
                station_i, station_options_all
            )  # See if theres any similar options

            # Sad! No closest options found. Just set the key to all valid options
            match closest_options:
                case None:
                    print("Valid options: \n- ", end="")
                    print("\n- ".join(station_options_all))
                    raise ValueError("Bad input")

                # Just one option in the list
                case closest_options if len(closest_options) == 1:
                    print("Closest option: '" + closest_options[0] + "'")

                case closest_options if len(closest_options) > 1:
                    print("Closest options: \n- " + "\n- ".join(closest_options))

            print("Outputting data for station='" + closest_options[0] + "'")
            stations[i] = closest_options[
                0
            ]  # Replace that value in the list with the best option :)

            printed_warning = True

        return stations

    # Internal functions
    def _error_handling_warming_level_inputs(
        wl: Union[list[float], list[int]],
        argument_name: str,
        downscaling_method: str,
        resolution: str,
    ):
        """Error handling for arguments: warming_level and warming_level_month
        Both require a list of either floats or ints
        argument_name is either "warming_level" or "warming_level_months" and is used to
        print an appropriate error message for bad input

        """
        # Find the WL bounds for LOCA and WRF
        loca, wrf = create_ae_warming_trajectories(resolution)
        loca_max = round(loca.max().max(), 2)
        wrf_max = round(wrf.max().max(), 2)

        match downscaling_method:
            case "Statistical":
                max_val = loca_max
            case "Dynamical":
                max_val = wrf_max
            case "Dynamical+Statistical":
                max_val = min(loca_max, wrf_max)
            case _:
                raise ValueError(
                    "Downscaling method be 'Statistical', 'Dynamical', or 'Dynamical+Statistical'"
                )

        if (wl is not None) and not isinstance(wl, list):
            if isinstance(wl, (float, int)):  # Convert float to a singleton list
                wl = [wl]
            if not isinstance(wl, list):
                raise ValueError(
                    f"""Function argument {argument_name} requires a float/int or list 
                    of floats/ints input. Your input: {type(wl)}"""
                )
        if isinstance(wl, list):
            for x in wl:
                if not isinstance(x, (float, int)):
                    raise ValueError(
                        f"Each item in '{argument_name}' must be a float or int. Got: {type(x)}"
                    )
                if argument_name == "warming_level":
                    if x < 0 or x > max_val:
                        raise ValueError(
                            f"{argument_name} value {x}. "
                            f"Allowed range for {downscaling_method}-downscaled data at {resolution} resolution is 0 to {max_val:.2f}."
                        )
        return wl

    def _error_handling_approach_inputs(
        approach: str, scenario: str, warming_level: list[float], time_slice: tuple
    ) -> tuple[str, str, list[float], tuple]:
        """Error handling for approach and scenario inputs"""
        _valid_options_approach = ["Time", "Warming Level"]
        if approach not in _valid_options_approach:
            # Maybe the user just capitalized it wrong
            # If so, fix it for them-- don't raise an error
            if approach.lower().title() in _valid_options_approach:
                approach = approach.lower().title()
            else:
                # An error will be raised later when you try to set selections
                pass

        # Print a warming if scenario is set but approach is Warming Level
        if approach == "Warming Level" and scenario not in [None, ["n/a"], "n/a"]:
            print(
                'WARNING: "scenario" argument will be ignored for warming levels approach'
            )
            scenario = None
        if approach == "Warming Level" and time_slice != None:
            print(
                'WARNING: "time_slice" argument will be ignored for warming levels approach'
            )
            time_slice = None

        if approach == "Time":
            warming_level = ["n/a"]

        return approach, scenario, warming_level, time_slice

    def _error_handling_location_settings(
        area_subset: list[str], cached_area: list[str]
    ) -> list[str]:
        """Maybe the user put an input for cached area but not for area subset
        We need to have the matching/correct area subset in order for selections.retrieve() to actually subset the data
        Here, we load in the geometry options to set area_subset to the correct value
        This also raises an appropriate error if the user has a bad input

        """
        if area_subset == "none" and cached_area != ["entire domain"]:
            geom_df = get_subsetting_options(area_subset="all").reset_index()
            area_subset_vals = geom_df[geom_df["cached_area"] == cached_area[0]][
                "area_subset"
            ].values
            if len(area_subset_vals) == 0:
                raise ValueError("Invalid input for argument 'cached_area'")
            else:
                area_subset = area_subset_vals[0]
        return area_subset

    def _get_scenario_ssp_scenario_historical(
        approach: str, scenario: str
    ) -> tuple[str, str]:
        """Get scenario_ssp, scenario_historical depending on user inputs"""
        match approach:
            case "Warming Level":
                scenario_ssp = ["n/a"]
                scenario_historical = ["n/a"]
            case "Time":
                if (
                    "Historical Reconstruction" in scenario
                ):  # Handling for Historical Reconstruction option
                    scenario_historical = [x for x in scenario if "Historical" in x]
                    scenario_ssp = []
                    if (
                        len(scenario) != 1
                    ):  # No SSP options for Historical Reconstruction data
                        print(
                            "WARNING: Historical Reconstruction data cannot be retrieved in the same data object as SSP scenario options. SSP data will not be retrieved."
                        )
                else:
                    scenario_ssp = [
                        x for x in scenario if "Historical" not in x
                    ]  # Add non-historical SSPs to scenario_ssp key
                    if "Historical Climate" in scenario:
                        scenario_historical = ["Historical Climate"]
                    else:
                        scenario_historical = []
            case _:
                scenario_ssp, scenario_historical = None, None
        return scenario_ssp, scenario_historical

    # default values set as lists are dangerous, so set them to None and then set to
    # default value later
    if cached_area is None:
        cached_area = ["entire domain"]
    # Get intake catalog and variable descriptions from DataInterface object
    data_interface = DataInterface()
    var_df = data_interface.variable_descriptions.rename(
        columns={"variable": "display_name"}
    )  # Rename column so that it can be merged with cat_df

    # Filter variable descriptions based on enable_hidden_vars
    if not enable_hidden_vars:
        var_df = var_df[var_df["show"] == True]

    ## --------- ERROR HANDLING ----------
    # Deal with bad or missing users inputs

    # Station data error handling
    if data_type == "Stations":
        # dictionary with { argument name : [valid option, user input]}
        d = {
            "downscaling_method": ["Dynamical", downscaling_method],
            "timescale": ["hourly", timescale],
            "variable": ["Air Temperature at 2m", variable],
        }
        # Go through the users inputs
        # See if they match the required value for that argument
        # If not, print a warning to the user.
        for key, vals in zip(d.keys(), d.values()):
            if vals[0] != vals[1]:
                print(
                    "Weather station data can only be retrieved for {0}={1} \nYour input: {2} \nRetrieving data for {0}={1}".format(
                        key, vals[0], vals[1]
                    )
                )

        downscaling_method = "Dynamical"
        timescale = "hourly"
        variable = "Air Temperature at 2m"

        # Deal with scenario and time_slice arguments
        # Handle various use-cases of user inputs/errors
        if scenario is None:
            if time_slice is None:
                # Default
                scenario = ["Historical Climate"]
            else:
                scenario = []

        if resolution == "3 km":
            # Neither SSP 2-4.5 nor SSP 5-8.5 are valid options for scenario... need to remove
            for bad_scenario_choice in ["SSP 2-4.5", "SSP 5-8.5"]:
                if bad_scenario_choice in scenario:
                    error_message = f"{bad_scenario_choice} is not a valid scenario input for resolution = {resolution}"
                    print(_format_error_print_message(error_message))
                    return None
        if time_slice is not None:
            # Make sure time_slice and scenario match each other
            # If time_slice is not assigned by the user, it will be auto-set by the DataInterface object
            if any(value < 2015 for value in time_slice) and (
                ("Historical Climate") not in scenario
            ):
                # Add Historical Climate to scenario if the time scale includes historical period
                scenario.append("Historical Climate")
            if any(value >= 2015 for value in time_slice) and not any(
                "SSP" in item for item in scenario
            ):
                # If the time scale includes the future period and no SSP data is selected, add SSP 3-7.0
                scenario.append("SSP 3-7.0")

        if stations is None:
            # Print a warning if the user wants to retrieve station data but they don't input a value for station
            # The function will return all the stations by default
            print(
                "WARNING: You haven't set a particular station/s to retrieve data for; the function will default to retrieving all available stations in the domain"
            )
        if (stations is not None) and (type(stations) == str):
            # Catch easy user mistake without raising an error: Inputting a string instead of a list of list
            # I imagine this could happen if you just wanted to retrieve data for a single station
            stations = [stations]

    # If lat/lon input, change cached_area and area_subset
    if (latitude is not None) and (longitude is not None):
        area_subset = "lat/lon"
        cached_area = ["coordinate selection"]

    # Check warming level inputs
    try:
        warming_level = _error_handling_warming_level_inputs(
            warming_level, "warming_level", downscaling_method, resolution
        )
        warming_level_months = _error_handling_warming_level_inputs(
            warming_level_months, "warming_level_months", downscaling_method, resolution
        )
    except ValueError as error_message:
        print(_format_error_print_message(error_message))
        return None

    # Make sure the inputs are a valid type (no floats, ints, dictionaries, etc)
    for user_input in [
        variable,
        downscaling_method,
        resolution,
        timescale,
        area_subset,
        area_average,
        approach,
        scenario,
    ]:
        if (user_input is not None) and (type(user_input) not in [str, list]):
            error_message = (
                "Function arguments require a single string value for your inputs"
            )
            print(_format_error_print_message(error_message))
            return None

    # Maybe area average was capitalized wrong
    # Fix it instead of raising an error
    if area_average is not None:
        if area_average.lower().title() in ["Yes", "No"]:
            area_average = area_average.lower().title()

    # Cached area should be a list even if its just a single string value (i.e. [str])
    cached_area = [cached_area] if type(cached_area) != list else cached_area

    # If all_touched is None set to False
    if all_touched == None:
        all_touched = False

    # Check if all_touched boolean
    if all_touched not in [True, False]:
        raise ValueError("all_touched must be a boolean")

    # Make sure approach matches the scenario setting
    # See function documentation for more details
    approach, scenario, warming_level, time_slice = _error_handling_approach_inputs(
        approach, scenario, warming_level, time_slice
    )

    # Make sure the area subset is set to a valid input
    # See function documentation for more details
    try:
        area_subset = _error_handling_location_settings(area_subset, cached_area)
    except ValueError as error_message:
        print(_format_error_print_message(error_message))
        return None

    ## --------- ADD ARGUMENTS TO A DICTIONARY ----------
    # A dictionary is used for all the inputs in selections because it enables better error handling and cleaner code when we set selections.thing = thing
    # It also makes parsing through the arguments easier
    # The inputs here need to be a list so that they can be parsed easier by the _check_if_good_input function when comparing with the valid catalog options to confirm the user input is valid
    scenario_user_input = scenario  # What the user originally input for scenario

    check_input_df = get_data_options(
        variable=variable,
        downscaling_method=downscaling_method,
        resolution=resolution,
        timescale=timescale,
        scenario=scenario,
        tidy=False,
        enable_hidden_vars=enable_hidden_vars,
    )

    if check_input_df is None:
        # Does this print an informative error message? I think so but I'm not sure.
        return None

    # Merge with variable dataframe to get all the info about the data in one place
    check_input_df = check_input_df.merge(var_df, how="left")

    # Convert to a dictionary so it can be easily parsed by the function
    cat_dict = check_input_df.to_dict(orient="list")
    for key, values in cat_dict.items():
        # Remove non-unique values
        # This happens because we converted a pandas dataframe to a dictionary
        cat_dict[key] = list(np.unique(values))

    # _check_if_good_input will default fill the scenario options with EVERY possible option
    # It will in most cases give a list of all the available SSPs and the two historical data options (Historical Climate AND Historical Reconstruction)
    # I'd like the function to just default to Historical Climate + SSPs
    # So, if the user input None for scenario, I just remove Historical Reconstruction from the list
    if scenario_user_input == None:
        if "Historical Reconstruction" in cat_dict["scenario"]:
            cat_dict["scenario"] = [
                item
                for item in cat_dict["scenario"]
                if item != "Historical Reconstruction"
            ]

    # Check if it's an index
    # Use proper variable_id lookup that considers downscaling method and timescale
    variable_ids = _get_var_ids(
        data_interface.variable_descriptions,
        cat_dict["variable"][0],
        cat_dict["downscaling_method"][0],
        cat_dict["timescale"][0],
        enable_hidden_vars=enable_hidden_vars,
    )
    variable_id = variable_ids[0] if variable_ids else ""
    variable_type = "Derived Index" if "_index" in variable_id else "Variable"

    # Settings for selections
    selections_dict = {
        "variable": cat_dict["variable"][0],
        "timescale": cat_dict["timescale"][0],
        "downscaling_method": cat_dict["downscaling_method"][0],
        "resolution": cat_dict["resolution"][0],
        "data_type": data_type,
        "scenario": cat_dict["scenario"],
        "area_average": area_average,
        "area_subset": area_subset,
        "cached_area": cached_area,
        "approach": approach,
        "warming_level": warming_level,
        "warming_level_window": warming_level_window,
        "warming_level_months": warming_level_months,
        "variable_type": variable_type,
        "time_slice": time_slice,
        "latitude": latitude,
        "longitude": longitude,
        "stations": stations,
        "all_touched": all_touched,
    }

    scenario_ssp, scenario_historical = _get_scenario_ssp_scenario_historical(
        selections_dict["approach"], selections_dict["scenario"]
    )
    selections_dict["scenario_ssp"] = scenario_ssp
    selections_dict["scenario_historical"] = scenario_historical

    ## ----- SET THE UNITS ------

    # Query the table based on input values
    # Timescale in table needs to be handled differently
    # This is because the monthly variables are derived from daily variables, so they are listed in the table as "daily, monthly"
    # Hourly variables may be different
    # Querying the data needs special handling due to the layout of the csv file
    var_df_query = var_df[
        (var_df["display_name"] == selections_dict["variable"])
        & (var_df["downscaling_method"] == selections_dict["downscaling_method"])
    ]
    var_df_query = var_df_query[
        var_df_query["timescale"].str.contains(selections_dict["timescale"])
    ]

    selections_dict["units"] = (
        units if units is not None else var_df_query["unit"].item()
    )  # Set units if user doesn't set them manually

    ## ------ CREATE SELECTIONS OBJECT --------
    selections = DataParameters(enable_hidden_vars=enable_hidden_vars)

    # Error handling for stations
    # If the user input a value for the station argument, check that it exists
    # If it doesn't exist, see if you can find something close... if not, throw an error
    # Need to do the error handling here since it requires the selections object
    if data_type == "Stations" and stations is not None:
        stations = _check_valid_input_station(
            stations, selections._stations_gdf.station.values
        )

    ## ------- SET EACH ATTRIBUTE -------

    try:
        selections.data_type = selections_dict["data_type"]
        selections.approach = selections_dict["approach"]
        selections.scenario_ssp = selections_dict["scenario_ssp"]
        selections.scenario_historical = selections_dict["scenario_historical"]
        selections.area_subset = selections_dict["area_subset"]
        selections.cached_area = selections_dict["cached_area"]
        selections.downscaling_method = selections_dict["downscaling_method"]
        selections.resolution = selections_dict["resolution"]
        selections.timescale = selections_dict["timescale"]
        selections.variable_type = selections_dict["variable_type"]
        selections.variable = selections_dict["variable"]
        selections.units = selections_dict["units"]
        selections.all_touched = selections_dict["all_touched"]

        # Setting the values like this enables us to take advantage of the default settings in DataParameters without having to manually set defaults in this function
        if selections_dict["warming_level"] is not None:
            selections.warming_level = selections_dict["warming_level"]
        if selections_dict["warming_level_window"] is not None:
            selections.warming_level_window = selections_dict["warming_level_window"]
        if selections_dict["area_average"] is not None:
            selections.area_average = selections_dict["area_average"]
        if selections_dict["time_slice"] is not None:
            selections.time_slice = selections_dict["time_slice"]
        if selections_dict["warming_level_months"] is not None:
            selections.warming_level_months = selections_dict["warming_level_months"]
        if selections_dict["latitude"] is not None:
            selections.latitude = selections_dict["latitude"]
        if selections_dict["longitude"] is not None:
            selections.longitude = selections_dict["longitude"]
        if selections_dict["stations"] is not None:
            selections.stations = selections_dict["stations"]

        for key in kwargs:
            if getattr(selections, key, None) is not None:
                setattr(selections, key, kwargs[key])

        # Force update variable_id after all attributes are set
        # This ensures hidden variables work correctly
        selections.variable_id = _get_var_ids(
            data_interface.variable_descriptions,
            selections.variable,
            selections.downscaling_method,
            selections.timescale,
            enable_hidden_vars=enable_hidden_vars,
        )

    except ValueError as error_message:
        # The error message is really long
        # And sometimes has a confusing Attribute Error: Pieces mismatch that is hard to interpret
        # Here we just print the error message and return None instead of allowing the long error to be raised by default
        print(_format_error_print_message(error_message))
        return None

    # Retrieve data
    data = selections.retrieve()
    return data

Notes on behavior

  • DataParameters.retrieve() is the closest analogue to the old GUI workflow.
  • get_data_options() and get_subsetting_options() are useful when you need to discover valid values programmatically.
  • The module does not raise on every bad input. In several cases it prints a diagnostic message and returns None to match the original notebook behavior.
  • get_data() is the lower-level direct entry point and accepts the same legacy naming conventions as the GUI.