Core Data Interface (Detailed)

The legacy data interface module providing function-based API for climate data access.

Overview

climakitae.core.data_interface is the main entry point for the legacy interface. It provides: - DataParameters class — Configuration object for data queries - get_data() function — Execute data queries with validation

Warning

This is the legacy interface. For new code, use climakitae.new_core.user_interface.ClimateData instead.

DataParameters Class

Bases: Parameterized

Python param object to hold data parameters for use in panel GUI.

Call DataParameters when you want to select and retrieve data from the climakitae data catalog without using the ckg.Select GUI. ckg.Select uses this class to store selections and retrieve data.

DataParameters calls DataInterface, a singleton class that makes the connection to the intake-esm data store in S3 bucket.

Attributes

unit_options_dict : dict options dictionary for converting unit to other units area_subset : str dataset to use from Boundaries for sub area selection cached_area : list of strs one or more features from area_subset datasets to use for selection latitude : tuple latitude range of selection box longitude : tuple longitude range of selection box variable_type : str toggle raw or derived variable selection default_variable : str initial variable to have selected in widget time_slice : tuple year range to select resolution : str resolution of data to select ("3 km", "9 km", "45 km") timescale : str frequency of dataset ("hourly", "daily", "monthly") scenario_historical : list of strs historical scenario selections area_average : str whether to comput area average ("Yes", "No") downscaling_method : str whether to choose WRF or LOCA2 data or both ("Dynamical", "Statistical", "Dynamical+Statistical") data_type : str whether to choose gridded or station based data ("Gridded", "Stations") stations : list or strs list of stations that can be filtered by cached_area _station_data_info : str informational statement when station data selected with data_type scenario_ssp : list of strs list of future climate scenarios selected (availability depends on other params) simulation : list of strs list of simulations (models) selected (availability depends on other params) variable : str variable long display name units : str unit abbreviation currently of the data (native or converted) enable_hidden_vars : boolean enable selection of variables that are hidden from the GUI? extended_description : str extended description of the data variable variable_id : list of strs list of variable ids that match the variable (WRF and LOCA2 can have different codes for same type of variable) historical_climate_range_wrf : tuple time range of historical WRF data historical_climate_range_loca : tuple time range of historical LOCA2 data historical_climate_range_wrf_and_loca : tuple time range of historical WRF and LOCA2 data combined historical_reconstruction_range : tuple time range of historical reanalysis data ssp_range : tuple time range of future scenario SSP data _info_about_station_data : str warning message about station data _data_warning : str warning about selecting unavailable data combination data_interface : DataInterface data connection singleton class that provides data _data_catalog : intake_esm.source.ESMDataSource shorthand alias to DataInterface.data_catalog _variable_descriptions : pd.DataFrame shorthand alias to DataInterface.variable_descriptions _stations_gdf : gpd.GeoDataFrame shorthand alias to DataInterface.stations_gdf _geographies : Boundaries shorthand alias to DataInterface.geographies _geography_choose : dict shorthand alias to Boundaries.boundary_dict() _warming_level_times : pd.DataFrame shorthand alias to DataInterface.warming_level_times colormap : str default colormap to render the currently selected data scenario_options : list of strs list of available scenarios (historical and ssp) for selection variable_options_df : pd.DataFrame filtered variable descriptions for the downscaling_method and timescale warming_level : array global warming level(s) warming_level_window : integer years around Global Warming Level (+/-) (e.g. 15 means a 30yr window) approach : str, "Warming Level" or "Time" how do you want the data to be retrieved? warming_level_months : array months of year to use for computing warming levels default to entire calendar year: 1,2,3,4,5,6,7,8,9,10,11,12 all_touched : boolean spatial subset option for within or touching selection

Source code in climakitae/core/data_interface.py

def __init__(self, **params):
    # Set default values
    super().__init__(**params)

    self.data_interface = DataInterface()

    # Data Catalog
    self._data_catalog = self.data_interface.data_catalog

    # Warming Levels Table
    self._warming_level_times = self.data_interface.warming_level_times

    # variable descriptions
    self._variable_descriptions = self.data_interface.variable_descriptions

    # station data
    self._stations_gdf = self.data_interface.stations_gdf

    # Get geography boundaries and selection options
    self._geographies = self.data_interface.geographies
    self._geography_choose = self._geographies.boundary_dict()

    # Set location params
    self.area_subset = "none"
    self.param["area_subset"].objects = list(self._geography_choose.keys())
    self.param["cached_area"].objects = list(
        self._geography_choose[self.area_subset].keys()
    )

    self.all_touched = False

    # Set data params
    (
        self.scenario_options,
        self.simulation,
        unique_variable_ids,
    ) = _get_user_options(
        data_catalog=self._data_catalog,
        downscaling_method=self.downscaling_method,
        timescale=self.timescale,
        resolution=self.resolution,
    )
    self.variable_options_df = _get_variable_options_df(
        variable_descriptions=self._variable_descriptions,
        unique_variable_ids=unique_variable_ids,
        downscaling_method=self.downscaling_method,
        timescale=self.timescale,
        enable_hidden_vars=self.enable_hidden_vars,
    )

    # Show derived index option?
    indices = True
    if self.data_type == "Stations":
        indices = False
    if self.downscaling_method != "Dynamical":
        indices = False
    if self.timescale == "monthly":
        indices = False
    if not indices:
        self.param["variable_type"].objects = ["Variable"]
        self.variable_type = "Variable"
    else:
        self.param["variable_type"].objects = ["Variable", "Derived Index"]

    # Set scenario param
    scenario_ssp_options = [
        scenario_to_experiment_id(scen, reverse=True)
        for scen in self.scenario_options
        if "ssp" in scen
    ]
    for scenario_i in SSPS:
        if scenario_i in scenario_ssp_options:  # Reorder list
            scenario_ssp_options.remove(scenario_i)  # Remove item
            scenario_ssp_options.append(scenario_i)  # Add to back of list
    self.param["scenario_ssp"].objects = scenario_ssp_options
    self.scenario_ssp = []

    # Set variable param
    self.param["variable"].objects = (
        self.variable_options_df.display_name.values.tolist()
    )
    self.variable = self.default_variable

    # Set colormap, units, & extended description
    var_info = self.variable_options_df[
        self.variable_options_df["display_name"] == self.variable
    ]

    # Set params that are not selected by the user
    self.colormap = var_info.colormap.item()
    self.units = var_info.unit.item()
    self.extended_description = var_info.extended_description.item()
    self.variable_id = _get_var_ids(
        self._variable_descriptions,
        self.variable,
        self.downscaling_method,
        self.timescale,
        self.enable_hidden_vars,
    )
    self._data_warning = ""

`retrieve(config=None, merge=True)`

Retrieve data from catalog

By default, DataParameters determines the data retrieved. Grabs the data from the AWS S3 bucket, returns lazily loaded dask array. User-facing function that provides a wrapper for read_catalog_from_select.

Returns:

Name	Type	Description
`data_return`	`DataArray \| Dataset \| List[DataArray]`	DataArray or Dataset object

Source code in climakitae/core/data_interface.py

def retrieve(
    self, config: str = None, merge: bool = True
) -> Union[xr.DataArray, xr.Dataset, List[xr.DataArray]]:
    """Retrieve data from catalog

    By default, DataParameters determines the data retrieved.
    Grabs the data from the AWS S3 bucket, returns lazily loaded dask array.
    User-facing function that provides a wrapper for read_catalog_from_select.

    Returns
    -------
    data_return : xr.DataArray | xr.Dataset | List[xr.DataArray]
        DataArray or Dataset object

    """

    def _warn_of_large_file_size(da: xr.DataArray):
        """Warn user if the data array is large"""
        nbytes = da.nbytes
        match nbytes:
            case nbytes if nbytes >= int(1e9) and nbytes < int(5e9):
                print(
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                    "! Returned data array is large. Operations could take up to 5x longer than 1GB of data!\n"
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                )
            case nbytes if nbytes >= int(5e9) and nbytes < int(1e10):
                print(
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                    "!! Returned data array is very large. Operations could take up to 8x longer than 1GB of data !!\n"
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                )
            case nbytes if nbytes >= int(1e10):
                print(
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                    "!!! Returned data array is huge. Operations could take 10x to infinity longer than 1GB of data !!!\n"
                    "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n"
                )

    def _warn_of_empty_data(self):
        if self.approach == "Warming Level" and (len(self.warming_level) > 1):
            print(
                "WARNING FOR WARMING LEVELS APPROACH\n-----------------------------------\nThere may be NaNs in your data for certain simulation/warming level combinations if the warming level is not reached for that particular simulation before the year 2100. \n\nThis does not mean you have missing data, but rather a feature of how the data is combined in retrieval to return a single data object. \n\nIf you want to remove these empty simulations, it is recommended to first subset the data object by each individual warming level and then dropping NaN values."
            )
        elif (self.approach == "Time") and (len(self.scenario_ssp) > 1):
            print(
                "WARNING\n-------\nYou have retrieved data for more than one SSP, but not all ensemble members for each GCM are available for all SSPs.\n\nAs a result, some scenario and simulation combinations may contain NaN values.\n\nIf you want to remove these empty simulations, it is recommended to first subset the data object by each individual scenario and then dropping NaN values."
            )

    data_return = read_catalog_from_select(self)

    if isinstance(data_return, list):
        for l in data_return:
            _warn_of_large_file_size(l)
    else:
        _warn_of_large_file_size(data_return)

    # Warn about empty simulations for certain selections
    _warn_of_empty_data(self)

    return data_return

Get Data Function

Retrieve formatted data from the Analytics Engine data catalog.

Contrasts with DataParameters().retrieve(), which retrieves data from the user inputs in climakitaegui's selections GUI.

Parameters

variable : str String name of climate variable resolution : str, one of ["3 km", "9 km", "45 km"] Resolution of data in kilometers timescale : str, one of ["hourly", "daily", "monthly"] Temporal frequency of dataset downscaling_method : str, one of ["Dynamical", "Statistical", "Dynamical+Statistical"], optional Downscaling method of the data: WRF ("Dynamical"), LOCA2 ("Statistical"), or both "Dynamical+Statistical" Default to "Dynamical" data_type : str, one of ["Gridded", "Stations"], optional Whether to choose gridded data or weather station data Default to "Gridded" approach : one of ["Time", "Warming Level"], optional Default to "Time" scenario : str or list of str, optional SSP scenario ["SSP 3-7.0", "SSP 2-4.5","SSP 5-8.5"] and/or historical data selection ["Historical Climate", "Historical Reconstruction"] If approach = "Time", you need to set a valid option If approach = "Warming Level", scenario is ignored units : str, optional Variable units. Defaults to native units of data area_subset : str, optional Area category: i.e "CA counties" Defaults to entire domain ("none") cached_area : list, optional Area: i.e "Alameda county" Defaults to entire domain (["entire domain"]) area_average : one of ["Yes","No"], optional Take an average over spatial domain? Default to "No". latitude : None or tuple of float, optional Tuple of valid latitude bounds Default to entire domain longitude : None or tuple of float, optional Tuple of valid longitude bounds Default to entire domain time_slice : tuple, optional Time range for retrieved data Only valid for approach = "Time" stations : list of str, optional Which weather stations to retrieve data for Only valid for data_type = "Stations" Default to all stations warming_level : list of float, optional Must be one of the warming levels available in clmakitae.core.constants Only valid for approach = "Warming Level" and data_type = "Stations" warming_level_window : int in range (5,25), optional Years around Global Warming Level (+/-) (e.g. 15 means a 30yr window) warming_level_months : list of int, optional Months of year for which to perform warming level computation Default to all months in a year: [1,2,3,4,5,6,7,8,9,10,11,12] For example, you may want to set warming_level_months=[12,1,2] to perform the analysis for the winter season. Only valid for approach = "Warming Level" and data_type = "Stations" all_touched : boolean spatial subset option for within or touching selection enable_hidden_vars : boolean, optional Return all variables, including the ones in which "show" is set to False? Default to False kwargs : dict Additional keyword arguments to pass to DataParameters()

Returns

xr.DataArray The requested climate data, or None if an error occurred.

Notes

Errors aren't raised by the function. Rather, an appropriate informative message is printed, and the function returns None. This is due to the fact that the AE Jupyter Hub raises a strange Pieces Mismatch Error for some bad inputs; instead, that error is ignored and a more informative error message is printed instead.

Source code in climakitae/core/data_interface.py

def get_data(
    variable: str,
    resolution: str,
    timescale: str,
    downscaling_method: str = "Dynamical",
    data_type: str = "Gridded",
    approach: str = "Time",
    scenario: Union[str, list[str]] = None,
    units: str = None,
    warming_level: list[float] = None,
    area_subset: str = "none",
    latitude: tuple[float, float] = None,
    longitude: tuple[float, float] = None,
    cached_area: list[str] = None,
    area_average: str = None,
    time_slice: tuple = None,
    stations: list[str] = None,
    warming_level_window: int = None,
    warming_level_months: list[int] = None,
    all_touched=False,
    enable_hidden_vars: bool = False,
    **kwargs,
) -> xr.DataArray:
    """Retrieve formatted data from the Analytics Engine data catalog.

    Contrasts with DataParameters().retrieve(), which retrieves data from
    the user inputs in climakitaegui's selections GUI.

    Parameters
    ----------
    variable : str
        String name of climate variable
    resolution : str, one of ["3 km", "9 km", "45 km"]
        Resolution of data in kilometers
    timescale : str, one of ["hourly", "daily", "monthly"]
        Temporal frequency of dataset
    downscaling_method : str, one of ["Dynamical", "Statistical", "Dynamical+Statistical"], optional
        Downscaling method of the data:
        WRF ("Dynamical"), LOCA2 ("Statistical"), or both "Dynamical+Statistical"
        Default to "Dynamical"
    data_type : str, one of ["Gridded", "Stations"], optional
        Whether to choose gridded data or weather station data
        Default to "Gridded"
    approach : one of ["Time", "Warming Level"], optional
        Default to "Time"
    scenario : str or list of str, optional
        SSP scenario ["SSP 3-7.0", "SSP 2-4.5","SSP 5-8.5"] and/or historical data selection ["Historical Climate", "Historical Reconstruction"]
        If approach = "Time", you need to set a valid option
        If approach = "Warming Level", scenario is ignored
    units : str, optional
        Variable units.
        Defaults to native units of data
    area_subset : str, optional
        Area category: i.e "CA counties"
        Defaults to entire domain ("none")
    cached_area : list, optional
        Area: i.e "Alameda county"
        Defaults to entire domain (["entire domain"])
    area_average : one of ["Yes","No"], optional
        Take an average over spatial domain?
        Default to "No".
    latitude : None or tuple of float, optional
        Tuple of valid latitude bounds
        Default to entire domain
    longitude : None or tuple of float, optional
        Tuple of valid longitude bounds
        Default to entire domain
    time_slice : tuple, optional
        Time range for retrieved data
        Only valid for approach = "Time"
    stations : list of str, optional
        Which weather stations to retrieve data for
        Only valid for data_type = "Stations"
        Default to all stations
    warming_level : list of float, optional
        Must be one of the warming levels available in `clmakitae.core.constants`
        Only valid for approach = "Warming Level" and data_type = "Stations"
    warming_level_window : int in range (5,25), optional
        Years around Global Warming Level (+/-) \n (e.g. 15 means a 30yr window)
    warming_level_months : list of int, optional
        Months of year for which to perform warming level computation
        Default to all months in a year: [1,2,3,4,5,6,7,8,9,10,11,12]
        For example, you may want to set warming_level_months=[12,1,2] to perform the analysis for the winter season.
        Only valid for approach = "Warming Level" and data_type = "Stations"
    all_touched : boolean
        spatial subset option for within or touching selection
    enable_hidden_vars : boolean, optional
        Return all variables, including the ones in which "show" is set to False?
        Default to False
    kwargs : dict
        Additional keyword arguments to pass to DataParameters()

    Returns
    -------
    xr.DataArray
        The requested climate data, or None if an error occurred.

    Notes
    -----
    Errors aren't raised by the function. Rather, an appropriate informative
    message is printed, and the function returns None. This is due to the fact
    that the AE Jupyter Hub raises a strange Pieces Mismatch Error for some bad
    inputs; instead, that error is ignored and a more informative error message
    is printed instead.

    """

    def _check_valid_input_station(
        stations: list[str], station_options_all: list[str]
    ) -> list[str]:
        """Check that the user input a valid value for station
        If invalid input, the function will "guess" a close-ish station using difflib
        See _get_closest_option function for more info
        If invalid input and no guesses found, the function will print an informative
        error message and raise a ValueError

        Parameters
        ----------
        stations : list[str]
        station_options_all : list of string
            All the possible station options
            Can be retrieved from DataParameters()._stations_gdf.station.values

        Returns
        -------
        stations : list[str]

        """
        station_options_all = sorted(
            station_options_all
        )  # sorted() puts the list in alphabetical order

        # Keep track of if error was raised and message was printed to user
        # If more than one station prints errors to the console, print a space between each station
        printed_warning = False

        for i, station_i in enumerate(stations):  # Go through all the stations
            # If the station is a valid option, don't do anything
            if station_i in station_options_all:
                continue

            if printed_warning:
                print(
                    "\n", end=""
                )  # Add a space between stations for better readability

            # If the station isn't a valid option...
            print("Input station='" + station_i + "' is not a valid option.")
            closest_options = _get_closest_options(
                station_i, station_options_all
            )  # See if theres any similar options

            # Sad! No closest options found. Just set the key to all valid options
            match closest_options:
                case None:
                    print("Valid options: \n- ", end="")
                    print("\n- ".join(station_options_all))
                    raise ValueError("Bad input")

                # Just one option in the list
                case closest_options if len(closest_options) == 1:
                    print("Closest option: '" + closest_options[0] + "'")

                case closest_options if len(closest_options) > 1:
                    print("Closest options: \n- " + "\n- ".join(closest_options))

            print("Outputting data for station='" + closest_options[0] + "'")
            stations[i] = closest_options[
                0
            ]  # Replace that value in the list with the best option :)

            printed_warning = True

        return stations

    # Internal functions
    def _error_handling_warming_level_inputs(
        wl: Union[list[float], list[int]],
        argument_name: str,
        downscaling_method: str,
        resolution: str,
    ):
        """Error handling for arguments: warming_level and warming_level_month
        Both require a list of either floats or ints
        argument_name is either "warming_level" or "warming_level_months" and is used to
        print an appropriate error message for bad input

        """
        # Find the WL bounds for LOCA and WRF
        loca, wrf = create_ae_warming_trajectories(resolution)
        loca_max = round(loca.max().max(), 2)
        wrf_max = round(wrf.max().max(), 2)

        match downscaling_method:
            case "Statistical":
                max_val = loca_max
            case "Dynamical":
                max_val = wrf_max
            case "Dynamical+Statistical":
                max_val = min(loca_max, wrf_max)
            case _:
                raise ValueError(
                    "Downscaling method be 'Statistical', 'Dynamical', or 'Dynamical+Statistical'"
                )

        if (wl is not None) and not isinstance(wl, list):
            if isinstance(wl, (float, int)):  # Convert float to a singleton list
                wl = [wl]
            if not isinstance(wl, list):
                raise ValueError(
                    f"""Function argument {argument_name} requires a float/int or list 
                    of floats/ints input. Your input: {type(wl)}"""
                )
        if isinstance(wl, list):
            for x in wl:
                if not isinstance(x, (float, int)):
                    raise ValueError(
                        f"Each item in '{argument_name}' must be a float or int. Got: {type(x)}"
                    )
                if argument_name == "warming_level":
                    if x < 0 or x > max_val:
                        raise ValueError(
                            f"{argument_name} value {x}. "
                            f"Allowed range for {downscaling_method}-downscaled data at {resolution} resolution is 0 to {max_val:.2f}."
                        )
        return wl

    def _error_handling_approach_inputs(
        approach: str, scenario: str, warming_level: list[float], time_slice: tuple
    ) -> tuple[str, str, list[float], tuple]:
        """Error handling for approach and scenario inputs"""
        _valid_options_approach = ["Time", "Warming Level"]
        if approach not in _valid_options_approach:
            # Maybe the user just capitalized it wrong
            # If so, fix it for them-- don't raise an error
            if approach.lower().title() in _valid_options_approach:
                approach = approach.lower().title()
            else:
                # An error will be raised later when you try to set selections
                pass

        # Print a warming if scenario is set but approach is Warming Level
        if approach == "Warming Level" and scenario not in [None, ["n/a"], "n/a"]:
            print(
                'WARNING: "scenario" argument will be ignored for warming levels approach'
            )
            scenario = None
        if approach == "Warming Level" and time_slice != None:
            print(
                'WARNING: "time_slice" argument will be ignored for warming levels approach'
            )
            time_slice = None

        if approach == "Time":
            warming_level = ["n/a"]

        return approach, scenario, warming_level, time_slice

    def _error_handling_location_settings(
        area_subset: list[str], cached_area: list[str]
    ) -> list[str]:
        """Maybe the user put an input for cached area but not for area subset
        We need to have the matching/correct area subset in order for selections.retrieve() to actually subset the data
        Here, we load in the geometry options to set area_subset to the correct value
        This also raises an appropriate error if the user has a bad input

        """
        if area_subset == "none" and cached_area != ["entire domain"]:
            geom_df = get_subsetting_options(area_subset="all").reset_index()
            area_subset_vals = geom_df[geom_df["cached_area"] == cached_area[0]][
                "area_subset"
            ].values
            if len(area_subset_vals) == 0:
                raise ValueError("Invalid input for argument 'cached_area'")
            else:
                area_subset = area_subset_vals[0]
        return area_subset

    def _get_scenario_ssp_scenario_historical(
        approach: str, scenario: str
    ) -> tuple[str, str]:
        """Get scenario_ssp, scenario_historical depending on user inputs"""
        match approach:
            case "Warming Level":
                scenario_ssp = ["n/a"]
                scenario_historical = ["n/a"]
            case "Time":
                if (
                    "Historical Reconstruction" in scenario
                ):  # Handling for Historical Reconstruction option
                    scenario_historical = [x for x in scenario if "Historical" in x]
                    scenario_ssp = []
                    if (
                        len(scenario) != 1
                    ):  # No SSP options for Historical Reconstruction data
                        print(
                            "WARNING: Historical Reconstruction data cannot be retrieved in the same data object as SSP scenario options. SSP data will not be retrieved."
                        )
                else:
                    scenario_ssp = [
                        x for x in scenario if "Historical" not in x
                    ]  # Add non-historical SSPs to scenario_ssp key
                    if "Historical Climate" in scenario:
                        scenario_historical = ["Historical Climate"]
                    else:
                        scenario_historical = []
            case _:
                scenario_ssp, scenario_historical = None, None
        return scenario_ssp, scenario_historical

    # default values set as lists are dangerous, so set them to None and then set to
    # default value later
    if cached_area is None:
        cached_area = ["entire domain"]
    # Get intake catalog and variable descriptions from DataInterface object
    data_interface = DataInterface()
    var_df = data_interface.variable_descriptions.rename(
        columns={"variable": "display_name"}
    )  # Rename column so that it can be merged with cat_df

    # Filter variable descriptions based on enable_hidden_vars
    if not enable_hidden_vars:
        var_df = var_df[var_df["show"] == True]

    ## --------- ERROR HANDLING ----------
    # Deal with bad or missing users inputs

    # Station data error handling
    if data_type == "Stations":
        # dictionary with { argument name : [valid option, user input]}
        d = {
            "downscaling_method": ["Dynamical", downscaling_method],
            "timescale": ["hourly", timescale],
            "variable": ["Air Temperature at 2m", variable],
        }
        # Go through the users inputs
        # See if they match the required value for that argument
        # If not, print a warning to the user.
        for key, vals in zip(d.keys(), d.values()):
            if vals[0] != vals[1]:
                print(
                    "Weather station data can only be retrieved for {0}={1} \nYour input: {2} \nRetrieving data for {0}={1}".format(
                        key, vals[0], vals[1]
                    )
                )

        downscaling_method = "Dynamical"
        timescale = "hourly"
        variable = "Air Temperature at 2m"

        # Deal with scenario and time_slice arguments
        # Handle various use-cases of user inputs/errors
        if scenario is None:
            if time_slice is None:
                # Default
                scenario = ["Historical Climate"]
            else:
                scenario = []

        if resolution == "3 km":
            # Neither SSP 2-4.5 nor SSP 5-8.5 are valid options for scenario... need to remove
            for bad_scenario_choice in ["SSP 2-4.5", "SSP 5-8.5"]:
                if bad_scenario_choice in scenario:
                    error_message = f"{bad_scenario_choice} is not a valid scenario input for resolution = {resolution}"
                    print(_format_error_print_message(error_message))
                    return None
        if time_slice is not None:
            # Make sure time_slice and scenario match each other
            # If time_slice is not assigned by the user, it will be auto-set by the DataInterface object
            if any(value < 2015 for value in time_slice) and (
                ("Historical Climate") not in scenario
            ):
                # Add Historical Climate to scenario if the time scale includes historical period
                scenario.append("Historical Climate")
            if any(value >= 2015 for value in time_slice) and not any(
                "SSP" in item for item in scenario
            ):
                # If the time scale includes the future period and no SSP data is selected, add SSP 3-7.0
                scenario.append("SSP 3-7.0")

        if stations is None:
            # Print a warning if the user wants to retrieve station data but they don't input a value for station
            # The function will return all the stations by default
            print(
                "WARNING: You haven't set a particular station/s to retrieve data for; the function will default to retrieving all available stations in the domain"
            )
        if (stations is not None) and (type(stations) == str):
            # Catch easy user mistake without raising an error: Inputting a string instead of a list of list
            # I imagine this could happen if you just wanted to retrieve data for a single station
            stations = [stations]

    # If lat/lon input, change cached_area and area_subset
    if (latitude is not None) and (longitude is not None):
        area_subset = "lat/lon"
        cached_area = ["coordinate selection"]

    # Check warming level inputs
    try:
        warming_level = _error_handling_warming_level_inputs(
            warming_level, "warming_level", downscaling_method, resolution
        )
        warming_level_months = _error_handling_warming_level_inputs(
            warming_level_months, "warming_level_months", downscaling_method, resolution
        )
    except ValueError as error_message:
        print(_format_error_print_message(error_message))
        return None

    # Make sure the inputs are a valid type (no floats, ints, dictionaries, etc)
    for user_input in [
        variable,
        downscaling_method,
        resolution,
        timescale,
        area_subset,
        area_average,
        approach,
        scenario,
    ]:
        if (user_input is not None) and (type(user_input) not in [str, list]):
            error_message = (
                "Function arguments require a single string value for your inputs"
            )
            print(_format_error_print_message(error_message))
            return None

    # Maybe area average was capitalized wrong
    # Fix it instead of raising an error
    if area_average is not None:
        if area_average.lower().title() in ["Yes", "No"]:
            area_average = area_average.lower().title()

    # Cached area should be a list even if its just a single string value (i.e. [str])
    cached_area = [cached_area] if type(cached_area) != list else cached_area

    # If all_touched is None set to False
    if all_touched == None:
        all_touched = False

    # Check if all_touched boolean
    if all_touched not in [True, False]:
        raise ValueError("all_touched must be a boolean")

    # Make sure approach matches the scenario setting
    # See function documentation for more details
    approach, scenario, warming_level, time_slice = _error_handling_approach_inputs(
        approach, scenario, warming_level, time_slice
    )

    # Make sure the area subset is set to a valid input
    # See function documentation for more details
    try:
        area_subset = _error_handling_location_settings(area_subset, cached_area)
    except ValueError as error_message:
        print(_format_error_print_message(error_message))
        return None

    ## --------- ADD ARGUMENTS TO A DICTIONARY ----------
    # A dictionary is used for all the inputs in selections because it enables better error handling and cleaner code when we set selections.thing = thing
    # It also makes parsing through the arguments easier
    # The inputs here need to be a list so that they can be parsed easier by the _check_if_good_input function when comparing with the valid catalog options to confirm the user input is valid
    scenario_user_input = scenario  # What the user originally input for scenario

    check_input_df = get_data_options(
        variable=variable,
        downscaling_method=downscaling_method,
        resolution=resolution,
        timescale=timescale,
        scenario=scenario,
        tidy=False,
        enable_hidden_vars=enable_hidden_vars,
    )

    if check_input_df is None:
        # Does this print an informative error message? I think so but I'm not sure.
        return None

    # Merge with variable dataframe to get all the info about the data in one place
    check_input_df = check_input_df.merge(var_df, how="left")

    # Convert to a dictionary so it can be easily parsed by the function
    cat_dict = check_input_df.to_dict(orient="list")
    for key, values in cat_dict.items():
        # Remove non-unique values
        # This happens because we converted a pandas dataframe to a dictionary
        cat_dict[key] = list(np.unique(values))

    # _check_if_good_input will default fill the scenario options with EVERY possible option
    # It will in most cases give a list of all the available SSPs and the two historical data options (Historical Climate AND Historical Reconstruction)
    # I'd like the function to just default to Historical Climate + SSPs
    # So, if the user input None for scenario, I just remove Historical Reconstruction from the list
    if scenario_user_input == None:
        if "Historical Reconstruction" in cat_dict["scenario"]:
            cat_dict["scenario"] = [
                item
                for item in cat_dict["scenario"]
                if item != "Historical Reconstruction"
            ]

    # Check if it's an index
    # Use proper variable_id lookup that considers downscaling method and timescale
    variable_ids = _get_var_ids(
        data_interface.variable_descriptions,
        cat_dict["variable"][0],
        cat_dict["downscaling_method"][0],
        cat_dict["timescale"][0],
        enable_hidden_vars=enable_hidden_vars,
    )
    variable_id = variable_ids[0] if variable_ids else ""
    variable_type = "Derived Index" if "_index" in variable_id else "Variable"

    # Settings for selections
    selections_dict = {
        "variable": cat_dict["variable"][0],
        "timescale": cat_dict["timescale"][0],
        "downscaling_method": cat_dict["downscaling_method"][0],
        "resolution": cat_dict["resolution"][0],
        "data_type": data_type,
        "scenario": cat_dict["scenario"],
        "area_average": area_average,
        "area_subset": area_subset,
        "cached_area": cached_area,
        "approach": approach,
        "warming_level": warming_level,
        "warming_level_window": warming_level_window,
        "warming_level_months": warming_level_months,
        "variable_type": variable_type,
        "time_slice": time_slice,
        "latitude": latitude,
        "longitude": longitude,
        "stations": stations,
        "all_touched": all_touched,
    }

    scenario_ssp, scenario_historical = _get_scenario_ssp_scenario_historical(
        selections_dict["approach"], selections_dict["scenario"]
    )
    selections_dict["scenario_ssp"] = scenario_ssp
    selections_dict["scenario_historical"] = scenario_historical

    ## ----- SET THE UNITS ------

    # Query the table based on input values
    # Timescale in table needs to be handled differently
    # This is because the monthly variables are derived from daily variables, so they are listed in the table as "daily, monthly"
    # Hourly variables may be different
    # Querying the data needs special handling due to the layout of the csv file
    var_df_query = var_df[
        (var_df["display_name"] == selections_dict["variable"])
        & (var_df["downscaling_method"] == selections_dict["downscaling_method"])
    ]
    var_df_query = var_df_query[
        var_df_query["timescale"].str.contains(selections_dict["timescale"])
    ]

    selections_dict["units"] = (
        units if units is not None else var_df_query["unit"].item()
    )  # Set units if user doesn't set them manually

    ## ------ CREATE SELECTIONS OBJECT --------
    selections = DataParameters(enable_hidden_vars=enable_hidden_vars)

    # Error handling for stations
    # If the user input a value for the station argument, check that it exists
    # If it doesn't exist, see if you can find something close... if not, throw an error
    # Need to do the error handling here since it requires the selections object
    if data_type == "Stations" and stations is not None:
        stations = _check_valid_input_station(
            stations, selections._stations_gdf.station.values
        )

    ## ------- SET EACH ATTRIBUTE -------

    try:
        selections.data_type = selections_dict["data_type"]
        selections.approach = selections_dict["approach"]
        selections.scenario_ssp = selections_dict["scenario_ssp"]
        selections.scenario_historical = selections_dict["scenario_historical"]
        selections.area_subset = selections_dict["area_subset"]
        selections.cached_area = selections_dict["cached_area"]
        selections.downscaling_method = selections_dict["downscaling_method"]
        selections.resolution = selections_dict["resolution"]
        selections.timescale = selections_dict["timescale"]
        selections.variable_type = selections_dict["variable_type"]
        selections.variable = selections_dict["variable"]
        selections.units = selections_dict["units"]
        selections.all_touched = selections_dict["all_touched"]

        # Setting the values like this enables us to take advantage of the default settings in DataParameters without having to manually set defaults in this function
        if selections_dict["warming_level"] is not None:
            selections.warming_level = selections_dict["warming_level"]
        if selections_dict["warming_level_window"] is not None:
            selections.warming_level_window = selections_dict["warming_level_window"]
        if selections_dict["area_average"] is not None:
            selections.area_average = selections_dict["area_average"]
        if selections_dict["time_slice"] is not None:
            selections.time_slice = selections_dict["time_slice"]
        if selections_dict["warming_level_months"] is not None:
            selections.warming_level_months = selections_dict["warming_level_months"]
        if selections_dict["latitude"] is not None:
            selections.latitude = selections_dict["latitude"]
        if selections_dict["longitude"] is not None:
            selections.longitude = selections_dict["longitude"]
        if selections_dict["stations"] is not None:
            selections.stations = selections_dict["stations"]

        for key in kwargs:
            if getattr(selections, key, None) is not None:
                setattr(selections, key, kwargs[key])

        # Force update variable_id after all attributes are set
        # This ensures hidden variables work correctly
        selections.variable_id = _get_var_ids(
            data_interface.variable_descriptions,
            selections.variable,
            selections.downscaling_method,
            selections.timescale,
            enable_hidden_vars=enable_hidden_vars,
        )

    except ValueError as error_message:
        # The error message is really long
        # And sometimes has a confusing Attribute Error: Pieces mismatch that is hard to interpret
        # Here we just print the error message and return None instead of allowing the long error to be raised by default
        print(_format_error_print_message(error_message))
        return None

    # Retrieve data
    data = selections.retrieve()
    return data

Migration Note

For new code, use the modern climakitae.new_core interface. See the migration guide for detailed upgrade instructions.

Quick Example

Legacy (old):

from climakitae.core.data_interface import get_data, DataParameters

params = DataParameters()
params.variable = "Maximum air temperature at 2m"
params.time_slice = (2015, 2045)            # year-range tuple
params.downscaling_method = "Statistical"    # \u2248 LOCA2
params.resolution = "3 km"                   # \u2248 grid_label d03
params.timescale = "monthly"                 # \u2248 table_id "mon"
data = get_data(params)

Modern (new):

from climakitae.new_core.user_interface import ClimateData

data = (ClimateData()
    .variable("tasmax")
    .processes({"time_slice": (2015, 2045)})
    .get())