Processor: Clip
Registry key: clip | Priority: 65 | Category: Spatial Processing
Subset climate data to specific geographic regions, points, or boundaries. Extract data for counties, watersheds, weather stations, or custom lat/lon coordinates with automatic nearest-gridcell location and coordinate system handling.
Algorithm
Clip runs in two phases: first it parses self.value into a geometry (or routes to a point-based path), then it dispatches over the input data type and calls the appropriate clipper.
flowchart TD
Init([__init__: parse value, detect mode]) --> Init2["Set flags:<br/>is_single_point / is_multi_point /<br/>separated / extract_points / persist"]
Init2 --> Start([execute])
Start --> MatchValue{match self.value}
MatchValue -->|str| StrCheck{Station id?}
StrCheck -->|Yes| StationCoords["_get_station_coordinates<br/>(sets is_station, is_single_point)"]
StrCheck -->|No| PathCheck{File path exists?}
PathCheck -->|Yes| ReadFile["gpd.read_file(value)"]
PathCheck -->|No| BoundaryLookup["_get_boundary_geometry"]
MatchValue -->|list| ListCheck{All station ids?}
ListCheck -->|Yes| MultiStation["_convert_stations_to_points"]
ListCheck -->|No — lat/lon tuples| MultiPoint["is_multi_point<br/>(use point_list)"]
ListCheck -->|No — separated boundaries| Sep["separated path<br/>(per-boundary loop)"]
ListCheck -->|No — union| MultiBoundary["_get_multi_boundary_geometry<br/>(union)"]
MatchValue -->|tuple len 2 of floats| SinglePt["is_single_point<br/>(lat, lon)"]
MatchValue -->|tuple of tuples| BBox["shapely.box → GeoDataFrame<br/>(EPSG:4326)"]
MatchValue -->|other| RaiseValue["raise ValueError"]
StationCoords --> DispatchResult
ReadFile --> DispatchResult
BoundaryLookup --> DispatchResult
MultiStation --> DispatchResult
MultiPoint --> DispatchResult
Sep --> DispatchResult
MultiBoundary --> DispatchResult
SinglePt --> DispatchResult
BBox --> DispatchResult
DispatchResult{match result}
DispatchResult -->|dict| LoopDict["For each (key, value):<br/>route by mode flag"]
DispatchResult -->|Dataset / DataArray| RouteSingle["Route by mode flag"]
DispatchResult -->|list / tuple| LoopList["Per-item route, preserve container type"]
LoopDict --> ChooseClipper
RouteSingle --> ChooseClipper
LoopList --> ChooseClipper
ChooseClipper{Mode flag}
ChooseClipper -->|is_single_point| ClipPoint["_clip_data_to_point<br/>(closest cell, fallback to 3x3 neighborhood mean)"]
ChooseClipper -->|is_multi_point| ClipMaskPts["_clip_data_to_points_as_mask<br/>(mask or extract along 'points' dim)"]
ChooseClipper -->|separated| ClipSep["_clip_data_separated<br/>(one geom per boundary)"]
ChooseClipper -->|geom set| ClipGeom["_clip_data_with_geom<br/>(rio.clip)"]
ClipPoint --> CheckPersist{persist?}
ClipMaskPts --> CheckPersist
ClipSep --> CheckPersist
ClipGeom --> CheckPersist
CheckPersist -->|Yes| Compute[".compute() to collapse Dask graph"]
CheckPersist -->|No| UpdateCtx
Compute --> UpdateCtx["update_context"]
UpdateCtx --> End([Output: clipped data, same container as input])
click Init "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L186" "__init__"
click Start "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L294" "execute"
click MatchValue "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L303" "First match: parse self.value"
click StationCoords "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L541" "_get_station_coordinates"
click ReadFile "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L318" "gpd.read_file"
click BoundaryLookup "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L1443" "_get_boundary_geometry"
click MultiStation "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L570" "_convert_stations_to_points"
click MultiBoundary "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L1683" "_get_multi_boundary_geometry (union)"
click BBox "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L355" "shapely.box → GeoDataFrame"
click DispatchResult "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L386" "Second match: dispatch by result type"
click ClipPoint "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L665" "_clip_data_to_point"
click ClipMaskPts "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L1061" "_clip_data_to_points_as_mask"
click ClipSep "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L1348" "_clip_data_separated"
click ClipGeom "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L595" "_clip_data_with_geom"
click UpdateCtx "https://github.com/cal-adapt/climakitae/blob/main/climakitae/new_core/processors/clip.py#L479" "update_context"
Input Modes
Mode 1: Named Boundaries
Clip using predefined administrative or utility boundaries from the Cal-Adapt boundary catalog.
Examples:
"Los Angeles" # County name
"San Francisco Bay" # Watershed
"CA" # State-wide
"CA_IOU" # Utility (IOU = Investor-Owned Utility)
Mode 2: Weather Stations
Clip to specific weather station locations from the HadISD station network.
Examples:
"KSAC" # Sacramento International Airport
"KSFO" # San Francisco International Airport
"KLAX" # Los Angeles International Airport
Mode 3: Single Point (Closest Valid Gridcell)
Extract data for a single geographic point. The processor first selects the geographically closest gridcell; if that cell is all-NaN (common in WRF data near coastlines/mountains/masked regions) it falls back to a 3×3 index-space neighborhood around the nearest cell and returns the mean of the valid (non-NaN) cells in that neighborhood.
Algorithm (_clip_data_to_point, line 665):
- Call
get_closest_gridcell(dataset, lat, lon)and check if it has valid data on a sample slice. - If valid, return that cell.
- If NaN, find the nearest grid index (
(idx1, idx2)) along the spatial dims (x/yfor WRF,lat/lonfor LOCA2). For projected grids, lat/lon are first transformed to the dataset's CRS viapyproj.Transformer. - Iterate the 3×3 neighborhood
(di, dj) ∈ {-1, 0, 1}², skipping out-of-bounds and all-NaN cells. - If any valid neighbors exist,
xr.concat(neighbors, dim="nearest_cell").mean(dim="nearest_cell")and reassign mean lat/lon coords; otherwise returnNone. - Fallback paths handle averaging failures by returning the center cell directly.
The earlier docs described an expanding-radius search (0.01° → 0.05° → ... → 0.5°). That was never the implementation; the current behavior is the index-space 3×3 mean above.
Example:
Mode 4: Bounding Box
Clip data to a rectangular geographic region specified by latitude and longitude ranges.
Example:
Mode 5: Custom File
Clip using geometry from shapefile or GeoJSON file.
Example:
Multi-Input Handling
Multiple Boundaries
When providing a list of boundary names, they are combined using union (OR logic):
With separation:
{"boundaries": ["Alameda", "Contra Costa"], "separated": True}
# Returns: Dict with separate Dataset for each county
Multiple Points
When providing multiple point coordinates:
- Each point gets independent smart-gridcell search
- Results include closest_cell dimension with length = number of points
- Automatic duplicate filtering if multiple points map to same gridcell
Example:
[(37.7749, -122.4194), (34.0522, -118.2437), (32.7157, -117.1611)]
# Returns: Dataset with closest_cell dimension = 3 (SF, LA, San Diego)
Spatial Processing Details
Coordinate System Handling
- Data CRS detection (
_clip_data_with_geom, line 595): ifdata.rio.crsis unset, the processor detects WRF data by the presence of aLambert_Conformalcoordinate and writes the CRS fromspatial_refor CF-convention attributes; otherwise it assumes EPSG:4326 (LOCA2 lat/lon). - Boundary CRS: boundaries are assumed to be EPSG:4326 (a warning is logged if no CRS is set).
- Reprojection direction: when
data.rio.crs != gdf.crs, the GeoDataFrame is reprojected to the data's CRS — the data stays in its native projection (i.e., WRF output remains in Lambert Conformal).
Masking Strategy
Geometry-based clipping uses rioxarray:
all_touched=Trueincludes cells that any part of the geometry touches.drop=Truetrims the bounding box to the clipped extent.
For multi-point clipping, _clip_data_to_points_as_mask either applies a points-mask in place or extracts along a new points dimension when extract_points=True (set by {"points": [...], "separated": True}).
Persisting to Memory
When persist=True (constructor arg or persist key in dict input), the processor calls .compute() on the result before returning. This collapses the Dask task graph and is recommended for large multi-point clipping followed by 1-in-X analysis or other graph-heavy operations.
Boundary Catalog Access
- Boundaries loaded lazily via
_get_boundary_geometry(line 1443) /_get_multi_boundary_geometry(line 1683). - Sourced from S3 intake-esm catalog and cached for the process lifetime.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
value |
str / list / tuple / dict | ✓ | — | Geometry specification (modes 1–5 above). For dict input, must contain boundaries or points. |
separated |
bool (in dict input) | False | For boundaries: keep each as its own dataset entry. For points: extract along a new points dimension. |
|
persist |
bool (in dict input or constructor) | False | Call .compute() after clipping to collapse the Dask task graph. Recommended for large multi-point workflows. |
Code References
| Method | Lines | Purpose |
|---|---|---|
__init__ |
186–291 | Parse value, set mode flags (is_single_point, is_multi_point, separated, extract_points, persist) |
execute |
294–477 | Two-phase: parse value → geom (line 303), then dispatch over result type (line 386); optional .compute() if persist |
update_context |
479–535 | Record clipped-region metadata under new_attrs["clip"] |
set_data_accessor |
537–539 | Receive DataCatalog reference for boundary/station lookups |
_get_station_coordinates |
541–568 | HadISD station code → (lat, lon, metadata) |
_convert_stations_to_points |
570–593 | Multi-station list → point_list + metadata list |
_clip_data_with_geom |
595–663 | CRS detection + rio.clip(all_touched=True, drop=True) |
_clip_data_to_point |
665–878 | Closest cell, fallback to 3×3 index-space neighborhood mean |
_clip_data_to_multiple_points |
880–961 | Vectorized multi-point selection |
_clip_data_to_multiple_points_fallback |
963–1059 | Fallback path used when vectorized lookup fails |
_clip_data_to_points_as_mask |
1061–1346 | Mask-based multi-point clip; supports extract_points along new points dim |
_clip_data_separated |
1348–1441 | Per-boundary clipping returning dict/list keyed by boundary |
_get_boundary_geometry |
1443–1503 | Single boundary key → GeoDataFrame |
_get_multi_boundary_geometry |
1683–1749 | Multi-boundary union via _combine_geometries |
_combine_geometries |
1751– | Geometry union helper |
Examples
Single County
from climakitae.new_core.user_interface import ClimateData
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({
"clip": "Alameda"
})
.get())
Multiple Counties (Separated)
# Get each county in separate dataset
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("pr")
.table_id("mon")
.grid_label("d02")
.processes({
"clip": {
"boundaries": ["Alameda", "Contra Costa", "Santa Clara"],
"separated": True
}
})
.get())
# data is dict: {"Alameda": ds1, "Contra Costa": ds2, "Santa Clara": ds3}
Single Lat/Lon Point
# Closest grid cell to San Francisco
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({
"clip": (37.7749, -122.4194)
})
.get())
# Scalar lat/lon coordinates (size 1)
Multiple Points (Separated)
# Time series for 3 cities
locations = [
(34.05, -118.25), # Los Angeles
(37.77, -122.42), # San Francisco
(32.72, -117.16) # San Diego
]
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({
"clip": {
"boundaries": locations,
"separated": True,
"location_based_naming": True
}
})
.get())
# data is dict with lat/lon in keys
Bounding Box
# Bay Area region (rough bbox)
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("pr")
.table_id("mon")
.grid_label("d03")
.processes({
"clip": ((37.5, 38.5), (-123.0, -121.5))
})
.get())
Weather Station
# Sacramento airport observations reference point
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({
"clip": "KSAC"
})
.get())
Chained: Clip → Warming Level → Export
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.experiment_id("ssp245")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({
"clip": "Los Angeles",
"warming_level": {"warming_levels": [1.5, 2.0, 3.0]},
"export": {
"filename": "la_warming",
"file_format": "NetCDF"
}
})
.get())
Implementation Details
Geometry Loading
Clip resolves self.value to a geometry (or to a point-mode flag) in this order:
- String → station id (
is_station_identifier, line 305) → file path (os.path.exists, line 318) → boundary key (_get_boundary_geometry, line 320). - List → all-station list (line 324) → lat/lon tuples (
is_multi_point, line 340) → separated boundaries (line 343) → union of boundaries (_get_multi_boundary_geometry, line 345). - Tuple → single
(lat, lon)(line 347) or bbox((lat_min, lat_max), (lon_min, lon_max))(line 355, buildsshapely.boxin EPSG:4326).
Result Dispatch
The second match (line 386) routes the input data:
- dict: per-key clip; mode flag selects the clipper.
- Dataset / DataArray: single clip via the matching mode’s clipper.
- list / tuple: per-item clip, container type preserved;
Noneresults filtered out for point modes.
Persist (persist=True)
After clipping, if self.persist is true the processor calls .compute() on the result (per-value for dicts, on the whole object otherwise). This collapses very large Dask task graphs that arise from multi-point clipping followed by quantile or block operations and prevents OOMs in downstream steps.
Error Handling
- Invalid station code, missing boundary key, or unsupported
valuetype →ValueError. - All-NaN single point with no valid 3×3 neighbors →
_clip_data_to_pointreturnsNone; the per-item paths filter these out. - Failed CRS detection on WRF data missing required CF attributes →
ValueErrorwith the missing key.
Common Patterns
County Loop
import climakitae
counties = ["Alameda", "Contra Costa", "Santa Clara", "San Mateo"]
data_by_county = {}
for county in counties:
data_by_county[county] = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({"clip": county})
.get())
Urban Heat Island Study
# Urban and rural points for comparison
urban_point = (37.7749, -122.4194) # San Francisco downtown
rural_point = (37.5, -122.0) # Sierra foothills
data = (ClimateData()
.catalog("cadcat")
.activity_id("WRF")
.variable("t2max")
.table_id("day")
.grid_label("d03")
.processes({
"clip": {
"boundaries": [urban_point, rural_point],
"separated": True
}
})
.get())
# data["urban_point"] vs data["rural_point"] comparison