Custom metrics#
Custom metrics, such as for land cover type or mosquito incidence can be added to the pipeline. For a fully custom configuration, it is recommended to work directly with the underlying geoglue library, for which we provide an example of performing zonal aggregation in its online documentation.
Metrics can also be integrated into the DART-Pipeline command line interface,
such that they can be processed using uv run dart-pipeline as described in
the Metrics page. This can be done by cloning the DART-Pipeline
repository and placing a new
file under src/dart_pipeline/metrics that follows a certain format. Metric
processing code in the DART-Pipeline is divided into three parts: declaring the metric metadata, fetching and
processing.
Declaring metadata#
Each Python code file under src/dart_pipeline/metrics/ represents a single data source. If your code is complex, it is recommended to create a snuffled under src/dart_pipeline/metrics.
Metadata is declared for a source (which can comprise multiple metrics) by calling the register_metrics() function defined in the dart_pipeline.metrics submodule.
Each metric has associated metadata, starting with the name, which should have
lowercase characters, with underscore (_) separating words. The other
metadata are as follows. CF Convention-compliant
attributes that are saved as part of the netCDF file are indicated as [CF]
- url
URL of data source
- long_name
Description of the metric [CF]
- units
Unit of the metric [CF], should correspond to UDUNITS which is recommended by the CF Conventions
- valid_min
Minimum value of the metric, values below this will be considered as NA [CF]
- valid_max
Maximum value of the metric, values above this will be considered as NA [CF]
- license
License of the underlying data, should follow the list of SPDX licenses
- citation
Free text describing required citation information for using the underlying data
You can see an example
registration
for the worldpop.pop_count data source:
register_metrics(
"worldpop",
description="WorldPop population data",
metrics={
"pop_count": {
"url": "https://hub.worldpop.org/geodata/listing?id=75",
"long_name": "WorldPop population count",
"units": "unitless",
"license": "CC-BY-4.0",
# We produce outputs at minimum admin1 resolution, unlikely
# that any administrative area will have population greater than this
"valid_min": 0,
"citation": """
WorldPop (www.worldpop.org - School of Geography and Environmental
Science, University of Southampton; Department of Geography and
Geosciences, University of Louisville; Departement de Geographie,
Universite de Namur) and Center for International Earth Science
Information Network (CIESIN), Columbia University (2018). Global
High Resolution Population Denominators Project - Funded by The
Bill and Melinda Gates Foundation (OPP1134076).
https://dx.doi.org/10.5258/SOTON/WP00671
""",
},
},
)
Fetching data#
To fetch data, define a function and decorate it with
@register_fetch("source.metric_name"). The first parameter of the function
must be of a geoglue.region
type
(BaseRegion, ZonedBaseRegion, BaseCountry, Region, AdministrativeLevel, or
CountryAdministrativeLevel). If the function takes a temporal scope (year,
year-month, date or year range), the second parameter must be named date
to work with the DART CLI. The rest of the parameter signature can be custom
parameters, which can be passed through the CLI as param=value. For example, the fetcher for worldpop.pop_count is defined as follows
@register_fetch("worldpop.pop_count")
def worldpop_pop_count_fetch(region: BaseCountry, date: str) -> Literal[False]:
The return type can be (for type definitions, see src/dart_pipeline/types.py)
Literal[False]: The function directly downloads data. This skips any automated processing that DART-Pipeline may perform after fetching dataDataFile: The function downloads into a temporary buffer, which is then processed by DART to save the file to diskURLCollection: A type that comprises a base URL and a list of files, with optional attributes indicating whether ZIP files should be extracted. If the fetcher function returns this type, then DART-Pipeline automatically fetches the data and extracts it to a standardised location, saving you the effort of writing boilerplate downloading code.
Processing data#
Similarly to fetching data, the first parameter of the processing function must be a geoglue.region type, and the second parameter should be a date, if the processing requires a temporal scope. The return type must be one of these types:
pd.DataFrame: The function can return a pandas DataFrame. The DataFrame must contain aDART_REGIONcolumn containing the region name, and a dataframe attributeadminrepresenting the administrative level. This is used by DART to save the file to the appropriate output location with a standardised file name.xr.DataArray|xr.Dataset: The function returns an annotated xarray Dataset or DataArray. We recommend using thezonal_stats_xarray()function indart_pipeline.utilwhich can perform zonal statistics for a given source raster and geoglue.region type, while adding appropriate metadata. If writing your own function for zonal aggregation, it is recommended to add CF-compliant attributes and theDART_regionattribute that stores a string representation of the geoglue.region type passed to the processing function.
The function should be decorated using the register_process() decorator, as
seen in this snippet for the worldpop.pop_count metric:
@register_process("worldpop.pop_count")
def worldpop_pop_count_process(region: CountryAdministrativeLevel, date: str) -> xr.DataArray: