How Metrics Are Computed

Scorecard metrics compare forecast values against surface observations at each station and lead time. All metrics are computed over daily lead-time bins and averaged across initialization times within the selected window (90 or 180 days).

Deterministic metrics (RMSE, MAE, Bias, ETS, Frequency Bias, HSS, FSS) are applied to the ensemble mean for ensemble models, or directly for deterministic models. CRPS is the only metric that uses individual ensemble members; for deterministic models it reduces to MAE.

For precipitation, both forecast and observed values are first resampled to 6-hour averages before any metric is computed.

RMSE (Root Mean Square Error)

Square root of the mean of squared differences between forecast and observed values. RMSE penalizes large errors more heavily than MAE. Units match the variable. A perfect forecast has RMSE = 0. Applied to temperature only.

def rmse(forecast, observed):
    se = (forecast - observed) ** 2
    daily_mean = se.resample(lead_time="1d").mean().mean(dim="init_time")
    return np.sqrt(daily_mean)

MAE (Mean Absolute Error)

Mean of the absolute differences between forecast and observed values. MAE treats all error magnitudes equally. Units match the variable. A perfect forecast has MAE = 0.

def mae(forecast, observed):
    ae = abs(forecast - observed)
    return ae.resample(lead_time="1d").mean().mean(dim="init_time")

Bias

Mean signed difference (forecast minus observed). Positive bias means the model over-predicts; negative means it under-predicts. Units match the variable. A perfect forecast has Bias = 0.

def bias(forecast, observed):
    err = forecast - observed
    return err.resample(lead_time="1d").mean().mean(dim="init_time")

CRPS (Continuous Ranked Probability Score)

CRPS measures how well the full forecast probability distribution matches the observation. It decomposes into an accuracy term (mean absolute error of each ensemble member against the observation) minus a spread term (mean absolute difference between ensemble member pairs), rewarding ensembles that are both accurate and well-calibrated. For deterministic forecasts, CRPS reduces to MAE. Units match the variable. A perfect forecast has CRPS = 0.

The spread term uses an efficient O(n log n) rank-weighted formulation rather than computing all O(n²) pairwise differences.

def crps_ensemble(forecast, observed):
    # For deterministic forecasts, reduces to MAE
    if "ensemble_member" not in forecast.dims:
        return mae(forecast, observed)

    n_ens = forecast.sizes["ensemble_member"]

    # Term 1: E|X - y| — mean absolute error of each member vs obs
    abs_err = abs(forecast - observed).mean(dim="ensemble_member")

    # Term 2: E|X - X'| — mean absolute spread between all member pairs
    # Efficient O(n log n) form using sorted ensemble members:
    # E|X-X'| = (2 / n²) * sum_i (2i - n - 1) * x_{(i)}
    sorted_fc = forecast.copy(data=np.sort(forecast.values, axis=...))
    ranks = np.arange(1, n_ens + 1)
    weights = 2.0 * ranks - n_ens - 1
    spread = (2.0 / (n_ens ** 2)) * (sorted_fc * weights).sum(dim="ensemble_member")

    crps = abs_err - 0.5 * spread
    return crps.resample(lead_time="1d").mean().mean(dim="init_time")

ETS (Equitable Threat Score)

Measures skill at predicting precipitation occurrence after removing hits expected by chance. A precipitation event is defined as a rate at or above 0.1 mm/h. ETS ranges from −1/3 to 1, where 1 is a perfect forecast, 0 indicates no skill beyond random chance, and negative values indicate worse-than-random performance. Applied to precipitation only.

ETS is computed from a contingency table of hits (H), misses (M), false alarms (F), and correct negatives (CN):

PRECIP_THRESHOLD_MM_S = 0.1 / 3600.0  # 0.1 mm/h in model units (mm/s)

def ets(forecast, observed):
    h, m, f, cn = _contingency(forecast, observed, PRECIP_THRESHOLD_MM_S)
    # Sum counts over daily lead-time bins and init_times
    h, m, f, cn = [_daily_sum(x) for x in (h, m, f, cn)]
    n = h + m + f + cn
    h_random = (h + m) * (h + f) / n
    return (h - h_random) / (h + m + f - h_random)

Frequency Bias

Ratio of the number of forecast precipitation events to the number of observed precipitation events, using the same 0.1 mm/h threshold as ETS. A value of 1 means the model predicts precipitation as often as it actually occurs. Values above 1 indicate over-forecasting; below 1 indicate under-forecasting. Applied to precipitation only.

def frequency_bias(forecast, observed):
    h, m, f, _ = _contingency(forecast, observed, PRECIP_THRESHOLD_MM_S)
    h, m, f = [_daily_sum(x) for x in (h, m, f)]
    return (h + f) / (h + m)

HSS (Heidke Skill Score)

Measures the fraction of correct forecasts (both wet and dry) after removing those expected by random chance. Uses the same 0.1 mm/h precipitation threshold as ETS and Frequency Bias. HSS ranges from −1 to 1, where 1 is a perfect forecast, 0 indicates no skill beyond random chance, and negative values indicate worse-than-random performance. Unlike ETS, HSS gives equal credit for correctly forecasting both wet and dry events. Applied to precipitation only.

def hss(forecast, observed):
    h, m, f, cn = _contingency(forecast, observed, PRECIP_THRESHOLD_MM_S)
    h, m, f, cn = [_daily_sum(x) for x in (h, m, f, cn)]
    numer = 2 * (h * cn - f * m)
    denom = (h + m) * (m + cn) + (h + f) * (f + cn)
    return numer / denom

FSS (Fractions Skill Score)

Measures forecast skill by comparing the fractional coverage of precipitation events within a spatial neighborhood, rather than requiring exact grid-point matches. Uses the same 0.1 mm/h precipitation threshold as ETS, Frequency Bias, and HSS. FSS ranges from 0 to 1, where 1 is a perfect forecast and 0 indicates no skill. Unlike point-based categorical metrics, FSS gives credit for forecasts that are close in space, making it more forgiving of small spatial displacement errors. Applied to precipitation only.

FSS is computed by comparing the observed and forecast fractions of precipitation events within a neighborhood:

def fss(forecast, observed):
    f_binary = (forecast >= PRECIP_THRESHOLD_MM_S).astype(float)
    o_binary = (observed >= PRECIP_THRESHOLD_MM_S).astype(float)

    # Compute fractions over the neighborhood (spatial smoothing)
    f_frac = uniform_filter(f_binary, size=neighborhood_size)
    o_frac = uniform_filter(o_binary, size=neighborhood_size)

    # Sum counts over daily lead-time bins and init_times
    f_frac, o_frac = [_daily_mean(x) for x in (f_frac, o_frac)]

    mse = ((f_frac - o_frac) ** 2).mean()
    ref = (f_frac ** 2).mean() + (o_frac ** 2).mean()
    return 1 - mse / ref

Bias-Corrected Metrics

Temperature metrics are also available in bias-corrected variants (RMSE_bc, MAE_bc, Bias_bc, CRPS_bc). These remove the systematic offset between each model's grid-cell average and the point station measurement before computing the metric.

NWP grid cells represent area-averages, but weather stations measure point values. A station in the upland corner of a grid cell will systematically read cooler than the cell average. This pixel-station representativeness error inflates all error metrics regardless of forecast skill.

The correction factor is a single scalar per station per model, computed as the mean forecast-minus-observed difference over a 180-day window across all lead times:

correction = mean(forecast - observed)  # per station, per model, 180 days

This correction is then subtracted from every forecast value before computing metrics. It cannot systematically help or hurt any particular model type — it simply removes the fixed spatial mismatch between each model's grid representation and the station location.

Bias-corrected metrics are available for temperature only. Precipitation bias is multiplicative rather than additive, so a constant offset correction does not apply.

Contingency Table

The categorical precipitation metrics (ETS, Frequency Bias, HSS) are built on a contingency table. For each forecast-observation pair, both values are classified as "wet" (>= 0.1 mm/h) or "dry" (< 0.1 mm/h), producing four counts:

	Observed wet	Observed dry
Forecast wet	Hit (H)	False alarm (F)
Forecast dry	Miss (M)	Correct negative (CN)

NaN values in either forecast or observation are excluded from all four cells.

def _contingency(forecast, observed, threshold):
    valid = forecast.notnull() & observed.notnull()
    f_yes = (forecast >= threshold) & valid
    o_yes = (observed >= threshold) & valid
    hits = (f_yes & o_yes).astype(float)
    misses = (~f_yes & o_yes).astype(float)
    false_alarms = (f_yes & ~o_yes).astype(float)
    correct_neg = (~f_yes & ~o_yes & valid).astype(float)
    return hits, misses, false_alarms, correct_neg