Compare Runs

Objective

Compare multiple DSAMbayes runner executions and select a candidate model for reporting or decision-making, using predictive scoring and diagnostic summaries.

Prerequisites

  • Two or more completed runner run executions (MCMC fit method).
  • Artefacts under 50_model_selection/ for each run (LOO summary, ELPD outputs).
  • Familiarity with Diagnostics Gates and Model Selection Plots.

Steps

1. Collect run directories

Identify the run directories to compare:

results/20260228_083808_blm_synth_kpi_os_hfb01/
results/20260228_084410_blm_synth_kpi_os_hfb01/
results/20260228_084602_blm_synth_kpi_os_hfb01/

2. Compare ELPD scores

The compare_runs() helper ranks runs by expected log predictive density (ELPD):

library(DSAMbayes)
comparison <- compare_runs(
  run_dirs = c(
    "results/20260228_083808_blm_synth_kpi_os_hfb01",
    "results/20260228_084410_blm_synth_kpi_os_hfb01"
  )
)
print(comparison)

The output ranks runs by ELPD (higher is better) and reports Pareto-k diagnostics.

3. Check Pareto-k reliability

Examine the loo_summary.csv in each run’s 50_model_selection/ folder:

cat results/<run_dir>/50_model_selection/loo_summary.csv

Key metrics:

Metric Interpretation
elpd_loo Expected log predictive density; higher is better
p_loo Effective number of parameters
looic LOO information criterion; lower is better
Pareto-k counts Observations with k > 0.7 indicate unreliable LOO estimates

If many observations have high Pareto-k values, the LOO approximation is unreliable for that run. Consider time-series cross-validation as an alternative.

4. Review time-series CV (if available)

If diagnostics.time_series_selection.enabled: true was configured, check:

cat results/<run_dir>/50_model_selection/tscv_summary.csv

This provides expanding-window blocked CV scores (holdout ELPD, RMSE, SMAPE) that are more appropriate for time-series data than standard LOO.

5. Cross-reference diagnostics

For each candidate run, check the diagnostics overall status:

head -1 results/<run_dir>/40_diagnostics/diagnostics_report.csv

A model with better ELPD but failing diagnostics should not be preferred over a model with slightly lower ELPD and passing diagnostics.

6. Compare fit quality visually

Review the fit time series and scatter plots in 20_model_fit/ for each run:

  • Fit time series — does the model track the observed KPI?
  • Fit scatter — is the predicted-vs-observed relationship close to the diagonal?
  • Posterior forest — are coefficient estimates reasonable and well-identified?

7. Selection decision matrix

Criterion Weight Run A Run B
ELPD (higher is better) High value value
Pareto-k reliability (fewer high-k) High value value
Diagnostics overall status High pass/warn/fail pass/warn/fail
TSCV holdout RMSE (if available) Medium value value
Coefficient plausibility Medium judgement judgement
Fit visual quality Low judgement judgement

8. Record the selection

Document the selected run directory and rationale. If using the runner for release evidence, the selected run’s artefacts form part of the evidence pack.

Caveats

  • ELPD is not causal validation. Predictive scoring measures in-sample predictive quality, not whether the model identifies causal media effects correctly.
  • Pooled models do not support time-series CV (rejected by config validation).
  • MAP-fitted models do not produce LOO diagnostics. Use MCMC for model comparison.