Compare Runs

Objective

Compare multiple DSAMbayes runner executions and select a candidate model for reporting or decision-making, using predictive scoring and diagnostic summaries.

Prerequisites

Two or more completed runner run executions (MCMC fit method).
Artefacts under 50_model_selection/ for each run (LOO summary, ELPD outputs).
Familiarity with Diagnostics Gates and Model Selection Plots.

Steps

1. Collect run directories

Identify the run directories to compare:

results/20260228_083808_blm_synth_kpi_os_hfb01/
results/20260228_084410_blm_synth_kpi_os_hfb01/
results/20260228_084602_blm_synth_kpi_os_hfb01/

2. Compare ELPD scores

The compare_runs() helper ranks runs by expected log predictive density (ELPD):

library(DSAMbayes)
comparison <- compare_runs(
  run_dirs = c(
    "results/20260228_083808_blm_synth_kpi_os_hfb01",
    "results/20260228_084410_blm_synth_kpi_os_hfb01"
  )
)
print(comparison)

The output ranks runs by ELPD (higher is better) and reports Pareto-k diagnostics.

3. Check Pareto-k reliability

Examine the loo_summary.csv in each run’s 50_model_selection/ folder:

cat results/<run_dir>/50_model_selection/loo_summary.csv

Key metrics:

Metric	Interpretation
`elpd_loo`	Expected log predictive density; higher is better
`p_loo`	Effective number of parameters
`looic`	LOO information criterion; lower is better
Pareto-k counts	Observations with `k > 0.7` indicate unreliable LOO estimates

If many observations have high Pareto-k values, the LOO approximation is unreliable for that run. Consider time-series cross-validation as an alternative.

4. Review time-series CV (if available)

If diagnostics.time_series_selection.enabled: true was configured, check:

cat results/<run_dir>/50_model_selection/tscv_summary.csv

This provides expanding-window blocked CV scores (holdout ELPD, RMSE, SMAPE) that are more appropriate for time-series data than standard LOO.

5. Cross-reference diagnostics

For each candidate run, check the diagnostics overall status:

head -1 results/<run_dir>/40_diagnostics/diagnostics_report.csv

A model with better ELPD but failing diagnostics should not be preferred over a model with slightly lower ELPD and passing diagnostics.

6. Compare fit quality visually

Review the fit time series and scatter plots in 20_model_fit/ for each run:

Fit time series — does the model track the observed KPI?
Fit scatter — is the predicted-vs-observed relationship close to the diagonal?
Posterior forest — are coefficient estimates reasonable and well-identified?

7. Selection decision matrix

Criterion	Weight	Run A	Run B
ELPD (higher is better)	High	value	value
Pareto-k reliability (fewer high-k)	High	value	value
Diagnostics overall status	High	pass/warn/fail	pass/warn/fail
TSCV holdout RMSE (if available)	Medium	value	value
Coefficient plausibility	Medium	judgement	judgement
Fit visual quality	Low	judgement	judgement

8. Record the selection

Document the selected run directory and rationale. If using the runner for release evidence, the selected run’s artefacts form part of the evidence pack.

Caveats

ELPD is not causal validation. Predictive scoring measures in-sample predictive quality, not whether the model identifies causal media effects correctly.
Pooled models do not support time-series CV (rejected by config validation).
MAP-fitted models do not produce LOO diagnostics. Use MCMC for model comparison.

Diagnostics Gates — diagnostic quality checks
Model Selection Plots — Pareto-k, LOO-PIT, ELPD influence plots
Run from YAML — executing runs for comparison
Config Schema — model_selection.* YAML keys