Compare Runs
Objective
Compare multiple DSAMbayes runner executions and select a candidate model for reporting or decision-making, using predictive scoring and diagnostic summaries.
Prerequisites
- Two or more completed runner
runexecutions (MCMC fit method). - Artefacts under
50_model_selection/for each run (LOO summary, ELPD outputs). - Familiarity with Diagnostics Gates and Model Selection Plots.
Steps
1. Collect run directories
Identify the run directories to compare:
2. Compare ELPD scores
The compare_runs() helper ranks runs by expected log predictive density (ELPD):
The output ranks runs by ELPD (higher is better) and reports Pareto-k diagnostics.
3. Check Pareto-k reliability
Examine the loo_summary.csv in each run’s 50_model_selection/ folder:
Key metrics:
| Metric | Interpretation |
|---|---|
elpd_loo |
Expected log predictive density; higher is better |
p_loo |
Effective number of parameters |
looic |
LOO information criterion; lower is better |
| Pareto-k counts | Observations with k > 0.7 indicate unreliable LOO estimates |
If many observations have high Pareto-k values, the LOO approximation is unreliable for that run. Consider time-series cross-validation as an alternative.
4. Review time-series CV (if available)
If diagnostics.time_series_selection.enabled: true was configured, check:
This provides expanding-window blocked CV scores (holdout ELPD, RMSE, SMAPE) that are more appropriate for time-series data than standard LOO.
5. Cross-reference diagnostics
For each candidate run, check the diagnostics overall status:
A model with better ELPD but failing diagnostics should not be preferred over a model with slightly lower ELPD and passing diagnostics.
6. Compare fit quality visually
Review the fit time series and scatter plots in 20_model_fit/ for each run:
- Fit time series — does the model track the observed KPI?
- Fit scatter — is the predicted-vs-observed relationship close to the diagonal?
- Posterior forest — are coefficient estimates reasonable and well-identified?
7. Selection decision matrix
| Criterion | Weight | Run A | Run B |
|---|---|---|---|
| ELPD (higher is better) | High | value | value |
| Pareto-k reliability (fewer high-k) | High | value | value |
| Diagnostics overall status | High | pass/warn/fail | pass/warn/fail |
| TSCV holdout RMSE (if available) | Medium | value | value |
| Coefficient plausibility | Medium | judgement | judgement |
| Fit visual quality | Low | judgement | judgement |
8. Record the selection
Document the selected run directory and rationale. If using the runner for release evidence, the selected run’s artefacts form part of the evidence pack.
Caveats
- ELPD is not causal validation. Predictive scoring measures in-sample predictive quality, not whether the model identifies causal media effects correctly.
- Pooled models do not support time-series CV (rejected by config validation).
- MAP-fitted models do not produce LOO diagnostics. Use MCMC for model comparison.
Related pages
- Diagnostics Gates — diagnostic quality checks
- Model Selection Plots — Pareto-k, LOO-PIT, ELPD influence plots
- Run from YAML — executing runs for comparison
- Config Schema —
model_selection.*YAML keys