Modelling

Purpose

Describe model classes, inference contracts, diagnostics, and decision-layer semantics for DSAMbayes.

Audience

  • Practitioners building and interpreting DSAMbayes models.
  • Reviewers validating modelling assumptions and outputs.

Pages

Page Topic
Model Classes BLM, hierarchical, and pooled class constructors, fit support, and limitations
Model Object Lifecycle State transitions from construction through fitting to post-fit extraction
Priors and Boundaries Prior schema, defaults, overrides, boundary controls, and scale semantics
Minimal-Prior Policy Governance guidance for prior specification in MMM
Response Scale Semantics Identity vs log response, KPI-scale conversion, Jensen-safe reporting
Diagnostics Gates Policy modes, threshold tables, identifiability gate, and remediation actions
CRE / Mundlak Correlated random effects for hierarchical models
Time Components Managed holiday feature generation and weekly anchoring
Budget Optimisation Decision-layer budget allocation, objectives, risk scoring, and response transforms

Subsections of Modelling

Model Classes

Purpose

DSAMbayes provides three model classes for Bayesian marketing mix modelling. Each class targets a different data structure and pooling strategy. This page describes the constructor pathways, fit support, and practical limitations of each class so that an operator can select the appropriate model for a given dataset.

Class summary

Class S3 class chain Constructor Data structure Grouping Typical use case
BLM blm blm(formula, data) Single market/brand None One-market regression with full prior and boundary control
Hierarchical hierarchical, blm blm(formula, data) with (term | group) syntax Panel (long format) Random effects by group Multi-market models sharing strength across groups
Pooled pooled, blm pool(blm_obj, grouping_vars, map) Single market Structured coefficient pooling via dimension map Single-market models with media coefficients pooled across labelled dimensions

BLM (blm)

Construction

model <- blm(kpi ~ m_tv + m_search + trend + seasonality, data = df)

blm() dispatches on the first argument. When passed a formula, it creates a blm object with default priors and boundaries. When passed an lm object, it creates a bayes_lm_updater whose priors are initialised from the OLS coefficient estimates and standard errors.

Fit support

Method Function Backend
MCMC fit(model, ...) rstan::sampling()
MAP fit_map(model, n_runs, ...) rstan::optimizing() (repeated starts)

Post-fit accessors

  • get_posterior() — coefficient draws, fitted values, metrics
  • fitted() — predicted response on the original scale
  • decomp() — predictor-level decomposition via DSAMdecomp
  • optimise_budget() — decision-layer budget allocation

Limitations

  • No group structure. For multi-market data, use the hierarchical class.
  • optimise_budget() aborts if scale=TRUE and an offset is present (unsupported combination for the bayes_lm_updater Stan template).

Hierarchical (hierarchical)

Construction

The hierarchical class is created automatically when blm() detects random-effects syntax (|) in the formula:

model <- blm(
  kpi ~ m_tv + m_search + trend + (1 + m_tv + m_search | market),
  data = panel_df
)

Terms to the left of | become random slopes; the variable to the right defines the grouping factor. Multiple grouping terms are supported.

CRE / Mundlak extension

For correlated random effects, call set_cre() after construction:

model <- set_cre(model, vars = c("m_tv", "m_search"))

This augments the population formula with group-mean terms (cre_mean_*) and updates priors and boundaries accordingly. See CRE / Mundlak for details.

Fit support

Method Function Backend
MCMC fit(model, ...) rstan::sampling()
MAP fit_map(model, n_runs, ...) rstan::optimizing() (repeated starts)

Post-fit accessors

Same as BLM. Coefficient draws from get_posterior() return vectors (one value per group) rather than scalars. Budget optimisation uses the population-level (fixed-effect) coefficient draws from the beta parameter.

Limitations

  • Stan template compilation uses a templated source (general_hierarchical.stan) rendered per number of groups and parameterisation mode. First compilation is slow; subsequent runs use a cached binary.
  • Response decomposition via model.matrix() may fail for formulas containing | syntax. The runner wraps this in tryCatch and skips gracefully.
  • Posterior forest and prior-vs-posterior plots average group-specific draws to produce a single population-level estimate.
  • Offset support in the hierarchical Stan template is handled via stats::model.offset() within build_hierarchical_frame_data().

Pooled (pooled)

Construction

The pooled class is created by converting an existing BLM object with pool():

base <- blm(kpi ~ m_tv + m_search + trend + seasonality, data = df)
model <- pool(base, grouping_vars = c("channel"), map = pooling_map)

The map is a data frame with a variable column mapping formula terms to pooling dimension labels. Priors and boundaries are reset to defaults when pool() is called.

Fit support

Method Function Backend
MCMC fit(model, ...) rstan::sampling()

MAP fitting (fit_map) is not currently implemented for pooled models.

Post-fit accessors

Same as BLM. The design matrix is split into base terms (intercept + non-pooled) and media terms (pooled). The Stan template uses a per-dimension coefficient structure.

Limitations

  • MAP fitting is not available.
  • extract_stan_design_matrix() may return a zero-row matrix, which causes VIF computation to be skipped.
  • The pooled Stan cache key includes sorted grouping variable names to avoid collisions between different pooling configurations.
  • Time-series cross-validation is not supported for pooled models (rejected by config validation).

Class selection guide

Scenario Recommended class Rationale
Single market, sufficient data BLM Simplest pathway; full accessor and optimisation support
Single market, OLS baseline available BLM via blm(lm_obj, data) Priors initialised from OLS; Bayesian updating
Multi-market panel Hierarchical Partial pooling shares strength across markets
Multi-market panel with confounding concerns Hierarchical + CRE Mundlak terms control for between-group confounding
Single market with structured media dimensions Pooled Coefficient pooling across labelled media categories

Fit method selection

Criterion MCMC (fit) MAP (fit_map)
Full posterior Yes No (point estimate only)
Credible intervals Yes Approximate via repeated starts
Diagnostics (Rhat, ESS, divergences) Yes Not applicable
LOO-CV / model selection Yes Not supported
Speed Minutes to hours Seconds to minutes
Budget optimisation Full posterior-based Point-estimate-based

For production runs where diagnostics and uncertainty quantification matter, MCMC is the recommended fit method. MAP is useful for rapid iteration during model development.

Cross-references

Model Object Lifecycle

DSAMbayes model objects (blm, hierarchical, pooled) are mutable S3 lists that progress through a well-defined sequence of states. Understanding these states helps avoid calling post-fit accessors on an unfitted object or forgetting to compile before fitting.

State-machine diagram

                        ┌─────────────────────────────────────────┐
                        │           CREATED                       │
                        │  blm(), blm.formula(), blm.lm(),        │
                        │  as_bayes_lm_updater()                  │
                        │  Fields set: .formula, .original_data,  │
                        │    .prior, .boundaries                  │
                        └──────────────┬──────────────────────────┘
                                       │
            ┌──────────────────────────┼──────────────────────────┐
            ▼                          ▼                          ▼
    set_prior(obj, …)         set_boundary(obj, …)        set_date(obj, …)
    Mutates .prior            Mutates .boundaries         Sets .date_var
            │                          │                          │
            └──────────────────────────┼──────────────────────────┘
                                       │
                    (optional: pool() transitions blm → pooled,
                     resets .prior/.boundaries, adds .pooling_vars/.pooling_map)
                                       │
                                       ▼
                        ┌─────────────────────────────────────────┐
                        │           CONFIGURED                    │
                        │  Priors, boundaries, date variable are  │
                        │  set (may still use defaults).          │
                        └──────────────┬──────────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────────────────┐
                        │           COMPILED                      │
                        │  compile_model(obj)                     │
                        │  Sets .stan_model                       │
                        │  (pre_flight_checks auto-compiles if    │
                        │   .stan_model is NULL)                  │
                        └──────────────┬──────────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────────────────┐
                        │           PRE-FLIGHTED                  │
                        │  pre_flight_checks(obj, data)           │
                        │  Validates formula/data compatibility,  │
                        │  auto-compiles and auto-sets date_var   │
                        │  if missing. Sets .response_transform,  │
                        │  .response_scale.                       │
                        └──────────────┬──────────────────────────┘
                                       │
                                       ▼
                        ┌─────────────────────────────────────────┐
                        │           FITTED                        │
                        │  fit(obj) / fit_map(obj)                │
                        │  Calls pre_flight_checks internally,    │
                        │  then prep_data_for_fit → rstan.        │
                        │  Sets .stan_data, .date_val, .posterior │
                        └──────────────┬──────────────────────────┘
                                       │
                   ┌───────────────────┼───────────────────┐
                   ▼                   ▼                   ▼
           get_posterior(obj)    fitted(obj)         decomp(obj)
           Returns tibble of    Predicted values    Decomposition
           posterior draws      (yhat)              via DSAMdecomp
                   │                                       │
                   ▼                                       ▼
           optimise_budget(obj, …)                         Further analysis
           Budget allocation
           (requires fitted model)

States and key fields

State Entry point Fields populated
Created blm(), blm.formula(), blm.lm(), as_bayes_lm_updater() .formula, .original_data, .prior, .boundaries, .response_transform, .response_scale
Configured set_prior(), set_boundary(), set_date() Mutates .prior, .boundaries, .date_var
Pooled pool(obj, grouping_vars, map) Adds .pooling_vars, .pooling_map; resets .prior, .boundaries; class becomes pooled
Compiled compile_model(obj) .stan_model
Pre-flighted pre_flight_checks(obj, data) .response_transform, .response_scale; auto-sets .stan_model, .date_var if missing
Fitted fit(obj) / fit_map(obj) .stan_data, .date_val, .posterior

Post-fit accessors

These functions require a fitted model (.posterior is not NULL):

Accessor Returns Notes
get_posterior(obj) Tibble of posterior draws (coefficients, metrics, yhat) Back-transforms to original scale when scale=TRUE
fitted(obj) Predicted values (yhat) on original scale
get_optimisation(obj) Optimisation results tibble Only for MAP-fitted models (.posterior inherits optimisation)
decomp(obj) Predictor-level decomposition via DSAMdecomp
optimise_budget(obj, …) Budget allocation results Requires fitted model with media terms
chain_diagnostics(obj) MCMC chain diagnostic summary Only for MCMC-fitted models

Guards and auto-transitions

  • pre_flight_checks() auto-compiles via compile_model() if .stan_model is NULL, and auto-sets .date_var to "date" if not already set.
  • fit() and fit_map() call pre_flight_checks() internally, so explicit compilation is optional.
  • get_posterior() aborts with a clear error if .posterior is NULL.
  • optimise_budget() aborts if the model has scale=TRUE and an offset is present (unsupported combination for the bayes_lm_updater class).

Object field reference

All fields are initialised by model_object_schema_defaults() in R/model_schema.R. The canonical field list:

Field Type Set by
.original lm object or NULL Constructor
.formula formula Constructor
.original_data data.frame Constructor
.response_transform character(1) Constructor / pre_flight_checks
.response_scale character(1) Constructor / pre_flight_checks
.prior tibble Constructor / set_prior
.boundaries tibble Constructor / set_boundary
.stan_model stanmodel compile_model
.stan_data list prep_data_for_fit (via fit)
.posterior stanfit or optimisation fit / fit_map
.fitted logical(1) Internal
.offset matrix or NULL prep_offset (via fit)
.date_var character(1) set_date / pre_flight_checks
.date_val vector fit / fit_map
.cre list or NULL apply_cre_data (hierarchical)
.pooling_vars character pool()
.pooling_map data.frame pool()
.positive_prior_parameterization character(1) Runner config

Runner-injected fields

These are set by the YAML/CLI runner (run_from_yaml()) for artifact writing and are not part of the core modelling API:

  • .runner_config, .runner_kpi_type, .runner_identifiability
  • .runner_time_components, .runner_budget_optimisation
  • .runner_model_selection, .runner_model_type

Priors and Boundaries

Purpose

This page defines how DSAMbayes specifies, defaults, overrides, and scales coefficient priors and parameter boundaries for all model classes. It covers the prior schema, supported families, default-generation logic, YAML override contract, and the interaction between priors, boundaries, and the scale=TRUE pathway.

Prior schema

Each model object carries a .prior tibble with one row per parameter. The columns are:

Column Type Meaning
parameter character Parameter name (matches design-matrix column or special name)
description character Human-readable label
distribution call R distribution call, e.g. normal(0, 5)
is_default logical Whether the row was generated by default_prior()

Supported prior families

Family Stan encoding Use case
normal(mean, sd) Default (prior_family_noise_sd = 0) Coefficient priors (location–scale)
lognormal_ms(mean, sd) Encoded as prior_family_noise_sd = 1 with log-transformed parameters noise_sd prior when positive-support is desired

All coefficient priors use normal(). The lognormal_ms family is available only for the noise_sd parameter and is parameterised by the mean and standard deviation on the original (non-log) scale; DSAMbayes converts these internally to log-space parameters.

Default prior generation

BLM and hierarchical (population terms)

default_prior.blm() calls standard_prior_terms(), which produces normal(0, 5) for each population-formula term (intercept and slope terms) plus a noise_sd entry.

Hierarchical (group-level standard deviations)

default_prior.hierarchical() additionally generates sd_<idx>[<term>] rows for each group factor. The prior standard deviation is set to the between-group standard deviation of the response, rounded to two decimal places.

BLM from lm (Bayesian updating)

default_prior.bayes_lm_updater() initialises coefficient priors from the OLS point estimates (mean) and standard errors (sd), enabling informative Bayesian updating.

Pooled

default_prior.pooled() uses the BLM defaults for non-pooled terms (intercept, base regressors, noise_sd) and normal(0, 5) for each dimension-level pooled coefficient.

Boundary schema

Each model object carries a .boundaries tibble with one row per parameter:

Column Type Meaning
parameter character Parameter name
description character Human-readable label
boundary list-column List with $lower and $upper (numeric scalars)
is_default logical Whether the row was generated by default_boundary()

Default boundaries are lower = -Inf, upper = Inf for all terms. No sign constraints are imposed by default.

YAML override contract

Prior overrides

priors:
  use_defaults: true
  overrides:
    - { parameter: m_tv, mean: 0.5, sd: 0.2 }
    - { parameter: price_index, mean: -0.2, sd: 0.1 }

Each override replaces the distribution call for the named parameter with normal(mean, sd). Overrides are sparse: only the listed parameters are changed; all other parameters keep their defaults.

When use_defaults: false, the default prior table is not generated. This is not recommended for typical use.

Boundary overrides

boundaries:
  overrides:
    - { parameter: m_tv, lower: 0.0, upper: .Inf }
    - { parameter: competitor_discount, lower: -.Inf, upper: 0.0 }

Each override replaces the boundary entry for the named parameter. YAML infinity tokens (.Inf, -.Inf) are coerced during config resolution.

Scale semantics (scale = TRUE)

When model.scale: true (the default), the response and predictors are standardised before Stan fitting. This affects both priors and boundaries.

Coefficient prior scaling

Prior standard deviations are scaled by the ratio sx / sy for slope terms and by 1 / sy for the intercept. The noise_sd prior standard deviation is multiplied by sy (the response standard deviation) to remain interpretable in the scaled space.

Boundary scaling

  • Zero boundaries (0) are invariant under scaling.
  • Infinite boundaries (±Inf) are invariant under scaling.
  • Finite non-zero boundaries for slope terms are scaled using scale_boundary_for_parameter(), which applies the same sx / sy ratio used for slope priors.
  • If a finite non-zero boundary is specified for a parameter without a matching scale factor in the design matrix, DSAMbayes aborts with a validation error.

Practical implication

Users specify priors and boundaries on the original (unscaled) data scale. DSAMbayes converts them internally before passing data to Stan. Post-fit, coefficient draws are back-transformed to the original scale by get_posterior().

Interaction with model classes

Behaviour BLM Hierarchical Pooled
Default priors normal(0, 5) per term Population: same as BLM; group SD: data-derived Non-pooled: BLM defaults; pooled: normal(0, 5) per dimension
Boundary defaults (-Inf, Inf) per term Same as BLM for population terms Per-dimension boundaries for pooled terms
Prior scaling sx / sy ratio Same, computed on pooled model frame Same, computed on full model frame
Boundary scaling Same ratio Same Same

Programmatic API

Inspect priors and boundaries

peek_prior(model)
peek_boundary(model)

Override priors

model <- model %>%
  set_prior(
    m_tv ~ normal(0.5, 0.2),
    price_index ~ normal(-0.2, 0.1)
  )

Override boundaries

model <- model %>%
  set_boundary(
    m_tv > 0,
    competitor_discount < 0
  )

Minimal-prior policy

The recommended operating profile for MMM is documented in Minimal-Prior Policy. The policy keeps priors weak by default and uses hard constraints only when there is structural business knowledge.

Cross-references

Minimal-Prior Policy

Purpose

Use a principled but low-friction prior setup that avoids specification-hunting while preserving identifiability in short, collinear MMM datasets.

Policy

  1. Default-first: keep priors.use_defaults: true.
  2. Sparse overrides: only add priors.overrides for high-conviction terms.
  3. Selective bounds: add boundaries.overrides only for structural signs.
  4. No blanket constraints: do not force all controls/media to one sign by default.
  5. Diagnose before tightening: use pre-flight and diagnostics gates first, then add priors/bounds if uncertainty is still unstable.

YAML mapping

priors:
  use_defaults: true
  overrides:
    # Optional high-conviction override examples:
    # - { parameter: price_index, mean: -0.2, sd: 0.1 }
    # - { parameter: distribution, mean: 0.15, sd: 0.1 }

boundaries:
  overrides:
    # Optional structural sign constraints:
    # - { parameter: m_tv, lower: 0.0, upper: .Inf }
    # - { parameter: competitor_discount, lower: -Inf, upper: 0.0 }

When to override defaults

  • Do override when domain mechanism is stable and defensible.
  • Do not override only to improve one run’s fit metrics.
  • Do not add bounds if sign can plausibly flip under promotion, pricing, or substitution effects.

Review checklist

  • Are overrides fewer than the number of major business assumptions?
  • Is each bound tied to a concrete causal rationale?
  • Did diagnostics indicate a real identifiability problem before tightening?

Response Scale Semantics

Purpose

DSAMbayes models can operate on an identity (level) or log response scale. This page defines how response scale is detected, stored, and used for post-fit reporting, so that operators understand which scale their outputs are on and how KPI-scale conversions work.

Response scale detection

Response scale is determined at construction time by detect_response_scale(), which inspects the left-hand side of the formula:

Formula LHS Detected transform Response scale label
kpi ~ ... identity response_level
log(kpi) ~ ... log response_log

The detected value is stored in two model-object fields:

  • .response_transform"identity" or "log". Describes the mathematical transform applied to the response before modelling.
  • .response_scale"identity" or "log". Used as a label when reporting whether outputs are on the model scale or the KPI scale.

Both fields are set by the constructor and confirmed by pre_flight_checks().

Model scale vs KPI scale

Concept Identity response Log response
Model scale Raw KPI units Log of KPI units
KPI scale Same as model scale exp() of model scale
Coefficient interpretation Unit change in KPI per unit change in predictor Approximate percentage change in KPI per unit change in predictor

For identity-response models, model scale and KPI scale are identical. For log-response models, fitted values and residuals on the model scale are in log units and must be exponentiated to obtain KPI-scale values.

Post-fit accessors and scale behaviour

fitted() — model scale

fitted() returns predicted values on the model scale. For identity-response models this is the KPI scale. For log-response models this is the log scale.

fit_tbl <- fitted(model)
# fit_tbl$fitted is on model scale

fitted_kpi() — KPI scale

fitted_kpi() applies the inverse transform draw-wise before summarising. For log-response models the default conversion (since v1.2.2) uses the conditional-mean estimator:

$$E[Y] = \exp\!\bigl(\mu + \tfrac{\sigma^2}{2}\bigr)$$

This is the bias-corrected back-transform that accounts for the log-normal variance term. The previous behaviour (v1.2.0) used the simpler exp(mu) estimator, which corresponds to the conditional median on the KPI scale. To retain that behaviour, pass log_response = "median":

# Default (v1.2.2): conditional mean — bias-corrected
kpi_tbl <- fitted_kpi(model)

# Explicit median — equivalent to pre-v1.2.2 behaviour
kpi_tbl <- fitted_kpi(model, log_response = "median")

The output includes source_response_scale (the model’s response scale), response_scale = "kpi", and conversion_method ("conditional_mean" or "point_exp") to label the result.

observed() — model scale

observed() returns the observed response on the model scale after unscaling (if scale=TRUE).

observed_kpi() — KPI scale

observed_kpi() returns the observed response on the KPI scale. For log-response models, this applies exp() to the model-scale observed values.

to_kpi_scale() helper

The internal function to_kpi_scale(x, response_scale) implements the conversion:

  • If response_scale == "log": returns exp(x).
  • Otherwise: returns x unchanged.

This function is used consistently by fitted_kpi(), observed_kpi(), and runner artefact writers.

Runner artefact scale conventions

Runner artefact writers use the response scale metadata to determine which scale to report:

Artefact Scale Notes
fitted.csv Model scale Direct output from fitted()
observed.csv Model scale Direct output from observed()
posterior_summary.csv Model scale Coefficient summaries on model scale
Fit time series plot KPI scale Uses fitted_kpi() and observed_kpi() for visual comparison
Fit scatter plot KPI scale Same as fit time series
Diagnostics (residuals) Model scale Residuals computed on model scale
Budget optimisation outputs KPI scale Response curves and allocations reported on KPI scale

Interaction with scale = TRUE

The scale flag and response scale are orthogonal:

  • scale = TRUE standardises predictors and response by centring and dividing by standard deviation before Stan fitting. Coefficients and fitted values are back-transformed to the original scale by get_posterior().
  • Response scale determines whether the original scale is levels (identity) or logs (log).

Both transformations compose: a log-response model with scale=TRUE first takes the log of the response (via the formula), then standardises the logged values. Post-fit, draws are first unscaled, then (for KPI-scale outputs) exponentiated.

Jensen’s inequality and draw-wise conversion

When converting log-scale posterior draws to KPI scale, DSAMbayes applies exp() to each draw individually before computing summaries (mean, median, credible intervals). This is the correct Bayesian approach because:

  • E[exp(X)] ≠ exp(E[X]) when X has non-zero variance (Jensen’s inequality).
  • Draw-wise conversion preserves the full posterior distribution on the KPI scale.
  • Summary statistics (mean, quantiles) computed after conversion correctly reflect KPI-scale uncertainty.

Practical guidance

  • Use identity-response models when the KPI is naturally additive and coefficients should represent unit changes.
  • Use log-response models when the KPI is naturally multiplicative, when variance scales with level, or when the response must remain positive.
  • Always check response_scale_label(model) before interpreting coefficient magnitudes.
  • Use fitted_kpi() for business reporting; use fitted() for diagnostics.
  • Do not manually exponentiate posterior means from log-response models. Use fitted_kpi() or to_kpi_scale() on individual draws.

Cross-references

CRE / Mundlak

Purpose

The correlated random effects (CRE) pathway, implemented as a Mundlak device, augments hierarchical DSAMbayes models with group-mean terms. This separates within-group variation from between-group variation for selected regressors, reducing confounding bias when group-level means are correlated with the random effects.

When to use CRE

Use CRE when:

  • The model is hierarchical (panel data with (term | group) syntax).
  • Time-varying regressors (e.g. media spend) have group-level means that may be correlated with the group intercept or slope.
  • You want to decompose effects into within-group (temporal) and between-group (cross-sectional) components.

Do not use CRE when:

  • The model is BLM or pooled (CRE requires hierarchical class).
  • The panel has only one group (no between-group variation exists).
  • All regressors of interest are time-invariant (CRE mean terms would be constant).

Construction

CRE is applied after model construction via set_cre():

model <- blm(
  kpi ~ m_tv + m_search + trend + (1 + m_tv + m_search | market),
  data = panel_df
)
model <- set_cre(model, vars = c("m_tv", "m_search"))

What set_cre() does

  1. Resolves the grouping variable. If the formula has one group factor, it is used automatically. If multiple group factors exist, the group argument must be specified explicitly.

  2. Generates group-mean column names. For each variable in vars, a mean-term column is named cre_mean_<variable> (configurable via prefix).

  3. Augments the data. apply_cre_data() computes group-level means of each CRE variable and joins them back to the panel data as new columns.

  4. Updates the formula. The generated mean terms are appended to the population formula as fixed effects.

  5. Extends priors and boundaries. Default prior and boundary entries are added for each new mean term, matching the existing prior schema.

YAML runner configuration

When using the runner, CRE is configured via:

cre:
  enabled: true
  vars: [m_tv, m_search, m_social]
  group: market
  prefix: cre_mean_

The runner calls set_cre() during model construction if cre.enabled: true.

Mundlak decomposition

For a regressor $x_{gt}$ (group $g$, time $t$), the Mundlak device decomposes the effect into:

  • Within-group effect: the coefficient on $x_{gt}$ in the population formula captures temporal variation after conditioning on the group mean.
  • Between-group effect: the coefficient on $\bar{x}_g$ (the CRE mean term) captures cross-sectional variation in group-level averages.

The original coefficient on $x_{gt}$ in a standard random-effects model conflates both sources. Adding $\bar{x}_g$ as a fixed effect separates them.

Validation and identification warnings

Input validation

set_cre() validates:

  • The model is hierarchical (aborts for BLM or pooled).
  • All vars are present in the data and are numeric.
  • The group variable exists in the formula’s group factors.
  • No CRE mean terms appear in random-slope blocks (would cause double-counting).

Identification warnings

warn_cre_identification() checks two conditions after CRE setup:

  1. More CRE variables than groups. If length(vars) > n_groups, between-effect estimates may be weakly identified. The function emits a warning.

  2. Near-zero within-group variation. For each CRE variable, the within-group residual ($x_{gt} - \bar{x}_g$) standard deviation is checked. If it is effectively zero, within-effect identification is weak. The function emits a per-variable warning.

Zero-variance CRE mean terms

If a CRE mean term has zero variance across all observations (possible when the underlying variable has identical group means), calculate_scaling_terms() in R/scale.R will abort when scale=TRUE. The error message identifies the constant CRE columns and suggests using model.type: re (without CRE) or model.scale: false as workarounds.

Panel assumptions

  • Balanced panels are not required. apply_cre_data() computes group means using dplyr::group_by() and mean(), which handles unequal group sizes.
  • Missing values in CRE variables are excluded from the group-mean calculation (na.rm = TRUE).
  • Group-mean recomputation. CRE mean columns are recomputed each time apply_cre_data() is called, including during prep_data_for_fit.hierarchical(). Existing CRE mean columns are dropped and regenerated to prevent stale values.

Decomposition and reporting

CRE mean terms appear as ordinary fixed-effect terms in the population formula. This means:

  • Posterior summary includes CRE mean-term coefficients alongside other population coefficients.
  • Response decomposition via decomp() attributes fitted-value contributions to CRE mean terms separately from their within-group counterparts.
  • Plots (posterior forest, prior-vs-posterior) include CRE mean terms.

Interpretation note: the CRE mean-term coefficient represents the between-group effect conditional on the within-group variation. It does not represent the total effect of the underlying variable.

Cross-references

Time Components

Purpose

DSAMbayes provides managed time-component generation through the time_components config section. When enabled, the runner deterministically generates holiday feature columns from a calendar file and optionally appends them to the model formula. This page defines the configuration contract, generation logic, naming conventions, and audit properties.

Overview

Time components in DSAMbayes cover:

  • Holidays — deterministic weekly indicator features derived from an external calendar file.
  • Trend and seasonality — specified directly in the model formula (e.g. t_scaled, sin52_1, cos52_1). These are not generated by the time-components system; they are user-supplied columns in the data.

The time_components system is responsible only for holiday feature generation.

YAML configuration

time_components:
  enabled: true
  holidays:
    enabled: true
    calendar_path: data/holidays.csv
    date_col: null          # auto-detected: date, ds, or event_date
    label_col: holiday
    date_format: null       # null = ISO 8601; or e.g. "%d/%m/%Y"
    week_start: monday
    timezone: UTC
    prefix: holiday_
    window_before: 0
    window_after: 0
    aggregation_rule: count # count | any
    overlap_policy: count_all # count_all | dedupe_label_date
    add_to_formula: true
    overwrite_existing: false

Key definitions

Key Default Description
enabled false Master toggle for the time-components system
holidays.enabled false Toggle for holiday feature generation
holidays.calendar_path null Path to the holiday calendar CSV (resolved relative to the config file)
holidays.date_col null Date column in the calendar; auto-detected from date, ds, or event_date
holidays.label_col holiday Column containing holiday event labels
holidays.date_format null Date parse format; null assumes ISO 8601
holidays.week_start monday Day-of-week anchor for weekly aggregation
holidays.timezone UTC Timezone used when parsing POSIX date-time inputs
holidays.prefix holiday_ Prefix prepended to generated feature column names
holidays.window_before 0 Days before each event date to include in the holiday window
holidays.window_after 0 Days after each event date to include in the holiday window
holidays.aggregation_rule count Weekly aggregation: count sums event-days per week; any produces a binary indicator
holidays.overlap_policy count_all Overlap handling: count_all counts every event-day; dedupe_label_date deduplicates per label and date
holidays.add_to_formula true Whether generated holiday terms are appended to the model formula automatically
holidays.overwrite_existing false Whether existing columns with matching names are overwritten

Calendar file contract

The holiday calendar is a CSV (or data frame) with at minimum:

Column Required Content
Date column Yes Daily event dates (one row per event occurrence)
Label column Yes Human-readable event name (e.g. Christmas, Black Friday)

Date column detection

If date_col is null, the system tries column names in order: date, ds, event_date. If none is found, validation aborts.

Label normalisation

Holiday labels are normalised to lowercase, alphanumeric-plus-underscore form via normalise_holiday_label(). For example:

  • Black Fridayblack_friday
  • New Year's Daynew_year_s_day
  • Empty labels → unnamed

The generated feature column name is {prefix}{normalised_label}, e.g. holiday_black_friday.

Generation pipeline

The runner calls build_weekly_holiday_features() with the following steps:

  1. Parse and validate the calendar. validate_holiday_calendar() checks column presence, date parsing, and label completeness.

  2. Expand holiday windows. expand_holiday_windows() replicates each event row across the [event_date - window_before, event_date + window_after] range.

  3. Align to weekly index. Each expanded event-day is mapped to its containing week using week_floor_date() with the configured week_start.

  4. Aggregate per week. Events are counted per week per feature. Under aggregation_rule: any, counts are collapsed to binary (0/1). Under overlap_policy: dedupe_label_date, duplicate label-date pairs within a week are removed before counting.

  5. Join to model data. The generated feature matrix is left-joined to the model data by the date column. Weeks with no events receive zero.

  6. Append to formula. If add_to_formula: true, generated feature columns are appended as additive terms to the population formula.

Weekly anchoring

All weekly alignment uses week_floor_date(), which computes the most recent occurrence of week_start on or before each date. The model data’s date column must contain week-start-aligned dates; normalise_weekly_index() validates this and aborts if dates are not aligned.

Supported week-start values

monday, tuesday, wednesday, thursday, friday, saturday, sunday.

Timezone handling

  • Calendar dates are parsed using the configured timezone (default UTC).
  • If the calendar contains POSIXt values, they are coerced to Date in the configured timezone.
  • Character dates are parsed as ISO 8601 by default, or using date_format if specified.

Generated-term audit contract

Generated holiday terms are tracked for downstream diagnostics and reporting:

  • The list of generated term names is stored in model$.runner_time_components$generated_terms.
  • The identifiability gate in R/diagnostics_report.R uses this list to auto-detect baseline terms (via detect_baseline_terms()), so generated holiday terms are included in baseline-media correlation checks without requiring explicit configuration.

Feature naming collision

If two different holiday labels normalise to the same feature name, build_weekly_holiday_features() aborts with a collision error. Ensure calendar labels are distinct after normalisation.

Interaction with existing data columns

  • If overwrite_existing: false (default), the runner aborts if any generated column name already exists in the data.
  • If overwrite_existing: true, existing columns with matching names are replaced by the generated features.

Practical guidance

  • Start with aggregation_rule: count to capture multi-day holiday effects (e.g. a holiday spanning two days in one week produces a count of 2).
  • Use window_before and window_after for events with known anticipation or lingering effects (e.g. window_before: 7 for pre-Christmas shopping).
  • Use aggregation_rule: any when you want binary holiday indicators regardless of how many event-days fall in a week.
  • Check generated terms in the resolved config (config.resolved.yaml) and posterior summary to confirm which holidays entered the model.

Cross-references

Diagnostics Gates

Purpose

DSAMbayes runs a deterministic diagnostics framework after model fitting. Each diagnostic check produces a pass, warn, or fail status. The policy mode controls how lenient or strict the thresholds are. This page defines the check taxonomy, threshold tables, policy modes, identifiability gate, and the overall status aggregation rule.

Policy modes

The diagnostics framework supports three policy modes, configured via diagnostics.policy_mode in YAML:

Mode Intent Threshold behaviour
explore Rapid iteration during model development Relaxed fail thresholds; many checks can only warn, not fail
publish Default production mode for shareable outputs Balanced thresholds; condition-number fail is downgraded to warn
strict Audit-grade gating for release candidates Tightest thresholds; rank deficit fails rather than warns

The mode is resolved by diagnostics_policy_thresholds(mode) in R/diagnostics_report.R.

Check taxonomy

Checks are organised into phases:

Phase Scope When evaluated
P0 Data integrity and design matrix validity Pre-fit (design matrix available)
P1 Sampler quality, residual behaviour, identifiability Post-fit (posterior available)

Each check row includes:

Field Meaning
check_id Unique identifier
phase P0 or P1
severity Priority rating (P0 = critical, P1 = important)
status pass, warn, fail, or skipped
metric Metric name
value Observed value
threshold Applied threshold description
message Human-readable explanation

P0 design checks

Check ID Metric Pass Warn Fail
pre_response_finite non_finite_response_count == 0 > 0
pre_design_constants_duplicates constant_plus_duplicate_columns == 0 > 0
pre_design_rank_deficit rank_deficit == 0 > 0 (publish) > 0 (strict)
pre_design_condition_number kappa_X ≤ warn > warn > fail

Condition number thresholds by mode

Mode Warn Fail
explore 10,000 ∞ (cannot fail)
publish 10,000 1,000,000 (downgraded to warn)
strict 10,000 1,000,000

P1 sampler checks (MCMC only)

Check ID Metric Direction Warn Fail
sampler_rhat_max max_rhat Lower is better 1.01 1.05
sampler_ess_bulk_min min_ess_bulk Higher is better 400 200
sampler_ess_tail_min min_ess_tail Higher is better 200 100
sampler_ebfmi_min min_ebfmi Higher is better 0.30 0.20
sampler_treedepth_frac treedepth_hit_fraction Lower is better 0.00 0.01
sampler_divergences divergent_fraction Lower is better 0.00 0.00

Mode adjustments for sampler checks

In explore mode, fail thresholds are substantially relaxed (e.g. rhat_fail = 1.10, ess_bulk_fail = 50). In strict mode, warn thresholds match publish fail thresholds.

P1 residual checks

Check ID Metric Direction Warn Fail
resid_ljung_box_p resid_lb_p Higher is better 0.05 0.01
resid_acf_max resid_acf_max Lower is better 0.20 0.40

Mode adjustments for residual checks

Mode resid_lb_p warn resid_lb_p fail resid_acf warn resid_acf fail
explore 0.05 0.00 (cannot fail) 0.20 ∞ (cannot fail)
publish 0.05 0.01 0.20 0.40
strict 0.10 0.05 0.15 0.30

P1 boundary hit check

Check ID Metric Direction Warn Fail
boundary_hit_fraction boundary_hit_frac Lower is better 0.05 0.20

In explore mode, boundary hits cannot fail. In strict mode, thresholds tighten to warn > 0.02, fail > 0.10.

P1 within-group variation check

Check ID Metric Direction Warn Fail
within_var_ratio within_var_min_ratio Higher is better 0.10 0.05

This check applies to hierarchical models and flags groups where within-group variation is extremely low relative to between-group variation. In explore mode, the fail threshold is zero (cannot fail).

Identifiability gate

The identifiability gate measures the maximum absolute correlation between baseline terms and media terms in the design matrix. It is configured via diagnostics.identifiability in YAML:

diagnostics:
  identifiability:
    enabled: true
    media_terms: [m_tv, m_search, m_social]
    baseline_terms: [trend, seasonality]
    baseline_regex: ["^h_", "^sin", "^cos"]
    abs_corr_warn: 0.80
    abs_corr_fail: 0.95

Term detection

  • Media terms: explicitly listed in media_terms.
  • Baseline terms: union of baseline_terms, generated time-component terms, and matches from baseline_regex patterns.
  • Both sets are intersected with actual design-matrix columns and filtered to remove constant columns.

Thresholds by mode

Mode Warn Fail
explore 0.80 ∞ (cannot fail)
publish 0.80 0.95
strict 0.70 0.85

Skip conditions

The identifiability gate reports skipped when:

  • identifiability.enabled: false
  • No configured media terms found in the design matrix
  • No baseline terms detected from configured terms/regex
  • All resolved baseline or media terms are constant

Overall status aggregation

The overall diagnostics status is determined by diagnostics_overall_status():

  1. If any check has status == "fail" → overall status is fail.
  2. If any check has status == "warn" (and none fail) → overall status is warn.
  3. Otherwise → overall status is pass.

Checks with status == "skipped" do not affect the overall status.

Runner artefact output

The diagnostics framework produces:

Artefact Location Content
diagnostics_report.csv 40_diagnostics/ Full check table with all fields
diagnostics_summary.txt 40_diagnostics/ Human-readable summary of overall status and failing checks

Interpretation guidance

  • pass — no remediation needed; model is suitable for the configured policy mode.
  • warn — review recommended; the model may have quality concerns but does not block the configured policy.
  • fail — remediation required before the model can be considered production-ready under the configured policy.

Common remediation actions

Diagnostic area Warning signs Actions
High Rhat > 1.01 Increase MCMC iterations or warmup; simplify model
Low ESS < 400 bulk or < 200 tail Increase iterations; check for multimodality
Divergences Any non-zero fraction Increase adapt_delta; reparameterise model
High condition number kappa > 10,000 Reduce collinearity; remove redundant terms
Residual autocorrelation High ACF or low Ljung-Box p Add time controls (trend, seasonality, holidays)
Boundary hits > 5% of draws Review boundary specification; widen or remove constraints
High baseline-media correlation > 0.80 Add controls to separate baseline from media; consider alternative model specifications

Cross-references

Budget Optimisation

Purpose

DSAMbayes provides a decision-layer budget optimisation engine that operates on fitted model posteriors. Given a channel scenario with spend bounds, response-transform specifications, and an objective function, the engine searches for the allocation that maximises the chosen objective while respecting channel-level constraints. This page defines the inputs, objectives, risk scoring, response-scale handling, and output structure.

Overview

Budget optimisation is separate from parameter estimation. It takes a fitted model and a scenario specification, then:

  1. Extracts posterior coefficient draws for the scenario’s channel terms.
  2. Generates feasible candidate allocations within channel bounds that sum to the total budget.
  3. Evaluates each candidate across all posterior draws to obtain a distribution of KPI outcomes.
  4. Ranks candidates by the configured objective and risk scoring function.
  5. Returns the best allocation, channel-level summaries, response curves, and impact breakdowns.

Entry point

result <- optimise_budget(model, scenario, n_candidates = 2000L, seed = 123L)

The optimize_budget() alias is also available for American English convention.

Scenario specification

The scenario is a structured list with the following top-level keys:

channels

A list of channel definitions, each containing:

Key Required Default Description
term Yes Model formula term name for this channel
name No Same as term Human-readable channel label
spend_col No Same as name Data column used for reference spend lookup
bounds.min No 0 Minimum allowed spend for this channel
bounds.max No Inf Maximum allowed spend for this channel
response No {type: "identity"} Response transform specification
currency_col No null Data column for currency-unit conversion

Channel names and terms must be unique across the scenario.

budget_total

Total budget to allocate across all channels. All feasible allocations sum to this value.

reference_spend

Optional named list of per-channel reference spend values. If not provided, reference spend is estimated from the mean of the spend_col in the model’s original data.

objective

Defines the optimisation target and risk scoring:

Key Values Description
target kpi_uplift, profit What to maximise
value_per_kpi numeric (required for profit) Currency value of one KPI unit
risk.type mean, mean_minus_sd, quantile Risk scoring function
risk.lambda numeric ≥ 0 (for mean_minus_sd) Penalty weight on posterior standard deviation
risk.quantile (0, 1) (for quantile) Quantile level for pessimistic scoring

Response transforms

Each channel can specify a response transform that maps raw spend to the transformed value used in the linear predictor. Supported types:

Type Formula Parameters
identity spend None
atan atan(spend / scale) scale (positive scalar)
log1p log(1 + spend / scale) scale (positive scalar)
hill spend^n / (spend^n + k^n) k (half-saturation), n (shape)

The response transform is applied within response_transform_value() and determines the shape of the channel’s response curve.

Objective functions

kpi_uplift

Maximises the expected change in KPI relative to the reference allocation. The metric for each candidate is:

$$\Delta\text{KPI}_d = f(\text{candidate}) - f(\text{reference})$$

evaluated across posterior draws $d$.

profit

Maximises expected profit, defined as:

$$\text{profit}_d = \text{value\_per\_kpi} \times \Delta\text{KPI}_d - \Delta\text{spend}$$

where $\Delta\text{spend} = \text{candidate total} - \text{reference total}$.

Risk-aware scoring

The risk scoring function determines how the distribution of objective draws is summarised into a single score for ranking candidates:

Risk type Score formula Use case
mean $\bar{m}$ Risk-neutral; maximises expected value
mean_minus_sd $\bar{m} - \lambda \cdot \sigma$ Penalises uncertainty; higher $\lambda$ is more conservative
quantile $Q_\alpha(m)$ Optimises the $\alpha$-quantile; directly targets worst-case outcomes

Coefficient extraction

BLM and pooled models

Coefficient draws are extracted via get_posterior() and indexed by the scenario’s channel terms.

Hierarchical models

For hierarchical MCMC models, the population-level (fixed-effect) beta draws are extracted directly from the Stan posterior. If the model was fitted with scale=TRUE, draws are back-transformed to the original scale before optimisation. This ensures that optimisation operates on the population effect rather than group-specific random-effect totals.

Draw thinning

If max_draws is specified, a random subsample of posterior draws is used for computational efficiency. The subsampling uses the configured seed for reproducibility.

Response-scale handling

Budget optimisation handles both identity and log response scales:

  • Identity response: $\Delta\text{KPI}$ is the difference in linear-predictor draws between candidate and reference allocations.
  • Log response: $\Delta\text{KPI}$ is computed via kpi_delta_from_link_levels(), which correctly accounts for the exponential back-transformation. If kpi_baseline is available, the delta is expressed in absolute KPI units; otherwise, it is expressed as a relative change.

The delta_kpi_from_link() and kpi_delta_from_link_levels() functions ensure Jensen-safe conversions by operating draw-wise.

Feasible allocation generation

sample_feasible_allocation() generates random allocations that:

  1. Respect per-channel lower bounds.
  2. Respect per-channel upper bounds.
  3. Sum exactly to budget_total.

Allocation is performed by distributing remaining budget (after lower bounds) using exponential random weights, iteratively filling channels until the budget is exhausted. project_to_budget() ensures exact budget equality via proportional adjustment.

Output structure

optimise_budget() returns a budget_optimisation object containing:

Field Content
best_spend Named numeric vector of optimal per-channel spend
best_score Objective score of the best allocation
channel_summary Tibble with per-channel reference vs optimised spend, response, ROI, CPA, and deltas
curves List of per-channel response curve tibbles (spend grid × mean/lower/p50/upper)
points Tibble of reference and optimised points per channel with confidence intervals
impact Waterfall-style tibble of per-channel KPI contribution and interaction residual
objective_cfg Echo of the objective configuration
scenario Echo of the input scenario
model_metadata Model class, response scale, and scale flag

Runner integration

When allocation.enabled: true in YAML, the runner calls optimise_budget() after fitting and writes artefacts under 60_optimisation/:

Artefact Content
allocation_summary.csv Channel summary table
response_curves.csv Response curve data for all channels
allocation_impact.csv Waterfall impact breakdown
Plot PNGs Response curves, ROI/CPA panel, allocation waterfall, and other visual outputs

Constraints and guardrails

  • Budget feasibility: if channel lower bounds sum to more than budget_total, the engine aborts.
  • Upper bound capacity: if channel upper bounds cannot accommodate the full budget, the engine aborts.
  • Missing terms: if a scenario term is not found in the posterior coefficients, the engine aborts with a descriptive error.
  • Offset + scale combination: for bayes_lm_updater models, optimise_budget() aborts if scale=TRUE and an offset is present.

Cross-references