This vignette collects the questions early users most often hit when running pigauto on their own data. Each section follows the same template:
?function, design memo, or Methodology bench under the
Methodology navbar dropdown.If you hit something that isn’t here and feels surprising, please open an issue — most of the items below were added because real users tripped on them.
impute() and
result$prediction$imputed looks like my input”Symptom. You run
result <- impute(df, tree) on a fully-observed dataset
(e.g. the bundled avonet300) and read
result$prediction$imputed$Mass expecting “the imputed
values” — but the values look exactly like your input data, including
legitimately huge ones (a 25 kg rhea, a 12 kg vulture).
Why this happens. impute() only
imputes cells that are NA in the input. Your input
was fully observed, so nothing was imputed:
result$completed equals the input,
sum(result$imputed_mask) is zero, and
result$prediction$imputed contains the model’s prediction
for every cell — observed and missing alike. For
observed cells, the well-calibrated gate keeps the prediction close to
the input value, so what comes back is essentially the original data
passed through. The slot is intended for diagnostics (checking
calibration on training cells), not as the imputed-values output.
Diagnose.
library(pigauto)
data(avonet300, tree300)
df <- avonet300
rownames(df) <- df$Species_Key
df$Species_Key <- NULL
sum(is.na(df)) # if 0, there's nothing for impute() to doFix. Mask some cells before calling
impute(), then evaluate predictions only on the held-out
cells:
set.seed(1L)
hide <- sample(which(!is.na(df$Mass)), 30L)
df_obs <- df
df_obs$Mass[hide] <- NA # hide 30 mass values
result <- impute(df_obs, tree300)
result$completed$Mass[hide] # pigauto's imputations
df$Mass[hide] # held-out truth, for comparison
sum(result$imputed_mask[, "Mass"]) # 30For your own data with real NAs, the imputed values you
actually care about are
result$completed[result$imputed_mask], not
result$prediction$imputed.
See also. ?impute (“What gets imputed
(read this first)”), issue #67.
Symptom. You impute an ordinal trait and the
prediction is the majority class for every species. For example, on
avonet300$Migration (K = 3 ordinal: Resident / Partial /
Full), 300/0/0.
Why this happens. Two things compound:
NAs in that column, there’s
nothing to impute externally (see Pitfall 1) —
result$prediction$imputed$Migration reflects the model’s
calibrated-gate output, not new imputations.n_imputations = 1L,
pool_method = "median"), a small ordinal trait whose
marginal distribution is heavily skewed (AVONET Migration
is ~78 % Resident / 14 % Partial / 8 % Full at n = 300) can have its
calibrated gate snap to a corner that returns the majority class for
every species.Diagnose.
library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL
table(df$Migration) # check the marginal distribution
result <- impute(df, tree300, verbose = FALSE)
table(result$prediction$imputed$Migration)Fix. For imbalanced K-class ordinal traits, increase
n_imputations and switch to
pool_method = "mode" (Phase H). On the AVONET multi-seed
bench this gave +6.6 percentage-point accuracy on Migration
(K = 3) versus the default median pool.
set.seed(1L)
hide <- sample(which(!is.na(df$Migration)), 30L)
df_obs <- df
df_obs$Migration[hide] <- NA
# Default settings: prone to majority-class collapse on imbalanced K = 3
result <- impute(df_obs, tree300, verbose = FALSE)
table(result$completed$Migration[hide], df$Migration[hide])
# Recommended for K = 3 ordinal: more draws + mode pooling
result_mode <- impute(df_obs, tree300, n_imputations = 20L,
pool_method = "mode", verbose = FALSE)
table(result_mode$completed$Migration[hide], df$Migration[hide])See also. ?impute (“Imbalanced K-class
traits”), Phase
H memo, issue #68.
Symptom. You expected the GNN to dominate, but
inspecting the fitted model shows the calibrated gate is fully or
near-fully closed (r_cal_gnn ≈ 0) — predictions equal the
BM baseline.
Why this happens. This is the safety-floor design
behaviour, not a bug. After training, pigauto picks the
per-latent-column gate that minimises validation loss across the simplex
\(r_\text{BM} \cdot \text{BM} + r_\text{GNN}
\cdot \text{GNN} + r_\text{MEAN} \cdot \text{MEAN}\). When the
GNN cannot beat BM on the held-out validation set, the optimum can be
r_cal_gnn = 0. In that case the calibrated prediction stays
on the validation-supported baseline or mean corner instead of forcing a
GNN contribution. This is what the package was designed to do on
high-phylogenetic-signal traits where BM is already hard to beat.
Diagnose.
library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL
fit <- impute(df, tree300, verbose = FALSE)$fit
# Per-latent-column calibrated gates (since v0.9.1.9002):
fit$r_cal_bm # r assigned to the BM baseline
fit$r_cal_gnn # r assigned to the GNN delta
fit$r_cal_mean # r assigned to the grand meanA row where r_cal_gnn is small (< 0.1) means the gate
has effectively closed for that latent column.
Fix. Often there is nothing to fix — the closed gate is evidence of high phylogenetic signal, not a problem. If you suspect the GNN should be helping (e.g. you’ve added covariates, or the trait has known cross-trait structure) but the gate is closed:
?fit_pigauto “Calibration at small n”).See also. ?fit_pigauto
(phylo_signal_gate, “Safety floor”), design
spec.
Symptom. You aren’t sure whether pigauto’s BM kriging baseline will outperform a simple mean impute on your dataset.
Why this matters. pigauto’s BM baseline buys you accuracy in proportion to phylogenetic signal in the trait. At Pagel’s λ ≈ 0 (no signal), BM kriging reduces to the species mean and pigauto won’t beat a simple mean baseline; at λ ≈ 1 (strong signal), BM kriging materially outperforms the mean. The Phase 8 signal-strength sweep (re-running locally produces the evidence; the deployed Methodology dropdown surfaces it once the bench HTML is regenerated) shows the crossover empirically.
Diagnose. The fitted object stores the per-trait λ
values used by phylo_signal_gate:
library(pigauto)
data(avonet300, tree300)
df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL
fit <- impute(df, tree300)$fit
fit$phylo_signal_per_traitThe output reports λ on the observed cells where it can be estimated. Discrete traits use the package’s internal continuous proxy on the latent/liability scale.
Fix. Use the lambda estimate to set expectations:
See also. ?fit_pigauto
(phylo_signal_gate), Phase
8 signal sweep memo.
Symptom. A masked log-transformed continuous trait (body mass, seed mass, fish weight) predicts a value 50–100× larger than anything observed. On AVONET, the canonical case is the cassowary: truth ≈ 35 kg, predicted up to ~540 kg.
Why this happens. For log-transformed traits, the
GNN’s MC-dropout draws are on the log scale. A latent ~+3-4 σ above the
training distribution survives as a ~50-100× value error after
expm1() back-transformation. With
n_imputations = 1, a single unlucky dropout pattern can
produce this; with pool_method = "median" (default) the
median of M draws is robust to one bad draw, but a small M (≤ 5) on the
long tail of the latent distribution can still mis-pool.
Diagnose.
# After running impute(), check whether any imputation exceeds the
# observed maximum by an unrealistic factor:
predicted_mass <- result$completed$Mass[result$imputed_mask[, "Mass"]]
obs_max <- max(df$Mass, na.rm = TRUE)
sum(predicted_mass > 5 * obs_max)A non-zero count is a signal of tail extrapolation.
Fix. Phase G clamp_outliers = TRUE caps
post-back-transform predictions for log-transformed continuous, count,
and zi_count magnitude traits at
obs_max * clamp_factor (default 5). This is opt-in because
for legitimate growth-curve datasets where 5× the observed maximum is
plausible, you don’t want it.
result <- impute(df_obs, tree300,
clamp_outliers = TRUE,
clamp_factor = 5, # Tukey-style outlier cap
verbose = FALSE)See also. ?impute
(clamp_outliers, clamp_factor arguments), AVONET
Mass diagnosis memo, Phase
G results.
impute() and
result$prediction$imputed looks like my input”