Real comparative datasets contain many kinds of traits: body mass (continuous), clutch size (count), migratory status (yes/no), diet type (carnivore/herbivore/omnivore), threat status (LC < VU < EN). pigauto handles all of them in a single model — you do not need separate imputation runs for different column types.
| Type | R class | Example | Notes |
|---|---|---|---|
| Continuous | numeric |
Body mass, wing length | Auto-detected |
| Count | integer |
Clutch size, litter size | Auto-detected |
| Binary | factor (2 levels) |
Migratory yes/no | Auto-detected |
| Categorical | factor (>2 levels) |
Diet type, lifestyle | Auto-detected |
| Ordinal | ordered |
Threat status (LC < VU < EN) | Auto-detected |
| Proportion | numeric (0–1) |
Habitat cover, diet fraction | Requires trait_types = "proportion" override |
| ZI count | integer (zero-inflated) |
Parasite load, rare behaviour | Requires trait_types = "zi_count" override;
experimental — accuracy more variable than other types |
| Multi-proportion | K numeric columns summing to 1 |
Diet composition, plumage-colour fractions, microbiome relative abundances | Requires
multi_proportion_groups = list(<name> = c("col1", ..., "colK"));
encoded via centred log-ratio (CLR) + per-component z-score |
The first five rows in this table are auto-detected from R column
class; proportion, zi_count, and
multi_proportion must be declared explicitly
(trait_types or multi_proportion_groups). All
eight share the same latent space — the phylogenetic baseline and GNN
correction both operate in this space, and type-specific logic appears
only at encoding, loss computation, and decoding.
library(pigauto)
library(ape)
set.seed(42)
n <- 60
tree <- rtree(n)
traits <- data.frame(
row.names = tree$tip.label,
mass = exp(rnorm(n, 3, 0.5)),
clutch = as.integer(rpois(n, 3) + 1L),
migr = factor(sample(c("no", "yes"), n, replace = TRUE)),
diet = factor(sample(c("herb", "carn", "omni"), n, replace = TRUE)),
threat = ordered(sample(c("LC", "VU", "EN"), n, replace = TRUE),
levels = c("LC", "VU", "EN"))
)preprocess_traits() auto-detects column types from R
classes:
The trait_map records each trait’s type, levels, latent
column indices, and normalisation parameters:
When trait_map is supplied,
make_missing_splits() operates at the original-trait level.
For categorical traits, all K one-hot columns are held out together:
The baseline uses phylogenetic conditional MVN for continuous-family latent columns, and label-propagation or threshold/liability candidates for discrete-family columns:
fit_pigauto() uses type-specific losses and trait-level
corruption masking automatically when a trait_map is
present:
predict() decodes latent predictions back to original
types:
For binary and categorical traits, class probabilities are available:
# Binary: probability of "yes"
head(pred$probabilities$migr)
# Categorical: probability of each diet class
head(pred$probabilities$diet)The SE matrix provides type-appropriate uncertainty:
evaluate_imputation() dispatches type-specific
metrics:
The recommended workflow uses multi_impute() to generate
M complete datasets from the model’s calibrated uncertainty
distribution, then pools downstream coefficients with Rubin’s rules via
with_imputations() and pool_mi().
draws_method = "conformal" is the default: missing cells
are drawn from Normal distributions centred on the point estimate with
width set by the split-conformal calibration score. The alternative
draws_method = "mc_dropout" runs M stochastic GNN forward
passes with dropout active and BM posterior draws as the blend baseline;
it is available for comparison.
# Generate M = 20 stochastic complete datasets
mi <- multi_impute(fit$data$X_original, tree, m = 20L)
# Fit a downstream model to each
fits <- with_imputations(mi, function(d) {
glm(Migratory ~ log(Mass) + Diet, data = d, family = binomial)
})
# Pool with Rubin's rules
pool_mi(fits)pool_mi() returns a tidy data.frame with
estimate, std.error, p.value,
df (Barnard-Rubin degrees of freedom), fmi
(fraction of missing information), and riv (relative
increase in variance) per coefficient.
For reference, the lower-level
predict(fit, n_imputations = 5L) interface returns M
complete datasets directly (without downstream pooling):