Mixed-Type Trait Imputation

Overview

Real comparative datasets contain many kinds of traits: body mass (continuous), clutch size (count), migratory status (yes/no), diet type (carnivore/herbivore/omnivore), threat status (LC < VU < EN). pigauto handles all of them in a single model — you do not need separate imputation runs for different column types.

Type R class Example Notes
Continuous numeric Body mass, wing length Auto-detected
Count integer Clutch size, litter size Auto-detected
Binary factor (2 levels) Migratory yes/no Auto-detected
Categorical factor (>2 levels) Diet type, lifestyle Auto-detected
Ordinal ordered Threat status (LC < VU < EN) Auto-detected
Proportion numeric (0–1) Habitat cover, diet fraction Requires trait_types = "proportion" override
ZI count integer (zero-inflated) Parasite load, rare behaviour Requires trait_types = "zi_count" override; experimental — accuracy more variable than other types
Multi-proportion K numeric columns summing to 1 Diet composition, plumage-colour fractions, microbiome relative abundances Requires multi_proportion_groups = list(<name> = c("col1", ..., "colK")); encoded via centred log-ratio (CLR) + per-component z-score

The first five rows in this table are auto-detected from R column class; proportion, zi_count, and multi_proportion must be declared explicitly (trait_types or multi_proportion_groups). All eight share the same latent space — the phylogenetic baseline and GNN correction both operate in this space, and type-specific logic appears only at encoding, loss computation, and decoding.

Synthetic example

library(pigauto)
library(ape)

set.seed(42)
n <- 60
tree <- rtree(n)

traits <- data.frame(
  row.names = tree$tip.label,
  mass      = exp(rnorm(n, 3, 0.5)),
  clutch    = as.integer(rpois(n, 3) + 1L),
  migr      = factor(sample(c("no", "yes"), n, replace = TRUE)),
  diet      = factor(sample(c("herb", "carn", "omni"), n, replace = TRUE)),
  threat    = ordered(sample(c("LC", "VU", "EN"), n, replace = TRUE),
                      levels = c("LC", "VU", "EN"))
)

Preprocessing

preprocess_traits() auto-detects column types from R classes:

pd <- preprocess_traits(traits, tree)
print(pd)

The trait_map records each trait’s type, levels, latent column indices, and normalisation parameters:

str(pd$trait_map, max.level = 1)
pd$trait_map$diet

Creating splits

When trait_map is supplied, make_missing_splits() operates at the original-trait level. For categorical traits, all K one-hot columns are held out together:

spl <- make_missing_splits(pd$X_scaled, missing_frac = 0.20,
                           seed = 1, trait_map = pd$trait_map)
cat("Val cells (latent):", length(spl$val_idx), "\n")
cat("Test cells (latent):", length(spl$test_idx), "\n")

Baseline fitting

The baseline uses phylogenetic conditional MVN for continuous-family latent columns, and label-propagation or threshold/liability candidates for discrete-family columns:

bl <- fit_baseline(pd, tree, splits = spl)
dim(bl$mu)

Training

fit_pigauto() uses type-specific losses and trait-level corruption masking automatically when a trait_map is present:

fit <- fit_pigauto(
  pd, tree,
  splits = spl,
  epochs = 200L,
  eval_every = 50L,
  patience = 5L,
  verbose = FALSE,
  seed = 1
)
print(fit)

Prediction and decoding

predict() decodes latent predictions back to original types:

pred <- predict(fit, return_se = TRUE)
head(pred$imputed)

For binary and categorical traits, class probabilities are available:

# Binary: probability of "yes"
head(pred$probabilities$migr)

# Categorical: probability of each diet class
head(pred$probabilities$diet)

The SE matrix provides type-appropriate uncertainty:

head(pred$se)

Evaluation

evaluate_imputation() dispatches type-specific metrics:

ev <- evaluate_imputation(pred, pd$X_scaled, spl)
print(ev)

Multiple imputation for downstream inference

The recommended workflow uses multi_impute() to generate M complete datasets from the model’s calibrated uncertainty distribution, then pools downstream coefficients with Rubin’s rules via with_imputations() and pool_mi().

draws_method = "conformal" is the default: missing cells are drawn from Normal distributions centred on the point estimate with width set by the split-conformal calibration score. The alternative draws_method = "mc_dropout" runs M stochastic GNN forward passes with dropout active and BM posterior draws as the blend baseline; it is available for comparison.

# Generate M = 20 stochastic complete datasets
mi <- multi_impute(fit$data$X_original, tree, m = 20L)

# Fit a downstream model to each
fits <- with_imputations(mi, function(d) {
  glm(Migratory ~ log(Mass) + Diet, data = d, family = binomial)
})

# Pool with Rubin's rules
pool_mi(fits)

pool_mi() returns a tidy data.frame with estimate, std.error, p.value, df (Barnard-Rubin degrees of freedom), fmi (fraction of missing information), and riv (relative increase in variance) per coefficient.

For reference, the lower-level predict(fit, n_imputations = 5L) interface returns M complete datasets directly (without downstream pooling):