--- title: "Mixed-Type Trait Imputation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Mixed-Type Trait Imputation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = torch::torch_is_installed() ) ``` ## Overview Real comparative datasets contain many kinds of traits: body mass (continuous), clutch size (count), migratory status (yes/no), diet type (carnivore/herbivore/omnivore), threat status (LC < VU < EN). pigauto handles all of them in a single model — you do not need separate imputation runs for different column types. | Type | R class | Example | Notes | |------|---------|---------|-------| | Continuous | `numeric` | Body mass, wing length | Auto-detected | | Count | `integer` | Clutch size, litter size | Auto-detected | | Binary | `factor` (2 levels) | Migratory yes/no | Auto-detected | | Categorical | `factor` (>2 levels) | Diet type, lifestyle | Auto-detected | | Ordinal | `ordered` | Threat status (LC < VU < EN) | Auto-detected | | Proportion | `numeric` (0–1) | Habitat cover, diet fraction | Requires `trait_types = "proportion"` override | | ZI count | `integer` (zero-inflated) | Parasite load, rare behaviour | Requires `trait_types = "zi_count"` override; experimental — accuracy more variable than other types | | Multi-proportion | K `numeric` columns summing to 1 | Diet composition, plumage-colour fractions, microbiome relative abundances | Requires `multi_proportion_groups = list( = c("col1", ..., "colK"))`; encoded via centred log-ratio (CLR) + per-component z-score | The first five rows in this table are auto-detected from R column class; `proportion`, `zi_count`, and `multi_proportion` must be declared explicitly (`trait_types` or `multi_proportion_groups`). All eight share the same latent space — the phylogenetic baseline and GNN correction both operate in this space, and type-specific logic appears only at encoding, loss computation, and decoding. ## Synthetic example ```{r simulate} library(pigauto) library(ape) set.seed(42) n <- 60 tree <- rtree(n) traits <- data.frame( row.names = tree$tip.label, mass = exp(rnorm(n, 3, 0.5)), clutch = as.integer(rpois(n, 3) + 1L), migr = factor(sample(c("no", "yes"), n, replace = TRUE)), diet = factor(sample(c("herb", "carn", "omni"), n, replace = TRUE)), threat = ordered(sample(c("LC", "VU", "EN"), n, replace = TRUE), levels = c("LC", "VU", "EN")) ) ``` ## Preprocessing `preprocess_traits()` auto-detects column types from R classes: ```{r preprocess} pd <- preprocess_traits(traits, tree) print(pd) ``` The `trait_map` records each trait's type, levels, latent column indices, and normalisation parameters: ```{r trait_map} str(pd$trait_map, max.level = 1) pd$trait_map$diet ``` ## Creating splits When `trait_map` is supplied, `make_missing_splits()` operates at the original-trait level. For categorical traits, all K one-hot columns are held out together: ```{r splits} spl <- make_missing_splits(pd$X_scaled, missing_frac = 0.20, seed = 1, trait_map = pd$trait_map) cat("Val cells (latent):", length(spl$val_idx), "\n") cat("Test cells (latent):", length(spl$test_idx), "\n") ``` ## Baseline fitting The baseline uses phylogenetic conditional MVN for continuous-family latent columns, and label-propagation or threshold/liability candidates for discrete-family columns: ```{r baseline} bl <- fit_baseline(pd, tree, splits = spl) dim(bl$mu) ``` ## Training `fit_pigauto()` uses type-specific losses and trait-level corruption masking automatically when a `trait_map` is present: ```{r train, message = FALSE} fit <- fit_pigauto( pd, tree, splits = spl, epochs = 200L, eval_every = 50L, patience = 5L, verbose = FALSE, seed = 1 ) print(fit) ``` ## Prediction and decoding `predict()` decodes latent predictions back to original types: ```{r predict} pred <- predict(fit, return_se = TRUE) head(pred$imputed) ``` For binary and categorical traits, class probabilities are available: ```{r probs} # Binary: probability of "yes" head(pred$probabilities$migr) # Categorical: probability of each diet class head(pred$probabilities$diet) ``` The SE matrix provides type-appropriate uncertainty: ```{r se} head(pred$se) ``` ## Evaluation `evaluate_imputation()` dispatches type-specific metrics: ```{r evaluate} ev <- evaluate_imputation(pred, pd$X_scaled, spl) print(ev) ``` ## Multiple imputation for downstream inference The recommended workflow uses `multi_impute()` to generate M complete datasets from the model's calibrated uncertainty distribution, then pools downstream coefficients with Rubin's rules via `with_imputations()` and `pool_mi()`. `draws_method = "conformal"` is the default: missing cells are drawn from Normal distributions centred on the point estimate with width set by the split-conformal calibration score. The alternative `draws_method = "mc_dropout"` runs M stochastic GNN forward passes with dropout active and BM posterior draws as the blend baseline; it is available for comparison. ```{r mi_workflow, eval=FALSE} # Generate M = 20 stochastic complete datasets mi <- multi_impute(fit$data$X_original, tree, m = 20L) # Fit a downstream model to each fits <- with_imputations(mi, function(d) { glm(Migratory ~ log(Mass) + Diet, data = d, family = binomial) }) # Pool with Rubin's rules pool_mi(fits) ``` `pool_mi()` returns a tidy data.frame with `estimate`, `std.error`, `p.value`, `df` (Barnard-Rubin degrees of freedom), `fmi` (fraction of missing information), and `riv` (relative increase in variance) per coefficient. For reference, the lower-level `predict(fit, n_imputations = 5L)` interface returns M complete datasets directly (without downstream pooling):