--- title: "Getting started with pigauto: Phylogenetic Imputation via Graph AUTO-encoders" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with pigauto} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 4 ) ``` ## What pigauto does Many comparative datasets have missing trait values: species were not measured, specimens were unavailable, or data were lost. pigauto fills in those gaps using three sources of information: 1. **The phylogenetic tree.** Closely related species tend to share similar traits, so known values of relatives are informative about missing ones. 2. **Cross-trait correlations.** If body mass predicts beak length, then observed masses help impute missing beak lengths even when the species itself was never measured for that trait. 3. **Environmental covariates** (optional). Climate, habitat, or geographic variables can improve imputation when trait variation has a strong environmental component beyond phylogenetic signal. The package handles continuous measurements, counts, binary variables, ordered categories, unordered categories, bounded proportions, zero-inflated counts, and compositional multi-proportion data — all in a single call to `impute()`. ### How it works (briefly) pigauto combines a phylogenetic baseline with a graph neural network (GNN) correction. The baseline uses Brownian-motion conditional imputation for continuous-family latent columns and phylogenetic label-propagation or threshold/liability candidates for discrete-family columns. The GNN learns additional patterns from tree topology and cross-trait structure. A per-trait gate controls how much the GNN contributes: when the baseline is already good enough, the gate closes and the GNN stays out of the way. This is verified on held-out data after training. > **Technical note.** The internal class name `ResidualPhyloDAE` refers > to the ResNet-style skip connections inside the GNN layers, not to a > statistical residual `y - baseline`. The phylogenetic structure is encoded as: - A symmetric-normalised Gaussian-kernel adjacency matrix derived from cophenetic (patristic) distances. - *k* Laplacian eigenvectors as node positional features — these encode phylogenetic clusters at multiple scales. ## Installation ```{r install, eval=FALSE} pak::pak("itchyshin/pigauto") # torch backend (required; ~1 GB first-time download) torch::install_torch() ``` ```{r load} library(pigauto) library(ape) ``` ## Quick start For most users, a single function call is all you need: ```{r quickstart, eval=FALSE} result <- impute(traits, tree) result$completed # data.frame: observed values preserved, NAs filled result$imputed_mask # logical matrix: TRUE where a cell was imputed result$prediction$se # per-cell uncertainty (original units) ``` `impute()` runs the full pipeline internally — preprocessing, baseline fitting, graph construction, GNN training, calibration, and prediction. The rest of this vignette walks through each step individually for users who need fine-grained control, want to cache intermediate objects, or are benchmarking components. ## The AVONET bird data pigauto ships with `avonet300`, a 300-species subset of the AVONET v3 bird morphology database (Tobias et al. 2022, *Ecology Letters*), and `tree300`, a matching pruned Hackett MCC phylogeny from BirdTree.org. ```{r data} data(avonet300, tree300, package = "pigauto") head(avonet300) ape::Ntip(tree300) ``` The four morphological traits — body mass, beak length, tarsus length, and wing length — are all log-normally distributed and highly heritable, making them ideal test cases for phylogenetic imputation. ## Step 1: Preprocess `preprocess_traits()` log-transforms and z-scores the traits, and reorders rows to match the phylogeny's tip order (required by the graph operations). ```{r preprocess} traits <- avonet300[, -1] # drop Species_Key column rownames(traits) <- avonet300$Species_Key pd <- preprocess_traits(traits, tree300, log_transform = TRUE) print(pd) ``` The object `pd` is a `pigauto_data` list containing: - `X_scaled`: the z-scored, log-transformed trait matrix (300 × 4) - `means`, `sds`: per-trait normalisation parameters (for back-transformation) - `log_transform`: `TRUE`, recorded so `predict()` can invert it automatically ## Step 2: Create evaluation splits We designate 25% of all matrix cells as "missing" for evaluation. These cells are *not* used during training or baseline fitting; they provide an honest estimate of imputation error. ```{r splits} splits <- make_missing_splits(pd$X_scaled, missing_frac = 0.25, seed = 42) cat("Total held-out cells:", length(splits$val_idx) + length(splits$test_idx), "\n") cat(" Validation:", length(splits$val_idx), "\n") cat(" Test: ", length(splits$test_idx), "\n") ``` The missing cells are split further: validation cells guide early stopping, test cells provide the final unbiased performance estimate. ## Step 3: Fit the BM baseline `fit_baseline()` fits the phylogenetic BM baseline after masking the held-out cells. This gives the benchmark against which the GNN will be compared. ```{r baseline, eval=TRUE} baseline <- fit_baseline(pd, tree300, splits = splits, model = "BM") cat("Baseline object contains:\n") cat(" mu matrix:", nrow(baseline$mu), "x", ncol(baseline$mu), "\n") cat(" se matrix:", nrow(baseline$se), "x", ncol(baseline$se), "\n") ``` ## Step 4: Build the phylogenetic graph `build_phylo_graph()` computes the adjacency matrix and Laplacian spectral features from the tree. The result can be cached to disk for reuse. ```{r graph, eval=TRUE} graph <- build_phylo_graph( tree300, k_eigen = 8, sigma_mult = 0.5 ) cat("Graph: n =", graph$n, "species\n") cat("Adjacency: [", nrow(graph$adj), "x", ncol(graph$adj), "]\n") cat("Spectral coords: [", nrow(graph$coords), "x", ncol(graph$coords), "]\n") cat("Kernel bandwidth sigma:", round(graph$sigma, 3), "\n") ``` For repeated runs, supply `cache_path = "graph_cache.rds"` to skip recomputation. ## Step 5: Train the model The training loop corrupts a random 55% of observed cells each epoch with a learnable mask token, then minimises the type-appropriate loss on the corrupted cells plus a shrinkage penalty on `delta - baseline` that pulls the GNN toward the baseline. Early stopping is based on held-out validation RMSE. ```{r train, eval=FALSE} fit <- fit_pigauto( data = pd, tree = tree300, splits = splits, graph = graph, baseline = baseline, hidden_dim = 64, k_eigen = 8, dropout = 0.10, lr = 3e-3, epochs = 2000, corruption_rate = 0.55, lambda_shrink = 0.03, eval_every = 100, patience = 10, verbose = TRUE, seed = 1 ) ``` Example console output during training: ``` Epoch 100 | train loss: 0.8431 | val RMSE: 0.8852 Epoch 200 | train loss: 0.7215 | val RMSE: 0.8601 ... Epoch 900 | train loss: 0.6108 | val RMSE: 0.8289 Early stopping at epoch 900 (no improvement for 10 evals). ``` ```{r print-fit, eval=FALSE} print(fit) ``` ``` Species : 300 Traits : 4 -- Mass, Beak.Length_Culmen, Tarsus.Length, Wing.Length Architecture: hidden_dim = 64, k_eigen = 8, dropout = 0.1 Val RMSE : 0.828 (z-score) Test RMSE: 0.831 (z-score) ``` ## Step 6: Predict `predict.pigauto_fit()` runs a single forward pass through the trained model, optionally repeated with MC dropout to generate multiple stochastic completions for uncertainty quantification. The result is back-transformed to the original (non-log) scale. ```{r predict, eval=FALSE} pred <- predict(fit, return_se = TRUE) # pred$imputed: 300 x 4 matrix in original units # pred$se: 300 x 4 uncertainty matrix (original units) head(pred$imputed) # Conformal prediction intervals (95% marginal coverage guarantee) pred$conformal_lower[["Mass"]] # lower bound per species (original units) pred$conformal_upper[["Mass"]] # upper bound per species (original units) pred$conformal_coverage # empirical coverage on val set (target ≈ 0.95) ``` Expected output (values will vary by seed and convergence): ``` Mass Beak.Length_Culmen Tarsus.Length Wing.Length t1 28.4150 14.2316 18.0923 136.421 t2 41.2031 18.5034 21.3441 154.832 ... ``` ## Step 7: Evaluate and compare baselines We compare the BM baseline to the GNN on the held-out test cells (both evaluated in z-score scale for comparability). ```{r evaluate, eval=FALSE} # BM baseline RMSE on test cells eval_bm <- evaluate_imputation(baseline$mu, pd$X_scaled, splits) eval_bm[eval_bm$split == "test", c("trait", "n", "rmse", "pearson_r")] ``` ``` trait n rmse pearson_r Mass Mass 56 0.872 0.611 Beak.Length_Culmen Beak.Length_Culmen 56 0.901 0.582 Tarsus.Length Tarsus.Length 56 0.863 0.628 Wing.Length Wing.Length 56 0.884 0.595 ``` ```{r compare, eval=FALSE} # GNN test RMSE stored in fit object data.frame( trait = fit$trait_names, bm_rmse = eval_bm$rmse[eval_bm$split == "test"], gnn_rmse = fit$test_rmse ) ``` ``` trait bm_rmse gnn_rmse 1 Mass 0.872 0.831 2 Beak.Length_Culmen 0.901 0.856 3 Tarsus.Length 0.863 0.822 4 Wing.Length 0.884 0.841 ``` In this example output, the GNN reduces test RMSE relative to the BM baseline. How large the improvement is depends on the trait, clade, missingness pattern, and validation split. ## Step 8: Visualise ### Training history ```{r plot-history, eval=FALSE} plot(fit, type = "history") ``` This plots the validation RMSE over epochs, showing where early stopping triggered. ### Uncertainty ribbons ```{r plot-uncertainty, eval=FALSE} plot(pred, type = "intervals", trait = "Mass") ``` Species are sorted by predicted mass. The blue ribbon shows the conformal prediction interval when conformal scores are available. Grey points (if `truth` is supplied) show known values — a useful visual check of calibration. ## Multiple imputation for downstream inference The single completed dataset from `impute()` ignores imputation uncertainty — downstream standard errors will be too small. Use `multi_impute()` to generate M stochastic complete datasets, each a draw from the model's calibrated uncertainty distribution, then pool coefficients with Rubin's rules. ```{r mi-workflow, eval=FALSE} # Step 1: generate M = 50 complete datasets # draws_method = "conformal" draws from conformal-calibrated # uncertainty — properly calibrated against held-out residuals. mi <- multi_impute(traits, tree, m = 50L) # Step 2: fit your downstream model to each imputed dataset fits <- with_imputations(mi, function(d) { lm(log(Wing.Length) ~ log(Mass) + Trophic.Level, data = d) }) # Step 3: pool with Rubin's rules pool_mi(fits) # Returns a tidy data.frame with estimate, std.error, statistic, p.value, # df (Barnard-Rubin degrees of freedom), fmi (fraction of missing information), # and riv (relative increase in variance) per coefficient. ``` Choose M ≥ 50 for stable pooled SEs; M ≥ 100 × max(fmi) for reliable p-values. The `fmi` column directly tells you how much each coefficient is affected by missing data — if `fmi < 0.05`, imputation uncertainty is negligible for that term. ## Phylogenetic tree uncertainty A single fixed tree ignores phylogenetic uncertainty. pigauto splits this into **two** steps, matching Nakagawa & de Villemereuil (2019, *Syst. Biol.* 68:632–641): **Step 1 — imputation.** `multi_impute_trees()` fits pigauto on each of the T posterior trees and returns T × M completed datasets. Every dataset is conditional on a specific tree, recorded in `mi$tree_index`: ```{r tree-uncertainty-step1, eval=FALSE} data(trees300, package = "pigauto") # 50 posterior trees mi <- multi_impute_trees(traits, trees = trees300, m_per_tree = 5L) # 50 trees × 5 imputations = 250 completed datasets ``` **Step 2 — analysis.** For each completed dataset, fit the downstream comparative model using the *same* tree that produced that dataset, then pool all T × M fits with Rubin's rules: ```{r tree-uncertainty-step2, eval=FALSE} fits <- with_imputations(mi, function(dat, tree) { dat$species <- rownames(dat) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian( phy = tree, form = ~species), data = dat, method = "ML" ) }) pool_mi(fits) ``` The pooled standard errors reflect trait-imputation uncertainty, phylogenetic tree uncertainty, and their interaction in one step. **Compute cost scales linearly with T** (each tree requires a fresh pigauto fit). At `n = 300` species this is ~30 minutes for T = 50 trees; at `n = 10,000` species it can be 17+ hours. For large trees we recommend reducing T to 10–20 (the `fmi` column in `pool_mi()` output typically indicates convergence well before T = 50), or parallelising tree fits across machines. ## Active imputation: where to measure next When you can measure k more species but not all of them, which ones should you prioritise to maximally reduce imputation uncertainty across the rest of your tree? `suggest_next_observation()` answers this with a closed-form variance-reduction calculation built on the BM conditional MVN. ```{r active-impute, eval=FALSE} res <- impute(traits, tree) # Top-10 individual (species, trait) cells, ordered by expected # total variance reduction: suggest_next_observation(res, top_n = 10, by = "cell") # Or aggregated by species (sum of variance reductions across the # species' currently-missing continuous-family traits): suggest_next_observation(res, top_n = 10, by = "species") ``` The `delta_var_total` column is the closed-form expected reduction in total predictive variance across all currently-missing cells if you observed that candidate next: \[ \Delta V(s_{\text{new}}) = \sigma_t^2 \sum_{i \in \mathrm{miss}_t} \frac{D_{ik}^2}{\alpha_k} \] where \(D = R_{mm} - R_{mo} R_{oo}^{-1} R_{om}\) is the residual matrix at currently-missing cells, \(\alpha_k = D_{kk}\) is the current relative leverage of cell \(k\), and \(\sigma_t^2\) is the REML BM variance. The formula is the standard active-learning variance-reduction bound (Cohn et al. 1996); applying it to phylogenetic trait imputation appears to be new. **Scope.** `suggest_next_observation()` v2 (2026-05-01) supports all eight pigauto trait types: * **Variance reduction** for continuous-family traits (`continuous`, `count`, `ordinal`, `proportion`) and the magnitude column of `zi_count`, computed by closed-form Sherman-Morrison rank-1 update on the BM phylogenetic-correlation matrix. * **Entropy reduction** for `binary`, `categorical`, and the gate column of `zi_count`, computed by closed-form label-propagation posteriors. * **Multi-component variance reduction** (per-component BM, summed across the K CLR-z latent columns) for `multi_proportion`. For `zi_count` rows the output populates BOTH `delta_var_total` (probability-weighted magnitude variance reduction) AND `delta_entropy_total` (gate entropy reduction); the `metric` column is set to `"variance"` so rows sort on the magnitude scale. See `?suggest_next_observation` for full details. **Variance vs entropy are not directly comparable** — the cross- metric ordering by `delta_var_total` is approximate. When you want a strict ranking on a single metric, filter by `metric` first or restrict `types` to continuous-family-only or discrete-only. Multi-obs (multiple observations per species) inputs are not supported and error with a clear message. ## Optional advanced features ### Phylogenetic-signal gating `impute()` runs a per-trait Pagel's-λ check before training and routes traits with very weak phylogenetic signal to the **grand-mean corner** of the safety-floor simplex (i.e., it predicts the column mean for those traits). This protects against over-fitting noise on traits where the phylogeny carries little information. It is on by default since v0.9.1.9003. The two arguments you can tune: ```{r phylo-signal-gate, eval=FALSE} result <- impute(traits, tree, phylo_signal_gate = TRUE, # default phylo_signal_threshold = 0.2, # default; min lambda to keep BM/GNN phylo_signal_method = "lambda") ``` If a trait's λ falls below `phylo_signal_threshold`, the calibrated gate routes it to the grand-mean corner. `print(fit)` reports which traits triggered the gate and their λ values. To disable (e.g., when comparing against an older snapshot or for benchmarking purposes): ```{r phylo-signal-gate-off, eval=FALSE} result <- impute(traits, tree, phylo_signal_gate = FALSE) ``` ### Assembling covariates from public data Two helpers are bundled for the common case of building a covariate matrix from species-level environmental data: ```{r covariate-helpers, eval=FALSE} # Step 1: GBIF occurrence centroids (median lat/lon across cleaned points) gbif <- pull_gbif_centroids(species = rownames(my_traits), cache_dir = "cache/gbif", occurrence_limit = 500L) #> data.frame with columns species, centroid_lat, centroid_lon, n_occurrences # Step 2: WorldClim bioclim summaries (median + IQR of all 19 bio variables # across each species' GBIF occurrences) clim <- pull_worldclim_per_species(species = rownames(my_traits), gbif_cache_dir = "cache/gbif", worldclim_cache_dir = "cache/worldclim", resolution = "10m") #> data.frame with bio1..bio19 medians + IQRs + n_extracted per species # Step 3: hand the whole climate frame to impute() as covariates result <- impute(my_traits, my_tree, covariates = clim) ``` Caching is on by default — both helpers write to the `cache_dir` you supply and re-use existing extracts on subsequent runs. GBIF requires no API key. WorldClim raster downloads are ~50–500 MB depending on resolution and are cached in `worldclim_cache_dir`. Both functions error with a clear message if the optional dependencies (`rgbif`, `terra`) are not installed. For a worked end-to-end example with covariates, use this section alongside `?impute`, `?pull_gbif_centroids`, and `?pull_worldclim_per_species`. The old static covariate walk-through is being refreshed and is not the current source of truth. ## Tips and next steps **Use GPU if available.** pigauto automatically selects CUDA > MPS > CPU. Check availability with: ```{r gpu, eval=FALSE} torch::cuda_is_available() # NVIDIA GPU torch::backends_mps_is_available() # Apple Silicon GPU ``` **Cache the phylogenetic graph.** For repeated experiments, save time by writing the graph to disk: ```{r cache, eval=FALSE} graph <- build_phylo_graph(tree300, k_eigen = 8, cache_path = "tree300_graph.rds") ``` **Large trees (N > 2000).** `build_phylo_graph()` computes a dense N × N cophenetic distance matrix and a Laplacian eigendecomposition. The eigendecomposition is O(N³) in the worst case, but `k_eigen = "auto"` caps the number of eigenvectors at 32 (the default for all trees), which keeps computation tractable in practice — pigauto has been tested up to 10,000 species on a standard laptop. The main cost at very large N is memory for the N × N distance matrix (~800 MB at N = 10,000). If memory is tight, prune to a clade of interest or watch for a future release that switches to a sparse Lanczos eigensolver. **Supplying your own data.** The workflow above works for any phylogenetic tree and trait matrix: ```{r own-data, eval=FALSE} tree <- ape::read.tree("my_phylogeny.nwk") traits <- read_traits("my_traits.csv", species_col = "species") pd <- preprocess_traits(traits, tree) # ...proceed as above ``` Trait columns can be any of eight supported types. Five are auto-detected from R column class: `numeric` (continuous), `integer` (count), `factor` with 2 levels (binary), `factor` with >2 levels (categorical), and `ordered` (ordinal). Three require an explicit declaration because they cannot be inferred from class alone: `numeric` (0–1) with `trait_types = "proportion"` for bounded proportions; `integer` with `trait_types = "zi_count"` for zero-inflated counts; and K `numeric` columns declared via the `multi_proportion_groups` argument for compositional (simplex) data whose rows sum to 1. pigauto reads column classes to decide how each trait is encoded and imputed. The species names in `traits` must overlap with tip labels in `tree` (partial overlap is allowed; unmatched species are dropped with a warning). ## References - Tobias, J.A. et al. (2022). AVONET: morphological, ecological and geographical data for all birds. *Ecology Letters*, 25, 581–597. - Jetz, W. et al. (2012). The global diversity of birds in space and time. *Nature*, 491, 444–448. - Goolsby, E.W., Bruggeman, J. & Ané, C. (2017). Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within-species variation. *Methods in Ecology and Evolution*, 8, 22–27. (pigauto's internal BM baseline implements the same conditional-MVN approach.) - Cohn, D.A., Ghahramani, Z. & Jordan, M.I. (1996). Active learning with statistical models. *Journal of Artificial Intelligence Research*, 4, 129–145. (Variance-reduction framework that `suggest_next_observation()` adapts to the phylogenetic setting.)