---
title: "Getting started with pigauto: Phylogenetic Imputation via Graph AUTO-encoders"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting started with pigauto}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.width = 6,
  fig.height = 4
)
```

## What pigauto does

Many comparative datasets have missing trait values: species were not
measured, specimens were unavailable, or data were lost. pigauto fills
in those gaps using three sources of information:

1. **The phylogenetic tree.** Closely related species tend to share
   similar traits, so known values of relatives are informative about
   missing ones.
2. **Cross-trait correlations.** If body mass predicts beak length,
   then observed masses help impute missing beak lengths even when the
   species itself was never measured for that trait.
3. **Environmental covariates** (optional). Climate, habitat, or
   geographic variables can improve imputation when trait variation
   has a strong environmental component beyond phylogenetic signal.

The package handles continuous measurements, counts, binary variables,
ordered categories, unordered categories, bounded proportions,
zero-inflated counts, and compositional multi-proportion data — all in a
single call to `impute()`.

### How it works (briefly)

pigauto combines a phylogenetic baseline with a graph neural network
(GNN) correction. The baseline uses Brownian-motion conditional
imputation for continuous-family latent columns and phylogenetic
label-propagation or threshold/liability candidates for discrete-family
columns. The GNN learns
additional patterns from tree topology and cross-trait structure. A
per-trait gate controls how much the GNN contributes: when the baseline
is already good enough, the gate closes and the GNN stays out of the
way. This is verified on held-out data after training.

> **Technical note.** The internal class name `ResidualPhyloDAE` refers
> to the ResNet-style skip connections inside the GNN layers, not to a
> statistical residual `y - baseline`.

The phylogenetic structure is encoded as:

- A symmetric-normalised Gaussian-kernel adjacency matrix derived from
  cophenetic (patristic) distances.
- *k* Laplacian eigenvectors as node positional features — these encode
  phylogenetic clusters at multiple scales.

## Installation

```{r install, eval=FALSE}
pak::pak("itchyshin/pigauto")

# torch backend (required; ~1 GB first-time download)
torch::install_torch()
```

```{r load}
library(pigauto)
library(ape)
```

## Quick start

For most users, a single function call is all you need:

```{r quickstart, eval=FALSE}
result <- impute(traits, tree)

result$completed    # data.frame: observed values preserved, NAs filled
result$imputed_mask # logical matrix: TRUE where a cell was imputed
result$prediction$se # per-cell uncertainty (original units)
```

`impute()` runs the full pipeline internally — preprocessing, baseline fitting,
graph construction, GNN training, calibration, and prediction. The rest of this
vignette walks through each step individually for users who need fine-grained
control, want to cache intermediate objects, or are benchmarking components.

## The AVONET bird data

pigauto ships with `avonet300`, a 300-species subset of the AVONET v3 bird
morphology database (Tobias et al. 2022, *Ecology Letters*), and `tree300`,
a matching pruned Hackett MCC phylogeny from BirdTree.org.

```{r data}
data(avonet300, tree300, package = "pigauto")

head(avonet300)
ape::Ntip(tree300)
```

The four morphological traits — body mass, beak length, tarsus length, and
wing length — are all log-normally distributed and highly heritable, making
them ideal test cases for phylogenetic imputation.

## Step 1: Preprocess

`preprocess_traits()` log-transforms and z-scores the traits, and reorders
rows to match the phylogeny's tip order (required by the graph operations).

```{r preprocess}
traits <- avonet300[, -1]              # drop Species_Key column
rownames(traits) <- avonet300$Species_Key

pd <- preprocess_traits(traits, tree300, log_transform = TRUE)
print(pd)
```

The object `pd` is a `pigauto_data` list containing:

- `X_scaled`: the z-scored, log-transformed trait matrix (300 × 4)
- `means`, `sds`: per-trait normalisation parameters (for back-transformation)
- `log_transform`: `TRUE`, recorded so `predict()` can invert it automatically

## Step 2: Create evaluation splits

We designate 25% of all matrix cells as "missing" for evaluation. These cells
are *not* used during training or baseline fitting; they provide an honest
estimate of imputation error.

```{r splits}
splits <- make_missing_splits(pd$X_scaled, missing_frac = 0.25, seed = 42)

cat("Total held-out cells:", length(splits$val_idx) + length(splits$test_idx), "\n")
cat("  Validation:", length(splits$val_idx), "\n")
cat("  Test:      ", length(splits$test_idx), "\n")
```

The missing cells are split further: validation cells guide early stopping,
test cells provide the final unbiased performance estimate.

## Step 3: Fit the BM baseline

`fit_baseline()` fits the phylogenetic BM baseline after masking the held-out cells. This
gives the benchmark against which the GNN will be compared.

```{r baseline, eval=TRUE}
baseline <- fit_baseline(pd, tree300, splits = splits, model = "BM")

cat("Baseline object contains:\n")
cat("  mu matrix:", nrow(baseline$mu), "x", ncol(baseline$mu), "\n")
cat("  se matrix:", nrow(baseline$se), "x", ncol(baseline$se), "\n")
```

## Step 4: Build the phylogenetic graph

`build_phylo_graph()` computes the adjacency matrix and Laplacian spectral
features from the tree. The result can be cached to disk for reuse.

```{r graph, eval=TRUE}
graph <- build_phylo_graph(
  tree300,
  k_eigen    = 8,
  sigma_mult = 0.5
)

cat("Graph: n =", graph$n, "species\n")
cat("Adjacency: [", nrow(graph$adj), "x", ncol(graph$adj), "]\n")
cat("Spectral coords: [", nrow(graph$coords), "x", ncol(graph$coords), "]\n")
cat("Kernel bandwidth sigma:", round(graph$sigma, 3), "\n")
```

For repeated runs, supply `cache_path = "graph_cache.rds"` to skip
recomputation.

## Step 5: Train the model

The training loop corrupts a random 55% of observed cells each epoch with a
learnable mask token, then minimises the type-appropriate loss on the
corrupted cells plus a shrinkage penalty on `delta - baseline` that
pulls the GNN toward the baseline. Early stopping is based on held-out
validation RMSE.

```{r train, eval=FALSE}
fit <- fit_pigauto(
  data            = pd,
  tree            = tree300,
  splits          = splits,
  graph           = graph,
  baseline        = baseline,
  hidden_dim      = 64,
  k_eigen         = 8,
  dropout         = 0.10,
  lr              = 3e-3,
  epochs          = 2000,
  corruption_rate = 0.55,
  lambda_shrink   = 0.03,
  eval_every      = 100,
  patience        = 10,
  verbose         = TRUE,
  seed            = 1
)
```

Example console output during training:

```
Epoch  100 | train loss: 0.8431 | val RMSE: 0.8852
Epoch  200 | train loss: 0.7215 | val RMSE: 0.8601
...
Epoch  900 | train loss: 0.6108 | val RMSE: 0.8289
Early stopping at epoch 900 (no improvement for 10 evals).
```

```{r print-fit, eval=FALSE}
print(fit)
```

```
<pigauto_fit>
  Species  : 300
  Traits   : 4 -- Mass, Beak.Length_Culmen, Tarsus.Length, Wing.Length
  Architecture: hidden_dim = 64, k_eigen = 8, dropout = 0.1
  Val RMSE : 0.828 (z-score)
  Test RMSE: 0.831 (z-score)
```

## Step 6: Predict

`predict.pigauto_fit()` runs a single forward pass through the trained model,
optionally repeated with MC dropout to generate multiple stochastic completions
for uncertainty quantification. The result is back-transformed to the original
(non-log) scale.

```{r predict, eval=FALSE}
pred <- predict(fit, return_se = TRUE)

# pred$imputed: 300 x 4 matrix in original units
# pred$se:      300 x 4 uncertainty matrix (original units)
head(pred$imputed)

# Conformal prediction intervals (95% marginal coverage guarantee)
pred$conformal_lower[["Mass"]]    # lower bound per species (original units)
pred$conformal_upper[["Mass"]]    # upper bound per species (original units)
pred$conformal_coverage           # empirical coverage on val set (target ≈ 0.95)
```

Expected output (values will vary by seed and convergence):

```
       Mass Beak.Length_Culmen Tarsus.Length Wing.Length
t1  28.4150           14.2316       18.0923     136.421
t2  41.2031           18.5034       21.3441     154.832
...
```

## Step 7: Evaluate and compare baselines

We compare the BM baseline to the GNN on the held-out test cells
(both evaluated in z-score scale for comparability).

```{r evaluate, eval=FALSE}
# BM baseline RMSE on test cells
eval_bm <- evaluate_imputation(baseline$mu, pd$X_scaled, splits)
eval_bm[eval_bm$split == "test", c("trait", "n", "rmse", "pearson_r")]
```

```
               trait  n  rmse pearson_r
Mass            Mass 56 0.872     0.611
Beak.Length_Culmen Beak.Length_Culmen 56 0.901     0.582
Tarsus.Length Tarsus.Length 56 0.863     0.628
Wing.Length Wing.Length 56 0.884     0.595
```

```{r compare, eval=FALSE}
# GNN test RMSE stored in fit object
data.frame(
  trait    = fit$trait_names,
  bm_rmse  = eval_bm$rmse[eval_bm$split == "test"],
  gnn_rmse = fit$test_rmse
)
```

```
               trait bm_rmse gnn_rmse
1               Mass   0.872    0.831
2 Beak.Length_Culmen   0.901    0.856
3      Tarsus.Length   0.863    0.822
4        Wing.Length   0.884    0.841
```

In this example output, the GNN reduces test RMSE relative to the BM baseline.
How large the improvement is depends on the trait, clade, missingness pattern,
and validation split.

## Step 8: Visualise

### Training history

```{r plot-history, eval=FALSE}
plot(fit, type = "history")
```

This plots the validation RMSE over epochs, showing where early stopping
triggered.

### Uncertainty ribbons

```{r plot-uncertainty, eval=FALSE}
plot(pred, type = "intervals", trait = "Mass")
```

Species are sorted by predicted mass. The blue ribbon shows the conformal
prediction interval when conformal scores are available. Grey points (if
`truth` is supplied) show known values — a useful visual check of calibration.

## Multiple imputation for downstream inference

The single completed dataset from `impute()` ignores imputation uncertainty —
downstream standard errors will be too small. Use `multi_impute()` to generate
M stochastic complete datasets, each a draw from the model's calibrated
uncertainty distribution, then pool coefficients with Rubin's rules.

```{r mi-workflow, eval=FALSE}
# Step 1: generate M = 50 complete datasets
# draws_method = "conformal" draws from conformal-calibrated
# uncertainty — properly calibrated against held-out residuals.
mi <- multi_impute(traits, tree, m = 50L)

# Step 2: fit your downstream model to each imputed dataset
fits <- with_imputations(mi, function(d) {
  lm(log(Wing.Length) ~ log(Mass) + Trophic.Level, data = d)
})

# Step 3: pool with Rubin's rules
pool_mi(fits)
# Returns a tidy data.frame with estimate, std.error, statistic, p.value,
# df (Barnard-Rubin degrees of freedom), fmi (fraction of missing information),
# and riv (relative increase in variance) per coefficient.
```

Choose M ≥ 50 for stable pooled SEs; M ≥ 100 × max(fmi) for reliable p-values.
The `fmi` column directly tells you how much each coefficient is affected by
missing data — if `fmi < 0.05`, imputation uncertainty is negligible for that term.

## Phylogenetic tree uncertainty

A single fixed tree ignores phylogenetic uncertainty. pigauto splits this
into **two** steps, matching Nakagawa & de Villemereuil (2019, *Syst.
Biol.* 68:632–641):

**Step 1 — imputation.**  `multi_impute_trees()` fits pigauto on each of
the T posterior trees and returns T × M completed datasets. Every dataset
is conditional on a specific tree, recorded in `mi$tree_index`:

```{r tree-uncertainty-step1, eval=FALSE}
data(trees300, package = "pigauto")   # 50 posterior trees

mi <- multi_impute_trees(traits, trees = trees300, m_per_tree = 5L)
# 50 trees × 5 imputations = 250 completed datasets
```

**Step 2 — analysis.**  For each completed dataset, fit the downstream
comparative model using the *same* tree that produced that dataset, then
pool all T × M fits with Rubin's rules:

```{r tree-uncertainty-step2, eval=FALSE}
fits <- with_imputations(mi, function(dat, tree) {
  dat$species <- rownames(dat)
  nlme::gls(
    log(Mass) ~ log(Wing.Length),
    correlation = ape::corBrownian(
      phy = tree, form = ~species),
    data = dat, method = "ML"
  )
})
pool_mi(fits)
```

The pooled standard errors reflect trait-imputation uncertainty,
phylogenetic tree uncertainty, and their interaction in one step.

**Compute cost scales linearly with T** (each tree requires a fresh
pigauto fit). At `n = 300` species this is ~30 minutes for T = 50 trees;
at `n = 10,000` species it can be 17+ hours. For large trees we
recommend reducing T to 10–20 (the `fmi` column in `pool_mi()` output
typically indicates convergence well before T = 50), or parallelising
tree fits across machines.

## Active imputation: where to measure next

When you can measure k more species but not all of them, which ones
should you prioritise to maximally reduce imputation uncertainty
across the rest of your tree? `suggest_next_observation()` answers
this with a closed-form variance-reduction calculation built on the
BM conditional MVN.

```{r active-impute, eval=FALSE}
res <- impute(traits, tree)

# Top-10 individual (species, trait) cells, ordered by expected
# total variance reduction:
suggest_next_observation(res, top_n = 10, by = "cell")

# Or aggregated by species (sum of variance reductions across the
# species' currently-missing continuous-family traits):
suggest_next_observation(res, top_n = 10, by = "species")
```

The `delta_var_total` column is the closed-form expected reduction
in total predictive variance across all currently-missing cells if
you observed that candidate next:

\[
\Delta V(s_{\text{new}}) = \sigma_t^2 \sum_{i \in \mathrm{miss}_t}
\frac{D_{ik}^2}{\alpha_k}
\]

where \(D = R_{mm} - R_{mo} R_{oo}^{-1} R_{om}\) is the residual
matrix at currently-missing cells, \(\alpha_k = D_{kk}\) is the
current relative leverage of cell \(k\), and \(\sigma_t^2\) is
the REML BM variance. The formula is the standard active-learning
variance-reduction bound (Cohn et al. 1996); applying it to
phylogenetic trait imputation appears to be new.

**Scope.** `suggest_next_observation()` v2 (2026-05-01) supports
all eight pigauto trait types:

* **Variance reduction** for continuous-family traits (`continuous`,
  `count`, `ordinal`, `proportion`) and the magnitude column of
  `zi_count`, computed by closed-form Sherman-Morrison rank-1
  update on the BM phylogenetic-correlation matrix.
* **Entropy reduction** for `binary`, `categorical`, and the gate
  column of `zi_count`, computed by closed-form label-propagation
  posteriors.
* **Multi-component variance reduction** (per-component BM,
  summed across the K CLR-z latent columns) for `multi_proportion`.

For `zi_count` rows the output populates BOTH `delta_var_total`
(probability-weighted magnitude variance reduction) AND
`delta_entropy_total` (gate entropy reduction); the `metric`
column is set to `"variance"` so rows sort on the magnitude scale.
See `?suggest_next_observation` for full details.

**Variance vs entropy are not directly comparable** — the cross-
metric ordering by `delta_var_total` is approximate.  When you
want a strict ranking on a single metric, filter by `metric`
first or restrict `types` to continuous-family-only or
discrete-only.

Multi-obs (multiple observations per species) inputs are not
supported and error with a clear message.

## Optional advanced features

### Phylogenetic-signal gating

`impute()` runs a per-trait Pagel's-λ check before training and
routes traits with very weak phylogenetic signal to the
**grand-mean corner** of the safety-floor simplex (i.e., it
predicts the column mean for those traits).  This protects against
over-fitting noise on traits where the phylogeny carries little
information.

It is on by default since v0.9.1.9003.  The two arguments you can
tune:

```{r phylo-signal-gate, eval=FALSE}
result <- impute(traits, tree,
                 phylo_signal_gate     = TRUE,   # default
                 phylo_signal_threshold = 0.2,   # default; min lambda to keep BM/GNN
                 phylo_signal_method   = "lambda")
```

If a trait's λ falls below `phylo_signal_threshold`, the calibrated
gate routes it to the grand-mean corner.  `print(fit)` reports
which traits triggered the gate and their λ values.

To disable (e.g., when comparing against an older snapshot or for
benchmarking purposes):

```{r phylo-signal-gate-off, eval=FALSE}
result <- impute(traits, tree, phylo_signal_gate = FALSE)
```

### Assembling covariates from public data

Two helpers are bundled for the common case of building a
covariate matrix from species-level environmental data:

```{r covariate-helpers, eval=FALSE}
# Step 1: GBIF occurrence centroids (median lat/lon across cleaned points)
gbif <- pull_gbif_centroids(species          = rownames(my_traits),
                            cache_dir        = "cache/gbif",
                            occurrence_limit = 500L)
#> data.frame with columns species, centroid_lat, centroid_lon, n_occurrences

# Step 2: WorldClim bioclim summaries (median + IQR of all 19 bio variables
# across each species' GBIF occurrences)
clim <- pull_worldclim_per_species(species             = rownames(my_traits),
                                    gbif_cache_dir      = "cache/gbif",
                                    worldclim_cache_dir = "cache/worldclim",
                                    resolution          = "10m")
#> data.frame with bio1..bio19 medians + IQRs + n_extracted per species

# Step 3: hand the whole climate frame to impute() as covariates
result <- impute(my_traits, my_tree, covariates = clim)
```

Caching is on by default — both helpers write to the
`cache_dir` you supply and re-use existing extracts on subsequent
runs.  GBIF requires no API key.  WorldClim raster downloads are
~50–500 MB depending on resolution and are cached in
`worldclim_cache_dir`.  Both functions error with a clear message
if the optional dependencies (`rgbif`, `terra`) are not installed.

For a worked end-to-end example with covariates, use this section alongside
`?impute`, `?pull_gbif_centroids`, and `?pull_worldclim_per_species`. The old
static covariate walk-through is being refreshed and is not the current source
of truth.

## Tips and next steps

**Use GPU if available.** pigauto automatically selects CUDA > MPS > CPU. Check
availability with:

```{r gpu, eval=FALSE}
torch::cuda_is_available()       # NVIDIA GPU
torch::backends_mps_is_available() # Apple Silicon GPU
```

**Cache the phylogenetic graph.** For repeated experiments, save time by
writing the graph to disk:

```{r cache, eval=FALSE}
graph <- build_phylo_graph(tree300, k_eigen = 8, cache_path = "tree300_graph.rds")
```

**Large trees (N > 2000).** `build_phylo_graph()` computes a dense N × N
cophenetic distance matrix and a Laplacian eigendecomposition. The eigendecomposition
is O(N³) in the worst case, but `k_eigen = "auto"` caps the number of eigenvectors
at 32 (the default for all trees), which keeps computation tractable in practice —
pigauto has been tested up to 10,000 species on a standard laptop. The main
cost at very large N is memory for the N × N distance matrix (~800 MB at
N = 10,000). If memory is tight, prune to a clade of interest or watch for a
future release that switches to a sparse Lanczos eigensolver.

**Supplying your own data.** The workflow above works for any phylogenetic tree
and trait matrix:

```{r own-data, eval=FALSE}
tree   <- ape::read.tree("my_phylogeny.nwk")
traits <- read_traits("my_traits.csv", species_col = "species")
pd     <- preprocess_traits(traits, tree)
# ...proceed as above
```

Trait columns can be any of eight supported types. Five are
auto-detected from R column class: `numeric` (continuous), `integer`
(count), `factor` with 2 levels (binary), `factor` with >2 levels
(categorical), and `ordered` (ordinal). Three require an explicit
declaration because they cannot be inferred from class alone:
`numeric` (0–1) with `trait_types = "proportion"` for bounded
proportions; `integer` with `trait_types = "zi_count"` for
zero-inflated counts; and K `numeric` columns declared via the
`multi_proportion_groups` argument for compositional (simplex) data
whose rows sum to 1. pigauto reads column classes to decide how each
trait is encoded and imputed. The
species names in `traits` must overlap with tip labels in `tree` (partial
overlap is allowed; unmatched species are dropped with a warning).

## References

- Tobias, J.A. et al. (2022). AVONET: morphological, ecological and
  geographical data for all birds. *Ecology Letters*, 25, 581–597.
- Jetz, W. et al. (2012). The global diversity of birds in space and time.
  *Nature*, 491, 444–448.
- Goolsby, E.W., Bruggeman, J. & Ané, C. (2017). Rphylopars: fast
  multivariate phylogenetic comparative methods for missing data and
  within-species variation. *Methods in Ecology and Evolution*, 8, 22–27.
  (pigauto's internal BM baseline implements the same conditional-MVN
  approach.)
- Cohn, D.A., Ghahramani, Z. & Jordan, M.I. (1996). Active learning
  with statistical models. *Journal of Artificial Intelligence Research*,
  4, 129–145. (Variance-reduction framework that
  `suggest_next_observation()` adapts to the phylogenetic setting.)