| Title: | Fill in Missing Species Traits Using a Phylogenetic Tree |
|---|---|
| Description: | Imputes missing species trait data for comparative analyses by combining three sources of information: phylogenetic similarity (closely related species share similar traits), cross-trait correlations (observed traits inform missing ones), and optional environmental covariates (climate, habitat, geography). Handles continuous measurements, counts, binary variables, ordered categories, unordered categories, bounded proportions, zero-inflated counts, and compositional multi-proportion data in a single call. The method blends a phylogenetic baseline with a graph neural network correction; a per-trait gate calibrated on held-out data ensures the network only contributes when it improves on the baseline. Provides conformal prediction intervals (95% coverage) for continuous, count, and ordinal traits and supports Rubin's-rules multiple imputation for downstream inference, including tree-uncertainty propagation via posterior tree samples. Tested up to 10,000 species. Bundled datasets include a 300-species and a 9,993-species AVONET bird trait + BirdTree phylogeny subset. |
| Authors: | Shinichi Nakagawa [aut, cre] |
| Maintainer: | Shinichi Nakagawa <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.10.0.9000 |
| Built: | 2026-06-01 23:08:58 UTC |
| Source: | https://github.com/itchyshin/pigauto |
The full-scale counterpart to avonet300: every bird species
for which AVONET3 and the BirdTree Stage2 Hackett MCC phylogeny agree on
both a species label and a complete set of continuous morphometric
measurements. The schema is identical to avonet300 (same trait
columns, same factor encodings, same Species_Key column) so any
code that runs on avonet300 runs on avonet_full with no
modification.
avonet_fullavonet_full
A data frame with 9,993 rows and 8 variables. Columns match
avonet300: Species_Key, Mass,
Beak.Length_Culmen, Tarsus.Length, Wing.Length,
Trophic.Level, Primary.Lifestyle, Migration.
Unlike the 300-species subset, native AVONET missingness is PRESERVED in
the two ecological columns: 4 NAs in Trophic.Level and 20 NAs in
Migration. The four continuous columns are complete by construction.
Use avonet_full + tree_full to exercise pigauto at
real-world scale (the AVONET missingness sweep benchmark uses this
dataset). For quick examples or unit tests, prefer the 300-species
avonet300 subset – it is ~30x smaller and loads instantly.
See avonet300 for the full column schema (same traits and
encodings).
Tobias et al. (2022) AVONET: morphological, ecological and geographical data for all birds. Ecology Letters, 25, 581-597. BirdTree backbone: Hackett et al. MCC tree via BirdTree.org.
A data frame with 300 rows and 8 columns: 4 continuous morphometric traits
and 3 ecological traits (2 categorical, 1 ordinal) for demonstrating
mixed-type imputation. Species names are in the Species_Key
column. Set rownames(df) <- df$Species_Key; df$Species_Key <- NULL
before passing to preprocess_traits.
avonet300avonet300
A data frame with 300 rows and 8 variables:
Character. Species name in BirdTree format (spaces replaced by underscores).
Numeric. Body mass (grams).
Numeric. Beak length from culmen (mm).
Numeric. Tarsus length (mm).
Numeric. Wing length (mm).
Factor. Dietary category: Carnivore, Herbivore, Omnivore, or Scavenger.
Factor. Lifestyle category: Aerial, Aquatic, Generalist, Insessorial, or Terrestrial.
Ordered factor. Migration strategy: Resident < Partial < Full.
Tobias et al. (2022) AVONET: morphological, ecological and geographical data for all birds. Ecology Letters, 25, 581-597. BirdTree backbone: Hackett et al. MCC tree via BirdTree.org.
Computes a symmetric-normalised Gaussian kernel adjacency matrix and
Laplacian spectral node features from the cophenetic distance matrix.
Results are optionally cached to an .rds file.
build_phylo_graph(tree, k_eigen = "auto", sigma_mult = 0.5, cache_path = NULL)build_phylo_graph(tree, k_eigen = "auto", sigma_mult = 0.5, cache_path = NULL)
tree |
object of class |
k_eigen |
integer or |
sigma_mult |
numeric. Bandwidth multiplier (call it |
cache_path |
character or |
The graph is built in three steps: (1) cophenetic distances between all
pairs of tips; (2) Gaussian kernel with (where
is the user-supplied sigma_mult); (3) symmetric normalisation
with self-loops added before
normalisation.
Spectral node features are the k_eigen smallest non-zero
eigenvectors of the unnormalised Laplacian, which encode the broad
cluster structure of the phylogeny.
A list with:
Numeric matrix (n x n). Symmetric-normalised adjacency.
Numeric matrix (n x k_eigen). Spectral node features.
Integer. Number of tips.
Numeric. Bandwidth used.
Numeric matrix (n x n). Cophenetic (patristic) distances
between tips, row/column-ordered by tree$tip.label.
Returned so downstream functions can reuse it instead of calling
ape::cophenetic.phylo() again.
Numeric matrix (n x n). Element-wise squared cophenetic
distances (D * D). Used by GraphTransformerBlock for
the learnable per-head Gaussian bandwidth (B2 rate-aware attention).
Not freed during impute() training — only D is freed.
Numeric matrix (n x n). Phylogenetic correlation matrix
cov2cor(ape::vcv(tree)) (diagonal = 1). Used by
fit_baseline for BM conditional imputation.
For trees with more than 7500 tips, build_phylo_graph() uses
RSpectra's sparse Lanczos eigensolver (eigs_sym) instead
of the dense base::eigen(). This reduces the spectral step
from (hours at ) to (seconds to a minute). The threshold is
conservative because the dense eigensolver is competitive on small
and mid-size trees, and on pathological ultrametric simulations
(ape::rcoal) the Laplacian spectrum has tight degenerate
clusters that slow down Lanczos until the tree is fairly large. On
realistic asymmetric phylogenies with variable branch lengths
(e.g. the AVONET BirdTree Stage2 Hackett tree) the sparse path wins
much earlier, and at it delivers a ~10x speedup
over dense. The dense path is kept as the default for smaller trees
and as a fallback when RSpectra is not installed, or if
Lanczos fails to converge. The cophenetic distance matrix is
computed once per call and reused for both the adjacency and the
spectral features; it is also returned in the result so that
downstream consumers (notably fit_baseline) can skip
recomputation.
set.seed(1) tree <- ape::rtree(30) g <- build_phylo_graph(tree, k_eigen = 4) dim(g$adj) # 30 x 30 dim(g$coords) # 30 x 4 dim(g$D) # 30 x 30set.seed(1) tree <- ape::rtree(30) g <- build_phylo_graph(tree, k_eigen = 4) dim(g$adj) # 30 x 30 dim(g$coords) # 30 x 4 dim(g$D) # 30 x 30
Bins predicted probabilities and computes observed frequencies within each bin. Useful for calibration plots of binary classifiers.
calibration_df(truth, prob, n_bins = 10L)calibration_df(truth, prob, n_bins = 10L)
truth |
numeric vector (0/1 or TRUE/FALSE). |
prob |
numeric vector of predicted probabilities. |
n_bins |
integer, number of bins (default 10). |
A data.frame with columns bin_mid, obs_freq,
mean_pred, n.
Runs a full comparison of the BM baseline and pigauto (with attention + calibration) over multiple random seeds. Returns a tidy data.frame suitable for plotting or downstream analysis.
compare_methods( data, tree, splits = NULL, seeds = 1:3, epochs = 500L, verbose = TRUE, ... )compare_methods( data, tree, splits = NULL, seeds = 1:3, epochs = 500L, verbose = TRUE, ... )
data |
pigauto_data object. |
tree |
phylo object. |
splits |
pre-computed splits (applied to all reps) or |
seeds |
integer vector of random seeds for replication. |
epochs |
number of training epochs. |
verbose |
logical. |
... |
additional arguments passed to |
For each seed the function:
Creates train/val/test splits.
Fits the phylogenetic BM baseline.
Fits pigauto (with attention and calibration enabled by default).
Evaluates both methods on the test split.
Collects results into a single data.frame.
A data.frame with columns: method, trait,
type, metric, value, rep.
## Not run: cmp <- compare_methods(pd, tree300, seeds = 1:3, epochs = 500) # Summarise across reps aggregate(value ~ method + trait + metric, data = cmp, FUN = mean) ## End(Not run)## Not run: cmp <- compare_methods(pd, tree300, seeds = 1:3, epochs = 500) # Summarise across reps aggregate(value ~ method + trait + metric, data = cmp, FUN = mean) ## End(Not run)
Compute a confusion matrix for categorical or binary predictions
confusion_matrix(truth, predicted, levels = NULL)confusion_matrix(truth, predicted, levels = NULL)
truth |
factor or character vector of true classes. |
predicted |
factor or character vector of predicted classes. |
levels |
character vector of class levels (default: union of truth and predicted levels). |
A list with:
Confusion matrix (rows = truth, columns = predicted).
Overall accuracy.
Data.frame with per-class precision, recall, F1.
Performs stratified k-fold cross-validation by rotating which cells serve as the test set. Returns per-fold and aggregated metrics.
cross_validate( data, tree, k = 5L, seeds = 1:3, epochs = 500L, verbose = TRUE, ... )cross_validate( data, tree, k = 5L, seeds = 1:3, epochs = 500L, verbose = TRUE, ... )
data |
pigauto_data object (output of |
tree |
phylo object. |
k |
integer. Number of folds (default 5). |
seeds |
integer vector. Seeds for replicate runs. |
epochs |
integer. Training epochs per fold (default 500). |
verbose |
logical. Print progress (default TRUE). |
... |
Additional arguments passed to |
A list of class "pigauto_cv" with:
Data.frame with columns: fold, rep, trait, type, metric, value.
Data.frame with mean and sd across folds/reps for each trait + metric.
Data.frame of coverage per trait across folds (if available).
Number of folds.
Number of replicates.
## Not run: data(avonet300, tree300, package = "pigauto") traits <- avonet300; rownames(traits) <- traits$Species_Key traits$Species_Key <- NULL pd <- preprocess_traits(traits, tree300) cv <- cross_validate(pd, tree300, k = 5L, seeds = 1:3, epochs = 500L) print(cv) summary(cv) ## End(Not run)## Not run: data(avonet300, tree300, package = "pigauto") traits <- avonet300; rownames(traits) <- traits$Species_Key traits$Species_Key <- NULL pd <- preprocess_traits(traits, tree300) cv <- cross_validate(pd, tree300, k = 5L, seeds = 1:3, epochs = 500L) print(cv) summary(cv) ## End(Not run)
A simulated dataset mimicking a thermal tolerance study where each species has multiple measurements of critical thermal maximum (CTmax) taken at different acclimation temperatures. The data-generating process is
ctmax_simctmax_sim
A data frame with 1,464 rows and 3 variables:
Character. Species name matching tree300 tips.
Numeric. Acclimation temperature (degrees C). This is an observation-level covariate that varies within species.
Numeric. Critical thermal maximum (degrees C). Contains NA values for unobserved species and MCAR missingness.
where follows Brownian motion on tree300,
the within-species acclimation response ratio is 0.10, and
. Thirty percent of species are entirely
unobserved (all CTmax values are NA), and an additional 15\
observations are missing at random.
This dataset demonstrates pigauto's multi-observation and observation-level
covariate support. Use with tree300 and set
species_col = "species" in impute.
Simulated. See data-raw/make_ctmax_sim.R for the
generation script.
Plumage lightness measurements and environmental covariates for 5,809 passerine bird species. This dataset demonstrates environmental-covariate support in pigauto: climate variables are fully observed conditioners that improve imputation of lightness traits.
delhey5809delhey5809
A data frame with 5,809 rows and 10 variables:
Character. Species name in BirdTree format.
Character. Taxonomic family.
Numeric. Annual mean temperature (BIO1, degrees C x 10).
Numeric. Annual precipitation (mm).
Numeric. Percent tree cover.
Numeric. Mean temperature of warmest quarter (BIO10, degrees C x 10).
Numeric. Precipitation of warmest quarter (mm).
Numeric. Mid-latitude of species range.
Numeric. Average plumage lightness for males.
Numeric. Average plumage lightness for females.
Delhey K, Dale J, Valcu M, Kempenaers B (2019). "Reconciling ecogeographical rules: rainfall and temperature predict global colour variation in the largest bird radiation." Ecology Letters, 22(5): 726-736.
Computes type-specific performance metrics for each trait. By default
the model is evaluated on the test split stored in the fit object, but
an alternative data / splits can be supplied.
evaluate(fit, data = NULL, splits = NULL)evaluate(fit, data = NULL, splits = NULL)
fit |
pigauto_fit object. |
data |
pigauto_data object (default: |
splits |
splits object (default: |
Metrics by trait type:
RMSE, Pearson r, MAE
RMSE, MAE, Pearson r
Accuracy, Brier score
Accuracy (overall)
RMSE (on integer scale), Spearman rho
When conformal scores are present in the fit, conformal coverage at the 95\ ordinal traits.
When the fit includes a baseline, baseline metrics are appended with
method = "baseline" for direct comparison.
A data.frame with columns: method, trait,
type, metric, value, n_test.
## Not run: eval_df <- evaluate(fit) eval_df[eval_df$metric == "rmse", ] ## End(Not run)## Not run: eval_df <- evaluate(fit) eval_df[eval_df$metric == "rmse", ] ## End(Not run)
Computes type-specific metrics for each trait on the validation and test
splits. When a trait_map is supplied, metrics are dispatched per
trait type; otherwise the function falls back to continuous-only metrics
(RMSE, Pearson r, 95% coverage).
evaluate_imputation(pred, truth, splits, pred_se = NULL, trait_map = NULL)evaluate_imputation(pred, truth, splits, pred_se = NULL, trait_map = NULL)
pred |
predicted values: either a numeric matrix in latent scale
(same dimensions as |
truth |
numeric matrix of true values in latent scale (from
|
splits |
list (output of |
pred_se |
numeric matrix of prediction SEs (same scale as
|
trait_map |
list of trait descriptors (from |
Metrics by trait type:
RMSE, Pearson r, 95\ supplied)
RMSE, Pearson r, 95\ supplied)
RMSE, MAE, Pearson r
RMSE, Spearman rho
Accuracy, Brier score
Accuracy
RMSE, MAE, Pearson r, zero-accuracy, Brier score (on gate)
For binary and categorical traits the function accepts either a
pigauto_pred object (preferred, gives access to probabilities) or
raw matrices (latent scale).
A data.frame with columns split, trait,
type, n, and type-specific metric columns.
## Not run: # From pigauto_pred object eval_df <- evaluate_imputation(pred_obj, pd$X_scaled, splits) # From raw latent matrix eval_df <- evaluate_imputation(bl$mu, pd$X_scaled, splits, trait_map = pd$trait_map) ## End(Not run)## Not run: # From pigauto_pred object eval_df <- evaluate_imputation(pred_obj, pd$X_scaled, splits) # From raw latent matrix eval_df <- evaluate_imputation(bl$mu, pd$X_scaled, splits, trait_map = pd$trait_map) ## End(Not run)
Dispatches to pigauto's phylogenetic baseline machinery and returns imputed latent-scale means and standard errors for every species.
fit_baseline( data, tree, splits = NULL, model = "BM", graph = NULL, multi_obs_aggregation = c("hard", "soft"), em_iterations = 0L, em_tol = 0.001, em_offdiag = FALSE )fit_baseline( data, tree, splits = NULL, model = "BM", graph = NULL, multi_obs_aggregation = c("hard", "soft"), em_iterations = 0L, em_tol = 0.001, em_offdiag = FALSE )
data |
object of class |
tree |
object of class |
splits |
list (output of |
model |
character. Evolutionary model: |
graph |
optional list returned by |
multi_obs_aggregation |
character. How to aggregate multiple
observations per species before the Level-C (Rphylopars) baseline:
|
em_iterations |
integer. Number of Phase 6 EM iterations for the
threshold-joint baseline (binary + ordinal + OVR categorical). Default
|
em_tol |
numeric. Relative-Frobenius convergence tolerance for the
Phase 6 / 7 EM loop. Early-stops when
|
em_offdiag |
logical. Phase 7 opt-in: when |
When splits is supplied the val and test cells are masked to
NA before fitting, so the baseline is evaluated under the same
conditions as fit_pigauto.
Continuous-family columns use Brownian-motion conditional MVN baselines on the phylogenetic correlation matrix, either independently or through the joint MVN path when the data and optional dependencies support it. Binary, ordinal, categorical, and zero-inflated gate columns use the appropriate label-propagation or threshold/liability baseline candidates, with per-column fallbacks when a joint path is not available.
A list with:
Numeric matrix (n_species x p_latent), baseline means in latent scale.
Numeric matrix (n_species x p_latent), standard errors.
## Not run: data(avonet300, tree300, package = "pigauto") traits <- avonet300; rownames(traits) <- traits$Species_Key traits$Species_Key <- NULL pd <- preprocess_traits(traits, tree300) splits <- make_missing_splits(pd$X_scaled, trait_map = pd$trait_map) bl <- fit_baseline(pd, tree300, splits) ## End(Not run)## Not run: data(avonet300, tree300, package = "pigauto") traits <- avonet300; rownames(traits) <- traits$Species_Key traits$Species_Key <- NULL pd <- preprocess_traits(traits, tree300) splits <- make_missing_splits(pd$X_scaled, trait_map = pd$trait_map) bl <- fit_baseline(pd, tree300, splits) ## End(Not run)
Uses BACE to provide a phylogenetically-informed Bayesian baseline
for all trait types. Returns imputed means and between-imputation SEs
in latent scale, matching the interface of fit_baseline.
fit_baseline_bace( data, tree, splits = NULL, runs = 5L, nitt = 4000L, burnin = 1000L, thin = 10L, verbose = TRUE )fit_baseline_bace( data, tree, splits = NULL, runs = 5L, nitt = 4000L, burnin = 1000L, thin = 10L, verbose = TRUE )
data |
object of class |
tree |
object of class |
splits |
list (output of |
runs |
integer. Number of BACE chained imputation iterations
(default |
nitt |
integer. MCMC iterations per model (default |
burnin |
integer. Burn-in iterations (default |
thin |
integer. Thinning rate (default |
verbose |
logical. |
BACE runs chained MCMCglmm imputation: each trait is modelled as a response with all others as predictors, cycling through multiple MCMC runs. This provides a fully phylogenetic baseline for binary, categorical, and count traits — not just continuous ones.
The returned mu matrix is the mean across imputation runs;
se is the between-imputation SD (capturing imputation
uncertainty). Both are in latent scale (same as
pigauto_data$X_scaled), so they can be passed directly to
fit_pigauto as the baseline argument.
A list with:
Numeric matrix (n x p_latent), baseline means.
Numeric matrix (n x p_latent), between-imputation SEs.
## Not run: bl_bace <- fit_baseline_bace(pd, tree, splits = spl) fit <- fit_pigauto(pd, tree, splits = spl, baseline = bl_bace) ## End(Not run)## Not run: bl_bace <- fit_baseline_bace(pd, tree, splits = spl) fit <- fit_pigauto(pd, tree, splits = spl, baseline = bl_bace) ## End(Not run)
Trains a pigauto model: a gated ensemble of a phylogenetic baseline
and an attention-based graph neural network correction, implemented
as an internal torch module (ResidualPhyloDAE; "Residual" here
refers to the ResNet-style skip connections in the GNN layers, not
to a statistical residual). For continuous, count, and ordinal traits
the baseline is Brownian motion (phylogenetic correlation matrix); for binary
and categorical traits it is phylogenetic label propagation. Supports
all five trait types via a unified latent space.
fit_pigauto( data, tree, splits = NULL, graph = NULL, baseline = NULL, hidden_dim = 64L, k_eigen = "auto", n_gnn_layers = 2L, gate_cap = 0.8, use_attention = TRUE, use_transformer_blocks = TRUE, n_heads = 4L, ffn_mult = 4L, use_trait_attention = FALSE, n_trait_heads = 2L, trait_embed_dim = 32L, dropout = 0.1, lr = 0.003, weight_decay = 1e-04, epochs = 3000L, corruption_rate = 0.55, corruption_start = 0.2, corruption_ramp = 500L, refine_steps = 8L, lambda_shrink = 0.03, lambda_gate = 0.01, warmup_epochs = 200L, edge_dropout = 0.1, eval_every = 100L, patience = 10L, clip_norm = 1, conformal_method = c("split", "bootstrap"), conformal_bootstrap_B = 500L, conformal_split_val = FALSE, gate_method = c("cv_folds", "median_splits", "single_split"), gate_splits_B = 31L, gate_cv_folds = 5L, safety_floor = TRUE, phylo_signal_gate = TRUE, phylo_signal_threshold = 0.2, phylo_signal_method = c("lambda", "blomberg_k"), min_val_cells = 20L, verbose = TRUE, seed = 1L )fit_pigauto( data, tree, splits = NULL, graph = NULL, baseline = NULL, hidden_dim = 64L, k_eigen = "auto", n_gnn_layers = 2L, gate_cap = 0.8, use_attention = TRUE, use_transformer_blocks = TRUE, n_heads = 4L, ffn_mult = 4L, use_trait_attention = FALSE, n_trait_heads = 2L, trait_embed_dim = 32L, dropout = 0.1, lr = 0.003, weight_decay = 1e-04, epochs = 3000L, corruption_rate = 0.55, corruption_start = 0.2, corruption_ramp = 500L, refine_steps = 8L, lambda_shrink = 0.03, lambda_gate = 0.01, warmup_epochs = 200L, edge_dropout = 0.1, eval_every = 100L, patience = 10L, clip_norm = 1, conformal_method = c("split", "bootstrap"), conformal_bootstrap_B = 500L, conformal_split_val = FALSE, gate_method = c("cv_folds", "median_splits", "single_split"), gate_splits_B = 31L, gate_cv_folds = 5L, safety_floor = TRUE, phylo_signal_gate = TRUE, phylo_signal_threshold = 0.2, phylo_signal_method = c("lambda", "blomberg_k"), min_val_cells = 20L, verbose = TRUE, seed = 1L )
data |
object of class |
tree |
object of class |
splits |
list (output of |
graph |
list (output of |
baseline |
list (output of |
|
integer. Hidden layer width (default |
|
k_eigen |
integer. Number of spectral node features (default
|
n_gnn_layers |
integer. Number of graph message-passing layers
(default |
gate_cap |
numeric. Upper bound for the per-column blend gate
(default |
use_attention |
logical. Use attention in the GNN layers (default
|
use_transformer_blocks |
logical. Replace the legacy attention stack with pre-norm transformer-encoder blocks (multi-head attention
|
n_heads |
integer. Number of attention heads when
|
ffn_mult |
integer. Feed-forward width multiplier inside each
transformer block (default |
use_trait_attention |
logical. Opt-in within-row cross-trait
self-attention (B3, v0.9.3). When |
n_trait_heads |
integer. Number of attention heads in the within-row
self-attention block when |
trait_embed_dim |
integer. Embedding dim per trait token in the
within-row self-attention block. Default |
dropout |
numeric. Dropout rate (default |
lr |
numeric. AdamW learning rate (default |
weight_decay |
numeric. AdamW weight decay (default |
epochs |
integer. Maximum training epochs (default |
corruption_rate |
numeric. Final corruption fraction if
|
corruption_start |
numeric. Initial corruption fraction for the
curriculum schedule (default |
corruption_ramp |
integer. Epochs over which corruption linearly
ramps from |
refine_steps |
integer. Iterative refinement steps at inference
(default |
lambda_shrink |
numeric. Weight on the shrinkage penalty
|
lambda_gate |
numeric. Weight on the gate regularisation penalty
that pushes learnable gates toward zero. Prevents gates from staying
open when the GNN provides no useful correction (default |
warmup_epochs |
integer. Linear learning-rate warmup over the first
N epochs (default |
edge_dropout |
numeric. Fraction of adjacency edges randomly zeroed
each training epoch for graph regularisation (default |
eval_every |
integer. Evaluate on val every N epochs (default
|
patience |
integer. Early-stopping patience in eval cycles (default
|
clip_norm |
numeric. Gradient clip norm (default |
conformal_method |
character. How the conformal prediction score
is estimated from validation residuals. |
conformal_bootstrap_B |
integer. Bootstrap resamples used when
|
conformal_split_val |
logical. Default |
gate_method |
character. How the per-trait calibrated gate is
chosen. |
gate_splits_B |
integer. Random splits used when
|
gate_cv_folds |
integer. Number of CV folds when
|
safety_floor |
logical. When |
phylo_signal_gate |
logical. When |
phylo_signal_threshold |
numeric, default |
phylo_signal_method |
character, currently only |
min_val_cells |
integer. Warn at fit time if any trait has fewer
than |
verbose |
logical. Print training progress (default |
seed |
integer. Random seed (default |
Blend formulation:
The prediction is , where is
the BM baseline, is the model's direct prediction, and
is a per-column learnable gate
bounded in . When , the prediction
collapses to the baseline.
The gate is regularised toward zero via the shrinkage penalty on
, so the model defaults to the baseline unless the
GNN's correction demonstrably helps on the validation set.
Training objective (per epoch):
A random subset of observed cells is corrupted with a learnable mask token.
The model predicts from graph context.
Loss = type-specific reconstruction on corrupted cells +
lambda_shrink * MSE() +
lambda_gate * MSE().
The gate penalty on is necessary because when (the BM-optimal solution for observed cells), the reconstruction
and shrinkage losses both equal zero regardless of , leaving no
gradient to close the gate. The explicit penalty ensures gates default
toward zero when the GNN correction provides no benefit.
Type-specific losses:
MSE
BCE with logits
cross-entropy over K latent columns
An object of class "pigauto_fit".
pigauto's 95\
interval half-width for each trait is the empirical (1 - alpha)
quantile of on held-out validation cells. When the
number of validation cells per trait (n_val) is large
(), split-conformal gives near-exact marginal coverage
under mild exchangeability assumptions.
At small n_val ( 20, and especially 10) two
things degrade at once:
The conformal quantile clamps to max(residuals) because
the required quantile level exceeds 1. The score is the single
largest val residual and has substantial sampling variance across
fits.
The gate calibration's half-A / half-B split ends up with only
a few cells per half, so the grid-search winner is essentially
random between 0 and gate_cap.
Empirically (pigauto simulation harness, n=150 species, 35\
trait_MAR, 10 random seeds), default behaviour produces 92\
coverage with per-fit coverage ranging [0.73, 1.00]. The mean
is close to the 95\
gate_method = "median_splits" and conformal_method =
"bootstrap" reduce their respective estimator variances but
empirically do not meaningfully narrow the coverage distribution.
Treat the 95\
regime; increase missing_frac (more held-out data for
calibration) or collect more species if tight per-fit coverage is
required.
The min_val_cells warning fires at fit time whenever any trait
falls below this threshold, so you know when to interpret the
intervals cautiously.
One-call interface to the full pigauto pipeline: preprocessing, baseline
fitting, GNN training, and prediction. For fine-grained control, use the
individual functions (preprocess_traits,
fit_baseline, fit_pigauto, etc.) directly.
impute( traits, tree, species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_transform = TRUE, missing_frac = 0.25, n_imputations = 1L, covariates = NULL, epochs = 2000L, verbose = TRUE, seed = 1L, multi_obs_aggregation = c("hard", "soft"), em_iterations = 0L, em_tol = 0.001, em_offdiag = FALSE, pool_method = c("median", "mean", "mode"), clamp_outliers = FALSE, clamp_factor = 5, match_observed = c("none", "pmm"), pmm_K = 5L, safety_floor = TRUE, phylo_signal_gate = TRUE, phylo_signal_threshold = 0.2, phylo_signal_method = "lambda", ... )impute( traits, tree, species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_transform = TRUE, missing_frac = 0.25, n_imputations = 1L, covariates = NULL, epochs = 2000L, verbose = TRUE, seed = 1L, multi_obs_aggregation = c("hard", "soft"), em_iterations = 0L, em_tol = 0.001, em_offdiag = FALSE, pool_method = c("median", "mean", "mode"), clamp_outliers = FALSE, clamp_factor = 5, match_observed = c("none", "pmm"), pmm_K = 5L, safety_floor = TRUE, phylo_signal_gate = TRUE, phylo_signal_threshold = 0.2, phylo_signal_method = "lambda", ... )
traits |
data.frame with species as rownames and trait columns,
or (when |
tree |
object of class |
species_col |
character. Name of the column in |
trait_types |
named character vector overriding the auto-detected
type for specific trait columns, e.g.
|
multi_proportion_groups |
named list declaring compositional
( |
log_transform |
logical. Auto-log positive continuous columns
(default |
missing_frac |
numeric. Fraction of observed cells held out for
validation/test evaluation (default |
n_imputations |
integer. Number of MC-dropout imputation sets
(default |
covariates |
data.frame or matrix of environmental covariates
(fully observed — no NAs). Covariates are conditioners: they inform
imputation but are not themselves imputed. Numeric/integer columns are
z-scored; factor/ordered columns are one-hot encoded automatically.
If a variable has missing values, include it in |
epochs |
integer. Maximum GNN training epochs (default |
verbose |
logical. Print progress (default |
seed |
integer. Random seed (default |
multi_obs_aggregation |
character. How to aggregate multiple
observations per species before the Level-C baseline. |
em_iterations |
integer. Phase 6 EM iterations for the
threshold-joint baseline (binary + ordinal + OVR categorical).
Default |
em_tol |
numeric. Relative-Frobenius convergence tolerance for
the Phase 6 / 7 EM loop. Default |
em_offdiag |
logical. Phase 7 opt-in: when |
pool_method |
character. How to pool multiple imputation draws
( |
clamp_outliers |
logical. Phase G (v0.9.1.9011+). When
|
clamp_factor |
numeric scalar (>= 1). Multiplicative factor
on the observed maximum used by |
match_observed |
character, one of When to use: PMM is a niche feature. pigauto already
provides conformal prediction intervals (calibrated against
held-out residuals) and Default |
pmm_K |
integer (>= 1). Donor pool size for PMM. Default
|
safety_floor |
logical. When |
phylo_signal_gate, phylo_signal_threshold, phylo_signal_method
|
Pass-through to |
... |
additional arguments passed to |
An object of class "pigauto_result" with components:
The input traits data.frame with observed
values preserved and only missing cells filled in. This is the
primary output – typically what users want.
Logical matrix (same shape as completed)
that is TRUE for cells that were imputed (originally
NA) and FALSE for observed cells.
A pigauto_pred object from
predict.pigauto_fit containing raw model
predictions for every cell (observed + missing), standard errors,
class probabilities, and conformal intervals.
The trained pigauto_fit object.
The phylogenetic baseline.
The preprocessed pigauto_data object.
The val/test splits (or NULL if
missing_frac = 0).
Evaluation metrics on test set (or NULL).
pigauto infers each trait's type from its R class — no trait_types
argument is needed for most data:
| R class | pigauto type |
numeric |
continuous (auto-log if all-positive) |
integer |
count |
factor with 2 levels |
binary |
factor (unordered) with >2 levels |
categorical |
ordered / factor(..., ordered = TRUE) |
ordinal |
character |
→ factor → binary or categorical |
logical |
binary |
Two types cannot be inferred from class alone and must be declared
via trait_types:
"proportion"A numeric bounded 0–1, e.g. survival
rate: trait_types = c(Survival = "proportion").
"zi_count"An integer with excess zeros, e.g.
parasite count: trait_types = c(Parasites = "zi_count").
Use the trait_types argument directly (it is an explicit
parameter, not a ... pass-through).
The distinction is functional, not ontological: a trait is something
you want to impute (NA values allowed in traits); a covariate is
something you use to sharpen imputation accuracy (must be fully observed,
passed via covariates). The same variable can be either depending on
the scientific question.
Examples:
IUCN status with Data Deficient species → put it in
traits as ordered(c("LC","NT","VU","EN","CR")) so
pigauto predicts the unknown categories.
IUCN status fully known for all species → pass as a covariate to inform imputation of other traits (e.g. body mass, range size).
Realm / biome (factor) → pass as a covariate; pigauto one-hot encodes factor columns automatically (v0.6.1+).
Variables that belong in traits: anything with missing values you
care about predicting. Variables that belong in covariates: fully
observed, exogenous to the trait space (geography, climate, habitat,
experimental treatment).
With safety_floor = TRUE (the new default), the post-training
calibration grid searches a 3-way convex combination of the
Brownian-motion baseline, the GNN delta, and the per-trait grand
mean. The simplex is sampled at step 0.05 (231 candidates per latent
column). Because the corner (0, 0, 1) — pure grand mean —
is always in the grid, the selected candidate cannot be worse than
the grand-mean corner on the validation cells under the calibration
metric. The fit object gains four new slots:
r_cal_bm, r_cal_gnn, r_cal_mean (each a named
numeric of length p_latent), and
mean_baseline_per_col.
Set safety_floor = FALSE to reproduce the pre-v0.9.1.9002
1-D calibration bit-identically (no mean term; r_cal_mean = 0;
r_cal_bm = 1 - r_cal_gnn). See
specs/2026-04-23-safety-floor-mean-gate-design.md for the
design rationale and
plans/2026-04-23-safety-floor-mean-gate.md for the
implementation plan.
pigauto only imputes cells that are NA in the input.
Observed cells are preserved as-is in result$completed. The
slot result$prediction$imputed contains the model's
prediction for every cell – observed and missing alike –
and is intended for diagnostics (e.g. checking calibration on
training cells), not as the imputed-values output. The
imputed values themselves are result$completed[result$imputed_mask].
Common pitfall. If you call impute() on a fully
observed trait matrix (no NAs anywhere), there is nothing
to impute. result$completed is identical to the input,
sum(result$imputed_mask) is 0, and
result$prediction$imputed is just model predictions for
already-known values. This is the right behaviour, but it can
look surprising: e.g. on avonet300 (fully observed), the
"imputed" Mass values you see are simply the observed body
masses passed through (some bird species are 24 kg). To exercise
the imputation path on a complete dataset, mask some cells first
(see Examples).
Imbalanced K-class traits. At default settings
(n_imputations = 1L, pool_method = "median"), a
small ordinal / categorical trait whose marginal distribution is
heavily skewed (e.g. AVONET Migration is ~78\
~14\
collapse onto a corner that predicts the majority class
everywhere. When this matters, increase n_imputations
(>= 20 in our K=3 ordinal benches) and set
pool_method = "mode" (Phase H, +6.6 pp on AVONET
Migration K=3 vs the default median pool). See
useful/MEMO_2026-05-01_phase_h_results.md.
## Not run: # Typical use: your data already has NAs you want filled result <- impute(my_traits_with_NAs, my_tree) result$completed # observed preserved, NAs filled result$imputed_mask # which cells were imputed sum(result$imputed_mask) # how many cells were imputed # Proportion and zi_count must be declared explicitly result <- impute(my_traits, my_tree, trait_types = c(Survival = "proportion", Parasites = "zi_count")) # Diagnostic: raw predictions for every cell (NOT the imputed values). # `imputed` here means "model output", not "filled gap". result$prediction$imputed # model prediction at every cell result$prediction$se # per-cell uncertainty result$prediction$probabilities$diet # class probabilities # Demonstration / sanity-check on a fully observed dataset: # mask some cells, impute, compare predictions to truth. data(avonet300, tree300) df <- avonet300 rownames(df) <- df$Species_Key; df$Species_Key <- NULL set.seed(1L) truth <- df$Mass df_obs <- df hide <- sample(which(!is.na(df$Mass)), 30L) df_obs$Mass[hide] <- NA result <- impute(df_obs, tree300) truth[hide] # held-out truth result$completed$Mass[hide] # pigauto's imputations for those cells plot(truth[hide], result$completed$Mass[hide], log = "xy", xlab = "truth", ylab = "imputed") abline(0, 1, col = "red") # For imbalanced K=3 ordinal traits (e.g. Migration), prefer: result <- impute(df_obs, tree300, n_imputations = 20L, pool_method = "mode") ## End(Not run)## Not run: # Typical use: your data already has NAs you want filled result <- impute(my_traits_with_NAs, my_tree) result$completed # observed preserved, NAs filled result$imputed_mask # which cells were imputed sum(result$imputed_mask) # how many cells were imputed # Proportion and zi_count must be declared explicitly result <- impute(my_traits, my_tree, trait_types = c(Survival = "proportion", Parasites = "zi_count")) # Diagnostic: raw predictions for every cell (NOT the imputed values). # `imputed` here means "model output", not "filled gap". result$prediction$imputed # model prediction at every cell result$prediction$se # per-cell uncertainty result$prediction$probabilities$diet # class probabilities # Demonstration / sanity-check on a fully observed dataset: # mask some cells, impute, compare predictions to truth. data(avonet300, tree300) df <- avonet300 rownames(df) <- df$Species_Key; df$Species_Key <- NULL set.seed(1L) truth <- df$Mass df_obs <- df hide <- sample(which(!is.na(df$Mass)), 30L) df_obs$Mass[hide] <- NA result <- impute(df_obs, tree300) truth[hide] # held-out truth result$completed$Mass[hide] # pigauto's imputations for those cells plot(truth[hide], result$completed$Mass[hide], log = "xy", xlab = "truth", ylab = "imputed") abline(0, 1, col = "red") # For imbalanced K=3 ordinal traits (e.g. Migration), prefer: result <- impute(df_obs, tree300, n_imputations = 20L, pool_method = "mode") ## End(Not run)
Reads a pigauto_fit object previously saved with
save_pigauto.
load_pigauto(path)load_pigauto(path)
path |
character. File path to load from. |
An object of class "pigauto_fit".
## Not run: fit <- load_pigauto("my_model.pigauto") pred <- predict(fit) ## End(Not run)## Not run: fit <- load_pigauto("my_model.pigauto") pred <- predict(fit) ## End(Not run)
Randomly designates a fraction of cells as "missing" and splits them into
validation and test sets. When a trait_map is supplied, masking
operates at the original trait level – all latent columns belonging
to one trait are held out together (important for categorical traits).
make_missing_splits( X, missing_frac = 0.25, val_frac = 0.25, seed = 555, trait_map = NULL, mechanism = c("MCAR", "MAR_trait", "MAR_phylo", "MNAR"), mechanism_args = list(), tree = NULL )make_missing_splits( X, missing_frac = 0.25, val_frac = 0.25, seed = 555, trait_map = NULL, mechanism = c("MCAR", "MAR_trait", "MAR_phylo", "MNAR"), mechanism_args = list(), tree = NULL )
X |
numeric matrix (species x latent columns from
|
missing_frac |
numeric. Fraction of all (species, trait) cells to
designate as missing (default |
val_frac |
numeric. Fraction of missing cells for validation
(default |
seed |
integer. Random seed for reproducibility (default |
trait_map |
list of trait descriptors (from |
mechanism |
character. Missingness mechanism:
|
mechanism_args |
named list of mechanism-specific parameters:
|
tree |
object of class |
The returned index vectors use linear (column-major) indexing. Both
original-trait-space and latent-space indices are returned when a
trait_map is present.
A list with:
Integer vector of linear indices (latent space).
Integer vector of linear indices (latent space).
Integer vector in original-trait space (if
trait_map supplied).
Integer vector in original-trait space (if
trait_map supplied).
Number of species (rows).
Number of latent columns.
Number of original traits.
Logical matrix (n x p_latent). TRUE = observed.
Character string of the mechanism used.
X <- matrix(rnorm(100), nrow = 20) splits <- make_missing_splits(X, missing_frac = 0.25, seed = 1) length(splits$val_idx) # MAR: missingness depends on another trait splits_mar <- make_missing_splits(X, mechanism = "MAR_trait", mechanism_args = list(driver_col = 1, beta = 2))X <- matrix(rnorm(100), nrow = 20) splits <- make_missing_splits(X, missing_frac = 0.25, seed = 1) length(splits$val_idx) # MAR: missingness depends on another trait splits_mar <- make_missing_splits(X, mechanism = "MAR_trait", mechanism_args = list(driver_col = 1, beta = 2))
Returns a logical matrix of the same dimensions as X, with
TRUE where values are observed (not NA).
mask_missing(X)mask_missing(X)
X |
numeric matrix (species x traits or species x latent columns). |
Logical matrix, TRUE = observed.
X <- matrix(c(1, NA, 3, 4), nrow = 2) mask_missing(X)X <- matrix(c(1, NA, 3, 4), nrow = 2) mask_missing(X)
Run pigauto's full imputation pipeline and return M stochastic
completions of the trait matrix instead of a single point estimate.
The M datasets are the input needed for the classical multiple
imputation workflow: fit a downstream model on each dataset, then
pool the results with Rubin's rules via pool_mi(). This is the
standard way to propagate imputation uncertainty into phylogenetic
comparative analyses (PGLS, PGLMM, etc.) rather than treating
imputed cells as if they were observed.
multi_impute( traits, tree, m = 100L, draws_method = c("conformal", "mc_dropout"), species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_transform = TRUE, missing_frac = 0.25, covariates = NULL, epochs = 2000L, verbose = TRUE, seed = 1L, ... )multi_impute( traits, tree, m = 100L, draws_method = c("conformal", "mc_dropout"), species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_transform = TRUE, missing_frac = 0.25, covariates = NULL, epochs = 2000L, verbose = TRUE, seed = 1L, ... )
traits |
data.frame with species as rownames and trait columns.
Same input format as |
tree |
object of class |
m |
integer. Number of imputation datasets to generate
(default |
draws_method |
character. How stochastic draws are generated for missing cells. One of:
|
species_col |
character or |
trait_types |
named character vector overriding auto-detected
trait types for specific columns. Required for |
multi_proportion_groups |
named list declaring compositional
trait groups (rows summing to 1), e.g.
|
log_transform |
logical. Auto-log positive continuous columns
(default |
missing_frac |
numeric. Fraction of observed cells held out for
validation/test during training (default |
covariates |
data.frame or matrix of environmental covariates
(fully observed, numeric). Passed through to |
epochs |
integer. Maximum GNN training epochs (default |
verbose |
logical. Print progress (default |
seed |
integer. Random seed (default |
... |
additional arguments forwarded to |
Multiple imputation is a method for doing downstream analysis
under missing data, not an end in itself. Plugging a single
point-estimate imputation into a regression underestimates standard
errors because it treats imputed cells as if they were observed.
The standard remedy, due to Rubin (1987), is to generate M
stochastic completions, fit the downstream model on each, and pool
the results. multi_impute() + with_imputations() + pool_mi()
implement this workflow end to end.
draws_method = "conformal" (default): Run the model once; missing
cells are sampled from
where is the trait-level conformal score (the empirical
97.5th percentile of held-out absolute residuals, in latent z-score
units back-transformed to the original scale). The draw width is
therefore calibrated against actual prediction error regardless of
whether the BM or GNN term dominates. For discrete traits (binary,
categorical) it uses Bernoulli / categorical draws from the estimated
probability vector. For multi_proportion groups it draws the
K CLR latent columns with their BM latent SEs, projects back to
sum-zero CLR space, and decodes to the simplex. This is the preferred
default for pigauto.
draws_method = "mc_dropout": Run M GNN forward passes in
training mode (dropout active). Caution: when the per-trait
calibrated gate r_cal = 0 (which happens whenever the BM baseline
already fits well, typically for continuous traits with strong
phylogenetic signal), every MC pass is identical to the BM point
estimate and draws have zero between-imputation variance. Check
mi$fit$calibrated_gates after fitting — if all gates for the traits
of interest are zero, use draws_method = "conformal" instead.
Nakagawa & Freckleton (2008, 2011) review the consequences of ignoring missing data in ecological and comparative analyses and argue for multiple imputation as the default.
An object of class "pigauto_mi" with components:
datasetsA list of length m. Each element is a
data.frame with the same shape and column types as the input
traits; observed cells are preserved and missing cells are
filled with the corresponding imputation draw. Pass this list
to with_imputations() to fit downstream models.
mNumber of imputations.
pooled_pointA single data.frame whose missing cells
are replaced by the MC-averaged point estimate. Convenient for
reporting but does not propagate imputation uncertainty –
use datasets + pool_mi() for inference.
seMatrix of per-cell standard errors combining the baseline SE and the between-imputation standard deviation.
imputed_maskLogical matrix; TRUE where a cell was
originally missing.
fitThe underlying pigauto_fit
object, retained for diagnostics and for calls to predict()
on new data.
dataThe pigauto_data object.
treeThe input phylogeny.
species_colPassed-through species-column name or
NULL.
pigauto provides two multiple-imputation functions. Pick based on how many trees you have:
One tree (single published phylogeny, single time-calibrated tree):
use multi_impute(). The m MC-dropout imputations capture model
uncertainty.
Multiple posterior trees (BirdTree samples, BEAST posterior, etc.):
use multi_impute_trees(). Between-tree variation is added to the
pooled SEs via Rubin's rules (Nakagawa & de Villemereuil 2019).
The two functions share the same downstream API — both return objects
compatible with with_imputations() and pool_mi().
When fit_pigauto() was called with safety_floor = TRUE
(the default since v0.9.1.9002), the 3-way blend
r_BM * BM + r_GNN * GNN + r_MEAN * MEAN propagates through
every imputation draw automatically via the updated
predict.pigauto_fit(). For draws_method = "mc_dropout"
the mean term contributes no between-draw variance (it is a
deterministic scalar per column), so Rubin-pooled SE stays correctly
calibrated: variance comes from the BM-draw and GNN-dropout terms
only. For draws_method = "conformal" the blend centre is the
3-way prediction and conformal scores remain calibrated on the
blended residuals.
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Nakagawa S, Freckleton RP (2008). "Missing inaction: the dangers of ignoring missing data." Trends in Ecology & Evolution 23(11): 592-596.
Nakagawa S, Freckleton RP (2011). "Model averaging, missing data and multiple imputation: a case study for behavioural ecology." Behavioral Ecology and Sociobiology 65(1): 103-116.
impute() for single-point imputation, with_imputations()
for applying a model-fitting function across the M datasets,
pool_mi() for Rubin's rules pooling of the resulting fits.
## Not run: library(pigauto) data(avonet300, tree300) df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL # Generate 100 complete datasets mi <- multi_impute(df, tree300, m = 100) print(mi) # Downstream analysis: phylogenetic GLS via nlme, pooled with Rubin's rules fits <- with_imputations(mi, function(d) { d$species <- rownames(d) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian(phy = tree300, form = ~species), data = d, method = "ML" ) }) pool_mi(fits) ## End(Not run)## Not run: library(pigauto) data(avonet300, tree300) df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL # Generate 100 complete datasets mi <- multi_impute(df, tree300, m = 100) print(mi) # Downstream analysis: phylogenetic GLS via nlme, pooled with Rubin's rules fits <- with_imputations(mi, function(d) { d$species <- rownames(d) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian(phy = tree300, form = ~species), data = d, method = "ML" ) }) pool_mi(fits) ## End(Not run)
Run pigauto's full imputation pipeline on each of T posterior
phylogenies, generating m_per_tree stochastic completions per tree
for a total of T * m_per_tree completed datasets. Each completed
dataset is conditional on a specific posterior tree
(recorded in mi$tree_index).
multi_impute_trees( traits, trees, m_per_tree = 1L, species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_transform = TRUE, missing_frac = 0.25, covariates = NULL, epochs = 2000L, verbose = TRUE, seed = 1L, share_gnn = TRUE, reference_tree = NULL, ... )multi_impute_trees( traits, trees, m_per_tree = 1L, species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_transform = TRUE, missing_frac = 0.25, covariates = NULL, epochs = 2000L, verbose = TRUE, seed = 1L, share_gnn = TRUE, reference_tree = NULL, ... )
traits |
data.frame. Same format as |
trees |
list of |
m_per_tree |
integer. Number of MC-dropout imputations per tree
(default |
species_col |
character or |
trait_types |
named character vector overriding auto-detected
trait types. Required for |
multi_proportion_groups |
named list declaring compositional
trait groups (rows summing to 1), forwarded to |
log_transform |
logical. Auto-log positive continuous columns
(default |
missing_frac |
numeric. Fraction held out for validation/test
during training (default |
covariates |
data.frame or matrix of environmental covariates
(fully observed, numeric). Passed through to |
epochs |
integer. Maximum GNN training epochs per tree (default
|
verbose |
logical. Print progress (default |
seed |
integer. Base random seed; each tree uses |
share_gnn |
logical. If |
reference_tree |
optional |
... |
additional arguments forwarded to |
This is step 1 of the two-step workflow for propagating tree
uncertainty. Step 2 — varying the downstream analysis tree in
lockstep with the imputation tree — is the user's responsibility
and is described in the Examples section and in
vignette("tree-uncertainty").
multi_impute_trees() handles the imputation half (step 1) cleanly:
every completed dataset carries a different tree's signal so that
between-tree variation propagates into the pooled standard errors.
Step 2 is where Nakagawa & de Villemereuil (2019) enters: for each
completed dataset, fit the downstream model (e.g. nlme::gls() with
a corBrownian on trees[[ mi$tree_index[i] ]]), then pool the
T × M fits with pool_mi().
For each tree the function runs the full pigauto pipeline
(preprocess -> baseline -> GNN -> predict) when share_gnn = FALSE.
With the default share_gnn = TRUE, the GNN is trained once and only
the baseline is recomputed per tree. Topologies and branch lengths vary
across trees, so the phylogenetic baseline covariance differs for each
tree.
Downstream usage is identical to multi_impute(): pass the result
to with_imputations() to fit a model on each dataset, then to
pool_mi() for Rubin's-rules pooling. The pooled standard errors
will be wider than those from a single tree because they incorporate
the extra between-tree variance.
Variance decomposition. The between-imputation variance from
Rubin's rules has two sources: (1) within-tree sampling variance
(MC-dropout noise), and (2) between-tree variance (phylogenetic
uncertainty at the imputation step). The fraction of missing
information (FMI) reported by pool_mi() reflects both. To decompose
them, compare FMI from multi_impute() (single tree) with FMI from
multi_impute_trees().
Computation time. With share_gnn = TRUE (default): one GNN fit
T cheap baseline passes. Rough budget on a modern CPU laptop:
| Species n | 1 fit | T = 50 share_gnn=TRUE | T = 50 share_gnn=FALSE |
| 300 | ~30-60 s | ~3-5 min | 25-50 min |
| 5,000 | ~5-10 min | ~10-20 min | 4-8 hr |
| 10,000 | ~20-40 min | ~30-60 min | 17-33 hr |
An object of class "pigauto_mi_trees", inheriting from
"pigauto_mi", with components:
datasetsList of T * m_per_tree completed data.frames.
Observed cells are preserved; missing cells are filled with
imputation draws. Compatible with with_imputations().
mTotal number of datasets (T * m_per_tree).
n_treesNumber of posterior trees used.
m_per_treeImputations per tree.
tree_indexInteger vector of length m; element i gives
the tree index (1..T) for dataset i.
pooled_pointSingle data.frame averaging across all
T * m_per_tree datasets. For reporting, not inference.
seMatrix of per-cell pooled SEs (NA if not available).
imputed_maskLogical matrix; TRUE where a cell was
originally missing.
share_gnnLogical; TRUE if the shared-GNN path was used.
fitSingle pigauto_fit trained on the reference
tree when share_gnn = TRUE; NULL otherwise.
fitsList of T pigauto_fit objects (one per tree)
when share_gnn = FALSE; NULL when share_gnn = TRUE.
reference_treeThe reference phylo used for GNN training
when share_gnn = TRUE; NULL otherwise.
treesThe input posterior trees.
species_colPassed-through species column name.
pigauto provides two multiple-imputation functions. Pick based on how many trees you have:
One tree (single published phylogeny, single time-calibrated tree):
use multi_impute(). The m MC-dropout imputations capture model
uncertainty.
Multiple posterior trees (BirdTree samples, BEAST posterior, etc.):
use multi_impute_trees(). Between-tree variation is added to the
pooled SEs via Rubin's rules (Nakagawa & de Villemereuil 2019).
The two functions share the same downstream API — both return objects
compatible with with_imputations() and pool_mi().
Under share_gnn = TRUE the GNN weights and spectral features are
trained once on the reference tree (MCC by default). For each
posterior tree the BM / joint-MVN baseline is recomputed, and the
prediction is the blend (1 - r_cal) * baseline_t + r_cal * gnn_shared.
Because r_cal is calibrated once on held-out data at the reference
tree and applied uniformly, the tree-uncertainty contribution is:
Fully preserved when the gate is closed (r_cal near 0): the GNN contributes nothing, and the baseline varies per tree.
Partially preserved when the gate is open: the baseline portion still varies, but the GNN portion is a tree-invariant constant — this slightly under-estimates tree variance in the GNN channel.
Lost in the GNN channel when the gate is fully open (rare on real data; the baseline channel still carries tree variation).
On every real dataset benchmarked in the v0.9.0 campaign the gate
closed partially or fully, so share_gnn = TRUE is cheap AND honest.
Set share_gnn = FALSE if you need exact per-tree model independence.
When share_gnn = TRUE with safety_floor = TRUE, the
grand-mean baseline mean_baseline_per_col and the three
calibrated weights (r_cal_bm, r_cal_gnn,
r_cal_mean) are computed ONCE on the reference tree and
reused across all posterior trees. They are properties of the
observed training traits, not of the tree topology. Each posterior
tree only recomputes the BM baseline; the GNN delta and the three
weights stay fixed. This preserves the Nakagawa & de Villemereuil
(2019) tree-uncertainty integration story without re-calibrating
the safety floor per tree, and keeps the shared-GNN speedup intact.
Nakagawa S, de Villemereuil P (2019). "A general method for simultaneously accounting for phylogenetic and species sampling uncertainty via Rubin's rules in comparative analysis." Systematic Biology 68(4): 632-641.
Jetz W, Thomas GH, Joy JB, Hartmann K, Mooers AO (2012). "The global diversity of birds in space and time." Nature 491(7424): 444-448.
multi_impute() for single-tree MI, with_imputations(),
pool_mi(), trees300
## Not run: library(pigauto) data(avonet300, trees300) df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL # ---- Step 1: tree-aware imputation (canonical N&dV 2019 workflow) -- # 50 trees x 1 imputation = 50 completed datasets (fast with share_gnn=TRUE) mi <- multi_impute_trees(df, trees300, m_per_tree = 1L) print(mi) # ---- Step 2: tree-aware analysis (Nakagawa & de Villemereuil 2019) # For each completed dataset, fit the downstream model using the SAME # tree that produced that dataset. `mi$tree_index[i]` gives the tree # index (1..T) for dataset `i`. fits <- with_imputations(mi, function(dat, tree) { dat$species <- rownames(dat) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian( phy = tree, form = ~species), data = dat, method = "ML" ) }) # Rubin's rules: pooled SEs include both trait-imputation and # phylogenetic-tree uncertainty. pool_mi(fits) ## End(Not run)## Not run: library(pigauto) data(avonet300, trees300) df <- avonet300; rownames(df) <- df$Species_Key; df$Species_Key <- NULL # ---- Step 1: tree-aware imputation (canonical N&dV 2019 workflow) -- # 50 trees x 1 imputation = 50 completed datasets (fast with share_gnn=TRUE) mi <- multi_impute_trees(df, trees300, m_per_tree = 1L) print(mi) # ---- Step 2: tree-aware analysis (Nakagawa & de Villemereuil 2019) # For each completed dataset, fit the downstream model using the SAME # tree that produced that dataset. `mi$tree_index[i]` gives the tree # index (1..T) for dataset `i`. fits <- with_imputations(mi, function(dat, tree) { dat$species <- rownames(dat) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian( phy = tree, form = ~species), data = dat, method = "ML" ) }) # Rubin's rules: pooled SEs include both trait-imputation and # phylogenetic-tree uncertainty. pool_mi(fits) ## End(Not run)
Produces a self-contained HTML file with interactive charts comparing the GNN against the phylogenetic baseline. The report includes per-trait metrics, gate values, conformal coverage, and training history.
pigauto_report( fit, data = NULL, splits = NULL, output_path = "pigauto_report.html", title = "pigauto Imputation Report", open = TRUE )pigauto_report( fit, data = NULL, splits = NULL, output_path = "pigauto_report.html", title = "pigauto Imputation Report", open = TRUE )
fit |
A |
data |
Optional |
splits |
Optional splits object. Extracted automatically when
|
output_path |
Character. File path for the HTML report (default
|
title |
Character. Report title. |
open |
Logical. Open the report in a browser when done (default
|
The output path (invisibly).
Takes a data.frame of benchmark results and produces a forest-plot style figure comparing baseline and GNN performance per trait. Each trait appears on the y-axis; points mark RMSE (continuous/count) or accuracy (binary/categorical) for the two methods, connected by a horizontal line colour-coded by whether the GNN improves over the baseline.
plot_comparison( results, metric = NULL, methods = c("BM_baseline", "pigauto_GNN"), ... )plot_comparison( results, metric = NULL, methods = c("BM_baseline", "pigauto_GNN"), ... )
results |
A data.frame with columns |
metric |
Character. Which metric to compare.
Default |
methods |
Character vector of length 2: the baseline method
name and the GNN method name. Default
|
... |
Additional arguments passed to |
When both RMSE and accuracy metrics are present and metric is
NULL, a two-panel figure is produced automatically.
Invisible NULL. Called for its side effect (plotting).
## Not run: results <- read.csv("benchmark_results.csv") plot_comparison(results) plot_comparison(results, metric = "rmse") ## End(Not run)## Not run: results <- read.csv("benchmark_results.csv") plot_comparison(results) plot_comparison(results, metric = "rmse") ## End(Not run)
Deprecated. Use the base-R S3 method instead:
plot(fit, type = "history").
Produces a ggplot2 validation loss training curve.
plot_history_gg(x, ...)plot_history_gg(x, ...)
x |
object of class |
... |
ignored. |
A ggplot object.
Plots conformal prediction intervals when present, otherwise an
approximate prediction +/- 1.96 * SE ribbon, sorted by predicted
value. If truth is supplied, observed values are overlaid as
points. Works for continuous, count, and ordinal traits. For binary
traits, plots predicted probabilities with an uncertainty ribbon.
plot_uncertainty(pred_result, truth = NULL, trait_name)plot_uncertainty(pred_result, truth = NULL, trait_name)
pred_result |
list with |
truth |
matrix or data.frame of true values (same scale as
|
trait_name |
character. Which trait to plot (must match column name). |
A ggplot object.
Creates a multi-panel comparison plot showing BM baseline vs pigauto performance across all simulation scenarios.
## S3 method for class 'pigauto_benchmark' plot(x, metric = "rmse", ...)## S3 method for class 'pigauto_benchmark' plot(x, metric = "rmse", ...)
x |
pigauto_benchmark object. |
metric |
character. Which metric to plot (default |
... |
passed to base plot functions. |
Invisible NULL.
S3 plot method for pigauto_fit objects, using base R
graphics. Three plot types are available: training history,
calibrated gate values, and conformal prediction scores.
## S3 method for class 'pigauto_fit' plot(x, type = "history", ...)## S3 method for class 'pigauto_fit' plot(x, type = "history", ...)
x |
An object of class |
type |
Character.
|
... |
Additional arguments passed to base plotting functions. |
Invisible NULL. Called for its side effect (plotting).
## Not run: fit <- fit_pigauto(data, tree) plot(fit) plot(fit, type = "history") plot(fit, type = "gates") plot(fit, type = "conformal") ## End(Not run)## Not run: fit <- fit_pigauto(data, tree) plot(fit) plot(fit, type = "history") plot(fit, type = "gates") plot(fit, type = "conformal") ## End(Not run)
S3 plot method for pigauto_pred objects, using base R
graphics. Three plot types are available: scatter plots of observed
vs predicted values, interval plots per species, and probability
charts for discrete traits.
## S3 method for class 'pigauto_pred' plot(x, data = NULL, splits = NULL, type = "scatter", trait = NULL, ...)## S3 method for class 'pigauto_pred' plot(x, data = NULL, splits = NULL, type = "scatter", trait = NULL, ...)
x |
An object of class |
data |
Optional numeric matrix or data.frame of observed values in original scale. When supplied, observed values are overlaid on scatter and interval plots. |
splits |
Optional list (output of |
type |
Character.
|
trait |
Character vector of trait names to plot. If |
... |
Additional arguments passed to base plotting functions. |
Invisible NULL. Called for its side effect (plotting).
## Not run: pred <- predict(fit) plot(pred) plot(pred, type = "scatter", data = observed_df) plot(pred, type = "intervals", trait = "Beak.Length_Culmen") plot(pred, type = "probabilities", data = observed_df) ## End(Not run)## Not run: pred <- predict(fit) plot(pred) plot(pred, type = "scatter", data = observed_df) plot(pred, type = "intervals", trait = "Beak.Length_Culmen") plot(pred, type = "probabilities", data = observed_df) ## End(Not run)
Combine regression coefficients from M model fits – one per imputed
dataset – into a single pooled table using Rubin's rules. The pooled
standard errors properly account for both within-imputation sampling
variance and between-imputation variance, so downstream inference
propagates the uncertainty introduced by imputation.
pool_mi( fits, conf.level = 0.95, coef_fun = stats::coef, vcov_fun = stats::vcov, df_fun = NULL )pool_mi( fits, conf.level = 0.95, coef_fun = stats::coef, vcov_fun = stats::vcov, df_fun = NULL )
fits |
A list of model fits of length |
conf.level |
Confidence level for the pooled interval (default
|
coef_fun |
Function extracting a named numeric coefficient vector
from one fit. Defaults to |
vcov_fun |
Function extracting the variance-covariance matrix of
the fixed-effect coefficients. Defaults to |
df_fun |
Optional function returning the complete-data residual
degrees of freedom |
Let be the coefficient vector from fit i and
, for .
Rubin's rules (Rubin 1987) give
with pooled standard error . The relative increase in
variance is , the classical pooled df is
, and the fraction of missing
information is
When df_fun returns finite complete-data df nu_com, the
Barnard-Rubin (1999) correction combines
with nu_old via
.
MCMCglmm fits are rejected because Rubin's rules are not the right
tool for posterior samples: variance decomposition does not generalise
cleanly to posterior distributions. For a Bayesian pigauto workflow
(pigauto as imputer, MCMCglmm as inference engine), concatenate the
posterior samples across imputations manually – stack fit$Sol and
fit$VCV row-wise with do.call(rbind, ...) and wrap the result in
coda::as.mcmc(). For the frequentist Rubin's-rules path, see
vignette("mixed-types", package = "pigauto"). For an integrated
chained-equation MCMC workflow where imputation and inference happen
in a single engine, use the companion BACE package end-to-end
(BACE::bace() + BACE::pool_posteriors()).
A data.frame with one row per coefficient and columns:
termCoefficient name.
estimatePooled point estimate (mean across fits).
std.errorPooled standard error sqrt(T) where
T = W + (1 + 1/M) * B.
dfPooled degrees of freedom (Barnard-Rubin if df_fun
supplied, else classical Rubin).
statisticestimate / std.error.
p.valueTwo-sided p-value from a t distribution on df.
conf.low, conf.high
Pooled conf.level interval.
fmiFraction of missing information.
rivRelative increase in variance due to non-response.
Rubin DB (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Barnard J, Rubin DB (1999). "Small-sample degrees of freedom with multiple imputation." Biometrika 86(4): 948-955.
Nakagawa S, Freckleton RP (2008). "Missing inaction: the dangers of ignoring missing data." Trends in Ecology & Evolution 23(11): 592-596.
Nakagawa S, Freckleton RP (2011). "Model averaging, missing data and multiple imputation: a case study for behavioural ecology." Behavioral Ecology and Sociobiology 65(1): 103-116.
multi_impute(), with_imputations()
## Not run: # Typical workflow mi <- multi_impute(traits, tree, m = 100) fits <- with_imputations(mi, function(d) lm(y ~ x1 + x2, data = d)) pool_mi(fits) # glmmTMB with custom extractors (conditional component only) pool_mi( fits, coef_fun = function(f) glmmTMB::fixef(f)$cond, vcov_fun = function(f) vcov(f)$cond ) ## End(Not run)## Not run: # Typical workflow mi <- multi_impute(traits, tree, m = 100) fits <- with_imputations(mi, function(d) lm(y ~ x1 + x2, data = d)) pool_mi(fits) # glmmTMB with custom extractors (conditional component only) pool_mi( fits, coef_fun = function(f) glmmTMB::fixef(f)$cond, vcov_fun = function(f) vcov(f)$cond ) ## End(Not run)
Runs a single forward pass through the fitted model and returns imputed
trait values back-transformed to the original scale. Supports all
trait types (continuous, binary, categorical, ordinal, count, proportion,
zero-inflated count, multi-proportion)
and MC dropout for multiple imputation (when n_imputations > 1). The fitted model is a gated ensemble
of a phylogenetic baseline and a graph neural network correction;
prediction is the per-trait blend
(1 - r_cal) * baseline + r_cal * delta_GNN.
## S3 method for class 'pigauto_fit' predict( object, newdata = NULL, return_se = TRUE, n_imputations = 1L, baseline_override = NULL, pool_method = c("median", "mean", "mode"), clamp_outliers = FALSE, clamp_factor = 5, match_observed = c("none", "pmm"), pmm_K = 5L, ... )## S3 method for class 'pigauto_fit' predict( object, newdata = NULL, return_se = TRUE, n_imputations = 1L, baseline_override = NULL, pool_method = c("median", "mean", "mode"), clamp_outliers = FALSE, clamp_factor = 5, match_observed = c("none", "pmm"), pmm_K = 5L, ... )
object |
object of class |
newdata |
|
return_se |
logical. Compute standard errors? (default |
n_imputations |
integer. Number of stochastic imputation draws —
BM posterior samples plus GNN dropout — (default |
baseline_override |
optional |
pool_method |
character. How to pool the M decoded draws when
|
clamp_outliers |
logical. Phase G (v0.9.1.9011+). When
|
clamp_factor |
numeric scalar (>= 1). Multiplicative factor
on the observed maximum used by |
match_observed |
character, one of When to use: PMM is a niche feature in pigauto. The
package already provides conformal prediction intervals
(calibrated against held-out residuals) and
The Phase G' acceptance bench
( Discrete-class types (binary / categorical / ordinal /
multi_proportion) and un-log continuous: no-op. Default
|
pmm_K |
integer scalar (>= 1). Donor pool size for PMM.
Default |
... |
ignored. |
When n_imputations > 1, each imputation m draws a BM
posterior sample t_BM_draw ~ N(BM_mu, BM_se) on the latent scale
for originally-missing cells (BM_se = 0 for observed cells so they
are never perturbed). The model runs in train mode (GNN dropout active)
using t_BM_draw as input. The final blend is
(1 - r_cal) * t_BM_draw + r_cal * GNN_delta(t_BM_draw): when the
calibrated gate is zero the imputation is a pure BM posterior draw; when
r_cal > 0 both BM draws and GNN dropout contribute variance. Point
estimates are the mean (continuous, count) or mode (binary, categorical,
ordinal) across passes. The M complete datasets are returned in
imputed_datasets for Rubin's-rules pooling. For the user-facing
multiple-imputation workflow, prefer multi_impute() which offers
draws_method = "conformal" (calibrated, narrower) or
"mc_dropout" (BM posterior draws + GNN dropout, wider).
Decoding per type:
reverse z-score, then exp() if log-transformed
sigmoid(latent) to probability, round to 0/1
reverse z-score of log1p, expm1(), round, clip >= 0
reverse z-score, round to nearest valid integer level
softmax() over K latent columns, argmax
A list of class "pigauto_pred" with:
data.frame of imputed values in original scale with proper R types (numeric, integer, factor, ordered).
Numeric matrix (n x p_latent) of predictions in latent scale.
Numeric matrix (n x n_original_traits) of per-cell
uncertainty. Continuous/count/ordinal/proportion: SE in original
scale (BM conditional SD, delta-method back-transformed).
Binary: min(p, 1-p) — probability of being wrong (0 = certain,
0.5 = maximally uncertain); not a Gaussian SE.
Categorical: 1 - max(p_k) — margin from certainty; not
a Gaussian SE. NULL if return_se = FALSE.
Named list. Binary traits: numeric probability vector. Categorical traits: n x K probability matrix. Other types: not present.
List of M data.frames when
n_imputations > 1; NULL otherwise.
Trait map from the fitted model.
Character vector.
Character vector.
Integer, number of imputations performed.
## Not run: pred <- predict(fit, return_se = TRUE) pred$imputed # data.frame, original scale pred$se # matrix, uncertainty pred$probabilities # list of prob vectors/matrices # Multiple imputation (BM posterior draws + GNN dropout) pred10 <- predict(fit, n_imputations = 10) pred10$imputed_datasets # 10 complete data.frames ## End(Not run)## Not run: pred <- predict(fit, return_se = TRUE) pred$imputed # data.frame, original scale pred$se # matrix, uncertainty pred$probabilities # list of prob vectors/matrices # Multiple imputation (BM posterior draws + GNN dropout) pred10 <- predict(fit, n_imputations = 10) pred10$imputed_datasets # 10 complete data.frames ## End(Not run)
Aligns species in the trait data frame to the tree, detects or accepts trait types (continuous, binary, categorical, ordinal, count, proportion, zi_count), and encodes each trait into a continuous latent matrix.
preprocess_traits( traits, tree, species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_cols = NULL, log_transform = TRUE, center = TRUE, scale = TRUE, covariates = NULL )preprocess_traits( traits, tree, species_col = NULL, trait_types = NULL, multi_proportion_groups = NULL, log_cols = NULL, log_transform = TRUE, center = TRUE, scale = TRUE, covariates = NULL )
traits |
|
tree |
object of class |
species_col |
character. Name of the column in |
trait_types |
named character vector overriding auto-detection,
e.g. |
multi_proportion_groups |
named list declaring compositional
(multi-proportion) trait groups. Each element is a character vector
of column names whose row-wise values sum to 1 (e.g.
|
log_cols |
character vector of continuous trait names to
log-transform. Default |
log_transform |
logical. Legacy parameter: if |
center |
logical. Subtract column means for continuous/count/ordinal?
Default |
scale |
logical. Divide by column SDs for continuous/count/ordinal?
Default |
covariates |
data.frame or numeric matrix of environmental covariates.
Covariates are conditioners: they inform imputation but are not themselves
imputed, so they must be fully observed (no NAs — if a variable
has missing values, put it in
Default |
When each species has one observation (the default), output rows match
tree$tip.label order. When species_col is supplied,
multiple observations per species are supported: the output matrix has
one row per observation, plus an obs_to_species mapping for the
GNN (which operates at species level).
Automatic type detection (when trait_types = NULL) follows the
R class of each column — no user input is required for most data:
| R class | pigauto type |
numeric (non-integer) |
continuous |
integer |
count |
factor with 2 levels |
binary |
factor (unordered) with >2 levels |
categorical |
ordered / factor(..., ordered = TRUE) |
ordinal |
character |
converted to factor, then binary or categorical |
logical |
binary (FALSE = 0, TRUE = 1) |
Two types require an explicit override because they cannot be distinguished from their R class alone:
"proportion"A numeric column bounded 0–1 looks
identical to continuous. Declare it explicitly:
trait_types = c(SurvivalRate = "proportion").
Encoded via qlogis(clamp(x, 0.001, 0.999)).
"zi_count"An integer column with excess zeros
looks identical to count. Declare it explicitly:
trait_types = c(Parasites = "zi_count").
Encoded as a binary zero/non-zero gate plus log1p-z magnitude.
Practical examples of type assignment:
Body mass (numeric, all positive) → continuous,
auto-log-transformed.
Clutch size (integer) → count.
Migratory (factor with levels "Yes"/"No") → binary.
Diet (factor with >2 levels) → categorical.
IUCN status (ordered factor, LC < NT < VU < EN < CR) →
ordinal. If left as an unordered factor, it becomes
categorical — both are valid depending on the question.
Parasite load (integer with many zeros) → needs
trait_types = c(Parasites = "zi_count").
Survival rate (numeric, values in 0 to 1) → needs
trait_types = c(Survival = "proportion").
Latent encoding per type:
optional log(), then z-score (1 latent column)
0/1 encoding (1 latent column)
log1p(), then z-score (1 latent column)
integer coding (0 to K-1), then z-score (1 latent column)
one-hot encoding (K latent columns)
qlogis(clamp(x, 0.001, 0.999)), then z-score (1 latent column)
gate (0/1) + log1p-z of non-zeros (2 latent columns)
centred log-ratio (CLR) + per-component z-score
(K latent columns per group). Rows must sum to 1. Declared via the
multi_proportion_groups argument, not trait_types.
A list of class "pigauto_data" with components:
Numeric matrix (n_obs x p_latent), latent encoding.
When species_col is NULL, n_obs = n_species.
Numeric matrix of continuous traits after optional log but before z-scoring (for backward compatibility).
Original data.frame (aligned to tree, before encoding).
Named numeric vector of column means used for z-scoring (continuous/count/ordinal traits only).
Named numeric vector of column SDs.
Character vector of unique species matching
tree$tip.label order (length = n_species).
Character vector of species labels per observation
(length = n_obs). When multi-obs, can have duplicates.
NULL when species_col is NULL.
Integer vector (length = n_obs) mapping each
observation to its species index in species_names.
NULL when species_col is NULL.
Integer, number of unique species.
Integer, number of observations (= n_species when single-obs).
Logical, TRUE when multiple observations
per species are present.
Character vector of original trait names.
Character vector of latent column names.
List of trait descriptors (see Details).
Integer, total number of latent columns.
Logical, legacy field (TRUE if any continuous trait was log-transformed).
# Single-obs per species (backward compatible) data(avonet300, tree300, package = "pigauto") traits <- avonet300 rownames(traits) <- traits$Species_Key traits$Species_Key <- NULL pd <- preprocess_traits(traits, tree300) dim(pd$X_scaled) # 300 x p_latent # Multi-obs per species (via species_col) pd2 <- preprocess_traits(avonet300, tree300, species_col = "Species_Key") pd2$n_obs # number of observations pd2$n_species # number of unique species# Single-obs per species (backward compatible) data(avonet300, tree300, package = "pigauto") traits <- avonet300 rownames(traits) <- traits$Species_Key traits$Species_Key <- NULL pd <- preprocess_traits(traits, tree300) dim(pd$X_scaled) # 300 x p_latent # Multi-obs per species (via species_col) pd2 <- preprocess_traits(avonet300, tree300, species_col = "Species_Key") pd2$n_obs # number of observations pd2$n_species # number of unique species
Pulls occurrence records from the Global Biodiversity Information
Facility (GBIF) for each species in species, aggregates them
to a median lat / lon centroid, and returns a data.frame ready to
be passed to impute via its covariates argument.
pull_gbif_centroids( species, cache_dir = NULL, occurrence_limit = 500L, sleep_ms = 100L, verbose = TRUE, refresh_cache = FALSE, store_points = FALSE )pull_gbif_centroids( species, cache_dir = NULL, occurrence_limit = 500L, sleep_ms = 100L, verbose = TRUE, refresh_cache = FALSE, store_points = FALSE )
species |
character vector of species names. |
cache_dir |
directory to cache per-species RDS files.
|
occurrence_limit |
integer, maximum number of occurrences to fetch per species (default 500; paginated if > 300). |
sleep_ms |
integer, polite delay between API calls in milliseconds (default 100). |
verbose |
logical, print progress every 50 species. |
refresh_cache |
logical, force re-fetch even when cache exists. |
store_points |
logical. When |
Caching is strongly recommended — GBIF rate-limits anonymous calls
and the per-species fetch is expensive. With cache_dir set,
each species gets one RDS file; subsequent calls skip the API.
For each species: resolve taxon via
name_backbone, fetch up to
occurrence_limit records via occ_search
(paginated at 300 per GBIF call), filter out records with
hasGeospatialIssues = TRUE and basisOfRecord in
c("FOSSIL_SPECIMEN", "LIVING_SPECIMEN"), drop
out-of-range coordinates, then take the median latitude and
longitude as the species centroid.
Species with zero post-filter records receive NA centroids;
their rows are still included in the returned data.frame so it
aligns with the input species list.
Requires the optional rgbif package (in DESCRIPTION
Suggests). If rgbif is not installed the function
errors with an installation message.
A data.frame with columns species, centroid_lat,
centroid_lon, n_occurrences. Rownames are set to
species. NA centroids are returned for species with no
GBIF hits; the caller should decide how to handle them (typical:
drop or impute).
impute (pass the return value as
covariates).
## Not run: # Plants ecology example: pull centroids for a species list. sp <- c("Quercus alba", "Pinus taeda", "Acer saccharum") cov <- pull_gbif_centroids(sp, cache_dir = "script/data-cache/gbif") # Use as covariates (drop the bookkeeping cols) cov_num <- cov[, c("centroid_lat", "centroid_lon"), drop = FALSE] # Then: impute(traits, tree, covariates = cov_num) ## End(Not run)## Not run: # Plants ecology example: pull centroids for a species list. sp <- c("Quercus alba", "Pinus taeda", "Acer saccharum") cov <- pull_gbif_centroids(sp, cache_dir = "script/data-cache/gbif") # Use as covariates (drop the bookkeeping cols) cov_num <- cov[, c("centroid_lat", "centroid_lon"), drop = FALSE] # Then: impute(traits, tree, covariates = cov_num) ## End(Not run)
Extends pull_gbif_centroids by extracting 19 WorldClim
v2.1 bioclim variables at each species' GBIF occurrence points,
aggregating per species (median + IQR), and returning a data.frame
ready for impute via its covariates argument.
On first call, downloads the WorldClim 10-arc-minute raster stack
(~130 MB compressed, ~500 MB unzipped) to worldclim_cache_dir.
A sentinel file makes subsequent calls fully offline.
Per-species extracts are also cached (one RDS per species in
worldclim_cache_dir/extracts/), so repeated calls for the
same species list are instantaneous.
Since v0.9.1.9006 (2026-04-30) extraction operates on the raw
per-occurrence point list when pull_gbif_centroids
was called with store_points = TRUE. For legacy caches
written by older pull_gbif_centroids (centroid-only) calls,
extraction silently falls back to the species centroid.
Requires the optional terra package (in DESCRIPTION
Suggests). If terra is not installed the function
errors with an installation message.
pull_worldclim_per_species( species, gbif_cache_dir, worldclim_cache_dir, resolution = "10m", verbose = TRUE, refresh_cache = FALSE )pull_worldclim_per_species( species, gbif_cache_dir, worldclim_cache_dir, resolution = "10m", verbose = TRUE, refresh_cache = FALSE )
species |
character vector of species binomials. |
gbif_cache_dir |
character path, GBIF per-species cache
directory (as written by |
worldclim_cache_dir |
character path, directory for WorldClim rasters + per-species extract cache. Created if absent. |
resolution |
one of |
verbose |
logical. Print progress every 50 species. |
refresh_cache |
logical. Force re-extract even when per-species cache exists. |
A data.frame with length(species) rows and columns:
species
bio1_median, bio1_iqr, ...,
bio19_median, bio19_iqr
n_extracted (integer, number of valid occurrence
points that contributed)
Rownames are set to species. Species with no GBIF hits
get all-NA bio values and n_extracted = 0.
Fick SE, Hijmans RJ (2017). WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology 37, 4302–4315.
pull_gbif_centroids (B.1, required input),
impute (consume as covariates).
## Not run: sp <- c("Quercus alba", "Pinus taeda", "Acer saccharum") gbif_df <- pull_gbif_centroids(sp, cache_dir = "script/data-cache/gbif") wc_df <- pull_worldclim_per_species( species = sp, gbif_cache_dir = "script/data-cache/gbif", worldclim_cache_dir = "script/data-cache/worldclim") # Combine for impute() cov <- cbind(gbif_df[, c("centroid_lat", "centroid_lon")], wc_df[, grep("^bio", colnames(wc_df))]) # res <- impute(traits, tree, covariates = cov, # phylo_signal_gate = FALSE) # <-- needed; see NEWS ## End(Not run)## Not run: sp <- c("Quercus alba", "Pinus taeda", "Acer saccharum") gbif_df <- pull_gbif_centroids(sp, cache_dir = "script/data-cache/gbif") wc_df <- pull_worldclim_per_species( species = sp, gbif_cache_dir = "script/data-cache/gbif", worldclim_cache_dir = "script/data-cache/worldclim") # Combine for impute() cov <- cbind(gbif_df[, c("centroid_lat", "centroid_lon")], wc_df[, grep("^bio", colnames(wc_df))]) # res <- impute(traits, tree, covariates = cov, # phylo_signal_gate = FALSE) # <-- needed; see NEWS ## End(Not run)
Loads trait data and sets species names as row names. If a CSV path is
supplied, it is read with read.csv. By default, all non-species
columns are returned (numeric, factor, integer, etc.). Use
trait_cols to select a subset.
read_traits( x, species_col = "species", trait_cols = NULL, factor_cols = NULL, ordered_cols = NULL )read_traits( x, species_col = "species", trait_cols = NULL, factor_cols = NULL, ordered_cols = NULL )
x |
character path to a CSV file, or a |
species_col |
character. Name of the column containing species names
(default |
trait_cols |
character vector of column names to include. If
|
factor_cols |
character vector of columns to coerce to |
ordered_cols |
character vector of columns to coerce to |
A data.frame with species names as row names.
df <- data.frame(species = c("Sp_a", "Sp_b"), mass = c(10, 20)) traits <- read_traits(df, species_col = "species")df <- data.frame(species = c("Sp_a", "Sp_b"), mass = c(10, 20)) traits <- read_traits(df, species_col = "species")
Reads a Newick or NEXUS tree file using ape. Tries
ape::read.tree first; falls back to ape::read.nexus if that
fails.
read_tree(path)read_tree(path)
path |
character. Path to the tree file. |
An object of class "phylo".
## Not run: tree <- read_tree("path/to/tree.tre") ## End(Not run)## Not run: tree <- read_tree("path/to/tree.tre") ## End(Not run)
Serialises a pigauto_fit object to disk, including the torch
model state. The saved file can be loaded on any machine with the
same pigauto version.
save_pigauto(fit, path, compress = TRUE)save_pigauto(fit, path, compress = TRUE)
fit |
pigauto_fit object. |
path |
character. File path to save to (recommended extension:
|
compress |
logical. Use gzip compression (default |
Standard saveRDS() does not work for pigauto fits because
torch tensor states cannot be serialised by R's native mechanism.
This function converts the model state to a portable raw format
before saving.
Invisible path.
## Not run: save_pigauto(fit, "my_model.pigauto") fit2 <- load_pigauto("my_model.pigauto") ## End(Not run)## Not run: save_pigauto(fit, "my_model.pigauto") fit2 <- load_pigauto("my_model.pigauto") ## End(Not run)
Generates trait data under various evolutionary models, introduces missing data, fits both the Brownian motion baseline and the full pigauto GNN, and compares performance. This is the recommended way to assess pigauto on data with known properties before applying it to real data.
simulate_benchmark( n_species = 100L, n_traits = 4L, scenarios = c("BM", "OU", "regime_shift", "nonlinear", "mixed"), missing_frac = 0.25, n_reps = 3L, epochs = 500L, verbose = TRUE, ... )simulate_benchmark( n_species = 100L, n_traits = 4L, scenarios = c("BM", "OU", "regime_shift", "nonlinear", "mixed"), missing_frac = 0.25, n_reps = 3L, epochs = 500L, verbose = TRUE, ... )
n_species |
integer. Number of tips in the simulated tree (default 100). |
n_traits |
integer. Number of continuous traits (default 4).
Ignored for |
scenarios |
character vector. Subset of
|
missing_frac |
numeric. Fraction of observed cells held out (default 0.25). |
n_reps |
integer. Number of replicate trees per scenario (default 3). |
epochs |
integer. Maximum GNN training epochs (default 500). |
verbose |
logical. Print progress (default |
... |
additional arguments passed to |
Available scenarios:
"BM"Pure Brownian motion – the baseline is exact, so the GNN should tie or slightly improve via inter-trait correlations.
"OU"Ornstein-Uhlenbeck – stabilising selection constrains variation. BM over-estimates evolutionary variance.
"regime_shift"Two-regime BM – clade-specific optima create bimodal distributions that BM cannot capture.
"nonlinear"Non-linear inter-trait relationships – the GNN's multi-layer message passing can capture quadratic and interaction effects that BM's linear covariance misses.
"mixed"Mixed trait types: 2 continuous + 1 binary + 1 categorical (3 levels). Tests the full type pipeline.
An object of class "pigauto_benchmark" with:
data.frame with columns: scenario,
rep, method, trait, type,
metric, value, n_test.
data.frame averaged across replicates.
character vector of scenarios run.
integer.
integer.
## Not run: bench <- simulate_benchmark(n_species = 50, epochs = 200, n_reps = 2) bench$summary plot(bench) ## End(Not run)## Not run: bench <- simulate_benchmark(n_species = 50, epochs = 200, n_reps = 2) bench$summary plot(bench) ## End(Not run)
Generates species trait data under models that deviate from Brownian Motion. Supports OU (stabilising selection), regime shifts (clade-specific optima), and non-linear inter-trait correlations.
simulate_non_bm( tree, n_traits = 4, scenario = c("OU", "regime_shift", "nonlinear"), alpha = 2, theta = 0, sigma = 1, shift_magnitude = 2, seed = NULL )simulate_non_bm( tree, n_traits = 4, scenario = c("OU", "regime_shift", "nonlinear"), alpha = 2, theta = 0, sigma = 1, shift_magnitude = 2, seed = NULL )
tree |
Object of class |
n_traits |
Integer. Number of traits to simulate (default 4). |
scenario |
Character. One of |
alpha |
Numeric. OU pull strength (only for |
theta |
Numeric. OU optimum (only for |
sigma |
Numeric. BM diffusion rate. |
shift_magnitude |
Numeric. Regime shift size in SD units
(only for |
seed |
Integer seed or |
A data.frame with species as rownames.
For a fitted pigauto_result (returned by impute),
compute the closed-form expected reduction in total predictive
variance across all currently-missing cells if each candidate cell
were observed next. Useful for sampling-design guidance: when you
have time/budget to measure k more species, this function
tells you which ones contribute most to imputation precision.
suggest_next_observation( result, top_n = 10L, by = c("cell", "species"), types = c("continuous", "count", "ordinal", "proportion", "binary", "categorical", "zi_count", "multi_proportion") )suggest_next_observation( result, top_n = 10L, by = c("cell", "species"), types = c("continuous", "count", "ordinal", "proportion", "binary", "categorical", "zi_count", "multi_proportion") )
result |
A |
top_n |
integer, default |
by |
character, one of |
types |
character vector of pigauto trait types to include.
Default includes all eight supported types: |
Two metrics are supported, dispatched by trait type:
Continuous-family traits (continuous, count, ordinal,
proportion) use the BM variance-reduction formula. The variance-
reduction formula is derived from a Sherman-Morrison rank-1 inverse
update on the BM conditional MVN: adding species to the
observed set updates the inverse correlation matrix by a known
closed form. For each candidate cell ,
where is the residual
matrix at currently-missing cells, is the
current relative leverage of cell , and is
the REML BM variance for trait .
Discrete traits (binary, categorical) use a label-
propagation expected-entropy-reduction formula. The current LP
probability at species is
, with entropy
. After observing
with unknown class ,
the new LP probability has a closed form, and the expected entropy
is averaged over = current LP estimate at
. Total expected entropy reduction sums
across all currently-missing cells (the entropy at
itself drops to 0).
Variance and entropy are NOT directly comparable. The
output sorts within each metric and the cross-metric ordering by
delta is approximate. When you want a strict ranking,
filter by metric first.
Reductions are summed across the included traits for each species
when by = "species", supporting the typical use case where
measuring a species observes all of its currently-missing traits
at once. At by = "species", the per-trait variance and
entropy reductions are summed separately into
delta_var_total and delta_entropy_total columns; the
delta column is whichever is non-NA (or
delta_var_total when both are populated). Cross-type
species-level ranking is approximate – see the variance-vs-
entropy caveat above.
zi_count (v2): observing a missing zi_count cell reveals
the gate value (entropy reduction at the gate column, computed via
the LP binary formula) AND, with probability
(current LP estimate at ), reveals a
magnitude (variance reduction at the magnitude column, computed
via the BM Sherman-Morrison formula on the gate=1 subset). Output
rows for zi_count populate BOTH delta_var_total (= expected
magnitude variance reduction = ) AND delta_entropy_total (= gate entropy
reduction). metric is set to "variance" so the row
sorts on the magnitude scale; delta_entropy_total is
available for users who care about gate-uncertainty separately.
multi_proportion (v2): observing a row reveals all K
simplex components simultaneously. Per-component variance
reductions are computed via BM Sherman-Morrison on each CLR-z
latent column, summed across components. metric is
"variance"; delta_var_total is the K-component sum.
A data.frame of class "pigauto_active". Columns
when by = "cell": species, trait,
type, metric ("variance" or "entropy"),
delta, delta_var_total (NA for discrete rows), and
delta_entropy_total (NA for continuous rows), sorted by
delta descending. When by = "species":
species, delta_var_total, delta_entropy_total,
n_traits_missing, sorted by the SUM of available metrics.
## Not run: data(avonet300, tree300, package = "pigauto") res <- impute(avonet300, tree300) suggest_next_observation(res, top_n = 5) # top-5 cells suggest_next_observation(res, top_n = 10, by = "species") # top-10 species # Continuous only: suggest_next_observation(res, top_n = 10, types = c("continuous", "count", "ordinal", "proportion")) ## End(Not run)## Not run: data(avonet300, tree300, package = "pigauto") res <- impute(avonet300, tree300) suggest_next_observation(res, top_n = 5) # top-5 cells suggest_next_observation(res, top_n = 10, by = "species") # top-10 species # Continuous only: suggest_next_observation(res, top_n = 10, types = c("continuous", "count", "ordinal", "proportion")) ## End(Not run)
Prints a formatted evaluation table including per-trait metrics,
gate calibration, and conformal coverage. Requires the original
pigauto_data object to compute test-set performance.
## S3 method for class 'pigauto_fit' summary(object, ..., data = NULL)## S3 method for class 'pigauto_fit' summary(object, ..., data = NULL)
object |
pigauto_fit object. |
... |
ignored. |
data |
pigauto_data object used for fitting (optional; when
|
Invisibly returns the evaluation data.frame (or NULL if
data is not supplied).
## Not run: summary(fit, data = pd) ## End(Not run)## Not run: summary(fit, data = pd) ## End(Not run)
delhey5809
An object of class 'phylo' from the ape package. The tree is
the Stage2_Hackett_MCC Maximum Clade Credibility tree, pruned to the 5,809
passerine species in delhey5809.
tree_delheytree_delhey
An object of class phylo with 5,809 tips.
BirdTree.org (Jetz et al. 2012, Hackett et al. backbone).
avonet_full
An object of class 'phylo' from the ape package. The tree is
the same Stage2_Hackett_MCC Maximum Clade Credibility tree used for
tree300, but pruned to the 9,993 species present in
avonet_full rather than a 300-species random subset.
tree_fulltree_full
An object of class phylo with 9,993 tips.
Row order in avonet_full matches tip order in tree_full:
all(avonet_full$Species_Key == tree_full$tip.label) returns
TRUE.
BirdTree.org (Jetz et al. 2012, Hackett et al. backbone).
avonet300
An object of class 'phylo' from the ape package. The tree is a
Maximum Clade Credibility (MCC) tree from the Hackett et al. backbone
(Stage2_Hackett_MCC_no_neg.tre), pruned to the 300 species present in
avonet300.
tree300tree300
An object of class phylo with 300 tips.
BirdTree.org (Jetz et al. 2012, Hackett et al. backbone).
avonet300
A multiPhylo list of 50 phylogenetic trees randomly sampled from the
BirdTree Hackett backbone posterior (Jetz et al. 2012), each pruned to the
300 species in avonet300. These trees capture phylogenetic
uncertainty: topologies and branch lengths vary across the posterior sample.
trees300trees300
An object of class multiPhylo containing 50 phylo
objects, each with 300 tips.
Use with multi_impute_trees for the imputation half
(step 1) of the two-step workflow for propagating phylogenetic
uncertainty. The analysis half (step 2) is the user's responsibility:
refit the downstream comparative model on each
(imputation, trees300[[mi$tree_index[i]]]) pair and pool the
T x M fits with pool_mi
(Nakagawa & de Villemereuil 2019). See
?multi_impute_trees for a complete code example.
BirdTree.org posterior (Jetz et al. 2012, Hackett et al. backbone),
pruned from megatrees::tree_bird_n100.
tree300, avonet300,
multi_impute_trees
Apply a user-supplied model-fitting function .f to each of the M
complete datasets stored in a pigauto_mi object and return the list
of fits. This is the middle step of the canonical multiple-imputation
workflow multi_impute() -> with_imputations() -> pool_mi().
with_imputations( mi, .f, ..., .progress = interactive(), .on_error = c("continue", "stop") )with_imputations( mi, .f, ..., .progress = interactive(), .on_error = c("continue", "stop") )
mi |
A |
.f |
A function of the form |
... |
Additional arguments passed to |
.progress |
Logical. Show a text progress indicator (default
|
.on_error |
One of |
A list of length M with class "pigauto_mi_fits". Each
element is either a model fit or, if .f errored on that
imputation and .on_error = "continue", an object of class
"pigauto_mi_error" containing the captured condition. pool_mi()
filters error elements automatically.
## Not run: mi <- multi_impute(df, tree, m = 50) # nlme::gls example -- zero new dependencies fits <- with_imputations(mi, function(d) { d$species <- rownames(d) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian(phy = mi$tree, form = ~species), data = d, method = "ML" ) }) pool_mi(fits) # Tree-aware MI: with_imputations() passes the matching posterior tree # when the callback declares a `tree` argument. mi_t <- multi_impute_trees(df, trees, m_per_tree = 1L) fits_t <- with_imputations(mi_t, function(d, tree) { d$species <- rownames(d) nlme::gls( y ~ x, correlation = ape::corBrownian(phy = tree, form = ~species), data = d, method = "ML" ) }) pool_mi(fits_t) ## End(Not run)## Not run: mi <- multi_impute(df, tree, m = 50) # nlme::gls example -- zero new dependencies fits <- with_imputations(mi, function(d) { d$species <- rownames(d) nlme::gls( log(Mass) ~ log(Wing.Length), correlation = ape::corBrownian(phy = mi$tree, form = ~species), data = d, method = "ML" ) }) pool_mi(fits) # Tree-aware MI: with_imputations() passes the matching posterior tree # when the callback declares a `tree` argument. mi_t <- multi_impute_trees(df, trees, m_per_tree = 1L) fits_t <- with_imputations(mi_t, function(d, tree) { d$species <- rownames(d) nlme::gls( y ~ x, correlation = ape::corBrownian(phy = tree, form = ~species), data = d, method = "ML" ) }) pool_mi(fits_t) ## End(Not run)