Cross-validation for model evaluation and comparison
2024-12-16
Source:vignettes/articles/cross-validation.Rmd
cross-validation.Rmd
If the code in this vignette has not been evaluated, a rendered version is available on the documentation site under ‘Articles’.
Overview
Cross-validation is one of the best approaches that can be used to quantify model performance and compare sdmTMB models with different structures (unlike AIC, this approach will also factor in uncertainty in random effects). Arguably the most challenging decision in implementing cross-validation is how to specify the folds (each fold representing a subset of data that is in turn held out and used as a test set). Folds may vary in number and how data are partitioned, and will likely be slightly different for each application.
The goals of some sdmTMB applications may be focused on spatial
prediction; these include making prediction to new spatial regions
(e.g. unsampled areas or areas not sampled in every year). For these
types of models we recommend exploring folds using the
blockCV
or spatialsample
packages (Valavi et al. 2019; Silge 2021). In
general, these spatial sampling approaches assign observations that are
spatially autocorrelated to the same fold. Accounting for the spatial
correlation can lead to better estimates of covariate effects, as well
as prediction errors.
Alternatively, the goals of an analysis with sdmTMB may be to evaluate the predictive accuracy of a model in time (e.g. a missing survey year, or prediction to future years). For retrospective analyses, all points within a year may be assigned to a fold (or groups of years to the same fold). In contrast, models that are forward looking would use Leave Future Out Cross-Validation (LFOCV). In LFOCV, data up to year are used to predict observations at , etc.
Cross validation in sdmTMB
Cross validation in sdmTMB is implemented using the
sdmTMB_cv()
function, with the k_folds
argument specifying the number of folds (defaults to 8). The function
uses parallelization by default a future::plan()
is set,
but this can be turned off with the parallel
argument.
# Set parallel processing if desired:
library(future)
plan(multisession, workers = 2)
m_cv <- sdmTMB_cv(
density ~ 0 + s(depth_scaled) + fyear,
data = pcod,
mesh = mesh,
family = tweedie(link = "log"),
k_folds = 4
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
In the above example, folds are assigned randomly—but these can be
modified to specific spatial or temporal applications. Without getting
into the complexities of the blockCV
or
spatialsample
packages, we could simply use
kmeans
to generate spatial clusters, e.g.
clust <- kmeans(pcod[, c("X", "Y")], 20)$cluster
m_cv <- sdmTMB_cv(
density ~ 0 + s(depth_scaled) + fyear,
data = pcod,
mesh = mesh,
fold_ids = clust,
family = tweedie(link = "log")
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
Or similarly, these clusters could be assigned in time—here, each year to a unique fold. Note that year is not included as a factor and spatiotemporal fields are turned off because they cannot be estimated in missing years.
clust <- as.numeric(as.factor(pcod$year))
m_cv <- sdmTMB_cv(
density ~ 0 + s(depth_scaled),
data = pcod,
mesh = mesh,
fold_ids = clust,
spatiotemporal = "off",
family = tweedie(link = "log")
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
Measuring model performance
Lots of measures of predictive accuracy can be used to evaluate model
performance. By default, sdmTMB_cv()
returns a list that
contains the sum of the log likelihoods for each left-out fold and the
total summed across the left-out folds. This is roughly equivalent to
the expected log predictive density (ELPD) in the Bayesian literature
and can be interpreted as the predictive ability of the model for new
observations. These can be accessed as below, and inspecting the
quantities across folds may help elucidate whether there are particular
folds that are difficult to predict.
m_cv <- sdmTMB_cv(
density ~ 0 + s(depth_scaled) + fyear,
data = pcod,
mesh = mesh,
family = tweedie(link = "log"),
k_folds = 4
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
m_cv$fold_loglik # fold log-likelihood
#> [1] -1720.122 -1756.138 -1502.946 -1661.257
m_cv$sum_loglik # total log-likelihood
#> [1] -6640.463
Single splits
In cases where only a single test set is evaluated (e.g., 10% of the
data), using the sdmTMB_cv()
function may be overkill
because two sdmTMB()
models will be fit, but using this
function may be worthwhile to reduce coding errors (in the
log-likelihood calculations). For example, here we assign two folds,
randomly holding out 10% of the observations as a test set (the test set
is given ID = 1, and the training set is given ID = 2).
clust <- sample(1:2, size = nrow(pcod), replace = TRUE, prob = c(0.1, 0.9))
m_cv <- sdmTMB_cv(
density ~ 0 + s(depth_scaled) + fyear,
data = pcod,
mesh = mesh,
fold_ids = clust,
family = tweedie(link = "log"),
k_folds = length(unique(clust))
)
We can ignore the total log-likelihood, and just focus on the first element of list list:
m_cv$fold_loglik[[1]]
#> [1] -608.5838
Comparing two or more models
We can use the output of sdmTMB_cv()
to compare two or
more models. For example, if we wanted to evaluate the support for a
depth effect or not, we could do 10-fold cross validation (it’s
important that the folds be the same across the two models). In this
example, using either the predictive log-likelihood or ELPD would lead
one to conclude that including depth improves the predictive accuracy of
the model.
clust <- sample(seq_len(10), size = nrow(pcod), replace = TRUE)
m1 <- sdmTMB_cv(
density ~ 0 + fyear,
data = pcod,
mesh = mesh,
fold_ids = clust,
family = tweedie(link = "log")
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
m2 <- sdmTMB_cv(
density ~ 0 + fyear + s(depth_scaled),
data = pcod,
mesh = mesh,
fold_ids = clust,
family = tweedie(link = "log")
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# Compare log-likelihoods -- higher is better!
m1$sum_loglik
#> [1] -6748.929
m2$sum_loglik
#> [1] -6596.481
Model ensembling
Finally, instead of identifying single “best” models, we may be
interested in doing model averaging. In the sdmTMB package, we’ve
implemented the model stacking procedure described by (Yao et al. 2018) in the
sdmTMB_stacking()
function. This procedure uses
optimization to find the normalized weights that maximize the total
log-likelihood across models (other metrics may also be used). Inputs to
the function are a list of models (a fictitious
model_list
), where each list element is the output of a
call to sdmTMB_cv()
:
weights <- sdmTMB_stacking(model_list)
By default this calculation uses data from each fold. If instead, we
had split the data into the 10/90 split (as in the example above), we
wouldn’t want to use the 2nd model fit to generate these weights. If we
had just wanted to use the predictions from the first fold onto the 10%
test set, we could specify that using the include_folds
argument.
weights <- sdmTMB_stacking(model_list, include_folds = 1)
Calculating measures of predictive skill for binary data
For delta models, or models of presence-absence data, several measures of predictive ability are available. These are applicable to cross validation, although we demonstrate them here first in a non-cross validation context for simplicity.
A first commonly used diagnostic is the AUC (Area Under the Curve),
which quantifies the ability of a model to discriminate between the two
classes; this is done from the Receiver Operating Characteristic (ROC)
curve, which plots the true positive rate vs. false positive rate. There
are several packages to calculate AUC in R, but this can be done with
the pROC
package, where inputs are a vector of 0s and 1s
(or factor equivalents) in the raw data, and a vector of estimated
probabilities (generated from a call to predict()
, as shown
below). The plogis()
function is needed to convert
estimated values in logit space to probabilities in natural (zero to
one) space.
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)
fit <- sdmTMB(present ~ s(depth), data = pcod, mesh = mesh)
pred <- predict(fit) # presence-absence model
roc <- pROC::roc(pcod$present, plogis(pred$est))
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
auc <- pROC::auc(roc)
auc
#> Area under the curve: 0.8831
With a delta model, two estimated values are returned, so only the first would be used. E.g.,
fit <- sdmTMB(density ~ 1, data = pcod,
mesh = mesh, family = delta_gamma())
pred <- predict(fit)
# the first linear predictor is the binomial component (est1):
roc <- pROC::roc(pcod$present, plogis(pred$est1))
#> Setting levels: control = 0, case = 1
#> Setting direction: controls < cases
auc <- pROC::auc(roc)
auc
#> Area under the curve: 0.8604
If we wanted to apply this in the context of cross validation, we could do it like this:
x <- sdmTMB_cv(
present ~ s(depth), data = pcod, spatial = "off",
mesh = mesh, family = binomial(), k_folds = 2
)
roc <- pROC::roc(x$data$present, plogis(x$data$cv_predicted))
auc <- pROC::auc(roc)
auc
AUC may be sensitive to imbalances in the data, however, and alternative metrics may better approximate skill. Here we highlight an example of using true skill score (implemented in packages such as SDMtune):
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)
fit <- sdmTMB(present ~ 1, data = pcod,
mesh = mesh, family = binomial())
Next, we can generate predicted probabilities and classes using a threshold of 0.5 as an example:
Next we create a confusion matrix and calculate the true skill score:
conmat <- table(pred$pred_01, pred$present)
true_neg <- conmat[1, 1]
false_neg <- conmat[1, 2]
false_pos <- conmat[2, 1]
true_pos <- conmat[2, 2]
# Calculate TSS:
true_pos_rate <- true_pos / (true_pos + false_neg)
true_neg_rate <- true_neg / (true_neg + false_pos)
TSS <- true_pos_rate + true_neg_rate - 1
TSS
#> [1] 0.5238745
In some cases, reporting the true negative or true positive rate might be of interest in addition to TSS.