Facilitates cross validation with sdmTMB models. Returns log likelihood or expected log predictive density (ELPD) across left-out data. Has an option for leave-future-out cross validation. By default creates folds randomly but folds can be manually assigned.
Usage
sdmTMB_cv(
formula,
data,
mesh_args,
mesh = NULL,
time = NULL,
k_folds = 8,
fold_ids = NULL,
lfo = FALSE,
lfo_forecast = 1,
lfo_validations = 5,
parallel = TRUE,
use_initial_fit = FALSE,
spde = deprecated(),
...
)
Arguments
- formula
Model formula.
- data
A data frame.
- mesh_args
Arguments for
make_mesh()
. If supplied, the mesh will be reconstructed for each fold.- mesh
Output from
make_mesh()
. If supplied, the mesh will be constant across folds.- time
The name of the time column. Leave as
NULL
if this is only spatial data.- k_folds
Number of folds.
- fold_ids
Optional vector containing user fold IDs. Can also be a single string, e.g.
"fold_id"
representing the name of the variable indata
. Ignored iflfo
is TRUE- lfo
Whether to implement leave-future-out (LFO) cross validation where data are used to predict future folds.
time
argument insdmTMB()
must be specified. See Details section below.- lfo_forecast
If
lfo = TRUE
, number of time steps to forecast. Time steps 1, ..., T are used to predict T +lfo_forecast
and the last forecasted time step is used for validation. See Details section below.- lfo_validations
If
lfo = TRUE
, number of times to step through the LFOCV process. Defaults to 5. See Details section below.- parallel
If
TRUE
and afuture::plan()
is supplied, will be run in parallel.- use_initial_fit
Fit the first fold and use those parameter values as starting values for subsequent folds? Can be faster with many folds.
- spde
Depreciated. Use
mesh
instead.- ...
All other arguments required to run
sdmTMB()
model with the exception ofweights
, which are used to define the folds.
Value
A list:
data
: Original data plus columns for fold ID, CV predicted value, and CV log likelihood.models
: A list of models; one per fold.fold_loglik
: Sum of left-out log likelihoods per fold.fold_elpd
: Expected log predictive density per fold on left-out data.sum_loglik
: Sum offold_loglik
across all left-out data.elpd
: Expected log predictive density across all left-out data.pdHess
: Logical vector: Hessian was invertible each fold?converged
: Logical: allpdHess
TRUE
?max_gradients
: Max gradient per fold.
Details
Parallel processing
Parallel processing can be used by setting a future::plan()
.
For example:
library(future)
plan(multisession)
# now use sdmTMB_cv() ...
Leave-future-out cross validation (LFOCV)
An example of LFOCV with 9 time steps, lfo_forecast = 1
, and
lfo_validations = 2
:
Fit data to time steps 1 to 7, predict and validate step 8.
Fit data to time steps 1 to 8, predict and validate step 9.
An example of LFOCV with 9 time steps, lfo_forecast = 2
, and
lfo_validations = 3
:
Fit data to time steps 1 to 5, predict and validate step 7.
Fit data to time steps 1 to 6, predict and validate step 8.
Fit data to time steps 1 to 7, predict and validate step 9.
See example below.
Examples
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)
# Set parallel processing first if desired with the future package.
# See the Details section above.
m_cv <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, mesh = mesh,
family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
m_cv$fold_elpd
#> [1] -1.058697 -1.031947
m_cv$elpd
#> [1] -1.045089
m_cv$fold_loglik
#> [1] -3387.398 -3329.670
m_cv$sum_loglik
#> [1] -6717.068
head(m_cv$data)
#> # A tibble: 6 × 15
#> year X Y depth density present lat lon depth_mean depth_sd
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2003 446. 5793. 201 113. 1 52.3 -130. 5.16 0.445
#> 2 2003 446. 5800. 212 41.7 1 52.3 -130. 5.16 0.445
#> 3 2003 449. 5802. 220 0 0 52.4 -130. 5.16 0.445
#> 4 2003 437. 5802. 197 15.7 1 52.4 -130. 5.16 0.445
#> 5 2003 421. 5771. 256 0 0 52.1 -130. 5.16 0.445
#> 6 2003 418. 5772. 293 0 0 52.1 -130. 5.16 0.445
#> # ℹ 5 more variables: depth_scaled <dbl>, depth_scaled2 <dbl>, cv_fold <int>,
#> # cv_predicted <dbl>, cv_loglik <dbl>
m_cv$models[[1]]
#> Spatial model fit by ML ['sdmTMB']
#> Formula: density ~ 0 + depth_scaled + depth_scaled2
#> Family: tweedie(link = 'log')
#>
#> coef.est coef.se
#> depth_scaled -2.06 0.19
#> depth_scaled2 -1.27 0.10
#>
#> Dispersion parameter: 13.10
#> Tweedie p: 1.60
#> Matérn range: 96.42
#> Spatial SD: 2.78
#> ML criterion at convergence: 3206.009
#>
#> See ?tidy.sdmTMB to extract these values as a data frame.
m_cv$max_gradients
#> [1] 1.814759e-10 3.560097e-12
# \donttest{
# Create mesh each fold:
m_cv2 <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, mesh_args = list(xy_cols = c("X", "Y"), cutoff = 20),
family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# Use fold_ids:
m_cv3 <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, mesh = mesh,
family = tweedie(link = "log"),
fold_ids = rep(seq(1, 3), nrow(pcod))[seq(1, nrow(pcod))]
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# LFOCV:
m_lfocv <- sdmTMB_cv(
present ~ s(year, k = 4),
data = pcod,
mesh = mesh,
lfo = TRUE,
lfo_forecast = 2,
lfo_validations = 3,
family = binomial(),
spatiotemporal = "off",
time = "year" # must be specified
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# See how the LFOCV folds were assigned:
example_data <- m_lfocv$models[[1]]$data
table(example_data$cv_fold, example_data$year)
#>
#> 2003 2004 2005 2007 2009 2011 2013 2015 2017
#> 1 232 230 224 255 233 0 0 0 0
#> 2 0 0 0 0 0 251 0 0 0
#> 3 0 0 0 0 0 0 240 0 0
#> 4 0 0 0 0 0 0 0 238 0
#> 5 0 0 0 0 0 0 0 0 240
# }