Facilitates cross validation with sdmTMB models. Returns the log likelihood
of left-out data, which is similar in spirit to the ELPD (expected log
pointwise predictive density). The function has an option for
leave-future-out cross validation. By default, the function creates folds
randomly but folds can be manually assigned via the fold_ids
argument.
Usage
sdmTMB_cv(
formula,
data,
mesh_args,
mesh = NULL,
time = NULL,
k_folds = 8,
fold_ids = NULL,
lfo = FALSE,
lfo_forecast = 1,
lfo_validations = 5,
parallel = TRUE,
use_initial_fit = FALSE,
future_globals = NULL,
spde = deprecated(),
...
)
Arguments
- formula
Model formula.
- data
A data frame.
- mesh_args
Arguments for
make_mesh()
. If supplied, the mesh will be reconstructed for each fold.- mesh
Output from
make_mesh()
. If supplied, the mesh will be constant across folds.- time
The name of the time column. Leave as
NULL
if this is only spatial data.- k_folds
Number of folds.
- fold_ids
Optional vector containing user fold IDs. Can also be a single string, e.g.
"fold_id"
representing the name of the variable indata
. Ignored iflfo
is TRUE- lfo
Whether to implement leave-future-out (LFO) cross validation where data are used to predict future folds.
time
argument insdmTMB()
must be specified. See Details section below.- lfo_forecast
If
lfo = TRUE
, number of time steps to forecast. Time steps 1, ..., T are used to predict T +lfo_forecast
and the last forecasted time step is used for validation. See Details section below.- lfo_validations
If
lfo = TRUE
, number of times to step through the LFOCV process. Defaults to 5. See Details section below.- parallel
If
TRUE
and afuture::plan()
is supplied, will be run in parallel.- use_initial_fit
Fit the first fold and use those parameter values as starting values for subsequent folds? Can be faster with many folds.
- future_globals
A character vector of global variables used within arguments if an error is returned that future.apply can't find an object. This vector is appended to
TRUE
and passed to the argumentfuture.globals
infuture.apply::future_lapply()
. Useful if global objects are used to specify arguments like priors, families, etc.- spde
Depreciated. Use
mesh
instead.- ...
All other arguments required to run
sdmTMB()
model with the exception ofweights
, which are used to define the folds.
Value
A list:
data
: Original data plus columns for fold ID, CV predicted value, and CV log likelihood.models
: A list of models; one per fold.fold_loglik
: Sum of left-out log likelihoods per fold. More positive values are better.sum_loglik
: Sum offold_loglik
across all left-out data. More positive values are better.pdHess
: Logical vector: Hessian was invertible each fold?converged
: Logical: allpdHess
TRUE
?max_gradients
: Max gradient per fold.
Prior to sdmTMB version '0.3.0.9002', elpd
was incorrectly returned as
the log average likelihood, which is another metric you could compare models
with, but not ELPD. For maximum likelihood, ELPD is equivalent in spirit to the sum of the log likelihoods.
Details
Parallel processing
Parallel processing can be used by setting a future::plan()
.
For example:
Leave-future-out cross validation (LFOCV)
An example of LFOCV with 9 time steps, lfo_forecast = 1
, and
lfo_validations = 2
:
Fit data to time steps 1 to 7, predict and validate step 8.
Fit data to time steps 1 to 8, predict and validate step 9.
An example of LFOCV with 9 time steps, lfo_forecast = 2
, and
lfo_validations = 3
:
Fit data to time steps 1 to 5, predict and validate step 7.
Fit data to time steps 1 to 6, predict and validate step 8.
Fit data to time steps 1 to 7, predict and validate step 9.
See example below.
Examples
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)
# Set parallel processing first if desired with the future package.
# See the Details section above.
m_cv <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, mesh = mesh,
family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
m_cv$fold_loglik
#> [1] -3351.292 -3298.167
m_cv$sum_loglik
#> [1] -6649.459
head(m_cv$data)
#> # A tibble: 6 × 15
#> year X Y depth density present lat lon depth_mean depth_sd
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2003 446. 5793. 201 113. 1 52.3 -130. 5.16 0.445
#> 2 2003 446. 5800. 212 41.7 1 52.3 -130. 5.16 0.445
#> 3 2003 449. 5802. 220 0 0 52.4 -130. 5.16 0.445
#> 4 2003 437. 5802. 197 15.7 1 52.4 -130. 5.16 0.445
#> 5 2003 421. 5771. 256 0 0 52.1 -130. 5.16 0.445
#> 6 2003 418. 5772. 293 0 0 52.1 -130. 5.16 0.445
#> # ℹ 5 more variables: depth_scaled <dbl>, depth_scaled2 <dbl>, cv_fold <int>,
#> # cv_predicted <dbl>, cv_loglik <dbl>
m_cv$models[[1]]
#> Spatial model fit by ML ['sdmTMB']
#> Formula: density ~ 0 + depth_scaled + depth_scaled2
#> Family: tweedie(link = 'log')
#>
#> coef.est coef.se
#> depth_scaled -2.07 0.23
#> depth_scaled2 -1.57 0.14
#>
#> Dispersion parameter: 14.61
#> Tweedie p: 1.64
#> Matérn range: 100.81
#> Spatial SD: 3.11
#> ML criterion at convergence: 3191.001
#>
#> See ?tidy.sdmTMB to extract these values as a data frame.
m_cv$max_gradients
#> [1] 2.788227e-08 6.945717e-10
# \donttest{
# Create mesh each fold:
m_cv2 <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, mesh_args = list(xy_cols = c("X", "Y"), cutoff = 20),
family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# Use fold_ids:
m_cv3 <- sdmTMB_cv(
density ~ 0 + depth_scaled + depth_scaled2,
data = pcod, mesh = mesh,
family = tweedie(link = "log"),
fold_ids = rep(seq(1, 3), nrow(pcod))[seq(1, nrow(pcod))]
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# }