Skip to contents

Facilitates cross validation with sdmTMB models. Returns the log likelihood of left-out data, which is similar in spirit to the ELPD (expected log pointwise predictive density). The function has an option for leave-future-out cross validation. By default, the function creates folds randomly but folds can be manually assigned via the fold_ids argument.

Usage

sdmTMB_cv(
  formula,
  data,
  mesh_args,
  mesh = NULL,
  time = NULL,
  k_folds = 8,
  fold_ids = NULL,
  lfo = FALSE,
  lfo_forecast = 1,
  lfo_validations = 5,
  parallel = TRUE,
  use_initial_fit = FALSE,
  future_globals = NULL,
  spde = deprecated(),
  ...
)

Arguments

formula

Model formula.

data

A data frame.

mesh_args

Arguments for make_mesh(). If supplied, the mesh will be reconstructed for each fold.

mesh

Output from make_mesh(). If supplied, the mesh will be constant across folds.

time

The name of the time column. Leave as NULL if this is only spatial data.

k_folds

Number of folds.

fold_ids

Optional vector containing user fold IDs. Can also be a single string, e.g. "fold_id" representing the name of the variable in data. Ignored if lfo is TRUE

lfo

Whether to implement leave-future-out (LFO) cross validation where data are used to predict future folds. time argument in sdmTMB() must be specified. See Details section below.

lfo_forecast

If lfo = TRUE, number of time steps to forecast. Time steps 1, ..., T are used to predict T + lfo_forecast and the last forecasted time step is used for validation. See Details section below.

lfo_validations

If lfo = TRUE, number of times to step through the LFOCV process. Defaults to 5. See Details section below.

parallel

If TRUE and a future::plan() is supplied, will be run in parallel.

use_initial_fit

Fit the first fold and use those parameter values as starting values for subsequent folds? Can be faster with many folds.

future_globals

A character vector of global variables used within arguments if an error is returned that future.apply can't find an object. This vector is appended to TRUE and passed to the argument future.globals in future.apply::future_lapply(). Useful if global objects are used to specify arguments like priors, families, etc.

spde

Depreciated. Use mesh instead.

...

All other arguments required to run sdmTMB() model with the exception of weights, which are used to define the folds.

Value

A list:

  • data: Original data plus columns for fold ID, CV predicted value, and CV log likelihood.

  • models: A list of models; one per fold.

  • fold_loglik: Sum of left-out log likelihoods per fold. More positive values are better.

  • sum_loglik: Sum of fold_loglik across all left-out data. More positive values are better.

  • pdHess: Logical vector: Hessian was invertible each fold?

  • converged: Logical: all pdHess TRUE?

  • max_gradients: Max gradient per fold.

Prior to sdmTMB version '0.3.0.9002', elpd was incorrectly returned as the log average likelihood, which is another metric you could compare models with, but not ELPD. For maximum likelihood, ELPD is equivalent in spirit to the sum of the log likelihoods.

Details

Parallel processing

Parallel processing can be used by setting a future::plan().

For example:

library(future)
plan(multisession)
# now use sdmTMB_cv() ...

Leave-future-out cross validation (LFOCV)

An example of LFOCV with 9 time steps, lfo_forecast = 1, and lfo_validations = 2:

  • Fit data to time steps 1 to 7, predict and validate step 8.

  • Fit data to time steps 1 to 8, predict and validate step 9.

An example of LFOCV with 9 time steps, lfo_forecast = 2, and lfo_validations = 3:

  • Fit data to time steps 1 to 5, predict and validate step 7.

  • Fit data to time steps 1 to 6, predict and validate step 8.

  • Fit data to time steps 1 to 7, predict and validate step 9.

See example below.

Examples

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)

# Set parallel processing first if desired with the future package.
# See the Details section above.

m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

m_cv$fold_loglik
#> [1] -3387.398 -3329.670
m_cv$sum_loglik
#> [1] -6717.068

head(m_cv$data)
#> # A tibble: 6 × 15
#>    year     X     Y depth density present   lat   lon depth_mean depth_sd
#>   <int> <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>      <dbl>    <dbl>
#> 1  2003  446. 5793.   201   113.        1  52.3 -130.       5.16    0.445
#> 2  2003  446. 5800.   212    41.7       1  52.3 -130.       5.16    0.445
#> 3  2003  449. 5802.   220     0         0  52.4 -130.       5.16    0.445
#> 4  2003  437. 5802.   197    15.7       1  52.4 -130.       5.16    0.445
#> 5  2003  421. 5771.   256     0         0  52.1 -130.       5.16    0.445
#> 6  2003  418. 5772.   293     0         0  52.1 -130.       5.16    0.445
#> # ℹ 5 more variables: depth_scaled <dbl>, depth_scaled2 <dbl>, cv_fold <int>,
#> #   cv_predicted <dbl>, cv_loglik <dbl>
m_cv$models[[1]]
#> Spatial model fit by ML ['sdmTMB']
#> Formula: density ~ 0 + depth_scaled + depth_scaled2
#> Family: tweedie(link = 'log')
#>  
#>               coef.est coef.se
#> depth_scaled     -2.06    0.19
#> depth_scaled2    -1.27    0.10
#> 
#> Dispersion parameter: 13.10
#> Tweedie p: 1.60
#> Matérn range: 96.42
#> Spatial SD: 2.78
#> ML criterion at convergence: 3206.009
#> 
#> See ?tidy.sdmTMB to extract these values as a data frame.
m_cv$max_gradients
#> [1] 1.814876e-10 4.610284e-12

# \donttest{
# Create mesh each fold:
m_cv2 <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh_args = list(xy_cols = c("X", "Y"), cutoff = 20),
  family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

# Use fold_ids:
m_cv3 <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"),
  fold_ids = rep(seq(1, 3), nrow(pcod))[seq(1, nrow(pcod))]
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

# LFOCV:
m_lfocv <- sdmTMB_cv(
  present ~ s(year, k = 4),
  data = pcod,
  mesh = mesh,
  lfo = TRUE,
  lfo_forecast = 2,
  lfo_validations = 3,
  family = binomial(),
  spatiotemporal = "off",
  time = "year" # must be specified
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

# See how the LFOCV folds were assigned:
example_data <- m_lfocv$models[[1]]$data
table(example_data$cv_fold, example_data$year)
#>    
#>     2003 2004 2005 2007 2009 2011 2013 2015 2017
#>   1  232  230  224  255  233  251    0    0    0
#>   2    0    0    0    0    0    0  240    0    0
#>   3    0    0    0    0    0    0    0  238    0
#>   4    0    0    0    0    0    0    0    0  240
# }