Skip to contents

Save log likelihoods of k-fold cross-validation for sdmTMB models.

Usage

sdmTMB_cv(
  formula,
  data,
  mesh_args,
  mesh = NULL,
  time = NULL,
  k_folds = 8,
  fold_ids = NULL,
  parallel = TRUE,
  use_initial_fit = FALSE,
  spde = deprecated(),
  ...
)

Arguments

formula

Model formula.

data

A data frame.

mesh_args

Arguments for make_mesh(). If supplied, the mesh will be reconstructed for each fold.

mesh

Output from make_mesh(). If supplied, the mesh will be constant across folds.

time

The name of the time column. Leave as NULL if this is only spatial data.

k_folds

Number of folds.

fold_ids

Optional vector containing user fold IDs. Can also be a single string, e.g. "fold_id" representing the name of the variable in data.

parallel

If TRUE and a future::plan() is supplied, will be run in parallel.

use_initial_fit

Fit the first fold and use those parameter values as starting values for subsequent folds? Can be faster with many folds.

spde

Depreciated. Use mesh instead.

...

All other arguments required to run sdmTMB() model with the exception of weights, which are used to define the folds.

Value

A list:

  • data: Original data plus columns for fold ID, CV predicted value, and CV log likelihood.

  • models: A list of models; one per fold.

  • fold_loglik: Sum of left-out log likelihoods per fold.

  • fold_elpd: Expected log predictive density per fold on left-out data.

  • sum_loglik: Sum of fold_loglik across all left-out data.

  • elpd: Expected log predictive density across all left-out data.

  • pdHess: Logical vector: Hessian was invertible each fold?

  • converged: Logical: all pdHess TRUE?

  • max_gradients: Max gradient per fold.

Details

Parallel processing can be used by setting a future::plan().

For example:

library(future)
plan(multisession)
# now use sdmTMB_cv() ...

Examples

mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 25)

# Set parallel processing first if desired with the future package.
# See the Details section above.

m_cv <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

m_cv$fold_elpd
#> [1] -1.058983 -1.035476
m_cv$elpd
#> [1] -1.046859

m_cv$fold_loglik
#> [1] -3288.382 -3326.460
m_cv$sum_loglik
#> [1] -6614.843

head(m_cv$data)
#> # A tibble: 6 × 15
#>    year     X     Y depth density present   lat   lon depth_mean depth…¹ depth…²
#>   <int> <dbl> <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>      <dbl>   <dbl>   <dbl>
#> 1  2003  446. 5793.   201   113.        1  52.3 -130.       5.16   0.445   0.333
#> 2  2003  446. 5800.   212    41.7       1  52.3 -130.       5.16   0.445   0.453
#> 3  2003  449. 5802.   220     0         0  52.4 -130.       5.16   0.445   0.536
#> 4  2003  437. 5802.   197    15.7       1  52.4 -130.       5.16   0.445   0.288
#> 5  2003  421. 5771.   256     0         0  52.1 -130.       5.16   0.445   0.877
#> 6  2003  418. 5772.   293     0         0  52.1 -130.       5.16   0.445   1.18 
#> # … with 4 more variables: depth_scaled2 <dbl>, cv_fold <int>,
#> #   cv_predicted <dbl>, cv_loglik <dbl>, and abbreviated variable names
#> #   ¹​depth_sd, ²​depth_scaled
m_cv$models[[1]]
#> Spatial model fit by ML ['sdmTMB']
#> Formula: density ~ 0 + depth_scaled + depth_scaled2
#> Family: tweedie(link = 'log')
#>  
#>               coef.est coef.se
#> depth_scaled     -1.81    0.19
#> depth_scaled2    -1.28    0.10
#> 
#> Dispersion parameter: 14.03
#> Tweedie p: 1.60
#> Matern range: 117.49
#> Spatial SD: 2.61
#> ML criterion at convergence: 3282.890
#> 
#> See ?tidy.sdmTMB to extract these values as a data frame.
m_cv$max_gradients
#> [1] 0.0003465450 0.0008065475

# \donttest{
# Create mesh each fold:
m_cv2 <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh_args = list(xy_cols = c("X", "Y"), cutoff = 20),
  family = tweedie(link = "log"), k_folds = 2
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.

# Use fold_ids:
m_cv3 <- sdmTMB_cv(
  density ~ 0 + depth_scaled + depth_scaled2,
  data = pcod, mesh = mesh,
  family = tweedie(link = "log"),
  fold_ids = rep(seq(1, 3), nrow(pcod))[seq(1, nrow(pcod))]
)
#> Running fits with `future.apply()`.
#> Set a parallel `future::plan()` to use parallel processing.
# }