Model selection with sdmTMB

class: center, middle, inverse, title-slide

.title[
# Model selection with sdmTMB
]
.subtitle[
## NOAA PSAW Seminar Series
]
.author[
### 
]
.date[
### March 9, 2022
]

---

# What is the goal of modeling?

* Straight prediction?
  * in-sample, out-of-sample, both?

* Parsimony?
  * balance of bias-variance tradeoff

* Not all metrics appropriate for all applications

---

# Too many metrics to discuss

* Root mean squared error

* AUC (see pseudo-absence example)
  * true and false positive rates

* AIC: widely used tool in ecology
  * `$\mathrm{AIC} = -2 \log L + 2K$`, where `$K$` is the number of parameters
  * designed for fixed effects models (Burnham and Anderson 2002)
  
---

# AIC and likelihood with sdmTMB

* `AIC()` and `logLik()` methods work, just like `glm()`

```r
mesh <- make_mesh(pcod,
  xy_cols = c("X", "Y"),
  cutoff = 10
)
fit <- sdmTMB(
  present ~ 1,
  data = pcod,
  mesh = mesh,
  family = binomial(link = "logit"),
  spatial = "on"
)
logLik(fit) # log likelihood
AIC(fit) # AIC
```

---

# When to use restricted maximum likelihood (REML)

* Integrates over random effects *and* fixed effects; sometimes helps convergence too

* *Can* use REML when comparing different random effect structures

* *Don't* use REML when comparing alternative fixed effects

* *Don't* use REML for index standardization

.small[

```r
fit <- sdmTMB(..., reml = FALSE)
```
]

---

# Reminder: fixed and random effects in sdmTMB

Random effects include

* random intercepts: `(1 | group)`
* smooth terms: `s(year)`
* time-varying coefficients
* all random fields
* spatially varying coefficients (also random fields)

---

# Limitations of AIC

* Originally conceived for fixed effects 
  * Burnham and Anderson (2002)

* Approximate & problematic for mixed effects
  * Vaida and Blanchard (2005)
  * Liang et al. (2008)
  
* Great [FAQ on `glmmTMB` by Ben Bolker](https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#can-i-use-aic-for-mixed-models-how-do-i-count-the-number-of-degrees-of-freedom-for-a-random-effect)

---

# Predictive model selection

* Ideal world: use cross validation to evaluate predictive accuracy

* Split data into train / test sets

* Objective function:
  * maximize log-likelihood of test data

---

# ELPD

Expected log pointwise predictive density

Interpreted as theoretical expectation for new dataset  
  
ELPD: `$\frac{\log ( \sum( \exp( \log L(y^*|\theta))))} {N}$`,

where: `$y^*$` are left-out data, `$\theta$` are parameters, and `$N$` is the number of left-out data points

[Vehtari, Gelman, and Gabry (2017)](https://arxiv.org/abs/1507.04544)

---

# Cross validation in sdmTMB

.small[

```r
*fit <- sdmTMB_cv(
  present ~ 1,
  data = pcod,
  mesh = mesh,
  family = binomial(link = "logit"),
* k_folds = 8,
* fold_ids = NULL,
* parallel = TRUE,
* use_initial_fit = FALSE
)
```
]

* More folds = more computation time
* `use_initial_fit = TRUE` 
  * fits first fold, and initializes subsequent model fits from that

---

# Cross validation in sdmTMB

`sdmTMB_cv()` returns:

- A list of models (each `sdmTMB()` object)

- `fold_loglik`: sum of held out likelihoods for each fold

- `sum_loglik`: sum across `fold_loglik`, or all data

- `fold_elpd`: expected log predictive density for each fold

- `elpd`: Expected log predictive density across all data

---

# How to choose folds? How many?

.small[
* Words of wisdom:
  * Can be highly variable based on data, folds, sampling scheme
  
* Spatial sampling or random?
  * [blockCV R package](https://cran.r-project.org/web/packages/blockCV/index.html), [Valavi et al. (2019)](https://doi.org/10.1111/2041-210X.13107)
  
* How to sample with time / years? 
  * LOOCV (leave-one-out...) vs. LFOCV (leave-future-out...)

* `sdmTMB_cv()` does random fold assignment
  * Custom folds can be specified with `fold_ids`
]

---

# Basic cross validation example

Set a `future::plan()`; the folds would be fit in parallel

```r
*library(future)
*plan(multisession)
mesh <- make_mesh(pcod, c("X", "Y"), cutoff = 10)
*m_cv <- sdmTMB_cv(
  density ~ s(depth, k = 5), data = pcod, mesh = mesh,
  family = tweedie(link = "log"), 
* k_folds = 2
)
# Sum of log likelihoods of left-out data:
m_cv$sum_loglik
#> [1] -12541.86
# Expected log pointwise predictive density
# of left-out data:
m_cv$elpd
#> [1] -1.00918
```

---

# Ensembling

Rather than choose best model, we can do model averaging

* Ensemble may perform better than any single model

* AIC weights (please don't!)

* Stacking

---

# Stacking

* Borrowed from Bayesian literature, e.g. Yao et al. (2018)

* Given set of models, `$m_{1}, m_{2}, ..., m_{M}$`

* Estimate weights to maximize combined predictive density

`$$p(y^* | y) = \sum_{i=1}^{M} w_{i} p(y^*|y, m_{i})$$`
`$y^*$` is new data (e.g. test); `$\sum_{i=1}^{M} w_{i} = 1$`

---

# Stacking in sdmTMB

```r
sdmTMB_stacking(model_list,
  include_folds = NULL
)
```

* `model_list` is a collection of models fit with `sdmTMB_cv`

* `optim()` is used to find the best weights `$w_{i}$` to maximize the combined predictive density

* Not in `main` branch yet!