Introduction to sdmTMB

.title[
# Introduction to sdmTMB
]
.subtitle[
## IMR sdmTMB workshop
]
.author[
### 
]
.date[
### May 23–25 2023
]

---

# sdmTMB origins

A need for reproducible code and methods for assessments [(Anderson et al. 2019)](https://www.dfo-mpo.gc.ca/csas-sccs/Publications/ResDocs-DocRech/2019/2019_041-eng.html)

---

# sdmTMB highlights

sdmTMB is a user-friendly R package for modeling spatial processes

* Familiar syntax to widely used functions/packages; `glm()`, mgcv, glmmTMB, etc.
  
* Performs fast (marginal) maximum likelihood estimation via TMB
  
* Widely applicable to species distribution modelling, stock assessment inputs, bycatch models, etc. that include spatially referenced data

---

# Installing sdmTMB

From CRAN:

```r
install.packages("sdmTMB", dependencies = TRUE)
```

From source on GitHub:

* Install a C++ compiler
  * e.g., [Rtools](https://cran.r-project.org/bin/windows/Rtools/rtools40.html) on Windows
  * e.g., Xcode command line tools on a Mac:  
  `xcode-select --install`

```r
# install.packages("remotes")
remotes::install_github("pbs-assess/sdmTMB", 
  dependencies = TRUE)
```
]

---

# Installing sdmTMB

Warning: The INLA package is large, not on CRAN, and can be slow to install. If you run into problems, try installing it separately first:

<https://www.r-inla.org/download-install>

---

# General sdmTMB workflow

1. Prepare data .xsmall[(convert to UTMs, scale covariates, ...)]

2. Construct a mesh

3. Fit the model

4. Inspect the model .xsmall[(and possibly refit the model)]

5. Predict from the model

6. Calculate any derived quantities

---

# General sdmTMB workflow

1. Prepare data: .blue[`add_utm_columns()`]

2. Construct a mesh: .blue[`make_mesh()`]

3. Fit the model: .blue[`sdmTMB()`]

4. Inspect the model: .blue[`print()`], .blue[`sanity()`], .blue[`tidy()`], .blue[`residuals()`]

5. Predict from the model: .blue[`predict()`]

6. Get derived quantities: .blue[`get_index()`]

---

# Preparing the data

---

# The usual things

Prepare the data in wide format (i.e. row = observation)

Make sure there is no **NA** in the used covariates

**NA** in the response value is OK (internally checked)

---

# Preparing the "space": the use of UTMs

Need a projection that preserves constant distance

UTMs work well for regional analyses

Helper function: `sdmTMB::add_utm_columns()`

Internally, guesses at UTM zone and uses the sf package:  
`sf::st_as_sf()`, `sf::st_transform()`, and `sf::st_coordinates()`

Or use the sf or sp packages yourself

---

# Example of adding UTM columns

```r
d <- data.frame(
  lat = c(52.1, 53.4), 
  lon = c(-130.0, -131.4)
)
d <- sdmTMB::add_utm_columns(d, c("lon", "lat"))
#> Detected UTM zone 9N; CRS = 32609.
#> Visit https://epsg.io/32609 to verify.
d
#>    lat    lon        X        Y
#> 1 52.1 -130.0 431.5034 5772.632
#> 2 53.4 -131.4 340.4411 5919.452
```

* Note default `units = "km"`
* Why? Range parameter estimated in units of X and Y
* Should be not too big or small for estimation

---

# Constructing a mesh

---

# Constructing a mesh

`make_mesh()` has 2 shortcuts to mesh construction

1. K-means algorithm: used to cluster points (e.g., `n_knots = 100`); approach taken in VAST; sensitive to random `seed` argument!

2. Cutoff: minimum allowed distance between vertices (e.g., `cutoff = 10`) - **_the preferred option_**
  
Alternatively, build any INLA mesh and supply it to the `mesh` argument in `make_mesh()`

---

# Constructing a mesh

`cutoff` is in units of x and y (minimum triangle size)
]

```r
d <- data.frame(x = runif(500), y = runif(500))
mesh <- make_mesh(d, xy_cols = c("x", "y"), cutoff = 0.1)
mesh$mesh$n
#> [1] 89
plot(mesh)
```

<img src="02-intro-sdmTMB_files/figure-html/make-mesh-1.png" width="280px" style="display: block; margin: auto;" />
]

---

# Fitting the model: sdmTMB()

---

# sdmTMB()

Set up is similar to `glmmTMB()`. Common arguments:

```r
fit <- sdmTMB(
  formula,
  data,
  mesh,
  time = NULL,
  family = gaussian(link = "identity"),
  spatial = c("on", "off"),
  spatiotemporal = c("iid", "ar1", "rw", "off"),
  silent = TRUE,
  ...
)
```

See `?sdmTMB`

---

# Syntax: formulas for non-spatial model components

---

# Formula interface

sdmTMB uses a similar formula interface to widely used R packages

A formula is used to specify fixed effects and (optionally) random intercepts

```r
# linear effect of x1:
formula = y ~ x1

# add smoother effect of x2:
formula = y ~ x1 + s(x2)

# add random intercept by group g:
formula = y ~ x1 + s(x2) + (1 | g)
```
]

---

# Smoothers (as in mgcv)

```r
# smoother effect of x:
formula = y ~ s(x)

# basis dimension of 5:
formula = y ~ s(x, k = 5)

# bivariate smoother effect of x & y:
formula = y ~ s(x, y)

# smoother effect of x1 varying by x2:
formula = y ~ s(x1, by = x2)

# other kinds of mgcv smoothers:
formula = ~ s(month, bs = "cc", k = 12)
```

Smoothers are penalized ('p-splines'), i.e. data determine 'wiggliness'
]

---

# Other common R formula options

Polynomials and omitting the intercept:

```r
# transformations using `I()` notation
y ~ depth + I(depth^2)

# polynomial functions using `poly`
y ~ poly(depth, degree = 2)

# omit intercept
y ~ -1 + as.factor(year)
y ~ 0 + as.factor(year)
```

---

# Breakpoint functions for threshold analyses

```r
cpue ~ breakpt(temperature)
```

.tiny[
Essington, T.E. S.C. Anderson, L.A.K. Barnett, H.M. Berger, S.A. Siedlecki, E.J. Ward. Advancing statistical models to reveal the effect of dissolved oxygen on the spatial distribution of marine taxa using thresholds and a physiologically based index. 2022. Ecography 2022(8): e06249. <https://doi.org/10.1111/ecog.06249>
]

---

# Logistic functions for threshold analyses

```r
cpue ~ logistic(temperature)
```

---

# Syntax: families and links

---

# Families

Many of the same families used in `glm()`, `glmmTMB()`, `mgcv::gam()` can be used here

Includes:
  `gaussian()`, `Gamma()`, 
  `binomial()`, `poisson()`, `Beta()`, 
  `student()`, `tweedie()`, 
  `nbinom1()`, `nbinom2()`,
  `truncated_nbinom1()`, `truncated_nbinom2()`,
  `delta_gamma()`, `delta_lognormal()`, `delta_beta()`, and more...
  
All have `link` arguments
  
See `?sdmTMB::Families`

---

# An aside on the Tweedie

Dispersion ( `$\phi$` ) and power ( `$p$` ) parameters allow for a wide variety of shapes including many zeros

Also known as compound Poisson-Gamma distribution
]

---

# An aside on the student-t

Can also be used to deal with positive data when using _log_ `link`

`df` parameters controls for the amount of tail in the distribution i.e. degree of "extreme" values

P.S: beware when df << 3. All noise is assimilated in the "observation error"

]

.tiny[
Anderson, S. C., Branch, T. A., Cooper, A. B., and Dulvy, N. K. 2017. Black-swan events in animal populations. Proceedings of the National Academy of Sciences, 114: 3252–3257. http://www.pnas.org/lookup/doi/10.1073/pnas.16115.
]

---

# An aside on delta models

* Delta/hurdle model has 2 sub-models:
- presence/absence (binomial, logit link)  
- positive model (link varies by family)

* `family` argument to sdmTMB can be a list()  
- for convenience, many delta- families implemented: `delta_gamma`, `delta_lognormal`, `delta_truncated_nbinom2` etc

* More detail coming on day 3

---

# An aside on mixture models

Positive components may be modeled as a mixture of 2 distributions

* Finite mixture model (2 components)

* Also referred to as "ECE" (extreme catch event) model, Thorson et al. (2012)

* Mechanisms: shoaling, etc.

* See `gamma_mix()` and `lognormal_mix()`

* But requires data

---

# Decision tree for family choice

---

# Syntax: spatial model components

---

# Spatial vs. spatiotemporal fields -  notion

* A spatial field can be thought of as a spatial intercept

* a wiggly spatial process that is constant in time.
  e.g. areas that has on average a higher/lower animal density, constant "hot/cold-spot" 
  
--

* Spatiotemporal variation represents separate fields estimated for each time
  slice (possibly correlated)
  
  * wiggly spatial processes that change through time.
  e.g. inter-annual variability in "hot/cold-spot" locations

* See [model description](https://pbs-assess.github.io/sdmTMB/articles/model-description.html) vignette for math notation

---

# Spatial fields can be turned on/off

* By default `sdmTMB()` estimates a spatial field

```r
fit <- sdmTMB(
  y ~ x,
  family = gaussian(),
  data = dat,
  mesh = mesh,
* spatial = "on",
  ...
)
```

---

# Why *not* estimate a spatial field?

* If shared process across time slices isn't of interest

* If magnitude of spatiotemporal variability >> spatial variation

* If confounded with other parameters (more later)

---

# Spatiotemporal fields can be turned on/off

* By default `sdmTMB()` estimates a spatiotemporal field if the `time` argument is specified

```r
fit <- sdmTMB(
  y ~ x,
  family = gaussian(),
  data = dat,
  mesh = mesh,
* time = "year", # column in `data`
* spatiotemporal = "iid",
  ...
)
```

---

# Types of spatiotemporal fields

* None (`spatiotemporal = "off"`)

* Independent (`spatiotemporal = "iid"`)

* Random walk (`spatiotemporal = "rw"`)

* Autoregressive (`spatiotemporal = "ar1"`)

---

# Independent (IID) spatiotemporal fields

---

# AR1 spatiotemporal fields

.small[
Useful if pattern are related between years.                 
P.S: Random walk = AR1 with 1.0 correlation
]

---

# Spatiotemporal fields

* Why include spatiotemporal fields?

* If the data are collected in both space and time *and* 
    there are 'latent' spatial processes that vary through time
  * E.g., effect of water temperature on abundance if temperature wasn't in the model
  * Represents all the missing variables that vary through time

* Why would a field be IID vs RW/AR1?

* Do we expect hotspots to be independent with each time slice or adapt slowly over time?

---

# After model fitting (more tomorrow)

Inspecting, summarizing, predicting, etc.

Covered in examples in tomorrow's slides.

* `predict()`: `?predict.sdmTMB`
* `residuals()`: `?residuals.sdmTMB`
* `tidy()`: `?tidy.sdmTMB`
* `print()`
* `sanity()`
* `get_index()`
*  `...`