class: center, middle, inverse, title-slide .title[ # Introduction to sdmTMB ] .subtitle[ ## DFO TESA sdmTMB workshop ] .author[ ### ] .date[ ### January 16–19 2023 ] --- <!-- Build with: xaringan::inf_mr() --> # sdmTMB origins A need for reproducible code and methods for assessments [(Anderson et al. 2019)](https://www.dfo-mpo.gc.ca/csas-sccs/Publications/ResDocs-DocRech/2019/2019_041-eng.html) .center[ <img src="images/yellowtail_synopsis.png" width="450px" height = "400px"/> ] --- # sdmTMB highlights sdmTMB is a user-friendly R package for modeling spatial processes * Familiar syntax to widely used functions/packages; `glm()`, mgcv, glmmTMB, etc. * Performs fast (marginal) maximum likelihood estimation via TMB * Widely applicable to species distribution modelling, stock assessment inputs, bycatch models, etc. that include spatially referenced data --- # Installing sdmTMB From CRAN: ```r install.packages("sdmTMB", dependencies = TRUE) ``` From source on GitHub: * Install a C++ compiler * e.g., [Rtools](https://cran.r-project.org/bin/windows/Rtools/rtools40.html) on Windows * e.g., Xcode command line tools on a Mac: `xcode-select --install` .small[ ```r # install.packages("remotes") remotes::install_github("pbs-assess/sdmTMB", dependencies = TRUE) ``` ] --- # Installing sdmTMB Warning: The INLA package is large, not on CRAN, and can be slow to install. If you run into problems, try installing it separately first: <https://www.r-inla.org/download-install> --- # sdmTMB workflow 1. Prepare data .xsmall[(convert to UTMs, scale covariates, ...)] 2. Construct a mesh 3. Fit the model 4. Inspect the model .xsmall[(and possibly refit the model)] 5. Predict from the model 6. Calculate any derived quantities --- # sdmTMB workflow 1. Prepare data: .blue[`add_utm_columns()`] 2. Construct a mesh: .blue[`make_mesh()`] 3. Fit the model: .blue[`sdmTMB()`] 4. Inspect the model: .blue[`print()`], .blue[`sanity()`], .blue[`tidy()`], .blue[`residuals()`] 5. Predict from the model: .blue[`predict()`] 6. Get derived quantities: .blue[`get_index()`] <!-- <img src="images/sdmTMB_workflow.png" height = "500px" width="900px" /> --> --- class: center, middle, inverse # Preparing data: getting into constant distance coordinates --- # Getting into UTMs Need a projection that preserves constant distance UTMs work well for regional analyses Helper function: `sdmTMB::add_utm_columns()` Internally, guesses at UTM zone and uses the sf package: `sf::st_as_sf()`, `sf::st_transform()`, and `sf::st_coordinates()` Or use the sf or sp packages yourself --- # Example of adding UTM columns ```r d <- data.frame( lat = c(52.1, 53.4), lon = c(-130.0, -131.4) ) d <- sdmTMB::add_utm_columns(d, c("lon", "lat")) #> Detected UTM zone 9N; CRS = 32609. #> Visit https://epsg.io/32609 to verify. d #> lat lon X Y #> 1 52.1 -130.0 431.5034 5772.632 #> 2 53.4 -131.4 340.4411 5919.452 ``` * Note default `units = "km"` * Why? Range parameter estimated in units of X and Y * Should be not too big or small for estimation --- class: center, middle, inverse # Constructing a mesh --- # Constructing a mesh `make_mesh()` has 2 shortcuts to mesh construction 1. K-means algorithm: used to cluster points (e.g., `n_knots = 100`); approach taken in VAST; sensitive to random `seed` argument! 2. Cutoff: minimum allowed distance between vertices (e.g., `cutoff = 10`) Alternatively, build any INLA mesh and supply it to the `mesh` argument in `make_mesh()` --- # Constructing a mesh .small[ Size of mesh has the single largest impact on fitting speed `cutoff` is in units of x and y (minimum triangle size) ] .small[ ```r d <- data.frame(x = runif(500), y = runif(500)) mesh <- make_mesh(d, xy_cols = c("x", "y"), cutoff = 0.1) mesh$mesh$n #> [1] 89 plot(mesh) ``` <img src="02-intro-sdmTMB_files/figure-html/make-mesh-1.png" width="280px" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # Fitting the model: sdmTMB() --- # sdmTMB() Set up is similar to `glmmTMB()`. Common arguments: ```r fit <- sdmTMB( formula, data, mesh, time = NULL, family = gaussian(link = "identity"), spatial = c("on", "off"), spatiotemporal = c("iid", "ar1", "rw", "off"), silent = TRUE, ... ) ``` See `?sdmTMB` --- class: center, middle, inverse # Syntax: formulas for non-spatial model components --- # Formula interface sdmTMB uses a similar formula interface to widely used R packages A formula is used to specify fixed effects and (optionally) random intercepts .small[ ```r # linear effect of x1: formula = y ~ x1 # add smoother effect of x2: formula = y ~ x1 + s(x2) # add random intercept by group g: formula = y ~ x1 + s(x2) + (1 | g) ``` ] --- # Smoothers (as in mgcv) .small[ ```r # smoother effect of x: formula = y ~ s(x) # basis dimension of 5: formula = y ~ s(x, k = 5) # bivariate smoother effect of x & y: formula = y ~ s(x, y) # smoother effect of x1 varying by x2: formula = y ~ s(x1, by = x2) # other kinds of mgcv smoothers: formula = ~ s(month, bs = "cc", k = 12) ``` Smoothers are penalized ('p-splines'), i.e. data determine 'wiggliness' ] --- # Other common R formula options Polynomials and omitting the intercept: ```r # transformations using `I()` notation y ~ depth + I(depth^2) # polynomial functions using `poly` y ~ poly(depth, degree = 2) # omit intercept y ~ -1 + as.factor(year) y ~ 0 + as.factor(year) ``` --- # Breakpoint functions for threshold analyses ```r cpue ~ breakpt(temperature) ``` <img src="02-intro-sdmTMB_files/figure-html/make-breakpt-1.png" width="400px" style="display: block; margin: auto;" /> .tiny[ Essington, T.E. S.C. Anderson, L.A.K. Barnett, H.M. Berger, S.A. Siedlecki, E.J. Ward. Advancing statistical models to reveal the effect of dissolved oxygen on the spatial distribution of marine taxa using thresholds and a physiologically based index. 2022. Ecography 2022(8): e06249. <https://doi.org/10.1111/ecog.06249> ] --- # Logistic functions for threshold analyses ```r cpue ~ logistic(temperature) ``` <img src="02-intro-sdmTMB_files/figure-html/make-logistic-1.png" width="500px" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Syntax: families and links --- # Families Many of the same families used in `glm()`, `glmmTMB()`, `mgcv::gam()` can be used here Includes: `gaussian()`, `Gamma()`, `binomial()`, `poisson()`, `Beta()`, `student()`, `tweedie()`, `nbinom1()`, `nbinom2()`, `truncated_nbinom1()`, `truncated_nbinom2()`, `delta_gamma()`, `delta_lognormal()`, `delta_beta()`, and more... All have `link` arguments See `?sdmTMB::Families` --- # An aside on the Tweedie .small[ Useful for positive continuous data with zeros (e.g., biomass density per unit effort) Dispersion ( `\(\phi\)` ) and power ( `\(p\)` ) parameters allow for a wide variety of shapes including many zeros Also known as compound Poisson-Gamma distribution ] <img src="02-intro-sdmTMB_files/figure-html/sim-tweedie-1.png" width="400px" style="display: block; margin: auto;" /> --- class: center, middle, inverse # Syntax: spatial model components --- # Spatial vs. spatiotemporal fields * A spatial field can be thought of as a spatial intercept * a wiggly spatial process that is constant in time -- * Spatiotemporal variation represents separate fields estimated for each time slice (possibly correlated) * wiggly spatial processes that change through time -- * Refer to [model description](https://pbs-assess.github.io/sdmTMB/articles/model-description.html) vignette for math notation --- # Spatial fields can be turned on/off * By default `sdmTMB()` estimates a spatial field ```r fit <- sdmTMB( y ~ x, family = gaussian(), data = dat, mesh = mesh, * spatial = "on", ... ) ``` --- # Why *not* estimate a spatial field? * If shared process across time slices isn't of interest * If magnitude of spatiotemporal variability >> spatial variation * If confounded with other parameters (more later) --- # Spatiotemporal fields can be turned on/off * By default `sdmTMB()` estimates a spatiotemporal field if the `time` argument is specified ```r fit <- sdmTMB( y ~ x, family = gaussian(), data = dat, mesh = mesh, * time = "year", # column in `data` * spatiotemporal = "iid", ... ) ``` --- # Types of spatiotemporal fields * None (`spatiotemporal = "off"`) * Independent (`spatiotemporal = "iid"`) * Random walk (`spatiotemporal = "rw"`) * Autoregressive (`spatiotemporal = "ar1"`) --- # Independent (IID) spatiotemporal fields <img src="02-intro-sdmTMB_files/figure-html/iid-demo-1.png" width="700px" style="display: block; margin: auto;" /> --- # AR1 spatiotemporal fields <img src="02-intro-sdmTMB_files/figure-html/ar1-demo-1.png" width="700px" style="display: block; margin: auto;" /> .small[Random walk = AR1 with 1.0 correlation] --- # Spatiotemporal fields * Why include spatiotemporal fields? * If the data are collected in both space and time *and* there are 'latent' spatial processes that vary through time * E.g., effect of water temperature on abundance if temperature wasn't in the model * Represents all the missing variables that vary through time -- * Why would a field be IID vs RW/AR1? * Do we expect hotspots to be independent with each time slice or adapt slowly over time? --- # After model fitting Inspecting, summarizing, predicting, etc. Covered in examples in next slides. * `predict()`: `?predict.sdmTMB` * `residuals()`: `?residuals.sdmTMB` * `tidy()`: `?tidy.sdmTMB` * `print()` * `sanity()` * `get_index()` * `...`