library(ggplot2)
library(dplyr)
library(sdmTMB)

sdmTMB does not have built-in hurdle models (also called delta models), but we can fit the two components separately and easily combine the predictions. In this case, our delta-gamma model involves fitting a binomial presence-absence model (family = binomial(link = "logit")) and then a model for the positive catches only with a gamma observation distribution and a log link (Gamma(link = "log")). We could also use a lognormal(link = "log") family. Hurdle models are more appropriate than something like a Tweedie when there are differences in the processes controlling presence vs. abundance, or when greater flexibility to account for overdispersion is required. A similar strategy can also be used for zero-inflated count data but with a truncated_nbinom1(link = "log") or truncated_nbinom2(link = "log") distribution instead of the gamma for the positive component.

We will use a dataset built into the sdmTMB package: trawl survey data for Pacific Cod in Queen Charlotte Sound, British Columbia, Canada. The density units are kg/km2. Here, X and Y are coordinates in UTM zone 9.

glimpse(pcod)
#> Rows: 2,143
#> Columns: 12
#> $year <int> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20… #>$ X             <dbl> 446.4752, 446.4594, 448.5987, 436.9157, 420.6101, 417.71…
#> $Y <dbl> 5793.426, 5800.136, 5801.687, 5802.305, 5771.055, 5772.2… #>$ depth         <dbl> 201, 212, 220, 197, 256, 293, 410, 387, 285, 270, 381, 1…
#> $density <dbl> 113.138476, 41.704922, 0.000000, 15.706138, 0.000000, 0.… #>$ present       <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $lat <dbl> 52.28858, 52.34890, 52.36305, 52.36738, 52.08437, 52.094… #>$ lon           <dbl> -129.7847, -129.7860, -129.7549, -129.9265, -130.1586, -…
#> $depth_mean <dbl> 5.155194, 5.155194, 5.155194, 5.155194, 5.155194, 5.1551… #>$ depth_sd      <dbl> 0.4448783, 0.4448783, 0.4448783, 0.4448783, 0.4448783, 0…
#> $depth_scaled <dbl> 0.3329252, 0.4526914, 0.5359529, 0.2877417, 0.8766077, 1… #>$ depth_scaled2 <dbl> 0.11083919, 0.20492947, 0.28724555, 0.08279527, 0.768440…
mesh1 <- make_mesh(pcod, c("X", "Y"), cutoff = 10)

It is not necessary to use the same mesh for both models, but one can do so by updating the first mesh to match the reduced data frame as shown here:

dat2 <- subset(pcod, density > 0)
mesh2 <- make_mesh(dat2,
xy_cols = c("X", "Y"),
mesh = mesh1$mesh ) plot(mesh2) This delta-gamma model is similar to the Tweedie model in the Intro to modelling with sdmTMB vignette, except that we will use s() for the depth effect. m1 <- sdmTMB( formula = present ~ 0 + as.factor(year) + s(depth, k = 3), data = pcod, mesh = mesh1, time = "year", family = binomial(link = "logit"), spatiotemporal = "iid", spatial = "on" ) m1 #> Spatiotemporal model fit by ML ['sdmTMB'] #> Formula: present ~ 0 + as.factor(year) + s(depth, k = 3) #> Time column: "year" #> Mesh: mesh1 #> Data: pcod #> Family: binomial(link = 'logit') #> coef.est coef.se #> as.factor(year)2003 -0.90 0.46 #> as.factor(year)2004 -0.52 0.45 #> as.factor(year)2005 -0.55 0.46 #> as.factor(year)2007 -1.67 0.46 #> as.factor(year)2009 -1.31 0.46 #> as.factor(year)2011 -1.77 0.46 #> as.factor(year)2013 -0.45 0.45 #> as.factor(year)2015 -0.78 0.46 #> as.factor(year)2017 -1.75 0.46 #> sdepth 6.36 0.61 #> #> Smooth terms: #> Std. Dev. #> sds(depth) 96.19 #> #> Matern range: 32.87 #> Spatial SD: 1.80 #> Spatiotemporal SD: 0.78 #> ML criterion at convergence: 1031.273 #> #> See ?tidy.sdmTMB to extract these values as a data frame. One can use different covariates in each model, but in this case we will just let the depth effect be more wiggly by not specifying k = 3. m2 <- sdmTMB( formula = density ~ 0 + as.factor(year) + s(depth), data = dat2, mesh = mesh2, time = "year", family = Gamma(link = "log"), spatiotemporal = "iid", spatial = "on" ) m2 #> Spatiotemporal model fit by ML ['sdmTMB'] #> Formula: density ~ 0 + as.factor(year) + s(depth) #> Time column: "year" #> Mesh: mesh2 #> Data: dat2 #> Family: Gamma(link = 'log') #> coef.est coef.se #> as.factor(year)2003 3.92 0.19 #> as.factor(year)2004 4.25 0.17 #> as.factor(year)2005 4.06 0.17 #> as.factor(year)2007 3.33 0.18 #> as.factor(year)2009 3.61 0.18 #> as.factor(year)2011 4.39 0.19 #> as.factor(year)2013 3.99 0.16 #> as.factor(year)2015 4.09 0.17 #> as.factor(year)2017 3.75 0.19 #> sdepth 1.60 3.39 #> #> Smooth terms: #> Std. Dev. #> sds(depth) 2.18 #> #> Dispersion parameter: 1.11 #> Matern range: 10.88 #> Spatial SD: 0.63 #> Spatiotemporal SD: 1.40 #> ML criterion at convergence: 5081.850 #> #> See ?tidy.sdmTMB to extract these values as a data frame. We can inspect the effect of the smoothed covariate using plot_smooth(). s1 <- plot_smooth(m1, ggplot = TRUE) s1 + xlim(min(pcod$depth), max(pcod$depth)) s2 <- plot_smooth(m2, ggplot = TRUE) s2 + xlim(min(pcod$depth), max(pcod$depth)) Next, we need some way of combining the predictions across the two models. If all we need are point predictions, we can just multiply the predictions from the two models after applying the inverse link: pred <- qcs_grid # use the grid as template for saving our predictions p_bin <- predict(m1, newdata = qcs_grid) p_pos <- predict(m2, newdata = qcs_grid) p_bin_prob <- m1$family$linkinv(p_bin$est)
p_pos_exp <- m2$family$linkinv(p_pos$est) pred$est_exp <- p_bin_prob * p_pos_exp

But if a measure of uncertainty is required, we can simulate from the joint parameter precision matrix using the predict() function with any number of simulations selected (e.g., sims = 500). Because the predictions come from simulated draws from the parameter covariance matrix, the predictions will become more consistent with a larger number of draws. However, a greater number of draws takes longer to calculate and will use more memory (larger matrix), so fewer draws (~100) may be fine for experimentation. A larger number (say ~1000) may be appropriate for final model runs.

set.seed(28239)
p_bin_sim <- predict(m1, newdata = qcs_grid, sims = 500)
p_pos_sim <- predict(m2, newdata = qcs_grid, sims = 500)
p_bin_prob_sim <- m1$family$linkinv(p_bin_sim)
p_pos_exp_sim <- m2$family$linkinv(p_pos_sim)
p_combined_sim <- p_bin_prob_sim * p_pos_exp_sim

p_combined_sim is just a matrix with a row for each row of data that was predicted on and width sims (dim(p_combined_sim): 65826, 500). You can process this matrix however you would like. We can save median predictions and upper and lower 95% confidence intervals:

pred$median <- apply(p_combined_sim, 1, median) # pred$lwr <- apply(p_combined_sim, 1, quantile, probs = 0.025)
# pred$upr <- apply(p_combined_sim, 1, quantile, probs = 0.975) plot(pred$est_exp, pred$median) ggplot(subset(pred, year == 2017), aes(X, Y, fill = median)) + geom_raster() + coord_fixed() + scale_fill_viridis_c(trans = "sqrt") And we can now calculate spatial uncertainty: pred$cv <- apply(p_combined_sim, 1, function(x) sd(x) / mean(x))
ggplot(subset(pred, year == 2017), aes(X, Y, fill = cv)) + # 2017 as an example
geom_raster() +
coord_fixed() +
scale_fill_viridis_c(trans = "log10")

sdmTMB also has a function for calculating an index from those draws get_index_sims(). This function is just summing biomass or abundance across grid cells for each simulation draw and for each year and then calculating quantiles on the distribution of samples. The default for this function expects the simulations to still be in log space, so we either need to log(p_combined), or we can set agg_function = function(x) sum(x).

qcs_grid$area <- 4 # all 2 x 2km ind <- get_index_sims(p_combined_sim / 1000, # convert from kg to tonnes agg_function = function(x) sum(x), area = qcs_grid$area
)
ggplot(ind, aes(year, est)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) +
ylab("Biomass (t)")

For more on modelling for the purposes of creating an index see the vignette on Index standardization with sdmTMB.