Using 'get_all' functions
Philina English
2024-10-11
Source:vignettes/02-gfdata-vignette-get-all.Rmd
02-gfdata-vignette-get-all.Rmd
Why use a get_all_*()
function?
The original get_*()
survey functions are limited to
returning sets and specimen samples that conform to current survey
design specifications based on assigned grouping and usability codes.
This works well for some surveys and data uses that depend on sampling a
consistent survey footprint across years such as generating design-based
abundance indexes. Other uses can harness information from samples that
were collected outside the current survey footprint or from different
types of surveys, some of which don’t consistently apply these codes.
The get_all_*()
functions have been designed to retrieve
all fishery-independent survey data relevant to a particular species, or
set of species, and to do so more quickly and comprehensively than the
original functions. When retrieving data for multiple species at once,
these functions will be dramatically faster than the original
get_*()
functions. This is because sql scripts are called
once for all species rather than repeatedly for each species. When
retrieving data for single species run times will depend on which
surveys and arguments are used. The extent of the data returned can be
specific to a single survey, a single major stat area, any combination
of these, or generalized to get everything in the database that is
appropriately formatted.
Additional variables are also returned to support modelling objectives
and decisions regarding which data should be retained for specific
purposes.
More flexibility
The original functions required the user to specify survey series ids
(ssid
) and were limited in which they could accept and
return.
For get_all_*()
functions, we have the option to set
ssid = NULL
which will return all fishery-independent
samples and catch data in the database, as long as the survey series ids
(and for catch data, survey ids as well) have been assigned in the
database. These include trap (sablefish), longline, jig, and most
contemporary and historic trawl surveys (exceptions include the
Nearshore Shrimp trawl survey for which survey ids are missing). If a
character string of major stat area codes is provided to the argument
major
, than all sets or samples from that area will be
returned.
Design-based analyses
Because the original get_*()
survey functions were
intended to only return sets and specimen samples that conform to
current survey design specifications, they cannot retrieve sets and
samples from cells that fall outside the latest definition of a
particular survey’s design. When this behaviour is desired, it can be
reproduced using the get_all_*()
functions with the
filtering options usability = c(0, 1, 2, 6)
and
grouping_only = TRUE
. This works reliably for most
groundfish bottom trawl and longline surveys
(ssid = c(1, 2, 3, 4, 16, 22, 36, 39, 40)
). However, when
these filtering options are applied to certain surveys, a small
proportion of the data returned by the original function may be missed
(e.g., IPHC: ssid = 14
) and no data is returned for surveys
for which grouping or usability codes do not appear in the database
(e.g., the jig survey: ssid = c(82:87)
).
If using the get_all_*()
functions to generate
design-based indices, the strata area variable is now called
grouping_area_km2
(instead of area_km2
to
avoid confusion with area_swept*
variables) and, in case
design changes occur that are not incorporated into the usability codes,
one should also always check for differences between the
grouping_code
and grouping_code_updated
variables. The grouping_code_updated
generally contains a
subset of the former, likely as a result of a shrinking footprint or the
dropping an entire strata from a survey’s definition. This is currently
the case for the offshore shrimp, also known as multi-species small-mesh
(MSSM), surveys (ssid = c(6, 7)
), which needs to be
filtered for only those sets with updated grouping codes
(!is.na(grouping_code_updated)
) in order to match the
current survey design. Consulting data stewards for specific surveys may
be helpful in understanding differences between grouping codes.
To retrieve specimen samples that conform to design specifications,
the arguments unsorted_only = TRUE
and
random_only = TRUE
should be used in addition to usability
and grouping options. When doing so
get_all_survey_samples()
function will return > 70
additional specimens over the original function for each of the longline
surveys (ssid = c(22, 36, 39, 40)
). This is because the
original get_survey_samples()
function used a stricter
method for filtering based on grouping codes. This stricter filtering
matches how get_survey_sets()
filtered for the current
trawl survey footprint, but not how it filtered sets for longline
surveys. If desired, this stricter filtering can be achieved for both
sets and samples from any survey by filtering for
!is.na(grouping_code_updated)
.
Non-standard data
In contrast, the default behaviour of the get_all_*()
survey functions is to return all data collected on any given survey,
whether or not it conforms to current design. This includes sets and
samples from grid cells that all within subsequently established
Rockfish Conservation areas (RCAs), and data that differ at the skate
level. The original functions were not built to retrieve data that
differed at the skate level, like gear comparison studies (e.g.,
ssid = 48
). The get_all_*()
functions will
automatically return catch information at the skate level, instead of
the fishing event level, for sets within a single function call whenever
gear variables (currently checking for differences in hook code and
size) differ between skates.
Set up
If you don’t already have the package installed, see the general gfdata vignette for instructions. Here we will load gfdata along with the package dplyr.
The available arguments are described in the help documentation:
Examples
What survey data is available for a species?
As an example, we might want to determine what survey set data are in
our database for Bluntnose Sixgill Sharks (Hexanchus griseus).
For now, we will leave the default settings that pull all surveys and
all areas. Beware that some records in the database are from outside
Canadian waters. If desired, returned data can be filtered using the
major_stat_area_code
to retain only Canadian records (see
get_major_areas()
to identify which codes to use).
Original get_survey_sets()
function
To start with, we check what the original
get_survey_sets()
function returns for this species. By
default this function returns just the most commonly used groundfish
surveys: synoptic trawl (ssid = c(1, 3, 4, 16)
), one
historical trawl (2
), and five longline–IPHC
(14
) and PHMA (22, 36, 39, 40
) surveys. The
first thing to note here is that this function will only return one row
per fishing event (unless overlapping survey series or sample_ids were
requested). This function will also return all sets for any survey
series, even when the species has never been recorded on that
survey.
d0 <- get_survey_sets("Bluntnose Sixgill Shark")
nrow(d0) #> number of rows
#> [1] 15349
#> [1] 15349
#> [1] 1 2 3 4 14 16 22 36 39 40
In contrast, get_all_survey_sets()
only returns set data
for survey series that have captured the species at least once. Both
*_survey_sets()
functions return all sets for any survey
series returned, including those that did not record the species. So, to
make sets that did capture the species visible to head()
,
we will sort by descending catch_count
.
d0 <- d0 |> rename(ssid = survey_series_id) |>
relocate(year, fishing_event_id, catch_count, catch_weight, ssid, survey_abbrev,
survey_series_desc) |>
arrange(-catch_count, -fishing_event_id)
head(d0, n = 8L)
#> # A tibble: 8 x 36
#> year fishing_event_id catch_count catch_weight ssid survey_abbrev survey_series_desc survey_id
#> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <int>
#> 1 2018 5092608 24 0 14 IPHC FISS International Pacif~ 538
#> 2 2018 5087665 14 0 14 IPHC FISS International Pacif~ 538
#> 3 2018 5093313 13 0 14 IPHC FISS International Pacif~ 538
#> 4 2018 5089779 10 0 14 IPHC FISS International Pacif~ 538
#> 5 2018 5074269 10 0 14 IPHC FISS International Pacif~ 538
#> 6 2018 5090481 9 0 14 IPHC FISS International Pacif~ 538
#> 7 2018 5074976 8 0 14 IPHC FISS International Pacif~ 538
#> 8 2018 5091185 6 0 14 IPHC FISS International Pacif~ 538
#> # i 28 more variables: species_code <chr>, survey_desc <chr>, trip_id <dbl>,
#> # fe_major_level_id <dbl>, latitude <dbl>, longitude <dbl>, grouping_code <dbl>,
#> # major_stat_area_code <chr>, minor_stat_area_code <chr>, depth_m <dbl>, duration_min <int>,
#> # doorspread_m <dbl>, speed_mpm <dbl>, tow_length_m <dbl>, density_kgpm2 <dbl>,
#> # density_pcpm2 <dbl>, skate_count <int>, hook_count <int>, density_ppkm2 <dbl>, month <int>,
#> # day <int>, time_deployed <dttm>, time_retrieved <dttm>, latitude_end <dbl>,
#> # longitude_end <dbl>, species_common_name <chr>, species_science_name <chr>, ...
Notice that catch_weight
sometimes contains zeros when
catch_count
is at least 1. This is because the original SQL
code assume NULL values to be zeros. In many cases catch weights are
missing because they are not collected on that type of survey. However,
even for surveys where weights are the usual unit of measurement, a
particular catch may have been too large or small for the scale and
therefore recorded only as a count. For
get_all_survey_sets()
, the default setting is
remove_false_zeros = TRUE
, which removes these misleading
zeros from the data.
Using get_all_survey_sets()
Messages and warnings will alert the user about nuances in the data requested and returned. For example, this function call results in multiple rows of data that share the same fishing_event_id and a warning that suggests possible reasons for this. In this case, the number of rows of data exceeds the number of fishing events because catch is being returned at the skate level for some fishing events. This will happen any time skates within a fishing event differ in their gear (currently just working off differences in hook type and size).
d <- get_all_survey_sets("Bluntnose Sixgill Shark")
#> [1] "Returning all sets/events/skates (including those with no catch) from all survey series that recorded Bluntnose Sixgill Shark at least once."
#> Warning in get_all_survey_sets("Bluntnose Sixgill Shark"): Duplicate fishing_event_ids are still
#> present despite `remove_duplicates = TRUE`. This may be because of overlapping survey
#> stratifications or multiple skates per event (specifically when at least one survey included used
#> skates with differences in gear type), but could also be due to trips participating in more than
#> one type of survey. If the latter, location, gear, or `reason_desc` columns should be used to
#> choose which events to keep. After selecting specific survey stratifications and determining that
#> all relevant variables are accurate, the remaining duplications can be filtered using `dat <-
#> dat[!duplicated(dat$fishing_event_id), ]`.
#> Warning in get_all_survey_sets("Bluntnose Sixgill Shark"): All sablefish research related sets are
#> returned as survey_series_id 35. To separate types of sets, use reason_desc and grouping_code
#> variables.
nrow(d) #> number of rows
#> [1] 18967
#> [1] 18840
#> [1] 4 6 7 14 34 35 36 39 40 45 48 76 80
Now, when we view the data catch_weight
appears
appropriately as NA
when this data was not collected. We
now also get catches at the skate level (multiple skates make up each
fishing event) for the dogfish comparison work
(ssid = c(48)
) which was not (and cannot be) returned by
get_survey_sets()
due to the gear differences between
skates.
d <- d |> rename(ssid = survey_series_id) |>
relocate(year, fishing_event_id, catch_count, catch_weight, ssid, survey_abbrev,
activity_desc, skate_id) |>
arrange(-catch_count, -fishing_event_id)
head(d, n = 8L)
#> # A tibble: 8 x 75
#> year fishing_event_id catch_count catch_weight ssid survey_abbrev activity_desc skate_id
#> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl>
#> 1 2018 5092608 24 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 2 2018 5087665 14 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 3 2018 5093313 13 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 4 2018 5089779 10 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 5 2018 5074269 10 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 6 2018 5090481 9 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 7 2018 5074976 8 NA 14 IPHC FISS INTERNATIONAL PACIFI~ NA
#> 8 2023 5795618 6 NA 48 OTHER DOGFISH GEAR/TIMING ~ 5788696
#> # i 67 more variables: species_common_name <chr>, species_code <chr>, fe_major_level_id <dbl>,
#> # trip_id <dbl>, survey_series_og <dbl>, survey_id <int>, activity_code <dbl>, reason_desc <chr>,
#> # trip_year <int>, month <int>, day <int>, time_deployed <dttm>, time_retrieved <dttm>,
#> # time_end_deployment <dttm>, time_begin_retrieval <dttm>, latitude <dbl>, longitude <dbl>,
#> # latitude_end <dbl>, longitude_end <dbl>, major_stat_area_code <chr>,
#> # minor_stat_area_code <chr>, depth_m <dbl>, depth_begin <dbl>, depth_end <dbl>, vessel_id <dbl>,
#> # captain_id <dbl>, duration_min <int>, tow_length_m <dbl>, mouth_width_m <dbl>, ...
So, which surveys encountered the most of this species? Some surveys only count individuals and others only weigh the total catch, so we will summarize both count and weight variables.
d |> group_by(ssid, survey_series_desc) |>
mutate(event_skate_id = paste0(fishing_event_id, "-", skate_id)) |>
summarise(individuals = sum(catch_count, na.rm = TRUE),
weight = sum(catch_weight),
events = length(unique(fishing_event_id)),
skates = length(unique(event_skate_id)),
rows = n()) |>
arrange(-individuals, -weight)
#> # A tibble: 13 x 7
#> # Groups: ssid [13]
#> ssid survey_series_desc individuals weight events skates rows
#> <dbl> <chr> <dbl> <dbl> <int> <int> <int>
#> 1 14 "International Pacific Halibut Commission Fishery-I~ 175 NA 3278 3278 3278
#> 2 48 "Dogfish Gear/Timing Comparison Surveys" 38 NA 145 271 271
#> 3 40 "Hard Bottom Longline Inside South " 16 NA 528 528 528
#> 4 76 "Strait of Georgia Dogfish Longline" 13 NA 351 351 351
#> 5 45 "Strait of Georgia Synoptic Bottom Trawl" 2 19.4 98 98 98
#> 6 39 "Hard Bottom Longline Inside North " 1 200 769 769 769
#> 7 34 "Strait of Georgia Ecosystem Research Initiative Ac~ 1 36.7 167 167 167
#> 8 6 "Queen Charlotte Sound Multispecies Small-mesh Bott~ 1 7.14 1295 1295 1295
#> 9 35 "Sablefish Research and Assessment" 1 5.8 3582 3582 3582
#> 10 4 "West Coast Vancouver Island Synoptic Bottom Trawl" 1 1.52 1737 1737 1737
#> 11 36 "Hard Bottom Longline Outside South" 1 NA 1536 1536 1536
#> 12 7 "West Coast Vancouver Island Multispecies Small-mes~ 0 15.7 5084 5084 5084
#> 13 80 "Eulachon Migration Study Bottom Trawl (South)" 0 4.5 271 271 271
We can also tally the number of unique fishing events versus the
number of rows of data returned to see which surveys have been returned
at the skate level. This was the case for SSID 48
.
The vast majority of records for Bluntnose Sixgill Shark come from
the IPHC, followed by the Dogfish and Hard Bottom Longline surveys, both
conducted in the Strait of Georgia (aka. Inside South). Because the IPHC
covers a wider area, we can explore the spatial distribution of catches
within that survey only, and confirm that they are most frequently
caught in the Strait of Georgia, major stat area "01"
.
d |> filter(ssid == 14) |>
group_by(major_stat_area_code) |>
summarise(individuals = sum(catch_count, na.rm = TRUE),
weight = sum(catch_weight),
events = length(unique(fishing_event_id))) |>
arrange(-individuals, -weight)
#> # A tibble: 8 x 4
#> major_stat_area_code individuals weight events
#> <chr> <dbl> <dbl> <int>
#> 1 01 137 NA 43
#> 2 06 14 NA 831
#> 3 03 10 NA 383
#> 4 04 10 NA 262
#> 5 07 2 NA 757
#> 6 05 1 NA 362
#> 7 09 1 NA 201
#> 8 08 0 0 439
What survey samples are available for a species within a specific area?
As an example, we might want to determine what survey sample data
exists for Pacific Spiny Dogfish in the Strait of Georgia. The area
argument is major
and accepts character vectors of major
stat area codes. A table of options can be retrieved with
get_major_area()
.
d2 <- get_all_survey_samples("north pacific spiny dogfish",
major = c("01"))
#> [1] "Returning all north pacific spiny dogfish specimens from major area(s) 01 from any survey series."
d2 |> group_by(survey_series_id, survey_series_desc) |>
summarise(specimens = length(unique(specimen_id)),
lengths = sum(!is.na(length)),
weights = sum(!is.na(weight)),
age_structures = sum(age_specimen_collected)
) |>
arrange(-specimens)|>
rename(ssid = survey_series_id)
#> # A tibble: 15 x 6
#> # Groups: ssid [15]
#> ssid survey_series_desc specimens lengths weights age_structures
#> <dbl> <chr> <int> <int> <int> <int>
#> 1 76 "Strait of Georgia Dogfish Longline" 54716 54690 6582 19053
#> 2 40 "Hard Bottom Longline Inside South " 36566 36561 3 0
#> 3 39 "Hard Bottom Longline Inside North " 35624 35497 408 400
#> 4 48 "Dogfish Gear/Timing Comparison Surveys" 16695 16681 3322 884
#> 5 45 "Strait of Georgia Synoptic Bottom Trawl" 1721 1721 1243 505
#> 6 15 "Strait of Georgia Lingcod Young-of-year Bottom T~ 1371 1181 0 0
#> 7 34 "Strait of Georgia Ecosystem Research Initiative ~ 925 877 402 0
#> 8 50 "Yelloweye Rockfish Genetics" 786 786 0 0
#> 9 87 "Jig Survey - 4B Stat Area 19" 314 300 0 0
#> 10 86 "Jig Survey - 4B Stat Area 18" 176 164 0 0
#> 11 51 "Combined Submersible And Longline Fishing Survey" 169 169 168 0
#> 12 82 "Jig Survey - 4B Stat Area 12" 14 14 0 0
#> 13 85 "Jig Survey - 4B Stat Area 16" 5 4 0 0
#> 14 68 "Joint Canada/US Hake Acoustic" 1 1 1 0
#> 15 84 "Jig Survey - 4B Stat Area 15" 1 0 0 0
This should return all fishery-independent specimen records. We
haven’t counted actual ages for this species, because none are
available. For both get_all_*
functions, the default is to
drop all columns with no data, so in this case the column named
age
is missing. However, this column could have been
retained by using the argument drop_na_columns = FALSE
.
If you want to focus on specimens that come from design-based survey
sets you can add arguments that filter for unsorted random samples that
come from events that have grouping codes that match those expected for
the current survey design. Alternatively, this can be achieved by
filtering for specimens where !is.na(grouping_code)
or
!is.na(grouping_code_updated)
and checking that the
sample_type_comment
and sample_source_desc
notes are consistent with the specimens being from random samples. Here,
we will use the built in filter arguments, but also add the additional
constraint of filtering based on the updated grouping codes. Note: Some
surveys do not use grouping codes, and therefore won’t be returned when
the grouping_only
option is used. In this case, all of SSID
51
and 82:87
are now missing.
d3 <- get_all_survey_samples("north pacific spiny dogfish",
major = c("01"),
usability = c(0, 1, 2, 6),
unsorted_only = TRUE,
random_only = TRUE,
grouping_only = TRUE)
#> [1] "Looking for samples that are usable (0, 1, 2, 6) unsorted random with originally specified grouping codes."
#> [1] "Returning all north pacific spiny dogfish specimens from major area(s) 01 from any survey series."
d3 |> filter(!is.na(grouping_code_updated)) |>
group_by(survey_series_id, survey_series_desc) |>
summarise(specimens = length(unique(specimen_id)),
lengths = sum(!is.na(length)),
weights = sum(!is.na(weight)),
age_structures = sum(age_specimen_collected)) |>
arrange(-specimens) |>
rename(ssid = survey_series_id)
#> # A tibble: 7 x 6
#> # Groups: ssid [7]
#> ssid survey_series_desc specimens lengths weights age_structures
#> <dbl> <chr> <int> <int> <int> <int>
#> 1 40 "Hard Bottom Longline Inside South " 35070 35065 3 0
#> 2 39 "Hard Bottom Longline Inside North " 29149 29027 399 400
#> 3 76 "Strait of Georgia Dogfish Longline" 18384 18376 0 15521
#> 4 45 "Strait of Georgia Synoptic Bottom Trawl" 1721 1721 1243 505
#> 5 15 "Strait of Georgia Lingcod Young-of-year Bottom Tr~ 1071 1071 0 0
#> 6 34 "Strait of Georgia Ecosystem Research Initiative A~ 215 215 0 0
#> 7 68 "Joint Canada/US Hake Acoustic" 1 1 1 0
If you want to retrieve additional event or skate-level covariates
for use in model-based analyses, than use the argument
include_event_info = TRUE
. For example, when this is
applied to the various longline surveys in the Strait of Georgia, one
can test the effects of variables like depth, date, hook type and size
on the sex and sizes of fish caught.
d4 <- get_all_survey_samples("north pacific spiny dogfish",
major = c("01"),
ssid = c(39, 40, 48, 76),
include_event_info = TRUE)
#> [1] "Returning all north pacific spiny dogfish specimens from within major area(s) 01 and belonging to survey series 39, 40, 48, 76."
#> [1] "Specimens found. Fetching additional event info."
d4 |> group_by(survey_series_id, activity_desc, hook_desc, hooksize_desc) |>
summarise(specimens = length(unique(specimen_id)),
years = paste(min(year, na.rm = TRUE), "-", max(year, na.rm = TRUE))) |>
arrange(-specimens) |>
rename(ssid = survey_series_id, hooksize = hooksize_desc) |>
print()
#> # A tibble: 8 x 6
#> # Groups: ssid, activity_desc, hook_desc [7]
#> ssid activity_desc hook_desc hooksize specimens years
#> <dbl> <chr> <chr> <chr> <int> <chr>
#> 1 40 HARD BOTTOM LONGLINE HOOK SURVEY - INSIDE SOUTH CIRCLE HOOK 13/0 36566 2005 - 2022
#> 2 76 STRAIT OF GEORGIA DOGFISH LONGLINE SURVEY CIRCLE HOOK 14/0 36332 2005 - 2019
#> 3 39 HARD BOTTOM LONGLINE HOOK SURVEY - INSIDE NORTH CIRCLE HOOK 13/0 35624 2003 - 2023
#> 4 76 STRAIT OF GEORGIA DOGFISH LONGLINE SURVEY J-HOOK 12/0 18384 1986 - 1989
#> 5 48 DOGFISH GEAR/TIMING COMPARISON SURVEYS CIRCLE HOOK 14/0 8474 2004 - 2023
#> 6 48 DOGFISH GEAR/TIMING COMPARISON SURVEYS CIRCLE HOOK 13/0 7254 2019 - 2023
#> 7 48 DOGFISH GEAR/TIMING COMPARISON SURVEYS J-HOOK 12/0 948 2004 - 2023
#> 8 48 DOGFISH GEAR/TIMING COMPARISON SURVEYS <NA> <NA> 19 2004 - 2004
These are the variables returned by default:
glimpse(d2)
#> Rows: 149,084
#> Columns: 50
#> $ species_common_name <chr> "north pacific spiny dogfish", "north pacific spiny dogfish",~
#> $ survey_series_id <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
#> $ sex <dbl> 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2~
#> $ length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ weight <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ survey_series_og <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 1~
#> $ activity_desc <chr> "STRAIT OF GEORGIA LINGCOD YOUNG-OF-YEAR BOTTOM TRAWL SURVEY"~
#> $ activity_code <dbl> 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 3~
#> $ fishing_event_id <dbl> 3132875, 3132875, 3132875, 3132875, 3132875, 3132875, 3132875~
#> $ trip_year <int> 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2006, 2~
#> $ sample_date <dttm> 2006-08-04, 2006-08-04, 2006-08-04, 2006-08-04, 2006-08-04, ~
#> $ species_code <chr> "044", "044", "044", "044", "044", "044", "044", "044", "044"~
#> $ species_science_name <chr> "squalus suckleyi", "squalus suckleyi", "squalus suckleyi", "~
#> $ specimen_id <dbl> 12291537, 12291539, 12291540, 12291542, 12291543, 12291544, 1~
#> $ sample_id <dbl> 399729, 399729, 399729, 399729, 399729, 399729, 399729, 39972~
#> $ fork_length <dbl> 72, 72, 73, 74, 75, 75, 76, 80, 85, 88, 89, 70, 77, 77, 78, 7~
#> $ total_length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ maturity_code <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ maturity_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ maturity_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ maturity_convention_code <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9~
#> $ maturity_convention_desc <chr> "MATURITIES NOT LOOKED AT", "MATURITIES NOT LOOKED AT", "MATU~
#> $ maturity_convention_maxvalue <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ major_stat_area_code <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "~
#> $ major_stat_area_name <chr> "4B: STRAIT OF GEORGIA", "4B: STRAIT OF GEORGIA", "4B: STRAIT~
#> $ minor_stat_area_code <chr> "14", "14", "14", "14", "14", "14", "14", "14", "14", "14", "~
#> $ gear <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ reason_desc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ survey_id <int> 481, 481, 481, 481, 481, 481, 481, 481, 481, 481, 481, 481, 4~
#> $ trip_id <dbl> 73330, 73330, 73330, 73330, 73330, 73330, 73330, 73330, 73330~
#> $ trip_sub_type_code <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2~
#> $ fe_parent_event_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ fe_major_level_id <dbl> 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 48, 4~
#> $ fe_sub_level_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ sample_type_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ sample_type_comment <chr> "TOTAL CATCH", "TOTAL CATCH", "TOTAL CATCH", "TOTAL CATCH~
#> $ species_category_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ sample_source_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ age_specimen_collected <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
#> $ usability_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
#> $ grouping_code <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 1~
#> $ grouping_desc <chr> "Lingcod YOY depth stratum 1: 16 - 25 m", "Lingcod YOY depth ~
#> $ grouping_code_updated <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ grouping_desc_updated <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ original_ind <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
#> $ length_type <chr> "total_length", "total_length", "total_length", "total_length~
#> $ sample_source_desc <chr> "Unsorted", "Unsorted", "Unsorted", "Unsorted", "Unsorted", "~
#> $ usability_desc <chr> "FULLY USABLE", "FULLY USABLE", "FULLY USABLE", "FULLY USABLE~
#> $ survey_series_desc <chr> "Strait of Georgia Lingcod Young-of-year Bottom Trawl", "Stra~
#> $ survey_abbrev <chr> "OTHER", "OTHER", "OTHER", "OTHER", "OTHER", "OTHER", "OTHER"~
And these additional variables were returned for longline surveys
when include_event_info = TRUE
:
#> Rows: 143,601
#> Columns: 34
#> $ catch_weight <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ catch_count <dbl> 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 5~
#> $ year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202~
#> $ month <int> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, ~
#> $ day <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
#> $ time_deployed <dttm> 2023-09-02 14:43:16, 2023-09-02 14:43:16, 2023-09-02 14:43:16, 2023-~
#> $ time_retrieved <dttm> 2023-09-02 16:42:55, 2023-09-02 16:42:55, 2023-09-02 16:42:55, 2023-~
#> $ time_end_deployment <dttm> 2023-09-02 14:43:16, 2023-09-02 14:43:16, 2023-09-02 14:43:16, 2023-~
#> $ time_begin_retrieval <dttm> 2023-09-02 16:42:55, 2023-09-02 16:42:55, 2023-09-02 16:42:55, 2023-~
#> $ latitude <dbl> 50.21935, 50.21935, 50.21935, 50.21935, 50.21935, 50.21935, 50.21935,~
#> $ longitude <dbl> -125.3644, -125.3644, -125.3644, -125.3644, -125.3644, -125.3644, -12~
#> $ latitude_end <dbl> 50.22198, 50.22198, 50.22198, 50.22198, 50.22198, 50.22198, 50.22198,~
#> $ longitude_end <dbl> -125.3671, -125.3671, -125.3671, -125.3671, -125.3671, -125.3671, -12~
#> $ depth_m <dbl> 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 79, 7~
#> $ depth_begin <dbl> 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 6~
#> $ depth_end <dbl> 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101,~
#> $ vessel_id <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201~
#> $ captain_id <dbl> 969, 969, 969, 969, 969, 969, 969, 969, 969, 969, 969, 969, 969, 969,~
#> $ duration_min <int> 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119, 119,~
#> $ tow_length_m <dbl> 350, 350, 350, 350, 350, 350, 350, 350, 350, 350, 350, 350, 350, 350,~
#> $ hook_code <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ lglsp_hook_count <dbl> 225, 225, 225, 225, 225, 225, 225, 225, 225, 225, 225, 225, 225, 225,~
#> $ hook_desc <chr> "CIRCLE HOOK", "CIRCLE HOOK", "CIRCLE HOOK", "CIRCLE HOOK", "CIRCLE H~
#> $ hooksize_desc <chr> "13/0", "13/0", "13/0", "13/0", "13/0", "13/0", "13/0", "13/0", "13/0~
#> $ grouping_depth_id <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2",~
#> $ grouping_area_km2 <int> 844, 844, 844, 844, 844, 844, 844, 844, 844, 844, 844, 844, 844, 844,~
#> $ skate_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ mean_per_skate <dbl> 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236,~
#> $ minor_id_count <int> 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236,~
#> $ minor_id_max <dbl> 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236, 236,~
#> $ diff <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ event_level_count <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ skate_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N~
#> $ hook_area_swept_km2 <dbl> 0.005262028, 0.005262028, 0.005262028, 0.005262028, 0.005262028, 0.00~
The collection of variables that are returned can change depending on the records that are retrieved. For example, variables that are specific to longline surveys are omitted when only trawl survey sets are returned.
Surveys with overlapping stratifications
Some fishing events are assigned to multiple surveys, which may or
may not be fully or partially overlapping, and are defined by the same
activity code in the database. To get all sets with matching activity
codes one can use include_activity_matches = TRUE
. It will
return all events that share the same activity_code
as any
SSIDs requested. This works when retrieving either sets or samples.
d5 <- get_all_survey_sets("north pacific spiny dogfish",
ssid = c(7),
include_activity_matches = TRUE,
remove_duplicates = TRUE #> default
)
#> [1] "north pacific spiny dogfish have been recorded by survey series 7, 6, 67 at least once. "
#> [1] "Returning all relevant sets/events/skates including those with no catch."
#> Warning in get_all_survey_sets("north pacific spiny dogfish", ssid = c(7), : Duplicate
#> fishing_event_ids are still present despite `remove_duplicates = TRUE`. This may be because of
#> overlapping survey stratifications or multiple skates per event (specifically when at least one
#> survey included used skates with differences in gear type), but could also be due to trips
#> participating in more than one type of survey. If the latter, location, gear, or `reason_desc`
#> columns should be used to choose which events to keep. After selecting specific survey
#> stratifications and determining that all relevant variables are accurate, the remaining
#> duplications can be filtered using `dat <- dat[!duplicated(dat$fishing_event_id), ]`.
d5 |> group_by(survey_series_id, survey_abbrev, activity_desc) |>
summarise(events = length(unique(fishing_event_id)),
years = paste(min(year), "-", max(year)),
rows = n()) |>
arrange(-events) |>
rename(ssid = survey_series_id)
#> # A tibble: 2 x 6
#> # Groups: ssid, survey_abbrev [2]
#> ssid survey_abbrev activity_desc events years rows
#> <dbl> <chr> <chr> <int> <chr> <int>
#> 1 7 MSSM WCVI MULTISPECIES SMALL-MESH (AKA SHRIMP) BOTTOM TRAWL SURVEY 5084 1975 - ~ 5084
#> 2 6 MSSM QCS MULTISPECIES SMALL-MESH (AKA SHRIMP) BOTTOM TRAWL SURVEY 1295 1998 - ~ 1295
You will get a warning that some fishing events are duplicated even
though remove_duplicates = TRUE
. We can look at one of the
duplicated events and see that it lacks location information, which
means that it couldn’t be accurately assigned to either region of shrimp
survey, so it has been returned as potentially belonging to both.
dd <- d5[duplicated(d5$fishing_event_id),]
glimpse(dd)
#> Rows: 1
#> Columns: 54
#> $ species_common_name <chr> "north pacific spiny dogfish"
#> $ catch_count <dbl> NA
#> $ catch_weight <dbl> 1.9
#> $ survey_series_id <dbl> 7
#> $ survey_abbrev <chr> "MSSM WCVI"
#> $ year <int> 2000
#> $ fishing_event_id <dbl> 901698
#> $ species_code <chr> "044"
#> $ fe_major_level_id <dbl> 9
#> $ trip_id <dbl> 60021
#> $ survey_series_og <dbl> 7
#> $ activity_desc <chr> "MULTISPECIES SMALL-MESH (AKA SHRIMP) BOTTOM TRAWL SURVEY"
#> $ activity_code <dbl> 25
#> $ reason_desc <chr> "QUANT. BIOMASS SURVEY"
#> $ trip_year <int> 2000
#> $ month <int> NA
#> $ day <int> NA
#> $ time_deployed <dttm> NA
#> $ time_retrieved <dttm> NA
#> $ time_end_deployment <dttm> NA
#> $ time_begin_retrieval <dttm> NA
#> $ latitude <dbl> NA
#> $ longitude <dbl> NA
#> $ latitude_end <dbl> NA
#> $ longitude_end <dbl> NA
#> $ major_stat_area_code <chr> "00"
#> $ minor_stat_area_code <chr> "00"
#> $ depth_m <dbl> NA
#> $ depth_begin <dbl> NA
#> $ depth_end <dbl> NA
#> $ vessel_id <dbl> 2000
#> $ captain_id <dbl> NA
#> $ duration_min <int> NA
#> $ tow_length_m <dbl> NA
#> $ mouth_width_m <dbl> 10.6
#> $ doorspread_m <dbl> 29.6
#> $ speed_mpm <dbl> 86.66668
#> $ usability_code <dbl> 13
#> $ grouping_code <dbl> NA
#> $ grouping_desc <chr> NA
#> $ grouping_code_updated <dbl> NA
#> $ grouping_desc_updated <chr> NA
#> $ grouping_area_km2 <int> NA
#> $ original_ind <chr> "Y"
#> $ survey_series_desc <chr> "West Coast Vancouver Island Multispecies Small-mesh Bottom Trawl"
#> $ species_science_name <chr> "squalus suckleyi"
#> $ species_desc <chr> "north pacific spiny dogfish"
#> $ usability_desc <chr> "UNUSABLE FOR CPUE ESTIMATION"
#> $ area_swept1 <dbl> NA
#> $ area_swept2 <dbl> NA
#> $ area_swept <dbl> NA
#> $ area_swept_km2 <dbl> NA
#> $ density_kgpm2 <dbl> NA
#> $ density_pcpm2 <dbl> NA
NOTE: All activity matches are always returned by default whenever
any one of the sablefish surveys (ssid = 35, 41, 42, 43) is requested.
This is because SSIDs for that survey were inconsistently assigned and
frequently share trip ids, which results in duplication and or
assignment to the wrong survey series. In order to accurately separate
the types of sablefish surveys one needs to split data from this survey
by the reason_desc
variable.
Troubleshooting
Beware of duplication of fishing events and specimens
A risk in using get_all_*()
functions is that, in the
attempt to return a comprehensive data set, some fishing events and
specimen ids may be duplicated (this also occurs with the original
functions but usually for different reasons). Some vessel trips conduct
sampling for multiple survey series, and unless joining is based on
grouping codes (which aren’t used consistently for some surveys) the
only way to connect a fishing event to a survey series id is through the
vessel trip. This can result in events and specimens getting assigned to
both surveys that were conducted on the same trip. The most common
instances of this (e.g., sablefish, jig, and offshore shrimp surveys)
have custom corrections coded into an internal function
correct_ssids()
that is applied within the
get_all_*()
functions. Duplication can also occur due to
missing covariates (e.g., both event level and survey defaults are
missing for doorspread_m
on a couple sets for some trawl
survey series), or for specimens when multiple vials of DNA were
collected and return_dna_info = TRUE
. It is recommended to
always check for unexpected duplication of observations (usually
fishing_event_id
or specimen_id
) before
beginning any analysis. Two return both copies of each duplicated record
the following can be used:
d[duplicated(d$specimen_id) | duplicated(d$specimen_id, fromLast=TRUE), ]
.
Error messages
Any error message that mentions SQL Server suggests either that the network connection or server timed out, or that the SQL query was flawed. One way this can happen is providing an invalid ssid (e.g., a character “4” instead of a numeric 4), invalid major area code (e.g., a numeric 1 instead of a character “01”), or a search that is insufficiently limited in its scope.
Error: nanodbc/nanodbc.cpp:2823: 08S01 [Microsoft][ODBC SQL Server Driver][DBNETLIB]ConnectionRead (recv()). [Microsoft][ODBC SQL Server Driver][DBNETLIB]General network error. Check your network documentation.
If the SQL search is successfully returned to R, but your computer has insufficient memory to handle the amount of data returned, you may see an error like this:
Error: cannot allocate vector of size XXX Mb