vignettes/MazamaLocationUtils.Rmd
MazamaLocationUtils.Rmd
This package is intended for use in data management activities associated with fixed locations in space. The motivating fields include air and water quality monitoring where fixed sensors report at regular time intervals.
When working with environmental monitoring time series, one of the
first things you have to do is create unique identifiers for each
individual time series. In an ideal world, each environmental time
series would have both a locationID
and a
deviceID
that uniquely identify the specific instrument
making measurements and the physical location where measurements are
made. A unique timeseriesID
could be produced as
locationID_deviceID
. Metadata associated with each
timeseriesID
would contain basic information needed for
downstream analysis including at least:
timeseriesID, locationID, deviceID, longitude, latitude, ...
deviceID
.locationID
.longitude, latitude
.data
table with timeseriesID
column
names.Unfortunately, we are rarely supplied with a truly unique and truly
spatial locationID
. Instead we often use
deviceID
or an associated non-spatial identifier as a
stand-in for locationID
.
Complications we have seen include:
locationID
.locationID
.A solution to all these problems is possible if we store spatial
metadata in simple tables in a standard directory. These tables will be
referred to as collections. Location lookups can be performed
with geodesic distance calculations where a longitude-latitude pair is
assigned to a pre-existing known location if it is within
distanceThreshold
meters of that location. These lookups
will be extremely fast.
If no previously known location is found, the relatively slow (seconds) creation of a new known location metadata record can be performed and then added to the growing collection.
For collections of stationary environmental monitors that only number
in the thousands, this entire collection can be stored as
either a .rda
or .csv
file and will be under a
megabyte in size making it fast to load. This small size also makes it
possible to store multiple known locations files, each created
with different locations and different distance thresholds to address
the needs of different scientific studies.
The package comes with some example known locations tables.
Lets take some metadata we have for air quality monitors in Washington state and create a known locations table for them.
## [1] "deviceDeploymentID" "deviceID" "deviceType"
## [4] "deviceDescription" "deviceExtra" "pollutant"
## [7] "units" "dataIngestSource" "dataIngestURL"
## [10] "dataIngestUnitID" "dataIngestExtra" "dataIngestDescription"
## [13] "locationID" "locationName" "longitude"
## [16] "latitude" "elevation" "countryCode"
## [19] "stateCode" "countyName" "timezone"
## [22] "houseNumber" "street" "city"
## [25] "postalCode" "AQSID" "fullAQSID"
## [28] "address" "deploymentType"
We can create a known locations table for them with a minimum 500 meter separation between distinct locations. (NOTE: This will take some time to performa all the spatial queries.)
To speed things up, we call table_addLocation()
with
defaults: elevationService = NULL, addressService = NULL
.
This avoids these slow web service requests and results in a table with
NA
for these columns.
library(MazamaLocationUtils)
# Initialize with standard directories
initializeMazamaSpatialUtils()
setLocationDataDir("./data")
wa_monitors_500 <-
table_initialize() %>%
table_addLocation(wa$longitude, wa$latitude, distanceThreshold = 500)
At this point, our known locations table contains only automatically generated spatial metadata.
dplyr::glimpse(wa_monitors_500, width = 75)
## Rows: 78
## Columns: 13
## $ locationID <chr> "c2913q48uk", "c28f8z9xq8", "c23hfxrdne", "c2k9v9bjc…
## $ locationName <chr> "us.wa_c2913q", "us.wa_c28f8z", "us.wa_c23hfx", "us.…
## $ longitude <dbl> -122.2852, -122.6600, -122.2233, -117.1801, -119.008…
## $ latitude <dbl> 48.06534, 48.29440, 47.28140, 46.72450, 46.20010, 48…
## $ elevation <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ countryCode <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US"…
## $ stateCode <chr> "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA", "WA"…
## $ countyName <chr> "Snohomish", "Island", "King", "Whitman", "Walla Wal…
## $ timezone <chr> "America/Los_Angeles", "America/Los_Angeles", "Ameri…
## $ houseNumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ street <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ city <chr> "Tulalip Bay", "Oak Harbor", "Auburn", "Pullman", "B…
## $ postalCode <chr> "98207", "98277", "98002", "99163", "99323", "98221"…
Perhaps we would like to import some of the original metadata into our new table. This is a very common use case where non-spatial metadata like uniform identifiers or owner information for a monitor can be added.
Just to make it interesting, let’s assume that our known locations table is already large and we are only providing additional metadata for a subset of the records.
# Use a subset of the wa metadata
wa_indices <- seq(5,65,5)
wa_sub <- wa[wa_indices,]
# Use a generic name for the location table
locationTbl <- wa_monitors_500
# Find the location IDs associated with our subset
locationID <- table_getLocationID(
locationTbl,
longitude = wa_sub$longitude,
latitude = wa_sub$latitude,
distanceThreshold = 500
)
# Now add the "AQSID" column for our subset of locations
locationData <- wa_sub$AQSID
locationTbl <- table_updateColumn(
locationTbl,
columnName = "AQSID",
locationID = locationID,
locationData = locationData
)
# Lets see how we did
locationTbl_indices <- table_getRecordIndex(locationTbl, locationID)
locationTbl[locationTbl_indices, c("city", "AQSID")]
## # A tibble: 13 × 2
## city AQSID
## <chr> <chr>
## 1 Burbank 530710006
## 2 Newport 840MM0510008
## 3 Soap Lake 840530250003
## 4 Shelton 530450007
## 5 Winthrop 530470010
## 6 Seattle 530330030
## 7 Cle Elum 840MM0370180
## 8 Longview 530150015
## 9 Enumclaw 530330023
## 10 Wenatchee 530070011
## 11 Mount Vernon 530570015
## 12 White Salmon 840MM0399990
## 13 LaCrosse 530750005
Very nice. We have added AQSID
to our known locations
table for a more detailed description of each monitors’ location.
The whole point of a known locations table is to speed up access to spatial and other metadata. Here’s how we can use it with a set of longitudes and latitudes that are not currently in our table.
# Create new locations near our known locations
lons <- jitter(wa_sub$longitude)
lats <- jitter(wa_sub$latitude)
# Any known locations within 50 meters?
table_getNearestLocation(
wa_monitors_500,
longitude = lons,
latitude = lats,
distanceThreshold = 50
) %>% dplyr::pull(city)
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA
# Any known locations within 250 meters
table_getNearestLocation(
wa_monitors_500,
longitude = lons,
latitude = lats,
distanceThreshold = 250
) %>% dplyr::pull(city)
## [1] "Burbank" "Newport" NA "Shelton" "Winthrop"
## [6] NA NA "Longview" "Enumclaw" "Wenatchee"
## [11] "Mount Vernon" NA "LaCrosse"
# How about 5000 meters?
table_getNearestLocation(
wa_monitors_500,
longitude = lons,
latitude = lats,
distanceThreshold = 5000
) %>% dplyr::pull(city)
## [1] "Burbank" "Newport" "Soap Lake" "Shelton" "Winthrop"
## [6] "Seattle" "Cle Elum" "Longview" "Enumclaw" "Wenatchee"
## [11] "Mount Vernon" "White Salmon" "LaCrosse"
Before using MazamaLocationUtils you must first install MazamaSpatialUtils and then install core spatial data with:
library(MazamaSpatialUtils)
setSpatialDataDir("~/Data/Spatial")
installSpatialData("EEZCountries")
installSpatialData("OSMTimezones")
installSpatialData("NaturalEarthAdm1")
installSpatialData("USCensusCounties")
The initializeMazamaSpatialData()
function by default
assumes spatial data are installed in the standard location and is just
a wrapper for:
MazamaSpatialUtils::setSpatialDataDir("~/Data/Spatial")
MazamaSpatialUtils::loadSpatialData("EEZCountries.rda")
MazamaSpatialUtils::loadSpatialData("OSMTimezones.rda")
MazamaSpatialUtils::loadSpatialData("NaturalEarthAdm1.rda")
MazamaSpatialUtils::loadSpatialData("USCensusCounties.rda")
Once the required datasets have been installed, the easiest way to set things up each session is with:
library(MazamaLocationUtils)
initializeMazamaSpatialData()
setLocationDataDir("~/Data/KnownLocations")
Every time you table_save()
your location table, a
backup will be created so you can experiment without losing your work.
File sizes are pretty tiny so you don’t have to worry about filling up
your disk.
Best wishes for well organized spatial metadata!