Distance clustering is used to identify unique deployments of a sensor in an environmental monitoring field study. GPS-reported locations can be jittery and result in a sensor self-reporting from a cluster of nearby locations. Clustering helps resolve this by assigning a single location to the cluster.

Standard kmeans clustering does not work well when clusters can have widely differing numbers of members. A much better result is acheived with the Partitioning Around Medoids method available in cluster::pam().

The value of clusterDiameter is compared with the output of cluster::pam(...)$clusinfo[,'av_diss'] to determine the number of clusters.

clusterByDistance(
  tbl,
  clusterDiameter = 1000,
  lonVar = "longitude",
  latVar = "latitude",
  maxClusters = 50
)

Arguments

tbl

Tibble with geolocation information.

clusterDiameter

Diameter in meters used to determine the number of clusters (see description).

lonVar

Name of longitude variable in the incoming tibble.

latVar

Name of the latitude variable in the incoming tibble.

maxClusters

Maximum number of clusters to try.

Value

Input tibble with additional columns: clusterLon, clusterLat, clusterID.

Note

In most applications, the table_addClustering function should be used as it implements two-stage clustering using clusterbyDistance().

Examples

library(MazamaLocationUtils)

# Fremont, Seattle 47.6504, -122.3509
# Magnolia, Seattle 47.6403, -122.3997
# Downtown Seattle 47.6055, -122.3370

fremont_x <- jitter(rep(-122.3509, 10), .0005)
fremont_y <- jitter(rep(47.6504, 10), .0005)

magnolia_x <- jitter(rep(-122.3997, 8), .0005)
magnolia_y <- jitter(rep(47.6403, 8), .0005)

downtown_x <- jitter(rep(-122.3370, 3), .0005)
downtown_y <- jitter(rep(47.6055, 3), .0005)

# Apply clustering
tbl <-
  dplyr::tibble(
    longitude = c(fremont_x, magnolia_x, downtown_x),
    latitude = c(fremont_y, magnolia_y, downtown_y)
  ) %>%
  clusterByDistance(
    clusterDiameter = 1000
  )

plot(tbl$longitude, tbl$latitude, pch = tbl$clusterID)