Add clustering information to a dataframe

Clustering is used to identify unique deployments of a sensor in an environmental monitoring field study.

Sensors will be moved around from time to time, sometimes across the country and sometimes across the street. We would like to assign unique identifiers to each new "deployment" but not when the sensor is moved a short distance.

We use clustering to find an appropriate number of unique "deployments". The sensitivity of this algorithm can be adjused with the clusterDiameter argument.

Standard kmeans clustering does not work well when clusters can have widely differing numbers of members. A much better result is acheived with the Partitioning Around Medoids method available in cluster::pam().

The value of clusterRadius is compared with the output of cluster::pam(...)$clusinfo[,'av_diss'] to determine the number of clusters.

table_addClustering(
  tbl,
  clusterDiameter = 1000,
  lonVar = "longitude",
  latVar = "latitude",
  maxClusters = 50
)

Arguments

tbl: Tibble with geolocation information (e.g..
clusterDiameter: Diameter in meters used to determine the number of clusters (see description).
lonVar: Name of longitude variable in the incoming tibble.
latVar: Name of the latitude variable in the incoming tibble.
maxClusters: Maximum number of clusters to try.

Value

Input tibble with additional columns: clusterLon, clusterLat.

Note

The table_addClustering() function implements two-stage clustering using clusterByDistance. If the first attempt at clustering produces clustered locations that are still too close to eachother, another round of clustering is performed using the results of the previous attempt. This two-stage approach seems to work well in. practice.

References

When k-means clustering fails

Examples

library(MazamaLocationUtils)

# Fremont, Seattle 47.6504, -122.3509
# Magnolia, Seattle 47.6403, -122.3997
# Downtown Seattle 47.6055, -122.3370

fremont_x <- jitter(rep(-122.3509, 10), .0005)
fremont_y <- jitter(rep(47.6504, 10), .0005)

magnolia_x <- jitter(rep(-122.3997, 8), .0005)
magnolia_y <- jitter(rep(47.6403, 8), .0005)

downtown_x <- jitter(rep(-122.3370, 3), .0005)
downtown_y <- jitter(rep(47.6055, 3), .0005)

# Apply clustering
tbl <-
  dplyr::tibble(
    longitude = c(fremont_x, magnolia_x, downtown_x),
    latitude = c(fremont_y, magnolia_y, downtown_y)
  ) %>%
  table_addClustering(
    clusterDiameter = 1000
  )

plot(tbl$longitude, tbl$latitude, pch = tbl$clusterID)

Arguments

Value

Note

References

See also

Examples