Find all links in an html page — html_getLinks • MazamaCoreUtils

Parses an html page to extract all <a href="...">...</a> links and return them in a dataframe where linkName is the human readable name and linkUrl is the href portion. By default this function will return relative URLs.

This is especially useful for extracting data from an index page that shows the contents of a web accessible directory.

Wrapper functions html_getLinkNames() and html_getLinkUrls() return the appropriate columns as vectors.

html_getLinks(url = NULL, relative = TRUE)

html_getLinkNames(url = NULL)

html_getLinkUrls(url = NULL, relative = TRUE)

Arguments

url: URL or file path of an html page.
relative: Logical instruction to return relative URLs.

Value

A dataframe with linkName and/or linkUrl columns.

Examples

library(MazamaCoreUtils)

# Fail gracefully if the resource is not available
try({

  # US Census 2019 shapefiles
  url <- "https://www2.census.gov/geo/tiger/GENZ2019/shp/"

  # Extract links
  dataLinks <- html_getLinks(url)

  dataLinks <- dataLinks %>%
    dplyr::filter(stringr::str_detect(linkName, "us_county"))
  head(dataLinks, 10)

}, silent = FALSE)
#> # A tibble: 5 × 2
#>   linkName                                linkUrl                               
#>   <chr>                                   <chr>                                 
#> 1 cb_2019_us_county_20m.zip               cb_2019_us_county_20m.zip             
#> 2 cb_2019_us_county_500k.zip              cb_2019_us_county_500k.zip            
#> 3 cb_2019_us_county_5m.zip                cb_2019_us_county_5m.zip              
#> 4 cb_2019_us_county_within_cd116_500k.zip cb_2019_us_county_within_cd116_500k.z…
#> 5 cb_2019_us_county_within_ua_500k.zip    cb_2019_us_county_within_ua_500k.zip