Introduction

This tutorial demonstrates basic exploration of previously created ‘pat’ timeseries data. In order to run the code in this tutorial you must have followed the instructions in Tutorial 1 and created a directory with ‘pas’ and ‘pat’ data files for the Methow Valley. Target audiences include grad students, researchers and any member of the public concerned about air quality and comfortable working with R and RStudio.

Tutorials in this series include:

Goal

Our goal in this tutorial is to explore a variety of pat_~() functions available in the AirSensor package and to explain their most important arguments and the interpretation of output graphics.

PurpleAir A & B Channels

PurpleAir (PA) sensors are unique in that they have two separate optical particle detectors — one labeled A and one labeled B. All of the plotting functions, except pat_monitorComparison() and pat_externalFit(), only use data from a single PA sensor with A and B data channels.

When measurements from the A and B channels are largely in agreement, the sensor is working properly. When they substantially differ, there is a problem. Moreover, when a sensor is properly functioning, correlation between the A and B channel (pm25_A:pm25_B) will be strongly positive. In western states, the correlation between temperature and humidity will typically be strongly negative.

Loading pat objects created in MVCAA Tutorial 1

library(AirSensor)

# Use the default archiveDir unless it is already defined
if ( !exists("archiveDir") ) {
  archiveDir <- file.path("~/Data/MVCAA")
}

# Load pat data
patList <- get(load(file.path(archiveDir,"/patList.rda")))

# Print site ids and associated names
sapply(patList, function(x) { return(x$meta$label) }) 
##                               ab5dca99422f2c0d_13669 
##               "MV Clean Air Ambassador @ Balky Hill" 
##                               f6c44edd41c941c7_10182 
##             "MV Clean Air Ambassador @ Benson Creek" 
##                               49215ad49d1a87e3_10188 
##              "MV Clean Air Ambassador @ Bush School" 
##                               f736fd3fb21fc4da_13667 
##               "MV Clean Air Ambassador @ Gunn Ranch" 
##                               db5d6b3b79f5830e_39237 
## "MV Clean Air Ambassador @ Liberty Bell High School" 
##                               4f19d256e1787973_10166 
##          "MV Clean Air Ambassador @ Lower Studhorse" 
##                               f592adb5067ad9d3_13675 
##         "MV Clean Air Ambassador @ McFarland Creek " 
##                               4a47b9252e16e558_15077 
##           "MV Clean Air Ambassador @ Methow Estates" 
##                               0cbfeb2ce4c1553c_13661 
##             "MV Clean Air Ambassador @ Pine Forest " 
##                               2e3b5ceea86a885b_10168 
##       "MV Clean Air Ambassador @ Upper Beaver Creek" 
##                               f96deab8c29aa42b_10134 
##         "MV Clean Air Ambassador @ Willowbrook Farm" 
##                               96b108298883ca47_64441 
##              "MV Clean Air Ambassador-Little Cougar"
# Pull out each sensor as a separate 'pat' object
Balky_Hill <- patList[["ab5dca99422f2c0d_13669"]]
Bush_School <- patList[["49215ad49d1a87e3_10188"]]
Liberty_School <- patList[["db5d6b3b79f5830e_39237"]]      # good
McFarland_Creek <- patList[["f592adb5067ad9d3_13675"]]
Pine_Forest <- patList[["0cbfeb2ce4c1553c_13661"]]         # great
Willowbrook_Farm <- patList[["f96deab8c29aa42b_10134"]]    # poor
Benson_Creek <- patList[["f6c44edd41c941c7_10182"]]
Gunn_Ranch <- patList[["f736fd3fb21fc4da_13667"]]
Lower_Studhorse <- patList[["4f19d256e1787973_10166"]]
Methow_Estates <- patList[["4a47b9252e16e558_15077"]] 
Beaver_Creek <- patList[["2e3b5ceea86a885b_10168"]]
Little_Cougar <- patList[["96b108298883ca47_64441"]]

pat_dygraph

The pat_dygraph() function creates interactive graphs that will be displayed in RStudio’s ‘Viewer’ tab.

Lets explore a couple of sensors. We will accept default settings for most of the pat_dygraph() parameters but will apply use sampleSize = 1000 and rollPeriod = 6. Reducing the sampleSize prevents the interactive viewer from bogging down with too much data and increasing the rollPeriod results in smooth lines rather than disconnected points.

Specifying the correct timezone ensures that times are displayed in local time.

NOTE: You can experiment with the rollPeriod by adjusting the number in the dygraph viewer.

NOTE: You can use the sliders underneath each plot to change the time period or “brush” over the plot itself to restrict both X and Y ranges. Double-click to return to full range.

Liberty School

pat_dygraph(
  pat = Liberty_School, 
  sampleSize = 1000,
  xlab = "September 2020",
  rollPeriod = 6,
  timezone = "America/Los_Angeles"
)

Willowbrook Farm

pat_dygraph(
  pat = Willowbrook_Farm,       
  sampleSize = 1000,
  xlab = "September 2020",
  rollPeriod = 6,
  timezone = "America/Los_Angeles"
)

For Liberty School, the A and B channel measurements are largely in agreement while for Willowbrook Farm, the channels show serious discrepancies. This
indicates a poorly functioning sensor. It would be interesting to explore a larger time range to investigate whether these discrepancies persist.

pat_multiPlot

The pat_multiplot() function allow us to quickly display several parameters from our raw pat data using one of the pre-defined plot types. Available plottype options include:

  • “all” – pm25_A, pm25_B, temperature, humidity

  • “pm25_a” – PM2.5 from channel A only

  • “pm25_b” – PM2.5 from channel B only

  • “pm25” – PM2.5 from channels A and B in separate plots

  • “pm25_over” – PM2.5 from channels A and B in the same plot

  • “aux” – auxiliary data (temperature, humidity)

As before, sub-sampling to only 1000 points speeds up plotting.

NOTE: The sub-sampling algorithm carefully retains outliers in the A and B channels so that the overall appearance of the A and and B plots should remain the same regardless of the sampleSize.

# Liberty School multiplot - plottype = "all"
pat_multiPlot(
  pat = Liberty_School,
  plottype = "all",
  sampleSize = 1000
)

The “all” plot type above offers a nice overview of all four variables. We can see general agreement between channel A and B data and an inverse relationship between humidity and temperature measurements.

Let’s take a closer look at the A and B PM2.5 measurements using the “pm25_over” plot type.

# Liberty School multiplot - plottype = "pm25_over"
pat_multiPlot(
  pat = Liberty_School,
  plottype = "pm25_over",
  sampleSize = 1000
)

Channels A and B show strong agreement. We improve our faith in this assessment by increasing the sample size. Let’s try sampleSize = 5000.

# Balky Hill multiplot - plottype = "pm25_over", sampleSize = 5000
pat_multiPlot(
  pat = Liberty_School,
  plottype = "pm25_over",
  sampleSize = 5000
)

Again, channels A and B are largely in agreement. Purely based on visual inspection, we can assume that the Liberty School sensor is working properly A more in-depth statistical analysis of A and B channel agreement is discussed later in this tutorial.

Let’s take a closer look at the relationship between temperature and humidity by using the “aux” plot type.

# Liberty School multiplot - plottype = "aux"
pat_multiPlot(
  pat = Balky_Hill,
  plottype = "aux",
  sampleSize = 5000
)

From the graph, we can see a strong anti-correlation between the two variables. This also supports the assessment that the sensor is functioning properly

pat_scatterPlotMatrix

The pat_scatterPlotMatrix() function returns a matrix of scatterplots and correlation values for the most important variables in the pat object. Various sensor failure modes are often evident as unexpected or missing correlations in this plot.

The list of parameters used by default includes:

  • datetime – measurement time

  • pm25_A – A channel PM2.5 (ug/m3)

  • pm25_B – B channel PM2.5 (ug/m3)

  • temperature – temperature (F)

  • humidity – humidity (%)

Let’s compare all of them for the Liberty School sensor.

# Liberty School scatterPlotMatrix
pat_scatterPlotMatrix(
  pat = Liberty_School,
  sampleSize = 5000
)

As we can see from the plot above, there is an excellent correlation of near 1 between PM2.5 measurements in channels A and B and a strong negative correlation of around -0.84 between temperature and humidity. (We say “around” because we are sampling only 5000 points from the entire population.) These results support the conclusion that the sensor worked well throughout September 2020.

pat_outliers

The pat_outliers() function detects outliers using a Median Average Deviation or “Hampel” filter. This filter looks at a window of consecutive values and identifies those that are very far out in the tails of the distribution of values within the window. The filter window is then “rolled” over the entire length of the time series, much like a “rolling mean”.

windowSize = 15 corresponds to a temporal window of approximately 30 minutes given the current PurpleAir sampling interval of 2 minutes. This is long enough that we have a reasonable number of values from which to calculate statistics and short enough that we would not expect radical changes in atmospheric conditions

The default threshold setting thresholdMin = 8 identifies points that are extremely unlikely to be part of a normal distribution and therefore very likely to be outliers.

Specifying replace = TRUE will return a new pat object with outliers replaced by the window median values.

NOTE: Using this technique, you can create a highly smoothed, artificial dataset by setting thresholdMin = 1 or lower (but always above zero).

If desired, you can use pat_outlisers() to create a new pat object and save it in your archive directory as in the example below.

# create an outliers plot for Liberty School 
Liberty_School_clean <- pat_outliers(
  pat = Liberty_School,
  windowSize = 15,
  thresholdMin = 8,
  replace = TRUE,
  showPlot = TRUE
)

# Liberty_School_clean contains the replaced values and can be saved in your archive 
# directory.
save(Liberty_School_clean, file = file.path(archiveDir, "Liberty_School_clean.rda"))

The pat_outlier() function will almost always find some outliers, often insignificant ones. The plot above shows a few non-concerning outliers. Outliers of concern may be associated with electronic “glitches” and appear far away from the main time-series trace as depicted in the example below (notice data around August 3, 2018).

example_pat %>%
  pat_filterDate(20180801, 20180815) %>%
  pat_outliers()

pat_internalFit

The pat_internalFit() function uses a linear model to fit data from channel A to data from channel B and returns an object of class “lm”. If you assign the result to an object, you can then interrogate the linear model using the base R function summary().

The default setting of showPlot = TRUE will generate a model fit plot.

Let’s have a look at a well functioning sensor and a not-so-well functioning one.

# Liberty School linear model 
lm_Liberty <- pat_internalFit(
  pat = Liberty_School
)

summary(lm_Liberty)
## 
## Call:
## lm(formula = data$pm25_A ~ data$pm25_B, subset = NULL, weights = NULL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8389  -1.2155  -0.1392   0.8043  16.6558 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.715848   0.019458  -36.79   <2e-16 ***
## data$pm25_B  0.948297   0.000209 4536.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.353 on 20144 degrees of freedom
##   (1980 observations deleted due to missingness)
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999 
## F-statistic: 2.058e+07 on 1 and 20144 DF,  p-value: < 2.2e-16

The model fit plot shows that the model is slightly biased towards channel B (slope = 0.95) and that the linear model explains 99.9% of the data variability. This supports the assumption that the Liberty School sensor functioned correctly throughout most of September 2020.

Now lets take a look at our problem sensor:

# Willowbrook Farm linear model
lm_Willowbrook <- pat_internalFit(
  pat = Willowbrook_Farm
)

summary(lm_Willowbrook)
## 
## Call:
## lm(formula = data$pm25_A ~ data$pm25_B, subset = NULL, weights = NULL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -332.29   -2.41    7.53    8.15   58.07 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.634438   0.250332   -30.5   <2e-16 ***
## data$pm25_B  1.497656   0.003102   482.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.76 on 19638 degrees of freedom
##   (454 observations deleted due to missingness)
## Multiple R-squared:  0.9223, Adjusted R-squared:  0.9223 
## F-statistic: 2.33e+05 on 1 and 19638 DF,  p-value: < 2.2e-16

The model fit plot shows that the model is biased towards channel A (slope = 1.5), as is also evident in the bottom graphic. The model explains only 92.2% of the data variability, reinforcing the notion that the sensor was not functioning properly in September, 2020.

But we knew something was wrong from a first glance at the plots!

While statistical calculations are important, we encourage everyone to use the AirSensor package to truly “look at” the data from these sensors. Your eyes are a very powerful statistical tool all on their own.

pat_dailySoHPlot

The pat_dailySoHPlot() function is in a very different category from the other functions mentioned so far. It works with daily “State-of-Health” metrics that attempt to identify days when a sensor is performing well or poorly. The idea is that State-of-Health metrics can be used as a sort of daily checkup so that people can identify sensors that are beginning to malfunction due to acute problems like disconnected wires or due to gradual deterioration.

The pat_dailySoHPlot() function plots a subset of the most useful State of Health metrics calculated by the pat_dailySoH() function. (pat_dailySoHPlot() runs pat_dailySoH() internally and uses the output to create the plot.)

To better understand each plot, see also:

  • PurpleAirSoH_dailyPctReporting(), for pm25_A_PctReporting and pm25_B_PctReporting. Plots show the percentage of each day that the sensor is reporting data.

  • PurpleAirSoH_dailyPctValid(), for pm25_A_PctValid and pm25_B_PctValid. Plots show percentage of the total recorded measurements that are considered physically plausible, e.g. no negative values for PM2.5.

  • PurpleAirSoH_dailyPctDC(), for pm25_A_pctDC and pm25_B_pctDC. A high percent DC (direct current) value indicates the likely occurrence of a “sticky value”, and a zero or low percent DC indicates that the sensor is recording dynamic data.

  • PurpleAirSoH_dailyMetFit(), for pm25_A_temperature_rsquared. Plot shows daily linear model fit values between the pm25_A and temperature channels. The daily r-squared value is returned and is expected to hover near 0 for a properly functioning sensor, under normal environmental conditions.

  • PurpleAirSoH_dailyABFit(), for pm25_A_pm25_B_slope, pm25_A_pm25_B_intercept, and pm25_A_pm25_B_rsquared. Plots show daily linear model fit values between the pm25_A and pm25_B channels. The daily r-squared value is returned in addition to the coefficients of the linear fit (slope and intercept).

# Create state-of-health plots for Liberty School
pat_dailySoHPlot(Liberty_School, ncol = 2)

The red line in each plot above identifies expected values. The plot that grabs our attention the most is pm25_A_temperature_rsquared with the persistent elevated values occurring throughout the month. Values are expected to hover near 0 for a properly functioning sensor under normal environmental conditions. However, wildfire events during the summer and the heavy use of wood heat during winter months could explain a non-zero correlation between PM2.5 and temperature. In our case, it’s possible that the elevated values recorded in this plot are due to wildfire events during that period.

The pm25_A_pctDC plot shows that channel B reported dynamic data throughout the month while channel A recorded a higher percentage of “sticky values” towards the beginning and the end of the month. (Looking at the raw data, channel A had periods where it reported repeated PM2.5 values of 0.0.) We notice that this overlaps with periods when the r-squared values were lower as shown in the pm25_A_pm25_B_rsquared plot. Especially at the beginning of the month, the plot shows r-squared values hovering around zero, indicating the poor functioning of the sensor during that period.

pat_monitorComparison

The pat_monitorComparison() function is in another new category of functions that compare hourly aggregated sensor data with data from a professionally maintained “Federal Reference Method” (FRM) monitor

pat_monitorComparison() creates and returns a ggplot object that plots raw pat data, hourly aggregated pat data and hourly data from the nearest monitor in the US Forest Service maintained PWFSL database.

If the nearest monitor didn’t collect data during the period of time investigated or is not within the distanceCutoff, it is possible that you will encounter the following error:

Error in monitor_subset(ws_monitor, tlim = c(starttime, endtime)) : ws_monitor object contains zero monitors

You can try to change the default distanceCutoff = 20 to 50 (km). However, 20 km is the recommended distance to ensure that both instruments are measuring similar parcels of air.

To learn more info about the FRM monitors in your area of interest and discover their exact location, you can use the USFS AirFire Monitoring Site.

When aggregating raw data to hourly averages, pat_monitorComparison() applies one of several Quality Control functions available in the AirSensor package. These functions use various algorithms to invalidate certain hourly values that don’t pass QC. The FUN argument lets you specify which QC algorithm to use.

For specific details on the QC logic, have look at the online documentation available for each of the four algorithms:

The default “AB_01” algorithm attempts to find a balance where it throws out data that is “obviously” junk while retaining data during “periods of interest”. It works tolerably well under normal conditions but is perhaps too strict during periods of very high PM2.5 values like the Fourth of July and wildfire events where it ends up invalidating many hourly values.

# Comparison plot for Liberty School sensor
pat_monitorComparison(
  pat = Liberty_School,
  FUN = AirSensor::PurpleAirQC_hourly_AB_01, # default QC
  distanceCutoff = 20
)

# Find the nearest monitor
Liberty_School$meta$pwfsl_closestMonitorID # "530470010_01"
## [1] "530470010_01"

The graph above shows:

  • in gray – sensor data before aggregation and QC (PA raw)
  • in purple – sensor data after the QC treatment (PA hourly)
  • in black – data collected by the nearest monitor (Monitor)

There is general agreement between “PA hourly” and “Monitor” but many of the hourly aggregated values are missing because they have been invalidated by our QC function.

Because we have looked carefully at the raw data from Liberty School and know this sensor to be functioning well, we feel comfortable using the very forgiving “AB_00” QC function to see more of the hourly aggregated values:

# comparison plot for Liberty School sensor
pat_monitorComparison(
  pat = Liberty_School,
  FUN = AirSensor::PurpleAirQC_hourly_AB_00, # minimal QC
  distanceCutoff = 20
)

In the plot above, the PA hourly and the monitor data are telling the same basic story. The trends are similar but the absolute values differ. A more detailed assessment of their statistical relationship is discussed in the next section.

pat_externalFit

The pat_externalFit() function produces a linear model between data from PurpleAir and data from the closest FRM monitor. A diagnostic plot is produced if showPlot = TRUE.

NOTE: As of AirSensor version 1.0.8, this function does not properly support choosing the QC algorithm and hourly_AB_01 is always used.

# Create comparison plot for Liberty School sensor
lm_01 <- pat_externalFit(
  pat = Liberty_School,
  showPlot = TRUE
)

summary(lm_01)
## 
## Call:
## lm(formula = both_data$pwfsl_pm25 ~ both_data$pa_pm25, subset = NULL, 
##     weights = NULL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.855  -1.175  -0.352   0.689  39.943 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.570862   0.360715   4.355 1.55e-05 ***
## both_data$pa_pm25 0.694821   0.006858 101.312  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.705 on 649 degrees of freedom
##   (69 observations deleted due to missingness)
## Multiple R-squared:  0.9405, Adjusted R-squared:  0.9404 
## F-statistic: 1.026e+04 on 1 and 649 DF,  p-value: < 2.2e-16

We can almost see in the title that the nearest FRM monitor to the Liberty School sensor is 4.2 km away – nicely close for this analysis.

Unfortunately, several of the hourly aggregated values have been invalidated by default QC algorithm “AB_01” during the periods of highest wildfire smoke. Of the remaining hourly values, the linear fit is actually pretty good with an r-squared of 0.94 and a slope of 0.69 implying that the PurpleAir sensor is overstating the PM2.5 values.

Overall, the Liberty sensor tracked wildfire smoke intrusions quite well.


Best of luck assessing air quality in your community!