Outlier Management
GRAPHICAL APPROACHES TO OUTLIER DETECTION
Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.
y <-read.csv("y.csv") ylarge <- read.csv("ylarge.csv") #summarizing and plotting y summary(y) hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5)) boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]") shapiro.test(y[,2]) qqnorm(y[,2], main = "Normal QQ Plot - y") qqline(y[,2], col = "red") #summarizing and plotting ylarge summary(ylarge) hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5)) boxplot(ylarge[,2], col = rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]") shapiro.test(ylarge[,2]) qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge") qqline(ylarge[,2], col = "red")
The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.
The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.
DIXON AND CHI SQUARED TESTS FOR OUTLIERS
The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.
library(outliers) #Dixon Tests for Outliers for y dixon.test(y[,2],opposite = TRUE) dixon.test(y[,2],opposite = FALSE) #Dixon Tests for Outliers for ylarge dixon.test(ylarge[,2],opposite = TRUE) dixon.test(ylarge[,2],opposite = FALSE) #Chi-Sq Tests for Outliers for y chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE) chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE) #Chi-Sq Tests for Outliers for ylarge chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE) chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)
In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument
opposite
. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.
Sample output is below, from one of the tests.
> #Dixon Tests for Outliers for y > dixon.test(y[,2],opposite = TRUE) Dixon test for outliers data: y[, 2] Q = 0.0466, p-value = 0.114 alternative hypothesis: highest value 11.7079474800368 is an outlier
When you closely observe the p-values of these tests alone, you can see the following results:
P-values for outlier tests: Dixon test (y, upper): 0.114 ; Dixon test (y, lower): 0.3543 Dixon test not executed for ylarge Chi-sq test (y, upper): 0.1047 ; Chi-sq test (y, lower): 0.0715 Chi-sq test (ylarge, upper): 0.0012 ; Chi-sq test (ylarge, lower): 4e-04
The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see ibare actually typical in that data set.
No comments:
Post a Comment