Saturday, July 28, 2018

Outlier Management - Hard and Soft Outliers Identification

Outlier Management 










GRAPHICAL APPROACHES TO OUTLIER DETECTION

Boxplots and histograms are useful to get an idea of the distribution that could be used to model the data, and could also provide insights into whether outliers exist or not in our data set.
y <-read.csv("y.csv")
ylarge <- read.csv("ylarge.csv")

#summarizing and plotting y
summary(y)
hist(y[,2], breaks = 20, col = rgb(0,0,1,0.5))
boxplot(y[,2], col = rgb(0,0,1,0.5), main = "Boxplot of y[,2]")
shapiro.test(y[,2])
qqnorm(y[,2], main = "Normal QQ Plot - y")
qqline(y[,2], col = "red")

#summarizing and plotting ylarge
summary(ylarge)
hist(ylarge[,2], breaks = 20, col = rgb(0,1,0,0.5))
boxplot(ylarge[,2], col =  rgb(0,1,0,0.5), main = "Boxplot of ylarge[,2]")
shapiro.test(ylarge[,2])
qqnorm(ylarge[,2], main = "Normal QQ Plot - ylarge")
qqline(ylarge[,2], col = "red")
The Shapiro-Wilk test used above is used to check for the normality of a data set. Normality assumptions underlie outlier detection hypothesis tests. In this case, with p-values of 0.365 and 0.399 respectively and sample sizes of 30 and 1000, both samples y and ylarge seem to be normally distributed.




Box plot of y (no real outliers observed as per graph)
Box plot of y (no real outliers observed as per graph)
Boxplot of ylarge - a few outlier points seem to be present in graph
Boxplot of ylarge – a few outlier points seem to be present in graph
Histogram of y
Histogram of y
Histogram of ylarge
Histogram of ylarge
Normal QQ Plot of Y
Normal QQ Plot of Y
Normal QQ Plot of ylarge
Normal QQ Plot of ylarge


The graphical analysis tells us that there could possibly be outliers in our data set ylarge, which is the larger data set out of the two. The normal probability plots also seem to indicate that these data sets (as different in sample size as they are) can be modeled using normal distributions.

DIXON AND CHI SQUARED TESTS FOR OUTLIERS

The Dixon test and Chi-squared tests for outliers (PDF) are statistical hypothesis tests used to detect outliers in given sample sets. Bear in mind though, that this Chi-squared test for outliers is very different from the better known Chi-square test used for comparing multiple proportions. The Dixon tests makes a normality assumption about the data, and is used generally for 30 points or less. The Chi-square test on the other hands makes variance assumptions, and is not sensitive to mild outliers if variance isn’t specified as an argument. Let’s see how these tests can be used for outliers detection.
library(outliers)
#Dixon Tests for Outliers for y
dixon.test(y[,2],opposite = TRUE)
dixon.test(y[,2],opposite = FALSE)

#Dixon Tests for Outliers for ylarge
dixon.test(ylarge[,2],opposite = TRUE)
dixon.test(ylarge[,2],opposite = FALSE)


#Chi-Sq Tests for Outliers for y
chisq.out.test(y[,2],variance = var(y[,2]),opposite = TRUE)
chisq.out.test(y[,2],variance = var(y[,2]),opposite = FALSE)

#Chi-Sq Tests for Outliers for ylarge
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = TRUE)
chisq.out.test(ylarge[,2],variance = var(ylarge[,2]),opposite = FALSE)
In each of the Dixon and Chi-Squared tests for outliers above, we’ve chosen both options TRUE and FALSE in turn, for the argument opposite. This argument helps us choose between whether we’re testing for the lowest extreme value, or the highest extreme value, since outliers can lie to both sides of the data set.
Sample output is below, from one of the tests.
> #Dixon Tests for Outliers for y
> dixon.test(y[,2],opposite = TRUE)

 Dixon test for outliers

data:  y[, 2]
Q = 0.0466, p-value = 0.114
alternative hypothesis: highest value 11.7079474800368 is an outlier


When you closely observe the p-values of these tests alone, you can see the following results:
P-values for outlier tests:

Dixon test (y, upper):  0.114 ; Dixon test (y, lower):  0.3543
Dixon test not executed for ylarge
Chi-sq test (y, upper):  0.1047 ; Chi-sq test (y, lower):  0.0715
Chi-sq test (ylarge, upper):  0.0012 ; Chi-sq test (ylarge, lower):  4e-04

The p-values here (taken with an indicative 5% significance) may imply that the possibility that the extreme values in ylarge are outliers. This may or may not be true, of course, since in inferential statistics, we always state the chance of error. And in this case, we can conclude that there is a very small chance that those extreme values we see ibare actually typical in that data set.

No comments:

Post a Comment

ChatGPT and Intelligent Document Processing!!

 ChatGPT and Intelligent Document Processing!! Question: How chatgpt can helpful in IDP? Answer: As an AI language model, ChatGPT can be he...