Monday, May 28, 2018

How to tackle common data cleaning issues in R

How to tackle common data cleaning issues in R

R is a language and environment that is easy to learn, very flexible in nature, and very focused on statistical computing, making it a great choice for manipulating, cleaning, summarizing, producing probability statistics, and so on.
In addition, here are a few more reasons to use R for data cleaning:
  • It is used by a large number of data scientists so it's not going away anytime soon
  • R is platform independent, so what you create will run almost anywhere
  • R has awesome help resources--just Google it, you'll see!
Editor’s Note: While the author has named the example data as ‘Gamming Data’, it is simply the gaming data that he uses to demonstrate his code.

Outliers


The simplest explanation for what outliers are might be is to say that outliers are those data points that just don't fit the rest of your data. Upon observance, any data that is either very high, very low, or just unusual (within the context of your project), is an outlier. As part of data cleansing, a data scientist would typically identify the outliers and then address the outliers using a generally accepted method:
  • Delete the outlier values or even the actual variable where the outliers exist
  • Transform the values or the variable itself
  • Let's look at a real-world example of using R to identify and then address data outliers.
    In the world of gaming, slot machines (a gambling machine operated by inserting coins into a slot and pulling a handle which determines the payoff) are quite popular. Most slot machines today are electronic and therefore are programmed in such a way that all their activities are continuously tracked. In our example, investors in a casino want to use this data (as well as various supplementary data) to drive adjustments to their profitability strategy. In other words, what makes for a profitable slot machine? Is it the machine's theme or its type? Are newer machines more profitable than older or retro machines? What about the physical location of the machine? Are lower denomination machines more profitable? We try to find our answers using the outliers.
    We are given a collection or pool of gaming data (formatted as a comma-delimited or CSV text file), which includes data points such as the location of the slot machine, its denomination, month, day, year, machine type, age of the machine, promotions, coupons, weather, and coin-in (which is the total amount inserted into the machine less pay-outs). The first step for us as a data scientist is to review (sometimes called profile) the data, where we'll determine if any outliers exist. The second step will be to address those outliers.

Step 1 – Profiling the data


R makes this step very simple. Although there are many ways to program a solution, let us try to keep the lines of the actual program code or script to a minimum. We can begin by defining our CSV file as a variable in our R session (named MyFile) and then reading our file into an R data.frame (named MyData):
MyFile <-"C:/GammingData/SlotsResults.csv" 
MyData <- read.csv(file=MyFile, header=TRUE, sep=",")

In statistics, a boxplot is a simple way to gain information regarding the shape, variability, and centre (or median) of a statistical dataset, so we'll use the boxplot with our data to see if we can identify both the median Coin-in and if there are any outliers. To do this, we can ask R to plot the Coin-in value for each slot machine in our file, using the boxplot function:
boxplot(MyData[11],main='Gamming Data Review', ylab = "Coin-in")


Step 2 – Addressing the outliers


Now that we see the outliers do exist within our data, we can address them so that they do not adversely affect our intended study. Firstly, we know that it is illogical to have a negative Coin-in value since machines cannot dispense more coins that have been inserted in them. Given this rule, we can simply drop any records from the file that have negative Coin-in values. Again, R makes it easy as we'll use the subset function to create a new version of our data.frame, one that only has records (or cases) with non-negative Coin-in values.
We'll call our subset data frame noNegs:
noNegs <- subset(MyData, MyData[11]>0)

Then, we'll replot to make sure we've dropped our negative outlier:
boxplot(noNegs[11],main='Gamming Data Review', ylab = "Coin-in")

Image



We can use the same approach to drop our extreme positive Coin-in values (those greater than $1,500) by creating yet another subset and then replotting:
noOutliers <-subset(noNegs, noNegs[11]<1500)
boxplot(noOutliers[11],main='Gamming Data Review', ylab = "Coin-in")
It is well-advised, as you work through various iterations of your data, that you save off most (if not just the most significant) versions of your data. You can use the R function write.csv:
write.csv(noOutliers, file = "C:/GammingData/MyData_lessOutliers.csv")


Image

Domain expertise

  Moving on, another data cleaning technique is referred to as cleaning data based upon domain expertise. This doesn't need to be complicated. The point of this technique is simply using information not found in the data. For example, previously we excluded cases with negative Coin-in values since we know it is impossible to have a negative Coin-in amount. Another example might be the time when Hurricane Sandy hit the northeast United States. During that period of time, the cases of most machines had very low (if not zero) Coin-in amounts. A data scientist would probably remove all the data cases during a specific time period, based on that information.

Validity checking

  Cross-validation is is a technique where a data scientist applies rules to data in a data pool.
Note
Validity checking is the most common form of statistical data cleansing and is a process that both the data developer and the data scientist will most likely be (at least somewhat) familiar with.
here can be any number of validity rules used to clean the data, and these rules will depend upon the intended purpose or objective of the data scientist. Examples of these rules include: data-typing (for example, a field must be a numeric), range limitations (where numbers or dates must fall within a certain range), required (a value cannot be empty or missing), uniqueness (a field, or a combination of fields, must be unique within the data pool), set-member (this is when values must be a member of a discreet list), foreign-key (certain values found within a case must be defined as member of or meeting a particular rule), regular expression patterning (which simply means verifying that a value is formatted in a prescribed format), and cross-field validation (where combinations of fields within a case must meet a certain criteria).
Let's look at a few examples of the preceding, starting with data-typing (also known as coercion). R offers six coercion functions to make it easy:
  • as.numeric
  • as.integer
  • as.character
  • as.logical
  • as.factor
  • as.ordered
  • as.Date
Image

No comments:

Post a Comment

ChatGPT and Intelligent Document Processing!!

 ChatGPT and Intelligent Document Processing!! Question: How chatgpt can helpful in IDP? Answer: As an AI language model, ChatGPT can be he...