Wednesday, May 30, 2018

Big Data Testing

Data science is all about trying to create a process that allows you to chart out new ways of thinking about problems that are novel, or trying to use the existing data in a creative atmosphere with a pragmatic approach.

Businesses are struggling to grapple with the phenomenal information explosion. Conventional database systems and business intelligence applications have given way to horizontal databases, columnar designs and cloud-enabled schemas powered by sharing techniques.

Particularly, the role of QA is very challenging in this context, as this is still in a nascent stage. Testing Big Data applications requires a specific mindset, skillset and deep understanding of the technologies, and pragmatic approaches to data science. Big Data from a tester’s perspective is an interesting aspect. Understanding the evolution of Big Data, What is Big Data meant for and Why Test Big Data Applications is fundamentally important.

Big Data Testing – Needs and Challenges

The following are some of the needs and challenges that make it imperative for Big Data applications to be tested thoroughly.

An in-depth understanding of the 4 Nouns of Big Data is a key to successful Big Data Testing.

Increasing need for Live integration of information: With multiple sources of information from different data, it has become imminent to facilitate live integration of information. This forces enterprises to have constantly clean and reliable data, which can only be ensured through end-to-end testing of the data sources and integrators.
Instant Data Collection and Deployment: Power of Predictive analytics and the ability to take Decisive Actions have pushed enterprises to adopt instant data collection solutions. These decisions bring in significant business impact by leveraging the insights from the minute patterns in large data sets. Add that to the CIO’s profile which demands deployment of instant solutions to stay in tune with changing dynamics of business. Unless the applications and data feeds are tested and certified for live deployment, these challenges cannot be met with the assurance that is essential for every critical operation.
Real-time scalability challenges: Big Data Applications are built to match the level of scalability and monumental data processing that is involved in a given scenario. Critical errors in the architectural elements governing the design of Big Data Applications can lead to catastrophic situations. Hardcore testing involving smarter data sampling and cataloging techniques coupled with high end performance testing capabilities are essential to meet the scalability problems that Big Data Applications pose.

Big Data Testing Needs and Challenges - Cigniti

Current data integration platforms which have been built for an older generation of data challenges, limit IT’s ability to support the business. In order to keep up, organizations are beginning to look at next-generation data integration techniques and platforms.

Ability to understand, analyze and create test sets that encompass multiple data sets, is vital to ensure comprehensive Big Data Testing.

#KMeans Clustering on Gun violence data set

#Please post comments in case of suggestions

#Dataset Available at the below URL:
https://github.com/shrishtripathi/Datasets/blob/master/gun-violence-data_01-2013_03-2018.zip

guns<-read.csv("C:/Users/E002891/Desktop/DayWiseTracker/Programming Concepts/Data Science/DataSets/Kaggle DataSets/gun-violence-data_01-2013_03-2018.csv", na.strings = c('',' ',' ','?','NA'))
nrow(guns)
set.seed(123)
guns<-guns[sample(1:nrow(guns),40000),]

#Preprocessing
summary(guns)
colnames(guns)
#Determining the columns to use
#The below cols are not required
#incident_id,date,address,incident_url,source_url,incident_url_fields_missing,incident_characteristics,location_description,
#notes,participant_age,participant_age_group,participant_name,participant_status,participant_type,sources,state_house_district,state_senate_district
for(i in colnames(guns[,c("incident_id","date","address","incident_url","source_url","incident_url_fields_missing","incident_characteristics","location_description","notes","participant_age","participant_age_group","participant_name","participant_status","participant_type","sources","state_house_district","state_senate_district")]))
{
print(which(colnames(guns)==i))
}

guns<-guns[,-c(1,2,5,8,9,10,14,16,19,
20,21,23,25,
26,27,28,29)]
head(guns)

#Also the below columns are not required
guns$participant_gender<-NULL
guns$participant_relationship<-NULL
guns$gun_stolen<-NULL
View(guns)
table(guns$gun_type)
guns$gun_type<-NULL

mapply(table, guns)
table(guns$city_or_county)
guns$longitude<-NULL
guns$latitude<-NULL
guns$city_or_county<-NULL

#Factorization and bucketing
str(guns)
table(guns$n_killed)
#guns$n_killed<-ifelse(guns$n_killed<2,"Less",ifelse(guns$n_killed<4,"Medium","More"))
table(guns$n_injured)
#guns$n_injured<-ifelse(guns$n_injured<4,"Less",ifelse(guns$n_injured<10,"Medium","More"))
table(guns$congressional_district)
#guns$congressional_district<-ifelse(guns$congressional_district<20,"Less",ifelse(guns$congressional_district<40,"Medium","More"))
table(guns$n_guns_involved)
#guns$n_guns_involved<-ifelse(guns$n_guns_involved<50,"Less",ifelse(guns$n_guns_involved<150,"Medium","More"))
#guns$n_killed<-as.factor(guns$n_killed)
#guns$n_injured<-as.factor(guns$n_injured)
#guns$congressional_district<-as.factor(guns$congressional_district)
#guns$n_guns_involved<-as.factor(guns$n_guns_involved)

#Imputation
library(DMwR)
mapply(anyNA, guns)
guns<-guns[sample(1:nrow(guns),10000),]
guns<-knnImputation(guns, k = 10)

#Start Clustering
withinByBetween<-c()
for(i in 2:15)
{
clusters<-kmeans(guns[,-c(1)], centers = i) #State not taken as its categorical
withinByBetween<-c(withinByBetween,mean(clusters$withinss)/clusters$betweenss)
}

#Error more cluster centers than distinct data points. (As catagorical data). Hence commented the factorization codes
plot(2:15,withinByBetween, type="l")

#No Of clusters=7
clusters<-kmeans(guns[,-c(1)], centers = 7)
guns$cluster<-clusters$cluster
View(guns)
clusters$centers

Monday, May 28, 2018

Which is the best tool for the job

One of the perennial points of debate in data science industry has been – “Which is the best tool for the job?“. Traditionally, this question was raised for SAS vs. R. Recently, there have been discussions on R vs. Python.

A few decades back, when R / SAS launched, it was difficult to envisage the possibilities future will offer. And this turned out to be a ‘blessing in disguise’. Because, it made easy for them to focus on one tool!

But today ? The situation is different. Even before deciding what technique they should apply, they fall into the pit of searching for the best tool to perform that particular task. And finally, they get nothing out of it.

The honest answer is that there is no universal winner in this contest. Each tool has its own strength and weakness. A prudent data scientist would diversify his / her repository of tools and use the one appropriate in each situation. In order to do this, it is critical to know the strengths and weakness of each tool, which is what this infographic offers.

Info-graphics:
Courtesy : AnalyticVidya

#Linear Regression Model on AirQuality dataset

#Your Comments are valuable for us. Please leave a comment if you think the model needs to be optimised more.

data(airquality)
airquality
summary(airquality)
head(airquality)
mapply(table, airquality)
table(airquality$Day)

#Preprocessing Steps
summary(airquality)
mapply(anyNA, airquality)
library(DMwR)
airquality<-knnImputation(airquality,k=5)

#Normalize the dataset

mapply(shapiro.test, airquality)

minMaxFunc<-function(x){
return((x-min(x))/(max(x)-min(x)))
}

airquality<-minMaxFunc(airquality)

str(airquality)

#Construct Model
rows<-1:nrow(airquality)
set.seed(123)
trainIndex<-sample(rows,round(0.8*length(rows)))
train<-airquality[trainIndex,]
test<-airquality[-trainIndex,]
nrow(train)/nrow(airquality)
nrow(test)/nrow(airquality)

model1<-lm(Ozone~.,data = train)
summary(model1)
# As month and day have no effect on model so removing them
train$Month<-NULL
train$Day<-NULL
test$Month<-NULL
test$Day<-NULL

model1<-lm(Ozone~.,data = train)
plot(model1)
abline(model1)
summary(model1)

preds<-predict(model1,test)
test$preds<-preds

#Calculating RMSE
rmse<-sqrt(mean((test$preds-test$Ozone)^2))

#Logistic regression Model on Titanic dataset

#Please post your comments in the comment box if you think we can optimise the model prediction

titanic<-read.csv("C:/Users/******/Desktop/DayWiseTracker/Programming Concepts/Data Science/DataSets/titanic.csv", na.strings = c(""," "," ","?","NA"))
summary(titanic)
colnames(titanic)
View(titanic)

#Cols To Use: Not applying as want to make a naive model first
#Imputation
table(titanic$pclass)
library(DMwR)
knnImputation(data=titanic, k = 5)
#Error in knnImputation(titanic, k = 5) : Not sufficient complete cases for computing neighbors.
#Dropping boat and body
sum(is.na(titanic$boat)) #823
nrow(titanic) #1309
#Out of 1309 rows 823 are NA. So KnnImputation will not work

sum(is.na(titanic$body)) #1188
#Out of 1309 rows 1188 are NA. So KnnImputation will not work

titanic$boat<-NULL
titanic$body<-NULL
data<-knnImputation(data=titanic, k = 5)

#Binning not req
#Convert numerics to factor
str(data)
data$name<-NULL #if we take name then it will be converted to many factors by one hot encoding
data$ticket<-NULL
data$cabin<-NULL
data$home_dest<-NULL

data$age<-ifelse(data$age<30,"young",ifelse(data$age<60,"middle","Aged"))
data$sibsp<-ifelse(data$sibsp<3,"Low",ifelse(data$sibsp<5,"Mid","High"))
table(data$embarked)
data$parch<-ifelse(data$parch<3,"Low",ifelse(data$parch<5,"Mid","High"))

#Scaling
hist((data$fare)^1/3)
install.packages("uroot",dependencies = TRUE)
library(forecast)
BoxCox(data$fare,BoxCox.lambda(data$fare)) #Not good. Go with log
data$fare<-log(data$fare)

#Outlier mgmt: Let's live with this. As we are going to make naive model

#Model Construct
rows<-1:nrow(data)
set.seed(123)
trainIndex<-sample(rows,round(0.8*length(rows)))
train<-data[trainIndex,]
test<-data[-trainIndex,]
nrow(train)/nrow(data)
nrow(test)/nrow(data)

str(data)

model1<-glm(formula = survived ~ .-fare, family = binomial(link = "logit"), data = train)
plot(model1)
summary(model1)

#Prediction
preds<-predict(model1,test,type = 'response')

test$preds<-preds
test$preds<-ifelse(test$preds>0.5,1,0)

#Construct Confusion matrix
table(test$preds,test$survived,dnn = c('preds','actuals'))
#Precision: how much is predicted truely and positively from the total predicted values (60 out of 22+60)
precision<-60/(22+60)
#Recall: how much is predicted truely and positively from the total actual values (60 out of 34+60)
recall<-60/(34+60)

#By caret in-built functions
library(caret)
precision1 <- posPredValue(as.factor(test$preds), as.factor(test$survived), positive="1")
sensitivity1 <- sensitivity(as.factor(test$preds), as.factor(test$survived), positive="1")

#Calculate ROCR
library(ROCR)
rocrPred<-prediction(test$preds,test$survived)
rocrPerf<-performance(rocrPred,'tpr','fpr')
plot(rocrPerf,colorize=TRUE,text.adj=c(-0.2,1.7))

#plot glm
library(ggplot2)
ggplot(test, aes(x=Rating, y=Recommended)) + geom_point() +
stat_smooth(method="glm", family="binomial", se=FALSE)

How to tackle common data cleaning issues in R

R is a language and environment that is easy to learn, very flexible in nature, and very focused on statistical computing, making it a great choice for manipulating, cleaning, summarizing, producing probability statistics, and so on.

In addition, here are a few more reasons to use R for data cleaning:

It is used by a large number of data scientists so it's not going away anytime soon
R is platform independent, so what you create will run almost anywhere
R has awesome help resources--just Google it, you'll see!

Editor’s Note: While the author has named the example data as ‘Gamming Data’, it is simply the gaming data that he uses to demonstrate his code.

Outliers

The simplest explanation for what outliers are might be is to say that outliers are those data points that just don't fit the rest of your data. Upon observance, any data that is either very high, very low, or just unusual (within the context of your project), is an outlier. As part of data cleansing, a data scientist would typically identify the outliers and then address the outliers using a generally accepted method:

Delete the outlier values or even the actual variable where the outliers exist
Transform the values or the variable itself
Let's look at a real-world example of using R to identify and then address data outliers.

In the world of gaming, slot machines (a gambling machine operated by inserting coins into a slot and pulling a handle which determines the payoff) are quite popular. Most slot machines today are electronic and therefore are programmed in such a way that all their activities are continuously tracked. In our example, investors in a casino want to use this data (as well as various supplementary data) to drive adjustments to their profitability strategy. In other words, what makes for a profitable slot machine? Is it the machine's theme or its type? Are newer machines more profitable than older or retro machines? What about the physical location of the machine? Are lower denomination machines more profitable? We try to find our answers using the outliers.

We are given a collection or pool of gaming data (formatted as a comma-delimited or CSV text file), which includes data points such as the location of the slot machine, its denomination, month, day, year, machine type, age of the machine, promotions, coupons, weather, and coin-in (which is the total amount inserted into the machine less pay-outs). The first step for us as a data scientist is to review (sometimes called profile) the data, where we'll determine if any outliers exist. The second step will be to address those outliers.

Step 1 – Profiling the data

R makes this step very simple. Although there are many ways to program a solution, let us try to keep the lines of the actual program code or script to a minimum. We can begin by defining our CSV file as a variable in our R session (named MyFile) and then reading our file into an R data.frame (named MyData):

MyFile <-"C:/GammingData/SlotsResults.csv" 
MyData <- read.csv(file=MyFile, header=TRUE, sep=",")

In statistics, a boxplot is a simple way to gain information regarding the shape, variability, and centre (or median) of a statistical dataset, so we'll use the boxplot with our data to see if we can identify both the median Coin-in and if there are any outliers. To do this, we can ask R to plot the Coin-in value for each slot machine in our file, using the boxplot function:

boxplot(MyData[11],main='Gamming Data Review', ylab = "Coin-in")

Step 2 – Addressing the outliers

Now that we see the outliers do exist within our data, we can address them so that they do not adversely affect our intended study. Firstly, we know that it is illogical to have a negative Coin-in value since machines cannot dispense more coins that have been inserted in them. Given this rule, we can simply drop any records from the file that have negative Coin-in values. Again, R makes it easy as we'll use the subset function to create a new version of our data.frame, one that only has records (or cases) with non-negative Coin-in values.

We'll call our subset data frame noNegs:

noNegs <- subset(MyData, MyData[11]>0)

Then, we'll replot to make sure we've dropped our negative outlier:

boxplot(noNegs[11],main='Gamming Data Review', ylab = "Coin-in")


We can use the same approach to drop our extreme positive Coin-in values (those greater than $1,500) by creating yet another subset and then replotting:

noOutliers <-subset(noNegs, noNegs[11]<1500)
boxplot(noOutliers[11],main='Gamming Data Review', ylab = "Coin-in")




It is well-advised, as you work through various iterations of your data, that you save off most (if not just the most significant) versions of your data. You can use the R function write.csv:

write.csv(noOutliers, file = "C:/GammingData/MyData_lessOutliers.csv")





Domain expertise

 
Moving on, another data cleaning technique is referred to as cleaning data based upon domain expertise. This doesn't need to be complicated. The point of this technique is simply using information not found in the data. For example, previously we excluded cases with negative Coin-in values since we know it is impossible to have a negative Coin-in amount. Another example might be the time when Hurricane Sandy hit the northeast United States. During that period of time, the cases of most machines had very low (if not zero) Coin-in amounts. A data scientist would probably remove all the data cases during a specific time period, based on that information.



Validity checking

 
Cross-validation is is a technique where a data scientist applies rules to data in a data pool.

Note

Validity checking is the most common form of statistical data cleansing and is a process that both the data developer and the data scientist will most likely be (at least somewhat) familiar with.



here can be any number of validity rules used to clean the data, and these rules will depend upon the intended purpose or objective of the data scientist. Examples of these rules include: data-typing (for example, a field must be a numeric), range limitations (where numbers or dates must fall within a certain range), required (a value cannot be empty or missing), uniqueness (a field, or a combination of fields, must be unique within the data pool), set-member (this is when values must be a member of a discreet list), foreign-key (certain values found within a case must be defined as member of or meeting a particular rule), regular expression patterning (which simply means verifying that a value is formatted in a prescribed format), and cross-field validation (where combinations of fields within a case must meet a certain criteria).

Let's look at a few examples of the preceding, starting with data-typing (also known as coercion). R offers six coercion functions to make it easy:




as.numeric
as.integer
as.character
as.logical
as.factor
as.ordered
as.Date

Learning Path on R

Learning Path on R – Step by Step Guide to Learn Data Science on R

One of the common problems people face in learning R is lack of a structured path. They don’t know, from where to start, how to proceed, which track to choose? Though, there is an overload of good free resources available on the Internet, this could be overwhelming as well as confusing at the same time.

To create this R learning path, Analytics Vidhya and DataCamp sat together and selected a comprehensive set of resources to help you learn R from scratch. This learning path is a great introduction for anyone new to data science or R, and if you are a more experienced R user you will be updated on some of the latest advancements.

This will help you learn R quickly and efficiently. Time to have fun while lea-R-ning!

Step 0: Warming up

Before starting your journey, the first question to answer is: Why use R? or How would R be useful?

R is a fast growing open source contestant to commercial software packages like SAS, STATA and SPSS. The demand for R skills in the job marketing is rising rapidly, and recently companies such as Microsoft pledged their commitment to R as a lingua franca of Data Science.

Watch this 90 seconds video from Revolution Analytics to get an idea of how useful R could be. Incidentally Revolution Analytics just got acquired by Microsoft.

Step 1: Setting up your machine

The easiest way to set-up R is by downloading a copy of it on your local computer from the Comprehensive R Archive Network (CRAN). You can choose between binaries for Linux, Mac and Windows.

Although you could consider working with the basic R console, we recommend you to install one of R’s integrated development environment (IDE). The most well known IDE is RStudio, which makes R coding much easier and faster as it allows you to type multiple lines of code, handle plots, install and maintain packages and navigate your programming environment much more productively. An alternative to RStudio is Architect, an eclipse-based workbench.

(Need a GUI? Check R-commander or Deducer)

Assignment

Install R, and RStudio
Install Packages Rcmdr, rattle, and Deducer. Install all suggested packages or dependencies including GUI.
Load these packages using library command and open these GUIs one by one.

Step 2: Learn the basics of R language

You should start by understanding the basics of the language, libraries and data structure.

If you prefer an online interactive learning environment to learn R’s syntax this free online R tutorial by DataCamp is a great way to get you going. Also check the successor to this course: intermediate R programming. An alternative learning tool is this online version of swirl where you can learn R in an environment similar to RStudio.

Next to these interactive learning environments, you can also choose to enroll in one of the Moocs available on Coursera or Edx.

In addition to these online resources, you can also consider the following excellent written resources:

The free introduction to R manual by CRAN
Jared Lander’s R for Everyone
Quick-R

Specifically learn: read.table, data frames, table, summary, describe, loading and installing packages, data visualization using plot command

Assignment

Take the free online R tutorial by DataCamp and become familiar with basic R syntax
Create a github account at http://github.com
Learn to troubleshoot package installation above by googling for help.
Install package swirl and learn R programming (see above)

Step 3: Understanding the R community

The major reason R is growing rapidly and is such a huge success, is because of its strong community. At the center of this is R’s package ecosystem. These packages can be downloaded from the Comprehensive R Archive Network, or from bioconductor, github and bitbucket. At Rdocumentation you can easily search packages from CRAN, github and bioconductor that will fit your needs for the task at hand.

Next to the package ecosystem R, you can also easily find help and feedback on your R endeavours. First of all there is R’s built-in help system which you can access via the command ? and the name of e.g. a function. There is also Analytics Vidhya Discussions, Stack Overflow where R is one of the fastests growing languages. To end, there are numerous blogs run by R enthusiast, a great collection of these is aggregated at R-bloggers.

Assignment

Understand the R package ecosystem by visiting Cran Task Views
Sign up at http://r-bloggers.com for the daily newsletter

Step 4: Importing and manipulating your data

Importing and manipulating your data are important steps in the data science workflow. R allows for the import of different data formats using specific packages that can make your job easier:

readr for importing flat files
The readxl package for getting excel files into R
The haven package lets you import SAS, STATA and SPSS data files into R.
Databases: connect via packages like RMySQL and RpostgreSQL, and access and manipulate via DBI
rvest for webscraping

Once your data is available in your working environment you are ready to start manipulating it using these packages:

The tidyr package for tidying your data.
The stringr package for string manipulation.
For data frame like objects learn the ins and outs of the dplyr package (try this course).
Need to perform heavy data wrangling tasks? Check out the data.table package
Performing time series analysis? Try out packages like like zoo, xts and quantmod.

Assignment

Master the packages mentioned for importing data via this “Importing Data Into R” course, or read these articles 1,2,3 and 4.
See this Data Wrangling with R video by RStudio
Read and practice how to work with packages like dplyr, tidyr, and data.table.

Step 5: Effective Data Visualization

There is no greater satisfaction than creating your own data visualizations. However, visualizing data is as much of an art as it is a skill. A great read on this is Edward Tufte principles for visualizing quantitative data, or the pitfalls on dashboard design by Stephen Few. Also check out the blog FlowingData by Nathan Yau for inspiration on creating visualization using (mainly) R.

5.1: Plots everywhere

R offers multiple ways for creating graphs. The standard way is by making use of base graphics in R. However, there are way better tools (or packages) to create your graphs in a more simple way that will look on top of that way more beautiful:

Start with learning the grammar of graphics, a practical way to do data visualizations in R.
Probably the most important package to master if you want to become serious about data visualization in R is the ggplot2 package. ggplot2 is so popular that there are tons of resources available on the web for learning purposes such as this online ggplot2 tutorial, a handy cheatsheet or this book by the creator of the package Hadley Wickham.
A package such as ggvis allows you create interactive web graphics using the grammar of graphics (see tutorial)
Know this ted talk by Hans Rosling? Learn how to re-create this yourself with googleVis (an interface with google charts).
In case you run into issues plotting your data this post might help as well.

See more visualization option in this CRAN task view.

Alternatively look at the data visualization guide to R.

5.2: Maps everywhere

Interested in visualizing data on spatial analysis? Take the tutorial on Introduction to visualising spatial data in Rand get started easily with these two packages:

Visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with ggmap.
Ari Lamstein’s choroplethr
The tmap package.

5.2: Maps everywhere

Interested in visualizing data on spatial analysis? Take the tutorial on Introduction to visualising spatial data in Rand get started easily with these two packages:

Visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with ggmap.
Ari Lamstein’s choroplethr
The tmap package.

5.3: HTML widgets

A very promising new tool for visualizations in R is the usage of HTML widgets. HTML widgets allow you to create interactive web visualizations in an easy way (see the tutorial by RStudio) and mastering this type of visualizations is very likely to become a must have R skill. Impress your friends and colleagues with these visualizations:

Dynamic maps with leaflet
Time-series data charting using dygraphs
Interactive tables (DataTables)
DiagrammeR for diagrams and flowcharts
D3 scatterplots, line charts, and histograms with MetricsGraphics

Assignment

Make sure you have understand the principles of the grammar of graphics.
Take the ggplot2 tutorial
Follow the html widgets tutorial by RStudio

Step 6: Data Mining and Machine Learning

For those that are new to statistics we recommend these resources:

Andrew Conway’s Introduction to statistics with R (online)
Data Analysis and Statistical Inference by Duke University (online)
Practical Data Science With R (book)
Data Science Specialization by Johns Hopkins (online)
A Survival Guide to Data Science with R (book)

If you want to sharpen your machine learning skills, consider starting with these tutorials:

Make sure to see the various machine learning options available in R in the relevant CRAN task view.

Assignment

Start off with one of the intro to statistics courses
Take this free machine learning course by kaggle.
If there is one book on data mining using R you want, it is on Rattle
You can learn on time series forecasting from this booklet – A Little Book for Time Series in R .

Step 7: Reporting Results

Communicating your results and sharing your insights with fellow data science enthusiast is equally important as the analysis itself. Luckily R has some very nifty tools to do this that can save you a lot of time.

The first is R Markdown , a great tool for reporting your data analysis in a reproducible manner based on knitr and pandoc. With R markdown, R generates a final document that replaces the R code with its results. This document can be in an html, word, pfd, ioslides, etc. format. You can learn more on it via this tutorial and use this cheat sheet a a reference.

Next to R Markdown there is also ReporteRs. ReporteRs is an R package for creating Microsoft (Word docx and Powerpoint pptx) and html documents and runs on Windows, Linux, Unix and Mac OS systems. Just like R Markdown it’s an ideal tool to automate reporting generation from R. See here how to get started.

Last but not least there is Shiny, one of the most exciting tools in R around at the moment. Shiny makes it incredibly easy to build interactive web applications with R. It allows you to turn your analysis into interactive web applications without needing to know HTML, CSS or Javascript. If you want to get started with Shiny (and believe us you should!), checkout the RStudio learning portal.

Assignment

Create your first interactive report using RMarkdown and/or ReporteRs
Try to build your very first Shiny app

Bonus Step: Practice

You will only become a great R programmer through practice. Therefore, make sure to tackle new data science challenges regularly. The best recommendation we can make to you here is to start competing with fellow data scientists on Kaggle: https://www.kaggle.com/c/titanic-gettingStarted.

Test your R Skills on live challenges – Practice Problems

Step 8: Time Series Analysis

R has a dedicated task view for Time Series. If you ever want to do something with time series analysis in R, this is definitely the place the start. You will soon see that the scope & depth of tools is tremendous.

You will not run out of online resources for learning time series analysis with R easily. Good starting points are A little book of R for time series or check out Forecasting: principles and practice. In terms of packages, you need to make sure that you are familiar with the zoo package and the xts. Zoo provides you a common used format for saving time series objects, while xts gives you the tools to manipulate your time series data sets.

Alternate resource: Comprehensive tutorial on Time Series

Assignment

Take one of the recommended time series tutorials listed above so you are ready to start your own analysis.
Use a package such as quantmod or quandl to download financial data and start your own time series analysis.
Use a package such as dygraphs to create stunning visualizations of your time series data and analysis.

Bonus Step – Text Mining is Important Too!

To learn text mining, you can refer to text mining module from analytics edge course. Though, the course is archived, you can still access the tutorials.

Practice

Text Mining Competiton – Complete Solution in R

Step 9: Becoming an R Master

Now that you have learnt most of data analytics using R , it is time to give some advanced topics a shot. There is a good chance that you already know many of these, but have a look at these tutorials too.

Advanced R by Hadley Wickham
Using R together with Hadoop, MongoDB or NoSQL
The RevoScaleR package by Microsoft (formerly Revolution Analytics)