TECHCEPTRON: June 2019

Thursday, June 20, 2019

AUC ROC Curve

In this post, I will go through the AUC ROC curve and explain how it evaluates your model’s performance. Highly suggest you go through the Confusion Matrix post before you go ahead.

All set? Let’s explore it! :D

AUC ROC is one of the most important evaluation metrics for any classification model’s performance.

What is ROC?

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

Let’s assume we have a model which predicts whether the patient has a particular disease or no. The model predicts probabilities for each patient (in python we use the“ predict_proba” function). Using these probabilities, we plot the distribution as shown below:

Here, the red distribution represents all the patients who do not have the disease and the green distribution represents all the patients who have the disease.

Now we got to pick a value where we need to set the cut off i.e. a threshold value, above which we will predict everyone as positive (they have the disease) and below which will predict as negative (they do not have the disease). We will set the threshold at “0.5” as shown below:

All the positive values above the threshold will be “True Positives” and the negative values above the threshold will be “False Positives” as they are predicted incorrectly as positives.

All the negative values below the threshold will be “True Negatives” and the positive values below the threshold will be “False Negative” as they are predicted incorrectly as negatives.

Here, we have got a basic idea of the model predicting correct and incorrect values with respect to the threshold set. Before we move on, let’s go through two important terms: Sensitivity and Specificity.

What is Sensitivity and Specificity?

In simple terms, the proportion of patients that were identified correctly to have the disease (i.e. True Positive) upon the total number of patients who actually have the disease is called as Sensitivity or Recall.

Similarly, the proportion of patients that were identified correctly to not have the disease (i.e. True Negative) upon the total number of patients who do not have the disease is called as Specificity.

Trade-off between Sensitivity and Specificity

When we decrease the threshold, we get more positive values thus increasing the sensitivity. Meanwhile, this will decrease the specificity.

Similarly, when we increase the threshold, we get more negative values thus increasing the specificity and decreasing sensitivity.

As Sensitivity ⬇️ Specificity ⬆️

As Specificity ⬇️ Sensitivity ⬆️

Trade off between Sensitivity & Specificity

But, this is not how we graph the ROC curve. To plot ROC curve, instead of Specificity we use (1 — Specificity) and the graph will look something like this:

So now, when the sensitivity increases, (1 — specificity) will also increase. This curve is known as the ROC curve.

I know you might be wondering why do we use (1 — specificity)? Don’t worry I’ll come back to it soon. :)

Area Under the Curve

The AUC is the area under the ROC curve. This score gives us a good idea of how well the model performances.

Let’s take a few examples

As we see, the first model does quite a good job of distinguishing the positive and the negative values. Therefore, there the AUC score is 0.9 as the area under the ROC curve is large.

Whereas, if we see the last model, predictions are completely overlapping each other and we get the AUC score of 0.5. This means that the model is performing poorly and it is predictions are almost random.

Why do we use (1 — Specificity)?

Let’s derive what exactly is (1 — Specificity):

As we see above, Specificity gives us the True Negative Rate and (1 — Specificity) gives us the False Positive Rate.

So the sensitivity can be called as the “True Positive Rate” and (1 — Specificity) can be called as the “False Positive Rate”.

So now we are just looking at the positives. As we increase the threshold, we decrease the TPR as well as the FPRand when we decrease the threshold, we are increasing the TPR and FPR.

Thus, AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes.

I hope I’ve given you some understanding on what exactly is the AUC ROC curve and how it evaluates your model’s performance

Ensemble Techniques

Bagging and Boosting are similar in that they are both ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one.

ENSEMBLE LEARNING

Ensemble methods combine several decision trees classifiers to produce better predictive performance than a single decision tree classifier. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner, thus increasing the accuracy of the model.When we try to predict the target variable using any machine learning technique, the main causes of difference in actual and predicted values are noise, variance, and bias. Ensemble helps to reduce these factors (except noise, which is irreducible error).

Another way to think about Ensemble learning is Fable of blind men and elephant. All of the blind men had their own description of the elephant. Even though each of the description was true, it would have been better to come together and discuss their undertanding before coming to final conclusion. This story perfectly describes the Ensemble learning method.

Using techniques like Bagging and Boosting helps to decrease the variance and increased the robustness of the model. Combinations of multiple classifiers decrease variance, especially in the case of unstable classifiers, and may produce a more reliable classification than a single classifier.

But before understanding Bagging and Boosting and how different classifiers are selected in the two algorithms, lets first talk about Bootstrapping.

BOOTSTRAPPING

Bootstrap refers to random sampling with replacement. Bootstrap allows us to better understand the bias and the variance with the dataset. Bootstrap involves random sampling of small subset of data from the dataset. This subset can be replace. The selection of all the example in the dataset has equal probability. This method can help to better understand the mean and standand deviation from the dataset.

Let’s assume we have a sample of ‘n’ values (x) and we’d like to get an estimate of the mean of the sample.

mean(x) = 1/n * sum(x)

We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the bootstrap procedure:

Create many (e.g. m) random sub-samples of our dataset with replacement (meaning we can select the same value multiple times).
Calculate the mean of each sub-sample.
Calculate the average of all of our collected means and use that as our estimated mean for the data.

For example, let’s say we used 3 resamples and got the mean values 2.5, 3.3 and 4.7. Taking the average of these we could take the estimated mean of the data to be 3.5.

Having understood Bootstrapping we will use this knowledge to understand Bagging and Boosting.

BAGGING

Bootstrap Aggregation (or Bagging for short), is a simple and very powerful ensemble method. Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

Suppose there are N observations and M features. A sample from observation is selected randomly with replacement(Bootstrapping).
A subset of features are selected to create a model with sample of observations and subset of features.
Feature from the subset is selected which gives the best split on the training data.(Visit my blog on Decision Tree to know more of best split)
This is repeated to create many models and every model is trained in parallel
Prediction is given based on the aggregation of predictions from all the models.

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging. The only parameters when bagging decision trees is the number of samples and hence the number of trees to include. This can be chosen by increasing the number of trees on run after run until the accuracy begins to stop showing improvement

BOOSTING

Boosting refers to a group of algorithms that utilize weighted averages to make weak learners into stronger learners. Unlike bagging that had each model run independently and then aggregate the outputs at the end without preference to any model. Boosting is all about “teamwork”. Each model that runs, dictates what features the next model will focus on.

Box 1: You can see that we have assigned equal weights to each data point and applied a decision stump to classify them as + (plus) or — (minus). The decision stump (D1) has generated vertical line at left side to classify the data points. We see that, this vertical line has incorrectly predicted three + (plus) as — (minus). In such case, we’ll assign higher weights to these three + (plus) and apply another decision stump.

Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as compared to rest of the data points. In this case, the second decision stump (D2) will try to predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-classified + (plus) correctly. But again, it has caused mis-classification errors. This time with three -(minus). Again, we will assign higher weight to three — (minus) and apply another decision stump.

Box 3: Here, three — (minus) are given higher weights. A decision stump (D3) is applied to predict these mis-classified observation correctly. This time a horizontal line is generated to classify + (plus) and — (minus) based on higher weight of mis-classified observation.

Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule as compared to individual weak learner. You can see that this algorithm has classified these observation quite well as compared to any of individual weak learner.

Which is the best, Bagging or Boosting?

There’s not an outright winner; it depends on the data, the simulation and the circumstances.
Bagging and Boosting decrease the variance of your single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimises the advantages and reduces pitfalls of the single model.

By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting.

Saturday, June 15, 2019

R Or Python for Machine Learning

After a few years of programming in both Python and R, I still struggle with this. Which language is the best language to use for Data Science? I like to think of myself as a technologist with a Statistics degree who is dabbling in Data Science. But, even I can’t shy away from the easy to use aspects of R. The allure of the R programming language stayed with me even as I ventured into Pandas, Numpy and Scipy. Python’s robust packages in machine learning frankly amaze me. It’s literally a one-stop shop.At the same time, it literally takes me less than 30 minutes to run simply statistical analysis in R to explore my dataset. Machine Learning and Deep Learning packages are now becoming the norm in R as well.

For a while, like any data enthusiast, I started to test out both programming languages side by side for everyday tasks of data science. Using them across datasets large and small has really made a difference in how I look at each programming language.

Before we delve into the programming languages, I want you to understand that a Data Scientist or Data Analyst or a Data Enthusiast does not actually develop software.

He or she might inform the process of software development and help software developers with data logic that’s needed inside the software, but he or she does not actually develop software.

There’s a real difference between programming for software development and programming for data analysis.

Software Development requires extensive design of code for simplicity and efficiency. Object-Oriented languages tend to lend themselves to software development simply because the code is more scalable as the system grows.

Data Science Programming requires the ability to do everything it takes to analyze the data. By everything, I mean using multidisciplinary set of knowledge from almost every walk of life to figure out the true nature of this data. The code is used to solve an equation in the form a dataset.

Functional Programming vs. OOP

Functional Programming Language is a language that:

treats computation as the evaluation of mathematical functions
avoids changing-state and mutable data
programming is done with expressions or declarations
functional code is idempotent or function’s return value depends only on its result

Functional Programming is good when you have a fixed set of things. As your code evolves, you add new operations on existing things.

In contrast, in Object-Oriented Programming, data can have both mutable and immutable states. Programming is done with statements and expressions. Global programming states can affect the function’s resulting value.

OOP is good when you have a fixed set of operations on things. As your code evolves, you add new things.

Why does Data Science lend itself to Functional Programming?

Data Science’s objective is often to solve a problem. It is often functional in nature. Models themselves are essentially equations where the return values need to be the same. Even in deep learning, the data itself does not change. New values are added. But, the data stays the same. The immutable state is essential for the output to be consistent in the model. Functional Programming is all about chaining together functions to operate over a simple data structure. This design makes it easy to implement parallelism. In any machine learning or deep learning project, parallelism is essential when working with large sets of data.

The Nature of R and Python

Python is an interpreted, high level, general-purpose programming language. You can do some functional programming in Python. But, Python is not a functional programming language. It does not meet the technical specifications of “purity” in the context of a functional programming language. There are a lot more OO use cases that Python caters to. Python is actually a good language to use for Object-Oriented Programming. You will find that because Python is versatile, it will often be used in Software Development. Even though it’s not a strictly functional programming language, it has robust packages for Data Science.

R is primarily a functional programming language. It contains many tools for creations and manipulations of functions. You can do anything with functions as you can do with vectors. Anonymous functions give you the ability to use functions without giving them a name. This makes possible the chaining of functions that is useful in machine learning and deep learning. Almost all R objects are immutable. R environments, however, are mutable. R has robust visualization libraries such as ggplot2, plot, lattice, etc. Statisticians use R to visualize data. Often, quick visualization of data can provide insights into the data that leads to further statistical analysis.

Which One is Better: R vs. Python?

In the real world, it’s often difficult to choose R or Python for all of your Data Science efforts.

At the end of the day, the purpose of the programming language is to allow for the simplest and the most efficient code to be used for the job at hand.

Personally, for my Data Science projects, I have taken to use both R and Python in conjunction of other languages for the different steps of the Data Science process.

Exploration of Unstructured Data

80% of the world’s data is actually unstructured data. Data such as text, video, and images are all unstructured data. Python has a multitude of packages such as NLTK, scikit-image, pyPI for natural language processing, image processing, and voice analysis. Making sense of unstructured data often means that the data needs to be converted into structured data. Python is very useful for this conversion.

Data Cleaning: Structured Data or Semi-Structured Data

With large sets of data, Python is unbeatable in data cleaning. You can use packages such as Pandas, NumPy to easily clean up large sets of data. Frequently, I will also use Perl one-linersfor specific data cleaning purposes.
The combination of the two often produces “clean data” in a short span of time. This way, most of my Data Science effort can be focused on Analysis.

Exploration and Modeling in R

Once you have structured data or semi-structured data, it’s much easier to do data exploration in R. I can write clean code for a multitude of statistical analysis to get to know my data. It’s also easy to use the visualization packages to visualize the data to help with my analysis: ANOVA, Multivariate correlations and Regressions, Factor Analysis, and Geostatistics. Logistic Regression and Time Series Analysis are both simple to implement in R with easy visualizations. Feature selection is easily done with R using the caret package and fastcaret package. Model Selection is easily implemented. Machine Learning models such as LDA, CART, kNN, SVM and RF all are easily implemented in R. Each algorithm has its own packages in R. Training the dataset and cross-validation takes just a few lines of code. Even Deep Learning, the Keras library in R with Tensor Flow now make this an easier endeavor in R.

Exploration and Modeling in Python

Data exploration and modeling is not limited to R. Python has packages such as NumPy, Matplotlib and Pandas that can help with the data exploration process. Seaborn is used for visualization much the same way as ggplot2 in R. Scipyprovides all you need for traditional statistical analysis. SciKit-Learn provides for machine learning algorithm implementation, cross-validation and more. Using Keras, TensorFlow and PyTorch, deep learning in Python is not also a much easier process. Machine Learning and Deep Learning often means that you are working with large sets of data that sits in the Cloud. Most likely the infrastructure aspect of the Data Science project will drive any Data Scientist to AWS, Azure, and Google Cloud. This will mean that Python will be the default language to use in such large scale Data Science project.

Inconclusion, working with real-world data presents complex problems. These problems can’t often be solved with one programming language or another. Understanding the nature of R and Python can help any programmer, data scientist or data analyst to choose the best programming language for the task at hand. The hybrid nature of tasks in Data Science means that there will always be a wrestling match between Python and R.

That is a good thing. The competing nature of the two languages might help us produce the simplest and the most efficient code for our purposes.

TECHCEPTRON