Tuesday, June 26, 2018

How Predictive Analytics Can Help NGOs

       

How Predictive Analytics Can Help NGOs


Predictive analytics can deliver the insights you need to help ensure your non-profit’s success. By uncovering patterns and trends hiding within your datasets, you can easily identify donors who may be willing to give more, or determine which programs offer the best return for your investment. And that means you can focus your limited resources — and efforts — on those areas where they’re likely to do the most good.

NGOs, also known as non-profits here in the U.S., fulfill a very important role as they seek to accomplish social good. They are in a unique position that allows them to see social need and react to it in ways that often times have more impact than other organization’s efforts could. Vault is looking to apply the science and art of measurement and data analytics to help NGOs accomplish their various missions, and we believe that if applied correctly, analytics can make a huge difference in NGO effectiveness

We’ve broken down the process of how to use analytics for NGOs into three categories, summarized below. We feel that it presents a systematic and practical approach to foster performance management and measurement in these organizations.

Measurement

The first hurdle that must be crossed is that of measurement, of taking the time and effort to measure work and progress and collect it in a database for further analysis and presentation. There are several reasons why it is important for an NGO to measure its efforts:
-Make sure time, effort, and money are being used where they need to be
-Gain ability to prove that you are accomplishing and fulfilling your social mission
-Gain ability to show that donor and sponsor funding is being used effectively
There are a few things to keep in mind when implementing a measurement strategy. First – it is important to not only measure the end goal, but also the incremental steps that lead up to that goal. Let’s say your organization’s mission is to decrease the number of diabetics within a specific demographic in your community. Measuring the % decrease in diabetes withing this population over a given time period is great, but it doesn’t tell the whole story. Ask yourself, what are the incremental steps leading up to the lowered diabetes rates? Perhaps one is the amount of exercise the average person in the demographic is getting on a daily basis. Perhaps another is the amount of sweets or fatty foods the average person is consuming per day. As you attack these issues that lead to diabetes, measure the improvement in these areas. Then people get the whole story of where your efforts have helped reduce each aspect of the larger problem – and you can find out which efforts are the most effective at getting rid of this problem.
Second – make sure and measure regression rates. Too often we stop the measurement once the problem is solved – once we have lowered the diabetes rate, in this case. But how many of those people, after we stopped working with them, have regressed into having diabetes? This is sometimes an alarmingly high number, and when regression rates are high, that means all the work we performed to lower the diabetes rate in the first place has gone to waste. If you see through measurement that the regression level is high, it’s time to implement some strategies and efforts into keeping the solution in place – that is, not losing ground once you’ve attained it. It’s often a lot easier to keep the problem gone than to go back and fix it again. This allows you to really fulfill your mission, in a lasting sense. It wastes less resources because you retain the ground you’ve gained. And donors and sponsors will be excited by the fact that you can show that your solution is a long lasting one.

Analytics

Once we have measurement strategies in place, now we have lots of data on our hands. Analytics is the process by which we extract useful intelligence from this data. There are many methods of doing this, whether it be through visual analysis techniques, statistics, predictive models, etc. (specific ways on how to do these types of analysis will be the topic of subsequent posts) Many people think that analytics is a task that is beyond their abilities, but many times even simple analysis will result in sufficient intelligence that you can use to do your work smarter.
One of the most important things to remember in doing analysis is the principle of segmentation. This means looking at the data in smaller pieces, rather than in aggregate. For instance, if you want to know who your most effective workers are, break down the data to show you the hours each worker put in, and the changes in the incremental metrics we discussed above that occurred as a result of their work. Maybe you want to know which types of donors consistently give high sums to support your work – break them down by demographics, or by income, or by age, or any other variables to get a view of what your ideal donor looks like. Then you can target more of these kinds of people in your donation campaigns.

Presentation

Not to be forgotten is the element of presentation. Once you have the data and all the analysis, you need to be able to present the intelligence you’ve found to others in a way that they understand, and in a way that will cause a change in their behavior. The intelligence from the analytics is there so that you can be more effective in your work; however, if no one understands it, nothing will change and it will be useless. There are a few easy guidelines to follow in presenting analytical information so that it sticks:
– Relate the numbers to something people understand (Just saying the number 416 can be somewhat abstract, but if you say instead “the number of people that fit in a Boeing 747” the number becomes real and concrete)
– Only show the necessary elements of analysis to get your point across (many times you’ll have to go through a lot of analysis to get a few golden nuggets of intelligence, and our tendency is to want to show off all of the work we did to get there. The problem is, the process is not important to the people you are talking to. What’s important is the results and intelligence, so just focus on that.)
– Keep it simple (showing too many variables on a graph, or just plain too many graphs, causes more confusion that it does clarity)
– Relate the analysis back to what concerns your constituents (Your focus should always be on solving the problem, and the analysis is only important insofar as it helps you to do that. Focus on what solves the problem for the constituents)
Hopefully this small outline gives you a framework that you can use in thinking about how to implement analytics into your organization. In the coming posts we’ll be discussing more in depth how to do each of these three points.
The challenges of fundraising:
Around the world, donors are already crucial to the function and sustainability of nonprofit organizations. Individual donors now represent almost 99 percent of nonprofit funding in India,2 80 to 95 percent of nonprofit funding in the US4 and 53 percent in Europe.5 But donor giving is undergoing dramatic changes. While the number of donors has remained steady, the average gift they give is much lower. As donors struggle with tightening their own household or corporate philanthropic budgets, they are more selective about which causes they will support. With fewer dollars available for giving, donors may seek a perfect fit with a cause before making a commitment. And when fewer donors are receptive to giving, the competition for charitable dollars increases.
In this environment, it is increasingly important for nonprofits to not only increase the number of donors, but increase the donation amount as well. In addition, they must find ways to reduce the high costs of donor processing and correspondence, and accelerate the turnaround time for funds availability. Most importantly, nonprofits must not invest their limited time, effort and expense soliciting potential donors that will likely never contribute. All of these challenges require a more effective strategy for the overall task of donor engagement. Fundraising encompasses a wide range of capabilities within a nonprofit organization. These include tracking and managing a complex array of donors, members and volunteers; anticipating which donors are most likely to give; building loyal donor relationships for repeat giving; identifying which donors will provide the biggest returns; creating campaigns and donation request levels to appeal to different donor types; understanding which communication channels are most effective; knowing when donors should be solicited and when they should not to avoid saturation; and deploying limited resources more cost-effectively
Predictive fundraising analytics 
Forward-looking nonprofit organizations are now using predictive analytics to improve donor engagement and returns on fundraising efforts. Predictive analytics helps these organizations unlock hidden insights within their data so they can: • Identify prospective donors • Understand and anticipate donor needs, behaviors and preferences Retain donors Attract ideal d o nors Increase contributions 360o Donor Experience • Know where to deploy donor resources for the biggest returns • Predict which donors are most likely to donate, how much they will give, and when they would likely donate • Determine the most effective messages and channels for solicitation (such as email, phone, direct mail or others) • Optimize the frequency of donor contact to maximize contributions • Anticipate when staff should provide additional attention to a specific donor

What is predictive analytics? 
Predictive analytics uncovers patterns, trends and associations hidden within all types of data to help predict future outcomes, solve problems and guide smarter decisions. Commercial businesses across many industries use predictive analytics to understand their customers and build stronger, more profitable relationships. These capabilities are also used by nonprofit organizations to gain similar benefits with their donors. Predictive analytics uses advanced algorithms to analyze donor data and deliver a 360-degree view of individual donors. These analytic results provide detailed insight into the needs, preferences and behaviors of donors. Predictive models can be created which enable nonprofits to anticipate how donors will respond to certain campaigns, which contribution amounts they would be likely to give, when they should be solicited and when they should be left alone, which communication channels they prefer and much more. By deploying these insights to decision makers and frontline systems such as call centers or direct mail initiatives, nonprofits can significantly increase the effectiveness of donor campaigns and strategies. And because predictive analytics learns from every donor interaction, it can also help to build more loyal relationships over time and provide an “early warning system” of donors that may be dissatisfied and require extra attention. Predictive analytics also helps nonprofits prioritize their resources based on anticipated returns and thereby reduce the costs of donor management. Organizations can determine which donor targets, messages and channels will yield the best results. The wasted effort and expense of low yield donor processing and correspondence can be minimized.

Four steps for using predictive analytics for fundraising:
So exactly how can nonprofits use predictive analytics to carry out their donor management strategies? There are four basic steps that follow an analytical process: align donor data, predict what donors want, personalize donor interactions, and integrate what you learned back into the process to optimize your future predictions. 
Step 1—Align: Integrate donor data The first step and the foundation of this process is to align your existing raw donor information. Donor data from all sources and systems across your organization, including spreadsheets, surveys, databases and social media, can be integrated within a single solution. With IBM predictive analytics solutions, it is not necessary to create a separate data warehouse to store this consolidated data. The predictive analytics software can access data from disparate sources and perform the required analysis on your desktop PC or a server.
This data does not have to be “perfect” before you move forward with your analysis. Because this is an ongoing process, you will have many opportunities to improve and refine your data with future iterations. Although these volumes of information are already available within many nonprofit organizations, they are often unused or not used to their full advantage. By accessing, organizing and analyzing this data, you can unlock valuable insights that you can put to good use. Some key data elements you should focus on in order to ensure success in the following steps of this process include: • Demographic data such as age, income, occupation, family status, business and personal relationships • Campaign data such as contact history, responses and donations, and results of test campaigns • Opinion data captured from donor feedback in social media, emails and surveys that provides insight into donor needs and preferences • Any other structured or unstructured (text) data regarding donor activity
Step 2—Anticipate: Predict what donors want The information you consolidated can now be analyzed by predictive models that help you understand and anticipate what donors want and will do next. These models use predictive analytics to determine ideal donor segments, score the data and predict the likelihood of future events. For example, you could use predictive models to determine how likely it is for an individual donor to respond to a marketing campaign. Or predict the most effective actions that will build long term, profitable relationships with donors. Along with predictive modeling, another key capability of this step is decision optimization. Once your predictive models tell you how a donor will likely respond, decision optimization tells you how to use that information most effectively. For example, outreach managers would not only know which donors are likely to provide the biggest returns, but also which precise messages or campaigns to implement in order to maximize the success of every donor interaction. 
Step 3—Act: Personalize donor interactions Now that you know the best actions to take, the next capability is to personalize interactions with donors by integrating those insights into your operational processes and systems. For example, you could integrate predicted donor responses into your direct marketing programs. Individual donors would receive direct marketing offers that appeal to them and your organization would not waste time or expense targeting donors that have no interest in a particular solicitation. You can also use predictive donor insights to guide the actions of your donor outreach representatives. With at-a-glance, aggregated donor information, your employees will know which donors may be dissatisfied and need a little extra care, and where to focus their retention efforts. In this way, personalizing donor interactions can help improve loyalty, boost response rates, reduce marketing costs and maximize contributions. And because analytics provides the capability to predict the likely amount of donor contributions, you can further customize solicitations to ensure that they will increase a donor’s value and giving over time. The predictive intelligence you gain from this process can also be sent to upper management via dashboards and scorecards to guide their decisions and strategies. 
Step 4—Optimize your predictions Predictive analytics isn’t a linear process. With each iteration you gain new insights from donor responses. That valuable source of information can now be integrated back into the analytical process to continually improve future performance. By adding more data sources to your analysis over time, and refining your existing sources, you can significantly enrich your donor view and sharpen the accuracy of your predictive models. And with direct analytical insight into the results of your donor initiatives, you can isolate KPPs, or key performance predictors, that will guide your efforts moving forward. This capability provides insightful data on what worked and what did not within your marketing or campaign initiatives. You are then able to anticipate what you can do next time to gain better results, reduce costs and improve overall efficiency.

Conclusion 

Nonprofit organizations need to improve their fundraising capabilities so they can become as efficient and effective as possible. Predictive analytics provides an effective way to understand and anticipate donor needs in order to increase the success of fundraising and marketing campaigns. Using this technology, nonprofits can gain a significant return on investment by increasing donor contributions, reducing costs and building stronger donor relationships over time

Sunday, June 24, 2018

Measures of Spread

Measures of Spread

Introduction

A measure of spread, sometimes also called a measure of dispersion, is used to describe the variability in a sample or population. It is usually used in conjunction with a measure of central tendency, such as the mean or median, to provide an overall description of a set of data.

Why is it important to measure the spread of data?

There are many reasons why the measure of the spread of data values is important, but one of the main reasons regards its relationship with measures of central tendency. A measure of spread gives us an idea of how well the mean, for example, represents the data. If the spread of values in the data set is large, the mean is not as representative of the data as if the spread of data is small. This is because a large spread indicates that there are probably large differences between individual scores. Additionally, in research, it is often seen as positive if there is little variation in each data group as it indicates that the similar.
We will be looking at the range, quartiles, varianceabsolute deviation and standard deviation.

Range

The range is the difference between the highest and lowest scores in a data set and is the simplest measure of spread. So we calculate range as:
Range = maximum value - minimum value
For example, let us consider the following data set:
23564565595562548525
The maximum value is 85 and the minimum value is 23. This results in a range of 62, which is 85 minus 23. Whilst using the range as a measure of spread is limited, it does set the boundaries of the scores. This can be useful if you are measuring a variable that has either a critical low or high threshold (or both) that should not be crossed. The range will instantly inform you whether at least one value broke these critical thresholds. In addition, the range can be used to detect any errors when entering data. For example, if you have recorded the age of school children in your study and your range is 7 to 123 years old you know you have made a mistake!

Quartiles and Interquartile Range

Quartiles tell us about the spread of a data set by breaking the data set into quarters, just like the median breaks it in half. For example, consider the marks of the 100 students below, which have been ordered from the lowest to the highest scores, and the quartiles highlighted in red.
OrderScoreOrderScoreOrderScoreOrderScoreOrderScore
1st3521st4241st5361st6481st74
2nd3722nd4242nd5362nd6482nd74
3rd3723rd4443rd5463rd6583rd74
4th3824th4444th5564th6684th75
5th3925th4545th5565th6785th75
6th3926th4546th5666th6786th76
7th3927th4547th5767th6787th77
8th3928th4548th5768th6788th77
9th3929th4749th5869th6889th79
10th4030th4850th5870th6990th80
11th4031st4951st5971st6991st81
12th4032nd4952nd6072nd6992nd81
13th4033rd4953rd6173rd7093rd81
14th4034th4954th6274th7094th81
15th4035th5155th6275th7195th81
16th4136th5156th6276th7196th81
17th4137th5157th6377th7197th83
18th4238th5158th6378th7298th84
19th4239th5259th6479th7499th84
20th4240th5260th6480th74100th85

The first quartile (Q1) lies between the 25th and 26th student's marks, the second quartile (Q2) between the 50th and 51st student's marks, and the third quartile (Q3) between the 75th and 76th student's marks. Hence:
First quartile (Q1) = (45 + 45) ÷ 2 = 45
Second quartile (Q2) = (58 + 59) ÷ 2 = 58.5
Third quartile (Q3) = (71 + 71) ÷ 2 = 71
In the above example, we have an even number of scores (100 students, rather than an odd number, such as 99 students). This means that when we calculate the quartiles, we take the sum of the two scores around each quartile and then half them (hence Q1= (45 + 45) ÷ 2 = 45) . However, if we had an odd number of scores (say, 99 students), we would only need to take one score for each quartile (that is, the 25th, 50th and 75th scores). You should recognize that the second quartile is also the median.

Quartiles are a useful measure of spread because they are much less affected by outliers or a skewed data set than the equivalent measures of mean and standard deviation. For this reason, quartiles are often reported along with the median as the best choice of measure of spread and central tendency, respectively, when dealing with skewed and/or data with outliers. A common way of expressing quartiles is as an interquartile range. The interquartile range describes the difference between the third quartile (Q3) and the first quartile (Q1), telling us about the range of the middle half of the scores in the distribution. Hence, for our 100 students:
Interquartile range = Q3 - Q1
= 71 - 45
= 26
However, it should be noted that in journals and other publications you will usually see the interquartile range reported as 45 to 71, rather than the calculated range.
A slight variation on this is the semi-interquartile range, which is half the interquartile range = ½ (Q3 - Q1). Hence, for our 100 students, this would be 26 ÷ 2 = 13.


Absolute Deviation & Variance

Variation

Quartiles are useful, but they are also somewhat limited because they do not take into account every score in our group of data. To get a more representative idea of spread we need to take into account the actual values of each score in a data set. The absolute deviation, variance and standard deviation are such measures.
The absolute and mean absolute deviation show the amount of deviation (variation) that occurs around the mean score. To find the total variability in our group of data, we simply add up the deviation of each score from the mean. The average deviation of a score can then be calculated by dividing this total by the number of scores. How we calculate the deviation of a score from the mean depends on our choice of statistic, whether we use absolute deviation, variance or standard deviation.

Absolute Deviation and Mean Absolute Deviation

Perhaps the simplest way of calculating the deviation of a score from the mean is to take each score and minus the mean score. For example, the mean score for the group of 100 students we used earlier was 58.75 out of 100. Therefore, if we took a student that scored 60 out of 100, the deviation of a score from the mean is 60 - 58.75 = 1.25. It is important to note that scores above the mean have positive deviations (as demonstrated above), whilst scores below the mean will have negative deviations.
To find out the total variability in our data set, we would perform this calculation for all of the 100 students' scores. However, the problem is that because we have both positive and minus signs, when we add up all of these deviations, they cancel each other out, giving us a total deviation of zero. Since we are only interested in the deviations of the scores and not whether they are above or below the mean score, we can ignore the minus sign and take only the absolute value, giving us the absolute deviation. Adding up all of these absolute deviations and dividing them by the total number of scores then gives us the mean absolute deviation (see below). Therefore, for our 100 students the mean absolute deviation is 12.81, as shown below:



Variance

Another method for calculating the deviation of a group of scores from the mean, such as the 100 students we used earlier, is to use the variance. Unlike the absolute deviation, which uses the absolute value of the deviation in order to "rid itself" of the negative values, the variance achieves positive values by squaring each of the deviations instead. Adding up these squared deviations gives us the sum of squares, which we can then divide by the total number of scores in our group of data (in other words, 100 because there are 100 students) to find the variance (see below). Therefore, for our 100 students, the variance is 211.89, as shown below:
As a measure of variability, the variance is useful. If the scores in our group of data are spread out, the variance will be a large number. Conversely, if the scores are spread closely around the mean, the variance will be a smaller number. However, there are two potential problems with the variance. First, because the deviations of scores from the mean are 'squared', this gives more weight to extreme scores. If our data contains outliers (in other words, one or a small number of scores that are particularly far away from the mean and perhaps do not represent well our data as a whole), this can give undo weight to these scores. Secondly, the variance is not in the same units as the scores in our data set: variance is measured in the units squared. This means we cannot place it on our frequency distribution and cannot directly relate its value to the values in our data set. Therefore, the figure of 211.89, our variance, appears somewhat arbitrary. Calculating the standard deviation rather than the variance rectifies this problem. Nonetheless, analysing variance is extremely important in some statistical analyses, discussed in other statistical guides.


Standard Deviation

Introduction

The standard deviation is a measure of the spread of scores within a set of data. Usually, we are interested in the standard deviation of a population. However, as we are often presented with data from a sample only, we can estimate the population standard deviation from a sample standard deviation. These two standard deviations - sample and population standard deviations - are calculated differently. In statistics, we are usually presented with having to calculate sample standard deviations, and so this is what this article will focus on, although the formula for a population standard deviation will also be shown.

When to use the sample or population standard deviation

We are normally interested in knowing the population standard deviation because our population contains all the values we are interested in. Therefore, you would normally calculate the population standard deviation if: (1) you have the entire population or (2) you have a sample of a larger population, but you are only interested in this sample and do not wish to generalize your findings to the population. However, in statistics, we are usually presented with a sample from which we wish to estimate (generalize to) a population, and the standard deviation is no exception to this. Therefore, if all you have is a sample, but you wish to make a statement about the population standard deviation from which the sample is drawn, you need to use the sample standard deviation. Confusion can often arise as to which standard deviation to use due to the name "sample" standard deviation incorrectly being interpreted as meaning the standard deviation of the sample itself and not the estimate of the population standard deviation based on the sample.

What type of data should you use when you calculate a standard deviation?

The standard deviation is used in conjunction with the mean to summarise continuous data, not categorical data. In addition, the standard deviation, like the mean, is normally only appropriate when the continuous data is not significantly skewed or has outliers

Examples of when to use the sample or population standard deviation

Q. A teacher sets an exam for their pupils. The teacher wants to summarize the results the pupils attained as a mean and standard deviation. Which standard deviation should be used?
A. Population standard deviation. Why? Because the teacher is only interested in this class of pupils' scores and nobody else.
Q. A researcher has recruited males aged 45 to 65 years old for an exercise training study to investigate risk markers for heart disease (e.g., cholesterol). Which standard deviation would most likely be used?
A. Sample standard deviation. Although not explicitly stated, a researcher investigating health related issues will not simply be concerned with just the participants of their study; they will want to show how their sample results can be generalised to the whole population (in this case, males aged 45 to 65 years old). Hence, the use of the sample standard deviation.
Q. One of the questions on a national consensus survey asks for respondents' age. Which standard deviation would be used to describe the variation in all ages received from the consensus?
A. Population standard deviation. A national consensus is used to find out information about the nation's citizens. By definition, it includes the whole population. Therefore, a population standard deviation would be used.

What are the formulas for the standard deviation?

The sample standard deviation formula is:
where,
s = sample standard deviation
 = sum of...
 = sample mean
n = number of scores in sample.

The population standard deviation formula is:
where,
 = population standard deviation
 = sum of...
 = population mean
n = number of scores in sample.

Thursday, June 14, 2018

K-Means Clustering in Machine Learning Quick Guide



Exploring clustering within the realm of machine learning and big data. Specifically, let’s look at the commonly used k-means algorithm. If you want a refresh on clustering (and other techniques).

K-Means Clustering in the Real World

Clustering as a general technique is something that humans do. But unlike decision trees, I don’t think anybody really uses k-means as a technique outside of the realm of data science. But let’s pretend for a second, that you really wanted to do just that. What would it look like?
Imagine it’s the end of the long summer vacation and your child is heading back off to college. The day before leaving, said child delivers a request: “Hey, I forgot to do my laundry; can you wash it so it’s clean for college”? You head down to the bedroom and are greeted (that’s not really the right word here, but you get the idea) with a pile of dirty clothes in the middle of the floor that’s nearly as tall as you. What do you do (apart from burning it all)?
Clustering and K-Means
Of course, you know that some of these clothes are washed in hot water, some in cold, that some can be put in the dryer and some need air drying, and so on. 
To make this a clustering scenario, we have to assume that your child, who is going to be doing all the work here, has no clue about how to group things for laundry. Not a stretch in this scenario. Let’s treat this as a learning opportunity and put them to work.
Tell them that you want them to group all the clothes into three piles using a very specific approach. Ask them to pick three different items of clothing at random that will be the starting points for those three separate piles (clusters in fact). Then, go through that massive initial pile and look at each item of clothing in turn. Ask them to compare the attributes of “water temperature” and “drying temperature” and “color” with each of the three starting items and then place the new item into the best pile.
The definition of “best pile” is based not on the whole pile, but purely on that starting item (technically that starting item is known as a centroid, but we’ll get to that later). The items won’t be identical, so pick the best match. If you want to be really detailed about this, ask them to place each item closer to or farther away from the starting item, based on how similar they are (the more similar, the closer together).
Now you’ve got three piles, which means your child is ready to use the washing machine. Not so fast, we’re just getting started. This is a very iterative process, and the next step would be to determine the new centroids of each of those piles and repeat the process. Those new centers would be calculated and may not correspond to an actual item of clothing. And we don’t really know if three piles is the optimal number, so once your child has completed iteration with three piles, they should go and try four, five, and more.
Clustering in general, and perhaps this algorithm in particular, is not a good technique for sorting laundry. Let’s look at a more realistic example using K means clustering and start working with data, not dirty socks and laundry labels.

How Does K-Means Clustering Work in Machine Learning, Exactly?

I’ll reuse the same data table that we had for the decision trees example. But this time, instead of trying to predict customer churn, we’re going to use clustering to see what different customer segments we can find.
Customer ID
Age
Income
Gender
Churned
1008764
34
47,200
F
Yes
I’m initially going to work with just two columns or attributes: age and income. This will allow us to focus on the method without the complexity of the data. Having two attributes enables us to work with a two-dimensional plane, and easily plot data points.
Is the data normalized? Let’s say that ages range up to 100 and incomes up to 200,000. We’ll scale the age range 0..100 to 0..1, and similarly salary 0..200,000 to 0..1. (Note that if your data has outliers like one person with an income of 800,000, there are techniques to deal with that; I just won’t cover them here).
The first thing we do is pick two centroids, which means that we’re going to start with K=2 (now you know what the K in k-means represents). These represent the proposed centers of the two clusters we are going to uncover. This is quite a simple choice: pick two rows at random and use those values. If you’re building a mental representation of this process in your head, then you’ve got a piece of paper with two axes, one labeled “Age” and the other labeled “Income.” That chart has two color coded points marked on it, representing those two centroids.
Now we can get to work. We start with the first row and determine the Euclidean distance (which is a fancy way of saying we’ll use Pythagoras’s theorem) between the point in question and both centroids. It will be closer to one of them. If you’re visualizing this in your head, you can just plot the new point on the chart, and color code it to match the closer centroid (diagram below). Repeat this process for every row that you want to work with. You’ll end up with a chart with all points plotted and color coded, and two clusters will be apparent.
Clustering and K-Means Centroid
Once all the points have been allocated to a cluster we have to go back and calculate the centers of both clusters (a centroid doesn’t have to be in the center of the cluster, and in the early iterations could be some distance away from the eventual center). Then go back and repeat the calculations for each and every point (including those initial two rows that were your starting centroids). Some of those points will actually be nearer to the other cluster and will effectively swap clusters. This, of course, will change the centers of your clusters, so when you’ve completed all the rows for a second time, just recalculate the new centroids and start again.
In the diagram below you can see two clusters with the eight-point star showing the location of the new centroids which will be the basis for the next iteration. The arrows show the “movement” of the centroid from the previous iteration.
Clustering and K-Means Clusters
As with all iterative processes you need to figure out when to stop. Two obvious stopping points would be:
  • Stop after some number of iterations (10, for example)
  • Stop when the clusters are stable (when there is little or no movement of the centroids after each iteration, which would also mean few or no points “swapping clusters”)
Is K=2 or two clusters optimal for this data set? Good question. At this stage you don’t know. But you can start finding out by repeating the whole process above with K=3 (i.e. start with 3 random centroids). And then go to four, and five and so on.

How to Optimize Your K-Means Clustering

Fortunately, this process doesn’t go on forever. There’s one more calculation to do: for each value of K, measure the average distance between all the data points in a cluster and the centroid for that cluster. As you add more clusters you will tend to get smaller clusters, more tightly grouped together. So that “average within-cluster distance to centroid” will decrease as K increases (as the number of clusters increases). Basically this metric gives you a concrete measure of how “good” a cluster you’ve got, with lower numbers meaning a more tightly grouped cluster with members that are more similar to each other.
If you increase K to equal the total number of data points you have (giving you K clusters, each with one member) then that distance will be zero but that’s not going to be useful. The best way to find the optimal value of K is to look at how that “average within-cluster distance to centroid” decreases as K increases and find the “elbow”. In this chart I’d say that the elbow is at K=3, though you might prefer to use K=4. There doesn’t look to be much point going with five or more clusters because the incremental improvement is minimal.
Clustering and K-Means Elbow Graph
In this example I used just two attributes to build the clusters. This has the advantage of being easy to visualize on a chart. What would those clusters represent? In this simple example with just ages and incomes, it might not be very illuminating, but maybe I’d find that I have a segment of younger customers with relatively higher incomes (and presumably higher disposable incomes) that I could target with promotions for suitable luxury goods. In general, you’d probably get more value from segments defined with more attributes.

A Real-Life Example for K-Means Clustering

With three attributes things are a little harder to visualize in 2D. Below there’s a cluster diagram for cars using fuel economy (in miles per gallon), peak power (horsepower) and engine displacement (in cubic inches) for a selection of cars from the early 1980s.
K-Means Cluster and Horsepower
Real use cases for k-means clustering might employ 100s or 1000s of attributes and while hard to visualize on a single chart, they are easily computed. They are also likely to be much more useful. In the customer segmentation example starting about half way down this machine learning article somebody obviously used many more attributes than just age and income.

CRISP-DM model

CRISP-DM

The model splits a data mining project into six phases and it allows for needing to go back and forth between different stages. I’d personally stick a few more backwards arrows but it’s generally fine. The CRISP-DM model applies equally well to a data science project.
CRISP-DM Process diagram by Kenneth Jensen (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
CRISP-DM Process diagram by Kenneth Jensen (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

TYPICAL ACTIVITIES IN EACH PHASE

  • Business Understanding
    • Understanding the business goal
    • Situation assessment
    • Translating the business goal in a data mining objective
    • Development of a project plan
  • Data understanding
    • Considering data requirements
    • Initial data collection, exploration, and quality assessment
  • Data preparation
    • Selection of required data
    • Data acquisition
    • Data integration and formatting […]
    • Data cleaning
    • Data transformation and enrichment […]
  • Modeling
    • Selection of appropriate modeling technique
    • […] Splitting of the dataset into training and testing subsets for evaluation purposes
    • Development and examination of alternative modeling algorithms and parameter settings
    • Fine tuning of the model settings according to an initial assessment of the model’s performance
  • Model evaluation
    • Evaluation of the model in the context of the business success criteria
    • Model approval
  • Deployment
    • Create a report of findings
    • Planning and development of the deployment procedure
    • Deployment of the […] model
    • Distribution of the model results and integration in the organisation’s operational […] system
    • Development of a maintenance / update plan
    • Review of the project
    • Planning the next steps

CRISP-DM’S VALUE

The CRISP-DM process outlines the steps involved in performing data science activities from business need to deployment, and most importantly it indicates how iterative this process is and that you never get things perfectly right.

Wednesday, June 13, 2018

Get started with ML.NET in 10 minutes

Get started with ML.NET in 10 minutes


  1. I
    nstall the .NET SDK

    To start building .NET apps you just need to download and install the .NET SDK (Software Development Kit).
  2. Create your app

    Open a new command prompt and run the following commands:
    dotnet new console -o myApp
    cd myApp
    The dotnet command creates a new application of type console for you. The -o parameter creates a directory named myAppwhere your app is stored, and populates it with the required files. The cd myApp command puts you into the newly created app directory.
  3. Install ML.NET package

    To use ML.NET, you need to install the Microsoft.ML package. In your command prompt, run the following command:
    dotnet add package Microsoft.ML --version 0.2.0
  4. Download the data set

    Your machine learning app will predict the type of iris flower (setosa, versicolor, or virginica) based on four features: petal length, petal width, sepal length, and sepal width
    Open the UCI Machine Learning Repository: Iris Data Set, copy and paste the data into a text editor (e.g. Notepad), and save it as iris-data.txt in the myApp directory.
    When you paste the data it will look like the following. Each row represents a different sample of an iris flower. From left to right, the columns represent: sepal length, sepal width, petal length, petal width, and type of iris flower.
    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    ...
    4.7,3.2,1.3,0.2,Iris-setosa

    Using Visual Studio?

    If you're following along in Visual Studio, you'll need to configure iris-data.txt to be copied to the output directory.
    In Visual Studio, open the properties window for iris-data.txt and set 'Copy To Output Directory' to 'Copy always'.
  5. Write some code

    Open Program.cs in any text editor and replace all of the code with the following:
    using Microsoft.ML;

  6. using Microsoft.ML.Data;
    using Microsoft.ML.Runtime.Api;
    using Microsoft.ML.Trainers;
    using Microsoft.ML.Transforms;
    using System;

    namespace myApp
    {
        class Program
        {
            // STEP 1: Define your data structures

            // IrisData is used to provide training data, and as
            // input for prediction operations
            // - First 4 properties are inputs/features used to predict the label
            // - Label is what you are predicting, and is only set when training
            public class IrisData
            {
                [Column("0")]
                public float SepalLength;

                [Column("1")]
                public float SepalWidth;

                [Column("2")]
                public float PetalLength;

                [Column("3")]
                public float PetalWidth;

                [Column("4")]
                [ColumnName("Label")]
                public string Label;
            }

            // IrisPrediction is the result returned from prediction operations
            public class IrisPrediction
            {
                [ColumnName("PredictedLabel")]
                public string PredictedLabels;
            }

            static void Main(string[] args)
            {
                // STEP 2: Create a pipeline and load your data
                var pipeline = new LearningPipeline();

                // If working in Visual Studio, make sure the 'Copy to Output Directory'
                // property of iris-data.txt is set to 'Copy always'
                string dataPath = "iris-data.txt";
                pipeline.Add(new TextLoader(dataPath).CreateFrom<IrisData>(separator: ','));

                // STEP 3: Transform your data
                // Assign numeric values to text in the "Label" column, because only
                // numbers can be processed during model training
                pipeline.Add(new Dictionarizer("Label"));

                // Puts all features into a vector
                pipeline.Add(new ColumnConcatenator("Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"));

                // STEP 4: Add learner
                // Add a learning algorithm to the pipeline.
                // This is a classification scenario (What type of iris is this?)
                pipeline.Add(new StochasticDualCoordinateAscentClassifier());

                // Convert the Label back into original text (after converting to number in step 3)
                pipeline.Add(new PredictedLabelColumnOriginalValueConverter() { PredictedLabelColumn = "PredictedLabel" });

                // STEP 5: Train your model based on the data set
                var model = pipeline.Train<IrisData, IrisPrediction>();

                // STEP 6: Use your model to make a prediction
                // You can change these numbers to test different predictions
                var prediction = model.Predict(new IrisData()
                {
                    SepalLength = 3.3f,
                    SepalWidth = 1.6f,
                    PetalLength = 0.2f,
                    PetalWidth = 5.1f,
                });

                Console.WriteLine($"Predicted flower type is: {prediction.PredictedLabels}");
            }
        }


  1. In your command prompt, run the following command:

ChatGPT and Intelligent Document Processing!!

 ChatGPT and Intelligent Document Processing!! Question: How chatgpt can helpful in IDP? Answer: As an AI language model, ChatGPT can be he...