Posted by Kelley MacEwen on 9/22/14 3:31 PM

Find me on:

--------------------------------------------------------

Last week we talked about the diverse applications of predictive analytics, from estimating annual government tax revenue to modeling the effects of a zombie apocalypse. This week, we’ll dive a little deeper into the first step to any predictive analytics process, which is to understand the problem at hand and to collect the appropriate data. Sampling is a technique through which data are collected and can be used in a variety of ways to suit a given problem.

Summit has used sampling in many business areas, including auditing and program evaluation. Different sampling designs can be used to get insight into a population of interest when it is impractical to observe each individual or entity in that population. For example, one can estimate the improper payment rate (i.e., the rate of overpayment or underpayment) by sampling from the entirety of payments made to another party and investigating each payment in the sample for appropriateness. The important statistics obtained from the sample, such as average underpayment per payment transaction, are then extrapolated to the entire population to get an estimate for the improper payment rate.

Stratified sampling is a common statistical sampling technique that ensures that certain parts of the population are represented in a sample based on some characteristic. An illustration is shown below:

Let’s say we wanted to sample four aliens from the population of aliens in the illustration. We suspect that aliens of different color have different attributes, so we want to make sure that blue, red, and green aliens are all chosen as part of our sample.  Since there are six red aliens in the population, there is a chance that we could end up only sampling red aliens. Instead, we can stratify by alien color so we can make sure we sample two red aliens out of six, one blue alien out of three, and one green alien out of three. The math to extrapolate population statistics becomes trickier, but can be calculated.

In the next post, we’ll use this stratification technique, which is often used in large scale samples based on geographic regions, to answer the all-important question: which beer should I serve at my wedding? Stay tuned!

This post was written with the help of Sejla Karalic and Jon Carver.

Topics: data analytics

About the Summit Blog

Complexity simplified.

Summit is a specialized analytics advisory firm that guides clients as they decode their most complex analytical challenges. Our blog highlights the strategies and techniques we use, as well as relevant topics in current events.