Predictive analytics modeling is a statistical method that connects observable patterns to unobservable occurrences. For example, the IRS has access to tax filing data (observable patterns). Since rule-breakers usually attempt to hide their bad behavior, fraud can be difficult to detect. To address this problem, investigators can analyze data using special modeling techniques that detect filing patterns most closely associated with fraudulent behavior (unobservable occurrences). In this way, predictive analytics can help agencies identify these harmful patterns using already-available data.

#### CHOOSING A MODELING METHOD

In recent decades, modeling methods have proliferated. From these plentiful options, researchers choose their modeling method depending on (a) the type and the availability of data and (b) what outcome they want to investigate. While a full exploration of modeling is beyond the scope of a single blog post, we will assume for the purpose of this conversation that historical outcome data is available. (Note: If an outcome variable is unavailable, we could use different modeling techniques instead.[1]

We can model an outcome by expressing it as a mathematical equation of other variables that show statistical patterns. In this context, the outcome is a dependent variable and the other variables that provide statistical patterns are predictors. Modeling calculates the numerical impact of each predictor on an outcome, since each predictor is likely to show a different influence. These numerical impacts are patterned after historical experiences, as reflected in the available data.

#### Exploring a REAL-WORLD EXAMPLE: MORTGAGE DEFAULT PREDICTORS AND OUTCOMES

Let’s consider a concrete financial example: mortgage defaults. During the modeling process, banks investigate which residential mortgage loan applicants are likely to default on their future payment obligations. Banks are interested in historical default patterns—but every borrower is different. Some of them have better credit (e.g. better FICO score), some provide a larger down payment (e.g. lower loan-to-value ratio), some choose fixed rate loans and other choose adjustable mortgages (e.g. loan product type), and some have higher income (e.g. lower debt-to-income ratio). Banks routinely collect these data elements, so the information is already available in administrative datasets.

Banks express the outcome—that is, whether a borrower defaulted—as a mathematical formula of predictors including FICO score, loan-to-value ratio, loan product type, and debt-to-income ratio. Based on historical experience, the mechanics of modeling assigns a numerical impact to each predictor, patterned after its influence on historical default experience. By applying these numerical impacts to individual mortgage loan applicants, banks can predict default probabilities based on applicants’ financial profiles.

#### MODELING CIVIL INFRACTIONS

Although the example above applies to many enforcement settings, modeling civil infractions is considerably more complex.

First, there is the issue of sample selection. Enforcement agencies often do not have a database that contains all historical civil infractions. Rather, they have a database of civil infractions from cases they investigated, which is likely to be a subset of all civil infractions. The civil infractions that were not investigated are invisible to the enforcement agencies. To the extent that investigations are not representative of the entire regulated community, observed infractions are unlikely to be representative of all infractions among the entire regulated community. Left untreated, a model trained on historical infractions may inherit systematic blind spots and miss critical patterns.[2]

Second, the definitions of civil infractions can be unclear and impermanent. Often, the interpretations of rules and regulations change over time. An agency’s investigative focus can change as the agency’s priorities change, causing inconsistent outcomes. Because determining civil infractions is a human endeavor, it is susceptible to subjectivity. Cases with the same fact pattern could receive different verdicts. In other words, outcomes are imperfectly measured.

Curious about how predictive analytics can help your organization or agency identify fraudulent behavior? Do you want to learn how to identify patterns in administrative data? Subscribe to our blog for our next post on predictive analytics or talk with our team.

[1] Models trained with outcome variables are “supervised”.  Otherwise, they are “unsupervised”. See Elements of Statistical Learning: Data Mining, Inference and Prediction by Hastie, et al. (2008) for further elaboration.

[2] Professor Jonathan Feinstein studied this modeling challenge extensively.  See "Detection Controlled Estimation," Jonathan S. Feinstein, The Journal of Law & Economics, Vol. 33, No. 1 (Apr., 1990), pp. 233-276.