Predictive Coding

July 5, 2013 Kelley MacEwen

On a recently completed project, Summit employed a technique called “predictive coding.” Predictive coding uses a set of text documents that have been classified as being relevant or not relevant to particular topics, and uses the information in those documents to train an algorithm to automatically determine the relevancy of other unclassified documents. Predictive coding is one of the many different applications of the rapidly evolving field called text analytics, and is an especially effective solution when there is incomplete classification information available in a dataset.

For this particular project, Summit was provided an inquiry dataset with subject indicators and a notes field, which contained free form text information about the topic of the inquiry. A large number of these inquiries were missing the relevant subject indicators. Summit used a cleaning algorithm to standardize the natural language information found in the notes fields and to eliminate noise. After standardization, the most common words found in the notes fields for each subject of interest, and each note in the inquiry population was assigned indicator variables signaling if the individual note contained the common words. After creating this dataset, we used these derived indicator variables and the assigned subject indicators in a logistic regression, which allowed Summit to compute the probabilities of subject assignment for each inquiry. After out of sample tests, which confirmed that the model was not overfitting the training data, Summit was able to use these models to assign subjects to inquiries with initially missing subject information.

Summit is using predictive coding, as well as other types of text analysis, to harness information contained in natural language fields. If you have a dataset that has a substantial amount of incomplete information (such as a voluntary field that is not always filled out) along with relevant text fields, predictive coding may be a promising avenue to deal with this issue.

Share This: