Text Analytics for Combining Information Sources

Posted by Randall Ronsberg on 7/18/13 10:32 AM

Find me on:

Read more about me: Biography

--------------------------------------------------------

Data is being captured and stored at an enormous rate since the middle part of the last decade. Figure 1, produced by McKinsey Global Institute, provides a breakdown of information captured by each industry in 2009. As more data becomes available, integrating data is becoming a frequent challenge facing organizations. An organization’s data may be stored in separate databases, yet combining these different data sources is a powerful way to gain insight and enhance business decisions.

Unfortunately, not all data sources we would like to merge are combinable on a clean numeric key identifier (such as a Student ID number or a Patient ID number). Not all is lost if you have a character string that commonly identifies both data sources (such as a name or an address). The problem with names is that they can vary based on the data source; XYZ Corporation and XYZ Corp are likely the same entity, but if we tried to directly merge on the two names our attempt would fail. Special treatment is required for character string identification variables. A matching algorithm can be employed to remedy the differences in character strings across data sources.

Company data storage

Edit distance is one way to match string-based identifying information. Edit distance (also known as Levenshtein distance, named after Vladimir Levenshtein) computes a ‘distance’ between two character strings by the number of edits necessary to build one string out of the other. Statistical software packages have edit distance algorithms such as levenshtienDist in R or SAS’s COMPGED function which are able to compute a score on how likely a pair of character strings are considered a match[1]. When dealing with thousands or millions of records some human judgment is required to consider where the designated ‘cut-off’ point should be between matching and non-matching pair of strings. Matching information records based on these text matching algorithms is often referred to as ‘Fuzzy Matching’, because text matching is unlikely to be a perfect solution. If the project has a tolerance for some error (mismatches) then integrating information on a text matching algorithm provides information that would be otherwise unattainable.

At Summit we have applied text analytics solutions for our client’s projects to improve the quality of information available. Text analytics and matching information records based on character strings is likely to be more prevalent as the age of ‘big data’ and information is growing rapidly. Using a tool such as edit distance is one way you can integrate your data and improve information quality for business decision making.

[1] Many programming languages have the capability to compute Edit Distance, Java, Python and C are additional examples.

Topics: data analytics, R

About the Summit Blog

Complexity simplified.

Summit is a specialized analytics advisory firm that guides clients as they decode their most complex analytical challenges. Our blog highlights the strategies and techniques we use, as well as relevant topics in current events.

Subscribe to Email Updates

Recent Posts

Posts by Topic