Text Matching Techniques within the Field of Text Analytics

Text analytics is the field of analysis concerned with gleaning useful information out of strings of text. Text mining is becoming more relevant in the data-rich world of econometrics and consulting as databases grow larger and eyeballing text information becomes even more unreliable and inefficient.

Summit recently examined typed-in names. For instance, one entry refers YMCA as “YMCA of the USA” and another as “NATIONAL YMCA”. In a small database, this might be easy to spot, but in a sizable population, it would be nearly impossible to recognize these names as the same entity without text matching.

In many cases, it is advisable to use multiple text matching techniques, as one method will not catch all text string matches. Two text analysis methods are exact location matching and letter frequency matching. For many text matching methods, the first step is removing spacing, punctuation, and common phrases like ”THE”, “AND”, or maybe “CORPORATION”. For exact location text matching, the next step would be to compare each condensed text string, and matched letter positions to determine a similarity score. For letter frequency matching, the frequency of each letter in the text string is tallied, and then these counts are used to determine a similarity score. This is useful for text strings where the key information is likely to be in different positions, like a comment log. Summit applied these text matching methods to produce standardized and distinct groupings from many disparate variations of name spelling and entries.

There are plenty of other text analysis algorithms. To enhance our competency, Summit has held weekly meetings to discuss cutting-edge text analysis techniques and their implementation. With the demand for text analysis grow among our clients, we expect to deploy more and more of these text analysis techniques to meet their needs.