In this post of our Strength in Numbers series, we break down how data scientists work with unstructured data. This post was written by former Summit Senior Analyst Tim Book.
When you see the word “data,” you might imagine cascading rows of numbers in neat rectangular spreadsheets. However, data is just a synonym for information, and information is rarely so organized. We call any data not fitting neatly into this classic spreadsheet format unstructured. Examples of unstructured data include images, videos, sound, and plain text.
Cutting-edge machine learning algorithms are often used to analyze unstructured data. Self-driving cars use live video data to avoid obstacles and read street signs. Smartphones use sound data to communicate with Siri or use text-to-speech messaging. While these are flashy examples, to a data scientist, the most powerful type of unstructured data is text.
We spoke in a previous blog post about natural language processing (NLP), the branch of machine learning used to analyze plain text. NLP gives us the ability to analyze millions of tweets, Amazon reviews, and news articles in mere seconds. NLP has been used by marketers to analyze favorability towards their product, by research agencies to detect bias in news outlets, and by hobbyists to analyze Donald Trump’s Twitter account.
This is great – but how do you get text data? Certainly no one is copy and pasting every single news article published on the web. Enter the most powerful tool in any data scientist’s tool belt: web scraping. Web scraping is the act of using code to gather text data from the internet. This is where we come upon a new variety of unstructured data: HTML. HTML is the language of the web – literal computer code that your web browser reads to render websites. If you’d like to see proof, right-click anywhere on this page and select “Inspect.” The jumble of code you’ve just been confronted with is everything that makes up the website you are on – including its contents. An experienced web scraper can write a simple program to scrape millions of pages of text in just a few seconds.
Web scraping can be done using a variety of programming languages and tools. The right approach depends on the type of website you are scraping and the tools you are most comfortable. Python’s BeautifulSoup library is particularly useful and popular for scraping data and is relatively fast to code and run. JSoup from Java is another popular choice, though there are many other suitable options available.
Combining the tools mentioned in this post (and countless others) allows data scientists to break out of the land of spreadsheets and unlock the treasure trove of information held in unstructured data.
You can read all posts from the Strength in Numbers blog series here.