Part 1: How to build a text analytics solution in under 10 minutes


Share on LinkedIn

Source: Shutterstock.

For a long time, I’ve been planning to write a post to clarify what’s possible in Text Analytics space today, in 2018. Throughout my career, I’ve spoken with many people who are living through the pain of analyzing text and trying to find a solution.

Some try to reinvent the wheel by writing their own algorithms from scratch, others believe that Google and IBM APIs are the savior, others again are stuck with technologies from the late 90’s that vendors pitch as “advanced Text Analytics”.

So, I decided to post a series of articles that dig deeper into the 5 most common Text Analytics approaches and examines their pros and cons.

Introducing word spotting: DIY in Excel or Python

Let’s start with word spotting. First off, it’s not a thing!

The academic Natural Language Processing community does not register such an approach, and rightly so. In fact, in the academic world, word spotting refers to handwriting recognition (spotting which word a person, a doctor perhaps, has written).

There is also keyword spotting, which focuses on speech processing.

But to my knowledge, word spotting is not a used for any type of text analysis.

But I’ve heard frequently enough about it in meetings to include in this review. It’s loved by DIY analysts and Excel wizards and is a popular approach among many customer insights professionals.

The main idea behind text word spotting is this: If a word appears in text, we can assume that this piece of text is “about” that particular word. For example, if words like “price” or “cost” are mentioned in a review, this means that this review is about “Price”.

The beauty of the word spotting approach is its simplicity.

You can implement word spotting in an Excel spreadsheet in less than 10 minutes.

Or, you could write a script in Python or R. Here’s how.

How to build a Text Analytics solution in 10 minutes

You can type in a formula, like this one, in Excel to categorize comments into “Billing”, “Pricing” and “Ease of use”:

Source: Thematic.

And voilà!

Here it is applied to a Net Promoter Score survey where column B contains open-ended answers to questions “Why did you give us this score”:

Source: Thematic.

It probably took me less than 10 minutes to create this, and the result is so encouraging! But wait…

Everyone loves simplicity. But in this case, simplicity sucks

Various issues can easily crop up with this approach.

Here, I’ve annotated them for you.

Source: Thematic.

Out of 7 comments, here only 3 were categorized correctly. “Billing” is actually about “Price”, and three other comments missed additional themes. Would you bet your customer insights on something that’s at best 50 accurate?

When word spotting is OK

You can imagine that the formula above can be tweaked further. And indeed, I’ve talked to companies who hand-crafted massive custom spreadsheets and are very happy with the results.

If you have a dataset with a couple of hundred responses that you only need to analyze once or twice, you can use this approach. If the dataset is small, you can review the results and ensure high accuracy very quickly.

When word spotting fails

As for the downside? Please don’t use word spotting:

  • If you have any substantial amount of data, more than several hundred responses
  • If you won’t have time to review and correct the accuracy of each piece of text
  • If you need to visualize the results (Excel will hear you swearing)
  • If you need to share the results with your colleagues
  • If you need to maintain the data consistently over time

There are also many other disadvantages to DIY word spotting, that we’ll discuss in the next post. I’ll also talk about what actually does work and is a good approach.

That’s it for now, watch out for Part 2, coming soon!

This article was first published here.

Alyona Medelyan
I run Thematic, a SaaS company for analysing customer feedback. We tell companies how to drive change to Net Promoter Score, customer satisfaction and churn. Thematic uses proprietary word-class Text Analytics technology developed based on 15+ years of my research in NLP and Machine Learning.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here