From tapeworm to rape: cautionary tales of chatbot tuning


Share on LinkedIn

Danbo character reading book Photo by Lisa Fotios from Pexels
Photo by Lisa Fotios from Pexels

Even the best and most advanced chatbots can find themselves out of their depth. For most, this results in an error message and a second attempt. For some, it leads to concerns, public discussions or even outrage.

Two prominent chatbots have recently blundered and sparked such upset. From tapeworm diagnoses to jokey rape references, these high-profile chatbots missed the mark and made news for all the wrong reasons.

Worryingly, these chatbots have teams constantly testing their success and tuning their responses. So, how could they get it so wrong?

Meet chatbot one

Babylon Health’s chatbot advises patients on the likely causes of their symptoms. This can help them decide if they need to make an appointment with a GP. Or, it might give them a prior indication of the diagnosis they could be about to face.

The chatbot algorithm was tested with 100 real-life scenarios. It’s allegedly passed an exam with 81%. (Higher than the human average of 72%.) And, for the most part, the chatbot has served its function well. It’s backed by the NHS and it generally makes sound suggestions to help concerned patients.

However, this doesn’t make the chatbot infallible. Much like a human doctor might, it can make the occasional mistake. As seen in a few recent blunders.

The cautionary tale of the Babylon Health chatbot

A 66-year-old female heavy smoker recently sought advice after losing weight and coughing up blood. (Unfortunately, these are indicators of cancer or a serious chest infection.)

The chatbot advised the unwell Suffolk user that she ‘most likely’ had a tapeworm. In Britain, the odds of getting a tapeworm are slim, to say the least. It listed chest infections like pneumonia, bronchiectasis and tuberculosis, along with ‘something more serious’ (i.e. cancer), as being ‘less likely’.

This isn’t the only time that the health bot made a mistake like this. At the end of the same month, the chatbot advised a woman with a breast lump. Its diagnosis? Osteoporosis (a condition where bones become weaker with age). This sparked discussion over the potential danger of the health chatbot.

Why did it happen?

In the first instance, the issue with the chatbot’s diagnosis wasn’t that it was inaccurate. The symptoms described could indeed be indicative of a tapeworm. Rather, the fault came with the probabilities assigned to each possible diagnosis.

A possible cause is the loss of context — the bot forgot the patient was in Britain. (And so had a low chance of getting a tapeworm.)

Admittedly, it’s unclear what prompted the bot to suggest a breast lump was the product of osteoporosis. The likely case is that some overlapping symptoms prompted the suggestion.
In both cases, however, the error lies with the bot — something in its algorithm brought it to the wrong conclusions.

So, what does this tale tell?

Something went wrong in these two cases of many. But these faults — and the ensuing discussions, show more than a broken chatbot making an error.

The fact that the chatbot made these mistakes demonstrates the need for ongoing testing, updating and management of a chatbot’s algorithm. It shows that no chatbot — no matter how well used, how successful or how well backed it is, is always going to get it right. There’s still the need for maintenance, tuning and tweaking of the algorithm.

The outrage also shows something about the journey of a chatbot. A human expert in any one field is likely to make an error from time-to-time, so too is a chatbot. But the chatbot is far less likely to get away with it. Mistakes from a chatbot lead to ridicule, concerns and criticism — the kind that makes it into national newspapers.

Meet chatbot two

Cleo is a financial assistant chatbot that integrates with Facebook Messenger. The idea is that Cleo helps people manage their money. The bot boasts a friendly, informal personality, with messages from the bot often containing emoji or funny GIFs.

Cleo can help you budget, dissect your spending habits and suggest how you can save more money. The AI side of Cleo is found in the calculations, predictions and recommendations. Meanwhile, the wording of the messages is pre-determined by humans.

Cleo’s cautionary chatbot faux pas

Cleo’s creators had a bright idea for Valentine’s Day this year: tough love. So, they added functionality to their chatbot called ‘savage mode’. The idea was that a user could opt-in to receive some hard truths (and a fair amount of sass) regarding their finances. They did this by typing ‘Roast me Cleo’ into the chat interface.

Unfortunately, the wording chosen to introduce this new function was less than ideal, with a (perceived) implication of sexual violence. The message in question read: “I’m adding the option to get a bit savage with you”, and was followed up with, “Fully consensual, only when you want it.”

These connotations caused considerable upset when highlighted on Twitter.

What happened?

Babylon Health is an example of something going wrong with the algorithm of the chatbot. Cleo is a chatbot that suffered a flaw in human judgement.

The copywriting behind Cleo’s so-called ‘savage mode’ worked as intended much of the time. The message that caused the controversy also didn’t upset everyone. Some commenters took it at face value, questioning what made the message so inappropriate.

Regardless, the team behind the message were quick to respond. They removed the offending message, explained their reasoning and apologised for the offence.

What does this tale tell?

Babylon Health’s chatbot blunder demonstrated that even proficient chatbots need their algorithms to be subject to ongoing tuning. Cleo’s faux pas is different. It demonstrates the importance of tone of voice and UX testing, and highlights issues around chatbot personality.

What your chatbot says and how it says it is integral to the user experience. So, testing and tuning your chatbot’s personality is as important as maintaining the functionality of the bot. The chatbot might work perfectly, but that won’t count for much if the tone of voice is inconsistent, jarring or offensive.

And, even when it’s an AI chatbot (like Cleo), it isn’t going to have a moral compass. It won’t question what it is told to say or do. AI doesn’t understand language connotations in the same way that humans do. So, we need to carefully test, tune, question and update the things our chatbots say.

Always be tuning, always be testing

These cautionary tales of chatbot tuning teach valuable lessons. They show that even the most carefully tended chatbots need a helping hand to stay on track.

Chatbots are yet to understand the intricacies of human language. They struggle to remember the context of conversations. In short, they have a long way to go before they’re as smart, eloquent and human as, well, a human.

So, when it comes to maintaining a chatbot, it’s never a case of install and press go. Chatbots need ongoing tuning and constant testing. They also need human support at the ready for the times they drift a little too far out of their depth.

Niamh Reed
I'm a Keele University graduate and copywriter for digital engagement specialist Parker Software. I graduated with first-class honors in English with creative writing and was also awarded a certificate of competency in Japanese. I can usually be found feverishly writing business technology articles – covering everything from AI to customer service – and drinking too much tea.


Please use comments to add value to the discussion. Maximum one link to an educational blog post or article. We will NOT PUBLISH brief comments like "good post," comments that mainly promote links, or comments with links to companies, products, or services.

Please enter your comment!
Please enter your name here