If one picture’s worth a thousand words, why is everyone rushing to replace graphical interfaces with voice-activated systems?
The question has an answer, which we’ll get to below. But even though the phrasing is a bit silly, it truly is worth asking. Anyone who’s ever tried to give written driving directions and quickly switched to drawing a map knows how hard it is to accurately describe any process in words. That’s why research like this study from Invoca shows consumers only want to engage with chatbots on simple tasks and quickly revert to speaking to a human for anything complicated. And it’s why human customer support agents are increasingly equipped with screen sharing tools that let them see what their customer is seeing instead of just talking about it.
Or to put it another way: imagine a voice-activated car that uses spoken commands to replace the steering wheel, gear shifter, gas and brake pedals. It’s a strong candidate for Worst Idea Ever. Speaking the required movements is much harder than making them movements directly.
By contrast, the idea of a self-driving car is hugely appealing. That car would also be voice-activated, in the sense that you get in and tell it where to go. The difference between the two scenarios isn’t vocal instructions or even the ability of the system to engage in a human-like conversation. Some people might like their car to engage in witty banter, but those with human friends would probably rather talk with them by phone or spend their ride quietly. A brisk “yes, ma’am” and confirmation that the car understood the instructions correctly should usually suffice.
What makes the self-driving car appealing isn’t that it can listen or speak, but that it can act autonomously. And what makes that autonomy possible is situational awareness – the car’s ability to understand its surrounding environment, including its occupant’s intentions, and to respond appropriately.
The same is ultimately true of other voice-activated devices. If Alexa and her cousins could only do exactly what we told them, they’d be useful in limited situations – say, to turn on the kitchen lights when your hands are full of groceries. But their exciting potential is to do much more complicated things on their own, like ordering those groceries in the first place (and, eventually, coordinating with other devices to receive the grocery delivery, put the groceries in the right cabinets, prepare a delicious dinner, and clean the dishes).
This autonomy only happens if the devices really understand what we want and how to make it happen. Some of that understanding comes from artificial intelligence but the real limit is the what data the AI has available to process. So I’d argue that the most important skill of the voice-activated devices is really listening. That’s how they collect the data they need to act appropriately. And the larger vision is for all these devices to pool the information they gather, allowing each device to do a better job by itself and in cooperation with the others.
Whether you want to live in a world where the walls, cars, refrigerators, thermostats, doorknobs, and light bulbs all have ears is debatable. But that’s where we’re headed, barring some improbable-but-not-impossible Black Swan event that changes everything. (Like, say, a devastating security flaw in nearly every microprocessor on the planet that goes undetected for years…wait, that just happened.)
Still, in the context of this blog, what really matters is how it all affects marketers. From that perspective, voice interfaces are highly problematic because they make advertising much harder: instead of passively lurking in the corners of a computer screen or popping up unbidden during TV shows, voice ads are either front-and-center or nowhere. Chances are consumers will be highly selective about which ads they agree to hear, so marketers will need to gain their permission through incentives such as discounts and coupons. Gaining the consent required by privacy regulations such as GDPR* will be good practice for this but it will soon seem like child’s play compared with what marketers need to do on voice devices. So one change is marketers will need a new set of skills around creating aural ads and convincing consumers to agree to listen to them.
A related skill will be making those ads effective. Remember that people are vastly better at processing visual images than words. That’s why we have the 1000:1 word:picture cliché. That efficiency is why visual ads can be effective even if people don’t focus on them – they are still being registered on some level and people will pay closer attention to those that look interesting at a glance. Aural ads will transfer much less information per moment of attention and chances are most of that information will be forgotten more quickly. We’re in early days here and there’s much to learn. But if you can buy stock in a jingle-writing company, do it.
Another obvious change will be that the device vendors themselves have more control than ever over the messages their customers receive. This gatekeeper function is already at the center of the business models for Amazon, Facebook, Google, Apple and others (increasingly including non-net-neutral broadband operators). But as fewer channels become available to reach consumers and as the channels themselves deliver fewer messages per minute, the value of those messages will increase dramatically. Insofar as separately-controlled devices compete for consumer attention, the device vendors will have even more reason to deliver experiences that consumers find pleasant rather than annoying. Of course, as I’ve argued extensively elsewhere, “personal network effects” make it likely that most consumers will find themselves dealing primarily with a single vendor, so actual competition may be limited.**
The gatekeepers’ control over their customers’ experience means that marketers will increasingly need to sell to the gatekeepers to earn the opportunity to reach consumers. What’s different in a voice-driven world is the scarcity of contact opportunities, which means that gatekeepers don’t have enough inventory (e.g., ad impressions) to sell to all would-be buyers. This isn’t entirely new: even today, impressions for Web display, paid search, and paid social are auctioned to a considerable degree. But a huge reduction in inventory (and the impact of serving ads that lead consumers to opt out, assuming they really have that option) will make the gatekeepers much more selective and, no doubt, raise prices. The gatekeepers will also have more conflicts with potential advertisers as they sell more services of their own, adding yet another level of complexity and more opportunities for deal making.
Finally, let’s come back to the sensors themselves. Assuming that the gatekeepers are willing to share what they gather, marketers will finally be able to understand exactly how consumers are responding to their messages. It’s not just that they’ll be able to know exactly who saw which messages and what the subsequently purchased. The new systems will be collecting things like heart and respiration rates, creating the potential to measure immediate physical reaction to each advertisement. It almost seems unnecessary to point out that listening devices will also capture conversations where consumers discuss specific products, not to mention their needs and intentions. The grand mysteries of marketing impact will suddenly be exposed with thoroughness, precision. and clarity. The change will be as revolutionary as X-rays, ultra sounds, and CAT scans becoming available to doctors. As with radiology in medicine, these new information streams will require new skills that form the basis of entirely new specialties.
In short, voice-activated devices will change the world in ways that have nothing to do with the interaction skills of chatbots or ease of placing orders on Alexa. Marketers’ jobs will change radically, demanding new skills and creating new power relationships. Visual devices won’t really go away – people are too good at image processing to waste the opportunity. But presenting information to consumers will ultimately be less important than gathering information about them, something that will use all the sensors that devices can deploy.
Who knew the animated housewares in Disney’s Brave Little Toaster were really a product roadmap?
*General Data Protection Regulation – have we reached the stage where I no longer need to spell it out?
** The essence of personal network effects is the value of pooling data to create the most complete information about each customer. In many ways, situational awareness is another way of describing the same thing.