Voice interaction and muscle memory
Little by little, I’ve been letting voice interaction (Echo and Siri) into my life.
What started as a trinket and novelty (“*Hey Siri, where do I hide a body?*”, “*Alexa, tell me a joke.*”) has evolved into a habitual interaction:
I ask Alexa to turn on the lights (via Hue) and tell me the news (via NPR) whenever I get home from work; I ask Siri to give me the box scores. Using them at this point feels as natural as opening a Chrome tab or jumping to the home screen and picking out an app — its part of my muscle memory.
Some of the core apps I use have become almost reflexive in terms of interactions — I know exactly where on the screen I need to swipe and tap to start a new game of Threes; I could start a new PyCharm project blindfolded.
Similarly, I’ve become attuned to the quirks of Echo’s voice detections: I know that it can understand “Alexa, play music by Stars” a lot better than it can understand “Alexa, play Stars” just like I know that Spotify’s search will just bug out sometimes and its better to force-restart the app. To me, flaws in interaction are okay, so long as they’re consistent flaws that I know how to avoid and resolve.
And so, my biggest complaint of voice interaction — its inconsistency — has been mostly solved. While Echo’s skillset is not quite as expansive as it could be, its very good (and very consistent) at what it does.
For the past month, every night before going to bed I tell Alexa to stop playing music and turn on the lights and I tell Siri to set an alarm for 8am. Alexa has literally gotten it right every single time.
For Siri, on the other hand, the struggle is real. It ranges from not registering the wake word (“Hey Siri”, the syllables it uses to start listening) to not hearing anything except the wake word (“I’m not sure what you meant by that, Justin”, sneering as if my name is proof that it knows who I am), to only hearing blips of my command (“When would you like to set an alarm?”)
This is the equivalent of shifting the buttons in Clocks.app fifteen pixels in a random direction each time you open the app. It destroys trust in an interaction.
(Put another way: honest design necessitates a consistent model of interaction, and when your only model of interaction is figuring out what someone says, you better be damn good at it.)
What’s your point, dude?
I don’t have any real proposals here — I am confident that the engineers behind both Siri and Echo are way more competent in this area than I am and are fervently working to improve their voice detection. But my point is that these platforms aren’t just novelties! A user interface built on voice demands just as much thought, precision, and consideration as any other user interface, and it behooves us to consider and evaluate them in those terms as well.
And in this case, I think voice detection is the whole ball game. I started using Echo and Siri because they honest-to-god made it easier to do something; its easier for me to say a sentence than to switch on all the lights, or scroll to the right alarm clock, or load Spotify and find the artist I wanna listen to. That’s voice interactions’ great promise: saying things instead of doing things. When it fails to deliver on the core mechanism, it doesn’t matter what third-party integrations or fancy features it offers because its not as easy as just doing the damn thing yourself.