top of page
  • Writer's pictureAdam Woollacott

The Exploration of Voice Interfaces

This was written as research paper for the Digital Assistant Academy Course, during the back-end of last year.

What makes voice interfaces work? Why do persona’s help in creating a user experience and what parts of the information are most important for the user? These questions get explored in more detail, whilst also providing a benchmark of voice interfaces that incorporate some great designs and also some food for thought.

Voice-Only Interfaces

The first part of this research was to firstly identify which assistants that would be benchmarking against one another. For the voice only interfaces, Amazon Alexa Dot and Google Nest Mini were selected.

For the purpose of this research, each interface was chosen for simple tasks around setting reminders. Although not a tough test in terms of what voice assistances can do, research shows that 60% of users automate tasks and save time by using voice regularly, (Vixen Labs, 2020) so it was important to use it as a benchmark to see how each of them weigh up. Both assistants were asked the same set of task questions and the results were recorded.

1. Remind me to turn on the oven at 3:30pm

2. What reminders do I have today?

3. Cancel my reminder.

4. *Reminder Notification

The initial task to set the reminder was an easy process with both interfaces. They both use a similar method and conversational approach, using words such as “Alright” and “Okay”. These set the tone and gives the assistant a persona, allowing the user to connect at a more common level and creating a trust. The persona plays a crucial part in creating a positive user experience and has the ability to change it from an interaction with the user, into a relationship. The informal language is typically used within household setting, creating a much less rigid and robotic voice and giving off the appearance of another person in the room.

Google - “Alright I’ll remind you at 3:30pm”

Alexa -Okay, I’ll remind you at 3:30pm”

“There is no such thing as a voice user interface without persona” (Cohen, Giangla and Balogh, Voice Interface Design)

However, although both were effective in confirming the time of the reminder, they did not confirm the actual details. Users could be left unsure whether the assistant had correctly recognised the request, unless asking the assistant again and therefor wasting time. This is in contrast with the notification to remind the user at the specified time. Alexa told the user what the actual reminder was straight away, whereas Google’s only notification was a simple reminder, withholding the details unless asked again.

Alexa - “I’m reminding you…Turn on the oven”

Google - “I have a reminder for Adam”

User - “What’s my reminder?”

Google - You have one reminder, today at 3:30, turn on the oven.”

Withholding information as in the above example can work well due to potential privacy situations that could occur at the time of the reminder. The user may not want other people in the room to know the extent of the details. With about 40% of users concerned about privacy of data when using voice apps according to Vixen Labs Voice Report, 2020, it is clear that there is a need for providing as much privacy as possible in voice technology, allowing users to feel more comfortable sharing personal data.

One finding that was particularly interesting was the order in which Google and Alexa notified the users as to what reminders they had that particular day. The decisions of choosing what pieces of information in a sentence are most important, takes on a psychological role. The serial position effect explains where the position of data in a sequence or sentence can affect the recall accuracy of the user. Data in the middle has been proven to show the least amount of memory recall, whereas data at the beginning and at the end have higher accuracy (Wong, 2020). This shows an interesting decision between Google Assistant and Amazon Alexa with Google using “today at 3:30pm” as the least important data, in comparison to Alexa, where this was “turn off the oven”.

User -What reminders do I have today?”

Google - “You have one reminder, today at 3:30pm, switch on the oven”

Alexa - “You have one reminder today, turn off the oven at 3:31pm

Lastly, being able to notify and confirm to the user what is happening during the different states of the voice sessions is an important factor. Confirmations are a way to ensure that the user feels understood. (Kore, 2018) Both interfaces were able to notify the user through sounds; earcons and visual feedback. They both had options within their own individual apps to disable the request sounds depending on user preference, giving them a more personalised experience. They did include some very helpful earcons that confirmed to the user that a task was in process or had been completed. Sometimes, to use the phrase; ‘less is more’, is commonly used in the voice industry, with research stating that there is a much stronger preference for shorter dialogue. (Voicebot, 2019). Utilising an earcon; a short, unique sound, can eliminate the need for dialogue through a tone, to confirm to the user what is happening. During the notification reminder, both interfaces produced gentle earcons that gave a happy, light and unobtrusive tone, whilst also being informative. The main difference between both interfaces were Google’s three notes compared to Alexa’s two.

The visual feedback for the devices were very useful and informative without attracting too much unwanted attention. Each had their own distinctive visuals; Google Nest Mini using the 4 LED lights on top of the interface that pulsed and changed depending on the situation, Figure 1, and Amazon Alexa utilising the LED blue ring that span round changing colour to light blue, Figure 2.

Multimodal Interfaces

The second part of this research focuses on the use of multimodal interfaces. Why do we need multimodal interfaces when we already have voice only? Multimodal interfaces offer another dimension into the user experience. It can help to create deeper interactions with users, whilst also speeding up processes.

“Interacting with both voice and screens not only enables more complex interactions that require visuals, but it’s also significantly faster. Just as humans speak much faster than they type, they also read faster than they listen, at 250 words per minute and 130 words per minute, respectively.” (Dengel, 2020)

The interfaces used for this section were Google Assistant and Siri, on smartphone devices. They were both challenged to get find and get directions to a local McDonald’s restaurant. According to the In-Car Voice Assistant Report 2020, ( navigation is the second most common use case in cars, after phone calls, so it was important to look deeper into the interfaces involved.

Both interfaces were asked the same selection of questions to find out which direction the conversation would go.

1. Which McDonald’s shall I go to?

2. Is it open?

3. How long will it take to get there?

The interfaces offered an easy user navigation with simple and easy-to-read interfaces. The first and main noticeable finding was the correlation between the voice and on-screen text. Both interfaces showed a good understanding between the different modalities, utilising each mode for specific requests. By incorporating a short voice prompt or answer whilst providing more in-depth information on-screen, it became much easier to digest the information.

According to a study by, consumers prefer shorter dialogues as it can be easier to retain information and whereas as long dialogues can become tiresome to users. (, 2019)

User - Which McDonald’s shall I go to?

Siri - “One option I see is McDonalds Rue St. Dennis. Wanna try that one?”

Google Assistant - “I found a few McDonald locations near you”

You can see from Figure 3, that they both show a similar interface. Showing a few options for the user whilst proving, short and quick information on the search. This will be a reoccurring point in this section on multimodal devices. One thing to mention, is the inclusion of a map for Google Assistant. Here, if users want a quick glance on the screen, they can have an idea of where the locations are in relation to themselves, which as useful indication.

Both interfaces allowed the conversation to flow well, prompting to user to make a decision and decide which location they preferred. However, there were occasions with Siri, where the user was left with only one option of touch to be able to carry on the conversation. What would have worked better is allowing the user to have the option to answer on both mediums, so they are not forced to change mode unnaturally.

Other examples of good balance between the voice and screen are shown as the users choose their desired final location. The visual interfaces give the user all the basic information in an easy-to-read format with large icons to quickly navigate. This is especially important when users only have time to glance at the screen. To work alongside the screen, the voice assistant was able to tell the user some extra important information, relating to any changes to the restaurant. This provided to be incredibly useful as the user was now aware of this new information and could then go on to ask for further details regarding the matter.

Google Assistant - “McDonald’s, here you are. A lot of places have changed their hours and serves temporarily, so you might want to check with them.”

Siri - “I can call that location or get directions, which would you like?”

Lastly, each interface had its own way of notifying the user of the different states. By letting the user know that something is happening is an important feature for the user interface. It allows for the user to get feedback and confirmation that the assistant is processing the information. Siri utilised their logo that increased in size whilst adding movement swirls. Google Assistant on the other hand utilised their four colour dots are recognised on other Google devices. The dots bounced, creating a simple but effective movement to notify the user.


So, by benchmarking some examples of different types of interfaces, in different tasks and different modalities, it has shown that even though each one differs from one another, they all have the same goal of wanting to help the user. The interfaces aim to make it easier for each user to interact with, whilst finding out exactly what reminders they have that day or where the closest McDonald’s is. Sometimes though voice, sometimes through visuals or sometimes a combination of the two. It’s also the delivery method that resonates here. The order that the sentenced is phrased, can make an impact on the user’s memory retention whilst deciding which information is delivered by voice and by visuals. We as humans utilise multiple senses. By incorporating the technology to interact with these different senses, can help develop strong relationships between humans and technology, assisting us in positive ways.

So who knows what the future holds?

Also, if anyones interested to learn more about Voice Technology and Conversation Design, I recommend checking out the Digital Assistant Academy's course on Voice Interaction Design.


Vixen Labs. “Vixen Labs Commerce Voice Report”, 2020, p17-19, Retrieved 11th November 2020.

Wong, Euphemia. “Serial Position Effect: How to Create Better User Interfaces” (, Interaction Design Foundation, 2020, Retrieved 12th November 2020.

Giangola, James. “Voice User Interface Design”, Addison Wesley; 1st edition, 2004, Retrieved 12th November 2020.

Collins English Dictionary. “Persona”, Harper Collins Publishers, (, Retrieved 12th November 2020.

King, Jennifer. “Hearing it from a Skill Builder: How to Create a Persona for your Alex Skill”, (, Amazon, 2018, Retrieved 13th November 2020.

Kore. “Tips on Designing Conversations for Voice Interfaces” (,UX Design, 2018, Retrieved 13th November 2020.,, Voice Labs. “What Consumers Want in Voice App Design”, 2019, Retrieved 14th November 2020.

Figure 1 – “Meaning of Google Home Mini LED Lights” (, Gadget Guide Online, Retrieved 15th November 2020.

Figure 2 – Kozuch Kate, “Decoding Alexa Flashing Lights Review”, (,review-6341.html), Toms Guide, 2020, Retrieved 15th November 2020.

Dengel, Tobias. “Google Highlights Multimodal as a Key Trend in 2020”, (, Willow Tree Apps, 2020, Retrieved, 16th November 2020., Cerence. “In-Car Voice Assistant Report” 2020, p15, Retrieved 16th November 2020.


bottom of page