Home > Resource Center > VUI-Design > Taking VUI to the Next Level with Non-Speech Audio
Taking VUI to the Next Level with Non-Speech Audio
As we saw in the previous issue of the VUI View, speech-enabling Interactive Voice Response (IVR) systems, when done right, markedly increases caller satisfaction. The tedious, restrictive, and colorless medium of pushing buttons to interact with an automated system is deemed far less pleasant by users than speaking naturally. Callers don't just want to complete their task: they want to do it efficiently, with the least amount of frustration and, if at all possible, enjoy themselves while at it.
A technique that can help the designer compound the added benefits of speech and achieve even higher caller satisfaction is the use of non-speech audio: sounds such as beeps, blips, chimes, and music that help guide and position callers during their interaction with the system. Such markers, when deployed judiciously by the designer, can unobtrusively and gently cue the user to do the right thing at the right time, and even add some "audio color" to the caller's experience. They serve to punctuate interactions, mentally position callers in the dialog, let the caller know that the system is moving along, and efficiently establish quick associations between sound and function that would otherwise be cumbersome to communicate with simple spoken speech.
In this issue of the VUI view, I briefly review what non-speech audio tones are about and where and how in a Voice User Interface (VUI) they should be used.
Types of non-speech audio
There are five basic types of non-speech audio, each useful only for certain conversational contexts and situations.
- Beep or blip: a simple sound with a static pitch and a simple energy envelope. [1] It can be the traditional beep [
beep.wav ] or shorter variations of that sound, such as [
shortbeep.wav ], or [
blip.wav ].
- Chime: a slightly more complex sound pattern that is usually used to announce menus or new contexts and to mark transitions. [
harp.wav ]
- Earcon: also known as "auditory icon", the earcon is the audio equivalent of the visual icon and similarly serves to impart to the user a specific meaning when encountered. For instance, the sound of a bat cracking could signal the beginning of the baseball scores section in a dialog [
bat.wav ].
- Audio Logos: an audio logo is a short signature jingle or tune, often no more than three or four notes, that is vividly associated in the minds of the callers with a company's brand. [
outback.wav ]
- Music: usually used at the opening of a call, as light background when offering a list of options, or during times when the caller is made to wait either for a system response (if the wait is more than a few seconds)
[
drums.wav ] or for the next available operator.
When to use non-speech audio
Here are twelve situations where the use of non-speech audio will enhance a caller's experience.
- Opening the application: if the service is provided by a company whose brand is associated with an audio jingle or tune, use an audio logo to open the call. If the logo is a short three or four note jingle or sounds more like a chime, then open the call with the audio followed by spoken speech. The outback signature flute is a good example of such a branded sound with which to open a speech application.
[
outback.wav ] If the logo is music, then mix the music with voice.
- Signaling that it's the caller's turn to speak: use a short, soft beep
[
shortbeep.wav ].
- When the system is busy doing something and is holding the turn: if the caller is made to wait a few seconds (10 or less) use an earcon
[
keyboard.wav ]; if they must wait more, use music.
- When waiting for the caller to give an answer: use a music loop
[
bongos.wav ] [
bass.wav ] [
drums.wav ] to convey a sense that the system is patiently waiting on the caller to say something.
- After a caller's speech is successfully captured: use a short beep or a short chime to communicate to the caller that their input was successfully understood. [
shortbeep.wav ]
- After a "no input": use a beep to signal to the caller that the system did not successfully capture caller input. A double beep is a good way to alert the caller to start speaking. [
doublebeep.wav ]
- When announcing a menu: use a chime [
mainmenu.wav ] followed by a verbal landmark, such as "main menu".
- When transitioning from one section to the next: use a short chime or an earcon. [
transition.wav ]
- When entering a new section: use an earcon that captures the theme of the new section (e.g., the sound of a train whistling [
train.wav ] would communicate to the caller that they are now in the train reservation subsection).
- When reading back a list of results: use a short beep at end of each list item to mark each choice. [
blip.wav ]
- When announcing help: use a chime followed by a verbal landmark, such as "help", or "help menu". [
cymbal.wav ]
- End of the application: fading, tension-releasing music is usually the best sound for ending an application. [
endgalactic.wav ]
When using non-speech audio, it is crucial to be consistent: (1) use non-speech audio across the application and not just sporadically (e.g., always use an earcon when waiting for caller input and not just at some waiting instances), and (2) use the same audio for the same situations (don't change the wait-for-caller-input music from one state to the next).
If you are interested in professional help with optimizing your voice applications, feel free to contact me at bouzid@angel.com or call 1-888-MYANGEL (1-888-692-6435) and say "Ahmed Bouzid".
[1] For more details on the physics of tonal sounds, see: Bruce Balentine & David P. Morgan, How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues, 2nd edition, 1999, San Ramon, CA: Enterprise Integration Group.