The State of Speech Synthesis Markup Language (SSML)
Updated: May 10, 2021
Speech Synthesis Markup Language (SSML) is an XML-based language that affects the synthetic speech generation across all speech applications. By standardizing the control of different elements of the speech, it can affect the pronunciation, accents, emotions and their intensity, language, volume, amongst other speech controls.
The aim? To make conversations sound more natural, by aiming to replicate human speech. According to WR3 recommendation documentation; The intended use of SSML is to improve the quality of synthesized content (W3C, 2010).
How is SSML used?
SSML is predominately used to tune Text-to-Speech (TTS) engines. TTS engines such as Amazon Poly, already provide the basic speech controls for things such as punctuation, but by including support for SSML, it gives a much more refined control over the speech generation. SSML uses the content using XML format whilst within a TTS engine. The initial step happens when an XML parser separates the plain text from marked-up language or text. The marked-up language is then turned into a set of instructions for the synthesiser to construct, as part of the TTS process. In order for the text to be separated by the XML parser, the SSML must be well-formed, by using closed and fully nested elements.
Some SSML elements or tags have their own attributes and values. These add another layer of customization to the speech synthesis. For example, Table 1 shows each of the elements alongside their purpose. The <prosody> element has six different attributes: contour, duration, pitch, range, rate and volume. Within each of these, multiple defined values can be combined for a deeper level of control. To illustrate this, the attributes; volume, has an additional five values; x-low, low, medium, high, x-high. Take a look at the below example (Amazon, 2021):
Normal volume for the first sentence.
<prosody volume="x-loud">Louder volume for the second sentence</prosody>.
When I wake up, <prosody rate="x-slow">I speak quite slowly</prosody>.
I can speak with my normal pitch,
<prosody pitch="x-high"> but also with a much higher pitch </prosody>,
and also <prosody pitch="low">with a lower pitch</prosody>.
Another tuning option that can be shown within the <prosody> element, is the rate attribute. This allows for a detailed control over the speed of the speech in words-per-minute. It can be used is to alter the emphasis on certain words or phrases. This example below (Ma, 2020) shows a decrease of rate to the word “really” by 51%. This adds an added emphasis on the word, resulting in a slightly different meaning to the sentence. Listen to the follow example in the audio below.
Sometimes somebody will bring something that you <prosody rate="51.00%">really </prosody>like.
Audio 1. Rate - Original Audio 2. Rate - 51%
How Does SSML Help in Voice Applications?
There are many different ways to pronounce words in various different languages, and in multiple pronunciations, therefor, SSML is be used within voice applications to help voice assistants pronounce these words or text correctly. The example below can showcase this perfectly (VideoLocalize, 2019):
Genba is a Japanese term meaning “the actual place”.
Audio 3. Genba - Original Audio 4. Genba - SSML
According to (Reval, 2020), SSML is important within voice applications as it allows the developers and designers to create better user experiences for the user or customer. So, by developing voice applications that project emotion, pauses and other variations of the human speech model, conversations between humans and voice-enabled devices will develop into much smoother interactions and better user experiences.
Utilizing speech synthesis and SSML into the development of voice applications could improve the retention value for each particular application. According to a Voicelabs report back in 2017, voice app second-week retention increased from 3% to 6% (Kinsella, 2017). Even for 2017 these percentages are low, with Mark Tucker stating in 2020, that retention rates still remains a problem (Kinsella, 2020). So, finding ways to increase retention rates amongst voice applications is incredibly important and is where the use of speech synthesis can be part of the solution. Another example can be shown by a recent study on the impact of voice pitch on text memory, which states that the pitch of the voice mediates long-term memory effects (Hede Helfrich, Philipp Weidenbecher, 2011). It shows that by altering various parameters such a pitch, can affect how people remember information and this finding can be incorporated into developing speech synthesis for voice applications.
Discussion and Conclusion
Table 2 shows the range of different elements that are available according to W3C, 2010. By also showing multiple TTS engines, it also shows which engines support each element. All these different parameters offer the engines an incredible amount of flexibility to customize the sound of the speech. It can be seen that the engines as a whole, have some catching up to do to ensure they support all the available elements for maximum customization and flexibility. Companies such as Amazon are already developing new elements to implement within their voice applications. Some examples are <amazon:auto-breaths> and <amazon:effect name=”whispered”> that have greatly improved the impact of speech. This shows that companies are continuing to improve their own engines through the use of Speech Synthesis Markup Language, providing developers and designers with more customization.
SSML is an incredibly powerful language that can affect many different parts of speech synthesis. It can provide increased customization for speech, allowing for a more natural and free-flowing conversations between technology and humans. Continued research will ensure a deeper understanding of speech synthesis, that will help to improve tuning accuracy in the future.