Speech Synthesis Markup Language (SSML)

To control additional aspects of AT&T Natural Voices you can use the SSML (Speech Synthesis Markup Language), which is an XML based markup language. It consists of a set of XML control tags that modify some aspect of synthesized speech. For example, you can instruct it to ignore the capitalization of words, change the rate, voice, or volume.

The SSML tags can be embedded within question, information, or response text. These tags will not appear when the text is displayed, but will alter the voice or pronunciation of the text that follows during an ACASI interview.

SSML follows the rules for XML syntax. Every tag must have a matching closing tag, such as: <tag attribute=“value”>this is text</tag>. When the tags don’t enclose any text, the format looks like <tag attribute=“value”/>.

SSML (Speech Synthesis Markup Language) Control Tags

Ignore Case

Tag tells the TTS engine to ignore the capitalization of words within the specified context. This mode is useful to override the default behavior which is to spell all capitalized words.

  • Syntax: <ATT_IGNORE_CASE> text </ATT_IGNORE_CASE>
  • Example: <ATT_ignore_case> THIS CONTRACT </ATT_ignore_case>

Note: Pronounced as “this contract”

Break

The tag instructs the TTS engine to insert a pause in the synthesized text in one of three ways.

Syntax #1: <BREAK/>

  • Example: Time for a pause <Break/> Okay, keep going.
  • Note: Inserts a brief break after the word “pause”.

Syntax #2: <BREAK Size=”none | small | medium | large”/>

  • Example: No time for a pause <Break size=”none”/> Keep going.
  • Note: Inserts no break after the word “pause”.
  • Example: Time for a pause <Break size=”medium”/> Okay, keep going.
  • Note: Inserts a brief silence, the equivalent of the silence following a sentence or paragraph, after the word “pause”.
  • Example: Time for a pause <Break size=”large”/> Keep going.
  • Note: Inserts only the default break after the word “pause”.

Syntax #3: <BREAK time=” duration ”/>

  • Example: Break for 100 milliseconds <Break time=”100ms”/> Okay, keep going.
  • Note: Inserts 100 milliseconds of silence after the word “milliseconds”.
  • Example: Break for 3 seconds <Break time=”3s”/> Okay, keep going.
  • Note: Inserts 3 seconds of silence after the word “seconds”.

Paragraph

This tag tells the TTS engine to change the prosody to reflect the end of a paragraph, regardless of the surrounding punctuation.

  • Syntax: <PARAGRAPH> text </PARAGRAPH> or <P> text
  • Example: <P> The paragraph tag can be abbreviated as just the letter P.

Note: The TTS engine changes the prosody to reflect the paragraph boundaries.

Phoneme

This tag allows the user to specify pronunciations explicitly in the input text. Including the orthography, i.e.the word or words represented by the phonemes, are optional, but recommended, because some modules in the front end depend on orthography. The pronunciations must be represented using the DARPA or SAMPA phoneme set when using the AT&T Natural Voices TTS Client SDK.

  • Syntax: <PHONEME alphabet=”att_darpabet_english” ph=” phoneme+ “/>
  • Syntax: <PHONEME alphabet=” att sampa_spanish” ph=” phoneme+ “> orthography </PHONEME>
  • Example: <Phoneme alphabet=” att _darpabet_english” ph=”b ow t 1”/>
  • Example: <Phoneme alphabet=”att_sampa_english” ph=”b ow t 1”/>
  • Example: <Phoneme alphabet=”att_sampa_spanish” ph=”p a 1 D r e”> padre </Phoneme>
  • Example: <Phoneme alphabet=”att_sampa_german” ph=”b o: t 1”> boot </Phoneme>

Prosody RATE

The Rate attribute of the Prosody tag changes the rate at which the text is spoken. You can specify either the absolute rate or a relative change in the current speaking rate. This release supports a range of up to eight times faster or slower than the default speed.

  • Syntax: <PROSODY RATE=”fast | medium | slow | default”> content </PROSODY>
  • Syntax: <PROSODY RATE=” RelativeChange “> text </PROSODY>

Note: This changes the speaking rate which is expressed in Words per Minute (WPM). RelativeChange is a floating point number that is added to the current rate. Using a RelativeChange < 0 decreases the speed.

  • Example: This is the default speed <prosody rate=”slow”> this is speaking slowly <prosody rate=”fast”>this is speaking fast </prosody> back to slow </prosody> back to the default rate.
  • Example: This is the default speed <prosody rate="-50%"> this is 50% slower <prosody rate="+100%">this is 50% faster than normal </prosody> back to 50% slower </prosody> back to the default rate

Prosody VOLUME

The Volume attribute of the Prosody tag allows the application to change the volume of the TTS voice.

Note that this does not change the volume of the output device, but it does raise or lower the volume of the text spoken within the context of the tag.

  • Syntax: <PROSODY VOLUME=” level “> text </PROSODY>

Note: Level is a value from 0.0 to 200.0. A value of 100 is the voice’s default volume, a value of 0 changes the volume to 0 and a value of 200 doubles the volume. The volume changes linearly.

  • Syntax: <PROSODY VOLUME=” silent | soft | medium | loud | default “> text </PROSODY>
  • Syntax: <PROSODY VOLUME=” delta “> text </PROSODY>
  • Example: This is the default volume <prosody volume="silent"> silence </prosody> <prosody
  • volume="soft"> Now I'm whispering </prosody> <prosody volume="+20%"> a little louder </prosody><prosody volume="medium"> medium volume</prosody> <prosody volume="+20%"> a little louder</prosody> <prosody volume="loud"> very loud </prosody> <prosody volume="default"> back to defaultvolume </ prosody>.

Say-As

Say-As tags provide contextual hints to the TTS engine about how text should be pronounced.

The TTS engine supports a number of different contexts that can be used to fine-tune the pronunciation of words.

Say-As > Acronym (English only)

The Acronym context tells the TTS engine to treat the text as an acronym and to pronounce the text as the letters in the words. This tag is especially useful if your text is mostly upper case and you use the ATT_Ignore_Case tag but then encounter an acronym.

  • Syntax: <SAY-AS Type=”Acronym”> text </SAY-AS>
  • Example: MADD <Say-as type=”acronym”> MADD </Say-as>

Note: Pronounced as “mad M-A-D-D”.

Say-As > Address

The Address context tells the TTS engine to treat the text as an address.

  • Syntax: <SAY-AS Type=”Address”> text </SAY-AS>
  • Example: <Say-as Type=”Address”> 123 Main St. , New York, NY 10017 </Say-as>

Note: Will be pronounced “one twenty three main street, New York, New York one zero zero one seven”

Say-As > ATT_Literal (English only)

The AT&T_Literal context instructs the TTS engine to pass the string through literally without applying additional context processing such as abbreviations, addresses, dates, math symbols, measurements, names, numbers, times, or phone numbers.

  • Syntax: <SAY-AS Type=”ATT_Literal”> text </SAY-AS>
  • Example: <Say-as type=”ATT_Literal”> misc. </Say-as> misc.

Note: Pronounced as “misk miscellaneous”

Say-As > ATT_Math (English only)

The ATT_Math context tells the TTS engine to treat the text as a mathematical expression.

  • Syntax: <SAY-AS Type=”ATT_Math”> text </SAY-AS>
  • Example: <Say-as Type=”ATT_Math”> 3+4=7 </Say-as> 3+5=8

Note: Pronounced as “three plus four equals seven three plus sign five equal sign eight”

Say-As > ATT_Measurement (English only)

The ATT_Measurement context tells the TTS engine to treat the text as a measurement, e.g. single quotes are pronounced “feet” and double quotes are pronounced “inches”.

  • Syntax: <SAY-AS Type=”ATT_Measurement”> text </SAY-AS >
  • Example: <Say-as Type=”ATT_Measurement”> 5’ 3” </Say-as> 7’ 9”

Note: Pronounced as “five feet three inches seven single quote nine double quote”

Say-As > Currency

The Currency context tells the TTS engine to treat the text as currency and expand $ and decimal

numbers appropriately. The Currency Context works only with US currency and not other currencies.

  • Syntax: <SAY-AS Type=”Currency”> text </SAY-AS>
  • Example: <Say-as Type=”Currency”> $25.32 </Say-as>

Note: Pronounced as “twenty five dollars and thirty two cents”

Say-As > Date

The Date context tells the TTS engine to treat the text as date. You may also add qualifiers to provide even more information to the TTS engine but in general the extra qualifiers are not needed.

  • Syntax: <SAY-AS Type=”Date”> text </SAY-AS>
  • Example: <Say-as Type=”Date”> Dec 25, 2001 </Say-as>

Note: Pronounced as “December twenty fifth two thousand one”

  • Syntax: <SAY-AS Type=”Date:M”> text </SAY-AS>
  • Example: <Say-as Type=”Date:M”> Dec </Say-as>

Note: Pronounced as “December”

  • Syntax: <SAY-AS Type=”Date:MD”> text </SAY-AS>
  • Example: <Say-as Type=”Date:MD”> Dec 25</Say-as>

Note: Pronounced as “December twenty fifth”

  • Syntax: <SAY-AS Type=”Date:MDY”> text </SAY-AS>
  • Example: <Say-as Type=”Date:MDY”> Dec 25, 2001 </Say-as>

Note: Pronounced as “December twenty fifth two thousand one”

  • Syntax: <SAY-AS Type=”Date:MY”> text </SAY-AS>
  • Example: <Say-as Type=”Date:MY”> Dec, 2001 </Say-as>

Note: Pronounced as “December two thousand one”

  • Syntax: <SAY-AS Type=”Date:Y”> text </SAY-AS>
  • Example: <Say-as Type=”Date:Y”> 2001 </Say-as>

Note: Pronounced as “two thousand one”

Say-As > Name

AT&T Natural Voices TTS engine does a great job on names without XML tags but you can tell the engine to expect an name with the SAY-AS name tag.

  • Syntax: <SAY-AS Type=”Name”> text </SAY-AS>
  • Example: <Say-as Type=”Name”> Mark Beutnagel </Say-as>

Note: Pronounced as “Mark Beutnagel”

Say-As > Net

The Net type tells the engine to expect either an email address or a URL.

  • Syntax: <SAY-AS Type=”Net”> text </SAY-AS>
  • Example: <Say-as Type=”Net”> help@naturalvoices.att.com </Say-as>

Note: Pronounced as “help at natural voices dot ATT dot com”

  • Example: <Say-as Type=”Net”> http://naturalvoices.att.com </Say-as>

Note: Pronounced as “H T T P natural voices dot ATT dot com”

  • Syntax: <SAY-AS Type=”Net:email”> text </SAY-AS>
  • Example: <Say-as Type=”Net:email”> help@naturalvoices.att.com </Say-as>

Note: Pronounced as “help at natural voices dot ATT dot com”

  • Syntax: <SAY-AS Type=”Net:URL”> text </SAY-AS>
  • Example: <Say-as Type=”Net:URL”> help@naturalvoices.att.com </Say-as>

Note: Pronounced as “help at natural voices dot ATT dot com”

Say-As > Number

The Number type tells the engine to expect a number.

  • Syntax: <SAY-AS type=”Number”> text </SAY-AS>
  • Example: <Say-as type=”Number”> 10,243 </Say-as>

Note: Pronounced as “ten thousand two hundred forty three”

  • Syntax: <SAY-AS type=”Number:Decimal”> text </SAY-AS>
  • Example: <Say-as type=”Number:decimal”> 3.14159 </Say-as>

Note: Pronounced as “three point one four one five nine”

  • Syntax: <SAY-AS type=”Number:Fraction”> text </SAY-AS>
  • Example: <Say-as type=”Number:Fraction”> 5 3/4 </Say-as>

Note: Pronounced as “five and three fourths”

  • Syntax: <SAY-AS type=”Number:Ordinal”> text </SAY-AS>
  • Example: <Say-as type=”Number:ordinal”> VI </Say-as>

Note: Pronounced as “sixth”

Say-As > Sub

The SUB tag allows you to substitute spoken text for the written text.

  • Syntax: <SAY-AS sub=”spoken”> text </SAY-AS>
  • Example: <Say-as sub=” Mothers Against Drunk Driving”> MADD </Say-as>

Note: Pronounced as “mothers against drunk driving”

Say-As > Telephone

The Telephone context tells the TTS engine to treat the text as a telephone number.

  • Syntax: <SAY-AS type=”Telephone”> text </SAY-AS>
  • Example: <Say-as type=”telephone”> (212)555-1212 </Say-as>

Note: Pronounced as “two one two five five five one two one two”

Say-As > Time

The Time context tells the TTS engine to treat the text as a time.

  • Syntax: <SAY-AS Type=”Time”> text </SAY-AS>
  • Example: <SAY-AS type=”Time”> 12:34 PM </SAY-AS>

Note: Pronounced as “twelve thirty four P M”

  • Syntax: <SAY-AS Type=”Time:HMS”> text </SAY-AS>
  • Example: <SAY-AS type=”Time”> 12:34:56 PM </SAY-AS>

Note: Pronounced as “twelve thirty four and fifty six seconds P M”

  • Syntax: <SAY-AS Type=”Time:HM”> text </SAY-AS>
  • Example: <SAY-AS type=”Time”> 12:34 PM </SAY-AS>

Note: Pronounced as “twelve thirty four P M”

  • Syntax: <SAY-AS Type=”Time:H”> text </SAY-AS>
  • Example: <SAY-AS type=”Time”> 12 PM </SAY-AS>

Note: Pronounced as “twelve P M”

Sentence

This tag tells the TTS engine to change the prosody to reflect the end of a sentence, regardless of the surrounding punctuation. The TTS engine changes the prosody to reflect the sentence boundaries.

  • Syntax: <SENTENCE> text </SENTENCE> or <S> text </S>
  • Example: <Sentence> This text is a sentence. </Sentence>
  • Example: <S> The sentence tag can be abbreviated as just the letter S. </S>

Voice

The Voice tag allows the application to change the voice of the TTS speaker from the input text. You can use this feature to change voices; e.g. you might use different voices to speak different sections of an email message or carry on a conversation between two different voices. The default voice for a server is specified when the server process is started. You choose a voice by specifying one or more of the following attributes:

  • Gender: male, female, or neutral
  • Age: an integer value, e.g. 30
  • Category: child, teen, adult, or elder
  • Name: mike8, crystal8, rosa16, rich16, or any other voices you’ve installed.
  • Language:
    • “en_us” “English” for US English
    • “es_us” or “Spanish” for Spanish

You can specify several attributes but in the current release it is best to specify the speaker by name. You can purchase additional voices which may make more use of the additional attributes. Note that switching voices forces the server to load the data files for the each voice that is specified which may result in noticeable delays. Voice switches will happen instantaneously once the voice data is in place.

  • Syntax: <VOICE Gender=”male | female | neutral” Age=”integer” Category=”child | teen | adult | elder”
  • Language=”en_us | …” Name=”mike8 | …” </VOICE>
  • Example: <Voice Name=”mike8”> Hi, I’m Mike </Voice>

Note: Pronounced as “Hi, I’m Mike” using Mike’s 8 KHz voice.

  • Example: <Voice Name=”crystal8”> I’m Crystal </Voice>

Note: Pronounced as “I’m Crystal” using Crystal’s 8 KHz voice.

  • Example: <Voice Name=”rosa8”> Hola, me llamo Rosa </Voice>

Note: Pronounced as “Hola, me llamo Rosa” using Rosa’s 8KHz voice.

  • Example: <voice name=”mike16”> This is Mike <Voice Name=”crystal16”> This is Crystal </voice> This is Mike again. </voice>

Note: Pronounced as: in Mike’s voice “This is Mike”, then in Crystal’s voice, “This is Crystal”, then in Mike’s voice, “This is Mike again”.