W3C

Speech Synthesis Markup Language Specification

W3C Working Draft 5 April 2002

This version:
http://www.w3.org/TR/2002/WD-speech-synthesis-20020405/
Latest version:
http://www.w3.org/TR/speech-synthesis
Previous version:
http://www.w3.org/TR/2001/WD-speech-synthesis-20010103/

Editors:
Daniel C. Burnett, Nuance
Mark R. Walker, Intel
Andrew Hunt, SpeechWorks International

Abstract

The Voice Browser Working Group has sought to develop standards to enable access to the web using spoken interaction. The Speech Synthesis Markup Language Specification is part of this set of new markup specifications for voice browsers, and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is a working draft of the "Speech Synthesis Markup Language Specification". You are encouraged to subscribe to the public discussion list <www-voice@w3.org> and to mail in your comments as soon as possible. To subscribe, send an email to <mailto:www-voice-request@w3.org> with the word subscribe in the subject line (include the word unsubscribe if you want to unsubscribe). A public archive is available online.

This specification describes markup for generating synthetic speech via a speech synthesizer, and forms part of the proposals for the W3C Speech Interface Framework. This document has been produced as part of the W3C Voice Browser Activity, following the procedures set out for the W3C Process. The authors of this document are members of the Voice Browser Working Group (W3C Members only).

The previous draft of this specification was published as a Last Call Working Draft in January of 2001. Over the past year the Voice Browser Working Group has not focused its attention on this specification, but now it is ready to make more active and timely progress on the specification. The Working Group has meanwhile made progress on other specifications, such as the Speech Recognition Grammar Format and the VoiceXML 2.0 specification. These are related to the SSML specification, and in some areas depend on this specification.

In order to coordinate the advancements of these specification along the W3C track to Recommendation, the Working Group felt that it was necessary to update the SSML specification with changes necessary to support the VoiceXML specification. Due to changes in the state of the art of the technology of speech synthesis during this timeframe, the Working Group felt it would be appropriate to make a new release of the specification, with a small number of changes, as a Working Draft. The expectation and goal are that it will be possible to release another draft after that one as a Last Call Working Draft because the Working Group will have focused sufficient attention on the specification for it to be technically sound in today's world.

Following the publication of the previous draft of this specification, the group received a number of public comments. Those comments have not been addressed in this current Working Draft but will be addressed in the timeframe of the Last Call Working Draft. Commenters who have sent their comments to the public mailing list need not resubmit their comments in order for them to be addressed at that time.

To help the Voice Browser working group build an implementation report, (as part of advancing the document on the W3C Recommendation Track), you are encouraged to implement this specification and to indicate to W3C which features have been implemented, and any problems that arose.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite W3C Working Drafts as other than "work in progress". A list of current public W3C Working Drafts can be found at http://www.w3.org/TR/.

Table of Contents

1. Introduction

The W3C Standard is known as the Speech Synthesis Markup Language specification and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [JSML].

The Speech Synthesis Markup Language specification is part of a larger set of markup specifications for voice browsers developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate and etc. across different synthesis-capable platforms.

1.1 Vocabulary and Design Concepts

There is some variance in the use of technical vocabulary in the speech synthesis community. The following definitions establish a common understanding for this document.

Voice Browser A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.
Speech Synthesis The process of automatic generation of speech output from data input which may include plain text, formatted text or binary objects.
Text-To-Speech The process of automatic generation of speech output from text or annotated text input.

The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.

The following items were the key design criteria.

1.2 Speech Synthesis Processes

A Text-To-Speech (TTS) system that supports the Speech Synthesis Markup Language will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.

Document creation: A text document provided as input to the TTS system may be produced automatically, by human authoring, or through a combination of these forms. The Speech Synthesis markup language defines the form of the document.

Document processing: The following are the six major processing steps undertaken by a TTS system to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output.

  1. XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.

  2. Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.

    - Markup support: The "paragraph" and "sentence" elements defined in the TTS markup language explicitly indicate document structures that affect the speech output.

    - Non-markup behavior: In documents and parts of documents where these elements are not used, the TTS system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.

  3. Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

    - Markup support: The "say-as" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.

    - Non-markup behavior: For text content that is not marked with the "say-as" element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.

  4. Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

    - Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "say-as" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.

    - Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.

  5. Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.
    - Markup support: The "emphasis" element, "break" element and "prosody" element may all be used by document creators to guide the TTS system in generating appropriate prosodic features in the speech output.

    - Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.

  6. Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

    - Markup support: The TTS markup does not provide explicit controls over the generation of waveforms. The "voice" element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The "audio" element allows for insertion of recorded audio data into the output stream.

1.3 Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a TTS system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

The following are important instances of architectures or designs from which marked-up TTS documents will be generated. The language design is intended to facilitate each of these approaches.

1.4 Platform-Dependent Output Behavior of Speech Synthesis Content

The Speech Synthesis Markup Language Specification provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate and etc. Exact specification of synthetic speech output behavior across disparate platforms, however, is beyond the scope of this document.

1.5 Terminology

Requirements terms

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. However, for readability, these words do not appear in all uppercase letters in this specification.

2. Elements and Attributes

The following elements are defined in this draft specification.

2.1 Document Structure, Text Processing and Pronunciation

2.1.1 "speak" Root Element

The Speech Synthesis Markup Language is an XML application. The root element is speak. xml:lang is a defined attribute specifying the language of the root document. The version attribute is a required attribute that indicates the version of the specification to be used for the document. The version number for this specification is 1.0.

<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US">
  ... the body ...
</speak>

2.1.2 "xml:lang" Attribute: Language

Following the XML convention, languages are indicated by an xml:lang attribute on the enclosing element with the value following [RFC3066] to define language codes. A language is specified by an RFC 3066 identifier following the convention of XML 1.0. [Note: XML 1.0 adopted RFC3066 through Errata as of 2001-02-22].

Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

xml:lang is a defined attribute for speak , paragraph , sentence , p , and s elements.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
  <paragraph>I don't speak Japanese.</paragraph>
  <paragraph xml:lang="ja">Nihongo-ga wakarimasen.</paragraph>
</speak>

General Notes (Normative)

2.1.3 "paragraph" and "sentence": Text Structure Elements

A paragraph element represents the paragraph structure in text. A sentence element represents the sentence structure in text. A paragraph contains zero or more sentences.

xml:lang is a defined attribute on both paragraph and sentence elements.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
<paragraph>
  <sentence>This is the first sentence of the paragraph.</sentence>
  <sentence>Here's another sentence.</sentence>
</paragraph>
</speak>

General Notes (Normative)

2.1.4 "say-as" Element

The say-as element indicates the type of text construct contained within the element. This information is used to help specify the pronunciation of the contained text. Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages. The say-as element has been specified with a reasonable set of format types. Text substitution may be utilized for unsupported constructs.

The type attribute is a required attribute that indicates the contained text construct. The format is a text type optionally followed by a colon and a format.

The base set of type values, divided according to broad functionality, is as follows:

Pronunciation Types

Numerical Types

Time, Date and Measure Types

Address, Name, Net Types

General Notes (Normative)

Usage Notes (Informative)

2.1.5 "phoneme" Element

The phoneme element provides a phonetic pronunciation for the contained text. The "phoneme" element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

The ph attribute is a required attribute that specifies the phoneme string.

The alphabet attribute is an optional attribute that specifies the phonetic alphabet. The default value of alphabet for a conforming SSML processor is "ipa", corresponding to characters composing the International Phonetic Alphabet. In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;">
    tomato
  </phoneme>
  <!-- This is an example of IPA using character entities -->
</speak>

If a value for alphabet specifying an unknown phonetic alphabet is encountered, a conforming SSML processor should continue processing and should notify the hosting environment in that case.

Usage Notes (Informative)

2.1.6 "sub" Element

The sub element is employed to indicate that the specified text replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be substituted for the enclosed string.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <sub alias="World Wide Web Consortium"> W3C </sub>
  <!-- World Wide Web Consortium -->
</speak>

2.2 Prosody and Style

2.2.1 "voice" Element

The "voice" element is a production element that requests a change in speaking voice. Attributes are:

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">   
  <voice gender="female">Mary had a little lamb,</voice>
  <!-- now request a different female child's voice -->
  <voice gender="female" variant="2">
    It's fleece was white as snow.
  </voice>
  <!-- platform-specific voice selection -->
  <voice name="Mike">I want to be like Mike.</voice>
</speak>

General Notes (Normative)

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <voice gender="female"> 
    Any female voice here.
    <voice age="6"> 
      A female child voice here.
      <paragraph xml:lang="ja"> 
        <!-- A female child voice in Japanese. -->
      </paragraph>
    </voice>
  </voice>
</speak>
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <voice xml:lang="en-cockney">
    Try a Cockney voice (London area).
  </voice>
  <voice xml:lang="en-brooklyn">
    Try one with a New York accent.
  </voice>
</speak>

2.2.2 "emphasis" Element

The "emphasis" element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  That is a <emphasis> big </emphasis> car!
  That is a <emphasis level="strong"> huge </emphasis>
  bank account!
</speak>

2.2.3 "break" Element

The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not defined, the speech synthesizer is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a speech synthesizer. The attributes are:

    <?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  Take a deep breath <break/> then continue. 
  Press 1 or wait for the tone. <break time="3s"/>
  I didn't hear you!
</speak>

Usage Notes (Informative)

2.2.4 "prosody" Element

The prosody element permits control of the pitch, speaking rate and volume of the speech output. The attributes are:

Relative values

Relative changes for any of the attributes above are specified as floating-point values: "+10", "-5.5", "+15.2%", "-8.0%". For the pitch and range attributes, relative changes in semitones are permitted: "+0.5st", "+5st", "-2st".

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  The price of XYZ is <prosody rate="-10%">
  <say-as type="currency">$45</say-as></prosody>
</speak>

Pitch contour

The pitch contour is defined as a set of targets at specified intervals in the speech output. The algorithm for interpolating between the targets is platform-specific. In each pair of the form (interval,target), the first value is a percentage of the period of the contained text and the second value is the value of the pitch attribute (absolute, relative, relative semitone, or descriptive values are all permitted). Interval values outside 0% to 100% are ignored. If a value is not defined for 0% or 100% then the nearest pitch target is copied.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis">
  <prosody contour="(0%,+20)(10%,+30%)(40%,+10)">
    good morning
  </prosody>
</speak>

General Notes (Normative)

Usage Notes (Informative)

2.3 Other Elements

2.3.1 "audio" Element

The audio element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup, or another audio element. The alternate contents may also be used when rendering the document to non-audible output and for accessibility. The optional attribute is src, which is the URI of a document with an appropriate mime-type.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
                 
<!-- Empty element -->
Please say your name after the tone.  <audio src="beep.wav"/>

<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>
<audio src="welcome.wav">  
     <emphasis>Welcome</emphasis>  to the Voice Portal. 
</audio>

</speak>

An audio element is sucessfully rendered if:

  1. The referenced audio source is played, or
  2. If the processor is unable to execute #1, then the alternative content is successfully rendered.

Deciding which conditions result in the alternative content being rendered is platform dependent. If the audio element is not successfully rendered, a conforming SSML processor should continue processing and should notify the hosting environment in that case. An SSML processor may determine after beginning playback of an audio source that it cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.

Usage Notes (Informative)

2.3.2 "mark" Element

A mark element is an element that places a marker into the text/tag sequence. The mark element that contains text is used to reference a special sequence of tags and text, either for internal reference within the SSML document, or externally by another document. The empty mark element can also be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When audio output of the TTS document reaches the mark, the speech synthesizer issues an event that includes the required name attribute of the element. The platform defines the destination of the event. The mark element does not affect the speech output process.

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en-US">
                 
We would like 
<mark name="congrats">to extend our warmest congratulations</mark> 
to the members of the Voice Browser Working Group! 
Go from <mark name="here"/> here, to <mark name="there"/> there!

</speak>

Usage Notes (Informative)

When supported by the implementation, requests can be made to pause and resume at document locations specified by the mark values.

3. SSML Documents

3.1 Document Form

A legal Speech Synthesis Markup Language document must have a legal XML Prolog [XML §2.8].

The XML prolog in a synthesis document comprises the XML declaration and an optional DOCTYPE declaration referencing the synthesis DTD. It is followed by the root speak element. The XML prolog may also contain XML comments, processor instructions and other content permitted by XML in a prolog.

The version number of the XML declaration indicates which version of XML is being used. The version number of the speak element indicates which version of the SSML specification is being used -- "1.0" for this specification. The speak version is a required attribute.

The speak element must designate the SSML namespace using the xmlns attribute [XMLNS]. The namespace for SSML is defined to be http://www.w3.org/2001/10/synthesis.

If present, the DOCTYPE should reference the standard DOCTYPE and identifier.

The following are two examples of SSML headers:

<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en">

 

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
        "http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xml:lang="en">

3.2 Integration With Other Markup Languages

3.2.1 SMIL

The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text-editor. See the SMIL/SSML integration examples in Appendix A..

3.2.2 ACSS

Aural style sheets are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.

3.3 SSML Document Fetching

The fetching and caching behavior of SSML documents is defined by the environment in which the SSML processor operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.

4. Conformance

This section is Normative.

4.1 Conforming Speech Synthesis Markup Language Fragments

A synthesis document fragment is a Conforming Speech Synthesis Markup Language Fragment if:

4.2 Conforming Stand-Alone Speech Synthesis Markup Language Documents

A document is a Conforming Stand-Alone Speech Synthesis Markup Language Document if:

The Speech Synthesis specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.

4.3 Using SSML with other Namespaces

The SSML namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.

4.4 Conforming Speech Synthesis Markup Language Processors

5. References

5.1 Normative References

[XML]
World Wide Web Consortium. Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation, 6 October 2000. See http://www.w3.org/TR/2000/REC-xml-20001006
[XMLNS]
World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names
[JSML]
Sun Microsystems. JSpeech Markup Language. Sun Microsystems submission to W3C, 5 June 2000. See http://www.w3.org/TR/jsml/
[RFC2119]
S. Bradner, Key words for use in RFCs to Indicate Requirement Levels , Harvard University, March 1997. See http://www.normos.org/ietf/rfc/rfc2119.txt
[RFC3066]
H. Alvestrand, Tags for the Identification of Languages. See http://www.ietf.org/rfc/rfc3066.txt
[CSS2]
World Wide Web Consortium, Cascading Style Sheets, level 2 CSS2 Specification. W3C Recommendation. See http://www.w3.org/TR/REC-CSS2/aural.html
[VXML]
World Wide Web Consortium. Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C Working Draft. See http://www.w3.org/TR/2001/WD-voicexml20-20011023/

5.2 Informative References

[SABLE]
R. Sproat, SABLE. A Standard for TTS Markup. See http://www.research.att.com/~rws/Sable.v1_0.htm

6. Acknowledgements

This document was written with the participation of the members of the W3C Voice Browser Working Group (listed in alphabetical order):

Brian Eberman, SpeechWorks International
Andrew Hunt, SpeechWorks International
Jim Larson, Intel
Bruce Lucas, IBM
Scott McGlashan, PipeBeach
T.V. Raman, IBM
Dave Raggett, W3C/Openwave
Richard Sproat, ATT
Luc Van Tichelen, ScanSoft
Kuansan Wang, Microsoft
Mark Walker, Intel

Appendix A: Example SSML

This appendix is Non-Normative.

The following is an example of reading headers of email messages. The paragraph and sentence elements are used to mark the text structure. The say-as element is used to indicate text constructs such as the time and proper name. The break element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The prosody element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">
  <paragraph>
    <sentence>You have 4 new messages.</sentence>
    <sentence>The first is from 
      <say-as type="name">
        Stephanie Williams
      </say-as>
      and arrived at <break/>
      <say-as type="time">3:45pm</say-as>.
    </sentence>
    <sentence>
      The subject is <prosody rate="-20%">ski trip</prosody>
    </sentence>
  </paragraph>
</speak>

The following example combines audio files and different spoken voices to provide information on a collection of music.

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">

  <paragraph>
    <voice gender="male">
      <sentence>Today we preview the latest romantic music
      from the W3C.</sentence>

      <sentence>Hear what the Software Reviews said about
      Tim Lee's newest hit.</sentence>
    </voice>
  </paragraph>

  <paragraph>
    <voice gender="female">
      He sings about issues that touch us all.
    </voice>
  </paragraph>

  <paragraph>
    <voice gender="male">
      Here's a sample.  <audio src="http://www.w3c.org/music.wav"/>
      Would you like to buy it?
    </voice>
  </paragraph>

</speak>

SMIL Integration Example

The SMIL language is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.

File 'greetings.ssml' contains the following:

<?xml version="1.0"?>

<speak xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en">

  <sentence>
  <mark name="greetings">
    <emphasis>Greetings</emphasis> from the
    <sub alias="World Wide Web Consortium">W3C</sub>!
  </mark>
  </sentence>
</speak>

SMIL Example 1: W3C logo image appears, and then one second later, the speech sequence is rendered. File 'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <top-layout width="640" height="320">
      <region id="whole" width="640" height="320"/>
    </top-layout>
  </head>
  <body>
    <par>
      <img src="http://w3clogo.gif" region="whole" begin="0s"/>
      <ref src="greetings.ssml#greetings" begin="1s"/>
    </par>
  </body>
</smil>

SMIL Example 2: W3C logo image appears, then clicking on the image causes it to disappear and the speach sequence to be rendered. File 'greetings.smil' contains the following:

<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <top-layout width="640" height="320">
      <region id="whole" width="640" height="320"/>
    </top-layout>
  </head>
  <body>
    <seq>
      <img id="logo" src="http://w3clogo.gif" region="whole" 
      begin="0s" end="logo.activateEvent"/>
      <ref src="greetings.ssml#greetings"/>
    </seq>
  </body>
</smil>

Appendix B: DTD for the Speech Synthesis Markup Language

This appendix is Informative.

The synthesis DTD is located at http://www.w3.org/TR/speech-synthesis/synthesis.dtd.

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- 

SSML DTD 20020313

Copyright 1998-2002 W3C (MIT, INRIA, Keio), All Rights Reserved. 

Permission to use, copy, modify and distribute the SSML DTD and 
its accompanying documentation for any purpose and without fee is 
hereby granted in perpetuity, provided that the above copyright 
notice and this paragraph appear in all copies.  

The copyright holders make no representation about the suitability 
of the DTD for any purpose. It is provided "as is" without expressed 
or implied warranty.

-->
<!ENTITY % duration "CDATA">
<!ENTITY % integer "CDATA">
<!ENTITY % uri "CDATA">
<!ENTITY % audio "#PCDATA | audio ">
<!ENTITY % structure "paragraph | p | sentence | s">
<!ENTITY % sentence-elements "break | emphasis | mark | phoneme | 
        prosody | say-as | voice | sub">
<!ENTITY % allowed-within-sentence " %audio; | %sentence-elements; ">
<!ENTITY % say-as-types "(acronym|spell-out|currency|measure|
        name|telephone|address|
        number|number:ordinal|number:digits|number:cardinal|
            date|date:dmy|date:mdy|date:ymd|
        date:ym|date:my|date:md|
        date:y|date:m|date:d|
            time|time:hms|time:hm|time:h|
            duration|duration:hms|duration:hm|duration:ms|
            duration:h|duration:m|duration:s|
            net|net:email|net:uri)">
<!ELEMENT speak (%allowed-within-sentence; | %structure;)*>
<!ATTLIST speak
    version NMTOKEN #REQUIRED
    xml:lang NMTOKEN #IMPLIED
    xmlns CDATA #REQUIRED
    xmlns:xsi CDATA #IMPLIED
    xsi:schemaLocation CDATA #IMPLIED
>
<!ELEMENT paragraph (%allowed-within-sentence; | sentence | s)*>
<!ATTLIST paragraph
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT sentence (%allowed-within-sentence;)*>
<!ATTLIST sentence
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT p (%allowed-within-sentence; | sentence | s)*>
<!ATTLIST p
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT s (%allowed-within-sentence;)*>
<!ATTLIST s
    xml:lang NMTOKEN #IMPLIED
>
<!ELEMENT voice (%allowed-within-sentence; | %structure;)*>
<!ATTLIST voice
    xml:lang NMTOKEN #IMPLIED
    gender (male | female | neutral) #IMPLIED
    age %integer; #IMPLIED
    variant %integer; #IMPLIED
    name CDATA #IMPLIED
>
<!ELEMENT prosody (%allowed-within-sentence; | %structure;)*>
<!ATTLIST prosody
    pitch CDATA #IMPLIED
    contour CDATA #IMPLIED
    range CDATA #IMPLIED
    rate CDATA #IMPLIED
    duration %duration; #IMPLIED
    volume CDATA #IMPLIED
>
<!ELEMENT audio (%allowed-within-sentence; | %structure;)*>
<!ATTLIST audio
    src %uri; #IMPLIED
>
<!ELEMENT emphasis (%allowed-within-sentence;)*>
<!ATTLIST emphasis
    level (strong | moderate | none | reduced) "moderate"
>
<!ELEMENT say-as (#PCDATA)>
<!ATTLIST say-as
    type %say-as-types; #REQUIRED
>
<!ELEMENT sub (#PCDATA)>
<!ATTLIST sub
    alias CDATA #REQUIRED
>
<!ELEMENT phoneme (#PCDATA)>
<!ATTLIST phoneme
    ph CDATA #REQUIRED
    alphabet CDATA "ipa"
>
<!ELEMENT break EMPTY>
<!ATTLIST break
    size (large | medium | small | none) "medium"
    time %duration; #IMPLIED
>
<!ELEMENT mark (%allowed-within-sentence; | %structure;)*>
<!ATTLIST mark
    name ID #REQUIRED
>

Appendix C: Schema for the Speech Synthesis Markup Language

This appendix is Normative.

The synthesis schema is located at http://www.w3.org/TR/speech-synthesis/synthesis.xsd.

Note: the synthesis schema includes a no-namespace core schema, located at http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments embedded in non-synthesis namespace schemas.

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsd:schema targetNamespace="http://www.w3.org/2001/10/synthesis" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema" 
xmlns="http://www.w3.org/2001/10/synthesis"
elementFormDefault="qualified">
    <xsd:annotation>
    <xsd:documentation>SSML 1.0 Schema (20020311)</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
    <xsd:documentation>Copyright 1998-2002 W3C (MIT, INRIA, Keio),
  All Rights Reserved. Permission to use, copy, modify and
  distribute the SSML schema and its accompanying documentation
  for any purpose and without fee is hereby granted in
  perpetuity, provided that the above copyright notice and this
  paragraph appear in all copies.  The copyright holders make no
  representation about the suitability of the schema for any purpose. 
  It is provided "as is" without expressed or implied warranty.
  </xsd:documentation>
    </xsd:annotation>
    <xsd:include schemaLocation="synthesis-core.xsd"/>
</xsd:schema>
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">
    <xsd:annotation>
        <xsd:documentation>SSML 1.0 Core Schema 
        (20020222)</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
        <xsd:documentation>Copyright 1998-2002 W3C (MIT,
        INRIA, Keio), All Rights Reserved. Permission to use,
        copy, modify and distribute the SSML core schema and its
        accompanying documentation for any purpose and without
        fee is hereby granted in perpetuity, provided that the
        above copyright notice and this paragraph appear in all
        copies.  The copyright holders make no representation
        about the suitability of the schema for any purpose. 
        It is provided "as is" without expressed or implied
        warranty.</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
        <xsd:documentation>Importing dependent
        namespaces</xsd:documentation>
    </xsd:annotation>
    <xsd:import namespace="http://www.w3.org/XML/1998/namespace"
    schemaLocation="http://www.w3.org/2001/xml.xsd"/>
    <xsd:annotation>
        <xsd:documentation>General Datatypes</xsd:documentation>
    </xsd:annotation>
    <xsd:simpleType name="duration">
        <xsd:annotation>
            <xsd:documentation>Duration follows "Times" in
            CCS specification; e.g. "25ms", "3s"</xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[0-9]+m?s"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="relative.change">
        <xsd:annotation>
            <xsd:documentation>Relative change: e.g. +10, -5.5,
            +15%, -9.0%</xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[+-][0-9]+(.[0-9]+)?[%]?"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="relative.change.st">
        <xsd:annotation>
            <xsd:documentation>Relative change in semi-tones:
            e.g. +10st, -5st</xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[+-]?[0-9]+st"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="height.scale">
        <xsd:annotation>
            <xsd:documentation>values for height
                   </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="high"/>
            <xsd:enumeration value="medium"/>
            <xsd:enumeration value="low"/>
            <xsd:enumeration value="default"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="number.range">
        <xsd:annotation>
            <xsd:documentation>number range: e.g. 0-123, 23343-223333.
    No constraint that the second number is greater than the first.
    </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:pattern value="[0-9]+-.[0-9]+"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="speed.scale">
        <xsd:annotation>
            <xsd:documentation>values for speed
                     </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="fast"/>
            <xsd:enumeration value="medium"/>
            <xsd:enumeration value="slow"/>
            <xsd:enumeration value="default"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="volume.scale">
        <xsd:annotation>
            <xsd:documentation>values for speed
                 </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="silent"/>
            <xsd:enumeration value="soft"/>
            <xsd:enumeration value="medium"/>
            <xsd:enumeration value="loud"/>
            <xsd:enumeration value="default"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="float.range1">
        <xsd:annotation>
            <xsd:documentation>0.0 - 100.0
            </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:float">
            <xsd:minInclusive value="0.0"/>
            <xsd:maxInclusive value="100.0"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:simpleType name="Say-as.datatype">
        <xsd:annotation>
            <xsd:documentation>say-as datatypes
            </xsd:documentation>
        </xsd:annotation>
        <xsd:restriction base="xsd:string">
            <xsd:enumeration value="acronym"/>
            <xsd:enumeration value="spell-out"/>
            <xsd:enumeration value="number"/>
            <xsd:enumeration value="number:ordinal"/>
            <xsd:enumeration value="number:digits"/>
            <xsd:enumeration value="number:cardinal"/>
            <xsd:enumeration value="date"/>
            <xsd:enumeration value="date:dmy"/>
            <xsd:enumeration value="date:mdy"/>
            <xsd:enumeration value="date:ymd"/>
            <xsd:enumeration value="date:ym"/>
            <xsd:enumeration value="date:my"/>
            <xsd:enumeration value="date:md"/>
            <xsd:enumeration value="date:y"/>
            <xsd:enumeration value="date:m"/>
            <xsd:enumeration value="date:d"/>
            <xsd:enumeration value="time"/>
            <xsd:enumeration value="time:hms"/>
            <xsd:enumeration value="time:hm"/>
            <xsd:enumeration value="time:h"/>
            <xsd:enumeration value="duration"/>
            <xsd:enumeration value="duration:hms"/>
            <xsd:enumeration value="duration:hm"/>
            <xsd:enumeration value="duration:ms"/>
            <xsd:enumeration value="duration:h"/>
            <xsd:enumeration value="duration:m"/>
            <xsd:enumeration value="duration:s"/>
            <xsd:enumeration value="currency"/>
            <xsd:enumeration value="measure"/>
            <xsd:enumeration value="name"/>
            <xsd:enumeration value="net"/>
            <xsd:enumeration value="net:email"/>
            <xsd:enumeration value="net:uri"/>
            <xsd:enumeration value="address"/>
            <xsd:enumeration value="telephone"/>
        </xsd:restriction>
    </xsd:simpleType>
    <xsd:annotation>
        <xsd:documentation>General attributes</xsd:documentation>
    </xsd:annotation>
    <xsd:annotation>
        <xsd:documentation>Elements</xsd:documentation>
    </xsd:annotation>
    <xsd:element name="aws" abstract="true">
        <xsd:annotation>
            <xsd:documentation>The 'allowed-within-sentence'
            group uses this abstract element. Elements with aws as
            their substitution class are then alternatives for
            'allowed-within-sentence'.</xsd:documentation>
        </xsd:annotation>
    </xsd:element>
    <xsd:group name="allowed-within-sentence">
        <xsd:choice>
            <xsd:element ref="aws"/>
        </xsd:choice>
    </xsd:group>
    <xsd:element name="struct" abstract="true"/>
    <xsd:group name="structure">
        <xsd:choice>
            <xsd:element ref="struct"/>
        </xsd:choice>
    </xsd:group>
    <xsd:element name="speak" type="speak"/>
    <xsd:complexType name="speak" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="version" use="required">
            <xsd:simpleType>
                <xsd:restriction base="xsd:NMTOKEN"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="paragraph" type="paragraph"
    substitutionGroup="struct"/>
    <xsd:element name="p" type="paragraph" substitutionGroup="struct"/>
    <xsd:complexType name="paragraph" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:element ref="sentence"/>
            <xsd:element ref="s"/>
        </xsd:choice>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="sentence" type="sentence"
    substitutionGroup="struct"/>
    <xsd:element name="s" type="sentence" substitutionGroup="struct"/>
    <xsd:complexType name="sentence" mixed="true">
        <xsd:sequence minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
        </xsd:sequence>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="voice" type="voice" substitutionGroup="aws"/>
    <xsd:complexType name="voice" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="gender">
            <xsd:simpleType>
                <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="male"/>
                    <xsd:enumeration value="female"/>
                    <xsd:enumeration value="neutral"/>
                </xsd:restriction>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="age" type="xsd:positiveInteger"/>
        <xsd:attribute name="variant" type="xsd:integer"/>
        <xsd:attribute name="name" type="xsd:string"/>
        <xsd:attribute ref="xml:lang"/>
    </xsd:complexType>
    <xsd:element name="prosody" type="prosody" substitutionGroup="aws"/>
    <xsd:complexType name="prosody" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="pitch">
            <xsd:simpleType>
                <xsd:union memberTypes="xsd:positiveInteger
                relative.change relative.change.st height.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="contour" type="xsd:string"/>
        <xsd:attribute name="range">
            <xsd:simpleType>
                <xsd:union memberTypes="number.range
                relative.change relative.change.st height.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="rate">
            <xsd:simpleType>
                <xsd:union memberTypes="xsd:positiveInteger
                relative.change speed.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="duration" type="duration"/>
        <xsd:attribute name="volume">
            <xsd:simpleType>
                <xsd:union memberTypes="float.range1
                relative.change volume.scale"/>
            </xsd:simpleType>
        </xsd:attribute>
    </xsd:complexType>
    <xsd:element name="audio" type="audio" substitutionGroup="aws"/>
    <xsd:complexType name="audio" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="src" type="xsd:anyURI"/>
    </xsd:complexType>
    <xsd:element name="emphasis" type="emphasis" substitutionGroup="aws"/>
    <xsd:complexType name="emphasis" mixed="true">
        <xsd:sequence minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
        </xsd:sequence>
        <xsd:attribute name="level" default="moderate">
            <xsd:simpleType>
                <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="strong"/>
                    <xsd:enumeration value="moderate"/>
                    <xsd:enumeration value="none"/>
                    <xsd:enumeration value="reduced"/>
                </xsd:restriction>
            </xsd:simpleType>
        </xsd:attribute>
    </xsd:complexType>
    <xsd:element name="sub" type="sub" substitutionGroup="aws"/>
    <xsd:complexType name="sub">
        <xsd:simpleContent>
            <xsd:extension base="xsd:string">
                <xsd:attribute name="alias" type="xsd:string"
                use="required"/>
            </xsd:extension>
        </xsd:simpleContent>
    </xsd:complexType>
    <xsd:element name="say-as" type="say-as" substitutionGroup="aws"/>
    <xsd:complexType name="say-as" mixed="true">
        <xsd:attribute name="type" type="Say-as.datatype" use="required"/>
    </xsd:complexType>
    <xsd:element name="phoneme" type="phoneme" substitutionGroup="aws"/>
    <xsd:complexType name="phoneme" mixed="true">
        <xsd:attribute name="ph" type="xsd:string" use="required"/>
        <xsd:attribute name="alphabet" type="xsd:string" default="ipa"/>
    </xsd:complexType>
    <xsd:element name="break" type="break" substitutionGroup="aws"/>
    <xsd:complexType name="break">
        <xsd:attribute name="size" default="medium">
            <xsd:simpleType>
                <xsd:restriction base="xsd:string">
                    <xsd:enumeration value="large"/>
                    <xsd:enumeration value="medium"/>
                    <xsd:enumeration value="small"/>
                    <xsd:enumeration value="none"/>
                </xsd:restriction>
            </xsd:simpleType>
        </xsd:attribute>
        <xsd:attribute name="time" type="duration"/>
    </xsd:complexType>
    <xsd:element name="mark" type="mark" substitutionGroup="aws"/>
    <xsd:complexType name="mark" mixed="true">
        <xsd:choice minOccurs="0" maxOccurs="unbounded">
            <xsd:group ref="allowed-within-sentence"/>
            <xsd:group ref="structure"/>
        </xsd:choice>
        <xsd:attribute name="name" type="xsd:ID" use="required"/>
    </xsd:complexType>
</xsd:schema>

Appendix D: Audio File Formats

This appendix is Normative.

SSML requires that a platform support the playing of the audio formats specified below.

Audio Format Media Type
Raw (headerless) 8kHz 8-bit mono mu-law [PCM] single channel. (G.711) audio/basic (from http://www.ietf.org/rfc/rfc1521.txt)
Raw (headerless) 8kHz 8 bit mono A-law [PCM] single channel. (G.711) audio/x-alaw-basic
WAV (RIFF header) 8kHz 8-bit mono mu-law [PCM] single channel. audio/wav
WAV (RIFF header) 8kHz 8-bit mono A-law [PCM] single channel. audio/wav

The 'audio/basic' mime type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this mime type is specified for recording, the mu-law format must be used. For playback with the 'audio/basic' mime type, platforms must support the mu-law format and may support the 'au' format.

Appendix E: MIME Types and File Suffix

This appendix is Non-Normative.

The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".

The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where "speak" is the root element.

Appendix F: Features Under Consideration for Future Versions

This appendix is Non-Normative.

The following features are under consideration for versions of the Speech Synthesis Markup Language Specification after version 1.0:

Appendix G: Internationalization

This appendix is Normative.

SSML is an application of XML 1.0 and thus supports Unicode which defines a standard universal character set.

Additionally, SSML provides a mechanism for precise control of the input and output languages via the use of "xml:lang" attribute. This facility provides: