Copyright ©2001 W3C® (MIT, INRIA, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.
This specification defines the Second Edition of XHTML 1.0, a reformulation of HTML 4 as an XML 1.0 application, and three DTDs corresponding to the ones defined by HTML 4. The semantics of the elements and their attributes are defined in the W3C Recommendation for HTML 4. These semantics provide the foundation for future extensibility of XHTML. Compatibility with existing HTML user agents is possible by following a small set of guidelines.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.
This document is the second edition of the XHTML 1.0 specification incorporating the errata changes as of 4 October 2001. It is released as a W3C Working Draft to gather public feedback before its final release as a W3C Recommendation. This document should not be used as reference material or cited as a normative reference from another document. The review period for this Working Draft is 4 weeks ending 1 November 2001.
This second edition is not a new version of XHTML 1.0 (first published 26 January 2000). The changes in this document reflect corrections applied as a result of comments submitted by the community and as a result of ongoing work within the HTML Working Group. There are no substantive changes in this document - only the integration of various errata.
Please report errors in this document to www-html-editor@w3.org (archive).
This document has been produced as part of the W3C HTML Activity. The goals of the HTML Working Group (members only) are discussed in the HTML Working Group charter.
A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR.
Public discussion on HTML features takes place on the mailing list www-html@w3.org (archive).
XHTML is a family of current and future document types and modules that reproduce, subset, and extend HTML 4 [HTML]. XHTML family document types are XML based, and ultimately are designed to work in conjunction with XML-based user agents. The details of this family and its evolution are discussed in more detail in the section on Future Directions.
XHTML 1.0 (this specification) is the first document type in the XHTML family. It is a reformulation of the three HTML 4 document types as applications of XML 1.0 [XML]. It is intended to be used as a language for content that is both XML-conforming and, if some simple guidelines are followed, operates in HTML 4 conforming user agents. Developers who migrate their content to XHTML 1.0 will realize the following benefits:
The XHTML family is the next step in the evolution of the Internet. By migrating to XHTML today, content developers can enter the XML world with all of its attendant benefits, while still remaining confident in their content's backward and future compatibility.
HTML 4 [HTML] is an SGML (Standard Generalized Markup Language) application conforming to International Standard ISO 8879, and is widely regarded as the standard publishing language of the World Wide Web.
SGML is a language for describing markup languages, particularly those used in electronic document exchange, document management, and document publishing. HTML is an example of a language defined in SGML.
SGML has been around since the middle 1980's and has remained quite stable. Much of this stability stems from the fact that the language is both feature-rich and flexible. This flexibility, however, comes at a price, and that price is a level of complexity that has inhibited its adoption in a diversity of environments, including the World Wide Web.
HTML, as originally conceived, was to be a language for the exchange of scientific and other technical documents, suitable for use by non-document specialists. HTML addressed the problem of SGML complexity by specifying a small set of structural and semantic tags suitable for authoring relatively simple documents. In addition to simplifying the document structure, HTML added support for hypertext. Multimedia capabilities were added later.
In a remarkably short space of time, HTML became wildly popular and rapidly outgrew its original purpose. Since HTML's inception, there has been rapid invention of new elements for use within HTML (as a standard) and for adapting HTML to vertical, highly specialized, markets. This plethora of new elements has led to compatibility problems for documents across different platforms.
As the heterogeneity of both software and platforms rapidly proliferate, it is clear that the suitability of 'classic' HTML 4 for use on these platforms is somewhat limited.
XML™ is the shorthand for Extensible Markup Language, and is an acronym of Extensible Markup Language [XML].
XML was conceived as a means of regaining the power and flexibility of SGML without most of its complexity. Although a restricted form of SGML, XML nonetheless preserves most of SGML's power and richness, and yet still retains all of SGML's commonly used features.
While retaining these beneficial features, XML removes many of the more complex features of SGML that make the authoring and design of suitable software both difficult and costly.
The benefits of migrating to XHTML 1.0 are described above. Some of the benefits of migrating to XHTML in general are:
The following terms are used in this specification. These terms extend the definitions in [RFC2119] in ways based upon similar definitions in ISO/IEC 9945-1:1990 [POSIX.1]:
This version of XHTML provides a definition of strictly conforming XHTML documents, which are restricted to tags and attributes from the XML and XHTML namespaces. See Section 3.1.2 for information on using XHTML with other namespaces, for instance, to include metadata expressed in RDF within XHTML documents.
A Strictly Conforming XHTML Document is a document that requires only the facilities described as mandatory in this specification. Such a document must meet all of the following criteria:
It must conform to the constraints expressed in one of the three DTDs found in Appendix A.
The root element of the document must be
<html>
.
The root element of the document must designate the XHTML
namespace using the xmlns
attribute [XMLNAMES]. The namespace for XHTML is
defined to be
http://www.w3.org/1999/xhtml
.
There must be a DOCTYPE declaration in the document prior to the root element. The public identifier included in the DOCTYPE declaration must reference one of the three DTDs found in Appendix A using the respective Formal Public Identifier. The system identifier may be changed to reflect local system conventions.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
The DTD subset must not be used to override any parameter entities in the DTD.
Here is an example of a minimal XHTML document:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Virtual Library</title> </head> <body> <p>Moved to <a href="http://vlib.org/">vlib.org</a>.</p> </body> </html>
Note that in this example, the XML declaration is included. An XML declaration like the one above is not required in all XML documents. XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding was determined by a higher-level protocol.
The XHTML namespace may be used with other XML namespaces as per [XMLNAMES], although such documents are not strictly conforming XHTML 1.0 documents as defined above. Future work by W3C will address ways to specify conformance for documents involving multiple namespaces.
The following example shows the way in which XHTML 1.0 could be used in conjunction with the MathML Recommendation:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>A Math Example</title> </head> <body> <p>The following is MathML markup:</p> <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <log/> <logbase> <cn> 3 </cn> </logbase> <ci> x </ci> </apply> </math> </body> </html>
The following example shows the way in which XHTML 1.0 markup could be incorporated into another XML namespace:
<?xml version="1.0" encoding="UTF-8"?> <!-- initially, the default namespace is "books" --> <book xmlns='urn:loc.gov:books' xmlns:isbn='urn:ISBN:0-395-36341-6' xml:lang="en" lang="en"> <title>Cheaper by the Dozen</title> <isbn:number>1568491379</isbn:number> <notes> <!-- make HTML the default namespace for a hypertext commentary --> <p xmlns='http://www.w3.org/1999/xhtml'> This is also available <a href="http://www.w3.org/">online</a>. </p> </notes> </book>
A conforming user agent must meet all of the following criteria:
ID
(e.g. the id
attribute on most XHTML elements)
as fragment identifiers.White space is handled according to the following rules. The following characters are defined in [XML] as white space characters:
The XML processor normalizes different systems' line end codes into one single LINE FEED character, that is passed up to the application.
The user agent must process white space characters in the data received from the XML processor as follows:
xml:space
' attribute is set to 'preserve
',
white space characters must be preserved and consequently LINE FEED
characters within a block must not be converted. xml:space
' attribute is not set to 'preserve
',
then:
White space in attribute values is processed according to [XML].
In determining how to convert a LINE FEED character a user agent must meet the following rules, whereby the script of characters on either side of the LINE FEED determines the choice of the replacement. The assignment of script names to all characters is done in accordance to the Unicode [UNICODE] technical report TR#24 (Script Names).
Note (informative): Some scripts, such as HAN, HIRAGANA, KATAKANA, KHMER, LAO, MYANMAR, THAI do not use space characters for word boundary delimitation, but may still use these space characters for delimitation of sentences or fragments of sentence. If such a character occurs as the last character before a LINE FEED character, or a character following a LINE FEED character, it may be eliminated by the white space processing described above. Several solutions are possible:
Note that in order to produce a Canonical XHTML document, the rules above must be applied and the rules in [XMLC14N] must also be applied to the document.
This section is informative.
Due to the fact that XHTML is an XML application, certain practices that were perfectly legal in SGML-based HTML 4 [HTML] must be changed.
Well-formedness is a new concept introduced by [XML]. Essentially this means that all elements must either have closing tags or be written in a special form (as described below), and that all the elements must nest properly.
Although overlapping is illegal in SGML, it is widely tolerated in existing browsers.
CORRECT: nested elements.
<p>here is an emphasized <em>paragraph</em>.</p>
INCORRECT: overlapping elements
<p>here is an emphasized <em>paragraph.</p></em>
XHTML documents must use lower case for all HTML element and attribute names. This difference is necessary because XML is case-sensitive e.g. <li> and <LI> are different tags.
In SGML-based HTML 4 certain elements were permitted to omit
the end tag; with the elements that followed implying closure.
XML does not allow end tags to be omitted.
All elements
other than those declared in the DTD as EMPTY
must
have an end tag. Elements that are declared in the DTD as EMPTY
can have an end tag or can use empty element shorthand (see Empty Elements).
CORRECT: terminated elements
<p>here is a paragraph.</p><p>here is another paragraph.</p>
INCORRECT: unterminated elements
<p>here is a paragraph.<p>here is another paragraph.
All attribute values must be quoted, even those which appear to be numeric.
CORRECT: quoted attribute values
<table rows="3">
INCORRECT: unquoted attribute values
<table rows=3>
XML does not support attribute minimization. Attribute-value
pairs must be written in full. Attribute names such as
compact
and checked
cannot occur in elements
without their value being specified.
CORRECT: unminimized attributes
<dl compact="compact">
INCORRECT: minimized attributes
<dl compact>
Empty elements must either have an end tag or the start tag must end with />
. For instance,
<br/>
or <hr></hr>
. See HTML Compatibility Guidelines for information on ways to
ensure this is backward compatible with HTML 4 user agents.
CORRECT: terminated empty elements
<br/><hr/>
INCORRECT: unterminated empty elements
<br><hr>
In attribute values, user agents will strip leading and trailing white space from attribute values and map sequences of one or more white space characters (including line breaks) to a single inter-word space (an ASCII space character for western scripts). See Section 3.3.3 of [XML].
In XHTML, the script and style elements are declared as having
#PCDATA
content. As a result, <
and
&
will be treated as the start of markup, and
entities such as <
and &
will be recognized as entity references by the XML processor to
<
and &
respectively. Wrapping
the content of the script or style element within a
CDATA
marked section avoids the expansion of these
entities.
<script type="text/javascript"> <![CDATA[ ... unescaped script content ... ]]> </script>
CDATA
sections are recognized by the XML
processor and appear as nodes in the Document Object Model, see
Section 1.3 of the DOM Level 1 Recommendation [DOM].
An alternative is to use external script and style documents.
SGML gives the writer of a DTD the ability to exclude specific elements from being contained within an element. Such prohibitions (called "exclusions") are not possible in XML.
For example, the HTML 4 Strict DTD forbids the nesting of an
'a
' element within another 'a
' element
to any descendant depth. It is not possible to spell out such
prohibitions in XML. Even though these prohibitions cannot be
defined in the DTD, certain elements should not be nested. A
summary of such elements and the elements that should not be
nested in them is found in the normative
Appendix B.
HTML 4 defined the name
attribute for the elements
a
,
applet
, form
, frame
,
iframe
, img
, and map
.
HTML 4 also introduced
the id
attribute. Both of these attributes are designed to be
used as fragment identifiers.
In XML, fragment identifiers are of type ID
, and
there can only be a single attribute of type ID
per element.
Therefore, in XHTML 1.0 the id
attribute is defined to be of type ID
. In order to
ensure that XHTML 1.0 documents are well-structured XML documents, XHTML 1.0
documents MUST use the id
attribute when defining fragment
identifiers on the elements listed above.
See the HTML Compatibility
Guidelines for information on ensuring such anchors are backwards
compatible when serving XHTML documents as media type text/html
.
Note that in XHTML 1.0, the name
attribute of these
elements is formally deprecated, and will be removed in a
subsequent version of XHTML.
HTML 4 and XHTML both have some attributes that have pre-defined and
limited sets of values (e.g. the type
attribute of the
input
element). In SGML and XML, these are called enumerated
attributes. Under HTML 4, the interpretation of these values
was case-insensitive, so a value of TEXT
was equivalent
to a value of text
. Under XML, the interpretation of these
values is case-sensitive, and in XHTML 1 all of these values are
defined in lower-case.
Although there is no requirement for XHTML 1.0 documents to be compatible with existing user agents, in practice this is easy to accomplish. Guidelines for creating compatible documents can be found in Appendix C.
XHTML Documents which follow the guidelines set forth in Appendix C, "HTML Compatibility Guidelines" may be labeled with the Internet Media Type "text/html", as they are compatible with most HTML browsers. This document makes no recommendation about MIME labeling of other XHTML documents.
XHTML 1.0 provides the basis for a family of document types that will extend and subset XHTML, in order to support a wide range of new devices and applications, by defining modules and specifying a mechanism for combining these modules. This mechanism will enable the extension and sub-setting of XHTML 1.0 in a uniform way through the definition of new modules.
As the use of XHTML moves from the traditional desktop user agents to other platforms, it is clear that not all of the XHTML elements will be required on all platforms. For example a hand held device or a cell-phone may only support a subset of XHTML elements.
The process of modularization breaks XHTML up into a series of smaller element sets. These elements can then be recombined to meet the needs of different communities.
These modules are defined in "Modularization of XHTML" [XHTMLMOD].
Modularization brings with it several advantages:
It provides a formal mechanism for sub-setting XHTML.
It provides a formal mechanism for extending XHTML.
It simplifies the transformation between document types.
It promotes the reuse of modules in new document types.
This appendix is normative.
These DTDs and entity sets form a normative part of this specification. The complete set of DTD files together with an XML declaration and SGML Open Catalog is included in the zip file and the gzip'd tar file for this specification. Users looking for local copies of the DTDs to work with should download and use those archives rather than using the specific DTDs referenced below.
These DTDs approximate the HTML 4 DTDs. It is likely that when the DTDs are modularized, a method of DTD construction will be employed that corresponds more closely to HTML 4.
The XHTML entity sets are the same as for HTML 4, but have
been modified to be valid XML 1.0 entity declarations. Note the
entity for the Euro currency sign (€
or
€
or €
) is defined
as part of the special characters.
This appendix is normative.
The following elements have prohibitions on which elements they can contain (see Section 4.9). This prohibition applies to all depths of nesting, i.e. it contains all the descendant elements.
a
a
elements.pre
img
, object
,
big
, small
, sub
, or
sup
elements.button
input
, select
,
textarea
, label
, button
,
form
, fieldset
, iframe
or
isindex
elements.label
label
elements.form
form
elements.This appendix is informative.
This appendix summarizes design guidelines for authors who wish their XHTML documents to render on existing HTML user agents.
Be aware that processing instructions are rendered on some user agents. However, also note that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.
Include a space before the trailing /
and
>
of empty elements, e.g.
<br />
,
<hr />
and <img
src="karen.jpg" alt="Karen" />
. Also, use the
minimized tag syntax for empty elements, e.g. <br />
, as the alternative syntax <br></br>
allowed by XML
gives uncertain results in many existing user agents.
Given an empty instance of an element whose content model is
not EMPTY
(for example, an empty title or paragraph)
do not use the minimized form (e.g. use
<p> </p>
and not
<p />
).
Use external style sheets if your style sheet uses
<
or &
or ]]>
or --
. Use
external scripts if your script uses <
or
&
or ]]>
or --
. Note that XML parsers
are permitted to silently remove the contents of comments. Therefore, the historical
practice of "hiding" scripts and style sheets within "comments" to make the
documents backward compatible is likely to not work as expected in XML-based
implementations.
Avoid line breaks and multiple white space characters within attribute values. These are handled inconsistently by user agents.
Don't include more than one isindex
element in
the document head
. The isindex
element
is deprecated in favor of the input
element.
lang
and xml:lang
AttributesUse both the lang
and xml:lang
attributes when specifying the language of an element. The value
of the xml:lang
attribute takes precedence.
In XML, URI-references [RFC2396] that end with fragment identifiers of the form
"#foo"
do not refer to elements with an attribute
name="foo"
; rather, they refer to elements with an
attribute defined to be of type ID
, e.g., the
id
attribute in HTML 4. Many existing HTML clients don't
support the use of ID
-type attributes in this way,
so identical values may be supplied for both of these attributes to ensure
maximum forward and backward compatibility (e.g., <a id="foo" name="foo">...</a>
).
Further, since the set of
legal values for attributes of type ID
is much smaller than
for those of type CDATA
, the type of the name
attribute has been changed to NMTOKEN
. This attribute is
constrained such that it can only have the same values as type
ID
, or as the Name
production in XML 1.0 Section
2.3, production 5. Unfortunately, this constraint cannot be expressed in the
XHTML 1.0 DTDs. Because of this change, care must be taken when
converting existing HTML documents. The values of these attributes
must be unique within the document, valid, and any references to these
fragment identifiers (both
internal and external) must be updated should the values be changed during
conversion.
Note that the collection of legal values in XML 1.0 Section 2.3, production 5
is much larger than that permitted to be used in the ID
and
NAME
types defined in HTML 4. When defining fragment identifiers
to be backward-compatible, only strings matching the pattern
[A-Za-z][A-Za-z0-9:_.-]*
should be used. See
[HTML]
Section 6.2
for more information.
Finally, note that XHTML 1.0 has deprecated the
name
attribute of the a
, applet
, form
, frame
, iframe
, img
, and map
elements, and it will be
removed from XHTML in subsequent versions.
To specify a character encoding in the document, use both the
encoding attribute specification on the xml declaration (e.g.
<?xml version="1.0"
encoding="EUC-JP"?>
) and a meta http-equiv statement
(e.g. <meta http-equiv="Content-type"
content='text/html; charset=EUC-JP' />
). The
value of the encoding attribute of the xml declaration
takes precedence.
Some HTML user agents are unable to interpret boolean
attributes when these appear in their full (non-minimized) form,
as required by XML 1.0. Note this problem doesn't affect user
agents compliant with HTML 4. The following attributes are
involved: compact
, nowrap
,
ismap
, declare
, noshade
,
checked
, disabled
, readonly
,
multiple
, selected
,
noresize
, defer
.
The Document Object Model level 1 Recommendation [DOM] defines document object model interfaces for XML and HTML 4. The HTML 4 document object model specifies that HTML element and attribute names are returned in upper-case. The XML document object model specifies that element and attribute names are returned in the case they are specified. In XHTML 1.0, elements and attributes are specified in lower-case. This apparent difference can be addressed in two ways:
text/html
via the DOM can use the HTML DOM,
and can rely upon element and attribute names being returned in
upper-case from those interfaces.text/xml
or application/xml
can also use the XML DOM. Elements and attributes will be returned in lower-case.
Also, some XHTML elements may or may
not appear
in the object tree because they are optional in the content model
(e.g. the tbody
element within
table
). This occurs because in HTML 4 some elements were
permitted to be minimized such that their start and end tags are both omitted
(an SGML feature).
This is not possible in XML. Rather than require document authors to insert
extraneous elements, XHTML has made the elements optional.
Applications need to adapt to this
accordingly.
When an attribute value contains an ampersand, it must be expressed as a character
entity reference
(e.g. "&
"). For example, when the
href
attribute
of the a
element refers to a
CGI script that takes parameters, it must be expressed as
http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user
rather than as
http://my.site.dom/cgi-bin/myscript.pl?class=guest&name=user
.
The Cascading Style Sheets level 2 Recommendation [CSS2] defines style properties which are applied to the parse tree of the HTML or XML documents. Differences in parsing will produce different visual or aural results, depending on the selectors used. The following hints will reduce this effect for documents which are served without modification as both media types:
In HTML 4 and XHTML, the style
element can be used to define
document-internal style rules. In XML, an XML stylesheet declaration is used
to define style rules. In order to be compatible with this convention,
style
elements should have their fragment identifier set using
the id
attribute, and an XML stylesheet declaration should
reference this fragment. For example:
<?xml-stylesheet href="http://www.w3.org/StyleSheets/TR/W3C-REC.css" type="text/css"?> <?xml-stylesheet href="#internalStyle" type="text/css"?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>An internal stylesheet example</title> <style id="internalStyle"> code { color: green; font-family: monospace; font-weight: bold; } </style> </head> <body> <p> This is text that uses our <code>internal stylesheet</code>. </p> </body> </html>
Some characters that are legal in HTML documents, are illegal in XML document. For example, in HTML, the Formfeed character (U+000C) is treated as white space, in XHTML, due to XML's definition of characters, it is illegal.
The named character reference '
(the apostrophe, U+0027) was
introduced in XML 1.0 but does not appear in HTML.
Authors should therefore use '
instead of
'
to work as expected in HTML 4 user agents.
This appendix is informative.
This specification was written with the participation of the members of the W3C HTML Working Group.
At publication of the second edition, the membership was:
At publication of the first edition, the membership was:
This appendix is informative.