Contents
This section is informative.
Both DTDs and XML Schema are designed to accomplish the same fundamental task: to define the structure of XML document types. In this sense both are simply different text representations for the same underlying data structures. However, Schema and DTDs differ significantly in several ways, both in structure and capabilities.
Some differences worth noting are:
Common XML features - XML Schema are XML documents themselves and therefore share many aspects of the languages they define.
Data typing - schemas are designed with a much larger set of built-in data types than DTDs, and provide methods for creating user-defined types.
Namespaces - DTDs only partially support XML Namespaces, which are inherently a part of XML Schema.
Extension - XML Schema have a rich set of extension mechanisms including inheritance, redefinition, and substitution.
Entities- there is no mechanism in XML Schema corresponding to the use of entities for data abstraction in DTDs. In many cases the functionality of entities can be replaced through other XML-based mechanisms. However, there is currently no support for named character entity references as used in XHTML within XML Schema.
DTDs and Document Order Dependence - a more subtle feature of modularized DTDs is their dependence on the document order; the order in which elements and entities are defined within DTD files has a large impact on language development. XML Schema are far less dependent on document order.
XML language definitions, regardless of their text representation, contain at least three types of data structures. When combined into a coherent and consistent whole, they form a complete language definition. These three components are:
Additional abstract data structures may be defined for use in the language definition, such as common content models or attribute groups, whose use is shared by other data structures within the language definition. The definition of these structures is the primary task of language development, and the core of the modularization framework.
This schema modularization framework consists of two parts:
1) A set of schema modules that conform to the abstract modules in XHTML
2) A set of modularization conventions that describe how the individual modules work together, and how they can be modified or extended.
In XHTML-MOD, every object in the DTDs is represented by an XML entity. These entities are then composed into larger sets of entities and so on, resulting in a set of data abstractions that can be generalized and used modularly. These multiple levels of abstraction are tied together by the use of a specific naming convention and a set of abstract modules.
Generic classes of entities (composed of sub- and sub-sub-entities) are used to create definitions of the three components listed above. Content models, attribute lists and elements are defined separately, sometimes in separate modules, and the ordering of the modules in the DTD structure is strictly defined (due to document order dependence). They are then combined to form the resulting document type. Extensibility is accomplished through the extensive use of INCLUDE/IGNORE sections in the DTD modules. How each of these structures relates to its Schema-based counterpart is summarized in Table 1 below.
Both the DTD and schema-based modularization frameworks implement a set of formalized data structures, often in a conceptually similar way. The modularization framework described here is designed around the use of similar data structures, which can be represented (more or less) equally well in either representation. This is accomplished through the use of a straightforward mapping of data structures defined in the DTD modules onto equivalent data structures in the XML Schema language.
In XHTML-MOD, content models for elements are defined using three classes of entities, identified through the naming conventions by the suffixes ".content", ".class", and ".mix". Each of these classes of entities is mapped onto a corresponding Schema counterpart in the following way:
".content" models - these models are used to define the contents of individual elements. For each element there is a corresponding ".content" object. IN XML Schema, ".content" entities are mapped directly onto groups:
DTD | Schema |
<ENTITY % html.content "(head+,body+)"> |
<group name="html.content"> <sequence> <element ref="head" minOccurs="1"> <element ref="body" minOccurs="1"> </sequence> </group> |
The contents of ".content" groups are often classes or mixes.
".class" models - these models are used to define abstract classes of content models made up of either ".content" entities or other ".class" entities (or elements). In XML Schema they correspond to groups that may also contain substitution groups:
DTD | Schema |
<!ENTITY % Misc.class "%Edit.class; %Script.class; %Misc.extra;"> |
<group name="Misc.class"> <choice minOccurs="0" maxOccurs="unbounded"> <element ref="Edit.class" abstract="true"/> <element ref="Script.class" abstract="true"/> <element ref="Misc.extra" abstract="true"/> </choice> </group> |
".mix" models - these models correspond to content models that are mixed groupings of ".class", ".content", and ".mix" entities and serve as abstract content models often used in common by many elements in the DTD. They correspond to groups in XML Schema:
DTD | Schema |
<!ENTITY % Block.mix "%Heading.class; | %List.class; | %Block.class; %Misc.class;"> |
<group name="Block.mix"> <choice minOccurs="0" maxOccurs="unbounded"> <group ref="List.class"/> <group ref="Block.class"/> <group ref="Misc.class"/> </choice> </group> |
In addition to these three content model groupings, XHTML-MOD includes an additional grouping ".extra". These are currently omitted from the schema modules. (If needed, a developer could add them to the schema modules in a conformant way.)
Attributes and Attribute lists in DTDs correspond directly to attribute and attributeGroup elements in XML Schema. The translation from one to the other is relatively simple and straightforward. Here is an example:
DTD | Schema |
<!ENTITY % title.attrib "title %Text.datatype; #IMPLIED"> |
<attributeGroup name="title"> <attribute name="title" type="string"/> </attributeGroup> |
Complex attribute groups that are used by many different elements are grouped in the DTDs using entities suffixed with ".attrib". These attribute entities map directly onto attributeGroup elements in XML Schema as shown above.
The XML Schema specification allows elements as well as attribute values to be strongly typed. In defining elements in the modularized schema, an element type is created for each element that is a complex type composed of the content model (element.content) and the attribute list (element.attlist) as shown below:
<complexType name="form.type"> <group ref="form.content"/> <attributeGroup ref="form.attlist"/> </complexType> |
Elements are then declared to be of the type element.type:
<element name="form" type="form.type"/> |
This allows the author the greatest degree of flexibility while retaining strict type checking via XML Schema. It also allows for extension of the element via type substitution.
Note that in the case of an element with a mixed content model, a complexType is necessary.
In summary, each element is composed of a content model and an attribute list, which are composed into a type for that element.
XML Schema allows inheritance and redefinition of elements, groups, attributes and attributeGroups. In several cases modules require modification of previously declared attribute lists. This is done by using the <xsd:redefine> element to redefine the attributeGroup that needs to be modified
<!- - new attribute to be added - - > <attributeGroup name="align.legacy.attlist"> <attribute name="align"> <simpleType> <restriction base="NMTOKEN"> <enumeration value="left"/> <enumeration value="center"/> <enumeration value="right"/> <enumeration value="justify"/> </restriction> </simpleType> </attribute> </attributeGroup> <!- - add it to the caption element's attribute group - - > <redefine schemaLocation="xhtml-table-01.xsd"> <attributeGroup name="caption.attlist"> <extension base="align.attlist"/> <attributeGroup ref="align.legacy.attlist"/> </extension> </attributeGroup> </redefine> |
In this example, we redefine the attribute list for the caption element in the tables module to add the align attribute defined in align.legacy.attlist.
The modularized DTDs contain support mechanisms for XHTML. Some of these are DTD-specific and are not fully supported in XML Schema.
This modularization framework attempts to recreate these support structures to the greatest extent possible.
Notations are an SGML feature that allows non-SGML data within documents to be interpreted locally [CATALOG]. Notations for XHTML are preserved in the Schema modules using the notation element in a straightforward way.
DTD | Schema |
<!NOTATION character PUBLIC "-//W3C//NOTATION XHTML Datatype: Character//EN"> |
<notation name="charset" public="-//W3C//NOTATION XHTML Datatype: Charset//EN"/> |
The strong typing mechanism in XML Schema, along with the large set of intrinsic types and the ability to create user-defined types, provides for a high level of type safety in instance documents. This feature can be used to express more strict data type constraints, such as those of attribute values, when using XML Schema for validation.
DTD | Schema |
<!ENTITY % Length.datatype "CDATA" > |
<simpleType name="Length"> <restriction base="string"/> </simpleType> |
XML Schema provides no means of duplicating XHTML's named character entity mechanism. In most cases data abstraction through entities can be dispensed with in schemas. However, in the case of named character references, no replacement method is available.
Character entities are used to represent characters that occur in document data that may not be processed natively on the user's machine, for instance the copyright symbol. XHTML makes use of 3 sets of named character entities: the ISO Latin 1, Symbols, and Special.
A general solution for the resolution of language-specific named character entities is outside the scope of this document.
Entities are currently referenced in this framework as notations; this is strictly incorrect. However it is considered desirable to include references to the XHTML-specific entity sets.
Here is an example of the use of the notation declaration syntax for named character references:
<notation name="XHTML-Latin-1" id="xhtml-lat1" public="-//W3C//ENTITIES Latin 1 for XHTML//EN" system=" http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent"/> |
The following table summarizes the mapping of DTD data structures onto XML Schema structures.
DTD Entity | Use | Schema Element |
.content | Element content model | group |
.class | Abstract content model | group |
.mix | Abstract content model | group |
.attlist | Attribute lists | attributeGroup |
.attrib | attributes | Attribute |
.extra | Abstract attribute group | attributeGroup |
elements | Element definitions | Elements+complexType |
attribute redefinition | Attribute list redefinition | AttributeGroup w/redefine |
notation | SGML specific | notation |
datatypes | attribute datatypes | simpleType |
entities | Character replacement | notation reference |
DTD "driver" | Framework document | "Hub" Schema document |
One further issue of note in the conversion of DTDs to XML Schema is that it is absolutely necessary to define all elements globally. Otherwise they are not considered to be in the XHTML namespace but only "associated" with it. This document does not make use of this association feature in XML Schema.
This section is normative.
This modularization framework consists of a complete set of XHTML schema modules and a set of framework conventions that describe how to use them. The use of the framework conventions is required for conformance.
The modularized XHTML schema uses three types of files, which when combined comprise the entire XHTML grammar definition.
The Schema hub document is the base document for the schema. It contains only annotations and modules, which in turn contain <xsd:include> statements referencing other module files. The hub document corresponds to the DTD "driver" file in XHTML-MOD, but is much simpler. The hub document allows the author to modify the schema's contents by the simple expedient of commenting out modules that are not used. Note that some modules are always required in order to ensure conformance.
The (non-normative) hub document described here contains <include> elements for two modules, named "required" and "optional". Each of these included modules is itself a module container file.
Module container files, reasonably enough, include other modules. Modules and their containers are organized according to function. Including the hub document, which is a special case of a module container, there are ten included module container files.
In addition to the module container files listed above, there are around forty schema modules which contain only element definitions and their associated attribute and content model definitions. By convention, Schema modularization files may contain either <include> statements or element definitions but not both.
In order to easily identify the contents of any particular schema module file, it is useful to provide here a module naming convention syntax. This syntax also provides a simple means of distinguishing module files based on their language version, which may improve maintainability of the modules themselves.
The module naming convention adopted here is the same in almost all respects as that used in XHTML-MOD.
Schema modules for XHTML should have names that are:
Modules used in this modularization framework must have names that conform to the following syntax:
Pattern |
languagename-filecontentsdescription-versionnumber.xsd |
Example |
xhtml-table-01.xsd |
Exceptions to this rule are made for the Schema hub modules whose names are the same as above but may omit the file contents description syllable for brevity.
Version numbers of hub modules may omit the leading zero in the version number, but should include the minor version number.
Example: xhtml-1.1.xsd
In the case where a hub module contains elements or attributes from external namespaces, the name(s) of the external module(s) should be appended to the base language name using the "+" character.
Example: xhtml+fml-1.0.xsd
This module naming convention is intended also to comply with the required use of the media type in [XHTMLMIME].
In order to establish a physical structure for the composition of the Schema modules that corresponds to the abstract modules in XHTML, a module hierarchy structure has been used to organize the physical modules. The hierarchy structure looks like this:
xhtml/ xhtml/req/ xhtml/req/framework/ xhtml/req/core/ xhtml/req/core/text/ xhtml/opt/ xhtml/opt/pres/ xhtml/opt/legacy/ xhtml/opt/legacy/misc/ xhtml/opt/legacy/frames/ |
These correspond to the divisions of XHTML into abstract modules described in detail in Section 3.2. The hierarchy structure is intended to match the abstract module structure as closely as possible. This feature is not present in DTD modularization, and is not required for Schema modularization. It does, however, allow the developer to organize the modules in accordance with their hierarchical structure. The directories listed in Table 2 also correspond exactly to the module container modules in this framework.
The consistent use of naming conventions is important for the maintenance and development of complex software applications.
Adhering to these conventions provides numerous benefits to developers:
Simplifies testing and debugging by managing complexity
Eases maintenance by allowing any developer to read and understand another developer's code
Provides self-documenting code by using descriptive names and predictable naming conventions
Enforces encapsulation by using consistent naming conventions for public and private knowledge.
With few exceptions, the naming conventions used in XHTML-MOD are preserved in this framework.
The naming convention in XHTML-MOD uses suffixing of object names to indicate functionality, as described below.
Abstract attribute groups and attribute lists are suffixed with the ".attrib" and ".attlist" suffixes respectively.
Three different suffixes are used in content model names. They are ".content" for element content models, and ".class" or ".mix" for abstract content models.
Element names are not suffixed in XHTML-MOD. This document uses the notion of element types, which are complexTypes used to define elements and are suffixed with ".type". The ".type" suffix was used in XHTML-MOD for attribute data types. This is superfluous in XML Schema (since attribute types are arguments to the "type" attribute) and so the suffix is used in a different way in this framework.
This document establishes a convention for the internal structure of XHTML Schema modules. This convention provides a consistent and predictable way of organizing schema module files internally. This convention applies also to the hub document, which is itself simply a module of modules, albeit a somewhat specialized one.
Each schema module is composed of several components, some of which are required for functional reasons and some of which provide metadata as a convenience to the author. Not every component is included in every module.
Each file begins with a <xsd:schema> root element (after the optional xml declaration and DOCTYPE).
In the XHTML schema modules, the version number for the specific language being defined (e.g. "1.1") is used as the default value of the version attribute on the schema element.
This framework uses the value of "unqualified" for the value of the elementFormDefault attribute on the schema root element. Elements within the html namespace do not need to use a namespace prefix.
After the root element each module contains an annotation element containing several documentation sections briefly describing the purpose of the module.
This is an annotation element that contains a short description of the module and its purpose.
An annotation element containing authoring and versioning information for the module should always be included.
The standard W3C copyright statement is included in each module through the use of an include element. An exception is the hub document, which contains the full copyright text.
This is a module specific documentation element providing detailed information about the file's contents, its organization, and any noteworthy items of interest to developers.
Module elements contain include statements, import statements, or other modules (or comments). They must precede any other definitions in the file.
These include groups with names ending in ".content", ".class", or ".mix".
These are suffixed with either ".attrib" or ".attlist".
These are complexType elements defining each element's type.
These define individual elements in the module.
Additional constraints on the internal structure of schema modules are:
Each module must contain include statements for other modules or data structure definitions, but not both.
Each module must include at least sections 1 and 2 above, as well either section 3 or some combination of sections 4-7.
The handling of namespaces in XML Schema is entirely different from that in XHTML-MOD. Namespaces are integral to XML Schema and their use in modularization arises naturally from the schema syntax.
One convention chosen for this framework is that the names of elements and attributes in the modules are unqualified i.e. no namespace prefix is required for XHTML elements.
This is set by using the value of "unqualified" on the elementFormDefault attribute of the xsd:schema element.
A consistent commenting convention has been imposed on the modules described here. The purpose of a commenting convention is to allow for generating documentation from the comments (as well as general comprehension). Documentation elements containing Annotation-level comments are assumed to be of the highest importance and should be used to denote information about the module itself, and for important notes for developers.
File-level comments are denoted as usual with SGML comment delimiters "<!--" and "-->". By means of this convention, modules can become self-documenting. Tools for extracting these comments and formatting them suitably may (hopefully) be developed in the future.