Refinement Task Force Report

Eve Maler, Ashok Malhotra, and Allen Brown on behalf of the Refinement Task Force


22 September 1999 draft; incorporates skeleton recommendations and all known outstanding issues
(production edits $Revision: 1.4 $ $Date: 1999/09/24 21:24:55 $ by $Author: connolly $)
Table of Contents
1. Introduction
Motivation
Terminology
2. Analysis of the Problem Space
Contracts and Modification Tolerances
Modification in XML vs. Other Languages
Potential Forms of Content Model Modification
Potential Forms of Content Model Openness
Granularity of Tolerance for Final References
Implementation Considerations
Abstract Types
Defining a Base Type
Reuse Mechanisms
3. Existing Proposals
Schema: Structures
Schema NG
Views and Inheritance
4. Recommendations
Recommendations for Reuse
Recommendations for Modification
Recommendations for Subtyping
Recommendations for Openness
A. Formal Model for Schemas, Refinement, Tolerance, and Validity

Chapter 1. Introduction

The Refinement Task Force has been chartered to "address the issues of refinement, named model groups, and named attribute groups. The task force should put existing and/or new options for this functionality on an equal footing, i.e. set out strengths and weaknesses in a comparable way, examine relevance to existing use cases, etc., and if possible to make a recommendation on what approach to this functionality, if any, should appear in version 1.0." The relevant requirements we hope to address are "mechanisms to enable inheritance for element, attribute, and datatype definitions" and, possibly, "mechanisms for addressing the evolution of schemata."

This report discharges [or will soon] that duty. Conclusions and recommendations can be found in Chapter 4.

Status of this document: Preliminary recommendations have been provided; issues and incomplete sections are highlighted [in bold, like this]. We anticipate providing a more complete draft in time for the Antibes F2F.


Motivation

The design of any features for "refinement, named model groups, and named attribute groups" have the following motivations:

  • Reuse: The ability to reuse parts of a definition in another definition without having to redeclare them literally. For example, consider the information consisting of a name, address, and telephone number. This is the kind of thing that one may want to use in several element definitions and it would be nice to be able to do that by referring to the collective information by a single name, such as NameAddress. In DTDs, this requirement is primarily addressed through the use of parameter entities.

  • Modification: The ability to use an existing schema or individual definition (often authored by someone else) by modifying parts of it. The modifications may be desired for several reasons; for example, you may need to include extra information, or change the occurrence contraints on certain components, or change the names of some components.

  • Subtyping: Asserting an "is-a" relationship between one or more base types and a derived type, such that a type lattice can be constructed. This typing information would be made available as a contribution to the infoset.

Subtyping implies substitutability; that is, a construct with a derived type should, by design, be directly usable by a particular class of applications in all the ways that a construct of the base type is usable. Sometimes, such as in schema versioning and evolution, modification is interesting without its having the substitutability property because the relationship between old and new versions of a schema, or original and third-party modifications, is able to be described in formal terms. Such a formal description can help in automating tasks related to evolution, such as upgrading documents to comply with the new version. In these cases, application code may have to be modified.


Terminology

We use reuse, modification, and subtyping in this report instead of terms such as refinement and inheritance because the latter have meanings from object-oriented languages that would color this discussion in the mind of the reader. As we shall show, the possible modifications to XML content models are more general and more powerful than refinement for object-oriented languages, and the conditions for substitutability are more complex.

Applications that consume XML content are distinguished in the section called Potential Forms of Content Model Modification in Chapter 2 in the following ways:

  • Context-insensitive (CI) applications work with a limited perspective of an XML document, ignoring nodes deemed to be irrelevant and often ignoring element order; this is typical of, though not exclusive to, "data" applications. Such applications are optimized for extraction, and it may be inefficient to use them for (say) document formating. As a general rule, documents used with these applications are machine-generated.

  • Context-sensitive (CS) applications are expected to walk all the nodes and output them; this is typical of, though not exclusive to, "document" applications. Such applications are optimized for rules-based behavior, and it may be inefficient to use them for extraction. As a general rule, documents used with these applications are human-authored.

An example may make this distinction clear. Narrative documents often have a "header" element that contains various kinds of metadata about the documents, such as the author, the date of publication, and so on. Say these pieces of metadata are all in individual sub-elements. In order to populate a database with the information, you would use a context- insensitive application to locate the elements of interest, probably ignoring their order and the rest of the document. In order to format a title page for the document containing the metadata, you would use a context-sensitive application, which would be a lot easier to write if the original order of the metadata elements matched the desired output order. You might also use ancestry and other context cues to generate output text and achieve other sophisticated formatting effects.


Chapter 2. Analysis of the Problem Space

This chapter records the Refinement Task Force's discussion of various reuse and modification issues. Note that the text of this report links to all use cases that we have collected.


Contracts and Modification Tolerances

A schema-aware validating processor takes the following two inputs:

  • The infoset for an XML schema root element

  • A set of definitions from a nominated schema (possibly gathered from several different schema modules)

And it produces the following two outputs:

  • A report on whether the schema root is valid according to the schema definitions

  • An augmented infoset that reflects any newly contributed schema information items, such as subtyping information (motivated by, for example, David Beech's use case)

The Task Force discussed the notion of making a schema-aware validating processor responsible for producing an additional output: performing transformations on the infoset in order to present elements to downstream applications as being valid instances of a type other than their declared type.

The reason this idea arose is that since a schema-aware processor would have access to the type lattice, having it do the transformation would be useful in cases where a document is valid according to the schema but is nonetheless unacceptable ("invalid") as input to a downstream processor.

(Note that "validity" in this sense is not universal, but rather would be specified as a contract between the schema and the application. In the contract, the application specifies the kinds of modifications it will accept. In some cases, the application may agree to accept different forms of modification made to different types.)

The forms of modification acceptable by an application would depend on what it does with various kinds of elements. A pretty-printing program may not much care whether new sub-elements are added in a type's tree structure. A query program may not allow renaming of attributes or sub-elements. An application that extracts names and addresses of employees and builds a master list may break if multiple addresses are allowed, and so on. Here is a classic example of an intolerant application, based on Dave Hollander's classic use case: An application written to handle a ShipAddr element based on the Address type may not be able to handle a ModifiedShipAddr element based on a derived AddressPlus type that tacks subelements onto the end of the model, even if the original definition was declared to be "refinable."

In this case, if the ModifiedShipAddr element were renamed and the offending subelements removed from the instance, the application could (it is hoped) operate normally. The only alternative to the renaming and removal is to require instances to be validated against the original schema, rejecting any documents that take advantage of the modified definitions.

However, we ultimately felt that specifying such a transformation model is out of scope for the Working Group; the transformation process is a job for some other application component (such as a run-time processor), not a schema-aware validator. The schema is therefore allowed to express unilaterally the terms of the contract -- that is, it promises to deliver only instances that are valid according to the rules expressed within it.

Even if infoset transformation is left to a downstream application, we discussed how it might be useful for a designer of a type in a schema to declare the nature of modifications it tolerates, where such information would be used to augment the infoset. However, Matthew Fuchs notes that "If we agree that it's the application's duty to control what refinements it accepts, it is not at all obvious that types must likewise be control freaks about what refinements are acceptable. In any case, since all things should be permitted which are not explicitly forbidden, it should be [a rejects instead of a tolerates paradigm]..."


Modification in XML vs. Other Languages

In object-oriented programming languages such as Java and C++, you can refine a class by adding attributes to it. (You can also add and change methods, but this is not relevant for XML). Because the order of attributes is not significant, the new attributes are generally added at the end. The definition of "validity" of a refined class (that is, substitutability of classes/instances) is that methods that run on an instance of the base class must also run on an instance of the refined class; this means that it is prohibited to remove or rename attributes.

In XML, the situation is much more complex. Because element order is significant, it may be desirable to add new sub-elements in specific locations within the tree structure of a type. Because multiple occurrences of an element are allowed, it may be desirable to increase or decrease the occurrence constraints. Because some elements and attributes are optional, it may be desirable to remove them altogether.


Potential Forms of Content Model Modification

The following table shows the major forms of content model modification possible in XML. These forms of modification are broken down first by whether they broaden or narrow the generative power of the content model as given in the base type , and second by the structural effect of the modification.

The substitutability assessments below are made assuming the following:

  • The original definition was intended to tolerate no modifications

  • The application is coded perfectly defensively (that is, accounting for every possible instance in a schema's actual "value space")

  • The application makes important use of every "relevant fact" about the content model (admittedly a judgment call)

#Form of modificationS14Y quotient Simplistic examplesUse cases and comments
Broadening of the content model
1 Outright deletion of required element or group No S14Y for any app.

(first, middle, last) to (first, last)

(Actually, this doesn't broaden the model so much as make a disjunctive change. Also, #1 is a special case of #2.)

For human-authored content, this is usually only done when a schema undergoes a major renovation. When DocBook moves to a new major version, it is allowed to introduce backwards incompatibilities; the ArtHeader element, for example, is slated for destruction.

Jonathan Robie's use case on deletion suggests that deletion of required constructs is not very useful.

2 Relaxation of element or group's occurrence No S14Y for any app.

(first, middle, last) to (first, middle?, last)

For human-authored content, this is sometimes done as an intermediate step towards the removal of an element, as described above. (Note that the fourth example is simultaneously both a broadening and a narrowing.)

Jonathan Robie's use case on deletion suggests that deletion of optional constructs could be useful.

(first, middle+, last) to (first, middle*, last)

(first, middle[2,4], last) to (first, middle[0,4], last)

(first, middle[2,4], last) to (first, middle[0,2], last)

(first|middle|last)+ to (first|middle|last)*

((first|given), middle, last) to ((first|given)?, middle, last)

3 Outright insertion of required element or group S14Y for CI apps because ordering is preserved. (True for all insertion positions?) No S14Y for CS apps.

(first, middle, last) to (honorific[2,3], first, middle, last)

For CI apps, suffixing is the major requirement, although insertion into other content model positions may be desired as well.

For CS apps, insertion in all content model positions is sometimes done for human-authored content when an organization's production environment needs proprietary extensions that an interchange DTD does not provide.

Existing schema language proposals allow only suffixing of elements, probably because of ease of implementation. Prefixing may be nearly as easy. Infixes are extremely complicated; you need a content model transformation language in order to achieve it.

Jonathan Robie's use case on extension suggests that extension could be useful.

(first, middle, last) to (first, middle, nickname+, last)

(first, middle, last) to (first, middle, last, suffix)

4 Outright insertion of optional element or group S14Y for CI apps because ordering is preserved. (True for all insertion positions?) No S14Y for CS apps.

(first, middle, last) to (first, middle, nickname[0,2], last)

Michael Sperberg-McQueen's use case #2 (add a new phrase-level or block-level element type to a repeatable-OR group) is similar to the fourth example shown here. The DTD-oriented method of achieving this effect is to use a sort of model group fragment, which he notes is "ugly, but simple and very local":

<!ENTITY % x.data 'ship |' >

Jonathan Robie's use case on extension suggests that extension could be useful.

(first, middle, last) to (honorific?, first, middle, last)

(first, middle, last) to (first, middle, last, suffix?)

(first|middle|last)+ to (first|middle|last|suffix)+

((first|given), middle, last) to ((first|given|Christian), middle, last)

5 Outright insertion of certain classes of element (well-formed XML, well-balanced XML, elements in certain namespaces, elements not in certain namespaces) S14Y for CI apps because ordering is preserved. (True for all insertion positions?) No S14Y for CS apps.

(first, middle, last) to (first, middle, last, [WF]*)

The second example shows a use case that has been made by many people and has even arisen in the Working Group's discussion of handling schema documentation: it is desirable to allow "any HTML element" in a certain position in the content model. It might also be useful to allow elements from a particular set of namespaces; the third example demonstrates allowing elements from all namespaces except the default one.

The logical extension of this form of modification, shown in the last example, is "complete openness" (which does not mean the same thing as openness in the current Schema: Structures draft): modifications that allow any well-formed XML chunk to appear in any location.

(first, middle, last) to ([html:]*, first, middle, last)

(first, middle, last) to (first, middle, last, [^default:] *)

(first, middle, last) to ([WF]*, first, [WF]*, middle, [WF]*, last, [WF]*)

6 Renaming of element (while keeping the same base type) S14Y for all apps if the processing model handles the mapping somewhere.

(first, middle, last) to (Christian, middle, surname)

It is extremely common for developers of all types of applications to rename elements to match the technical jargon or natural language of their users.

Michael Sperberg-McQueen's use case #4 appears to need renaming, though he does not mention this explicitly.

As Matthew Fuchs has pointed out, renamed GIs can cause major problems by turning acceptable content models into ambiguous models. This is not a problem if the renaming occurs as part of creating a refined type.

(first, middle, last) to (prenom, moyenne, nom)

7 Reordering of elements or groups No S14Y for any app.

(given, middle, surname) to (surname, middle, given)

The use case in Jonathan's proposal shows how elements might need to be reordered in the case of address fields (although his proposed solution avoids reordering, preferred to fix an order where originally there was none).

((first|given), middle, surname) to (surname, middle, (first|given))

8 Unfixing the order of a sequential group No S14Y for any app.

(first, middle, last) to (first|middle|last)*

Unusual in human-authored content.

Narrowing of the content model
9 Outright deletion of optional element or group S14Y for all apps.

(first, middle?, last) to (first, last)

Done frequently for human-authored content, in order to make the authoring environment simpler than the interchange version of a schema. To date, achieved mostly by means of editing the DTD code directly, although parameter entities are sometimes used. Standard DTDs are often heavily subsetted to remove elements.

Michael Sperberg-McQueen's use case #4 is similar to the first example here; one choice of a mutually exclusive group is removed. The DTD-oriented method for doing this is to redeclare the entire element type. This has the disadvantage that, as he notes, "There is no constraint on what the user can do when the user redefines an element type."

((first|given), middle, last) to ((first), middle, last)

(first, (middle|nickname)?, last) to (first, last)

(first|middle|last)+ to (first|last)+

10 Restricting of element or group's occurrence S14Y for all apps.

(first, middle*, last) to (first, middle?, last)

Unusual in human-authored content.

11 Fixing the order of a free mixture of elements S14Y for all apps.

(first|middle|last)* to (first, middle?, last)

Jonathan Robie's proposal advocates the notion of starting with an archetype that has just a "vocabulary" -- a list of elements that are available to be used in a real element type. When an element type is derived from an archetype, it has the opportunity to anchor any or all of the vocabulary items into a sequential group. His use case is of a set of address-related fields whose order can then be fixed.


Potential Forms of Content Model Openness

Content model modification presumes that an original type has been modified in the course of creating a derived type; this literally creates a new content model against which an element can be validated. Thus, it directly affects only schemas. The "open content model" notion does something different: It provides some of the parameters of allowed content outside the usual content model mechanism. That is, it directly affects constraints on instances.

The following table shows the major forms of content model openness that have been suggested to date.

Note that the forms of model-broadening modification could be considered as potential forms of openness too; for example, we could add model="relax-occurrence" as a kind of openness. However, most of these have impractical consequences. For example, it would be pointless, without further direction, to declare that a model has "outright insertion openness"; it's better to literally perform the model modification. Thus, these are not listed exhaustively here.

#Form of opennessS14Y quotient Simplistic examplesUse cases and comments
12 Completely open model S14Y for CI apps because ordering is preserved. (True for all insertion positions?) No S14Y for CS apps.

(first, middle, last) model being validated against as if it were declared ([WF]*, first, [WF]*, middle, [WF]*, last, [WF]*), as long as the WF blocks didn't contain first, middle , or last

This is the notion of openness provided by the current Schema: Structures draft.

13 Allowing any element of a derived type in place of an element of an original type S14Y for CI apps that are prepared to do the transformation described in the section called Contracts and Modification Tolerances or otherwise are prepared to query on types/lattices rather than GIs (first, middle, last) model being validated against as if each element could be substituted with any element conforming to the original element's type or to a type derived from it

A use case from Matthew Fuchs's paper describes dis allowing any elements that have derived types as the "restrict" option. This sort of restriction is similar to declaring an element "final."

The WG has not yet concluded whether this type of openness, which Schema: Structures already has (with no way to turn if off), allows "variant" element GIs. David Beech's paper proposes not to allow this. Matthew has also later pointed out that variant GIs can cause major problems by turning acceptable content models into ambiguous models.

14 Allowing insertion of certain classes of element (well-formed XML, well-balanced XML, elements in certain namespaces, elements not in certain namespaces) S14Y for CI apps because ordering is preserved. (True for all insertion positions?) No S14Y for CS apps. (first, middle, last) being validated against as if it were declared ([html:]*, first, middle, last) Similar to #5 above, only accomplished without a literal content model modification. This would require a pretty slick interface for indicating the location,type, and amount of openness.
15 Allowing any derived (arche)type in place of an original (arche)type S14Y for CI apps that are prepared to do the transformation described in the section called Contracts and Modification Tolerances or otherwise are prepared to query on types/lattices rather than GIs (Using PEs very loosely as if they were archetypes!!) ((%first;), middle, last) model being validated against as if %first; could be substituted with any (arche)type conforming to %first; Since the WG voted to allow (arche)type references in content models, this option arises in parallel with #13.


Granularity of Tolerance for Final References

It's possible to set tolerance/rejection of various kinds of openness at various levels. For example, a "final" switch (#13 above) could be set for a whole schema, for a type or element (no matter where gets referenced), or for a particular slot in a content model where a type or element is referenced. Gabe Beged-Dov notes that a schema author can simulate the last choice by using the middle choice: You create a trivial subtype that is set to be final, and then use that in the desired slots in place of the non-final base type. He says further: "When you mark a type as final it does not constrain the substitubility of slots in that type's content model. It constrains the substitubility of the type in other content model's slots. In this world view there is no way to constrain a type's substitutability in a particular content model. Substitutability is part of a type's contract with the content models that contain it. If you want a different contract, you need a different type."


Implementation Considerations

[Issue: Revise this section.] We discussed the costs of searching in the modification and validation model we choose.

Suppose we allow no variant GIs.The validator stands at a node n in a document instance having some particular tag t. What type T (and consequently, what content model -- regular expression -- r) should be used for validation? Without variation, things proceed as follows: The basis for the type search is the explicitly declared type found in the instance, if there is one. Otherwise, we take as the basis for the search the type associated with t in the schema. Beginning with the basis type, we choose it or its nearest refinement in the type hierarchy whose associated content model accepts the children of n. Because validation can fail on some descendant of n, the validator may have to backtrack to the point at which it chose a type to validate the children of n, and make another choice (next nearest) of type from the type hierarchy.

When we allow GI variance, we effectively increase the size of the search space. Let r1 be a content model. A variant of r1 is any regular expression r2 derived from r by replacing a tag t1 in r1 with a tag t2 whose associated type is a refinement of the associated type of t1. (The "variant of" relationship is transitive.) Now having chosen a type in the search above as a candidate for use in the validation of the descendants of n, we have an additional choice to make: We can use the content model associated with the chosen type, or any variant of that content model. This choice too can fail, necessitating backtracking to the point of choice and choosing another variant.


Abstract Types

The idea of being able to set a type as being abstract (unusable without further derivation) has come up numerous times in discussion. In addition, Jonathan Robie's use case for addresses has the notion of an "abstract address" type from which all directly usable types are derived, and Matthew Fuchs proposed a "real simple" model for discussion that contained the notion of an abstract type.

We are inclined towards recommending that the schema language allow this, as it seems conceptually useful. In addition, there may be proto- content model mechanisms, such as "amp," that we could then allow solely in abstract types in order to avoid parser implementation problems.


Defining a Base Type

We are inclined towards recommending the definition of a single built-in type from which all other types are derived. This base type would essentially "validate" a well-formed schema root, and might define certain global type characteristics that all XML elements must share.


Reuse Mechanisms

Quite apart from subtyping and modification, it has been proposed to simplify the mechanisms offered for reuse-by-reference of type definitions. David Beech's paper outlines a proposal to remove named model groups and attribute groups in favor of type references.


Chapter 3. Existing Proposals

There have been several proposals to support modification:

These proposals each allow a limited set of modifications from the universe of all modifications. We take the position that all possible modifications can, in theory, be allowed; Allen Brown's paper explains the theoretical underpinnings of this position. If a content model is considered to be a regular expression, modification entails changing the regular expression. Given the original content model and the changes that were made to it, the validator can always validate instances of the modified class, that is, it has the information to carry out the validation (although this may be expensive or impractical in other ways).

The following sections analyze each proposal in turn, in light of the potential forms of modification and openness described above.

[Issue: Add an example of how to handle strict:p vs. transitional:p somewhere here.]


Schema: Structures

The current Schema: Structures draft offers the following features:

#Description
Reuse
-- Named model groups (where a model group is pulled in by reference).
-- Named attribute groups (where an attribute group is pulled in by reference).
-- Refinement of base types (where a type definition is pulled in by reference).
-- Schema inclusion (where a schema module is pulled in by reference).
Modification
3 "Refinement" of a content model by appending required elements and groups to the end of the model.
4 "Refinement" of a content model by appending optional elements and groups to the end of the model.
Subtyping
-- "Refinement" of base types (where the derived type is presumed to be a subtype of the base type; there is no "private inheritance"). Also, derivation (without refinement) made to a base type by means of the type attribute on a type or elementType element.
Tolerance of openness
12 Allows a completely open model. (Also has an explicit setting for tolerance of modification.)

The mechanism for tolerance and modification works as follows to add an element suffix to the end of a personName element's content model. Notice that the refines element acts as a conceptual "pulling in by reference" of the base type in the precise location in the content model where the base type's model is needed.

<!-- #3: (first, middle, last) to (first,
middle, last, suffix) -->
<type name="name" model="refinable" order="seq">
  <element name="first"/>
  <element name="middle"/>
  <element name="last"/>
</type>

<element name="fullname">
  <refines name="name"/>
  <element name="suffix"/>
</element>

<!-- #4: (first, middle, last) to (first, middle, last, suffix?) --
>
<element name="fullname">
  <refines name="name"/>
  <element name="suffix" minOccurs="0" maxOccurs="1"/>
</element>

Schema NG

Andrew Layman's Schema NG Guide offers the following features:

#Description
Reuse
-- Silent on this issue.
Modification
2 "Revision" of a portion of a model (though illicitly, according to prose constraints in the proposal, since "it is an error to assert constraints that are less restrictive than those of the base type").
3 "Refinement" of a content model by appending required elements and groups to the end of the model.
4 "Refinement" of a content model by appending optional elements and groups to the end of the model.
Subtyping
-- "Refinement" of base types (where the derived type is presumed to be a subtype of the base type; there is no "private inheritance"). Does not seem to offer the option of direct derivation (without refinement) made to a base type; only element declarations can refer directly to a type without the refines mechanism. Allows for subtyping from multiple base types by using "refinement" markers in constructing content models.
Openness
12 Completely open model.

The mechanism works as follows. Notice that the refines element acts as a conceptual "pulling in by reference" of the base type in the precise location in the content model where the base type's model is needed, but here (as opposed to Schema: Structures), multiple refines elements are allowed wherever an element or (arche)type name can appear. Also notice that the revises element allows for easy adjustment of occurrence.

<!-- #2: (first, middle, last) to (first,
middle?, last) -->
<!-- NOTE: Broadening the model is disallowed in the prose -->
<type name="shortname">
  <refines type="name">
    <element revises="middle" maxOccurs="0" />
  </refines>
</type>

<!-- #3: (first, middle, last) to (first, middle, last, suffix) -->
<type name="fullname">
  <refines type="name" />
  <element name="suffix" type="..." />
</type>

<!-- #4: (first, middle, last) to (first, middle, last, suffix?) --
>
<type name="fullname">
  <refines type="name" />
  <element name="suffix" type="..." minOccurs="0"/>
</type>

Views and Inheritance

Jonathan Robie's proposal offers the following features:

#Description
Reuse
-- Silent on this issue.
Modification
3 "Inheritance" of a content model by appending required elements and groups to the end of the model while explicitly referring to the base type as a "basis" for the derived type. This is called "inheriting type and content model" in the proposal.
4 "Inheritance" of a content model by appending optional elements and groups to the end of the model while explicitly referring to the base type as a "basis" for the derived type. This is called "inheriting type and content model" in the proposal.
7 Reordering of elements in a base type's model while explicitly referring to the base type as a "basis" for the derived type. Expected to re-mention the base type's sub-elements in this case; does not implicitly reuse them.
11 Fixing of the order of a potential vocabulary of elements listed in an abstract type. This is called "inheriting type and vocabulary" in the proposal.
Subtyping
-- "Inheritance" of base types (where the derived type is presumed to be a subtype of the base type; there is no "private inheritance"). Also, "inheritance" of base types that had no content model to begin with, so the derived type is a vacuous specialization. This is called "inheriting type without content model" in the proposal.
Openness
-- Silent on this issue.

[Issue: Add an example of #7 here.]

The mechanism for inheriting a base type with a vocabulary works as follows. Notice that desired elements are referenced by name, using a namespace-prefix-like mechanism.

<!-- #11: (first|middle|last)* to (first,
middle?, last) -->
<ElementType name="AbstractName" abstract="true">
  <vocab>
    <element name="first" ... />
    <element name="middle" ... />
    <element name="last" ... />
  </vocab>
</ElementType>

<ElementType name="name" basis:a="AbstractName">
  <element name="first" inherit="a:first" />
  <element name="middle" inherit="a:middle" minOccurs="0" />
  <element name="last" inherit="a:last" />
</ElementType>

Chapter 4. Recommendations

Planned reuse, planned modification, and a subtyping system that makes contributions to a document's infoset are desirable and necessary in Schema: Structures version 1.0. Without them, DTDs alone would have served most of the important purposes of structural schemas. At issue is the extent to what kinds of reuse, modification, subtyping, and openness should be allowed and what mechanisms should be used to achieve them.

We took it as our goal to choose the kinds of reuse, modification, subtyping, and openness that:

[Issue: Do we recommend to change any of the names of the constructs, e.g. "refines," wherever we propose to retain the existing Schema: Structures mechanism?]

[Issue: Do we recommend to keep or to change the default openness statuses in the language?]


Recommendations for Reuse

We propose to drop named model groups and named attribute groups, and in their place use type references.

Currently, named model groups are reusable without requiring a wrapper element around their content. For example:

[TBS]

Just substituting type references for named model groups will not duplicate this functionality, since a type must ultimately be referenced by an element. For example:

[TBS]

In order to allow type references in content models to have the same power as named model groups, it is necessary to allow the group element to reference a type, in much the same way as an element declaration (or a derived type) references a type. For example:

[TBS]

Note that the act of referencing a type from an element declaration or a group has no effect on the type lattice; that is, no subtype is created.

[Issue: Does a subtype get created if you use the type attribute on the type element, that is, reuse an entire base type without modifying it?]

[Issue: We still need to discuss David's specific recommendations for concrete syntax (allowing archetypeRef in place of attribute declarations, rather than collecting them in the model) and for achieving co-constraints with sequence/choice groups.]


Recommendations for Modification

We propose to allow only one kind of content model modification: the "refines" modification already available in Schema: Structures. This mechanism allows for appending elements [or groups?] at the end of the reused content model. This mechanism has the effect of subtyping from the base type (or types).

We considered the "refines" aspect of the Schema NG proposal, which allows the schema author to build up a complex content model in which one or more base types' content models can be inserted as placeholders. However, while this mechanism was appealing in its apparent simplicity, but it has too many special cases (accounted for by prose in the NG proposal) for which we might eventually need to add more mechanisms. Since the Schema: Structures mechanism is a subset of the NG mechanism and doesn't preclude a move to the more robust one later, we decided to stick with the Schema: Structures mechanism.

We also considered the "revises" aspect of the Schema NG proposal, which allows the schema author to reuse a content model and then selectively restrict the occurrence settings of the original elements. We failed to achieve consensus on allowing this.

[Issue: We need to consider Jonathan's solution for modification form #7.]


Recommendations for Subtyping

We agree with the existing subtyping mechanism ("refines") in Schema: Structures that allows for creating a subtype from a base type or types. Creating a derived type is the only way to add an entry to the type lattice.

[Issue: Does a subtype get created if you use the type attribute on the type element, that is, reuse an entire base type without modifying it?]

[Issue:] We considered the Views and Inheritance proposal to allow subtyping without reusing the base content model. We think that this is already sort of vacuously allowed in Schema: Structures, in the case where the base type has no model at all. We failed to reach consensus on allowing this.

[Issue:] We considered the Views and Inheritance proposal to allow subtyping from a vocabulary "pool." There are two obvious ways to do this: subtype by fixing the order of an "ampersand" group, and subtype by constructing a content model out of a "repeatable-OR" group. We are still undecided, and are waiting for a further writeup from Jonathan Robie. One possibility is to allow ampersand groups only in abstract types, so that parsers never actually have to face the problems of validating against such a group.

We propose that the schema language have a built-in "ur -type" that defines any primitive information necessary for every XML element:

TBS

We propose that the schema language offer a way to declare that a type is abstract, that is, it cannot be referenced directly but rather used as the base type for a subtype that refines it.


Recommendations for Openness

We agree with the existing mechanism in Schema: Structures for allowing a content model to tolerate complete openness.

We propose to allow a schema author to allow a [schema? type/element definition? content model slot?] to reject any element substitution. With this setting on, all elements present in an instance must have exactly the same GIs as those mentioned in the content model, and exactly the types referenced by the declarations of each of those elements. For example:

[TBS]

We considered allowing a schema author to allow a [schema? type definition? content model slot?] to reject any archetype substitution. We failed to reach consensus on allowing this.

[Issue:] We considered how to resolve the issue of variant vs. invariant GIs. So far, we have failed to reach consensus.


Appendix A. Formal Model for Schemas, Refinement, Tolerance, and Validity

A.1  Introduction

The principal idea here is that the designer of a modified or refined type be able to declare the nature of the modification made. In order to exploit fully such a declaration, an application needs to be able to declare what sorts of modifications it is willing to tolerate. Moreover, the validator, upon hearing such advice from the application, needs to validate an instance according to the tolerances declared. In effect we are proposing that we make applications responsible for telling the validator what classes of transformations lead to substitutable refinements from the application's perspective. The validator, then, is responsible for guaranteeing that the effective refinements fall within the application's tolerances.

We open by informally sketching a model of the declaration, the notion of tolerance, and the kinds of refinements with respect to which validations might be successfully conducted. Later, we will give a rigorous account of the concepts in question, from which one can design a concrete specification syntax, and implement a validator.

First, let's embellish the element declaration to permit refinements:

<element type="mumble" tolerates="(attr-append | elt-append)*"> ... </element>

and then include them:

<element type="baz"> <refines type="mumble" by=ättr- append"> ... </element>

The former declares a type that permits refinement by any sequence of element or attribute addition transformations. The latter refinement means that the subtype will be defined by adding an attribute to the former. (See the section below on transformations.)

An application would declare to the validator that certain kinds of refinement are tolerated. One possibility is to make a global declaration of such tolerance. Another possibility is to make local declarations of the form ``this application tolerates refinements of the mumble type by any sequence of transformations of the following kinds ... .''

Now there are two questions:

  1. What kinds of refinement are there?
  2. How might the validation of such refinements work?

In this informal account, we only consider refinements vis a' vis content models, and ``element only'' content models at that. (It should be reasonably clear how to extend the same ideas to cover refinements with respect to attributes.) For the time being, types will be conflated with regular expressions, where a refined type is accessed through a new tag.1 Finally, we assume that all content models are representable by regular expressions.

A.2  Validation: an Informal Account

First, let 's consider validation without refinement. Validity is defined re cursively roughly as follows: A node of an XML tree tagged with A, which has the type definition A ® R (R being a regular expression in tags) is valid just in case the children of the node tagged with A (when taken in sequence) are in the regular set denoted by R, and each of the children is itself valid. An XML tree, by extension, is valid just in case all of its nodes are valid.

Without saying what can count as a refinement (other than to note that every type is a refinement of itself), we extend the definition of validity to accommodate refinement. A node of an XML tree tagged with A, is valid just in case A has some refinement B defined by B ® Q, such that each of the children of the node tagged with A (when taken in sequence) is in the regular set denoted by Q, and each of the children is valid. An XML tree, by extension, is valid just in case all of its nodes are valid.

A.3  Refinement by Transformation

Suppose A1 ® R1, ... , An ® Rn are type definitions, we will define a set of immediate transformations below yielding a type definition B ® Q such that B is an immediate refinement from A1, ¼,An by one such transformation. If {T1, ¼,Tk} is a set of transformations, there is the obvious inductive definition of a refinement B from A by a sequence of immediate transformations taken from that set. So what might be interesting immediate transformations?

Roughly speaking, the first XML schema working draft defined the immediate refinement of B from A1, ¼,An by the ``prepend'' immediate transformation (or inheritance transformation) illustrated as follows:

A1 ® R1
A2 ® R2
:
An ® Rn
yields
B ® R1R2 ¼RnQ

where Q is a new regular expression in tags. Similarly, one might define the ``postpend'' immediate transformation by

B ® QR1R2 ¼Rn

Andrew Layman defines a kind of ``interleaving'' immediate transformation by beginning with the type definitions

A1,1 ® R1,1
:
A1,k1 ® R1,k1
:
Am,1 ® Rm,1
:
Am,km ¼Rm,km
he ``refines'' them to an inherited type definition
B ® Q0R1,1¼R1,k1Q1¼Qm-1Rm,1¼Rm,kmQm
where the Q's are all new regular expressions. The first WD and Layman definitions of refinement are induced by particular structured transformations of regular expressions. But there are lots of others as tabulated above.

Refining B from A by the ``insertion'' immediate transformation:

A ® R1 ¼Rm-1Rm ¼Rn
becomes
B ® R1 ¼Rm-1QRm ¼Rn

Refining B from A by the ``deletion'' immediate transformation:

A ® R1 ¼Rm-1Rm Rm+1¼Rn
becomes
B ® R1 ¼Rm-1Rm+1¼Rn

Refining Bi from A by a ``restriction on choices'' immediate transformation::

A ® PQ*R
can be restricted to any of:
B1 ® PQ+R
B2 ® PQ?R
B3 ® PQR

Refining B from A by the ``cyclic permutation'' immediate transformation:.

A ® R1 ¼Rm ¼Rm+k ¼Rn
becomes
B ® R1 ¼Rm+1 ¼Rm+kRm ¼Rn

No doubt you have noticed that we have not told you how the positions at which the transformations occur are to be identified. Also, the only transformations we have shown are on the immediate subexpressions of regular expressions. There are endless grammar-based processors that allow the identification of subexpressions and modification thereof, so we'll defer providing a solution.

By composing the transformations described, a wide variety of insertions, deletions and rearrangements can be accommodated. (Indeed, if we allow the transformations above to occur at any level, we get a universal editor of the sort proposed by Matthew Fuchs.) While each immediate transformation is eminently understandable, their composition can rapidly become opaque, so one may want to declare the number of immediate transformations that are tolerable in refining B from A. An application's robustness against transformations is more likely, however, to be determined by the nature of the transformations tolerated rather than their number.

Imagine that an application can declare that it will tolerate refinements B from A by any sequence of transformations in a set of immediate transformations {T1, ¼,Tk}. Thus, in the validation definition above, we amend it to look for not just refinements, but ``tolerated'' refinements. The validator, of course, should leave a record of what refinements were used at each node during the validation process. Moreover, the application can have access to the transformational relationship between the type associated with a node's actual tag and the type that led to the validation of the node's children.

A.4  The Formalities

Now we are ready to provide mathematical models of schemata, instances and refinement wherein:

A.4.1  Schemata

A schema S is a structure < E,T,A,V,v,f,g,h,e0 > where

A type t Î T is basic if there does not exist a distinct type t¢ such that t v t¢.4 Also, we will sometimes call f(t) the signature of t.

A.4.2  Instances

An instance I of a schema S is a structure < N,Á, << ,l,n0 > where

A labeling, being an ordered 4-tuple, has a first component called its tag, a second component, called its attribution, a third component called its type; and a fourth component called its value. When the type is ^, it is said to be implicit, and otherwise explicit. The partial order Á induces a tree on the nodes of I rooted at n0.

A.4.3  Basic Validity

An instance I is S = -valid (read S-basic valid) with respect to a schema S iff all of its nodes have implicit types, n0's tag is e0 and each node n Î N is S = -valid with respect to a schema S . A node n Î N is S = -valid with respect to a schema S iff

  1. for every a in the domain of n's attribution p, p(a) Î (g(h(y(n))))(a)6; and
  2. either

    1. f(h(y(n))) Î Ã(V);
    2. n has no children;
    3. the type component of n is ^; and
    4. the value component of n is in f(h(y(n)));7
  3. or
    1. f(h(y(n))) Î Â(E);
    2. the value component of n is ^;
    3. the sequence y(n1)y (n2)¼y(nk), where n1 << n2 << ¼ << nk are all the children of n, is in the regular set defined by f(h(y(n))); and
    4. each of the n1,n2,¼,nk is S = -valid with respect to S.

A.4.4  Validity with Refinement

The intent of refinement is to control the circumstances under which a tag might occur at a node where another kind of tag is expected.8 Our task here is to define the circumstances under which such a substitution might be valid. Regular expressions offer us an obvious mechanism by which we might achieve this: alternation. If e,e¢ Î E and h(e¢)v* h(e), then e¢ is a v-alternative of e. Let e1 ,¼,ek be all of the v- alternatives of e, then e|e1|¼|ek is the Sv- closure of e. By extension, for r Î Â(E) the Sv-closure of r (denoted [^r]) iff [^r] is obtained from r by replacing each element in the latter with its Sv-closure. So any occurrence of the e above would be replaced with (e|e1|¼|ek).9 Finally, if t,t¢ Î Tand t¢v* t, then t¢ is a v-alternative of t. Let t1 ,¼,tm be all of the \v- alternatives of t, then the Sv- effective signature of t is

An instance I is Sv- valid (read ``S-refinement valid'') with respect to a schema S iff n0's tag is either e0 or a v-alternative of e0, and each node n Î N is Sv-valid with respect to a schema S. A node n Î N is Sv-valid with respect to a schema S iff assuming the base type t for validating n to be the type component of n when that component is explicit, and h(y(n)) otherwise,

  1. t v* h(y(n));
  2. for every a in the domain of n 's attribution p, p(a) Î (g(h(t¢)))(a); and

  3. either

    1. f(h(y(n))) Î Ã(V);
    2. n has no children; and

    3. the value component of n is in Sv -effective signature of t;

  4. or
    1. f(h(y(n))) Î Â(E);
    2. if r¢ is the Sv-effective signature of t, the sequence y(n1)y(n2)¼y(nk), where n1,n2,¼,nk are all the children of n taken in << order whose tags are mentioned in r¢, is in the regular set defined by r¢;10

    3. each of the n1,n2,¼,nk is Sv-valid with respect to S.

A.4.5  Validity with Refinement and Tolerance

Recalling that v is a subset of T2, consider a segmentation of v into a finite collection of segments v1,¼,vk such that v = v1 ȼÈvk. We can think of each of the segments as corresponding to a particular atomic refinement transformation on content models. (In the next section we'll examine some particular kinds of atomic transformations.) Furthermore, we can abuse the notation of juxtaposition and consider composing the atomic transformations as relations. Indeed, we can consider regular expressions over atomic transformations and note that in the notation of regular expressions v = (v1 | ¼| vk)

Let S and I be as above. Suppose v is segmented into a finite collection of segments v1,¼,vk such that v = v1 ȼÈvk. Let R(v) be a regular expression over the segments of v. Putting R(v) in the place of v, we can define the R(v)-alternatives, the SR(v )-closure, and the SR(v)-effective signature by analogy with the v-alternatives, the S v-closure, and the Sv-effective signature above. An instance I is SR( v)-valid (read ``S-refinement valid tolerating transformations in the regular set R(v)'') with respect to a schema S iff n 0's tag is either e0 or an R(v)-alternative of e0, and each node n Î N is SR(v) -valid with respect to a schema S. A node n Î N is SR(v) -valid with respect to a schema S iff assuming the base type t for validating n to be the type component of n when that component is explicit, and h(y(n)) otherwise,

  1. t (R(v))* h(y (n));
  2. for every a in the domain of n 's attribution p, p(a) Î (g(h(t¢)))(a); and

  3. either

    1. f(h(y(n))) Î Ã(V);
    2. n has no children; and

    3. the value component of n is in the SR( v)-effective signature of t;

  4. or
    1. f(h(y(n))) Î Â(E);
    2. if r¢ is the SR(v) -effective signature of t, the sequence y(n1)y(n2)¼y(nk), where n1,n2,¼,nk are all the children of n taken in << order whose tags are mentioned in r¢, is in the regular set defined by r¢;

    3. each of the n1,n2,¼,nk is SR(v )-valid with respect to S.

A.4.6  Modeling Particular Refinement Transformations

We now explore how some of the forms of modification described above fit into the formal model. Generally speaking, we proceed by taking a schema I and defining a new schema S¢ where the components of the latter structure are defined by making incremental changes to the components of the former, but leaving the root element the same in both structures. We will subscript the components of S and S¢ with their respective schema names, e.g. ES and vS¢.

A.4.6.1  Adding a new tag
>From schema S with type t we create a new schema S¢ with a new element e¢ via transformation e:

S¢
=
< ES¢,TS,AS,VS, vS,fS,g S,hS¢,e0 > where
ES¢
=
ES È{e¢}
hS¢(e)
=
ì
í
î
hS(e)
if e Î ES
t
if e = e¢

A.4.6.2  Creating a new type by adding an attribute
>From schema S with type t¢¢ we create a new schema S¢ having a new type t¢ with an additional attribute a¢ assigned from the subspace V¢ Í V via transformation a:

S¢
=
< ES,TS¢,AS,VS,v S¢,fS¢,gS¢,hS,e0 > where
TS¢
=
TS È{t¢}
vS¢
=
vS È{ < t¢,t¢¢ > }
fS¢(t)
=
ì
í
î
fS(t)
if t Î TS
fS(t¢¢)
if t = t¢
gS¢(t)
=
gS(t) for t Î T
(gS¢(t¢))(a)
=
ì
í
î
(gS(t¢¢))(a)
for a ¹ a¢
V¢
for a = a¢

A.4.6.3  Creating a new type by restricting choices
>From schema S with type t¢¢ we create a new schema S¢ having a new type t¢ such that fS¢(t¢) = r¢ is derived from fS(t¢¢) by restriction of choices (see §3) via transformation r:

S¢
=
< ES,TS¢,AS,VS,v S¢,fS¢,gS¢,hS,e0 > where
TS¢
=
TS È{t¢}
vS¢
=
vS È{ < t¢,t¢¢ > }
fS¢(t)
=
ì
í
î
fS(t)
if t Î TS
r¢
if t = t¢
gS¢(t)
=
ì
í
î
gS(t)
if t Î TS
r¢
if gS(t¢¢) = t¢

A.4.6.4  Creating a new type by cyclically permuting a content model
>From schema S with type t¢¢ we create a new schema S¢ having a new type t¢ such that fS¢(t¢) = r¢ is derived from fS(t¢¢) by cyclic permutation (see §3) via transformation k:

S¢
=
< ES,TS¢,AS,VS,v S¢,fS¢,gS¢,hS,e0 > where
TS¢
=
TS È{t¢}
vS¢
=
vS È{ < t¢,t¢¢ > }
fS¢(t)
=
ì
í
î
fS(t)
if t Î TS
r¢
if t = t¢
gS¢(t)
=
ì
í
î
gS(t)
if t Î TS
r¢
if gS(t¢¢) = t¢

A.4.6.5  The case of international addresses
Jonathan Robie presents a use case in which he defines an abstract address type. From this he refines a US and German addresses. The first differs from the base class in that it is a reordering of the base type with a narrowing of the attribute value space. The second differs from the base class in that it is a reordering of the base type with a narrowing of the attribute value space. He presents these as follows.

<archetype name="AbstractAddress">
           <SGMLAmpersandGroup>
             <element name="Street" type="dt:String"
             />
             <element name="StreetNumber" type="dt:
             String" />
             <element name="City" type="dt:String"
             />
             <element name="PostalCode" type="dt:
             String"  />
           </SGMLAmpersandGroup>
</archetype>

<element name="Address" type="AbstractAddress" >
         <element name="StreetNumber" type="a:
         StreetNumber" />
         <element name="StreetName" type="a:Street"
         />
         <element name="City" type="a:City" />
         <element name="Zip" type="a:PostalCode"
         />
</element>

<element name="Adresse" type="AbstractAddress" >
         <element name="Strasse" type="a:Street"
         />
         <element name="Strassennummer" type="a:
         StreetNumber" />
         <element name="Postleitzahl" type="a:
         PostalCode"  />
         <element name="Stadt" type="a:City" />
</element>
We achieve the effect of the above by deriving a schema having two new types for US and German addresses and associating the elements Address and Addresse with those types respectively. Assume a schema S having types t1,t2,t3,t 4,t5,t6 corresponding to AbstractAddress, dt:String, a:StreetNumber, a: Street, a:City, and a:PostalCode. Moreover, in schema S these types are related as follows: t3 vS t2, t4 vS t2, t5 vS t2, t6 vS t2.
  1. Via e transformations we create StreetName, Zip, Strasse , Strassennummer, Postleitzahl, and Stadt elements.
  2. Via k transformations we create a new USAddress from of the AbstractAddress type.

  3. Via a e tranformation we create an Address element of type USAddress .

  4. Via k transformations we create a new DAddresse from of the AbstractAddress type.

  5. Call the resulting schema S¢.

At this point an instance of the form of the element definition of Address or of the element definition of Addresse will be S ¢(e|k)*-valid with respect to S¢ .


Footnotes:

1 The informal account corresponds to the view of refinement that happens thgough the variation of tags. We do this because it simplifies the informal exposition. The formal exposition, however permits refinement both by variation of tag and through type names

2 Â(E) ÇV = Æ. If t,t¢ Î T and t¢v t then both h(t) and h(t¢) are in Â(E) or they are both in Ã(V).

3 For a Î A such that g(a) ¹ Æ we will say that the type T has the attribute a.

4 We will take as primitive those types in T that are mapped by f to the empty regular expression, i.e. stings, integers, dates, etc.

5 A partial order Á is discrete if whenever n Á n¢ there does not exist a n¢¢ such that n Á n¢¢Á n¢ for distinct n,n¢,n¢¢.

6 For node n y(n) is the tag of n.

7 This corresponds to an element with non-element content contained in a specified subset of the set of values V.

8 This can occur either through the refinement of type definitions or the refinement of type declarations.

9 In considering the Sv-closure of a tag we incorporate the refinement of type declarations, sometimes called the variation of tags.

10 Since the sequence of children nodes in question is the maximum subsequence of children nodes whose tags are mentioned in the r¢ , all other children nodes are ignored for the sake of validation. This is how we model openness of content models. Alternatively, if we required that all the children of n be tagged with tags mentioned in the r¢, we would effectively have a closed content model.