Document Type Definitions (DTDs) and Namespaces

This chapter is based on Chapter 5 of Erik Ray's book [Ray 2001], Chapter 6 of the Deitel, et.al. tome [Deitel 2001], Part 2 of Elliotte Rusty Harold's book [Harold 1999], plus additional material from the web.

Introduction

One of the most powerful features of XML is that it lets you create your own markup language, defining elements and attributes that best fit the information you want to describe. However, it does not provide a mechanism for formally defining your language. This process is called document modeling. The current way of modeling XML is with Document Type Definitions (DTDs). They provide the following features.

Element Type Declarations

Element type declarations identify the names of elements and the nature of their content. Consider Figure 1.

<!ELEMENT album (title, artist+, label?, track+)>

Figure 1 : The Album Element

This declaration identifies the element named album. Its content model follows the element name and defines what the element may contain. In this case, an album must contain a title, an artist, an optional label, and a track. The commas between element names indicates that they must occur in succession. The plus sign after artist and track indicates that they may be repeated, but must occur at least once. The question mark after label indicates that it is optional (zero or once). A name with no punctuation, such as title, must occur exactly once.

Declarations for title, artist, etc. must also be provided for the XML parser to check the validity of the document. In addition to element names, the special symbol #PCDATA indicates parsed character data. Consider Figure 2.

<!ELEMENT title (#PCDATA)>

Figure 2 : The Title Element

The title element contains character data and no nested elements. Elements that contain only other elements are said to have element content while elements that contain both other elements and character data are said to have mixed content. Parentheses can be used to combine several elements as a single element. Consider the declaration of an HTML definition list given in Figure 3.

<!ELEMENT DL (DT, DD)+>

Figure 3 : HTML Definition List

Each <DT> element is followed by a <DD> element; this pair can be repeated one or more times. A vertical bar is used to indicate alternatives. Consider the declaration of an HTML table row given in Figure 4.

<!ELEMENT TR (TH | TD)+>

Figure 4 : HTML Table Row

Each table row consists of one or more header cells or data cells. There are two other possible content models. ANY indicates that any content is allowed; this should be avoided because it disables content checking. EMPTY indicates that the element has no content (and therefore no end tag). Consider the declaration of an HTML image given in Figure 5.

<!ELEMENT IMG EMPTY>

Figure 5 : HTML Image Element

All the information concerning the image is given as attributes inside the IMG tag.

Attribute List Declarations

Attribute list declarations identify which elements may have attributes, what attributes they may have, what values the attributes may hold, and what value is the default. Consider the attribute list declaration for a track on one of our CDs given in Figure 6.

<!ATTLIST track time CDATA #IMPLIED>

Figure 6 : CD Track Attribute

This indicates that the track element has one attribute, time, which consists of character data, and that it's optional. Figure 7 is the complete DTD for the CD collection.

<?xml version="1.0" encoding="ISO-8859-1"?>

<!ELEMENT album (title, artist+, label?, track+)>
<!ATTLIST album id CDATA #IMPLIED>

<!ELEMENT artist (#PCDATA)>

<!ELEMENT cdcollection (album+)>

<!ELEMENT label (#PCDATA)>

<!ELEMENT title (#PCDATA)>

<!ELEMENT track (#PCDATA)>
<!ATTLIST track time CDATA #IMPLIED>


Figure 7 : CD Collection DTD

Each attribute in a declaration has three parts: a name, a type, and a default value. There are six possible attribute types.

CDATA
Character data, i.e. strings of text.
ID
A name uniquely identifying one element in the document.
IDREF or IDREFS
The name of the ID attribute on some other element in the document; otherwise a list of them separated by whitespace.
ENTITY or ENTITIES
The name of an entity defined in the document (see below); otherwise a list of them separated by whitespace.
NMTOKEN or NMTOKENS
A single word; otherwise a list of them separated by whitespace.
A list of names
An enumeration of permitted names.

There are four possible default values.

#REQUIRED
The attribute must have an explicitly specified value on every occurance of the element in the document.
#IMPLIED
The attribute value is not required, and no default value is provided.
"value"
Any legal value as a default if the attribute is not provided.
#FIXED "value"
The attribute is not required, but if present it must have the specified value.

Consider the element declaration and attribute list declaration of an HTML line break given in Figure 8.

<!ELEMENT BR EMPTY>
<!ATTLIST BR clear (all|left|none|right) "none">

Figure 8 : HTML Line Break

The clear attribute of the <BR> element can take one of four values; if it is omitted the default value is "none".

Entity Declarations

Entity declarations associate a name with some other fragment of content. There are three types of entity.

Internal Entities

These associate a name with a string of literal text. A simple example is given in Figure 9.

<!ENTITY bbk "Birkbeck, University of London">

Figure 9 : Birkbeck Entity

The &bbk; entity reference is replaced by the associated text wherever it occurs, even in other entity declarations. The XML specification predefines the seven internal entities listed in Figure 10.

Named Entity Decimal Entity Hexadecimal Entity Character
&quot; &#34; &#x22; Double quote (")
&amp; &#38; &#x26; Ampersand (&)
&apos; &#39; &#x27; Apostrophe (')
&lt; &#60; &#x3C; Less than (<)
&gt; &#62; &#x3E; Greater than (>)
&nbsp; &#160; &#xA0; Non-breaking space
&shy; &#173; &#xAD; Soft hyphen

Figure 10 : XML Internal Entities
External Entities

These associate a name with the contents of another file or URL. Consider the entity declaration in Figure 11.

<!ENTITY cds SYSTEM "cd-collection.xml">

Figure 11 : CD Collection Entity

The SYSTEM keyword indicates that it's an external entity and the &cds; entity reference will be replaced by the contents of the (relative) file wherever it occurs.

Parameter Entities

Parameter entities can only occur in the document type declaration. They are declared in the DTD just like internal entities, but with a % and a space before the name. Consider the declaration of an HTML heading given in Figure 12.

<!ENTITY % heading "H1 | H2 | H3 | H4 | H5 | H6">

Figure 12 : HTML Heading

A parameter entity reference, such as %heading;, is expanded immediately and becomes part of the declaration. Consider the partial declaration of an HTML block given in Figure 13.

<!ENTITY % block "P | %heading; | %list; | ...">

Figure 13 : HTML Block

The %block; entity reference contains the P element, the six heading elements, the four list elements, and much more besides.

Validity

A document can only be well-formed if it obeys the syntax of XML and conforms to the grammar of XML documents. By definition, if a document is not well-formed, it is not XML. A well-formed document is valid only if it contains a DTD and if the document conforms to that DTD.

Namespaces

A namespace is a group of element and attribute names. You can declare that an element exists within a particular namespace and that it should be validated against that namespace's DTD. For example, is <title> the title of a CD album or the title of a track on the CD, or the title of a book? Attaching a prefix to the element indicates which namespace it belongs to, so the <cd:title> is different from the <book:title>.

Namespaces aren't useful only for preventing name clashes. They also help the XML processor sort out different groups of elements for different processing. For example, XSL transformations (XSLT) rely on namespaces to distinguish between XML objects that are data, and those that are instructions for processing the data. Figure 14 is the complete listing of the XSL document for transforming our CD collection into HTML.

<?xml version="1.0"?>
 
<xsl:stylesheet version="1.0"
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  
  <xsl:template match="cdcollection">
    <xsl:apply-templates select="album"/>
  </xsl:template>
  
  <xsl:template match="album">
    <H1>
    <xsl:apply-templates select="title"/>
    </H1>
    <H2>
    <xsl:apply-templates select="artist"/>
    </H2>
    <OL>
    <xsl:for-each select="track">
      <LI>
      <xsl:value-of select="text()"/>
      <xsl:if test="@time">
        (<xsl:value-of select="@time"/>)
      </xsl:if>
      </LI>
    </xsl:for-each>
    </OL>
  </xsl:template>

  <xsl:template match="artist">
    <xsl:value-of select="text()"/>
    <xsl:if test="position()!=last()">, </xsl:if>
  </xsl:template>

</xsl:stylesheet>


Figure 14 : CD XSL

The <xsl:stylesheet> root element indicates that the XML namespace is xsl and that the XML elements that comprise XSL transformations are placed at the indicated URL. So <xsl:...> elements are processed and the remaining elements, such as <H1> and <OL>, are just data. Note that the document has to be well-formed XML, so HTML start tags must have corresponding end tags.

For fine-grained control, any element in the document can contain a namespace declaration, but most often it's the root element.

References

  1. [an error occurred while processing this directive]
  2. [an error occurred while processing this directive]
  3. [an error occurred while processing this directive]