This chapter is based on Chapter 5 of Erik Ray's book [Ray 2001], Chapter 6 of the Deitel, et.al. tome [Deitel 2001], Part 2 of Elliotte Rusty Harold's book [Harold 1999], plus additional material from the web.
One of the most powerful features of XML is that it lets you create your own markup language, defining elements and attributes that best fit the information you want to describe. However, it does not provide a mechanism for formally defining your language. This process is called document modeling. The current way of modeling XML is with Document Type Definitions (DTDs). They provide the following features.
Element type declarations identify the names of elements and the nature of their content. Consider Figure 1.
<!ELEMENT album (title, artist+, label?, track+)>
|
This declaration identifies the element named
album. Its content model follows the
element name and defines what the element may contain. In
this case, an album must contain a
title, an artist, an optional
label, and a track. The commas
between element names indicates that they must occur in
succession. The plus sign after artist and
track indicates that they may be repeated, but
must occur at least once. The question mark after
label indicates that it is optional (zero or
once). A name with no punctuation, such as
title, must occur exactly once.
Declarations for title, artist,
etc. must also be provided for the XML parser to check the
validity of the document. In addition to element names, the
special symbol #PCDATA indicates parsed
character data. Consider Figure 2.
<!ELEMENT title (#PCDATA)>
|
The title element contains character data and
no nested elements. Elements that contain only other
elements are said to have element content while
elements that contain both other elements and character data
are said to have mixed content. Parentheses can be
used to combine several elements as a single element.
Consider the declaration of an HTML definition list given in
Figure 3.
<!ELEMENT DL (DT, DD)+>
|
Each <DT> element is followed by a
<DD> element; this pair can be repeated
one or more times. A vertical bar is used to indicate
alternatives. Consider the declaration of an HTML table row
given in Figure 4.
<!ELEMENT TR (TH | TD)+>
|
Each table row consists of one or more header cells or data
cells. There are two other possible content models.
ANY indicates that any content is allowed; this
should be avoided because it disables content checking.
EMPTY indicates that the element has no content
(and therefore no end tag). Consider the declaration of an
HTML image given in Figure 5.
<!ELEMENT IMG EMPTY>
|
All the information concerning the image is given as
attributes inside the IMG tag.
Attribute list declarations identify which elements may have
attributes, what attributes they may have, what values the
attributes may hold, and what value is the default.
Consider the attribute list declaration for a
track on one of our CDs given in Figure 6.
<!ATTLIST track time CDATA #IMPLIED>
|
This indicates that the track element has one
attribute, time, which consists of character
data, and that it's optional. Figure 7 is the complete DTD
for the CD
collection.
<?xml version="1.0" encoding="ISO-8859-1"?> <!ELEMENT album (title, artist+, label?, track+)> <!ATTLIST album id CDATA #IMPLIED> <!ELEMENT artist (#PCDATA)> <!ELEMENT cdcollection (album+)> <!ELEMENT label (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT track (#PCDATA)> <!ATTLIST track time CDATA #IMPLIED> |
Each attribute in a declaration has three parts: a name, a type, and a default value. There are six possible attribute types.
CDATA
ID
IDREF or IDREFS
ENTITY or ENTITIES
NMTOKEN or NMTOKENS
There are four possible default values.
#REQUIRED
#IMPLIED
#FIXED "value"
Consider the element declaration and attribute list declaration of an HTML line break given in Figure 8.
<!ELEMENT BR EMPTY>
<!ATTLIST BR clear (all|left|none|right) "none">
|
The clear attribute of the
<BR> element can take one of four values;
if it is omitted the default value is "none".
Entity declarations associate a name with some other fragment of content. There are three types of entity.
These associate a name with a string of literal text. A simple example is given in Figure 9.
<!ENTITY bbk "Birkbeck, University of London">
|
The &bbk; entity reference is replaced by
the associated text wherever it occurs, even in other entity
declarations. The XML specification predefines the seven
internal entities listed in Figure 10.
| Named Entity | Decimal Entity | Hexadecimal Entity | Character |
|---|---|---|---|
" |
" |
" |
Double quote (") |
& |
& |
& |
Ampersand (&) |
' |
' |
' |
Apostrophe (') |
< |
< |
< |
Less than (<) |
> |
> |
> |
Greater than (>) |
|
  |
  |
Non-breaking space |
­ |
­ |
­ |
Soft hyphen |
These associate a name with the contents of another file or URL. Consider the entity declaration in Figure 11.
<!ENTITY cds SYSTEM "cd-collection.xml">
|
The SYSTEM keyword indicates that it's an
external entity and the &cds; entity reference
will be replaced by the contents of the (relative) file
wherever it occurs.
Parameter entities can only occur in the document type
declaration. They are declared in the DTD just like
internal entities, but with a % and a
space before the name. Consider the declaration of an
HTML heading given in Figure 12.
<!ENTITY % heading "H1 | H2 | H3 | H4 | H5 | H6">
|
A parameter entity reference, such as
%heading;, is expanded immediately and becomes
part of the declaration. Consider the partial declaration
of an HTML block given in Figure 13.
<!ENTITY % block "P | %heading; | %list; | ...">
|
The %block; entity reference contains the
P element, the six heading elements, the four
list elements, and much more besides.
A document can only be well-formed if it obeys the syntax of XML and conforms to the grammar of XML documents. By definition, if a document is not well-formed, it is not XML. A well-formed document is valid only if it contains a DTD and if the document conforms to that DTD.
A namespace is a group of element and attribute
names. You can declare that an element exists within a
particular namespace and that it should be validated against
that namespace's DTD. For example, is
<title> the title of a CD album or the
title of a track on the CD, or the title of a book?
Attaching a prefix to the element indicates which namespace
it belongs to, so the <cd:title> is
different from the <book:title>.
Namespaces aren't useful only for preventing name clashes. They also help the XML processor sort out different groups of elements for different processing. For example, XSL transformations (XSLT) rely on namespaces to distinguish between XML objects that are data, and those that are instructions for processing the data. Figure 14 is the complete listing of the XSL document for transforming our CD collection into HTML.
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="cdcollection">
<xsl:apply-templates select="album"/>
</xsl:template>
<xsl:template match="album">
<H1>
<xsl:apply-templates select="title"/>
</H1>
<H2>
<xsl:apply-templates select="artist"/>
</H2>
<OL>
<xsl:for-each select="track">
<LI>
<xsl:value-of select="text()"/>
<xsl:if test="@time">
(<xsl:value-of select="@time"/>)
</xsl:if>
</LI>
</xsl:for-each>
</OL>
</xsl:template>
<xsl:template match="artist">
<xsl:value-of select="text()"/>
<xsl:if test="position()!=last()">, </xsl:if>
</xsl:template>
</xsl:stylesheet>
|
The <xsl:stylesheet> root element
indicates that the XML namespace is xsl and
that the XML elements that comprise XSL transformations are
placed at the indicated URL. So
<xsl:...> elements are processed and the
remaining elements, such as <H1> and
<OL>, are just data. Note that the
document has to be well-formed XML, so HTML start tags must
have corresponding end tags.
For fine-grained control, any element in the document can contain a namespace declaration, but most often it's the root element.