Defining Web Document Types

Peter Wood

Document Types

Document Type Definitions (DTDs)

Valid XML

XML parser checking document is valid

Valid XHTML

XHTML parser checking document is valid

  • all XHTML element names (tags) must be in lowercase
  • empty elements must use abbreviated closing tag
    • e.g., <hr> must appear as <hr/> (or <hr /> to fool old browsers)
  • must conform to one of 3 DTDs: strict, transitional or frameset
    • strict: presentation must use CSS
    • transitional: presentation information can be embedded in document
    • frameset: for documents using frames (not covered)
  • XHTML document can be validated using the W3C validator service

DTD syntax

  • syntax for an element declaration in a DTD is:
    <!ELEMENT name    (model) >
    
    where
    • ELEMENT is a keyword
    • name is the element name being declared
    • model is the element content model (the allowed contents of the element)
  • content model specified using a regular expression over element names
  • regular expression specifies the permitted sequences of element names

Examples of DTD element declarations

  • an html element must contain a head element followed by a body element:
    <!ELEMENT html    (head, body) >
    
    where "," is the sequence (or concatenation) operator
  • a list element (not in HTML) must contain either a ul element or an ol element (but not both):
    <!ELEMENT list    (ul|ol) >
    
    where "|" is the alternation (or "exclusive or") operator
  • a ul element must contain zero or more li elements:
    <!ELEMENT ul    (li)* >
    
    where "*" is the repetition (or "Kleene star") operator

DTD syntax

DTD Syntax Meaning
b element b must occur
b,c both b and c must occur, in the order specified
b|c one (and only one) of b or c must occur
b* zero or more occurrences of b must occur
b+ one or more occurrences of b must occur
b? zero or one occurrence of b must occur
EMPTY no element content is allowed
ANY any content (of declared elements and text) is allowed
#PCDATA content is text rather than an element
  • element names in above table are b and c
  • parentheses can be used for grouping, e.g., (a,b)*
  • b+ is short for (b,b*)
  • b? is short for (b|EMPTY)

#PCDATA stands for "parsed character data", meaning an XML parser should parse the characters to resolve character and entity references.

DTD for RSS

  • a fragment of a simplified DTD for RSS might be
    <!ELEMENT rss         (channel) >
    <!ELEMENT channel     (title,link,description,item+) >
    <!ELEMENT item        (title,description,link,pubDate?) >
    <!ELEMENT title       (#PCDATA) >
    <!ELEMENT link        (#PCDATA) >
    <!ELEMENT description (#PCDATA) >
    <!ELEMENT pubDate     (#PCDATA) >
    

Validation of XML Documents

Referencing a DTD

  • DTD to be used to validate a document can be specified
    • internally (part of document)
    • externally (in another file)
  • done using a document type declaration
  • declare document to be of type given in DTD
  • e.g., <!DOCTYPE rss ... >

Declaring an Internal DTD

<?xml version="1.0"?>
<!DOCTYPE rss [
    <!-- all declarations for rss DTD go here -->
    ...
    <!ELEMENT rss ... >
    ...
]>
<rss>
   <!-- This is an instance of a document of type rss -->
   ...
</rss>
  • element rss must be defined in the DTD
  • name after DOCTYPE (i.e., rss) must match root element of document

Declaring an External DTD (1)

<?xml version="1.0"?>
<!DOCTYPE rss SYSTEM "rss.dtd">
<rss>
   <!-- This is an instance of a document of type rss -->
   ...
</rss>
  • what follows SYSTEM is a URI
  • rss.dtd is a relative URI, assumed to be in same directory as source document

Declaring an External DTD (2)

<?xml version="1.0"?>
<!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN"
     "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd">
<math>
   <!-- This is an instance of a mathML document type -->
   ...
</math>
  • PUBLIC means what follows is a formal public identifier with 4 fields:
    1. ISO for ISO standard, + for approval by other standards body, and - for everything else
    2. owner of the DTD: e.g., W3C
    3. title of the DTD: e.g., DTD MathML 2.0
    4. language abbreviation: e.g., EN
  • URI gives location of DTD

Formal public identifiers are meant for widely used entities. They should be unique world-wide. Processing software might either come with such entities already installed or it might know the most efficient sites form which to download them. If not, the URI is used to retrieve the DTD.

Attributes

  • recall that attribute name-value pairs are allowed in start tags
  • e.g., href="file.html" in an HTML a start tag
  • allowed attributes for an element are defined in an attribute list declaration
  • e.g., for rss and guid elements, these might be
    <!ATTLIST rss
       version CDATA #FIXED "2.0" >
    <!ATTLIST guid
       isPermaLink (true|false) "true" >
    
  • attribute definition comprises
    • attribute name, e.g., version
    • type, e.g., CDATA
    • default, e.g., "true"

Some Attribute Types

  • CDATA: any valid character data
  • ID: an identifier unique within the document
  • IDREF: a reference to a unique identifier
  • IDREFS: a reference to several unique identifiers (separated by white-space)
  • (a|b|c), e.g.: (enumerated attribute type) possible values are one of a, b or c

Attribute Defaults

  • #IMPLIED: attribute may be omitted (optional)
  • #REQUIRED: attribute must be present
  • #FIXED "x", e.g.: attribute optional; if present, value must be x
  • "x", e.g.: value will be x if attribute is omitted

Mixed Content Models

  • in rss DTD all content models comprise only elements or only text
  • in HTML, paragraph elements, e.g., allow text interleaved with various in-line elements, such as em, img, b, etc.
  • such a model is said to have mixed content
  • if we want to mix text with elements em, img and b as contents of element p:
    <!ELEMENT p (#PCDATA | em | img | b)* >
    
  • #PCDATA must be first (in the definition)
  • followed by other elements separated by |
  • all must have * applied to them
  • this limits our ability to constrain the content model

Some exercises

  • Consider the content models (zero, one)* and (zero | one)*. Give an example of a sequence of elements allowed by the one model but not by the other.
  • Consider the elements day, month and year. Produce a content model which allows for each of the sequences
    year
    month year
    day month year
    
    but no others.

Family example DTD

<!ELEMENT family (parent, (parent)?, (child)*)>
<!ELEMENT parent (name)>
<!ELEMENT child  (name)>
<!ELEMENT name   (#PCDATA)>

<!ATTLIST parent
  pno     ID               #IMPLIED
  role    (mother|father)  #IMPLIED
  spouse  IDREF            #IMPLIED>

<!ATTLIST  child
  cno           ID      #IMPLIED
  date-of-birth CDATA   #IMPLIED
  siblings      IDREFS  #IMPLIED>
  • spouse attribute is meant to be interpreted as a reference to a pno attribute
  • siblings attribute is meant to be interpreted as a set of references to cno attributes

Family example: XML document

<?xml version="1.0"?>
<!-- <!DOCTYPE family [ ... DTD goes here ... ]> -->
<family>
  <parent pno="p1" role="mother" spouse="p2">
    <name>Janet</name>
  </parent>
  <parent pno="p2" role="father" spouse="p1">
    <name>John</name>
  </parent>
  <child cno="c1" siblings="c2 c3">
    <name>Tom</name>
  </child>
  <child cno="c2" siblings="c1 c3">
    <name>Dick</name>
  </child>
  <child cno="c3" siblings="c1 c2">
    <name>Harry</name>
  </child>
</family>

A valid family

<?xml version="1.0"?>
<!-- <!DOCTYPE family [ ... DTD goes here ... ]> -->
<family>
  <parent pno="janet">
    <name>Janet</name>
  </parent>
  <child date-of-birth="yesterday">
    <name>Tom</name>
  </child>
</family>
  • no need for an attribute of type ID to be referenced
  • date-of-birth cannot be restricted to a valid date by a DTD

An invalid family

<family>
  <parent role="stepmother" spouse="john jim">
    <name>Janet</name>
  </parent>
  <parent pno="john" spouse="janet"></parent>
  <parent pno="jim" spouse="janet"></parent>
</family>
  • if role is given, must be mother or father
  • spouse must refer to only one value of type ID
  • spouse must refer to a value of type ID which exists
  • each parent must have a name
  • at most two parent elements allowed
  • checking the file invalid-family.xml using the online XML validator gives results

Entities

  • entity is a physical unit such as a character, string or file
  • entities allow
    • references to non-keyboard characters
    • abbreviations for frequently used strings
    • documents to be broken up into multiple parts
  • entity declaration in a DTD associates a name with an entity, e.g.,
    <!ENTITY BBK "Birkbeck, University of London">
  • entity reference, e.g., &BBK; substitutes value of entity for its name in document
  • entity must be declared before it is referenced

Example using Entities

<!DOCTYPE xmas [
<!ENTITY  on        "On the">
<!ENTITY  day       "day of Christmas my true love sent to me">
<!ENTITY  partridge "<line>a partridge in a pear tree.</line>">
<!ENTITY  doves     "<line>two turtle doves and</line> &partridge;">
<!ENTITY  hens      "<line>three French hens,</line> &doves;">
<!ELEMENT xmas      (verse+)>
<!ELEMENT verse     (line+)>
<!ELEMENT line      (#PCDATA)>
]>
<xmas>
  <verse><line>&on; first  &day;</line> &partridge;</verse>
  <verse><line>&on; second &day;</line> &doves;</verse>
  <verse><line>&on; third  &day;</line> &hens;</verse>
</xmas>

General Entities

  • BBK is an example of a general entity
  • can also be used for non-keyboard characters, i.e, a character entity reference (see slide 3.23)
  • in XML, only 5 general entity declarations are built-in
    • &amp; (&), &lt; (<), &gt; (>), &quot; ("), &apos; ('),
  • contents of internal entities are defined in same document as references to them
  • contents of external entities are defined elsewhere, e.g.,
    <!ENTITY HTML-chapter SYSTEM "html.xml" >
    • then &HTML-chapter; includes contents of file html.xml at point of reference
    • must include standalone="no" in XML declaration

Parameter Entities

  • parameter entities are
    • used only within XML markup declarations
    • declared by inserting % between ENTITY and name, e.g.,
      <!ENTITY % list     "OL | UL" >
      <!ENTITY % heading  "H1 | H2 | H3 | H4 | H5 | H6" >
      
    • referenced using % and ; delimiters, e.g.,
      <!ENTITY % block  "P | %list; | %heading; | ..." >
      
  • as an example. see the HTML 4.01 DTD

Limitations of DTDs

  • non-XML syntax
  • no data typing, especially for element content
  • only marginally compatible with namespaces
  • cannot use mixed content and enforce order and number of child elements
  • clumsy to enforce presence of child elements without also enforcing order (i.e. no & operator from SGML)
  • element names in a DTD are global
  • XML Schema Definition Language, e.g., addresses these limitations

Exercises

  1. Write an XML DTD which will define the following structure for documents of type exam. An exam has a course code, a title and a date, which comprises only the month and year. These are followed by a list of questions. Exams consist of either 5 or 6 questions. Each question has one or more parts. Parts of questions can themselves comprise parts along with text.

    Give an instance of an exam document which is valid with respect to your DTD and two instances which are invalid, explaining why they are invalid. Check your answers using an on-line XML validator.


  2. Write an XML DTD for representing information about students on an MSc programme. All information should be represented using elements rather than attributes. The root element of the document is programme. A programme has a degree and a year. These elements are followed by the results for the programme. The results are partitioned into distinction, merit, pass and fail. Within each is a sequence of name elements, each containing the name of a person having achieved the corresponding result for the programme.


  3. Consider a relational database containing a relation teaches with attributes course and lecturer, representing the relationship between courses taught on an MSc programme and the lecturers who teach them. Give an XML DTD for representing this information.

Links to more information

DTDs are covered in Chapter 4 of [Moller and Schwartzbach] and briefly in Chapter 2 of [Jacobs].