Defining Web Document Types

Peter Wood

Document Types

Document Type Definitions (DTDs)

Valid XML

XML parser checking document is valid

DTD syntax

  • syntax for an element declaration in a DTD is:
    <!ELEMENT name    (model) >
    
    where
    • ELEMENT is a keyword
    • name is the element name being declared
    • model is the element content model (the allowed contents of the element)
  • content model specified using a regular expression over element names
  • regular expression specifies the permitted sequences of element names

Examples of DTD element declarations

  • an html element must contain a head element followed by a body element:
    <!ELEMENT html    (head, body) >
    
    where "," is the sequence (or concatenation) operator
  • a list element (not in HTML) must contain either a ul element or an ol element (but not both):
    <!ELEMENT list    (ul|ol) >
    
    where "|" is the alternation (or "exclusive or") operator
  • a ul element must contain zero or more li elements:
    <!ELEMENT ul    (li)* >
    
    where "*" is the repetition (or "Kleene star") operator

DTD syntax

DTD Syntax Meaning
b element b must occur
b,c both b and c must occur, in the order specified
b|c one (and only one) of b or c must occur
b* zero or more occurrences of b must occur
b+ one or more occurrences of b must occur
b? zero or one occurrence of b must occur
EMPTY no element content is allowed
ANY any content (of declared elements and text) is allowed
#PCDATA content is text rather than an element
  • element names in above table are b and c
  • parentheses can be used for grouping, e.g., (a,b)*
  • b+ is short for (b,b*)
  • b? is short for (b|EMPTY)

#PCDATA stands for "parsed character data", meaning an XML parser should parse the characters to resolve character and entity references.

DTD for RSS

  • a fragment of a simplified DTD for RSS might be
    <!ELEMENT rss         (channel) >
    <!ELEMENT channel     (title,link,description,item+) >
    <!ELEMENT item        (title,description,link,pubDate?) >
    <!ELEMENT title       (#PCDATA) >
    <!ELEMENT link        (#PCDATA) >
    <!ELEMENT description (#PCDATA) >
    <!ELEMENT pubDate     (#PCDATA) >
    

Validation of XML Documents

Referencing a DTD

  • DTD to be used to validate a document can be specified
    • internally (part of document)
    • externally (in another file)
  • done using a document type declaration
  • declare document to be of type given in DTD
  • e.g., <!DOCTYPE rss ... >

Declaring an Internal DTD

<?xml version="1.0"?>
<!DOCTYPE rss [
    <!-- all declarations for rss DTD go here -->
    ...
    <!ELEMENT rss ... >
    ...
]>
<rss>
   <!-- This is an instance of a document of type rss -->
   ...
</rss>
  • element rss must be defined in the DTD
  • name after DOCTYPE (i.e., rss) must match root element of document

Declaring an External DTD (1)

<?xml version="1.0"?>
<!DOCTYPE rss SYSTEM "rss.dtd">
<rss>
   <!-- This is an instance of a document of type rss -->
   ...
</rss>
  • what follows SYSTEM is a URI
  • rss.dtd is a relative URI, assumed to be in same directory as source document

Declaring an External DTD (2)

<?xml version="1.0"?>
<!DOCTYPE math PUBLIC "-//W3C//DTD MathML 2.0//EN"
     "http://www.w3.org/TR/MathML2/dtd/mathml2.dtd">
<math>
   <!-- This is an instance of a mathML document type -->
   ...
</math>
  • PUBLIC means what follows is a formal public identifier with 4 fields:
    1. ISO for ISO standard, + for approval by other standards body, and - for everything else
    2. owner of the DTD: e.g., W3C
    3. title of the DTD: e.g., DTD MathML 2.0
    4. language abbreviation: e.g., EN
  • URI gives location of DTD

Formal public identifiers are meant for widely used entities. They should be unique world-wide. Processing software might either come with such entities already installed or it might know the most efficient sites form which to download them. If not, the URI is used to retrieve the DTD.

DTD for CD Example

  • recall the XML document for representing a collection of CDs
  • assume that every CD element contains a composer followed by one or more performance elements
  • and that each performance element must contain composition and date elements
  • between them is an optional soloist element, followed by either both an orchestra and conductor or neither of them
  • the declarations of the CD and performance elements might be as follows:
    <!ELEMENT CD          (composer, (performance)+)>
    <!ELEMENT performance (composition, (soloist)?,
                           (orchestra, conductor)?, date)>
    

Attributes

  • recall that attribute name-value pairs are allowed in start tags
  • e.g., href="file.html" in an HTML a start tag
  • allowed attributes for an element are defined in an attribute list declaration
  • e.g., for rss and guid elements, these might be
    <!ATTLIST rss
       version CDATA #FIXED "2.0" >
    <!ATTLIST guid
       isPermaLink (true|false) "true" >
    
  • attribute definition comprises
    • attribute name, e.g., version
    • type, e.g., CDATA
    • default, e.g., "true"

Some Attribute Types

  • CDATA: any valid character data
  • ID: an identifier unique within the document
  • IDREF: a reference to a unique identifier
  • IDREFS: a reference to several unique identifiers (separated by white-space)
  • (a|b|c), e.g.: (enumerated attribute type) possible values are one of a, b or c

Attribute Defaults

  • #IMPLIED: attribute may be omitted (optional)
  • #REQUIRED: attribute must be present
  • #FIXED "x", e.g.: attribute optional; if present, value must be x
  • "x", e.g.: value will be x if attribute is omitted

Example Declaring HTML Attributes

  • in HTML, the IMG element is empty and must have src and alt attributes
  • it may have height and width attributes
  • (simplified) declaration may be as follows:
    <!ELEMENT IMG EMPTY>
    <!ATTLIST IMG
      src         CDATA       #REQUIRED
      alt         CDATA       #REQUIRED
      height      CDATA       #IMPLIED
      width       CDATA       #IMPLIED
      >
    
  • in HTML, the FORM element has an optional method attribute
  • if present, it must have the value GET or POST, with default value GET:
    <!ATTLIST FORM
      method      (GET|POST)     GET>
    

Mixed Content Models

  • in rss DTD all content models comprise only elements or only text
  • in HTML, paragraph elements, e.g., allow text interleaved with various in-line elements, such as em, img, b, etc.
  • such a model is said to have mixed content
  • if we want to mix text with elements em, img and b as contents of element p:
    <!ELEMENT p (#PCDATA | em | img | b)* >
    
  • #PCDATA must be first (in the definition)
  • followed by other elements separated by |
  • all must have * applied to them
  • this limits our ability to constrain the content model

Some exercises

  • Consider the content models (zero, one)* and (zero | one)*. Give an example of a sequence of elements allowed by the one model but not by the other.
  • Consider the elements day, month and year. Produce a content model which allows for each of the sequences
    year
    month year
    day month year
    
    but no others.

Entities

  • entity is a physical unit such as a character, string or file
  • entities allow
    • references to non-keyboard characters
    • abbreviations for frequently used strings
    • documents to be broken up into multiple parts
  • entity declaration in a DTD associates a name with an entity, e.g.,
    <!ENTITY BBK "Birkbeck, University of London">
  • entity reference, e.g., &BBK; substitutes value of entity for its name in document
  • entity must be declared before it is referenced

General Entities

  • BBK is an example of a general entity
  • can also be used for non-keyboard characters, i.e, a character entity reference (see slide 3.23)
  • in XML, only 5 general entity declarations are built-in
    • &amp; (&), &lt; (<), &gt; (>), &quot; ("), &apos; ('),
  • contents of internal entities are defined in same document as references to them
  • contents of external entities are defined elsewhere, e.g.,
    <!ENTITY HTML-chapter SYSTEM "html.xml" >
    • then &HTML-chapter; includes contents of file html.xml at point of reference
    • must include standalone="no" in XML declaration

Parameter Entities

  • parameter entities are
    • used only within XML markup declarations
    • declared by inserting % between ENTITY and name, e.g.,
      <!ENTITY % list     "OL | UL" >
      <!ENTITY % heading  "H1 | H2 | H3 | H4 | H5 | H6" >
      
    • referenced using % and ; delimiters, e.g.,
      <!ENTITY % block  "P | %list; | %heading; | ..." >
      
  • as an example. see the HTML 4.01 DTD

Limitations of DTDs

  • non-XML syntax
  • no data typing, especially for element content
  • only marginally compatible with namespaces
  • cannot use mixed content and enforce order and number of child elements
  • clumsy to enforce presence of child elements without also enforcing order (i.e. no & operator from SGML)
  • element names in a DTD are global
  • XML Schema Definition Language, e.g., addresses these limitations

JSON Schema

  • JSON schema is a draft vocabulary for annotating and validating JSON documents (see json-schema.org)
  • sample JSON for product API (inspired by example on json-schema.org):
    {
        "id": 1234,
        "name": "Bowers & Wilkins Zeppelin Wireless",
        "price": 499.50,
        "tags": ["airplay", "spotify", "bluetooth"]
    }
    

Example Product Schema (1)

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "Product",
    "description": "A product from Acme's catalog",
    "type": "object"
}
  • note that the schema is also a JSON document
  • $schema keyword specifies the vocabulary used
  • title and description keywords are descriptive only; they do not add constraints
  • type keyword defines the first constraint: it has to be a JSON object

Example Product Schema (2)

  • assume that id and name are always required
  • and that id is an integer, while name is a string
{
    ...
    "type": "object"
    "properties": {
        "id": {
            "description": "The unique identifier for a product",
            "type": "integer"
        },
        "name": {
            "description": "The name of the product",
            "type": "string"
        }
    },
    "required": ["id", "name"]
}

Example Product Schema (3)

  • assume that price is required and must be a positive number
  • and that tags is an array of strings, where there must be at least one tag and all tags must be unique
{
    ...
    "properties": {
        ...
        "price": {
            "type": "number",
            "exclusiveMinimum": 0
        },
        "tags": {
            "type": "array",
            "items": {
                "type": "string"
            },
            "minItems": 1,
            "uniqueItems": true
        }
    },
    "required": ["id", "name", "price"]
}

Some Other JSON Schema Features

  • maximum and minimum values for numbers
  • maxLength, minLength and pattern (regular expression) values for strings
  • "anyOf": [ ... ] specifies that at least one alternative should match
  • "oneOf": [ ... ] specifies that only one alternative should match
  • "allOf": [ ... ] specifies that all alternatives should match
  • "additionalProperties": false specifies that no other properties are allowed
  • ...

Exercises

  1. Write an XML DTD which will define the following structure for documents of type exam. An exam has a course code, a title and a date, which comprises only the month and year. These are followed by a list of questions. Exams consist of either 5 or 6 questions. Each question has one or more parts. Parts of questions can themselves comprise parts along with text.

    Give an instance of an exam document which is valid with respect to your DTD and two instances which are invalid, explaining why they are invalid. Check your answers using an on-line XML validator.


  2. Write an XML DTD for representing information about students on an MSc programme. All information should be represented using elements rather than attributes. The root element of the document is programme. A programme has a degree and a year. These elements are followed by the results for the programme. The results are partitioned into distinction, merit, pass and fail. Within each is a sequence of name elements, each containing the name of a person having achieved the corresponding result for the programme.


  3. Consider a relational database containing a relation teaches with attributes course and lecturer, representing the relationship between courses taught on an MSc programme and the lecturers who teach them. Give an XML DTD for representing this information.

Links to more information

DTDs are covered in Chapter 4 of [Moller and Schwartzbach] and briefly in Chapter 2 of [Jacobs].