This chapter is based on Chapter 2 of Schneider & Perry's book [Schneider 2000], Chapters 4, 5, 20 and 21 of the Deitel, et.al. tome [Deitel 2001], some examples from Elliotte Rusty Harold's book [Harold 1999], plus additional material from the web.
SGML, HTML, and XML are the most important markup languages. SGML because it is the parent language of both HTML and XML, HTML because it is the current language of the web, and XML because it is the future language of the web.
In the late 1960s, IBM researchers worked on the problem of building a portable system for the interchange and manipulation of legal documents. Their prototype language marked up structural elements, with formatting information kept in separate files, called style sheets. The document structure was defined in yet another file, called a Document Type Definition (DTD). By 1969, the researchers had developed the General Markup Language (GML). After further work worldwide, in 1986, the International Standards Organisation (ISO) adopted a particular version called the Standard Generalised Markup Language (SGML). It quickly became the business standard for data storage and interchange. SGML has the following advantages.
However, it also has the following disadvantages.
Put bluntly, it is too elaborate for the ever-changing web.
Tim Berners-Lee and Robert Calliau, working independently from the other at CERN, invented the HyperText Markup Language (HTML) based on SGML. HTML is one particular SGML DTD that is easier to learn and use than SGML. HTML is a trimmed-down version of SGML, eliminating SGML features that are rarely needed, but including hyperlinks to link web documents.
With earlier versions of HTML, web browsers controlled the appearance (rendering) of every web page. With the advent of Cascading Style Sheets (CSS), the document author can control the way the browser renders the page, or the entire web site for that matter. Style sheets allow document authors to specify the style of their page elements (spacing, margins, etc.) separately from their structure (section headers, body text, etc.), thus allowing greater manageability.
The Extensible Markup Language (XML) is also a descendent of SGML, representing an industry-wide effort to define which data are displayed (or printed), whereas HTML defines how a page is displayed. XML will overtake HTML because of its ability to describe content. XML has the following advantages.
XML defines a document's structure by marking the start and
end (tags) of its logical parts
(elements). This is similar to HTML, but also
defines record structures for databases and other
applications. Figure 1 illustrates an XML-formatted week
from my calendar in file calendar.xml.
<?xml version="1.0"?>
<!DOCTYPE calendar SYSTEM "calendar.dtd">
<calendar>
<year value="2001">
<date month="01" day="22">
<event time="1700">
Eric Hobsbawm's lecture
</event>
</date>
<date month="01" day="23">
<event time="0930">
Lewisham Hospital
</event>
<event time="1730">
Quizmaster's Cup
</event>
</date>
<date month="01" day="24">
<event time="1600">
Teaching Committee
</event>
</date>
<date month="01" day="25">
<event time="1800">
Computer Networking 3
</event>
</date>
<date month="01" day="26">
<event time="1400">
School Meeting
</event>
<event time="1800">
Electronic Commerce 3
</event>
</date>
</year>
</calendar>
|
The first line is an XML declaration, specifying
which version of XML the document conforms to. The second
line is a comment using the same syntax as HTML. All XML
documents must contain exactly one root element,
e.g. <calendar> in this example,
containing all other elements. Element
<year> is a child element
because it is nested inside element
<calendar>.
How do we know that the above XML document is well
formed, i.e. correctly structured? Enter a
document model in the form of a Document Type
Definition (DTD), which is a hand-me-down from SGML
defining the allowed structure. Figure 2 is
calendar.dtd.
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT calendar (year)*>
<!ELEMENT year (date)*>
<!ATTLIST year value CDATA #REQUIRED>
<!ELEMENT date (event)*>
<!ATTLIST date day CDATA #REQUIRED
month CDATA #REQUIRED>
<!ELEMENT event (#PCDATA)>
<!ATTLIST event time CDATA #IMPLIED>
|
This archaic structure contains a set of rules or
declarations. Each declaration adds a new element,
set of attributes, or notation to the language we are
describing. Briefly it states that a
<calendar> contains zero or more
<year>s, a <year>
contains zero or more <date>s and that
the attribute value is mandatory, etc.
The format in Figure 1 is consistent with that outlined in
[RFC 2445] for the exchange of
Calendar information between applications, e.g. your PC and
your Personal Data Assistant (PDA). In addition,
this format can easily be transformed into another format
using a parser and a style sheet. Figure
3 is the result of parsing the above XML document with the
style sheet xml2html.xsl.
Calendar for 2001
|
On reflection, there are certain disadvantages to the chosen
style of markup. It would be easier to remove
month as an attribute of
<date> and make it a child element of
<year>. In addition, an
<event> should have start and stop times.
However, it is not difficult to provide a parser and style
sheet to generate virtually any other markup. This
is illustrated in Figure 4.
Notice that we have introduced yet another document type here, written in the Extensible Stylesheet Language (XSL). Virtually a programming language, XSL supports functions, recursion, and templates. The advantage in having a single source document being used to generate a number of alternative documents is a big win and, of course, the style sheets only need to be written once.
XML tags are case-sensitive, using the wrong mixture is an
error. XML can use Unicode characters. Unicode is a standard
defining the characters for the world's major languages
(Klingon is currently undergoing review :-). Markup
text is enclosed within angle brackets
(< and >). Character
data is the text between a start tag and an end tag,
e.g. Electronic Commerce 3 in Figure 1. In XML
all start tags must have an end tag. Consider Figure 5.
| HTML | XML | |
|---|---|---|
<img src="image.gif">
|
<img src="image.gif"></img>
|
<img src="image.gif"/>
|
The middle entry is called an empty element, which can be written more concisely as given on the right. Elements define structure, and may or may not contain content. Attributes describe elements, and attribute values are enclosed in quotes. Figure 6 illustrates another XML application.
<?xml version="1.0"?>
<!-- Connex Train Information Database -->
<schedule date="07/03/01">
<train route="24">
<status depart="1602">
Cancelled
</status>
</train>
<train route="25">
<status depart="1605" platform="2">
Delayed waiting for in-bound driver
</status>
</train>
<train route="34">
<status depart="1628" platform="3">
About to depart
</status>
</train>
</schedule>
|
Presumably this <schedule> is updated
periodically as conditions change. The route
attribute is used as a key into another database containing
information about which stations are serviced. This XML
document is easily parsed and transformed into whatever
format the public information displays require. It could
also be used to update their web pages!
XML languages are being developed for many areas of document processing and e-commerce. Some of the more prominent ones are presented below.
The Bank Internet Payment System [BIPS] facilitates secure electronic transactions over the Internet. Transactions can be initiated by either the payer or payee, and are secured using digital certificates.
JavaBeans (also called beans) are software components that can be combined to create Java applications and applets. The Bean Markup Language [BML] is used for describing JavaBeans. BML defines how various beans are interconnected.
Peter Murray-Rust's Chemical Markup Language [CML] is used for representing molecular and chemical information. Figure 7 illustrates the CML document for a water molecule (H2O).
<?xml version="1.0"?>
<CML>
<MOL TITLE="Water">
<ATOMS>
<ARRAY BUILTIN="ELSYM">H O H</ARRAY>
</ATOMS>
<BONDS>
<ARRAY BUILTIN="ATID1">1 2</ARRAY>
<ARRAY BUILTIN="ATID2">2 3</ARRAY>
<ARRAY BUILTIN="ORDER">1 1</ARRAY>
</BONDS>
</MOL>
</CML>
|
Unfortunately, this example cannot be displayed in current browsers. However, Figure 8 should give some idea of what it should look like.
Commerce XML [cXML] is used for describing catalog data and performing business-to-business electronic transactions that use the data.
Electronic Business XML [ebXML] is the result of an 18-month project by the United Nations to standardise the global exchange of business information. Rather than emphasising business documents, ebXML emphasises business processes.
The Extensible Business Reporting Language [XBRL] captures existing financial and accounting information standards in XML. Future versions of XBRL will expand to encompass descriptions of information in other areas of business.
The Extensible User Interface Language [XUL] (pronounced zool) is an XML-based language developed by the Mozilla project for describing user interfaces. Cross-platform applications can load the information from a XUL document to create the appropriate user interface.
The Geography Markup Language [GML] describes geographical information for use and reuse by different applications for different purposes. In GML, geographic information is described in terms of features. A feature is composed of properties and geometries. A property contains name, type and value elements. Geometries contain the bulk of geometric data.
In the USA court documents must be filed with a clerk, and the information often must be entered into different document management systems multiple times. With LegalXML [LegalXML], the information in court documents can be described to enable more efficient processing.
The Mathematical Markup Language [MathML] was developed for describing mathematical notations and expressions using XML. It allows mathematical expressions to be processed by different applications for different purposes. Figure 9 shows the MathML for the quadratic equation x2+4x+4=0 (in HTML) or here in MathML.
<math>
<mrow>
<mrow>
<msup>
<mi>x</mi>
<mn>2</mn>
</msup>
<mo>+</mo>
<mrow>
<mn>4</mn>
<mo>⁢</mo>
<mi>x</mi>
</mrow>
<mo>+</mo>
<mn>4</mn>
</mrow>
<mo>=</mo>
<mn>0</mn>
</mrow>
</math>
|
The <mi> element is for identifiers, the
<mn> element is for numbers, the
<mo> element is for operators, etc. The
entity ⁢ is important - it's
invisible when rendered for viewing, spoken when rendered
for voice, but indicates multiplication if the equation is
being computed!
News items exist in many different formats and are presented and received through different means. NewsXML [NewsXML] is designed to be media independent, so that all news-content formats (e.g. text, photo, etc.) can be described. NewsXML also enables tracking and revision of documents over time.
This is the result of a collaboration of companies, the Open eBook Forum [Open eBook Forum], dedicated to electronic text publication. The language is designed to be platform independent, but maintains flexibility and permits document authors to embed platform-specific content as long as a platform-independent alternative is provided.
OpenMath [OpenMath] is a standard for describing mathematical content as objects which can be exchanged, manipulated, and displayed by different browsers in different contexts.
The Scalable Vector Graphics markup language [SVG] is a way to describe vector graphics
data over the web. Current methods (e.g. GIF,
JPEG, PNG) use bitmaps,
which have a fixed resolution and cannot be scaled without a
loss in image quality. Vector graphics describe graphical
information in terms of lines, curves, etc. which can be
scaled and printed quite easily. Think PostScript for
pictures.
If your browser is Mozilla (Version 0.9 or higher) or Internet Explorer (Version 4.0 or higher), Adobe provide a free plug-in for rendering SVG documents. The plug-in is available at www.adobe.com/svg/. A static demonstration is here (1747 bytes) and an animated demonstration is here (2054 bytes). Both are from the Deitel book.
The Synchronised Multimedia Integration Language [SMIL] (pronounced "smile") enables web authors to co-ordinate the presentation of a wide range of multimedia elements. In SMIL, multimedia elements can work together; this enables authors to specify when and how these multimedia elements appear in the document.
Visa [Visa] has developed this to enable its business customers to exchange credit-card purchase information between businesses over the Internet in a secure and standardised form. Currently, the specification provides a framework that describes credit-card purchases in the areas of procurement (i.e. business-to-business purchasing) and travel & entertainment (T&E) expenses.
Motorola's VoxML [VoxML] is an XML application for the spoken word, in particular for automated telephone response systems. VoxML enables the same data on the web to be served up via the telephone. Figure 10 is an example taken from Elliotte Rusty Harold's book.
<?xml version="1.0"?>
<DIALOG>
<CLASS NAME="help_top">
<HELP>Welcome to TIC consumer products division.
For shampoo information, say shampoo now.
</HELP>
</CLASS>
<STEP NAME="init" PARENT="help_top">
<PROMPT>Welcome to Wonder Shampoo
<BREAK SIZE="large"/>
Which color did Wonder Shampoo turn your hair?
</PROMPT>
<INPUT TYPE="OPTIONLIST">
<OPTION NEXT="#green">green</OPTION>
<OPTION NEXT="#purple">purple</OPTION>
<OPTION NEXT="#bald">bald</OPTION>
<OPTION NEXT="#bye">exit</OPTION>
</INPUT>
</STEP>
<STEP NAME="green" PARENT="help_top">
<PROMPT>
If Wonder Shampoo turned your hair green and you wish
to return it to its natural color, simply shampoo seven
times with three parts soap, seven parts water, four
parts kerosene, and two parts iguana bile.
</PROMPT>
<INPUT TYPE="NONE" NEXT="#bye"/>
</STEP>
<STEP NAME="purple" PARENT="help_top">
<PROMPT>
If Wonder Shampoo turned your hair purple and you wish
to return it to its natural color, please walk
widdershins around your local cemetery
three times while chanting "Surrender Dorothy".
</PROMPT>
<INPUT TYPE="NONE" NEXT="#bye"/>
</STEP>
<STEP NAME="bald" PARENT="help_top">
<PROMPT>
If you went bald as a result of using Wonder Shampoo,
please purchase and apply a three months supply
of our Magic Hair Growth Formula(TM). Please do not
consult an attorney as doing so would violate the
license agreement printed on inside fold of the Wonder
Shampoo box in 3 point type which you agreed to
by opening the package.
</PROMPT>
<INPUT TYPE="NONE" NEXT="#bye"/>
</STEP>
<STEP NAME="bye" PARENT="help_top">
<PROMPT>
Thank you for visiting TIC Corp. Goodbye.
</PROMPT>
<INPUT TYPE="NONE" NEXT="#exit"/>
</STEP>
</DIALOG>
|
It's not possible to show a screen shot of this example, because it's not intended for the web. Just pick up your phone!
The Wireless Markup Language [WML] allows web pages to be displayed on wireless devices such as cellular phones and PDAs. WML works with the Wireless Application Protocol (WAP) to deliver the content.
People are beginning to use XML to store their data. Software applications can use XML to store preferences and virtually any kind of information from chemical formulae to file archives. XML is not the perfect solution for every data storage problem; access times may be slow and documents can be large. Some kinds of data just don't need XML - a raster image is usually a long sequence of binary digits, monolithic, unparsable and huge.
However, XML has great possibilities for programmers. It is well suited to being read, written, and altered by software. Its syntax is straightforward and easy to parse. It's well documented and there are many tools and code libraries available to developers. As an open standard, with support from many popular programming languages, XML may well become the lingua franca for computer communication.
Now try this exercise.