My curmudgeonly face

The 10 Minute Guide To Reading an XML DTD

Home
Blog
Writing
Wiki
Projects
FOAF-a-Matic
Photos
Bookmarks

This document is meant to be a brief overview on how to read and interpret an XML Document Type Definition (DTD). It makes no assumptions over how much you know about markup languages, XML or DTDs.

Markup Language

Whats a markup language? A markup language is a way of describing textual data by indicating the meaning of a particular piece of text. This text can be anything from a word, sentence to a paragraph, or even an entire document.

Why would you do this? Its usually (!) obvious to the human reader of a document what the text in a document means. For example you can see from the way I've written "The 10 Minute Guide to Reading An XML DTD" above, and "1. Markup languages" that these are meant to be headings. You can see that because I left a space between this block of text and the previous block that this is a separate paragraph. We take the meaning for granted.

Its much harder to describe this meaning to a computer which must process a document. So we need a means to explicitly describe the meaning of a particular piece or block of text so that the application knows how to process it.

If this were a HTML document then I might do something like the following:

      <TITLE>THE 10 MINUTE GUIDE TO READING AN XML DTD</TITLE>
      <P>Whats a markup language?...</P>
      

Here I've told the application (a HTML browser in this instance) that a certain piece of text is the title, and that the next piece of text is a paragraph. In this case the title text is described by what is known as a 'tag' or 'element' called TITLE and the paragraph by one called P.

Tags and Elements

Notice from the previous example the syntax of a tag (or element) in a markup language such as HTML or XML:

   <name-of-the-tag-goes-here>
   

The end of the text which is bounded by a tag is marked as follows:

   </name-of-the-tag-goes-here>
   

Attributes

Elements can have additional information (usually meta information) associated with them. This information is known as an attribute. It looks like this:

<P ALIGN="RIGHT">Some text</P>

Here we've told the browser that the text 'Some text' is a separate paragraph and that the paragraph should be aligned to the right margin when its displayed.

Attributes then have the form:

<tag name-of-the-attribute="the-value-of-the-attribute">...</tag>

Document Type Definitions (DTD)

Its ought to be obvious from the above HTML examples that providing a document with markup like <TITLE> and <P> is only useful if the application knows how to handle those elements, and that they exist.

This is achieved by agreeing a Document Type Definition which describes the legal elements and attributes that can be used to markup a document. This is essentially a contract between the application and the user of the markup language - if the user marks up a document in a certain way, then the application can be relied upon to respond accordingly.

The additional advantage of a DTD is that they are defined on a rigorous syntax which means that it becomes possible to 'validate' (i.e. check) a document against its DTD to see whether it conforms to the letter of the contract.

XML DTDs all have a common underlying syntax which allows any XML parser (an application that can read a document, and potentially validate it against its DTD) to process any XML document.

Defining a Element in a DTD

OK, how do we define an element in a DTD?

Heres how:

<!ELEMENT element-name  ...>

This declares an element called element-name. From the example above we might have:

<!ELEMENT TITLE         ...>

Whats the ... mean? This is where you declare what other tags or text an element can contain - its 'content model'. A content model is usually either plain text, other elements, or a mixture of the two.

Heres an element which can contain only plain text (no other elements)

<!ELEMENT TITLE         #PCDATA>

Heres an element which can contain a single paragraph element
<!ELEMENT FOOTNOTE       (P)>

How do we specify multiple or optional elements? There are three qualifiers which we can specify to denote multiple or optional elements:

So if we wanted to declare an element which could contain multiple paragraphs then we would do the following:

<!ELEMENT       ARTICLE         (P)*>

If we wanted our article to include an optional title, then we'd declare:

<!ELEMENT       ARTICLE         (TITLE?, P*)>

Notice the comma separator. An article consists of an optional title and zero or more paragraphs. The order is important however. The TITLE element if there is one must appear before all P elements.

If you want to specify an OR (instead of and AND which is what the comma specifies), the you can specify the following:

<!ELEMENT       ARTICLE         (TITLE? | P*)>

Declaring an Attribute in a DTD

OK, how do we define an attibute in a DTD?

Heres how:

<!ATTLIST element-name  attribute-name  CDATA>

This declares an attribute, attribute-name for the element, element-name. The attribute is declared as having text content (character data or CDATA).

You can declare multiple attributes for an element in one go:

<!ATTLIST ARTICLE       AUTHOR  CDATA
                           DATE    CDATA>

Its common to put the declarations on separate lines for readability. An attribute can also be declared as having one of several fixed values:

<!ATTLIST P             ALIGN   (LEFT|RIGHT|CENTRE)>

This enforces the ALIGN attribute of the P element to have the value of either LEFT, or RIGHT, or CENTRE. A default can be supplied as follows:

<!ATTLIST P             ALIGN   (LEFT|RIGHT|CENTRE) "LEFT">

So if the ALIGN attribute isn't specified is should default to LEFT.

Attributes can also be enforced - i.e. force the author to specify a particular piece of information:

<!ATTLIST ARTICLE       AUTHOR  CDATA   #REQUIRED
                           DATE    CDATA   #IMPLIED>

This means that an article *must* have an AUTHOR attribute, and can have a DATE attribute but this is optional.

Is that all there is to it?

No unfortunately not, there lots of other subtle rules relating to DTDs and other types of declaration that can be used, but as elements and attributes are the main contents of an DTD, this should be enough to get you started.

One thing to remember is that an XML (or HTML or SGML) document is really a tree structure. This means that one element is the 'document' element or the root of the tree. In HTML the root element is the HTML tag.

e.g. <HTML>...other tags...</HTML>

The root element cannot be contained in any other element, although it can potentially contain any other element. Its a good tip when reading a DTD to determine the root element and then starting reading from there - from the base up. This will give you a starting point which should give you an idea of the broad structure of a document quite quickly.

Because elements should be declared before they are used in the content models of other elements, you can usually find the document element of a DTD towards the bottom of the file.

e.g.

<!ELEMENT  P       CDATA>
<!ELEMENT ARTICLE  (P)*>

The article element is declared after the paragraph element because it can contain contain paragraphs. So look for the document element and start there.

Contact info

Hope this proved of some use, if you want to contact the author, then you can mail me here: ldodds@ingenta.com. If there are any glaring omissions or errors then please let me know.