XDELTA - Deriving an XML based format for Taxonomic Information

Introduction
What Is Delta?
Where can I find additional information on the DELTA format?
What is XML?
Where can I find additional information relating to XML?
What is the XDELTA format?
What are the advantages of using an XML based format?
What are the disadvantages of using an XML format?
An Overview of the XDELTA DTD
What tools/applications can be used to process XDELTA?
Will the XDELTA format, and any associated tools be free?
Downloading XDELTA
Version History
TODO List

Introduction

This short paper describes my attempts to produce an XML file format (working title XDELTA), derived from the DELTA (DEscription Language for TAxonomy) standard. I outline some of the advantages of using XML over a proprietary format, and link to associated reference material. A first draft of the DTD, and some sample data are also provided.

The intention behind this work is to facilitate discussions on the requirements for a standardised XML format for taxonomic information by providing a basic example implementation - in this regard XDELTA should be seen as a 'proof of concept' project: there are likely to be additional requirements for such a format over and above a basic 1:1 translation of DELTA into XML.

This page is a work in progress. Please forward any comments to the author at the following address : ldodds@ingenta.com

What is Delta?

DELTA is a file format and an associated set of applications used for storing taxonomic information in a form suitable for computer processing. It is a standard which has been adopted by the International Taxonomic Databases Working Group.

A full description of the DELTA format can be found here.

The features of the DELTA system include

generation and typesetting of taxonomic descriptions and conventional keys
conversion of the DELTA data for use in classification applications
applications to allow the interactive identification of taxa, and the retrieval of taxonomic information

Where can I find additional information on the DELTA format?

The DELTA home page contains links to the relevant applications, data and documentation of the system. The DELTA standard can be downloaded for off-line use (Compressed MS Word document).

Discussions relating to the DELTA format, and this paper can be carried out on the DELTA mailing list.

What is XML?

XML - The eXtensible Markup Language is a markup language, similar to HTML which can be used to describe data and information in a structured format. It is extensible in that it is a 'meta' language which defines a basic grammar which can be built upon to design your own markup. Efforts are under way to produce standard XML based representations for information used in many contexts, both in business and scientific applications. For example CML is a markup language used by chemists to manage and exchange information relating to chemical structures. Other scientific markup languages include : Bioinformatic Sequence Markup Language (BSML), and BIOpolymer Markup Language (BIOML)

XML is a text based language and can be very easy to author by hand. There are also a lot of freely available tools for manipulating (parsing, reading, writing, translating) XML data. Because all XML formats share a basic common grammar, any XML parser can parse any XML document. For custom markup languages, e.g. CML, XDELTA, a Document Type Definition (DTD) can be defined which specifies exactly what information can be inserted in an XML document, and where in that document the information must be placed. A parser can then validate a document against its DTD to ensure that the document is valid - ensuring that it can be successfully processed.

See the next section for additional resources on XML. The above is only a (very) brief introduction.

Where can I find additional information relating to XML?

Here are some miscellaneous XML links:

What is the XDELTA format?

The XDELTA format is a XML version of the DELTA format. It consists of a draft DTD which captures the essential information from the DELTA format, and (eventually) some sample utilities for converting that format to and from the original DELTA format (for backwards compatibility) as well as hopefully demonstrating some of the power of XML in storing structured data for information exchange between research scientists.

What are the advantages of using an XML based format?

Free parsing and processing tools
Because of the common basic grammar which underlies XML, parsing and validation of any markup language based upon the standard can be carried out by a shared toolset. This toolset includes both non-validating and validating parsers (i.e. those that check against a Document Type Definition) many of which are freely available and are highly optimised for this purpose. This makes them high performing, and in some cases very lightweight applications. XML parsers range from server-side backend utilities, through to implementations suitable for use in web browsers and (Java) applets.

This freely available toolset, which includes parsers suitable for use in any programming language, means that there is no need to rely on custom parsing routines to process any XML data. Writing parser software can be very time-consuming and difficult, particularly if the data format is poorly defined or includes non-standard extensions which may or may not be shared by all applications. Relying on this shared toolset means that XML based applications gain the benefits of very robust parser technology.

Well defined and standardised programmatic interfaces to XML parsers (the SAX and DOM APIs) mean that development of XML applications can, once a DTD has been agreed upon, move quickly onto the core functionality rather than having to reinvent the 'parser wheel' every time the format changes slightly.

Extensibility, flexibility, open-ness
Defining an XML DTD involves (or should involve) a lot of analysis of the data which is to be used by the application(s). This analysis is then captured within the DTD which helps to document the format. This means that XML is to a large extent self-documenting. Reading a DTD is not a difficult task, so it is easy to begin working with data, as well as authoring new XML documents.

Taking care over the analysis of the DTD, can yield great benefits in terms of the flexibility and extensibility of the format. Providing that the core structure of the DTD remains the same, additional elements and attributes can be added to the document type without invalidating data which conforms to an earlier version. This is far better that pigeon-holing data into comment blocks for example. This causes overloading of the meanings of the fields in the format, further complicating the parser routines.

For example this fragment from the XDELTA DTD might define a character:

<!ELEMENT character (description, comment?, (multi|num|txt))>

The above means that a character element (or tag) must have a description, and optional comment (signified by the ? operator), and contains either text, numeric, or multiple choice data. Assume that this has been in use for several months when it is realised that the format needs to be able to hold additional notes about a character as well as image reference information. In the original DELTA format, to avoid breaking existing software, this information has to be held in comment blocks - making the meaning of such blocks more confusing to the parser writer, and also the author of a document. In XML however we can do the following:

<!ELEMENT character (description, comment?, (multi|num|txt), image?, notes?)>

We've added optional image and note elements to the document type. By making them optional we've ensured that existing data which has been marked up with an earlier version of the DTD is still valid, but we have also greatly increased the expressive power of the document without having to overload the meaning of any particular element. The content of the image and/or note element can be as complex as required.

In certain cases however backwards compatibility is not possible. We may have decided that all characters should have associated notes with them to help the user to make an identification. This presents the problem that existing data must be amended to include this new element. This is easily achieved however by carrying out transformations of the document structure. These transformations can be managed through the use of what are known as 'stylesheets'. A stylesheet defines the mapping between two different document types, with no requirement that the target document is actually XML data - it can as easily be plain text as HTML, another XML format, PDF, etc, etc. Here we see that the well-defined structure of XML documents provides a great deal of flexibility when it comes to processing the data that it captures.

Re-evaluating the DELTA (and other) formats
Another benefit to be gained from analysing the description of taxonomic data in XML is that it provides the opportunity to revisit the DELTA format (as well as others such as NEXUS) and re-assess whether they are still meeting the requirements of the scientific community. Are there new requirements that need to be met? Are there old requirements which are no longer a priority? Are there out-standing problems that need to be resolved? The DELTA documentation has many references to future enhancements as well as obsolete constructs in the data. Starting with a new format gives the opportunity to clean out any 'dead wood' as well as ensuring that all processing and data capture practices are well-standardised.

For example the DELTA standard states (page 5, Taxonomic Descriptions):

Note. Comments in item names were implemented before text characters, and often contained material, such as synonomy, which would now be better placed in text characters. These comments are now generally used for the authority, as in the example below. The interpretation of inner comments is currently not defined; they may be used in future extensions of the DELTA format.

Here we see that text characters are a more recent addition than comments, and that some material in comments is better suited for text characters, and alongside this the note that comments are now generally used for authority references. This example is not meant to criticise the format, or the standard, but hopefully just serves to highlight how a format evolves over time - a well specified XML standard should hopefully minimise these types of problem.

What are the disadvantages of using an XML format?

Learning a new format
Any change to an existing computer system causes an additional learning curve for users of the system as they get to grips with new features and a new way of working. XML markup is very different to the current DELTA format, and for long documents can be unwieldy to author because of the need to learn the meanings of a different number of elements and its strict syntax requirements. Users of the system will have to be willing to take on a new standard and their willingness to do so will be measured by how well that standard appears to conform to the perceived requirements.

XML is still also a relatively young language and that means that although there are many freely available tools, in some cases those tools are still in early versions with the obvious associated problems that this can cause. However XML is also a rapidly growing language and new parsers, validators, authoring packages and a host of other utilities are constantly being developed.

XML isn't that far removed from HTML so prior experience in this area can help mitigate the learning curve for the end-user of the format.The last few months has seen a surge in the number of XML authoring tools available.

Lack of current software
The DELTA format is well established and there is a good array of packages available for processing its data. This software is well-known and well-used - users are familiar with it, and many bugs will have already been eliminated. Moving to a new standard may well involve writing new applications, although hopefully incorporation of an XML parser into packages such as Intkey and NaviKey could avoid re-inventing the wheel.

Indeed some software will be made obsolete by an XML standard. I am currently working on sample stylesheets which demonstrate how the data can be automatically converted to HTML for use on the Internet, this encroaches on some of the functionality of CONFOR for example.

XML is not a 'magic bullet'
As with any newly hyped tool or language, XML can be wrongly viewed as the 'magic bullet' of software. Unfortunately this is not the case, and while it provides a great deal of benefits these are only a portion of those available to a well-designed system. XML is not a subsitute for design: it's a starting point. These messages from the XML-DEV mailing list reiterate this point. [1], [2], [ 3].

An Overview of the XDELTA DTD

In its present form the XDELTA DTD attempts to capture the main body of information present in DELTA files. The author has attempted to keep to the 'spirit' of the DELTA format so that the DTD isn't biased unduly towards programming constraints or requirements, and aims to make mark-up of data relatively simple. To this end the following requirements were kept in mind:

Keep the number of tags/attributes to a minimum
Name the tags/attributes in plain english, rather than using acronyms (readability)
Allow information to be entered 'freely' - i.e. avoid over-prescription of data requirements
Keep the expressive power of the DELTA format

The DTD itself contains a lot of comments which go into more detail of its structure.

I see this version of the DTD as being mainly a discussion point. I accept that changes to the DTD may be required after it has been reviewed by users/maintainers of the DELTA format, and would encourage the discussion of what additional information could be captured by the format over and above that found in DELTA. The NEXUS format seems to hold additional information, and this is an obvious example of how the format can be expanded.

Go to the download section to grab a copy of the DTD

What tools/applications can be used to process XDELTA?

Because this is still an early version of XDELTA I've yet to produce any significant applications which are based on the format. The download section does include some example stylesheets however. I aim to use the source of the Navikey application to develop a conversion application to generate XDELTA xml documents from DELTA files. This work should also enhance Navikey to allow it to process XDELTA documents.

Will the XDELTA format, and any associated tools be free?

Yes. The author is hoping that the XDELTA format can be used as a starting point for the definition of a XML based markup language for use in sharing and manipulating taxonomic information. To this end the format, and any tools produced by the author will be considered Open Source. The software will be available free, but placed under a license that will hopefully ensure that the format remains open to extension and development by the community at large.

The Artistic License will probably suffice in this regard. At present treat XDELTA as 'free, but used at your own risk'.

Downloading XDELTA

The latest version of the XDELTA DTD can be found here (you may need to shift-click the link):

http://www.ldodds.com/delta/xdelta.dtd

If you are not familiar with reading XML DTDs then I'll direct you to the section of XML pointers, or this quick 10 minute guide to reading a DTD.

Some style-sheets which demonstrate the manipulation of XDELTA documents can be found here:

Production of DELTA format Character descriptions - ctext.xsl
Production of DELTA format Item descriptions - itext.xsl

These stylesheets hopefully as well as demonstrating the flexibility of an XML format (i.e. its ability to be transformed into other formats) also show that the current version of the DTD captures the information in a DELTA file(s).

An example XML data file containing data for the Lepidoptera is also available. This is derived from the example DELTA Lepidoptera data which is available here.

A zip file containing this page, the dtd, and the example stylesheets and data can also be downloaded.

In developing the XDELTA format, I've been using the XP Java parser, and the XT XSLT processor. You'll need the latest version of the XT processor to use the example stylesheets. The Java Development Kit can be downloaded from the Java Home Page. The RXP application is also useful for validating documents.

Version History

Main points of interest:

7/10/99 - Initial draft of DTD produced
11/10/99 - Sample data (for Lepidoptera) selected, and characters marked-up into the XDELTA format
11/10/99 - DTD revised to simplify the markup of character values in taxon descriptions
12/10/99 - DTD revised to include groupings of characters
12/10/99 - DTD revised to include image and image references
12/10/99 - XSL stylesheet produced to convert XDELTA character information back into the equivalent DELTA format
12/10/99 - Mailed DELTA mailing list to announce work.
13/10/99 - Draft web page produced
15/10/99 - Produced stylesheet to generate DELTA format item descriptions from XDELTA
18/10/99 - Work started on a stylesheet to generate plain text natural language versions of the data.
18/10/99 - Work started on a stylesheet to generate plain text natural language versions of the data.
19/10/99 - Rough version of natural language conversion stylesheet produced.
21/10/99 - Made slight amendments to the DTD, e.g. renaming doctype to xdelta, and added a great deal of comments to document each element.
21/10/99 - Wrote a 10 Minute Guide to XML DTDs.
22/10/99 - Produced first public version of this HTML document.

TODO List

In no particular order (plus see the comments in the DTD for additional thoughts):

Finalise markup of sample data - image information is still largely missing
Produce XSL stylesheet to produce HTML version of character data
Produce XSL stylesheet to produce HTML version of taxon information
Produce XSL stylesheet to produce natural language description of taxa
Produce Java based DELTA-to-XDELTA conversion utility
Examine use of FOP for production of PDF output
XML - RTF utilities?
Your Suggestion Here!

Last Updated 22/10/99 - Leigh Dodds