It strikes me that when the Semantic Web truly takes off, then there will be a clear need to extract RDF statements from existing XML (but non-RDF) data formats. All the information isn't going to be thrown away, and all those formats aren't going to be revised to include RDF markup. (Although new formats may well be RDF based). Even mining existing HTML documents, badly formed as they are, will be important.
How can we achieve this?
Here are some avenues to explore. Some of them may be dead-ends, some of them may not. At the moment I'm just pulling together some notes and will then see where they lead.
- OOL -- Out of Line Markup by Simon St. Laurent, but generally speaking the capability to add/refer to additional markup outside of the document instance.
- By extension of the above, Concurrent Markup
- Regular Fragmentations -- again by Simon St. Laurent. Useful for extracting bits and pieces of data from documents to build new markup.
- XSLT, of course. See semantic Web Screen-scraping notes from Dan Brickley
- Pipeline processing, see XML Pipeline Frameworks, etc
- Meaning Definition Language
- Schema annotations?
A lot of ground to cover!
I think there are two views of the RDF syntax:
Firstly, RDF is a framework/graph model and the specs make it very clear that its not about syntax, and that there are various possible syntax. There's an XML one, and then there's N3. If you focus on "RDF as framework" then the syntax is about serializing data for exchange between RDF applications. Therefore the syntax is optimised for data exchange and needs to efficiently describe the graph model. This is fully-fledged RDF.
Secondly, RDF elements can be added to XML documents (e.g. as in RSS 1.0) to make some relationships more explicit, enabling RDF aware applications to extract more information from them. So we might characterise this as annotating XML data (which will probably be a serialization of some other data model) with enough RDF so that it can be parsed into a triples based graph model.
An RDF application parsing data in the second form has no need to use the original syntax at all, it can just serialize its graph as pure RDF. So there seems to be a boot-strapping activity here.
So, the key issue is: do the RDF annotations have to be in the original document?
I suspect the answer is they don't. But am interested in finding out how best this might be achieved.
XSLT looks like the obvious candidate for doing this as it can easily be used to inject new elements/attributes into an existing document. However this is at the syntax layer. Perhaps its possible to use Architectural Forms to make explicit the relationships when the parsed data is presented to the application.