There's been a lot of discussion about getting decent XML out of Microsoft Office documents, particularly from Word. However I can't remember seeing anyone talk about extracting metadata from these kinds of documents.
As an exercise in further tinkering with Jena, and as a way for people to create more RDF data to embed in FOAF documents, it'd be interesting to attempt creating RDF metadata from the kinds of properties stored in these documents.
A simple command-line app should suffice, dumping out the RDF document to the command-line, although it'd be interesting to embed this into the FOAF-a-Matic Mark 2.
Turns out this is very easy to do. The Apache POI project provides a Java interface to OLE2 documents, including the Horrible Property Set Format. This provides direct access to the properties stored in a document. Should be simple to process a document and build a Jena Model from the property values.
|HPSF Property||DC Equivalent|
|Last Saved Time/Date||dc:modified|
|Last Saved By||dc:contributor?|
One could manufacture the dc:format based on the file type. Excel.
There's also a bunch of other potentially useful data such as word, line, page counts, creating application, etc. It'd probably be useful to dump out as much as possible. Should check for standard ontologies for describing documents in this detail.
There are some other interesting possibilities. For example it may be possible to use other features in POI to extract text from a Word document. This could be used to extract hyperlinks from within the document to build up a list of references (dc:citation) to other texts on the web (if available).
Something equivalent to Norm Walsh's jpegrdf. i.e. allow extraction, and querying of metadata from the command-line.
You can search Google for Word (and other Office) documents. So in theory one could write a spider that used the Google API to perform a search for all Word documents that contain a particular expression. The result set (or n results of it anyway) could then be walked through, fetching each document (or the cached version from Google) in turn and extracting its metadata. The result would be a big bunch of document related metadata that some-one could have fun with.