Introduction

There's been a lot of discussion about getting decent XML out of Microsoft Office documents, particularly from Word. However I can't remember seeing anyone talk about extracting metadata from these kinds of documents.

As an exercise in further tinkering with Jena, and as a way for people to create more RDF data to embed in FOAF documents, it'd be interesting to attempt creating RDF metadata from the kinds of properties stored in these documents.

A simple command-line app should suffice, dumping out the RDF document to the command-line, although it'd be interesting to embed this into the FOAF-a-Matic Mark 2.

Research Notes

Turns out this is very easy to do. The Apache POI project provides a Java interface to OLE2 documents, including the Horrible Property Set Format. This provides direct access to the properties stored in a document. Should be simple to process a document and build a Jena Model from the property values.

See the "internals" document for a list of property ids and their meaning. The most interesting ones are likely to be the following (with mappings to DC terms:

HPSF PropertyDC Equivalent
Titledc:title
Authordc:creator
Keywordsdc:subject
Subjectdc:subject
Commentsdc:description
Create Time/Datedc:created
Last Saved Time/Datedc:modified
Last Saved Bydc:contributor?

One could manufacture the dc:format based on the file type. Excel.

There's also a bunch of other potentially useful data such as word, line, page counts, creating application, etc. It'd probably be useful to dump out as much as possible. Should check for standard ontologies for describing documents in this detail.

There are some other interesting possibilities. For example it may be possible to use other features in POI to extract text from a Word document. This could be used to extract hyperlinks from within the document to build up a list of references (dc:citation) to other texts on the web (if available).

Application Ideas

Command-line App

Something equivalent to Norm Walsh's jpegrdf. i.e. allow extraction, and querying of metadata from the command-line.

Google Spider

You can search Google for Word (and other Office) documents. So in theory one could write a spider that used the Google API to perform a search for all Word documents that contain a particular expression. The result set (or n results of it anyway) could then be walked through, fetching each document (or the cached version from Google) in turn and extracting its metadata. The result would be a big bunch of document related metadata that some-one could have fun with.

Add new attachment

In order to upload a new attachment to this page, please use the following box to find the file, then click on “Upload”.
« This page (revision-1) was last changed on 13-Jun-2003 06:03 by unknown [RSS]
G’day (anonymous guest) My Prefs


Referenced by
Main

JSPWiki v2.6.0 [RSS]