Re: Normalize metadata to Dublin Core

Stephane Bastian Wed, 03 Dec 2008 00:38:04 -0800

Hi Jukka,

my 2 cents on this:

While this certainly sounds like a very good idea, it will be difficultto settle on using solely a single metadata format in Tika. Dublin Coreis one of several metadata format available, and while it is certainlysuitable for some documents (word, excel, open document and such), it'snot a silver bullet. for instance when it comes to images, audio andothers, it is fairly limited and we've got almost no choice thandescribing the metadata in another format than Dublin Core (for instancewe could use something like thishttp://www.metadataworkinggroup.com/pdf/mwg_guidance.pdf )

What is important for me though is that Tika Parsers should neverextract meta-data using a key that doesn't belong to a known format asit make it difficult to use the data.


BR,

Stephane Bastian

Jukka Zitting wrote:

Hi,

Currently Tika doesn't have any good guidelines on the semantics and
usage of metadata keys. Mostly we've just ended up with a few basic
keys like CONTENT_TYPE and a bunch of more or less inconsistently used
other keys. The result is that a client that currently wants to assign
any reasonable semantics to the extracted metadata needs to first
check the reported CONTENT_TYPE and use that to deduce the meanings of
the other available metadata keys based on documentation in [1].

This is not optimal. It should be up to the Tika parsers to interpret
the metadata available in the supported document types and map that as
well as possible to a single standard like Dublin Core. This way a
client only needs to know a single set of metadata semantics.

The parser can still make the raw underlying metadata available using
metadata keys that are specific to the actual metadata schema used in
the document type, but that should be considered an extra feature
beyond the normalized Dublin Core output.

One corollary of this is that we should replace the current HTTP-based
CONTENT_TYPE metadata key with the Dublin Core FORMAT.

WDYT?

[1] http://lucene.apache.org/tika/formats.html

BR,

Jukka Zitting

Re: Normalize metadata to Dublin Core

Reply via email to