On Feb 8, 2009, at 5:55 AM, Jukka Zitting wrote:
Hi,
On Sun, Feb 8, 2009 at 6:22 AM, Jonathan Koren
<jonat...@soe.ucsc.edu> wrote:
The problem with all these metadata standards is that they're all
dumb in
the sense that they duplicate effort.
Agreed. So why would we want to duplicate the effort in Tika?
Because someone is going to be stuck doing it anyway. The only
question is whether it's going to be Tika, or the application using
Tika. Tika is in a better position to know what the variety of
formats are and how they interrelate, better than any single
application developer. Tika already does this with respect to the
barebones metadata of image and audio files Picking some externally
developed standard doesn't solve anything. All it does is purport to
absolve Tika of responsibility.
Say you want to export (because that's what we're really talking about
here) Dublin Core. MS Office doesn't support DC, it has its own
ontology. Not only do these ontologies not map one to one, they only
sort of share one concept: ms:author and dc:creator. The other
concepts simply don't exist. Sure you could perhaps cajole
ms:lastauthor into dc:contributor, or ms:lastsavedate to dc:modified,
but the vast majority of items simply have no counterpart in the other
ontology. Now whatever DC Tika would construct from the MS metadata
would be wrong by definition (since the ontologies are being abused)
or be so devoid of information, it might as well not even exist.
Now let's say we're dealing with two other metadata formats You've
got ID3v2 and you want to export out XMP. XMP has xmpDM:artist, but
your ID3 information has conflicting id3:artist and id3:albumartist
tags. Which one do you map, and which one do you lose? More
importantly, how do you tell the user that you might be mapped the
wrong one? If you use a Tika namespace for the lowest common
denominator metadata, you not only have you provided an answer to the
question "who's the artist?", but you've also told the user that the
answer might be wrong. This ability to express uncertainty simply
doesn't exist any existing ontology because each ontology believes
it's the One True Ontology, and that mappings from the inferior
ontologies to the One True Ontology exists for at least all cases that
any one cares about.
I STRONGLY believe that you're going to have to store all the raw
metadata according to some set of Tika blessed namespaces (e.g. dc,
id3, xmp, msoffice, exif, tiff, etc) in order to allow application
developers to handle anything above the least common denominator of
the various metadata formats. No mapping among the ontologies exists
that is going to satisfy everyone in all cases, so why should Tika
keep users from making their own mappings if they really want to do
that? If you use an existing ontology, you're going to have to flag
that it's synthesized from other metadata, and thus is suspect.
Furthermore you're going to have be able to flag the synthesized data
on a per key basis in order to avoid collisions between real and
synthetic metadata within the exported namespace.
--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/