On Feb 8, 2009, at 10:59 AM, Jukka Zitting wrote:

Hi,

On Sun, Feb 8, 2009 at 4:57 PM, Jonathan Koren <jonat...@soe.ucsc.edu> wrote:
On Feb 8, 2009, at 5:55 AM, Jukka Zitting wrote:
On Sun, Feb 8, 2009 at 6:22 AM, Jonathan Koren <jonat...@soe.ucsc.edu >
wrote:
The problem with all these metadata standards is that they're all dumb in
the sense that they duplicate effort.

Agreed. So why would we want to duplicate the effort in Tika?

Because someone is going to be stuck doing it anyway.

Why? The metadata keys I proposed are semantically equivalent to the
custom keys we use now. Why would someone need to specify custom keys
when standard alternatives for the exact same concepts already exist?
Note that I'm only proposing that we change the keys of the six
metadata entries I listed.

But why only those six? It certainly seems like an arbitrary list based on temporary convenience. You're not proposing to support all of XMP, just the bare minimum that you need this week. At some point you're going to want to add more metadata and then you're going going to have to deal with the ontology mismatch problem. By luck or design you've picked ones that do map 1-to-1 to some other ontology, but this doesn't hold across XMP and it doesn't scale across multiple ontologies, including the ontologies you're currently using. When the day comes that you want to add more metadata, you haven't explained how you're going to solve the mismatch problem.

I don't understand what you do with the things that don't map 1-to-1 with XMP. Ignore them? That doesn't work because then you're arbitrarily dictating what kinds of problems the user can solve. Map them to some other space? That doesn't work either because then if the user wants to grab all the metadata from the foo space the user will have to know that foo:one gets mapped to bar:uno, foo:two gets mapped to baz:cinco, and foo:three doesn't get mapped. It's unreasonable to force such an ugly hack on all users just because it was easier to do this for one person once.

I have a concrete use case where doing this would be beneficial: My
employer is building a digital asset management application where we
plan to leverage XMP for metadata handling. Rather than explicitly
mapping each individual Tika metadata key to equivalent XMP entries,
it would be much easier and clearer to just map the "tiff" and "xmlDM"
prefixes to appropriate XMP namespaces when importing Tika metadata.
We also wouldn't need to keep updating the metadata mappings whenever
new Tika versions start supporting new keys.

I understand that you don't want to keep updating your own code every time Tika changes, but as you said, this is a 0.x release, so you're going to be stuck doing that for awhile. What I don't understand is why naively hardcoding the requirements for your current project into a publicly available library is the appropriate place for this code.

Is there some better way for us to implement this use case?

Yes. Tika does no translation between ontologies. It simply dumps all metadata detected for a file into its own namespace. This means that an MS Office file gets an MS namespace. Something with XMP gets an XMP namespace. ID3 tags go into the ID3 namepsace. Tika does no mapping among the types by default. You create a new class that takes the raw key-value pairs that stored in Tika::Metadata and translates them to something else. Call it Metadata2XMP or whatever. That can be packaged within Tika as a convenient class that does least common denominator mapping in a well defined way. By breaking the mapping out to a class separate from Metadata, you avoid spreading a single metadata namespace across 15 namespaces, and you make all mapping 100% reversible (well in this case ignorable), since inevitably some will be wrong in some case. If all a user wants is LCD metadata, they can get it through a common XMP namespace.


--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/


Reply via email to