On Feb 8, 2009, at 10:59 AM, Jukka Zitting wrote:
Hi,
On Sun, Feb 8, 2009 at 4:57 PM, Jonathan Koren
<jonat...@soe.ucsc.edu> wrote:
On Feb 8, 2009, at 5:55 AM, Jukka Zitting wrote:
On Sun, Feb 8, 2009 at 6:22 AM, Jonathan Koren <jonat...@soe.ucsc.edu
>
wrote:
The problem with all these metadata standards is that they're all
dumb in
the sense that they duplicate effort.
Agreed. So why would we want to duplicate the effort in Tika?
Because someone is going to be stuck doing it anyway.
Why? The metadata keys I proposed are semantically equivalent to the
custom keys we use now. Why would someone need to specify custom keys
when standard alternatives for the exact same concepts already exist?
Note that I'm only proposing that we change the keys of the six
metadata entries I listed.
But why only those six? It certainly seems like an arbitrary list
based on temporary convenience. You're not proposing to support all
of XMP, just the bare minimum that you need this week. At some point
you're going to want to add more metadata and then you're going going
to have to deal with the ontology mismatch problem. By luck or design
you've picked ones that do map 1-to-1 to some other ontology, but this
doesn't hold across XMP and it doesn't scale across multiple
ontologies, including the ontologies you're currently using. When the
day comes that you want to add more metadata, you haven't explained
how you're going to solve the mismatch problem.
I don't understand what you do with the things that don't map 1-to-1
with XMP. Ignore them? That doesn't work because then you're
arbitrarily dictating what kinds of problems the user can solve. Map
them to some other space? That doesn't work either because then if
the user wants to grab all the metadata from the foo space the user
will have to know that foo:one gets mapped to bar:uno, foo:two gets
mapped to baz:cinco, and foo:three doesn't get mapped. It's
unreasonable to force such an ugly hack on all users just because it
was easier to do this for one person once.
I have a concrete use case where doing this would be beneficial: My
employer is building a digital asset management application where we
plan to leverage XMP for metadata handling. Rather than explicitly
mapping each individual Tika metadata key to equivalent XMP entries,
it would be much easier and clearer to just map the "tiff" and "xmlDM"
prefixes to appropriate XMP namespaces when importing Tika metadata.
We also wouldn't need to keep updating the metadata mappings whenever
new Tika versions start supporting new keys.
I understand that you don't want to keep updating your own code every
time Tika changes, but as you said, this is a 0.x release, so you're
going to be stuck doing that for awhile. What I don't understand is
why naively hardcoding the requirements for your current project into
a publicly available library is the appropriate place for this code.
Is there some better way for us to implement this use case?
Yes. Tika does no translation between ontologies. It simply dumps
all metadata detected for a file into its own namespace. This means
that an MS Office file gets an MS namespace. Something with XMP gets
an XMP namespace. ID3 tags go into the ID3 namepsace. Tika does no
mapping among the types by default. You create a new class that takes
the raw key-value pairs that stored in Tika::Metadata and translates
them to something else. Call it Metadata2XMP or whatever. That can
be packaged within Tika as a convenient class that does least common
denominator mapping in a well defined way. By breaking the mapping
out to a class separate from Metadata, you avoid spreading a single
metadata namespace across 15 namespaces, and you make all mapping 100%
reversible (well in this case ignorable), since inevitably some will
be wrong in some case. If all a user wants is LCD metadata, they can
get it through a common XMP namespace.
--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/