On Feb 7, 2009, at 11:32 AM, Jukka Zitting wrote:
The current image and audio parsers use hardcoded strings like
"width", "height", "encoding" and "samplerate" for extracted metadata.
The semantics of these metadata keys are nowhere documented and little
thought has been put on interoperability with external metadata
applications. To improve things I'd like to replace these custom
metadata keys with keys defined in part 2 of the XMP specification
[1].
More specifically, I'd like to start using the following keys for
image and audio metadata:
* "tiff:ImageWidth" instead of "width"
* "tiff:ImageHeight" instead of "height"
* "xmpDM:audioCompressor" instead of "encoding"
* "xmpDM:audioSampleRate" instead of "samplerate"
* "xmpDM:audioSampleType" instead of "bits"
* "xmpDM:audioChannelType" instead of "channels"
Why would you want to use a tag that implies that the underlying data
is TIFF when it isn't (e.g. JPEG)? That strikes me as a REALLY Bad
Idea(tm). The reason why Adobe put this out and is using TIFF tags is
because they target Photoshop to professional photographers that take
12 megapixel shots and store them as uncompressed TIFFs. It's the
path of least resistance for them, since they already support TIFF
tags. Correctness isn't even fourth on their list of priorities. If
this was from Apple, they'd be talking about iPhoto, and so you would
have gotten jpg:wdth, because the average consumer takes JPEGs. This
isn't even really a spec as much as it's Adobe saying, "This is what
we're already doing and we're not changing. If you want to play,
these are the rules. Deal with it." While appropriate for
interoperability with Adobe CreateSuite, this isn't really for general
use.
The problem with all these metadata standards is that they're all dumb
in the sense that they duplicate effort. What is the the
philosophical difference. between: xmpDM:artist,
tiff:Artist, and dc:creator? These examples were culled from Adobe's
XMP "spec" you linked to. Throw in id3:artist, pdf:author, and
literally countless others, and you can begin to appreciate the sheer
number of metadata tags that mean "person or organization from which
this artifact originates."[*]
You're already converting metadata from one ontology to another,
whether you realize it or not, each one of which has its own biases
and shortcomings. Currently you're converting from whatever metadata
ontology the file has toTika's implicit ontology. I consider this a
Good Thing(tm). As a developer I shouldn't have to know what esoteric
keys are used to store what metadata in whatever specific file I'm
reading, no more than I have to know how to get the text out of the
file. Tika handles that for me, and that's why I like it. It's
someone else's problem.
Metadata ontologies are already such a mess, because of historical,
not-invented-here, and I-know-better-than-everyone-else reasons.
Fundamentally, they're just key-value pairs, so who cares? Just wrap
whatever key-value pairs that are detected with some namespace thing
to avoid name collisions, and copy the metadata to some generic Tika
ontology. That way the user has a common interface to whatever
metadata he/she wants, but at the same time has access to the raw
metadata if need be. Even if you ended up duplicating all the
metadata, we're dealing with what? 20 keys? It's trivial.
Sympathizing with the Universalist camp, I say there's no reason why
you can't combine metadata from a variety of ontologies, and then have
the values interpreted appropriately according to whatever document
type the user is interrogating. Say we're dealing with the concept of
"length". This represents a variety of concepts, but typically either
a spatial or temporal measurement. No one is going to interpret the
"length" tag for an audio file as being meters, and if they do,
they're dumb.
In summary, my objections are:
1. XMP are that it's lazily written.
2. XMP was never intended to solve the problem at hand.
3. There needs to be a clean interface. A hodgepodge of competing or
at best quasi-interoperable standards isn't clean.
4. They're just key-value pairs. It doesn't cost anything to add
more, so just add everything.
----
[*] I can't help but think that this touches on the Problem of
Universals, which has been around for about 2400 years.
--
Jonathan Koren
jonat...@soe.ucsc.edu
http://www.soe.ucsc.edu/~jonathan/