Hi Jukka,

On 12/2/08 4:34 PM, "Jukka Zitting" <[EMAIL PROTECTED]> wrote:

> Hi,
>
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].

This is really only true of any sub-classes of o.a.t.parser.CompositeParser.
There is no enforcing mechanism that this be the case. In, fact, on the
contrary, it's possible to implement another o.a.t.parser.Parser subclass
that has entirely different semantics. Sure, by doing this you really don't
take advantage of tika-config.xml, and it's associated auto-goodness, but
that's the whole point of making the Parser an interface, to allow folks to
adhere to the lowest common denominator standard.

>
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.

I'm not sure of the relationship between the fact that CompositeParsers use
metadata for CONTENT_TYPE to determine which underlying CompositeParser
subclass to call, and that of the metadata standard adhered to by the
underlying parser.

It seems like you are suggesting that o.a.t.parser.Parsers should declare
what met semantics and std vocabulary (or vocabularies) they adhere to, so
as to know, e.g., if you can pipeline together different parsers, and take
advantage of their output.

This is an interesting proposition because if we go down this path, we are
now starting to get into the realm of data flow dependencies, and then we
have to start thinking about parsing workflows and how Tika can support
them. I think declaring things like required InputMetadata, and declaring
provided output met semantics and vocabularies would be a very interesting
and useful contribution to Tika.

However, I want to point out, that it seems that this is entirely
independent of the CompositeParser.

>
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.

To me, while adhering to Dublin Core is great, and provides standardization,
we shouldn't enforce Dublin Core as the _only_ output met vocabulary. In
fact, we should, as noted above, have several output met vocabularies, and
perhaps, have all the o.a.t.parser.Parsers declare them.

>
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.

In what context? Could you be more specific?

Thanks,
 Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to