On Wed, Dec 3, 2008 at 12:34 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Currently Tika doesn't have any good guidelines on the semantics and
> usage of metadata keys. Mostly we've just ended up with a few basic
> keys like CONTENT_TYPE and a bunch of more or less inconsistently used
> other keys. The result is that a client that currently wants to assign
> any reasonable semantics to the extracted metadata needs to first
> check the reported CONTENT_TYPE and use that to deduce the meanings of
> the other available metadata keys based on documentation in [1].
>
> This is not optimal. It should be up to the Tika parsers to interpret
> the metadata available in the supported document types and map that as
> well as possible to a single standard like Dublin Core. This way a
> client only needs to know a single set of metadata semantics.
>
> The parser can still make the raw underlying metadata available using
> metadata keys that are specific to the actual metadata schema used in
> the document type, but that should be considered an extra feature
> beyond the normalized Dublin Core output.
>
> One corollary of this is that we should replace the current HTTP-based
> CONTENT_TYPE metadata key with the Dublin Core FORMAT.
>
> WDYT?

like the idea :-)

but it gets more interesting once you move away from the the basics

there are lots of good ways which CONTENT_TYPE could be represented eg
http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or
http://dbpedia.org/page/Content-Type or
http://dublincore.org/2008/01/14/dcelements.rdf#format. the most
precise meaning is http://lucene.apache.org/tika/content_type. the
rest are just synonyms, and some more subjective than others.
different users may prefer different choices.

this suggests - to me at least - that some minimal support would be
useful for deductive ontologies. (in the same way, the namespacing
gives minimal support for RDF.) for example, a user may ask for
http://dublincore.org/2008/01/14/dcelements.rdf#format but this
meta-data property may be absent but
http://lucene.apache.org/tika/content_type is present, and is a
subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format .
so, that value is returned.

- robert

Reply via email to