On Wed, Dec 3, 2008 at 12:34 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote: > Hi, > > Currently Tika doesn't have any good guidelines on the semantics and > usage of metadata keys. Mostly we've just ended up with a few basic > keys like CONTENT_TYPE and a bunch of more or less inconsistently used > other keys. The result is that a client that currently wants to assign > any reasonable semantics to the extracted metadata needs to first > check the reported CONTENT_TYPE and use that to deduce the meanings of > the other available metadata keys based on documentation in [1]. > > This is not optimal. It should be up to the Tika parsers to interpret > the metadata available in the supported document types and map that as > well as possible to a single standard like Dublin Core. This way a > client only needs to know a single set of metadata semantics. > > The parser can still make the raw underlying metadata available using > metadata keys that are specific to the actual metadata schema used in > the document type, but that should be considered an extra feature > beyond the normalized Dublin Core output. > > One corollary of this is that we should replace the current HTTP-based > CONTENT_TYPE metadata key with the Dublin Core FORMAT. > > WDYT?
like the idea :-) but it gets more interesting once you move away from the the basics there are lots of good ways which CONTENT_TYPE could be represented eg http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or http://dbpedia.org/page/Content-Type or http://dublincore.org/2008/01/14/dcelements.rdf#format. the most precise meaning is http://lucene.apache.org/tika/content_type. the rest are just synonyms, and some more subjective than others. different users may prefer different choices. this suggests - to me at least - that some minimal support would be useful for deductive ontologies. (in the same way, the namespacing gives minimal support for RDF.) for example, a user may ask for http://dublincore.org/2008/01/14/dcelements.rdf#format but this meta-data property may be absent but http://lucene.apache.org/tika/content_type is present, and is a subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format . so, that value is returned. - robert