Hi, On Wed, Dec 3, 2008 at 9:32 AM, Robert Burrell Donkin <[EMAIL PROTECTED]> wrote: > there are lots of good ways which CONTENT_TYPE could be represented eg > http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or > http://dbpedia.org/page/Content-Type or > http://dublincore.org/2008/01/14/dcelements.rdf#format. the most > precise meaning is http://lucene.apache.org/tika/content_type. the > rest are just synonyms, and some more subjective than others. > different users may prefer different choices.
Yeah, been there done that. :-) Getting your head around all the semantic details of different metadata schemas and making your content consistently use one of them is major work, and I'd rather do as much of that in Tika as possible so I won't need to reimplement it in each client application. My proposal is that we choose one widely used metadata schema as the standard in Tika and try to use it as consistently as possible in all our parsers. Even with it's limitations Dublin Core seems like the best alternative for us to use. > this suggests - to me at least - that some minimal support would be > useful for deductive ontologies. (in the same way, the namespacing > gives minimal support for RDF.) for example, a user may ask for > http://dublincore.org/2008/01/14/dcelements.rdf#format but this > meta-data property may be absent but > http://lucene.apache.org/tika/content_type is present, and is a > subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format . > so, that value is returned. There be dragons down that path... BR, Jukka Zitting