On Wed, Dec 3, 2008 at 11:33 AM, Jukka Zitting <[EMAIL PROTECTED]> wrote:
> Hi,
>
> On Wed, Dec 3, 2008 at 9:32 AM, Robert Burrell Donkin
> <[EMAIL PROTECTED]> wrote:
>> there are lots of good ways which CONTENT_TYPE could be represented eg
>> http://www.w3.org/Protocols/rfc2616/rfc2616.html#content-type or
>> http://dbpedia.org/page/Content-Type or
>> http://dublincore.org/2008/01/14/dcelements.rdf#format. the most
>> precise meaning is http://lucene.apache.org/tika/content_type. the
>> rest are just synonyms, and some more subjective than others.
>> different users may prefer different choices.
>
> Yeah, been there done that. :-) Getting your head around all the
> semantic details of different metadata schemas and making your content
> consistently use one of them is major work, and I'd rather do as much
> of that in Tika as possible so I won't need to reimplement it in each
> client application.
>
> My proposal is that we choose one widely used metadata schema as the
> standard in Tika and try to use it as consistently as possible in all
> our parsers. Even with it's limitations Dublin Core seems like the
> best alternative for us to use.

one of the problems i found with dublin core is that it's important to
adhere to what are often quite specific definitions suitable mostly
for librarians but are also wide enough to be hard to interpret.
content-type has a good definition in HTTP but DC format could contain
about anything. this makes it hard to parse. content-type is a
subclass of format.

DC core is also limited in it's expressiveness: simile and other
people tend to prefer DBPedia which is wider and allows more precision

>> this suggests - to me at least - that some minimal support would be
>> useful for deductive ontologies. (in the same way, the namespacing
>> gives minimal support for RDF.) for example, a user may ask for
>> http://dublincore.org/2008/01/14/dcelements.rdf#format but this
>> meta-data property may be absent but
>> http://lucene.apache.org/tika/content_type is present, and is a
>> subclass of http://dublincore.org/2008/01/14/dcelements.rdf#format .
>> so, that value is returned.
>
> There be dragons down that path...

yep :-)

<flame-proof-boots>
should be simple enough to support minimal subclassing eg
tika:content-type -> dc:format

i found that (in RAT) coding information like this in java turned out
to be a bad idea. probably a text configuration would be better with a
canonical version shipped with the software. people can then easily
contribute new mappings back.
</flame-proof-boots>

- robert

Reply via email to