Hi Jukka, my 2 cents on this:
While this certainly sounds like a very good idea, it will be difficult to settle on using solely a single metadata format in Tika. Dublin Core is one of several metadata format available, and while it is certainly suitable for some documents (word, excel, open document and such), it's not a silver bullet. for instance when it comes to images, audio and others, it is fairly limited and we've got almost no choice than describing the metadata in another format than Dublin Core (for instance we could use something like this http://www.metadataworkinggroup.com/pdf/mwg_guidance.pdf )
What is important for me though is that Tika Parsers should never extract meta-data using a key that doesn't belong to a known format as it make it difficult to use the data.
BR, Stephane Bastian Jukka Zitting wrote:
Hi, Currently Tika doesn't have any good guidelines on the semantics and usage of metadata keys. Mostly we've just ended up with a few basic keys like CONTENT_TYPE and a bunch of more or less inconsistently used other keys. The result is that a client that currently wants to assign any reasonable semantics to the extracted metadata needs to first check the reported CONTENT_TYPE and use that to deduce the meanings of the other available metadata keys based on documentation in [1]. This is not optimal. It should be up to the Tika parsers to interpret the metadata available in the supported document types and map that as well as possible to a single standard like Dublin Core. This way a client only needs to know a single set of metadata semantics. The parser can still make the raw underlying metadata available using metadata keys that are specific to the actual metadata schema used in the document type, but that should be considered an extra feature beyond the normalized Dublin Core output. One corollary of this is that we should replace the current HTTP-based CONTENT_TYPE metadata key with the Dublin Core FORMAT. WDYT? [1] http://lucene.apache.org/tika/formats.html BR, Jukka Zitting