Hi Jukka, On 12/2/08 4:34 PM, "Jukka Zitting" <[EMAIL PROTECTED]> wrote:
> Hi, > > Currently Tika doesn't have any good guidelines on the semantics and > usage of metadata keys. Mostly we've just ended up with a few basic > keys like CONTENT_TYPE and a bunch of more or less inconsistently used > other keys. The result is that a client that currently wants to assign > any reasonable semantics to the extracted metadata needs to first > check the reported CONTENT_TYPE and use that to deduce the meanings of > the other available metadata keys based on documentation in [1]. This is really only true of any sub-classes of o.a.t.parser.CompositeParser. There is no enforcing mechanism that this be the case. In, fact, on the contrary, it's possible to implement another o.a.t.parser.Parser subclass that has entirely different semantics. Sure, by doing this you really don't take advantage of tika-config.xml, and it's associated auto-goodness, but that's the whole point of making the Parser an interface, to allow folks to adhere to the lowest common denominator standard. > > This is not optimal. It should be up to the Tika parsers to interpret > the metadata available in the supported document types and map that as > well as possible to a single standard like Dublin Core. This way a > client only needs to know a single set of metadata semantics. I'm not sure of the relationship between the fact that CompositeParsers use metadata for CONTENT_TYPE to determine which underlying CompositeParser subclass to call, and that of the metadata standard adhered to by the underlying parser. It seems like you are suggesting that o.a.t.parser.Parsers should declare what met semantics and std vocabulary (or vocabularies) they adhere to, so as to know, e.g., if you can pipeline together different parsers, and take advantage of their output. This is an interesting proposition because if we go down this path, we are now starting to get into the realm of data flow dependencies, and then we have to start thinking about parsing workflows and how Tika can support them. I think declaring things like required InputMetadata, and declaring provided output met semantics and vocabularies would be a very interesting and useful contribution to Tika. However, I want to point out, that it seems that this is entirely independent of the CompositeParser. > > The parser can still make the raw underlying metadata available using > metadata keys that are specific to the actual metadata schema used in > the document type, but that should be considered an extra feature > beyond the normalized Dublin Core output. To me, while adhering to Dublin Core is great, and provides standardization, we shouldn't enforce Dublin Core as the _only_ output met vocabulary. In fact, we should, as noted above, have several output met vocabularies, and perhaps, have all the o.a.t.parser.Parsers declare them. > > One corollary of this is that we should replace the current HTTP-based > CONTENT_TYPE metadata key with the Dublin Core FORMAT. In what context? Could you be more specific? Thanks, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [EMAIL PROTECTED] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.