I'd like to add my 2 cents to this. The whole idea of Tika actually grew out of the fact that we tried this already in Nutch, and we realized that implementing your own parser libraries and glue code/interfaces:
1. was not developer-friendly: the maintainers of these parsing libraries typically like to branch off and work on the intricacies of maintaining the parser specific code in their own forum, which makes it difficult to keep the current parsing code in your system up to date. 2. was a hindrance to reuse: we had different versions and forks of existing parser libraries (e.g., see parse-rss) which fell out of date with the externally developed versions 3. did not scale I would be remiss to consider implementing parsers in Tika as it really defeats the purpose of the project: that is, to be a bridging middleware and standard interface to parsing libraries, metadata representation, mime extraction frameworks and content analysis mechanisms. Thanks! Cheers, Chris On 12/4/08 9:14 AM, "Jukka Zitting" <[EMAIL PROTECTED]> wrote: > Hi, > > On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: >> Just one question: Is there interest to do the same tag mapping approach for >> OpenXML (MS Office 2007) files? In my opinion, this is much resource >> friendlier (because it is only extracting text from an XML file) than the >> POI approach of having DOM trees and megabytes of DOM-Tree mappings of the >> OpenXML schema with additional external dependencies. > > I agree that directly mapping things from the underlying XML is > probably the most straightforward and easy solution for simple text > extraction. > > However, a proper parser library becomes very handy as soon as you > start implementing more complex things like extracting content from > possible attachments or handling encryption. Using an external parser > library also insulates us from a lot of complex details like users > complaining why isn't some content in their documents being extracted. > If we implement parsing inside Tika we also need to take on the burden > of maintaining and supporting that implementation. > > In general I'd only implement a parser fully in Tika if the required > amount of code is small (up to a few hundred lines max) and that code > covers all the features we need. The current MP3 parser is a good > example where both requirements are currently satisfied, though if we > want to start supporting some of the more complex MP3 tagging formats > I'd definitely go for an external parser library. > > BR, > > Jukka Zitting > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [EMAIL PROTECTED] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.