Hi, On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > Just one question: Is there interest to do the same tag mapping approach for > OpenXML (MS Office 2007) files? In my opinion, this is much resource > friendlier (because it is only extracting text from an XML file) than the > POI approach of having DOM trees and megabytes of DOM-Tree mappings of the > OpenXML schema with additional external dependencies.
I agree that directly mapping things from the underlying XML is probably the most straightforward and easy solution for simple text extraction. However, a proper parser library becomes very handy as soon as you start implementing more complex things like extracting content from possible attachments or handling encryption. Using an external parser library also insulates us from a lot of complex details like users complaining why isn't some content in their documents being extracted. If we implement parsing inside Tika we also need to take on the burden of maintaining and supporting that implementation. In general I'd only implement a parser fully in Tika if the required amount of code is small (up to a few hundred lines max) and that code covers all the features we need. The current MP3 parser is a good example where both requirements are currently satisfied, though if we want to start supporting some of the more complex MP3 tagging formats I'd definitely go for an external parser library. BR, Jukka Zitting