XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Jukka Zitting Thu, 04 Dec 2008 09:14:47 -0800

Hi,

On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote:
> Just one question: Is there interest to do the same tag mapping approach for
> OpenXML (MS Office 2007) files? In my opinion, this is much resource
> friendlier (because it is only extracting text from an XML file) than the
> POI approach of having DOM trees and megabytes of DOM-Tree mappings of the
> OpenXML schema with additional external dependencies.


I agree that directly mapping things from the underlying XML is
probably the most straightforward and easy solution for simple text
extraction.

However, a proper parser library becomes very handy as soon as you
start implementing more complex things like extracting content from
possible attachments or handling encryption. Using an external parser
library also insulates us from a lot of complex details like users
complaining why isn't some content in their documents being extracted.
If we implement parsing inside Tika we also need to take on the burden
of maintaining and supporting that implementation.

In general I'd only implement a parser fully in Tika if the required
amount of code is small (up to a few hundred lines max) and that code
covers all the features we need. The current MP3 parser is a good
example where both requirements are currently satisfied, though if we
want to start supporting some of the more complex MP3 tagging formats
I'd definitely go for an external parser library.

BR,

Jukka Zitting

XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to