Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Mattmann, Chris A Sun, 07 Dec 2008 18:00:50 -0800

I'd like to add my 2 cents to this. The whole idea of Tika actually grew out
of the fact that we tried this already in Nutch, and we realized that
implementing your own parser libraries and glue code/interfaces:


1. was not developer-friendly: the maintainers of these parsing libraries
typically like to branch off and work on the intricacies of maintaining the
parser specific code in their own forum, which makes it difficult to keep
the current parsing code in your system up to date.

2. was a hindrance to reuse: we had different versions and forks of existing
parser libraries (e.g., see parse-rss) which fell out of date with the
externally developed versions

3. did not scale

I would be remiss to consider implementing parsers in Tika as it really
defeats the purpose of the project: that is, to be a bridging middleware and
standard interface to parsing libraries, metadata representation, mime
extraction frameworks and content analysis mechanisms.

Thanks!

Cheers,
Chris



On 12/4/08 9:14 AM, "Jukka Zitting" <[EMAIL PROTECTED]> wrote:

> Hi,
>
> On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote:
>> Just one question: Is there interest to do the same tag mapping approach for
>> OpenXML (MS Office 2007) files? In my opinion, this is much resource
>> friendlier (because it is only extracting text from an XML file) than the
>> POI approach of having DOM trees and megabytes of DOM-Tree mappings of the
>> OpenXML schema with additional external dependencies.
>
> I agree that directly mapping things from the underlying XML is
> probably the most straightforward and easy solution for simple text
> extraction.
>
> However, a proper parser library becomes very handy as soon as you
> start implementing more complex things like extracting content from
> possible attachments or handling encryption. Using an external parser
> library also insulates us from a lot of complex details like users
> complaining why isn't some content in their documents being extracted.
> If we implement parsing inside Tika we also need to take on the burden
> of maintaining and supporting that implementation.
>
> In general I'd only implement a parser fully in Tika if the required
> amount of code is small (up to a few hundred lines max) and that code
> covers all the features we need. The current MP3 parser is a good
> example where both requirements are currently satisfied, though if we
> want to start supporting some of the more complex MP3 tagging formats
> I'd definitely go for an external parser library.
>
> BR,
>
> Jukka Zitting
>

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to