Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Nadav Har'El Sun, 07 Dec 2008 23:44:41 -0800

I know my following question and proposal is a bit heretic, but I think it
needs to be raised nontheless...


On Sun, Dec 07, 2008, Mattmann, Chris A wrote about "Re: XML formats vs. parser 
libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that 
emmits structured XHTML content.)":
> I would be remiss to consider implementing parsers in Tika as it really
> defeats the purpose of the project: that is, to be a bridging middleware and
> standard interface to parsing libraries, metadata representation, mime
> extraction frameworks and content analysis mechanisms.

The more I think about this issue, the less I am convinced that Tika as a
wrapper for a dozen other libraries is the right way to go in the long run.
We are facing several problems:

1. The big libraries such as PDFBox, POI, etc. often do much more than Tika
   wants to do - they can understand the document fonts, layout, graphics
   and other things Tika doesn't care about, and can create new documents or
   modify existing documents. As a result, these libraries are very complex,
   and very large - heck, each of PDFBox and POI is larger than the whole of
   Lucene (Tika's parent project)! This is a problem for applications that
   need text extraction but wish to remain small or at least not huge
   (think: desktop search, mobile devices, etc.).

2. There are many formats that will need new code anyway, such as OpenOffice,
   MP3, and other parsers which Tika already has code for. Very soon (if
   it is not true already), most of the parsers will be in Tika's code
   anyway, and not in external projects.

3. It is hard to enforce behavior standards and policies on the individual
   libraries. Each library makes its own decision whether it works by
   streaming or allocating the whole file, how it handles memory constraints
   (to avoid OOM exceptions which are catastrophic to the application), what
   kind of metadata it knows how to extract, and so on. This makes it harder
   for Tika to give consistent behavior.

What I'd really love to see materialize is a relatively small stand-alone
library (500K sounds a reasonable figure ;-)), which can extract text and a
bit of metadata from all sorts of file formats, but nothing more. I would
love to see Tika become this sort of library.

I don't think this is a pipe-dream - I believe (but correct me if I got the
wrong impression...) that only a minority of the code in PDFBox and POI for
example is relevant at all to Tika, and that Tika could do better by copying
only the relevant parts of the code rather than using the whole code as
a black box. I would be even happier if those projects made the separation
themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
doubtful that this will happen on its own any time soon.

What I am suggesting may sound scary at first, but I think that in the long
run, this is the right way to go forward.

Nadav.

-- 
Nadav Har'El                        |      Monday, Dec  8 2008, 11 Kislev 5769
[EMAIL PROTECTED]             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |I have a great signature, but it won't
http://nadav.harel.org.il           |fit at the end of this message -- Fermat

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to