On Mon, Dec 08, 2008, Stephane Bastian wrote about "Re: XML formats vs. parser 
libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that 
emmits structured XHTML content.)":
> If the size of the dependencies is the main concern, we could use the 
> minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/) 

This is a very interesting idea. Thanks! I wasn't aware of this tool.
Maybe I'll use it for some of my projects :-)

However, I fear that this technique underestimates by a longshot the
possible saving in the case of Tika. The most important example is:

> Original length of pdfbox-0.7.3.jar was 3321762 bytes. Was able shrink 
> it to pdfbox-0.7.3-mini.jar at 3076844 bytes (92%)

If you ever looked at the PDFBox code, it is evident (at I think it is...)
that Tika needs far less than 92% of the code. PDFBox can create PDF files,
modify them, render them, fill forms, and countless other features that are
completely irrelevant to Tika. The problem is that PDFBox is written in a
way that even when you just want to extract text, all sort of irrelevant
code (like font handling) gets referred. I once tried manually to reduce
the size of pdfbox's jar by removing what I though was irrelevant, and
only partially succeeded because the code was so entangled.

Anyway, the size of the dependencies is indeed a major concern, but like I
said it isn't the only concern (I mentioned a couple others in my post)
Uwe brought up today another interesting concern: the sheer number of
dependencies and their versions is very hard to use correctly. If Tika depends
on a dozen libraries, it will probably require you to get libary A version X,
library B version Y, and so on, and if you get one of these versions wrong
it might fail in misterious ways.

Nadav.

-- 
Nadav Har'El                        |      Monday, Dec  8 2008, 11 Kislev 5769
[EMAIL PROTECTED]             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |Red meat is not bad for you: fuzzy green
http://nadav.harel.org.il           |meat is bad for you.

Reply via email to