On Mon, Dec 08, 2008, Stephane Bastian wrote about "Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)": > If the size of the dependencies is the main concern, we could use the > minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/)
This is a very interesting idea. Thanks! I wasn't aware of this tool. Maybe I'll use it for some of my projects :-) However, I fear that this technique underestimates by a longshot the possible saving in the case of Tika. The most important example is: > Original length of pdfbox-0.7.3.jar was 3321762 bytes. Was able shrink > it to pdfbox-0.7.3-mini.jar at 3076844 bytes (92%) If you ever looked at the PDFBox code, it is evident (at I think it is...) that Tika needs far less than 92% of the code. PDFBox can create PDF files, modify them, render them, fill forms, and countless other features that are completely irrelevant to Tika. The problem is that PDFBox is written in a way that even when you just want to extract text, all sort of irrelevant code (like font handling) gets referred. I once tried manually to reduce the size of pdfbox's jar by removing what I though was irrelevant, and only partially succeeded because the code was so entangled. Anyway, the size of the dependencies is indeed a major concern, but like I said it isn't the only concern (I mentioned a couple others in my post) Uwe brought up today another interesting concern: the sheer number of dependencies and their versions is very hard to use correctly. If Tika depends on a dozen libraries, it will probably require you to get libary A version X, library B version Y, and so on, and if you get one of these versions wrong it might fail in misterious ways. Nadav. -- Nadav Har'El | Monday, Dec 8 2008, 11 Kislev 5769 [EMAIL PROTECTED] |----------------------------------------- Phone +972-523-790466, ICQ 13349191 |Red meat is not bad for you: fuzzy green http://nadav.harel.org.il |meat is bad for you.