Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Stephane Bastian Mon, 08 Dec 2008 00:43:33 -0800

Hi,

If the size of the dependencies is the main concern, we could use theminijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/)which creates a mini version of the dependencies used by tika The pluginanalyses classes and only keeps the ones actually used by Tika. If weneed something more powerful we could then use Proguard:http://proguard.sourceforge.net/

I modified pom.xml and run the minijar plugin. Here are the results ifyou're interested:


Can remove 3969 of 5601 classes (70%).

Original length of commons-io-1.4.jar was 109043 bytes. Was able shrinkit to commons-io-1.4-mini.jar at 19022 bytes (17%)Original length of poi-3.1-FINAL.jar was 1344956 bytes. Was able shrinkit to poi-3.1-FINAL-mini.jar at 852767 bytes (63%)Original length of log4j-1.2.14.jar was 367444 bytes. Was able shrink itto log4j-1.2.14-mini.jar at 98397 bytes (26%)Original length of poi-scratchpad-3.1-FINAL.jar was 721469 bytes. Wasable shrink it to poi-scratchpad-3.1-FINAL-mini.jar at 469216 bytes (65%)Original length of commons-logging-1.0.4.jar was 38015 bytes. Was ableshrink it to commons-logging-1.0.4-mini.jar at 15839 bytes (41%)Original length of commons-lang-2.1.jar was 207723 bytes. Was ableshrink it to commons-lang-2.1-mini.jar at 106922 bytes (51%)Original length of bcmail-jdk14-136.jar was 189233 bytes. Was ableshrink it to bcmail-jdk14-136-mini.jar at 10639 bytes (5%)Original length of jempbox-0.2.0.jar was 42669 bytes. Was able shrink itto jempbox-0.2.0-mini.jar at 234 bytes (0%)

No references to jar icu4j-3.8.jar. You can safely omit that dependency.

Original length of bcprov-jdk14-136.jar was 1401560 bytes. Was ableshrink it to bcprov-jdk14-136-mini.jar at 142698 bytes (10%)Original length of xml-apis-1.3.03.jar was 195119 bytes. Was able shrinkit to xml-apis-1.3.03-mini.jar at 64806 bytes (33%)Original length of fontbox-0.1.0.jar was 63159 bytes. Was able shrink itto fontbox-0.1.0-mini.jar at 61322 bytes (97%)Original length of commons-codec-1.3.jar was 46725 bytes. Was ableshrink it to commons-codec-1.3-mini.jar at 12439 bytes (26%)Original length of asm-3.1.jar was 43033 bytes. Was able shrink it toasm-3.1-mini.jar at 39590 bytes (91%)Original length of xercesImpl-2.8.1.jar was 1212965 bytes. Was ableshrink it to xercesImpl-2.8.1-mini.jar at 101989 bytes (8%)Original length of nekohtml-1.9.9.jar was 116162 bytes. Was able shrinkit to nekohtml-1.9.9-mini.jar at 92625 bytes (79%)Original length of pdfbox-0.7.3.jar was 3321762 bytes. Was able shrinkit to pdfbox-0.7.3-mini.jar at 3076844 bytes (92%)


BR,

Stephane

Nadav Har'El wrote:

I know my following question and proposal is a bit heretic, but I think it
needs to be raised nontheless...

On Sun, Dec 07, 2008, Mattmann, Chris A wrote about "Re: XML formats vs. parser 
libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits 
structured XHTML content.)":

I would be remiss to consider implementing parsers in Tika as it really
defeats the purpose of the project: that is, to be a bridging middleware and
standard interface to parsing libraries, metadata representation, mime
extraction frameworks and content analysis mechanisms.


The more I think about this issue, the less I am convinced that Tika as a
wrapper for a dozen other libraries is the right way to go in the long run.
We are facing several problems:

1. The big libraries such as PDFBox, POI, etc. often do much more than Tika
   wants to do - they can understand the document fonts, layout, graphics
   and other things Tika doesn't care about, and can create new documents or
   modify existing documents. As a result, these libraries are very complex,
   and very large - heck, each of PDFBox and POI is larger than the whole of
   Lucene (Tika's parent project)! This is a problem for applications that
   need text extraction but wish to remain small or at least not huge
   (think: desktop search, mobile devices, etc.).

2. There are many formats that will need new code anyway, such as OpenOffice,
   MP3, and other parsers which Tika already has code for. Very soon (if
   it is not true already), most of the parsers will be in Tika's code
   anyway, and not in external projects.

3. It is hard to enforce behavior standards and policies on the individual
   libraries. Each library makes its own decision whether it works by
   streaming or allocating the whole file, how it handles memory constraints
   (to avoid OOM exceptions which are catastrophic to the application), what
   kind of metadata it knows how to extract, and so on. This makes it harder
   for Tika to give consistent behavior.

What I'd really love to see materialize is a relatively small stand-alone
library (500K sounds a reasonable figure ;-)), which can extract text and a
bit of metadata from all sorts of file formats, but nothing more. I would
love to see Tika become this sort of library.

I don't think this is a pipe-dream - I believe (but correct me if I got the
wrong impression...) that only a minority of the code in PDFBox and POI for
example is relevant at all to Tika, and that Tika could do better by copying
only the relevant parts of the code rather than using the whole code as
a black box. I would be even happier if those projects made the separation
themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
doubtful that this will happen on its own any time soon.

What I am suggesting may sound scary at first, but I think that in the long
run, this is the right way to go forward.

Nadav.

Re: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to