Hi,
If the size of the dependencies is the main concern, we could use the
minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/)
which creates a mini version of the dependencies used by tika The plugin
analyses classes and only keeps the ones actually used by Tika. If we
need something more powerful we could then use Proguard:
http://proguard.sourceforge.net/
I modified pom.xml and run the minijar plugin. Here are the results if
you're interested:
Can remove 3969 of 5601 classes (70%).
Original length of commons-io-1.4.jar was 109043 bytes. Was able shrink
it to commons-io-1.4-mini.jar at 19022 bytes (17%)
Original length of poi-3.1-FINAL.jar was 1344956 bytes. Was able shrink
it to poi-3.1-FINAL-mini.jar at 852767 bytes (63%)
Original length of log4j-1.2.14.jar was 367444 bytes. Was able shrink it
to log4j-1.2.14-mini.jar at 98397 bytes (26%)
Original length of poi-scratchpad-3.1-FINAL.jar was 721469 bytes. Was
able shrink it to poi-scratchpad-3.1-FINAL-mini.jar at 469216 bytes (65%)
Original length of commons-logging-1.0.4.jar was 38015 bytes. Was able
shrink it to commons-logging-1.0.4-mini.jar at 15839 bytes (41%)
Original length of commons-lang-2.1.jar was 207723 bytes. Was able
shrink it to commons-lang-2.1-mini.jar at 106922 bytes (51%)
Original length of bcmail-jdk14-136.jar was 189233 bytes. Was able
shrink it to bcmail-jdk14-136-mini.jar at 10639 bytes (5%)
Original length of jempbox-0.2.0.jar was 42669 bytes. Was able shrink it
to jempbox-0.2.0-mini.jar at 234 bytes (0%)
No references to jar icu4j-3.8.jar. You can safely omit that dependency.
Original length of bcprov-jdk14-136.jar was 1401560 bytes. Was able
shrink it to bcprov-jdk14-136-mini.jar at 142698 bytes (10%)
Original length of xml-apis-1.3.03.jar was 195119 bytes. Was able shrink
it to xml-apis-1.3.03-mini.jar at 64806 bytes (33%)
Original length of fontbox-0.1.0.jar was 63159 bytes. Was able shrink it
to fontbox-0.1.0-mini.jar at 61322 bytes (97%)
Original length of commons-codec-1.3.jar was 46725 bytes. Was able
shrink it to commons-codec-1.3-mini.jar at 12439 bytes (26%)
Original length of asm-3.1.jar was 43033 bytes. Was able shrink it to
asm-3.1-mini.jar at 39590 bytes (91%)
Original length of xercesImpl-2.8.1.jar was 1212965 bytes. Was able
shrink it to xercesImpl-2.8.1-mini.jar at 101989 bytes (8%)
Original length of nekohtml-1.9.9.jar was 116162 bytes. Was able shrink
it to nekohtml-1.9.9-mini.jar at 92625 bytes (79%)
Original length of pdfbox-0.7.3.jar was 3321762 bytes. Was able shrink
it to pdfbox-0.7.3-mini.jar at 3076844 bytes (92%)
BR,
Stephane
Nadav Har'El wrote:
I know my following question and proposal is a bit heretic, but I think it
needs to be raised nontheless...
On Sun, Dec 07, 2008, Mattmann, Chris A wrote about "Re: XML formats vs. parser
libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits
structured XHTML content.)":
I would be remiss to consider implementing parsers in Tika as it really
defeats the purpose of the project: that is, to be a bridging middleware and
standard interface to parsing libraries, metadata representation, mime
extraction frameworks and content analysis mechanisms.
The more I think about this issue, the less I am convinced that Tika as a
wrapper for a dozen other libraries is the right way to go in the long run.
We are facing several problems:
1. The big libraries such as PDFBox, POI, etc. often do much more than Tika
wants to do - they can understand the document fonts, layout, graphics
and other things Tika doesn't care about, and can create new documents or
modify existing documents. As a result, these libraries are very complex,
and very large - heck, each of PDFBox and POI is larger than the whole of
Lucene (Tika's parent project)! This is a problem for applications that
need text extraction but wish to remain small or at least not huge
(think: desktop search, mobile devices, etc.).
2. There are many formats that will need new code anyway, such as OpenOffice,
MP3, and other parsers which Tika already has code for. Very soon (if
it is not true already), most of the parsers will be in Tika's code
anyway, and not in external projects.
3. It is hard to enforce behavior standards and policies on the individual
libraries. Each library makes its own decision whether it works by
streaming or allocating the whole file, how it handles memory constraints
(to avoid OOM exceptions which are catastrophic to the application), what
kind of metadata it knows how to extract, and so on. This makes it harder
for Tika to give consistent behavior.
What I'd really love to see materialize is a relatively small stand-alone
library (500K sounds a reasonable figure ;-)), which can extract text and a
bit of metadata from all sorts of file formats, but nothing more. I would
love to see Tika become this sort of library.
I don't think this is a pipe-dream - I believe (but correct me if I got the
wrong impression...) that only a minority of the code in PDFBox and POI for
example is relevant at all to Tika, and that Tika could do better by copying
only the relevant parts of the code rather than using the whole code as
a black box. I would be even happier if those projects made the separation
themselves, e.g., PDFBox split into a small "PDFBOX-extract" package which is
exactly what Tika needs and "PDFBOX-etc" which is all the rest, but I am
doubtful that this will happen on its own any time soon.
What I am suggesting may sound scary at first, but I think that in the long
run, this is the right way to go forward.
Nadav.