Hallo Stephane,

> If the size of the dependencies is the main concern, we could use the
> minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/)
> which creates a mini version of the dependencies used by tika The plugin
> analyses classes and only keeps the ones actually used by Tika. If we
> need something more powerful we could then use Proguard:
> http://proguard.sourceforge.net/

But then you have still the problem of polluting the classpath with class
files you cannot control. If for example in the project using the minified
standalone TIKA JAR and one of the dependencies is also needed (e.g.
nekohtml for own developments), but this project uses more class files of
the bundle (because it needs some extra-feature of nekohtml) you have a
problem. Your project may then choose to add nekohtml to his own lib path,
maybe in a newer version. Depending on the position of the TIKA standalone
JAR file in the classpath, you have two symptoms: It fails completely,
because the few classes from TIKA mix with the newer versions of
nekohtml-full and this creates incomptibilities (tika is before the
project-classes, because internal implementations may not keep consistent
cross-version) or it works (tika comes at the end). If you have more of such
dependencies in your project, it is hard to figure out.

In my opinion: Third party JARS should *always* kept in its original JAR
file, with original name and version number. If you really want to have such
things like tika-standalone.jar for the command line interface, clearly note
to end user, that it is *not* for including in projects, that the JAR file
is *only* for the CLI version (see my other mail mentioning my Tomcat<6
problems)

If somebody uses TIKA as parser plugin in own java-developments, he can scan
the supplied JAR files (if they are hopefully supplied in the binary release
soon) and choose the ones he needs (and for that it would be good to have a
map somewhere: if you want to support this and that document format you need
JAR files a,b,c).

If he want to minimize them, he chooses the tools you proposed. I think this
is the way to go.

Uwe

Reply via email to