Hallo Stephane, > If the size of the dependencies is the main concern, we could use the > minijar maven plugin (http://mojo.codehaus.org/minijar-maven-plugin/) > which creates a mini version of the dependencies used by tika The plugin > analyses classes and only keeps the ones actually used by Tika. If we > need something more powerful we could then use Proguard: > http://proguard.sourceforge.net/
But then you have still the problem of polluting the classpath with class files you cannot control. If for example in the project using the minified standalone TIKA JAR and one of the dependencies is also needed (e.g. nekohtml for own developments), but this project uses more class files of the bundle (because it needs some extra-feature of nekohtml) you have a problem. Your project may then choose to add nekohtml to his own lib path, maybe in a newer version. Depending on the position of the TIKA standalone JAR file in the classpath, you have two symptoms: It fails completely, because the few classes from TIKA mix with the newer versions of nekohtml-full and this creates incomptibilities (tika is before the project-classes, because internal implementations may not keep consistent cross-version) or it works (tika comes at the end). If you have more of such dependencies in your project, it is hard to figure out. In my opinion: Third party JARS should *always* kept in its original JAR file, with original name and version number. If you really want to have such things like tika-standalone.jar for the command line interface, clearly note to end user, that it is *not* for including in projects, that the JAR file is *only* for the CLI version (see my other mail mentioning my Tomcat<6 problems) If somebody uses TIKA as parser plugin in own java-developments, he can scan the supplied JAR files (if they are hopefully supplied in the binary release soon) and choose the ones he needs (and for that it would be good to have a map somewhere: if you want to support this and that document format you need JAR files a,b,c). If he want to minimize them, he chooses the tools you proposed. I think this is the way to go. Uwe