Hi, Inspired by TIKA-236, I ran the following ad-hoc test:
$ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip > output-0.3.txt real 0m29.844s user 0m39.686s sys 0m0.840s $ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip > output-0.4.txt real 0m12.587s user 0m15.911s sys 0m0.495s This is especially impressive as the 0.4 version is able to extract almost twice as much text from the archive: $ du -h output-* 6.8M output-0.3.txt 13M output-0.4.txt This speed increase is mostly the result of the TIKA-204 and TIKA-238 improvements. Looking deeper at the output reveals some minor issues that I'll be filing bugs for. However, in general the result of the extraction seems pretty good. BR, Jukka Zitting