Nice, thanks for sharing! You observed the same speed increase pattern after running this several times to avoid any cold/hot cache side-effects?
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Jukka Zitting <jukka.zitt...@gmail.com> > To: tika-dev@lucene.apache.org > Sent: Wednesday, June 3, 2009 6:18:02 AM > Subject: Major speed improvements in package parsing > > Hi, > > Inspired by TIKA-236, I ran the following ad-hoc test: > > $ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip > > output-0.3.txt > real 0m29.844s > user 0m39.686s > sys 0m0.840s > $ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip > > output-0.4.txt > real 0m12.587s > user 0m15.911s > sys 0m0.495s > > This is especially impressive as the 0.4 version is able to extract > almost twice as much text from the archive: > > $ du -h output-* > 6.8M output-0.3.txt > 13M output-0.4.txt > > This speed increase is mostly the result of the TIKA-204 and TIKA-238 > improvements. > > Looking deeper at the output reveals some minor issues that I'll be > filing bugs for. However, in general the result of the extraction > seems pretty good. > > BR, > > Jukka Zitting