Nice, thanks for sharing!  You observed the same speed increase pattern after 
running this several times to avoid any cold/hot cache side-effects?

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Jukka Zitting <jukka.zitt...@gmail.com>
> To: tika-dev@lucene.apache.org
> Sent: Wednesday, June 3, 2009 6:18:02 AM
> Subject: Major speed improvements in package parsing
> 
> Hi,
> 
> Inspired by TIKA-236, I ran the following ad-hoc test:
> 
> $ time java -jar tika-0.3-standalone.jar --text lucene-2.0.0-src.zip >
> output-0.3.txt
> real    0m29.844s
> user    0m39.686s
> sys    0m0.840s
> $ time java -jar tika-app-0.4-SNAPSHOT.jar --text lucene-2.0.0-src.zip
> > output-0.4.txt
> real    0m12.587s
> user    0m15.911s
> sys    0m0.495s
> 
> This is especially impressive as the 0.4 version is able to extract
> almost twice as much text from the archive:
> 
> $ du -h output-*
> 6.8M    output-0.3.txt
> 13M    output-0.4.txt
> 
> This speed increase is mostly the result of the TIKA-204 and TIKA-238
> improvements.
> 
> Looking deeper at the output reveals some minor issues that I'll be
> filing bugs for. However, in general the result of the extraction
> seems pretty good.
> 
> BR,
> 
> Jukka Zitting

Reply via email to