With Nutch 1.1 I have been experiencing consistent hangs on zip files.  I
have had as many as 15 errant threads in one Nutch process.  The issue is
caused by Tika (TIKA-401 https://issues.apache.org/jira/browse/TIKA-401) or
more accurately by a library Tika uses Commons Compress aka
commons-compress-1.0.jar (issue compress-87
https://issues.apache.org/jira/browse/COMPRESS-87 ).  The issue is being
fixed in Commons Compress 1.1.  Commons Compress 1.1 has not been released
yet, but is expected to be released any day now.


To see if Commons Compress 1.1 fixes the problem,
I downloaded the current commons compress-1.1 and built it.  
Then I renamed .../nutch/plugins/parse-tika/commons-compress-1.0.jar to
commons-compress-1.0.jar_orig
Then I copied commons-compress-1.1-SNAPSHOT.jar to
.../nutch/plugins/parse-tika/commons-compress-1.1.jar

I have run 5 large fetch and parses and I have not had a single hang of tika
parser on zip files and CPU utilization has dropped in half...

So it looks like updated Commons Compress 1.1 will solve problem.

So hopefully this will make into the next Tika release which will make it
into Nutch 1.2.




Reply via email to