With Nutch 1.1 I have been experiencing consistent hangs on zip files. I have had as many as 15 errant threads in one Nutch process. The issue is caused by Tika (TIKA-401 https://issues.apache.org/jira/browse/TIKA-401) or more accurately by a library Tika uses Commons Compress aka commons-compress-1.0.jar (issue compress-87 https://issues.apache.org/jira/browse/COMPRESS-87 ). The issue is being fixed in Commons Compress 1.1. Commons Compress 1.1 has not been released yet, but is expected to be released any day now.
To see if Commons Compress 1.1 fixes the problem, I downloaded the current commons compress-1.1 and built it. Then I renamed .../nutch/plugins/parse-tika/commons-compress-1.0.jar to commons-compress-1.0.jar_orig Then I copied commons-compress-1.1-SNAPSHOT.jar to .../nutch/plugins/parse-tika/commons-compress-1.1.jar I have run 5 large fetch and parses and I have not had a single hang of tika parser on zip files and CPU utilization has dropped in half... So it looks like updated Commons Compress 1.1 will solve problem. So hopefully this will make into the next Tika release which will make it into Nutch 1.2.

