sorry, what's the compress tika issue? -aj On Wed, Aug 4, 2010 at 11:20 AM, brad <[email protected]> wrote:
> I agree. The patch has been applied to Nutch 1.2 and 2.0, so it is in the > loop. However, my understanding is the timeout patch won't fix the > compress > tika issue. And I know it didn't for me. > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of AJ Chen > Sent: Wednesday, August 04, 2010 10:56 AM > To: [email protected] > Subject: Re: Nutch Parser: Tika hangs on corrupt zip files fix due soon > > I also found tika hangs up occasionally on other content types. Before tika > becomes reliable (meaning: don't just hangup unexpectedly but quit when it > can't handle a doc), I would like to see the parser timeout patch gets into > new nutch release. I have applied the timeout mechanism to prevent from > the > headache. > -aj > > On Wed, Aug 4, 2010 at 10:18 AM, brad <[email protected]> wrote: > > > > > With Nutch 1.1 I have been experiencing consistent hangs on zip files. > > I have had as many as 15 errant threads in one Nutch process. The > > issue is caused by Tika (TIKA-401 > > https://issues.apache.org/jira/browse/TIKA-401) > > or > > more accurately by a library Tika uses Commons Compress aka > > commons-compress-1.0.jar (issue compress-87 > > https://issues.apache.org/jira/browse/COMPRESS-87 ). The issue is > > being fixed in Commons Compress 1.1. Commons Compress 1.1 has not > > been released yet, but is expected to be released any day now. > > > > > > To see if Commons Compress 1.1 fixes the problem, I downloaded the > > current commons compress-1.1 and built it. > > Then I renamed .../nutch/plugins/parse-tika/commons-compress-1.0.jar > > to commons-compress-1.0.jar_orig Then I copied > > commons-compress-1.1-SNAPSHOT.jar to > > .../nutch/plugins/parse-tika/commons-compress-1.1.jar > > > > I have run 5 large fetch and parses and I have not had a single hang > > of tika parser on zip files and CPU utilization has dropped in half... > > > > So it looks like updated Commons Compress 1.1 will solve problem. > > > > So hopefully this will make into the next Tika release which will make > > it into Nutch 1.2. > > > > > > > > > > > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

