The zip parser runs in an infinite loop.  My understanding is the timeout
patch does not actually kill the parser thread.  So, in the case of the zip
parser, it continues to run in a infinite loop consuming resources...  At
least that has been my experience...



-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of AJ Chen
Sent: Wednesday, August 04, 2010 11:23 AM
To: [email protected]
Subject: Re: Nutch Parser: Tika hangs on corrupt zip files fix due soon

sorry, what's the compress tika issue?
-aj

On Wed, Aug 4, 2010 at 11:20 AM, brad <[email protected]> wrote:

> I agree.  The patch has been applied to Nutch 1.2 and 2.0, so it is in 
> the loop.  However, my understanding is the timeout patch won't fix 
> the compress tika issue.  And I know it didn't for me.
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of AJ 
> Chen
> Sent: Wednesday, August 04, 2010 10:56 AM
> To: [email protected]
> Subject: Re: Nutch Parser: Tika hangs on corrupt zip files fix due 
> soon
>
> I also found tika hangs up occasionally on other content types. Before 
> tika becomes reliable (meaning: don't just hangup unexpectedly but 
> quit when it can't handle a doc), I would like to see the parser 
> timeout patch gets into new nutch release.  I have applied the timeout 
> mechanism to prevent from the headache.
> -aj
>
> On Wed, Aug 4, 2010 at 10:18 AM, brad <[email protected]> wrote:
>
> >
> > With Nutch 1.1 I have been experiencing consistent hangs on zip files.
> > I have had as many as 15 errant threads in one Nutch process.  The 
> > issue is caused by Tika (TIKA-401
> > https://issues.apache.org/jira/browse/TIKA-401)
> > or
> > more accurately by a library Tika uses Commons Compress aka 
> > commons-compress-1.0.jar (issue compress-87
> > https://issues.apache.org/jira/browse/COMPRESS-87 ).  The issue is 
> > being fixed in Commons Compress 1.1.  Commons Compress 1.1 has not 
> > been released yet, but is expected to be released any day now.
> >
> >
> > To see if Commons Compress 1.1 fixes the problem, I downloaded the 
> > current commons compress-1.1 and built it.
> > Then I renamed .../nutch/plugins/parse-tika/commons-compress-1.0.jar
> > to commons-compress-1.0.jar_orig Then I copied 
> > commons-compress-1.1-SNAPSHOT.jar to 
> > .../nutch/plugins/parse-tika/commons-compress-1.1.jar
> >
> > I have run 5 large fetch and parses and I have not had a single hang 
> > of tika parser on zip files and CPU utilization has dropped in half...
> >
> > So it looks like updated Commons Compress 1.1 will solve problem.
> >
> > So hopefully this will make into the next Tika release which will 
> > make it into Nutch 1.2.
> >
> >
> >
> >
> >
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
>


--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to