sorry, what's the compress tika issue?
-aj

On Wed, Aug 4, 2010 at 11:20 AM, brad <[email protected]> wrote:

> I agree.  The patch has been applied to Nutch 1.2 and 2.0, so it is in the
> loop.  However, my understanding is the timeout patch won't fix the
> compress
> tika issue.  And I know it didn't for me.
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of AJ Chen
> Sent: Wednesday, August 04, 2010 10:56 AM
> To: [email protected]
> Subject: Re: Nutch Parser: Tika hangs on corrupt zip files fix due soon
>
> I also found tika hangs up occasionally on other content types. Before tika
> becomes reliable (meaning: don't just hangup unexpectedly but quit when it
> can't handle a doc), I would like to see the parser timeout patch gets into
> new nutch release.  I have applied the timeout mechanism to prevent from
> the
> headache.
> -aj
>
> On Wed, Aug 4, 2010 at 10:18 AM, brad <[email protected]> wrote:
>
> >
> > With Nutch 1.1 I have been experiencing consistent hangs on zip files.
> > I have had as many as 15 errant threads in one Nutch process.  The
> > issue is caused by Tika (TIKA-401
> > https://issues.apache.org/jira/browse/TIKA-401)
> > or
> > more accurately by a library Tika uses Commons Compress aka
> > commons-compress-1.0.jar (issue compress-87
> > https://issues.apache.org/jira/browse/COMPRESS-87 ).  The issue is
> > being fixed in Commons Compress 1.1.  Commons Compress 1.1 has not
> > been released yet, but is expected to be released any day now.
> >
> >
> > To see if Commons Compress 1.1 fixes the problem, I downloaded the
> > current commons compress-1.1 and built it.
> > Then I renamed .../nutch/plugins/parse-tika/commons-compress-1.0.jar
> > to commons-compress-1.0.jar_orig Then I copied
> > commons-compress-1.1-SNAPSHOT.jar to
> > .../nutch/plugins/parse-tika/commons-compress-1.1.jar
> >
> > I have run 5 large fetch and parses and I have not had a single hang
> > of tika parser on zip files and CPU utilization has dropped in half...
> >
> > So it looks like updated Commons Compress 1.1 will solve problem.
> >
> > So hopefully this will make into the next Tika release which will make
> > it into Nutch 1.2.
> >
> >
> >
> >
> >
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to