I also found tika hangs up occasionally on other content types. Before tika
becomes reliable (meaning: don't just hangup unexpectedly but quit when it
can't handle a doc), I would like to see the parser timeout patch gets into
new nutch release.  I have applied the timeout mechanism to prevent from the
headache.
-aj

On Wed, Aug 4, 2010 at 10:18 AM, brad <[email protected]> wrote:

>
> With Nutch 1.1 I have been experiencing consistent hangs on zip files.  I
> have had as many as 15 errant threads in one Nutch process.  The issue is
> caused by Tika (TIKA-401 https://issues.apache.org/jira/browse/TIKA-401)
> or
> more accurately by a library Tika uses Commons Compress aka
> commons-compress-1.0.jar (issue compress-87
> https://issues.apache.org/jira/browse/COMPRESS-87 ).  The issue is being
> fixed in Commons Compress 1.1.  Commons Compress 1.1 has not been released
> yet, but is expected to be released any day now.
>
>
> To see if Commons Compress 1.1 fixes the problem,
> I downloaded the current commons compress-1.1 and built it.
> Then I renamed .../nutch/plugins/parse-tika/commons-compress-1.0.jar to
> commons-compress-1.0.jar_orig
> Then I copied commons-compress-1.1-SNAPSHOT.jar to
> .../nutch/plugins/parse-tika/commons-compress-1.1.jar
>
> I have run 5 large fetch and parses and I have not had a single hang of
> tika
> parser on zip files and CPU utilization has dropped in half...
>
> So it looks like updated Commons Compress 1.1 will solve problem.
>
> So hopefully this will make into the next Tika release which will make it
> into Nutch 1.2.
>
>
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to