I believe this timeout patch maybe what you need to apply:

https://issues.apache.org/jira/browse/NUTCH-696



-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of AJ Chen
Sent: Tuesday, July 27, 2010 11:37 AM
To: [email protected]
Subject: Re: parse step hangs

A different hanging up case:
Tika hangs up when parsing an image which content-type is incorrectly set to
text/plain.

2010-07-27 10:42:54,099 DEBUG parse.ParseUtil - Parsing [
http://rsb.info.nih.gov/ij/images/im2.dcm] with
[org.apache.nutch.parse.tika.tikapar...@1c0e6e]

It seems tika hangs up occasionally with some edge cases. This is a tough
problem since we don't know all the edge cases. Is there a timeout mechanism
for tika parser?  If we can timeout tika parser on a document basis, the
crawling will not stall forever when tika hangs on a specific document.

-aj

On Fri, Jul 23, 2010 at 4:38 PM, AJ Chen <[email protected]> wrote:

> Tika is known to hang up on truncated zip file.  To prevent this from 
> happening, enable zip parser, i.e. not to use tika for zip file.
> -aj
>
>
> On Thu, Jul 22, 2010 at 11:21 AM, AJ Chen <[email protected]> wrote:
>
>> in reality, some documents will be trimmed.  the parser should just 
>> throw an error exception instead of hang up on trimmed document. it 
>> will be surprised if tika is not designed in this way. I wonder if 
>> there is an unknown bug in tika parser that causes the hanging
occasionally.
>>  -aj
>>
>>
>> On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche < 
>> [email protected]> wrote:
>>
>>> This is probably due to the content having been trimmed during the 
>>> fetching.
>>> Try setting  http.content.limit to a larger value
>>>
>>> On 21 July 2010 19:56, AJ Chen <[email protected]> wrote:
>>>
>>> > Another case where the parser hangs up.  debug is on for logging.
>>> >
>>> > 2010-07-21 11:49:11,620 WARN  parse.Parser - Error parsing:
>>> >
>>> >
>>> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&n
>>> ame=d19c7ed0-ad5c-426e-b2df-722508f97d67
>>> > :
>>> > failed(2,0): expected='endstream' actual=''
>>> > org.apache.pdfbox.io.pushbackinputstr...@f3552f
>>> > 2010-07-21 11:49:11,622 INFO  parse.ParserFactory - The parsing
>>> plugins:
>>> > [org.apache.nutch.parse.tika.Parser] are enabled via the
>>> plugin.includes
>>> > system property, and all claim to support the content type
>>> application/zip,
>>> > but they are not mapped to it  in the parse-plugins.xml file
>>> >
>>> > any idea?
>>> >
>>> > -aj
>>> >
>>> > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche < 
>>> > [email protected]> wrote:
>>> >
>>> > > The log you sent earlier indicated that Tika had no parser for 
>>> > > the
>>> that
>>> > > mime
>>> > > type, which means it not used for it.
>>> > > It might be hanging but that would be on a different document 
>>> > > and
>>> > possibly
>>> > > mimetype
>>> > >
>>> > > Try setting in log4j.properties
>>> > > log4j.logger.org.apache.nutch=DEBUG
>>> > >
>>> > > and check the logs again
>>> > >
>>> > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote:
>>> > >
>>> > > > I set mime.type.magic=false, parsed the segment again. the 
>>> > > > parser
>>> got
>>> > > hung
>>> > > > up at the same place. maybe tika is trapped into a endless 
>>> > > > loop
>>> after
>>> > > > seeing
>>> > > > mime-type application/x-sh.  is there a way to configure tika 
>>> > > > to
>>> skip
>>> > > > mime-type application/x-sh?
>>> > > > thanks,
>>> > > > -aj
>>> > > >
>>> > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen 
>>> > > > <[email protected]>
>>> > wrote:
>>> > > >
>>> > > > > there is another thread reporting hanging during tika parsing.
>>> I'm
>>> > > seeing
>>> > > > > similar problem now. not sure the cause is the same or not, 
>>> > > > > but
>>> what
>>> > to
>>> > > > show
>>> > > > > the message at the point of hanging.
>>> > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't 
>>> > > > > retrieve
>>> Tika
>>> > > > parser
>>> > > > > for mime-type application/x-sh
>>> > > > > 2010-07-12 14:36:33,645 WARN  parse.Parser - Error parsing:
>>> > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt:
>>> > > failed(2,0):
>>> > > > > Can't retrieve Tika parser for mime-type application/x-sh
>>> > > > > 2010-07-12 14:36:33,650 INFO  parse.ParserFactory - The 
>>> > > > > parsing
>>> > > plugins:
>>> > > > > [org.apache.nutch.parse.tika.Parser - 
>>> > > > > org.apache.nutch.parse.text.TextParser] are enabled via the
>>> > > > plugin.includes
>>> > > > > system property, and all claim to support the content type
>>> > text/plain,
>>> > > > but
>>> > > > > they are not mapped to it  in the parse-plugins.xml file
>>> > > > >
>>> > > > > my setting:
>>> > > > > mime.type.magic=true
>>> > > > > plugin.includes=...parse-(text|html|js|tika)...
>>> > > > >
>>> > > > > any idea?
>>> > > > > thanks,
>>> > > > > --
>>> > > > > AJ Chen, PhD
>>> > > > > Chair, Semantic Web SIG, sdforum.org http://web2express.org 
>>> > > > > twitter @web2express Palo Alto, CA, USA
>>> > > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > AJ Chen, PhD
>>> > > > Chair, Semantic Web SIG, sdforum.org http://web2express.org 
>>> > > > twitter @web2express Palo Alto, CA, USA
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > DigitalPebble Ltd
>>> > >
>>> > > Open Source Solutions for Text Engineering 
>>> > > http://www.digitalpebble.com
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > AJ Chen, PhD
>>> > Chair, Semantic Web SIG, sdforum.org http://web2express.org 
>>> > twitter @web2express Palo Alto, CA, USA
>>> >
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>>
>>> Open Source Solutions for Text Engineering 
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to