yes, the parser timeout code works. Thank you everyone. -aj

On Tue, Jul 27, 2010 at 11:54 AM, brad <[email protected]> wrote:

> I believe this timeout patch maybe what you need to apply:
>
> https://issues.apache.org/jira/browse/NUTCH-696
>
>
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of AJ Chen
> Sent: Tuesday, July 27, 2010 11:37 AM
> To: [email protected]
> Subject: Re: parse step hangs
>
> A different hanging up case:
> Tika hangs up when parsing an image which content-type is incorrectly set
> to
> text/plain.
>
> 2010-07-27 10:42:54,099 DEBUG parse.ParseUtil - Parsing [
> http://rsb.info.nih.gov/ij/images/im2.dcm] with
> [org.apache.nutch.parse.tika.tikapar...@1c0e6e]
>
> It seems tika hangs up occasionally with some edge cases. This is a tough
> problem since we don't know all the edge cases. Is there a timeout
> mechanism
> for tika parser?  If we can timeout tika parser on a document basis, the
> crawling will not stall forever when tika hangs on a specific document.
>
> -aj
>
> On Fri, Jul 23, 2010 at 4:38 PM, AJ Chen <[email protected]> wrote:
>
> > Tika is known to hang up on truncated zip file.  To prevent this from
> > happening, enable zip parser, i.e. not to use tika for zip file.
> > -aj
> >
> >
> > On Thu, Jul 22, 2010 at 11:21 AM, AJ Chen <[email protected]>
> wrote:
> >
> >> in reality, some documents will be trimmed.  the parser should just
> >> throw an error exception instead of hang up on trimmed document. it
> >> will be surprised if tika is not designed in this way. I wonder if
> >> there is an unknown bug in tika parser that causes the hanging
> occasionally.
> >>  -aj
> >>
> >>
> >> On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche <
> >> [email protected]> wrote:
> >>
> >>> This is probably due to the content having been trimmed during the
> >>> fetching.
> >>> Try setting  http.content.limit to a larger value
> >>>
> >>> On 21 July 2010 19:56, AJ Chen <[email protected]> wrote:
> >>>
> >>> > Another case where the parser hangs up.  debug is on for logging.
> >>> >
> >>> > 2010-07-21 11:49:11,620 WARN  parse.Parser - Error parsing:
> >>> >
> >>> >
> >>> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&n
> >>> ame=d19c7ed0-ad5c-426e-b2df-722508f97d67
> >>> > :
> >>> > failed(2,0): expected='endstream' actual=''
> >>> > org.apache.pdfbox.io.pushbackinputstr...@f3552f
> >>> > 2010-07-21 11:49:11,622 INFO  parse.ParserFactory - The parsing
> >>> plugins:
> >>> > [org.apache.nutch.parse.tika.Parser] are enabled via the
> >>> plugin.includes
> >>> > system property, and all claim to support the content type
> >>> application/zip,
> >>> > but they are not mapped to it  in the parse-plugins.xml file
> >>> >
> >>> > any idea?
> >>> >
> >>> > -aj
> >>> >
> >>> > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche <
> >>> > [email protected]> wrote:
> >>> >
> >>> > > The log you sent earlier indicated that Tika had no parser for
> >>> > > the
> >>> that
> >>> > > mime
> >>> > > type, which means it not used for it.
> >>> > > It might be hanging but that would be on a different document
> >>> > > and
> >>> > possibly
> >>> > > mimetype
> >>> > >
> >>> > > Try setting in log4j.properties
> >>> > > log4j.logger.org.apache.nutch=DEBUG
> >>> > >
> >>> > > and check the logs again
> >>> > >
> >>> > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote:
> >>> > >
> >>> > > > I set mime.type.magic=false, parsed the segment again. the
> >>> > > > parser
> >>> got
> >>> > > hung
> >>> > > > up at the same place. maybe tika is trapped into a endless
> >>> > > > loop
> >>> after
> >>> > > > seeing
> >>> > > > mime-type application/x-sh.  is there a way to configure tika
> >>> > > > to
> >>> skip
> >>> > > > mime-type application/x-sh?
> >>> > > > thanks,
> >>> > > > -aj
> >>> > > >
> >>> > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen
> >>> > > > <[email protected]>
> >>> > wrote:
> >>> > > >
> >>> > > > > there is another thread reporting hanging during tika parsing.
> >>> I'm
> >>> > > seeing
> >>> > > > > similar problem now. not sure the cause is the same or not,
> >>> > > > > but
> >>> what
> >>> > to
> >>> > > > show
> >>> > > > > the message at the point of hanging.
> >>> > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't
> >>> > > > > retrieve
> >>> Tika
> >>> > > > parser
> >>> > > > > for mime-type application/x-sh
> >>> > > > > 2010-07-12 14:36:33,645 WARN  parse.Parser - Error parsing:
> >>> > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt:
> >>> > > failed(2,0):
> >>> > > > > Can't retrieve Tika parser for mime-type application/x-sh
> >>> > > > > 2010-07-12 14:36:33,650 INFO  parse.ParserFactory - The
> >>> > > > > parsing
> >>> > > plugins:
> >>> > > > > [org.apache.nutch.parse.tika.Parser -
> >>> > > > > org.apache.nutch.parse.text.TextParser] are enabled via the
> >>> > > > plugin.includes
> >>> > > > > system property, and all claim to support the content type
> >>> > text/plain,
> >>> > > > but
> >>> > > > > they are not mapped to it  in the parse-plugins.xml file
> >>> > > > >
> >>> > > > > my setting:
> >>> > > > > mime.type.magic=true
> >>> > > > > plugin.includes=...parse-(text|html|js|tika)...
> >>> > > > >
> >>> > > > > any idea?
> >>> > > > > thanks,
> >>> > > > > --
> >>> > > > > AJ Chen, PhD
> >>> > > > > Chair, Semantic Web SIG, sdforum.org http://web2express.org
> >>> > > > > twitter @web2express Palo Alto, CA, USA
> >>> > > > >
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > --
> >>> > > > AJ Chen, PhD
> >>> > > > Chair, Semantic Web SIG, sdforum.org http://web2express.org
> >>> > > > twitter @web2express Palo Alto, CA, USA
> >>> > > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > DigitalPebble Ltd
> >>> > >
> >>> > > Open Source Solutions for Text Engineering
> >>> > > http://www.digitalpebble.com
> >>> > >
> >>> >
> >>> >
> >>> >
> >>> > --
> >>> > AJ Chen, PhD
> >>> > Chair, Semantic Web SIG, sdforum.org http://web2express.org
> >>> > twitter @web2express Palo Alto, CA, USA
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> DigitalPebble Ltd
> >>>
> >>> Open Source Solutions for Text Engineering
> >>> http://www.digitalpebble.com
> >>>
> >>
> >>
> >>
> >> --
> >> AJ Chen, PhD
> >> Chair, Semantic Web SIG, sdforum.org
> >> http://web2express.org
> >> twitter @web2express
> >> Palo Alto, CA, USA
> >>
> >
> >
> >
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
> >
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to