yes, the parser timeout code works. Thank you everyone. -aj On Tue, Jul 27, 2010 at 11:54 AM, brad <[email protected]> wrote:
> I believe this timeout patch maybe what you need to apply: > > https://issues.apache.org/jira/browse/NUTCH-696 > > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of AJ Chen > Sent: Tuesday, July 27, 2010 11:37 AM > To: [email protected] > Subject: Re: parse step hangs > > A different hanging up case: > Tika hangs up when parsing an image which content-type is incorrectly set > to > text/plain. > > 2010-07-27 10:42:54,099 DEBUG parse.ParseUtil - Parsing [ > http://rsb.info.nih.gov/ij/images/im2.dcm] with > [org.apache.nutch.parse.tika.tikapar...@1c0e6e] > > It seems tika hangs up occasionally with some edge cases. This is a tough > problem since we don't know all the edge cases. Is there a timeout > mechanism > for tika parser? If we can timeout tika parser on a document basis, the > crawling will not stall forever when tika hangs on a specific document. > > -aj > > On Fri, Jul 23, 2010 at 4:38 PM, AJ Chen <[email protected]> wrote: > > > Tika is known to hang up on truncated zip file. To prevent this from > > happening, enable zip parser, i.e. not to use tika for zip file. > > -aj > > > > > > On Thu, Jul 22, 2010 at 11:21 AM, AJ Chen <[email protected]> > wrote: > > > >> in reality, some documents will be trimmed. the parser should just > >> throw an error exception instead of hang up on trimmed document. it > >> will be surprised if tika is not designed in this way. I wonder if > >> there is an unknown bug in tika parser that causes the hanging > occasionally. > >> -aj > >> > >> > >> On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche < > >> [email protected]> wrote: > >> > >>> This is probably due to the content having been trimmed during the > >>> fetching. > >>> Try setting http.content.limit to a larger value > >>> > >>> On 21 July 2010 19:56, AJ Chen <[email protected]> wrote: > >>> > >>> > Another case where the parser hangs up. debug is on for logging. > >>> > > >>> > 2010-07-21 11:49:11,620 WARN parse.Parser - Error parsing: > >>> > > >>> > > >>> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&n > >>> ame=d19c7ed0-ad5c-426e-b2df-722508f97d67 > >>> > : > >>> > failed(2,0): expected='endstream' actual='' > >>> > org.apache.pdfbox.io.pushbackinputstr...@f3552f > >>> > 2010-07-21 11:49:11,622 INFO parse.ParserFactory - The parsing > >>> plugins: > >>> > [org.apache.nutch.parse.tika.Parser] are enabled via the > >>> plugin.includes > >>> > system property, and all claim to support the content type > >>> application/zip, > >>> > but they are not mapped to it in the parse-plugins.xml file > >>> > > >>> > any idea? > >>> > > >>> > -aj > >>> > > >>> > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche < > >>> > [email protected]> wrote: > >>> > > >>> > > The log you sent earlier indicated that Tika had no parser for > >>> > > the > >>> that > >>> > > mime > >>> > > type, which means it not used for it. > >>> > > It might be hanging but that would be on a different document > >>> > > and > >>> > possibly > >>> > > mimetype > >>> > > > >>> > > Try setting in log4j.properties > >>> > > log4j.logger.org.apache.nutch=DEBUG > >>> > > > >>> > > and check the logs again > >>> > > > >>> > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote: > >>> > > > >>> > > > I set mime.type.magic=false, parsed the segment again. the > >>> > > > parser > >>> got > >>> > > hung > >>> > > > up at the same place. maybe tika is trapped into a endless > >>> > > > loop > >>> after > >>> > > > seeing > >>> > > > mime-type application/x-sh. is there a way to configure tika > >>> > > > to > >>> skip > >>> > > > mime-type application/x-sh? > >>> > > > thanks, > >>> > > > -aj > >>> > > > > >>> > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen > >>> > > > <[email protected]> > >>> > wrote: > >>> > > > > >>> > > > > there is another thread reporting hanging during tika parsing. > >>> I'm > >>> > > seeing > >>> > > > > similar problem now. not sure the cause is the same or not, > >>> > > > > but > >>> what > >>> > to > >>> > > > show > >>> > > > > the message at the point of hanging. > >>> > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't > >>> > > > > retrieve > >>> Tika > >>> > > > parser > >>> > > > > for mime-type application/x-sh > >>> > > > > 2010-07-12 14:36:33,645 WARN parse.Parser - Error parsing: > >>> > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt: > >>> > > failed(2,0): > >>> > > > > Can't retrieve Tika parser for mime-type application/x-sh > >>> > > > > 2010-07-12 14:36:33,650 INFO parse.ParserFactory - The > >>> > > > > parsing > >>> > > plugins: > >>> > > > > [org.apache.nutch.parse.tika.Parser - > >>> > > > > org.apache.nutch.parse.text.TextParser] are enabled via the > >>> > > > plugin.includes > >>> > > > > system property, and all claim to support the content type > >>> > text/plain, > >>> > > > but > >>> > > > > they are not mapped to it in the parse-plugins.xml file > >>> > > > > > >>> > > > > my setting: > >>> > > > > mime.type.magic=true > >>> > > > > plugin.includes=...parse-(text|html|js|tika)... > >>> > > > > > >>> > > > > any idea? > >>> > > > > thanks, > >>> > > > > -- > >>> > > > > AJ Chen, PhD > >>> > > > > Chair, Semantic Web SIG, sdforum.org http://web2express.org > >>> > > > > twitter @web2express Palo Alto, CA, USA > >>> > > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > -- > >>> > > > AJ Chen, PhD > >>> > > > Chair, Semantic Web SIG, sdforum.org http://web2express.org > >>> > > > twitter @web2express Palo Alto, CA, USA > >>> > > > > >>> > > > >>> > > > >>> > > > >>> > > -- > >>> > > DigitalPebble Ltd > >>> > > > >>> > > Open Source Solutions for Text Engineering > >>> > > http://www.digitalpebble.com > >>> > > > >>> > > >>> > > >>> > > >>> > -- > >>> > AJ Chen, PhD > >>> > Chair, Semantic Web SIG, sdforum.org http://web2express.org > >>> > twitter @web2express Palo Alto, CA, USA > >>> > > >>> > >>> > >>> > >>> -- > >>> DigitalPebble Ltd > >>> > >>> Open Source Solutions for Text Engineering > >>> http://www.digitalpebble.com > >>> > >> > >> > >> > >> -- > >> AJ Chen, PhD > >> Chair, Semantic Web SIG, sdforum.org > >> http://web2express.org > >> twitter @web2express > >> Palo Alto, CA, USA > >> > > > > > > > > -- > > AJ Chen, PhD > > Chair, Semantic Web SIG, sdforum.org > > http://web2express.org > > twitter @web2express > > Palo Alto, CA, USA > > > > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

