in reality, some documents will be trimmed. the parser should just throw an error exception instead of hang up on trimmed document. it will be surprised if tika is not designed in this way. I wonder if there is an unknown bug in tika parser that causes the hanging occasionally. -aj
On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche < [email protected]> wrote: > This is probably due to the content having been trimmed during the > fetching. > Try setting http.content.limit to a larger value > > On 21 July 2010 19:56, AJ Chen <[email protected]> wrote: > > > Another case where the parser hangs up. debug is on for logging. > > > > 2010-07-21 11:49:11,620 WARN parse.Parser - Error parsing: > > > > > http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67 > > : > > failed(2,0): expected='endstream' actual='' > > org.apache.pdfbox.io.pushbackinputstr...@f3552f > > 2010-07-21 11:49:11,622 INFO parse.ParserFactory - The parsing plugins: > > [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes > > system property, and all claim to support the content type > application/zip, > > but they are not mapped to it in the parse-plugins.xml file > > > > any idea? > > > > -aj > > > > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche < > > [email protected]> wrote: > > > > > The log you sent earlier indicated that Tika had no parser for the that > > > mime > > > type, which means it not used for it. > > > It might be hanging but that would be on a different document and > > possibly > > > mimetype > > > > > > Try setting in log4j.properties > > > log4j.logger.org.apache.nutch=DEBUG > > > > > > and check the logs again > > > > > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote: > > > > > > > I set mime.type.magic=false, parsed the segment again. the parser got > > > hung > > > > up at the same place. maybe tika is trapped into a endless loop after > > > > seeing > > > > mime-type application/x-sh. is there a way to configure tika to skip > > > > mime-type application/x-sh? > > > > thanks, > > > > -aj > > > > > > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]> > > wrote: > > > > > > > > > there is another thread reporting hanging during tika parsing. I'm > > > seeing > > > > > similar problem now. not sure the cause is the same or not, but > what > > to > > > > show > > > > > the message at the point of hanging. > > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve Tika > > > > parser > > > > > for mime-type application/x-sh > > > > > 2010-07-12 14:36:33,645 WARN parse.Parser - Error parsing: > > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt: > > > failed(2,0): > > > > > Can't retrieve Tika parser for mime-type application/x-sh > > > > > 2010-07-12 14:36:33,650 INFO parse.ParserFactory - The parsing > > > plugins: > > > > > [org.apache.nutch.parse.tika.Parser - > > > > > org.apache.nutch.parse.text.TextParser] are enabled via the > > > > plugin.includes > > > > > system property, and all claim to support the content type > > text/plain, > > > > but > > > > > they are not mapped to it in the parse-plugins.xml file > > > > > > > > > > my setting: > > > > > mime.type.magic=true > > > > > plugin.includes=...parse-(text|html|js|tika)... > > > > > > > > > > any idea? > > > > > thanks, > > > > > -- > > > > > AJ Chen, PhD > > > > > Chair, Semantic Web SIG, sdforum.org > > > > > http://web2express.org > > > > > twitter @web2express > > > > > Palo Alto, CA, USA > > > > > > > > > > > > > > > > > > > > > -- > > > > AJ Chen, PhD > > > > Chair, Semantic Web SIG, sdforum.org > > > > http://web2express.org > > > > twitter @web2express > > > > Palo Alto, CA, USA > > > > > > > > > > > > > > > > -- > > > DigitalPebble Ltd > > > > > > Open Source Solutions for Text Engineering > > > http://www.digitalpebble.com > > > > > > > > > > > -- > > AJ Chen, PhD > > Chair, Semantic Web SIG, sdforum.org > > http://web2express.org > > twitter @web2express > > Palo Alto, CA, USA > > > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

