Another case where the parser hangs up. debug is on for logging. 2010-07-21 11:49:11,620 WARN parse.Parser - Error parsing: http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67: failed(2,0): expected='endstream' actual='' org.apache.pdfbox.io.pushbackinputstr...@f3552f 2010-07-21 11:49:11,622 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes system property, and all claim to support the content type application/zip, but they are not mapped to it in the parse-plugins.xml file
any idea? -aj On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche < [email protected]> wrote: > The log you sent earlier indicated that Tika had no parser for the that > mime > type, which means it not used for it. > It might be hanging but that would be on a different document and possibly > mimetype > > Try setting in log4j.properties > log4j.logger.org.apache.nutch=DEBUG > > and check the logs again > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote: > > > I set mime.type.magic=false, parsed the segment again. the parser got > hung > > up at the same place. maybe tika is trapped into a endless loop after > > seeing > > mime-type application/x-sh. is there a way to configure tika to skip > > mime-type application/x-sh? > > thanks, > > -aj > > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]> wrote: > > > > > there is another thread reporting hanging during tika parsing. I'm > seeing > > > similar problem now. not sure the cause is the same or not, but what to > > show > > > the message at the point of hanging. > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve Tika > > parser > > > for mime-type application/x-sh > > > 2010-07-12 14:36:33,645 WARN parse.Parser - Error parsing: > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt: > failed(2,0): > > > Can't retrieve Tika parser for mime-type application/x-sh > > > 2010-07-12 14:36:33,650 INFO parse.ParserFactory - The parsing > plugins: > > > [org.apache.nutch.parse.tika.Parser - > > > org.apache.nutch.parse.text.TextParser] are enabled via the > > plugin.includes > > > system property, and all claim to support the content type text/plain, > > but > > > they are not mapped to it in the parse-plugins.xml file > > > > > > my setting: > > > mime.type.magic=true > > > plugin.includes=...parse-(text|html|js|tika)... > > > > > > any idea? > > > thanks, > > > -- > > > AJ Chen, PhD > > > Chair, Semantic Web SIG, sdforum.org > > > http://web2express.org > > > twitter @web2express > > > Palo Alto, CA, USA > > > > > > > > > > > -- > > AJ Chen, PhD > > Chair, Semantic Web SIG, sdforum.org > > http://web2express.org > > twitter @web2express > > Palo Alto, CA, USA > > > > > > -- > DigitalPebble Ltd > > Open Source Solutions for Text Engineering > http://www.digitalpebble.com > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

