Another case where the parser hangs up.  debug is on for logging.

2010-07-21 11:49:11,620 WARN  parse.Parser - Error parsing:
http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67:
failed(2,0): expected='endstream' actual=''
org.apache.pdfbox.io.pushbackinputstr...@f3552f
2010-07-21 11:49:11,622 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
system property, and all claim to support the content type application/zip,
but they are not mapped to it  in the parse-plugins.xml file

any idea?

-aj

On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche <
[email protected]> wrote:

> The log you sent earlier indicated that Tika had no parser for the that
> mime
> type, which means it not used for it.
> It might be hanging but that would be on a different document and possibly
> mimetype
>
> Try setting in log4j.properties
> log4j.logger.org.apache.nutch=DEBUG
>
> and check the logs again
>
> On 12 July 2010 23:57, AJ Chen <[email protected]> wrote:
>
> > I set mime.type.magic=false, parsed the segment again. the parser got
> hung
> > up at the same place. maybe tika is trapped into a endless loop after
> > seeing
> > mime-type application/x-sh.  is there a way to configure tika to skip
> > mime-type application/x-sh?
> > thanks,
> > -aj
> >
> > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]> wrote:
> >
> > > there is another thread reporting hanging during tika parsing. I'm
> seeing
> > > similar problem now. not sure the cause is the same or not, but what to
> > show
> > > the message at the point of hanging.
> > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve Tika
> > parser
> > > for mime-type application/x-sh
> > > 2010-07-12 14:36:33,645 WARN  parse.Parser - Error parsing:
> > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt:
> failed(2,0):
> > > Can't retrieve Tika parser for mime-type application/x-sh
> > > 2010-07-12 14:36:33,650 INFO  parse.ParserFactory - The parsing
> plugins:
> > > [org.apache.nutch.parse.tika.Parser -
> > > org.apache.nutch.parse.text.TextParser] are enabled via the
> > plugin.includes
> > > system property, and all claim to support the content type text/plain,
> > but
> > > they are not mapped to it  in the parse-plugins.xml file
> > >
> > > my setting:
> > > mime.type.magic=true
> > > plugin.includes=...parse-(text|html|js|tika)...
> > >
> > > any idea?
> > > thanks,
> > > --
> > > AJ Chen, PhD
> > > Chair, Semantic Web SIG, sdforum.org
> > > http://web2express.org
> > > twitter @web2express
> > > Palo Alto, CA, USA
> > >
> >
> >
> >
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to