This is probably due to the content having been trimmed during the fetching.
Try setting  http.content.limit to a larger value

On 21 July 2010 19:56, AJ Chen <[email protected]> wrote:

> Another case where the parser hangs up.  debug is on for logging.
>
> 2010-07-21 11:49:11,620 WARN  parse.Parser - Error parsing:
>
> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67
> :
> failed(2,0): expected='endstream' actual=''
> org.apache.pdfbox.io.pushbackinputstr...@f3552f
> 2010-07-21 11:49:11,622 INFO  parse.ParserFactory - The parsing plugins:
> [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
> system property, and all claim to support the content type application/zip,
> but they are not mapped to it  in the parse-plugins.xml file
>
> any idea?
>
> -aj
>
> On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche <
> [email protected]> wrote:
>
> > The log you sent earlier indicated that Tika had no parser for the that
> > mime
> > type, which means it not used for it.
> > It might be hanging but that would be on a different document and
> possibly
> > mimetype
> >
> > Try setting in log4j.properties
> > log4j.logger.org.apache.nutch=DEBUG
> >
> > and check the logs again
> >
> > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote:
> >
> > > I set mime.type.magic=false, parsed the segment again. the parser got
> > hung
> > > up at the same place. maybe tika is trapped into a endless loop after
> > > seeing
> > > mime-type application/x-sh.  is there a way to configure tika to skip
> > > mime-type application/x-sh?
> > > thanks,
> > > -aj
> > >
> > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]>
> wrote:
> > >
> > > > there is another thread reporting hanging during tika parsing. I'm
> > seeing
> > > > similar problem now. not sure the cause is the same or not, but what
> to
> > > show
> > > > the message at the point of hanging.
> > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve Tika
> > > parser
> > > > for mime-type application/x-sh
> > > > 2010-07-12 14:36:33,645 WARN  parse.Parser - Error parsing:
> > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt:
> > failed(2,0):
> > > > Can't retrieve Tika parser for mime-type application/x-sh
> > > > 2010-07-12 14:36:33,650 INFO  parse.ParserFactory - The parsing
> > plugins:
> > > > [org.apache.nutch.parse.tika.Parser -
> > > > org.apache.nutch.parse.text.TextParser] are enabled via the
> > > plugin.includes
> > > > system property, and all claim to support the content type
> text/plain,
> > > but
> > > > they are not mapped to it  in the parse-plugins.xml file
> > > >
> > > > my setting:
> > > > mime.type.magic=true
> > > > plugin.includes=...parse-(text|html|js|tika)...
> > > >
> > > > any idea?
> > > > thanks,
> > > > --
> > > > AJ Chen, PhD
> > > > Chair, Semantic Web SIG, sdforum.org
> > > > http://web2express.org
> > > > twitter @web2express
> > > > Palo Alto, CA, USA
> > > >
> > >
> > >
> > >
> > > --
> > > AJ Chen, PhD
> > > Chair, Semantic Web SIG, sdforum.org
> > > http://web2express.org
> > > twitter @web2express
> > > Palo Alto, CA, USA
> > >
> >
> >
> >
> > --
> > DigitalPebble Ltd
> >
> > Open Source Solutions for Text Engineering
> > http://www.digitalpebble.com
> >
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Reply via email to