in reality, some documents will be trimmed.  the parser should just throw an
error exception instead of hang up on trimmed document. it will be surprised
if tika is not designed in this way. I wonder if there is an unknown bug in
tika parser that causes the hanging occasionally.
-aj

On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche <
[email protected]> wrote:

> This is probably due to the content having been trimmed during the
> fetching.
> Try setting  http.content.limit to a larger value
>
> On 21 July 2010 19:56, AJ Chen <[email protected]> wrote:
>
> > Another case where the parser hangs up.  debug is on for logging.
> >
> > 2010-07-21 11:49:11,620 WARN  parse.Parser - Error parsing:
> >
> >
> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67
> > :
> > failed(2,0): expected='endstream' actual=''
> > org.apache.pdfbox.io.pushbackinputstr...@f3552f
> > 2010-07-21 11:49:11,622 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
> > system property, and all claim to support the content type
> application/zip,
> > but they are not mapped to it  in the parse-plugins.xml file
> >
> > any idea?
> >
> > -aj
> >
> > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche <
> > [email protected]> wrote:
> >
> > > The log you sent earlier indicated that Tika had no parser for the that
> > > mime
> > > type, which means it not used for it.
> > > It might be hanging but that would be on a different document and
> > possibly
> > > mimetype
> > >
> > > Try setting in log4j.properties
> > > log4j.logger.org.apache.nutch=DEBUG
> > >
> > > and check the logs again
> > >
> > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote:
> > >
> > > > I set mime.type.magic=false, parsed the segment again. the parser got
> > > hung
> > > > up at the same place. maybe tika is trapped into a endless loop after
> > > > seeing
> > > > mime-type application/x-sh.  is there a way to configure tika to skip
> > > > mime-type application/x-sh?
> > > > thanks,
> > > > -aj
> > > >
> > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]>
> > wrote:
> > > >
> > > > > there is another thread reporting hanging during tika parsing. I'm
> > > seeing
> > > > > similar problem now. not sure the cause is the same or not, but
> what
> > to
> > > > show
> > > > > the message at the point of hanging.
> > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve Tika
> > > > parser
> > > > > for mime-type application/x-sh
> > > > > 2010-07-12 14:36:33,645 WARN  parse.Parser - Error parsing:
> > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt:
> > > failed(2,0):
> > > > > Can't retrieve Tika parser for mime-type application/x-sh
> > > > > 2010-07-12 14:36:33,650 INFO  parse.ParserFactory - The parsing
> > > plugins:
> > > > > [org.apache.nutch.parse.tika.Parser -
> > > > > org.apache.nutch.parse.text.TextParser] are enabled via the
> > > > plugin.includes
> > > > > system property, and all claim to support the content type
> > text/plain,
> > > > but
> > > > > they are not mapped to it  in the parse-plugins.xml file
> > > > >
> > > > > my setting:
> > > > > mime.type.magic=true
> > > > > plugin.includes=...parse-(text|html|js|tika)...
> > > > >
> > > > > any idea?
> > > > > thanks,
> > > > > --
> > > > > AJ Chen, PhD
> > > > > Chair, Semantic Web SIG, sdforum.org
> > > > > http://web2express.org
> > > > > twitter @web2express
> > > > > Palo Alto, CA, USA
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > AJ Chen, PhD
> > > > Chair, Semantic Web SIG, sdforum.org
> > > > http://web2express.org
> > > > twitter @web2express
> > > > Palo Alto, CA, USA
> > > >
> > >
> > >
> > >
> > > --
> > > DigitalPebble Ltd
> > >
> > > Open Source Solutions for Text Engineering
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
> > --
> > AJ Chen, PhD
> > Chair, Semantic Web SIG, sdforum.org
> > http://web2express.org
> > twitter @web2express
> > Palo Alto, CA, USA
> >
>
>
>
> --
> DigitalPebble Ltd
>
> Open Source Solutions for Text Engineering
> http://www.digitalpebble.com
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to