Tika is known to hang up on truncated zip file.  To prevent this from
happening, enable zip parser, i.e. not to use tika for zip file.
-aj

On Thu, Jul 22, 2010 at 11:21 AM, AJ Chen <[email protected]> wrote:

> in reality, some documents will be trimmed.  the parser should just throw
> an error exception instead of hang up on trimmed document. it will be
> surprised if tika is not designed in this way. I wonder if there is an
> unknown bug in tika parser that causes the hanging occasionally.
> -aj
>
>
> On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche <
> [email protected]> wrote:
>
>> This is probably due to the content having been trimmed during the
>> fetching.
>> Try setting  http.content.limit to a larger value
>>
>> On 21 July 2010 19:56, AJ Chen <[email protected]> wrote:
>>
>> > Another case where the parser hangs up.  debug is on for logging.
>> >
>> > 2010-07-21 11:49:11,620 WARN  parse.Parser - Error parsing:
>> >
>> >
>> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67
>> > :
>> > failed(2,0): expected='endstream' actual=''
>> > org.apache.pdfbox.io.pushbackinputstr...@f3552f
>> > 2010-07-21 11:49:11,622 INFO  parse.ParserFactory - The parsing plugins:
>> > [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes
>> > system property, and all claim to support the content type
>> application/zip,
>> > but they are not mapped to it  in the parse-plugins.xml file
>> >
>> > any idea?
>> >
>> > -aj
>> >
>> > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche <
>> > [email protected]> wrote:
>> >
>> > > The log you sent earlier indicated that Tika had no parser for the
>> that
>> > > mime
>> > > type, which means it not used for it.
>> > > It might be hanging but that would be on a different document and
>> > possibly
>> > > mimetype
>> > >
>> > > Try setting in log4j.properties
>> > > log4j.logger.org.apache.nutch=DEBUG
>> > >
>> > > and check the logs again
>> > >
>> > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote:
>> > >
>> > > > I set mime.type.magic=false, parsed the segment again. the parser
>> got
>> > > hung
>> > > > up at the same place. maybe tika is trapped into a endless loop
>> after
>> > > > seeing
>> > > > mime-type application/x-sh.  is there a way to configure tika to
>> skip
>> > > > mime-type application/x-sh?
>> > > > thanks,
>> > > > -aj
>> > > >
>> > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]>
>> > wrote:
>> > > >
>> > > > > there is another thread reporting hanging during tika parsing. I'm
>> > > seeing
>> > > > > similar problem now. not sure the cause is the same or not, but
>> what
>> > to
>> > > > show
>> > > > > the message at the point of hanging.
>> > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve
>> Tika
>> > > > parser
>> > > > > for mime-type application/x-sh
>> > > > > 2010-07-12 14:36:33,645 WARN  parse.Parser - Error parsing:
>> > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt:
>> > > failed(2,0):
>> > > > > Can't retrieve Tika parser for mime-type application/x-sh
>> > > > > 2010-07-12 14:36:33,650 INFO  parse.ParserFactory - The parsing
>> > > plugins:
>> > > > > [org.apache.nutch.parse.tika.Parser -
>> > > > > org.apache.nutch.parse.text.TextParser] are enabled via the
>> > > > plugin.includes
>> > > > > system property, and all claim to support the content type
>> > text/plain,
>> > > > but
>> > > > > they are not mapped to it  in the parse-plugins.xml file
>> > > > >
>> > > > > my setting:
>> > > > > mime.type.magic=true
>> > > > > plugin.includes=...parse-(text|html|js|tika)...
>> > > > >
>> > > > > any idea?
>> > > > > thanks,
>> > > > > --
>> > > > > AJ Chen, PhD
>> > > > > Chair, Semantic Web SIG, sdforum.org
>> > > > > http://web2express.org
>> > > > > twitter @web2express
>> > > > > Palo Alto, CA, USA
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > AJ Chen, PhD
>> > > > Chair, Semantic Web SIG, sdforum.org
>> > > > http://web2express.org
>> > > > twitter @web2express
>> > > > Palo Alto, CA, USA
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > DigitalPebble Ltd
>> > >
>> > > Open Source Solutions for Text Engineering
>> > > http://www.digitalpebble.com
>> > >
>> >
>> >
>> >
>> > --
>> > AJ Chen, PhD
>> > Chair, Semantic Web SIG, sdforum.org
>> > http://web2express.org
>> > twitter @web2express
>> > Palo Alto, CA, USA
>> >
>>
>>
>>
>> --
>> DigitalPebble Ltd
>>
>> Open Source Solutions for Text Engineering
>> http://www.digitalpebble.com
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to