Tika is known to hang up on truncated zip file. To prevent this from happening, enable zip parser, i.e. not to use tika for zip file. -aj
On Thu, Jul 22, 2010 at 11:21 AM, AJ Chen <[email protected]> wrote: > in reality, some documents will be trimmed. the parser should just throw > an error exception instead of hang up on trimmed document. it will be > surprised if tika is not designed in this way. I wonder if there is an > unknown bug in tika parser that causes the hanging occasionally. > -aj > > > On Thu, Jul 22, 2010 at 5:01 AM, Julien Nioche < > [email protected]> wrote: > >> This is probably due to the content having been trimmed during the >> fetching. >> Try setting http.content.limit to a larger value >> >> On 21 July 2010 19:56, AJ Chen <[email protected]> wrote: >> >> > Another case where the parser hangs up. debug is on for logging. >> > >> > 2010-07-21 11:49:11,620 WARN parse.Parser - Error parsing: >> > >> > >> http://dailymed.nlm.nih.gov/dailymed/getFile.cfm?id=18551&type=pdf&name=d19c7ed0-ad5c-426e-b2df-722508f97d67 >> > : >> > failed(2,0): expected='endstream' actual='' >> > org.apache.pdfbox.io.pushbackinputstr...@f3552f >> > 2010-07-21 11:49:11,622 INFO parse.ParserFactory - The parsing plugins: >> > [org.apache.nutch.parse.tika.Parser] are enabled via the plugin.includes >> > system property, and all claim to support the content type >> application/zip, >> > but they are not mapped to it in the parse-plugins.xml file >> > >> > any idea? >> > >> > -aj >> > >> > On Tue, Jul 13, 2010 at 1:49 AM, Julien Nioche < >> > [email protected]> wrote: >> > >> > > The log you sent earlier indicated that Tika had no parser for the >> that >> > > mime >> > > type, which means it not used for it. >> > > It might be hanging but that would be on a different document and >> > possibly >> > > mimetype >> > > >> > > Try setting in log4j.properties >> > > log4j.logger.org.apache.nutch=DEBUG >> > > >> > > and check the logs again >> > > >> > > On 12 July 2010 23:57, AJ Chen <[email protected]> wrote: >> > > >> > > > I set mime.type.magic=false, parsed the segment again. the parser >> got >> > > hung >> > > > up at the same place. maybe tika is trapped into a endless loop >> after >> > > > seeing >> > > > mime-type application/x-sh. is there a way to configure tika to >> skip >> > > > mime-type application/x-sh? >> > > > thanks, >> > > > -aj >> > > > >> > > > On Mon, Jul 12, 2010 at 3:36 PM, AJ Chen <[email protected]> >> > wrote: >> > > > >> > > > > there is another thread reporting hanging during tika parsing. I'm >> > > seeing >> > > > > similar problem now. not sure the cause is the same or not, but >> what >> > to >> > > > show >> > > > > the message at the point of hanging. >> > > > > 2010-07-12 14:36:33,645 ERROR tika.TikaParser - Can't retrieve >> Tika >> > > > parser >> > > > > for mime-type application/x-sh >> > > > > 2010-07-12 14:36:33,645 WARN parse.Parser - Error parsing: >> > > > > http://rsb.info.nih.gov/ij/download/linux/unix-script.txt: >> > > failed(2,0): >> > > > > Can't retrieve Tika parser for mime-type application/x-sh >> > > > > 2010-07-12 14:36:33,650 INFO parse.ParserFactory - The parsing >> > > plugins: >> > > > > [org.apache.nutch.parse.tika.Parser - >> > > > > org.apache.nutch.parse.text.TextParser] are enabled via the >> > > > plugin.includes >> > > > > system property, and all claim to support the content type >> > text/plain, >> > > > but >> > > > > they are not mapped to it in the parse-plugins.xml file >> > > > > >> > > > > my setting: >> > > > > mime.type.magic=true >> > > > > plugin.includes=...parse-(text|html|js|tika)... >> > > > > >> > > > > any idea? >> > > > > thanks, >> > > > > -- >> > > > > AJ Chen, PhD >> > > > > Chair, Semantic Web SIG, sdforum.org >> > > > > http://web2express.org >> > > > > twitter @web2express >> > > > > Palo Alto, CA, USA >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > AJ Chen, PhD >> > > > Chair, Semantic Web SIG, sdforum.org >> > > > http://web2express.org >> > > > twitter @web2express >> > > > Palo Alto, CA, USA >> > > > >> > > >> > > >> > > >> > > -- >> > > DigitalPebble Ltd >> > > >> > > Open Source Solutions for Text Engineering >> > > http://www.digitalpebble.com >> > > >> > >> > >> > >> > -- >> > AJ Chen, PhD >> > Chair, Semantic Web SIG, sdforum.org >> > http://web2express.org >> > twitter @web2express >> > Palo Alto, CA, USA >> > >> >> >> >> -- >> DigitalPebble Ltd >> >> Open Source Solutions for Text Engineering >> http://www.digitalpebble.com >> > > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

