Re: issue about tika parse

宾军志 Mon, 15 Oct 2012 03:10:06 -0700

Hi Tejas,

Thanks for your information. This issue has been resolved as your
instruction.


BTW: currently which versions of MS office is supported by nutch?

BR,

Rock Bin

2012/10/15 Tejas Patil <[email protected]>

> This can happen due to either of these:
>
> 1. This is probably due to the content having been trimmed during the
> fetching. Try setting  http.content.limit to a larger
> value<http://lucene.472066.n3.nabble.com/parse-step-hangs-td961720.html>
> .
> 2. If the file is huge, try increasing your parser.timeout
> setting<http://osdir.com/ml/user.nutch.apache/2011-10/msg00229.html>
> .
>
> thanks,
> Tejas Patil
>
> On Sun, Oct 14, 2012 at 7:50 PM, 宾军志 <[email protected]> wrote:
>
> > Hi All,
> >
> > Currently I already have done the installation of nutch2.1 with hbase and
> > it work well with html parsing.
> > But when I try to parse a word document I got the below exception:
> >
> > 2012-10-14 17:56:04,686 INFO  crawl.SignatureFactory - Using Signature
> > impl: org.apache.nutch.crawl.MD5Signature
> > 2012-10-14 17:56:05,026 INFO  mapreduce.GoraRecordReader -
> > gora.buffer.read.limit = 10000
> > 2012-10-14 17:56:05,048 INFO  mapreduce.GoraRecordWriter -
> > gora.buffer.write.limit = 10000
> > 2012-10-14 17:56:05,054 INFO  crawl.SignatureFactory - Using Signature
> > impl: org.apache.nutch.crawl.MD5Signature
> > 2012-10-14 17:56:05,077 INFO  parse.ParserJob - Parsing
> >
> >
> http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc
> > 2012-10-14 17:56:05,077 INFO  parse.ParserFactory - The parsing plugins:
> > [org.apache.nutch.parse.tika.TikaParser] are enabled via the
> > plugin.includes system property, and all claim to support the content
> type
> > application/x-tika-msoffice, but they are not mapped to it  in the
> > parse-plugins.xml file
> > 2012-10-14 17:56:05,164 ERROR tika.TikaParser - Error parsing
> >
> >
> http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc
> > java.io.IOException: Invalid header signature; read 0x0000000000000000,
> > expected 0xE11AB1A1E011CFD0
> > at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
> > at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115)
> > at
> >
> >
> org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:265)
> > at
> >
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
> > at org.apache.nutch.parse.tika.TikaParser.getParse(Unknown Source)
> > at org.apache.nutch.parse.ParseCallable.call(Unknown Source)
> > at org.apache.nutch.parse.ParseCallable.call(Unknown Source)
> > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> > at java.util.concurrent.FutureTask.run(Unknown Source)
> > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> > at java.lang.Thread.run(Unknown Source)
> >
> > Then I download this document to my local and try tika parse by command:
> > ./bin/nutch plugin parse-tika
> > org.apache.nutch.parse.tika.TikaParser
> zhangw8867520120625134724265971.doc
> > This command worked well.
> >
> > Anyone has idea about it?
> >
> > BR,
> >
> > Rock Bin
> >
>

Re: issue about tika parse

Reply via email to