See the Tika formats page for more info:
http://tika.apache.org/1.2/formats.html#Microsoft_Office_document_formats -----Original message----- > From:宾军志 <[email protected]> > Sent: Mon 15-Oct-2012 12:14 > To: [email protected] > Subject: Re: issue about tika parse > > Hi Tejas, > > Thanks for your information. This issue has been resolved as your > instruction. > > BTW: currently which versions of MS office is supported by nutch? > > BR, > > Rock Bin > > 2012/10/15 Tejas Patil <[email protected]> > > > This can happen due to either of these: > > > > 1. This is probably due to the content having been trimmed during the > > fetching. Try setting http.content.limit to a larger > > value<http://lucene.472066.n3.nabble.com/parse-step-hangs-td961720.html> > > . > > 2. If the file is huge, try increasing your parser.timeout > > setting<http://osdir.com/ml/user.nutch.apache/2011-10/msg00229.html> > > . > > > > thanks, > > Tejas Patil > > > > On Sun, Oct 14, 2012 at 7:50 PM, 宾军志 <[email protected]> wrote: > > > > > Hi All, > > > > > > Currently I already have done the installation of nutch2.1 with hbase and > > > it work well with html parsing. > > > But when I try to parse a word document I got the below exception: > > > > > > 2012-10-14 17:56:04,686 INFO crawl.SignatureFactory - Using Signature > > > impl: org.apache.nutch.crawl.MD5Signature > > > 2012-10-14 17:56:05,026 INFO mapreduce.GoraRecordReader - > > > gora.buffer.read.limit = 10000 > > > 2012-10-14 17:56:05,048 INFO mapreduce.GoraRecordWriter - > > > gora.buffer.write.limit = 10000 > > > 2012-10-14 17:56:05,054 INFO crawl.SignatureFactory - Using Signature > > > impl: org.apache.nutch.crawl.MD5Signature > > > 2012-10-14 17:56:05,077 INFO parse.ParserJob - Parsing > > > > > > > > http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc > > > 2012-10-14 17:56:05,077 INFO parse.ParserFactory - The parsing plugins: > > > [org.apache.nutch.parse.tika.TikaParser] are enabled via the > > > plugin.includes system property, and all claim to support the content > > type > > > application/x-tika-msoffice, but they are not mapped to it in the > > > parse-plugins.xml file > > > 2012-10-14 17:56:05,164 ERROR tika.TikaParser - Error parsing > > > > > > > > http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc > > > java.io.IOException: Invalid header signature; read 0x0000000000000000, > > > expected 0xE11AB1A1E011CFD0 > > > at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140) > > > at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115) > > > at > > > > > > > > org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:265) > > > at > > > > > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) > > > at org.apache.nutch.parse.tika.TikaParser.getParse(Unknown Source) > > > at org.apache.nutch.parse.ParseCallable.call(Unknown Source) > > > at org.apache.nutch.parse.ParseCallable.call(Unknown Source) > > > at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) > > > at java.util.concurrent.FutureTask.run(Unknown Source) > > > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > > > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > > > at java.lang.Thread.run(Unknown Source) > > > > > > Then I download this document to my local and try tika parse by command: > > > ./bin/nutch plugin parse-tika > > > org.apache.nutch.parse.tika.TikaParser > > zhangw8867520120625134724265971.doc > > > This command worked well. > > > > > > Anyone has idea about it? > > > > > > BR, > > > > > > Rock Bin > > > > > >

