Hi All,

Currently I already have done the installation of nutch2.1 with hbase and
it work well with html parsing.
But when I try to parse a word document I got the below exception:

2012-10-14 17:56:04,686 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-10-14 17:56:05,026 INFO  mapreduce.GoraRecordReader -
gora.buffer.read.limit = 10000
2012-10-14 17:56:05,048 INFO  mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2012-10-14 17:56:05,054 INFO  crawl.SignatureFactory - Using Signature
impl: org.apache.nutch.crawl.MD5Signature
2012-10-14 17:56:05,077 INFO  parse.ParserJob - Parsing
http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc
2012-10-14 17:56:05,077 INFO  parse.ParserFactory - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser] are enabled via the
plugin.includes system property, and all claim to support the content type
application/x-tika-msoffice, but they are not mapped to it  in the
parse-plugins.xml file
2012-10-14 17:56:05,164 ERROR tika.TikaParser - Error parsing
http://www.g12e.com/upload/html/2012/6/25/zhangw8867520120625134724265971.doc
java.io.IOException: Invalid header signature; read 0x0000000000000000,
expected 0xE11AB1A1E011CFD0
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:140)
at org.apache.poi.poifs.storage.HeaderBlock.<init>(HeaderBlock.java:115)
at
org.apache.poi.poifs.filesystem.NPOIFSFileSystem.<init>(NPOIFSFileSystem.java:265)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170)
at org.apache.nutch.parse.tika.TikaParser.getParse(Unknown Source)
at org.apache.nutch.parse.ParseCallable.call(Unknown Source)
at org.apache.nutch.parse.ParseCallable.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)

Then I download this document to my local and try tika parse by command:
./bin/nutch plugin parse-tika
org.apache.nutch.parse.tika.TikaParser zhangw8867520120625134724265971.doc
This command worked well.

Anyone has idea about it?

BR,

Rock Bin

Reply via email to