Sorry to repeat this mail, no one using 1.2 nutch has problems
indexing MS-Office documents? Should I send this problem to tika list?

Regards and thanks
German


---------- Forwarded message ----------
From: Germán Biozzoli <[email protected]>
Date: Wed, Nov 17, 2010 at 3:16 PM
Subject: problem with tika and MS-Office file - Nucth 1.2
To: [email protected]


Hi everybody

I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse
OK html and pdf files, but when it tries to parse doc files, the
following message appears:

Unable to successfully
parse content http://xxx of type
application/x-tika-msoffice

I've tried to follow what is shown here:

http://www.mail-archive.com/[email protected]/msg01073.html

But really cannot find a solution. Only if I test the same command,
nutch returns:


r...@tango06:/home/apache-nutch-1.2# bin/nutch
org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
Exception in thread "main" org.apache.nutch.parse.ParseException:
parser not found for contentType=application/x-tika-msoffice
url=http://ridder.uio.no/wtest2.doc
       at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
       at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97)

I have at nutch-default.xml the plugin folder in

<property>
 <name>plugin.folders</name>
 <value>/home/apache-nutch-1.2/build/plugins</value>
 <description>Directories where nutch plugins are located.  Each
 element may be a relative or absolute path.  If absolute, it is used
 as is.  If relative, it is searched for on the classpath.</description>
</property>

The path is ok

and the tika-mimetypes.xml

 <mime-type type="application/msword">
   <alias type="application/vnd.ms-word"/>
   <comment>Microsoft Word Document</comment>
   <magic priority="50">
     <match value="Microsoft\ Word\ 6.0\ Document" type="string"
offset="2080"/>
     <match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/>
     <match value="MSWordDoc" type="string" offset="2112"/>
     <match value="0x31be0000" type="big32" offset="0"/>
     <match value="PO^Q`" type="string" offset="0"/>
     <match value="\376\067\0\043" type="string" offset="0"/>
     <match value="\333\245-\0\0\0" type="string" offset="0"/>
     <match value="\354\245\301" type="string" offset="512"/>
     <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
     <match value="\224\246\056" type="string" offset="0"/>
     <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" offset="512"/>
   </magic>
   <glob pattern="*.doc"/>
   <glob pattern="*.dot"/>
   <sub-class-of type="application/x-tika-msoffice"/>
 </mime-type>

I can't imagine what I'm doing wrong. Somebody could help me?
Regards and thanks
German

Reply via email to