Hi German,

I was having same problem (and others). For an immediate effect, try switching 
the parser to use, e.g.:


        <mimeType name="application/msword">
                <plugin id="parse-oo" />
        </mimeType>

In my experience, this parser is a lot more tolerant. For a long-term solution, 
these problems should be reported to Tika, but I doubt that they don't know it 
yet.

Regards,

Arkadi


>-----Original Message-----
>From: Germán Biozzoli [mailto:[email protected]]
>Sent: Friday, November 19, 2010 1:53 PM
>To: [email protected]
>Subject: Problem with tika and MS-Office file - Nucth 1.2
>
>Sorry to repeat this mail, no one using 1.2 nutch has problems
>indexing MS-Office documents? Should I send this problem to tika list?
>
>Regards and thanks
>German
>
>
>---------- Forwarded message ----------
>From: Germán Biozzoli <[email protected]>
>Date: Wed, Nov 17, 2010 at 3:16 PM
>Subject: problem with tika and MS-Office file - Nucth 1.2
>To: [email protected]
>
>
>Hi everybody
>
>I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse
>OK html and pdf files, but when it tries to parse doc files, the
>following message appears:
>
>Unable to successfully
>parse content http://xxx of type
>application/x-tika-msoffice
>
>I've tried to follow what is shown here:
>
>http://www.mail-archive.com/[email protected]/msg01073.html
>
>But really cannot find a solution. Only if I test the same command,
>nutch returns:
>
>
>r...@tango06:/home/apache-nutch-1.2# bin/nutch
>org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
>Exception in thread "main" org.apache.nutch.parse.ParseException:
>parser not found for contentType=application/x-tika-msoffice
>url=http://ridder.uio.no/wtest2.doc
>       at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
>       at
>org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97)
>
>I have at nutch-default.xml the plugin folder in
>
><property>
> <name>plugin.folders</name>
> <value>/home/apache-nutch-1.2/build/plugins</value>
> <description>Directories where nutch plugins are located.  Each
> element may be a relative or absolute path.  If absolute, it is used
> as is.  If relative, it is searched for on the classpath.</description>
></property>
>
>The path is ok
>
>and the tika-mimetypes.xml
>
> <mime-type type="application/msword">
>   <alias type="application/vnd.ms-word"/>
>   <comment>Microsoft Word Document</comment>
>   <magic priority="50">
>     <match value="Microsoft\ Word\ 6.0\ Document" type="string"
>offset="2080"/>
>     <match value="Documento\ Microsoft\ Word\ 6" type="string"
>offset="2080"/>
>     <match value="MSWordDoc" type="string" offset="2112"/>
>     <match value="0x31be0000" type="big32" offset="0"/>
>     <match value="PO^Q`" type="string" offset="0"/>
>     <match value="\376\067\0\043" type="string" offset="0"/>
>     <match value="\333\245-\0\0\0" type="string" offset="0"/>
>     <match value="\354\245\301" type="string" offset="512"/>
>     <match value="\320\317\021\340\241\261\032\341" type="string"
>offset="0"/>
>     <match value="\224\246\056" type="string" offset="0"/>
>     <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string"
>offset="512"/>
>   </magic>
>   <glob pattern="*.doc"/>
>   <glob pattern="*.dot"/>
>   <sub-class-of type="application/x-tika-msoffice"/>
> </mime-type>
>
>I can't imagine what I'm doing wrong. Somebody could help me?
>Regards and thanks
>German

Reply via email to