Re: problem with tika and MS-Office file - Nucth 1.2

Mattmann, Chris A (388J) Fri, 19 Nov 2010 21:52:46 -0800

Hi Guys,

One thing to try might be to parse the file with the latest Tika release (0.8), 
and then see if it works there. If it does, then the issue Julien just filed 
[1] to upgrade Nutch to use Tika 0.8 might help your problem...


Cheers,
Chris

[1] http://issues.apache.org/jira/browse/NUTCH-934

On Nov 19, 2010, at 1:52 AM, qiu chi wrote:

> tika is not allmighty, you can try to download the document to local disk
> and use the tika plugin to parse it then check if it can be parsed.
> 
> 
> On Thu, Nov 18, 2010 at 2:16 AM, Germán Biozzoli
> <[email protected]>wrote:
> 
>> Hi everybody
>> 
>> I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse
>> OK html and pdf files, but when it tries to parse doc files, the
>> following message appears:
>> 
>> Unable to successfully
>> parse content http://xxx of type
>> application/x-tika-msoffice
>> 
>> I've tried to follow what is shown here:
>> 
>> http://www.mail-archive.com/[email protected]/msg01073.html
>> 
>> But really cannot find a solution. Only if I test the same command,
>> nutch returns:
>> 
>> 
>> r...@tango06:/home/apache-nutch-1.2# bin/nutch
>> org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc
>> Exception in thread "main" org.apache.nutch.parse.ParseException:
>> parser not found for contentType=application/x-tika-msoffice
>> url=http://ridder.uio.no/wtest2.doc
>>       at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
>>       at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97)
>> 
>> I have at nutch-default.xml the plugin folder in
>> 
>> <property>
>> <name>plugin.folders</name>
>> <value>/home/apache-nutch-1.2/build/plugins</value>
>> <description>Directories where nutch plugins are located.  Each
>> element may be a relative or absolute path.  If absolute, it is used
>> as is.  If relative, it is searched for on the classpath.</description>
>> </property>
>> 
>> The path is ok
>> 
>> and the tika-mimetypes.xml
>> 
>> <mime-type type="application/msword">
>>   <alias type="application/vnd.ms-word"/>
>>   <comment>Microsoft Word Document</comment>
>>   <magic priority="50">
>>     <match value="Microsoft\ Word\ 6.0\ Document" type="string"
>> offset="2080"/>
>>     <match value="Documento\ Microsoft\ Word\ 6" type="string"
>> offset="2080"/>
>>     <match value="MSWordDoc" type="string" offset="2112"/>
>>     <match value="0x31be0000" type="big32" offset="0"/>
>>     <match value="PO^Q`" type="string" offset="0"/>
>>     <match value="\376\067\0\043" type="string" offset="0"/>
>>     <match value="\333\245-\0\0\0" type="string" offset="0"/>
>>     <match value="\354\245\301" type="string" offset="512"/>
>>     <match value="\320\317\021\340\241\261\032\341" type="string"
>> offset="0"/>
>>     <match value="\224\246\056" type="string" offset="0"/>
>>     <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string"
>> offset="512"/>
>>   </magic>
>>   <glob pattern="*.doc"/>
>>   <glob pattern="*.dot"/>
>>   <sub-class-of type="application/x-tika-msoffice"/>
>> </mime-type>
>> 
>> I can't imagine what I'm doing wrong. Somebody could help me?
>> Regards and thanks
>> German
>> 
> 
> 
> 
> -- 
> Regards
> Qiu
> - [email protected]


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: problem with tika and MS-Office file - Nucth 1.2

Reply via email to