Hi Guys, One thing to try might be to parse the file with the latest Tika release (0.8), and then see if it works there. If it does, then the issue Julien just filed [1] to upgrade Nutch to use Tika 0.8 might help your problem...
Cheers, Chris [1] http://issues.apache.org/jira/browse/NUTCH-934 On Nov 19, 2010, at 1:52 AM, qiu chi wrote: > tika is not allmighty, you can try to download the document to local disk > and use the tika plugin to parse it then check if it can be parsed. > > > On Thu, Nov 18, 2010 at 2:16 AM, Germán Biozzoli > <[email protected]>wrote: > >> Hi everybody >> >> I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse >> OK html and pdf files, but when it tries to parse doc files, the >> following message appears: >> >> Unable to successfully >> parse content http://xxx of type >> application/x-tika-msoffice >> >> I've tried to follow what is shown here: >> >> http://www.mail-archive.com/[email protected]/msg01073.html >> >> But really cannot find a solution. Only if I test the same command, >> nutch returns: >> >> >> r...@tango06:/home/apache-nutch-1.2# bin/nutch >> org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc >> Exception in thread "main" org.apache.nutch.parse.ParseException: >> parser not found for contentType=application/x-tika-msoffice >> url=http://ridder.uio.no/wtest2.doc >> at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) >> at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97) >> >> I have at nutch-default.xml the plugin folder in >> >> <property> >> <name>plugin.folders</name> >> <value>/home/apache-nutch-1.2/build/plugins</value> >> <description>Directories where nutch plugins are located. Each >> element may be a relative or absolute path. If absolute, it is used >> as is. If relative, it is searched for on the classpath.</description> >> </property> >> >> The path is ok >> >> and the tika-mimetypes.xml >> >> <mime-type type="application/msword"> >> <alias type="application/vnd.ms-word"/> >> <comment>Microsoft Word Document</comment> >> <magic priority="50"> >> <match value="Microsoft\ Word\ 6.0\ Document" type="string" >> offset="2080"/> >> <match value="Documento\ Microsoft\ Word\ 6" type="string" >> offset="2080"/> >> <match value="MSWordDoc" type="string" offset="2112"/> >> <match value="0x31be0000" type="big32" offset="0"/> >> <match value="PO^Q`" type="string" offset="0"/> >> <match value="\376\067\0\043" type="string" offset="0"/> >> <match value="\333\245-\0\0\0" type="string" offset="0"/> >> <match value="\354\245\301" type="string" offset="512"/> >> <match value="\320\317\021\340\241\261\032\341" type="string" >> offset="0"/> >> <match value="\224\246\056" type="string" offset="0"/> >> <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" >> offset="512"/> >> </magic> >> <glob pattern="*.doc"/> >> <glob pattern="*.dot"/> >> <sub-class-of type="application/x-tika-msoffice"/> >> </mime-type> >> >> I can't imagine what I'm doing wrong. Somebody could help me? >> Regards and thanks >> German >> > > > > -- > Regards > Qiu > - [email protected] ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

