Sorry to repeat this mail, no one using 1.2 nutch has problems indexing MS-Office documents? Should I send this problem to tika list?
Regards and thanks German ---------- Forwarded message ---------- From: Germán Biozzoli <[email protected]> Date: Wed, Nov 17, 2010 at 3:16 PM Subject: problem with tika and MS-Office file - Nucth 1.2 To: [email protected] Hi everybody I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse OK html and pdf files, but when it tries to parse doc files, the following message appears: Unable to successfully parse content http://xxx of type application/x-tika-msoffice I've tried to follow what is shown here: http://www.mail-archive.com/[email protected]/msg01073.html But really cannot find a solution. Only if I test the same command, nutch returns: r...@tango06:/home/apache-nutch-1.2# bin/nutch org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc Exception in thread "main" org.apache.nutch.parse.ParseException: parser not found for contentType=application/x-tika-msoffice url=http://ridder.uio.no/wtest2.doc at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97) I have at nutch-default.xml the plugin folder in <property> <name>plugin.folders</name> <value>/home/apache-nutch-1.2/build/plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> The path is ok and the tika-mimetypes.xml <mime-type type="application/msword"> <alias type="application/vnd.ms-word"/> <comment>Microsoft Word Document</comment> <magic priority="50"> <match value="Microsoft\ Word\ 6.0\ Document" type="string" offset="2080"/> <match value="Documento\ Microsoft\ Word\ 6" type="string" offset="2080"/> <match value="MSWordDoc" type="string" offset="2112"/> <match value="0x31be0000" type="big32" offset="0"/> <match value="PO^Q`" type="string" offset="0"/> <match value="\376\067\0\043" type="string" offset="0"/> <match value="\333\245-\0\0\0" type="string" offset="0"/> <match value="\354\245\301" type="string" offset="512"/> <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/> <match value="\224\246\056" type="string" offset="0"/> <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" offset="512"/> </magic> <glob pattern="*.doc"/> <glob pattern="*.dot"/> <sub-class-of type="application/x-tika-msoffice"/> </mime-type> I can't imagine what I'm doing wrong. Somebody could help me? Regards and thanks German

