tika is not allmighty, you can try to download the document to local disk and use the tika plugin to parse it then check if it can be parsed.
On Thu, Nov 18, 2010 at 2:16 AM, Germán Biozzoli <[email protected]>wrote: > Hi everybody > > I'm using Nutch 1.2 to crawl a set of specialized sites. I could parse > OK html and pdf files, but when it tries to parse doc files, the > following message appears: > > Unable to successfully > parse content http://xxx of type > application/x-tika-msoffice > > I've tried to follow what is shown here: > > http://www.mail-archive.com/[email protected]/msg01073.html > > But really cannot find a solution. Only if I test the same command, > nutch returns: > > > r...@tango06:/home/apache-nutch-1.2# bin/nutch > org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc > Exception in thread "main" org.apache.nutch.parse.ParseException: > parser not found for contentType=application/x-tika-msoffice > url=http://ridder.uio.no/wtest2.doc > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) > at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:97) > > I have at nutch-default.xml the plugin folder in > > <property> > <name>plugin.folders</name> > <value>/home/apache-nutch-1.2/build/plugins</value> > <description>Directories where nutch plugins are located. Each > element may be a relative or absolute path. If absolute, it is used > as is. If relative, it is searched for on the classpath.</description> > </property> > > The path is ok > > and the tika-mimetypes.xml > > <mime-type type="application/msword"> > <alias type="application/vnd.ms-word"/> > <comment>Microsoft Word Document</comment> > <magic priority="50"> > <match value="Microsoft\ Word\ 6.0\ Document" type="string" > offset="2080"/> > <match value="Documento\ Microsoft\ Word\ 6" type="string" > offset="2080"/> > <match value="MSWordDoc" type="string" offset="2112"/> > <match value="0x31be0000" type="big32" offset="0"/> > <match value="PO^Q`" type="string" offset="0"/> > <match value="\376\067\0\043" type="string" offset="0"/> > <match value="\333\245-\0\0\0" type="string" offset="0"/> > <match value="\354\245\301" type="string" offset="512"/> > <match value="\320\317\021\340\241\261\032\341" type="string" > offset="0"/> > <match value="\224\246\056" type="string" offset="0"/> > <match value="R\0o\0o\0t\0\ \0E\0n\0t\0r\0y" type="string" > offset="512"/> > </magic> > <glob pattern="*.doc"/> > <glob pattern="*.dot"/> > <sub-class-of type="application/x-tika-msoffice"/> > </mime-type> > > I can't imagine what I'm doing wrong. Somebody could help me? > Regards and thanks > German > -- Regards Qiu - [email protected]

