Hi
> I'm trying to parse different document formats with Nutch 1.2, but only an > XLSX (Excel) document is parsed in addition to HTML files. This is how my > plugin.includes settings are configured at the moment: > > > <value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|more)|query-(basic|site|url|lang)</value> > > And I have several questions as well: > 1. Should I define formats such as pdf and msword when tika is set? Or is > the tika setting sufficient? > Tika handles the pdf and msword formats so you should not need to specify them in plugin.includes > 2. Do I have to enable all document formats in parse-plugins.xml? A lot of > document formats are disabled. Even though I tried to enable them, e.g. > <mimeType name="application/msword">, it still does not parse MS Word. Since you've added parse-tika to plugin.includes, it will be used by default on all mime-types which is why we don't have an explicit association for all mimeTypes in parse-plugins.xml > 3. Why is the XLSX document parsed when I (still) haven't defined msexcel > (or enabled the type in parse-plugins.xml? > See explanation for point 2 above > > Here are some lines from the hadoop.log file regarding MS Word: > > 2010-10-08 10:29:03,925 INFO fetcher.Fetcher - fetching > http://ridder.uio.no/wtest2.doc > 2010-10-08 10:29:04,012 WARN parse.ParseUtil - Unable to successfully > parse content http://ridder.uio.no/wtest2.doc of type > application/x-tika-msoffice > 2010-10-08 10:29:04,013 WARN fetcher.Fetcher - Error parsing: > http://ridder.uio.no/wtest2.doc: failed(2,200): > org.apache.nutch.parse.ParseException: Unable to successfully parse content > Strange, I can't reproduce the issue with nutch-1.2. Can you try running the command below? *bin/nutch org.apache.nutch.parse.ParserChecker http://ridder.uio.no/wtest2.doc --------- Url --------------- http://ridder.uio.no/wtest2.doc--------- ParseData --------- Version: 5 Status: success(1,0) Title: Outlinks: 0 Content Metadata: ETag="20064f-5600-47d0d1f0f8400" Date=Fri, 08 Oct 2010 10:19:19 GMT Content-Length=22016 Last-Modified=Wed, 13 Jan 2010 15:06:56 GMT Content-Type=application/msword Connection=close Accept-Ranges=bytes Server=Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8j DAV/2 PHP/5.2.14 mod_perl/2.0.4 Perl/v5.8.8 Parse Metadata: Revision-Number=2 Last-Author=Erlend Garåsen Template=Normal.dotm subject= Page-Count=1 Application-Name=Microsoft Macintosh Word Author=Erlend Garåsen Edit-Time=600000000 Creation-Date=Wed Jan 13 14:58:00 GMT 2010 Company=Universitetet i Oslo Content-Type=application/msword Keywords= Last-Save-Date=Wed Jan 13 14:58:00 GMT 2010 * HTH Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

