Hi

> I'm trying to parse different document formats with Nutch 1.2, but only an
> XLSX (Excel) document is parsed in addition to HTML files. This is how my
> plugin.includes settings are configured at the moment:
>
>
> <value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
>
> And I have several questions as well:
> 1. Should I define formats such as pdf and msword when tika is set? Or is
> the tika setting sufficient?
>

Tika handles the pdf and msword formats so you should not need to specify
them in plugin.includes


> 2. Do I have to enable all document formats in parse-plugins.xml? A lot of
> document formats are disabled. Even though I tried to enable them, e.g.
> <mimeType name="application/msword">, it still does not parse MS Word.


Since you've added parse-tika to plugin.includes, it will be used by default
on all mime-types which is why we don't have an explicit association for all
mimeTypes in parse-plugins.xml



> 3. Why is the XLSX document parsed when I (still) haven't defined msexcel
> (or enabled the type in parse-plugins.xml?
>

See explanation for point 2 above


>
> Here are some lines from the hadoop.log file regarding MS Word:
>
> 2010-10-08 10:29:03,925 INFO  fetcher.Fetcher - fetching
> http://ridder.uio.no/wtest2.doc
> 2010-10-08 10:29:04,012 WARN  parse.ParseUtil - Unable to successfully
> parse content http://ridder.uio.no/wtest2.doc of type
> application/x-tika-msoffice
> 2010-10-08 10:29:04,013 WARN  fetcher.Fetcher - Error parsing:
> http://ridder.uio.no/wtest2.doc: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
>

Strange, I can't reproduce the issue with nutch-1.2. Can you try running the
command below?

*bin/nutch org.apache.nutch.parse.ParserChecker
http://ridder.uio.no/wtest2.doc
---------
Url
---------------
http://ridder.uio.no/wtest2.doc---------
ParseData
---------
Version: 5
Status: success(1,0)
Title:
Outlinks: 0
Content Metadata: ETag="20064f-5600-47d0d1f0f8400" Date=Fri, 08 Oct 2010
10:19:19 GMT Content-Length=22016 Last-Modified=Wed, 13 Jan 2010 15:06:56
GMT Content-Type=application/msword Connection=close Accept-Ranges=bytes
Server=Apache/2.2.16 (Unix) mod_ssl/2.2.16 OpenSSL/0.9.8j DAV/2 PHP/5.2.14
mod_perl/2.0.4 Perl/v5.8.8
Parse Metadata: Revision-Number=2 Last-Author=Erlend Garåsen
Template=Normal.dotm subject= Page-Count=1 Application-Name=Microsoft
Macintosh Word Author=Erlend Garåsen Edit-Time=600000000 Creation-Date=Wed
Jan 13 14:58:00 GMT 2010 Company=Universitetet i Oslo
Content-Type=application/msword Keywords= Last-Save-Date=Wed Jan 13 14:58:00
GMT 2010 *

HTH

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to