Hello list,

I'm trying to parse different document formats with Nutch 1.2, but only an XLSX (Excel) document is parsed in addition to HTML files. This is how my plugin.includes settings are configured at the moment:

<value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>

And I have several questions as well:
1. Should I define formats such as pdf and msword when tika is set? Or is the tika setting sufficient?

2. Do I have to enable all document formats in parse-plugins.xml? A lot of document formats are disabled. Even though I tried to enable them, e.g. <mimeType name="application/msword">, it still does not parse MS Word.

3. Why is the XLSX document parsed when I (still) haven't defined msexcel (or enabled the type in parse-plugins.xml?

Here are some lines from the hadoop.log file regarding MS Word:

2010-10-08 10:29:03,925 INFO fetcher.Fetcher - fetching http://ridder.uio.no/wtest2.doc 2010-10-08 10:29:04,012 WARN parse.ParseUtil - Unable to successfully parse content http://ridder.uio.no/wtest2.doc of type application/x-tika-msoffice 2010-10-08 10:29:04,013 WARN fetcher.Fetcher - Error parsing: http://ridder.uio.no/wtest2.doc: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content

I need to parse the following document types:
- html
- xml
- MS Word (doc and docx)
- MS Excel (xls and xlsx)
- RTF
- PDF
- ODT

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to