Parse MS Office etc. in Nutch 1.2

Erlend Garåsen Fri, 08 Oct 2010 02:29:23 -0700


Hello list,

I'm trying to parse different document formats with Nutch 1.2, but onlyan XLSX (Excel) document is parsed in addition to HTML files. This ishow my plugin.includes settings are configured at the moment:


<value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>

And I have several questions as well:

1. Should I define formats such as pdf and msword when tika is set? Oris the tika setting sufficient?

2. Do I have to enable all document formats in parse-plugins.xml? A lotof document formats are disabled. Even though I tried to enable them,e.g. <mimeType name="application/msword">, it still does not parse MS Word.

3. Why is the XLSX document parsed when I (still) haven't definedmsexcel (or enabled the type in parse-plugins.xml?


Here are some lines from the hadoop.log file regarding MS Word:

2010-10-08 10:29:03,925 INFO fetcher.Fetcher - fetchinghttp://ridder.uio.no/wtest2.doc2010-10-08 10:29:04,012 WARN parse.ParseUtil - Unable to successfullyparse content http://ridder.uio.no/wtest2.doc of typeapplication/x-tika-msoffice2010-10-08 10:29:04,013 WARN fetcher.Fetcher - Error parsing:http://ridder.uio.no/wtest2.doc: failed(2,200):org.apache.nutch.parse.ParseException: Unable to successfully parse content


I need to parse the following document types:
- html
- xml
- MS Word (doc and docx)
- MS Excel (xls and xlsx)
- RTF
- PDF
- ODT

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Parse MS Office etc. in Nutch 1.2

Reply via email to