Hello list,
I'm trying to parse different document formats with Nutch 1.2, but only
an XLSX (Excel) document is parsed in addition to HTML files. This is
how my plugin.includes settings are configured at the moment:
<value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|more)|query-(basic|site|url|lang)</value>
And I have several questions as well:
1. Should I define formats such as pdf and msword when tika is set? Or
is the tika setting sufficient?
2. Do I have to enable all document formats in parse-plugins.xml? A lot
of document formats are disabled. Even though I tried to enable them,
e.g. <mimeType name="application/msword">, it still does not parse MS Word.
3. Why is the XLSX document parsed when I (still) haven't defined
msexcel (or enabled the type in parse-plugins.xml?
Here are some lines from the hadoop.log file regarding MS Word:
2010-10-08 10:29:03,925 INFO fetcher.Fetcher - fetching
http://ridder.uio.no/wtest2.doc
2010-10-08 10:29:04,012 WARN parse.ParseUtil - Unable to successfully
parse content http://ridder.uio.no/wtest2.doc of type
application/x-tika-msoffice
2010-10-08 10:29:04,013 WARN fetcher.Fetcher - Error parsing:
http://ridder.uio.no/wtest2.doc: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse content
I need to parse the following document types:
- html
- xml
- MS Word (doc and docx)
- MS Excel (xls and xlsx)
- RTF
- PDF
- ODT
Erlend
--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050