Re: Parse MS Office etc. in Nutch 1.2

Markus Jelsma Tue, 23 Nov 2010 04:55:56 -0800

You can patch your version now.
https://issues.apache.org/jira/browse/NUTCH-912


On Friday 08 October 2010 11:28:44 Erlend Garåsen wrote:
> Hello list,
> 
> I'm trying to parse different document formats with Nutch 1.2, but only
> an XLSX (Excel) document is parsed in addition to HTML files. This is
> how my plugin.includes settings are configured at the moment:
> 
> <value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|m
> ore)|query-(basic|site|url|lang)</value>
> 
> And I have several questions as well:
> 1. Should I define formats such as pdf and msword when tika is set? Or
> is the tika setting sufficient?
> 
> 2. Do I have to enable all document formats in parse-plugins.xml? A lot
> of document formats are disabled. Even though I tried to enable them,
> e.g. <mimeType name="application/msword">, it still does not parse MS Word.
> 
> 3. Why is the XLSX document parsed when I (still) haven't defined
> msexcel (or enabled the type in parse-plugins.xml?
> 
> Here are some lines from the hadoop.log file regarding MS Word:
> 
> 2010-10-08 10:29:03,925 INFO  fetcher.Fetcher - fetching
> http://ridder.uio.no/wtest2.doc
> 2010-10-08 10:29:04,012 WARN  parse.ParseUtil - Unable to successfully
> parse content http://ridder.uio.no/wtest2.doc of type
> application/x-tika-msoffice
> 2010-10-08 10:29:04,013 WARN  fetcher.Fetcher - Error parsing:
> http://ridder.uio.no/wtest2.doc: failed(2,200):
> org.apache.nutch.parse.ParseException: Unable to successfully parse content
> 
> I need to parse the following document types:
> - html
> - xml
> - MS Word (doc and docx)
> - MS Excel (xls and xlsx)
> - RTF
> - PDF
> - ODT
> 
> Erlend

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Parse MS Office etc. in Nutch 1.2

Reply via email to