You can patch your version now. https://issues.apache.org/jira/browse/NUTCH-912
On Friday 08 October 2010 11:28:44 Erlend Garåsen wrote: > Hello list, > > I'm trying to parse different document formats with Nutch 1.2, but only > an XLSX (Excel) document is parsed in addition to HTML files. This is > how my plugin.includes settings are configured at the moment: > > <value>protocol-httpclient|parse-(text|html|pdf|msword|tika)|index-(basic|m > ore)|query-(basic|site|url|lang)</value> > > And I have several questions as well: > 1. Should I define formats such as pdf and msword when tika is set? Or > is the tika setting sufficient? > > 2. Do I have to enable all document formats in parse-plugins.xml? A lot > of document formats are disabled. Even though I tried to enable them, > e.g. <mimeType name="application/msword">, it still does not parse MS Word. > > 3. Why is the XLSX document parsed when I (still) haven't defined > msexcel (or enabled the type in parse-plugins.xml? > > Here are some lines from the hadoop.log file regarding MS Word: > > 2010-10-08 10:29:03,925 INFO fetcher.Fetcher - fetching > http://ridder.uio.no/wtest2.doc > 2010-10-08 10:29:04,012 WARN parse.ParseUtil - Unable to successfully > parse content http://ridder.uio.no/wtest2.doc of type > application/x-tika-msoffice > 2010-10-08 10:29:04,013 WARN fetcher.Fetcher - Error parsing: > http://ridder.uio.no/wtest2.doc: failed(2,200): > org.apache.nutch.parse.ParseException: Unable to successfully parse content > > I need to parse the following document types: > - html > - xml > - MS Word (doc and docx) > - MS Excel (xls and xlsx) > - RTF > - PDF > - ODT > > Erlend -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350

