Erlend Garåsen wrote:

After I changed the nutch-default.xml file back to its original form, I'm able to parse html, pdf, xls and xlsz (by using the command line instead of running Nutch inside Eclipse). But do you know how to parse MS Office, RTF and ODT?

Sorry, I'm able to parse doc, docx, sxw, odt and rtf as well. After I removed the plugins.folder I changed in order to run Nutch inside Eclipse, everything works.

BTW, I see the following in my log file:
2010-10-08 13:56:32,555 WARN more.MoreIndexingFilter - http://ridder.uio.no/test1.xlsx: can't parse erroneous date: 2010-10-08T13:55:54Z 2010-10-08 13:56:32,558 WARN more.MoreIndexingFilter - http://ridder.uio.no/wtest1.docx: can't parse erroneous date: 2010-10-08T13:55:49Z

Should I report this as an IndexingFilter bug? It seems that I need to rewrite it in order to parse the date correctly, but not a big issue right now.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to