Hi,
I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949
Is a
Hi - this is my first post to the nutch mailing list, please let me know
if I commit any list protocol errors.
I'm currently using Nutch 1.0 with the Powerpoint plugin enabled and can
verify that Nutch is indeed pulling in the entire file for passing off
to the parser (i.e., I've set the