Nutch 1.0 and Office 2007 documents
Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Nutch 1.0 ms-powerpoint plugin
Hi - this is my first post to the nutch mailing list, please let me know if I commit any list protocol errors. I'm currently using Nutch 1.0 with the Powerpoint plugin enabled and can verify that Nutch is indeed pulling in the entire file for passing off to the parser (i.e., I've set the content limit to -1 to get the full file). However it appears that most Powerpoint files with any complexity (they use a template, have tables, images, etc.) do not get indexed at all. In one case I created a new file with one "title" slide and the title text was recognized but the subtitle text directly underneath was not. My question is whether I'm missing something that has already been covered (like for example, http://issues.apache.org/jira/browse/NUTCH-463, though I don't see any logs indicating issues in my crawl) or that this is a known defect in the existing Powerpoint plugin? It goes without saying that I'd very much like to be able to completely index Powerpoint slides as this is going to be the most common document type on my site. Thanks, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.