Re: Nutch 1.0 and Office 2007 documents
Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
If I manage to put it to work I will write here a mini how-to. The Nutch Wiki would be the right place for doing that. It would be nice to have a page there listing the differences between the capabilities of the Tika plugin and the existing Nutch parsing plugins as there might be differences between them (support for Office 2007 being potentially one of them) Note that the Tika plugin is VERY beta Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
Re: Nutch 1.0 and Office 2007 documents
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use it for your how-to J. 2009/12/14 Julien Nioche lists.digitalpeb...@gmail.com If I manage to put it to work I will write here a mini how-to. The Nutch Wiki would be the right place for doing that. It would be nice to have a page there listing the differences between the capabilities of the Tika plugin and the existing Nutch parsing plugins as there might be differences between them (support for Office 2007 being potentially one of them) Note that the Tika plugin is VERY beta Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi, Thanks for the reply. I will try to use Tika with Nutch to parse the documents. My current Nutch setup is working quite nice and I don't want to configure another Nutch instance. If I manage to put it to work I will write here a mini how-to. Best, Adilson On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi, There is a Tika plugin in JIRA ( https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page the support for the Office 2007 was imminent in POI (which Tika uses internally). The plan for Nutch is to progressively delegate the parsing to Tika; Nutch-766 has been implemented for this. I haven't checked whether Tika currently supports Office 2007 but I suggest that you try parsing docs at this format with Tika, if it does work then you'll get that automatically via Nutch-766 Makes sense? Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com Hi all, Anyone successfully used nutch to index Office 2007 documents? I know that this question has already been asked, but considering the number of e-mails asking the same question, looks like that Nutch does not support Office 2007 documents. Best, Adilson On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote: Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified. -- DigitalPebble Ltd http://www.digitalpebble.com