Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Adilson Oliveira Cruz
 Hi all,

 Anyone successfully used nutch to index Office 2007 documents? I know that
this question has already been asked, but considering the number of e-mails
asking the same question, looks like that Nutch does not support Office 2007
documents.

 Best,

 Adilson

On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com wrote:

 Hi,



 I'm also curious as to whether anyone has had success with Nutch and
 parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
 errors as seen here -
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
 cuments-in-Nutch-1.0-td26640949.html#a26640949http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949



 Is a separate plugin required to parse these documents (i.e.,
 parse-msexcel, parse-mspowerpoint, etc. will *not* work?)



 I noticed the comment on the above thread - docx should be parsed,A
 plugin can be used to Parsed docx file. you get some
 help info from parse-html plugin and so on. - but didn't find it really
 helpful.



 Regards,

 Joe




 This message is confidential to Prodea Systems, Inc unless otherwise
 indicated
 or apparent from its nature. This message is directed to the intended
 recipient
 only, who may be readily determined by the sender of this message and its
 contents. If the reader of this message is not the intended recipient, or
 an
 employee or agent responsible for delivering this message to the intended
 recipient:(a)any dissemination or copying of this message is strictly
 prohibited; and(b)immediately notify the sender by return message and
 destroy
 any copies of this message in any form(electronic, paper or otherwise) that
 you
 have.The delivery of this message and its information is neither intended
 to be
 nor constitutes a disclosure or waiver of any trade secrets, intellectual
 property, attorney work product, or attorney-client communications. The
 authority of the individual sending this message to legally bind Prodea
 Systems
 is neither apparent nor implied,and must be independently verified.


Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
Hi,

There is a Tika plugin in JIRA (
https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
the support for the Office 2007 was imminent in POI (which Tika uses
internally). The plan for Nutch is to progressively delegate the parsing to
Tika; Nutch-766 has been implemented for this. I haven't checked whether
Tika currently supports Office 2007 but I suggest that you try parsing docs
at this format with Tika, if it does work then you'll get that automatically
via Nutch-766

Makes sense?

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

  Hi all,

  Anyone successfully used nutch to index Office 2007 documents? I know that
 this question has already been asked, but considering the number of e-mails
 asking the same question, looks like that Nutch does not support Office
 2007
 documents.

  Best,

  Adilson

 On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
 wrote:

  Hi,
 
 
 
  I'm also curious as to whether anyone has had success with Nutch and
  parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
  errors as seen here -
  http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
  cuments-in-Nutch-1.0-td26640949.html#a26640949
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
 
 
 
 
  Is a separate plugin required to parse these documents (i.e.,
  parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
 
 
 
  I noticed the comment on the above thread - docx should be parsed,A
  plugin can be used to Parsed docx file. you get some
  help info from parse-html plugin and so on. - but didn't find it really
  helpful.
 
 
 
  Regards,
 
  Joe
 
 
 
 
  This message is confidential to Prodea Systems, Inc unless otherwise
  indicated
  or apparent from its nature. This message is directed to the intended
  recipient
  only, who may be readily determined by the sender of this message and its
  contents. If the reader of this message is not the intended recipient, or
  an
  employee or agent responsible for delivering this message to the intended
  recipient:(a)any dissemination or copying of this message is strictly
  prohibited; and(b)immediately notify the sender by return message and
  destroy
  any copies of this message in any form(electronic, paper or otherwise)
 that
  you
  have.The delivery of this message and its information is neither intended
  to be
  nor constitutes a disclosure or waiver of any trade secrets, intellectual
  property, attorney work product, or attorney-client communications. The
  authority of the individual sending this message to legally bind Prodea
  Systems
  is neither apparent nor implied,and must be independently verified.



Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Adilson Oliveira Cruz
 Hi,

 Thanks for the reply. I will try to use Tika with Nutch to parse the
documents. My current Nutch setup is working quite nice and I don't want to
configure another Nutch instance.

 If I manage to put it to work I will write here a mini how-to.

 Best,

 Adilson

On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Hi,

 There is a Tika plugin in JIRA (
 https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's page
 the support for the Office 2007 was imminent in POI (which Tika uses
 internally). The plan for Nutch is to progressively delegate the parsing to
 Tika; Nutch-766 has been implemented for this. I haven't checked whether
 Tika currently supports Office 2007 but I suggest that you try parsing docs
 at this format with Tika, if it does work then you'll get that
 automatically
 via Nutch-766

 Makes sense?

 Julien

 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

   Hi all,
 
   Anyone successfully used nutch to index Office 2007 documents? I know
 that
  this question has already been asked, but considering the number of
 e-mails
  asking the same question, looks like that Nutch does not support Office
  2007
  documents.
 
   Best,
 
   Adilson
 
  On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
  wrote:
 
   Hi,
  
  
  
   I'm also curious as to whether anyone has had success with Nutch and
   parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
   errors as seen here -
  
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
   cuments-in-Nutch-1.0-td26640949.html#a26640949
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
  
  
  
  
   Is a separate plugin required to parse these documents (i.e.,
   parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
  
  
  
   I noticed the comment on the above thread - docx should be parsed,A
   plugin can be used to Parsed docx file. you get some
   help info from parse-html plugin and so on. - but didn't find it really
   helpful.
  
  
  
   Regards,
  
   Joe
  
  
  
  
   This message is confidential to Prodea Systems, Inc unless otherwise
   indicated
   or apparent from its nature. This message is directed to the intended
   recipient
   only, who may be readily determined by the sender of this message and
 its
   contents. If the reader of this message is not the intended recipient,
 or
   an
   employee or agent responsible for delivering this message to the
 intended
   recipient:(a)any dissemination or copying of this message is strictly
   prohibited; and(b)immediately notify the sender by return message and
   destroy
   any copies of this message in any form(electronic, paper or otherwise)
  that
   you
   have.The delivery of this message and its information is neither
 intended
   to be
   nor constitutes a disclosure or waiver of any trade secrets,
 intellectual
   property, attorney work product, or attorney-client communications. The
   authority of the individual sending this message to legally bind Prodea
   Systems
   is neither apparent nor implied,and must be independently verified.
 



Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche

  If I manage to put it to work I will write here a mini how-to.


The Nutch Wiki would be the right place for doing that. It would be nice to
have a page there listing the differences between the capabilities of the
Tika plugin and the existing Nutch parsing plugins as there might be
differences between them (support for Office 2007 being potentially one of
them)

Note that the Tika plugin is VERY beta

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

  Hi,

  Thanks for the reply. I will try to use Tika with Nutch to parse the
 documents. My current Nutch setup is working quite nice and I don't want to
 configure another Nutch instance.

  If I manage to put it to work I will write here a mini how-to.

  Best,

  Adilson

 On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  Hi,
 
  There is a Tika plugin in JIRA (
  https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
 page
  the support for the Office 2007 was imminent in POI (which Tika uses
  internally). The plan for Nutch is to progressively delegate the parsing
 to
  Tika; Nutch-766 has been implemented for this. I haven't checked whether
  Tika currently supports Office 2007 but I suggest that you try parsing
 docs
  at this format with Tika, if it does work then you'll get that
  automatically
  via Nutch-766
 
  Makes sense?
 
  Julien
 
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com
 
Hi all,
  
Anyone successfully used nutch to index Office 2007 documents? I know
  that
   this question has already been asked, but considering the number of
  e-mails
   asking the same question, looks like that Nutch does not support Office
   2007
   documents.
  
Best,
  
Adilson
  
   On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
   wrote:
  
Hi,
   
   
   
I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
   
  http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949
  
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
   
   
   
   
Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
   
   
   
I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on. - but didn't find it
 really
helpful.
   
   
   
Regards,
   
Joe
   
   
   
   
This message is confidential to Prodea Systems, Inc unless otherwise
indicated
or apparent from its nature. This message is directed to the intended
recipient
only, who may be readily determined by the sender of this message and
  its
contents. If the reader of this message is not the intended
 recipient,
  or
an
employee or agent responsible for delivering this message to the
  intended
recipient:(a)any dissemination or copying of this message is strictly
prohibited; and(b)immediately notify the sender by return message and
destroy
any copies of this message in any form(electronic, paper or
 otherwise)
   that
you
have.The delivery of this message and its information is neither
  intended
to be
nor constitutes a disclosure or waiver of any trade secrets,
  intellectual
property, attorney work product, or attorney-client communications.
 The
authority of the individual sending this message to legally bind
 Prodea
Systems
is neither apparent nor implied,and must be independently verified.
  
 



Re: Nutch 1.0 and Office 2007 documents

2009-12-14 Thread Julien Nioche
Have create a page http://wiki.apache.org/nutch/TikaPlugin; feel free to use
it for your how-to

J.

2009/12/14 Julien Nioche lists.digitalpeb...@gmail.com

  If I manage to put it to work I will write here a mini how-to.


 The Nutch Wiki would be the right place for doing that. It would be nice to
 have a page there listing the differences between the capabilities of the
 Tika plugin and the existing Nutch parsing plugins as there might be
 differences between them (support for Office 2007 being potentially one of
 them)

 Note that the Tika plugin is VERY beta

 Julien
 --
 DigitalPebble Ltd
 http://www.digitalpebble.com

 2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com

  Hi,

  Thanks for the reply. I will try to use Tika with Nutch to parse the
 documents. My current Nutch setup is working quite nice and I don't want
 to
 configure another Nutch instance.

  If I manage to put it to work I will write here a mini how-to.

  Best,

  Adilson

 On Mon, Dec 14, 2009 at 10:00 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

  Hi,
 
  There is a Tika plugin in JIRA (
  https://issues.apache.org/jira/browse/NUTCH-766). According to Tika's
 page
  the support for the Office 2007 was imminent in POI (which Tika uses
  internally). The plan for Nutch is to progressively delegate the parsing
 to
  Tika; Nutch-766 has been implemented for this. I haven't checked whether
  Tika currently supports Office 2007 but I suggest that you try parsing
 docs
  at this format with Tika, if it does work then you'll get that
  automatically
  via Nutch-766
 
  Makes sense?
 
  Julien
 
  --
  DigitalPebble Ltd
  http://www.digitalpebble.com
 
  2009/12/14 Adilson Oliveira Cruz adilsonoc...@gmail.com
 
Hi all,
  
Anyone successfully used nutch to index Office 2007 documents? I know
  that
   this question has already been asked, but considering the number of
  e-mails
   asking the same question, looks like that Nutch does not support
 Office
   2007
   documents.
  
Best,
  
Adilson
  
   On Wed, Dec 9, 2009 at 2:27 PM, Joe Bell joe.b...@prodeasystems.com
   wrote:
  
Hi,
   
   
   
I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
   
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949
  
 
 http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do%0Acuments-in-Nutch-1.0-td26640949.html#a26640949
   
   
   
   
Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?)
   
   
   
I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some
help info from parse-html plugin and so on. - but didn't find it
 really
helpful.
   
   
   
Regards,
   
Joe
   
   
   
   
This message is confidential to Prodea Systems, Inc unless otherwise
indicated
or apparent from its nature. This message is directed to the
 intended
recipient
only, who may be readily determined by the sender of this message
 and
  its
contents. If the reader of this message is not the intended
 recipient,
  or
an
employee or agent responsible for delivering this message to the
  intended
recipient:(a)any dissemination or copying of this message is
 strictly
prohibited; and(b)immediately notify the sender by return message
 and
destroy
any copies of this message in any form(electronic, paper or
 otherwise)
   that
you
have.The delivery of this message and its information is neither
  intended
to be
nor constitutes a disclosure or waiver of any trade secrets,
  intellectual
property, attorney work product, or attorney-client communications.
 The
authority of the individual sending this message to legally bind
 Prodea
Systems
is neither apparent nor implied,and must be independently verified.
  
 








-- 
DigitalPebble Ltd
http://www.digitalpebble.com