Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Ferdy Galema
Agree with only releasing src.

On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

  Or just not ship a bin release at all. Src is the only thing we really
 VOTE on legally though bin is provided for convenience purposes. Will type
 more on this later...

 Sent from my iPhone

 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

   Hi Julien,

 Do you suggest with the binary release that we simply open up all gora-*
 deps and ship it with every jar available?

 Lewis

 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box - which
 is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as they
 have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to use
 with Gora, you will need to go and manually fetch the correct .jar's from
 maven central.
 
  Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
   Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


  ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




  --
 *
 *
 Open Source Solutions for Text Engineering

   http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*




Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Julien Nioche
+1

On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:

 Agree with only releasing src.


 On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Or just not ship a bin release at all. Src is the only thing we really
 VOTE on legally though bin is provided for convenience purposes. Will type
 more on this later...

 Sent from my iPhone

 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

   Hi Julien,

 Do you suggest with the binary release that we simply open up all gora-*
 deps and ship it with every jar available?

 Lewis

 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box - which
 is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as they
 have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to
 use with Gora, you will need to go and manually fetch the correct .jar's
 from maven central.
 
  Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should
 NOT have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
   Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


  ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




  --
 *
 *
 Open Source Solutions for Text Engineering

   http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*





-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Lewis John Mcgibbney
I'll push this in an hour or so guys.

Thanks for the input.

Lewis

On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 +1


 On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:

 Agree with only releasing src.


 On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Or just not ship a bin release at all. Src is the only thing we really
 VOTE on legally though bin is provided for convenience purposes. Will type
 more on this later...

 Sent from my iPhone

 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

   Hi Julien,

 Do you suggest with the binary release that we simply open up all gora-*
 deps and ship it with every jar available?

 Lewis

 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box -
 which is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as
 they have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as
 a TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to
 use with Gora, you will need to go and manually fetch the correct .jar's
 from maven central.
 
  Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora
 we don't supply binary distributions of the code, this is because when
 using Gora a user may wish/require to recompile the code to accomodate
 config changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should
 NOT have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
   Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


  ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




  --
 *
 *
 Open Source Solutions for Text Engineering

   http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*





 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
*Lewis*


Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Julien Nioche
Before you do, could you check that NutchGora passes ant test successfully.
I just tried and got an error related to the parse-tika tests. Am about to
open a JIRA to update to the latest version of Tika for NutchGora which
should fix the problem and put it at the same level as trunk

J

On 15 June 2012 10:01, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:ly

 I'll push this in an hour or so guys.

 Thanks for the input.

 Lewis


 On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 +1


 On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:

 Agree with only releasing src.


 On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Or just not ship a bin release at all. Src is the only thing we
 really VOTE on legally though bin is provided for convenience purposes.
 Will type more on this later...

 Sent from my iPhone

 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

   Hi Julien,

 Do you suggest with the binary release that we simply open up all
 gora-* deps and ship it with every jar available?

 Lewis

 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box -
 which is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as
 they have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as
 a TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to
 use with Gora, you will need to go and manually fetch the correct .jar's
 from maven central.
 
  Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora
 we don't supply binary distributions of the code, this is because when
 using Gora a user may wish/require to recompile the code to accomodate
 config changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should
 NOT have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
   Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


  ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




  --
 *
 *
 Open Source Solutions for Text Engineering

   http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*





 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering


Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Julien Nioche
see https://issues.apache.org/jira/browse/NUTCH-1396

On 15 June 2012 10:43, Julien Nioche lists.digitalpeb...@gmail.com wrote:

 Before you do, could you check that NutchGora passes ant test
 successfully. I just tried and got an error related to the parse-tika
 tests. Am about to open a JIRA to update to the latest version of Tika for
 NutchGora which should fix the problem and put it at the same level as trunk

 J

 On 15 June 2012 10:01, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.comwrote:ly

 I'll push this in an hour or so guys.

 Thanks for the input.

 Lewis


 On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 +1


 On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:

 Agree with only releasing src.


 On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

  Or just not ship a bin release at all. Src is the only thing we
 really VOTE on legally though bin is provided for convenience purposes.
 Will type more on this later...

 Sent from my iPhone

 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:

   Hi Julien,

 Do you suggest with the binary release that we simply open up all
 gora-* deps and ship it with every jar available?

 Lewis

 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box -
 which is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as
 they have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it
 as a TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to
 use with Gora, you will need to go and manually fetch the correct .jar's
 from maven central.
 
  Does anyone else have either solution or a workaround before I
 push RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get
 nutch running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora
 we don't supply binary distributions of the code, this is because when
 using Gora a user may wish/require to recompile the code to accomodate
 config changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing 
 it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should
 NOT have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
   Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


  ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




  --
 *
 *
 Open Source Solutions for Text Engineering

   http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*





 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com

Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Mattmann, Chris A (388J)
OK you are just making us all look bad now Juls ;)

Super fast!

Cheers,
Chris


On Jun 15, 2012, at 2:54 AM, Julien Nioche wrote:

 see https://issues.apache.org/jira/browse/NUTCH-1396
 
 On 15 June 2012 10:43, Julien Nioche lists.digitalpeb...@gmail.com wrote:
 Before you do, could you check that NutchGora passes ant test successfully. I 
 just tried and got an error related to the parse-tika tests. Am about to open 
 a JIRA to update to the latest version of Tika for NutchGora which should fix 
 the problem and put it at the same level as trunk
 
 J
 
 On 15 June 2012 10:01, Lewis John Mcgibbney lewis.mcgibb...@gmail.com 
 wrote:ly
 
 I'll push this in an hour or so guys.
 
 Thanks for the input.
 
 Lewis
 
 
 On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 +1
 
 
 On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:
 Agree with only releasing src.
 
 
 On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Or just not ship a bin release at all. Src is the only thing we really VOTE 
 on legally though bin is provided for convenience purposes. Will type more on 
 this later...
 
 Sent from my iPhone
 
 On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
 Hi Julien,
 
 Do you suggest with the binary release that we simply open up all gora-* 
 deps and ship it with every jar available?
 
 Lewis
 
 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 I disagree. You'd expect a binary release to work out of the box - which is 
 not the case. Plus we'd have to spend more time explaining the workaround, 
 answering the same questions over and over on the ML etc... Fixing this 
 should not be a big deal (i.e. add the gore-x modules for the backends to 
 the ivy deps file).
 
 Julien
 
 
 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Guys,
 
 I think the annoyance is probably something folks can live with as they have 
 been
 waiting for an official release of 2.x for years :)
 
 My +1 to roll RC #2 with or without a solution to this and mark it as a 
 TODO. release
 eary, release often :)
 
 Cheers,
 Chris
 
 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:
 
  Aye this is no good at all. Depending on which backend you wish to use 
  with Gora, you will need to go and manually fetch the correct .jar's from 
  maven central.
 
  Does anyone else have either solution or a workaround before I push RC2 
  with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
  wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch 
  running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we 
  don't supply binary distributions of the code, this is because when using 
  Gora a user may wish/require to recompile the code to accomodate config 
  changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your using 
  the gora-sql dependency, then you wish to switch to HBase and recompile, 
  is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
  lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means 
  distributed running of jobs is not supported. I'm not sure if this is a 
  problem (since users can always build one themselves), merely pointing it 
  out. The recently released 1.5 also lacks this job jar, so at least no 
  difference there.
 
  The binary distrib corresponds to runtime/local and as such should NOT 
  have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 
 
 
 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++
 
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 

Re: VOTE Apache Nutch 2.0 RC1

2012-06-15 Thread Julien Nioche
That was not intented. Just that am on holidays, it's raining and the
children were either asleep or playing nicely :-)

On 15 June 2012 18:19, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 OK you are just making us all look bad now Juls ;)

 Super fast!

 Cheers,
 Chris


 On Jun 15, 2012, at 2:54 AM, Julien Nioche wrote:

  see https://issues.apache.org/jira/browse/NUTCH-1396
 
  On 15 June 2012 10:43, Julien Nioche lists.digitalpeb...@gmail.com
 wrote:
  Before you do, could you check that NutchGora passes ant test
 successfully. I just tried and got an error related to the parse-tika
 tests. Am about to open a JIRA to update to the latest version of Tika for
 NutchGora which should fix the problem and put it at the same level as trunk
 
  J
 
  On 15 June 2012 10:01, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 wrote:ly
 
  I'll push this in an hour or so guys.
 
  Thanks for the input.
 
  Lewis
 
 
  On Fri, Jun 15, 2012 at 9:39 AM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  +1
 
 
  On 15 June 2012 09:00, Ferdy Galema ferdy.gal...@kalooga.com wrote:
  Agree with only releasing src.
 
 
  On Thu, Jun 14, 2012 at 11:32 PM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
  Or just not ship a bin release at all. Src is the only thing we really
 VOTE on legally though bin is provided for convenience purposes. Will type
 more on this later...
 
  Sent from my iPhone
 
  On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.com wrote:
 
  Hi Julien,
 
  Do you suggest with the binary release that we simply open up all
 gora-* deps and ship it with every jar available?
 
  Lewis
 
  On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  I disagree. You'd expect a binary release to work out of the box -
 which is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).
 
  Julien
 
 
  On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:
  Hey Guys,
 
  I think the annoyance is probably something folks can live with as they
 have been
  waiting for an official release of 2.x for years :)
 
  My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
  eary, release often :)
 
  Cheers,
  Chris
 
  On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:
 
   Aye this is no good at all. Depending on which backend you wish to
 use with Gora, you will need to go and manually fetch the correct .jar's
 from maven central.
  
   Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
  
   Thanks
  
   Lewis
  
   On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
We only supply src distributions...
Does this principle apply to Nutch 2 as well?
   Maybe, yes.
   The situation with the current binary package is uncomfortable:
   I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
  
   2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
   Hi Guys,
  
   Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
  
   Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
  
   Best
  
   Lewis
  
  
   On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
   Ferdy
  
   The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
  
   The binary distrib corresponds to runtime/local and as such should
 NOT have the job file there. This is now the norm since 1.5
  
   Will try and do some testing of the RC
  
   Thanks
  
   Julien
  
  
  
   --
  
   Open Source Solutions for Text Engineering
  
   http://digitalpebble.blogspot.com/
   http://www.digitalpebble.com
   http://twitter.com/digitalpebble
  
  
  
  
   --
   Lewis
  
  
  
  
  
   --
   Lewis
  
 
 
  ++
  Chris Mattmann, Ph.D.
  Senior Computer Scientist
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 171-266B, Mailstop: 171-246
  Email: chris.a.mattm...@nasa.gov
  WWW:   http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Assistant 

Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Ferdy Galema
Maybe just 1392? I went ahead and made a patch that should fix this. Feel
free to commit or ignore prior to RC2.

On Thu, Jun 14, 2012 at 1:44 AM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Sebastian,

 On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel
 wastl.na...@googlemail.com wrote:
 I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
  Much simpler than 1.x (no segments!).

 :0)

  % ./bin/nutch readdb -stats
  WebTable statistics start
  WebTableReader: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
 at
 
 org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
 at
 org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
 at
 org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
 at
 org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
 org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
  -- readdb -dump works.

 Confirmed and ticket opened as NUTCH-1391

  % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
  Exception in thread main java.lang.IllegalArgumentException: arg
 -parse not recognized

 The parse argument was removed in Nutch 2.0 and now throws an
 illegalargumentexception. This is now normal. To enable parsing during
 fetching please set config in nutch-site.xml. The reason that the
 incorrect -parse argument is till in the Usage message, is because I
 was not diligent enough when patching the fetcher CLI aesthetics. I'll
 address this within the issue below as well.

 
 
  % ./bin/nutch parse -all -force -resume
  ParserJob: starting
  ParserJob: resuming:false-resume and
  ParserJob: forced reparse:  false-force obviously ignored ?
  ParserJob: parsing all

 Yes confirmed and ticket opened as NUTCH-1392


  % ./bin/nutch generate
  -- generates batchid, but should show help as in 1.x ?
  -- is there an option -topN ?

 Yes this is opened in NUTCH-1393. Users may not necessarily wish to
 generate at all, instead wishing to merely find out the GeneratorJob
 CLI options... I will open this just now and fix for 2.1.

  The 2.0 Solr schema and mappings still contain the field site
  which has been removed in 1.x (NUTCH-1232).
  Should be done also in 2.0: it's easier to maintain only one Solr
 installation
  for all Nutch versions.

 Logged in NUTCH-1394

 Thanks Seb for your contributions here... this is exactly what we are
 after.

 Does anyone have issues with running another RC and addressing these
 issues in 2.1?

 --
 Lewis



Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Sebastian Nagel
  We only supply src distributions...
 Does this principle apply to Nutch 2 as well?
Maybe, yes.
The situation with the current binary package is uncomfortable:
I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running.

2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com

 Hi Guys,

 Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...

 Does this principle apply to Nutch 2 as well? I mean, what if your using
 the gora-sql dependency, then you wish to switch to HBase and recompile, is
 this possible within the binary distribution?

 Best

 Lewis


 On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Ferdy


 The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.


 The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5

 Will try and do some testing of the RC

 Thanks

 Julien



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*




Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Lewis John Mcgibbney
Aye this is no good at all. Depending on which backend you wish to use with
Gora, you will need to go and manually fetch the correct .jar's from maven
central.

Does anyone else have either solution or a workaround before I push RC2
with just src dists?

Thanks

Lewis

On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel wastl.na...@googlemail.com
 wrote:

  We only supply src distributions...
  Does this principle apply to Nutch 2 as well?
 Maybe, yes.
 The situation with the current binary package is uncomfortable:
 I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.

 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com

 Hi Guys,

 Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...

 Does this principle apply to Nutch 2 as well? I mean, what if your using
 the gora-sql dependency, then you wish to switch to HBase and recompile, is
 this possible within the binary distribution?

 Best

 Lewis


 On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 Ferdy


 The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.


 The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5

 Will try and do some testing of the RC

 Thanks

 Julien



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*





-- 
*Lewis*


Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Mattmann, Chris A (388J)
Hey Guys,

I think the annoyance is probably something folks can live with as they have 
been
waiting for an official release of 2.x for years :)

My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. 
release
eary, release often :)

Cheers,
Chris

On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

 Aye this is no good at all. Depending on which backend you wish to use with 
 Gora, you will need to go and manually fetch the correct .jar's from maven 
 central.
 
 Does anyone else have either solution or a workaround before I push RC2 with 
 just src dists?
 
 Thanks
 
 Lewis
 
 On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel wastl.na...@googlemail.com 
 wrote:
  We only supply src distributions... 
  Does this principle apply to Nutch 2 as well?
 Maybe, yes.
 The situation with the current binary package is uncomfortable:
 I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running.
 
 2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
 Hi Guys,
 
 Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't 
 supply binary distributions of the code, this is because when using Gora a 
 user may wish/require to recompile the code to accomodate config changes etc. 
 We only supply src distributions... 
 
 Does this principle apply to Nutch 2 as well? I mean, what if your using the 
 gora-sql dependency, then you wish to switch to HBase and recompile, is this 
 possible within the binary distribution?
 
 Best
 
 Lewis
 
 
 On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
 Ferdy
 
 The Nutch job jar is not present in the binary archive. This means 
 distributed running of jobs is not supported. I'm not sure if this is a 
 problem (since users can always build one themselves), merely pointing it 
 out. The recently released 1.5 also lacks this job jar, so at least no 
 difference there.
 
 The binary distrib corresponds to runtime/local and as such should NOT have 
 the job file there. This is now the norm since 1.5
 
 Will try and do some testing of the RC
 
 Thanks
 
 Julien
 
 
 
 -- 
 
 Open Source Solutions for Text Engineering
 
 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble
 
 
 
 
 -- 
 Lewis 
 
 
 
 
 
 -- 
 Lewis 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Julien Nioche
I disagree. You'd expect a binary release to work out of the box - which is
not the case. Plus we'd have to spend more time explaining the workaround,
answering the same questions over and over on the ML etc... Fixing this
should not be a big deal (i.e. add the gore-x modules for the backends to
the ivy deps file).

Julien

On 14 June 2012 20:27, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as they
 have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to use
 with Gora, you will need to go and manually fetch the correct .jar's from
 maven central.
 
  Does anyone else have either solution or a workaround before I push RC2
 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your using
 the gora-sql dependency, then you wish to switch to HBase and recompile, is
 this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Lewis John Mcgibbney
Hi Julien,

Do you suggest with the binary release that we simply open up all gora-*
deps and ship it with every jar available?

Lewis

On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box - which
 is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as they
 have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to use
 with Gora, you will need to go and manually fetch the correct .jar's from
 maven central.
 
  Does anyone else have either solution or a workaround before I push RC2
 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 --
 *
 *
 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
*Lewis*


Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Julien Nioche
yep, remember that you can't build from the bin package so inevitably
someone will wonder why only such or such backend is available etc...

another option is to NOT have a binary release at all, in which case it is
acceptable I think not to include the deps in ivy. Maybe we should at least
add them but comment them out

Ju

On 14 June 2012 21:51, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Julien,

 Do you suggest with the binary release that we simply open up all gora-*
 deps and ship it with every jar available?

 Lewis


 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box - which
 is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as they
 have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to use
 with Gora, you will need to go and manually fetch the correct .jar's from
 maven central.
 
  Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 --
 *
 *
 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Lewis John Mcgibbney
This is what is currently done and what I was essentially proposing.

I really don't know about the size of the bin artifact if we enable all
gora-* dependencies before packaging it for distribution... thanks to input
from yourselves we recently sorted out some size issues with 1.5, it would
be good to to have 2.0 shadow this.

I am +1 for shipping just src distributions for 2.0, this would keep the
default (gora-sql 0.1.1-incubating) ivy configuration.

If users can't do 'ant runtime' then you kinda got to wonder how they're
using Nutch at all...

On Thu, Jun 14, 2012 at 9:56 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 yep, remember that you can't build from the bin package so inevitably
 someone will wonder why only such or such backend is available etc...

 another option is to NOT have a binary release at all, in which case it is
 acceptable I think not to include the deps in ivy. Maybe we should at least
 add them but comment them out

 Ju


 On 14 June 2012 21:51, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Hi Julien,

 Do you suggest with the binary release that we simply open up all gora-*
 deps and ship it with every jar available?

 Lewis


 On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:

 I disagree. You'd expect a binary release to work out of the box - which
 is not the case. Plus we'd have to spend more time explaining the
 workaround, answering the same questions over and over on the ML etc...
 Fixing this should not be a big deal (i.e. add the gore-x modules for the
 backends to the ivy deps file).

 Julien


 On 14 June 2012 20:27, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 I think the annoyance is probably something folks can live with as they
 have been
 waiting for an official release of 2.x for years :)

 My +1 to roll RC #2 with or without a solution to this and mark it as a
 TODO. release
 eary, release often :)

 Cheers,
 Chris

 On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

  Aye this is no good at all. Depending on which backend you wish to
 use with Gora, you will need to go and manually fetch the correct .jar's
 from maven central.
 
  Does anyone else have either solution or a workaround before I push
 RC2 with just src dists?
 
  Thanks
 
  Lewis
 
  On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.com wrote:
   We only supply src distributions...
   Does this principle apply to Nutch 2 as well?
  Maybe, yes.
  The situation with the current binary package is uncomfortable:
  I had to copy/link gora-hbase and hbase jars into lib/ to get nutch
 running.
 
  2012/6/13 Lewis John Mcgibbney lewis.mcgibb...@gmail.com
  Hi Guys,
 
  Whilst updating the Nutch2Tutorial I got thinking that within Gora we
 don't supply binary distributions of the code, this is because when using
 Gora a user may wish/require to recompile the code to accomodate config
 changes etc. We only supply src distributions...
 
  Does this principle apply to Nutch 2 as well? I mean, what if your
 using the gora-sql dependency, then you wish to switch to HBase and
 recompile, is this possible within the binary distribution?
 
  Best
 
  Lewis
 
 
  On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.com wrote:
  Ferdy
 
  The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.
 
  The binary distrib corresponds to runtime/local and as such should
 NOT have the job file there. This is now the norm since 1.5
 
  Will try and do some testing of the RC
 
  Thanks
 
  Julien
 
 
 
  --
 
  Open Source Solutions for Text Engineering
 
  http://digitalpebble.blogspot.com/
  http://www.digitalpebble.com
  http://twitter.com/digitalpebble
 
 
 
 
  --
  Lewis
 
 
 
 
 
  --
  Lewis
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 --
 *
 *
 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 *Lewis*




 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
*Lewis*


Re: VOTE Apache Nutch 2.0 RC1

2012-06-14 Thread Mattmann, Chris A (388J)
Or just not ship a bin release at all. Src is the only thing we really VOTE on 
legally though bin is provided for convenience purposes. Will type more on this 
later...

Sent from my iPhone

On Jun 14, 2012, at 2:18 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com wrote:

Hi Julien,

Do you suggest with the binary release that we simply open up all gora-* deps 
and ship it with every jar available?

Lewis

On Thu, Jun 14, 2012 at 9:39 PM, Julien Nioche 
lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote:
I disagree. You'd expect a binary release to work out of the box - which is not 
the case. Plus we'd have to spend more time explaining the workaround, 
answering the same questions over and over on the ML etc... Fixing this should 
not be a big deal (i.e. add the gore-x modules for the backends to the ivy deps 
file).

Julien


On 14 June 2012 20:27, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.govmailto:chris.a.mattm...@jpl.nasa.gov wrote:
Hey Guys,

I think the annoyance is probably something folks can live with as they have 
been
waiting for an official release of 2.x for years :)

My +1 to roll RC #2 with or without a solution to this and mark it as a TODO. 
release
eary, release often :)

Cheers,
Chris

On Jun 14, 2012, at 10:04 AM, Lewis John Mcgibbney wrote:

 Aye this is no good at all. Depending on which backend you wish to use with 
 Gora, you will need to go and manually fetch the correct .jar's from maven 
 central.

 Does anyone else have either solution or a workaround before I push RC2 with 
 just src dists?

 Thanks

 Lewis

 On Thu, Jun 14, 2012 at 4:52 PM, Sebastian Nagel 
 wastl.na...@googlemail.commailto:wastl.na...@googlemail.com wrote:
  We only supply src distributions...
  Does this principle apply to Nutch 2 as well?
 Maybe, yes.
 The situation with the current binary package is uncomfortable:
 I had to copy/link gora-hbase and hbase jars into lib/ to get nutch running.

 2012/6/13 Lewis John Mcgibbney 
 lewis.mcgibb...@gmail.commailto:lewis.mcgibb...@gmail.com
 Hi Guys,

 Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't 
 supply binary distributions of the code, this is because when using Gora a 
 user may wish/require to recompile the code to accomodate config changes etc. 
 We only supply src distributions...

 Does this principle apply to Nutch 2 as well? I mean, what if your using the 
 gora-sql dependency, then you wish to switch to HBase and recompile, is this 
 possible within the binary distribution?

 Best

 Lewis


 On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
 lists.digitalpeb...@gmail.commailto:lists.digitalpeb...@gmail.com wrote:
 Ferdy

 The Nutch job jar is not present in the binary archive. This means 
 distributed running of jobs is not supported. I'm not sure if this is a 
 problem (since users can always build one themselves), merely pointing it 
 out. The recently released 1.5 also lacks this job jar, so at least no 
 difference there.

 The binary distrib corresponds to runtime/local and as such should NOT have 
 the job file there. This is now the norm since 1.5

 Will try and do some testing of the RC

 Thanks

 Julien



 --

 Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




 --
 Lewis





 --
 Lewis



++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.govmailto:chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/http://sunset.usc.edu/%7Emattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++




--
[http://digitalpebble.com/img/logo.gif]
Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble




--
Lewis



Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Ferdy Galema
Findings about Nutch-2.0 RC 1.

The Nutch job jar is not present in the binary archive. This means
distributed running of jobs is not supported. I'm not sure if this is a
problem (since users can always build one themselves), merely pointing it
out. The recently released 1.5 also lacks this job jar, so at least no
difference there.

Parse text is limited to 100 characters for html. We noticed this when our
index wasn't showing enough terms for some documents. This is a pretty
severe bug that I will commit a fix for right away.

Building runtime with the default SqlStore and HBaseStore works fine. Will
perform some more functionality tests when there is a new RC.

Ferdy.

On Wed, Jun 13, 2012 at 4:24 AM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 #2 is probably reason enough for a respin.

 Lewis if you don't have time to do it before Thursday, I could probably
 give it a whack. Let me know.

 Cheers,
 Chris

 On Jun 12, 2012, at 3:33 PM, Sebastian Nagel wrote:

  Hi Lewis,
 
  my first steps with 2.0 (to be continued, still struggling).
 
  Two points (I'll try to give a final vote tomorrow):
 
  1 some guidance would be nice. README.txt points
  to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x
  (I'm using
 http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html
 )
 
  2 the package contains your nutch-site.xml:
 namehttp.agent.email/name
 valuelewi...@apache.org/value
  I guess that's not intended :)
 
  Cheers,
  Sebastian
 
  On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote:
  Hi Everyone,
 
  I appreciate that most of the core dev's are using trunk, however I
  would appeal to you guys to at least check out the artifacts and check
  sigs, tests, license headers if possible. Although this does not fully
  satisfy the requirements of a thoroughly reviewed RC, hopefully the
  thorough stuff can be undertaken by those directly using the artifacts
  and code in development/production.
 
  Thanks very much in advance
 
  Best
 
  Lewis
 
  On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney 
 lewi...@apache.org wrote:
  Good Evening Everyone,
 
  A candidate for the Apache Nutch 2.0 RC1 is available at:
 
  http://people.apache.org/~lewismc/nutch-2.0
 
  The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
  archive of the sources in:
 
  http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1
 
  Further, a staged Maven repository of the 2.0 jar, sources.jar and
  javadoc.jar is available here:
 
  https://repository.apache.org/content/repositories/orgapachenutch-215
 
  Please vote on releasing this package as Apache Nutch 2.0.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...
 
  Many Thanks and heres to plenty more.
 
  Have a great weekend, Kind Regards,
  Lewis
 
  P.S. Here's my +1.
 
 
 
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Ferdy Galema
Hmm please ignore the parse text limited to 100 chars, this is actually
not the case. (Only in our branch that has a fix for limiting anchor texts;
not yet present in in the nutchgora branch because it still needs
polishing). So no need to wait for commits on my part.

On Wed, Jun 13, 2012 at 11:00 AM, Ferdy Galema ferdy.gal...@kalooga.comwrote:

 Findings about Nutch-2.0 RC 1.

 The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.

 Parse text is limited to 100 characters for html. We noticed this when our
 index wasn't showing enough terms for some documents. This is a pretty
 severe bug that I will commit a fix for right away.

 Building runtime with the default SqlStore and HBaseStore works fine. Will
 perform some more functionality tests when there is a new RC.

 Ferdy.

 On Wed, Jun 13, 2012 at 4:24 AM, Mattmann, Chris A (388J) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Guys,

 #2 is probably reason enough for a respin.

 Lewis if you don't have time to do it before Thursday, I could probably
 give it a whack. Let me know.

 Cheers,
 Chris

 On Jun 12, 2012, at 3:33 PM, Sebastian Nagel wrote:

  Hi Lewis,
 
  my first steps with 2.0 (to be continued, still struggling).
 
  Two points (I'll try to give a final vote tomorrow):
 
  1 some guidance would be nice. README.txt points
  to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x
  (I'm using
 http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html
 )
 
  2 the package contains your nutch-site.xml:
 namehttp.agent.email/name
 valuelewi...@apache.org/value
  I guess that's not intended :)
 
  Cheers,
  Sebastian
 
  On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote:
  Hi Everyone,
 
  I appreciate that most of the core dev's are using trunk, however I
  would appeal to you guys to at least check out the artifacts and check
  sigs, tests, license headers if possible. Although this does not fully
  satisfy the requirements of a thoroughly reviewed RC, hopefully the
  thorough stuff can be undertaken by those directly using the artifacts
  and code in development/production.
 
  Thanks very much in advance
 
  Best
 
  Lewis
 
  On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney 
 lewi...@apache.org wrote:
  Good Evening Everyone,
 
  A candidate for the Apache Nutch 2.0 RC1 is available at:
 
  http://people.apache.org/~lewismc/nutch-2.0
 
  The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
  archive of the sources in:
 
  http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1
 
  Further, a staged Maven repository of the 2.0 jar, sources.jar and
  javadoc.jar is available here:
 
  https://repository.apache.org/content/repositories/orgapachenutch-215
 
  Please vote on releasing this package as Apache Nutch 2.0.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...
 
  Many Thanks and heres to plenty more.
 
  Have a great weekend, Kind Regards,
  Lewis
 
  P.S. Here's my +1.
 
 
 
 


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Seb,

As Chris said, the issues you highlight well justify another RC.

I can shift it by the end of play today.

Thanks very much for having a look through guys

Lewis

On Tue, Jun 12, 2012 at 11:33 PM, Sebastian Nagel
wastl.na...@googlemail.com wrote:
 Hi Lewis,

 my first steps with 2.0 (to be continued, still struggling).

 Two points (I'll try to give a final vote tomorrow):

 1 some guidance would be nice. README.txt points
 to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x
 (I'm using 
 http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html)

 2 the package contains your nutch-site.xml:
    namehttp.agent.email/name
    valuelewi...@apache.org/value
 I guess that's not intended :)

 Cheers,
 Sebastian

 On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote:
 Hi Everyone,

 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.

 Thanks very much in advance

 Best

 Lewis

 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,

 A candidate for the Apache Nutch 2.0 RC1 is available at:

 http://people.apache.org/~lewismc/nutch-2.0

 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:

 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1

 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:

 https://repository.apache.org/content/repositories/orgapachenutch-215

 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...

 Many Thanks and heres to plenty more.

 Have a great weekend, Kind Regards,
 Lewis

 P.S. Here's my +1.







-- 
Lewis


Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Seb,

Quick update

On Tue, Jun 12, 2012 at 11:33 PM, Sebastian Nagel
wastl.na...@googlemail.com wrote:
1 some guidance would be nice. README.txt points
 to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x

Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an
update of Julien's (I think) page on GORA_HBase. Thsi will get you
rocking with HBase. The changes between Cassandra, Accumulo and the
other data stores are fairly trivial.

 2 the package contains your nutch-site.xml:
    namehttp.agent.email/name
    valuelewi...@apache.org/value
 I guess that's not intended :)

I'll deal with this when I spin RC2. Thanks

Lewis


Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Julien Nioche
Ferdy


 The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.


The binary distrib corresponds to runtime/local and as such should NOT have
the job file there. This is now the norm since 1.5

Will try and do some testing of the RC

Thanks

Julien



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Guys,

Whilst updating the Nutch2Tutorial I got thinking that within Gora we don't
supply binary distributions of the code, this is because when using Gora a
user may wish/require to recompile the code to accomodate config changes
etc. We only supply src distributions...

Does this principle apply to Nutch 2 as well? I mean, what if your using
the gora-sql dependency, then you wish to switch to HBase and recompile, is
this possible within the binary distribution?

Best

Lewis

On Wed, Jun 13, 2012 at 3:38 PM, Julien Nioche 
lists.digitalpeb...@gmail.com wrote:

 Ferdy


 The Nutch job jar is not present in the binary archive. This means
 distributed running of jobs is not supported. I'm not sure if this is a
 problem (since users can always build one themselves), merely pointing it
 out. The recently released 1.5 also lacks this job jar, so at least no
 difference there.


 The binary distrib corresponds to runtime/local and as such should NOT
 have the job file there. This is now the norm since 1.5

 Will try and do some testing of the RC

 Thanks

 Julien



 --
 *
 *Open Source Solutions for Text Engineering

 http://digitalpebble.blogspot.com/
 http://www.digitalpebble.com
 http://twitter.com/digitalpebble




-- 
*Lewis*


Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Sebastian Nagel
Hi Lewis,

 Please see http://wiki.apache.org/nutch/Nutch2Tutorial which is an
 update of Julien's (I think) page on GORA_HBase. Thsi will get you
 rocking with HBase. The changes between Cassandra, Accumulo and the
 other data stores are fairly trivial.

I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
Much simpler than 1.x (no segments!).

Below a couple of problems I've run into (possible issues to be adressed in 
2.1).

Cheers,
Sebastian



% ./bin/nutch readdb -stats
WebTable statistics start
WebTableReader: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
at 
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
at 
org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
-- readdb -dump works.



% ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
Exception in thread main java.lang.IllegalArgumentException: arg -parse not 
recognized



% ./bin/nutch parse -all -force -resume
ParserJob: starting
ParserJob: resuming:false-resume and
ParserJob: forced reparse:  false-force obviously ignored ?
ParserJob: parsing all



% ./bin/nutch generate
-- generates batchid, but should show help as in 1.x ?
-- is there an option -topN ?



The 2.0 Solr schema and mappings still contain the field site
which has been removed in 1.x (NUTCH-1232).
Should be done also in 2.0: it's easier to maintain only one Solr installation
for all Nutch versions.



Re: VOTE Apache Nutch 2.0 RC1

2012-06-13 Thread Lewis John Mcgibbney
Hi Sebastian,

On Wed, Jun 13, 2012 at 11:30 PM, Sebastian Nagel
wastl.na...@googlemail.com wrote:
I'll managed to perform a crawl with 2.0 and HBase: it rocks, indeed.
 Much simpler than 1.x (no segments!).

:0)

 % ./bin/nutch readdb -stats
 WebTable statistics start
 WebTableReader: java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:197)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
        at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
        at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
        at 
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
        at
 org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:89)
        at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:537)
        at 
 org.apache.nutch.crawl.WebTableReader.processStatJob(WebTableReader.java:218)
        at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:479)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:412)
 -- readdb -dump works.

Confirmed and ticket opened as NUTCH-1391

 % ./bin/nutch fetch 1339621550-203073321 -threads 1 -parse
 Exception in thread main java.lang.IllegalArgumentException: arg -parse not 
 recognized

The parse argument was removed in Nutch 2.0 and now throws an
illegalargumentexception. This is now normal. To enable parsing during
fetching please set config in nutch-site.xml. The reason that the
incorrect -parse argument is till in the Usage message, is because I
was not diligent enough when patching the fetcher CLI aesthetics. I'll
address this within the issue below as well.



 % ./bin/nutch parse -all -force -resume
 ParserJob: starting
 ParserJob: resuming:    false            -resume and
 ParserJob: forced reparse:      false    -force obviously ignored ?
 ParserJob: parsing all

Yes confirmed and ticket opened as NUTCH-1392


 % ./bin/nutch generate
 -- generates batchid, but should show help as in 1.x ?
 -- is there an option -topN ?

Yes this is opened in NUTCH-1393. Users may not necessarily wish to
generate at all, instead wishing to merely find out the GeneratorJob
CLI options... I will open this just now and fix for 2.1.

 The 2.0 Solr schema and mappings still contain the field site
 which has been removed in 1.x (NUTCH-1232).
 Should be done also in 2.0: it's easier to maintain only one Solr installation
 for all Nutch versions.

Logged in NUTCH-1394

Thanks Seb for your contributions here... this is exactly what we are after.

Does anyone have issues with running another RC and addressing these
issues in 2.1?

-- 
Lewis


Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Lewis John Mcgibbney
Hi Everyone,

I appreciate that most of the core dev's are using trunk, however I
would appeal to you guys to at least check out the artifacts and check
sigs, tests, license headers if possible. Although this does not fully
satisfy the requirements of a thoroughly reviewed RC, hopefully the
thorough stuff can be undertaken by those directly using the artifacts
and code in development/production.

Thanks very much in advance

Best

Lewis

On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org wrote:
 Good Evening Everyone,

 A candidate for the Apache Nutch 2.0 RC1 is available at:

 http://people.apache.org/~lewismc/nutch-2.0

 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:

 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1

 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:

 https://repository.apache.org/content/repositories/orgapachenutch-215

 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...

 Many Thanks and heres to plenty more.

 Have a great weekend, Kind Regards,
 Lewis

 P.S. Here's my +1.



-- 
Lewis


Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Mattmann, Chris A (388J)
Hey Lewis,

I will get to this tonight, for sure.

Thanks!

Cheers,
Chris

On Jun 12, 2012, at 1:16 PM, Lewis John Mcgibbney wrote:

 Hi Everyone,
 
 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.
 
 Thanks very much in advance
 
 Best
 
 Lewis
 
 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,
 
 A candidate for the Apache Nutch 2.0 RC1 is available at:
 
 http://people.apache.org/~lewismc/nutch-2.0
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1
 
 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-215
 
 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Have a great weekend, Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Lewis John Mcgibbney
Thank you

On Tue, Jun 12, 2012 at 9:19 PM, Mattmann, Chris A (388J)
chris.a.mattm...@jpl.nasa.gov wrote:
 Hey Lewis,

 I will get to this tonight, for sure.

 Thanks!

 Cheers,
 Chris

 On Jun 12, 2012, at 1:16 PM, Lewis John Mcgibbney wrote:

 Hi Everyone,

 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.

 Thanks very much in advance

 Best

 Lewis

 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,

 A candidate for the Apache Nutch 2.0 RC1 is available at:

 http://people.apache.org/~lewismc/nutch-2.0

 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:

 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1

 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:

 https://repository.apache.org/content/repositories/orgapachenutch-215

 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...

 Many Thanks and heres to plenty more.

 Have a great weekend, Kind Regards,
 Lewis

 P.S. Here's my +1.



 --
 Lewis


 ++
 Chris Mattmann, Ph.D.
 Senior Computer Scientist
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 171-266B, Mailstop: 171-246
 Email: chris.a.mattm...@nasa.gov
 WWW:   http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Assistant Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




-- 
Lewis


Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Sebastian Nagel
Hi Lewis,

my first steps with 2.0 (to be continued, still struggling).

Two points (I'll try to give a final vote tomorrow):

1 some guidance would be nice. README.txt points
to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x
(I'm using 
http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html)

2 the package contains your nutch-site.xml:
namehttp.agent.email/name
valuelewi...@apache.org/value
I guess that's not intended :)

Cheers,
Sebastian

On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote:
 Hi Everyone,
 
 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.
 
 Thanks very much in advance
 
 Best
 
 Lewis
 
 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,

 A candidate for the Apache Nutch 2.0 RC1 is available at:

 http://people.apache.org/~lewismc/nutch-2.0

 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:

 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1

 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:

 https://repository.apache.org/content/repositories/orgapachenutch-215

 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...

 Many Thanks and heres to plenty more.

 Have a great weekend, Kind Regards,
 Lewis

 P.S. Here's my +1.
 
 
 



Re: VOTE Apache Nutch 2.0 RC1

2012-06-12 Thread Mattmann, Chris A (388J)
Hey Guys,

#2 is probably reason enough for a respin. 

Lewis if you don't have time to do it before Thursday, I could probably
give it a whack. Let me know.

Cheers,
Chris

On Jun 12, 2012, at 3:33 PM, Sebastian Nagel wrote:

 Hi Lewis,
 
 my first steps with 2.0 (to be continued, still struggling).
 
 Two points (I'll try to give a final vote tomorrow):
 
 1 some guidance would be nice. README.txt points
 to http://wiki.apache.org/nutch/NutchTutorial which refers to 1.x
 (I'm using 
 http://sujitpal.blogspot.de/2012/01/exploring-nutch-gora-with-cassandra.html)
 
 2 the package contains your nutch-site.xml:
namehttp.agent.email/name
valuelewi...@apache.org/value
 I guess that's not intended :)
 
 Cheers,
 Sebastian
 
 On 06/12/2012 10:16 PM, Lewis John Mcgibbney wrote:
 Hi Everyone,
 
 I appreciate that most of the core dev's are using trunk, however I
 would appeal to you guys to at least check out the artifacts and check
 sigs, tests, license headers if possible. Although this does not fully
 satisfy the requirements of a thoroughly reviewed RC, hopefully the
 thorough stuff can be undertaken by those directly using the artifacts
 and code in development/production.
 
 Thanks very much in advance
 
 Best
 
 Lewis
 
 On Fri, Jun 8, 2012 at 3:49 PM, lewis john mcgibbney lewi...@apache.org 
 wrote:
 Good Evening Everyone,
 
 A candidate for the Apache Nutch 2.0 RC1 is available at:
 
 http://people.apache.org/~lewismc/nutch-2.0
 
 The release candidate is a src.zip, bin.zip, src.tar.gz and bin.tar.gz
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc1
 
 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-215
 
 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 2.0
 [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Have a great weekend, Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 
 
 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++