Re: [DISCUSS] contents of nutch release artifact

2009-03-21 Thread Andrzej Bialecki

Doğacan Güney wrote:

On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote:

Sami Siren wrote:

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a
local job, no optional filesystems etc), the *.job and *.war files and
scripts. Scripts would check for the presence of plugins/ dir, and offer an
option to create it from *.job. Assumption here is that this shouldbe enough
to run full cycle in local mode, and that people who want to run a
distributed cluster will first install a plain Hadoop release, and then just
put the *.job and bin/nutch on the master.

* source: no build artifacts, no .svn (equivalent to svn export), simple
tgz.


this sounds good to me. additionally some new documentation needs to be
written too.


I added a simple patch to NUTCH-728 to make a plain source release from svn,
what do people think should we add the plain source package into next rc. I
would not like to make changes to binary package now but propose that we do
those changes post 1.0.



+1 for including plain source release in next rc.

As for, local/distributed separation, it is a good idea but I think we
should hold
it for 1.1 (or something else) if it requires architectural changes
(thus needs review
and testing).


Yes, sorry for not being more explicit - my proposal was for 1.1, I 
think 1.0 has to go out as it is (and I'd even hesitate to create a 
source-only release now - we would have to test that it's still 
buildable and fully functional.)



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-21 Thread Jukka Zitting
Hi,

On Fri, Mar 20, 2009 at 1:10 PM, Andrzej Bialecki a...@sigram.com wrote:
 Yes, sorry for not being more explicit - my proposal was for 1.1, I think
 1.0 has to go out as it is (and I'd even hesitate to create a source-only
 release now - we would have to test that it's still buildable and fully
 functional.)

To be accurate, the source release *is* the collection of bits that
the release manager is using to produce binaries and other release
artifacts. It's just a packaged svn export of the release tag.

If the release manager can build and test the sources, then anyone
else should be able do the same using the exact same set of bits.
Verifying that is one of the key parts of the release vote.

From that perspective I'm even a bit worried about the idea of having
an Ant target that exports and packages the tag, as it suggests that
the release manager is not necessarily using that set of bits to build
the release.

BR,

Jukka Zitting


Re: [DISCUSS] contents of nutch release artifact

2009-03-21 Thread Jukka Zitting
Hi,

On Sat, Mar 21, 2009 at 12:28 PM, Jukka Zitting jukka.zitt...@gmail.com wrote:
 To be accurate, the source release *is* the collection of bits that
 the release manager is using to produce binaries and other release
 artifacts. It's just a packaged svn export of the release tag.

Or, to express this in another way, the release manager can produce
the source release simply by packaging the entire source tree he's
using just before invoking any Ant targets to produce the binaries.

BR,

Jukka Zitting


Re: [DISCUSS] contents of nutch release artifact

2009-03-20 Thread Doğacan Güney
On Thu, Mar 19, 2009 at 23:46, Sami Siren ssi...@gmail.com wrote:
 Sami Siren wrote:

 Andrzej Bialecki wrote:

 How about the following: we build just 2 packages:

 * binary: this includes only base hadoop libs in lib/ (enough to start a
 local job, no optional filesystems etc), the *.job and *.war files and
 scripts. Scripts would check for the presence of plugins/ dir, and offer an
 option to create it from *.job. Assumption here is that this shouldbe enough
 to run full cycle in local mode, and that people who want to run a
 distributed cluster will first install a plain Hadoop release, and then just
 put the *.job and bin/nutch on the master.

 * source: no build artifacts, no .svn (equivalent to svn export), simple
 tgz.


 this sounds good to me. additionally some new documentation needs to be
 written too.


 I added a simple patch to NUTCH-728 to make a plain source release from svn,
 what do people think should we add the plain source package into next rc. I
 would not like to make changes to binary package now but propose that we do
 those changes post 1.0.


+1 for including plain source release in next rc.

As for, local/distributed separation, it is a good idea but I think we
should hold
it for 1.1 (or something else) if it requires architectural changes
(thus needs review
and testing).

 --
  Sami Siren




-- 
Doğacan Güney


[DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw in 
your opinions...



the related snippet from email discussion:

Sami Siren wrote:
 Jukka Zitting wrote:
 * Why does the release package contain pre-built documentation and
 binaries? Downloading the 90MB package takes much longer than checking
 out and building the 40MB tag from svn.
 IMHO it would be a service to users to make the release contain just
 the svn export with instruction
 on how to build the rest.

 I see your point about the fat artifact but I am not totally convinced
 that users (as in end users) would prefer the idea of fetching the
 development tools and compiling the software before they use it, at
 least I am not doing that with the software I use.

 I will discuss this with rest of the devs and see what we can do here.
 One solution could be to split the release in two parts binary only and
 source (they would both be about the same size since out build process
 currently copies jars around I think that's mostly the reason for the
 gigantic size) as you propose below.


--
 Sami Siren


Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Andrzej Bialecki

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw in 
your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the officially 
released sources.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Eric J. Christeson


On Mar 19, 2009, at 8:48 AM, Sami Siren wrote:



Jukka Zitting was suggesting we should rethink the Nutch release  
packaging because of it's size. I don't see this as a blocker for  
1.0 but we could perhaps start the discussion about this anyway so  
throw in your opinions...


+1 for both binary and source releases.  As I see it, it's not much  
more work and it gives people options.  If we're looking to get more  
interest in Nutch, making things as easy as possible for people is a  
good thing.


Eric

--
Eric J. Christeson  
eric.christe...@ndsu.edu

Enterprise Computing and Infrastructure(701) 231-8693 (Voice)
North Dakota State University, Fargo, North Dakota, USA



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Jukka Zitting
Hi,

On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote:
 (anyway, what's a measly 90MB nowadays .. ;)

It's a pretty long download unless you have a fast connection and a
nearby mirror.

BR,

Jukka Zitting


Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Doğacan Güney
On Thu, Mar 19, 2009 at 16:48, Jukka Zitting jukka.zitt...@gmail.com wrote:
 Hi,

 On Thu, Mar 19, 2009 at 3:38 PM, Andrzej Bialecki a...@getopt.org wrote:
 (anyway, what's a measly 90MB nowadays .. ;)

 It's a pretty long download unless you have a fast connection and a
 nearby mirror.


I agree. Can't we also do a source-only release? Kind of like a checkout from
svn (without, of course, svn bits)? I think this would be much more interesting
to me if I wasn't using trunk.

So, my suggestion is that we have 3 releases? Source only, binary only and full.


 BR,

 Jukka Zitting




-- 
Doğacan Güney


Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Andrzej Bialecki wrote:

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw 
in your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the officially 
released sources.


The source package is straight forward one. Size of source package would 
be about 30GB. but the binary package will still remain quite big if we 
need to allow it to run on local and distributed mode (plugins as 
exploded format and also the .job + .war), size of such binary package 
would still be nearly 80G.


We could split the binary to yet smaller pieces: one for local mode, one 
for distributed mode, and the .war separately but I am not sure if 
that's worth the effort.


--
 Sami Siren




Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Andrzej Bialecki

Sami Siren wrote:

Andrzej Bialecki wrote:

Sami Siren wrote:


Jukka Zitting was suggesting we should rethink the Nutch release 
packaging because of it's size. I don't see this as a blocker for 1.0 
but we could perhaps start the discussion about this anyway so throw 
in your opinions...


I agree with you and Jukka that we should provide separate tarballs of 
source and binaries. This likely won't result in significant size 
reductions (anyway, what's a measly 90MB nowadays .. ;) but it would 
help other parties to deploy clean binaries and/or track the 
officially released sources.


The source package is straight forward one. Size of source package would 
be about 30GB. but the binary package will still remain quite big if we 

   

Now, this is big, indeed ;)

need to allow it to run on local and distributed mode (plugins as 
exploded format and also the .job + .war), size of such binary package 
would still be nearly 80G.


We could split the binary to yet smaller pieces: one for local mode, one 
for distributed mode, and the .war separately but I am not sure if 
that's worth the effort.


I don't think so either. Please remember also that each binary 
sub-package may create its own range of support issues ...


How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a 
local job, no optional filesystems etc), the *.job and *.war files and 
scripts. Scripts would check for the presence of plugins/ dir, and offer 
an option to create it from *.job. Assumption here is that this shouldbe 
enough to run full cycle in local mode, and that people who want to run 
a distributed cluster will first install a plain Hadoop release, and 
then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), simple 
tgz.



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren
The source package is straight forward one. Size of source package 
would be about 30GB. but the binary package will still remain quite 
big if we 

   

Now, this is big, indeed ;)


heh, some serious software, need to buy more disc just to download it 
(yes I was thinking of M not G)  :)


--
 Sami Siren




Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start a 
local job, no optional filesystems etc), the *.job and *.war files and 
scripts. Scripts would check for the presence of plugins/ dir, and offer 
an option to create it from *.job. Assumption here is that this shouldbe 
enough to run full cycle in local mode, and that people who want to run 
a distributed cluster will first install a plain Hadoop release, and 
then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), simple 
tgz.



this sounds good to me. additionally some new documentation needs to be 
written too.


--
 Sami Siren



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Andrzej Bialecki

Eric J. Christeson wrote:


On Mar 19, 2009, at 12:03 PM, Sami Siren wrote:


Andrzej Bialecki wrote:

How about the following: we build just 2 packages:
* binary: this includes only base hadoop libs in lib/ (enough to 
start a local job, no optional filesystems etc), the *.job and *.war 
files and scripts. Scripts would check for the presence of plugins/ 
dir, and offer an option to create it from *.job. Assumption here is 
that this shouldbe enough to run full cycle in local mode, and that 
people who want to run a distributed cluster will first install a 
plain Hadoop release, and then just put the *.job and bin/nutch on 
the master.
* source: no build artifacts, no .svn (equivalent to svn export), 
simple tgz.



this sounds good to me. additionally some new documentation needs to 
be written too.


Distributed is a little more complicated than just dropping *.job and 
bin/nutch on a hadoop install.  Will this even work unless one edits 
config/stuff and builds a new .job?  Anyone using distributed nutch 
probably wouldn't be interested in something trivial so a step-by-step 
config how-to would probably be a good idea.


Actually, this works very well and it _is_ just a matter of dropping the 
*.job file and a (slightly) modified bin/nutch.


Some time ago I committed a fix that removed Hadoop artifacts from nutch 
*.job file. This was exactly to avoid confusion that multiple 
hadoop-site.xml and hadoop*.jar caused (one in your Hadoop install and 
the other in your Nutch job jar). So now the only place where you should 
edit Hadoop-related stuff is in your Hadoop conf/ dir, and the only 
place where you should edit Nutch-related stuff is in your Nutch conf/ 
dir (and after that indeed you need to rebuild the *.job jar and drop 
the new version to your Hadoop master).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: [DISCUSS] contents of nutch release artifact

2009-03-19 Thread Sami Siren

Sami Siren wrote:

Andrzej Bialecki wrote:

How about the following: we build just 2 packages:

* binary: this includes only base hadoop libs in lib/ (enough to start 
a local job, no optional filesystems etc), the *.job and *.war files 
and scripts. Scripts would check for the presence of plugins/ dir, and 
offer an option to create it from *.job. Assumption here is that this 
shouldbe enough to run full cycle in local mode, and that people who 
want to run a distributed cluster will first install a plain Hadoop 
release, and then just put the *.job and bin/nutch on the master.


* source: no build artifacts, no .svn (equivalent to svn export), 
simple tgz.



this sounds good to me. additionally some new documentation needs to be 
written too.




I added a simple patch to NUTCH-728 to make a plain source release from 
svn, what do people think should we add the plain source package into 
next rc. I would not like to make changes to binary package now but 
propose that we do those changes post 1.0.


--
 Sami Siren