One reason we are experimenting with the Dan's hadoop integration with GE is its ability to control how many nodes can be used for a given hadoop MapReduce job. This is important feature in an environment with many users are vying for resources to run their Hadoop MapReduce jobs.
Regards, - Chansup On Tue, Mar 6, 2012 at 1:50 PM, Rayson Ho <[email protected]> wrote: > On Tue, Mar 6, 2012 at 1:18 PM, Heywood, Todd <[email protected]> wrote: > > Rayson, the point of your 2nd paragraph is good. I'd like to see dynamic > > partitioning of a cluster between GridEngine and Hadoop nodes, where idle > > nodes in each partition could be requisitioned on-the-fly for the other > > partition. For the SGE side, you could just stop and start sgeexecd's, > but > > for the Hadoop side you would need to gracefully add and remove nodes > from > > Hadoop. > > I looked at Hadoop in Sept & Oct last year (and went to a Hadoop > workshop with a friend in Nov - heck! It's almost 4 months ago!). I > recall downloading an old version of Hadoop so that I could compile > the examples in the books too. > > I agree that on the Hadoop side you can't just remove an HDFS node > (unless you always remove less than 3 nodes and give HDFS enough time > to do its job to rebuild the replicated data - and during the period > make sure that hard disks and nodes don't fail :-D ). > > Even in the original SGE-Hadoop integration HDFS nodes are always > running (or else the load sensors wouldn't be able to report the > locality of data in HDFS). So SGE does not get a "full" node. HDFS > does balancing in the background and in later versions there is also > the HDFS snapshot feature that can use up processor cycles & I/O... > > So, partitioning is the cleaner approach, and I remember on the Hadoop > dev mailing list there were discussions on partitioning a cluster such > that Hadoop & other batch systems do not over commit the execution > nodes. > > Rayson > > > > > > > > Todd > > > > -----Original Message----- > > From: Rayson Ho <[email protected]> > > Date: Tue, 6 Mar 2012 13:09:29 -0500 > > To: CB <[email protected]> > > Cc: Todd Heywood <[email protected]>, "[email protected]" > > <[email protected]> > > Subject: Re: [gridengine users] Hadoop integration > > > >>Chansup, > >> > >>Did you need to change anything in GE2011.11 to integrate it with > >>Hadoop?? I am finishing up GE2011.11 patch 1 (ie. GE2011.11 update-0 > >>patch-1), so if the changes are small and isolated, then I can quickly > >>integrate them into the patch 1 release, or else I will just push them > >>into patch 2 & GE2011.11 u1. > >> > >> > >>Tood, > >> > >>The SGE-Hadoop integration uses Grid Engine as the job scheduler for > >>Hadoop jobs, and the integration has the Herd JSV & load sensor that > >>talk to HDFS to request & report data locality. There was a big API > >>change in Hadoop 0.20.x for the Hadoop 1.0 release. I recall someone > >>contributed a small patch that fixed things related to Hadoop, and > >>that part is in GE 2011.11 already, but I don't recall changing any of > >>the Java code in the GE2011.11 release for Hadoop. > >> > >>However, to be honest, using the SGE-Hadoop integration means that you > >>need to give up the Hadoop job scheduler, and thus to get the full > >>functionality of a normal Hadoop cluster, Grid Engine needs to > >>implement all the features of the scheduler in Hadoop. For example, in > >>the Hadoop scheduler supports "Speculative Execution" and Grid Engine > >>does not have it. > >> > >>Rayson > >> > >> > >> > >>On Tue, Mar 6, 2012 at 12:53 PM, CB <[email protected]> wrote: > >>> Hi Todd, > >>> > >>> I have implemented a hadoop (0.20.2 version) integration with > >>>OGE2011.11 > >>> release based on Dan T's work as described in the link below. We are > >>> experimenting the development cluster for internal projects. > >>> > >>> Dan T's hadoop module was built with hadoop 0.20.x release. So it will > >>> requires some changes in order to work with the latest hadoop 1.x > >>>release. > >>> This is one of my ToDo list. :-) > >>> > >>> Regards, > >>> - Chansup > >>> > >>> > >>> On Tue, Mar 6, 2012 at 12:21 PM, Heywood, Todd <[email protected]> > wrote: > >>>> > >>>> Yes. There also used to be something similar called Hadoop-on-Demand. > >>>> > >>>> But the idea is to schedule jobs to a persistent HDFS, sending jobs to > >>>> where the data is, as opposed to setting up and tearing down HDFS for > >>>> every job. > >>>> > >>>> I probably should have given this as background: > >>>> > >>>> https://blogs.oracle.com/templedf/entry/beta_testing_the_sun_grid > >>>> > >>>> > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: "Hung-Sheng Tsao (LaoTsao) Ph.D" <[email protected]> > >>>> Date: Tue, 6 Mar 2012 12:12:06 -0500 > >>>> To: Todd Heywood <[email protected]> > >>>> Cc: "[email protected]" <[email protected]> > >>>> Subject: Re: [gridengine users] Hadoop integration > >>>> > >>>> >did you see this blog? > >>>> >https://blogs.oracle.com/ravee/entry/creating_hadoop_pe_under_sge > >>>> > > >>>> >Sent from my iPad > >>>> > > >>>> >On Mar 6, 2012, at 11:45, "Heywood, Todd" <[email protected]> wrote: > >>>> > > >>>> >> Way back when SGE was still at Sun, Dan Templeton wrote a > SGE-Hadoop > >>>> >>integration for 6.2u5 (Sun's distribution as a value-added feature). > >>>> >> > >>>> >> I have been told that because of changes have been made to the > >>>>Hadoop > >>>> >>API since Oracle purchased Sun this integration no longer works - at > >>>> >>least not in the open source versions following 6.2u5. > >>>> >> > >>>> >> Does anyone know if this is true? Has anyone worked with this > >>>>recently? > >>>> >>I do see a hadoop.tar.gz at the SoGE site > >>>> > >>>> >> > >>>>>>http://arc.liv.ac.uk/downloads/SGE/releases/8.0.0d< > http://arc.liv.ac.u > >>>>>>k/d > >>>> >>ownloads/SGE/releases/8.0.0d/> but it looks to me like it is > >>>>probably > >>>> >>the 2-3 year old code from Sun (with no documentation since it was a > >>>> >>value-added feature for Sun). > >>>> >> > >>>> >> Thanks, > >>>> >> > >>>> >> Todd Heywood > >>>> >> > >>>> >> > >>>> >> _______________________________________________ > >>>> >> users mailing list > >>>> >> [email protected] > >>>> >> https://gridengine.org/mailman/listinfo/users > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> [email protected] > >>>> https://gridengine.org/mailman/listinfo/users > >>> > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> [email protected] > >>> https://gridengine.org/mailman/listinfo/users > >>> > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
