Just want to update everyone - I followed up with Ralph @ EMC, and I looked at his code, which is very similar to DanT's code in SGE 6.2u5 - ie. they both pull information from HDFS and use the locality info to affect scheduling.
However, the APIs used are different, and we will pay attention to the Hadoop 2.x API changes and test DanT's integration again when 2.x comes out. CB, can you let me know about the multi-user issue? As mentioned before we have HBase, Pig, Hive, etc tested with our Hadoop setup, but we don't have real users on it and thus it really would help if you can let us know the issues you've encountered. Rayson On Fri, Mar 30, 2012 at 3:18 PM, CB <[email protected]> wrote: > I'm very much interested in SGE + Hadoop enhancement. > > I'm currently testing Dan T's Hadoop + SGE integration for multi-user > environment on an internal dev cluster and it's working nicely. > But it is not easy to set up. It requires to change file permissions various > places in order to make it working under multi-user environment. > > - Chansup > > On Fri, Mar 30, 2012 at 1:42 PM, Chris Dagdigian <[email protected]> wrote: >> >> >> I'm registering my interest here. >> >> Reuti -- if you could pass my email along to Ralph I'd appreciate it. >> >> I have several consulting customers using EMC Isilon storage on Grid >> Engine HPC clusters and we've been getting pinged from EMC/Greenplum sales >> reps pushing to show off the combination of native HDFS support in Isilon + >> the greenplum hadoop appliance integration. >> >> Basically I have a few largish sites that could test & provide feedback if >> things work out. Some are commercial, some are .gov & all are interested in >> SGE + Hadoop enhancements. >> >> -dag >> >> >> >> >> >> Reuti wrote: >>> >>> on behalf of Ralph Castain who you may know from the Open MPI mailing >>> list I want to forward this eMail to your attention. >>> >>> -- Reuti >>> >>>> > I have a question for the Gridengine community, but thought I'd run >>>> > it through you as I believe you work in that area? >>>> > > As you may know, I am now employed by Greenplum/EMC to work on >>>> > resource management for Hadoop as well as MPI. The main concern frankly >>>> > is >>>> > that the current Hadoop RM (yarn) scales poorly in terms of launch and >>>> > provides no support for MPI wireup, thus causing MPI jobs to exhibit >>>> > quadratic scaling of startup times. >>>> > > The only reason for using yarn is that it has the HDFS interface >>>> > required to determine file locality, thus allowing users to place >>>> > processes >>>> > network-near to the files they will use. I have initiated an effort here >>>> > at >>>> > GP to create a C-library for accessing HDFS to obtain that locality info, >>>> > and expect to have it completed in the next few weeks. >>>> > > Armed with that capability, it would be possible to extend more >>>> > capable RMs such as Gridengine so that users could obtain HDFS-based >>>> > allocations for their MapReduce applications. This would allow >>>> > Gridengine to >>>> > support Hadoop operations, and make Hadoop clusters that used Gridengine >>>> > as >>>> > their RM be "multi-use". >>>> > > Would this be of interest to the community? I can contribute the >>>> > C-lib code for their use under a BSD-like license structure, if that >>>> > would >>>> > help. >>>> > > Regards, >>>> > Ralph >>>> > > >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
