Hi Prakashan & Ron, I thought about this issue while I was writing & testing the HOWTO... but I didn't spend much more time on it as I needed to work on something else, and it requires an upcoming C API binding for HDFS from Ralph. Plus... I didn't want to pre-announce too many upcoming new features. :-)
With the architecture of Prakashan's On-demand Hadoop Cluster, we can take advantage of Ralph's C HDFS API, and we can then easily write a scheduler plugin that queries HDFS block information. This scheduler plugin then affects scheduling decision such that Open Grid Scheduler/Grid Engine can send jobs to the data, which IMO is the core idea behind Hadoop - scheduling jobs & tasks to the data. Note that we will also need to productionize the "Parallel Environment Queue Sort (PQS) Scheduler API", which was under technology preview in GE 2011.11: http://gridscheduler.sourceforge.net/Releases/ReleaseNotesGE2011.11.pdf Rayson On Mon, Jun 4, 2012 at 12:55 PM, Prakashan Korambath <[email protected]> wrote: > Hi Ron, > > I don't have anything planned beyond what I released right now. Idea is to > let what Hadoop does best to Hadoop and what SGE or any scheduler does best > to the scheduler. I believe somebody from SDSC also released similar > strategy for PBS/Torque. I worked only on the SGE because I mostly use SGE. > > Prakashan > > > > On 06/04/2012 09:45 AM, Ron Chen wrote: >> >> Hi Prakashan, >> >> >> I am trying to understand your integration, and it looks like Ravi Chandra >> Nallan's Hadoop Integration. >> >> One of the improvements in Daniel Templeton's Hadoop Integration is he >> models HDFS data as resources, and thus can schedule jobs to data. Is >> scheduling jobs to data a planned feature of your "On-Demand Hadoop Cluster" >> integration? >> >> For those who didn't know Ravi Chandra Nallan, he was with Sun Micro when >> he developed the integration. Last I checked, he was with Oracle. >> >> -Ron >> >> >> >> >> ----- Original Message ----- >> From: Rayson Ho<[email protected]> >> To: Prakashan Korambath<[email protected]> >> Cc: "[email protected]"<[email protected]> >> Sent: Friday, June 1, 2012 3:04 PM >> Subject: Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop >> Integration - how's it going) >> >> Thanks again Prakashan for the contribution! >> >> Rayson >> >> >> >> On Fri, Jun 1, 2012 at 1:25 PM, Prakashan Korambath<[email protected]> >> wrote: >>> >>> Thank you Rayson! Appreciate you taking time and upload the tar files >>> and >>> writing the howto. >>> >>> Regards, >>> >>> Prakashan >>> >>> >>> >>> On 06/01/2012 10:19 AM, Rayson Ho wrote: >>>> >>>> >>>> I've reviewed the integration, and wrote a short Grid Engine Hadoop >>>> HOWTO: >>>> >>>> http://gridscheduler.sourceforge.net/howto/GridEngineHadoop.html >>>> >>>> The difference between the 2 methods (original SGE 6.2u5 vs >>>> Prakashan's) is that with Prakashan's approach, Grid Engine is used >>>> for resource allocation, and the Hadoop job scheduler/Job Tracker is >>>> used to handle all the MapReduce operations. A Hadoop cluster is >>>> created on demand with Prakashan's approach, but in the original SGE >>>> 6.2u5 method Grid Engine replaces the Hadoop job scheduler. >>>> >>>> As standard Grid Engine PEs are used in this new approach, one can >>>> call "qrsh -inherit" and use Grid Engine's method to start Hadoop >>>> services on remote nodes, and thus get full job control, job >>>> accounting, and cleanup at terminate benefits like any other tight PE >>>> jobs! >>>> >>>> Rayson >>>> >>>> >>>> >>>> On Tue, May 29, 2012 at 10:36 AM, Prakashan Korambath<[email protected]> >>>> wrote: >>>>> >>>>> >>>>> I put my scripts in a tar file and send it to Rayson yesterday so that >>>>> he >>>>> can put it in a common place to download. >>>>> >>>>> Prakashan >>>>> >>>>> >>>>> >>>>> On 05/29/2012 07:18 AM, Jesse Becker wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Mon, May 28, 2012 at 12:00:24PM -0400, Prakashan >>>>>> Korambath wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is how we run hadoop using Grid Engine (for that matter >>>>>>> any scheduler with appropriate alteration) >>>>>>> >>>>>>> http://www.ats.ucla.edu/clusters/hoffman2/hadoop/default.htm >>>>>>> >>>>>>> Basically, run either a prolog or call a script inside the >>>>>>> submission command file itself to parse the output of >>>>>>> PE_HOSTFILE to create hadoop *.site.xml, masters and slaves >>>>>>> files at run time. This methodology is suitable for any >>>>>>> scheduler as it is not dependent on them. If there is >>>>>>> interest I can post the prologue script. Thanks. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Please do. >>>>>> >>>>> >>> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
