Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop Integration - how's it going)

Prakashan Korambath Mon, 04 Jun 2012 10:26:42 -0700

Hi Rayson,

Let me know why you have C API bindings from Ralph ready. Ican help you guys with testing it out.


Prakashan


On 06/04/2012 10:17 AM, Rayson Ho wrote:

Hi Prakashan&  Ron,

I thought about this issue while I was writing&  testing the HOWTO...
but I didn't spend much more time on it as I needed to work on
something else, and it requires an upcoming C API binding for HDFS
from Ralph. Plus... I didn't want to pre-announce too many upcoming
new features. :-)

With the architecture of Prakashan's On-demand Hadoop Cluster, we can
take advantage of Ralph's C HDFS API, and we can then easily write a
scheduler plugin that queries HDFS block information. This scheduler
plugin then affects scheduling decision such that Open Grid
Scheduler/Grid Engine can send jobs to the data, which IMO is the core
idea behind Hadoop - scheduling jobs&  tasks to the data.

Note that we will also need to productionize the "Parallel Environment
Queue Sort (PQS) Scheduler API", which was under technology preview in
GE 2011.11:

http://gridscheduler.sourceforge.net/Releases/ReleaseNotesGE2011.11.pdf

Rayson



On Mon, Jun 4, 2012 at 12:55 PM, Prakashan Korambath<[email protected]>  wrote:

Hi Ron,

I don't have anything planned beyond what I released right now.  Idea is to
let what Hadoop does best to Hadoop and what SGE or any scheduler does best
to the scheduler.  I believe somebody from SDSC also released similar
strategy for PBS/Torque.  I worked only on the SGE because I mostly use SGE.

Prakashan



On 06/04/2012 09:45 AM, Ron Chen wrote:


Hi Prakashan,


I am trying to understand your integration, and it looks like Ravi Chandra
Nallan's Hadoop Integration.

One of the improvements in Daniel Templeton's Hadoop Integration is he
models HDFS data as resources, and thus can schedule jobs to data. Is
scheduling jobs to data a planned feature of your "On-Demand Hadoop Cluster"
integration?

For those who didn't know Ravi Chandra Nallan, he was with Sun Micro when
he developed the integration. Last I checked, he was with Oracle.

  -Ron




----- Original Message -----
From: Rayson Ho<[email protected]>
To: Prakashan Korambath<[email protected]>
Cc: "[email protected]"<[email protected]>
Sent: Friday, June 1, 2012 3:04 PM
Subject: Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop
Integration - how's it going)

Thanks again Prakashan for the contribution!

Rayson



On Fri, Jun 1, 2012 at 1:25 PM, Prakashan Korambath<[email protected]>
  wrote:


Thank you Rayson!  Appreciate you taking time and upload the tar files
and
writing the howto.

Regards,

Prakashan



On 06/01/2012 10:19 AM, Rayson Ho wrote:



I've reviewed the integration, and wrote a short Grid Engine Hadoop
HOWTO:

http://gridscheduler.sourceforge.net/howto/GridEngineHadoop.html

The difference between the 2 methods (original SGE 6.2u5 vs
Prakashan's) is that with Prakashan's approach, Grid Engine is used
for resource allocation, and the Hadoop job scheduler/Job Tracker is
used to handle all the MapReduce operations. A Hadoop cluster is
created on demand with Prakashan's approach, but in the original SGE
6.2u5 method Grid Engine replaces the Hadoop job scheduler.

As standard Grid Engine PEs are used in this new approach, one can
call "qrsh -inherit" and use Grid Engine's method to start Hadoop
services on remote nodes, and thus get full job control, job
accounting, and cleanup at terminate benefits like any other tight PE
jobs!

Rayson



On Tue, May 29, 2012 at 10:36 AM, Prakashan Korambath<[email protected]>
  wrote:



I put my scripts in a tar file and send it to Rayson yesterday so that
he
can put it in a common place to download.

Prakashan



On 05/29/2012 07:18 AM, Jesse Becker wrote:




On Mon, May 28, 2012 at 12:00:24PM -0400, Prakashan
Korambath wrote:





This is how we run hadoop using Grid Engine (for that matter
any scheduler with appropriate alteration)

http://www.ats.ucla.edu/clusters/hoffman2/hadoop/default.htm

Basically, run either a prolog or call a script inside the
submission command file itself to parse the output of
PE_HOSTFILE to create hadoop *.site.xml, masters and slaves
files at run time. This methodology is suitable for any
scheduler as it is not dependent on them. If there is
interest I can post the prologue script. Thanks.





Please do.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Hadoop Integration HOWTO (was: Hadoop Integration - how's it going)

Reply via email to