I have a little bit of conflict of interest given I worked on Hadoop YARN all 
time but..

I have worked on torque/condor based resource management systems too. There are 
many advantages of working on top of YARN, a couple that should be specifically 
relevant here:
 - MR and non MR all on same cluster (there are a few not-so-ready MR 
implementations on existing schedulers but with lots of limitations)
 - Data locality feature that is native in Hadoop YARN and hard to simulate in 
other schedulers (we have experience trying this in the past)
 - Elastic resource managements - jobs can grow and shrink elastically

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On May 17, 2013, at 7:20 AM, Tim St Clair wrote:

> Hi John - 
> 
> If you are doing extensive levels of non-MR C-style batch, you may be better 
> served to look at myriad universes of existing schedulers (torque, condor, 
> etc.).  Or investigate the space around interop (1 cluster, many schedulers). 
>  
> 
> Either way, I recommend minimizing your dependency graph on your 
> C-application where possible if you are working in a heterogeneous 
> environment. 
> 
> Cheers,
> Tim
> 
> 
> From: "John Lilley" <[email protected]>
> To: [email protected]
> Sent: Friday, May 17, 2013 8:35:53 AM
> Subject: RE: Distribution of native executables and data for YARN-based 
> execution
> 
> Thanks!  This sounds exactly like what I need.  PUBLIC is right.
>  
> Do you know if this works for executables as well?  Like, would there be any 
> issue transferring the executable bit on the file?
>  
> john
>  
> From: Vinod Kumar Vavilapalli [mailto:[email protected]] 
> Sent: Friday, May 17, 2013 12:56 AM
> To: [email protected]
> Subject: Re: Distribution of native executables and data for YARN-based 
> execution
>  
>  
> The "local resources" you mentioned is the exact solution for this. For each 
> LocalResource, you also mention a LocalResourceVisibility which takes one of 
> the three values today - PUBLIC, PRIVATE and APPLICATON.
>  
> PUBLIC resources are downloaded only once and shared by any application 
> running on that node.
>  
> PRIVATE resources are downloaded only once and shared by any application run 
> by the same user on that node
>  
> APPLICATION resources are downloaded per application and removed after the 
> application finishes.
>  
> Seems like you want PUBLIC or PRIVATE.
>  
> Note that for PUBLIC resources to work, the corresponding files need to be 
> public on HDFS too.
>  
> Also if the remote files on HDFS are updated, these local files will be 
> uploaded afresh again on each node where your containers run.
>  
> HTH
>  
> Thanks,
> +Vinod Kumar Vavilapalli
> Hortonworks Inc.
> http://hortonworks.com/
>  
>  
> On May 16, 2013, at 2:21 PM, John Lilley wrote:
> 
> 
> I am attempting to distribute the execution of a C-based program onto a 
> Hadoop cluster, without using MapReduce.  I read that YARN can be used to 
> schedule non-MapReduce applications by programming to the ASM/RM interfaces.  
> As I understand it, eventually I get down to specifying each sub-task via 
> ContainerLaunchContext.setCommands().
>  
> However, the program and shared libraries need to be stored on each worker’s 
> local disk to run.  In addition there is a hefty data set that the 
> application uses (say, 4GB) that is accessed via regular open()/read() calls 
> by a library.  I thought a decent strategy would be to push the program+data 
> package to a known folder in HDFS, then launch a “bootstrap” that compared 
> the HDFS folder version to a local folder, copying any updated files as 
> needed before launching the native application task.
>  
> Are there better approaches?  I notice that one can implicitly copy “local 
> resources” as part of the launch, but I don’t want to copy 4GB every time, 
> only occasionally when the application or reference data is updated.  Also, 
> will my bootstrapper be allowed to set executable-mode bits on the programs 
> after they are copied?
>  
> Thanks
> John
>  
>  
> 

Reply via email to