I have a little bit of conflict of interest given I worked on Hadoop YARN all time but..
I have worked on torque/condor based resource management systems too. There are many advantages of working on top of YARN, a couple that should be specifically relevant here: - MR and non MR all on same cluster (there are a few not-so-ready MR implementations on existing schedulers but with lots of limitations) - Data locality feature that is native in Hadoop YARN and hard to simulate in other schedulers (we have experience trying this in the past) - Elastic resource managements - jobs can grow and shrink elastically Thanks, +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On May 17, 2013, at 7:20 AM, Tim St Clair wrote: > Hi John - > > If you are doing extensive levels of non-MR C-style batch, you may be better > served to look at myriad universes of existing schedulers (torque, condor, > etc.). Or investigate the space around interop (1 cluster, many schedulers). > > > Either way, I recommend minimizing your dependency graph on your > C-application where possible if you are working in a heterogeneous > environment. > > Cheers, > Tim > > > From: "John Lilley" <[email protected]> > To: [email protected] > Sent: Friday, May 17, 2013 8:35:53 AM > Subject: RE: Distribution of native executables and data for YARN-based > execution > > Thanks! This sounds exactly like what I need. PUBLIC is right. > > Do you know if this works for executables as well? Like, would there be any > issue transferring the executable bit on the file? > > john > > From: Vinod Kumar Vavilapalli [mailto:[email protected]] > Sent: Friday, May 17, 2013 12:56 AM > To: [email protected] > Subject: Re: Distribution of native executables and data for YARN-based > execution > > > The "local resources" you mentioned is the exact solution for this. For each > LocalResource, you also mention a LocalResourceVisibility which takes one of > the three values today - PUBLIC, PRIVATE and APPLICATON. > > PUBLIC resources are downloaded only once and shared by any application > running on that node. > > PRIVATE resources are downloaded only once and shared by any application run > by the same user on that node > > APPLICATION resources are downloaded per application and removed after the > application finishes. > > Seems like you want PUBLIC or PRIVATE. > > Note that for PUBLIC resources to work, the corresponding files need to be > public on HDFS too. > > Also if the remote files on HDFS are updated, these local files will be > uploaded afresh again on each node where your containers run. > > HTH > > Thanks, > +Vinod Kumar Vavilapalli > Hortonworks Inc. > http://hortonworks.com/ > > > On May 16, 2013, at 2:21 PM, John Lilley wrote: > > > I am attempting to distribute the execution of a C-based program onto a > Hadoop cluster, without using MapReduce. I read that YARN can be used to > schedule non-MapReduce applications by programming to the ASM/RM interfaces. > As I understand it, eventually I get down to specifying each sub-task via > ContainerLaunchContext.setCommands(). > > However, the program and shared libraries need to be stored on each worker’s > local disk to run. In addition there is a hefty data set that the > application uses (say, 4GB) that is accessed via regular open()/read() calls > by a library. I thought a decent strategy would be to push the program+data > package to a known folder in HDFS, then launch a “bootstrap” that compared > the HDFS folder version to a local folder, copying any updated files as > needed before launching the native application task. > > Are there better approaches? I notice that one can implicitly copy “local > resources” as part of the launch, but I don’t want to copy 4GB every time, > only occasionally when the application or reference data is updated. Also, > will my bootstrapper be allowed to set executable-mode bits on the programs > after they are copied? > > Thanks > John > > >
