Hi John - If you are doing extensive levels of non-MR C-style batch, you may be better served to look at myriad universes of existing schedulers (torque, condor, etc.). Or investigate the space around interop (1 cluster, many schedulers).
Either way, I recommend minimizing your dependency graph on your C-application where possible if you are working in a heterogeneous environment. Cheers, Tim ----- Original Message ----- > From: "John Lilley" <[email protected]> > To: [email protected] > Sent: Friday, May 17, 2013 8:35:53 AM > Subject: RE: Distribution of native executables and data for YARN-based > execution > Thanks! This sounds exactly like what I need. PUBLIC is right. > Do you know if this works for executables as well? Like, would there be any > issue transferring the executable bit on the file? > john > From: Vinod Kumar Vavilapalli [mailto:[email protected]] > Sent: Friday, May 17, 2013 12:56 AM > To: [email protected] > Subject: Re: Distribution of native executables and data for YARN-based > execution > The "local resources" you mentioned is the exact solution for this. For each > LocalResource, you also mention a LocalResourceVisibility which takes one of > the three values today - PUBLIC, PRIVATE and APPLICATON. > PUBLIC resources are downloaded only once and shared by any application > running on that node. > PRIVATE resources are downloaded only once and shared by any application run > by the same user on that node > APPLICATION resources are downloaded per application and removed after the > application finishes. > Seems like you want PUBLIC or PRIVATE. > Note that for PUBLIC resources to work, the corresponding files need to be > public on HDFS too. > Also if the remote files on HDFS are updated, these local files will be > uploaded afresh again on each node where your containers run. > HTH > Thanks, > +Vinod Kumar Vavilapalli > Hortonworks Inc. > http://hortonworks.com/ > On May 16, 2013, at 2:21 PM, John Lilley wrote: > I am attempting to distribute the execution of a C-based program onto a > Hadoop cluster, without using MapReduce. I read that YARN can be used to > schedule non-MapReduce applications by programming to the ASM/RM interfaces. > As I understand it, eventually I get down to specifying each sub-task via > ContainerLaunchContext.setCommands(). > However, the program and shared libraries need to be stored on each worker’s > local disk to run. In addition there is a hefty data set that the > application uses (say, 4GB) that is accessed via regular open()/read() calls > by a library. I thought a decent strategy would be to push the program+data > package to a known folder in HDFS, then launch a “bootstrap” that compared > the HDFS folder version to a local folder, copying any updated files as > needed before launching the native application task. > Are there better approaches? I notice that one can implicitly copy “local > resources” as part of the launch, but I don’t want to copy 4GB every time, > only occasionally when the application or reference data is updated. Also, > will my bootstrapper be allowed to set executable-mode bits on the programs > after they are copied? > Thanks > John
