Re: Is there a way to ensure two actions run on the same datanode?

mpeterson2 Wed, 30 Oct 2013 11:39:39 -0700

> Taking care about concrete task tracker node in not hadoop approach.

The "Hadoop approach" is no longer a pure MapReduce approach with the
coming of YARN and Hadoop 2.  It is a parallel processing platform.

Not every piece of data that goes into a parallel processing system is
massively huge "big data".  Some may be only hundreds of gigabytes, not
terabytes.

In my case as part of moving the data into HDFS, there is some manipulation
of data, both filename manipulation and data changes (particular smaller
reference data sets) that are more easily manipulated on the local
filesystem rather than in HDFS before I push it into HDFS.

If oozie is going to be a workflow and scheduling system for a parallel
processing platform, it needs to account for the fact that some algorithms
will have different requirements than a MR model.

As Alejandro said in a previous note, adding an action that can multiple
actions on the same node is not unreasonable in a YARN environment.

Anyway, thanks for answering my question that there is no way to do this
currently.

-Michael

On Wed, Oct 30, 2013 at 1:59 PM, Serega Sheypak <[email protected]>wrote:

> Looks like global design issue.
> Taking care about concrete task tracker node in not hadoop approach.
> It's hard to imagine what kind of problem you try to solve.
> What would you do if your closuter grows to 50 nodes? to 100 nodes?
>
>
>
> 2013/10/30 <[email protected]>
>
> > Using the distributed cache is a good idea for MR-based tasks, but not
> all
> > tasks are MR-based.
> >
> > For example, I might need to run a shell script action followed by a Java
> > action, neither of which does anything with MR and need to work on files
> on
> > the local filesystem.  It would be useful to have a "compound action"
> that
> > can run a shell action and Java action on the same node consecutively.  I
> > was hoping this is what a sub-workflow is for.
> >
> > One could argue that "compound things" just need to be managed via your
> own
> > shell action, but I like the Java action because it sets up your
> classpath
> > (including the Hadoop jars in your path).  I'm not sure how to do this in
> > my own shell script to launch a Java program. So it is more convenient to
> > run a shell action that runs some bash stuff and then launch a Java
> program
> > to do more stuff with it before putting the final result into HDFS.
> >
> > Any other ideas on ways to do this?
> > -Michael
> >
> >
> > On Wed, Oct 30, 2013 at 12:20 PM, Serega Sheypak
> > <[email protected]>wrote:
> >
> > > Its mapreduce duty to select which TT node use to run task.
> > > Try to put your local stuff into hdfs and use distributed cache
> > > 30.10.2013 19:22 пользователь <[email protected]> написал:
> > >
> > > > I have two actions that need to run on the same datanode (due to
> stuff
> > on
> > > > the local filesystem).  Is there any way to ensure this in Oozie?
> > > >
> > > > For instance, if I put them into the same sub-workflow, will that
> work?
> > > >  Does a subworkflow run two or more actions at the same node?
> > > >
> > > > Thanks,
> > > > -Michael
> > > >
> > >
> >
>

Re: Is there a way to ensure two actions run on the same datanode?

Reply via email to