I was considering based on earlier discussions using a JobController or ChainMapper to do this. But like a few of you mentioned Pig, Cascade or Oozie might be better. So what are the use cases for them? How do I decide which one works best for what?
Thank you all for your feedback. On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <[email protected]> wrote: > Chaining the jobs is a fantastically inefficient solution. If you use Pig > or Cascading, the optimizer will glue all of your map functions into a > single mapper. The result is something like: > > (mapper1 -> mapper2 -> mapper3) => reducer > > Here the parentheses indicate that all of the map functions are executed > as a single function formed by composing mapper1, mapper2, and mapper3. > Writing multiple jobs to do this forces *lots* of unnecessary traffic to > your persistent store and lots of unnecessary synchronization. > > You can do this optimization by hand, but using a higher level language is > often better for maintenance. > > > On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney > <[email protected]>wrote: > >> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig >> or Hive. You can do this is a couple lines of code, I suspect. Two map >> reduce jobs should not pose any kind of challenge with the right tools. >> >> >> On Monday, March 4, 2013, Sandy Ryza wrote: >> >>> Hi Aji, >>> >>> Oozie is a mature project for managing MapReduce workflows. >>> http://oozie.apache.org/ >>> >>> -Sandy >>> >>> >>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <[email protected]>wrote: >>> >>>> Aji, >>>> >>>> Why don't you just chain the jobs together? >>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining >>>> >>>> Justin >>>> >>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <[email protected]> wrote: >>>> > Russell thanks for the link. >>>> > >>>> > I am interested in finding a solution (if out there) where Mapper1 >>>> outputs a >>>> > custom object and Mapper 2 can use that as input. One way to do this >>>> > obviously by writing to Accumulo, in my case. But, is there another >>>> solution >>>> > for this: >>>> > >>>> > List<MyObject> ----> Input to Job >>>> > >>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output >>>> <MyObjectId, >>>> > MyObject> >>>> > >>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on >>>> > >>>> > >>>> > >>>> > Ideas? >>>> > >>>> > >>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney < >>>> [email protected]> >>>> > wrote: >>>> >> >>>> >> >>>> >> >>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java >>>> >> >>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to >>>> try >>>> >> it. >>>> >> >>>> >> Russell Jurney http://datasyndrome.com >>>> >> >>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <[email protected]> wrote: >>>> >> >>>> >> Hello, >>>> >> >>>> >> I have a MR job design with a flow like this: Mapper1 -> Mapper2 -> >>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's >>>> output goes >>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo. >>>> >> >>>> >> Questions: >>>> >> >>>> >> 1) Has any one tried something like this before? Are there any >>>> workflow >>>> >> control apis (in or outside of Hadoop) that can help me set up the >>>> job like >>>> >> this. Or am I limited to use Quartz for this? >>>> >> 2) If both M2 and M3 needed to write some data to two same tables in >>>> >> Accumulo, is it possible to do so? Are there any good accumulo >>>> mapreduce >>>> >> jobs you can point me to? blogs/pages that I can use for reference >>>> (starting >>>> >> point/best practices). >>>> >> >>>> >> Thank you in advance for any suggestions! >>>> >> >>>> >> Aji >>>> >> >>>> > >>>> >>> >>> >> >> -- >> Russell Jurney twitter.com/rjurney [email protected] datasyndrome. >> com >> > >
