Re: Accumulo and Mapreduce

Aji Janis Mon, 04 Mar 2013 14:03:36 -0800

I was considering based on earlier discussions using a JobController or
ChainMapper to do this. But like a few of you mentioned Pig, Cascade or
Oozie might be better. So what are the use cases for them? How do I decide
which one works best for what?


Thank you all for your feedback.



On Mon, Mar 4, 2013 at 2:43 PM, Ted Dunning <[email protected]> wrote:

> Chaining the jobs is a fantastically inefficient solution.  If you use Pig
> or Cascading, the optimizer will glue all of your map functions into a
> single mapper.  The result is something like:
>
>     (mapper1 -> mapper2 -> mapper3) => reducer
>
> Here the parentheses indicate that all of the map functions are executed
> as a single function formed by composing mapper1, mapper2, and mapper3.
>  Writing multiple jobs to do this forces *lots* of unnecessary traffic to
> your persistent store and lots of unnecessary synchronization.
>
> You can do this optimization by hand, but using a higher level language is
> often better for maintenance.
>
>
> On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney 
> <[email protected]>wrote:
>
>> You can chain MR jobs with Oozie, but would suggest using Cascading, Pig
>> or Hive. You can do this is a couple lines of code, I suspect. Two map
>> reduce jobs should not pose any kind of challenge with the right tools.
>>
>>
>> On Monday, March 4, 2013, Sandy Ryza wrote:
>>
>>> Hi Aji,
>>>
>>> Oozie is a mature project for managing MapReduce workflows.
>>> http://oozie.apache.org/
>>>
>>> -Sandy
>>>
>>>
>>> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <[email protected]>wrote:
>>>
>>>> Aji,
>>>>
>>>> Why don't you just chain the jobs together?
>>>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
>>>>
>>>> Justin
>>>>
>>>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <[email protected]> wrote:
>>>> > Russell thanks for the link.
>>>> >
>>>> > I am interested in finding a solution (if out there) where Mapper1
>>>> outputs a
>>>> > custom object and Mapper 2 can use that as input. One way to do this
>>>> > obviously by writing to Accumulo, in my case. But, is there another
>>>> solution
>>>> > for this:
>>>> >
>>>> > List<MyObject> ----> Input to Job
>>>> >
>>>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output
>>>> <MyObjectId,
>>>> > MyObject>
>>>> >
>>>> > <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>>>> >
>>>> >
>>>> >
>>>> > Ideas?
>>>> >
>>>> >
>>>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <
>>>> [email protected]>
>>>> > wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java
>>>> >>
>>>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to
>>>> try
>>>> >> it.
>>>> >>
>>>> >> Russell Jurney http://datasyndrome.com
>>>> >>
>>>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis <[email protected]> wrote:
>>>> >>
>>>> >> Hello,
>>>> >>
>>>> >>  I have a MR job design with a flow like this: Mapper1 -> Mapper2 ->
>>>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's
>>>> output goes
>>>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo.
>>>> >>
>>>> >> Questions:
>>>> >>
>>>> >> 1) Has any one tried something like this before? Are there any
>>>> workflow
>>>> >> control apis (in or outside of Hadoop) that can help me set up the
>>>> job like
>>>> >> this. Or am I limited to use Quartz for this?
>>>> >> 2) If both M2 and M3 needed to write some data to two same tables in
>>>> >> Accumulo, is it possible to do so? Are there any good accumulo
>>>> mapreduce
>>>> >> jobs you can point me to? blogs/pages that I can use for reference
>>>> (starting
>>>> >> point/best practices).
>>>> >>
>>>> >> Thank you in advance for any suggestions!
>>>> >>
>>>> >> Aji
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>> --
>> Russell Jurney twitter.com/rjurney [email protected] datasyndrome.
>> com
>>
>
>

Re: Accumulo and Mapreduce

Reply via email to