Re: How to best decide mapper output/reducer input for a huge string?

Pavan Sudheendra Sat, 21 Sep 2013 00:05:23 -0700

Hi Pradeep,
Yes.. Basically i'm only writing the key part as the map output.. The V of
<K,V> is not of much use to me.. But i'm hoping to change that if it leads
to faster execution.. I'm kind of a newbie so looking to make the
map/reduce job run a lot faster..


Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
seems if i write a map output for each and every row of a 19 m row HBase
table, its taking nearly a day to complete.. (21 mappers and 21 reducers)

I have looked at both Pig/Hive to do the job but i'm supposed to do this
via a MR job.. So, cannot use either of that.. Do you recommend me to try
something if i have the data in that format?


On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <[email protected]>wrote:

> I'm sorry but I don't understand your question. Is the output of the
> mapper you're describing the key portion? If it is the key, then your data
> should already be sorted by HouseHoldId since it occurs first in your key.
>
> The SortComparator will tell Hadoop how to sort your data. So you use this
> if you have a need for a non lexical sort order. The GroupingComparator
> will tell Hadoop how to group your data for the reducer. All KV-pairs from
> the same group will be given to the same Reducer.
>
> If your reduce computation needs all the KV-pairs for the same
> HouseHoldId, then you will need to write a GroupingComparator.
>
> Also, have you considered using a higher level abstraction on Hadoop such
> as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are a LOT
> easier to write in these languages.
>
> Hope this helps!
> - Pradeep
>
>
> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[email protected]>wrote:
>
>> I need to improve my MR jobs which uses HBase as source as well as sink..
>>
>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>> them out as one huge string for the reducer to do some computation and dump
>> into a HBase Table..
>>
>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>
>> The output of the mapper is something like this :
>>
>> HouseHoldId contentID name duration genre type channelId personId 
>> televisionID timestamp
>>
>> I'm interested in sorting it on the basis of the HouseHoldID value so i'm
>> using this technique. I'm not interested in the V part of pair so i'm kind
>> of ignoring it. My mapper class is defined as follows:
>>
>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }
>>
>> For my MR job to be completed, it takes 22 hours to complete which is not
>> desirable at all. I'm supposed to optimize this somehow to run a lot faster
>> somehow..
>>
>> scan.setCaching(750);
>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>                                        Table1,           // input HBase 
>> table name
>>                                        scan,
>>                                        AnalyzeMapper.class,    // mapper
>>                                        Text.class,             // mapper 
>> output key
>>                                        IntWritable.class,      // mapper 
>> output value
>>                                        job);
>>
>>                 TableMapReduceUtil.initTableReducerJob(
>>                                         OutputTable,                // 
>> output table
>>                                         AnalyzeReducerTable.class,  // 
>> reducer class
>>                                         job);
>>                 job.setNumReduceTasks(RegionCount);
>>
>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>> a 8 node cloudera cluster.
>>
>> Should i use a custom SortComparator or a Group Comparator?
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Reply via email to