Re: How to best decide mapper output/reducer input for a huge string?

Pavan Sudheendra Sat, 21 Sep 2013 01:18:50 -0700

No, I don't have a combiner in place. Is it necessary? How do I make my map
output compressed? Yes, the Tables in HBase are compressed.


Although, there's no real bottleneck, the time it takes to process the
entire table is huge. I have to constantly check if i can optimize it
somehow..

Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
any thing wrong with my design? Does it require any kind of re-work? Thank
you so much for helping..


On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <[email protected]>wrote:

> One thing that comes to mind is that your keys are Strings which are
> highly inefficient. You might get a lot better performance if you write a
> custom writable for your Key object using the appropriate data types. For
> example, use a long (LongWritable) for timestamps. This should make
> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
> comparisons for sorting will also go up.
>
> Ensure that your map output's are being compressed. Are your tables in
> HBase compressed? Do you have a combiner?
>
> Have you been able to profile your code to see where the bottlenecks are?
>
>
> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[email protected]>wrote:
>
>> Hi Pradeep,
>> Yes.. Basically i'm only writing the key part as the map output.. The V
>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>> leads to faster execution.. I'm kind of a newbie so looking to make the
>> map/reduce job run a lot faster..
>>
>> Also, yes. It gets sorted by the HouseHoldID which is what i needed.. But
>> seems if i write a map output for each and every row of a 19 m row HBase
>> table, its taking nearly a day to complete.. (21 mappers and 21 reducers)
>>
>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>> something if i have the data in that format?
>>
>>
>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <[email protected]
>> > wrote:
>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.
>>>
>>> Hope this helps!
>>> - Pradeep
>>>
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra 
>>> <[email protected]>wrote:
>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink..
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table..
>>>>
>>>> Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.
>>>>
>>>> The output of the mapper is something like this :
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId 
>>>> televisionID timestamp
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { 
>>>> }
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..
>>>>
>>>> scan.setCaching(750);
>>>> scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
>>>>                                        Table1,           // input HBase 
>>>> table name
>>>>                                        scan,
>>>>                                        AnalyzeMapper.class,    // mapper
>>>>                                        Text.class,             // mapper 
>>>> output key
>>>>                                        IntWritable.class,      // mapper 
>>>> output value
>>>>                                        job);
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(
>>>>                                         OutputTable,                // 
>>>> output table
>>>>                                         AnalyzeReducerTable.class,  // 
>>>> reducer class
>>>>                                         job);
>>>>                 job.setNumReduceTasks(RegionCount);
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator?
>>>>
>>>>
>>>> --
>>>> Regards-
>>>> Pavan
>>>>
>>>
>>>
>>
>>
>> --
>> Regards-
>> Pavan
>>
>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Reply via email to