Re: How to best decide mapper output/reducer input for a huge string?

Pavan Sudheendra Mon, 23 Sep 2013 02:33:56 -0700

@Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
are functional at the same time.. Although, @Pradeep, i should do the
compression like you say.. I'll give it a shot.. As far as i can see, i
think i'll need to implement Writable and write out the key of the mapper
using the specific data types instead of writing it out as a string which
might slow the operation down..



On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota <[email protected]>wrote:

> Pavan,
>
> It's hard to tell whether there's anything wrong with your design or not
> since you haven't given us specific enough details. The best thing you can
> do is instrument your code and see what is taking a long time. Rahul
> mentioned a problem that I myself have seen before, with only one region
> (or a couple) having any data. So even if you have 21 regions, only mapper
> might be doing the heavy lifting.
>
> A combiner is hugely helpful in terms of reducing the data output of
> mappers. Writing a combiner is a best practice and you should almost always
> have one. Compression can be turned on by setting the following properties
> in your job config.
>  <property>
>      <name> mapreduce.map.output.compress </name>
>      <value> true</value>
>  </property>
>  <property>
>      <name>mapreduce.map.output.compress.codec</name>
>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>  </property>
> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
> depending on your use cases. Gzip is really slow but gets the best
> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
> compression ratio. If your computations are CPU bound, then you'd probably
> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
> are idle, you can use Gzip. You'll have to experiment and find the best
> settings for you. There are a lot of other tweaks that you can try to get
> the best performance out of your cluster.
>
> One of the best things you can do is to install Ganglia (or some other
> similar tool) on your cluster and monitor usage of resources while your job
> is running. This will tell you if your job is I/O bound or CPU bound.
>
> Take a look at this paper by Intel about optimizing your Hadoop cluster
> and see if that fits your deployment.
> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>
> If your cluster is already optimized and your job is not I/O bound, then
> there might be a problem with your algorithm and might warrant a redesign.
>
> Hope this helps!
> - Pradeep
>
>
> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
> [email protected]> wrote:
>
>> One mapper is spawned per hbase table region. You can use the admin ui of
>> hbase to find the number of regions per table. It might happen that all the
>> data is sitting in a single region , so a single mapper is spawned and you
>> are not getting enough parallel work getting done.
>>
>> If that is the case then you can recreate the tables with predefined
>> splits to create more regions.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley <[email protected]>wrote:
>>
>>>  Pavan,****
>>>
>>> How large are the rows in HBase?  22 million rows is not very much but
>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>
>>> John****
>>>
>>> ** **
>>>
>>> ** **
>>>
>>> *From:* Pavan Sudheendra [mailto:[email protected]]
>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>> *To:* [email protected]
>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>> huge string?****
>>>
>>> ** **
>>>
>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>
>>> Although, there's no real bottleneck, the time it takes to process the
>>> entire table is huge. I have to constantly check if i can optimize it
>>> somehow.. ****
>>>
>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you see
>>> any thing wrong with my design? Does it require any kind of re-work? Thank
>>> you so much for helping..****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <[email protected]>
>>> wrote:****
>>>
>>> One thing that comes to mind is that your keys are Strings which are
>>> highly inefficient. You might get a lot better performance if you write a
>>> custom writable for your Key object using the appropriate data types. For
>>> example, use a long (LongWritable) for timestamps. This should make
>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>> comparisons for sorting will also go up.****
>>>
>>> ** **
>>>
>>> Ensure that your map output's are being compressed. Are your tables in
>>> HBase compressed? Do you have a combiner?****
>>>
>>> ** **
>>>
>>> Have you been able to profile your code to see where the bottlenecks are?
>>> ****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[email protected]>
>>> wrote:****
>>>
>>> Hi Pradeep,****
>>>
>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>> map/reduce job run a lot faster.. ****
>>>
>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>> But seems if i write a map output for each and every row of a 19 m row
>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>> reducers)****
>>>
>>> ** **
>>>
>>> I have looked at both Pig/Hive to do the job but i'm supposed to do this
>>> via a MR job.. So, cannot use either of that.. Do you recommend me to try
>>> something if i have the data in that format?****
>>>
>>> ** **
>>>
>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>> [email protected]> wrote:****
>>>
>>> I'm sorry but I don't understand your question. Is the output of the
>>> mapper you're describing the key portion? If it is the key, then your data
>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>> ****
>>>
>>> ** **
>>>
>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>> this if you have a need for a non lexical sort order. The
>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>
>>> ** **
>>>
>>> If your reduce computation needs all the KV-pairs for the same
>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>
>>> ** **
>>>
>>> Also, have you considered using a higher level abstraction on Hadoop
>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>> a LOT easier to write in these languages.****
>>>
>>> ** **
>>>
>>> Hope this helps!****
>>>
>>> - Pradeep****
>>>
>>> ** **
>>>
>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[email protected]>
>>> wrote:****
>>>
>>> I need to improve my MR jobs which uses HBase as source as well as
>>> sink.. ** **
>>>
>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>> them out as one huge string for the reducer to do some computation and dump
>>> into a HBase Table.. ****
>>>
>>> Table1 ~ 19 million rows.****
>>>
>>> Table2 ~ 2 million rows.****
>>>
>>> Table3 ~ 900,000 rows.****
>>>
>>> The output of the mapper is something like this : ****
>>>
>>> HouseHoldId contentID name duration genre type channelId personId 
>>> televisionID timestamp****
>>>
>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>> kind of ignoring it. My mapper class is defined as follows:****
>>>
>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { 
>>> }****
>>>
>>> For my MR job to be completed, it takes 22 hours to complete which is
>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>> faster somehow..****
>>>
>>> scan.setCaching(750);        ****
>>>
>>> scan.setCacheBlocks(false); ****
>>>
>>> TableMapReduceUtil.initTableMapperJob (****
>>>
>>>                                        Table1,           // input HBase 
>>> table name****
>>>
>>>                                        scan,                   ****
>>>
>>>                                        AnalyzeMapper.class,    // mapper****
>>>
>>>                                        Text.class,             // mapper 
>>> output key****
>>>
>>>                                        IntWritable.class,      // mapper 
>>> output value****
>>>
>>>                                        job);****
>>>
>>> ** **
>>>
>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>
>>>                                         OutputTable,                // 
>>> output table****
>>>
>>>                                         AnalyzeReducerTable.class,  // 
>>> reducer class****
>>>
>>>                                         job);****
>>>
>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>
>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are running
>>> a 8 node cloudera cluster.****
>>>
>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>
>>>
>>> ****
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>> ****
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>> ** **
>>>
>>>
>>>
>>>
>>> --
>>> Regards-****
>>>
>>> Pavan****
>>>
>>
>>
>


-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Reply via email to