Re: How to best decide mapper output/reducer input for a huge string?

Pavan Sudheendra Mon, 23 Sep 2013 02:34:43 -0700

@John, to be really frank i don't know what the limiting factor is.. It
might be all of them or a subset of them.. Cannot tell..



On Mon, Sep 23, 2013 at 2:58 PM, Pavan Sudheendra <[email protected]>wrote:

> @Rahul, Yes you are right. 21 mappers are spawned where all the 21 mappers
> are functional at the same time.. Although, @Pradeep, i should do the
> compression like you say.. I'll give it a shot.. As far as i can see, i
> think i'll need to implement Writable and write out the key of the mapper
> using the specific data types instead of writing it out as a string which
> might slow the operation down..
>
>
> On Mon, Sep 23, 2013 at 9:29 AM, Pradeep Gollakota 
> <[email protected]>wrote:
>
>> Pavan,
>>
>> It's hard to tell whether there's anything wrong with your design or not
>> since you haven't given us specific enough details. The best thing you can
>> do is instrument your code and see what is taking a long time. Rahul
>> mentioned a problem that I myself have seen before, with only one region
>> (or a couple) having any data. So even if you have 21 regions, only mapper
>> might be doing the heavy lifting.
>>
>> A combiner is hugely helpful in terms of reducing the data output of
>> mappers. Writing a combiner is a best practice and you should almost always
>> have one. Compression can be turned on by setting the following properties
>> in your job config.
>>  <property>
>>      <name> mapreduce.map.output.compress </name>
>>      <value> true</value>
>>  </property>
>>  <property>
>>      <name>mapreduce.map.output.compress.codec</name>
>>      <value>org.apache.hadoop.io.compress.GzipCodec</value>
>>  </property>
>> You can also try other compression codes such as Lzo, Snappy, Bzip2, etc.
>> depending on your use cases. Gzip is really slow but gets the best
>> compression ratios. Snappy/Lzo are a lot faster but don't have as good of a
>> compression ratio. If your computations are CPU bound, then you'd probably
>> want to use Snappy/Lzo. If your computations are I/O bound, and your CPUs
>> are idle, you can use Gzip. You'll have to experiment and find the best
>> settings for you. There are a lot of other tweaks that you can try to get
>> the best performance out of your cluster.
>>
>> One of the best things you can do is to install Ganglia (or some other
>> similar tool) on your cluster and monitor usage of resources while your job
>> is running. This will tell you if your job is I/O bound or CPU bound.
>>
>> Take a look at this paper by Intel about optimizing your Hadoop cluster
>> and see if that fits your deployment.
>> http://software.intel.com/sites/default/files/m/f/4/3/2/f/31124-Optimizing_Hadoop_2010_final.pdf
>>
>> If your cluster is already optimized and your job is not I/O bound, then
>> there might be a problem with your algorithm and might warrant a redesign.
>>
>> Hope this helps!
>> - Pradeep
>>
>>
>> On Sun, Sep 22, 2013 at 8:14 PM, Rahul Bhattacharjee <
>> [email protected]> wrote:
>>
>>> One mapper is spawned per hbase table region. You can use the admin ui
>>> of hbase to find the number of regions per table. It might happen that all
>>> the data is sitting in a single region , so a single mapper is spawned and
>>> you are not getting enough parallel work getting done.
>>>
>>> If that is the case then you can recreate the tables with predefined
>>> splits to create more regions.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Sun, Sep 22, 2013 at 4:38 AM, John Lilley 
>>> <[email protected]>wrote:
>>>
>>>>  Pavan,****
>>>>
>>>> How large are the rows in HBase?  22 million rows is not very much but
>>>> you mentioned “huge strings”.  Can you tell which part of the processing is
>>>> the limiting factor (read from HBase, mapper output, reducers)?****
>>>>
>>>> John****
>>>>
>>>> ** **
>>>>
>>>> ** **
>>>>
>>>> *From:* Pavan Sudheendra [mailto:[email protected]]
>>>> *Sent:* Saturday, September 21, 2013 2:17 AM
>>>> *To:* [email protected]
>>>> *Subject:* Re: How to best decide mapper output/reducer input for a
>>>> huge string?****
>>>>
>>>> ** **
>>>>
>>>> No, I don't have a combiner in place. Is it necessary? How do I make my
>>>> map output compressed? Yes, the Tables in HBase are compressed.****
>>>>
>>>> Although, there's no real bottleneck, the time it takes to process the
>>>> entire table is huge. I have to constantly check if i can optimize it
>>>> somehow.. ****
>>>>
>>>> Oh okay.. I'll implement a Custom Writable.. Apart from that, do you
>>>> see any thing wrong with my design? Does it require any kind of re-work?
>>>> Thank you so much for helping..****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 1:06 PM, Pradeep Gollakota <
>>>> [email protected]> wrote:****
>>>>
>>>> One thing that comes to mind is that your keys are Strings which are
>>>> highly inefficient. You might get a lot better performance if you write a
>>>> custom writable for your Key object using the appropriate data types. For
>>>> example, use a long (LongWritable) for timestamps. This should make
>>>> (de)serialization a lot faster. If HouseHoldId is an integer, your speed of
>>>> comparisons for sorting will also go up.****
>>>>
>>>> ** **
>>>>
>>>> Ensure that your map output's are being compressed. Are your tables in
>>>> HBase compressed? Do you have a combiner?****
>>>>
>>>> ** **
>>>>
>>>> Have you been able to profile your code to see where the bottlenecks
>>>> are?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:04 AM, Pavan Sudheendra <[email protected]>
>>>> wrote:****
>>>>
>>>> Hi Pradeep,****
>>>>
>>>> Yes.. Basically i'm only writing the key part as the map output.. The V
>>>> of <K,V> is not of much use to me.. But i'm hoping to change that if it
>>>> leads to faster execution.. I'm kind of a newbie so looking to make the
>>>> map/reduce job run a lot faster.. ****
>>>>
>>>> Also, yes. It gets sorted by the HouseHoldID which is what i needed..
>>>> But seems if i write a map output for each and every row of a 19 m row
>>>> HBase table, its taking nearly a day to complete.. (21 mappers and 21
>>>> reducers)****
>>>>
>>>> ** **
>>>>
>>>> I have looked at both Pig/Hive to do the job but i'm supposed to do
>>>> this via a MR job.. So, cannot use either of that.. Do you recommend me to
>>>> try something if i have the data in that format?****
>>>>
>>>> ** **
>>>>
>>>> On Sat, Sep 21, 2013 at 12:26 PM, Pradeep Gollakota <
>>>> [email protected]> wrote:****
>>>>
>>>> I'm sorry but I don't understand your question. Is the output of the
>>>> mapper you're describing the key portion? If it is the key, then your data
>>>> should already be sorted by HouseHoldId since it occurs first in your key.
>>>> ****
>>>>
>>>> ** **
>>>>
>>>> The SortComparator will tell Hadoop how to sort your data. So you use
>>>> this if you have a need for a non lexical sort order. The
>>>> GroupingComparator will tell Hadoop how to group your data for the reducer.
>>>> All KV-pairs from the same group will be given to the same Reducer.****
>>>>
>>>> ** **
>>>>
>>>> If your reduce computation needs all the KV-pairs for the same
>>>> HouseHoldId, then you will need to write a GroupingComparator.****
>>>>
>>>> ** **
>>>>
>>>> Also, have you considered using a higher level abstraction on Hadoop
>>>> such as Pig, Hive, Cascading, etc.? The sorting/grouping type of tasks are
>>>> a LOT easier to write in these languages.****
>>>>
>>>> ** **
>>>>
>>>> Hope this helps!****
>>>>
>>>> - Pradeep****
>>>>
>>>> ** **
>>>>
>>>> On Fri, Sep 20, 2013 at 11:32 PM, Pavan Sudheendra <[email protected]>
>>>> wrote:****
>>>>
>>>> I need to improve my MR jobs which uses HBase as source as well as
>>>> sink.. ** **
>>>>
>>>> Basically, i'm reading data from 3 HBase Tables in the mapper, writing
>>>> them out as one huge string for the reducer to do some computation and dump
>>>> into a HBase Table.. ****
>>>>
>>>> Table1 ~ 19 million rows.****
>>>>
>>>> Table2 ~ 2 million rows.****
>>>>
>>>> Table3 ~ 900,000 rows.****
>>>>
>>>> The output of the mapper is something like this : ****
>>>>
>>>> HouseHoldId contentID name duration genre type channelId personId 
>>>> televisionID timestamp****
>>>>
>>>> I'm interested in sorting it on the basis of the HouseHoldID value so
>>>> i'm using this technique. I'm not interested in the V part of pair so i'm
>>>> kind of ignoring it. My mapper class is defined as follows:****
>>>>
>>>> public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { 
>>>> }****
>>>>
>>>> For my MR job to be completed, it takes 22 hours to complete which is
>>>> not desirable at all. I'm supposed to optimize this somehow to run a lot
>>>> faster somehow..****
>>>>
>>>> scan.setCaching(750);        ****
>>>>
>>>> scan.setCacheBlocks(false); ****
>>>>
>>>> TableMapReduceUtil.initTableMapperJob (****
>>>>
>>>>                                        Table1,           // input HBase 
>>>> table name****
>>>>
>>>>                                        scan,                   ****
>>>>
>>>>                                        AnalyzeMapper.class,    // 
>>>> mapper****
>>>>
>>>>                                        Text.class,             // mapper 
>>>> output key****
>>>>
>>>>                                        IntWritable.class,      // mapper 
>>>> output value****
>>>>
>>>>                                        job);****
>>>>
>>>> ** **
>>>>
>>>>                 TableMapReduceUtil.initTableReducerJob(****
>>>>
>>>>                                         OutputTable,                // 
>>>> output table****
>>>>
>>>>                                         AnalyzeReducerTable.class,  // 
>>>> reducer class****
>>>>
>>>>                                         job);****
>>>>
>>>>                 job.setNumReduceTasks(RegionCount);  ****
>>>>
>>>> My HBase Table1 has 21 regions so 21 mappers are spawned. We are
>>>> running a 8 node cloudera cluster.****
>>>>
>>>> Should i use a custom SortComparator or a Group Comparator? ****
>>>>
>>>>
>>>> ****
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>> ****
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>> ** **
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Regards-****
>>>>
>>>> Pavan****
>>>>
>>>
>>>
>>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: How to best decide mapper output/reducer input for a huge string?

Reply via email to