Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Pavan Sudheendra Sun, 25 Aug 2013 23:51:18 -0700

Ted and lhztop, here is a gist of my code: http://pastebin.com/mxY4AqBA


Can you suggest few ways of optimizing it? I know i am re-initializing the
conf object in the map function everytime its called, i'll change that.

Anil Gupta, 6 Node Cluster so 6 Region Servers.. I am basically trying to
do a partial join across 3 tables, perform some computation on it and dump
into another table..

The first Table is somehwere around 19m rows, 2nd one 1m rows and 3rd table
is 2.5m rows.. I know we can use hive/pig for this but i am to write this
as a map/reduce application.. For the first table, i created a smaller
subset of 100,000 rows and ran it. The output was my first thread message
which completed in one hour.. For 19m rows, i cannot imagine it running in
a finite time.. Please suggest something..


On Mon, Aug 26, 2013 at 12:03 PM, Pavan Sudheendra <[email protected]>wrote:

> Jens, can i set a smaller value in my application?
> Is this valid?
> conf.setInt("mapred.max.split.size", 50);
>
> This is our mapred-site.xml:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <configuration>
>   <property>
>     <name>mapred.job.tracker</name>
>     <value>ip-10-10-100170.eu-east-1.compute.internal:8021</value>
>   </property>
>   <property>
>     <name>mapred.job.tracker.http.address</name>
>     <value>0.0.0.0:50030</value>
>   </property>
>   <property>
>     <name>mapreduce.job.counters.max</name>
>     <value>120</value>
>   </property>
>   <property>
>     <name>mapred.output.compress</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.type</name>
>     <value>BLOCK</value>
>   </property>
>   <property>
>     <name>mapred.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.DefaultCodec</value>
>   </property>
>   <property>
>     <name>mapred.map.output.compression.codec</name>
>     <value>org.apache.hadoop.io.compress.SnappyCodec</value>
>   </property>
>   <property>
>     <name>mapred.compress.map.output</name>
>     <value>true</value>
>   </property>
>   <property>
>     <name>zlib.compress.level</name>
>     <value>DEFAULT_COMPRESSION</value>
>   </property>
>   <property>
>     <name>io.sort.factor</name>
>     <value>64</value>
>   </property>
>   <property>
>     <name>io.sort.record.percent</name>
>     <value>0.05</value>
>   </property>
>   <property>
>     <name>io.sort.spill.percent</name>
>     <value>0.8</value>
>   </property>
>   <property>
>     <name>mapred.reduce.parallel.copies</name>
>     <value>10</value>
>   </property>
>   <property>
>     <name>mapred.submit.replication</name>
>     <value>2</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>   </property>
>   <property>
>     <name>mapred.userlog.retain.hours</name>
>     <value>24</value>
>   </property>
>   <property>
>     <name>io.sort.mb</name>
>     <value>112</value>
>   </property>
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value> -Xmx471075479</value>
>   </property>
>   <property>
>     <name>mapred.job.reuse.jvm.num.tasks</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>mapred.map.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.tasks.speculative.execution</name>
>     <value>false</value>
>   </property>
>   <property>
>     <name>mapred.reduce.slowstart.completed.maps</name>
>     <value>0.8</value>
>   </property></configuration>
>
>
> Suggest ways to overwrite the default value please.
>
>
> On Mon, Aug 26, 2013 at 9:38 AM, anil gupta <[email protected]> wrote:
>
>> Hi Pavan,
>>
>> Standalone cluster? How many RS you are running?What are you trying to
>> achieve in MR? Have you tried increasing scanner caching?
>> Slow is very theoretical unless we know some more details of your stuff.
>>
>> ~Anil
>>
>>
>>
>> On Sun, Aug 25, 2013 at 5:52 PM, 李洪忠 <[email protected]> wrote:
>>
>>> You need release your map code here to analyze the question. generally,
>>> when map/reduce hbase,  scanner with filter(s) is used. so the mapper count
>>> is the hbase region count in your hbase table.
>>> As the reason why you reduce so slow, I guess, you have an disaster join
>>> on the three tables, which cause too many rows.
>>>
>>> 于 2013/8/26 4:36, Pavan Sudheendra 写道:
>>>
>>>  Another Question, why does it indicate number of mappers as 1? Can i
>>>> change it so that multiple mappers perform computation?
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>
>
>
> --
> Regards-
> Pavan
>



-- 
Regards-
Pavan

Re: Mapper and Reducer takes longer than usual for a HBase table aggregation task

Reply via email to