Re: Hadoop shuffling traffic

Pramod Biligiri Fri, 10 Oct 2014 00:05:01 -0700

I think it refers to the no. of bytes the reducer fetches from the mapper.

Pramod


On Wed, Oct 8, 2014 at 10:17 PM, Abdul Navaz <[email protected]> wrote:

> Hello,
>
> Fiesr of all thank you very much for your help. :)
>
> I still have some doubt with this .
>
> Is the highlighted metric “ *Reduce shuffle bytes=3059” *
>
>
>    1. Is the total bytes after the reduced phase. ( That is the output
>    file which is written into HDFS)
>
> Or
>
> 2.  Is this is the actual shuffled traffic which is exchanged between
> mappers and reducers before performing reducing ?
>
> Please clarify !
>
> Thanks & Regards,
>
> Abdul Navaz
>
>
>
> From: Pramod Biligiri <[email protected]>
> Reply-To: <[email protected]>
> Date: Thursday, October 2, 2014 at 12:44 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: Hadoop shuffling traffic
>
> Hi Abdul,
> That is the right metric. You can take a look at this report we made on
> this earlier:
> http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
>
> Pramod
>
> On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <[email protected]> wrote:
>
>> Hello,
>>
>> This is the portion of the output which is displayed on the console when
>> I run sample word count job.
>>
>> map 0% reduce 0%
>>
>> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>>
>> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
>> job_201409262002_0003
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
>> waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
>> bytes=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>>
>>
>> I am trying to find the shuffling traffic that is total traffic
>> generated when mappers exchange their key values pair with the reducer. Is
>> the highlighted portion gives the shuffling traffic ?
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> On 9/26/14, 12:00 AM, "karthikeyan S" <[email protected]> wrote:
>>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <[email protected]>
>> wrote:
>>
>> see mapreduce.job.reduce.slowstart.completedmaps
>> It gives hint of  when reduce tasks could kick off.
>>
>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <[email protected]>:
>>
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> --
>> Bing Jiang
>> Tel：(86)134-2619-1361
>> weibo: http://weibo.com/jiangbinglover
>> BLOG: www.binospace.com
>> BLOG: http://blog.sina.com.cn/jiangbinglover
>> Focus on distributed computing, HDFS/HBase
>>
>>
>>
>

Re: Hadoop shuffling traffic

Reply via email to