RE: How fast would you expect shuffle serialize to be?

Liu, Raymond Tue, 29 Apr 2014 22:24:27 -0700

By the way, to be clear, I run repartition firstly to make all data go through 
shuffle instead of run ReduceByKey etc directly ( which reduce the data need to 
be shuffle and serialized), thus say all 50MB/s data from HDFS will go to 
serializer. ( in fact, I also tried generate data in memory directly instead of 
read from HDFS, similar throughput result)


Best Regards,
Raymond Liu


-----Original Message-----
From: Liu, Raymond [mailto:raymond....@intel.com] 

For all the tasks, say 32 task on total

Best Regards,
Raymond Liu


-----Original Message-----
From: Patrick Wendell [mailto:pwend...@gmail.com] 

Is this the serialization throughput per task or the serialization throughput 
for all the tasks?

On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond <raymond....@intel.com> wrote:
> Hi
>
>         I am running a WordCount program which count words from HDFS, 
> and I noticed that the serializer part of code takes a lot of CPU 
> time. On a 16core/32thread node, the total throughput is around 50MB/s 
> by JavaSerializer, and if I switching to KryoSerializer, it doubles to 
> around 100-150MB/s. ( I have 12 disks per node and files scatter 
> across disks, so HDFS BW is not a problem)
>
>         And I also notice that, in this case, the object to write is (String, 
> Int), if I try some case with (int, int), the throughput will be 2-3x faster 
> further.
>
>         So, in my Wordcount case, the bottleneck is CPU ( cause if 
> with shuffle compress on, the 150MB/s data bandwidth in input side, 
> will usually lead to around 50MB/s shuffle data)
>
>         This serialize BW looks somehow too low , so I am wondering, what's 
> BW you observe in your case? Does this throughput sounds reasonable to you? 
> If not, anything might possible need to be examined in my case?
>
>
>
> Best Regards,
> Raymond Liu
>
>

RE: How fast would you expect shuffle serialize to be?

Reply via email to