Hi

        I am running a WordCount program which count words from HDFS, and I 
noticed that the serializer part of code takes a lot of CPU time. On a 
16core/32thread node, the total throughput is around 50MB/s by JavaSerializer, 
and if I switching to KryoSerializer, it doubles to around 100-150MB/s. ( I 
have 12 disks per node and files scatter across disks, so HDFS BW is not a 
problem)

        And I also notice that, in this case, the object to write is (String, 
Int), if I try some case with (int, int), the throughput will be 2-3x faster 
further.

        So, in my Wordcount case, the bottleneck is CPU ( cause if with shuffle 
compress on, the 150MB/s data bandwidth in input side, will usually lead to 
around 50MB/s shuffle data)

        This serialize BW looks somehow too low , so I am wondering, what's BW 
you observe in your case? Does this throughput sounds reasonable to you? If 
not, anything might possible need to be examined in my case?



Best Regards,
Raymond Liu


Reply via email to