By the way, to be clear, I run repartition firstly to make all data go through shuffle instead of run ReduceByKey etc directly ( which reduce the data need to be shuffle and serialized), thus say all 50MB/s data from HDFS will go to serializer. ( in fact, I also tried generate data in memory directly instead of read from HDFS, similar throughput result)
Best Regards, Raymond Liu -----Original Message----- From: Liu, Raymond [mailto:raymond....@intel.com] For all the tasks, say 32 task on total Best Regards, Raymond Liu -----Original Message----- From: Patrick Wendell [mailto:pwend...@gmail.com] Is this the serialization throughput per task or the serialization throughput for all the tasks? On Tue, Apr 29, 2014 at 9:34 PM, Liu, Raymond <raymond....@intel.com> wrote: > Hi > > I am running a WordCount program which count words from HDFS, > and I noticed that the serializer part of code takes a lot of CPU > time. On a 16core/32thread node, the total throughput is around 50MB/s > by JavaSerializer, and if I switching to KryoSerializer, it doubles to > around 100-150MB/s. ( I have 12 disks per node and files scatter > across disks, so HDFS BW is not a problem) > > And I also notice that, in this case, the object to write is (String, > Int), if I try some case with (int, int), the throughput will be 2-3x faster > further. > > So, in my Wordcount case, the bottleneck is CPU ( cause if > with shuffle compress on, the 150MB/s data bandwidth in input side, > will usually lead to around 50MB/s shuffle data) > > This serialize BW looks somehow too low , so I am wondering, what's > BW you observe in your case? Does this throughput sounds reasonable to you? > If not, anything might possible need to be examined in my case? > > > > Best Regards, > Raymond Liu > >