Hello, I am trying to understand how to solve a performance problem with my spark job.
Here is the algorithm RDD.flatMap( func(createSomeObject)).distinct.reduceByKey(...).... When the data set is big enough to create around 2000 tasks, when it gets to the second stage (group by key) the CPU of ALL machines spend most of the time waiting, and besides that I could see that there is a constant flow on the NetWork NetWork. I am using the ec2 script to deploy a cluster on AMAZON. The strange think is that when I have less than 300 task the reduceByKey stage is faster than the first stage (distinct), so I am trying to figure out why it gets really slow (I mean slower than the first stage (distinct)), with huge data set. Thanks a lot
