subject:"GroupBy and Spark Performance issue"

Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang

Repartition wouldn't save you from skewed data unfortunately. The way Spark works now is that it pulls data of the same key to one single partition, and Spark, AFAIK, retains the mapping from key to data in memory. You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid this probl

GroupBy and Spark Performance issue

2017-01-16 Thread KhajaAsmath Mohammed

Hi, I am trying to group by data in spark and find out maximum value for group of data. I have to use group by as I need to transpose based on the values. I tried repartition data by increasing number from 1 to 1.Job gets run till the below stage and it takes long time to move ahead. I was ne