Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.
You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this probl
Hi,
I am trying to group by data in spark and find out maximum value for group
of data. I have to use group by as I need to transpose based on the values.
I tried repartition data by increasing number from 1 to 1.Job gets run
till the below stage and it takes long time to move ahead. I was ne