Enrico,
The below solution works but there is a little glitch.
It is working fine in spark-shell but failing for *_/skewed keys/_*
while doing a spark-submit.
while looking into the execution plan, the partitioning value is same
for both repartition and groupByKey and is driven by the value
I believe I logged an issue first and I should get a response first.
I was ignored.
Regards
Did you know there are 8 million people in kashmir locked up in their homes
by the Hindutwa (Indians)
for 8 months.
Now the whole planet is locked up in their homes.
You didn't take notice of them either.
Abhinav,
you can repartition by your key, then sortWithinPartition, and the
groupByKey. Since data are already hash-partitioned by key, Spark should
not shuffle the data hence change the sort wihtin each partition:
ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key")
Enrico
Hi,
I have a dataframe which has data like:
key | code | code_value
1 | c1 | 11
1 | c2 | 12
1 | c2 | 9
1 | c3