Hi everyone, I am trying to push by understanding of bucketing vs sorting. I hope I can get som clarification from this list.
Bucketing as I've come to understand it is primarily intended for when preparing the dataframe for join operations. Where the goal is to get data that will be joined together in the same partition, to make the joins faster. Sorting on the other hand is simply for when I want my data sorted, nothing strange there I guess. The effect of using bucketing, as I see it, would be the same as sorting if I'm not doing any joining and I use enough buckets, like in the following example program. Where the sorting or bucketing would replace the `?()` transformation. ```pseudo code df = spark.read.parquet("s3://...") // df contains the columns A, B, and C df2 = df.distinct().?().repartition(num_desired_partitions) df2.write.parquet("s3://,,,") ``` Is my understanding correct or am I missing something? Is there a performance consideration between sorting and bucketing that I need to keep in mind? The end goal for me here is not that the data as such is sorted on the A column, it's that each distinct value of A is kept together with all other rows which have the same value in A. If all rows with the same A value cannot fit within one partitions, then I accept that there's more than one partitions with the same value in the A column. If there's room left in the partitions, then it would be fine for rows with another value of A to fill up the partition. I would like something as depicted below ```desireable example -- Partition 1 A|B|C ===== 2|?|? 2|?|? 2|?|? 2|?|? 2|?|? 2|?|? 2|?|? -- Partition 2 A|B|C ===== 2|?|? 0|?|? 0|?|? 0|?|? 1|?|? ``` What I don't want is something like below ```undesireable example -- Partition 1 A|B|C ===== 0|?|? 0|?|? 1|?|? 0|?|? 1|?|? 2|?|? 1|?|? -- Partition 2 A|B|C ===== 0|?|? 0|?|? 0|?|? 1|?|? 2|?|? ``` Where the A value varies. Patrik Iselind