Question on bucketing vs sorting

Patrik Iselind Thu, 31 Dec 2020 05:39:28 -0800

Hi everyone,

I am trying to push by understanding of bucketing vs sorting. I hope I can
get som clarification from this list.


Bucketing as I've come to understand it is primarily intended for when
preparing the dataframe for join operations. Where the goal is to get data
that will be joined together in the same partition, to make the joins
faster.

Sorting on the other hand is simply for when I want my data sorted, nothing
strange there I guess.

The effect of using bucketing, as I see it, would be the same as sorting if
I'm not doing any joining and I use enough buckets, like in the following
example program. Where the sorting or bucketing would replace the `?()`
transformation.

```pseudo code
df = spark.read.parquet("s3://...")
// df contains the columns A, B, and C
df2 = df.distinct().?().repartition(num_desired_partitions)
df2.write.parquet("s3://,,,")
```

Is my understanding correct or am I missing something?

Is there a performance consideration between sorting and bucketing that I
need to keep in mind?

The end goal for me here is not that the data as such is sorted on the A
column, it's that each  distinct value of A is kept together with all other
rows which have the same value in A. If all rows with the same A value
cannot fit within one partitions, then I accept that there's more than one
partitions with the same value in the A column. If there's room left in the
partitions, then it would be fine for rows with another value of A to fill
up the partition.

I would like something as depicted below
```desireable example
-- Partition 1
A|B|C
=====
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
2|?|?
-- Partition 2
A|B|C
=====
2|?|?
0|?|?
0|?|?
0|?|?
1|?|?
```

What I don't want is something like below

```undesireable example
-- Partition 1
A|B|C
=====
0|?|?
0|?|?
1|?|?
0|?|?
1|?|?
2|?|?
1|?|?
-- Partition 2
A|B|C
=====
0|?|?
0|?|?
0|?|?
1|?|?
2|?|?
```
Where the A value varies.

Patrik Iselind

Question on bucketing vs sorting

Reply via email to