So there's the hive partitions, that's at rest partitioning, vs Spark
partitioning, make sure you're not confusing the two. If the cardinality of
the column you want to bucket by isn't too high and you don't have data
skewness with respect to the buckets then you should use it (and each
partition has at least 256M-1G of data) also you're total data size is
large enough, not a few MBs. Each bucket doesn't translate to one Spark
partition necessarily, if you have a couple of Gigs per bucket (at rest),
Spark will create many partitions per bucket. So whether you sort or bucket
the same value won't be in the same partition necessarily unless you force
it by having a specific number of partitions (same as you number of
buckets), which may not be a good idea if you have too much data per bucket.


On Thu, Dec 31, 2020 at 10:21 AM Patrik Iselind <patrik....@gmail.com>
wrote:

> Thank you Peyman for clarifying this for me.
> Would you say there's a case for using bucketing in this case at all, or
> should I simply focus completely on the sorting solution? If so, when would
> you say bucketing is the preferred solution?
>
> Patrik Iselind
>
>
> On Thu, Dec 31, 2020 at 4:15 PM Peyman Mohajerian <mohaj...@gmail.com>
> wrote:
>
>> You can save your data to hdfs or other targets using either a sorted or
>> bucketed dataframe. In the case of bucketing you will have a different data
>> skipping mechanism when you read back the data compared to the sorted
>> version.
>>
>> On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind <patrik....@gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I am trying to push by understanding of bucketing vs sorting. I hope I
>>> can get som clarification from this list.
>>>
>>> Bucketing as I've come to understand it is primarily intended for when
>>> preparing the dataframe for join operations. Where the goal is to get data
>>> that will be joined together in the same partition, to make the joins
>>> faster.
>>>
>>> Sorting on the other hand is simply for when I want my data sorted,
>>> nothing strange there I guess.
>>>
>>> The effect of using bucketing, as I see it, would be the same as sorting
>>> if I'm not doing any joining and I use enough buckets, like in the
>>> following example program. Where the sorting or bucketing would replace the
>>> `?()` transformation.
>>>
>>> ```pseudo code
>>> df = spark.read.parquet("s3://...")
>>> // df contains the columns A, B, and C
>>> df2 = df.distinct().?().repartition(num_desired_partitions)
>>> df2.write.parquet("s3://,,,")
>>> ```
>>>
>>> Is my understanding correct or am I missing something?
>>>
>>> Is there a performance consideration between sorting and bucketing that
>>> I need to keep in mind?
>>>
>>> The end goal for me here is not that the data as such is sorted on the A
>>> column, it's that each  distinct value of A is kept together with all other
>>> rows which have the same value in A. If all rows with the same A value
>>> cannot fit within one partitions, then I accept that there's more than one
>>> partitions with the same value in the A column. If there's room left in the
>>> partitions, then it would be fine for rows with another value of A to fill
>>> up the partition.
>>>
>>> I would like something as depicted below
>>> ```desireable example
>>> -- Partition 1
>>> A|B|C
>>> =====
>>> 2|?|?
>>> 2|?|?
>>> 2|?|?
>>> 2|?|?
>>> 2|?|?
>>> 2|?|?
>>> 2|?|?
>>> -- Partition 2
>>> A|B|C
>>> =====
>>> 2|?|?
>>> 0|?|?
>>> 0|?|?
>>> 0|?|?
>>> 1|?|?
>>> ```
>>>
>>> What I don't want is something like below
>>>
>>> ```undesireable example
>>> -- Partition 1
>>> A|B|C
>>> =====
>>> 0|?|?
>>> 0|?|?
>>> 1|?|?
>>> 0|?|?
>>> 1|?|?
>>> 2|?|?
>>> 1|?|?
>>> -- Partition 2
>>> A|B|C
>>> =====
>>> 0|?|?
>>> 0|?|?
>>> 0|?|?
>>> 1|?|?
>>> 2|?|?
>>> ```
>>> Where the A value varies.
>>>
>>> Patrik Iselind
>>>
>>

Reply via email to