Re: Multi-level partitions question

Boris Tyukin Thu, 11 Oct 2018 13:31:22 -0700

Hi Dan,

thanks for replying so quickly. Unfortunately, customer_id per shop is
indeed very skewed (some shops have way more customers than others) and I
was tempted to do  hash(shop_id, customer_id) but because 40% of queries
won't use customer_id, it would cause a full scan. Actually to be accurate,
then we join tables in Impala, first join might be limited on shop_id,
while other tables will be joined by shop_id and customer_id. Of course,
that first table will be very large (billions of rows).


Just to clarify, are you saying that partition by hash(shop_id),
hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
totally equivalent in terms of having tablets being skewed and there is no
difference at all?

I thought it would be more like with Hive/Impala HDFS, then nested
partitions are grouped under different parent partitions. Does Kudu store
them as a bunch of independent files instead and each file will have data
for the specific hash of shop_id/customer_id?

Boris

On Thu, Oct 11, 2018 at 4:05 PM Dan Burkert <[email protected]> wrote:

> Hi Boris,
>
> The two examples you gave are exactly equivalent; the relative ordering of
> hash levels has no effect on query performance, hotspotting, or anything
> else.  Given that 60% of your queries don't specify a specific customer_id,
> it does make sense to use hash(shop_id), hash(customer_id) instead of
> combining them in a single hash level as hash(shop_id, customer_id),
> however the trade-off is that the hotspotting resistance isn't as good.  If
> the shop_id and customer_id columns aren't skewed to begin with that's not
> a concern, though.
>
> - Dan
>
> On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <[email protected]>
> wrote:
>
>> Hi guys,
>> Read this doc
>> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning
>> and I have a question on this particular statement
>> "Scans on multilevel partitioned tables can take advantage of partition
>> pruning on any of the levels independently"
>>
>> Does it mean, that both strategies below would be equivalent in terms of
>> performance (i.e. minimum scans)
>>
>> partition by hash(shop_id), hash(customer_id)
>> vs.
>> partition by hash(customer_id), hash(shop_id)
>>
>> 60% of the queries are using both shop_id and customer_id but 40% of
>> queries need to pull all customers for a specific shop_id. And almost never
>> by customer_id alone (customer_id is not unique across shops and is
>> assigned per shop).
>>
>> At the same time, if I partition by customer_id first,  partitions will
>> be distributed more evenly.
>>
>> Thanks!
>> Boris
>>
>>
>>
>>

Re: Multi-level partitions question

Reply via email to