Re: Multi-level partitions question

Boris Thu, 11 Oct 2018 15:59:06 -0700

Got it, appreciate your help, Dan

On Thu, Oct 11, 2018, 18:25 Dan Burkert <[email protected]> wrote:


> > Just to clarify, are you saying that partition by hash(shop_id),
> hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
> totally equivalent in terms of having tablets being skewed and there is no
> difference at all?
>
> Yes, that's correct, there is no difference at all in any metric,
> including skew, read, or write performance.
>
> > I thought it would be more like with Hive/Impala HDFS, then nested
> partitions are grouped under different parent partitions. Does Kudu store
> them as a bunch of independent files instead and each file will have data
> for the specific hash of shop_id/customer_id?
>
> That's correct, there is no parent/child relationship.  The only minor
> clarification is that the individual hash pairs are a tablet, not a file.
>
> - Dan
>
> On Thu, Oct 11, 2018 at 1:31 PM Boris Tyukin <[email protected]>
> wrote:
>
>> Hi Dan,
>>
>> thanks for replying so quickly. Unfortunately, customer_id per shop is
>> indeed very skewed (some shops have way more customers than others) and I
>> was tempted to do  hash(shop_id, customer_id) but because 40% of queries
>> won't use customer_id, it would cause a full scan. Actually to be accurate,
>> then we join tables in Impala, first join might be limited on shop_id,
>> while other tables will be joined by shop_id and customer_id. Of course,
>> that first table will be very large (billions of rows).
>>
>> Just to clarify, are you saying that partition by hash(shop_id),
>> hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
>> totally equivalent in terms of having tablets being skewed and there is no
>> difference at all?
>>
>> I thought it would be more like with Hive/Impala HDFS, then nested
>> partitions are grouped under different parent partitions. Does Kudu store
>> them as a bunch of independent files instead and each file will have data
>> for the specific hash of shop_id/customer_id?
>>
>> Boris
>>
>> On Thu, Oct 11, 2018 at 4:05 PM Dan Burkert <[email protected]>
>> wrote:
>>
>>> Hi Boris,
>>>
>>> The two examples you gave are exactly equivalent; the relative ordering
>>> of hash levels has no effect on query performance, hotspotting, or anything
>>> else.  Given that 60% of your queries don't specify a specific customer_id,
>>> it does make sense to use hash(shop_id), hash(customer_id) instead of
>>> combining them in a single hash level as hash(shop_id, customer_id),
>>> however the trade-off is that the hotspotting resistance isn't as good.  If
>>> the shop_id and customer_id columns aren't skewed to begin with that's not
>>> a concern, though.
>>>
>>> - Dan
>>>
>>> On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <[email protected]>
>>> wrote:
>>>
>>>> Hi guys,
>>>> Read this doc
>>>> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning
>>>> and I have a question on this particular statement
>>>> "Scans on multilevel partitioned tables can take advantage of partition
>>>> pruning on any of the levels independently"
>>>>
>>>> Does it mean, that both strategies below would be equivalent in terms
>>>> of performance (i.e. minimum scans)
>>>>
>>>> partition by hash(shop_id), hash(customer_id)
>>>> vs.
>>>> partition by hash(customer_id), hash(shop_id)
>>>>
>>>> 60% of the queries are using both shop_id and customer_id but 40% of
>>>> queries need to pull all customers for a specific shop_id. And almost never
>>>> by customer_id alone (customer_id is not unique across shops and is
>>>> assigned per shop).
>>>>
>>>> At the same time, if I partition by customer_id first,  partitions will
>>>> be distributed more evenly.
>>>>
>>>> Thanks!
>>>> Boris
>>>>
>>>>
>>>>
>>>>

Re: Multi-level partitions question

Reply via email to