One thing I'll add - you should consider trying out the following schema:

CREATE TABLE orders (
shop_id bigint,
customer_id bigint,
...,
PRIMARY KEY (shop_id, customer_id))
PARTITION BY HASH (shop_id, customer_id) PARTITIONS <n>;

The thinking there is you will get great skew resistance, the only way
there could be skew among hash buckets is if a particular customer at a
particular shop had an outsized number of orders.  In terms of scan
performance, when the query has equality on both shop_id and customer_id
all tablets can be pruned except the one containing the data.  When the
query has just equality on the shop_id there will be no partition pruning,
however the scans on every tablet should be quite fast because they can
take advantage of the PK to only scan the data related to the specific
shop.  Depending on your circumstances this could be ideal; if individual
shops have a lot of data you would actually want it to be parallelized
across many tablets, and therefore be able to take advantage of many
tservers to perform the scan.

- Dan

On Thu, Oct 11, 2018 at 3:25 PM Dan Burkert <d...@cloudera.com> wrote:

> > Just to clarify, are you saying that partition by hash(shop_id),
> hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
> totally equivalent in terms of having tablets being skewed and there is no
> difference at all?
>
> Yes, that's correct, there is no difference at all in any metric,
> including skew, read, or write performance.
>
> > I thought it would be more like with Hive/Impala HDFS, then nested
> partitions are grouped under different parent partitions. Does Kudu store
> them as a bunch of independent files instead and each file will have data
> for the specific hash of shop_id/customer_id?
>
> That's correct, there is no parent/child relationship.  The only minor
> clarification is that the individual hash pairs are a tablet, not a file.
>
> - Dan
>
> On Thu, Oct 11, 2018 at 1:31 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi Dan,
>>
>> thanks for replying so quickly. Unfortunately, customer_id per shop is
>> indeed very skewed (some shops have way more customers than others) and I
>> was tempted to do  hash(shop_id, customer_id) but because 40% of queries
>> won't use customer_id, it would cause a full scan. Actually to be accurate,
>> then we join tables in Impala, first join might be limited on shop_id,
>> while other tables will be joined by shop_id and customer_id. Of course,
>> that first table will be very large (billions of rows).
>>
>> Just to clarify, are you saying that partition by hash(shop_id),
>> hash(customer_id) and partition by hash(customer_id), hash(shop_id) are
>> totally equivalent in terms of having tablets being skewed and there is no
>> difference at all?
>>
>> I thought it would be more like with Hive/Impala HDFS, then nested
>> partitions are grouped under different parent partitions. Does Kudu store
>> them as a bunch of independent files instead and each file will have data
>> for the specific hash of shop_id/customer_id?
>>
>> Boris
>>
>> On Thu, Oct 11, 2018 at 4:05 PM Dan Burkert <danburk...@apache.org>
>> wrote:
>>
>>> Hi Boris,
>>>
>>> The two examples you gave are exactly equivalent; the relative ordering
>>> of hash levels has no effect on query performance, hotspotting, or anything
>>> else.  Given that 60% of your queries don't specify a specific customer_id,
>>> it does make sense to use hash(shop_id), hash(customer_id) instead of
>>> combining them in a single hash level as hash(shop_id, customer_id),
>>> however the trade-off is that the hotspotting resistance isn't as good.  If
>>> the shop_id and customer_id columns aren't skewed to begin with that's not
>>> a concern, though.
>>>
>>> - Dan
>>>
>>> On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>> Read this doc
>>>> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning
>>>> and I have a question on this particular statement
>>>> "Scans on multilevel partitioned tables can take advantage of partition
>>>> pruning on any of the levels independently"
>>>>
>>>> Does it mean, that both strategies below would be equivalent in terms
>>>> of performance (i.e. minimum scans)
>>>>
>>>> partition by hash(shop_id), hash(customer_id)
>>>> vs.
>>>> partition by hash(customer_id), hash(shop_id)
>>>>
>>>> 60% of the queries are using both shop_id and customer_id but 40% of
>>>> queries need to pull all customers for a specific shop_id. And almost never
>>>> by customer_id alone (customer_id is not unique across shops and is
>>>> assigned per shop).
>>>>
>>>> At the same time, if I partition by customer_id first,  partitions will
>>>> be distributed more evenly.
>>>>
>>>> Thanks!
>>>> Boris
>>>>
>>>>
>>>>
>>>>

Reply via email to