Got it, appreciate your help, Dan On Thu, Oct 11, 2018, 18:25 Dan Burkert <[email protected]> wrote:
> > Just to clarify, are you saying that partition by hash(shop_id), > hash(customer_id) and partition by hash(customer_id), hash(shop_id) are > totally equivalent in terms of having tablets being skewed and there is no > difference at all? > > Yes, that's correct, there is no difference at all in any metric, > including skew, read, or write performance. > > > I thought it would be more like with Hive/Impala HDFS, then nested > partitions are grouped under different parent partitions. Does Kudu store > them as a bunch of independent files instead and each file will have data > for the specific hash of shop_id/customer_id? > > That's correct, there is no parent/child relationship. The only minor > clarification is that the individual hash pairs are a tablet, not a file. > > - Dan > > On Thu, Oct 11, 2018 at 1:31 PM Boris Tyukin <[email protected]> > wrote: > >> Hi Dan, >> >> thanks for replying so quickly. Unfortunately, customer_id per shop is >> indeed very skewed (some shops have way more customers than others) and I >> was tempted to do hash(shop_id, customer_id) but because 40% of queries >> won't use customer_id, it would cause a full scan. Actually to be accurate, >> then we join tables in Impala, first join might be limited on shop_id, >> while other tables will be joined by shop_id and customer_id. Of course, >> that first table will be very large (billions of rows). >> >> Just to clarify, are you saying that partition by hash(shop_id), >> hash(customer_id) and partition by hash(customer_id), hash(shop_id) are >> totally equivalent in terms of having tablets being skewed and there is no >> difference at all? >> >> I thought it would be more like with Hive/Impala HDFS, then nested >> partitions are grouped under different parent partitions. Does Kudu store >> them as a bunch of independent files instead and each file will have data >> for the specific hash of shop_id/customer_id? >> >> Boris >> >> On Thu, Oct 11, 2018 at 4:05 PM Dan Burkert <[email protected]> >> wrote: >> >>> Hi Boris, >>> >>> The two examples you gave are exactly equivalent; the relative ordering >>> of hash levels has no effect on query performance, hotspotting, or anything >>> else. Given that 60% of your queries don't specify a specific customer_id, >>> it does make sense to use hash(shop_id), hash(customer_id) instead of >>> combining them in a single hash level as hash(shop_id, customer_id), >>> however the trade-off is that the hotspotting resistance isn't as good. If >>> the shop_id and customer_id columns aren't skewed to begin with that's not >>> a concern, though. >>> >>> - Dan >>> >>> On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <[email protected]> >>> wrote: >>> >>>> Hi guys, >>>> Read this doc >>>> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning >>>> and I have a question on this particular statement >>>> "Scans on multilevel partitioned tables can take advantage of partition >>>> pruning on any of the levels independently" >>>> >>>> Does it mean, that both strategies below would be equivalent in terms >>>> of performance (i.e. minimum scans) >>>> >>>> partition by hash(shop_id), hash(customer_id) >>>> vs. >>>> partition by hash(customer_id), hash(shop_id) >>>> >>>> 60% of the queries are using both shop_id and customer_id but 40% of >>>> queries need to pull all customers for a specific shop_id. And almost never >>>> by customer_id alone (customer_id is not unique across shops and is >>>> assigned per shop). >>>> >>>> At the same time, if I partition by customer_id first, partitions will >>>> be distributed more evenly. >>>> >>>> Thanks! >>>> Boris >>>> >>>> >>>> >>>>
