Hi Dan, thanks for replying so quickly. Unfortunately, customer_id per shop is indeed very skewed (some shops have way more customers than others) and I was tempted to do hash(shop_id, customer_id) but because 40% of queries won't use customer_id, it would cause a full scan. Actually to be accurate, then we join tables in Impala, first join might be limited on shop_id, while other tables will be joined by shop_id and customer_id. Of course, that first table will be very large (billions of rows).
Just to clarify, are you saying that partition by hash(shop_id), hash(customer_id) and partition by hash(customer_id), hash(shop_id) are totally equivalent in terms of having tablets being skewed and there is no difference at all? I thought it would be more like with Hive/Impala HDFS, then nested partitions are grouped under different parent partitions. Does Kudu store them as a bunch of independent files instead and each file will have data for the specific hash of shop_id/customer_id? Boris On Thu, Oct 11, 2018 at 4:05 PM Dan Burkert <[email protected]> wrote: > Hi Boris, > > The two examples you gave are exactly equivalent; the relative ordering of > hash levels has no effect on query performance, hotspotting, or anything > else. Given that 60% of your queries don't specify a specific customer_id, > it does make sense to use hash(shop_id), hash(customer_id) instead of > combining them in a single hash level as hash(shop_id, customer_id), > however the trade-off is that the hotspotting resistance isn't as good. If > the shop_id and customer_id columns aren't skewed to begin with that's not > a concern, though. > > - Dan > > On Thu, Oct 11, 2018 at 12:14 PM Boris Tyukin <[email protected]> > wrote: > >> Hi guys, >> Read this doc >> https://kudu.apache.org/docs/schema_design.html#multilevel-partitioning >> and I have a question on this particular statement >> "Scans on multilevel partitioned tables can take advantage of partition >> pruning on any of the levels independently" >> >> Does it mean, that both strategies below would be equivalent in terms of >> performance (i.e. minimum scans) >> >> partition by hash(shop_id), hash(customer_id) >> vs. >> partition by hash(customer_id), hash(shop_id) >> >> 60% of the queries are using both shop_id and customer_id but 40% of >> queries need to pull all customers for a specific shop_id. And almost never >> by customer_id alone (customer_id is not unique across shops and is >> assigned per shop). >> >> At the same time, if I partition by customer_id first, partitions will >> be distributed more evenly. >> >> Thanks! >> Boris >> >> >> >>
