Cliff, I think i got the idea - you basically use a bunch of temp servers that can be evicted any time and each has attached temporary nvme drives - very clever! I am surprised it works well with Kudu as our experience with our on-prem cluster that Kudu is extremely sensitive to disk failures and random reboots. at one point we had a situation when 2 out of 3 replicas were down for a while and it took us quite a bit of time to recover and we had to use kudu cli to evict some tablets manually.
Also manual rebalancing is in order often and it takes a lot of time (many hours). On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com> wrote: > Cliff, i would be extremely interested to see a blog post to compare > Snowflake, Redshift and Impala/Kudu since you tried all of them. > > would love to get some details how you set up Kudu/Impala cluster on AWS > as well as my company might be heading the same direction. this does not > mean much to me as we are not using cloud but I hope you can elaborate on > your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by > a load balancer. We're in AWS, and thanks to Kudu's replication we can > use "for free" instance-store NVMe. We also have associated > compute-oriented stateless Impala Spot Fleet clusters for HLL and other > compute oriented queries. " > > I cannot really find anything on the web that would compare Impala/Kudu to > Snowflake and Redshift. Everything I see is about Snowflake, Redshift and > BigQuery. > > On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cre...@gmail.com> wrote: > >> This is a good conversation but I don't think the comparison with >> Snowflake is a fair one, at least from an older version of Snowflake (In my >> last job, about 5 years ago, I pretty much single-handedly scale tested >> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is >> closed source, it seems pretty clear the architectures are quite different. >> Snowflake has no primary key index, no UPSERT capability, features that >> make Kudu valuable for some use cases. >> >> It also seems to me that their intended workloads are quite different. >> Snowflake is great for intensive analytics on demand, and can handle deeply >> nested data very well, where Kudu can't handle that at all. Snowflake is >> not designed for heavy concurrency, but complex query plans for a small >> group of users. If you select an x-large Snowflake cluster it's probably >> because you have a large amount of data to churn through, not because you >> have a large number of users. Or, at least that's how we used it. >> >> At my current workplace we use Kudu/Impala to handle about 30-60 >> concurrent queries. I agree that getting very fussy about partitioning can >> be a pain, but for the large fact tables we generally use a simple strategy >> of twelves: 12 hash x a12 (month) ranges in two 12-node clusters fronted >> by a load balancer. We're in AWS, and thanks to Kudu's replication we can >> use "for free" instance-store NVMe. We also have associated >> compute-oriented stateless Impala Spot Fleet clusters for HLL and other >> compute oriented queries. >> >> The net result blows away what we had with RedShift at less than 1/3 the >> cost, with performance improvements mostly from better concurrency >> handling. This is despite the fact that RedShift has built-in cache. We >> also use streaming ingestion which, aside from being impossible with >> RedShift, removes the added cost of staging. >> >> Getting back to Snowflake, there's no way we could use it the same way we >> use Kudu, and even if we could, the cost would would probably put us out of >> business! >> >> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> >> wrote: >> >>> thanks Andrew for taking your time responding to me. It seems that there >>> are no exact recommendations. >>> >>> I did look at scaling recommendations but that math is extremely >>> complicated and I do not think anyone will know all the answers to plug >>> into that calculation. We have no control really what users are doing, how >>> many queries they run, how many are hot vs cold etc. It is not realistic >>> IMHO to expect that knowledge of user query patterns. >>> >>> I do like the Snowflake approach than the engine takes care of defaults >>> and can estimate the number of micro-partitions and even repartition tables >>> as they grow. I feel Kudu has the same capabilities as the design is very >>> similar. I really do not like to pick random number of buckets. Also we >>> manager 100s of tables, I cannot look at them each one by one to make these >>> decisions. Does it make sense? >>> >>> >>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote: >>> >>>> Hey Boris, >>>> >>>> Sorry you didn't have much luck on Slack. I know partitioning in >>>> general can be tricky; thanks for the question. Left some thoughts below: >>>> >>>> Maybe I was not asking a clear question. If my cluster is large enough >>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick >>>>> tablets to be closer to 1Gb? >>>>> And a follow-up question, if I have tons of smaller tables under 5 >>>>> million rows, should I just use 1 partition or still break them on smaller >>>>> tablets for concurrency? >>>> >>>> >>>> Per your numbers, this confirms that the partitions are the units of >>>> concurrency here, and that therefore having more and having smaller >>>> partitions yields a concurrency bump. That said, extending a scheme of >>>> smaller partitions across all tables may not scale when thinking about the >>>> total number of partitions cluster-wide. >>>> >>>> There are some trade offs with the replica count per tablet server here >>>> -- generally, each tablet replica has a resource cost on tablet servers: >>>> WALs and tablet-related metadata use a shared disk (if you can put this on >>>> an SSD, I would recommend doing so), each tablet introduces some >>>> Raft-related RPC traffic, each tablet replica introduces some maintenance >>>> operations in the pool of background operations to be run, etc. >>>> >>>> Your point about scan concurrency is certainly a valid one -- there >>>> have been patches for other integrations that have tackled this to decouple >>>> partitioning from scan concurrency (KUDU-2437 >>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670 >>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, >>>> where Kudu's Spark integration will split range scans into smaller-scoped >>>> scan tokens to be run concurrently, though this optimization hasn't made >>>> its way into Impala yet). I filed KUDU-3071 >>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I >>>> think is left on the Kudu-side to get this up and running, so that it can >>>> be worked into Impala. >>>> >>>> For now, I would try to take into account the total sum of resources >>>> you have available to Kudu (including number of tablet servers, amount of >>>> storage per node, number of disks per tablet server, type of disk for the >>>> WAL/metadata disks), to settle on roughly how many tablet replicas your >>>> system can handle (the scaling guide >>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful >>>> here), and hopefully that, along with your own SLAs per table, can help >>>> guide how you partition your tables. >>>> >>>> confused why they say "at least" not "at most" - does it mean I should >>>>> design it so a tablet takes 2Gb or 3Gb in this example? >>>> >>>> >>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the >>>> low tens of GB per tablet replica, though exact perf obviously depends on >>>> your workload. As you show and as pointed out in documentation, larger and >>>> fewer tablets can limit the amount of concurrency for writes and reads, >>>> though we've seen multiple GBs works relatively well for many use cases >>>> while weighing the above mentioned tradeoffs with replica count. >>>> >>>> It is recommended that new tables which are expected to have heavy read >>>>> and write workloads have at least as many tablets as tablet servers. >>>> >>>> if I have 20 tablet servers and I have two tables - one with 1MM rows >>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both >>>>> (divide by 3 because of replication)? >>>> >>>> >>>> The recommendation here is to have at least 20 logical partitions per >>>> table. That way, a scan were to touch a table's entire keyspace, the table >>>> scan would be broken up into 20 tablet scans, and each of those might land >>>> on a different tablet server running on isolated hardware. For a >>>> significantly larger table into which you expect highly concurrent >>>> workloads, the recommendation serves as a lower bound -- I'd recommend >>>> having more partitions, and if your data is naturally time-oriented, >>>> consider range-partitioning on timestamp. >>>> >>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com> >>>> wrote: >>>> >>>>> just saw this in the docs but it is still confusing statement >>>>> No Default Partitioning >>>>> Kudu does not provide a default partitioning strategy when creating >>>>> tables. It is recommended that new tables which are expected to have heavy >>>>> read and write workloads have at least as many tablets as tablet servers. >>>>> >>>>> >>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows >>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both >>>>> (divide by 3 because of replication)? >>>>> >>>>> >>>>> >>>>> >>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com> >>>>> wrote: >>>>> >>>>>> hey guys, >>>>>> >>>>>> I asked the same question on Slack on and got no responses. I just >>>>>> went through the docs and design doc and FAQ and still did not find an >>>>>> answer. >>>>>> >>>>>> Can someone comment? >>>>>> >>>>>> Maybe I was not asking a clear question. If my cluster is large >>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or >>>>>> should >>>>>> I pick tablets to be closer to 1Gb? >>>>>> >>>>>> And a follow-up question, if I have tons of smaller tables under 5 >>>>>> million rows, should I just use 1 partition or still break them on >>>>>> smaller >>>>>> tablets for concurrency? >>>>>> >>>>>> We cannot pick the partitioning strategy for each table as we need to >>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an >>>>>> automated way to pick number of partitions/tablets. So far I was using >>>>>> 1Gb >>>>>> rule but rethinking this now for another project. >>>>>> >>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com> >>>>>> wrote: >>>>>> >>>>>>> forgot to post results of my quick test: >>>>>>> >>>>>>> Kudu 1.5 >>>>>>> >>>>>>> Table takes 18Gb of disk space after 3x replication >>>>>>> >>>>>>> Tablets Tablet Size Query run time, sec >>>>>>> 3 2Gb 65 >>>>>>> 9 700Mb 27 >>>>>>> 18 350Mb 17 >>>>>>> [image: image.png] >>>>>>> >>>>>>> >>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> just want to clarify recommendations from the doc. It says: >>>>>>>> >>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb >>>>>>>> >>>>>>>> Partitioning Rules of Thumb >>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> For large tables, such as fact tables, aim for as many tablets >>>>>>>> as you have cores in the cluster. >>>>>>>> - >>>>>>>> >>>>>>>> For small tables, such as dimension tables, ensure that each >>>>>>>> tablet is at least 1 GB in size. >>>>>>>> >>>>>>>> In general, be mindful the number of tablets limits the parallelism >>>>>>>> of reads, in the current implementation. Increasing the number of >>>>>>>> tablets >>>>>>>> significantly beyond the number of cores is likely to have diminishing >>>>>>>> returns. >>>>>>>> >>>>>>>> >>>>>>>> I've read this a few times but I am not sure I understand it >>>>>>>> correctly. Let me use concrete example. >>>>>>>> >>>>>>>> If a table ends up taking 18Gb after replication (so with 3x >>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim >>>>>>>> for >>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb >>>>>>>> tablets if my cluster capacity allows so (12 tablets before >>>>>>>> replication)? >>>>>>>> confused why they say "at least" not "at most" - does it mean I should >>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example? >>>>>>>> >>>>>>>> Assume that I have tons of CPU cores on a cluster... >>>>>>>> Based on my quick test, it seems that queries are faster if I have >>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best >>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least >>>>>>>> 1Gb". >>>>>>>> >>>>>>>> Really confused what the doc is saying, please help >>>>>>>> >>>>>>>> Boris >>>>>>>> >>>>>>>> >>>> >>>> -- >>>> Andrew Wong >>>> >>>