Cliff, i would be extremely interested to see a blog post to compare Snowflake, Redshift and Impala/Kudu since you tried all of them.
would love to get some details how you set up Kudu/Impala cluster on AWS as well as my company might be heading the same direction. this does not mean much to me as we are not using cloud but I hope you can elaborate on your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by a load balancer. We're in AWS, and thanks to Kudu's replication we can use "for free" instance-store NVMe. We also have associated compute-oriented stateless Impala Spot Fleet clusters for HLL and other compute oriented queries. " I cannot really find anything on the web that would compare Impala/Kudu to Snowflake and Redshift. Everything I see is about Snowflake, Redshift and BigQuery. On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <[email protected]> wrote: > This is a good conversation but I don't think the comparison with > Snowflake is a fair one, at least from an older version of Snowflake (In my > last job, about 5 years ago, I pretty much single-handedly scale tested > Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is > closed source, it seems pretty clear the architectures are quite different. > Snowflake has no primary key index, no UPSERT capability, features that > make Kudu valuable for some use cases. > > It also seems to me that their intended workloads are quite different. > Snowflake is great for intensive analytics on demand, and can handle deeply > nested data very well, where Kudu can't handle that at all. Snowflake is > not designed for heavy concurrency, but complex query plans for a small > group of users. If you select an x-large Snowflake cluster it's probably > because you have a large amount of data to churn through, not because you > have a large number of users. Or, at least that's how we used it. > > At my current workplace we use Kudu/Impala to handle about 30-60 > concurrent queries. I agree that getting very fussy about partitioning can > be a pain, but for the large fact tables we generally use a simple strategy > of twelves: 12 hash x a12 (month) ranges in two 12-node clusters fronted > by a load balancer. We're in AWS, and thanks to Kudu's replication we can > use "for free" instance-store NVMe. We also have associated > compute-oriented stateless Impala Spot Fleet clusters for HLL and other > compute oriented queries. > > The net result blows away what we had with RedShift at less than 1/3 the > cost, with performance improvements mostly from better concurrency > handling. This is despite the fact that RedShift has built-in cache. We > also use streaming ingestion which, aside from being impossible with > RedShift, removes the added cost of staging. > > Getting back to Snowflake, there's no way we could use it the same way we > use Kudu, and even if we could, the cost would would probably put us out of > business! > > On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <[email protected]> wrote: > >> thanks Andrew for taking your time responding to me. It seems that there >> are no exact recommendations. >> >> I did look at scaling recommendations but that math is extremely >> complicated and I do not think anyone will know all the answers to plug >> into that calculation. We have no control really what users are doing, how >> many queries they run, how many are hot vs cold etc. It is not realistic >> IMHO to expect that knowledge of user query patterns. >> >> I do like the Snowflake approach than the engine takes care of defaults >> and can estimate the number of micro-partitions and even repartition tables >> as they grow. I feel Kudu has the same capabilities as the design is very >> similar. I really do not like to pick random number of buckets. Also we >> manager 100s of tables, I cannot look at them each one by one to make these >> decisions. Does it make sense? >> >> >> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <[email protected]> wrote: >> >>> Hey Boris, >>> >>> Sorry you didn't have much luck on Slack. I know partitioning in general >>> can be tricky; thanks for the question. Left some thoughts below: >>> >>> Maybe I was not asking a clear question. If my cluster is large enough >>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick >>>> tablets to be closer to 1Gb? >>>> And a follow-up question, if I have tons of smaller tables under 5 >>>> million rows, should I just use 1 partition or still break them on smaller >>>> tablets for concurrency? >>> >>> >>> Per your numbers, this confirms that the partitions are the units of >>> concurrency here, and that therefore having more and having smaller >>> partitions yields a concurrency bump. That said, extending a scheme of >>> smaller partitions across all tables may not scale when thinking about the >>> total number of partitions cluster-wide. >>> >>> There are some trade offs with the replica count per tablet server here >>> -- generally, each tablet replica has a resource cost on tablet servers: >>> WALs and tablet-related metadata use a shared disk (if you can put this on >>> an SSD, I would recommend doing so), each tablet introduces some >>> Raft-related RPC traffic, each tablet replica introduces some maintenance >>> operations in the pool of background operations to be run, etc. >>> >>> Your point about scan concurrency is certainly a valid one -- there have >>> been patches for other integrations that have tackled this to decouple >>> partitioning from scan concurrency (KUDU-2437 >>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670 >>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where >>> Kudu's Spark integration will split range scans into smaller-scoped scan >>> tokens to be run concurrently, though this optimization hasn't made its way >>> into Impala yet). I filed KUDU-3071 >>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think >>> is left on the Kudu-side to get this up and running, so that it can be >>> worked into Impala. >>> >>> For now, I would try to take into account the total sum of resources you >>> have available to Kudu (including number of tablet servers, amount of >>> storage per node, number of disks per tablet server, type of disk for the >>> WAL/metadata disks), to settle on roughly how many tablet replicas your >>> system can handle (the scaling guide >>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here), >>> and hopefully that, along with your own SLAs per table, can help guide how >>> you partition your tables. >>> >>> confused why they say "at least" not "at most" - does it mean I should >>>> design it so a tablet takes 2Gb or 3Gb in this example? >>> >>> >>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low >>> tens of GB per tablet replica, though exact perf obviously depends on your >>> workload. As you show and as pointed out in documentation, larger and fewer >>> tablets can limit the amount of concurrency for writes and reads, though >>> we've seen multiple GBs works relatively well for many use cases while >>> weighing the above mentioned tradeoffs with replica count. >>> >>> It is recommended that new tables which are expected to have heavy read >>>> and write workloads have at least as many tablets as tablet servers. >>> >>> if I have 20 tablet servers and I have two tables - one with 1MM rows >>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both >>>> (divide by 3 because of replication)? >>> >>> >>> The recommendation here is to have at least 20 logical partitions per >>> table. That way, a scan were to touch a table's entire keyspace, the table >>> scan would be broken up into 20 tablet scans, and each of those might land >>> on a different tablet server running on isolated hardware. For a >>> significantly larger table into which you expect highly concurrent >>> workloads, the recommendation serves as a lower bound -- I'd recommend >>> having more partitions, and if your data is naturally time-oriented, >>> consider range-partitioning on timestamp. >>> >>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <[email protected]> >>> wrote: >>> >>>> just saw this in the docs but it is still confusing statement >>>> No Default Partitioning >>>> Kudu does not provide a default partitioning strategy when creating >>>> tables. It is recommended that new tables which are expected to have heavy >>>> read and write workloads have at least as many tablets as tablet servers. >>>> >>>> >>>> if I have 20 tablet servers and I have two tables - one with 1MM rows >>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both >>>> (divide by 3 because of replication)? >>>> >>>> >>>> >>>> >>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <[email protected]> >>>> wrote: >>>> >>>>> hey guys, >>>>> >>>>> I asked the same question on Slack on and got no responses. I just >>>>> went through the docs and design doc and FAQ and still did not find an >>>>> answer. >>>>> >>>>> Can someone comment? >>>>> >>>>> Maybe I was not asking a clear question. If my cluster is large enough >>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick >>>>> tablets to be closer to 1Gb? >>>>> >>>>> And a follow-up question, if I have tons of smaller tables under 5 >>>>> million rows, should I just use 1 partition or still break them on smaller >>>>> tablets for concurrency? >>>>> >>>>> We cannot pick the partitioning strategy for each table as we need to >>>>> stream 100s of tables and we use PK from RBDMS and need to come with an >>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb >>>>> rule but rethinking this now for another project. >>>>> >>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <[email protected]> >>>>> wrote: >>>>> >>>>>> forgot to post results of my quick test: >>>>>> >>>>>> Kudu 1.5 >>>>>> >>>>>> Table takes 18Gb of disk space after 3x replication >>>>>> >>>>>> Tablets Tablet Size Query run time, sec >>>>>> 3 2Gb 65 >>>>>> 9 700Mb 27 >>>>>> 18 350Mb 17 >>>>>> [image: image.png] >>>>>> >>>>>> >>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>> just want to clarify recommendations from the doc. It says: >>>>>>> >>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb >>>>>>> >>>>>>> Partitioning Rules of Thumb >>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>> >>>>>>> - >>>>>>> >>>>>>> For large tables, such as fact tables, aim for as many tablets >>>>>>> as you have cores in the cluster. >>>>>>> - >>>>>>> >>>>>>> For small tables, such as dimension tables, ensure that each >>>>>>> tablet is at least 1 GB in size. >>>>>>> >>>>>>> In general, be mindful the number of tablets limits the parallelism >>>>>>> of reads, in the current implementation. Increasing the number of >>>>>>> tablets >>>>>>> significantly beyond the number of cores is likely to have diminishing >>>>>>> returns. >>>>>>> >>>>>>> >>>>>>> I've read this a few times but I am not sure I understand it >>>>>>> correctly. Let me use concrete example. >>>>>>> >>>>>>> If a table ends up taking 18Gb after replication (so with 3x >>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim >>>>>>> for >>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb >>>>>>> tablets if my cluster capacity allows so (12 tablets before >>>>>>> replication)? >>>>>>> confused why they say "at least" not "at most" - does it mean I should >>>>>>> design it so a tablet takes 2Gb or 3Gb in this example? >>>>>>> >>>>>>> Assume that I have tons of CPU cores on a cluster... >>>>>>> Based on my quick test, it seems that queries are faster if I have >>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best >>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least >>>>>>> 1Gb". >>>>>>> >>>>>>> Really confused what the doc is saying, please help >>>>>>> >>>>>>> Boris >>>>>>> >>>>>>> >>> >>> -- >>> Andrew Wong >>> >>
