very nice, Grant! this sounds really promising! On Fri, Mar 13, 2020 at 9:52 AM Grant Henke <ghe...@cloudera.com> wrote:
> Actually, I have a much longer wish list on the Impala side than Kudu - it >> feels that Kudu itself is very close to Snowflake but lacks these >> self-management features. While Impala is certainly one of the fastest SQL >> engines I've seen, we struggle with mutli-tenancy and complex queries, that >> Hive would run like a champ. >> > > There is an experimental Hive integration you could try. You may need to > compile it for your version of Hive depending on what version of Hive you > are using. Feel free to reach out on Slack if you have issues. The details > on the integration here: > https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration > > On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin <bo...@boristyukin.com> > wrote: > >> thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu, >> and I do not think snowflake deserves that much credit and noise. But they >> do have great ideas that take their engine to the next level. I do not know >> architecture-wise how it works but here is the best doc I found regarding >> micro partitions: >> >> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html >> >> >> It is also appealing to me that they support nested data types with ease >> (and they are still very efficient to query/filter by), they scale workload >> easily (Kudu requires manual rebalancing and it is VERY slow as I've >> learned last weekend by doing it on our dev cluster). Also, they support >> very large blobs and text fields that Kudu does not. >> >> Actually, I have a much longer wish list on the Impala side than Kudu - >> it feels that Kudu itself is very close to Snowflake but lacks these >> self-management features. While Impala is certainly one of the fastest SQL >> engines I've seen, we struggle with mutli-tenancy and complex queries, that >> Hive would run like a champ. >> >> As for Kudu's compaction process - it was largely an issue with 1.5 and I >> believed was addressed in 1.9. We had a terrible time for a few weeks when >> frequent updates/delete almost froze all our queries to crawl. A lot of >> smaller tables had a huge number of tiny rowsets causing massive scans and >> freezing the entire Kudu cluster. We had a good discussion on slack and >> Todd Lipcon suggested a good workaround using flush_threshold_secs till we >> move to 1.9 and it worked fine. Nowhere in the documentation, it was >> suggested to set this flag and actually it was one of these "use it at your >> own risk" flags. >> >> >> >> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <a...@cloudera.com> >> wrote: >> >>> This has been an excellent discussion to follow, with very useful >>> feedback. Thank you for that. >>> >>> Boris, if I can try to summarize your position, it's that manual >>> partitioning doesn't scale when dealing with hundreds of (small) tables and >>> when you don't control the PK of each table. The Kudu schema design guide >>> would advise you to hash partition such tables, and, following Andrew's >>> recommendation, have at least one partition per tserver. Except that given >>> enough tables, you'll eventually overwhelm any cluster based on the sheer >>> number of partitions per tserver. And if you reduce the number of >>> partitions per table, you open yourself up to potential hotspotting if the >>> per-table load isn't even. Is that correct? >>> >>> I can think of several improvements we could make here: >>> 1. Support splitting/merging of range partitions, first manually and >>> later automatically based on load. This is tracked in KUDU-437 and >>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out >>> of static range/hash partitioning, there hasn't been much traction on those >>> JIRAs. >>> 2. Improve server efficiency when hosting large numbers of replicas, so >>> that you can blanket your cluster with maximally hash-partitioned tables. >>> We did some work here a couple years ago (see KUDU-1967) but there's >>> clearly much more that we can do. Of note, we previously discussed >>> implementing Multi-Raft ( >>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031 >>> ). >>> 3. Add support for consistent hashing. I'm not super familiar with the >>> details but YugabyteDB (based on Kudu) has seen success here. Their recent >>> blog post on sharding is a good (and related) read: >>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/ >>> >>> Since you're clearly more familiar with Snowflake than I, can you >>> comment more on how their partitioning system works? Sounds like it's quite >>> automated; is that ever problematic when it decides to repartition during a >>> workload? The YugabyteDB blog post talks about heuristics "reacting" to >>> hotspotting by load balancing, and concludes that hotspots can move around >>> too quickly for a load balancer to keep up. >>> >>> Separately, you mentioned having to manage Kudu's compaction process. >>> Could you go into more detail here? >>> >>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com> >>> wrote: >>> >>>> thanks Cliff, this is really good info. I am tempted to do the >>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of >>>> traction lately, and based on a few conversations in slack, looks like Kudu >>>> team is not tracking competition that much and I think there are awesome >>>> features in Snowflake that would be great to have in Kudu. >>>> >>>> I think they do support upserts now and back in-time queries - while >>>> they do not upsert/delete records, they just create new micro-partitions >>>> and it is a metadata operation - that's who they can see data in a given >>>> time. The idea of virtual warehouse is also very appealing especially in >>>> healthcare as we often need to share data with other departments or >>>> partners. >>>> >>>> do not get me wrong, I decided to use Kudu for the same reasons and I >>>> really did not have a choice on CDH other than HBase but I get questioned >>>> lately why we should not just give up CDH and go all in on Cloud with Cloud >>>> native tech like Redshift and Snowflake. It is getting harder to explain >>>> why :) >>>> >>>> My biggest gripe with Kudu besides known limitations is the management >>>> of partitions and compaction process. For our largest tables, we just max >>>> out the number of tablets as Kudu allows per cluster. Smaller tables though >>>> is a challenge and more like an art while I feel there should be an >>>> automated process with good defaults like in other data storage systems. No >>>> one expects to assign partitions manually in Snowflake, BigQuery or >>>> Redshift. No one is expected to tweak compaction parameters and deal with >>>> fast degraded performance over time on tables with a high number of >>>> deletes/updates. We are happy with performance but it is not all about >>>> performance. >>>> >>>> We also struggle on Impala side as it feels it is limiting what we can >>>> do with Kudu and it feels like Impala team is always behind. pyspark is >>>> another problem - Kudu pyspark client is lagging behind java client and >>>> very difficult to install/deploy. Some api is not even available in >>>> pyspark. >>>> >>>> And honestly, I do not like that Impala is the only viable option with >>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a >>>> tank but we cannot do that anymore with Kudu tables and struggle with disk >>>> spilling issues, Impala choosing bad explain plans and forcing us to use >>>> straight_join hint many times etc. >>>> >>>> Sorry if this post came a bit on a negative side, I really like Kudu >>>> and the dev team behind it rocks. But I do not like their industry is going >>>> and why Kudu is not getting the traction it deserves. >>>> >>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cre...@gmail.com> wrote: >>>> >>>>> This is a good conversation but I don't think the comparison with >>>>> Snowflake is a fair one, at least from an older version of Snowflake (In >>>>> my >>>>> last job, about 5 years ago, I pretty much single-handedly scale tested >>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is >>>>> closed source, it seems pretty clear the architectures are quite >>>>> different. >>>>> Snowflake has no primary key index, no UPSERT capability, features that >>>>> make Kudu valuable for some use cases. >>>>> >>>>> It also seems to me that their intended workloads are quite different. >>>>> Snowflake is great for intensive analytics on demand, and can handle >>>>> deeply >>>>> nested data very well, where Kudu can't handle that at all. Snowflake is >>>>> not designed for heavy concurrency, but complex query plans for a small >>>>> group of users. If you select an x-large Snowflake cluster it's probably >>>>> because you have a large amount of data to churn through, not because you >>>>> have a large number of users. Or, at least that's how we used it. >>>>> >>>>> At my current workplace we use Kudu/Impala to handle about 30-60 >>>>> concurrent queries. I agree that getting very fussy about partitioning can >>>>> be a pain, but for the large fact tables we generally use a simple >>>>> strategy >>>>> of twelves: 12 hash x a12 (month) ranges in two 12-node clusters fronted >>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can >>>>> use "for free" instance-store NVMe. We also have associated >>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other >>>>> compute oriented queries. >>>>> >>>>> The net result blows away what we had with RedShift at less than 1/3 >>>>> the cost, with performance improvements mostly from better concurrency >>>>> handling. This is despite the fact that RedShift has built-in cache. We >>>>> also use streaming ingestion which, aside from being impossible with >>>>> RedShift, removes the added cost of staging. >>>>> >>>>> Getting back to Snowflake, there's no way we could use it the same way >>>>> we use Kudu, and even if we could, the cost would would probably put us >>>>> out >>>>> of business! >>>>> >>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> >>>>> wrote: >>>>> >>>>>> thanks Andrew for taking your time responding to me. It seems that >>>>>> there are no exact recommendations. >>>>>> >>>>>> I did look at scaling recommendations but that math is extremely >>>>>> complicated and I do not think anyone will know all the answers to plug >>>>>> into that calculation. We have no control really what users are doing, >>>>>> how >>>>>> many queries they run, how many are hot vs cold etc. It is not realistic >>>>>> IMHO to expect that knowledge of user query patterns. >>>>>> >>>>>> I do like the Snowflake approach than the engine takes care of >>>>>> defaults and can estimate the number of micro-partitions and even >>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as >>>>>> the design is very similar. I really do not like to pick random number of >>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one >>>>>> by >>>>>> one to make these decisions. Does it make sense? >>>>>> >>>>>> >>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> Hey Boris, >>>>>>> >>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in >>>>>>> general can be tricky; thanks for the question. Left some thoughts >>>>>>> below: >>>>>>> >>>>>>> Maybe I was not asking a clear question. If my cluster is large >>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or >>>>>>>> should >>>>>>>> I pick tablets to be closer to 1Gb? >>>>>>>> And a follow-up question, if I have tons of smaller tables under 5 >>>>>>>> million rows, should I just use 1 partition or still break them on >>>>>>>> smaller >>>>>>>> tablets for concurrency? >>>>>>> >>>>>>> >>>>>>> Per your numbers, this confirms that the partitions are the units of >>>>>>> concurrency here, and that therefore having more and having smaller >>>>>>> partitions yields a concurrency bump. That said, extending a scheme of >>>>>>> smaller partitions across all tables may not scale when thinking about >>>>>>> the >>>>>>> total number of partitions cluster-wide. >>>>>>> >>>>>>> There are some trade offs with the replica count per tablet server >>>>>>> here -- generally, each tablet replica has a resource cost on tablet >>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can >>>>>>> put >>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some >>>>>>> Raft-related RPC traffic, each tablet replica introduces some >>>>>>> maintenance >>>>>>> operations in the pool of background operations to be run, etc. >>>>>>> >>>>>>> Your point about scan concurrency is certainly a valid one -- there >>>>>>> have been patches for other integrations that have tackled this to >>>>>>> decouple >>>>>>> partitioning from scan concurrency (KUDU-2437 >>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670 >>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, >>>>>>> where Kudu's Spark integration will split range scans into >>>>>>> smaller-scoped >>>>>>> scan tokens to be run concurrently, though this optimization hasn't made >>>>>>> its way into Impala yet). I filed KUDU-3071 >>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I >>>>>>> think is left on the Kudu-side to get this up and running, so that it >>>>>>> can >>>>>>> be worked into Impala. >>>>>>> >>>>>>> For now, I would try to take into account the total sum of resources >>>>>>> you have available to Kudu (including number of tablet servers, amount >>>>>>> of >>>>>>> storage per node, number of disks per tablet server, type of disk for >>>>>>> the >>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your >>>>>>> system can handle (the scaling guide >>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful >>>>>>> here), and hopefully that, along with your own SLAs per table, can help >>>>>>> guide how you partition your tables. >>>>>>> >>>>>>> confused why they say "at least" not "at most" - does it mean I >>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example? >>>>>>> >>>>>>> >>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the >>>>>>> low tens of GB per tablet replica, though exact perf obviously depends >>>>>>> on >>>>>>> your workload. As you show and as pointed out in documentation, larger >>>>>>> and >>>>>>> fewer tablets can limit the amount of concurrency for writes and reads, >>>>>>> though we've seen multiple GBs works relatively well for many use cases >>>>>>> while weighing the above mentioned tradeoffs with replica count. >>>>>>> >>>>>>> It is recommended that new tables which are expected to have heavy >>>>>>>> read and write workloads have at least as many tablets as tablet >>>>>>>> servers. >>>>>>>> >>>>>>> >>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM >>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for >>>>>>>> both >>>>>>>> (divide by 3 because of replication)? >>>>>>> >>>>>>> >>>>>>> The recommendation here is to have at least 20 logical partitions >>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the >>>>>>> table scan would be broken up into 20 tablet scans, and each of those >>>>>>> might >>>>>>> land on a different tablet server running on isolated hardware. For a >>>>>>> significantly larger table into which you expect highly concurrent >>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend >>>>>>> having more partitions, and if your data is naturally time-oriented, >>>>>>> consider range-partitioning on timestamp. >>>>>>> >>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com> >>>>>>> wrote: >>>>>>> >>>>>>>> just saw this in the docs but it is still confusing statement >>>>>>>> No Default Partitioning >>>>>>>> Kudu does not provide a default partitioning strategy when creating >>>>>>>> tables. It is recommended that new tables which are expected to have >>>>>>>> heavy >>>>>>>> read and write workloads have at least as many tablets as tablet >>>>>>>> servers. >>>>>>>> >>>>>>>> >>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM >>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for >>>>>>>> both >>>>>>>> (divide by 3 because of replication)? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> hey guys, >>>>>>>>> >>>>>>>>> I asked the same question on Slack on and got no responses. I just >>>>>>>>> went through the docs and design doc and FAQ and still did not find an >>>>>>>>> answer. >>>>>>>>> >>>>>>>>> Can someone comment? >>>>>>>>> >>>>>>>>> Maybe I was not asking a clear question. If my cluster is large >>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or >>>>>>>>> should >>>>>>>>> I pick tablets to be closer to 1Gb? >>>>>>>>> >>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5 >>>>>>>>> million rows, should I just use 1 partition or still break them on >>>>>>>>> smaller >>>>>>>>> tablets for concurrency? >>>>>>>>> >>>>>>>>> We cannot pick the partitioning strategy for each table as we need >>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come >>>>>>>>> with an >>>>>>>>> automated way to pick number of partitions/tablets. So far I was >>>>>>>>> using 1Gb >>>>>>>>> rule but rethinking this now for another project. >>>>>>>>> >>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin < >>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>> >>>>>>>>>> forgot to post results of my quick test: >>>>>>>>>> >>>>>>>>>> Kudu 1.5 >>>>>>>>>> >>>>>>>>>> Table takes 18Gb of disk space after 3x replication >>>>>>>>>> >>>>>>>>>> Tablets Tablet Size Query run time, sec >>>>>>>>>> 3 2Gb 65 >>>>>>>>>> 9 700Mb 27 >>>>>>>>>> 18 350Mb 17 >>>>>>>>>> [image: image.png] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin < >>>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi guys, >>>>>>>>>>> >>>>>>>>>>> just want to clarify recommendations from the doc. It says: >>>>>>>>>>> >>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb >>>>>>>>>>> >>>>>>>>>>> Partitioning Rules of Thumb >>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> For large tables, such as fact tables, aim for as many >>>>>>>>>>> tablets as you have cores in the cluster. >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> For small tables, such as dimension tables, ensure that each >>>>>>>>>>> tablet is at least 1 GB in size. >>>>>>>>>>> >>>>>>>>>>> In general, be mindful the number of tablets limits the >>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the >>>>>>>>>>> number >>>>>>>>>>> of tablets significantly beyond the number of cores is likely to >>>>>>>>>>> have >>>>>>>>>>> diminishing returns. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I've read this a few times but I am not sure I understand it >>>>>>>>>>> correctly. Let me use concrete example. >>>>>>>>>>> >>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x >>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I >>>>>>>>>>> aim for >>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb >>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before >>>>>>>>>>> replication)? >>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I >>>>>>>>>>> should >>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example? >>>>>>>>>>> >>>>>>>>>>> Assume that I have tons of CPU cores on a cluster... >>>>>>>>>>> Based on my quick test, it seems that queries are faster if I >>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me >>>>>>>>>>> the best >>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at >>>>>>>>>>> least >>>>>>>>>>> 1Gb". >>>>>>>>>>> >>>>>>>>>>> Really confused what the doc is saying, please help >>>>>>>>>>> >>>>>>>>>>> Boris >>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Andrew Wong >>>>>>> >>>>>> > > -- > Grant Henke > Software Engineer | Cloudera > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke >