I look forward to this enhancement, Tim! thanks for sharing On Mon, Mar 16, 2020 at 7:54 PM Tim Armstrong <tarmstr...@cloudera.com> wrote:
> > I think your hardware situation hurts not only for the number of tablets > but also for Kudu + Impala. Impala afaik will only use one core per host > per query, so is a poor fit for large complex queries on vertical hardware. > > This is basically true as of current releases of Impala but I'm working on > addressing this. It's been possible to set mt_dop per-query on queries > without joins or table sinks for a long time now (the limitations make it > of limited use). Rough join and table sink support is behind a hidden flag > in 3.3 and 3.4. I've been working on making it performant with joins (add > doing all the requisite testing), which should land in master very soon and > if things go remotely to plan, be a fully supported option in the next > release after 3.4. We saw huge speedups on a lot of queries (like 10x or > more). Some queries didn't benefit much, if they were limited by the scan > perf (including if the runtime filters pushed into the scans were filtering > most data before joins). > > On Mon, Mar 16, 2020 at 2:58 PM Boris Tyukin <bo...@boristyukin.com> > wrote: > >> appreciate your thoughts, Cliff >> >> On Mon, Mar 16, 2020 at 11:18 AM Cliff Resnick <cre...@gmail.com> wrote: >> >>> Boris, >>> >>> I think the crux of the problem is that "real-time analytics" over >>> deeply nested storage does not really exist. I'll qualify that statement >>> with "real-time" meaning streaming per-record ingestion, and "analytics" as >>> columnar query access. The closest thing I know of today is Google BigQuery >>> or Snowflake, but those are actually micro-batch ingestion done well, not >>> per-record like Kudu. The only real-time analytics solution I know of that >>> has a modicum of nesting is Druid, with a single level of "GroupBy" >>> dimensions that get flattened into virtual rows as part of its non-SQL >>> rollup API. Columnar storage and real-time are probably aways going to be a >>> difficult pairing and in fact Kudu's "real-time" storage format is >>> row-based, and Druid requires a whole lambda architecture batch compaction. >>> As we all know, nothing is for free, whether the trade-off for nested data >>> be column-shredding into something like Kudu as Adar suggested, or >>> micro-batching to Parquet or some combination of both. I won't even go into >>> how nested analytics are handled on the SQL/query side because it gets >>> weird there, too. >>> >>> I think your hardware situation hurts not only for the number of tablets >>> but also for Kudu + Impala. Impala afaik will only use one core per host >>> per query, so is a poor fit for large complex queries on vertical hardware. >>> >>> In conclusion, I don't know how far you've gone into your investigations >>> but I can say that if your needs are to support "power users" at a premium >>> then something like Snowflake is great for your problem space. But if >>> analytics are also more centrally integrated in pipelines then parquet is >>> hard to beat for the price and flexibility, as is Kudu for dashboards or >>> other intelligence that leverages upsert/key semantics. Ultimately, like >>> many of us, you may find that you'll need all of the above. >>> >>> Cliff >>> >>> >>> >>> On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com> >>> wrote: >>> >>>> thanks Adar for your comments and thinking this through! I really think >>>> Kudu has tons of potential and very happy we've made a decision to use it >>>> for our real-time pipeline. >>>> >>>> you asked about use cases for support of nested and large binary/text >>>> objects. I work for a large healthcare organization and there is a lot of >>>> going with newer HL7 FHIR standard. FHIR documents are highly nested json >>>> objects that might get as big as a few Gbs in size. We could not use Kudu >>>> for storing FHIR bundles/documents due to the size and inability to store >>>> FHIR objects natively. We ended up using Hive/Spark for that, but that >>>> solution is not real-time. And it is not only about storing, but also about >>>> fast search in those FHIR documents. I know we could use Hbase for that but >>>> Hbase is really bad when it comes to analytics type of workloads. >>>> >>>> As for large text/binaries, the good and common example is >>>> narratives notes (physician notes, progress notes, etc.) They can be in >>>> form of RTF or PDF documents or images. >>>> >>>> 300 column limit was not a big deal so far but we have high-density >>>> nodes with 2 44-core cpus and 12 12-Tb drives and we cannot use the full >>>> power of them due to limitations of Kudu in terms of number of tablets per >>>> node. This is more concerning than 300 column limit. >>>> >>>> >>>> >>>> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <a...@cloudera.com> >>>> wrote: >>>> >>>>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets: >>>>> 1. Both are created transparently and automatically as data is >>>>> inserted. >>>>> 2. Both may overlap with one another based on the sort column >>>>> (Snowflake lets you choose the sort column; in Kudu this is always the >>>>> PK). >>>>> 3. Both are pruned at scan time if the scan's predicates allow it. >>>>> 4. In Kudu, a tablet with a large clustering depth (known as "average >>>>> rowset height") will be a likely compaction candidate. It's not clear to >>>>> me >>>>> if Snowflake has a process to automatically compact their >>>>> micro-partitions. >>>>> >>>>> I don't understand how this solves the partitioning problem though: >>>>> the doc doesn't explain how micro-partitions are mapped to nodes, or >>>>> if/when micro-partitions are load balanced across the cluster. I'd be >>>>> curious to learn more about this. >>>>> >>>>> Below you'll find some more info on the other issues you raised: >>>>> - Nested data type support is tracked in KUDU-1261. We've had an >>>>> on-again/off-again relationship with it: it's a significant amount of work >>>>> to implement, so we're hesitant to commit unless there's also significant >>>>> need. Many users have been successful in "shredding" their nested types >>>>> across many Kudu columns, which is one of the reasons we're actively >>>>> working on increasing the number of supported columns above 300. >>>>> - We're also working on auto-rebalancing right now; see KUDU-2780 for >>>>> more details. >>>>> - Large cell support is tracked in KUDU-2874, but there doesn't seem >>>>> to have been as much interest there. >>>>> - The compaction issue you're describing sounds like KUDU-1400, which >>>>> was indeed fixed in 1.9. This was our bad; for quite some time we were >>>>> focused on optimizing for heavy workloads and didn't pay attention to Kudu >>>>> performance in "trickling" scenarios, especially not over a long period of >>>>> time. That's also why we didn't think to advertise --flush_threshold_secs, >>>>> or even to change its default value (which is still 2 minutes: far too >>>>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how >>>>> widespread of a problem it was. >>>>> >>>>> >>>>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com> >>>>> wrote: >>>>> >>>>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in >>>>>> Kudu, and I do not think snowflake deserves that much credit and noise. >>>>>> But >>>>>> they do have great ideas that take their engine to the next level. I do >>>>>> not >>>>>> know architecture-wise how it works but here is the best doc I found >>>>>> regarding micro partitions: >>>>>> >>>>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html >>>>>> >>>>>> >>>>>> It is also appealing to me that they support nested data types with >>>>>> ease (and they are still very efficient to query/filter by), they scale >>>>>> workload easily (Kudu requires manual rebalancing and it is VERY slow as >>>>>> I've learned last weekend by doing it on our dev cluster). Also, they >>>>>> support very large blobs and text fields that Kudu does not. >>>>>> >>>>>> Actually, I have a much longer wish list on the Impala side than Kudu >>>>>> - it feels that Kudu itself is very close to Snowflake but lacks these >>>>>> self-management features. While Impala is certainly one of the fastest >>>>>> SQL >>>>>> engines I've seen, we struggle with mutli-tenancy and complex queries, >>>>>> that >>>>>> Hive would run like a champ. >>>>>> >>>>>> As for Kudu's compaction process - it was largely an issue with 1.5 >>>>>> and I believed was addressed in 1.9. We had a terrible time for a few >>>>>> weeks >>>>>> when frequent updates/delete almost froze all our queries to crawl. A lot >>>>>> of smaller tables had a huge number of tiny rowsets causing massive scans >>>>>> and freezing the entire Kudu cluster. We had a good discussion on slack >>>>>> and >>>>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till >>>>>> we >>>>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was >>>>>> suggested to set this flag and actually it was one of these "use it at >>>>>> your >>>>>> own risk" flags. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <a...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> This has been an excellent discussion to follow, with very useful >>>>>>> feedback. Thank you for that. >>>>>>> >>>>>>> Boris, if I can try to summarize your position, it's that manual >>>>>>> partitioning doesn't scale when dealing with hundreds of (small) tables >>>>>>> and >>>>>>> when you don't control the PK of each table. The Kudu schema design >>>>>>> guide >>>>>>> would advise you to hash partition such tables, and, following Andrew's >>>>>>> recommendation, have at least one partition per tserver. Except that >>>>>>> given >>>>>>> enough tables, you'll eventually overwhelm any cluster based on the >>>>>>> sheer >>>>>>> number of partitions per tserver. And if you reduce the number of >>>>>>> partitions per table, you open yourself up to potential hotspotting if >>>>>>> the >>>>>>> per-table load isn't even. Is that correct? >>>>>>> >>>>>>> I can think of several improvements we could make here: >>>>>>> 1. Support splitting/merging of range partitions, first manually and >>>>>>> later automatically based on load. This is tracked in KUDU-437 and >>>>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage >>>>>>> out >>>>>>> of static range/hash partitioning, there hasn't been much traction on >>>>>>> those >>>>>>> JIRAs. >>>>>>> 2. Improve server efficiency when hosting large numbers of replicas, >>>>>>> so that you can blanket your cluster with maximally hash-partitioned >>>>>>> tables. We did some work here a couple years ago (see KUDU-1967) but >>>>>>> there's clearly much more that we can do. Of note, we previously >>>>>>> discussed >>>>>>> implementing Multi-Raft ( >>>>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031 >>>>>>> ). >>>>>>> 3. Add support for consistent hashing. I'm not super familiar with >>>>>>> the details but YugabyteDB (based on Kudu) has seen success here. Their >>>>>>> recent blog post on sharding is a good (and related) read: >>>>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/ >>>>>>> >>>>>>> Since you're clearly more familiar with Snowflake than I, can you >>>>>>> comment more on how their partitioning system works? Sounds like it's >>>>>>> quite >>>>>>> automated; is that ever problematic when it decides to repartition >>>>>>> during a >>>>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to >>>>>>> hotspotting by load balancing, and concludes that hotspots can move >>>>>>> around >>>>>>> too quickly for a load balancer to keep up. >>>>>>> >>>>>>> Separately, you mentioned having to manage Kudu's compaction >>>>>>> process. Could you go into more detail here? >>>>>>> >>>>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com> >>>>>>> wrote: >>>>>>> >>>>>>>> thanks Cliff, this is really good info. I am tempted to do the >>>>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of >>>>>>>> traction lately, and based on a few conversations in slack, looks like >>>>>>>> Kudu >>>>>>>> team is not tracking competition that much and I think there are >>>>>>>> awesome >>>>>>>> features in Snowflake that would be great to have in Kudu. >>>>>>>> >>>>>>>> I think they do support upserts now and back in-time queries - >>>>>>>> while they do not upsert/delete records, they just create new >>>>>>>> micro-partitions and it is a metadata operation - that's who they can >>>>>>>> see >>>>>>>> data in a given time. The idea of virtual warehouse is also very >>>>>>>> appealing >>>>>>>> especially in healthcare as we often need to share data with other >>>>>>>> departments or partners. >>>>>>>> >>>>>>>> do not get me wrong, I decided to use Kudu for the same reasons and >>>>>>>> I really did not have a choice on CDH other than HBase but I get >>>>>>>> questioned >>>>>>>> lately why we should not just give up CDH and go all in on Cloud with >>>>>>>> Cloud >>>>>>>> native tech like Redshift and Snowflake. It is getting harder to >>>>>>>> explain >>>>>>>> why :) >>>>>>>> >>>>>>>> My biggest gripe with Kudu besides known limitations is the >>>>>>>> management of partitions and compaction process. For our largest >>>>>>>> tables, we >>>>>>>> just max out the number of tablets as Kudu allows per cluster. Smaller >>>>>>>> tables though is a challenge and more like an art while I feel there >>>>>>>> should >>>>>>>> be an automated process with good defaults like in other data storage >>>>>>>> systems. No one expects to assign partitions manually in Snowflake, >>>>>>>> BigQuery or Redshift. No one is expected to tweak compaction >>>>>>>> parameters and >>>>>>>> deal with fast degraded performance over time on tables with a high >>>>>>>> number >>>>>>>> of deletes/updates. We are happy with performance but it is not all >>>>>>>> about >>>>>>>> performance. >>>>>>>> >>>>>>>> We also struggle on Impala side as it feels it is limiting what we >>>>>>>> can do with Kudu and it feels like Impala team is always behind. >>>>>>>> pyspark is >>>>>>>> another problem - Kudu pyspark client is lagging behind java client and >>>>>>>> very difficult to install/deploy. Some api is not even available in >>>>>>>> pyspark. >>>>>>>> >>>>>>>> And honestly, I do not like that Impala is the only viable option >>>>>>>> with Kudu. We are forced a lot of time to do heavy ETL in Hive which is >>>>>>>> like a tank but we cannot do that anymore with Kudu tables and struggle >>>>>>>> with disk spilling issues, Impala choosing bad explain plans and >>>>>>>> forcing us >>>>>>>> to use straight_join hint many times etc. >>>>>>>> >>>>>>>> Sorry if this post came a bit on a negative side, I really like >>>>>>>> Kudu and the dev team behind it rocks. But I do not like their >>>>>>>> industry is >>>>>>>> going and why Kudu is not getting the traction it deserves. >>>>>>>> >>>>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cre...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This is a good conversation but I don't think the comparison with >>>>>>>>> Snowflake is a fair one, at least from an older version of Snowflake >>>>>>>>> (In my >>>>>>>>> last job, about 5 years ago, I pretty much single-handedly scale >>>>>>>>> tested >>>>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though >>>>>>>>> Snowflake is >>>>>>>>> closed source, it seems pretty clear the architectures are quite >>>>>>>>> different. >>>>>>>>> Snowflake has no primary key index, no UPSERT capability, features >>>>>>>>> that >>>>>>>>> make Kudu valuable for some use cases. >>>>>>>>> >>>>>>>>> It also seems to me that their intended workloads are quite >>>>>>>>> different. Snowflake is great for intensive analytics on demand, and >>>>>>>>> can >>>>>>>>> handle deeply nested data very well, where Kudu can't handle that at >>>>>>>>> all. >>>>>>>>> Snowflake is not designed for heavy concurrency, but complex query >>>>>>>>> plans >>>>>>>>> for a small group of users. If you select an x-large Snowflake >>>>>>>>> cluster it's >>>>>>>>> probably because you have a large amount of data to churn through, not >>>>>>>>> because you have a large number of users. Or, at least that's how we >>>>>>>>> used >>>>>>>>> it. >>>>>>>>> >>>>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60 >>>>>>>>> concurrent queries. I agree that getting very fussy about >>>>>>>>> partitioning can >>>>>>>>> be a pain, but for the large fact tables we generally use a simple >>>>>>>>> strategy >>>>>>>>> of twelves: 12 hash x a12 (month) ranges in two 12-node clusters >>>>>>>>> fronted >>>>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we >>>>>>>>> can >>>>>>>>> use "for free" instance-store NVMe. We also have associated >>>>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and >>>>>>>>> other >>>>>>>>> compute oriented queries. >>>>>>>>> >>>>>>>>> The net result blows away what we had with RedShift at less than >>>>>>>>> 1/3 the cost, with performance improvements mostly from better >>>>>>>>> concurrency >>>>>>>>> handling. This is despite the fact that RedShift has built-in cache. >>>>>>>>> We >>>>>>>>> also use streaming ingestion which, aside from being impossible with >>>>>>>>> RedShift, removes the added cost of staging. >>>>>>>>> >>>>>>>>> Getting back to Snowflake, there's no way we could use it the same >>>>>>>>> way we use Kudu, and even if we could, the cost would would probably >>>>>>>>> put us >>>>>>>>> out of business! >>>>>>>>> >>>>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> thanks Andrew for taking your time responding to me. It seems >>>>>>>>>> that there are no exact recommendations. >>>>>>>>>> >>>>>>>>>> I did look at scaling recommendations but that math is extremely >>>>>>>>>> complicated and I do not think anyone will know all the answers to >>>>>>>>>> plug >>>>>>>>>> into that calculation. We have no control really what users are >>>>>>>>>> doing, how >>>>>>>>>> many queries they run, how many are hot vs cold etc. It is not >>>>>>>>>> realistic >>>>>>>>>> IMHO to expect that knowledge of user query patterns. >>>>>>>>>> >>>>>>>>>> I do like the Snowflake approach than the engine takes care of >>>>>>>>>> defaults and can estimate the number of micro-partitions and even >>>>>>>>>> repartition tables as they grow. I feel Kudu has the same >>>>>>>>>> capabilities as >>>>>>>>>> the design is very similar. I really do not like to pick random >>>>>>>>>> number of >>>>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each >>>>>>>>>> one by >>>>>>>>>> one to make these decisions. Does it make sense? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Boris, >>>>>>>>>>> >>>>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in >>>>>>>>>>> general can be tricky; thanks for the question. Left some thoughts >>>>>>>>>>> below: >>>>>>>>>>> >>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large >>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? >>>>>>>>>>>> or should >>>>>>>>>>>> I pick tablets to be closer to 1Gb? >>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables >>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break >>>>>>>>>>>> them on >>>>>>>>>>>> smaller tablets for concurrency? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Per your numbers, this confirms that the partitions are the >>>>>>>>>>> units of concurrency here, and that therefore having more and having >>>>>>>>>>> smaller partitions yields a concurrency bump. That said, extending >>>>>>>>>>> a scheme >>>>>>>>>>> of smaller partitions across all tables may not scale when thinking >>>>>>>>>>> about >>>>>>>>>>> the total number of partitions cluster-wide. >>>>>>>>>>> >>>>>>>>>>> There are some trade offs with the replica count per tablet >>>>>>>>>>> server here -- generally, each tablet replica has a resource cost >>>>>>>>>>> on tablet >>>>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you >>>>>>>>>>> can put >>>>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces >>>>>>>>>>> some >>>>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some >>>>>>>>>>> maintenance >>>>>>>>>>> operations in the pool of background operations to be run, etc. >>>>>>>>>>> >>>>>>>>>>> Your point about scan concurrency is certainly a valid one -- >>>>>>>>>>> there have been patches for other integrations that have tackled >>>>>>>>>>> this to >>>>>>>>>>> decouple partitioning from scan concurrency (KUDU-2437 >>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670 >>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an >>>>>>>>>>> example, where Kudu's Spark integration will split range scans into >>>>>>>>>>> smaller-scoped scan tokens to be run concurrently, though this >>>>>>>>>>> optimization >>>>>>>>>>> hasn't made its way into Impala yet). I filed KUDU-3071 >>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what >>>>>>>>>>> I think is left on the Kudu-side to get this up and running, so >>>>>>>>>>> that it can >>>>>>>>>>> be worked into Impala. >>>>>>>>>>> >>>>>>>>>>> For now, I would try to take into account the total sum of >>>>>>>>>>> resources you have available to Kudu (including number of tablet >>>>>>>>>>> servers, >>>>>>>>>>> amount of storage per node, number of disks per tablet server, type >>>>>>>>>>> of disk >>>>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet >>>>>>>>>>> replicas >>>>>>>>>>> your system can handle (the scaling guide >>>>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be >>>>>>>>>>> helpful here), and hopefully that, along with your own SLAs per >>>>>>>>>>> table, can >>>>>>>>>>> help guide how you partition your tables. >>>>>>>>>>> >>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I >>>>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in >>>>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously >>>>>>>>>>> depends >>>>>>>>>>> on your workload. As you show and as pointed out in documentation, >>>>>>>>>>> larger >>>>>>>>>>> and fewer tablets can limit the amount of concurrency for writes >>>>>>>>>>> and reads, >>>>>>>>>>> though we've seen multiple GBs works relatively well for many use >>>>>>>>>>> cases >>>>>>>>>>> while weighing the above mentioned tradeoffs with replica count. >>>>>>>>>>> >>>>>>>>>>> It is recommended that new tables which are expected to have >>>>>>>>>>>> heavy read and write workloads have at least as many tablets as >>>>>>>>>>>> tablet >>>>>>>>>>>> servers. >>>>>>>>>>> >>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM >>>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions >>>>>>>>>>>> for both >>>>>>>>>>>> (divide by 3 because of replication)? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The recommendation here is to have at least 20 logical >>>>>>>>>>> partitions per table. That way, a scan were to touch a table's >>>>>>>>>>> entire >>>>>>>>>>> keyspace, the table scan would be broken up into 20 tablet scans, >>>>>>>>>>> and each >>>>>>>>>>> of those might land on a different tablet server running on isolated >>>>>>>>>>> hardware. For a significantly larger table into which you expect >>>>>>>>>>> highly >>>>>>>>>>> concurrent workloads, the recommendation serves as a lower bound -- >>>>>>>>>>> I'd >>>>>>>>>>> recommend having more partitions, and if your data is naturally >>>>>>>>>>> time-oriented, consider range-partitioning on timestamp. >>>>>>>>>>> >>>>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin < >>>>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> just saw this in the docs but it is still confusing statement >>>>>>>>>>>> No Default Partitioning >>>>>>>>>>>> Kudu does not provide a default partitioning strategy when >>>>>>>>>>>> creating tables. It is recommended that new tables which are >>>>>>>>>>>> expected to >>>>>>>>>>>> have heavy read and write workloads have at least as many tablets >>>>>>>>>>>> as tablet >>>>>>>>>>>> servers. >>>>>>>>>>>> >>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with >>>>>>>>>>>> 1MM rows and another one with 100MM rows, do I pick 20 / 3 >>>>>>>>>>>> partitions for >>>>>>>>>>>> both (divide by 3 because of replication)? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin < >>>>>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> hey guys, >>>>>>>>>>>>> >>>>>>>>>>>>> I asked the same question on Slack on and got no responses. I >>>>>>>>>>>>> just went through the docs and design doc and FAQ and still did >>>>>>>>>>>>> not find an >>>>>>>>>>>>> answer. >>>>>>>>>>>>> >>>>>>>>>>>>> Can someone comment? >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is >>>>>>>>>>>>> large enough in my example above, should I go with 3, 9 or 18 >>>>>>>>>>>>> tablets? or >>>>>>>>>>>>> should I pick tablets to be closer to 1Gb? >>>>>>>>>>>>> >>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables >>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still >>>>>>>>>>>>> break them on >>>>>>>>>>>>> smaller tablets for concurrency? >>>>>>>>>>>>> >>>>>>>>>>>>> We cannot pick the partitioning strategy for each table as we >>>>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need >>>>>>>>>>>>> to come >>>>>>>>>>>>> with an automated way to pick number of partitions/tablets. So >>>>>>>>>>>>> far I was >>>>>>>>>>>>> using 1Gb rule but rethinking this now for another project. >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin < >>>>>>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> forgot to post results of my quick test: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kudu 1.5 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication >>>>>>>>>>>>>> >>>>>>>>>>>>>> Tablets Tablet Size Query run time, sec >>>>>>>>>>>>>> 3 2Gb 65 >>>>>>>>>>>>>> 9 700Mb 27 >>>>>>>>>>>>>> 18 350Mb 17 >>>>>>>>>>>>>> [image: image.png] >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin < >>>>>>>>>>>>>> bo...@boristyukin.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi guys, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> just want to clarify recommendations from the doc. It says: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Partitioning Rules of Thumb >>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For large tables, such as fact tables, aim for as many >>>>>>>>>>>>>>> tablets as you have cores in the cluster. >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> For small tables, such as dimension tables, ensure that >>>>>>>>>>>>>>> each tablet is at least 1 GB in size. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In general, be mindful the number of tablets limits the >>>>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing >>>>>>>>>>>>>>> the number >>>>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely >>>>>>>>>>>>>>> to have >>>>>>>>>>>>>>> diminishing returns. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've read this a few times but I am not sure I understand it >>>>>>>>>>>>>>> correctly. Let me use concrete example. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x >>>>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), >>>>>>>>>>>>>>> should I aim for >>>>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for >>>>>>>>>>>>>>> 500Mb >>>>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before >>>>>>>>>>>>>>> replication)? >>>>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I >>>>>>>>>>>>>>> should >>>>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster... >>>>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if >>>>>>>>>>>>>>> I have more tablets/partitions...In this example, 18 tablets >>>>>>>>>>>>>>> gave me the >>>>>>>>>>>>>>> best timing but tablet size was around 300-400Mb. But the doc >>>>>>>>>>>>>>> says "at >>>>>>>>>>>>>>> least 1Gb". >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Really confused what the doc is saying, please help >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Boris >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Andrew Wong >>>>>>>>>>> >>>>>>>>>>