Re: Partitioning Rules of Thumb

Boris Tyukin Tue, 17 Mar 2020 04:26:27 -0700

I look forward to this enhancement, Tim! thanks for sharing

On Mon, Mar 16, 2020 at 7:54 PM Tim Armstrong <tarmstr...@cloudera.com>
wrote:


> > I think your hardware situation hurts not only for the number of tablets
> but also for Kudu + Impala. Impala afaik will only use one core per host
> per query, so is a poor fit for large complex queries on vertical hardware.
>
> This is basically true as of current releases of Impala but I'm working on
> addressing this. It's been possible to set mt_dop per-query on queries
> without joins or table sinks for a long time now (the limitations make it
> of limited use). Rough join and table sink support is behind a hidden flag
> in 3.3 and 3.4.  I've been working on making it performant with joins (add
> doing all the requisite testing), which should land in master very soon and
> if things go remotely to plan, be a fully supported option in the next
> release after 3.4. We saw huge speedups on a lot of queries (like 10x or
> more). Some queries didn't benefit much, if they were limited by the scan
> perf (including if the runtime filters pushed into the scans were filtering
> most data before joins).
>
> On Mon, Mar 16, 2020 at 2:58 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> appreciate your thoughts, Cliff
>>
>> On Mon, Mar 16, 2020 at 11:18 AM Cliff Resnick <cre...@gmail.com> wrote:
>>
>>> Boris,
>>>
>>> I think the crux of the problem is that "real-time analytics" over
>>> deeply nested storage does not really exist. I'll qualify that statement
>>> with "real-time" meaning streaming per-record ingestion, and "analytics" as
>>> columnar query access. The closest thing I know of today is Google BigQuery
>>> or Snowflake, but those are actually micro-batch ingestion done well, not
>>> per-record like Kudu. The only real-time analytics solution I know of that
>>> has a modicum of nesting is Druid, with a single level of "GroupBy"
>>> dimensions that get flattened into virtual rows as part of its non-SQL
>>> rollup API. Columnar storage and real-time are probably aways going to be a
>>> difficult pairing and in fact Kudu's "real-time" storage format is
>>> row-based, and Druid requires a whole lambda architecture batch compaction.
>>> As we all know, nothing is for free, whether the trade-off for nested data
>>> be column-shredding into something like Kudu as Adar suggested, or
>>> micro-batching to Parquet or some combination of both. I won't even go into
>>> how nested analytics are handled on the SQL/query side because it gets
>>> weird there, too.
>>>
>>> I think your hardware situation hurts not only for the number of tablets
>>> but also for Kudu + Impala. Impala afaik will only use one core per host
>>> per query, so is a poor fit for large complex queries on vertical hardware.
>>>
>>> In conclusion, I don't know how far you've gone into your investigations
>>> but I can say that if your needs are to support "power users" at a premium
>>> then something like Snowflake is great for your problem space. But if
>>> analytics are also more centrally integrated in pipelines then parquet is
>>> hard to beat for the price and flexibility, as is Kudu for dashboards or
>>> other intelligence that leverages upsert/key semantics. Ultimately, like
>>> many of us, you may find that you'll need all of the above.
>>>
>>> Cliff
>>>
>>>
>>>
>>> On Sun, Mar 15, 2020 at 1:11 PM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Adar for your comments and thinking this through! I really think
>>>> Kudu has tons of potential and very happy we've made a decision to use it
>>>> for our real-time pipeline.
>>>>
>>>> you asked about use cases for support of nested and large binary/text
>>>> objects. I work for a large healthcare organization and there is a lot of
>>>> going with newer HL7 FHIR standard. FHIR documents are highly nested json
>>>> objects that might get as big as a few Gbs in size. We could not use Kudu
>>>> for storing FHIR bundles/documents due to the size and inability to store
>>>> FHIR objects natively. We ended up using Hive/Spark for that, but that
>>>> solution is not real-time. And it is not only about storing, but also about
>>>> fast search in those FHIR documents. I know we could use Hbase for that but
>>>> Hbase is really bad when it comes to analytics type of workloads.
>>>>
>>>> As for large text/binaries, the good and common example is
>>>> narratives notes (physician notes, progress notes, etc.) They can be in
>>>> form of RTF or PDF documents or images.
>>>>
>>>> 300 column limit was not a big deal so far but we have high-density
>>>> nodes with 2 44-core cpus and 12 12-Tb drives and we cannot use the full
>>>> power of them due to limitations of Kudu in terms of number of tablets per
>>>> node. This is more concerning than 300 column limit.
>>>>
>>>>
>>>>
>>>> On Sat, Mar 14, 2020 at 3:45 AM Adar Lieber-Dembo <a...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Snowflake's micro-partitions sound an awful lot like Kudu's rowsets:
>>>>> 1. Both are created transparently and automatically as data is
>>>>> inserted.
>>>>> 2. Both may overlap with one another based on the sort column
>>>>> (Snowflake lets you choose the sort column; in Kudu this is always the 
>>>>> PK).
>>>>> 3. Both are pruned at scan time if the scan's predicates allow it.
>>>>> 4. In Kudu, a tablet with a large clustering depth (known as "average
>>>>> rowset height") will be a likely compaction candidate. It's not clear to 
>>>>> me
>>>>> if Snowflake has a process to automatically compact their 
>>>>> micro-partitions.
>>>>>
>>>>> I don't understand how this solves the partitioning problem though:
>>>>> the doc doesn't explain how micro-partitions are mapped to nodes, or
>>>>> if/when micro-partitions are load balanced across the cluster. I'd be
>>>>> curious to learn more about this.
>>>>>
>>>>> Below you'll find some more info on the other issues you raised:
>>>>> - Nested data type support is tracked in KUDU-1261. We've had an
>>>>> on-again/off-again relationship with it: it's a significant amount of work
>>>>> to implement, so we're hesitant to commit unless there's also significant
>>>>> need. Many users have been successful in "shredding" their nested types
>>>>> across many Kudu columns, which is one of the reasons we're actively
>>>>> working on increasing the number of supported columns above 300.
>>>>> - We're also working on auto-rebalancing right now; see KUDU-2780 for
>>>>> more details.
>>>>> - Large cell support is tracked in KUDU-2874, but there doesn't seem
>>>>> to have been as much interest there.
>>>>> - The compaction issue you're describing sounds like KUDU-1400, which
>>>>> was indeed fixed in 1.9. This was our bad; for quite some time we were
>>>>> focused on optimizing for heavy workloads and didn't pay attention to Kudu
>>>>> performance in "trickling" scenarios, especially not over a long period of
>>>>> time. That's also why we didn't think to advertise --flush_threshold_secs,
>>>>> or even to change its default value (which is still 2 minutes: far too
>>>>> short if we hadn't fixed KUDU-1400); we just weren't very aware of how
>>>>> widespread of a problem it was.
>>>>>
>>>>>
>>>>> On Fri, Mar 13, 2020 at 6:22 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in
>>>>>> Kudu, and I do not think snowflake deserves that much credit and noise. 
>>>>>> But
>>>>>> they do have great ideas that take their engine to the next level. I do 
>>>>>> not
>>>>>> know architecture-wise how it works but here is the best doc I found
>>>>>> regarding micro partitions:
>>>>>>
>>>>>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>>>>>
>>>>>>
>>>>>> It is also appealing to me that they support nested data types with
>>>>>> ease (and they are still very efficient to query/filter by), they scale
>>>>>> workload easily (Kudu requires manual rebalancing and it is VERY slow as
>>>>>> I've learned last weekend by doing it on our dev cluster). Also, they
>>>>>> support very large blobs and text fields that Kudu does not.
>>>>>>
>>>>>> Actually, I have a much longer wish list on the Impala side than Kudu
>>>>>> - it feels that Kudu itself is very close to Snowflake but lacks these
>>>>>> self-management features. While Impala is certainly one of the fastest 
>>>>>> SQL
>>>>>> engines I've seen, we struggle with mutli-tenancy and complex queries, 
>>>>>> that
>>>>>> Hive would run like a champ.
>>>>>>
>>>>>> As for Kudu's compaction process - it was largely an issue with 1.5
>>>>>> and I believed was addressed in 1.9. We had a terrible time for a few 
>>>>>> weeks
>>>>>> when frequent updates/delete almost froze all our queries to crawl. A lot
>>>>>> of smaller tables had a huge number of tiny rowsets causing massive scans
>>>>>> and freezing the entire Kudu cluster. We had a good discussion on slack 
>>>>>> and
>>>>>> Todd Lipcon suggested a good workaround using flush_threshold_secs till 
>>>>>> we
>>>>>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>>>>>> suggested to set this flag and actually it was one of these "use it at 
>>>>>> your
>>>>>> own risk" flags.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <a...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This has been an excellent discussion to follow, with very useful
>>>>>>> feedback. Thank you for that.
>>>>>>>
>>>>>>> Boris, if I can try to summarize your position, it's that manual
>>>>>>> partitioning doesn't scale when dealing with hundreds of (small) tables 
>>>>>>> and
>>>>>>> when you don't control the PK of each table. The Kudu schema design 
>>>>>>> guide
>>>>>>> would advise you to hash partition such tables, and, following Andrew's
>>>>>>> recommendation, have at least one partition per tserver. Except that 
>>>>>>> given
>>>>>>> enough tables, you'll eventually overwhelm any cluster based on the 
>>>>>>> sheer
>>>>>>> number of partitions per tserver. And if you reduce the number of
>>>>>>> partitions per table, you open yourself up to potential hotspotting if 
>>>>>>> the
>>>>>>> per-table load isn't even. Is that correct?
>>>>>>>
>>>>>>> I can think of several improvements we could make here:
>>>>>>> 1. Support splitting/merging of range partitions, first manually and
>>>>>>> later automatically based on load. This is tracked in KUDU-437 and
>>>>>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage 
>>>>>>> out
>>>>>>> of static range/hash partitioning, there hasn't been much traction on 
>>>>>>> those
>>>>>>> JIRAs.
>>>>>>> 2. Improve server efficiency when hosting large numbers of replicas,
>>>>>>> so that you can blanket your cluster with maximally hash-partitioned
>>>>>>> tables. We did some work here a couple years ago (see KUDU-1967) but
>>>>>>> there's clearly much more that we can do. Of note, we previously 
>>>>>>> discussed
>>>>>>> implementing Multi-Raft (
>>>>>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>>>>>> ).
>>>>>>> 3. Add support for consistent hashing. I'm not super familiar with
>>>>>>> the details but YugabyteDB (based on Kudu) has seen success here. Their
>>>>>>> recent blog post on sharding is a good (and related) read:
>>>>>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>>>>>
>>>>>>> Since you're clearly more familiar with Snowflake than I, can you
>>>>>>> comment more on how their partitioning system works? Sounds like it's 
>>>>>>> quite
>>>>>>> automated; is that ever problematic when it decides to repartition 
>>>>>>> during a
>>>>>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>>>>>> hotspotting by load balancing, and concludes that hotspots can move 
>>>>>>> around
>>>>>>> too quickly for a load balancer to keep up.
>>>>>>>
>>>>>>> Separately, you mentioned having to manage Kudu's compaction
>>>>>>> process. Could you go into more detail here?
>>>>>>>
>>>>>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>>>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>>>>>> traction lately, and based on a few conversations in slack, looks like 
>>>>>>>> Kudu
>>>>>>>> team is not tracking competition that much and I think there are 
>>>>>>>> awesome
>>>>>>>> features in Snowflake that would be great to have in Kudu.
>>>>>>>>
>>>>>>>> I think they do support upserts now and back in-time queries -
>>>>>>>> while they do not upsert/delete records, they just create new
>>>>>>>> micro-partitions and it is a metadata operation - that's who they can 
>>>>>>>> see
>>>>>>>> data in a given time. The idea of virtual warehouse is also very 
>>>>>>>> appealing
>>>>>>>> especially in healthcare as we often need to share data with other
>>>>>>>> departments or partners.
>>>>>>>>
>>>>>>>> do not get me wrong, I decided to use Kudu for the same reasons and
>>>>>>>> I really did not have a choice on CDH other than HBase but I get 
>>>>>>>> questioned
>>>>>>>> lately why we should not just give up CDH and go all in on Cloud with 
>>>>>>>> Cloud
>>>>>>>> native tech like Redshift and Snowflake. It is getting harder to 
>>>>>>>> explain
>>>>>>>> why :)
>>>>>>>>
>>>>>>>> My biggest gripe with Kudu besides known limitations is the
>>>>>>>> management of partitions and compaction process. For our largest 
>>>>>>>> tables, we
>>>>>>>> just max out the number of tablets as Kudu allows per cluster. Smaller
>>>>>>>> tables though is a challenge and more like an art while I feel there 
>>>>>>>> should
>>>>>>>> be an automated process with good defaults like in other data storage
>>>>>>>> systems. No one expects to assign partitions manually in Snowflake,
>>>>>>>> BigQuery or Redshift. No one is expected to tweak compaction 
>>>>>>>> parameters and
>>>>>>>> deal with fast degraded performance over time on tables with a high 
>>>>>>>> number
>>>>>>>> of deletes/updates. We are happy with performance but it is not all 
>>>>>>>> about
>>>>>>>> performance.
>>>>>>>>
>>>>>>>> We also struggle on Impala side as it feels it is limiting what we
>>>>>>>> can do with Kudu and it feels like Impala team is always behind. 
>>>>>>>> pyspark is
>>>>>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>>>>>> very difficult to install/deploy. Some api is not even available in 
>>>>>>>> pyspark.
>>>>>>>>
>>>>>>>> And honestly, I do not like that Impala is the only viable option
>>>>>>>> with Kudu. We are forced a lot of time to do heavy ETL in Hive which is
>>>>>>>> like a tank but we cannot do that anymore with Kudu tables and struggle
>>>>>>>> with disk spilling issues, Impala choosing bad explain plans and 
>>>>>>>> forcing us
>>>>>>>> to use straight_join hint many times etc.
>>>>>>>>
>>>>>>>> Sorry if this post came a bit on a negative side, I really like
>>>>>>>> Kudu and the dev team behind it rocks. But I do not like their 
>>>>>>>> industry is
>>>>>>>> going and why Kudu is not getting the traction it deserves.
>>>>>>>>
>>>>>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cre...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is a good conversation but I don't think the comparison with
>>>>>>>>> Snowflake is a fair one, at least from an older version of Snowflake 
>>>>>>>>> (In my
>>>>>>>>> last job, about 5 years ago, I pretty much single-handedly scale 
>>>>>>>>> tested
>>>>>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though 
>>>>>>>>> Snowflake is
>>>>>>>>> closed source, it seems pretty clear the architectures are quite 
>>>>>>>>> different.
>>>>>>>>> Snowflake has no primary key index, no UPSERT capability, features 
>>>>>>>>> that
>>>>>>>>> make Kudu valuable for some use cases.
>>>>>>>>>
>>>>>>>>> It also seems to me that their intended workloads are quite
>>>>>>>>> different. Snowflake is great for intensive analytics on demand, and 
>>>>>>>>> can
>>>>>>>>> handle deeply nested data very well, where Kudu can't handle that at 
>>>>>>>>> all.
>>>>>>>>> Snowflake is not designed for heavy concurrency, but complex query 
>>>>>>>>> plans
>>>>>>>>> for a small group of users. If you select an x-large Snowflake 
>>>>>>>>> cluster it's
>>>>>>>>> probably because you have a large amount of data to churn through, not
>>>>>>>>> because you have a large number of users. Or, at least that's how we 
>>>>>>>>> used
>>>>>>>>> it.
>>>>>>>>>
>>>>>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>>>>>> concurrent queries. I agree that getting very fussy about 
>>>>>>>>> partitioning can
>>>>>>>>> be a pain, but for the large fact tables we generally use a simple 
>>>>>>>>> strategy
>>>>>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters 
>>>>>>>>> fronted
>>>>>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we 
>>>>>>>>> can
>>>>>>>>> use "for free" instance-store NVMe. We also have associated
>>>>>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and 
>>>>>>>>> other
>>>>>>>>> compute oriented queries.
>>>>>>>>>
>>>>>>>>> The net result blows away what we had with RedShift at less than
>>>>>>>>> 1/3 the cost, with performance improvements mostly from better 
>>>>>>>>> concurrency
>>>>>>>>> handling. This is despite the fact that RedShift has built-in cache. 
>>>>>>>>> We
>>>>>>>>> also use streaming ingestion which, aside from being impossible with
>>>>>>>>> RedShift, removes the added cost of staging.
>>>>>>>>>
>>>>>>>>> Getting back to Snowflake, there's no way we could use it the same
>>>>>>>>> way we use Kudu, and even if we could, the cost would would probably 
>>>>>>>>> put us
>>>>>>>>> out of business!
>>>>>>>>>
>>>>>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> thanks Andrew for taking your time responding to me. It seems
>>>>>>>>>> that there are no exact recommendations.
>>>>>>>>>>
>>>>>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>>>>>> complicated and I do not think anyone will know all the answers to 
>>>>>>>>>> plug
>>>>>>>>>> into that calculation. We have no control really what users are 
>>>>>>>>>> doing, how
>>>>>>>>>> many queries they run, how many are hot vs cold etc. It is not 
>>>>>>>>>> realistic
>>>>>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>>>>>
>>>>>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>>>>>> repartition tables as they grow. I feel Kudu has the same 
>>>>>>>>>> capabilities as
>>>>>>>>>> the design is very similar. I really do not like to pick random 
>>>>>>>>>> number of
>>>>>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each 
>>>>>>>>>> one by
>>>>>>>>>> one to make these decisions. Does it make sense?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Boris,
>>>>>>>>>>>
>>>>>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>>>>>> general can be tricky; thanks for the question. Left some thoughts 
>>>>>>>>>>> below:
>>>>>>>>>>>
>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? 
>>>>>>>>>>>> or should
>>>>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still break 
>>>>>>>>>>>> them on
>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Per your numbers, this confirms that the partitions are the
>>>>>>>>>>> units of concurrency here, and that therefore having more and having
>>>>>>>>>>> smaller partitions yields a concurrency bump. That said, extending 
>>>>>>>>>>> a scheme
>>>>>>>>>>> of smaller partitions across all tables may not scale when thinking 
>>>>>>>>>>> about
>>>>>>>>>>> the total number of partitions cluster-wide.
>>>>>>>>>>>
>>>>>>>>>>> There are some trade offs with the replica count per tablet
>>>>>>>>>>> server here -- generally, each tablet replica has a resource cost 
>>>>>>>>>>> on tablet
>>>>>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you 
>>>>>>>>>>> can put
>>>>>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces 
>>>>>>>>>>> some
>>>>>>>>>>> Raft-related RPC traffic, each tablet replica introduces some 
>>>>>>>>>>> maintenance
>>>>>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>>>>>
>>>>>>>>>>> Your point about scan concurrency is certainly a valid one --
>>>>>>>>>>> there have been patches for other integrations that have tackled 
>>>>>>>>>>> this to
>>>>>>>>>>> decouple partitioning from scan concurrency (KUDU-2437
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an
>>>>>>>>>>> example, where Kudu's Spark integration will split range scans into
>>>>>>>>>>> smaller-scoped scan tokens to be run concurrently, though this 
>>>>>>>>>>> optimization
>>>>>>>>>>> hasn't made its way into Impala yet). I filed KUDU-3071
>>>>>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what
>>>>>>>>>>> I think is left on the Kudu-side to get this up and running, so 
>>>>>>>>>>> that it can
>>>>>>>>>>> be worked into Impala.
>>>>>>>>>>>
>>>>>>>>>>> For now, I would try to take into account the total sum of
>>>>>>>>>>> resources you have available to Kudu (including number of tablet 
>>>>>>>>>>> servers,
>>>>>>>>>>> amount of storage per node, number of disks per tablet server, type 
>>>>>>>>>>> of disk
>>>>>>>>>>> for the WAL/metadata disks), to settle on roughly how many tablet 
>>>>>>>>>>> replicas
>>>>>>>>>>> your system can handle (the scaling guide
>>>>>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be
>>>>>>>>>>> helpful here), and hopefully that, along with your own SLAs per 
>>>>>>>>>>> table, can
>>>>>>>>>>> help guide how you partition your tables.
>>>>>>>>>>>
>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in
>>>>>>>>>>> the low tens of GB per tablet replica, though exact perf obviously 
>>>>>>>>>>> depends
>>>>>>>>>>> on your workload. As you show and as pointed out in documentation, 
>>>>>>>>>>> larger
>>>>>>>>>>> and fewer tablets can limit the amount of concurrency for writes 
>>>>>>>>>>> and reads,
>>>>>>>>>>> though we've seen multiple GBs works relatively well for many use 
>>>>>>>>>>> cases
>>>>>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>>>>>
>>>>>>>>>>> It is recommended that new tables which are expected to have
>>>>>>>>>>>> heavy read and write workloads have at least as many tablets as 
>>>>>>>>>>>> tablet
>>>>>>>>>>>> servers.
>>>>>>>>>>>
>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions 
>>>>>>>>>>>> for both
>>>>>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The recommendation here is to have at least 20 logical
>>>>>>>>>>> partitions per table. That way, a scan were to touch a table's 
>>>>>>>>>>> entire
>>>>>>>>>>> keyspace, the table scan would be broken up into 20 tablet scans, 
>>>>>>>>>>> and each
>>>>>>>>>>> of those might land on a different tablet server running on isolated
>>>>>>>>>>> hardware. For a significantly larger table into which you expect 
>>>>>>>>>>> highly
>>>>>>>>>>> concurrent workloads, the recommendation serves as a lower bound -- 
>>>>>>>>>>> I'd
>>>>>>>>>>> recommend having more partitions, and if your data is naturally
>>>>>>>>>>> time-oriented, consider range-partitioning on timestamp.
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <
>>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>>>>>> No Default Partitioning
>>>>>>>>>>>> Kudu does not provide a default partitioning strategy when
>>>>>>>>>>>> creating tables. It is recommended that new tables which are 
>>>>>>>>>>>> expected to
>>>>>>>>>>>> have heavy read and write workloads have at least as many tablets 
>>>>>>>>>>>> as tablet
>>>>>>>>>>>> servers.
>>>>>>>>>>>>
>>>>>>>>>>>> if I have 20 tablet servers and I have two tables - one with
>>>>>>>>>>>> 1MM rows and another one with 100MM rows, do I pick 20 / 3 
>>>>>>>>>>>> partitions for
>>>>>>>>>>>> both (divide by 3 because of replication)?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <
>>>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> hey guys,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I asked the same question on Slack on and got no responses. I
>>>>>>>>>>>>> just went through the docs and design doc and FAQ and still did 
>>>>>>>>>>>>> not find an
>>>>>>>>>>>>> answer.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can someone comment?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe I was not asking a clear question. If my cluster is
>>>>>>>>>>>>> large enough in my example above, should I go with 3, 9 or 18 
>>>>>>>>>>>>> tablets? or
>>>>>>>>>>>>> should I pick tablets to be closer to 1Gb?
>>>>>>>>>>>>>
>>>>>>>>>>>>> And a follow-up question, if I have tons of smaller tables
>>>>>>>>>>>>> under 5 million rows, should I just use 1 partition or still 
>>>>>>>>>>>>> break them on
>>>>>>>>>>>>> smaller tablets for concurrency?
>>>>>>>>>>>>>
>>>>>>>>>>>>> We cannot pick the partitioning strategy for each table as we
>>>>>>>>>>>>> need to stream 100s of tables and we use PK from RBDMS and need 
>>>>>>>>>>>>> to come
>>>>>>>>>>>>> with an automated way to pick number of partitions/tablets. So 
>>>>>>>>>>>>> far I was
>>>>>>>>>>>>> using 1Gb rule but rethinking this now for another project.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>>>>>> 3 2Gb 65
>>>>>>>>>>>>>> 9 700Mb 27
>>>>>>>>>>>>>> 18 350Mb 17
>>>>>>>>>>>>>> [image: image.png]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi guys,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    For small tables, such as dimension tables, ensure that
>>>>>>>>>>>>>>>    each tablet is at least 1 GB in size.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing 
>>>>>>>>>>>>>>> the number
>>>>>>>>>>>>>>> of tablets significantly beyond the number of cores is likely 
>>>>>>>>>>>>>>> to have
>>>>>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), 
>>>>>>>>>>>>>>> should I aim for
>>>>>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 
>>>>>>>>>>>>>>> 500Mb
>>>>>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before 
>>>>>>>>>>>>>>> replication)?
>>>>>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I 
>>>>>>>>>>>>>>> should
>>>>>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>>>>>> Based on my quick test, it seems that queries are faster if
>>>>>>>>>>>>>>> I have more tablets/partitions...In this example, 18 tablets 
>>>>>>>>>>>>>>> gave me the
>>>>>>>>>>>>>>> best timing but tablet size was around 300-400Mb. But the doc 
>>>>>>>>>>>>>>> says "at
>>>>>>>>>>>>>>> least 1Gb".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Boris
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Andrew Wong
>>>>>>>>>>>
>>>>>>>>>>

Re: Partitioning Rules of Thumb

Reply via email to