Re: Partitioning Rules of Thumb

Boris Tyukin Fri, 13 Mar 2020 08:05:00 -0700

very nice, Grant! this sounds really promising!

On Fri, Mar 13, 2020 at 9:52 AM Grant Henke <ghe...@cloudera.com> wrote:


> Actually, I have a much longer wish list on the Impala side than Kudu - it
>> feels that Kudu itself is very close to Snowflake but lacks these
>> self-management features. While Impala is certainly one of the fastest SQL
>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>> Hive would run like a champ.
>>
>
> There is an experimental Hive integration you could try. You may need to
> compile it for your version of Hive depending on what version of Hive you
> are using. Feel free to reach out on Slack if you have issues. The details
> on the integration here:
> https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration
>
> On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu,
>> and I do not think snowflake deserves that much credit and noise. But they
>> do have great ideas that take their engine to the next level. I do not know
>> architecture-wise how it works but here is the best doc I found regarding
>> micro partitions:
>>
>> https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html
>>
>>
>> It is also appealing to me that they support nested data types with ease
>> (and they are still very efficient to query/filter by), they scale workload
>> easily (Kudu requires manual rebalancing and it is VERY slow as I've
>> learned last weekend by doing it on our dev cluster). Also, they support
>> very large blobs and text fields that Kudu does not.
>>
>> Actually, I have a much longer wish list on the Impala side than Kudu -
>> it feels that Kudu itself is very close to Snowflake but lacks these
>> self-management features. While Impala is certainly one of the fastest SQL
>> engines I've seen, we struggle with mutli-tenancy and complex queries, that
>> Hive would run like a champ.
>>
>> As for Kudu's compaction process - it was largely an issue with 1.5 and I
>> believed was addressed in 1.9. We had a terrible time for a few weeks when
>> frequent updates/delete almost froze all our queries to crawl. A lot of
>> smaller tables had a huge number of tiny rowsets causing massive scans and
>> freezing the entire Kudu cluster. We had a good discussion on slack and
>> Todd Lipcon suggested a good workaround using flush_threshold_secs till we
>> move to 1.9 and it worked fine. Nowhere in the documentation, it was
>> suggested to set this flag and actually it was one of these "use it at your
>> own risk" flags.
>>
>>
>>
>> On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo <a...@cloudera.com>
>> wrote:
>>
>>> This has been an excellent discussion to follow, with very useful
>>> feedback. Thank you for that.
>>>
>>> Boris, if I can try to summarize your position, it's that manual
>>> partitioning doesn't scale when dealing with hundreds of (small) tables and
>>> when you don't control the PK of each table. The Kudu schema design guide
>>> would advise you to hash partition such tables, and, following Andrew's
>>> recommendation, have at least one partition per tserver. Except that given
>>> enough tables, you'll eventually overwhelm any cluster based on the sheer
>>> number of partitions per tserver. And if you reduce the number of
>>> partitions per table, you open yourself up to potential hotspotting if the
>>> per-table load isn't even. Is that correct?
>>>
>>> I can think of several improvements we could make here:
>>> 1. Support splitting/merging of range partitions, first manually and
>>> later automatically based on load. This is tracked in KUDU-437 and
>>> KUDU-441. Both are complicated, and since we've gotten a lot of mileage out
>>> of static range/hash partitioning, there hasn't been much traction on those
>>> JIRAs.
>>> 2. Improve server efficiency when hosting large numbers of replicas, so
>>> that you can blanket your cluster with maximally hash-partitioned tables.
>>> We did some work here a couple years ago (see KUDU-1967) but there's
>>> clearly much more that we can do. Of note, we previously discussed
>>> implementing Multi-Raft (
>>> https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031
>>> ).
>>> 3. Add support for consistent hashing. I'm not super familiar with the
>>> details but YugabyteDB (based on Kudu) has seen success here. Their recent
>>> blog post on sharding is a good (and related) read:
>>> https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/
>>>
>>> Since you're clearly more familiar with Snowflake than I, can you
>>> comment more on how their partitioning system works? Sounds like it's quite
>>> automated; is that ever problematic when it decides to repartition during a
>>> workload? The YugabyteDB blog post talks about heuristics "reacting" to
>>> hotspotting by load balancing, and concludes that hotspots can move around
>>> too quickly for a load balancer to keep up.
>>>
>>> Separately, you mentioned having to manage Kudu's compaction process.
>>> Could you go into more detail here?
>>>
>>> On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>>
>>>> thanks Cliff, this is really good info. I am tempted to do the
>>>> benchmarks myself but need to find a sponsor :) Snowflake gets A LOT of
>>>> traction lately, and based on a few conversations in slack, looks like Kudu
>>>> team is not tracking competition that much and I think there are awesome
>>>> features in Snowflake that would be great to have in Kudu.
>>>>
>>>> I think they do support upserts now and back in-time queries - while
>>>> they do not upsert/delete records, they just create new micro-partitions
>>>> and it is a metadata operation - that's who they can see data in a given
>>>> time. The idea of virtual warehouse is also very appealing especially in
>>>> healthcare as we often need to share data with other departments or
>>>> partners.
>>>>
>>>> do not get me wrong, I decided to use Kudu for the same reasons and I
>>>> really did not have a choice on CDH other than HBase but I get questioned
>>>> lately why we should not just give up CDH and go all in on Cloud with Cloud
>>>> native tech like Redshift and Snowflake. It is getting harder to explain
>>>> why :)
>>>>
>>>> My biggest gripe with Kudu besides known limitations is the management
>>>> of partitions and compaction process. For our largest tables, we just max
>>>> out the number of tablets as Kudu allows per cluster. Smaller tables though
>>>> is a challenge and more like an art while I feel there should be an
>>>> automated process with good defaults like in other data storage systems. No
>>>> one expects to assign partitions manually in Snowflake, BigQuery or
>>>> Redshift. No one is expected to tweak compaction parameters and deal with
>>>> fast degraded performance over time on tables with a high number of
>>>> deletes/updates. We are happy with performance but it is not all about
>>>> performance.
>>>>
>>>> We also struggle on Impala side as it feels it is limiting what we can
>>>> do with Kudu and it feels like Impala team is always behind. pyspark is
>>>> another problem - Kudu pyspark client is lagging behind java client and
>>>> very difficult to install/deploy. Some api is not even available in 
>>>> pyspark.
>>>>
>>>> And honestly, I do not like that Impala is the only viable option with
>>>> Kudu. We are forced a lot of time to do heavy ETL in Hive which is like a
>>>> tank but we cannot do that anymore with Kudu tables and struggle with disk
>>>> spilling issues, Impala choosing bad explain plans and forcing us to use
>>>> straight_join hint many times etc.
>>>>
>>>> Sorry if this post came a bit on a negative side, I really like Kudu
>>>> and the dev team behind it rocks. But I do not like their industry is going
>>>> and why Kudu is not getting the traction it deserves.
>>>>
>>>> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cre...@gmail.com> wrote:
>>>>
>>>>> This is a good conversation but I don't think the comparison with
>>>>> Snowflake is a fair one, at least from an older version of Snowflake (In 
>>>>> my
>>>>> last job, about 5 years ago, I pretty much single-handedly scale tested
>>>>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>>>>> closed source, it seems pretty clear the architectures are quite 
>>>>> different.
>>>>> Snowflake has no primary key index, no UPSERT capability, features that
>>>>> make Kudu valuable for some use cases.
>>>>>
>>>>> It also seems to me that their intended workloads are quite different.
>>>>> Snowflake is great for intensive analytics on demand, and can handle 
>>>>> deeply
>>>>> nested data very well, where Kudu can't handle that at all. Snowflake is
>>>>> not designed for heavy concurrency, but complex query plans for a small
>>>>> group of users. If you select an x-large Snowflake cluster it's probably
>>>>> because you have a large amount of data to churn through, not because you
>>>>> have a large number of users. Or, at least that's how we used it.
>>>>>
>>>>> At my current workplace we use Kudu/Impala to handle about 30-60
>>>>> concurrent queries. I agree that getting very fussy about partitioning can
>>>>> be a pain, but for the large fact tables we generally use a simple 
>>>>> strategy
>>>>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>>>>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>>>>> use "for free" instance-store NVMe. We also have associated
>>>>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>>>>> compute oriented queries.
>>>>>
>>>>> The net result blows away what we had with RedShift at less than 1/3
>>>>> the cost, with performance improvements mostly from better concurrency
>>>>> handling. This is despite the fact that RedShift has built-in cache. We
>>>>> also use streaming ingestion which, aside from being impossible with
>>>>> RedShift, removes the added cost of staging.
>>>>>
>>>>> Getting back to Snowflake, there's no way we could use it the same way
>>>>> we use Kudu, and even if we could, the cost would would probably put us 
>>>>> out
>>>>> of business!
>>>>>
>>>>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> thanks Andrew for taking your time responding to me. It seems that
>>>>>> there are no exact recommendations.
>>>>>>
>>>>>> I did look at scaling recommendations but that math is extremely
>>>>>> complicated and I do not think anyone will know all the answers to plug
>>>>>> into that calculation. We have no control really what users are doing, 
>>>>>> how
>>>>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>>>>> IMHO to expect that knowledge of user query patterns.
>>>>>>
>>>>>> I do like the Snowflake approach than the engine takes care of
>>>>>> defaults and can estimate the number of micro-partitions and even
>>>>>> repartition tables as they grow. I feel Kudu has the same capabilities as
>>>>>> the design is very similar. I really do not like to pick random number of
>>>>>> buckets. Also we manager 100s of tables, I cannot look at them each one 
>>>>>> by
>>>>>> one to make these decisions. Does it make sense?
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Boris,
>>>>>>>
>>>>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>>>>> general can be tricky; thanks for the question. Left some thoughts 
>>>>>>> below:
>>>>>>>
>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or 
>>>>>>>> should
>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>> million rows, should I just use 1 partition or still break them on 
>>>>>>>> smaller
>>>>>>>> tablets for concurrency?
>>>>>>>
>>>>>>>
>>>>>>> Per your numbers, this confirms that the partitions are the units of
>>>>>>> concurrency here, and that therefore having more and having smaller
>>>>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>>>>> smaller partitions across all tables may not scale when thinking about 
>>>>>>> the
>>>>>>> total number of partitions cluster-wide.
>>>>>>>
>>>>>>> There are some trade offs with the replica count per tablet server
>>>>>>> here -- generally, each tablet replica has a resource cost on tablet
>>>>>>> servers: WALs and tablet-related metadata use a shared disk (if you can 
>>>>>>> put
>>>>>>> this on an SSD, I would recommend doing so), each tablet introduces some
>>>>>>> Raft-related RPC traffic, each tablet replica introduces some 
>>>>>>> maintenance
>>>>>>> operations in the pool of background operations to be run, etc.
>>>>>>>
>>>>>>> Your point about scan concurrency is certainly a valid one -- there
>>>>>>> have been patches for other integrations that have tackled this to 
>>>>>>> decouple
>>>>>>> partitioning from scan concurrency (KUDU-2437
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>>>>> where Kudu's Spark integration will split range scans into 
>>>>>>> smaller-scoped
>>>>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>>>>> its way into Impala yet). I filed KUDU-3071
>>>>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>>>>> think is left on the Kudu-side to get this up and running, so that it 
>>>>>>> can
>>>>>>> be worked into Impala.
>>>>>>>
>>>>>>> For now, I would try to take into account the total sum of resources
>>>>>>> you have available to Kudu (including number of tablet servers, amount 
>>>>>>> of
>>>>>>> storage per node, number of disks per tablet server, type of disk for 
>>>>>>> the
>>>>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>>>>> system can handle (the scaling guide
>>>>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>>>>> guide how you partition your tables.
>>>>>>>
>>>>>>> confused why they say "at least" not "at most" - does it mean I
>>>>>>>> should design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>>
>>>>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>>>>> low tens of GB per tablet replica, though exact perf obviously depends 
>>>>>>> on
>>>>>>> your workload. As you show and as pointed out in documentation, larger 
>>>>>>> and
>>>>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>>>>> though we've seen multiple GBs works relatively well for many use cases
>>>>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>>>>
>>>>>>> It is recommended that new tables which are expected to have heavy
>>>>>>>> read and write workloads have at least as many tablets as tablet 
>>>>>>>> servers.
>>>>>>>>
>>>>>>>
>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for 
>>>>>>>> both
>>>>>>>> (divide by 3 because of replication)?
>>>>>>>
>>>>>>>
>>>>>>> The recommendation here is to have at least 20 logical partitions
>>>>>>> per table. That way, a scan were to touch a table's entire keyspace, the
>>>>>>> table scan would be broken up into 20 tablet scans, and each of those 
>>>>>>> might
>>>>>>> land on a different tablet server running on isolated hardware. For a
>>>>>>> significantly larger table into which you expect highly concurrent
>>>>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>>>>> having more partitions, and if your data is naturally time-oriented,
>>>>>>> consider range-partitioning on timestamp.
>>>>>>>
>>>>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> just saw this in the docs but it is still confusing statement
>>>>>>>> No Default Partitioning
>>>>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>>>>> tables. It is recommended that new tables which are expected to have 
>>>>>>>> heavy
>>>>>>>> read and write workloads have at least as many tablets as tablet 
>>>>>>>> servers.
>>>>>>>>
>>>>>>>>
>>>>>>>> if I have 20 tablet servers and I have two tables - one with 1MM
>>>>>>>> rows and another one with 100MM rows, do I pick 20 / 3 partitions for 
>>>>>>>> both
>>>>>>>> (divide by 3 because of replication)?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hey guys,
>>>>>>>>>
>>>>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>>>>> answer.
>>>>>>>>>
>>>>>>>>> Can someone comment?
>>>>>>>>>
>>>>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or 
>>>>>>>>> should
>>>>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>>>>
>>>>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>>>>> million rows, should I just use 1 partition or still break them on 
>>>>>>>>> smaller
>>>>>>>>> tablets for concurrency?
>>>>>>>>>
>>>>>>>>> We cannot pick the partitioning strategy for each table as we need
>>>>>>>>> to stream 100s of tables and we use PK from RBDMS and need to come 
>>>>>>>>> with an
>>>>>>>>> automated way to pick number of partitions/tablets. So far I was 
>>>>>>>>> using 1Gb
>>>>>>>>> rule but rethinking this now for another project.
>>>>>>>>>
>>>>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <
>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>
>>>>>>>>>> forgot to post results of my quick test:
>>>>>>>>>>
>>>>>>>>>>  Kudu 1.5
>>>>>>>>>>
>>>>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>>>>
>>>>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>>>>> 3 2Gb 65
>>>>>>>>>> 9 700Mb 27
>>>>>>>>>> 18 350Mb 17
>>>>>>>>>> [image: image.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <
>>>>>>>>>> bo...@boristyukin.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>>>>
>>>>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>>>>
>>>>>>>>>>> Partitioning Rules of Thumb
>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>>>>
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    For large tables, such as fact tables, aim for as many
>>>>>>>>>>>    tablets as you have cores in the cluster.
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>>>>
>>>>>>>>>>> In general, be mindful the number of tablets limits the
>>>>>>>>>>> parallelism of reads, in the current implementation. Increasing the 
>>>>>>>>>>> number
>>>>>>>>>>> of tablets significantly beyond the number of cores is likely to 
>>>>>>>>>>> have
>>>>>>>>>>> diminishing returns.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>>>>
>>>>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I 
>>>>>>>>>>> aim for
>>>>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>>>>> tablets if my cluster capacity allows so (12 tablets before 
>>>>>>>>>>> replication)?
>>>>>>>>>>> confused why they say "at least" not "at most" - does it mean I 
>>>>>>>>>>> should
>>>>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>>>>
>>>>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>>>>> Based on my quick test, it seems that queries are faster if I
>>>>>>>>>>> have more tablets/partitions...In this example, 18 tablets gave me 
>>>>>>>>>>> the best
>>>>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at 
>>>>>>>>>>> least
>>>>>>>>>>> 1Gb".
>>>>>>>>>>>
>>>>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>>>>
>>>>>>>>>>> Boris
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Andrew Wong
>>>>>>>
>>>>>>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: Partitioning Rules of Thumb

Reply via email to