Re: Partitioning Rules of Thumb

Boris Tyukin Sat, 25 Apr 2020 12:21:44 -0700

Cliff, I think i got the idea - you basically use a bunch of temp servers
that can be evicted any time and each has attached temporary nvme drives -
very clever! I am surprised it works well with Kudu as our experience with
our on-prem cluster that Kudu is extremely sensitive to disk failures and
random reboots. at one point we had a situation when 2 out of 3 replicas
were down for a while and it took us quite a bit of time to recover and we
had to use kudu cli to evict some tablets manually.


Also manual rebalancing is in order often and it takes a lot of time (many
hours).

On Sat, Apr 25, 2020 at 2:54 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> Cliff, i would be extremely interested to see a blog post to compare
> Snowflake, Redshift and Impala/Kudu since you tried all of them.
>
> would love to get some details how you set up Kudu/Impala cluster on AWS
> as well as my company might be heading the same direction. this does not
> mean much to me as we are not using cloud but I hope you can elaborate on
> your setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by
> a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries. "
>
> I cannot really find anything on the web that would compare Impala/Kudu to
> Snowflake and Redshift. Everything I see is about Snowflake, Redshift and
> BigQuery.
>
> On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <cre...@gmail.com> wrote:
>
>> This is a good conversation but I don't think the comparison with
>> Snowflake is a fair one, at least from an older version of Snowflake (In my
>> last job, about 5 years ago, I pretty much single-handedly scale tested
>> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
>> closed source, it seems pretty clear the architectures are quite different.
>> Snowflake has no primary key index, no UPSERT capability, features that
>> make Kudu valuable for some use cases.
>>
>> It also seems to me that their intended workloads are quite different.
>> Snowflake is great for intensive analytics on demand, and can handle deeply
>> nested data very well, where Kudu can't handle that at all. Snowflake is
>> not designed for heavy concurrency, but complex query plans for a small
>> group of users. If you select an x-large Snowflake cluster it's probably
>> because you have a large amount of data to churn through, not because you
>> have a large number of users. Or, at least that's how we used it.
>>
>> At my current workplace we use Kudu/Impala to handle about 30-60
>> concurrent queries. I agree that getting very fussy about partitioning can
>> be a pain, but for the large fact tables we generally use a simple strategy
>> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
>> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
>> use "for free" instance-store NVMe. We also have associated
>> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
>> compute oriented queries.
>>
>> The net result blows away what we had with RedShift at less than 1/3 the
>> cost, with performance improvements mostly from better concurrency
>> handling. This is despite the fact that RedShift has built-in cache. We
>> also use streaming ingestion which, aside from being impossible with
>> RedShift, removes the added cost of staging.
>>
>> Getting back to Snowflake, there's no way we could use it the same way we
>> use Kudu, and even if we could, the cost would would probably put us out of
>> business!
>>
>> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>>
>>> thanks Andrew for taking your time responding to me. It seems that there
>>> are no exact recommendations.
>>>
>>> I did look at scaling recommendations but that math is extremely
>>> complicated and I do not think anyone will know all the answers to plug
>>> into that calculation. We have no control really what users are doing, how
>>> many queries they run, how many are hot vs cold etc. It is not realistic
>>> IMHO to expect that knowledge of user query patterns.
>>>
>>> I do like the Snowflake approach than the engine takes care of defaults
>>> and can estimate the number of micro-partitions and even repartition tables
>>> as they grow. I feel Kudu has the same capabilities as the design is very
>>> similar. I really do not like to pick random number of buckets. Also we
>>> manager 100s of tables, I cannot look at them each one by one to make these
>>> decisions. Does it make sense?
>>>
>>>
>>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <aw...@cloudera.com> wrote:
>>>
>>>> Hey Boris,
>>>>
>>>> Sorry you didn't have much luck on Slack. I know partitioning in
>>>> general can be tricky; thanks for the question. Left some thoughts below:
>>>>
>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>
>>>>
>>>> Per your numbers, this confirms that the partitions are the units of
>>>> concurrency here, and that therefore having more and having smaller
>>>> partitions yields a concurrency bump. That said, extending a scheme of
>>>> smaller partitions across all tables may not scale when thinking about the
>>>> total number of partitions cluster-wide.
>>>>
>>>> There are some trade offs with the replica count per tablet server here
>>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>>> an SSD, I would recommend doing so), each tablet introduces some
>>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>>> operations in the pool of background operations to be run, etc.
>>>>
>>>> Your point about scan concurrency is certainly a valid one -- there
>>>> have been patches for other integrations that have tackled this to decouple
>>>> partitioning from scan concurrency (KUDU-2437
>>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example,
>>>> where Kudu's Spark integration will split range scans into smaller-scoped
>>>> scan tokens to be run concurrently, though this optimization hasn't made
>>>> its way into Impala yet). I filed KUDU-3071
>>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I
>>>> think is left on the Kudu-side to get this up and running, so that it can
>>>> be worked into Impala.
>>>>
>>>> For now, I would try to take into account the total sum of resources
>>>> you have available to Kudu (including number of tablet servers, amount of
>>>> storage per node, number of disks per tablet server, type of disk for the
>>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>>> system can handle (the scaling guide
>>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful
>>>> here), and hopefully that, along with your own SLAs per table, can help
>>>> guide how you partition your tables.
>>>>
>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>
>>>>
>>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the
>>>> low tens of GB per tablet replica, though exact perf obviously depends on
>>>> your workload. As you show and as pointed out in documentation, larger and
>>>> fewer tablets can limit the amount of concurrency for writes and reads,
>>>> though we've seen multiple GBs works relatively well for many use cases
>>>> while weighing the above mentioned tradeoffs with replica count.
>>>>
>>>> It is recommended that new tables which are expected to have heavy read
>>>>> and write workloads have at least as many tablets as tablet servers.
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>> The recommendation here is to have at least 20 logical partitions per
>>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>>> scan would be broken up into 20 tablet scans, and each of those might land
>>>> on a different tablet server running on isolated hardware. For a
>>>> significantly larger table into which you expect highly concurrent
>>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>>> having more partitions, and if your data is naturally time-oriented,
>>>> consider range-partitioning on timestamp.
>>>>
>>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <bo...@boristyukin.com>
>>>> wrote:
>>>>
>>>>> just saw this in the docs but it is still confusing statement
>>>>> No Default Partitioning
>>>>> Kudu does not provide a default partitioning strategy when creating
>>>>> tables. It is recommended that new tables which are expected to have heavy
>>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>>
>>>>>
>>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>>> (divide by 3 because of replication)?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <bo...@boristyukin.com>
>>>>> wrote:
>>>>>
>>>>>> hey guys,
>>>>>>
>>>>>> I asked the same question on Slack on and got no responses. I just
>>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>>> answer.
>>>>>>
>>>>>> Can someone comment?
>>>>>>
>>>>>> Maybe I was not asking a clear question. If my cluster is large
>>>>>> enough in my example above, should I go with 3, 9 or 18 tablets? or 
>>>>>> should
>>>>>> I pick tablets to be closer to 1Gb?
>>>>>>
>>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>>> million rows, should I just use 1 partition or still break them on 
>>>>>> smaller
>>>>>> tablets for concurrency?
>>>>>>
>>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>>> automated way to pick number of partitions/tablets. So far I was using 
>>>>>> 1Gb
>>>>>> rule but rethinking this now for another project.
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>> wrote:
>>>>>>
>>>>>>> forgot to post results of my quick test:
>>>>>>>
>>>>>>>  Kudu 1.5
>>>>>>>
>>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>>
>>>>>>> Tablets Tablet Size Query run time, sec
>>>>>>> 3 2Gb 65
>>>>>>> 9 700Mb 27
>>>>>>> 18 350Mb 17
>>>>>>> [image: image.png]
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi guys,
>>>>>>>>
>>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>>
>>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>>
>>>>>>>> Partitioning Rules of Thumb
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>>    as you have cores in the cluster.
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>>
>>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>>> of reads, in the current implementation. Increasing the number of 
>>>>>>>> tablets
>>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>>> returns.
>>>>>>>>
>>>>>>>>
>>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>>> correctly. Let me use concrete example.
>>>>>>>>
>>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim 
>>>>>>>> for
>>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>>> tablets if my cluster capacity allows so (12 tablets before 
>>>>>>>> replication)?
>>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>>
>>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>>> 1Gb".
>>>>>>>>
>>>>>>>> Really confused what the doc is saying, please help
>>>>>>>>
>>>>>>>> Boris
>>>>>>>>
>>>>>>>>
>>>>
>>>> --
>>>> Andrew Wong
>>>>
>>>

Re: Partitioning Rules of Thumb

Reply via email to