Re: Partitioning Rules of Thumb

Boris Tyukin Sat, 25 Apr 2020 11:57:42 -0700

Cliff, i would be extremely interested to see a blog post to compare
Snowflake, Redshift and Impala/Kudu since you tried all of them.


would love to get some details how you set up Kudu/Impala cluster on AWS as
well as my company might be heading the same direction. this does not mean
much to me as we are not using cloud but I hope you can elaborate on your
setup "12 hash x a12 (month) ranges in two 12-node clusters fronted by a
load balancer. We're in AWS, and thanks to Kudu's replication we can use
"for free" instance-store NVMe. We also have associated compute-oriented
stateless Impala Spot Fleet clusters for HLL and other compute oriented
queries. "

I cannot really find anything on the web that would compare Impala/Kudu to
Snowflake and Redshift. Everything I see is about Snowflake, Redshift and
BigQuery.

On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick <[email protected]> wrote:

> This is a good conversation but I don't think the comparison with
> Snowflake is a fair one, at least from an older version of Snowflake (In my
> last job, about 5 years ago, I pretty much single-handedly scale tested
> Snowflake in exchange for a sweetheart pricing deal) . Though Snowflake is
> closed source, it seems pretty clear the architectures are quite different.
> Snowflake has no primary key index, no UPSERT capability, features that
> make Kudu valuable for some use cases.
>
> It also seems to me that their intended workloads are quite different.
> Snowflake is great for intensive analytics on demand, and can handle deeply
> nested data very well, where Kudu can't handle that at all. Snowflake is
> not designed for heavy concurrency, but complex query plans for a small
> group of users. If you select an x-large Snowflake cluster it's probably
> because you have a large amount of data to churn through, not because you
> have a large number of users. Or, at least that's how we used it.
>
> At my current workplace we use Kudu/Impala to handle about 30-60
> concurrent queries. I agree that getting very fussy about partitioning can
> be a pain, but for the large fact tables we generally use a simple strategy
> of twelves:  12 hash x a12 (month) ranges in two 12-node clusters fronted
> by a load balancer. We're in AWS, and thanks to Kudu's replication we can
> use "for free" instance-store NVMe. We also have associated
> compute-oriented stateless Impala Spot Fleet clusters for HLL and other
> compute oriented queries.
>
> The net result blows away what we had with RedShift at less than 1/3 the
> cost, with performance improvements mostly from better concurrency
> handling. This is despite the fact that RedShift has built-in cache. We
> also use streaming ingestion which, aside from being impossible with
> RedShift, removes the added cost of staging.
>
> Getting back to Snowflake, there's no way we could use it the same way we
> use Kudu, and even if we could, the cost would would probably put us out of
> business!
>
> On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin <[email protected]> wrote:
>
>> thanks Andrew for taking your time responding to me. It seems that there
>> are no exact recommendations.
>>
>> I did look at scaling recommendations but that math is extremely
>> complicated and I do not think anyone will know all the answers to plug
>> into that calculation. We have no control really what users are doing, how
>> many queries they run, how many are hot vs cold etc. It is not realistic
>> IMHO to expect that knowledge of user query patterns.
>>
>> I do like the Snowflake approach than the engine takes care of defaults
>> and can estimate the number of micro-partitions and even repartition tables
>> as they grow. I feel Kudu has the same capabilities as the design is very
>> similar. I really do not like to pick random number of buckets. Also we
>> manager 100s of tables, I cannot look at them each one by one to make these
>> decisions. Does it make sense?
>>
>>
>> On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong <[email protected]> wrote:
>>
>>> Hey Boris,
>>>
>>> Sorry you didn't have much luck on Slack. I know partitioning in general
>>> can be tricky; thanks for the question. Left some thoughts below:
>>>
>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>> tablets to be closer to 1Gb?
>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>> million rows, should I just use 1 partition or still break them on smaller
>>>> tablets for concurrency?
>>>
>>>
>>> Per your numbers, this confirms that the partitions are the units of
>>> concurrency here, and that therefore having more and having smaller
>>> partitions yields a concurrency bump. That said, extending a scheme of
>>> smaller partitions across all tables may not scale when thinking about the
>>> total number of partitions cluster-wide.
>>>
>>> There are some trade offs with the replica count per tablet server here
>>> -- generally, each tablet replica has a resource cost on tablet servers:
>>> WALs and tablet-related metadata use a shared disk (if you can put this on
>>> an SSD, I would recommend doing so), each tablet introduces some
>>> Raft-related RPC traffic, each tablet replica introduces some maintenance
>>> operations in the pool of background operations to be run, etc.
>>>
>>> Your point about scan concurrency is certainly a valid one -- there have
>>> been patches for other integrations that have tackled this to decouple
>>> partitioning from scan concurrency (KUDU-2437
>>> <https://issues.apache.org/jira/browse/KUDU-2437> and KUDU-2670
>>> <https://issues.apache.org/jira/browse/KUDU-2670> are an example, where
>>> Kudu's Spark integration will split range scans into smaller-scoped scan
>>> tokens to be run concurrently, though this optimization hasn't made its way
>>> into Impala yet). I filed KUDU-3071
>>> <https://issues.apache.org/jira/browse/KUDU-3071> to track what I think
>>> is left on the Kudu-side to get this up and running, so that it can be
>>> worked into Impala.
>>>
>>> For now, I would try to take into account the total sum of resources you
>>> have available to Kudu (including number of tablet servers, amount of
>>> storage per node, number of disks per tablet server, type of disk for the
>>> WAL/metadata disks), to settle on roughly how many tablet replicas your
>>> system can handle (the scaling guide
>>> <https://kudu.apache.org/docs/scaling_guide.html> may be helpful here),
>>> and hopefully that, along with your own SLAs per table, can help guide how
>>> you partition your tables.
>>>
>>> confused why they say "at least" not "at most" - does it mean I should
>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>
>>>
>>> Aiming for 1GB seems a bit low; Kudu should be able to handle in the low
>>> tens of GB per tablet replica, though exact perf obviously depends on your
>>> workload. As you show and as pointed out in documentation, larger and fewer
>>> tablets can limit the amount of concurrency for writes and reads, though
>>> we've seen multiple GBs works relatively well for many use cases while
>>> weighing the above mentioned tradeoffs with replica count.
>>>
>>> It is recommended that new tables which are expected to have heavy read
>>>> and write workloads have at least as many tablets as tablet servers.
>>>
>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>
>>>
>>> The recommendation here is to have at least 20 logical partitions per
>>> table. That way, a scan were to touch a table's entire keyspace, the table
>>> scan would be broken up into 20 tablet scans, and each of those might land
>>> on a different tablet server running on isolated hardware. For a
>>> significantly larger table into which you expect highly concurrent
>>> workloads, the recommendation serves as a lower bound -- I'd recommend
>>> having more partitions, and if your data is naturally time-oriented,
>>> consider range-partitioning on timestamp.
>>>
>>> On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin <[email protected]>
>>> wrote:
>>>
>>>> just saw this in the docs but it is still confusing statement
>>>> No Default Partitioning
>>>> Kudu does not provide a default partitioning strategy when creating
>>>> tables. It is recommended that new tables which are expected to have heavy
>>>> read and write workloads have at least as many tablets as tablet servers.
>>>>
>>>>
>>>> if I have 20 tablet servers and I have two tables - one with 1MM rows
>>>> and another one with 100MM rows, do I pick 20 / 3 partitions for both
>>>> (divide by 3 because of replication)?
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin <[email protected]>
>>>> wrote:
>>>>
>>>>> hey guys,
>>>>>
>>>>> I asked the same question on Slack on and got no responses. I just
>>>>> went through the docs and design doc and FAQ and still did not find an
>>>>> answer.
>>>>>
>>>>> Can someone comment?
>>>>>
>>>>> Maybe I was not asking a clear question. If my cluster is large enough
>>>>> in my example above, should I go with 3, 9 or 18 tablets? or should I pick
>>>>> tablets to be closer to 1Gb?
>>>>>
>>>>> And a follow-up question, if I have tons of smaller tables under 5
>>>>> million rows, should I just use 1 partition or still break them on smaller
>>>>> tablets for concurrency?
>>>>>
>>>>> We cannot pick the partitioning strategy for each table as we need to
>>>>> stream 100s of tables and we use PK from RBDMS and need to come with an
>>>>> automated way to pick number of partitions/tablets. So far I was using 1Gb
>>>>> rule but rethinking this now for another project.
>>>>>
>>>>> On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> forgot to post results of my quick test:
>>>>>>
>>>>>>  Kudu 1.5
>>>>>>
>>>>>> Table takes 18Gb of disk space after 3x replication
>>>>>>
>>>>>> Tablets Tablet Size Query run time, sec
>>>>>> 3 2Gb 65
>>>>>> 9 700Mb 27
>>>>>> 18 350Mb 17
>>>>>> [image: image.png]
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> just want to clarify recommendations from the doc. It says:
>>>>>>>
>>>>>>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>>>>>>
>>>>>>> Partitioning Rules of Thumb
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    For large tables, such as fact tables, aim for as many tablets
>>>>>>>    as you have cores in the cluster.
>>>>>>>    -
>>>>>>>
>>>>>>>    For small tables, such as dimension tables, ensure that each
>>>>>>>    tablet is at least 1 GB in size.
>>>>>>>
>>>>>>> In general, be mindful the number of tablets limits the parallelism
>>>>>>> of reads, in the current implementation. Increasing the number of 
>>>>>>> tablets
>>>>>>> significantly beyond the number of cores is likely to have diminishing
>>>>>>> returns.
>>>>>>>
>>>>>>>
>>>>>>> I've read this a few times but I am not sure I understand it
>>>>>>> correctly. Let me use concrete example.
>>>>>>>
>>>>>>> If a table ends up taking 18Gb after replication (so with 3x
>>>>>>> replication it is ~9Gb per tablet if I do not partition), should I aim 
>>>>>>> for
>>>>>>> 1Gb tablets (6 tablets before replication) or should I aim for 500Mb
>>>>>>> tablets if my cluster capacity allows so (12 tablets before 
>>>>>>> replication)?
>>>>>>> confused why they say "at least" not "at most" - does it mean I should
>>>>>>> design it so a tablet takes 2Gb or 3Gb in this example?
>>>>>>>
>>>>>>> Assume that I have tons of CPU cores on a cluster...
>>>>>>> Based on my quick test, it seems that queries are faster if I have
>>>>>>> more tablets/partitions...In this example, 18 tablets gave me the best
>>>>>>> timing but tablet size was around 300-400Mb. But the doc says "at least
>>>>>>> 1Gb".
>>>>>>>
>>>>>>> Really confused what the doc is saying, please help
>>>>>>>
>>>>>>> Boris
>>>>>>>
>>>>>>>
>>>
>>> --
>>> Andrew Wong
>>>
>>

Re: Partitioning Rules of Thumb

Reply via email to