hey guys,

I asked the same question on Slack on and got no responses. I just went
through the docs and design doc and FAQ and still did not find an answer.

Can someone comment?

Maybe I was not asking a clear question. If my cluster is large enough in
my example above, should I go with 3, 9 or 18 tablets? or should I pick
tablets to be closer to 1Gb?

And a follow-up question, if I have tons of smaller tables under 5 million
rows, should I just use 1 partition or still break them on smaller tablets
for concurrency?

We cannot pick the partitioning strategy for each table as we need to
stream 100s of tables and we use PK from RBDMS and need to come with an
automated way to pick number of partitions/tablets. So far I was using 1Gb
rule but rethinking this now for another project.

On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> forgot to post results of my quick test:
>
>  Kudu 1.5
>
> Table takes 18Gb of disk space after 3x replication
>
> Tablets Tablet Size Query run time, sec
> 3 2Gb 65
> 9 700Mb 27
> 18 350Mb 17
> [image: image.png]
>
>
> On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> Hi guys,
>>
>> just want to clarify recommendations from the doc. It says:
>>
>> https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb
>>
>> Partitioning Rules of Thumb
>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>> <https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb>
>>
>>    -
>>
>>    For large tables, such as fact tables, aim for as many tablets as you
>>    have cores in the cluster.
>>    -
>>
>>    For small tables, such as dimension tables, ensure that each tablet
>>    is at least 1 GB in size.
>>
>> In general, be mindful the number of tablets limits the parallelism of
>> reads, in the current implementation. Increasing the number of tablets
>> significantly beyond the number of cores is likely to have diminishing
>> returns.
>>
>>
>> I've read this a few times but I am not sure I understand it correctly.
>> Let me use concrete example.
>>
>> If a table ends up taking 18Gb after replication (so with 3x replication
>> it is ~9Gb per tablet if I do not partition), should I aim for 1Gb tablets
>> (6 tablets before replication) or should I aim for 500Mb tablets if my
>> cluster capacity allows so (12 tablets before replication)? confused why
>> they say "at least" not "at most" - does it mean I should design it so a
>> tablet takes 2Gb or 3Gb in this example?
>>
>> Assume that I have tons of CPU cores on a cluster...
>> Based on my quick test, it seems that queries are faster if I have more
>> tablets/partitions...In this example, 18 tablets gave me the best timing
>> but tablet size was around 300-400Mb. But the doc says "at least 1Gb".
>>
>> Really confused what the doc is saying, please help
>>
>> Boris
>>
>>

Reply via email to