Hi All,
How can I get the Hive Integration with Kudu Tables? I'm new to this, let me 
know what to install to achieve this.

Best Regards,
Jyothi Swaroop Bh.
HADOOP |BIDM |GIS |Applied Materials
EXTN: #9575-8876 | PH: +91-9051222662

From: Boris Tyukin <bo...@boristyukin.com>
Sent: Friday, March 13, 2020 8:15 PM
To: user@kudu.apache.org
Subject: [EXTERNAL] Re: Partitioning Rules of Thumb

CAUTION: EXTERNAL EMAIL. Verify before you click links or open attachments. 
Questions? Contact GIS.

very nice, Grant! this sounds really promising!

On Fri, Mar 13, 2020 at 9:52 AM Grant Henke 
<ghe...@cloudera.com<mailto:ghe...@cloudera.com>> wrote:
Actually, I have a much longer wish list on the Impala side than Kudu - it 
feels that Kudu itself is very close to Snowflake but lacks these 
self-management features. While Impala is certainly one of the fastest SQL 
engines I've seen, we struggle with mutli-tenancy and complex queries, that 
Hive would run like a champ.

There is an experimental Hive integration you could try. You may need to 
compile it for your version of Hive depending on what version of Hive you are 
using. Feel free to reach out on Slack if you have issues. The details on the 
integration here: 
https://cwiki.apache.org/confluence/display/Hive/Kudu+Integration<https://urldefense.com/v3/__https:/cwiki.apache.org/confluence/display/Hive/Kudu*Integration__;Kw!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYym92KIOY$>

On Fri, Mar 13, 2020 at 8:22 AM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
thanks Adar, I am a big fan of Kudu and I see a lot of potential in Kudu, and I 
do not think snowflake deserves that much credit and noise. But they do have 
great ideas that take their engine to the next level. I do not know 
architecture-wise how it works but here is the best doc I found regarding micro 
partitions:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html<https://urldefense.com/v3/__https:/docs.snowflake.net/manuals/user-guide/tables-clustering-micropartitions.html__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYy_s86tNE$>

It is also appealing to me that they support nested data types with ease (and 
they are still very efficient to query/filter by), they scale workload easily 
(Kudu requires manual rebalancing and it is VERY slow as I've learned last 
weekend by doing it on our dev cluster). Also, they support very large blobs 
and text fields that Kudu does not.

Actually, I have a much longer wish list on the Impala side than Kudu - it 
feels that Kudu itself is very close to Snowflake but lacks these 
self-management features. While Impala is certainly one of the fastest SQL 
engines I've seen, we struggle with mutli-tenancy and complex queries, that 
Hive would run like a champ.

As for Kudu's compaction process - it was largely an issue with 1.5 and I 
believed was addressed in 1.9. We had a terrible time for a few weeks when 
frequent updates/delete almost froze all our queries to crawl. A lot of smaller 
tables had a huge number of tiny rowsets causing massive scans and freezing the 
entire Kudu cluster. We had a good discussion on slack and Todd Lipcon 
suggested a good workaround using flush_threshold_secs till we move to 1.9 and 
it worked fine. Nowhere in the documentation, it was suggested to set this flag 
and actually it was one of these "use it at your own risk" flags.



On Thu, Mar 12, 2020 at 8:49 PM Adar Lieber-Dembo 
<a...@cloudera.com<mailto:a...@cloudera.com>> wrote:
This has been an excellent discussion to follow, with very useful feedback. 
Thank you for that.

Boris, if I can try to summarize your position, it's that manual partitioning 
doesn't scale when dealing with hundreds of (small) tables and when you don't 
control the PK of each table. The Kudu schema design guide would advise you to 
hash partition such tables, and, following Andrew's recommendation, have at 
least one partition per tserver. Except that given enough tables, you'll 
eventually overwhelm any cluster based on the sheer number of partitions per 
tserver. And if you reduce the number of partitions per table, you open 
yourself up to potential hotspotting if the per-table load isn't even. Is that 
correct?

I can think of several improvements we could make here:
1. Support splitting/merging of range partitions, first manually and later 
automatically based on load. This is tracked in KUDU-437 and KUDU-441. Both are 
complicated, and since we've gotten a lot of mileage out of static range/hash 
partitioning, there hasn't been much traction on those JIRAs.
2. Improve server efficiency when hosting large numbers of replicas, so that 
you can blanket your cluster with maximally hash-partitioned tables. We did 
some work here a couple years ago (see KUDU-1967) but there's clearly much more 
that we can do. Of note, we previously discussed implementing Multi-Raft 
(https://issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15967031<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-1913?focusedCommentId=15967031&page=com.atlassian.jira.plugin.system.issuetabpanels*3Acomment-tabpanel*comment-15967031__;JSM!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyQWo2sAc$>).
3. Add support for consistent hashing. I'm not super familiar with the details 
but YugabyteDB (based on Kudu) has seen success here. Their recent blog post on 
sharding is a good (and related) read: 
https://blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/<https://urldefense.com/v3/__https:/blog.yugabyte.com/four-data-sharding-strategies-we-analyzed-in-building-a-distributed-sql-database/__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYy3CfFDeA$>

Since you're clearly more familiar with Snowflake than I, can you comment more 
on how their partitioning system works? Sounds like it's quite automated; is 
that ever problematic when it decides to repartition during a workload? The 
YugabyteDB blog post talks about heuristics "reacting" to hotspotting by load 
balancing, and concludes that hotspots can move around too quickly for a load 
balancer to keep up.

Separately, you mentioned having to manage Kudu's compaction process. Could you 
go into more detail here?

On Wed, Mar 11, 2020 at 6:49 AM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
thanks Cliff, this is really good info. I am tempted to do the benchmarks 
myself but need to find a sponsor :) Snowflake gets A LOT of traction lately, 
and based on a few conversations in slack, looks like Kudu team is not tracking 
competition that much and I think there are awesome features in Snowflake that 
would be great to have in Kudu.

I think they do support upserts now and back in-time queries - while they do 
not upsert/delete records, they just create new micro-partitions and it is a 
metadata operation - that's who they can see data in a given time. The idea of 
virtual warehouse is also very appealing especially in healthcare as we often 
need to share data with other departments or partners.

do not get me wrong, I decided to use Kudu for the same reasons and I really 
did not have a choice on CDH other than HBase but I get questioned lately why 
we should not just give up CDH and go all in on Cloud with Cloud native tech 
like Redshift and Snowflake. It is getting harder to explain why :)

My biggest gripe with Kudu besides known limitations is the management of 
partitions and compaction process. For our largest tables, we just max out the 
number of tablets as Kudu allows per cluster. Smaller tables though is a 
challenge and more like an art while I feel there should be an automated 
process with good defaults like in other data storage systems. No one expects 
to assign partitions manually in Snowflake, BigQuery or Redshift. No one is 
expected to tweak compaction parameters and deal with fast degraded performance 
over time on tables with a high number of deletes/updates. We are happy with 
performance but it is not all about performance.

We also struggle on Impala side as it feels it is limiting what we can do with 
Kudu and it feels like Impala team is always behind. pyspark is another problem 
- Kudu pyspark client is lagging behind java client and very difficult to 
install/deploy. Some api is not even available in pyspark.

And honestly, I do not like that Impala is the only viable option with Kudu. We 
are forced a lot of time to do heavy ETL in Hive which is like a tank but we 
cannot do that anymore with Kudu tables and struggle with disk spilling issues, 
Impala choosing bad explain plans and forcing us to use straight_join hint many 
times etc.

Sorry if this post came a bit on a negative side, I really like Kudu and the 
dev team behind it rocks. But I do not like their industry is going and why 
Kudu is not getting the traction it deserves.

On Tue, Mar 10, 2020 at 9:03 PM Cliff Resnick 
<cre...@gmail.com<mailto:cre...@gmail.com>> wrote:
This is a good conversation but I don't think the comparison with Snowflake is 
a fair one, at least from an older version of Snowflake (In my last job, about 
5 years ago, I pretty much single-handedly scale tested Snowflake in exchange 
for a sweetheart pricing deal) . Though Snowflake is closed source, it seems 
pretty clear the architectures are quite different. Snowflake has no primary 
key index, no UPSERT capability, features that make Kudu valuable for some use 
cases.

It also seems to me that their intended workloads are quite different. 
Snowflake is great for intensive analytics on demand, and can handle deeply 
nested data very well, where Kudu can't handle that at all. Snowflake is not 
designed for heavy concurrency, but complex query plans for a small group of 
users. If you select an x-large Snowflake cluster it's probably because you 
have a large amount of data to churn through, not because you have a large 
number of users. Or, at least that's how we used it.

At my current workplace we use Kudu/Impala to handle about 30-60 concurrent 
queries. I agree that getting very fussy about partitioning can be a pain, but 
for the large fact tables we generally use a simple strategy of twelves:  12 
hash x a12 (month) ranges in two 12-node clusters fronted by a load balancer. 
We're in AWS, and thanks to Kudu's replication we can use "for free" 
instance-store NVMe. We also have associated compute-oriented stateless Impala 
Spot Fleet clusters for HLL and other compute oriented queries.

The net result blows away what we had with RedShift at less than 1/3 the cost, 
with performance improvements mostly from better concurrency handling. This is 
despite the fact that RedShift has built-in cache. We also use streaming 
ingestion which, aside from being impossible with RedShift, removes the added 
cost of staging.

Getting back to Snowflake, there's no way we could use it the same way we use 
Kudu, and even if we could, the cost would would probably put us out of 
business!

On Tue, Mar 10, 2020, 10:59 AM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
thanks Andrew for taking your time responding to me. It seems that there are no 
exact recommendations.

I did look at scaling recommendations but that math is extremely complicated 
and I do not think anyone will know all the answers to plug into that 
calculation. We have no control really what users are doing, how many queries 
they run, how many are hot vs cold etc. It is not realistic IMHO to expect that 
knowledge of user query patterns.

I do like the Snowflake approach than the engine takes care of defaults and can 
estimate the number of micro-partitions and even repartition tables as they 
grow. I feel Kudu has the same capabilities as the design is very similar. I 
really do not like to pick random number of buckets. Also we manager 100s of 
tables, I cannot look at them each one by one to make these decisions. Does it 
make sense?


On Mon, Mar 9, 2020 at 4:42 PM Andrew Wong 
<aw...@cloudera.com<mailto:aw...@cloudera.com>> wrote:
Hey Boris,

Sorry you didn't have much luck on Slack. I know partitioning in general can be 
tricky; thanks for the question. Left some thoughts below:

Maybe I was not asking a clear question. If my cluster is large enough in my 
example above, should I go with 3, 9 or 18 tablets? or should I pick tablets to 
be closer to 1Gb?
And a follow-up question, if I have tons of smaller tables under 5 million 
rows, should I just use 1 partition or still break them on smaller tablets for 
concurrency?

Per your numbers, this confirms that the partitions are the units of 
concurrency here, and that therefore having more and having smaller partitions 
yields a concurrency bump. That said, extending a scheme of smaller partitions 
across all tables may not scale when thinking about the total number of 
partitions cluster-wide.

There are some trade offs with the replica count per tablet server here -- 
generally, each tablet replica has a resource cost on tablet servers: WALs and 
tablet-related metadata use a shared disk (if you can put this on an SSD, I 
would recommend doing so), each tablet introduces some Raft-related RPC 
traffic, each tablet replica introduces some maintenance operations in the pool 
of background operations to be run, etc.

Your point about scan concurrency is certainly a valid one -- there have been 
patches for other integrations that have tackled this to decouple partitioning 
from scan concurrency 
(KUDU-2437<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-2437__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyv2_-DpY$>
 and 
KUDU-2670<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-2670__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYytW0tFHc$>
 are an example, where Kudu's Spark integration will split range scans into 
smaller-scoped scan tokens to be run concurrently, though this optimization 
hasn't made its way into Impala yet). I filed 
KUDU-3071<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/KUDU-3071__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyiAjniDw$>
 to track what I think is left on the Kudu-side to get this up and running, so 
that it can be worked into Impala.

For now, I would try to take into account the total sum of resources you have 
available to Kudu (including number of tablet servers, amount of storage per 
node, number of disks per tablet server, type of disk for the WAL/metadata 
disks), to settle on roughly how many tablet replicas your system can handle 
(the scaling 
guide<https://urldefense.com/v3/__https:/kudu.apache.org/docs/scaling_guide.html__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyrIbQ8MQ$>
 may be helpful here), and hopefully that, along with your own SLAs per table, 
can help guide how you partition your tables.

confused why they say "at least" not "at most" - does it mean I should design 
it so a tablet takes 2Gb or 3Gb in this example?

Aiming for 1GB seems a bit low; Kudu should be able to handle in the low tens 
of GB per tablet replica, though exact perf obviously depends on your workload. 
As you show and as pointed out in documentation, larger and fewer tablets can 
limit the amount of concurrency for writes and reads, though we've seen 
multiple GBs works relatively well for many use cases while weighing the above 
mentioned tradeoffs with replica count.

It is recommended that new tables which are expected to have heavy read and 
write workloads have at least as many tablets as tablet servers.
if I have 20 tablet servers and I have two tables - one with 1MM rows and 
another one with 100MM rows, do I pick 20 / 3 partitions for both (divide by 3 
because of replication)?

The recommendation here is to have at least 20 logical partitions per table. 
That way, a scan were to touch a table's entire keyspace, the table scan would 
be broken up into 20 tablet scans, and each of those might land on a different 
tablet server running on isolated hardware. For a significantly larger table 
into which you expect highly concurrent workloads, the recommendation serves as 
a lower bound -- I'd recommend having more partitions, and if your data is 
naturally time-oriented, consider range-partitioning on timestamp.

On Sat, Mar 7, 2020 at 7:13 AM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
just saw this in the docs but it is still confusing statement
No Default Partitioning
Kudu does not provide a default partitioning strategy when creating tables. It 
is recommended that new tables which are expected to have heavy read and write 
workloads have at least as many tablets as tablet servers.

if I have 20 tablet servers and I have two tables - one with 1MM rows and 
another one with 100MM rows, do I pick 20 / 3 partitions for both (divide by 3 
because of replication)?




On Sat, Mar 7, 2020 at 9:52 AM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
hey guys,

I asked the same question on Slack on and got no responses. I just went through 
the docs and design doc and FAQ and still did not find an answer.

Can someone comment?

Maybe I was not asking a clear question. If my cluster is large enough in my 
example above, should I go with 3, 9 or 18 tablets? or should I pick tablets to 
be closer to 1Gb?

And a follow-up question, if I have tons of smaller tables under 5 million 
rows, should I just use 1 partition or still break them on smaller tablets for 
concurrency?

We cannot pick the partitioning strategy for each table as we need to stream 
100s of tables and we use PK from RBDMS and need to come with an automated way 
to pick number of partitions/tablets. So far I was using 1Gb rule but 
rethinking this now for another project.

On Tue, Sep 24, 2019 at 4:29 PM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
forgot to post results of my quick test:

 Kudu 1.5

Table takes 18Gb of disk space after 3x replication

Tablets
Tablet Size
Query run time, sec
3
2Gb
65
9
700Mb
27
18
350Mb
17

[image.png]


On Tue, Sep 24, 2019 at 3:58 PM Boris Tyukin 
<bo...@boristyukin.com<mailto:bo...@boristyukin.com>> wrote:
Hi guys,

just want to clarify recommendations from the doc. It says:
https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_rules_of_thumb<https://urldefense.com/v3/__https:/kudu.apache.org/docs/kudu_impala_integration.html*partitioning_rules_of_thumb__;Iw!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyjLK3py0$>

Partitioning Rules of 
Thumb<https://urldefense.com/v3/__https:/kudu.apache.org/docs/kudu_impala_integration.html*partitioning_rules_of_thumb__;Iw!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyjLK3py0$>

  *   For large tables, such as fact tables, aim for as many tablets as you 
have cores in the cluster.
  *   For small tables, such as dimension tables, ensure that each tablet is at 
least 1 GB in size.

In general, be mindful the number of tablets limits the parallelism of reads, 
in the current implementation. Increasing the number of tablets significantly 
beyond the number of cores is likely to have diminishing returns.



I've read this a few times but I am not sure I understand it correctly. Let me 
use concrete example.

If a table ends up taking 18Gb after replication (so with 3x replication it is 
~9Gb per tablet if I do not partition), should I aim for 1Gb tablets (6 tablets 
before replication) or should I aim for 500Mb tablets if my cluster capacity 
allows so (12 tablets before replication)? confused why they say "at least" not 
"at most" - does it mean I should design it so a tablet takes 2Gb or 3Gb in 
this example?

Assume that I have tons of CPU cores on a cluster...
Based on my quick test, it seems that queries are faster if I have more 
tablets/partitions...In this example, 18 tablets gave me the best timing but 
tablet size was around 300-400Mb. But the doc says "at least 1Gb".

Really confused what the doc is saying, please help

Boris



--
Andrew Wong


--
Grant Henke
Software Engineer | Cloudera
gr...@cloudera.com<mailto:gr...@cloudera.com> | 
twitter.com/gchenke<https://urldefense.com/v3/__http:/twitter.com/gchenke__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyGRJzLIM$>
 | 
linkedin.com/in/granthenke<https://urldefense.com/v3/__http:/linkedin.com/in/granthenke__;!!NH8t9uXaRvxizNEf!GO6gkvBPkOahZEzrM-PqM3jwQjQuxkX3LucNJXQxCUfqlHuzTMY3N0nLRL4O3bPYVhWq9VYyk4V5SJY$>

The content of this message is APPLIED MATERIALS CONFIDENTIAL. If you are not 
the intended recipient, please notify me, delete this email and do not use or 
distribute this email.

Reply via email to