Thanks for that, glad I was wrong there! Aside from replication considerations, 
is it also recommended the number of tablet servers be odd?

I will check forums as you suggested, but from what I read after searching is 
that Impala relies on user configured caching strategies using HDFS cache.  The 
workload for these tables is very light write, maybe a dozen or so records per 
hour across 6 or 7 tables. The size of the tables ranges from thousands to low 
millions of rows so so sub-partitioning would not be required. So perhaps this 
is not a typical use-case but I think it could work quite well with kudu.

From: Dan Burkert <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, March 16, 2018 at 2:09 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: "broadcast" tablet replication for kudu?

The replication count is the number of tablet servers which Kudu will host 
copies on.  So if you set the replication level to 5, Kudu will put the data on 
5 separate tablet servers.  There's no built-in broadcast table feature; upping 
the replication factor is the closest thing.  A couple of things to keep in 
mind:

- Always use an odd replication count.  This is important due to how the Raft 
algorithm works.  Recent versions of Kudu won't even let you specify an even 
number without flipping some flags.
- We don't test much much beyond 5 replicas.  It should work, but you may run 
in to issues since it's a relatively rare configuration.  With a heavy write 
workload and many replicas you are even more likely to encounter issues.

It's also worth checking in an Impala forum whether it has features that make 
joins against small broadcast tables better?  Perhaps Impala can cache small 
tables locally when doing joins.

- Dan

On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick 
<[email protected]<mailto:[email protected]>> wrote:
The problem is, AFIK, that replication count is not necessarily the 
distribution count, so you can't guarantee all tablet servers will have a copy.

On Mar 16, 2018 1:41 PM, Boris Tyukin 
<[email protected]<mailto:[email protected]>> wrote:
I'm new to Kudu but we are also going to use Impala mostly with Kudu. We have a 
few tables that are small but used a lot. My plan is replicate them more than 3 
times. When you create a kudu table, you can specify number of replicated 
copies (3 by default) and I guess you can put there a number, corresponding to 
your node count in cluster. The downside, you cannot change that number unless 
you recreate a table.

On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick 
<[email protected]<mailto:[email protected]>> wrote:
We will soon be moving our analytics from AWS Redshift to Impala/Kudu. One 
Redshift feature that we will miss is its ALL Distribution, where a copy of a 
table is maintained on each server. We define a number of metadata tables this 
way since they are used in nearly every query. We are considering using parquet 
in HDFS cache for these, and Kudu would be a much better fit for the update 
semantics but we are worried about the additional contention.  I'm wondering if 
having a Broadcast, or ALL, tablet replication might be an easy feature to add 
to Kudu?

-Cliff


Reply via email to