Re: "broadcast" tablet replication for kudu?

Todd Lipcon Fri, 16 Mar 2018 14:30:02 -0700

On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]> wrote:


> Hey Todd,
>
> Thanks for that explanation, as well as all the great work you're doing
> -- it's much appreciated! I just have one last follow-up question. Reading
> about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller
> table is always copied in its entirety BEFORE the predicate is evaluated.
>

That's not quite true. If you have a predicate on a joined column, or on
one of the columns in the joined table, it will be pushed down to the
"scan" operator, which happens before the "exchange". In addition, there is
a feature called "runtime filters" that can push dynamically-generated
filters from one side of the exchange to the other.


> But since the Kudu client provides a serialized scanner as part of the
> ScanToken API, why wouldn't Impala use that instead if it knows that the
> table is Kudu and the query has any type of predicate? Perhaps if I
> hash-partition the table I could maybe force this (because that complicates
> a BROADCAST)? I guess this is really a question for Impala but perhaps
> there is a more basic reason.
>

Impala could definitely be smarter, just a matter of programming
Kudu-specific join strategies into the optimizer. Today, the optimizer
isn't aware of the unique properties of Kudu scans vs other storage
mechanisms.

-Todd


>
> -Cliff
>
> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]> wrote:
>
>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>> [email protected]> wrote:
>>
>>> I thought I had read that the Kudu client can configure a scan for
>>> CLOSEST_REPLICA and assumed this was a way to take advantage of data
>>> collocation.
>>>
>>
>> Yea, when a client uses CLOSEST_REPLICA it will read a local one if
>> available. However, that doesn't influence the higher level operation of
>> the Impala (or Spark) planner. The planner isn't aware of the replication
>> policy, so it will use one of the existing supported JOIN strategies. Given
>> statistics, it will choose to broadcast the small table, which means that
>> it will create a plan that looks like:
>>
>>
>>                                    +-------------------------+
>>                                    |                         |
>>                         +---------->build      JOIN          |
>>                         |          |                         |
>>                         |          |              probe      |
>>                  +--------------+  +-------------------------+
>>                  |              |                  |
>>                  | Exchange     |                  |
>>             +----+ (broadcast   |                  |
>>             |    |              |                  |
>>             |    +--------------+                  |
>>             |                                      |
>>       +---------+                                  |
>>       |         |                        +-----------------------+
>>       |  SCAN   |                        |                       |
>>       |  KUDU   |                        |   SCAN (other side)   |
>>       |         |                        |                       |
>>       +---------+                        +-----------------------+
>>
>> (hopefully the ASCII art comes through)
>>
>> In other words, the "scan kudu" operator scans the table once, and then
>> replicates the results of that scan into the JOIN operator. The "scan kudu"
>> operator of course will read its local copy, but it will still go through
>> the exchange process.
>>
>> For the use case you're talking about, where the join is just looking up
>> a single row by PK in a dimension table, ideally we'd be using an
>> altogether different join strategy such as nested-loop join, with the inner
>> "loop" actually being a Kudu PK lookup, but that strategy isn't implemented
>> by Impala.
>>
>> -Todd
>>
>>
>>
>>>  If this exists then how far out of context is my understanding of it?
>>> Reading about HDFS cache replication, I do know that Impala will choose a
>>> random replica there to more evenly distribute load. But especially
>>> compared to Kudu upsert, managing mutable data using Parquet is painful.
>>> So, perhaps to sum thing up, if nearly 100% of my metadata scan are single
>>> Primary Key lookups followed by a tiny broadcast then am I really just
>>> splitting hairs performance-wise between Kudu and HDFS-cached parquet?
>>>
>>> From:  Todd Lipcon <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Friday, March 16, 2018 at 2:51 PM
>>>
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>
>>> It's worth noting that, even if your table is replicated, Impala's
>>> planner is unaware of this fact and it will give the same plan regardless.
>>> That is to say, rather than every node scanning its local copy, instead a
>>> single node will perform the whole scan (assuming it's a small table) and
>>> broadcast it from there within the scope of a single query. So, I don't
>>> think you'll see any performance improvements on Impala queries by
>>> attempting something like an extremely high replication count.
>>>
>>> I could see bumping the replication count to 5 for these tables since
>>> the extra storage cost is low and it will ensure higher availability of the
>>> important central tables, but I'd be surprised if there is any measurable
>>> perf impact.
>>>
>>> -Todd
>>>
>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <
>>> [email protected]> wrote:
>>>
>>>> Thanks for that, glad I was wrong there! Aside from replication
>>>> considerations, is it also recommended the number of tablet servers be odd?
>>>>
>>>> I will check forums as you suggested, but from what I read after
>>>> searching is that Impala relies on user configured caching strategies using
>>>> HDFS cache.  The workload for these tables is very light write, maybe a
>>>> dozen or so records per hour across 6 or 7 tables. The size of the tables
>>>> ranges from thousands to low millions of rows so so sub-partitioning would
>>>> not be required. So perhaps this is not a typical use-case but I think it
>>>> could work quite well with kudu.
>>>>
>>>> From: Dan Burkert <[email protected]>
>>>> Reply-To: "[email protected]" <[email protected]>
>>>> Date: Friday, March 16, 2018 at 2:09 PM
>>>> To: "[email protected]" <[email protected]>
>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>
>>>> The replication count is the number of tablet servers which Kudu will
>>>> host copies on.  So if you set the replication level to 5, Kudu will put
>>>> the data on 5 separate tablet servers.  There's no built-in broadcast table
>>>> feature; upping the replication factor is the closest thing.  A couple of
>>>> things to keep in mind:
>>>>
>>>> - Always use an odd replication count.  This is important due to how
>>>> the Raft algorithm works.  Recent versions of Kudu won't even let you
>>>> specify an even number without flipping some flags.
>>>> - We don't test much much beyond 5 replicas.  It *should* work, but
>>>> you may run in to issues since it's a relatively rare configuration.  With
>>>> a heavy write workload and many replicas you are even more likely to
>>>> encounter issues.
>>>>
>>>> It's also worth checking in an Impala forum whether it has features
>>>> that make joins against small broadcast tables better?  Perhaps Impala can
>>>> cache small tables locally when doing joins.
>>>>
>>>> - Dan
>>>>
>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <
>>>> [email protected]> wrote:
>>>>
>>>>> The problem is, AFIK, that replication count is not necessarily the
>>>>> distribution count, so you can't guarantee all tablet servers will have a
>>>>> copy.
>>>>>
>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]> wrote:
>>>>> I'm new to Kudu but we are also going to use Impala mostly with Kudu.
>>>>> We have a few tables that are small but used a lot. My plan is replicate
>>>>> them more than 3 times. When you create a kudu table, you can specify
>>>>> number of replicated copies (3 by default) and I guess you can put there a
>>>>> number, corresponding to your node count in cluster. The downside, you
>>>>> cannot change that number unless you recreate a table.
>>>>>
>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> We will soon be moving our analytics from AWS Redshift to
>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL
>>>>>> Distribution, where a copy of a table is maintained on each server. We
>>>>>> define a number of metadata tables this way since they are used in nearly
>>>>>> every query. We are considering using parquet in HDFS cache for these, 
>>>>>> and
>>>>>> Kudu would be a much better fit for the update semantics but we are 
>>>>>> worried
>>>>>> about the additional contention.  I'm wondering if having a Broadcast, or
>>>>>> ALL, tablet replication might be an easy feature to add to Kudu?
>>>>>>
>>>>>> -Cliff
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: "broadcast" tablet replication for kudu?

Reply via email to