Re: "broadcast" tablet replication for kudu?

Todd Lipcon Mon, 23 Jul 2018 06:47:26 -0700

Are you on the latest release of Impala? It switched from using Thrift for
RPC to a new implementation (actually borrowed from kudu) which might help
broadcast performance a bit.


Todd

On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> wrote:

> sorry to revive the old thread but I am curious if there is a good way to
> speed up requests to frequently used tables in Kudu.
>
> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]>
> wrote:
>
>> bummer..After reading your guys conversation, I wish there was an easier
>> way...we will have the same issue as we have a few dozens of tables which
>> are used very frequently in joins and I was hoping there was an easy way to
>> replicate them on most of the nodes to avoid broadcasts every time
>>
>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <[email protected]
>> > wrote:
>>
>>> The table in our case is 12x hashed and ranged by month, so the
>>> broadcasts were often to all (12) nodes.
>>>
>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]>
>>> wrote:
>>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast..
>>>
>>>
>>>
>>> Not sure though how a shuffle would be much different from a broadcast
>>> if entire table is 1 file/block in 1 node.
>>>
>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]> wrote:
>>>
>>>> From the screenshot it does not look like there was a broadcast of the
>>>> dimension table(s), so it could be the case here that the multiple smaller
>>>> sends helps. Our dim tables are generally in the single-digit millions and
>>>> Impala chooses to broadcast them. Since the fact result cardinality is
>>>> always much smaller, we've found that forcing a [shuffle] dimension join is
>>>> actually faster since it only sends dims once rather than all to all nodes.
>>>> The degenerative performance of broadcast is especially obvious when the
>>>> query returns zero results. I don't have much experience here, but it does
>>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's
>>>> query plan.
>>>>
>>>> -Cliff
>>>>
>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <
>>>> [email protected]> wrote:
>>>>
>>>>> @Todd not to belabor the point, but when I suggested breaking up small
>>>>> dim tables into multiple parquet files (and in this thread's context
>>>>> perhaps partition kudu table, even if small, into multiple tablets), it 
>>>>> was
>>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>>
>>>>> For example recently we ran into this slow query where the 14M record
>>>>> dimension fit into a single file & block, so it got scanned on a single
>>>>> node though still pretty quickly (300ms), however it caused the join to
>>>>> take 25+ seconds and bogged down the entire query.  See highlighted
>>>>> fragment and its parent.
>>>>>
>>>>> So we broke it into several small files the way I described in my
>>>>> previous post, and now join and query are fast (6s).
>>>>>
>>>>> -m
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I suppose in the case that the dimension table scan makes a
>>>>>> non-trivial portion of your workload time, then yea, parallelizing the 
>>>>>> scan
>>>>>> as you suggest would be beneficial. That said, in typical analytic 
>>>>>> queries,
>>>>>> scanning the dimension tables is very quick compared to scanning the
>>>>>> much-larger fact tables, so the extra parallelism on the dim table scan
>>>>>> isn't worth too much.
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> @Todd I know working with parquet in the past I've seen small
>>>>>>> dimensions that fit in 1 single file/block limit parallelism of
>>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread
>>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or 
>>>>>>> similar
>>>>>>> when doing INSERT OVERWRITE to load them, which then allows these
>>>>>>> operations to parallelize across that many nodes.
>>>>>>>
>>>>>>> Wouldn't it be useful here for Cliff's small dims to be partitioned
>>>>>>> into a couple tablets to similarly improve parallelism?
>>>>>>>
>>>>>>> -m
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hey Todd,
>>>>>>>>>
>>>>>>>>> Thanks for that explanation, as well as all the great work you're
>>>>>>>>> doing  -- it's much appreciated! I just have one last follow-up 
>>>>>>>>> question.
>>>>>>>>> Reading about BROADCAST operations (Kudu, Spark, Flink, etc. ) it 
>>>>>>>>> seems the
>>>>>>>>> smaller table is always copied in its entirety BEFORE the predicate is
>>>>>>>>> evaluated.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's not quite true. If you have a predicate on a joined column,
>>>>>>>> or on one of the columns in the joined table, it will be pushed down 
>>>>>>>> to the
>>>>>>>> "scan" operator, which happens before the "exchange". In addition, 
>>>>>>>> there is
>>>>>>>> a feature called "runtime filters" that can push dynamically-generated
>>>>>>>> filters from one side of the exchange to the other.
>>>>>>>>
>>>>>>>>
>>>>>>>>> But since the Kudu client provides a serialized scanner as part of
>>>>>>>>> the ScanToken API, why wouldn't Impala use that instead if it knows 
>>>>>>>>> that
>>>>>>>>> the table is Kudu and the query has any type of predicate? Perhaps if 
>>>>>>>>> I
>>>>>>>>> hash-partition the table I could maybe force this (because that 
>>>>>>>>> complicates
>>>>>>>>> a BROADCAST)? I guess this is really a question for Impala but perhaps
>>>>>>>>> there is a more basic reason.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Impala could definitely be smarter, just a matter of programming
>>>>>>>> Kudu-specific join strategies into the optimizer. Today, the optimizer
>>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage
>>>>>>>> mechanisms.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Cliff
>>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I thought I had read that the Kudu client can configure a scan
>>>>>>>>>>> for CLOSEST_REPLICA and assumed this was a way to take advantage of 
>>>>>>>>>>> data
>>>>>>>>>>> collocation.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local one
>>>>>>>>>> if available. However, that doesn't influence the higher level 
>>>>>>>>>> operation of
>>>>>>>>>> the Impala (or Spark) planner. The planner isn't aware of the 
>>>>>>>>>> replication
>>>>>>>>>> policy, so it will use one of the existing supported JOIN 
>>>>>>>>>> strategies. Given
>>>>>>>>>> statistics, it will choose to broadcast the small table, which means 
>>>>>>>>>> that
>>>>>>>>>> it will create a plan that looks like:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                                    +-------------------------+
>>>>>>>>>>                                    |                         |
>>>>>>>>>>                         +---------->build      JOIN          |
>>>>>>>>>>                         |          |                         |
>>>>>>>>>>                         |          |              probe      |
>>>>>>>>>>                  +--------------+  +-------------------------+
>>>>>>>>>>                  |              |                  |
>>>>>>>>>>                  | Exchange     |                  |
>>>>>>>>>>             +----+ (broadcast   |                  |
>>>>>>>>>>             |    |              |                  |
>>>>>>>>>>             |    +--------------+                  |
>>>>>>>>>>             |                                      |
>>>>>>>>>>       +---------+                                  |
>>>>>>>>>>       |         |                        +-----------------------+
>>>>>>>>>>       |  SCAN   |                        |                       |
>>>>>>>>>>       |  KUDU   |                        |   SCAN (other side)   |
>>>>>>>>>>       |         |                        |                       |
>>>>>>>>>>       +---------+                        +-----------------------+
>>>>>>>>>>
>>>>>>>>>> (hopefully the ASCII art comes through)
>>>>>>>>>>
>>>>>>>>>> In other words, the "scan kudu" operator scans the table once,
>>>>>>>>>> and then replicates the results of that scan into the JOIN operator. 
>>>>>>>>>> The
>>>>>>>>>> "scan kudu" operator of course will read its local copy, but it will 
>>>>>>>>>> still
>>>>>>>>>> go through the exchange process.
>>>>>>>>>>
>>>>>>>>>> For the use case you're talking about, where the join is just
>>>>>>>>>> looking up a single row by PK in a dimension table, ideally we'd be 
>>>>>>>>>> using
>>>>>>>>>> an altogether different join strategy such as nested-loop join, with 
>>>>>>>>>> the
>>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy isn't
>>>>>>>>>> implemented by Impala.
>>>>>>>>>>
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>  If this exists then how far out of context is my understanding
>>>>>>>>>>> of it? Reading about HDFS cache replication, I do know that Impala 
>>>>>>>>>>> will
>>>>>>>>>>> choose a random replica there to more evenly distribute load. But
>>>>>>>>>>> especially compared to Kudu upsert, managing mutable data using 
>>>>>>>>>>> Parquet is
>>>>>>>>>>> painful. So, perhaps to sum thing up, if nearly 100% of my metadata 
>>>>>>>>>>> scan
>>>>>>>>>>> are single Primary Key lookups followed by a tiny broadcast then am 
>>>>>>>>>>> I
>>>>>>>>>>> really just splitting hairs performance-wise between Kudu and 
>>>>>>>>>>> HDFS-cached
>>>>>>>>>>> parquet?
>>>>>>>>>>>
>>>>>>>>>>> From:  Todd Lipcon <[email protected]>
>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM
>>>>>>>>>>>
>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>>>>
>>>>>>>>>>> It's worth noting that, even if your table is replicated,
>>>>>>>>>>> Impala's planner is unaware of this fact and it will give the same 
>>>>>>>>>>> plan
>>>>>>>>>>> regardless. That is to say, rather than every node scanning its 
>>>>>>>>>>> local copy,
>>>>>>>>>>> instead a single node will perform the whole scan (assuming it's a 
>>>>>>>>>>> small
>>>>>>>>>>> table) and broadcast it from there within the scope of a single 
>>>>>>>>>>> query. So,
>>>>>>>>>>> I don't think you'll see any performance improvements on Impala 
>>>>>>>>>>> queries by
>>>>>>>>>>> attempting something like an extremely high replication count.
>>>>>>>>>>>
>>>>>>>>>>> I could see bumping the replication count to 5 for these tables
>>>>>>>>>>> since the extra storage cost is low and it will ensure higher 
>>>>>>>>>>> availability
>>>>>>>>>>> of the important central tables, but I'd be surprised if there is 
>>>>>>>>>>> any
>>>>>>>>>>> measurable perf impact.
>>>>>>>>>>>
>>>>>>>>>>> -Todd
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside from replication
>>>>>>>>>>>> considerations, is it also recommended the number of tablet 
>>>>>>>>>>>> servers be odd?
>>>>>>>>>>>>
>>>>>>>>>>>> I will check forums as you suggested, but from what I read
>>>>>>>>>>>> after searching is that Impala relies on user configured caching 
>>>>>>>>>>>> strategies
>>>>>>>>>>>> using HDFS cache.  The workload for these tables is very light 
>>>>>>>>>>>> write, maybe
>>>>>>>>>>>> a dozen or so records per hour across 6 or 7 tables. The size of 
>>>>>>>>>>>> the tables
>>>>>>>>>>>> ranges from thousands to low millions of rows so so 
>>>>>>>>>>>> sub-partitioning would
>>>>>>>>>>>> not be required. So perhaps this is not a typical use-case but I 
>>>>>>>>>>>> think it
>>>>>>>>>>>> could work quite well with kudu.
>>>>>>>>>>>>
>>>>>>>>>>>> From: Dan Burkert <[email protected]>
>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM
>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>>>>>
>>>>>>>>>>>> The replication count is the number of tablet servers which
>>>>>>>>>>>> Kudu will host copies on.  So if you set the replication level to 
>>>>>>>>>>>> 5, Kudu
>>>>>>>>>>>> will put the data on 5 separate tablet servers.  There's no 
>>>>>>>>>>>> built-in
>>>>>>>>>>>> broadcast table feature; upping the replication factor is the 
>>>>>>>>>>>> closest
>>>>>>>>>>>> thing.  A couple of things to keep in mind:
>>>>>>>>>>>>
>>>>>>>>>>>> - Always use an odd replication count.  This is important due
>>>>>>>>>>>> to how the Raft algorithm works.  Recent versions of Kudu won't 
>>>>>>>>>>>> even let
>>>>>>>>>>>> you specify an even number without flipping some flags.
>>>>>>>>>>>> - We don't test much much beyond 5 replicas.  It *should*
>>>>>>>>>>>> work, but you may run in to issues since it's a relatively rare
>>>>>>>>>>>> configuration.  With a heavy write workload and many replicas you 
>>>>>>>>>>>> are even
>>>>>>>>>>>> more likely to encounter issues.
>>>>>>>>>>>>
>>>>>>>>>>>> It's also worth checking in an Impala forum whether it has
>>>>>>>>>>>> features that make joins against small broadcast tables better?  
>>>>>>>>>>>> Perhaps
>>>>>>>>>>>> Impala can cache small tables locally when doing joins.
>>>>>>>>>>>>
>>>>>>>>>>>> - Dan
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is, AFIK, that replication count is not
>>>>>>>>>>>>> necessarily the distribution count, so you can't guarantee all 
>>>>>>>>>>>>> tablet
>>>>>>>>>>>>> servers will have a copy.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly
>>>>>>>>>>>>> with Kudu. We have a few tables that are small but used a lot. My 
>>>>>>>>>>>>> plan is
>>>>>>>>>>>>> replicate them more than 3 times. When you create a kudu table, 
>>>>>>>>>>>>> you can
>>>>>>>>>>>>> specify number of replicated copies (3 by default) and I guess 
>>>>>>>>>>>>> you can put
>>>>>>>>>>>>> there a number, corresponding to your node count in cluster. The 
>>>>>>>>>>>>> downside,
>>>>>>>>>>>>> you cannot change that number unless you recreate a table.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We will soon be moving our analytics from AWS Redshift to
>>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL
>>>>>>>>>>>>>> Distribution, where a copy of a table is maintained on each 
>>>>>>>>>>>>>> server. We
>>>>>>>>>>>>>> define a number of metadata tables this way since they are used 
>>>>>>>>>>>>>> in nearly
>>>>>>>>>>>>>> every query. We are considering using parquet in HDFS cache for 
>>>>>>>>>>>>>> these, and
>>>>>>>>>>>>>> Kudu would be a much better fit for the update semantics but we 
>>>>>>>>>>>>>> are worried
>>>>>>>>>>>>>> about the additional contention.  I'm wondering if having a 
>>>>>>>>>>>>>> Broadcast, or
>>>>>>>>>>>>>> ALL, tablet replication might be an easy feature to add to Kudu?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Cliff
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Todd Lipcon
>>>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Todd Lipcon
>>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Todd Lipcon
>>>>>>>> Software Engineer, Cloudera
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *MAURICIO ARISTIZABAL*
>>>>>>> Architect - Business Intelligence + Data Science
>>>>>>> [email protected](m)+1 323 309 4260 <(323)%20309-4260>
>>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>>>>>
>>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>>>>>> <https://twitter.com/impactradius> | Facebook
>>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mauricio Aristizabal
>>>>> Architect - Data Pipeline
>>>>> *M * 323 309 4260
>>>>> *E  *[email protected]  |  *W * https://impact.com
>>>>> <https://www.linkedin.com/company/608678/>
>>>>> <https://www.facebook.com/ImpactMarTech/>
>>>>> <https://twitter.com/impactmartech>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Mauricio Aristizabal
>>> Architect - Data Pipeline
>>> *M * 323 309 4260
>>> *E  *[email protected]  |  *W * https://impact.com
>>> <https://www.linkedin.com/company/608678/>
>>> <https://www.facebook.com/ImpactMarTech/>
>>> <https://twitter.com/impactmartech>
>>>
>>
>>

Re: "broadcast" tablet replication for kudu?

Reply via email to