On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin <[email protected]> wrote:

> Hi Todd,
>
> Are you saying that your earlier comment below is not longer valid with
> Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can
> benefit from this?
>

No, the earlier comment is still valid. Just saying that in some cases
exchange can be faster in the new Impala version.


> "
> *It's worth noting that, even if your table is replicated, Impala's
> planner is unaware of this fact and it will give the same plan regardless.
> That is to say, rather than every node scanning its local copy, instead a
> single node will perform the whole scan (assuming it's a small table) and
> broadcast it from there within the scope of a single query. So, I don't
> think you'll see any performance improvements on Impala queries by
> attempting something like an extremely high replication count.*
>
> *I could see bumping the replication count to 5 for these tables since the
> extra storage cost is low and it will ensure higher availability of the
> important central tables, but I'd be surprised if there is any measurable
> perf impact.*
> "
>
> On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon <[email protected]> wrote:
>
>> Are you on the latest release of Impala? It switched from using Thrift
>> for RPC to a new implementation (actually borrowed from kudu) which might
>> help broadcast performance a bit.
>>
>> Todd
>>
>> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> wrote:
>>
>>> sorry to revive the old thread but I am curious if there is a good way
>>> to speed up requests to frequently used tables in Kudu.
>>>
>>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]>
>>> wrote:
>>>
>>>> bummer..After reading your guys conversation, I wish there was an
>>>> easier way...we will have the same issue as we have a few dozens of tables
>>>> which are used very frequently in joins and I was hoping there was an easy
>>>> way to replicate them on most of the nodes to avoid broadcasts every time
>>>>
>>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <
>>>> [email protected]> wrote:
>>>>
>>>>> The table in our case is 12x hashed and ranged by month, so the
>>>>> broadcasts were often to all (12) nodes.
>>>>>
>>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]>
>>>>> wrote:
>>>>> Sorry I left that out Cliff, FWIW it does seem to have been
>>>>> broadcast..
>>>>>
>>>>>
>>>>>
>>>>> Not sure though how a shuffle would be much different from a broadcast
>>>>> if entire table is 1 file/block in 1 node.
>>>>>
>>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> From the screenshot it does not look like there was a broadcast of
>>>>>> the dimension table(s), so it could be the case here that the multiple
>>>>>> smaller sends helps. Our dim tables are generally in the single-digit
>>>>>> millions and Impala chooses to broadcast them. Since the fact result
>>>>>> cardinality is always much smaller, we've found that forcing a [shuffle]
>>>>>> dimension join is actually faster since it only sends dims once rather 
>>>>>> than
>>>>>> all to all nodes. The degenerative performance of broadcast is especially
>>>>>> obvious when the query returns zero results. I don't have much experience
>>>>>> here, but it does seem that Kudu's efficient predicate scans can 
>>>>>> sometimes
>>>>>> "break" Impala's query plan.
>>>>>>
>>>>>> -Cliff
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> @Todd not to belabor the point, but when I suggested breaking up
>>>>>>> small dim tables into multiple parquet files (and in this thread's 
>>>>>>> context
>>>>>>> perhaps partition kudu table, even if small, into multiple tablets), it 
>>>>>>> was
>>>>>>> to speed up joins/exchanges, not to parallelize the scan.
>>>>>>>
>>>>>>> For example recently we ran into this slow query where the 14M
>>>>>>> record dimension fit into a single file & block, so it got scanned on a
>>>>>>> single node though still pretty quickly (300ms), however it caused the 
>>>>>>> join
>>>>>>> to take 25+ seconds and bogged down the entire query.  See highlighted
>>>>>>> fragment and its parent.
>>>>>>>
>>>>>>> So we broke it into several small files the way I described in my
>>>>>>> previous post, and now join and query are fast (6s).
>>>>>>>
>>>>>>> -m
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I suppose in the case that the dimension table scan makes a
>>>>>>>> non-trivial portion of your workload time, then yea, parallelizing the 
>>>>>>>> scan
>>>>>>>> as you suggest would be beneficial. That said, in typical analytic 
>>>>>>>> queries,
>>>>>>>> scanning the dimension tables is very quick compared to scanning the
>>>>>>>> much-larger fact tables, so the extra parallelism on the dim table scan
>>>>>>>> isn't worth too much.
>>>>>>>>
>>>>>>>> -Todd
>>>>>>>>
>>>>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> @Todd I know working with parquet in the past I've seen small
>>>>>>>>> dimensions that fit in 1 single file/block limit parallelism of
>>>>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread
>>>>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or 
>>>>>>>>> similar
>>>>>>>>> when doing INSERT OVERWRITE to load them, which then allows these
>>>>>>>>> operations to parallelize across that many nodes.
>>>>>>>>>
>>>>>>>>> Wouldn't it be useful here for Cliff's small dims to be
>>>>>>>>> partitioned into a couple tablets to similarly improve parallelism?
>>>>>>>>>
>>>>>>>>> -m
>>>>>>>>>
>>>>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Todd,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for that explanation, as well as all the great work
>>>>>>>>>>> you're doing  -- it's much appreciated! I just have one last 
>>>>>>>>>>> follow-up
>>>>>>>>>>> question. Reading about BROADCAST operations (Kudu, Spark, Flink, 
>>>>>>>>>>> etc. ) it
>>>>>>>>>>> seems the smaller table is always copied in its entirety BEFORE the
>>>>>>>>>>> predicate is evaluated.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That's not quite true. If you have a predicate on a joined
>>>>>>>>>> column, or on one of the columns in the joined table, it will be 
>>>>>>>>>> pushed
>>>>>>>>>> down to the "scan" operator, which happens before the "exchange". In
>>>>>>>>>> addition, there is a feature called "runtime filters" that can push
>>>>>>>>>> dynamically-generated filters from one side of the exchange to the 
>>>>>>>>>> other.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> But since the Kudu client provides a serialized scanner as part
>>>>>>>>>>> of the ScanToken API, why wouldn't Impala use that instead if it 
>>>>>>>>>>> knows that
>>>>>>>>>>> the table is Kudu and the query has any type of predicate? Perhaps 
>>>>>>>>>>> if I
>>>>>>>>>>> hash-partition the table I could maybe force this (because that 
>>>>>>>>>>> complicates
>>>>>>>>>>> a BROADCAST)? I guess this is really a question for Impala but 
>>>>>>>>>>> perhaps
>>>>>>>>>>> there is a more basic reason.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Impala could definitely be smarter, just a matter of programming
>>>>>>>>>> Kudu-specific join strategies into the optimizer. Today, the 
>>>>>>>>>> optimizer
>>>>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage
>>>>>>>>>> mechanisms.
>>>>>>>>>>
>>>>>>>>>> -Todd
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -Cliff
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I thought I had read that the Kudu client can configure a scan
>>>>>>>>>>>>> for CLOSEST_REPLICA and assumed this was a way to take advantage 
>>>>>>>>>>>>> of data
>>>>>>>>>>>>> collocation.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local
>>>>>>>>>>>> one if available. However, that doesn't influence the higher level
>>>>>>>>>>>> operation of the Impala (or Spark) planner. The planner isn't 
>>>>>>>>>>>> aware of the
>>>>>>>>>>>> replication policy, so it will use one of the existing supported 
>>>>>>>>>>>> JOIN
>>>>>>>>>>>> strategies. Given statistics, it will choose to broadcast the 
>>>>>>>>>>>> small table,
>>>>>>>>>>>> which means that it will create a plan that looks like:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                                    +-------------------------+
>>>>>>>>>>>>                                    |                         |
>>>>>>>>>>>>                         +---------->build      JOIN          |
>>>>>>>>>>>>                         |          |                         |
>>>>>>>>>>>>                         |          |              probe      |
>>>>>>>>>>>>                  +--------------+  +-------------------------+
>>>>>>>>>>>>                  |              |                  |
>>>>>>>>>>>>                  | Exchange     |                  |
>>>>>>>>>>>>             +----+ (broadcast   |                  |
>>>>>>>>>>>>             |    |              |                  |
>>>>>>>>>>>>             |    +--------------+                  |
>>>>>>>>>>>>             |                                      |
>>>>>>>>>>>>       +---------+                                  |
>>>>>>>>>>>>       |         |
>>>>>>>>>>>> +-----------------------+
>>>>>>>>>>>>       |  SCAN   |                        |
>>>>>>>>>>>>  |
>>>>>>>>>>>>       |  KUDU   |                        |   SCAN (other side)
>>>>>>>>>>>>  |
>>>>>>>>>>>>       |         |                        |
>>>>>>>>>>>>  |
>>>>>>>>>>>>       +---------+
>>>>>>>>>>>> +-----------------------+
>>>>>>>>>>>>
>>>>>>>>>>>> (hopefully the ASCII art comes through)
>>>>>>>>>>>>
>>>>>>>>>>>> In other words, the "scan kudu" operator scans the table once,
>>>>>>>>>>>> and then replicates the results of that scan into the JOIN 
>>>>>>>>>>>> operator. The
>>>>>>>>>>>> "scan kudu" operator of course will read its local copy, but it 
>>>>>>>>>>>> will still
>>>>>>>>>>>> go through the exchange process.
>>>>>>>>>>>>
>>>>>>>>>>>> For the use case you're talking about, where the join is just
>>>>>>>>>>>> looking up a single row by PK in a dimension table, ideally we'd 
>>>>>>>>>>>> be using
>>>>>>>>>>>> an altogether different join strategy such as nested-loop join, 
>>>>>>>>>>>> with the
>>>>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy 
>>>>>>>>>>>> isn't
>>>>>>>>>>>> implemented by Impala.
>>>>>>>>>>>>
>>>>>>>>>>>> -Todd
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>  If this exists then how far out of context is my
>>>>>>>>>>>>> understanding of it? Reading about HDFS cache replication, I do 
>>>>>>>>>>>>> know that
>>>>>>>>>>>>> Impala will choose a random replica there to more evenly 
>>>>>>>>>>>>> distribute load.
>>>>>>>>>>>>> But especially compared to Kudu upsert, managing mutable data 
>>>>>>>>>>>>> using Parquet
>>>>>>>>>>>>> is painful. So, perhaps to sum thing up, if nearly 100% of my 
>>>>>>>>>>>>> metadata scan
>>>>>>>>>>>>> are single Primary Key lookups followed by a tiny broadcast then 
>>>>>>>>>>>>> am I
>>>>>>>>>>>>> really just splitting hairs performance-wise between Kudu and 
>>>>>>>>>>>>> HDFS-cached
>>>>>>>>>>>>> parquet?
>>>>>>>>>>>>>
>>>>>>>>>>>>> From:  Todd Lipcon <[email protected]>
>>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM
>>>>>>>>>>>>>
>>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's worth noting that, even if your table is replicated,
>>>>>>>>>>>>> Impala's planner is unaware of this fact and it will give the 
>>>>>>>>>>>>> same plan
>>>>>>>>>>>>> regardless. That is to say, rather than every node scanning its 
>>>>>>>>>>>>> local copy,
>>>>>>>>>>>>> instead a single node will perform the whole scan (assuming it's 
>>>>>>>>>>>>> a small
>>>>>>>>>>>>> table) and broadcast it from there within the scope of a single 
>>>>>>>>>>>>> query. So,
>>>>>>>>>>>>> I don't think you'll see any performance improvements on Impala 
>>>>>>>>>>>>> queries by
>>>>>>>>>>>>> attempting something like an extremely high replication count.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I could see bumping the replication count to 5 for these
>>>>>>>>>>>>> tables since the extra storage cost is low and it will ensure 
>>>>>>>>>>>>> higher
>>>>>>>>>>>>> availability of the important central tables, but I'd be 
>>>>>>>>>>>>> surprised if there
>>>>>>>>>>>>> is any measurable perf impact.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Todd
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside from
>>>>>>>>>>>>>> replication considerations, is it also recommended the number of 
>>>>>>>>>>>>>> tablet
>>>>>>>>>>>>>> servers be odd?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I will check forums as you suggested, but from what I read
>>>>>>>>>>>>>> after searching is that Impala relies on user configured caching 
>>>>>>>>>>>>>> strategies
>>>>>>>>>>>>>> using HDFS cache.  The workload for these tables is very light 
>>>>>>>>>>>>>> write, maybe
>>>>>>>>>>>>>> a dozen or so records per hour across 6 or 7 tables. The size of 
>>>>>>>>>>>>>> the tables
>>>>>>>>>>>>>> ranges from thousands to low millions of rows so so 
>>>>>>>>>>>>>> sub-partitioning would
>>>>>>>>>>>>>> not be required. So perhaps this is not a typical use-case but I 
>>>>>>>>>>>>>> think it
>>>>>>>>>>>>>> could work quite well with kudu.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> From: Dan Burkert <[email protected]>
>>>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM
>>>>>>>>>>>>>> To: "[email protected]" <[email protected]>
>>>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The replication count is the number of tablet servers which
>>>>>>>>>>>>>> Kudu will host copies on.  So if you set the replication level 
>>>>>>>>>>>>>> to 5, Kudu
>>>>>>>>>>>>>> will put the data on 5 separate tablet servers.  There's no 
>>>>>>>>>>>>>> built-in
>>>>>>>>>>>>>> broadcast table feature; upping the replication factor is the 
>>>>>>>>>>>>>> closest
>>>>>>>>>>>>>> thing.  A couple of things to keep in mind:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Always use an odd replication count.  This is important due
>>>>>>>>>>>>>> to how the Raft algorithm works.  Recent versions of Kudu won't 
>>>>>>>>>>>>>> even let
>>>>>>>>>>>>>> you specify an even number without flipping some flags.
>>>>>>>>>>>>>> - We don't test much much beyond 5 replicas.  It *should*
>>>>>>>>>>>>>> work, but you may run in to issues since it's a relatively rare
>>>>>>>>>>>>>> configuration.  With a heavy write workload and many replicas 
>>>>>>>>>>>>>> you are even
>>>>>>>>>>>>>> more likely to encounter issues.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It's also worth checking in an Impala forum whether it has
>>>>>>>>>>>>>> features that make joins against small broadcast tables better?  
>>>>>>>>>>>>>> Perhaps
>>>>>>>>>>>>>> Impala can cache small tables locally when doing joins.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Dan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The problem is, AFIK, that replication count is not
>>>>>>>>>>>>>>> necessarily the distribution count, so you can't guarantee all 
>>>>>>>>>>>>>>> tablet
>>>>>>>>>>>>>>> servers will have a copy.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly
>>>>>>>>>>>>>>> with Kudu. We have a few tables that are small but used a lot. 
>>>>>>>>>>>>>>> My plan is
>>>>>>>>>>>>>>> replicate them more than 3 times. When you create a kudu table, 
>>>>>>>>>>>>>>> you can
>>>>>>>>>>>>>>> specify number of replicated copies (3 by default) and I guess 
>>>>>>>>>>>>>>> you can put
>>>>>>>>>>>>>>> there a number, corresponding to your node count in cluster. 
>>>>>>>>>>>>>>> The downside,
>>>>>>>>>>>>>>> you cannot change that number unless you recreate a table.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We will soon be moving our analytics from AWS Redshift to
>>>>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL
>>>>>>>>>>>>>>>> Distribution, where a copy of a table is maintained on each 
>>>>>>>>>>>>>>>> server. We
>>>>>>>>>>>>>>>> define a number of metadata tables this way since they are 
>>>>>>>>>>>>>>>> used in nearly
>>>>>>>>>>>>>>>> every query. We are considering using parquet in HDFS cache 
>>>>>>>>>>>>>>>> for these, and
>>>>>>>>>>>>>>>> Kudu would be a much better fit for the update semantics but 
>>>>>>>>>>>>>>>> we are worried
>>>>>>>>>>>>>>>> about the additional contention.  I'm wondering if having a 
>>>>>>>>>>>>>>>> Broadcast, or
>>>>>>>>>>>>>>>> ALL, tablet replication might be an easy feature to add to 
>>>>>>>>>>>>>>>> Kudu?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Cliff
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Todd Lipcon
>>>>>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Todd Lipcon
>>>>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Todd Lipcon
>>>>>>>>>> Software Engineer, Cloudera
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *MAURICIO ARISTIZABAL*
>>>>>>>>> Architect - Business Intelligence + Data Science
>>>>>>>>> [email protected](m)+1 323 309 4260 <(323)%20309-4260>
>>>>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101
>>>>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g>
>>>>>>>>>
>>>>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter
>>>>>>>>> <https://twitter.com/impactradius> | Facebook
>>>>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> |
>>>>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc->
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Todd Lipcon
>>>>>>>> Software Engineer, Cloudera
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mauricio Aristizabal
>>>>>>> Architect - Data Pipeline
>>>>>>> *M * 323 309 4260
>>>>>>> *E  *[email protected]  |  *W * https://impact.com
>>>>>>> <https://www.linkedin.com/company/608678/>
>>>>>>> <https://www.facebook.com/ImpactMarTech/>
>>>>>>> <https://twitter.com/impactmartech>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mauricio Aristizabal
>>>>> Architect - Data Pipeline
>>>>> *M * 323 309 4260
>>>>> *E  *[email protected]  |  *W * https://impact.com
>>>>> <https://www.linkedin.com/company/608678/>
>>>>> <https://www.facebook.com/ImpactMarTech/>
>>>>> <https://twitter.com/impactmartech>
>>>>>
>>>>
>>>>

Reply via email to