Are you on the latest release of Impala? It switched from using Thrift for RPC to a new implementation (actually borrowed from kudu) which might help broadcast performance a bit.
Todd On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> wrote: > sorry to revive the old thread but I am curious if there is a good way to > speed up requests to frequently used tables in Kudu. > > On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]> > wrote: > >> bummer..After reading your guys conversation, I wish there was an easier >> way...we will have the same issue as we have a few dozens of tables which >> are used very frequently in joins and I was hoping there was an easy way to >> replicate them on most of the nodes to avoid broadcasts every time >> >> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <[email protected] >> > wrote: >> >>> The table in our case is 12x hashed and ranged by month, so the >>> broadcasts were often to all (12) nodes. >>> >>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]> >>> wrote: >>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast.. >>> >>> >>> >>> Not sure though how a shuffle would be much different from a broadcast >>> if entire table is 1 file/block in 1 node. >>> >>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]> wrote: >>> >>>> From the screenshot it does not look like there was a broadcast of the >>>> dimension table(s), so it could be the case here that the multiple smaller >>>> sends helps. Our dim tables are generally in the single-digit millions and >>>> Impala chooses to broadcast them. Since the fact result cardinality is >>>> always much smaller, we've found that forcing a [shuffle] dimension join is >>>> actually faster since it only sends dims once rather than all to all nodes. >>>> The degenerative performance of broadcast is especially obvious when the >>>> query returns zero results. I don't have much experience here, but it does >>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's >>>> query plan. >>>> >>>> -Cliff >>>> >>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>> [email protected]> wrote: >>>> >>>>> @Todd not to belabor the point, but when I suggested breaking up small >>>>> dim tables into multiple parquet files (and in this thread's context >>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>> was >>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>> >>>>> For example recently we ran into this slow query where the 14M record >>>>> dimension fit into a single file & block, so it got scanned on a single >>>>> node though still pretty quickly (300ms), however it caused the join to >>>>> take 25+ seconds and bogged down the entire query. See highlighted >>>>> fragment and its parent. >>>>> >>>>> So we broke it into several small files the way I described in my >>>>> previous post, and now join and query are fast (6s). >>>>> >>>>> -m >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]> >>>>> wrote: >>>>> >>>>>> I suppose in the case that the dimension table scan makes a >>>>>> non-trivial portion of your workload time, then yea, parallelizing the >>>>>> scan >>>>>> as you suggest would be beneficial. That said, in typical analytic >>>>>> queries, >>>>>> scanning the dimension tables is very quick compared to scanning the >>>>>> much-larger fact tables, so the extra parallelism on the dim table scan >>>>>> isn't worth too much. >>>>>> >>>>>> -Todd >>>>>> >>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> @Todd I know working with parquet in the past I've seen small >>>>>>> dimensions that fit in 1 single file/block limit parallelism of >>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread >>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or >>>>>>> similar >>>>>>> when doing INSERT OVERWRITE to load them, which then allows these >>>>>>> operations to parallelize across that many nodes. >>>>>>> >>>>>>> Wouldn't it be useful here for Cliff's small dims to be partitioned >>>>>>> into a couple tablets to similarly improve parallelism? >>>>>>> >>>>>>> -m >>>>>>> >>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hey Todd, >>>>>>>>> >>>>>>>>> Thanks for that explanation, as well as all the great work you're >>>>>>>>> doing -- it's much appreciated! I just have one last follow-up >>>>>>>>> question. >>>>>>>>> Reading about BROADCAST operations (Kudu, Spark, Flink, etc. ) it >>>>>>>>> seems the >>>>>>>>> smaller table is always copied in its entirety BEFORE the predicate is >>>>>>>>> evaluated. >>>>>>>>> >>>>>>>> >>>>>>>> That's not quite true. If you have a predicate on a joined column, >>>>>>>> or on one of the columns in the joined table, it will be pushed down >>>>>>>> to the >>>>>>>> "scan" operator, which happens before the "exchange". In addition, >>>>>>>> there is >>>>>>>> a feature called "runtime filters" that can push dynamically-generated >>>>>>>> filters from one side of the exchange to the other. >>>>>>>> >>>>>>>> >>>>>>>>> But since the Kudu client provides a serialized scanner as part of >>>>>>>>> the ScanToken API, why wouldn't Impala use that instead if it knows >>>>>>>>> that >>>>>>>>> the table is Kudu and the query has any type of predicate? Perhaps if >>>>>>>>> I >>>>>>>>> hash-partition the table I could maybe force this (because that >>>>>>>>> complicates >>>>>>>>> a BROADCAST)? I guess this is really a question for Impala but perhaps >>>>>>>>> there is a more basic reason. >>>>>>>>> >>>>>>>> >>>>>>>> Impala could definitely be smarter, just a matter of programming >>>>>>>> Kudu-specific join strategies into the optimizer. Today, the optimizer >>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage >>>>>>>> mechanisms. >>>>>>>> >>>>>>>> -Todd >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> -Cliff >>>>>>>>> >>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I thought I had read that the Kudu client can configure a scan >>>>>>>>>>> for CLOSEST_REPLICA and assumed this was a way to take advantage of >>>>>>>>>>> data >>>>>>>>>>> collocation. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local one >>>>>>>>>> if available. However, that doesn't influence the higher level >>>>>>>>>> operation of >>>>>>>>>> the Impala (or Spark) planner. The planner isn't aware of the >>>>>>>>>> replication >>>>>>>>>> policy, so it will use one of the existing supported JOIN >>>>>>>>>> strategies. Given >>>>>>>>>> statistics, it will choose to broadcast the small table, which means >>>>>>>>>> that >>>>>>>>>> it will create a plan that looks like: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> +-------------------------+ >>>>>>>>>> | | >>>>>>>>>> +---------->build JOIN | >>>>>>>>>> | | | >>>>>>>>>> | | probe | >>>>>>>>>> +--------------+ +-------------------------+ >>>>>>>>>> | | | >>>>>>>>>> | Exchange | | >>>>>>>>>> +----+ (broadcast | | >>>>>>>>>> | | | | >>>>>>>>>> | +--------------+ | >>>>>>>>>> | | >>>>>>>>>> +---------+ | >>>>>>>>>> | | +-----------------------+ >>>>>>>>>> | SCAN | | | >>>>>>>>>> | KUDU | | SCAN (other side) | >>>>>>>>>> | | | | >>>>>>>>>> +---------+ +-----------------------+ >>>>>>>>>> >>>>>>>>>> (hopefully the ASCII art comes through) >>>>>>>>>> >>>>>>>>>> In other words, the "scan kudu" operator scans the table once, >>>>>>>>>> and then replicates the results of that scan into the JOIN operator. >>>>>>>>>> The >>>>>>>>>> "scan kudu" operator of course will read its local copy, but it will >>>>>>>>>> still >>>>>>>>>> go through the exchange process. >>>>>>>>>> >>>>>>>>>> For the use case you're talking about, where the join is just >>>>>>>>>> looking up a single row by PK in a dimension table, ideally we'd be >>>>>>>>>> using >>>>>>>>>> an altogether different join strategy such as nested-loop join, with >>>>>>>>>> the >>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy isn't >>>>>>>>>> implemented by Impala. >>>>>>>>>> >>>>>>>>>> -Todd >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> If this exists then how far out of context is my understanding >>>>>>>>>>> of it? Reading about HDFS cache replication, I do know that Impala >>>>>>>>>>> will >>>>>>>>>>> choose a random replica there to more evenly distribute load. But >>>>>>>>>>> especially compared to Kudu upsert, managing mutable data using >>>>>>>>>>> Parquet is >>>>>>>>>>> painful. So, perhaps to sum thing up, if nearly 100% of my metadata >>>>>>>>>>> scan >>>>>>>>>>> are single Primary Key lookups followed by a tiny broadcast then am >>>>>>>>>>> I >>>>>>>>>>> really just splitting hairs performance-wise between Kudu and >>>>>>>>>>> HDFS-cached >>>>>>>>>>> parquet? >>>>>>>>>>> >>>>>>>>>>> From: Todd Lipcon <[email protected]> >>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM >>>>>>>>>>> >>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>> >>>>>>>>>>> It's worth noting that, even if your table is replicated, >>>>>>>>>>> Impala's planner is unaware of this fact and it will give the same >>>>>>>>>>> plan >>>>>>>>>>> regardless. That is to say, rather than every node scanning its >>>>>>>>>>> local copy, >>>>>>>>>>> instead a single node will perform the whole scan (assuming it's a >>>>>>>>>>> small >>>>>>>>>>> table) and broadcast it from there within the scope of a single >>>>>>>>>>> query. So, >>>>>>>>>>> I don't think you'll see any performance improvements on Impala >>>>>>>>>>> queries by >>>>>>>>>>> attempting something like an extremely high replication count. >>>>>>>>>>> >>>>>>>>>>> I could see bumping the replication count to 5 for these tables >>>>>>>>>>> since the extra storage cost is low and it will ensure higher >>>>>>>>>>> availability >>>>>>>>>>> of the important central tables, but I'd be surprised if there is >>>>>>>>>>> any >>>>>>>>>>> measurable perf impact. >>>>>>>>>>> >>>>>>>>>>> -Todd >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside from replication >>>>>>>>>>>> considerations, is it also recommended the number of tablet >>>>>>>>>>>> servers be odd? >>>>>>>>>>>> >>>>>>>>>>>> I will check forums as you suggested, but from what I read >>>>>>>>>>>> after searching is that Impala relies on user configured caching >>>>>>>>>>>> strategies >>>>>>>>>>>> using HDFS cache. The workload for these tables is very light >>>>>>>>>>>> write, maybe >>>>>>>>>>>> a dozen or so records per hour across 6 or 7 tables. The size of >>>>>>>>>>>> the tables >>>>>>>>>>>> ranges from thousands to low millions of rows so so >>>>>>>>>>>> sub-partitioning would >>>>>>>>>>>> not be required. So perhaps this is not a typical use-case but I >>>>>>>>>>>> think it >>>>>>>>>>>> could work quite well with kudu. >>>>>>>>>>>> >>>>>>>>>>>> From: Dan Burkert <[email protected]> >>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM >>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>> >>>>>>>>>>>> The replication count is the number of tablet servers which >>>>>>>>>>>> Kudu will host copies on. So if you set the replication level to >>>>>>>>>>>> 5, Kudu >>>>>>>>>>>> will put the data on 5 separate tablet servers. There's no >>>>>>>>>>>> built-in >>>>>>>>>>>> broadcast table feature; upping the replication factor is the >>>>>>>>>>>> closest >>>>>>>>>>>> thing. A couple of things to keep in mind: >>>>>>>>>>>> >>>>>>>>>>>> - Always use an odd replication count. This is important due >>>>>>>>>>>> to how the Raft algorithm works. Recent versions of Kudu won't >>>>>>>>>>>> even let >>>>>>>>>>>> you specify an even number without flipping some flags. >>>>>>>>>>>> - We don't test much much beyond 5 replicas. It *should* >>>>>>>>>>>> work, but you may run in to issues since it's a relatively rare >>>>>>>>>>>> configuration. With a heavy write workload and many replicas you >>>>>>>>>>>> are even >>>>>>>>>>>> more likely to encounter issues. >>>>>>>>>>>> >>>>>>>>>>>> It's also worth checking in an Impala forum whether it has >>>>>>>>>>>> features that make joins against small broadcast tables better? >>>>>>>>>>>> Perhaps >>>>>>>>>>>> Impala can cache small tables locally when doing joins. >>>>>>>>>>>> >>>>>>>>>>>> - Dan >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The problem is, AFIK, that replication count is not >>>>>>>>>>>>> necessarily the distribution count, so you can't guarantee all >>>>>>>>>>>>> tablet >>>>>>>>>>>>> servers will have a copy. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly >>>>>>>>>>>>> with Kudu. We have a few tables that are small but used a lot. My >>>>>>>>>>>>> plan is >>>>>>>>>>>>> replicate them more than 3 times. When you create a kudu table, >>>>>>>>>>>>> you can >>>>>>>>>>>>> specify number of replicated copies (3 by default) and I guess >>>>>>>>>>>>> you can put >>>>>>>>>>>>> there a number, corresponding to your node count in cluster. The >>>>>>>>>>>>> downside, >>>>>>>>>>>>> you cannot change that number unless you recreate a table. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> We will soon be moving our analytics from AWS Redshift to >>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL >>>>>>>>>>>>>> Distribution, where a copy of a table is maintained on each >>>>>>>>>>>>>> server. We >>>>>>>>>>>>>> define a number of metadata tables this way since they are used >>>>>>>>>>>>>> in nearly >>>>>>>>>>>>>> every query. We are considering using parquet in HDFS cache for >>>>>>>>>>>>>> these, and >>>>>>>>>>>>>> Kudu would be a much better fit for the update semantics but we >>>>>>>>>>>>>> are worried >>>>>>>>>>>>>> about the additional contention. I'm wondering if having a >>>>>>>>>>>>>> Broadcast, or >>>>>>>>>>>>>> ALL, tablet replication might be an easy feature to add to Kudu? >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Cliff >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Todd Lipcon >>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Todd Lipcon >>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Todd Lipcon >>>>>>>> Software Engineer, Cloudera >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *MAURICIO ARISTIZABAL* >>>>>>> Architect - Business Intelligence + Data Science >>>>>>> [email protected](m)+1 323 309 4260 <(323)%20309-4260> >>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101 >>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> >>>>>>> >>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter >>>>>>> <https://twitter.com/impactradius> | Facebook >>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | >>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc-> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Todd Lipcon >>>>>> Software Engineer, Cloudera >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Mauricio Aristizabal >>>>> Architect - Data Pipeline >>>>> *M * 323 309 4260 >>>>> *E *[email protected] | *W * https://impact.com >>>>> <https://www.linkedin.com/company/608678/> >>>>> <https://www.facebook.com/ImpactMarTech/> >>>>> <https://twitter.com/impactmartech> >>>>> >>>> >>>> >>> >>> >>> -- >>> Mauricio Aristizabal >>> Architect - Data Pipeline >>> *M * 323 309 4260 >>> *E *[email protected] | *W * https://impact.com >>> <https://www.linkedin.com/company/608678/> >>> <https://www.facebook.com/ImpactMarTech/> >>> <https://twitter.com/impactmartech> >>> >> >>
