Impala 2.12. The external RPC protocol is still Thrift. Todd
On Mon, Jul 23, 2018, 7:02 AM Clifford Resnick <[email protected]> wrote: > Is this impala 3.0? I’m concerned about breaking changes and our RPC to > Impala is thrift-based. > > From: Todd Lipcon <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Monday, July 23, 2018 at 9:46 AM > To: "[email protected]" <[email protected]> > Subject: Re: "broadcast" tablet replication for kudu? > > Are you on the latest release of Impala? It switched from using Thrift for > RPC to a new implementation (actually borrowed from kudu) which might help > broadcast performance a bit. > > Todd > > On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> wrote: > >> sorry to revive the old thread but I am curious if there is a good way to >> speed up requests to frequently used tables in Kudu. >> >> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]> >> wrote: >> >>> bummer..After reading your guys conversation, I wish there was an easier >>> way...we will have the same issue as we have a few dozens of tables which >>> are used very frequently in joins and I was hoping there was an easy way to >>> replicate them on most of the nodes to avoid broadcasts every time >>> >>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick < >>> [email protected]> wrote: >>> >>>> The table in our case is 12x hashed and ranged by month, so the >>>> broadcasts were often to all (12) nodes. >>>> >>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]> >>>> wrote: >>>> Sorry I left that out Cliff, FWIW it does seem to have been broadcast.. >>>> >>>> >>>> >>>> Not sure though how a shuffle would be much different from a broadcast >>>> if entire table is 1 file/block in 1 node. >>>> >>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]> >>>> wrote: >>>> >>>>> From the screenshot it does not look like there was a broadcast of the >>>>> dimension table(s), so it could be the case here that the multiple smaller >>>>> sends helps. Our dim tables are generally in the single-digit millions and >>>>> Impala chooses to broadcast them. Since the fact result cardinality is >>>>> always much smaller, we've found that forcing a [shuffle] dimension join >>>>> is >>>>> actually faster since it only sends dims once rather than all to all >>>>> nodes. >>>>> The degenerative performance of broadcast is especially obvious when the >>>>> query returns zero results. I don't have much experience here, but it does >>>>> seem that Kudu's efficient predicate scans can sometimes "break" Impala's >>>>> query plan. >>>>> >>>>> -Cliff >>>>> >>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>>> [email protected]> wrote: >>>>> >>>>>> @Todd not to belabor the point, but when I suggested breaking up >>>>>> small dim tables into multiple parquet files (and in this thread's >>>>>> context >>>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>>> was >>>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>>> >>>>>> For example recently we ran into this slow query where the 14M record >>>>>> dimension fit into a single file & block, so it got scanned on a single >>>>>> node though still pretty quickly (300ms), however it caused the join to >>>>>> take 25+ seconds and bogged down the entire query. See highlighted >>>>>> fragment and its parent. >>>>>> >>>>>> So we broke it into several small files the way I described in my >>>>>> previous post, and now join and query are fast (6s). >>>>>> >>>>>> -m >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I suppose in the case that the dimension table scan makes a >>>>>>> non-trivial portion of your workload time, then yea, parallelizing the >>>>>>> scan >>>>>>> as you suggest would be beneficial. That said, in typical analytic >>>>>>> queries, >>>>>>> scanning the dimension tables is very quick compared to scanning the >>>>>>> much-larger fact tables, so the extra parallelism on the dim table scan >>>>>>> isn't worth too much. >>>>>>> >>>>>>> -Todd >>>>>>> >>>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> @Todd I know working with parquet in the past I've seen small >>>>>>>> dimensions that fit in 1 single file/block limit parallelism of >>>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread >>>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or >>>>>>>> similar >>>>>>>> when doing INSERT OVERWRITE to load them, which then allows these >>>>>>>> operations to parallelize across that many nodes. >>>>>>>> >>>>>>>> Wouldn't it be useful here for Cliff's small dims to be partitioned >>>>>>>> into a couple tablets to similarly improve parallelism? >>>>>>>> >>>>>>>> -m >>>>>>>> >>>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hey Todd, >>>>>>>>>> >>>>>>>>>> Thanks for that explanation, as well as all the great work you're >>>>>>>>>> doing -- it's much appreciated! I just have one last follow-up >>>>>>>>>> question. >>>>>>>>>> Reading about BROADCAST operations (Kudu, Spark, Flink, etc. ) it >>>>>>>>>> seems the >>>>>>>>>> smaller table is always copied in its entirety BEFORE the predicate >>>>>>>>>> is >>>>>>>>>> evaluated. >>>>>>>>>> >>>>>>>>> >>>>>>>>> That's not quite true. If you have a predicate on a joined column, >>>>>>>>> or on one of the columns in the joined table, it will be pushed down >>>>>>>>> to the >>>>>>>>> "scan" operator, which happens before the "exchange". In addition, >>>>>>>>> there is >>>>>>>>> a feature called "runtime filters" that can push dynamically-generated >>>>>>>>> filters from one side of the exchange to the other. >>>>>>>>> >>>>>>>>> >>>>>>>>>> But since the Kudu client provides a serialized scanner as part >>>>>>>>>> of the ScanToken API, why wouldn't Impala use that instead if it >>>>>>>>>> knows that >>>>>>>>>> the table is Kudu and the query has any type of predicate? Perhaps >>>>>>>>>> if I >>>>>>>>>> hash-partition the table I could maybe force this (because that >>>>>>>>>> complicates >>>>>>>>>> a BROADCAST)? I guess this is really a question for Impala but >>>>>>>>>> perhaps >>>>>>>>>> there is a more basic reason. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Impala could definitely be smarter, just a matter of programming >>>>>>>>> Kudu-specific join strategies into the optimizer. Today, the optimizer >>>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage >>>>>>>>> mechanisms. >>>>>>>>> >>>>>>>>> -Todd >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> -Cliff >>>>>>>>>> >>>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I thought I had read that the Kudu client can configure a scan >>>>>>>>>>>> for CLOSEST_REPLICA and assumed this was a way to take advantage >>>>>>>>>>>> of data >>>>>>>>>>>> collocation. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local one >>>>>>>>>>> if available. However, that doesn't influence the higher level >>>>>>>>>>> operation of >>>>>>>>>>> the Impala (or Spark) planner. The planner isn't aware of the >>>>>>>>>>> replication >>>>>>>>>>> policy, so it will use one of the existing supported JOIN >>>>>>>>>>> strategies. Given >>>>>>>>>>> statistics, it will choose to broadcast the small table, which >>>>>>>>>>> means that >>>>>>>>>>> it will create a plan that looks like: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> +-------------------------+ >>>>>>>>>>> | | >>>>>>>>>>> +---------->build JOIN | >>>>>>>>>>> | | | >>>>>>>>>>> | | probe | >>>>>>>>>>> +--------------+ +-------------------------+ >>>>>>>>>>> | | | >>>>>>>>>>> | Exchange | | >>>>>>>>>>> +----+ (broadcast | | >>>>>>>>>>> | | | | >>>>>>>>>>> | +--------------+ | >>>>>>>>>>> | | >>>>>>>>>>> +---------+ | >>>>>>>>>>> | | >>>>>>>>>>> +-----------------------+ >>>>>>>>>>> | SCAN | | >>>>>>>>>>> | >>>>>>>>>>> | KUDU | | SCAN (other side) >>>>>>>>>>> | >>>>>>>>>>> | | | >>>>>>>>>>> | >>>>>>>>>>> +---------+ >>>>>>>>>>> +-----------------------+ >>>>>>>>>>> >>>>>>>>>>> (hopefully the ASCII art comes through) >>>>>>>>>>> >>>>>>>>>>> In other words, the "scan kudu" operator scans the table once, >>>>>>>>>>> and then replicates the results of that scan into the JOIN >>>>>>>>>>> operator. The >>>>>>>>>>> "scan kudu" operator of course will read its local copy, but it >>>>>>>>>>> will still >>>>>>>>>>> go through the exchange process. >>>>>>>>>>> >>>>>>>>>>> For the use case you're talking about, where the join is just >>>>>>>>>>> looking up a single row by PK in a dimension table, ideally we'd be >>>>>>>>>>> using >>>>>>>>>>> an altogether different join strategy such as nested-loop join, >>>>>>>>>>> with the >>>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy >>>>>>>>>>> isn't >>>>>>>>>>> implemented by Impala. >>>>>>>>>>> >>>>>>>>>>> -Todd >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> If this exists then how far out of context is my understanding >>>>>>>>>>>> of it? Reading about HDFS cache replication, I do know that Impala >>>>>>>>>>>> will >>>>>>>>>>>> choose a random replica there to more evenly distribute load. But >>>>>>>>>>>> especially compared to Kudu upsert, managing mutable data using >>>>>>>>>>>> Parquet is >>>>>>>>>>>> painful. So, perhaps to sum thing up, if nearly 100% of my >>>>>>>>>>>> metadata scan >>>>>>>>>>>> are single Primary Key lookups followed by a tiny broadcast then >>>>>>>>>>>> am I >>>>>>>>>>>> really just splitting hairs performance-wise between Kudu and >>>>>>>>>>>> HDFS-cached >>>>>>>>>>>> parquet? >>>>>>>>>>>> >>>>>>>>>>>> From: Todd Lipcon <[email protected]> >>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM >>>>>>>>>>>> >>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>> >>>>>>>>>>>> It's worth noting that, even if your table is replicated, >>>>>>>>>>>> Impala's planner is unaware of this fact and it will give the same >>>>>>>>>>>> plan >>>>>>>>>>>> regardless. That is to say, rather than every node scanning its >>>>>>>>>>>> local copy, >>>>>>>>>>>> instead a single node will perform the whole scan (assuming it's a >>>>>>>>>>>> small >>>>>>>>>>>> table) and broadcast it from there within the scope of a single >>>>>>>>>>>> query. So, >>>>>>>>>>>> I don't think you'll see any performance improvements on Impala >>>>>>>>>>>> queries by >>>>>>>>>>>> attempting something like an extremely high replication count. >>>>>>>>>>>> >>>>>>>>>>>> I could see bumping the replication count to 5 for these tables >>>>>>>>>>>> since the extra storage cost is low and it will ensure higher >>>>>>>>>>>> availability >>>>>>>>>>>> of the important central tables, but I'd be surprised if there is >>>>>>>>>>>> any >>>>>>>>>>>> measurable perf impact. >>>>>>>>>>>> >>>>>>>>>>>> -Todd >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside from >>>>>>>>>>>>> replication considerations, is it also recommended the number of >>>>>>>>>>>>> tablet >>>>>>>>>>>>> servers be odd? >>>>>>>>>>>>> >>>>>>>>>>>>> I will check forums as you suggested, but from what I read >>>>>>>>>>>>> after searching is that Impala relies on user configured caching >>>>>>>>>>>>> strategies >>>>>>>>>>>>> using HDFS cache. The workload for these tables is very light >>>>>>>>>>>>> write, maybe >>>>>>>>>>>>> a dozen or so records per hour across 6 or 7 tables. The size of >>>>>>>>>>>>> the tables >>>>>>>>>>>>> ranges from thousands to low millions of rows so so >>>>>>>>>>>>> sub-partitioning would >>>>>>>>>>>>> not be required. So perhaps this is not a typical use-case but I >>>>>>>>>>>>> think it >>>>>>>>>>>>> could work quite well with kudu. >>>>>>>>>>>>> >>>>>>>>>>>>> From: Dan Burkert <[email protected]> >>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM >>>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>>> >>>>>>>>>>>>> The replication count is the number of tablet servers which >>>>>>>>>>>>> Kudu will host copies on. So if you set the replication level to >>>>>>>>>>>>> 5, Kudu >>>>>>>>>>>>> will put the data on 5 separate tablet servers. There's no >>>>>>>>>>>>> built-in >>>>>>>>>>>>> broadcast table feature; upping the replication factor is the >>>>>>>>>>>>> closest >>>>>>>>>>>>> thing. A couple of things to keep in mind: >>>>>>>>>>>>> >>>>>>>>>>>>> - Always use an odd replication count. This is important due >>>>>>>>>>>>> to how the Raft algorithm works. Recent versions of Kudu won't >>>>>>>>>>>>> even let >>>>>>>>>>>>> you specify an even number without flipping some flags. >>>>>>>>>>>>> - We don't test much much beyond 5 replicas. It *should* >>>>>>>>>>>>> work, but you may run in to issues since it's a relatively rare >>>>>>>>>>>>> configuration. With a heavy write workload and many replicas you >>>>>>>>>>>>> are even >>>>>>>>>>>>> more likely to encounter issues. >>>>>>>>>>>>> >>>>>>>>>>>>> It's also worth checking in an Impala forum whether it has >>>>>>>>>>>>> features that make joins against small broadcast tables better? >>>>>>>>>>>>> Perhaps >>>>>>>>>>>>> Impala can cache small tables locally when doing joins. >>>>>>>>>>>>> >>>>>>>>>>>>> - Dan >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The problem is, AFIK, that replication count is not >>>>>>>>>>>>>> necessarily the distribution count, so you can't guarantee all >>>>>>>>>>>>>> tablet >>>>>>>>>>>>>> servers will have a copy. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly >>>>>>>>>>>>>> with Kudu. We have a few tables that are small but used a lot. >>>>>>>>>>>>>> My plan is >>>>>>>>>>>>>> replicate them more than 3 times. When you create a kudu table, >>>>>>>>>>>>>> you can >>>>>>>>>>>>>> specify number of replicated copies (3 by default) and I guess >>>>>>>>>>>>>> you can put >>>>>>>>>>>>>> there a number, corresponding to your node count in cluster. The >>>>>>>>>>>>>> downside, >>>>>>>>>>>>>> you cannot change that number unless you recreate a table. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> We will soon be moving our analytics from AWS Redshift to >>>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL >>>>>>>>>>>>>>> Distribution, where a copy of a table is maintained on each >>>>>>>>>>>>>>> server. We >>>>>>>>>>>>>>> define a number of metadata tables this way since they are used >>>>>>>>>>>>>>> in nearly >>>>>>>>>>>>>>> every query. We are considering using parquet in HDFS cache for >>>>>>>>>>>>>>> these, and >>>>>>>>>>>>>>> Kudu would be a much better fit for the update semantics but we >>>>>>>>>>>>>>> are worried >>>>>>>>>>>>>>> about the additional contention. I'm wondering if having a >>>>>>>>>>>>>>> Broadcast, or >>>>>>>>>>>>>>> ALL, tablet replication might be an easy feature to add to Kudu? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Cliff >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Todd Lipcon >>>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Todd Lipcon >>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Todd Lipcon >>>>>>>>> Software Engineer, Cloudera >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *MAURICIO ARISTIZABAL* >>>>>>>> Architect - Business Intelligence + Data Science >>>>>>>> [email protected](m)+1 323 309 4260 <(323)%20309-4260> >>>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101 >>>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> >>>>>>>> >>>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter >>>>>>>> <https://twitter.com/impactradius> | Facebook >>>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | >>>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc-> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Todd Lipcon >>>>>>> Software Engineer, Cloudera >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mauricio Aristizabal >>>>>> Architect - Data Pipeline >>>>>> *M * 323 309 4260 >>>>>> *E *[email protected] | *W * https://impact.com >>>>>> <https://www.linkedin.com/company/608678/> >>>>>> <https://www.facebook.com/ImpactMarTech/> >>>>>> <https://twitter.com/impactmartech> >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Mauricio Aristizabal >>>> Architect - Data Pipeline >>>> *M * 323 309 4260 >>>> *E *[email protected] | *W * https://impact.com >>>> <https://www.linkedin.com/company/608678/> >>>> <https://www.facebook.com/ImpactMarTech/> >>>> <https://twitter.com/impactmartech> >>>> >>> >>>
