sorry to revive the old thread but curious if there is a better solution 1 year after...We have a few small tables (under 300k rows) which are practically used with every single query and to make things worse joined more than once in the same query.
Is there a way to replicate this table on every node to improve performance and avoid broadcasting this table every time? On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon <[email protected]> wrote: > > > On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin <[email protected]> wrote: > >> Hi Todd, >> >> Are you saying that your earlier comment below is not longer valid with >> Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can >> benefit from this? >> > > No, the earlier comment is still valid. Just saying that in some cases > exchange can be faster in the new Impala version. > > >> " >> *It's worth noting that, even if your table is replicated, Impala's >> planner is unaware of this fact and it will give the same plan regardless. >> That is to say, rather than every node scanning its local copy, instead a >> single node will perform the whole scan (assuming it's a small table) and >> broadcast it from there within the scope of a single query. So, I don't >> think you'll see any performance improvements on Impala queries by >> attempting something like an extremely high replication count.* >> >> *I could see bumping the replication count to 5 for these tables since >> the extra storage cost is low and it will ensure higher availability of the >> important central tables, but I'd be surprised if there is any measurable >> perf impact.* >> " >> >> On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon <[email protected]> wrote: >> >>> Are you on the latest release of Impala? It switched from using Thrift >>> for RPC to a new implementation (actually borrowed from kudu) which might >>> help broadcast performance a bit. >>> >>> Todd >>> >>> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> >>> wrote: >>> >>>> sorry to revive the old thread but I am curious if there is a good way >>>> to speed up requests to frequently used tables in Kudu. >>>> >>>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]> >>>> wrote: >>>> >>>>> bummer..After reading your guys conversation, I wish there was an >>>>> easier way...we will have the same issue as we have a few dozens of tables >>>>> which are used very frequently in joins and I was hoping there was an easy >>>>> way to replicate them on most of the nodes to avoid broadcasts every time >>>>> >>>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick < >>>>> [email protected]> wrote: >>>>> >>>>>> The table in our case is 12x hashed and ranged by month, so the >>>>>> broadcasts were often to all (12) nodes. >>>>>> >>>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]> >>>>>> wrote: >>>>>> Sorry I left that out Cliff, FWIW it does seem to have been >>>>>> broadcast.. >>>>>> >>>>>> >>>>>> >>>>>> Not sure though how a shuffle would be much different from a >>>>>> broadcast if entire table is 1 file/block in 1 node. >>>>>> >>>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> From the screenshot it does not look like there was a broadcast of >>>>>>> the dimension table(s), so it could be the case here that the multiple >>>>>>> smaller sends helps. Our dim tables are generally in the single-digit >>>>>>> millions and Impala chooses to broadcast them. Since the fact result >>>>>>> cardinality is always much smaller, we've found that forcing a [shuffle] >>>>>>> dimension join is actually faster since it only sends dims once rather >>>>>>> than >>>>>>> all to all nodes. The degenerative performance of broadcast is >>>>>>> especially >>>>>>> obvious when the query returns zero results. I don't have much >>>>>>> experience >>>>>>> here, but it does seem that Kudu's efficient predicate scans can >>>>>>> sometimes >>>>>>> "break" Impala's query plan. >>>>>>> >>>>>>> -Cliff >>>>>>> >>>>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> @Todd not to belabor the point, but when I suggested breaking up >>>>>>>> small dim tables into multiple parquet files (and in this thread's >>>>>>>> context >>>>>>>> perhaps partition kudu table, even if small, into multiple tablets), >>>>>>>> it was >>>>>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>>>>> >>>>>>>> For example recently we ran into this slow query where the 14M >>>>>>>> record dimension fit into a single file & block, so it got scanned on a >>>>>>>> single node though still pretty quickly (300ms), however it caused the >>>>>>>> join >>>>>>>> to take 25+ seconds and bogged down the entire query. See highlighted >>>>>>>> fragment and its parent. >>>>>>>> >>>>>>>> So we broke it into several small files the way I described in my >>>>>>>> previous post, and now join and query are fast (6s). >>>>>>>> >>>>>>>> -m >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I suppose in the case that the dimension table scan makes a >>>>>>>>> non-trivial portion of your workload time, then yea, parallelizing >>>>>>>>> the scan >>>>>>>>> as you suggest would be beneficial. That said, in typical analytic >>>>>>>>> queries, >>>>>>>>> scanning the dimension tables is very quick compared to scanning the >>>>>>>>> much-larger fact tables, so the extra parallelism on the dim table >>>>>>>>> scan >>>>>>>>> isn't worth too much. >>>>>>>>> >>>>>>>>> -Todd >>>>>>>>> >>>>>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> @Todd I know working with parquet in the past I've seen small >>>>>>>>>> dimensions that fit in 1 single file/block limit parallelism of >>>>>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread >>>>>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or >>>>>>>>>> similar >>>>>>>>>> when doing INSERT OVERWRITE to load them, which then allows these >>>>>>>>>> operations to parallelize across that many nodes. >>>>>>>>>> >>>>>>>>>> Wouldn't it be useful here for Cliff's small dims to be >>>>>>>>>> partitioned into a couple tablets to similarly improve parallelism? >>>>>>>>>> >>>>>>>>>> -m >>>>>>>>>> >>>>>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected] >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Todd, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for that explanation, as well as all the great work >>>>>>>>>>>> you're doing -- it's much appreciated! I just have one last >>>>>>>>>>>> follow-up >>>>>>>>>>>> question. Reading about BROADCAST operations (Kudu, Spark, Flink, >>>>>>>>>>>> etc. ) it >>>>>>>>>>>> seems the smaller table is always copied in its entirety BEFORE the >>>>>>>>>>>> predicate is evaluated. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> That's not quite true. If you have a predicate on a joined >>>>>>>>>>> column, or on one of the columns in the joined table, it will be >>>>>>>>>>> pushed >>>>>>>>>>> down to the "scan" operator, which happens before the "exchange". In >>>>>>>>>>> addition, there is a feature called "runtime filters" that can push >>>>>>>>>>> dynamically-generated filters from one side of the exchange to the >>>>>>>>>>> other. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> But since the Kudu client provides a serialized scanner as part >>>>>>>>>>>> of the ScanToken API, why wouldn't Impala use that instead if it >>>>>>>>>>>> knows that >>>>>>>>>>>> the table is Kudu and the query has any type of predicate? Perhaps >>>>>>>>>>>> if I >>>>>>>>>>>> hash-partition the table I could maybe force this (because that >>>>>>>>>>>> complicates >>>>>>>>>>>> a BROADCAST)? I guess this is really a question for Impala but >>>>>>>>>>>> perhaps >>>>>>>>>>>> there is a more basic reason. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Impala could definitely be smarter, just a matter of programming >>>>>>>>>>> Kudu-specific join strategies into the optimizer. Today, the >>>>>>>>>>> optimizer >>>>>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage >>>>>>>>>>> mechanisms. >>>>>>>>>>> >>>>>>>>>>> -Todd >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -Cliff >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I thought I had read that the Kudu client can configure a >>>>>>>>>>>>>> scan for CLOSEST_REPLICA and assumed this was a way to take >>>>>>>>>>>>>> advantage of >>>>>>>>>>>>>> data collocation. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local >>>>>>>>>>>>> one if available. However, that doesn't influence the higher level >>>>>>>>>>>>> operation of the Impala (or Spark) planner. The planner isn't >>>>>>>>>>>>> aware of the >>>>>>>>>>>>> replication policy, so it will use one of the existing supported >>>>>>>>>>>>> JOIN >>>>>>>>>>>>> strategies. Given statistics, it will choose to broadcast the >>>>>>>>>>>>> small table, >>>>>>>>>>>>> which means that it will create a plan that looks like: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> +-------------------------+ >>>>>>>>>>>>> | | >>>>>>>>>>>>> +---------->build JOIN | >>>>>>>>>>>>> | | | >>>>>>>>>>>>> | | probe | >>>>>>>>>>>>> +--------------+ +-------------------------+ >>>>>>>>>>>>> | | | >>>>>>>>>>>>> | Exchange | | >>>>>>>>>>>>> +----+ (broadcast | | >>>>>>>>>>>>> | | | | >>>>>>>>>>>>> | +--------------+ | >>>>>>>>>>>>> | | >>>>>>>>>>>>> +---------+ | >>>>>>>>>>>>> | | >>>>>>>>>>>>> +-----------------------+ >>>>>>>>>>>>> | SCAN | | >>>>>>>>>>>>> | >>>>>>>>>>>>> | KUDU | | SCAN (other >>>>>>>>>>>>> side) | >>>>>>>>>>>>> | | | >>>>>>>>>>>>> | >>>>>>>>>>>>> +---------+ >>>>>>>>>>>>> +-----------------------+ >>>>>>>>>>>>> >>>>>>>>>>>>> (hopefully the ASCII art comes through) >>>>>>>>>>>>> >>>>>>>>>>>>> In other words, the "scan kudu" operator scans the table once, >>>>>>>>>>>>> and then replicates the results of that scan into the JOIN >>>>>>>>>>>>> operator. The >>>>>>>>>>>>> "scan kudu" operator of course will read its local copy, but it >>>>>>>>>>>>> will still >>>>>>>>>>>>> go through the exchange process. >>>>>>>>>>>>> >>>>>>>>>>>>> For the use case you're talking about, where the join is just >>>>>>>>>>>>> looking up a single row by PK in a dimension table, ideally we'd >>>>>>>>>>>>> be using >>>>>>>>>>>>> an altogether different join strategy such as nested-loop join, >>>>>>>>>>>>> with the >>>>>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy >>>>>>>>>>>>> isn't >>>>>>>>>>>>> implemented by Impala. >>>>>>>>>>>>> >>>>>>>>>>>>> -Todd >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> If this exists then how far out of context is my >>>>>>>>>>>>>> understanding of it? Reading about HDFS cache replication, I do >>>>>>>>>>>>>> know that >>>>>>>>>>>>>> Impala will choose a random replica there to more evenly >>>>>>>>>>>>>> distribute load. >>>>>>>>>>>>>> But especially compared to Kudu upsert, managing mutable data >>>>>>>>>>>>>> using Parquet >>>>>>>>>>>>>> is painful. So, perhaps to sum thing up, if nearly 100% of my >>>>>>>>>>>>>> metadata scan >>>>>>>>>>>>>> are single Primary Key lookups followed by a tiny broadcast then >>>>>>>>>>>>>> am I >>>>>>>>>>>>>> really just splitting hairs performance-wise between Kudu and >>>>>>>>>>>>>> HDFS-cached >>>>>>>>>>>>>> parquet? >>>>>>>>>>>>>> >>>>>>>>>>>>>> From: Todd Lipcon <[email protected]> >>>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM >>>>>>>>>>>>>> >>>>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's worth noting that, even if your table is replicated, >>>>>>>>>>>>>> Impala's planner is unaware of this fact and it will give the >>>>>>>>>>>>>> same plan >>>>>>>>>>>>>> regardless. That is to say, rather than every node scanning its >>>>>>>>>>>>>> local copy, >>>>>>>>>>>>>> instead a single node will perform the whole scan (assuming it's >>>>>>>>>>>>>> a small >>>>>>>>>>>>>> table) and broadcast it from there within the scope of a single >>>>>>>>>>>>>> query. So, >>>>>>>>>>>>>> I don't think you'll see any performance improvements on Impala >>>>>>>>>>>>>> queries by >>>>>>>>>>>>>> attempting something like an extremely high replication count. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I could see bumping the replication count to 5 for these >>>>>>>>>>>>>> tables since the extra storage cost is low and it will ensure >>>>>>>>>>>>>> higher >>>>>>>>>>>>>> availability of the important central tables, but I'd be >>>>>>>>>>>>>> surprised if there >>>>>>>>>>>>>> is any measurable perf impact. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Todd >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside from >>>>>>>>>>>>>>> replication considerations, is it also recommended the number >>>>>>>>>>>>>>> of tablet >>>>>>>>>>>>>>> servers be odd? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I will check forums as you suggested, but from what I read >>>>>>>>>>>>>>> after searching is that Impala relies on user configured >>>>>>>>>>>>>>> caching strategies >>>>>>>>>>>>>>> using HDFS cache. The workload for these tables is very light >>>>>>>>>>>>>>> write, maybe >>>>>>>>>>>>>>> a dozen or so records per hour across 6 or 7 tables. The size >>>>>>>>>>>>>>> of the tables >>>>>>>>>>>>>>> ranges from thousands to low millions of rows so so >>>>>>>>>>>>>>> sub-partitioning would >>>>>>>>>>>>>>> not be required. So perhaps this is not a typical use-case but >>>>>>>>>>>>>>> I think it >>>>>>>>>>>>>>> could work quite well with kudu. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> From: Dan Burkert <[email protected]> >>>>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM >>>>>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The replication count is the number of tablet servers which >>>>>>>>>>>>>>> Kudu will host copies on. So if you set the replication level >>>>>>>>>>>>>>> to 5, Kudu >>>>>>>>>>>>>>> will put the data on 5 separate tablet servers. There's no >>>>>>>>>>>>>>> built-in >>>>>>>>>>>>>>> broadcast table feature; upping the replication factor is the >>>>>>>>>>>>>>> closest >>>>>>>>>>>>>>> thing. A couple of things to keep in mind: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Always use an odd replication count. This is important >>>>>>>>>>>>>>> due to how the Raft algorithm works. Recent versions of Kudu >>>>>>>>>>>>>>> won't even >>>>>>>>>>>>>>> let you specify an even number without flipping some flags. >>>>>>>>>>>>>>> - We don't test much much beyond 5 replicas. It *should* >>>>>>>>>>>>>>> work, but you may run in to issues since it's a relatively rare >>>>>>>>>>>>>>> configuration. With a heavy write workload and many replicas >>>>>>>>>>>>>>> you are even >>>>>>>>>>>>>>> more likely to encounter issues. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It's also worth checking in an Impala forum whether it has >>>>>>>>>>>>>>> features that make joins against small broadcast tables better? >>>>>>>>>>>>>>> Perhaps >>>>>>>>>>>>>>> Impala can cache small tables locally when doing joins. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - Dan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The problem is, AFIK, that replication count is not >>>>>>>>>>>>>>>> necessarily the distribution count, so you can't guarantee all >>>>>>>>>>>>>>>> tablet >>>>>>>>>>>>>>>> servers will have a copy. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly >>>>>>>>>>>>>>>> with Kudu. We have a few tables that are small but used a lot. >>>>>>>>>>>>>>>> My plan is >>>>>>>>>>>>>>>> replicate them more than 3 times. When you create a kudu >>>>>>>>>>>>>>>> table, you can >>>>>>>>>>>>>>>> specify number of replicated copies (3 by default) and I guess >>>>>>>>>>>>>>>> you can put >>>>>>>>>>>>>>>> there a number, corresponding to your node count in cluster. >>>>>>>>>>>>>>>> The downside, >>>>>>>>>>>>>>>> you cannot change that number unless you recreate a table. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We will soon be moving our analytics from AWS Redshift to >>>>>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL >>>>>>>>>>>>>>>>> Distribution, where a copy of a table is maintained on each >>>>>>>>>>>>>>>>> server. We >>>>>>>>>>>>>>>>> define a number of metadata tables this way since they are >>>>>>>>>>>>>>>>> used in nearly >>>>>>>>>>>>>>>>> every query. We are considering using parquet in HDFS cache >>>>>>>>>>>>>>>>> for these, and >>>>>>>>>>>>>>>>> Kudu would be a much better fit for the update semantics but >>>>>>>>>>>>>>>>> we are worried >>>>>>>>>>>>>>>>> about the additional contention. I'm wondering if having a >>>>>>>>>>>>>>>>> Broadcast, or >>>>>>>>>>>>>>>>> ALL, tablet replication might be an easy feature to add to >>>>>>>>>>>>>>>>> Kudu? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -Cliff >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Todd Lipcon >>>>>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Todd Lipcon >>>>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Todd Lipcon >>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *MAURICIO ARISTIZABAL* >>>>>>>>>> Architect - Business Intelligence + Data Science >>>>>>>>>> [email protected](m)+1 323 309 4260 <(323)%20309-4260> >>>>>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101 >>>>>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> >>>>>>>>>> >>>>>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter >>>>>>>>>> <https://twitter.com/impactradius> | Facebook >>>>>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | >>>>>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc-> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Todd Lipcon >>>>>>>>> Software Engineer, Cloudera >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Mauricio Aristizabal >>>>>>>> Architect - Data Pipeline >>>>>>>> *M * 323 309 4260 >>>>>>>> *E *[email protected] | *W * https://impact.com >>>>>>>> <https://www.linkedin.com/company/608678/> >>>>>>>> <https://www.facebook.com/ImpactMarTech/> >>>>>>>> <https://twitter.com/impactmartech> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Mauricio Aristizabal >>>>>> Architect - Data Pipeline >>>>>> *M * 323 309 4260 >>>>>> *E *[email protected] | *W * https://impact.com >>>>>> <https://www.linkedin.com/company/608678/> >>>>>> <https://www.facebook.com/ImpactMarTech/> >>>>>> <https://twitter.com/impactmartech> >>>>>> >>>>> >>>>>
