thanks for the idea Clifford, most of the time we are already doing that but I will check a few queries to make sure.
thanks Todd, not that this is causing any issues for us now but I do see joins to a few tables like that in a lot of queries. On Wed, Apr 24, 2019 at 1:05 PM Clifford Resnick <[email protected]> wrote: > Probably a narrow reach, but do these particular dimension tables possibly > have a common column that can be transitively joined with other dimension > tables? Possibly by some light denormalization? If so you can add a > (redundant) predicate and only the filtered set from that dim table will be > broadcast (at least with Impala). > > > > For instance, this will broadcast all of DIM_2: > > > > SELECT f.a,d1.b,d2.c > > from FACT f > > inner join DIM_1 d1 on f.dim_1_id = d1.id > > inner join DIM_2 d2 on f.dim_2_id = d2.id > > where f.dim_1_id = 123; > > > > This equivalent query will broadcast a filtered rowset. > > > > SELECT f.a,d1.b,d2.c > > from FACT f > > inner join DIM_1 d1 on f.dim_1_id = d1.id > > inner join DIM_2 d2 on f.dim_2_id = d2.id > > where f.dim_1_id = 123 > > and d2.dim_1_id = d1.id; > > > > > > > > > > *From: *Boris Tyukin <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, April 24, 2019 at 12:02 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: "broadcast" tablet replication for kudu? > > > > sorry to revive the old thread but curious if there is a better solution 1 > year after...We have a few small tables (under 300k rows) which are > practically used with every single query and to make things worse joined > more than once in the same query. > > > > Is there a way to replicate this table on every node to improve > performance and avoid broadcasting this table every time? > > > > On Mon, Jul 23, 2018 at 10:52 AM Todd Lipcon <[email protected]> wrote: > > > > On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin <[email protected]> wrote: > > Hi Todd, > > > > Are you saying that your earlier comment below is not longer valid with > Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can > benefit from this? > > > > No, the earlier comment is still valid. Just saying that in some cases > exchange can be faster in the new Impala version. > > > > > > " > > *It's worth noting that, even if your table is replicated, Impala's > planner is unaware of this fact and it will give the same plan regardless. > That is to say, rather than every node scanning its local copy, instead a > single node will perform the whole scan (assuming it's a small table) and > broadcast it from there within the scope of a single query. So, I don't > think you'll see any performance improvements on Impala queries by > attempting something like an extremely high replication count.* > > > > *I could see bumping the replication count to 5 for these tables since the > extra storage cost is low and it will ensure higher availability of the > important central tables, but I'd be surprised if there is any measurable > perf impact.* > > " > > > > On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon <[email protected]> wrote: > > Are you on the latest release of Impala? It switched from using Thrift for > RPC to a new implementation (actually borrowed from kudu) which might help > broadcast performance a bit. > > > > Todd > > > > On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> wrote: > > sorry to revive the old thread but I am curious if there is a good way to > speed up requests to frequently used tables in Kudu. > > > > On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]> > wrote: > > bummer..After reading your guys conversation, I wish there was an easier > way...we will have the same issue as we have a few dozens of tables which > are used very frequently in joins and I was hoping there was an easy way to > replicate them on most of the nodes to avoid broadcasts every time > > > > On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick <[email protected]> > wrote: > > The table in our case is 12x hashed and ranged by month, so the broadcasts > were often to all (12) nodes. > > > > On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]> > wrote: > > Sorry I left that out Cliff, FWIW it does seem to have been broadcast.. > > > > [image: Image removed by sender.] > > Not sure though how a shuffle would be much different from a broadcast if > entire table is 1 file/block in 1 node. > > > > On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]> wrote: > > From the screenshot it does not look like there was a broadcast of the > dimension table(s), so it could be the case here that the multiple smaller > sends helps. Our dim tables are generally in the single-digit millions and > Impala chooses to broadcast them. Since the fact result cardinality is > always much smaller, we've found that forcing a [shuffle] dimension join is > actually faster since it only sends dims once rather than all to all nodes. > The degenerative performance of broadcast is especially obvious when the > query returns zero results. I don't have much experience here, but it does > seem that Kudu's efficient predicate scans can sometimes "break" Impala's > query plan. > > > > -Cliff > > > > On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal <[email protected]> > wrote: > > @Todd not to belabor the point, but when I suggested breaking up small dim > tables into multiple parquet files (and in this thread's context perhaps > partition kudu table, even if small, into multiple tablets), it was to > speed up joins/exchanges, not to parallelize the scan. > > > > For example recently we ran into this slow query where the 14M record > dimension fit into a single file & block, so it got scanned on a single > node though still pretty quickly (300ms), however it caused the join to > take 25+ seconds and bogged down the entire query. See highlighted > fragment and its parent. > > > > So we broke it into several small files the way I described in my previous > post, and now join and query are fast (6s). > > > > -m > > > > > > *Error! Filename not specified.* > > > > On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]> wrote: > > I suppose in the case that the dimension table scan makes a non-trivial > portion of your workload time, then yea, parallelizing the scan as you > suggest would be beneficial. That said, in typical analytic queries, > scanning the dimension tables is very quick compared to scanning the > much-larger fact tables, so the extra parallelism on the dim table scan > isn't worth too much. > > > > -Todd > > > > On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < > [email protected]> wrote: > > @Todd I know working with parquet in the past I've seen small dimensions > that fit in 1 single file/block limit parallelism of > join/exchange/aggregation nodes, and I've forced those dims to spread > across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or similar > when doing INSERT OVERWRITE to load them, which then allows these > operations to parallelize across that many nodes. > > > > Wouldn't it be useful here for Cliff's small dims to be partitioned into a > couple tablets to similarly improve parallelism? > > > > -m > > > > On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]> wrote: > > On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]> wrote: > > Hey Todd, > > > > Thanks for that explanation, as well as all the great work you're doing > -- it's much appreciated! I just have one last follow-up question. Reading > about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller > table is always copied in its entirety BEFORE the predicate is evaluated. > > > > That's not quite true. If you have a predicate on a joined column, or on > one of the columns in the joined table, it will be pushed down to the > "scan" operator, which happens before the "exchange". In addition, there is > a feature called "runtime filters" that can push dynamically-generated > filters from one side of the exchange to the other. > > > > But since the Kudu client provides a serialized scanner as part of the > ScanToken API, why wouldn't Impala use that instead if it knows that the > table is Kudu and the query has any type of predicate? Perhaps if I > hash-partition the table I could maybe force this (because that complicates > a BROADCAST)? I guess this is really a question for Impala but perhaps > there is a more basic reason. > > > > Impala could definitely be smarter, just a matter of programming > Kudu-specific join strategies into the optimizer. Today, the optimizer > isn't aware of the unique properties of Kudu scans vs other storage > mechanisms. > > > > -Todd > > > > > > -Cliff > > > > On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]> wrote: > > On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick <[email protected]> > wrote: > > I thought I had read that the Kudu client can configure a scan for > CLOSEST_REPLICA and assumed this was a way to take advantage of data > collocation. > > > > Yea, when a client uses CLOSEST_REPLICA it will read a local one if > available. However, that doesn't influence the higher level operation of > the Impala (or Spark) planner. The planner isn't aware of the replication > policy, so it will use one of the existing supported JOIN strategies. Given > statistics, it will choose to broadcast the small table, which means that > it will create a plan that looks like: > > > > > > +-------------------------+ > > | | > > +---------->build JOIN | > > | | | > > | | probe | > > +--------------+ +-------------------------+ > > | | | > > | Exchange | | > > +----+ (broadcast | | > > | | | | > > | +--------------+ | > > | | > > +---------+ | > > | | +-----------------------+ > > | SCAN | | | > > | KUDU | | SCAN (other side) | > > | | | | > > +---------+ +-----------------------+ > > > > (hopefully the ASCII art comes through) > > > > In other words, the "scan kudu" operator scans the table once, and then > replicates the results of that scan into the JOIN operator. The "scan kudu" > operator of course will read its local copy, but it will still go through > the exchange process. > > > > For the use case you're talking about, where the join is just looking up a > single row by PK in a dimension table, ideally we'd be using an altogether > different join strategy such as nested-loop join, with the inner "loop" > actually being a Kudu PK lookup, but that strategy isn't implemented by > Impala. > > > > -Todd > > > > > > If this exists then how far out of context is my understanding of it? > Reading about HDFS cache replication, I do know that Impala will choose a > random replica there to more evenly distribute load. But especially > compared to Kudu upsert, managing mutable data using Parquet is painful. > So, perhaps to sum thing up, if nearly 100% of my metadata scan are single > Primary Key lookups followed by a tiny broadcast then am I really just > splitting hairs performance-wise between Kudu and HDFS-cached parquet? > > > > *From: * Todd Lipcon <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Friday, March 16, 2018 at 2:51 PM > > > *To: *"[email protected]" <[email protected]> > *Subject: *Re: "broadcast" tablet replication for kudu? > > > > It's worth noting that, even if your table is replicated, Impala's planner > is unaware of this fact and it will give the same plan regardless. That is > to say, rather than every node scanning its local copy, instead a single > node will perform the whole scan (assuming it's a small table) and > broadcast it from there within the scope of a single query. So, I don't > think you'll see any performance improvements on Impala queries by > attempting something like an extremely high replication count. > > > > I could see bumping the replication count to 5 for these tables since the > extra storage cost is low and it will ensure higher availability of the > important central tables, but I'd be surprised if there is any measurable > perf impact. > > > > -Todd > > > > On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick <[email protected]> > wrote: > > Thanks for that, glad I was wrong there! Aside from replication > considerations, is it also recommended the number of tablet servers be odd? > > > > I will check forums as you suggested, but from what I read after searching > is that Impala relies on user configured caching strategies using HDFS > cache. The workload for these tables is very light write, maybe a dozen or > so records per hour across 6 or 7 tables. The size of the tables ranges > from thousands to low millions of rows so so sub-partitioning would not be > required. So perhaps this is not a typical use-case but I think it could > work quite well with kudu. > > > > *From: *Dan Burkert <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Friday, March 16, 2018 at 2:09 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: "broadcast" tablet replication for kudu? > > > > The replication count is the number of tablet servers which Kudu will host > copies on. So if you set the replication level to 5, Kudu will put the > data on 5 separate tablet servers. There's no built-in broadcast table > feature; upping the replication factor is the closest thing. A couple of > things to keep in mind: > > - Always use an odd replication count. This is important due to how the > Raft algorithm works. Recent versions of Kudu won't even let you specify > an even number without flipping some flags. > > - We don't test much much beyond 5 replicas. It *should* work, but you > may run in to issues since it's a relatively rare configuration. With a > heavy write workload and many replicas you are even more likely to > encounter issues. > > It's also worth checking in an Impala forum whether it has features that > make joins against small broadcast tables better? Perhaps Impala can cache > small tables locally when doing joins. > > > > - Dan > > > > On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick <[email protected]> > wrote: > > The problem is, AFIK, that replication count is not necessarily the > distribution count, so you can't guarantee all tablet servers will have a > copy. > > > > On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]> wrote: > > I'm new to Kudu but we are also going to use Impala mostly with Kudu. We > have a few tables that are small but used a lot. My plan is replicate them > more than 3 times. When you create a kudu table, you can specify number of > replicated copies (3 by default) and I guess you can put there a number, > corresponding to your node count in cluster. The downside, you cannot > change that number unless you recreate a table. > > > > On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <[email protected]> wrote: > > We will soon be moving our analytics from AWS Redshift to Impala/Kudu. One > Redshift feature that we will miss is its ALL Distribution, where a copy of > a table is maintained on each server. We define a number of metadata tables > this way since they are used in nearly every query. We are considering > using parquet in HDFS cache for these, and Kudu would be a much better fit > for the update semantics but we are worried about the additional > contention. I'm wondering if having a Broadcast, or ALL, tablet > replication might be an easy feature to add to Kudu? > > > > -Cliff > > > > > > > > > > -- > > Todd Lipcon > Software Engineer, Cloudera > > > > > > -- > > Todd Lipcon > Software Engineer, Cloudera > > > > > > > > -- > > Todd Lipcon > Software Engineer, Cloudera > > > > > > -- > > *MAURICIO ARISTIZABAL* > Architect - Business Intelligence + Data Science > [email protected](m)+1 323 309 4260 <(323)%20309-4260> > 223 E. De La Guerra St. | Santa Barbara, CA 93101 > <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> > [image: Image removed by sender.] > Overview <http://www.impactradius.com/?src=slsap> | Twitter > <https://twitter.com/impactradius> | Facebook > <https://www.facebook.com/pages/Impact-Radius/153376411365183> | LinkedIn > <https://www.linkedin.com/company/impact-radius-inc-> > > > > > > -- > > Todd Lipcon > Software Engineer, Cloudera > > > > > > -- > > [image: Image removed by sender.] > > *Mauricio Aristizabal* > > Architect - Data Pipeline > > *M * 323 309 4260 > > *E *[email protected] | *W * https://impact.com > > [image: Image removed by sender.] > <https://www.linkedin.com/company/608678/> [image: Image removed by > sender.] <https://www.facebook.com/ImpactMarTech/> [image: Image removed > by sender.] <https://twitter.com/impactmartech> > > > > > > > > -- > > [image: Image removed by sender.] > > *Mauricio Aristizabal* > > Architect - Data Pipeline > > *M * 323 309 4260 > > *E *[email protected] | *W * https://impact.com > > [image: Image removed by sender.] > <https://www.linkedin.com/company/608678/> [image: Image removed by > sender.] <https://www.facebook.com/ImpactMarTech/> [image: Image removed > by sender.] <https://twitter.com/impactmartech> > > > >
