On Mon, Jul 23, 2018, 7:21 AM Boris Tyukin <[email protected]> wrote:
> Hi Todd, > > Are you saying that your earlier comment below is not longer valid with > Impala 2.11 and if I replicate a table to all our Kudu nodes Impala can > benefit from this? > No, the earlier comment is still valid. Just saying that in some cases exchange can be faster in the new Impala version. > " > *It's worth noting that, even if your table is replicated, Impala's > planner is unaware of this fact and it will give the same plan regardless. > That is to say, rather than every node scanning its local copy, instead a > single node will perform the whole scan (assuming it's a small table) and > broadcast it from there within the scope of a single query. So, I don't > think you'll see any performance improvements on Impala queries by > attempting something like an extremely high replication count.* > > *I could see bumping the replication count to 5 for these tables since the > extra storage cost is low and it will ensure higher availability of the > important central tables, but I'd be surprised if there is any measurable > perf impact.* > " > > On Mon, Jul 23, 2018 at 9:46 AM Todd Lipcon <[email protected]> wrote: > >> Are you on the latest release of Impala? It switched from using Thrift >> for RPC to a new implementation (actually borrowed from kudu) which might >> help broadcast performance a bit. >> >> Todd >> >> On Mon, Jul 23, 2018, 6:43 AM Boris Tyukin <[email protected]> wrote: >> >>> sorry to revive the old thread but I am curious if there is a good way >>> to speed up requests to frequently used tables in Kudu. >>> >>> On Thu, Apr 12, 2018 at 8:19 AM Boris Tyukin <[email protected]> >>> wrote: >>> >>>> bummer..After reading your guys conversation, I wish there was an >>>> easier way...we will have the same issue as we have a few dozens of tables >>>> which are used very frequently in joins and I was hoping there was an easy >>>> way to replicate them on most of the nodes to avoid broadcasts every time >>>> >>>> On Thu, Apr 12, 2018 at 7:26 AM, Clifford Resnick < >>>> [email protected]> wrote: >>>> >>>>> The table in our case is 12x hashed and ranged by month, so the >>>>> broadcasts were often to all (12) nodes. >>>>> >>>>> On Apr 12, 2018 12:58 AM, Mauricio Aristizabal <[email protected]> >>>>> wrote: >>>>> Sorry I left that out Cliff, FWIW it does seem to have been >>>>> broadcast.. >>>>> >>>>> >>>>> >>>>> Not sure though how a shuffle would be much different from a broadcast >>>>> if entire table is 1 file/block in 1 node. >>>>> >>>>> On Wed, Apr 11, 2018 at 8:52 PM, Cliff Resnick <[email protected]> >>>>> wrote: >>>>> >>>>>> From the screenshot it does not look like there was a broadcast of >>>>>> the dimension table(s), so it could be the case here that the multiple >>>>>> smaller sends helps. Our dim tables are generally in the single-digit >>>>>> millions and Impala chooses to broadcast them. Since the fact result >>>>>> cardinality is always much smaller, we've found that forcing a [shuffle] >>>>>> dimension join is actually faster since it only sends dims once rather >>>>>> than >>>>>> all to all nodes. The degenerative performance of broadcast is especially >>>>>> obvious when the query returns zero results. I don't have much experience >>>>>> here, but it does seem that Kudu's efficient predicate scans can >>>>>> sometimes >>>>>> "break" Impala's query plan. >>>>>> >>>>>> -Cliff >>>>>> >>>>>> On Wed, Apr 11, 2018 at 5:41 PM, Mauricio Aristizabal < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> @Todd not to belabor the point, but when I suggested breaking up >>>>>>> small dim tables into multiple parquet files (and in this thread's >>>>>>> context >>>>>>> perhaps partition kudu table, even if small, into multiple tablets), it >>>>>>> was >>>>>>> to speed up joins/exchanges, not to parallelize the scan. >>>>>>> >>>>>>> For example recently we ran into this slow query where the 14M >>>>>>> record dimension fit into a single file & block, so it got scanned on a >>>>>>> single node though still pretty quickly (300ms), however it caused the >>>>>>> join >>>>>>> to take 25+ seconds and bogged down the entire query. See highlighted >>>>>>> fragment and its parent. >>>>>>> >>>>>>> So we broke it into several small files the way I described in my >>>>>>> previous post, and now join and query are fast (6s). >>>>>>> >>>>>>> -m >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 16, 2018 at 3:55 PM, Todd Lipcon <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I suppose in the case that the dimension table scan makes a >>>>>>>> non-trivial portion of your workload time, then yea, parallelizing the >>>>>>>> scan >>>>>>>> as you suggest would be beneficial. That said, in typical analytic >>>>>>>> queries, >>>>>>>> scanning the dimension tables is very quick compared to scanning the >>>>>>>> much-larger fact tables, so the extra parallelism on the dim table scan >>>>>>>> isn't worth too much. >>>>>>>> >>>>>>>> -Todd >>>>>>>> >>>>>>>> On Fri, Mar 16, 2018 at 2:56 PM, Mauricio Aristizabal < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> @Todd I know working with parquet in the past I've seen small >>>>>>>>> dimensions that fit in 1 single file/block limit parallelism of >>>>>>>>> join/exchange/aggregation nodes, and I've forced those dims to spread >>>>>>>>> across 20 or so blocks by leveraging SET PARQUET_FILE_SIZE=8m; or >>>>>>>>> similar >>>>>>>>> when doing INSERT OVERWRITE to load them, which then allows these >>>>>>>>> operations to parallelize across that many nodes. >>>>>>>>> >>>>>>>>> Wouldn't it be useful here for Cliff's small dims to be >>>>>>>>> partitioned into a couple tablets to similarly improve parallelism? >>>>>>>>> >>>>>>>>> -m >>>>>>>>> >>>>>>>>> On Fri, Mar 16, 2018 at 2:29 PM, Todd Lipcon <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Todd, >>>>>>>>>>> >>>>>>>>>>> Thanks for that explanation, as well as all the great work >>>>>>>>>>> you're doing -- it's much appreciated! I just have one last >>>>>>>>>>> follow-up >>>>>>>>>>> question. Reading about BROADCAST operations (Kudu, Spark, Flink, >>>>>>>>>>> etc. ) it >>>>>>>>>>> seems the smaller table is always copied in its entirety BEFORE the >>>>>>>>>>> predicate is evaluated. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> That's not quite true. If you have a predicate on a joined >>>>>>>>>> column, or on one of the columns in the joined table, it will be >>>>>>>>>> pushed >>>>>>>>>> down to the "scan" operator, which happens before the "exchange". In >>>>>>>>>> addition, there is a feature called "runtime filters" that can push >>>>>>>>>> dynamically-generated filters from one side of the exchange to the >>>>>>>>>> other. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> But since the Kudu client provides a serialized scanner as part >>>>>>>>>>> of the ScanToken API, why wouldn't Impala use that instead if it >>>>>>>>>>> knows that >>>>>>>>>>> the table is Kudu and the query has any type of predicate? Perhaps >>>>>>>>>>> if I >>>>>>>>>>> hash-partition the table I could maybe force this (because that >>>>>>>>>>> complicates >>>>>>>>>>> a BROADCAST)? I guess this is really a question for Impala but >>>>>>>>>>> perhaps >>>>>>>>>>> there is a more basic reason. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Impala could definitely be smarter, just a matter of programming >>>>>>>>>> Kudu-specific join strategies into the optimizer. Today, the >>>>>>>>>> optimizer >>>>>>>>>> isn't aware of the unique properties of Kudu scans vs other storage >>>>>>>>>> mechanisms. >>>>>>>>>> >>>>>>>>>> -Todd >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -Cliff >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I thought I had read that the Kudu client can configure a scan >>>>>>>>>>>>> for CLOSEST_REPLICA and assumed this was a way to take advantage >>>>>>>>>>>>> of data >>>>>>>>>>>>> collocation. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yea, when a client uses CLOSEST_REPLICA it will read a local >>>>>>>>>>>> one if available. However, that doesn't influence the higher level >>>>>>>>>>>> operation of the Impala (or Spark) planner. The planner isn't >>>>>>>>>>>> aware of the >>>>>>>>>>>> replication policy, so it will use one of the existing supported >>>>>>>>>>>> JOIN >>>>>>>>>>>> strategies. Given statistics, it will choose to broadcast the >>>>>>>>>>>> small table, >>>>>>>>>>>> which means that it will create a plan that looks like: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> +-------------------------+ >>>>>>>>>>>> | | >>>>>>>>>>>> +---------->build JOIN | >>>>>>>>>>>> | | | >>>>>>>>>>>> | | probe | >>>>>>>>>>>> +--------------+ +-------------------------+ >>>>>>>>>>>> | | | >>>>>>>>>>>> | Exchange | | >>>>>>>>>>>> +----+ (broadcast | | >>>>>>>>>>>> | | | | >>>>>>>>>>>> | +--------------+ | >>>>>>>>>>>> | | >>>>>>>>>>>> +---------+ | >>>>>>>>>>>> | | >>>>>>>>>>>> +-----------------------+ >>>>>>>>>>>> | SCAN | | >>>>>>>>>>>> | >>>>>>>>>>>> | KUDU | | SCAN (other side) >>>>>>>>>>>> | >>>>>>>>>>>> | | | >>>>>>>>>>>> | >>>>>>>>>>>> +---------+ >>>>>>>>>>>> +-----------------------+ >>>>>>>>>>>> >>>>>>>>>>>> (hopefully the ASCII art comes through) >>>>>>>>>>>> >>>>>>>>>>>> In other words, the "scan kudu" operator scans the table once, >>>>>>>>>>>> and then replicates the results of that scan into the JOIN >>>>>>>>>>>> operator. The >>>>>>>>>>>> "scan kudu" operator of course will read its local copy, but it >>>>>>>>>>>> will still >>>>>>>>>>>> go through the exchange process. >>>>>>>>>>>> >>>>>>>>>>>> For the use case you're talking about, where the join is just >>>>>>>>>>>> looking up a single row by PK in a dimension table, ideally we'd >>>>>>>>>>>> be using >>>>>>>>>>>> an altogether different join strategy such as nested-loop join, >>>>>>>>>>>> with the >>>>>>>>>>>> inner "loop" actually being a Kudu PK lookup, but that strategy >>>>>>>>>>>> isn't >>>>>>>>>>>> implemented by Impala. >>>>>>>>>>>> >>>>>>>>>>>> -Todd >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> If this exists then how far out of context is my >>>>>>>>>>>>> understanding of it? Reading about HDFS cache replication, I do >>>>>>>>>>>>> know that >>>>>>>>>>>>> Impala will choose a random replica there to more evenly >>>>>>>>>>>>> distribute load. >>>>>>>>>>>>> But especially compared to Kudu upsert, managing mutable data >>>>>>>>>>>>> using Parquet >>>>>>>>>>>>> is painful. So, perhaps to sum thing up, if nearly 100% of my >>>>>>>>>>>>> metadata scan >>>>>>>>>>>>> are single Primary Key lookups followed by a tiny broadcast then >>>>>>>>>>>>> am I >>>>>>>>>>>>> really just splitting hairs performance-wise between Kudu and >>>>>>>>>>>>> HDFS-cached >>>>>>>>>>>>> parquet? >>>>>>>>>>>>> >>>>>>>>>>>>> From: Todd Lipcon <[email protected]> >>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:51 PM >>>>>>>>>>>>> >>>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>>> >>>>>>>>>>>>> It's worth noting that, even if your table is replicated, >>>>>>>>>>>>> Impala's planner is unaware of this fact and it will give the >>>>>>>>>>>>> same plan >>>>>>>>>>>>> regardless. That is to say, rather than every node scanning its >>>>>>>>>>>>> local copy, >>>>>>>>>>>>> instead a single node will perform the whole scan (assuming it's >>>>>>>>>>>>> a small >>>>>>>>>>>>> table) and broadcast it from there within the scope of a single >>>>>>>>>>>>> query. So, >>>>>>>>>>>>> I don't think you'll see any performance improvements on Impala >>>>>>>>>>>>> queries by >>>>>>>>>>>>> attempting something like an extremely high replication count. >>>>>>>>>>>>> >>>>>>>>>>>>> I could see bumping the replication count to 5 for these >>>>>>>>>>>>> tables since the extra storage cost is low and it will ensure >>>>>>>>>>>>> higher >>>>>>>>>>>>> availability of the important central tables, but I'd be >>>>>>>>>>>>> surprised if there >>>>>>>>>>>>> is any measurable perf impact. >>>>>>>>>>>>> >>>>>>>>>>>>> -Todd >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for that, glad I was wrong there! Aside from >>>>>>>>>>>>>> replication considerations, is it also recommended the number of >>>>>>>>>>>>>> tablet >>>>>>>>>>>>>> servers be odd? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I will check forums as you suggested, but from what I read >>>>>>>>>>>>>> after searching is that Impala relies on user configured caching >>>>>>>>>>>>>> strategies >>>>>>>>>>>>>> using HDFS cache. The workload for these tables is very light >>>>>>>>>>>>>> write, maybe >>>>>>>>>>>>>> a dozen or so records per hour across 6 or 7 tables. The size of >>>>>>>>>>>>>> the tables >>>>>>>>>>>>>> ranges from thousands to low millions of rows so so >>>>>>>>>>>>>> sub-partitioning would >>>>>>>>>>>>>> not be required. So perhaps this is not a typical use-case but I >>>>>>>>>>>>>> think it >>>>>>>>>>>>>> could work quite well with kudu. >>>>>>>>>>>>>> >>>>>>>>>>>>>> From: Dan Burkert <[email protected]> >>>>>>>>>>>>>> Reply-To: "[email protected]" <[email protected]> >>>>>>>>>>>>>> Date: Friday, March 16, 2018 at 2:09 PM >>>>>>>>>>>>>> To: "[email protected]" <[email protected]> >>>>>>>>>>>>>> Subject: Re: "broadcast" tablet replication for kudu? >>>>>>>>>>>>>> >>>>>>>>>>>>>> The replication count is the number of tablet servers which >>>>>>>>>>>>>> Kudu will host copies on. So if you set the replication level >>>>>>>>>>>>>> to 5, Kudu >>>>>>>>>>>>>> will put the data on 5 separate tablet servers. There's no >>>>>>>>>>>>>> built-in >>>>>>>>>>>>>> broadcast table feature; upping the replication factor is the >>>>>>>>>>>>>> closest >>>>>>>>>>>>>> thing. A couple of things to keep in mind: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Always use an odd replication count. This is important due >>>>>>>>>>>>>> to how the Raft algorithm works. Recent versions of Kudu won't >>>>>>>>>>>>>> even let >>>>>>>>>>>>>> you specify an even number without flipping some flags. >>>>>>>>>>>>>> - We don't test much much beyond 5 replicas. It *should* >>>>>>>>>>>>>> work, but you may run in to issues since it's a relatively rare >>>>>>>>>>>>>> configuration. With a heavy write workload and many replicas >>>>>>>>>>>>>> you are even >>>>>>>>>>>>>> more likely to encounter issues. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It's also worth checking in an Impala forum whether it has >>>>>>>>>>>>>> features that make joins against small broadcast tables better? >>>>>>>>>>>>>> Perhaps >>>>>>>>>>>>>> Impala can cache small tables locally when doing joins. >>>>>>>>>>>>>> >>>>>>>>>>>>>> - Dan >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The problem is, AFIK, that replication count is not >>>>>>>>>>>>>>> necessarily the distribution count, so you can't guarantee all >>>>>>>>>>>>>>> tablet >>>>>>>>>>>>>>> servers will have a copy. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> I'm new to Kudu but we are also going to use Impala mostly >>>>>>>>>>>>>>> with Kudu. We have a few tables that are small but used a lot. >>>>>>>>>>>>>>> My plan is >>>>>>>>>>>>>>> replicate them more than 3 times. When you create a kudu table, >>>>>>>>>>>>>>> you can >>>>>>>>>>>>>>> specify number of replicated copies (3 by default) and I guess >>>>>>>>>>>>>>> you can put >>>>>>>>>>>>>>> there a number, corresponding to your node count in cluster. >>>>>>>>>>>>>>> The downside, >>>>>>>>>>>>>>> you cannot change that number unless you recreate a table. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We will soon be moving our analytics from AWS Redshift to >>>>>>>>>>>>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL >>>>>>>>>>>>>>>> Distribution, where a copy of a table is maintained on each >>>>>>>>>>>>>>>> server. We >>>>>>>>>>>>>>>> define a number of metadata tables this way since they are >>>>>>>>>>>>>>>> used in nearly >>>>>>>>>>>>>>>> every query. We are considering using parquet in HDFS cache >>>>>>>>>>>>>>>> for these, and >>>>>>>>>>>>>>>> Kudu would be a much better fit for the update semantics but >>>>>>>>>>>>>>>> we are worried >>>>>>>>>>>>>>>> about the additional contention. I'm wondering if having a >>>>>>>>>>>>>>>> Broadcast, or >>>>>>>>>>>>>>>> ALL, tablet replication might be an easy feature to add to >>>>>>>>>>>>>>>> Kudu? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -Cliff >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Todd Lipcon >>>>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Todd Lipcon >>>>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Todd Lipcon >>>>>>>>>> Software Engineer, Cloudera >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *MAURICIO ARISTIZABAL* >>>>>>>>> Architect - Business Intelligence + Data Science >>>>>>>>> [email protected](m)+1 323 309 4260 <(323)%20309-4260> >>>>>>>>> 223 E. De La Guerra St. | Santa Barbara, CA 93101 >>>>>>>>> <https://maps.google.com/?q=223+E.+De+La+Guerra+St.+%7C+Santa+Barbara,+CA+93101&entry=gmail&source=g> >>>>>>>>> >>>>>>>>> Overview <http://www.impactradius.com/?src=slsap> | Twitter >>>>>>>>> <https://twitter.com/impactradius> | Facebook >>>>>>>>> <https://www.facebook.com/pages/Impact-Radius/153376411365183> | >>>>>>>>> LinkedIn <https://www.linkedin.com/company/impact-radius-inc-> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Todd Lipcon >>>>>>>> Software Engineer, Cloudera >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mauricio Aristizabal >>>>>>> Architect - Data Pipeline >>>>>>> *M * 323 309 4260 >>>>>>> *E *[email protected] | *W * https://impact.com >>>>>>> <https://www.linkedin.com/company/608678/> >>>>>>> <https://www.facebook.com/ImpactMarTech/> >>>>>>> <https://twitter.com/impactmartech> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Mauricio Aristizabal >>>>> Architect - Data Pipeline >>>>> *M * 323 309 4260 >>>>> *E *[email protected] | *W * https://impact.com >>>>> <https://www.linkedin.com/company/608678/> >>>>> <https://www.facebook.com/ImpactMarTech/> >>>>> <https://twitter.com/impactmartech> >>>>> >>>> >>>>
