On Fri, Mar 16, 2018 at 2:19 PM, Cliff Resnick <[email protected]> wrote:
> Hey Todd, > > Thanks for that explanation, as well as all the great work you're doing > -- it's much appreciated! I just have one last follow-up question. Reading > about BROADCAST operations (Kudu, Spark, Flink, etc. ) it seems the smaller > table is always copied in its entirety BEFORE the predicate is evaluated. > That's not quite true. If you have a predicate on a joined column, or on one of the columns in the joined table, it will be pushed down to the "scan" operator, which happens before the "exchange". In addition, there is a feature called "runtime filters" that can push dynamically-generated filters from one side of the exchange to the other. > But since the Kudu client provides a serialized scanner as part of the > ScanToken API, why wouldn't Impala use that instead if it knows that the > table is Kudu and the query has any type of predicate? Perhaps if I > hash-partition the table I could maybe force this (because that complicates > a BROADCAST)? I guess this is really a question for Impala but perhaps > there is a more basic reason. > Impala could definitely be smarter, just a matter of programming Kudu-specific join strategies into the optimizer. Today, the optimizer isn't aware of the unique properties of Kudu scans vs other storage mechanisms. -Todd > > -Cliff > > On Fri, Mar 16, 2018 at 4:10 PM, Todd Lipcon <[email protected]> wrote: > >> On Fri, Mar 16, 2018 at 12:30 PM, Clifford Resnick < >> [email protected]> wrote: >> >>> I thought I had read that the Kudu client can configure a scan for >>> CLOSEST_REPLICA and assumed this was a way to take advantage of data >>> collocation. >>> >> >> Yea, when a client uses CLOSEST_REPLICA it will read a local one if >> available. However, that doesn't influence the higher level operation of >> the Impala (or Spark) planner. The planner isn't aware of the replication >> policy, so it will use one of the existing supported JOIN strategies. Given >> statistics, it will choose to broadcast the small table, which means that >> it will create a plan that looks like: >> >> >> +-------------------------+ >> | | >> +---------->build JOIN | >> | | | >> | | probe | >> +--------------+ +-------------------------+ >> | | | >> | Exchange | | >> +----+ (broadcast | | >> | | | | >> | +--------------+ | >> | | >> +---------+ | >> | | +-----------------------+ >> | SCAN | | | >> | KUDU | | SCAN (other side) | >> | | | | >> +---------+ +-----------------------+ >> >> (hopefully the ASCII art comes through) >> >> In other words, the "scan kudu" operator scans the table once, and then >> replicates the results of that scan into the JOIN operator. The "scan kudu" >> operator of course will read its local copy, but it will still go through >> the exchange process. >> >> For the use case you're talking about, where the join is just looking up >> a single row by PK in a dimension table, ideally we'd be using an >> altogether different join strategy such as nested-loop join, with the inner >> "loop" actually being a Kudu PK lookup, but that strategy isn't implemented >> by Impala. >> >> -Todd >> >> >> >>> If this exists then how far out of context is my understanding of it? >>> Reading about HDFS cache replication, I do know that Impala will choose a >>> random replica there to more evenly distribute load. But especially >>> compared to Kudu upsert, managing mutable data using Parquet is painful. >>> So, perhaps to sum thing up, if nearly 100% of my metadata scan are single >>> Primary Key lookups followed by a tiny broadcast then am I really just >>> splitting hairs performance-wise between Kudu and HDFS-cached parquet? >>> >>> From: Todd Lipcon <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Friday, March 16, 2018 at 2:51 PM >>> >>> To: "[email protected]" <[email protected]> >>> Subject: Re: "broadcast" tablet replication for kudu? >>> >>> It's worth noting that, even if your table is replicated, Impala's >>> planner is unaware of this fact and it will give the same plan regardless. >>> That is to say, rather than every node scanning its local copy, instead a >>> single node will perform the whole scan (assuming it's a small table) and >>> broadcast it from there within the scope of a single query. So, I don't >>> think you'll see any performance improvements on Impala queries by >>> attempting something like an extremely high replication count. >>> >>> I could see bumping the replication count to 5 for these tables since >>> the extra storage cost is low and it will ensure higher availability of the >>> important central tables, but I'd be surprised if there is any measurable >>> perf impact. >>> >>> -Todd >>> >>> On Fri, Mar 16, 2018 at 11:35 AM, Clifford Resnick < >>> [email protected]> wrote: >>> >>>> Thanks for that, glad I was wrong there! Aside from replication >>>> considerations, is it also recommended the number of tablet servers be odd? >>>> >>>> I will check forums as you suggested, but from what I read after >>>> searching is that Impala relies on user configured caching strategies using >>>> HDFS cache. The workload for these tables is very light write, maybe a >>>> dozen or so records per hour across 6 or 7 tables. The size of the tables >>>> ranges from thousands to low millions of rows so so sub-partitioning would >>>> not be required. So perhaps this is not a typical use-case but I think it >>>> could work quite well with kudu. >>>> >>>> From: Dan Burkert <[email protected]> >>>> Reply-To: "[email protected]" <[email protected]> >>>> Date: Friday, March 16, 2018 at 2:09 PM >>>> To: "[email protected]" <[email protected]> >>>> Subject: Re: "broadcast" tablet replication for kudu? >>>> >>>> The replication count is the number of tablet servers which Kudu will >>>> host copies on. So if you set the replication level to 5, Kudu will put >>>> the data on 5 separate tablet servers. There's no built-in broadcast table >>>> feature; upping the replication factor is the closest thing. A couple of >>>> things to keep in mind: >>>> >>>> - Always use an odd replication count. This is important due to how >>>> the Raft algorithm works. Recent versions of Kudu won't even let you >>>> specify an even number without flipping some flags. >>>> - We don't test much much beyond 5 replicas. It *should* work, but >>>> you may run in to issues since it's a relatively rare configuration. With >>>> a heavy write workload and many replicas you are even more likely to >>>> encounter issues. >>>> >>>> It's also worth checking in an Impala forum whether it has features >>>> that make joins against small broadcast tables better? Perhaps Impala can >>>> cache small tables locally when doing joins. >>>> >>>> - Dan >>>> >>>> On Fri, Mar 16, 2018 at 10:55 AM, Clifford Resnick < >>>> [email protected]> wrote: >>>> >>>>> The problem is, AFIK, that replication count is not necessarily the >>>>> distribution count, so you can't guarantee all tablet servers will have a >>>>> copy. >>>>> >>>>> On Mar 16, 2018 1:41 PM, Boris Tyukin <[email protected]> wrote: >>>>> I'm new to Kudu but we are also going to use Impala mostly with Kudu. >>>>> We have a few tables that are small but used a lot. My plan is replicate >>>>> them more than 3 times. When you create a kudu table, you can specify >>>>> number of replicated copies (3 by default) and I guess you can put there a >>>>> number, corresponding to your node count in cluster. The downside, you >>>>> cannot change that number unless you recreate a table. >>>>> >>>>> On Fri, Mar 16, 2018 at 10:42 AM, Cliff Resnick <[email protected]> >>>>> wrote: >>>>> >>>>>> We will soon be moving our analytics from AWS Redshift to >>>>>> Impala/Kudu. One Redshift feature that we will miss is its ALL >>>>>> Distribution, where a copy of a table is maintained on each server. We >>>>>> define a number of metadata tables this way since they are used in nearly >>>>>> every query. We are considering using parquet in HDFS cache for these, >>>>>> and >>>>>> Kudu would be a much better fit for the update semantics but we are >>>>>> worried >>>>>> about the additional contention. I'm wondering if having a Broadcast, or >>>>>> ALL, tablet replication might be an easy feature to add to Kudu? >>>>>> >>>>>> -Cliff >>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > -- Todd Lipcon Software Engineer, Cloudera
