On Mon, Jun 26, 2017 at 8:53 AM, Jean-Daniel Cryans <[email protected]> wrote:
> Hi Pavel, > > I think the whole Kudu/Spark story needs more attention, for example Spark > SQL query plans don't have access to any Kudu stats so you can end up with > some really bad join decisions. > > It feels like KUDU-1454 should be really easy to solve at this point. What > we need is to get the RDD to use CLOSEST_REPLICA and to set a propagated > timestamp like Todd says in the jira. This is all stuff that's done in > Impala's integration for Kudu. If you wanted to see if that solves your > problem you could add the following code on this line http://github.mtv. > cloudera.com/CDH/kudu/blob/cdh5-trunk/java/kudu-client/ > src/main/java/org/apache/kudu/client/KuduScanToken.java#L226 > Of course I meant a link more like this https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L226 > > builder.replicaSelection(ReplicaSelection.CLOSEST_REPLICA); > > The propagated timestamp part is also needed but only for consistency > purposes, it won't affect the locality. > > J-D > > On Mon, Jun 26, 2017 at 12:59 AM, Pavel Martynov <[email protected]> > wrote: > >> Hi, guys! >> >> I working on replacing proprietary analytic platform Microsoft PDW (aka >> Microsoft APS) in my company with open source alternative. Currently, I >> experimenting with Mesos/Spark/Kudu stack and it looks promising. >> >> Recently I discovered very strange behavior. Situation: I have table on >> 5-servers cluster with 50 tablets and run simple Spark rdd.count() against >> it. If table has no replication - all is fine, every server run count >> aggregation on local data. But, if that table have replication > 1, I see >> (with iftop util) that Spark scans remote tablets and Spark UI still shows >> me tasks with locality NODE_LOCAL, what is not true. >> >> I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark and >> MR jobs running without scan locality" which looks like my problem. >> >> IMHO Kudu-Spark can't be considered as production-ready with such an >> issue. Are there fundamental problems with fixing of that issue? >> >> -- >> with best regards, Pavel Martynov >> > >
