Thanks, this line helps me with locality. I agree with you about that Kudu/Spark need more attention, because.. you know there in 2017, Spark looks like default "weapon of choice" for analytic data aggregations :)
2017-06-26 19:18 GMT+03:00 Jean-Daniel Cryans <[email protected]>: > > > On Mon, Jun 26, 2017 at 8:53 AM, Jean-Daniel Cryans <[email protected]> > wrote: > >> Hi Pavel, >> >> I think the whole Kudu/Spark story needs more attention, for example >> Spark SQL query plans don't have access to any Kudu stats so you can end up >> with some really bad join decisions. >> >> It feels like KUDU-1454 should be really easy to solve at this point. >> What we need is to get the RDD to use CLOSEST_REPLICA and to set a >> propagated timestamp like Todd says in the jira. This is all stuff that's >> done in Impala's integration for Kudu. If you wanted to see if that solves >> your problem you could add the following code on this line >> http://github.mtv.cloudera.com/CDH/kudu/blob/cdh5- >> trunk/java/kudu-client/src/main/java/org/apache/kudu/clie >> nt/KuduScanToken.java#L226 >> > > Of course I meant a link more like this https://github.com/ > apache/kudu/blob/master/java/kudu-client/src/main/java/org/ > apache/kudu/client/KuduScanToken.java#L226 > > >> >> builder.replicaSelection(ReplicaSelection.CLOSEST_REPLICA); >> >> The propagated timestamp part is also needed but only for consistency >> purposes, it won't affect the locality. >> >> J-D >> >> On Mon, Jun 26, 2017 at 12:59 AM, Pavel Martynov <[email protected]> >> wrote: >> >>> Hi, guys! >>> >>> I working on replacing proprietary analytic platform Microsoft PDW (aka >>> Microsoft APS) in my company with open source alternative. Currently, I >>> experimenting with Mesos/Spark/Kudu stack and it looks promising. >>> >>> Recently I discovered very strange behavior. Situation: I have table on >>> 5-servers cluster with 50 tablets and run simple Spark rdd.count() against >>> it. If table has no replication - all is fine, every server run count >>> aggregation on local data. But, if that table have replication > 1, I see >>> (with iftop util) that Spark scans remote tablets and Spark UI still shows >>> me tasks with locality NODE_LOCAL, what is not true. >>> >>> I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark >>> and MR jobs running without scan locality" which looks like my problem. >>> >>> IMHO Kudu-Spark can't be considered as production-ready with such an >>> issue. Are there fundamental problems with fixing of that issue? >>> >>> -- >>> with best regards, Pavel Martynov >>> >> >> > -- with best regards, Pavel Martynov
