Hi, guys! I working on replacing proprietary analytic platform Microsoft PDW (aka Microsoft APS) in my company with open source alternative. Currently, I experimenting with Mesos/Spark/Kudu stack and it looks promising.
Recently I discovered very strange behavior. Situation: I have table on 5-servers cluster with 50 tablets and run simple Spark rdd.count() against it. If table has no replication - all is fine, every server run count aggregation on local data. But, if that table have replication > 1, I see (with iftop util) that Spark scans remote tablets and Spark UI still shows me tasks with locality NODE_LOCAL, what is not true. I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark and MR jobs running without scan locality" which looks like my problem. IMHO Kudu-Spark can't be considered as production-ready with such an issue. Are there fundamental problems with fixing of that issue? -- with best regards, Pavel Martynov
