Spark locality issue

Pavel Martynov Mon, 26 Jun 2017 01:00:08 -0700

Hi, guys!

I working on replacing proprietary analytic platform Microsoft PDW (aka
Microsoft APS) in my company with open source alternative. Currently, I
experimenting with Mesos/Spark/Kudu stack and it looks promising.


Recently I discovered very strange behavior. Situation: I have table on
5-servers cluster with 50 tablets and run simple Spark rdd.count() against
it. If table has no replication - all is fine, every server run count
aggregation on local data. But, if that table have replication > 1, I see
(with iftop util) that Spark scans remote tablets and Spark UI still shows
me tasks with locality NODE_LOCAL, what is not true.

I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark and MR
jobs running without scan locality" which looks like my problem.

IMHO Kudu-Spark can't be considered as production-ready with such an issue.
Are there fundamental problems with fixing of that issue?

-- 
with best regards, Pavel Martynov

Spark locality issue

Reply via email to