Re: Spark locality issue

Jean-Daniel Cryans Mon, 26 Jun 2017 09:18:27 -0700

On Mon, Jun 26, 2017 at 8:53 AM, Jean-Daniel Cryans <[email protected]>
wrote:


> Hi Pavel,
>
> I think the whole Kudu/Spark story needs more attention, for example Spark
> SQL query plans don't have access to any Kudu stats so you can end up with
> some really bad join decisions.
>
> It feels like KUDU-1454 should be really easy to solve at this point. What
> we need is to get the RDD to use CLOSEST_REPLICA and to set a propagated
> timestamp like Todd says in the jira. This is all stuff that's done in
> Impala's integration for Kudu. If you wanted to see if that solves your
> problem you could add the following code on this line http://github.mtv.
> cloudera.com/CDH/kudu/blob/cdh5-trunk/java/kudu-client/
> src/main/java/org/apache/kudu/client/KuduScanToken.java#L226
>

Of course I meant a link more like this
https://github.com/apache/kudu/blob/master/java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java#L226


>
> builder.replicaSelection(ReplicaSelection.CLOSEST_REPLICA);
>
> The propagated timestamp part is also needed but only for consistency
> purposes, it won't affect the locality.
>
> J-D
>
> On Mon, Jun 26, 2017 at 12:59 AM, Pavel Martynov <[email protected]>
> wrote:
>
>> Hi, guys!
>>
>> I working on replacing proprietary analytic platform Microsoft PDW (aka
>> Microsoft APS) in my company with open source alternative. Currently, I
>> experimenting with Mesos/Spark/Kudu stack and it looks promising.
>>
>> Recently I discovered very strange behavior. Situation: I have table on
>> 5-servers cluster with 50 tablets and run simple Spark rdd.count() against
>> it. If table has no replication - all is fine, every server run count
>> aggregation on local data. But, if that table have replication > 1, I see
>> (with iftop util) that Spark scans remote tablets and Spark UI still shows
>> me tasks with locality NODE_LOCAL, what is not true.
>>
>> I found issue https://issues.apache.org/jira/browse/KUDU-1454 "Spark and
>> MR jobs running without scan locality" which looks like my problem.
>>
>> IMHO Kudu-Spark can't be considered as production-ready with such an
>> issue. Are there fundamental problems with fixing of that issue?
>>
>> --
>> with best regards, Pavel Martynov
>>
>
>

Re: Spark locality issue

Reply via email to