Do you also set any of spark.locality.wait.{process,node,rack} ? Those
override spark.locality.wait for specific locality levels.
private def getLocalityWait(level: TaskLocality.TaskLocality): Long = {
val defaultWait = System.getProperty("spark.locality.wait", "3000")
level match {
case TaskLocality.PROCESS_LOCAL =>
System.getProperty("spark.locality.wait.process",
defaultWait).toLong
case TaskLocality.NODE_LOCAL =>
System.getProperty("spark.locality.wait.node", defaultWait).toLong
case TaskLocality.RACK_LOCAL =>
System.getProperty("spark.locality.wait.rack", defaultWait).toLong
case TaskLocality.ANY =>
0L
}
}
The other option I'm thinking is maybe these tasks are jumping straight to
TaskLocality.ANY with no locality preference. Do you have any logs you can
share that include this fallback to less-preferred localities?
Did you have this working properly on 0.7.x ?
On Tue, Nov 26, 2013 at 2:54 PM, Erik Freed <[email protected]>wrote:
> Hi Andrew - thanks - that's a good thought - unfortunately, I have those
> set in the same pre context creation place as all the other variables that
> I have been using for months quite happily and that seem to impact Spark
> nicely. I have it set to Int.MaxValue.toString which I am guessing is large
> enough.
>
> It very occasionally will use all data local nodes, and sometimes a mix,
> but mostly all process-local...
>
>
> On Tue, Nov 26, 2013 at 2:45 PM, Andrew Ash <[email protected]> wrote:
>
>> Hi Erik,
>>
>> I would guess that if you set spark.locality.wait to an absurdly large
>> value then you would have essentially that effect.
>>
>> Maybe you aren't setting the system property before creating your Spark
>> context?
>>
>> http://spark.incubator.apache.org/docs/latest/configuration.html
>>
>> Andrew
>>
>>
>> On Tue, Nov 26, 2013 at 2:40 PM, Erik Freed <[email protected]>wrote:
>>
>>> Hi All,
>>> After switching to 0.8, and reducing the number of partitions/tasks for
>>> a large scale computation, I have been unable to force Spark to use only
>>> executors on nodes where hbase data is local. I have not been able to find
>>> a setting for spark.locality.wait that makes any difference. It is not an
>>> option for us to let spark chose non data local nodes. Is their some
>>> example code of how to get this to work the way we want? We have our own
>>> input RDD that mimics the NewHadoopRdd and it seems to be doing the correct
>>> thing in all regards wrt to preferred locations.
>>>
>>> Do I have to write my own compute Tasks and schedule them myself?
>>>
>>> Anyone have any suggestions? I am stumped.
>>>
>>> cheers,
>>> Erik
>>>
>>>
>>>
>>
>
>
> --
> Erik James Freed
> CoDecision Software
> 510.859.3360
> [email protected]
>
> 1480 Olympus Avenue
> Berkeley, CA
> 94708
>
> 179 Maria Lane
> Orcas, WA
> 98245
>