Re: Some questions about task distribution and execution in Spark

Shay Seng Thu, 03 Oct 2013 12:32:17 -0700

Ok, even if my understanding of allowLocal is incorrect, nevertheless
(1) I'm loading a local file
(2) The tasks seem as if they are getting executed on a slave node
(ip-10-129-25-28) is not my master node
??





On Thu, Oct 3, 2013 at 12:22 PM, Mark Hamstra <[email protected]>wrote:

> No, that is not what allowLocal means.  For a very few actions, the
> DAGScheduler will run the job locally (in a separate thread on the master
> node) if the RDD in the action has a single partition and no dependencies
> in its lineage.  If allowLocal is false, that doesn't mean that
> SparkContext.textFile called on a local file will magically turn that local
> file into a distributed file and allow more than just the node where the
> file is local to process that file.
>
>
> On Thu, Oct 3, 2013 at 11:05 AM, Shay Seng <[email protected]> wrote:
>
>> Inlined.
>>
>> On Wed, Oct 2, 2013 at 1:00 PM, Matei Zaharia <[email protected]>wrote:
>>
>>> Hi Shangyu,
>>>
>>> (1) When we read in a local file by SparkContext.textFile and do some
>>> map/reduce job on it, how will spark decide to send data to which worker
>>> node? Will the data be divided/partitioned equally according to the number
>>> of worker node and each worker node get one piece of data?
>>>
>>> You actually can't run distributed jobs on local files. The local file
>>> URL only works on the same machine, or if the file is in a filesystem
>>> that's mounted on the same path on all worker nodes.
>>>
>>> Is this really true?
>>
>> scala> val f = sc.textFile("/root/ue/ue_env.sh")
>> 13/10/03 17:55:45 INFO storage.MemoryStore: ensureFreeSpace(34870) called
>> with curMem=34870, maxMem=4081511301
>> 13/10/03 17:55:45 INFO storage.MemoryStore: Block broadcast_1 stored as
>> values to memory (estimated size 34.1 KB, free 3.8 GB)
>> f: spark.RDD[String] = MappedRDD[5] at textFile at <console>:12
>>
>> scala> f.map(l=>l.split(" ")).collect
>> 13/10/03 17:55:51 INFO mapred.FileInputFormat: Total input paths to
>> process : 1
>> 13/10/03 17:55:51 INFO spark.SparkContext: Starting job: collect at
>> <console>:15
>> 13/10/03 17:55:51 INFO scheduler.DAGScheduler: Got job 2 (collect at
>> <console>:15) with 2 output partitions (*allowLocal=false*)
>>  ...
>> 13/10/03 17:55:51 INFO cluster.TaskSetManager: Starting task 2.0:0 as TID
>> 4 on executor 0: ip-10-129-25-28 (preferred)
>> 13/10/03 17:55:51 INFO cluster.TaskSetManager: Serialized task 2.0:0 as
>> 1517 bytes in 3 ms
>> 13/10/03 17:55:51 INFO cluster.TaskSetManager: Starting task 2.0:1 as TID
>> 5 on executor 0: ip-10-129-25-28 (preferred)
>> 13/10/03 17:55:51 INFO cluster.TaskSetManager: Serialized task 2.0:1 as
>> 1517 bytes in 0 ms
>>
>> Doesn't allowLocal=false mean the job is getting distributed to workers
>> rather than computed locally?
>>
>>
>> tks
>>
>>
>>>
>>> Matei
>>>
>>>
>>> Any help will be appreciated.
>>> Thanks!
>>>
>>>
>>>
>>>
>>> --
>>> --
>>>
>>> Shangyu, Luo
>>> Department of Computer Science
>>> Rice University
>>>
>>>
>>>
>>
>

Re: Some questions about task distribution and execution in Spark

Reply via email to