No, that is not what allowLocal means. For a very few actions, the DAGScheduler will run the job locally (in a separate thread on the master node) if the RDD in the action has a single partition and no dependencies in its lineage. If allowLocal is false, that doesn't mean that SparkContext.textFile called on a local file will magically turn that local file into a distributed file and allow more than just the node where the file is local to process that file.
On Thu, Oct 3, 2013 at 11:05 AM, Shay Seng <[email protected]> wrote: > Inlined. > > On Wed, Oct 2, 2013 at 1:00 PM, Matei Zaharia <[email protected]>wrote: > >> Hi Shangyu, >> >> (1) When we read in a local file by SparkContext.textFile and do some >> map/reduce job on it, how will spark decide to send data to which worker >> node? Will the data be divided/partitioned equally according to the number >> of worker node and each worker node get one piece of data? >> >> You actually can't run distributed jobs on local files. The local file >> URL only works on the same machine, or if the file is in a filesystem >> that's mounted on the same path on all worker nodes. >> >> Is this really true? > > scala> val f = sc.textFile("/root/ue/ue_env.sh") > 13/10/03 17:55:45 INFO storage.MemoryStore: ensureFreeSpace(34870) called > with curMem=34870, maxMem=4081511301 > 13/10/03 17:55:45 INFO storage.MemoryStore: Block broadcast_1 stored as > values to memory (estimated size 34.1 KB, free 3.8 GB) > f: spark.RDD[String] = MappedRDD[5] at textFile at <console>:12 > > scala> f.map(l=>l.split(" ")).collect > 13/10/03 17:55:51 INFO mapred.FileInputFormat: Total input paths to > process : 1 > 13/10/03 17:55:51 INFO spark.SparkContext: Starting job: collect at > <console>:15 > 13/10/03 17:55:51 INFO scheduler.DAGScheduler: Got job 2 (collect at > <console>:15) with 2 output partitions (*allowLocal=false*) > ... > 13/10/03 17:55:51 INFO cluster.TaskSetManager: Starting task 2.0:0 as TID > 4 on executor 0: ip-10-129-25-28 (preferred) > 13/10/03 17:55:51 INFO cluster.TaskSetManager: Serialized task 2.0:0 as > 1517 bytes in 3 ms > 13/10/03 17:55:51 INFO cluster.TaskSetManager: Starting task 2.0:1 as TID > 5 on executor 0: ip-10-129-25-28 (preferred) > 13/10/03 17:55:51 INFO cluster.TaskSetManager: Serialized task 2.0:1 as > 1517 bytes in 0 ms > > Doesn't allowLocal=false mean the job is getting distributed to workers > rather than computed locally? > > > tks > > >> >> Matei >> >> >> Any help will be appreciated. >> Thanks! >> >> >> >> >> -- >> -- >> >> Shangyu, Luo >> Department of Computer Science >> Rice University >> >> >> >
