Re: A program that works in local mode but fails in distributed mode

Reynold Xin Tue, 15 Oct 2013 10:51:52 -0700

You need to put your data files on a distributed file system (e.g. hdfs/s3)
for the distributed spark to work. Otherwise, the workers cannot read files
from a single node.


The reason first works is because for very short actions like first / take,
Spark alternatively launches the action on the master.


On Tue, Oct 15, 2013 at 1:27 AM, Markus Losoi <[email protected]>wrote:

> Hi
>
> Assume that a list of file names exists in the file "data/Files.dat":
>
> "
> File1.ext
> File2.ext
> ...
> "
>
> and these files exist in the directory "data". The following commands:
>
>    "
> 1: import java.io.file
> 2:
> 3: val filesList = sc.textFile("data/Files.dat", 2)
> 4:
> 5: val fileSizes = filesList.map(file => new File("data/" + file).length)
> 6:
> 7: fileSizes.collect.sum
>    "
>
> work correctly when they are run in the Spark shell in local mode. That is,
> I get the total size of the files listed in the file "data/Files.dat".
> (This
> is not supposed to be a good example of using Spark. I was just testing
> whether I understood how to use the system.) However, the same commands
> return 0 on the line 7 ("fileSizes.collect.sum") when the lines are
> executed
> in distributed mode (with one worker).
>
> Have you any ideas what could be causing this? Running "fileSizes.collect"
> in distributed mode returns an array of zeros, but mysteriously the command
> "fileSizes.first" returns the correct answer (176034 > 0).
>
> The log messages (with the logging level DEBUG) on spark-shell are as
> follows. I have excluded the lines:
>
> "
> 13/10/15 10:58:14 DEBUG Configuration: java.io.IOException: config(config)
>         at
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:260)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:341)
>         at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:346)
>         at org.apache.spark.SparkContext.textFile(SparkContext.scala:319)
>         at $line4.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:13)
>         at $line4.$read$$iwC$$iwC$$iwC.<init>(<console>:18)
>         at $line4.$read$$iwC$$iwC.<init>(<console>:20)
>         at $line4.$read$$iwC.<init>(<console>:22)
>         at $line4.$read.<init>(<console>:24)
>         at $line4.$read$.<init>(<console>:28)
>         at $line4.$read$.<clinit>(<console>)
>         at $line4.$eval$.<init>(<console>:11)
>         at $line4.$eval$.<clinit>(<console>)
>         at $line4.$eval.$export(<console>)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
> )
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
> .java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
>         at
>
> org.apache.spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:
> 890)
>         at
> scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
>         at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
>         at java.lang.Thread.run(Thread.java:662)
> "
>
> because they may have resulted from
> (https://issues.apache.org/jira/browse/HADOOP-8995). I used Hadoop 1.0.4.
>
> --- Local ---
>
> val filesList = sc.textFile("data/Files.dat", 2):
>
> 13/10/15 10:58:14 DEBUG Groups:  Creating new Groups object
> 13/10/15 10:58:14 DEBUG Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 13/10/15 10:58:14 DEBUG UserGroupInformation: hadoop login
> 13/10/15 10:58:14 DEBUG UserGroupInformation: hadoop login commit
> 13/10/15 10:58:14 DEBUG UserGroupInformation: using local
> user:UnixPrincipal: mlosoi
> 13/10/15 10:58:14 DEBUG UserGroupInformation: UGI loginUser:mlosoi
> 13/10/15 10:58:14 DEBUG FileSystem: Creating filesystem for file:///
> 13/10/15 10:58:14 INFO MemoryStore: ensureFreeSpace(36280) called with
> curMem=0, maxMem=339585269
> 13/10/15 10:58:14 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 35.4 KB, free 323.8 MB)
> 13/10/15 10:58:14 DEBUG BlockManager: Put block broadcast_0 locally took
>  58
> ms
> filesList: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
> <console>:13
>
> val fileSizes = filesList.map(file => new File("data/" + file).length):
>
> fileSizes: org.apache.spark.rdd.RDD[Long] = MappedRDD[2] at map at
> <console>:15
>
> fileSizes.collect.sum:
>
> 13/10/15 11:01:02 DEBUG NativeCodeLoader: Trying to load the custom-built
> native-hadoop library...
> 13/10/15 11:01:02 DEBUG NativeCodeLoader: Failed to load native-hadoop with
> error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
> 13/10/15 11:01:02 DEBUG NativeCodeLoader: java.library.path=
> 13/10/15 11:01:02 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 13/10/15 11:01:02 WARN LoadSnappy: Snappy native library not loaded
> 13/10/15 11:01:02 INFO FileInputFormat: Total input paths to process : 1
> 13/10/15 11:01:02 DEBUG FileInputFormat: Total # of splits: 2
> 13/10/15 11:01:02 INFO SparkContext: Starting job: collect at <console>:18
> 13/10/15 11:01:02 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.JobSubmitted
> 13/10/15 11:01:02 INFO DAGScheduler: Got job 0 (collect at <console>:18)
> with 2 output partitions (allowLocal=false)
> 13/10/15 11:01:02 INFO DAGScheduler: Final stage: Stage 0 (collect at
> <console>:18)
> 13/10/15 11:01:02 INFO DAGScheduler: Parents of final stage: List()
> 13/10/15 11:01:02 DEBUG BlockManager: Got multiple block location in  2 ms
> 13/10/15 11:01:02 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:01:02 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:01:02 INFO DAGScheduler: Missing parents: List()
> 13/10/15 11:01:02 DEBUG DAGScheduler: submitStage(Stage 0)
> 13/10/15 11:01:02 DEBUG DAGScheduler: missing: List()
> 13/10/15 11:01:02 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at
> map
> at <console>:15), which has no missing parents
> 13/10/15 11:01:02 DEBUG DAGScheduler: submitMissingTasks(Stage 0)
> 13/10/15 11:01:02 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 0
> (MappedRDD[2] at map at <console>:15)
> 13/10/15 11:01:02 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(0,
> 0), ResultTask(0, 1))
> 13/10/15 11:01:02 DEBUG LocalScheduler:
> parentName:,name:TaskSet_0,runningTasks:0
> 13/10/15 11:01:02 DEBUG LocalTaskSetManager: availableCpus:1,
> numFinished:0,
> numTasks:2
> 13/10/15 11:01:02 INFO LocalTaskSetManager: Size of task 0 is 1671 bytes
> 13/10/15 11:01:02 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.BeginEvent
> 13/10/15 11:01:02 DEBUG LocalTaskSetManager: availableCpus:0,
> numFinished:0,
> numTasks:2
> 13/10/15 11:01:02 INFO LocalScheduler: Running 0
> 13/10/15 11:01:02 DEBUG BlockManager: Getting local block broadcast_0
> 13/10/15 11:01:02 DEBUG BlockManager: Level for block broadcast_0 is
> StorageLevel(true, true, true, 1)
> 13/10/15 11:01:02 DEBUG BlockManager: Getting block broadcast_0 from memory
> 13/10/15 11:01:02 INFO HadoopRDD: Input split:
>
> file:/home/mlosoi/personal/testenv3/spark_apps/TestApp1/data/Files.dat:0+337
> 5
> 13/10/15 11:01:02 INFO LocalScheduler: Finished 0
> 13/10/15 11:01:02 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.CompletionEvent
> 13/10/15 11:01:02 DEBUG LocalScheduler:
> parentName:,name:TaskSet_0,runningTasks:0
> 13/10/15 11:01:02 DEBUG LocalTaskSetManager: availableCpus:1,
> numFinished:1,
> numTasks:2
> 13/10/15 11:01:02 INFO LocalTaskSetManager: Size of task 1 is 1671 bytes
> 13/10/15 11:01:02 DEBUG LocalTaskSetManager: availableCpus:0,
> numFinished:1,
> numTasks:2
> 13/10/15 11:01:02 INFO LocalScheduler: Running 1
> 13/10/15 11:01:02 INFO DAGScheduler: Completed ResultTask(0, 0)
> 13/10/15 11:01:02 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.BeginEvent
> 13/10/15 11:01:02 DEBUG BlockManager: Getting local block broadcast_0
> 13/10/15 11:01:02 DEBUG BlockManager: Level for block broadcast_0 is
> StorageLevel(true, true, true, 1)
> 13/10/15 11:01:02 DEBUG BlockManager: Getting block broadcast_0 from memory
> 13/10/15 11:01:02 INFO HadoopRDD: Input split:
>
> file:/home/mlosoi/personal/testenv3/spark_apps/TestApp1/data/Files.dat:3375+
> 3375
> 13/10/15 11:01:02 INFO LocalScheduler: Finished 1
> 13/10/15 11:01:02 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.CompletionEvent
> 13/10/15 11:01:02 INFO DAGScheduler: Completed ResultTask(0, 1)
> 13/10/15 11:01:02 INFO LocalScheduler: Remove TaskSet 0.0 from pool
> 13/10/15 11:01:02 INFO DAGScheduler: Stage 0 (collect at <console>:18)
> finished in 0.099 s
> 13/10/15 11:01:02 INFO SparkContext: Job finished: collect at <console>:18,
> took 0.191815601 s
> res0: Long = 74747646
>
> --- Distributed ---
>
> val filesList = sc.textFile("data/Files.dat", 2):
>
> 13/10/15 11:05:33 DEBUG Groups:  Creating new Groups object
> 13/10/15 11:05:33 DEBUG Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=300000
> 13/10/15 11:05:33 DEBUG UserGroupInformation: hadoop login
> 13/10/15 11:05:33 DEBUG UserGroupInformation: hadoop login commit
> 13/10/15 11:05:33 DEBUG UserGroupInformation: using local
> user:UnixPrincipal: mlosoi
> 13/10/15 11:05:33 DEBUG UserGroupInformation: UGI loginUser:mlosoi
> 13/10/15 11:05:33 DEBUG FileSystem: Creating filesystem for file:///
> 13/10/15 11:05:33 INFO MemoryStore: ensureFreeSpace(36280) called with
> curMem=0, maxMem=339585269
> 13/10/15 11:05:33 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 35.4 KB, free 323.8 MB)
> 13/10/15 11:05:33 DEBUG BlockManager: Put block broadcast_0 locally took
>  63
> ms
> filesList: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
> <console>:13
>
> val fileSizes = filesList.map(file => new File("data/" + file).length):
>
> fileSizes: org.apache.spark.rdd.RDD[Long] = MappedRDD[2] at map at
> <console>:15
>
> fileSizes.collect.sum:
>
> 13/10/15 11:06:58 DEBUG NativeCodeLoader: Trying to load the custom-built
> native-hadoop library...
> 13/10/15 11:06:58 DEBUG NativeCodeLoader: Failed to load native-hadoop with
> error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
> 13/10/15 11:06:58 DEBUG NativeCodeLoader: java.library.path=
> 13/10/15 11:06:58 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 13/10/15 11:06:58 WARN LoadSnappy: Snappy native library not loaded
> 13/10/15 11:06:58 INFO FileInputFormat: Total input paths to process : 1
> 13/10/15 11:06:58 DEBUG FileInputFormat: Total # of splits: 2
> 13/10/15 11:06:58 INFO SparkContext: Starting job: collect at <console>:18
> 13/10/15 11:06:58 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.JobSubmitted
> 13/10/15 11:06:58 INFO DAGScheduler: Got job 0 (collect at <console>:18)
> with 2 output partitions (allowLocal=false)
> 13/10/15 11:06:58 INFO DAGScheduler: Final stage: Stage 0 (collect at
> <console>:18)
> 13/10/15 11:06:58 INFO DAGScheduler: Parents of final stage: List()
> 13/10/15 11:06:58 DEBUG BlockManager: Got multiple block location in  2 ms
> 13/10/15 11:06:58 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:06:58 DEBUG BlockManager: Got multiple block location in  2 ms
> 13/10/15 11:06:58 INFO DAGScheduler: Missing parents: List()
> 13/10/15 11:06:58 DEBUG DAGScheduler: submitStage(Stage 0)
> 13/10/15 11:06:58 DEBUG DAGScheduler: missing: List()
> 13/10/15 11:06:58 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[2] at
> map
> at <console>:15), which has no missing parents
> 13/10/15 11:06:58 DEBUG DAGScheduler: submitMissingTasks(Stage 0)
> 13/10/15 11:06:58 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 0
> (MappedRDD[2] at map at <console>:15)
> 13/10/15 11:06:58 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(0,
> 1), ResultTask(0, 0))
> 13/10/15 11:06:58 INFO ClusterScheduler: Adding task set 0.0 with 2 tasks
> 13/10/15 11:06:58 DEBUG ClusterTaskSetManager: Epoch for TaskSet 0.0: 0
> 13/10/15 11:06:58 DEBUG ClusterTaskSetManager: Valid locality levels for
> TaskSet 0.0: ANY
> 13/10/15 11:06:58 DEBUG ClusterScheduler: parentName: , name: TaskSet_0,
> runningTasks: 0
> 13/10/15 11:06:58 DEBUG ClusterScheduler: parentName: , name: TaskSet_0,
> runningTasks: 0
> 13/10/15 11:06:58 INFO ClusterTaskSetManager: Starting task 0.0:0 as TID 0
> on executor 0: 130.233.193.25 (PROCESS_LOCAL)
> 13/10/15 11:06:58 INFO ClusterTaskSetManager: Serialized task 0.0:0 as 1671
> bytes in 4 ms
> 13/10/15 11:06:58 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.BeginEvent
> 13/10/15 11:06:58 INFO ClusterTaskSetManager: Finished TID 0 in 482 ms on
> 130.233.193.25 (progress: 1/2)
> 13/10/15 11:06:58 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.CompletionEvent
> 13/10/15 11:06:58 DEBUG ClusterScheduler: parentName: , name: TaskSet_0,
> runningTasks: 0
> 13/10/15 11:06:58 DEBUG ClusterScheduler: parentName: , name: TaskSet_0,
> runningTasks: 0
> 13/10/15 11:06:58 INFO ClusterTaskSetManager: Starting task 0.0:1 as TID 1
> on executor 0: 130.233.193.25 (PROCESS_LOCAL)
> 13/10/15 11:06:58 INFO DAGScheduler: Completed ResultTask(0, 0)
> 13/10/15 11:06:58 INFO ClusterTaskSetManager: Serialized task 0.0:1 as 1671
> bytes in 1 ms
> 13/10/15 11:06:58 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.BeginEvent
> 13/10/15 11:06:58 INFO ClusterTaskSetManager: Finished TID 1 in 27 ms on
> 130.233.193.25 (progress: 2/2)
> 13/10/15 11:06:58 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.CompletionEvent
> 13/10/15 11:06:58 INFO DAGScheduler: Completed ResultTask(0, 1)
> 13/10/15 11:06:58 INFO ClusterScheduler: Remove TaskSet 0.0 from pool
> 13/10/15 11:06:58 INFO DAGScheduler: Stage 0 (collect at <console>:18)
> finished in 0.521 s
> 13/10/15 11:06:58 INFO SparkContext: Job finished: collect at <console>:18,
> took 0.586467999 s
> res0: Long = 0
>
> fileSizes.collect:
>
> 13/10/15 11:09:04 INFO SparkContext: Starting job: collect at <console>:18
> 13/10/15 11:09:04 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.JobSubmitted
> 13/10/15 11:09:04 INFO DAGScheduler: Got job 1 (collect at <console>:18)
> with 2 output partitions (allowLocal=false)
> 13/10/15 11:09:04 INFO DAGScheduler: Final stage: Stage 1 (collect at
> <console>:18)
> 13/10/15 11:09:04 INFO DAGScheduler: Parents of final stage: List()
> 13/10/15 11:09:04 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:09:04 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:09:04 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:09:04 INFO DAGScheduler: Missing parents: List()
> 13/10/15 11:09:04 DEBUG DAGScheduler: submitStage(Stage 1)
> 13/10/15 11:09:04 DEBUG DAGScheduler: missing: List()
> 13/10/15 11:09:04 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[2] at
> map
> at <console>:15), which has no missing parents
> 13/10/15 11:09:04 DEBUG DAGScheduler: submitMissingTasks(Stage 1)
> 13/10/15 11:09:04 INFO DAGScheduler: Submitting 2 missing tasks from Stage
> 1
> (MappedRDD[2] at map at <console>:15)
> 13/10/15 11:09:04 DEBUG DAGScheduler: New pending tasks: Set(ResultTask(1,
> 1), ResultTask(1, 0))
> 13/10/15 11:09:04 INFO ClusterScheduler: Adding task set 1.0 with 2 tasks
> 13/10/15 11:09:04 DEBUG ClusterTaskSetManager: Epoch for TaskSet 1.0: 0
> 13/10/15 11:09:04 DEBUG ClusterTaskSetManager: Valid locality levels for
> TaskSet 1.0: ANY
> 13/10/15 11:09:04 DEBUG ClusterScheduler: parentName: , name: TaskSet_1,
> runningTasks: 0
> 13/10/15 11:09:04 DEBUG ClusterScheduler: parentName: , name: TaskSet_1,
> runningTasks: 0
> 13/10/15 11:09:04 INFO ClusterTaskSetManager: Starting task 1.0:0 as TID 2
> on executor 0: 130.233.193.25 (PROCESS_LOCAL)
> 13/10/15 11:09:04 INFO ClusterTaskSetManager: Serialized task 1.0:0 as 1675
> bytes in 1 ms
> 13/10/15 11:09:04 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.BeginEvent
> 13/10/15 11:09:04 INFO ClusterTaskSetManager: Finished TID 2 in 26 ms on
> 130.233.193.25 (progress: 1/2)
> 13/10/15 11:09:04 DEBUG ClusterScheduler: parentName: , name: TaskSet_1,
> runningTasks: 0
> 13/10/15 11:09:04 DEBUG ClusterScheduler: parentName: , name: TaskSet_1,
> runningTasks: 0
> 13/10/15 11:09:04 INFO ClusterTaskSetManager: Starting task 1.0:1 as TID 3
> on executor 0: 130.233.193.25 (PROCESS_LOCAL)
> 13/10/15 11:09:04 INFO ClusterTaskSetManager: Serialized task 1.0:1 as 1675
> bytes in 0 ms
> 13/10/15 11:09:04 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.CompletionEvent
> 13/10/15 11:09:04 INFO DAGScheduler: Completed ResultTask(1, 0)
> 13/10/15 11:09:04 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.BeginEvent
> 13/10/15 11:09:04 INFO ClusterTaskSetManager: Finished TID 3 in 27 ms on
> 130.233.193.25 (progress: 2/2)
> 13/10/15 11:09:04 INFO ClusterScheduler: Remove TaskSet 1.0 from pool
> 13/10/15 11:09:04 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.CompletionEvent
> 13/10/15 11:09:04 INFO DAGScheduler: Completed ResultTask(1, 1)
> 13/10/15 11:09:04 INFO DAGScheduler: Stage 1 (collect at <console>:18)
> finished in 0.057 s
> 13/10/15 11:09:04 INFO SparkContext: Job finished: collect at <console>:18,
> took 0.06363878 s
> res1: Array[Long] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
>
> fileSizes.first:
>
> 13/10/15 11:10:14 INFO SparkContext: Starting job: first at <console>:18
> 13/10/15 11:10:14 DEBUG DAGScheduler: Got event of type
> org.apache.spark.scheduler.JobSubmitted
> 13/10/15 11:10:14 INFO DAGScheduler: Got job 2 (first at <console>:18) with
> 1 output partitions (allowLocal=true)
> 13/10/15 11:10:14 INFO DAGScheduler: Final stage: Stage 2 (first at
> <console>:18)
> 13/10/15 11:10:14 INFO DAGScheduler: Parents of final stage: List()
> 13/10/15 11:10:14 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:10:14 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:10:14 DEBUG BlockManager: Got multiple block location in  0 ms
> 13/10/15 11:10:14 INFO DAGScheduler: Missing parents: List()
> 13/10/15 11:10:14 INFO DAGScheduler: Computing the requested partition
> locally
> 13/10/15 11:10:14 INFO HadoopRDD: Input split:
>
> file:/home/mlosoi/personal/testenv3/spark_apps/TestApp1/data/Files.dat:0+337
> 5
> 13/10/15 11:10:14 INFO SparkContext: Job finished: first at <console>:18,
> took 0.016751241 s
> res2: Long = 176034
>
>

Re: A program that works in local mode but fails in distributed mode

Reply via email to