I can't seem to get Spark to run the tasks in parallel. My spark code is the following:
//Create commands to be piped into a C++ program List<String> commandList = makeCommandList(Integer.parseInt(step.first()),100); JavaRDD<String> commandListRDD = ctx.parallelize(commandList, commandList.size()); //Run the C++ application JavaRDD<String> workerOutput = commandListRDD.pipe("RandomC++Application"); workerOutput.saveAsTextFile("output"); Running this code appears to make the system run all the tasks in series as opposed to in parallel: any ideas as to what could be wrong? I'm guessing that there is an issue with the serializer, due to the sample output below: 14/05/12 17:17:32 INFO TaskSchedulerImpl: Adding task set 1.0 with 14 tasks 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor 2: neuro-1-3.local (PROCESS_LOCAL) 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:0 as 4888 bytes in 9 ms 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:1 as TID 1 on executor 5: neuro-2-0.local (PROCESS_LOCAL) 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:1 as 4890 bytes in 1 ms 14/05/12 17:17:32 INFO TaskSetManager: Starting task 1.0:2 as TID 2 on executor 12: neuro-1-4.local (PROCESS_LOCAL) 14/05/12 17:17:32 INFO TaskSetManager: Serialized task 1.0:2 as 4890 bytes in 1 ms -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Forcing-spark-to-send-exactly-one-element-to-each-worker-node-tp5605p5616.html Sent from the Apache Spark User List mailing list archive at Nabble.com.