I set up a cluster with four nodes using the spark cluster setup. I built the Spark 0.8.1 binaries with "sbt/sbt assembly" and "./make_distribution.sh", the result of which I copied to each client machine (which has Java + Python installed). DNS works in all directions, all machines have no firewall of any sort, and passwordless SSH from the master node.
I am using the pyspark interface, launched with: *MASTER=spark://namenode1:7077 ./pyspark* I am able to load the web interface, with *http://namenode1:8080/*, and browse my four workers. On the pyspark command line, I run: *> data = sc.textFile("my_text_file.txt") > data.count() * This launches a job, which can be seen on the web interface (http://namenode1:8080/). Also, I can see the Java processes running on the slave nodes: *lobdellb@slave2:~/spark-0.8.1-incubating/conf$ jps 22366 CoarseGrainedExecutorBackend 22301 Worker 22876 Jps* Additionally, I can see the python daemon process running on the slaves: *lobdellb@slave2:~/spark-0.8.1-incubating/conf$ ps aux | grep python lobdellb 22433 0.0 0.2 46028 11856 ? S 14:43 0:00 python /home/lobdellb/spark-0.8.1-incubating/python/pyspark/daemon.py* However, the slaves remain idle, ie., no computation is done, the "*data.count()*" command never finishes, the load average of all cluster machines stays near 0.0. I adjusted log4j.properties to show ALL error messages, but none of the messages. The Master log ( spark-username-org.apache.spark.deploy.master.Master-1-namenode1.out) is quite verbose but has no messages which seem indicative of a malfunction. The slave node messages from the web interface are easier to mentally parse and perhaps offer some clues, but I'm not sure what I'm looking at: 14/01/24 14:42:04 INFO Slf4jEventHandler: Slf4jEventHandler started 14/01/24 14:42:04 INFO SparkEnv: Connecting to BlockManagerMaster: akka://[email protected]:52181/user/BlockManagerMaster 14/01/24 14:42:04 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140124144204-5097 14/01/24 14:42:04 INFO MemoryStore: MemoryStore started with capacity 324.4 MB. 14/01/24 14:42:04 INFO ConnectionManager: Bound socket to port 55782 with id = ConnectionManagerId(slave1.hsd1.il.comcast.net,55782) 14/01/24 14:42:04 INFO BlockManagerMaster: Trying to register BlockManager 14/01/24 14:42:04 INFO BlockManagerMaster: Registered BlockManager 14/01/24 14:42:04 INFO SparkEnv: Connecting to MapOutputTracker: akka://[email protected]:52181/user/MapOutputTracker 14/01/24 14:42:04 INFO HttpFileServer: HTTP File server directory is /tmp/spark-157b67f1-1be6-4289-bff2-b29cf23a7e67 14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 2 14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 6 14/01/24 14:43:22 INFO Executor: Running task ID 6 14/01/24 14:43:22 INFO Executor: Running task ID 2 14/01/24 14:43:22 INFO HttpBroadcast: Started reading broadcast variable 0 14/01/24 14:43:23 INFO MemoryStore: ensureFreeSpace(39200) called with curMem=0, maxMem=340147568 14/01/24 14:43:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 38.3 KB, free 324.4 MB) 14/01/24 14:43:23 INFO HttpBroadcast: Reading broadcast variable 0 took 0.333493883 s 14/01/24 14:43:23 INFO BlockManager: Found block broadcast_0 locally 14/01/24 14:43:23 INFO HadoopRDD: Input split: file:/home/lobdellb/data/Building_Permits.csv:67108864+33554432 14/01/24 14:43:23 INFO HadoopRDD: Input split: file:/home/lobdellb/data/Building_Permits.csv:201326592+3555642 14/01/24 14:43:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/01/24 14:43:23 WARN LoadSnappy: Snappy native library not loaded *14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early 14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early* I would start sifting through the python source code (which is only a few thousand lines) and attempt to understand what it does -- probably by adding logging, but before doing this, I am hoping someone might know: (1) Is there additional logging I can access which will help identify the problem? (2) Is there an obvious flaw in my setup? Much appreciated -Bryce -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-job-not-starting-tp903.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
