I set up a cluster with four nodes using the spark cluster setup.  I built
the Spark 0.8.1 binaries with "sbt/sbt assembly" and
"./make_distribution.sh", the result of which I copied to each client
machine (which has Java + Python installed).  DNS works in all directions,
all machines have no firewall of any sort, and passwordless SSH from the
master node.

I am using the pyspark interface, launched with: 
*MASTER=spark://namenode1:7077 ./pyspark*

I am able to load the web interface, with *http://namenode1:8080/*, and
browse my four workers.

On the pyspark command line, I run:
*> data = sc.textFile("my_text_file.txt")
> data.count()
*
This launches a job, which can be seen on the web interface
(http://namenode1:8080/).  Also, I can see the Java processes running on the
slave nodes:

*lobdellb@slave2:~/spark-0.8.1-incubating/conf$ jps
22366 CoarseGrainedExecutorBackend
22301 Worker
22876 Jps*

Additionally, I can see the python daemon process running on the slaves:

*lobdellb@slave2:~/spark-0.8.1-incubating/conf$ ps aux | grep python
lobdellb 22433  0.0  0.2  46028 11856 ?        S    14:43   0:00 python
/home/lobdellb/spark-0.8.1-incubating/python/pyspark/daemon.py*

However, the slaves remain idle, ie., no computation is done, the
"*data.count()*" command never finishes, the load average of all cluster
machines stays near 0.0.

I adjusted log4j.properties to show ALL error messages, but none of the
messages.  The Master log (
spark-username-org.apache.spark.deploy.master.Master-1-namenode1.out) is
quite verbose but has no messages which seem indicative of a malfunction. 
The slave node messages from the web interface are easier to mentally parse
and perhaps offer some clues, but I'm not sure what I'm looking at:

14/01/24 14:42:04 INFO Slf4jEventHandler: Slf4jEventHandler started
14/01/24 14:42:04 INFO SparkEnv: Connecting to BlockManagerMaster:
akka://[email protected]:52181/user/BlockManagerMaster
14/01/24 14:42:04 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140124144204-5097
14/01/24 14:42:04 INFO MemoryStore: MemoryStore started with capacity 324.4
MB.
14/01/24 14:42:04 INFO ConnectionManager: Bound socket to port 55782 with id
= ConnectionManagerId(slave1.hsd1.il.comcast.net,55782)
14/01/24 14:42:04 INFO BlockManagerMaster: Trying to register BlockManager
14/01/24 14:42:04 INFO BlockManagerMaster: Registered BlockManager
14/01/24 14:42:04 INFO SparkEnv: Connecting to MapOutputTracker:
akka://[email protected]:52181/user/MapOutputTracker
14/01/24 14:42:04 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-157b67f1-1be6-4289-bff2-b29cf23a7e67
14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 2
14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 6
14/01/24 14:43:22 INFO Executor: Running task ID 6
14/01/24 14:43:22 INFO Executor: Running task ID 2
14/01/24 14:43:22 INFO HttpBroadcast: Started reading broadcast variable 0
14/01/24 14:43:23 INFO MemoryStore: ensureFreeSpace(39200) called with
curMem=0, maxMem=340147568
14/01/24 14:43:23 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 38.3 KB, free 324.4 MB)
14/01/24 14:43:23 INFO HttpBroadcast: Reading broadcast variable 0 took
0.333493883 s
14/01/24 14:43:23 INFO BlockManager: Found block broadcast_0 locally
14/01/24 14:43:23 INFO HadoopRDD: Input split:
file:/home/lobdellb/data/Building_Permits.csv:67108864+33554432
14/01/24 14:43:23 INFO HadoopRDD: Input split:
file:/home/lobdellb/data/Building_Permits.csv:201326592+3555642
14/01/24 14:43:23 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/01/24 14:43:23 WARN LoadSnappy: Snappy native library not loaded
*14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early
14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early*

I would start sifting through the python source code (which is only a few
thousand lines) and attempt to understand what it does -- probably by adding
logging, but before doing this, I am hoping someone might know:

(1)  Is there additional logging I can access which will help identify the
problem?
(2)  Is there an obvious flaw in my setup?

Much appreciated
-Bryce



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-job-not-starting-tp903.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to