Min Li created SPARK-1967: ----------------------------- Summary: Using parallelize method to create RDD, wordcount app just hanging there without errors or warnings Key: SPARK-1967 URL: https://issues.apache.org/jira/browse/SPARK-1967 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Environment: Ubuntu-12.04, single machine spark standalone, 8 core, 8G mem, spark 0.9.1, java-1.7 Reporter: Min Li
I was trying the parallelize method to create RDD. I used Java. And it's a simple wordcount program, except that I first read the input into memory and then use the parallelize method to create the RDD, rather than the default textFile method in the given example. Pseudo codes: JavaSparkContext ctx = new JavaSparkContext($SparkMasterURL, $NAME, $SparkHome, $jars); List<String> input = #read lines from input file and form a ArrayList<String> JavaRDD lines = ctx.parallelize(input); //followed by wordcount ----above is not working. JavaRDD lines = ctx.textFile(file); //followed by wordcount ----this is working The log is: 14/05/29 16:18:43 INFO Slf4jLogger: Slf4jLogger started 14/05/29 16:18:43 INFO Remoting: Starting remoting 14/05/29 16:18:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@spark:55224] 14/05/29 16:18:43 INFO SparkEnv: Registering BlockManagerMaster 14/05/29 16:18:43 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140529161843-836a 14/05/29 16:18:43 INFO MemoryStore: MemoryStore started with capacity 1056.0 MB. 14/05/29 16:18:43 INFO ConnectionManager: Bound socket to port 42942 with id = ConnectionManagerId(spark,42942) 14/05/29 16:18:43 INFO BlockManagerMaster: Trying to register BlockManager 14/05/29 16:18:43 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager spark:42942 with 1056.0 MB RAM 14/05/29 16:18:43 INFO BlockManagerMaster: Registered BlockManager 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO HttpBroadcast: Broadcast server started at http://10.227.119.185:43522 14/05/29 16:18:43 INFO SparkEnv: Registering MapOutputTracker 14/05/29 16:18:43 INFO HttpFileServer: HTTP File server directory is /tmp/spark-3704a621-789c-4d97-b1fc-9654236dba3e 14/05/29 16:18:43 INFO HttpServer: Starting HTTP Server 14/05/29 16:18:43 INFO SparkUI: Started Spark Web UI at http://spark:4040 14/05/29 16:18:44 INFO SparkContext: Added JAR /home/maxmin/tmp/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar at http://10.227.119.185:55286/jars/spark-test-1.0-SNAPSHOT-jar-with-dependencies.jar with timestamp 1401394724045 14/05/29 16:18:44 INFO AppClient$ClientActor: Connecting to master spark://spark:7077... 14/05/29 16:18:44 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140529161844-0001 14/05/29 16:18:44 INFO AppClient$ClientActor: Executor added: app-20140529161844-0001/0 on worker-20140529155406-spark-59658 (spark:59658) with 8 cores The app is hanging here forever. And spark:8080 spark:4040 are not showing any strange info. The Spark Stages page shows the Active Stages is reduceByKey, tasks: Succeeded/Total is 0/2. I've also tried directly call lines.count after parallelize, and the app will stuck at the count stage. I used spark-0.9.1 and used default spark-env.sh. In the slaves file I have only one host. I used maven to compile a fat jar with spark specified as provided. I modified the run-example script to submit the jar. -- This message was sent by Atlassian JIRA (v6.2#6252)