Re: Pyspark Error
My best guess would be a networking issue--it looks like the Python socket library isn't able to connect to whatever hostname you're providing Spark in the configuration. On 11/18/14 9:10 AM, amin mohebbi wrote: Hi there, *I have already downloaded Pre-built spark-1.1.0, I want to run pyspark by try typing ./bin/pyspark but I got the following error:* * * *scala shell is up and working fine* hduser@master:~/Downloads/spark-1.1.0$ ./bin/spark-shell Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties . . 14/11/18 04:33:13 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@master:34937/user/HeartbeatReceiver 14/11/18 04:33:13 INFO SparkILoop: Created spark context.. Spark context available as sc. scala hduser@master:~/Downloads/spark-1.1.0$ * * *But python shell does not work:* hduser@master:~/Downloads/spark-1.1.0$ hduser@master:~/Downloads/spark-1.1.0$ hduser@master:~/Downloads/spark-1.1.0$ ./bin/pyspark Python 2.7.3 (default, Feb 27 2014, 20:00:17) [GCC 4.6.3] on linux2 Type help, copyright, credits or license for more information. Java HotSpot(TM) Client VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/11/18 04:36:06 INFO SecurityManager: Changing view acls to: hduser, 14/11/18 04:36:06 INFO SecurityManager: Changing modify acls to: hduser, 14/11/18 04:36:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser, ); users with modify permissions: Set(hduser, ) 14/11/18 04:36:06 INFO Slf4jLogger: Slf4jLogger started 14/11/18 04:36:06 INFO Remoting: Starting remoting 14/11/18 04:36:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@master:52317] 14/11/18 04:36:06 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@master:52317] 14/11/18 04:36:06 INFO Utils: Successfully started service 'sparkDriver' on port 52317. 14/11/18 04:36:06 INFO SparkEnv: Registering MapOutputTracker 14/11/18 04:36:06 INFO SparkEnv: Registering BlockManagerMaster 14/11/18 04:36:06 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141118043606-c346 14/11/18 04:36:07 INFO Utils: Successfully started service 'Connection manager for block manager' on port 47507. 14/11/18 04:36:07 INFO ConnectionManager: Bound socket to port 47507 with id = ConnectionManagerId(master,47507) 14/11/18 04:36:07 INFO MemoryStore: MemoryStore started with capacity 267.3 MB 14/11/18 04:36:07 INFO BlockManagerMaster: Trying to register BlockManager 14/11/18 04:36:07 INFO BlockManagerMasterActor: Registering block manager master:47507 with 267.3 MB RAM 14/11/18 04:36:07 INFO BlockManagerMaster: Registered BlockManager 14/11/18 04:36:07 INFO HttpFileServer: HTTP File server directory is /tmp/spark-8b29544a-c74b-4a3e-88e0-13801c8dcc65 14/11/18 04:36:07 INFO HttpServer: Starting HTTP Server 14/11/18 04:36:07 INFO Utils: Successfully started service 'HTTP file server' on port 40029. 14/11/18 04:36:12 INFO Utils: Successfully started service 'SparkUI' on port 4040. 14/11/18 04:36:12 INFO SparkUI: Started SparkUI at http://master:4040 http://master:4040/ 14/11/18 04:36:12 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@master:52317/user/HeartbeatReceiver 14/11/18 04:36:12 INFO SparkUI: Stopped Spark web UI at http://master:4040 http://master:4040/ 14/11/18 04:36:12 INFO DAGScheduler: Stopping DAGScheduler 14/11/18 04:36:13 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/11/18 04:36:13 INFO ConnectionManager: Selector thread was interrupted! 14/11/18 04:36:13 INFO ConnectionManager: ConnectionManager stopped 14/11/18 04:36:13 INFO MemoryStore: MemoryStore cleared 14/11/18 04:36:13 INFO BlockManager: BlockManager stopped 14/11/18 04:36:13 INFO BlockManagerMaster: BlockManagerMaster stopped 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/11/18 04:36:13 INFO SparkContext: Successfully stopped SparkContext 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/11/18 04:36:13 INFO Remoting: Remoting shut down 14/11/18 04:36:13 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. Traceback (most recent call last): File /home/hduser/Downloads/spark-1.1.0/python/pyspark/shell.py, line 44, in module sc = SparkContext(appName=PySparkShell, pyFiles=add_files) File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line 107, in __init__ conf) File /home/hduser/Downloads/spark-1.1.0/python/pyspark/context.py, line 159, in _do_init self._accumulatorServer = accumulators._start_update_server() File
Iterative transformations over RDD crashes in phantom reduce
Hi all, This is somewhat related to my previous question ( http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html , for additional context) but for all practical purposes this is its own issue. As in my previous question, I'm making iterative changes to an RDD, where each iteration depends on the results of the previous one. I've stripped down what was previously a loop to just be two sequential edits to try and nail down where the problem is. It looks like this: index = 0 INDEX = sc.broadcast(index) M = M.flatMap(func1).reduceByKey(func2) M.foreach(debug_output) index = 1 INDEX = sc.broadcast(index) M = M.flatMap(func1) M.foreach(debug_output) M is basically a row-indexed matrix, where each index points to a dictionary (sparse matrix more or less, with some domain-specific modifications). This program crashes on the second-to-last (7th) line; the creepy part is that it says the crash happens in func2 with the broadcast variable INDEX == 1 (it attempts to access an entry that doesn't exist in a dictionary of one of the rows). How is that even possible? Am I missing something fundamental about how Spark works under the hood? Thanks for your help! Shannon - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Iterative transformations over RDD crashes in phantom reduce
To clarify about what, precisely, is impossible: the crash happens with INDEX == 1 in func2, but func2 is only called in the reduceByKey transformation when INDEX == 0. And according to the output of the foreach() in line 4, that reduceByKey(func2) works just fine. How is it then invoked again with INDEX == 1 when there clearly isn't another reduce call at line 7? On 11/18/14 1:58 PM, Shannon Quinn wrote: Hi all, This is somewhat related to my previous question ( http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html , for additional context) but for all practical purposes this is its own issue. As in my previous question, I'm making iterative changes to an RDD, where each iteration depends on the results of the previous one. I've stripped down what was previously a loop to just be two sequential edits to try and nail down where the problem is. It looks like this: index = 0 INDEX = sc.broadcast(index) M = M.flatMap(func1).reduceByKey(func2) M.foreach(debug_output) index = 1 INDEX = sc.broadcast(index) M = M.flatMap(func1) M.foreach(debug_output) M is basically a row-indexed matrix, where each index points to a dictionary (sparse matrix more or less, with some domain-specific modifications). This program crashes on the second-to-last (7th) line; the creepy part is that it says the crash happens in func2 with the broadcast variable INDEX == 1 (it attempts to access an entry that doesn't exist in a dictionary of one of the rows). How is that even possible? Am I missing something fundamental about how Spark works under the hood? Thanks for your help! Shannon - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Iterative transformations over RDD crashes in phantom reduce
Sorry everyone--turns out an oft-forgotten single line of code was required to make this work: index = 0 INDEX = sc.broadcast(index) M = M.flatMap(func1).reduceByKey(func2) M.foreach(debug_output) *M.cache()* index = 1 INDEX = sc.broadcast(index) M = M.flatMap(func1) M.foreach(debug_output) Works as expected now, and I understand why it was failing before: Spark was trying to recompute the RDD but consequently it was invoked with index == 1. On 11/18/14 2:02 PM, Shannon Quinn wrote: To clarify about what, precisely, is impossible: the crash happens with INDEX == 1 in func2, but func2 is only called in the reduceByKey transformation when INDEX == 0. And according to the output of the foreach() in line 4, that reduceByKey(func2) works just fine. How is it then invoked again with INDEX == 1 when there clearly isn't another reduce call at line 7? On 11/18/14 1:58 PM, Shannon Quinn wrote: Hi all, This is somewhat related to my previous question ( http://apache-spark-user-list.1001560.n3.nabble.com/Iterative-changes-to-RDD-and-broadcast-variables-tt19042.html , for additional context) but for all practical purposes this is its own issue. As in my previous question, I'm making iterative changes to an RDD, where each iteration depends on the results of the previous one. I've stripped down what was previously a loop to just be two sequential edits to try and nail down where the problem is. It looks like this: index = 0 INDEX = sc.broadcast(index) M = M.flatMap(func1).reduceByKey(func2) M.foreach(debug_output) index = 1 INDEX = sc.broadcast(index) M = M.flatMap(func1) M.foreach(debug_output) M is basically a row-indexed matrix, where each index points to a dictionary (sparse matrix more or less, with some domain-specific modifications). This program crashes on the second-to-last (7th) line; the creepy part is that it says the crash happens in func2 with the broadcast variable INDEX == 1 (it attempts to access an entry that doesn't exist in a dictionary of one of the rows). How is that even possible? Am I missing something fundamental about how Spark works under the hood? Thanks for your help! Shannon
Iterative changes to RDD and broadcast variables
Hi all, I'm iterating over an RDD (representing a distributed matrix...have to roll my own in Python) and making changes to different submatrices at each iteration. The loop structure looks something like: for i in range(x): VAR = sc.broadcast(i) rdd.map(func1).reduceByKey(func2) M = rdd.collect() where func1 and func2 use the current value of VAR for that iteration. Because there aren't any actions in the main loop, nothing actually happens until the collect method is called. I'm running into problems I can't diagnose (*extremely* long execution time for no particular reason, among others); is this code even valid? If not, how should make in-place iterative edits to different portions of a matrix, where each subsequent edit is dependent on the edits from the previous iteration? Thanks in advance! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Dividing tasks among Spark workers
The default # of partitions is the # of cores, correct? On 7/18/14, 10:53 AM, Yanbo Liang wrote: check how many partitions in your program. If only one, change it to more partitions will make the execution parallel. 2014-07-18 20:57 GMT+08:00 Madhura das.madhur...@gmail.com mailto:das.madhur...@gmail.com: I am running my program on a spark cluster but when I look into my UI while the job is running I see that only one worker does most of the tasks. My cluster has one master and 4 workers where the master is also a worker. I want my task to complete as quickly as possible and I believe that if the number of tasks were to be divided equally among the workers, the job will be completed faster. Is there any way I can customize the umber of job on each worker? http://apache-spark-user-list.1001560.n3.nabble.com/file/n10160/Question.png -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dividing-tasks-among-Spark-workers-tp10160.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Python: saving/reloading RDD
+1, had to learn this the hard way when some of my objects were written as pointers, rather than translated correctly to strings :) On 7/18/14, 11:52 AM, Xiangrui Meng wrote: You can save RDDs to text files using RDD.saveAsTextFile and load it back using sc.textFile. But make sure the record to string conversion is correctly implemented if the type is not primitive and you have the parser to load them back. -Xiangrui On Jul 18, 2014, at 8:39 AM, Roch Denis rde...@exostatic.com wrote: Hello, Just to make sure I correctly read the doc and the forums. It's my understanding that currently in python with Spark 1.0.1 there is no way to save my RDD to disk that I can just reload. The hadoop RDD are not yet present in Python. Is that correct? I just want to make sure that's the case before I write a workaround. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-saving-reloading-RDD-tp10172.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Job aborted due to stage failure: TID x failed for unknown reasons
Hi all, I'm dealing with some strange error messages that I *think* comes down to a memory issue, but I'm having a hard time pinning it down and could use some guidance from the experts. I have a 2-machine Spark (1.0.1) cluster. Both machines have 8 cores; one has 16GB memory, the other 32GB (which is the master). My application involves computing pairwise pixel affinities in images, though the images I've tested so far only get as big as 1920x1200, and as small as 16x16. I did have to change a few memory and parallelism settings, otherwise I was getting explicit OutOfMemoryExceptions. In spark-default.conf: spark.executor.memory14g spark.default.parallelism32 spark.akka.frameSize1000 In spark-env.sh: SPARK_DRIVER_MEMORY=10G With those settings, however, I get a bunch of WARN statements about Lost TIDs (no task is successfully completed) in addition to lost Executors, which are repeated 4 times until I finally get the following error message and crash: --- 14/07/18 12:06:20 INFO TaskSchedulerImpl: Cancelling stage 0 14/07/18 12:06:20 INFO DAGScheduler: Failed to run collect at /home/user/Programming/PySpark-Affinities/affinity.py:243 Traceback (most recent call last): File /home/user/Programming/PySpark-Affinities/affinity.py, line 243, in module lambda x: np.abs(IMAGE.value[x[0]] - IMAGE.value[x[1]]) File /net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/pyspark/rdd.py, line 583, in collect bytesInJava = self._jrdd.collect().iterator() File /net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /net/antonin/home/user/Spark/spark-1.0.1-bin-hadoop2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o27.collect. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:13 failed 4 times, most recent failure: *TID 32 on host master.host.univ.edu failed for unknown reason* Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/07/18 12:06:20 INFO DAGScheduler: Executor lost: 4 (epoch 4) 14/07/18 12:06:20 INFO BlockManagerMasterActor: Trying to remove executor 4 from BlockManagerMaster. 14/07/18 12:06:20 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor user@master:~/Programming/PySpark-Affinities$ --- If I run the really small image instead (16x16), it *appears* to run to completion (gives me the output I expect without any exceptions being thrown). However, in the stderr logs for the app that was run, it lists the state as KILLED with the final message a ERROR CoarseGrainedExecutorBackend: Driver Disassociated. If I run any larger images, I get the exception I pasted above. Furthermore, if I just do a spark-submit with master=local[*], aside from still needing to set the aforementioned memory options, it will work for an image of *any* size (I've tested both machines independently; they both do this when running as local[*]), whereas working on a cluster will result in the aforementioned crash at stage 0 with anything but the smallest images. Any ideas what is going on? Thank you very much in advance! Regards,
Re: Spark standalone network configuration problems
I put the settings as you specified in spark-env.sh for the master. When I run start-all.sh, the web UI shows both the worker on the master (machine1) and the slave worker (machine2) as ALIVE and ready, with the master URL at spark://192.168.1.101. However, when I run spark-submit, it immediately crashes with py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting error: [Startup failed] akka.remote.RemoteTransportException: Startup failed [...] org.jboss.netty.channel.ChannelException: Failed to bind to /192.168.1.101:5060 [...] java.net.BindException: Address already in use. [...] This seems entirely contrary to intuition; why would Spark be unable to bind to the exact IP:port set for the master? On 6/27/14, 1:54 AM, Akhil Das wrote: Hi Shannon, How about a setting like the following? (just removed the quotes) export SPARK_MASTER_IP=192.168.1.101 export SPARK_MASTER_PORT=5060 #export SPARK_LOCAL_IP=127.0.0.1 Not sure whats happening in your case, it could be that your system is not able to bind to 192.168.1.101 address. What is the spark:// master url that you are seeing there in the webUI? (It should be spark://192.168.1.101:7077 in your case). Thanks Best Regards On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: In the interest of completeness, this is how I invoke spark: [on master] sbin/start-all.sh spark-submit --py-files extra.py main.py iPhone'd On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end. Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP. In looking at the logs of the worker spawned on master, it's also receiving a spark://localhost:5060 argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing spark://{SPARK_LOCAL_IP}:5060 to the workers? That was my motivation behind commenting out SPARK_LOCAL_IP; however, that's when the master crashes immediately due to the address already being in use. Any ideas? Thanks! Shannon On 6/26/14, 10:14 AM, Akhil Das wrote: Can you paste your spark-env.sh file? Thanks Best Regards On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Both /etc/hosts have each other's IP addresses in them. Telneting from machine2 to machine1 on port 5060 works just fine. Here's the output of lsof: user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN) java23985 user 40u IPv6 11099560 0t0 TCP machine1:sip-machine1:48315 (ESTABLISHED) java23985 user 52u IPv6 11100405 0t0 TCP machine1:sip-machine2:54476 (ESTABLISHED) java24157 user 40u IPv6 11092413 0t0 TCP machine1:48315-machine1:sip (ESTABLISHED) Ubuntu seems to recognize 5060 as the standard port for sip; it's not actually running anything there besides Spark, it just does a s/5060/sip/g. Is there something to the fact that every time I comment out SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due to the address already being in use? Or am I barking up the wrong tree on that one? Thanks again for all your help; I hope we can knock this one out. Shannon On 6/26/14, 9:13 AM, Akhil Das wrote: Do you have ip machine1 in your workers /etc/hosts also? If so try telneting from your machine2 to machine1 on port 5060. Also make sure nothing else is running on port 5060 other than Spark (*/lsof -i:5060/*) Thanks Best Regards On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Still running into the same problem. /etc/hosts on the master says 127.0.0.1localhost ip machine1 ip
Re: Spark standalone network configuration problems
No joy, unfortunately. Same issue; see my previous email--still crashes with address already in use. On 6/27/14, 1:54 AM, sujeetv wrote: Try to explicitly set set the spark.driver.host property to the master's IP. Sujeet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark standalone network configuration problems
Sorry, master spark URL in the web UI is *spark://192.168.1.101:5060*, exactly as configured. On 6/27/14, 9:07 AM, Shannon Quinn wrote: I put the settings as you specified in spark-env.sh for the master. When I run start-all.sh, the web UI shows both the worker on the master (machine1) and the slave worker (machine2) as ALIVE and ready, with the master URL at spark://192.168.1.101. However, when I run spark-submit, it immediately crashes with py4j.protocol.Py4JJavaError14/06/27 09:01:32 ERROR Remoting: Remoting error: [Startup failed] akka.remote.RemoteTransportException: Startup failed [...] org.jboss.netty.channel.ChannelException: Failed to bind to /192.168.1.101:5060 [...] java.net.BindException: Address already in use. [...] This seems entirely contrary to intuition; why would Spark be unable to bind to the exact IP:port set for the master? On 6/27/14, 1:54 AM, Akhil Das wrote: Hi Shannon, How about a setting like the following? (just removed the quotes) export SPARK_MASTER_IP=192.168.1.101 export SPARK_MASTER_PORT=5060 #export SPARK_LOCAL_IP=127.0.0.1 Not sure whats happening in your case, it could be that your system is not able to bind to 192.168.1.101 address. What is the spark:// master url that you are seeing there in the webUI? (It should be spark://192.168.1.101:7077 in your case). Thanks Best Regards On Fri, Jun 27, 2014 at 5:47 AM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: In the interest of completeness, this is how I invoke spark: [on master] sbin/start-all.sh spark-submit --py-files extra.py main.py iPhone'd On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end. Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP. In looking at the logs of the worker spawned on master, it's also receiving a spark://localhost:5060 argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing spark://{SPARK_LOCAL_IP}:5060 to the workers? That was my motivation behind commenting out SPARK_LOCAL_IP; however, that's when the master crashes immediately due to the address already being in use. Any ideas? Thanks! Shannon On 6/26/14, 10:14 AM, Akhil Das wrote: Can you paste your spark-env.sh file? Thanks Best Regards On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Both /etc/hosts have each other's IP addresses in them. Telneting from machine2 to machine1 on port 5060 works just fine. Here's the output of lsof: user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ mailto:user@machine1:%7E/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java23985 user 30u IPv6 110923540t0 TCP machine1:sip (LISTEN) java23985 user 40u IPv6 110995600t0 TCP machine1:sip-machine1:48315 (ESTABLISHED) java23985 user 52u IPv6 111004050t0 TCP machine1:sip-machine2:54476 (ESTABLISHED) java24157 user 40u IPv6 110924130t0 TCP machine1:48315-machine1:sip (ESTABLISHED) Ubuntu seems to recognize 5060 as the standard port for sip; it's not actually running anything there besides Spark, it just does a s/5060/sip/g. Is there something to the fact that every time I comment out SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due to the address already being in use? Or am I barking up the wrong tree on that one? Thanks again for all your help; I hope we can knock this one out. Shannon On 6/26/14, 9:13 AM, Akhil Das wrote: Do you have ip machine1 in your workers /etc/hosts also? If so try telneting from your machine2 to machine1 on port 5060. Also make sure nothing else is running on port 5060 other than Spark (*/lsof -i:5060/*) Thanks Best Regards On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Still running
Re: numpy + pyspark
Would deploying virtualenv on each directory on the cluster be viable? The dependencies would get tricky but I think this is the sort of situation it's built for. On 6/27/14, 11:06 AM, Avishek Saha wrote: I too felt the same Nick but I don't have root privileges on the cluster, unfortunately. Are there any alternatives? On 27 June 2014 08:04, Nick Pentreath nick.pentre...@gmail.com mailto:nick.pentre...@gmail.com wrote: I've not tried this - but numpy is a tricky and complex package with many dependencies on Fortran/C libraries etc. I'd say by the time you figure out correctly deploying numpy in this manner, you may as well have just built it into your cluster bootstrap process, or PSSH install it on each node... On Fri, Jun 27, 2014 at 4:58 PM, Avishek Saha avishek.s...@gmail.com mailto:avishek.s...@gmail.com wrote: To clarify I tried it and it almost worked -- but I am getting some problems from the Random module in numpy. If anyone has successfully passed a numpy module (via the --py-files option) to spark-submit then please let me know. Thanks !! Avishek On 26 June 2014 17:45, Avishek Saha avishek.s...@gmail.com mailto:avishek.s...@gmail.com wrote: Hi all, Instead of installing numpy in each worker node, is it possible to ship numpy (via --py-files option maybe) while invoking the spark-submit? Thanks, Avishek
Re: numpy + pyspark
I suppose along those lines, there's also Anaconda: https://store.continuum.io/cshop/anaconda/ On 6/27/14, 11:13 AM, Nick Pentreath wrote: Hadoopy uses http://www.pyinstaller.org/ to package things up into an executable that should be runnable without root privileges. It says it support numpy On Fri, Jun 27, 2014 at 5:08 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Would deploying virtualenv on each directory on the cluster be viable? The dependencies would get tricky but I think this is the sort of situation it's built for. On 6/27/14, 11:06 AM, Avishek Saha wrote: I too felt the same Nick but I don't have root privileges on the cluster, unfortunately. Are there any alternatives? On 27 June 2014 08:04, Nick Pentreath nick.pentre...@gmail.com mailto:nick.pentre...@gmail.com wrote: I've not tried this - but numpy is a tricky and complex package with many dependencies on Fortran/C libraries etc. I'd say by the time you figure out correctly deploying numpy in this manner, you may as well have just built it into your cluster bootstrap process, or PSSH install it on each node... On Fri, Jun 27, 2014 at 4:58 PM, Avishek Saha avishek.s...@gmail.com mailto:avishek.s...@gmail.com wrote: To clarify I tried it and it almost worked -- but I am getting some problems from the Random module in numpy. If anyone has successfully passed a numpy module (via the --py-files option) to spark-submit then please let me know. Thanks !! Avishek On 26 June 2014 17:45, Avishek Saha avishek.s...@gmail.com mailto:avishek.s...@gmail.com wrote: Hi all, Instead of installing numpy in each worker node, is it possible to ship numpy (via --py-files option maybe) while invoking the spark-submit? Thanks, Avishek
Re: Spark standalone network configuration problems
For some reason, commenting out spark.driver.host and spark.driver.port fixed something...and broke something else (or at least revealed another problem). For reference, the only lines I have in my spark-defaults.conf now: spark.app.name myProg spark.masterspark://192.168.1.101:5060 spark.executor.memory 8g spark.files.overwrite true It starts up, but has problems with machine2. For some reason, machine2 is having trouble communicating with *itself*. Here are the worker logs of one of the failures (there are 10 before it quits): Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: java -cp ::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar -XX:MaxPermSize=128m -Xms8192M -Xmx8192M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler 7 machine2 8 akka.tcp://sparkWorker@machine2:48019/user/Worker app-20140627144512-0001 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 finished with state FAILED message Command exited with code 1 exitStatus 1 14/06/27 14:56:54 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] was not delivered. [10] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 ] 14/06/27 14:56:54 INFO Worker: Asked to launch executor app-20140627144512-0001/8 for Funtown, USA 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 ] 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 ] Port 48019 on machine2 is indeed open, connected, and listening. Any ideas? Thanks! Shannon On 6/27/14, 1:54 AM, sujeetv wrote: Try to explicitly set set the spark.driver.host property to the master's IP. Sujeet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark standalone network configuration problems
Apologies; can you advise as to how I would check that? I can certainly SSH from master to machine2. On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote: Looks like your driver is not able to connect to the remote executor on machine2/130.49.226.148:60949 http://130.49.226.148:60949/. Cn you check if the master machine can route to 130.49.226.148 Sujeet On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: For some reason, commenting out spark.driver.host and spark.driver.port fixed something...and broke something else (or at least revealed another problem). For reference, the only lines I have in my spark-defaults.conf now: spark.app.name http://spark.app.name myProg spark.masterspark://192.168.1.101:5060 http://192.168.1.101:5060 spark.executor.memory 8g spark.files.overwrite true It starts up, but has problems with machine2. For some reason, machine2 is having trouble communicating with *itself*. Here are the worker logs of one of the failures (there are 10 before it quits): Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: java -cp ::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar -XX:MaxPermSize=128m -Xms8192M -Xmx8192M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler 7 machine2 8 akka.tcp://sparkWorker@machine2:48019/user/Worker app-20140627144512-0001 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 finished with state FAILED message Command exited with code 1 exitStatus 1 14/06/27 14:56:54 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] was not delivered. [10] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 http://130.49.226.148:60949 ] 14/06/27 14:56:54 INFO Worker: Asked to launch executor app-20140627144512-0001/8 for Funtown, USA 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 http://130.49.226.148:60949 ] 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 http://130.49.226.148:60949 ] Port 48019 on machine2 is indeed open, connected, and listening. Any ideas? Thanks! Shannon On 6/27/14, 1:54 AM, sujeetv wrote: Try to explicitly set set the spark.driver.host property to the master's IP. Sujeet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Spark standalone network configuration problems
I switched which machine was the master and which was the dedicated worker, and now it works just fine. I discovered machine2 is on my department's DMZ; machine1 is not. I suspect the departmental firewall was causing problems. By moving the master to machine2, that seems to have solved my problems. Thank you all very much for your help. I'm sure I'll have other questions soon :) Regards, Shannon On 6/27/14, 3:22 PM, Sujeet Varakhedi wrote: Looks like your driver is not able to connect to the remote executor on machine2/130.49.226.148:60949 http://130.49.226.148:60949/. Cn you check if the master machine can route to 130.49.226.148 Sujeet On Fri, Jun 27, 2014 at 12:04 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: For some reason, commenting out spark.driver.host and spark.driver.port fixed something...and broke something else (or at least revealed another problem). For reference, the only lines I have in my spark-defaults.conf now: spark.app.name http://spark.app.name myProg spark.masterspark://192.168.1.101:5060 http://192.168.1.101:5060 spark.executor.memory 8g spark.files.overwrite true It starts up, but has problems with machine2. For some reason, machine2 is having trouble communicating with *itself*. Here are the worker logs of one of the failures (there are 10 before it quits): Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/06/27 14:55:13 INFO ExecutorRunner: Launch command: java -cp ::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar -XX:MaxPermSize=128m -Xms8192M -Xmx8192M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@machine1:46378/user/CoarseGrainedScheduler 7 machine2 8 akka.tcp://sparkWorker@machine2:48019/user/Worker app-20140627144512-0001 14/06/27 14:56:54 INFO Worker: Executor app-20140627144512-0001/7 finished with state FAILED message Command exited with code 1 exitStatus 1 14/06/27 14:56:54 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40130.49.226.148%3A53561-38#-1924573003] was not delivered. [10] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 http://130.49.226.148:60949 ] 14/06/27 14:56:54 INFO Worker: Asked to launch executor app-20140627144512-0001/8 for Funtown, USA 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 http://130.49.226.148:60949 ] 14/06/27 14:56:54 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@machine2:48019] - [akka.tcp://sparkExecutor@machine2:60949]: Error [Association failed with [akka.tcp://sparkExecutor@machine2:60949]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor@machine2:60949] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: machine2/130.49.226.148:60949 http://130.49.226.148:60949 ] Port 48019 on machine2 is indeed open, connected, and listening. Any ideas? Thanks! Shannon On 6/27/14, 1:54 AM, sujeetv wrote: Try to explicitly set set the spark.driver.host property to the master's IP. Sujeet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration
Re: Spark standalone network configuration problems
My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end. Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP. In looking at the logs of the worker spawned on master, it's also receiving a spark://localhost:5060 argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing spark://{SPARK_LOCAL_IP}:5060 to the workers? That was my motivation behind commenting out SPARK_LOCAL_IP; however, that's when the master crashes immediately due to the address already being in use. Any ideas? Thanks! Shannon On 6/26/14, 10:14 AM, Akhil Das wrote: Can you paste your spark-env.sh file? Thanks Best Regards On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Both /etc/hosts have each other's IP addresses in them. Telneting from machine2 to machine1 on port 5060 works just fine. Here's the output of lsof: user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN) java23985 user 40u IPv6 11099560 0t0 TCP machine1:sip-machine1:48315 (ESTABLISHED) java23985 user 52u IPv6 11100405 0t0 TCP machine1:sip-machine2:54476 (ESTABLISHED) java24157 user 40u IPv6 11092413 0t0 TCP machine1:48315-machine1:sip (ESTABLISHED) Ubuntu seems to recognize 5060 as the standard port for sip; it's not actually running anything there besides Spark, it just does a s/5060/sip/g. Is there something to the fact that every time I comment out SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due to the address already being in use? Or am I barking up the wrong tree on that one? Thanks again for all your help; I hope we can knock this one out. Shannon On 6/26/14, 9:13 AM, Akhil Das wrote: Do you have ip machine1 in your workers /etc/hosts also? If so try telneting from your machine2 to machine1 on port 5060. Also make sure nothing else is running on port 5060 other than Spark (*/lsof -i:5060/*) Thanks Best Regards On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Still running into the same problem. /etc/hosts on the master says 127.0.0.1localhost ipmachine1 ip is the same address set in spark-env.sh for SPARK_MASTER_IP. Any other ideas? On 6/26/14, 3:11 AM, Akhil Das wrote: Hi Shannon, It should be a configuration issue, check in your /etc/hosts and make sure localhost is not associated with the SPARK_MASTER_IP you provided. Thanks Best Regards On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn squ...@gatech.edu mailto:squ...@gatech.edu wrote: Hi all, I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers listed on the UI page. The logs of both workers indicate successful registration with the Spark master. The problems begin when I attempt to submit a job: I get an address already in use exception that crashes the program. It says Failed to bind to and lists the exact port and address of the master. At this point, the only items I have set in my spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060). The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the master to successfully send out the jobs; however, it ends up canceling the stage after running this command several times: 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: app-20140625210032-/8 on worker-20140625205623-machine2-53597 (machine2:53597) with 8 cores 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140625210032-/8
Re: Spark standalone network configuration problems
In the interest of completeness, this is how I invoke spark: [on master] sbin/start-all.sh spark-submit --py-files extra.py main.py iPhone'd On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu wrote: My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end. Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP. In looking at the logs of the worker spawned on master, it's also receiving a spark://localhost:5060 argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing spark://{SPARK_LOCAL_IP}:5060 to the workers? That was my motivation behind commenting out SPARK_LOCAL_IP; however, that's when the master crashes immediately due to the address already being in use. Any ideas? Thanks! Shannon On 6/26/14, 10:14 AM, Akhil Das wrote: Can you paste your spark-env.sh file? Thanks Best Regards On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn squ...@gatech.edu wrote: Both /etc/hosts have each other's IP addresses in them. Telneting from machine2 to machine1 on port 5060 works just fine. Here's the output of lsof: user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME java23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN) java23985 user 40u IPv6 11099560 0t0 TCP machine1:sip-machine1:48315 (ESTABLISHED) java23985 user 52u IPv6 11100405 0t0 TCP machine1:sip-machine2:54476 (ESTABLISHED) java24157 user 40u IPv6 11092413 0t0 TCP machine1:48315-machine1:sip (ESTABLISHED) Ubuntu seems to recognize 5060 as the standard port for sip; it's not actually running anything there besides Spark, it just does a s/5060/sip/g. Is there something to the fact that every time I comment out SPARK_LOCAL_IP in spark-env, it crashes immediately upon spark-submit due to the address already being in use? Or am I barking up the wrong tree on that one? Thanks again for all your help; I hope we can knock this one out. Shannon On 6/26/14, 9:13 AM, Akhil Das wrote: Do you have ipmachine1 in your workers /etc/hosts also? If so try telneting from your machine2 to machine1 on port 5060. Also make sure nothing else is running on port 5060 other than Spark (lsof -i:5060) Thanks Best Regards On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn squ...@gatech.edu wrote: Still running into the same problem. /etc/hosts on the master says 127.0.0.1localhost ipmachine1 ip is the same address set in spark-env.sh for SPARK_MASTER_IP. Any other ideas? On 6/26/14, 3:11 AM, Akhil Das wrote: Hi Shannon, It should be a configuration issue, check in your /etc/hosts and make sure localhost is not associated with the SPARK_MASTER_IP you provided. Thanks Best Regards On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn squ...@gatech.edu wrote: Hi all, I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers listed on the UI page. The logs of both workers indicate successful registration with the Spark master. The problems begin when I attempt to submit a job: I get an address already in use exception that crashes the program. It says Failed to bind to and lists the exact port and address of the master. At this point, the only items I have set in my spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060). The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the master to successfully send out the jobs; however, it ends up canceling the stage after running this command several times: 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: app-20140625210032-/8 on worker-20140625205623-machine2-53597 (machine2:53597) with 8 cores 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140625210032-/8 on hostPort machine2:53597 with 8 cores, 8.0 GB RAM 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: app-20140625210032-/8 is now RUNNING 14/06/25 21:00:49 INFO AppClient
Spark standalone network configuration problems
Hi all, I have a 2-machine Spark network I've set up: a master and worker on machine1, and worker on machine2. When I run 'sbin/start-all.sh', everything starts up as it should. I see both workers listed on the UI page. The logs of both workers indicate successful registration with the Spark master. The problems begin when I attempt to submit a job: I get an address already in use exception that crashes the program. It says Failed to bind to and lists the exact port and address of the master. At this point, the only items I have set in my spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060). The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the master to 127.0.0.1. This allows the master to successfully send out the jobs; however, it ends up canceling the stage after running this command several times: 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: app-20140625210032-/8 on worker-20140625205623-machine2-53597 (machine2:53597) with 8 cores 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140625210032-/8 on hostPort machine2:53597 with 8 cores, 8.0 GB RAM 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: app-20140625210032-/8 is now RUNNING 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: app-20140625210032-/8 is now FAILED (Command exited with code 1) The /8 started at /1, eventually becomes /9, and then /10, at which point the program crashes. The worker on machine2 shows similar messages in its logs. Here are the last bunch: 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-/9 finished with state FAILED message Command exited with code 1 exitStatus 1 14/06/25 21:00:31 INFO Worker: Asked to launch executor app-20140625210032-/10 for app_name Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: java -cp ::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar -XX:MaxPermSize=128m -Xms8192M -Xmx8192M org.apache.spark.executor.CoarseGrainedExecutorBackend *akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler* 10 machine2 8 akka.tcp://sparkWorker@machine2:53597/user/Worker app-20140625210032- 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-/10 finished with state FAILED message Command exited with code 1 exitStatus 1 I highlighted the part that seemed strange to me; that's the master port number (I set it to 5060), and yet it's referencing localhost? Is this the reason why machine2 apparently can't seem to give a confirmation to the master once the job is submitted? (The logs from the worker on the master node indicate that it's running just fine) I appreciate any assistance you can offer! Regards, Shannon Quinn