In the interest of completeness, this is how I invoke spark: [on master]
> sbin/start-all.sh > spark-submit --py-files extra.py main.py iPhone'd > On Jun 26, 2014, at 17:29, Shannon Quinn <squ...@gatech.edu> wrote: > > My *best guess* (please correct me if I'm wrong) is that the master > (machine1) is sending the command to the worker (machine2) with the localhost > argument as-is; that is, machine2 isn't doing any weird address conversion on > its end. > > Consequently, I've been focusing on the settings of the master/machine1. But > I haven't found anything to indicate where the localhost argument could be > coming from. /etc/hosts lists only 127.0.0.1 as localhost; > spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); > spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The > *only* place on the master where it's associated with localhost is > SPARK_LOCAL_IP. > > In looking at the logs of the worker spawned on master, it's also receiving a > "spark://localhost:5060" argument, but since it resides on the master that > works fine. Is it possible that the master is, for some reason, passing > "spark://{SPARK_LOCAL_IP}:5060" to the workers? > > That was my motivation behind commenting out SPARK_LOCAL_IP; however, > that's when the master crashes immediately due to the address already being > in use. > > Any ideas? Thanks! > > Shannon > >> On 6/26/14, 10:14 AM, Akhil Das wrote: >> Can you paste your spark-env.sh file? >> >> Thanks >> Best Regards >> >> >>> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squ...@gatech.edu> wrote: >>> Both /etc/hosts have each other's IP addresses in them. Telneting from >>> machine2 to machine1 on port 5060 works just fine. >>> >>> Here's the output of lsof: >>> >>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060 >>> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME >>> java 23985 user 30u IPv6 11092354 0t0 TCP machine1:sip (LISTEN) >>> java 23985 user 40u IPv6 11099560 0t0 TCP >>> machine1:sip->machine1:48315 (ESTABLISHED) >>> java 23985 user 52u IPv6 11100405 0t0 TCP >>> machine1:sip->machine2:54476 (ESTABLISHED) >>> java 24157 user 40u IPv6 11092413 0t0 TCP >>> machine1:48315->machine1:sip (ESTABLISHED) >>> >>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not >>> actually running anything there besides Spark, it just does a s/5060/sip/g. >>> >>> Is there something to the fact that every time I comment out SPARK_LOCAL_IP >>> in spark-env, it crashes immediately upon spark-submit due to the "address >>> already being in use"? Or am I barking up the wrong tree on that one? >>> >>> Thanks again for all your help; I hope we can knock this one out. >>> >>> Shannon >>> >>> >>>> On 6/26/14, 9:13 AM, Akhil Das wrote: >>>> Do you have <ip> machine1 in your workers /etc/hosts also? If >>>> so try telneting from your machine2 to machine1 on port 5060. Also make >>>> sure nothing else is running on port 5060 other >>>> than Spark (lsof -i:5060) >>>> >>>> Thanks >>>> Best Regards >>>> >>>> >>>>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squ...@gatech.edu> wrote: >>>>> Still running into the same problem. /etc/hosts on the master says >>>>> >>>>> 127.0.0.1 localhost >>>>> <ip> machine1 >>>>> >>>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any >>>>> other ideas? >>>>> >>>>> >>>>>> On 6/26/14, 3:11 AM, Akhil Das wrote: >>>>>> Hi Shannon, >>>>>> >>>>>> It should be a configuration issue, check in your /etc/hosts and make >>>>>> sure localhost is not associated with the SPARK_MASTER_IP you provided. >>>>>> >>>>>> Thanks >>>>>> Best Regards >>>>>> >>>>>> >>>>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squ...@gatech.edu> wrote: >>>>>>> Hi all, >>>>>>> >>>>>>> I have a 2-machine Spark network I've set up: a master and worker on >>>>>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh', >>>>>>> everything starts up as it should. I see both workers >>>>>>> listed on the UI page. The logs of both workers >>>>>>> indicate successful registration with the Spark master. >>>>>>> >>>>>>> The problems begin when I attempt to submit a job: I get an "address >>>>>>> already in use" exception that crashes the program. It says "Failed to >>>>>>> bind to " and lists the exact port and address of the master. >>>>>>> >>>>>>> At this point, the only items I have set in my spark-env.sh are >>>>>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060). >>>>>>> >>>>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the >>>>>>> master to 127.0.0.1. This allows the master to successfully send out >>>>>>> the jobs; however, it ends up canceling the stage after running this >>>>>>> command several times: >>>>>>> >>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: >>>>>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597 >>>>>>> (machine2:53597) with 8 cores >>>>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID >>>>>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 >>>>>>> GB RAM >>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: >>>>>>> app-20140625210032-0000/8 is now RUNNING >>>>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: >>>>>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1) >>>>>>> >>>>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at >>>>>>> which point the program crashes. The worker on machine2 shows similar >>>>>>> messages in its logs. Here are the last bunch: >>>>>>> >>>>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 >>>>>>> finished with state FAILED message Command exited with code 1 >>>>>>> exitStatus 1 >>>>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor >>>>>>> app-20140625210032-0000/10 for app_name >>>>>>> Spark assembly has been built with Hive, including Datanucleus jars on >>>>>>> classpath >>>>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp" >>>>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar" >>>>>>> "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" >>>>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" >>>>>>> "akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler" "10" >>>>>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker" >>>>>>> "app-20140625210032-0000" >>>>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10 >>>>>>> finished with state FAILED message >>>>>>> Command exited with code 1 exitStatus 1 >>>>>>> >>>>>>> I highlighted the part that seemed strange to me; that's the master >>>>>>> port number (I set it to 5060), and yet it's referencing localhost? Is >>>>>>> this the reason why machine2 apparently can't seem to give a >>>>>>> confirmation to the master once the job is submitted? (The logs from >>>>>>> the worker on the master node indicate that it's running just fine) >>>>>>> >>>>>>> I appreciate any assistance you can offer! >>>>>>> >>>>>>> Regards, >>>>>>> Shannon Quinn >