Re: Spark standalone network configuration problems

Shannon Quinn Thu, 26 Jun 2014 17:19:27 -0700

In the interest of completeness, this is how I invoke spark:

[on master]


> sbin/start-all.sh
> spark-submit --py-files extra.py main.py

iPhone'd

> On Jun 26, 2014, at 17:29, Shannon Quinn <squ...@gatech.edu> wrote:
> 
> My *best guess* (please correct me if I'm wrong) is that the master 
> (machine1) is sending the command to the worker (machine2) with the localhost 
> argument as-is; that is, machine2 isn't doing any weird address conversion on 
> its end.
> 
> Consequently, I've been focusing on the settings of the master/machine1. But 
> I haven't found anything to indicate where the localhost argument could be 
> coming from. /etc/hosts lists only 127.0.0.1 as localhost; 
> spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); 
> spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The 
> *only* place on the master where it's associated with localhost is 
> SPARK_LOCAL_IP.
> 
> In looking at the logs of the worker spawned on master, it's also receiving a 
> "spark://localhost:5060" argument, but since it resides on the master that 
> works fine. Is it possible that the master is, for some reason, passing 
> "spark://{SPARK_LOCAL_IP}:5060" to the workers?
> 
> That was my motivation behind commenting out SPARK_LOCAL_IP;     however, 
> that's when the master crashes immediately due to the address already being 
> in use.
> 
> Any ideas? Thanks!
> 
> Shannon
> 
>> On 6/26/14, 10:14 AM, Akhil Das wrote:
>> Can you paste your spark-env.sh file?
>> 
>> Thanks
>> Best Regards
>> 
>> 
>>> On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squ...@gatech.edu> wrote:
>>> Both /etc/hosts have each other's IP addresses in them. Telneting from 
>>> machine2 to machine1 on port 5060 works just fine.
>>> 
>>> Here's the output of lsof:
>>> 
>>> user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
>>> COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
>>> java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip (LISTEN)
>>> java    23985 user   40u  IPv6 11099560      0t0  TCP 
>>> machine1:sip->machine1:48315 (ESTABLISHED)
>>> java    23985 user   52u  IPv6 11100405      0t0  TCP 
>>> machine1:sip->machine2:54476 (ESTABLISHED)
>>> java    24157 user   40u  IPv6 11092413      0t0  TCP 
>>> machine1:48315->machine1:sip (ESTABLISHED)
>>> 
>>> Ubuntu seems to recognize 5060 as the standard port for "sip"; it's not 
>>> actually running anything there besides Spark, it just does a s/5060/sip/g.
>>> 
>>> Is there something to the fact that every time I comment out SPARK_LOCAL_IP 
>>> in spark-env, it crashes immediately upon spark-submit due to the "address 
>>> already being in use"? Or am I barking up the wrong tree on that one?
>>> 
>>> Thanks again for all your help; I hope we can knock this one out.
>>> 
>>> Shannon
>>> 
>>> 
>>>> On 6/26/14, 9:13 AM, Akhil Das wrote:
>>>> Do you have <ip>            machine1 in your workers /etc/hosts also? If 
>>>> so try telneting from your machine2 to machine1 on port 5060. Also make 
>>>> sure nothing else is running on port 5060 other                           
>>>> than Spark (lsof -i:5060)
>>>> 
>>>> Thanks
>>>> Best Regards
>>>> 
>>>> 
>>>>> On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squ...@gatech.edu> wrote:
>>>>> Still running into the same problem. /etc/hosts on the master says
>>>>> 
>>>>> 127.0.0.1    localhost
>>>>> <ip>            machine1
>>>>> 
>>>>> <ip> is the same address set in spark-env.sh for SPARK_MASTER_IP. Any 
>>>>> other ideas?
>>>>> 
>>>>> 
>>>>>> On 6/26/14, 3:11 AM, Akhil Das wrote:
>>>>>> Hi Shannon,
>>>>>> 
>>>>>> It should be a configuration issue, check in your /etc/hosts and make 
>>>>>> sure localhost is not associated with the SPARK_MASTER_IP you provided.
>>>>>> 
>>>>>> Thanks
>>>>>> Best Regards
>>>>>> 
>>>>>> 
>>>>>> On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn <squ...@gatech.edu> wrote:
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I have a 2-machine Spark network I've set up: a master and worker on 
>>>>>>> machine1, and worker on machine2. When I run 'sbin/start-all.sh', 
>>>>>>> everything starts up as it should. I see both workers                   
>>>>>>>                         listed on the UI page. The logs of both workers 
>>>>>>> indicate successful registration with the Spark master.
>>>>>>> 
>>>>>>> The problems begin when I attempt to submit a job: I get an "address 
>>>>>>> already in use" exception that crashes the program. It says "Failed to 
>>>>>>> bind to " and lists the exact port and address of the master.
>>>>>>> 
>>>>>>> At this point, the only items I have set in my spark-env.sh are 
>>>>>>> SPARK_MASTER_IP and SPARK_MASTER_PORT (non-standard, set to 5060).
>>>>>>> 
>>>>>>> The next step I took, then, was to explicitly set SPARK_LOCAL_IP on the 
>>>>>>> master to 127.0.0.1. This allows the master to successfully send out 
>>>>>>> the jobs; however, it ends up canceling the stage after running this 
>>>>>>> command several times:
>>>>>>> 
>>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor added: 
>>>>>>> app-20140625210032-0000/8 on worker-20140625205623-machine2-53597 
>>>>>>> (machine2:53597) with 8 cores
>>>>>>> 14/06/25 21:00:47 INFO SparkDeploySchedulerBackend: Granted executor ID 
>>>>>>> app-20140625210032-0000/8 on hostPort machine2:53597 with 8 cores, 8.0 
>>>>>>> GB RAM
>>>>>>> 14/06/25 21:00:47 INFO AppClient$ClientActor: Executor updated: 
>>>>>>> app-20140625210032-0000/8 is now RUNNING
>>>>>>> 14/06/25 21:00:49 INFO AppClient$ClientActor: Executor updated: 
>>>>>>> app-20140625210032-0000/8 is now FAILED (Command exited with code 1)
>>>>>>> 
>>>>>>> The "/8" started at "/1", eventually becomes "/9", and then "/10", at 
>>>>>>> which point the program crashes. The worker on machine2 shows similar 
>>>>>>> messages in its logs. Here are the last bunch:
>>>>>>> 
>>>>>>> 14/06/25 21:00:31 INFO Worker: Executor app-20140625210032-0000/9 
>>>>>>> finished with state FAILED message Command exited with code 1 
>>>>>>> exitStatus 1
>>>>>>> 14/06/25 21:00:31 INFO Worker: Asked to launch executor 
>>>>>>> app-20140625210032-0000/10 for app_name
>>>>>>> Spark assembly has been built with Hive, including Datanucleus jars on 
>>>>>>> classpath
>>>>>>> 14/06/25 21:00:32 INFO ExecutorRunner: Launch command: "java" "-cp" 
>>>>>>> "::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
>>>>>>>  "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M" 
>>>>>>> "org.apache.spark.executor.CoarseGrainedExecutorBackend" 
>>>>>>> "akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler" "10" 
>>>>>>> "machine2" "8" "akka.tcp://sparkWorker@machine2:53597/user/Worker" 
>>>>>>> "app-20140625210032-0000"
>>>>>>> 14/06/25 21:00:33 INFO Worker: Executor app-20140625210032-0000/10      
>>>>>>>                                      finished with state FAILED message 
>>>>>>> Command exited with code 1 exitStatus 1
>>>>>>> 
>>>>>>> I highlighted the part that seemed strange to me; that's the master 
>>>>>>> port number (I set it to 5060), and yet it's referencing localhost? Is 
>>>>>>> this the reason why machine2 apparently can't seem to give a 
>>>>>>> confirmation to the master once the job is submitted? (The logs from 
>>>>>>> the worker on the master node indicate that it's running just fine)
>>>>>>> 
>>>>>>> I appreciate any assistance you can offer!
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Shannon Quinn
>

Re: Spark standalone network configuration problems

Reply via email to