My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end.

Consequently, I've been focusing on the settings of the master/machine1. But I haven't found anything to indicate where the localhost argument could be coming from. /etc/hosts lists only 127.0.0.1 as localhost; spark-defaults.conf list spark.master as the full IP address (not 127.0.0.1); spark-env.sh on the master also lists the full IP under SPARK_MASTER_IP. The *only* place on the master where it's associated with localhost is SPARK_LOCAL_IP.

In looking at the logs of the worker spawned on master, it's also receiving a "spark://localhost:5060" argument, but since it resides on the master that works fine. Is it possible that the master is, for some reason, passing "spark://{SPARK_LOCAL_IP}:5060" to the workers?

That was my motivation behind commenting out SPARK_LOCAL_IP; however, that's when the master crashes immediately due to the address already being in use.

Any ideas? Thanks!

Shannon

On 6/26/14, 10:14 AM, Akhil Das wrote:
Can you paste your spark-env.sh file?

Thanks
Best Regards


On Thu, Jun 26, 2014 at 7:01 PM, Shannon Quinn <squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:

    Both /etc/hosts have each other's IP addresses in them. Telneting
    from machine2 to machine1 on port 5060 works just fine.

    Here's the output of lsof:

    user@machine1:~/spark/spark-1.0.0-bin-hadoop2$ lsof -i:5060
    COMMAND   PID   USER   FD   TYPE   DEVICE SIZE/OFF NODE NAME
    java    23985 user   30u  IPv6 11092354      0t0  TCP machine1:sip
    (LISTEN)
    java    23985 user   40u  IPv6 11099560      0t0  TCP
    machine1:sip->machine1:48315 (ESTABLISHED)
    java    23985 user   52u  IPv6 11100405      0t0  TCP
    machine1:sip->machine2:54476 (ESTABLISHED)
    java    24157 user   40u  IPv6 11092413      0t0  TCP
    machine1:48315->machine1:sip (ESTABLISHED)

    Ubuntu seems to recognize 5060 as the standard port for "sip";
    it's not actually running anything there besides Spark, it just
    does a s/5060/sip/g.

    Is there something to the fact that every time I comment out
    SPARK_LOCAL_IP in spark-env, it crashes immediately upon
    spark-submit due to the "address already being in use"? Or am I
    barking up the wrong tree on that one?

    Thanks again for all your help; I hope we can knock this one out.

    Shannon


    On 6/26/14, 9:13 AM, Akhil Das wrote:
    Do you have <ip>         machine1 in your workers /etc/hosts
    also? If so try telneting from your machine2 to machine1 on port
    5060. Also make sure nothing else is running on port 5060 other
    than Spark (*/lsof -i:5060/*)

    Thanks
    Best Regards


    On Thu, Jun 26, 2014 at 6:35 PM, Shannon Quinn <squ...@gatech.edu
    <mailto:squ...@gatech.edu>> wrote:

        Still running into the same problem. /etc/hosts on the master
        says

        127.0.0.1    localhost
        <ip>            machine1

        <ip> is the same address set in spark-env.sh for
        SPARK_MASTER_IP. Any other ideas?


        On 6/26/14, 3:11 AM, Akhil Das wrote:
        Hi Shannon,

        It should be a configuration issue, check in your /etc/hosts
        and make sure localhost is not associated with the
        SPARK_MASTER_IP you provided.

        Thanks
        Best Regards


        On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn
        <squ...@gatech.edu <mailto:squ...@gatech.edu>> wrote:

            Hi all,

            I have a 2-machine Spark network I've set up: a master
            and worker on machine1, and worker on machine2. When I
            run 'sbin/start-all.sh', everything starts up as it
            should. I see both workers listed on the UI page. The
            logs of both workers indicate successful registration
            with the Spark master.

            The problems begin when I attempt to submit a job: I get
            an "address already in use" exception that crashes the
            program. It says "Failed to bind to " and lists the
            exact port and address of the master.

            At this point, the only items I have set in my
            spark-env.sh are SPARK_MASTER_IP and SPARK_MASTER_PORT
            (non-standard, set to 5060).

            The next step I took, then, was to explicitly set
            SPARK_LOCAL_IP on the master to 127.0.0.1. This allows
            the master to successfully send out the jobs; however,
            it ends up canceling the stage after running this
            command several times:

            14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
            added: app-20140625210032-0000/8 on
            worker-20140625205623-machine2-53597 (machine2:53597)
            with 8 cores
            14/06/25 21:00:47 INFO SparkDeploySchedulerBackend:
            Granted executor ID app-20140625210032-0000/8 on
            hostPort machine2:53597 with 8 cores, 8.0 GB RAM
            14/06/25 21:00:47 INFO AppClient$ClientActor: Executor
            updated: app-20140625210032-0000/8 is now RUNNING
            14/06/25 21:00:49 INFO AppClient$ClientActor: Executor
            updated: app-20140625210032-0000/8 is now FAILED
            (Command exited with code 1)

            The "/8" started at "/1", eventually becomes "/9", and
            then "/10", at which point the program crashes. The
            worker on machine2 shows similar messages in its logs.
            Here are the last bunch:

            14/06/25 21:00:31 INFO Worker: Executor
            app-20140625210032-0000/9 finished with state FAILED
            message Command exited with code 1 exitStatus 1
            14/06/25 21:00:31 INFO Worker: Asked to launch executor
            app-20140625210032-0000/10 for app_name
            Spark assembly has been built with Hive, including
            Datanucleus jars on classpath
            14/06/25 21:00:32 INFO ExecutorRunner: Launch command:
            "java" "-cp"
            
"::/home/spark/spark-1.0.0-bin-hadoop2/conf:/home/spark/spark-1.0.0-bin-hadoop2/lib/spark-assembly-1.0.0-hadoop2.2.0.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-rdbms-3.2.1.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-core-3.2.2.jar:/home/spark/spark-1.0.0-bin-hadoop2/lib/datanucleus-api-jdo-3.2.1.jar"
            "-XX:MaxPermSize=128m" "-Xms8192M" "-Xmx8192M"
            "org.apache.spark.executor.CoarseGrainedExecutorBackend"
            "*akka.tcp://spark@localhost:5060/user/CoarseGrainedScheduler*"
            "10" "machine2" "8"
            "akka.tcp://sparkWorker@machine2:53597/user/Worker"
            "app-20140625210032-0000"
            14/06/25 21:00:33 INFO Worker: Executor
            app-20140625210032-0000/10 finished with state FAILED
            message Command exited with code 1 exitStatus 1

            I highlighted the part that seemed strange to me; that's
            the master port number (I set it to 5060), and yet it's
            referencing localhost? Is this the reason why machine2
            apparently can't seem to give a confirmation to the
            master once the job is submitted? (The logs from the
            worker on the master node indicate that it's running
            just fine)

            I appreciate any assistance you can offer!

            Regards,
            Shannon Quinn







Reply via email to