Great hints, you guys!

Yes spark-shell worked fine with mesos as master. I haven't tried to execute multiple rdd actions in a row though (I did couple of successful counts on hbase tables i am working with in several experiments but nothing that would compare to the stuff my spark jobs are doing), but will check if shell stalls upon some decent rdd action.

Also thanks a bunch for the links to binaries. This will literally save me hours!

kind regards
reinis

On 25.05.2015 21:00, Dean Wampler wrote:
Here is a link for builds of 1.4 RC2:

http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ <http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/>

For a mvn repo, I believe the RC2 artifacts are here:

https://repository.apache.org/content/repositories/orgapachespark-1104/

A few experiments you might try:

1. Does spark-shell work? It might start fine, but make sure you can create an RDD and use it, e.g., something like:

val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println

2. Try coarse grained mode, which has different logic for executor management.

You can set it in $SPARK_HOME/conf/spark-defaults.conf file:

spark.mesos.coarse   true

Or, from this page <http://spark.apache.org/docs/latest/running-on-mesos.html>, set the property in a SparkConf object used to construct the SparkContext:

conf.set("spark.mesos.coarse", "true")

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups <sp...@orbit-x.de <mailto:sp...@orbit-x.de>> wrote:

    Hello,

    I assume I am running spark in a fine-grained mode since I haven't
    changed the default here.

    One question regarding 1.4.0-RC1 - is there a mvn snapshot
    repository I could use for my project config? (I know that I have
    to download source and make-distribution for executor as well)

    thanks
    reinis


    On 25.05.2015 17:07, Iulian Dragoș wrote:

    On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups <sp...@orbit-x.de
    <mailto:sp...@orbit-x.de>> wrote:

        Hello,

        I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with
        zookeeper and running on a cluster with 3 nodes on 64bit ubuntu.

        My application is compiled with spark 1.3.1 (apparently with
        mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka
        2.3.10. Only with this combination I have succeeded to run
        spark-jobs on mesos at all. Different versions are causing
        class loader issues.

        I am submitting spark jobs with spark-submit with
        mesos://zk://.../mesos.


    Are you using coarse grained or fine grained mode?

        sandbox log of slave-node app01 (the one that stalls) shows
        following:

        10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
        'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
        10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
        'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
        using Hadoop Client
        10:01:26.497869 35409 fetcher.cpp:109] Downloading resource
        from 'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
        to
        
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
        10:01:32.877717 35409 fetcher.cpp:78] Extracted resource
        
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
        into
        
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
        Using Spark's default log4j profile:
        org/apache/spark/log4j-defaults.properties
        10:01:34 INFO MesosExecutorBackend: Registered signal
        handlers for [TERM, HUP, INT]
        10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
        *10:01:34 ERROR MesosExecutorBackend: Received launchTask but
        executor was null*
        10:01:34.540870 35765 exec.cpp:206] Executor registered on
        slave 20150511-150924-3410235146-5050-1903-S3
        10:01:34 INFO MesosExecutorBackend: Registered with Mesos as
        executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus


    It looks like an inconsistent state on the Mesos scheduler. It
    tries to launch a task on a given slave before the executor has
    registered. This code was improved/refactored in 1.4, could you
    try 1.4.0-RC1?


​Yes and note the second message after the error you highlighted; that's when the executor would be registered with Mesos and the local object created. ​


    iulian

        10:01:34 INFO SecurityManager: Changing view acls to...
        10:01:35 INFO Slf4jLogger: Slf4jLogger started
        10:01:35 INFO Remoting: Starting remoting
        10:01:35 INFO Remoting: Remoting started; listening on
        addresses :[akka.tcp://sparkExecutor@app01:xxx]
        10:01:35 INFO Utils: Successfully started service
        'sparkExecutor' on port xxx.
        10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
        akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
        10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
        akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
        10:01:36 INFO DiskBlockManager: Created local directory at
        
/tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
        10:01:36 INFO MemoryStore: MemoryStore started with capacity
        88.3 MB
        10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop
        library for your platform... using builtin-java classes where
        applicable
        10:01:36 INFO AkkaUtils: Connecting to
        OutputCommitCoordinator:
        akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator
        10:01:36 INFO Executor: Starting executor ID
        20150511-150924-3410235146-5050-1903-S3 on host app01
        10:01:36 INFO NettyBlockTransferService: Server created on XXX
        10:01:36 INFO BlockManagerMaster: Trying to register BlockManager
        10:01:36 INFO BlockManagerMaster: Registered BlockManager
        10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver:
        akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver

        As soon as spark-driver is aborted, following log entries are
        added to the sandbox log of slave-node app01:

        10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
        10:17:29 WARN ReliableDeliverySupervisor: Association with
        remote system [akka.tcp://sparkDriver@dev-web01] has failed,
        address is now gated for [5000] ms. Reason is: [Disassociated]

        Successful Job shows instead following in spark-driver log:

        08:03:19,862 INFO o.a.s.s.TaskSetManager - Finished task 3.0
        in stage 1.0 (TID 7) in 1688 ms on app01 (1/4)
        08:03:19,869 INFO o.a.s.s.TaskSetManager - Finished task 0.0
        in stage 1.0 (TID 4) in 1700 ms on app03 (2/4)
        08:03:19,874 INFO o.a.s.s.TaskSetManager - Finished task 1.0
        in stage 1.0 (TID 5) in 1703 ms on app02 (3/4)
        08:03:19,878 INFO o.a.s.s.TaskSetManager - Finished task 2.0
        in stage 1.0 (TID 6) in 1706 ms on app02 (4/4)
        08:03:19,878 INFO  o.a.s.s.DAGScheduler - Stage 1
        (saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90)
        finished in 1.718 s
        08:03:19,878 INFO o.a.s.s.TaskSchedulerImpl - Removed TaskSet
        1.0, whose tasks have all completed, from pool
        08:03:19,886 INFO  o.a.s.s.DAGScheduler - Job 0 finished:
        saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90, took
        16.946405 s

        this corresponds nicelly to sandbox logs of slave-nodes

        08:03:19 INFO Executor: Finished task 3.0 in stage 1.0 (TID
        7). 872 bytes result sent to driver
        08:03:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID
        4). 872 bytes result sent to driver
        08:03:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID
        5). 872 bytes result sent to driver
        08:03:19 INFO Executor: Finished task 2.0 in stage 1.0 (TID
        6). 872 bytes result sent to driver
        08:03:20 WARN ReliableDeliverySupervisor: Association with
        remote system [akka.tcp://sparkDriver@dev-web01] has failed,
        address is now gated for [5000] ms. Reason is: [Disassociated].




--
    --
    Iulian Dragos

    ------
    Reactive Apps on the JVM
    www.typesafe.com <http://www.typesafe.com>

Reply via email to