Re: Tasks randomly stall when running on mesos

Reinis Vicups Tue, 26 May 2015 06:00:24 -0700

Hi,

I just configured my cluster to run with 1.4.0-rc2, alas the dependencyjungle does not one let just download, config and start. Instead onewill have to fiddle with sbt settings for the upcoming couple of nights:

2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor -Association with remote system[akka.tcp://driverPropsFetcher@app03:44805] has failed, address is nowgated for [5000] ms. Reason is: [org.apache.spark.rpc.akka.AkkaMessage].2015-05-26 14:52:55,707 ERROR Remoting -org.apache.spark.rpc.akka.AkkaMessage

java.lang.ClassNotFoundException: org.apache.spark.rpc.akka.AkkaMessage
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:348)

atjava.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626)atakka.util.ClassLoaderObjectInputStream.resolveClass(ClassLoaderObjectInputStream.scala:19)atjava.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)atjava.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)atjava.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)atjava.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

atakka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)

        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

atakka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)atakka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)

        at scala.util.Try$.apply(Try.scala:161)

atakka.serialization.Serialization.deserialize(Serialization.scala:98)atakka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)atakka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)atakka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)

        at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)

atakka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)

        at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
        at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)

atakka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)atscala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)atscala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)atscala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)atscala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


kind regards
reinis

On 25.05.2015 23:09, Reinis Vicups wrote:

Great hints, you guys!

Yes spark-shell worked fine with mesos as master. I haven't tried toexecute multiple rdd actions in a row though (I did couple ofsuccessful counts on hbase tables i am working with in severalexperiments but nothing that would compare to the stuff my spark jobsare doing), but will check if shell stalls upon some decent rdd action.

Also thanks a bunch for the links to binaries. This will literallysave me hours!


kind regards
reinis

On 25.05.2015 21:00, Dean Wampler wrote:

Here is a link for builds of 1.4 RC2:

http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/ 
<http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/>

For a mvn repo, I believe the RC2 artifacts are here:

https://repository.apache.org/content/repositories/orgapachespark-1104/

A few experiments you might try:

1. Does spark-shell work? It might start fine, but make sure you cancreate an RDD and use it, e.g., something like:


val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println

2. Try coarse grained mode, which has different logic for executormanagement.


You can set it in $SPARK_HOME/conf/spark-defaults.conf file:

spark.mesos.coarse   true

Or, from this page<http://spark.apache.org/docs/latest/running-on-mesos.html>, set theproperty in a SparkConf object used to construct the SparkContext:


conf.set("spark.mesos.coarse", "true")

dean

Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)

Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups <[email protected]<mailto:[email protected]>> wrote:


    Hello,

    I assume I am running spark in a fine-grained mode since I
    haven't changed the default here.

    One question regarding 1.4.0-RC1 - is there a mvn snapshot
    repository I could use for my project config? (I know that I have
    to download source and make-distribution for executor as well)

    thanks
    reinis


    On 25.05.2015 17:07, Iulian Dragoș wrote:


    On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups <[email protected]
    <mailto:[email protected]>> wrote:

        Hello,

        I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with
        zookeeper and running on a cluster with 3 nodes on 64bit ubuntu.

        My application is compiled with spark 1.3.1 (apparently with
        mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka
        2.3.10. Only with this combination I have succeeded to run
        spark-jobs on mesos at all. Different versions are causing
        class loader issues.

        I am submitting spark jobs with spark-submit with
        mesos://zk://.../mesos.


    Are you using coarse grained or fine grained mode?

        sandbox log of slave-node app01 (the one that stalls) shows
        following:

        10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
        'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
        10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
        'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
        using Hadoop Client
        10:01:26.497869 35409 fetcher.cpp:109] Downloading resource
        from
        'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to
        
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
        10:01:32.877717 35409 fetcher.cpp:78] Extracted resource
        
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
        into
        
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
        Using Spark's default log4j profile:
        org/apache/spark/log4j-defaults.properties
        10:01:34 INFO MesosExecutorBackend: Registered signal
        handlers for [TERM, HUP, INT]
        10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
        *10:01:34 ERROR MesosExecutorBackend: Received launchTask
        but executor was null*
        10:01:34.540870 35765 exec.cpp:206] Executor registered on
        slave 20150511-150924-3410235146-5050-1903-S3
        10:01:34 INFO MesosExecutorBackend: Registered with Mesos as
        executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus


    It looks like an inconsistent state on the Mesos scheduler. It
    tries to launch a task on a given slave before the executor has
    registered. This code was improved/refactored in 1.4, could you
    try 1.4.0-RC1?

Yes and note the second message after the error you highlighted;that's when the executor would be registered with Mesos and the localobject created.


    iulian

        10:01:34 INFO SecurityManager: Changing view acls to...
        10:01:35 INFO Slf4jLogger: Slf4jLogger started
        10:01:35 INFO Remoting: Starting remoting
        10:01:35 INFO Remoting: Remoting started; listening on
        addresses :[akka.tcp://sparkExecutor@app01:xxx]
        10:01:35 INFO Utils: Successfully started service
        'sparkExecutor' on port xxx.
        10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
        akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
        10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
        akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
        10:01:36 INFO DiskBlockManager: Created local directory at
        
/tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
        10:01:36 INFO MemoryStore: MemoryStore started with capacity
        88.3 MB
        10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop
        library for your platform... using builtin-java classes
        where applicable
        10:01:36 INFO AkkaUtils: Connecting to
        OutputCommitCoordinator:
        akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator
        10:01:36 INFO Executor: Starting executor ID
        20150511-150924-3410235146-5050-1903-S3 on host app01
        10:01:36 INFO NettyBlockTransferService: Server created on XXX
        10:01:36 INFO BlockManagerMaster: Trying to register
        BlockManager
        10:01:36 INFO BlockManagerMaster: Registered BlockManager
        10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver:
        akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver

        As soon as spark-driver is aborted, following log entries
        are added to the sandbox log of slave-node app01:

        10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
        10:17:29 WARN ReliableDeliverySupervisor: Association with
        remote system [akka.tcp://sparkDriver@dev-web01] has failed,
        address is now gated for [5000] ms. Reason is: [Disassociated]

        Successful Job shows instead following in spark-driver log:

        08:03:19,862 INFO o.a.s.s.TaskSetManager - Finished task 3.0
        in stage 1.0 (TID 7) in 1688 ms on app01 (1/4)
        08:03:19,869 INFO o.a.s.s.TaskSetManager - Finished task 0.0
        in stage 1.0 (TID 4) in 1700 ms on app03 (2/4)
        08:03:19,874 INFO o.a.s.s.TaskSetManager - Finished task 1.0
        in stage 1.0 (TID 5) in 1703 ms on app02 (3/4)
        08:03:19,878 INFO o.a.s.s.TaskSetManager - Finished task 2.0
        in stage 1.0 (TID 6) in 1706 ms on app02 (4/4)
        08:03:19,878 INFO o.a.s.s.DAGScheduler - Stage 1
        (saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90)
        finished in 1.718 s
        08:03:19,878 INFO o.a.s.s.TaskSchedulerImpl - Removed
        TaskSet 1.0, whose tasks have all completed, from pool
        08:03:19,886 INFO o.a.s.s.DAGScheduler - Job 0 finished:
        saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90, took
        16.946405 s

        this corresponds nicelly to sandbox logs of slave-nodes

        08:03:19 INFO Executor: Finished task 3.0 in stage 1.0 (TID
        7). 872 bytes result sent to driver
        08:03:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID
        4). 872 bytes result sent to driver
        08:03:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID
        5). 872 bytes result sent to driver
        08:03:19 INFO Executor: Finished task 2.0 in stage 1.0 (TID
        6). 872 bytes result sent to driver
        08:03:20 WARN ReliableDeliverySupervisor: Association with
        remote system [akka.tcp://sparkDriver@dev-web01] has failed,
        address is now gated for [5000] ms. Reason is: [Disassociated].

--

    --
    Iulian Dragos

    ------
    Reactive Apps on the JVM
    www.typesafe.com <http://www.typesafe.com>

Re: Tasks randomly stall when running on mesos

Reply via email to