Hi,
I just configured my cluster to run with 1.4.0-rc2, alas the dependency
jungle does not one let just download, config and start. Instead one
will have to fiddle with sbt settings for the upcoming couple of nights:
2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor -
Association with remote system
[akka.tcp://driverPropsFetcher@app03:44805] has failed, address is now
gated for [5000] ms. Reason is: [org.apache.spark.rpc.akka.AkkaMessage].
2015-05-26 14:52:55,707 ERROR Remoting -
org.apache.spark.rpc.akka.AkkaMessage
java.lang.ClassNotFoundException: org.apache.spark.rpc.akka.AkkaMessage
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at
java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:626)
at
akka.util.ClassLoaderObjectInputStream.resolveClass(ClassLoaderObjectInputStream.scala:19)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
at
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at
akka.serialization.Serialization.deserialize(Serialization.scala:98)
at
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at
akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
kind regards
reinis
On 25.05.2015 23:09, Reinis Vicups wrote:
Great hints, you guys!
Yes spark-shell worked fine with mesos as master. I haven't tried to
execute multiple rdd actions in a row though (I did couple of
successful counts on hbase tables i am working with in several
experiments but nothing that would compare to the stuff my spark jobs
are doing), but will check if shell stalls upon some decent rdd action.
Also thanks a bunch for the links to binaries. This will literally
save me hours!
kind regards
reinis
On 25.05.2015 21:00, Dean Wampler wrote:
Here is a link for builds of 1.4 RC2:
http://people.apache.org/~pwendell/spark-releases/spark-1.4.0-rc2-bin/
<http://people.apache.org/%7Epwendell/spark-releases/spark-1.4.0-rc2-bin/>
For a mvn repo, I believe the RC2 artifacts are here:
https://repository.apache.org/content/repositories/orgapachespark-1104/
A few experiments you might try:
1. Does spark-shell work? It might start fine, but make sure you can
create an RDD and use it, e.g., something like:
val rdd = sc.parallelize(Seq(1,2,3,4,5,6))
rdd foreach println
2. Try coarse grained mode, which has different logic for executor
management.
You can set it in $SPARK_HOME/conf/spark-defaults.conf file:
spark.mesos.coarse true
Or, from this page
<http://spark.apache.org/docs/latest/running-on-mesos.html>, set the
property in a SparkConf object used to construct the SparkContext:
conf.set("spark.mesos.coarse", "true")
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com
On Mon, May 25, 2015 at 12:06 PM, Reinis Vicups <sp...@orbit-x.de
<mailto:sp...@orbit-x.de>> wrote:
Hello,
I assume I am running spark in a fine-grained mode since I
haven't changed the default here.
One question regarding 1.4.0-RC1 - is there a mvn snapshot
repository I could use for my project config? (I know that I have
to download source and make-distribution for executor as well)
thanks
reinis
On 25.05.2015 17:07, Iulian Dragoș wrote:
On Mon, May 25, 2015 at 2:43 PM, Reinis Vicups <sp...@orbit-x.de
<mailto:sp...@orbit-x.de>> wrote:
Hello,
I am using Spark 1.3.1-hadoop2.4 with Mesos 0.22.1 with
zookeeper and running on a cluster with 3 nodes on 64bit ubuntu.
My application is compiled with spark 1.3.1 (apparently with
mesos 0.21.0 dependency), hadoop 2.5.1-mapr-1503 and akka
2.3.10. Only with this combination I have succeeded to run
spark-jobs on mesos at all. Different versions are causing
class loader issues.
I am submitting spark jobs with spark-submit with
mesos://zk://.../mesos.
Are you using coarse grained or fine grained mode?
sandbox log of slave-node app01 (the one that stalls) shows
following:
10:01:25.815506 35409 fetcher.cpp:214] Fetching URI
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:26.497764 35409 fetcher.cpp:99] Fetching URI
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz'
using Hadoop Client
10:01:26.497869 35409 fetcher.cpp:109] Downloading resource
from
'hdfs://dev-hadoop01/apps/spark-1.3.1-bin-hadoop2.4.tgz' to
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
10:01:32.877717 35409 fetcher.cpp:78] Extracted resource
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05/spark-1.3.1-bin-hadoop2.4.tgz'
into
'/tmp/mesos/slaves/20150511-150924-3410235146-5050-1903-S3/frameworks/20150511-150924-3410235146-5050-1903-0249/executors/20150511-150924-3410235146-5050-1903-S3/runs/ec3a0f13-2f44-4952-bb23-4d48abaacc05'
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
10:01:34 INFO MesosExecutorBackend: Registered signal
handlers for [TERM, HUP, INT]
10:01:34.459292 35730 exec.cpp:132] Version: 0.22.0
*10:01:34 ERROR MesosExecutorBackend: Received launchTask
but executor was null*
10:01:34.540870 35765 exec.cpp:206] Executor registered on
slave 20150511-150924-3410235146-5050-1903-S3
10:01:34 INFO MesosExecutorBackend: Registered with Mesos as
executor ID 20150511-150924-3410235146-5050-1903-S3 with 1 cpus
It looks like an inconsistent state on the Mesos scheduler. It
tries to launch a task on a given slave before the executor has
registered. This code was improved/refactored in 1.4, could you
try 1.4.0-RC1?
Yes and note the second message after the error you highlighted;
that's when the executor would be registered with Mesos and the local
object created.
iulian
10:01:34 INFO SecurityManager: Changing view acls to...
10:01:35 INFO Slf4jLogger: Slf4jLogger started
10:01:35 INFO Remoting: Starting remoting
10:01:35 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkExecutor@app01:xxx]
10:01:35 INFO Utils: Successfully started service
'sparkExecutor' on port xxx.
10:01:35 INFO AkkaUtils: Connecting to MapOutputTracker:
akka.tcp://sparkDriver@dev-web01/user/MapOutputTracker
10:01:35 INFO AkkaUtils: Connecting to BlockManagerMaster:
akka.tcp://sparkDriver@dev-web01/user/BlockManagerMaster
10:01:36 INFO DiskBlockManager: Created local directory at
/tmp/spark-52a6585a-f9f2-4ab6-bebc-76be99b0c51c/blockmgr-e6d79818-fe30-4b5c-bcd6-8fbc5a201252
10:01:36 INFO MemoryStore: MemoryStore started with capacity
88.3 MB
10:01:36 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes
where applicable
10:01:36 INFO AkkaUtils: Connecting to
OutputCommitCoordinator:
akka.tcp://sparkDriver@dev-web01/user/OutputCommitCoordinator
10:01:36 INFO Executor: Starting executor ID
20150511-150924-3410235146-5050-1903-S3 on host app01
10:01:36 INFO NettyBlockTransferService: Server created on XXX
10:01:36 INFO BlockManagerMaster: Trying to register
BlockManager
10:01:36 INFO BlockManagerMaster: Registered BlockManager
10:01:36 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@dev-web01/user/HeartbeatReceiver
As soon as spark-driver is aborted, following log entries
are added to the sandbox log of slave-node app01:
10:17:29.559433 35772 exec.cpp:379] Executor asked to shutdown
10:17:29 WARN ReliableDeliverySupervisor: Association with
remote system [akka.tcp://sparkDriver@dev-web01] has failed,
address is now gated for [5000] ms. Reason is: [Disassociated]
Successful Job shows instead following in spark-driver log:
08:03:19,862 INFO o.a.s.s.TaskSetManager - Finished task 3.0
in stage 1.0 (TID 7) in 1688 ms on app01 (1/4)
08:03:19,869 INFO o.a.s.s.TaskSetManager - Finished task 0.0
in stage 1.0 (TID 4) in 1700 ms on app03 (2/4)
08:03:19,874 INFO o.a.s.s.TaskSetManager - Finished task 1.0
in stage 1.0 (TID 5) in 1703 ms on app02 (3/4)
08:03:19,878 INFO o.a.s.s.TaskSetManager - Finished task 2.0
in stage 1.0 (TID 6) in 1706 ms on app02 (4/4)
08:03:19,878 INFO o.a.s.s.DAGScheduler - Stage 1
(saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90)
finished in 1.718 s
08:03:19,878 INFO o.a.s.s.TaskSchedulerImpl - Removed
TaskSet 1.0, whose tasks have all completed, from pool
08:03:19,886 INFO o.a.s.s.DAGScheduler - Job 0 finished:
saveAsNewAPIHadoopDataset at ImportSparkJob.scala:90, took
16.946405 s
this corresponds nicelly to sandbox logs of slave-nodes
08:03:19 INFO Executor: Finished task 3.0 in stage 1.0 (TID
7). 872 bytes result sent to driver
08:03:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID
4). 872 bytes result sent to driver
08:03:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID
5). 872 bytes result sent to driver
08:03:19 INFO Executor: Finished task 2.0 in stage 1.0 (TID
6). 872 bytes result sent to driver
08:03:20 WARN ReliableDeliverySupervisor: Association with
remote system [akka.tcp://sparkDriver@dev-web01] has failed,
address is now gated for [5000] ms. Reason is: [Disassociated].
--
--
Iulian Dragos
------
Reactive Apps on the JVM
www.typesafe.com <http://www.typesafe.com>