Thanks for the information, Andres. Created the following PR: https://github.com/apache/spark/pull/7756
BTW bq is used on JIRA to reference other people's comment. Cheers On Wed, Jul 29, 2015 at 12:46 PM, Andres Perez <and...@tresata.com> wrote: > Hi Ted. Taking a look at the logs, I get the feeling like there may be an > uncaught exception blowing up the SparkContext.stop method, causing it to > not reach the line where it gets set as inactive. The line referenced below > in SparkContext (SparkContext.scala:1644) is the call: _dagScheduler.stop() > > 15/07/29 15:17:09 INFO Client: Deleting staging directory > .sparkStaging/application_1436825124867_0223 > 15/07/29 15:17:09 ERROR YarnClientSchedulerBackend: Yarn application has > already exited with state KILLED! > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/metrics/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/stage/kill,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/api,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/executors/threadDump,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/executors/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/executors,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/environment/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/environment,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/storage/rdd/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/storage/rdd,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/storage/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/storage,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/pool/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/pool,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/stage/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/stage,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/stages,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/jobs/job/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/jobs/job,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/jobs/json,null} > 15/07/29 15:17:09 INFO ContextHandler: stopped > o.s.j.s.ServletContextHandler{/jobs,null} > 15/07/29 15:17:09 INFO SparkUI: Stopped Spark web UI at http://<address > removed> > 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: > ApplicationMaster has disassociated: <address removed> > 15/07/29 15:17:09 INFO DAGScheduler: Stopping DAGScheduler > 15/07/29 15:17:09 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkYarnAM@<address removed>] has failed, address is > now gated for [5000] ms. Reason is: [Disassociated]. > 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: > ApplicationMaster has disassociated: <address removed> > 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all > executors > Exception in thread "Yarn application state monitor" > org.apache.spark.SparkException: Error asking standalone scheduler to shut > down executors > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158) > at > org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1644) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139) > Caused by: java.lang.InterruptedException > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190)15/07/29 > 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down > > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257) > ... 6 more > 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 2 on > node09.tresata.com: remote Rpc client disassociated > 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 1 on > node06.tresata.com: remote Rpc client disassociated > 5/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 2 with no > recent heartbeats: 128182 ms exceeds timeout 120000 ms > 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 2 (already > removed): Executor heartbeat timed out after 128182 ms > 15/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 1 with no > recent heartbeats: 126799 ms exceeds timeout 120000 ms > 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 1 (already > removed): Executor heartbeat timed out after 126799 ms > 15/07/29 15:19:09 INFO YarnClientSchedulerBackend: Requesting to kill > executor(s) 2 > 15/07/29 15:19:09 WARN YarnClientSchedulerBackend: Executor to kill 2 does > not exist! > 15/07/29 15:19:09 WARN Remoting: Tried to associate with unreachable > remote address [akka.tcp://sparkYarnAM@<address removed>]. Address is now > gated for 5000 ms, all messages to this address will be delivered to dead > letters. Reason: Connection refused: /<address removed> > 15/07/29 15:19:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: > ApplicationMaster has disassociated: <address removed> > > It seems like the call, > > driverEndpoint.askWithRetry[Boolean](StopExecutors) > > inside the stopExecutors() method of CoarseGrainedSchedulerBackend is > throwing an exception, which then bubbles up as a SparkException ("Error > asking standalone scheduler to shut down executors"). This is then not > caught by the SparkContext.stop() method (and thus stop() never reaches > the clearActiveContext() call). Does this sound right? > > Also, does bq == be quiet??? > > Thanks for the reply! > > -Andres > > On Wed, Jul 29, 2015 at 1:10 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> bq. it seems like we never get to the clearActiveContext() call by the >> end >> >> Looking at stop() method, there is only one early return >> after stopped.compareAndSet() call. >> Is there any clue from driver log ? >> >> Cheers >> >> On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <and...@tresata.com> wrote: >> >>> Hi everyone. I'm running into an issue with SparkContexts when running on >>> Yarn. The issue is observable when I reproduce these steps in the >>> spark-shell (version 1.4.1): >>> >>> scala> sc >>> res0: org.apache.spark.SparkContext = >>> org.apache.spark.SparkContext@7b965dee >>> >>> *Note the pointer address of sc. >>> >>> (Then yarn application -kill <application-id> on the corresponding yarn >>> application) >>> >>> scala> val rdd = sc.parallelize(List(1,2,3)) >>> java.lang.IllegalStateException: Cannot call methods on a stopped >>> SparkContext >>> at >>> org.apache.spark.SparkContext.org >>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103) >>> >>> at >>> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914) >>> at >>> >>> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695) >>> ... 49 elided >>> >>> (Great, the SparkContext has been stopped by the killed yarn >>> application, as >>> expected.) >>> >>> alternatively: >>> >>> scala> sc.stop() >>> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped. >>> >>> (OK, so it's confirmed that it has been stopped.) >>> >>> scala> org.apache.spark.SparkContext.getOrCreate >>> res3: org.apache.spark.SparkContext = >>> org.apache.spark.SparkContext@7b965dee >>> >>> (Hm, that's the same SparkContext, note the pointer address.) >>> >>> The issue here is that the SparkContext.getOrCreate method returns either >>> the active SparkContext, if it exists, or creates a new one. Here it is >>> returning the original SparkContext, meaning the one we verified was >>> stopped >>> above is still active. How can we recover from this? We can't use the >>> current one once it's been stopped (unless we allow for multiple >>> contexts to >>> run using the spark.driver.allowMultipleContexts flag, but that's a >>> band-aid >>> solution), and we can't seem to create a new one, because the old one is >>> still marked as active. >>> >>> Digging a little deeper, in the body of the stop() method of >>> SparkContext, >>> it seems like we never get to the clearActiveContext() call by the end, >>> which would have marked the context as inactive. Any future call to >>> stop(), >>> however, will exit early since the stopped variable is true (hence the >>> "SparkContext already stopped." log message). So I don't see any other >>> way >>> to mark the context as not active. Something about how the SparkContext >>> was >>> stopped after killing the yarn application is preventing the SparkContext >>> from cleaning up properly. >>> >>> Any ideas about this? >>> >>> Thanks, >>> >>> Andres >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >