Thanks for the information, Andres.

Created the following PR:
https://github.com/apache/spark/pull/7756

BTW bq is used on JIRA to reference other people's comment.

Cheers

On Wed, Jul 29, 2015 at 12:46 PM, Andres Perez <and...@tresata.com> wrote:

> Hi Ted. Taking a look at the logs, I get the feeling like there may be an
> uncaught exception blowing up the SparkContext.stop method, causing it to
> not reach the line where it gets set as inactive. The line referenced below
> in SparkContext (SparkContext.scala:1644) is the call: _dagScheduler.stop()
>
> 15/07/29 15:17:09 INFO Client: Deleting staging directory
> .sparkStaging/application_1436825124867_0223
> 15/07/29 15:17:09 ERROR YarnClientSchedulerBackend: Yarn application has
> already exited with state KILLED!
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/metrics/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/api,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/static,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors/threadDump,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/executors,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/environment/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/environment,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage/rdd,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/storage,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/pool/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/pool,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/stage,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/stages,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs/job/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs/job,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs/json,null}
> 15/07/29 15:17:09 INFO ContextHandler: stopped
> o.s.j.s.ServletContextHandler{/jobs,null}
> 15/07/29 15:17:09 INFO SparkUI: Stopped Spark web UI at http://<address
> removed>
> 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster has disassociated: <address removed>
> 15/07/29 15:17:09 INFO DAGScheduler: Stopping DAGScheduler
> 15/07/29 15:17:09 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://sparkYarnAM@<address removed>] has failed, address is
> now gated for [5000] ms. Reason is: [Disassociated].
> 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster has disassociated: <address removed>
> 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all
> executors
> Exception in thread "Yarn application state monitor"
> org.apache.spark.SparkException: Error asking standalone scheduler to shut
> down executors
>         at
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
>         at
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
>         at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
>         at
> org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
>         at
> org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
>         at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
>         at
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
> Caused by: java.lang.InterruptedException
>         at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>         at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>         at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>         at
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>         at scala.concurrent.Await$.result(package.scala:190)15/07/29
> 15:17:09 INFO YarnClientSchedulerBackend: Asking each executor to shut down
>
>         at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>         at
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
>         at
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
>         ... 6 more
> 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 2 on
> node09.tresata.com: remote Rpc client disassociated
> 15/07/29 15:17:09 ERROR YarnScheduler: Lost executor 1 on
> node06.tresata.com: remote Rpc client disassociated
> 5/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 2 with no
> recent heartbeats: 128182 ms exceeds timeout 120000 ms
> 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 2 (already
> removed): Executor heartbeat timed out after 128182 ms
> 15/07/29 15:19:09 WARN HeartbeatReceiver: Removing executor 1 with no
> recent heartbeats: 126799 ms exceeds timeout 120000 ms
> 15/07/29 15:19:09 ERROR YarnScheduler: Lost an executor 1 (already
> removed): Executor heartbeat timed out after 126799 ms
> 15/07/29 15:19:09 INFO YarnClientSchedulerBackend: Requesting to kill
> executor(s) 2
> 15/07/29 15:19:09 WARN YarnClientSchedulerBackend: Executor to kill 2 does
> not exist!
> 15/07/29 15:19:09 WARN Remoting: Tried to associate with unreachable
> remote address [akka.tcp://sparkYarnAM@<address removed>]. Address is now
> gated for 5000 ms, all messages to this address will be delivered to dead
> letters. Reason: Connection refused: /<address removed>
> 15/07/29 15:19:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster has disassociated: <address removed>
>
> It seems like the call,
>
> driverEndpoint.askWithRetry[Boolean](StopExecutors)
>
> inside the stopExecutors() method of CoarseGrainedSchedulerBackend is
> throwing an exception, which then bubbles up as a SparkException ("Error
> asking standalone scheduler to shut down executors"). This is then not
> caught by the SparkContext.stop() method (and thus stop() never reaches
> the clearActiveContext() call). Does this sound right?
>
> Also, does bq == be quiet???
>
> Thanks for the reply!
>
> -Andres
>
> On Wed, Jul 29, 2015 at 1:10 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> bq. it seems like we never get to the clearActiveContext() call by the
>> end
>>
>> Looking at stop() method, there is only one early return
>> after stopped.compareAndSet() call.
>> Is there any clue from driver log ?
>>
>> Cheers
>>
>> On Wed, Jul 29, 2015 at 9:38 AM, Andres Perez <and...@tresata.com> wrote:
>>
>>> Hi everyone. I'm running into an issue with SparkContexts when running on
>>> Yarn. The issue is observable when I reproduce these steps in the
>>> spark-shell (version 1.4.1):
>>>
>>> scala> sc
>>> res0: org.apache.spark.SparkContext =
>>> org.apache.spark.SparkContext@7b965dee
>>>
>>> *Note the pointer address of sc.
>>>
>>> (Then yarn application -kill <application-id> on the corresponding yarn
>>> application)
>>>
>>> scala> val rdd = sc.parallelize(List(1,2,3))
>>> java.lang.IllegalStateException: Cannot call methods on a stopped
>>> SparkContext
>>>   at
>>> org.apache.spark.SparkContext.org
>>> $apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
>>>
>>>   at
>>> org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1914)
>>>   at
>>>
>>> org.apache.spark.SparkContext.parallelize$default$2(SparkContext.scala:695)
>>>   ... 49 elided
>>>
>>> (Great, the SparkContext has been stopped by the killed yarn
>>> application, as
>>> expected.)
>>>
>>> alternatively:
>>>
>>> scala> sc.stop()
>>> 15/07/29 12:10:14 INFO SparkContext: SparkContext already stopped.
>>>
>>> (OK, so it's confirmed that it has been stopped.)
>>>
>>> scala> org.apache.spark.SparkContext.getOrCreate
>>> res3: org.apache.spark.SparkContext =
>>> org.apache.spark.SparkContext@7b965dee
>>>
>>> (Hm, that's the same SparkContext, note the pointer address.)
>>>
>>> The issue here is that the SparkContext.getOrCreate method returns either
>>> the active SparkContext, if it exists, or creates a new one. Here it is
>>> returning the original SparkContext, meaning the one we verified was
>>> stopped
>>> above is still active. How can we recover from this? We can't use the
>>> current one once it's been stopped (unless we allow for multiple
>>> contexts to
>>> run using the spark.driver.allowMultipleContexts flag, but that's a
>>> band-aid
>>> solution), and we can't seem to create a new one, because the old one is
>>> still marked as active.
>>>
>>> Digging a little deeper, in the body of the stop() method of
>>> SparkContext,
>>> it seems like we never get to the clearActiveContext() call by the end,
>>> which would have marked the context as inactive. Any future call to
>>> stop(),
>>> however, will exit early since the stopped variable is true (hence the
>>> "SparkContext already stopped." log message). So I don't see any other
>>> way
>>> to mark the context as not active. Something about how the SparkContext
>>> was
>>> stopped after killing the yarn application is preventing the SparkContext
>>> from cleaning up properly.
>>>
>>> Any ideas about this?
>>>
>>> Thanks,
>>>
>>> Andres
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/stopped-SparkContext-remaining-active-tp24065.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to