[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587461#comment-15587461 ] Angus Gerry commented on SPARK-10872: - Will do. Thanks mate :). > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the database > ~/metastore_db": > {code:python} > from pyspark import SparkContext, HiveContext > sc = SparkContext("local[*]", "app1") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() > sc.stop() > sc = SparkContext("local[*]", "app2") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() # Py4J error > {code} > This is related to [#SPARK-9539], and I intend to restart spark context > several times for isolated jobs to prevent cache cluttering and GC errors. > Here's a larger part of the full error trace: > {noformat} > Failed to start database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > org.datanucleus.exceptions.NucleusDataStoreException: Failed to start > database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571) > at > org.apache.hadoop.hive.metastore.Hi
[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584052#comment-15584052 ] Angus Gerry commented on SPARK-10872: - Hi [~srowen], I'm chasing down something in our code base at the moment that might be tangentially related to this issue. In our tests, we start and stop a new {{TestHiveContext}} for each test suite. Our builds recently started failing with this stack trace, ultimately caused by an {{IOException}} because "Too many open files" {noformat} java.lang.IllegalStateException: failed to create a child event loop at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:68) at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56) at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88) at org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:63) at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:177) at org.apache.spark.SparkContext.(SparkContext.scala:536) ... Cause: io.netty.channel.ChannelException: failed to open a new selector at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:128) at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87) at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64) at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56) at org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88) ... Cause: java.io.IOException: Too many open files at sun.nio.ch.IOUtil.makePipe(Native Method) at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:65) at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36) at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126) at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87) at io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64) at io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61) at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52) {noformat} Running our test suite locally, and keeping an eye on the jvm process with lsof, I can see that the number of open file handles continues to grow larger and larger, and over 75% of the paths look something like this: {{/tmp/spark-a0ff08e6-ae94-42ad-8a9c-bc43dee0b283/metastore/seg0/c530.dat}} My initial tracing through the code indicates that even though we're stopping the context, it's not closing its connection to the {{executionHive}} object, which runs as a derby DB in a tmp directory as above. This is where my 'tangentially related' comes in - if the context were actually closing its derby DB connections, then we mightn't be hitting the issue at all. FWIW the [programming guide|http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark] does state the following, which at the very least _implies_ that stopping and then subsequently starting a context within one JVM is supported. {quote} Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one. {quote} Personally I don't much care about said support other than needing it for our tests. If [~belevtsoff] doesn't start working on a PR for this, I'll start trying to work on a fix for my problems shortly. > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have alread
[jira] [Comment Edited] (SPARK-16702) Driver hangs after executors are lost
[ https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392938#comment-15392938 ] Angus Gerry edited comment on SPARK-16702 at 7/26/16 12:54 AM: --- I'm not so sure about SPARK-12419. SPARK-16533 however definitely looks the same. The logs in my scenario are similar to what's described there. Effectively it's just repetitions of: {noformat} WARN ExecutorAllocationManager: Uncaught exception in thread spark-dynamic-executor-allocation org.apache.spark.SparkException: Error sending message [message = RequestExecutors(...)] WARN NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on host: ... was preempted.)] in 3 attempts WARN NettyRpcEndpointRef: Error sending message [message = KillExecutors(List(450))] in 1 attempts {noformat} was (Author: ango...@gmail.com): I'm not so sure about SPARK-12419. SPARK-16355 however definitely looks the same. The logs in my scenario are similar to what's described there. Effectively it's just repetitions of: {noformat} WARN ExecutorAllocationManager: Uncaught exception in thread spark-dynamic-executor-allocation org.apache.spark.SparkException: Error sending message [message = RequestExecutors(...)] WARN NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on host: ... was preempted.)] in 3 attempts WARN NettyRpcEndpointRef: Error sending message [message = KillExecutors(List(450))] in 1 attempts {noformat} > Driver hangs after executors are lost > - > > Key: SPARK-16702 > URL: https://issues.apache.org/jira/browse/SPARK-16702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Angus Gerry > Attachments: SparkThreadsBlocked.txt > > > It's my first time, please be kind. > I'm still trying to debug this error locally - at this stage I'm pretty > convinced that it's a weird deadlock/livelock problem due to the use of > {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is > possibly tangentially related to the issues discussed in SPARK-1560 around > the use of blocking calls within locks. > h4. Observed Behavior > When running a spark job, and executors are lost, the job occassionally goes > into a state where it makes no progress with tasks. Most commonly it seems > that the issue occurs when executors are preempted by yarn, but I'm not > confident enough to state that it's restricted to just this scenario. > Upon inspecting a thread dump from the driver, the following stack traces > seem noteworthy (a full thread dump is attached): > {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThre
[jira] [Commented] (SPARK-16702) Driver hangs after executors are lost
[ https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392938#comment-15392938 ] Angus Gerry commented on SPARK-16702: - I'm not so sure about SPARK-12419. SPARK-16355 however definitely looks the same. The logs in my scenario are similar to what's described there. Effectively it's just repetitions of: {noformat} WARN ExecutorAllocationManager: Uncaught exception in thread spark-dynamic-executor-allocation org.apache.spark.SparkException: Error sending message [message = RequestExecutors(...)] WARN NettyRpcEndpointRef: Error sending message [message = RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on host: ... was preempted.)] in 3 attempts WARN NettyRpcEndpointRef: Error sending message [message = KillExecutors(List(450))] in 1 attempts {noformat} > Driver hangs after executors are lost > - > > Key: SPARK-16702 > URL: https://issues.apache.org/jira/browse/SPARK-16702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Angus Gerry > Attachments: SparkThreadsBlocked.txt > > > It's my first time, please be kind. > I'm still trying to debug this error locally - at this stage I'm pretty > convinced that it's a weird deadlock/livelock problem due to the use of > {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is > possibly tangentially related to the issues discussed in SPARK-1560 around > the use of blocking calls within locks. > h4. Observed Behavior > When running a spark job, and executors are lost, the job occassionally goes > into a state where it makes no progress with tasks. Most commonly it seems > that the issue occurs when executors are preempted by yarn, but I'm not > confident enough to state that it's restricted to just this scenario. > Upon inspecting a thread dump from the driver, the following stack traces > seem noteworthy (a full thread dump is attached): > {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {noformat} > {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) > org.apache.spark.scheduler.cluster.YarnSchedulerBacken
[jira] [Updated] (SPARK-16702) Driver hangs after executors are lost
[ https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Angus Gerry updated SPARK-16702: Attachment: SparkThreadsBlocked.txt > Driver hangs after executors are lost > - > > Key: SPARK-16702 > URL: https://issues.apache.org/jira/browse/SPARK-16702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Angus Gerry > Attachments: SparkThreadsBlocked.txt > > > It's my first time, please be kind. > I'm still trying to debug this error locally - at this stage I'm pretty > convinced that it's a weird deadlock/livelock problem due to the use of > {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is > possibly tangentially related to the issues discussed in SPARK-1560 around > the use of blocking calls within locks. > h4. Observed Behavior > When running a spark job, and executors are lost, the job occassionally goes > into a state where it makes no progress with tasks. Most commonly it seems > that the issue occurs when executors are preempted by yarn, but I'm not > confident enough to state that it's restricted to just this scenario. > Upon inspecting a thread dump from the driver, the following stack traces > seem noteworthy (a full thread dump is attached): > {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {noformat} > {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) > scala.Option.foreach(Option.scala:257) > org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) > org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) > org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) > org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
[jira] [Created] (SPARK-16702) Driver hangs after executors are lost
Angus Gerry created SPARK-16702: --- Summary: Driver hangs after executors are lost Key: SPARK-16702 URL: https://issues.apache.org/jira/browse/SPARK-16702 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0, 1.6.2, 1.6.1 Reporter: Angus Gerry It's my first time, please be kind. I'm still trying to debug this error locally - at this stage I'm pretty convinced that it's a weird deadlock/livelock problem due to the use of {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is possibly tangentially related to the issues discussed in SPARK-1560 around the use of blocking calls within locks. h4. Observed Behavior When running a spark job, and executors are lost, the job occassionally goes into a state where it makes no progress with tasks. Most commonly it seems that the issue occurs when executors are preempted by yarn, but I'm not confident enough to state that it's restricted to just this scenario. Upon inspecting a thread dump from the driver, the following stack traces seem noteworthy (a full thread dump is attached): {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)} sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033) java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) scala.concurrent.Await$.result(package.scala:190) org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447) org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423) org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {noformat} {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)} org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289) org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121) org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120) scala.Option.foreach(Option.scala:257) org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120) org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142) org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204) org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {noformat} {noformat:title=Thread 640: kill-executor-thread (BLOCKED)} org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:488) org.apache.spark.SparkContext.killAndReplaceExecutor(SparkContext.scala:1499) org.apache.spark.HeartbeatRec