[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-18 Thread Angus Gerry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587461#comment-15587461
 ] 

Angus Gerry commented on SPARK-10872:
-

Will do. Thanks mate :).

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>   at 
> 

[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2016-10-17 Thread Angus Gerry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584052#comment-15584052
 ] 

Angus Gerry commented on SPARK-10872:
-

Hi [~srowen],

I'm chasing down something in our code base at the moment that might be 
tangentially related to this issue. In our tests, we start and stop a new 
{{TestHiveContext}} for each test suite. Our builds recently started failing 
with this stack trace, ultimately caused by an {{IOException}} because "Too 
many open files"
{noformat}
java.lang.IllegalStateException: failed to create a child event loop
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:68)
at 
io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56)
at 
org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104)
at 
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88)
at 
org.apache.spark.network.netty.NettyBlockTransferService.init(NettyBlockTransferService.scala:63)
at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:177)
at org.apache.spark.SparkContext.(SparkContext.scala:536)
...
Cause: io.netty.channel.ChannelException: failed to open a new selector
at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:128)
at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120)
at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87)
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64)
at 
io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:56)
at 
org.apache.spark.network.client.TransportClientFactory.(TransportClientFactory.java:104)
at 
org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:88)
...
Cause: java.io.IOException: Too many open files
at sun.nio.ch.IOUtil.makePipe(Native Method)
at sun.nio.ch.EPollSelectorImpl.(EPollSelectorImpl.java:65)
at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:126)
at io.netty.channel.nio.NioEventLoop.(NioEventLoop.java:120)
at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:87)
at 
io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:64)
at 
io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:49)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:61)
at io.netty.channel.nio.NioEventLoopGroup.(NioEventLoopGroup.java:52)
{noformat}

Running our test suite locally, and keeping an eye on the jvm process with 
lsof, I can see that the number of open file handles continues to grow larger 
and larger, and over 75% of the paths look something like this: 
{{/tmp/spark-a0ff08e6-ae94-42ad-8a9c-bc43dee0b283/metastore/seg0/c530.dat}}

My initial tracing through the code indicates that even though we're stopping 
the context, it's not closing its connection to the {{executionHive}} object, 
which runs as a derby DB in a tmp directory as above.

This is where my 'tangentially related' comes in - if the context were actually 
closing its derby DB connections, then we mightn't be hitting the issue at all.

FWIW the [programming 
guide|http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark]
 does state the following, which at the very least _implies_ that stopping and 
then subsequently starting a context within one JVM is supported.
{quote}
Only one SparkContext may be active per JVM. You must stop() the active 
SparkContext before creating a new one.
{quote}

Personally I don't much care about said support other than needing it for our 
tests. If [~belevtsoff] doesn't start working on a PR for this, I'll start 
trying to work on a fix for my problems shortly.

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the 

[jira] [Comment Edited] (SPARK-16702) Driver hangs after executors are lost

2016-07-25 Thread Angus Gerry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392938#comment-15392938
 ] 

Angus Gerry edited comment on SPARK-16702 at 7/26/16 12:54 AM:
---

I'm not so sure about SPARK-12419.

SPARK-16533 however definitely looks the same. The logs in my scenario are 
similar to what's described there. Effectively it's just repetitions of:
{noformat}
WARN ExecutorAllocationManager: Uncaught exception in thread 
spark-dynamic-executor-allocation
org.apache.spark.SparkException: Error sending message [message = 
RequestExecutors(...)]
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on 
host: ... was preempted.)] in 3 attempts
WARN NettyRpcEndpointRef: Error sending message [message = 
KillExecutors(List(450))] in 1 attempts
{noformat}


was (Author: ango...@gmail.com):
I'm not so sure about SPARK-12419.

SPARK-16355 however definitely looks the same. The logs in my scenario are 
similar to what's described there. Effectively it's just repetitions of:
{noformat}
WARN ExecutorAllocationManager: Uncaught exception in thread 
spark-dynamic-executor-allocation
org.apache.spark.SparkException: Error sending message [message = 
RequestExecutors(...)]
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on 
host: ... was preempted.)] in 3 attempts
WARN NettyRpcEndpointRef: Error sending message [message = 
KillExecutors(List(450))] in 1 attempts
{noformat}

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> 

[jira] [Commented] (SPARK-16702) Driver hangs after executors are lost

2016-07-25 Thread Angus Gerry (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392938#comment-15392938
 ] 

Angus Gerry commented on SPARK-16702:
-

I'm not so sure about SPARK-12419.

SPARK-16355 however definitely looks the same. The logs in my scenario are 
similar to what's described there. Effectively it's just repetitions of:
{noformat}
WARN ExecutorAllocationManager: Uncaught exception in thread 
spark-dynamic-executor-allocation
org.apache.spark.SparkException: Error sending message [message = 
RequestExecutors(...)]
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on 
host: ... was preempted.)] in 3 attempts
WARN NettyRpcEndpointRef: Error sending message [message = 
KillExecutors(List(450))] in 1 attempts
{noformat}

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {noformat}
> {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
> 

[jira] [Updated] (SPARK-16702) Driver hangs after executors are lost

2016-07-24 Thread Angus Gerry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Angus Gerry updated SPARK-16702:

Attachment: SparkThreadsBlocked.txt

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {noformat}
> {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
> scala.Option.foreach(Option.scala:257)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 

[jira] [Created] (SPARK-16702) Driver hangs after executors are lost

2016-07-24 Thread Angus Gerry (JIRA)
Angus Gerry created SPARK-16702:
---

 Summary: Driver hangs after executors are lost
 Key: SPARK-16702
 URL: https://issues.apache.org/jira/browse/SPARK-16702
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2, 1.6.1
Reporter: Angus Gerry


It's my first time, please be kind.

I'm still trying to debug this error locally - at this stage I'm pretty 
convinced that it's a weird deadlock/livelock problem due to the use of 
{{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
possibly tangentially related to the issues discussed in SPARK-1560 around the 
use of blocking calls within locks.

h4. Observed Behavior
When running a spark job, and executors are lost, the job occassionally goes 
into a state where it makes no progress with tasks. Most commonly it seems that 
the issue occurs when executors are preempted by yarn, but I'm not confident 
enough to state that it's restricted to just this scenario.

Upon inspecting a thread dump from the driver, the following stack traces seem 
noteworthy (a full thread dump is attached):
{noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
scala.concurrent.Await$.result(package.scala:190)
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{noformat}

{noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:120)
scala.Option.foreach(Option.scala:257)
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint.onDisconnected(YarnSchedulerBackend.scala:120)
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:142)
org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{noformat}

{noformat:title=Thread 640: kill-executor-thread (BLOCKED)}
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(CoarseGrainedSchedulerBackend.scala:488)
org.apache.spark.SparkContext.killAndReplaceExecutor(SparkContext.scala:1499)