[jira] [Created] (SPARK-15966) Fix markdown for Spark Monitoring

2016-06-15 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-15966:


 Summary: Fix markdown for Spark Monitoring
 Key: SPARK-15966
 URL: https://issues.apache.org/jira/browse/SPARK-15966
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.0
Reporter: Dhruve Ashar
Priority: Trivial


The markdown for Spark monitoring needs to be fixed. 
http://spark.apache.org/docs/2.0.0-preview/monitoring.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15966) Fix markdown for Spark Monitoring

2016-06-16 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-15966:
-
Description: 
The markdown for Spark monitoring needs to be fixed. 
http://spark.apache.org/docs/2.0.0-preview/monitoring.html 
The closing tag is missing for `spark.ui.view.acls.groups`, which is causing 
the markdown to render incorrectly.

  was:
The markdown for Spark monitoring needs to be fixed. 
http://spark.apache.org/docs/2.0.0-preview/monitoring.html



> Fix markdown for Spark Monitoring
> -
>
> Key: SPARK-15966
> URL: https://issues.apache.org/jira/browse/SPARK-15966
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: Dhruve Ashar
>Priority: Trivial
>
> The markdown for Spark monitoring needs to be fixed. 
> http://spark.apache.org/docs/2.0.0-preview/monitoring.html 
> The closing tag is missing for `spark.ui.view.acls.groups`, which is causing 
> the markdown to render incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16018) Shade netty for shuffle to work on YARN

2016-06-17 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-16018:


 Summary: Shade netty for shuffle to work on YARN
 Key: SPARK-16018
 URL: https://issues.apache.org/jira/browse/SPARK-16018
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.0.0
Reporter: Dhruve Ashar
Priority: Blocker
 Fix For: 2.0.0


We replaced LazyFileRegion with netty's DefaultFileRegion. 

Because of this (https://issues.apache.org/jira/browse/SPARK-15178), we have a 
dependency on netty 4.0.25.Final or >, but hadoop is on 4.0.23.Final, so the 
shuffle jar won't load in nodemanager as its trying to find a new version of 
the class. 

We need to shade netty's classes inorder to load the latest version we need and 
not depend on the one loaded by hadoop.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14572) Update Config Doc to specify -Xms in extraJavaOptions

2016-04-12 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-14572:


 Summary: Update Config Doc to specify -Xms in extraJavaOptions
 Key: SPARK-14572
 URL: https://issues.apache.org/jira/browse/SPARK-14572
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.6.0
Reporter: Dhruve Ashar
Priority: Minor
 Fix For: 2.0.0


[SPARK-12384|https://issues.apache.org/jira/browse/SPARK-12384] Allows the user 
to specify the initial heap memory -Xms through the *.extraJavaOptions. 

We have to update the entries in the configuration docs to mention this 
behavior compared to the previous one where passing heap memory settings 
through *.extraJavaOptions was not allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-07-25 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-16441:
-
External issue URL:   (was: 
https://issues.apache.org/jira/browse/SPARK-15703)

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>

[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-07-25 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-16441:
-
External issue URL: https://issues.apache.org/jira/browse/SPARK-15703

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at 

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-07-25 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392049#comment-15392049
 ] 

Dhruve Ashar commented on SPARK-16441:
--

How large is the data being sent to driver? And is this consistently 
reproducible?  

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> 

[jira] [Updated] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-07-29 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-16441:
-
Affects Version/s: 2.0.0

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
>   at 

[jira] [Updated] (SPARK-15703) Spark UI doesn't show all tasks as completed when it should

2016-07-19 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-15703:
-
Attachment: SparkListenerBus .png
spark-dynamic-executor-allocation.png

> Spark UI doesn't show all tasks as completed when it should
> ---
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15703) Spark UI doesn't show all tasks as completed when it should

2016-07-19 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-15703:
-
Component/s: Scheduler

> Spark UI doesn't show all tasks as completed when it should
> ---
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15703) Spark UI doesn't show all tasks as completed when it should

2016-07-19 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384331#comment-15384331
 ] 

Dhruve Ashar commented on SPARK-15703:
--

Here are some of the findings: 

LiveListenerBus replaces the AsynchronousListenerBus. With dynamic allocation 
enabled and setting maximum executors to ~2000, I am consistently seeing 
excessive messages being dropped for an input data size of 300GB. These events 
are being dropped (UI gets messed up here) because the event queue is not being 
drained fast enough. 

>From the thread dumps, the event queue dispatcher freezes up momentarily 
>during which the queue gets full in a short span and messages are dropped, and 
>once its active, the queue clears up fast. The race condition happens in 
>ExecutorAllocationManager because of the synchronization. And the dispatcher 
>threads waits for the locks to be released. See attached dumps.

The remedy for this is two fold:
1 - Decouple the event dispatch and handling of dynamic executor allocation. 
2 - Make the listener event queue size configurable. For users who want to run 
with smaller heartbeat intervals, the no. of events floating around would be 
large and it would be helpful to have the flexibility to tune this.






> Spark UI doesn't show all tasks as completed when it should
> ---
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Priority: Critical
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, SparkListenerBus .png, 
> spark-dynamic-executor-allocation.png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-07-19 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384573#comment-15384573
 ] 

Dhruve Ashar commented on SPARK-16441:
--

Does the application hang consistently when dynamic allocation is enabled? Or 
you hit it infrequently?

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> 

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-07-19 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15384732#comment-15384732
 ] 

Dhruve Ashar commented on SPARK-16441:
--

[~cenyuhai] have you started looking at this one? If not then I can look at it.

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> 

[jira] [Commented] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-09-06 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468622#comment-15468622
 ] 

Dhruve Ashar commented on SPARK-17417:
--

Thanks for the suggestion. I'll work on the changes and submit a PR.

> Fix # of partitions for RDD while checkpointing - Currently limited by 
> 1(%05d)
> --
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 10 and uses %05d 
> padding. 
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
> and fails. This is because of part-files are sorted and compared as strings. 
> This leads filename order to be part-1, part-10, ... instead of 
> part-1, part-10001, ..., part-10 and while reconstructing the 
> checkpointed RDD the job fails. 
> Possible solutions: 
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the 
> RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-09-06 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-17417:


 Summary: Fix # of partitions for RDD while checkpointing - 
Currently limited by 1(%05d)
 Key: SPARK-17417
 URL: https://issues.apache.org/jira/browse/SPARK-17417
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Dhruve Ashar


Spark currently assumes # of partitions to be less than 10 and uses %05d 
padding. 

If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
and fails. This is because of part-files are sorted and compared as strings. 

This leads filename order to be part-1, part-10, ... instead of 
part-1, part-10001, ..., part-10 and while reconstructing the 
checkpointed RDD the job fails. 

Possible solutions: 
- Bump the padding to allow more partitions or
- Sort the part files extracting a sub-portion as string and then verify the RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17365) Kill multiple executors together to reduce lock contention

2016-09-01 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-17365:


 Summary: Kill multiple executors together to reduce lock contention
 Key: SPARK-17365
 URL: https://issues.apache.org/jira/browse/SPARK-17365
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Dhruve Ashar


To regulate pending and running executors we determine the executors which are 
eligible to kill and kill them iteratively rather than a loop. This does an RPC 
call and is synchronized leading to lock contention for SparkListenerBus. 

Side effect - listener bus is blocked while we iteratively remove executors.

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-09-13 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488528#comment-15488528
 ] 

Dhruve Ashar commented on SPARK-16441:
--

So all of these are related in one way or the other. I will share my 
observations made so far: 

I have taken multiple thread dumps of these and analyzed what is causing the 
SparkListenerBus to block. The ExecutorAllocationManager is heavily 
synchronized and its a hotspot for contention especially when schedule is being 
invoked every 100ms to balance the executors. So for a spark job running 
thousands of executors, any change in the status of these executors with 
dynamic allocation enabled leads to frequent firing of events. This causes a 
contention and leads to blocking. Specifically if calls are remote RPCs.

Also the current design of the listener bus is such that it waits for every 
listener to process the event before it can proceed to deliver the next event 
from the queue. Any wait for acquiring the locks are leading to the event queue 
being filling up fast and leads to dropping of events. 

Logging individual execution times of these do not necessarily conclude to 
updateAndSync consuming majority of the time and hence we are working on 
reducing the lock contention and the minimize the RPC calls. Also a parameter 
to look at would be the heartbeat interval. Having a small interval for a very 
large no. of executors will aggravate the problem as you would be getting 
frequent ExecutorMetricsUpdate from the running executors. 



> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> 

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-09-13 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487224#comment-15487224
 ] 

Dhruve Ashar commented on SPARK-16441:
--

There was a patch which was recently contributed which reduces the deadlocking. 
You can check the details here => 
https://github.com/apache/spark/commit/a0aac4b775bc8c275f96ad0fbf85c9d8a3690588

Also I am working on a patch which improves this further and should be able to 
complete it soon. Its a WIP and I have a PR up for it 
[https://github.com/apache/spark/pull/14926]. 

Let us know if your issue is resolved/minimized with the patch.  



> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> 

[jira] [Commented] (SPARK-17417) Fix # of partitions for RDD while checkpointing - Currently limited by 10000(%05d)

2016-10-04 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15546792#comment-15546792
 ] 

Dhruve Ashar commented on SPARK-17417:
--

[~srowen] AFAIU the checkpointing mechanism in spark core, the recovery of an 
RDD from a checkpoint is limited to an application attempt. Spark streaming 
mentions that it can recover metadata/rdd from checkpointed data across 
application attempts. Please correct me if I have missed something here. With 
this understanding it wouldn't be necessary to parse the code for the old 
format as the recovery would be done using the same spark jar which was used to 
launch it. 

Also why is it that we are not cleaning up the checkpointed directory on 
sc.close ?

> Fix # of partitions for RDD while checkpointing - Currently limited by 
> 1(%05d)
> --
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 10 and uses %05d 
> padding. 
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
> and fails. This is because of part-files are sorted and compared as strings. 
> This leads filename order to be part-1, part-10, ... instead of 
> part-1, part-10001, ..., part-10 and while reconstructing the 
> checkpointed RDD the job fails. 
> Possible solutions: 
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the 
> RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17417) Fix sorting of part files while reconstructing RDD/partition from checkpointed files.

2016-10-07 Thread Dhruve Ashar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-17417:
-
Summary: Fix sorting of part files while reconstructing RDD/partition from 
checkpointed files.  (was: Fix # of partitions for RDD while checkpointing - 
Currently limited by 1(%05d))

> Fix sorting of part files while reconstructing RDD/partition from 
> checkpointed files.
> -
>
> Key: SPARK-17417
> URL: https://issues.apache.org/jira/browse/SPARK-17417
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Dhruve Ashar
>
> Spark currently assumes # of partitions to be less than 10 and uses %05d 
> padding. 
> If we exceed this no., the sort logic in ReliableCheckpointRDD gets messed up 
> and fails. This is because of part-files are sorted and compared as strings. 
> This leads filename order to be part-1, part-10, ... instead of 
> part-1, part-10001, ..., part-10 and while reconstructing the 
> checkpointed RDD the job fails. 
> Possible solutions: 
> - Bump the padding to allow more partitions or
> - Sort the part files extracting a sub-portion as string and then verify the 
> RDD



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17649) Log how many Spark events got dropped in LiveListenerBus

2016-09-28 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531080#comment-15531080
 ] 

Dhruve Ashar commented on SPARK-17649:
--

I get the idea. While you certainly cannot filter out a specific type of 
message. You could increase the heartbeat interval to reduce the executor 
metrics update events to an acceptable degree. This would be significant while 
running with a large no. of executors. The stats would the effect of it on the 
event bus. 

> Log how many Spark events got dropped in LiveListenerBus
> 
>
> Key: SPARK-17649
> URL: https://issues.apache.org/jira/browse/SPARK-17649
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.6.3, 2.0.2, 2.1.0
>
>
> Log how many Spark events got dropped in LiveListenerBus so that the user can 
> get insights on how to set a correct event queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17649) Log how many Spark events got dropped in LiveListenerBus

2016-09-27 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526426#comment-15526426
 ] 

Dhruve Ashar commented on SPARK-17649:
--

[~zsxwing] Since we are logging the drop event count, it would be beneficial to 
provide some stats into how many were dropped of a specific type. 

>From what I have seen is majority of the dropped events are Executor Metrics 
>update. It would make the log message a bit longer, but the information will 
>be helpful. We could add it as a debug as well. Let me know your thoughts.  

> Log how many Spark events got dropped in LiveListenerBus
> 
>
> Key: SPARK-17649
> URL: https://issues.apache.org/jira/browse/SPARK-17649
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.6.3, 2.0.2, 2.1.0
>
>
> Log how many Spark events got dropped in LiveListenerBus so that the user can 
> get insights on how to set a correct event queue size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full

2017-07-20 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16094689#comment-16094689
 ] 

Dhruve Ashar commented on SPARK-21460:
--

[~Tagar] Can you attach the driver logs so that it helps in investigating. I am 
not able to reproduce this issue on my end and would like to check more on 
this. Also how frequently are you hitting this?

> Spark dynamic allocation breaks when ListenerBus event queue runs full
> --
>
> Key: SPARK-21460
> URL: https://issues.apache.org/jira/browse/SPARK-21460
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
> Environment: Spark 2.1 
> Hadoop 2.6
>Reporter: Ruslan Dautkhanov
>Priority: Critical
>  Labels: dynamic_allocation, performance, scheduler, yarn
>
> When ListenerBus event queue runs full, spark dynamic allocation stops 
> working - Spark fails to shrink number of executors when there are no active 
> jobs (Spark driver "thinks" there are active jobs since it didn't capture 
> when they finished) .
> ps. What's worse it also makes Spark flood YARN RM with reservation requests, 
> so YARN preemption doesn't function properly too (we're on Spark 2.1 / Hadoop 
> 2.6). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21500) Update Shuffle Fetch bookkeeping immediately after receiving remote block

2017-07-21 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-21500:


 Summary: Update Shuffle Fetch bookkeeping immediately after 
receiving remote block
 Key: SPARK-21500
 URL: https://issues.apache.org/jira/browse/SPARK-21500
 Project: Spark
  Issue Type: Task
  Components: Shuffle, Spark Core
Affects Versions: 2.2.0
Reporter: Dhruve Ashar
Priority: Minor


Currently we update any remote shuffle fetch related parameters only when  we 
start processing the FetchResult rather than updating them immediately after 
receiving them. This would speed up sending requests for remote shuffle fetch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21243) Limit the number of maps in a single shuffle fetch

2017-06-28 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-21243:


 Summary: Limit the number of maps in a single shuffle fetch
 Key: SPARK-21243
 URL: https://issues.apache.org/jira/browse/SPARK-21243
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1, 2.1.0
Reporter: Dhruve Ashar
Priority: Minor


Right now spark can limit the # of parallel fetches and also limits the amount 
of data in one fetch, but one fetch to a host could be for 100's of blocks. In 
one instance we saw 450+ blocks. When you have 100's of those and 1000's of 
reducers fetching that becomes a lot of metadata and can run the Node Manager 
out of memory. We should add a config to limit the # of maps per fetch to 
reduce the load on the NM.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-08-04 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114628#comment-16114628
 ] 

Dhruve Ashar commented on SPARK-20589:
--

I have a patch for this and would like to have some feedback/thoughts on this 
approach.

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20589) Allow limiting task concurrency per stage

2017-08-04 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114626#comment-16114626
 ] 

Dhruve Ashar commented on SPARK-20589:
--

Spark defines stages based on the shuffle dependencies and the application 
developer does not have a direct way to control this behavior. Moreover, it 
would require the application developer to exactly know how the stage 
boundaries are defined and set the max concurrency with the desired 
transformation. This would need a change in the API to support this at a 
transformation level and resolve the max concurrent tasks across tasks in a 
given stage. 

Since the scope for this change is much broader and involved, in the meanwhile 
I would like to propose an approach where we limit the no. of concurrent tasks 
on a per job basis rather than a per stage basis. This way the developer has 
control over limiting only a single spark job in his application. This approach 
can be implemented by following two steps:
1 - Specify the concurrency metric for the job.
2 - Tag the job to be limited in a specific job group.

So the application code, looks something like this:

{code:java}
// limit concurrency for all the job/s under jobGroupId => myjob
conf.set("spark.job.myjob.maxConcurrentTasks","10") 
sc.parallelize(1 to Int.MaxValue, 1).map(x => x + 1).map(x => x - 1).map(x 
=> x * 1).count()
// tag the job to be limited under the respective jobGroupId
sc.setJobGroup("myjob","",false) 
sc.parallelize(1 to Int.MaxValue, 1).map(x => x + 1).map(x => x - 1).map(x 
=> x * 1).count()
// clear the tag or set it to a different value.
sc.clearJobGroup 
sc.parallelize(1 to Int.MaxValue, 1).map(x => x + 1).map(x => x - 1).map(x 
=> x * 1).count()
{code}

> Allow limiting task concurrency per stage
> -
>
> Key: SPARK-20589
> URL: https://issues.apache.org/jira/browse/SPARK-20589
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> It would be nice to have the ability to limit the number of concurrent tasks 
> per stage.  This is useful when your spark job might be accessing another 
> service and you don't want to DOS that service.  For instance Spark writing 
> to hbase or Spark doing http puts on a service.  Many times you want to do 
> this without limiting the number of partitions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21181) Suppress memory leak errors reported by netty

2017-06-22 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-21181:


 Summary: Suppress memory leak errors reported by netty
 Key: SPARK-21181
 URL: https://issues.apache.org/jira/browse/SPARK-21181
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.0
Reporter: Dhruve Ashar
Priority: Minor


We are seeing netty report memory leak erros like the one below after switching 
to 2.1. 

{code}
ERROR ResourceLeakDetector: LEAK: ByteBuf.release() was not called before it's 
garbage-collected. Enable advanced leak reporting to find out where the leak 
occurred. To enable advanced leak reporting, specify the JVM option 
'-Dio.netty.leakDetection.level=advanced' or call 
ResourceLeakDetector.setLevel() See 
http://netty.io/wiki/reference-counted-objects.html for more information.
{code}

Looking a bit deeper, Spark is not leaking any memory here, but it is confusing 
for the user to see the error message in the driver logs. 

After enabling, '-Dio.netty.leakDetection.level=advanced', netty reveals the 
SparkSaslServer to be the source of these leaks.

Sample trace :https://gist.github.com/dhruve/b299ebc35aa0a185c244a0468927daf1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24610) wholeTextFiles broken for small files

2018-06-20 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-24610:


 Summary: wholeTextFiles broken for small files
 Key: SPARK-24610
 URL: https://issues.apache.org/jira/browse/SPARK-24610
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.3.1, 2.2.1
Reporter: Dhruve Ashar


Spark is unable to read small files using the wholeTextFiles method when split 
size related configs are specified - either explicitly or if they are contained 
in other config files like hive-site.xml.

For small sized files, the computed maxSplitSize by `WholeTextFileInputFormat`  
is way smaller than the default or commonly used split size of 64/128M and 
spark throws an exception while trying to read them.  

 

To reproduce the issue: 
{code:java}
$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client --conf 
"spark.hadoop.mapreduce.input.fileinputformat.split.minsize.per.node=123456"

scala> sc.wholeTextFiles("file:///etc/passwd").count
java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
maximum split size 9962
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided


// For hdfs
sc.wholeTextFiles("smallFile").count
java.io.IOException: Minimum split size pernode 123456 cannot be larger than 
maximum split size 15
at 
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:200)
at 
org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:50)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2096)
at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
... 48 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25468) Highlight current page index in the history server

2018-09-19 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-25468:


 Summary: Highlight current page index in the history server
 Key: SPARK-25468
 URL: https://issues.apache.org/jira/browse/SPARK-25468
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Dhruve Ashar


Spark History Server Web UI should highlight the current page index selected 
for better navigation. Without it being highlighted it is difficult to identify 
the current page you are looking at. 

 

For example: Page 1 should be highlighted here.

!image-2018-09-19-11-29-46-677.png!

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25468) Highlight current page index in the history server

2018-09-19 Thread Dhruve Ashar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-25468:
-
Description: 
Spark History Server Web UI should highlight the current page index selected 
for better navigation. Without it being highlighted it is difficult to identify 
the current page you are looking at. 

 

For example: Page 1 should be highlighted here.

 

  was:
Spark History Server Web UI should highlight the current page index selected 
for better navigation. Without it being highlighted it is difficult to identify 
the current page you are looking at. 

 

For example: Page 1 should be highlighted here.

!image-2018-09-19-11-29-46-677.png!

 


> Highlight current page index in the history server
> --
>
> Key: SPARK-25468
> URL: https://issues.apache.org/jira/browse/SPARK-25468
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Dhruve Ashar
>Priority: Trivial
>
> Spark History Server Web UI should highlight the current page index selected 
> for better navigation. Without it being highlighted it is difficult to 
> identify the current page you are looking at. 
>  
> For example: Page 1 should be highlighted here.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25468) Highlight current page index in the history server

2018-09-19 Thread Dhruve Ashar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-25468:
-
Description: 
Spark History Server Web UI should highlight the current page index selected 
for better navigation. Without it being highlighted it is difficult to identify 
the current page you are looking at. 

 

For example: Page 1 should be highlighted as show in SparkHistoryServer.png 

  was:
Spark History Server Web UI should highlight the current page index selected 
for better navigation. Without it being highlighted it is difficult to identify 
the current page you are looking at. 

 

For example: Page 1 should be highlighted here.

 


> Highlight current page index in the history server
> --
>
> Key: SPARK-25468
> URL: https://issues.apache.org/jira/browse/SPARK-25468
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Dhruve Ashar
>Priority: Trivial
> Attachments: SparkHistoryServer.png
>
>
> Spark History Server Web UI should highlight the current page index selected 
> for better navigation. Without it being highlighted it is difficult to 
> identify the current page you are looking at. 
>  
> For example: Page 1 should be highlighted as show in SparkHistoryServer.png 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



***UNCHECKED*** [jira] [Updated] (SPARK-25468) Highlight current page index in the history server

2018-09-19 Thread Dhruve Ashar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-25468:
-
Attachment: SparkHistoryServer.png

> Highlight current page index in the history server
> --
>
> Key: SPARK-25468
> URL: https://issues.apache.org/jira/browse/SPARK-25468
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.1
>Reporter: Dhruve Ashar
>Priority: Trivial
> Attachments: SparkHistoryServer.png
>
>
> Spark History Server Web UI should highlight the current page index selected 
> for better navigation. Without it being highlighted it is difficult to 
> identify the current page you are looking at. 
>  
> For example: Page 1 should be highlighted here.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-08 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-27107:


 Summary: Spark SQL Job failing because of Kryo buffer overflow 
with ORC
 Key: SPARK-27107
 URL: https://issues.apache.org/jira/browse/SPARK-27107
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2
Reporter: Dhruve Ashar


The issue occurs while trying to read ORC data and setting the SearchArgument.
{code:java}
 Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
Available: 0, required: 9
Serialization trace:
literalList 
(org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
at 
org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
at 
org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-08 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16788021#comment-16788021
 ] 

Dhruve Ashar commented on SPARK-27107:
--

[~dongjoon] can you review the PR for ORC to fix this issue? 
[https://github.com/apache/orc/pull/372]

Once this is merged, we can fix the issue in spark as well. Until then the only 
workaround is to use the hive based implementation.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> 

[jira] [Commented] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-20 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797253#comment-16797253
 ] 

Dhruve Ashar commented on SPARK-27112:
--

[~irashid] - I think this is a critical bug and since it is resolved we should 
include it in the rc8.

> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.4, 2.4.2, 3.0.0
>
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out that there exist 
> two independent deadlocks and either of them occur under different 
> circumstances. The screenshots for these two deadlocks are attached here. 
> We were able to reproduce the deadlocks with the following piece of code:
>  
> {code:java}
> import org.apache.hadoop.conf.Configuration
> import org.apache.hadoop.fs.{FileSystem, Path}
> import org.apache.spark._
> import org.apache.spark.TaskContext
> // Simple example of Word Count in Scala
> object ScalaWordCount {
> def main(args: Array[String]) {
> if (args.length < 2) {
> System.err.println("Usage: ScalaWordCount  ")
> System.exit(1)
> }
> val conf = new SparkConf().setAppName("Scala Word Count")
> val sc = new SparkContext(conf)
> // get the input file uri
> val inputFilesUri = args(0)
> // get the output file uri
> val outputFilesUri = args(1)
> while (true) {
> val textFile = sc.textFile(inputFilesUri)
> val counts = textFile.flatMap(line => line.split(" "))
> .map(word => {if (TaskContext.get.partitionId == 5 && 
> TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
> blacklisting") else (word, 1)})
> .reduceByKey(_ + _)
> counts.saveAsTextFile(outputFilesUri)
> val conf: Configuration = new Configuration()
> val path: Path = new Path(outputFilesUri)
> val hdfs: FileSystem = FileSystem.get(conf)
> hdfs.delete(path, true)
> }
> sc.stop()
> }
> }
> {code}
>  
> Additionally, to ensure that the deadlock surfaces up soon enough, I also 
> added a small delay in the Spark code here:
> [https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]
>  
> {code:java}
> executorIdToFailureList.remove(exec)
> updateNextExpiryTime()
> Thread.sleep(2000)
> killBlacklistedExecutor(exec)
> {code}
>  
> Also make sure that the following configs are set when launching the above 
> spark job:
> *spark.blacklist.enabled=true*
> *spark.blacklist.killBlacklistedExecutors=true*
> *spark.blacklist.application.maxFailedTasksPerExecutor=1*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-11 Thread Dhruve Ashar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar updated SPARK-27107:
-
Description: 
The issue occurs while trying to read ORC data and setting the SearchArgument.
{code:java}
 Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
Available: 0, required: 9
Serialization trace:
literalList 
(org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
at 
com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
at 
com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
at 
com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
at 
org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
at 
org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
at 
org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371)
at 
org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
at 

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-11 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16789668#comment-16789668
 ] 

Dhruve Ashar commented on SPARK-27107:
--

This is something we can consistently reproduce every single time. I do not 
have an example with company details stripped off that I can share for this use 
case at this point. Is there anything specific that you are looking for? I can 
try to come up with a reproducible case, but it seems to be difficult to 
reproduce as this is dependent on user query and the parameters that are being 
passed to filter the data.
{quote} Could you provide a reproducible test case here?
{quote}
The Hive default is 4K and not 4M (that was a typo). Thanks for correcting that.
{quote}BTW, the Hive default is 4K instead of 4M, isn't it?
{quote}
 Yes. The hive implementation should fail when it exceeds the 10M limit for a 
SArg and the PR that I have against the Orc implementation tries to make this 
configurable so that spark can control the buffer size if we hit a buffer 
overflow error.
{quote}Technically, Hive implementation also fails when it exceeds the 
limitation because it's a non-configurable parameter issue.
{quote}

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> 

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790610#comment-16790610
 ] 

Dhruve Ashar commented on SPARK-27107:
--

Update: The PR was merged in the orc repository. My understanding is that we 
should update our pom once a new orc release is cut out.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790911#comment-16790911
 ] 

Dhruve Ashar commented on SPARK-27107:
--

I verified the changes and we are no longer seeing the issue. Thanks for 
testing+voting the ORC RC. I think I am not on the ORC mailing list, so I might 
have missed the voting. 

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> 

[jira] [Created] (SPARK-26801) Spark unable to read valid avro types

2019-01-31 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-26801:


 Summary: Spark unable to read valid avro types
 Key: SPARK-26801
 URL: https://issues.apache.org/jira/browse/SPARK-26801
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dhruve Ashar


Currently the external avro package reads avro schemas for type records only. 
This is probably because of representation of InternalRow in spark sql. As a 
result, if the avro file has anything other than a sequence of records it fails 
to read it.

We faced this issue earlier while trying to read primitive types. We 
encountered this again while trying to read an array of records. Below are code 
examples trying to read valid avro data showing the stack traces.
{code:java}
spark.read.format("avro").load("avroTypes/randomInt.avro").show
java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
StructType:

"int"

at org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided

==

scala> spark.read.format("avro").load("avroTypes/randomEnum.avro").show
java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
StructType:

{
"type" : "enum",
"name" : "Suit",
"symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ]
}

at org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 49 elided
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26801) Spark unable to read valid avro types

2019-02-04 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16760178#comment-16760178
 ] 

Dhruve Ashar commented on SPARK-26801:
--

I have given a short summary of the issue in the PR description. It would be 
great if you could review it [~hyukjin.kwon]

Thanks.

> Spark unable to read valid avro types
> -
>
> Key: SPARK-26801
> URL: https://issues.apache.org/jira/browse/SPARK-26801
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> Currently the external avro package reads avro schemas for type records only. 
> This is probably because of representation of InternalRow in spark sql. As a 
> result, if the avro file has anything other than a sequence of records it 
> fails to read it.
> We faced this issue earlier while trying to read primitive types. We 
> encountered this again while trying to read an array of records. Below are 
> code examples trying to read valid avro data showing the stack traces.
> {code:java}
> spark.read.format("avro").load("avroTypes/randomInt.avro").show
> java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
> StructType:
> "int"
> at 
> org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> ... 49 elided
> ==
> scala> spark.read.format("avro").load("avroTypes/randomEnum.avro").show
> java.lang.RuntimeException: Avro schema cannot be converted to a Spark SQL 
> StructType:
> {
> "type" : "enum",
> "name" : "Suit",
> "symbols" : [ "SPADES", "HEARTS", "DIAMONDS", "CLUBS" ]
> }
> at 
> org.apache.spark.sql.avro.AvroFileFormat.inferSchema(AvroFileFormat.scala:95)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180)
> at scala.Option.orElse(Option.scala:289)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
> at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> ... 49 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26827) Support importing python modules having shared objects(.so)

2019-02-05 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761072#comment-16761072
 ] 

Dhruve Ashar edited comment on SPARK-26827 at 2/5/19 6:08 PM:
--

Resolution : Pass the same archive with py-files and archives option.


was (Author: dhruve ashar):
Pass the same archive with py-files and archives option.

> Support importing python modules having shared objects(.so)
> ---
>
> Key: SPARK-26827
> URL: https://issues.apache.org/jira/browse/SPARK-26827
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> If a user wants to import dynamic modules, specifically having .so files, 
> this is currently disallowed by python from a zip file. 
> ([https://docs.python.org/3/library/zipimport.html)] and currently spark 
> doesn't support this either. 
> Files which are passed using py-files options are placed on the PYTHONPATH, 
> but are not extracted. While files which are passed as archives, are 
> extracted but not placed on the PYTHONPATH. The dynamic modules can be loaded 
> if they are extracted and added to the PYTHONPATH.
>  
> Has anyone encountered this issue before and what is the best way to go about 
> it?
>  
> Some possible solutions:
> 1 - Get around this issue, by passing the archive with py-files and archives 
> option, this extracts the archive as well as adds it to the path. Gotcha - 
> both have to be named the same. I have tested this and it works, but its just 
> a workaround.
> 2 - We add a new config like py-archives which takes all the files and 
> extracts them and also adds them to the PYTHONPATH. Or just examine the 
> contents of the zip file and if it has dynamic modules then do the same. I am 
> happy to work on the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26827) Support importing python modules having shared objects(.so)

2019-02-05 Thread Dhruve Ashar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruve Ashar resolved SPARK-26827.
--
Resolution: Workaround

Pass the same archive with py-files and archives option.

> Support importing python modules having shared objects(.so)
> ---
>
> Key: SPARK-26827
> URL: https://issues.apache.org/jira/browse/SPARK-26827
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> If a user wants to import dynamic modules, specifically having .so files, 
> this is currently disallowed by python from a zip file. 
> ([https://docs.python.org/3/library/zipimport.html)] and currently spark 
> doesn't support this either. 
> Files which are passed using py-files options are placed on the PYTHONPATH, 
> but are not extracted. While files which are passed as archives, are 
> extracted but not placed on the PYTHONPATH. The dynamic modules can be loaded 
> if they are extracted and added to the PYTHONPATH.
>  
> Has anyone encountered this issue before and what is the best way to go about 
> it?
>  
> Some possible solutions:
> 1 - Get around this issue, by passing the archive with py-files and archives 
> option, this extracts the archive as well as adds it to the path. Gotcha - 
> both have to be named the same. I have tested this and it works, but its just 
> a workaround.
> 2 - We add a new config like py-archives which takes all the files and 
> extracts them and also adds them to the PYTHONPATH. Or just examine the 
> contents of the zip file and if it has dynamic modules then do the same. I am 
> happy to work on the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26827) Support importing python modules having shared objects(.so)

2019-02-05 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761034#comment-16761034
 ] 

Dhruve Ashar commented on SPARK-26827:
--

[~holden.ka...@gmail.com] , [~irashid] any thoughts on this one?

> Support importing python modules having shared objects(.so)
> ---
>
> Key: SPARK-26827
> URL: https://issues.apache.org/jira/browse/SPARK-26827
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> If a user wants to import dynamic modules, specifically having .so files, 
> this is currently disallowed by python from a zip file. 
> ([https://docs.python.org/3/library/zipimport.html)] and currently spark 
> doesn't support this either. 
> Files which are passed using py-files options are placed on the PYTHONPATH, 
> but are not extracted. While files which are passed as archives, are 
> extracted but not placed on the PYTHONPATH. The dynamic modules can be loaded 
> if they are extracted and added to the PYTHONPATH.
>  
> Has anyone encountered this issue before and what is the best way to go about 
> it?
>  
> Some possible solutions:
> 1 - Get around this issue, by passing the archive with py-files and archives 
> option, this extracts the archive as well as adds it to the path. Gotcha - 
> both have to be named the same. I have tested this and it works, but its just 
> a workaround.
> 2 - We add a new config like py-archives which takes all the files and 
> extracts them and also adds them to the PYTHONPATH. Or just examine the 
> contents of the zip file and if it has dynamic modules then do the same. I am 
> happy to work on the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26827) Support importing python modules having shared objects(.so)

2019-02-05 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-26827:


 Summary: Support importing python modules having shared 
objects(.so)
 Key: SPARK-26827
 URL: https://issues.apache.org/jira/browse/SPARK-26827
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.0, 2.3.2
Reporter: Dhruve Ashar


If a user wants to import dynamic modules, specifically having .so files, this 
is currently disallowed by python from a zip file. 
([https://docs.python.org/3/library/zipimport.html)] and currently spark 
doesn't support this either. 

Files which are passed using py-files options are placed on the PYTHONPATH, but 
are not extracted. While files which are passed as archives, are extracted but 
not placed on the PYTHONPATH. The dynamic modules can be loaded if they are 
extracted and added to the PYTHONPATH.

 

Has anyone encountered this issue before and what is the best way to go about 
it?

 

Some possible solutions:

1 - Get around this issue, by passing the archive with py-files and archives 
option, this extracts the archive as well as adds it to the path. Gotcha - both 
have to be named the same. I have tested this and it works, but its just a 
workaround.

2 - We add a new config like py-archives which takes all the files and extracts 
them and also adds them to the PYTHONPATH. Or just examine the contents of the 
zip file and if it has dynamic modules then do the same. I am happy to work on 
the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26827) Support importing python modules having shared objects(.so)

2019-02-05 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16761069#comment-16761069
 ] 

Dhruve Ashar commented on SPARK-26827:
--

Thanks for the response [~irashid] and [~hyukjin.kwon]. Will close this one out.

> Support importing python modules having shared objects(.so)
> ---
>
> Key: SPARK-26827
> URL: https://issues.apache.org/jira/browse/SPARK-26827
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> If a user wants to import dynamic modules, specifically having .so files, 
> this is currently disallowed by python from a zip file. 
> ([https://docs.python.org/3/library/zipimport.html)] and currently spark 
> doesn't support this either. 
> Files which are passed using py-files options are placed on the PYTHONPATH, 
> but are not extracted. While files which are passed as archives, are 
> extracted but not placed on the PYTHONPATH. The dynamic modules can be loaded 
> if they are extracted and added to the PYTHONPATH.
>  
> Has anyone encountered this issue before and what is the best way to go about 
> it?
>  
> Some possible solutions:
> 1 - Get around this issue, by passing the archive with py-files and archives 
> option, this extracts the archive as well as adds it to the path. Gotcha - 
> both have to be named the same. I have tested this and it works, but its just 
> a workaround.
> 2 - We add a new config like py-archives which takes all the files and 
> extracts them and also adds them to the PYTHONPATH. Or just examine the 
> contents of the zip file and if it has dynamic modules then do the same. I am 
> happy to work on the fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-24 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847929#comment-16847929
 ] 

Dhruve Ashar commented on SPARK-24149:
--

I also think that HDFS client shouldn't be figuring out if HA is enabled and 
then add both the NNs to the list.

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-24 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847923#comment-16847923
 ] 

Dhruve Ashar commented on SPARK-24149:
--

 

Spark should be agnostic about the namenodes for which it should get the 
tokens. It is either hdfs which figures this out or the user who should specify 
this explicitly.

You should get tokens for only those namenodes which you are going to access. 
If the namespaces are related, viewfs does this for you. If they are unrelated 
the user explicitly provides them. Unrelated namespaces may or may not use HDFS 
federation and there can also be more namespaces in the federation which a user 
job might not access at all.

 

 

 

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-24 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847854#comment-16847854
 ] 

Dhruve Ashar commented on SPARK-24149:
--

[~mgaido] [~vanzin],

The change this PR introduced is trying to explicitly figure out the list of 
namenodes from the hadoop configs. I think we are duplicating the logic here 
and this makes it confusing to understand as the client should be transparent 
to figuring out the necessary namenodes.

 

Rationale:

- HDFS Federation is used to store the data from two different namespaces on 
the same data node (mostly used with unrelated namespaces).

- ViewFS on the other hand is used for better namespace management by having 
different namespaces on different namenodes. But in that case, you should 
always be using it with viewfs:// which takes care of getting the tokens for 
you. (Note: this may use HDFS federation or may be not).

In either case we should rely on hadoop to give us the requested namenodes.

In the use case where we want to access unrelated namespaces (often used in 
scenarios where different hive tables are stored in different namespaces), we 
already have a config to pass in the other namenodes and we really don't need 
this change.

 

There was a follow-up PR to fix an issue because of this behavior to get the FS 
only for the specified namenodes. IMHO both of these changes are unnecessary 
and we should revert them to the original behavior.

 

Thoughts, comments?

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-28 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850096#comment-16850096
 ] 

Dhruve Ashar commented on SPARK-24149:
--

Thanks for the missing context.

The current behavior doesn't seem to address the use case mentioned. We could 
have a table which is partitioned across a different namespace which is on a 
different cluster (NN). In this case the user has to know about the underlying 
namespace and their corresponding NN to get the tokens.

The current logic seems to address only a specific use case where a table is 
stored across multiple namespaces (configured without viewfs) and in this case 
they luckily happen to be on the same cluster (using HDFS federation). What if 
these are on a different cluster?

I would expect that if data for a given table is to be stored across different 
namespaces, then these namespaces be related and addressed using viewfs. This 
has the advantage of getting the tokens for all the NN the data resides on 
irrespective of the use case if the partitions happen to reside on the same or 
a different cluster and is much better from a user transparency standpoint as 
well, since it covers all the use cases.

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-30 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852365#comment-16852365
 ] 

Dhruve Ashar commented on SPARK-24149:
--

IMHO having a consistent way to reason about a behavior is more preferable over 
specifying an additional config.

We encountered an issue while trying to get the filesystem using a logical 
nameservice (That is one more reason why this is broken) based on => 
[https://github.com/apache/spark/pull/21216/files#diff-f8659513cf91c15097428c3d8dfbcc35R213]

 

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]

2019-06-07 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859001#comment-16859001
 ] 

Dhruve Ashar commented on SPARK-27937:
--

The exception that we started encountering is while spark tries to create a 
path of the logic nameservice or nameservice id configured as a part of HDFS 
federation. 

 
{code:java}
19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 
19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls enabled; users  with view permissions: Set(...); groups with 
view permissions: Set(); users  with modify permissions: Set(); groups 
with modify permissions: Set(.)
19/05/20 08:48:43 INFO Client: Deleted staging directory 
hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456
Exception in thread "main" java.io.IOException: Cannot create proxy with 
unresolved address: abcabcabc-nn1:8020
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345)
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2821)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:100)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2892)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2874)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:215)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:214)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:214)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:213)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.hadoopFSsToAccess(YarnSparkHadoopUtil.scala:213)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.obtainDelegationTokens(HadoopFSDelegationTokenProvider.scala:48)
{code}
 

> Revert changes introduced as a part of Automatic namespace discovery 
> [SPARK-24149]
> --
>
> Key: SPARK-27937
> URL: https://issues.apache.org/jira/browse/SPARK-27937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Dhruve Ashar
>Priority: Major
>
> Spark fails to launch for a valid deployment of HDFS while trying to get 
> tokens for a logical nameservice instead of an actual namenode (with HDFS 
> federation enabled). 
> On inspecting the source code closely, it is unclear why we were doing it and 
> based on the context from SPARK-24149, it solves a very specific use case of 
> getting the tokens for only those namenodes which are configured for HDFS 
> federation in the same cluster. IMHO these are better left to the user to 
> specify explicitly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]

2019-06-07 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859001#comment-16859001
 ] 

Dhruve Ashar edited comment on SPARK-27937 at 6/7/19 9:27 PM:
--

The exception that we started encountering is while spark tries to create a 
path of the logic nameservice or nameservice id configured as a part of HDFS 
federation as a part of the code here:

https://github.com/apache/spark/blob/v2.4.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L215

 
{code:java}
19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 
19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls enabled; users  with view permissions: Set(...); groups with 
view permissions: Set(); users  with modify permissions: Set(); groups 
with modify permissions: Set(.)
19/05/20 08:48:43 INFO Client: Deleted staging directory 
hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456
Exception in thread "main" java.io.IOException: Cannot create proxy with 
unresolved address: abcabcabc-nn1:8020
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345)
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2821)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:100)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2892)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2874)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:215)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:214)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:214)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:213)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.hadoopFSsToAccess(YarnSparkHadoopUtil.scala:213)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.obtainDelegationTokens(HadoopFSDelegationTokenProvider.scala:48)
{code}
 


was (Author: dhruve ashar):
The exception that we started encountering is while spark tries to create a 
path of the logic nameservice or nameservice id configured as a part of HDFS 
federation. 

 
{code:java}
19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 
19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls enabled; users  with view permissions: Set(...); groups with 
view permissions: Set(); users  with modify permissions: Set(); groups 
with modify permissions: Set(.)
19/05/20 08:48:43 INFO Client: Deleted staging directory 
hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456
Exception in thread "main" java.io.IOException: Cannot create proxy with 
unresolved address: abcabcabc-nn1:8020
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345)
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
at 

[jira] [Created] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]

2019-06-03 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-27937:


 Summary: Revert changes introduced as a part of Automatic 
namespace discovery [SPARK-24149]
 Key: SPARK-27937
 URL: https://issues.apache.org/jira/browse/SPARK-27937
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Dhruve Ashar


Spark fails to launch for a valid deployment of HDFS while trying to get tokens 
for a logical nameservice instead of an actual namenode (with HDFS federation 
enabled). 

On inspecting the source code closely, it is unclear why we were doing it and 
based on the context from SPARK-24149, it solves a very specific use case of 
getting the tokens for only those namenodes which are configured for HDFS 
federation in the same cluster. IMHO these are better left to the user to 
specify explicitly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org