[jira] [Updated] (SPARK-25775) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-25775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ablimit A. Keskin updated SPARK-25775: -- Description: The following issue comes from a production Spark job where executors die due to uncaught exceptions during block release. When the task run with a specific configuration for _executor-cores_ and _total-executor-cores_ (e.g. 4 & 8 or 1 & 8), it constantly fails at the same code segment. Following are logs from our run: {code:java} 18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks are for partitions Vector(0)) 18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks 18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory on 10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB) 18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 10.248.7.2, executor 0): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) on 10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) [duplicate 1] 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181018230546-0040/0 is now EXITED (Command exited with code 50) 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor app-20181018230546-0040/0 removed: Command exited with code 50 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 (10.248.110.2:40787) with 4 cores 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 GB RAM 18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181018230546-0040/2 is now RUNNING 18/10/18 23:06:31 INFO DAGScheduler: Executor lost: 0 (epoch 11) 18/10/18 23:06:31 INFO BlockManagerMaster: Removal of executor 0 requested 18/10/18 23:06:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 18/10/18 23:06:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.3 in stage 27.0 (TID 115, 10.248.21.2, executor 1, partition 0, ANY, 5585 bytes) 18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_27_2 ! 18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_44_7 ! 18/10/18 23:06:31 WARN
[jira] [Updated] (SPARK-25775) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-25775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ablimit A. Keskin updated SPARK-25775: -- Affects Version/s: (was: 2.2.0) 2.2.2 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-25775 > URL: https://issues.apache.org/jira/browse/SPARK-25775 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.2.2 >Reporter: Ablimit A. Keskin >Priority: Major > > The following issue comes from a production Spark job where executors die due > to uncaught exceptions during block release. When the task run with a > specific configuration for -_executor-cores_ and- _total-executor-cores_ > (e.g. 4 & 8 or 1 & 8), it constantly fails at the same code segment. > Following are logs from our run: > > {code:java} > 18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from > ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks > are for partitions Vector(0)) > 18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks > 18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID > 112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) > 18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory > on 10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB) > 18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, > 10.248.7.2, executor 0): java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:156) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID > 113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) > 18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) > on 10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) > [duplicate 1] > 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID > 114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) > 18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: > Remote RPC client disassociated. Likely due to containers exceeding > thresholds, or network issues. Check driver logs for WARN messages. > 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: > app-20181018230546-0040/0 is now EXITED (Command exited with code 50) > 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor > app-20181018230546-0040/0 removed: Command exited with code 50 > 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: > app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 > (10.248.110.2:40787) with 4 cores > 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID > app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 > GB RAM > 18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, > 10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one > of the running tasks) Reason: Remote RPC client disassociated. Likely due to > containers exceeding thresholds, or network issues. Check driver logs for > WARN messages. > 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: > app-20181018230546-0040/2 is now RUNNING > 18/10/18 23:06:31 INFO DAGScheduler: Executor lost:
[jira] [Updated] (SPARK-25775) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-25775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ablimit A. Keskin updated SPARK-25775: -- Description: The following issue comes from a production Spark job where executors die due to uncaught exceptions during block release. When the task run with a specific configuration for -_executor-cores_ and- _total-executor-cores_ (e.g. 4 & 8 or 1 & 8), it constantly fails at the same code segment. Following are logs from our run: {code:java} 18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks are for partitions Vector(0)) 18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks 18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory on 10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB) 18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 10.248.7.2, executor 0): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) on 10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) [duplicate 1] 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181018230546-0040/0 is now EXITED (Command exited with code 50) 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor app-20181018230546-0040/0 removed: Command exited with code 50 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 (10.248.110.2:40787) with 4 cores 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 GB RAM 18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181018230546-0040/2 is now RUNNING 18/10/18 23:06:31 INFO DAGScheduler: Executor lost: 0 (epoch 11) 18/10/18 23:06:31 INFO BlockManagerMaster: Removal of executor 0 requested 18/10/18 23:06:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 18/10/18 23:06:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.3 in stage 27.0 (TID 115, 10.248.21.2, executor 1, partition 0, ANY, 5585 bytes) 18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_27_2 ! 18/10/18 23:06:31 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_44_7 ! 18/10/18 23:06:31
[jira] [Created] (SPARK-25775) Race between end-of-task and completion iterator read lock release
Ablimit A. Keskin created SPARK-25775: - Summary: Race between end-of-task and completion iterator read lock release Key: SPARK-25775 URL: https://issues.apache.org/jira/browse/SPARK-25775 Project: Spark Issue Type: Bug Components: Block Manager, Spark Core Affects Versions: 2.2.0 Reporter: Ablimit A. Keskin The following issue comes from a production Spark job where executors die due to uncaught exceptions during block release. When the task run with a specific configuration for "--executor-cores" and "--total-executor-cores" (e.g. 4 & 8 or 1 & 8), it constantly fails at the same code segment. Following are logs from our run: {code:java} 18/10/18 23:06:18 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 27 (PythonRDD[94] at RDD at PythonRDD.scala:49) (first 15 tasks are for partitions Vector(0)) 18/10/18 23:06:18 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks 18/10/18 23:06:18 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 112, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:18 INFO BlockManagerInfo: Added broadcast_37_piece0 in memory on 10.248.7.2:44871 (size: 9.1 KB, free: 13.0 GB) 18/10/18 23:06:24 WARN TaskSetManager: Lost task 0.0 in stage 27.0 (TID 112, 10.248.7.2, executor 0): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 18/10/18 23:06:24 INFO TaskSetManager: Starting task 0.1 in stage 27.0 (TID 113, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:31 INFO TaskSetManager: Lost task 0.1 in stage 27.0 (TID 113) on 10.248.7.2, executor 0: java.lang.AssertionError (assertion failed) [duplicate 1] 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.2 in stage 27.0 (TID 114, 10.248.7.2, executor 0, partition 0, PROCESS_LOCAL, 5585 bytes) 18/10/18 23:06:31 ERROR TaskSchedulerImpl: Lost executor 0 on 10.248.7.2: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181018230546-0040/0 is now EXITED (Command exited with code 50) 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Executor app-20181018230546-0040/0 removed: Command exited with code 50 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20181018230546-0040/2 on worker-20181018173742-10.248.110.2-40787 (10.248.110.2:40787) with 4 cores 18/10/18 23:06:31 INFO StandaloneSchedulerBackend: Granted executor ID app-20181018230546-0040/2 on hostPort 10.248.110.2:40787 with 4 cores, 25.0 GB RAM 18/10/18 23:06:31 WARN TaskSetManager: Lost task 0.2 in stage 27.0 (TID 114, 10.248.7.2, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 18/10/18 23:06:31 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20181018230546-0040/2 is now RUNNING 18/10/18 23:06:31 INFO DAGScheduler: Executor lost: 0 (epoch 11) 18/10/18 23:06:31 INFO BlockManagerMaster: Removal of executor 0 requested 18/10/18 23:06:31 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 18/10/18 23:06:31 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0 18/10/18 23:06:31 INFO TaskSetManager: Starting task 0.3 in stage 27.0 (TID 115, 10.248.21.2, executor 1, partition 0,