[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845431#comment-16845431 ] Apache Spark commented on SPARK-18406: -- User 'rezasafi' has created a pull request for this issue: https://github.com/apache/spark/pull/24670 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835339#comment-16835339 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/24552 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834276#comment-16834276 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/24542 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834277#comment-16834277 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/24542 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830770#comment-16830770 ] Xingbo Jiang commented on SPARK-18406: -- This problem still exists in PythonRunner, since python side uses a pre-fetch model to consume the upstream data, and open another thread to serve output data to downstream operators, thus it's possible the Task finishes first and trigger the task cleanup logic, and then the CompletionIterator try to release the write lock it holds on some blocks and found the lock has already been released. I'll submit a PR to bypass the issue later. > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Xingbo Jiang >Priority: Major > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16254839#comment-16254839 ] david zhang commented on SPARK-18406: - we still find this issue in Spark 2.2, I check this case change code and find spark version 2.2 contain these change. but our error log: 17/11/03 08:00:10 ERROR Utils: Uncaught exception in thread stdout writer for python java.lang.AssertionError: assertion failed: Block input-0-1508745006978 is not locked for reading at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720) at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516) at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.util.CompletionIterator.foreach(CompletionIterator.scala:26) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1954) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169661#comment-16169661 ] Hadoop QA commented on SPARK-18406: --- [ https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167181#comment-16167181 ] Wenchen Fan commented on SPARK-18406: - https://github.com/apache/spark/pull/18099 is the PR that backported the fix to 2.1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167181#comment-16167181 ] Wenchen Fan commented on SPARK-18406: - https://github.com/apache/spark/pull/18099 is the PR that backported the fix to 2.1 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues-test.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for >
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16169586#comment-16169586 ] Hadoop QA commented on SPARK-18406: --- [ https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167180#comment-16167180 ] Yongqin Xiao commented on SPARK-18406: -- [~cloud_fan], I see there are 3 check-ins for this issue, touching multiple files. You mentioned the fix will be backport to spark2.1.0. Can you let me know which single submission in spark2.1.0 will address the issue? The reason I am asking is that my company may not update spark version to 2.2 very soon, I will have to port your fix to our company's version of spark 2.1.0 and 2.0.1. I cannot just use latest spark 2.1.0 even after you backport the fix because we have other patches on top of spark 2.1.0, some were fixed by ourselves. Thanks for your help. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at >
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues-test.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167180#comment-16167180 ] Yongqin Xiao commented on SPARK-18406: -- [~cloud_fan], I see there are 3 check-ins for this issue, touching multiple files. You mentioned the fix will be backport to spark2.1.0. Can you let me know which single submission in spark2.1.0 will address the issue? The reason I am asking is that my company may not update spark version to 2.2 very soon, I will have to port your fix to our company's version of spark 2.1.0 and 2.0.1. I cannot just use latest spark 2.1.0 even after you backport the fix because we have other patches on top of spark 2.1.0, some were fixed by ourselves. Thanks for your help. > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues-test.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113702#comment-16113702 ] Taichi Sano commented on SPARK-18406: - Hello, I am experiencing an issue very similar to this. I am currently trying to do a groupByKeyAndWindow() with batch size of 1, window size of 80, and shift size of 1 from data that is being streamed from Kafka (ver 0.10) with Direct Streaming. Every once in a while, I encounter the AssertionError like so: 17/08/03 22:32:19 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 20936.0 (TID 4409) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 17/08/03 22:32:19 ERROR org.apache.spark.executor.Executor: Exception in task 0.1 in stage 20936.0 (TID 4410) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:367) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:366) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:366) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:736) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:342) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) 17/08/03 22:32:19 ERROR org.apache.spark.util.Utils: Uncaught exception in thread stdout writer for /opt/conda/bin/python java.lang.AssertionError: assertion failed: Block rdd_30291_0 is not locked for reading at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:720) at org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:516) at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:46) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:35) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333) at
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023759#comment-16023759 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/18099 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.0.3, 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for >
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16023462#comment-16023462 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/18096 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for >
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022477#comment-16022477 ] Wenchen Fan commented on SPARK-18406: - we will backport this to 2.1 and 2.0 later > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for > /databricks/python/bin/python > java.lang.AssertionError: assertion
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022077#comment-16022077 ] Yongqin Xiao commented on SPARK-18406: -- Thanks for the fix. What spark release will have it? Can we get a patch on top of spark 2.1.0? > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for > /databricks/python/bin/python > java.lang.AssertionError: assertion failed: Block
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16022067#comment-16022067 ] Apache Spark commented on SPARK-18406: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/18076 > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Utils: Uncaught exception in thread stdout writer for > /databricks/python/bin/python > java.lang.AssertionError: assertion failed:
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970496#comment-15970496 ] Yongqin Xiao commented on SPARK-18406: -- Thanks Josh for the quick response! This issue is critical to my company's use cases, where for the purpose of performance we have to use custom RDD to take input from multiple parent RDDs, and use existing computation logic (in a black box) in the background to pull the result. > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7925 > 16/11/07 17:11:06 INFO Executor: Running task 0.1 in stage 2390.0 (TID 7925) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 41, boot = -536, init = > 576, finish = 1 > 16/11/07 17:11:06 INFO Executor: Finished task 1.0 in stage 2390.0 (TID > 7922). 1429 bytes
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970235#comment-15970235 ] Josh Rosen commented on SPARK-18406: I can see how allowing user-level code to call setTaskContext() can fix this issue but it's not ideal because it still places the burden on the end users to call the setTaskContext() method in their code. Instead, I think a cleaner fix would be to have the CompletionIterator record the task ID when it's instantiated so that the same task ID can be used even if the completion occurs in a different thread (the idea is to reduce our reliance on thread locals: there are reasons why we couldn't completely remove them (API changes), but there are parts of the internals where we can propagate more efficiently). To move forward here, my suggestion is that we write a failing regression test based on the description provided by [~yxiao], then experiment on my suggested approach of more explicit threading of task ids into closeable objects when they're first created. I'm on vacation this week and won't be able to help with this until Monday, April 24th, so someone else will need to help / review if this is urgent. > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at > org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) > at > org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) > at > org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) > at >
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15969384#comment-15969384 ] Yongqin Xiao commented on SPARK-18406: -- The same JIRA is found under https://issues-test.apache.org/jira/browse/SPARK-18406. Same issue observed in spark 2.1.0 as well. The issue is observed on some simple spark query that compiles into 3 stages. I have some custom RDD being used in this case, and it is registered for persistence. In the compute() method of the custom RDD, I spawn a new thread to compute the data in the background, and then immediately return an abstract iterator (wrapped under a InterruptibleIterator) that gets data from the background computation on demand. The assertion happens when iterator of the parent RDD reaches the end of the data. This issue doesn't always happen when the custom RDD is used in the query, regardless being used once or multiple times. The issue is related to the new thread I created which accesses data from the input iterator of parent RDD. The new thread is missing the TSS(thread-specific-storage) for TaskContext. I see BlockInfoManager is using this TSS TaskContext as key to search the storage. Here is log showing the task ID being unset in this thread: Line 1819: 2017/04/13 15:01:01.674 [Thread-33]: TRACE storage.BlockInfoManager: Task -1024 releasing lock for rdd_25_0 However, I have no way to set TSS for my thread now because the method is made protected as below: object TaskContext { ... private[this] val taskContext: ThreadLocal[TaskContext] = new ThreadLocal[TaskContext] // Note: protected[spark] instead of private[spark] to prevent the following two from // showing up in JavaDoc. /** Set the thread local TaskContext. Internal to Spark. */ protected[spark] def setTaskContext(tc: TaskContext): Unit = taskContext.set(tc) Just to confirm my theory, I made the TaskContext.setTaskContext public, and called it in the beginning of my thread. The use cases that were failing consistently with assertion on lock-release now run successful in all scenarios I have tried, which include having different number of src/shuffle partitions, number of executors, async vs. sequential execution (for having multiple downstream custom RDDs pulling data from upstream RDD). > Race between end-of-task and completion iterator read lock release > -- > > Key: SPARK-18406 > URL: https://issues.apache.org/jira/browse/SPARK-18406 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.0.0, 2.0.1 >Reporter: Josh Rosen > > The following log comes from a production streaming job where executors > periodically die due to uncaught exceptions during block release: > {code} > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7921 > 16/11/07 17:11:06 INFO Executor: Running task 0.0 in stage 2390.0 (TID 7921) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7922 > 16/11/07 17:11:06 INFO Executor: Running task 1.0 in stage 2390.0 (TID 7922) > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7923 > 16/11/07 17:11:06 INFO Executor: Running task 2.0 in stage 2390.0 (TID 7923) > 16/11/07 17:11:06 INFO TorrentBroadcast: Started reading broadcast variable > 2721 > 16/11/07 17:11:06 INFO CoarseGrainedExecutorBackend: Got assigned task 7924 > 16/11/07 17:11:06 INFO Executor: Running task 3.0 in stage 2390.0 (TID 7924) > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721_piece0 stored as > bytes in memory (estimated size 5.0 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO TorrentBroadcast: Reading broadcast variable 2721 took > 3 ms > 16/11/07 17:11:06 INFO MemoryStore: Block broadcast_2721 stored as values in > memory (estimated size 9.4 KB, free 4.9 GB) > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_1 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_3 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_2 locally > 16/11/07 17:11:06 INFO BlockManager: Found block rdd_2741_4 locally > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 2, boot = -566, init = > 567, finish = 1 > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 7, boot = -540, init = > 541, finish = 6 > 16/11/07 17:11:06 INFO Executor: Finished task 2.0 in stage 2390.0 (TID > 7923). 1429 bytes result sent to driver > 16/11/07 17:11:06 INFO PythonRunner: Times: total = 8, boot = -532, init = > 533, finish = 7 > 16/11/07 17:11:06 INFO Executor: Finished task 3.0 in stage 2390.0 (TID > 7924). 1429 bytes result sent to driver > 16/11/07 17:11:06 ERROR Executor: Exception in task 0.0 in stage 2390.0 (TID > 7921) > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:165) > at
[jira] [Commented] (SPARK-18406) Race between end-of-task and completion iterator read lock release
[ https://issues.apache.org/jira/browse/SPARK-18406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830294#comment-15830294 ] John Myers commented on SPARK-18406: Similar issue in doing basic RDD operations (like checking for empty RDD's) within streaming jobs: 17/01/19 16:43:45 WARN BlockManager: Block input-0-1484840671538 replicated to only 0 peer(s) instead of 1 peers 17/01/19 16:43:46 WARN MetricsHelper: No metrics scope set in thread RecurringTimer - Kinesis Checkpointer - Worker localhost:65fe618d-c0a7-4fea-b710-0f1b5c6498f2, getMetricsScope returning NullMetricsScope. 17/01/19 16:44:51 ERROR Executor: Exception in task 0.0 in stage 212.0 (TID 4180) java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 17/01/19 16:44:51 WARN TaskSetManager: Lost task 0.0 in stage 212.0 (TID 4180, localhost): java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.storage.BlockInfo.checkInvariants(BlockInfoManager.scala:84) at org.apache.spark.storage.BlockInfo.readerCount_$eq(BlockInfoManager.scala:66) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:362) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2$$anonfun$apply$2.apply(BlockInfoManager.scala:361) at scala.Option.foreach(Option.scala:257) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:361) at org.apache.spark.storage.BlockInfoManager$$anonfun$releaseAllLocksForTask$2.apply(BlockInfoManager.scala:356) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:356) at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:646) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 17/01/19 16:44:51 ERROR TaskSetManager: Task 0 in stage 212.0 failed 1 times; aborting job 17/01/19 16:44:51 ERROR JobScheduler: Error running job streaming job 148484427 ms.0 org.apache.spark.SparkException: An exception was raised by Python: Traceback (most recent call last): File "/root/spark/python/lib/pyspark.zip/pyspark/streaming/util.py", line 65, in call r = self.func(t, *rdds) File "/root/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py", line 159, in func = lambda t, rdd: old_func(rdd) File "/root/streamtest.py", line 519, in multiplex if not flow_rdd.isEmpty(): File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in isEmpty return self.getNumPartitions() == 0 or len(self.take(1)) == 0 File "/root/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1310, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "/root/spark/python/lib/pyspark.zip/pyspark/context.py", line 933, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "/root/spark/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line