I'm using barrier execution in my spark job but am occasionally seeing deadlocks where the task scheduler is unable to place all the tasks. The failure is logged but the job hangs indefinitely. I have 2 executors with 16 cores each, using standalone mode (I think? I'm using databricks). The dataset has 31 partitions.
One thing I've noticed when this occurs is that the number of "Active Tasks" exceeds the number of cores on one executor. How is the executor able to start more tasks than it has cores? spark.executor.cores is not set. Looking at the source I came across this issue https://issues.apache.org/jira/browse/SPARK-24818 which looks related but I'm not sure how to properly address it. <http://apache-spark-user-list.1001560.n3.nabble.com/file/t10345/Screen_Shot_2019-09-10_at_11.png> Example log 19/09/10 13:39:55 WARN FairSchedulableBuilder: A job was submitted with scheduler pool 7987057535266794448, which has not been configured. This can happen when the file that pools are read from isn't set, or when that file doesn't contain 7987057535266794448. Created 7987057535266794448 with default configuration (schedulingMode: FIFO, minShare: 0, weight: 1) 19/09/10 13:39:55 INFO FairSchedulableBuilder: Added task set TaskSet_20.0 tasks to pool 7987057535266794448 19/09/10 13:39:55 INFO TaskSetManager: Starting task 1.0 in stage 20.0 (TID 1705, 10.222.231.128, executor 0, partition 1, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 0.0 in stage 20.0 (TID 1706, 10.222.234.66, executor 1, partition 0, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 5.0 in stage 20.0 (TID 1707, 10.222.231.128, executor 0, partition 5, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 2.0 in stage 20.0 (TID 1708, 10.222.234.66, executor 1, partition 2, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 7.0 in stage 20.0 (TID 1709, 10.222.231.128, executor 0, partition 7, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 3.0 in stage 20.0 (TID 1710, 10.222.234.66, executor 1, partition 3, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 8.0 in stage 20.0 (TID 1711, 10.222.231.128, executor 0, partition 8, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 4.0 in stage 20.0 (TID 1712, 10.222.234.66, executor 1, partition 4, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 9.0 in stage 20.0 (TID 1713, 10.222.231.128, executor 0, partition 9, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 6.0 in stage 20.0 (TID 1714, 10.222.234.66, executor 1, partition 6, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 12.0 in stage 20.0 (TID 1715, 10.222.231.128, executor 0, partition 12, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 10.0 in stage 20.0 (TID 1716, 10.222.234.66, executor 1, partition 10, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 13.0 in stage 20.0 (TID 1717, 10.222.231.128, executor 0, partition 13, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 11.0 in stage 20.0 (TID 1718, 10.222.234.66, executor 1, partition 11, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 16.0 in stage 20.0 (TID 1719, 10.222.231.128, executor 0, partition 16, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 14.0 in stage 20.0 (TID 1720, 10.222.234.66, executor 1, partition 14, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 17.0 in stage 20.0 (TID 1721, 10.222.231.128, executor 0, partition 17, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 15.0 in stage 20.0 (TID 1722, 10.222.234.66, executor 1, partition 15, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 18.0 in stage 20.0 (TID 1723, 10.222.231.128, executor 0, partition 18, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 21.0 in stage 20.0 (TID 1724, 10.222.234.66, executor 1, partition 21, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 19.0 in stage 20.0 (TID 1725, 10.222.231.128, executor 0, partition 19, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 22.0 in stage 20.0 (TID 1726, 10.222.234.66, executor 1, partition 22, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 20.0 in stage 20.0 (TID 1727, 10.222.231.128, executor 0, partition 20, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 24.0 in stage 20.0 (TID 1728, 10.222.234.66, executor 1, partition 24, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 23.0 in stage 20.0 (TID 1729, 10.222.231.128, executor 0, partition 23, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 25.0 in stage 20.0 (TID 1730, 10.222.234.66, executor 1, partition 25, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 27.0 in stage 20.0 (TID 1731, 10.222.231.128, executor 0, partition 27, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 26.0 in stage 20.0 (TID 1732, 10.222.234.66, executor 1, partition 26, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 28.0 in stage 20.0 (TID 1733, 10.222.231.128, executor 0, partition 28, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 29.0 in stage 20.0 (TID 1734, 10.222.231.128, executor 0, partition 29, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 ERROR Inbox: Ignoring error java.lang.IllegalArgumentException: requirement failed: Skip current round of resource offers for barrier stage 20 because only 30 out of a total number of 31 tasks got resource offers. The resource offers may have been blacklisted or cannot fulfill task locality requirements. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:489) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:461) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:461) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:253) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:245) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:709) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:245) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:140) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/09/10 13:39:55 INFO TaskSetManager: Starting task 30.0 in stage 20.0 (TID 1735, 10.222.231.128, executor 0, partition 30, NODE_LOCAL, 5669 bytes) 19/09/10 13:39:55 ERROR Inbox: Ignoring error java.lang.IllegalArgumentException: requirement failed: Skip current round of resource offers for barrier stage 20 because only 1 out of a total number of 31 tasks got resource offers. The resource offers may have been blacklisted or cannot fulfill task locality requirements. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:489) at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:461) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:461) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:253) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$2.apply(CoarseGrainedSchedulerBackend.scala:245) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$$withLock(CoarseGrainedSchedulerBackend.scala:709) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.org$apache$spark$scheduler$cluster$CoarseGrainedSchedulerBackend$DriverEndpoint$$makeOffers(CoarseGrainedSchedulerBackend.scala:245) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:140) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:226) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org