[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230807#comment-14230807 ] Josh Rosen commented on SPARK-4498: --- [~andrewor14] and I just had a long discussion about this. To recap things, I think that there are two distinct bugs: 1. Driver disconnection is not properly detected by the Master. 2. The application failure detection logic SPARK-2425 is broken. The second bug can mask the first one. If something is wrong with the logic for detecting a disconnected application / driver, then the application still might eventually be removed if the fail an application when its executors fail logic worked correctly, since every executor assigned to the exited driver will fail. To simulate bugs in the driver disconnected logic, we can simply comment out the relevant line in the Master's {{DisassociatedEvent}} handler. After doing this, I was able to reproduce behavior similar to the issue reported in this ticket. With a local standalone cluster with one master, one worker, and two executor slots, I was able to start {{spark-shell}}, kill it with {{kill -9}}, then browse to the Master web UI and see that executors were continually launching and failing even though the driver had exited (this continued for 30+ minutes; it's still happening). I think that the logic for SPARK-2425 is completely broken in the sense that there are some cluster configurations for which it will _never_ kill an application even though every executor that it launches fails, and that there are very few scenarios where it will ever fail an application. To help explain this, I've created a commit that factors out the failure detection logic into its own class: https://github.com/JoshRosen/spark/commit/87d7960d660b218a9a965fd7d344e2aae0250128. That commit also includes unit tests of the failure detection logic in isolation, which helps to illustrate the current bugs. If you look at the current (broken) failure detection logic, two conditions must be met for an application to be marked as failed: 1. No executor can be running: {{!execs.exists(_.state == ExecutorState.RUNNING)}} 2. More than {{MAX_NUM_RETRY}} consecutive executor failure events must have been received: {{!appInfo.incrementRetryCount() ApplicationState.MAX_NUM_RETRY}} With the current logic, though, these conditions can almost never be met. The current (hardcoded) value of {{MAX_NUM_RETRY}} is 10. Imagine that I have a cluster that only has room for 2 concurrent executors and imagine that all executors fail immediately after launching. In order for the retry count to be incremented past the threshold, we must have at least 10 executor failures. In order for this to happen, though, executor launches must occur in between some of those failures, since we can only have 2 executors running at the same time. When an executor fails and a new one is relaunched, the new executor enters the RUNNING state and sends a message back to the Master, which causes the master to reset the retry count back to 0. Therefore, it's impossible for the failure detector to report failure in a cluster with fewer than 10 executor slots, even if every executor crashes. Even for larger clusters, it's still nearly impossible for the current logic to declare failure. Because of the no executor can be running condition, all executors must be dead in order for it to declare failure. The Master immediately calls {{schedule()}} after an executor failure, though, so in most cases a new executor will launch before we can see 10 consecutive failures (the test suite in my commit includes a contrived execution where it does declare failure, though). So, what do we do, considering that we need to fix this before 1.2.0 and as a bugfix to be included in 1.1.2? I don't think that we should just revert SPARK-2425, since it's not okay to introduce one regression to fix another. A correct application failure detector needs to strike this balance between protecting long-running applications from failures and quickly detecting buggy applications that will never work. Long running applications imply that we can't have counters / state that monotonically progresses towards failure, since we can assume that over an infinite execution there will be an infinite number of executor failures. Intuitively, I think we want something that's window-based: if a large fraction of recently launched executors have failed, then declare that the application has failed. Unfortunately, the master does not receive any signals about non-faulty executors (until they exit cleanly). One solution would be to add a driver - master RPC that acknowledges that launched executors are in a good state (e.g. after a task has been sent to that executor). This would allow us to properly implement a fail the application if the last _n_ launched
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230895#comment-14230895 ] Patrick Wendell commented on SPARK-4498: Hey Josh, The proposal you gave here seems to be complex. Why not just revert SPARK-2425? It's only a regression from something that was released in the last few days and which we may need to revert anyways for a quick Spark 1.1.2 release due to its severity. It seems like it might be worth it to just revert SPARK-2425 and put out a quick 1.1.2 release without the feature, then revert it in the 1.2 branch also. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters.
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230910#comment-14230910 ] Mark Hamstra commented on SPARK-4498: - I'd argue against reverting 2425 on the grounds that a long-running application being killed when it is still able to make progress is a worse bug than Executors repeatedly trying to run an application that no longer exists. Either way, it seems to me that the existing logic may not be too far from being right. The flaw simply seems to be that an Executor process starting and successfully connecting to stderr and stdout are necessary but not sufficient conditions for that Executor to be transitioned to RUNNING. If the Executor doesn't become RUNNING until it succeeds in connecting to its Application, then I think the problem is almost entirely and perhaps completely solved. (Although I think SPARK-2424 should also be implemented.) Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230926#comment-14230926 ] Andrew Or commented on SPARK-4498: -- Ok, [~joshrosen] and I talked about this more offline. In summary, there are two problems: (1) If enough executors fail over time, the application dies unconditionally. This is fixed by SPARK-2425. (2) If the application exited and the Master doesn't know that, then the Master might keep on launching executors for that application indefinitely. This is caused by SPARK-2425. The goal of this issue is to fix both. This means we can't just revert SPARK-2425 because that does not address (1) (and this patch was released in 1.1.1 so it will be a regression if we reverted it). We came up with the following solution that addresses both problems: Driver periodically sends a heartbeat to Master to indicate whether it is healthy and ready to schedule tasks (i.e. whether at least one executor has registered). Meanwhile, as in the existing code, the Master keeps a counter of how many executors have failed. This counter is reset whenever the Master receives a healthy message from the driver. When the counter reaches a configurable threshold, meaning the driver has not reported healthy in a while, then the Master fails the application. Additionally, when the Master hasn't received a heartbeat from the driver after some time, it fails the application. In a nutshell, this prevents the following from happening: - BEFORE SPARK-2425, we never reset the counter - AFTER SPARK-2425, we always reset the counter - Instead, we reset it periodically whenever we get a heartbeat from the driver - The heartbeat also provides a mechanism for the Master to detect whether an application has exited in addition to relying on the akka DisassociatedEvent Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230928#comment-14230928 ] Andrew Or commented on SPARK-4498: -- Yes [~markhamstra] SPARK-2424 will actually be included in this proposal. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230938#comment-14230938 ] Patrick Wendell commented on SPARK-4498: Going back to the original comment by Josh. Just wondering - what is the basis you have for assuming that driver disassociation is not working in master? Is there a known reason for this or is it just speculation based on observed logs. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE)
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230948#comment-14230948 ] Josh Rosen commented on SPARK-4498: --- {quote} What is the basis you have for assuming that driver disassociation is not working in master? Is there a known reason for this or is it just speculation based on observed logs. {quote} I suppose it's just speculation, but I think it's a good idea to include a defensive / preemptive fix for this. We don't know for sure whether driver disassociation was broken / buggy in earlier releases, since the application would still eventually get killed when running with the old failure detection code; it's possible that the executor failure handling bug might have uncovered a long-standing bug that was masked by correct failure handling code. Based on chatting with Andrew, I think that we can ship a pretty small patch that adds a heartbeat mechanism and solves both issues. I'm working on a prototype now and hope to have a really rough version out soon for an initial round of reviews. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230987#comment-14230987 ] Apache Spark commented on SPARK-4498: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/3548 Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230989#comment-14230989 ] Josh Rosen commented on SPARK-4498: --- I've submitted a really rough-draft WIP PR for the heartbeat mechanism that we've discussed. I'll be back later tonight to address any comments and finish writing tests. If you have any spare review time, please take a look. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE)
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231119#comment-14231119 ] Apache Spark commented on SPARK-4498: - User 'markhamstra' has created a pull request for this issue: https://github.com/apache/spark/pull/3550 Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228937#comment-14228937 ] Josh Rosen commented on SPARK-4498: --- Hi [~airhorns], I finally got a chance to look into this and, based on reading the code, I have a theory about what might be happening. If you look at the [current Master.scala file|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L668], you'll notice that there are only two situations where the standalone Master removes applications: - The master receives a DisassociatedEvent due to the application actor shutting down and calls {{finishApplication}}. - An executor exited with a non-zero exit status and the maximum number of executor failures has been succeeded. Now, imagine that for some reason the standalone Master does not receive a DisassociatedEvent. When executors eventually start to die, the standalone master will discover this via ExecutorStateChanged. If it hasn't hit the maximum number of executor failures, [it will attempt to re-schedule the application|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L325] and obtain new resources. If a new executor is granted, this will [cause the maximum failed executors count to reset to zero|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L313], leading to a sort of livelock behavior where executors die because they can't contact the application but keep being launched because executors keep entering the ExecutorState.RUNNING state ([it looks like|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L148] executors transition to this state when they launch, not once they've registered with the driver). It looks like the line {code} if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() } {code} was introduced in SPARK-2425. It looks like this was introduced after the earliest commit that you mentioned, so it seems like this is be a regression in 1.2.0. I don't think that we should revert SPARK-2425, since that fixes another fairly important bug. Instead, I'd like to try to figure out how an application could fail without a DisassociatedEvent causing it to be removed. Could this be due to our use of non-standard Akka timeout / failure detector settings? I would think that we'd still get a DisassociatedEvent when a network connection was closed or something. Maybe we could switch to relying on our own explicit heartbeats for failure detection, like we do elsewhere in Spark. [~markhamstra], do you have any ideas here? Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Critical Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228943#comment-14228943 ] Josh Rosen commented on SPARK-4498: --- Adding 1.1.1 as an affected version, too, since SPARK-2425 was backported to that release, too. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: spark-etl1.chi.shopify.com/172.16.126.88:58849 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] Exception in thread main java.lang.reflect.UndeclaredThrowableException:
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228947#comment-14228947 ] Mark Hamstra commented on SPARK-4498: - On a quick look-through, your analysis looks likely to be correct, [~joshrosen]. Making sure that failed applications are always accompanied by a DisassociatedEvent would be a good thing. The belt-and-suspenders fix would be to also change the executor state-change semantics so that either RUNNING means not just that the executor process is running, but also that it has successfully connected to the application, or else introduce an additional executor state (perhaps REGISTERED) along with state transitions and finer-grained state logic controlling executor restart and application removal. Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration options we don't set to the default: {code} spark.cores.max: 30 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs spark.eventLog.enabled: true spark.executor.memory: 7g spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/ spark.io.compression.codec: lzf spark.ui.killEnabled: true {code} - The exception the executors die with is this: {code} 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started 14/11/19 19:42:37 INFO Remoting: Starting remoting 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682] 14/11/19 19:42:38 INFO Utils: Successfully started service 'driverPropsFetcher' on port 37682. 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228952#comment-14228952 ] Josh Rosen commented on SPARK-4498: --- In addition to exploring the missing DisassociatedEvent theory, it might also be worthwhile to brainstorm whether problems at other steps in the cleanup process could cause an application to fail to be removed. I'm not sure that a single missing DisassociatedEvent could explain the blip behavior observed here, where an entire group of applications fail to be marked as completed / failed. In the DisassociatedEvent handler, we index into {{addressToApp}} to determine which app corresponded to the DisassociatedEvent: {code} case DisassociatedEvent(_, address, _) = { // The disconnected client could've been either a worker or an app; remove whichever it was logInfo(s$address got disassociated, removing it.) addressToWorker.get(address).foreach(removeWorker) addressToApp.get(address).foreach(finishApplication) if (state == RecoveryState.RECOVERING canCompleteRecovery) { completeRecovery() } } {code} If the {{addressToApp}} entry was empty / wrong, then we wouldn't properly clean up the app. However, I don't think that there should be any problems here because each application actor system should have its own distinct address and Akka's {{Address}} class properly implements hashCode / equals. Even if drivers run on the same host, their actor systems should have different port numbers. Continuing along: {code} def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) { if (apps.contains(app)) { logInfo(Removing app + app.id) {code} Is there any way that the {{apps}} HashSet could fail to contain {{app}}? I don't think so: {{ApplicationInfo}} doesn't override equals/hashCode, but I don't think that's a problem since we only create one ApplicationInfo per app, so the default object identity comparison should be fine. We should probably log an error if we call {{removeApplication}} on an application that has already been removed, though. (Also, why do we need the {{apps}} HashSet when we could just use {{idToApp.values}}?) Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard evidence that it exists other than my identification of the pattern in the Running Applications table. See http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png : the applications started before the blip at 1.9 hours ago still have active drivers. All the applications started 1.9 hours ago do not, and the applications started less than 1.9 hours ago (at the top of the table) do in fact have active drivers. Deploy mode: - PySpark drivers running on one node outside the cluster, scheduled by a cron-like application, not master supervised Other factoids: - In most places, we call sc.stop() explicitly before shutting down our driver process - Here's the sum total of spark configuration
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228966#comment-14228966 ] Josh Rosen commented on SPARK-4498: --- Here's an interesting pattern to grep for in all-master-logs-around-blip.txt: {{sparkdri...@spark-etl1.chi.shopify.com:52047}}. Note that this log is in reverse-chronological order. The earliest occurrence is in a DisassociatedEvent log message: {code} 14-11-19_18:48:31.34508 14/11/19 18:48:31 ERROR EndpointWriter: AssociationError [akka.tcp://sparkmas...@dn05.chi.shopify.com:7077] - [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047]: Error [Shut down address: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047] [ 2014-11-19_18:48:31.34510 akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047 2014-11-19_18:48:31.34511 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down. 2014-11-19_18:48:31.34512 ] 2014-11-19_18:48:31.34521 14/11/19 18:48:31 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.16.126.88%3A48040-1355#-59270061] was not delivered. [2859] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2014-11-19_18:48:31.34603 14/11/19 18:48:31 INFO Master: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047 got disassociated, removing it. 2014-11-19_18:48:31.20255 14/11/19 18:48:31 INFO Master: Removing executor app-20141119184815-1316/7 because it is EXITED {code} Even though INFO-level logging is enabled, there's no INFO: master: Removing app message near this event. The entire log contains many repetitions of this same DisassociatedEvent log. The same log also contains many executors that launch and immediately fail: {code} 2014-11-19_18:52:51.84000 14/11/19 18:52:51 INFO Master: Launching executor app-20141119184815-1313/75 on worker worker-20141118172622-dn19.chi.shopify.com-38498 2014-11-19_18:52:51.83981 14/11/19 18:52:51 INFO Master: Removing executor app-20141119184815-1313/67 because it is EXITED {code} I couldn't find a {{removing app app-20141119184815-1313}} event. Another interesting thing: even though it looks like this log contains information for 39 drivers, there are 100 disassociated events: {code} [joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | uniq | wc -l 39 [joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | wc -l 100 {code} Standalone Master can fail to recognize completed/failed applications - Key: SPARK-4498 URL: https://issues.apache.org/jira/browse/SPARK-4498 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Affects Versions: 1.1.1, 1.2.0 Environment: - Linux dn11.chi.shopify.com 3.2.0-57-generic #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux - Standalone Spark built from apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242 - Python 2.7.3 java version 1.7.0_71 Java(TM) SE Runtime Environment (build 1.7.0_71-b14) Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode) - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of memory a piece - All client code is PySpark Reporter: Harry Brundage Priority: Blocker Attachments: all-master-logs-around-blip.txt, one-applications-master-logs.txt We observe the spark standalone master not detecting that a driver application has completed after the driver process has shut down indefinitely, leaving that driver's resources consumed indefinitely. The master reports applications as Running, but the driver process has long since terminated. The master continually spawns one executor for the application. It boots, times out trying to connect to the driver application, and then dies with the exception below. The master then spawns another executor on a different worker, which does the same thing. The application lives until the master (and workers) are restarted. This happens to many jobs at once, all right around the same time, two or three times a day, where they all get suck. Before and after this blip applications start, get resources, finish, and are marked as finished properly. The blip is mostly conjecture on my part, I have no hard
[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications
[ https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218491#comment-14218491 ] Harry Brundage commented on SPARK-4498: --- For the simple canary spark application (who's master logs are attached), the first executor does exactly what it is supposed to, and then the driver shuts down and it disassociates. Then, a second executor is started, when it never should have been. The first executors stderr: {code} 14/11/19 18:48:16 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 14/11/19 18:48:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/11/19 18:48:16 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:16 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:17 INFO Slf4jLogger: Slf4jLogger started 14/11/19 18:48:17 INFO Remoting: Starting remoting 14/11/19 18:48:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverpropsfetc...@dn42.chi.shopify.com:36365] 14/11/19 18:48:17 INFO Utils: Successfully started service 'driverPropsFetcher' on port 36365. 14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/11/19 18:48:18 INFO Slf4jLogger: Slf4jLogger started 14/11/19 18:48:18 INFO Remoting: Starting remoting 14/11/19 18:48:18 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkexecu...@dn42.chi.shopify.com:39974] 14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 14/11/19 18:48:18 INFO Utils: Successfully started service 'sparkExecutor' on port 39974. 14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/CoarseGrainedScheduler 14/11/19 18:48:18 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker 14/11/19 18:48:18 INFO WorkerWatcher: Successfully connected to akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker 14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Successfully registered with driver 14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban 14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); users with modify permissions: Set(spark, azkaban) 14/11/19 18:48:18 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/MapOutputTracker 14/11/19 18:48:18 INFO AkkaUtils: Connecting to BlockManagerMaster: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/BlockManagerMaster 14/11/19 18:48:18 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20141119184818-e5ae 14/11/19 18:48:18 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 14/11/19 18:48:18 INFO LogReporter: Creating metrics output file: /tmp/spark-metrics 14/11/19 18:48:18 INFO NettyBlockTransferService: Server created on 44406 14/11/19 18:48:18 INFO BlockManagerMaster: Trying to register BlockManager 14/11/19 18:48:18 INFO BlockManagerMaster: Registered BlockManager 14/11/19 18:48:18 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/HeartbeatReceiver 14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Got assigned task 0 14/11/19 18:48:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 14/11/19 18:48:18 INFO Executor: Fetching hdfs://nn01.chi.shopify.com:8020/tmp/starscream_pyfiles_cache/packages-68badf293ff9e4d13929073572892d7f3d0a3546.egg with timestamp 1416422896241 14/11/19 18:48:19 INFO TorrentBroadcast: Started reading broadcast variable 1 14/11/19 18:48:19 INFO MemoryStore: ensureFreeSpace(3391) called with curMem=0, maxMem=278302556 14/11/19 18:48:19 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.3 KB, free 265.4 MB) 14/11/19 18:48:19 INFO BlockManagerMaster: Updated info of block