[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230807#comment-14230807
 ] 

Josh Rosen commented on SPARK-4498:
---

[~andrewor14] and I just had a long discussion about this.  To recap things, I 
think that there are two distinct bugs:

1. Driver disconnection is not properly detected by the Master.
2. The application failure detection logic SPARK-2425 is broken.

The second bug can mask the first one.  If something is wrong with the logic 
for detecting a disconnected application / driver, then the application still 
might eventually be removed if the fail an application when its executors 
fail logic worked correctly, since every executor assigned to the exited 
driver will fail.

To simulate bugs in the driver disconnected logic, we can simply comment out 
the relevant line in the Master's {{DisassociatedEvent}} handler.  After doing 
this, I was able to reproduce behavior similar to the issue reported in this 
ticket.  With a local standalone cluster with one master, one worker, and two 
executor slots, I was able to start {{spark-shell}}, kill it with {{kill -9}}, 
then browse to the Master web UI and see that executors were continually 
launching and failing even though the driver had exited (this continued for 30+ 
minutes; it's still happening).

I think that the logic for SPARK-2425 is completely broken in the sense that 
there are some cluster configurations for which it will _never_ kill an 
application even though every executor that it launches fails, and that there 
are very few scenarios where it will ever fail an application.  To help explain 
this, I've created a commit that factors out the failure detection logic into 
its own class: 
https://github.com/JoshRosen/spark/commit/87d7960d660b218a9a965fd7d344e2aae0250128.
  That commit also includes unit tests of the failure detection logic in 
isolation, which helps to illustrate the current bugs.

If you look at the current (broken) failure detection logic, two conditions 
must be met for an application to be marked as failed:

1. No executor can be running: {{!execs.exists(_.state == 
ExecutorState.RUNNING)}}
2. More than {{MAX_NUM_RETRY}} consecutive executor failure events must have 
been received: {{!appInfo.incrementRetryCount()  
ApplicationState.MAX_NUM_RETRY}}

With the current logic, though, these conditions can almost never be met.  The 
current (hardcoded) value of {{MAX_NUM_RETRY}} is 10.  Imagine that I have a 
cluster that only has room for 2 concurrent executors and imagine that all 
executors fail immediately after launching.  In order for the retry count to be 
incremented past the threshold, we must have at least 10 executor failures.  In 
order for this to happen, though, executor launches must occur in between some 
of those failures, since we can only have 2 executors running at the same time. 
 When an executor fails and a new one is relaunched, the new executor enters 
the RUNNING state and sends a message back to the Master, which causes the 
master to reset the retry count back to 0.  Therefore, it's impossible for the 
failure detector to report failure in a cluster with fewer than 10 executor 
slots, even if every executor crashes.

Even for larger clusters, it's still nearly impossible for the current logic to 
declare failure.  Because of the no executor can be running condition, all 
executors must be dead in order for it to declare failure.  The Master 
immediately calls {{schedule()}} after an executor failure, though, so in most 
cases a new executor will launch before we can see 10 consecutive failures (the 
test suite in my commit includes a contrived execution where it does declare 
failure, though).

So, what do we do, considering that we need to fix this before 1.2.0 and as a 
bugfix to be included in 1.1.2?  I don't think that we should just revert 
SPARK-2425, since it's not okay to introduce one regression to fix another.  A 
correct application failure detector needs to strike this balance between 
protecting long-running applications from failures and quickly detecting buggy 
applications that will never work.  Long running applications imply that we 
can't have counters / state that monotonically progresses towards failure, 
since we can assume that over an infinite execution there will be an infinite 
number of executor failures.  Intuitively, I think we want something that's 
window-based: if a large fraction of recently launched executors have failed, 
then declare that the application has failed.

Unfortunately, the master does not receive any signals about non-faulty 
executors (until they exit cleanly).  One solution would be to add a driver - 
master RPC that acknowledges that launched executors are in a good state (e.g. 
after a task has been sent to that executor).  This would allow us to properly 
implement a fail the application if the last _n_ launched 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230895#comment-14230895
 ] 

Patrick Wendell commented on SPARK-4498:


Hey Josh,

The proposal you gave here seems to be complex. Why not just revert SPARK-2425? 
It's only a regression from something that was released in the last few days 
and which we may need to revert anyways for a quick Spark 1.1.2 release due to 
its severity. It seems like it might be worth it to just revert SPARK-2425 and 
put out a quick 1.1.2 release without the feature, then revert it in the 1.2 
branch also.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230910#comment-14230910
 ] 

Mark Hamstra commented on SPARK-4498:
-

I'd argue against reverting 2425 on the grounds that a long-running application 
being killed when it is still able to make progress is a worse bug than 
Executors repeatedly trying to run an application that no longer exists.

Either way, it seems to me that the existing logic may not be too far from 
being right.  The flaw simply seems to be that an Executor process starting and 
successfully connecting to stderr and stdout are necessary but not sufficient 
conditions for that Executor to be transitioned to RUNNING.  If the Executor 
doesn't become RUNNING until it succeeds in connecting to its Application, then 
I think the problem is almost entirely and perhaps completely solved.  
(Although I think SPARK-2424 should also be implemented.) 

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230926#comment-14230926
 ] 

Andrew Or commented on SPARK-4498:
--

Ok, [~joshrosen] and I talked about this more offline.

In summary, there are two problems:
(1) If enough executors fail over time, the application dies unconditionally. 
This is fixed by SPARK-2425.
(2) If the application exited and the Master doesn't know that, then the Master 
might keep on launching executors for that application indefinitely. This is 
caused by SPARK-2425.

The goal of this issue is to fix both. This means we can't just revert 
SPARK-2425 because that does not address (1) (and this patch was released in 
1.1.1 so it will be a regression if we reverted it). We came up with the 
following solution that addresses both problems:

Driver periodically sends a heartbeat to Master to indicate whether it is 
healthy and ready to schedule tasks (i.e. whether at least one executor has 
registered). Meanwhile, as in the existing code, the Master keeps a counter of 
how many executors have failed. This counter is reset whenever the Master 
receives a healthy message from the driver. When the counter reaches a 
configurable threshold, meaning the driver has not reported healthy in a 
while, then the Master fails the application. Additionally, when the Master 
hasn't received a heartbeat from the driver after some time, it fails the 
application.

In a nutshell, this prevents the following from happening:
- BEFORE SPARK-2425, we never reset the counter
- AFTER SPARK-2425, we always reset the counter
- Instead, we reset it periodically whenever we get a heartbeat from the driver
- The heartbeat also provides a mechanism for the Master to detect whether an 
application has exited in addition to relying on the akka DisassociatedEvent

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230928#comment-14230928
 ] 

Andrew Or commented on SPARK-4498:
--

Yes [~markhamstra] SPARK-2424 will actually be included in this proposal.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main java.lang.reflect.UndeclaredThrowableException: 
 Unknown exception in doAs
 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230938#comment-14230938
 ] 

Patrick Wendell commented on SPARK-4498:


Going back to the original comment by Josh. Just wondering - what is the basis 
you have for assuming that driver disassociation is not working in master? Is 
there a known reason for this or is it just speculation based on observed logs.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230948#comment-14230948
 ] 

Josh Rosen commented on SPARK-4498:
---

{quote}
What is the basis you have for assuming that driver disassociation is not 
working in master? Is there a known reason for this or is it just speculation 
based on observed logs.
{quote}

I suppose it's just speculation, but I think it's a good idea to include a 
defensive / preemptive fix for this.  We don't know for sure whether driver 
disassociation was broken / buggy in earlier releases, since the application 
would still eventually get killed when running with the old failure detection 
code; it's possible that the executor failure handling bug might have uncovered 
a long-standing bug that was masked by correct failure handling code.

Based on chatting with Andrew, I think that we can ship a pretty small patch 
that adds a heartbeat mechanism and solves both issues.  I'm working on a 
prototype now and hope to have a really rough version out soon for an initial 
round of reviews.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230987#comment-14230987
 ] 

Apache Spark commented on SPARK-4498:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3548

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230989#comment-14230989
 ] 

Josh Rosen commented on SPARK-4498:
---

I've submitted a really rough-draft WIP PR for the heartbeat mechanism that 
we've discussed.  I'll be back later tonight to address any comments and finish 
writing tests.  If you have any spare review time, please take a look.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-12-01 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231119#comment-14231119
 ] 

Apache Spark commented on SPARK-4498:
-

User 'markhamstra' has created a pull request for this issue:
https://github.com/apache/spark/pull/3550

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228937#comment-14228937
 ] 

Josh Rosen commented on SPARK-4498:
---

Hi [~airhorns],

I finally got a chance to look into this and, based on reading the code, I have 
a theory about what might be happening.  If you look at the [current 
Master.scala 
file|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L668],
 you'll notice that there are only two situations where the standalone Master 
removes applications:

- The master receives a DisassociatedEvent due to the application actor 
shutting down and calls {{finishApplication}}.
- An executor exited with a non-zero exit status and the maximum number of 
executor failures has been succeeded.

Now, imagine that for some reason the standalone Master does not receive a 
DisassociatedEvent.  When executors eventually start to die, the standalone 
master will discover this via ExecutorStateChanged.  If it hasn't hit the 
maximum number of executor failures, [it will attempt to re-schedule the 
application|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L325]
 and obtain new resources.  If a new executor is granted, this will [cause the 
maximum failed executors count to reset to 
zero|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L313],
 leading to a sort of livelock behavior where executors die because they can't 
contact the application but keep being launched because executors keep entering 
the ExecutorState.RUNNING state ([it looks 
like|https://github.com/apache/spark/blob/317e114e11669899618c7c06bbc0091b36618f36/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L148]
 executors transition to this state when they launch, not once they've 
registered with the driver).

It looks like the line

{code}
  if (state == ExecutorState.RUNNING) { appInfo.resetRetryCount() }
{code}

was introduced in SPARK-2425.  It looks like this was introduced after the 
earliest commit that you mentioned, so it seems like this is be a regression in 
1.2.0.  I don't think that we should revert SPARK-2425, since that fixes 
another fairly important bug.  Instead, I'd like to try to figure out how an 
application could fail without a DisassociatedEvent causing it to be removed.

Could this be due to our use of non-standard Akka timeout / failure detector 
settings?  I would think that we'd still get a DisassociatedEvent when a 
network connection was closed or something.  Maybe we could switch to relying 
on our own explicit heartbeats for failure detection, like we do elsewhere in 
Spark.

[~markhamstra], do you have any ideas here?

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Critical
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228943#comment-14228943
 ] 

Josh Rosen commented on SPARK-4498:
---

Adding 1.1.1 as an affected version, too, since SPARK-2425 was backported to 
that release, too.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable remote 
 address [akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849]. Address is 
 now gated for 5000 ms, all messages to this address will be delivered to dead 
 letters. Reason: Connection refused: 
 spark-etl1.chi.shopify.com/172.16.126.88:58849
 14/11/19 19:43:08 ERROR UserGroupInformation: PriviledgedActionException 
 as:azkaban (auth:SIMPLE) cause:java.util.concurrent.TimeoutException: Futures 
 timed out after [30 seconds]
 Exception in thread main java.lang.reflect.UndeclaredThrowableException: 
 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-29 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228947#comment-14228947
 ] 

Mark Hamstra commented on SPARK-4498:
-

On a quick look-through, your analysis looks likely to be correct, [~joshrosen].

Making sure that failed applications are always accompanied by a 
DisassociatedEvent would be a good thing.  The belt-and-suspenders fix would be 
to also change the executor state-change semantics so that either RUNNING means 
not just that the executor process is running, but also that it has 
successfully connected to the application, or else introduce an additional 
executor state (perhaps REGISTERED) along with state transitions and 
finer-grained state logic controlling executor restart and application removal.

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration options we don't set to the 
 default:
 {code}
 spark.cores.max: 30
 spark.eventLog.dir: hdfs://nn.shopify.com:8020/var/spark/event-logs
 spark.eventLog.enabled: true
 spark.executor.memory: 7g
 spark.hadoop.fs.defaultFS: hdfs://nn.shopify.com:8020/
 spark.io.compression.codec: lzf
 spark.ui.killEnabled: true
 {code}
  - The exception the executors die with is this:
 {code}
 14/11/19 19:42:37 INFO CoarseGrainedExecutorBackend: Registered signal 
 handlers for [TERM, HUP, INT]
 14/11/19 19:42:37 WARN NativeCodeLoader: Unable to load native-hadoop library 
 for your platform... using builtin-java classes where applicable
 14/11/19 19:42:37 INFO SecurityManager: Changing view acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: Changing modify acls to: spark,azkaban
 14/11/19 19:42:37 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
 users with modify permissions: Set(spark, azkaban)
 14/11/19 19:42:37 INFO Slf4jLogger: Slf4jLogger started
 14/11/19 19:42:37 INFO Remoting: Starting remoting
 14/11/19 19:42:38 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://driverpropsfetc...@dn13.chi.shopify.com:37682]
 14/11/19 19:42:38 INFO Utils: Successfully started service 
 'driverPropsFetcher' on port 37682.
 14/11/19 19:42:38 WARN Remoting: Tried to associate with unreachable 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228952#comment-14228952
 ] 

Josh Rosen commented on SPARK-4498:
---

In addition to exploring the missing DisassociatedEvent theory, it might also 
be worthwhile to brainstorm whether problems at other steps in the cleanup 
process could cause an application to fail to be removed.  I'm not sure that a 
single missing DisassociatedEvent could explain the blip behavior observed 
here, where an entire group of applications fail to be marked as completed / 
failed.

In the DisassociatedEvent handler, we index into {{addressToApp}} to determine 
which app corresponded to the DisassociatedEvent:

{code}
case DisassociatedEvent(_, address, _) = {
  // The disconnected client could've been either a worker or an app; 
remove whichever it was
  logInfo(s$address got disassociated, removing it.)
  addressToWorker.get(address).foreach(removeWorker)
  addressToApp.get(address).foreach(finishApplication)
  if (state == RecoveryState.RECOVERING  canCompleteRecovery) { 
completeRecovery() }
}
{code}

If the {{addressToApp}} entry was empty / wrong, then we wouldn't properly 
clean up the app.  However, I don't think that there should be any problems 
here because each application actor system should have its own distinct address 
and Akka's {{Address}} class properly implements hashCode / equals.  Even if 
drivers run on the same host, their actor systems should have different port 
numbers.

Continuing along:

{code}
  def removeApplication(app: ApplicationInfo, state: ApplicationState.Value) {
if (apps.contains(app)) {
  logInfo(Removing app  + app.id)
{code}

Is there any way that the {{apps}} HashSet could fail to contain {{app}}?  I 
don't think so: {{ApplicationInfo}} doesn't override equals/hashCode, but I 
don't think that's a problem since we only create one ApplicationInfo per app, 
so the default object identity comparison should be fine.  We should probably 
log an error if we call {{removeApplication}} on an application that has 
already been removed, though.  (Also, why do we need the {{apps}} HashSet when 
we could just use {{idToApp.values}}?)

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard evidence 
 that it exists other than my identification of the pattern in the Running 
 Applications table. See 
 http://cl.ly/image/2L383s0e2b3t/Screen%20Shot%202014-11-19%20at%203.43.09%20PM.png
  : the applications started before the blip at 1.9 hours ago still have 
 active drivers. All the applications started 1.9 hours ago do not, and the 
 applications started less than 1.9 hours ago (at the top of the table) do in 
 fact have active drivers.
 Deploy mode:
  - PySpark drivers running on one node outside the cluster, scheduled by a 
 cron-like application, not master supervised
  
 Other factoids:
  - In most places, we call sc.stop() explicitly before shutting down our 
 driver process
  - Here's the sum total of spark configuration 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-29 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228966#comment-14228966
 ] 

Josh Rosen commented on SPARK-4498:
---

Here's an interesting pattern to grep for in all-master-logs-around-blip.txt: 
{{sparkdri...@spark-etl1.chi.shopify.com:52047}}.  Note that this log is in 
reverse-chronological order.

The earliest occurrence is in a DisassociatedEvent log message:

{code}
14-11-19_18:48:31.34508 14/11/19 18:48:31 ERROR EndpointWriter: 
AssociationError [akka.tcp://sparkmas...@dn05.chi.shopify.com:7077] - 
[akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047]: Error [Shut down 
address: akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047] [
2014-11-19_18:48:31.34510 akka.remote.ShutDownAssociation: Shut down address: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047
2014-11-19_18:48:31.34511 Caused by: 
akka.remote.transport.Transport$InvalidAssociationException: The remote system 
terminated the association because it is shutting down.
2014-11-19_18:48:31.34512 ]
2014-11-19_18:48:31.34521 14/11/19 18:48:31 INFO LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%40172.16.126.88%3A48040-1355#-59270061]
 was not delivered. [2859] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
2014-11-19_18:48:31.34603 14/11/19 18:48:31 INFO Master: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:52047 got disassociated, 
removing it.
2014-11-19_18:48:31.20255 14/11/19 18:48:31 INFO Master: Removing executor 
app-20141119184815-1316/7 because it is EXITED
{code}

Even though INFO-level logging is enabled, there's no INFO: master: Removing 
app  message near this event.  The entire log contains many repetitions of 
this same DisassociatedEvent log.

The same log also contains many executors that launch and immediately fail:

{code}
2014-11-19_18:52:51.84000 14/11/19 18:52:51 INFO Master: Launching executor 
app-20141119184815-1313/75 on worker 
worker-20141118172622-dn19.chi.shopify.com-38498
2014-11-19_18:52:51.83981 14/11/19 18:52:51 INFO Master: Removing executor 
app-20141119184815-1313/67 because it is EXITED
{code}

I couldn't find a {{removing app app-20141119184815-1313}} event.

Another interesting thing: even though it looks like this log contains 
information for 39 drivers, there are 100 disassociated events:

{code}
[joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | 
grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | uniq | wc 
-l
  39

[joshrosen ~]$ cat /Users/joshrosen/Desktop/all-master-logs-around-blip.txt | 
grep -e \d\d\d\d\d got disassociated -o | cut -d ' ' -f 1 | sort | wc -l
 100
{code}

 Standalone Master can fail to recognize completed/failed applications
 -

 Key: SPARK-4498
 URL: https://issues.apache.org/jira/browse/SPARK-4498
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Affects Versions: 1.1.1, 1.2.0
 Environment:  - Linux dn11.chi.shopify.com 3.2.0-57-generic 
 #87-Ubuntu SMP 3 x86_64 x86_64 x86_64 GNU/Linux
  - Standalone Spark built from 
 apache/spark#c6e0c2ab1c29c184a9302d23ad75e4ccd8060242
  - Python 2.7.3
 java version 1.7.0_71
 Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
 Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)
  - 1 Spark master, 40 Spark workers with 32 cores a piece and 60-90 GB of 
 memory a piece
  - All client code is PySpark
Reporter: Harry Brundage
Priority: Blocker
 Attachments: all-master-logs-around-blip.txt, 
 one-applications-master-logs.txt


 We observe the spark standalone master not detecting that a driver 
 application has completed after the driver process has shut down 
 indefinitely, leaving that driver's resources consumed indefinitely. The 
 master reports applications as Running, but the driver process has long since 
 terminated. The master continually spawns one executor for the application. 
 It boots, times out trying to connect to the driver application, and then 
 dies with the exception below. The master then spawns another executor on a 
 different worker, which does the same thing. The application lives until the 
 master (and workers) are restarted. 
 This happens to many jobs at once, all right around the same time, two or 
 three times a day, where they all get suck. Before and after this blip 
 applications start, get resources, finish, and are marked as finished 
 properly. The blip is mostly conjecture on my part, I have no hard 

[jira] [Commented] (SPARK-4498) Standalone Master can fail to recognize completed/failed applications

2014-11-19 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218491#comment-14218491
 ] 

Harry Brundage commented on SPARK-4498:
---

For the simple canary spark application (who's master logs are attached), the 
first executor does exactly what it is supposed to, and then the driver shuts 
down and it disassociates. Then, a second executor is started, when it never 
should have been. The first executors stderr:

{code}
14/11/19 18:48:16 INFO CoarseGrainedExecutorBackend: Registered signal handlers 
for [TERM, HUP, INT]
14/11/19 18:48:16 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
14/11/19 18:48:16 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:16 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:16 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:17 INFO Slf4jLogger: Slf4jLogger started
14/11/19 18:48:17 INFO Remoting: Starting remoting
14/11/19 18:48:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverpropsfetc...@dn42.chi.shopify.com:36365]
14/11/19 18:48:17 INFO Utils: Successfully started service 'driverPropsFetcher' 
on port 36365.
14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
14/11/19 18:48:18 INFO Slf4jLogger: Slf4jLogger started
14/11/19 18:48:18 INFO Remoting: Starting remoting
14/11/19 18:48:18 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkexecu...@dn42.chi.shopify.com:39974]
14/11/19 18:48:18 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
14/11/19 18:48:18 INFO Utils: Successfully started service 'sparkExecutor' on 
port 39974.
14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/CoarseGrainedScheduler
14/11/19 18:48:18 INFO WorkerWatcher: Connecting to worker 
akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker
14/11/19 18:48:18 INFO WorkerWatcher: Successfully connected to 
akka.tcp://sparkwor...@dn42.chi.shopify.com:41095/user/Worker
14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Successfully registered 
with driver
14/11/19 18:48:18 INFO SecurityManager: Changing view acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: Changing modify acls to: spark,azkaban
14/11/19 18:48:18 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(spark, azkaban); 
users with modify permissions: Set(spark, azkaban)
14/11/19 18:48:18 INFO AkkaUtils: Connecting to MapOutputTracker: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/MapOutputTracker
14/11/19 18:48:18 INFO AkkaUtils: Connecting to BlockManagerMaster: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/BlockManagerMaster
14/11/19 18:48:18 INFO DiskBlockManager: Created local directory at 
/tmp/spark-local-20141119184818-e5ae
14/11/19 18:48:18 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
14/11/19 18:48:18 INFO LogReporter: Creating metrics output file: 
/tmp/spark-metrics
14/11/19 18:48:18 INFO NettyBlockTransferService: Server created on 44406
14/11/19 18:48:18 INFO BlockManagerMaster: Trying to register BlockManager
14/11/19 18:48:18 INFO BlockManagerMaster: Registered BlockManager
14/11/19 18:48:18 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
akka.tcp://sparkdri...@spark-etl1.chi.shopify.com:58849/user/HeartbeatReceiver
14/11/19 18:48:18 INFO CoarseGrainedExecutorBackend: Got assigned task 0
14/11/19 18:48:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
14/11/19 18:48:18 INFO Executor: Fetching 
hdfs://nn01.chi.shopify.com:8020/tmp/starscream_pyfiles_cache/packages-68badf293ff9e4d13929073572892d7f3d0a3546.egg
 with timestamp 1416422896241
14/11/19 18:48:19 INFO TorrentBroadcast: Started reading broadcast variable 1
14/11/19 18:48:19 INFO MemoryStore: ensureFreeSpace(3391) called with curMem=0, 
maxMem=278302556
14/11/19 18:48:19 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 3.3 KB, free 265.4 MB)
14/11/19 18:48:19 INFO BlockManagerMaster: Updated info of block