[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."

2019-12-25 Thread GitBox
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix 
flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
URL: https://github.com/apache/spark/pull/27004#discussion_r361369613
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
 ##
 @@ -97,13 +97,40 @@ class MockWorker(master: RpcEndpointRef, conf: SparkConf = 
new SparkConf) extend
   }
 }
 
-class MockExecutorLaunchFailWorker(master: RpcEndpointRef, conf: SparkConf = 
new SparkConf)
-  extends MockWorker(master, conf) {
+// This class is designed to handle the lifecycle of only one application.
+class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new 
SparkConf)
+  extends MockWorker(master.self, conf) with Eventually {
+
+  val appRegistered = new CountDownLatch(1)
+  val launchExecutorReceived = new CountDownLatch(1)
+  val appIdsToLaunchExecutor = new mutable.HashSet[String]
   var failedCnt = 0
+
   override def receive: PartialFunction[Any, Unit] = {
+case LaunchDriver(driverId, desc, resources_) =>
 
 Review comment:
   nit: Personally, I'd prefer `_` for those unused fields.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."

2019-12-25 Thread GitBox
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix 
flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
URL: https://github.com/apache/spark/pull/27004#discussion_r361355454
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
 ##
 @@ -97,13 +97,40 @@ class MockWorker(master: RpcEndpointRef, conf: SparkConf = 
new SparkConf) extend
   }
 }
 
-class MockExecutorLaunchFailWorker(master: RpcEndpointRef, conf: SparkConf = 
new SparkConf)
-  extends MockWorker(master, conf) {
+// This class is designed to handle the lifecycle of only one application.
+class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new 
SparkConf)
+  extends MockWorker(master.self, conf) with Eventually {
+
+  val appRegistered = new CountDownLatch(1)
+  val launchExecutorReceived = new CountDownLatch(1)
+  val appIdsToLaunchExecutor = new mutable.HashSet[String]
   var failedCnt = 0
+
   override def receive: PartialFunction[Any, Unit] = {
+case LaunchDriver(driverId, desc, resources_) =>
+  drivers += driverId
+  driverResources(driverId) = resources_.map(r => (r._1, 
r._2.addresses.toSet))
+  master.self.send(RegisterApplication(appDesc, newDriver(driverId)))
+
+  // Below code doesn't make driver stuck, as newDriver opens another rpc 
endpoint for
+  // handling driver related messages. It guarantees registering 
application is done
+  // before handling LaunchExecutor message.
+  eventually(timeout(10.seconds)) {
 
 Review comment:
   > this would block handling LaunchExecutor until we verify application 
registering is done successfully.
   
   How does this block handling `LaunchExecutor`? I believe `Worker` is not a 
thread safe `RpcEndpoint`. Do I miss something?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."

2019-12-25 Thread GitBox
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix 
flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
URL: https://github.com/apache/spark/pull/27004#discussion_r361354710
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
 ##
 @@ -97,13 +97,40 @@ class MockWorker(master: RpcEndpointRef, conf: SparkConf = 
new SparkConf) extend
   }
 }
 
-class MockExecutorLaunchFailWorker(master: RpcEndpointRef, conf: SparkConf = 
new SparkConf)
-  extends MockWorker(master, conf) {
+// This class is designed to handle the lifecycle of only one application.
+class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new 
SparkConf)
+  extends MockWorker(master.self, conf) with Eventually {
+
+  val appRegistered = new CountDownLatch(1)
+  val launchExecutorReceived = new CountDownLatch(1)
+  val appIdsToLaunchExecutor = new mutable.HashSet[String]
   var failedCnt = 0
+
   override def receive: PartialFunction[Any, Unit] = {
+case LaunchDriver(driverId, desc, resources_) =>
+  drivers += driverId
+  driverResources(driverId) = resources_.map(r => (r._1, 
r._2.addresses.toSet))
 
 Review comment:
   Looks like `drivers`/`driverResources` is useless here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."

2019-12-24 Thread GitBox
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix 
flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
URL: https://github.com/apache/spark/pull/27004#discussion_r361239993
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
 ##
 @@ -682,7 +682,11 @@ class MasterSuite extends SparkFunSuite
 // an app would be registered with Master once Driver set up
 assert(worker.apps.nonEmpty)
 appId = worker.apps.head._1
-assert(master.idToApp.contains(appId))
+
+// we found the case where the test was too fast which all steps were 
done within
+// an interval - in this case, we have to check either app is 
available in master
+// or marked as completed. See SPARK-30348 for details.
+assert(master.idToApp.contains(appId) || 
master.completedApps.exists(_.id == appId))
 
 Review comment:
   My concern is, in this test:
   
   1st assert is to make sure that the application exists in `Master.idToApp`
   
   2nd assert is to check the application has been remove from `Master.idToApp` 
after MAX_EXECUTOR_RETRIES.
   
   But if we find the application has completed in 1st assert, which means it 
doesn't exist in `Master.idToApp`, I think it has broke original assume.
   
   
   How about setting up a `CountDownLatch` in `MockExecutorLaunchFailWorker`, 
and count down before we reach MAX_EXECUTOR_RETRIES ? So we can be sure the 
application hasn't been removed from `Master.idToApp`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."

2019-12-24 Thread GitBox
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix 
flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
URL: https://github.com/apache/spark/pull/27004#discussion_r361194044
 
 

 ##
 File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala
 ##
 @@ -682,7 +682,11 @@ class MasterSuite extends SparkFunSuite
 // an app would be registered with Master once Driver set up
 assert(worker.apps.nonEmpty)
 appId = worker.apps.head._1
-assert(master.idToApp.contains(appId))
+
+// we found the case where the test was too fast which all steps were 
done within
+// an interval - in this case, we have to check either app is 
available in master
+// or marked as completed. See SPARK-30348 for details.
+assert(master.idToApp.contains(appId) || 
master.completedApps.exists(_.id == appId))
 
 Review comment:
   I think hit `assert(master.completedApps.exists(_.id == appId))` isn't what 
we want to see for this test. 
   
   Does a short delay above `assert(master.idToApp.contains(appId))` or 
increase `MAX_EXECUTOR_RETRIES` works? I'd also prefer not to expose internal 
data for such a fix.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org