[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." URL: https://github.com/apache/spark/pull/27004#discussion_r361369613 ## File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ## @@ -97,13 +97,40 @@ class MockWorker(master: RpcEndpointRef, conf: SparkConf = new SparkConf) extend } } -class MockExecutorLaunchFailWorker(master: RpcEndpointRef, conf: SparkConf = new SparkConf) - extends MockWorker(master, conf) { +// This class is designed to handle the lifecycle of only one application. +class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf) + extends MockWorker(master.self, conf) with Eventually { + + val appRegistered = new CountDownLatch(1) + val launchExecutorReceived = new CountDownLatch(1) + val appIdsToLaunchExecutor = new mutable.HashSet[String] var failedCnt = 0 + override def receive: PartialFunction[Any, Unit] = { +case LaunchDriver(driverId, desc, resources_) => Review comment: nit: Personally, I'd prefer `_` for those unused fields. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." URL: https://github.com/apache/spark/pull/27004#discussion_r361355454 ## File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ## @@ -97,13 +97,40 @@ class MockWorker(master: RpcEndpointRef, conf: SparkConf = new SparkConf) extend } } -class MockExecutorLaunchFailWorker(master: RpcEndpointRef, conf: SparkConf = new SparkConf) - extends MockWorker(master, conf) { +// This class is designed to handle the lifecycle of only one application. +class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf) + extends MockWorker(master.self, conf) with Eventually { + + val appRegistered = new CountDownLatch(1) + val launchExecutorReceived = new CountDownLatch(1) + val appIdsToLaunchExecutor = new mutable.HashSet[String] var failedCnt = 0 + override def receive: PartialFunction[Any, Unit] = { +case LaunchDriver(driverId, desc, resources_) => + drivers += driverId + driverResources(driverId) = resources_.map(r => (r._1, r._2.addresses.toSet)) + master.self.send(RegisterApplication(appDesc, newDriver(driverId))) + + // Below code doesn't make driver stuck, as newDriver opens another rpc endpoint for + // handling driver related messages. It guarantees registering application is done + // before handling LaunchExecutor message. + eventually(timeout(10.seconds)) { Review comment: > this would block handling LaunchExecutor until we verify application registering is done successfully. How does this block handling `LaunchExecutor`? I believe `Worker` is not a thread safe `RpcEndpoint`. Do I miss something? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." URL: https://github.com/apache/spark/pull/27004#discussion_r361354710 ## File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ## @@ -97,13 +97,40 @@ class MockWorker(master: RpcEndpointRef, conf: SparkConf = new SparkConf) extend } } -class MockExecutorLaunchFailWorker(master: RpcEndpointRef, conf: SparkConf = new SparkConf) - extends MockWorker(master, conf) { +// This class is designed to handle the lifecycle of only one application. +class MockExecutorLaunchFailWorker(master: Master, conf: SparkConf = new SparkConf) + extends MockWorker(master.self, conf) with Eventually { + + val appRegistered = new CountDownLatch(1) + val launchExecutorReceived = new CountDownLatch(1) + val appIdsToLaunchExecutor = new mutable.HashSet[String] var failedCnt = 0 + override def receive: PartialFunction[Any, Unit] = { +case LaunchDriver(driverId, desc, resources_) => + drivers += driverId + driverResources(driverId) = resources_.map(r => (r._1, r._2.addresses.toSet)) Review comment: Looks like `drivers`/`driverResources` is useless here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." URL: https://github.com/apache/spark/pull/27004#discussion_r361239993 ## File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ## @@ -682,7 +682,11 @@ class MasterSuite extends SparkFunSuite // an app would be registered with Master once Driver set up assert(worker.apps.nonEmpty) appId = worker.apps.head._1 -assert(master.idToApp.contains(appId)) + +// we found the case where the test was too fast which all steps were done within +// an interval - in this case, we have to check either app is available in master +// or marked as completed. See SPARK-30348 for details. +assert(master.idToApp.contains(appId) || master.completedApps.exists(_.id == appId)) Review comment: My concern is, in this test: 1st assert is to make sure that the application exists in `Master.idToApp` 2nd assert is to check the application has been remove from `Master.idToApp` after MAX_EXECUTOR_RETRIES. But if we find the application has completed in 1st assert, which means it doesn't exist in `Master.idToApp`, I think it has broke original assume. How about setting up a `CountDownLatch` in `MockExecutorLaunchFailWorker`, and count down before we reach MAX_EXECUTOR_RETRIES ? So we can be sure the application hasn't been removed from `Master.idToApp`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..."
Ngone51 commented on a change in pull request #27004: [SPARK-30348][CORE] Fix flaky test failure on "MasterSuite.SPARK-27510: Master should avoid ..." URL: https://github.com/apache/spark/pull/27004#discussion_r361194044 ## File path: core/src/test/scala/org/apache/spark/deploy/master/MasterSuite.scala ## @@ -682,7 +682,11 @@ class MasterSuite extends SparkFunSuite // an app would be registered with Master once Driver set up assert(worker.apps.nonEmpty) appId = worker.apps.head._1 -assert(master.idToApp.contains(appId)) + +// we found the case where the test was too fast which all steps were done within +// an interval - in this case, we have to check either app is available in master +// or marked as completed. See SPARK-30348 for details. +assert(master.idToApp.contains(appId) || master.completedApps.exists(_.id == appId)) Review comment: I think hit `assert(master.completedApps.exists(_.id == appId))` isn't what we want to see for this test. Does a short delay above `assert(master.idToApp.contains(appId))` or increase `MAX_EXECUTOR_RETRIES` works? I'd also prefer not to expose internal data for such a fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org