LuciferYang opened a new pull request, #37938:
URL: https://github.com/apache/spark/pull/37938

   ### What changes were proposed in this pull request?
   After SPARK-17321, `YarnShuffleService` will persist data to local shuffle 
state db/reload data from local shuffle state db only when Yarn NodeManager 
start with `YarnConfiguration#NM_RECOVERY_ENABLED = true`.
   
   `YarnShuffleIntegrationSuite` not set 
`YarnConfiguration#NM_RECOVERY_ENABLED` and the default value of the 
configuration is false,  so `YarnShuffleIntegrationSuite` will neither trigger 
data persistence to the db nor verify the reload of data.
   
   This pr aims to let `YarnShuffleIntegrationSuite` restart the verification 
of registeredExecFile reload scenarios, to achieve this goal, this pr make the 
following changes:
   
   1. Add a new un-document configuration 
`spark.yarn.shuffle.integrationTesting` to `YarnShuffleService`, and Initialize 
`_recoveryPath` when `_recoveryPath == null && 
spark.yarn.shuffle.integrationTesting == true`.
   
   2. Only set `spark.yarn.shuffle.integrationTesting = true` in 
`YarnShuffleIntegrationSuite`, and add assertions to check `registeredExecFile` 
is not null to ensure that registeredExecFile reload scenarios will be verified.
   
   ### Why are the changes needed?
   Fix registeredExecFile reload  test scenarios.
   
   Why not test by configuring `YarnConfiguration#NM_RECOVERY_ENABLED` as true?
   
   This configuration has been tried
   
   **Hadoop 3.3.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
 -Phadoop-3
   ```
   
   ```
   2022-09-10T11:44:42.1710230Z Cause: java.lang.ClassNotFoundException: 
org.apache.hadoop.shaded.org.iq80.leveldb.DBException
   2022-09-10T11:44:42.1715234Z at 
java.net.URLClassLoader.findClass(URLClassLoader.java:387)
   2022-09-10T11:44:42.1719347Z at 
java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   2022-09-10T11:44:42.1723090Z at 
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
   2022-09-10T11:44:42.1726759Z at 
java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   2022-09-10T11:44:42.1731028Z at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:313)
   2022-09-10T11:44:42.1735424Z at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:370)
   2022-09-10T11:44:42.1740303Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1745576Z at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceInit(MiniYARNCluster.java:597)
   2022-09-10T11:44:42.1828858Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1829712Z at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:109)
   2022-09-10T11:44:42.1830633Z at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceInit(MiniYARNCluster.java:327)
   2022-09-10T11:44:42.1831431Z at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
   2022-09-10T11:44:42.1832279Z at 
org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:112)
   ```
   
   **Hadoop 2.7.4**
   
   ```
   build/mvn clean install -pl resource-managers/yarn -Pyarn -Dtest=none 
-DwildcardSuites=org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite
 -Phadoop-2
   ```
   
   ```
   YarnShuffleIntegrationWithLevelDBBackendSuite:
   org.apache.spark.deploy.yarn.YarnShuffleIntegrationWithLevelDBBackendSuite 
*** ABORTED ***
     java.lang.IllegalArgumentException: Cannot support recovery with an 
ephemeral server port. Check the setting of yarn.nodemanager.address
     at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceStart(ContainerManagerImpl.java:395)
     at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
     at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
     at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     at 
org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper.serviceStart(MiniYARNCluster.java:560)
     at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     at 
org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
     at 
org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:278)
     at 
org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
     ...
   Run completed in 3 seconds, 992 milliseconds.
   Total number of tests run: 0
   Suites: completed 1, aborted 1
   Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
   *** 1 SUITE ABORTED ***
   ```
   
   From the above test, we need to use a fixed port to enable Yarn NodeManager 
recovery, but this is difficult to be guaranteed in UT, so this pr try a 
workaround way.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Pass GitHub Actions
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to