[ https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mridul Muralidharan resolved SPARK-39647. ----------------------------------------- Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37052 [https://github.com/apache/spark/pull/37052] > Block push fails with java.lang.IllegalArgumentException: Active local dirs > list has not been updated by any executor registration even when the > NodeManager hasn't been restarted > ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-39647 > URL: https://issues.apache.org/jira/browse/SPARK-39647 > Project: Spark > Issue Type: Bug > Components: Shuffle > Affects Versions: 3.2.0 > Reporter: Chandni Singh > Assignee: Chandni Singh > Priority: Major > Fix For: 3.3.1, 3.4.0 > > > We saw these exceptions during block push: > {code:java} > 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block > shuffle_170_568_174, and will not retry (0 retries) > org.apache.spark.network.shuffle.BlockPushException: > !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException: > Active local dirs list has not been updated by any executor registration > at > org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312) > at > org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168) > 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 > to BlockManagerId(, node-x, 7337, None) failed. > {code} > Note: The NodeManager on node-x (node against which this exception was seen) > was not restarted. > The reason this happened is because the executor registers the block manager > with {{BlockManagerMaster}} before it registers with the ESS. In push-based > shuffle, a block manager is selected by the driver as a merger for the > shuffle push. However, the ESS on that node can successfully merge the block > only if it has received the metadata about merged directories from the local > executor (sent when the local executor registers with the ESS). If this local > executor registration is delayed, but the ESS host got picked up as a merger > then it will fail to merge the blocks pushed to it which is what happened > here. > The local executor on node-x is executor 754 and the block manager > registration happened at 13:28:11 > {code:java} > 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has > registered (new total is 1200) > 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager > node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None) > {code} > The application got registered with shuffle server at node-x at 13:29:40 > {code:java} > 2022-06-24 13:29:40,343 INFO > org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active > local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, > /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, > /grid/c/tmp/yarn/] for application application_1653753500486_3193550 > {code} > node-x was selected as a merger by the driver after 13:28:11 and when the > executors started pushing to it, all those pushes failed until 13:29:40 > We can fix by having the executor register with ESS before it registers the > block manager with the {{BlockManagerMaster}} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org