[ 
https://issues.apache.org/jira/browse/SPARK-39647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-39647.
-----------------------------------------
    Fix Version/s: 3.3.1
                   3.4.0
       Resolution: Fixed

Issue resolved by pull request 37052
[https://github.com/apache/spark/pull/37052]

> Block push fails with java.lang.IllegalArgumentException: Active local dirs 
> list has not been updated by any executor registration even when the 
> NodeManager hasn't been restarted
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-39647
>                 URL: https://issues.apache.org/jira/browse/SPARK-39647
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 3.2.0
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>             Fix For: 3.3.1, 3.4.0
>
>
> We saw these exceptions during block push:
> {code:java}
> 22/06/24 13:29:14 ERROR RetryingBlockFetcher: Failed to fetch block 
> shuffle_170_568_174, and will not retry (0 retries)
> org.apache.spark.network.shuffle.BlockPushException: 
> !application_1653753500486_3193550shuffle_170_568_174java.lang.IllegalArgumentException:
>  Active local dirs list has not been updated by any executor registration
>       at 
> org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:92)
>       at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getActiveLocalDirs(RemoteBlockPushResolver.java:300)
>       at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getFile(RemoteBlockPushResolver.java:290)
>       at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.getMergedShuffleFile(RemoteBlockPushResolver.java:312)
>       at 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver.lambda$getOrCreateAppShufflePartitionInfo$1(RemoteBlockPushResolver.java:168)
> 22/06/24 13:29:14 WARN UnsafeShuffleWriter: Pushing block shuffle_170_568_174 
> to BlockManagerId(, node-x, 7337, None) failed.
> {code}
> Note: The NodeManager on node-x (node against which this exception was seen) 
> was not restarted.
> The reason this happened is because the executor registers the block manager 
> with {{BlockManagerMaster}} before it registers with the ESS. In push-based 
> shuffle, a block manager is selected by the driver as a merger for the 
> shuffle push. However, the ESS on that node can successfully merge the block 
> only if it has received the metadata about merged directories from the local 
> executor (sent when the local executor registers with the ESS). If this local 
> executor registration is delayed, but the ESS host got picked up as a merger 
> then it will fail to merge the blocks pushed to it which is what happened 
> here.
> The local executor on node-x is executor 754 and the block manager 
> registration happened at 13:28:11
> {code:java}
> 22/06/24 13:28:11 INFO ExecutorAllocationManager: New executor 754 has 
> registered (new total is 1200)
> 22/06/24 13:28:11 INFO BlockManagerMasterEndpoint: Registering block manager 
> node-x:16747 with 2004.6 MB RAM, BlockManagerId(754, node-x, 16747, None)
> {code}
> The application got registered with shuffle server at node-x at 13:29:40
> {code:java}
> 2022-06-24 13:29:40,343 INFO 
> org.apache.spark.network.shuffle.RemoteBlockPushResolver: Updated the active 
> local dirs [/grid/i/tmp/yarn/, /grid/g/tmp/yarn/, /grid/b/tmp/yarn/, 
> /grid/e/tmp/yarn/, /grid/h/tmp/yarn/, /grid/f/tmp/yarn/, /grid/d/tmp/yarn/, 
> /grid/c/tmp/yarn/] for application application_1653753500486_3193550
>  {code}
> node-x was selected as a merger by the driver after 13:28:11 and when the 
> executors started pushing to it, all those pushes failed until 13:29:40
> We can fix by having the executor register with ESS before it registers the 
> block manager with the {{BlockManagerMaster}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to