Yuanbo Liu created HDFS-16657:
---------------------------------

             Summary: Changing pool-level lock to volume-level lock for 
invalidation of blocks
                 Key: HDFS-16657
                 URL: https://issues.apache.org/jira/browse/HDFS-16657
             Project: Hadoop HDFS
          Issue Type: Sub-task
            Reporter: Yuanbo Liu
         Attachments: image-2022-07-13-10-25-37-383.png, 
image-2022-07-13-10-27-01-386.png, image-2022-07-13-10-27-44-258.png

Recently we see that the heartbeating of dn become slow in a very busy cluster, 
here is the chart:

!image-2022-07-13-10-25-37-383.png!

 

After getting jstack of the dn, we find that dn heartbeat stuck in invalidation 
of blocks:

!image-2022-07-13-10-27-01-386.png!

!image-2022-07-13-10-27-44-258.png!

The key code is:
{code:java}
// code placeholder
try {
  File blockFile = new File(info.getBlockURI());
  if (blockFile != null && blockFile.getParentFile() == null) {
    errors.add("Failed to delete replica " + invalidBlks[i]
        +  ". Parent not found for block file: " + blockFile);
    continue;
  }
} catch(IllegalArgumentException e) {
  LOG.warn("Parent directory check failed; replica " + info
      + " is not backed by a local file");
} {code}
DN is trying to locate parent path of block file, thus there is a disk I/O in 
pool-level lock. When the disk becomes very busy with high io wait, All the 
pending threads will be blocked by the pool-level lock, and the time of 
heartbeat is high. We proposal to change the pool-level lock to volume-level 
lock for block invalidation

cc: [~hexiaoqiao] [~Aiphag0] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to