Haibin Huang created HDFS-15744: ----------------------------------- Summary: Use cumulative counting way to improve the accuracy of slow disk detection Key: HDFS-15744 URL: https://issues.apache.org/jira/browse/HDFS-15744 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haibin Huang Assignee: Haibin Huang Attachments: image-2020-12-22-11-37-14-734.png, image-2020-12-22-11-37-35-280.png, image-2020-12-22-11-46-48-817.png
11461 support the datanode disk outlier detection, we can use it to find out slow disk via SlowDiskReport(11551).However i found the slow disk information may not be accurate enough in practice. Because a large number of short-term writes can lead to miscalculation. Here is the example, this disk is health, when it encounters a lot of writing in a few minute, it's write io does get slow, and will be considered to be slow disk.The disk just slow in a few minute but SlowDiskReport will keep it until the information becomes invalid. This scenario confuse us since we want to use SlowDiskReport to detect the real bad disk. !image-2020-12-22-11-37-14-734.png! !image-2020-12-22-11-37-35-280.png! To improve the deteciton accuracy, we use a cumulative counting way to detect slow disk. If within the reportValidityMs interval, a disk is considered to be outlier over 50% times, than it should be a real bad disk. Here is an exsample, if reportValidityMs is one hour and detection interval is five minute, there will be 12 times disk outlier detection in one hour. If a disk is considered to be outlier over 6 times, it should be a real bad disk. We use this way to detect bad disk in cluster, it can reach over 90% accuracy. !image-2020-12-22-11-46-48-817.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org