[jira] [Created] (HDFS-15744) Use cumulative counting way to improve the accuracy of slow disk detection

Haibin Huang (Jira) Mon, 21 Dec 2020 22:42:05 -0800

Haibin Huang created HDFS-15744:
-----------------------------------

             Summary: Use cumulative counting way to improve the accuracy of 
slow disk detection
                 Key: HDFS-15744
                 URL: https://issues.apache.org/jira/browse/HDFS-15744
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: Haibin Huang
            Assignee: Haibin Huang
         Attachments: image-2020-12-22-11-37-14-734.png, 
image-2020-12-22-11-37-35-280.png, image-2020-12-22-11-46-48-817.png


11461 support the datanode disk outlier detection, we can use it to find out 
slow disk via SlowDiskReport(11551).However i found the slow disk information 
may not be accurate enough in practice.

Because a large number of short-term writes can lead to miscalculation. Here is 
the example, this disk is health, when it encounters a lot of writing in a few 
minute, it's write io does get slow, and will be considered to be slow disk.The 
disk just slow in a few minute but SlowDiskReport will keep it until the 
information becomes invalid. This scenario confuse us since we want to use 
SlowDiskReport to detect the real bad disk.

!image-2020-12-22-11-37-14-734.png!

!image-2020-12-22-11-37-35-280.png!

To improve the deteciton accuracy, we use a cumulative counting way to detect 
slow disk. If within the reportValidityMs interval, a disk is considered to be 
outlier over 50% times, than it should be a real bad disk.

Here is an exsample, if reportValidityMs is one hour and detection interval is 
five minute, there will be 12 times disk outlier detection in one hour. If a 
disk is considered to be outlier over 6 times, it should be a real bad disk. We 
use this way to detect bad disk in cluster, it can reach over 90% accuracy.

!image-2020-12-22-11-46-48-817.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-15744) Use cumulative counting way to improve the accuracy of slow disk detection

Reply via email to