Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-04-09 Thread via GitHub


ptlrs commented on PR #8843:
URL: https://github.com/apache/ozone/pull/8843#issuecomment-4216327335

   Thanks for the reviews @ChenSammi @adoroszlai @Tejaskriya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-04-07 Thread via GitHub


ChenSammi commented on PR #8843:
URL: https://github.com/apache/ozone/pull/8843#issuecomment-4203858109

   Thanks @ptlrs , and @Tejaskriya @adoroszlai for review. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-04-07 Thread via GitHub


ChenSammi merged PR #8843:
URL: https://github.com/apache/ozone/pull/8843


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-04-07 Thread via GitHub


ptlrs commented on PR #8843:
URL: https://github.com/apache/ozone/pull/8843#issuecomment-4201301527

   I have removed the modifications from DeprecationDelta


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-04-07 Thread via GitHub


ChenSammi commented on PR #8843:
URL: https://github.com/apache/ozone/pull/8843#issuecomment-4198111549

   @ptlrs , it looks like we cannot add 
"hdds.datanode.disk.check.io.test.count" to DeprecationDelta. It's different 
property type. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-04-02 Thread via GitHub


ptlrs commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3030881349


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -404,6 +409,19 @@ public class DatanodeConfiguration extends 
ReconfigurableConfig {
   )
   private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 
+  @Config(key = "hdds.datanode.disk.check.sliding.window.timeout",

Review Comment:
   Ok, I reverted the change which removed the config and updated the 
deprecated config list. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-31 Thread via GitHub


ChenSammi commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3019611363


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -404,6 +409,19 @@ public class DatanodeConfiguration extends 
ReconfigurableConfig {
   )
   private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 
+  @Config(key = "hdds.datanode.disk.check.sliding.window.timeout",

Review Comment:
   To not break existing users, it's recommend to add new property 
"hdds.datanode.disk.check.io.test.enabled", instead of change the current 
property "hdds.datanode.disk.check.io.test.count" to 
"hdds.datanode.disk.check.io.test.enabled". 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-30 Thread via GitHub


ptlrs commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3013230591


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -688,6 +706,23 @@ public void validate() {
   diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 }
 
+if (diskCheckSlidingWindowTimeout.isNegative()) {
+  LOG.warn("{} must be greater than zero and was set to {}. Defaulting to 
{}",
+  DISK_CHECK_SLIDING_WINDOW_TIMEOUT_KEY, diskCheckSlidingWindowTimeout,
+  DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT);
+  diskCheckSlidingWindowTimeout = 
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;
+}
+
+// Do not set window timeout <= periodic disk check interval period, or 
failures can be missed across sparse checks
+// e.g., every 120m interval with a 60m window rarely accumulates enough 
failed events
+if 
(diskCheckSlidingWindowTimeout.compareTo(Duration.ofMinutes(periodicDiskCheckIntervalMinutes))
 < 0) {
+  LOG.warn("{} must be greater than or equal to {} minutes and was set to 
{} minutes. Defaulting to {}",
+  DISK_CHECK_SLIDING_WINDOW_TIMEOUT_KEY, 
periodicDiskCheckIntervalMinutes,
+  diskCheckSlidingWindowTimeout.toMinutes(),
+  
DurationFormatUtils.formatDurationHMS(DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT.toMillis()));
+  diskCheckSlidingWindowTimeout = 
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;

Review Comment:
   Good catch. Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-30 Thread via GitHub


ptlrs commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3013229707


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -404,6 +409,19 @@ public class DatanodeConfiguration extends 
ReconfigurableConfig {
   )
   private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 
+  @Config(key = "hdds.datanode.disk.check.sliding.window.timeout",

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-29 Thread via GitHub


ChenSammi commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3007843398


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -404,6 +409,19 @@ public class DatanodeConfiguration extends 
ReconfigurableConfig {
   )
   private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 
+  @Config(key = "hdds.datanode.disk.check.sliding.window.timeout",

Review Comment:
   With this sliding window introduced,  
"hdds.datanode.disk.check.io.test.count" property function is half removed. We 
should consider deprecate "hdds.datanode.disk.check.io.test.count" and 
introduce a new boolean property with name, like 
"hdds.datanode.disk.check.io.test.enabled". 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-29 Thread via GitHub


ChenSammi commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3007781558


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -688,6 +706,23 @@ public void validate() {
   diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 }
 
+if (diskCheckSlidingWindowTimeout.isNegative()) {
+  LOG.warn("{} must be greater than zero and was set to {}. Defaulting to 
{}",
+  DISK_CHECK_SLIDING_WINDOW_TIMEOUT_KEY, diskCheckSlidingWindowTimeout,
+  DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT);
+  diskCheckSlidingWindowTimeout = 
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;
+}
+
+// Do not set window timeout <= periodic disk check interval period, or 
failures can be missed across sparse checks
+// e.g., every 120m interval with a 60m window rarely accumulates enough 
failed events
+if 
(diskCheckSlidingWindowTimeout.compareTo(Duration.ofMinutes(periodicDiskCheckIntervalMinutes))
 < 0) {
+  LOG.warn("{} must be greater than or equal to {} minutes and was set to 
{} minutes. Defaulting to {}",
+  DISK_CHECK_SLIDING_WINDOW_TIMEOUT_KEY, 
periodicDiskCheckIntervalMinutes,
+  diskCheckSlidingWindowTimeout.toMinutes(),
+  
DurationFormatUtils.formatDurationHMS(DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT.toMillis()));
+  diskCheckSlidingWindowTimeout = 
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;

Review Comment:
   Instead of set diskCheckSlidingWindowTimeout to static  
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT, how about set 
diskCheckSlidingWindowTimeout to 
Duration.ofMinutes(periodicDiskCheckInterval).plus(diskCheckTimeout)? 
   Same for above diskCheckSlidingWindowTimeout negative check. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-27 Thread via GitHub


adoroszlai commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r3002594733


##
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/volume/TestStorageVolumeHealthChecks.java:
##
@@ -57,19 +60,22 @@ public static Stream volumeBuilders() {
 new HddsVolume.Builder(volumePath.toString())
 .datanodeUuid(DATANODE_UUID)
 .conf(CONF)
-.usageCheckFactory(MockSpaceUsageCheckFactory.NONE);
+.usageCheckFactory(MockSpaceUsageCheckFactory.NONE)
+.clock(TEST_CLOCK);

Review Comment:
   No need to update the patch for this, just a note for the future: adding 
`.clock...` in a previous line would reduce unnecessary noise in the patch (due 
to moving around the `;`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-27 Thread via GitHub


ptlrs commented on PR #8843:
URL: https://github.com/apache/ozone/pull/8843#issuecomment-418630

   Hi @errose28 @ChenSammi @adoroszlai I have updated this PR, could you please 
take another look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-25 Thread via GitHub


adoroszlai commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r2987432293


##
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/SlidingWindow.java:
##
@@ -149,12 +149,16 @@ private long getCurrentTime() {
 return clock.millis();
   }
 
+  public long getExpiryDurationMillis() {
+return expiryDurationMillis;
+  }
+
   /**
* A custom monotonic clock implementation.
* Implementation of Clock that uses System.nanoTime() for real usage.
* See {@see org.apache.ozone.test.TestClock}

Review Comment:
   ```
   [WARNING] 
hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/utils/SlidingWindow.java:159:
 warning - invalid usage of tag {@see org.apache.ozone.test.TestClock}
   ```
   
   ```suggestion
  * @see org.apache.ozone.test.TestClock
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-25 Thread via GitHub


adoroszlai commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r2987420224


##
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/volume/TestStorageVolumeHealthChecks.java:
##
@@ -304,13 +308,23 @@ private void 
testCheckIOUntilFailure(StorageVolume.Builder builder,
 DatanodeConfiguration dnConf = CONF.getObject(DatanodeConfiguration.class);
 dnConf.setVolumeIOTestCount(ioTestCount);
 dnConf.setVolumeIOFailureTolerance(ioFailureTolerance);
+dnConf.setDiskCheckSlidingWindowTimeout(Duration.ofMillis(ioTestCount));
 CONF.setFromObject(dnConf);
 builder.conf(CONF);
 StorageVolume volume = builder.build();
 volume.format(CLUSTER_ID);
 volume.createTmpDirs(CLUSTER_ID);
+// Sliding window protocol transitioned from count-based to a time-based 
system
+// Update the default failure duration of the window from 60 minutes to a 
shorter duration for the test
+long eventRate = 1L;
+TestClock testClock = TestClock.newInstance();
+Field clock = SlidingWindow.class.getDeclaredField("clock");
+clock.setAccessible(true);
+clock.set(volume.getIoTestSlidingWindow(), testClock);

Review Comment:
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-24 Thread via GitHub


ptlrs commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r2985324862


##
hadoop-hdds/container-service/src/test/java/org/apache/hadoop/ozone/container/common/volume/TestStorageVolumeHealthChecks.java:
##
@@ -304,13 +308,23 @@ private void 
testCheckIOUntilFailure(StorageVolume.Builder builder,
 DatanodeConfiguration dnConf = CONF.getObject(DatanodeConfiguration.class);
 dnConf.setVolumeIOTestCount(ioTestCount);
 dnConf.setVolumeIOFailureTolerance(ioFailureTolerance);
+dnConf.setDiskCheckSlidingWindowTimeout(Duration.ofMillis(ioTestCount));
 CONF.setFromObject(dnConf);
 builder.conf(CONF);
 StorageVolume volume = builder.build();
 volume.format(CLUSTER_ID);
 volume.createTmpDirs(CLUSTER_ID);
+// Sliding window protocol transitioned from count-based to a time-based 
system
+// Update the default failure duration of the window from 60 minutes to a 
shorter duration for the test
+long eventRate = 1L;
+TestClock testClock = TestClock.newInstance();
+Field clock = SlidingWindow.class.getDeclaredField("clock");
+clock.setAccessible(true);
+clock.set(volume.getIoTestSlidingWindow(), testClock);

Review Comment:
   Done



##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java:
##
@@ -717,39 +718,25 @@ public synchronized VolumeCheckResult check(@Nullable 
Boolean unused)
 // We can check again if disk is full. If it is full,
 // in this case keep volume as healthy so that READ can still be served
 if (!diskChecksPassed && getCurrentUsage().getAvailable() < 
minimumDiskSpace) {
-  ioTestSlidingWindow.add(true);
   return VolumeCheckResult.HEALTHY;
 }
 
-// Move the sliding window of IO test results forward 1 by adding the
-// latest entry and removing the oldest entry from the window.
-// Update the failure counter for the new window.
-ioTestSlidingWindow.add(diskChecksPassed);
 if (!diskChecksPassed) {
-  currentIOFailureCount.incrementAndGet();
-}
-if (ioTestSlidingWindow.size() > ioTestCount &&
-Objects.equals(ioTestSlidingWindow.poll(), Boolean.FALSE)) {
-  currentIOFailureCount.decrementAndGet();
+  ioTestSlidingWindow.add();
 }
 
-// If the failure threshold has been crossed, fail the volume without
-// further scans.
+// If the failure threshold has been crossed, fail the volume without 
further scans.
 // Once the volume is failed, it will not be checked anymore.
 // The failure counts can be left as is.
-if (currentIOFailureCount.get() > ioFailureTolerance) {
-  LOG.error("Failed IO test for volume {}: the last {} runs " +
-  "encountered {} out of {} tolerated failures.", this,
-  ioTestSlidingWindow.size(), currentIOFailureCount,
-  ioFailureTolerance);
+if (ioTestSlidingWindow.isExceeded()) {
+  LOG.error("Failed IO test for volume {}: encountered more than the {} 
tolerated failures within the past {} ms.",
+  this, ioTestSlidingWindow.getWindowSize(), 
ioTestSlidingWindow.getExpiryDurationMillis());
   return VolumeCheckResult.FAILED;
-} else if (LOG.isDebugEnabled()) {
-  LOG.debug("IO test results for volume {}: the last {} runs encountered " 
+
-  "{} out of {} tolerated failures", this,
-  ioTestSlidingWindow.size(),
-  currentIOFailureCount, ioFailureTolerance);
 }
 
+LOG.debug("IO test results for volume {}: encountered {} out of {} 
tolerated failures",
+this, ioTestSlidingWindow.getNumEventsInWindow(), 
ioTestSlidingWindow.getWindowSize());

Review Comment:
   `getNumEventsInWindow` is correct here. It will show how many failures are 
in the actual window. The window size is the number of failures that are 
allowed. 
   
   This log line is logged every time in debug mode. Since our queue is larger 
than the window, logging `getNumEvents` may give us the false impression that 
we got more failures as it will include the errors outside the window as well. 
   
   We are using `getNumEvents` only for logging in tests. 



##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -404,6 +407,16 @@ public class DatanodeConfiguration extends 
ReconfigurableConfig {
   )
   private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 
+  @Config(key = "hdds.datanode.disk.check.sliding.window.timeout",
+  defaultValue = "60m",
+  type = ConfigType.TIME,
+  tags = {ConfigTag.DATANODE},
+  description = "Time interval after which a disk check"
+  + " failure result stored in the sliding window will expire."
+  + " Unit could be defined with postfix (ns,ms,s,m,h,d)."
+  )
+  private Duration diskCheckSlidingWindowTimeout = 
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;

Review Comment:
 

Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-10 Thread via GitHub


ChenSammi commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r2915854722


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java:
##
@@ -717,39 +718,25 @@ public synchronized VolumeCheckResult check(@Nullable 
Boolean unused)
 // We can check again if disk is full. If it is full,
 // in this case keep volume as healthy so that READ can still be served
 if (!diskChecksPassed && getCurrentUsage().getAvailable() < 
minimumDiskSpace) {
-  ioTestSlidingWindow.add(true);
   return VolumeCheckResult.HEALTHY;
 }
 
-// Move the sliding window of IO test results forward 1 by adding the
-// latest entry and removing the oldest entry from the window.
-// Update the failure counter for the new window.
-ioTestSlidingWindow.add(diskChecksPassed);
 if (!diskChecksPassed) {
-  currentIOFailureCount.incrementAndGet();
-}
-if (ioTestSlidingWindow.size() > ioTestCount &&
-Objects.equals(ioTestSlidingWindow.poll(), Boolean.FALSE)) {
-  currentIOFailureCount.decrementAndGet();
+  ioTestSlidingWindow.add();
 }
 
-// If the failure threshold has been crossed, fail the volume without
-// further scans.
+// If the failure threshold has been crossed, fail the volume without 
further scans.
 // Once the volume is failed, it will not be checked anymore.
 // The failure counts can be left as is.
-if (currentIOFailureCount.get() > ioFailureTolerance) {
-  LOG.error("Failed IO test for volume {}: the last {} runs " +
-  "encountered {} out of {} tolerated failures.", this,
-  ioTestSlidingWindow.size(), currentIOFailureCount,
-  ioFailureTolerance);
+if (ioTestSlidingWindow.isExceeded()) {
+  LOG.error("Failed IO test for volume {}: encountered more than the {} 
tolerated failures within the past {} ms.",
+  this, ioTestSlidingWindow.getWindowSize(), 
ioTestSlidingWindow.getExpiryDurationMillis());
   return VolumeCheckResult.FAILED;
-} else if (LOG.isDebugEnabled()) {
-  LOG.debug("IO test results for volume {}: the last {} runs encountered " 
+
-  "{} out of {} tolerated failures", this,
-  ioTestSlidingWindow.size(),
-  currentIOFailureCount, ioFailureTolerance);
 }
 
+LOG.debug("IO test results for volume {}: encountered {} out of {} 
tolerated failures",
+this, ioTestSlidingWindow.getNumEventsInWindow(), 
ioTestSlidingWindow.getWindowSize());

Review Comment:
   Shall we call getNumEvents() here, instead of getNumEventsInWindow()?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-10 Thread via GitHub


ChenSammi commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r2915835263


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/StorageVolume.java:
##
@@ -163,9 +159,8 @@ protected StorageVolume(Builder b) throws IOException {
   this.conf = b.conf;
   this.dnConf = conf.getObject(DatanodeConfiguration.class);
   this.ioTestCount = dnConf.getVolumeIOTestCount();
-  this.ioFailureTolerance = dnConf.getVolumeIOFailureTolerance();
-  this.ioTestSlidingWindow = new LinkedList<>();
-  this.currentIOFailureCount = new AtomicInteger(0);
+  this.ioTestSlidingWindow = new 
SlidingWindow(dnConf.getVolumeIOFailureTolerance(),

Review Comment:
   Thanks @ptlrs for this improvement.  Time window based solution is better 
than current accumulated count solution, volume has the chance to survive 
through the temporary failures.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HDDS-13108. Refactor StorageVolume to use SlidingWindow [ozone]

2026-03-10 Thread via GitHub


ChenSammi commented on code in PR #8843:
URL: https://github.com/apache/ozone/pull/8843#discussion_r2915827798


##
hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeConfiguration.java:
##
@@ -404,6 +407,16 @@ public class DatanodeConfiguration extends 
ReconfigurableConfig {
   )
   private Duration diskCheckTimeout = DISK_CHECK_TIMEOUT_DEFAULT;
 
+  @Config(key = "hdds.datanode.disk.check.sliding.window.timeout",
+  defaultValue = "60m",
+  type = ConfigType.TIME,
+  tags = {ConfigTag.DATANODE},
+  description = "Time interval after which a disk check"
+  + " failure result stored in the sliding window will expire."
+  + " Unit could be defined with postfix (ns,ms,s,m,h,d)."
+  )
+  private Duration diskCheckSlidingWindowTimeout = 
DISK_CHECK_SLIDING_WINDOW_TIMEOUT_DEFAULT;

Review Comment:
   What's the recommendation of value relationship between this new property 
and PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT?  Say if user reconfigured 
PERIODIC_DISK_CHECK_INTERVAL_MINUTES_DEFAULT to 2h, or 30m, shall we suggest 
user to reconfigure this property too? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]