Hi,
After our diagnosis, we've found that there's wrong configuration in the ceph file system layer, and also some bugs in kubernetes CSI driver. Now we believe that the exception is not caused by Ratis. Riguz Lee [email protected] Original Email Sender:"Riguz Lee"< [email protected] >; Sent Time:2022/6/27 18:28 To:"user"< [email protected] >; Subject:Ratis start failed due to "OverlappingFileLockException" Hi there, I get an error when trying to start a raft node, which is deployed inside a kubernetes cluster. Here's the error info: Caused by: java.io.IOException: Failed to lock storage /data/ratis-data/dynamic-service-2.dynamic-service-gcek/43dea5d8-f076-11ec-8ea0-0242ac120002. The directory is already locked at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.tryLock(RaftStorageDirectoryImpl.java:236) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lambda$lock$0(RaftStorageDirectoryImpl.java:194) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:166) ~[ratis-common-2.3.0.jar!/:2.3.0] at org.apache.ratis.util.FileUtils.attempt(FileUtils.java:40) ~[ratis-common-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lock(RaftStorageDirectoryImpl.java:194) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.analyzeStorage(RaftStorageDirectoryImpl.java:153) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageImpl.analyzeAndRecoverStorage(RaftStorageImpl.java:97) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageImpl.<init>(RaftStorageImpl.java:67) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageImpl.<init>(RaftStorageImpl.java:52) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:116) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$5(RaftServerProxy.java:274) ~[ratis-server-2.3.0.jar!/:2.3.0] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: java.nio.channels.OverlappingFileLockException at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.tryLock(RaftStorageDirectoryImpl.java:227) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lambda$lock$0(RaftStorageDirectoryImpl.java:194) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:166) ~[ratis-common-2.3.0.jar!/:2.3.0] at org.apache.ratis.util.FileUtils.attempt(FileUtils.java:40) ~[ratis-common-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lock(RaftStorageDirectoryImpl.java:194) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageDirectoryImpl.analyzeStorage(RaftStorageDirectoryImpl.java:153) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageImpl.analyzeAndRecoverStorage(RaftStorageImpl.java:97) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageImpl.<init>(RaftStorageImpl.java:67) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.storage.RaftStorageImpl.<init>(RaftStorageImpl.java:52) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.impl.ServerState.<init>(ServerState.java:116) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.impl.RaftServerImpl.<init>(RaftServerImpl.java:201) ~[ratis-server-2.3.0.jar!/:2.3.0] at org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$5(RaftServerProxy.java:274) ~[ratis-server-2.3.0.jar!/:2.3.0] at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ~[?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) ~[?:?] And I've tried to recreate the raft directory(by recreating the pvc) and restart the pod, but still get the same issue. Each pod has it's own data storage, there's no reason it will be locked by two ratis process. So I guess it might be some kind of bug? I found a JIRA bug here: https://issues.apache.org/jira/browse/RATIS-538, which is almost the same. Any ideas how to fix it? Thanks, Riguz Lee
