Hi,



After our diagnosis, we've found that there's wrong configuration in the ceph 
file system layer, and also some bugs in kubernetes CSI driver. 




Now we believe that the exception is not caused by Ratis.






Riguz Lee
[email protected]









Original Email


Sender:"Riguz Lee"< [email protected] &gt;;

Sent Time:2022/6/27 18:28

To:"user"< [email protected] &gt;;

Subject:Ratis start failed due to "OverlappingFileLockException"






Hi there,

I get an error when trying to start a raft node, which is deployed inside a 
kubernetes cluster. Here's the error info:



Caused by: java.io.IOException: Failed to lock storage 
/data/ratis-data/dynamic-service-2.dynamic-service-gcek/43dea5d8-f076-11ec-8ea0-0242ac120002.
 The directory is already locked
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.tryLock(RaftStorageDirectoryImpl.java:236)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lambda$lock$0(RaftStorageDirectoryImpl.java:194)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:166) 
~[ratis-common-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at org.apache.ratis.util.FileUtils.attempt(FileUtils.java:40) 
~[ratis-common-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lock(RaftStorageDirectoryImpl.java:194)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.analyzeStorage(RaftStorageDirectoryImpl.java:153)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageImpl.analyzeAndRecoverStorage(RaftStorageImpl.java:97)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageImpl.<init&gt;(RaftStorageImpl.java:67)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageImpl.<init&gt;(RaftStorageImpl.java:52)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.impl.ServerState.<init&gt;(ServerState.java:116) 
~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.impl.RaftServerImpl.<init&gt;(RaftServerImpl.java:201) 
~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$5(RaftServerProxy.java:274)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
 ~[?:?]
&nbsp; &nbsp; at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
&nbsp; &nbsp; at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]
&nbsp; &nbsp; at java.lang.Thread.run(Thread.java:829) ~[?:?]
&nbsp;Caused by: java.nio.channels.OverlappingFileLockException
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.tryLock(RaftStorageDirectoryImpl.java:227)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lambda$lock$0(RaftStorageDirectoryImpl.java:194)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at org.apache.ratis.util.JavaUtils.attempt(JavaUtils.java:166) 
~[ratis-common-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at org.apache.ratis.util.FileUtils.attempt(FileUtils.java:40) 
~[ratis-common-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.lock(RaftStorageDirectoryImpl.java:194)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageDirectoryImpl.analyzeStorage(RaftStorageDirectoryImpl.java:153)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageImpl.analyzeAndRecoverStorage(RaftStorageImpl.java:97)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageImpl.<init&gt;(RaftStorageImpl.java:67)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.storage.RaftStorageImpl.<init&gt;(RaftStorageImpl.java:52)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.impl.ServerState.<init&gt;(ServerState.java:116) 
~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.impl.RaftServerImpl.<init&gt;(RaftServerImpl.java:201) 
~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$5(RaftServerProxy.java:274)
 ~[ratis-server-2.3.0.jar!/:2.3.0]
&nbsp; &nbsp; at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
 ~[?:?]
&nbsp; &nbsp; at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
~[?:?]
&nbsp; &nbsp; at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
~[?:?]

&nbsp; &nbsp; at java.lang.Thread.run(Thread.java:829) ~[?:?]




And I've tried to recreate the raft directory(by recreating the pvc) and 
restart the pod, but still get the same issue.




Each pod has it's own data storage, there's no reason it will be locked by two 
ratis process.

So I guess it might be some kind of bug? I found a JIRA bug here: 
https://issues.apache.org/jira/browse/RATIS-538, which is

almost the same.




Any ideas how to fix it?




Thanks,




Riguz Lee

Reply via email to