Hi everyone, We have a 3 nodes solrcloud cluster using solr operator. We are in version 8.11.1. We have created a collection with 2 shards and 3 replicas on each shards.
We could index around 45 million documents, but since a few days, I could see that one node is “missing” 4.3 millions documents and is in the state of “recovering”. And it doesn’t work, it is like stuck. In the logs I get following error messages: 10/11/2022, 9:33:37 PM ERROR false x:Documents_shard2_replica_n10 IndexFetcher Error fetching file,​ doing one retry... 10/11/2022, 9:33:37 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _itx6_Lucene84_0.doc (downloaded 583008256 of 798710576 bytes) 10/11/2022, 9:33:39 PM ERROR false x:Documents_shard2_replica_n10 IndexFetcher Error deleting file: _itx6_Lucene84_0.doc 10/11/2022, 9:33:39 PM ERROR false x:Documents_shard2_replica_n10 ReplicationHandler Index fetch failed 10/11/2022, 9:33:39 PM ERROR false x:Documents_shard2_replica_n10 RecoveryStrategy Error while trying to recover 10/11/2022, 9:33:39 PM ERROR false x:Documents_shard2_replica_n10 RecoveryStrategy Recovery failed - trying again... (0) 10/11/2022, 9:34:37 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _fby0.fdt (downloaded 2404384768 of 2924978973 bytes) 10/11/2022, 9:34:37 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:40 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:41 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _fby0.fdt (downloaded 2404384768 of 2924978973 bytes) 10/11/2022, 9:34:41 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:42 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:44 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:44 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _fby0.fdt (downloaded 2404384768 of 2924978973 bytes) 10/11/2022, 9:34:45 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:46 PM ERROR false x:Documents_shard1_replica_n4 IndexFetcher Error fetching file,​ doing one retry... 10/11/2022, 9:34:46 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _mkqu.fdt (downloaded 2736783360 of 2937231251 bytes) 10/11/2022, 9:34:47 PM ERROR false x:Documents_shard1_replica_n4 IndexFetcher Error deleting file: _mkqu.fdt 10/11/2022, 9:34:48 PM ERROR false x:Documents_shard1_replica_n4 ReplicationHandler Index fetch failed 10/11/2022, 9:34:48 PM ERROR false x:Documents_shard1_replica_n4 RecoveryStrategy Error while trying to recover 10/11/2022, 9:34:48 PM ERROR false x:Documents_shard1_replica_n4 RecoveryStrategy Recovery failed - trying again... (0) 10/11/2022, 9:36:23 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _mk2d.fdt (downloaded 1245708288 of 2947895758 bytes) 10/11/2022, 9:36:23 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:25 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:25 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:26 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:26 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:27 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:28 PM ERROR false x:Documents_shard1_replica_n4 IndexFetcher Error fetching file,​ doing one retry... 10/11/2022, 9:36:28 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _ismo_Lucene84_0.tim (downloaded 111149056 of 230410466 bytes) 10/11/2022, 9:36:28 PM ERROR false x:Documents_shard1_replica_n4 IndexFetcher Error deleting file: _ismo_Lucene84_0.tim 10/11/2022, 9:36:29 PM ERROR false x:Documents_shard1_replica_n4 ReplicationHandler Index fetch failed 10/11/2022, 9:36:29 PM ERROR false x:Documents_shard1_replica_n4 RecoveryStrategy Error while trying to recover 10/11/2022, 9:36:29 PM ERROR false x:Documents_shard1_replica_n4 RecoveryStrategy Recovery failed - trying again... (1) 10/11/2022, 9:37:20 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _q41p.fdt (downloaded 983564288 of 2951956658 bytes) 10/11/2022, 9:37:20 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:37:30 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:37:41 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:37:44 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _q41p.fdt (downloaded 983564288 of 2951956658 bytes) 10/11/2022, 9:37:52 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:38:04 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:38:11 PM WARN false x:Documents_shard2_replica_n10 IndexFetcher Error in fetching file: _q41p.fdt (downloaded 983564288 of 2951956658 bytes) 10/11/2022, 9:38:15 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:38:26 PM ERROR false x:Documents_shard1_replica_n4 IndexFetcher Error fetching file,​ doing one retry... 10/11/2022, 9:38:26 PM WARN false x:Documents_shard1_replica_n4 IndexFetcher Error in fetching file: _f1ie.fdt (downloaded 2085617664 of 2922856230 bytes) 10/11/2022, 9:38:37 PM ERROR false x:Documents_shard1_replica_n4 IndexFetcher Error deleting file: _f1ie.fdt 10/11/2022, 9:38:37 PM ERROR false x:Documents_shard1_replica_n4 ReplicationHandler Index fetch failed 10/11/2022, 9:38:37 PM ERROR false x:Documents_shard1_replica_n4 RecoveryStrategy Error while trying to recover 10/11/2022, 9:38:37 PM ERROR false x:Documents_shard1_replica_n4 RecoveryStrategy Recovery failed - trying again... (2) And at one point I got such stack-trace: 10/11/2022, 9:43:01 PM ERROR true x:Documents_shard1_replica_n4 IndexFetcher Error deleting file: _mkqu.fdt java.nio.file.NoSuchFileException: /var/solr/data/Documents_shard1_replica_n4/data/index.20221011193935739/_mkqu.fdt at java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixFileSystemProvider.implDelete(Unknown Source) at java.base/sun.nio.fs.AbstractFileSystemProvider.delete(Unknown Source) at java.base/java.nio.file.Files.delete(Unknown Source) at org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:370) at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:339) at org.apache.lucene.store.NRTCachingDirectory.deleteFile(NRTCachingDirectory.java:118) at org.apache.solr.handler.IndexFetcher$DirectoryFile.delete(IndexFetcher.java:1948) at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1857) at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1743) at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1718) at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1109) at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:619) at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:384) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:458) at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:252) at org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:683) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:339) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) I’ve done on each indexes the command (without the exorcise argument): java -cp lucene-core-8.11.1.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex And I got on all 6 indexes the same message: “ No problems were detected with this index “ What should I do to recover from this situation? Thank you in advance for all the help you can give me! Kind regards, Alessandro