Cluster has 1 zookeeper node and 3 solr nodes. There is only one collection with 3 shards. Data is continuously indexed using SolrJ API. System is running on AWS and I am taking backup on EFS (Elastic File System).
Observed behavior: If indexing is not in progress, I take a backup of cluster using collection API, backup succeeds and restore works as expected. snapshotscli.sh works as expected if I first take snapshot of index while indexing is in progress and then take backup. There is no error during restore. However, I get error most of the time if I try to restore collection from the backup taken using collection API when indexing was still in progress. Error is always missing segment and I can see that segment its trying to read during restore does not exist in the backup shard directory. Also, Is there a way to take snapshot of solr cloud using collection api? User guide only has documentation to take snapshot of core using collection api. 2017-09-08 19:47:22.592 WARN (parallelCoreAdminExecutor-5-thread-8-processing-n:ec2-34-201-149-27.compute-1.amazonaws.com:8983_solr t1cloudbackuponefs-r2187461299681393 RESTORECORE) [ ] o.a.s.h.RestoreCore Could not switch to restored index. Rolling back to the current index org.apache.lucene.index.CorruptIndexException: Unexpected file read error while reading index. (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/var/solr/data/t1cloud3_shard2_replica0/data/restore.20170908194722131/segments_y"))) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290) at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:930) at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:118) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:93) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:248) at org.apache.solr.update.DefaultSolrCoreState.changeWriter(DefaultSolrCoreState.java:211) at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:220) at org.apache.solr.update.DirectUpdateHandler2.newIndexWriter(DirectUpdateHandler2.java:726) at org.apache.solr.handler.RestoreCore.doRestore(RestoreCore.java:108) at org.apache.solr.handler.admin.RestoreCoreOp.execute(RestoreCoreOp.java:65) at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:384) at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:388) at org.apache.solr.handler.admin.CoreAdminHandler.lambda$handleRequestBody$0(CoreAdminHandler.java:182) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) Caused by: java.nio.file.NoSuchFileException: /var/solr/data/t1cloud3_shard2_replica0/data/restore.20170908194722131/_ 4m.si at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) at java.nio.channels.FileChannel.open(FileChannel.java:287) at java.nio.channels.FileChannel.open(FileChannel.java:335) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238) at org.apache.lucene.store.NRTCachingDirectory.openInput(NRTCachingDirectory.java:192) at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:137) at org.apache.lucene.codecs.lucene62.Lucene62SegmentInfoFormat.read(Lucene62SegmentInfoFormat.java:89) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:288) ... 17 more