Hi Shamik,
I am pleased to see you find SPM useful!
I think that your problems might be related to caches exhausting your memory. 
You mentioned that your index is 70GB, but how many documents it has? Remember 
that filter caches can take up to 1bit/doc. With 4096 filter cache size it 
means that full cache will take up to 0.5GB for 1 million documents (it is per 
core, so multiply it by number of replicas). If you have some fields without 
doc values check the sizes of fieldCaches and fieldValueCaches as well.
Other thing that you should revisit is heap size - having heap sizes over 32GB 
prevents JVM from using compressed OOPS resulting in less object fitting heap 
of sizes just over 32GB than of heap of 31GB (the boundary depends on JVM so 
not exactly 32GB).

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 19 Oct 2017, at 01:12, Shamik Bandopadhyay <sham...@gmail.com> wrote:
> 
> Hi,
> 
>  I'm having this weird issue where Solr nodes suddenly go into recovery
> mode and eventually failing. That one failure kicks off a cascading effect
> and impacts the other nodes eventually. Without a restart, the entire
> cluster goes into a limbo after a while. Looking into the log and SPM
> monitoring tool, the issue happens under following circumstances:
> 1. Node gets a spike in query/index request, thus exhausting its allocated
> memory.
> 2. GC forces CPU to use 100% of it's capacity
> 3. None of the above, when both JVM and CPU are within limit
> 
> I'm using Solr 6.6. Here are the details about the node :
> 
> Hardware type: AWS m4.4xlarge instance
> Total memory : 64 gb
> CPU : 16
> SSD
> SOLR_JAVA_MEM="-Xms35g -Xmx35g"
> GC_TUNE="-XX:+UseG1GC \
> -XX:+ParallelRefProcEnabled \
> -XX:G1HeapRegionSize=8m \
> -XX:MaxGCPauseMillis=200 \
> -XX:+UseLargePages \
> -XX:+AggressiveOpts"
> SOLR_OPTS="$SOLR_OPTS -Xss256k"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.autoCommit.maxTime=600000"
> SOLR_OPTS="$SOLR_OPTS -Dsolr.clustering.enabled=true"
> SOLR_OPTS="$SOLR_OPTS -Dpkiauth.ttl=120000"
> 
> Cache Parameters:
> <maxBooleanClauses>4096</maxBooleanClauses>
> <slowQueryThresholdMillis>1000</slowQueryThresholdMillis>
> <filterCache class="solr.FastLRUCache" size="4096" initialSize="1024"
> autowarmCount="20"/>
> <queryResultCache class="solr.LRUCache" size="256" initialSize="256"
> autowarmCount="10"/>
> <documentCache class="solr.LRUCache" size="256" initialSize="256"
> autowarmCount="10"/>
> <!-- custom cache currently used by block join -->
> <cache name="perSegFilter" class="solr.search.LRUCache" size="10"
> initialSize="0" autowarmCount="10" regenerator="solr.NoOpRegenerator" />
> <fieldValueCache class="solr.FastLRUCache" size="256" autowarmCount="256"
> showItems="10" />
> <cache enable="${solr.ltr.enabled:false}" name="QUERY_DOC_FV"
> class="solr.search.LRUCache" size="4096" initialSize="2048"
> autowarmCount="4096" regenerator="solr.search.NoOpRegenerator" />
> <enableLazyFieldLoading>true</enableLazyFieldLoading>
> <queryResultWindowSize>60</queryResultWindowSize>
> 
> I've currently 2 shards each having 2 replicas. The index size is
> approximately 70gb.
> 
> Here's a solr log trace from the series of events once the node starts
> getting into trouble. I've posted only the relevant ones here.
> 
> 
> org.apache.solr.common.SolrException.log(SolrException.java:148) -
> org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates
> are disabled.
>        at
> org.apache.solr.update.processor.DistributedUpdateProcessor.zkCheck(DistributedUpdateProcessor.java:1738)
> 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$5(CoreAdminOperation.java:143)
> - It has been requested that we recover: core=knowledge
> INFO647718[qtp2039328061-1526] -
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:732)
> - [admin] webapp=null path=/admin/cores
> params={core=knowledge&action=REQUESTRECOVERY&wt=javabin&version=2}
> status=0 QTime=0
> INFO647808[qtp2039328061-1540] -
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:187)
> - [knowledge]  webapp=/solr path=/update
> params={update.distrib=FROMLEADER&distrib.from=
> http://xx.xxx.xxx.63:8983/solr/knowledge/&wt=javabin&version=2}{} 0 0
> 
> WARN657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
> - Socket timeout on send prep recovery cmd, retrying..
> INFO657500[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
> - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
> [WaitForState:
> action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
> WARN667514[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:659)
> - Socket timeout on send prep recovery cmd, retrying..
> 
> The retry happens for few times, then
> 
> INFO689389[qtp2039328061-1649] -
> org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
> - request has come without principal. failed permission {
>  "name":"select",
>  "collection":"knowledge",
>  "path":"/select",
>  "role":[
>    "admin",
>    "dev",
>    "read"],
>  "index":3}
> INFO689390[qtp2039328061-1649] -
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
> USER_REQUIRED auth header null context : userPrincipal: [null] type:
> [READ], collections: [knowledge,], Path: [/select] path : /select params
> :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
> INFO907854[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:676)
> - Sending prep recovery command to [http://xx.xxx.xxx.63:8983/solr];
> [WaitForState:
> action=PREPRECOVERY&core=knowledge&nodeName=xx.xxx.xxx.251:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
> INFO913811[commitScheduler-13-thread-1] -
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:603)
> - start
> commit{,optimize=false,openSearcher=true,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.SolrIndexWriter.setCommitData(SolrIndexWriter.java:174)
> - Calling setCommitData with
> IW:org.apache.solr.update.SolrIndexWriter@29f7c7fc
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [IW][commitScheduler-13-thread-1]: commit: start
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [IW][commitScheduler-13-thread-1]: commit: enter lock
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [IW][commitScheduler-13-thread-1]: commit: now prepare
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [IW][commitScheduler-13-thread-1]: prepareCommit: flush
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [IW][commitScheduler-13-thread-1]:   index before flush
> _1e0k(6.6.0):C4171894/2478712:delGen=4036
> _1pl8(6.6.0):C774233/198857:delGen=642
> _1g34(6.6.0):C1023729/434317:delGen=3518
> _1q40(6.6.0):C978521/95955:delGen=457 _1pzp(6.6.0):c202089/41205:delGen=494
> _1qqe(6.6.0):c64285/33487:delGen=251 _1qsz(6.6.0):c163910/69541:delGen=233
> _1rhp(6.6.0):c114624/15323:delGen=51 _1rc2(6.6.0):c47728/12355:delGen=93
> _1rm0(6.6.0):c43606/4233:delGen=14 _1rl6(6.6.0):c1349/425:delGen=18
> _1rk9(6.6.0):C2237/901:delGen=26 _1rmg(6.6.0):c3698/1000:delGen=10
> _1rni(6.6.0):c5632/602:delGen=2 _1rlw(6.6.0):C3/2:delGen=2
> _1rn1(6.6.0):c475/134:delGen=6 _1rmi(6.6.0):C709/361:delGen=9
> _1rnp(6.6.0):c194/21:delGen=1 _1rmr(6.6.0):C1058/525:delGen=7
> _1rn4(6.6.0):C661/316:delGen=6 _1rnc(6.6.0):C1111/534:delGen=4
> _1rne(6.6.0):C1103/523:delGen=3 _1rnf(6.6.0):C1087/466:delGen=3
> _1rng(6.6.0):C1099/535:delGen=3 _1rnl(6.6.0):C1222/473:delGen=2
> _1rnn(6.6.0):C445/162:delGen=2 _1rnm(6.6.0):C2455/1084:delGen=2
> _1rnq(6.6.0):C588/200:delGen=1 _1rns(6.6.0):C1153/306:delGen=1
> _1rnr(6.6.0):C1434/309:delGen=1 _1rnt(6.6.0):C7
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [DW][commitScheduler-13-thread-1]: startFullFlush
> INFO913812[commitScheduler-13-thread-1] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [DW][commitScheduler-13-thread-1]: anyChanges? numDocsInRam=1725
> deletes=true hasTickets:false pendingChangesInFullFlush: false
> INFO951165[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.core.SolrCore.getIndexFingerprint(SolrCore.java:1853) -
> Caching fingerprint for searcher:Searcher@61856c91[knowledge]
> main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_1e0k(6.6.0):C4171894/2478713:delGen=4037)
> Uninverting(_1pl8(6.6.0):C774233/198909:delGen=643)
> Uninverting(_1g34(6.6.0):C1023729/434329:delGen=3519)
> Uninverting(_1q40(6.6.0):C978521/96008:delGen=458)
> Uninverting(_1pzp(6.6.0):c202089/41209:delGen=495)
> Uninverting(_1qqe(6.6.0):c64285/33505:delGen=252)
> Uninverting(_1qsz(6.6.0):c163910/69577:delGen=234)
> Uninverting(_1rhp(6.6.0):c114624/15521:delGen=52)
> Uninverting(_1rc2(6.6.0):c47728/12387:delGen=94)
> Uninverting(_1rm0(6.6.0):c43606/4440:delGen=15)
> Uninverting(_1rl6(6.6.0):c1349/433:delGen=19)
> Uninverting(_1rk9(6.6.0):C2237/908:delGen=27)
> Uninverting(_1rmg(6.6.0):c3698/1044:delGen=11)
> Uninverting(_1rni(6.6.0):c5632/708:delGen=3)
> Uninverting(_1rlw(6.6.0):C3/2:delGen=2)
> Uninverting(_1rn1(6.6.0):c475/143:delGen=7)
> Uninverting(_1rmi(6.6.0):C709/367:delGen=10)
> Uninverting(_1rnp(6.6.0):c194/28:delGen=2)
> Uninverting(_1rmr(6.6.0):C1058/541:delGen=8)
> Uninverting(_1rn4(6.6.0):C661/325:delGen=7)
> Uninverting(_1rnc(6.6.0):C1111/550:delGen=5)
> Uninverting(_1rne(6.6.0):C1103/548:delGen=4)
> Uninverting(_1rnf(6.6.0):C1087/500:delGen=4)
> Uninverting(_1rng(6.6.0):C1099/553:delGen=4)
> Uninverting(_1rnl(6.6.0):C1222/518:delGen=3)
> Uninverting(_1rnn(6.6.0):C445/178:delGen=3)
> Uninverting(_1rnm(6.6.0):C2455/1138:delGen=3)
> Uninverting(_1rnq(6.6.0):C588/238:delGen=2)
> Uninverting(_1rns(6.6.0):C1153/397:delGen=2)
> Uninverting(_1rnr(6.6.0):C1434/479:delGen=2)
> Uninverting(_1rnt(6.6.0):C7/1:delGen=1)
> Uninverting(_1rnw(6.6.0):C4/1:delGen=1)
> Uninverting(_1rnx(6.6.0):C679/149:delGen=1)
> Uninverting(_1rnv(6.6.0):C107/33:delGen=1)
> Uninverting(_1rnu(6.6.0):C935/205:delGen=1)))}
> leafReaderContext:LeafReaderContext(FilterLeafReader(Uninverting(_1rnu(6.6.0):C935/205:delGen=1))
> docBase=7613129 ord=34) mavVersion:9223372036854775807
> 
> 
> This goes on for a while till I see this:
> 
> 
> index=_1e0k(6.6.0):C4171894/2478715:delGen=4037
> _1pl8(6.6.0):C774233/198956:delGen=643
> _1g34(6.6.0):C1023729/434341:delGen=3519
> _1q40(6.6.0):C978521/96071:delGen=458 _1pzp(6.6.0):c202089/41213:delGen=495
> _1qqe(6.6.0):c64285/33523:delGen=252 _1qsz(6.6.0):c163910/69626:delGen=234
> _1rhp(6.6.0):c114624/15701:delGen=52 _1rc2(6.6.0):c47728/12410:delGen=94
> _1rm0(6.6.0):c43606/4621:delGen=15 _1rl6(6.6.0):c1349/443:delGen=19
> _1rk9(6.6.0):C2237/913:delGen=27 _1rmg(6.6.0):c3698/1074:delGen=11
> _1rna(6.6.0):c6218/1527:delGen=5 _1rlw(6.6.0):C3/2:delGen=2
> _1rn1(6.6.0):c475/151:delGen=7 _1rmi(6.6.0):C709/374:delGen=10
> _1rnn(6.6.0):c233/33:delGen=2 _1rmr(6.6.0):C1058/553:delGen=8
> _1rmw(6.6.0):C1 _1rnd(6.6.0):C576/312:delGen=5
> _1rne(6.6.0):C568/277:delGen=5 _1rng(6.6.0):C2136/1055:delGen=4
> _1rnh(6.6.0):C2570/1260:delGen=4 _1rni(6.6.0):C1
> _1rnk(6.6.0):C1462/712:delGen=3 _1rnj(6.6.0):C2436/1106:delGen=3
> _1rnp(6.6.0):C1560/603:delGen=2 _1rno(6.6.0):C141/71:delGen=2
> _1rnq(6.6.0):C1393/517:delGen=2 _1rns(6.6.0):C4/1:delGen=1
> _1rnr(6.6.0):C12/5:delGen=1 _1rnu(6.6.0):C1846/484:delGen=1
> _1rnv(6.6.0):C513/106:delGen=1 _1rnt(6.6.0):C600/185:delGen=1
> _1rnw(6.6.0):C1 _1rnx(6.6.0):C3 _1rnz(6.6.0):C1
> version=6.6.0
> analyzer=org.apache.solr.update.SolrIndexConfig$DelayedSchemaAnalyzer
> ramBufferSizeMB=320.0
> maxBufferedDocs=-1
> maxBufferedDeleteTerms=-1
> mergedSegmentWarmer=null
> delPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper
> commit=null
> openMode=APPEND
> similarity=org.apache.solr.search.similarities.SchemaSimilarityFactory$SchemaSimilarity
> mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1,
> maxMergeCount=-1, ioThrottle=true
> codec=Lucene62
> infoStream=org.apache.solr.update.LoggingInfoStream
> mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
> maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0,
> forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0,
> maxCFSSegmentSizeMB=8.796093022207999E12, noCFSRatio=0.1
> indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@63603ac1
> readerPooling=false
> perThreadHardLimitMB=1945
> useCompoundFile=false
> commitOnClose=true
> indexSort=null
> writer=org.apache.solr.update.SolrIndexWriter@3360cb8
> 
> INFO265541[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.update.LoggingInfoStream.message(LoggingInfoStream.java:34)
> - [IW][recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9]:
> MMapDirectory.UNMAP_SUPPORTED=true
> INFO265541[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.update.DefaultSolrCoreState.changeWriter(DefaultSolrCoreState.java:212)
> - New IndexWriter is ready to be used.
> 
> ERROR267470[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:519)
> - Could not publish as ACTIVE after succesful recovery
> org.apache.solr.common.SolrException: Cannot publish state of core
> 'knowledge' as active without recovering first!
>        at
> org.apache.solr.cloud.ZkController.publish(ZkController.java:1175)
>        at
> org.apache.solr.cloud.ZkController.publish(ZkController.java:1131)
>        at
> org.apache.solr.cloud.ZkController.publish(ZkController.java:1127)
>        at
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:517)
>        at
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:284)
>        at
> com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
>        at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>        at
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)
> ERROR267471[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:541)
> - Recovery failed - trying again... (0)
> INFO267471[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:563)
> - Wait [2.0] seconds before trying to recover again (attempt=1)
> 
> 
> The entire cycle keeps repeating till it goes offline. I've also seen the
> following exceptions at times:
> 
> 
> WARN670422[recoveryExecutor-3-thread-4-processing-n:xx.xxx.xxx.251:8983_solr
> x:knowledge s:shard2 c:knowledge r:core_node9] -
> org.apache.solr.handler.IndexFetcher$FileFetcher.fetchPackets(IndexFetcher.java:1536)
> - Error in fetching file: _1e0k.fdt (downloaded 2385510400 of 6610887701
> bytes)
> org.apache.http.TruncatedChunkException: Truncated chunk ( expected size:
> 32768; actual size: 28517)
>        at
> org.apache.http.impl.io.ChunkedInputStream.read(ChunkedInputStream.java:200)
>        at
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> 
> INFO831840[qtp1389808948-125825] -
> org.apache.solr.security.RuleBasedAuthorizationPlugin.checkPathPerm(RuleBasedAuthorizationPlugin.java:147)
> - request has come without principal. failed permission {
>  "name":"select",
>  "collection":"knowledge",
>  "path":"/select",
>  "role":[
>    "admin",
>    "dev",
>    "read"]}
> INFO831841[qtp1389808948-125836] -
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
> USER_REQUIRED auth header null context : userPrincipal: [null] type:
> [READ], collections: [knowledge,], Path: [/select] path : /select params
> :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
> INFO831840[qtp1389808948-125855] -
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:500) -
> USER_REQUIRED auth header null context : userPrincipal: [null] type:
> [READ], collections: [knowledge,], Path: [/select] path : /select params
> :q=*:*&distrib=false&sort=_docid_+asc&rows=0&wt=javabin&version=2
> ERROR831840[qtp1389808948-125740] -
> org.apache.solr.common.SolrException.log(SolrException.java:148) -
> org.apache.solr.common.SolrException: no servers hosting shard: shard1
>        at
> org.apache.solr.handler.component.HttpShardHandler.prepDistributed(HttpShardHandler.java:413)
>        at
> org.apache.solr.handler.component.SearchHandler.getAndPrepShardHandler(SearchHandler.java:227)
>        at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:265)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
> 
> 
> 
> Not sure what's the issue here and how to address this. I've played around
> with different memory settings but haven't been successful so far. Also,
> not sure why it affects the entire cluster. When I restart the instance,
> it goes into recovery mode and updates it's index with the delta, which is
> understandable.But at the same time, the other replica in the same shard
> stalls and goes offline. This starts a cascading effect and I've to end up
> restarting all the nodes.
> 
> Any pointers will be appreciated.
> 
> Thanks,
> Shamik

Reply via email to