One possibility is merging index segments. When this happens, are you actively indexing? And are these NRT replicas or TLOG/PULL? If the latter, are your TLOG leaders on the affected machines?
Best, Erick > On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug > <marvin.lilleh...@gmail.com> wrote: > > Hi, > We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one > or two nodes has Solr running with 100% cpu on all cores, «load» over 400, > and high IO. It usually lasts five to ten minutes, and the node is hardly > responding. > Does anyone have any experience with this type of behaviour? Is there any > logging other than infostream that could give any information? > > We managed to trigger a thread dump, > >> java.base@11.0.6 >> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112) >> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483) >> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331) >> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286) >> >> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158) >> >> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68) >> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805) >> >> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277) >> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445) >> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410) >> >> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678) >> >> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636) >> >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337) >> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318) > > > But not sure if this is from the incident or just right after. It seems > strange that a fsync should behave like this. > > Swappiness is set to default for RHEL 7 (Ops have resisted turning it off) > > -- > Kind regards, > Marvin B. Lillehaug