Hi,
We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
or two nodes has Solr running with 100% cpu on all cores, «load» over 400,
and high IO. It usually lasts five to ten minutes, and the node is hardly
responding.
Does anyone have any experience with this type of behaviour? Is there any
logging other than infostream that could give any information?

We managed to trigger a thread dump,

> java.base@11.0.6
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
> org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
> org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
> org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
> org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
> org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)


But not sure if this is from the incident or just right after. It seems
strange that a fsync should behave like this.

Swappiness is set to default for RHEL 7 (Ops have resisted turning it off)

-- 
Kind regards,
Marvin B. Lillehaug

Reply via email to