Hi everyone,

My SolrCloud cluster (4.3.0) has came into production a few days ago.
Docs are being indexed into Solr using "/update" requestHandler, as a POST
request, containing text/xml content-type.

The collection is sharded into 36 pieces, each shard has two replicas.
There are 36 nodes (each node on separate virtual machine), so each node
holds exactly 2 cores.

Each update request contains 100 docs, what means 2-3 docs for each shard.
There are 1-2 such requests every minute. Soft-commit happens every 10
minutes, Hard-commit every 30 minutes, and ramBufferSizeMB=128.

After 48 hours of zero problems, suddenly one shard went down (its both
cores). Log says it's OOM ("GC overhead limit exceeded"). JVM is set to
Xmx=4G.
I'm pretty sure that some minutes before this incident, JVM memory wasn't
so high (even the max memory usage indicator was below 2G).

Indexing requests did not stop, and started getting HTTP 503 errors ("no
server hosting shard"). At this time, some other cores started to go down
(l had all of the rainbow colors: Active, Recovering, Down, Recovery Failed
and Gone :).

Then I tried to restart tomcat of the down nodes, but some of them failed
to start, due to the error message: "we are not the leader". Only shutting
down the both two cores and starting them gradually, solved the problem,
and the whole cluster came back to green state.

Solr is not yet exposed to users, so no queries have been made at that time
(but maybe some non-heavy auto-warm queries were executed).

I don't think that all of the 4GB were being used for justifiable reasons..
I guess that adding more RAM will not solve the problem, in the long term.

Where should I start my log investigation? (about the OOM itself, and about
the chain accident came after it)

I did a search for previous similar issues. There are a lot, but most of
them talks about very old versions of Solr.

[Versions:
Solr: 4.3.0
Tomcat 7
JVM: Oracle 7 (last, standard, JRE), 64bit.
OS: RedHat 6.3]

Reply via email to