Re: Solr nodes going into recovery mode and eventually failing

Erick Erickson Thu, 19 Oct 2017 17:42:06 -0700

Once you hit an OOM, the behavior of Java is indeterminate. There's no
expectation that things will just pick up where they left off when
memory his freed up. Lots of production systems have OOM killer
scripts that automatically kill/restart Java apps that OOM for just
that reason.


Yes, each replica has it's own cache, but the JVM heap is used by them
all. That's why "times the number of replica". Perhaps a more complete
statement would be "times the number of replica hosted in the JVM".

Hmmm, 11M docs. Let's take 16M, that would give 2M bytes/filterCache
entry. Times 4096 gives around 8G that could be used up in by a cache
that size.

Yeah, your hit ratio is poor at 15%. It's relatively unusual to
require that many entries though, what do the fq clauses look like? Or
are you using something else that consumes cache (some facet methods
do for instance).

And do be sure to use docValues for any field you facet, sort or group on.

Best,
Erick

On Thu, Oct 19, 2017 at 2:24 PM, shamik <sham...@gmail.com> wrote:
> Thanks Emir. The index is equally split between the two shards, each having
> approx 35gb. The total number of documents is around 11 million which should
> be distributed equally among the two shards. So, each core should take 3gb
> of the heap for a full cache. Not sure I get the "multiply it by number of
> replica". Shouldn't each replica have its own cache of 3gb? Moreover, based
> on the SPM graph, the max filter cache size during the outages have been 1.5
> million max.
>
> Majority of our queries are heavily dependent on some implicit filter and
> user selected ones. By reducing the filter cache size to the current one of
> 4096 has taken a hit in performance. Earlier (in 5.5), I had a max cache
> size of 10,000 (running on 15gb allocated heap)  which produced a 95% hit
> rate. With the memory issues in 6.6,  I started reducing it to the current
> value. It reduced the % hit to 25. I tried earlier reducing the value to
> <filterCache class="solr.FastLRUCache" size="256" initialSize="256"
> autowarmCount="0"/>.
> It still didn't help which is when I decided to go for a higher RAM machine.
> What I've noticed is that the heap is consistently around 22-23gb mark out
> of which G1 old gen takes close to 13gb, G1 eden space around 6gb, rest
> shared by G Survivor space, Metaspace and Code cache.
>
> This issue has been bothering me as I seemed to be running out of possible
> tuning options. What I could see from the monitoring tool is the surge
> period saw around 400 requests/hr with 40 docs/sec getting indexed. Is it a
> really high volume of load to handle for a cluster size 6 nodes with 16 CPU
> / 64gb RAM? What are the other options I should be looking into?
>
> The other thing which I'm still confused is why the recovery fails when the
> memory has been freed up.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Solr nodes going into recovery mode and eventually failing

Reply via email to