Re: replicas goes in recovery mode right after update

Erick Erickson Sun, 25 Jan 2015 15:32:06 -0800

Shawn directed you over here to the user list, but I see this note on
SOLR-7030:
"All our searchers have 12 GB of RAM available and have quad core Intel(R)
Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
jboss and solr in it . All 12 GB is available as heap for the java
process..."


So you have 12G physical memory and have allocated 12G to the Java process?
This is an anti-pattern. If that's
the case, your operating system is being starved for memory, probably
hitting a state where it spends all of its
time in stop-the-world garbage collection, eventually it doesn't respond to
Zookeeper's ping so Zookeeper
thinks the node is down and puts it into recovery. Where it spends a lot of
time doing... essentially nothing.

About the hard and soft commits: I suspect these are entirely unrelated,
but here's a blog on what they do, you
should pick the configuration that supports your use case (i.e. how much
latency can you stand between indexing
and being able to search?).

https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Here's one very good reason you shouldn't starve your op system by
allocating all the physical memory to the JVM:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html


But your biggest problem is that you have far too much of your physical
memory allocated to the JVM. This
will cause you endless problems, you just need more physical memory on
those boxes. It's _possible_ you could
get by with less memory for the JVM, counterintuitive as it seems try 8G or
maybe even 6G. At some point
you'll hit OOM errors, but that'll give you a lower limit on what the JVM
needs.

Unless I've mis-interpreted what you've written, though, I doubt you'll get
stable with that much memory allocated
to the JVM.

Best,
Erick



On Sun, Jan 25, 2015 at 1:02 PM, Vijay Sekhri <sekhrivi...@gmail.com> wrote:

> We have a cluster of solr cloud server with 10 shards and 4 replicas in
> each shard in our stress environment. In our prod environment we will have
> 10 shards and 15 replicas in each shard. Our current commit settings are as
> follows
>
> *    <autoSoftCommit>*
> *        <maxDocs>500000</maxDocs>*
> *        <maxTime>180000</maxTime>*
> *    </autoSoftCommit>*
> *    <autoCommit>*
> *        <maxDocs>2000000</maxDocs>*
> *        <maxTime>180000</maxTime>*
> *        <openSearcher>false</openSearcher>*
> *    </autoCommit>*
>
>
> We indexed roughly 90 Million docs. We have two different ways to index
> documents a) Full indexing. It takes 4 hours to index 90 Million docs and
> the rate of docs coming to the searcher is around 6000 per second b)
> Incremental indexing. It takes an hour to indexed delta changes. Roughly
> there are 3 million changes and rate of docs coming to the searchers is
> 2500
> per second
>
> We have two collections search1 and search2. When we do full indexing , we
> do it in search2 collection while search1 is serving live traffic. After it
> finishes we swap the collection using aliases so that the search2
> collection serves live traffic while search1 becomes available for next
> full indexing run. When we do incremental indexing we do it in the search1
> collection which is serving live traffic.
>
> All our searchers have 12 GB of RAM available and have quad core Intel(R)
> Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e
> jboss and solr in it . All 12 GB is available as heap for the java
> process.  We have observed that the heap memory of the java process average
> around 8 - 10 GB. All searchers have final index size of 9 GB. So in total
> there are 9X10 (shards) =  90GB worth of index files.
>
>  We have observed the following issue when we trigger indexing . In about
> 10 minutes after we trigger indexing on 14 parallel hosts, the replicas
> goes in to recovery mode. This happens to all the shards . In about 20
> minutes more and more replicas start going into recovery mode. After about
> half an hour all replicas except the leader are in recovery mode. We cannot
> throttle the indexing load as that will increase our overall indexing time.
> So to overcome this issue, we remove all the replicas before we trigger the
> indexing and then add them back after the indexing finishes.
>
> We observe the same behavior of replicas going into recovery when we do
> incremental indexing. We cannot remove replicas during our incremental
> indexing because it is also serving live traffic. We tried to throttle our
> indexing speed , however the cluster still goes into recovery .
>
> If we leave the cluster as it , when the indexing finishes , it eventually
> recovers after a while. As it is serving live traffic we cannot have these
> replicas go into recovery mode because it degrades the search performance
> also , our tests have shown.
>
> We have tried different commit settings like below
>
> a) No auto soft commit, no auto hard commit and a commit triggered at the
> end of indexing b) No auto soft commit, yes auto hard commit and a commit
> in the end of indexing
> c) Yes auto soft commit , no auto hard commit
> d) Yes auto soft commit , yes auto hard commit
> e) Different frequency setting for commits for above. Please NOTE that we
> have tried 15 minute soft commit setting and 30 minutes hard commit
> settings. Same time settings for both, 30 minute soft commit and an hour
> hard commit setting
>
> Unfortunately all the above yields the same behavior . The replicas still
> goes in recovery We have increased the zookeeper timeout from 30 seconds to
> 5 minutes and the problem persists. Is there any setting that would fix
> this issue ?
>
> --
> *********************************************
> Vijay Sekhri
> *********************************************
>

Re: replicas goes in recovery mode right after update

Reply via email to