Hi !,

I keep getting nodes that fall into recovery mode and then issue the
following log WARN every 10 seconds:

WARN   Stopping recovery for core=xxxx coreNodeName=core_node7

and sometimes this appears as well:
PERFORMANCE WARNING: Overlapping onDeckSearchers=2
At higher traffic time, this gets worse and out of 4 nodes only 1 is up.
I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB
respectively. Core A gets a lot of index updates and higher query traffic
compared to core B. Core A is going through active/recovery/down states
very often.
Nodes are coordinated via Zookeeper, we have three, running in different
machines than Solr.
Each machine has around 24 cores and between 38 and 48 GB of RAM, with each
Solr getting 16GB of heap memory.
I read this article:
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

and decided to apply:

     <autoCommit>
       <!-- Every 15 seconds -->
       <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
       <openSearcher>false</openSearcher>
     </autoCommit>

and

     <autoSoftCommit>
       <!-- Every 10 minutes -->
       <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
     </autoSoftCommit>

I also have these cache configurations:

    <filterCache class="solr.LFUCache"
                 size="64"
                 initialSize="64"
                 autowarmCount="32"/>

    <queryResultCache class="solr.LRUCache"
                     size="512"
                     initialSize="512"
                     autowarmCount="0"/>

    <documentCache class="solr.LRUCache"
                   size="1024"
                   initialSize="1024"
                   autowarmCount="0"/>

    <cache name="perSegFilter"
      class="solr.search.LRUCache"
      size="10"
      initialSize="0"
      autowarmCount="10"
      regenerator="solr.NoOpRegenerator" />

       <fieldValueCache class="solr.FastLRUCache"
                        size="512"
                        autowarmCount="0"
                        showItems="32" />

I also have this:
<maxWarmingSearchers>6</maxWarmingSearchers>
The size of the tlogs are usually between 1MB to 8MB.
I thought the changes above could improve the situation, but I am not 100%
convinced they did since after 15 min one of the nodes entered recovery
mode again.

any ideas ?

Thanks in advance.

Cheers !

-- 

-- 
Lorenzo Fundaro
Backend Engineer
E-Mail: lorenzo.fund...@dawandamail.com

Fax       + 49 - (0)30 - 25 76 08 52
Tel        + 49 - (0)179 - 51 10 982

DaWanda GmbH
Windscheidstraße 18
10627 Berlin

Geschäftsführer: Claudia Helming, Michael Pütz
Amtsgericht Charlottenburg HRB 104695 B

Reply via email to