Re: recovering mode loop

Erick Erickson Wed, 23 Sep 2015 09:44:00 -0700

bq: and when one of the nodes enter recovery mode the leader says this:

Hmmm, nothing came through, the mail filter is pretty aggressive
about stripping attachments though.


bq: You mean 10 seconds apart ?

Hmmm, no I mean 10 minutes. That would
explain the overlapping searchers since the only
time you should be opening a new searcher is
when you do a soft commit. But from the log
fragment you pasted, that isn't happening so
I'm totally mystified why you were getting those.
Or was that some time ago and things have
changed since then? I totally don't
understand why you'd be seeing overlapping
searchers, so grasping at straws. Given
what you've added, I suspect this is totally
off base.

I see you're printing GC stats, are any of those
GC times exceeding 15 seconds for stop-the-world
pauses? That can trigger ZK marking the node as
down and the node going into recovery. But with
16G heaps, that is unlikely, but you can see in
your logs. If you are getting long stop-the-world
pauses, perhaps you're very close to your JVM limit?
In _some_ cases I've seen adding more memory
decrease GC thrashing, the scenario is this:

You're running very close to the limit of your JVM
and each collection only reclaims a very little bit of
memory (say 0.05%), so it goes immediately into
another GC cycle. This is totally a shot in the dark
though.

Anyway, it's mystifying so far, but from everything
you've written your system should not be going into
recovery and you shouldn't be seeing
overlapping searchers, so I'm missing something

Best,
Erick

On Wed, Sep 23, 2015 at 9:25 AM, Lorenzo Fundaró <
lorenzo.fund...@dawandamail.com> wrote:

> On 23 September 2015 at 18:08, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > Wow, this is not expected at all. There's no
> > way you should, on the face of it, get
> > overlapping on-deck searchers.
> >
> > I recommend you put your maxWarmingSearchers
> > back to 2, that's a fail-safe that is there to make
> > people look at why they're warming a bunch of
> > searchers at once.
> >
>
> Ok, will do.
>
>
> >
> > With those settings, it's saying that autowarming is
> > taking over 10 minutes.
>
>
> What do you mean ? I don't think the autowarming is taking 10 minutes here.
> Do you maybe mean 10 seconds apart ?
> Every 10 min I issue a soft commit which would create a new searcher
> with several autowarmings, one for each cache. Looking at the stats the
> slowest autowarm is from filter cache and it takes around 2 seconds.
>
>
> > This is absurdly long, so either
> > something is pathologically wrong with your Solr
> > or you're really committing more often than you think.
> > Possibly you have a client issuing commits?
>
>
> mm no, I checked my client and I am not doing any commit.
> Here are some logs from the leader:
>
> INFO  - 2015-09-23 16:21:59.803;
> org.apache.solr.update.DirectUpdateHandler2; start
>
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> INFO  - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy;
> SolrDeletionPolicy.onCommit: commits: num=2
> commit{dir=NRTCachingDirectory(MMapDirectory@
> /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilp,generation=24109}
> commit{dir=NRTCachingDirectory(MMapDirectory@
> /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
> INFO  - 2015-09-23 16:22:00.981; org.apache.solr.core.SolrDeletionPolicy;
> newest commit generation = 24110
> INFO  - 2015-09-23 16:22:01.010;
> org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
> INFO  - 2015-09-23 16:22:16.967;
> org.apache.solr.update.DirectUpdateHandler2; start
>
> commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
> INFO  - 2015-09-23 16:22:18.452; org.apache.solr.core.SolrDeletionPolicy;
> SolrDeletionPolicy.onCommit: commits: num=2
> commit{dir=NRTCachingDirectory(MMapDirectory@
> /srv/loveos/solr/server/solr/dawanda/data/index.20150818133228229
> lockFactory=org.apache.lucene.store.SingleInstanceLockFactory@66cd8efb;
> maxCacheMB=48.0 maxMergeSizeMB=4.0),segFN=segments_ilq,generation=24110}
>
> You can look at your Solr logs and see commits, just
> > look for the word "commit". When reading those lines,
> > it'll say whether it has openSearcher true or false.
> > Are the timestamps when openSearcer=true really
> > 10 minutes apart?
> >
>
> You mean 10 seconds apart ?
>
>
> >
> > You'll also see autowarm times in your logs, see how
> > long they really take. If they really take 10 minutes,
> > we need to get to the bottom of that because the
> > autowarm counts you're showing in your cache
> > configurations don't indicate any problem here.
> >
> > Bottom line:
> > 1> you shouldn't be seeing nodes go into recovery in the
> > first place. Are your Solr logs showing any ERROR
> > level messages?
> >
> > 2> it's extremely surprising that you're getting any
> > overlapping on-deck searchers. If it turns out that
> > your autowarming is really taking more than a few
> > seconds, getting a stack trace to see where Solr is
> > spending all the time is warranted.
> >
>
> Will do.
>
> >
> > 3> Any clues from the logs _why_ they're going
> > into recovery? Also look at your leader's log file
> > and see if there are any messages about "leader
> > initiated recovery". If you see that, then perhaps
> > one of the timeouts is too short.
> >
> > 4> the tlog size is quite reasonable. It's only relevant
> > when a node goes down for some reason anyway,
> > so I wouldn't expend too much energy worrying about
> > them until we get to the bottom of overlapping
> > searchers and nodes going into recovery.
> >
> > BTW, nice job of laying out the relevant issues and
> > adding supporting information! I wish more problem
> > statements were as complete. If your Solr is 4.7.0,
> > there was a memory problem and you should definitely
> > go to 4.7.2. The symptom here is that you'll see
> > Out of Memory errors...
> >
> >
> > Best,
> > Erick
>
>
> Thank you very much !
>
> >
>
>
> > On Wed, Sep 23, 2015 at 8:48 AM, Lorenzo Fundaró <
> > lorenzo.fund...@dawandamail.com> wrote:
> >
> > > Hi !,
> > >
> > > I keep getting nodes that fall into recovery mode and then issue the
> > > following log WARN every 10 seconds:
> > >
> > > WARN   Stopping recovery for core=xxxx coreNodeName=core_node7
> > >
> > > and sometimes this appears as well:
> > > PERFORMANCE WARNING: Overlapping onDeckSearchers=2
> > > At higher traffic time, this gets worse and out of 4 nodes only 1 is
> up.
> > > I have 4 solr nodes each running two cores A and B of 13GB and 1.5GB
> > > respectively. Core A gets a lot of index updates and higher query
> traffic
> > > compared to core B. Core A is going through active/recovery/down states
> > > very often.
> > > Nodes are coordinated via Zookeeper, we have three, running in
> different
> > > machines than Solr.
> > > Each machine has around 24 cores and between 38 and 48 GB of RAM, with
> > each
> > > Solr getting 16GB of heap memory.
> > > I read this article:
> > >
> > >
> >
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> > >
> > > and decided to apply:
> > >
> > >      <autoCommit>
> > >        <!-- Every 15 seconds -->
> > >        <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
> > >        <openSearcher>false</openSearcher>
> > >      </autoCommit>
> > >
> > > and
> > >
> > >      <autoSoftCommit>
> > >        <!-- Every 10 minutes -->
> > >        <maxTime>${solr.autoSoftCommit.maxTime:600000}</maxTime>
> > >      </autoSoftCommit>
> > >
> > > I also have these cache configurations:
> > >
> > >     <filterCache class="solr.LFUCache"
> > >                  size="64"
> > >                  initialSize="64"
> > >                  autowarmCount="32"/>
> > >
> > >     <queryResultCache class="solr.LRUCache"
> > >                      size="512"
> > >                      initialSize="512"
> > >                      autowarmCount="0"/>
> > >
> > >     <documentCache class="solr.LRUCache"
> > >                    size="1024"
> > >                    initialSize="1024"
> > >                    autowarmCount="0"/>
> > >
> > >     <cache name="perSegFilter"
> > >       class="solr.search.LRUCache"
> > >       size="10"
> > >       initialSize="0"
> > >       autowarmCount="10"
> > >       regenerator="solr.NoOpRegenerator" />
> > >
> > >        <fieldValueCache class="solr.FastLRUCache"
> > >                         size="512"
> > >                         autowarmCount="0"
> > >                         showItems="32" />
> > >
> > > I also have this:
> > > <maxWarmingSearchers>6</maxWarmingSearchers>
> > > The size of the tlogs are usually between 1MB to 8MB.
> > > I thought the changes above could improve the situation, but I am not
> > 100%
> > > convinced they did since after 15 min one of the nodes entered recovery
> > > mode again.
> > >
> > > any ideas ?
> > >
> > > Thanks in advance.
> > >
> > > Cheers !
> > >
> > > --
> > >
> > > --
> > > Lorenzo Fundaro
> > > Backend Engineer
> > > E-Mail: lorenzo.fund...@dawandamail.com
> > >
> > > Fax       + 49 - (0)30 - 25 76 08 52
> > > Tel        + 49 - (0)179 - 51 10 982
> > >
> > > DaWanda GmbH
> > > Windscheidstraße 18
> > > 10627 Berlin
> > >
> > > Geschäftsführer: Claudia Helming, Michael Pütz
> > > Amtsgericht Charlottenburg HRB 104695 B
> > >
> >
>
>
>
> --
>
> --
> Lorenzo Fundaro
> Backend Engineer
> E-Mail: lorenzo.fund...@dawandamail.com
>
> Fax       + 49 - (0)30 - 25 76 08 52
> Tel        + 49 - (0)179 - 51 10 982
>
> DaWanda GmbH
> Windscheidstraße 18
> 10627 Berlin
>
> Geschäftsführer: Claudia Helming, Michael Pütz
> Amtsgericht Charlottenburg HRB 104695 B
>

Re: recovering mode loop

Reply via email to