Re: SOLR4 cluster - strange CPU spike on slave

Erick Erickson Wed, 05 Dec 2012 08:25:23 -0800

In addition to Mark's comment, be sure you aren't starving the memory for
the OS by over-allocating to the JVM.


FWIW,
Erick


On Tue, Dec 4, 2012 at 2:25 AM, John Nielsen <j...@mcb.dk> wrote:

> Success!
>
> I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We
> haven't seen any spikes since.
>
> I'm cautiously optimistic though and will be monitoring the servers for a
> week or so before declaring final victory.
>
> The post about mmapdirectory is really interesting. We switched to using
> that from NRTCachingDirectory and am monitoring performance as well.
> Initially performance doesn't look stellar, but i suspect that we lack
> memory in the server to really make it shine.
>
>
> Med venlig hilsen / Best regards
>
> *John Nielsen*
> Programmer
>
>
>
> *MCB A/S*
> Enghaven 15
> DK-7500 Holstebro
>
> Kundeservice: +45 9610 2824
> p...@mcb.dk
> www.mcb.dk
>
>
>
> On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > right, so here's what I'd check for.
> >
> > Your logs should show a replication pretty coincident with the spike and
> > that should be in the log. Note: the replication should complete just
> > before the spike.
> >
> > Or you can just turn replication off and fire it manually to try to force
> > the situation at will, see:
> > http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that
> > you'll
> > have to wait until the index has changed on the master to see any
> action).
> >
> > So you should be able to create your spike at will. And this will be
> pretty
> > normal. When replication happens, a new searcher is opened, caches are
> > filled, autowarming is done, all kinds of stuff like that. During this
> > period, the _old_ searcher is still open, which will both cause the CPU
> to
> > be busier and require additional memory. Once the new searcher is warmed,
> > new queries go to it, and when the old searcher has finished serving all
> > the queries it shuts down and all the resources are freed. Which is why
> > commits are expensive operations.
> >
> > All of which means that so far I don't think there's a problem, this is
> > just normal Solr operation. If you're seeing responsiveness problems when
> > serving queries you probably want to throw more hardware (particularly
> > memory) at the problem.
> >
> > But when thinking about memory allocating to the JVM, _really_ read Uwe's
> > post here:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > Best
> > Erick
> >
> >
> > On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen <j...@mcb.dk> wrote:
> >
> > > Yup you read it right.
> > >
> > > We originally intended to do all our indexing to varnish02, replicate
> to
> > > varnish01 and then search from varnish01 (through a fail-over ip which
> > > would switch the reader to varnish02 in case of trouble).
> > >
> > > When I saw the spikes, I tried to eliminate possibilities by starting
> > > searching from varnish02, leaving varnish01 with nothing to do but to
> > > receive replication data. This did not remove the spikes. As soon as
> this
> > > spike is fixed, I will start searching from varnish01 again. These sort
> > of
> > > debug antics are only possible because, although we do have customers
> > using
> > > this, we are still in our beta phase.
> > >
> > > Varnish01 never receives any manual commit orders. Varnish02 does from
> > time
> > > to time.
> > >
> > > Oh, and I accidentally misinformed you before. (damn secondary
> language)
> > We
> > > are actually seeing the spikes on both servers. I was just focusing on
> > > varnish01 because I use it to eliminate possibilities.
> > >
> > > It just occurred to me now; We tried switching off our feeder/index
> tool
> > > for 24 hours, and we didn't see any spikes during this period, so
> > receiving
> > > replication data certainly has something to do with it.
> > >
> > > Med venlig hilsen / Best regards
> > >
> > > *John Nielsen*
> > > Programmer
> > >
> > >
> > >
> > > *MCB A/S*
> > > Enghaven 15
> > > DK-7500 Holstebro
> > >
> > > Kundeservice: +45 9610 2824
> > > p...@mcb.dk
> > > www.mcb.dk
> > >
> > >
> > >
> > > On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Am I reading this right? All you're doing on varnish1 is replicating
> to
> > > it?
> > > > You're not searching or indexing? I'm sure I'm misreading this.....
> > > >
> > > >
> > > > "The spike, which only lasts for a couple of minutes, sends the disks
> > > > racing" This _sounds_ suspiciously like segment merging, especially
> the
> > > > "disks racing" bit. Or possibly replication. Neither of which make
> much
> > > > sense. But is there any chance that somehow multiple commits are
> being
> > > > issued? Of course if varnish1 is a slave, that shouldn't be happening
> > > > either.
> > > >
> > > > And the whole bit about nothing going to the logs is just bizarre.
> I'm
> > > > tempted to claim hardware gremlins, especially if you see nothing
> > similar
> > > > on varnish2. Or some other process is pegging the machine. All of
> which
> > > is
> > > > a way of saying "I have no idea"....
> > > >
> > > > Yours in bewilderment,
> > > > Erick
> > > >
> > > >
> > > >
> > > > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <j...@mcb.dk> wrote:
> > > >
> > > > > I apologize for the late reply.
> > > > >
> > > > > The query load is more or less stable during the spikes. There are
> > > always
> > > > > fluctuations, but nothing on the order of magnitude that could
> > explain
> > > > this
> > > > > spike. In fact, the latest spike occured last night when there were
> > > > almost
> > > > > noone using it.
> > > > >
> > > > > To test a hunch of mine, I tried to deactivate all caches by
> > commenting
> > > > out
> > > > > all cache entries in solrconfig.xml. It still spikes, so I dont
> think
> > > it
> > > > > has anything to do with cache warming or hits/misses or anything of
> > the
> > > > > sort.
> > > > >
> > > > > One interesting thing GC though. This is our latest spike with cpu
> > load
> > > > > (this server has 8 cores, so a load higher than 8 is potentially
> > > > > troublesome):
> > > > >
> > > > > 2012.Nov.27 19:58:18    2.27
> > > > > 2012.Nov.27 19:57:17    4.06
> > > > > 2012.Nov.27 19:56:18    8.95
> > > > > 2012.Nov.27 19:55:17    19.97
> > > > > 2012.Nov.27 19:54:17    32.27
> > > > > 2012.Nov.27 19:53:18    1.67
> > > > > 2012.Nov.27 19:52:17    1.6
> > > > > 2012.Nov.27 19:51:18    1.77
> > > > > 2012.Nov.27 19:50:17    1.89
> > > > >
> > > > > This is what the GC was doing around that time:
> > > > >
> > > > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
> > > > 4777586K->277641K(4969216K)]
> > > > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
> > > sys=0.00,
> > > > > real=0.09 secs]
> > > > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
> > > > 4749769K->325171K(5068096K)]
> > > > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
> > > sys=0.00,
> > > > > real=0.10 secs]
> > > > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
> > > > 4911603K->306181K(5071168K)]
> > > > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
> > > sys=0.00,
> > > > > real=0.09 secs]
> > > > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
> > > > 4892613K->376175K(5075328K)]
> > > > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
> > > sys=0.01,
> > > > > real=0.11 secs]
> > > > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
> > > > 4972847K->271468K(4868160K)]
> > > > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48
> > > sys=0.01,
> > > > > real=0.07 secs]
> > > > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
> > > > 4868140K->336421K(5090944K)]
> > > > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56
> > > sys=0.01,
> > > > > real=0.08 secs]
> > > > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
> > > > 4950373K->340513K(5092864K)]
> > > > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71
> > > sys=0.00,
> > > > > real=0.10 secs]
> > > > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
> > > > 4954465K->480488K(4952000K)]
> > > > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23
> > > sys=0.09,
> > > > > real=0.19 secs]
> > > > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
> > > > 4951976K->560434K(5031936K)]
> > > > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32
> > > sys=0.12,
> > > > > real=0.35 secs]
> > > > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
> > > > 560434K->0K(5031936K)]
> > > > > [PSOldGen: 4514851K->2793342K(5047296K)]
> > 5075285K->2793342K(10079232K)
> > > > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times:
> > user=5.23
> > > > > sys=0.00, real=5.23 secs]
> > > > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
> > > > 4471488K->87249K(4647168K)]
> > > > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27
> > > sys=0.00,
> > > > > real=0.04 secs]
> > > > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
> > > > 4173969K->171306K(4839552K)]
> > > > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34
> > > sys=0.00,
> > > > > real=0.05 secs]
> > > > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
> > > > 4258026K->156193K(4802368K)]
> > > > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37
> > > sys=0.00,
> > > > > real=0.06 secs]
> > > > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
> > > > 4254753K->156028K(4826880K)]
> > > > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35
> > > sys=0.00,
> > > > > real=0.06 secs]
> > > > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
> > > > 4254588K->413148K(4860288K)]
> > > > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91
> > > sys=0.01,
> > > > > real=0.12 secs]
> > > > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen:
> > > > 4564892K->412801K(4866432K)]
> > > > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82
> > > sys=0.00,
> > > > > real=0.11 secs]
> > > > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen:
> > > > 4564545K->434702K(4870592K)]
> > > > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86
> > > sys=0.00,
> > > > > real=0.12 secs]
> > > > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen:
> > > > 4591822K->500171K(4657344K)]
> > > > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12
> > > sys=0.00,
> > > > > real=0.16 secs]
> > > > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
> > > > 4657291K->453204K(4849536K)]
> > > > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83
> > > sys=0.01,
> > > > > real=0.11 secs]
> > > > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen:
> > > > 4560404K->533101K(4640320K)]
> > > > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96
> > > sys=0.00,
> > > > > real=0.13 secs]
> > > > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen:
> > > > 4640301K->480969K(4822720K)]
> > > > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99
> > > sys=0.01,
> > > > > real=0.15 secs]
> > > > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen:
> > > > 4534281K->572295K(4625664K)]
> > > > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10
> > > sys=0.01,
> > > > > real=0.15 secs]
> > > > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen:
> > > > 4625607K->486964K(4800640K)]
> > > > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94
> > > sys=0.00,
> > > > > real=0.12 secs]
> > > > >
> > > > > So the GC was doing a full collection around that time, but that
> only
> > > > > accounts for around 5 secs. Not the couple of minutes the spike
> > lasts.
> > > > >
> > > > > I am really at a complete loss as to what this could be. Google has
> > not
> > > > > given me any clues.
> > > > >
> > > > >
> > > > > Med venlig hilsen / Best regards
> > > > >
> > > > > *John Nielsen*
> > > > > Programmer
> > > > >
> > > > >
> > > > >
> > > > > *MCB A/S*
> > > > > Enghaven 15
> > > > > DK-7500 Holstebro
> > > > >
> > > > > Kundeservice: +45 9610 2824
> > > > > p...@mcb.dk
> > > > > www.mcb.dk
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic <
> > > > > otis.gospodne...@gmail.com> wrote:
> > > > >
> > > > > > Strange indeed. What about query load/ayes during that time? What
> > > about
> > > > > GC?
> > > > > > And does cache hit rate drop?
> > > > > >
> > > > > > Otis
> > > > > > --
> > > > > > SOLR Performance Monitoring - http://sematext.com/spm
> > > > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <j...@mcb.dk> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We are seeing a strange CPU spike on one of our solr4 servers
> > which
> > > > we
> > > > > > are
> > > > > > > unable to explain. The spike, which only lasts for a couple of
> > > > minutes,
> > > > > > > sends the disks racing. This happens a few times a times a day.
> > > This
> > > > is
> > > > > > > what the load looks like:
> > > > > > >
> > > > > > > 2012.Nov.14 13:37:17    2.77
> > > > > > > 2012.Nov.14 13:36:17    3.65
> > > > > > > 2012.Nov.14 13:35:18    3.92
> > > > > > > 2012.Nov.14 13:34:17    3.95
> > > > > > > 2012.Nov.14 13:33:18    6.56
> > > > > > > 2012.Nov.14 13:32:17    10.79
> > > > > > > 2012.Nov.14 13:31:17    24.38
> > > > > > > 2012.Nov.14 13:30:17    63.35
> > > > > > > 2012.Nov.14 13:29:17    24.68
> > > > > > > 2012.Nov.14 13:28:17    2.44
> > > > > > > 2012.Nov.14 13:27:18    3.51
> > > > > > > 2012.Nov.14 13:26:17    5.26
> > > > > > > 2012.Nov.14 13:25:18    5.71
> > > > > > > 2012.Nov.14 13:24:17    2.7
> > > > > > >
> > > > > > > The problem is that out of a 3 minute spike, I get about 40
> > seconds
> > > > of
> > > > > > > silence in the logs. This log usually adds like a thousand
> lines
> > > > every
> > > > > > > second. Not being able to communicate with the server for this
> > > long,
> > > > > > breaks
> > > > > > > our use case.
> > > > > > >
> > > > > > > We have two servers, varnish01 and varnish02. We used to feed
> > data
> > > to
> > > > > > > varnish02, replicate it to varnish01 where the data is then
> read
> > > > from.
> > > > > > When
> > > > > > > we discovered this issue, we moved all traffic to varnish02 so
> > that
> > > > > data
> > > > > > is
> > > > > > > being replicated to varnish01, but other than that, gets zero
> > > > traffic.
> > > > > > The
> > > > > > > spike did not disappear.
> > > > > > >
> > > > > > > The spike we are seeing is on varnish01 only.
> > > > > > >
> > > > > > > Please note that our use case requires us to continuously feed
> > > large
> > > > > > > amounts of data from our main system in the order of up to
> 1.000
> > > > > > registers
> > > > > > > every minute. Our solrconfig.xml is attached.
> > > > > > >
> > > > > > > Has anyone seen this phenomenon before?
> > > > > > >
> > > > > > > Med venlig hilsen / Best regards
> > > > > > >
> > > > > > > *John Nielsen*
> > > > > > > Programmer
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *MCB A/S*
> > > > > > > Enghaven 15
> > > > > > > DK-7500 Holstebro
> > > > > > >
> > > > > > > Kundeservice: +45 9610 2824
> > > > > > > p...@mcb.dk
> > > > > > > www.mcb.dk
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: SOLR4 cluster - strange CPU spike on slave

Reply via email to