In addition to Mark's comment, be sure you aren't starving the memory for the OS by over-allocating to the JVM.
FWIW, Erick On Tue, Dec 4, 2012 at 2:25 AM, John Nielsen <j...@mcb.dk> wrote: > Success! > > I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We > haven't seen any spikes since. > > I'm cautiously optimistic though and will be monitoring the servers for a > week or so before declaring final victory. > > The post about mmapdirectory is really interesting. We switched to using > that from NRTCachingDirectory and am monitoring performance as well. > Initially performance doesn't look stellar, but i suspect that we lack > memory in the server to really make it shine. > > > Med venlig hilsen / Best regards > > *John Nielsen* > Programmer > > > > *MCB A/S* > Enghaven 15 > DK-7500 Holstebro > > Kundeservice: +45 9610 2824 > p...@mcb.dk > www.mcb.dk > > > > On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > right, so here's what I'd check for. > > > > Your logs should show a replication pretty coincident with the spike and > > that should be in the log. Note: the replication should complete just > > before the spike. > > > > Or you can just turn replication off and fire it manually to try to force > > the situation at will, see: > > http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that > > you'll > > have to wait until the index has changed on the master to see any > action). > > > > So you should be able to create your spike at will. And this will be > pretty > > normal. When replication happens, a new searcher is opened, caches are > > filled, autowarming is done, all kinds of stuff like that. During this > > period, the _old_ searcher is still open, which will both cause the CPU > to > > be busier and require additional memory. Once the new searcher is warmed, > > new queries go to it, and when the old searcher has finished serving all > > the queries it shuts down and all the resources are freed. Which is why > > commits are expensive operations. > > > > All of which means that so far I don't think there's a problem, this is > > just normal Solr operation. If you're seeing responsiveness problems when > > serving queries you probably want to throw more hardware (particularly > > memory) at the problem. > > > > But when thinking about memory allocating to the JVM, _really_ read Uwe's > > post here: > > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > > > Best > > Erick > > > > > > On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen <j...@mcb.dk> wrote: > > > > > Yup you read it right. > > > > > > We originally intended to do all our indexing to varnish02, replicate > to > > > varnish01 and then search from varnish01 (through a fail-over ip which > > > would switch the reader to varnish02 in case of trouble). > > > > > > When I saw the spikes, I tried to eliminate possibilities by starting > > > searching from varnish02, leaving varnish01 with nothing to do but to > > > receive replication data. This did not remove the spikes. As soon as > this > > > spike is fixed, I will start searching from varnish01 again. These sort > > of > > > debug antics are only possible because, although we do have customers > > using > > > this, we are still in our beta phase. > > > > > > Varnish01 never receives any manual commit orders. Varnish02 does from > > time > > > to time. > > > > > > Oh, and I accidentally misinformed you before. (damn secondary > language) > > We > > > are actually seeing the spikes on both servers. I was just focusing on > > > varnish01 because I use it to eliminate possibilities. > > > > > > It just occurred to me now; We tried switching off our feeder/index > tool > > > for 24 hours, and we didn't see any spikes during this period, so > > receiving > > > replication data certainly has something to do with it. > > > > > > Med venlig hilsen / Best regards > > > > > > *John Nielsen* > > > Programmer > > > > > > > > > > > > *MCB A/S* > > > Enghaven 15 > > > DK-7500 Holstebro > > > > > > Kundeservice: +45 9610 2824 > > > p...@mcb.dk > > > www.mcb.dk > > > > > > > > > > > > On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > Am I reading this right? All you're doing on varnish1 is replicating > to > > > it? > > > > You're not searching or indexing? I'm sure I'm misreading this..... > > > > > > > > > > > > "The spike, which only lasts for a couple of minutes, sends the disks > > > > racing" This _sounds_ suspiciously like segment merging, especially > the > > > > "disks racing" bit. Or possibly replication. Neither of which make > much > > > > sense. But is there any chance that somehow multiple commits are > being > > > > issued? Of course if varnish1 is a slave, that shouldn't be happening > > > > either. > > > > > > > > And the whole bit about nothing going to the logs is just bizarre. > I'm > > > > tempted to claim hardware gremlins, especially if you see nothing > > similar > > > > on varnish2. Or some other process is pegging the machine. All of > which > > > is > > > > a way of saying "I have no idea".... > > > > > > > > Yours in bewilderment, > > > > Erick > > > > > > > > > > > > > > > > On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen <j...@mcb.dk> wrote: > > > > > > > > > I apologize for the late reply. > > > > > > > > > > The query load is more or less stable during the spikes. There are > > > always > > > > > fluctuations, but nothing on the order of magnitude that could > > explain > > > > this > > > > > spike. In fact, the latest spike occured last night when there were > > > > almost > > > > > noone using it. > > > > > > > > > > To test a hunch of mine, I tried to deactivate all caches by > > commenting > > > > out > > > > > all cache entries in solrconfig.xml. It still spikes, so I dont > think > > > it > > > > > has anything to do with cache warming or hits/misses or anything of > > the > > > > > sort. > > > > > > > > > > One interesting thing GC though. This is our latest spike with cpu > > load > > > > > (this server has 8 cores, so a load higher than 8 is potentially > > > > > troublesome): > > > > > > > > > > 2012.Nov.27 19:58:18 2.27 > > > > > 2012.Nov.27 19:57:17 4.06 > > > > > 2012.Nov.27 19:56:18 8.95 > > > > > 2012.Nov.27 19:55:17 19.97 > > > > > 2012.Nov.27 19:54:17 32.27 > > > > > 2012.Nov.27 19:53:18 1.67 > > > > > 2012.Nov.27 19:52:17 1.6 > > > > > 2012.Nov.27 19:51:18 1.77 > > > > > 2012.Nov.27 19:50:17 1.89 > > > > > > > > > > This is what the GC was doing around that time: > > > > > > > > > > 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: > > > > 4777586K->277641K(4969216K)] > > > > > 8887542K->4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 > > > sys=0.00, > > > > > real=0.09 secs] > > > > > 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: > > > > 4749769K->325171K(5068096K)] > > > > > 8859821K->4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 > > > sys=0.00, > > > > > real=0.10 secs] > > > > > 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: > > > > 4911603K->306181K(5071168K)] > > > > > 9021752K->4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 > > > sys=0.00, > > > > > real=0.09 secs] > > > > > 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: > > > > 4892613K->376175K(5075328K)] > > > > > 9003049K->4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 > > > sys=0.01, > > > > > real=0.11 secs] > > > > > 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: > > > > 4972847K->271468K(4868160K)] > > > > > 9083427K->4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 > > > sys=0.01, > > > > > real=0.07 secs] > > > > > 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: > > > > 4868140K->336421K(5090944K)] > > > > > 8980192K->4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 > > > sys=0.01, > > > > > real=0.08 secs] > > > > > 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: > > > > 4950373K->340513K(5092864K)] > > > > > 9062524K->4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 > > > sys=0.00, > > > > > real=0.10 secs] > > > > > 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: > > > > 4954465K->480488K(4952000K)] > > > > > 9082167K->4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 > > > sys=0.09, > > > > > real=0.19 secs] > > > > > 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: > > > > 4951976K->560434K(5031936K)] > > > > > 9156025K->5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 > > > sys=0.12, > > > > > real=0.35 secs] > > > > > 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: > > > > 560434K->0K(5031936K)] > > > > > [PSOldGen: 4514851K->2793342K(5047296K)] > > 5075285K->2793342K(10079232K) > > > > > [PSPermGen: 35285K->35285K(44864K)], 5.2310820 secs] [Times: > > user=5.23 > > > > > sys=0.00, real=5.23 secs] > > > > > 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: > > > > 4471488K->87249K(4647168K)] > > > > > 7264830K->2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 > > > sys=0.00, > > > > > real=0.04 secs] > > > > > 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: > > > > 4173969K->171306K(4839552K)] > > > > > 6967312K->2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 > > > sys=0.00, > > > > > real=0.05 secs] > > > > > 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: > > > > 4258026K->156193K(4802368K)] > > > > > 7051369K->2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 > > > sys=0.00, > > > > > real=0.06 secs] > > > > > 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: > > > > 4254753K->156028K(4826880K)] > > > > > 7048096K->2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 > > > sys=0.00, > > > > > real=0.06 secs] > > > > > 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: > > > > 4254588K->413148K(4860288K)] > > > > > 7047931K->3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 > > > sys=0.01, > > > > > real=0.12 secs] > > > > > 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: > > > > 4564892K->412801K(4866432K)] > > > > > 7358235K->3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 > > > sys=0.00, > > > > > real=0.11 secs] > > > > > 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: > > > > 4564545K->434702K(4870592K)] > > > > > 7357888K->3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 > > > sys=0.00, > > > > > real=0.12 secs] > > > > > 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: > > > > 4591822K->500171K(4657344K)] > > > > > 7385165K->3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 > > > sys=0.00, > > > > > real=0.16 secs] > > > > > 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen: > > > > 4657291K->453204K(4849536K)] > > > > > 7450633K->3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 > > > sys=0.01, > > > > > real=0.11 secs] > > > > > 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen: > > > > 4560404K->533101K(4640320K)] > > > > > 7353747K->3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 > > > sys=0.00, > > > > > real=0.13 secs] > > > > > 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen: > > > > 4640301K->480969K(4822720K)] > > > > > 7433643K->3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 > > > sys=0.01, > > > > > real=0.15 secs] > > > > > 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen: > > > > 4534281K->572295K(4625664K)] > > > > > 7327624K->3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 > > > sys=0.01, > > > > > real=0.15 secs] > > > > > 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen: > > > > 4625607K->486964K(4800640K)] > > > > > 7418950K->3280306K(9847936K), 0.1244440 secs] [Times: user=0.94 > > > sys=0.00, > > > > > real=0.12 secs] > > > > > > > > > > So the GC was doing a full collection around that time, but that > only > > > > > accounts for around 5 secs. Not the couple of minutes the spike > > lasts. > > > > > > > > > > I am really at a complete loss as to what this could be. Google has > > not > > > > > given me any clues. > > > > > > > > > > > > > > > Med venlig hilsen / Best regards > > > > > > > > > > *John Nielsen* > > > > > Programmer > > > > > > > > > > > > > > > > > > > > *MCB A/S* > > > > > Enghaven 15 > > > > > DK-7500 Holstebro > > > > > > > > > > Kundeservice: +45 9610 2824 > > > > > p...@mcb.dk > > > > > www.mcb.dk > > > > > > > > > > > > > > > > > > > > On Fri, Nov 23, 2012 at 1:56 PM, Otis Gospodnetic < > > > > > otis.gospodne...@gmail.com> wrote: > > > > > > > > > > > Strange indeed. What about query load/ayes during that time? What > > > about > > > > > GC? > > > > > > And does cache hit rate drop? > > > > > > > > > > > > Otis > > > > > > -- > > > > > > SOLR Performance Monitoring - http://sematext.com/spm > > > > > > On Nov 23, 2012 2:45 AM, "John Nielsen" <j...@mcb.dk> wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > We are seeing a strange CPU spike on one of our solr4 servers > > which > > > > we > > > > > > are > > > > > > > unable to explain. The spike, which only lasts for a couple of > > > > minutes, > > > > > > > sends the disks racing. This happens a few times a times a day. > > > This > > > > is > > > > > > > what the load looks like: > > > > > > > > > > > > > > 2012.Nov.14 13:37:17 2.77 > > > > > > > 2012.Nov.14 13:36:17 3.65 > > > > > > > 2012.Nov.14 13:35:18 3.92 > > > > > > > 2012.Nov.14 13:34:17 3.95 > > > > > > > 2012.Nov.14 13:33:18 6.56 > > > > > > > 2012.Nov.14 13:32:17 10.79 > > > > > > > 2012.Nov.14 13:31:17 24.38 > > > > > > > 2012.Nov.14 13:30:17 63.35 > > > > > > > 2012.Nov.14 13:29:17 24.68 > > > > > > > 2012.Nov.14 13:28:17 2.44 > > > > > > > 2012.Nov.14 13:27:18 3.51 > > > > > > > 2012.Nov.14 13:26:17 5.26 > > > > > > > 2012.Nov.14 13:25:18 5.71 > > > > > > > 2012.Nov.14 13:24:17 2.7 > > > > > > > > > > > > > > The problem is that out of a 3 minute spike, I get about 40 > > seconds > > > > of > > > > > > > silence in the logs. This log usually adds like a thousand > lines > > > > every > > > > > > > second. Not being able to communicate with the server for this > > > long, > > > > > > breaks > > > > > > > our use case. > > > > > > > > > > > > > > We have two servers, varnish01 and varnish02. We used to feed > > data > > > to > > > > > > > varnish02, replicate it to varnish01 where the data is then > read > > > > from. > > > > > > When > > > > > > > we discovered this issue, we moved all traffic to varnish02 so > > that > > > > > data > > > > > > is > > > > > > > being replicated to varnish01, but other than that, gets zero > > > > traffic. > > > > > > The > > > > > > > spike did not disappear. > > > > > > > > > > > > > > The spike we are seeing is on varnish01 only. > > > > > > > > > > > > > > Please note that our use case requires us to continuously feed > > > large > > > > > > > amounts of data from our main system in the order of up to > 1.000 > > > > > > registers > > > > > > > every minute. Our solrconfig.xml is attached. > > > > > > > > > > > > > > Has anyone seen this phenomenon before? > > > > > > > > > > > > > > Med venlig hilsen / Best regards > > > > > > > > > > > > > > *John Nielsen* > > > > > > > Programmer > > > > > > > > > > > > > > > > > > > > > > > > > > > > *MCB A/S* > > > > > > > Enghaven 15 > > > > > > > DK-7500 Holstebro > > > > > > > > > > > > > > Kundeservice: +45 9610 2824 > > > > > > > p...@mcb.dk > > > > > > > www.mcb.dk > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >