Re: SOLR4 cluster - strange CPU spike on slave
Not quite. Too much memory for the JVM means that it starves the op system and the caching that goes on there. Objects that consume memory are created all the time in Solr. They won't be recovered until some threshold is passed. So you can be sure that the more memory you allocate to the JVM, the more will be used. And the GC pauses will eventually get quite long. So over-allocating memory to the JVM is discouraged. Best Erick On Wed, Dec 5, 2012 at 11:28 PM, John Nielsen j...@mcb.dk wrote: I'm not sure I understand why this is important. Too much memory would just be unused. This is what the heap looks now: Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 17179869184 (16384.0MB) NewSize = 21757952 (20.75MB) MaxNewSize = 283508736 (270.375MB) OldSize = 65404928 (62.375MB) NewRatio = 7 SurvivorRatio= 8 PermSize = 21757952 (20.75MB) MaxPermSize = 176160768 (168.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 255197184 (243.375MB) used = 108828496 (103.78694152832031MB) free = 146368688 (139.5880584716797MB) 42.644865548359654% used Eden Space: capacity = 226885632 (216.375MB) used = 83498424 (79.63030242919922MB) free = 143387208 (136.74469757080078MB) 36.80198841326365% used From Space: capacity = 28311552 (27.0MB) used = 25330072 (24.156639099121094MB) free = 2981480 (2.8433609008789062MB) 89.46903370044849% used To Space: capacity = 28311552 (27.0MB) used = 0 (0.0MB) free = 28311552 (27.0MB) 0.0% used concurrent mark-sweep generation: capacity = 16896360448 (16113.625MB) used = 12452710200 (11875.829887390137MB) free = 4443650248 (4237.795112609863MB) 73.70054775005708% used Perm Generation: capacity = 70578176 (67.30859375MB) used = 37652032 (35.90777587890625MB) free = 32926144 (31.40081787109375MB) 53.347981109627995% used Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: If this is caused by index segment merging you should be able to see that very clearly on the Index report in SPM, where you would see sudden graph movement at the time of spike and corresponding to CPU and disk activity. I think uncommenting that infostream in solrconfig would also show it. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 28, 2012 9:20 PM, Erick Erickson erickerick...@gmail.com wrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09 secs]
Re: SOLR4 cluster - strange CPU spike on slave
In addition to Mark's comment, be sure you aren't starving the memory for the OS by over-allocating to the JVM. FWIW, Erick On Tue, Dec 4, 2012 at 2:25 AM, John Nielsen j...@mcb.dk wrote: Success! I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We haven't seen any spikes since. I'm cautiously optimistic though and will be monitoring the servers for a week or so before declaring final victory. The post about mmapdirectory is really interesting. We switched to using that from NRTCachingDirectory and am monitoring performance as well. Initially performance doesn't look stellar, but i suspect that we lack memory in the server to really make it shine. Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson erickerick...@gmail.com wrote: right, so here's what I'd check for. Your logs should show a replication pretty coincident with the spike and that should be in the log. Note: the replication should complete just before the spike. Or you can just turn replication off and fire it manually to try to force the situation at will, see: http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that you'll have to wait until the index has changed on the master to see any action). So you should be able to create your spike at will. And this will be pretty normal. When replication happens, a new searcher is opened, caches are filled, autowarming is done, all kinds of stuff like that. During this period, the _old_ searcher is still open, which will both cause the CPU to be busier and require additional memory. Once the new searcher is warmed, new queries go to it, and when the old searcher has finished serving all the queries it shuts down and all the resources are freed. Which is why commits are expensive operations. All of which means that so far I don't think there's a problem, this is just normal Solr operation. If you're seeing responsiveness problems when serving queries you probably want to throw more hardware (particularly memory) at the problem. But when thinking about memory allocating to the JVM, _really_ read Uwe's post here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best Erick On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen j...@mcb.dk wrote: Yup you read it right. We originally intended to do all our indexing to varnish02, replicate to varnish01 and then search from varnish01 (through a fail-over ip which would switch the reader to varnish02 in case of trouble). When I saw the spikes, I tried to eliminate possibilities by starting searching from varnish02, leaving varnish01 with nothing to do but to receive replication data. This did not remove the spikes. As soon as this spike is fixed, I will start searching from varnish01 again. These sort of debug antics are only possible because, although we do have customers using this, we are still in our beta phase. Varnish01 never receives any manual commit orders. Varnish02 does from time to time. Oh, and I accidentally misinformed you before. (damn secondary language) We are actually seeing the spikes on both servers. I was just focusing on varnish01 because I use it to eliminate possibilities. It just occurred to me now; We tried switching off our feeder/index tool for 24 hours, and we didn't see any spikes during this period, so receiving replication data certainly has something to do with it. Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.com wrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The
Re: SOLR4 cluster - strange CPU spike on slave
Very interesting! I've seen references to NRTCachingDirectory, MMapDirectory, FSDirectory, RamDirectory and NIOFSDirectory, and thats just what I can remember. I have tried to search for more information about these, but I'm not having much luck. Is there a place where I can read up on these? Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Wed, Dec 5, 2012 at 1:11 AM, Mark Miller markrmil...@gmail.com wrote: On Dec 4, 2012, at 2:25 AM, John Nielsen j...@mcb.dk wrote: The post about mmapdirectory is really interesting. We switched to using that from NRTCachingDirectory and am monitoring performance as well. Initially performance doesn't look stellar, but i suspect that we lack memory in the server to really make it shine. NRTCachingDirectory delegates to another directory and simply caches small segments in RAM - usually it delegates MMapDirectory by default. So likely you won't notice any changes, because you have not likely really changed anything. NRTCachingDirectory simply helps in the NRT case and doesn't really hurt that I've seen in the std case. It's more like a help dir than a replacement. - Mark
Re: SOLR4 cluster - strange CPU spike on slave
I'm not sure I understand why this is important. Too much memory would just be unused. This is what the heap looks now: Heap Configuration: MinHeapFreeRatio = 40 MaxHeapFreeRatio = 70 MaxHeapSize = 17179869184 (16384.0MB) NewSize = 21757952 (20.75MB) MaxNewSize = 283508736 (270.375MB) OldSize = 65404928 (62.375MB) NewRatio = 7 SurvivorRatio= 8 PermSize = 21757952 (20.75MB) MaxPermSize = 176160768 (168.0MB) Heap Usage: New Generation (Eden + 1 Survivor Space): capacity = 255197184 (243.375MB) used = 108828496 (103.78694152832031MB) free = 146368688 (139.5880584716797MB) 42.644865548359654% used Eden Space: capacity = 226885632 (216.375MB) used = 83498424 (79.63030242919922MB) free = 143387208 (136.74469757080078MB) 36.80198841326365% used From Space: capacity = 28311552 (27.0MB) used = 25330072 (24.156639099121094MB) free = 2981480 (2.8433609008789062MB) 89.46903370044849% used To Space: capacity = 28311552 (27.0MB) used = 0 (0.0MB) free = 28311552 (27.0MB) 0.0% used concurrent mark-sweep generation: capacity = 16896360448 (16113.625MB) used = 12452710200 (11875.829887390137MB) free = 4443650248 (4237.795112609863MB) 73.70054775005708% used Perm Generation: capacity = 70578176 (67.30859375MB) used = 37652032 (35.90777587890625MB) free = 32926144 (31.40081787109375MB) 53.347981109627995% used Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: If this is caused by index segment merging you should be able to see that very clearly on the Index report in SPM, where you would see sudden graph movement at the time of spike and corresponding to CPU and disk activity. I think uncommenting that infostream in solrconfig would also show it. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 28, 2012 9:20 PM, Erick Erickson erickerick...@gmail.com wrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09 secs] 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)] 8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00, real=0.10 secs] 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)] 9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00, real=0.09 secs] 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)] 9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01, real=0.11 secs] 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)] 9083427K-4383520K(9304768K), 0.0699660 secs] [Times:
Re: SOLR4 cluster - strange CPU spike on slave
Success! I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We haven't seen any spikes since. I'm cautiously optimistic though and will be monitoring the servers for a week or so before declaring final victory. The post about mmapdirectory is really interesting. We switched to using that from NRTCachingDirectory and am monitoring performance as well. Initially performance doesn't look stellar, but i suspect that we lack memory in the server to really make it shine. Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson erickerick...@gmail.comwrote: right, so here's what I'd check for. Your logs should show a replication pretty coincident with the spike and that should be in the log. Note: the replication should complete just before the spike. Or you can just turn replication off and fire it manually to try to force the situation at will, see: http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that you'll have to wait until the index has changed on the master to see any action). So you should be able to create your spike at will. And this will be pretty normal. When replication happens, a new searcher is opened, caches are filled, autowarming is done, all kinds of stuff like that. During this period, the _old_ searcher is still open, which will both cause the CPU to be busier and require additional memory. Once the new searcher is warmed, new queries go to it, and when the old searcher has finished serving all the queries it shuts down and all the resources are freed. Which is why commits are expensive operations. All of which means that so far I don't think there's a problem, this is just normal Solr operation. If you're seeing responsiveness problems when serving queries you probably want to throw more hardware (particularly memory) at the problem. But when thinking about memory allocating to the JVM, _really_ read Uwe's post here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best Erick On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen j...@mcb.dk wrote: Yup you read it right. We originally intended to do all our indexing to varnish02, replicate to varnish01 and then search from varnish01 (through a fail-over ip which would switch the reader to varnish02 in case of trouble). When I saw the spikes, I tried to eliminate possibilities by starting searching from varnish02, leaving varnish01 with nothing to do but to receive replication data. This did not remove the spikes. As soon as this spike is fixed, I will start searching from varnish01 again. These sort of debug antics are only possible because, although we do have customers using this, we are still in our beta phase. Varnish01 never receives any manual commit orders. Varnish02 does from time to time. Oh, and I accidentally misinformed you before. (damn secondary language) We are actually seeing the spikes on both servers. I was just focusing on varnish01 because I use it to eliminate possibilities. It just occurred to me now; We tried switching off our feeder/index tool for 24 hours, and we didn't see any spikes during this period, so receiving replication data certainly has something to do with it. Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.com wrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting
Re: SOLR4 cluster - strange CPU spike on slave
On Dec 4, 2012, at 2:25 AM, John Nielsen j...@mcb.dk wrote: The post about mmapdirectory is really interesting. We switched to using that from NRTCachingDirectory and am monitoring performance as well. Initially performance doesn't look stellar, but i suspect that we lack memory in the server to really make it shine. NRTCachingDirectory delegates to another directory and simply caches small segments in RAM - usually it delegates MMapDirectory by default. So likely you won't notice any changes, because you have not likely really changed anything. NRTCachingDirectory simply helps in the NRT case and doesn't really hurt that I've seen in the std case. It's more like a help dir than a replacement. - Mark
Re: SOLR4 cluster - strange CPU spike on slave
right, so here's what I'd check for. Your logs should show a replication pretty coincident with the spike and that should be in the log. Note: the replication should complete just before the spike. Or you can just turn replication off and fire it manually to try to force the situation at will, see: http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that you'll have to wait until the index has changed on the master to see any action). So you should be able to create your spike at will. And this will be pretty normal. When replication happens, a new searcher is opened, caches are filled, autowarming is done, all kinds of stuff like that. During this period, the _old_ searcher is still open, which will both cause the CPU to be busier and require additional memory. Once the new searcher is warmed, new queries go to it, and when the old searcher has finished serving all the queries it shuts down and all the resources are freed. Which is why commits are expensive operations. All of which means that so far I don't think there's a problem, this is just normal Solr operation. If you're seeing responsiveness problems when serving queries you probably want to throw more hardware (particularly memory) at the problem. But when thinking about memory allocating to the JVM, _really_ read Uwe's post here: http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html Best Erick On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen j...@mcb.dk wrote: Yup you read it right. We originally intended to do all our indexing to varnish02, replicate to varnish01 and then search from varnish01 (through a fail-over ip which would switch the reader to varnish02 in case of trouble). When I saw the spikes, I tried to eliminate possibilities by starting searching from varnish02, leaving varnish01 with nothing to do but to receive replication data. This did not remove the spikes. As soon as this spike is fixed, I will start searching from varnish01 again. These sort of debug antics are only possible because, although we do have customers using this, we are still in our beta phase. Varnish01 never receives any manual commit orders. Varnish02 does from time to time. Oh, and I accidentally misinformed you before. (damn secondary language) We are actually seeing the spikes on both servers. I was just focusing on varnish01 because I use it to eliminate possibilities. It just occurred to me now; We tried switching off our feeder/index tool for 24 hours, and we didn't see any spikes during this period, so receiving replication data certainly has something to do with it. Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.com wrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09
Re: SOLR4 cluster - strange CPU spike on slave
I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09 secs] 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)] 8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00, real=0.10 secs] 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)] 9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00, real=0.09 secs] 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)] 9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01, real=0.11 secs] 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)] 9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01, real=0.07 secs] 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K-336421K(5090944K)] 8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01, real=0.08 secs] 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K-340513K(5092864K)] 9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00, real=0.10 secs] 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K-480488K(4952000K)] 9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09, real=0.19 secs] 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K-560434K(5031936K)] 9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12, real=0.35 secs] 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K-0K(5031936K)] [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K) [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] [Times: user=5.23 sys=0.00, real=5.23 secs] 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K-87249K(4647168K)] 7264830K-2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00, real=0.04 secs] 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K-171306K(4839552K)] 6967312K-2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00, real=0.05 secs] 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K-156193K(4802368K)] 7051369K-2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00, real=0.06 secs] 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K-156028K(4826880K)] 7048096K-2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K-413148K(4860288K)] 7047931K-3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01, real=0.12 secs] 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: 4564892K-412801K(4866432K)] 7358235K-3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00, real=0.11 secs] 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: 4564545K-434702K(4870592K)] 7357888K-3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00, real=0.12 secs] 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: 4591822K-500171K(4657344K)] 7385165K-3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00, real=0.16 secs] 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen: 4657291K-453204K(4849536K)] 7450633K-3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 sys=0.01, real=0.11 secs] 2012-11-27T19:58:46.081+0100: [GC [PSYoungGen: 4560404K-533101K(4640320K)] 7353747K-3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 sys=0.00, real=0.13 secs] 2012-11-27T19:59:01.488+0100: [GC [PSYoungGen: 4640301K-480969K(4822720K)] 7433643K-3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 sys=0.01, real=0.15 secs] 2012-11-27T19:59:25.155+0100: [GC [PSYoungGen: 4534281K-572295K(4625664K)] 7327624K-3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 sys=0.01, real=0.15 secs] 2012-11-27T19:59:39.923+0100: [GC [PSYoungGen: 4625607K-486964K(4800640K)] 7418950K-3280306K(9847936K), 0.120 secs] [Times: user=0.94 sys=0.00, real=0.12 secs] So the GC was doing a full collection around that time, but that only accounts for around 5 secs. Not the couple of minutes the spike lasts. I am really at a complete loss as to what
Re: SOLR4 cluster - strange CPU spike on slave
Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09 secs] 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)] 8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00, real=0.10 secs] 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)] 9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00, real=0.09 secs] 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)] 9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01, real=0.11 secs] 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)] 9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01, real=0.07 secs] 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K-336421K(5090944K)] 8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01, real=0.08 secs] 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K-340513K(5092864K)] 9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00, real=0.10 secs] 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K-480488K(4952000K)] 9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09, real=0.19 secs] 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K-560434K(5031936K)] 9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12, real=0.35 secs] 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K-0K(5031936K)] [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K) [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] [Times: user=5.23 sys=0.00, real=5.23 secs] 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K-87249K(4647168K)] 7264830K-2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00, real=0.04 secs] 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K-171306K(4839552K)] 6967312K-2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00, real=0.05 secs] 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K-156193K(4802368K)] 7051369K-2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00, real=0.06 secs] 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K-156028K(4826880K)] 7048096K-2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K-413148K(4860288K)] 7047931K-3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01, real=0.12 secs] 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: 4564892K-412801K(4866432K)] 7358235K-3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00, real=0.11 secs] 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: 4564545K-434702K(4870592K)] 7357888K-3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00, real=0.12 secs] 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: 4591822K-500171K(4657344K)] 7385165K-3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00, real=0.16 secs] 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen:
Re: SOLR4 cluster - strange CPU spike on slave
If this is caused by index segment merging you should be able to see that very clearly on the Index report in SPM, where you would see sudden graph movement at the time of spike and corresponding to CPU and disk activity. I think uncommenting that infostream in solrconfig would also show it. Otis -- SOLR Performance Monitoring - http://sematext.com/spm On Nov 28, 2012 9:20 PM, Erick Erickson erickerick...@gmail.com wrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09 secs] 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)] 8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00, real=0.10 secs] 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)] 9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00, real=0.09 secs] 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)] 9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01, real=0.11 secs] 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)] 9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01, real=0.07 secs] 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K-336421K(5090944K)] 8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01, real=0.08 secs] 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K-340513K(5092864K)] 9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00, real=0.10 secs] 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K-480488K(4952000K)] 9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09, real=0.19 secs] 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K-560434K(5031936K)] 9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12, real=0.35 secs] 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K-0K(5031936K)] [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K) [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] [Times: user=5.23 sys=0.00, real=5.23 secs] 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K-87249K(4647168K)] 7264830K-2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00, real=0.04 secs] 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K-171306K(4839552K)] 6967312K-2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00, real=0.05 secs] 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K-156193K(4802368K)] 7051369K-2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00, real=0.06 secs] 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K-156028K(4826880K)] 7048096K-2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00, real=0.06 secs] 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K-413148K(4860288K)] 7047931K-3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01, real=0.12 secs] 2012-11-27T19:56:56.649+0100:
Re: SOLR4 cluster - strange CPU spike on slave
Yup you read it right. We originally intended to do all our indexing to varnish02, replicate to varnish01 and then search from varnish01 (through a fail-over ip which would switch the reader to varnish02 in case of trouble). When I saw the spikes, I tried to eliminate possibilities by starting searching from varnish02, leaving varnish01 with nothing to do but to receive replication data. This did not remove the spikes. As soon as this spike is fixed, I will start searching from varnish01 again. These sort of debug antics are only possible because, although we do have customers using this, we are still in our beta phase. Varnish01 never receives any manual commit orders. Varnish02 does from time to time. Oh, and I accidentally misinformed you before. (damn secondary language) We are actually seeing the spikes on both servers. I was just focusing on varnish01 because I use it to eliminate possibilities. It just occurred to me now; We tried switching off our feeder/index tool for 24 hours, and we didn't see any spikes during this period, so receiving replication data certainly has something to do with it. Med venlig hilsen / Best regards *John Nielsen* Programmer *MCB A/S* Enghaven 15 DK-7500 Holstebro Kundeservice: +45 9610 2824 p...@mcb.dk www.mcb.dk On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.comwrote: Am I reading this right? All you're doing on varnish1 is replicating to it? You're not searching or indexing? I'm sure I'm misreading this. The spike, which only lasts for a couple of minutes, sends the disks racing This _sounds_ suspiciously like segment merging, especially the disks racing bit. Or possibly replication. Neither of which make much sense. But is there any chance that somehow multiple commits are being issued? Of course if varnish1 is a slave, that shouldn't be happening either. And the whole bit about nothing going to the logs is just bizarre. I'm tempted to claim hardware gremlins, especially if you see nothing similar on varnish2. Or some other process is pegging the machine. All of which is a way of saying I have no idea Yours in bewilderment, Erick On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote: I apologize for the late reply. The query load is more or less stable during the spikes. There are always fluctuations, but nothing on the order of magnitude that could explain this spike. In fact, the latest spike occured last night when there were almost noone using it. To test a hunch of mine, I tried to deactivate all caches by commenting out all cache entries in solrconfig.xml. It still spikes, so I dont think it has anything to do with cache warming or hits/misses or anything of the sort. One interesting thing GC though. This is our latest spike with cpu load (this server has 8 cores, so a load higher than 8 is potentially troublesome): 2012.Nov.27 19:58:182.27 2012.Nov.27 19:57:174.06 2012.Nov.27 19:56:188.95 2012.Nov.27 19:55:1719.97 2012.Nov.27 19:54:1732.27 2012.Nov.27 19:53:181.67 2012.Nov.27 19:52:171.6 2012.Nov.27 19:51:181.77 2012.Nov.27 19:50:171.89 This is what the GC was doing around that time: 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)] 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00, real=0.09 secs] 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)] 8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00, real=0.10 secs] 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)] 9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00, real=0.09 secs] 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)] 9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01, real=0.11 secs] 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)] 9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01, real=0.07 secs] 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K-336421K(5090944K)] 8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01, real=0.08 secs] 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K-340513K(5092864K)] 9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00, real=0.10 secs] 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K-480488K(4952000K)] 9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09, real=0.19 secs] 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K-560434K(5031936K)] 9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12, real=0.35 secs] 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K-0K(5031936K)] [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K) [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs]
SOLR4 cluster - strange CPU spike on slave
Hi all, We are seeing a strange CPU spike on one of our solr4 servers which we are unable to explain. The spike, which only lasts for a couple of minutes, sends the disks racing. This happens a few times a times a day. This is what the load looks like: 2012.Nov.14 13:37:172.77 2012.Nov.14 13:36:173.65 2012.Nov.14 13:35:183.92 2012.Nov.14 13:34:173.95 2012.Nov.14 13:33:186.56 2012.Nov.14 13:32:1710.79 2012.Nov.14 13:31:1724.38 2012.Nov.14 13:30:1763.35 2012.Nov.14 13:29:1724.68 2012.Nov.14 13:28:172.44 2012.Nov.14 13:27:183.51 2012.Nov.14 13:26:175.26 2012.Nov.14 13:25:185.71 2012.Nov.14 13:24:172.7 The problem is that out of a 3 minute spike, I get about 40 seconds of silence in the logs. This log usually adds like a thousand lines every second. Not being able to communicate with the server for this long, breaks our use case. We have two servers, varnish01 and varnish02. We used to feed data to varnish02, replicate it to varnish01 where the data is then read from. When we discovered this issue, we moved all traffic to varnish02 so that data is being replicated to varnish01, but other than that, gets zero traffic. The spike did not disappear. The spike we are seeing is on varnish01 only. Please note that our use case requires us to continuously feed large amounts of data from our main system in the order of up to 1.000 registers every minute. Has anyone seen this phenomenon before? Best regards John Nielsen