Re: SOLR4 cluster - strange CPU spike on slave

2012-12-06 Thread Erick Erickson
Not quite. Too much memory for the JVM means that it starves the op system
and the caching that goes on there. Objects that consume memory are created
all the time in Solr. They won't be recovered until some threshold is
passed. So you can be sure that the more memory you allocate to the JVM,
the more will be used. And the GC pauses will eventually get quite long.

So over-allocating memory to the JVM is discouraged.

Best
Erick


On Wed, Dec 5, 2012 at 11:28 PM, John Nielsen j...@mcb.dk wrote:

 I'm not sure I understand why this is important. Too much memory would just
 be unused.

 This is what the heap looks now:

 Heap Configuration:
MinHeapFreeRatio = 40
MaxHeapFreeRatio = 70
MaxHeapSize  = 17179869184 (16384.0MB)
NewSize  = 21757952 (20.75MB)
MaxNewSize   = 283508736 (270.375MB)
OldSize  = 65404928 (62.375MB)
NewRatio = 7
SurvivorRatio= 8
PermSize = 21757952 (20.75MB)
MaxPermSize  = 176160768 (168.0MB)

 Heap Usage:
 New Generation (Eden + 1 Survivor Space):
capacity = 255197184 (243.375MB)
used = 108828496 (103.78694152832031MB)
free = 146368688 (139.5880584716797MB)
42.644865548359654% used
 Eden Space:
capacity = 226885632 (216.375MB)
used = 83498424 (79.63030242919922MB)
free = 143387208 (136.74469757080078MB)
36.80198841326365% used
 From Space:
capacity = 28311552 (27.0MB)
used = 25330072 (24.156639099121094MB)
free = 2981480 (2.8433609008789062MB)
89.46903370044849% used
 To Space:
capacity = 28311552 (27.0MB)
used = 0 (0.0MB)
free = 28311552 (27.0MB)
0.0% used
 concurrent mark-sweep generation:
capacity = 16896360448 (16113.625MB)
used = 12452710200 (11875.829887390137MB)
free = 4443650248 (4237.795112609863MB)
73.70054775005708% used
 Perm Generation:
capacity = 70578176 (67.30859375MB)
used = 37652032 (35.90777587890625MB)
free = 32926144 (31.40081787109375MB)
53.347981109627995% used



 Med venlig hilsen / Best regards

 *John Nielsen*
 Programmer



 *MCB A/S*
 Enghaven 15
 DK-7500 Holstebro

 Kundeservice: +45 9610 2824
 p...@mcb.dk
 www.mcb.dk



 On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  If this is caused by index segment merging you should be able to see that
  very clearly on the Index report in SPM, where you would see sudden graph
  movement at the time of spike and corresponding to CPU and disk activity.
  I think uncommenting that infostream in solrconfig would also show it.
 
  Otis
  --
  SOLR Performance Monitoring - http://sematext.com/spm
  On Nov 28, 2012 9:20 PM, Erick Erickson erickerick...@gmail.com
 wrote:
 
   Am I reading this right? All you're doing on varnish1 is replicating to
  it?
   You're not searching or indexing? I'm sure I'm misreading this.
  
  
   The spike, which only lasts for a couple of minutes, sends the disks
   racing This _sounds_ suspiciously like segment merging, especially the
   disks racing bit. Or possibly replication. Neither of which make much
   sense. But is there any chance that somehow multiple commits are being
   issued? Of course if varnish1 is a slave, that shouldn't be happening
   either.
  
   And the whole bit about nothing going to the logs is just bizarre. I'm
   tempted to claim hardware gremlins, especially if you see nothing
 similar
   on varnish2. Or some other process is pegging the machine. All of which
  is
   a way of saying I have no idea
  
   Yours in bewilderment,
   Erick
  
  
  
   On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:
  
I apologize for the late reply.
   
The query load is more or less stable during the spikes. There are
  always
fluctuations, but nothing on the order of magnitude that could
 explain
   this
spike. In fact, the latest spike occured last night when there were
   almost
noone using it.
   
To test a hunch of mine, I tried to deactivate all caches by
 commenting
   out
all cache entries in solrconfig.xml. It still spikes, so I dont think
  it
has anything to do with cache warming or hits/misses or anything of
 the
sort.
   
One interesting thing GC though. This is our latest spike with cpu
 load
(this server has 8 cores, so a load higher than 8 is potentially
troublesome):
   
2012.Nov.27 19:58:182.27
2012.Nov.27 19:57:174.06
2012.Nov.27 19:56:188.95
2012.Nov.27 19:55:1719.97
2012.Nov.27 19:54:1732.27
2012.Nov.27 19:53:181.67
2012.Nov.27 19:52:171.6
2012.Nov.27 19:51:181.77
2012.Nov.27 19:50:171.89
   
This is what the GC was doing around that time:
   
2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
   4777586K-277641K(4969216K)]
8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
  sys=0.00,
real=0.09 secs]

Re: SOLR4 cluster - strange CPU spike on slave

2012-12-05 Thread Erick Erickson
In addition to Mark's comment, be sure you aren't starving the memory for
the OS by over-allocating to the JVM.

FWIW,
Erick


On Tue, Dec 4, 2012 at 2:25 AM, John Nielsen j...@mcb.dk wrote:

 Success!

 I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We
 haven't seen any spikes since.

 I'm cautiously optimistic though and will be monitoring the servers for a
 week or so before declaring final victory.

 The post about mmapdirectory is really interesting. We switched to using
 that from NRTCachingDirectory and am monitoring performance as well.
 Initially performance doesn't look stellar, but i suspect that we lack
 memory in the server to really make it shine.


 Med venlig hilsen / Best regards

 *John Nielsen*
 Programmer



 *MCB A/S*
 Enghaven 15
 DK-7500 Holstebro

 Kundeservice: +45 9610 2824
 p...@mcb.dk
 www.mcb.dk



 On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson erickerick...@gmail.com
 wrote:

  right, so here's what I'd check for.
 
  Your logs should show a replication pretty coincident with the spike and
  that should be in the log. Note: the replication should complete just
  before the spike.
 
  Or you can just turn replication off and fire it manually to try to force
  the situation at will, see:
  http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that
  you'll
  have to wait until the index has changed on the master to see any
 action).
 
  So you should be able to create your spike at will. And this will be
 pretty
  normal. When replication happens, a new searcher is opened, caches are
  filled, autowarming is done, all kinds of stuff like that. During this
  period, the _old_ searcher is still open, which will both cause the CPU
 to
  be busier and require additional memory. Once the new searcher is warmed,
  new queries go to it, and when the old searcher has finished serving all
  the queries it shuts down and all the resources are freed. Which is why
  commits are expensive operations.
 
  All of which means that so far I don't think there's a problem, this is
  just normal Solr operation. If you're seeing responsiveness problems when
  serving queries you probably want to throw more hardware (particularly
  memory) at the problem.
 
  But when thinking about memory allocating to the JVM, _really_ read Uwe's
  post here:
  http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
 
  Best
  Erick
 
 
  On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen j...@mcb.dk wrote:
 
   Yup you read it right.
  
   We originally intended to do all our indexing to varnish02, replicate
 to
   varnish01 and then search from varnish01 (through a fail-over ip which
   would switch the reader to varnish02 in case of trouble).
  
   When I saw the spikes, I tried to eliminate possibilities by starting
   searching from varnish02, leaving varnish01 with nothing to do but to
   receive replication data. This did not remove the spikes. As soon as
 this
   spike is fixed, I will start searching from varnish01 again. These sort
  of
   debug antics are only possible because, although we do have customers
  using
   this, we are still in our beta phase.
  
   Varnish01 never receives any manual commit orders. Varnish02 does from
  time
   to time.
  
   Oh, and I accidentally misinformed you before. (damn secondary
 language)
  We
   are actually seeing the spikes on both servers. I was just focusing on
   varnish01 because I use it to eliminate possibilities.
  
   It just occurred to me now; We tried switching off our feeder/index
 tool
   for 24 hours, and we didn't see any spikes during this period, so
  receiving
   replication data certainly has something to do with it.
  
   Med venlig hilsen / Best regards
  
   *John Nielsen*
   Programmer
  
  
  
   *MCB A/S*
   Enghaven 15
   DK-7500 Holstebro
  
   Kundeservice: +45 9610 2824
   p...@mcb.dk
   www.mcb.dk
  
  
  
   On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
Am I reading this right? All you're doing on varnish1 is replicating
 to
   it?
You're not searching or indexing? I'm sure I'm misreading this.
   
   
The spike, which only lasts for a couple of minutes, sends the disks
racing This _sounds_ suspiciously like segment merging, especially
 the
disks racing bit. Or possibly replication. Neither of which make
 much
sense. But is there any chance that somehow multiple commits are
 being
issued? Of course if varnish1 is a slave, that shouldn't be happening
either.
   
And the whole bit about nothing going to the logs is just bizarre.
 I'm
tempted to claim hardware gremlins, especially if you see nothing
  similar
on varnish2. Or some other process is pegging the machine. All of
 which
   is
a way of saying I have no idea
   
Yours in bewilderment,
Erick
   
   
   
On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:
   
 I apologize for the late reply.

 The 

Re: SOLR4 cluster - strange CPU spike on slave

2012-12-05 Thread John Nielsen
Very interesting!

I've seen references to NRTCachingDirectory, MMapDirectory, FSDirectory,
RamDirectory and NIOFSDirectory, and thats just what I can remember. I have
tried to search for more information about these, but I'm not having much
luck.

Is there a place where I can read up on these?

Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk



On Wed, Dec 5, 2012 at 1:11 AM, Mark Miller markrmil...@gmail.com wrote:


 On Dec 4, 2012, at 2:25 AM, John Nielsen j...@mcb.dk wrote:

  The post about mmapdirectory is really interesting. We switched to using
  that from NRTCachingDirectory and am monitoring performance as well.
  Initially performance doesn't look stellar, but i suspect that we lack
  memory in the server to really make it shine.

 NRTCachingDirectory delegates to another directory and simply caches small
 segments in RAM - usually it delegates MMapDirectory by default. So likely
 you won't notice any changes, because you have not likely really changed
 anything. NRTCachingDirectory simply helps in the NRT case and doesn't
 really hurt that I've seen in the std case. It's more like a help dir than
 a replacement.

 - Mark


Re: SOLR4 cluster - strange CPU spike on slave

2012-12-05 Thread John Nielsen
I'm not sure I understand why this is important. Too much memory would just
be unused.

This is what the heap looks now:

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize  = 17179869184 (16384.0MB)
   NewSize  = 21757952 (20.75MB)
   MaxNewSize   = 283508736 (270.375MB)
   OldSize  = 65404928 (62.375MB)
   NewRatio = 7
   SurvivorRatio= 8
   PermSize = 21757952 (20.75MB)
   MaxPermSize  = 176160768 (168.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 255197184 (243.375MB)
   used = 108828496 (103.78694152832031MB)
   free = 146368688 (139.5880584716797MB)
   42.644865548359654% used
Eden Space:
   capacity = 226885632 (216.375MB)
   used = 83498424 (79.63030242919922MB)
   free = 143387208 (136.74469757080078MB)
   36.80198841326365% used
From Space:
   capacity = 28311552 (27.0MB)
   used = 25330072 (24.156639099121094MB)
   free = 2981480 (2.8433609008789062MB)
   89.46903370044849% used
To Space:
   capacity = 28311552 (27.0MB)
   used = 0 (0.0MB)
   free = 28311552 (27.0MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 16896360448 (16113.625MB)
   used = 12452710200 (11875.829887390137MB)
   free = 4443650248 (4237.795112609863MB)
   73.70054775005708% used
Perm Generation:
   capacity = 70578176 (67.30859375MB)
   used = 37652032 (35.90777587890625MB)
   free = 32926144 (31.40081787109375MB)
   53.347981109627995% used



Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk



On Thu, Nov 29, 2012 at 4:08 AM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 If this is caused by index segment merging you should be able to see that
 very clearly on the Index report in SPM, where you would see sudden graph
 movement at the time of spike and corresponding to CPU and disk activity.
 I think uncommenting that infostream in solrconfig would also show it.

 Otis
 --
 SOLR Performance Monitoring - http://sematext.com/spm
 On Nov 28, 2012 9:20 PM, Erick Erickson erickerick...@gmail.com wrote:

  Am I reading this right? All you're doing on varnish1 is replicating to
 it?
  You're not searching or indexing? I'm sure I'm misreading this.
 
 
  The spike, which only lasts for a couple of minutes, sends the disks
  racing This _sounds_ suspiciously like segment merging, especially the
  disks racing bit. Or possibly replication. Neither of which make much
  sense. But is there any chance that somehow multiple commits are being
  issued? Of course if varnish1 is a slave, that shouldn't be happening
  either.
 
  And the whole bit about nothing going to the logs is just bizarre. I'm
  tempted to claim hardware gremlins, especially if you see nothing similar
  on varnish2. Or some other process is pegging the machine. All of which
 is
  a way of saying I have no idea
 
  Yours in bewilderment,
  Erick
 
 
 
  On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:
 
   I apologize for the late reply.
  
   The query load is more or less stable during the spikes. There are
 always
   fluctuations, but nothing on the order of magnitude that could explain
  this
   spike. In fact, the latest spike occured last night when there were
  almost
   noone using it.
  
   To test a hunch of mine, I tried to deactivate all caches by commenting
  out
   all cache entries in solrconfig.xml. It still spikes, so I dont think
 it
   has anything to do with cache warming or hits/misses or anything of the
   sort.
  
   One interesting thing GC though. This is our latest spike with cpu load
   (this server has 8 cores, so a load higher than 8 is potentially
   troublesome):
  
   2012.Nov.27 19:58:182.27
   2012.Nov.27 19:57:174.06
   2012.Nov.27 19:56:188.95
   2012.Nov.27 19:55:1719.97
   2012.Nov.27 19:54:1732.27
   2012.Nov.27 19:53:181.67
   2012.Nov.27 19:52:171.6
   2012.Nov.27 19:51:181.77
   2012.Nov.27 19:50:171.89
  
   This is what the GC was doing around that time:
  
   2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
  4777586K-277641K(4969216K)]
   8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
 sys=0.00,
   real=0.09 secs]
   2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
  4749769K-325171K(5068096K)]
   8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63
 sys=0.00,
   real=0.10 secs]
   2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
  4911603K-306181K(5071168K)]
   9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62
 sys=0.00,
   real=0.09 secs]
   2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
  4892613K-376175K(5075328K)]
   9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79
 sys=0.01,
   real=0.11 secs]
   2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
  4972847K-271468K(4868160K)]
   9083427K-4383520K(9304768K), 0.0699660 secs] [Times: 

Re: SOLR4 cluster - strange CPU spike on slave

2012-12-04 Thread John Nielsen
Success!

I tried adding -XX:+UseConcMarkSweepGC to java to make it GC earlier. We
haven't seen any spikes since.

I'm cautiously optimistic though and will be monitoring the servers for a
week or so before declaring final victory.

The post about mmapdirectory is really interesting. We switched to using
that from NRTCachingDirectory and am monitoring performance as well.
Initially performance doesn't look stellar, but i suspect that we lack
memory in the server to really make it shine.


Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk



On Fri, Nov 30, 2012 at 3:13 PM, Erick Erickson erickerick...@gmail.comwrote:

 right, so here's what I'd check for.

 Your logs should show a replication pretty coincident with the spike and
 that should be in the log. Note: the replication should complete just
 before the spike.

 Or you can just turn replication off and fire it manually to try to force
 the situation at will, see:
 http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that
 you'll
 have to wait until the index has changed on the master to see any action).

 So you should be able to create your spike at will. And this will be pretty
 normal. When replication happens, a new searcher is opened, caches are
 filled, autowarming is done, all kinds of stuff like that. During this
 period, the _old_ searcher is still open, which will both cause the CPU to
 be busier and require additional memory. Once the new searcher is warmed,
 new queries go to it, and when the old searcher has finished serving all
 the queries it shuts down and all the resources are freed. Which is why
 commits are expensive operations.

 All of which means that so far I don't think there's a problem, this is
 just normal Solr operation. If you're seeing responsiveness problems when
 serving queries you probably want to throw more hardware (particularly
 memory) at the problem.

 But when thinking about memory allocating to the JVM, _really_ read Uwe's
 post here:
 http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 Best
 Erick


 On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen j...@mcb.dk wrote:

  Yup you read it right.
 
  We originally intended to do all our indexing to varnish02, replicate to
  varnish01 and then search from varnish01 (through a fail-over ip which
  would switch the reader to varnish02 in case of trouble).
 
  When I saw the spikes, I tried to eliminate possibilities by starting
  searching from varnish02, leaving varnish01 with nothing to do but to
  receive replication data. This did not remove the spikes. As soon as this
  spike is fixed, I will start searching from varnish01 again. These sort
 of
  debug antics are only possible because, although we do have customers
 using
  this, we are still in our beta phase.
 
  Varnish01 never receives any manual commit orders. Varnish02 does from
 time
  to time.
 
  Oh, and I accidentally misinformed you before. (damn secondary language)
 We
  are actually seeing the spikes on both servers. I was just focusing on
  varnish01 because I use it to eliminate possibilities.
 
  It just occurred to me now; We tried switching off our feeder/index tool
  for 24 hours, and we didn't see any spikes during this period, so
 receiving
  replication data certainly has something to do with it.
 
  Med venlig hilsen / Best regards
 
  *John Nielsen*
  Programmer
 
 
 
  *MCB A/S*
  Enghaven 15
  DK-7500 Holstebro
 
  Kundeservice: +45 9610 2824
  p...@mcb.dk
  www.mcb.dk
 
 
 
  On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   Am I reading this right? All you're doing on varnish1 is replicating to
  it?
   You're not searching or indexing? I'm sure I'm misreading this.
  
  
   The spike, which only lasts for a couple of minutes, sends the disks
   racing This _sounds_ suspiciously like segment merging, especially the
   disks racing bit. Or possibly replication. Neither of which make much
   sense. But is there any chance that somehow multiple commits are being
   issued? Of course if varnish1 is a slave, that shouldn't be happening
   either.
  
   And the whole bit about nothing going to the logs is just bizarre. I'm
   tempted to claim hardware gremlins, especially if you see nothing
 similar
   on varnish2. Or some other process is pegging the machine. All of which
  is
   a way of saying I have no idea
  
   Yours in bewilderment,
   Erick
  
  
  
   On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:
  
I apologize for the late reply.
   
The query load is more or less stable during the spikes. There are
  always
fluctuations, but nothing on the order of magnitude that could
 explain
   this
spike. In fact, the latest spike occured last night when there were
   almost
noone using it.
   
To test a hunch of mine, I tried to deactivate all caches by
 commenting
   

Re: SOLR4 cluster - strange CPU spike on slave

2012-12-04 Thread Mark Miller

On Dec 4, 2012, at 2:25 AM, John Nielsen j...@mcb.dk wrote:

 The post about mmapdirectory is really interesting. We switched to using
 that from NRTCachingDirectory and am monitoring performance as well.
 Initially performance doesn't look stellar, but i suspect that we lack
 memory in the server to really make it shine.

NRTCachingDirectory delegates to another directory and simply caches small 
segments in RAM - usually it delegates MMapDirectory by default. So likely you 
won't notice any changes, because you have not likely really changed anything. 
NRTCachingDirectory simply helps in the NRT case and doesn't really hurt that 
I've seen in the std case. It's more like a help dir than a replacement.

- Mark

Re: SOLR4 cluster - strange CPU spike on slave

2012-11-30 Thread Erick Erickson
right, so here's what I'd check for.

Your logs should show a replication pretty coincident with the spike and
that should be in the log. Note: the replication should complete just
before the spike.

Or you can just turn replication off and fire it manually to try to force
the situation at will, see:
http://wiki.apache.org/solr/SolrReplication#HTTP_API. (but note that you'll
have to wait until the index has changed on the master to see any action).

So you should be able to create your spike at will. And this will be pretty
normal. When replication happens, a new searcher is opened, caches are
filled, autowarming is done, all kinds of stuff like that. During this
period, the _old_ searcher is still open, which will both cause the CPU to
be busier and require additional memory. Once the new searcher is warmed,
new queries go to it, and when the old searcher has finished serving all
the queries it shuts down and all the resources are freed. Which is why
commits are expensive operations.

All of which means that so far I don't think there's a problem, this is
just normal Solr operation. If you're seeing responsiveness problems when
serving queries you probably want to throw more hardware (particularly
memory) at the problem.

But when thinking about memory allocating to the JVM, _really_ read Uwe's
post here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Best
Erick


On Thu, Nov 29, 2012 at 2:39 AM, John Nielsen j...@mcb.dk wrote:

 Yup you read it right.

 We originally intended to do all our indexing to varnish02, replicate to
 varnish01 and then search from varnish01 (through a fail-over ip which
 would switch the reader to varnish02 in case of trouble).

 When I saw the spikes, I tried to eliminate possibilities by starting
 searching from varnish02, leaving varnish01 with nothing to do but to
 receive replication data. This did not remove the spikes. As soon as this
 spike is fixed, I will start searching from varnish01 again. These sort of
 debug antics are only possible because, although we do have customers using
 this, we are still in our beta phase.

 Varnish01 never receives any manual commit orders. Varnish02 does from time
 to time.

 Oh, and I accidentally misinformed you before. (damn secondary language) We
 are actually seeing the spikes on both servers. I was just focusing on
 varnish01 because I use it to eliminate possibilities.

 It just occurred to me now; We tried switching off our feeder/index tool
 for 24 hours, and we didn't see any spikes during this period, so receiving
 replication data certainly has something to do with it.

 Med venlig hilsen / Best regards

 *John Nielsen*
 Programmer



 *MCB A/S*
 Enghaven 15
 DK-7500 Holstebro

 Kundeservice: +45 9610 2824
 p...@mcb.dk
 www.mcb.dk



 On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.com
 wrote:

  Am I reading this right? All you're doing on varnish1 is replicating to
 it?
  You're not searching or indexing? I'm sure I'm misreading this.
 
 
  The spike, which only lasts for a couple of minutes, sends the disks
  racing This _sounds_ suspiciously like segment merging, especially the
  disks racing bit. Or possibly replication. Neither of which make much
  sense. But is there any chance that somehow multiple commits are being
  issued? Of course if varnish1 is a slave, that shouldn't be happening
  either.
 
  And the whole bit about nothing going to the logs is just bizarre. I'm
  tempted to claim hardware gremlins, especially if you see nothing similar
  on varnish2. Or some other process is pegging the machine. All of which
 is
  a way of saying I have no idea
 
  Yours in bewilderment,
  Erick
 
 
 
  On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:
 
   I apologize for the late reply.
  
   The query load is more or less stable during the spikes. There are
 always
   fluctuations, but nothing on the order of magnitude that could explain
  this
   spike. In fact, the latest spike occured last night when there were
  almost
   noone using it.
  
   To test a hunch of mine, I tried to deactivate all caches by commenting
  out
   all cache entries in solrconfig.xml. It still spikes, so I dont think
 it
   has anything to do with cache warming or hits/misses or anything of the
   sort.
  
   One interesting thing GC though. This is our latest spike with cpu load
   (this server has 8 cores, so a load higher than 8 is potentially
   troublesome):
  
   2012.Nov.27 19:58:182.27
   2012.Nov.27 19:57:174.06
   2012.Nov.27 19:56:188.95
   2012.Nov.27 19:55:1719.97
   2012.Nov.27 19:54:1732.27
   2012.Nov.27 19:53:181.67
   2012.Nov.27 19:52:171.6
   2012.Nov.27 19:51:181.77
   2012.Nov.27 19:50:171.89
  
   This is what the GC was doing around that time:
  
   2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
  4777586K-277641K(4969216K)]
   8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54
 sys=0.00,
   real=0.09 

Re: SOLR4 cluster - strange CPU spike on slave

2012-11-28 Thread John Nielsen
I apologize for the late reply.

The query load is more or less stable during the spikes. There are always
fluctuations, but nothing on the order of magnitude that could explain this
spike. In fact, the latest spike occured last night when there were almost
noone using it.

To test a hunch of mine, I tried to deactivate all caches by commenting out
all cache entries in solrconfig.xml. It still spikes, so I dont think it
has anything to do with cache warming or hits/misses or anything of the
sort.

One interesting thing GC though. This is our latest spike with cpu load
(this server has 8 cores, so a load higher than 8 is potentially
troublesome):

2012.Nov.27 19:58:182.27
2012.Nov.27 19:57:174.06
2012.Nov.27 19:56:188.95
2012.Nov.27 19:55:1719.97
2012.Nov.27 19:54:1732.27
2012.Nov.27 19:53:181.67
2012.Nov.27 19:52:171.6
2012.Nov.27 19:51:181.77
2012.Nov.27 19:50:171.89

This is what the GC was doing around that time:

2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)]
8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
real=0.09 secs]
2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)]
8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
real=0.10 secs]
2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)]
9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
real=0.09 secs]
2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)]
9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
real=0.11 secs]
2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)]
9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
real=0.07 secs]
2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K-336421K(5090944K)]
8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
real=0.08 secs]
2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K-340513K(5092864K)]
9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
real=0.10 secs]
2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K-480488K(4952000K)]
9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
real=0.19 secs]
2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K-560434K(5031936K)]
9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
real=0.35 secs]
2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K-0K(5031936K)]
[PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K)
[PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] [Times: user=5.23
sys=0.00, real=5.23 secs]
2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K-87249K(4647168K)]
7264830K-2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
real=0.04 secs]
2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K-171306K(4839552K)]
6967312K-2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
real=0.05 secs]
2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K-156193K(4802368K)]
7051369K-2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
real=0.06 secs]
2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K-156028K(4826880K)]
7048096K-2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
real=0.06 secs]
2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K-413148K(4860288K)]
7047931K-3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
real=0.12 secs]
2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: 4564892K-412801K(4866432K)]
7358235K-3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00,
real=0.11 secs]
2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: 4564545K-434702K(4870592K)]
7357888K-3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00,
real=0.12 secs]
2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: 4591822K-500171K(4657344K)]
7385165K-3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00,
real=0.16 secs]
2012-11-27T19:58:24.062+0100: [GC [PSYoungGen: 4657291K-453204K(4849536K)]
7450633K-3246547K(9896832K), 0.1099200 secs] [Times: user=0.83 sys=0.01,
real=0.11 secs]
2012-11-27T19:58:46.081+0100: [GC [PSYoungGen: 4560404K-533101K(4640320K)]
7353747K-3326443K(9687616K), 0.1252060 secs] [Times: user=0.96 sys=0.00,
real=0.13 secs]
2012-11-27T19:59:01.488+0100: [GC [PSYoungGen: 4640301K-480969K(4822720K)]
7433643K-3274312K(9870016K), 0.1495860 secs] [Times: user=0.99 sys=0.01,
real=0.15 secs]
2012-11-27T19:59:25.155+0100: [GC [PSYoungGen: 4534281K-572295K(4625664K)]
7327624K-3365638K(9672960K), 0.1492270 secs] [Times: user=1.10 sys=0.01,
real=0.15 secs]
2012-11-27T19:59:39.923+0100: [GC [PSYoungGen: 4625607K-486964K(4800640K)]
7418950K-3280306K(9847936K), 0.120 secs] [Times: user=0.94 sys=0.00,
real=0.12 secs]

So the GC was doing a full collection around that time, but that only
accounts for around 5 secs. Not the couple of minutes the spike lasts.

I am really at a complete loss as to what 

Re: SOLR4 cluster - strange CPU spike on slave

2012-11-28 Thread Erick Erickson
Am I reading this right? All you're doing on varnish1 is replicating to it?
You're not searching or indexing? I'm sure I'm misreading this.


The spike, which only lasts for a couple of minutes, sends the disks
racing This _sounds_ suspiciously like segment merging, especially the
disks racing bit. Or possibly replication. Neither of which make much
sense. But is there any chance that somehow multiple commits are being
issued? Of course if varnish1 is a slave, that shouldn't be happening
either.

And the whole bit about nothing going to the logs is just bizarre. I'm
tempted to claim hardware gremlins, especially if you see nothing similar
on varnish2. Or some other process is pegging the machine. All of which is
a way of saying I have no idea

Yours in bewilderment,
Erick



On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:

 I apologize for the late reply.

 The query load is more or less stable during the spikes. There are always
 fluctuations, but nothing on the order of magnitude that could explain this
 spike. In fact, the latest spike occured last night when there were almost
 noone using it.

 To test a hunch of mine, I tried to deactivate all caches by commenting out
 all cache entries in solrconfig.xml. It still spikes, so I dont think it
 has anything to do with cache warming or hits/misses or anything of the
 sort.

 One interesting thing GC though. This is our latest spike with cpu load
 (this server has 8 cores, so a load higher than 8 is potentially
 troublesome):

 2012.Nov.27 19:58:182.27
 2012.Nov.27 19:57:174.06
 2012.Nov.27 19:56:188.95
 2012.Nov.27 19:55:1719.97
 2012.Nov.27 19:54:1732.27
 2012.Nov.27 19:53:181.67
 2012.Nov.27 19:52:171.6
 2012.Nov.27 19:51:181.77
 2012.Nov.27 19:50:171.89

 This is what the GC was doing around that time:

 2012-11-27T19:50:04.933+0100: [GC [PSYoungGen: 4777586K-277641K(4969216K)]
 8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
 real=0.09 secs]
 2012-11-27T19:50:30.785+0100: [GC [PSYoungGen: 4749769K-325171K(5068096K)]
 8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
 real=0.10 secs]
 2012-11-27T19:51:12.293+0100: [GC [PSYoungGen: 4911603K-306181K(5071168K)]
 9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
 real=0.09 secs]
 2012-11-27T19:51:52.817+0100: [GC [PSYoungGen: 4892613K-376175K(5075328K)]
 9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
 real=0.11 secs]
 2012-11-27T19:52:29.454+0100: [GC [PSYoungGen: 4972847K-271468K(4868160K)]
 9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
 real=0.07 secs]
 2012-11-27T19:53:08.176+0100: [GC [PSYoungGen: 4868140K-336421K(5090944K)]
 8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
 real=0.08 secs]
 2012-11-27T19:54:53.534+0100: [GC [PSYoungGen: 4950373K-340513K(5092864K)]
 9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
 real=0.10 secs]
 2012-11-27T19:55:02.906+0100: [GC [PSYoungGen: 4954465K-480488K(4952000K)]
 9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
 real=0.19 secs]
 2012-11-27T19:55:09.114+0100: [GC [PSYoungGen: 4951976K-560434K(5031936K)]
 9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
 real=0.35 secs]
 2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen: 560434K-0K(5031936K)]
 [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K)
 [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] [Times: user=5.23
 sys=0.00, real=5.23 secs]
 2012-11-27T19:55:21.359+0100: [GC [PSYoungGen: 4471488K-87249K(4647168K)]
 7264830K-2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
 real=0.04 secs]
 2012-11-27T19:55:35.399+0100: [GC [PSYoungGen: 4173969K-171306K(4839552K)]
 6967312K-2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
 real=0.05 secs]
 2012-11-27T19:55:41.397+0100: [GC [PSYoungGen: 4258026K-156193K(4802368K)]
 7051369K-2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
 real=0.06 secs]
 2012-11-27T19:55:55.609+0100: [GC [PSYoungGen: 4254753K-156028K(4826880K)]
 7048096K-2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
 real=0.06 secs]
 2012-11-27T19:56:20.519+0100: [GC [PSYoungGen: 4254588K-413148K(4860288K)]
 7047931K-3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
 real=0.12 secs]
 2012-11-27T19:56:56.649+0100: [GC [PSYoungGen: 4564892K-412801K(4866432K)]
 7358235K-3206144K(9913728K), 0.1089870 secs] [Times: user=0.82 sys=0.00,
 real=0.11 secs]
 2012-11-27T19:57:24.304+0100: [GC [PSYoungGen: 4564545K-434702K(4870592K)]
 7357888K-3228045K(9917888K), 0.1172090 secs] [Times: user=0.86 sys=0.00,
 real=0.12 secs]
 2012-11-27T19:57:41.489+0100: [GC [PSYoungGen: 4591822K-500171K(4657344K)]
 7385165K-3293513K(9704640K), 0.1663600 secs] [Times: user=1.12 sys=0.00,
 real=0.16 secs]
 2012-11-27T19:58:24.062+0100: [GC [PSYoungGen: 

Re: SOLR4 cluster - strange CPU spike on slave

2012-11-28 Thread Otis Gospodnetic
If this is caused by index segment merging you should be able to see that
very clearly on the Index report in SPM, where you would see sudden graph
movement at the time of spike and corresponding to CPU and disk activity.
I think uncommenting that infostream in solrconfig would also show it.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Nov 28, 2012 9:20 PM, Erick Erickson erickerick...@gmail.com wrote:

 Am I reading this right? All you're doing on varnish1 is replicating to it?
 You're not searching or indexing? I'm sure I'm misreading this.


 The spike, which only lasts for a couple of minutes, sends the disks
 racing This _sounds_ suspiciously like segment merging, especially the
 disks racing bit. Or possibly replication. Neither of which make much
 sense. But is there any chance that somehow multiple commits are being
 issued? Of course if varnish1 is a slave, that shouldn't be happening
 either.

 And the whole bit about nothing going to the logs is just bizarre. I'm
 tempted to claim hardware gremlins, especially if you see nothing similar
 on varnish2. Or some other process is pegging the machine. All of which is
 a way of saying I have no idea

 Yours in bewilderment,
 Erick



 On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:

  I apologize for the late reply.
 
  The query load is more or less stable during the spikes. There are always
  fluctuations, but nothing on the order of magnitude that could explain
 this
  spike. In fact, the latest spike occured last night when there were
 almost
  noone using it.
 
  To test a hunch of mine, I tried to deactivate all caches by commenting
 out
  all cache entries in solrconfig.xml. It still spikes, so I dont think it
  has anything to do with cache warming or hits/misses or anything of the
  sort.
 
  One interesting thing GC though. This is our latest spike with cpu load
  (this server has 8 cores, so a load higher than 8 is potentially
  troublesome):
 
  2012.Nov.27 19:58:182.27
  2012.Nov.27 19:57:174.06
  2012.Nov.27 19:56:188.95
  2012.Nov.27 19:55:1719.97
  2012.Nov.27 19:54:1732.27
  2012.Nov.27 19:53:181.67
  2012.Nov.27 19:52:171.6
  2012.Nov.27 19:51:181.77
  2012.Nov.27 19:50:171.89
 
  This is what the GC was doing around that time:
 
  2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
 4777586K-277641K(4969216K)]
  8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
  real=0.09 secs]
  2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
 4749769K-325171K(5068096K)]
  8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
  real=0.10 secs]
  2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
 4911603K-306181K(5071168K)]
  9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
  real=0.09 secs]
  2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
 4892613K-376175K(5075328K)]
  9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
  real=0.11 secs]
  2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
 4972847K-271468K(4868160K)]
  9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
  real=0.07 secs]
  2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
 4868140K-336421K(5090944K)]
  8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
  real=0.08 secs]
  2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
 4950373K-340513K(5092864K)]
  9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
  real=0.10 secs]
  2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
 4954465K-480488K(4952000K)]
  9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
  real=0.19 secs]
  2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
 4951976K-560434K(5031936K)]
  9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
  real=0.35 secs]
  2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
 560434K-0K(5031936K)]
  [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K)
  [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] [Times: user=5.23
  sys=0.00, real=5.23 secs]
  2012-11-27T19:55:21.359+0100: [GC [PSYoungGen:
 4471488K-87249K(4647168K)]
  7264830K-2880592K(9694464K), 0.0426640 secs] [Times: user=0.27 sys=0.00,
  real=0.04 secs]
  2012-11-27T19:55:35.399+0100: [GC [PSYoungGen:
 4173969K-171306K(4839552K)]
  6967312K-2964649K(9886848K), 0.0493570 secs] [Times: user=0.34 sys=0.00,
  real=0.05 secs]
  2012-11-27T19:55:41.397+0100: [GC [PSYoungGen:
 4258026K-156193K(4802368K)]
  7051369K-2949536K(9849664K), 0.0643190 secs] [Times: user=0.37 sys=0.00,
  real=0.06 secs]
  2012-11-27T19:55:55.609+0100: [GC [PSYoungGen:
 4254753K-156028K(4826880K)]
  7048096K-2949371K(9874176K), 0.0590410 secs] [Times: user=0.35 sys=0.00,
  real=0.06 secs]
  2012-11-27T19:56:20.519+0100: [GC [PSYoungGen:
 4254588K-413148K(4860288K)]
  7047931K-3206491K(9907584K), 0.1189810 secs] [Times: user=0.91 sys=0.01,
  real=0.12 secs]
  2012-11-27T19:56:56.649+0100: 

Re: SOLR4 cluster - strange CPU spike on slave

2012-11-28 Thread John Nielsen
Yup you read it right.

We originally intended to do all our indexing to varnish02, replicate to
varnish01 and then search from varnish01 (through a fail-over ip which
would switch the reader to varnish02 in case of trouble).

When I saw the spikes, I tried to eliminate possibilities by starting
searching from varnish02, leaving varnish01 with nothing to do but to
receive replication data. This did not remove the spikes. As soon as this
spike is fixed, I will start searching from varnish01 again. These sort of
debug antics are only possible because, although we do have customers using
this, we are still in our beta phase.

Varnish01 never receives any manual commit orders. Varnish02 does from time
to time.

Oh, and I accidentally misinformed you before. (damn secondary language) We
are actually seeing the spikes on both servers. I was just focusing on
varnish01 because I use it to eliminate possibilities.

It just occurred to me now; We tried switching off our feeder/index tool
for 24 hours, and we didn't see any spikes during this period, so receiving
replication data certainly has something to do with it.

Med venlig hilsen / Best regards

*John Nielsen*
Programmer



*MCB A/S*
Enghaven 15
DK-7500 Holstebro

Kundeservice: +45 9610 2824
p...@mcb.dk
www.mcb.dk



On Thu, Nov 29, 2012 at 3:20 AM, Erick Erickson erickerick...@gmail.comwrote:

 Am I reading this right? All you're doing on varnish1 is replicating to it?
 You're not searching or indexing? I'm sure I'm misreading this.


 The spike, which only lasts for a couple of minutes, sends the disks
 racing This _sounds_ suspiciously like segment merging, especially the
 disks racing bit. Or possibly replication. Neither of which make much
 sense. But is there any chance that somehow multiple commits are being
 issued? Of course if varnish1 is a slave, that shouldn't be happening
 either.

 And the whole bit about nothing going to the logs is just bizarre. I'm
 tempted to claim hardware gremlins, especially if you see nothing similar
 on varnish2. Or some other process is pegging the machine. All of which is
 a way of saying I have no idea

 Yours in bewilderment,
 Erick



 On Wed, Nov 28, 2012 at 6:15 AM, John Nielsen j...@mcb.dk wrote:

  I apologize for the late reply.
 
  The query load is more or less stable during the spikes. There are always
  fluctuations, but nothing on the order of magnitude that could explain
 this
  spike. In fact, the latest spike occured last night when there were
 almost
  noone using it.
 
  To test a hunch of mine, I tried to deactivate all caches by commenting
 out
  all cache entries in solrconfig.xml. It still spikes, so I dont think it
  has anything to do with cache warming or hits/misses or anything of the
  sort.
 
  One interesting thing GC though. This is our latest spike with cpu load
  (this server has 8 cores, so a load higher than 8 is potentially
  troublesome):
 
  2012.Nov.27 19:58:182.27
  2012.Nov.27 19:57:174.06
  2012.Nov.27 19:56:188.95
  2012.Nov.27 19:55:1719.97
  2012.Nov.27 19:54:1732.27
  2012.Nov.27 19:53:181.67
  2012.Nov.27 19:52:171.6
  2012.Nov.27 19:51:181.77
  2012.Nov.27 19:50:171.89
 
  This is what the GC was doing around that time:
 
  2012-11-27T19:50:04.933+0100: [GC [PSYoungGen:
 4777586K-277641K(4969216K)]
  8887542K-4387693K(9405824K), 0.0856360 secs] [Times: user=0.54 sys=0.00,
  real=0.09 secs]
  2012-11-27T19:50:30.785+0100: [GC [PSYoungGen:
 4749769K-325171K(5068096K)]
  8859821K-4435320K(9504704K), 0.0992160 secs] [Times: user=0.63 sys=0.00,
  real=0.10 secs]
  2012-11-27T19:51:12.293+0100: [GC [PSYoungGen:
 4911603K-306181K(5071168K)]
  9021752K-4416617K(9507776K), 0.0957890 secs] [Times: user=0.62 sys=0.00,
  real=0.09 secs]
  2012-11-27T19:51:52.817+0100: [GC [PSYoungGen:
 4892613K-376175K(5075328K)]
  9003049K-4486755K(9511936K), 0.1099830 secs] [Times: user=0.79 sys=0.01,
  real=0.11 secs]
  2012-11-27T19:52:29.454+0100: [GC [PSYoungGen:
 4972847K-271468K(4868160K)]
  9083427K-4383520K(9304768K), 0.0699660 secs] [Times: user=0.48 sys=0.01,
  real=0.07 secs]
  2012-11-27T19:53:08.176+0100: [GC [PSYoungGen:
 4868140K-336421K(5090944K)]
  8980192K-4448572K(9527552K), 0.0824350 secs] [Times: user=0.56 sys=0.01,
  real=0.08 secs]
  2012-11-27T19:54:53.534+0100: [GC [PSYoungGen:
 4950373K-340513K(5092864K)]
  9062524K-4468215K(9529472K), 0.1016770 secs] [Times: user=0.71 sys=0.00,
  real=0.10 secs]
  2012-11-27T19:55:02.906+0100: [GC [PSYoungGen:
 4954465K-480488K(4952000K)]
  9082167K-4684537K(9388608K), 0.1813290 secs] [Times: user=1.23 sys=0.09,
  real=0.19 secs]
  2012-11-27T19:55:09.114+0100: [GC [PSYoungGen:
 4951976K-560434K(5031936K)]
  9156025K-5075285K(9547072K), 0.3511090 secs] [Times: user=2.32 sys=0.12,
  real=0.35 secs]
  2012-11-27T19:55:09.465+0100: [Full GC [PSYoungGen:
 560434K-0K(5031936K)]
  [PSOldGen: 4514851K-2793342K(5047296K)] 5075285K-2793342K(10079232K)
  [PSPermGen: 35285K-35285K(44864K)], 5.2310820 secs] 

SOLR4 cluster - strange CPU spike on slave

2012-11-23 Thread John Nielsen
Hi all,

We are seeing a strange CPU spike on one of our solr4 servers which we are
unable to explain. The spike, which only lasts for a couple of minutes,
sends the disks racing. This happens a few times a times a day. This is
what the load looks like:

2012.Nov.14 13:37:172.77
2012.Nov.14 13:36:173.65
2012.Nov.14 13:35:183.92
2012.Nov.14 13:34:173.95
2012.Nov.14 13:33:186.56
2012.Nov.14 13:32:1710.79
2012.Nov.14 13:31:1724.38
2012.Nov.14 13:30:1763.35
2012.Nov.14 13:29:1724.68
2012.Nov.14 13:28:172.44
2012.Nov.14 13:27:183.51
2012.Nov.14 13:26:175.26
2012.Nov.14 13:25:185.71
2012.Nov.14 13:24:172.7

The problem is that out of a 3 minute spike, I get about 40 seconds of
silence in the logs. This log usually adds like a thousand lines every
second. Not being able to communicate with the server for this long, breaks
our use case.

We have two servers, varnish01 and varnish02. We used to feed data to
varnish02, replicate it to varnish01 where the data is then read from. When
we discovered this issue, we moved all traffic to varnish02 so that data is
being replicated to varnish01, but other than that, gets zero traffic. The
spike did not disappear.

The spike we are seeing is on varnish01 only.

Please note that our use case requires us to continuously feed large
amounts of data from our main system in the order of up to 1.000 registers
every minute.

Has anyone seen this phenomenon before?

Best regards

John Nielsen