On Mon, Feb 07, 2011 at 02:06:00PM +0100, Markus Jelsma said: > Heap usage can spike after a commit. Existing caches are still in use and new > caches are being generated and/or auto warmed. Can you confirm this is the > case?
We see spikes after replication which I suspect is, as you say, because of the ensuing commit. What we seem to have found is that when we weren't using the Concurrent GC stop-the-world gc runs would kill the app. Now that we're using CMS we occasionally find ourselves in situations where the app still has memory "left over" but the load on the machine spikes, the GC duty cycle goes to 100 and the app never recovers. Restarting usually helps but sometimes we have to take the machine out of the laod balancer, wait for a number of minutes and then out it back in. We're working on two hypotheses Firstly - we're CPU bound somehow and that at some point we cross some threshhold and GC or something else is just unable to to keep up. So whilst it looks like instantaneous death of the app it's actually gradual resource exhaustion where the definition of 'gradual' is 'a very short period of time' (as opposed to some cataclysmic infinite loop bug somewhere). Either that or ... Secondly - there's some sort of Query Of Death that kills machines. We just haven't found it yet, even when replaying logs. Or some combination of both. Or other things. It's maddeningly frustrating. We're also got to try deploying a custom solr.war and try using the MMapDirectory to see if that helps with anything.