On 05/21/2010 11:32 AM, Stephen Green wrote:
Right.  The system can be very memory-intensive, but at the time these
are occurring, it's not under a really heavy load, and there's plenty
of heap available. However, while looking at a thread dump from one of
the nodes, I realized that a very poor decision meant that I had more
than 1200 threads running.  I expect this is more of a problem than
the GC at this point.  I'm taking steps to correct this problem now.

Lately, I've had fewer and fewer problems with GC.  In a former life,
I sat down the hall from the folks who wrote Hotspot's GC and they're
pretty sharp folks :-)

GC as a cause is very common, however had you mentioned 1200 threads I would have guessed that to be a potential issue. ;-)

Right.  I'd like to have as small a timeout as possible so that I
notice quickly when things disappear.  What's a reasonable minimum?  I
notice recommendations in other messages on the list that 20000 is a
good value.


The setting you should use typically is determined by your sla requirements. How soon do you want ephemeral nodes to be cleaned up if a client fails? Say you were doing leader election, this would gate re-election in the case where the current leader failed (set it lower and you are more responsive (faster), but also more susceptible to "false positives" (such as temp network glitch). Set it higher and you ride over the network glitches however it takes longer to recover when a client really does go down).

In some cases (hbase, solr) we've seen that the timeout had to be set artificially high due to the limitations of the current JVM GC algos. For example some hbase users were seeing GC pause times of > 4 minutes. So this raises the question - do you consider this a failure or not? (I could reboot the machine faster than it takes to run that GC...)

Good luck,

Patrick

Reply via email to