The last patch on that ticket is what we're running in prod. Its working well for us with disk_failure_mode: readwrite. In the case of filesystem errors the node shuts off thrift and gossip. While the gossip is propagating we can continue to serve some reads out of the caches.
-ryan On Tue, Aug 2, 2011 at 9:27 AM, Jim Ancona <j...@anconafamily.com> wrote: > On Mon, Aug 1, 2011 at 6:12 PM, Ryan King <r...@twitter.com> wrote: >> On Fri, Jul 29, 2011 at 12:02 PM, Chris Burroughs >> <chris.burrou...@gmail.com> wrote: >>> On 07/25/2011 01:53 PM, Ryan King wrote: >>>> Actually I was wrong– our patch will disable gosisp and thrift but >>>> leave the process running: >>>> >>>> https://issues.apache.org/jira/browse/CASSANDRA-2118 >>>> >>>> If people are interested in that I can make sure its up to date with >>>> our latest version. >>> >>> Thanks Ryan. >>> >>> /me expresses interest. > > /me too! > >>> >>> Zombie nodes when the file system does something "interesting" are not fun. >> >> In our experience this only gets triggered on hardware failures that >> would otherwise seriously degrade the performance or cause lots of >> errors. >> >> After the nodes traffic coalesces we get an alert which we can then deal >> with. >> >> -ryan >> >