Sorry to follow-up to my own post but I just saw this issue: https://issues.apache.org/jira/browse/CASSANDRA-2118 linked in a neighboring thread (cassandra server disk full). It certainly implies that a disk IO failure resulting in a "zombie" node is a possibility.
Jim On Tue, Aug 2, 2011 at 4:19 PM, Jim Ancona <j...@anconafamily.com> wrote: > Ideally, I would hope that a bad disk wouldn't hang a node but would > instead just cause writes to fail, but if that is not the case, > perhaps the bad disk somehow wedged that server node completely so > that requests were not being processed at all (maybe not even being > timed out). At that point you'd be depending on Hector's > CassandraHostConfigurator.cassandraThriftSocketTimeout to expire, > which would cause the request to fail over to a working node. But that > value defaults to zero (i.e. forever), so if you didn't explicitly > configure it your client would hang along with the server node. > > Perhaps someone with more knowledge of Cassandra's internals could > comment on the possibility of the server hanging completely. I would > think that the logs from the bad node might help to diagnose that. > > Jim > > On Sun, Jul 31, 2011 at 4:58 PM, aaron morton <aa...@thelastpickle.com> wrote: >> Yup, it sounds like things may not have failed as their should. Do you have >> a better definition of stuck ? Was the client waiting for a single request >> to completed or was the client not cycling to another node ? >> If there is some server log details out it may help understand what >> happened. Also what setting you had for commitlog_sync in the yaml. Also >> some info on the failure, did the disk stop dead, or run slowly, or fail >> sometimes etc. >> AFAIK the wait on the writes to return should have timed out on the >> coordinator. I may be behind on the expected behaviour, perhaps a thread >> pool was shutdown as part of handling the error and this prevents the error >> from returning. >> I would check the rpc_timeout in the yaml, and that the client is setting a >> client side socket time out. Timeouts should kick in. Then check the >> expected behaviour for Hector in when it gets a timeout. >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> On 1 Aug 2011, at 09:40, Lior Golan wrote: >> >> Thanks Aaron. We will try to pull the logs and post them in this forum. >> >> But what I don't understand is why the client should pause at all. We are >> writing with CL.ONE, and the replication factor is 2. As far as we >> understand – the client communicates with a certain node (any node for that >> matter) StorageProxy, which then sends write requests to all 2 replicas, but >> wait for just the first one of them to acknowledge the write.ii >> >> So even if one node got stuck because of this commit log disk failure, it >> should not have stuck the client. Can you explain why that ever happened in >> the first place? >> >> And to add to that – when we took down the Cassandra node with the faulty >> commit log disk, the client continued to write and didn't seem to bother >> (which is what we expected to happen in the first place, but didn't). >> >> From: aaron morton [mailto:aa...@thelastpickle.com] >> Sent: Monday, August 01, 2011 12:19 AM >> To: user@cassandra.apache.org >> Subject: Re: Damaged commit log disk causes Cassandra client to get stuck >> >> A couple of timeouts should have kicked in. >> >> First the rpc_timeout on the server side should have kicked in and given the >> client a (thrift) TimedOutException. Secondly a client side socket timeout >> should be set so the client will timeout the socket. Did either of these >> appear in the client side logs? >> >> In response to either of those my guess would be that hector would cycle the >> connection. (I've not checked this.) >> >> How did the disk fail ? Was their anything in the server logs ? >> >> Some background about handling disk >> fails https://issues.apache.org/jira/browse/CASSANDRA-809 >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 1 Aug 2011, at 08:13, Lior Golan wrote: >> >> In one of our test clusters we had a damaged commit log disks in one of the >> nodes. >> >> We have replication factor = 2 in this cluster, and write with consistency >> level = ONE. So we expected writes will not be affected by such an issue. >> But what actually happened is that the client that was writing with CL.ONE >> got stuck. The client could resume writing when we stopped the server with >> the faulty disk (so this is another indication it's not a replication factor >> or consistency level issue). >> >> We are running Cassandra 0.7.6, and the client we're using is Hector. >> >> Can anyone explain what happened here? Why the client got stuck when the >> commit log disk on one of the servers damaged (and could resume writing if >> we actually took off that server)? >> >