On Tue, Dec 7, 2010 at 4:00 PM, Reverend Chip <rev.c...@gmail.com> wrote: > On 12/7/2010 1:10 PM, Jonathan Ellis wrote: >> I'm inclined to think there's a bug in your client, then. > > That doesn't pass the smell test. The very same client has logged > timeout and unavailable exceptions on other occasions, e.g. when there > are too many clients or (in a previous configuration) when the JVMs had > insufficient memory. It's too much of a coincidence to believe that the > client's exception reporting happens to fail only at the same time that > a server experiences unexplained and problematic gossip failures.
You're probably right. >> DEBUG-level >> logs could confirm or refute this by logging for each insert how many >> replicas are being blocked for, which nodes it got responses from, and >> whether a TimedOutException from not getting ALL replies was returned >> to the client. > > Full DEBUG level logs would be a space problem; I'm loading at least 1T > per node (after 3x replication), and these events are rare. Can the > DEBUG logs be limited to the specific modules helpful for this diagnosis > of the gossip problem and, secondarily, the failure to report > replication failure? The gossip problem is almost certainly due to a GC pause. You can check that by enabling verbose GC logging (uncomment the lines in cassandra-env.sh). The replication failure is what we want DEBUG logs for, and restricting it to the right modules isn't going to help since when you're stress-testing writes, the write modules are going to be 99% of the log volume anyway. Maybe a script to constantly throw away all but the most recent log file until you see the WARN line would be sufficient workaround? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com