Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

Jonathan Ellis Wed, 08 Dec 2010 07:30:51 -0800

On Tue, Dec 7, 2010 at 4:00 PM, Reverend Chip <rev.c...@gmail.com> wrote:
> On 12/7/2010 1:10 PM, Jonathan Ellis wrote:
>> I'm inclined to think there's a bug in your client, then.
>
> That doesn't pass the smell test.  The very same client has logged
> timeout and unavailable exceptions on other occasions, e.g. when there
> are too many clients or (in a previous configuration) when the JVMs had
> insufficient memory.  It's too much of a coincidence to believe that the
> client's exception reporting happens to fail only at the same time that
> a server experiences unexplained and problematic gossip failures.


You're probably right.

>>   DEBUG-level
>> logs could confirm or refute this by logging for each insert how many
>> replicas are being blocked for, which nodes it got responses from, and
>> whether a TimedOutException from not getting ALL replies was returned
>> to the client.
>
> Full DEBUG level logs would be a space problem; I'm loading at least 1T
> per node (after 3x replication), and these events are rare.  Can the
> DEBUG logs be limited to the specific modules helpful for this diagnosis
> of the gossip problem and, secondarily, the failure to report
> replication failure?

The gossip problem is almost certainly due to a GC pause.  You can
check that by enabling verbose GC logging (uncomment the lines in
cassandra-env.sh).

The replication failure is what we want DEBUG logs for, and
restricting it to the right modules isn't going to help since when
you're stress-testing writes, the write modules are going to be 99% of
the log volume anyway.

Maybe a script to constantly throw away all but the most recent log
file until you see the WARN line would be sufficient workaround?

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Node goes AWOL briefly; failed replication does not report error to client, though consistency=ALL

Reply via email to