Ariel Weisberg created CASSANDRA-8732:
-----------------------------------------

             Summary: Make inter-node timeouts tolerate time skew
                 Key: CASSANDRA-8732
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Ariel Weisberg


Right now internode timeouts rely on currentTimeMillis() (and NTP) to make sure 
that tasks don't expire before they arrive.

Every receiver needs to deduce the offset between its nanoTime and the remote 
nanoTime. I don't think currentTimeMillis is a good choice because it is 
designed to be manipulated by operators and NTP. I would probably be 
comfortable assuming that nanoTime isn't going to move in significant ways 
without something that could be classified as operator error happening.

I suspect the one timing method you can rely on being accurate is nanoTime 
within a node (on average) and that a node can report on its own scheduling 
jitter (on average).

Finding the offset requires knowing what the network latency is in one 
direction.

One way to do this would be to periodically send a ping request which generates 
a series of ping responses at fixed intervals (maybe by UDP?). The responses 
should corrected for scheduling jitter since the fixed intervals may not be 
exactly achieved by the sender. By measuring the time deviation between ping 
responses and their expected arrival time (based on the interval) and 
correcting for the remotely reported scheduling jitter, you should be able to 
measure latency in one direction.

A weighted moving average (only correct for drift, not readjustment) of these 
measurements would eventually converge on a close answer and would not be 
impacted by outlier measurements. It may also make sense to drop the largest N 
samples to improve accuracy.

One you know network latency you can add that to the timestamp of each ping and 
compare to the local clock and know what the offset is.

These measurements won't calculate the offset to be too small (timeouts fire 
early), but could calculate the offset to be too large (timeouts fire late). 
The conditions where you the offset won't be accurate are the conditions where 
you also want them firing reliably. This and bootstrapping in bad conditions is 
what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to