Andrew Wong created KUDU-3310:
---------------------------------

             Summary: Checksum scan results for lagging replicas can be 
confusing
                 Key: KUDU-3310
                 URL: https://issues.apache.org/jira/browse/KUDU-3310
             Project: Kudu
          Issue Type: Improvement
          Components: ops-tooling
            Reporter: Andrew Wong


When running a checksum scan, we've seen cases where the following is reported:
{code}
Error: Remote error: Service unavailable: Timed out: could not wait for desired 
snapshot timestamp to be consistent: Timed out waiting for ts: P: 1621906 
798986764 usec, L: 0 to be safe (mode: NON-LEADER). Current safe time: P: 
1621906798962044 usec, L: 0 Physical time difference: 0.025s
{code}
and this results in messages like:
{code}
Aborted: checksum scan error: 1 errors were detected
{code}

Without much context about Kudu, this makes it seem like there is some 
corruption between replicas, even though the issue is just that the replica is 
lagging a bit. We should consider either:
- allowing the wait time to be configured when running the tool, or
- reword the result such that it's clear the scan failed and no checksums were 
verified for the tablet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to