> Thanks! I was keeping the discussion simple. But you make my case stronger
> that we need such monitoring since it looks like it should always be run but
> we want to run it as soon as it is required.

The way to deal with individual requests timing out or transient
flapping, is to use a consistency level which is appropriate for your
application along with an appropriately configured level of read
repair.

If you *require* that reads see writes, use QUORUM. If you only softly
require it for "99.x% of cases" or similar, use CL.ONE with read
repair turned on. If requirements are very lax, maybe use CL.ONE with
read repair turned off or set very low (only useful for the
performance improvement it will imply relative to full read repair).

Running nodetool repair as soon as a single write times out to some
node, is not the way to go (ok, I can think of situations where it
might be - but those would be very very obscure cases unless I am
overlooking something).

Bottom line: If you want a flag that is set to true whenever some node
ever may have dropped a write, that functionality currently does not
exist. It may be possible to add, but I would be skeptical as to it
being committed unless a clear need can be shown. Maybe if you
describe your situation we can better agree on what is appropriate.

For monitoring that repair does happen within desired time periods,
there *is* a clear need for monitoring and exposing something like a
time-of-start-of-last-successful-repair would be helpful I think, but
doesn't currently exist (as far as I know), such that the script (or
whatever) doing the repairs would have to solve that problem.

-- 
/ Peter Schuller

Reply via email to