> Thanks! I was keeping the discussion simple. But you make my case stronger > that we need such monitoring since it looks like it should always be run but > we want to run it as soon as it is required.
The way to deal with individual requests timing out or transient flapping, is to use a consistency level which is appropriate for your application along with an appropriately configured level of read repair. If you *require* that reads see writes, use QUORUM. If you only softly require it for "99.x% of cases" or similar, use CL.ONE with read repair turned on. If requirements are very lax, maybe use CL.ONE with read repair turned off or set very low (only useful for the performance improvement it will imply relative to full read repair). Running nodetool repair as soon as a single write times out to some node, is not the way to go (ok, I can think of situations where it might be - but those would be very very obscure cases unless I am overlooking something). Bottom line: If you want a flag that is set to true whenever some node ever may have dropped a write, that functionality currently does not exist. It may be possible to add, but I would be skeptical as to it being committed unless a clear need can be shown. Maybe if you describe your situation we can better agree on what is appropriate. For monitoring that repair does happen within desired time periods, there *is* a clear need for monitoring and exposing something like a time-of-start-of-last-successful-repair would be helpful I think, but doesn't currently exist (as far as I know), such that the script (or whatever) doing the repairs would have to solve that problem. -- / Peter Schuller