Nice explanation.

Wanted to add the importance of been the first node in the ordered node list, 
even for CL ONE, is that this is the node sent the data request and it has to 
return before the CL is considered satisfied. e.g. CL One with RR running, read 
sent to all 5 replicas, if 3 digest request have returned the coordinator will 
still be blocking waiting for the one data request.  

Thanks
Aaron


On 7 Apr 2011, at 08:13, Peter Schuller wrote:

> Ok, I took this opportunity to look look a bit more on this part of
> the code. My reading of StorageProxy.fetchRows() and related is as
> follows, but please allow for others to say I'm wrong/missing
> something (and sorry, this is more a stream of consciousness that is
> probably more useful to me for learning the code than in answer to
> your question, but it's probably better to send it than write it and
> then just discard the e-mail - maybe someone is helped;)):
> 
> The endpoints obtained is sorted by the snitch's sortByProximity,
> against the local address. If the closest (as determined by that
> sorting) is the local address, the request is added directly to the
> local READ stage.
> 
> In the case of the SimpleSnitch, sortByProximity is a no-op, so the
> "sorted by proximity" should be the ring order. As per the comments to
> the SimpleSnitch, the intent is to allow "non-read-repaired reads to
> prefer a single endpoint, which improves cache locality".
> 
> So my understand is that in the case of the SImpleSnitch, ignoring any
> effect of the dynamic snitch, you will *not* always grab from the
> local node because the "closest" node (because ring order is used) is
> just whatever is the "first" node on the ring in the replica set.
> 
> In the case of the NetworkTopologyStrategy, it inherits the
> implementation in AbstractNetworkTopologySnitch which sorts by
> AbstractNetworkTopologySnitch.compareEndPoints(), which:
> 
> (1) Always prefers itself to any other node. So "myself" is always
> "closest", no matter what.
> (2) Else, always prefers a node in the same rack, to a node in a different 
> rack.
> (3) Else, always prefers a node in the same dc, to a node in a different dc.
> 
> So in the NTS case, I believe, *disregarding the dynamic snitch*, that
> with NTS you would in fact always read from the co-ordinator node if
> that node happens to be part of the replica set for the row.
> 
> (There is no tie-breaking if neither 1, 2 nor 3 above gives a
> presedence, and it is sorted with Collections.sort(), which guarantees
> that the sort is stable. So for nodes where rack/dc awareness does not
> come into play, it should result in the ring order as with the
> SimpleSnitch.)
> 
> Now; so far this only determines the order of endpoints after
> proximity sorting. fetchRows() will route to "itself" directly without
> messaging if the closest node is itself. This determines from which
> node we read the *data* (not digest).
> 
> Moving back to endpoint selection; after sorting by proximity it is
> actually filtered by getReadCallback. This is what determines how many
> will be receiving a request. If read repair doesn't happen, it'll be
> whatever is implied by the consistency level (so only one for CL.ONE).
> If read repair does happen, all endpoints are included and so none is
> filtered out.
> 
> Moving back out into fetchRows(), we're now past the sending or local
> scheduling of the data read. It then loops over the remainder (1
> through last of handler.endpoints) and submitting digest read messages
> to each endpoint (either local or remote).
> 
> We're now so far as to have determined (1) which node to send data
> request to, (2) which nodes, if any, to send digest reads to
> (regardless of whether it is due to read repair or consistency level
> requirements).
> 
> Now fetchRows() proceeds to iterate over all the ReadCallbacks,
> get():Ing each. This is where digest mismatch exceptions are raised if
> relevant. CL.ONE seems special-cased in the sense that if the number
> of responses to block/wait for is exactly 1, the data is returned
> without resolving to check for digest mismatches (once responses come
> back later on, the read repair is triggered by
> ReadCallback.maybeResolveForRepair).
> 
> In the case of CL > ONE, a digest mismatch can be raised immediately
> in which case fetchRows() triggers read repair.
> 
> Now:
> 
>> However case (C) as I have described it does not allow for any notion of
>> 'pinning' as mentioned for dynamic_snitch_badness_threshold:
>> 
>> # if set greater than zero and read_repair_chance is < 1.0, this will allow
>> # 'pinning' of replicas to hosts in order to increase cache capacity.
>> # The badness threshold will control how much worse the pinned host has
>> to be
>> # before the dynamic snitch will prefer other replicas over it.  This is
>> # expressed as a double which represents a percentage.  Thus, a value of
>> # 0.2 means Cassandra would continue to prefer the static snitch values
>> # until the pinned host was 20% worse than the fastest.
> 
> If you look at DynamicEndpointSnitch.sortByProximity(), it branches
> into two main cases: If BADNESS_THRESHOLD is exactly 0 (it's not a
> constant despite the caps, it's taken from the conf) is uses
> sortByProximityWithScore(). Otherwise it uses
> sortByProximityWithBadness().
> 
> ...withBadness() first asks the subsnitch (meaning normally either
> SImpleSnitch or NTS) to sort by proximity. Then it iterates through
> the endpoints, and if any node is sufficiently good in comparison to
> the closest-as-determined-by-subsnitch endpoint, it falls back to
> sortByProximityWithScore(). "Sufficiently good" is where the actual
> value of BADNESS_THRESHOLD comes in (if the "would-be" closest node is
> sufficiently bad, that implies some other node is sufficiently good in
> comparison to it...)
> 
> So, my reading of it is then that the comment is correct. By setting
> it to >0, you're making the dynamic snitch essentially be a no-op for
> the purpose of proximity sorting *until* the badness threshold is
> reached, at which point it uses it's scoring algorithm. Because the
> behavior of NTS and the SImpleSnitch is to always prefer the same node
> (for a given row key), that means pinning.
> 
> As far as I can tell, read-repair should not affect things either way
> since it doesn't have anything to do with which node gets asked for
> the data (as opposed to the digest).
> 
> One interesting aspect though: Say you specifically *don't* want
> pinning, and rather *want* round-robin type of behavior to keep caches
> hot. If you're at CL.ONE with read repair turned off or very low, this
> doesn't seem possible except as may result accidentally by dynamic
> snitch balancing - depending on performance characteristics of nodes.
> 
> -- 
> / Peter Schuller

Reply via email to