The difference is likely due to the DynamicEndpointSnitch (aka dynamic snitch), which picks replicas to send messages to based on recently observed latency and self-reported load (accounting for compactions, repair, etc). If you want to confirm this, you can disable the dynamic snitch by adding this line to cassandra.yaml: "dynamic_snitch: false".
On Thu, Nov 20, 2014 at 9:52 AM, Nikolai Grigoriev <[email protected]> wrote: > Hi, > > There is something odd I have observed when testing a configuration with > two DC for the first time. I wanted to do a simple functional test to prove > myself (and my pessimistic colleagues ;) ) that it works. > > I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is > replicated as follows: > > CREATE KEYSPACE xxxxxxx WITH replication = { > > 'class': 'NetworkTopologyStrategy', > > 'DC2': '3', > > 'DC1': '3' > > }; > > > I have disabled the traffic compression between DCs to get more accurate > numbers. > > I have set up a bunch of IP accounting rules on each node so they count > the outgoing traffic from this node to each other node. I had rules for > different ports but, of course, but it is mostly about port 7000 (or 7001) > when talking about inter-node traffic. Anyway, I have a table that shows > the traffic from any node to any node's port 7000. > > I have ran a test with DCAwareRoundRobinPolicy and the client talking only > to DC1 nodes. Everything looks fine - the client has sent identical amount > of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing > with LOCAL_ONE consistency) have sent similar amount of data to each other > that represents exactly two extra replicas. > > However, when I look at the traffic from the nodes in DC1 to the nodes in > DC1 the picture is different: > > 10.3.45.156 > > 10.3.45.159 > > dpt:7000 > > 117,273,075 > > 10.3.45.156 > > 10.3.45.160 > > dpt:7000 > > 228,326,091 > > 10.3.45.156 > > 10.3.45.161 > > dpt:7000 > > 46,924,339 > > 10.3.45.157 > > 10.3.45.159 > > dpt:7000 > > 118,978,269 > > 10.3.45.157 > > 10.3.45.160 > > dpt:7000 > > 230,444,929 > > 10.3.45.157 > > 10.3.45.161 > > dpt:7000 > > 47,394,179 > > 10.3.45.158 > > 10.3.45.159 > > dpt:7000 > > 113,969,248 > > 10.3.45.158 > > 10.3.45.160 > > dpt:7000 > > 225,844,838 > > 10.3.45.158 > > 10.3.45.161 > > dpt:7000 > > 46,338,939 > > Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each > of nodes in DC1 has sent different amount of traffic to the remote nodes: > 117Mb, 228Mb and 46Mb respectively. Both DC have one rack. > > So, here is my question. How does node select the node in remote DC to > send the message to? I did a quick sweep through the code and I could only > find the sorting by proximity (checking the rack and DC). So, considering > that for each request I fire the targets are all 3 nodes in the remote DC, > the list will contain all 3 nodes in DC2. And, if I understood correctly, > the first node from the list is picked to send the message. > > So, it seems to me that there is no any kind of round-robin-type logic is > applied when selecting the target node to forward the write to from the > list of targets in remote DC. > > If this is true (and the numbers kind of show it is, right?), then > probably the list with equal proximity should be shuffled randomly? Or, > instead of picking the first target, a random one should be picked? > > > -- > Nikolai Grigoriev > > -- Tyler Hobbs DataStax <http://datastax.com/>
