Hi,
There is something odd I have observed when testing a configuration with
two DC for the first time. I wanted to do a simple functional test to prove
myself (and my pessimistic colleagues ;) ) that it works.
I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
replicated as follows:
CREATE KEYSPACE xxxxxxx WITH replication = {
'class': 'NetworkTopologyStrategy',
'DC2': '3',
'DC1': '3'
};
I have disabled the traffic compression between DCs to get more accurate
numbers.
I have set up a bunch of IP accounting rules on each node so they count the
outgoing traffic from this node to each other node. I had rules for
different ports but, of course, but it is mostly about port 7000 (or 7001)
when talking about inter-node traffic. Anyway, I have a table that shows
the traffic from any node to any node's port 7000.
I have ran a test with DCAwareRoundRobinPolicy and the client talking only
to DC1 nodes. Everything looks fine - the client has sent identical amount
of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing
with LOCAL_ONE consistency) have sent similar amount of data to each other
that represents exactly two extra replicas.
However, when I look at the traffic from the nodes in DC1 to the nodes in
DC1 the picture is different:
10.3.45.156
10.3.45.159
dpt:7000
117,273,075
10.3.45.156
10.3.45.160
dpt:7000
228,326,091
10.3.45.156
10.3.45.161
dpt:7000
46,924,339
10.3.45.157
10.3.45.159
dpt:7000
118,978,269
10.3.45.157
10.3.45.160
dpt:7000
230,444,929
10.3.45.157
10.3.45.161
dpt:7000
47,394,179
10.3.45.158
10.3.45.159
dpt:7000
113,969,248
10.3.45.158
10.3.45.160
dpt:7000
225,844,838
10.3.45.158
10.3.45.161
dpt:7000
46,338,939
Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
of nodes in DC1 has sent different amount of traffic to the remote nodes:
117Mb, 228Mb and 46Mb respectively. Both DC have one rack.
So, here is my question. How does node select the node in remote DC to send
the message to? I did a quick sweep through the code and I could only find
the sorting by proximity (checking the rack and DC). So, considering that
for each request I fire the targets are all 3 nodes in the remote DC, the
list will contain all 3 nodes in DC2. And, if I understood correctly, the
first node from the list is picked to send the message.
So, it seems to me that there is no any kind of round-robin-type logic is
applied when selecting the target node to forward the write to from the
list of targets in remote DC.
If this is true (and the numbers kind of show it is, right?), then probably
the list with equal proximity should be shuffled randomly? Or, instead of
picking the first target, a random one should be picked?
--
Nikolai Grigoriev