coordinator selection in remote DC

Nikolai Grigoriev Thu, 20 Nov 2014 07:54:35 -0800

Hi,

There is something odd I have observed when testing a configuration with
two DC for the first time. I wanted to do a simple functional test to prove
myself (and my pessimistic colleagues ;) ) that it works.


I have a test cluster of 6 nodes, 3 in each DC, and a keyspace that is
replicated as follows:

CREATE KEYSPACE xxxxxxx WITH replication = {

  'class': 'NetworkTopologyStrategy',

  'DC2': '3',

  'DC1': '3'

};


I have disabled the traffic compression between DCs to get more accurate
numbers.

I have set up a bunch of IP accounting rules on each node so they count the
outgoing traffic from this node to each other node. I had rules for
different ports but, of course, but it is mostly about port 7000 (or 7001)
when talking about inter-node traffic. Anyway, I have a table that shows
the traffic from any node to any node's port 7000.

I have ran a test with DCAwareRoundRobinPolicy and the client talking only
to DC1 nodes. Everything looks fine - the client has sent identical amount
of data to each of 3 nodes in DC1. These nodes inside of DC1 (I was writing
with LOCAL_ONE consistency) have sent similar amount of data to each other
that represents exactly two extra replicas.

However, when I look at the traffic from the nodes in DC1 to the nodes in
DC1 the picture is different:

  10.3.45.156

10.3.45.159

dpt:7000

117,273,075

10.3.45.156

10.3.45.160

dpt:7000

228,326,091

10.3.45.156

10.3.45.161

dpt:7000

46,924,339

10.3.45.157

10.3.45.159

dpt:7000

118,978,269

10.3.45.157

10.3.45.160

dpt:7000

230,444,929

10.3.45.157

10.3.45.161

dpt:7000

47,394,179

10.3.45.158

10.3.45.159

dpt:7000

113,969,248

10.3.45.158

10.3.45.160

dpt:7000

225,844,838

10.3.45.158

10.3.45.161

dpt:7000

46,338,939

Nodes 10.3.45.156-158 are in DC1, .159-161 - in DC2. As you can see, each
of nodes in DC1 has sent different amount of traffic to the remote nodes:
117Mb, 228Mb and 46Mb respectively. Both DC have one rack.

So, here is my question. How does node select the node in remote DC to send
the message to? I did a quick sweep through the code and I could only find
the sorting by proximity (checking the rack and DC). So, considering that
for each request I fire the targets are all 3 nodes in the remote DC, the
list will contain all 3 nodes in DC2. And, if I understood correctly, the
first node from the list is picked to send the message.

So, it seems to me that there is no any kind of round-robin-type logic is
applied when selecting the target node to forward the write to from the
list of targets in remote DC.

If this is true (and the numbers kind of show it is, right?), then probably
the list with equal proximity should be shuffled randomly? Or, instead of
picking the first target, a random one should be picked?


-- 
Nikolai Grigoriev

coordinator selection in remote DC

Reply via email to