Hi All,
At some circumstances we face following scenario:
1. A new member(sc-rdops-vm07-dhcp-195-204) is trying to join distributed
system but it does not succeed - it seems due to not receiving response on join
request:
[fine 2018/08/08 15:09:16.100 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] searching for the membership coordinator
[fine 2018/08/08 15:09:16.100 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] sending
FindCoordinatorRequest(memberID=sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002,
rejected=[], lastViewId=-1) to [/10.193.40.129:6061]
[fine 2018/08/08 15:09:16.104 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] SSL Configuration:
ssl-enabled = true
[fine 2018/08/08 15:09:16.356 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] SSL Connection from peer OU=MBU, O="VMware, Inc.",
CN=vc-ops-slice-1
[fine 2018/08/08 15:09:16.369 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] TcpClient sending
FindCoordinatorRequest(memberID=sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002,
rejected=[], lastViewId=-1) to /10.193.40.129:6061
[fine 2018/08/08 15:09:16.373 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] SSL Connection from peer OU=MBU, O="VMware, Inc.",
CN=vc-ops-slice-1
[fine 2018/08/08 15:09:16.489 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] received response:
FindCoordinatorResponse(coordinator=sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002,
fromView=true, viewId=14299, registrants=0,
senderId=sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002, network
partition detection enabled=false, locators preferred as coordinators=false)
[fine 2018/08/08 15:09:16.490 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] Locator's address indicates it is part of a distributed
system so I will not become membership coordinator on this attempt to join
[fine 2018/08/08 15:09:16.490 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] found possible coordinator
sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002
[info 2018/08/08 15:09:16.490 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] Attempting to join the distributed system through coordinator
sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002 using address
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002
[fine 2018/08/08 15:09:16.491 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main
Thread> tid=0x11] sending via JGroups:
[JoinRequestMessage(sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002)
failureDetectionPort:10003] recipients:
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002]
[fine 2018/08/08 15:13:16.499 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 <Main
Thread> tid=0x11] received no join response
[fine 2018/08/08 15:13:16.500 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 <Main
Thread> tid=0x11] sleeping for 1000 before making another attempt to find the
coordinator
[fine 2018/08/08 15:13:17.500 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 <Main
Thread> tid=0x11] searching for the membership coordinator
1. Snipper from Coordinator(in this case locator) is log:
[info 2018/08/08 15:09:16.498 UTC <unicast
receiver,sc2-rdops-vm09-dhcp-40-129-24030> tid=0x23] received join request from
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002
[info 2018/08/08 15:09:16.798 UTC <Geode Membership View Creator> tid=0x29]
preparing new view
View[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002|14300] members:
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002,
sc-rdops-vm05-dhcp-130-203(dbe93506-f385-4542-9bda-55599273e96c:28849)<ec><v14279>:10002{lead},
sc2-rdops-vm09-dhcp-40-129(a3c1b724-aa54-427b-8325-20defec15b7e:482)<ec><v14291>:10002,
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002]
failure detection ports: 20007 10002 10006 10003
[info 2018/08/08 15:09:29.800 UTC <Geode Membership View Creator> tid=0x29]
finished waiting for responses to view preparation
[warning 2018/08/08 15:09:29.800 UTC <Geode Membership View Creator> tid=0x29]
these members failed to respond to the view change:
[sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002]
[info 2018/08/08 15:09:29.800 UTC <Geode View Creator verification thread 1>
tid=0x649d] checking state of member
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002
[info 2018/08/08 15:09:29.800 UTC <Geode View Creator verification thread 1>
tid=0x649d] member
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002
failed availability check
[info 2018/08/08 15:09:42.238 UTC <Geode Membership View Creator> tid=0x29]
adding these unresponsive members to the crash-set for the next view:
[sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002]
1. Snippet from other member log that is already member of distributed
system indicating that PrepareView request reaches
it(sc-rdops-vm05-dhcp-130-203) :
[fine 2018/08/08 15:09:16.802 UTC dbe93506-f385-4542-9bda-55599273e96c <unicast
receiver,sc-rdops-vm05-dhcp-130-203-36192> tid=0x2b] processing
InstallViewMessage(type=PREPARE; Current ViewID=14300; Previous View ID=0;
View[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002|14300] members:
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002,
sc-rdops-vm05-dhcp-130-203(dbe93506-f385-4542-9bda-55599273e96c:28849)<ec><v14279>:10002{lead},
sc2-rdops-vm09-dhcp-40-129(a3c1b724-aa54-427b-8325-20defec15b7e:482)<ec><v14291>:10002,
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002];
cred=null)
So from coordinator log it is obvious that it gets join request from member
sc-rdops-vm07-dhcp-195-204, prepares new view and broadcasts it, but
sc-rdops-vm07-dhcp-195-204 member itself is not getting this response.
Could someone have some glue on why view preparation response is not reaching
the node that has initiated join?
Please note, that we are using Geode 1.1.0 and locator restart seem to fix this
issue.
Thanks,
Vahram.