Hi Bruce,

Thanks for your feedback. I just was thinking about workaround by decreasing 
the number within de range of 1000..60000 that is recommended by documentation. 
Even in this case it is risky?

We have checked the MTU on all members of the cluster and it is set to 1500. 
Even paths from one member to another is reporting to have the same MTU value 
of 1500.

Thanks,
Vahram.

From: Bruce Schuchardt <[email protected]>
Sent: Tuesday, August 14, 2018 9:58 PM
To: [email protected]
Subject: Re: PrepareView is not reaching join initiator member


I've yet to try small packet sizes with the version of JGroups we're using in 
Geode.  In the old GemFire 8x code base I know that packets that small do not 
work, but we've reduced the reliance on JGroups packet size calculations in 
Geode so it's possible it will work.

Is it possible that you have MTU size differences between the machines?  TCP/IP 
stream socket formation usually negotiates the MTU but that doesn't happen for 
UDP datagram communications.  The MTUs really need to be consistent across the 
cluster.

On 8/14/18 8:43 AM, Vahram Aharonyan wrote:
We have encountered that problem is due to UDP packet size. If UDP packet 
locator tries to send is larger than 1472 (pure data, without headers, etc.), 
then this packet is not reaching to destination. We still trying to figure out 
what can cause this kind of issue on path from source to destination. Meanwhile 
want to understand if we play with UDP configuration parameters of GEODE, will 
this be a workaround for us? For example, if we decrease udp-fragment-size, 
will this ensure that "large" packets will be fragmented to small pieces? And 
what kind of side effects this change can cause?

Thanks,
Vahram.

From: Vahram Aharonyan <[email protected]><mailto:[email protected]>
Sent: Friday, August 10, 2018 7:49 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: PrepareView is not reaching join initiator member

Hi Bruce,

I've tried to generate UPD traffic on port 10002 using netcat and dump it on 
other end via tcpdump - both directions work fine.

Moreover, from tcpdump output on node(sc-rdops-vm07-dhcp-195-204) that is 
trying to join it is obvious that after sending JOIN request some UDP packets 
are arriving from locator VM(sc2-rdops-vm09-dhcp-40-129):


  1.  Point when Join is sent:

[fine 2018/08/10 15:18:36.566 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] sending 
FindCoordinatorRequest(memberID=sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:11973)<ec>:10002,
 rejected=[], lastViewId=-1) to [/10.193.40.129:6061]



[fine 2018/08/10 15:18:36.566 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] TcpClient sending 
FindCoordinatorRequest(memberID=sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:11973)<ec>:10002,
 rejected=[], lastViewId=-1) to /10.193.40.129:6061



[fine 2018/08/10 15:18:36.668 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] SSL Connection from peer OU=MBU, O="Inc.", CN=vc-ops-slice-1



[fine 2018/08/10 15:18:36.688 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] received response: 
FindCoordinatorResponse(coordinator=sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002,
 fromView=true, viewId=16484, registrants=0, 
senderId=sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002, network 
partition detection enabled=false, locators preferred as coordinators=false)



[fine 2018/08/10 15:18:36.689 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] Locator's address indicates it is part of a distributed 
system so I will not become membership coordinator on this attempt to join



[fine 2018/08/10 15:18:36.689 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] found possible coordinator 
sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002



[info 2018/08/10 15:18:36.689 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] Attempting to join the distributed system through coordinator 
sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002 using address 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:11973)<ec>:10002



[fine 2018/08/10 15:18:36.689 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] sending via JGroups: 
[JoinRequestMessage(sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:11973)<ec>:10002)
 failureDetectionPort:10009] recipients: 
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002]



  1.  Capture from tcpdump output:

sc-rdops-vm07-dhcp-195-204:/# tcpdump -A port 10002

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes



15:18:36.690284 IP sc-rdops-vm07-dhcp-195-204.com.documentum > 
sc2-rdops-vm09-dhcp-40-129.com.commtact-http: UDP, length 778

E..&..@[email protected]<mailto:E..&..@[email protected]>

...

.(.'.N"..............

.(.N".......b...!F.2..|......

...'.....go0!.q`k...R..

15:18:37.046326 IP sc2-rdops-vm09-dhcp-40-129.com.commtact-http > 
sc-rdops-vm07-dhcp-195-204.com.documentum: UDP, length 89

E..u. @.:.e.

.(.

...N"'..a............

...'.....go0!.q`k...R........

.(.N".......b...!F.2..|

15:19:02.433981 IP sc2-rdops-vm09-dhcp-40-129.com.commtact-http > 
sc-rdops-vm07-dhcp-195-204.com.documentum: UDP, length 859

E..w..@.:.b<mailto:E..w..@.:.b>.

.(.

...N"'..c............

...'...@ego0!.q`k...R........

.(.N".......b...!F.2..|

15:19:02.434960 IP sc-rdops-vm07-dhcp-195-204.com.documentum > 
sc2-rdops-vm09-dhcp-40-129.com.commtact-http: UDP, length 83

E..o..@.@<mailto:E..o..@.@>..!

...

.(.'.N".[............

.(.N".......b...!F.2..|......

...'.....go0!.q`k...R..

As we have encrypted traffic, I'm not sure if any of this is ViewPreparation 
request that reaches the destination but somehow is ignored by upper layer or 
not.

Thanks,
Vahram.

From: Bruce Schuchardt <[email protected]<mailto:[email protected]>>
Sent: Thursday, August 9, 2018 10:09 PM
To: [email protected]<mailto:[email protected]>
Subject: Re: PrepareView is not reaching join initiator member


Have you checked that UDP traffic can get through in both directions for those 
two machines?

On 8/9/18 7:04 AM, Vahram Aharonyan wrote:
Just one adjustment,

I was wrong stating that locator restart solves the issue. More experiments 
show that it is not solution in this case.

Thanks,
Vahram.

From: Vahram Aharonyan <[email protected]><mailto:[email protected]>
Sent: Wednesday, August 8, 2018 7:41 PM
To: [email protected]<mailto:[email protected]>
Subject: PrepareView is not reaching join initiator member

Hi All,

At some circumstances we face following scenario:


  1.  A new member(sc-rdops-vm07-dhcp-195-204) is trying to join distributed 
system but it does not succeed - it seems due to not receiving response on join 
request:

[fine 2018/08/08 15:09:16.100 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] searching for the membership coordinator



[fine 2018/08/08 15:09:16.100 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] sending 
FindCoordinatorRequest(memberID=sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002,
 rejected=[], lastViewId=-1) to [/10.193.40.129:6061]



[fine 2018/08/08 15:09:16.104 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] SSL Configuration:

    ssl-enabled = true





[fine 2018/08/08 15:09:16.356 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] SSL Connection from peer OU=MBU, O="VMware, Inc.", 
CN=vc-ops-slice-1



[fine 2018/08/08 15:09:16.369 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] TcpClient sending 
FindCoordinatorRequest(memberID=sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002,
 rejected=[], lastViewId=-1) to /10.193.40.129:6061



[fine 2018/08/08 15:09:16.373 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] SSL Connection from peer OU=MBU, O="VMware, Inc.", 
CN=vc-ops-slice-1



[fine 2018/08/08 15:09:16.489 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] received response: 
FindCoordinatorResponse(coordinator=sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002,
 fromView=true, viewId=14299, registrants=0, 
senderId=sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002, network 
partition detection enabled=false, locators preferred as coordinators=false)



[fine 2018/08/08 15:09:16.490 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] Locator's address indicates it is part of a distributed 
system so I will not become membership coordinator on this attempt to join



[fine 2018/08/08 15:09:16.490 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] found possible coordinator 
sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002



[info 2018/08/08 15:09:16.490 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] Attempting to join the distributed system through coordinator 
sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002 using address 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002



[fine 2018/08/08 15:09:16.491 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 < Main 
Thread> tid=0x11] sending via JGroups: 
[JoinRequestMessage(sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002)
 failureDetectionPort:10003] recipients: 
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002]



[fine 2018/08/08 15:13:16.499 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 <Main 
Thread> tid=0x11] received no join response



[fine 2018/08/08 15:13:16.500 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 <Main 
Thread> tid=0x11] sleeping for 1000 before making another attempt to find the 
coordinator



[fine 2018/08/08 15:13:17.500 UTC fc2d07ae-e61b-4c0f-b2ed-8216a326e249 <Main 
Thread> tid=0x11] searching for the membership coordinator



  1.  Snipper from Coordinator(in this case locator) is log:



[info 2018/08/08 15:09:16.498 UTC  <unicast 
receiver,sc2-rdops-vm09-dhcp-40-129-24030> tid=0x23] received join request from 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec>:10002



[info 2018/08/08 15:09:16.798 UTC  <Geode Membership View Creator> tid=0x29] 
preparing new view 
View[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002|14300] members: 
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002, 
sc-rdops-vm05-dhcp-130-203(dbe93506-f385-4542-9bda-55599273e96c:28849)<ec><v14279>:10002{lead},
 
sc2-rdops-vm09-dhcp-40-129(a3c1b724-aa54-427b-8325-20defec15b7e:482)<ec><v14291>:10002,
 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002]

  failure detection ports: 20007 10002 10006 10003



[info 2018/08/08 15:09:29.800 UTC  <Geode Membership View Creator> tid=0x29] 
finished waiting for responses to view preparation



[warning 2018/08/08 15:09:29.800 UTC  <Geode Membership View Creator> tid=0x29] 
these members failed to respond to the view change: 
[sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002]



[info 2018/08/08 15:09:29.800 UTC  <Geode View Creator verification thread 1> 
tid=0x649d] checking state of member 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002



[info 2018/08/08 15:09:29.800 UTC  <Geode View Creator verification thread 1> 
tid=0x649d] member 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002
 failed availability check



[info 2018/08/08 15:09:42.238 UTC  <Geode Membership View Creator> tid=0x29] 
adding these unresponsive members to the crash-set for the next view: 
[sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002]



  1.  Snippet from other member log that is already member of distributed 
system indicating that PrepareView request reaches 
it(sc-rdops-vm05-dhcp-130-203) :



[fine 2018/08/08 15:09:16.802 UTC dbe93506-f385-4542-9bda-55599273e96c <unicast 
receiver,sc-rdops-vm05-dhcp-130-203-36192> tid=0x2b] processing 
InstallViewMessage(type=PREPARE; Current ViewID=14300; Previous View ID=0; 
View[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002|14300] members: 
[sc2-rdops-vm09-dhcp-40-129(17201:locator)<ec><v0>:20002, 
sc-rdops-vm05-dhcp-130-203(dbe93506-f385-4542-9bda-55599273e96c:28849)<ec><v14279>:10002{lead},
 
sc2-rdops-vm09-dhcp-40-129(a3c1b724-aa54-427b-8325-20defec15b7e:482)<ec><v14291>:10002,
 
sc-rdops-vm07-dhcp-195-204(fc2d07ae-e61b-4c0f-b2ed-8216a326e249:23205)<ec><v14300>:10002];
 cred=null)



So from coordinator log it is obvious that it gets join request from member 
sc-rdops-vm07-dhcp-195-204, prepares new view and broadcasts it, but 
sc-rdops-vm07-dhcp-195-204 member itself is not getting this response.
Could someone have some glue on why view preparation response is not reaching 
the node that has initiated  join?

Please note, that we are using Geode 1.1.0 and locator restart seem to fix this 
issue.

Thanks,
Vahram.


Reply via email to