Hi Dharam, I assume the top command is on the physical host; I am not sure in that case how it reports the virtual machine stats and the process running in it.
The error indicates that server is unable to ping the peer/other member; do you see any errors/stack trace in the peer member. The ping response wait time is based on "member-timeout"; you can try increasing that to see if it fixes. If it, then it indicates the responses are taking more time. When you move the client to other host; do you still see the problem? If you haven't seen; here is few guidelines about running Geode on virtual environment. -Anil. On Thu, Sep 27, 2018 at 12:13 PM Thacker, Dharam < [email protected]> wrote: > Hi Anil, > > > > Here are the ones in member which failed. > > > > [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <Geode Failure Detection > Server thread 0> tid=0x37] GMSHealthMonitor server thread exiting > > > > [severe 2018/09/27 08:09:06.561 EDT xxx-server-1 <unicast > receiver,xxx-27036> tid=0x33] Membership service failure: Member isn't > responding to heartbeat requests > > org.apache.geode.ForcedDisconnectException: Member isn't responding to > heartbeat requests > > at > org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2520) > > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:998) > > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:635) > > at > org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1702) > > at > org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1286) > > at org.jgroups.JChannel.invokeCallback(JChannel.java:816) > > at org.jgroups.JChannel.up(JChannel.java:741) > > at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030) > > at org.jgroups.protocols.FRAG2.up(FRAG2.java:165) > > at org.jgroups.protocols.FlowControl.up(FlowControl.java:390) > > at > org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1070) > > at > org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:785) > > at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426) > > at > org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:74) > > at > org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72) > > at org.jgroups.protocols.TP.passMessageUp(TP.java:1601) > > at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1817) > > at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10) > > at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1729) > > at org.jgroups.protocols.TP.receive(TP.java:1654) > > at > org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:160) > > at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701) > > at java.lang.Thread.run(Thread.java:745) > > > > [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <Geode Failure Detection > Server thread 0> tid=0x37] GMSHealthMonitor server socket closed. > > > > [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <unicast > receiver,xxx-27036> tid=0x33] CacheServer configuration saved > > > > *Top command output:* > > > > top - 15:09:56 up 24 days, 18:47, 3 users, load average: 0.03, 0.08, 0.08 > > Tasks: 427 total, 1 running, 426 sleeping, 0 stopped, 0 zombie > > Cpu(s): 2.8%us, 0.6%sy, 0.0%ni, 96.6%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > > Mem: 132159004k total, 83190332k used, 48968672k free, 21108k buffers > > > > Swap: 2097148k total, 0k used, 2097148k free, 7780808k cached > > > > *Visual VM [I have seen below trend only most of the time]* > > > > Thanks, > > Dharam > > > > *From:* Anilkumar Gingade [mailto:[email protected]] > *Sent:* Thursday, September 27, 2018 11:24 PM > *To:* [email protected] > *Subject:* Re: Frequent GMS membership error > > > > Can you share the exact error message you are seeing; do you see any > exception stack trace in the server log. > > > > Most probable cause is n/w; or memory. Can you verify the specified memory > is getting allocated to the JVM and the host (virtual machines) has > sufficient memory to run all the servers/clients.Hell > > > > -Anil. > > > > > > > > > > > > On Thu, Sep 27, 2018 at 9:48 AM Dharam Thacker <[email protected]> > wrote: > > Hello Anthony, > > > > Yes I am running in virtualized infrastructure. But when I checked %id and > %st and logged graph for it, i see %st as always 0.0 and %id in range of > (95-98) most of the time. > > > > Could number of connections for every client app or member-timeout or > ack-wait-threshold help here? > > > > Thanks, > > - Dharam Thacker > > > > > > On Thu, Sep 27, 2018 at 8:37 PM Anthony Baker <[email protected]> wrote: > > Are you running on cloud or virtualized infrastructure? If so, check if > your steal time stats—you may have “noisy neighbors” causing members to > become unresponsive. Geode detects this and fences off the unhealthy > members to maintain consistency and availability. > > > > Anthony > > > > > > On Sep 27, 2018, at 10:31 AM, Dharam Thacker <[email protected]> > wrote: > > > > Hi Team, > > > > I have following topology for geode currently and all regions are > replicated. > > > > Note : Unfortunately I am still on version 1.1.1 > > > > *Host1*: > > Locator1 > > Server1.1 (Group1) -- 24G > > Server2.1 (Group2) -- 24G > > Client1 (CQ listener only -- 20 CQs registered via locator pool) > > Client2 (Fires OQL queries and functions only via locator pool) > > > > *Host2*: > > Locator2 > > Server1.2 (Group1) -- 24G > > Server2.2 (Group2) -- 24G > > > > As shown above I have spring boot web app geode clients (client1 and > client2) only on HOST1. > > > > If I scale them by putting them on HOST2 as well it works. > > > > Now I see 40 CQs registered for CQ listener client. > > > > But I frequently see now "GMS Membership error" complaining about "No > heartbeat request and force disconnection of member" for all server nodes. > > > > Transient though but really painful! > > > > Somehow with 1.1.1 it can't auto reconnect which I know is fixed in later > version but that's still fine. > > > > I did GC,CPU load and Memory analysis very well and at least these 3 looks > quite healthy as expected. > > > > What could be the possible other reasons where scalling client apps might > result into this? > > > > Or if you can suggest anything else to look at? > > > > Thanks, > > Dharam > > > > > > This message is confidential and subject to terms at: http:// > www.jpmorgan.com/emaildisclaimer including on confidentiality, legal > privilege, viruses and monitoring of electronic messages. If you are not > the intended recipient, please delete this message and notify the sender > immediately. Any unauthorized use is strictly prohibited. >
