Re: Frequent GMS membership error

Anilkumar Gingade Thu, 27 Sep 2018 15:45:02 -0700

Hi Dharam,

I assume the top command is on the physical host; I am not sure in that
case how it reports the virtual machine stats and the process running in it.


The error indicates that server is unable to ping the peer/other member; do
you see any errors/stack trace in the peer member.
The ping response wait time is based on "member-timeout"; you can try
increasing that to see if it fixes. If it, then it indicates the responses
are taking more time.

When you move the client to other host; do you still see the problem?

If you haven't seen; here is few guidelines about running Geode on virtual
environment.

-Anil.




On Thu, Sep 27, 2018 at 12:13 PM Thacker, Dharam <
[email protected]> wrote:

> Hi Anil,
>
>
>
> Here are the ones in member which failed.
>
>
>
> [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <Geode Failure Detection
> Server thread 0> tid=0x37] GMSHealthMonitor server thread exiting
>
>
>
> [severe 2018/09/27 08:09:06.561 EDT xxx-server-1 <unicast
> receiver,xxx-27036> tid=0x33] Membership service failure: Member isn't
> responding to heartbeat requests
>
> org.apache.geode.ForcedDisconnectException: Member isn't responding to
> heartbeat requests
>
>         at
> org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2520)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:998)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:635)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1702)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1286)
>
>         at org.jgroups.JChannel.invokeCallback(JChannel.java:816)
>
>         at org.jgroups.JChannel.up(JChannel.java:741)
>
>         at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1030)
>
>         at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
>
>         at org.jgroups.protocols.FlowControl.up(FlowControl.java:390)
>
>         at
> org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1070)
>
>         at
> org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:785)
>
>         at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:74)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:72)
>
>         at org.jgroups.protocols.TP.passMessageUp(TP.java:1601)
>
>         at org.jgroups.protocols.TP$SingleMessageHandler.run(TP.java:1817)
>
>         at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
>
>         at org.jgroups.protocols.TP.handleSingleMessage(TP.java:1729)
>
>         at org.jgroups.protocols.TP.receive(TP.java:1654)
>
>         at
> org.apache.geode.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:160)
>
>         at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:701)
>
>         at java.lang.Thread.run(Thread.java:745)
>
>
>
> [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <Geode Failure Detection
> Server thread 0> tid=0x37] GMSHealthMonitor server socket closed.
>
>
>
> [info 2018/09/27 08:09:06.561 EDT xxx-server-1 <unicast
> receiver,xxx-27036> tid=0x33] CacheServer configuration saved
>
>
>
> *Top command output:*
>
>
>
> top - 15:09:56 up 24 days, 18:47,  3 users,  load average: 0.03, 0.08, 0.08
>
> Tasks: 427 total,   1 running, 426 sleeping,   0 stopped,   0 zombie
>
> Cpu(s):  2.8%us,  0.6%sy,  0.0%ni, 96.6%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
>
> Mem:  132159004k total, 83190332k used, 48968672k free,    21108k buffers
>
>
>
> Swap:  2097148k total,        0k used,  2097148k free,  7780808k cached
>
>
>
> *Visual VM [I have seen below trend only most of the time]*
>
>
>
> Thanks,
>
> Dharam
>
>
>
> *From:* Anilkumar Gingade [mailto:[email protected]]
> *Sent:* Thursday, September 27, 2018 11:24 PM
> *To:* [email protected]
> *Subject:* Re: Frequent GMS membership error
>
>
>
> Can you share the exact error message you are seeing; do you see any
> exception stack trace in the server log.
>
>
>
> Most probable cause is n/w; or memory. Can you verify the specified memory
> is getting allocated to the JVM and the host (virtual machines) has
> sufficient memory to run all the servers/clients.Hell
>
>
>
> -Anil.
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Sep 27, 2018 at 9:48 AM Dharam Thacker <[email protected]>
> wrote:
>
> Hello Anthony,
>
>
>
> Yes I am running in virtualized infrastructure. But when I checked %id and
> %st and logged graph for it, i see %st as always 0.0 and %id in range of
> (95-98) most of the time.
>
>
>
> Could number of connections for every client app or member-timeout or
> ack-wait-threshold help here?
>
>
>
> Thanks,
>
> - Dharam Thacker
>
>
>
>
>
> On Thu, Sep 27, 2018 at 8:37 PM Anthony Baker <[email protected]> wrote:
>
> Are you running on cloud or virtualized infrastructure?  If so, check if
> your steal time stats—you may have “noisy neighbors” causing members to
> become unresponsive.  Geode detects this and fences off the unhealthy
> members to maintain consistency and availability.
>
>
>
> Anthony
>
>
>
>
>
> On Sep 27, 2018, at 10:31 AM, Dharam Thacker <[email protected]>
> wrote:
>
>
>
> Hi Team,
>
>
>
> I have following topology for geode currently and all regions are
> replicated.
>
>
>
> Note : Unfortunately I am still on version 1.1.1
>
>
>
> *Host1*:
>
> Locator1
>
> Server1.1 (Group1) -- 24G
>
> Server2.1 (Group2) -- 24G
>
> Client1 (CQ listener only -- 20 CQs registered via locator pool)
>
> Client2 (Fires OQL queries and functions only via locator pool)
>
>
>
> *Host2*:
>
> Locator2
>
> Server1.2 (Group1) -- 24G
>
> Server2.2 (Group2) -- 24G
>
>
>
> As shown above I have spring boot web app geode clients (client1 and
> client2) only on HOST1.
>
>
>
> If I scale them by putting them on HOST2 as well it works.
>
>
>
> Now I see 40 CQs registered for CQ listener client.
>
>
>
> But I frequently see now "GMS Membership error" complaining about "No
> heartbeat request and force disconnection of member" for all server nodes.
>
>
>
> Transient though but really painful!
>
>
>
> Somehow with 1.1.1 it can't auto reconnect which I know is fixed in later
> version but that's still fine.
>
>
>
> I did GC,CPU load and Memory analysis very well and at least these 3 looks
> quite healthy as expected.
>
>
>
> What could be the possible other reasons where scalling client apps might
> result into this?
>
>
>
> Or if you can suggest anything else to look at?
>
>
>
> Thanks,
>
> Dharam
>
>
>
>
>
> This message is confidential and subject to terms at: http://
> www.jpmorgan.com/emaildisclaimer including on confidentiality, legal
> privilege, viruses and monitoring of electronic messages. If you are not
> the intended recipient, please delete this message and notify the sender
> immediately. Any unauthorized use is strictly prohibited.
>

Re: Frequent GMS membership error

Reply via email to