Re: Avoiding Split Brain problem

Bruce Schuchardt Wed, 04 May 2016 08:46:32 -0700

A ForcedDisconnectException is generated when a node is kicked out ofthe system. If all of the nodes are throwing that then there was atotal meltdown of the cluster. The Geode logs should show how this cameto be. If you want to zip them up and share them I'll take a look.


Le 5/4/2016 à 6:29 AM, Eugene Strokin a écrit :

This is Geode.
After I've set enable-network-partition-detection=true, I've got suchproblem:The cluster (10 nodes) was working under normal production load. Onenode went down. All other nodes started getting the exception (seebellow).
The line I'm getting exception on is: region.size()
I hoped that if a node goes down, the system would function normally,it would just loose a portion of data, that is understood, but therest would continue to work.
Is anything could be done here to avoid the exception?
Thanks,
Eugene
com.gemstone.gemfire.distributed.DistributedSystemDisconnectedException:GemFire on 10.132.49.101(3787)<ec><v6>:1024 started at Tue May 0317:06:13 EDT 2016: Message distribution has terminatedatcom.gemstone.gemfire.distributed.internal.DistributionManager$Stopper.generateCancelledException(DistributionManager.java:745)atcom.gemstone.gemfire.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:861)atcom.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1453)atcom.gemstone.gemfire.CancelCriterion.checkCancelInProgress(CancelCriterion.java:91)atcom.gemstone.gemfire.internal.cache.LocalRegion.checkRegionDestroyed(LocalRegion.java:8118)atcom.gemstone.gemfire.internal.cache.LocalRegion.checkReadiness(LocalRegion.java:2994)atcom.gemstone.gemfire.internal.cache.LocalRegion.size(LocalRegion.java:9668)
    at ccio.image.ImageServer$2.run(ImageServer.java:135)
atjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
atjava.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)atjava.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: com.gemstone.gemfire.ForcedDisconnectException: Memberisn't responding to heartbeat requestsatcom.gemstone.gemfire.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2571)atcom.gemstone.gemfire.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:811)atcom.gemstone.gemfire.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:519)atcom.gemstone.gemfire.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1459)atcom.gemstone.gemfire.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1051)
    at org.jgroups.JChannel.invokeCallback(JChannel.java:817)
    at org.jgroups.JChannel.up(JChannel.java:741)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1029)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
    at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1064)
atorg.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:779)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426)
atcom.gemstone.gemfire.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:72)atcom.gemstone.gemfire.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:76)at org.jgroups.protocols.TP<http://org.jgroups.protocols.TP>.passMessageUp(TP.java:1577)at org.jgroups.protocols.TP<http://org.jgroups.protocols.tp/>$MyHandler.run(TP.java:1796)
    at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
at org.jgroups.protocols.TP<http://org.jgroups.protocols.TP>.handleSingleMessage(TP.java:1693)at org.jgroups.protocols.TP<http://org.jgroups.protocols.TP>.receive(TP.java:1630)atcom.gemstone.gemfire.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:165)
    at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:691)
     ... 1 common frames omitted
On Tue, May 3, 2016 at 8:10 PM, Bruce Schuchardt<[email protected] <mailto:[email protected]>> wrote:
    Is this using Geode or GemFire?  Either way If you continue to
    have problems you can PM Udo and me directly.  Send us a zip with
    the log files and we'll help you figure it out.

    Le 5/3/2016 à 2:13 PM, Eugene Strokin a écrit :
    Udo, thanks for the hint. The property was missing indeed.
    I've put it into my gemfire.properties file and the cluster waits
    all nodes to start before proceed to any activity.
    Eugene

    On Tue, May 3, 2016 at 4:28 PM, Udo Kohlmeyer
    <[email protected] <mailto:[email protected]>> wrote:

        Hi there Eugene,

        Can you check if the enable-network-partition-detection
        property is set, as per the documentation.
        Handling Network partitioning
        
<http://geode.docs.pivotal.io/docs/managing/network_partitioning/handling_network_partitioning.html>

        --Udo


        On 4/05/2016 6:22 am, Eugene Strokin wrote:
        I'm testing my 10 nodes cluster under production load and
        with production data.
        I was using automated tool which created the nodes (VMs)
        configured everything and restarted all of them.
        Everything worked, I mean, I was getting the data I
        expected, but when I've checked the stats I noticed that I'm
        running 10 one node clusters. My nodes didn't see each
        other, they had a separate duplicated set of data on each node.
        I've stopped all the nodes, cleaned all logs/storage files,
        and restarted the nodes again.
        Now I had one cluster with 7 nodes and 3 nodes separate.
        I've stopped the 3 nodes, cleaned them up, and started them
        up one by one, they successfully joined the cluster. At the
        end I've got all 10 nodes working as a single cluster.
        But I'm afraid that if nodes would get restarted or network
        would have some problems, I could end up with split cluster
        again.
        I use API to start Cache with locators, and all locator's
        IPs are provided in the config. From the documentation I had
        impression that Geode would wait till N/2+1 nodes would
        start before forming the cluster, since the number of
        locators is preset. But looks like it is not the case.
        Or should I set some setting to force such behavior?

        Thank you,
        Eugene

Re: Avoiding Split Brain problem

Reply via email to