A ForcedDisconnectException is generated when a node is kicked out of
the system. If all of the nodes are throwing that then there was a
total meltdown of the cluster. The Geode logs should show how this came
to be. If you want to zip them up and share them I'll take a look.
Le 5/4/2016 à 6:29 AM, Eugene Strokin a écrit :
This is Geode.
After I've set enable-network-partition-detection=true, I've got such
problem:
The cluster (10 nodes) was working under normal production load. One
node went down. All other nodes started getting the exception (see
bellow).
The line I'm getting exception on is: region.size()
I hoped that if a node goes down, the system would function normally,
it would just loose a portion of data, that is understood, but the
rest would continue to work.
Is anything could be done here to avoid the exception?
Thanks,
Eugene
com.gemstone.gemfire.distributed.DistributedSystemDisconnectedException:
GemFire on 10.132.49.101(3787)<ec><v6>:1024 started at Tue May 03
17:06:13 EDT 2016: Message distribution has terminated
at
com.gemstone.gemfire.distributed.internal.DistributionManager$Stopper.generateCancelledException(DistributionManager.java:745)
at
com.gemstone.gemfire.distributed.internal.InternalDistributedSystem$Stopper.generateCancelledException(InternalDistributedSystem.java:861)
at
com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1453)
at
com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(CancelCriterion.java:91)
at
com.gemstone.gemfire.internal.cache.LocalRegion.checkRegionDestroyed(LocalRegion.java:8118)
at
com.gemstone.gemfire.internal.cache.LocalRegion.checkReadiness(LocalRegion.java:2994)
at
com.gemstone.gemfire.internal.cache.LocalRegion.size(LocalRegion.java:9668)
at ccio.image.ImageServer$2.run(ImageServer.java:135)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.gemstone.gemfire.ForcedDisconnectException: Member
isn't responding to heartbeat requests
at
com.gemstone.gemfire.distributed.internal.membership.gms.mgr.GMSMembershipManager.forceDisconnect(GMSMembershipManager.java:2571)
at
com.gemstone.gemfire.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:811)
at
com.gemstone.gemfire.distributed.internal.membership.gms.membership.GMSJoinLeave.processRemoveRequest(GMSJoinLeave.java:519)
at
com.gemstone.gemfire.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:1459)
at
com.gemstone.gemfire.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1051)
at org.jgroups.JChannel.invokeCallback(JChannel.java:817)
at org.jgroups.JChannel.up(JChannel.java:741)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1029)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1064)
at
org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:779)
at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:426)
at
com.gemstone.gemfire.distributed.internal.membership.gms.messenger.StatRecorder.up(StatRecorder.java:72)
at
com.gemstone.gemfire.distributed.internal.membership.gms.messenger.AddressManager.up(AddressManager.java:76)
at org.jgroups.protocols.TP
<http://org.jgroups.protocols.TP>.passMessageUp(TP.java:1577)
at org.jgroups.protocols.TP
<http://org.jgroups.protocols.tp/>$MyHandler.run(TP.java:1796)
at org.jgroups.util.DirectExecutor.execute(DirectExecutor.java:10)
at org.jgroups.protocols.TP
<http://org.jgroups.protocols.TP>.handleSingleMessage(TP.java:1693)
at org.jgroups.protocols.TP
<http://org.jgroups.protocols.TP>.receive(TP.java:1630)
at
com.gemstone.gemfire.distributed.internal.membership.gms.messenger.Transport.receive(Transport.java:165)
at org.jgroups.protocols.UDP$PacketReceiver.run(UDP.java:691)
... 1 common frames omitted
On Tue, May 3, 2016 at 8:10 PM, Bruce Schuchardt
<[email protected] <mailto:[email protected]>> wrote:
Is this using Geode or GemFire? Either way If you continue to
have problems you can PM Udo and me directly. Send us a zip with
the log files and we'll help you figure it out.
Le 5/3/2016 à 2:13 PM, Eugene Strokin a écrit :
Udo, thanks for the hint. The property was missing indeed.
I've put it into my gemfire.properties file and the cluster waits
all nodes to start before proceed to any activity.
Eugene
On Tue, May 3, 2016 at 4:28 PM, Udo Kohlmeyer
<[email protected] <mailto:[email protected]>> wrote:
Hi there Eugene,
Can you check if the enable-network-partition-detection
property is set, as per the documentation.
Handling Network partitioning
<http://geode.docs.pivotal.io/docs/managing/network_partitioning/handling_network_partitioning.html>
--Udo
On 4/05/2016 6:22 am, Eugene Strokin wrote:
I'm testing my 10 nodes cluster under production load and
with production data.
I was using automated tool which created the nodes (VMs)
configured everything and restarted all of them.
Everything worked, I mean, I was getting the data I
expected, but when I've checked the stats I noticed that I'm
running 10 one node clusters. My nodes didn't see each
other, they had a separate duplicated set of data on each node.
I've stopped all the nodes, cleaned all logs/storage files,
and restarted the nodes again.
Now I had one cluster with 7 nodes and 3 nodes separate.
I've stopped the 3 nodes, cleaned them up, and started them
up one by one, they successfully joined the cluster. At the
end I've got all 10 nodes working as a single cluster.
But I'm afraid that if nodes would get restarted or network
would have some problems, I could end up with split cluster
again.
I use API to start Cache with locators, and all locator's
IPs are provided in the config. From the documentation I had
impression that Geode would wait till N/2+1 nodes would
start before forming the cluster, since the number of
locators is preset. But looks like it is not the case.
Or should I set some setting to force such behavior?
Thank you,
Eugene