Re: ignte cluster hang with GridCachePartitionExchangeManager
Yes , It was blocked when do cache operation in discovery event listeners when node left events arrival concurrently. I just do cache operation in another thread. Then the listener will not be blocked. The original cause may be that discovery event processor hold the server latch.when do cache operation in the listener,the server latch will be blocked(why,I am not sure). -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: ignte cluster hang with GridCachePartitionExchangeManager
Hi Wangsan, So what was the original cause of the issue? Was it blocking the listening thread in your test code or something else? We are having similar issues Cheers, Eugene On Mon, Sep 3, 2018 at 1:23 PM Ilya Kasnacheev wrote: > Hello! > > The operation will execute after partition map exchange (or maybe several > ones). Just be sure to avoid waiting on operation from discovery event > listener. > > Regards, > -- > Ilya Kasnacheev > > > пн, 3 сент. 2018 г. в 17:37, wangsan : > >> Thanks! >> >> Can I do cache operations(update cache item) in another thread from >> discovery event listeners? And the operation(update cache item) will >> execute >> concurrently or execute before partition map exchange? >> >> >> >> -- >> Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >> >
Re: ignte cluster hang with GridCachePartitionExchangeManager
Hello! The operation will execute after partition map exchange (or maybe several ones). Just be sure to avoid waiting on operation from discovery event listener. Regards, -- Ilya Kasnacheev пн, 3 сент. 2018 г. в 17:37, wangsan : > Thanks! > > Can I do cache operations(update cache item) in another thread from > discovery event listeners? And the operation(update cache item) will > execute > concurrently or execute before partition map exchange? > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Re: ignte cluster hang with GridCachePartitionExchangeManager
Thanks! Can I do cache operations(update cache item) in another thread from discovery event listeners? And the operation(update cache item) will execute concurrently or execute before partition map exchange? -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: ignte cluster hang with GridCachePartitionExchangeManager
Hello! I'm fairly confident that you should not attempt to do any cache operations from discovery threads, e.g. from event listeners. When topology changes you can't expect any cache operations to be performed before partition map exchange is done. Regards, -- Ilya Kasnacheev ср, 29 авг. 2018 г. в 6:24, wangsan : > I can reproduce the bug, above log is the server(first) node print when I > stop other nodes . > > > > import java.io.IOException; > import java.util.ArrayList; > import java.util.Arrays; > import java.util.concurrent.CompletableFuture; > import java.util.concurrent.ExecutorService; > import java.util.concurrent.Executors; > import java.util.concurrent.TimeUnit; > import java.util.stream.Collectors; > import java.util.stream.IntStream; > > import org.apache.ignite.Ignite; > import org.apache.ignite.IgniteCache; > import org.apache.ignite.Ignition; > import org.apache.ignite.cache.CacheMode; > import org.apache.ignite.cluster.ClusterNode; > import org.apache.ignite.configuration.CacheConfiguration; > import org.apache.ignite.configuration.IgniteConfiguration; > import org.apache.ignite.events.DiscoveryEvent; > import org.apache.ignite.events.EventType; > import org.apache.ignite.internal.managers.discovery.IgniteDiscoverySpi; > import org.apache.ignite.logger.slf4j.Slf4jLogger; > import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi; > import > > org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder; > import org.junit.Test; > > import com.google.common.collect.Lists; > > /** > * @author wangsan > * @date 2018/08/23 > */ > public class IgniteMultiThreadOnOffTest { > private static final String CACHE_DATA_REGION_PERSISTENCE = > "PERSISTENCE_REGION"; > private static final String stringCacheOneName = "StringCacheOneName"; > > @Test > public void testFirstServer() throws Exception { > IgniteConfiguration cfg = new IgniteConfiguration(); > cfg.setClientMode(false); > cfg.setDiscoverySpi(tcpDiscoverySpi()); > cfg.setConsistentId("server_test_1"); > cfg.setIgniteInstanceName("server_test_1"); > cfg.setGridLogger(new Slf4jLogger()); > Ignite ignite = Ignition.start(cfg); > > CacheConfiguration cacheOneConfiguration = new > CacheConfiguration<>(stringCacheOneName); > cacheOneConfiguration.setCacheMode(CacheMode.REPLICATED); > IgniteCache cacheOne = > ignite.getOrCreateCache(cacheOneConfiguration); > > ignite.events().localListen(event -> { > System.err.println("get event " + event.name() + " " + event); > if (event instanceof DiscoveryEvent) { > ClusterNode clusterNode = ((DiscoveryEvent) > event).eventNode(); > > String item = "event_" + clusterNode.consistentId(); > System.err.println("-- oldest node process the message > : " + item); > > switch (event.type()) { > case EventType.EVT_NODE_JOINED: > cacheOne.put(item, item); > // > System.err.println("--- do add async" + item); > // > ForkJoinPool.commonPool().execute(() -> cacheOne.put(item, item)); > break; > case EventType.EVT_NODE_FAILED: > case EventType.EVT_NODE_LEFT: > // will block > cacheOne.remove(item); > // > System.err.println("--- do remove async " + item); > // > ForkJoinPool.commonPool().execute(() -> cacheOne.remove(item)); > break; > default: > System.err.println("ignore discovery event:" + > event); > break; > } > return true; > } else { > return false; > } > }, EventType.EVTS_DISCOVERY); > > while (true) { > TimeUnit.SECONDS.sleep(5); > > System.err.println(" " + cacheOne.size()); > > > System.err.println(ignite.cacheNames().stream().collect(Collectors.joining(","))); > } > } > > @Test > public void testMultiServer() throws IOException { > ExecutorService executorService = Executors.newCachedThreadPool(); > long start = System.currentTimeMillis(); > CompletableFuture> allIgnite = > IntStream.range(0, 5) > .mapToObj(i -> CompletableFuture.supplyAsync(() -> > testServer(i), executorService)) > .map(f -> f.thenApply(i -> Lists.newArrayList(i))) > .reduce((x, y) -> x.thenCombine(y, (a, b) -> { > a.addAll(b); > return a; > })).get(); > > allIgnite.thenAccept(list -> { > System.err.println("start use time ms " + > (System.currentTimeMillis() - start)); >
Re: ignte cluster hang with GridCachePartitionExchangeManager
I can reproduce the bug, above log is the server(first) node print when I stop other nodes . import java.io.IOException; import java.util.ArrayList; import java.util.Arrays; import java.util.concurrent.CompletableFuture; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.TimeUnit; import java.util.stream.Collectors; import java.util.stream.IntStream; import org.apache.ignite.Ignite; import org.apache.ignite.IgniteCache; import org.apache.ignite.Ignition; import org.apache.ignite.cache.CacheMode; import org.apache.ignite.cluster.ClusterNode; import org.apache.ignite.configuration.CacheConfiguration; import org.apache.ignite.configuration.IgniteConfiguration; import org.apache.ignite.events.DiscoveryEvent; import org.apache.ignite.events.EventType; import org.apache.ignite.internal.managers.discovery.IgniteDiscoverySpi; import org.apache.ignite.logger.slf4j.Slf4jLogger; import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi; import org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder; import org.junit.Test; import com.google.common.collect.Lists; /** * @author wangsan * @date 2018/08/23 */ public class IgniteMultiThreadOnOffTest { private static final String CACHE_DATA_REGION_PERSISTENCE = "PERSISTENCE_REGION"; private static final String stringCacheOneName = "StringCacheOneName"; @Test public void testFirstServer() throws Exception { IgniteConfiguration cfg = new IgniteConfiguration(); cfg.setClientMode(false); cfg.setDiscoverySpi(tcpDiscoverySpi()); cfg.setConsistentId("server_test_1"); cfg.setIgniteInstanceName("server_test_1"); cfg.setGridLogger(new Slf4jLogger()); Ignite ignite = Ignition.start(cfg); CacheConfiguration cacheOneConfiguration = new CacheConfiguration<>(stringCacheOneName); cacheOneConfiguration.setCacheMode(CacheMode.REPLICATED); IgniteCache cacheOne = ignite.getOrCreateCache(cacheOneConfiguration); ignite.events().localListen(event -> { System.err.println("get event " + event.name() + " " + event); if (event instanceof DiscoveryEvent) { ClusterNode clusterNode = ((DiscoveryEvent) event).eventNode(); String item = "event_" + clusterNode.consistentId(); System.err.println("-- oldest node process the message : " + item); switch (event.type()) { case EventType.EVT_NODE_JOINED: cacheOne.put(item, item); // System.err.println("--- do add async" + item); // ForkJoinPool.commonPool().execute(() -> cacheOne.put(item, item)); break; case EventType.EVT_NODE_FAILED: case EventType.EVT_NODE_LEFT: // will block cacheOne.remove(item); // System.err.println("--- do remove async " + item); // ForkJoinPool.commonPool().execute(() -> cacheOne.remove(item)); break; default: System.err.println("ignore discovery event:" + event); break; } return true; } else { return false; } }, EventType.EVTS_DISCOVERY); while (true) { TimeUnit.SECONDS.sleep(5); System.err.println(" " + cacheOne.size()); System.err.println(ignite.cacheNames().stream().collect(Collectors.joining(","))); } } @Test public void testMultiServer() throws IOException { ExecutorService executorService = Executors.newCachedThreadPool(); long start = System.currentTimeMillis(); CompletableFuture> allIgnite = IntStream.range(0, 5) .mapToObj(i -> CompletableFuture.supplyAsync(() -> testServer(i), executorService)) .map(f -> f.thenApply(i -> Lists.newArrayList(i))) .reduce((x, y) -> x.thenCombine(y, (a, b) -> { a.addAll(b); return a; })).get(); allIgnite.thenAccept(list -> { System.err.println("start use time ms " + (System.currentTimeMillis() - start)); list.forEach(n -> System.err.println("n.id " + n.cluster().localNode().consistentId())); try { TimeUnit.SECONDS.sleep(20); } catch (InterruptedException e) { e.printStackTrace(); } System.err.println("going to close ignite,see first server "); list.forEach(n -> n.close()); }).join(); } private Ignite testServer(int index) { String name =
Re: ignte cluster hang with GridCachePartitionExchangeManager
Hello! Please check that there are no problems with connectivity in your cluster, i.e. that all nodes can open communication and discovery connections to all other nodes. >From what I observe in the log, there are massive problems with cluster stability: 23:48:44.624 [tcp-disco-sock-reader-#48%test01%] DEBUG o.a.i.s.d.tcp.TcpDiscoverySpi - Message has been added to queue: TcpDiscoveryNodeLeftMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=230e516f-6c12-4391-b902-822afc6f7bc4, id=12c221c7561-710fac06-6272-42f7-a8a8-8f0861c36c63, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]] 23:48:44.631 [tcp-disco-msg-worker-#2%test01%] DEBUG o.a.i.s.d.tcp.TcpDiscoverySpi - Removed node from topology: TcpDiscoveryNode [id=230e516f-6c12-4391-b902-822afc6f7bc4, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8315, /127.0.0.1:8315, /192.168.10.103:8315], discPort=8315, order=70, intOrder=38, lastExchangeTime=1535384891465, loc=false, ver=2.5.0#20180524-sha1:86e110c7, isClient=false] 23:48:44.637 [tcp-disco-msg-worker-#2%test01%] DEBUG o.a.i.s.d.tcp.TcpDiscoverySpi - Discarding node left message since sender node is not in topology: TcpDiscoveryNodeLeftMessage [super=TcpDiscoveryAbstractMessage [sndNodeId=230e516f-6c12-4391-b902-822afc6f7bc4, id=62c221c7561-c261b07c-9495-4058-889b-bd484be10477, verifierNodeId=null, topVer=0, pendingIdx=0, failedNodes=null, isClient=false]] and then finally 23:48:44.738 [grid-nio-worker-tcp-comm-1-#26%test01%] DEBUG o.a.i.s.c.tcp.TcpCommunicationSpi - Remote client closed connection: GridSelectorNioSessionImpl [worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=1, bytesRcvd=28387, bytesSent=563858, bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker [name=grid-nio-worker-tcp-comm-1, igniteInstanceName=test01, finished=false, hashCode=686001483, interrupted=false, runner=grid-nio-worker-tcp-comm-1-#26%test01%]]], writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768], inRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=28, sentCnt=28, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=c261b07c-9495-4058-889b-bd484be10477, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8312, / 127.0.0.1:8312, /192.168.10.103:8312], discPort=8312, order=68, intOrder=36, lastExchangeTime=1535384891444, loc=false, ver=2.5.0#20180524-sha1:86e110c7, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], outRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=28, sentCnt=28, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode [id=c261b07c-9495-4058-889b-bd484be10477, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8312, / 127.0.0.1:8312, /192.168.10.103:8312], discPort=8312, order=68, intOrder=36, lastExchangeTime=1535384891444, loc=false, ver=2.5.0#20180524-sha1:86e110c7, isClient=false], connected=true, connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false], super=GridNioSessionImpl [locAddr=/0:0:0:0:0:0:0:1:8410, rmtAddr=/0:0:0:0:0:0:0:1:64102, createTime=1535384893005, closeTime=0, bytesSent=116175, bytesRcvd=10585, bytesSent0=0, bytesRcvd0=0, sndSchedTime=1535384893005, lastSndTime=1535384920458, lastRcvTime=1535384920458, readsPaused=false, filterChain=FilterChain[filters=[GridNioCodecFilter [parser=org.apache.ignite.internal.util.nio.GridDirectParser@47fbc56, directMode=true], GridConnectionBytesVerifyFilter], accepted=true]] 23:48:44.904 [tcp-disco-sock-reader-#48%test01%] ERROR o.a.i.s.d.tcp.TcpDiscoverySpi - Caught exception on message read [sock=Socket[addr=/0:0:0:0:0:0:0:1,port=64098,localport=8310], locNodeId=016f5c35-ac7d-4391-9142-1a7aea1c3378, rmtNodeId=230e516f-6c12-4391-b902-822afc6f7bc4] org.apache.ignite.IgniteCheckedException: Failed to deserialize object with given class loader: sun.misc.Launcher$AppClassLoader@18b4aac2 at org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:147) at org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94) at org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9907) at org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5981) at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) Caused by: java.io.EOFException: null at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853) at java.io.ObjectInputStream.(ObjectInputStream.java:349) at
Re: ignte cluster hang with GridCachePartitionExchangeManager
About question 2, debug level like this: I start a node,then b,c,d,e,f nodes in mulitithread. then close them all. in the debugs log, A server latch created with participantsSize=5 but only one countdown .then latch will be hang. simple logs is: >>> ++ > >>> Topology snapshot. > >>> ++ > >>> Ignite instance name: test01 > >>> Number of server nodes: 5 > >>> Number of client nodes: 0 > >>> Topology version: 71 > >>> Local: 016F5C35-AC7D-4391-9142-1A7AEA1C3378, [ > 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 21, Mac OS > X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment > 1.8.0_131-b11 > >>> Remote: 58973B59-DB04-4325-9966-D259A7AE3BBD, [ > 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 66, Mac OS > X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment > 1.8.0_131-b11 > >>> Remote: A79A57F0-D3A4-4ECA-890B-0A043BB0D23F, [ > 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 67, Mac OS > X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment > 1.8.0_131-b11 > >>> Remote: C261B07C-9495-4058-889B-BD484BE10477, [ > 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 68, Mac OS > X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment > 1.8.0_131-b11 > >>> Remote: 230E516F-6C12-4391-B902-822AFC6F7BC4, [ > 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 70, Mac OS > X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment > 1.8.0_131-b11 > >>> Total number of CPUs: 4 > >>> Total heap size: 3.6GB > >>> Total offheap size: 6.4GB > .. > 23:48:44.629 [exchange-worker-#42%test01%] DEBUG > o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager - Server latch is created > [latch=IgniteBiTuple [val1=exchange, val2=AffinityTopologyVersion > [topVer=71, minorTopVer=0]], participantsSize=5] > .. > 23:48:44.635 [exchange-worker-#42%test01%] DEBUG > o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager - Count down + > [latch=exchange-AffinityTopologyVersion [topVer=71, minorTopVer=0], > remaining=4] > .. > 23:48:54.639 [exchange-worker-#42%test01%] WARN > o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture - Unable to await > partitions release latch within timeout: ServerLatch [permits=4, > pendingAcks=[230e516f-6c12-4391-b902-822afc6f7bc4, > 58973b59-db04-4325-9966-d259a7ae3bbd, a79a57f0-d3a4-4eca-890b-0a043bb0d23f, > c261b07c-9495-4058-889b-bd484be10477], super=CompletableLatch [id=exchange, > topVer=AffinityTopologyVersion [topVer=71, minorTopVer=0]]] full logs is: see the attachment tks question2.log Description: Binary data
Re: ignte cluster hang with GridCachePartitionExchangeManager
Hello! 1. As far as my understanding goes, there's no such handling of OOM in Apache Ignite that would guarantee not causing cluster crash. This means you should be extra careful with that. This is since after OOM node doesn't have a chance to quit gracefully. Maybe other nodes will be able to eventually remove it, maybe not. 2. It's hard to say what happens here. Can you provide logs? Regards, -- Ilya Kasnacheev пт, 24 авг. 2018 г. в 19:49, wangsan : > Now my cluster topology is Node a,b,c,d all with persistence enable and > peerclassloader false. b c d have different class(cache b) from a. > 1.When any node crash with oom(memory or stack) .all nodes hang with " - > Still waiting for initial partition map exchange " > 2.When a start first, b,c,d start in multi threads concurrent.b,c,d hang > with " - Still waiting for initial partition map exchange ".a hang with > "Unable to await partitions release latch" > > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
ignte cluster hang with GridCachePartitionExchangeManager
Now my cluster topology is Node a,b,c,d all with persistence enable and peerclassloader false. b c d have different class(cache b) from a. 1.When any node crash with oom(memory or stack) .all nodes hang with " - Still waiting for initial partition map exchange " 2.When a start first, b,c,d start in multi threads concurrent.b,c,d hang with " - Still waiting for initial partition map exchange ".a hang with "Unable to await partitions release latch" -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/