Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-09-11 Thread wangsan
Yes , It was blocked when do cache operation in discovery event listeners
when node left events arrival concurrently.  I just do cache operation in
another thread. Then the listener will not be blocked. 
The original cause may be that discovery event processor hold the server
latch.when do cache operation in the listener,the server latch will be
blocked(why,I am not sure).



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-09-07 Thread eugene miretsky
Hi Wangsan,
So what was the original cause of the issue? Was it blocking the listening
thread in your test code or something else?

We are having similar issues

Cheers,
Eugene

On Mon, Sep 3, 2018 at 1:23 PM Ilya Kasnacheev 
wrote:

> Hello!
>
> The operation will execute after partition map exchange (or maybe several
> ones). Just be sure to avoid waiting on operation from discovery event
> listener.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пн, 3 сент. 2018 г. в 17:37, wangsan :
>
>> Thanks!
>>
>> Can I do cache operations(update cache item) in another thread from
>> discovery event listeners? And the operation(update cache item) will
>> execute
>> concurrently or execute before partition map exchange?
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>


Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-09-03 Thread Ilya Kasnacheev
Hello!

The operation will execute after partition map exchange (or maybe several
ones). Just be sure to avoid waiting on operation from discovery event
listener.

Regards,
-- 
Ilya Kasnacheev


пн, 3 сент. 2018 г. в 17:37, wangsan :

> Thanks!
>
> Can I do cache operations(update cache item) in another thread from
> discovery event listeners? And the operation(update cache item) will
> execute
> concurrently or execute before partition map exchange?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-09-03 Thread wangsan
Thanks!

Can I do cache operations(update cache item) in another thread from
discovery event listeners? And the operation(update cache item) will execute
concurrently or execute before partition map exchange?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-08-31 Thread Ilya Kasnacheev
Hello!

I'm fairly confident that you should not attempt to do any cache operations
from discovery threads, e.g. from event listeners.

When topology changes you can't expect any cache operations to be performed
before partition map exchange is done.

Regards,
-- 
Ilya Kasnacheev


ср, 29 авг. 2018 г. в 6:24, wangsan :

> I can reproduce the bug,  above log is the server(first) node print when I
> stop other nodes .
>
>
>
> import java.io.IOException;
> import java.util.ArrayList;
> import java.util.Arrays;
> import java.util.concurrent.CompletableFuture;
> import java.util.concurrent.ExecutorService;
> import java.util.concurrent.Executors;
> import java.util.concurrent.TimeUnit;
> import java.util.stream.Collectors;
> import java.util.stream.IntStream;
>
> import org.apache.ignite.Ignite;
> import org.apache.ignite.IgniteCache;
> import org.apache.ignite.Ignition;
> import org.apache.ignite.cache.CacheMode;
> import org.apache.ignite.cluster.ClusterNode;
> import org.apache.ignite.configuration.CacheConfiguration;
> import org.apache.ignite.configuration.IgniteConfiguration;
> import org.apache.ignite.events.DiscoveryEvent;
> import org.apache.ignite.events.EventType;
> import org.apache.ignite.internal.managers.discovery.IgniteDiscoverySpi;
> import org.apache.ignite.logger.slf4j.Slf4jLogger;
> import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;
> import
>
> org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder;
> import org.junit.Test;
>
> import com.google.common.collect.Lists;
>
> /**
>  * @author wangsan
>  * @date 2018/08/23
>  */
> public class IgniteMultiThreadOnOffTest {
> private static final String CACHE_DATA_REGION_PERSISTENCE =
> "PERSISTENCE_REGION";
> private static final String stringCacheOneName = "StringCacheOneName";
>
> @Test
> public void testFirstServer() throws Exception {
> IgniteConfiguration cfg = new IgniteConfiguration();
> cfg.setClientMode(false);
> cfg.setDiscoverySpi(tcpDiscoverySpi());
> cfg.setConsistentId("server_test_1");
> cfg.setIgniteInstanceName("server_test_1");
> cfg.setGridLogger(new Slf4jLogger());
> Ignite ignite = Ignition.start(cfg);
>
> CacheConfiguration cacheOneConfiguration = new
> CacheConfiguration<>(stringCacheOneName);
> cacheOneConfiguration.setCacheMode(CacheMode.REPLICATED);
> IgniteCache cacheOne =
> ignite.getOrCreateCache(cacheOneConfiguration);
>
> ignite.events().localListen(event -> {
> System.err.println("get event " + event.name() + " " + event);
> if (event instanceof DiscoveryEvent) {
> ClusterNode clusterNode = ((DiscoveryEvent)
> event).eventNode();
>
> String item = "event_" + clusterNode.consistentId();
> System.err.println("--  oldest node process the message
> : " + item);
>
> switch (event.type()) {
> case EventType.EVT_NODE_JOINED:
> cacheOne.put(item, item);
> //
> System.err.println("--- do add async" + item);
> //
> ForkJoinPool.commonPool().execute(() -> cacheOne.put(item, item));
> break;
> case EventType.EVT_NODE_FAILED:
> case EventType.EVT_NODE_LEFT:
> // will block
> cacheOne.remove(item);
> //
> System.err.println("--- do remove async " + item);
> //
> ForkJoinPool.commonPool().execute(() -> cacheOne.remove(item));
> break;
> default:
> System.err.println("ignore discovery event:" +
> event);
> break;
> }
> return true;
> } else {
> return false;
> }
> }, EventType.EVTS_DISCOVERY);
>
> while (true) {
> TimeUnit.SECONDS.sleep(5);
>
> System.err.println(" " + cacheOne.size());
>
>
> System.err.println(ignite.cacheNames().stream().collect(Collectors.joining(",")));
> }
> }
>
> @Test
> public void testMultiServer() throws IOException {
> ExecutorService executorService = Executors.newCachedThreadPool();
> long start = System.currentTimeMillis();
> CompletableFuture> allIgnite =
> IntStream.range(0, 5)
> .mapToObj(i -> CompletableFuture.supplyAsync(() ->
> testServer(i), executorService))
> .map(f -> f.thenApply(i -> Lists.newArrayList(i)))
> .reduce((x, y) -> x.thenCombine(y, (a, b) -> {
> a.addAll(b);
> return a;
> })).get();
>
> allIgnite.thenAccept(list -> {
> System.err.println("start use time ms " +
> (System.currentTimeMillis() - start));
> 

Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-08-28 Thread wangsan
I can reproduce the bug,  above log is the server(first) node print when I
stop other nodes .



import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import org.apache.ignite.Ignite;
import org.apache.ignite.IgniteCache;
import org.apache.ignite.Ignition;
import org.apache.ignite.cache.CacheMode;
import org.apache.ignite.cluster.ClusterNode;
import org.apache.ignite.configuration.CacheConfiguration;
import org.apache.ignite.configuration.IgniteConfiguration;
import org.apache.ignite.events.DiscoveryEvent;
import org.apache.ignite.events.EventType;
import org.apache.ignite.internal.managers.discovery.IgniteDiscoverySpi;
import org.apache.ignite.logger.slf4j.Slf4jLogger;
import org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;
import
org.apache.ignite.spi.discovery.tcp.ipfinder.multicast.TcpDiscoveryMulticastIpFinder;
import org.junit.Test;

import com.google.common.collect.Lists;

/**
 * @author wangsan
 * @date 2018/08/23
 */
public class IgniteMultiThreadOnOffTest {
private static final String CACHE_DATA_REGION_PERSISTENCE =
"PERSISTENCE_REGION";
private static final String stringCacheOneName = "StringCacheOneName";

@Test
public void testFirstServer() throws Exception {
IgniteConfiguration cfg = new IgniteConfiguration();
cfg.setClientMode(false);
cfg.setDiscoverySpi(tcpDiscoverySpi());
cfg.setConsistentId("server_test_1");
cfg.setIgniteInstanceName("server_test_1");
cfg.setGridLogger(new Slf4jLogger());
Ignite ignite = Ignition.start(cfg);

CacheConfiguration cacheOneConfiguration = new
CacheConfiguration<>(stringCacheOneName);
cacheOneConfiguration.setCacheMode(CacheMode.REPLICATED);
IgniteCache cacheOne =
ignite.getOrCreateCache(cacheOneConfiguration);

ignite.events().localListen(event -> {
System.err.println("get event " + event.name() + " " + event);
if (event instanceof DiscoveryEvent) {
ClusterNode clusterNode = ((DiscoveryEvent)
event).eventNode();

String item = "event_" + clusterNode.consistentId();
System.err.println("--  oldest node process the message
: " + item);

switch (event.type()) {
case EventType.EVT_NODE_JOINED:
cacheOne.put(item, item);
//   
System.err.println("--- do add async" + item);
//   
ForkJoinPool.commonPool().execute(() -> cacheOne.put(item, item));
break;
case EventType.EVT_NODE_FAILED:
case EventType.EVT_NODE_LEFT:
// will block
cacheOne.remove(item);
//   
System.err.println("--- do remove async " + item);
//   
ForkJoinPool.commonPool().execute(() -> cacheOne.remove(item));
break;
default:
System.err.println("ignore discovery event:" +
event);
break;
}
return true;
} else {
return false;
}
}, EventType.EVTS_DISCOVERY);

while (true) {
TimeUnit.SECONDS.sleep(5);

System.err.println(" " + cacheOne.size());
   
System.err.println(ignite.cacheNames().stream().collect(Collectors.joining(",")));
}
}

@Test
public void testMultiServer() throws IOException {
ExecutorService executorService = Executors.newCachedThreadPool();
long start = System.currentTimeMillis();
CompletableFuture> allIgnite =
IntStream.range(0, 5)
.mapToObj(i -> CompletableFuture.supplyAsync(() ->
testServer(i), executorService))
.map(f -> f.thenApply(i -> Lists.newArrayList(i)))
.reduce((x, y) -> x.thenCombine(y, (a, b) -> {
a.addAll(b);
return a;
})).get();

allIgnite.thenAccept(list -> {
System.err.println("start use time ms " +
(System.currentTimeMillis() - start));
list.forEach(n -> System.err.println("n.id " +
n.cluster().localNode().consistentId()));
try {
TimeUnit.SECONDS.sleep(20);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.err.println("going to close ignite,see first server ");
list.forEach(n -> n.close());
}).join();
}

private Ignite testServer(int index) {
String name = 

Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-08-28 Thread Ilya Kasnacheev
Hello!

Please check that there are no problems with connectivity in your cluster,
i.e. that all nodes can open communication and discovery connections to all
other nodes.

>From what I observe in the log, there are massive problems with cluster
stability:
23:48:44.624 [tcp-disco-sock-reader-#48%test01%] DEBUG
o.a.i.s.d.tcp.TcpDiscoverySpi  - Message has been added to queue:
TcpDiscoveryNodeLeftMessage [super=TcpDiscoveryAbstractMessage
[sndNodeId=230e516f-6c12-4391-b902-822afc6f7bc4,
id=12c221c7561-710fac06-6272-42f7-a8a8-8f0861c36c63, verifierNodeId=null,
topVer=0, pendingIdx=0, failedNodes=null, isClient=false]]
23:48:44.631 [tcp-disco-msg-worker-#2%test01%] DEBUG
o.a.i.s.d.tcp.TcpDiscoverySpi  - Removed node from topology:
TcpDiscoveryNode [id=230e516f-6c12-4391-b902-822afc6f7bc4,
addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.10.103],
sockAddrs=[/0:0:0:0:0:0:0:1:8315, /127.0.0.1:8315, /192.168.10.103:8315],
discPort=8315, order=70, intOrder=38, lastExchangeTime=1535384891465,
loc=false, ver=2.5.0#20180524-sha1:86e110c7, isClient=false]
23:48:44.637 [tcp-disco-msg-worker-#2%test01%] DEBUG
o.a.i.s.d.tcp.TcpDiscoverySpi  - Discarding node left message since sender
node is not in topology: TcpDiscoveryNodeLeftMessage
[super=TcpDiscoveryAbstractMessage
[sndNodeId=230e516f-6c12-4391-b902-822afc6f7bc4,
id=62c221c7561-c261b07c-9495-4058-889b-bd484be10477, verifierNodeId=null,
topVer=0, pendingIdx=0, failedNodes=null, isClient=false]]
and then finally
23:48:44.738 [grid-nio-worker-tcp-comm-1-#26%test01%] DEBUG
o.a.i.s.c.tcp.TcpCommunicationSpi  - Remote client closed connection:
GridSelectorNioSessionImpl [worker=DirectNioClientWorker
[super=AbstractNioClientWorker [idx=1, bytesRcvd=28387, bytesSent=563858,
bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-1, igniteInstanceName=test01,
finished=false, hashCode=686001483, interrupted=false,
runner=grid-nio-worker-tcp-comm-1-#26%test01%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=28,
sentCnt=28, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode
 [id=c261b07c-9495-4058-889b-bd484be10477, addrs=[0:0:0:0:0:0:0:1,
127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8312, /
127.0.0.1:8312, /192.168.10.103:8312], discPort=8312, order=68,
intOrder=36, lastExchangeTime=1535384891444, loc=false,
ver=2.5.0#20180524-sha1:86e110c7, isClient=false], connected=true,
connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=28,
sentCnt=28, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode
[id=c261b07c-9495-4058-889b-bd484be10477, addrs=[0:0:0:0:0:0:0:1,
127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8312, /
127.0.0.1:8312, /192.168.10.103:8312], discPort=8312, order=68,
intOrder=36, lastExchangeTime=1535384891444, loc=false,
ver=2.5.0#20180524-sha1:86e110c7, isClient=false], connected=true,
connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/0:0:0:0:0:0:0:1:8410,
rmtAddr=/0:0:0:0:0:0:0:1:64102, createTime=1535384893005, closeTime=0,
bytesSent=116175, bytesRcvd=10585, bytesSent0=0, bytesRcvd0=0,
sndSchedTime=1535384893005, lastSndTime=1535384920458,
lastRcvTime=1535384920458, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=org.apache.ignite.internal.util.nio.GridDirectParser@47fbc56,
directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]
23:48:44.904 [tcp-disco-sock-reader-#48%test01%] ERROR
o.a.i.s.d.tcp.TcpDiscoverySpi  - Caught exception on message read
[sock=Socket[addr=/0:0:0:0:0:0:0:1,port=64098,localport=8310],
locNodeId=016f5c35-ac7d-4391-9142-1a7aea1c3378,
rmtNodeId=230e516f-6c12-4391-b902-822afc6f7bc4]
org.apache.ignite.IgniteCheckedException: Failed to deserialize object with
given class loader: sun.misc.Launcher$AppClassLoader@18b4aac2
at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:147)
at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9907)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5981)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.io.EOFException: null
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
at java.io.ObjectInputStream.(ObjectInputStream.java:349)
at

Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-08-27 Thread wangsan wang
About question 2, debug level like this:
I start a node,then b,c,d,e,f nodes in mulitithread. then close them all.
in the debugs log, A server latch created with participantsSize=5 but only
one countdown .then latch will be hang.

simple logs is:

>>> ++
> >>> Topology snapshot.
> >>> ++
> >>> Ignite instance name: test01
> >>> Number of server nodes: 5
> >>> Number of client nodes: 0
> >>> Topology version: 71
> >>> Local: 016F5C35-AC7D-4391-9142-1A7AEA1C3378, [
> 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 21, Mac OS
> X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment
> 1.8.0_131-b11
> >>> Remote: 58973B59-DB04-4325-9966-D259A7AE3BBD, [
> 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 66, Mac OS
> X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment
> 1.8.0_131-b11
> >>> Remote: A79A57F0-D3A4-4ECA-890B-0A043BB0D23F, [
> 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 67, Mac OS
> X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment
> 1.8.0_131-b11
> >>> Remote: C261B07C-9495-4058-889B-BD484BE10477, [
> 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 68, Mac OS
> X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment
> 1.8.0_131-b11
> >>> Remote: 230E516F-6C12-4391-B902-822AFC6F7BC4, [
> 192.168.10.103/0:0:0:0:0:0:0:1, /127.0.0.1, /192.168.10.103], 70, Mac OS
> X x86_64 10.11.6, wangqingpeng, Java(TM) SE Runtime Environment
> 1.8.0_131-b11
> >>> Total number of CPUs: 4
> >>> Total heap size: 3.6GB
> >>> Total offheap size: 6.4GB
> ..
> 23:48:44.629 [exchange-worker-#42%test01%] DEBUG
> o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager  - Server latch is created
> [latch=IgniteBiTuple [val1=exchange, val2=AffinityTopologyVersion
> [topVer=71, minorTopVer=0]], participantsSize=5]
> ..
> 23:48:44.635 [exchange-worker-#42%test01%] DEBUG
> o.a.i.i.p.c.d.d.p.l.ExchangeLatchManager  - Count down +
> [latch=exchange-AffinityTopologyVersion [topVer=71, minorTopVer=0],
> remaining=4]
> ..
> 23:48:54.639 [exchange-worker-#42%test01%] WARN
> o.a.i.i.p.c.d.d.p.GridDhtPartitionsExchangeFuture  - Unable to await
> partitions release latch within timeout: ServerLatch [permits=4,
> pendingAcks=[230e516f-6c12-4391-b902-822afc6f7bc4,
> 58973b59-db04-4325-9966-d259a7ae3bbd, a79a57f0-d3a4-4eca-890b-0a043bb0d23f,
> c261b07c-9495-4058-889b-bd484be10477], super=CompletableLatch [id=exchange,
> topVer=AffinityTopologyVersion [topVer=71, minorTopVer=0]]]



full logs is:
see the attachment

tks


question2.log
Description: Binary data


Re: ignte cluster hang with GridCachePartitionExchangeManager

2018-08-27 Thread Ilya Kasnacheev
Hello!

1. As far as my understanding goes, there's no such handling of OOM in
Apache Ignite that would guarantee not causing cluster crash. This means
you should be extra careful with that. This is since after OOM node doesn't
have a chance to quit gracefully. Maybe other nodes will be able to
eventually remove it, maybe not.

2. It's hard to say what happens here. Can you provide logs?

Regards,

-- 
Ilya Kasnacheev


пт, 24 авг. 2018 г. в 19:49, wangsan :

> Now my cluster topology is Node a,b,c,d  all with persistence enable and
> peerclassloader false. b c d have different class(cache b) from a.
> 1.When any node crash with oom(memory or stack) .all nodes hang with " -
> Still waiting for initial partition map exchange "
> 2.When a start first,  b,c,d start in multi threads concurrent.b,c,d hang
> with " - Still waiting for initial partition map exchange ".a hang with
> "Unable to await partitions release latch"
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


ignte cluster hang with GridCachePartitionExchangeManager

2018-08-24 Thread wangsan
Now my cluster topology is Node a,b,c,d  all with persistence enable and
peerclassloader false. b c d have different class(cache b) from a.
1.When any node crash with oom(memory or stack) .all nodes hang with " -
Still waiting for initial partition map exchange "
2.When a start first,  b,c,d start in multi threads concurrent.b,c,d hang
with " - Still waiting for initial partition map exchange ".a hang with
"Unable to await partitions release latch"





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/