Re: ignte cluster hang with GridCachePartitionExchangeManager

Ilya Kasnacheev Tue, 28 Aug 2018 07:32:08 -0700

Hello!

Please check that there are no problems with connectivity in your cluster,
i.e. that all nodes can open communication and discovery connections to all
other nodes.


>From what I observe in the log, there are massive problems with cluster
stability:
23:48:44.624 [tcp-disco-sock-reader-#48%test01%] DEBUG
o.a.i.s.d.tcp.TcpDiscoverySpi  - Message has been added to queue:
TcpDiscoveryNodeLeftMessage [super=TcpDiscoveryAbstractMessage
[sndNodeId=230e516f-6c12-4391-b902-822afc6f7bc4,
id=12c221c7561-710fac06-6272-42f7-a8a8-8f0861c36c63, verifierNodeId=null,
topVer=0, pendingIdx=0, failedNodes=null, isClient=false]]
23:48:44.631 [tcp-disco-msg-worker-#2%test01%] DEBUG
o.a.i.s.d.tcp.TcpDiscoverySpi  - Removed node from topology:
TcpDiscoveryNode [id=230e516f-6c12-4391-b902-822afc6f7bc4,
addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 192.168.10.103],
sockAddrs=[/0:0:0:0:0:0:0:1:8315, /127.0.0.1:8315, /192.168.10.103:8315],
discPort=8315, order=70, intOrder=38, lastExchangeTime=1535384891465,
loc=false, ver=2.5.0#20180524-sha1:86e110c7, isClient=false]
23:48:44.637 [tcp-disco-msg-worker-#2%test01%] DEBUG
o.a.i.s.d.tcp.TcpDiscoverySpi  - Discarding node left message since sender
node is not in topology: TcpDiscoveryNodeLeftMessage
[super=TcpDiscoveryAbstractMessage
[sndNodeId=230e516f-6c12-4391-b902-822afc6f7bc4,
id=62c221c7561-c261b07c-9495-4058-889b-bd484be10477, verifierNodeId=null,
topVer=0, pendingIdx=0, failedNodes=null, isClient=false]]
and then finally
23:48:44.738 [grid-nio-worker-tcp-comm-1-#26%test01%] DEBUG
o.a.i.s.c.tcp.TcpCommunicationSpi  - Remote client closed connection:
GridSelectorNioSessionImpl [worker=DirectNioClientWorker
[super=AbstractNioClientWorker [idx=1, bytesRcvd=28387, bytesSent=563858,
bytesRcvd0=0, bytesSent0=0, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-1, igniteInstanceName=test01,
finished=false, hashCode=686001483, interrupted=false,
runner=grid-nio-worker-tcp-comm-1-#26%test01%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=28,
sentCnt=28, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode
 [id=c261b07c-9495-4058-889b-bd484be10477, addrs=[0:0:0:0:0:0:0:1,
127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8312, /
127.0.0.1:8312, /192.168.10.103:8312], discPort=8312, order=68,
intOrder=36, lastExchangeTime=1535384891444, loc=false,
ver=2.5.0#20180524-sha1:86e110c7, isClient=false], connected=true,
connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=0, resendCnt=0, rcvCnt=28,
sentCnt=28, reserved=true, lastAck=0, nodeLeft=false, node=TcpDiscoveryNode
[id=c261b07c-9495-4058-889b-bd484be10477, addrs=[0:0:0:0:0:0:0:1,
127.0.0.1, 192.168.10.103], sockAddrs=[/0:0:0:0:0:0:0:1:8312, /
127.0.0.1:8312, /192.168.10.103:8312], discPort=8312, order=68,
intOrder=36, lastExchangeTime=1535384891444, loc=false,
ver=2.5.0#20180524-sha1:86e110c7, isClient=false], connected=true,
connectCnt=0, queueLimit=4096, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/0:0:0:0:0:0:0:1:8410,
rmtAddr=/0:0:0:0:0:0:0:1:64102, createTime=1535384893005, closeTime=0,
bytesSent=116175, bytesRcvd=10585, bytesSent0=0, bytesRcvd0=0,
sndSchedTime=1535384893005, lastSndTime=1535384920458,
lastRcvTime=1535384920458, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=org.apache.ignite.internal.util.nio.GridDirectParser@47fbc56,
directMode=true], GridConnectionBytesVerifyFilter], accepted=true]]
23:48:44.904 [tcp-disco-sock-reader-#48%test01%] ERROR
o.a.i.s.d.tcp.TcpDiscoverySpi  - Caught exception on message read
[sock=Socket[addr=/0:0:0:0:0:0:0:1,port=64098,localport=8310],
locNodeId=016f5c35-ac7d-4391-9142-1a7aea1c3378,
rmtNodeId=230e516f-6c12-4391-b902-822afc6f7bc4]
org.apache.ignite.IgniteCheckedException: Failed to deserialize object with
given class loader: sun.misc.Launcher$AppClassLoader@18b4aac2
        at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:147)
        at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
        at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9907)
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5981)
        at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.io.EOFException: null
        at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2638)
        at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3113)
        at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:853)
        at java.io.ObjectInputStream.<init>(ObjectInputStream.java:349)
        at
org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.<init>(JdkMarshallerObjectInputStream.java:43)
        at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:137)
        ... 4 common frames omitted

Regards,
-- 
Ilya Kasnacheev


вт, 28 авг. 2018 г. в 5:13, wangsan wang <[email protected]>:

> attach the full log:
>
>
> Ilya Kasnacheev <[email protected]> 于2018年8月27日周一 下午6:43写道：
>
>> Hello!
>>
>> 1. As far as my understanding goes, there's no such handling of OOM in
>> Apache Ignite that would guarantee not causing cluster crash. This means
>> you should be extra careful with that. This is since after OOM node doesn't
>> have a chance to quit gracefully. Maybe other nodes will be able to
>> eventually remove it, maybe not.
>>
>> 2. It's hard to say what happens here. Can you provide logs?
>>
>> Regards,
>>
>> --
>> Ilya Kasnacheev
>>
>>
>> пт, 24 авг. 2018 г. в 19:49, wangsan <[email protected]>:
>>
>>> Now my cluster topology is Node a,b,c,d  all with persistence enable and
>>> peerclassloader false. b c d have different class(cache b) from a.
>>> 1.When any node crash with oom(memory or stack) .all nodes hang with " -
>>> Still waiting for initial partition map exchange "
>>> 2.When a start first,  b,c,d start in multi threads concurrent.b,c,d hang
>>> with " - Still waiting for initial partition map exchange ".a hang with
>>> "Unable to await partitions release latch"
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>
>>

Re: ignte cluster hang with GridCachePartitionExchangeManager

Reply via email to