Re: Starvation in striped pool

2019-10-18 Thread Ilya Kasnacheev
Hello!

We, we had IGNITE_ENABLE_FORCIBLE_NODE_KILL, but the best solution is, in
my opinion, to avoid adding anything unstable to the cluster.

Regards,
-- 
Ilya Kasnacheev


пт, 18 окт. 2019 г. в 08:35, ihalilaltun :

> Hi Ilya,
>
> From time to time, we have faced exactly the same problem. Is there any
> best
> practices for handling network issues? What i mean is, if there is any
> network issues between client/s and server/s we want the cluster keeps
> living. As for the clients, they can be disconnected from servers.
>
> Regards.
>
>
>
> -
> İbrahim Halil Altun
> Senior Software Engineer @ Segmentify
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Starvation in striped pool

2019-10-17 Thread ihalilaltun
Hi Ilya,

>From time to time, we have faced exactly the same problem. Is there any best
practices for handling network issues? What i mean is, if there is any
network issues between client/s and server/s we want the cluster keeps
living. As for the clients, they can be disconnected from servers.

Regards.



-
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Starvation in striped pool

2019-10-11 Thread Ilya Kasnacheev
Hello!

I don't think there is any problem with idleConnectionTimeout, but you *should
not* use nodes which are not mutually connectible to each other anyway.

I can't really comment on the feasibility of dropping client when it can't
be reached via Communication. You can start a discussion about it on
developers list.

Regards,
-- 
Ilya Kasnacheev


пт, 11 окт. 2019 г. в 11:31, maheshkr76private :

> >>I'm almost certain is that problem is that server node cannot open a
> connection to client node (and while it tries, it will reject connection
> attempts from client node)
>
> The default idleTimeout of TCP communication spi is 6 minutes. So I assume,
> after this timeout, the connection is closed and restarted probably later
> on
> a request from the client.
> So I can imagine, this issue happening, when the client and server are
> trying to re-establish the connection and your explanation makes sense.
>
> However, my concern still remains.  The server has plenty of timeouts in
> its
> communication SPI. Why wouldn't the server, not throw that client out of
> the
> cluster and let the client fail gracefully? This incessant pinging by
> client
> to the server is a problem in production environments.
>
> Currently, the only way out for me seems to be set
>  >>>communicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE);
> testing is currently and seems to be holding up for 22 hours now.
>
> Do you see any issues with setting idleConnectionTimeout that high?
>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Starvation in striped pool

2019-10-11 Thread maheshkr76private
>>I'm almost certain is that problem is that server node cannot open a
connection to client node (and while it tries, it will reject connection
attempts from client node)

The default idleTimeout of TCP communication spi is 6 minutes. So I assume,
after this timeout, the connection is closed and restarted probably later on
a request from the client. 
So I can imagine, this issue happening, when the client and server are
trying to re-establish the connection and your explanation makes sense. 

However, my concern still remains.  The server has plenty of timeouts in its
communication SPI. Why wouldn't the server, not throw that client out of the
cluster and let the client fail gracefully? This incessant pinging by client
to the server is a problem in production environments. 

Currently, the only way out for me seems to be set 
 >>>communicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE);
testing is currently and seems to be holding up for 22 hours now. 

Do you see any issues with setting idleConnectionTimeout that high?



  




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Starvation in striped pool

2019-10-10 Thread Ilya Kasnacheev
Hello!

I'm almost certain is that problem is that server node cannot open
connection to client node (and while it tries, it will reject connection
attempts from client node)

clientReconnectDisabled=true will only concern discovery. In your case,
there's no problems with discovery, the problem is with communication (port
47100). Unless discovery fails, node will not be dropped.

Regards,
-- 
Ilya Kasnacheev


чт, 10 окт. 2019 г. в 08:48, maheshkr76private :

> Ilya.
>
> What is most mysterious to me is, I disabled reconnect of think client
> (clientReconnectDisabled=true). Still the server prints, the below, where
> the same thick client is making an immediate attempt to reconnect back to
> the cluster, while the previous connecting isn't still successful.
>
> [00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
> rmtAddr=/192.168.1.71:2007]
> *[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
> Received incoming connection from remote node while connecting to this
> node,
> rejecting [locNode=3240558e-72b4-4314-a970-0965654e7e6f, locNodeOrder=1,
> rmtNode=e1be7b6f-8691-4f81-a03f-c77ff14843ef, rmtNodeOrder=675]*
>
> This I feel is an internal ignite code logic, which does not seem to be
> influenced by the user configuration of IgniteConfiguration params. Can you
> cross-check this? My point is, the same node should have to attempt a
> reconnect when the user has set clientReconnectDisabled = true in
> IgniteConfiguration
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Starvation in striped pool

2019-10-09 Thread maheshkr76private
Ilya. 

What is most mysterious to me is, I disabled reconnect of think client
(clientReconnectDisabled=true). Still the server prints, the below, where
the same thick client is making an immediate attempt to reconnect back to
the cluster, while the previous connecting isn't still successful.  

[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.71:2007]
*[00:47:46,004][INFO][grid-nio-worker-tcp-comm-1-#29][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=3240558e-72b4-4314-a970-0965654e7e6f, locNodeOrder=1,
rmtNode=e1be7b6f-8691-4f81-a03f-c77ff14843ef, rmtNodeOrder=675]*

This I feel is an internal ignite code logic, which does not seem to be
influenced by the user configuration of IgniteConfiguration params. Can you
cross-check this? My point is, the same node should have to attempt a
reconnect when the user has set clientReconnectDisabled = true in
IgniteConfiguration



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Starvation in striped pool

2019-10-09 Thread maheshkr76private
Attached are the logs. In the server log, you will see the thick client
continuously pinging server indefinitely...there is no recovery of the thick
client. SO, the problem is, we can't even reboot the thick client in a
production scenario, as it does even fail (meaning, the configured failure
handler is not even triggered)

thickclient-prining-always.zip

  


Below is the thick client configuration that we use 

Ignite thick client configuration...
IgniteConfiguration [igniteInstanceName=null, pubPoolSize=8,
svcPoolSize=null, callbackPoolSize=8, stripedPoolSize=8, sysPoolSize=8,
mgmtPoolSize=4, igfsPoolSize=1, dataStreamerPoolSize=8,
utilityCachePoolSize=8, utilityCacheKeepAliveTime=6, p2pPoolSize=2,
qryPoolSize=8, igniteHome=null, igniteWorkDir=null, mbeanSrv=null,
nodeId=null, marsh=null, marshLocJobs=false, daemon=false, p2pEnabled=true,
netTimeout=6, sndRetryDelay=1000, sndRetryCnt=3, metricsHistSize=100,
metricsUpdateFreq=2000, metricsExpTime=9223372036854775807,
discoSpi=TcpDiscoverySpi [addrRslvr=null, sockTimeout=0, ackTimeout=0,
marsh=null, reconCnt=10, reconDelay=2000, maxAckTimeout=60,
forceSrvMode=false, clientReconnectDisabled=true, internalLsnr=null],
segPlc=STOP, segResolveAttempts=2, waitForSegOnStart=true,
allResolversPassReq=true, segChkFreq=1, commSpi=TcpCommunicationSpi
[connectGate=null,
connPlc=org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi$FirstConnectionPolicy@74e47d7c,
enableForcibleNodeKill=false, enableTroubleshootingLog=false, locAddr=null,
locHost=null, locPort=47100, locPortRange=100, shmemPort=-1, directBuf=true,
directSndBuf=true, idleConnTimeout=60, connTimeout=5000,
maxConnTimeout=48000, reconCnt=10, sockSndBuf=32768, sockRcvBuf=32768,
msgQueueLimit=1000, slowClientQueueLimit=0, nioSrvr=null, shmemSrv=null,
usePairedConnections=false, connectionsPerNode=1, tcpNoDelay=true,
filterReachableAddresses=false, ackSndThreshold=32, unackedMsgsBufSize=0,
sockWriteTimeout=2000, boundTcpPort=-1, boundTcpShmemPort=-1,
selectorsCnt=4, selectorSpins=0, addrRslvr=null,
ctxInitLatch=java.util.concurrent.CountDownLatch@65072e7d[Count = 1],
stopping=false], evtSpi=null, colSpi=null, deploySpi=null, indexingSpi=null,
addrRslvr=null, encryptionSpi=null, clientMode=true,
rebalanceThreadPoolSize=1, txCfg=TransactionConfiguration
[txSerEnabled=false, dfltIsolation=REPEATABLE_READ,
dfltConcurrency=PESSIMISTIC, dfltTxTimeout=0,
txTimeoutOnPartitionMapExchange=0, pessimisticTxLogSize=0,
pessimisticTxLogLinger=1, tmLookupClsName=null, txManagerFactory=null,
useJtaSync=false], cacheSanityCheckEnabled=true, discoStartupDelay=6,
deployMode=SHARED, p2pMissedCacheSize=100, locHost=null,
timeSrvPortBase=31100, timeSrvPortRange=100, failureDetectionTimeout=1,
sysWorkerBlockedTimeout=12, clientFailureDetectionTimeout=12,
metricsLogFreq=3000, hadoopCfg=null, connectorCfg=ConnectorConfiguration
[jettyPath=null, host=null, port=11211, noDelay=true, directBuf=false,
sndBufSize=32768, rcvBufSize=32768, idleQryCurTimeout=60,
idleQryCurCheckFreq=6, sndQueueLimit=0, selectorCnt=1, idleTimeout=7000,
sslEnabled=false, sslClientAuth=false, sslCtxFactory=null, sslFactory=null,
portRange=100, threadPoolSize=8, msgInterceptor=null], odbcCfg=null,
warmupClos=null, atomicCfg=AtomicConfiguration [seqReserveSize=1000,
cacheMode=PARTITIONED, backups=1, aff=null, grpName=null], classLdr=null,
sslCtxFactory=null, platformCfg=null, binaryCfg=null, memCfg=null,
pstCfg=null, dsCfg=null, activeOnStart=true, autoActivation=true,
longQryWarnTimeout=3000, sqlConnCfg=null,
cliConnCfg=ClientConnectorConfiguration [host=null, port=10800,
portRange=100, sockSndBufSize=0, sockRcvBufSize=0, tcpNoDelay=true,
maxOpenCursorsPerConn=128, threadPoolSize=8, idleTimeout=0,
jdbcEnabled=true, odbcEnabled=true, thinCliEnabled=true, sslEnabled=false,
useIgniteSslCtxFactory=true, sslClientAuth=false, sslCtxFactory=null],
mvccVacuumThreadCnt=2, mvccVacuumFreq=5000, authEnabled=false,
failureHnd=net.aline.cloudedh.base.database.IgniteDatabase$$Lambda$19/1447342142@5fafe841,
commFailureRslvr=null]





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Starvation in striped pool

2019-10-09 Thread Ilya Kasnacheev
Hello!

It's hard to say what happens here. What timeout settings do you have? Can
you provide complete log from client node as well?

Regards,
-- 
Ilya Kasnacheev


вт, 8 окт. 2019 г. в 19:25, maheshkr76private :

> Hello Ilya
> Once connection goes bad between client and server, which configuration
> parameter on the thick client-side would force stop the thick client from
> pinging server...
> Tried join timeout, connection timeouts, in communicaiton and discovery
> spis
> and nothing ever worked.
>
>
> Have seen in a few scenarios, a think client never stops pinging the server
> if the network connection went bad for any reason. Continuously see these
> pings on server and the entire cluster slows down.
>
>
>
>
> [06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
> rmtAddr=/192.168.1.22:32935]
> [06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
> Received incoming connection from remote node while connecting to this
> node,
> rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
> rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
> [06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
> rmtAddr=/192.168.1.22:38152]
> [06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
> Received incoming connection from remote node while connecting to this
> node,
> rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
> rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
> [06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
> Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
> rmtAddr=/192.168.1.22:49095]
> [06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
> Received incoming connection from remote node while connecting to this
> node,
> rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
> rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
>
>
>
> This issue was reported here.
>
> http://apache-ignite-users.70518.x6.nabble.com/Ignite-2-7-0-thick-client-has-all-system-threads-busy-indefinitely-td28880.html#a29018
>
>
>
>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Starvation in striped pool

2019-10-08 Thread maheshkr76private
Hello Ilya 
Once connection goes bad between client and server, which configuration
parameter on the thick client-side would force stop the thick client from
pinging server...
Tried join timeout, connection timeouts, in communicaiton and discovery spis
and nothing ever worked.


Have seen in a few scenarios, a think client never stops pinging the server
if the network connection went bad for any reason. Continuously see these
pings on server and the entire cluster slows down.




[06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:32935]
[06:20:16,203][INFO][grid-nio-worker-tcp-comm-7-#47][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
[06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:38152]
[06:20:16,405][INFO][grid-nio-worker-tcp-comm-0-#40][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]
[06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/192.168.1.6:47100,
rmtAddr=/192.168.1.22:49095]
[06:20:16,607][INFO][grid-nio-worker-tcp-comm-1-#41][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=527c8173-b079-459a-8bb5-260f62d47797, locNodeOrder=1,
rmtNode=a89f3605-fe0b-4abf-8d98-245d75205743, rmtNodeOrder=587]



This issue was reported here. 
http://apache-ignite-users.70518.x6.nabble.com/Ignite-2-7-0-thick-client-has-all-system-threads-busy-indefinitely-td28880.html#a29018








--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Starvation in striped pool

2019-10-08 Thread Ilya Kasnacheev
Hello!

If client node continues to respond via Discovery, server node is not going
to drop it by unreachability. This is the default behavior.

Regards,
-- 
Ilya Kasnacheev


вт, 8 окт. 2019 г. в 15:43, maheshkr76private :

> OK. There could have been a temporary network issue between the server and
> client node.
>
> However, I was expecting the server node to throw the client out of the
> cluster and resume normal functioning. But what bothers me is that the
> server node never recovered after the network issue and finally got stuck
> in
> dumping threads. If there was a bigger network issue, I would have expected
> the server to report segmentation and go to the failure handler. This did
> not happen.
>
> Feels like the handling of this scenario is not proper in ignite code.
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Starvation in striped pool

2019-10-08 Thread maheshkr76private
OK. There could have been a temporary network issue between the server and
client node.

However, I was expecting the server node to throw the client out of the
cluster and resume normal functioning. But what bothers me is that the
server node never recovered after the network issue and finally got stuck in
dumping threads. If there was a bigger network issue, I would have expected
the server to report segmentation and go to the failure handler. This did
not happen. 

Feels like the handling of this scenario is not proper in ignite code. 




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Starvation in striped pool

2019-10-08 Thread Ilya Kasnacheev
Hello!

Thread [name="sys-stripe-4-#5", id=27, state=RUNNABLE, blockCnt=0,
waitCnt=10004]
at sun.nio.ch.Net.poll(Native Method)
at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954)
at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
at
o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
at
o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)

Most likely this would mean that there is no communication connectivity
between your nodes, e.g., your server node can't open connection to your
client node to send over results of operation in striped pool.

Please check your network.

Regards,
-- 
Ilya Kasnacheev


вт, 8 окт. 2019 г. в 15:13, maheshkr76private :

> hi,
>
> We are seeing a crash of the server node. This is with Ignite 2.7.6
> Any insight and a possible workaround would be deeply appreciated?
> attached is the log ignite-29268843.zip
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t1596/ignite-29268843.zip>
>
>
>
> *
> [11:16:09,531][WARNING][grid-timeout-worker-#27][G] >>> Possible starvation
> in striped pool.*
> Thread name: sys-stripe-4-#5
> Queue: [Message closure [msg=GridIoMessage [plc=2, topic=TOPIC_CACHE,
> topicOrd=8, ordered=false, timeout=0, skipOnTimeout=false,
> msg=GridNearSingleGetRequest [futId=1570427398413, key=KeyCacheObjectImpl
> [part=212, val=null, hasValBytes=true], flags=1,
> topVer=AffinityTopologyVersion [topVer=618, minorTopVer=4],
> subjId=b3a51345-c6a5-4a6f-9117-374a80948ae8, taskNameHash=0, createTtl=-1,
> accessTtl=-1, mvccSnapshot=null]]], Message closure [msg=GridIoMessage
> [plc=2, topic=TOPIC_CACHE, topicOrd=8, ordered=false, timeout=0,
> skipOnTimeout=false, msg=GridNearSingleGetRequest [futId=1570427398737,
> key=KeyCacheObjectImpl [part=212, val=null, hasValBytes=true], flags=1,
> topVer=AffinityTopologyVersion [topVer=618, minorTopVer=4],
> subjId=b3a51345-c6a5-4a6f-9117-374a80948ae8, taskNameHash=0, createTtl=-1,
> accessTtl=-1, mvccSnapshot=null
> Deadlock: false
> Completed: 10057
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>