I think this is related to commit 906ed2ae369 for GEODE-3948. I agree that setting SoTimeout to 0 is incorrect. Shouldn’t it be a small multiple of the ping frequency?
Anthony > On Apr 11, 2019, at 12:57 PM, Dan Smith <dsm...@pivotal.io> wrote: > > That's interesting. I agree we should be setting a read timeout during the > ping so if that thread is stuck it seems like something is wrong with setting > the read timeout. > > Would you mind sharing your pool configuration? > > -Dan > > On Thu, Apr 11, 2019 at 11:52 AM Vahram Aharonyan <vaharon...@vmware.com > <mailto:vaharon...@vmware.com>> wrote: > > > One more note here - thread remain stucked even after bringing the network to > normal shape. What I mean is that network outage for limited period of time > (ie. 15mins) yields to this situation. > > Thanks, > Vahram. > From: Vahram Aharonyan > Sent: Thursday, April 11, 2019 7:06:31 PM > To: user@geode.apache.org <mailto:user@geode.apache.org> > Subject: Stucked thread after network outage > > Hi All, > > > We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along > with Geode Server each VM has one Geode 1.7 Client. Hence we have 2 servers > and 2 clients in Geode cluster. > > > While doing validation, we have introduced packet loss(~65%) on first VM “A” > and after about 1 minute client of VM “B” reports following: > > > [warning 2019/04/11 16:20:27.502 AMT > Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] > Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck > for <76.204 seconds> and number of thread monitor iteration <1> > > Thread Name <poolTimer-CollectorControllerPool-142> > > Thread state <RUNNABLE> > > Executor Group <ScheduledThreadPoolExecutorWithKeepAlive> > > Monitored metric <ResourceManagerStats.numThreadsStuck> > > Thread Stack: > > java.net.SocketInputStream.socketRead0(Native Method) > > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > > java.net.SocketInputStream.read(SocketInputStream.java:171) > > java.net.SocketInputStream.read(SocketInputStream.java:141) > > sun.security.ssl.InputRecord.readFully(InputRecord.java:465) > > sun.security.ssl.InputRecord.read(InputRecord.java:503) > > sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) > > sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) > > sun.security.ssl.AppInputStream.read(AppInputStream.java:105) > > > org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809) > > > org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659) > > > org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124) > > > org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135) > > > org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205) > > > org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386) > > > org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276) > > > org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167) > > > org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894) > > > org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387) > > > org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349) > > org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827) > > org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36) > > > org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90) > > > org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338) > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > > java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > > > org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271) > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > java.lang.Thread.run(Thread.java:748) > > > This report and stacktrace is being continuously repeated by ThreadsMOnitor > over time – just iteration count and “stuck for” values are increasing. From > stacktrace it seems to be PingOperation initiated by client on VM “B” to > Server of VM “A”. Due to packet drop between the nodes the response is not > reaching caller client from the server and this thread remaines blocked for > hours. In source I see that receiveWithHeaderReadTimeout receives > NO_HEADER_READ_TIMEOUT as a timeout argument which means we will wait > indefinitely. Is this reasonable? So the question is why PingOperation is > executed without timeout? > > > Or could it be that this stacked thread will be interrupted by some > monitoring logic at some moment? > > > Thanks, > > Vahram. >