Re: Stucked thread after network outage

Bruce Schuchardt Thu, 11 Apr 2019 15:36:39 -0700

https://github.com/apache/geode/pull/3449


On 4/11/19 3:28 PM, Bruce Schuchardt wrote:

I've reopened GEODE-3948 to address this Vahram. I'll have a pullrequest up shortly.
On 4/11/19 8:06 AM, Vahram Aharonyan wrote:
Hi All,
We have 2 VMs that are running Geode 1.7 servers – one server per VM.Along with Geode Server each VM has one Geode 1.7 Client. Hence wehave 2 servers and 2 clients in Geode cluster.
While doing validation, we have introduced packet loss(~65%) on firstVM “A” and after about 1 minute client of VM “B” reports following:
[warning 2019/04/11 16:20:27.502 AMTCollector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor>tid=0x1c] Thread <2182> that was executed at <11 Apr 2019 16:19:11AMT> has been stuck for <76.204 seconds> and number of thread monitoriteration <1>
  Thread Name <poolTimer-CollectorControllerPool-142>

  Thread state <RUNNABLE>

  Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>

  Monitored metric <ResourceManagerStats.numThreadsStuck>

  Thread Stack:

  java.net.SocketInputStream.socketRead0(Native Method)

java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

java.net.SocketInputStream.read(SocketInputStream.java:171)

java.net.SocketInputStream.read(SocketInputStream.java:141)

sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

sun.security.ssl.InputRecord.read(InputRecord.java:503)

sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)

sun.security.ssl.AppInputStream.read(AppInputStream.java:105)

org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)

org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)

org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)

org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)

org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)

org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)

org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)

org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)

org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)

org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)

org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)

org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)

org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)

org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)

org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

  java.lang.Thread.run(Thread.java:748)
This report and stacktrace is being continuously repeated byThreadsMOnitor over time – just iteration count and “stuck for”values are increasing. From stacktrace it seems to be PingOperationinitiated by client on VM “B” to Server of VM “A”. Due to packet dropbetween the nodes the response is not reaching caller client from theserver and this thread remaines blocked for hours. In source I seethat receiveWithHeaderReadTimeout receives NO_HEADER_READ_TIMEOUT asa timeout argument which means we will wait indefinitely. Is thisreasonable? So the question is why PingOperation is executed withouttimeout?
Or could it be that this stacked thread will be interrupted by somemonitoring logic at some moment?
Thanks,

Vahram.

Re: Stucked thread after network outage

Reply via email to