https://github.com/apache/geode/pull/3449

On 4/11/19 3:28 PM, Bruce Schuchardt wrote:

I've reopened GEODE-3948 to address this Vahram.  I'll have a pull request up shortly.

On 4/11/19 8:06 AM, Vahram Aharonyan wrote:

Hi All,

We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along with Geode Server each VM has one Geode 1.7 Client. Hence we have  2 servers and 2 clients in Geode cluster.

While doing validation, we have introduced packet loss(~65%) on first VM “A” and after about 1 minute client of VM “B” reports following:

[warning 2019/04/11 16:20:27.502 AMT Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck for <76.204 seconds> and number of thread monitor iteration <1>

  Thread Name <poolTimer-CollectorControllerPool-142>

  Thread state <RUNNABLE>

  Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>

  Monitored metric <ResourceManagerStats.numThreadsStuck>

  Thread Stack:

  java.net.SocketInputStream.socketRead0(Native Method)

java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

java.net.SocketInputStream.read(SocketInputStream.java:171)

java.net.SocketInputStream.read(SocketInputStream.java:141)

sun.security.ssl.InputRecord.readFully(InputRecord.java:465)

sun.security.ssl.InputRecord.read(InputRecord.java:503)

sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)

sun.security.ssl.AppInputStream.read(AppInputStream.java:105)

org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)

org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)

org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)

org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)

org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)

org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)

org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)

org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)

org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)

org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)

org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)

org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)

org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)

org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)

org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

  java.lang.Thread.run(Thread.java:748)

This report and stacktrace is being continuously repeated by ThreadsMOnitor over time – just iteration count and “stuck for” values are increasing. From stacktrace it seems to be PingOperation initiated by client on VM “B” to Server of VM “A”. Due to packet drop between the nodes the response is not reaching caller client from the server and this thread remaines blocked for hours. In source I see that receiveWithHeaderReadTimeout receives NO_HEADER_READ_TIMEOUT as a timeout argument which means we will wait indefinitely. Is this reasonable? So the question is why PingOperation is executed without timeout?

Or could it be that this stacked thread will be interrupted by some monitoring logic at some moment?

Thanks,

Vahram.

Reply via email to