Re: Stucked thread after network outage

Anthony Baker Thu, 11 Apr 2019 14:22:41 -0700

I think this is related to commit 906ed2ae369 for GEODE-3948.  

I agree that setting SoTimeout to 0 is incorrect.  Shouldn’t it be a small 
multiple of the ping frequency?


Anthony


> On Apr 11, 2019, at 12:57 PM, Dan Smith <dsm...@pivotal.io> wrote:
> 
> That's interesting. I agree we should be setting a read timeout during the 
> ping so if that thread is stuck it seems like something is wrong with setting 
> the read timeout.
> 
> Would you mind sharing your pool configuration? 
> 
> -Dan
> 
> On Thu, Apr 11, 2019 at 11:52 AM Vahram Aharonyan <vaharon...@vmware.com 
> <mailto:vaharon...@vmware.com>> wrote:
> 
> 
> One more note here - thread remain stucked even after bringing the network to 
> normal shape. What I mean is that network outage for limited period of time 
> (ie. 15mins) yields to this situation.
> 
> Thanks,
> Vahram.
> From: Vahram Aharonyan
> Sent: Thursday, April 11, 2019 7:06:31 PM
> To: user@geode.apache.org <mailto:user@geode.apache.org>
> Subject: Stucked thread after network outage
>  
> Hi All,
> 
>  
> We have 2 VMs that are running Geode 1.7 servers – one server per VM. Along 
> with Geode Server each VM has one Geode 1.7 Client. Hence we have  2 servers 
> and 2 clients in Geode cluster.
> 
>  
> While doing validation, we have introduced packet loss(~65%) on first VM “A” 
> and after about 1 minute client of VM “B” reports following:
> 
>  
> [warning 2019/04/11 16:20:27.502 AMT 
> Collector-c0f1ee3e-366a-4ac3-8fda-60540cdd21c4 <ThreadsMonitor> tid=0x1c] 
> Thread <2182> that was executed at <11 Apr 2019 16:19:11 AMT> has been stuck 
> for <76.204 seconds> and number of thread monitor iteration <1>
> 
>   Thread Name <poolTimer-CollectorControllerPool-142>
> 
>   Thread state <RUNNABLE>
> 
>   Executor Group <ScheduledThreadPoolExecutorWithKeepAlive>
> 
>   Monitored metric <ResourceManagerStats.numThreadsStuck>
> 
>   Thread Stack:
> 
>   java.net.SocketInputStream.socketRead0(Native Method)
> 
>   java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> 
>   java.net.SocketInputStream.read(SocketInputStream.java:171)
> 
>   java.net.SocketInputStream.read(SocketInputStream.java:141)
> 
>   sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> 
>   sun.security.ssl.InputRecord.read(InputRecord.java:503)
> 
>   sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
> 
>   sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
> 
>   sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
> 
>   
> org.apache.geode.internal.cache.tier.sockets.Message.fetchHeader(Message.java:809)
> 
>   
> org.apache.geode.internal.cache.tier.sockets.Message.readHeaderAndBody(Message.java:659)
> 
>   
> org.apache.geode.internal.cache.tier.sockets.Message.receiveWithHeaderReadTimeout(Message.java:1124)
> 
>   
> org.apache.geode.internal.cache.tier.sockets.Message.receive(Message.java:1135)
> 
>   
> org.apache.geode.cache.client.internal.AbstractOp.attemptReadResponse(AbstractOp.java:205)
> 
>   
> org.apache.geode.cache.client.internal.AbstractOp.attempt(AbstractOp.java:386)
> 
>   
> org.apache.geode.cache.client.internal.ConnectionImpl.execute(ConnectionImpl.java:276)
> 
>   
> org.apache.geode.cache.client.internal.QueueConnectionImpl.execute(QueueConnectionImpl.java:167)
> 
>   
> org.apache.geode.cache.client.internal.OpExecutorImpl.executeWithPossibleReAuthentication(OpExecutorImpl.java:894)
> 
>   
> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOnServer(OpExecutorImpl.java:387)
> 
>   
> org.apache.geode.cache.client.internal.OpExecutorImpl.executeOn(OpExecutorImpl.java:349)
> 
>   org.apache.geode.cache.client.internal.PoolImpl.executeOn(PoolImpl.java:827)
> 
>   org.apache.geode.cache.client.internal.PingOp.execute(PingOp.java:36)
> 
>   
> org.apache.geode.cache.client.internal.LiveServerPinger$PingTask.run2(LiveServerPinger.java:90)
> 
>   
> org.apache.geode.cache.client.internal.PoolImpl$PoolTask.run(PoolImpl.java:1338)
> 
>   java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 
>   java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> 
>   
> org.apache.geode.internal.ScheduledThreadPoolExecutorWithKeepAlive$DelegatingScheduledFuture.run(ScheduledThreadPoolExecutorWithKeepAlive.java:271)
> 
>   
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 
>   
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 
>   java.lang.Thread.run(Thread.java:748)
> 
>  
> This report and stacktrace is being continuously repeated by ThreadsMOnitor 
> over time – just iteration count and “stuck for” values are increasing. From 
> stacktrace it seems to be PingOperation initiated by client on VM “B” to 
> Server of VM “A”. Due to packet drop between the nodes the response is not 
> reaching caller client from the server and this thread remaines blocked for 
> hours. In source I see that receiveWithHeaderReadTimeout receives 
> NO_HEADER_READ_TIMEOUT as a timeout argument which means we will wait 
> indefinitely. Is this reasonable? So the question is why PingOperation is 
> executed without timeout?
> 
>  
> Or could it be that this stacked thread will be interrupted by some 
> monitoring logic at some moment?
> 
>  
> Thanks,
> 
> Vahram.
>

Re: Stucked thread after network outage

Reply via email to