I was looking at the heapdump and identified the node which didn't sent the 
response.

But the question now is why didn't it send it, did it run the function or not 
yet...?

________________________________
From: Darrel Schneider <dschnei...@pivotal.io>
Sent: Tuesday, December 15, 2015 9:58 PM
To: user@geode.incubator.apache.org
Subject: Re: How to troubleshoot stuck distribute?d function calls

Usually the member waiting for a response logs a warning that it has been 
waiting for longer than 15 seconds from a particular member. Use that member id 
to identify the member that is not responding. Get a stack dump on that member 
and look for a thread that is processing the unresponsive message. Sometimes 
this member also logs that he is waiting for someone else to respond to him 
before he can respond to the first member.

The log message to look for is: "seconds have elapsed while waiting for 
replies:". It will be a warning and should be the last message logged by that 
thread. Sometimes it will log this warning and then get the response later in 
which case it will log an info message that it did receive the reply.


On Tue, Dec 15, 2015 at 12:03 AM, Hovhannes Antonyan 
<hanton...@vmware.com<mailto:hanton...@vmware.com>> wrote:
Hello experts,

I have a multi node environment where one of the nodes has made a broadcast 
call to all other nodes and got stuck.
It is still waiting responses from all nodes and from the heapdump I see that 
ResultCollector has N-1 elements, where N is the total number of nodes, so it 
looks like one of the nodes didn't return a response, or it did return but for 
some reason the caller has not received it.
How can I troubleshoot this issue, how can I know which node exactly has failed 
to return the response and why?

Thanks in advance,
Hovhannes

Reply via email to