RE: local-idle-timeout and idle timeout sequencing errors on several instances

Wiggelinkhuizen J (Jaap) Mon, 19 Jul 2021 00:02:59 -0700

Dear Cliff,

Thank you for your reaction. I have created 
PROTON-2411<https://issues.apache.org/jira/browse/PROTON-2411> with the 
information from my mail plus some additional information. Unfortunately we 
can’t reproduce the issue at our test facilities until now and without a clue 
of the cause we don’t know how to trigger it either.


Indeed we build our own Proton libraries from source. Hence your offer to 
create a patch that helps reducing the impact and gathers more information 
would be very much appreciated.

P.S.: I’m off for holiday’s from tomorrow. Could you reply to all in CC when 
reacting to this mail?

Thanks again!

With kind regards,

Jaap Wiggelinkhuizen

Van: Cliff Jansen <[email protected]>
Verzonden: vrijdag 16 juli 2021 18:23
Aan: Wiggelinkhuizen J (Jaap) <[email protected]>; 
[email protected]
Onderwerp: Re: local-idle-timeout and idle timeout sequencing errors on several 
instances

This is not a known bug. Despite your providing a helpful detailed account, I 
am unable to see the possibility of a second “earlier” deadline in the life of 
an AMQP connection.  Even being off by one.

Please raise a JIRA including any additional information you can think of.

Obviously a reproducer would be ideal, but may be hard to provide.

Are you building your own Proton libraries from source? If so I could try to 
put together a patch that would be more resilient in the abort case and gather 
some additional bread crumbs to help analyze the circumstances of the failure.

Cliff



On Thu, Jul 15, 2021 at 3:31 AM Wiggelinkhuizen J (Jaap) 
<[email protected]<mailto:[email protected]>> 
wrote:
Dear Qpid users,

In our mission critical software for the Dutch government we use Qpid proton 
0.34.0 in our C++-client software together with the Qpid dispatch router 
1.16.0. We updated to these versions not so long ago, before we used proton 
0.25.0 and dispatch 1.3.0. Our application runs on several VM’s with a router 
on each VM. All clients connect to the local router only and the routers 
connect to eachother in a hub spoke pattern. In both the client configuration 
as the router configuration we have configured an idle timeout of 30 seconds.

Two weeks ago we were confronted with an incident in production where a lot of 
our client processes reported problems regarding the idle timeouts. These 
client processes were already running stable for more than 3 weeks. The problem 
appeared in two flavors:

  1.  Transport error “error: amqp:resource-limit-exceeded: local-idle-timeout 
expired”
  2.  epoll proactor failure in epoll_timer.c:263: “idle timeout sequencing 
error”
On each VM at least 3 processes showed one of these problems in a time window 
of less than a minute. We haven’t found any cause in the underlying hardware, 
hypervisor, network or operating system until now.

Although we don’t know the root cause of the problems, we can solve the first 
situation by using the proper reconnect settings. However the second situation 
is more odd because it results in an abort within proton itself. The comments 
in epoll_timer.c explain that this error occurs when a connection timer is 
moved backwards a second time. We don’t understand how this can happen suddenly.

Does anyone have experienced similar problems using recent proton versions (the 
epoll_timer.c module is introduced in version 0.33.0). And even more important 
is there a solution or workaround?

Looking forward to any reaction. Thanks in advance!

With kind regards,

Jaap Wiggelinkhuizen
Software architect & Systeem integrator



E    [email protected]<mailto:[email protected]>
W   intraffic.nl<https://www.intraffic.nl/>

  <https://www.linkedin.com/company/intraffic>

Visiting address: Iepenhoeve 11, 3438 MR Nieuwegein

RE: local-idle-timeout and idle timeout sequencing errors on several instances

Reply via email to