On 18/12/2019 8:45 am, Toralf Lund wrote:
On one of our installations of one of our AMQP 0-10 applications using the QPid messaging API (in C++), we keep getting exceptions with the message

session-busy: Session detached by peer

- usually while waiting for incoming messages, but in some cases also when sending them. After resolving some issues with "dangling" session pointers (with help from the list), I thought this wasn't really a problem, but I just found that the broker has several extra connections for the application; qpid-stat reports the following:

$ qpid-stat -c
Connections
  connection                                  cproc            cpid mech       auth       connected        idle            msgIn  msgOut ===========================================================================================================================================   qpid.192.168.0.215:5672-192.168.0.61:58532  our-program      134233 ANONYMOUS  anonymous  4d 19h 42m 25s   4d 18h 37m 44s  47.3k   531k   qpid.192.168.0.215:5672-192.168.0.61:58564  our-program      134233 ANONYMOUS  anonymous  4d 5h 27m 9s     4d 4h 32m 4s    42.3k   488k   qpid.192.168.0.215:5672-192.168.0.61:58566  our-program      134233 ANONYMOUS  anonymous  4d 4h 32m 12s    4d 3h 33m 34s   44.6k   513k   qpid.192.168.0.215:5672-192.168.0.61:58580  our-program      134233 ANONYMOUS  anonymous  1d 17h 37m 49s   1d 16h 41m 54s  40.6k   515k   qpid.192.168.0.215:5672-192.168.0.61:58582  our-program      134233 ANONYMOUS  anonymous  1d 16h 42m 1s    1d 15h 46m 44s  41.3k   497k   qpid.192.168.0.215:5672-192.168.0.61:58584  our-program      134233 ANONYMOUS  anonymous  1d 15h 46m 46s   1d 14h 43m 24s  46.8k   535k   qpid.192.168.0.215:5672-192.168.0.61:58618  our-program      134233 ANONYMOUS  anonymous  22h 55m 50s      21h 48m 4s      44.5k   496k   qpid.192.168.0.215:5672-192.168.0.61:58846  our-program      134233 ANONYMOUS  anonymous  18h 21m 22s      0s                 0      0

(Names and addresses have been edited.)

The implied times seem to match instances of the session-busy exception, but a new connection doesn't appear every time. The exception has occurred 27 times during the last 4 days, and as you can see, there are a not that many connections.

It is possible some connections are actually closed. The session-busy can be a race, e.g. if the client times out a connection before the server (but the original connection and session is then eventually closed, just too late for the reattach attempt to succeed with the same session name).

Your heartbeat interval is set quite low. I don't know if that was deliberately so, but a slightly less sensitive setting might avoid timing out in some cases.

That said you still clearly also still have multiple connections open, which I'm assuming is not expected? I.e. you would expect only one connection from that process?

I don't see any signs of connection errors or explicit reopens but automatic reconnect with a limit has been enabled; the following connect options are used

   connection.setOption("reconnect", true);
   connection.setOption("reconnect_limit", 0);

(I believe this means trying reconnect once, as "reconnect_limit" specifies retries after an initial attempt.)

I believe a reconnect_limit of 0 will reconnect without any limit. If you want to only reconnect once, you would specify a value of 1. However that would I believe be reset every time it does successfully reconnect.


An additional, more serious issue is that the system has also locked up a couple of times following an exception during Sender::send(). Slightly simplified stack trace from one of the cases:

Thread 5 (Thread 0x7fb93bc7c700 (LWP 134270)):
#0  0x00007fb9400bd113 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007fb93f9c7677 in qpid::sys::Poller::wait(qpid::sys::Duration) ()
    from /usr/lib64/libqpidcommon.so.2
#2  0x00007fb93f9c9d5f in qpid::sys::Poller::run() ()
    from /usr/lib64/libqpidcommon.so.2
#3  0x00007fb93f9b8e4a in ?? ()
    from /usr/lib64/libqpidcommon.so.2
#4  0x00007fb941663dd5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fb9400bcb3d in clone () from /usr/lib64/libc.so.6

Thread 4 (Thread 0x7fb93b47b700 (LWP 134271)):
#0  0x00007fb941667cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
    from /usr/lib64/libpthread.so.0
#1  0x00007fb93fa35b35 in qpid::sys::Timer::run() ()
    from /usr/lib64/libqpidcommon.so.2
#2  0x00007fb93f9b8e4a in ?? ()
    from /usr/lib64/libqpidcommon.so.2
#3  0x00007fb941663dd5 in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007fb9400bcb3d in clone () from /usr/lib64/libc.so.6

Thread 3 (Thread 0x7fb93987f700 (LWP 134344)):
[ ... ]

Thread 3's trace didn't make it through. Do you still have that?
Thread 2 (Thread 0x7fb93907e700 (LWP 395509)):
#0  0x00007fb9400bd113 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x00007fb93f9c7677 in qpid::sys::Poller::wait(qpid::sys::Duration) ()
    from /usr/lib64/libqpidcommon.so.2
#2  0x00007fb93f9c9d5f in qpid::sys::Poller::run() ()
    from /usr/lib64/libqpidcommon.so.2
#3  0x00007fb93f9b8e4a in ?? ()
    from /usr/lib64/libqpidcommon.so.2
#4  0x00007fb941663dd5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fb9400bcb3d in clone () from /usr/lib64/libc.so.6

Thread 1 (Thread 0x7fb942671300 (LWP 134233)):
#0  0x00007fb94039850f in ?? () from /usr/lib64/libgcc_s.so.1
#1  0x00007fb940399f5f in ?? () from /usr/lib64/libgcc_s.so.1
#2  0x00007fb94039a8ca in ?? () from /usr/lib64/libgcc_s.so.1
#3  0x00007fb94039add7 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#4  0x00007fb93fd8b927 in qpid::client::SessionImpl::sendCommand(qpid::framing::AMQBody const&, qpid::framing::MethodContent const*) ()
    from /usr/lib64/libqpidclient.so.2
#5  0x00007fb93fd8b97b in qpid::client::SessionImpl::send(qpid::framing::AMQBody const&) () from /usr/lib64/libqpidclient.so.2
#6  0x00007fb93fd876a6 in qpid::client::SessionBase_0_10::sync() ()
    from /usr/lib64/libqpidclient.so.2
#7  0x00007fb9422403f8 in ?? ()
    from /usr/lib64/libqpidmessaging.so.2
#8  0x00007fb942241e6a in ?? ()
    from /usr/lib64/libqpidmessaging.so.2
#9  0x00000000004f2dc9 in ?? ()

It looks like the send is stuck trying to unwind after an exception (possibly on the checkOpen()), due to not being able to unlock the sendLock semaphore. Is something in thread 3 holding any locks?

The application became responsive again after running a debug session to get the above info. There is a chance that it did another Sender::send() on the same sender (without trying to reset the session or whatever) following the exception.

Any idea what's going on here? Is there anything I can do to avoid the session-busy errors? Can I prevent extra connections from appearing if it should occur after all? Also, is there a way to prevent the apparent lock-ups after exception on send()? I think I've assumed that it's "safe" to try other operations as they will just fail directly, unless the session is actually able to recover, but perhaps that's a bad idea?

Note that besides what's been mentioned already, the following connection options are used:

connection.setOption("tcp_nodelay", true);
connection.setOption("heartbeat", 2);

Thanks,

- Toralf




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
For additional commands, e-mail: users-h...@qpid.apache.org

Reply via email to