On 18/12/2019 8:45 am, Toralf Lund wrote:
On one of our installations of one of our AMQP 0-10 applications using
the QPid messaging API (in C++), we keep getting exceptions with the
message
session-busy: Session detached by peer
- usually while waiting for incoming messages, but in some cases also
when sending them. After resolving some issues with "dangling" session
pointers (with help from the list), I thought this wasn't really a
problem, but I just found that the broker has several extra connections
for the application; qpid-stat reports the following:
$ qpid-stat -c
Connections
connection cproc cpid
mech auth connected idle msgIn msgOut
===========================================================================================================================================
qpid.192.168.0.215:5672-192.168.0.61:58532 our-program 134233
ANONYMOUS anonymous 4d 19h 42m 25s 4d 18h 37m 44s 47.3k 531k
qpid.192.168.0.215:5672-192.168.0.61:58564 our-program 134233
ANONYMOUS anonymous 4d 5h 27m 9s 4d 4h 32m 4s 42.3k 488k
qpid.192.168.0.215:5672-192.168.0.61:58566 our-program 134233
ANONYMOUS anonymous 4d 4h 32m 12s 4d 3h 33m 34s 44.6k 513k
qpid.192.168.0.215:5672-192.168.0.61:58580 our-program 134233
ANONYMOUS anonymous 1d 17h 37m 49s 1d 16h 41m 54s 40.6k 515k
qpid.192.168.0.215:5672-192.168.0.61:58582 our-program 134233
ANONYMOUS anonymous 1d 16h 42m 1s 1d 15h 46m 44s 41.3k 497k
qpid.192.168.0.215:5672-192.168.0.61:58584 our-program 134233
ANONYMOUS anonymous 1d 15h 46m 46s 1d 14h 43m 24s 46.8k 535k
qpid.192.168.0.215:5672-192.168.0.61:58618 our-program 134233
ANONYMOUS anonymous 22h 55m 50s 21h 48m 4s 44.5k 496k
qpid.192.168.0.215:5672-192.168.0.61:58846 our-program 134233
ANONYMOUS anonymous 18h 21m 22s 0s 0 0
(Names and addresses have been edited.)
The implied times seem to match instances of the session-busy exception,
but a new connection doesn't appear every time. The exception has
occurred 27 times during the last 4 days, and as you can see, there are
a not that many connections.
It is possible some connections are actually closed. The session-busy
can be a race, e.g. if the client times out a connection before the
server (but the original connection and session is then eventually
closed, just too late for the reattach attempt to succeed with the same
session name).
Your heartbeat interval is set quite low. I don't know if that was
deliberately so, but a slightly less sensitive setting might avoid
timing out in some cases.
That said you still clearly also still have multiple connections open,
which I'm assuming is not expected? I.e. you would expect only one
connection from that process?
I don't see any signs of connection errors or explicit reopens but
automatic reconnect with a limit has been enabled; the following connect
options are used
connection.setOption("reconnect", true);
connection.setOption("reconnect_limit", 0);
(I believe this means trying reconnect once, as "reconnect_limit"
specifies retries after an initial attempt.)
I believe a reconnect_limit of 0 will reconnect without any limit. If
you want to only reconnect once, you would specify a value of 1. However
that would I believe be reset every time it does successfully reconnect.
An additional, more serious issue is that the system has also locked up
a couple of times following an exception during Sender::send(). Slightly
simplified stack trace from one of the cases:
Thread 5 (Thread 0x7fb93bc7c700 (LWP 134270)):
#0 0x00007fb9400bd113 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007fb93f9c7677 in qpid::sys::Poller::wait(qpid::sys::Duration) ()
from /usr/lib64/libqpidcommon.so.2
#2 0x00007fb93f9c9d5f in qpid::sys::Poller::run() ()
from /usr/lib64/libqpidcommon.so.2
#3 0x00007fb93f9b8e4a in ?? ()
from /usr/lib64/libqpidcommon.so.2
#4 0x00007fb941663dd5 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007fb9400bcb3d in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7fb93b47b700 (LWP 134271)):
#0 0x00007fb941667cf2 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
from /usr/lib64/libpthread.so.0
#1 0x00007fb93fa35b35 in qpid::sys::Timer::run() ()
from /usr/lib64/libqpidcommon.so.2
#2 0x00007fb93f9b8e4a in ?? ()
from /usr/lib64/libqpidcommon.so.2
#3 0x00007fb941663dd5 in start_thread () from /usr/lib64/libpthread.so.0
#4 0x00007fb9400bcb3d in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fb93987f700 (LWP 134344)):
[ ... ]
Thread 3's trace didn't make it through. Do you still have that?
Thread 2 (Thread 0x7fb93907e700 (LWP 395509)):
#0 0x00007fb9400bd113 in epoll_wait () from /usr/lib64/libc.so.6
#1 0x00007fb93f9c7677 in qpid::sys::Poller::wait(qpid::sys::Duration) ()
from /usr/lib64/libqpidcommon.so.2
#2 0x00007fb93f9c9d5f in qpid::sys::Poller::run() ()
from /usr/lib64/libqpidcommon.so.2
#3 0x00007fb93f9b8e4a in ?? ()
from /usr/lib64/libqpidcommon.so.2
#4 0x00007fb941663dd5 in start_thread () from /usr/lib64/libpthread.so.0
#5 0x00007fb9400bcb3d in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7fb942671300 (LWP 134233)):
#0 0x00007fb94039850f in ?? () from /usr/lib64/libgcc_s.so.1
#1 0x00007fb940399f5f in ?? () from /usr/lib64/libgcc_s.so.1
#2 0x00007fb94039a8ca in ?? () from /usr/lib64/libgcc_s.so.1
#3 0x00007fb94039add7 in _Unwind_Resume () from /usr/lib64/libgcc_s.so.1
#4 0x00007fb93fd8b927 in
qpid::client::SessionImpl::sendCommand(qpid::framing::AMQBody const&,
qpid::framing::MethodContent const*) ()
from /usr/lib64/libqpidclient.so.2
#5 0x00007fb93fd8b97b in
qpid::client::SessionImpl::send(qpid::framing::AMQBody const&) () from
/usr/lib64/libqpidclient.so.2
#6 0x00007fb93fd876a6 in qpid::client::SessionBase_0_10::sync() ()
from /usr/lib64/libqpidclient.so.2
#7 0x00007fb9422403f8 in ?? ()
from /usr/lib64/libqpidmessaging.so.2
#8 0x00007fb942241e6a in ?? ()
from /usr/lib64/libqpidmessaging.so.2
#9 0x00000000004f2dc9 in ?? ()
It looks like the send is stuck trying to unwind after an exception
(possibly on the checkOpen()), due to not being able to unlock the
sendLock semaphore. Is something in thread 3 holding any locks?
The application became responsive again after running a debug session to
get the above info. There is a chance that it did another Sender::send()
on the same sender (without trying to reset the session or whatever)
following the exception.
Any idea what's going on here? Is there anything I can do to avoid the
session-busy errors? Can I prevent extra connections from appearing if
it should occur after all? Also, is there a way to prevent the apparent
lock-ups after exception on send()? I think I've assumed that it's
"safe" to try other operations as they will just fail directly, unless
the session is actually able to recover, but perhaps that's a bad idea?
Note that besides what's been mentioned already, the following
connection options are used:
connection.setOption("tcp_nodelay", true);
connection.setOption("heartbeat", 2);
Thanks,
- Toralf
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
For additional commands, e-mail: users-h...@qpid.apache.org