[jira] [Commented] (PROTON-1090) BlockingConnection client spins at 100% cpu on reconnect
[ https://issues.apache.org/jira/browse/PROTON-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087099#comment-15087099 ] Pavel Moravec commented on PROTON-1090: --- Just a side-effect observation from the reproducer: testing it on downstream `python-qpid-proton-0.9-11.el7.x86_64`, I see also a memory consumption increase. I just run the reproducer and restart `qdrouterd` every 5 seconds. I even backported one known mem.leak there by applying these patches: https://git-wip-us.apache.org/repos/asf?p=qpid-proton.git;h=c799a29 https://git-wip-us.apache.org/repos/asf?p=qpid-proton.git;h=bbba61a but the mem.increase persits. Since I dont have upstream version of proton reactor, the mem.leak can be already fixed in upstream. > BlockingConnection client spins at 100% cpu on reconnect > > > Key: PROTON-1090 > URL: https://issues.apache.org/jira/browse/PROTON-1090 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c, python-binding >Affects Versions: 0.9.1, 0.12.0 >Reporter: Ken Giusti >Priority: Blocker > Fix For: 0.12.0 > > Attachments: cputest.py > > > Attached is a simple python client that connects to a server and waits > forever for a message to be received, reconnecting on connection failure. > When the server is restarted (in my case I'm using qdrouterd), the client > reconnects then pins the cpu at 100%. It appears as if the > BlockingConnection.wait() method in util.py is the source of the busy loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-1090) BlockingConnection client spins at 100% cpu on reconnect
[ https://issues.apache.org/jira/browse/PROTON-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087502#comment-15087502 ] Pavel Moravec commented on PROTON-1090: --- Yet another observation: the problem sounds to be on link level, not connection level (I *think*). I reproduced the same when having link routing in qdrouterd to qpid C++ broker, and the reproducer script in fact created a link via qdrouterd to qpidd. Then I was bouncing qpid _broker_. Not qdrouterd but qpidd. With lower probability than bouncing qdrouterd, I got spinning CPU and same backtraces. > BlockingConnection client spins at 100% cpu on reconnect > > > Key: PROTON-1090 > URL: https://issues.apache.org/jira/browse/PROTON-1090 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c, python-binding >Affects Versions: 0.9.1, 0.12.0 >Reporter: Ken Giusti >Priority: Blocker > Fix For: 0.12.0 > > Attachments: cputest.py > > > Attached is a simple python client that connects to a server and waits > forever for a message to be received, reconnecting on connection failure. > When the server is restarted (in my case I'm using qdrouterd), the client > reconnects then pins the cpu at 100%. It appears as if the > BlockingConnection.wait() method in util.py is the source of the busy loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-1025) CLOSE_WAIT leak following reproducer for PROTON-1023 / PROTON-1024
[ https://issues.apache.org/jira/browse/PROTON-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980661#comment-14980661 ] Pavel Moravec commented on PROTON-1025: --- Good point. It makes sense the programmer should call "destructor" before assigning a new instance of the class to the variable. So I am closing this JIRA - let reopen when feature "BlockingConnection should close itself on unreference" is required.. > CLOSE_WAIT leak following reproducer for PROTON-1023 / PROTON-1024 > -- > > Key: PROTON-1025 > URL: https://issues.apache.org/jira/browse/PROTON-1025 > Project: Qpid Proton > Issue Type: Bug > Components: python-binding >Affects Versions: 0.10 >Reporter: Pavel Moravec >Priority: Minor > > Following reproducer for PROTON-1023 or PROTON-1024 (attached at the botton), > client leaves some sockets in CLOSE_WAIT state forever. > I tested the reproducer before & after those two fixes and it is present in > both. I.e. this bug is not a regression caused by PROTON-1023 or PROTON-1024. > Reproducer: > (assuming localhost runs qdrouterd that is restarted every 5 seconds in a > loop): > {code} > #!/usr/bin/python > from time import sleep > from uuid import uuid4 > from proton import ConnectionException > from proton.utils import BlockingConnection > import traceback > import random > while True: > sleep(random.uniform(0.3,3)) > try: > conn = BlockingConnection("proton+amqp://localhost:5672", > ssl_domain=None, heartbeat=2) > rec = conn.create_receiver("another_address", name=str(uuid4()), > dynamic=False, options=None) > print "sleeping.." > sleep(random.uniform(0.3,3)) > rec2 = conn.create_receiver("some_address", name=str(uuid4()), > dynamic=False, options=None) > except ConnectionException: > try: > if conn: > conn.close() > except Exception, e: > print e > print(traceback.format_exc()) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (PROTON-1003) ssl transport layer does not define an error handler
[ https://issues.apache.org/jira/browse/PROTON-1003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Moravec reopened PROTON-1003: --- Reopening both PROTON-1000 and PROTON-1003: at least backport to 0.9 does not fix it. Reproducer: {code} #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException, Timeout from proton import SSLDomain, SSLException #from proton import Message from proton.utils import BlockingConnection import random import threading ROUTER_ADDRESS = "amqps://dispatch-router:5671" ADDRESS = "some_destination" HEARTBEAT = 2 TIMEOUT = 3 class ReceiverThread(threading.Thread): def __init__(self,domain=None): super(ReceiverThread, self).__init__() self.domain=domain self.running = True def connect(self): self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT) self.recv = self.conn.create_receiver(ADDRESS, name=str(uuid4()), dynamic=False, options=None) def run(self): while self.running: self.connect() while self.running: try: msg = self.recv.receive(TIMEOUT) if (msg): print "message received: %s" % msg self.recv.accept() except: print "receiver failed to accept msg, reconnecting.." try: self.conn.close() # underlying TCP connection never gone except: print "receiver thread: failed to close connection" pass self.connect() def stop(self): self.running = False ca_certificate='/etc/rhsm/ca/katello-default-ca.pem' client_certificate='/etc/pki/consumer/bundle.pem' client_key=None domain = SSLDomain(SSLDomain.MODE_CLIENT) domain.set_trusted_ca_db(ca_certificate) domain.set_credentials( client_certificate, client_key or client_certificate, None) domain.set_peer_authentication(SSLDomain.VERIFY_PEER) rcv_thread = ReceiverThread(domain) rcv_thread.start() _in = raw_input("Press Enter to exit:") rcv_thread.stop() rcv_thread.join() {code} With SSL enabled (like above), there is an ESTABLISHED connection leak - `one per `receiver failed to accept msg, reconnecting` log - `self.conn.close()` has apparently no impact. With SSL disabled (just set `ssl_domain=None`), there is a CLOSE_WAIT connection leak - again once per `receiver failed to accept msg, reconnecting` log. > ssl transport layer does not define an error handler > > > Key: PROTON-1003 > URL: https://issues.apache.org/jira/browse/PROTON-1003 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.10 >Reporter: Gordon Sim >Assignee: Ken Giusti > > When the local process times out an ssl based connection due to lack of > heartbeats from its peer, the underlying socket is never closed. The cause of > this appears to be that the ssl transport layer doesn't define an error > handler, which is what is used to notify it of the locally initiated timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (PROTON-1000) Connection leak on heartbeat-timeouted connections
[ https://issues.apache.org/jira/browse/PROTON-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Moravec reopened PROTON-1000: --- Reopening both PROTON-1000 and PROTON-1003: at least backport to 0.9 does not fix it. Reproducer: {code} #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException, Timeout from proton import SSLDomain, SSLException #from proton import Message from proton.utils import BlockingConnection import random import threading ROUTER_ADDRESS = "amqps://dispatch-router:5671" ADDRESS = "some_destination" HEARTBEAT = 2 TIMEOUT = 3 class ReceiverThread(threading.Thread): def __init__(self,domain=None): super(ReceiverThread, self).__init__() self.domain=domain self.running = True def connect(self): self.conn = BlockingConnection(ROUTER_ADDRESS, ssl_domain=self.domain, heartbeat=HEARTBEAT) self.recv = self.conn.create_receiver(ADDRESS, name=str(uuid4()), dynamic=False, options=None) def run(self): while self.running: self.connect() while self.running: try: msg = self.recv.receive(TIMEOUT) if (msg): print "message received: %s" % msg self.recv.accept() except: print "receiver failed to accept msg, reconnecting.." try: self.conn.close() # underlying TCP connection never gone except: print "receiver thread: failed to close connection" pass self.connect() def stop(self): self.running = False ca_certificate='/etc/rhsm/ca/katello-default-ca.pem' client_certificate='/etc/pki/consumer/bundle.pem' client_key=None domain = SSLDomain(SSLDomain.MODE_CLIENT) domain.set_trusted_ca_db(ca_certificate) domain.set_credentials( client_certificate, client_key or client_certificate, None) domain.set_peer_authentication(SSLDomain.VERIFY_PEER) rcv_thread = ReceiverThread(domain) rcv_thread.start() _in = raw_input("Press Enter to exit:") rcv_thread.stop() rcv_thread.join() {code} With SSL enabled (like above), there is an ESTABLISHED connection leak - `one per `receiver failed to accept msg, reconnecting` log - `self.conn.close()` has apparently no impact. With SSL disabled (just set `ssl_domain=None`), there is a CLOSE_WAIT connection leak - again once per `receiver failed to accept msg, reconnecting` log. > Connection leak on heartbeat-timeouted connections > -- > > Key: PROTON-1000 > URL: https://issues.apache.org/jira/browse/PROTON-1000 > Project: Qpid Proton > Issue Type: Bug > Components: python-binding >Affects Versions: 0.9 >Reporter: Pavel Moravec >Assignee: Gordon Sim > Fix For: 0.11 > > > Using gofer/katello-agent that uses BlockingConnection from Proton Reactor > with heartbeats set up, if some connection timeouts due to the heartbeats, > Proton does not close the TCP connection. That causes TCP connection leak, > despite gofer properly called BlockingConnection.close() and forgot any > reference to that class instance. > Checking tcpdump, Proton simply ignores the timeouted connections - it does > not respond anyhow to the communication partner whatever it sends (in some > scenarios it sends some AMQP performative that Proton was assumed to respond, > in other scenario the communication peer dropped the TCP connection by > sending FIN+ACK packet but Proton didn't send FIN packet back - the only > stuff seen in tcpdump is ACKing on TCP layer made by OS, not by Proton). And > Proton ignores an attempt of Proton reactor to close the > connection/container, raising: > Sep 21 15:02:35 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in > on_transport_closed > Sep 21 15:02:35 my-capsule goferd: raise ConnectionException("Connection %s > disconnected" % self.url); > Sep 21 15:02:35 my-capsule goferd: ConnectionException: Connection > amqps://satellite.example.com:5647 disconnected > for SSL connections, and raising: > Sep 21 14:56:28 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 259, in > on_transport_tail_closed > Sep 21 14:56:28 my-capsule goferd: self.on_transport_closed(event) > Sep 21 14:56:28 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in > on_transport_closed > Sep 21 14:56:28 my-capsule goferd: raise ConnectionException("Connection %s > disconnected" % self.url); > Sep 21 14:56:28 my-capsule goferd: ConnectionException: Connection > amqps://satellite.example.com:5647 disconnected > (some difference between SSL
[jira] [Commented] (PROTON-1000) Connection leak on heartbeat-timeouted connections
[ https://issues.apache.org/jira/browse/PROTON-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902171#comment-14902171 ] Pavel Moravec commented on PROTON-1000: --- Reproducer for SSL: {code} #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton import SSLDomain, SSLException from proton.utils import BlockingConnection import fileinput import traceback from gofer.messaging.adapter.proton.connection import Connection ca_certificate='/path/to/ca.pem' client_certificate='/path/to/client_cert.pem' client_key=None while True: domain = SSLDomain(SSLDomain.MODE_CLIENT) domain.set_trusted_ca_db(ca_certificate) domain.set_credentials( client_certificate, client_key or client_certificate, None) domain.set_peer_authentication(SSLDomain.VERIFY_PEER) conn = BlockingConnection("amqps://localhost:5671", ssl_domain=domain, heartbeat=5) rec = conn.create_receiver("another_address", name=str(uuid4()), dynamic=False, options=None) try: sleep(11) snd = conn.create_sender("another_address", name=str(uuid4())) except ConnectionException: try: conn.close() except Exception, e: print e pass {code} Now, after every iteration, 1 ESTABLISHED connection remains opened. Backtrace of the "e" exception is: {code} File "proton-1000.py", line 35, in conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 219, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 231, in wait self.container.process() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 143, in process self._check_errors() File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 3737, in dispatch ev.dispatch(self.handler) File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 3662, in dispatch result = dispatch(handler, type.method, self) File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 3551, in dispatch return m(*args) File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 257, in on_transport_tail_closed self.on_transport_closed(event) File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 261, in on_transport_closed raise ConnectionException("Connection %s disconnected" % self.url); {code} > Connection leak on heartbeat-timeouted connections > -- > > Key: PROTON-1000 > URL: https://issues.apache.org/jira/browse/PROTON-1000 > Project: Qpid Proton > Issue Type: Bug > Components: python-binding >Affects Versions: 0.9 >Reporter: Pavel Moravec >Assignee: Gordon Sim > > Using gofer/katello-agent that uses BlockingConnection from Proton Reactor > with heartbeats set up, if some connection timeouts due to the heartbeats, > Proton does not close the TCP connection. That causes TCP connection leak, > despite gofer properly called BlockingConnection.close() and forgot any > reference to that class instance. > Checking tcpdump, Proton simply ignores the timeouted connections - it does > not respond anyhow to the communication partner whatever it sends (in some > scenarios it sends some AMQP performative that Proton was assumed to respond, > in other scenario the communication peer dropped the TCP connection by > sending FIN+ACK packet but Proton didn't send FIN packet back - the only > stuff seen in tcpdump is ACKing on TCP layer made by OS, not by Proton). And > Proton ignores an attempt of Proton reactor to close the > connection/container, raising: > Sep 21 15:02:35 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in > on_transport_closed > Sep 21 15:02:35 my-capsule goferd: raise ConnectionException("Connection %s > disconnected" % self.url); > Sep 21 15:02:35 my-capsule goferd: ConnectionException: Connection > amqps://satellite.example.com:5647 disconnected > for SSL connections, and raising: > Sep 21 14:56:28 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 259, in > on_transport_tail_closed > Sep 21 14:56:28 my-capsule goferd: self.on_transport_closed(event) > Sep 21 14:56:28 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in > on_transport_closed > Sep 21 14:56:28 my-capsule goferd: raise ConnectionException("Connection %s > disconnected" % self.url); > Sep 21 14:56:28 my-capsule goferd: ConnectionException: Connection > amqps://satellite.example.com:5647 disconnected > (some difference between SSL and nonSSL could come from the fact that in my > case the server part - qdrouterd / Qpid Dispatch Router - sends FIN+ACK > packet for nonSSL connection, while it does not send anything for SSL >
[jira] [Created] (PROTON-1000) Connection leak on heartbeat-timeouted connections
Pavel Moravec created PROTON-1000: - Summary: Connection leak on heartbeat-timeouted connections Key: PROTON-1000 URL: https://issues.apache.org/jira/browse/PROTON-1000 Project: Qpid Proton Issue Type: Bug Components: python-binding Affects Versions: 0.9 Reporter: Pavel Moravec Using gofer/katello-agent that uses BlockingConnection from Proton Reactor with heartbeats set up, if some connection timeouts due to the heartbeats, Proton does not close the TCP connection. That causes TCP connection leak, despite gofer properly called BlockingConnection.close() and forgot any reference to that class instance. Checking tcpdump, Proton simply ignores the timeouted connections - it does not respond anyhow to the communication partner whatever it sends (in some scenarios it sends some AMQP performative that Proton was assumed to respond, in other scenario the communication peer dropped the TCP connection by sending FIN+ACK packet but Proton didn't send FIN packet back - the only stuff seen in tcpdump is ACKing on TCP layer made by OS, not by Proton). And Proton ignores an attempt of Proton reactor to close the connection/container, raising: Sep 21 15:02:35 my-capsule goferd: File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in on_transport_closed Sep 21 15:02:35 my-capsule goferd: raise ConnectionException("Connection %s disconnected" % self.url); Sep 21 15:02:35 my-capsule goferd: ConnectionException: Connection amqps://satellite.example.com:5647 disconnected for SSL connections, and raising: Sep 21 14:56:28 my-capsule goferd: File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 259, in on_transport_tail_closed Sep 21 14:56:28 my-capsule goferd: self.on_transport_closed(event) Sep 21 14:56:28 my-capsule goferd: File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in on_transport_closed Sep 21 14:56:28 my-capsule goferd: raise ConnectionException("Connection %s disconnected" % self.url); Sep 21 14:56:28 my-capsule goferd: ConnectionException: Connection amqps://satellite.example.com:5647 disconnected (some difference between SSL and nonSSL could come from the fact that in my case the server part - qdrouterd / Qpid Dispatch Router - sends FIN+ACK packet for nonSSL connection, while it does not send anything for SSL connection and continue for sending empty AMQP frames due to heartbeats enabled forever) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-1000) Connection leak on heartbeat-timeouted connections
[ https://issues.apache.org/jira/browse/PROTON-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901368#comment-14901368 ] Pavel Moravec commented on PROTON-1000: --- I think I have reproducer based on Proton Reactor (derived from what gofer does): {code} #!/usr/bin/python from time import sleep from uuid import uuid4 from proton import ConnectionException from proton import SSLDomain, SSLException from proton.utils import BlockingConnection import fileinput domain = None conn = BlockingConnection("proton+amqp://localhost:5672", ssl_domain=domain, heartbeat=5) rec = conn.create_receiver("some_address", name=str(uuid4()), dynamic=False, options=None) try: sleep(9) snd = conn.create_sender("another_address", name=str(uuid4())) except ConnectionException: try: conn.close() except Exception, e: print e pass _in = raw_input("Check for CLOSE_WAIT before pressing Enter: ") {code} Execute that code and on prompt, check thatf the python process has CLOSE_WAIT connection. Backtrace of the caught exception "e" is: {code} File "proton-1000.py", line 24, in conn.close() File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 219, in close msg="Closing connection") File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 231, in wait self.container.process() File "/usr/lib64/python2.7/site-packages/proton/reactor.py", line 143, in process self._check_errors() File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 3737, in dispatch ev.dispatch(self.handler) File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 3662, in dispatch result = dispatch(handler, type.method, self) File "/usr/lib64/python2.7/site-packages/proton/__init__.py", line 3551, in dispatch return m(*args) File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 257, in on_transport_tail_closed self.on_transport_closed(event) File "/usr/lib64/python2.7/site-packages/proton/utils.py", line 261, in on_transport_closed raise ConnectionException("Connection %s disconnected" % self.url); {code} Worth playing with SSL as well where I noticed little bit different behaviour - adding SSL stuff to the reproducer should be trivial, though. > Connection leak on heartbeat-timeouted connections > -- > > Key: PROTON-1000 > URL: https://issues.apache.org/jira/browse/PROTON-1000 > Project: Qpid Proton > Issue Type: Bug > Components: python-binding >Affects Versions: 0.9 >Reporter: Pavel Moravec >Assignee: Gordon Sim > > Using gofer/katello-agent that uses BlockingConnection from Proton Reactor > with heartbeats set up, if some connection timeouts due to the heartbeats, > Proton does not close the TCP connection. That causes TCP connection leak, > despite gofer properly called BlockingConnection.close() and forgot any > reference to that class instance. > Checking tcpdump, Proton simply ignores the timeouted connections - it does > not respond anyhow to the communication partner whatever it sends (in some > scenarios it sends some AMQP performative that Proton was assumed to respond, > in other scenario the communication peer dropped the TCP connection by > sending FIN+ACK packet but Proton didn't send FIN packet back - the only > stuff seen in tcpdump is ACKing on TCP layer made by OS, not by Proton). And > Proton ignores an attempt of Proton reactor to close the > connection/container, raising: > Sep 21 15:02:35 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in > on_transport_closed > Sep 21 15:02:35 my-capsule goferd: raise ConnectionException("Connection %s > disconnected" % self.url); > Sep 21 15:02:35 my-capsule goferd: ConnectionException: Connection > amqps://satellite.example.com:5647 disconnected > for SSL connections, and raising: > Sep 21 14:56:28 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 259, in > on_transport_tail_closed > Sep 21 14:56:28 my-capsule goferd: self.on_transport_closed(event) > Sep 21 14:56:28 my-capsule goferd: File > "/usr/lib64/python2.7/site-packages/proton/utils.py", line 263, in > on_transport_closed > Sep 21 14:56:28 my-capsule goferd: raise ConnectionException("Connection %s > disconnected" % self.url); > Sep 21 14:56:28 my-capsule goferd: ConnectionException: Connection > amqps://satellite.example.com:5647 disconnected > (some difference between SSL and nonSSL could come from the fact that in my > case the server part - qdrouterd / Qpid Dispatch Router - sends FIN+ACK > packet for nonSSL connection, while it does not send anything for SSL > connection and continue for sending empty AMQP frames due to heartbeats > enabled forever) -- This message was sent by Atlassian JIRA