[jira] [Commented] (QPID-7317) Deadlock on publish

2017-08-09 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16120575#comment-16120575
 ] 

ASF subversion and git services commented on QPID-7317:
---

Commit 7c968c8318f4c4a70fbe0ebbcdbe0a09d8cfbb3e in qpid-python's branch 
refs/heads/master from [~aconway]
[ https://git-wip-us.apache.org/repos/asf?p=qpid-python.git;h=7c968c8 ]

QUID-7884: Python client should not raise on close() after stop.

The python client throws exceptions out of AMQP object methods (Connection, 
Session and Link objects) if the selector has been stopped, to prevent hanging 
(see QPID-7317 Deadlock on publish)

However to be robust to shut-down order, the close() method should not throw an 
exception in this case, but should be a no-op.


> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> pystack.17806, spout-hang.py, spout-hang-trace.txt, taabt.txt, worker-stacks
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2017-02-10 Thread Alan Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861827#comment-15861827
 ] 

Alan Conway commented on QPID-7317:
---

I hope this will address the pulp hang in the wild, I have been unable to 
reproduce it with the fix.

Note you can apply this patch by replacing 
/usr/lib/python2.7/site-packages/qpid/selector.py with the patched file, it is 
the only file modified and should work with and version of python-qpid released 
in the last year.

If you do see this hang again please report to this JIRA with the output of the 
following commands from the machine where the hung celery workers are:
{code}
> rpm -q python-qpid  # or attach a copy of 
> /usr/lib/python2.7/site-packages/qpid/selector.py
> journalctl #  use --since and --until to get a few minutes 
> before/after the hang
> yum install -y gdb python-debug # needed for worker-stacks script
> worker-stacks # script attached to this JIRA
{code}
Here is log output showing that pulp does indeed use qpid.messaging in an 
illegal state that could have caused a hang prior to this fix. However it is 
not an exact match for reported stack traces so I'm not yet 100% sure the 
problem is solved. I am not able to reproduce the original hang or traces that 
look like it with the fix.

{code}
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
illegal use of qpid.messaging at:
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) File 
"/usr/lib64/python2.7/threading.py", line 784, in __bootstrap
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self.__bootstrap_inner()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib64/python2.7/threading.py", line 811, in __bootstrap_inner
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self.run()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 
55, in run
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self.monitor_events()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/pulp/server/async/scheduler.py", line 
82, in monitor_events
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
recv.capture(limit=None, timeout=None, wakeup=True)
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 715, in 
__exit__
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self.release()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 330, in 
release
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self._close()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 298, in _close
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self._do_close_self()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 288, in 
_do_close_self
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self.maybe_close_channel(self._default_channel)
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 280, in 
maybe_close_channel
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
channel.close()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/kombu/transport/qpid.py", line 983, in 
close
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self._broker.close()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/qpidtoollibs/broker.py", line 48, in 
close
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
self.sess.close()
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416)   
File "/usr/lib/python2.7/site-packages/qpid/selector.py", line 213, in log_raise
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
_check(exception, 1)
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
qpid.messaging thread has been stopped
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) 
qpid.messaging was previously stopped at:
Feb 10 14:50:13 pulp-server pulp[7427]: qpid.messaging:ERROR: (7427-28416) File 

[jira] [Commented] (QPID-7317) Deadlock on publish

2017-02-10 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861790#comment-15861790
 ] 

ASF subversion and git services commented on QPID-7317:
---

Commit fda9594010b13d99134c10cff54b0ba9d82c0c27 in qpid-python's branch 
refs/heads/master from [~aconway]
[ https://git-wip-us.apache.org/repos/asf?p=qpid-python.git;h=fda9594 ]

QPID-7317: More robust qpid.selector with better logging

This commit disables the selector and related qpid.messaging objects when the
selector thread exits for any reason: process exit, fork, exception etc.  Any
subsequent use will throw an exception and log the locations of the failed call
and where the selector thread was stopped. This should be slightly more
predictable & robust than commit 037c573 which tried to keep the selector alive
in a daemon thread.

I have not been able to hang the pulp_smash test suite with this patch. The new
logging shows that celery workers do sometimes use qpid.messaging in an illegal
state, which could cause the reported hang. So far I have not seen a stack trace
that is an exact match for reported stacks. If this patch does not address the
pulp problem it should at least provide much better debugging information in
journalctl log output after the hang.


> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> pystack.17806, spout-hang.py, spout-hang-trace.txt, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2017-01-17 Thread Alan Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826589#comment-15826589
 ] 

Alan Conway commented on QPID-7317:
---

I have seen a similar hang. I'm starting to think that the missing thread is a 
consequence of the hang, not a cause - which would explain why our attempts to 
prevent/log that thread's death doesn't solve the problem or provide any new 
log data.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> pystack.17806, spout-hang.py, spout-hang-trace.txt, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-12-28 Thread Brian Bouterse (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782996#comment-15782996
 ] 

Brian Bouterse commented on QPID-7317:
--

We haven't been able to reproduce this issue reliably (such as a unit or 
integration test). Today I read a report from a user which suggests their 
environment was deadlocked. Also, the Jenkins environment Pulp uses (which uses 
the Qpid Python client) to run integration tests deadlocks roughly 5% of the 
time. With those reports in mind, I believe this issue is not resolved.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-12-24 Thread Robbie Gemmell (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15775154#comment-15775154
 ] 

Robbie Gemmell commented on QPID-7317:
--

The commit details show [~k-wall] was simply removing an exclude entry used 
when running the python test suite against the Java broker to verify if 
QPID-6122 was resolved by the changes made previously on this JIRA, with the 
test previously excluded via QPID-6122, which is the main JIRA referenced. The 
change had no relation to the Java client, and does not suggest this issue is 
fully resolved, just that the other one may have been.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-12-24 Thread Brian Bouterse (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15775097#comment-15775097
 ] 

Brian Bouterse commented on QPID-7317:
--

This issue is about the Python client, not the Java client. Perhaps commit 
1775853 resolves a similar issue in the Java client, but since that is a 
separate codebase it couldn't resolve this issue.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-12-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15773258#comment-15773258
 ] 

ASF subversion and git services commented on QPID-7317:
---

Commit 1775853 from [~k-wall] in branch 'java/trunk'
[ https://svn.apache.org/r1775853 ]

QPID-6122: [Python Test Suite] Remove test exclusion 
qpid.tests.messaging.endpoints.TimeoutTests from Java Broker 0-10 Python 
excludes file

>From this description of QPID-7317, it seems possible this is fixed.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Fix For: qpid-python-1.36.0
>
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-09-23 Thread Alan Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517632#comment-15517632
 ] 

Alan Conway commented on QPID-7317:
---

Not marking resolved yet. The fix above has not been proven to resolve the 
issues in the field, but it definitely fixes several ways that an 
identical-looking hang can be created in the lab.

Will update this when I get confirmation or denial that this resolves the real 
problem.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-09-23 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15517624#comment-15517624
 ] 

ASF subversion and git services commented on QPID-7317:
---

Commit 037c5738734d8fecb7b7f7e7af4e4f14f9cd3a64 in qpid-python's branch 
refs/heads/master from [~aconway]
[ https://git-wip-us.apache.org/repos/asf?p=qpid-python.git;h=037c573 ]

QPID-7317: Fix hangs in qpid.messaging.

Hang is observed in processes using qpid.messaging with a thread blocked waiting
for the Selector to wake it, but no Selector.run thread.

This patch removes all the known ways that this hang can occur. Either we
function normally or immediately raise an exception and log to the
"qpid.messaging" logger a message starting with "qpid.messaging:"

The following issues are fixed:

1. The Selector.run() thread raises a fatal exception.

Use of qpid.messaging will re-raise the exception immediately, not hang.

2. The process forks, so child has no Selector thread.

https://issues.apache.org/jira/browse/QPID-5637 resets the Selector after a 
fork.
In addition we now:

- Close Selector.waiter: its file descriptors are shared with the parent which
  can cause havoc if they "steal" each other's wakeups.

- Replace Endpoint._lock in related endpoints with a BrokenLock. If the parent
  is holding locks when it forks, they remain locked forever in the child.
  BrokenLock.acquire() raises instead of hanging.

3. Selector.stop() called on atexit.

Selector.stop was registered via atexit, which could cause a hang if
qpid.messaging was used in a later-executing atexit function. That has been
removed, Selector.run() is in a daemon thread so there is no need for stop()

4. User calls Selector.stop() directly

There is no reason to do this for the default Selector used by qpid.messaging,
so for that case stop() is now ignored. It works as before for code that creates
its own qpid.Selector instances.


> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Attachments: bad_child.py, bad_child.py, bt.txt, lsof.txt, 
> spout-hang-trace.txt, spout-hang.py, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-08-18 Thread Brian Bouterse (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15426432#comment-15426432
 ] 

Brian Bouterse commented on QPID-7317:
--

These issues are observed in the Pulp [0] codebase. The pulp codebase does very 
little directly with qpid.messaging so I expect you'll not find an issue there. 
Pulp does rely heavily on celery [0] which is what does the forking using a 
dependency it maintains called billiard [2]. Billiard itself is a fork of the 
Python multiprocessing library. Celery uses Kombu [3] which has a plugin for 
Qpid [4] which I maintain.

Note the master branches on [1] [2][3] are not the versions we use and they may 
be significantly different. Browsing on the right branch is important. A 
typical recent install gets these versions:

python-kombu-3.0.33-5.fc24.noarch
python2-celery-3.1.20-2.fc24.noarch
python-billiard-3.3.0.22-2.fc24.x86_64

[0]: https://github.com/pulp/pulp
[1]: https://github.com/celery/celery/tree/3.1
[2]: https://github.com/celery/billiard
[3]: https://github.com/celery/kombu/tree/3.0/kombu
[4]: https://github.com/celery/kombu/blob/3.0/kombu/transport/qpid.py

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>Assignee: Alan Conway
> Attachments: bt.txt, lsof.txt, spout-hang-trace.txt, spout-hang.py, 
> taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-07-13 Thread Dennis Kliban (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375584#comment-15375584
 ] 

Dennis Kliban commented on QPID-7317:
-

I experienced this issue two times in  the last 24 hours. I am attaching a core 
dump from my process. 

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
> Attachments: bt.txt, lsof.txt, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-06-29 Thread Brian Bouterse (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355169#comment-15355169
 ] 

Brian Bouterse commented on QPID-7317:
--

One other point of information. I believe my machine resumed from sleep just 
before I attempted to do this publish. The order of events would have been 
these:

1. create my vagrant VM and start the processes which use qpid.messaging
2. put my host (and thus the VM) to sleep
3. resume my host (and the VM)
4. ssh back to the VM
5. run the publish (no process restart)
6. observe the deadlock.

Note that there are two other processes which did publish messages correctly, 
but the third process publish deadlocked. Each message is published by a 
different process. My point is that not all processes were affected, just that 
1 process. I don't know why.

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
> Attachments: bt.txt, lsof.txt, taabt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-06-23 Thread Michael Hrivnak (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346797#comment-15346797
 ] 

Michael Hrivnak commented on QPID-7317:
---

The process in question did not produce any log statements from or related to 
qpid. 

This is what I see from strace:

# strace -p 21739
Process 21739 attached
restart_syscall(<... resuming interrupted call ...>) = 0
poll([{fd=19, events=POLLIN}], 1, 3000) = 0 (Timeout)
poll([{fd=19, events=POLLIN}], 1, 3000) = 0 (Timeout)
poll([{fd=19, events=POLLIN}], 1, 3000) = 0 (Timeout)
poll([{fd=19, events=POLLIN}], 1, 3000) = 0 (Timeout)
poll([{fd=19, events=POLLIN}], 1, 3000) = 0 (Timeout)

Note that for a healthy child worker process, I do not see that polling happen. 
I think thread 3, the one the attached backtrace is from, is stuck in some kind 
of loop polling for some condition on FD 19 every 3 seconds that is never going 
to happen.

FD 19 appears to be a FIFO pipe. I will attach lsof output separately.

# ls -l /proc/21739/fd/19
lr-x--. 1 apache apache 64 Jun 21 13:45 /proc/21739/fd/19 -> pipe:[152836]

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
> Attachments: bt.txt, lsof.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-06-23 Thread Michael Hrivnak (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346775#comment-15346775
 ] 

Michael Hrivnak commented on QPID-7317:
---

List of installed packages that go with that backtrace:

# rpm -qa| grep qpid
qpid-cpp-client-devel-0.30-11.el7sat.x86_64
qpid-java-client-0.30-3.el7.noarch
python-qpid-qmf-0.30-5.el7.x86_64
sat6-atom.refarch.bos.redhat.com-qpid-router-server-1.0-1.noarch
sat6-atom.refarch.bos.redhat.com-qpid-broker-1.0-1.noarch
qpid-dispatch-router-0.4-11.el7.x86_64
qpid-cpp-server-linearstore-0.30-11.el7sat.x86_64
qpid-cpp-server-0.30-11.el7sat.x86_64
libqpid-dispatch-0.4-11.el7.x86_64
python-gofer-qpid-2.7.6-1.el7sat.noarch
qpid-tools-0.30-4.el7.noarch
sat6-atom.refarch.bos.redhat.com-qpid-client-cert-1.0-1.noarch
sat6-atom.refarch.bos.redhat.com-qpid-router-client-1.0-1.noarch
qpid-proton-c-0.9-16.el7.x86_64
qpid-java-common-0.30-3.el7.noarch
qpid-qmf-0.30-5.el7.x86_64
qpid-cpp-client-0.30-11.el7sat.x86_64
python-qpid-0.30-9.el7sat.noarch
tfm-rubygem-qpid_messaging-0.30.0-7.el7sat.x86_64

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
> Attachments: bt.txt
>
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org



[jira] [Commented] (QPID-7317) Deadlock on publish

2016-06-23 Thread Brian Bouterse (JIRA)

[ 
https://issues.apache.org/jira/browse/QPID-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346542#comment-15346542
 ] 

Brian Bouterse commented on QPID-7317:
--

There is no logging that the background thread is dying. On the EL6 machine the 
following turned up nothing: `grep -r Fatal /var/log/`
Also on F23 (journalctl) the following turned up nothing:journalctl 
--no-pager -l | grep Fatal

We expect 'Fatal' to be in the logged message because both implementations were 
verified to contain the logging with this commit [0]. The line we expect to log 
is here [1]. Per the commit message in [0], I think the thread that is dying is 
called the Selector thread.

I also grepped for `grep -r "thread has" journalctl_output.txt` thinking it 
could be logged by this line [2].

[0]: 
https://github.com/apache/qpid/commit/11368ef1a01233f253eb9eadbadaa9cb9b8465f3
[1]: 
https://github.com/apache/qpid/blob/trunk/qpid/python/qpid/messaging/driver.py#L420
[2]: 
https://github.com/apache/qpid/commit/11368ef1a01233f253eb9eadbadaa9cb9b8465f3#diff-a2870153748f29e8583ccdbe0c527e8dR157

> Deadlock on publish
> ---
>
> Key: QPID-7317
> URL: https://issues.apache.org/jira/browse/QPID-7317
> Project: Qpid
>  Issue Type: Bug
>  Components: Python Client
>Affects Versions: 0.32
> Environment: python-qpid-0.32-13.fc23.noarch
>Reporter: Brian Bouterse
>
> When publishing a task with qpid.messaging it deadlocks and our application 
> cannot continue. This has not been a problem for several releases, but within 
> a few days recently, another Satellite developer and I both experienced the 
> issue on separate machines, different distros. He is using a MRG built 
> pacakge (not sure of version). I am using python-qpid-0.32-13.fc23.
> Both deadlocked machines had core dumps taken on the deadlocked processes and 
> only show only 1 Qpid thread when I expect there to be 2. There are other 
> mongo threads, but those are idle as expected and not related. The traces 
> show our application calling into qpid.messaging to publish a message to the 
> message bus.
> This problem happens intermittently, and in cases where message publish is 
> successful I've verified by core dump that there are the expected 2 threads 
> for Qpid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org
For additional commands, e-mail: dev-h...@qpid.apache.org