[jira] [Created] (PROTON-1170) closed links are never deleted
michael goulish created PROTON-1170: --- Summary: closed links are never deleted Key: PROTON-1170 URL: https://issues.apache.org/jira/browse/PROTON-1170 Project: Qpid Proton Issue Type: Bug Components: proton-c Environment: miserable Reporter: michael goulish I wrote a reactor-based application that makes a single connection, and then repeatedly makes-and-closes links (receivers) on that connection. It makes and closes the links as fast as possible: as soon as it gets the on_receiver_close event, it makes a new one. As soon as it gets the on_receiver_open event -- it closes that receiver. This application talks to a dispatch router. Problem: Both the router and my application grow their memory (RSS) rapidly -- and the router's ability to respond to new link creations slows down rapidly. Looking at the router with Valgrind/Callgrind, after about 15,000 links have been created and closed I see that 45% of all CPU time on the router is being consumed by pn_find_link(). Instrumenting that code, I see that the list it is looking at never decreases in size. I tried creating my links with the "lifetime_policy" set to DELETE_ON_CLOSE, but that had no effect. Grepping for that symbol, I see that it does not occur in the proton C code except in its definition, and in a printing convenience function. Major scalability bug. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Does anyone read this list?
I do think this list is meant to be mostly about the mechanical details of development, and any topics that would be of interest to actual users is meant to go on the users list. The fact that there is no documentation that might have told you this is actually intentional. It's to help get you ready to read the proton code. :-) - Original Message - > Troy, > > I monitor both this discussion (proton) list and the users. > > It is true there is more going on in the users discussion list. I tend to > post to the users list because a lot of times discussion items posted here > (proton) are just not picked here as readily as on users. > > I agree with your general assessment. It does pose the question "When is it > appropriate to post to this discussion list?" > > Paul Flores > > > > From: Troy Daniels [troy.dani...@stresearch.com] > Sent: Wednesday, March 23, 2016 10:02 AM > To: proton@qpid.apache.org > Subject: Does anyone read this list? > > It seems like there are two types of posts to this list: automated posts when > there is an commit to version control, and initial questions from new users. > There does not seem to be discussion or answers to questions. > > It seems like I should unsubscribe and find a different forum for my > questions. Is that an accurate assessment? > > Troy
Re: [VOTE] Release Qpid Proton 0.12.0
+1 Testing done: I used it with all of my performance tests: * point-to-point communication with C and CPP clients. * many CPP senders and receivers intermediated by a router The CPP clients exercise the proton::handler event interface. - Original Message - The artifacts proposed for release: https://dist.apache.org/repos/dist/dev/qpid/proton/0.12.0-rc/ Please indicate your vote below. If you favor releasing the 0.12.0 RC bits as 0.12.0 GA, vote +1. If you have reason to think the RC is not ready for release, vote -1. Thanks, Justin
Re: PN_REACTOR_QUIESCED
But it's obvious how this constant was chosen. With circular reasoning. - Original Message - > On Mon, 2015-10-12 at 16:05 -0400, aconway wrote: > > ... > > +1, that looks like the right fix. 3141 is an odd choice of default, > > even for a mathematician. > > > > At this point, I'm desperately trying to find an appropriate pi joke : > -) > > Andrew > >
[jira] [Created] (PROTON-1009) message.h does not have a set method for annotations
michael goulish created PROTON-1009: --- Summary: message.h does not have a set method for annotations Key: PROTON-1009 URL: https://issues.apache.org/jira/browse/PROTON-1009 Project: Qpid Proton Issue Type: Bug Components: proton-c Reporter: michael goulish Comments above the method pn_message_annotations() indicate that it can bot set and get annotations -- but in fact it has no way to set. And it looks like there is no other way in the C API, either. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PROTON-1009) message.h does not have a set method for annotations
[ https://issues.apache.org/jira/browse/PROTON-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish resolved PROTON-1009. - Resolution: Not A Problem Oops. I didn't realize that the function is returning a pointer that can be used to change the annotations. *That's* how you set them. Sorry for the noise. > message.h does not have a set method for annotations > > > Key: PROTON-1009 > URL: https://issues.apache.org/jira/browse/PROTON-1009 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c > Reporter: michael goulish > > Comments above the method pn_message_annotations() indicate that it can bot > set and get annotations -- but in fact it has no way to set. > And it looks like there is no other way in the C API, either. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PROTON-992) Proton's use of Cyrus SASL is not thread-safe.
[ https://issues.apache.org/jira/browse/PROTON-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish closed PROTON-992. -- Resolution: Duplicate this is a duplicate of PROTON-862 > Proton's use of Cyrus SASL is not thread-safe. > -- > > Key: PROTON-992 > URL: https://issues.apache.org/jira/browse/PROTON-992 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.10 > Reporter: michael goulish >Assignee: michael goulish >Priority: Critical > > Documentation for the Cyrus SASL library says that the library is believed to > be thread-safe only if the code that uses it meets several requirements. > The requirements are: > * you supply mutex functions (see sasl_set_mutex()) > * you make no libsasl calls until sasl_client/server_init() completes > * no libsasl calls are made after sasl_done() is begun > * when using GSSAPI, you use a thread-safe GSS / Kerberos 5 library. > It says explicitly that that sasl_set* calls are not thread safe, since they > set global state. > The proton library makes calls to sasl_set* functions in : > pni_init_client() > pni_init_server(), and > pni_process_init() > Since those are internal functions, there is no way for code that uses Proton > to lock around those calls. > I think proton needs a new API call to let applications call > sasl_set_mutex(). Or something. > We probably also need other protections to meet the other requirements > specified in the Cyrus documentation (and quoted above). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-992) Proton's use of Cyrus SASL is not thread-safe.
[ https://issues.apache.org/jira/browse/PROTON-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14902731#comment-14902731 ] michael goulish commented on PROTON-992: oops. this is a duplicate of PROTON-862 > Proton's use of Cyrus SASL is not thread-safe. > -- > > Key: PROTON-992 > URL: https://issues.apache.org/jira/browse/PROTON-992 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.10 > Reporter: michael goulish >Assignee: michael goulish >Priority: Critical > > Documentation for the Cyrus SASL library says that the library is believed to > be thread-safe only if the code that uses it meets several requirements. > The requirements are: > * you supply mutex functions (see sasl_set_mutex()) > * you make no libsasl calls until sasl_client/server_init() completes > * no libsasl calls are made after sasl_done() is begun > * when using GSSAPI, you use a thread-safe GSS / Kerberos 5 library. > It says explicitly that that sasl_set* calls are not thread safe, since they > set global state. > The proton library makes calls to sasl_set* functions in : > pni_init_client() > pni_init_server(), and > pni_process_init() > Since those are internal functions, there is no way for code that uses Proton > to lock around those calls. > I think proton needs a new API call to let applications call > sasl_set_mutex(). Or something. > We probably also need other protections to meet the other requirements > specified in the Cyrus documentation (and quoted above). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Commented] (PROTON-992) Proton's use of Cyrus SASL is not thread-safe.
Thanks! I wondered about that (briefly) but thought there was nothing to be done. If you have a sketch, I would be happy to see it! - Original Message - [ https://issues.apache.org/jira/browse/PROTON-992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14802937#comment-14802937 ] Andrew Stitcher commented on PROTON-992: Upon a few days reflection I've realised that you cannot fix this problem with a global init for proton: This is because there are a couple of parameters of the Cyrus library that *must* be set before calling either sasl_server_init() or sasl_client_init(). These are the configuration file directory and the server name. Currently if you want to customise these you must set them before the first usage of SASL and this works as SASL is initialised lazily. However in the proposed API there is literally no place to set them: Since the usage pattern has to be documented that you must initialise the library before using it the Cyrus Sasl ibrary will be initialised before you are allowed to use the APIs that set the path or name. So I think the only workable solution is to have an atomic count of use of the library and initialise on the going from 0->1 then finalise on going from 1->0. Obviously also keeping track of how many uses the library has as well so that we can finalise in the correct place. An important point here is to make the count atomic so that we can be sure to avoid any re-entrance into the initialisation or finalisation code (this is doable using gcc/clang builtins - we don't really support Cyrus on Win32 so Visual Studio isn't too important, but it has atomic primitives too) I would note that really we should alos be using atomic counts like this in the OpenSSL code too,. > Proton's use of Cyrus SASL is not thread-safe. > -- > > Key: PROTON-992 > URL: https://issues.apache.org/jira/browse/PROTON-992 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.10 > Reporter: michael goulish >Assignee: michael goulish >Priority: Critical > > Documentation for the Cyrus SASL library says that the library is believed to > be thread-safe only if the code that uses it meets several requirements. > The requirements are: > * you supply mutex functions (see sasl_set_mutex()) > * you make no libsasl calls until sasl_client/server_init() completes > * no libsasl calls are made after sasl_done() is begun > * when using GSSAPI, you use a thread-safe GSS / Kerberos 5 library. > It says explicitly that that sasl_set* calls are not thread safe, since they > set global state. > The proton library makes calls to sasl_set* functions in : > pni_init_client() > pni_init_server(), and > pni_process_init() > Since those are internal functions, there is no way for code that uses Proton > to lock around those calls. > I think proton needs a new API call to let applications call > sasl_set_mutex(). Or something. > We probably also need other protections to meet the other requirements > specified in the Cyrus documentation (and quoted above). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-992) Proton's use of Cyrus SASL is not thread-safe.
michael goulish created PROTON-992: -- Summary: Proton's use of Cyrus SASL is not thread-safe. Key: PROTON-992 URL: https://issues.apache.org/jira/browse/PROTON-992 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.10 Reporter: michael goulish Priority: Critical Documentation for the Cyrus SASL library says that the library is believed to be thread-safe only if the code that uses it meets several requirements. The requirements are: * you supply mutex functions (see sasl_set_mutex()) * you make no libsasl calls until sasl_client/server_init() completes * no libsasl calls are made after sasl_done() is begun * when using GSSAPI, you use a thread-safe GSS / Kerberos 5 library. It says explicitly that that sasl_set* calls are not thread safe, since they set global state. The proton library makes calls to sasl_set* functions in : pni_init_client() pni_init_server(), and pni_process_init() Since those are internal functions, there is no way for code that uses Proton to lock around those calls. I think proton needs a new API call to let applications call sasl_set_mutex(). Or something. We probably also need other protections to meet the other requirements specified in the Cyrus documentation (and quoted above). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PROTON-919) make C impl behave like java wrt channel_max error
[ https://issues.apache.org/jira/browse/PROTON-919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish closed PROTON-919. -- Resolution: Fixed Fix Version/s: 0.10 commit 4ee726002804d7286a8c76b42e0a0717e0798822 please NOTE that this change also adds #define PN_OK (0) to the list of errors in error.h make C impl behave like java wrt channel_max error -- Key: PROTON-919 URL: https://issues.apache.org/jira/browse/PROTON-919 Project: Qpid Proton Issue Type: Improvement Components: proton-c, python-binding Reporter: michael goulish Assignee: michael goulish Priority: Minor Fix For: 0.10 In the Java impl, I made TransportImpl throw an exception if the application tries to change the local channel_max setting after we have already sent the OPEN frame to the remote peer. ( Because at that point we communicate our channel_max limit to the peer -- no fair changing it afterwards.) One reviewer suggested that it would be nice if the C impl worked the same way. That would mean that pn_set_channel_max() would have to return a result code, which the Python binding would detect -- Python binding throws exception, python tests detect it -- so it would work same way as Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (PROTON-864) don't crash when channel number goes high
[ https://issues.apache.org/jira/browse/PROTON-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish closed PROTON-864. -- Resolution: Fixed Fix Version/s: 0.10 This is a duplicate of PROTON-842 don't crash when channel number goes high - Key: PROTON-864 URL: https://issues.apache.org/jira/browse/PROTON-864 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Fix For: 0.10 Code in transport.c, and a little in engine.c, looks at the topmost bit in channel numbers to decide if the channels are in use. This causes crashes when the number of channels in a single connection goes beyond 32767. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-949) proton doesn't build with ccache swig
michael goulish created PROTON-949: -- Summary: proton doesn't build with ccache swig Key: PROTON-949 URL: https://issues.apache.org/jira/browse/PROTON-949 Project: Qpid Proton Issue Type: Bug Components: proton-c Reporter: michael goulish Thanks to aconway for finding this and saving me a day of madness and horror. On freshly-downloaded proton tree, if I use this swig: /usr/lib64/ccache/swig the build fails this way: qpid-proton/build/proton-c/bindings/python/cprotonPYTHON_wrap.c:4993:25: error: 'PN_HANDLE' undeclared (first use in this function) PNI_PYTRACER = *((PN_HANDLE *)(argp)); -- but if I delete that swig executable, and use the one in /bin/swig , then everything works. yikes. aconway believes the bug is in ccache-swig, not in proton, but I want to put this here in case this bites someone else in Proton Land. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PROTON-946) remove generated data structure definitions from protocol.h
[ https://issues.apache.org/jira/browse/PROTON-946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-946: --- Description: Currently protocol.h.py reads the AMQP 1.0 spec xml files and generates all of its output into protocol.h -- even the data structure definitions. Those definitions are currently protected by #ifdef DEFINE_FIELDS , which is defined only in codec.c -- so the definitions only show up in that file, while other .c files only see the declarations. If DEFINE_FIELDS is #defined in any other file, compilation will fail with multiple definition errors. The structure declarations should remain in the .h file , but the actual definitions should be moved into a generated .c file. was: Currently protocol.h.py reads the AMQP 1.0 spec xml files and generates all of its output into protocol.h -- evel the data structure definitions. Those definitions are currently protected by #ifdef DEFINE_FIELDS , which is defined only in codec.c -- so the definitions only show up in that file, while other .c files only see the declarations. If DEFINE_FIELDS is #defined in any other file, compilation will fail with multiple definition errors. The structure declarations should remain in the .h file , but the actual definitions should be moved into a generated .c file. Summary: remove generated data structure definitions from protocol.h (was: remove generated data structure definitions from .protocol.h) remove generated data structure definitions from protocol.h --- Key: PROTON-946 URL: https://issues.apache.org/jira/browse/PROTON-946 Project: Qpid Proton Issue Type: Improvement Components: proton-c Affects Versions: 0.10 Reporter: michael goulish Assignee: michael goulish Currently protocol.h.py reads the AMQP 1.0 spec xml files and generates all of its output into protocol.h -- even the data structure definitions. Those definitions are currently protected by #ifdef DEFINE_FIELDS , which is defined only in codec.c -- so the definitions only show up in that file, while other .c files only see the declarations. If DEFINE_FIELDS is #defined in any other file, compilation will fail with multiple definition errors. The structure declarations should remain in the .h file , but the actual definitions should be moved into a generated .c file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-946) remove generated data structure definitions from .protocol.h
michael goulish created PROTON-946: -- Summary: remove generated data structure definitions from .protocol.h Key: PROTON-946 URL: https://issues.apache.org/jira/browse/PROTON-946 Project: Qpid Proton Issue Type: Improvement Components: proton-c Affects Versions: 0.10 Reporter: michael goulish Assignee: michael goulish Currently protocol.h.py reads the AMQP 1.0 spec xml files and generates all of its output into protocol.h -- evel the data structure definitions. Those definitions are currently protected by #ifdef DEFINE_FIELDS , which is defined only in codec.c -- so the definitions only show up in that file, while other .c files only see the declarations. If DEFINE_FIELDS is #defined in any other file, compilation will fail with multiple definition errors. The structure declarations should remain in the .h file , but the actual definitions should be moved into a generated .c file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PROTON-826) recent checkin causes frequent double-free or corruption crash
[ https://issues.apache.org/jira/browse/PROTON-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish resolved PROTON-826. Resolution: Fixed I recreated my test from February, and cannot reproduce the bug using latest dispatch + protron code. recent checkin causes frequent double-free or corruption crash -- Key: PROTON-826 URL: https://issues.apache.org/jira/browse/PROTON-826 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Priority: Blocker In my dispatch testing I am seeing frequent crashes in proton library that began with proton checkin 01cb00c on 2015-02-15 report read and write errors through the transport The output at crash-time says this: --- *** Error in `/home/mick/dispatch/install/sbin/qdrouterd': double free or corruption (fasttop): 0x020ee880 *** === Backtrace: = /lib64/libc.so.6[0x3e3d875a4f] /lib64/libc.so.6[0x3e3d87cd78] /lib64/libqpid-proton.so.2(pn_error_clear+0x18)[0x7f4f4f4e1f18] /lib64/libqpid-proton.so.2(pn_error_set+0x11)[0x7f4f4f4e1f41] /lib64/libqpid-proton.so.2(pn_error_vformat+0x3e)[0x7f4f4f4e1f9e] /lib64/libqpid-proton.so.2(pn_error_format+0x82)[0x7f4f4f4e2032] /lib64/libqpid-proton.so.2(pn_i_error_from_errno+0x67)[0x7f4f4f4fd737] /lib64/libqpid-proton.so.2(pn_recv+0x5a)[0x7f4f4f4fd16a] /home/mick/dispatch/install/lib64/libqpid-dispatch.so.0(qdpn_connector_process+0xd7)[0x7f4f4f759430] The backtrace from the core file looks like this: #0 0x003e3d835877 in raise () from /lib64/libc.so.6 #1 0x003e3d836f68 in abort () from /lib64/libc.so.6 #2 0x003e3d875a54 in __libc_message () from /lib64/libc.so.6 #3 0x003e3d87cd78 in _int_free () from /lib64/libc.so.6 #4 0x7fbf8a59b2e8 in pn_error_clear (error=error@entry=0x1501140) at /home/mick/rh-qpid-proton/proton-c/src/error.c:56 #5 0x7fbf8a59b311 in pn_error_set (error=error@entry=0x1501140, code=code@entry=-2, text=text@entry=0x7fbf801a69c0 recv: Resource temporarily unavailable) at /home/mick/rh-qpid-proton/proton-c/src/error.c:65 #6 0x7fbf8a59b36e in pn_error_vformat (error=0x1501140, code=-2, fmt=optimized out, ap=ap@entry=0x7fbf801a6de8) at /home/mick/rh-qpid-proton/proton-c/src/error.c:81 #7 0x7fbf8a59b402 in pn_error_format (error=error@entry=0x1501140, code=optimized out, fmt=fmt@entry=0x7fbf8a5bb21e %s: %s) at /home/mick/rh-qpid-proton/proton-c/src/error.c:89 #8 0x7fbf8a5b6797 in pn_i_error_from_errno (error=0x1501140, msg=msg@entry=0x7fbf8a5bbe1a recv) at /home/mick/rh-qpid-proton/proton-c/src/platform.c:119 #9 0x7fbf8a5b61ca in pn_recv (io=0x14e77b0, socket=optimized out, buf=optimized out, size=optimized out) at /home/mick/rh-qpid-proton/proton-c/src/posix/io.c:271 #10 0x7fbf8a812430 in qdpn_connector_process (c=0x7fbf7801c7f0) - And I can prevent the crash from happening, apparently forever, by commenting out this line: free(error-text); in the function pn_error_clear in the file proton-c/src/error.c The error text that is being freed which causes the crash looks like this: $2 = {text = 0x7f66e8104e30 recv: Resource temporarily unavailable, root = 0x0, code = -2} My dispatch test creates a router network and then repeatedly kills and restarts a randomly-selected router. After this proton checkin it almost never gets through 5 iterations without this crash. After I commented out that line, it got through more than 500 iterations before I stopped it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: 0.10 alpha1
I just took 826 -- I think I can still re-create the test that found it. Let me see if I can repro... - Original Message - Yay Rafi! Thanks! A simple query of currently outstanding blocker JIRAs affecting 0.9+ shows only three: https://issues.apache.org/jira/browse/PROTON-826 (unassigned) https://issues.apache.org/jira/browse/PROTON-923 (asticher) https://issues.apache.org/jira/browse/PROTON-934 (rschloming) The remaining open bugs affecting 0.9+ are: https://issues.apache.org/jira/browse/PROTON-826?jql=project%20%3D%20PROTON%20AND%20status%20in%20%28Open%2C%20%22In%20Progress%22%2C%20Reopened%29%20AND%20affectedVersion%20in%20%280.9%2C%200.9.1%2C%200.10%29%20ORDER%20BY%20priority%20DESC - Original Message - From: Rafael Schloming r...@alum.mit.edu To: proton@qpid.apache.org Sent: Tuesday, July 7, 2015 1:28:17 AM Subject: 0.10 alpha1 As promised, here is the first alpha for 0.10. It's posted in the usual places: Source code is here: http://people.apache.org/~rhs/qpid-proton-0.10-alpha1/ Java binaries are here: https://repository.apache.org/content/repositories/orgapacheqpid-1036 Please check it out and follow up with any issues. --Rafael -- -K - To unsubscribe, e-mail: dev-unsubscr...@qpid.apache.org For additional commands, e-mail: dev-h...@qpid.apache.org
[jira] [Assigned] (PROTON-826) recent checkin causes frequent double-free or corruption crash
[ https://issues.apache.org/jira/browse/PROTON-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish reassigned PROTON-826: -- Assignee: michael goulish recent checkin causes frequent double-free or corruption crash -- Key: PROTON-826 URL: https://issues.apache.org/jira/browse/PROTON-826 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Priority: Blocker In my dispatch testing I am seeing frequent crashes in proton library that began with proton checkin 01cb00c on 2015-02-15 report read and write errors through the transport The output at crash-time says this: --- *** Error in `/home/mick/dispatch/install/sbin/qdrouterd': double free or corruption (fasttop): 0x020ee880 *** === Backtrace: = /lib64/libc.so.6[0x3e3d875a4f] /lib64/libc.so.6[0x3e3d87cd78] /lib64/libqpid-proton.so.2(pn_error_clear+0x18)[0x7f4f4f4e1f18] /lib64/libqpid-proton.so.2(pn_error_set+0x11)[0x7f4f4f4e1f41] /lib64/libqpid-proton.so.2(pn_error_vformat+0x3e)[0x7f4f4f4e1f9e] /lib64/libqpid-proton.so.2(pn_error_format+0x82)[0x7f4f4f4e2032] /lib64/libqpid-proton.so.2(pn_i_error_from_errno+0x67)[0x7f4f4f4fd737] /lib64/libqpid-proton.so.2(pn_recv+0x5a)[0x7f4f4f4fd16a] /home/mick/dispatch/install/lib64/libqpid-dispatch.so.0(qdpn_connector_process+0xd7)[0x7f4f4f759430] The backtrace from the core file looks like this: #0 0x003e3d835877 in raise () from /lib64/libc.so.6 #1 0x003e3d836f68 in abort () from /lib64/libc.so.6 #2 0x003e3d875a54 in __libc_message () from /lib64/libc.so.6 #3 0x003e3d87cd78 in _int_free () from /lib64/libc.so.6 #4 0x7fbf8a59b2e8 in pn_error_clear (error=error@entry=0x1501140) at /home/mick/rh-qpid-proton/proton-c/src/error.c:56 #5 0x7fbf8a59b311 in pn_error_set (error=error@entry=0x1501140, code=code@entry=-2, text=text@entry=0x7fbf801a69c0 recv: Resource temporarily unavailable) at /home/mick/rh-qpid-proton/proton-c/src/error.c:65 #6 0x7fbf8a59b36e in pn_error_vformat (error=0x1501140, code=-2, fmt=optimized out, ap=ap@entry=0x7fbf801a6de8) at /home/mick/rh-qpid-proton/proton-c/src/error.c:81 #7 0x7fbf8a59b402 in pn_error_format (error=error@entry=0x1501140, code=optimized out, fmt=fmt@entry=0x7fbf8a5bb21e %s: %s) at /home/mick/rh-qpid-proton/proton-c/src/error.c:89 #8 0x7fbf8a5b6797 in pn_i_error_from_errno (error=0x1501140, msg=msg@entry=0x7fbf8a5bbe1a recv) at /home/mick/rh-qpid-proton/proton-c/src/platform.c:119 #9 0x7fbf8a5b61ca in pn_recv (io=0x14e77b0, socket=optimized out, buf=optimized out, size=optimized out) at /home/mick/rh-qpid-proton/proton-c/src/posix/io.c:271 #10 0x7fbf8a812430 in qdpn_connector_process (c=0x7fbf7801c7f0) - And I can prevent the crash from happening, apparently forever, by commenting out this line: free(error-text); in the function pn_error_clear in the file proton-c/src/error.c The error text that is being freed which causes the crash looks like this: $2 = {text = 0x7f66e8104e30 recv: Resource temporarily unavailable, root = 0x0, code = -2} My dispatch test creates a router network and then repeatedly kills and restarts a randomly-selected router. After this proton checkin it almost never gets through 5 iterations without this crash. After I commented out that line, it got through more than 500 iterations before I stopped it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-826) recent checkin causes frequent double-free or corruption crash
[ https://issues.apache.org/jira/browse/PROTON-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616749#comment-14616749 ] michael goulish commented on PROTON-826: I see why I didn't follow this up earlier. Current dispatch will not compile against latest proton because of some SASL issues. But I need to test against latest proton. SO ... now attempting to hack up dispatch so that it doesn't have SASL but will still build and run against latest proton recent checkin causes frequent double-free or corruption crash -- Key: PROTON-826 URL: https://issues.apache.org/jira/browse/PROTON-826 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Priority: Blocker In my dispatch testing I am seeing frequent crashes in proton library that began with proton checkin 01cb00c on 2015-02-15 report read and write errors through the transport The output at crash-time says this: --- *** Error in `/home/mick/dispatch/install/sbin/qdrouterd': double free or corruption (fasttop): 0x020ee880 *** === Backtrace: = /lib64/libc.so.6[0x3e3d875a4f] /lib64/libc.so.6[0x3e3d87cd78] /lib64/libqpid-proton.so.2(pn_error_clear+0x18)[0x7f4f4f4e1f18] /lib64/libqpid-proton.so.2(pn_error_set+0x11)[0x7f4f4f4e1f41] /lib64/libqpid-proton.so.2(pn_error_vformat+0x3e)[0x7f4f4f4e1f9e] /lib64/libqpid-proton.so.2(pn_error_format+0x82)[0x7f4f4f4e2032] /lib64/libqpid-proton.so.2(pn_i_error_from_errno+0x67)[0x7f4f4f4fd737] /lib64/libqpid-proton.so.2(pn_recv+0x5a)[0x7f4f4f4fd16a] /home/mick/dispatch/install/lib64/libqpid-dispatch.so.0(qdpn_connector_process+0xd7)[0x7f4f4f759430] The backtrace from the core file looks like this: #0 0x003e3d835877 in raise () from /lib64/libc.so.6 #1 0x003e3d836f68 in abort () from /lib64/libc.so.6 #2 0x003e3d875a54 in __libc_message () from /lib64/libc.so.6 #3 0x003e3d87cd78 in _int_free () from /lib64/libc.so.6 #4 0x7fbf8a59b2e8 in pn_error_clear (error=error@entry=0x1501140) at /home/mick/rh-qpid-proton/proton-c/src/error.c:56 #5 0x7fbf8a59b311 in pn_error_set (error=error@entry=0x1501140, code=code@entry=-2, text=text@entry=0x7fbf801a69c0 recv: Resource temporarily unavailable) at /home/mick/rh-qpid-proton/proton-c/src/error.c:65 #6 0x7fbf8a59b36e in pn_error_vformat (error=0x1501140, code=-2, fmt=optimized out, ap=ap@entry=0x7fbf801a6de8) at /home/mick/rh-qpid-proton/proton-c/src/error.c:81 #7 0x7fbf8a59b402 in pn_error_format (error=error@entry=0x1501140, code=optimized out, fmt=fmt@entry=0x7fbf8a5bb21e %s: %s) at /home/mick/rh-qpid-proton/proton-c/src/error.c:89 #8 0x7fbf8a5b6797 in pn_i_error_from_errno (error=0x1501140, msg=msg@entry=0x7fbf8a5bbe1a recv) at /home/mick/rh-qpid-proton/proton-c/src/platform.c:119 #9 0x7fbf8a5b61ca in pn_recv (io=0x14e77b0, socket=optimized out, buf=optimized out, size=optimized out) at /home/mick/rh-qpid-proton/proton-c/src/posix/io.c:271 #10 0x7fbf8a812430 in qdpn_connector_process (c=0x7fbf7801c7f0) - And I can prevent the crash from happening, apparently forever, by commenting out this line: free(error-text); in the function pn_error_clear in the file proton-c/src/error.c The error text that is being freed which causes the crash looks like this: $2 = {text = 0x7f66e8104e30 recv: Resource temporarily unavailable, root = 0x0, code = -2} My dispatch test creates a router network and then repeatedly kills and restarts a randomly-selected router. After this proton checkin it almost never gets through 5 iterations without this crash. After I commented out that line, it got through more than 500 iterations before I stopped it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-930) add explicit AMQP 1.0 constants
michael goulish created PROTON-930: -- Summary: add explicit AMQP 1.0 constants Key: PROTON-930 URL: https://issues.apache.org/jira/browse/PROTON-930 Project: Qpid Proton Issue Type: Improvement Components: proton-c Reporter: michael goulish Assignee: michael goulish Priority: Minor Fix For: 0.10 Add an include file that has explicit defined constants for every numeric default value that is mandated by the AMQP 1.0 spec. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PROTON-925) proton-c seems to treat unspecified channel-max as implying 0
[ https://issues.apache.org/jira/browse/PROTON-925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish resolved PROTON-925. Resolution: Fixed commit fc38e86a6f5a1b265552708e674d3c8040c1985b proton-c seems to treat unspecified channel-max as implying 0 - Key: PROTON-925 URL: https://issues.apache.org/jira/browse/PROTON-925 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.10 Reporter: Gordon Sim Assignee: michael goulish Priority: Blocker Fix For: 0.10 If max-channels is not specified in the open, it appears the latest proton-c treats that as implying the maximum is 0 though the spec states the default is 65535. This breaks compatibility with previous proton releases. E.g. the following is the interaction between a sender using the latest 0.10 and a receiver using proton 0.9. {noformat} [0x151c710]: - AMQP [0x151c710]:0 - @open(16) [container-id=65A6602D-5D24-4D39-9C6F-7403D98F5E15, hostname=localhost, channel-max=32767] [0x151c710]:0 - @begin(17) [next-outgoing-id=0, incoming-window=2147483647, outgoing-window=1] [0x151c710]:1 - @begin(17) [next-outgoing-id=0, incoming-window=2147483647, outgoing-window=1] [0x151c710]:2 - @begin(17) [next-outgoing-id=0, incoming-window=2147483647, outgoing-window=1] [0x151c710]:0 - @attach(18) [name=sender-xxx, handle=0, role=false, snd-settle-mode=2, rcv-settle-mode=0, source=@source(40) [address=queue_a, durable=0, timeout=0, dynamic=false], target=@target(41) [address=queue_a, durable=0, timeout=0, dynamic=false], initial-delivery-count=0] [0x151c710]:1 - @attach(18) [name=sender-xxx, handle=0, role=false, snd-settle-mode=2, rcv-settle-mode=0, source=@source(40) [address=queue_b, durable=0, timeout=0, dynamic=false], target=@target(41) [address=queue_b, durable=0, timeout=0, dynamic=false], initial-delivery-count=0] [0x151c710]:2 - @attach(18) [name=sender-xxx, handle=0, role=false, snd-settle-mode=2, rcv-settle-mode=0, source=@source(40) [address=queue_c, durable=0, timeout=0, dynamic=false], target=@target(41) [address=queue_c, durable=0, timeout=0, dynamic=false], initial-delivery-count=0] [0x151c710]: - AMQP [0x151c710]:0 - @open(16) [container-id=abab56b0-c25e-427b-9f4f-d63da48d1973] [0x151c710]:0 - @begin(17) [remote-channel=0, next-outgoing-id=0, incoming-window=2147483647, outgoing-window=0] [0x151c710]:1 - @begin(17) [remote-channel=1, next-outgoing-id=0, incoming-window=2147483647, outgoing-window=0] [0x151c710]:2 - @begin(17) [remote-channel=2, next-outgoing-id=0, incoming-window=2147483647, outgoing-window=0] [0x151c710]:0 - @attach(18) [name=sender-xxx, handle=0, role=true, snd-settle-mode=2, rcv-settle-mode=0, source=@source(40) [address=queue_a, durable=0, timeout=0, dynamic=false], target=@target(41) [address=queue_a, durable=0, timeout=0, dynamic=false], initial-delivery-count=0] [0x151c710]:1 - @attach(18) [name=sender-xxx, handle=0, role=true, snd-settle-mode=2, rcv-settle-mode=0, source=@source(40) [address=queue_b, durable=0, timeout=0, dynamic=false], target=@target(41) [address=queue_b, durable=0, timeout=0, dynamic=false], initial-delivery-count=0] [0x151c710]:2 - @attach(18) [name=sender-xxx, handle=0, role=true, snd-settle-mode=2, rcv-settle-mode=0, source=@source(40) [address=queue_c, durable=0, timeout=0, dynamic=false], target=@target(41) [address=queue_c, durable=0, timeout=0, dynamic=false], initial-delivery-count=0] [0x151c710]:0 - @flow(19) [next-incoming-id=0, incoming-window=2147483647, next-outgoing-id=0, outgoing-window=0, handle=0, delivery-count=0, link-credit=341, drain=false] [0x151c710]:1 - @flow(19) [next-incoming-id=0, incoming-window=2147483647, next-outgoing-id=0, outgoing-window=0, handle=0, delivery-count=0, link-credit=341, drain=false] [0x151c710]:2 - @flow(19) [next-incoming-id=0, incoming-window=2147483647, next-outgoing-id=0, outgoing-window=0, handle=0, delivery-count=0, link-credit=341, drain=false] [0x151c710]:0 - @close(24) [error=@error(29) [condition=:amqp:connection:framing-error, description=remote channel 1 is above negotiated channel_max 0.]] [0x151c710]: - EOS [0x151c710]:0 - @close(24) [] [0x151c710]: - EOS {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PROTON-842) proton-c should honor channel_max
[ https://issues.apache.org/jira/browse/PROTON-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish resolved PROTON-842. Resolution: Fixed Last checkin fixed java tests. proton-c should honor channel_max - Key: PROTON-842 URL: https://issues.apache.org/jira/browse/PROTON-842 Project: Qpid Proton Issue Type: Bug Components: proton-j Affects Versions: 0.9, 0.10 Reporter: michael goulish Assignee: michael goulish proton-c code should use transport-channel_max and transport-remote_channel_max to enforce a limit on the maximum number of simultaneously active sessions on a connection. I guess the limit should be the minimum of those two numbers, or, if neither side sets a limit, then 2^16. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-919) make C impl behave like java wrt channel_max error
michael goulish created PROTON-919: -- Summary: make C impl behave like java wrt channel_max error Key: PROTON-919 URL: https://issues.apache.org/jira/browse/PROTON-919 Project: Qpid Proton Issue Type: Improvement Components: proton-c, python-binding Reporter: michael goulish Assignee: michael goulish Priority: Minor In the Java impl, I made TransportImpl throw an exception if the application tries to change the local channel_max setting after we have already sent the OPEN frame to the remote peer. ( Because at that point we communicate our channel_max limit to the peer -- no fair changing it afterwards.) One reviewer suggested that it would be nice if the C impl worked the same way. That would mean that pn_set_channel_max() would have to return a result code, which the Python binding would detect -- Python binding throws exception, python tests detect it -- so it would work same way as Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-919) make C impl behave like java wrt channel_max error
[ https://issues.apache.org/jira/browse/PROTON-919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14598095#comment-14598095 ] michael goulish commented on PROTON-919: ~~~ NOTE ~~~ The proposed change alters the public API in that it changes pn_transport_set_channel_max() to return an int, rather than void. make C impl behave like java wrt channel_max error -- Key: PROTON-919 URL: https://issues.apache.org/jira/browse/PROTON-919 Project: Qpid Proton Issue Type: Improvement Components: proton-c, python-binding Reporter: michael goulish Assignee: michael goulish Priority: Minor In the Java impl, I made TransportImpl throw an exception if the application tries to change the local channel_max setting after we have already sent the OPEN frame to the remote peer. ( Because at that point we communicate our channel_max limit to the peer -- no fair changing it afterwards.) One reviewer suggested that it would be nice if the C impl worked the same way. That would mean that pn_set_channel_max() would have to return a result code, which the Python binding would detect -- Python binding throws exception, python tests detect it -- so it would work same way as Java. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PROTON-842) proton-c should honor channel_max
[ https://issues.apache.org/jira/browse/PROTON-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish resolved PROTON-842. Resolution: Fixed commit e38957ae5115ec023993672ca5b7d5e3df414f7e proton-c should honor channel_max - Key: PROTON-842 URL: https://issues.apache.org/jira/browse/PROTON-842 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish proton-c code should use transport-channel_max and transport-remote_channel_max to enforce a limit on the maximum number of simultaneously active sessions on a connection. I guess the limit should be the minimum of those two numbers, or, if neither side sets a limit, then 2^16. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PROTON-842) proton-c should honor channel_max
[ https://issues.apache.org/jira/browse/PROTON-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591877#comment-14591877 ] michael goulish commented on PROTON-842: -- please note -- This fix changes API behavior in one way: pn_session can now return NULL if an attempt is made to create more sessions than are allowed by the value of channel_max. Previously, limitation on number of session was enforced by SEGV. proton-c should honor channel_max - Key: PROTON-842 URL: https://issues.apache.org/jira/browse/PROTON-842 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish proton-c code should use transport-channel_max and transport-remote_channel_max to enforce a limit on the maximum number of simultaneously active sessions on a connection. I guess the limit should be the minimum of those two numbers, or, if neither side sets a limit, then 2^16. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (PROTON-842) proton-c should honor channel_max
[ https://issues.apache.org/jira/browse/PROTON-842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish reopened PROTON-842: My fix for proton-c is making trouble for proton-j proton-c should honor channel_max - Key: PROTON-842 URL: https://issues.apache.org/jira/browse/PROTON-842 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish proton-c code should use transport-channel_max and transport-remote_channel_max to enforce a limit on the maximum number of simultaneously active sessions on a connection. I guess the limit should be the minimum of those two numbers, or, if neither side sets a limit, then 2^16. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (PROTON-896) change all static function names to begin with pni_
[ https://issues.apache.org/jira/browse/PROTON-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish reassigned PROTON-896: -- Assignee: michael goulish change all static function names to begin with pni_ --- Key: PROTON-896 URL: https://issues.apache.org/jira/browse/PROTON-896 Project: Qpid Proton Issue Type: Improvement Reporter: michael goulish Assignee: michael goulish Priority: Minor Change all the static function names to start with pni_ , and declare all functions as static that ought to be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: question about proton error philosophy
It it philosophically questionable but C is not a very philosophical language. I beg to differ. Here are some famous thoughts concerning the C language by one early practioner, that greatest of all Roman philosophers, the incomparable Ibid. Non teneas aurum totum quod splendet ut aurum, et scribe in C. Minima maxima sunt, sic haec scriberem C. Mutantur omnia nos et mutamur in illis, nisi lingua C.
[jira] [Created] (PROTON-896) change all statis function names to begin with pni_
michael goulish created PROTON-896: -- Summary: change all statis function names to begin with pni_ Key: PROTON-896 URL: https://issues.apache.org/jira/browse/PROTON-896 Project: Qpid Proton Issue Type: Improvement Reporter: michael goulish Priority: Minor Change all the static function names to start with pni_ , and declare all functions as static that ought to be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PROTON-896) change all static function names to begin with pni_
[ https://issues.apache.org/jira/browse/PROTON-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-896: --- Summary: change all static function names to begin with pni_ (was: change all statis function names to begin with pni_) change all static function names to begin with pni_ --- Key: PROTON-896 URL: https://issues.apache.org/jira/browse/PROTON-896 Project: Qpid Proton Issue Type: Improvement Reporter: michael goulish Priority: Minor Change all the static function names to start with pni_ , and declare all functions as static that ought to be. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PROTON-864) don't crash when channel number goes high
[ https://issues.apache.org/jira/browse/PROTON-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-864: --- Summary: don't crash when channel number goes high (was: avoid crashes when channel number goes high.) don't crash when channel number goes high - Key: PROTON-864 URL: https://issues.apache.org/jira/browse/PROTON-864 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Code in transport.c, and a little in engine.c, looks at the topmost bit in channel numbers to decide if the channels are in use. This causes crashes when the number of channels in a single connection goes beyond 32767. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PROTON-864) avoid crashes when channel number goes high.
[ https://issues.apache.org/jira/browse/PROTON-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-864: --- Summary: avoid crashes when channel number goes high. (was: don't overload top bit of channel numbers ) avoid crashes when channel number goes high. Key: PROTON-864 URL: https://issues.apache.org/jira/browse/PROTON-864 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Code in transport.c, and a little in engine.c, looks at the topmost bit in channel numbers to decide if the channels are in use. This causes crashes when the number of channels in a single connection goes beyond 32767. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-888) allocate_alias linear search becomes slow at scale
michael goulish created PROTON-888: -- Summary: allocate_alias linear search becomes slow at scale Key: PROTON-888 URL: https://issues.apache.org/jira/browse/PROTON-888 Project: Qpid Proton Issue Type: Improvement Reporter: michael goulish Testing that I have done recently goes to large scale on number of sessions per connection. I noticed that the test was slowing down rapidly over time, in terms of how many sessions were being established per unit time. The function allocate_alias in file transport.c uses a linear search through an array to find the next available channel number for a session (or the next available handle number for a link). In a usage scenario like mine in which many sessions will be established, this becomes very slow as the array fills up. At the beginning of my test, this function is too fast to measure. By the end, it is using more than 82 milliseconds per call. Overall, this function alone is contributing more than 20 seconds to my 3-minute test. This is not an unrealistic scenario -- we already have one potential customer who is interested in going to this kind of scale. (Which is why I was doing this test.) Maybe we can find an implementation that does not slow down the common scale, and yet behaves better at the high end. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-886) make proton enforce handle-max
michael goulish created PROTON-886: -- Summary: make proton enforce handle-max Key: PROTON-886 URL: https://issues.apache.org/jira/browse/PROTON-886 Project: Qpid Proton Issue Type: Bug Reporter: michael goulish Make the code enforce limits on handles (and links) from section 2.7.2 of the AMQP 1.0 spec. The handle-max value is the highest handle value that can be used on the session. A peer MUST NOT attempt to attach a link using a handle value outside the range that its partner can handle. A peer that receives a handle outside the supported range MUST close the connection with the framing-error error-code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE]: Release Proton 0.9.1-rc1 as 0.9.1
[ X ]: Yes, release Proton 0.9.1-rc1 as 0.9.1 [ ]: No, ... tested with my 30,000-link-in-one-connection code.
[jira] [Created] (PROTON-864) don't overload top bit of channel numbers
michael goulish created PROTON-864: -- Summary: don't overload top bit of channel numbers Key: PROTON-864 URL: https://issues.apache.org/jira/browse/PROTON-864 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Assignee: michael goulish Code in transport.c, and a little in engine.c, looks at the topmost bit in channel numbers to decide if the channels are in use. This causes crashes when the number of channels in a single connection goes beyond 32767. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: I think that's a blocker...
Good point! I'm afraid it will take me the rest of my life to reproduce under valgrind .. but ... I'll see what I can do In the meantime -- I'm not sure what to do with a Jira if the provenance is in doubt... - Original Message - This isn't necessarily a proton bug. Nothing in the referenced checkin actually touches the logic around allocating/freeing error strings, it merely causes pn_send/pn_recv to make use of pn_io_t's pn_error_t where previously it threw away the error information. This would suggest that there is perhaps a pre-existing bug in dispatch where it is calling pn_send/pn_recv with a pn_io_t that has been freed, and it is only now triggering due to the additional asserts that are encountered due to not ignoring the error information. I could be mistaken, but I would try reproducing this under valgrind. That will tell you where the first free occurred and that should hopefully make it obvious whether this is indeed a proton bug or whether dispatch is somehow freeing the pn_io_t sooner than it should. (FWIW, if it is indeed a proton bug, then I would agree it is a blocker.) --Rafael On Wed, Feb 25, 2015 at 7:54 AM, Michael Goulish mgoul...@redhat.com wrote: ...but if not, somebody please feel free to correct me. The Jira that I just created -- PROTON-826 -- is for a bug I found with my topology testing of the Dispatch Router, in which I repeatedly kill and restart a router and make sure that the router network comes back to the same topology that it had before. As of checkin 01cb00c -- which had no Jira -- it is pretty easy for my test to blow core. It looks like an error string is being double-freed (maybe) in the proton library. ( full info in the Jira. https://issues.apache.org/jira/browse/PROTON-826 )
[jira] [Commented] (PROTON-826) recent checkin causes frequent double-free or corruption crash
[ https://issues.apache.org/jira/browse/PROTON-826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14336824#comment-14336824 ] michael goulish commented on PROTON-826: It looks like the problem here is just that the error struct used in proton-c/src/error.c is not thread safe -- so I am opening a new Jira for Dispatch. I am leaving this one open for now, however, because other applications using proton will encounter this. Either something could be changed in proton to make this less thread-hostile, or ... it could be publicized better? Please feel free to close when appropriate. recent checkin causes frequent double-free or corruption crash -- Key: PROTON-826 URL: https://issues.apache.org/jira/browse/PROTON-826 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Priority: Blocker In my dispatch testing I am seeing frequent crashes in proton library that began with proton checkin 01cb00c on 2015-02-15 report read and write errors through the transport The output at crash-time says this: --- *** Error in `/home/mick/dispatch/install/sbin/qdrouterd': double free or corruption (fasttop): 0x020ee880 *** === Backtrace: = /lib64/libc.so.6[0x3e3d875a4f] /lib64/libc.so.6[0x3e3d87cd78] /lib64/libqpid-proton.so.2(pn_error_clear+0x18)[0x7f4f4f4e1f18] /lib64/libqpid-proton.so.2(pn_error_set+0x11)[0x7f4f4f4e1f41] /lib64/libqpid-proton.so.2(pn_error_vformat+0x3e)[0x7f4f4f4e1f9e] /lib64/libqpid-proton.so.2(pn_error_format+0x82)[0x7f4f4f4e2032] /lib64/libqpid-proton.so.2(pn_i_error_from_errno+0x67)[0x7f4f4f4fd737] /lib64/libqpid-proton.so.2(pn_recv+0x5a)[0x7f4f4f4fd16a] /home/mick/dispatch/install/lib64/libqpid-dispatch.so.0(qdpn_connector_process+0xd7)[0x7f4f4f759430] The backtrace from the core file looks like this: #0 0x003e3d835877 in raise () from /lib64/libc.so.6 #1 0x003e3d836f68 in abort () from /lib64/libc.so.6 #2 0x003e3d875a54 in __libc_message () from /lib64/libc.so.6 #3 0x003e3d87cd78 in _int_free () from /lib64/libc.so.6 #4 0x7fbf8a59b2e8 in pn_error_clear (error=error@entry=0x1501140) at /home/mick/rh-qpid-proton/proton-c/src/error.c:56 #5 0x7fbf8a59b311 in pn_error_set (error=error@entry=0x1501140, code=code@entry=-2, text=text@entry=0x7fbf801a69c0 recv: Resource temporarily unavailable) at /home/mick/rh-qpid-proton/proton-c/src/error.c:65 #6 0x7fbf8a59b36e in pn_error_vformat (error=0x1501140, code=-2, fmt=optimized out, ap=ap@entry=0x7fbf801a6de8) at /home/mick/rh-qpid-proton/proton-c/src/error.c:81 #7 0x7fbf8a59b402 in pn_error_format (error=error@entry=0x1501140, code=optimized out, fmt=fmt@entry=0x7fbf8a5bb21e %s: %s) at /home/mick/rh-qpid-proton/proton-c/src/error.c:89 #8 0x7fbf8a5b6797 in pn_i_error_from_errno (error=0x1501140, msg=msg@entry=0x7fbf8a5bbe1a recv) at /home/mick/rh-qpid-proton/proton-c/src/platform.c:119 #9 0x7fbf8a5b61ca in pn_recv (io=0x14e77b0, socket=optimized out, buf=optimized out, size=optimized out) at /home/mick/rh-qpid-proton/proton-c/src/posix/io.c:271 #10 0x7fbf8a812430 in qdpn_connector_process (c=0x7fbf7801c7f0) - And I can prevent the crash from happening, apparently forever, by commenting out this line: free(error-text); in the function pn_error_clear in the file proton-c/src/error.c The error text that is being freed which causes the crash looks like this: $2 = {text = 0x7f66e8104e30 recv: Resource temporarily unavailable, root = 0x0, code = -2} My dispatch test creates a router network and then repeatedly kills and restarts a randomly-selected router. After this proton checkin it almost never gets through 5 iterations without this crash. After I commented out that line, it got through more than 500 iterations before I stopped it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-826) recent checkin causes frequent double-free or corruption crash
michael goulish created PROTON-826: -- Summary: recent checkin causes frequent double-free or corruption crash Key: PROTON-826 URL: https://issues.apache.org/jira/browse/PROTON-826 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.9 Reporter: michael goulish Priority: Blocker In my dispatch testing I am seeing frequent crashes in proton library that began with proton checkin 01cb00c on 2015-02-15 report read and write errors through the transport The output at crash-time says this: --- *** Error in `/home/mick/dispatch/install/sbin/qdrouterd': double free or corruption (fasttop): 0x020ee880 *** === Backtrace: = /lib64/libc.so.6[0x3e3d875a4f] /lib64/libc.so.6[0x3e3d87cd78] /lib64/libqpid-proton.so.2(pn_error_clear+0x18)[0x7f4f4f4e1f18] /lib64/libqpid-proton.so.2(pn_error_set+0x11)[0x7f4f4f4e1f41] /lib64/libqpid-proton.so.2(pn_error_vformat+0x3e)[0x7f4f4f4e1f9e] /lib64/libqpid-proton.so.2(pn_error_format+0x82)[0x7f4f4f4e2032] /lib64/libqpid-proton.so.2(pn_i_error_from_errno+0x67)[0x7f4f4f4fd737] /lib64/libqpid-proton.so.2(pn_recv+0x5a)[0x7f4f4f4fd16a] /home/mick/dispatch/install/lib64/libqpid-dispatch.so.0(qdpn_connector_process+0xd7)[0x7f4f4f759430] The backtrace from the core file looks like this: #0 0x003e3d835877 in raise () from /lib64/libc.so.6 #1 0x003e3d836f68 in abort () from /lib64/libc.so.6 #2 0x003e3d875a54 in __libc_message () from /lib64/libc.so.6 #3 0x003e3d87cd78 in _int_free () from /lib64/libc.so.6 #4 0x7fbf8a59b2e8 in pn_error_clear (error=error@entry=0x1501140) at /home/mick/rh-qpid-proton/proton-c/src/error.c:56 #5 0x7fbf8a59b311 in pn_error_set (error=error@entry=0x1501140, code=code@entry=-2, text=text@entry=0x7fbf801a69c0 recv: Resource temporarily unavailable) at /home/mick/rh-qpid-proton/proton-c/src/error.c:65 #6 0x7fbf8a59b36e in pn_error_vformat (error=0x1501140, code=-2, fmt=optimized out, ap=ap@entry=0x7fbf801a6de8) at /home/mick/rh-qpid-proton/proton-c/src/error.c:81 #7 0x7fbf8a59b402 in pn_error_format (error=error@entry=0x1501140, code=optimized out, fmt=fmt@entry=0x7fbf8a5bb21e %s: %s) at /home/mick/rh-qpid-proton/proton-c/src/error.c:89 #8 0x7fbf8a5b6797 in pn_i_error_from_errno (error=0x1501140, msg=msg@entry=0x7fbf8a5bbe1a recv) at /home/mick/rh-qpid-proton/proton-c/src/platform.c:119 #9 0x7fbf8a5b61ca in pn_recv (io=0x14e77b0, socket=optimized out, buf=optimized out, size=optimized out) at /home/mick/rh-qpid-proton/proton-c/src/posix/io.c:271 #10 0x7fbf8a812430 in qdpn_connector_process (c=0x7fbf7801c7f0) - And I can prevent the crash from happening, apparently forever, by commenting out this line: free(error-text); in the function pn_error_clear in the file proton-c/src/error.c The error text that is being freed which causes the crash looks like this: $2 = {text = 0x7f66e8104e30 recv: Resource temporarily unavailable, root = 0x0, code = -2} My dispatch test creates a router network and then repeatedly kills and restarts a randomly-selected router. After this proton checkin it almost never gets through 5 iterations without this crash. After I commented out that line, it got through more than 500 iterations before I stopped it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: no slowdown in 10 gigamessage test
Dominic -- I'm trying to get these into proton, under examples/engine/c , but I am having some problems because I am brain-damaged. Attempting to repair brain-damage now. ( And in the proton checkin, the licenses will be standard, not silly. ) ( ALthough I *did* like the idea of a PL that is meant to be (partly) sung... ) --- Mick . - Original Message - Hi Michael, Michael Goulish wrote After getting complete data for a 10 gigamessage interbox test, ( proton-c, event interface, code here: https://github.com/mick-goulish/proton_c_clients.git ) ...I see that there is no gradual speed change at all over the duration of the test. I noticed that there wasn't a properly declared license in your repository, I wonder if you'd be willing to add a LICENSE for http://www.apache.org/licenses/LICENSE-2.0 and update the headers of psend.c / precv.c to match? Currently there seems to be a comedy copyright header instead :) This would allow other qpid-proton developers to contribute to these as generic event-driven samples and/or performance harnesses that could end up under either ./contrib/ or ./examples/ -- View this message in context: http://qpid.2158936.n2.nabble.com/no-slowdown-in-10-gigamessage-test-tp7616818p7617251.html Sent from the Apache Qpid Proton mailing list archive at Nabble.com.
interbox test: 446,000 messages per second
proton-c event interface soak test 24-25 nov 2014 results -- interbox test over 1 gig-e wire 1 connection, 1 session, 5 links [1] 10 billion ( 1e10 ) messages sent, 100 bytes payload per message, 140 bytes total per message messages per second average: 446,675 [2] bandwidth consumed: 500,275,553 bits per second [2] ( 50% of available bandwidth on 1 gig-e ) credit scheme: (per link) 400 initial credits, 200 more every time total falls to 200. behavior: Very stable during test. Low variability for time-reports after each 5 million messages delivered. No obvious changes in speed. memory growth: on receiver, from 4232 KB to 4264 KB [1] I did an earlier test that determined that these settings gave me the greatest number of messages per second on a single connection. However, these settings did not give the highest bandwidth consumption. That happened with larger message sizes. ( i.e. larger payloads. ) The test was able to saturate a 1 gig-e wire. Looking for a faster wire to test with ... [2] note: these data are from a new run that has only completed 2 billion messages so far. I lost the data from the first run. :-( I will do more analysis on the data when the current test completes.
proton-c event test stable and fast for 5 billion messages
I recently finished switching over my proton-c programs psend precv to the new event-based interface, and my first test of them was a 5 billion message soak test. The programs survived this test with no memory growth, and no gradual slowdown. This test is meant to find the fastest possible speed of the proton-c code itself. (In future, we could make other similar tests designed to mimic realistic user scenarios.) In this test, I run both sender and receiver on one box, with the loopback interface. I have MTU == 64K, I use a credit scheme of 600 initial credits, and 300 new credits whenever credit falls below 300. The messages are small: exactly 100 bytes long. I am using two processors, both Intel Xeon E5420 @ 2.50GHz with 6144 KB cache. (Letting the OS decide which processors to use for my two processes.) On that system, with the above credit scheme, the test is sustaining throughput of 408,500 messages per second . That's over a single link, between two singly-threaded processes. This is significantly faster than my previous, non-event-based code, and I find the code *much* easier to understand. This may still not be the maximum possible speed on my box. It looks like the limiting factor will be the receiver, and right now it is using only 74% of its CPU -- so if we could get it to use 100% we *might* see a performance gain to the neighborhood of 550,000 messages per second. But I have not been able to get closer to 100% just by fooling with the credit scheme. Hmm. If you'd like to take a look, the code is here: https://github.com/mick-goulish/proton_c_clients.git
Re: [VOTE]: migrate the proton repo to use git
[ X ] Yes, migrate the proton repo over to git. [ ] No, keep it in svn. [ ] see if we can find a working copy of SCCS.
proton slowdown - I'm not doing it right
Bozo -- i see you're right. The size of my delivery list -- i.e. the list at connection-work_head -- is slowly increasing, and that's the problem. There is something I'm not handling, which gets a lot worse under heavy system load, and my sender never digs itself out. But -- what I want here is a canonical example of how to get high throughput with proton-c at the engine level -- and what I am hearing is that I should be using the event-collector interface which Dispatch uses. That's what we want to steer new users toward. So far the simple fixes I've tried have all resulted in zero speed... So! Rather than spending more time fixing these examples I will switch to the event collector model and see if I can get a nice little example working that way. (And from what I hear, it will be much nicer and much littler.) At least I know better now what I am trying to do: Get a send/receive example at the lowest level, that we want to direct new users toward, that has throughput as high as possible, and that can run flat-out on a time scale of days or weeks without crashing, without growing, and without slowing down. - Original Message - On 28. 10. 14 20:18, Michael Goulish wrote: I have gotten callgrind call-graph pictures of my proton/engine sender and receiver when the test is running fast and when it slows down. The difference is in the sender -- when running fast, it is spending most of its time in the subtree of pn_connector_process() .. like 71%. When it slows down it is instead spending 47% in pn_delivery_writeable(), and only 17% in pn_connector_process(). Since it is still not instantly obvious to me what has happened, I thought I would share with you-all. Please see cool pictures at: http://people.apache.org/~mgoulish/protonics/performance/slowdown/2014_10_28/svg/psend_fast.svg http://people.apache.org/~mgoulish/protonics/performance/slowdown/2014_10_28/svg/psend_slow.svg To recap -- I can trigger this condition by getting the box busy while my proton/engine test is running. I.e. by doing a build. Even though I stop the build, and all 6 other processors on the box go back to being idle -- the test never recovers. The receiver goes down to 50% CPU or worse -- but these pictures show that the behavior change is in the sender. look at call counts, for pn_connector_process() and pn_delivery_writable() fast : ratio 1 : 5 slow : ratio 1 : 244.5 (!) The iteration over connection work list gets really expensive, which means the connection thinks it has to work on other stuff than what psend.c wants to work on. I still think that the call to pn_delivery() in psend.c is in a really unfortunate spot. btw, why do you iterate over connection work list at all, you could just remember the delivery when calling pn_delivery()? Bozzo
proton slowdown - major clue
I have gotten callgrind call-graph pictures of my proton/engine sender and receiver when the test is running fast and when it slows down. The difference is in the sender -- when running fast, it is spending most of its time in the subtree of pn_connector_process() .. like 71%. When it slows down it is instead spending 47% in pn_delivery_writeable(), and only 17% in pn_connector_process(). Since it is still not instantly obvious to me what has happened, I thought I would share with you-all. Please see cool pictures at: http://people.apache.org/~mgoulish/protonics/performance/slowdown/2014_10_28/svg/psend_fast.svg http://people.apache.org/~mgoulish/protonics/performance/slowdown/2014_10_28/svg/psend_slow.svg To recap -- I can trigger this condition by getting the box busy while my proton/engine test is running. I.e. by doing a build. Even though I stop the build, and all 6 other processors on the box go back to being idle -- the test never recovers. The receiver goes down to 50% CPU or worse -- but these pictures show that the behavior change is in the sender.
proton gradual slowdown -- I know how to cause it
Earlier I reported a very gradual slowdown in the performance of my simple 1-sender 1-receiver test, on RHEL 7.0 and Fedora 20 but not on RHEL 6.3 . The slowdown caused the test to end up running at half speed after a billion or two billion messages. ( Which took hours to run. ) I now know how to cause this slowdown to happen any time, and it works just as well on RHEL 6.3 as it does on RHEL 7.0 . All I have to do is make the machine busy. Even though I do not swamp all processors -- in fact, I leave a couple processors idle -- my receiver program slows down when the machine becomes busy -- ***and it never recovers***. I have been doing qpid builds, for example. Even after I interrupt the build -- many minutes later, long after the box has become idle but for my sender and receiver, the receiver's CPU usage is still depressed, and performance has been cut to 1/2 or 1/3 of what it was at the beginning of the test. It never comes back. In fact, I can make it ratchet down again by running another build. This is nothing magic about builds -- that was just a convenient way of making the box busy. I will be making a Jira for this later today. I was able to make a callgrind picture for the receiver from when the test was fast and from when it was slow. I will attach all my info to the Jira.
Re: proton gradual slowdown -- I know how to cause it
You know, I thought of something along those lines, but I can't see how it makes the receiver actually use less CPU permanently. It seems like it ought to simply get a backlog, but go back to normal CPU usage. Can you think of any way that a backlog would cause receiver to stay at low CPU? Now that i can make this happen easily instead of waiting forever- and-a-day, I will get callgrind snapshots of both programs when the test is fast and slow. It seems like that just must show me something. - Original Message - On 27. 10. 14 09:10, Michael Goulish wrote: Earlier I reported a very gradual slowdown in the performance of my simple 1-sender 1-receiver test, on RHEL 7.0 and Fedora 20 but not on RHEL 6.3 . The slowdown caused the test to end up running at half speed after a billion or two billion messages. ( Which took hours to run. ) I now know how to cause this slowdown to happen any time, and it works just as well on RHEL 6.3 as it does on RHEL 7.0 . All I have to do is make the machine busy. Even though I do not swamp all processors -- in fact, I leave a couple processors idle -- my receiver program slows down when the machine becomes busy -- ***and it never recovers***. Michael, this is totally a wild guess, but looking at your psend.c the only thing that jumps out is that you couple 1:1 number of deliveries created and number of calls to pn_driver_wait(). So if anything (which I cannot explain /what/) happens where sender starts to lag in talking to receiver it may not be able to dig itself out. Maybe try to create first delivery before the loop and create the next delivery after pn_link_advance() Bozzo
proton engine perfectly stable on RHEL 6 after 1.5 billion messages
unlike my recent experience on Fedora, I have just seen my psend and precv clients ( written against proton engine/driver interface ) survive a 1.5 billion message test completely unscathed. On the machine I am using, that is about 4.5 hours of sending messages as fast as they will fly. Memory use is absolutely stable -- no increase at all in RSS as measured by 'top'. Time per 5 million messages has always between 56 and 57 seconds. This is exactly the same code (for the send/recv clients) that I used on Fedora when I saw the gradual slowdown. ( I downloaded new proton code in the wee hours today, but it sure doesn't look like anything that got checked in in the last 3 days is at all relevant to a gradual slowdown.) SO ! please give me your opinion but ... I think that we DO NOT CARE about behavior on Fedora. The reason I am doing these soak tests is to assure potential users that the code is stable enough for prolonged use in a production environment. Which Fedora is not. Does that make sense to everybody ?
Re: proton engine perfectly stable on RHEL 6 after 1.5 billion messages
Yes, you're right. I hadn't thought of the Fedora-as-the-future angle. sigh. But then at least I must see whether it happens on another Fedora box, and a more up to date one -- i.e. F20, which I now have. My test that showed the slowdown ran on F17. I will start that today. I did not get the callgrind data yet, because I was unable to install callgrind-devel on my F17 box, which is why I finally installed F20 on another box. so, I will report on what happens here. - Original Message - Hi Mick, That's a real head-scratcher - I'm at a loss to explain what you are seeing. On a lark I thought that perhaps generating new random uuids for each message send may be involved - perhaps over time the entropy pool would shrink and slow down the allocation of new uuids. I even wrote a little python loop that did nothing but allocate uuids. Ran overnight, no change in allocation rate on my Fedora 19 laptop. Yeah, I really, really need to get a larger tinfoil hat. Otherwise, while I personally agree with your opinion regarding Fedora support, I'm hesitant to dismiss the problem without root causing it since what is in Fedora today often ends up in RHEL tomorrow. Did your callgrind tracing show anything of interest? -K - Original Message - From: Michael Goulish mgoul...@redhat.com To: proton@qpid.apache.org Sent: Monday, October 20, 2014 9:53:05 AM Subject: proton engine perfectly stable on RHEL 6 after 1.5 billion messages unlike my recent experience on Fedora, I have just seen my psend and precv clients ( written against proton engine/driver interface ) survive a 1.5 billion message test completely unscathed. On the machine I am using, that is about 4.5 hours of sending messages as fast as they will fly. Memory use is absolutely stable -- no increase at all in RSS as measured by 'top'. Time per 5 million messages has always between 56 and 57 seconds. This is exactly the same code (for the send/recv clients) that I used on Fedora when I saw the gradual slowdown. ( I downloaded new proton code in the wee hours today, but it sure doesn't look like anything that got checked in in the last 3 days is at all relevant to a gradual slowdown.) SO ! please give me your opinion but ... I think that we DO NOT CARE about behavior on Fedora. The reason I am doing these soak tests is to assure potential users that the code is stable enough for prolonged use in a production environment. Which Fedora is not. Does that make sense to everybody ? -- -K
very weird proton soak-test result
I just want to mention this to the list in case anyone has an immediate brilliant idea. Something spooky is happening. I have a 2-process test, one sender, one receiver, written at the proton engine level. This is what I've been using for my nightly performance measurements lately, which are here: http://people.apache.org/~mgoulish/protonics/performance/results/nightly.svg Recently changed this test to be perpetual, and the receiver now only reports the timing for the most recent 5 million messages. So ... the weird result is that, over many messages, the test is slowing down ... and it is doing this without the RSS memory of either process growing! Arg! The virtual mem of the sender increased, but only slightly. It might very well have fallen again if I let the test keep running. ( That happened in the receiver. ) The effect is very gradual, but after 500 million messages it is taking about 50% more time to get each batch of 5 million messages received !! And it looks like the effect is accelerating. Also -- the receiver CPU usage is slowly going down. CPU usage on the sender is constant. For the receiver, CPU usage started out around 77% at the beginning of the test, and after 500 million msgs has fallen to 64% or so. My plan is to use callgrind to get a snapshot of sender behavior (I suspect sender is culprit) at the beginning of a run, and a separate snapshot later after the slowdown has started. But i wanted to just mention it here, just in case anybody has a Great Idea, or in case I get hit by a truck or something.
Re: nightly graphical proton-c performance results
excellent idea! so, just a horizontal line for each release something like this? : How about: the lines do not extend indefinitely rightward -- but only until the next release? - Original Message - Michael Goulish wrote From now on until ebola gets me (and maybe long after that!) new proton-c code will be downloaded, built, performance-tested, and the results posted in tasteful and attractive graphical form here: http://people.apache.org/~mgoulish/protonics/performance/results/nightly.svg The testing is done with my proton-C engine-level clients. Each test consists of 50 trials, 5 million small messages each trial. Michael, nice work. Would it also be possible to add baselines for some of the stable releases? 0.5, 0.6, 0.7, 0.8rc1 ? Either on a separate Releases graph, or if it can elegantly be inlined at the beginning of the nightly graph that might work too. Cheers, Dom -- View this message in context: http://qpid.2158936.n2.nabble.com/nightly-graphical-proton-c-performance-results-tp7615180p7615181.html Sent from the Apache Qpid Proton mailing list archive at Nabble.com.
Re: nightly graphical proton-c performance results
B-but ... that's me! I am that process. I guess you could say that I am automated... - Original Message - On Mon, 2014-10-13 at 20:09 -0400, Michael Goulish wrote: From now on until ebola gets me (and maybe long after that!) new proton-c code will be downloaded, built, performance-tested, and the results posted in tasteful and attractive graphical form here: http://people.apache.org/~mgoulish/protonics/performance/results/nightly.svg The testing is done with my proton-C engine-level clients. Each test consists of 50 trials, 5 million small messages each trial. the graphics show you at a glance the mean value of the 50 tests for that day, as well as the plus-one sigma and minus-one sigma range. (That is the range in which about two-thirds of the test results will fall.) The standard deviation (sigma) is important, because if you see that suddenly increasing -- even if the mean value remains relatively constant -- that means we have a problem. A while ago we had a significant performance issue in qpidd that went undetected for several months. The goal here is to make sure that any significant proton performance regression will become obvious within 24 hours. (8.64e19 femtoseconds) As always, I would be happy to hear any thoughts, questions, criticisms, ideas, proposals, desires, hopes, dreams, or schemes relating to this system or anything else. Excellent stuff. Now all we need is an automated process to watch for significant performance regressions and send email listing all those commits that might have been responsible.
nightly graphical proton-c performance results
From now on until ebola gets me (and maybe long after that!) new proton-c code will be downloaded, built, performance-tested, and the results posted in tasteful and attractive graphical form here: http://people.apache.org/~mgoulish/protonics/performance/results/nightly.svg The testing is done with my proton-C engine-level clients. Each test consists of 50 trials, 5 million small messages each trial. the graphics show you at a glance the mean value of the 50 tests for that day, as well as the plus-one sigma and minus-one sigma range. (That is the range in which about two-thirds of the test results will fall.) The standard deviation (sigma) is important, because if you see that suddenly increasing -- even if the mean value remains relatively constant -- that means we have a problem. A while ago we had a significant performance issue in qpidd that went undetected for several months. The goal here is to make sure that any significant proton performance regression will become obvious within 24 hours. (8.64e19 femtoseconds) As always, I would be happy to hear any thoughts, questions, criticisms, ideas, proposals, desires, hopes, dreams, or schemes relating to this system or anything else.
proton callgrind pictures online
...together with the callgrind data that generated the pictures, scripts, licenses, witty illuminating commentary, advice to the lovelorn -- Everything Necessary to the Enjoyment and Success of the Aspiring Proton Programmer. http://people.apache.org/~mgoulish/protonics/performance/results/2014_09_05_perf88/
LTO: link-time optimization effect on proton performance
Link-time optimization can be turned on by adding the -flto flag to the proton library build, in both compilation and linking steps. It offers the possibility of optimizations using deeper knowledge of the whole program than is available before linking. I have also been trying to get some extra performance by hand- inlining functions that I select based on valgrind/callgrind profiling data. My test procedure has been to run 50 trials, where each trial is a run of two programs: my psend and precv proton-C clients written at the Engine level. Each trial involves sending and receiving 5 million small messages. The result from each trial is a single high-resolution timing number. (From just before the sender sends the first message, to just after the receiver receives the last message.) The result of each test is a list of 50 of those numbers. I compare tests using an online Student's T-test calculator. (Student was the pen-name of the guy who invented it. His real name was Gosset, and he was working at the Guiness Brewery in Dublin when he invented it. I am not making this up.) The t-test gives a number that indicates the likihood that the difference between two tests could have happened randomly. A small t-test result indicates that the difference between two test is unlikely to have happened randomly. For example a t-test result of 0.01 means that the difference between your two tests should only happen 1 time out of 100 times due to random chance. Smaller results are better. With 50 sample-points in each test, you can get nice high certainty as to whether you are seeing real or random results. All of the results below are hyper-significant. The *worst* t-test result was 2.9e-8, i.e. 3 chances out of 100 million that the difference between the two tests could happen randomly. So .. here are the results. (in seconds) ( builds used throughout are normal release-with-debug-info, with -O2 optimization. ) 1. Proton code as of 0800 EDT yesterday, with no changes. mean 41.267825 sigma 0.834826 2. LTO build mean 40.073661 sigma 1.108513improvement: 2.9% 3. manual inlining changes mean 39.011794 sigma 1.056831improvement: 5.5% 4. LTO build plus my changes mean 39.211283 sigma 1.041303improvement: 5.0% So! The LTO technology really works, but it's not as good as manual inlining based on profiling. In fact it slows that down a little, probably because it is choosing some inlining candidates that don't help enough to offset cache thrash due to code size increase. so there you go.
[jira] [Created] (PROTON-703) inlining performance improvements
michael goulish created PROTON-703: -- Summary: inlining performance improvements Key: PROTON-703 URL: https://issues.apache.org/jira/browse/PROTON-703 Project: Qpid Proton Issue Type: Improvement Components: proton-c Reporter: michael goulish Assignee: michael goulish Priority: Minor omnibus jira for any other inlining performance improvements i may find. notes to self: * don't affect public APIs. * don't forget to test Debug build. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PROTON-700) small performance improvement from inling one fn.
michael goulish created PROTON-700: -- Summary: small performance improvement from inling one fn. Key: PROTON-700 URL: https://issues.apache.org/jira/browse/PROTON-700 Project: Qpid Proton Issue Type: Improvement Components: proton-c Reporter: michael goulish Assignee: michael goulish Priority: Minor inlining the internal function pn_data_node() improves speed somewhere between 2.6% and 6%, depending on architecture. This is based on testing I did with two C-based clients written at the engine interface level. The higher 6% figure was seen on a more modern machine with recent Intel processors, the lower figure was seen on an older box with AMD processors. But the effect is real: after 5- repetition before the change 50 after, T-test indicates odds of this happening by chance is 2.0e-18 . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
proton performance: Release vs. RelWithDebInfo
A little confirmation -- in my testing of proton engine send/receive clients I just now tested the performance of those clients built against Proton Release version, versus Proton RelWithDebInfo version. I did indeed find that both versions were the same at running my tests, after averaging the speed of 5 tests with each version. In fact, RelWithDebInfo came in a mite faster -- a couple percent -- but probably still within the variability of the observations.
Proton Performance Pictures (1 of 2)
[ resend : I am attaching only 1 image here, so hopefully the apache mail gadget will not become upset. Next one in next email. ] Attached, please find two cool pictures of the valgrind/callgrind data I got with a test run of the psend and precv clients I mentioned before. ( Sorry, I keep saying 'clients'. These are pure Peer-to-Peer. ) ( Hey -- if we ever sell this technology to maritime transport companies, could we call it Pier-to-Pier ? ) This was from a run of 100,000 messages, using credit strategy of 200, 100, 100. i.e. start at 200, every time you get down to 100, add 100. That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) By the way, I actually got repeatably better performance (maybe 1.5% better (which resulted in the 123,500 number)) by using processors 1 and 3 on my laptop, rather than 1 and 2. Looking at /proc/cpuinfo, I see that processors 1 and 3 have different core IDs. OK, whatever. ( And it's an Intel system... ) I think there are no shockers here: psend uses its time in pn_post_transfer_frame (44%) precv uses its time in pn_dispatch_frame (67%) The code is at https://github.com/mick-goulish/proton_c_clients.git I will put all this performance info in there too, shortly.
proton performance pic 1 of 2
I hope these are still legible. I would like to have a Little Talk with the Apache mail server... I will put full-resolution versions in my git repo...
proton performance: OK, so I can't send you pictures.
Fabulous. OK, I put them in my git repo: https://github.com/mick-goulish/proton_c_clients.git And there, they are full-res. sigh.
Re: Proton Performance Pictures (1 of 2)
- Original Message - On 09/03/2014 08:51 AM, Michael Goulish wrote: That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) If you are sending direct between the sender and receiver process (i.e. no intermediary process), then why are you doubling the number of messages sent to get 'transfers per second'? One transfer is the sending of a message from one process to another, which in this case is the same as messages sent or received. Yes, this is interesting. I need a way to make a fair comparison between something like this setup (simple peer-to-peer) and the Dispatch Router numbers I was getting earlier. For the router, the analogous topology is writer -- router -- reader in which case I counted each message twice. But it does not seem right to count a single message in writer -- router -- reader as 2 transfers, while counting a single message in writer -- reader as only 1 transfer. Because -- from the application point of view, those two topologies are doing the same work. Also I think that I *need* to countwriter--router--reader as 2, because in *this* case: writer -- router -- reader_1 \ \-- reader_2 ...I need to count that as 3 . ? Thoughts ?
Re: Proton Performance Pictures (1 of 2)
OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. This is from the application perspective, asking how fast is your messaging system. It doesn't care about how fancy the intermediation is, it only cares about results. This seems like the right way to judge that. - Original Message - On 09/03/2014 11:35 AM, Michael Goulish wrote: - Original Message - On 09/03/2014 08:51 AM, Michael Goulish wrote: That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) If you are sending direct between the sender and receiver process (i.e. no intermediary process), then why are you doubling the number of messages sent to get 'transfers per second'? One transfer is the sending of a message from one process to another, which in this case is the same as messages sent or received. Yes, this is interesting. I need a way to make a fair comparison between something like this setup (simple peer-to-peer) and the Dispatch Router numbers I was getting earlier. For the router, the analogous topology is writer -- router -- reader in which case I counted each message twice. But it does not seem right to count a single message in writer -- router -- reader as 2 transfers, while counting a single message in writer -- reader as only 1 transfer. Because -- from the application point of view, those two topologies are doing the same work. You should probably be using throughput and not transfers in this case. Also I think that I *need* to countwriter--router--reader as 2, because in *this* case: writer -- router -- reader_1 \ \-- reader_2 ...I need to count that as 3 . ? Thoughts ?
Re: proton performance: OK, so I can't send you pictures.
- Original Message - On Wed, 2014-09-03 at 04:18 -0400, Michael Goulish wrote: Fabulous. OK, I put them in my git repo: https://github.com/mick-goulish/proton_c_clients.git And there, they are full-res. sigh. pn_data_node and pn_data_add look interesting. Might be worth inlining pn_data_node if the compiler isn't doing that already (did you build with -O3?) Also shaving a few instructions off pn_data_add might pay off. I just went into ccmake and told it to do a Release build. Is there any way I can turn it up to 11 ?
proton engine performance: two strong credit management effects
conclusion = Using the proton engine interface (in C) I am seeing two aspects of credit management that can strongly affect throughput. The first is credit stall. If you frequently allow the credit available to the sender to drop to zero, so that he has to cool his heels waiting for more, that can have a very strong effect even in my simple test case, in which the receiver is granting new credit as quickly as possible, and is serving only 1 sender. The second effect is credit replenishment amortization. It looks like the granting of new credit is kind of an expensive operation. If you do it too frequently, that will also have a noticeable effect on throughput. These tests use the C clients, written against the engine/driver level, that I recently put at https://github.com/mick-goulish/proton_c_clients.git test setup = * single sender, single receiver, on laptop. * sender only sends, receiver only receives. * one link * sender locked onto CPU core 1, receiver locked onto CPU core 2 * system is otherwise quiet - only OS and XFCE running. no browser, no internet. * sender sends as fast as possible, receiver receives as fast as possible. * each test consists of 5,000,000 messages, about 50 bytes of payload each. * each test is repeated 3 times, and the results averaged to make the number that is graphed. stall test result = scenario: start out with 200 credits. Every time we get down to X, add 100 to credit level. X axis: point at which credit gets refilled. Y axis: messages received per second. Note: When we let credit go to 10 or less before replenishment, throughput falls off a cliff. amortization test result = scenario: start out with 200 credits. Every time we get down to 200-X, add X to credit level. X axis: credit increment Y axis: messages received per second. Note: In this case, if X has a value of 5, that means we add 5 new units of credit every time we see that it has fallen by 5. The smaller the X value, the more frequently we replenish credit. If we replenish too frequently, throughput is affected.
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051826#comment-14051826 ] michael goulish commented on PROTON-625: Here's what happens, and a fix. 1. pni_map_entry() calls pni_map_ensure() to make sure map has enough capacity. 2. The capacity-increasing loop in pni_map_ensure() has two conditions on it: increase the capacity if map-capacity is too small, or if map 'load' is greater than map-load_factor. ( Map load is ... meaning not obvious to me. ) 3. If pni_map_ensure() returns true, then pni_map_entry() will call itself recursively, and keep doing that until pni_map_ensure() returns false. 'False' means 'I made no change.' 4. But it is possible for pni_map_ensure() to make no change, and yet return true. Here is how it happened in my most recent test: map-capacity 512 capacity 331 pni_map_load(map) 0.75 map-load_factor 0.75 5. Those values made *both* conditions on the capacity- increasing loop in pni_map_ensure() false. So it didn't do anything to change the map. But it returned true. So pni_map_entry() called itself. But nothing had changed. And away we go. FIX Make the test on the if at the top of pni_map_ensure say this: if (capacity = map-capacity load = map-load_factor) { ( Added '=' to the load test. ) After that, I ran twenty tests with no failure. Previously, failure probability on my system was 0.3.So odds of 20 in a row happening by chance is a little less than 1 in 1000. Biggest Backtrace Ever! --- Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.8 Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049897#comment-14049897 ] michael goulish commented on PROTON-625: I had some confusion about what libraries were being picked up. Sorry! This bug is *not* present on 0.7 ! I was able to run 0.7-based dispatch-router 10 times with no failure. Then, switching to latest proton trunk code as of today -- 2 out of first 3 tests resulted in this failure. Biggest Backtrace Ever! --- Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.8 Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b in pn_do_transfer () from /usr/lib64/libqpid-proton.so.2 #93563 0x0039c6c24385 in pn_dispatch_frame () from /usr/lib64/libqpid-proton.so.2 #93564 0x0039c6c2448f in pn_dispatcher_input () from /usr/lib64/libqpid-proton.so.2 #93565 0x0039c6c2d68b in pn_input_read_amqp () from /usr/lib64/libqpid-proton.so.2 #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93568 0x0039c6c2d275 in transport_consume () from /usr/lib64/libqpid-proton.so.2 #93569 0x0039c6c304cd in pn_transport_process () from /usr/lib64/libqpid-proton.so.2 #93570 0x0039c6c3e40c in pn_connector_process () from /usr/lib64/libqpid-proton.so.2 #93571 0x7f1060c60460 in process_connector () from /home/mick/dispatch/build/libqpid-dispatch.so.0
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051007#comment-14051007 ] michael goulish commented on PROTON-625: Here is a hack that fixes it. A little new code in pni_map_ensure(). Tested this on latest protonics, version 1607485. Without hack: 3 failures out of 10 tests. (similar to what I have been seeing on other versions.) With hack: 0 failures out of 13 tests. ( probability this happened by chance: less that 1% ) So, now I'm trying to see how it should *really* be fixed... --- code --- code --- code --- code --- code --- code --- code --- code --- code --- // This loop is what is already there, in pni_map_ensure. No change. while (map-capacity capacity || pni_map_load(map) map-load_factor) { map-capacity *= 2; map-addressable = (size_t) (0.86 * map-capacity); } /*--- If ever we get past the above while-loop without actually having changed map-cap, we are doomed to eternal torment. So, force it. ---*/ if ( oldcap == map-capacity ) { fprintf ( stderr, Fiery the angels fell; deep thunder rolled around their shores, burning with the fires of Orc!\n ); map-capacity *= 2; map-addressable = (size_t) (0.86 * map-capacity); } Biggest Backtrace Ever! --- Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.8 Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048569#comment-14048569 ] michael goulish commented on PROTON-625: BTW -- I kill and restart the router after each test. Biggest Backtrace Ever! --- Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.7 Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b in pn_do_transfer () from /usr/lib64/libqpid-proton.so.2 #93563 0x0039c6c24385 in pn_dispatch_frame () from /usr/lib64/libqpid-proton.so.2 #93564 0x0039c6c2448f in pn_dispatcher_input () from /usr/lib64/libqpid-proton.so.2 #93565 0x0039c6c2d68b in pn_input_read_amqp () from /usr/lib64/libqpid-proton.so.2 #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93568 0x0039c6c2d275 in transport_consume () from /usr/lib64/libqpid-proton.so.2 #93569 0x0039c6c304cd in pn_transport_process () from /usr/lib64/libqpid-proton.so.2 #93570 0x0039c6c3e40c in pn_connector_process () from /usr/lib64/libqpid-proton.so.2 #93571 0x7f1060c60460 in process_connector () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93572 0x7f1060c61017 in thread_run () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93573 0x003cf9c07851 in start_thread (arg=0x7f1052bfd700) at pthread_create.c:301 #93574 0x003cf98e890d in clone
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048573#comment-14048573 ] michael goulish commented on PROTON-625: When I put usleep(1000) after each message sent, I have zero failures in 10 tries. Biggest Backtrace Ever! --- Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.7 Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b in pn_do_transfer () from /usr/lib64/libqpid-proton.so.2 #93563 0x0039c6c24385 in pn_dispatch_frame () from /usr/lib64/libqpid-proton.so.2 #93564 0x0039c6c2448f in pn_dispatcher_input () from /usr/lib64/libqpid-proton.so.2 #93565 0x0039c6c2d68b in pn_input_read_amqp () from /usr/lib64/libqpid-proton.so.2 #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93568 0x0039c6c2d275 in transport_consume () from /usr/lib64/libqpid-proton.so.2 #93569 0x0039c6c304cd in pn_transport_process () from /usr/lib64/libqpid-proton.so.2 #93570 0x0039c6c3e40c in pn_connector_process () from /usr/lib64/libqpid-proton.so.2 #93571 0x7f1060c60460 in process_connector () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93572 0x7f1060c61017 in thread_run () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93573 0x003cf9c07851 in start_thread (arg=0x7f1052bfd700) at pthread_create.c:301 #93574
Re: [jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
Yes! Great idea -- I will attempt. - Original Message - [ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14048701#comment-14048701 ] Rafael H. Schloming commented on PROTON-625: I think the easiest way to track down this bug would be to put some sort of detection inside of pni_map_entry and if it recurses more than some limit, e.g. 32 times or something, then print out a representation of the maps internal structure. It might also help to use a debug build so you have line numbers. Is that something you feel comfortable trying? You should be able to find the relevant code around line 551 of object.c. Biggest Backtrace Ever! --- Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.7 Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b in pn_do_transfer () from /usr/lib64/libqpid-proton.so.2 #93563 0x0039c6c24385 in pn_dispatch_frame () from /usr/lib64/libqpid-proton.so.2 #93564 0x0039c6c2448f in pn_dispatcher_input () from /usr/lib64/libqpid-proton.so.2 #93565 0x0039c6c2d68b in pn_input_read_amqp () from /usr/lib64/libqpid-proton.so.2 #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93568 0x0039c6c2d275 in transport_consume () from /usr/lib64/libqpid-proton.so.2 #93569 0x0039c6c304cd in pn_transport_process () from /usr/lib64
[jira] [Created] (PROTON-625) Biggest Backtrace Ever!
michael goulish created PROTON-625: -- Summary: Biggest Backtrace Ever! Key: PROTON-625 URL: https://issues.apache.org/jira/browse/PROTON-625 Project: Qpid Proton Issue Type: Bug Components: proton-c Reporter: michael goulish I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b in pn_do_transfer () from /usr/lib64/libqpid-proton.so.2 #93563 0x0039c6c24385 in pn_dispatch_frame () from /usr/lib64/libqpid-proton.so.2 #93564 0x0039c6c2448f in pn_dispatcher_input () from /usr/lib64/libqpid-proton.so.2 #93565 0x0039c6c2d68b in pn_input_read_amqp () from /usr/lib64/libqpid-proton.so.2 #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93568 0x0039c6c2d275 in transport_consume () from /usr/lib64/libqpid-proton.so.2 #93569 0x0039c6c304cd in pn_transport_process () from /usr/lib64/libqpid-proton.so.2 #93570 0x0039c6c3e40c in pn_connector_process () from /usr/lib64/libqpid-proton.so.2 #93571 0x7f1060c60460 in process_connector () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93572 0x7f1060c61017 in thread_run () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93573 0x003cf9c07851 in start_thread (arg=0x7f1052bfd700) at pthread_create.c:301 #93574 0x003cf98e890d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-625: --- Description: I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5 receivers first. They start OK. Dispatch router happy stable. Wait a few seconds. I start the 5 senders, from a bash script. The first sender is already sending when the 2nd, 3rd, 4th start. After a few of them start,but before all have finished starting, a few seconds into the script, the crash occurs. ( If they all start up successfully, no crash. ) The crash occurs in the dispatch router. Here is the biggest backtrace ever: #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at malloc.c:4383 #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 #2 0x0039c6c1650a in pni_map_allocate () from /usr/lib64/libqpid-proton.so.2 #3 0x0039c6c16a3a in pni_map_ensure () from /usr/lib64/libqpid-proton.so.2 #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 . . . . #93549 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93550 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93551 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93552 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93553 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93554 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93555 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93556 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93557 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93558 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 #93560 0x0039c6c17226 in pn_hash_put () from /usr/lib64/libqpid-proton.so.2 #93561 0x0039c6c2a643 in pn_delivery_map_push () from /usr/lib64/libqpid-proton.so.2 #93562 0x0039c6c2c44b in pn_do_transfer () from /usr/lib64/libqpid-proton.so.2 #93563 0x0039c6c24385 in pn_dispatch_frame () from /usr/lib64/libqpid-proton.so.2 #93564 0x0039c6c2448f in pn_dispatcher_input () from /usr/lib64/libqpid-proton.so.2 #93565 0x0039c6c2d68b in pn_input_read_amqp () from /usr/lib64/libqpid-proton.so.2 #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from /usr/lib64/libqpid-proton.so.2 #93568 0x0039c6c2d275 in transport_consume () from /usr/lib64/libqpid-proton.so.2 #93569 0x0039c6c304cd in pn_transport_process () from /usr/lib64/libqpid-proton.so.2 #93570 0x0039c6c3e40c in pn_connector_process () from /usr/lib64/libqpid-proton.so.2 #93571 0x7f1060c60460 in process_connector () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93572 0x7f1060c61017 in thread_run () from /home/mick/dispatch/build/libqpid-dispatch.so.0 #93573 0x003cf9c07851 in start_thread (arg=0x7f1052bfd700) at pthread_create.c:301 #93574 0x003cf98e890d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 was: I am saving all my stuff so I can repro on demand. It doesn't happen every time, but it's about 50%. -- On one box, I have a dispatch router. On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 qpid-messaging-based senders. Each client will handle 100 addresses, of the form mick/0 ... mick/1 ... c. 100 messages will be sent to each address. I start the 5
big improvement in memory usage for proton address scale-up
In my original testing for address scale-up with a Proton Messenger based client, I was measuring a memory cost of 115 KB per subscribed address in the client. Now, after Rafi's recent changes, I am seeing a better than 7x improvement, to just under 16 KB per subscribed address. The downside, of course, is that this will make it about 7x harder for me to persuade my boss to buy me a Really Big Box. ( But I'll think of something... ) Thanks for the memory!
[jira] [Commented] (PROTON-577) CollectorImpl creates a lot of unnecessary garbage
[ https://issues.apache.org/jira/browse/PROTON-577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989829#comment-13989829 ] michael goulish commented on PROTON-577: What the engineer *means* to say is superfluous paraphernalia. CollectorImpl creates a lot of unnecessary garbage -- Key: PROTON-577 URL: https://issues.apache.org/jira/browse/PROTON-577 Project: Qpid Proton Issue Type: Improvement Components: proton-j Affects Versions: 0.7 Reporter: Rafael H. Schloming Assignee: Rafael H. Schloming -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (PROTON-566) crash in pn_transport_set_max_frame
michael goulish created PROTON-566: -- Summary: crash in pn_transport_set_max_frame Key: PROTON-566 URL: https://issues.apache.org/jira/browse/PROTON-566 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.7 Environment: 3 boxes. 1 with senders, 1 with receivers, and 1 in the middle with a single router. Reporter: michael goulish Here's what I do: ( I have saved all relevant software so I can repro this. ) 1. On router box, start 1 router. 2. On receiver box, start 1000 receivers. With delays in between each group of 50, so as to avoid backlog problem. 3. After receivers are all started, start 1000 senders also with delays. Senders start up but do not yet begin sending until I manually signal them by touching a file. 4. Short time after sender start sending, qdrouter crashes in proton code, with this traceback: Core was generated by `/home/mick/dispatch/build/router/qdrouterd --config ./config_1/X.conf'. Program terminated with signal 11, Segmentation fault. #0 0x7f29d3c0f3c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 1915transport-local_max_frame = size; #0 0x7f8ad5a613c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 #1 0x7f8ad5cdd4bd in thread_process_listeners (qd_server=0x14f8e10) at /home/mick/dispatch/src/server.c:100 #2 0x7f8ad5cddedb in thread_run (arg=0x1490bf0) at /home/mick/dispatch/src/server.c:416 #3 0x003638c07de3 in start_thread () from /lib64/libpthread.so.0 #4 0x0036388f616d in clone () from /lib64/libc.so.6 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (PROTON-566) crash in pn_transport_set_max_frame
[ https://issues.apache.org/jira/browse/PROTON-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-566: --- Description: Here's what I do: ( I have saved all relevant software so I can repro this. ) 1. On router box, start 1 router. 2. On receiver box, start 1000 receivers. With delays in between each group of 50, so as to avoid backlog problem. 3. After receivers are all started, start 1000 senders also with delays. Senders start up but do not yet begin sending until I manually signal them by touching a file. 4. Short time after sender start sending, qdrouter crashes in proton code, with this traceback: Core was generated by `/home/mick/dispatch/build/router/qdrouterd --config ./config_1/X.conf'. Program terminated with signal 11, Segmentation fault. #0 0x7f29d3c0f3c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 1915transport-local_max_frame = size; #0 0x7f8ad5a613c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 #1 0x7f8ad5cdd4bd in thread_process_listeners (qd_server=0x14f8e10) at /home/mick/dispatch/src/server.c:100 #2 0x7f8ad5cddedb in thread_run (arg=0x1490bf0) at /home/mick/dispatch/src/server.c:416 #3 0x003638c07de3 in start_thread () from /lib64/libpthread.so.0 #4 0x0036388f616d in clone () from /lib64/libc.so.6 Looks like this is not a proton problem, but something in dispatch. I'm closing this and moving it was: Here's what I do: ( I have saved all relevant software so I can repro this. ) 1. On router box, start 1 router. 2. On receiver box, start 1000 receivers. With delays in between each group of 50, so as to avoid backlog problem. 3. After receivers are all started, start 1000 senders also with delays. Senders start up but do not yet begin sending until I manually signal them by touching a file. 4. Short time after sender start sending, qdrouter crashes in proton code, with this traceback: Core was generated by `/home/mick/dispatch/build/router/qdrouterd --config ./config_1/X.conf'. Program terminated with signal 11, Segmentation fault. #0 0x7f29d3c0f3c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 1915transport-local_max_frame = size; #0 0x7f8ad5a613c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 #1 0x7f8ad5cdd4bd in thread_process_listeners (qd_server=0x14f8e10) at /home/mick/dispatch/src/server.c:100 #2 0x7f8ad5cddedb in thread_run (arg=0x1490bf0) at /home/mick/dispatch/src/server.c:416 #3 0x003638c07de3 in start_thread () from /lib64/libpthread.so.0 #4 0x0036388f616d in clone () from /lib64/libc.so.6 crash in pn_transport_set_max_frame --- Key: PROTON-566 URL: https://issues.apache.org/jira/browse/PROTON-566 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.7 Environment: 3 boxes. 1 with senders, 1 with receivers, and 1 in the middle with a single router. Reporter: michael goulish Here's what I do: ( I have saved all relevant software so I can repro this. ) 1. On router box, start 1 router. 2. On receiver box, start 1000 receivers. With delays in between each group of 50, so as to avoid backlog problem. 3. After receivers are all started, start 1000 senders also with delays. Senders start up but do not yet begin sending until I manually signal them by touching a file. 4. Short time after sender start sending, qdrouter crashes in proton code, with this traceback: Core was generated by `/home/mick/dispatch/build/router/qdrouterd --config ./config_1/X.conf'. Program terminated with signal 11, Segmentation fault. #0 0x7f29d3c0f3c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 1915transport-local_max_frame = size; #0 0x7f8ad5a613c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 #1 0x7f8ad5cdd4bd in thread_process_listeners (qd_server=0x14f8e10) at /home/mick/dispatch/src/server.c:100 #2 0x7f8ad5cddedb in thread_run (arg=0x1490bf0) at /home/mick/dispatch/src/server.c:416 #3 0x003638c07de3 in start_thread () from /lib64/libpthread.so.0 #4 0x0036388f616d in clone () from /lib64/libc.so.6 Looks like
[jira] [Closed] (PROTON-566) crash in pn_transport_set_max_frame
[ https://issues.apache.org/jira/browse/PROTON-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish closed PROTON-566. -- Resolution: Fixed It looks like this is not a proton issue, but a dispatch issue. I'm closing this and moving it. crash in pn_transport_set_max_frame --- Key: PROTON-566 URL: https://issues.apache.org/jira/browse/PROTON-566 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.7 Environment: 3 boxes. 1 with senders, 1 with receivers, and 1 in the middle with a single router. Reporter: michael goulish Here's what I do: ( I have saved all relevant software so I can repro this. ) 1. On router box, start 1 router. 2. On receiver box, start 1000 receivers. With delays in between each group of 50, so as to avoid backlog problem. 3. After receivers are all started, start 1000 senders also with delays. Senders start up but do not yet begin sending until I manually signal them by touching a file. 4. Short time after sender start sending, qdrouter crashes in proton code, with this traceback: Core was generated by `/home/mick/dispatch/build/router/qdrouterd --config ./config_1/X.conf'. Program terminated with signal 11, Segmentation fault. #0 0x7f29d3c0f3c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 1915transport-local_max_frame = size; #0 0x7f8ad5a613c0 in pn_transport_set_max_frame (transport=0x0, size=65536) at /home/mick/proton/proton-c/src/transport/transport.c:1915 #1 0x7f8ad5cdd4bd in thread_process_listeners (qd_server=0x14f8e10) at /home/mick/dispatch/src/server.c:100 #2 0x7f8ad5cddedb in thread_run (arg=0x1490bf0) at /home/mick/dispatch/src/server.c:416 #3 0x003638c07de3 in start_thread () from /lib64/libpthread.so.0 #4 0x0036388f616d in clone () from /lib64/libc.so.6 Looks like this is not a proton problem, but something in dispatch. I'm closing this and moving it -- This message was sent by Atlassian JIRA (v6.2#6252)
Plan for Improving Engine API Documentation, and Request for Suggestions, Etc.
Hello, Proton List! I'd like to take a whack at adding to the documentation of the Proton Engine API, and I'd be happy to hear any relevant suggestions, advice, opinions, speculation, calumny, slander, warnings, desires, requests, demands, tall tales, theories, predictions, personal experiences, or amusing anecdotes that anyone might wish to share. In that order. My plan is to attempt documentation of the functions that are mentioned in Rafi's excellent UML diagram: https://cwiki.apache.org/confluence/display/qpid/Proton+Architecture that are not already documented, that are in the public interface ( engine.h ) As I work on those functions, I will write little examples, minimal and focused on one function at a time wherever humanly possible. I will probably start with some kind of Absolutely Minimal Hello World, then use that as a template to show off other functions. These examples are for myself, but if they seem useful I will show them to you and see if they ought to be published somewhere. Here is the initial list of functions. After this I will go after whatever other functions are in the engine.h file that are not totally glaringly obvious, like pn_return_char_ptr(). ( I made that one up... ) pn_condition_clear pn_condition_is_set pn_connection_reset pn_delivery_buffered pn_delivery_clear pn_delivery_readable pn_delivery_settled pn_delivery_updated pn_delivery_writable pn_error_clear pn_error_set pn_link_available pn_link_credit pn_link_get_drain pn_link_is_receiver pn_link_is_sender pn_link_queued pn_link_drain pn_link_draining pn_link_flow pn_link_recv pn_link_set_drain pn_link_drained pn_link_offered pn_link_send pn_session_incoming_bytes pn_session_outgoing_bytes pn_terminus_copy pn_transport_unbind
please review: Ruby Messenger Doc
I wanted to show you this in lovely HTML, but all my attempts thus far (outside of the usual Ruby framework) have created only travesties of the proper format: diseased and horrible things, lurching through the stygian depths of my browser like ... like... Ah. Sorry. Anyway, so --- so, I'm not doing that. Just settling for practical ASCII. (It will show up the usual way in the rdoc-generated HTML, once I check this stuff in.) This Ruby text is almost identical to the Python text I sent out a while ago, just a few tweaks attempting to increase perceived Rubiosity. There are a few places where the Ruby API lacked some of the other bindings' (and C code's) interfaces -- those are noted here, but I left the text in this doc, anticipating that those APIs will show up shortly. (Thanks to mcpierce.) Those places are noted, with the word NOTE. = - class comments - { The Messenger class defines a high level interface for sending and receiving Messages. Every Messenger contains a single logical queue of incoming messages and a single logical queue of outgoing messages. These messages in these queues may be destined for, or originate from, a variety of addresses. The messenger interface is single-threaded. All methods except one (interrupt) are intended to be used from within the messenger thread. Sending Receiving Messages { The Messenger class works in conjuction with the Message class. The Message class is a mutable holder of message content. The put method copies its Message to the outgoing queue, and may send queued messages if it can do so without blocking. The send method blocks until it has sent the requested number of messages, or until a timeout interrupts the attempt. Similarly, the recv method receives messages into the incoming queue, and may block as it attempts to receive the requested number of messages, or until timeout is reached. It may receive fewer than the requested number. The get method pops the eldest Message off the incoming queue and copies it into the Message object that you supply. It will not block. The blocking attribute allows you to turn off blocking behavior entirely, in which case send and recv will do whatever they can without blocking, and then return. You can then look at the number of incoming and outgoing messages to see how much outstanding work still remains. } } - method details - { __init__ { Construct a new Messenger with the given name. The name has global scope. If a NULL name is supplied, a unique name will be chosen. } __del__ { Destroy the Messenger. This will close all connections that are managed by the Messenger. Call the stop method before destroying the Messenger. } start { Currently a no-op placeholder. For future compatibility, do not send or recv messages before starting the Messenger. } stop { Transitions the Messenger to an inactive state. An inactive Messenger will not send or receive messages from its internal queues. A Messenger should be stopped before being discarded to ensure a clean shutdown handshake occurs on any internally managed connections. } subscribe { Subscribes the Messenger to messages originating from the specified source. The source is an address as specified in the Messenger introduction with the following addition. If the domain portion of the address begins with the '~' character, the Messenger will interpret the domain as host/port, bind to it, and listen for incoming messages. For example ~0.0.0.0, amqp://~0.0.0.0, and amqps://~0.0.0.0 will all bind to any local interface and listen for incoming messages with the last variant only permitting incoming SSL connections. } put { Places the content contained in the message onto the outgoing queue of the Messenger. This method will never block, however it will send any unblocked Messages in the outgoing queue immediately and leave any blocked Messages remaining in the outgoing queue. The send call may be used to block until the outgoing queue is empty. The outgoing property may be used to check the depth of the outgoing queue. When the content in a given Message object is copied to the outgoing message queue, you may then modify or discard the Message object without having any impact on the content in the outgoing queue. This method returns an outgoing tracker for the Message. The tracker can be used to determine the delivery status of the
[jira] [Created] (PROTON-452) Ruby API doesn't have pn_messenger_interrupt()
michael goulish created PROTON-452: -- Summary: Ruby API doesn't have pn_messenger_interrupt() Key: PROTON-452 URL: https://issues.apache.org/jira/browse/PROTON-452 Project: Qpid Proton Issue Type: Bug Affects Versions: 0.5 Reporter: michael goulish It looks like the Ruby binding doesn't cover the new-ish C function pn_messenger_interrupt(). -- This message was sent by Atlassian JIRA (v6.1#6144)
proposed Python API doc changes -- will check in on All Hallow's Eve
Dear Proton Proponents -- Here is my proposed text for Python Messenger API documentation. If you'd like to comment, please do so within the next week. I will incorporate feedback and check in the resulting changes to the codebase at the stroke of midnight, on All Hallows Eve. ( Samhain. ) I have given you the current text for each method and property, and then my changes. My changes are either proposed replacements ( NEW_TEXT ) or proposed additions ( ADD_TEXT ). Mostly, this is highly similar to the C API text, but with minor changes for Pythonification. -- Mick . Class Comments { CURRENT_TEXT { The Messenger class defines a high level interface for sending and receiving Messages. Every Messenger contains a single logical queue of incoming messages and a single logical queue of outgoing messages. These messages in these queues may be destined for, or originate from, a variety of addresses. } ADD_TEXT { The messenger interface is single-threaded. All methods except one ( interrupt ) are intended to be used from within the messenger thread. } } Sending Receiving Messages { CURRENT_TEXT { The L{Messenger} class works in conjuction with the L{Message} class. The L{Message} class is a mutable holder of message content. The L{put} method will encode the content in a given L{Message} object into the outgoing message queue leaving that L{Message} object free to be modified or discarded without having any impact on the content in the outgoing queue. Similarly, the L{get} method will decode the content in the incoming message queue into the supplied L{Message} object. } NEW_TEXT { The Messenger class works in conjuction with the Message class. The Message class is a mutable holder of message content. The put method copies its message to the outgoing queue, and may send queued messages if it can do so without blocking. The send method blocks until it has sent the requested number of messages, or until a timeout interrupts the attempt. Similarly, the recv() method receives messages into the incoming queue, and may block until it has received the requested number of messages, or until timeout is reached. The get method pops the eldest message off the incoming queue and copies it into the message object that you supply. It will not block. } NOTE { I thought it would be better in this comment to only emphasize the blocking and non-blocking differences between get/put and recv/send. Details about how the arg message is handled are moved to the comments for specific methods. } } Method Details { __init__ { CURRENT_TEXT { Construct a new L{Messenger} with the given name. The name has global scope. If a NULL name is supplied, a L{uuid.UUID} based name will be chosen. } NEW_TEXT { // no change } } __del__ { CURRENT_TEXT { // none } NEW_TEXT { Destroy the messenger. This will close all connections that are managed by the messenger. Call the stop method before destroying the messenger. } } start { CURRENT_TEXT { Transitions the L{Messenger} to an active state. A L{Messenger} is initially created in an inactive state. When inactive a L{Messenger} will not send or receive messages from its internal queues. A L{Messenger} must be started before calling L{send} or L{recv}. } NEW_TEXT { Currently a no-op placeholder. For future compatibility, do not send or receive messages before starting the messenger. } } stop { CURRENT_TEXT { Transitions the L{Messenger} to an inactive state. An inactive L{Messenger} will not send or receive messages from its internal queues. A L{Messenger} should be stopped before being discarded to ensure a clean shutdown handshake occurs on any internally managed connections. } NEW_TEXT { // no change } } subscribe { CURRENT_TEXT { Subscribes the L{Messenger} to messages originating from the specified source. The source is an address as specified in the L{Messenger} introduction with the following addition. If the domain portion of the address begins with the '~'
[jira] [Comment Edited] (PROTON-260) Messenger Documentation
[ https://issues.apache.org/jira/browse/PROTON-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797027#comment-13797027 ] michael goulish edited comment on PROTON-260 at 10/16/13 5:30 PM: -- rev 152 -- checked in new C API doxygen comments in messenger.h was (Author: mgoulish): rev r152 -- checked in new C API doxygen comments in messenger.h Messenger Documentation --- Key: PROTON-260 URL: https://issues.apache.org/jira/browse/PROTON-260 Project: Qpid Proton Issue Type: Improvement Components: proton-c Affects Versions: 0.5 Reporter: michael goulish Assignee: michael goulish Write documentation for the Proton Messenger interface, to include: introduction API explanations theory of operation example programs programming idioms tutorials quickstarts troubleshooting Documents should use MarkDown markup language. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (PROTON-260) Messenger Documentation
[ https://issues.apache.org/jira/browse/PROTON-260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13797027#comment-13797027 ] michael goulish commented on PROTON-260: rev r152 -- checked in new C API doxygen comments in messenger.h Messenger Documentation --- Key: PROTON-260 URL: https://issues.apache.org/jira/browse/PROTON-260 Project: Qpid Proton Issue Type: Improvement Components: proton-c Affects Versions: 0.5 Reporter: michael goulish Assignee: michael goulish Write documentation for the Proton Messenger interface, to include: introduction API explanations theory of operation example programs programming idioms tutorials quickstarts troubleshooting Documents should use MarkDown markup language. -- This message was sent by Atlassian JIRA (v6.1#6144)
please take a look at new C API descriptions
These are expanded descriptions that I'd like to add to the C API documentation. ( These are the descriptions only -- where the current info already explains the parameters and returns values I will just leave those in place. ) Please take a look to see 1. whether the description matches your understanding of what the functions do, and how they fit together. 2. whether you, as a developer using this code, would find the description useful, sufficient, understandable, etc. Question 2 is still very valuable even if you have no idea about Question 1. This is not yet a complete list. Some of the functions are clear already, and some I have no clue about as yet. Here they are: pn_messenger_accept { Signal the sender that you have received and have acted on the message pointed to by the tracker. If the PN_CUMULATIVE flag is set, all messages prior to the tracker will also be accepted, back to the beginning of your incoming window. } pn_messenger_errno { Return the code for the most recent error. Initialized to zero at messenger creation. Error numbers are sticky i.e. are not reset to 0 at the end of successful API calls. (NOTE! This is the only description that is intentionally false. There *is* one API call that resets errno to 0 -- but I think it shouldn't, and I will complain about it Real Soon Now.) } pn_messenger_error { Return a text description of the most recent error. Initialized to null at messenger creation. Error text is sticky i.e. not reset to null at the end of successful API calls. } pn_messenger_get { Pop the oldest message off your incoming message queue, and copy it into the given message structure. If the given pointer to a message structure in NULL, the popped message is discarded. Returns PN_EOS if there are no messages to get. Returns an error code only if there is a problem in decoding the message. } pn_messenger_incoming_subscription { Returns a pointer to the subscription of the message returned by the most recent call to pn_messenger_get(), or NULL if pn_messenger_get() has never been called. } pn_messenger_incoming_tracker { Returns a tracker for the message most recently fetched by pn_messenger_get(). The tracker allows you to accept or reject its message, or its message plus all prior messages that are still within your incoming window. } pn_messenger_outgoing_tracker { Returns a tracker for the outgoing message most recently given to pn_messenger_put. Use this tracker with pn_messenger_status to determine the delivery status of the message, as long as the message is still within your outgoing window. } pn_messenger_put { Puts the message onto the messenger's outgoing queue. The message may also be sent if transmission would not cause blocking. This call will not block. } pn_messenger_reject { Rejects the message indicated by the tracker. If the PN_CUMULATIVE flag is used this call will also reject all prior messages that have not already been settled. The semantics of message rejection are application-specific. If messages represent work requests, then rejection would leave the sender free to try another receiver, without fear of having the same task done twice. } pn_messenger_rewrite { Similar to pn_messenger_route(), except that the destination of the message is determined before the message address is rewritten. If a message has an outgoing address of amqp://0.0.0.0:5678, and a rewriting rule that changes its outgoing address to foo, it will still arrive at the peer that is listening on amqp://0.0.0.0:5678, but when it arrives there, its outgoing address will have been changed to foo. } pn_messenger_send { If blocking has been set with pn_messenger_set_blocking, this call will block until n messages have been sent. A value of -1 for n means all messages in the outgoing queue. In addition, if a nonzero size has been set for the outgoing window, this call will block until all messages within that window have been received. Any blocking will end upon timeout, if one has been set by pn_messenger_timeout. If blocking has not been set, this call will stop transmitting messages when further transmission would require blocking, or when the outgoing queue is empty, or when n messages have been sent. } pn_messenger_set_blocking { Enable or disable blocking behavior during calls to pn_messenger_send and pn_messenger_recv. } pn_messenger_set_incoming_window { The size of your incoming window limits the number of messages that can be accepted or rejected using trackers. Messages do not enter this window when they have been received (pn_messenger_recv) onto you incoming queue. Messages enter this window only when you take them into your application using pn_messenger_get. If your incoming window size is N, and you get N+1 messages without explicitly accepting or
another proton error question
OK, so now I understand that we are using the standard errno philosophy in which: errno always contains the most recent error from anywhere within messenger code. ( and is iniialized to 0 on creation of messenger struct. ) But another part of this philosophy is: No system call (in our case, translate that to 'Messenger API call' or maybe 'Messenger function') ever sets errno to zero. Yet, I am seeing it get set to zero, sometimes. ( The only fn I have actually observed doing this so far is pn_output_write_amqp(), but there may be others ) Is there a reason why we depart from the standard errno philosophy here? Or is it just an oversight? If it's an oversight, I'd like to put something like assert(code); as the first line in pn_error_set(). If it's not an oversight, I'd like to know the reasoning so I can document it.
question about proton error philosophy
I was expecting errno inside the messenger to be reset to 0 at the end of any successful API call. It isn't: instead it looks like the idea is that errno preserves the most recent error that happened, regardless of how long ago that might be. Is this intentional? I am having a hard time understanding why we would not want errno to always represent the messenger state as of the completion of the most recent API call. I would be happy to submit a patch to make it work this way, and see what people think - but not if I am merely exhibiting my own philosophical ignorance here.
Re: question about proton error philosophy
No, you're right. errno is never set to zero by any system call or library function ( That's from Linux doco. ) OK, I was just philosophically challenged. I think what confused me was the line in the current Proton C doc (about errno) that says an error code or zero if there is no error. I'll just remove that line. OK, I withdraw the question. ( I still don't like this philosophy, but the whole world is using it, and the whole world is bigger than I am... ) - Original Message - Do other APIs reset the errno? I could have sworn they didn't. On Mon, Sep 16, 2013 at 12:01 PM, Michael Goulish mgoul...@redhat.com wrote: I was expecting errno inside the messenger to be reset to 0 at the end of any successful API call. It isn't: instead it looks like the idea is that errno preserves the most recent error that happened, regardless of how long ago that might be. Is this intentional? I am having a hard time understanding why we would not want errno to always represent the messenger state as of the completion of the most recent API call. I would be happy to submit a patch to make it work this way, and see what people think - but not if I am merely exhibiting my own philosophical ignorance here. -- Hiram Chirino Engineering | Red Hat, Inc. hchir...@redhat.com | fusesource.com | redhat.com skype: hiramchirino | twitter: @hiramchirino blog: Hiram Chirino's Bit Mojo
[MESSENGER] multilingual docs - reviews welcome
What I'm doing with Messenger Tutorial Docs -- Reviews Welcome == { 1. How it works -- { There will be a custom cmake target, probably called 'docs'. When you 'make' this target, it runs a bunch of simple Messenger tests in all supported languages. The purpose of these tests is not the same as the other tests that already exist; they are not trying to prove correctness of all the Messenger features. Their purpose is to provide the documents with good code snippets -- that so the same thing in each language. If any of these tests fail, we stop right there and the docs do not get made. After the tests run successfully, the code snippets are all extracted. The docs I write are kind of ... templates for docs. Each doc gets expanded into L distinct docs, where L is the number of languages that Messenger supports. So you get the cross-product of docs and languages. Each doc has little markers in it, like pmdocproc 13 that tell the processor which snippet to put there. Each program has little snippet-markers like: /* pmdocproc 13 c */ code ( goes, here ); /* pmdocproc 13 end */ to tell the processor where to get the snippet from, and what language it's in. The markers that are inthe code are always comments -- however that language makes its comments -- and must be on a line by themselves. } 2. Why it works that way { We get two cool benefits out of this: * You can see the whole doc tree in your favorite language, and compare languages. * The code snippets will never go out of date. If they quite working, the docs don't get built. } 3. Where I am Now -- { * the python version of the doc-maker is working * I know how to integrate with cmake, and doing that now. * I'll be away next week, but I would like to check stuff in shortly after returning. } 4. Feedback I would like - { Anything you are inspired to volunteer, about any aspect of this. I am attaching the Python doc maker below just in case anyone wants to look at it. ( I was so uncomfortable with Python that I prototyped the project in C, so I don't know how *pythonic* my code is ) Thanks in advance! And please send any comments to the list. } 5. The Python doc processor -- { #! /usr/bin/python import os import subprocess import shutil # # First, run all the tests for each language. # We will not create any documents if any of # these tests fail. # def run_examples(languages): saved_dir = os.getcwd() for language in languages: test_dir = './doc_examples/' + language os.chdir ( test_dir ) subprocess.check_call ( ./run_all ) print - print Tests in , test_dir , were successful. print - os.chdir ( saved_dir ) print \n= print All language example tests were successful. print =\n # # Make new output dirs for each language. # These will hold final docs. # def make_output_dirs ( output_dir, html_dir, languages ): if os.path.exists ( output_dir ): shutil.rmtree ( output_dir ) os.mkdir ( output_dir ) if os.path.exists ( html_dir ): shutil.rmtree ( html_dir ) os.mkdir ( html_dir ) for language in languages: os.mkdir ( output_dir + '/' + language ) os.mkdir ( html_dir + '/' + language ) # # For each example program name, there # should be an instance of it in each # example/language directory. # I.e., example program foo should exist as # doc_examples/c/foo.c # doc_examples/rb/foo.rb # doc_examples/py/foo.py # def make_example_file_names(example_dir, languages, example_names, example_file_names ): for language in languages:
Re: message disposition question
Oh! Oh! Let me try! (see inline) - Original Message - On 04/18/2013 06:21 AM, Rafael Schloming wrote: I spoke a bit too soon in my first reply. The tracking windows are *supposed* to be measured from the point where the tracker is first assigned, so from when you call put or get. This means that it shouldn't matter how many times you call recv or how much credit recv gives out, the only thing that matters is whether you've called get() more than WINDOW times. That should be fine as calling get() is very much in your control. Now the reason I was confused yesterday is that from looking at the code it appears that due to a recent commit, incoming trackers are actually assigned earlier than they should be. This has not been the case for any released code, however, only for a short time quite recently on trunk. --Rafael On Wed, Apr 17, 2013 at 2:26 PM, Rafael Schloming r...@alum.mit.edu wrote: That's a good question and now that you mention it nothing prevents it. That was an intentional choice when the feature was added, and it wasn't a problem at the time because we didn't have recv(-1). This meant that you were always asking for an explicit amount and if you asked for more than your window, you were (hopefully knowingly) asking for trouble. With recv(-1), however, you are no longer explicitly controlling the credit window so this could be a problem. One possibility might be to define incoming/outgoing window sizes of -1 to allow for unlimited sizes. --Rafael On Wed, Apr 17, 2013 at 1:32 PM, Michael Goulish mgoul...@redhat.comwrote: ( a question inspired by a question from a reviewer of one of my docs... ) If you set an incoming message window of a certain size, and if Messenger can receive messages even when you, i.e. call send() - - - what's to stop some messages from falling off the edge of that window, and thus getting accepted-by-default, before your app code ever gets a chance to make a real decision about the message's disposition ? I'm still not clear on how this answer the original question: If messages can be received in the background when I call send() or other functions, and that can cause messages to fall out of the received window, then how do I ensure that I get a chance to see and ack/reject every message? I have no control over the background message delivery. First, just to be clear, it's not in the background in the sense of a separate thread, it's just ... 'unexpectedly'. But the main thing is .. .that is not a *received* window. If you set a window size of N, that window exists only relative to the position of the first message for which you create a tracker. Creating a tracker is done not with recv() but with get(), and the window only exists in the get-space. Not in the recv space. So the only way to make a message fall off the window is to: 0. define incoming window size N 1. call get() 2. make a tracker ( will track starting with most recent message got() ) 3. call get() N more times, but don't both to dispose the messages one way or the other. 4. You just had 1 message fall off the edge of your window, and get accepted by default. 5. If you keep calling get() now, you will have 1 new message fall over the edge for each call. A related question: how can I flow control if I'm getting too many messages? The naive answer is stop calling recv() but if messages can also be received when I call send() then I have no way to limit the messages that pile up, or worse: that are dropped off my receive window and into oblivion. Don't know. Brain tired. I think you can't -- but at least they won't fall out the window. Cheers, Alan.
[jira] [Created] (PROTON-300) qpidd --help should show sasl config path default
michael goulish created PROTON-300: -- Summary: qpidd --help should show sasl config path default Key: PROTON-300 URL: https://issues.apache.org/jira/browse/PROTON-300 Project: Qpid Proton Issue Type: Bug Reporter: michael goulish Priority: Minor qpidd --help does not show the sasl config path default, which is /etc/sasl2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PROTON-300) qpidd --help should show sasl config path default
[ https://issues.apache.org/jira/browse/PROTON-300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] michael goulish updated PROTON-300: --- Assignee: michael goulish qpidd --help should show sasl config path default - Key: PROTON-300 URL: https://issues.apache.org/jira/browse/PROTON-300 Project: Qpid Proton Issue Type: Bug Reporter: michael goulish Assignee: michael goulish Priority: Minor qpidd --help does not show the sasl config path default, which is /etc/sasl2 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
message disposition question
( a question inspired by a question from a reviewer of one of my docs... ) If you set an incoming message window of a certain size, and if Messenger can receive messages even when you, i.e. call send() - - - what's to stop some messages from falling off the edge of that window, and thus getting accepted-by-default, before your app code ever gets a chance to make a real decision about the message's disposition ?
[jira] [Created] (PROTON-295) recv(-1) + incoming_window == bad
michael goulish created PROTON-295: -- Summary: recv(-1) + incoming_window == bad Key: PROTON-295 URL: https://issues.apache.org/jira/browse/PROTON-295 Project: Qpid Proton Issue Type: Bug Components: proton-c Affects Versions: 0.4 Reporter: michael goulish Use of recv(-1) could receive enough messages that some would exceed the incoming window size and be automatically accepted -- with app logic never getting a say in the matter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
messenger routing suggestion
The idea I put forward for a change to pn_messenger_route() may have seemed not very well motivated. Here is a more complete example, and a more complete suggestion. Example === All of these nodes are Messenger nodes. 1. Sender You have a node that's a sender. It is sending to only abstract addresses. It uses its routing table to map all its abstract addresses to the same receiver -- because that receiver is where the system centralizes knowledge about changing network conditions. Sender's routing table -- COLOSSUS -- 1.2.3.4: GUARDIAN -- 1.2.3.4: HAL9000 -- 1.2.3.4: SKYNET -- 1.2.3.4: 2. Router You have a router node that is listening on 1.2.3.4: . It receives and then forwards messages. It can do this because the messages it receives from Sender still have their untranslated addresses. It has this routing table --- COLOSSUS -- 5.6.7.8:1234 COLOSSUS -- 5.6.7.9:1234 GUARDIAN -- 5.6.7.23:3456 HAL9000 -- null SKYNET -- 5.6.7.99:6969 3. Adapting to changing conditions The 'router' node can change its address translation table based on messages that it receives from other parts of the network. For example, maybe it does load-balancing this way. But in general -- it needs to change its translation table at run time because conditions in the network are changing. It is the node that encapsulates knowledge about those changes so that the rest of our nodes do not need to worry about it. They just send to COLOSSUS or whatever. This implies that we should be able to change routing dynamically. 4. Fanout Note that there are two translations for COLOSSUS. We send to both of them. ( This is a change. ) This is how we implement fanout with the address translation. 5. Load and Store Since the translation table can change due to changing network conditions, the Router node should be able to store its table to a file, and load from that file. The information that it has learned during operation is not lost. Or it can use this facility to fork off another copy of itself. 6. API changes It seems to me that the address translation functionality is potentially very powerful, with a few teensy changes. Here they are: /*= At send time, the messenger examines its translation table, and sends a copy of the message to each matching address. ( this is a change ) The address stored in the message is not changed. ( this is true now. ) =*/ /*--- Append the given translation to the list. ---*/ pn_messenger_route_add ( pn_messenger_t *messenger, const char *pattern, const char *address ); /*- If the given pattern already exists in the list, replace its first occurrence with this translation. Otherwise add this translation to the list. -*/ pn_messenger_route_replace ( pn_messenger_t *messenger, const char *pattern, const char *address ); /*- Delete the given translation from the list. ( Else, NOOP. ) -*/ pn_messenger_route_delete ( pn_messenger_t *messenger, const char *pattern, const char *address ); /*- Delete from the list all translations with this pattern. -*/ pn_messenger_route_delete_pattern ( pn_messenger_t *messenger, const char *pattern ); /*- Clear the translation table. -*/ pn_messenger_route_clear_table ( pn_messenger_t *messenger ); /*- Load from the given fp -*/ pn_messenger_route_load_table ( pn_messenger_t *messenger, FILE * fp ); /*- Store to the given fp -*/ pn_messenger_route_load_table ( pn_messenger_t *messenger, FILE * fp );
possible cool change to pn_messenger_route()
While working on docs, I was getting excited about something cool that I could do with pn_messenger_route() And then I realized that I couldn't. What would you think about allowing replacement of an old route by a new route with the same pattern ? It seems like this would allow some cool, adaptive behavior in Proton networks. I just coded and tested a simple case successfully -- sending 2 messages to an abstract address, and having them go to 2 different receivers because a new route got set in between. Would this behavior cause any problems that would outweigh the coolness? Code seems pretty straightforward - - - - int pn_messenger_route(pn_messenger_t *messenger, const char *pattern, const char *address) { if (strlen(pattern) PN_MAX_PATTERN || strlen(address) PN_MAX_ROUTE) { return PN_ERR; } pn_route_t *new_route = (pn_route_t *) malloc(sizeof(pn_route_t)); if (!new_route) return PN_ERR; strcpy(new_route-pattern, pattern); strcpy(new_route-address, address); new_route-next = NULL; /* The list is empty. */ if (! messenger-routes ) { messenger-routes = new_route; return 0; } pn_route_t *old; /* The route to be replaced is first on the list. */ if ( ! strcmp ( messenger-routes-pattern, new_route-pattern ) ) { old = messenger-routes; new_route-next = old-next; messenger-routes = new_route; free ( (char *) old ); return 0; } pn_route_t *route = messenger-routes; /* The route to be replaced is somewhere down the list, or not there. */ while ( 1 ) { /* No route in list had same pattern. */ if ( ! route-next ) { route-next = new_route; return 0; } /* Bingo ! */ if ( ! strcmp ( route-next-pattern, new_route-pattern ) ) { old = route-next; new_route-next = old-next; route-next = new_route; free ( (char *) old ); return 0; } route = route-next; } return 0; }
problem with multiple senders
Is this a bug, or am I Doing Something Wrong ? Scenario { My sender sends a single message, and hopes to see that the receiver has accepted it. I launch 3 copies of the sender very close together-- they all talk to the same address. My receiver receives in a loop, accepts every message that it receives. } Result { Sometimes my receiver gets 1 of the 3 messages. Usually it gets 2. It never gets all 3. The 3rd sender hangs in pn_messenger_send(). While the 3rd sender is hanging in send(), the receiver is patiently waiting in recv(). } Sender Code /* Launch 3 of these from a script like so: ./sender ./sender ./sender */ #include proton/message.h #include proton/messenger.h #include getopt.h #include stdio.h #include stdlib.h #include string.h #include ctype.h char * status_2_str ( pn_status_t status ) { switch ( status ) { case PN_STATUS_UNKNOWN: return unknown; break; case PN_STATUS_PENDING: return pending; break; case PN_STATUS_ACCEPTED: return accepted; break; case PN_STATUS_REJECTED: return rejected; break; default: return bad value; break; } } pid_t my_pid = 0; void check ( char * label, int result ) { fprintf ( stderr, %d %s result: %d\n, my_pid, label, result ); } int main(int argc, char** argv) { int c; char addr [ 1000 ]; char msgtext [ 100 ]; pn_message_t * message; pn_messenger_t * messenger; pn_data_t * body; pn_tracker_t tracker; pn_status_t status; int result; my_pid = getpid(); sprintf ( addr, amqp://0.0.0.0:%s, argv[1] ); message = pn_message ( ); messenger = pn_messenger ( NULL ); pn_messenger_start ( messenger ) ; pn_messenger_set_outgoing_window ( messenger, 1 ); pn_message_set_address ( message, addr ); body = pn_message_body ( message ); sprintf ( msgtext, Message from %d, getpid() ); pn_data_put_string ( body, pn_bytes ( strlen ( msgtext ), msgtext )); pn_messenger_put ( messenger, message ); tracker = pn_messenger_outgoing_tracker ( messenger ); pn_messenger_send ( messenger ); status = pn_messenger_status ( messenger, tracker ); fprintf ( stderr, status : %s\n, status_2_str(status) ); pn_messenger_stop ( messenger ); pn_messenger_free ( messenger ); pn_message_free ( message ); return 0; } Receiver Code /* Launch like this: ./receiver */ #include stdio.h #include stdlib.h #include ctype.h #include proton/message.h #include proton/messenger.h #define BUFSIZE 1024 int main(int argc, char** argv) { size_t bufsize = BUFSIZE; char buffer [ BUFSIZE ]; char addr [ 1000 ]; pn_message_t * message; pn_messenger_t * messenger; pn_data_t * body; pn_tracker_t tracker; sprintf ( addr, amqp://~0.0.0.0:%s, argv[1] ); message = pn_message(); messenger = pn_messenger ( NULL ); pn_messenger_start(messenger); pn_messenger_subscribe ( messenger, addr ); pn_messenger_set_incoming_window ( messenger, 5 ); /*- Receive and accept the message. -*/ while ( 1 ) { fprintf ( stderr, receiving...\n ); pn_messenger_recv ( messenger, 3 ); while ( pn_messenger_incoming ( messenger ) 0 ) { fprintf ( stderr, getting message...\n ); pn_messenger_get ( messenger, message ); tracker = pn_messenger_incoming_tracker ( messenger ); pn_messenger_accept ( messenger, tracker, 0 ); body = pn_message_body ( message ); pn_data_format ( body, buffer, bufsize ); fprintf ( stdout, Address: %s\n, pn_message_get_address ( message ) ); fprintf ( stdout, Content: %s\n, buffer); } } pn_messenger_stop(messenger); pn_messenger_free(messenger); return 0; }
Re: problem with multiple senders
] - CLOSE @24 [null] [0x24fae10:0] - EOS Closed localhost:42468 [0x2538e40:1] - DETACH @22 [1, true, null] [0x2538e40:0] - CLOSE @24 [null] [0x2538e40:0] - EOS [0x2538e40:1] - DETACH @22 [1, true, null] [0x2538e40:0] - CLOSE @24 [null] [0x2538e40:0] - EOS Closed localhost:42469 - end trace - - Original Message - Any clues from a trace of the receiver? $ PN_TRACE_FRM=1 ./receiver -Ted On 04/04/2013 02:09 PM, Michael Goulish wrote: Is this a bug, or am I Doing Something Wrong ? Scenario { My sender sends a single message, and hopes to see that the receiver has accepted it. I launch 3 copies of the sender very close together-- they all talk to the same address. My receiver receives in a loop, accepts every message that it receives. } Result { Sometimes my receiver gets 1 of the 3 messages. Usually it gets 2. It never gets all 3. The 3rd sender hangs in pn_messenger_send(). While the 3rd sender is hanging in send(), the receiver is patiently waiting in recv(). } Sender Code /* Launch 3 of these from a script like so: ./sender ./sender ./sender */ #include proton/message.h #include proton/messenger.h #include getopt.h #include stdio.h #include stdlib.h #include string.h #include ctype.h char * status_2_str ( pn_status_t status ) { switch ( status ) { case PN_STATUS_UNKNOWN: return unknown; break; case PN_STATUS_PENDING: return pending; break; case PN_STATUS_ACCEPTED: return accepted; break; case PN_STATUS_REJECTED: return rejected; break; default: return bad value; break; } } pid_t my_pid = 0; void check ( char * label, int result ) { fprintf ( stderr, %d %s result: %d\n, my_pid, label, result ); } int main(int argc, char** argv) { int c; char addr [ 1000 ]; char msgtext [ 100 ]; pn_message_t * message; pn_messenger_t * messenger; pn_data_t * body; pn_tracker_t tracker; pn_status_t status; int result; my_pid = getpid(); sprintf ( addr, amqp://0.0.0.0:%s, argv[1] ); message = pn_message ( ); messenger = pn_messenger ( NULL ); pn_messenger_start ( messenger ) ; pn_messenger_set_outgoing_window ( messenger, 1 ); pn_message_set_address ( message, addr ); body = pn_message_body ( message ); sprintf ( msgtext, Message from %d, getpid() ); pn_data_put_string ( body, pn_bytes ( strlen ( msgtext ), msgtext )); pn_messenger_put ( messenger, message ); tracker = pn_messenger_outgoing_tracker ( messenger ); pn_messenger_send ( messenger ); status = pn_messenger_status ( messenger, tracker ); fprintf ( stderr, status : %s\n, status_2_str(status) ); pn_messenger_stop ( messenger ); pn_messenger_free ( messenger ); pn_message_free ( message ); return 0; } Receiver Code /* Launch like this: ./receiver */ #include stdio.h #include stdlib.h #include ctype.h #include proton/message.h #include proton/messenger.h #define BUFSIZE 1024 int main(int argc, char** argv) { size_t bufsize = BUFSIZE; char buffer [ BUFSIZE ]; char addr [ 1000 ]; pn_message_t * message; pn_messenger_t * messenger; pn_data_t * body; pn_tracker_t tracker; sprintf ( addr, amqp://~0.0.0.0:%s, argv[1] ); message = pn_message(); messenger = pn_messenger ( NULL ); pn_messenger_start(messenger); pn_messenger_subscribe ( messenger, addr ); pn_messenger_set_incoming_window ( messenger, 5 ); /*- Receive and accept the message. -*/ while ( 1 ) { fprintf ( stderr, receiving...\n ); pn_messenger_recv ( messenger, 3 ); while ( pn_messenger_incoming ( messenger ) 0 ) { fprintf ( stderr, getting message...\n ); pn_messenger_get ( messenger, message ); tracker = pn_messenger_incoming_tracker ( messenger ); pn_messenger_accept ( messenger, tracker, 0 ); body = pn_message_body ( message ); pn_data_format ( body, buffer, bufsize ); fprintf ( stdout, Address: %s\n, pn_message_get_address ( message ) ); fprintf ( stdout, Content: %s\n, buffer); } } pn_messenger_stop(messenger); pn_messenger_free(messenger); return 0; }
Re: problem with multiple senders
Yes! -1 did it. Thanks! - Original Message - I think this is the same bug we've seen before with passing fixed (positive) credit limits to recv. The implementation isn't smart enough to pay attention to who actually is offering messages when it allocates credit, and so it ends up giving out all of its credit to a sender that has no use for it instead of to the senders that are blocked. I suspect if you replace your 3 with -1 in your call to pn_messenger_recv, then you will see the hang go away. --Rafael On Thu, Apr 4, 2013 at 3:06 PM, Michael Goulish mgoul...@redhat.com wrote: OK, I'm looking at trace from receiver, and I thought I would post it here so I can't be accused of hogging all the fun for myself. ( Remember, three senders all send to same receiver address, only two get 'accepted' replies. Last sender ends up hanging in send(), while receiver (in infinite loop) blocks on recv(). ) I have marked the lines of application output with APPLICATION OUTPUT: Note: I see these 3 lines: Accepted from localhost:42468 Accepted from localhost:42469 Accepted from localhost:42470 But only two get closed: Closed localhost:42468 Closed localhost:42469 - begin trace --- Listening on 0.0.0.0: APPLICATION OUTPUT: receiving... Accepted from localhost:42468 Accepted from localhost:42469 - SASL [0x25013c0:0] - SASL-INIT @65 [:ANONYMOUS, b] [0x25013c0:0] - SASL-MECHANISMS @64 [@PN_SYMBOL[:ANONYMOUS]] [0x25013c0:0] - SASL-OUTCOME @68 [0] - SASL - AMQP [0x24fae10:0] - OPEN @16 [a03b1f27-5053-47f0-ae85-c543782480b5, null, null, null, null, null, null, null, null] Accepted from localhost:42470 - SASL [0x253f490:0] - SASL-INIT @65 [:ANONYMOUS, b] [0x253f490:0] - SASL-MECHANISMS @64 [@PN_SYMBOL[:ANONYMOUS]] [0x253f490:0] - SASL-OUTCOME @68 [0] - SASL - AMQP [0x2538e40:0] - OPEN @16 [a03b1f27-5053-47f0-ae85-c543782480b5, null, null, null, null, null, null, null, null] - AMQP [0x24fae10:0] - OPEN @16 [1425753e-bda0-48af-a60f-b8a23c0933d3, 0.0.0.0, null, null, null, null, null, null, null] [0x24fae10:1] - BEGIN @17 [null, 0, 1024, 1024] [0x24fae10:1] - ATTACH @18 [sender-xxx, 1, false, null, null, @40 [null, 0, null, 0, false, null, null, null, null, null, null], @41 [null, 0, null, 0, false, null, null], null, null, 0] [0x24fae10:1] - BEGIN @17 [1, 0, 1024, 1024] [0x24fae10:1] - ATTACH @18 [sender-xxx, 1, true, null, null, null, null, null, null, 0] [0x24fae10:1] - FLOW @19 [0, 1024, 0, 1024, 1, 0, 3, null, false] - SASL [0x2563350:0] - SASL-INIT @65 [:ANONYMOUS, b] [0x2563350:0] - SASL-MECHANISMS @64 [@PN_SYMBOL[:ANONYMOUS]] [0x2563350:0] - SASL-OUTCOME @68 [0] - SASL - AMQP [0x255cd00:0] - OPEN @16 [a03b1f27-5053-47f0-ae85-c543782480b5, null, null, null, null, null, null, null, null] - AMQP [0x2538e40:0] - OPEN @16 [35806640-4a26-47a2-a6e2-7fe7505938cf, 0.0.0.0, null, null, null, null, null, null, null] [0x2538e40:1] - BEGIN @17 [null, 0, 1024, 1024] [0x2538e40:1] - ATTACH @18 [sender-xxx, 1, false, null, null, @40 [null, 0, null, 0, false, null, null, null, null, null, null], @41 [null, 0, null, 0, false, null, null], null, null, 0] [0x2538e40:1] - BEGIN @17 [1, 0, 1024, 1024] [0x2538e40:1] - ATTACH @18 [sender-xxx, 1, true, null, null, null, null, null, null, 0] [0x2538e40:1] - FLOW @19 [0, 1024, 0, 1024, 1, 0, 0, null, false] - AMQP [0x255cd00:0] - OPEN @16 [c8b87edf-6971-4d73-9790-e6f44772cebb, 0.0.0.0, null, null, null, null, null, null, null] [0x255cd00:1] - BEGIN @17 [null, 0, 1024, 1024] [0x255cd00:1] - ATTACH @18 [sender-xxx, 1, false, null, null, @40 [null, 0, null, 0, false, null, null, null, null, null, null], @41 [null, 0, null, 0, false, null, null], null, null, 0] [0x255cd00:1] - BEGIN @17 [1, 0, 1024, 1024] [0x255cd00:1] - ATTACH @18 [sender-xxx, 1, true, null, null, null, null, null, null, 0] [0x255cd00:1] - FLOW @19 [0, 1024, 0, 1024, 1, 0, 0, null, false] [0x24fae10:1] - TRANSFER @20 [1, 0, b\x00\x00\x00\x00\x00\x00\x00\x00, 0, false, false] (148) \x00Sp\xd0\x00\x00\x00\x0b\x00\x00\x00\x05BP\x04@BR \x00\x00Ss\xd0\x00\x00\x00b\x00\x00\x00\x0d@@\xa1\x13amqp://0.0.0.0: @\xa1+amqp://1425753e-bda0-48af-a60f-b8a23c0933d3@ @@\x83\x00\x00\x00\x00\x00\x00\x00\x00\x83\x00\x00\x00\x00\x00\x00\x00\x00@R \x00@\x00Sw\xa1\x12Message from 22470 APPLICATION OUTPUT: getting message... APPLICATION OUTPUT: Address: amqp://0.0.0.0: APPLICATION OUTPUT: Content: Message from 22470 APPLICATION OUTPUT: receiving... [0x24fae10:1] - DISPOSITION @21 [true, 0, 0, false, @36 []] [0x2538e40:1] - FLOW @19 [0, 1024, 0, 1024, 1, 0, 1, null, false] [0x24fae10:1] - DISPOSITION @21 [false, 0, 0, false, @36 []] [0x2538e40:1] - TRANSFER @20 [1, 0, b\x00\x00\x00\x00\x00\x00\x00\x00, 0, false