[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070257#comment-14070257 ] ASF subversion and git services commented on PROTON-625: Commit 1612569 from [~mgoulish] in branch 'proton/trunk' [ https://svn.apache.org/r1612569 ] PROTON-625 In edge-case where map->load_factor exactly equals load, it was possible for pni_map_ensure() and pni_map_entry() to get into a semi-infinite loop. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.8 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/lib64/libqpid-proton.so.2 > #93569 0x0039c6c304cd in pn_transport_process () from > /usr/lib64/libqpid-proton.so.2 > #93570 0x0039c6c3e40c in pn_connector_process () from > /usr/lib64/libqpid-proton.so.2 > #93571 0x0
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051836#comment-14051836 ] Rafael H. Schloming commented on PROTON-625: Nice catch. FYI the load of the map is the ratio of the size of the map to the number of hash buckets in the map. If this gets too large then more and more collisions occur and get/put operations take more time. Of course keeping this small requires having more buckets and so is less space efficient. The load_factor controls the time/space tradeoff. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.8 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/lib64/libqpid-proton.so.2 > #93569 0x0039c6c304cd in pn_transport_process () from > /usr/lib64/libqpid-proton.so.2 > #93570 0x0039c6c3e40
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051826#comment-14051826 ] michael goulish commented on PROTON-625: Here's what happens, and a fix. 1. pni_map_entry() calls pni_map_ensure() to make sure map has enough capacity. 2. The capacity-increasing loop in pni_map_ensure() has two conditions on it: increase the capacity if map->capacity is too small, or if map 'load' is greater than map->load_factor. ( Map load is ... meaning not obvious to me. ) 3. If pni_map_ensure() returns true, then pni_map_entry() will call itself recursively, and keep doing that until pni_map_ensure() returns false. 'False' means 'I made no change.' 4. But it is possible for pni_map_ensure() to make no change, and yet return true. Here is how it happened in my most recent test: map->capacity 512 capacity 331 pni_map_load(map) 0.75 map->load_factor 0.75 5. Those values made *both* conditions on the capacity- increasing loop in pni_map_ensure() false. So it didn't do anything to change the map. But it returned true. So pni_map_entry() called itself. But nothing had changed. And away we go. FIX Make the test on the if at the top of pni_map_ensure say this: if (capacity <= map->capacity && load <= map->load_factor) { ( Added '=' to the load test. ) After that, I ran twenty tests with no failure. Previously, failure probability on my system was 0.3.So odds of 20 in a row happening by chance is a little less than 1 in 1000. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.8 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14051007#comment-14051007 ] michael goulish commented on PROTON-625: Here is a hack that fixes it. A little new code in pni_map_ensure(). Tested this on latest protonics, version 1607485. Without hack: 3 failures out of 10 tests. (similar to what I have been seeing on other versions.) With hack: 0 failures out of 13 tests. ( probability this happened by chance: less that 1% ) So, now I'm trying to see how it should *really* be fixed... --- code --- code --- code --- code --- code --- code --- code --- code --- code --- // This loop is what is already there, in pni_map_ensure. No change. while (map->capacity < capacity || pni_map_load(map) > map->load_factor) { map->capacity *= 2; map->addressable = (size_t) (0.86 * map->capacity); } /*--- If ever we get past the above while-loop without actually having changed map->cap, we are doomed to eternal torment. So, force it. ---*/ if ( oldcap == map->capacity ) { fprintf ( stderr, "Fiery the angels fell; deep thunder rolled around their shores, burning with the fires of Orc!\n" ); map->capacity *= 2; map->addressable = (size_t) (0.86 * map->capacity); } > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.8 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a6
Re: [jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
That's a good point. I suspect something (extra) funny is going on here. I really want to see line numbers against that stack trace. I'm not sure I believe it's a real trace, there may be some sort of stack corruption happening here. On Wed, Jul 2, 2014 at 10:54 AM, Cliff Jansen wrote: > As far as I can tell, pni_map_entry() only allocates space for a > single additional entry at a time, and hence should never recurse more > than once. I.e. pni_map_ensure should work the first time and prevent > further recursion. > > On Wed, Jul 2, 2014 at 7:32 AM, Alan Conway wrote: > > On Tue, 2014-07-01 at 07:15 -0400, Michael Goulish wrote: > >> Yes! > >> Great idea -- > >> I will attempt. > > > > I would put #ifndef NDEBUG around this code. We will never test it but > > someday on a vital production server at our biggest customer, somebody > > will use a map with 33 levels of nesting. I can guarantee it. > > > > It would be even better to do proper loop detection , i.e. check if > > you've seen the same map before. That would affect performance but I > > think the effect would be negligible for maps with a "normal" amount of > > nesting. > > > >> > >> > >> > >> - Original Message - > >> > >> [ > https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048701#comment-14048701 > ] > >> > >> Rafael H. Schloming commented on PROTON-625: > >> > >> > >> I think the easiest way to track down this bug would be to put some > sort of detection inside of pni_map_entry and if it recurses more than some > limit, e.g. 32 times or something, then print out a representation of the > maps internal structure. It might also help to use a debug build so you > have line numbers. Is that something you feel comfortable trying? You > should be able to find the relevant code around line 551 of object.c. > >> > >> > Biggest Backtrace Ever! > >> > --- > >> > > >> > Key: PROTON-625 > >> > URL: https://issues.apache.org/jira/browse/PROTON-625 > >> > Project: Qpid Proton > >> > Issue Type: Bug > >> > Components: proton-c > >> >Affects Versions: 0.7 > >> >Reporter: michael goulish > >> > > >> > I am saving all my stuff so I can repro on demand. > >> > It doesn't happen every time, but it's about 50%. > >> > -- > >> > On one box, I have a dispatch router. > >> > On the other box, I have 10 clients: 5 Messenger-based receivers, and > 5 qpid-messaging-based senders. > >> > Each client will handle 100 addresses, of the form "mick/0" ... > "mick/1" ... & c. > >> > 100 messages will be sent to each address. > >> > I start the 5 receivers first. They start OK. Dispatch router happy > & stable. > >> > Wait a few seconds. > >> > I start the 5 senders, from a bash script. > >> > The first sender is already sending when the 2nd, 3rd, 4th start. > >> > After a few of them start,but before all have finished starting, a > few seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > >> > The crash occurs in the dispatch router. > >> > Here is the biggest backtrace ever: > >> > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, > bytes=16384) at malloc.c:4383 > >> > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > >> > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > >> > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > >> > #4 0x0039c6c16c45 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #5 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #6 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #7 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #8 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #9 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #10 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #11 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #12 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #13 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #14 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > . > >> > . > >> > . > >> > . > >> > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > >> > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/lib
Re: [jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
As far as I can tell, pni_map_entry() only allocates space for a single additional entry at a time, and hence should never recurse more than once. I.e. pni_map_ensure should work the first time and prevent further recursion. On Wed, Jul 2, 2014 at 7:32 AM, Alan Conway wrote: > On Tue, 2014-07-01 at 07:15 -0400, Michael Goulish wrote: >> Yes! >> Great idea -- >> I will attempt. > > I would put #ifndef NDEBUG around this code. We will never test it but > someday on a vital production server at our biggest customer, somebody > will use a map with 33 levels of nesting. I can guarantee it. > > It would be even better to do proper loop detection , i.e. check if > you've seen the same map before. That would affect performance but I > think the effect would be negligible for maps with a "normal" amount of > nesting. > >> >> >> >> - Original Message - >> >> [ >> https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048701#comment-14048701 >> ] >> >> Rafael H. Schloming commented on PROTON-625: >> >> >> I think the easiest way to track down this bug would be to put some sort of >> detection inside of pni_map_entry and if it recurses more than some limit, >> e.g. 32 times or something, then print out a representation of the maps >> internal structure. It might also help to use a debug build so you have line >> numbers. Is that something you feel comfortable trying? You should be able >> to find the relevant code around line 551 of object.c. >> >> > Biggest Backtrace Ever! >> > --- >> > >> > Key: PROTON-625 >> > URL: https://issues.apache.org/jira/browse/PROTON-625 >> > Project: Qpid Proton >> > Issue Type: Bug >> > Components: proton-c >> >Affects Versions: 0.7 >> >Reporter: michael goulish >> > >> > I am saving all my stuff so I can repro on demand. >> > It doesn't happen every time, but it's about 50%. >> > -- >> > On one box, I have a dispatch router. >> > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 >> > qpid-messaging-based senders. >> > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" >> > ... & c. >> > 100 messages will be sent to each address. >> > I start the 5 receivers first. They start OK. Dispatch router happy & >> > stable. >> > Wait a few seconds. >> > I start the 5 senders, from a bash script. >> > The first sender is already sending when the 2nd, 3rd, 4th start. >> > After a few of them start,but before all have finished starting, a few >> > seconds into the script, the crash occurs. ( If they all start up >> > successfully, no crash. ) >> > The crash occurs in the dispatch router. >> > Here is the biggest backtrace ever: >> > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at >> > malloc.c:4383 >> > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 >> > #2 0x0039c6c1650a in pni_map_allocate () from >> > /usr/lib64/libqpid-proton.so.2 >> > #3 0x0039c6c16a3a in pni_map_ensure () from >> > /usr/lib64/libqpid-proton.so.2 >> > #4 0x0039c6c16c45 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #5 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #6 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #7 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #8 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #9 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #10 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #11 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #12 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #13 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #14 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > . >> > . >> > . >> > . >> > #93549 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93550 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93551 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93552 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93553 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93554 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93555 0x0039c6c16c64 in pni_map_entry () from >> > /usr/lib64/libqpid-proton.so.2 >> > #93556 0x0039c6c16c64 in pni_map_entry
Re: [jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
On Tue, 2014-07-01 at 07:15 -0400, Michael Goulish wrote: > Yes! > Great idea -- > I will attempt. I would put #ifndef NDEBUG around this code. We will never test it but someday on a vital production server at our biggest customer, somebody will use a map with 33 levels of nesting. I can guarantee it. It would be even better to do proper loop detection , i.e. check if you've seen the same map before. That would affect performance but I think the effect would be negligible for maps with a "normal" amount of nesting. > > > > - Original Message - > > [ > https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048701#comment-14048701 > ] > > Rafael H. Schloming commented on PROTON-625: > > > I think the easiest way to track down this bug would be to put some sort of > detection inside of pni_map_entry and if it recurses more than some limit, > e.g. 32 times or something, then print out a representation of the maps > internal structure. It might also help to use a debug build so you have line > numbers. Is that something you feel comfortable trying? You should be able to > find the relevant code around line 551 of object.c. > > > Biggest Backtrace Ever! > > --- > > > > Key: PROTON-625 > > URL: https://issues.apache.org/jira/browse/PROTON-625 > > Project: Qpid Proton > > Issue Type: Bug > > Components: proton-c > >Affects Versions: 0.7 > >Reporter: michael goulish > > > > I am saving all my stuff so I can repro on demand. > > It doesn't happen every time, but it's about 50%. > > -- > > On one box, I have a dispatch router. > > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > > qpid-messaging-based senders. > > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" > > ... & c. > > 100 messages will be sent to each address. > > I start the 5 receivers first. They start OK. Dispatch router happy & > > stable. > > Wait a few seconds. > > I start the 5 senders, from a bash script. > > The first sender is already sending when the 2nd, 3rd, 4th start. > > After a few of them start,but before all have finished starting, a few > > seconds into the script, the crash occurs. ( If they all start up > > successfully, no crash. ) > > The crash occurs in the dispatch router. > > Here is the biggest backtrace ever: > > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > > malloc.c:4383 > > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > > #2 0x0039c6c1650a in pni_map_allocate () from > > /usr/lib64/libqpid-proton.so.2 > > #3 0x0039c6c16a3a in pni_map_ensure () from > > /usr/lib64/libqpid-proton.so.2 > > #4 0x0039c6c16c45 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #5 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #6 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #7 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #8 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #9 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #10 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #11 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #12 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #13 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #14 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > . > > . > > . > > . > > #93549 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93550 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93551 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93552 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93553 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93554 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93555 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93556 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93557 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93558 0x0039c6c16c64 in pni_map_entry () from > > /usr/lib64/libqpid-proton.so.2 > > #93559 0x0039c6c16dc0 in pn_map_put () from > > /usr/lib64/libqpid-proton.so.2 > > #93560 0x0039c6c17226 in pn_hash_put () from > > /usr/lib64/libqpi
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14049897#comment-14049897 ] michael goulish commented on PROTON-625: I had some confusion about what libraries were being picked up. Sorry! This bug is *not* present on 0.7 ! I was able to run 0.7-based dispatch-router 10 times with no failure. Then, switching to latest proton trunk code as of today -- 2 out of first 3 tests resulted in this failure. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.8 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/lib64/libqpid-proton.so.2 > #93569 0x0039c6c304cd in pn_transport_process () from > /usr/lib64/libqpid-proton.so.2 > #93570 0x0039c6c3e40c in pn_connector_process () from > /usr/lib64/libqpid-proton.so.2 > #935
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048826#comment-14048826 ] Rafael H. Schloming commented on PROTON-625: Judging from the impact of adding the sleep in there, it could also be possible that the table is getting corrupted from (unprotected) concurrent accesses. Logging the sequence of put/del operations might provide some additional insight. If this is in fact a logic bug then you should be able to come up with a deterministic sequence of put/del operations that trigger the same bug. If it is corruption due to concurrent access then hopefully logging the current thread of the put/del operation will make that clear. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.7 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_con
Re: [jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
Yes! Great idea -- I will attempt. - Original Message - [ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048701#comment-14048701 ] Rafael H. Schloming commented on PROTON-625: I think the easiest way to track down this bug would be to put some sort of detection inside of pni_map_entry and if it recurses more than some limit, e.g. 32 times or something, then print out a representation of the maps internal structure. It might also help to use a debug build so you have line numbers. Is that something you feel comfortable trying? You should be able to find the relevant code around line 551 of object.c. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.7 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048701#comment-14048701 ] Rafael H. Schloming commented on PROTON-625: I think the easiest way to track down this bug would be to put some sort of detection inside of pni_map_entry and if it recurses more than some limit, e.g. 32 times or something, then print out a representation of the maps internal structure. It might also help to use a debug build so you have line numbers. Is that something you feel comfortable trying? You should be able to find the relevant code around line 551 of object.c. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.7 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/lib64/libqpid-proton.so.2 > #93569 0x0039c6c304cd in pn_transpor
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048573#comment-14048573 ] michael goulish commented on PROTON-625: When I put usleep(1000) after each message sent, I have zero failures in 10 tries. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.7 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/lib64/libqpid-proton.so.2 > #93569 0x0039c6c304cd in pn_transport_process () from > /usr/lib64/libqpid-proton.so.2 > #93570 0x0039c6c3e40c in pn_connector_process () from > /usr/lib64/libqpid-proton.so.2 > #93571 0x7f1060c60460 in process_connector () from > /home/mick/dispatch/build/libqpid-dispatch.so.0 > #93572 0x7f1060c61017 in thread_run () from > /home/mick/dispatch/build/libqpid-dispatch.so.0 > #9
[jira] [Commented] (PROTON-625) Biggest Backtrace Ever!
[ https://issues.apache.org/jira/browse/PROTON-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048569#comment-14048569 ] michael goulish commented on PROTON-625: BTW -- I kill and restart the router after each test. > Biggest Backtrace Ever! > --- > > Key: PROTON-625 > URL: https://issues.apache.org/jira/browse/PROTON-625 > Project: Qpid Proton > Issue Type: Bug > Components: proton-c >Affects Versions: 0.7 >Reporter: michael goulish > > I am saving all my stuff so I can repro on demand. > It doesn't happen every time, but it's about 50%. > -- > On one box, I have a dispatch router. > On the other box, I have 10 clients: 5 Messenger-based receivers, and 5 > qpid-messaging-based senders. > Each client will handle 100 addresses, of the form "mick/0" ... "mick/1" ... > & c. > 100 messages will be sent to each address. > I start the 5 receivers first. They start OK. Dispatch router happy & > stable. > Wait a few seconds. > I start the 5 senders, from a bash script. > The first sender is already sending when the 2nd, 3rd, 4th start. > After a few of them start,but before all have finished starting, a few > seconds into the script, the crash occurs. ( If they all start up > successfully, no crash. ) > The crash occurs in the dispatch router. > Here is the biggest backtrace ever: > #0 0x003cf9879ad1 in _int_malloc (av=0x7f101c20, bytes=16384) at > malloc.c:4383 > #1 0x003cf987a911 in __libc_malloc (bytes=16384) at malloc.c:3664 > #2 0x0039c6c1650a in pni_map_allocate () from > /usr/lib64/libqpid-proton.so.2 > #3 0x0039c6c16a3a in pni_map_ensure () from > /usr/lib64/libqpid-proton.so.2 > #4 0x0039c6c16c45 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #5 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #6 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #7 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #8 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #9 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #10 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #11 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #12 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #13 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > #14 0x0039c6c16c64 in pni_map_entry () from /usr/lib64/libqpid-proton.so.2 > . > . > . > . > #93549 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93550 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93551 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93552 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93553 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93554 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93555 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93556 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93557 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93558 0x0039c6c16c64 in pni_map_entry () from > /usr/lib64/libqpid-proton.so.2 > #93559 0x0039c6c16dc0 in pn_map_put () from /usr/lib64/libqpid-proton.so.2 > #93560 0x0039c6c17226 in pn_hash_put () from > /usr/lib64/libqpid-proton.so.2 > #93561 0x0039c6c2a643 in pn_delivery_map_push () from > /usr/lib64/libqpid-proton.so.2 > #93562 0x0039c6c2c44b in pn_do_transfer () from > /usr/lib64/libqpid-proton.so.2 > #93563 0x0039c6c24385 in pn_dispatch_frame () from > /usr/lib64/libqpid-proton.so.2 > #93564 0x0039c6c2448f in pn_dispatcher_input () from > /usr/lib64/libqpid-proton.so.2 > #93565 0x0039c6c2d68b in pn_input_read_amqp () from > /usr/lib64/libqpid-proton.so.2 > #93566 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93567 0x0039c6c3011a in pn_io_layer_input_passthru () from > /usr/lib64/libqpid-proton.so.2 > #93568 0x0039c6c2d275 in transport_consume () from > /usr/lib64/libqpid-proton.so.2 > #93569 0x0039c6c304cd in pn_transport_process () from > /usr/lib64/libqpid-proton.so.2 > #93570 0x0039c6c3e40c in pn_connector_process () from > /usr/lib64/libqpid-proton.so.2 > #93571 0x7f1060c60460 in process_connector () from > /home/mick/dispatch/build/libqpid-dispatch.so.0 > #93572 0x7f1060c61017 in thread_run () from > /home/mick/dispatch/build/libqpid-dispatch.so.0 > #93573 0x003cf9c07851 in sta