[jira] [Commented] (PROTON-829) Possible reference counting bug in pn_clear_tpwork

2015-02-26 Thread Rafael H. Schloming (JIRA)

[ 
https://issues.apache.org/jira/browse/PROTON-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338658#comment-14338658
 ] 

Rafael H. Schloming commented on PROTON-829:


What really motivates the current scheme is not the pooling, but the need to 
avoid circular references. The object graph simply has circular references and 
there is no reasonable way to avoid this in the current design, yet all these 
objects need to be wrappable into python, which means you need to be able to 
present an incref/decref style memory management interface. As far as I can 
tell, that pretty much demands that you have a hybrid refcount/incremental gc 
scheme of some sort. I fully agree that pn_incref(...)/pn_decref(...) is an 
incredibly obscure way to trigger an incremental garbage collect though and the 
code needs to be better factored to make the design more transparent.

 Possible reference counting bug in pn_clear_tpwork
 --

 Key: PROTON-829
 URL: https://issues.apache.org/jira/browse/PROTON-829
 Project: Qpid Proton
  Issue Type: Bug
  Components: proton-c
Affects Versions: 0.8
Reporter: Alan Conway
Assignee: Alan Conway
 Fix For: 0.9


 See QPID-6415 which describes a core dump in the qpid tests that appears when 
 using the current 0.9 proton master. The qpid tests pass OK with proton 0.8.
 The valgrind output in QPID-6415 shows that a connection is deleted while it 
 is being finalized by  a call from pn_connection_unbound to pn_clear_tpwork.
 I do not yet understand the details, but removing the following strange code 
 fixes the problem and passes the proton test suite without valgrind errors:
 {noformat}
 --- a/proton-c/src/engine/engine.c
 +++ b/proton-c/src/engine/engine.c
 @@ -690,10 +690,10 @@ void pn_clear_tpwork(pn_delivery_t *delivery)
{
  LL_REMOVE(connection, tpwork, delivery);
  delivery-tpwork = false;
 -if (pn_refcount(delivery)  0) {
 -  pn_incref(delivery);
 -  pn_decref(delivery);
 -}
}
  }
 {noformat}
 The code is strange because
 a) you should never examine a refcount except for debugging purposes
 b) under normal refcounting semantics incref+decref is a no-op.
 Is removing this code OK?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PROTON-829) Possible reference counting bug in pn_clear_tpwork

2015-02-26 Thread Alan Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/PROTON-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338534#comment-14338534
 ] 

Alan Conway commented on PROTON-829:


Thanks for the info, I agree that it is not clear yet whether the bug is in 
proton or qpid, I'll keep both this and QPID-6415 open till it is.

My feeling is that this is an abuse of a simple and widely-known memory 
management pattern that will lead people into errors (e.g. how would anybody 
coming to the code know when they need to do the apparent incref/decref no-op 
and when they don't?) My feeling is that memory pooling and re-use should 
happen at a lower level and not be mixed in with object finalizers (e.g. like 
qpid-dispatchs allocators)  but I don't understand the context very well yet so 
I'll say no more.

 Possible reference counting bug in pn_clear_tpwork
 --

 Key: PROTON-829
 URL: https://issues.apache.org/jira/browse/PROTON-829
 Project: Qpid Proton
  Issue Type: Bug
  Components: proton-c
Affects Versions: 0.8
Reporter: Alan Conway
Assignee: Alan Conway
 Fix For: 0.9


 See QPID-6415 which describes a core dump in the qpid tests that appears when 
 using the current 0.9 proton master. The qpid tests pass OK with proton 0.8.
 The valgrind output in QPID-6415 shows that a connection is deleted while it 
 is being finalized by  a call from pn_connection_unbound to pn_clear_tpwork.
 I do not yet understand the details, but removing the following strange code 
 fixes the problem and passes the proton test suite without valgrind errors:
 {noformat}
 --- a/proton-c/src/engine/engine.c
 +++ b/proton-c/src/engine/engine.c
 @@ -690,10 +690,10 @@ void pn_clear_tpwork(pn_delivery_t *delivery)
{
  LL_REMOVE(connection, tpwork, delivery);
  delivery-tpwork = false;
 -if (pn_refcount(delivery)  0) {
 -  pn_incref(delivery);
 -  pn_decref(delivery);
 -}
}
  }
 {noformat}
 The code is strange because
 a) you should never examine a refcount except for debugging purposes
 b) under normal refcounting semantics incref+decref is a no-op.
 Is removing this code OK?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PROTON-829) Possible reference counting bug in pn_clear_tpwork

2015-02-26 Thread Rafael H. Schloming (JIRA)

[ 
https://issues.apache.org/jira/browse/PROTON-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338377#comment-14338377
 ] 

Rafael H. Schloming commented on PROTON-829:


I added that code to get the python tests to pass without valgrind errors, so 
I'm surprised that removing it doesn't cause issues.

I'll try to explain a little bit about what's going on. (I've commented this 
elsewhere in the code, but I missed this spot. I think andrew pointed out that 
this might be clearer with a helper function instead of/in addition to a 
comment.)

There are definitely non-debug cases where it is useful to examine the 
refcount. In particular, a finalizer gets fired when the refcount hits zero, 
but the finalizer can sometimes run code that creates a new reference to the 
object that is being finalized. There are a few places where this pattern is 
very useful:

  - The finalizer can run some logic that decides whether or not the object 
should be pooled and if so sticking it in the pool adds a reference.
  - The finalizer might run some code that posts an event, thereby creating a 
reference from the collector to the object.
  - The connection/session/link/delivery/transport objects all have pointers to 
each other, and to just naively refcount all those pointers would result in 
circular references and memory leaks. In order to avoid circular references, we 
don't refcount internal pointers, e.g. the parent-child pointers, instead the 
finalizers of all the children have logic that runs and figures out if it is 
safe for the child to go away (basically does a tiny amount of incremental 
garbage collection). If the finalizer's garbage collection finds an internal 
reference, the parent of that object takes over the last external reference. In 
the latter case an external pointer to a child may be created again, and to 
account for this, the incref of all the child objects is overriden to remove 
the reference that the parent took over.

In all these cases, the logic that may/may not create a new reference is run at 
the beginning of the finalizer. The finalizer then needs to check if a new 
reference has actually been created in order to figure out whether or not to 
actually free the resources used by the object or to return. This pattern could 
perhaps be formalized by distinguishing between finalizers (the part that runs 
application logic which can create new references) and destructors (the part 
that actually frees anything malloced in the initializer) in the object code.

This should also explain why incref/decref is not necessarily a noop, it is 
basically retriggering the gc portion of the code in the finalizer that 
decides whether or not an object should really go away or not, and in the rare 
case that you toggle some state that might influence that gc logic, you need to 
get it to run again. Clearing the tp work queue is one such case. The reason it 
is guarded by a check on the refcount is that we also call that function from 
inside the delivery finalizer, and in that case if the refcount is zero we've 
already decided that the object should go away.

Again, this gc part of the pattern could probably be formalized a bit, e.g. 
possibly having some sort of mark/sweep style thing that is in fact triggered 
by external references reaching zero, but probably not a good idea to attempt 
anytime soon since what complicates this picture is not so much the fact that 
it isn't formalized, but rather that the engine uses ad-hoc data structures 
that predate the development of the collections that are part of the object 
library and therefore has to manually account and search for references in a 
type specific way rather than using the automatic ref counting and/or search 
routines available in the collections. This manual/type specific accounting is 
what makes the gc logic a bit brittle, and could possibly be the source of 
the bug you are seeing. At some point relatively soon (probably after 0.9 but 
before the next release) I expect to have to update a significant chunk of 
those data structures to properly use the collections, but until then if you 
manage to pare this down to a minimal reproducer in terms of proton API calls, 
that would definitely suggest the bug is in proton and I can probably help 
debug more from there. As it is, it looks like it could possibly just be a 
plain old double free style bug in messaging, and removing the above code just 
masks it by introducing a memory leak.

 Possible reference counting bug in pn_clear_tpwork
 --

 Key: PROTON-829
 URL: https://issues.apache.org/jira/browse/PROTON-829
 Project: Qpid Proton
  Issue Type: Bug
  Components: proton-c
Affects Versions: 0.8
Reporter: Alan Conway
Assignee: Alan Conway
 Fix For: 0.9


 

[jira] [Commented] (PROTON-829) Possible reference counting bug in pn_clear_tpwork

2015-02-26 Thread Alan Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/PROTON-829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14338757#comment-14338757
 ] 

Alan Conway commented on PROTON-829:


Found the fix to the qpid problem as follows, I think there is still a (minor) 
issue for proton. 

The fix is to call pn_transport_free *after* pn_connection_free. The other way 
round works fine with proton 0.7 or 0.8 but causes memory errors and crashes 
proton 0.9.  
The API doc for transport_free says:
{code}
/**
 * Free a transport object.
 *
 * When a transport is freed, it is automatically unbound from its
 * associated connection.
 *
 * @param[in] transport a transport object or NULL
PN_EXTERN void pn_transport_free(pn_transport_t *transport);
*/
{code}
 
It does NOT say that the connection is freed. So either we need to modify the 
behavior so the connection is not freed (preferable since this is backwards 
compatible) or we need to change the API doc text to make it clear and release 
note the change.

 Possible reference counting bug in pn_clear_tpwork
 --

 Key: PROTON-829
 URL: https://issues.apache.org/jira/browse/PROTON-829
 Project: Qpid Proton
  Issue Type: Bug
  Components: proton-c
Affects Versions: 0.8
Reporter: Alan Conway
Assignee: Alan Conway
 Fix For: 0.9


 See QPID-6415 which describes a core dump in the qpid tests that appears when 
 using the current 0.9 proton master. The qpid tests pass OK with proton 0.8.
 The valgrind output in QPID-6415 shows that a connection is deleted while it 
 is being finalized by  a call from pn_connection_unbound to pn_clear_tpwork.
 I do not yet understand the details, but removing the following strange code 
 fixes the problem and passes the proton test suite without valgrind errors:
 {noformat}
 --- a/proton-c/src/engine/engine.c
 +++ b/proton-c/src/engine/engine.c
 @@ -690,10 +690,10 @@ void pn_clear_tpwork(pn_delivery_t *delivery)
{
  LL_REMOVE(connection, tpwork, delivery);
  delivery-tpwork = false;
 -if (pn_refcount(delivery)  0) {
 -  pn_incref(delivery);
 -  pn_decref(delivery);
 -}
}
  }
 {noformat}
 The code is strange because
 a) you should never examine a refcount except for debugging purposes
 b) under normal refcounting semantics incref+decref is a no-op.
 Is removing this code OK?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)