Re: codec changes
On Mon, May 11, 2015 at 3:49 PM, Alan Conway acon...@redhat.com wrote: On Thu, 2015-05-07 at 15:53 -0400, Rafael Schloming wrote: I believe where we ended up was standardizing on a single snprintf-style signature for all encode functions, i.e.: // always returns encoded size, never overwrites limit, pass in NULL, 0 to compute size in advance // returns 0 if there was an actual error of some sort, e.g. xxx_t cannot be validly encoded for some reason ssize_t pn_xxx_encode(xxx_t, char *buf, size_t limit); And transitioning to this with a feature macro. I'm good with sprintf-style. I'm not sure what you mean by a feature macro. Does that addresses binary compatibility or is just a source-level switch? If we break binary compatibility we need to make sure we bump the library version appropriately to avoid breaking existing apps. If we set the switch to default to the new behavior we should make sure the compiler barfs on old code and it's fairly obvious what to do about it. I've never actually defined feature test macros before, just used them, so I'm just making this up as we go along. What I imagine could work though is to define pn_xxx_encode2 in the .c file to provide the new behavior. By default it would not be visible from the header files unless the appropriate feature macro was defined upon inclusion, in which case we would alias pn_xxx_encode to pn_xxx_encode2, e.g.: -- #include codec.h #include message.h ... int err = pn_message_encode(msg, buf, in_out_size); // we could make this spew a deprecated warning by default -- -- #define PN_STANDARD_ENCODE_SIGNATURE #include codec.h #include message.h ... ssize_t encoded_size = pn_message_encode(msg, buf, bufsize); -- This would allow people to incrementally upgrade and would not break binary compatibility. We could of course at some later point break binary compatibility if we want to and clean up the encode2 symbols, especially if we have other reasons to change ABI. One thing I'm not sure of is the granularity of the feature test, e.g. should the macro be specific to the change (PN_STANDARD_ENCODE_SIGNATURE) or should it be related to the version somehow, e.g. (PN_0_10_SOURCE) or something like that. I guess both could be an option too, e.g. PN_0_10_SOURCE could be an alias for all the new features in 0.10, presumably just this and possibly the sasl stuff if there is a way to make that work. --Rafael
Re: codec changes
On Thu, 2015-05-07 at 15:53 -0400, Rafael Schloming wrote: I believe where we ended up was standardizing on a single snprintf-style signature for all encode functions, i.e.: // always returns encoded size, never overwrites limit, pass in NULL, 0 to compute size in advance // returns 0 if there was an actual error of some sort, e.g. xxx_t cannot be validly encoded for some reason ssize_t pn_xxx_encode(xxx_t, char *buf, size_t limit); And transitioning to this with a feature macro. I'm good with sprintf-style. I'm not sure what you mean by a feature macro. Does that addresses binary compatibility or is just a source-level switch? If we break binary compatibility we need to make sure we bump the library version appropriately to avoid breaking existing apps. If we set the switch to default to the new behavior we should make sure the compiler barfs on old code and it's fairly obvious what to do about it. Cheers, Alan. --Rafael On Thu, May 7, 2015 at 3:28 PM, Andrew Stitcher astitc...@redhat.com wrote: On Wed, 2015-05-06 at 14:03 -0400, Rafael Schloming wrote: We seem to have reached consensus here, but I haven't seen any commits on this. We should probably fix this before 0.10 so we don't end up putting out a new API and then deprecating it in the next release. Is anyone actually working on this? Could you articulate the consensus then. Asserting we have reached consensus without explicitly stating what you think the consensus to be is remarkably likely to be proven wrong by subsequent events! Andrew
Re: codec changes
On Wed, 2015-05-06 at 14:03 -0400, Rafael Schloming wrote: We seem to have reached consensus here, but I haven't seen any commits on this. We should probably fix this before 0.10 so we don't end up putting out a new API and then deprecating it in the next release. Is anyone actually working on this? Could you articulate the consensus then. Asserting we have reached consensus without explicitly stating what you think the consensus to be is remarkably likely to be proven wrong by subsequent events! Andrew
Re: codec changes
I believe where we ended up was standardizing on a single snprintf-style signature for all encode functions, i.e.: // always returns encoded size, never overwrites limit, pass in NULL, 0 to compute size in advance // returns 0 if there was an actual error of some sort, e.g. xxx_t cannot be validly encoded for some reason ssize_t pn_xxx_encode(xxx_t, char *buf, size_t limit); And transitioning to this with a feature macro. --Rafael On Thu, May 7, 2015 at 3:28 PM, Andrew Stitcher astitc...@redhat.com wrote: On Wed, 2015-05-06 at 14:03 -0400, Rafael Schloming wrote: We seem to have reached consensus here, but I haven't seen any commits on this. We should probably fix this before 0.10 so we don't end up putting out a new API and then deprecating it in the next release. Is anyone actually working on this? Could you articulate the consensus then. Asserting we have reached consensus without explicitly stating what you think the consensus to be is remarkably likely to be proven wrong by subsequent events! Andrew
Re: codec changes
-Rafael Schloming r...@alum.mit.edu wrote: - On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway acon...@redhat.com wrote: That works for me, now how do we manage the transition? I don't think we can afford to inflict yum update proton; all proton apps crash on our users. That means we cannot change the behavior of existing function names. I don't much like adding encode_foo2 for every encode_foo but I don't see what else to do. We could mark the old ones as deprecated and add the new ones as pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode to pn_xxx_encode2. Then after a sufficient transition period we can get rid of the old versions. --Rafael +1 -- Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Re: codec changes
On Wed, 2015-04-15 at 07:13 -0400, Rafael Schloming wrote: On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway acon...@redhat.com wrote: That works for me, now how do we manage the transition? I don't think we can afford to inflict yum update proton; all proton apps crash on our users. That means we cannot change the behavior of existing function names. I don't much like adding encode_foo2 for every encode_foo but I don't see what else to do. We could mark the old ones as deprecated and add the new ones as pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode to pn_xxx_encode2. Then after a sufficient transition period we can get rid of the old versions. Along these lines it is also possible to keep ABI (binary library compatibility) by using versioned library symbols as well. There is a level of linker magic needed, but at least on Unix it's well understood, if a little arcane. Another perfectly reasonable approach would be to create a new name for the new API and deprecate the old name. So for example deprecate pn_data_t in favour of pn_value_t (or whatever better but new name). Where pn_value_t has all the old functionality of pn_data_t and indeed may be the same internal structure initially, but with a different interface. Incidentally C++ does make this easier because it allows function overloading. Andrew
Re: codec changes
On Tue, Apr 14, 2015 at 1:27 PM, Alan Conway acon...@redhat.com wrote: That works for me, now how do we manage the transition? I don't think we can afford to inflict yum update proton; all proton apps crash on our users. That means we cannot change the behavior of existing function names. I don't much like adding encode_foo2 for every encode_foo but I don't see what else to do. We could mark the old ones as deprecated and add the new ones as pn_xxx_encode2 and provide a feature macro that #defines pn_xxx_encode to pn_xxx_encode2. Then after a sufficient transition period we can get rid of the old versions. --Rafael
Re: codec changes
On Sun, 2015-04-12 at 13:12 -0400, Rafael Schloming wrote: On Fri, Apr 10, 2015 at 11:37 AM, Alan Conway acon...@redhat.com wrote: On Wed, 2015-04-08 at 14:50 -0400, Andrew Stitcher wrote: On Wed, 2015-04-08 at 10:48 -0400, Alan Conway wrote: ... The issue is that we need buffer growth AND control over allocation. pn_buffer_t forces use of malloc/free/realloc. That won't help if you're trying to get the data into a buffer allocated by Go for example. I agree a growable buffer type is a nicer API than raw function pointers, but the buffer type itself would need to use function pointers so we can replace the functions used for alloc/free/realloc. I think this is a pretty general issue actually that also crops up in embedded systems, where you want some different control over memory allocation. I think it might be better addressed by making malloc/free/realloc replaceable for the whole of proton - would that solve your go issue? Sadly no. A proton realloc surrogate is just going to be passed void* and return void*. There's no memory safe way to implement that in Go, unless someone keeps a Go pointer to the returned buffer it gets garbage collected. There's no way to communicate a Go pointer thru a C proton interface. Given all that, I think I prefer the sprintf-style solution for growing buffers. It keeps the memory management strictly on the caller side of the pn_ interface which is simpler. If there is a big pn_data overhaul in the works then maybe we should not be talking about pn_data_encode and instead make sure we fix it in the overhaul. I notice we have these other APIs: codec.h:499:PN_EXTERN int pn_data_format(pn_data_t *data, char *bytes, size_t *size); message.h:739:PN_EXTERN int pn_message_encode(pn_message_t *msg, char *bytes, size_t *size); ssl.h:320:PN_EXTERN int pn_ssl_get_peer_hostname( pn_ssl_t *ssl, char *hostname, size_t *bufsize ); That is a better signature IMO, the new (backward compatible) syntax for such functions would be: return 0: *size is updated to be the actual size. return PN_OVERFLOW: *size is updated to be the required size. return 0: *size is undefined. don't return 0 Perhaps for the overhaul we should ensure all such functions (including the replacement for pn_data_encode) follow this pattern? I'm a big +1 on standardizing on a single snprintf style signature for all encode functions. The signatures you mention above (similar to sprintf but using an IN/OUT parameter) were my first attempt at a standard encode/decode signature. I agree with you that at first it looks a bit more appealing for some reason, however after using it for a while I actually found it quite cumbersome relative to the traditional snprintf style. In the common case it takes almost twice as much code to use it since you always have to declare, initialize, and check an extra variable in separate steps due to the IN/OUT parameter. It also doesn't map that nicely when you swig it since most languages don't do the IN/OUT parameter thing well, so you have to have more cumborsome/custom typemaps and/or wrapper code to deal with the impedance mismatch. For those reasons I propose that we standardize on the traditional snprintf rather than using the IN/OUT variant. That works for me, now how do we manage the transition? I don't think we can afford to inflict yum update proton; all proton apps crash on our users. That means we cannot change the behavior of existing function names. I don't much like adding encode_foo2 for every encode_foo but I don't see what else to do.
Re: codec changes
On Tue, 2015-04-07 at 17:57 -0400, Rafael Schloming wrote: Maybe I'm not following something, but I don't see how passing around allocation functions actually solves any ownership problems. I would think the ownership problems come from pn_data_t holding onto a pointer regardless of whether that pointer was gotten from malloc or from a callback. I'm assuming you want to be able to do something like use a pn_data_t that is owned by one thread to encode into a buffer and pass that buffer over to another thread. How does passing in a bunch of allocators help with this? If the pn_data_t holds a pointer to whatever those allocators return, then aren't you going to have ownership issues no matter what? To answer Bozzo's original question, I think that it's good to keep the encoder/decoder decoupled from the buffer for a number of reasons. In addition to the ownership issues that Alan points out, the encoded data may have a different lifecycle from the pn_data_t that created it for non thread related reasons, or you may simply want to encode directly into a frame buffer and avoid an extra copy. If the goal here is simply to provide a convenient way to avoid having to repeat the resizing loop then I would suggest simply providing a convenience API that accepts a growable buffer of some sort. This provides both convenience and avoids ownership issues with holding pointers. I'm thinking something along the lines of: pn_data_t data = ...; pn_string/buffer_t buf = ...; pn_data_encode_conveniently(data, buf); // obviously needs a better name It is perhaps not *quite* as convenient in the minimal case as pn_data_t holding the buffer internally, but it is an improvement in general and probably simpler than having to mess with function pointers in the case where the buffer's lifecycle is independent from the pn_data_t. (Not that I really understand how that would work anyways.) The issue is that we need buffer growth AND control over allocation. pn_buffer_t forces use of malloc/free/realloc. That won't help if you're trying to get the data into a buffer allocated by Go for example. I agree a growable buffer type is a nicer API than raw function pointers, but the buffer type itself would need to use function pointers so we can replace the functions used for alloc/free/realloc.
Re: codec changes
On Wed, 2015-04-08 at 10:48 -0400, Alan Conway wrote: ... The issue is that we need buffer growth AND control over allocation. pn_buffer_t forces use of malloc/free/realloc. That won't help if you're trying to get the data into a buffer allocated by Go for example. I agree a growable buffer type is a nicer API than raw function pointers, but the buffer type itself would need to use function pointers so we can replace the functions used for alloc/free/realloc. I think this is a pretty general issue actually that also crops up in embedded systems, where you want some different control over memory allocation. I think it might be better addressed by making malloc/free/realloc replaceable for the whole of proton - would that solve your go issue? Andrew
Re: codec changes
On Wed, Apr 8, 2015 at 10:48 AM, Alan Conway acon...@redhat.com wrote: On Tue, 2015-04-07 at 17:57 -0400, Rafael Schloming wrote: Maybe I'm not following something, but I don't see how passing around allocation functions actually solves any ownership problems. I would think the ownership problems come from pn_data_t holding onto a pointer regardless of whether that pointer was gotten from malloc or from a callback. I'm assuming you want to be able to do something like use a pn_data_t that is owned by one thread to encode into a buffer and pass that buffer over to another thread. How does passing in a bunch of allocators help with this? If the pn_data_t holds a pointer to whatever those allocators return, then aren't you going to have ownership issues no matter what? To answer Bozzo's original question, I think that it's good to keep the encoder/decoder decoupled from the buffer for a number of reasons. In addition to the ownership issues that Alan points out, the encoded data may have a different lifecycle from the pn_data_t that created it for non thread related reasons, or you may simply want to encode directly into a frame buffer and avoid an extra copy. If the goal here is simply to provide a convenient way to avoid having to repeat the resizing loop then I would suggest simply providing a convenience API that accepts a growable buffer of some sort. This provides both convenience and avoids ownership issues with holding pointers. I'm thinking something along the lines of: pn_data_t data = ...; pn_string/buffer_t buf = ...; pn_data_encode_conveniently(data, buf); // obviously needs a better name It is perhaps not *quite* as convenient in the minimal case as pn_data_t holding the buffer internally, but it is an improvement in general and probably simpler than having to mess with function pointers in the case where the buffer's lifecycle is independent from the pn_data_t. (Not that I really understand how that would work anyways.) The issue is that we need buffer growth AND control over allocation. pn_buffer_t forces use of malloc/free/realloc. That won't help if you're trying to get the data into a buffer allocated by Go for example. I agree a growable buffer type is a nicer API than raw function pointers, but the buffer type itself would need to use function pointers so we can replace the functions used for alloc/free/realloc. I'm skeptical that passing around a set of function pointers whether directly to pn_data_t or to pn_buffer_t is actually fewer lines of code than simply writing the analogous convenience encoding function for that language. For python it would likely be fewer lines of code to simply define something like this: ssize_t pn_data_encode2bytearray(pn_data_t *data, PyObject *bytearray) { ssize_t size = pn_data_encode(data, PyByteArray_AsString(bytearray), PyByteArray_Size(bytearray)); if (size 0) { return size; } if (size PyByteArray_Size(bytearray)) { PyByteArray_Resize(bytearray, size); return pn_data_encode(data, PyByteArray_AsString(bytearray), PyByteArray_Size(bytearray)); } } This would also be more flexible since you could pass in any bytearray object from python, reuse it as much as you like, and have full control over the lifecycle of that object, whereas I suspect given the signature of the function pointers you are defining you would need to choose some sort of predefined allocation strategy for creating whatever go object you are encoding into. This sounds to me like it is both less convenient and less flexible, but its entirely possible I'm just not understanding how you intend to use this. Perhaps it would clarify things if you could post an actual example usage of how you imagine using this that hopefully illustrates why it is fewer lines of code than the alternatives? --Rafael
Re: codec changes
On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote: Given the memory overhead of a pn_data_t before encoding, why not have it own an encode buffer? it could get by with exactly that grow_buffer() callback if ownership is the issue . I like this idea a lot. Ownership is an issue for me in Go so I would want an alloc/free callback but we could have a non-callback version as well (simple wrapper for the callback version with default callbacks) for users that don't care about ownership - what do others think? Something like: // if allocfn or freefn is NULL or allocfn returns NULL return EOS on out of data. ssize_t pn_data_encode_alloc(pn_data_t *data, char *bytes, size_t size, void*(*allocfn)(size_t), void (*freefn)(void*)); // Compatibility ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size) { return pn_data_encode_alloc(data, bytes, size, NULL, NULL); } // Convenience when caller would like encode to do buffer management for them. // If *bytes is not null it is the initial buffer, if NULL encode allocates. // *bytes is updated to point to the final buffer which must be freed with free() int *pn_data_encode_grow(pn_data_t *data, char **bytes, size_t size) { return pn_data_encode_alloc(data, bytes, size, pni_realloc, free) } Note: - alloc called 0 or 1 times, and is passed the exact size needed to encode. - user is free to allocate more (e.g. doubling strategies) - free called after alloc, if and only if alloc is called. - separate free instead of single realloc call: encode impl decides if/how much data to copy from old to new buffer. This can probably be made a bit simpler. Bozzo On Mar 31, 2015 6:10 PM, Rafael Schloming r...@alum.mit.edu wrote: Hi Alan, Sorry I didn't comment on this sooner, I didn't have time to comment on your original review request during my travels, however I do have some thoughts on the changes you made to the codec interface. I noticed you added a separate accessor for the size: ssize_t pn_data_encoded_size(pn_data_t *data); This is alongside the original encode method: ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size); I think this API choice while nice in that it is backwards compatible is also going to result in code that is roughly twice as slow as it needs to be in the most common case. Based on my experience implementing and profiling codec in C, Python, and Java, computing the size of the encoded data seems to usually be roughly the same amount of work as actually encoding it regardless of the implementation language. Therefore code like this: if (buffer_size() pn_data_encoded_size(data)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Can end up being roughly twice as slow as code like this: ssize_t err; while ((err = pn_data_encode(data, buffer, buffer_size())) == PN_OVERFLOW) { grow_buffer(); } Admittedly the latter form is much more awkward in those cases where you don't care about performance, so I'm all for providing something nicer, but I think a better API change would be to steal a page from the C stdio.h APIs and have pn_data_encode always return the number of bytes that would have been written had there been enough space. This allows you to write the simplified encode as above: if (buffer_size() pn_data_encode(data, NULL, 0)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Or use a more optimal form: ssize_t n = pn_data_encode(data, buffer, buffer_size()); if (n buffer_size()) { grow_buffer(); pn_data_encode(data, buffer, buffer_size()); } This makes the slow/convenient form possible, and provides some options that are a bit less awkward than the loop, but it also makes it very clear that when you use the slow/convenient form you are incurring roughly twice the cost of the alternative. Normally I wouldn't be overly fussed by something like this, and I realize what I'm suggesting is a breaking change relative to what you provided, but based on what profiling we've done in the past, codec is probably the most significant source of overhead that we add to an application, and exactly this sort of double encode effect is almost always one of the first things you hit when you try to optimize. Given this, I think it would be a good thing if the API accurately reflects the relative cost of the different styles of use. Thoughts? --Rafael
Re: codec changes
On Tue, 2015-04-07 at 09:38 -0400, Alan Conway wrote: On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote: Given the memory overhead of a pn_data_t before encoding, why not have it own an encode buffer? it could get by with exactly that grow_buffer() callback if ownership is the issue . I think the best way to do this would be to introduce a new class to sit on top of the existing pn_data_t which does this, rather than extending the current pn_data_t. So I think the below is fine, but I'd prefer to avoid stuffing it all into pn_data_t especially as I think the class is, long term, going away. Andrew I like this idea a lot. Ownership is an issue for me in Go so I would want an alloc/free callback but we could have a non-callback version as well (simple wrapper for the callback version with default callbacks) for users that don't care about ownership - what do others think? Something like: // if allocfn or freefn is NULL or allocfn returns NULL return EOS on out of data. ssize_t pn_data_encode_alloc(pn_data_t *data, char *bytes, size_t size, void*(*allocfn)(size_t), void (*freefn)(void*)); // Compatibility ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size) { return pn_data_encode_alloc(data, bytes, size, NULL, NULL); } // Convenience when caller would like encode to do buffer management for them. // If *bytes is not null it is the initial buffer, if NULL encode allocates. // *bytes is updated to point to the final buffer which must be freed with free() int *pn_data_encode_grow(pn_data_t *data, char **bytes, size_t size) { return pn_data_encode_alloc(data, bytes, size, pni_realloc, free) } Note: - alloc called 0 or 1 times, and is passed the exact size needed to encode. - user is free to allocate more (e.g. doubling strategies) - free called after alloc, if and only if alloc is called. - separate free instead of single realloc call: encode impl decides if/how much data to copy from old to new buffer. This can probably be made a bit simpler. Bozzo On Mar 31, 2015 6:10 PM, Rafael Schloming r...@alum.mit.edu wrote: Hi Alan, Sorry I didn't comment on this sooner, I didn't have time to comment on your original review request during my travels, however I do have some thoughts on the changes you made to the codec interface. I noticed you added a separate accessor for the size: ssize_t pn_data_encoded_size(pn_data_t *data); This is alongside the original encode method: ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size); I think this API choice while nice in that it is backwards compatible is also going to result in code that is roughly twice as slow as it needs to be in the most common case. Based on my experience implementing and profiling codec in C, Python, and Java, computing the size of the encoded data seems to usually be roughly the same amount of work as actually encoding it regardless of the implementation language. Therefore code like this: if (buffer_size() pn_data_encoded_size(data)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Can end up being roughly twice as slow as code like this: ssize_t err; while ((err = pn_data_encode(data, buffer, buffer_size())) == PN_OVERFLOW) { grow_buffer(); } Admittedly the latter form is much more awkward in those cases where you don't care about performance, so I'm all for providing something nicer, but I think a better API change would be to steal a page from the C stdio.h APIs and have pn_data_encode always return the number of bytes that would have been written had there been enough space. This allows you to write the simplified encode as above: if (buffer_size() pn_data_encode(data, NULL, 0)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Or use a more optimal form: ssize_t n = pn_data_encode(data, buffer, buffer_size()); if (n buffer_size()) { grow_buffer(); pn_data_encode(data, buffer, buffer_size()); } This makes the slow/convenient form possible, and provides some options that are a bit less awkward than the loop, but it also makes it very clear that when you use the slow/convenient form you are incurring roughly twice the cost of the alternative. Normally I wouldn't be overly fussed by something like this, and I realize what I'm suggesting is a breaking change relative to what you provided, but based on what profiling we've done in the past, codec is probably the most significant source of overhead that we add to an application, and exactly this sort of double encode effect is almost always one of the first things you hit when you try to optimize. Given this, I think it would be a good thing if the API accurately reflects the relative cost of the different styles of
Re: codec changes
On 7. 04. 15 15.38, Alan Conway wrote: On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote: Given the memory overhead of a pn_data_t before encoding, why not have it own an encode buffer? it could get by with exactly that grow_buffer() callback if ownership is the issue . I like this idea a lot. Ownership is an issue for me in Go so I would want an alloc/free callback but we could have a non-callback version as well (simple wrapper for the callback version with default callbacks) for users that don't care about ownership - what do others think? [snip] Note: - alloc called 0 or 1 times, and is passed the exact size needed to encode. I would hate to have the encoder know a-priori the exact size of allocation. Especially if we want to, at some point be able to generate a {LIST|MAP|ARRAY}8 encodings. I think we can aim for upper bound, but leave out exact-ness. calculating encoding size for non-composite atoms is easy and cheap so it could be moved from encoding to pn_data_put_*() mutators. as long as uuid is part of atom union, adding another int will not hurt :P Calculating conservative size for composites is also cheap. Maybe pn_data_t could even be made to encode as it goes and would thus not need the extra buffer *and* copying for variable-sized atoms (strings and binary) and each atom could be reduced to a offset into the encoded buffer. - user is free to allocate more (e.g. doubling strategies) - free called after alloc, if and only if alloc is called. - separate free instead of single realloc call: encode impl decides if/how much data to copy from old to new buffer. Partial copying of encoded buffers is also problematic, because encoder has to backtrack to fill out byte sizes for structured types. So at best we could do a scatter-gather thing, which quickly becomes another mess as the pn_data_t then needs a much better arithmetic for offset and size calculations. I really think we should just offer exactly realloc() for now. Bozzo
Re: codec changes
On Tue, 2015-04-07 at 11:00 -0400, Andrew Stitcher wrote: On Tue, 2015-04-07 at 09:38 -0400, Alan Conway wrote: On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote: Given the memory overhead of a pn_data_t before encoding, why not have it own an encode buffer? it could get by with exactly that grow_buffer() callback if ownership is the issue . I think the best way to do this would be to introduce a new class to sit on top of the existing pn_data_t which does this, rather than extending the current pn_data_t. So I think the below is fine, but I'd prefer to avoid stuffing it all into pn_data_t especially as I think the class is, long term, going away. Who's replacing it ;) ? Are they taking notes? Cheers, Alan.
Re: codec changes
Maybe I'm not following something, but I don't see how passing around allocation functions actually solves any ownership problems. I would think the ownership problems come from pn_data_t holding onto a pointer regardless of whether that pointer was gotten from malloc or from a callback. I'm assuming you want to be able to do something like use a pn_data_t that is owned by one thread to encode into a buffer and pass that buffer over to another thread. How does passing in a bunch of allocators help with this? If the pn_data_t holds a pointer to whatever those allocators return, then aren't you going to have ownership issues no matter what? To answer Bozzo's original question, I think that it's good to keep the encoder/decoder decoupled from the buffer for a number of reasons. In addition to the ownership issues that Alan points out, the encoded data may have a different lifecycle from the pn_data_t that created it for non thread related reasons, or you may simply want to encode directly into a frame buffer and avoid an extra copy. If the goal here is simply to provide a convenient way to avoid having to repeat the resizing loop then I would suggest simply providing a convenience API that accepts a growable buffer of some sort. This provides both convenience and avoids ownership issues with holding pointers. I'm thinking something along the lines of: pn_data_t data = ...; pn_string/buffer_t buf = ...; pn_data_encode_conveniently(data, buf); // obviously needs a better name It is perhaps not *quite* as convenient in the minimal case as pn_data_t holding the buffer internally, but it is an improvement in general and probably simpler than having to mess with function pointers in the case where the buffer's lifecycle is independent from the pn_data_t. (Not that I really understand how that would work anyways.) --Rafael On Tue, Apr 7, 2015 at 1:21 PM, Alan Conway acon...@redhat.com wrote: On Tue, 2015-04-07 at 11:00 -0400, Andrew Stitcher wrote: On Tue, 2015-04-07 at 09:38 -0400, Alan Conway wrote: On Tue, 2015-03-31 at 19:17 +0200, Božo Dragojevič wrote: Given the memory overhead of a pn_data_t before encoding, why not have it own an encode buffer? it could get by with exactly that grow_buffer() callback if ownership is the issue . I think the best way to do this would be to introduce a new class to sit on top of the existing pn_data_t which does this, rather than extending the current pn_data_t. So I think the below is fine, but I'd prefer to avoid stuffing it all into pn_data_t especially as I think the class is, long term, going away. Who's replacing it ;) ? Are they taking notes? Cheers, Alan.
codec changes
Hi Alan, Sorry I didn't comment on this sooner, I didn't have time to comment on your original review request during my travels, however I do have some thoughts on the changes you made to the codec interface. I noticed you added a separate accessor for the size: ssize_t pn_data_encoded_size(pn_data_t *data); This is alongside the original encode method: ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size); I think this API choice while nice in that it is backwards compatible is also going to result in code that is roughly twice as slow as it needs to be in the most common case. Based on my experience implementing and profiling codec in C, Python, and Java, computing the size of the encoded data seems to usually be roughly the same amount of work as actually encoding it regardless of the implementation language. Therefore code like this: if (buffer_size() pn_data_encoded_size(data)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Can end up being roughly twice as slow as code like this: ssize_t err; while ((err = pn_data_encode(data, buffer, buffer_size())) == PN_OVERFLOW) { grow_buffer(); } Admittedly the latter form is much more awkward in those cases where you don't care about performance, so I'm all for providing something nicer, but I think a better API change would be to steal a page from the C stdio.h APIs and have pn_data_encode always return the number of bytes that would have been written had there been enough space. This allows you to write the simplified encode as above: if (buffer_size() pn_data_encode(data, NULL, 0)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Or use a more optimal form: ssize_t n = pn_data_encode(data, buffer, buffer_size()); if (n buffer_size()) { grow_buffer(); pn_data_encode(data, buffer, buffer_size()); } This makes the slow/convenient form possible, and provides some options that are a bit less awkward than the loop, but it also makes it very clear that when you use the slow/convenient form you are incurring roughly twice the cost of the alternative. Normally I wouldn't be overly fussed by something like this, and I realize what I'm suggesting is a breaking change relative to what you provided, but based on what profiling we've done in the past, codec is probably the most significant source of overhead that we add to an application, and exactly this sort of double encode effect is almost always one of the first things you hit when you try to optimize. Given this, I think it would be a good thing if the API accurately reflects the relative cost of the different styles of use. Thoughts? --Rafael
Re: codec changes
Given the memory overhead of a pn_data_t before encoding, why not have it own an encode buffer? it could get by with exactly that grow_buffer() callback if ownership is the issue . Bozzo On Mar 31, 2015 6:10 PM, Rafael Schloming r...@alum.mit.edu wrote: Hi Alan, Sorry I didn't comment on this sooner, I didn't have time to comment on your original review request during my travels, however I do have some thoughts on the changes you made to the codec interface. I noticed you added a separate accessor for the size: ssize_t pn_data_encoded_size(pn_data_t *data); This is alongside the original encode method: ssize_t pn_data_encode(pn_data_t *data, char *bytes, size_t size); I think this API choice while nice in that it is backwards compatible is also going to result in code that is roughly twice as slow as it needs to be in the most common case. Based on my experience implementing and profiling codec in C, Python, and Java, computing the size of the encoded data seems to usually be roughly the same amount of work as actually encoding it regardless of the implementation language. Therefore code like this: if (buffer_size() pn_data_encoded_size(data)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Can end up being roughly twice as slow as code like this: ssize_t err; while ((err = pn_data_encode(data, buffer, buffer_size())) == PN_OVERFLOW) { grow_buffer(); } Admittedly the latter form is much more awkward in those cases where you don't care about performance, so I'm all for providing something nicer, but I think a better API change would be to steal a page from the C stdio.h APIs and have pn_data_encode always return the number of bytes that would have been written had there been enough space. This allows you to write the simplified encode as above: if (buffer_size() pn_data_encode(data, NULL, 0)) grow_buffer(); pn_data_encode(data, buffer, buffer_size()); Or use a more optimal form: ssize_t n = pn_data_encode(data, buffer, buffer_size()); if (n buffer_size()) { grow_buffer(); pn_data_encode(data, buffer, buffer_size()); } This makes the slow/convenient form possible, and provides some options that are a bit less awkward than the loop, but it also makes it very clear that when you use the slow/convenient form you are incurring roughly twice the cost of the alternative. Normally I wouldn't be overly fussed by something like this, and I realize what I'm suggesting is a breaking change relative to what you provided, but based on what profiling we've done in the past, codec is probably the most significant source of overhead that we add to an application, and exactly this sort of double encode effect is almost always one of the first things you hit when you try to optimize. Given this, I think it would be a good thing if the API accurately reflects the relative cost of the different styles of use. Thoughts? --Rafael