maybe zmq_msg_init_allocator which accept the allocator. With that pattern we do need the release method, the zmq_msg will handle it internally and register the release method as the free method of the zmq_msg. They do need to have the same signature.
On Thu, Aug 15, 2019 at 12:35 PM Francesco <[email protected]> wrote: > Hi Doron, hi Jens, > Yes the allocator method is a nice solution. > I think it would be nice to have libzmq provide also a memory pool > implementation but use as default the malloc/free implementation for > backward compatibility. > > It's also important to have a smart allocator that internally contains not > just one but several pools for different packet size classes,to avoid > memory waste. But I think this can fit easily in the allocator pattern > sketched out by Jens. > > Btw another issue unrelated to the allocator API but regarding performance > aspects: I think it's important to avoid not only the msg buffer but also > the allocation of the content_t structure and indeed in my preliminary > merge request I did modify zmq_msg_t of type_lmsg to use the first 40b > inside the pooled buffer. > Of course this approach is not backward compatible with the _init_data() > semantics. > How do you think this would best be approached? > I guess we may have a new _init_data_and_controlblock() helper that does > the trick of taking the first 40bytes of the provided buffer? > > Thanks > Francesco > > > Il mer 14 ago 2019, 22:23 Doron Somech <[email protected]> ha scritto: > >> Jens I like the idea. >> >> We actually don't need the release method. >> The signature of the allocate should receive zmq_msg and allocate it. >> >> int (&allocate)(zmq_msg *msg, size_t size, void *obj); >> >> When the allocator will create the zmq_msg it will provide the release >> method to the zmq_msg in the constructor. >> >> This is important in order to forward messages between sockets, so the >> release method is part of the msg. This is already supported by zmq_msg >> which accept free method with a hint (obj in your example). >> >> The return value of allocate will be success indication, like the rest of >> zeromq methods. >> >> zeromq actually already support pool mechanism when sending, using >> zmq_msg api. Receiving is the problem, your suggestion solve it nicely. >> >> By the way, memory pool already supported in NetMQ in a very similar >> solution as you suggested. (It is global for all sockets without override) >> >> >> >> On Wed, Aug 14, 2019, 22:41 Jens Auer <[email protected]> wrote: >> >>> Hi, >>> >>> Maybe this can be combined with a request that I have seen a couple of >>> times to be able to configure the allocator used in libzmq? I am thinking >>> of something like >>> >>> struct zmq_allocator { >>> void* obj; >>> void* (&allocate)(size_t n, void* obj); >>> void (&release)(void* ptr, void* obj); >>> }; >>> >>> void* useMalloc(size_t n, void*) {return malloc(n);} >>> void freeMalloc(void* ptr) {free(ptr);} >>> >>> zmq_allocator& zmg_default_allocator() { >>> static zmg_allocator defaultAllocator = {nullptr, useMalloc, >>> freeMalloc}; >>> Return defaultAllocator; >>> } >>> >>> The context could then store the allocator for libzmq, and users could >>> set a specific allocator as a context option, e.g. with a zmq_ctx_set. A >>> socket created for a context can then inherit the default allocator or set >>> a special allocator as a socket option. >>> >>> class MemoryPool {…}; // hopefully thread-safe >>> void* poolAllocate(size_t n) {return >>> >>> MemoryPool pool; >>> >>> void* allocatePool(size_t n, void* pool) {return >>> static_cast<MemoryPool*>(pool)->allocate(n);} >>> void releasePool(void* ptr, void* pool) >>> {static_cast<MemoryPool*>(pool)->release(ptr);} >>> >>> zmq_allocator pooledAllocator { >>> &pool, allocatePool, releasePool >>> } >>> >>> void* cdx = zmq_ctx_new(); >>> zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator); >>> >>> Cheers, >>> Jens >>> >>> Am 13.08.2019 um 13:24 schrieb Francesco <[email protected]>: >>> >>> Hi all, >>> >>> today I've taken some time to attempt building a memory-pooling >>> mechanism in ZMQ local_thr/remote_thr benchmarking utilities. >>> Here's the result: >>> https://github.com/zeromq/libzmq/pull/3631 >>> This PR is a work in progress and is a simple modification to show the >>> effects of avoiding malloc/free when creating zmq_msg_t with the >>> standard benchmark utils of ZMQ. >>> >>> In particular the very fast, zero-lock, >>> single-producer/single-consumer queue from: >>> https://github.com/cameron314/readerwriterqueue >>> is used to maintain between the "remote_thr" main thread and its ZMQ >>> background IO thread a list of free buffers that can be used. >>> >>> Here are the graphical results: >>> with mallocs / no memory pool: >>> >>> https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png >>> with memory pool: >>> >>> https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png >>> >>> Doing the math the memory pooled approach shows: >>> >>> mostly the same performances for messages <= 32B >>> +15% pps/throughput increase @ 64B, >>> +60% pps/throughput increase @ 128B, >>> +70% pps/throughput increase @ 210B >>> >>> [the tests were stopped at 210B because my current quick-dirty memory >>> pool approach has fixed max msg size of about 210B]. >>> >>> Honestly this is not a huge speedup, even if still interesting. >>> Indeed with these changes the performances now seem to be bounded by >>> the "local_thr" side and not by the "remote_thr" anymore. Indeed the >>> zmq background IO thread for local_thr is the only thread at 100% in >>> the 2 systems and its "perf top" now shows: >>> >>> 15,02% libzmq.so.5.2.3 [.] zmq::metadata_t::add_ref >>> 14,91% libzmq.so.5.2.3 [.] zmq::v2_decoder_t::size_ready >>> 8,94% libzmq.so.5.2.3 [.] zmq::ypipe_t<zmq::msg_t, 256>::write >>> 6,97% libzmq.so.5.2.3 [.] zmq::msg_t::close >>> 5,48% libzmq.so.5.2.3 [.] >>> zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo >>> 5,40% libzmq.so.5.2.3 [.] zmq::pipe_t::write >>> 4,94% libzmq.so.5.2.3 [.] >>> zmq::shared_message_memory_allocator::inc_ref >>> 2,59% libzmq.so.5.2.3 [.] zmq::msg_t::init_external_storage >>> 1,63% [kernel] [k] copy_user_enhanced_fast_string >>> 1,56% libzmq.so.5.2.3 [.] zmq::msg_t::data >>> 1,43% libzmq.so.5.2.3 [.] zmq::msg_t::init >>> 1,34% libzmq.so.5.2.3 [.] zmq::pipe_t::check_write >>> 1,24% libzmq.so.5.2.3 [.] >>> zmq::stream_engine_base_t::in_event_internal >>> 1,24% libzmq.so.5.2.3 [.] zmq::msg_t::size >>> >>> Do you know what this stacktrace might mean? >>> I would expect to have that ZMQ background thread topping in its >>> read() system call (from TCP socket)... >>> >>> Thanks, >>> Francesco >>> >>> >>> Il giorno ven 19 lug 2019 alle ore 18:15 Francesco >>> <[email protected]> ha scritto: >>> >>> >>> Hi Yan, >>> Unfortunately I have interrupted my attempts in this area after getting >>> some strange results (possibly due to the fact that I tried in a complex >>> application context... I should probably try hacking a simple zeromq >>> example instead!). >>> >>> I'm also a bit surprised that nobody has tried and posted online a way >>> to achieve something similar (Memory pool zmq send) ... But anyway It >>> remains in my plans to try that out when I have a bit more spare time... >>> If you manage to have some results earlier, I would be eager to know :-) >>> >>> Francesco >>> >>> >>> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) < >>> [email protected]> ha scritto: >>> >>> >>> Hi, Francesco >>> Could you please share the final solution and benchmark result for >>> plan 2? Big Thanks. >>> I'm concerning this because I had tried the similar before with >>> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues. >>> 1) My process is running in background for long time and finally I found >>> it occupies more and more memory, until it exhausted the system memory. It >>> seems there's memory leak with this way. 2) I provided *ffn for >>> deallocation but the memory freed back is much slower than consumer. So >>> finally my own customized pool could also be exhausted. How do you solve >>> this? >>> I had to turn back to use zmq_send(). I know it has memory copy >>> penalty but it's the easiest and most stable way to send message. I'm still >>> using 0MQ 4.1.x. >>> Thanks. >>> >>> BR >>> Yan Limin >>> >>> -----Original Message----- >>> From: zeromq-dev [mailto:[email protected] >>> <[email protected]>] On Behalf Of Luca Boccassi >>> Sent: Friday, July 05, 2019 4:58 PM >>> To: ZeroMQ development list <[email protected]> >>> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t >>> >>> There's no need to change the source for experimenting, you can just use >>> _init_data without a callback and with a callback (yes the first case will >>> leak memory but it's just a test), and measure the difference between the >>> two cases. You can then immediately see if it's worth pursuing further >>> optimisations or not. >>> >>> _external_storage is an implementation detail, and it's non-shared >>> because it's used in the receive case only, as it's used with a reference >>> to the TCP buffer used in the system call for zero-copy receives. Exposing >>> that means that those kind of messages could not be used with pub-sub or >>> radio-dish, as they can't have multiple references without copying them, >>> which means there would be a semantic difference between the different >>> message initialisation APIs, unlike now when the difference is only in who >>> owns the buffer. It would make the API quite messy in my opinion, and be >>> quite confusing as pub/sub is probably the most well known pattern. >>> >>> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote: >>> >>> Hi Luca, >>> thanks for the details. Indeed I understand why the "content_t" needs >>> to be allocated dynamically: it's just like the control block used by >>> STL's std::shared_ptr<>. >>> >>> And you're right: I'm not sure how much gain there is in removing 100% >>> of malloc operations from my TX path... still I would be curious to >>> find it out but right now it seems I need to patch ZMQ source code to >>> achieve that. >>> >>> Anyway I wonder if it could be possible to expose in the public API a >>> method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows >>> to create a non-shared zero-copy long message... it appears to be used >>> only by v2 decoder internally right now... >>> Is there a specific reason why that's not accessible from the public >>> API? >>> >>> Thanks, >>> Francesco >>> >>> >>> >>> >>> >>> Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi < >>> [email protected]> ha scritto: >>> >>> Another reason for that small struct to be on the heap is so that it >>> can be shared among all the copies of the message (eg: a pub socket >>> has N copies of the message on the stack, one for each subscriber). >>> The struct has an atomic counter in it, so that when all the copies >>> of the message on the stack have been closed, the userspace buffer >>> deallocation callback can be invoked. If the atomic counter were on >>> the stack inlined in the message, this wouldn't work. >>> So even if room were to be found, a malloc would still be needed. >>> >>> If you _really_ are worried about it, and testing shows it makes a >>> difference, then one option could be to pre-allocate a set of these >>> metadata structures at startup, and just assign them when the >>> message is created. It's possible, but increases complexity quite a >>> bit, so it needs to be worth it. >>> >>> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote: >>> >>> The second malloc cannot be avoided, but it's tiny and fixed in >>> >>> size >>> >>> at >>> compile time, so the compiler and glibc will be able to optimize >>> >>> it >>> >>> to >>> death. >>> >>> The reason for that is that there's not enough room in the 64 >>> >>> bytes >>> >>> to >>> store that structure, and increasing the message allocation on >>> >>> the >>> >>> stack past 64 bytes means it will no longer fit in a single cache >>> line, which will incur in a performance penalty far worse than the >>> >>> small >>> >>> malloc (I tested this some time ago). That is of course unless >>> >>> you >>> >>> are >>> running on s390 or a POWER with 256 bytes cacheline, but given >>> >>> it's >>> >>> part of the ABI it would be a bit of a mess for the benefit of >>> >>> very >>> >>> few >>> users if any. >>> >>> So I'd recommend to just go with the second plan, and compare >>> >>> what >>> >>> the >>> result is when passing a deallocation function vs not passing it >>> >>> (yes >>> >>> it will leak the memory but it's just for the test). My bet is >>> >>> that >>> >>> the >>> difference will not be that large. >>> >>> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote: >>> >>> Hi Stephan, Hi Luca, >>> >>> thanks for your hints. However I inspected >>> >>> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi >>> sher.cpp >>> >>> >>> and I don't think it's saving from malloc()... see my point >>> >>> 2) >>> >>> below: >>> >>> Indeed I realized that probably current ZMQ API does not allow >>> >>> me >>> >>> to >>> achieve the 100% of what I intended to do. >>> Let me rephrase my target: my target is to be able to >>> - memory pool creation: do a large memory allocation of, say, >>> >>> 1M >>> >>> zmq_msg_t only at the start of my program; let's say I create >>> >>> all >>> >>> these zmq_msg_t of a size of 2k bytes each (let's assume this >>> >>> is >>> >>> the >>> max size of message possible in my app) >>> - during application lifetime: call zmq_msg_send() at anytime >>> always avoiding malloc() operations (just picking the first >>> available unused entry of zmq_msg_t from the memory pool). >>> >>> Initially I thought that was possible but I think I have >>> >>> identified >>> >>> 2 >>> blocking issues: >>> 1) If I try to recycle zmq_msg_t directly: in this case I will >>> >>> fail >>> >>> because I cannot really change only the "size" member of a >>> zmq_msg_t without reallocating it... so that I'm forced (in my >>> example) >>> >>> to >>> >>> always send 2k bytes out (!!) >>> 2) if I do create only a memory pool of buffers of 2k bytes and >>> then wrap the first available buffer inside a zmq_msg_t >>> (allocated >>> >>> on >>> >>> the >>> stack, not in the heap): in this case I need to know when the >>> internals of ZMQ have completed using the zmq_msg_t and thus >>> >>> when I >>> >>> can mark that buffer as available again in my memory pool. >>> >>> However >>> >>> I >>> see that zmq_msg_init_data() ZMQ code contains: >>> >>> // Initialize constant message if there's no need to >>> deallocate >>> if (ffn_ == NULL) { >>> ... >>> _u.cmsg.data = data_; >>> _u.cmsg.size = size_; >>> ... >>> } else { >>> ... >>> _u.lmsg.content = >>> static_cast<content_t *> (malloc (sizeof >>> >>> (content_t))); >>> >>> ... >>> _u.lmsg.content->data = data_; >>> _u.lmsg.content->size = size_; >>> _u.lmsg.content->ffn = ffn_; >>> _u.lmsg.content->hint = hint_; >>> new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t >>> >>> (); >>> >>> } >>> >>> So that I skip malloc() operation only if I pass ffn_ == NULL. >>> >>> The >>> >>> problem is that if I pass ffn_ == NULL, then I have no way to >>> >>> know >>> >>> when the internals of ZMQ have completed using the zmq_msg_t... >>> >>> Any way to workaround either issue 1) or issue 2) ? >>> >>> I understand that the malloc is just of size(content_t)~= >>> >>> 40B... >>> >>> but >>> still I'd like to avoid it... >>> >>> Thanks! >>> Francesco >>> >>> >>> >>> >>> >>> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer < >>> [email protected] >>> >>> ha scritto: >>> On 04.07.19 14:29, Luca Boccassi wrote: >>> >>> How users make use of these primitives is up to them >>> >>> though, I >>> >>> >>> don't >>> >>> think anything special was shared before, as far as I >>> >>> remember. >>> >>> >>> Some example can be found here: >>> >>> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src >>> >>> >>> >>> The classes Publisher and Subscriber should replace the >>> >>> publisher >>> >>> and >>> subscriber in a former Robot-Operating-System-based System. I >>> hope that the subscriber is actually using the method Luca is >>> talking >>> >>> about >>> >>> on the >>> receiving side. >>> >>> The message data here is a Cap'n Proto container that we >>> "simply" >>> serialize and send via ZeroMQ -> therefore the name Cap'nZero >>> >>> ;-) >>> >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> >>> >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >>> >>> >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >>> >>> >>> -- >>> Kind regards, >>> Luca Boccassi >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >>> >>> _______________________________________________ >>> zeromq-dev mailing list >>> [email protected] >>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >>> >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
