Hi all, today I've taken some time to attempt building a memory-pooling mechanism in ZMQ local_thr/remote_thr benchmarking utilities. Here's the result: https://github.com/zeromq/libzmq/pull/3631 This PR is a work in progress and is a simple modification to show the effects of avoiding malloc/free when creating zmq_msg_t with the standard benchmark utils of ZMQ.
In particular the very fast, zero-lock, single-producer/single-consumer queue from: https://github.com/cameron314/readerwriterqueue is used to maintain between the "remote_thr" main thread and its ZMQ background IO thread a list of free buffers that can be used. Here are the graphical results: with mallocs / no memory pool: https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png with memory pool: https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png Doing the math the memory pooled approach shows: mostly the same performances for messages <= 32B +15% pps/throughput increase @ 64B, +60% pps/throughput increase @ 128B, +70% pps/throughput increase @ 210B [the tests were stopped at 210B because my current quick-dirty memory pool approach has fixed max msg size of about 210B]. Honestly this is not a huge speedup, even if still interesting. Indeed with these changes the performances now seem to be bounded by the "local_thr" side and not by the "remote_thr" anymore. Indeed the zmq background IO thread for local_thr is the only thread at 100% in the 2 systems and its "perf top" now shows: 15,02% libzmq.so.5.2.3 [.] zmq::metadata_t::add_ref 14,91% libzmq.so.5.2.3 [.] zmq::v2_decoder_t::size_ready 8,94% libzmq.so.5.2.3 [.] zmq::ypipe_t<zmq::msg_t, 256>::write 6,97% libzmq.so.5.2.3 [.] zmq::msg_t::close 5,48% libzmq.so.5.2.3 [.] zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo 5,40% libzmq.so.5.2.3 [.] zmq::pipe_t::write 4,94% libzmq.so.5.2.3 [.] zmq::shared_message_memory_allocator::inc_ref 2,59% libzmq.so.5.2.3 [.] zmq::msg_t::init_external_storage 1,63% [kernel] [k] copy_user_enhanced_fast_string 1,56% libzmq.so.5.2.3 [.] zmq::msg_t::data 1,43% libzmq.so.5.2.3 [.] zmq::msg_t::init 1,34% libzmq.so.5.2.3 [.] zmq::pipe_t::check_write 1,24% libzmq.so.5.2.3 [.] zmq::stream_engine_base_t::in_event_internal 1,24% libzmq.so.5.2.3 [.] zmq::msg_t::size Do you know what this stacktrace might mean? I would expect to have that ZMQ background thread topping in its read() system call (from TCP socket)... Thanks, Francesco Il giorno ven 19 lug 2019 alle ore 18:15 Francesco <[email protected]> ha scritto: > > Hi Yan, > Unfortunately I have interrupted my attempts in this area after getting some > strange results (possibly due to the fact that I tried in a complex > application context... I should probably try hacking a simple zeromq example > instead!). > > I'm also a bit surprised that nobody has tried and posted online a way to > achieve something similar (Memory pool zmq send) ... But anyway It remains in > my plans to try that out when I have a bit more spare time... > If you manage to have some results earlier, I would be eager to know :-) > > Francesco > > > Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) > <[email protected]> ha scritto: >> >> Hi, Francesco >> Could you please share the final solution and benchmark result for plan >> 2? Big Thanks. >> I'm concerning this because I had tried the similar before with >> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues. 1) >> My process is running in background for long time and finally I found it >> occupies more and more memory, until it exhausted the system memory. It >> seems there's memory leak with this way. 2) I provided *ffn for >> deallocation but the memory freed back is much slower than consumer. So >> finally my own customized pool could also be exhausted. How do you solve >> this? >> I had to turn back to use zmq_send(). I know it has memory copy penalty >> but it's the easiest and most stable way to send message. I'm still using >> 0MQ 4.1.x. >> Thanks. >> >> BR >> Yan Limin >> >> -----Original Message----- >> From: zeromq-dev [mailto:[email protected]] On Behalf Of >> Luca Boccassi >> Sent: Friday, July 05, 2019 4:58 PM >> To: ZeroMQ development list <[email protected]> >> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t >> >> There's no need to change the source for experimenting, you can just use >> _init_data without a callback and with a callback (yes the first case will >> leak memory but it's just a test), and measure the difference between the >> two cases. You can then immediately see if it's worth pursuing further >> optimisations or not. >> >> _external_storage is an implementation detail, and it's non-shared because >> it's used in the receive case only, as it's used with a reference to the TCP >> buffer used in the system call for zero-copy receives. Exposing that means >> that those kind of messages could not be used with pub-sub or radio-dish, as >> they can't have multiple references without copying them, which means there >> would be a semantic difference between the different message initialisation >> APIs, unlike now when the difference is only in who owns the buffer. It >> would make the API quite messy in my opinion, and be quite confusing as >> pub/sub is probably the most well known pattern. >> >> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote: >> > Hi Luca, >> > thanks for the details. Indeed I understand why the "content_t" needs >> > to be allocated dynamically: it's just like the control block used by >> > STL's std::shared_ptr<>. >> > >> > And you're right: I'm not sure how much gain there is in removing 100% >> > of malloc operations from my TX path... still I would be curious to >> > find it out but right now it seems I need to patch ZMQ source code to >> > achieve that. >> > >> > Anyway I wonder if it could be possible to expose in the public API a >> > method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows >> > to create a non-shared zero-copy long message... it appears to be used >> > only by v2 decoder internally right now... >> > Is there a specific reason why that's not accessible from the public >> > API? >> > >> > Thanks, >> > Francesco >> > >> > >> > >> > >> > >> > Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi < >> > [email protected]> ha scritto: >> > > Another reason for that small struct to be on the heap is so that it >> > > can be shared among all the copies of the message (eg: a pub socket >> > > has N copies of the message on the stack, one for each subscriber). >> > > The struct has an atomic counter in it, so that when all the copies >> > > of the message on the stack have been closed, the userspace buffer >> > > deallocation callback can be invoked. If the atomic counter were on >> > > the stack inlined in the message, this wouldn't work. >> > > So even if room were to be found, a malloc would still be needed. >> > > >> > > If you _really_ are worried about it, and testing shows it makes a >> > > difference, then one option could be to pre-allocate a set of these >> > > metadata structures at startup, and just assign them when the >> > > message is created. It's possible, but increases complexity quite a >> > > bit, so it needs to be worth it. >> > > >> > > On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote: >> > > > The second malloc cannot be avoided, but it's tiny and fixed in >> > > size >> > > > at >> > > > compile time, so the compiler and glibc will be able to optimize >> > > it >> > > > to >> > > > death. >> > > > >> > > > The reason for that is that there's not enough room in the 64 >> > > bytes >> > > > to >> > > > store that structure, and increasing the message allocation on >> > > the >> > > > stack past 64 bytes means it will no longer fit in a single cache >> > > > line, which will incur in a performance penalty far worse than the >> > > small >> > > > malloc (I tested this some time ago). That is of course unless >> > > you >> > > > are >> > > > running on s390 or a POWER with 256 bytes cacheline, but given >> > > it's >> > > > part of the ABI it would be a bit of a mess for the benefit of >> > > very >> > > > few >> > > > users if any. >> > > > >> > > > So I'd recommend to just go with the second plan, and compare >> > > what >> > > > the >> > > > result is when passing a deallocation function vs not passing it >> > > (yes >> > > > it will leak the memory but it's just for the test). My bet is >> > > that >> > > > the >> > > > difference will not be that large. >> > > > >> > > > On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote: >> > > > > Hi Stephan, Hi Luca, >> > > > > >> > > > > thanks for your hints. However I inspected >> > > > > >> > > https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi >> > > sher.cpp >> > > > > >> > > > > and I don't think it's saving from malloc()... see my point >> > > 2) >> > > > > below: >> > > > > >> > > > > Indeed I realized that probably current ZMQ API does not allow >> > > me >> > > > > to >> > > > > achieve the 100% of what I intended to do. >> > > > > Let me rephrase my target: my target is to be able to >> > > > > - memory pool creation: do a large memory allocation of, say, >> > > 1M >> > > > > zmq_msg_t only at the start of my program; let's say I create >> > > all >> > > > > these zmq_msg_t of a size of 2k bytes each (let's assume this >> > > is >> > > > > the >> > > > > max size of message possible in my app) >> > > > > - during application lifetime: call zmq_msg_send() at anytime >> > > > > always avoiding malloc() operations (just picking the first >> > > > > available unused entry of zmq_msg_t from the memory pool). >> > > > > >> > > > > Initially I thought that was possible but I think I have >> > > identified >> > > > > 2 >> > > > > blocking issues: >> > > > > 1) If I try to recycle zmq_msg_t directly: in this case I will >> > > fail >> > > > > because I cannot really change only the "size" member of a >> > > > > zmq_msg_t without reallocating it... so that I'm forced (in my >> > > > > example) >> > > to >> > > > > always send 2k bytes out (!!) >> > > > > 2) if I do create only a memory pool of buffers of 2k bytes and >> > > > > then wrap the first available buffer inside a zmq_msg_t >> > > > > (allocated >> > > on >> > > > > the >> > > > > stack, not in the heap): in this case I need to know when the >> > > > > internals of ZMQ have completed using the zmq_msg_t and thus >> > > when I >> > > > > can mark that buffer as available again in my memory pool. >> > > However >> > > > > I >> > > > > see that zmq_msg_init_data() ZMQ code contains: >> > > > > >> > > > > // Initialize constant message if there's no need to >> > > > > deallocate >> > > > > if (ffn_ == NULL) { >> > > > > ... >> > > > > _u.cmsg.data = data_; >> > > > > _u.cmsg.size = size_; >> > > > > ... >> > > > > } else { >> > > > > ... >> > > > > _u.lmsg.content = >> > > > > static_cast<content_t *> (malloc (sizeof >> > > (content_t))); >> > > > > ... >> > > > > _u.lmsg.content->data = data_; >> > > > > _u.lmsg.content->size = size_; >> > > > > _u.lmsg.content->ffn = ffn_; >> > > > > _u.lmsg.content->hint = hint_; >> > > > > new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t >> > > (); >> > > > > } >> > > > > >> > > > > So that I skip malloc() operation only if I pass ffn_ == NULL. >> > > The >> > > > > problem is that if I pass ffn_ == NULL, then I have no way to >> > > know >> > > > > when the internals of ZMQ have completed using the zmq_msg_t... >> > > > > >> > > > > Any way to workaround either issue 1) or issue 2) ? >> > > > > >> > > > > I understand that the malloc is just of size(content_t)~= >> > > 40B... >> > > > > but >> > > > > still I'd like to avoid it... >> > > > > >> > > > > Thanks! >> > > > > Francesco >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer < >> > > > > [email protected] >> > > > > > ha scritto: >> > > > > > On 04.07.19 14:29, Luca Boccassi wrote: >> > > > > > > How users make use of these primitives is up to them >> > > though, I >> > > > > > >> > > > > > don't >> > > > > > > think anything special was shared before, as far as I >> > > remember. >> > > > > > >> > > > > > Some example can be found here: >> > > > > > >> > > https://github.com/dasys-lab/capnzero/tree/master/capnzero/src >> > > > > > >> > > > > > >> > > > > > The classes Publisher and Subscriber should replace the >> > > publisher >> > > > > > and >> > > > > > subscriber in a former Robot-Operating-System-based System. I >> > > > > > hope that the subscriber is actually using the method Luca is >> > > > > > talking >> > > about >> > > > > > on the >> > > > > > receiving side. >> > > > > > >> > > > > > The message data here is a Cap'n Proto container that we >> > > > > > "simply" >> > > > > > serialize and send via ZeroMQ -> therefore the name Cap'nZero >> > > ;-) >> > > > > > >> > > > > > _______________________________________________ >> > > > > > zeromq-dev mailing list >> > > > > > [email protected] >> > > > > > >> > > > > > >> > > > > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > _______________________________________________ >> > > zeromq-dev mailing list >> > > [email protected] >> > > >> > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > > >> > >> > >> -- >> Kind regards, >> Luca Boccassi >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev _______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
