Hi Luca, THanks for the explanation. It seems like there is no need to do memory pooling for packet RX right? One allocation every 19kB seems pretty efficient already (nice work! :))
Still I wonder if we can improve somehow the performance of zmq::v2_decoder_t::size_ready since that function appears to be the bottleneck of my latest performance benchmarks. (See my previous email). My feeling is that if memory management is not a problem along the RX path then a single zmq background IO thread/core (on a fast CPU) should be able to do more than the approx 2 Mpps limit that I found... My concern is that it's a fundamental limit in zmq scalability: since a single zmq socket is always handled by a single zmq background thread that means that , even if I buy 100gbps of bandwidth, I will not be able to use more than 2/3gbps sending messages 64B long on that socket. Thanks for any hint or comment, Francesco Il ven 16 ago 2019, 17:20 Luca Boccassi <luca.bocca...@gmail.com> ha scritto: > The messages structures themselves are always on the stack. The TCP > receive is batched, and if there are multiple messages in an 8KB kernel > buffer, each message's content_t simply points to the right place for the > data in that shared buffer, which is refcounted. The content_t structure is > also in the same memory zone, which is split to allow enough content_t for > 8KB/minimum_size_msg+1 messages - so in practice there is one allocation on > ~19KB which is shared with as many messages as their data can fit in 8KB > that are received in one TCP read. > > On Fri, 2019-08-16 at 16:46 +0200, Francesco wrote: > > Hi Doron, > Ok the zmq_msg_init_allocator approach looks fine to me. I hope I have > time to work on that in the next couple of weeks (unless someone else wants > to step in of course :-) ). > > Anyway the current approach works for sending messages...I wonder how the > Rx side works and if we could exploit memory pooling also for that... Is > there any kind of documentation on how the engine works for Rx (or some > email thread) perhaps? > > I know there is some zero copy mechanism in place but it's not totally > clear to me: is the zmq_msg_t coming out of zmq API pointing directly to > the kernel buffers? > > Thanks > Francesco > > > Il gio 15 ago 2019, 11:39 Doron Somech <somdo...@gmail.com> ha scritto: > > maybe zmq_msg_init_allocator which accept the allocator. > > With that pattern we do need the release method, the zmq_msg will handle > it internally and register the release method as the free method of the > zmq_msg. They do need to have the same signature. > > On Thu, Aug 15, 2019 at 12:35 PM Francesco <francesco.monto...@gmail.com> > wrote: > > Hi Doron, hi Jens, > Yes the allocator method is a nice solution. > I think it would be nice to have libzmq provide also a memory pool > implementation but use as default the malloc/free implementation for > backward compatibility. > > It's also important to have a smart allocator that internally contains not > just one but several pools for different packet size classes,to avoid > memory waste. But I think this can fit easily in the allocator pattern > sketched out by Jens. > > Btw another issue unrelated to the allocator API but regarding performance > aspects: I think it's important to avoid not only the msg buffer but also > the allocation of the content_t structure and indeed in my preliminary > merge request I did modify zmq_msg_t of type_lmsg to use the first 40b > inside the pooled buffer. > Of course this approach is not backward compatible with the _init_data() > semantics. > How do you think this would best be approached? > I guess we may have a new _init_data_and_controlblock() helper that does > the trick of taking the first 40bytes of the provided buffer? > > Thanks > Francesco > > > Il mer 14 ago 2019, 22:23 Doron Somech <somdo...@gmail.com> ha scritto: > > Jens I like the idea. > > We actually don't need the release method. > The signature of the allocate should receive zmq_msg and allocate it. > > int (&allocate)(zmq_msg *msg, size_t size, void *obj); > > When the allocator will create the zmq_msg it will provide the release > method to the zmq_msg in the constructor. > > This is important in order to forward messages between sockets, so the > release method is part of the msg. This is already supported by zmq_msg > which accept free method with a hint (obj in your example). > > The return value of allocate will be success indication, like the rest of > zeromq methods. > > zeromq actually already support pool mechanism when sending, using zmq_msg > api. Receiving is the problem, your suggestion solve it nicely. > > By the way, memory pool already supported in NetMQ in a very similar > solution as you suggested. (It is global for all sockets without override) > > > > On Wed, Aug 14, 2019, 22:41 Jens Auer <jens.a...@betaversion.net> wrote: > > Hi, > > Maybe this can be combined with a request that I have seen a couple of > times to be able to configure the allocator used in libzmq? I am thinking > of something like > > struct zmq_allocator { > void* obj; > void* (&allocate)(size_t n, void* obj); > void (&release)(void* ptr, void* obj); > }; > > void* useMalloc(size_t n, void*) {return malloc(n);} > void freeMalloc(void* ptr) {free(ptr);} > > zmq_allocator& zmg_default_allocator() { > static zmg_allocator defaultAllocator = {nullptr, useMalloc, > freeMalloc}; > Return defaultAllocator; > } > > The context could then store the allocator for libzmq, and users could set > a specific allocator as a context option, e.g. with a zmq_ctx_set. A socket > created for a context can then inherit the default allocator or set a > special allocator as a socket option. > > class MemoryPool {…}; // hopefully thread-safe > void* poolAllocate(size_t n) {return > > MemoryPool pool; > > void* allocatePool(size_t n, void* pool) {return > static_cast<MemoryPool*>(pool)->allocate(n);} > void releasePool(void* ptr, void* pool) > {static_cast<MemoryPool*>(pool)->release(ptr);} > > zmq_allocator pooledAllocator { > &pool, allocatePool, releasePool > } > > void* cdx = zmq_ctx_new(); > zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator); > > Cheers, > Jens > > Am 13.08.2019 um 13:24 schrieb Francesco <francesco.monto...@gmail.com>: > > Hi all, > > today I've taken some time to attempt building a memory-pooling > mechanism in ZMQ local_thr/remote_thr benchmarking utilities. > Here's the result: > https://github.com/zeromq/libzmq/pull/3631 > This PR is a work in progress and is a simple modification to show the > effects of avoiding malloc/free when creating zmq_msg_t with the > standard benchmark utils of ZMQ. > > In particular the very fast, zero-lock, > single-producer/single-consumer queue from: > https://github.com/cameron314/readerwriterqueue > is used to maintain between the "remote_thr" main thread and its ZMQ > background IO thread a list of free buffers that can be used. > > Here are the graphical results: > with mallocs / no memory pool: > > https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png > with memory pool: > > https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png > > Doing the math the memory pooled approach shows: > > mostly the same performances for messages <= 32B > +15% pps/throughput increase @ 64B, > +60% pps/throughput increase @ 128B, > +70% pps/throughput increase @ 210B > > [the tests were stopped at 210B because my current quick-dirty memory > pool approach has fixed max msg size of about 210B]. > > Honestly this is not a huge speedup, even if still interesting. > Indeed with these changes the performances now seem to be bounded by > the "local_thr" side and not by the "remote_thr" anymore. Indeed the > zmq background IO thread for local_thr is the only thread at 100% in > the 2 systems and its "perf top" now shows: > > 15,02% libzmq.so.5.2.3 [.] zmq::metadata_t::add_ref > 14,91% libzmq.so.5.2.3 [.] zmq::v2_decoder_t::size_ready > 8,94% libzmq.so.5.2.3 [.] zmq::ypipe_t<zmq::msg_t, 256>::write > 6,97% libzmq.so.5.2.3 [.] zmq::msg_t::close > 5,48% libzmq.so.5.2.3 [.] > zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo > 5,40% libzmq.so.5.2.3 [.] zmq::pipe_t::write > 4,94% libzmq.so.5.2.3 [.] > zmq::shared_message_memory_allocator::inc_ref > 2,59% libzmq.so.5.2.3 [.] zmq::msg_t::init_external_storage > 1,63% [kernel] [k] copy_user_enhanced_fast_string > 1,56% libzmq.so.5.2.3 [.] zmq::msg_t::data > 1,43% libzmq.so.5.2.3 [.] zmq::msg_t::init > 1,34% libzmq.so.5.2.3 [.] zmq::pipe_t::check_write > 1,24% libzmq.so.5.2.3 [.] > zmq::stream_engine_base_t::in_event_internal > 1,24% libzmq.so.5.2.3 [.] zmq::msg_t::size > > Do you know what this stacktrace might mean? > I would expect to have that ZMQ background thread topping in its > read() system call (from TCP socket)... > > Thanks, > Francesco > > > Il giorno ven 19 lug 2019 alle ore 18:15 Francesco > <francesco.monto...@gmail.com> ha scritto: > > > Hi Yan, > Unfortunately I have interrupted my attempts in this area after getting > some strange results (possibly due to the fact that I tried in a complex > application context... I should probably try hacking a simple zeromq > example instead!). > > I'm also a bit surprised that nobody has tried and posted online a way to > achieve something similar (Memory pool zmq send) ... But anyway It remains > in my plans to try that out when I have a bit more spare time... > If you manage to have some results earlier, I would be eager to know :-) > > Francesco > > > Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) < > liming....@nokia-sbell.com> ha scritto: > > > Hi, Francesco > Could you please share the final solution and benchmark result for plan > 2? Big Thanks. > I'm concerning this because I had tried the similar before with > zmq_msg_init_data() and zmq_msg_send() but failed because of two issues. > 1) My process is running in background for long time and finally I found > it occupies more and more memory, until it exhausted the system memory. It > seems there's memory leak with this way. 2) I provided *ffn for > deallocation but the memory freed back is much slower than consumer. So > finally my own customized pool could also be exhausted. How do you solve > this? > I had to turn back to use zmq_send(). I know it has memory copy penalty > but it's the easiest and most stable way to send message. I'm still using > 0MQ 4.1.x. > Thanks. > > BR > Yan Limin > > -----Original Message----- > From: zeromq-dev [mailto:zeromq-dev-boun...@lists.zeromq.org > <zeromq-dev-boun...@lists.zeromq.org>] On Behalf Of Luca Boccassi > Sent: Friday, July 05, 2019 4:58 PM > To: ZeroMQ development list <zeromq-dev@lists.zeromq.org> > Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t > > There's no need to change the source for experimenting, you can just use > _init_data without a callback and with a callback (yes the first case will > leak memory but it's just a test), and measure the difference between the > two cases. You can then immediately see if it's worth pursuing further > optimisations or not. > > _external_storage is an implementation detail, and it's non-shared because > it's used in the receive case only, as it's used with a reference to the > TCP buffer used in the system call for zero-copy receives. Exposing that > means that those kind of messages could not be used with pub-sub or > radio-dish, as they can't have multiple references without copying them, > which means there would be a semantic difference between the different > message initialisation APIs, unlike now when the difference is only in who > owns the buffer. It would make the API quite messy in my opinion, and be > quite confusing as pub/sub is probably the most well known pattern. > > On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote: > > Hi Luca, > thanks for the details. Indeed I understand why the "content_t" needs > to be allocated dynamically: it's just like the control block used by > STL's std::shared_ptr<>. > > And you're right: I'm not sure how much gain there is in removing 100% > of malloc operations from my TX path... still I would be curious to > find it out but right now it seems I need to patch ZMQ source code to > achieve that. > > Anyway I wonder if it could be possible to expose in the public API a > method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows > to create a non-shared zero-copy long message... it appears to be used > only by v2 decoder internally right now... > Is there a specific reason why that's not accessible from the public > API? > > Thanks, > Francesco > > > > > > Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi < > luca.bocca...@gmail.com> ha scritto: > > Another reason for that small struct to be on the heap is so that it > can be shared among all the copies of the message (eg: a pub socket > has N copies of the message on the stack, one for each subscriber). > The struct has an atomic counter in it, so that when all the copies > of the message on the stack have been closed, the userspace buffer > deallocation callback can be invoked. If the atomic counter were on > the stack inlined in the message, this wouldn't work. > So even if room were to be found, a malloc would still be needed. > > If you _really_ are worried about it, and testing shows it makes a > difference, then one option could be to pre-allocate a set of these > metadata structures at startup, and just assign them when the > message is created. It's possible, but increases complexity quite a > bit, so it needs to be worth it. > > On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote: > > The second malloc cannot be avoided, but it's tiny and fixed in > > size > > at > compile time, so the compiler and glibc will be able to optimize > > it > > to > death. > > The reason for that is that there's not enough room in the 64 > > bytes > > to > store that structure, and increasing the message allocation on > > the > > stack past 64 bytes means it will no longer fit in a single cache > line, which will incur in a performance penalty far worse than the > > small > > malloc (I tested this some time ago). That is of course unless > > you > > are > running on s390 or a POWER with 256 bytes cacheline, but given > > it's > > part of the ABI it would be a bit of a mess for the benefit of > > very > > few > users if any. > > So I'd recommend to just go with the second plan, and compare > > what > > the > result is when passing a deallocation function vs not passing it > > (yes > > it will leak the memory but it's just for the test). My bet is > > that > > the > difference will not be that large. > > On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote: > > Hi Stephan, Hi Luca, > > thanks for your hints. However I inspected > > https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi > sher.cpp > > > and I don't think it's saving from malloc()... see my point > > 2) > > below: > > Indeed I realized that probably current ZMQ API does not allow > > me > > to > achieve the 100% of what I intended to do. > Let me rephrase my target: my target is to be able to > - memory pool creation: do a large memory allocation of, say, > > 1M > > zmq_msg_t only at the start of my program; let's say I create > > all > > these zmq_msg_t of a size of 2k bytes each (let's assume this > > is > > the > max size of message possible in my app) > - during application lifetime: call zmq_msg_send() at anytime > always avoiding malloc() operations (just picking the first > available unused entry of zmq_msg_t from the memory pool). > > Initially I thought that was possible but I think I have > > identified > > 2 > blocking issues: > 1) If I try to recycle zmq_msg_t directly: in this case I will > > fail > > because I cannot really change only the "size" member of a > zmq_msg_t without reallocating it... so that I'm forced (in my > example) > > to > > always send 2k bytes out (!!) > 2) if I do create only a memory pool of buffers of 2k bytes and > then wrap the first available buffer inside a zmq_msg_t > (allocated > > on > > the > stack, not in the heap): in this case I need to know when the > internals of ZMQ have completed using the zmq_msg_t and thus > > when I > > can mark that buffer as available again in my memory pool. > > However > > I > see that zmq_msg_init_data() ZMQ code contains: > > // Initialize constant message if there's no need to > deallocate > if (ffn_ == NULL) { > ... > _u.cmsg.data = data_; > _u.cmsg.size = size_; > ... > } else { > ... > _u.lmsg.content = > static_cast<content_t *> (malloc (sizeof > > (content_t))); > > ... > _u.lmsg.content->data = data_; > _u.lmsg.content->size = size_; > _u.lmsg.content->ffn = ffn_; > _u.lmsg.content->hint = hint_; > new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t > > (); > > } > > So that I skip malloc() operation only if I pass ffn_ == NULL. > > The > > problem is that if I pass ffn_ == NULL, then I have no way to > > know > > when the internals of ZMQ have completed using the zmq_msg_t... > > Any way to workaround either issue 1) or issue 2) ? > > I understand that the malloc is just of size(content_t)~= > > 40B... > > but > still I'd like to avoid it... > > Thanks! > Francesco > > > > > > Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer < > op...@vs.uni-kassel.de > > ha scritto: > On 04.07.19 14:29, Luca Boccassi wrote: > > How users make use of these primitives is up to them > > though, I > > > don't > > think anything special was shared before, as far as I > > remember. > > > Some example can be found here: > > https://github.com/dasys-lab/capnzero/tree/master/capnzero/src > > > > The classes Publisher and Subscriber should replace the > > publisher > > and > subscriber in a former Robot-Operating-System-based System. I > hope that the subscriber is actually using the method Luca is > talking > > about > > on the > receiving side. > > The message data here is a Cap'n Proto container that we > "simply" > serialize and send via ZeroMQ -> therefore the name Cap'nZero > > ;-) > > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > -- > Kind regards, > Luca Boccassi > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > _______________________________________________ > > zeromq-dev mailing list > > zeromq-dev@lists.zeromq.org > > > https://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > -- > > Kind regards, > Luca Boccassi > _______________________________________________ > zeromq-dev mailing list > zeromq-dev@lists.zeromq.org > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev