Hi Luca,
THanks for the explanation. It seems like there is no need to do memory
pooling for packet RX right?
One allocation every 19kB seems pretty efficient already (nice work! :))

Still I wonder if we can improve somehow the performance of
zmq::v2_decoder_t::size_ready
since that function appears to be the bottleneck of my latest performance
benchmarks. (See my previous email).
My feeling is that if memory management is not a problem along the RX path
then a single zmq background IO thread/core (on a fast CPU) should be able
to do more than the approx 2 Mpps limit that I found...
My concern is that it's a fundamental limit in zmq scalability: since a
single zmq socket is always handled by a single zmq background thread that
means that , even if I buy 100gbps of bandwidth, I will not be able to use
more than 2/3gbps sending messages 64B long on that socket.

Thanks for any hint or comment,
Francesco



Il ven 16 ago 2019, 17:20 Luca Boccassi <luca.bocca...@gmail.com> ha
scritto:

> The messages structures themselves are always on the stack. The TCP
> receive is batched, and if there are multiple messages in an 8KB kernel
> buffer, each message's content_t simply points to the right place for the
> data in that shared buffer, which is refcounted. The content_t structure is
> also in the same memory zone, which is split to allow enough content_t for
> 8KB/minimum_size_msg+1 messages - so in practice there is one allocation on
> ~19KB which is shared with as many messages as their data can fit in 8KB
> that are received in one TCP read.
>
> On Fri, 2019-08-16 at 16:46 +0200, Francesco wrote:
>
> Hi Doron,
> Ok the zmq_msg_init_allocator approach looks fine to me. I hope I have
> time to work on that in the next couple of weeks (unless someone else wants
> to step in of course :-) ).
>
> Anyway the current approach works for sending messages...I wonder how the
> Rx side works and if we could exploit memory pooling also for that... Is
> there any kind of documentation on how the engine works for Rx (or some
> email thread) perhaps?
>
> I know there is some zero copy mechanism in place but it's not totally
> clear to me: is the zmq_msg_t coming out of zmq API pointing directly to
> the kernel buffers?
>
> Thanks
> Francesco
>
>
> Il gio 15 ago 2019, 11:39 Doron Somech <somdo...@gmail.com> ha scritto:
>
> maybe zmq_msg_init_allocator which accept the allocator.
>
> With that pattern we do need the release method, the zmq_msg will handle
> it internally and register the release method as the free method of the
> zmq_msg. They do need to have the same signature.
>
> On Thu, Aug 15, 2019 at 12:35 PM Francesco <francesco.monto...@gmail.com>
> wrote:
>
> Hi Doron, hi Jens,
> Yes the allocator method is a nice solution.
> I think it would be nice to have libzmq provide also a memory pool
> implementation but use as default the malloc/free implementation for
> backward compatibility.
>
> It's also important to have a smart allocator that internally contains not
> just  one but several pools for different packet size classes,to avoid
> memory waste. But I think this can fit easily in the allocator pattern
> sketched out by Jens.
>
> Btw another issue unrelated to the allocator API but regarding performance
> aspects: I think it's important to avoid not only the msg buffer but also
> the allocation of the content_t structure and indeed in my preliminary
> merge request I did modify zmq_msg_t of type_lmsg to use the first 40b
> inside the pooled buffer.
> Of course this approach is not backward compatible with the _init_data()
> semantics.
> How do you think this would best be approached?
> I guess we may have a new _init_data_and_controlblock() helper that does
> the trick of taking the first 40bytes of the provided buffer?
>
> Thanks
> Francesco
>
>
> Il mer 14 ago 2019, 22:23 Doron Somech <somdo...@gmail.com> ha scritto:
>
> Jens I like the idea.
>
> We actually don't need the release method.
> The signature of the allocate should receive zmq_msg and allocate it.
>
> int (&allocate)(zmq_msg *msg, size_t size, void *obj);
>
> When the allocator will create the zmq_msg it will provide the release
> method to the zmq_msg in the constructor.
>
> This is important in order to forward messages between sockets, so the
> release method is part of the msg. This is already supported by zmq_msg
> which accept free method with a hint (obj in your example).
>
> The return value of allocate will be success indication, like the rest of
> zeromq methods.
>
> zeromq actually already support pool mechanism when sending, using zmq_msg
> api. Receiving is the problem, your suggestion solve it nicely.
>
> By the way, memory pool already supported in NetMQ in a very similar
> solution as you suggested. (It is global for all sockets without override)
>
>
>
> On Wed, Aug 14, 2019, 22:41 Jens Auer <jens.a...@betaversion.net> wrote:
>
> Hi,
>
> Maybe this can be combined with a request that I have seen a couple of
> times to be able to configure the allocator used in libzmq? I am thinking
> of something like
>
> struct zmq_allocator {
>     void* obj;
>     void* (&allocate)(size_t n, void* obj);
>     void (&release)(void* ptr, void* obj);
> };
>
> void* useMalloc(size_t n, void*) {return malloc(n);}
> void freeMalloc(void* ptr) {free(ptr);}
>
> zmq_allocator& zmg_default_allocator() {
>     static zmg_allocator defaultAllocator = {nullptr, useMalloc,
> freeMalloc};
>     Return defaultAllocator;
> }
>
> The context could then store the allocator for libzmq, and users could set
> a specific allocator as a context option, e.g. with a zmq_ctx_set. A socket
> created for a context can then inherit the default allocator or set a
> special allocator as a socket option.
>
> class MemoryPool {…}; // hopefully thread-safe
> void* poolAllocate(size_t n) {return
>
> MemoryPool pool;
>
> void* allocatePool(size_t n, void* pool) {return
> static_cast<MemoryPool*>(pool)->allocate(n);}
> void releasePool(void* ptr, void* pool)
> {static_cast<MemoryPool*>(pool)->release(ptr);}
>
> zmq_allocator pooledAllocator {
>     &pool, allocatePool, releasePool
> }
>
> void* cdx = zmq_ctx_new();
> zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator);
>
> Cheers,
> Jens
>
> Am 13.08.2019 um 13:24 schrieb Francesco <francesco.monto...@gmail.com>:
>
> Hi all,
>
> today I've taken some time to attempt building a memory-pooling
> mechanism in ZMQ local_thr/remote_thr benchmarking utilities.
> Here's the result:
> https://github.com/zeromq/libzmq/pull/3631
> This PR is a work in progress and is a simple modification to show the
> effects of avoiding malloc/free when creating zmq_msg_t with the
> standard benchmark utils of ZMQ.
>
> In particular the very fast, zero-lock,
> single-producer/single-consumer queue from:
> https://github.com/cameron314/readerwriterqueue
> is used to maintain between the "remote_thr" main thread and its ZMQ
> background IO thread a list of free buffers that can be used.
>
> Here are the graphical results:
> with mallocs / no memory pool:
>
> https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png
> with memory pool:
>
> https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png
>
> Doing the math the memory pooled approach shows:
>
> mostly the same performances for messages <= 32B
> +15% pps/throughput increase @ 64B,
> +60% pps/throughput increase @ 128B,
> +70% pps/throughput increase @ 210B
>
> [the tests were stopped at 210B because my current quick-dirty memory
> pool approach has fixed max msg size of about 210B].
>
> Honestly this is not a huge speedup, even if still interesting.
> Indeed with these changes the performances now seem to be bounded by
> the "local_thr" side and not by the "remote_thr" anymore. Indeed the
> zmq background IO thread for local_thr is the only thread at 100% in
> the 2 systems and its "perf top" now shows:
>
>  15,02%  libzmq.so.5.2.3     [.] zmq::metadata_t::add_ref
>  14,91%  libzmq.so.5.2.3     [.] zmq::v2_decoder_t::size_ready
>   8,94%  libzmq.so.5.2.3     [.] zmq::ypipe_t<zmq::msg_t, 256>::write
>   6,97%  libzmq.so.5.2.3     [.] zmq::msg_t::close
>   5,48%  libzmq.so.5.2.3     [.]
> zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo
>   5,40%  libzmq.so.5.2.3     [.] zmq::pipe_t::write
>   4,94%  libzmq.so.5.2.3     [.]
> zmq::shared_message_memory_allocator::inc_ref
>   2,59%  libzmq.so.5.2.3     [.] zmq::msg_t::init_external_storage
>   1,63%  [kernel]            [k] copy_user_enhanced_fast_string
>   1,56%  libzmq.so.5.2.3     [.] zmq::msg_t::data
>   1,43%  libzmq.so.5.2.3     [.] zmq::msg_t::init
>   1,34%  libzmq.so.5.2.3     [.] zmq::pipe_t::check_write
>   1,24%  libzmq.so.5.2.3     [.]
> zmq::stream_engine_base_t::in_event_internal
>   1,24%  libzmq.so.5.2.3     [.] zmq::msg_t::size
>
> Do you know what this stacktrace might mean?
> I would expect to have that ZMQ background thread topping in its
> read() system call (from TCP socket)...
>
> Thanks,
> Francesco
>
>
> Il giorno ven 19 lug 2019 alle ore 18:15 Francesco
> <francesco.monto...@gmail.com> ha scritto:
>
>
> Hi Yan,
> Unfortunately I have interrupted my attempts in this area after getting
> some strange results (possibly due to the fact that I tried in a complex
> application context... I should probably try hacking a simple zeromq
> example instead!).
>
> I'm also a bit surprised that nobody has tried and posted online a way to
> achieve something similar (Memory pool zmq send) ... But anyway It remains
> in my plans to try that out when I have a bit more spare time...
> If you manage to have some results earlier, I would be eager to know :-)
>
> Francesco
>
>
> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) <
> liming....@nokia-sbell.com> ha scritto:
>
>
> Hi,  Francesco
>   Could you please share the final solution and benchmark result for plan
> 2?  Big Thanks.
>   I'm concerning this because I had tried the similar before with
> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues.
>  1)  My process is running in background for long time and finally I found
> it occupies more and more memory, until it exhausted the system memory. It
> seems there's memory leak with this way.   2) I provided *ffn for
> deallocation but the memory freed back is much slower than consumer. So
> finally my own customized pool could also be exhausted. How do you solve
> this?
>   I had to turn back to use zmq_send(). I know it has memory copy penalty
> but it's the easiest and most stable way to send message. I'm still using
> 0MQ 4.1.x.
>   Thanks.
>
> BR
> Yan Limin
>
> -----Original Message-----
> From: zeromq-dev [mailto:zeromq-dev-boun...@lists.zeromq.org
> <zeromq-dev-boun...@lists.zeromq.org>] On Behalf Of Luca Boccassi
> Sent: Friday, July 05, 2019 4:58 PM
> To: ZeroMQ development list <zeromq-dev@lists.zeromq.org>
> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t
>
> There's no need to change the source for experimenting, you can just use
> _init_data without a callback and with a callback (yes the first case will
> leak memory but it's just a test), and measure the difference between the
> two cases. You can then immediately see if it's worth pursuing further
> optimisations or not.
>
> _external_storage is an implementation detail, and it's non-shared because
> it's used in the receive case only, as it's used with a reference to the
> TCP buffer used in the system call for zero-copy receives. Exposing that
> means that those kind of messages could not be used with pub-sub or
> radio-dish, as they can't have multiple references without copying them,
> which means there would be a semantic difference between the different
> message initialisation APIs, unlike now when the difference is only in who
> owns the buffer. It would make the API quite messy in my opinion, and be
> quite confusing as pub/sub is probably the most well known pattern.
>
> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote:
>
> Hi Luca,
> thanks for the details. Indeed I understand why the "content_t" needs
> to be allocated dynamically: it's just like the control block used by
> STL's std::shared_ptr<>.
>
> And you're right: I'm not sure how much gain there is in removing 100%
> of malloc operations from my TX path... still I would be curious to
> find it out but right now it seems I need to patch ZMQ source code to
> achieve that.
>
> Anyway I wonder if it could be possible to expose in the public API a
> method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows
> to create a non-shared zero-copy long message... it appears to be used
> only by v2 decoder internally right now...
> Is there a specific reason why that's not accessible from the public
> API?
>
> Thanks,
> Francesco
>
>
>
>
>
> Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
> luca.bocca...@gmail.com> ha scritto:
>
> Another reason for that small struct to be on the heap is so that it
> can be shared among all the copies of the message (eg: a pub socket
> has N copies of the message on the stack, one for each subscriber).
> The struct has an atomic counter in it, so that when all the copies
> of the message on the stack have been closed, the userspace buffer
> deallocation callback can be invoked. If the atomic counter were on
> the stack inlined in the message, this wouldn't work.
> So even if room were to be found, a malloc would still be needed.
>
> If you _really_ are worried about it, and testing shows it makes a
> difference, then one option could be to pre-allocate a set of these
> metadata structures at startup, and just assign them when the
> message is created. It's possible, but increases complexity quite a
> bit, so it needs to be worth it.
>
> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
>
> The second malloc cannot be avoided, but it's tiny and fixed in
>
> size
>
> at
> compile time, so the compiler and glibc will be able to optimize
>
> it
>
> to
> death.
>
> The reason for that is that there's not enough room in the 64
>
> bytes
>
> to
> store that structure, and increasing the message allocation on
>
> the
>
> stack past 64 bytes means it will no longer fit in a single cache
> line, which will incur in a performance penalty far worse than the
>
> small
>
> malloc (I tested this some time ago). That is of course unless
>
> you
>
> are
> running on s390 or a POWER with 256 bytes cacheline, but given
>
> it's
>
> part of the ABI it would be a bit of a mess for the benefit of
>
> very
>
> few
> users if any.
>
> So I'd recommend to just go with the second plan, and compare
>
> what
>
> the
> result is when passing a deallocation function vs not passing it
>
> (yes
>
> it will leak the memory but it's just for the test). My bet is
>
> that
>
> the
> difference will not be that large.
>
> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
>
> Hi Stephan, Hi Luca,
>
> thanks for your hints. However I inspected
>
> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi
> sher.cpp
>
>
> and I don't think it's saving from malloc()...  see my point
>
> 2)
>
> below:
>
> Indeed I realized that probably current ZMQ API does not allow
>
> me
>
> to
> achieve the 100% of what I intended to do.
> Let me rephrase my target: my target is to be able to
> - memory pool creation: do a large memory allocation of, say,
>
> 1M
>
> zmq_msg_t only at the start of my program; let's say I create
>
> all
>
> these zmq_msg_t of a size of 2k bytes each (let's assume this
>
> is
>
> the
> max size of message possible in my app)
> - during application lifetime: call zmq_msg_send() at anytime
> always avoiding malloc() operations (just picking the first
> available unused entry of zmq_msg_t from the memory pool).
>
> Initially I thought that was possible but I think I have
>
> identified
>
> 2
> blocking issues:
> 1) If I try to recycle zmq_msg_t directly: in this case I will
>
> fail
>
> because I cannot really change only the "size" member of a
> zmq_msg_t without reallocating it... so that I'm forced (in my
> example)
>
> to
>
> always send 2k bytes out (!!)
> 2) if I do create only a memory pool of buffers of 2k bytes and
> then wrap the first available buffer inside a zmq_msg_t
> (allocated
>
> on
>
> the
> stack, not in the heap): in this case I need to know when the
> internals of ZMQ have completed using the zmq_msg_t and thus
>
> when I
>
> can mark that buffer as available again in my memory pool.
>
> However
>
> I
> see that zmq_msg_init_data() ZMQ code contains:
>
>    //  Initialize constant message if there's no need to
> deallocate
>    if (ffn_ == NULL) {
> ...
>        _u.cmsg.data = data_;
>        _u.cmsg.size = size_;
> ...
>    } else {
> ...
>        _u.lmsg.content =
>          static_cast<content_t *> (malloc (sizeof
>
> (content_t)));
>
> ...
>        _u.lmsg.content->data = data_;
>        _u.lmsg.content->size = size_;
>        _u.lmsg.content->ffn = ffn_;
>        _u.lmsg.content->hint = hint_;
>        new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t
>
> ();
>
>    }
>
> So that I skip malloc() operation only if I pass ffn_ == NULL.
>
> The
>
> problem is that if I pass ffn_ == NULL, then I have no way to
>
> know
>
> when the internals of ZMQ have completed using the zmq_msg_t...
>
> Any way to workaround either issue 1) or issue 2) ?
>
> I understand that the malloc is just of size(content_t)~=
>
> 40B...
>
> but
> still I'd like to avoid it...
>
> Thanks!
> Francesco
>
>
>
>
>
> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
> op...@vs.uni-kassel.de
>
> ha scritto:
> On 04.07.19 14:29, Luca Boccassi wrote:
>
> How users make use of these primitives is up to them
>
> though, I
>
>
> don't
>
> think anything special was shared before, as far as I
>
> remember.
>
>
> Some example can be found here:
>
> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>
>
>
> The classes Publisher and Subscriber should replace the
>
> publisher
>
> and
> subscriber in a former Robot-Operating-System-based System. I
> hope that the subscriber is actually using the method Luca is
> talking
>
> about
>
> on the
> receiving side.
>
> The message data here is a Cap'n Proto container that we
> "simply"
> serialize and send via ZeroMQ -> therefore the name Cap'nZero
>
> ;-)
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
>
>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
>
> --
> Kind regards,
> Luca Boccassi
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
> _______________________________________________
>
> zeromq-dev mailing list
>
> zeromq-dev@lists.zeromq.org
>
>
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
> --
>
> Kind regards,
> Luca Boccassi
> _______________________________________________
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to