Re: [zeromq-dev] Memory pool for zmq_msg_t

Doron Somech Thu, 15 Aug 2019 02:40:39 -0700

maybe zmq_msg_init_allocator which accept the allocator.

With that pattern we do need the release method, the zmq_msg will handle it
internally and register the release method as the free method of the
zmq_msg. They do need to have the same signature.


On Thu, Aug 15, 2019 at 12:35 PM Francesco <[email protected]>
wrote:

> Hi Doron, hi Jens,
> Yes the allocator method is a nice solution.
> I think it would be nice to have libzmq provide also a memory pool
> implementation but use as default the malloc/free implementation for
> backward compatibility.
>
> It's also important to have a smart allocator that internally contains not
> just  one but several pools for different packet size classes,to avoid
> memory waste. But I think this can fit easily in the allocator pattern
> sketched out by Jens.
>
> Btw another issue unrelated to the allocator API but regarding performance
> aspects: I think it's important to avoid not only the msg buffer but also
> the allocation of the content_t structure and indeed in my preliminary
> merge request I did modify zmq_msg_t of type_lmsg to use the first 40b
> inside the pooled buffer.
> Of course this approach is not backward compatible with the _init_data()
> semantics.
> How do you think this would best be approached?
> I guess we may have a new _init_data_and_controlblock() helper that does
> the trick of taking the first 40bytes of the provided buffer?
>
> Thanks
> Francesco
>
>
> Il mer 14 ago 2019, 22:23 Doron Somech <[email protected]> ha scritto:
>
>> Jens I like the idea.
>>
>> We actually don't need the release method.
>> The signature of the allocate should receive zmq_msg and allocate it.
>>
>> int (&allocate)(zmq_msg *msg, size_t size, void *obj);
>>
>> When the allocator will create the zmq_msg it will provide the release
>> method to the zmq_msg in the constructor.
>>
>> This is important in order to forward messages between sockets, so the
>> release method is part of the msg. This is already supported by zmq_msg
>> which accept free method with a hint (obj in your example).
>>
>> The return value of allocate will be success indication, like the rest of
>> zeromq methods.
>>
>> zeromq actually already support pool mechanism when sending, using
>> zmq_msg api. Receiving is the problem, your suggestion solve it nicely.
>>
>> By the way, memory pool already supported in NetMQ in a very similar
>> solution as you suggested. (It is global for all sockets without override)
>>
>>
>>
>> On Wed, Aug 14, 2019, 22:41 Jens Auer <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> Maybe this can be combined with a request that I have seen a couple of
>>> times to be able to configure the allocator used in libzmq? I am thinking
>>> of something like
>>>
>>> struct zmq_allocator {
>>>     void* obj;
>>>     void* (&allocate)(size_t n, void* obj);
>>>     void (&release)(void* ptr, void* obj);
>>> };
>>>
>>> void* useMalloc(size_t n, void*) {return malloc(n);}
>>> void freeMalloc(void* ptr) {free(ptr);}
>>>
>>> zmq_allocator& zmg_default_allocator() {
>>>     static zmg_allocator defaultAllocator = {nullptr, useMalloc,
>>> freeMalloc};
>>>     Return defaultAllocator;
>>> }
>>>
>>> The context could then store the allocator for libzmq, and users could
>>> set a specific allocator as a context option, e.g. with a zmq_ctx_set. A
>>> socket created for a context can then inherit the default allocator or set
>>> a special allocator as a socket option.
>>>
>>> class MemoryPool {…}; // hopefully thread-safe
>>> void* poolAllocate(size_t n) {return
>>>
>>> MemoryPool pool;
>>>
>>> void* allocatePool(size_t n, void* pool) {return
>>> static_cast<MemoryPool*>(pool)->allocate(n);}
>>> void releasePool(void* ptr, void* pool)
>>> {static_cast<MemoryPool*>(pool)->release(ptr);}
>>>
>>> zmq_allocator pooledAllocator {
>>>     &pool, allocatePool, releasePool
>>> }
>>>
>>> void* cdx = zmq_ctx_new();
>>> zmq_ctx_set(ZMQ_ALLOCATOR, &pooledAllocator);
>>>
>>> Cheers,
>>> Jens
>>>
>>> Am 13.08.2019 um 13:24 schrieb Francesco <[email protected]>:
>>>
>>> Hi all,
>>>
>>> today I've taken some time to attempt building a memory-pooling
>>> mechanism in ZMQ local_thr/remote_thr benchmarking utilities.
>>> Here's the result:
>>> https://github.com/zeromq/libzmq/pull/3631
>>> This PR is a work in progress and is a simple modification to show the
>>> effects of avoiding malloc/free when creating zmq_msg_t with the
>>> standard benchmark utils of ZMQ.
>>>
>>> In particular the very fast, zero-lock,
>>> single-producer/single-consumer queue from:
>>> https://github.com/cameron314/readerwriterqueue
>>> is used to maintain between the "remote_thr" main thread and its ZMQ
>>> background IO thread a list of free buffers that can be used.
>>>
>>> Here are the graphical results:
>>> with mallocs / no memory pool:
>>>
>>> https://cdn1.imggmi.com/uploads/2019/8/13/9f009b91df394fa945cd2519fd993f50-full.png
>>> with memory pool:
>>>
>>> https://cdn1.imggmi.com/uploads/2019/8/13/f3ae0d6d58e9721b63129c23fe7347a6-full.png
>>>
>>> Doing the math the memory pooled approach shows:
>>>
>>> mostly the same performances for messages <= 32B
>>> +15% pps/throughput increase @ 64B,
>>> +60% pps/throughput increase @ 128B,
>>> +70% pps/throughput increase @ 210B
>>>
>>> [the tests were stopped at 210B because my current quick-dirty memory
>>> pool approach has fixed max msg size of about 210B].
>>>
>>> Honestly this is not a huge speedup, even if still interesting.
>>> Indeed with these changes the performances now seem to be bounded by
>>> the "local_thr" side and not by the "remote_thr" anymore. Indeed the
>>> zmq background IO thread for local_thr is the only thread at 100% in
>>> the 2 systems and its "perf top" now shows:
>>>
>>>  15,02%  libzmq.so.5.2.3     [.] zmq::metadata_t::add_ref
>>>  14,91%  libzmq.so.5.2.3     [.] zmq::v2_decoder_t::size_ready
>>>   8,94%  libzmq.so.5.2.3     [.] zmq::ypipe_t<zmq::msg_t, 256>::write
>>>   6,97%  libzmq.so.5.2.3     [.] zmq::msg_t::close
>>>   5,48%  libzmq.so.5.2.3     [.]
>>> zmq::decoder_base_t<zmq::v2_decoder_t, zmq::shared_message_memory_allo
>>>   5,40%  libzmq.so.5.2.3     [.] zmq::pipe_t::write
>>>   4,94%  libzmq.so.5.2.3     [.]
>>> zmq::shared_message_memory_allocator::inc_ref
>>>   2,59%  libzmq.so.5.2.3     [.] zmq::msg_t::init_external_storage
>>>   1,63%  [kernel]            [k] copy_user_enhanced_fast_string
>>>   1,56%  libzmq.so.5.2.3     [.] zmq::msg_t::data
>>>   1,43%  libzmq.so.5.2.3     [.] zmq::msg_t::init
>>>   1,34%  libzmq.so.5.2.3     [.] zmq::pipe_t::check_write
>>>   1,24%  libzmq.so.5.2.3     [.]
>>> zmq::stream_engine_base_t::in_event_internal
>>>   1,24%  libzmq.so.5.2.3     [.] zmq::msg_t::size
>>>
>>> Do you know what this stacktrace might mean?
>>> I would expect to have that ZMQ background thread topping in its
>>> read() system call (from TCP socket)...
>>>
>>> Thanks,
>>> Francesco
>>>
>>>
>>> Il giorno ven 19 lug 2019 alle ore 18:15 Francesco
>>> <[email protected]> ha scritto:
>>>
>>>
>>> Hi Yan,
>>> Unfortunately I have interrupted my attempts in this area after getting
>>> some strange results (possibly due to the fact that I tried in a complex
>>> application context... I should probably try hacking a simple zeromq
>>> example instead!).
>>>
>>> I'm also a bit surprised that nobody has tried and posted online a way
>>> to achieve something similar (Memory pool zmq send) ... But anyway It
>>> remains in my plans to try that out when I have a bit more spare time...
>>> If you manage to have some results earlier, I would be eager to know :-)
>>>
>>> Francesco
>>>
>>>
>>> Il ven 19 lug 2019, 04:02 Yan, Liming (NSB - CN/Hangzhou) <
>>> [email protected]> ha scritto:
>>>
>>>
>>> Hi,  Francesco
>>>   Could you please share the final solution and benchmark result for
>>> plan 2?  Big Thanks.
>>>   I'm concerning this because I had tried the similar before with
>>> zmq_msg_init_data() and zmq_msg_send() but failed because of two issues.
>>>  1)  My process is running in background for long time and finally I found
>>> it occupies more and more memory, until it exhausted the system memory. It
>>> seems there's memory leak with this way.   2) I provided *ffn for
>>> deallocation but the memory freed back is much slower than consumer. So
>>> finally my own customized pool could also be exhausted. How do you solve
>>> this?
>>>   I had to turn back to use zmq_send(). I know it has memory copy
>>> penalty but it's the easiest and most stable way to send message. I'm still
>>> using 0MQ 4.1.x.
>>>   Thanks.
>>>
>>> BR
>>> Yan Limin
>>>
>>> -----Original Message-----
>>> From: zeromq-dev [mailto:[email protected]
>>> <[email protected]>] On Behalf Of Luca Boccassi
>>> Sent: Friday, July 05, 2019 4:58 PM
>>> To: ZeroMQ development list <[email protected]>
>>> Subject: Re: [zeromq-dev] Memory pool for zmq_msg_t
>>>
>>> There's no need to change the source for experimenting, you can just use
>>> _init_data without a callback and with a callback (yes the first case will
>>> leak memory but it's just a test), and measure the difference between the
>>> two cases. You can then immediately see if it's worth pursuing further
>>> optimisations or not.
>>>
>>> _external_storage is an implementation detail, and it's non-shared
>>> because it's used in the receive case only, as it's used with a reference
>>> to the TCP buffer used in the system call for zero-copy receives. Exposing
>>> that means that those kind of messages could not be used with pub-sub or
>>> radio-dish, as they can't have multiple references without copying them,
>>> which means there would be a semantic difference between the different
>>> message initialisation APIs, unlike now when the difference is only in who
>>> owns the buffer. It would make the API quite messy in my opinion, and be
>>> quite confusing as pub/sub is probably the most well known pattern.
>>>
>>> On Thu, 2019-07-04 at 23:20 +0200, Francesco wrote:
>>>
>>> Hi Luca,
>>> thanks for the details. Indeed I understand why the "content_t" needs
>>> to be allocated dynamically: it's just like the control block used by
>>> STL's std::shared_ptr<>.
>>>
>>> And you're right: I'm not sure how much gain there is in removing 100%
>>> of malloc operations from my TX path... still I would be curious to
>>> find it out but right now it seems I need to patch ZMQ source code to
>>> achieve that.
>>>
>>> Anyway I wonder if it could be possible to expose in the public API a
>>> method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows
>>> to create a non-shared zero-copy long message... it appears to be used
>>> only by v2 decoder internally right now...
>>> Is there a specific reason why that's not accessible from the public
>>> API?
>>>
>>> Thanks,
>>> Francesco
>>>
>>>
>>>
>>>
>>>
>>> Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
>>> [email protected]> ha scritto:
>>>
>>> Another reason for that small struct to be on the heap is so that it
>>> can be shared among all the copies of the message (eg: a pub socket
>>> has N copies of the message on the stack, one for each subscriber).
>>> The struct has an atomic counter in it, so that when all the copies
>>> of the message on the stack have been closed, the userspace buffer
>>> deallocation callback can be invoked. If the atomic counter were on
>>> the stack inlined in the message, this wouldn't work.
>>> So even if room were to be found, a malloc would still be needed.
>>>
>>> If you _really_ are worried about it, and testing shows it makes a
>>> difference, then one option could be to pre-allocate a set of these
>>> metadata structures at startup, and just assign them when the
>>> message is created. It's possible, but increases complexity quite a
>>> bit, so it needs to be worth it.
>>>
>>> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
>>>
>>> The second malloc cannot be avoided, but it's tiny and fixed in
>>>
>>> size
>>>
>>> at
>>> compile time, so the compiler and glibc will be able to optimize
>>>
>>> it
>>>
>>> to
>>> death.
>>>
>>> The reason for that is that there's not enough room in the 64
>>>
>>> bytes
>>>
>>> to
>>> store that structure, and increasing the message allocation on
>>>
>>> the
>>>
>>> stack past 64 bytes means it will no longer fit in a single cache
>>> line, which will incur in a performance penalty far worse than the
>>>
>>> small
>>>
>>> malloc (I tested this some time ago). That is of course unless
>>>
>>> you
>>>
>>> are
>>> running on s390 or a POWER with 256 bytes cacheline, but given
>>>
>>> it's
>>>
>>> part of the ABI it would be a bit of a mess for the benefit of
>>>
>>> very
>>>
>>> few
>>> users if any.
>>>
>>> So I'd recommend to just go with the second plan, and compare
>>>
>>> what
>>>
>>> the
>>> result is when passing a deallocation function vs not passing it
>>>
>>> (yes
>>>
>>> it will leak the memory but it's just for the test). My bet is
>>>
>>> that
>>>
>>> the
>>> difference will not be that large.
>>>
>>> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
>>>
>>> Hi Stephan, Hi Luca,
>>>
>>> thanks for your hints. However I inspected
>>>
>>> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publi
>>> sher.cpp
>>>
>>>
>>> and I don't think it's saving from malloc()...  see my point
>>>
>>> 2)
>>>
>>> below:
>>>
>>> Indeed I realized that probably current ZMQ API does not allow
>>>
>>> me
>>>
>>> to
>>> achieve the 100% of what I intended to do.
>>> Let me rephrase my target: my target is to be able to
>>> - memory pool creation: do a large memory allocation of, say,
>>>
>>> 1M
>>>
>>> zmq_msg_t only at the start of my program; let's say I create
>>>
>>> all
>>>
>>> these zmq_msg_t of a size of 2k bytes each (let's assume this
>>>
>>> is
>>>
>>> the
>>> max size of message possible in my app)
>>> - during application lifetime: call zmq_msg_send() at anytime
>>> always avoiding malloc() operations (just picking the first
>>> available unused entry of zmq_msg_t from the memory pool).
>>>
>>> Initially I thought that was possible but I think I have
>>>
>>> identified
>>>
>>> 2
>>> blocking issues:
>>> 1) If I try to recycle zmq_msg_t directly: in this case I will
>>>
>>> fail
>>>
>>> because I cannot really change only the "size" member of a
>>> zmq_msg_t without reallocating it... so that I'm forced (in my
>>> example)
>>>
>>> to
>>>
>>> always send 2k bytes out (!!)
>>> 2) if I do create only a memory pool of buffers of 2k bytes and
>>> then wrap the first available buffer inside a zmq_msg_t
>>> (allocated
>>>
>>> on
>>>
>>> the
>>> stack, not in the heap): in this case I need to know when the
>>> internals of ZMQ have completed using the zmq_msg_t and thus
>>>
>>> when I
>>>
>>> can mark that buffer as available again in my memory pool.
>>>
>>> However
>>>
>>> I
>>> see that zmq_msg_init_data() ZMQ code contains:
>>>
>>>    //  Initialize constant message if there's no need to
>>> deallocate
>>>    if (ffn_ == NULL) {
>>> ...
>>>        _u.cmsg.data = data_;
>>>        _u.cmsg.size = size_;
>>> ...
>>>    } else {
>>> ...
>>>        _u.lmsg.content =
>>>          static_cast<content_t *> (malloc (sizeof
>>>
>>> (content_t)));
>>>
>>> ...
>>>        _u.lmsg.content->data = data_;
>>>        _u.lmsg.content->size = size_;
>>>        _u.lmsg.content->ffn = ffn_;
>>>        _u.lmsg.content->hint = hint_;
>>>        new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t
>>>
>>> ();
>>>
>>>    }
>>>
>>> So that I skip malloc() operation only if I pass ffn_ == NULL.
>>>
>>> The
>>>
>>> problem is that if I pass ffn_ == NULL, then I have no way to
>>>
>>> know
>>>
>>> when the internals of ZMQ have completed using the zmq_msg_t...
>>>
>>> Any way to workaround either issue 1) or issue 2) ?
>>>
>>> I understand that the malloc is just of size(content_t)~=
>>>
>>> 40B...
>>>
>>> but
>>> still I'd like to avoid it...
>>>
>>> Thanks!
>>> Francesco
>>>
>>>
>>>
>>>
>>>
>>> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
>>> [email protected]
>>>
>>> ha scritto:
>>> On 04.07.19 14:29, Luca Boccassi wrote:
>>>
>>> How users make use of these primitives is up to them
>>>
>>> though, I
>>>
>>>
>>> don't
>>>
>>> think anything special was shared before, as far as I
>>>
>>> remember.
>>>
>>>
>>> Some example can be found here:
>>>
>>> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>>>
>>>
>>>
>>> The classes Publisher and Subscriber should replace the
>>>
>>> publisher
>>>
>>> and
>>> subscriber in a former Robot-Operating-System-based System. I
>>> hope that the subscriber is actually using the method Luca is
>>> talking
>>>
>>> about
>>>
>>> on the
>>> receiving side.
>>>
>>> The message data here is a Cap'n Proto container that we
>>> "simply"
>>> serialize and send via ZeroMQ -> therefore the name Cap'nZero
>>>
>>> ;-)
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>>
>>>
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>>
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>>
>>> --
>>> Kind regards,
>>> Luca Boccassi
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>>>
>>> _______________________________________________
>>> zeromq-dev mailing list
>>> [email protected]
>>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>>
>> _______________________________________________
>> zeromq-dev mailing list
>> [email protected]
>> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>>
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>

_______________________________________________
zeromq-dev mailing list
[email protected]
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Memory pool for zmq_msg_t

Reply via email to