Re: [zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Francesco
Hi Luca,
thanks for the details. Indeed I understand why the "content_t" needs to be
allocated dynamically: it's just like the control block used by STL's
std::shared_ptr<>.

And you're right: I'm not sure how much gain there is in removing 100% of
malloc operations from my TX path... still I would be curious to find it
out but right now it seems I need to patch ZMQ source code to achieve that.

Anyway I wonder if it could be possible to expose in the public API a
method like "zmq::msg_t::init_external_storage()" that, AFAICS, allows to
create a non-shared zero-copy long message... it appears to be used only by
v2 decoder internally right now...
Is there a specific reason why that's not accessible from the public API?

Thanks,
Francesco




Il giorno gio 4 lug 2019 alle ore 20:25 Luca Boccassi <
luca.bocca...@gmail.com> ha scritto:

> Another reason for that small struct to be on the heap is so that it
> can be shared among all the copies of the message (eg: a pub socket has
> N copies of the message on the stack, one for each subscriber). The
> struct has an atomic counter in it, so that when all the copies of the
> message on the stack have been closed, the userspace buffer
> deallocation callback can be invoked. If the atomic counter were on the
> stack inlined in the message, this wouldn't work.
> So even if room were to be found, a malloc would still be needed.
>
> If you _really_ are worried about it, and testing shows it makes a
> difference, then one option could be to pre-allocate a set of these
> metadata structures at startup, and just assign them when the message
> is created. It's possible, but increases complexity quite a bit, so it
> needs to be worth it.
>
> On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
> > The second malloc cannot be avoided, but it's tiny and fixed in size
> > at
> > compile time, so the compiler and glibc will be able to optimize it
> > to
> > death.
> >
> > The reason for that is that there's not enough room in the 64 bytes
> > to
> > store that structure, and increasing the message allocation on the
> > stack past 64 bytes means it will no longer fit in a single cache
> > line,
> > which will incur in a performance penalty far worse than the small
> > malloc (I tested this some time ago). That is of course unless you
> > are
> > running on s390 or a POWER with 256 bytes cacheline, but given it's
> > part of the ABI it would be a bit of a mess for the benefit of very
> > few
> > users if any.
> >
> > So I'd recommend to just go with the second plan, and compare what
> > the
> > result is when passing a deallocation function vs not passing it (yes
> > it will leak the memory but it's just for the test). My bet is that
> > the
> > difference will not be that large.
> >
> > On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
> > > Hi Stephan, Hi Luca,
> > >
> > > thanks for your hints. However I inspected
> > >
> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publisher.cpp
> > >
> > >  and I don't think it's saving from malloc()...  see my point 2)
> > > below:
> > >
> > > Indeed I realized that probably current ZMQ API does not allow me
> > > to
> > > achieve the 100% of what I intended to do.
> > > Let me rephrase my target: my target is to be able to
> > >  - memory pool creation: do a large memory allocation of, say, 1M
> > > zmq_msg_t only at the start of my program; let's say I create all
> > > these zmq_msg_t of a size of 2k bytes each (let's assume this is
> > > the
> > > max size of message possible in my app)
> > >  - during application lifetime: call zmq_msg_send() at anytime
> > > always
> > > avoiding malloc() operations (just picking the first available
> > > unused
> > > entry of zmq_msg_t from the memory pool).
> > >
> > > Initially I thought that was possible but I think I have identified
> > > 2
> > > blocking issues:
> > > 1) If I try to recycle zmq_msg_t directly: in this case I will fail
> > > because I cannot really change only the "size" member of a
> > > zmq_msg_t
> > > without reallocating it... so that I'm forced (in my example) to
> > > always send 2k bytes out (!!)
> > > 2) if I do create only a memory pool of buffers of 2k bytes and
> > > then
> > > wrap the first available buffer inside a zmq_msg_t (allocated on
> > > the
> > > stack, not in the heap): in this case I need to know when the
> > > internals of ZMQ have completed using the zmq_msg_t and thus when I
> > > can mark that buffer as available again in my memory pool. However
> > > I
> > > see that zmq_msg_init_data() ZMQ code contains:
> > >
> > > //  Initialize constant message if there's no need to
> > > deallocate
> > > if (ffn_ == NULL) {
> > > ...
> > > _u.cmsg.data = data_;
> > > _u.cmsg.size = size_;
> > > ...
> > > } else {
> > > ...
> > > _u.lmsg.content =
> > >   static_cast (malloc (sizeof (content_t)));
> > > ...
> > > _u.lmsg.content->data = data_;
> > > _u.lmsg.content->size = size_;
> 

Re: [zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Luca Boccassi
Another reason for that small struct to be on the heap is so that it
can be shared among all the copies of the message (eg: a pub socket has
N copies of the message on the stack, one for each subscriber). The
struct has an atomic counter in it, so that when all the copies of the
message on the stack have been closed, the userspace buffer
deallocation callback can be invoked. If the atomic counter were on the
stack inlined in the message, this wouldn't work.
So even if room were to be found, a malloc would still be needed.

If you _really_ are worried about it, and testing shows it makes a
difference, then one option could be to pre-allocate a set of these
metadata structures at startup, and just assign them when the message
is created. It's possible, but increases complexity quite a bit, so it
needs to be worth it.

On Thu, 2019-07-04 at 17:42 +0100, Luca Boccassi wrote:
> The second malloc cannot be avoided, but it's tiny and fixed in size
> at
> compile time, so the compiler and glibc will be able to optimize it
> to
> death.
> 
> The reason for that is that there's not enough room in the 64 bytes
> to
> store that structure, and increasing the message allocation on the
> stack past 64 bytes means it will no longer fit in a single cache
> line,
> which will incur in a performance penalty far worse than the small
> malloc (I tested this some time ago). That is of course unless you
> are
> running on s390 or a POWER with 256 bytes cacheline, but given it's
> part of the ABI it would be a bit of a mess for the benefit of very
> few
> users if any.
> 
> So I'd recommend to just go with the second plan, and compare what
> the
> result is when passing a deallocation function vs not passing it (yes
> it will leak the memory but it's just for the test). My bet is that
> the
> difference will not be that large.
> 
> On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
> > Hi Stephan, Hi Luca,
> > 
> > thanks for your hints. However I inspected 
> > https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publisher.cpp
> > 
> >  and I don't think it's saving from malloc()...  see my point 2)
> > below:
> > 
> > Indeed I realized that probably current ZMQ API does not allow me
> > to
> > achieve the 100% of what I intended to do.
> > Let me rephrase my target: my target is to be able to 
> >  - memory pool creation: do a large memory allocation of, say, 1M
> > zmq_msg_t only at the start of my program; let's say I create all
> > these zmq_msg_t of a size of 2k bytes each (let's assume this is
> > the
> > max size of message possible in my app) 
> >  - during application lifetime: call zmq_msg_send() at anytime
> > always
> > avoiding malloc() operations (just picking the first available
> > unused
> > entry of zmq_msg_t from the memory pool).
> > 
> > Initially I thought that was possible but I think I have identified
> > 2
> > blocking issues:
> > 1) If I try to recycle zmq_msg_t directly: in this case I will fail
> > because I cannot really change only the "size" member of a
> > zmq_msg_t
> > without reallocating it... so that I'm forced (in my example) to
> > always send 2k bytes out (!!)
> > 2) if I do create only a memory pool of buffers of 2k bytes and
> > then
> > wrap the first available buffer inside a zmq_msg_t (allocated on
> > the
> > stack, not in the heap): in this case I need to know when the
> > internals of ZMQ have completed using the zmq_msg_t and thus when I
> > can mark that buffer as available again in my memory pool. However
> > I
> > see that zmq_msg_init_data() ZMQ code contains:
> > 
> > //  Initialize constant message if there's no need to
> > deallocate
> > if (ffn_ == NULL) {
> > ...
> > _u.cmsg.data = data_;
> > _u.cmsg.size = size_;
> > ...
> > } else {
> > ...
> > _u.lmsg.content =
> >   static_cast (malloc (sizeof (content_t)));
> > ...
> > _u.lmsg.content->data = data_;
> > _u.lmsg.content->size = size_;
> > _u.lmsg.content->ffn = ffn_;
> > _u.lmsg.content->hint = hint_;
> > new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t ();
> > }
> > 
> > So that I skip malloc() operation only if I pass ffn_ == NULL. The
> > problem is that if I pass ffn_ == NULL, then I have no way to know
> > when the internals of ZMQ have completed using the zmq_msg_t...
> > 
> > Any way to workaround either issue 1) or issue 2) ?
> > 
> > I understand that the malloc is just of size(content_t)~= 40B...
> > but
> > still I'd like to avoid it...
> > 
> > Thanks!
> > Francesco
> > 
> > 
> > 
> > 
> > 
> > Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
> > op...@vs.uni-kassel.de
> > > ha scritto:
> > > On 04.07.19 14:29, Luca Boccassi wrote:
> > > > How users make use of these primitives is up to them though, I
> > > 
> > > don't
> > > > think anything special was shared before, as far as I remember.
> > > 
> > > Some example can be found here: 
> > > 

Re: [zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Luca Boccassi
The second malloc cannot be avoided, but it's tiny and fixed in size at
compile time, so the compiler and glibc will be able to optimize it to
death.

The reason for that is that there's not enough room in the 64 bytes to
store that structure, and increasing the message allocation on the
stack past 64 bytes means it will no longer fit in a single cache line,
which will incur in a performance penalty far worse than the small
malloc (I tested this some time ago). That is of course unless you are
running on s390 or a POWER with 256 bytes cacheline, but given it's
part of the ABI it would be a bit of a mess for the benefit of very few
users if any.

So I'd recommend to just go with the second plan, and compare what the
result is when passing a deallocation function vs not passing it (yes
it will leak the memory but it's just for the test). My bet is that the
difference will not be that large.

On Thu, 2019-07-04 at 16:30 +0200, Francesco wrote:
> Hi Stephan, Hi Luca,
> 
> thanks for your hints. However I inspected 
> https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publisher.cpp
>  and I don't think it's saving from malloc()...  see my point 2)
> below:
> 
> Indeed I realized that probably current ZMQ API does not allow me to
> achieve the 100% of what I intended to do.
> Let me rephrase my target: my target is to be able to 
>  - memory pool creation: do a large memory allocation of, say, 1M
> zmq_msg_t only at the start of my program; let's say I create all
> these zmq_msg_t of a size of 2k bytes each (let's assume this is the
> max size of message possible in my app) 
>  - during application lifetime: call zmq_msg_send() at anytime always
> avoiding malloc() operations (just picking the first available unused
> entry of zmq_msg_t from the memory pool).
> 
> Initially I thought that was possible but I think I have identified 2
> blocking issues:
> 1) If I try to recycle zmq_msg_t directly: in this case I will fail
> because I cannot really change only the "size" member of a zmq_msg_t
> without reallocating it... so that I'm forced (in my example) to
> always send 2k bytes out (!!)
> 2) if I do create only a memory pool of buffers of 2k bytes and then
> wrap the first available buffer inside a zmq_msg_t (allocated on the
> stack, not in the heap): in this case I need to know when the
> internals of ZMQ have completed using the zmq_msg_t and thus when I
> can mark that buffer as available again in my memory pool. However I
> see that zmq_msg_init_data() ZMQ code contains:
> 
> //  Initialize constant message if there's no need to deallocate
> if (ffn_ == NULL) {
> ...
> _u.cmsg.data = data_;
> _u.cmsg.size = size_;
> ...
> } else {
> ...
> _u.lmsg.content =
>   static_cast (malloc (sizeof (content_t)));
> ...
> _u.lmsg.content->data = data_;
> _u.lmsg.content->size = size_;
> _u.lmsg.content->ffn = ffn_;
> _u.lmsg.content->hint = hint_;
> new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t ();
> }
> 
> So that I skip malloc() operation only if I pass ffn_ == NULL. The
> problem is that if I pass ffn_ == NULL, then I have no way to know
> when the internals of ZMQ have completed using the zmq_msg_t...
> 
> Any way to workaround either issue 1) or issue 2) ?
> 
> I understand that the malloc is just of size(content_t)~= 40B... but
> still I'd like to avoid it...
> 
> Thanks!
> Francesco
> 
> 
> 
> 
> 
> Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
> op...@vs.uni-kassel.de> ha scritto:
> > On 04.07.19 14:29, Luca Boccassi wrote:
> > > How users make use of these primitives is up to them though, I
> > don't
> > > think anything special was shared before, as far as I remember.
> > 
> > Some example can be found here: 
> > https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
> > 
> > The classes Publisher and Subscriber should replace the publisher
> > and 
> > subscriber in a former Robot-Operating-System-based System. I hope
> > that 
> > the subscriber is actually using the method Luca is talking about
> > on the 
> > receiving side.
> > 
> > The message data here is a Cap'n Proto container that we "simply" 
> > serialize and send via ZeroMQ -> therefore the name Cap'nZero ;-)
> > 
> > ___
> > zeromq-dev mailing list
> > zeromq-dev@lists.zeromq.org
> > 
> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev
> > 
> 
> 
-- 
Kind regards,
Luca Boccassi


signature.asc
Description: This is a digitally signed message part
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev


Re: [zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Francesco
Hi Stephan, Hi Luca,

thanks for your hints. However I inspected
https://github.com/dasys-lab/capnzero/blob/master/capnzero/src/Publisher.cpp
and
I don't think it's saving from malloc()...  see my point 2) below:

Indeed I realized that probably current ZMQ API does not allow me to
achieve the 100% of what I intended to do.
Let me rephrase my target: my target is to be able to
 - memory pool creation: do a large memory allocation of, say, 1M zmq_msg_t
only at the start of my program; let's say I create all these zmq_msg_t of
a size of 2k bytes each (let's assume this is the max size of message
possible in my app)
 - during application lifetime: call zmq_msg_send() at anytime always
avoiding malloc() operations (just picking the first available unused entry
of zmq_msg_t from the memory pool).

Initially I thought that was possible but I think I have identified 2
blocking issues:
1) If I try to recycle zmq_msg_t directly: in this case I will fail because
I cannot really change only the "size" member of a zmq_msg_t without
reallocating it... so that I'm forced (in my example) to always send 2k
bytes out (!!)
2) if I do create only a memory pool of buffers of 2k bytes and then wrap
the first available buffer inside a zmq_msg_t (allocated on the stack, not
in the heap): in this case I need to know when the internals of ZMQ have
completed using the zmq_msg_t and thus when I can mark that buffer as
available again in my memory pool. However I see that zmq_msg_init_data()
ZMQ code contains:

//  Initialize constant message if there's no need to deallocate
if (ffn_ == NULL) {
...
_u.cmsg.data = data_;
_u.cmsg.size = size_;
...
} else {
...
_u.lmsg.content =
  static_cast (malloc (sizeof (content_t)));
...
_u.lmsg.content->data = data_;
_u.lmsg.content->size = size_;
_u.lmsg.content->ffn = ffn_;
_u.lmsg.content->hint = hint_;
new (&_u.lmsg.content->refcnt) zmq::atomic_counter_t ();
}

So that I skip malloc() operation only if I pass ffn_ == NULL. The problem
is that if I pass ffn_ == NULL, then I have no way to know when the
internals of ZMQ have completed using the zmq_msg_t...

Any way to workaround either issue 1) or issue 2) ?

I understand that the malloc is just of size(content_t)~= 40B... but still
I'd like to avoid it...

Thanks!
Francesco




Il giorno gio 4 lug 2019 alle ore 14:58 Stephan Opfer <
op...@vs.uni-kassel.de> ha scritto:

>
> On 04.07.19 14:29, Luca Boccassi wrote:
> > How users make use of these primitives is up to them though, I don't
> > think anything special was shared before, as far as I remember.
>
> Some example can be found here:
> https://github.com/dasys-lab/capnzero/tree/master/capnzero/src
>
> The classes Publisher and Subscriber should replace the publisher and
> subscriber in a former Robot-Operating-System-based System. I hope that
> the subscriber is actually using the method Luca is talking about on the
> receiving side.
>
> The message data here is a Cap'n Proto container that we "simply"
> serialize and send via ZeroMQ -> therefore the name Cap'nZero ;-)
>
> --
> Distributed Systems Research Group
> Stephan Opfer  T. +49 561 804-6279  F. +49 561 804-6277
> Univ. Kassel,  FB 16,  Wilhelmshöher Allee 73,  D-34121 Kassel
> WWW: http://www.uni-kassel.de/go/vs_stephan-opfer/
>
> ___
> zeromq-dev mailing list
> zeromq-dev@lists.zeromq.org
> https://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev


Re: [zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Stephan Opfer


On 04.07.19 14:29, Luca Boccassi wrote:

How users make use of these primitives is up to them though, I don't
think anything special was shared before, as far as I remember.


Some example can be found here: 
https://github.com/dasys-lab/capnzero/tree/master/capnzero/src


The classes Publisher and Subscriber should replace the publisher and 
subscriber in a former Robot-Operating-System-based System. I hope that 
the subscriber is actually using the method Luca is talking about on the 
receiving side.


The message data here is a Cap'n Proto container that we "simply" 
serialize and send via ZeroMQ -> therefore the name Cap'nZero ;-)


--
Distributed Systems Research Group
Stephan Opfer  T. +49 561 804-6279  F. +49 561 804-6277
Univ. Kassel,  FB 16,  Wilhelmshöher Allee 73,  D-34121 Kassel
WWW: http://www.uni-kassel.de/go/vs_stephan-opfer/

___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev


Re: [zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Luca Boccassi
On Thu, 2019-07-04 at 14:21 +0200, Francesco wrote:
> Hi all,
> 
> I'm doing some benchmarking of a library I wrote based on ZMQ.
> In most of my use cases if I do a "perf top" on my application thread
> I see something like this:
> 
>   12,09%  [kernel]  [k] sysret_check
>7,48%  [kernel]  [k] system_call_after_swapgs
>5,64%  libc-2.25.so  [.] _int_malloc
>3,40%  libzmq.so.5.2.1   [.] zmq::socket_base_t::send
>3,20%  [kernel]  [k] do_sys_poll
> 
> 
> That is, ignoring the calls to Linux kernel, I see that malloc() is
> the most time-consuming operation my software is doing. After some
> investigation that's due to the use I do of zmq_msg_init_size().
> 
> Now I wonder: somebody has ever tried to avoid this kind of malloc()
> by using the zmq_msg_init_data() API instead and some sort of memory
> pool for zmq_msg_t objects?
> 
> 
> I've seen some proposal in this email thread:   
> 
> https://lists.zeromq.org/mailman/private/zeromq-dev/2016-November/031131.html
> but as far as I know nothing was submitted to the zmq community,
> right?
> 
> Thanks,
> Francesco

Hi,

The zmq_msg_init_data is there for that purpose and it works well - if
you pass your own buffer, it won't allocate new ones internally. Same
on receive since v4.2.0, the buffer returned by the kernel syscall is
used directly in the message.

How users make use of these primitives is up to them though, I don't
think anything special was shared before, as far as I remember.

-- 
Kind regards,
Luca Boccassi


signature.asc
Description: This is a digitally signed message part
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev


[zeromq-dev] Memory pool for zmq_msg_t

2019-07-04 Thread Francesco
Hi all,

I'm doing some benchmarking of a library I wrote based on ZMQ.
In most of my use cases if I do a "perf top" on my application thread I see
something like this:

  12,09%  [kernel]  [k] sysret_check
   7,48%  [kernel]  [k] system_call_after_swapgs
   5,64%  libc-2.25.so  [.] _int_malloc
   3,40%  libzmq.so.5.2.1   [.] zmq::socket_base_t::send
   3,20%  [kernel]  [k] do_sys_poll

That is, ignoring the calls to Linux kernel, I see that malloc() is the
most time-consuming operation my software is doing. After some
investigation that's due to the use I do of zmq_msg_init_size().

Now I wonder: somebody has ever tried to avoid this kind of malloc() by
using the zmq_msg_init_data() API instead and some sort of memory pool for
zmq_msg_t objects?

I've seen some proposal in this email thread:

https://lists.zeromq.org/mailman/private/zeromq-dev/2016-November/031131.html
but as far as I know nothing was submitted to the zmq community, right?

Thanks,
Francesco
___
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev