Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Jinpu Wang
Hi Peter

On Thu, May 2, 2024 at 6:20 PM Peter Xu  wrote:
>
> On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> > Hi Michael, Hi Peter,
> >
> >
> > On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> > >
> > > Yu Zhang / Jinpu,
> > >
> > > Any possibility (at your lesiure, and within the disclosure rules of
> > > your company, IONOS) if you could share any of your performance
> > > information to educate the group?
> > >
> > > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > > their disposal. Some people don't.
> > Our staging env is with 100 Gb/s IB environment.
> > We will have a new setup in the coming months with Ethernet (RoCE), we
> > will run some performance
> > comparison when we have the environment ready.
>
> Thanks both.  Please keep us posted.
>
> Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
> involved, am I right?
kinds of. Our new hardware is RDMA capable, we can configure it to run
in "rdma" transport or "tcp"
it is more straight comparison,
When run "rdma" transport, RoCE is involved, eg the
rdma-core/ibverbs/rdmacm/vendor verbs driver are used.
>
> The other note is that the comparison needs to be with multifd enabled for
> the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.
>
> I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
> without even waiting for months.  If you want I can try to see how we can
> test together.  And btw I don't think we need a cluster, IIUC we simply
> need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
> two cards just for experiments, systems that can drive the cards, and a
> wire supporting 100G?

Yes, the simple setup can be just two hosts directly connected. This remind me,
I may also able to find a test setup with 100 G nic in lab, will keep
you posted.

Regards!
>
> >
> > >
> > > - Michael
> >
> > Thx!
> > Jinpu
> > >
> > > On 5/1/24 11:16, Peter Xu wrote:
> > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > > >>> What I worry more is whether this is really what we want to keep rdma 
> > > >>> in
> > > >>> qemu, and that's also why I was trying to request for some serious
> > > >>> performance measurements comparing rdma v.s. nics.  And here when I 
> > > >>> said
> > > >>> "we" I mean both QEMU community and any company that will support 
> > > >>> keeping
> > > >>> rdma around.
> > > >>>
> > > >>> The problem is if NICs now are fast enough to perform at least equally
> > > >>> against rdma, and if it has a lower cost of overall maintenance, does 
> > > >>> it
> > > >>> mean that rdma migration will only be used by whoever wants to keep 
> > > >>> them in
> > > >>> the products and existed already?  In that case we should simply ask 
> > > >>> new
> > > >>> users to stick with tcp, and rdma users should only drop but not 
> > > >>> increase.
> > > >>>
> > > >>> It seems also destined that most new migration features will not 
> > > >>> support
> > > >>> rdma: see how much we drop old features in migration now (which rdma
> > > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > > >>> multifd
> > > >>> relevant which will probably not apply to rdma at all.  So in general 
> > > >>> what
> > > >>> I am worrying is a both-loss condition, if the company might be 
> > > >>> easier to
> > > >>> either stick with an old qemu (depending on whether other new 
> > > >>> features are
> > > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > > >>> RDMA
> > > >>> downstream only.
> > > >> I don't know much about the originals of RDMA support in QEMU and why
> > > >> this particular design was taken. It is indeed a huge maint burden to
> > > >> have a completely different code flow for RDMA with 4000+ lines of
> > > >> custom protocol signalling which is barely understandable.
> > > >>
> > > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > > >> type could almost[1] trivially have supported RDMA. There would have
> > > >> been almost no RDMA code required in the migration subsystem, and all
> > > >> the modern features like compression, multifd, post-copy, etc would
> > > >> "just work".
> > > >>
> > > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > > >> performance gains, but it might still have been a better tradeoff
> > > >> to have not quite so good peak performance, but with massively
> > > >> less maint burden.
> > > > My understanding so far is RDMA is sololy for performance but nothing 
> > > > else,
> > > > then it's a question on whether rdma existing users would like to do so 
> > > > if
> > > > it will run slower.
> > > >
> > > > Jinpu mentioned on the explicit usages of ib 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Peter Xu
On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support 
> > >>> keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep 
> > >>> them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not 
> > >>> increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > >>> multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general 
> > >>> what
> > >>> I am worrying is a both-loss condition, if the company might be easier 
> > >>> to
> > >>> either stick with an old qemu (depending on whether other new features 
> > >>> are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > >>> RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing 
> > > else,
> > > then it's a question on whether rdma existing users would like to do so if
> > > it will run slower.
> > >
> > > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > > quotting that word as I don't really know such details:
> > >
> > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> > >
> > > So not sure whether that applies here too, in that having qiochannel
> > > wrapper may not allow direct access to those ib verbs.
> > >
> > > Thanks,
> > >
> > >> With regards,
> > >> Daniel
> > >>
> > >> [1] "almost" trivially, because the poll() integration for rsockets
> > >>  requires a bit more magic sauce since rsockets FDs are not
> > >>  really FDs from the 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Jinpu Wang
Hi Michael, Hi Peter,


On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
>
> Yu Zhang / Jinpu,
>
> Any possibility (at your lesiure, and within the disclosure rules of
> your company, IONOS) if you could share any of your performance
> information to educate the group?
>
> NICs have indeed changed, but not everybody has 100ge mellanox cards at
> their disposal. Some people don't.
Our staging env is with 100 Gb/s IB environment.
We will have a new setup in the coming months with Ethernet (RoCE), we
will run some performance
comparison when we have the environment ready.

>
> - Michael

Thx!
Jinpu
>
> On 5/1/24 11:16, Peter Xu wrote:
> > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> >>> What I worry more is whether this is really what we want to keep rdma in
> >>> qemu, and that's also why I was trying to request for some serious
> >>> performance measurements comparing rdma v.s. nics.  And here when I said
> >>> "we" I mean both QEMU community and any company that will support keeping
> >>> rdma around.
> >>>
> >>> The problem is if NICs now are fast enough to perform at least equally
> >>> against rdma, and if it has a lower cost of overall maintenance, does it
> >>> mean that rdma migration will only be used by whoever wants to keep them 
> >>> in
> >>> the products and existed already?  In that case we should simply ask new
> >>> users to stick with tcp, and rdma users should only drop but not increase.
> >>>
> >>> It seems also destined that most new migration features will not support
> >>> rdma: see how much we drop old features in migration now (which rdma
> >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> >>> relevant which will probably not apply to rdma at all.  So in general what
> >>> I am worrying is a both-loss condition, if the company might be easier to
> >>> either stick with an old qemu (depending on whether other new features are
> >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> >>> downstream only.
> >> I don't know much about the originals of RDMA support in QEMU and why
> >> this particular design was taken. It is indeed a huge maint burden to
> >> have a completely different code flow for RDMA with 4000+ lines of
> >> custom protocol signalling which is barely understandable.
> >>
> >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> >> type could almost[1] trivially have supported RDMA. There would have
> >> been almost no RDMA code required in the migration subsystem, and all
> >> the modern features like compression, multifd, post-copy, etc would
> >> "just work".
> >>
> >> I guess the 'rsocket.h' shim may well limit some of the possible
> >> performance gains, but it might still have been a better tradeoff
> >> to have not quite so good peak performance, but with massively
> >> less maint burden.
> > My understanding so far is RDMA is sololy for performance but nothing else,
> > then it's a question on whether rdma existing users would like to do so if
> > it will run slower.
> >
> > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > quotting that word as I don't really know such details:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> >
> > So not sure whether that applies here too, in that having qiochannel
> > wrapper may not allow direct access to those ib verbs.
> >
> > Thanks,
> >
> >> With regards,
> >> Daniel
> >>
> >> [1] "almost" trivially, because the poll() integration for rsockets
> >>  requires a bit more magic sauce since rsockets FDs are not
> >>  really FDs from the kernel's POV. Still, QIOCHannel likely can
> >>  abstract that probme.
> >> --
> >> |: 
> >> https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$
> >>-o-
> >> https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$
> >>   :|
> >> |: 
> >> https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$
> >>   -o-
> >> https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$
> >>   :|
> >> |: 
> >> https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$
> >>  -o-
> >> 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Michael Galaxy

Yu Zhang / Jinpu,

Any possibility (at your lesiure, and within the disclosure rules of 
your company, IONOS) if you could share any of your performance 
information to educate the group?


NICs have indeed changed, but not everybody has 100ge mellanox cards at 
their disposal. Some people don't.


- Michael

On 5/1/24 11:16, Peter Xu wrote:

On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:

On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

I don't know much about the originals of RDMA support in QEMU and why
this particular design was taken. It is indeed a huge maint burden to
have a completely different code flow for RDMA with 4000+ lines of
custom protocol signalling which is barely understandable.

I would note that /usr/include/rdma/rsocket.h provides a higher level
API that is a 1-1 match of the normal kernel 'sockets' API. If we had
leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
type could almost[1] trivially have supported RDMA. There would have
been almost no RDMA code required in the migration subsystem, and all
the modern features like compression, multifd, post-copy, etc would
"just work".

I guess the 'rsocket.h' shim may well limit some of the possible
performance gains, but it might still have been a better tradeoff
to have not quite so good peak performance, but with massively
less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,


With regards,
Daniel

[1] "almost" trivially, because the poll() integration for rsockets
 requires a bit more magic sauce since rsockets FDs are not
 really FDs from the kernel's POV. Still, QIOCHannel likely can
 abstract that probme.
--
|: 
https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$
   -o-
https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$
  :|
|: 
https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$
  -o-
https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$
  :|
|: 
https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$
 -o-
https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$
  :|





Re: [PATCH v4 3/3] qapi: introduce device-sync-config

2024-05-02 Thread Markus Armbruster
Vladimir Sementsov-Ogievskiy  writes:

> On 30.04.24 11:31, Vladimir Sementsov-Ogievskiy wrote:
>> On 30.04.24 11:19, Markus Armbruster wrote:
>>> Vladimir Sementsov-Ogievskiy  writes:
>>>
 Add command to sync config from vhost-user backend to the device. It
 may be helpful when VHOST_USER_SLAVE_CONFIG_CHANGE_MSG failed or not
 triggered interrupt to the guest or just not available (not supported
 by vhost-user server).

 Command result is racy if allow it during migration. Let's allow the
 sync only in RUNNING state.

 Signed-off-by: Vladimir Sementsov-Ogievskiy 

[...]

 diff --git a/system/qdev-monitor.c b/system/qdev-monitor.c
 index 264978aa40..47bfc0506e 100644
 --- a/system/qdev-monitor.c
 +++ b/system/qdev-monitor.c
 @@ -23,6 +23,7 @@
   #include "monitor/monitor.h"
   #include "monitor/qdev.h"
   #include "sysemu/arch_init.h"
 +#include "sysemu/runstate.h"
   #include "qapi/error.h"
   #include "qapi/qapi-commands-qdev.h"
   #include "qapi/qmp/dispatch.h"
 @@ -971,6 +972,53 @@ void qmp_device_del(const char *id, Error **errp)
  }
  }

 +int qdev_sync_config(DeviceState *dev, Error **errp)
 +{
 +    DeviceClass *dc = DEVICE_GET_CLASS(dev);
 +
 +    if (!dc->sync_config) {
 +    error_setg(errp, "device-sync-config is not supported for '%s'",
 +   object_get_typename(OBJECT(dev)));
 +    return -ENOTSUP;
 +    }
 +
 +    return dc->sync_config(dev, errp);
 +}
 +
 +void qmp_device_sync_config(const char *id, Error **errp)
 +{
 +    DeviceState *dev;
 +
 +    /*
 + * During migration there is a race between syncing`configuration and
 + * migrating it (if migrate first, that target would get outdated 
 version),
 + * so let's just not allow it.
>>>
>>> Wrap comment lines around column 70 for legibility, please.
>>>
 + *
 + * Moreover, let's not rely on setting up interrupts in paused
 + * state, which may be a part of migration process.
>>>
>>> We discussed this in review of v3.  You wanted to check whether the
>>> problem is real.  Is it?
>>
>> We discussed it later than I've sent v4 :) No, I didn't check yet.
>
> Checked. Seems this works (in scheme pause -> migrate -> resume, interrupts 
> are migrated and triggered on target after resume), so my doubt was wrong.

Thanks!

[...]