RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-07 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Thursday, June 6, 2024 5:19 AM
> To: Dr. David Alan Gilbert 
> Cc: Michael Galaxy ; zhengchuan
> ; Gonglei (Arei) ;
> Daniel P. Berrangé ; Markus Armbruster
> ; Yu Zhang ; Zhijian Li (Fujitsu)
> ; Jinpu Wang ; Elmar Gerdes
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; integrat...@gluster.org; Paolo
> Bonzini ; qemu-block@nongnu.org;
> de...@lists.libvirt.org; Hanna Reitz ; Michael S. Tsirkin
> ; Thomas Huth ; Eric Blake
> ; Song Gao ; Marc-André
> Lureau ; Alex Bennée
> ; Wainer dos Santos Moschetta
> ; Beraldo Leal ; Pannengyuan
> ; Xiexiangyou 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > > I just noticed this thread; some random notes from a somewhat
> > > > fragmented memory of this:
> > > >
> > > >   a) Long long ago, I also tried rsocket;
> > > >
> https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > > >  as I remember the library was quite flaky at the time.
> > >
> > > Hmm interesting.  There also looks like a thread doing rpoll().
> >
> > Yeh, I can't actually remember much more about what I did back then!
> 
> Heh, that's understandable and fair. :)
> 
> > > I hope Lei and his team has tested >4G mem, otherwise definitely
> > > worth checking.  Lei also mentioned there're rsocket bugs they found
> > > in the cover letter, but not sure what's that about.
> >
> > It would probably be a good idea to keep track of what bugs are in
> > flight with it, and try it on a few RDMA cards to see what problems
> > get triggered.
> > I think I reported a few at the time, but I gave up after feeling it
> > was getting very hacky.
> 
> Agreed.  Maybe we can have a list of that in the cover letter or even QEMU's
> migration/rmda doc page.
> 
> Lei, if you think that makes sense please do so in your upcoming posts.
> There'll need to have a list of things you encountered in the kernel driver 
> and
> it'll be even better if there're further links to read on each problem.
> 
OK, no problem. There are two bugs:

Bug 1:

https://github.com/linux-rdma/rdma-core/commit/23985e25aebb559b761872313f8cab4e811c5a3d#diff-5ddbf83c6f021688166096ca96c9bba874dffc3cab88ded2e9d8b2176faa084cR3302-R3303

his commit introduces a bug that causes QEMU suspension.
When the timeout parameter of the rpoll is not -1 or 0, the program is 
suspended occasionally.

Problem analysis:
During the first rpoll,
In line 3297, rs_poll_enter () performs pollcnt++. In this case, the value of 
pollcnt is 1.
In line 3302, timeout expires and the function exits. Note that rs_poll_exit () 
is not --pollcnt here.
In this case, the value of pollcnt is 1.
During the second rpoll, pollcnt++ is performed in line 3297 rs_poll_enter (). 
In this case, the value of pollcnt is 2.
If no timeout expires and the poll return value is greater than 0, the 
rs_poll_stop () function is executed. Because the if (--pollcnt) condition is 
false, suspendpoll = 1 is executed.
Go back to the do while loop inside rpoll, again rs_poll_enter () now if 
(suspendpoll) condition is true, execute pthread_yield (); and return -EBUSY, 
Then, the do while loop in the rpoll is returned. Because the if (rs_poll_enter 
()) condition is true, the rs_poll_enter () function is executed again after 
the continue operation. As a result, the program is suspended.

Root cause: In line 3302, rs_poll_exit () is not executed before the timeout 
expires function exits.


Bug 2:

In rsocket.c, there is a receive queue int accept_queue[2] implemented by 
socketpair. The listen_svc thread in rsocket.c is responsible for receiving 
connections and writing them to the accept_queue[1]. When raccept () is called, 
a connection is received from accept_queue[0].
In the test case, qio_channel_wait(QIO_CHANNEL(lioc), G_IO_IN); waits for a 
readable event (waiting for a connection), rpoll () checks if accept_queue[0] 
has a readable event, However, this poll does not poll accept_queue[0]. After 
the timeout expires, rpoll () obtains the readable event of accept_queue[0] 
from rs_poll_arm again.

Impaction: 
The accept operation can be performed only after 5000 ms. Of course, we can 
shorten this time by echoing the millisecond time > 
/etc/rdma/rsocket/wake_up_interval.


Regards,
-Gonglei

> > > >
> > > >   e) Someone made a good suggestion (sorry can't remember who) -
> that the
> > > >  RDMA migration structure was the wrong way around - it should
> be the
&

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 08:48:28PM +, Dr. David Alan Gilbert wrote:
> > > I just noticed this thread; some random notes from a somewhat
> > > fragmented memory of this:
> > > 
> > >   a) Long long ago, I also tried rsocket; 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> > >  as I remember the library was quite flaky at the time.
> > 
> > Hmm interesting.  There also looks like a thread doing rpoll().
> 
> Yeh, I can't actually remember much more about what I did back then!

Heh, that's understandable and fair. :)

> > I hope Lei and his team has tested >4G mem, otherwise definitely worth
> > checking.  Lei also mentioned there're rsocket bugs they found in the cover
> > letter, but not sure what's that about.
> 
> It would probably be a good idea to keep track of what bugs
> are in flight with it, and try it on a few RDMA cards to see
> what problems get triggered.
> I think I reported a few at the time, but I gave up after
> feeling it was getting very hacky.

Agreed.  Maybe we can have a list of that in the cover letter or even
QEMU's migration/rmda doc page.

Lei, if you think that makes sense please do so in your upcoming posts.
There'll need to have a list of things you encountered in the kernel driver
and it'll be even better if there're further links to read on each problem.

> > > 
> > >   e) Someone made a good suggestion (sorry can't remember who) - that the
> > >  RDMA migration structure was the wrong way around - it should be the
> > >  destination which initiates an RDMA read, rather than the source
> > >  doing a write; then things might become a LOT simpler; you just need
> > >  to send page ranges to the destination and it can pull it.
> > >  That might work nicely for postcopy.
> > 
> > I'm not sure whether it'll still be a problem if rdma recv side is based on
> > zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> > that we don't want the guest vcpus to see a partially copied page during
> > on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> > that.
> 
> Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
> the destination can issue an RDMA read itself, it doesn't need to send 
> messages
> to the source to ask for a page fetch; it just goes and grabs it itself,
> that's got to be good for latency.

Oh, that's pretty internal stuff of rdma to me and beyond my knowledge..
but from what I can tell it sounds very reasonable indeed!

Thanks!

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Dr. David Alan Gilbert
* Peter Xu (pet...@redhat.com) wrote:
> Hey, Dave!

Hey!

> On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote:
> > * Michael Galaxy (mgal...@akamai.com) wrote:
> > > One thing to keep in mind here (despite me not having any hardware to 
> > > test)
> > > was that one of the original goals here
> > > in the RDMA implementation was not simply raw throughput nor raw latency,
> > > but a lack of CPU utilization in kernel
> > > space due to the offload. While it is entirely possible that newer 
> > > hardware
> > > w/ TCP might compete, the significant
> > > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > > 
> > > Just something to consider while you're doing the testing
> > 
> > I just noticed this thread; some random notes from a somewhat
> > fragmented memory of this:
> > 
> >   a) Long long ago, I also tried rsocket; 
> >   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
> >  as I remember the library was quite flaky at the time.
> 
> Hmm interesting.  There also looks like a thread doing rpoll().

Yeh, I can't actually remember much more about what I did back then!

> Btw, not sure whether you noticed, but there's the series posted for the
> latest rsocket conversion here:
> 
> https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com

Oh I hadn't; I think all of the stack of qemu's file abstractions had
changed in the ~10 years since I wrote my version!

> I hope Lei and his team has tested >4G mem, otherwise definitely worth
> checking.  Lei also mentioned there're rsocket bugs they found in the cover
> letter, but not sure what's that about.

It would probably be a good idea to keep track of what bugs
are in flight with it, and try it on a few RDMA cards to see
what problems get triggered.
I think I reported a few at the time, but I gave up after
feeling it was getting very hacky.

> Yes, and zero-copy requires multifd for now. I think it's because we didn't
> want to complicate the header processings in the migration stream where it
> may not be page aligned.

Ah yes.

> > 
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >  RDMA migration structure was the wrong way around - it should be the
> >  destination which initiates an RDMA read, rather than the source
> >  doing a write; then things might become a LOT simpler; you just need
> >  to send page ranges to the destination and it can pull it.
> >  That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

Yes, but even ignoring that (and the UFFDIO_CONTINUE idea you mention), if
the destination can issue an RDMA read itself, it doesn't need to send messages
to the source to ask for a page fetch; it just goes and grabs it itself,
that's got to be good for latency.

Dave

> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\dave @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
On Wed, Jun 05, 2024 at 10:10:57AM -0400, Peter Xu wrote:
> >   e) Someone made a good suggestion (sorry can't remember who) - that the
> >  RDMA migration structure was the wrong way around - it should be the
> >  destination which initiates an RDMA read, rather than the source
> >  doing a write; then things might become a LOT simpler; you just need
> >  to send page ranges to the destination and it can pull it.
> >  That might work nicely for postcopy.
> 
> I'm not sure whether it'll still be a problem if rdma recv side is based on
> zero-copy.  It would be a matter of whether atomicity can be guaranteed so
> that we don't want the guest vcpus to see a partially copied page during
> on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
> that.

And when thinking about this (of UFFDIO_COPY's nature on not being able to
do zero-copy...), the only way this will be able to do zerocopy is to use
file memories (shmem/hugetlbfs), as page cache can be prepopulated. So that
when we do DMA we pass over the page cache, which can be mapped in another
virtual address besides what the vcpus are using.

Then we can use UFFDIO_CONTINUE (rather than UFFDIO_COPY) to do atomic
updates on the vcpu pgtables, avoiding the copy.  QEMU doesn't have it, but
it looks like there's one more reason we may want to have better use of
shmem.. than anonymous.  And actually when working on 4k faults on 1G
hugetlb I added CONTINUE support.

https://github.com/xzpeter/qemu/tree/doublemap
https://github.com/xzpeter/qemu/commit/b8aff3a9d7654b1cf2c089a06894ff4899740dc5

Maybe it's worthwhile on its own now, because it also means we can use that
in multifd to avoid one extra layer of buffering when supporting
multifd+postcopy (which has the same issue here on directly copying data
into guest pages).  It'll also work with things like rmda I think in
similar ways.  It's just that it'll not work on anonymous.

I definitely hijacked the thread to somewhere too far away.  I'll stop
here..

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-05 Thread Peter Xu
Hey, Dave!

On Wed, Jun 05, 2024 at 12:31:56AM +, Dr. David Alan Gilbert wrote:
> * Michael Galaxy (mgal...@akamai.com) wrote:
> > One thing to keep in mind here (despite me not having any hardware to test)
> > was that one of the original goals here
> > in the RDMA implementation was not simply raw throughput nor raw latency,
> > but a lack of CPU utilization in kernel
> > space due to the offload. While it is entirely possible that newer hardware
> > w/ TCP might compete, the significant
> > reductions in CPU usage in the TCP/IP stack were a big win at the time.
> > 
> > Just something to consider while you're doing the testing
> 
> I just noticed this thread; some random notes from a somewhat
> fragmented memory of this:
> 
>   a) Long long ago, I also tried rsocket; 
>   https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
>  as I remember the library was quite flaky at the time.

Hmm interesting.  There also looks like a thread doing rpoll().

Btw, not sure whether you noticed, but there's the series posted for the
latest rsocket conversion here:

https://lore.kernel.org/r/1717503252-51884-1-git-send-email-arei.gong...@huawei.com

I hope Lei and his team has tested >4G mem, otherwise definitely worth
checking.  Lei also mentioned there're rsocket bugs they found in the cover
letter, but not sure what's that about.

> 
>   b) A lot of the complexity in the rdma migration code comes from
> emulating a stream to carry the migration control data and interleaving
> that with the actual RAM copy.   I believe the original design used
> a separate TCP socket for the control data, and just used the RDMA
> for the data - that should be a lot simpler (but alas was rejected
> in review early on)
> 
>   c) I can't rememmber the last benchmarks I did; but I think I did
> manage to beat RDMA with multifd; but yes, multifd does eat host CPU
> where as RDMA barely uses a whisper.

I think my first impression on this matter came from you on this one. :)

> 
>   d) The 'zero-copy-send' option in migrate may well get some of that
>  CPU time back; but if I remember we were still bottle necked on
>  the receive side. (I can't remember if zero-copy-send worked with
>  multifd?)

Yes, and zero-copy requires multifd for now. I think it's because we didn't
want to complicate the header processings in the migration stream where it
may not be page aligned.

> 
>   e) Someone made a good suggestion (sorry can't remember who) - that the
>  RDMA migration structure was the wrong way around - it should be the
>  destination which initiates an RDMA read, rather than the source
>  doing a write; then things might become a LOT simpler; you just need
>  to send page ranges to the destination and it can pull it.
>  That might work nicely for postcopy.

I'm not sure whether it'll still be a problem if rdma recv side is based on
zero-copy.  It would be a matter of whether atomicity can be guaranteed so
that we don't want the guest vcpus to see a partially copied page during
on-flight DMAs.  UFFDIO_COPY (or friend) is currently the only solution for
that.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-06-04 Thread Dr. David Alan Gilbert
* Michael Galaxy (mgal...@akamai.com) wrote:
> One thing to keep in mind here (despite me not having any hardware to test)
> was that one of the original goals here
> in the RDMA implementation was not simply raw throughput nor raw latency,
> but a lack of CPU utilization in kernel
> space due to the offload. While it is entirely possible that newer hardware
> w/ TCP might compete, the significant
> reductions in CPU usage in the TCP/IP stack were a big win at the time.
> 
> Just something to consider while you're doing the testing

I just noticed this thread; some random notes from a somewhat
fragmented memory of this:

  a) Long long ago, I also tried rsocket; 
  https://lists.gnu.org/archive/html/qemu-devel/2015-01/msg02040.html
 as I remember the library was quite flaky at the time.

  b) A lot of the complexity in the rdma migration code comes from
emulating a stream to carry the migration control data and interleaving
that with the actual RAM copy.   I believe the original design used
a separate TCP socket for the control data, and just used the RDMA
for the data - that should be a lot simpler (but alas was rejected
in review early on)

  c) I can't rememmber the last benchmarks I did; but I think I did
manage to beat RDMA with multifd; but yes, multifd does eat host CPU
where as RDMA barely uses a whisper.

  d) The 'zero-copy-send' option in migrate may well get some of that
 CPU time back; but if I remember we were still bottle necked on
 the receive side. (I can't remember if zero-copy-send worked with
 multifd?)

  e) Someone made a good suggestion (sorry can't remember who) - that the
 RDMA migration structure was the wrong way around - it should be the
 destination which initiates an RDMA read, rather than the source
 doing a write; then things might become a LOT simpler; you just need
 to send page ranges to the destination and it can pull it.
 That might work nicely for postcopy.

Dave

> - Michael
> 
> On 5/9/24 03:58, Zheng Chuan wrote:
> > Hi, Peter,Lei,Jinpu.
> > 
> > On 2024/5/8 0:28, Peter Xu wrote:
> > > On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
> > > > Hello,
> > > > 
> > > > > -Original Message-
> > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > Sent: Monday, May 6, 2024 11:18 PM
> > > > > To: Gonglei (Arei) 
> > > > > Cc: Daniel P. Berrangé ; Markus Armbruster
> > > > > ; Michael Galaxy ; Yu Zhang
> > > > > ; Zhijian Li (Fujitsu) ; 
> > > > > Jinpu Wang
> > > > > ; Elmar Gerdes ;
> > > > > qemu-de...@nongnu.org; Yuval Shaia ; Kevin 
> > > > > Wolf
> > > > > ; Prasanna Kumar Kalever
> > > > > ; Cornelia Huck ;
> > > > > Michael Roth ; Prasanna Kumar Kalever
> > > > > ; integrat...@gluster.org; Paolo Bonzini
> > > > > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > > > > Hanna Reitz ; Michael S. Tsirkin ;
> > > > > Thomas Huth ; Eric Blake ; Song
> > > > > Gao ; Marc-André Lureau
> > > > > ; Alex Bennée ;
> > > > > Wainer dos Santos Moschetta ; Beraldo Leal
> > > > > ; Pannengyuan ;
> > > > > Xiexiangyou 
> > > > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol 
> > > > > handling
> > > > > 
> > > > > On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > > > > > Hi, Peter
> > > > > Hey, Lei,
> > > > > 
> > > > > Happy to see you around again after years.
> > > > > 
> > > > Haha, me too.
> > > > 
> > > > > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > > > > network), and direct remote memory access by bypassing the CPU (As 
> > > > > > you
> > > > > > know, CPU resources are expensive for cloud vendors, which is one of
> > > > > > the reasons why we introduced offload cards.), which TCP does not 
> > > > > > have.
> > > > > It's another cost to use offload cards, v.s. preparing more cpu 
> > > > > resources?
> > > > > 
> > > > Software and hardware offload converged architecture is the way to go 
> > > > for all cloud vendors
> > > > (Including comprehensive benefits in terms of performance, cost, 
> > > > security, and innovation speed),
> > > > it's not just a matter of adding the resource of a

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-30 Thread Sean Hefty
> > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> >
> > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > different responsibilities. rdma_cm fd is used to notify connection
> > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be 
> listened,
> which indicates that a new CQE is generated.
> >
> > Rsocket is a sub-module attached to the rdma_cm library and provides
> > rdma calls that are completely similar to socket interfaces. However,
> > this library returns only the rdma_cm fd for listening to link setup-related
> events and does not expose the verbs fd (readable and writable events for
> listening to data). Only the rpoll interface provided by the RSocket can be 
> used
> to listen to related events. However, QEMU uses the ppoll interface to listen 
> to
> the rdma_cm fd (gotten by raccept API).
> > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> >
> > Do you guys have any ideas? Thanks.

The current rsocket code allows calling rpoll() with non-rsocket fd's, so an 
app can use rpoll() directly in place of poll().  It may be easiest to add an 
rppoll() call to rsockets and call that when using RDMA.

In case the easy path isn't feasible:

An extension could allow extracting the actual fd's under an rsocket, in order 
to allow a user to call poll()/ppoll() directly.  But it would be non-trivial.

The 'fd' that represents an rsocket happens to be the fd related to the RDMA 
CM.  That's because an rsocket needs a unique integer value to report as an 
'fd' value which will not conflict with any other fd value that the app may 
have.  I would consider the fd value an implementation detail, rather than 
something which an app should depend upon.  (For example, the 'fd' value 
returned for a datagram rsocket is actually a UDP socket fd).

Once an rsocket is in the connected state, it's possible an extended 
rgetsockopt() or rfcntl() call could return the fd related to the CQ.  But if 
an app tried to call poll() on that fd, the results would not be as expected.  
For example, it's possible for data to be available to receive on the rsocket 
without the CQ fd being signaled.  Calling poll() on the CQ fd in this state 
could leave the app hanging.  This is a natural? result of races in the RDMA CQ 
signaling.  If you look at the rsocket rpoll() implementation, you'll see that 
it checks for data prior to sleeping.

For an app to safely wait in poll/ppoll on the CQ fd, it would need to invoke 
some sort of 'pre-poll' routine, which would perform the same checks done in 
rpoll() prior to blocking.  As a reference to a similar pre-poll routine, see 
the fi_trywait() call from this man page: 

https://ofiwg.github.io/libfabric/v1.21.0/man/fi_poll.3.html

This is for a different library but deals with the same underlying problem.  
Obviously adding an rtrywait() to rsockets is possible but wouldn't align with 
any socket API equivalent.

- Sean


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Peter Xu
Lei,

On Wed, May 29, 2024 at 02:43:46AM +, Gonglei (Arei) wrote:
> For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> rdma_cm and ibverbs create two FDs respectively. The two FDs have
> different responsibilities. rdma_cm fd is used to notify connection
> establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a
> pollin event can be monitored, which means that an rdma_cm event
> occurs. When the verbs fd is directly polled/epolled, only the pollin
> event can be listened, which indicates that a new CQE is generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides
> rdma calls that are completely similar to socket interfaces. However,
> this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and
> writable events for listening to data). Only the rpoll interface provided
> by the RSocket can be used to listen to related events. However, QEMU
> uses the ppoll interface to listen to the rdma_cm fd (gotten by raccept
> API).  And cannot listen to the verbs fd event. Only some hacking methods
> can be used to address this problem.  Do you guys have any ideas? Thanks.

I saw that you mentioned this elsewhere:

> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

So what I'm thinking may not make much sense, as I mentioned I don't think
I know rdma at all.. and my idea also has involvement on coroutine stuff
which I also don't know well. But just in case it shed some light in some
form.

IIUC we do iochannel blockings with this no matter for read/write:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

One thing I'm wondering is whether we can provide a new feature bit for
qiochannel, e.g., QIO_CHANNEL_FEATURE_POLL, so that the iochannel can
define its own poll routine rather than using the default when possible.

I think it may not work if it's in a coroutine, as I guess that'll block
other fds from being waked up.  Hence it should look like this:

if (len == QIO_CHANNEL_ERR_BLOCK) {
if (qemu_in_coroutine()) {
qio_channel_yield(ioc, G_IO_XXX);
} else if (qio_channel_has_feature(ioc, QIO_CHANNEL_FEATURE_POLL)) {
qio_channel_poll(ioc, G_IO_XXX);
} else {
qio_channel_wait(ioc, G_IO_XXX);
}
continue;
}

Maybe we even want to forbid such channel to be used in coroutine already,
as when QIO_CHANNEL_FEATURE_POLL set it may mean that this iochannel simply
won't work with poll() like in rdma's use case.

Then rdma iochannel can implement qio_channel_poll() using rpoll().

There's one other dependent issue here in that I _think_ the migration recv
side is still in a coroutine.. so we may need to move that into a thread
first.  IIRC we don't yet have a major blocker to do that, but I didn't
further check either.  I've put that issue aside just to see whether this
may or may not make sense.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Greg Sword
On Wed, May 29, 2024 at 12:33 PM Jinpu Wang  wrote:
>
> On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)  
> wrote:
> >
> > Hi,
> >
> > > -Original Message-
> > > From: Peter Xu [mailto:pet...@redhat.com]
> > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > > widely used for production in our data center. The network
> > > > > > adapters are
> > > > > >
> > > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > >
> > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > > reasonable.
> > > > >
> > > > >
> > > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > >
> > > > > Appreciate a lot for everyone helping on the testings.
> > > > >
> > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > [ConnectX-5]
> > > > > >
> > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > > Ethernet on these two hosts. One is standby while the other is 
> > > > > > active.
> > > > > >
> > > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > > network adapters. One of them has:
> > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > > >
> > > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > > more
> > > > > sense.
> > > > >
> > > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > > >
> > > > > Is there anyone who started looking into that direction?  Would it
> > > > > make sense we start some PoC now?
> > > > >
> > > >
> > > > My team has finished the PoC refactoring which works well.
> > > >
> > > > Progress:
> > > > 1.  Implement io/channel-rdma.c,
> > > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > Rewrite the rdma_start_outgoing_migration and
> > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > > from migration/ram.c. (to prevent RDMA live migration from polluting the
> > > core logic of live migration), 6.  The soft-RoCE implemented by software 
> > > is
> > > used to test the RDMA live migration. It's successful.
> > > >
> > > > We will be submit the patchset later.
> > >
> > > That's great news, thank you!
> > >
> > > --
> > > Peter Xu
> >
> > For rdma programming, the current mainstream implementation is to use 
> > rdma_cm to establish a connection, and then use verbs to transmit data.
> >
> > rdma_cm and ibverbs create two FDs respectively. The two FDs have different 
> > responsibilities. rdma_cm fd is used to notify connection establishment 
> > events,
> > and verbs fd is used to notify new CQEs. When poll/epoll monitoring is 
> > directly performed on the rdma_cm fd, only a pollin event can be monitored, 
> > which means
> > that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, 
> > only the pollin event can be listened, which indicates that a new CQE is 
> > generated.
> >
> > Rsocket is a sub-module attached to the rdma_cm library and provides rdma 
> > calls that are completely similar to socket interfaces. However, this 
> > library returns
> > only the rdma_cm fd for listening to link setup-related events and does not 
> > expose the verbs fd (readable and writable events for listening to data). 
> > Only the rpoll
> > interface provided by the RSocket can be used to listen to related events. 
> > However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten 
> > by raccept API).
> > And cannot listen to the verbs fd event. Only some hacking methods can be 
> > used to address this problem.
> >
> > Do you guys have any ideas? Thanks.
> +cc linux-rdma

Why include rdma community?

> +cc Sean
>
>
>
> >
> >
> > Regards,
> > -Gonglei
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Haris Iqbal
Hello,

I am part of the storage kernel team which develops and maintains the
RDMA block storage in IONOS.
We work closely with Jinpu/Yu, and currently I am supporting Jinpu
with this Qemu RDMA work.

On Wed, May 29, 2024 at 11:49 AM Gonglei (Arei) via
 wrote:
>
> Hi,
>
> > -Original Message-
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > > 15
> > > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > > >
> > > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > > >
> > > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > > Family [ConnectX-5]
> > > > > > > > > >
> > > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > > while the other
> > > > is active.
> > > > > > > > > >
> > > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > > (rev
> > > > > > > > > > 01)
> > > > > > > > > >
> > > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > > could make more
> > > > > > > > > sense.
> > > > > > > > >
> > > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > > to be refactored
> > > > using rsocket.
> > > > > > > > >
> > > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > > Would it make sense we start some PoC now?
> > > > > > > > >
> > > > > > > >
> > > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > > >
> > > > > > > > Progress:
> > > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 
> > > > > > > > 4.
> > > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > > migration from polluting the
> > > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > > by software is used to test the RDMA live migration. It's 
> > > > > > > successful.
> > > > > > > >
> > > > > > > > We will be submit the patchset later.
> > > > > > >
> > > > > > > That's great news, thank you!
> > > > > > >
> > > > > > > --
> > > > > > > Peter Xu
> > > > > >
> > > > > > For rdma programming, the current mainstream implementation is
> > > > > > to use
> > > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > > >
> > > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > > connection establishment events, and verbs fd is used to notify
> > > > > > new CQEs. When
> > > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > > a pollin event can be monitored, which means that an rdma_cm event
> > > > occurs. When the verbs fd is directly polled/epolled, only the
> > > > pollin event can be listened, which indicates that a new CQE is 
> > > > generated.
> > > > > >
> > > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > > provides rdma calls that are completely similar to socket 
> > > > > > interfaces.
> > > > > > However, this library returns only the rdma_cm fd for listening
> > > > > > to link
> > > > setup-related events and does not expose the verbs fd (readable and
> > > > writable events for listening to data). Only the rpoll interface
> > > > provided by the RSocket can be used to listen to related events.
> > > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> > (gotten by raccept API).
> > > > > > And cannot listen to the verbs fd event.
> > I'm confused, the rs_poll_arm
> > :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> > L3290
> > For STREAM, rpoll setup fd for both cq fd and cm fd.
> >
>
> Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(

I have a query around this topic. Are the fds used in socket migration
polled through ppoll?
If yes, then can someone point out where; I couldn't find that piece of code.

I could only find that sendmsg/send and recvmsg/recv is being used.

>
>
> Regards,
> -Gonglei
>



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)
Hi,

> -Original Message-
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make sense we start some PoC now?
> > > > > > > >
> > > > > > >
> > > > > > > My team has finished the PoC refactoring which works well.
> > > > > > >
> > > > > > > Progress:
> > > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > > functions from migration/ram.c. (to prevent RDMA live
> > > > > > > migration from polluting the
> > > > > > core logic of live migration), 6.  The soft-RoCE implemented
> > > > > > by software is used to test the RDMA live migration. It's 
> > > > > > successful.
> > > > > > >
> > > > > > > We will be submit the patchset later.
> > > > > >
> > > > > > That's great news, thank you!
> > > > > >
> > > > > > --
> > > > > > Peter Xu
> > > > >
> > > > > For rdma programming, the current mainstream implementation is
> > > > > to use
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > > > >
> > > > > rdma_cm and ibverbs create two FDs respectively. The two FDs
> > > > > have different responsibilities. rdma_cm fd is used to notify
> > > > > connection establishment events, and verbs fd is used to notify
> > > > > new CQEs. When
> > > poll/epoll monitoring is directly performed on the rdma_cm fd, only
> > > a pollin event can be monitored, which means that an rdma_cm event
> > > occurs. When the verbs fd is directly polled/epolled, only the
> > > pollin event can be listened, which indicates that a new CQE is generated.
> > > > >
> > > > > Rsocket is a sub-module attached to the rdma_cm library and
> > > > > provides rdma calls that are completely similar to socket interfaces.
> > > > > However, this library returns only the rdma_cm fd for listening
> > > > > to link
> > > setup-related events and does not expose the verbs fd (readable and
> > > writable events for listening to data). Only the rpoll interface
> > > provided by the RSocket can be used to listen to related events.
> > > However, QEMU uses the ppoll interface to listen to the rdma_cm fd
> (gotten by raccept API).
> > > > > And cannot listen to the verbs fd event.
> I'm confused, the rs_poll_arm
> :https://github.com/linux-rdma/rdma-core/blob/master/librdmacm/rsocket.c#
> L3290
> For STREAM, rpoll setup fd for both cq fd and cm fd.
> 

Right. But the question is QEMU do not use rpoll but gilb's ppoll. :(


Regards,
-Gonglei



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Jinpu Wang
On Wed, May 29, 2024 at 11:35 AM Gonglei (Arei) 
wrote:

>
>
> > -Original Message-
> > From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> > Sent: Wednesday, May 29, 2024 5:18 PM
> > To: Gonglei (Arei) 
> > Cc: Greg Sword ; Peter Xu ;
> > Yu Zhang ; Michael Galaxy ;
> > Elmar Gerdes ; zhengchuan
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-de...@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas ;
> > RDMA mailing list ; she...@nvidia.com; Haris
> > Iqbal 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> handling
> >
> > Hi Gonglei,
> >
> > On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei)  >
> > wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Greg Sword [mailto:gregswo...@gmail.com]
> > > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > > To: Jinpu Wang 
> > > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > > handling
> > > >
> > > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > > > wrote:
> > > > >
> > > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > > 
> > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > > The network adapters are
> > > > > > > > > >
> > > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > > NetXtreme
> > > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > > >
> > > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > > looks more
> > > > > > > reasonable.
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > > 15
> > > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > > >
> > > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > > >
> > > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > > Family [ConnectX-5]
> > > > > > > > > >
> > > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > > while the other
> > > > is active.
> > > > > > > > > >
> > > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > > (rev
> > > > > > > > > > 01)
> > > > > > > > > >
> > > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > > could make more
> > > > > > > > > sense.
> > > > > > > > >
> > 

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)


> -Original Message-
> From: Jinpu Wang [mailto:jinpu.w...@ionos.com]
> Sent: Wednesday, May 29, 2024 5:18 PM
> To: Gonglei (Arei) 
> Cc: Greg Sword ; Peter Xu ;
> Yu Zhang ; Michael Galaxy ;
> Elmar Gerdes ; zhengchuan
> ; Daniel P. Berrangé ;
> Markus Armbruster ; Zhijian Li (Fujitsu)
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; Paolo Bonzini
> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou ; Fabiano Rosas ;
> RDMA mailing list ; she...@nvidia.com; Haris
> Iqbal 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> Hi Gonglei,
> 
> On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei) 
> wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Greg Sword [mailto:gregswo...@gmail.com]
> > > Sent: Wednesday, May 29, 2024 2:06 PM
> > > To: Jinpu Wang 
> > > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol
> > > handling
> > >
> > > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > > wrote:
> > > >
> > > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)
> > > > 
> > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > > -Original Message-
> > > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > > servers widely used for production in our data center.
> > > > > > > > > The network adapters are
> > > > > > > > >
> > > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > > NetXtreme
> > > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > > >
> > > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6
> > > > > > > > looks more
> > > > > > reasonable.
> > > > > > > >
> > > > > > > >
> > > > > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > > 15
> > > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > > >
> > > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > > >
> > > > > > > > > InfiniBand controller: Mellanox Technologies MT27800
> > > > > > > > > Family [ConnectX-5]
> > > > > > > > >
> > > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP
> > > > > > > > > for VM migration. RDMA traffic is through InfiniBand and
> > > > > > > > > TCP through Ethernet on these two hosts. One is standby
> > > > > > > > > while the other
> > > is active.
> > > > > > > > >
> > > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller
> > > > > > > > > (rev
> > > > > > > > > 01)
> > > > > > > > >
> > > > > > > > > The comparison between RDMA and TCP on the same NIC
> > > > > > > > > could make more
> > > > > > > > sense.
> > > > > > > >
> > > > > > > > It looks to me NICs are powerful now, but again as I
> > > > > > > > mentioned I don't think it's a reason we need to deprecate
> > > > > > > > rdma, especially if QEMU's rdma migration has the chance
> > > > > > > > to be refactored
> > > using rsocket.
> > > > > > > >
> > > > > > > > Is there anyone who started looking into that direction?
> > > > > > > > Would it make

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Jinpu Wang
Hi Gonglei,

On Wed, May 29, 2024 at 10:31 AM Gonglei (Arei)  wrote:
>
>
>
> > -Original Message-
> > From: Greg Sword [mailto:gregswo...@gmail.com]
> > Sent: Wednesday, May 29, 2024 2:06 PM
> > To: Jinpu Wang 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> >
> > On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> > wrote:
> > >
> > > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) 
> > wrote:
> > > >
> > > > Hi,
> > > >
> > > > > -Original Message-
> > > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > > servers widely used for production in our data center. The
> > > > > > > > network adapters are
> > > > > > > >
> > > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > > NetXtreme
> > > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > > >
> > > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > > more
> > > > > reasonable.
> > > > > > >
> > > > > > >
> > > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > > 15
> > > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > > >
> > > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > > >
> > > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > > [ConnectX-5]
> > > > > > > >
> > > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > > through Ethernet on these two hosts. One is standby while the 
> > > > > > > > other
> > is active.
> > > > > > > >
> > > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > > 01)
> > > > > > > >
> > > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > > make more
> > > > > > > sense.
> > > > > > >
> > > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > > especially if QEMU's rdma migration has the chance to be 
> > > > > > > refactored
> > using rsocket.
> > > > > > >
> > > > > > > Is there anyone who started looking into that direction?
> > > > > > > Would it make sense we start some PoC now?
> > > > > > >
> > > > > >
> > > > > > My team has finished the PoC refactoring which works well.
> > > > > >
> > > > > > Progress:
> > > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > > from polluting the
> > > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > > software is used to test the RDMA live migration. It's successful.
> > > > > >
> > > > > > We will be submit the patchset later.
> > > > >
> > > > > That's great news, thank you!
> > > > >
> > > > > --
> > > > > Peter Xu
> > > >
> > > > For rdma programming, the current mainstream impl

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Gonglei (Arei)


> -Original Message-
> From: Greg Sword [mailto:gregswo...@gmail.com]
> Sent: Wednesday, May 29, 2024 2:06 PM
> To: Jinpu Wang 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang 
> wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei) 
> wrote:
> > >
> > > Hi,
> > >
> > > > -Original Message-
> > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on
> > > > > > > servers widely used for production in our data center. The
> > > > > > > network adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries
> > > > > > > NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks
> > > > > > more
> > > > reasonable.
> > > > > >
> > > > > >
> > > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda
> > > > 15
> > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for
> > > > > > > VM migration. RDMA traffic is through InfiniBand and TCP
> > > > > > > through Ethernet on these two hosts. One is standby while the 
> > > > > > > other
> is active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and
> > > > > > > InfiniBand network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev
> > > > > > > 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could
> > > > > > > make more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned
> > > > > > I don't think it's a reason we need to deprecate rdma,
> > > > > > especially if QEMU's rdma migration has the chance to be refactored
> using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?
> > > > > > Would it make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c, 2.  Add unit test
> > > > > tests/unit/test-io-channel-rdma.c and verifying it is
> > > > > successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx
> > > > > functions from migration/ram.c. (to prevent RDMA live migration
> > > > > from polluting the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by
> > > > software is used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use
> rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have
> > > different responsibilities. rdma_cm fd is used to notify connection
> > > establishment events, and verbs fd is used to notify new CQEs. When
> poll/epoll monitoring is directly performed on the rdma_cm fd, only a pollin
> event can be monitored, which means that an rdma_cm event occurs. When
> the verbs fd is directly polled/epolled, only the pollin event can be 
> listened,
> which indicates that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides
> > > rdma calls that are completely similar to socket interfaces.
> > > However, this library returns only the rdma_cm fd for listening to link
> setup-related events and does not expose the verbs fd (readable and writable
> events for listening to data). Only the rpoll interface provided by the 
> RSocket
> can be used to listen to related events. However, QEMU uses the ppoll
> interface to listen to the rdma_cm fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be
> used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
> 
> Why include rdma community?
> 

Can rdma/rsocket provide an API to expose the verbs fd? 


Regards,
-Gonglei

> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-29 Thread Jinpu Wang
On Wed, May 29, 2024 at 8:08 AM Greg Sword  wrote:
>
> On Wed, May 29, 2024 at 12:33 PM Jinpu Wang  wrote:
> >
> > On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)  
> > wrote:
> > >
> > > Hi,
> > >
> > > > -Original Message-
> > > > From: Peter Xu [mailto:pet...@redhat.com]
> > > > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > > > widely used for production in our data center. The network
> > > > > > > adapters are
> > > > > > >
> > > > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > > > >
> > > > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > > > reasonable.
> > > > > >
> > > > > >
> > > > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > > > >
> > > > > > Appreciate a lot for everyone helping on the testings.
> > > > > >
> > > > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > > > [ConnectX-5]
> > > > > > >
> > > > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > > > Ethernet on these two hosts. One is standby while the other is 
> > > > > > > active.
> > > > > > >
> > > > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > > > network adapters. One of them has:
> > > > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > > > >
> > > > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > > > more
> > > > > > sense.
> > > > > >
> > > > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > > > >
> > > > > > Is there anyone who started looking into that direction?  Would it
> > > > > > make sense we start some PoC now?
> > > > > >
> > > > >
> > > > > My team has finished the PoC refactoring which works well.
> > > > >
> > > > > Progress:
> > > > > 1.  Implement io/channel-rdma.c,
> > > > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > > > Rewrite the rdma_start_outgoing_migration and
> > > > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > > > from migration/ram.c. (to prevent RDMA live migration from polluting 
> > > > > the
> > > > core logic of live migration), 6.  The soft-RoCE implemented by 
> > > > software is
> > > > used to test the RDMA live migration. It's successful.
> > > > >
> > > > > We will be submit the patchset later.
> > > >
> > > > That's great news, thank you!
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > For rdma programming, the current mainstream implementation is to use 
> > > rdma_cm to establish a connection, and then use verbs to transmit data.
> > >
> > > rdma_cm and ibverbs create two FDs respectively. The two FDs have 
> > > different responsibilities. rdma_cm fd is used to notify connection 
> > > establishment events,
> > > and verbs fd is used to notify new CQEs. When poll/epoll monitoring is 
> > > directly performed on the rdma_cm fd, only a pollin event can be 
> > > monitored, which means
> > > that an rdma_cm event occurs. When the verbs fd is directly 
> > > polled/epolled, only the pollin event can be listened, which indicates 
> > > that a new CQE is generated.
> > >
> > > Rsocket is a sub-module attached to the rdma_cm library and provides rdma 
> > > calls that are completely similar to socket interfaces. However, this 
> > > library returns
> > > only the rdma_cm fd for listening to link setup-related events and does 
> > > not expose the verbs fd (readable and writable events for listening to 
> > > data). Only the rpoll
> > > interface provided by the RSocket can be used to listen to related 
> > > events. However, QEMU uses the ppoll interface to listen to the rdma_cm 
> > > fd (gotten by raccept API).
> > > And cannot listen to the verbs fd event. Only some hacking methods can be 
> > > used to address this problem.
> > >
> > > Do you guys have any ideas? Thanks.
> > +cc linux-rdma
>
> Why include rdma community?
rdma community has a lot people with experience in rdma/rsocket?
>
> > +cc Sean
> >
> >
> >
> > >
> > >
> > > Regards,
> > > -Gonglei
> >



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Jinpu Wang
On Wed, May 29, 2024 at 4:43 AM Gonglei (Arei)  wrote:
>
> Hi,
>
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Tuesday, May 28, 2024 11:55 PM
> > > > > Exactly, not so compelling, as I did it first only on servers
> > > > > widely used for production in our data center. The network
> > > > > adapters are
> > > > >
> > > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > > BCM5720 2-port Gigabit Ethernet PCIe
> > > >
> > > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> > reasonable.
> > > >
> > > >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > > >
> > > > Appreciate a lot for everyone helping on the testings.
> > > >
> > > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > > [ConnectX-5]
> > > > >
> > > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > > Ethernet on these two hosts. One is standby while the other is active.
> > > > >
> > > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > > network adapters. One of them has:
> > > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > > >
> > > > > The comparison between RDMA and TCP on the same NIC could make
> > > > > more
> > > > sense.
> > > >
> > > > It looks to me NICs are powerful now, but again as I mentioned I
> > > > don't think it's a reason we need to deprecate rdma, especially if
> > > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > > >
> > > > Is there anyone who started looking into that direction?  Would it
> > > > make sense we start some PoC now?
> > > >
> > >
> > > My team has finished the PoC refactoring which works well.
> > >
> > > Progress:
> > > 1.  Implement io/channel-rdma.c,
> > > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > > Rewrite the rdma_start_outgoing_migration and
> > > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > > from migration/ram.c. (to prevent RDMA live migration from polluting the
> > core logic of live migration), 6.  The soft-RoCE implemented by software is
> > used to test the RDMA live migration. It's successful.
> > >
> > > We will be submit the patchset later.
> >
> > That's great news, thank you!
> >
> > --
> > Peter Xu
>
> For rdma programming, the current mainstream implementation is to use rdma_cm 
> to establish a connection, and then use verbs to transmit data.
>
> rdma_cm and ibverbs create two FDs respectively. The two FDs have different 
> responsibilities. rdma_cm fd is used to notify connection establishment 
> events,
> and verbs fd is used to notify new CQEs. When poll/epoll monitoring is 
> directly performed on the rdma_cm fd, only a pollin event can be monitored, 
> which means
> that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, 
> only the pollin event can be listened, which indicates that a new CQE is 
> generated.
>
> Rsocket is a sub-module attached to the rdma_cm library and provides rdma 
> calls that are completely similar to socket interfaces. However, this library 
> returns
> only the rdma_cm fd for listening to link setup-related events and does not 
> expose the verbs fd (readable and writable events for listening to data). 
> Only the rpoll
> interface provided by the RSocket can be used to listen to related events. 
> However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by 
> raccept API).
> And cannot listen to the verbs fd event. Only some hacking methods can be 
> used to address this problem.
>
> Do you guys have any ideas? Thanks.
+cc linux-rdma
+cc Sean



>
>
> Regards,
> -Gonglei



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Gonglei (Arei)
Hi,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Tuesday, May 28, 2024 11:55 PM
> > > > Exactly, not so compelling, as I did it first only on servers
> > > > widely used for production in our data center. The network
> > > > adapters are
> > > >
> > > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme
> > > > BCM5720 2-port Gigabit Ethernet PCIe
> > >
> > > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
> reasonable.
> > >
> > >
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > > wvaqk81vxtkzx-l...@mail.gmail.com/
> > >
> > > Appreciate a lot for everyone helping on the testings.
> > >
> > > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > > [ConnectX-5]
> > > >
> > > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > > migration. RDMA traffic is through InfiniBand and TCP through
> > > > Ethernet on these two hosts. One is standby while the other is active.
> > > >
> > > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > > network adapters. One of them has:
> > > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > > >
> > > > The comparison between RDMA and TCP on the same NIC could make
> > > > more
> > > sense.
> > >
> > > It looks to me NICs are powerful now, but again as I mentioned I
> > > don't think it's a reason we need to deprecate rdma, especially if
> > > QEMU's rdma migration has the chance to be refactored using rsocket.
> > >
> > > Is there anyone who started looking into that direction?  Would it
> > > make sense we start some PoC now?
> > >
> >
> > My team has finished the PoC refactoring which works well.
> >
> > Progress:
> > 1.  Implement io/channel-rdma.c,
> > 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it
> > is successful, 3.  Remove the original code from migration/rdma.c, 4.
> > Rewrite the rdma_start_outgoing_migration and
> > rdma_start_incoming_migration logic, 5.  Remove all rdma_xxx functions
> > from migration/ram.c. (to prevent RDMA live migration from polluting the
> core logic of live migration), 6.  The soft-RoCE implemented by software is
> used to test the RDMA live migration. It's successful.
> >
> > We will be submit the patchset later.
> 
> That's great news, thank you!
> 
> --
> Peter Xu

For rdma programming, the current mainstream implementation is to use rdma_cm 
to establish a connection, and then use verbs to transmit data.

rdma_cm and ibverbs create two FDs respectively. The two FDs have different 
responsibilities. rdma_cm fd is used to notify connection establishment events, 
and verbs fd is used to notify new CQEs. When poll/epoll monitoring is directly 
performed on the rdma_cm fd, only a pollin event can be monitored, which means
that an rdma_cm event occurs. When the verbs fd is directly polled/epolled, 
only the pollin event can be listened, which indicates that a new CQE is 
generated.

Rsocket is a sub-module attached to the rdma_cm library and provides rdma calls 
that are completely similar to socket interfaces. However, this library returns 
only the rdma_cm fd for listening to link setup-related events and does not 
expose the verbs fd (readable and writable events for listening to data). Only 
the rpoll 
interface provided by the RSocket can be used to listen to related events. 
However, QEMU uses the ppoll interface to listen to the rdma_cm fd (gotten by 
raccept API). 
And cannot listen to the verbs fd event. Only some hacking methods can be used 
to address this problem. 

Do you guys have any ideas? Thanks.


Regards,
-Gonglei


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Peter Xu
On Tue, May 28, 2024 at 09:06:04AM +, Gonglei (Arei) wrote:
> Hi Peter,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-de...@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> > 
> > Hi,
> > 
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> > 
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> > 
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> > 
> > Appreciate a lot for everyone helping on the testings.
> > 
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> > 
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> > 
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> > 
> 
> My team has finished the PoC refactoring which works well. 
> 
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
> 
> We will be submit the patchset later.

That's great news, thank you!

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Jinpu Wang
Hi Gonglei,

On Tue, May 28, 2024 at 11:06 AM Gonglei (Arei)  wrote:
>
> Hi Peter,
>
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Wednesday, May 22, 2024 6:15 AM
> > To: Yu Zhang 
> > Cc: Michael Galaxy ; Jinpu Wang
> > ; Elmar Gerdes ;
> > zhengchuan ; Gonglei (Arei)
> > ; Daniel P. Berrangé ;
> > Markus Armbruster ; Zhijian Li (Fujitsu)
> > ; qemu-de...@nongnu.org; Yuval Shaia
> > ; Kevin Wolf ; Prasanna
> > Kumar Kalever ; Cornelia Huck
> > ; Michael Roth ; Prasanna
> > Kumar Kalever ; Paolo Bonzini
> > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou ; Fabiano Rosas 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> >
> > On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > > Hello Michael and Peter,
> >
> > Hi,
> >
> > >
> > > Exactly, not so compelling, as I did it first only on servers widely
> > > used for production in our data center. The network adapters are
> > >
> > > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > > 2-port Gigabit Ethernet PCIe
> >
> > Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more 
> > reasonable.
> >
> > https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> > wvaqk81vxtkzx-l...@mail.gmail.com/
> >
> > Appreciate a lot for everyone helping on the testings.
> >
> > > InfiniBand controller: Mellanox Technologies MT27800 Family
> > > [ConnectX-5]
> > >
> > > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > > on these two hosts. One is standby while the other is active.
> > >
> > > Now I'll try on a server with more recent Ethernet and InfiniBand
> > > network adapters. One of them has:
> > > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> > >
> > > The comparison between RDMA and TCP on the same NIC could make more
> > sense.
> >
> > It looks to me NICs are powerful now, but again as I mentioned I don't 
> > think it's
> > a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> > the chance to be refactored using rsocket.
> >
> > Is there anyone who started looking into that direction?  Would it make 
> > sense
> > we start some PoC now?
> >
>
> My team has finished the PoC refactoring which works well.
>
> Progress:
> 1.  Implement io/channel-rdma.c,
> 2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
> successful,
> 3.  Remove the original code from migration/rdma.c,
> 4.  Rewrite the rdma_start_outgoing_migration and 
> rdma_start_incoming_migration logic,
> 5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
> migration from polluting the core logic of live migration),
> 6.  The soft-RoCE implemented by software is used to test the RDMA live 
> migration. It's successful.
>
> We will be submit the patchset later.
>
Thanks for working on this PoC, and sharing progress on this, we are
looking forward for the patchset.

>
> Regards,
> -Gonglei
Regards!
Jinpu
>
> > Thanks,
> >
> > --
> > Peter Xu
>



RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-28 Thread Gonglei (Arei)
Hi Peter,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, May 22, 2024 6:15 AM
> To: Yu Zhang 
> Cc: Michael Galaxy ; Jinpu Wang
> ; Elmar Gerdes ;
> zhengchuan ; Gonglei (Arei)
> ; Daniel P. Berrangé ;
> Markus Armbruster ; Zhijian Li (Fujitsu)
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; Paolo Bonzini
> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou ; Fabiano Rosas 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> > Hello Michael and Peter,
> 
> Hi,
> 
> >
> > Exactly, not so compelling, as I did it first only on servers widely
> > used for production in our data center. The network adapters are
> >
> > Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> > 2-port Gigabit Ethernet PCIe
> 
> Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more reasonable.
> 
> https://lore.kernel.org/qemu-devel/CAMGffEn-DKpMZ4tA71MJYdyemg0Zda15
> wvaqk81vxtkzx-l...@mail.gmail.com/
> 
> Appreciate a lot for everyone helping on the testings.
> 
> > InfiniBand controller: Mellanox Technologies MT27800 Family
> > [ConnectX-5]
> >
> > which doesn't meet our purpose. I can choose RDMA or TCP for VM
> > migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> > on these two hosts. One is standby while the other is active.
> >
> > Now I'll try on a server with more recent Ethernet and InfiniBand
> > network adapters. One of them has:
> > BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> >
> > The comparison between RDMA and TCP on the same NIC could make more
> sense.
> 
> It looks to me NICs are powerful now, but again as I mentioned I don't think 
> it's
> a reason we need to deprecate rdma, especially if QEMU's rdma migration has
> the chance to be refactored using rsocket.
> 
> Is there anyone who started looking into that direction?  Would it make sense
> we start some PoC now?
> 

My team has finished the PoC refactoring which works well. 

Progress:
1.  Implement io/channel-rdma.c,
2.  Add unit test tests/unit/test-io-channel-rdma.c and verifying it is 
successful,
3.  Remove the original code from migration/rdma.c,
4.  Rewrite the rdma_start_outgoing_migration and rdma_start_incoming_migration 
logic,
5.  Remove all rdma_xxx functions from migration/ram.c. (to prevent RDMA live 
migration from polluting the core logic of live migration),
6.  The soft-RoCE implemented by software is used to test the RDMA live 
migration. It's successful.

We will be submit the patchset later.


Regards,
-Gonglei

> Thanks,
> 
> --
> Peter Xu



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-21 Thread Peter Xu
On Fri, May 17, 2024 at 03:01:59PM +0200, Yu Zhang wrote:
> Hello Michael and Peter,

Hi,

> 
> Exactly, not so compelling, as I did it first only on servers widely
> used for production in our data center. The network adapters are
> 
> Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
> 2-port Gigabit Ethernet PCIe

Hmm... I definitely thinks Jinpu's Mellanox ConnectX-6 looks more
reasonable.

https://lore.kernel.org/qemu-devel/camgffen-dkpmz4ta71mjydyemg0zda15wvaqk81vxtkzx-l...@mail.gmail.com/

Appreciate a lot for everyone helping on the testings.

> InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]
> 
> which doesn't meet our purpose. I can choose RDMA or TCP for VM
> migration. RDMA traffic is through InfiniBand and TCP through Ethernet
> on these two hosts. One is standby while the other is active.
> 
> Now I'll try on a server with more recent Ethernet and InfiniBand
> network adapters. One of them has:
> BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)
> 
> The comparison between RDMA and TCP on the same NIC could make more sense.

It looks to me NICs are powerful now, but again as I mentioned I don't
think it's a reason we need to deprecate rdma, especially if QEMU's rdma
migration has the chance to be refactored using rsocket.

Is there anyone who started looking into that direction?  Would it make
sense we start some PoC now?

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-17 Thread Yu Zhang
Hello Michael and Peter,

Exactly, not so compelling, as I did it first only on servers widely
used for production in our data center. The network adapters are

Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720
2-port Gigabit Ethernet PCIe
InfiniBand controller: Mellanox Technologies MT27800 Family [ConnectX-5]

which doesn't meet our purpose. I can choose RDMA or TCP for VM
migration. RDMA traffic is through InfiniBand and TCP through Ethernet
on these two hosts. One is standby while the other is active.

Now I'll try on a server with more recent Ethernet and InfiniBand
network adapters. One of them has:
BCM57414 NetXtreme-E 10Gb/25Gb RDMA Ethernet Controller (rev 01)

The comparison between RDMA and TCP on the same NIC could make more sense.

Best regards,
Yu Zhang @ IONOS Cloud







On Thu, May 16, 2024 at 7:30 PM Michael Galaxy  wrote:
>
> These are very compelling results, no?
>
> (40gbps cards, right? Are the cards active/active? or active/standby?)
>
> - Michael
>
> On 5/14/24 10:19, Yu Zhang wrote:
> > Hello Peter and all,
> >
> > I did a comparison of the VM live-migration speeds between RDMA and
> > TCP/IP on our servers
> > and plotted the results to get an initial impression. Unfortunately,
> > the Ethernet NICs are not the
> > recent ones, therefore, it may not make much sense. I can do it on
> > servers with more recent Ethernet
> > NICs and keep you updated.
> >
> > It seems that the benefits of RDMA becomes obviously when the VM has
> > large memory and is
> > running memory-intensive workload.
> >
> > Best regards,
> > Yu Zhang @ IONOS Cloud
> >
> > On Thu, May 9, 2024 at 4:14 PM Peter Xu  wrote:
> >> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> >>> That's a good news to see the socket abstraction for RDMA!
> >>> When I was developed the series above, the most pain is the RDMA 
> >>> migration has no QIOChannel abstraction and i need to take a 'fake 
> >>> channel'
> >>> for it which is awkward in code implementation.
> >>> So, as far as I know, we can do this by
> >>> i. the first thing is that we need to evaluate the rsocket is good enough 
> >>> to satisfy our QIOChannel fundamental abstraction
> >>> ii. if it works right, then we will continue to see if it can give us 
> >>> opportunity to hide the detail of rdma protocol
> >>>  into rsocket by remove most of code in rdma.c and also some hack in 
> >>> migration main process.
> >>> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> >>> migration.
> >>>
> >>> Since I am not familiar with rsocket, I need some times to look at it and 
> >>> do some quick verify with rdma migration based on rsocket.
> >>> But, yes, I am willing to involved in this refactor work and to see if we 
> >>> can make this migration feature more better:)
> >> Based on what we have now, it looks like we'd better halt the deprecation
> >> process a bit, so I think we shouldn't need to rush it at least in 9.1
> >> then, and we'll need to see how it goes on the refactoring.
> >>
> >> It'll be perfect if rsocket works, otherwise supporting multifd with little
> >> overhead / exported APIs would also be a good thing in general with
> >> whatever approach.  And obviously all based on the facts that we can get
> >> resources from companies to support this feature first.
> >>
> >> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> >> any of us can provide some test results please do so.  Many people are
> >> saying RDMA is better, but I yet didn't see any numbers comparing it with
> >> modern TCP networks.  I don't want to have old impressions floating around
> >> even if things might have changed..  When we have consolidated results, we
> >> should share them out and also reflect that in QEMU's migration docs when a
> >> rdma document page is ready.
> >>
> >> Chuan, please check the whole thread discussion, it may help to understand
> >> what we are looking for on rdma migrations [1].  Meanwhile please feel free
> >> to sync with Jinpu's team and see how to move forward with such a project.
> >>
> >> [1] 
> >> https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/__;!!GjvTz_vk!QnXDo1zSlYecz7JvJky4SOQ9I8V5MoGHbINdAQAzMJQ_yYg_8_BSUXz9kjvbSgFefhG0wi1j38KaC3g$
> >>
> >> Thanks,
> >>
> >> --
> >> Peter Xu
> >>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-16 Thread Michael Galaxy

These are very compelling results, no?

(40gbps cards, right? Are the cards active/active? or active/standby?)

- Michael

On 5/14/24 10:19, Yu Zhang wrote:

Hello Peter and all,

I did a comparison of the VM live-migration speeds between RDMA and
TCP/IP on our servers
and plotted the results to get an initial impression. Unfortunately,
the Ethernet NICs are not the
recent ones, therefore, it may not make much sense. I can do it on
servers with more recent Ethernet
NICs and keep you updated.

It seems that the benefits of RDMA becomes obviously when the VM has
large memory and is
running memory-intensive workload.

Best regards,
Yu Zhang @ IONOS Cloud

On Thu, May 9, 2024 at 4:14 PM Peter Xu  wrote:

On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:

That's a good news to see the socket abstraction for RDMA!
When I was developed the series above, the most pain is the RDMA migration has 
no QIOChannel abstraction and i need to take a 'fake channel'
for it which is awkward in code implementation.
So, as far as I know, we can do this by
i. the first thing is that we need to evaluate the rsocket is good enough to 
satisfy our QIOChannel fundamental abstraction
ii. if it works right, then we will continue to see if it can give us 
opportunity to hide the detail of rdma protocol
 into rsocket by remove most of code in rdma.c and also some hack in 
migration main process.
iii. implement the advanced features like multi-fd and multi-uri for rdma 
migration.

Since I am not familiar with rsocket, I need some times to look at it and do 
some quick verify with rdma migration based on rsocket.
But, yes, I am willing to involved in this refactor work and to see if we can 
make this migration feature more better:)

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] 
https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/__;!!GjvTz_vk!QnXDo1zSlYecz7JvJky4SOQ9I8V5MoGHbINdAQAzMJQ_yYg_8_BSUXz9kjvbSgFefhG0wi1j38KaC3g$

Thanks,

--
Peter Xu





Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-14 Thread Yu Zhang
Hello Peter and all,

I did a comparison of the VM live-migration speeds between RDMA and
TCP/IP on our servers
and plotted the results to get an initial impression. Unfortunately,
the Ethernet NICs are not the
recent ones, therefore, it may not make much sense. I can do it on
servers with more recent Ethernet
NICs and keep you updated.

It seems that the benefits of RDMA becomes obviously when the VM has
large memory and is
running memory-intensive workload.

Best regards,
Yu Zhang @ IONOS Cloud

On Thu, May 9, 2024 at 4:14 PM Peter Xu  wrote:
>
> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration 
> > has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough 
> > to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us 
> > opportunity to hide the detail of rdma protocol
> > into rsocket by remove most of code in rdma.c and also some hack in 
> > migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma 
> > migration.
> >
> > Since I am not familiar with rsocket, I need some times to look at it and 
> > do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we 
> > can make this migration feature more better:)
>
> Based on what we have now, it looks like we'd better halt the deprecation
> process a bit, so I think we shouldn't need to rush it at least in 9.1
> then, and we'll need to see how it goes on the refactoring.
>
> It'll be perfect if rsocket works, otherwise supporting multifd with little
> overhead / exported APIs would also be a good thing in general with
> whatever approach.  And obviously all based on the facts that we can get
> resources from companies to support this feature first.
>
> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> any of us can provide some test results please do so.  Many people are
> saying RDMA is better, but I yet didn't see any numbers comparing it with
> modern TCP networks.  I don't want to have old impressions floating around
> even if things might have changed..  When we have consolidated results, we
> should share them out and also reflect that in QEMU's migration docs when a
> rdma document page is ready.
>
> Chuan, please check the whole thread discussion, it may help to understand
> what we are looking for on rdma migrations [1].  Meanwhile please feel free
> to sync with Jinpu's team and see how to move forward with such a project.
>
> [1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/
>
> Thanks,
>
> --
> Peter Xu
>


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-13 Thread Michael Galaxy
One thing to keep in mind here (despite me not having any hardware to 
test) was that one of the original goals here
in the RDMA implementation was not simply raw throughput nor raw 
latency, but a lack of CPU utilization in kernel
space due to the offload. While it is entirely possible that newer 
hardware w/ TCP might compete, the significant

reductions in CPU usage in the TCP/IP stack were a big win at the time.

Just something to consider while you're doing the testing

- Michael

On 5/9/24 03:58, Zheng Chuan wrote:

Hi, Peter,Lei,Jinpu.

On 2024/5/8 0:28, Peter Xu wrote:

On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:

Hello,


-Original Message-
From: Peter Xu [mailto:pet...@redhat.com]
Sent: Monday, May 6, 2024 11:18 PM
To: Gonglei (Arei) 
Cc: Daniel P. Berrangé ; Markus Armbruster
; Michael Galaxy ; Yu Zhang
; Zhijian Li (Fujitsu) ; Jinpu Wang
; Elmar Gerdes ;
qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
; Prasanna Kumar Kalever
; Cornelia Huck ;
Michael Roth ; Prasanna Kumar Kalever
; integrat...@gluster.org; Paolo Bonzini
; qemu-block@nongnu.org; de...@lists.libvirt.org;
Hanna Reitz ; Michael S. Tsirkin ;
Thomas Huth ; Eric Blake ; Song
Gao ; Marc-André Lureau
; Alex Bennée ;
Wainer dos Santos Moschetta ; Beraldo Leal
; Pannengyuan ;
Xiexiangyou 
Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:

Hi, Peter

Hey, Lei,

Happy to see you around again after years.


Haha, me too.


RDMA features high bandwidth, low latency (in non-blocking lossless
network), and direct remote memory access by bypassing the CPU (As you
know, CPU resources are expensive for cloud vendors, which is one of
the reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?


Software and hardware offload converged architecture is the way to go for all 
cloud vendors
(Including comprehensive benefits in terms of performance, cost, security, and 
innovation speed),
it's not just a matter of adding the resource of a DPU card.


In some scenarios where fast live migration is needed (extremely short
interruption duration and migration duration) is very useful. To this
end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like what
rdma.h does right now it'll be pretty nice.  We also want to make sure it works
with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
loads to
multifd channels too.


In fact, we sent the patchset to the community in 2021. Pls see:
https://urldefense.com/v3/__https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZl4NUEGc$

Yes, I have sent the patchset of multifd support for rdma migration by taking 
over my colleague, and also
sorry for not keeping on this work at that time due to some reasons.
And also I am strongly agree with Lei that the RDMA protocol has some special 
advantages against with TCP
in some scenario, and we are indeed to use it in our product.


I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://urldefense.com/v3/__https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZ0CFSE-o$

And Jinpu did help provide an initial test result over the library:

https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/__;!!GjvTz_vk!VfP_SV-8uRya7rBdopv8OUJkmnSi44Ktpqq1E7sr_Xcwt6zvveW51qboWOBSTChdUG1hJwfAl7HZxPNcdb4$

It looks like we have a chance to apply that in QEMU.




One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community still
has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop, unless
companies can help to provide good support and work together.


We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,


That's a good news to see the socket abstraction for RDMA!
When I was developed the series above, the most pain is the RDMA migration has 
no QIOChannel

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-13 Thread Jinpu Wang
Hi Peter, Hi Chuan,

On Thu, May 9, 2024 at 4:14 PM Peter Xu  wrote:
>
> On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> > That's a good news to see the socket abstraction for RDMA!
> > When I was developed the series above, the most pain is the RDMA migration 
> > has no QIOChannel abstraction and i need to take a 'fake channel'
> > for it which is awkward in code implementation.
> > So, as far as I know, we can do this by
> > i. the first thing is that we need to evaluate the rsocket is good enough 
> > to satisfy our QIOChannel fundamental abstraction
> > ii. if it works right, then we will continue to see if it can give us 
> > opportunity to hide the detail of rdma protocol
> > into rsocket by remove most of code in rdma.c and also some hack in 
> > migration main process.
> > iii. implement the advanced features like multi-fd and multi-uri for rdma 
> > migration.
> >
> > Since I am not familiar with rsocket, I need some times to look at it and 
> > do some quick verify with rdma migration based on rsocket.
> > But, yes, I am willing to involved in this refactor work and to see if we 
> > can make this migration feature more better:)
>
> Based on what we have now, it looks like we'd better halt the deprecation
> process a bit, so I think we shouldn't need to rush it at least in 9.1
> then, and we'll need to see how it goes on the refactoring.
>
> It'll be perfect if rsocket works, otherwise supporting multifd with little
> overhead / exported APIs would also be a good thing in general with
> whatever approach.  And obviously all based on the facts that we can get
> resources from companies to support this feature first.
>
> Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
> any of us can provide some test results please do so.  Many people are
> saying RDMA is better, but I yet didn't see any numbers comparing it with
> modern TCP networks.  I don't want to have old impressions floating around
> even if things might have changed..  When we have consolidated results, we
> should share them out and also reflect that in QEMU's migration docs when a
> rdma document page is ready.
I also did a tests with Mellanox ConnectX-6 100 G RoCE nic, the
results are mixed, for less than 3 streams native ethernet is faster,
and when more than 3 streams rsocket performs better.

root@x4-right:~# iperf -c 1.1.1.16 -P 1

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[  3] local 1.1.1.15 port 44214 connected with 1.1.1.16 port 5001
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10. sec  52.9 GBytes  45.4 Gbits/sec
root@x4-right:~# iperf -c 1.1.1.16 -P 2
[  3] local 1.1.1.15 port 33118 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 33130 connected with 1.1.1.16 port 5001

Client connecting to 1.1.1.16, TCP port 5001
TCP window size: 4.00 MByte (default)

[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0001 sec  45.0 GBytes  38.7 Gbits/sec
[  4] 0.-10. sec  43.9 GBytes  37.7 Gbits/sec
[SUM] 0.-10. sec  88.9 GBytes  76.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.172/0.189/0.205/0.172 ms (tot/err) = 2/0
root@x4-right:~# iperf -c 1.1.1.16 -P 4

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[  5] local 1.1.1.15 port 50748 connected with 1.1.1.16 port 5001
[  4] local 1.1.1.15 port 50734 connected with 1.1.1.16 port 5001
[  6] local 1.1.1.15 port 50764 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 50730 connected with 1.1.1.16 port 5001
[ ID] Interval   Transfer Bandwidth
[  6] 0.-10. sec  24.7 GBytes  21.2 Gbits/sec
[  3] 0.-10.0004 sec  23.6 GBytes  20.3 Gbits/sec
[  4] 0.-10. sec  27.8 GBytes  23.9 Gbits/sec
[  5] 0.-10. sec  28.0 GBytes  24.0 Gbits/sec
[SUM] 0.-10. sec   104 GBytes  89.4 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.104/0.156/0.204/0.124 ms (tot/err) = 4/0
root@x4-right:~# iperf -c 1.1.1.16 -P 8
[  4] local 1.1.1.15 port 55588 connected with 1.1.1.16 port 5001
[  5] local 1.1.1.15 port 55600 connected with 1.1.1.16 port 5001

Client connecting to 1.1.1.16, TCP port 5001
TCP window size:  325 KByte (default)

[ 10] local 1.1.1.15 port 55628 connected with 1.1.1.16 port 5001
[ 15] local 1.1.1.15 port 55648 connected with 1.1.1.16 port 5001
[  7] local 1.1.1.15 port 55620 connected with 1.1.1.16 port 5001
[  3] local 1.1.1.15 port 55584 connected with 1.1.1.16 port 5001
[ 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Peter Xu
On Thu, May 09, 2024 at 04:58:34PM +0800, Zheng Chuan via wrote:
> That's a good news to see the socket abstraction for RDMA!
> When I was developed the series above, the most pain is the RDMA migration 
> has no QIOChannel abstraction and i need to take a 'fake channel'
> for it which is awkward in code implementation.
> So, as far as I know, we can do this by
> i. the first thing is that we need to evaluate the rsocket is good enough to 
> satisfy our QIOChannel fundamental abstraction
> ii. if it works right, then we will continue to see if it can give us 
> opportunity to hide the detail of rdma protocol
> into rsocket by remove most of code in rdma.c and also some hack in 
> migration main process.
> iii. implement the advanced features like multi-fd and multi-uri for rdma 
> migration.
> 
> Since I am not familiar with rsocket, I need some times to look at it and do 
> some quick verify with rdma migration based on rsocket.
> But, yes, I am willing to involved in this refactor work and to see if we can 
> make this migration feature more better:)

Based on what we have now, it looks like we'd better halt the deprecation
process a bit, so I think we shouldn't need to rush it at least in 9.1
then, and we'll need to see how it goes on the refactoring.

It'll be perfect if rsocket works, otherwise supporting multifd with little
overhead / exported APIs would also be a good thing in general with
whatever approach.  And obviously all based on the facts that we can get
resources from companies to support this feature first.

Note that so far nobody yet compared with rdma v.s. nic perf, so I hope if
any of us can provide some test results please do so.  Many people are
saying RDMA is better, but I yet didn't see any numbers comparing it with
modern TCP networks.  I don't want to have old impressions floating around
even if things might have changed..  When we have consolidated results, we
should share them out and also reflect that in QEMU's migration docs when a
rdma document page is ready.

Chuan, please check the whole thread discussion, it may help to understand
what we are looking for on rdma migrations [1].  Meanwhile please feel free
to sync with Jinpu's team and see how to move forward with such a project.

[1] https://lore.kernel.org/qemu-devel/87frwatp7n@suse.de/

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-09 Thread Zheng Chuan via
Hi, Peter,Lei,Jinpu.

On 2024/5/8 0:28, Peter Xu wrote:
> On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
>> Hello,
>>
>>> -Original Message-
>>> From: Peter Xu [mailto:pet...@redhat.com]
>>> Sent: Monday, May 6, 2024 11:18 PM
>>> To: Gonglei (Arei) 
>>> Cc: Daniel P. Berrangé ; Markus Armbruster
>>> ; Michael Galaxy ; Yu Zhang
>>> ; Zhijian Li (Fujitsu) ; Jinpu 
>>> Wang
>>> ; Elmar Gerdes ;
>>> qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
>>> ; Prasanna Kumar Kalever
>>> ; Cornelia Huck ;
>>> Michael Roth ; Prasanna Kumar Kalever
>>> ; integrat...@gluster.org; Paolo Bonzini
>>> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
>>> Hanna Reitz ; Michael S. Tsirkin ;
>>> Thomas Huth ; Eric Blake ; Song
>>> Gao ; Marc-André Lureau
>>> ; Alex Bennée ;
>>> Wainer dos Santos Moschetta ; Beraldo Leal
>>> ; Pannengyuan ;
>>> Xiexiangyou 
>>> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
>>>
>>> On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
>>>> Hi, Peter
>>>
>>> Hey, Lei,
>>>
>>> Happy to see you around again after years.
>>>
>> Haha, me too.
>>
>>>> RDMA features high bandwidth, low latency (in non-blocking lossless
>>>> network), and direct remote memory access by bypassing the CPU (As you
>>>> know, CPU resources are expensive for cloud vendors, which is one of
>>>> the reasons why we introduced offload cards.), which TCP does not have.
>>>
>>> It's another cost to use offload cards, v.s. preparing more cpu resources?
>>>
>> Software and hardware offload converged architecture is the way to go for 
>> all cloud vendors 
>> (Including comprehensive benefits in terms of performance, cost, security, 
>> and innovation speed), 
>> it's not just a matter of adding the resource of a DPU card.
>>
>>>> In some scenarios where fast live migration is needed (extremely short
>>>> interruption duration and migration duration) is very useful. To this
>>>> end, we have also developed RDMA support for multifd.
>>>
>>> Will any of you upstream that work?  I'm curious how intrusive would it be
>>> when adding it to multifd, if it can keep only 5 exported functions like 
>>> what
>>> rdma.h does right now it'll be pretty nice.  We also want to make sure it 
>>> works
>>> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
>>> loads to
>>> multifd channels too.
>>>
>>
>> In fact, we sent the patchset to the community in 2021. Pls see:
>> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/
> 

Yes, I have sent the patchset of multifd support for rdma migration by taking 
over my colleague, and also
sorry for not keeping on this work at that time due to some reasons.
And also I am strongly agree with Lei that the RDMA protocol has some special 
advantages against with TCP
in some scenario, and we are indeed to use it in our product.

> I wasn't aware of that for sure in the past..
> 
> Multifd has changed quite a bit in the last 9.0 release, that may not apply
> anymore.  One thing to mention is please look at Dan's comment on possible
> use of rsocket.h:
> 
> https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/
> 
> And Jinpu did help provide an initial test result over the library:
> 
> https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/
> 
> It looks like we have a chance to apply that in QEMU.
> 
>>
>>
>>> One thing to note that the question here is not about a pure performance
>>> comparison between rdma and nics only.  It's about help us make a decision
>>> on whether to drop rdma, iow, even if rdma performs well, the community 
>>> still
>>> has the right to drop it if nobody can actively work and maintain it.
>>> It's just that if nics can perform as good it's more a reason to drop, 
>>> unless
>>> companies can help to provide good support and work together.
>>>
>>
>> We are happy to provide the necessary review and maintenance work for RDMA
>> if the community needs it.
>>
>> CC'ing Chuan Zheng.
> 
> I'm not sure whether you and Jinpu's team would like to work together and
> provide a final solution for rdma over multifd.  It could be much simpler
> than the original 2021 proposal if the rsocket API will work out.
> 
> T

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-08 Thread Daniel P . Berrangé
On Tue, May 07, 2024 at 06:52:50AM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,
> On Mon, May 6, 2024 at 5:29 PM Peter Xu  wrote:
> >
> > On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> > > Hi Peter, hi Daniel,
> >
> > Hi, Jinpu,
> >
> > Thanks for sharing this test results.  Sounds like a great news.
> >
> > What's your plan next?  Would it then be worthwhile / possible moving QEMU
> > into that direction?  Would that greatly simplify rdma code as Dan
> > mentioned?
> I'm rather not familiar with QEMU migration yet,  from the test
> result, I think it's a possible direction,
> just we need to at least based on a rather recent release like
> rdma-core v33 with proper 'fork' support.
> 
> Maybe Dan or you could give more detail about what you have in mind
> for using rsocket as a replacement for the future.
> We will also look into the implementation details in the meantime.

The migration/socket.c file is the entrypoint for traditional TCP
based migration code. It uses the QIOChannelSocket class which is
written against the traditional sockets APIs, and uses the QAPI
SocketAddress data type to configure it..

My thought was that potentially SocketAddress could be extended to
offer RDMA addressing eg


{ 'union': 'SocketAddress',
  'base': { 'type': 'SocketAddressType' },
  'discriminator': 'type',
  'data': { 'inet': 'InetSocketAddress',
'unix': 'UnixSocketAddress',
'vsock': 'VsockSocketAddress',
'fd': 'FdSocketAddress',
'rdma': 'InetSocketAddress' } }

And then QIOChannelSocket could be also extended to call the
alternative 'rsockets' APIs where needed. That would mean that
existing sockets migration code would almost "just work" with
RDMA. Theoreticaly any other part of QEMU using QIOChannelSocket
would also then magically support RDMA too, with very little (if
any) extra work.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-07 Thread Peter Xu
On Tue, May 07, 2024 at 01:50:43AM +, Gonglei (Arei) wrote:
> Hello,
> 
> > -Original Message-
> > From: Peter Xu [mailto:pet...@redhat.com]
> > Sent: Monday, May 6, 2024 11:18 PM
> > To: Gonglei (Arei) 
> > Cc: Daniel P. Berrangé ; Markus Armbruster
> > ; Michael Galaxy ; Yu Zhang
> > ; Zhijian Li (Fujitsu) ; Jinpu 
> > Wang
> > ; Elmar Gerdes ;
> > qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
> > ; Prasanna Kumar Kalever
> > ; Cornelia Huck ;
> > Michael Roth ; Prasanna Kumar Kalever
> > ; integrat...@gluster.org; Paolo Bonzini
> > ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> > Hanna Reitz ; Michael S. Tsirkin ;
> > Thomas Huth ; Eric Blake ; Song
> > Gao ; Marc-André Lureau
> > ; Alex Bennée ;
> > Wainer dos Santos Moschetta ; Beraldo Leal
> > ; Pannengyuan ;
> > Xiexiangyou 
> > Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> > 
> > On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > > Hi, Peter
> > 
> > Hey, Lei,
> > 
> > Happy to see you around again after years.
> > 
> Haha, me too.
> 
> > > RDMA features high bandwidth, low latency (in non-blocking lossless
> > > network), and direct remote memory access by bypassing the CPU (As you
> > > know, CPU resources are expensive for cloud vendors, which is one of
> > > the reasons why we introduced offload cards.), which TCP does not have.
> > 
> > It's another cost to use offload cards, v.s. preparing more cpu resources?
> > 
> Software and hardware offload converged architecture is the way to go for all 
> cloud vendors 
> (Including comprehensive benefits in terms of performance, cost, security, 
> and innovation speed), 
> it's not just a matter of adding the resource of a DPU card.
> 
> > > In some scenarios where fast live migration is needed (extremely short
> > > interruption duration and migration duration) is very useful. To this
> > > end, we have also developed RDMA support for multifd.
> > 
> > Will any of you upstream that work?  I'm curious how intrusive would it be
> > when adding it to multifd, if it can keep only 5 exported functions like 
> > what
> > rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> > works
> > with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> > loads to
> > multifd channels too.
> > 
> 
> In fact, we sent the patchset to the community in 2021. Pls see:
> https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/

I wasn't aware of that for sure in the past..

Multifd has changed quite a bit in the last 9.0 release, that may not apply
anymore.  One thing to mention is please look at Dan's comment on possible
use of rsocket.h:

https://lore.kernel.org/all/zjjm6rcqs5eho...@redhat.com/

And Jinpu did help provide an initial test result over the library:

https://lore.kernel.org/qemu-devel/camgffek8wiknqmouyxcathgtiem2dwocf_w7t0vmcd-i30t...@mail.gmail.com/

It looks like we have a chance to apply that in QEMU.

> 
> 
> > One thing to note that the question here is not about a pure performance
> > comparison between rdma and nics only.  It's about help us make a decision
> > on whether to drop rdma, iow, even if rdma performs well, the community 
> > still
> > has the right to drop it if nobody can actively work and maintain it.
> > It's just that if nics can perform as good it's more a reason to drop, 
> > unless
> > companies can help to provide good support and work together.
> > 
> 
> We are happy to provide the necessary review and maintenance work for RDMA
> if the community needs it.
> 
> CC'ing Chuan Zheng.

I'm not sure whether you and Jinpu's team would like to work together and
provide a final solution for rdma over multifd.  It could be much simpler
than the original 2021 proposal if the rsocket API will work out.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Jinpu Wang
Hi Peter, hi Daniel,
On Mon, May 6, 2024 at 5:29 PM Peter Xu  wrote:
>
> On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> > Hi Peter, hi Daniel,
>
> Hi, Jinpu,
>
> Thanks for sharing this test results.  Sounds like a great news.
>
> What's your plan next?  Would it then be worthwhile / possible moving QEMU
> into that direction?  Would that greatly simplify rdma code as Dan
> mentioned?
I'm rather not familiar with QEMU migration yet,  from the test
result, I think it's a possible direction,
just we need to at least based on a rather recent release like
rdma-core v33 with proper 'fork' support.

Maybe Dan or you could give more detail about what you have in mind
for using rsocket as a replacement for the future.
We will also look into the implementation details in the meantime.

Thx!
J

>
> Thanks,
>
> >
> > On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
> > >
> > > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > > I had a brief check in the rsocket changelog, there seems some
> > > > improvement over time,
> > > >  might be worth revisiting this. due to socket abstraction, we can't
> > > > use some feature like
> > > >  ODP, it won't be a small and easy task.
> > >
> > > It'll be good to know whether Dan's suggestion would work first, without
> > > rewritting everything yet so far.  Not sure whether some perf test could
> > > help with the rsocket APIs even without QEMU's involvements (or looking 
> > > for
> > > test data supporting / invalidate such conversions).
> > >
> > I did a quick test with iperf on 100 G environment and 40 G
> > environment, in summary rsocket works pretty well.
> >
> > iperf tests between 2 hosts with 40 G (IB),
> > first  a few test with different num. of threads on top of ipoib
> > interface, later with preload rsocket on top of same ipoib interface.
> >
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> > [ ID] Interval   Transfer Bandwidth
> > [  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> > [ ID] Interval   Transfer Bandwidth
> > [  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> > [SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> > [ ID] Interval   Transfer Bandwidth
> > [  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> > [  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> > [  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> > [  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> > [SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> > [ CT] final connect times (min/avg/max/stdev) =
> > 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> > jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> > 
> > Client connecting to 10.43.3.145, TCP port 52000
> > TCP window size:  165 KByte (default)
> > 
> > [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> > [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> > [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> > [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> > [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> > [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> > [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> > [  4] local 10.43.3.146 port 35050 connected with 

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Gonglei (Arei)
Hello,

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Monday, May 6, 2024 11:18 PM
> To: Gonglei (Arei) 
> Cc: Daniel P. Berrangé ; Markus Armbruster
> ; Michael Galaxy ; Yu Zhang
> ; Zhijian Li (Fujitsu) ; Jinpu Wang
> ; Elmar Gerdes ;
> qemu-de...@nongnu.org; Yuval Shaia ; Kevin Wolf
> ; Prasanna Kumar Kalever
> ; Cornelia Huck ;
> Michael Roth ; Prasanna Kumar Kalever
> ; integrat...@gluster.org; Paolo Bonzini
> ; qemu-block@nongnu.org; de...@lists.libvirt.org;
> Hanna Reitz ; Michael S. Tsirkin ;
> Thomas Huth ; Eric Blake ; Song
> Gao ; Marc-André Lureau
> ; Alex Bennée ;
> Wainer dos Santos Moschetta ; Beraldo Leal
> ; Pannengyuan ;
> Xiexiangyou 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> > Hi, Peter
> 
> Hey, Lei,
> 
> Happy to see you around again after years.
> 
Haha, me too.

> > RDMA features high bandwidth, low latency (in non-blocking lossless
> > network), and direct remote memory access by bypassing the CPU (As you
> > know, CPU resources are expensive for cloud vendors, which is one of
> > the reasons why we introduced offload cards.), which TCP does not have.
> 
> It's another cost to use offload cards, v.s. preparing more cpu resources?
> 
Software and hardware offload converged architecture is the way to go for all 
cloud vendors 
(Including comprehensive benefits in terms of performance, cost, security, and 
innovation speed), 
it's not just a matter of adding the resource of a DPU card.

> > In some scenarios where fast live migration is needed (extremely short
> > interruption duration and migration duration) is very useful. To this
> > end, we have also developed RDMA support for multifd.
> 
> Will any of you upstream that work?  I'm curious how intrusive would it be
> when adding it to multifd, if it can keep only 5 exported functions like what
> rdma.h does right now it'll be pretty nice.  We also want to make sure it 
> works
> with arbitrary sized loads and buffers, e.g. vfio is considering to add IO 
> loads to
> multifd channels too.
> 

In fact, we sent the patchset to the community in 2021. Pls see:
https://lore.kernel.org/all/20210203185906.GT2950@work-vm/T/


> One thing to note that the question here is not about a pure performance
> comparison between rdma and nics only.  It's about help us make a decision
> on whether to drop rdma, iow, even if rdma performs well, the community still
> has the right to drop it if nobody can actively work and maintain it.
> It's just that if nics can perform as good it's more a reason to drop, unless
> companies can help to provide good support and work together.
> 

We are happy to provide the necessary review and maintenance work for RDMA
if the community needs it.

CC'ing Chuan Zheng.


Regards,
-Gonglei



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu
On Mon, May 06, 2024 at 12:08:43PM +0200, Jinpu Wang wrote:
> Hi Peter, hi Daniel,

Hi, Jinpu,

Thanks for sharing this test results.  Sounds like a great news.

What's your plan next?  Would it then be worthwhile / possible moving QEMU
into that direction?  Would that greatly simplify rdma code as Dan
mentioned?

Thanks,

> 
> On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
> >
> > On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > > I had a brief check in the rsocket changelog, there seems some
> > > improvement over time,
> > >  might be worth revisiting this. due to socket abstraction, we can't
> > > use some feature like
> > >  ODP, it won't be a small and easy task.
> >
> > It'll be good to know whether Dan's suggestion would work first, without
> > rewritting everything yet so far.  Not sure whether some perf test could
> > help with the rsocket APIs even without QEMU's involvements (or looking for
> > test data supporting / invalidate such conversions).
> >
> I did a quick test with iperf on 100 G environment and 40 G
> environment, in summary rsocket works pretty well.
> 
> iperf tests between 2 hosts with 40 G (IB),
> first  a few test with different num. of threads on top of ipoib
> interface, later with preload rsocket on top of same ipoib interface.
> 
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
> [  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
> [  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
> [SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
> jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8
> 
> Client connecting to 10.43.3.145, TCP port 52000
> TCP window size:  165 KByte (default)
> 
> [  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
> [  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
> [  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
> [  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
> [  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
> [ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
> [  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
> [  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
> [ ID] Interval   Transfer Bandwidth
> [  4] 0.-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
> [  8] 0.-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
> [  5] 0.-10. sec  2.85 GBytes  2.45 Gbits/sec
> [ 12] 0.-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
> [  3] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [  7] 0.-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
> [  9] 0.-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
> [  6] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
> [SUM] 0.-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
> [ CT] final connect times (min/avg/max/stdev) =
> 0.096/0.226/0.339/0.109 ms 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Peter Xu
On Mon, May 06, 2024 at 02:06:28AM +, Gonglei (Arei) wrote:
> Hi, Peter

Hey, Lei,

Happy to see you around again after years.

> RDMA features high bandwidth, low latency (in non-blocking lossless
> network), and direct remote memory access by bypassing the CPU (As you
> know, CPU resources are expensive for cloud vendors, which is one of the
> reasons why we introduced offload cards.), which TCP does not have.

It's another cost to use offload cards, v.s. preparing more cpu resources?

> In some scenarios where fast live migration is needed (extremely short
> interruption duration and migration duration) is very useful. To this
> end, we have also developed RDMA support for multifd.

Will any of you upstream that work?  I'm curious how intrusive would it be
when adding it to multifd, if it can keep only 5 exported functions like
what rdma.h does right now it'll be pretty nice.  We also want to make sure
it works with arbitrary sized loads and buffers, e.g. vfio is considering
to add IO loads to multifd channels too.

One thing to note that the question here is not about a pure performance
comparison between rdma and nics only.  It's about help us make a decision
on whether to drop rdma, iow, even if rdma performs well, the community
still has the right to drop it if nobody can actively work and maintain it.
It's just that if nics can perform as good it's more a reason to drop,
unless companies can help to provide good support and work together.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-06 Thread Jinpu Wang
Hi Peter, hi Daniel,

On Fri, May 3, 2024 at 4:33 PM Peter Xu  wrote:
>
> On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> > I had a brief check in the rsocket changelog, there seems some
> > improvement over time,
> >  might be worth revisiting this. due to socket abstraction, we can't
> > use some feature like
> >  ODP, it won't be a small and easy task.
>
> It'll be good to know whether Dan's suggestion would work first, without
> rewritting everything yet so far.  Not sure whether some perf test could
> help with the rsocket APIs even without QEMU's involvements (or looking for
> test data supporting / invalidate such conversions).
>
I did a quick test with iperf on 100 G environment and 40 G
environment, in summary rsocket works pretty well.

iperf tests between 2 hosts with 40 G (IB),
first  a few test with different num. of threads on top of ipoib
interface, later with preload rsocket on top of same ipoib interface.

jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  3] local 10.43.3.146 port 55602 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0001 sec  2.85 GBytes  2.44 Gbits/sec
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 2

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  4] local 10.43.3.146 port 39640 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 39626 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0012 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.-10.0026 sec  2.86 GBytes  2.45 Gbits/sec
[SUM] 0.-10.0026 sec  5.71 GBytes  4.90 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.281/0.300/0.318/0.318 ms (tot/err) = 2/0
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 4

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  4] local 10.43.3.146 port 46956 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 46978 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 46944 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 46962 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  3] 0.-10.0017 sec  2.85 GBytes  2.45 Gbits/sec
[  4] 0.-10.0015 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.-10.0026 sec  2.85 GBytes  2.45 Gbits/sec
[  6] 0.-10.0005 sec  2.85 GBytes  2.45 Gbits/sec
[SUM] 0.-10.0005 sec  11.4 GBytes  9.80 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.274/0.312/0.360/0.212 ms (tot/err) = 4/0
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 8

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  7] local 10.43.3.146 port 35062 connected with 10.43.3.145 port 52000
[  6] local 10.43.3.146 port 35058 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 35066 connected with 10.43.3.145 port 52000
[  9] local 10.43.3.146 port 35074 connected with 10.43.3.145 port 52000
[  3] local 10.43.3.146 port 35038 connected with 10.43.3.145 port 52000
[ 12] local 10.43.3.146 port 35088 connected with 10.43.3.145 port 52000
[  5] local 10.43.3.146 port 35048 connected with 10.43.3.145 port 52000
[  4] local 10.43.3.146 port 35050 connected with 10.43.3.145 port 52000
[ ID] Interval   Transfer Bandwidth
[  4] 0.-10.0005 sec  2.85 GBytes  2.44 Gbits/sec
[  8] 0.-10.0011 sec  2.85 GBytes  2.45 Gbits/sec
[  5] 0.-10. sec  2.85 GBytes  2.45 Gbits/sec
[ 12] 0.-10.0021 sec  2.85 GBytes  2.44 Gbits/sec
[  3] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[  7] 0.-10.0065 sec  2.50 GBytes  2.14 Gbits/sec
[  9] 0.-10.0077 sec  2.52 GBytes  2.16 Gbits/sec
[  6] 0.-10.0003 sec  2.85 GBytes  2.44 Gbits/sec
[SUM] 0.-10.0003 sec  22.1 GBytes  19.0 Gbits/sec
[ CT] final connect times (min/avg/max/stdev) =
0.096/0.226/0.339/0.109 ms (tot/err) = 8/0
jw...@ps401a-914.nst:~$ iperf -p 52000 -c 10.43.3.145 -P 16
[  3] local 10.43.3.146 port 49540 connected with 10.43.3.145 port 52000

Client connecting to 10.43.3.145, TCP port 52000
TCP window size:  165 KByte (default)

[  6] local 10.43.3.146 port 49554 connected with 10.43.3.145 port 52000
[  8] local 10.43.3.146 port 49584 connected with 

RE: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-05 Thread Gonglei (Arei)
Hi, Peter

RDMA features high bandwidth, low latency (in non-blocking lossless network), 
and direct remote 
memory access by bypassing the CPU (As you know, CPU resources are expensive 
for cloud vendors, 
which is one of the reasons why we introduced offload cards.), which TCP does 
not have. 

In some scenarios where fast live migration is needed (extremely short 
interruption duration and migration 
duration) is very useful. To this end, we have also developed RDMA support for 
multifd.

Regards,
-Gonglei

> -Original Message-
> From: Peter Xu [mailto:pet...@redhat.com]
> Sent: Wednesday, May 1, 2024 11:31 PM
> To: Daniel P. Berrangé 
> Cc: Markus Armbruster ; Michael Galaxy
> ; Yu Zhang ; Zhijian Li (Fujitsu)
> ; Jinpu Wang ; Elmar Gerdes
> ; qemu-de...@nongnu.org; Yuval Shaia
> ; Kevin Wolf ; Prasanna
> Kumar Kalever ; Cornelia Huck
> ; Michael Roth ; Prasanna
> Kumar Kalever ; integrat...@gluster.org; Paolo
> Bonzini ; qemu-block@nongnu.org;
> de...@lists.libvirt.org; Hanna Reitz ; Michael S. Tsirkin
> ; Thomas Huth ; Eric Blake
> ; Song Gao ; Marc-André
> Lureau ; Alex Bennée
> ; Wainer dos Santos Moschetta
> ; Beraldo Leal ; Gonglei (Arei)
> ; Pannengyuan 
> Subject: Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling
> 
> On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > > Peter Xu  writes:
> > >
> > > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > > >> Hi All (and Peter),
> > > >
> > > > Hi, Michael,
> > > >
> > > >>
> > > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my
> > > >> last name (highly irregular for a male) and yes, that's my real last 
> > > >> name:
> > > >> https://www.linkedin.com/in/mrgalaxy/)
> > > >>
> > > >> I'm the original author of the RDMA implementation. I've been
> > > >> discussing with Yu Zhang for a little bit about potentially
> > > >> handing over maintainership of the codebase to his team.
> > > >>
> > > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > > >> unfortunately. so I've never been able to run tests or use what I
> > > >> wrote at work, and as all of you know, if you don't have a way to
> > > >> test something, then you can't maintain it.
> > > >>
> > > >> Yu Zhang put a (very kind) proposal forward to me to ask the
> > > >> community if they feel comfortable training his team to maintain
> > > >> the codebase (and run
> > > >> tests) while they learn about it.
> > > >
> > > > The "while learning" part is fine at least to me.  IMHO the
> > > > "ownership" to the code, or say, taking over the responsibility,
> > > > may or may not need 100% mastering the code base first.  There
> > > > should still be some fundamental confidence to work on the code
> > > > though as a starting point, then it's about serious use case to
> > > > back this up, and careful testings while getting more familiar with it.
> > >
> > > How much experience we expect of maintainers depends on the
> > > subsystem and other circumstances.  The hard requirement isn't
> > > experience, it's trust.  See the recent attack on xz.
> > >
> > > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > > I'm merely reminding y'all what's at stake.
> >
> > I think we shouldn't overly obsess[1] about 'xz', because the
> > overwhealmingly common scenario is that volunteer maintainers are
> > honest people. QEMU is in a massively better peer review situation.
> > With xz there was basically no oversight of the new maintainer. With
> > QEMU, we have oversight from 1000's of people on the list, a huge pool
> > of general maintainers, the specific migration maintainers, and the release
> manager merging code.
> >
> > With a lack of historical experiance with QEMU maintainership, I'd
> > suggest that new RDMA volunteers would start by adding themselves to the
> "MAINTAINERS"
> > file with only the 'Reviewer' classification. The main migration
> > maintainers would still handle pull requests, but wait for a R-b from
> > one of the RMDA volunteers. After some period of time the RDMA folks
> > could graduate to full maintainer status if the migration maintainers needed
> to reduce their load.
> > I suspect

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-03 Thread Peter Xu
On Fri, May 03, 2024 at 08:40:03AM +0200, Jinpu Wang wrote:
> I had a brief check in the rsocket changelog, there seems some
> improvement over time,
>  might be worth revisiting this. due to socket abstraction, we can't
> use some feature like
>  ODP, it won't be a small and easy task.

It'll be good to know whether Dan's suggestion would work first, without
rewritting everything yet so far.  Not sure whether some perf test could
help with the rsocket APIs even without QEMU's involvements (or looking for
test data supporting / invalidate such conversions).

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-03 Thread Jinpu Wang
Hi Daniel,

On Wed, May 1, 2024 at 6:00 PM Daniel P. Berrangé  wrote:
>
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> >
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> >
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
>
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
>
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
I guess at the time rsocket is less mature, and less performant
compared to using uverbs directly.



>
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.
I had a brief check in the rsocket changelog, there seems some
improvement over time,
 might be worth revisiting this. due to socket abstraction, we can't
use some feature like
 ODP, it won't be a small and easy task.
> With regards,
> Daniel
Thanks for the suggestion.
>
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> --
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Jinpu Wang
Hi Peter

On Thu, May 2, 2024 at 6:20 PM Peter Xu  wrote:
>
> On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> > Hi Michael, Hi Peter,
> >
> >
> > On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> > >
> > > Yu Zhang / Jinpu,
> > >
> > > Any possibility (at your lesiure, and within the disclosure rules of
> > > your company, IONOS) if you could share any of your performance
> > > information to educate the group?
> > >
> > > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > > their disposal. Some people don't.
> > Our staging env is with 100 Gb/s IB environment.
> > We will have a new setup in the coming months with Ethernet (RoCE), we
> > will run some performance
> > comparison when we have the environment ready.
>
> Thanks both.  Please keep us posted.
>
> Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
> involved, am I right?
kinds of. Our new hardware is RDMA capable, we can configure it to run
in "rdma" transport or "tcp"
it is more straight comparison,
When run "rdma" transport, RoCE is involved, eg the
rdma-core/ibverbs/rdmacm/vendor verbs driver are used.
>
> The other note is that the comparison needs to be with multifd enabled for
> the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.
>
> I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
> without even waiting for months.  If you want I can try to see how we can
> test together.  And btw I don't think we need a cluster, IIUC we simply
> need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
> two cards just for experiments, systems that can drive the cards, and a
> wire supporting 100G?

Yes, the simple setup can be just two hosts directly connected. This remind me,
I may also able to find a test setup with 100 G nic in lab, will keep
you posted.

Regards!
>
> >
> > >
> > > - Michael
> >
> > Thx!
> > Jinpu
> > >
> > > On 5/1/24 11:16, Peter Xu wrote:
> > > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > > >>> What I worry more is whether this is really what we want to keep rdma 
> > > >>> in
> > > >>> qemu, and that's also why I was trying to request for some serious
> > > >>> performance measurements comparing rdma v.s. nics.  And here when I 
> > > >>> said
> > > >>> "we" I mean both QEMU community and any company that will support 
> > > >>> keeping
> > > >>> rdma around.
> > > >>>
> > > >>> The problem is if NICs now are fast enough to perform at least equally
> > > >>> against rdma, and if it has a lower cost of overall maintenance, does 
> > > >>> it
> > > >>> mean that rdma migration will only be used by whoever wants to keep 
> > > >>> them in
> > > >>> the products and existed already?  In that case we should simply ask 
> > > >>> new
> > > >>> users to stick with tcp, and rdma users should only drop but not 
> > > >>> increase.
> > > >>>
> > > >>> It seems also destined that most new migration features will not 
> > > >>> support
> > > >>> rdma: see how much we drop old features in migration now (which rdma
> > > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > > >>> multifd
> > > >>> relevant which will probably not apply to rdma at all.  So in general 
> > > >>> what
> > > >>> I am worrying is a both-loss condition, if the company might be 
> > > >>> easier to
> > > >>> either stick with an old qemu (depending on whether other new 
> > > >>> features are
> > > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > > >>> RDMA
> > > >>> downstream only.
> > > >> I don't know much about the originals of RDMA support in QEMU and why
> > > >> this particular design was taken. It is indeed a huge maint burden to
> > > >> have a completely different code flow for RDMA with 4000+ lines of
> > > >> custom protocol signalling which is barely understandable.
> > > >>
> > > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > > >> type could almost[1] trivially have supported RDMA. There would have
> > > >> been almost no RDMA code required in the migration subsystem, and all
> > > >> the modern features like compression, multifd, post-copy, etc would
> > > >> "just work".
> > > >>
> > > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > > >> performance gains, but it might still have been a better tradeoff
> > > >> to have not quite so good peak performance, but with massively
> > > >> less maint burden.
> > > > My understanding so far is RDMA is sololy for performance but nothing 
> > > > else,
> > > > then it's a question on whether rdma existing users would like to do so 
> > > > if
> > > > it will run slower.
> > > >
> > > > Jinpu mentioned on the explicit usages of ib 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Peter Xu
On Thu, May 02, 2024 at 03:30:58PM +0200, Jinpu Wang wrote:
> Hi Michael, Hi Peter,
> 
> 
> On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
> >
> > Yu Zhang / Jinpu,
> >
> > Any possibility (at your lesiure, and within the disclosure rules of
> > your company, IONOS) if you could share any of your performance
> > information to educate the group?
> >
> > NICs have indeed changed, but not everybody has 100ge mellanox cards at
> > their disposal. Some people don't.
> Our staging env is with 100 Gb/s IB environment.
> We will have a new setup in the coming months with Ethernet (RoCE), we
> will run some performance
> comparison when we have the environment ready.

Thanks both.  Please keep us posted.

Just to double check, we're comparing "tcp:" v.s. "rdma:", RoCE is not
involved, am I right?

The other note is that the comparison needs to be with multifd enabled for
the "tcp:" case.  I'd suggest we start with 8 threads if it's 100Gbps.

I think I can still fetch some 100Gbps or even 200Gbps nics around our labs
without even waiting for months.  If you want I can try to see how we can
test together.  And btw I don't think we need a cluster, IIUC we simply
need two hosts, 100G nic on both sides?  IOW, it seems to me we only need
two cards just for experiments, systems that can drive the cards, and a
wire supporting 100G?

> 
> >
> > - Michael
> 
> Thx!
> Jinpu
> >
> > On 5/1/24 11:16, Peter Xu wrote:
> > > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> > >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > >>> What I worry more is whether this is really what we want to keep rdma in
> > >>> qemu, and that's also why I was trying to request for some serious
> > >>> performance measurements comparing rdma v.s. nics.  And here when I said
> > >>> "we" I mean both QEMU community and any company that will support 
> > >>> keeping
> > >>> rdma around.
> > >>>
> > >>> The problem is if NICs now are fast enough to perform at least equally
> > >>> against rdma, and if it has a lower cost of overall maintenance, does it
> > >>> mean that rdma migration will only be used by whoever wants to keep 
> > >>> them in
> > >>> the products and existed already?  In that case we should simply ask new
> > >>> users to stick with tcp, and rdma users should only drop but not 
> > >>> increase.
> > >>>
> > >>> It seems also destined that most new migration features will not support
> > >>> rdma: see how much we drop old features in migration now (which rdma
> > >>> _might_ still leverage, but maybe not), and how much we add mostly 
> > >>> multifd
> > >>> relevant which will probably not apply to rdma at all.  So in general 
> > >>> what
> > >>> I am worrying is a both-loss condition, if the company might be easier 
> > >>> to
> > >>> either stick with an old qemu (depending on whether other new features 
> > >>> are
> > >>> requested to be used besides RDMA alone), or do periodic rebase with 
> > >>> RDMA
> > >>> downstream only.
> > >> I don't know much about the originals of RDMA support in QEMU and why
> > >> this particular design was taken. It is indeed a huge maint burden to
> > >> have a completely different code flow for RDMA with 4000+ lines of
> > >> custom protocol signalling which is barely understandable.
> > >>
> > >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> > >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> > >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> > >> type could almost[1] trivially have supported RDMA. There would have
> > >> been almost no RDMA code required in the migration subsystem, and all
> > >> the modern features like compression, multifd, post-copy, etc would
> > >> "just work".
> > >>
> > >> I guess the 'rsocket.h' shim may well limit some of the possible
> > >> performance gains, but it might still have been a better tradeoff
> > >> to have not quite so good peak performance, but with massively
> > >> less maint burden.
> > > My understanding so far is RDMA is sololy for performance but nothing 
> > > else,
> > > then it's a question on whether rdma existing users would like to do so if
> > > it will run slower.
> > >
> > > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > > quotting that word as I don't really know such details:
> > >
> > > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> > >
> > > So not sure whether that applies here too, in that having qiochannel
> > > wrapper may not allow direct access to those ib verbs.
> > >
> > > Thanks,
> > >
> > >> With regards,
> > >> Daniel
> > >>
> > >> [1] "almost" trivially, because the poll() integration for rsockets
> > >>  requires a bit more magic sauce since rsockets FDs are not
> > >>  really FDs from the 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Jinpu Wang
Hi Michael, Hi Peter,


On Thu, May 2, 2024 at 3:23 PM Michael Galaxy  wrote:
>
> Yu Zhang / Jinpu,
>
> Any possibility (at your lesiure, and within the disclosure rules of
> your company, IONOS) if you could share any of your performance
> information to educate the group?
>
> NICs have indeed changed, but not everybody has 100ge mellanox cards at
> their disposal. Some people don't.
Our staging env is with 100 Gb/s IB environment.
We will have a new setup in the coming months with Ethernet (RoCE), we
will run some performance
comparison when we have the environment ready.

>
> - Michael

Thx!
Jinpu
>
> On 5/1/24 11:16, Peter Xu wrote:
> > On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> >> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> >>> What I worry more is whether this is really what we want to keep rdma in
> >>> qemu, and that's also why I was trying to request for some serious
> >>> performance measurements comparing rdma v.s. nics.  And here when I said
> >>> "we" I mean both QEMU community and any company that will support keeping
> >>> rdma around.
> >>>
> >>> The problem is if NICs now are fast enough to perform at least equally
> >>> against rdma, and if it has a lower cost of overall maintenance, does it
> >>> mean that rdma migration will only be used by whoever wants to keep them 
> >>> in
> >>> the products and existed already?  In that case we should simply ask new
> >>> users to stick with tcp, and rdma users should only drop but not increase.
> >>>
> >>> It seems also destined that most new migration features will not support
> >>> rdma: see how much we drop old features in migration now (which rdma
> >>> _might_ still leverage, but maybe not), and how much we add mostly multifd
> >>> relevant which will probably not apply to rdma at all.  So in general what
> >>> I am worrying is a both-loss condition, if the company might be easier to
> >>> either stick with an old qemu (depending on whether other new features are
> >>> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> >>> downstream only.
> >> I don't know much about the originals of RDMA support in QEMU and why
> >> this particular design was taken. It is indeed a huge maint burden to
> >> have a completely different code flow for RDMA with 4000+ lines of
> >> custom protocol signalling which is barely understandable.
> >>
> >> I would note that /usr/include/rdma/rsocket.h provides a higher level
> >> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> >> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> >> type could almost[1] trivially have supported RDMA. There would have
> >> been almost no RDMA code required in the migration subsystem, and all
> >> the modern features like compression, multifd, post-copy, etc would
> >> "just work".
> >>
> >> I guess the 'rsocket.h' shim may well limit some of the possible
> >> performance gains, but it might still have been a better tradeoff
> >> to have not quite so good peak performance, but with massively
> >> less maint burden.
> > My understanding so far is RDMA is sololy for performance but nothing else,
> > then it's a question on whether rdma existing users would like to do so if
> > it will run slower.
> >
> > Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
> > quotting that word as I don't really know such details:
> >
> > https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$
> >
> > So not sure whether that applies here too, in that having qiochannel
> > wrapper may not allow direct access to those ib verbs.
> >
> > Thanks,
> >
> >> With regards,
> >> Daniel
> >>
> >> [1] "almost" trivially, because the poll() integration for rsockets
> >>  requires a bit more magic sauce since rsockets FDs are not
> >>  really FDs from the kernel's POV. Still, QIOCHannel likely can
> >>  abstract that probme.
> >> --
> >> |: 
> >> https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$
> >>-o-
> >> https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$
> >>   :|
> >> |: 
> >> https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$
> >>   -o-
> >> https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$
> >>   :|
> >> |: 
> >> https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$
> >>  -o-
> >> 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-02 Thread Michael Galaxy

Yu Zhang / Jinpu,

Any possibility (at your lesiure, and within the disclosure rules of 
your company, IONOS) if you could share any of your performance 
information to educate the group?


NICs have indeed changed, but not everybody has 100ge mellanox cards at 
their disposal. Some people don't.


- Michael

On 5/1/24 11:16, Peter Xu wrote:

On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:

On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

I don't know much about the originals of RDMA support in QEMU and why
this particular design was taken. It is indeed a huge maint burden to
have a completely different code flow for RDMA with 4000+ lines of
custom protocol signalling which is barely understandable.

I would note that /usr/include/rdma/rsocket.h provides a higher level
API that is a 1-1 match of the normal kernel 'sockets' API. If we had
leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
type could almost[1] trivially have supported RDMA. There would have
been almost no RDMA code required in the migration subsystem, and all
the modern features like compression, multifd, post-copy, etc would
"just work".

I guess the 'rsocket.h' shim may well limit some of the possible
performance gains, but it might still have been a better tradeoff
to have not quite so good peak performance, but with massively
less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://urldefense.com/v3/__https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOew9oW_kg$

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,


With regards,
Daniel

[1] "almost" trivially, because the poll() integration for rsockets
 requires a bit more magic sauce since rsockets FDs are not
 really FDs from the kernel's POV. Still, QIOCHannel likely can
 abstract that probme.
--
|: 
https://urldefense.com/v3/__https://berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfyTmFFUQ$
   -o-
https://urldefense.com/v3/__https://www.flickr.com/photos/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf8A5OC0Q$
  :|
|: 
https://urldefense.com/v3/__https://libvirt.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOf3gffAdg$
  -o-
https://urldefense.com/v3/__https://fstop138.berrange.com__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfPMofYqw$
  :|
|: 
https://urldefense.com/v3/__https://entangle-photo.org__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOeQ5jjAeQ$
 -o-
https://urldefense.com/v3/__https://www.instagram.com/dberrange__;!!GjvTz_vk!W6-HGWM-XkF_52am249DrLIDQeZctVOHg72LvOHGUcwxqQM5mY0GNYYl-yNJslN7A5GfLOfhaDF9WA$
  :|





Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu
On Wed, May 01, 2024 at 04:59:38PM +0100, Daniel P. Berrangé wrote:
> On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> > What I worry more is whether this is really what we want to keep rdma in
> > qemu, and that's also why I was trying to request for some serious
> > performance measurements comparing rdma v.s. nics.  And here when I said
> > "we" I mean both QEMU community and any company that will support keeping
> > rdma around.
> > 
> > The problem is if NICs now are fast enough to perform at least equally
> > against rdma, and if it has a lower cost of overall maintenance, does it
> > mean that rdma migration will only be used by whoever wants to keep them in
> > the products and existed already?  In that case we should simply ask new
> > users to stick with tcp, and rdma users should only drop but not increase.
> > 
> > It seems also destined that most new migration features will not support
> > rdma: see how much we drop old features in migration now (which rdma
> > _might_ still leverage, but maybe not), and how much we add mostly multifd
> > relevant which will probably not apply to rdma at all.  So in general what
> > I am worrying is a both-loss condition, if the company might be easier to
> > either stick with an old qemu (depending on whether other new features are
> > requested to be used besides RDMA alone), or do periodic rebase with RDMA
> > downstream only.
> 
> I don't know much about the originals of RDMA support in QEMU and why
> this particular design was taken. It is indeed a huge maint burden to
> have a completely different code flow for RDMA with 4000+ lines of
> custom protocol signalling which is barely understandable.
> 
> I would note that /usr/include/rdma/rsocket.h provides a higher level
> API that is a 1-1 match of the normal kernel 'sockets' API. If we had
> leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
> type could almost[1] trivially have supported RDMA. There would have
> been almost no RDMA code required in the migration subsystem, and all
> the modern features like compression, multifd, post-copy, etc would
> "just work".
> 
> I guess the 'rsocket.h' shim may well limit some of the possible
> performance gains, but it might still have been a better tradeoff
> to have not quite so good peak performance, but with massively
> less maint burden.

My understanding so far is RDMA is sololy for performance but nothing else,
then it's a question on whether rdma existing users would like to do so if
it will run slower.

Jinpu mentioned on the explicit usages of ib verbs but I am just mostly
quotting that word as I don't really know such details:

https://lore.kernel.org/qemu-devel/camgffem2twjxopcnqtq1sjytf5395dbztcmyikrqfxdzjws...@mail.gmail.com/

So not sure whether that applies here too, in that having qiochannel
wrapper may not allow direct access to those ib verbs.

Thanks,

> 
> With regards,
> Daniel
> 
> [1] "almost" trivially, because the poll() integration for rsockets
> requires a bit more magic sauce since rsockets FDs are not
> really FDs from the kernel's POV. Still, QIOCHannel likely can
> abstract that probme.
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
> 

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Daniel P . Berrangé
On Wed, May 01, 2024 at 11:31:13AM -0400, Peter Xu wrote:
> What I worry more is whether this is really what we want to keep rdma in
> qemu, and that's also why I was trying to request for some serious
> performance measurements comparing rdma v.s. nics.  And here when I said
> "we" I mean both QEMU community and any company that will support keeping
> rdma around.
> 
> The problem is if NICs now are fast enough to perform at least equally
> against rdma, and if it has a lower cost of overall maintenance, does it
> mean that rdma migration will only be used by whoever wants to keep them in
> the products and existed already?  In that case we should simply ask new
> users to stick with tcp, and rdma users should only drop but not increase.
> 
> It seems also destined that most new migration features will not support
> rdma: see how much we drop old features in migration now (which rdma
> _might_ still leverage, but maybe not), and how much we add mostly multifd
> relevant which will probably not apply to rdma at all.  So in general what
> I am worrying is a both-loss condition, if the company might be easier to
> either stick with an old qemu (depending on whether other new features are
> requested to be used besides RDMA alone), or do periodic rebase with RDMA
> downstream only.

I don't know much about the originals of RDMA support in QEMU and why
this particular design was taken. It is indeed a huge maint burden to
have a completely different code flow for RDMA with 4000+ lines of
custom protocol signalling which is barely understandable.

I would note that /usr/include/rdma/rsocket.h provides a higher level
API that is a 1-1 match of the normal kernel 'sockets' API. If we had
leveraged that, then QIOChannelSocket class and the QAPI SocketAddress
type could almost[1] trivially have supported RDMA. There would have
been almost no RDMA code required in the migration subsystem, and all
the modern features like compression, multifd, post-copy, etc would
"just work".

I guess the 'rsocket.h' shim may well limit some of the possible
performance gains, but it might still have been a better tradeoff
to have not quite so good peak performance, but with massively
less maint burden.

With regards,
Daniel

[1] "almost" trivially, because the poll() integration for rsockets
requires a bit more magic sauce since rsockets FDs are not
really FDs from the kernel's POV. Still, QIOCHannel likely can
abstract that probme.
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-05-01 Thread Peter Xu
On Tue, Apr 30, 2024 at 09:00:49AM +0100, Daniel P. Berrangé wrote:
> On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> > Peter Xu  writes:
> > 
> > > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > >> Hi All (and Peter),
> > >
> > > Hi, Michael,
> > >
> > >> 
> > >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > >> (highly irregular for a male) and yes, that's my real last name:
> > >> https://www.linkedin.com/in/mrgalaxy/)
> > >> 
> > >> I'm the original author of the RDMA implementation. I've been discussing
> > >> with Yu Zhang for a little bit about potentially handing over 
> > >> maintainership
> > >> of the codebase to his team.
> > >> 
> > >> I simply have zero access to RoCE or Infiniband hardware at all,
> > >> unfortunately. so I've never been able to run tests or use what I wrote 
> > >> at
> > >> work, and as all of you know, if you don't have a way to test something,
> > >> then you can't maintain it.
> > >> 
> > >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > >> they feel comfortable training his team to maintain the codebase (and run
> > >> tests) while they learn about it.
> > >
> > > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > > the code, or say, taking over the responsibility, may or may not need 100%
> > > mastering the code base first.  There should still be some fundamental
> > > confidence to work on the code though as a starting point, then it's about
> > > serious use case to back this up, and careful testings while getting more
> > > familiar with it.
> > 
> > How much experience we expect of maintainers depends on the subsystem
> > and other circumstances.  The hard requirement isn't experience, it's
> > trust.  See the recent attack on xz.
> > 
> > I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> > I'm merely reminding y'all what's at stake.
> 
> I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
> common scenario is that volunteer maintainers are honest people. QEMU is
> in a massively better peer review situation. With xz there was basically no
> oversight of the new maintainer. With QEMU, we have oversight from 1000's
> of people on the list, a huge pool of general maintainers, the specific
> migration maintainers, and the release manager merging code.
> 
> With a lack of historical experiance with QEMU maintainership, I'd suggest
> that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
> file with only the 'Reviewer' classification. The main migration maintainers
> would still handle pull requests, but wait for a R-b from one of the RMDA
> volunteers. After some period of time the RDMA folks could graduate to full
> maintainer status if the migration maintainers needed to reduce their load.
> I suspect that might prove unneccesary though, given RDMA isn't an area of
> code with a high turnover of patches.

Right, and we can do that as a start, it also follows our normal rules of
starting from Reviewers to maintain something.  I even considered Zhijian
to be the previous rdma goto guy / maintainer no matter what role he used
to have in the MAINTAINERS file.

Here IMHO it's more about whether any company would like to stand up and
provide help, without yet binding that to be able to send pull requests in
the near future or even longer term.

What I worry more is whether this is really what we want to keep rdma in
qemu, and that's also why I was trying to request for some serious
performance measurements comparing rdma v.s. nics.  And here when I said
"we" I mean both QEMU community and any company that will support keeping
rdma around.

The problem is if NICs now are fast enough to perform at least equally
against rdma, and if it has a lower cost of overall maintenance, does it
mean that rdma migration will only be used by whoever wants to keep them in
the products and existed already?  In that case we should simply ask new
users to stick with tcp, and rdma users should only drop but not increase.

It seems also destined that most new migration features will not support
rdma: see how much we drop old features in migration now (which rdma
_might_ still leverage, but maybe not), and how much we add mostly multifd
relevant which will probably not apply to rdma at all.  So in general what
I am worrying is a both-loss condition, if the company might be easier to
either stick with an old qemu (depending on whether other new features are
requested to be used besides RDMA alone), or do periodic rebase with RDMA
downstream only.

So even if we want to keep RDMA around I hope with this chance we can at
least have clear picture on when we should still suggest any new user to
use RDMA (with the reasons behind).  Or we simply shouldn't suggest any new
user to use RDMA at all (because at least it'll lose many new migration
features).

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-30 Thread Daniel P . Berrangé
On Tue, Apr 30, 2024 at 09:15:03AM +0200, Markus Armbruster wrote:
> Peter Xu  writes:
> 
> > On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> >> Hi All (and Peter),
> >
> > Hi, Michael,
> >
> >> 
> >> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> >> (highly irregular for a male) and yes, that's my real last name:
> >> https://www.linkedin.com/in/mrgalaxy/)
> >> 
> >> I'm the original author of the RDMA implementation. I've been discussing
> >> with Yu Zhang for a little bit about potentially handing over 
> >> maintainership
> >> of the codebase to his team.
> >> 
> >> I simply have zero access to RoCE or Infiniband hardware at all,
> >> unfortunately. so I've never been able to run tests or use what I wrote at
> >> work, and as all of you know, if you don't have a way to test something,
> >> then you can't maintain it.
> >> 
> >> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> >> they feel comfortable training his team to maintain the codebase (and run
> >> tests) while they learn about it.
> >
> > The "while learning" part is fine at least to me.  IMHO the "ownership" to
> > the code, or say, taking over the responsibility, may or may not need 100%
> > mastering the code base first.  There should still be some fundamental
> > confidence to work on the code though as a starting point, then it's about
> > serious use case to back this up, and careful testings while getting more
> > familiar with it.
> 
> How much experience we expect of maintainers depends on the subsystem
> and other circumstances.  The hard requirement isn't experience, it's
> trust.  See the recent attack on xz.
> 
> I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
> I'm merely reminding y'all what's at stake.

I think we shouldn't overly obsess[1] about 'xz', because the overwhealmingly
common scenario is that volunteer maintainers are honest people. QEMU is
in a massively better peer review situation. With xz there was basically no
oversight of the new maintainer. With QEMU, we have oversight from 1000's
of people on the list, a huge pool of general maintainers, the specific
migration maintainers, and the release manager merging code.

With a lack of historical experiance with QEMU maintainership, I'd suggest
that new RDMA volunteers would start by adding themselves to the "MAINTAINERS"
file with only the 'Reviewer' classification. The main migration maintainers
would still handle pull requests, but wait for a R-b from one of the RMDA
volunteers. After some period of time the RDMA folks could graduate to full
maintainer status if the migration maintainers needed to reduce their load.
I suspect that might prove unneccesary though, given RDMA isn't an area of
code with a high turnover of patches.

With regards,
Daniel

[1] If we do want to obsess about something bad though, we should
look at our handling of binary blobs in the repo and tarballs.
ie the firmware binaries that all get built in an arbitrary
environment of their respective maintainer. If we need firmware
blobs in tree, we should strive to come up with a reprodicble
build environment that gives us byte-for-byte identical results,
so the blobs can be verified. This is rather a tangent from this
thread though :)
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-30 Thread Markus Armbruster
Peter Xu  writes:

> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
>> Hi All (and Peter),
>
> Hi, Michael,
>
>> 
>> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
>> (highly irregular for a male) and yes, that's my real last name:
>> https://www.linkedin.com/in/mrgalaxy/)
>> 
>> I'm the original author of the RDMA implementation. I've been discussing
>> with Yu Zhang for a little bit about potentially handing over maintainership
>> of the codebase to his team.
>> 
>> I simply have zero access to RoCE or Infiniband hardware at all,
>> unfortunately. so I've never been able to run tests or use what I wrote at
>> work, and as all of you know, if you don't have a way to test something,
>> then you can't maintain it.
>> 
>> Yu Zhang put a (very kind) proposal forward to me to ask the community if
>> they feel comfortable training his team to maintain the codebase (and run
>> tests) while they learn about it.
>
> The "while learning" part is fine at least to me.  IMHO the "ownership" to
> the code, or say, taking over the responsibility, may or may not need 100%
> mastering the code base first.  There should still be some fundamental
> confidence to work on the code though as a starting point, then it's about
> serious use case to back this up, and careful testings while getting more
> familiar with it.

How much experience we expect of maintainers depends on the subsystem
and other circumstances.  The hard requirement isn't experience, it's
trust.  See the recent attack on xz.

I do not mean to express any doubts whatsoever on Yu Zhang's integrity!
I'm merely reminding y'all what's at stake.

[...]




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Michael Galaxy

Reviewed-by: Michael Galaxy 

Thanks Yu Zhang and Peter.

- Michael

On 4/29/24 15:45, Yu Zhang wrote:

Hello Michael and Peter,

We are very glad at your quick and kind reply about our plan to take
over the maintenance of your code. The message is for presenting our
plan and working together.
If we were able to obtain the maintainer's role, our plan is:

1. Create the necessary unit-test cases and get them integrated into
the current QEMU GitLab-CI pipeline
2. Review and test the code changes by other developers to ensure that
nothing is broken in the changed code before being merged by the
community
3. Based on our current practice and application scenario, look for
possible improvements when necessary

Besides that, a patch is attached to announce this change in the community.

With your generous support, we hope that the development community
will make a positive decision for us.

Kind regards,
Yu Zhang@ IONOS Cloud

On Mon, Apr 29, 2024 at 4:57 PM Peter Xu  wrote:

On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:

Hi All (and Peter),

Hi, Michael,


My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
(highly irregular for a male) and yes, that's my real last name:
https://urldefense.com/v3/__https://www.linkedin.com/in/mrgalaxy/__;!!GjvTz_vk!TZmnCE90EK692dSjZGr-2cpOEZBQTBsTO2bW5z3rSbpZgNVCexZkxwDXhmIOWG2GAKZAUovQ5xe5coQ$
 )

I'm the original author of the RDMA implementation. I've been discussing
with Yu Zhang for a little bit about potentially handing over maintainership
of the codebase to his team.

I simply have zero access to RoCE or Infiniband hardware at all,
unfortunately. so I've never been able to run tests or use what I wrote at
work, and as all of you know, if you don't have a way to test something,
then you can't maintain it.

Yu Zhang put a (very kind) proposal forward to me to ask the community if
they feel comfortable training his team to maintain the codebase (and run
tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.


If you don't mind, I'd like to let him send over his (very detailed)
proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,


- Michael

On 4/11/24 11:36, Yu Zhang wrote:

1) Either a CI test covering at least the major RDMA paths, or at least
  periodically tests for each QEMU release will be needed.

We use a batch of regression test cases for the stack, which covers the
test for QEMU. I did such test for most of the QEMU releases planned as
candidates for rollout.

The migration test needs a pair of (either physical or virtual) servers with
InfiniBand network, which makes it difficult to do on a single server. The
nested VM could be a possible approach, for which we may need virtual
InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.

[1]  
https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$

Thanks and best regards!

On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:

On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:

On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:

on 4/10/2024 3:46 AM, Peter Xu wrote:


Is there document/link about the unittest/CI for migration tests, Why
are those tests missing?
Is it hard or very special to set up an environment for that? maybe we
can help in this regards.

See tests/qtest/migration-test.c.  We put most of our migration tests
there and that's covered in CI.

I think one major issue is CI systems don't normally have rdma devices.
Can rdma migration test be carried out without a real hardware?

Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
$ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
then we can get a new RDMA interface "rxe_eth0".
This new RDMA interface is able to do the QEMU RDMA migration.

Also, the loopback(lo) device is able to emulate the RDMA interface
"rxe_lo", however when
I tried(years ago) to do RDMA migration over this
interface(rdma:127.0.0.1:) , it got something wrong.
So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

Some people got dropped 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Yu Zhang
Hello Michael and Peter,

We are very glad at your quick and kind reply about our plan to take
over the maintenance of your code. The message is for presenting our
plan and working together.
If we were able to obtain the maintainer's role, our plan is:

1. Create the necessary unit-test cases and get them integrated into
the current QEMU GitLab-CI pipeline
2. Review and test the code changes by other developers to ensure that
nothing is broken in the changed code before being merged by the
community
3. Based on our current practice and application scenario, look for
possible improvements when necessary

Besides that, a patch is attached to announce this change in the community.

With your generous support, we hope that the development community
will make a positive decision for us.

Kind regards,
Yu Zhang@ IONOS Cloud

On Mon, Apr 29, 2024 at 4:57 PM Peter Xu  wrote:
>
> On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> > Hi All (and Peter),
>
> Hi, Michael,
>
> >
> > My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> > (highly irregular for a male) and yes, that's my real last name:
> > https://www.linkedin.com/in/mrgalaxy/)
> >
> > I'm the original author of the RDMA implementation. I've been discussing
> > with Yu Zhang for a little bit about potentially handing over maintainership
> > of the codebase to his team.
> >
> > I simply have zero access to RoCE or Infiniband hardware at all,
> > unfortunately. so I've never been able to run tests or use what I wrote at
> > work, and as all of you know, if you don't have a way to test something,
> > then you can't maintain it.
> >
> > Yu Zhang put a (very kind) proposal forward to me to ask the community if
> > they feel comfortable training his team to maintain the codebase (and run
> > tests) while they learn about it.
>
> The "while learning" part is fine at least to me.  IMHO the "ownership" to
> the code, or say, taking over the responsibility, may or may not need 100%
> mastering the code base first.  There should still be some fundamental
> confidence to work on the code though as a starting point, then it's about
> serious use case to back this up, and careful testings while getting more
> familiar with it.
>
> >
> > If you don't mind, I'd like to let him send over his (very detailed)
> > proposal,
>
> Yes please, it's exactly the time to share the plan.  The hope is we try to
> reach a consensus before or around the middle of this release (9.1).
> Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
> not yet out, but I think it means we make a decision before or around
> middle of June.
>
> Thanks,
>
> >
> > - Michael
> >
> > On 4/11/24 11:36, Yu Zhang wrote:
> > > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > > >  periodically tests for each QEMU release will be needed.
> > > We use a batch of regression test cases for the stack, which covers the
> > > test for QEMU. I did such test for most of the QEMU releases planned as
> > > candidates for rollout.
> > >
> > > The migration test needs a pair of (either physical or virtual) servers 
> > > with
> > > InfiniBand network, which makes it difficult to do on a single server. The
> > > nested VM could be a possible approach, for which we may need virtual
> > > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you 
> > > know.
> > >
> > > [1]  
> > > https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > >
> > > Thanks and best regards!
> > >
> > > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:
> > > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via 
> > > > > wrote:
> > > > > >
> > > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > >
> > > > > > > > Is there document/link about the unittest/CI for migration 
> > > > > > > > tests, Why
> > > > > > > > are those tests missing?
> > > > > > > > Is it hard or very special to set up an environment for that? 
> > > > > > > > maybe we
> > > > > > > > can help in this regards.
> > > > > > > See tests/qtest/migration-test.c.  We put most of our migration 
> > > > > > > tests
> > > > > > > there and that's covered in CI.
> > > > > > >
> > > > > > > I think one major issue is CI systems don't normally have rdma 
> > > > > > > devices.
> > > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > >
> > > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > > "rxe_lo", however 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Peter Xu
On Mon, Apr 29, 2024 at 08:08:10AM -0500, Michael Galaxy wrote:
> Hi All (and Peter),

Hi, Michael,

> 
> My name is Michael Galaxy (formerly Hines). Yes, I changed my last name
> (highly irregular for a male) and yes, that's my real last name:
> https://www.linkedin.com/in/mrgalaxy/)
> 
> I'm the original author of the RDMA implementation. I've been discussing
> with Yu Zhang for a little bit about potentially handing over maintainership
> of the codebase to his team.
> 
> I simply have zero access to RoCE or Infiniband hardware at all,
> unfortunately. so I've never been able to run tests or use what I wrote at
> work, and as all of you know, if you don't have a way to test something,
> then you can't maintain it.
> 
> Yu Zhang put a (very kind) proposal forward to me to ask the community if
> they feel comfortable training his team to maintain the codebase (and run
> tests) while they learn about it.

The "while learning" part is fine at least to me.  IMHO the "ownership" to
the code, or say, taking over the responsibility, may or may not need 100%
mastering the code base first.  There should still be some fundamental
confidence to work on the code though as a starting point, then it's about
serious use case to back this up, and careful testings while getting more
familiar with it.

> 
> If you don't mind, I'd like to let him send over his (very detailed)
> proposal,

Yes please, it's exactly the time to share the plan.  The hope is we try to
reach a consensus before or around the middle of this release (9.1).
Normally QEMU has a 3~4 months window for each release and 9.1 schedule is
not yet out, but I think it means we make a decision before or around
middle of June.

Thanks,

> 
> - Michael
> 
> On 4/11/24 11:36, Yu Zhang wrote:
> > > 1) Either a CI test covering at least the major RDMA paths, or at least
> > >  periodically tests for each QEMU release will be needed.
> > We use a batch of regression test cases for the stack, which covers the
> > test for QEMU. I did such test for most of the QEMU releases planned as
> > candidates for rollout.
> > 
> > The migration test needs a pair of (either physical or virtual) servers with
> > InfiniBand network, which makes it difficult to do on a single server. The
> > nested VM could be a possible approach, for which we may need virtual
> > InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you 
> > know.
> > 
> > [1]  
> > https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$
> > 
> > Thanks and best regards!
> > 
> > On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:
> > > On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > > > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via 
> > > > wrote:
> > > > > 
> > > > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > > > > 
> > > > > > > Is there document/link about the unittest/CI for migration tests, 
> > > > > > > Why
> > > > > > > are those tests missing?
> > > > > > > Is it hard or very special to set up an environment for that? 
> > > > > > > maybe we
> > > > > > > can help in this regards.
> > > > > > See tests/qtest/migration-test.c.  We put most of our migration 
> > > > > > tests
> > > > > > there and that's covered in CI.
> > > > > > 
> > > > > > I think one major issue is CI systems don't normally have rdma 
> > > > > > devices.
> > > > > > Can rdma migration test be carried out without a real hardware?
> > > > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > > > then we can get a new RDMA interface "rxe_eth0".
> > > > > This new RDMA interface is able to do the QEMU RDMA migration.
> > > > > 
> > > > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > > > "rxe_lo", however when
> > > > > I tried(years ago) to do RDMA migration over this
> > > > > interface(rdma:127.0.0.1:) , it got something wrong.
> > > > > So i gave up enabling the RDMA migration qtest at that time.
> > > > Thanks, Zhijian.
> > > > 
> > > > I'm not sure adding an emu-link for rdma is doable for CI systems, 
> > > > though.
> > > > Maybe someone more familiar with how CI works can chim in.
> > > Some people got dropped on the cc list for unknown reason, I'm adding them
> > > back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> > > accident.
> > > 
> > > I'll try to summarize what is still missing, and I think these will be
> > > greatly helpful if we don't want to deprecate rdma migration:
> > > 
> > >1) Either a CI test covering at least the major RDMA paths, or at least
> > >   periodically tests for each QEMU release will be needed.
> > > 
> > >2) Some performance tests between modern RDMA and NIC devices are
> > >   welcomed.  The current knowledge is modern NIC can work similarly to
> > 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-29 Thread Michael Galaxy

Hi All (and Peter),

My name is Michael Galaxy (formerly Hines). Yes, I changed my last name 
(highly irregular for a male) and yes, that's my real last name: 
https://www.linkedin.com/in/mrgalaxy/)


I'm the original author of the RDMA implementation. I've been discussing 
with Yu Zhang for a little bit about potentially handing over 
maintainership of the codebase to his team.


I simply have zero access to RoCE or Infiniband hardware at all, 
unfortunately. so I've never been able to run tests or use what I wrote 
at work, and as all of you know, if you don't have a way to test 
something, then you can't maintain it.


Yu Zhang put a (very kind) proposal forward to me to ask the community 
if they feel comfortable training his team to maintain the codebase (and 
run tests) while they learn about it.


If you don't mind, I'd like to let him send over his (very detailed) 
proposal,


- Michael

On 4/11/24 11:36, Yu Zhang wrote:

1) Either a CI test covering at least the major RDMA paths, or at least
 periodically tests for each QEMU release will be needed.

We use a batch of regression test cases for the stack, which covers the
test for QEMU. I did such test for most of the QEMU releases planned as
candidates for rollout.

The migration test needs a pair of (either physical or virtual) servers with
InfiniBand network, which makes it difficult to do on a single server. The
nested VM could be a possible approach, for which we may need virtual
InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.

[1]  
https://urldefense.com/v3/__https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWvGHE3ig$

Thanks and best regards!

On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:

On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:

On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:


on 4/10/2024 3:46 AM, Peter Xu wrote:


Is there document/link about the unittest/CI for migration tests, Why
are those tests missing?
Is it hard or very special to set up an environment for that? maybe we
can help in this regards.

See tests/qtest/migration-test.c.  We put most of our migration tests
there and that's covered in CI.

I think one major issue is CI systems don't normally have rdma devices.
Can rdma migration test be carried out without a real hardware?

Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
$ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
then we can get a new RDMA interface "rxe_eth0".
This new RDMA interface is able to do the QEMU RDMA migration.

Also, the loopback(lo) device is able to emulate the RDMA interface
"rxe_lo", however when
I tried(years ago) to do RDMA migration over this
interface(rdma:127.0.0.1:) , it got something wrong.
So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

Some people got dropped on the cc list for unknown reason, I'm adding them
back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
accident.

I'll try to summarize what is still missing, and I think these will be
greatly helpful if we don't want to deprecate rdma migration:

   1) Either a CI test covering at least the major RDMA paths, or at least
  periodically tests for each QEMU release will be needed.

   2) Some performance tests between modern RDMA and NIC devices are
  welcomed.  The current knowledge is modern NIC can work similarly to
  RDMA in performance, then it's debatable why we still maintain so much
  rdma specific code.

   3) No need to be soild patchsets for this one, but some plan to improve
  RDMA migration code so that it is not almost isolated from the rest
  protocols.

   4) Someone to look after this code for real.

For 2) and 3) more info is here:

https://urldefense.com/v3/__https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n__;!!GjvTz_vk!VEqNfg3Kdf58Oh1FkGL6ErDLfvUXZXPwMTaXizuIQeIgJiywPzuwbqx8wM0KUsyopw_EYQxWpIWYBhQ$

Here 4) can be the most important as Markus pointed out.  We just didn't
get there yet on the discussions, but maybe Markus is right that we should
talk that first.

Thanks,

--
Peter Xu





Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-12 Thread Peter Xu
Yu,

On Thu, Apr 11, 2024 at 06:36:54PM +0200, Yu Zhang wrote:
> > 1) Either a CI test covering at least the major RDMA paths, or at least
> > periodically tests for each QEMU release will be needed.
> We use a batch of regression test cases for the stack, which covers the
> test for QEMU. I did such test for most of the QEMU releases planned as
> candidates for rollout.

The least I can think of is a few tests in one release.  Definitely too
less if one release can already break..

> 
> The migration test needs a pair of (either physical or virtual) servers with
> InfiniBand network, which makes it difficult to do on a single server. The
> nested VM could be a possible approach, for which we may need virtual
> InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.
> 
> [1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Does it require a kernel driver?  The less host kernel / hardware /
.. dependencies the better.

I am wondering whether there can be a library doing everything in
userspace, translating RDMA into e.g. socket messages (so maybe ultimately
that's something like IP->rdma->IP.. just to cover the "rdma" procedures),
then that'll work for CI reliably.

Please also see my full list, though, especially entry 4).  Thanks already
for looking for solutions on the tests, but I don't want to waste your time
then found that tests are not enough even if ready.  I think we need people
that understand these stuff well enough, have dedicated time and look after
it.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Yu Zhang
> 1) Either a CI test covering at least the major RDMA paths, or at least
> periodically tests for each QEMU release will be needed.
We use a batch of regression test cases for the stack, which covers the
test for QEMU. I did such test for most of the QEMU releases planned as
candidates for rollout.

The migration test needs a pair of (either physical or virtual) servers with
InfiniBand network, which makes it difficult to do on a single server. The
nested VM could be a possible approach, for which we may need virtual
InfiniBand network. Is SoftRoCE [1] a choice? I will try it and let you know.

[1]  https://enterprise-support.nvidia.com/s/article/howto-configure-soft-roce

Thanks and best regards!

On Thu, Apr 11, 2024 at 4:20 PM Peter Xu  wrote:
>
> On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> > On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> > >
> > >
> > > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > >
> > > >> Is there document/link about the unittest/CI for migration tests, Why
> > > >> are those tests missing?
> > > >> Is it hard or very special to set up an environment for that? maybe we
> > > >> can help in this regards.
> > > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > > there and that's covered in CI.
> > > >
> > > > I think one major issue is CI systems don't normally have rdma devices.
> > > > Can rdma migration test be carried out without a real hardware?
> > >
> > > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > > then we can get a new RDMA interface "rxe_eth0".
> > > This new RDMA interface is able to do the QEMU RDMA migration.
> > >
> > > Also, the loopback(lo) device is able to emulate the RDMA interface
> > > "rxe_lo", however when
> > > I tried(years ago) to do RDMA migration over this
> > > interface(rdma:127.0.0.1:) , it got something wrong.
> > > So i gave up enabling the RDMA migration qtest at that time.
> >
> > Thanks, Zhijian.
> >
> > I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> > Maybe someone more familiar with how CI works can chim in.
>
> Some people got dropped on the cc list for unknown reason, I'm adding them
> back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
> accident.
>
> I'll try to summarize what is still missing, and I think these will be
> greatly helpful if we don't want to deprecate rdma migration:
>
>   1) Either a CI test covering at least the major RDMA paths, or at least
>  periodically tests for each QEMU release will be needed.
>
>   2) Some performance tests between modern RDMA and NIC devices are
>  welcomed.  The current knowledge is modern NIC can work similarly to
>  RDMA in performance, then it's debatable why we still maintain so much
>  rdma specific code.
>
>   3) No need to be soild patchsets for this one, but some plan to improve
>  RDMA migration code so that it is not almost isolated from the rest
>  protocols.
>
>   4) Someone to look after this code for real.
>
> For 2) and 3) more info is here:
>
> https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n
>
> Here 4) can be the most important as Markus pointed out.  We just didn't
> get there yet on the discussions, but maybe Markus is right that we should
> talk that first.
>
> Thanks,
>
> --
> Peter Xu
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Jinpu Wang
Hi Peter,

On Tue, Apr 9, 2024 at 9:47 PM Peter Xu  wrote:
>
> On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> > Hi Peter,
> >
> > On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> > >
> > > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > > Hi Peter,
> > >
> > > Jinpu,
> > >
> > > Thanks for joining the discussion.
> > >
> > > >
> > > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > > >
> > > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > > Hello Peter und Zhjian,
> > > > > >
> > > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > > surprised at
> > > > > > the plan for deprecating the RDMA migration subsystem.
> > > > >
> > > > > It's not too late, since it looks like we do have users not yet 
> > > > > notified
> > > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > > final
> > > > > plan, and it'll be 2 releases after this.
> > > > >
> > > > > >
> > > > > > > IMHO it's more important to know whether there are still users 
> > > > > > > and whether
> > > > > > > they would still like to see it around.
> > > > > >
> > > > > > > I admit RDMA migration was lack of testing(unit/CI test), which 
> > > > > > > led to the a few
> > > > > > > obvious bugs being noticed too late.
> > > > > >
> > > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > > test coverage
> > > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > > migration test
> > > > > > cases failed and came to realize that there might be a bug between 
> > > > > > 8.1
> > > > > > and 8.2, but
> > > > > > was unable to confirm and report it quickly to you.
> > > > > >
> > > > > > The maintenance of this part could be too costly or difficult from
> > > > > > your point of view.
> > > > >
> > > > > It may or may not be too costly, it's just that we need real users of 
> > > > > RDMA
> > > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > > definitely
> > > > > is a sign of lack of users.  It is an implication to the community 
> > > > > that we
> > > > > should consider dropping some features so that we can get the best 
> > > > > use of
> > > > > the community resources for the things that may have a broader 
> > > > > audience.
> > > > >
> > > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > > not
> > > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > > hardwares
> > > > > but just to sanity check the migration+rdma code running all fine.  
> > > > > RDMA
> > > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > > features that will be merged at least for migration subsystem, so 
> > > > > that we
> > > > > plan to not merge anything that is not covered by CI unless extremely
> > > > > necessary in the future.
> > > > >
> > > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > > with
> > > > > it, then someone should also take care of the code even if only in
> > > > > maintenance mode (no new feature to add on top).
> > > > >
> > > > > >
> > > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > > many) like us
> > > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > > version of QEMU,
> > > > > > or to abandon the currently used RDMA migration.
> > > > >
> > > > > RDMA doesn't get new features anyway, if there's specific use case 
> > > > > for RDMA
> > > > > migrations, would it work if such a scenario uses the old binary?  Is 
> > > > > it
> > > > > possible to switch to the TCP protocol with some good NICs?
> > > > We have used rdma migration with HCA from Nvidia for years, our
> > > > experience is RDMA migration works better than tcp (over ipoib).
> > >
> > > Please bare with me, as I know little on rdma stuff.
> > >
> > > I'm actually pretty confused (and since a long time ago..) on why we need
> > > to operation with rdma contexts when ipoib seems to provide all the tcp
> > > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > > even if there's rdma/ib hardwares underneath?  Is it because of 
> > > performance
> > > improvements so that we must use a separate path comparing to generic
> > > "tcp:" protocol here?
> > using rdma protocol with ib verbs , we can leverage the full benefit of 
> > RDMA by
> > talking directly to NIC which bypasses the kernel overhead, less cpu
> > utilization and better performance.
> >
> > While IPoIB is more for compatibility to  applications using tcp, but
> > can't get full benefit of RDMA.  When you have mix generation of IB
> > devices, there are performance issue on IPoIB, we've seen 40G HCA can
> > only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> > speed.
> >
> > I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> >
> > iperf 3.9
> > 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-11 Thread Peter Xu
On Wed, Apr 10, 2024 at 09:49:15AM -0400, Peter Xu wrote:
> On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> > 
> > 
> > on 4/10/2024 3:46 AM, Peter Xu wrote:
> > 
> > >> Is there document/link about the unittest/CI for migration tests, Why
> > >> are those tests missing?
> > >> Is it hard or very special to set up an environment for that? maybe we
> > >> can help in this regards.
> > > See tests/qtest/migration-test.c.  We put most of our migration tests
> > > there and that's covered in CI.
> > >
> > > I think one major issue is CI systems don't normally have rdma devices.
> > > Can rdma migration test be carried out without a real hardware?
> > 
> > Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> > $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> > then we can get a new RDMA interface "rxe_eth0".
> > This new RDMA interface is able to do the QEMU RDMA migration.
> > 
> > Also, the loopback(lo) device is able to emulate the RDMA interface 
> > "rxe_lo", however when
> > I tried(years ago) to do RDMA migration over this 
> > interface(rdma:127.0.0.1:) , it got something wrong.
> > So i gave up enabling the RDMA migration qtest at that time.
> 
> Thanks, Zhijian.
> 
> I'm not sure adding an emu-link for rdma is doable for CI systems, though.
> Maybe someone more familiar with how CI works can chim in.

Some people got dropped on the cc list for unknown reason, I'm adding them
back (Fabiano, Peter Maydell, Phil).  Let's make sure nobody is dropped by
accident.

I'll try to summarize what is still missing, and I think these will be
greatly helpful if we don't want to deprecate rdma migration:

  1) Either a CI test covering at least the major RDMA paths, or at least
 periodically tests for each QEMU release will be needed.

  2) Some performance tests between modern RDMA and NIC devices are
 welcomed.  The current knowledge is modern NIC can work similarly to
 RDMA in performance, then it's debatable why we still maintain so much
 rdma specific code.

  3) No need to be soild patchsets for this one, but some plan to improve
 RDMA migration code so that it is not almost isolated from the rest
 protocols.

  4) Someone to look after this code for real.

For 2) and 3) more info is here:

https://lore.kernel.org/r/ZhWa0YeAb9ySVKD1@x1n

Here 4) can be the most important as Markus pointed out.  We just didn't
get there yet on the discussions, but maybe Markus is right that we should
talk that first.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-10 Thread Peter Xu
On Wed, Apr 10, 2024 at 02:28:59AM +, Zhijian Li (Fujitsu) via wrote:
> 
> 
> on 4/10/2024 3:46 AM, Peter Xu wrote:
> 
> >> Is there document/link about the unittest/CI for migration tests, Why
> >> are those tests missing?
> >> Is it hard or very special to set up an environment for that? maybe we
> >> can help in this regards.
> > See tests/qtest/migration-test.c.  We put most of our migration tests
> > there and that's covered in CI.
> >
> > I think one major issue is CI systems don't normally have rdma devices.
> > Can rdma migration test be carried out without a real hardware?
> 
> Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
> $ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
> then we can get a new RDMA interface "rxe_eth0".
> This new RDMA interface is able to do the QEMU RDMA migration.
> 
> Also, the loopback(lo) device is able to emulate the RDMA interface 
> "rxe_lo", however when
> I tried(years ago) to do RDMA migration over this 
> interface(rdma:127.0.0.1:) , it got something wrong.
> So i gave up enabling the RDMA migration qtest at that time.

Thanks, Zhijian.

I'm not sure adding an emu-link for rdma is doable for CI systems, though.
Maybe someone more familiar with how CI works can chim in.

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Zhijian Li (Fujitsu)


on 4/10/2024 3:46 AM, Peter Xu wrote:

>> Is there document/link about the unittest/CI for migration tests, Why
>> are those tests missing?
>> Is it hard or very special to set up an environment for that? maybe we
>> can help in this regards.
> See tests/qtest/migration-test.c.  We put most of our migration tests
> there and that's covered in CI.
>
> I think one major issue is CI systems don't normally have rdma devices.
> Can rdma migration test be carried out without a real hardware?

Yeah,  RXE aka. SOFT-RoCE is able to emulate the RDMA, for example
$ sudo rdma link add rxe_eth0 type rxe netdev eth0  # on host
then we can get a new RDMA interface "rxe_eth0".
This new RDMA interface is able to do the QEMU RDMA migration.

Also, the loopback(lo) device is able to emulate the RDMA interface 
"rxe_lo", however when
I tried(years ago) to do RDMA migration over this 
interface(rdma:127.0.0.1:) , it got something wrong.
So i gave up enabling the RDMA migration qtest at that time.



Thanks
Zhijian



 

>
>>> It seems there can still be people joining this discussion.  I'll hold off
>>> a bit on merging this patch to provide enough window for anyone to chim in.
>> Thx for discussion and understanding.
> Thanks for all these inputs so far.  These can help us make a wiser and
> clearer step no matter which way we choose.


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Peter Xu
On Tue, Apr 09, 2024 at 09:32:46AM +0200, Jinpu Wang wrote:
> Hi Peter,
> 
> On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
> >
> > On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > > Hi Peter,
> >
> > Jinpu,
> >
> > Thanks for joining the discussion.
> >
> > >
> > > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > > >
> > > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > > Hello Peter und Zhjian,
> > > > >
> > > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > > surprised at
> > > > > the plan for deprecating the RDMA migration subsystem.
> > > >
> > > > It's not too late, since it looks like we do have users not yet notified
> > > > from this, we'll redo the deprecation procedure even if it'll be the 
> > > > final
> > > > plan, and it'll be 2 releases after this.
> > > >
> > > > >
> > > > > > IMHO it's more important to know whether there are still users and 
> > > > > > whether
> > > > > > they would still like to see it around.
> > > > >
> > > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > > to the a few
> > > > > > obvious bugs being noticed too late.
> > > > >
> > > > > Yes, we are a user of this subsystem. I was unaware of the lack of 
> > > > > test coverage
> > > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > > migration test
> > > > > cases failed and came to realize that there might be a bug between 8.1
> > > > > and 8.2, but
> > > > > was unable to confirm and report it quickly to you.
> > > > >
> > > > > The maintenance of this part could be too costly or difficult from
> > > > > your point of view.
> > > >
> > > > It may or may not be too costly, it's just that we need real users of 
> > > > RDMA
> > > > taking some care of it.  Having it broken easily for >1 releases 
> > > > definitely
> > > > is a sign of lack of users.  It is an implication to the community that 
> > > > we
> > > > should consider dropping some features so that we can get the best use 
> > > > of
> > > > the community resources for the things that may have a broader audience.
> > > >
> > > > One thing majorly missing is a RDMA tester to guard all the merges to 
> > > > not
> > > > break RDMA paths, hopefully in CI.  That should not rely on RDMA 
> > > > hardwares
> > > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > > taught us the lesson so we're requesting CI coverage for all other new
> > > > features that will be merged at least for migration subsystem, so that 
> > > > we
> > > > plan to not merge anything that is not covered by CI unless extremely
> > > > necessary in the future.
> > > >
> > > > For sure CI is not the only missing part, but I'd say we should start 
> > > > with
> > > > it, then someone should also take care of the code even if only in
> > > > maintenance mode (no new feature to add on top).
> > > >
> > > > >
> > > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > > many) like us
> > > > > either to stick to the RDMA migration by using an increasingly older
> > > > > version of QEMU,
> > > > > or to abandon the currently used RDMA migration.
> > > >
> > > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > > RDMA
> > > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > > possible to switch to the TCP protocol with some good NICs?
> > > We have used rdma migration with HCA from Nvidia for years, our
> > > experience is RDMA migration works better than tcp (over ipoib).
> >
> > Please bare with me, as I know little on rdma stuff.
> >
> > I'm actually pretty confused (and since a long time ago..) on why we need
> > to operation with rdma contexts when ipoib seems to provide all the tcp
> > layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> > even if there's rdma/ib hardwares underneath?  Is it because of performance
> > improvements so that we must use a separate path comparing to generic
> > "tcp:" protocol here?
> using rdma protocol with ib verbs , we can leverage the full benefit of RDMA 
> by
> talking directly to NIC which bypasses the kernel overhead, less cpu
> utilization and better performance.
> 
> While IPoIB is more for compatibility to  applications using tcp, but
> can't get full benefit of RDMA.  When you have mix generation of IB
> devices, there are performance issue on IPoIB, we've seen 40G HCA can
> only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
> speed.
> 
> I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:
> 
> iperf 3.9
> Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
> 07:19:34 UTC 2024 x86_64
> ---
> Server listening on 5201
> ---
> Time: Tue, 09 Apr 2024 06:55:02 GMT
> Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
> 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Markus Armbruster
Peter Xu  writes:

> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
>> Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
>> 
>> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
>> >
>> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
>> > > Hello Peter und Zhjian,
>> > >
>> > > Thank you so much for letting me know about this. I'm also a bit 
>> > > surprised at
>> > > the plan for deprecating the RDMA migration subsystem.
>> >
>> > It's not too late, since it looks like we do have users not yet notified
>> > from this, we'll redo the deprecation procedure even if it'll be the final
>> > plan, and it'll be 2 releases after this.

[...]

>> > Per our best knowledge, RDMA users are rare, and please let anyone know if
>> > you are aware of such users.  IIUC the major reason why RDMA stopped being
>> > the trend is because the network is not like ten years ago; I don't think I
>> > have good knowledge in RDMA at all nor network, but my understanding is
>> > it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
>> > little sense to maintain multiple protocols, considering RDMA migration
>> > code is so special so that it has the most custom code comparing to other
>> > protocols.
>> +cc some guys from Huawei.
>> 
>> I'm surprised RDMA users are rare,  I guess maybe many are just
>> working with different code base.
>
> Yes, please cc whoever might be interested (or surprised.. :) to know this,
> and let's be open to all possibilities.
>
> I don't think it makes sense if there're a lot of users of a feature then
> we deprecate that without a good reason.  However there's always the
> resource limitation issue we're facing, so it could still have the
> possibility that this gets deprecated if nobody is working on our upstream
> branch. Say, if people use private branches anyway to support rdma without
> collaborating upstream, keeping such feature upstream then may not make
> much sense either, unless there's some way to collaborate.  We'll see.
>
> It seems there can still be people joining this discussion.  I'll hold off
> a bit on merging this patch to provide enough window for anyone to chim in.

Users are not enough.  Only maintainers are.

At some point, people cared enough about RDMA in QEMU to contribute the
code.  That's why have the code.

To keep the code, we need people who care enough about RDMA in QEMU to
maintain it.  Without such people, the case for keeping it remains
dangerously weak, and no amount of talk or even benchmarks can change
that.




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-09 Thread Jinpu Wang
Hi Peter,

On Mon, Apr 8, 2024 at 6:18 PM Peter Xu  wrote:
>
> On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> > Hi Peter,
>
> Jinpu,
>
> Thanks for joining the discussion.
>
> >
> > On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> > >
> > > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > > Hello Peter und Zhjian,
> > > >
> > > > Thank you so much for letting me know about this. I'm also a bit 
> > > > surprised at
> > > > the plan for deprecating the RDMA migration subsystem.
> > >
> > > It's not too late, since it looks like we do have users not yet notified
> > > from this, we'll redo the deprecation procedure even if it'll be the final
> > > plan, and it'll be 2 releases after this.
> > >
> > > >
> > > > > IMHO it's more important to know whether there are still users and 
> > > > > whether
> > > > > they would still like to see it around.
> > > >
> > > > > I admit RDMA migration was lack of testing(unit/CI test), which led 
> > > > > to the a few
> > > > > obvious bugs being noticed too late.
> > > >
> > > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > > coverage
> > > > for this part. As soon as 8.2 was released, I saw that many of the
> > > > migration test
> > > > cases failed and came to realize that there might be a bug between 8.1
> > > > and 8.2, but
> > > > was unable to confirm and report it quickly to you.
> > > >
> > > > The maintenance of this part could be too costly or difficult from
> > > > your point of view.
> > >
> > > It may or may not be too costly, it's just that we need real users of RDMA
> > > taking some care of it.  Having it broken easily for >1 releases 
> > > definitely
> > > is a sign of lack of users.  It is an implication to the community that we
> > > should consider dropping some features so that we can get the best use of
> > > the community resources for the things that may have a broader audience.
> > >
> > > One thing majorly missing is a RDMA tester to guard all the merges to not
> > > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > > but just to sanity check the migration+rdma code running all fine.  RDMA
> > > taught us the lesson so we're requesting CI coverage for all other new
> > > features that will be merged at least for migration subsystem, so that we
> > > plan to not merge anything that is not covered by CI unless extremely
> > > necessary in the future.
> > >
> > > For sure CI is not the only missing part, but I'd say we should start with
> > > it, then someone should also take care of the code even if only in
> > > maintenance mode (no new feature to add on top).
> > >
> > > >
> > > > My concern is, this plan will forces a few QEMU users (not sure how
> > > > many) like us
> > > > either to stick to the RDMA migration by using an increasingly older
> > > > version of QEMU,
> > > > or to abandon the currently used RDMA migration.
> > >
> > > RDMA doesn't get new features anyway, if there's specific use case for 
> > > RDMA
> > > migrations, would it work if such a scenario uses the old binary?  Is it
> > > possible to switch to the TCP protocol with some good NICs?
> > We have used rdma migration with HCA from Nvidia for years, our
> > experience is RDMA migration works better than tcp (over ipoib).
>
> Please bare with me, as I know little on rdma stuff.
>
> I'm actually pretty confused (and since a long time ago..) on why we need
> to operation with rdma contexts when ipoib seems to provide all the tcp
> layers.  I meant, can it work with the current "tcp:" protocol with ipoib
> even if there's rdma/ib hardwares underneath?  Is it because of performance
> improvements so that we must use a separate path comparing to generic
> "tcp:" protocol here?
using rdma protocol with ib verbs , we can leverage the full benefit of RDMA by
talking directly to NIC which bypasses the kernel overhead, less cpu
utilization and better performance.

While IPoIB is more for compatibility to  applications using tcp, but
can't get full benefit of RDMA.  When you have mix generation of IB
devices, there are performance issue on IPoIB, we've seen 40G HCA can
only reach 2 Gb/s on IPoIB, but with raw RDMA can reach full line
speed.

I just run a simple iperf3 test via ipoib and ib_send_bw on same hosts:

iperf 3.9
Linux ps404a-3 5.15.137-pserver #5.15.137-6~deb11 SMP Thu Jan 4
07:19:34 UTC 2024 x86_64
---
Server listening on 5201
---
Time: Tue, 09 Apr 2024 06:55:02 GMT
Accepted connection from 2a02:247f:401:4:2:0:b:3, port 41130
  Cookie: cer2hexlldrowclq6izh7gbg5toviffqbcwt
  TCP MSS: 0 (default)
[  5] local 2a02:247f:401:4:2:0:a:3 port 5201 connected to
2a02:247f:401:4:2:0:b:3 port 41136
Starting Test: protocol: TCP, 1 streams, 131072 byte blocks, omitting
0 seconds, 10 second test, tos 0
[ ID] Interval   Transfer Bitrate
[  5]   0.00-1.00   sec  

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-08 Thread Peter Xu
On Mon, Apr 08, 2024 at 04:07:20PM +0200, Jinpu Wang wrote:
> Hi Peter,

Jinpu,

Thanks for joining the discussion.

> 
> On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
> >
> > On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > > Hello Peter und Zhjian,
> > >
> > > Thank you so much for letting me know about this. I'm also a bit 
> > > surprised at
> > > the plan for deprecating the RDMA migration subsystem.
> >
> > It's not too late, since it looks like we do have users not yet notified
> > from this, we'll redo the deprecation procedure even if it'll be the final
> > plan, and it'll be 2 releases after this.
> >
> > >
> > > > IMHO it's more important to know whether there are still users and 
> > > > whether
> > > > they would still like to see it around.
> > >
> > > > I admit RDMA migration was lack of testing(unit/CI test), which led to 
> > > > the a few
> > > > obvious bugs being noticed too late.
> > >
> > > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > > coverage
> > > for this part. As soon as 8.2 was released, I saw that many of the
> > > migration test
> > > cases failed and came to realize that there might be a bug between 8.1
> > > and 8.2, but
> > > was unable to confirm and report it quickly to you.
> > >
> > > The maintenance of this part could be too costly or difficult from
> > > your point of view.
> >
> > It may or may not be too costly, it's just that we need real users of RDMA
> > taking some care of it.  Having it broken easily for >1 releases definitely
> > is a sign of lack of users.  It is an implication to the community that we
> > should consider dropping some features so that we can get the best use of
> > the community resources for the things that may have a broader audience.
> >
> > One thing majorly missing is a RDMA tester to guard all the merges to not
> > break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> > but just to sanity check the migration+rdma code running all fine.  RDMA
> > taught us the lesson so we're requesting CI coverage for all other new
> > features that will be merged at least for migration subsystem, so that we
> > plan to not merge anything that is not covered by CI unless extremely
> > necessary in the future.
> >
> > For sure CI is not the only missing part, but I'd say we should start with
> > it, then someone should also take care of the code even if only in
> > maintenance mode (no new feature to add on top).
> >
> > >
> > > My concern is, this plan will forces a few QEMU users (not sure how
> > > many) like us
> > > either to stick to the RDMA migration by using an increasingly older
> > > version of QEMU,
> > > or to abandon the currently used RDMA migration.
> >
> > RDMA doesn't get new features anyway, if there's specific use case for RDMA
> > migrations, would it work if such a scenario uses the old binary?  Is it
> > possible to switch to the TCP protocol with some good NICs?
> We have used rdma migration with HCA from Nvidia for years, our
> experience is RDMA migration works better than tcp (over ipoib).

Please bare with me, as I know little on rdma stuff.

I'm actually pretty confused (and since a long time ago..) on why we need
to operation with rdma contexts when ipoib seems to provide all the tcp
layers.  I meant, can it work with the current "tcp:" protocol with ipoib
even if there's rdma/ib hardwares underneath?  Is it because of performance
improvements so that we must use a separate path comparing to generic
"tcp:" protocol here?

> 
> Switching back to TCP will lead us to the old problems which was
> solved by RDMA migration.

Can you elaborate the problems, and why tcp won't work in this case?  They
may not be directly relevant to the issue we're discussing, but I'm happy
to learn more.

What is the NICs you were testing before?  Did the test carry out with
things like modern ones (50Gbps-200Gbps NICs), or the test was done when
these hardwares are not common?

Per my recent knowledge on the new Intel hardwares, at least the ones that
support QPL, it's easy to achieve single core 50Gbps+.

https://lore.kernel.org/r/ph7pr11mb5941a91ac1e514bcc32896a6a3...@ph7pr11mb5941.namprd11.prod.outlook.com

Quote from Yuan:

  Yes, I use iperf3 to check the bandwidth for one core, the bandwith is 60Gbps.
  [ ID] Interval   Transfer Bitrate Retr  Cwnd
  [  5]   0.00-1.00   sec  7.00 GBytes  60.1 Gbits/sec0   2.87 MBytes
  [  5]   1.00-2.00   sec  7.05 GBytes  60.6 Gbits/sec0   2.87 Mbytes

  And in the live migration test, a multifd thread's CPU utilization is almost 
100%

It boils down to what old problems were there with tcp first, though.

> 
> >
> > Per our best knowledge, RDMA users are rare, and please let anyone know if
> > you are aware of such users.  IIUC the major reason why RDMA stopped being
> > the trend is because the network is not like ten years ago; I don't think I
> > have good knowledge in RDMA at all nor network, but my 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-08 Thread Jinpu Wang
Hi Peter,

On Tue, Apr 2, 2024 at 11:24 PM Peter Xu  wrote:
>
> On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> > Hello Peter und Zhjian,
> >
> > Thank you so much for letting me know about this. I'm also a bit surprised 
> > at
> > the plan for deprecating the RDMA migration subsystem.
>
> It's not too late, since it looks like we do have users not yet notified
> from this, we'll redo the deprecation procedure even if it'll be the final
> plan, and it'll be 2 releases after this.
>
> >
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> >
> > > I admit RDMA migration was lack of testing(unit/CI test), which led to 
> > > the a few
> > > obvious bugs being noticed too late.
> >
> > Yes, we are a user of this subsystem. I was unaware of the lack of test 
> > coverage
> > for this part. As soon as 8.2 was released, I saw that many of the
> > migration test
> > cases failed and came to realize that there might be a bug between 8.1
> > and 8.2, but
> > was unable to confirm and report it quickly to you.
> >
> > The maintenance of this part could be too costly or difficult from
> > your point of view.
>
> It may or may not be too costly, it's just that we need real users of RDMA
> taking some care of it.  Having it broken easily for >1 releases definitely
> is a sign of lack of users.  It is an implication to the community that we
> should consider dropping some features so that we can get the best use of
> the community resources for the things that may have a broader audience.
>
> One thing majorly missing is a RDMA tester to guard all the merges to not
> break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
> but just to sanity check the migration+rdma code running all fine.  RDMA
> taught us the lesson so we're requesting CI coverage for all other new
> features that will be merged at least for migration subsystem, so that we
> plan to not merge anything that is not covered by CI unless extremely
> necessary in the future.
>
> For sure CI is not the only missing part, but I'd say we should start with
> it, then someone should also take care of the code even if only in
> maintenance mode (no new feature to add on top).
>
> >
> > My concern is, this plan will forces a few QEMU users (not sure how
> > many) like us
> > either to stick to the RDMA migration by using an increasingly older
> > version of QEMU,
> > or to abandon the currently used RDMA migration.
>
> RDMA doesn't get new features anyway, if there's specific use case for RDMA
> migrations, would it work if such a scenario uses the old binary?  Is it
> possible to switch to the TCP protocol with some good NICs?
We have used rdma migration with HCA from Nvidia for years, our
experience is RDMA migration works better than tcp (over ipoib).

Switching back to TCP will lead us to the old problems which was
solved by RDMA migration.

>
> Per our best knowledge, RDMA users are rare, and please let anyone know if
> you are aware of such users.  IIUC the major reason why RDMA stopped being
> the trend is because the network is not like ten years ago; I don't think I
> have good knowledge in RDMA at all nor network, but my understanding is
> it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
> little sense to maintain multiple protocols, considering RDMA migration
> code is so special so that it has the most custom code comparing to other
> protocols.
+cc some guys from Huawei.

I'm surprised RDMA users are rare,  I guess maybe many are just
working with different code base.
>
> Thanks,
>
> --
> Peter Xu

Thx!
Jinpu Wang
>



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-02 Thread Peter Xu
On Mon, Apr 01, 2024 at 11:26:25PM +0200, Yu Zhang wrote:
> Hello Peter und Zhjian,
> 
> Thank you so much for letting me know about this. I'm also a bit surprised at
> the plan for deprecating the RDMA migration subsystem.

It's not too late, since it looks like we do have users not yet notified
from this, we'll redo the deprecation procedure even if it'll be the final
plan, and it'll be 2 releases after this.

> 
> > IMHO it's more important to know whether there are still users and whether
> > they would still like to see it around.
> 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the 
> > a few
> > obvious bugs being noticed too late.
> 
> Yes, we are a user of this subsystem. I was unaware of the lack of test 
> coverage
> for this part. As soon as 8.2 was released, I saw that many of the
> migration test
> cases failed and came to realize that there might be a bug between 8.1
> and 8.2, but
> was unable to confirm and report it quickly to you.
> 
> The maintenance of this part could be too costly or difficult from
> your point of view.

It may or may not be too costly, it's just that we need real users of RDMA
taking some care of it.  Having it broken easily for >1 releases definitely
is a sign of lack of users.  It is an implication to the community that we
should consider dropping some features so that we can get the best use of
the community resources for the things that may have a broader audience.

One thing majorly missing is a RDMA tester to guard all the merges to not
break RDMA paths, hopefully in CI.  That should not rely on RDMA hardwares
but just to sanity check the migration+rdma code running all fine.  RDMA
taught us the lesson so we're requesting CI coverage for all other new
features that will be merged at least for migration subsystem, so that we
plan to not merge anything that is not covered by CI unless extremely
necessary in the future.

For sure CI is not the only missing part, but I'd say we should start with
it, then someone should also take care of the code even if only in
maintenance mode (no new feature to add on top).

> 
> My concern is, this plan will forces a few QEMU users (not sure how
> many) like us
> either to stick to the RDMA migration by using an increasingly older
> version of QEMU,
> or to abandon the currently used RDMA migration.

RDMA doesn't get new features anyway, if there's specific use case for RDMA
migrations, would it work if such scenario uses the old binary?  Is it
possible to switch to the TCP protocol with some good NICs?

Per our best knowledge, RDMA users are rare, and please let anyone know if
you are aware of such users.  IIUC the major reason why RDMA stopped being
the trend is because the network is not like ten years ago; I don't think I
have good knowledge in RDMA at all nor network, but my understanding is
it's pretty easy to fetch modern NIC to outperform RDMAs, then it may make
little sense to maintain multiple protocols, considering RDMA migration
code is so special so that it has the most custom code comparing to other
protocols.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-01 Thread Yu Zhang
Hello Peter und Zhjian,

Thank you so much for letting me know about this. I'm also a bit surprised at
the plan for deprecating the RDMA migration subsystem.

> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

> I admit RDMA migration was lack of testing(unit/CI test), which led to the a 
> few
> obvious bugs being noticed too late.

Yes, we are a user of this subsystem. I was unaware of the lack of test coverage
for this part. As soon as 8.2 was released, I saw that many of the
migration test
cases failed and came to realize that there might be a bug between 8.1
and 8.2, but
was unable to confirm and report it quickly to you.

The maintenance of this part could be too costly or difficult from
your point of view.

My concern is, this plan will forces a few QEMU users (not sure how
many) like us
either to stick to the RDMA migration by using an increasingly older
version of QEMU,
or to abandon the currently used RDMA migration.

Best regards,
Yu Zhang

On Mon, Apr 1, 2024 at 9:56 AM Zhijian Li (Fujitsu)
 wrote:
>
> Phil,
>
> on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
> >>
> >>
> >>> IMHO it's more important to know whether there are still users and
> >>> whether
> >>> they would still like to see it around.
> >>
> >> Agree.
> >> I didn't immediately express my opinion in V1 because I'm also
> >> consulting our
> >> customers for this feature in the future.
> >>
> >> Personally, I agree with Perter's idea that "I'd slightly prefer
> >> postponing it one
> >> more release which might help a bit of our downstream maintenance"
> >
> > Do you mind posting a deprecation patch to clarify the situation?
> >
>
> No problem, i just posted a deprecation patch, please take a look.
> https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhij...@fujitsu.com/T/#u
>
> Thanks
> Zhijian



Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-04-01 Thread Zhijian Li (Fujitsu)
Phil,

on 3/29/2024 6:28 PM, Philippe Mathieu-Daudé wrote:
>>
>>
>>> IMHO it's more important to know whether there are still users and 
>>> whether
>>> they would still like to see it around.
>>
>> Agree.
>> I didn't immediately express my opinion in V1 because I'm also 
>> consulting our
>> customers for this feature in the future.
>>
>> Personally, I agree with Perter's idea that "I'd slightly prefer 
>> postponing it one
>> more release which might help a bit of our downstream maintenance"
>
> Do you mind posting a deprecation patch to clarify the situation?
>

No problem, i just posted a deprecation patch, please take a look.
https://lore.kernel.org/qemu-devel/20240401035947.3310834-1-lizhij...@fujitsu.com/T/#u

Thanks
Zhijian


Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-29 Thread Daniel P . Berrangé
On Fri, Mar 29, 2024 at 11:28:54AM +0100, Philippe Mathieu-Daudé wrote:
> Hi Zhijian,
> 
> On 29/3/24 02:53, Zhijian Li (Fujitsu) wrote:
> > 
> > 
> > On 28/03/2024 23:01, Peter Xu wrote:
> > > On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> > > > Philippe Mathieu-Daudé  writes:
> > > > 
> > > > > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > > > > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > > > > released in v8.2.
> > > > > 
> > > > > Remove:
> > > > >- RDMA handling from migration
> > > > >- dependencies on libibumad, libibverbs and librdmacm
> > > > > 
> > > > > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > > > > in old migration streams.
> > > > > 
> > > > > Cc: Peter Xu 
> > > > > Cc: Li Zhijian 
> > > > > Acked-by: Fabiano Rosas 
> > > > > Signed-off-by: Philippe Mathieu-Daudé 
> > > > 
> > > > Just to be clear, because people raised the point in the last version,
> > > > the first link in the deprecation commit links to a thread comprising
> > > > entirely of rdma migration patches. I don't see any ambiguity on whether
> > > > the deprecation was intended to include migration. There's even an ack
> > > > from Juan.
> > > 
> > > Yes I remember that's the plan.
> > > 
> > > > 
> > > > So on the basis of not reverting the previous maintainer's decision, my
> > > > Ack stands here.
> > > > 
> > > > We also had pretty obvious bugs ([1], [2]) in the past that would have
> > > > been caught if we had any kind of testing for the feature, so I can't
> > > > even say this thing works currently.
> > > > 
> > > > @Peter Xu, @Li Zhijian, what are your thoughts on this?
> > > 
> > > Generally I definitely agree with such a removal sooner or later, as 
> > > that's
> > > how deprecation works, and even after Juan's left I'm not aware of any
> > > other new RDMA users.  Personally, I'd slightly prefer postponing it one
> > > more release which might help a bit of our downstream maintenance, however
> > > I assume that's not a blocker either, as I think we can also manage it.
> > > 
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around. That's also one thing I notice 
> > > that
> > > e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
> > > if they're rare. According to [2] it could be that such user may only rely
> > > on the release versions of QEMU when it broke things.
> > > 
> > > So I'm copying Yu too (while Zhijian is already in the loop), just in case
> > > someone would like to stand up and speak.
> > 
> > 
> > I admit RDMA migration was lack of testing(unit/CI test), which led to the 
> > a few
> > obvious bugs being noticed too late.
> > However I was a bit surprised when I saw the removal of the RDMA migration. 
> > I wasn't
> > aware that this feature has not been marked as deprecated(at least there is 
> > no
> > prompt to end-user).
> > 
> > 
> > > IMHO it's more important to know whether there are still users and whether
> > > they would still like to see it around.
> > 
> > Agree.
> > I didn't immediately express my opinion in V1 because I'm also consulting 
> > our
> > customers for this feature in the future.
> > 
> > Personally, I agree with Perter's idea that "I'd slightly prefer postponing 
> > it one
> > more release which might help a bit of our downstream maintenance"
> 
> Do you mind posting a deprecation patch to clarify the situation?

The key thing the first deprecation patch missed was that it failed
to issue a warning message when RDMA migration was actually used.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-29 Thread Philippe Mathieu-Daudé

Hi Zhijian,

On 29/3/24 02:53, Zhijian Li (Fujitsu) wrote:



On 28/03/2024 23:01, Peter Xu wrote:

On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:

Philippe Mathieu-Daudé  writes:


The whole RDMA subsystem was deprecated in commit e9a54265f5
("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
released in v8.2.

Remove:
   - RDMA handling from migration
   - dependencies on libibumad, libibverbs and librdmacm

Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
in old migration streams.

Cc: Peter Xu 
Cc: Li Zhijian 
Acked-by: Fabiano Rosas 
Signed-off-by: Philippe Mathieu-Daudé 


Just to be clear, because people raised the point in the last version,
the first link in the deprecation commit links to a thread comprising
entirely of rdma migration patches. I don't see any ambiguity on whether
the deprecation was intended to include migration. There's even an ack
from Juan.


Yes I remember that's the plan.



So on the basis of not reverting the previous maintainer's decision, my
Ack stands here.

We also had pretty obvious bugs ([1], [2]) in the past that would have
been caught if we had any kind of testing for the feature, so I can't
even say this thing works currently.

@Peter Xu, @Li Zhijian, what are your thoughts on this?


Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around. That's also one thing I notice that
e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
if they're rare. According to [2] it could be that such user may only rely
on the release versions of QEMU when it broke things.

So I'm copying Yu too (while Zhijian is already in the loop), just in case
someone would like to stand up and speak.



I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
obvious bugs being noticed too late.
However I was a bit surprised when I saw the removal of the RDMA migration. I 
wasn't
aware that this feature has not been marked as deprecated(at least there is no
prompt to end-user).



IMHO it's more important to know whether there are still users and whether
they would still like to see it around.


Agree.
I didn't immediately express my opinion in V1 because I'm also consulting our
customers for this feature in the future.

Personally, I agree with Perter's idea that "I'd slightly prefer postponing it 
one
more release which might help a bit of our downstream maintenance"


Do you mind posting a deprecation patch to clarify the situation?

Thanks,

Phil.



Thanks
Zhijian



Thanks,



1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
2- 
https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com






Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Zhijian Li (Fujitsu)


On 28/03/2024 23:01, Peter Xu wrote:
> On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
>> Philippe Mathieu-Daudé  writes:
>>
>>> The whole RDMA subsystem was deprecated in commit e9a54265f5
>>> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
>>> released in v8.2.
>>>
>>> Remove:
>>>   - RDMA handling from migration
>>>   - dependencies on libibumad, libibverbs and librdmacm
>>>
>>> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
>>> in old migration streams.
>>>
>>> Cc: Peter Xu 
>>> Cc: Li Zhijian 
>>> Acked-by: Fabiano Rosas 
>>> Signed-off-by: Philippe Mathieu-Daudé 
>>
>> Just to be clear, because people raised the point in the last version,
>> the first link in the deprecation commit links to a thread comprising
>> entirely of rdma migration patches. I don't see any ambiguity on whether
>> the deprecation was intended to include migration. There's even an ack
>> from Juan.
> 
> Yes I remember that's the plan.
> 
>>
>> So on the basis of not reverting the previous maintainer's decision, my
>> Ack stands here.
>>
>> We also had pretty obvious bugs ([1], [2]) in the past that would have
>> been caught if we had any kind of testing for the feature, so I can't
>> even say this thing works currently.
>>
>> @Peter Xu, @Li Zhijian, what are your thoughts on this?
> 
> Generally I definitely agree with such a removal sooner or later, as that's
> how deprecation works, and even after Juan's left I'm not aware of any
> other new RDMA users.  Personally, I'd slightly prefer postponing it one
> more release which might help a bit of our downstream maintenance, however
> I assume that's not a blocker either, as I think we can also manage it.
> 
> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around. That's also one thing I notice that
> e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
> if they're rare. According to [2] it could be that such user may only rely
> on the release versions of QEMU when it broke things.
> 
> So I'm copying Yu too (while Zhijian is already in the loop), just in case
> someone would like to stand up and speak.


I admit RDMA migration was lack of testing(unit/CI test), which led to the a few
obvious bugs being noticed too late.
However I was a bit surprised when I saw the removal of the RDMA migration. I 
wasn't
aware that this feature has not been marked as deprecated(at least there is no
prompt to end-user).


> IMHO it's more important to know whether there are still users and whether
> they would still like to see it around.

Agree.
I didn't immediately express my opinion in V1 because I'm also consulting our
customers for this feature in the future.

Personally, I agree with Perter's idea that "I'd slightly prefer postponing it 
one
more release which might help a bit of our downstream maintenance"

Thanks
Zhijian

> 
> Thanks,
> 
>>
>> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
>> 2- 
>> https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com
>>
> 

Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu
On Thu, Mar 28, 2024 at 04:22:27PM +0100, Thomas Huth wrote:
> Since e9a54265f5 was not very clear about rdma migration code, should we
> maybe rather add a separate deprecation note for the migration part, and add
> a proper warning message to the migration code in case someone tries to use
> it there, and then only remove the rdma migration code after two more
> releases?

Definitely a valid option to me.

So far RDMA isn't covered in tests (actually same to COLO, and I wonder our
position of COLO too in this case..), so unfortunately we don't even know
when it'll break just like before.

>From other activities that I can see when new code comes, maintaining RDMA
code should be fairly manageable so far (and whoever will write new rdma
codes in those two releases will also need to take the maintainer's
role). We did it for those years, and we can keep that for two more
releases. Hopefully that can ring a louder alarm to the current users with
such warnings, so that people can either stick with old binaries, or invest
developer/test resources to the community.

Thanks,

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Thomas Huth

On 28/03/2024 16.01, Peter Xu wrote:

On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:

Philippe Mathieu-Daudé  writes:


The whole RDMA subsystem was deprecated in commit e9a54265f5
("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
released in v8.2.

Remove:
  - RDMA handling from migration
  - dependencies on libibumad, libibverbs and librdmacm

Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
in old migration streams.

Cc: Peter Xu 
Cc: Li Zhijian 
Acked-by: Fabiano Rosas 
Signed-off-by: Philippe Mathieu-Daudé 


Just to be clear, because people raised the point in the last version,
the first link in the deprecation commit links to a thread comprising
entirely of rdma migration patches. I don't see any ambiguity on whether
the deprecation was intended to include migration. There's even an ack
from Juan.


Yes I remember that's the plan.



So on the basis of not reverting the previous maintainer's decision, my
Ack stands here.

We also had pretty obvious bugs ([1], [2]) in the past that would have
been caught if we had any kind of testing for the feature, so I can't
even say this thing works currently.

@Peter Xu, @Li Zhijian, what are your thoughts on this?


Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around.


Since e9a54265f5 was not very clear about rdma migration code, should we 
maybe rather add a separate deprecation note for the migration part, and add 
a proper warning message to the migration code in case someone tries to use 
it there, and then only remove the rdma migration code after two more releases?


 Thomas





Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Peter Xu
On Thu, Mar 28, 2024 at 11:18:04AM -0300, Fabiano Rosas wrote:
> Philippe Mathieu-Daudé  writes:
> 
> > The whole RDMA subsystem was deprecated in commit e9a54265f5
> > ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> > released in v8.2.
> >
> > Remove:
> >  - RDMA handling from migration
> >  - dependencies on libibumad, libibverbs and librdmacm
> >
> > Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> > in old migration streams.
> >
> > Cc: Peter Xu 
> > Cc: Li Zhijian 
> > Acked-by: Fabiano Rosas 
> > Signed-off-by: Philippe Mathieu-Daudé 
> 
> Just to be clear, because people raised the point in the last version,
> the first link in the deprecation commit links to a thread comprising
> entirely of rdma migration patches. I don't see any ambiguity on whether
> the deprecation was intended to include migration. There's even an ack
> from Juan.

Yes I remember that's the plan.

> 
> So on the basis of not reverting the previous maintainer's decision, my
> Ack stands here.
> 
> We also had pretty obvious bugs ([1], [2]) in the past that would have
> been caught if we had any kind of testing for the feature, so I can't
> even say this thing works currently.
> 
> @Peter Xu, @Li Zhijian, what are your thoughts on this?

Generally I definitely agree with such a removal sooner or later, as that's
how deprecation works, and even after Juan's left I'm not aware of any
other new RDMA users.  Personally, I'd slightly prefer postponing it one
more release which might help a bit of our downstream maintenance, however
I assume that's not a blocker either, as I think we can also manage it.

IMHO it's more important to know whether there are still users and whether
they would still like to see it around. That's also one thing I notice that
e9a54265f533f didn't yet get acks from RDMA users that we are aware, even
if they're rare. According to [2] it could be that such user may only rely
on the release versions of QEMU when it broke things.

So I'm copying Yu too (while Zhijian is already in the loop), just in case
someone would like to stand up and speak.

Thanks,

> 
> 1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
> 2- 
> https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com
> 

-- 
Peter Xu




Re: [PATCH-for-9.1 v2 2/3] migration: Remove RDMA protocol handling

2024-03-28 Thread Fabiano Rosas
Philippe Mathieu-Daudé  writes:

> The whole RDMA subsystem was deprecated in commit e9a54265f5
> ("hw/rdma: Deprecate the pvrdma device and the rdma subsystem")
> released in v8.2.
>
> Remove:
>  - RDMA handling from migration
>  - dependencies on libibumad, libibverbs and librdmacm
>
> Keep the RAM_SAVE_FLAG_HOOK definition since it might appears
> in old migration streams.
>
> Cc: Peter Xu 
> Cc: Li Zhijian 
> Acked-by: Fabiano Rosas 
> Signed-off-by: Philippe Mathieu-Daudé 

Just to be clear, because people raised the point in the last version,
the first link in the deprecation commit links to a thread comprising
entirely of rdma migration patches. I don't see any ambiguity on whether
the deprecation was intended to include migration. There's even an ack
from Juan.

So on the basis of not reverting the previous maintainer's decision, my
Ack stands here.

We also had pretty obvious bugs ([1], [2]) in the past that would have
been caught if we had any kind of testing for the feature, so I can't
even say this thing works currently.

@Peter Xu, @Li Zhijian, what are your thoughts on this?

1- https://lore.kernel.org/r/20230920090412.726725-1-lizhij...@fujitsu.com
2- 
https://lore.kernel.org/r/cahecvy7hxswn4ow_kog+q+tn6f_kmeichevz1qgm-fbxbpp...@mail.gmail.com