from:"Peter Xu"

Re: [Qemu-block] [Qemu-devel] [PATCH 2/8] block: fix unbounded stack for dump_qdict

2016-03-08 Thread Peter Xu

On Tue, Mar 08, 2016 at 02:47:31PM +0100, Markus Armbruster wrote:
> Fam Zheng  writes:
> > Also I think the double underscore identifiers are considered reserved in C,
> > no?
> 
> Correct.  C99 7.1.3 Reserved identifiers: All identifiers that begin
> with an underscore and either an uppercase letter or another underscore
> are always reserved for any use.
> 
> [...]

Fam, Markus,

Thanks to point out. Will fix.

Peter

Re: [Qemu-block] [Qemu-devel] [PATCH 2/8] block: fix unbounded stack for dump_qdict

2016-03-08 Thread Peter Xu

On Tue, Mar 08, 2016 at 01:17:03PM +0100, Paolo Bonzini wrote:
> 
> 
> On 08/03/2016 09:12, Markus Armbruster wrote:
> > I'm afraid this isn't a good idea.  It relies on the non-local argument
> > that nobody will ever put a key longer than 255 into a qdict that gets
> > dumped.  That may even be the case, but you need to *prove* it, not just
> > assert it.  The weakest acceptable proof might be assertions in every
> > place that put keys into a dict that might get dumped.  I suspect that's
> > practical and maintainable only if there's a single place that does it.
> > 
> > If this was a good idea, I'd recommend to avoid the awkward macro:
> > 
> >char key[256];
> >int i;
> >
> >assert(strlen(entry->key) + 1 <= ARRAY_SIZE(key));
> > 
> > There are several other ways to limit the stack usage:
> > 
> > 1. Move the array from stack to heap.  Fine unless it's on a hot path.
> >As far as I can tell, this dumping business is for HMP and qemu-io,
> >i.e. not hot.
> 
> I think this is the best.  You can just g_strdup, modify in place, print
> and free.

g_strdup() will bring one more loop? One to copy the strings, one
for replacing "-" to " ". Though I will first need to replace
g_malloc0() with g_malloc(), which seems more suitable here. :)

Thanks!
Peter

[Qemu-block] [PATCH 0/2] block/qapi: trivial fixes

2016-03-08 Thread Peter Xu

One is to use literal printf format when possible.

One is to fix an unbounded usage of stack.

Peter Xu (2):
  block/qapi: make two printf() formats literal
  block/qapi: fix unbounded stack for dump_qdict

 block/qapi.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

-- 
2.4.3

[Qemu-block] [PATCH 1/2] block/qapi: make two printf() formats literal

2016-03-08 Thread Peter Xu

Fix two places to use literal printf format when possible.

Signed-off-by: Peter Xu <pet...@redhat.com>
---
 block/qapi.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index db2d3fb..c4c2115 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -619,9 +619,8 @@ static void dump_qlist(fprintf_function func_fprintf, void 
*f, int indentation,
 for (entry = qlist_first(list); entry; entry = qlist_next(entry), i++) {
 QType type = qobject_type(entry->value);
 bool composite = (type == QTYPE_QDICT || type == QTYPE_QLIST);
-const char *format = composite ? "%*s[%i]:\n" : "%*s[%i]: ";
-
-func_fprintf(f, format, indentation * 4, "", i);
+func_fprintf(f, "%*s[%i]:%c", indentation * 4, "", i,
+ composite ? '\n' : ' ');
 dump_qobject(func_fprintf, f, indentation + 1, entry->value);
 if (!composite) {
 func_fprintf(f, "\n");
@@ -637,7 +636,6 @@ static void dump_qdict(fprintf_function func_fprintf, void 
*f, int indentation,
 for (entry = qdict_first(dict); entry; entry = qdict_next(dict, entry)) {
 QType type = qobject_type(entry->value);
 bool composite = (type == QTYPE_QDICT || type == QTYPE_QLIST);
-const char *format = composite ? "%*s%s:\n" : "%*s%s: ";
 char key[strlen(entry->key) + 1];
 int i;
 
@@ -646,8 +644,8 @@ static void dump_qdict(fprintf_function func_fprintf, void 
*f, int indentation,
 key[i] = entry->key[i] == '-' ? ' ' : entry->key[i];
 }
 key[i] = 0;
-
-func_fprintf(f, format, indentation * 4, "", key);
+func_fprintf(f, "%*s%s:%c", indentation * 4, "", key,
+ composite ? '\n' : ' ');
 dump_qobject(func_fprintf, f, indentation + 1, entry->value);
 if (!composite) {
 func_fprintf(f, "\n");
-- 
2.4.3

Re: [Qemu-block] [PATCH 1/2] block/qapi: make two printf() formats literal

2016-03-09 Thread Peter Xu

On Wed, Mar 09, 2016 at 03:14:03PM -0700, Eric Blake wrote:
> > +func_fprintf(f, "%*s[%i]:%c", indentation * 4, "", i,
> > + composite ? '\n' : ' ');
> 
> [The nerd in me wants to point out that you could avoid the ternary by
> writing '"\n "[composite]', but that's too ugly to use outside of IOCCC
> submissions, and I wouldn't be surprised if it (rightfully) triggers
> clang warnings]

Do you mean something like:

int i = 0;
printf("%c", '"\n "[i]');

Is this a grammar btw?

Peter

Re: [Qemu-block] [PATCH 1/2] block/qapi: make two printf() formats literal

2016-03-21 Thread Peter Xu

On Mon, Mar 21, 2016 at 03:14:48PM -0600, Eric Blake wrote:
> On 03/09/2016 06:46 PM, Peter Xu wrote:
> > 
> > Is this a grammar btw?
> 
> Yes, C has an ugly grammar, because [] is just syntactic sugar for
> deferencing pointer addition with nicer operator precedence.  Quoting
> C99 6.5.2.1:
> 
> "The definition of the subscript operator [] is that E1[E2] is identical
> to (*((E1)+(E2))).  Because of the conversion rules that apply to the
> binary + operator, if E1 is an array object (equivalently, a pointer to
> the initial element of an array object) and E2 is an integer, E1[E2]
> designates the E2-th element of E1 (counting from zero)."
> 
> And a string literal is just a fancy way of writing the address of an
> array of characters (where the address is chosen by the compiler).
> 
> Thus, it IS valid to dereference the addition of an integer offset with
> the address implied by a string literal in order to obtain a character
> within the string.  And since the [] operator is commutative (even
> though no one in their right mind commutes the operands), you can also
> write the even-uglier:
> 
> composite["\n "]
> 
> But now we've gone far astray from the original patch review :)

Interesting thing to know.  Thanks. :)

-- peterx

Re: [Qemu-block] [Qemu-devel] [PATCH 2/8] block: fix unbounded stack for dump_qdict

2016-03-08 Thread Peter Xu

On Tue, Mar 08, 2016 at 09:12:45AM +0100, Markus Armbruster wrote:
> Peter Xu <pet...@redhat.com> writes:
> >  const char *format = composite ? "%*s%s:\n" : "%*s%s: ";
> 
> Unrelated to your patch: ugh!
> 
> Printf formats should be literals whenever possible, to make it easy for
> the compiler to warn you when you screw up.  It's trivially possible
> here!  Instead of
> 
>func_fprintf(f, format, indentation * 4, "", key);
> 
> do
> 
>func_fprintf(f, "%*s%s:%c", indentation * 4, "", key,
> composite ? '\n', ' ');;

Yes, I can do that too. Do you think I should split the patchset
into several smaller ones possibly? So that I can use a 2/2 for this
specific function, one for the printf() issue, one for the stack
allocation issue.

> 
> > -char key[strlen(entry->key) + 1];
> > +#define __KEY_LEN (256)
> > +char key[__KEY_LEN];
> >  int i;
> >  
> > +assert(strlen(entry->key) + 1 <= __KEY_LEN);
> > +#undef __KEY_LEN
> >  /* replace dashes with spaces in key (variable) names */
> >  for (i = 0; entry->key[i]; i++) {
> >  key[i] = entry->key[i] == '-' ? ' ' : entry->key[i];
> 
> I'm afraid this isn't a good idea.  It relies on the non-local argument
> that nobody will ever put a key longer than 255 into a qdict that gets
> dumped.  That may even be the case, but you need to *prove* it, not just
> assert it.  The weakest acceptable proof might be assertions in every
> place that put keys into a dict that might get dumped.  I suspect that's
> practical and maintainable only if there's a single place that does it.

Yes. Actually I doubted whether I should do this change... since I
am not experienced enough to estimate a value which will be 100%
working all the time. Here I just assumed 256 is big enough for
qdict keys, which indeed is lack of proof.

> 
> If this was a good idea, I'd recommend to avoid the awkward macro:
> 
>char key[256];
>int i;
>
>assert(strlen(entry->key) + 1 <= ARRAY_SIZE(key));

Yes, possibly! I forgot ARRAY_SIZE() macro. Thanks to point out.

> 
> There are several other ways to limit the stack usage:
> 
> 1. Move the array from stack to heap.  Fine unless it's on a hot path.
>As far as I can tell, this dumping business is for HMP and qemu-io,
>i.e. not hot.
> 
> 2. Use a stack array for short strings, switch to the heap for large
>strings.  More complex, but may be worth it on a hot path where most
>strings are short.
> 
> 3. Instead of creating a new string with '-' replaced for printing,
>print characters.  Can be okay with buffered I/O, but obviously not
>on a hot path.
> 
> 4. Like 3., but print substrings not containing '-'.

Thanks for all the suggestions and ideas.

To avoid the limitation of 256 (and also consider this path is not
hot path), I'd like to choose (1) if you would not mind, in a split
patch maybe.

Thanks.
Peter

Re: [Qemu-block] [Qemu-devel] [PATCH] vmdk: Switch to heap arrays for vmdk_write_cid

2016-03-08 Thread Peter Xu

On Tue, Mar 08, 2016 at 02:18:35PM +0800, Fam Zheng wrote:
> It is only called once for each opened image, so we can do it the easy
> way.
> 
> Signed-off-by: Fam Zheng <f...@redhat.com>
> ---
>  block/vmdk.c | 25 ++---
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index a8db5d9..1ec2452 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -274,36 +274,39 @@ static uint32_t vmdk_read_cid(BlockDriverState *bs, int 
> parent)
>  
>  static int vmdk_write_cid(BlockDriverState *bs, uint32_t cid)

If to use heap in write, do we need to do the same for e.g.,
vmdk_parent_open() and vmdk_read_cid(), to make all things at least
aligned?

>  {
> -char desc[DESC_SIZE], tmp_desc[DESC_SIZE];
> +char *desc, *tmp_desc;
>  char *p_name, *tmp_str;
>  BDRVVmdkState *s = bs->opaque;
> -int ret;
> +int ret = 0;
>  
> +desc = g_malloc0(DESC_SIZE);
> +tmp_desc = g_malloc0(DESC_SIZE);
>  ret = bdrv_pread(bs->file->bs, s->desc_offset, desc, DESC_SIZE);
>  if (ret < 0) {
> -return ret;
> +goto out;
>  }
>  
>  desc[DESC_SIZE - 1] = '\0';
>  tmp_str = strstr(desc, "parentCID");
>  if (tmp_str == NULL) {
> -return -EINVAL;
> +ret = -EINVAL;
> +goto out;
>  }
>  
> -pstrcpy(tmp_desc, sizeof(tmp_desc), tmp_str);
> +pstrcpy(tmp_desc, DESC_SIZE, tmp_str);
>  p_name = strstr(desc, "CID");
>  if (p_name != NULL) {
>  p_name += sizeof("CID");
> -snprintf(p_name, sizeof(desc) - (p_name - desc), "%" PRIx32 "\n", 
> cid);
> -pstrcat(desc, sizeof(desc), tmp_desc);
> +snprintf(p_name, DESC_SIZE - (p_name - desc), "%" PRIx32 "\n", cid);
> +pstrcat(desc, DESC_SIZE, tmp_desc);
>  }
>  
>  ret = bdrv_pwrite_sync(bs->file->bs, s->desc_offset, desc, DESC_SIZE);
> -if (ret < 0) {
> -return ret;
> -}
>  
> -return 0;
> +out:
> +g_free(desc);
> +g_free(tmp_desc);
> +return ret;
>  }
>  
>  static int vmdk_is_cid_valid(BlockDriverState *bs)
> -- 
> 2.4.3
> 
> 

Besides the above nit-pick:

Reviewed-by: Peter Xu <pet...@redhat.com>

[Qemu-block] [PATCH 0/8] Fix several unbounded stack usage

2016-03-08 Thread Peter Xu

Suggested by Paolo.

Ths patchset fixes several of the warnings generated by
"-Wstack-usage=10".  There are about 20-30 unbound stack cases
during my build, and this patch is only fixing several of them,
those which are obvious and easy.  For the rest, most of them need
some knowledge on specific area (e.g., USB, net, block) to have a
better assessment on the limitiation values, and are not covered in
this patchset.

One thing to mention about patch 4: I still cannot figure out why
the function xhci_dma_write_u32s() cannot be inlined in short
time... However, the current fix can at least keep the code behavior
not changed while making it stack bounded.  Please let me know if
anyone knows.

Thanks.
Peter

CC: Markus Armbruster <arm...@redhat.com>
CC: Kevin Wolf <kw...@redhat.com>
CC: "Michael S. Tsirkin" <m...@redhat.com>
CC: Paolo Bonzini <pbonz...@redhat.com>
CC: Richard Henderson <r...@twiddle.net>
CC: Eduardo Habkost <ehabk...@redhat.com>
CC: Gerd Hoffmann <kra...@redhat.com>
CC: Juan Quintela <quint...@redhat.com>
CC: Amit Shah <amit.s...@redhat.com>
CC: Luiz Capitulino <lcapitul...@redhat.com>
CC: qemu-block@nongnu.org

Peter Xu (8):
  qdict: fix unbounded stack for qdict_array_entries
  block: fix unbounded stack for dump_qdict
  usb: fix unbounded stack for ohci_td_pkt
  usb: fix unbounded stack for xhci_dma_write_u32s
  usb: fix unbounded stack for inotify_watchfn
  usb: fix unbounded stack for usb_mtp_add_str
  migration: fix unbounded stack for source_return_path_thread
  hw/i386: fix unbounded stack for load_multiboot

 block/qapi.c  |  5 -
 hw/i386/multiboot.c   | 10 +-
 hw/usb/dev-mtp.c  | 13 +
 hw/usb/hcd-ohci.c |  7 ---
 hw/usb/hcd-xhci.c | 12 
 migration/migration.c |  7 ---
 qobject/qdict.c   | 15 +--
 7 files changed, 47 insertions(+), 22 deletions(-)

-- 
2.4.3

[Qemu-block] [PATCH 2/8] block: fix unbounded stack for dump_qdict

2016-03-08 Thread Peter Xu

Suggested-by: Paolo Bonzini <pbonz...@redhat.com>
CC: Markus Armbruster <arm...@redhat.com>
CC: Kevin Wolf <kw...@redhat.com>
CC: qemu-block@nongnu.org
Signed-off-by: Peter Xu <pet...@redhat.com>
---
 block/qapi.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/qapi.c b/block/qapi.c
index db2d3fb..687e577 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -638,9 +638,12 @@ static void dump_qdict(fprintf_function func_fprintf, void 
*f, int indentation,
 QType type = qobject_type(entry->value);
 bool composite = (type == QTYPE_QDICT || type == QTYPE_QLIST);
 const char *format = composite ? "%*s%s:\n" : "%*s%s: ";
-char key[strlen(entry->key) + 1];
+#define __KEY_LEN (256)
+char key[__KEY_LEN];
 int i;
 
+assert(strlen(entry->key) + 1 <= __KEY_LEN);
+#undef __KEY_LEN
 /* replace dashes with spaces in key (variable) names */
 for (i = 0; entry->key[i]; i++) {
 key[i] = entry->key[i] == '-' ? ' ' : entry->key[i];
-- 
2.4.3

Re: [Qemu-block] [PATCH v2 1/1] qemu/migration: fix the double free problem on from_src_file

2017-06-06 Thread Peter Xu

On Tue, Jun 06, 2017 at 06:42:18PM +0100, Dr. David Alan Gilbert wrote:
> * Kevin Wolf (kw...@redhat.com) wrote:
> > Am 06.06.2017 um 07:24 hat QingFeng Hao geschrieben:
> > > In load_snapshot, mis->from_src_file is freed twice, the first free is by
> > > qemu_fclose, the second is by migration_incoming_state_destroy and
> > > it causes Illegal instruction exception. The fix is just to remove the
> > > first free.
> > > 
> > > This problem is found by qemu-iotests case 068 since commit
> > > "660819b migration: shut src return path unconditionally". The error is:
> > > 068 1s ... - output mismatch (see 068.out.bad)
> > > --- tests/qemu-iotests/068.out2017-05-06 01:00:26.417270437 
> > > +0200
> > > +++ 068.out.bad   2017-06-03 13:59:55.360274640 +0200
> > > @@ -6,6 +6,8 @@
> > >  QEMU X.Y.Z monitor - type 'help' for more information
> > >  (qemu) savevm 0
> > >  (qemu) quit
> > > +./common.config: line 107: 242472 Illegal instruction (core 
> > > dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
> > > +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
> > > +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
> > >  QEMU X.Y.Z monitor - type 'help' for more information
> > > -(qemu) quit
> > > -*** done
> > > +(qemu) *** done
> > > 
> > > Signed-off-by: QingFeng Hao <ha...@linux.vnet.ibm.com>
> > > Reviewed-by: Dr. David Alan Gilbert <dgilb...@redhat.com>
> > > Reviewed-by: Peter Xu <pet...@redhat.com>
> > 
> > Dave, as you only gave R-b rather than merging the patch, should this be
> > merged through the block tree?
> 
> I'm happy for it to go via block but also happy for it to go via
> migration; Juan is mostly doing the migration set at the moment since
> they're dominated by his cleanups.
> 
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 9c320f59d0..853e14e34e 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -2290,7 +2290,6 @@ int load_snapshot(const char *name, Error **errp)
> > >  
> > >  aio_context_acquire(aio_context);
> > >  ret = qemu_loadvm_state(f);
> > > -qemu_fclose(f);
> > >  aio_context_release(aio_context);
> > >  
> > >  migration_incoming_state_destroy();
> > 
> > Did we check other callers of migration_incoming_state_destroy()?
> > 
> > For example, qmp_xen_load_devices_state() looks suspicious, too.
> 
> Hmm, it looks suspicious in the opposite direction; it never sets
> mis->from_src_file as was added by b4b076da into the load_snapshot path.

Agree.

Does qmp_xen_load_devices_state() needs to call
migration_incoming_state_destroy() after all? Since the latter
function only cleanups MigrationIncomingState and looks like the
former xen code didn't really use it at all.

> 
> > I can't tell for postcopy_ram_listen_thread() - commit 660819b didn't
> > seem to remove a qemu_fclose() call there, but I can't see one left
> > behind either. Was the file leaked before commit 660819b or am I
> > missing something?
> 
> I don't think there's a problem in the postcopy path, although hmm was
> I missing a close before?
> 
> Dave
> > 
> > Kevin
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

-- 
Peter Xu

Re: [Qemu-block] [PATCH v1 1/1] qemu/migration: fix the double free problem on from_src_file

2017-06-05 Thread Peter Xu

On Mon, Jun 05, 2017 at 12:48:51PM +0200, QingFeng Hao wrote:
> In load_vmstate, mis->from_src_file is freed twice, the first free is by
> qemu_fclose, the second is by migration_incoming_state_destroy and
> it causes Illegal instruction exception. The fix is just to remove the
> first free.
> 
> This problem is found by qemu-iotests case 068 since commit
> "660819b migration: shut src return path unconditionally". The error is:
> 068 1s ... - output mismatch (see 068.out.bad)
> --- tests/qemu-iotests/068.out2017-05-06 01:00:26.417270437 +0200
> +++ 068.out.bad   2017-06-03 13:59:55.360274640 +0200
> @@ -6,6 +6,8 @@
>  QEMU X.Y.Z monitor - type 'help' for more information
>  (qemu) savevm 0
>  (qemu) quit
> +./common.config: line 107: 242472 Illegal instruction (core dumped) 
> ( if [ -n "${QEMU_NEED_PID}" ]; then
> +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
> +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
>  QEMU X.Y.Z monitor - type 'help' for more information
> -(qemu) quit
> -*** done
> +(qemu) *** done
> 
> Signed-off-by: QingFeng Hao <ha...@linux.vnet.ibm.com>
> ---
>  migration/savevm.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 9c320f59d0..853e14e34e 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2290,7 +2290,6 @@ int load_snapshot(const char *name, Error **errp)
>  
>  aio_context_acquire(aio_context);
>  ret = qemu_loadvm_state(f);
> -qemu_fclose(f);
>  aio_context_release(aio_context);
>  
>  migration_incoming_state_destroy();

Thanks QingFeng for the fix!

Reviewed-by: Peter Xu <pet...@redhat.com>

Though here instead of removing the fclose(), not sure whether this is
nicer:

diff --git a/migration/savevm.c b/migration/savevm.c
index 9c320f5..1feb519 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2233,7 +2233,6 @@ int load_snapshot(const char *name, Error **errp)
 QEMUFile *f;
 int ret;
 AioContext *aio_context;
-MigrationIncomingState *mis = migration_incoming_get_current();
 
 if (!bdrv_all_can_snapshot()) {
 error_setg(errp,
@@ -2286,7 +2285,6 @@ int load_snapshot(const char *name, Error **errp)
 }
 
 qemu_system_reset(SHUTDOWN_CAUSE_NONE);
-mis->from_src_file = f;
 
 aio_context_acquire(aio_context);
 ret = qemu_loadvm_state(f);

Since I see we setup from_src_file but we don't really use it. If we
remove that line, we can also remove the
migration_incoming_get_current() call altogether.

Either way work for me. So my r-b stands always. :-)

-- 
Peter Xu

Re: [Qemu-block] [PATCH v1 1/1] qemu/migration: fix the double free problem on from_src_file

2017-06-05 Thread Peter Xu

On Tue, Jun 06, 2017 at 11:38:05AM +0800, QingFeng Hao wrote:
> 
> 
> 在 2017/6/6 11:03, Peter Xu 写道:
> >On Mon, Jun 05, 2017 at 12:48:51PM +0200, QingFeng Hao wrote:
> >>In load_vmstate, mis->from_src_file is freed twice, the first free is by
> >>qemu_fclose, the second is by migration_incoming_state_destroy and
> >>it causes Illegal instruction exception. The fix is just to remove the
> >>first free.
> >>
> >>This problem is found by qemu-iotests case 068 since commit
> >>"660819b migration: shut src return path unconditionally". The error is:
> >>068 1s ... - output mismatch (see 068.out.bad)
> >> --- tests/qemu-iotests/068.out 2017-05-06 01:00:26.417270437 +0200
> >> +++ 068.out.bad2017-06-03 13:59:55.360274640 +0200
> >> @@ -6,6 +6,8 @@
> >>  QEMU X.Y.Z monitor - type 'help' for more information
> >>  (qemu) savevm 0
> >>  (qemu) quit
> >> +./common.config: line 107: 242472 Illegal instruction (core 
> >> dumped) ( if [ -n "${QEMU_NEED_PID}" ]; then
> >> +echo $BASHPID > "${QEMU_TEST_DIR}/qemu-${_QEMU_HANDLE}.pid";
> >> +fi; exec "$QEMU_PROG" $QEMU_OPTIONS "$@" )
> >>  QEMU X.Y.Z monitor - type 'help' for more information
> >> -(qemu) quit
> >> -*** done
> >> +(qemu) *** done
> >>
> >>Signed-off-by: QingFeng Hao <ha...@linux.vnet.ibm.com>
> >>---
> >>  migration/savevm.c | 1 -
> >>  1 file changed, 1 deletion(-)
> >>
> >>diff --git a/migration/savevm.c b/migration/savevm.c
> >>index 9c320f59d0..853e14e34e 100644
> >>--- a/migration/savevm.c
> >>+++ b/migration/savevm.c
> >>@@ -2290,7 +2290,6 @@ int load_snapshot(const char *name, Error **errp)
> >>  aio_context_acquire(aio_context);
> >>  ret = qemu_loadvm_state(f);
> >>-qemu_fclose(f);
> >>  aio_context_release(aio_context);
> >>  migration_incoming_state_destroy();
> >Thanks QingFeng for the fix!
> >
> >Reviewed-by: Peter Xu <pet...@redhat.com>
> >
> >Though here instead of removing the fclose(), not sure whether this is
> >nicer:
> >
> >diff --git a/migration/savevm.c b/migration/savevm.c
> >index 9c320f5..1feb519 100644
> >--- a/migration/savevm.c
> >+++ b/migration/savevm.c
> >@@ -2233,7 +2233,6 @@ int load_snapshot(const char *name, Error **errp)
> >  QEMUFile *f;
> >  int ret;
> >  AioContext *aio_context;
> >-MigrationIncomingState *mis = migration_incoming_get_current();
> >
> >  if (!bdrv_all_can_snapshot()) {
> >  error_setg(errp,
> >@@ -2286,7 +2285,6 @@ int load_snapshot(const char *name, Error **errp)
> >  }
> >
> >  qemu_system_reset(SHUTDOWN_CAUSE_NONE);
> >-mis->from_src_file = f;
> >
> >  aio_context_acquire(aio_context);
> >  ret = qemu_loadvm_state(f);
> Thanks Peter. I have a doubt on this change because I see there are several
> places
> referencing from_src_file e.g. loadvm_postcopy_ram_handle_discard, and it
> seems to
> get byte from the file and it's called by qemu_loadvm_state.
> Anyway, you remind me for the description that is "In load_snapshot" rather
> than
> "In load_vmstate". thanks

Oh I didn't really notice that. :)

It shouldn't affect imho since load snapshot won't trigger postcopy
codes. But sure current fix is good enough for me at least.

Thanks,

-- 
Peter Xu

Re: [Qemu-block] [PATCH v2] iothread: fix epollfd leak in the process of delIOThread

2018-05-16 Thread Peter Xu

On Wed, May 16, 2018 at 02:39:44PM +0800, Jie Wang wrote:
> From: w00251574 <wangji...@huawei.com>

(Maybe you'd prefer to still use "Jie Wang" here? :)

> 
> When we call addIOThread, the epollfd created in aio_context_setup,
> but not close it in the process of delIOThread, so the epollfd will leak.
> 
> Signed-off-by: Jie Wang <wangji...@huawei.com>

[...]

> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index d8f0cb4af8..bd81455851 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -713,6 +713,13 @@ void aio_context_setup(AioContext *ctx)
>  #endif
>  }
>  
> +void aio_context_destroy(AioContext *ctx)
> +{
> +#ifdef CONFIG_EPOLL_CREATE1
> +close(ctx->epollfd);

Would it be better to call aio_epoll_disable() here?  Otherwise it
looks good to me.

> +#endif
> +}
> +

Regards,

-- 
Peter Xu

Re: [Qemu-block] [PATCH v4] iothread: fix epollfd leak in the process of delIOThread

2018-05-16 Thread Peter Xu

On Wed, May 16, 2018 at 07:14:53PM +0800, WangJie (Pluto) wrote:
> Hi, Peter Xu:
>   If call aio_epoll_disable() here, aio_epoll_disable() will return 
> before close ctx->epollfd,
> Because the ctx->epoll_enabled is false in the moment.
>   In the process of addIOThread, aio_context_setup created epoll without 
> call aio_epoll_try_enable,
> so ctx->epoll_enabled have no chance to set true.

I see that epoll_available will only be set if epollfd != -1, so it
seems to me to make more sense if we swap the two variables in
aio_epoll_disable(), from current version:

static void aio_epoll_disable(AioContext *ctx)
{
ctx->epoll_available = false;
if (!ctx->epoll_enabled) {
return;
}
ctx->epoll_enabled = false;
close(ctx->epollfd);
}

To:

static void aio_epoll_disable(AioContext *ctx)
{
ctx->epoll_enabled = false;
if (!ctx->epoll_available) {
return;
}
ctx->epoll_available = false;
close(ctx->epollfd);
}

What do you think?  And Fam?

> 
> On 2018/5/16 16:36, Jie Wang wrote:
> > +void aio_context_destroy(AioContext *ctx)
> > +{
> > +#ifdef CONFIG_EPOLL_CREATE1
> > +if (ctx->epollfd >= 0) {
> > +close(ctx->epollfd);
> > +}
> > +#endif
> > +}
> > +
> >  void aio_context_set_poll_params(AioContext *ctx, int64_t max_ns,
> >   int64_t grow, int64_t shrink, Error 
> > **errp)
> 

-- 
Peter Xu

Re: [Qemu-block] [PATCH] iothread: fix epollfd leak in the process of delIOThread

2018-05-15 Thread Peter Xu

On Wed, May 16, 2018 at 09:38:31AM +0800, Fam Zheng wrote:
> On Tue, 05/15 20:00, Jie Wang wrote:
> > When we call addIOThread, the epollfd created in aio_context_setup,
> > but not close it in the process of delIOThread, so the epollfd will leak.
> > 
> > Signed-off-by: Jie Wang <wangji...@huawei.com>
> > ---
> >  iothread.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/iothread.c b/iothread.c
> > index aff1281..23ac0a8 100644
> > --- a/iothread.c
> > +++ b/iothread.c
> > @@ -128,6 +128,7 @@ static void iothread_instance_finalize(Object *obj)
> >   * GSources first before destroying any GMainContext.
> >   */
> >  if (iothread->ctx) {
> > +close(iothread->ctx->epollfd);
> >  aio_context_unref(iothread->ctx);
> >  iothread->ctx = NULL;
> >  }
> > -- 
> > 1.8.3.1
> > 
> 
> Please add an aio_context_destroy() function in aio-posix.c and call it from
> aio_context_finalize(). IOThread code should not touch AioContext internals.

I believe Fam means aio_ctx_finalize().

> Also please remember to wrap the close() code in CONFIG_EPOLL_CREATE1.  An 
> empty
> function may need to be added to aio-win32.c, too.

Agreed.

-- 
Peter Xu

Re: [Qemu-block] [PATCH v6 2/2] iothread: let aio_epoll_disable fit to aio_context_destroy

2018-05-17 Thread Peter Xu

On Thu, May 17, 2018 at 10:26:17AM +0800, Jie Wang wrote:
> epoll_available will only be set if epollfd != -1, os we
> can swap the two variables in aio_epoll_disable, and
> aio_context_destroy can call aio_epoll_disable directly.
> 
> Signed-off-by: Jie Wang <wangji...@huawei.com>
> ---
>  util/aio-posix.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 0ade2c7..118bf57 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -45,11 +45,11 @@ struct AioHandler
>  
>  static void aio_epoll_disable(AioContext *ctx)
>  {
> -ctx->epoll_available = false;
> -if (!ctx->epoll_enabled) {
> +ctx->epoll_enabled = false;
> +if (!ctx->epoll_available) {
>  return;
>  }
> -ctx->epoll_enabled = false;
> +ctx->epoll_available = false;
>  close(ctx->epollfd);
>  }
>  
> @@ -716,9 +716,7 @@ void aio_context_setup(AioContext *ctx)
>  void aio_context_destroy(AioContext *ctx)
>  {
>  #ifdef CONFIG_EPOLL_CREATE1
> -if (ctx->epollfd >= 0) {
> -close(ctx->epollfd);
> -}
> +aio_epoll_disable(ctx);

Hmm... I think this patch should be the first if to split.

Anyway, IMHO version 5 is already good enough and has got r-bs, no
need to bother reposting a version 7.  Maintainer could possibly touch
up the commit message if necessary.

Thanks,

>  #endif
>  }
>  
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [Qemu-block] raw iotest regressions in 2.12.0-rc0

2018-03-22 Thread Peter Xu

On Wed, Mar 21, 2018 at 05:58:48PM -0400, John Snow wrote:
> ./check -v -raw
> Failures: 109 132 136 148 152 183
> 
> 3fd2457d18edf5736f713dfe1ada9c87a9badab1 is the first bad commit
> commit 3fd2457d18edf5736f713dfe1ada9c87a9badab1
> Author: Peter Xu <pet...@redhat.com>
> Date:   Fri Mar 9 17:00:03 2018 +0800
> 
> monitor: enable IO thread for (qmp & !mux) typed
> 
> Start to use dedicate IO thread for QMP monitors that are not using
> MUXed chardev.
> 
> Reviewed-by: Fam Zheng <f...@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> Message-Id: <20180309090006.10018-21-pet...@redhat.com>
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> 
> 
> The symptom appears to be extra "RESUME" events in the stream that
> weren't expected by the original output for tests 109 and 183; the rest
> are python and I didn't dig yet.
> 
> ./check -v raw
> Failures: 055
> Failed 5 of 5 tests
> 
> 91ad45061af0fe44ac5dadb5bedaf4d7a08077c8 is the first bad commit
> commit 91ad45061af0fe44ac5dadb5bedaf4d7a08077c8
> Author: Peter Xu <pet...@redhat.com>
> Date:   Fri Mar 9 17:00:05 2018 +0800
> 
> tests: qmp-test: verify command batching
> 
> OOB introduced DROP event for flow control.  This should not affect old
> QMP clients.  Add a command batching check to make sure of it.
> 
> Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> Message-Id: <20180309090006.10018-23-pet...@redhat.com>
> Reviewed-by: Eric Blake <ebl...@redhat.com>
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> 
> 
> 
> Maybe these are known, but I wanted to consolidate them for rc0 for
> something easy to search for. There are others for qcow2 which I'll post
> in a bit...!
> 
> 
> Thanks,
> --js

CCing Max, Fam.

Now I think I know how to solve some of the tests already (109, 132,
148, 152, 183). While I am still working (or, not yet started to work)
on some others (055, 136, 205).

205 is interesting - it won't fail every time, but randomly:

205 1s ... [failed, exit status 1] - output mismatch (see 205.out.bad)
--- /home/peterx/git/qemu/tests/qemu-iotests/205.out2018-03-08 
19:36:27.452220803 +0800
+++ /home/peterx/git/qemu/bin/tests/qemu-iotests/205.out.bad
2018-03-22 21:16:52.727152006 +0800
@@ -1,5 +1,19 @@
-...
+F..
+==
+FAIL: test_connect_after_remove_default (__main__.TestNbdServerRemove)
+--
+Traceback (most recent call last):
+  File "205", line 96, in test_connect_after_remove_default
+self.do_test_connect_after_remove()
+  File "205", line 90, in do_test_connect_after_remove
+self.assert_qmp(result, 'return', {})
+  File "/home/peterx/git/qemu/tests/qemu-iotests/iotests.py", line 
422, in assert_qmp
+result = self.dictpath(d, path)
+  File "/home/peterx/git/qemu/tests/qemu-iotests/iotests.py", line 
381, in dictpath
+self.fail('failed path traversal for "%s" in "%s"' % (path, 
str(d)))
+AssertionError: failed path traversal for "return" in "{u'error': 
{u'class': u'GenericError', u'desc': u"export 'exp' still in use"}}"
+
--
Ran 7 tests

Not digged yet.

For 136, it happens always, this is the error:

136 4s ... [failed, exit status 1] - output mismatch (see 136.out.bad)
--- /home/peterx/git/qemu/tests/qemu-iotests/136.out2018-01-12 
12:46:42.069915434 +0800
+++ /home/peterx/git/qemu/bin/tests/qemu-iotests/136.out.bad
2018-03-22 21:16:13.981116000 +0800
@@ -1,5 +1,125 @@
-...
+.EE.EE.EE.EE.EE
+==
+ERROR: test_read_only (__main__.BlockDeviceStatsTestAccountBoth)
+--
+Traceback (most recent call last):
+  File "136", line 286, in test_read_only
+self.do_test_stats(rd_size = i[0], rd_ops = i[1])
+  File "136", line 278, in do_test_stats
+self.check_values()
+  File "136", line 204, in check_values
+self.assertLess(0, stats['idle_time_ns'])
+KeyError: 'idle_time_ns'

Re: [Qemu-block] raw iotest regressions in 2.12.0-rc0

2018-03-21 Thread Peter Xu

On Wed, Mar 21, 2018 at 05:58:48PM -0400, John Snow wrote:
> ./check -v -raw
> Failures: 109 132 136 148 152 183
> 
> 3fd2457d18edf5736f713dfe1ada9c87a9badab1 is the first bad commit
> commit 3fd2457d18edf5736f713dfe1ada9c87a9badab1
> Author: Peter Xu <pet...@redhat.com>
> Date:   Fri Mar 9 17:00:03 2018 +0800
> 
> monitor: enable IO thread for (qmp & !mux) typed
> 
> Start to use dedicate IO thread for QMP monitors that are not using
> MUXed chardev.
> 
> Reviewed-by: Fam Zheng <f...@redhat.com>
> Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> Message-Id: <20180309090006.10018-21-pet...@redhat.com>
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> 
> 
> The symptom appears to be extra "RESUME" events in the stream that
> weren't expected by the original output for tests 109 and 183; the rest
> are python and I didn't dig yet.
> 
> ./check -v raw
> Failures: 055
> Failed 5 of 5 tests
> 
> 91ad45061af0fe44ac5dadb5bedaf4d7a08077c8 is the first bad commit
> commit 91ad45061af0fe44ac5dadb5bedaf4d7a08077c8
> Author: Peter Xu <pet...@redhat.com>
> Date:   Fri Mar 9 17:00:05 2018 +0800
> 
> tests: qmp-test: verify command batching
> 
> OOB introduced DROP event for flow control.  This should not affect old
> QMP clients.  Add a command batching check to make sure of it.
> 
> Reviewed-by: Stefan Hajnoczi <stefa...@redhat.com>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> Message-Id: <20180309090006.10018-23-pet...@redhat.com>
> Reviewed-by: Eric Blake <ebl...@redhat.com>
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> 
> 
> 
> Maybe these are known, but I wanted to consolidate them for rc0 for
> something easy to search for. There are others for qcow2 which I'll post
> in a bit...!

I'll have a look on this today and see whether this is the same
problem as reported by Max.  Sorry for that, and thanks for reporting!

-- 
Peter Xu

Re: [Qemu-block] [PATCH 2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

2018-09-13 Thread Peter Xu

On Thu, Sep 13, 2018 at 10:00:43AM -0600, Alex Williamson wrote:
> On Thu, 13 Sep 2018 12:04:34 +0200
> Paolo Bonzini  wrote:
> 
> > On 13/09/2018 11:11, Paolo Bonzini wrote:
> > > On 13/09/2018 08:03, Fam Zheng wrote:  
> > >> On Wed, 09/12 14:42, Paolo Bonzini wrote:  
> > >>> On 12/09/2018 13:50, Fam Zheng wrote:  
> > >>>>> I think it's okay if it is invoked.  The sequence is first you stop 
> > >>>>> the
> > >>>>> vq, then you drain the BlockBackends, then you switch AioContext.  All
> > >>>>> that matters is the outcome when virtio_scsi_dataplane_stop returns.  
> > >>>> Yes, but together with vIOMMU, it also effectively leads to a 
> > >>>> virtio_error(),
> > >>>> which is not clean. QEMU stderr when this call happens (with patch 1 
> > >>>> but not
> > >>>> this patch):
> > >>>>
> > >>>> 2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: 
> > >>>> detected translation failure (dev=02:00:00, iova=0x0)
> > >>>> 2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not 
> > >>>> recorded due to compression of faults
> > >>>> 2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized 
> > >>>> buffers are not allowed  
> > >>>
> > >>> But with iothread, virtio_scsi_dataplane_stop runs in another thread
> > >>> than the iothread; in that case you still have a race where the iothread
> > >>> can process the vq before aio_disable_external and print the error.
> > >>>
> > >>> IIUC the guest has cleared the IOMMU page tables _before_ clearing the
> > >>> DRIVER_OK bit in the status field.  Could this be a guest bug?  
> > >>
> > >> I'm not sure if it is a bug or not. I think what happens is the device 
> > >> is left
> > >> enabled by Seabios, and then reset by kernel.  
> > > 
> > > That makes sense, though I'm not sure why QEMU needs to process a
> > > request long after SeaBIOS has left control to Linux.  Maybe it's just
> > > that the messages should not go on QEMU stderr, and rather trace-point
> > > should be enough.  
> > 
> > Aha, it's not that QEMU needs to poll, it's just that polling mode is
> > enabled, and it decides to do one last iteration.  In general the virtio
> > spec allows the hardware to poll whenever it wants, hence:
> > 
> > 1) I'm not sure that translation failures should mark the device as
> > broken---definitely not when doing polling, possibly not even in
> > response to the guest "kicking" the virtqueue.  Alex, does the PCI spec
> > say anything about this?
> 
> AFAIK the PCI spec doesn't define anything about the IOMMU or response
> to translation failures.  Depending on whether it's a read or write,
> the device might see an unsupported request or not even be aware of the
> error.  It's really a platform RAS question whether to have any more
> significant response, most don't, but at least one tends to consider
> IOMMU faults to be a data integrity issue worth bring the system down.
> We've struggled with handling ongoing DMA generating IOMMU faults
> during kexec for a long time, so any sort of marking a device broken
> for a fault should be thoroughly considered, especially when a device
> could be assigned to a user who can trivially trigger a fault.
>  
> > 2) translation faliures should definitely not print messages to stderr.
> 
> Yep, easy DoS vector for a malicious guest, or malicious userspace
> driver within the guest.  Thanks,

Note that it's using error_report_once() upstream so it'll only print
once for the whole lifecycle of QEMU process, and it's still a
tracepoint downstream so no error will be dumped by default.  So AFAIU
it's not a DoS target for either.

I would consider it a good hint for strange bugs since AFAIU DMA error
should never exist on well-behaved guests.  However I'll be fine too
to post a patch to make it an explicit tracepoint again if any of us
would still like it to go away.

Thanks,

-- 
Peter Xu

Re: [Qemu-block] [Qemu-devel] [PATCH v2 04/11] monitor: Create MonitorHMP with readline state

2019-06-12 Thread Peter Xu

On Wed, Jun 12, 2019 at 11:07:01AM +0200, Markus Armbruster wrote:

[...]

> > +struct MonitorHMP {
> > +Monitor common;
> > +/*
> > + * State used only in the thread "owning" the monitor.
> > + * If @use_io_thread, this is @mon_iothread.
> > + * Else, it's the main thread.
> > + * These members can be safely accessed without locks.
> > + */
> > +ReadLineState *rs;
> > +};
> > +
> 
> Hmm.
> 
> The monitor I/O thread code makes an effort not to restrict I/O thread
> use to QMP, even though we only use it there.  Whether the code would
> actually work for HMP as well we don't know.
> 
> Readline was similar until your PATCH 02: the code made an effort not to
> restrict it to HMP, even though we only use it there.  Whether the code
> would actually work for QMP as well we don't know.
> 
> Should we stop pretending and hard-code "I/O thread only for QMP"?
> 
> If yes, the comment above gets simplified by the patch that hard-codes
> "I/O thread only for QMP".
> 
> If no, we should perhaps point out that we currently don't use an I/O
> thread with HMP.  The comment above seems like a good place for that.

Yes I agree on that if we're refactoring the comment then we can make
it more explicit here.  For my own preference, I would prefer the
latter one, even we can have a bigger comment above MonitorHMP
mentioning that it's only used in main thread so no lock is needed for
all the HMP only structs (until someone wants to hammer on HMP again).

Thanks,

-- 
Peter Xu

Re: [Qemu-block] [PATCH] cutils: Move size_to_str() from "qemu-common.h" to "qemu/cutils.h"

2019-09-03 Thread Peter Xu

On Tue, Sep 03, 2019 at 02:57:31PM -0400, John Snow wrote:

[...]

> Seems proper. It must be an oversight to begin with that we declared it
> in qemu-common but defined it in cutils.

Porbably true..

> 
> Reviewed-by: John Snow 

Reviewed-by: Peter Xu 

Thanks,

-- 
Peter Xu

Re: [PATCH v2 0/4] apic: Fix migration breakage of >255 vcpus

2019-10-24 Thread Peter Xu

On Thu, Oct 24, 2019 at 01:49:11PM -0400, John Snow wrote:
> 
> 
> On 10/23/19 4:17 AM, Kevin Wolf wrote:
> > The important difference here is legacy IDE (which works) vs. AHCI
> > (which doesn't work). If you add a -device ahci to the -M pc case, it
> > starts failing, too.
> > 
> > Not sure why AHCI fails, but I'll just CC John who is the lucky
> > maintainer of this device. :-)
> 
> Hm... It looks like SeaBIOS is identifying the drive correctly and
> perfectly well, but we're failing at boot_disk(u8 bootdrv, int
> checksig), about here:
> 
> call16_int(0x13, );
> 
> if (br.flags & F_CF) {
> printf("Boot failed: could not read the boot disk\n\n");
> return;
> }
> 
> Looking at AHCI tracing (From the QEMU side), it looks like we set up
> the drive correctly, and then never touch the port ever again -- I don't
> see an attempted read on QEMU's end.
> 
> I'll need to look through SeaBIOS source for hints, I'm not sure right
> yet. If anyone is more familiar with the SeaBIOS boot code, maybe they
> can give a pointer faster than I'll figure it out myself.

Hi, John,

I don't know seabios well, but I did have a pointer in my previous
email on where it faulted.  It seems to me that the issue is that
SeaBIOS may have got incorrect secs/cyls/heads information (and
explicitly setting secs=1,cyls=1,heads=1 on the block device fixes the
issue).

Thanks,

-- 
Peter Xu

Re: [PATCH v2 0/4] apic: Fix migration breakage of >255 vcpus

2019-10-23 Thread Peter Xu

On Sat, Oct 19, 2019 at 11:41:53AM +0800, Peter Xu wrote:
> On Wed, Oct 16, 2019 at 11:40:01AM -0300, Eduardo Habkost wrote:
> > On Wed, Oct 16, 2019 at 10:29:29AM +0800, Peter Xu wrote:
> > > v2:
> > > - use uint32_t rather than int64_t [Juan]
> > > - one more patch (patch 4) to check dup SaveStateEntry [Dave]
> > > - one more patch to define a macro (patch 1) to simplify patch 2
> > > 
> > > Please review, thanks.
> > 
> > I wonder how hard it is to write a simple test case to reproduce
> > the original bug.  We can extend tests/migration-test.c or
> > tests/acceptance/migration.py.  If using -device with explicit
> > apic-id, we probably don't even need to create >255 VCPUs.
> 
> I can give it a shot next week. :)

When trying this, I probably noticed a block layer issue: q35 seems to
have problem on booting from a very small block device (like 512B,
which is the image size that currently used for migration-test.c).
For example, this cmdline can boot successfully into the test image:

$qemu -M pc -m 200m -accel kvm -nographic \
  -drive file=$image,id=drive0,index=0,format=raw \
  -device ide-hd,drive=drive0

While this cannot:

$qemu -M q35 -m 200m -accel kvm -nographic \
  -drive file=$image,id=drive0,index=0,format=raw \
  -device ide-hd,drive=drive0

With error (BIOS debug messages on):

Booting from Hard Disk..invalid basic_access:143:
   a=0201  b=  c=0001  d=0080 ds= es=07c0 ss=d980
  si= di= bp= sp=fd8e cs=f000 ip=cb81  f=0202
invalid basic_access:144:
   a=0201  b=  c=0001  d=0080 ds= es=07c0 ss=d980
  si= di= bp= sp=fd8e cs=f000 ip=cb81  f=0202
.
Boot failed: could not read the boot disenter handle_18:
  NULL
k

This corresponds to this SeaBIOS check error:

static void noinline
basic_access(struct bregs *regs, struct drive_s *drive_fl, u16 command)
{
...
// sanity check on cyl heads, sec
if (cylinder >= nlc || head >= nlh || sector > nls) {
warn_invalid(regs);
disk_ret(regs, DISK_RET_EPARAM);
return;
}
...
}

And... below cmdline will work even for q35 (as suggested by Fam when
we talked offline):

$qemu -M q35 -m 200m -accel kvm -nographic \
  -drive file=$image,id=drive0,index=0,format=raw \
  -device ide-hd,drive=drive0,secs=1,cyls=1,heads=1

I think for migration test we can workaround like above, but I'm also
curious whether this is a real bug somewhere because I don't see a
reason for q35 to refuse to boot on a one-sector image.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-08-31 Thread Peter Xu

On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > Call qio_channel_set_zerocopy(true) in the start of every multifd thread.
> > 
> > Change the send_write() interface of multifd, allowing it to pass down
> > flags for qio_channel_write*().
> > 
> > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > other data being sent at the default copying approach.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  migration/multifd-zlib.c | 7 ---
> >  migration/multifd-zstd.c | 7 ---
> >  migration/multifd.c  | 9 ++---
> >  migration/multifd.h  | 3 ++-
> >  4 files changed, 16 insertions(+), 10 deletions(-)
> 
> > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> >  }
> >  
> >  if (used) {
> > -ret = multifd_send_state->ops->send_write(p, used, 
> > _err);
> > +ret = multifd_send_state->ops->send_write(p, used, 
> > MSG_ZEROCOPY,
> > +  _err);
> 
> I don't think it is valid to unconditionally enable this feature due to the
> resource usage implications
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
>if the socket option was not set, the socket exceeds its optmem 
>limit or the user exceeds its ulimit on locked pages."
> 
> The limit on locked pages is something that looks very likely to be
> exceeded unless you happen to be running a QEMU config that already
> implies locked memory (eg PCI assignment)

Yes it would be great to be a migration capability in parallel to multifd. At
initial phase if it's easy to be implemented on multi-fd only, we can add a
dependency between the caps.  In the future we can remove that dependency when
the code is ready to go without multifd.  Thanks,

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-08-31 Thread Peter Xu

On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > send calls. It does so by avoiding copying user data into kernel buffers.
> > 
> > To make it work, three steps are needed:
> > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > 3 - Process the socket's error queue, dealing with any error
> 
> AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> 
> It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> from a synchronous call to an asynchronous call.
> 
> It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> until an asynchronous completion notification has been received from
> the socket error queue. These notifications are not required to
> arrive in-order, even for a TCP stream, because the kernel hangs on
> to the buffer if a re-transmit is needed.
> 
> https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> 
>   "Page pinning also changes system call semantics. It temporarily 
>shares the buffer between process and network stack. Unlike with
>copying, the process cannot immediately overwrite the buffer 
>after system call return without possibly modifying the data in 
>flight. Kernel integrity is not affected, but a buggy program
>can possibly corrupt its own data stream."
> 
> AFAICT, the design added in this patch does not provide any way
> to honour these requirements around buffer lifetime.
> 
> I can't see how we can introduce MSG_ZEROCOPY in any seemless
> way. The buffer lifetime requirements imply need for an API
> design that is fundamentally different for asynchronous usage,
> with a callback to notify when the write has finished/failed.

Regarding buffer reuse - it indeed has a very deep implication on the buffer
being available and it's not obvious at all.  Just to mention that the initial
user of this work will make sure all zero copy buffers will be guest pages only
(as it's only used in multi-fd), so they should always be there during the
process.

I think asking for a complete design still makes sense.  E.g., for precopy
before we flush device states and completes the migration, we may want to at
least have a final ack on all the zero-copies of guest pages to guarantee they
are flushed.

IOW, we need to make sure the last piece of migration stream lands after the
guest pages so that the dest VM will always contain the latest page data when
dest VM starts.  So far I don't see how current code guaranteed that.

In short, we may just want to at least having a way to make sure all zero
copied buffers are finished using and they're sent after some function returns
(e.g., qio_channel_flush()).  That may require us to do some accounting on when
we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
error queue and keep those information somewhere too.

Some other side notes that reached my mind..

The qio_channel_writev_full() may not be suitable for async operations, as the
name "full" implies synchronous to me.  So maybe we can add a new helper for
zero copy on the channel?

We may also want a new QIOChannelFeature as QIO_CHANNEL_FEATURE_ZEROCOPY, then
we fail qio_channel_writv_zerocopy() (or whatever name we come up with) if that
bit is not set in qio channel features.

Thanks,

-- 
Peter Xu

Re: [PATCH v1 0/3] QIOChannel flags + multifd zerocopy

2021-08-31 Thread Peter Xu

On Tue, Aug 31, 2021 at 08:02:36AM -0300, Leonardo Bras wrote:
> Results:
> So far, the resource usage of __sys_sendmsg() reduced 15 times, and the
> overall migration took 13-18% less time, based in synthetic workload.

Leo,

Could you share some of the details of your tests?  E.g., what's the
configuration of your VM for testing?  What's the migration time before/after
the patchset applied?  What is the network you're using?

Thanks,

-- 
Peter Xu

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-01 Thread Peter Xu

On Wed, Sep 01, 2021 at 09:53:07AM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 04:29:09PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 02:16:42PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:39AM -0300, Leonardo Bras wrote:
> > > > Call qio_channel_set_zerocopy(true) in the start of every multifd 
> > > > thread.
> > > > 
> > > > Change the send_write() interface of multifd, allowing it to pass down
> > > > flags for qio_channel_write*().
> > > > 
> > > > Pass down MSG_ZEROCOPY flag for sending memory pages, while keeping the
> > > > other data being sent at the default copying approach.
> > > > 
> > > > Signed-off-by: Leonardo Bras 
> > > > ---
> > > >  migration/multifd-zlib.c | 7 ---
> > > >  migration/multifd-zstd.c | 7 ---
> > > >  migration/multifd.c  | 9 ++---
> > > >  migration/multifd.h  | 3 ++-
> > > >  4 files changed, 16 insertions(+), 10 deletions(-)
> > > 
> > > > @@ -675,7 +676,8 @@ static void *multifd_send_thread(void *opaque)
> > > >  }
> > > >  
> > > >  if (used) {
> > > > -ret = multifd_send_state->ops->send_write(p, used, 
> > > > _err);
> > > > +ret = multifd_send_state->ops->send_write(p, used, 
> > > > MSG_ZEROCOPY,
> > > > +  _err);
> > > 
> > > I don't think it is valid to unconditionally enable this feature due to 
> > > the
> > > resource usage implications
> > > 
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > 
> > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens 
> > >if the socket option was not set, the socket exceeds its optmem 
> > >limit or the user exceeds its ulimit on locked pages."
> > > 
> > > The limit on locked pages is something that looks very likely to be
> > > exceeded unless you happen to be running a QEMU config that already
> > > implies locked memory (eg PCI assignment)
> > 
> > Yes it would be great to be a migration capability in parallel to multifd. 
> > At
> > initial phase if it's easy to be implemented on multi-fd only, we can add a
> > dependency between the caps.  In the future we can remove that dependency 
> > when
> > the code is ready to go without multifd.  Thanks,
> 
> Also, I'm wondering how zerocopy support interacts with kernel support
> for kTLS and multipath-TCP, both of which we want to be able to use
> with migration.

Copying Jason Wang for net implications between these features on kernel side
and whether they can be enabled together (MSG_ZEROCOPY, mptcp, kTLS).

>From the safe side we may want to only enable one of them until we prove
they'll work together I guess..

Not a immediate concern as I don't really think any of them is really
explicitly supported in qemu.

KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
least we need some knob to detect whether kTLS is enabled in gnutls.

-- 
Peter Xu

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-01 Thread Peter Xu

On Wed, Sep 01, 2021 at 04:44:30PM +0100, Daniel P. Berrangé wrote:
> QEMU has mptcp support already:
> 
>   commit 8bd1078aebcec5eac196a83ef1a7e74be0ba67b7
>   Author: Dr. David Alan Gilbert 
>   Date:   Wed Apr 21 12:28:34 2021 +0100
> 
> sockets: Support multipath TCP
> 
> Multipath TCP allows combining multiple interfaces/routes into a single
> socket, with very little work for the user/admin.
> 
> It's enabled by 'mptcp' on most socket addresses:
> 
>./qemu-system-x86_64 -nographic -incoming tcp:0:,mptcp

Oops, I totally forgot about that, sorry!

> 
> > KTLS may be implicitly included by a new gnutls, but we need to mark TLS and
> > ZEROCOPY mutual exclusive anyway because at least the userspace TLS code of
> > gnutls won't has a way to maintain the tls buffers used by zerocopy.  So at
> > least we need some knob to detect whether kTLS is enabled in gnutls.
> 
> It isn't possible for gnutls to transparently enable KTLS, because
> GNUTLS doesn't get to see the actual socket directly - it'll need
> some work in QEMU to enable it.  We know MPTCP and KTLS are currently
> mutually exclusive as they both use the same kernel network hooks
> framework.

Then we may need to at least figure out whether zerocopy needs to mask out
mptcp.

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-01 Thread Peter Xu

On Wed, Sep 01, 2021 at 09:50:56AM +0100, Daniel P. Berrangé wrote:
> On Tue, Aug 31, 2021 at 04:27:04PM -0400, Peter Xu wrote:
> > On Tue, Aug 31, 2021 at 01:57:33PM +0100, Daniel P. Berrangé wrote:
> > > On Tue, Aug 31, 2021 at 08:02:38AM -0300, Leonardo Bras wrote:
> > > > MSG_ZEROCOPY is a feature that enables copy avoidance in TCP/UDP socket
> > > > send calls. It does so by avoiding copying user data into kernel 
> > > > buffers.
> > > > 
> > > > To make it work, three steps are needed:
> > > > 1 - A setsockopt() system call, enabling SO_ZEROCOPY
> > > > 2 - Passing down the MSG_ZEROCOPY flag for each send*() syscall
> > > > 3 - Process the socket's error queue, dealing with any error
> > > 
> > > AFAICT, this is missing the single most critical aspect of MSG_ZEROCOPY.
> > > 
> > > It is non-obvious, but setting the MSG_ZEROCOPY flag turns sendmsg()
> > > from a synchronous call to an asynchronous call.
> > > 
> > > It is forbidden to overwrite/reuse/free the buffer passed to sendmsg
> > > until an asynchronous completion notification has been received from
> > > the socket error queue. These notifications are not required to
> > > arrive in-order, even for a TCP stream, because the kernel hangs on
> > > to the buffer if a re-transmit is needed.
> > > 
> > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > 
> > >   "Page pinning also changes system call semantics. It temporarily 
> > >shares the buffer between process and network stack. Unlike with
> > >copying, the process cannot immediately overwrite the buffer 
> > >after system call return without possibly modifying the data in 
> > >flight. Kernel integrity is not affected, but a buggy program
> > >can possibly corrupt its own data stream."
> > > 
> > > AFAICT, the design added in this patch does not provide any way
> > > to honour these requirements around buffer lifetime.
> > > 
> > > I can't see how we can introduce MSG_ZEROCOPY in any seemless
> > > way. The buffer lifetime requirements imply need for an API
> > > design that is fundamentally different for asynchronous usage,
> > > with a callback to notify when the write has finished/failed.
> > 
> > Regarding buffer reuse - it indeed has a very deep implication on the buffer
> > being available and it's not obvious at all.  Just to mention that the 
> > initial
> > user of this work will make sure all zero copy buffers will be guest pages 
> > only
> > (as it's only used in multi-fd), so they should always be there during the
> > process.
> 
> That is not the case when migration is using TLS, because the buffers
> transmitted are the ciphertext buffer, not the plaintext guest page.

Agreed.

> 
> > In short, we may just want to at least having a way to make sure all zero
> > copied buffers are finished using and they're sent after some function 
> > returns
> > (e.g., qio_channel_flush()).  That may require us to do some accounting on 
> > when
> > we called sendmsg(MSG_ZEROCOPY), meanwhile we should need to read out the
> > ee_data field within SO_EE_ORIGIN_ZEROCOPY msg when we do recvmsg() for the
> > error queue and keep those information somewhere too.
> > 
> > Some other side notes that reached my mind..
> > 
> > The qio_channel_writev_full() may not be suitable for async operations, as 
> > the
> > name "full" implies synchronous to me.  So maybe we can add a new helper for
> > zero copy on the channel?
> 
> All the APIs that exist today are fundamentally only suitable for sync
> operations. Supporting async correctly will definitely a brand new APIs
> separate from what exists today.

Yes.  What I wanted to say is maybe we can still keep the io_writev() interface
untouched, but just add a new interface at qio_channel_writev_full() level.

IOW, we could comment on io_writev() that it can be either sync or async now,
just like sendmsg() has that implication too with different flag passed in.
When calling io_writev() with different upper helpers, QIO channel could
identify what flag to pass over to io_writev().

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu

On Wed, Sep 08, 2021 at 05:25:50PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 8:06 AM Dr. David Alan Gilbert
>  wrote:
> > > Possibly, yes. This really need David G's input since he understands
> > > the code in way more detail than me.
> >
> > Hmm I'm not entirely sure why we have the sync after each iteration;
> > the case I can think of is if we're doing async sending, we could have
> > two versions of the same page in flight (one from each iteration) -
> > you'd want those to get there in the right order.
> >
> > Dave
> 
> Well, that's the thing: as we don't copy the buffer in MSG_ZEROCOPY,  we will 
> in
> fact have the same page in flight twice, instead of two versions,
> given the buffer is
> sent as it is during transmission.

That's an interesting point, which looks even valid... :)

There can still be two versions depending on when the page is read and feed to
the NICs as the page can be changing during the window, but as long as the
latter sent page always lands later than the former page on dest node then it
looks ok.

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu

On Wed, Sep 08, 2021 at 05:13:40PM -0300, Leonardo Bras Soares Passos wrote:
> On Tue, Sep 7, 2021 at 1:44 PM Peter Xu  wrote:
> >
> > On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> > > I also suggested something like that, but I thought it could be good if 
> > > we could
> > > fall back to io_writev() if we didn't have the zerocopy feature (or
> > > the async feature).
> > >
> > > What do you think?
> >
> > That fallback looks safe and ok, I'm just not sure whether it'll be of great
> > help.  E.g. if we provide an QIO api that allows both sync write and 
> > zero-copy
> > write (then we do the fallback when necessary), it means the buffer 
> > implication
> > applies too to this api, so it's easier to me to just detect the zero copy
> > capability and use one alternative.  Thanks,
> >
> > --
> > Peter Xu
> >
> 
> I was thinking this way (async method with fallback) we would allow code using
> the QIO api to just try to use the io_async_writev() whenever the code
> seems fit, and
> let the QIO channel layer to decide when it can be used
> (implementation provides,
> features available), and just fallback to io_writev() when it can't be used.

Sure, I think it depends on whether the whole sync/async interface will be the
same, e.g., when async api needs some callback registered then the interface
won't be the same, and the caller will be aware of that anyways.  Otherwise it
looks fine indeed.  Thanks,

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu

On Wed, Sep 08, 2021 at 10:57:06PM +0100, Daniel P. Berrangé wrote:
> We think we're probably ok with migration as we are going to rely on the
> face that we eventually pause the guest to stop page changes during the
> final switchover. None the less I really strongly dislike the idea of
> not honouring the kernel API contract, despite the potential performance
> benefits it brings.

Yes understandable, and it does looks tricky. But it's guest page and it's just
by nature how it works to me (sending page happening in parallel with page
being modified).

I think the MSG_ZEROCOPY doc page mentioned that and it's userspace program's
own choice if that happens. So even if it's not by design and not suggested, it
seems not forbidden either.

We can wr-protect the page (using things like userfaultfd-wp) during sending
and unprotect them when it's done with a qio callback, that'll guarantee the
buffer not changing during sending, however we gain nothing besides the "api
cleaness" then..

Thanks,

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-09 Thread Peter Xu

On Thu, Sep 09, 2021 at 01:58:39AM -0300, Leonardo Bras Soares Passos wrote:
> FWIW, what I had in mind for a (theoretical) migration setup with
> io_async_writev() + io_async_flush():

One trivial concern is it's not strictly just "async" because "async" can
happen on any nonblocking fd; here it's more "zerocopy" specific.  But as long
as Dan is okay with it I'm fine too to start with "async", as long as we do
proper documentation on the requirement of lifecycle of the buffers.

> - For guest RAM we can decide not to rw_lock memory on zerocopy,
> because there is no need,
> - For headers, we can decide to not use async  (use io_writev() instead),
> - flush() can happen each iteration of migration, or at each N
> seconds, or at the end.

I think we should only need the flush at the end of precopy, and that should
also cover when switching to postcopy.  We could merge this flush() into the
existing per-iteration sync of multi-fd, but it's not strictly necessary, imho.
We can see which is easier. The rest looks good to me.  Thanks,

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-07 Thread Peter Xu

On Thu, Sep 02, 2021 at 03:59:25AM -0300, Leonardo Bras Soares Passos wrote:
> I also suggested something like that, but I thought it could be good if we 
> could
> fall back to io_writev() if we didn't have the zerocopy feature (or
> the async feature).
> 
> What do you think?

That fallback looks safe and ok, I'm just not sure whether it'll be of great
help.  E.g. if we provide an QIO api that allows both sync write and zero-copy
write (then we do the fallback when necessary), it means the buffer implication
applies too to this api, so it's easier to me to just detect the zero copy
capability and use one alternative.  Thanks,

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-07 Thread Peter Xu

On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > What if we do the 'flush()' before we start post-copy, instead of after 
> > > each
> > > iteration? would that be enough?
> > 
> > Possibly, yes. This really need David G's input since he understands
> > the code in way more detail than me.
> 
> Hmm I'm not entirely sure why we have the sync after each iteration;

We don't have that yet, do we?

> the case I can think of is if we're doing async sending, we could have
> two versions of the same page in flight (one from each iteration) -
> you'd want those to get there in the right order.

>From MSG_ZEROCOPY document:

For protocols that acknowledge data in-order, like TCP, each
notification can be squashed into the previous one, so that no more
than one notification is outstanding at any one point.

Ordered delivery is the common case, but not guaranteed. Notifications
may arrive out of order on retransmission and socket teardown.

So no matter whether we have a flush() per iteration before, it seems we may
want one when zero copy enabled?

Thanks,

-- 
Peter Xu

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Peter Xu

On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > I don't think it is valid to unconditionally enable this feature due to the
> > resource usage implications
> >
> > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> >
> >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> >if the socket option was not set, the socket exceeds its optmem
> >limit or the user exceeds its ulimit on locked pages."
> 
> You are correct, I unfortunately missed this part in the docs :(
> 
> > The limit on locked pages is something that looks very likely to be
> > exceeded unless you happen to be running a QEMU config that already
> > implies locked memory (eg PCI assignment)
> 
> Do you mean the limit an user has on locking memory?
> 
> If so, that makes sense. I remember I needed to set the upper limit of locked
> memory for the user before using it, or adding a capability to qemu before.

So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.

The thing is IIUC that's accounting for pinned pages only with either mlock()
(FOLL_MLOCK) or vfio (FOLL_PIN).

I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
__zerocopy_sg_from_iter() -> iov_iter_get_pages().

Say, I think there're plenty of iov_iter_get_pages() callers and normal GUPs, I
think most of them do no need such accountings.

-- 
Peter Xu

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-07 Thread Peter Xu

On Wed, Sep 08, 2021 at 10:59:57AM +0800, Jason Wang wrote:
> On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> >
> > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos wrote:
> > > > I don't think it is valid to unconditionally enable this feature due to 
> > > > the
> > > > resource usage implications
> > > >
> > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > >
> > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > >if the socket option was not set, the socket exceeds its optmem
> > > >limit or the user exceeds its ulimit on locked pages."
> > >
> > > You are correct, I unfortunately missed this part in the docs :(
> > >
> > > > The limit on locked pages is something that looks very likely to be
> > > > exceeded unless you happen to be running a QEMU config that already
> > > > implies locked memory (eg PCI assignment)
> > >
> > > Do you mean the limit an user has on locking memory?
> > >
> > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > locked
> > > memory for the user before using it, or adding a capability to qemu 
> > > before.
> >
> > So I'm a bit confused on why MSG_ZEROCOPY requires checking RLIMIT_MEMLOCK.
> >
> > The thing is IIUC that's accounting for pinned pages only with either 
> > mlock()
> > (FOLL_MLOCK) or vfio (FOLL_PIN).
> >
> > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> 
> It happens probably here:
> 
> E.g
> 
> __ip_append_data()
> msg_zerocopy_realloc()
> mm_account_pinned_pages()

Right. :)

But my previous question is more about the reason behind it - I thought that's
a common GUP and it shouldn't rely on locked_vm because it should still be
temporary GUP rather than a long time GUP, IMHO that's how we use locked_vm (we
don't account for temp GUPs but only longterm ones). IOW, I'm wondering whether
all the rest of iov_iter_get_pages() callers should check locked_vm too, and
AFAIU they're not doing that right now..

Thanks,

-- 
Peter Xu

Re: [PATCH v1 3/3] migration: multifd: Enable zerocopy

2021-09-08 Thread Peter Xu

On Wed, Sep 08, 2021 at 09:19:27AM +0100, Dr. David Alan Gilbert wrote:
> * Jason Wang (jasow...@redhat.com) wrote:
> > On Wed, Sep 8, 2021 at 2:32 AM Peter Xu  wrote:
> > >
> > > On Thu, Sep 02, 2021 at 04:22:55AM -0300, Leonardo Bras Soares Passos 
> > > wrote:
> > > > > I don't think it is valid to unconditionally enable this feature due 
> > > > > to the
> > > > > resource usage implications
> > > > >
> > > > > https://www.kernel.org/doc/html/v5.4/networking/msg_zerocopy.html
> > > > >
> > > > >   "A zerocopy failure will return -1 with errno ENOBUFS. This happens
> > > > >if the socket option was not set, the socket exceeds its optmem
> > > > >limit or the user exceeds its ulimit on locked pages."
> > > >
> > > > You are correct, I unfortunately missed this part in the docs :(
> > > >
> > > > > The limit on locked pages is something that looks very likely to be
> > > > > exceeded unless you happen to be running a QEMU config that already
> > > > > implies locked memory (eg PCI assignment)
> > > >
> > > > Do you mean the limit an user has on locking memory?
> > > >
> > > > If so, that makes sense. I remember I needed to set the upper limit of 
> > > > locked
> > > > memory for the user before using it, or adding a capability to qemu 
> > > > before.
> > >
> > > So I'm a bit confused on why MSG_ZEROCOPY requires checking 
> > > RLIMIT_MEMLOCK.
> > >
> > > The thing is IIUC that's accounting for pinned pages only with either 
> > > mlock()
> > > (FOLL_MLOCK) or vfio (FOLL_PIN).
> > >
> > > I don't really think MSG_ZEROCOPY is doing that at all...  I'm looking at
> > > __zerocopy_sg_from_iter() -> iov_iter_get_pages().
> > 
> > It happens probably here:
> > 
> > E.g
> > 
> > __ip_append_data()
> > msg_zerocopy_realloc()
> > mm_account_pinned_pages()
> 
> When do they get uncounted?  i.e. is it just the data that's in flight
> that's marked as pinned?

I think so - there's __msg_zerocopy_callback() -> mm_unaccount_pinned_pages()
correspondingly.  Thanks,

-- 
Peter Xu

Re: [PATCH v1 2/3] io: Add zerocopy and errqueue

2021-09-08 Thread Peter Xu

On Wed, Sep 08, 2021 at 09:30:58AM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (pet...@redhat.com) wrote:
> > On Tue, Sep 07, 2021 at 12:06:15PM +0100, Dr. David Alan Gilbert wrote:
> > > > > What if we do the 'flush()' before we start post-copy, instead of 
> > > > > after each
> > > > > iteration? would that be enough?
> > > > 
> > > > Possibly, yes. This really need David G's input since he understands
> > > > the code in way more detail than me.
> > > 
> > > Hmm I'm not entirely sure why we have the sync after each iteration;
> > 
> > We don't have that yet, do we?
> 
> I think multifd does; I think it calls multifd_send_sync_main sometimes,
> which I *think* corresponds to iterations.

Oh.. Then we may need to:

  (1) Make that sync work for zero-copy too; say, semaphores may not be enough
  with it - we need to make sure all async buffers are consumed by checking
  the error queue of the sockets,

  (2) When we make zero-copy work without multi-fd, we may want some similar
  sync on the main stream too

Thanks,

> 
> Dave
> 
> > > the case I can think of is if we're doing async sending, we could have
> > > two versions of the same page in flight (one from each iteration) -
> > > you'd want those to get there in the right order.
> > 
> > From MSG_ZEROCOPY document:
> > 
> > For protocols that acknowledge data in-order, like TCP, each
> > notification can be squashed into the previous one, so that no more
> > than one notification is outstanding at any one point.
> > 
> > Ordered delivery is the common case, but not guaranteed. 
> > Notifications
> > may arrive out of order on retransmission and socket teardown.
> > 
> > So no matter whether we have a flush() per iteration before, it seems we may
> > want one when zero copy enabled?
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu
> > 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 

-- 
Peter Xu

Re: [PATCH v2 8/9] hw/dma: Use dma_addr_t type definition when relevant

2022-01-04 Thread Peter Xu

On Tue, Jan 04, 2022 at 09:54:30AM +0100, Philippe Mathieu-Daudé wrote:
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 462f79a1f60..c3c49176110 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1147,7 +1147,7 @@ static uint16_t nvme_tx(NvmeCtrl *n, NvmeSg *sg, 
> uint8_t *ptr, uint32_t len,
>  
>  if (sg->flags & NVME_SG_DMA) {
>  const MemTxAttrs attrs = MEMTXATTRS_UNSPECIFIED;
> -uint64_t residual;
> +dma_addr_t residual;
>  
>  if (dir == NVME_TX_DIRECTION_TO_DEVICE) {
>  residual = dma_buf_write(ptr, len, >qsg, attrs);

If there's a new version: Maybe also change the return value types of
dma_buf_write|read() to dma_addr_t?

It'll be changed anyway in the next patch, so not a big deal.

The rest patches looks good to me.  Thanks.

-- 
Peter Xu

Re: [PATCH v3 00/10] hw/dma: Use dma_addr_t type definition when relevant

2022-01-11 Thread Peter Xu

On Tue, Jan 11, 2022 at 07:42:59PM +0100, Philippe Mathieu-Daudé wrote:
> Since v2:
> - Split meson patch restricting fw_cfg (Richard)
> - Reorder pci_dma_map() docstring (Peter, Richard)
> - Move QEMUSGList in previous patch (David)
> - Have dma_buf_read/dma_buf_write return dma_addr_t (Peter)
> - Drop 'propagate MemTxResult' patch (David)
> - Added R-b tags

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-28 Thread Peter Xu

On Fri, Feb 18, 2022 at 05:57:13PM +0100, Juan Quintela wrote:
> I did a change on:
> 
> commit d48c3a044537689866fe44e65d24c7d39a68868a
> Author: Juan Quintela 
> Date:   Fri Nov 19 15:35:58 2021 +0100
> 
> multifd: Use a single writev on the send side
> 
> Until now, we wrote the packet header with write(), and the rest of the
> pages with writev().  Just increase the size of the iovec and do a
> single writev().
> 
> Signed-off-by: Juan Quintela 
> Reviewed-by: Dr. David Alan Gilbert 
> 
> And now we need to "perserve" this header until we do the sync,
> otherwise we are overwritting it with other things.
> 
> What testing have you done after this commit?
> 
> Notice that it is not _complicated_ to fix it, I will try to come with
> some idea on monday, but basically is having an array of buffers for
> each thread, and store them until we call a sync().

Or can we conditionally merge the two write()s?  IMHO the array of buffers
idea sounds too complicated, and I'm not extremely sure whether it'll pay
off at last.  We could keep the two write()s with ZEROCOPY enabled, and use
the merged version otherwise.

Btw, is there any performance measurements for above commit d48c3a044537?
I had a feeling that the single write() may not help that much, because for
multifd the bottleneck should be on the nic not on the processor.

IOW, we could find that the major time used does not fall into the
user<->kernel switches (which is where the extra overhead of write()
syscall, iiuc), but we simply blocked on any of the write()s because the
socket write buffer is full...  So we could have saved some cpu cycles by
merging the calls, but performance-wise we may not get much.

Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-07 Thread Peter Xu

On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> -void multifd_send_sync_main(QEMUFile *f)
> +int multifd_send_sync_main(QEMUFile *f)
>  {
>  int i;
> +bool flush_zero_copy;
>  
>  if (!migrate_use_multifd()) {
> -return;
> +return 0;
>  }
>  if (multifd_send_state->pages->num) {
>  if (multifd_send_pages(f) < 0) {
>  error_report("%s: multifd_send_pages fail", __func__);
> -return;
> +return 0;

I've not checked how it used to do if multifd_send_pages() failed, but.. should
it returns -1 rather than 0 when there will be a return code?

>  }
>  }
> +
> +/*
> + * When using zero-copy, it's necessary to flush after each iteration to
> + * make sure pages from earlier iterations don't end up replacing newer
> + * pages.
> + */
> +flush_zero_copy = migrate_use_zero_copy_send();
> +
>  for (i = 0; i < migrate_multifd_channels(); i++) {
>  MultiFDSendParams *p = _send_state->params[i];
>  
> @@ -591,7 +600,7 @@ void multifd_send_sync_main(QEMUFile *f)
>  if (p->quit) {
>  error_report("%s: channel %d has already quit", __func__, i);
>  qemu_mutex_unlock(>mutex);
> -return;
> +return 0;

Same question here.

>  }

The rest looks good.  Thanks,

-- 
Peter Xu

Re: [PATCH v8 5/5] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-02-07 Thread Peter Xu

On Mon, Feb 07, 2022 at 11:49:38PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter, thanks for reviewing!
> 
> On Mon, Feb 7, 2022 at 11:22 PM Peter Xu  wrote:
> >
> > On Tue, Feb 01, 2022 at 03:29:03AM -0300, Leonardo Bras wrote:
> > > -void multifd_send_sync_main(QEMUFile *f)
> > > +int multifd_send_sync_main(QEMUFile *f)
> > >  {
> > >  int i;
> > > +bool flush_zero_copy;
> > >
> > >  if (!migrate_use_multifd()) {
> > > -return;
> > > +return 0;
> > >  }
> > >  if (multifd_send_state->pages->num) {
> > >  if (multifd_send_pages(f) < 0) {
> > >  error_report("%s: multifd_send_pages fail", __func__);
> > > -return;
> > > +return 0;
> >
> > I've not checked how it used to do if multifd_send_pages() failed, but.. 
> > should
> > it returns -1 rather than 0 when there will be a return code?
> 
> Yeah, that makes sense.
> The point here is that I was trying not to modify much of the current 
> behavior.
> 
> I mean, multifd_send_sync_main() would previously return void, so any
> other errors would not matter to the caller of this function, which
> will continue to run as if nothing happened.
> 
> Now, if it fails with flush_zero_copy, the operation needs to be aborted.

Right, so how I understand is we'll fail anyway, and this allows us to fail
(probably) sooner.

> 
> Maybe, I should make it different:
> - In any error, return -1.
> - Create/use a specific error code in the case of a failing
> flush_zero_copy, so I can test the return value for it on the caller
> function and return early.
> 
> Or alternatively, the other errors could also return early, but since
> this will change how the code currently works, I would probably need
> another patch for that change. (so it can be easily reverted if
> needed)

Yeah, should work too to add a patch before this one.

> 
> What do you think is better?

I just don't see how it could continue if e.g. multifd_send_pages() failed.

The other thing is returning zero looks weird itself when there's obviously an
error.  Normally we could allow that but better with a comment showing why.
For this case it's more natural to me if we could just fail early.

Juan?

-- 
Peter Xu

Re: [PATCH v8 1/5] QIOChannel: Add flags on io_writev and introduce io_flush callback

2022-02-07 Thread Peter Xu

On Tue, Feb 01, 2022 at 03:28:59AM -0300, Leonardo Bras wrote:
> Add flags to io_writev and introduce io_flush as optional callback to
> QIOChannelClass, allowing the implementation of zero copy writes by
> subclasses.
> 
> How to use them:
> - Write data using qio_channel_writev*(...,QIO_CHANNEL_WRITE_FLAG_ZERO_COPY),
> - Wait write completion with qio_channel_flush().
> 
> Notes:
> As some zero copy write implementations work asynchronously, it's
> recommended to keep the write buffer untouched until the return of
> qio_channel_flush(), to avoid the risk of sending an updated buffer
> instead of the buffer state during write.
> 
> As io_flush callback is optional, if a subclass does not implement it, then:
> - io_flush will return 0 without changing anything.
> 
> Also, some functions like qio_channel_writev_full_all() were adapted to
> receive a flag parameter. That allows shared code between zero copy and
> non-zero copy writev, and also an easier implementation on new flags.
> 
> Signed-off-by: Leonardo Bras 

With Dan's comment addressed on removing the redundant assertion:

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v3] hw/dma: Let dma_buf_read() / dma_buf_write() propagate MemTxResult

2022-01-17 Thread Peter Xu

On Mon, Jan 17, 2022 at 01:51:30PM +0100, Philippe Mathieu-Daudé wrote:
> From: Philippe Mathieu-Daudé 
> 
> Since commit 292e13142d2, dma_buf_rw() returns a MemTxResult type.
> Do not discard it, return it to the caller. Pass the previously
> returned value (the QEMUSGList residual size, which was rarely used)
> as an optional argument.
> 
> With this new API, SCSIRequest::residual might now be accessed via
> a pointer. Since the size_t type does not have the same size on
> 32 and 64-bit host architectures, convert it to a uint64_t, which
> is big enough to hold the residual size, and the type is constant
> on both 32/64-bit hosts.
> 
> Update the few dma_buf_read() / dma_buf_write() callers to the new
> API.
> 
> Reviewed-by: Klaus Jensen 
> Signed-off-by: Philippe Mathieu-Daudé 
> Signed-off-by: Philippe Mathieu-Daudé 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 21/22] softmmu/memory: Clean up local variable shadowing

2023-09-05 Thread Peter Xu

On Mon, Sep 04, 2023 at 06:12:33PM +0200, Philippe Mathieu-Daudé wrote:
> Fix:
> 
>   softmmu/memory.c: In function ‘mtree_print_mr’:
>   softmmu/memory.c:3236:27: warning: declaration of ‘ml’ shadows a previous 
> local [-Wshadow=compatible-local]
>3236 | MemoryRegionList *ml;
> |   ^~
>   softmmu/memory.c:3213:32: note: shadowed declaration is here
>3213 | MemoryRegionList *new_ml, *ml, *next_ml;
> |^~
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 1/2] vmstate: Mark VMStateInfo.get/put() coroutine_mixed_fn

2023-09-05 Thread Peter Xu

On Tue, Sep 05, 2023 at 04:50:01PM +0200, Kevin Wolf wrote:
> Migration code can run both in coroutine context (the usual case) and
> non-coroutine context (at least savevm/loadvm for snapshots). This also
> affects the VMState callbacks, and devices must consider this. Change
> the callback definition in VMStateInfo to be explicit about it.
> 
> Signed-off-by: Kevin Wolf 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [RFC PATCH v2 22/22] softmmu/physmem: Clean up local variable shadowing

2023-09-05 Thread Peter Xu

On Mon, Sep 04, 2023 at 05:31:30PM +0100, Daniel P. Berrangé wrote:
> On Mon, Sep 04, 2023 at 06:12:34PM +0200, Philippe Mathieu-Daudé wrote:
> > Fix:
> > 
> >   softmmu/physmem.c: In function 
> > ‘cpu_physical_memory_snapshot_and_clear_dirty’:
> >   softmmu/physmem.c:916:27: warning: declaration of ‘offset’ shadows a 
> > parameter [-Wshadow=compatible-local]
> > 916 | unsigned long offset = page % DIRTY_MEMORY_BLOCK_SIZE;
> > |   ^~
> >   softmmu/physmem.c:892:31: note: shadowed declaration is here
> > 892 | (MemoryRegion *mr, hwaddr offset, hwaddr length, unsigned 
> > client)
> > |~~~^~
> > 
> > Signed-off-by: Philippe Mathieu-Daudé 
> > ---
> > RFC: Please double-check how 'offset' is used few lines later.
> 
> I don't see an issue - those lines are in an outer scope, so won't
> be accessing the 'offset' you've changed, they'll be the parameter
> instead. If you want to sanity check though, presumably the asm
> dissassembly for this method should be the same before/after this
> change

(and if it didn't do so then it's a bug..)

> 
> > ---
> >  softmmu/physmem.c | 10 +-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> Reviewed-by: Daniel P. Berrangé 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 2/7] migration: Clean up local variable shadowing

2023-08-31 Thread Peter Xu

On Thu, Aug 31, 2023 at 03:25:41PM +0200, Markus Armbruster wrote:
> Local variables shadowing other local variables or parameters make the
> code needlessly hard to understand.  Tracked down with -Wshadow=local.
> Clean up: delete inner declarations when they are actually redundant,
> else rename variables.
> 
> Signed-off-by: Markus Armbruster 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 1/7] migration/rdma: Fix save_page method to fail on polling error

2023-08-31 Thread Peter Xu

On Thu, Aug 31, 2023 at 03:25:40PM +0200, Markus Armbruster wrote:
> qemu_rdma_save_page() reports polling error with error_report(), then
> succeeds anyway.  This is because the variable holding the polling
> status *shadows* the variable the function returns.  The latter
> remains zero.
> 
> Broken since day one, and duplicated more recently.
> 
> Fixes: 2da776db4846 (rdma: core logic)
> Fixes: b390afd8c50b (migration/rdma: Fix out of order wrid)
> Signed-off-by: Markus Armbruster 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH V2 3/6] cpr: relax blockdev migration blockers

2023-10-31 Thread Peter Xu

Copy qemu-block and maintainers.

On Wed, Oct 25, 2023 at 12:44:26PM -0700, Steve Sistare wrote:
> Some blockdevs block migration because they do not support sharing across
> hosts and/or do not support dirty bitmaps.  These prohibitions do not apply
> if the old and new qemu processes do not run concurrently, and if new qemu
> starts on the same host as old, which is the case for cpr.  Narrow the scope
> of these blockers so they only apply to normal mode.  They will not block
> cpr modes when they are added in subsequent patches.
> 
> No functional change until a new mode is added.
> 
> Signed-off-by: Steve Sistare 
> Reviewed-by: Juan Quintela 
> ---
>  block/parallels.c | 2 +-
>  block/qcow.c  | 2 +-
>  block/vdi.c   | 2 +-
>  block/vhdx.c  | 2 +-
>  block/vmdk.c  | 2 +-
>  block/vpc.c   | 2 +-
>  block/vvfat.c | 2 +-
>  7 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/block/parallels.c b/block/parallels.c
> index 1697a2e..8a520db 100644
> --- a/block/parallels.c
> +++ b/block/parallels.c
> @@ -1369,7 +1369,7 @@ static int parallels_open(BlockDriverState *bs, QDict 
> *options, int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  error_setg(errp, "Migration blocker error");
>  goto fail;
> diff --git a/block/qcow.c b/block/qcow.c
> index fdd4c83..eab68e3 100644
> --- a/block/qcow.c
> +++ b/block/qcow.c
> @@ -307,7 +307,7 @@ static int qcow_open(BlockDriverState *bs, QDict 
> *options, int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vdi.c b/block/vdi.c
> index fd7e365..c647d72 100644
> --- a/block/vdi.c
> +++ b/block/vdi.c
> @@ -498,7 +498,7 @@ static int vdi_open(BlockDriverState *bs, QDict *options, 
> int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  goto fail_free_bmap;
>  }
> diff --git a/block/vhdx.c b/block/vhdx.c
> index e37f8c0..a9d0874 100644
> --- a/block/vhdx.c
> +++ b/block/vhdx.c
> @@ -1096,7 +1096,7 @@ static int vhdx_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  error_setg(>migration_blocker, "The vhdx format used by node '%s' "
> "does not support live migration",
> bdrv_get_device_or_node_name(bs));
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 1335d39..85864b8 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -1386,7 +1386,7 @@ static int vmdk_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  error_setg(>migration_blocker, "The vmdk format used by node '%s' "
> "does not support live migration",
> bdrv_get_device_or_node_name(bs));
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vpc.c b/block/vpc.c
> index c30cf86..aa1a48a 100644
> --- a/block/vpc.c
> +++ b/block/vpc.c
> @@ -452,7 +452,7 @@ static int vpc_open(BlockDriverState *bs, QDict *options, 
> int flags,
> bdrv_get_device_or_node_name(bs));
>  bdrv_graph_rdunlock_main_loop();
>  
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> diff --git a/block/vvfat.c b/block/vvfat.c
> index 266e036..9d050ba 100644
> --- a/block/vvfat.c
> +++ b/block/vvfat.c
> @@ -1268,7 +1268,7 @@ static int vvfat_open(BlockDriverState *bs, QDict 
> *options, int flags,
> "The vvfat (rw) format used by node '%s' "
> "does not support live migration",
> bdrv_get_device_or_node_name(bs));
> -ret = migrate_add_blocker(>migration_blocker, errp);
> +ret = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (ret < 0) {
>  goto fail;
>  }
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [PATCH V2 4/6] cpr: relax vhost migration blockers

2023-10-31 Thread Peter Xu

Copy qemu-block, virtio

On Wed, Oct 25, 2023 at 12:44:27PM -0700, Steve Sistare wrote:
> vhost blocks migration if logging is not supported to track dirty
> memory, and vhost-user blocks it if the log cannot be saved to a shm fd.
> 
> vhost-vdpa blocks migration if both hosts do not support all the device's
> features using a shadow VQ, for tracking requests and dirty memory.
> 
> vhost-scsi blocks migration if storage cannot be shared across hosts,
> or if state cannot be migrated.
> 
> None of these conditions apply if the old and new qemu processes do
> not run concurrently, and if new qemu starts on the same host as old,
> which is the case for cpr.
> 
> Narrow the scope of these blockers so they only apply to normal mode.
> They will not block cpr modes when they are added in subsequent patches.
> 
> No functional change until a new mode is added.
> 
> Signed-off-by: Steve Sistare 
> Reviewed-by: Juan Quintela 
> ---
>  hw/scsi/vhost-scsi.c | 2 +-
>  hw/virtio/vhost.c| 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c
> index 14e23cc..bf528d5 100644
> --- a/hw/scsi/vhost-scsi.c
> +++ b/hw/scsi/vhost-scsi.c
> @@ -208,7 +208,7 @@ static void vhost_scsi_realize(DeviceState *dev, Error 
> **errp)
>  "When external environment supports it (Orchestrator 
> migrates "
>  "target SCSI device state or use shared storage over 
> network), "
>  "set 'migratable' property to true to enable migration.");
> -if (migrate_add_blocker(>migration_blocker, errp) < 0) {
> +if (migrate_add_blocker_normal(>migration_blocker, errp) < 0) {
>  goto free_virtio;
>  }
>  }
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index d737671..f5e9625 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -1527,7 +1527,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>  }
>  
>  if (hdev->migration_blocker != NULL) {
> -r = migrate_add_blocker(>migration_blocker, errp);
> +r = migrate_add_blocker_normal(>migration_blocker, errp);
>  if (r < 0) {
>  goto fail_busyloop;
>  }
> -- 
> 1.8.3.1
> 

-- 
Peter Xu

Re: [PATCH v2 5/9] migration: check required subsections are loaded, once

2023-10-24 Thread Peter Xu

On Tue, Oct 24, 2023 at 12:41:56PM +0200, Juan Quintela wrote:
> > @@ -509,6 +538,13 @@ static int vmstate_subsection_load(QEMUFile *f, const 
> > VMStateDescription *vmsd,
> >  }
> >  }
> >  
> > +for (i = 0; i < n; i++) {
> > +if (!visited[i] && vmstate_section_needed(vmsd->subsections[i], 
> > opaque)) {
> > +trace_vmstate_subsection_load_bad(vmsd->name, 
> > vmsd->subsections[i]->name, "(not visited)");
> > +return -ENOENT;
> > +}
> > +}
> > +
> >  trace_vmstate_subsection_load_good(vmsd->name);
> >  return 0;
> >  }
> 
> This part is the only one where I can see there could be some
> discussion.  So I wil wait to see what other people think.

Previous email:

https://lore.kernel.org/qemu-devel/ZR2P1RbxCfBdYBaQ@x1n/

I still think it is safer to not fail unless justified that we won't hit
surprises in the ->needed().  There are a lot so I assume it's non-trivial
to justify.

We can switch the tracepoint into error_report() in that case, though, as
long as it won't fail the migration.

Thanks,

-- 
Peter Xu

Re: [PATCH 4/5] RFC: migration: check required subsections are loaded, once

2023-10-04 Thread Peter Xu

On Tue, Sep 26, 2023 at 07:59:24PM +0400, marcandre.lur...@redhat.com wrote:
> @@ -484,6 +513,13 @@ static int vmstate_subsection_load(QEMUFile *f, const 
> VMStateDescription *vmsd,
>  }
>  }
>  
> +for (i = 0; i < n; i++) {
> +if (!visited[i] && vmstate_save_needed(vmsd->subsections[i], 
> opaque)) {
> +trace_vmstate_subsection_load_bad(vmsd->name, 
> vmsd->subsections[i]->name, "(not visited)");
> +return -ENOENT;
> +}
> +}

One thing that might be tricky to call needed() on loading side is, IMHO
the needed() hooks normally was designed to be only called on a complete VM
state. IOW, I think it can reference any machine/device state, or whatever
variable assuming all of them contain valid data.

But the load side may not yet contain everything..  we can guarantee here
we loaded the full device state of this one as subsections should be the
last to come, and all we've loaded so far.  But what if it references
something else outside what we've loaded?  It looks possible in some
special .needed() hook we can return something unexpected.

I assume most needed() hooks are fine (and it does look like we can find
bugs with this, which means this might be proved useful already at least in
some form or another). I just worry on something start to break after we
become strict on this.

Maybe.. make the check only throw warnings, but not yet fail the migration?

> +
>  trace_vmstate_subsection_load_good(vmsd->name);
>  return 0;
>  }
> -- 
> 2.41.0
> 

-- 
Peter Xu

Re: [PATCH 0/5] RFC: migration: check required entries and sections are loaded

2023-10-04 Thread Peter Xu

On Tue, Sep 26, 2023 at 07:59:20PM +0400, marcandre.lur...@redhat.com wrote:
> Marc-André Lureau (5):
>   block/fdc: 'phase' is not needed on load
>   virtio: make endian_needed() work during loading
>   net/slirp: use different IDs for each instance

First 3 patches are bug fixes, am I right?

It'll be great if they can be acked (or even picked up earlier?) by
subsystem maintainers if so, and copy stable if justified proper.

Thanks,

-- 
Peter Xu

Re: introducing vrc :)

2022-04-21 Thread Peter Xu

On Thu, Apr 21, 2022 at 05:04:52PM +0200, Paolo Bonzini wrote:
> On 4/20/22 20:12, Peter Xu wrote:
> > > a while ago I looked at tools that could be used too build a call graph.
> > > The simplest but most effective that I found was a small Perl program
> > > (called "egypt", which is rot13 for "rtlcg" aka RTL call graph) that used
> > > the GCC dumps to build the graph.
> > 
> > Do you have any plan to put it into some git repository?
> 
> Good idea, it's now at https://github.com/bonzini/vrc.
> 
> It can be installed using "pip install --user ." after checking out the
> repository.

Looks great.  Another trivial request is whether you'd like to add a
sentence to the 1st paragraph of README.md on showing why it's named "vrc"?

I'm just curious :-D

Thanks again!

-- 
Peter Xu

Re: introducing vrc :)

2022-04-20 Thread Peter Xu

On Tue, Apr 19, 2022 at 04:39:13PM +0200, Paolo Bonzini wrote:
> Hi all,
> 
> a while ago I looked at tools that could be used too build a call graph.
> The simplest but most effective that I found was a small Perl program
> (called "egypt", which is rot13 for "rtlcg" aka RTL call graph) that used
> the GCC dumps to build the graph.

Paolo,

Do you have any plan to put it into some git repository?

Thanks!

-- 
Peter Xu

Re: [PATCH v9 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-26 Thread Peter Xu

 qio_channel_flush(p->c, );
> +if (ret < 0) {
> +error_report_err(err);
> +return -1;
> +}
> +}
>  }
>  for (i = 0; i < migrate_multifd_channels(); i++) {
>  MultiFDSendParams *p = _send_state->params[i];
> @@ -688,10 +708,9 @@ static void *multifd_send_thread(void *opaque)
>  p->iov[0].iov_base = p->packet;
>  }
>  
> -ret = qio_channel_writev_all(p->c, p->iov + iov_offset,
> - p->iovs_num - iov_offset,
> - _err);
> -
> +ret = qio_channel_writev_full_all(p->c, p->iov + iov_offset,
> +  p->iovs_num - iov_offset, NULL,
> +  0, p->write_flags, _err);

I kind of agree with Dan in previous patch - this iov_offset is confusing,
better drop it.

>  if (ret != 0) {
>  break;
>  }
> @@ -920,6 +939,13 @@ int multifd_save_setup(Error **errp)
>  /* We need one extra place for the packet header */
>  p->iov = g_new0(struct iovec, page_count + 1);
>  p->normal = g_new0(ram_addr_t, page_count);
> +
> +if (migrate_use_zero_copy_send()) {
> +p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
> +} else {
> +p->write_flags = 0;
> +}
> +
>  socket_send_channel_create(multifd_new_send_channel_async, p);
>  }
>  
> diff --git a/migration/socket.c b/migration/socket.c
> index 3754d8f72c..4fd5e85f50 100644
> --- a/migration/socket.c
> +++ b/migration/socket.c
> @@ -79,8 +79,9 @@ static void socket_outgoing_migration(QIOTask *task,
>  
>  trace_migration_socket_outgoing_connected(data->hostname);
>  
> -if (migrate_use_zero_copy_send()) {
> -error_setg(, "Zero copy send not available in migration");
> +if (migrate_use_zero_copy_send() &&
> +!qio_channel_has_feature(sioc, QIO_CHANNEL_FEATURE_WRITE_ZERO_COPY)) 
> {
> +error_setg(, "Zero copy send feature not detected in host 
> kernel");
>  }
>  
>  out:
> -- 
> 2.36.0
> 

-- 
Peter Xu

Re: [PATCH v9 5/7] multifd: multifd_send_sync_main now returns negative on error

2022-04-26 Thread Peter Xu

On Mon, Apr 25, 2022 at 06:50:54PM -0300, Leonardo Bras wrote:
> Even though multifd_send_sync_main() currently emits error_reports, it's
> callers don't really check it before continuing.
> 
> Change multifd_send_sync_main() to return -1 on error and 0 on success.
> Also change all it's callers to make use of this change and possibly fail
> earlier.
> 
> (This change is important to next patch on  multifd zero copy
> implementation, to make it sure an error in zero-copy flush does not go
> unnoticed.
> 
> Signed-off-by: Leonardo Bras 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v10 6/7] multifd: Send header packet without flags if zero-copy-send is enabled

2022-04-26 Thread Peter Xu

On Tue, Apr 26, 2022 at 08:06:55PM -0300, Leonardo Bras wrote:
> Since d48c3a0445 ("multifd: Use a single writev on the send side"),
> sending the header packet and the memory pages happens in the same
> writev, which can potentially make the migration faster.
> 
> Using channel-socket as example, this works well with the default copying
> mechanism of sendmsg(), but with zero-copy-send=true, it will cause
> the migration to often break.
> 
> This happens because the header packet buffer gets reused quite often,
> and there is a high chance that by the time the MSG_ZEROCOPY mechanism get
> to send the buffer, it has already changed, sending the wrong data and
> causing the migration to abort.
> 
> It means that, as it is, the buffer for the header packet is not suitable
> for sending with MSG_ZEROCOPY.
> 
> In order to enable zero copy for multifd, send the header packet on an
> individual write(), without any flags, and the remanining pages with a
> writev(), as it was happening before. This only changes how a migration
> with zero-copy-send=true works, not changing any current behavior for
> migrations with zero-copy-send=false.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  migration/multifd.c | 23 ---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 15fb668e64..07b2e92d8d 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -617,6 +617,7 @@ static void *multifd_send_thread(void *opaque)
>  MultiFDSendParams *p = opaque;
>  Error *local_err = NULL;
>  int ret = 0;
> +bool use_zero_copy_send = migrate_use_zero_copy_send();
>  
>  trace_multifd_send_thread_start(p->id);
>  rcu_register_thread();
> @@ -639,9 +640,14 @@ static void *multifd_send_thread(void *opaque)
>  if (p->pending_job) {
>  uint64_t packet_num = p->packet_num;
>  uint32_t flags = p->flags;
> -p->iovs_num = 1;
>  p->normal_num = 0;
>  
> +if (use_zero_copy_send) {
> +p->iovs_num = 0;
> +} else {
> +p->iovs_num = 1;
> +}
> +
>  for (int i = 0; i < p->pages->num; i++) {
>  p->normal[p->normal_num] = p->pages->offset[i];
>  p->normal_num++;
> @@ -665,8 +671,19 @@ static void *multifd_send_thread(void *opaque)
>  trace_multifd_send(p->id, packet_num, p->normal_num, flags,
> p->next_packet_size);
>  
> -p->iov[0].iov_len = p->packet_len;
> -p->iov[0].iov_base = p->packet;
> +if (use_zero_copy_send) {
> +/* Send header first, without zerocopy */
> +ret = qio_channel_write_all(p->c, (void *)p->packet,
> +p->packet_len, _err);
> +if (ret != 0) {
> +break;
> +}
> +

Extra but useless newline.. but not worth a repost.  Looks good here:

Reviewed-by: Peter Xu 

Thanks,

> +} else {
> +/* Send header using the same writev call */
> +p->iov[0].iov_len = p->packet_len;
> +p->iov[0].iov_base = p->packet;
> +}
>  
>  ret = qio_channel_writev_all(p->c, p->iov, p->iovs_num,
>   _err);
> -- 
> 2.36.0
> 

-- 
Peter Xu

Re: [PATCH v10 7/7] multifd: Implement zero copy write in multifd migration (multifd-zero-copy)

2022-04-26 Thread Peter Xu

On Tue, Apr 26, 2022 at 08:06:56PM -0300, Leonardo Bras wrote:
> Implement zero copy send on nocomp_send_write(), by making use of QIOChannel
> writev + flags & flush interface.
> 
> Change multifd_send_sync_main() so flush_zero_copy() can be called
> after each iteration in order to make sure all dirty pages are sent before
> a new iteration is started. It will also flush at the beginning and at the
> end of migration.
> 
> Also make it return -1 if flush_zero_copy() fails, in order to cancel
> the migration process, and avoid resuming the guest in the target host
> without receiving all current RAM.
> 
> This will work fine on RAM migration because the RAM pages are not usually 
> freed,
> and there is no problem on changing the pages content between 
> writev_zero_copy() and
> the actual sending of the buffer, because this change will dirty the page and
> cause it to be re-sent on a next iteration anyway.
> 
> A lot of locked memory may be needed in order to use multifd migration
> with zero-copy enabled, so disabling the feature should be necessary for
> low-privileged users trying to perform multifd migrations.
> 
> Signed-off-by: Leonardo Bras 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-01 Thread Peter Xu

On Wed, Jun 01, 2022 at 05:37:10PM +0800, 徐闯 wrote:
> 
> On 2022/5/13 下午2:28, Leonardo Bras wrote:
> > For CONFIG_LINUX, implement the new zero copy flag and the optional callback
> > io_flush on QIOChannelSocket, but enables it only when MSG_ZEROCOPY
> > feature is available in the host kernel, which is checked on
> > qio_channel_socket_connect_sync()
> > 
> > qio_channel_socket_flush() was implemented by counting how many times
> > sendmsg(...,MSG_ZEROCOPY) was successfully called, and then reading the
> > socket's error queue, in order to find how many of them finished sending.
> > Flush will loop until those counters are the same, or until some error 
> > occurs.
> > 
> > Notes on using writev() with QIO_CHANNEL_WRITE_FLAG_ZERO_COPY:
> > 1: Buffer
> > - As MSG_ZEROCOPY tells the kernel to use the same user buffer to avoid 
> > copying,
> > some caution is necessary to avoid overwriting any buffer before it's sent.
> > If something like this happen, a newer version of the buffer may be sent 
> > instead.
> > - If this is a problem, it's recommended to call qio_channel_flush() before 
> > freeing
> > or re-using the buffer.
> > 
> > 2: Locked memory
> > - When using MSG_ZERCOCOPY, the buffer memory will be locked after queued, 
> > and
> > unlocked after it's sent.
> > - Depending on the size of each buffer, and how often it's sent, it may 
> > require
> > a larger amount of locked memory than usually available to non-root user.
> > - If the required amount of locked memory is not available, writev_zero_copy
> > will return an error, which can abort an operation like migration,
> > - Because of this, when an user code wants to add zero copy as a feature, it
> > requires a mechanism to disable it, so it can still be accessible to less
> > privileged users.
> > 
> > Signed-off-by: Leonardo Bras 
> > Reviewed-by: Peter Xu 
> > Reviewed-by: Daniel P. Berrangé 
> > Reviewed-by: Juan Quintela 
> > ---
> >   include/io/channel-socket.h |   2 +
> >   io/channel-socket.c | 116 ++--
> >   2 files changed, 114 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/io/channel-socket.h b/include/io/channel-socket.h
> > index e747e63514..513c428fe4 100644
> > --- a/include/io/channel-socket.h
> > +++ b/include/io/channel-socket.h
> > @@ -47,6 +47,8 @@ struct QIOChannelSocket {
> >   socklen_t localAddrLen;
> >   struct sockaddr_storage remoteAddr;
> >   socklen_t remoteAddrLen;
> > +ssize_t zero_copy_queued;
> > +ssize_t zero_copy_sent;
> >   };
> Hi, Leonardo. I'm also paying attention to the application of MSG_ZEROCOPY
> in live migration recently. I noticed that you defined a member
> `zero_copy_queued` in the struct QIOChannelSocket, but I can't find out
> where the value of this member has been changed in your patch. Can you
> answer it for me?
> 

Good point.. it should probably be increased when queuing the pages. We'd
better fix it up or it seems the flush() will be literally an no-op..

Two things in qio_channel_socket_flush() we can do to make sure it'll work
as expected, imo:

  1) make ret=-1 as initial value, rather than 1 - we only check negative
 errors in the caller so we could have missed a positive "1"

  2) add a tracepoint into the loop of updating zero_copy_sent

Leo, what's your take?

Thanks,

-- 
Peter Xu

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-08 Thread Peter Xu

On Wed, Jun 08, 2022 at 02:37:28AM -0300, Leonardo Bras Soares Passos wrote:
> (1) is not an option, as the interface currently uses ret=1 to make
> sure MSG_ZEROCOPY is getting used,
> I added that so the user of qio_channel can switch off zero-copy if
> it's not getting used, and save some cpu.

Yes (1) is not, but could you explain what do you mean by making sure
MSG_ZEROCOPY being used?  Why is it relevant to the retval here?

I just figured it's a bit weird to return >0 here in flush().

> 
> (2) is not a problem, but I fail to see how useful that would be. Is
> the idea manually keeping track of flush happening?

Yes if we can check this up it'll be good enough to me.  The trace point
could help in some case in the future too to monitor the behavior of kernel
MSG_ERRQUEUE but if you don't like it then it's okay.

-- 
Peter Xu

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-08 Thread Peter Xu

On Wed, Jun 08, 2022 at 03:14:36PM -0300, Leonardo Bras Soares Passos wrote:
> On Wed, Jun 8, 2022 at 8:41 AM Peter Xu  wrote:
> >
> > On Wed, Jun 08, 2022 at 02:37:28AM -0300, Leonardo Bras Soares Passos wrote:
> > > (1) is not an option, as the interface currently uses ret=1 to make
> > > sure MSG_ZEROCOPY is getting used,
> > > I added that so the user of qio_channel can switch off zero-copy if
> > > it's not getting used, and save some cpu.
> >
> > Yes (1) is not, but could you explain what do you mean by making sure
> > MSG_ZEROCOPY being used?  Why is it relevant to the retval here?
> 
> If sendmsg() is called with MSG_ZEROCOPY, and everything is configured
> correctly, the kernel will attempt to send the buffer using zero-copy.
> 
> Even with the right configuration on a recent enough kernel, there are
> factors that can prevent zero-copy from happening, and the kernel will
> fall back to the copying mechanism.
> An example being the net device not supporting 'Scatter-Gather'
> feature (NETIF_F_SG).
> 
> When this happens, there is an overhead for 'trying zero-copy first',
> instead of just opting for the copying mechanism.
> 
> In a previous iteration of the patchset, it was made clear that it's
> desirable to detect when the kernel falls back to copying mechanism,
> so the user of 'QIOChannelSocket' can switch to copying and avoid the
> overhead. This was done by the return value of flush(), which is 1 if
> that occurs.

Two questions..

  1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
 is functional?

 If the answer is yes, I don't see how ret=1 will ever be
 returned.. because we'll also go into the same loop in
 qio_channel_socket_flush() anyway.

 If the answer is no, then since we'll have non-zero zero_copy_queued,
 will the loop in qio_channel_socket_flush() go into a dead one?  How
 could it return?

  2) Even if we have the correct ret=1 returned when that happens, which
 caller is detecting that ret==1 and warn the admin?

Thanks,

-- 
Peter Xu

Re: [External] [PATCH v13 3/8] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-06-13 Thread Peter Xu

On Mon, Jun 13, 2022 at 05:58:44PM -0300, Leonardo Bras Soares Passos wrote:
> Hello Peter,
> 
> On Wed, Jun 8, 2022 at 5:23 PM Peter Xu  wrote:
> [...]
> > > In a previous iteration of the patchset, it was made clear that it's
> > > desirable to detect when the kernel falls back to copying mechanism,
> > > so the user of 'QIOChannelSocket' can switch to copying and avoid the
> > > overhead. This was done by the return value of flush(), which is 1 if
> > > that occurs.
> >
> > Two questions..
> >
> >   1) When that happens, will MSG_ERRQUEUE keeps working just like zerocopy
> >  is functional?
> 
> I am not sure about what exactly you meant by 'like zerocopy is
> funcional', but the
> idea is that reading from MSG_ERRQUEUE should return a msg for each sendmsg
> syscall with MSG_ZEROCOPY that previously happened. This does not depend on
> the outcome (like falling back to the copying mechanism).
> btw, most of those messages may be batched to reduce overhead.
> 
> At some point, zero-copy may fail, and fall back to copying, so in
> those messages
> an error code SO_EE_CODE_ZEROCOPY_COPIED can be seen. Having only
> those messages in a flush will trigger the returning of 1 from the
> flush function.

Ah I think I missed the "reset ret==0 when !SO_EE_CODE_ZEROCOPY_COPIED"
path..  Sorry.

> 
> >
> >  If the answer is yes, I don't see how ret=1 will ever be
> >  returned.. because we'll also go into the same loop in
> >  qio_channel_socket_flush() anyway.
> 
> 
> We set ret to 1 at function entry and then for each message in the 
> MSG_ERRQUEUE,
> we test if it has error code different than SO_EE_CODE_ZEROCOPY_COPIED.
> If it ever have a different error code, we set ret=0.
> 
> So, in our previous example, if we have a net device not supporting
> the 'Scatter-Gather'
> feature (NETIF_F_SG), every error message will be
> SO_EE_CODE_ZEROCOPY_COPIED, and it will return 1.
> 
> 
> >
> >  If the answer is no, then since we'll have non-zero zero_copy_queued,
> >  will the loop in qio_channel_socket_flush() go into a dead one?  How
> >  could it return?
> 
> No, because it will go through all packets sent with MSG_ZEROCOPY, including 
> the
> ones that fell back to copying, so the counter should be fine. If any
> code disables
> zero-copy, it will both stop sending stuff wil MSG_ZEROCOPY and flushing, so 
> it
> should be fine.
> 
> >
> >   2) Even if we have the correct ret=1 returned when that happens, which
> >  caller is detecting that ret==1 and warn the admin?
> >
> 
> No caller is using that right now.
> It's supposed to be a QIOChannel interface feature, and any 
> user/implementation
> could use that information to warn if zero-copy is not being used, fall back 
> to
> copying directly (to avoid overhead of testing zero-copy) or even use
> it to cancel the
> sending if wanted.
> 
> It was a suggestion of Daniel on top of [PATCH v5 1/6] IIRC.

OK the detection makes sense, thanks for the details.

Then now I'm wondering whether we should have warned the admin already if
zero-copy send is not fully enabled in live migration.  Should we add a
error_report_once() somewhere for the ret==1 already?  After all the user
specify zero_copy_send=true explicitly.  Did I miss something again?

Thanks,

-- 
Peter Xu

Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-04 Thread Peter Xu

On Wed, May 04, 2022 at 04:18:31PM -0300, Leonardo Bras wrote:
> +/*
> + * Zero-copy defines bellow are included to avoid breaking builds on systems
> + * that don't support MSG_ZEROCOPY, while keeping the functions more readable
> + * (without a lot of ifdefs).
> + */
> +#ifndef MSG_ZEROCOPY
> +#define MSG_ZEROCOPY 0x400
> +#endif
> +#ifndef SO_ZEROCOPY
> +#define SO_ZEROCOPY 60
> +#endif

So this will define these two values on e.g. FreeBSD, while they do not
make sense at all there because these numbers are pure magics and
meaningless outside Linux..

I don't think it's anything dangerous, but IMHO it's another way of being
not clean comparing of using some "#ifdef"s.  Comparing to this approach
the "use #ifdef" approach is actually slightly more cleaner to me. :)

Let's wait for some other inputs.

-- 
Peter Xu

Re: [PATCH v12 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-07 Thread Peter Xu

On Fri, May 06, 2022 at 10:57:54PM -0300, Leonardo Bras wrote:
> diff --git a/io/channel-socket.c b/io/channel-socket.c
> index 05c425abb8..f03a068f25 100644
> --- a/io/channel-socket.c
> +++ b/io/channel-socket.c
> @@ -25,9 +25,18 @@
>  #include "io/channel-watch.h"
>  #include "trace.h"
>  #include "qapi/clone-visitor.h"
> +#ifdef CONFIG_LINUX
> +#include 
> +#include 
> +
> +#if (defined(MSG_ZEROCOPY) && defined(SO_ZEROCOPY))
> +#define QEMU_MSG_ZEROCOPY
> +#endif
> +#endif
>  
>  #define SOCKET_MAX_FDS 16
>  
> +

This line can be dropped when merge.

>  SocketAddress *
>  qio_channel_socket_get_local_address(QIOChannelSocket *ioc,
>   Error **errp)

This does look nicer, imho. :)

Thanks!

-- 
Peter Xu

Re: [PATCH v11 2/7] QIOChannelSocket: Implement io_writev zero copy flag & io_flush for CONFIG_LINUX

2022-05-05 Thread Peter Xu

On Thu, May 05, 2022 at 01:20:04AM -0300, Leonardo Bras Soares Passos wrote:
> (2) is already implemented in v11, but I have no issue implementing
> (1) for v12 if it's ok to create this 'global' define.

Dan's suggestion in the other thread sounds good to me with current
approach, on having CONFIG_LINUX to wrap the defines.

But note that it's still prone to change in the future, e.g., when other
qemu modules start to use MSG_ZEROCOPY, we'll probably at least move the
defines into some other headers.  We could probably need to clean things up
when it happens.

But I won't strongly ask for that - we can leave that for later.

Thanks,

-- 
Peter Xu

Re: QMP (without OOB) function running in thread different from the main thread as part of aio_poll

2023-04-27 Thread Peter Xu

vm.c:2954
> >> #15 0x558ed3e17fb1 in snapshot_save_job_bh (opaque=0x558ed68c5170) at 
> >> ../migration/savevm.c:3253
> >> #16 0x558ed42f050a in aio_bh_call (bh=0x558ed671ae00) at 
> >> ../util/async.c:155
> >> #17 0x558ed42f0615 in aio_bh_poll (ctx=0x558ed5c62910) at 
> >> ../util/async.c:184
> >> #18 0x558ed42d47b8 in aio_poll (ctx=0x558ed5c62910, blocking=true) at 
> >> ../util/aio-posix.c:721
> >> #19 0x558ed412df1c in bdrv_poll_co (s=0x7f8ffcff3eb0) at 
> >> /home/febner/repos/qemu/block/block-gen.h:43
> >> #20 0x558ed4130c3a in blk_pwrite (blk=0x558ed5ed4f60,
> >> offset=230912, bytes=512, buf=0x7f8ffc438600, flags=0) at
> >> block/block-gen.c:1650
> >> #21 0x558ed3ba9078 in pflash_update (pfl=0x558ed5eb7b30, 
> >> offset=230912, size=1) at ../hw/block/pflash_cfi01.c:394
> >> #22 0x558ed3ba9749 in pflash_write (pfl=0x558ed5eb7b30,
> >> offset=231232, value=0, width=1, be=0) at
> >> ../hw/block/pflash_cfi01.c:522
> >> #23 0x558ed3ba9cda in pflash_mem_write_with_attrs
> >> (opaque=0x558ed5eb7b30, addr=231232, value=0, len=1, attrs=...) at
> >> ../hw/block/pflash_cfi01.c:681
> >> #24 0x558ed402a36a in memory_region_write_with_attrs_accessor
> >> (mr=0x558ed5eb7ef0, addr=231232, value=0x7f8ffcff40c8, size=1,
> >> shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> >> #25 0x558ed402a4a9 in access_with_adjusted_size (addr=231232,
> >> value=0x7f8ffcff40c8, size=1, access_size_min=1, access_size_max=4,
> >> access_fn=0x558ed402a270 ,
> >> mr=0x558ed5eb7ef0, attrs=...) at ../softmmu/memory.c:555
> >> #26 0x558ed402d5de in memory_region_dispatch_write
> >> (mr=0x558ed5eb7ef0, addr=231232, data=0, op=MO_8, attrs=...) at
> >> ../softmmu/memory.c:1522
> >> #27 0x558ed403a6f4 in flatview_write_continue
> >> (fv=0x558ed66d62c0, addr=4291004224, attrs=..., ptr=0x7f9029957028,
> >> len=1, addr1=231232, l=1, mr=0x558ed5eb7ef0) at
> >> ../softmmu/physmem.c:2641
> >> #28 0x558ed403a857 in flatview_write (fv=0x558ed66d62c0,
> >> addr=4291004224, attrs=..., buf=0x7f9029957028, len=1) at
> >> ../softmmu/physmem.c:2683
> >> #29 0x558ed403ac07 in address_space_write (as=0x558ed4ca2b20
> >> , addr=4291004224, attrs=...,
> >> buf=0x7f9029957028, len=1) at ../softmmu/physmem.c:2779
> >> #30 0x558ed403ac74 in address_space_rw (as=0x558ed4ca2b20
> >> , addr=4291004224, attrs=...,
> >> buf=0x7f9029957028, len=1, is_write=true) at
> >> ../softmmu/physmem.c:2789
> >> #31 0x558ed40cea88 in kvm_cpu_exec (cpu=0x558ed622a910) at 
> >> ../accel/kvm/kvm-all.c:2989
> >> #32 0x558ed40d179a in kvm_vcpu_thread_fn (arg=0x558ed622a910) at 
> >> ../accel/kvm/kvm-accel-ops.c:51
> >> #33 0x558ed42d925f in qemu_thread_start (args=0x558ed5c68c80) at 
> >> ../util/qemu-thread-posix.c:541
> >> #34 0x7f9028ab7ea7 in start_thread (arg=) at 
> >> pthread_create.c:477
> >> #35 0x7f9027c18a2f in clone () at 
> >> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Totally unfamiliar with block jobs, but... does it mean that
snapshot_*_job_bh()s should just always make sure BQL taken?

Thanks,

-- 
Peter Xu

Re: [PULL 00/26] Next patches

2023-02-06 Thread Peter Xu

On Sat, Feb 04, 2023 at 10:19:34AM +, Peter Maydell wrote:
> On Thu, 2 Feb 2023 at 16:07, Juan Quintela  wrote:
> >
> > The following changes since commit deabea6e88f7c4c3c12a36ee30051c6209561165:
> >
> >   Merge tag 'for_upstream' of 
> > https://git.kernel.org/pub/scm/virt/kvm/mst/qemu into staging (2023-02-02 
> > 10:10:07 +)
> >
> > are available in the Git repository at:
> >
> >   https://gitlab.com/juan.quintela/qemu.git tags/next-pull-request
> >
> > for you to fetch changes up to 5ee6d3d1eeccd85aa2a835e82b8d9e1b4f7441e1:
> >
> >   migration: check magic value for deciding the mapping of channels 
> > (2023-02-02 17:04:16 +0100)
> >
> > 
> > Migration PULL request, new try
> 
> Fails to build on anything that isn't Linux:
> 
> In file included from ../migration/postcopy-ram.c:40:
> /private/var/folders/76/zy5ktkns50v6gt5g8r0sf6scgn/T/cirrus-ci-build/include/qemu/userfaultfd.h:18:10:
> fatal error: 'linux/userfaultfd.h' file not found

Oops, my fault.

Juan, please feel free to drop patch "util/userfaultfd: Add uffd_open()".
I'll respin with the whole set.

-- 
Peter Xu

Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side

2023-06-14 Thread Peter Xu

On Tue, Jun 13, 2023 at 06:02:05PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote:
> >> Remove the increase in qemu_file_fill_buffer() and add asserts to
> >> qemu_file_transferred* functions.
> >> 
> >> Signed-off-by: Juan Quintela 
> >
> > The read side accounting does look a bit weird and never caught my notice..
> >
> > Maybe worth also touching the document of QEMUFile::total_transferred to
> > clarify what it accounts?
> >
> > Reviewed-by: Peter Xu 
> >
> > Though when I'm looking at the counters (didn't follow every single recent
> > patch on this..), I found that now reading transferred value is actually
> > more expensive - qemu_file_transferred() needs flushing, even if for the
> > fast version, qemu_file_transferred_fast() loops over all possible iovs,
> > which can be as large as MAX_IOV_SIZE==64.
> >
> > To be explicit, I _think_ for each guest page we now need to flush...
> >
> >   ram_save_iterate
> > migration_rate_exceeded
> >   migration_transferred_bytes
> > qemu_file_transferred
> >
> > I hope I'm wrong..
> 
> See patch 7:
> 
> diff --git a/migration/migration-stats.c b/migration/migration-stats.c
> index 79eea8d865..1696185694 100644
> --- a/migration/migration-stats.c
> +++ b/migration/migration-stats.c
> @@ -62,7 +62,7 @@ uint64_t migration_transferred_bytes(QEMUFile *f)
>  {
>  uint64_t multifd = stat64_get(_stats.multifd_bytes);
>  uint64_t rdma = stat64_get(_stats.rdma_bytes);
> -uint64_t qemu_file = qemu_file_transferred(f);
> +uint64_t qemu_file = stat64_get(_stats.qemu_file_transferred);
>  
>  trace_migration_transferred_bytes(qemu_file, multifd, rdma);
>  return qemu_file + multifd + rdma;

If this is a known regression, should we make a first patchset fix it and
make it higher priority to merge?

It seems this is even not mentioned in the cover letter.. while IMHO this
is the most important bit to have in it..

> 
> 
> > Does it mean that perhaps we simply need "sent and put into send buffer"
> > more than "what really got transferred"?  So I start to wonder what's the
> > origianl purpose of this change, and which one is better..
> 
> That is basically what patch 5 and 6 do O:-)
> 
> Problem is arriving to something that is bisectable (for correctness)
> and is easy to review.
> 
> And yes, my choices can be different from the ones tat you do.
> 
> The other reason for the small patches is that:
> a - sometimes I found a different test where things broke, and have to
> bisect
> b - small patches are much easier to rebase (that I am doing a lot)

That's okay.  Thanks,

-- 
Peter Xu

Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 
> ---
>  migration/qemu-file.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index eb0497e532..6b6deea19b 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -41,9 +41,6 @@ struct QEMUFile {
>  QIOChannel *ioc;
>  bool is_writable;
>  
> -/* The sum of bytes transferred on the wire */
> -uint64_t total_transferred;
> -
>  int buf_index;
>  int buf_size; /* 0 when writing */
>  uint8_t buf[IO_BUF_SIZE];
> @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f)
>  qemu_file_set_error_obj(f, -EIO, local_error);
>  } else {
>  uint64_t size = iov_size(f->iov, f->iovcnt);
> -f->total_transferred += size;

I think this patch is another example why I think sometimes the way patch
is split are pretty much adding more complexity on review...

Here we removed a variable operation but it seems all fine if it's not used
anywhere.  But it also means current code base (before this patch applied)
doesn't make sense already because it contains this useless addition.  So
IMHO it means some previous patch does it just wrong.

I think it means it's caused by a wrong split of patches, then each patch
stops to make much sense as a standalone one.

I can go back and try to find whatever patch on the list that will explain
this.  But it'll also go into git log.  Anyone reads this later will be
confused once again.  Even harder for them to figure out what happened.

Do you think we could reorganize the patches so each of a single patch
explains itself?

The other thing is about priority of patches - I still have ~80 patches
pending reviews on migration only.. Would you think it makes sense we pick
up important ones first and merge them with higher priority?

What I have in mind are:

  - The regression you mentioned on qemu_fflush() when ram save (I hope I
understand the problem right, though...).

  - The slowness of migration-test.  I'm not sure whether you have anything
better than either Dan's approach or the switchover-hold flags, I just
think that seems more important to resolve for us upstream.  We can
merge Dan's or mine, you can also propose something else, but IMHO that
seems to be a higher priority?

And whatever else I haven't noticed.  I'll continue reading but I'm sure
you know the best on this..  so I'd really rely on you.

What do you think?

Thanks,

-- 
Peter Xu

Re: [PATCH v2 16/20] migration/rdma: Split qemu_fopen_rdma() into input/output functions

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:37PM +0200, Juan Quintela wrote:
> This is how everything else in QEMUFile is structured.
> As a bonus they are three less lines of code.
> 
> Signed-off-by: Juan Quintela 
> ---
>  migration/rdma.c | 35 ---
>  1 file changed, 16 insertions(+), 19 deletions(-)
> 
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 3ef35fc635..606837bd45 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -4046,25 +4046,22 @@ static void qio_channel_rdma_register_types(void)
>  
>  type_init(qio_channel_rdma_register_types);
>  
> -static QEMUFile *qemu_fopen_rdma(RDMAContext *rdma, const char *mode)
> +static QEMUFile *rdma_new_input(RDMAContext *rdma)
>  {
> -QIOChannelRDMA *rioc;
> +QIOChannelRDMA *rioc = 
> QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> +rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
> +rioc->rdmain = rdma;
> +rioc->rdmaout = rdma->return_path;
>  
> -if (qemu_file_mode_is_not_valid(mode)) {

We can also drop this function together.  If with that:

Reviewed-by: Peter Xu 

> -return NULL;
> -}
> +return rioc->file;
> +}
>  
> -rioc = QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> -
> -if (mode[0] == 'w') {
> -rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
> -rioc->rdmaout = rdma;
> -rioc->rdmain = rdma->return_path;
> -} else {
> -rioc->file = qemu_file_new_input(QIO_CHANNEL(rioc));
> -rioc->rdmain = rdma;
> -rioc->rdmaout = rdma->return_path;
> -}
> +static QEMUFile *rdma_new_output(RDMAContext *rdma)
> +{
> +QIOChannelRDMA *rioc = 
> QIO_CHANNEL_RDMA(object_new(TYPE_QIO_CHANNEL_RDMA));
> +rioc->file = qemu_file_new_output(QIO_CHANNEL(rioc));
> +rioc->rdmaout = rdma;
> +rioc->rdmain = rdma->return_path;
>  
>  return rioc->file;
>  }
> @@ -4090,9 +4087,9 @@ static void rdma_accept_incoming_migration(void *opaque)
>  return;
>  }
>  
> -f = qemu_fopen_rdma(rdma, "rb");
> +f = rdma_new_input(rdma);
>  if (f == NULL) {
> -fprintf(stderr, "RDMA ERROR: could not qemu_fopen_rdma\n");
> +fprintf(stderr, "RDMA ERROR: could not open RDMA for input\n");
>  qemu_rdma_cleanup(rdma);
>  return;
>  }
> @@ -4217,7 +4214,7 @@ void rdma_start_outgoing_migration(void *opaque,
>  trace_rdma_start_outgoing_migration_after_rdma_connect();
>  
>  s->rdma_migration = true;
> -s->to_dst_file = qemu_fopen_rdma(rdma, "wb");
> +s->to_dst_file = rdma_new_output(rdma);
>  migrate_fd_connect(s, NULL);
>  return;
>  return_path_err:
> -- 
> 2.40.1
> 

-- 
Peter Xu

Re: [PATCH v2 20/20] qemu-file: Make qemu_file_get_error_obj() static

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:41PM +0200, Juan Quintela wrote:
> It was not used outside of qemu_file.c anyways.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 19/20] qemu-file: Simplify qemu_file_shutdown()

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:40PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 18/20] qemu_file: Make qemu_file_is_writable() static

2023-06-14 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:39PM +0200, Juan Quintela wrote:
> It is not used outside of qemu_file, and it shouldn't.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-12 Thread Peter Xu

On Mon, Jun 12, 2023 at 09:33:42PM +0200, Juan Quintela wrote:
> Only "defer" is recommended.  After setting all migation parameters,
> start incoming migration with "migrate-incoming uri" command.
> 
> Signed-off-by: Juan Quintela 
> ---
>  docs/about/deprecated.rst | 7 +++
>  softmmu/vl.c  | 2 ++
>  2 files changed, 9 insertions(+)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 47e98dc95e..518672722d 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -447,3 +447,10 @@ The new way to modify migration is using migration 
> parameters.
>  ``blk`` functionality can be acchieved using
>  ``migrate_set_parameter block-incremental true``.
>  
> +``-incoming uri`` (since 8.1)
> +'
> +
> +Everything except ``-incoming defer`` are deprecated.  This allows to
> +setup parameters before launching the proper migration with
> +``migrate-incoming uri``.
> +
> diff --git a/softmmu/vl.c b/softmmu/vl.c
> index b0b96f67fa..7fe865ab59 100644
> --- a/softmmu/vl.c
> +++ b/softmmu/vl.c
> @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp)
>  if (incoming) {
>  Error *local_err = NULL;
>  if (strcmp(incoming, "defer") != 0) {
> +warn_report("-incoming %s is deprecated, use -incoming defer and 
> "
> +" set the uri with migrate-incoming.", incoming);

I still use uri for all my scripts, alongside with "-global migration.xxx"
and it works.

Shall we just leave it there?  Or is deprecating it helps us in any form?

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-12 Thread Peter Xu

On Mon, Jun 12, 2023 at 10:51:08PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Mon, Jun 12, 2023 at 09:33:42PM +0200, Juan Quintela wrote:
> >> Only "defer" is recommended.  After setting all migation parameters,
> >> start incoming migration with "migrate-incoming uri" command.
> >> 
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  docs/about/deprecated.rst | 7 +++
> >>  softmmu/vl.c  | 2 ++
> >>  2 files changed, 9 insertions(+)
> >> 
> >> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> >> index 47e98dc95e..518672722d 100644
> >> --- a/docs/about/deprecated.rst
> >> +++ b/docs/about/deprecated.rst
> >> @@ -447,3 +447,10 @@ The new way to modify migration is using migration 
> >> parameters.
> >>  ``blk`` functionality can be acchieved using
> >>  ``migrate_set_parameter block-incremental true``.
> >>  
> >> +``-incoming uri`` (since 8.1)
> >> +'
> >> +
> >> +Everything except ``-incoming defer`` are deprecated.  This allows to
> >> +setup parameters before launching the proper migration with
> >> +``migrate-incoming uri``.
> >> +
> >> diff --git a/softmmu/vl.c b/softmmu/vl.c
> >> index b0b96f67fa..7fe865ab59 100644
> >> --- a/softmmu/vl.c
> >> +++ b/softmmu/vl.c
> >> @@ -2651,6 +2651,8 @@ void qmp_x_exit_preconfig(Error **errp)
> >>  if (incoming) {
> >>  Error *local_err = NULL;
> >>  if (strcmp(incoming, "defer") != 0) {
> >> +warn_report("-incoming %s is deprecated, use -incoming defer 
> >> and "
> >> +" set the uri with migrate-incoming.", incoming);
> >
> > I still use uri for all my scripts, alongside with "-global migration.xxx"
> > and it works.
> 
> You know what you are doing (TM).
> And remember that we don't support -gobal migration.x-foo.
> Yes, I know, we should drop the "x-" prefixes.

I hope they'll always be there. :) They're pretty handy for tests, when we
want to boot a VM without the need to script the sequences of qmp cmds.

Yes, we probably should just always drop the x-.  We can always declare
debugging purpose for all -global migration.* fields.

> 
> > Shall we just leave it there?  Or is deprecating it helps us in any form?
> 
> See the patches two weeks ago when people complained that lisen(.., num)
> was too low.  And there are other parameters that work the same way
> (that I convenientely had forgotten).  So the easiest way to get things
> right is to use "defer" always.  Using -incoming "uri" should only be
> for people that "know what they are doing", so we had to ways to do it:
> - review all migration options and see which ones work without defer
>   and document it
> - deprecate everything that is not defer.
> 
> Anything else is not going to be very user unfriendly.
> What do you think.

IIRC Wei Wang had a series just for that, so after that patchset applied we
should have fixed all issues cleanly?  Is there one more thing that's not
working right there?

> 
> Later, Juan.
> 
> PD.  This series are RFC for multiple reasons O:-)

Happy to know the rest (besides which I know will break my script :).

-- 
Peter Xu

Re: [PATCH v2 05/20] qemu_file: Use a stat64 for qemu_file_transferred

2023-06-05 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:26PM +0200, Juan Quintela wrote:
> This way we can read it from any thread.
> I checked that it gives the same value than the current one.  We never
> use to qemu_files at the same time.
> 
> Signed-off-by: Juan Quintela 

The follow up patch may be better to be squashed or it's very confusing..

Why do we need to convert mostly everything into atomics?  Is it modified
outside migration thread?

AFAIR atomic ops are still expensive, and will get more expensive on larger
hosts, IOW it'll be good for us to keep non-atomic when possible (and
that's why when I changed some counters in postcopy work I only changed the
limited set because then the rest are still accessed in 1 single thread and
keep running fast).

> ---
>  migration/migration-stats.h | 4 
>  migration/qemu-file.c   | 5 +++--
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration-stats.h b/migration/migration-stats.h
> index 2358caad63..b7795e7914 100644
> --- a/migration/migration-stats.h
> +++ b/migration/migration-stats.h
> @@ -81,6 +81,10 @@ typedef struct {
>   * Number of bytes sent during precopy stage.
>   */
>  Stat64 precopy_bytes;
> +/*
> + * Number of bytes transferred with QEMUFile.
> + */
> +Stat64 qemu_file_transferred;
>  /*
>   * Amount of transferred data at the start of current cycle.
>   */
> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> index be3dab85cb..eb0497e532 100644
> --- a/migration/qemu-file.c
> +++ b/migration/qemu-file.c
> @@ -288,6 +288,7 @@ void qemu_fflush(QEMUFile *f)
>  } else {
>  uint64_t size = iov_size(f->iov, f->iovcnt);
>  f->total_transferred += size;
> +stat64_add(_stats.qemu_file_transferred, size);
>  }
>  
>  qemu_iovec_release_ram(f);
> @@ -628,7 +629,7 @@ int coroutine_mixed_fn qemu_get_byte(QEMUFile *f)
>  
>  uint64_t qemu_file_transferred_noflush(QEMUFile *f)
>  {
> -uint64_t ret = f->total_transferred;
> +uint64_t ret = stat64_get(_stats.qemu_file_transferred);
>  int i;
>  
>  g_assert(qemu_file_is_writable(f));
> @@ -644,7 +645,7 @@ uint64_t qemu_file_transferred(QEMUFile *f)
>  {
>  g_assert(qemu_file_is_writable(f));
>  qemu_fflush(f);
> -return f->total_transferred;
> +return stat64_get(_stats.qemu_file_transferred);
>  }
>  
>  void qemu_put_be16(QEMUFile *f, unsigned int v)
> -- 
> 2.40.1
> 

-- 
Peter Xu

Re: [PATCH v2 04/20] qemu-file: We only call qemu_file_transferred_* on the sending side

2023-06-05 Thread Peter Xu

On Tue, May 30, 2023 at 08:39:25PM +0200, Juan Quintela wrote:
> Remove the increase in qemu_file_fill_buffer() and add asserts to
> qemu_file_transferred* functions.
> 
> Signed-off-by: Juan Quintela 

The read side accounting does look a bit weird and never caught my notice..

Maybe worth also touching the document of QEMUFile::total_transferred to
clarify what it accounts?

Reviewed-by: Peter Xu 

Though when I'm looking at the counters (didn't follow every single recent
patch on this..), I found that now reading transferred value is actually
more expensive - qemu_file_transferred() needs flushing, even if for the
fast version, qemu_file_transferred_fast() loops over all possible iovs,
which can be as large as MAX_IOV_SIZE==64.

To be explicit, I _think_ for each guest page we now need to flush...

  ram_save_iterate
migration_rate_exceeded
  migration_transferred_bytes
qemu_file_transferred

I hope I'm wrong..

Does it mean that perhaps we simply need "sent and put into send buffer"
more than "what really got transferred"?  So I start to wonder what's the
origianl purpose of this change, and which one is better..

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 12:01:55PM +0200, Juan Quintela wrote:
> Paolo Bonzini  wrote:
> > On 6/22/23 10:52, Juan Quintela wrote:
> >> User friendliness.
> >> The problem is that if you use more than two channels with multifd, on
> >> the incoming side, you need to do:
> >
> > You're sacrificing user-friendliness for the 99.99% that don't use
> > multifd, for an error (i.e. it's not even fixing the issue) for the
> > 0.01% that use multifd.  That's not user-friendly.
> 
> You are forgeting of the 0.01% that uses postocopy preempt (that is easy
> just changing the default value to 2), and the 0.001% that uses
> compression (they have exactly the same problem with compress_threads,
> compress_zlib, etc).
> 
> >> - migrate_set_parameter multifd-channels 16
> >> - migrate_incoming 
> >> 
> >>> The issue is not how many features the command line has, but how
> >>> they're implemented.
> >> Or if they are confusing for the user?
> >
> > Anyone using multifd is not a typical user anyway.
> 
> >>> If they're just QMP wrappers and as such they're self-contained in
> >>> softmmu/vl.c, that's fine.
> >>>
> >>> In fact, even for parameters, we could use keyval to parse "-incoming"
> >> What is keyval?
> >
> > util/keyval.c and include/qemu/keyval.h.  It parses a list of
> > key=value pairs into a QDict.  Once you have removed the "source" key
> > from the QDict you can use a visitor to parse the rest into a
> > MigrateSetParameters.  See the handling of QEMU_OPTION_audio, it could
> > be something like
> >
> >
> > case QEMU_OPTION_incoing: {
> > Visitor *v;
> > MigrateSetParameters *incoming_params = NULL;
> > QDict *dict = keyval_parse(optarg, "source", NULL,
> > _fatal);
> >
> > if (incoming) {
> > if (qdict_haskey(dict, "source")) {
> > error_setg(_fatal, "Parameter 'source'
> > is duplicate");
> > }
> > } else {
> > if (!qdict_haskey(dict, "source")) {
> > error_setg(_fatal, "Parameter 'source'
> > is missing");
> > }
> > runstate_set(RUN_STATE_INMIGRATE);
> > incoming = g_strdup(qdict_get_str(dict, "source"));
> > qdict_del(dict, "source");
> > }
> >
> > v = qobject_input_visitor_new_keyval(QOBJECT(dict));
> > qobject_unref(dict);
> > visit_type_MigrateSetParameters(v, NULL,
> > _params, _fatal);
> > visit_free(v);
> > qmp_migration_set_parameters(incoming_params,

PS: we may want to postpone this to be later than migration_object_init(),
when/if there's a real patch.

> > _fatal);
> > qapi_free_MigrateSetParameters(incoming_params);
> > }
> >
> >
> > For example "-incoming [source=]tcp:foo,multifd-channels=16" would
> > desugar to
> >
> >   migrate_set_parameter multifd-channels 16
> >   migrate_incoming tcp:foo
> >
> > The only incompatibility is for people who are using "," in an URI,
> > which is rare and only an issue for the "exec" protocol.

If we worry on breaking anyone, we can apply the keyval parsing only when
!exec.  Not sure whether it matters a huge lot..

> 
> Aha, that makes sense.  And will allow us to deprecate/remove the
> --global migration.* stuff.

We may still need a way to set the caps/params for src qemu?..

Thanks,

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 11:22:56AM +0200, Thomas Huth wrote:
> Then simply forbid "migrate_set_parameter multifd-channels ..." if the uri
> has been specified on the command line?

Yeah, actually already in a pull (even though the pr may need a new one..):

https://lore.kernel.org/r/20230622021320.66124-23-quint...@redhat.com

-- 
Peter Xu

Re: [PATCH v2 06/20] qemu_file: total_transferred is not used anymore

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 01:05:49AM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Tue, May 30, 2023 at 08:39:27PM +0200, Juan Quintela wrote:
> >> Signed-off-by: Juan Quintela 
> >> ---
> >>  migration/qemu-file.c | 4 
> >>  1 file changed, 4 deletions(-)
> >> 
> >> diff --git a/migration/qemu-file.c b/migration/qemu-file.c
> >> index eb0497e532..6b6deea19b 100644
> >> --- a/migration/qemu-file.c
> >> +++ b/migration/qemu-file.c
> >> @@ -41,9 +41,6 @@ struct QEMUFile {
> >>  QIOChannel *ioc;
> >>  bool is_writable;
> >>  
> >> -/* The sum of bytes transferred on the wire */
> >> -uint64_t total_transferred;
> >> -
> >>  int buf_index;
> >>  int buf_size; /* 0 when writing */
> >>  uint8_t buf[IO_BUF_SIZE];
> >> @@ -287,7 +284,6 @@ void qemu_fflush(QEMUFile *f)
> >>  qemu_file_set_error_obj(f, -EIO, local_error);
> >>  } else {
> >>  uint64_t size = iov_size(f->iov, f->iovcnt);
> >> -f->total_transferred += size;
> >
> > I think this patch is another example why I think sometimes the way patch
> > is split are pretty much adding more complexity on review...
> 
> It depends of taste.
> 
> You are doing one thing in way1.
> Then you find a better way to do it, lets call it way2.
> 
> Now we have two options to see how we arrived there.
> 
> a- You got any declarations/definition/initializations for way2
> b- You write way2 alongside way1
> c- You test that both ways give the same result, and you see that they
>give the same result.
> d- you remove the way1.
> 
> Or you squash the four patches in a single patch.  But then the reviewer
> lost the place where one can see why it is the same than the old one.

For this patch to remove total_transferred, IMHO as a reviewer it'll be
easier to me if it's put in the same patch where it got replaced.

It might be different if we're going to remove a large chunk of code, but
for this patch it's a few lines of change.

> 
> Sometimes is better the longer way, sometimes is better the short one.
> 
> Clearly we don't agree about what is the best way in this case.
> 
> > Here we removed a variable operation but it seems all fine if it's not used
> > anywhere.  But it also means current code base (before this patch applied)
> > doesn't make sense already because it contains this useless addition.  So
> > IMHO it means some previous patch does it just wrong.
> 
> No.  It is how it is developed.  And being respectful with the
> reviewer.  Given it enough information to do a proper review.

I never doubted about that.  I trust that you were trying to provide the
best for a reviewer and I appreciate that.

> 
> During the development of this series, there were lots of:
> 
> if (old_counter != new_counter)
>printf("");

I assume you meant when debugging?  I may not have read all the patches; I
hope we just don't do that if possible in any real git commit.

> 
> traces were in the several thousand lines long.  If I have to review
> that change, I would love any help that writer can give me.  That is why
> it is done this way.

Yeah, I think it's probably that even from reviewers' perspective the need
can be different from individuals.

> 
> > I think it means it's caused by a wrong split of patches, then each patch
> > stops to make much sense as a standalone one.
> 
> It stops making sense if you want each feature to be a single patch.
> Before the patch no feature.  After the patch full feature.  That brings
> us to very long patches.
> 
> What is easier to review (to do the same)
> 
> a - 1 x 1000 lines patch
> b - 10 x 100 lines patch
> 
> I will go with b any time.  Except if the split is arbitrary.

AFAIU, this is a different thing.  I'm never against splitting patch, but
about how to split.  I was never asking for a 1000 LOC patch, right? :)

> 
> > I can go back and try to find whatever patch on the list that will explain
> > this.  But it'll also go into git log.  Anyone reads this later will be
> > confused once again.  Even harder for them to figure out what
> > happened.
> 
> As said before, I completely disagree here.  And what is worse.  If it
> gets wrong, with your approach git bisect will not help as much than
> with my appreach.
> 
> > Do you think we could reorganize the patches so each of a single patch
> > explains itself?
> 
> No.  See before.  We go for a very spaguetti code to a much less
> spaguety code.
> 
> > The other thing is about priority of patches - I sti

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 10:59:58AM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 10:52:12AM +0200, Juan Quintela wrote:
> > Paolo Bonzini  wrote:
> > > On 6/12/23 22:51, Juan Quintela wrote:
> > >>> Shall we just leave it there?  Or is deprecating it helps us in any 
> > >>> form?
> > >> See the patches two weeks ago when people complained that lisen(.., num)
> > >> was too low.  And there are other parameters that work the same way
> > >> (that I convenientely had forgotten).  So the easiest way to get things
> > >> right is to use "defer" always.  Using -incoming "uri" should only be
> > >> for people that "know what they are doing", so we had to ways to do it:
> > >> - review all migration options and see which ones work without defer
> > >>and document it
> > >> - deprecate everything that is not defer.
> > >
> > > "-incoming " is literally the same as running "migrate-incoming"
> > > as the first thing on the monitor:
> > >
> > > if (incoming) {
> > > Error *local_err = NULL;
> > > if (strcmp(incoming, "defer") != 0) {
> > > qmp_migrate_incoming(incoming, _err);
> > > if (local_err) {
> > > error_reportf_err(local_err, "-incoming %s: ", incoming);
> > > exit(1);
> > > }
> > > }
> > > } else if (autostart) {
> > > qmp_cont(NULL);
> > > }
> > >
> > > It's the only piece of code which distinguishes "-incoming defer" from
> > > "-incoming ".
> > >
> > > So I'm not sure what the problem would be with keeping it?
> > 
> > User friendliness.
> > 
> > First of all, I use it all the time.  And I know that it is useful for
> > developers.  I was the one asking peter to implement -global
> > migration.foo to be able to test multifd with it.
> > 
> > The problem is that if you use more than two channels with multifd, on
> > the incoming side, you need to do:
> > 
> > - migrate_set_parameter multifd-channels 16
> > - migrate_incoming 
> > 
> > And people continue to do:
> > 
> > - qemu -incoming 
> > - migrate_set_parameter multifd-channels 16 (on the command line)
> > 
> > And they complain that it fails, because we are calling listen with the
> > wrong value.
> 
> IMHO if we want to improve user friendliness then arguing about use
> of the CLI vs QMP for migration is completely missing the bigger
> picture IMHO.
> 
> I've mentioned several times before that the user should never need to
> set this multifd-channels parameter (nor many other parameters) on the
> destination in the first place.
> 
> The QEMU migration stream should be changed to add a full
> bi-directional handshake, with negotiation of most parameters.
> IOW, the src QEMU should be configured with 16 channels, and
> it should connect the primary control channel, and then directly
> tell the dest that it wants to use 16 multifd channels.
> 
> If we're expecting the user to pass this info across to the dest
> manually we've already spectacularly failed wrt user friendliness.

I can try to move the todo even higher.  Trying to list the initial goals
here:

- One extra phase of handshake between src/dst (maybe the time to boost
  QEMU_VM_FILE_VERSION) before anything else happens.

- Dest shouldn't need to apply any cap/param, it should get all from src.
  Dest still need to be setup with an URI and that should be all it needs.

- Src shouldn't need to worry on the binary version of dst anymore as long
  as dest qemu supports handshake, because src can fetch it from dest.

- Handshake can always fail gracefully if anything wrong happened, it
  normally should mean dest qemu is not compatible with src's setup (either
  machine, device, or migration configs) for whatever reason.  Src should
  be able to get a solid error from dest if so.

- Handshake protocol should always be self-bootstrap-able, it means when we
  change the handshake protocol it should always works with old binaries.

  - When src is newer it should be able to know what's missing on dest and
skip the new bits.

  - When dst is newer it should all rely on src (which is older) and it
should always understand src's language.

- All !main channels need to be established later than the handshake - if
  we're going to do this anyway we probably should do it altogether to make
  channels named, so each channel used in migration needs to have a common
  header.  Prepare to deprecate the old tricks of channel orderings.

Does it look reasonable?  Anything to add/remove?

Thanks,

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-22 Thread Peter Xu

On Thu, Jun 22, 2023 at 05:33:29PM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote:
> > I can try to move the todo even higher.  Trying to list the initial goals
> > here:
> > 
> > - One extra phase of handshake between src/dst (maybe the time to boost
> >   QEMU_VM_FILE_VERSION) before anything else happens.
> > 
> > - Dest shouldn't need to apply any cap/param, it should get all from src.
> >   Dest still need to be setup with an URI and that should be all it needs.
> > 
> > - Src shouldn't need to worry on the binary version of dst anymore as long
> >   as dest qemu supports handshake, because src can fetch it from dest.
> 
> I'm not sure that works in general. Even if we have a handshake and
> bi-directional comms for live migration, we still haave the save/restore
> to file codepath to deal with. The dst QEMU doesn't exist at the time
> the save process is done, so we can't add logic to VMSate handling that
> assumes knowledge of the dst version at time of serialization.

My current thought was still based on a new cap or anything the user would
need to specify first on both sides (but hopefully the last cap to set on
dest).

E.g. if with a new handshake cap we shouldn't set it on a exec: or file:
protocol migration, and it should just fail on qmp_migrate() telling that
the URI is not supported if the cap is set.  Return path is definitely
required here.

> 
> > - Handshake can always fail gracefully if anything wrong happened, it
> >   normally should mean dest qemu is not compatible with src's setup (either
> >   machine, device, or migration configs) for whatever reason.  Src should
> >   be able to get a solid error from dest if so.
> > 
> > - Handshake protocol should always be self-bootstrap-able, it means when we
> >   change the handshake protocol it should always works with old binaries.
> > 
> >   - When src is newer it should be able to know what's missing on dest and
> > skip the new bits.
> > 
> >   - When dst is newer it should all rely on src (which is older) and it
> > should always understand src's language.
> 
> I'm not convinced it can reliably self-bootstrap in a backwards
> compatible manner, precisely because the current migration stream
> has no handshake and only requires a unidirectional channel.

Yes, please see above.  I meant when we grow the handshake protocol we
should make sure we don't need anything new to be setup either on src/dst
of qemu.  It won't apply to before-handshake binaries.

> I don't think its possible for QEMU to validate that it has a fully
> bi-directional channel, without adding timeouts to its detection which I
> think we should strive to avoid.
> 
> I don't think we actually need self-bootstrapping anyway.
> 
> I think the mgmt app can just indicate the new v2 bi-directional
> protocol when issuing the 'migrate' and 'migrate-incoming'
> commands.  This becomes trivial when Het's refactoring of the
> migrate address QAPI is accepted:
> 
>   https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04851.html
> 
> eg:
> 
> { "execute": "migrate",
>   "arguments": {
>   "channels": [ { "channeltype": "main",
>   "addr": { "transport": "socket", "type": "inet",
>"host": "10.12.34.9",
> "port": "1050" } } ] } }
> 
> note the 'channeltype' parameter here. If we declare the 'main'
> refers to the existing migration protocol, then we merely need
> to define a new 'channeltype' to use as an indicator for the
> v2 migration handshake protocol.

Using a new channeltype would also work at least on src qemu, but I'm not
sure on how dest qemu would know that it needs a handshake in that case,
because it knows nothing until the connection is established.

Maybe we still need QEMU_VM_FILE_VERSION to be boosted at least in this
case, so dest can read this at the very beginning, old binaries will fail
immediately, new binaries will start to talk with v2 language.

> 
> > - All !main channels need to be established later than the handshake - if
> >   we're going to do this anyway we probably should do it altogether to make
> >   channels named, so each channel used in migration needs to have a common
> >   header.  Prepare to deprecate the old tricks of channel orderings.
> 
> Once the primary channel involves a bi-directional handshake,
> we'll trivially ensure ordering - similar to how the existing
> code worked fnie in TLS mode which had a bi-directional TLS

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-23 Thread Peter Xu

On Fri, Jun 23, 2023 at 08:17:46AM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 03:20:01PM -0400, Peter Xu wrote:
> > On Thu, Jun 22, 2023 at 05:33:29PM +0100, Daniel P. Berrangé wrote:
> > > On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote:
> > > > I can try to move the todo even higher.  Trying to list the initial 
> > > > goals
> > > > here:
> > > > 
> > > > - One extra phase of handshake between src/dst (maybe the time to boost
> > > >   QEMU_VM_FILE_VERSION) before anything else happens.
> > > > 
> > > > - Dest shouldn't need to apply any cap/param, it should get all from 
> > > > src.
> > > >   Dest still need to be setup with an URI and that should be all it 
> > > > needs.
> > > > 
> > > > - Src shouldn't need to worry on the binary version of dst anymore as 
> > > > long
> > > >   as dest qemu supports handshake, because src can fetch it from dest.
> > > 
> > > I'm not sure that works in general. Even if we have a handshake and
> > > bi-directional comms for live migration, we still haave the save/restore
> > > to file codepath to deal with. The dst QEMU doesn't exist at the time
> > > the save process is done, so we can't add logic to VMSate handling that
> > > assumes knowledge of the dst version at time of serialization.
> > 
> > My current thought was still based on a new cap or anything the user would
> > need to specify first on both sides (but hopefully the last cap to set on
> > dest).
> > 
> > E.g. if with a new handshake cap we shouldn't set it on a exec: or file:
> > protocol migration, and it should just fail on qmp_migrate() telling that
> > the URI is not supported if the cap is set.  Return path is definitely
> > required here.
> 
> exec can support bi-directional migration - we have both stdin + stdout
> for the command. For exec it is mostly a documentation problem - you
> can't merely use 'cat' for example, but if you used 'socat' that could
> be made to work bi-directionally.

Okay.  Just an example that the handshake just cannot work for all the
cases, and it should always be able to fail.

So when exec doesn't properly provide return path, I think with
handshake=on we should get a timeout of not getting response properly and
fail the migration after the timeout, then.

There're a bunch of implications and details that need to be investigated
around such a handshake if it'll be proposed for real, so I'm not yet sure
whether there's something that may be surprising.  For channeltypes it
seems all fine for now.  Hopefully nothing obvious overlooked.

> 
> > > I don't think its possible for QEMU to validate that it has a fully
> > > bi-directional channel, without adding timeouts to its detection which I
> > > think we should strive to avoid.
> > > 
> > > I don't think we actually need self-bootstrapping anyway.
> > > 
> > > I think the mgmt app can just indicate the new v2 bi-directional
> > > protocol when issuing the 'migrate' and 'migrate-incoming'
> > > commands.  This becomes trivial when Het's refactoring of the
> > > migrate address QAPI is accepted:
> > > 
> > >   https://lists.gnu.org/archive/html/qemu-devel/2023-05/msg04851.html
> > > 
> > > eg:
> > > 
> > > { "execute": "migrate",
> > >   "arguments": {
> > >   "channels": [ { "channeltype": "main",
> > >   "addr": { "transport": "socket", "type": "inet",
> > >"host": "10.12.34.9",
> > > "port": "1050" } } ] } }
> > > 
> > > note the 'channeltype' parameter here. If we declare the 'main'
> > > refers to the existing migration protocol, then we merely need
> > > to define a new 'channeltype' to use as an indicator for the
> > > v2 migration handshake protocol.
> > 
> > Using a new channeltype would also work at least on src qemu, but I'm not
> > sure on how dest qemu would know that it needs a handshake in that case,
> > because it knows nothing until the connection is established.
> 
> In Het's series the 'migrate_incoming' command similarly has a chaneltype
> parameter.

Oh, yeah then that'll just work.  Thanks.

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-23 Thread Peter Xu

On Fri, Jun 23, 2023 at 09:23:18AM +0100, Daniel P. Berrangé wrote:
> On Thu, Jun 22, 2023 at 11:54:43AM -0400, Peter Xu wrote:
> > On Thu, Jun 22, 2023 at 10:59:58AM +0100, Daniel P. Berrangé wrote:
> > > I've mentioned several times before that the user should never need to
> > > set this multifd-channels parameter (nor many other parameters) on the
> > > destination in the first place.
> > > 
> > > The QEMU migration stream should be changed to add a full
> > > bi-directional handshake, with negotiation of most parameters.
> > > IOW, the src QEMU should be configured with 16 channels, and
> > > it should connect the primary control channel, and then directly
> > > tell the dest that it wants to use 16 multifd channels.
> > > 
> > > If we're expecting the user to pass this info across to the dest
> > > manually we've already spectacularly failed wrt user friendliness.
> > 
> > I can try to move the todo even higher.  Trying to list the initial goals
> > here:
> > 
> > - One extra phase of handshake between src/dst (maybe the time to boost
> >   QEMU_VM_FILE_VERSION) before anything else happens.
> > 
> > - Dest shouldn't need to apply any cap/param, it should get all from src.
> >   Dest still need to be setup with an URI and that should be all it needs.
> 
> There are a few that the dest will still need set explicitly. Specifically
> the TLS parameters - tls-authz and tls-creds, because those are both
> related to --object parameters configured on the dst QEMU. Potentially
> there's an argument to be made for the TLS parameters to be part fo the
> initial 'migrate' and 'migrate-incoming' command data, as they're
> specifically related to the connection establishment, while (most) of
> the other params are related to the migration protocol running inside
> the connection.

Ideally we can even make tls options to be after the main connection is
established, IOW the gnutls handshake can be part of the generic handshake.
But yeah I agree that may contain much more work, so we may start with
assuming the v2 handshake just happen on the tls channel built for now.

I think the new protocol should allow extension so when we want to move the
tls handshake into it v2 protocol should be able to first detect src/dst
binary support of that, and switch to that if we want - then we can even
got a src qemu migration failure which tells "dest qemu forget to setup tls
credentials in cmdlines", or anything wrong on dest during tls setup.

> 
> A few other parameters are also related to the connection establishment,
> most notably the enablement multifd, postcopy and postcopy-pre-empt.

As I mentioned in the list, I plan to make this part of the default v2
where v2 handshake will take care of managing the connections rather than
relying on the old code.  I'm not sure how complicated it'll be, but the v2
protocol just sounds a good fit for having such a major change on how we
setup the channels, and chance we get all things alright from the start.

> 
> I think with those ones we don't need to set them on the src either.
> With the new migration handshake we should probably use multifd
> codepaths unconditionally, with a single channel.

The v2 handshake will be beneficial to !multifd as well.  Right now I tend
to make it also work for !multifd, e.g., it always makes sense to do a
device tree comparision before migration, even if someone used special
tunneling so multifd may not be able to be enabled for whatever reason, but
as long as a return path is available so they can talk.

> By matching with the introduction of new protocol, we have a nice point
> against which to deprecate the old non-multifd codepaths. We'll need to
> keep the non-multifd code around *alot* longer than the normal
> deprecation cycle though, as we need mig to/from very old QEMUs.

I actually had a feeling that we should always keep it..  I'm not sure
whether we must combine a new handshake to "making multifd the default".  I
do think we can make the !multifd path very simple though, e.g., I'd think
we should start considering deprecate things like !multifd+compressions etc
earlier than that.

> 
> The enablement of postcopy could be automatic too - src & dst can
> both detect if their host OS supports it. That would make all
> migrations post-copy capable. The mgmt app just needs to trigger
> the switch to post-copy mode *if* they want to use it.

Sounds doable.

> 
> Likewise we can just always assume postcopy-pre-empt is available.
> 
> I think 'return-path' becomes another one we can just assume too.

Right, handshake cap (or with the new QAPI of URI replacement) should imply
return path already.

Thanks,

-- 
Peter Xu

Re: [PATCH v2 1/5] migration: Use proper indentation for migration.json

2023-06-27 Thread Peter Xu

On Thu, Jun 22, 2023 at 09:50:15PM +0200, Juan Quintela wrote:
> We broke it with dirtyrate limit patches.
> 
> Signed-off-by: Juan Quintela 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH v2 5/5] migration: Deprecate old compression method

2023-06-27 Thread Peter Xu

On Thu, Jun 22, 2023 at 09:50:19PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Acked-by: Peter Xu 

-- 
Peter Xu

Re: [RFC 4/6] migration: Deprecate -incoming

2023-06-20 Thread Peter Xu

On Tue, Jun 20, 2023 at 01:10:55PM +0100, Daniel P. Berrangé wrote:
> In some cases it is worth having a convenience option for user friendliness.
> 
> In this case, however, users are already needing to use QMP/HMP on the
> source side to set migration parameters. I think it is reasonable to say
> that doing *exactly* the same with QMP/HMP on the destination side is
> actually more friendly than requiring use of -global on the dest side
> which is different syntax.

The -global parameters also work for the src side for my scripts, so I only
need to attach "-incoming tcp:XXX" for dst cmdline here.

> 
> We don't document the way to use -global with migration parameters so
> when people see '-incoming' I think we are steering them into a trap,
> making it look like -incoming is sufficient on its own. Hence that user's
> mistake recently about not setting migration parameters.
> 
> Overall I agree with deprecating  '-incoming' != 'defer', as I think it
> will better guide users to following the correct steps, as is not a
> usability burden as they're already using QMP/HMP on the source side.

I'd say -global is definitely not for users but for developers.  Just like
we keep around HMP - I'm not sure whether it's used in productions, I
thought it was only for developers and we don't deprecate it eagerly.

No strong opinions here if everyone wants to get rid of that, but before
that I hope we can have some kind of cmdline tool that can easily tunnel
qmp commands so whoever developers using pure cmdlines can switch to the
new way easily.

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

2023-05-05 Thread Peter Xu

On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
> To fix it, ensure that the BQL is held during setup. To avoid changing
> the behavior for migration too, introduce conditionals for the setup
> callbacks that need the BQL and only take the lock if it's not already
> held.

The major complexity of this patch is the "conditionally taking" part.

Pure question: what is the benefit of not holding BQL always during
save_setup(), if after all we have this coroutine issue to be solved?

I can understand that it helps us to avoid taking BQL too long, but we'll
need to take it anyway during ramblock dirty track initializations, and so
far IIUC it's the major time to be consumed during setup().

Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
call needs the iothread lock". Firstly I think it's also covering
enablement of dirty tracking:

+qemu_mutex_lock_iothread();
+qemu_mutex_lock_ramlist();
+bytes_transferred = 0;
+reset_ram_globals();
+
 memory_global_dirty_log_start();
 migration_bitmap_sync();
+qemu_mutex_unlock_iothread();

And I think enablement itself can be slow too, maybe even slower than
migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
supported in the kernel.

Meanwhile I always got confused on why we need to sync dirty bitmap when
setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?

This is slightly off-topic, but I'd like to know if someone can help
answer.

My whole point is still questioning whether we can unconditionally take bql
during save_setup().

I could have missed something, though, where we want to do in setup() but
avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
it'll be worthwhile to put that into comment of save_setup() hook).

Thanks,

-- 
Peter Xu

Re: [PATCH 0/9] QEMU file cleanups

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:32PM +0200, Juan Quintela wrote:
> - convince and review code to see that everything is uint64_t.

One general question to patches regarding this - what's the major benefit
of using uint64_t?

It doubles the possible numbers to hold, but it's already 64bits so I don't
think it matters a lot.  The thing is we're removing some code trying to
detect negative which seems to be still helpful to detect e.g. overflows
(even though I don't think it'll happen).  I just still think it's good to
know when overflow happens, and not sure what I missed on benefits of using
unsigned here.

I've reviewed all the rest patches and all look good here.

Thanks,

-- 
Peter Xu

Re: [PATCH 0/9] QEMU file cleanups

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 04:56:46PM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> > On Thu, May 04, 2023 at 01:38:32PM +0200, Juan Quintela wrote:
> >> - convince and review code to see that everything is uint64_t.
> >
> > One general question to patches regarding this - what's the major benefit
> > of using uint64_t?
> >
> > It doubles the possible numbers to hold, but it's already 64bits so I don't
> > think it matters a lot.
> 
> We were checking for negatives even when that can't be.
> And we are doing this dance of
> 
> int64_t x, y;
> uint64_t a, b;
> 
> x = a;
> b = y;
> 
> This is always confusing and not always right.

Yeah this is confusing, but if anything can go wrong with this I assume we
could have some bigger problem anyway..

> 
> > The thing is we're removing some code trying to
> > detect negative which seems to be still helpful to detect e.g. overflows
> > (even though I don't think it'll happen).  I just still think it's good to
> > know when overflow happens, and not sure what I missed on benefits of using
> > unsigned here.
> 
> If you grep through the code, you see that half of the things are
> int64_t and the other half is uint64_t.  I find it always confusing.

Right, I'm personally curious whether we should just use int64_t always
unless necessary. :) Another good thing with int64_t is it's also suitable
for error report when used in retvals.

But no strong opinion here, I don't think that's a huge deal for now.
Having such an alignment on types makes sense to me.

Thanks,

-- 
Peter Xu

Re: [PATCH 8/9] qemu-file: Make ram_control_save_page() use accessors for rate_limit

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:40PM +0200, Juan Quintela wrote:
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 6/9] qemu-file: remove shutdown member

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:38PM +0200, Juan Quintela wrote:
> The first thing that we do after setting the shutdown value is set the
> error as -EIO if there is not a previous error.
> 
> So this value is reduntant.  Just remove it and use
> qemu_file_get_error() in the places that it was tested.
> 
> Signed-off-by: Juan Quintela 

Nit: I'd squash this with previous.

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 9/9] qemu-file: Account for rate_limit usage on qemu_fflush()

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:41PM +0200, Juan Quintela wrote:
> That is the moment we know we have transferred something.
> 
> Signed-off-by: Juan Quintela 

There'll be a slight side effect that qemu_file_rate_limit() can be
triggered later than before because data cached in the qemufile won't be
accounted until flushed.

Two limits here:

- IO_BUF_SIZE==32K, the real buffer
- MAX_IOV_SIZE==64 (I think), the async buffer to put guest page ptrs
  directly, on x86 it's 64*4K=256K

So the impact should be no more than 288KB on x86.  Looks still fine..

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH 5/9] qemu-file: No need to check for shutdown in qemu_file_rate_limit

2023-05-04 Thread Peter Xu

On Thu, May 04, 2023 at 01:38:37PM +0200, Juan Quintela wrote:
> After calling qemu_file_shutdown() we set the error as -EIO if there
> is no another previous error, so no need to check it here.
> 
> Signed-off-by: Juan Quintela 

Reviewed-by: Peter Xu 

-- 
Peter Xu

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

2023-05-17 Thread Peter Xu

On Wed, May 10, 2023 at 08:31:13AM +0200, Juan Quintela wrote:
> Peter Xu  wrote:
> 
> Hi
> 
> [Adding Kevin to the party]
> 
> > On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
> >> To fix it, ensure that the BQL is held during setup. To avoid changing
> >> the behavior for migration too, introduce conditionals for the setup
> >> callbacks that need the BQL and only take the lock if it's not already
> >> held.
> >
> > The major complexity of this patch is the "conditionally taking" part.
> 
> Yeap.
> 
> I don't want that bit.
> 
> This is another case of:
> - I have a problem
> - I will use recursive mutexes to solve it
> 
> Now you have two problems O:-)
> 
> > Pure question: what is the benefit of not holding BQL always during
> > save_setup(), if after all we have this coroutine issue to be solved?
> 
> Dunno.
> 
> I would like that paolo commented on this.  I "reviewed the code" 10
> years ago.  I don't remember at all why we wanted to change that.
> 
> > I can understand that it helps us to avoid taking BQL too long, but we'll
> > need to take it anyway during ramblock dirty track initializations, and so
> > far IIUC it's the major time to be consumed during setup().
> >
> > Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
> > call needs the iothread lock". Firstly I think it's also covering
> > enablement of dirty tracking:
> >
> > +qemu_mutex_lock_iothread();
> > +qemu_mutex_lock_ramlist();
> > +bytes_transferred = 0;
> > +reset_ram_globals();
> > +
> >  memory_global_dirty_log_start();
> >  migration_bitmap_sync();
> > +qemu_mutex_unlock_iothread();
> >
> > And I think enablement itself can be slow too, maybe even slower than
> > migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
> > supported in the kernel.
> >
> > Meanwhile I always got confused on why we need to sync dirty bitmap when
> > setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
> > all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?
> 
> How do you convince KVM (or the other lists) to start doing dirty
> tracking?  Doing a bitmap sync.

I think memory_global_dirty_log_start() kicks off the tracking already.

Take KVM as example, normally the case is KVM_MEM_LOG_DIRTY_PAGES is set
there, then ioctl(KVM_SET_USER_MEMORY_REGION) will start dirty tracking for
all of the guest memory slots necessary (including wr-protect all pages).

KVM_DIRTY_LOG_INITIALLY_SET is slightly special, it'll skip that procedure
during ioctl(KVM_SET_USER_MEMORY_REGION), but that also means the kernel
assumed the userapp (QEMU) marked all pages dirty initially (always the
case for QEMU, I think..).  Hence in this case the sync doesn't help either
because we'll simply get no new dirty bits in this shot..

> 
> And yeap, probably there is a better way of doing it.
> 
> > This is slightly off-topic, but I'd like to know if someone can help
> > answer.
> >
> > My whole point is still questioning whether we can unconditionally take bql
> > during save_setup().
> 
> I agree with you.
> 
> > I could have missed something, though, where we want to do in setup() but
> > avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
> > it'll be worthwhile to put that into comment of save_setup() hook).
> 
> I am more towards revert completely
> 9b0950375277467fd74a9075624477ae43b9bb22
> 
> and call it a day.  On migration we don't use coroutines on the sending
> side (I mean the migration code, the block layer uses coroutines for
> everything/anything).
> 
> Paolo, Stefan any clues for not run setup with the BKL?
> 
> Later, Juan.
> 

-- 
Peter Xu

1 2 >

1 - 100 of 137 matches

Mail list logo