Re: [PATCH] xen-hvm: Avoid livelock while handling buffered ioreqs

2024-04-09 Thread Ross Lagerwall
On Tue, Apr 9, 2024 at 11:20 AM Anthony PERARD  wrote:
>
> On Thu, Apr 04, 2024 at 03:08:33PM +0100, Ross Lagerwall wrote:
> > diff --git a/hw/xen/xen-hvm-common.c b/hw/xen/xen-hvm-common.c
> > index 1627da739822..1116b3978938 100644
> > --- a/hw/xen/xen-hvm-common.c
> > +++ b/hw/xen/xen-hvm-common.c
> > @@ -521,22 +521,30 @@ static bool handle_buffered_iopage(XenIOState *state)
> [...]
> >
> >  static void handle_buffered_io(void *opaque)
> >  {
> > +unsigned int handled;
> >  XenIOState *state = opaque;
> >
> > -if (handle_buffered_iopage(state)) {
> > +handled = handle_buffered_iopage(state);
> > +if (handled >= IOREQ_BUFFER_SLOT_NUM) {
> > +/* We handled a full page of ioreqs. Schedule a timer to continue
> > + * processing while giving other stuff a chance to run.
> > + */
>
> ./scripts/checkpatch.pl report a style issue here:
> WARNING: Block comments use a leading /* on a separate line
>
> I can try to remember to fix that on commit.

I copied the comment style from a few lines above but I guess it was
wrong.

>
> >  timer_mod(state->buffered_io_timer,
> > -BUFFER_IO_MAX_DELAY + 
> > qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
> > -} else {
> > +qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
> > +} else if (handled == 0) {
>
> Just curious, why did you check for `handled == 0` here instead of
> `handled != 0`? That would have avoided to invert the last 2 cases, and
> the patch would just have introduce a new case without changing the
> order of the existing ones. But not that important I guess.
>

In general I try to use conditionals with the least amount of negation
since I think it is easier to read. I can change it if you would prefer?

Ross



Re: [PATCH] xen-hvm: Avoid livelock while handling buffered ioreqs

2024-04-08 Thread Ross Lagerwall
On Sat, Apr 6, 2024 at 11:58 AM Durrant, Paul  wrote:
>
> On 04/04/2024 15:08, Ross Lagerwall wrote:
> > A malicious or buggy guest may generated buffered ioreqs faster than
> > QEMU can process them in handle_buffered_iopage(). The result is a
> > livelock - QEMU continuously processes ioreqs on the main thread without
> > iterating through the main loop which prevents handling other events,
> > processing timers, etc. Without QEMU handling other events, it often
> > results in the guest becoming unsable and makes it difficult to stop the
> > source of buffered ioreqs.
> >
> > To avoid this, if we process a full page of buffered ioreqs, stop and
> > reschedule an immediate timer to continue processing them. This lets
> > QEMU go back to the main loop and catch up.
> >
>
> Do PV backends potentially cause the same scheduling issue (if not using
> io threads)?
>

>From what I can tell:

xen-block: It reads req_prod / req_cons once before entering the loop
so it should be fine, I think.

xen_console: Same as xen-block

xen_nic: It reads req_prod / req_cons once before entering the loop.
However, once the loop ends it checks for more requests and if there
are more requests it restarts from the beginning. It seems like this
could be susceptible to the same issue.

(These PV backends generally aren't used by XenServer's system QEMU
so I didn't spend too much time looking into it.)

Thanks,
Ross



[PATCH] xen-hvm: Avoid livelock while handling buffered ioreqs

2024-04-04 Thread Ross Lagerwall
A malicious or buggy guest may generated buffered ioreqs faster than
QEMU can process them in handle_buffered_iopage(). The result is a
livelock - QEMU continuously processes ioreqs on the main thread without
iterating through the main loop which prevents handling other events,
processing timers, etc. Without QEMU handling other events, it often
results in the guest becoming unsable and makes it difficult to stop the
source of buffered ioreqs.

To avoid this, if we process a full page of buffered ioreqs, stop and
reschedule an immediate timer to continue processing them. This lets
QEMU go back to the main loop and catch up.

Signed-off-by: Ross Lagerwall 
---
 hw/xen/xen-hvm-common.c | 26 +-
 1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/hw/xen/xen-hvm-common.c b/hw/xen/xen-hvm-common.c
index 1627da739822..1116b3978938 100644
--- a/hw/xen/xen-hvm-common.c
+++ b/hw/xen/xen-hvm-common.c
@@ -463,11 +463,11 @@ static void handle_ioreq(XenIOState *state, ioreq_t *req)
 }
 }
 
-static bool handle_buffered_iopage(XenIOState *state)
+static unsigned int handle_buffered_iopage(XenIOState *state)
 {
 buffered_iopage_t *buf_page = state->buffered_io_page;
 buf_ioreq_t *buf_req = NULL;
-bool handled_ioreq = false;
+unsigned int handled = 0;
 ioreq_t req;
 int qw;
 
@@ -480,7 +480,7 @@ static bool handle_buffered_iopage(XenIOState *state)
 req.count = 1;
 req.dir = IOREQ_WRITE;
 
-for (;;) {
+do {
 uint32_t rdptr = buf_page->read_pointer, wrptr;
 
 xen_rmb();
@@ -521,22 +521,30 @@ static bool handle_buffered_iopage(XenIOState *state)
 assert(!req.data_is_ptr);
 
 qatomic_add(_page->read_pointer, qw + 1);
-handled_ioreq = true;
-}
+handled += qw + 1;
+} while (handled < IOREQ_BUFFER_SLOT_NUM);
 
-return handled_ioreq;
+return handled;
 }
 
 static void handle_buffered_io(void *opaque)
 {
+unsigned int handled;
 XenIOState *state = opaque;
 
-if (handle_buffered_iopage(state)) {
+handled = handle_buffered_iopage(state);
+if (handled >= IOREQ_BUFFER_SLOT_NUM) {
+/* We handled a full page of ioreqs. Schedule a timer to continue
+ * processing while giving other stuff a chance to run.
+ */
 timer_mod(state->buffered_io_timer,
-BUFFER_IO_MAX_DELAY + qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
-} else {
+qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
+} else if (handled == 0) {
 timer_del(state->buffered_io_timer);
 qemu_xen_evtchn_unmask(state->xce_handle, state->bufioreq_local_port);
+} else {
+timer_mod(state->buffered_io_timer,
+BUFFER_IO_MAX_DELAY + qemu_clock_get_ms(QEMU_CLOCK_REALTIME));
 }
 }
 
-- 
2.43.0




Re: [PATCH] main-loop: Avoid some unnecessary poll calls

2024-03-06 Thread Ross Lagerwall
On Mon, Feb 12, 2024 at 11:45 AM Ross Lagerwall
 wrote:
>
> A common pattern is seen where a timer fires, the callback does some
> work, then rearms the timer which implicitly calls qemu_notify_event().
>
> qemu_notify_event() is supposed to interrupt the main loop's poll() by
> calling qemu_bh_schedule(). In the case that this is being called from a
> main loop callback, the main loop is already not waiting on poll() and
> instead it means the main loop does an addition iteration with a timeout
> of 0 to handle the bottom half wakeup, before once again polling with
> the expected timeout value.
>
> Detect this situation by skipping the qemu_bh_schedule() call if the
> default main context is currently owned by the caller. i.e. it is being
> called as part of a poll / timer callback. Adjust the scope of the main
> context acquire / release to cover the timer callbacks in
> qemu_clock_run_all_timers().
>
> Signed-off-by: Ross Lagerwall 
> ---
>  util/main-loop.c | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/util/main-loop.c b/util/main-loop.c
> index a0386cfeb60c..a2afbb7d0e13 100644
> --- a/util/main-loop.c
> +++ b/util/main-loop.c
> @@ -145,10 +145,16 @@ AioContext *qemu_get_aio_context(void)
>
>  void qemu_notify_event(void)
>  {
> +GMainContext *context;
> +
>  if (!qemu_aio_context) {
>  return;
>  }
> -qemu_bh_schedule(qemu_notify_bh);
> +
> +context = g_main_context_default();
> +if (!g_main_context_is_owner(context)) {
> +qemu_bh_schedule(qemu_notify_bh);
> +}
>  }
>
>  static GArray *gpollfds;
> @@ -292,11 +298,8 @@ static void glib_pollfds_poll(void)
>
>  static int os_host_main_loop_wait(int64_t timeout)
>  {
> -GMainContext *context = g_main_context_default();
>  int ret;
>
> -g_main_context_acquire(context);
> -
>  glib_pollfds_fill();
>
>  bql_unlock();
> @@ -309,8 +312,6 @@ static int os_host_main_loop_wait(int64_t timeout)
>
>  glib_pollfds_poll();
>
> -g_main_context_release(context);
> -
>  return ret;
>  }
>  #else
> @@ -470,15 +471,12 @@ static int os_host_main_loop_wait(int64_t timeout)
>  fd_set rfds, wfds, xfds;
>  int nfds;
>
> -g_main_context_acquire(context);
> -
>  /* XXX: need to suppress polling by better using win32 events */
>  ret = 0;
>  for (pe = first_polling_entry; pe != NULL; pe = pe->next) {
>  ret |= pe->func(pe->opaque);
>  }
>  if (ret != 0) {
> -g_main_context_release(context);
>  return ret;
>  }
>
> @@ -538,8 +536,6 @@ static int os_host_main_loop_wait(int64_t timeout)
>  g_main_context_dispatch(context);
>  }
>
> -g_main_context_release(context);
> -
>  return select_ret || g_poll_ret;
>  }
>  #endif
> @@ -559,6 +555,7 @@ void main_loop_poll_remove_notifier(Notifier *notify)
>
>  void main_loop_wait(int nonblocking)
>  {
> +GMainContext *context = g_main_context_default();
>  MainLoopPoll mlpoll = {
>  .state = MAIN_LOOP_POLL_FILL,
>  .timeout = UINT32_MAX,
> @@ -586,7 +583,10 @@ void main_loop_wait(int nonblocking)
>timerlistgroup_deadline_ns(
>_loop_tlg));
>
> +g_main_context_acquire(context);
> +
>  ret = os_host_main_loop_wait(timeout_ns);
> +
>  mlpoll.state = ret < 0 ? MAIN_LOOP_POLL_ERR : MAIN_LOOP_POLL_OK;
>  notifier_list_notify(_loop_poll_notifiers, );
>
> @@ -598,6 +598,8 @@ void main_loop_wait(int nonblocking)
>  icount_start_warp_timer();
>  }
>  qemu_clock_run_all_timers();
> +
> +g_main_context_release(context);
>  }
>
>  /* Functions to operate on the main QEMU AioContext.  */
> --
> 2.43.0
>

Ping?

Ross



[PATCH] main-loop: Avoid some unnecessary poll calls

2024-02-12 Thread Ross Lagerwall via
A common pattern is seen where a timer fires, the callback does some
work, then rearms the timer which implicitly calls qemu_notify_event().

qemu_notify_event() is supposed to interrupt the main loop's poll() by
calling qemu_bh_schedule(). In the case that this is being called from a
main loop callback, the main loop is already not waiting on poll() and
instead it means the main loop does an addition iteration with a timeout
of 0 to handle the bottom half wakeup, before once again polling with
the expected timeout value.

Detect this situation by skipping the qemu_bh_schedule() call if the
default main context is currently owned by the caller. i.e. it is being
called as part of a poll / timer callback. Adjust the scope of the main
context acquire / release to cover the timer callbacks in
qemu_clock_run_all_timers().

Signed-off-by: Ross Lagerwall 
---
 util/main-loop.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/util/main-loop.c b/util/main-loop.c
index a0386cfeb60c..a2afbb7d0e13 100644
--- a/util/main-loop.c
+++ b/util/main-loop.c
@@ -145,10 +145,16 @@ AioContext *qemu_get_aio_context(void)
 
 void qemu_notify_event(void)
 {
+GMainContext *context;
+
 if (!qemu_aio_context) {
 return;
 }
-qemu_bh_schedule(qemu_notify_bh);
+
+context = g_main_context_default();
+if (!g_main_context_is_owner(context)) {
+qemu_bh_schedule(qemu_notify_bh);
+}
 }
 
 static GArray *gpollfds;
@@ -292,11 +298,8 @@ static void glib_pollfds_poll(void)
 
 static int os_host_main_loop_wait(int64_t timeout)
 {
-GMainContext *context = g_main_context_default();
 int ret;
 
-g_main_context_acquire(context);
-
 glib_pollfds_fill();
 
 bql_unlock();
@@ -309,8 +312,6 @@ static int os_host_main_loop_wait(int64_t timeout)
 
 glib_pollfds_poll();
 
-g_main_context_release(context);
-
 return ret;
 }
 #else
@@ -470,15 +471,12 @@ static int os_host_main_loop_wait(int64_t timeout)
 fd_set rfds, wfds, xfds;
 int nfds;
 
-g_main_context_acquire(context);
-
 /* XXX: need to suppress polling by better using win32 events */
 ret = 0;
 for (pe = first_polling_entry; pe != NULL; pe = pe->next) {
 ret |= pe->func(pe->opaque);
 }
 if (ret != 0) {
-g_main_context_release(context);
 return ret;
 }
 
@@ -538,8 +536,6 @@ static int os_host_main_loop_wait(int64_t timeout)
 g_main_context_dispatch(context);
 }
 
-g_main_context_release(context);
-
 return select_ret || g_poll_ret;
 }
 #endif
@@ -559,6 +555,7 @@ void main_loop_poll_remove_notifier(Notifier *notify)
 
 void main_loop_wait(int nonblocking)
 {
+GMainContext *context = g_main_context_default();
 MainLoopPoll mlpoll = {
 .state = MAIN_LOOP_POLL_FILL,
 .timeout = UINT32_MAX,
@@ -586,7 +583,10 @@ void main_loop_wait(int nonblocking)
   timerlistgroup_deadline_ns(
   _loop_tlg));
 
+g_main_context_acquire(context);
+
 ret = os_host_main_loop_wait(timeout_ns);
+
 mlpoll.state = ret < 0 ? MAIN_LOOP_POLL_ERR : MAIN_LOOP_POLL_OK;
 notifier_list_notify(_loop_poll_notifiers, );
 
@@ -598,6 +598,8 @@ void main_loop_wait(int nonblocking)
 icount_start_warp_timer();
 }
 qemu_clock_run_all_timers();
+
+g_main_context_release(context);
 }
 
 /* Functions to operate on the main QEMU AioContext.  */
-- 
2.43.0




[PATCH] xen/pt: Emulate multifunction bit in header type

2023-11-03 Thread Ross Lagerwall via
The intention of the code appears to have been to unconditionally set
the multifunction bit but since the emulation mask is 0x00 it has no
effect. Instead, emulate the bit and set it based on the multifunction
property of the PCIDevice (which can be set using QAPI).

This allows making passthrough devices appear as functions in a Xen
guest.

Signed-off-by: Ross Lagerwall 
---
 hw/xen/xen_pt_config_init.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index 2b8680b112fa..e6ec32e3ccd2 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -291,7 +291,10 @@ static int 
xen_pt_header_type_reg_init(XenPCIPassthroughState *s,
uint32_t *data)
 {
 /* read PCI_HEADER_TYPE */
-*data = reg->init_val | 0x80;
+*data = reg->init_val;
+if ((PCI_DEVICE(s)->cap_present & QEMU_PCI_CAP_MULTIFUNCTION)) {
+*data |= PCI_HEADER_TYPE_MULTI_FUNCTION;
+}
 return 0;
 }
 
@@ -676,7 +679,7 @@ static XenPTRegInfo xen_pt_emu_reg_header0[] = {
 .size   = 1,
 .init_val   = 0x00,
 .ro_mask= 0xFF,
-.emu_mask   = 0x00,
+.emu_mask   = PCI_HEADER_TYPE_MULTI_FUNCTION,
 .init   = xen_pt_header_type_reg_init,
 .u.b.read   = xen_pt_byte_reg_read,
 .u.b.write  = xen_pt_byte_reg_write,
-- 
2.41.0




[PATCH] ps2: Don't send key release event for Lang1, Lang2 keys

2023-02-27 Thread Ross Lagerwall via
The scancodes for the Lang1 and Lang2 keys (i.e. Hangeul, Hanja) are
special since they already have the 0x80 bit set which is commonly used
to indicate a key release in AT set 1. Reportedly, real hardware does
not send a key release scancode. So, skip sending a release for these
keys. This ensures that Windows behaves correctly and interprets it as a
single keypress rather than two consecutive keypresses.

Signed-off-by: Ross Lagerwall 
---
 hw/input/ps2.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/hw/input/ps2.c b/hw/input/ps2.c
index 3253ab6a92..45af76a837 100644
--- a/hw/input/ps2.c
+++ b/hw/input/ps2.c
@@ -402,6 +402,9 @@ static void ps2_keyboard_event(DeviceState *dev, 
QemuConsole *src,
 ps2_put_keycode(s, 0xaa);
 }
 }
+} else if ((qcode == Q_KEY_CODE_LANG1 || qcode == Q_KEY_CODE_LANG2)
+   && !key->down) {
+/* Ignore release for these keys */
 } else {
 if (qcode < qemu_input_map_qcode_to_atset1_len) {
 keycode = qemu_input_map_qcode_to_atset1[qcode];
@@ -497,6 +500,9 @@ static void ps2_keyboard_event(DeviceState *dev, 
QemuConsole *src,
 ps2_put_keycode(s, 0x12);
 }
 }
+} else if ((qcode == Q_KEY_CODE_LANG1 || qcode == Q_KEY_CODE_LANG2) &&
+   !key->down) {
+/* Ignore release for these keys */
 } else {
 if (qcode < qemu_input_map_qcode_to_atset2_len) {
 keycode = qemu_input_map_qcode_to_atset2[qcode];
-- 
2.31.1




[PATCH] tpm_crb: Avoid backend startup just before shutdown under Xen

2022-08-26 Thread Ross Lagerwall via
When running under Xen and the guest reboots, it boots into a new domain
with a new QEMU process (and a new swtpm process if using the emulator
backend). The existing reset function is triggered just before the old
QEMU process exists which causes QEMU to startup the TPM backend and
then immediately shut it down. This is probably harmless but when using
the emulated backend, it wastes CPU and IO time reloading state, etc.

Fix this by calling the reset function directly from realize() when
running under Xen. During a reboot, this will be called by the QEMU
process for the new domain.

Signed-off-by: Ross Lagerwall 
---

This conditional logic is ugly. Is there a cleaner way of doing this?

 hw/tpm/tpm_crb.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/hw/tpm/tpm_crb.c b/hw/tpm/tpm_crb.c
index 67db594c48..ea930da545 100644
--- a/hw/tpm/tpm_crb.c
+++ b/hw/tpm/tpm_crb.c
@@ -26,6 +26,7 @@
 #include "sysemu/tpm_backend.h"
 #include "sysemu/tpm_util.h"
 #include "sysemu/reset.h"
+#include "sysemu/xen.h"
 #include "tpm_prop.h"
 #include "tpm_ppi.h"
 #include "trace.h"
@@ -308,7 +309,11 @@ static void tpm_crb_realize(DeviceState *dev, Error **errp)
  TPM_PPI_ADDR_BASE, OBJECT(s));
 }
 
-qemu_register_reset(tpm_crb_reset, dev);
+if (xen_enabled()) {
+tpm_crb_reset(dev);
+} else {
+qemu_register_reset(tpm_crb_reset, dev);
+}
 }
 
 static void tpm_crb_class_init(ObjectClass *klass, void *data)
-- 
2.31.1




[PATCH] tpm_emulator: Avoid double initialization during migration

2022-08-01 Thread Ross Lagerwall via
When resuming after a migration, the backend sends CMD_INIT to the
emulator from the startup callback, then it sends the migration state
from the vmstate to the emulator, then it sends CMD_INIT again. Skip the
first CMD_INIT during a migration to avoid initializing the TPM twice.

Signed-off-by: Ross Lagerwall 
---
 backends/tpm/tpm_emulator.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/backends/tpm/tpm_emulator.c b/backends/tpm/tpm_emulator.c
index 87d061e9bb..9b50c5b3e2 100644
--- a/backends/tpm/tpm_emulator.c
+++ b/backends/tpm/tpm_emulator.c
@@ -32,6 +32,7 @@
 #include "qemu/sockets.h"
 #include "qemu/lockable.h"
 #include "io/channel-socket.h"
+#include "sysemu/runstate.h"
 #include "sysemu/tpm_backend.h"
 #include "sysemu/tpm_util.h"
 #include "tpm_int.h"
@@ -383,6 +384,15 @@ err_exit:
 
 static int tpm_emulator_startup_tpm(TPMBackend *tb, size_t buffersize)
 {
+/* TPM startup will be done from post_load hook */
+if (runstate_check(RUN_STATE_INMIGRATE)) {
+if (buffersize != 0) {
+return tpm_emulator_set_buffer_size(tb, buffersize, NULL);
+}
+
+return 0;
+}
+
 return tpm_emulator_startup_tpm_resume(tb, buffersize, false);
 }
 
-- 
2.31.1




[PATCH] xen/pt: Avoid initializing BARs from the host ones

2022-04-27 Thread Ross Lagerwall via
The BAR emulated register definition does not set emu_mask because it
varies depending on bar_flag.  If emu_mask is not set, then the BAR is
initialized based on the host value which causes the BAR to be initially
mapped at whatever value the host device was using. Although it does
eventually get mapped at the correct location, it causes unnecessary
mapping/unmappings.

To fix this, initialize a per-register emu_mask in XenPTReg from the
initial value in XenPTRegInfo and then let the register's init() function
set/modify the emu_mask if necessary. Update the code to use emu_mask
in XenPTReg consistently and rename the existing emu_mask in
XenPTRegInfo to emu_mask_init to help with refactoring.

Signed-off-by: Ross Lagerwall 
---
 hw/xen/xen_pt.c |   2 +-
 hw/xen/xen_pt.h |   5 +-
 hw/xen/xen_pt_config_init.c | 221 ++--
 3 files changed, 115 insertions(+), 113 deletions(-)

diff --git a/hw/xen/xen_pt.c b/hw/xen/xen_pt.c
index 027190fa44..f0af1cfcec 100644
--- a/hw/xen/xen_pt.c
+++ b/hw/xen/xen_pt.c
@@ -327,7 +327,7 @@ static void xen_pt_pci_write_config(PCIDevice *d, uint32_t 
addr,
 uint32_t real_offset = reg_grp_entry->base_offset + reg->offset;
 uint32_t valid_mask = 0x >> ((4 - emul_len) << 3);
 uint8_t *ptr_val = NULL;
-uint32_t wp_mask = reg->emu_mask | reg->ro_mask;
+uint32_t wp_mask = reg_entry->emu_mask | reg->ro_mask;
 
 valid_mask <<= (find_addr - real_offset) << 3;
 ptr_val = (uint8_t *) + (real_offset & 3);
diff --git a/hw/xen/xen_pt.h b/hw/xen/xen_pt.h
index 6b8e13cdee..dbb917a46c 100644
--- a/hw/xen/xen_pt.h
+++ b/hw/xen/xen_pt.h
@@ -46,7 +46,7 @@ void igd_write_opregion(XenPCIPassthroughState *s, uint32_t 
val);
 
 /* function type for config reg */
 typedef int (*xen_pt_conf_reg_init)
-(XenPCIPassthroughState *, XenPTRegInfo *, uint32_t real_offset,
+(XenPCIPassthroughState *, XenPTReg *, uint32_t real_offset,
  uint32_t *data);
 typedef int (*xen_pt_conf_dword_write)
 (XenPCIPassthroughState *, XenPTReg *cfg_entry,
@@ -117,7 +117,7 @@ struct XenPTRegInfo {
 /* reg read/write-1-clear field mask (ON:RW1C/RW1CS, OFF:other) */
 uint32_t rw1c_mask;
 /* reg emulate field mask (ON:emu, OFF:passthrough) */
-uint32_t emu_mask;
+uint32_t emu_mask_init;
 xen_pt_conf_reg_init init;
 /* read/write function pointer
  * for double_word/word/byte size */
@@ -146,6 +146,7 @@ struct XenPTReg {
 uint16_t *half_word;
 uint32_t *word;
 } ptr; /* pointer to dev.config. */
+uint32_t emu_mask;
 };
 
 typedef const struct XenPTRegGroupInfo XenPTRegGroupInfo;
diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index c5c4e943a8..2a934494ed 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -25,7 +25,7 @@
 
 /* prototype */
 
-static int xen_pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTRegInfo *reg,
+static int xen_pt_ptr_reg_init(XenPCIPassthroughState *s, XenPTReg *reg_entry,
uint32_t real_offset, uint32_t *data);
 
 
@@ -98,9 +98,10 @@ XenPTReg *xen_pt_find_reg(XenPTRegGroup *reg_grp, uint32_t 
address)
 }
 
 static uint32_t get_throughable_mask(const XenPCIPassthroughState *s,
- XenPTRegInfo *reg, uint32_t valid_mask)
+ XenPTReg *reg_entry, uint32_t valid_mask)
 {
-uint32_t throughable_mask = ~(reg->emu_mask | reg->ro_mask);
+XenPTRegInfo *reg = reg_entry->reg;
+uint32_t throughable_mask = ~(reg_entry->emu_mask | reg->ro_mask);
 
 if (!s->permissive) {
 throughable_mask &= ~reg->res_mask;
@@ -116,10 +117,10 @@ static uint32_t get_throughable_mask(const 
XenPCIPassthroughState *s,
 /* register initialization function */
 
 static int xen_pt_common_reg_init(XenPCIPassthroughState *s,
-  XenPTRegInfo *reg, uint32_t real_offset,
+  XenPTReg *reg_entry, uint32_t real_offset,
   uint32_t *data)
 {
-*data = reg->init_val;
+*data = reg_entry->reg->init_val;
 return 0;
 }
 
@@ -128,12 +129,11 @@ static int xen_pt_common_reg_init(XenPCIPassthroughState 
*s,
 static int xen_pt_byte_reg_read(XenPCIPassthroughState *s, XenPTReg *cfg_entry,
 uint8_t *value, uint8_t valid_mask)
 {
-XenPTRegInfo *reg = cfg_entry->reg;
 uint8_t valid_emu_mask = 0;
 uint8_t *data = cfg_entry->ptr.byte;
 
 /* emulate byte register */
-valid_emu_mask = reg->emu_mask & valid_mask;
+valid_emu_mask = cfg_entry->emu_mask & valid_mask;
 *value = XEN_PT_MERGE_VALUE(*value, *data, ~valid_emu_mask);
 
 return 0;
@@ -141,12 +141,11 @@ static int xen_pt_byte_reg_read(XenPCIPassthroughState 
*s, 

[PATCH v2] xen-mapcache: Avoid entry->lock overflow

2022-01-24 Thread Ross Lagerwall via
In some cases, a particular mapcache entry may be mapped 256 times
causing the lock field to wrap to 0. For example, this may happen when
using emulated NVME and the guest submits a large scatter-gather write.
At this point, the entry map be remapped causing QEMU to write the wrong
data or crash (since remap is not atomic).

Avoid this overflow by increasing the lock field to a uint32_t and also
detect it and abort rather than continuing regardless.

Signed-off-by: Ross Lagerwall 
---
Changes in v2: Change type to uint32_t since there is a hole there
anyway. The struct size remains at 48 bytes on x86_64.

 hw/i386/xen/xen-mapcache.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
index bd47c3d672..f2ef977963 100644
--- a/hw/i386/xen/xen-mapcache.c
+++ b/hw/i386/xen/xen-mapcache.c
@@ -52,7 +52,7 @@ typedef struct MapCacheEntry {
 hwaddr paddr_index;
 uint8_t *vaddr_base;
 unsigned long *valid_mapping;
-uint8_t lock;
+uint32_t lock;
 #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0)
 uint8_t flags;
 hwaddr size;
@@ -355,6 +355,12 @@ tryagain:
 if (lock) {
 MapCacheRev *reventry = g_malloc0(sizeof(MapCacheRev));
 entry->lock++;
+if (entry->lock == 0) {
+fprintf(stderr,
+"mapcache entry lock overflow: "TARGET_FMT_plx" -> %p\n",
+entry->paddr_index, entry->vaddr_base);
+abort();
+}
 reventry->dma = dma;
 reventry->vaddr_req = mapcache->last_entry->vaddr_base + 
address_offset;
 reventry->paddr_index = mapcache->last_entry->paddr_index;
-- 
2.27.0




[PATCH] xen-mapcache: Avoid entry->lock overflow

2022-01-21 Thread Ross Lagerwall via
In some cases, a particular mapcache entry may be mapped 256 times
causing the lock field to wrap to 0. For example, this may happen when
using emulated NVME and the guest submits a large scatter-gather write.
At this point, the entry map be remapped causing QEMU to write the wrong
data or crash (since remap is not atomic).

Avoid this overflow by increasing the lock field to a uint16_t and also
detect it and abort rather than continuing regardless.

Signed-off-by: Ross Lagerwall 
---
 hw/i386/xen/xen-mapcache.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/hw/i386/xen/xen-mapcache.c b/hw/i386/xen/xen-mapcache.c
index bd47c3d672..82dc495a60 100644
--- a/hw/i386/xen/xen-mapcache.c
+++ b/hw/i386/xen/xen-mapcache.c
@@ -52,7 +52,7 @@ typedef struct MapCacheEntry {
 hwaddr paddr_index;
 uint8_t *vaddr_base;
 unsigned long *valid_mapping;
-uint8_t lock;
+uint16_t lock;
 #define XEN_MAPCACHE_ENTRY_DUMMY (1 << 0)
 uint8_t flags;
 hwaddr size;
@@ -355,6 +355,12 @@ tryagain:
 if (lock) {
 MapCacheRev *reventry = g_malloc0(sizeof(MapCacheRev));
 entry->lock++;
+if (entry->lock == 0) {
+fprintf(stderr,
+"mapcache entry lock overflow: "TARGET_FMT_plx" -> %p\n",
+entry->paddr_index, entry->vaddr_base);
+abort();
+}
 reventry->dma = dma;
 reventry->vaddr_req = mapcache->last_entry->vaddr_base + 
address_offset;
 reventry->paddr_index = mapcache->last_entry->paddr_index;
-- 
2.27.0




Re: [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-11-04 Thread Ross Lagerwall
On 11/4/19 9:04 AM, Klaus Birkelund wrote:
> On Mon, Nov 04, 2019 at 08:46:29AM +0000, Ross Lagerwall wrote:
>> On 8/23/19 9:10 AM, Klaus Birkelund wrote:
>>> On Thu, Aug 22, 2019 at 02:18:05PM +0100, Ross Lagerwall wrote:
>>>> On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:
>>>>
>>>> I tried this patch series by installing Windows with a single NVME
>>>> controller having two namespaces. QEMU crashed in get_feature /
>>>> NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.
>>>>
>>>
>>> Hi Ross,
>>>
>>> Good catch!
>>>
>>>> nvme_get_feature / nvme_set_feature look wrong to me since I can't see how
>>>> req->ns would have been set. Should they have similar code to nvme_io_cmd 
>>>> to
>>>> set req->ns from cmd->nsid?
>>>
>>> Definitely. I will fix that for v2.
>>>
>>>>
>>>> After working around this issue everything else seemed to be working well.
>>>> Thanks for your work on this patch series.
>>>>
>>>
>>> And thank you for trying out my patches!
>>>
>>
>> One more thing... it doesn't handle inactive namespaces properly so if you
>> have two namespaces with e.g. nsid=1 and nsid=3 QEMU ends up crashing in
>> certain situations. The patch below adds support for inactive namespaces.
>>
>> Still hoping to see a v2 some day :-)
>>
>  
> Hi Ross,
> 
> v2[1] is actually out, but only CC'ed Paul. Sorry about that! It fixes
> the support for discontiguous nsid's, but does not handle inactive
> namespaces correctly in identify.
> 
> I'll incorporate that in a v3 along with a couple of other fixes I did.
> 
> Thanks!
> 
> 
>   [1]: https://patchwork.kernel.org/cover/11190045/
> 

Thanks for pointing me at that, I missed it.

Regards,
-- 
Ross Lagerwall



Re: [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-11-04 Thread Ross Lagerwall
On 8/23/19 9:10 AM, Klaus Birkelund wrote:
> On Thu, Aug 22, 2019 at 02:18:05PM +0100, Ross Lagerwall wrote:
>> On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:
>>
>> I tried this patch series by installing Windows with a single NVME
>> controller having two namespaces. QEMU crashed in get_feature /
>> NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.
>>
> 
> Hi Ross,
> 
> Good catch!
> 
>> nvme_get_feature / nvme_set_feature look wrong to me since I can't see how
>> req->ns would have been set. Should they have similar code to nvme_io_cmd to
>> set req->ns from cmd->nsid?
> 
> Definitely. I will fix that for v2.
> 
>>
>> After working around this issue everything else seemed to be working well.
>> Thanks for your work on this patch series.
>>
> 
> And thank you for trying out my patches!
> 

One more thing... it doesn't handle inactive namespaces properly so if you
have two namespaces with e.g. nsid=1 and nsid=3 QEMU ends up crashing in
certain situations. The patch below adds support for inactive namespaces.

Still hoping to see a v2 some day :-)

Thanks,
Ross

-- 8-< --
nvme-ns: Allow inactive namespaces

The controller advertises the maximum namespace number but not all of these
slots may have proper namespaces. These are defined as inactive namespaces by
the spec.  Implement support for inactive namespaces instead of crashing.

Changes are needed in a few places:
* When identify_ns is used with an inactive namespace, the controller should
  return all zeroes.
* Only active namespaces should be returned by identify_ns_list.
* When the controller is unplugged, only cleanup active namespaces.
* Keep track of and advertise the maximum valid namespace number rather than
* the number of active namespaces.
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 1b8a09853d..29ea5c2023 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1302,6 +1302,7 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
 NvmeNamespace *ns;
+NvmeIdNs *id_ns, invalid_ns_id = {0};
 uint32_t nsid = le32_to_cpu(cmd->nsid);
 
 trace_nvme_identify_ns(nsid);
@@ -1312,9 +1313,13 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
 }
 
 ns = n->namespaces[nsid - 1];
+if (ns) {
+id_ns = >id_ns;
+} else {
+id_ns = _ns_id;
+}
 
-return nvme_dma_read(n, (uint8_t *) >id_ns, sizeof(ns->id_ns), cmd,
-req);
+return nvme_dma_read(n, (uint8_t *) id_ns, sizeof(*id_ns), cmd, req);
 }
 
 static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
@@ -1330,7 +1335,7 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, 
NvmeCmd *cmd,
 
 list = g_malloc0(data_len);
 for (i = 0; i < n->num_namespaces; i++) {
-if (i < min_nsid) {
+if (i < min_nsid || !n->namespaces[i]) {
 continue;
 }
 list[j++] = cpu_to_le32(i + 1);
@@ -1861,7 +1866,9 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 int i;
 
 for (i = 0; i < n->num_namespaces; i++) {
-blk_drain(n->namespaces[i]->conf.blk);
+if (n->namespaces[i]) {
+blk_drain(n->namespaces[i]->conf.blk);
+}
 }
 
 for (i = 0; i < n->params.num_queues; i++) {
@@ -1886,7 +1893,9 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 }
 
 for (i = 0; i < n->num_namespaces; i++) {
-blk_flush(n->namespaces[i]->conf.blk);
+if (n->namespaces[i]) {
+blk_flush(n->namespaces[i]->conf.blk);
+}
 }
 
 n->bar.cc = 0;
@@ -2464,8 +2473,9 @@ int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace 
*ns, Error **errp)
 trace_nvme_register_namespace(nsid);
 
 n->namespaces[nsid - 1] = ns;
-n->num_namespaces++;
-n->id_ctrl.nn++;
+if (nsid > n->num_namespaces)
+n->num_namespaces = nsid;
+n->id_ctrl.nn = n->num_namespaces;
 
 return 0;
 }



Re: [Qemu-devel] [PATCH 16/16] nvme: support multiple namespaces

2019-08-22 Thread Ross Lagerwall

On 7/5/19 8:23 AM, Klaus Birkelund Jensen wrote:

This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

   -drive file=nvme0n1.img,if=none,id=disk1
   -drive file=nvme0n2.img,if=none,id=disk2
   -device nvme,serial=deadbeef,id=nvme0
   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

A maximum of 256 namespaces can be configured.


snip

  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
  {
+NvmeNamespace *ns = req->ns;
+
  uint32_t dw10 = le32_to_cpu(cmd->cdw10);
  uint32_t dw11 = le32_to_cpu(cmd->cdw11);
  uint32_t result;
@@ -1464,7 +1474,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd 
*cmd, NvmeRequest *req)
  result = cpu_to_le32(n->features.err_rec);
  break;
  case NVME_VOLATILE_WRITE_CACHE:
-result = blk_enable_write_cache(n->conf.blk);
+result = blk_enable_write_cache(ns->conf.blk);
  trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
  break;


I tried this patch series by installing Windows with a single NVME 
controller having two namespaces. QEMU crashed in get_feature / 
NVME_VOLATILE_WRITE_CACHE because req->ns was NULL.


nvme_get_feature / nvme_set_feature look wrong to me since I can't see 
how req->ns would have been set. Should they have similar code to 
nvme_io_cmd to set req->ns from cmd->nsid?


After working around this issue everything else seemed to be working 
well. Thanks for your work on this patch series.


Thanks,
--
Ross Lagerwall



[Qemu-devel] [PATCH] xen_pt: Present the size of 64 bit BARs correctly

2018-05-14 Thread Ross Lagerwall
The full size of the BAR is stored in the lower PCIIORegion.size. The
upper PCIIORegion.size is 0.  Calculate the size of the upper half
correctly from the lower half otherwise the size read by the guest will
be incorrect.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 hw/xen/xen_pt_config_init.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/xen/xen_pt_config_init.c b/hw/xen/xen_pt_config_init.c
index a3ce33e..aee31c6 100644
--- a/hw/xen/xen_pt_config_init.c
+++ b/hw/xen/xen_pt_config_init.c
@@ -504,6 +504,8 @@ static int xen_pt_bar_reg_write(XenPCIPassthroughState *s, 
XenPTReg *cfg_entry,
 bar_ro_mask = XEN_PT_BAR_IO_RO_MASK | (r_size - 1);
 break;
 case XEN_PT_BAR_FLAG_UPPER:
+assert(index > 0);
+r_size = d->io_regions[index - 1].size >> 32;
 bar_emu_mask = XEN_PT_BAR_ALLF;
 bar_ro_mask = r_size ? r_size - 1 : 0;
 break;
-- 
2.9.5




Re: [Qemu-devel] [PATCH] vga: fix region calculation

2018-03-08 Thread Ross Lagerwall

On 03/08/2018 07:31 AM, Gerd Hoffmann wrote:

Typically the scanline length and the line offset are identical.  But
case they are not our calculation for region_end is incorrect.  Using
line_offset is fine for all scanlines, except the last one where we have
to use the actual scanline length.

Fixes: CVE-2018-
Cc: P J P <ppan...@redhat.com>
Cc: Ross Lagerwall <ross.lagerw...@citrix.com>
Signed-off-by: Gerd Hoffmann <kra...@redhat.com>
---


Tested-by: Ross Lagerwall <ross.lagerw...@citrix.com>

Thanks!
--
Ross Lagerwall



[Qemu-devel] [PATCH] migration/xen: Check return value of qemu_fclose

2018-02-06 Thread Ross Lagerwall
QEMUFile uses buffered IO so when writing small amounts (such as the Xen
device state file), the actual write call and any errors that may occur
only happen as part of qemu_fclose(). Therefore, report IO errors when
saving the device state under Xen by checking the return value of
qemu_fclose().

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 migration/savevm.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/migration/savevm.c b/migration/savevm.c
index b7908f6..4b9d5be 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2267,8 +2267,7 @@ void qmp_xen_save_devices_state(const char *filename, 
bool has_live, bool live,
 qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
 f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
 ret = qemu_save_device_state(f);
-qemu_fclose(f);
-if (ret < 0) {
+if (ret < 0 || qemu_fclose(f) < 0) {
 error_setg(errp, QERR_IO_ERROR);
 } else {
 /* libxl calls the QMP command "stop" before calling
-- 
2.9.5




Re: [Qemu-devel] [PATCH v5 0/8] xen: xen-domid-restrict improvements

2018-01-24 Thread Ross Lagerwall

On 10/20/2017 02:37 PM, Ian Jackson wrote:

Anthony PERARD writes ("Re: [PATCH v5 0/8] xen: xen-domid-restrict 
improvements"):

The patches in this v5 appear to be the same the one from the patch
series v4.


Erk, so they are.

I'll post a v5.1 in reply to this email.



What's the status of this patch series? There don't seem to be many 
outstanding complaints but they haven't been pushed into master. At 
least the Xen changes have all been reviewed by Anthony (except for 
configure changes) so they could probably go in.


Thanks,
--
Ross Lagerwall



Re: [Qemu-devel] [PATCH v2 1/4] migration: Don't leak IO channels

2018-01-18 Thread Ross Lagerwall

On 11/01/2017 02:25 PM, Ross Lagerwall wrote:

Since qemu_fopen_channel_{in,out}put take references on the underlying
IO channels, make sure to release our references to them.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
New in v2.

  migration/savevm.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index 4a88228..87557dd 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2259,6 +2259,7 @@ void qmp_xen_save_devices_state(const char *filename, 
Error **errp)
  }
  qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
  f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
+object_unref(OBJECT(ioc));
  ret = qemu_save_device_state(f);
  qemu_fclose(f);
  if (ret < 0) {
@@ -2292,6 +2293,7 @@ void qmp_xen_load_devices_state(const char *filename, 
Error **errp)
  }
  qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
  f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
+object_unref(OBJECT(ioc));
  
  ret = qemu_loadvm_state(f);

  qemu_fclose(f);



Ping for review...

Thanks,
--
Ross Lagerwall



Re: [Qemu-devel] [PATCH v2 0/4] QIOChannelFile bug fixes

2018-01-18 Thread Ross Lagerwall

On 11/01/2017 02:25 PM, Ross Lagerwall wrote:

Hi,

Here is a bug fix with the use of QIOChannelFile and 2 bug fixes and an
improvement to implementation of QIOChannelFile.

Regards,
Ross Lagerwall

Ross Lagerwall (4):
   migration: Don't leak IO channels
   io: Fix QIOChannelFile when creating and opening read-write
   io: Don't call close multiple times in QIOChannelFile
   io: Add /dev/fdset/ support to QIOChannelFile

  include/io/channel-file.h|  2 +-
  io/channel-file.c| 11 ---
  migration/savevm.c   |  2 ++
  tests/test-io-channel-file.c | 29 +
  4 files changed, 32 insertions(+), 12 deletions(-)



Ping for reviews...

v1:
Got feedback from Daniel P. Berrange and Marc-André Lureau.

v2:
Patch 1: Unreviewed
Patch 2: Unreviewed
Patch 3: Reviewed by Marc-André Lureau
Patch 4: Reviewed by Marc-André Lureau

The patch series still applies cleanly on top of master.

Thanks,
--
Ross Lagerwall



Re: [Qemu-devel] QMP event missed during startup

2017-11-10 Thread Ross Lagerwall

On 11/09/2017 02:14 PM, Markus Armbruster wrote:

"Dr. David Alan Gilbert" <dgilb...@redhat.com> writes:


* Ross Lagerwall (ross.lagerw...@citrix.com) wrote:

Hi,

I have found an issue where QEMU emits the RESUME event during startup when
it starts VM execution, but it is not possible to receive this event.

To repro this, run:
qemu-system-i386 -m 256 -trace
enable=monitor_protocol_event_emit,file=/tmp/out -qmp
unix:/tmp/qmp,server,wait

QEMU will not start execution of the VM until something connects to the QMP
socket (e.g. qmp-shell). Once connected, no event is received on the QMP
connection but the tracepoint is hit indicating that an event has been
emitted. I suspect that the event is emitted while the QMP client is doing
the initial negotiation.

The reason I want to receive this event is that QEMU currently uses xenstore
to communicate this information to the Xen toolstack (see
xen-common.c:xen_change_state_handler) but we want to move to using QMP
rather than xenstore for this kind of thing.

Is this a known issue or just a bug that should be fixed?


I'll leave it to Markus to say if it's a bug or not, but can't
you work around this by starting qemu with -S which leaves the guest
paused, and then continuing the guest when you have your QMP ?


You can:

<-- {"QMP": {"version": {"qemu": {"micro": 50, "minor": 10, "major": 2}, "package": " 
(v2.10.0-613-g10656079e1-dirty)"}, "capabilities": []}}
--> { "execute": "qmp_capabilities" }
<-- {"return": {}}
--> { "execute": "cont" }
<-- {"timestamp": {"seconds": 1510235984, "microseconds": 108550}, "event": 
"RESUME"}
<-- {"return": {}}

RESUME is sent in vm_prepare_start(), called from main() via vm_start(),
but only if @autostart, i.e. no -S.

The "wait" in the argument of -qmp makes QEMU wait for a QMP client to
connect to the QMP socket, long before vm_start() gets called.  However,
having connected is not sufficient for receiving events, you also have
to exit capabilities negotiation mode.  Not possible until QEMU is
running the main loop, which runs after the vm_start() quoted above.

If QMP monitors became usable before entering main_loop(), we'd have a
race condition instead.  The only reliable way to get the RESUME event
is -S.

This adds one minor item to the long list of reasons why management
software should pass -S.

All clear now?



Yeah that makes sense thanks. I've now tested with -S and it works fine.

Cheers,
--
Ross Lagerwall



[Qemu-devel] [PATCH v2 4/4] io: Add /dev/fdset/ support to QIOChannelFile

2017-11-01 Thread Ross Lagerwall
Add /dev/fdset/ support to QIOChannelFile by calling qemu_open() instead
of open() and qemu_close() instead of close(). There is a subtle
semantic change since qemu_open() automatically sets O_CLOEXEC, but this
doesn't affect any of the users of the function.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
Changed in v2:
* Split into separate patch.

 io/channel-file.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/io/channel-file.c b/io/channel-file.c
index 1f2f710..db948ab 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -50,7 +50,7 @@ qio_channel_file_new_path(const char *path,
 
 ioc = QIO_CHANNEL_FILE(object_new(TYPE_QIO_CHANNEL_FILE));
 
-ioc->fd = open(path, flags, mode);
+ioc->fd = qemu_open(path, flags, mode);
 if (ioc->fd < 0) {
 object_unref(OBJECT(ioc));
 error_setg_errno(errp, errno,
@@ -74,7 +74,7 @@ static void qio_channel_file_finalize(Object *obj)
 {
 QIOChannelFile *ioc = QIO_CHANNEL_FILE(obj);
 if (ioc->fd != -1) {
-close(ioc->fd);
+qemu_close(ioc->fd);
 ioc->fd = -1;
 }
 }
@@ -173,7 +173,7 @@ static int qio_channel_file_close(QIOChannel *ioc,
 {
 QIOChannelFile *fioc = QIO_CHANNEL_FILE(ioc);
 
-if (close(fioc->fd) < 0) {
+if (qemu_close(fioc->fd) < 0) {
 error_setg_errno(errp, errno,
  "Unable to close file");
 return -1;
-- 
2.9.5




[Qemu-devel] [PATCH v2 2/4] io: Fix QIOChannelFile when creating and opening read-write

2017-11-01 Thread Ross Lagerwall
The code wrongly passes the mode to open() only if O_WRONLY is set.
Instead, the mode should be passed when O_CREAT is set (or O_TMPFILE on
Linux). Fix this by always passing the mode since open() will correctly
ignore the mode if it is not needed. Add a testcase which exercises this
bug and also change the existing testcase to check that the mode of the
created file is correct.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
Changed in v2:
* Separated from qemu_open() change.

 include/io/channel-file.h|  2 +-
 io/channel-file.c|  6 +-
 tests/test-io-channel-file.c | 29 +
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/include/io/channel-file.h b/include/io/channel-file.h
index 79245f1..ebfe54e 100644
--- a/include/io/channel-file.h
+++ b/include/io/channel-file.h
@@ -73,7 +73,7 @@ qio_channel_file_new_fd(int fd);
  * qio_channel_file_new_path:
  * @path: the file path
  * @flags: the open flags (O_RDONLY|O_WRONLY|O_RDWR, etc)
- * @mode: the file creation mode if O_WRONLY is set in @flags
+ * @mode: the file creation mode if O_CREAT is set in @flags
  * @errp: pointer to initialized error object
  *
  * Create a new IO channel object for a file represented
diff --git a/io/channel-file.c b/io/channel-file.c
index b383273..16bf7ed 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -50,11 +50,7 @@ qio_channel_file_new_path(const char *path,
 
 ioc = QIO_CHANNEL_FILE(object_new(TYPE_QIO_CHANNEL_FILE));
 
-if (flags & O_WRONLY) {
-ioc->fd = open(path, flags, mode);
-} else {
-ioc->fd = open(path, flags);
-}
+ioc->fd = open(path, flags, mode);
 if (ioc->fd < 0) {
 object_unref(OBJECT(ioc));
 error_setg_errno(errp, errno,
diff --git a/tests/test-io-channel-file.c b/tests/test-io-channel-file.c
index 6bfede6..2e94f63 100644
--- a/tests/test-io-channel-file.c
+++ b/tests/test-io-channel-file.c
@@ -24,16 +24,21 @@
 #include "io-channel-helpers.h"
 #include "qapi/error.h"
 
-static void test_io_channel_file(void)
+#define TEST_FILE "tests/test-io-channel-file.txt"
+#define TEST_MASK 0600
+
+static void test_io_channel_file_helper(int flags)
 {
 QIOChannel *src, *dst;
 QIOChannelTest *test;
+struct stat st;
+mode_t mask;
+int ret;
 
-#define TEST_FILE "tests/test-io-channel-file.txt"
 unlink(TEST_FILE);
 src = QIO_CHANNEL(qio_channel_file_new_path(
   TEST_FILE,
-  O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0600,
+  flags, TEST_MASK,
   _abort));
 dst = QIO_CHANNEL(qio_channel_file_new_path(
   TEST_FILE,
@@ -45,18 +50,33 @@ static void test_io_channel_file(void)
 qio_channel_test_run_reader(test, dst);
 qio_channel_test_validate(test);
 
+/* Check that the requested mode took effect. */
+mask = umask(0);
+umask(mask);
+ret = stat(TEST_FILE, );
+g_assert_cmpint(ret, >, -1);
+g_assert_cmpuint(TEST_MASK & ~mask, ==, st.st_mode & 0777);
+
 unlink(TEST_FILE);
 object_unref(OBJECT(src));
 object_unref(OBJECT(dst));
 }
 
+static void test_io_channel_file(void)
+{
+test_io_channel_file_helper(O_WRONLY | O_CREAT | O_TRUNC | O_BINARY);
+}
+
+static void test_io_channel_file_rdwr(void)
+{
+test_io_channel_file_helper(O_RDWR | O_CREAT | O_TRUNC | O_BINARY);
+}
 
 static void test_io_channel_fd(void)
 {
 QIOChannel *ioc;
 int fd = -1;
 
-#define TEST_FILE "tests/test-io-channel-file.txt"
 fd = open(TEST_FILE, O_CREAT | O_TRUNC | O_WRONLY, 0600);
 g_assert_cmpint(fd, >, -1);
 
@@ -114,6 +134,7 @@ int main(int argc, char **argv)
 g_test_init(, , NULL);
 
 g_test_add_func("/io/channel/file", test_io_channel_file);
+g_test_add_func("/io/channel/file/rdwr", test_io_channel_file_rdwr);
 g_test_add_func("/io/channel/file/fd", test_io_channel_fd);
 #ifndef _WIN32
 g_test_add_func("/io/channel/pipe/sync", test_io_channel_pipe_sync);
-- 
2.9.5




[Qemu-devel] [PATCH v2 0/4] QIOChannelFile bug fixes

2017-11-01 Thread Ross Lagerwall
Hi,

Here is a bug fix with the use of QIOChannelFile and 2 bug fixes and an
improvement to implementation of QIOChannelFile.

Regards,
Ross Lagerwall

Ross Lagerwall (4):
  migration: Don't leak IO channels
  io: Fix QIOChannelFile when creating and opening read-write
  io: Don't call close multiple times in QIOChannelFile
  io: Add /dev/fdset/ support to QIOChannelFile

 include/io/channel-file.h|  2 +-
 io/channel-file.c| 11 ---
 migration/savevm.c   |  2 ++
 tests/test-io-channel-file.c | 29 +
 4 files changed, 32 insertions(+), 12 deletions(-)

-- 
2.9.5




[Qemu-devel] [PATCH v2 1/4] migration: Don't leak IO channels

2017-11-01 Thread Ross Lagerwall
Since qemu_fopen_channel_{in,out}put take references on the underlying
IO channels, make sure to release our references to them.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
New in v2.

 migration/savevm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/migration/savevm.c b/migration/savevm.c
index 4a88228..87557dd 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2259,6 +2259,7 @@ void qmp_xen_save_devices_state(const char *filename, 
Error **errp)
 }
 qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-save-state");
 f = qemu_fopen_channel_output(QIO_CHANNEL(ioc));
+object_unref(OBJECT(ioc));
 ret = qemu_save_device_state(f);
 qemu_fclose(f);
 if (ret < 0) {
@@ -2292,6 +2293,7 @@ void qmp_xen_load_devices_state(const char *filename, 
Error **errp)
 }
 qio_channel_set_name(QIO_CHANNEL(ioc), "migration-xen-load-state");
 f = qemu_fopen_channel_input(QIO_CHANNEL(ioc));
+object_unref(OBJECT(ioc));
 
 ret = qemu_loadvm_state(f);
 qemu_fclose(f);
-- 
2.9.5




[Qemu-devel] [PATCH v2 3/4] io: Don't call close multiple times in QIOChannelFile

2017-11-01 Thread Ross Lagerwall
If the file descriptor underlying QIOChannelFile is closed in the
io_close() method, don't close it again in the finalize() method since
the file descriptor number may have been reused in the meantime.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
New in v2.

 io/channel-file.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/io/channel-file.c b/io/channel-file.c
index 16bf7ed..1f2f710 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -178,6 +178,7 @@ static int qio_channel_file_close(QIOChannel *ioc,
  "Unable to close file");
 return -1;
 }
+fioc->fd = -1;
 return 0;
 }
 
-- 
2.9.5




Re: [Qemu-devel] [PATCH] io: Fix QIOChannelFile when creating and opening read-write

2017-11-01 Thread Ross Lagerwall

On 11/01/2017 10:04 AM, Daniel P. Berrange wrote:

On Tue, Oct 31, 2017 at 04:09:02PM +, Ross Lagerwall wrote:

The code wrongly passes the mode to open() only if O_WRONLY is set.
Instead, the mode should be passed when O_CREAT is set (or O_TMPFILE on
Linux). Fix this by always passing the mode since open() will correctly
ignore the mode if it is not needed. Add a testcase which exercises this
bug.

While at it, add /dev/fdset/ support to QIOChannelFile by calling
qemu_open() instead open(). There is a subtle semantic change since
qemu_open() automatically sets O_CLOEXEC, but this doesn't affect any of
the users of the function.


Can you split the use of qemu_open() into a separate patch - its bad pratice
to mix two different functional changes in one patch. Also as Marc-Andre
mentions, we would need qemu_close() here too I think.


OK, sure, I will split into two patches and call qemu_close().





Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
  include/io/channel-file.h|  2 +-
  io/channel-file.c|  6 +-
  tests/test-io-channel-file.c | 27 +++
  3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/io/channel-file.h b/include/io/channel-file.h
index 79245f1..ebfe54e 100644
--- a/include/io/channel-file.h
+++ b/include/io/channel-file.h
@@ -73,7 +73,7 @@ qio_channel_file_new_fd(int fd);
   * qio_channel_file_new_path:
   * @path: the file path
   * @flags: the open flags (O_RDONLY|O_WRONLY|O_RDWR, etc)
- * @mode: the file creation mode if O_WRONLY is set in @flags
+ * @mode: the file creation mode if O_CREAT is set in @flags
   * @errp: pointer to initialized error object
   *
   * Create a new IO channel object for a file represented
diff --git a/io/channel-file.c b/io/channel-file.c
index b383273..ae8fb62 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -50,11 +50,7 @@ qio_channel_file_new_path(const char *path,
  
  ioc = QIO_CHANNEL_FILE(object_new(TYPE_QIO_CHANNEL_FILE));
  
-if (flags & O_WRONLY) {

-ioc->fd = open(path, flags, mode);
-} else {
-ioc->fd = open(path, flags);
-}
+ioc->fd = qemu_open(path, flags, mode);
  if (ioc->fd < 0) {
  object_unref(OBJECT(ioc));
  error_setg_errno(errp, errno,
diff --git a/tests/test-io-channel-file.c b/tests/test-io-channel-file.c
index 6bfede6..0c7303d 100644
--- a/tests/test-io-channel-file.c
+++ b/tests/test-io-channel-file.c
@@ -50,6 +50,32 @@ static void test_io_channel_file(void)
  object_unref(OBJECT(dst));
  }
  
+static void test_io_channel_file_rdwr(void)

+{
+QIOChannel *src, *dst;
+QIOChannelTest *test;
+
+#define TEST_FILE "tests/test-io-channel-file.txt"
+unlink(TEST_FILE);
+src = QIO_CHANNEL(qio_channel_file_new_path(
+  TEST_FILE,
+  O_RDWR | O_CREAT | O_TRUNC | O_BINARY, 0600,
+  _abort));
+dst = QIO_CHANNEL(qio_channel_file_new_path(
+  TEST_FILE,
+  O_RDONLY | O_BINARY, 0,
+  _abort));
+
+test = qio_channel_test_new();
+qio_channel_test_run_writer(test, src);
+qio_channel_test_run_reader(test, dst);
+qio_channel_test_validate(test);


Since we're specifically fixing a problem with mode, it would be good to
check the mode of the created file to show that its applied - ideally the
test would fail with previous code in this way.


At least with glibc, the test does fail with the previous code because 
it aborts if O_CREAT is present in flags without the mode. But I can add 
a check anyway.


Regards,
--
Ross Lagerwall



[Qemu-devel] [PATCH] io: Fix QIOChannelFile when creating and opening read-write

2017-10-31 Thread Ross Lagerwall
The code wrongly passes the mode to open() only if O_WRONLY is set.
Instead, the mode should be passed when O_CREAT is set (or O_TMPFILE on
Linux). Fix this by always passing the mode since open() will correctly
ignore the mode if it is not needed. Add a testcase which exercises this
bug.

While at it, add /dev/fdset/ support to QIOChannelFile by calling
qemu_open() instead open(). There is a subtle semantic change since
qemu_open() automatically sets O_CLOEXEC, but this doesn't affect any of
the users of the function.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 include/io/channel-file.h|  2 +-
 io/channel-file.c|  6 +-
 tests/test-io-channel-file.c | 27 +++
 3 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/io/channel-file.h b/include/io/channel-file.h
index 79245f1..ebfe54e 100644
--- a/include/io/channel-file.h
+++ b/include/io/channel-file.h
@@ -73,7 +73,7 @@ qio_channel_file_new_fd(int fd);
  * qio_channel_file_new_path:
  * @path: the file path
  * @flags: the open flags (O_RDONLY|O_WRONLY|O_RDWR, etc)
- * @mode: the file creation mode if O_WRONLY is set in @flags
+ * @mode: the file creation mode if O_CREAT is set in @flags
  * @errp: pointer to initialized error object
  *
  * Create a new IO channel object for a file represented
diff --git a/io/channel-file.c b/io/channel-file.c
index b383273..ae8fb62 100644
--- a/io/channel-file.c
+++ b/io/channel-file.c
@@ -50,11 +50,7 @@ qio_channel_file_new_path(const char *path,
 
 ioc = QIO_CHANNEL_FILE(object_new(TYPE_QIO_CHANNEL_FILE));
 
-if (flags & O_WRONLY) {
-ioc->fd = open(path, flags, mode);
-} else {
-ioc->fd = open(path, flags);
-}
+ioc->fd = qemu_open(path, flags, mode);
 if (ioc->fd < 0) {
 object_unref(OBJECT(ioc));
 error_setg_errno(errp, errno,
diff --git a/tests/test-io-channel-file.c b/tests/test-io-channel-file.c
index 6bfede6..0c7303d 100644
--- a/tests/test-io-channel-file.c
+++ b/tests/test-io-channel-file.c
@@ -50,6 +50,32 @@ static void test_io_channel_file(void)
 object_unref(OBJECT(dst));
 }
 
+static void test_io_channel_file_rdwr(void)
+{
+QIOChannel *src, *dst;
+QIOChannelTest *test;
+
+#define TEST_FILE "tests/test-io-channel-file.txt"
+unlink(TEST_FILE);
+src = QIO_CHANNEL(qio_channel_file_new_path(
+  TEST_FILE,
+  O_RDWR | O_CREAT | O_TRUNC | O_BINARY, 0600,
+  _abort));
+dst = QIO_CHANNEL(qio_channel_file_new_path(
+  TEST_FILE,
+  O_RDONLY | O_BINARY, 0,
+  _abort));
+
+test = qio_channel_test_new();
+qio_channel_test_run_writer(test, src);
+qio_channel_test_run_reader(test, dst);
+qio_channel_test_validate(test);
+
+unlink(TEST_FILE);
+object_unref(OBJECT(src));
+object_unref(OBJECT(dst));
+}
+
 
 static void test_io_channel_fd(void)
 {
@@ -114,6 +140,7 @@ int main(int argc, char **argv)
 g_test_init(, , NULL);
 
 g_test_add_func("/io/channel/file", test_io_channel_file);
+g_test_add_func("/io/channel/file/rdwr", test_io_channel_file_rdwr);
 g_test_add_func("/io/channel/file/fd", test_io_channel_fd);
 #ifndef _WIN32
 g_test_add_func("/io/channel/pipe/sync", test_io_channel_pipe_sync);
-- 
2.9.5




[Qemu-devel] QMP event missed during startup

2017-10-30 Thread Ross Lagerwall

Hi,

I have found an issue where QEMU emits the RESUME event during startup 
when it starts VM execution, but it is not possible to receive this event.


To repro this, run:
qemu-system-i386 -m 256 -trace 
enable=monitor_protocol_event_emit,file=/tmp/out -qmp 
unix:/tmp/qmp,server,wait


QEMU will not start execution of the VM until something connects to the 
QMP socket (e.g. qmp-shell). Once connected, no event is received on the 
QMP connection but the tracepoint is hit indicating that an event has 
been emitted. I suspect that the event is emitted while the QMP client 
is doing the initial negotiation.


The reason I want to receive this event is that QEMU currently uses 
xenstore to communicate this information to the Xen toolstack (see 
xen-common.c:xen_change_state_handler) but we want to move to using QMP 
rather than xenstore for this kind of thing.


Is this a known issue or just a bug that should be fixed?

Thanks,
--
Ross Lagerwall



Re: [Qemu-devel] [PATCH v1] os-posix: Add -unshare option

2017-10-23 Thread Ross Lagerwall

On 10/23/2017 03:50 PM, Daniel P. Berrange wrote:

On Mon, Oct 23, 2017 at 03:30:05PM +0100, Ross Lagerwall wrote:

On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:

On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:

Add an option to allow calling unshare() just before starting guest
execution. The option allows unsharing one or more of the mount
namespace, the network namespace, and the IPC namespace. This is useful
to restrict the ability of QEMU to cause damage to the system should it
be compromised.

An example of using this would be to have QEMU open a QMP socket at
startup and unshare the network namespace. The instance of QEMU could
still be controlled by the QMP socket since that belongs in the original
namespace, but if QEMU were compromised it wouldn't be able to open any
new connections, even to other processes on the same machine.


Unless I'm misunderstanding you, what's described here is already possible
by just using the 'unshare' command to spawn QEMU:

# unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server 
-vnc :1
qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection 
on: disconnected:unix:/tmp/foo,server

And in another shell I can still access the QMP socket from the original host
namespace


So that works because UNIX domains sockets are not restricted by network
namespaces. But if you try to connect to the VNC server listening on TCP
port 5901, it won't work.



# ./scripts/qmp/qmp-shell /tmp/foo
Welcome to the QMP low-level shell!
Connected to QEMU 2.9.1

(QEMU) query-kvm
{"return": {"enabled": false, "present": true}}


FWIW, even if that were not possible, you could still do it by wrapping the
qmp-shell in an 'nsenter' call. eg

nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo


I have a single process which connects to all the QEMUs' listening VNC
sockets so I'm not sure that this would work.


Yes, it can still work - you simply need to use set() to temporarily
switch into QEMU's namespace, and then switch back again afterwards

   oldns = open("/proc/self/ns/net")
   newns = open("/proc/$PID-OF-QEMU/ns/net")
   setns(newns, CLONE_NEWNET)

   ...open connection to VNC...
   
   setns(oldns, CLONE_NEWNET)


   ...use connection to VNC...

The setns() call is thread-local, so you can safely use different
namespaces in each thread if you need to have concurrent comms with
many QEMUs.



Ah, I didn't realize that the namespace could be set locally for a 
thread. That's helpful, thanks!


--
Ross Lagerwall



Re: [Qemu-devel] [PATCH v1] os-posix: Add -unshare option

2017-10-23 Thread Ross Lagerwall

On 10/19/2017 05:24 PM, Daniel P. Berrange wrote:

On Thu, Oct 19, 2017 at 05:04:19PM +0100, Ross Lagerwall wrote:

Add an option to allow calling unshare() just before starting guest
execution. The option allows unsharing one or more of the mount
namespace, the network namespace, and the IPC namespace. This is useful
to restrict the ability of QEMU to cause damage to the system should it
be compromised.

An example of using this would be to have QEMU open a QMP socket at
startup and unshare the network namespace. The instance of QEMU could
still be controlled by the QMP socket since that belongs in the original
namespace, but if QEMU were compromised it wouldn't be able to open any
new connections, even to other processes on the same machine.


Unless I'm misunderstanding you, what's described here is already possible
by just using the 'unshare' command to spawn QEMU:

   # unshare --ipc --mount --net qemu-system-x86_64 -qmp unix:/tmp/foo,server 
-vnc :1
   qemu-system-x86_64: -qmp unix:/tmp/foo,server: QEMU waiting for connection 
on: disconnected:unix:/tmp/foo,server

And in another shell I can still access the QMP socket from the original host
namespace


So that works because UNIX domains sockets are not restricted by network 
namespaces. But if you try to connect to the VNC server listening on TCP 
port 5901, it won't work.




   # ./scripts/qmp/qmp-shell /tmp/foo
   Welcome to the QMP low-level shell!
   Connected to QEMU 2.9.1

   (QEMU) query-kvm
   {"return": {"enabled": false, "present": true}}


FWIW, even if that were not possible, you could still do it by wrapping the
qmp-shell in an 'nsenter' call. eg

   nsenter --target $QEMUPID --net ./scripts/qmp/qmp-shell /tmp/foo


I have a single process which connects to all the QEMUs' listening VNC 
sockets so I'm not sure that this would work.





Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
  os-posix.c  | 34 ++
  qemu-options.hx | 14 ++
  2 files changed, 48 insertions(+)

diff --git a/os-posix.c b/os-posix.c
index b9c2343..cfc5c38 100644
--- a/os-posix.c
+++ b/os-posix.c
@@ -45,6 +45,7 @@ static struct passwd *user_pwd;
  static const char *chroot_dir;
  static int daemonize;
  static int daemon_pipe;
+static int unshare_flags;
  
  void os_setup_early_signal_handling(void)

  {
@@ -160,6 +161,28 @@ void os_parse_cmd_args(int index, const char *optarg)
  fips_set_state(true);
  break;
  #endif
+#ifdef CONFIG_SETNS
+case QEMU_OPTION_unshare:
+{
+char *flag;
+char *opts = g_strdup(optarg);
+
+while ((flag = qemu_strsep(, ",")) != NULL) {
+if (!strcmp(flag, "mount")) {
+unshare_flags |= CLONE_NEWNS;
+} else if (!strcmp(flag, "net")) {
+unshare_flags |= CLONE_NEWNET;
+} else if (!strcmp(flag, "ipc")) {
+unshare_flags |= CLONE_NEWIPC;
+} else {
+fprintf(stderr, "Unknown unshare option: %s\n", flag);
+exit(1);
+}
+}
+g_free(opts);
+}
+break;
+#endif
  }
  }
  
@@ -201,6 +224,16 @@ static void change_root(void)
  
  }
  
+static void unshare_namespaces(void)

+{
+if (unshare_flags) {
+if (unshare(unshare_flags) < 0) {
+perror("could not unshare");
+exit(1);
+}
+}
+}
+
  void os_daemonize(void)
  {
  if (daemonize) {
@@ -266,6 +299,7 @@ void os_setup_post(void)
  }
  
  change_root();

+unshare_namespaces();
  change_process_uid();


This has some really bad implications.  All the command line options that are
given are processed *beforfe* os_setup_post() is called. IOW, -chardev, -vnc,
-migrate, -net, etc will all be configured in the context of the host namespace.

If you then use the QMP monitor to run  chardev_add,  device_add, migrate,
hostnet_add, etc this will all take place in the new namespace.

So the exact same args give as ARGV now have completely different semantics
when given via QMP.

I think this is really very undesirable.


I consider this to be broadly similar to using -chroot -- adding devices 
and so on after chrooting would have a different effect compared with 
adding them after chrooting. I do agree though that both -chroot and 
-unshare could have confusing semantics.




If you wrap QEMU execution in 'unshare' as I illustrate above, then the
semantics of ARGV & QMP remain consistent.

FWIW, as a further point that might be of interest, libvirt will now spawn
a new private mount namespace for QEMU by default. We do this so that we can
give QEMU a private /dev filesystem with only the devices its permitted to
use present as device nodes.  The ability to do such setup tasks inbetween
namespace creation and QEMU l

[Qemu-devel] [PATCH v1] os-posix: Add -unshare option

2017-10-19 Thread Ross Lagerwall
Add an option to allow calling unshare() just before starting guest
execution. The option allows unsharing one or more of the mount
namespace, the network namespace, and the IPC namespace. This is useful
to restrict the ability of QEMU to cause damage to the system should it
be compromised.

An example of using this would be to have QEMU open a QMP socket at
startup and unshare the network namespace. The instance of QEMU could
still be controlled by the QMP socket since that belongs in the original
namespace, but if QEMU were compromised it wouldn't be able to open any
new connections, even to other processes on the same machine.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 os-posix.c  | 34 ++
 qemu-options.hx | 14 ++
 2 files changed, 48 insertions(+)

diff --git a/os-posix.c b/os-posix.c
index b9c2343..cfc5c38 100644
--- a/os-posix.c
+++ b/os-posix.c
@@ -45,6 +45,7 @@ static struct passwd *user_pwd;
 static const char *chroot_dir;
 static int daemonize;
 static int daemon_pipe;
+static int unshare_flags;
 
 void os_setup_early_signal_handling(void)
 {
@@ -160,6 +161,28 @@ void os_parse_cmd_args(int index, const char *optarg)
 fips_set_state(true);
 break;
 #endif
+#ifdef CONFIG_SETNS
+case QEMU_OPTION_unshare:
+{
+char *flag;
+char *opts = g_strdup(optarg);
+
+while ((flag = qemu_strsep(, ",")) != NULL) {
+if (!strcmp(flag, "mount")) {
+unshare_flags |= CLONE_NEWNS;
+} else if (!strcmp(flag, "net")) {
+unshare_flags |= CLONE_NEWNET;
+} else if (!strcmp(flag, "ipc")) {
+unshare_flags |= CLONE_NEWIPC;
+} else {
+fprintf(stderr, "Unknown unshare option: %s\n", flag);
+exit(1);
+}
+}
+g_free(opts);
+}
+break;
+#endif
 }
 }
 
@@ -201,6 +224,16 @@ static void change_root(void)
 
 }
 
+static void unshare_namespaces(void)
+{
+if (unshare_flags) {
+if (unshare(unshare_flags) < 0) {
+perror("could not unshare");
+exit(1);
+}
+}
+}
+
 void os_daemonize(void)
 {
 if (daemonize) {
@@ -266,6 +299,7 @@ void os_setup_post(void)
 }
 
 change_root();
+unshare_namespaces();
 change_process_uid();
 
 if (daemonize) {
diff --git a/qemu-options.hx b/qemu-options.hx
index 3728e9b..5cfcc51 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -3972,6 +3972,20 @@ Immediately before starting guest execution, chroot to 
the specified
 directory.  Especially useful in combination with -runas.
 ETEXI
 
+#ifdef CONFIG_SETNS
+DEF("unshare", HAS_ARG, QEMU_OPTION_unshare, \
+"-unshare [mount][,net][,ipc]\n" \
+"unshare namespaces just before starting the VM\n",
+QEMU_ARCH_ALL)
+#endif
+STEXI
+@item -unshare @code{[mount][,net][,ipc]}
+@findex -unshare
+Immediately before starting guest execution, unshare the specified namespaces.
+The namespaces that can be unshared are the mount namespace, the network
+namespace and the IPC namespace.
+ETEXI
+
 #ifndef _WIN32
 DEF("runas", HAS_ARG, QEMU_OPTION_runas, \
 "-runas user change to user id user just before starting the VM\n",
-- 
2.9.5




Re: [Qemu-devel] [PATCH 3/8] xen: defer call to xen_restrict until just before os_setup_post

2017-10-13 Thread Ross Lagerwall

On 10/09/2017 05:01 PM, Ian Jackson wrote:

We need to restrict *all* the control fds that qemu opens.  Looking in
/proc/PID/fd shows there are many; their allocation seems scattered
throughout Xen support code in qemu.

We must postpone the restrict call until roughly the same time as qemu
changes its uid, chroots (if applicable), and so on.

There doesn't seem to be an appropriate hook already.  The RunState
change hook fires at different times depending on exactly what mode
qemu is operating in.

And it appears that no-one but the Xen code wants a hook at this phase
of execution.  So, introduce a bare call to a new function
xen_setup_post, just before os_setup_post.  Also provide the
appropriate stub for when Xen compilation is disabled.

We do the restriction before rather than after os_setup_post, because
xen_restrict may need to open /dev/null, and os_setup_post might have
called chroot.

This works for normally starting a VM but doesn't seem to work when 
resuming/migrating.


Here is the order of selected operations when starting a VM normally:
VNC server running on 127.0.0.1:5901
xen_change_state_handler()
xenstore_record_dm_state: running
xen_setup_post()
xentoolcore_restrict_all: rc = 0
os_setup_post()
main_loop()

Here is the order of selected operations when starting QEMU with 
-incoming fd:... :

VNC server running on 127.0.0.1:5902
migration_fd_incoming()
xen_setup_post()
xentoolcore_restrict_all: rc = 0
os_setup_post()
main_loop()
migration_set_incoming_channel()
migrate_set_state()
xen_change_state_handler()
xenstore_record_dm_state: running
error recording dm state
qemu exited with code 1

The issue is that QEMU needs xenstore access to write "running" but this 
is after it has already been restricted. Moving xen_setup_post() into 
xen_change_state_handler() works fine. The only issue is that in the 
migration case, it executes after os_setup_post() so QEMU might be 
chrooted in which case opening /dev/null to restrict fds doesn't work 
(unless its new root has a /dev/null).


--
Ross Lagerwall



[Qemu-devel] [PATCH] xen: Log errno rather than return value

2017-10-11 Thread Ross Lagerwall
xen_modified_memory() sets errno to communicate what went wrong so log
this rather than the return value which is not interesting.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 hw/i386/xen/xen-hvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index d9ccd5d..8028bed 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -1446,7 +1446,7 @@ void xen_hvm_modified_memory(ram_addr_t start, ram_addr_t 
length)
 if (rc) {
 fprintf(stderr,
 "%s failed for "RAM_ADDR_FMT" ("RAM_ADDR_FMT"): %i, %s\n",
-__func__, start, nb_pages, rc, strerror(-rc));
+__func__, start, nb_pages, errno, strerror(errno));
 }
 }
 }
-- 
2.9.5




Re: [Qemu-devel] [Xen-devel] [PATCH v2 0/*] xen: xen-domid-restrict improvements

2017-10-10 Thread Ross Lagerwall

On 10/06/2017 02:19 PM, Paul Durrant wrote:

-Original Message-
From: Xen-devel [mailto:xen-devel-boun...@lists.xen.org] On Behalf Of
Ross Lagerwall
Sent: 06 October 2017 13:58
To: Ian Jackson <ian.jack...@citrix.com>; qemu-devel@nongnu.org
Cc: Anthony Perard <anthony.per...@citrix.com>; xen-
de...@lists.xenproject.org; Stefano Stabellini <sstabell...@kernel.org>;
Juergen Gross <jgr...@suse.com>
Subject: Re: [Xen-devel] [PATCH v2 0/*] xen: xen-domid-restrict
improvements

On 10/04/2017 05:18 PM, Ian Jackson wrote:

(Resending this because 1. I got the CC for xen-devel wrong; 2. I got
the subject wrong: there are actually 8 patches; 3. I mangled
Anthony's name in theheaders.  Sorry for the noise.)

I have been working on trying to get qemu, when running as a Xen
device model, to _actually_ not have power equivalent to root.

I think I have achieved this, with some limitations (which will be
discussed in my series against xen.git.

However, there are changes to qemu needed.  In particular

   * The -xen-domid-restrict option does not work properly right now.
 It only restricts a small subset of the descriptors qemu has open.
 I am introducing a new library call in the Xen libraries for this,
 xentoolcore_restrict_all.



Hi Ian,

I'm testing your QEMU and Xen patch series and found that after being
restricted, QEMU fails to setup up the VGA memory properly which causes
a complete stall with stdvga. With cirrus it mostly works although it
seems to have reduced performance.

I think it happens when the VM sets up the BAR some time after
xen_restrict() has been called. The failure comes from QEMU calling
xc_domain_add_to_physmap() which calls do_memory_op() and finally
xencall2(). But the underlying xencall fd has been replaced with /dev/null.


Oh, that's a PITA. This is because of the slightly hacky way that the VRAM is 
dealt with (it needing to be moved into the BAR of the PCI VGA device at the 
point where guest firmware programs it). Maybe we need a new dm_op for this?

If no one objects, I propose adding the following calls to 
libxendevicemodel (with underlying Xen implementations, of course) that 
would be usable after the xendevicemodel handle has been restricted.


xendevicemodel_add_to_physmap(xendevicemodel_handle *dmod,
  domid_t domid,
  unsigned long idx,
  xen_pfn_t gpfn);

This is equivalent to xc_domain_add_to_physmap() with
space==XENMAPSPACE_gmfn.

xendevicemodel_pin_memory_cacheattr(xendevicemodel_handle *dmod,
domid_t domid,
uint64_t start,
uint64_t end,
uint32_t type);

This is equivalent to xc_domain_pin_memory_cacheattr().

--
Ross Lagerwall



Re: [Qemu-devel] [PATCH v2 0/*] xen: xen-domid-restrict improvements

2017-10-06 Thread Ross Lagerwall
 and -chroot at the same 
time. The restriction happens after chrooting, so the chroot directory 
has to contain a valid /dev/null. This is a bit annoying and prevents 
the chroot being on a "nodev" mount.


Regards,
--
Ross Lagerwall



Re: [Qemu-devel] [PATCH 7/8] os-posix: Provide new -runasid option

2017-10-06 Thread Ross Lagerwall

On 10/04/2017 05:18 PM, Ian Jackson wrote:

This allows the caller to specify a uid and gid to use, even if there
is no corresponding password entry.  This will be useful in certain
Xen configurations.

Signed-off-by: Ian Jackson <ian.jack...@eu.citrix.com>

snip

@@ -166,17 +187,19 @@ void os_parse_cmd_args(int index, const char *optarg)
  
  static void change_process_uid(void)

  {
-if (user_pwd) {
-if (setgid(user_pwd->pw_gid) < 0) {
+if (user_pwd || user_uid != (uid_t)-1) {
+if (setgid(user_pwd ? user_pwd->pw_gid : user_gid) < 0) {
  fprintf(stderr, "Failed to setgid(%d)\n", user_pwd->pw_gid);
  exit(1);
  }
-if (initgroups(user_pwd->pw_name, user_pwd->pw_gid) < 0) {
+if ((user_pwd
+ ? initgroups(user_pwd->pw_name, user_pwd->pw_gid)
+ : setgroups(1, _gid)) < 0) {
  fprintf(stderr, "Failed to initgroups(\"%s\", %d)\n",
  user_pwd->pw_name, user_pwd->pw_gid);
  exit(1);
  }
-if (setuid(user_pwd->pw_uid) < 0) {
+if (setuid(user_pwd ? user_pwd->pw_uid : user_gid) < 0) {
  fprintf(stderr, "Failed to setuid(%d)\n", user_pwd->pw_uid);
      exit(1);
  }


This last one should be user_uid, not user_gid.

--
Ross Lagerwall



[Qemu-devel] [PATCH v2] xen: Emit RTC_CHANGE upon TIMEOFFSET ioreq

2017-08-23 Thread Ross Lagerwall
When the guest writes to the RTC, Xen emulates it and broadcasts a
TIMEOFFSET ioreq. Emit an RTC_CHANGE QMP event to all QMP monitors when
this happens rather than ignoring it so that something useful can be
done with the information. This is the same event that QEMU generates
when it emulates the RTC.

This patch by itself doesn't affect any of the toolstacks that I
checked; the libxl toolstack doesn't currently handle this event nor
does the XAPI toolstack. If nothing handles the event, it is simply
ignored. We plan on modifying XAPI to handle it.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---

Changed in v2:
* Expanded commit message.

 hw/i386/xen/xen-hvm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index d9ccd5d..ffd20dc 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -16,6 +16,7 @@
 #include "hw/i386/apic-msidef.h"
 #include "hw/xen/xen_common.h"
 #include "hw/xen/xen_backend.h"
+#include "qapi-event.h"
 #include "qmp-commands.h"
 
 #include "qemu/error-report.h"
@@ -967,6 +968,7 @@ static void handle_ioreq(XenIOState *state, ioreq_t *req)
 handle_vmport_ioreq(state, req);
 break;
 case IOREQ_TYPE_TIMEOFFSET:
+qapi_event_send_rtc_change((int64_t)req->data, _abort);
 break;
 case IOREQ_TYPE_INVALIDATE:
 xen_invalidate_map_cache();
-- 
2.9.5




Re: [Qemu-devel] [PATCH] xen: Emit RTC_CHANGE upon TIMEOFFSET ioreq

2017-08-22 Thread Ross Lagerwall

On 08/21/2017 11:30 PM, Stefano Stabellini wrote:

On Mon, 21 Aug 2017, Ross Lagerwall wrote:

When the guest writes to the RTC, Xen emulates it and broadcasts a
TIMEOFFSET ioreq. Emit an RTC_CHANGE QMP message when this happens
rather than ignoring it so that something useful can be done with the
information.


Are there any handlers of the RTC_CHANGE QMP message today? What happens
if there are no handlers?


The libxl toolstack doesn't handle it nor does the XAPI project 
currently, although we plan on modifying XAPI to handle it.


It is simply an event that is broadcast to any QMP monitors. If nothing 
handles the event, then it is the same behavior as before. If something 
is interested in the event, then it can make use of the time offset 
however it wants.


--
Ross Lagerwall



[Qemu-devel] [PATCH] xen: Emit RTC_CHANGE upon TIMEOFFSET ioreq

2017-08-21 Thread Ross Lagerwall
When the guest writes to the RTC, Xen emulates it and broadcasts a
TIMEOFFSET ioreq. Emit an RTC_CHANGE QMP message when this happens
rather than ignoring it so that something useful can be done with the
information.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 hw/i386/xen/xen-hvm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/i386/xen/xen-hvm.c b/hw/i386/xen/xen-hvm.c
index d9ccd5d..ffd20dc 100644
--- a/hw/i386/xen/xen-hvm.c
+++ b/hw/i386/xen/xen-hvm.c
@@ -16,6 +16,7 @@
 #include "hw/i386/apic-msidef.h"
 #include "hw/xen/xen_common.h"
 #include "hw/xen/xen_backend.h"
+#include "qapi-event.h"
 #include "qmp-commands.h"
 
 #include "qemu/error-report.h"
@@ -967,6 +968,7 @@ static void handle_ioreq(XenIOState *state, ioreq_t *req)
 handle_vmport_ioreq(state, req);
 break;
 case IOREQ_TYPE_TIMEOFFSET:
+qapi_event_send_rtc_change((int64_t)req->data, _abort);
 break;
 case IOREQ_TYPE_INVALIDATE:
 xen_invalidate_map_cache();
-- 
2.9.5




[Qemu-devel] [PATCH v2] xen-platform: Cleanup network infrastructure when emulated NICs are unplugged

2017-06-30 Thread Ross Lagerwall
When the guest unplugs the emulated NICs, cleanup the peer for each NIC
as it is not needed anymore. Most importantly, this allows the tap
interfaces which QEMU holds open to be closed and removed.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---

In v2: Don't call nic_cleanup(), just remove the peer of the NIC which
will cleanup up the tap devices. This means that QEMU doesn't segv
when shutting down.

 hw/i386/xen/xen_platform.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/hw/i386/xen/xen_platform.c b/hw/i386/xen/xen_platform.c
index 1419fc9..f231558 100644
--- a/hw/i386/xen/xen_platform.c
+++ b/hw/i386/xen/xen_platform.c
@@ -102,8 +102,19 @@ static void unplug_nic(PCIBus *b, PCIDevice *d, void *o)
 }
 }
 
+/* Remove the peer of the NIC device. Normally, this would be a tap device. */
+static void del_nic_peer(NICState *nic, void *opaque)
+{
+NetClientState *nc;
+
+nc = qemu_get_queue(nic);
+if (nc->peer)
+qemu_del_net_client(nc->peer);
+}
+
 static void pci_unplug_nics(PCIBus *bus)
 {
+qemu_foreach_nic(del_nic_peer, NULL);
 pci_for_each_device(bus, 0, unplug_nic, NULL);
 }
 
-- 
2.9.4




[Qemu-devel] [PATCH] xen-platform: Cleanup network infrastructure when emulated NICs are unplugged

2017-06-19 Thread Ross Lagerwall
When the guest unplugs the emulated NICs, call net_cleanup() to cleanup
the network infrastructure in QEMU as it is not needed anymore. Most
importantly, this allows the tap interfaces which QEMU holds open to be
closed and removed.

Signed-off-by: Ross Lagerwall <ross.lagerw...@citrix.com>
---
 hw/i386/xen/xen_platform.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/xen/xen_platform.c b/hw/i386/xen/xen_platform.c
index 1419fc9..180abc7 100644
--- a/hw/i386/xen/xen_platform.c
+++ b/hw/i386/xen/xen_platform.c
@@ -105,6 +105,7 @@ static void unplug_nic(PCIBus *b, PCIDevice *d, void *o)
 static void pci_unplug_nics(PCIBus *bus)
 {
 pci_for_each_device(bus, 0, unplug_nic, NULL);
+net_cleanup();
 }
 
 static void unplug_disks(PCIBus *b, PCIDevice *d, void *opaque)
-- 
2.9.4