date:20181205

Re: [Qemu-devel] [PATCH for-4.0 1/6] char-socket: Enable "wait" option for client mode

2018-12-05 Thread Yongji Xie

On Thu, 6 Dec 2018 at 15:23, Marc-André Lureau
 wrote:
>
> Hi
>
> On Thu, Dec 6, 2018 at 10:38 AM  wrote:
> >
> > From: Xie Yongji 
> >
> > Now we attempt to connect asynchronously for "reconnect socket"
> > during open(). But vhost-user device prefer a connected socket
> > during initialization. That means we may still need to support
> > sync connection during open() for the "reconnect socket".
> >
> > Signed-off-by: Xie Yongji 
> > Signed-off-by: Zhang Yu 
>
> I am not sure this makes much sense, since upon reconnect it won't
> "wait" (if I am not mistaken)
>

Yes, qemu won't wait when reconnecting. The "wait" here just means that
we should wait connection at first time (during open()). I'm not sure whether
reuse the "wait" option is OK here.

If no this option, current qemu will fail to initialize vhost-user-blk
device when
"reconnect" option is specified no matter the backend is running or not.

Thanks,
Yongji

Re: [Qemu-devel] [PATCH for-4.0 0/6] vhost-user-blk: Add support for backend reconnecting

2018-12-05 Thread Yongji Xie

On Thu, 6 Dec 2018 at 15:23, Marc-André Lureau
 wrote:
>
> Hi
>
> On Thu, Dec 6, 2018 at 10:36 AM  wrote:
> >
> > From: Xie Yongji 
> >
> > This patchset is aimed at supporting qemu to reconnect
> > vhost-user-blk backend after vhost-user-blk backend crash or
> > restart.
> >
> > The patch 1 tries to implenment the sync connection for
> > "reconnect socket".
> >
> > The patch 2 introduces a new message VHOST_USER_SET_VRING_INFLIGHT
> > to support offering shared memory to backend to record
> > its inflight I/O.
> >
> > The patch 3,4 are the corresponding libvhost-user patches of
> > patch 2. Make libvhost-user support VHOST_USER_SET_VRING_INFLIGHT.
> >
> > The patch 5 supports vhost-user-blk to reconnect backend when
> > connection closed.
> >
> > The patch 6 tells qemu that we support reconnecting now.
> >
> > To use it, we could start qemu with:
> >
> > qemu-system-x86_64 \
> > -chardev socket,id=char0,path=/path/vhost.socket,reconnect=1,wait \
> > -device vhost-user-blk-pci,chardev=char0 \
>
> Why do you want qemu to be the client since it is actually the one
> that serves and remains alive?  Why make it try to reconnect regularly
> when it could instead wait for a connection to come up?
>

Actually, this patchset should also work when qemu is in server mode.
The reason I make qemu to be client is that some vhost-user backend
such as spdk, vhost-user-blk may still work in server mode. And
seems like we could not make sure all vhost-user backend is working in
client mode.

Thanks,
Yongji

Re: [Qemu-devel] Logging dirty pages from vhost-net in-kernel with vIOMMU

2018-12-05 Thread Jason Wang




On 2018/12/5 下午10:47, Jintack Lim wrote:

On Tue, Dec 4, 2018 at 8:30 PM Jason Wang  wrote:


On 2018/12/5 上午2:37, Jintack Lim wrote:

Hi,

I'm wondering how the current implementation works when logging dirty
pages during migration from vhost-net (in kernel) when used vIOMMU.

I understand how vhost-net logs GPAs when not using vIOMMU. But when
we use vhost with vIOMMU, then shouldn't vhost-net need to log the
translated address (GPA) instead of the address written in the
descriptor (IOVA) ? The current implementation looks like vhost-net
just logs IOVA without translation in vhost_get_vq_desc() in
drivers/vhost/net.c. It seems like QEMU doesn't do any further
translation of the dirty log when syncing.

I might be missing something. Could somebody shed some light on this?


Good catch. It looks like a bug to me. Want to post a patch for this?

Thanks for the confirmation.

What would be a good setup to catch this kind of migration bug? I
tried to observe it in the VM expecting to see network applications
not getting data correctly on the destination, but it was not
successful (i.e. the VM on the destination just worked fine.) I didn't
even see anything going wrong when I disabled the vhost logging
completely without using vIOMMU.

What I did is I ran multiple network benchmarks (e.g. netperf tcp
stream and my own one to check correctness of received data) in a VM
without vhost dirty page logging, and the benchmarks just ran fine in
the destination. I checked the used ring at the time the VM is stopped
in the source for migration, and it had multiple descriptors that is
(probably) not processed in the VM yet. Do you have any insight how it
could just work and what would be a good setup to catch this?



According to past experience, it could be reproduced by doing scp from 
host to guest during migration.





About sending a patch, as Michael suggested, I think it's better for
you to handle this case - this is not my area of expertise, yet :-)



No problem, I will fix this.

Thanks for spotting this issue.



Thanks



Thanks,
Jintack

Re: [Qemu-devel] Logging dirty pages from vhost-net in-kernel with vIOMMU

2018-12-05 Thread Jason Wang




On 2018/12/5 下午9:32, Michael S. Tsirkin wrote:

On Wed, Dec 05, 2018 at 11:02:11AM +0800, Jason Wang wrote:

On 2018/12/5 上午9:59, Michael S. Tsirkin wrote:

On Wed, Dec 05, 2018 at 09:30:19AM +0800, Jason Wang wrote:

On 2018/12/5 上午2:37, Jintack Lim wrote:

Hi,

I'm wondering how the current implementation works when logging dirty
pages during migration from vhost-net (in kernel) when used vIOMMU.

I understand how vhost-net logs GPAs when not using vIOMMU. But when
we use vhost with vIOMMU, then shouldn't vhost-net need to log the
translated address (GPA) instead of the address written in the
descriptor (IOVA) ? The current implementation looks like vhost-net
just logs IOVA without translation in vhost_get_vq_desc() in
drivers/vhost/net.c. It seems like QEMU doesn't do any further
translation of the dirty log when syncing.

I might be missing something. Could somebody shed some light on this?

Good catch. It looks like a bug to me. Want to post a patch for this?

This isn't going to be a quick fix: IOTLB UAPI is translating
IOVA values directly to uaddr.

So to fix it, we need to change IOVA messages to translate to GPA
so GPA can be logged.

for existing userspace We can try reverse translation uaddr->gpa as a
hack for logging but that translation was never guaranteed to be unique.


We have memory table in vhost as well, so looks like we can do this in
kernel as well without disturbing UAPI?

Thanks

Let me try to rephrase.

Yes, as a temporary bugfix we can do the uaddr to gpa translations.
It is probably good enough for what QEMU does now.

However it can break some legal userspace, since it is possible to
have multiple UADDR mappings for a single GPA.
In that setup the vhost table would only have one of these
and it's possible that IOTLB would use another one.



Consider we are logging GPA, so it doesn't matter which UADDR in this 
case since we finally get a same GPA. Maybe you mean multiple GPA 
mappings for a single UADDR? Then we may want to log all possible GPA in 
this case.





And generally it's a better idea security-wise to make
iotlb talk in GPA terms. This way whoever sets the static
GPA-to-UADDR mappings controls security, and the dynamic
and more fragile iova mappings can not break QEMU security.



AFAIK, this may only work if memory table and IOTLB entries were set by 
different process I believe. Consider it's all set by qemu, and qemu 
will go through GPA-UADDR mapping before setting device IOTLB. It's 
probably not a gain for us now.





So we need a UAPI extension with a feature flag.



Yes.

Thanks



Jason I think you'll have to work on it given the complexity.


Thanks



Thanks,
Jintack

Re: [Qemu-devel] [PATCH for-4.0 1/6] char-socket: Enable "wait" option for client mode

2018-12-05 Thread Marc-André Lureau

Hi

On Thu, Dec 6, 2018 at 10:38 AM  wrote:
>
> From: Xie Yongji 
>
> Now we attempt to connect asynchronously for "reconnect socket"
> during open(). But vhost-user device prefer a connected socket
> during initialization. That means we may still need to support
> sync connection during open() for the "reconnect socket".
>
> Signed-off-by: Xie Yongji 
> Signed-off-by: Zhang Yu 

I am not sure this makes much sense, since upon reconnect it won't
"wait" (if I am not mistaken)

> ---
>  chardev/char-socket.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
> index eaa8e8b68f..f2819d52e7 100644
> --- a/chardev/char-socket.c
> +++ b/chardev/char-socket.c
> @@ -1072,7 +1072,7 @@ static void qmp_chardev_open_socket(Chardev *chr,
>  s->reconnect_time = reconnect;
>  }
>
> -if (s->reconnect_time) {
> +if (s->reconnect_time && !is_waitconnect) {
>  tcp_chr_connect_async(chr);
>  } else {
>  if (s->is_listen) {
> @@ -1120,7 +1120,8 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
> ChardevBackend *backend,
>Error **errp)
>  {
>  bool is_listen  = qemu_opt_get_bool(opts, "server", false);
> -bool is_waitconnect = is_listen && qemu_opt_get_bool(opts, "wait", true);
> +bool is_waitconnect = is_listen ? qemu_opt_get_bool(opts, "wait", true) :
> +  qemu_opt_get_bool(opts, "wait", false);
>  bool is_telnet  = qemu_opt_get_bool(opts, "telnet", false);
>  bool is_tn3270  = qemu_opt_get_bool(opts, "tn3270", false);
>  bool is_websock = qemu_opt_get_bool(opts, "websocket", false);
> --
> 2.17.1
>
>


--
Marc-André Lureau

Re: [Qemu-devel] [PATCH for-4.0 2/6] vhost-user: Add shared memory to record inflight I/O

2018-12-05 Thread Yongji Xie

On Thu, 6 Dec 2018 at 15:19, Marc-André Lureau
 wrote:
>
> Hi
> On Thu, Dec 6, 2018 at 10:40 AM  wrote:
> >
> > From: Xie Yongji 
> >
> > This introduces a new message VHOST_USER_SET_VRING_INFLIGHT
> > to support offering shared memory to backend to record
> > its inflight I/O.
> >
> > With this new message, the backend is able to restart without
> > missing I/O which would cause I/O hung for block device.
> >
> > Signed-off-by: Xie Yongji 
> > Signed-off-by: Chai Wen 
> > Signed-off-by: Zhang Yu 
> > ---
> >  hw/virtio/vhost-user.c| 69 +++
> >  hw/virtio/vhost.c |  8 
> >  include/hw/virtio/vhost-backend.h |  4 ++
> >  include/hw/virtio/vhost-user.h|  8 
>
> Please update docs/interop/vhost-user.txt to describe the new message
>

Will do it in v2.

Thanks,
Yongji

> >  4 files changed, 89 insertions(+)
> >
> > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > index e09bed0e4a..4c0e64891d 100644
> > --- a/hw/virtio/vhost-user.c
> > +++ b/hw/virtio/vhost-user.c
> > @@ -19,6 +19,7 @@
> >  #include "sysemu/kvm.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/sockets.h"
> > +#include "qemu/memfd.h"
> >  #include "sysemu/cryptodev.h"
> >  #include "migration/migration.h"
> >  #include "migration/postcopy-ram.h"
> > @@ -52,6 +53,7 @@ enum VhostUserProtocolFeature {
> >  VHOST_USER_PROTOCOL_F_CONFIG = 9,
> >  VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
> >  VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> > +VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
> >  VHOST_USER_PROTOCOL_F_MAX
> >  };
> >
> > @@ -89,6 +91,7 @@ typedef enum VhostUserRequest {
> >  VHOST_USER_POSTCOPY_ADVISE  = 28,
> >  VHOST_USER_POSTCOPY_LISTEN  = 29,
> >  VHOST_USER_POSTCOPY_END = 30,
> > +VHOST_USER_SET_VRING_INFLIGHT = 31,
>
> why VRING? it seems to be free/arbitrary memory area.
>
> Oh, I understand later that this has an explicit layout and behaviour
> later described in "libvhost-user: Support recording inflight I/O in
> shared memory"
>
> Please update the vhost-user spec first to describe expected usage/behaviour.
>
>
> >  VHOST_USER_MAX
> >  } VhostUserRequest;
> >
> > @@ -147,6 +150,11 @@ typedef struct VhostUserVringArea {
> >  uint64_t offset;
> >  } VhostUserVringArea;
> >
> > +typedef struct VhostUserVringInflight {
> > +uint32_t size;
> > +uint32_t idx;
> > +} VhostUserVringInflight;
> > +
> >  typedef struct {
> >  VhostUserRequest request;
> >
> > @@ -169,6 +177,7 @@ typedef union {
> >  VhostUserConfig config;
> >  VhostUserCryptoSession session;
> >  VhostUserVringArea area;
> > +VhostUserVringInflight inflight;
> >  } VhostUserPayload;
> >
> >  typedef struct VhostUserMsg {
> > @@ -1739,6 +1748,58 @@ static bool vhost_user_mem_section_filter(struct 
> > vhost_dev *dev,
> >  return result;
> >  }
> >
> > +static int vhost_user_set_vring_inflight(struct vhost_dev *dev, int idx)
> > +{
> > +struct vhost_user *u = dev->opaque;
> > +
> > +if (!virtio_has_feature(dev->protocol_features,
> > +VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> > +return 0;
> > +}
> > +
> > +if (!u->user->inflight[idx].addr) {
> > +Error *err = NULL;
> > +
> > +u->user->inflight[idx].size = qemu_real_host_page_size;
> > +u->user->inflight[idx].addr = qemu_memfd_alloc("vhost-inflight",
> > +  u->user->inflight[idx].size,
> > +  F_SEAL_GROW | F_SEAL_SHRINK | 
> > F_SEAL_SEAL,
> > +  >user->inflight[idx].fd, );
> > +if (err) {
> > +error_report_err(err);
> > +u->user->inflight[idx].addr = NULL;
> > +return -1;
> > +}
> > +}
> > +
> > +VhostUserMsg msg = {
> > +.hdr.request = VHOST_USER_SET_VRING_INFLIGHT,
> > +.hdr.flags = VHOST_USER_VERSION,
> > +.payload.inflight.size = u->user->inflight[idx].size,
> > +.payload.inflight.idx = idx,
> > +.hdr.size = sizeof(msg.payload.inflight),
> > +};
> > +
> > +if (vhost_user_write(dev, , >user->inflight[idx].fd, 1) < 0) {
> > +return -1;
> > +}
> > +
> > +return 0;
> > +}
> > +
> > +void vhost_user_inflight_reset(VhostUserState *user)
> > +{
> > +int i;
> > +
> > +for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> > +if (!user->inflight[i].addr) {
> > +continue;
> > +}
> > +
> > +memset(user->inflight[i].addr, 0, user->inflight[i].size);
> > +}
> > +}
> > +
> >  VhostUserState *vhost_user_init(void)
> >  {
> >  VhostUserState *user = g_new0(struct VhostUserState, 1);
> > @@ -1756,6 +1817,13 @@ void vhost_user_cleanup(VhostUserState *user)
> >  munmap(user->notifier[i].addr, qemu_real_host_page_size);
> >  user->notifier[i].addr = NULL;
> >  }
> > +
> > +if

Re: [Qemu-devel] [PATCH for-4.0 0/6] vhost-user-blk: Add support for backend reconnecting

2018-12-05 Thread Marc-André Lureau

Hi

On Thu, Dec 6, 2018 at 10:36 AM  wrote:
>
> From: Xie Yongji 
>
> This patchset is aimed at supporting qemu to reconnect
> vhost-user-blk backend after vhost-user-blk backend crash or
> restart.
>
> The patch 1 tries to implenment the sync connection for
> "reconnect socket".
>
> The patch 2 introduces a new message VHOST_USER_SET_VRING_INFLIGHT
> to support offering shared memory to backend to record
> its inflight I/O.
>
> The patch 3,4 are the corresponding libvhost-user patches of
> patch 2. Make libvhost-user support VHOST_USER_SET_VRING_INFLIGHT.
>
> The patch 5 supports vhost-user-blk to reconnect backend when
> connection closed.
>
> The patch 6 tells qemu that we support reconnecting now.
>
> To use it, we could start qemu with:
>
> qemu-system-x86_64 \
> -chardev socket,id=char0,path=/path/vhost.socket,reconnect=1,wait \
> -device vhost-user-blk-pci,chardev=char0 \

Why do you want qemu to be the client since it is actually the one
that serves and remains alive?  Why make it try to reconnect regularly
when it could instead wait for a connection to come up?

>
> and start vhost-user-blk backend with:
>
> vhost-user-blk -b /path/file -s /path/vhost.socket
>
> Then we can restart vhost-user-blk at any time during VM running.
>
> Xie Yongji (6):
>   char-socket: Enable "wait" option for client mode
>   vhost-user: Add shared memory to record inflight I/O
>   libvhost-user: Introduce vu_queue_map_desc()
>   libvhost-user: Support recording inflight I/O in shared memory
>   vhost-user-blk: Add support for reconnecting backend
>   contrib/vhost-user-blk: enable inflight I/O recording
>
>  chardev/char-socket.c   |   5 +-
>  contrib/libvhost-user/libvhost-user.c   | 215 
>  contrib/libvhost-user/libvhost-user.h   |  19 +++
>  contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
>  hw/block/vhost-user-blk.c   | 169 +--
>  hw/virtio/vhost-user.c  |  69 
>  hw/virtio/vhost.c   |   8 +
>  include/hw/virtio/vhost-backend.h   |   4 +
>  include/hw/virtio/vhost-user-blk.h  |   4 +
>  include/hw/virtio/vhost-user.h  |   8 +
>  10 files changed, 452 insertions(+), 52 deletions(-)
>
> --
> 2.17.1
>
>


-- 
Marc-André Lureau

Re: [Qemu-devel] [PATCH for-4.0 2/6] vhost-user: Add shared memory to record inflight I/O

2018-12-05 Thread Marc-André Lureau

Hi
On Thu, Dec 6, 2018 at 10:40 AM  wrote:
>
> From: Xie Yongji 
>
> This introduces a new message VHOST_USER_SET_VRING_INFLIGHT
> to support offering shared memory to backend to record
> its inflight I/O.
>
> With this new message, the backend is able to restart without
> missing I/O which would cause I/O hung for block device.
>
> Signed-off-by: Xie Yongji 
> Signed-off-by: Chai Wen 
> Signed-off-by: Zhang Yu 
> ---
>  hw/virtio/vhost-user.c| 69 +++
>  hw/virtio/vhost.c |  8 
>  include/hw/virtio/vhost-backend.h |  4 ++
>  include/hw/virtio/vhost-user.h|  8 

Please update docs/interop/vhost-user.txt to describe the new message

>  4 files changed, 89 insertions(+)
>
> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> index e09bed0e4a..4c0e64891d 100644
> --- a/hw/virtio/vhost-user.c
> +++ b/hw/virtio/vhost-user.c
> @@ -19,6 +19,7 @@
>  #include "sysemu/kvm.h"
>  #include "qemu/error-report.h"
>  #include "qemu/sockets.h"
> +#include "qemu/memfd.h"
>  #include "sysemu/cryptodev.h"
>  #include "migration/migration.h"
>  #include "migration/postcopy-ram.h"
> @@ -52,6 +53,7 @@ enum VhostUserProtocolFeature {
>  VHOST_USER_PROTOCOL_F_CONFIG = 9,
>  VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
>  VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
> +VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
>  VHOST_USER_PROTOCOL_F_MAX
>  };
>
> @@ -89,6 +91,7 @@ typedef enum VhostUserRequest {
>  VHOST_USER_POSTCOPY_ADVISE  = 28,
>  VHOST_USER_POSTCOPY_LISTEN  = 29,
>  VHOST_USER_POSTCOPY_END = 30,
> +VHOST_USER_SET_VRING_INFLIGHT = 31,

why VRING? it seems to be free/arbitrary memory area.

Oh, I understand later that this has an explicit layout and behaviour
later described in "libvhost-user: Support recording inflight I/O in
shared memory"

Please update the vhost-user spec first to describe expected usage/behaviour.


>  VHOST_USER_MAX
>  } VhostUserRequest;
>
> @@ -147,6 +150,11 @@ typedef struct VhostUserVringArea {
>  uint64_t offset;
>  } VhostUserVringArea;
>
> +typedef struct VhostUserVringInflight {
> +uint32_t size;
> +uint32_t idx;
> +} VhostUserVringInflight;
> +
>  typedef struct {
>  VhostUserRequest request;
>
> @@ -169,6 +177,7 @@ typedef union {
>  VhostUserConfig config;
>  VhostUserCryptoSession session;
>  VhostUserVringArea area;
> +VhostUserVringInflight inflight;
>  } VhostUserPayload;
>
>  typedef struct VhostUserMsg {
> @@ -1739,6 +1748,58 @@ static bool vhost_user_mem_section_filter(struct 
> vhost_dev *dev,
>  return result;
>  }
>
> +static int vhost_user_set_vring_inflight(struct vhost_dev *dev, int idx)
> +{
> +struct vhost_user *u = dev->opaque;
> +
> +if (!virtio_has_feature(dev->protocol_features,
> +VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
> +return 0;
> +}
> +
> +if (!u->user->inflight[idx].addr) {
> +Error *err = NULL;
> +
> +u->user->inflight[idx].size = qemu_real_host_page_size;
> +u->user->inflight[idx].addr = qemu_memfd_alloc("vhost-inflight",
> +  u->user->inflight[idx].size,
> +  F_SEAL_GROW | F_SEAL_SHRINK | 
> F_SEAL_SEAL,
> +  >user->inflight[idx].fd, );
> +if (err) {
> +error_report_err(err);
> +u->user->inflight[idx].addr = NULL;
> +return -1;
> +}
> +}
> +
> +VhostUserMsg msg = {
> +.hdr.request = VHOST_USER_SET_VRING_INFLIGHT,
> +.hdr.flags = VHOST_USER_VERSION,
> +.payload.inflight.size = u->user->inflight[idx].size,
> +.payload.inflight.idx = idx,
> +.hdr.size = sizeof(msg.payload.inflight),
> +};
> +
> +if (vhost_user_write(dev, , >user->inflight[idx].fd, 1) < 0) {
> +return -1;
> +}
> +
> +return 0;
> +}
> +
> +void vhost_user_inflight_reset(VhostUserState *user)
> +{
> +int i;
> +
> +for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
> +if (!user->inflight[i].addr) {
> +continue;
> +}
> +
> +memset(user->inflight[i].addr, 0, user->inflight[i].size);
> +}
> +}
> +
>  VhostUserState *vhost_user_init(void)
>  {
>  VhostUserState *user = g_new0(struct VhostUserState, 1);
> @@ -1756,6 +1817,13 @@ void vhost_user_cleanup(VhostUserState *user)
>  munmap(user->notifier[i].addr, qemu_real_host_page_size);
>  user->notifier[i].addr = NULL;
>  }
> +
> +if (user->inflight[i].addr) {
> +munmap(user->inflight[i].addr, user->inflight[i].size);
> +user->inflight[i].addr = NULL;
> +close(user->inflight[i].fd);
> +user->inflight[i].fd = -1;
> +}
>  }
>  }
>
> @@ -1790,4 +1858,5 @@ const VhostOps user_ops = {
>  .vhost_crypto_create_session = vhost_user_crypto_create_session,

Re: [Qemu-devel] [PATCH for-4.0 3/6] libvhost-user: Introduce vu_queue_map_desc()

2018-12-05 Thread Marc-André Lureau

On Thu, Dec 6, 2018 at 10:40 AM  wrote:
>
> From: Xie Yongji 
>
> Introduce vu_queue_map_desc() which should be
> independent with vu_queue_pop();
>
> Signed-off-by: Xie Yongji 
> Signed-off-by: Zhang Yu 

Reviewed-by: Marc-André Lureau 

> ---
>  contrib/libvhost-user/libvhost-user.c | 86 +++
>  1 file changed, 49 insertions(+), 37 deletions(-)
>
> diff --git a/contrib/libvhost-user/libvhost-user.c 
> b/contrib/libvhost-user/libvhost-user.c
> index a6b46cdc03..4432bd8bb4 100644
> --- a/contrib/libvhost-user/libvhost-user.c
> +++ b/contrib/libvhost-user/libvhost-user.c
> @@ -1853,49 +1853,20 @@ virtqueue_alloc_element(size_t sz,
>  return elem;
>  }
>
> -void *
> -vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
> +static void *
> +vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
>  {
> -unsigned int i, head, max, desc_len;
> +struct vring_desc *desc = vq->vring.desc;
>  uint64_t desc_addr, read_len;
> +unsigned int desc_len;
> +unsigned int max = vq->vring.num;
> +unsigned int i = idx;
>  VuVirtqElement *elem;
> -unsigned out_num, in_num;
> +unsigned int out_num = 0, in_num = 0;
>  struct iovec iov[VIRTQUEUE_MAX_SIZE];
>  struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
> -struct vring_desc *desc;
>  int rc;
>
> -if (unlikely(dev->broken) ||
> -unlikely(!vq->vring.avail)) {
> -return NULL;
> -}
> -
> -if (vu_queue_empty(dev, vq)) {
> -return NULL;
> -}
> -/* Needed after virtio_queue_empty(), see comment in
> - * virtqueue_num_heads(). */
> -smp_rmb();
> -
> -/* When we start there are none of either input nor output. */
> -out_num = in_num = 0;
> -
> -max = vq->vring.num;
> -if (vq->inuse >= vq->vring.num) {
> -vu_panic(dev, "Virtqueue size exceeded");
> -return NULL;
> -}
> -
> -if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, )) {
> -return NULL;
> -}
> -
> -if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
> -vring_set_avail_event(vq, vq->last_avail_idx);
> -}
> -
> -i = head;
> -desc = vq->vring.desc;
>  if (desc[i].flags & VRING_DESC_F_INDIRECT) {
>  if (desc[i].len % sizeof(struct vring_desc)) {
>  vu_panic(dev, "Invalid size for indirect buffer table");
> @@ -1947,12 +1918,13 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
>  } while (rc == VIRTQUEUE_READ_DESC_MORE);
>
>  if (rc == VIRTQUEUE_READ_DESC_ERROR) {
> +vu_panic(dev, "read descriptor error");
>  return NULL;
>  }
>
>  /* Now copy what we have collected and mapped */
>  elem = virtqueue_alloc_element(sz, out_num, in_num);
> -elem->index = head;
> +elem->index = idx;
>  for (i = 0; i < out_num; i++) {
>  elem->out_sg[i] = iov[i];
>  }
> @@ -1960,6 +1932,46 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
>  elem->in_sg[i] = iov[out_num + i];
>  }
>
> +return elem;
> +}
> +
> +void *
> +vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
> +{
> +unsigned int head;
> +VuVirtqElement *elem;
> +
> +if (unlikely(dev->broken) ||
> +unlikely(!vq->vring.avail)) {
> +return NULL;
> +}
> +
> +if (vu_queue_empty(dev, vq)) {
> +return NULL;
> +}
> +/* Needed after virtio_queue_empty(), see comment in
> + * virtqueue_num_heads(). */
> +smp_rmb();
> +
> +if (vq->inuse >= vq->vring.num) {
> +vu_panic(dev, "Virtqueue size exceeded");
> +return NULL;
> +}
> +
> +if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, )) {
> +return NULL;
> +}
> +
> +if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
> +vring_set_avail_event(vq, vq->last_avail_idx);
> +}
> +
> +elem = vu_queue_map_desc(dev, vq, head, sz);
> +
> +if (!elem) {
> +return NULL;
> +}
> +
>  vq->inuse++;
>
>  return elem;
> --
> 2.17.1
>
>


-- 
Marc-André Lureau

Re: [Qemu-devel] [PATCH] docs: Update references to JSON RFC

2018-12-05 Thread Markus Armbruster

Markus Armbruster  writes:

> Eric Blake  writes:
>
>> RFC8259 obsoletes RFC7159. Fix a couple of URLs to point to the
>> newer version.
>>
>> Signed-off-by: Eric Blake 
>
> Reviewed-by: Markus Armbruster 

Queued, thanks!

Re: [Qemu-devel] [PATCH v11 0/3] wakeup-from-suspend and system_wakeup changes

2018-12-05 Thread Markus Armbruster

Daniel Henrique Barboza  writes:

> changes in v11:
> - fixed typos, changed version to 4.0 in patches 1 and 3
> - changed text in patch 2 to be less alarming
> - patch 3: changed error handling
> - previous version link:
> http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg01774.html

Looks ready to me.  Who's going to merge it?

Re: [Qemu-devel] [PATCH v11 3/3] qmp hmp: Make system_wakeup check wake-up support and run state

2018-12-05 Thread Markus Armbruster

Daniel Henrique Barboza  writes:

> The qmp/hmp command 'system_wakeup' is simply a direct call to
> 'qemu_system_wakeup_request' from vl.c. This function verifies if
> runstate is SUSPENDED and if the wake up reason is valid before
> proceeding. However, no error or warning is thrown if any of those
> pre-requirements isn't met. There is no way for the caller to
> differentiate between a successful wakeup or an error state caused
> when trying to wake up a guest that wasn't suspended.
>
> This means that system_wakeup is silently failing, which can be
> considered a bug. Adding error handling isn't an API break in this
> case - applications that didn't check the result will remain broken,
> the ones that check it will have a chance to deal with it.
>
> Adding to that, the commit before previous created a new QMP API called
> query-current-machine, with a new flag called wakeup-suspend-support,
> that indicates if the guest has the capability of waking up from suspended
> state. Although such guest will never reach SUSPENDED state and erroring
> it out in this scenario would suffice, it is more informative for the user
> to differentiate between a failure because the guest isn't suspended versus
> a failure because the guest does not have support for wake up at all.
>
> All this considered, this patch changes qmp_system_wakeup to check if
> the guest is capable of waking up from suspend, and if it is suspended.
> After this patch, this is the output of system_wakeup in a guest that
> does not have wake-up from suspend support (ppc64):
>
> (qemu) system_wakeup
> wake-up from suspend is not supported by this guest
> (qemu)
>
> And this is the output of system_wakeup in a x86 guest that has the
> support but isn't suspended:
>
> (qemu) system_wakeup
> Unable to wake up: guest is not in suspended state
> (qemu)
>
> Reported-by: Balamuruhan S 
> Signed-off-by: Daniel Henrique Barboza 

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH v11 2/3] qga: update guest-suspend-ram and guest-suspend-hybrid descriptions

2018-12-05 Thread Markus Armbruster

Daniel Henrique Barboza  writes:

> This patch updates the descriptions of 'guest-suspend-ram' and
> 'guest-suspend-hybrid' to mention that both commands relies now
> on the proper support for wake up from suspend, retrieved by the
> 'wakeup-suspend-support' attribute of the 'query-current-machine'
> QMP command.
>
> Reported-by: Balamuruhan S 
> Signed-off-by: Daniel Henrique Barboza 
> Reviewed-by: Michael Roth 

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH v11 1/3] qmp: query-current-machine with wakeup-suspend-support

2018-12-05 Thread Markus Armbruster

Daniel Henrique Barboza  writes:

> When issuing the qmp/hmp 'system_wakeup' command, what happens in a
> nutshell is:
>
> - qmp_system_wakeup_request set runstate to RUNNING, sets a wakeup_reason
> and notify the event
> - in the main_loop, all vcpus are paused, a system reset is issued, all
> subscribers of wakeup_notifiers receives a notification, vcpus are then
> resumed and the wake up QAPI event is fired
>
> Note that this procedure alone doesn't ensure that the guest will awake
> from SUSPENDED state - the subscribers of the wake up event must take
> action to resume the guest, otherwise the guest will simply reboot. At
> this moment, only the ACPI machines via acpi_pm1_cnt_init and xen_hvm_init
> have wake-up from suspend support.
>
> However, only the presence of 'system_wakeup' is required for QGA to
> support 'guest-suspend-ram' and 'guest-suspend-hybrid' at this moment.
> This means that the user/management will expect to suspend the guest using
> one of those suspend commands and then resume execution using system_wakeup,
> regardless of the support offered in system_wakeup in the first place.
>
> This patch creates a new API called query-current-machine [1], that holds
> a new flag called 'wakeup-suspend-support' that indicates if the guest
> supports wake up from suspend via system_wakeup. The machine is considered
> to implement wake-up support if a call to a new 'qemu_register_wakeup_support'
> is made during its init, as it is now being done inside acpi_pm1_cnt_init
> and xen_hvm_init. This allows for any other machine type to declare wake-up
> support regardless of ACPI state or wakeup_notifiers subscription, making 
> easier
> for newer implementations that might have their own mechanisms in the future.
>
> This is the expected output of query-current-machine when running a x86
> guest:
>
> {"execute" : "query-current-machine"}
> {"return": {"wakeup-suspend-support": true}}
>
> Running the same x86 guest, but with the --no-acpi option:
>
> {"execute" : "query-current-machine"}
> {"return": {"wakeup-suspend-support": false}}
>
> This is the output when running a pseries guest:
>
> {"execute" : "query-current-machine"}
> {"return": {"wakeup-suspend-support": false}}
>
> With this extra tool, management can avoid situations where a guest
> that does not have proper suspend/wake capabilities ends up in
> inconsistent state (e.g.
> https://github.com/open-power-host-os/qemu/issues/31).
>
> [1] the decision of creating the query-current-machine API is based
> on discussions in the QEMU mailing list where it was decided that
> query-target wasn't a proper place to store the wake-up flag, neither
> was query-machines because this isn't a static property of the
> machine object. This new API can then be used to store other
> dynamic machine properties that are scattered around the code
> ATM. More info at:
> https://lists.gnu.org/archive/html/qemu-devel/2018-05/msg04235.html
>
> Reported-by: Balamuruhan S 
> Signed-off-by: Daniel Henrique Barboza 

Reviewed-by: Markus Armbruster

[Qemu-devel] [PATCH for-4.0 6/6] contrib/vhost-user-blk: enable inflight I/O recording

2018-12-05 Thread elohimes

From: Xie Yongji 

This patch tells qemu that we now support inflight I/O recording
so that qemu could offer shared memory to it.

Signed-off-by: Xie Yongji 
Signed-off-by: Zhang Yu 
---
 contrib/vhost-user-blk/vhost-user-blk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/vhost-user-blk/vhost-user-blk.c 
b/contrib/vhost-user-blk/vhost-user-blk.c
index 571f114a56..f87f9de8cd 100644
--- a/contrib/vhost-user-blk/vhost-user-blk.c
+++ b/contrib/vhost-user-blk/vhost-user-blk.c
@@ -328,7 +328,8 @@ vub_get_features(VuDev *dev)
 static uint64_t
 vub_get_protocol_features(VuDev *dev)
 {
-return 1ull << VHOST_USER_PROTOCOL_F_CONFIG;
+return 1ull << VHOST_USER_PROTOCOL_F_CONFIG |
+   1uLL << VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD;
 }
 
 static int
-- 
2.17.1

[Qemu-devel] [PATCH for-4.0 2/6] vhost-user: Add shared memory to record inflight I/O

2018-12-05 Thread elohimes

From: Xie Yongji 

This introduces a new message VHOST_USER_SET_VRING_INFLIGHT
to support offering shared memory to backend to record
its inflight I/O.

With this new message, the backend is able to restart without
missing I/O which would cause I/O hung for block device.

Signed-off-by: Xie Yongji 
Signed-off-by: Chai Wen 
Signed-off-by: Zhang Yu 
---
 hw/virtio/vhost-user.c| 69 +++
 hw/virtio/vhost.c |  8 
 include/hw/virtio/vhost-backend.h |  4 ++
 include/hw/virtio/vhost-user.h|  8 
 4 files changed, 89 insertions(+)

diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
index e09bed0e4a..4c0e64891d 100644
--- a/hw/virtio/vhost-user.c
+++ b/hw/virtio/vhost-user.c
@@ -19,6 +19,7 @@
 #include "sysemu/kvm.h"
 #include "qemu/error-report.h"
 #include "qemu/sockets.h"
+#include "qemu/memfd.h"
 #include "sysemu/cryptodev.h"
 #include "migration/migration.h"
 #include "migration/postcopy-ram.h"
@@ -52,6 +53,7 @@ enum VhostUserProtocolFeature {
 VHOST_USER_PROTOCOL_F_CONFIG = 9,
 VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10,
 VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11,
+VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12,
 VHOST_USER_PROTOCOL_F_MAX
 };
 
@@ -89,6 +91,7 @@ typedef enum VhostUserRequest {
 VHOST_USER_POSTCOPY_ADVISE  = 28,
 VHOST_USER_POSTCOPY_LISTEN  = 29,
 VHOST_USER_POSTCOPY_END = 30,
+VHOST_USER_SET_VRING_INFLIGHT = 31,
 VHOST_USER_MAX
 } VhostUserRequest;
 
@@ -147,6 +150,11 @@ typedef struct VhostUserVringArea {
 uint64_t offset;
 } VhostUserVringArea;
 
+typedef struct VhostUserVringInflight {
+uint32_t size;
+uint32_t idx;
+} VhostUserVringInflight;
+
 typedef struct {
 VhostUserRequest request;
 
@@ -169,6 +177,7 @@ typedef union {
 VhostUserConfig config;
 VhostUserCryptoSession session;
 VhostUserVringArea area;
+VhostUserVringInflight inflight;
 } VhostUserPayload;
 
 typedef struct VhostUserMsg {
@@ -1739,6 +1748,58 @@ static bool vhost_user_mem_section_filter(struct 
vhost_dev *dev,
 return result;
 }
 
+static int vhost_user_set_vring_inflight(struct vhost_dev *dev, int idx)
+{
+struct vhost_user *u = dev->opaque;
+
+if (!virtio_has_feature(dev->protocol_features,
+VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD)) {
+return 0;
+}
+
+if (!u->user->inflight[idx].addr) {
+Error *err = NULL;
+
+u->user->inflight[idx].size = qemu_real_host_page_size;
+u->user->inflight[idx].addr = qemu_memfd_alloc("vhost-inflight",
+  u->user->inflight[idx].size,
+  F_SEAL_GROW | F_SEAL_SHRINK | 
F_SEAL_SEAL,
+  >user->inflight[idx].fd, );
+if (err) {
+error_report_err(err);
+u->user->inflight[idx].addr = NULL;
+return -1;
+}
+}
+
+VhostUserMsg msg = {
+.hdr.request = VHOST_USER_SET_VRING_INFLIGHT,
+.hdr.flags = VHOST_USER_VERSION,
+.payload.inflight.size = u->user->inflight[idx].size,
+.payload.inflight.idx = idx,
+.hdr.size = sizeof(msg.payload.inflight),
+};
+
+if (vhost_user_write(dev, , >user->inflight[idx].fd, 1) < 0) {
+return -1;
+}
+
+return 0;
+}
+
+void vhost_user_inflight_reset(VhostUserState *user)
+{
+int i;
+
+for (i = 0; i < VIRTIO_QUEUE_MAX; i++) {
+if (!user->inflight[i].addr) {
+continue;
+}
+
+memset(user->inflight[i].addr, 0, user->inflight[i].size);
+}
+}
+
 VhostUserState *vhost_user_init(void)
 {
 VhostUserState *user = g_new0(struct VhostUserState, 1);
@@ -1756,6 +1817,13 @@ void vhost_user_cleanup(VhostUserState *user)
 munmap(user->notifier[i].addr, qemu_real_host_page_size);
 user->notifier[i].addr = NULL;
 }
+
+if (user->inflight[i].addr) {
+munmap(user->inflight[i].addr, user->inflight[i].size);
+user->inflight[i].addr = NULL;
+close(user->inflight[i].fd);
+user->inflight[i].fd = -1;
+}
 }
 }
 
@@ -1790,4 +1858,5 @@ const VhostOps user_ops = {
 .vhost_crypto_create_session = vhost_user_crypto_create_session,
 .vhost_crypto_close_session = vhost_user_crypto_close_session,
 .vhost_backend_mem_section_filter = vhost_user_mem_section_filter,
+.vhost_set_vring_inflight = vhost_user_set_vring_inflight,
 };
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 569c4053ea..2ca7b4e841 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -973,6 +973,14 @@ static int vhost_virtqueue_start(struct vhost_dev *dev,
 return -errno;
 }
 
+if (dev->vhost_ops->vhost_set_vring_inflight) {
+r = dev->vhost_ops->vhost_set_vring_inflight(dev, vhost_vq_index);
+if (r) {
+VHOST_OPS_DEBUG("vhost_set_vring_inflight failed");

[Qemu-devel] [PATCH for-4.0 3/6] libvhost-user: Introduce vu_queue_map_desc()

2018-12-05 Thread elohimes

From: Xie Yongji 

Introduce vu_queue_map_desc() which should be
independent with vu_queue_pop();

Signed-off-by: Xie Yongji 
Signed-off-by: Zhang Yu 
---
 contrib/libvhost-user/libvhost-user.c | 86 +++
 1 file changed, 49 insertions(+), 37 deletions(-)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index a6b46cdc03..4432bd8bb4 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -1853,49 +1853,20 @@ virtqueue_alloc_element(size_t sz,
 return elem;
 }
 
-void *
-vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+static void *
+vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int idx, size_t sz)
 {
-unsigned int i, head, max, desc_len;
+struct vring_desc *desc = vq->vring.desc;
 uint64_t desc_addr, read_len;
+unsigned int desc_len;
+unsigned int max = vq->vring.num;
+unsigned int i = idx;
 VuVirtqElement *elem;
-unsigned out_num, in_num;
+unsigned int out_num = 0, in_num = 0;
 struct iovec iov[VIRTQUEUE_MAX_SIZE];
 struct vring_desc desc_buf[VIRTQUEUE_MAX_SIZE];
-struct vring_desc *desc;
 int rc;
 
-if (unlikely(dev->broken) ||
-unlikely(!vq->vring.avail)) {
-return NULL;
-}
-
-if (vu_queue_empty(dev, vq)) {
-return NULL;
-}
-/* Needed after virtio_queue_empty(), see comment in
- * virtqueue_num_heads(). */
-smp_rmb();
-
-/* When we start there are none of either input nor output. */
-out_num = in_num = 0;
-
-max = vq->vring.num;
-if (vq->inuse >= vq->vring.num) {
-vu_panic(dev, "Virtqueue size exceeded");
-return NULL;
-}
-
-if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, )) {
-return NULL;
-}
-
-if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
-vring_set_avail_event(vq, vq->last_avail_idx);
-}
-
-i = head;
-desc = vq->vring.desc;
 if (desc[i].flags & VRING_DESC_F_INDIRECT) {
 if (desc[i].len % sizeof(struct vring_desc)) {
 vu_panic(dev, "Invalid size for indirect buffer table");
@@ -1947,12 +1918,13 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 } while (rc == VIRTQUEUE_READ_DESC_MORE);
 
 if (rc == VIRTQUEUE_READ_DESC_ERROR) {
+vu_panic(dev, "read descriptor error");
 return NULL;
 }
 
 /* Now copy what we have collected and mapped */
 elem = virtqueue_alloc_element(sz, out_num, in_num);
-elem->index = head;
+elem->index = idx;
 for (i = 0; i < out_num; i++) {
 elem->out_sg[i] = iov[i];
 }
@@ -1960,6 +1932,46 @@ vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 elem->in_sg[i] = iov[out_num + i];
 }
 
+return elem;
+}
+
+void *
+vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
+{
+unsigned int head;
+VuVirtqElement *elem;
+
+if (unlikely(dev->broken) ||
+unlikely(!vq->vring.avail)) {
+return NULL;
+}
+
+if (vu_queue_empty(dev, vq)) {
+return NULL;
+}
+/* Needed after virtio_queue_empty(), see comment in
+ * virtqueue_num_heads(). */
+smp_rmb();
+
+if (vq->inuse >= vq->vring.num) {
+vu_panic(dev, "Virtqueue size exceeded");
+return NULL;
+}
+
+if (!virtqueue_get_head(dev, vq, vq->last_avail_idx++, )) {
+return NULL;
+}
+
+if (vu_has_feature(dev, VIRTIO_RING_F_EVENT_IDX)) {
+vring_set_avail_event(vq, vq->last_avail_idx);
+}
+
+elem = vu_queue_map_desc(dev, vq, head, sz);
+
+if (!elem) {
+return NULL;
+}
+
 vq->inuse++;
 
 return elem;
-- 
2.17.1

[Qemu-devel] [PATCH for-4.0 4/6] libvhost-user: Support recording inflight I/O in shared memory

2018-12-05 Thread elohimes

From: Xie Yongji 

This patch adds support for VHOST_USER_SET_VRING_INFLIGHT
message. Now we maintain a "bitmap" of all descriptors in
the shared memory for each queue. Then set it in vu_queue_pop()
and clear it in vu_queue_push();

Signed-off-by: Xie Yongji 
Signed-off-by: Zhang Yu 
---
 contrib/libvhost-user/libvhost-user.c | 129 ++
 contrib/libvhost-user/libvhost-user.h |  19 
 2 files changed, 148 insertions(+)

diff --git a/contrib/libvhost-user/libvhost-user.c 
b/contrib/libvhost-user/libvhost-user.c
index 4432bd8bb4..38ef1f5898 100644
--- a/contrib/libvhost-user/libvhost-user.c
+++ b/contrib/libvhost-user/libvhost-user.c
@@ -100,6 +100,7 @@ vu_request_to_string(unsigned int req)
 REQ(VHOST_USER_POSTCOPY_ADVISE),
 REQ(VHOST_USER_POSTCOPY_LISTEN),
 REQ(VHOST_USER_POSTCOPY_END),
+REQ(VHOST_USER_SET_VRING_INFLIGHT),
 REQ(VHOST_USER_MAX),
 };
 #undef REQ
@@ -890,6 +891,41 @@ vu_check_queue_msg_file(VuDev *dev, VhostUserMsg *vmsg)
 return true;
 }
 
+static int
+vu_check_queue_inflights(VuDev *dev, VuVirtq *vq)
+{
+int i = 0;
+
+if ((dev->protocol_features &
+ VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD) == 0) {
+return 0;
+}
+
+if (unlikely(!vq->inflight.addr)) {
+return -1;
+}
+
+vq->used_idx = vq->vring.used->idx;
+vq->inflight_num = 0;
+for (i = 0; i < vq->vring.num; i++) {
+if (vq->inflight.addr[i] == 0) {
+continue;
+}
+
+vq->inflight_desc[vq->inflight_num++] = i;
+vq->inuse++;
+}
+vq->shadow_avail_idx = vq->last_avail_idx = vq->inuse + vq->used_idx;
+
+/* in case of I/O hang after reconnecting */
+if (eventfd_write(vq->kick_fd, 1) ||
+eventfd_write(vq->call_fd, 1)) {
+return -1;
+}
+
+return 0;
+}
+
 static bool
 vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -925,6 +961,10 @@ vu_set_vring_kick_exec(VuDev *dev, VhostUserMsg *vmsg)
dev->vq[index].kick_fd, index);
 }
 
+if (vu_check_queue_inflights(dev, >vq[index])) {
+vu_panic(dev, "Failed to check inflights for vq: %d\n", index);
+}
+
 return false;
 }
 
@@ -1215,6 +1255,44 @@ vu_set_postcopy_end(VuDev *dev, VhostUserMsg *vmsg)
 return true;
 }
 
+static bool
+vu_set_vring_inflight(VuDev *dev, VhostUserMsg *vmsg)
+{
+int fd;
+uint32_t size, idx;
+void *rc;
+
+if (vmsg->fd_num != 1 ||
+vmsg->size != sizeof(vmsg->payload.inflight)) {
+vu_panic(dev, "Invalid vring_inflight message size:%d fds:%d",
+ vmsg->size, vmsg->fd_num);
+return false;
+}
+
+fd = vmsg->fds[0];
+idx = vmsg->payload.inflight.idx;
+size = vmsg->payload.inflight.size;
+DPRINT("vring_inflight idx: %"PRId32"\n", idx);
+DPRINT("vring_inflight size: %"PRId32"\n", size);
+
+rc = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+
+close(fd);
+
+if (rc == MAP_FAILED) {
+vu_panic(dev, "vring_inflight mmap error: %s", strerror(errno));
+return false;
+}
+
+if (dev->vq[idx].inflight.addr) {
+munmap(dev->vq[idx].inflight.addr, dev->vq[idx].inflight.size);
+}
+dev->vq[idx].inflight.addr = (char *)rc;
+dev->vq[idx].inflight.size = size;
+
+return false;
+}
+
 static bool
 vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 {
@@ -1292,6 +1370,8 @@ vu_process_message(VuDev *dev, VhostUserMsg *vmsg)
 return vu_set_postcopy_listen(dev, vmsg);
 case VHOST_USER_POSTCOPY_END:
 return vu_set_postcopy_end(dev, vmsg);
+case VHOST_USER_SET_VRING_INFLIGHT:
+return vu_set_vring_inflight(dev, vmsg);
 default:
 vmsg_close_fds(vmsg);
 vu_panic(dev, "Unhandled request: %d", vmsg->request);
@@ -1359,6 +1439,11 @@ vu_deinit(VuDev *dev)
 close(vq->err_fd);
 vq->err_fd = -1;
 }
+
+if (vq->inflight.addr) {
+munmap(vq->inflight.addr, vq->inflight.size);
+vq->inflight.addr = NULL;
+}
 }
 
 
@@ -1935,9 +2020,44 @@ vu_queue_map_desc(VuDev *dev, VuVirtq *vq, unsigned int 
idx, size_t sz)
 return elem;
 }
 
+static int
+vu_queue_inflight_get(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+if ((dev->protocol_features &
+ VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD) == 0) {
+return 0;
+}
+
+if (unlikely(!vq->inflight.addr)) {
+return -1;
+}
+
+vq->inflight.addr[desc_idx] = 1;
+
+return 0;
+}
+
+static int
+vu_queue_inflight_put(VuDev *dev, VuVirtq *vq, int desc_idx)
+{
+if ((dev->protocol_features &
+ VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD) == 0) {
+return 0;
+}
+
+if (unlikely(!vq->inflight.addr)) {
+return -1;
+}
+
+vq->inflight.addr[desc_idx] = 0;
+
+return 0;
+}
+
 void *
 vu_queue_pop(VuDev *dev, VuVirtq *vq, size_t sz)
 {
+int i;
 unsigned int head;
 VuVirtqElement *elem;
 
@@ -1946,6

[Qemu-devel] [PATCH for-4.0 5/6] vhost-user-blk: Add support for reconnecting backend

2018-12-05 Thread elohimes

From: Xie Yongji 

Since the new message VHOST_USER_SET_VRING_INFLIGHT,
the backend is able to restart safely. This patch
allow qemu to reconnect the backend after connection
closed.

Signed-off-by: Xie Yongji 
Signed-off-by: Ni Xun 
Signed-off-by: Zhang Yu 
---
 hw/block/vhost-user-blk.c  | 169 +++--
 include/hw/virtio/vhost-user-blk.h |   4 +
 2 files changed, 161 insertions(+), 12 deletions(-)

diff --git a/hw/block/vhost-user-blk.c b/hw/block/vhost-user-blk.c
index 1451940845..663e91bcf6 100644
--- a/hw/block/vhost-user-blk.c
+++ b/hw/block/vhost-user-blk.c
@@ -101,7 +101,7 @@ const VhostDevConfigOps blk_ops = {
 .vhost_dev_config_notifier = vhost_user_blk_handle_config_change,
 };
 
-static void vhost_user_blk_start(VirtIODevice *vdev)
+static int vhost_user_blk_start(VirtIODevice *vdev)
 {
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
@@ -110,13 +110,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
 
 if (!k->set_guest_notifiers) {
 error_report("binding does not support guest notifiers");
-return;
+return -ENOSYS;
 }
 
 ret = vhost_dev_enable_notifiers(>dev, vdev);
 if (ret < 0) {
 error_report("Error enabling host notifiers: %d", -ret);
-return;
+return ret;
 }
 
 ret = k->set_guest_notifiers(qbus->parent, s->dev.nvqs, true);
@@ -140,12 +140,13 @@ static void vhost_user_blk_start(VirtIODevice *vdev)
 vhost_virtqueue_mask(>dev, vdev, i, false);
 }
 
-return;
+return ret;
 
 err_guest_notifiers:
 k->set_guest_notifiers(qbus->parent, s->dev.nvqs, false);
 err_host_notifiers:
 vhost_dev_disable_notifiers(>dev, vdev);
+return ret;
 }
 
 static void vhost_user_blk_stop(VirtIODevice *vdev)
@@ -164,7 +165,6 @@ static void vhost_user_blk_stop(VirtIODevice *vdev)
 ret = k->set_guest_notifiers(qbus->parent, s->dev.nvqs, false);
 if (ret < 0) {
 error_report("vhost guest notifier cleanup failed: %d", ret);
-return;
 }
 
 vhost_dev_disable_notifiers(>dev, vdev);
@@ -174,21 +174,39 @@ static void vhost_user_blk_set_status(VirtIODevice *vdev, 
uint8_t status)
 {
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
 bool should_start = status & VIRTIO_CONFIG_S_DRIVER_OK;
+int ret;
 
 if (!vdev->vm_running) {
 should_start = false;
 }
 
-if (s->dev.started == should_start) {
+if (s->should_start == should_start) {
+return;
+}
+
+if (!s->connected || s->dev.started == should_start) {
+s->should_start = should_start;
 return;
 }
 
 if (should_start) {
-vhost_user_blk_start(vdev);
+s->should_start = true;
+/* make sure we ignore fake guest kick by
+ * vhost_dev_enable_notifiers() */
+barrier();
+ret = vhost_user_blk_start(vdev);
+if (ret < 0) {
+error_report("vhost-user-blk: vhost start failed: %s",
+ strerror(-ret));
+qemu_chr_fe_disconnect(>chardev);
+}
 } else {
 vhost_user_blk_stop(vdev);
+/* make sure we ignore fake guest kick by
+ * vhost_dev_disable_notifiers() */
+barrier();
+s->should_start = false;
 }
-
 }
 
 static uint64_t vhost_user_blk_get_features(VirtIODevice *vdev,
@@ -218,13 +236,22 @@ static uint64_t vhost_user_blk_get_features(VirtIODevice 
*vdev,
 static void vhost_user_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
 VHostUserBlk *s = VHOST_USER_BLK(vdev);
-int i;
+int i, ret;
 
 if (!(virtio_host_has_feature(vdev, VIRTIO_F_VERSION_1) &&
 !virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1))) {
 return;
 }
 
+if (s->should_start) {
+return;
+}
+s->should_start = true;
+
+if (!s->connected) {
+return;
+}
+
 if (s->dev.started) {
 return;
 }
@@ -232,7 +259,13 @@ static void vhost_user_blk_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
  * vhost here instead of waiting for .set_status().
  */
-vhost_user_blk_start(vdev);
+ret = vhost_user_blk_start(vdev);
+if (ret < 0) {
+error_report("vhost-user-blk: vhost start failed: %s",
+ strerror(-ret));
+qemu_chr_fe_disconnect(>chardev);
+return;
+}
 
 /* Kick right away to begin processing requests already in vring */
 for (i = 0; i < s->dev.nvqs; i++) {
@@ -245,6 +278,106 @@ static void vhost_user_blk_handle_output(VirtIODevice 
*vdev, VirtQueue *vq)
 }
 }
 
+static void vhost_user_blk_reset(VirtIODevice *vdev)
+{
+VHostUserBlk *s = VHOST_USER_BLK(vdev);
+
+if (s->vhost_user) {
+vhost_user_inflight_reset(s->vhost_user);
+}
+}
+
+static int vhost_user_blk_connect(DeviceState *dev)
+{
+VirtIODevice *vdev = VIRTIO_DEVICE(dev);
+

[Qemu-devel] [PATCH for-4.0 1/6] char-socket: Enable "wait" option for client mode

2018-12-05 Thread elohimes

From: Xie Yongji 

Now we attempt to connect asynchronously for "reconnect socket"
during open(). But vhost-user device prefer a connected socket
during initialization. That means we may still need to support
sync connection during open() for the "reconnect socket".

Signed-off-by: Xie Yongji 
Signed-off-by: Zhang Yu 
---
 chardev/char-socket.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/chardev/char-socket.c b/chardev/char-socket.c
index eaa8e8b68f..f2819d52e7 100644
--- a/chardev/char-socket.c
+++ b/chardev/char-socket.c
@@ -1072,7 +1072,7 @@ static void qmp_chardev_open_socket(Chardev *chr,
 s->reconnect_time = reconnect;
 }
 
-if (s->reconnect_time) {
+if (s->reconnect_time && !is_waitconnect) {
 tcp_chr_connect_async(chr);
 } else {
 if (s->is_listen) {
@@ -1120,7 +1120,8 @@ static void qemu_chr_parse_socket(QemuOpts *opts, 
ChardevBackend *backend,
   Error **errp)
 {
 bool is_listen  = qemu_opt_get_bool(opts, "server", false);
-bool is_waitconnect = is_listen && qemu_opt_get_bool(opts, "wait", true);
+bool is_waitconnect = is_listen ? qemu_opt_get_bool(opts, "wait", true) :
+  qemu_opt_get_bool(opts, "wait", false);
 bool is_telnet  = qemu_opt_get_bool(opts, "telnet", false);
 bool is_tn3270  = qemu_opt_get_bool(opts, "tn3270", false);
 bool is_websock = qemu_opt_get_bool(opts, "websocket", false);
-- 
2.17.1

[Qemu-devel] [PATCH for-4.0 0/6] vhost-user-blk: Add support for backend reconnecting

2018-12-05 Thread elohimes

From: Xie Yongji 

This patchset is aimed at supporting qemu to reconnect
vhost-user-blk backend after vhost-user-blk backend crash or
restart.

The patch 1 tries to implenment the sync connection for
"reconnect socket".

The patch 2 introduces a new message VHOST_USER_SET_VRING_INFLIGHT
to support offering shared memory to backend to record
its inflight I/O.

The patch 3,4 are the corresponding libvhost-user patches of
patch 2. Make libvhost-user support VHOST_USER_SET_VRING_INFLIGHT.

The patch 5 supports vhost-user-blk to reconnect backend when
connection closed.

The patch 6 tells qemu that we support reconnecting now.

To use it, we could start qemu with:

qemu-system-x86_64 \
-chardev socket,id=char0,path=/path/vhost.socket,reconnect=1,wait \
-device vhost-user-blk-pci,chardev=char0 \

and start vhost-user-blk backend with:

vhost-user-blk -b /path/file -s /path/vhost.socket

Then we can restart vhost-user-blk at any time during VM running.

Xie Yongji (6):
  char-socket: Enable "wait" option for client mode
  vhost-user: Add shared memory to record inflight I/O
  libvhost-user: Introduce vu_queue_map_desc()
  libvhost-user: Support recording inflight I/O in shared memory
  vhost-user-blk: Add support for reconnecting backend
  contrib/vhost-user-blk: enable inflight I/O recording

 chardev/char-socket.c   |   5 +-
 contrib/libvhost-user/libvhost-user.c   | 215 
 contrib/libvhost-user/libvhost-user.h   |  19 +++
 contrib/vhost-user-blk/vhost-user-blk.c |   3 +-
 hw/block/vhost-user-blk.c   | 169 +--
 hw/virtio/vhost-user.c  |  69 
 hw/virtio/vhost.c   |   8 +
 include/hw/virtio/vhost-backend.h   |   4 +
 include/hw/virtio/vhost-user-blk.h  |   4 +
 include/hw/virtio/vhost-user.h  |   8 +
 10 files changed, 452 insertions(+), 52 deletions(-)

-- 
2.17.1

Re: [Qemu-devel] [PATCH RFC v2 3/5] migration: fix the multifd code when receiving less channels

2018-12-05 Thread Fei Li





On 11/30/2018 11:45 AM, Fei Li wrote:



On 11/29/2018 10:46 PM, Philippe Mathieu-Daudé wrote:

Hi Fei,

On 29/11/18 11:03, Fei Li wrote:

In our current code, when multifd is used during migration, if there
is an error before the destination receives all new channels, the
source keeps running, however the destination does not exit but keeps
waiting until the source is killed deliberately.

Fix this by dumping the specific error and let users decide whether
to quit from the destination side when failing to receive packet via
some channel.

Signed-off-by: Fei Li 
Reviewed-by: Peter Xu 
---
  migration/channel.c   | 11 ++-
  migration/migration.c |  9 +++--
  migration/migration.h |  2 +-
  migration/ram.c   |  7 ++-
  migration/ram.h   |  2 +-
  5 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/migration/channel.c b/migration/channel.c
index 33e0e9b82f..20e4c8e2dc 100644
--- a/migration/channel.c
+++ b/migration/channel.c
@@ -30,6 +30,7 @@
  void migration_channel_process_incoming(QIOChannel *ioc)
  {
  MigrationState *s = migrate_get_current();
+    Error *local_err = NULL;
    trace_migration_set_incoming_channel(
  ioc, object_get_typename(OBJECT(ioc)));
@@ -38,13 +39,13 @@ void 
migration_channel_process_incoming(QIOChannel *ioc)

  *s->parameters.tls_creds &&
  !object_dynamic_cast(OBJECT(ioc),
   TYPE_QIO_CHANNEL_TLS)) {
-    Error *local_err = NULL;
  migration_tls_channel_process_incoming(s, ioc, _err);
-    if (local_err) {
-    error_report_err(local_err);
-    }
  } else {
-    migration_ioc_process_incoming(ioc);
+    migration_ioc_process_incoming(ioc, _err);
+    }
+
+    if (local_err) {
+    error_report_err(local_err);
  }
  }
  diff --git a/migration/migration.c b/migration/migration.c
index 49ffb9997a..72106bddf0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -541,7 +541,7 @@ void migration_fd_process_incoming(QEMUFile *f)
  migration_incoming_process();
  }
  -void migration_ioc_process_incoming(QIOChannel *ioc)
+void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp)
  {
  MigrationIncomingState *mis = migration_incoming_get_current();
  bool start_migration;
@@ -563,9 +563,14 @@ void migration_ioc_process_incoming(QIOChannel 
*ioc)

   */
  start_migration = !migrate_use_multifd();
  } else {
+    Error *local_err = NULL;
  /* Multiple connections */
  assert(migrate_use_multifd());
-    start_migration = multifd_recv_new_channel(ioc);
+    start_migration = multifd_recv_new_channel(ioc, _err);
+    if (local_err) {
+    error_propagate(errp, local_err);
+    return;
+    }
  }
    if (start_migration) {
diff --git a/migration/migration.h b/migration/migration.h
index e413d4d8b6..02b7304610 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -229,7 +229,7 @@ struct MigrationState
  void migrate_set_state(int *state, int old_state, int new_state);
    void migration_fd_process_incoming(QEMUFile *f);
-void migration_ioc_process_incoming(QIOChannel *ioc);
+void migration_ioc_process_incoming(QIOChannel *ioc, Error **errp);
  void migration_incoming_process(void);
    bool  migration_has_all_channels(void);
diff --git a/migration/ram.c b/migration/ram.c
index 7e7deec4d8..e13b9349d0 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1323,7 +1323,7 @@ bool multifd_recv_all_channels_created(void)
  }
    /* Return true if multifd is ready for the migration, otherwise 
false */

-bool multifd_recv_new_channel(QIOChannel *ioc)
+bool multifd_recv_new_channel(QIOChannel *ioc, Error **errp)
  {
  MultiFDRecvParams *p;
  Error *local_err = NULL;
@@ -1331,6 +1331,10 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
    id = multifd_recv_initial_packet(ioc, _err);
  if (id < 0) {
+    error_propagate_prepend(errp, local_err,
+    "failed to receive packet"
+    " via multifd channel %d: ",
+ multifd_recv_state->count);

Shouldn't we use atomic_read(_recv_state->count) here?

Right, will update this in next version. Thanks for pointing it out. :)
BTW, should we do the same update for the below sentence:
` return multifd_recv_state->count == migrate_multifd_channels();`?

Have a nice day
Fei

Kindly ping. :)
Thanks in advance.


Patch looks good otherwise.

Regards,

Phil.


multifd_recv_terminate_threads(local_err);
  return false;
  }
@@ -1340,6 +1344,7 @@ bool multifd_recv_new_channel(QIOChannel *ioc)
  error_setg(_err, "multifd: received id '%d' already 
setup'",

 id);
  multifd_recv_terminate_threads(local_err);
+    error_propagate(errp, local_err);
  return false;
  }
  p->c = ioc;
diff --git a/migration/ram.h b/migration/ram.h
index 83ff1bc11a..046d3074be 100644
---

Re: [Qemu-devel] [PATCH v6 06/37] ppc/xive: add support for the END Event State buffers

2018-12-05 Thread Cédric Le Goater

On 12/6/18 5:09 AM, David Gibson wrote:
> On Thu, Dec 06, 2018 at 12:22:20AM +0100, Cédric Le Goater wrote:
>> The Event Notification Descriptor (END) XIVE structure also contains
>> two Event State Buffers providing further coalescing of interrupts,
>> one for the notification event (ESn) and one for the escalation events
>> (ESe). A MMIO page is assigned for each to control the EOI through
>> loads only. Stores are not allowed.
>>
>> The END ESBs are modeled through an object resembling the 'XiveSource'
>> It is stateless as the END state bits are backed into the XiveEND
>> structure under the XiveRouter and the MMIO accesses follow the same
>> rules as for the standard source ESBs.
>>
>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
>> sPAPR. Nevetherless, it provides a mean to study the question in the
>> future and validates a bit more the XIVE model.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  include/hw/ppc/xive.h |  22 ++
>>  hw/intc/xive.c| 173 +-
>>  2 files changed, 193 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index d1b4c6c78ec5..d67b0785df7c 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -305,6 +305,8 @@ static inline void xive_source_irq_set(XiveSource *xsrc, 
>> uint32_t srcno,
>>  
>>  typedef struct XiveRouter {
>>  SysBusDeviceparent;
>> +
>> +uint32_t   chip_id;
> 
> I still don't think you need this..

I know :) 

> 
>>  } XiveRouter;
>>  
>>  #define TYPE_XIVE_ROUTER "xive-router"
>> @@ -336,6 +338,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t 
>> end_blk, uint32_t end_idx,
>>  int xive_router_write_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t 
>> end_idx,
>>XiveEND *end, uint8_t word_number);
>>  
>> +/*
>> + * XIVE END ESBs
>> + */
>> +
>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
>> +#define XIVE_END_SOURCE(obj) \
>> +OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
>> +
>> +typedef struct XiveENDSource {
>> +DeviceState parent;
>> +
>> +uint32_tnr_ends;
>> +
>> +/* ESB memory region */
>> +uint32_tesb_shift;
>> +MemoryRegionesb_mmio;
>> +
>> +XiveRouter  *xrtr;
> 
> ..or this..
> 
>> +} XiveENDSource;
>> +
>>  /*
>>   * For legacy compatibility, the exceptions define up to 256 different
>>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 41d8ba1540d0..83686e260df5 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -612,8 +612,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, 
>> uint8_t end_blk,
>>   * even futher coalescing in the Router
>>   */
>>  if (!xive_end_is_notify()) {
>> -qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>> -return;
>> +uint8_t pq = GETFIELD_BE32(END_W1_ESn, end.w1);
>> +bool notify = xive_esb_trigger();
>> +
>> +if (pq != GETFIELD_BE32(END_W1_ESn, end.w1)) {
>> +end.w1 = SETFIELD_BE32(END_W1_ESn, end.w1, pq);
>> +xive_router_write_end(xrtr, end_blk, end_idx, , 1);
>> +}
>> +
>> +/* ESn[Q]=1 : end of notification */
>> +if (!notify) {
>> +return;
>> +}
>>  }
>>  
>>  /*
>> @@ -658,12 +668,18 @@ static void xive_router_notify(XiveNotifier *xn, 
>> uint32_t lisn)
>> GETFIELD_BE64(EAS_END_DATA,  eas.w));
>>  }
>>  
>> +static Property xive_router_properties[] = {
>> +DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
>> +DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>>  static void xive_router_class_init(ObjectClass *klass, void *data)
>>  {
>>  DeviceClass *dc = DEVICE_CLASS(klass);
>>  XiveNotifierClass *xnc = XIVE_NOTIFIER_CLASS(klass);
>>  
>>  dc->desc= "XIVE Router Engine";
>> +dc->props   = xive_router_properties;
>>  xnc->notify = xive_router_notify;
>>  }
>>  
>> @@ -692,6 +708,158 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t 
>> lisn, Monitor *mon)
>> (uint32_t) GETFIELD_BE64(EAS_END_DATA, eas->w));
>>  }
>>  
>> +/*
>> + * END ESB MMIO loads
>> + */
>> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned 
>> size)
>> +{
>> +XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
>> +XiveRouter *xrtr = xsrc->xrtr;
>> +uint32_t offset = addr & 0xFFF;
>> +uint8_t end_blk;
>> +uint32_t end_idx;
>> +XiveEND end;
>> +uint32_t end_esmask;
>> +uint8_t pq;
>> +uint64_t ret = -1;
>> +
>> +end_blk = xrtr->chip_id;
> 
> .. instead I think it makes more sense to just configure the end_blk
> directly on the end_source, rather than reaching into another object
> to 

Ah. That's what I was asking in an email. I missed the answer maybe.
Let's drop it and sPAPRXive block will be 0. 

> 
>> +end_idx =

Re: [Qemu-devel] [PATCH for-4.0 v4 4/7] monitor: check if chardev can switch gcontext for OOB

2018-12-05 Thread Marc-André Lureau

Hi
On Thu, Dec 6, 2018 at 10:08 AM Markus Armbruster  wrote:
>
> One more question...
>
> Marc-André Lureau  writes:
>
> > Not all backends are able to switch gcontext. Those backends cannot
> > drive a OOB monitor (the monitor would then be blocking on main
> > thread).
> >
> > For example, ringbuf, spice, or more esoteric input chardevs like
> > braille or MUX.
>
> These chardevs don't provide QEMU_CHAR_FEATURE_GCONTEXT.
>
> > We currently forbid MUX because not all frontends are ready to run
> > outside main loop. Extend to add a context-switching feature check.
>
> Why check CHARDEV_IS_MUX() when chardev-mux already fails the
> qemu_char_feature_gcontext(chr, QEMU_CHAR_FEATURE_GCONTEXT) check?
>


It currently fails, but with "[PATCH 4/9] char: update the mux
hanlders in class callback", it won't.

But the main reason to keep an explicit check on mux is that the
monitor frontend doesn't know if other mux frontends can be called
from any context (when you set a context, it is set on the backend
side, events are dispatched by the backend).

We may want to mix this extra frontend-side capability limitation with
FEATURE_GCONTEXT flag, but they are fundamentally different: to be
able to set a backend context VS attached mux frontends can be
dispatched from any context.


> > Signed-off-by: Marc-André Lureau 
> > ---
> >  monitor.c | 6 --
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/monitor.c b/monitor.c
> > index 79afe99079..25cf4223e8 100644
> > --- a/monitor.c
> > +++ b/monitor.c
> > @@ -4562,9 +4562,11 @@ void monitor_init(Chardev *chr, int flags)
> >  bool use_oob = flags & MONITOR_USE_OOB;
> >
> >  if (use_oob) {
> > -if (CHARDEV_IS_MUX(chr)) {
> > +if (CHARDEV_IS_MUX(chr) ||
> > +!qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_GCONTEXT)) {
> >  error_report("Monitor out-of-band is not supported with "
> > - "MUX typed chardev backend");
> > + "%s typed chardev backend",
> > + object_get_typename(OBJECT(chr)));
> >  exit(1);
> >  }
> >  if (use_readline) {

Re: [Qemu-devel] [PATCH for-4.0 v4 5/7] colo: check chardev can switch context

2018-12-05 Thread Zhang Chen

On Thu, Dec 6, 2018 at 4:38 AM Marc-André Lureau <
marcandre.lur...@redhat.com> wrote:

> COLO uses a worker context (iothread) to drive the chardev. All
> backends are not able to switch the context, let's report an error in
> this case.
>
> Signed-off-by: Marc-André Lureau 
>

Reviewed-by: Zhang Chen 


> ---
>  net/colo-compare.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/net/colo-compare.c b/net/colo-compare.c
> index a39191d522..9156ab3349 100644
> --- a/net/colo-compare.c
> +++ b/net/colo-compare.c
> @@ -957,6 +957,12 @@ static int find_and_check_chardev(Chardev **chr,
>  return 1;
>  }
>
> +if (!qemu_chr_has_feature(*chr, QEMU_CHAR_FEATURE_GCONTEXT)) {
> +error_setg(errp, "chardev \"%s\" cannot switch context",
> +   chr_name);
> +return 1;
> +}
> +
>  return 0;
>  }
>
> --
> 2.20.0.rc1
>
>

Re: [Qemu-devel] [PATCH v6 04/37] ppc/xive: introduce the XiveRouter model

2018-12-05 Thread Cédric Le Goater

On 12/6/18 4:41 AM, David Gibson wrote:
> On Thu, Dec 06, 2018 at 12:22:18AM +0100, Cédric Le Goater wrote:
>> The XiveRouter models the second sub-engine of the XIVE architecture :
>> the Interrupt Virtualization Routing Engine (IVRE).
>>
>> The IVRE handles event notifications of the IVSE and performs the
>> interrupt routing process. For this purpose, it uses a set of tables
>> stored in system memory, the first of which being the Event Assignment
>> Structure (EAS) table.
>>
>> The EAT associates an interrupt source number with an Event Notification
>> Descriptor (END) which will be used in a second phase of the routing
>> process to identify a Notification Virtual Target.
>>
>> The XiveRouter is an abstract class which needs to be inherited from
>> to define a storage for the EAT, and other upcoming tables.
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  include/hw/ppc/xive.h  | 31 
>>  include/hw/ppc/xive_regs.h | 50 +
>>  hw/intc/xive.c | 76 ++
>>  3 files changed, 157 insertions(+)
>>  create mode 100644 include/hw/ppc/xive_regs.h
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 6770cffec67d..57ec9f84f527 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -141,6 +141,8 @@
>>  #define PPC_XIVE_H
>>  
>>  #include "hw/qdev-core.h"
>> +#include "hw/sysbus.h"
>> +#include "hw/ppc/xive_regs.h"
>>  
>>  /*
>>   * XIVE Fabric (Interface between Source and Router)
>> @@ -297,4 +299,33 @@ static inline void xive_source_irq_set(XiveSource 
>> *xsrc, uint32_t srcno,
>>  }
>>  }
>>  
>> +/*
>> + * XIVE Router
>> + */
>> +
>> +typedef struct XiveRouter {
>> +SysBusDeviceparent;
> 
> I thought the plan was to make XiveRouter as well as XiveSource a
> TYPE_DEVICE descendent rather than a SysBusDevice?

We start talking about that, indeed, but then :

https://lists.gnu.org/archive/html/qemu-devel/2018-11/msg06407.html

I thought we concluded that it was going to get too complex.

Also, sPAPRXive is a direct descendant of XiveRouter and we want sPAPRXive 
on SysBus.

C.

> 
>> +} XiveRouter;
>> +
>> +#define TYPE_XIVE_ROUTER "xive-router"
>> +#define XIVE_ROUTER(obj)\
>> +OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
>> +#define XIVE_ROUTER_CLASS(klass)\
>> +OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
>> +#define XIVE_ROUTER_GET_CLASS(obj)  \
>> +OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
>> +
>> +typedef struct XiveRouterClass {
>> +SysBusDeviceClass parent;
>> +
>> +/* XIVE table accessors */
>> +int (*get_eas)(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
>> +   XiveEAS *eas);
>> +} XiveRouterClass;
>> +
>> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>> +
>> +int xive_router_get_eas(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
>> +XiveEAS *eas);
>> +
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> new file mode 100644
>> index ..15f2470ed9cc
>> --- /dev/null
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -0,0 +1,50 @@
>> +/*
>> + * QEMU PowerPC XIVE internal structure definitions
>> + *
>> + *
>> + * The XIVE structures are accessed by the HW and their format is
>> + * architected to be big-endian. Some macros are provided to ease
>> + * access to the different fields.
>> + *
>> + *
>> + * Copyright (c) 2016-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_XIVE_REGS_H
>> +#define PPC_XIVE_REGS_H
>> +
>> +/*
>> + * Interrupt source number encoding on PowerBUS
>> + */
>> +#define XIVE_SRCNO_BLOCK(srcno) (((srcno) >> 28) & 0xf)
>> +#define XIVE_SRCNO_INDEX(srcno) ((srcno) & 0x0fff)
>> +#define XIVE_SRCNO(blk, idx)((uint32_t)(blk) << 28 | (idx))
>> +
>> +/* EAS (Event Assignment Structure)
>> + *
>> + * One per interrupt source. Targets an interrupt to a given Event
>> + * Notification Descriptor (END) and provides the corresponding
>> + * logical interrupt number (END data)
>> + */
>> +typedef struct XiveEAS {
>> +/* Use a single 64-bit definition to make it easier to
>> + * perform atomic updates
>> + */
>> +uint64_tw;
>> +#define EAS_VALID   PPC_BIT(0)
>> +#define EAS_END_BLOCK   PPC_BITMASK(4, 7)/* Destination END block# 
>> */
>> +#define EAS_END_INDEX   PPC_BITMASK(8, 31)   /* Destination END index */
>> +#define EAS_MASKED  PPC_BIT(32)  /* Masked */
>> +#define EAS_END_DATAPPC_BITMASK(33, 63)  /* Data written to the END 
>> */
>> +} XiveEAS;
>> +
>> +#define xive_eas_is_valid(eas)   (be64_to_cpu((eas)->w) & EAS_VALID)
>> +#define

Re: [Qemu-devel] [PATCH for-4.0 v4 4/7] monitor: check if chardev can switch gcontext for OOB

2018-12-05 Thread Markus Armbruster

One more question...

Marc-André Lureau  writes:

> Not all backends are able to switch gcontext. Those backends cannot
> drive a OOB monitor (the monitor would then be blocking on main
> thread).
>
> For example, ringbuf, spice, or more esoteric input chardevs like
> braille or MUX.

These chardevs don't provide QEMU_CHAR_FEATURE_GCONTEXT.

> We currently forbid MUX because not all frontends are ready to run
> outside main loop. Extend to add a context-switching feature check.

Why check CHARDEV_IS_MUX() when chardev-mux already fails the
qemu_char_feature_gcontext(chr, QEMU_CHAR_FEATURE_GCONTEXT) check?

> Signed-off-by: Marc-André Lureau 
> ---
>  monitor.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/monitor.c b/monitor.c
> index 79afe99079..25cf4223e8 100644
> --- a/monitor.c
> +++ b/monitor.c
> @@ -4562,9 +4562,11 @@ void monitor_init(Chardev *chr, int flags)
>  bool use_oob = flags & MONITOR_USE_OOB;
>  
>  if (use_oob) {
> -if (CHARDEV_IS_MUX(chr)) {
> +if (CHARDEV_IS_MUX(chr) ||
> +!qemu_chr_has_feature(chr, QEMU_CHAR_FEATURE_GCONTEXT)) {
>  error_report("Monitor out-of-band is not supported with "
> - "MUX typed chardev backend");
> + "%s typed chardev backend",
> + object_get_typename(OBJECT(chr)));
>  exit(1);
>  }
>  if (use_readline) {

Re: [Qemu-devel] [PATCH v6 00/37] ppc: support for the XIVE interrupt controller (POWER9)

2018-12-05 Thread Cédric Le Goater

Hello,

> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> Checking PATCH 25/37: spapr/xive: add state synchronization with KVM...
> Checking PATCH 26/37: spapr/xive: introduce a VM state change handler...
> ERROR: spaces required around that '*' (ctx:WxV)
> #38: FILE: hw/intc/spapr_xive_kvm.c:341:
> + static void kvmppc_xive_sync_all(sPAPRXive *xive, Error **errp)
>  ^
> 
> total: 1 errors, 0 warnings, 135 lines checked

This looks like a false positive.

C.

Re: [Qemu-devel] [PATCH v6 03/37] ppc/xive: introduce the XiveNotifier interface

2018-12-05 Thread Cédric Le Goater

On 12/6/18 4:25 AM, David Gibson wrote:
> On Thu, Dec 06, 2018 at 12:22:17AM +0100, Cédric Le Goater wrote:
>> The XiveNotifier offers a simple interface, between the XiveSource
>> object and the main interrupt controller of the machine. It will
>> forward event notifications to the XIVE Interrupt Virtualization
>> Routing Engine (IVRE).
>>
>> Signed-off-by: Cédric Le Goater 
>> ---
>>  include/hw/ppc/xive.h | 23 +++
>>  hw/intc/xive.c| 25 +
>>  2 files changed, 48 insertions(+)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 7cebc32eba4c..6770cffec67d 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -142,6 +142,27 @@
>>  
>>  #include "hw/qdev-core.h"
>>  
>> +/*
>> + * XIVE Fabric (Interface between Source and Router)
>> + */
>> +
>> +typedef struct XiveNotifier {
>> +Object parent;
>> +} XiveNotifier;
>> +
>> +#define TYPE_XIVE_NOTIFIER "xive-fabric"
> 
> I'm applying this, but changing the string here from "xive-fabric" to
> "xive-notifier".

Ah yes. My sed command missed that.

Thanks,

C.

> 
> 
>> +#define XIVE_NOTIFIER(obj) \
>> +OBJECT_CHECK(XiveNotifier, (obj), TYPE_XIVE_NOTIFIER)
>> +#define XIVE_NOTIFIER_CLASS(klass) \
>> +OBJECT_CLASS_CHECK(XiveNotifierClass, (klass), TYPE_XIVE_NOTIFIER)
>> +#define XIVE_NOTIFIER_GET_CLASS(obj)   \
>> +OBJECT_GET_CLASS(XiveNotifierClass, (obj), TYPE_XIVE_NOTIFIER)
>> +
>> +typedef struct XiveNotifierClass {
>> +InterfaceClass parent;
>> +void (*notify)(XiveNotifier *xn, uint32_t lisn);
>> +} XiveNotifierClass;
>> +
>>  /*
>>   * XIVE Interrupt Source
>>   */
>> @@ -171,6 +192,8 @@ typedef struct XiveSource {
>>  uint64_tesb_flags;
>>  uint32_tesb_shift;
>>  MemoryRegionesb_mmio;
>> +
>> +XiveNotifier*xive;
>>  } XiveSource;
>>  
>>  /*
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 11c7aac962de..79238eb57fae 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -155,7 +155,11 @@ static bool xive_source_esb_eoi(XiveSource *xsrc, 
>> uint32_t srcno)
>>   */
>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>>  {
>> +XiveNotifierClass *xnc = XIVE_NOTIFIER_GET_CLASS(xsrc->xive);
>>  
>> +if (xnc->notify) {
>> +xnc->notify(xsrc->xive, srcno);
>> +}
>>  }
>>  
>>  /*
>> @@ -362,6 +366,17 @@ static void xive_source_reset(void *dev)
>>  static void xive_source_realize(DeviceState *dev, Error **errp)
>>  {
>>  XiveSource *xsrc = XIVE_SOURCE(dev);
>> +Object *obj;
>> +Error *local_err = NULL;
>> +
>> +obj = object_property_get_link(OBJECT(dev), "xive", _err);
>> +if (!obj) {
>> +error_propagate(errp, local_err);
>> +error_prepend(errp, "required link 'xive' not found: ");
>> +return;
>> +}
>> +
>> +xsrc->xive = XIVE_NOTIFIER(obj);
>>  
>>  if (!xsrc->nr_irqs) {
>>  error_setg(errp, "Number of interrupt needs to be greater than 0");
>> @@ -428,9 +443,19 @@ static const TypeInfo xive_source_info = {
>>  .class_init= xive_source_class_init,
>>  };
>>  
>> +/*
>> + * XIVE Fabric
>> + */
>> +static const TypeInfo xive_fabric_info = {
>> +.name = TYPE_XIVE_NOTIFIER,
>> +.parent = TYPE_INTERFACE,
>> +.class_size = sizeof(XiveNotifierClass),
>> +};
>> +
>>  static void xive_register_types(void)
>>  {
>>  type_register_static(_source_info);
>> +type_register_static(_fabric_info);
>>  }
>>  
>>  type_init(xive_register_types)
>

Re: [Qemu-devel] [RFC 0/3] QEMU changes to do PVH boot

2018-12-05 Thread Maran Wilson


On 12/5/2018 2:37 PM, Liam Merwick wrote:

For certain applications it is desirable to rapidly boot a KVM virtual
machine. In cases where legacy hardware and software support within the
guest is not needed, QEMU should be able to boot directly into the
uncompressed Linux kernel binary with minimal firmware involvement.

There already exists an ABI to allow this for Xen PVH guests and the ABI
is supported by Linux and FreeBSD:

https://xenbits.xen.org/docs/unstable/misc/pvh.html

Details on the Linux changes: https://lkml.org/lkml/2018/4/16/1002


In case anyone wants to grab the patches and give it a try, I've just 
posted an updated version of the Linux patches rebased to the latest 
mainline code:


https://lkml.org/lkml/2018/12/6/26

No functional changes from before, just some minor conflict resolution 
as part of the rebase.


Thanks,
-Maran


qboot patches: http://patchwork.ozlabs.org/project/qemu-devel/list/?series=80020

This patch series provides QEMU support to read the ELF header of an
uncompressed kernel binary and get the 32-bit PVH kernel entry point
from an ELF Note.  This is called when initialising the machine state
in pc_memory_init().  Later on in load_linux() if the kernel entry
address is present, the uncompressed kernel binary (ELF) is loaded
and qboot does futher initialisation of the guest (e820, etc.) and
jumps to the kernel entry address and boots the guest.


Usіng the method/scripts documented by the NEMU team at

https://github.com/intel/nemu/wiki/Measuring-Boot-Latency
https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg00200.html

below are some timings measured (vmlinux and bzImage from the same build)
Time to get to kernel start is almost halved (95ṁs -> 48ms)

QEMU + qboot + vmlinux (PVH + 4.20-rc4)
  qemu_init_end: 41.550521
  fw_start: 41.667139 (+0.116618)
  fw_do_boot: 47.448495 (+5.781356)
  linux_startup_64: 47.720785 (+0.27229)
  linux_start_kernel: 48.399541 (+0.678756)
  linux_start_user: 296.952056 (+248.552515)

QEMU + qboot + bzImage:
  qemu_init_end: 29.209276
  fw_start: 29.317342 (+0.108066)
  linux_start_boot: 36.679362 (+7.36202)
  linux_startup_64: 94.531349 (+57.851987)
  linux_start_kernel: 94.900913 (+0.369564)
  linux_start_user: 401.060971 (+306.160058)

QEMU + bzImage:
  qemu_init_end: 30.424430
  linux_startup_64: 893.770334 (+863.345904)
  linux_start_kernel: 894.17049 (+0.400156)
  linux_start_user: 1208.679768 (+314.509278)


Liam Merwick (3):
   pvh: Add x86/HVM direct boot ABI header file
   pc: Read PVH entry point from ELF note in kernel binary
   pvh: Boot uncompressed kernel using direct boot ABI

  hw/i386/pc.c| 344 +++-
  include/elf.h   |  10 ++
  include/hw/xen/start_info.h | 146 +++
  3 files changed, 499 insertions(+), 1 deletion(-)
  create mode 100644 include/hw/xen/start_info.h

Re: [Qemu-devel] [PATCH for-4.0 v4 5/7] colo: check chardev can switch context

2018-12-05 Thread Li Zhijian





On 12/06/2018 04:37 AM, Marc-André Lureau wrote:

COLO uses a worker context (iothread) to drive the chardev. All
backends are not able to switch the context, let's report an error in
this case.

Signed-off-by: Marc-André Lureau 


Reviewed-by: Li Zhijian 



---
  net/colo-compare.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/net/colo-compare.c b/net/colo-compare.c
index a39191d522..9156ab3349 100644
--- a/net/colo-compare.c
+++ b/net/colo-compare.c
@@ -957,6 +957,12 @@ static int find_and_check_chardev(Chardev **chr,
  return 1;
  }
  
+if (!qemu_chr_has_feature(*chr, QEMU_CHAR_FEATURE_GCONTEXT)) {

+error_setg(errp, "chardev \"%s\" cannot switch context",
+   chr_name);
+return 1;
+}
+
  return 0;
  }

Re: [Qemu-devel] [PATCH for-4.0 v4 5/7] colo: check chardev can switch context

2018-12-05 Thread Markus Armbruster

I'd like an Acked-by or Reviewed-by from Zhang Chen or Li Zhijian.

Marc-André Lureau  writes:

> COLO uses a worker context (iothread) to drive the chardev. All
> backends are not able to switch the context, let's report an error in
> this case.
>
> Signed-off-by: Marc-André Lureau 
> ---
>  net/colo-compare.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/net/colo-compare.c b/net/colo-compare.c
> index a39191d522..9156ab3349 100644
> --- a/net/colo-compare.c
> +++ b/net/colo-compare.c
> @@ -957,6 +957,12 @@ static int find_and_check_chardev(Chardev **chr,
>  return 1;
>  }
>  
> +if (!qemu_chr_has_feature(*chr, QEMU_CHAR_FEATURE_GCONTEXT)) {
> +error_setg(errp, "chardev \"%s\" cannot switch context",
> +   chr_name);
> +return 1;
> +}
> +
>  return 0;
>  }

Re: [Qemu-devel] [PATCH for-4.0 v4 4/7] monitor: check if chardev can switch gcontext for OOB

2018-12-05 Thread Markus Armbruster

Marc-André Lureau  writes:

> Not all backends are able to switch gcontext. Those backends cannot
> drive a OOB monitor (the monitor would then be blocking on main
> thread).
>
> For example, ringbuf, spice, or more esoteric input chardevs like
> braille or MUX.
>
> We currently forbid MUX because not all frontends are ready to run
> outside main loop. Extend to add a context-switching feature check.
>
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH for-4.0 v4 3/7] char: add a QEMU_CHAR_FEATURE_GCONTEXT flag

2018-12-05 Thread Markus Armbruster

Marc-André Lureau  writes:

> QEMU_CHAR_FEATURE_GCONTEXT declares the character device can switch
> GMainContext.
>
> Assert we don't switch context when the character device doesn't
> provide this feature.  Character device users must not violate this
> restriction.  In particular, user configurations that violate them
> must be rejected.
>
> Existing frontend that rely on context switching would now assert() if
> the backend doesn't allow it (instead of silently producing undesired
> events in the default context). Following patches improve the
> situation by reporting an error earlier instead, on the frontend side.
>
> Signed-off-by: Marc-André Lureau 

Reviewed-by: Markus Armbruster

Re: [Qemu-devel] [PATCH v6 07/37] ppc/xive: introduce the XIVE interrupt thread context

2018-12-05 Thread David Gibson

On Thu, Dec 06, 2018 at 12:22:21AM +0100, Cédric Le Goater wrote:
> Each POWER9 processor chip has a XIVE presenter that can generate four
> different exceptions to its threads:
> 
>   - hypervisor exception,
>   - O/S exception
>   - Event-Based Branch (EBB)
>   - msgsnd (doorbell).
> 
> Each exception has a state independent from the others called a Thread
> Interrupt Management context. This context is a set of registers which
> lets the thread handle priority management and interrupt acknowledgment
> among other things. The most important ones being :
> 
>   - Interrupt Priority Register  (PIPR)
>   - Interrupt Pending Buffer (IPB)
>   - Current Processor Priority   (CPPR)
>   - Notification Source Register (NSR)
> 
> These registers are accessible through a specific MMIO region, called
> the Thread Interrupt Management Area (TIMA), four aligned pages, each
> exposing a different view of the registers. First page (page address
> ending in 0b00) gives access to the entire context and is reserved for
> the ring 0 view for the physical thread context. The second (page
> address ending in 0b01) is for the hypervisor, ring 1 view. The third
> (page address ending in 0b10) is for the operating system, ring 2
> view. The fourth (page address ending in 0b11) is for user level, ring
> 3 view.
> 
> The thread interrupt context is modeled with a XiveTCTX object
> containing the values of the different exception registers. The TIMA
> region is mapped at the same address for each CPU.
> 
> Signed-off-by: Cédric Le Goater 

Reviewed-by: David Gibson 

> ---
>  include/hw/ppc/xive.h  |  44 
>  include/hw/ppc/xive_regs.h |  82 
>  hw/intc/xive.c | 419 +
>  3 files changed, 545 insertions(+)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index d67b0785df7c..74b547707b17 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -368,4 +368,48 @@ typedef struct XiveENDSource {
>  void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>  void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width, Monitor 
> *mon);
>  
> +/*
> + * XIVE Thread interrupt Management (TM) context
> + */
> +
> +#define TYPE_XIVE_TCTX "xive-tctx"
> +#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
> +
> +/*
> + * XIVE Thread interrupt Management register rings :
> + *
> + *   QW-0  User   event-based exception state
> + *   QW-1  O/SOS context for priority management, interrupt acks
> + *   QW-2  Pool   hypervisor pool context for virtual processors 
> dispatched
> + *   QW-3  Physical   physical thread context and security context
> + */
> +#define XIVE_TM_RING_COUNT  4
> +#define XIVE_TM_RING_SIZE   0x10
> +
> +typedef struct XiveTCTX {
> +DeviceState parent_obj;
> +
> +CPUState*cs;
> +qemu_irqoutput;
> +
> +uint8_t regs[XIVE_TM_RING_COUNT * XIVE_TM_RING_SIZE];
> +} XiveTCTX;
> +
> +/*
> + * XIVE Thread Interrupt Management Aera (TIMA)
> + *
> + * This region gives access to the registers of the thread interrupt
> + * management context. It is four page wide, each page providing a
> + * different view of the registers. The page with the lower offset is
> + * the most privileged and gives access to the entire context.
> + */
> +#define XIVE_TM_HW_PAGE 0x0
> +#define XIVE_TM_HV_PAGE 0x1
> +#define XIVE_TM_OS_PAGE 0x2
> +#define XIVE_TM_USER_PAGE   0x3
> +
> +extern const MemoryRegionOps xive_tm_ops;
> +
> +void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index 3c0ebad18b69..ede3d04c5eda 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -23,6 +23,88 @@
>  #define XIVE_SRCNO_INDEX(srcno) ((srcno) & 0x0fff)
>  #define XIVE_SRCNO(blk, idx)((uint32_t)(blk) << 28 | (idx))
>  
> +#define TM_SHIFT16
> +
> +/* TM register offsets */
> +#define TM_QW0_USER 0x000 /* All rings */
> +#define TM_QW1_OS   0x010 /* Ring 0..2 */
> +#define TM_QW2_HV_POOL  0x020 /* Ring 0..1 */
> +#define TM_QW3_HV_PHYS  0x030 /* Ring 0..1 */
> +
> +/* Byte offsets inside a QW QW0 QW1 QW2 QW3 */
> +#define TM_NSR  0x0  /*  +   +   -   +  */
> +#define TM_CPPR 0x1  /*  -   +   -   +  */
> +#define TM_IPB  0x2  /*  -   +   +   +  */
> +#define TM_LSMFB0x3  /*  -   +   +   +  */
> +#define TM_ACK_CNT  0x4  /*  -   +   -   -  */
> +#define TM_INC  0x5  /*  -   +   -   +  */
> +#define TM_AGE  0x6  /*  -   +   -   +  */
> +#define TM_PIPR 0x7  /*  -   +   -   +  */
> +
> +#define TM_WORD00x0
> +#define TM_WORD10x4
> +
> +/*
> + * QW word 2 contains the valid bit at the top and other

Re: [Qemu-devel] [PATCH for-4.0 1/7] configure: Add a test for the minimum compiler version

2018-12-05 Thread Thomas Huth

On 2018-12-05 18:30, Philippe Mathieu-Daudé wrote:
> On 12/3/18 3:05 PM, Thomas Huth wrote:
>> So far we only had implicit requirements for the minimum compiler version,
>> e.g. we require at least GCC 4.1 for the support of atomics. However,
>> such old compiler versions are not tested anymore by the developers, so
>> they are not really supported anymore. Since we recently declared explicitly
>> what platforms we intend to support, we can also get more explicit on the
>> compiler version now. The supported distributions use the following version
>> of GCC:
>>
>>   RHEL-7: 4.8.5
>>   Debian (Stretch): 6.3.0
>>   Debian (Jessie): 4.8.4
>>   OpenBSD (ports): 4.9.4
>>   FreeBSD (ports): 8.2.0
>>   OpenSUSE Leap 15: 7.3.1
>>   Ubuntu (Xenial): 5.3.1
>>   macOS (Homebrew): 8.2.0
> 
> I'd like to track this in a machine parsable format, but sure where it
> better fits however, I'd prefer the git repo, and having the wiki
> pointing to the git repo.

I don't think that it makes sense to put fixed version numbers into the
git or wiki - the information will expire soon, and it is additional
maintenance to keep them up to date. We already got the generic
description here:

https://qemu.weilnetz.de/doc/qemu-doc.html#Supported-build-platforms

So you just have to follow these instructions to get to the supported
versions.

 Thomas

Re: [Qemu-devel] [PATCH v6 03/37] ppc/xive: introduce the XiveNotifier interface

2018-12-05 Thread David Gibson

On Thu, Dec 06, 2018 at 12:22:17AM +0100, Cédric Le Goater wrote:
> The XiveNotifier offers a simple interface, between the XiveSource
> object and the main interrupt controller of the machine. It will
> forward event notifications to the XIVE Interrupt Virtualization
> Routing Engine (IVRE).
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  include/hw/ppc/xive.h | 23 +++
>  hw/intc/xive.c| 25 +
>  2 files changed, 48 insertions(+)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 7cebc32eba4c..6770cffec67d 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -142,6 +142,27 @@
>  
>  #include "hw/qdev-core.h"
>  
> +/*
> + * XIVE Fabric (Interface between Source and Router)
> + */
> +
> +typedef struct XiveNotifier {
> +Object parent;
> +} XiveNotifier;
> +
> +#define TYPE_XIVE_NOTIFIER "xive-fabric"

I'm applying this, but changing the string here from "xive-fabric" to
"xive-notifier".


> +#define XIVE_NOTIFIER(obj) \
> +OBJECT_CHECK(XiveNotifier, (obj), TYPE_XIVE_NOTIFIER)
> +#define XIVE_NOTIFIER_CLASS(klass) \
> +OBJECT_CLASS_CHECK(XiveNotifierClass, (klass), TYPE_XIVE_NOTIFIER)
> +#define XIVE_NOTIFIER_GET_CLASS(obj)   \
> +OBJECT_GET_CLASS(XiveNotifierClass, (obj), TYPE_XIVE_NOTIFIER)
> +
> +typedef struct XiveNotifierClass {
> +InterfaceClass parent;
> +void (*notify)(XiveNotifier *xn, uint32_t lisn);
> +} XiveNotifierClass;
> +
>  /*
>   * XIVE Interrupt Source
>   */
> @@ -171,6 +192,8 @@ typedef struct XiveSource {
>  uint64_tesb_flags;
>  uint32_tesb_shift;
>  MemoryRegionesb_mmio;
> +
> +XiveNotifier*xive;
>  } XiveSource;
>  
>  /*
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 11c7aac962de..79238eb57fae 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -155,7 +155,11 @@ static bool xive_source_esb_eoi(XiveSource *xsrc, 
> uint32_t srcno)
>   */
>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>  {
> +XiveNotifierClass *xnc = XIVE_NOTIFIER_GET_CLASS(xsrc->xive);
>  
> +if (xnc->notify) {
> +xnc->notify(xsrc->xive, srcno);
> +}
>  }
>  
>  /*
> @@ -362,6 +366,17 @@ static void xive_source_reset(void *dev)
>  static void xive_source_realize(DeviceState *dev, Error **errp)
>  {
>  XiveSource *xsrc = XIVE_SOURCE(dev);
> +Object *obj;
> +Error *local_err = NULL;
> +
> +obj = object_property_get_link(OBJECT(dev), "xive", _err);
> +if (!obj) {
> +error_propagate(errp, local_err);
> +error_prepend(errp, "required link 'xive' not found: ");
> +return;
> +}
> +
> +xsrc->xive = XIVE_NOTIFIER(obj);
>  
>  if (!xsrc->nr_irqs) {
>  error_setg(errp, "Number of interrupt needs to be greater than 0");
> @@ -428,9 +443,19 @@ static const TypeInfo xive_source_info = {
>  .class_init= xive_source_class_init,
>  };
>  
> +/*
> + * XIVE Fabric
> + */
> +static const TypeInfo xive_fabric_info = {
> +.name = TYPE_XIVE_NOTIFIER,
> +.parent = TYPE_INTERFACE,
> +.class_size = sizeof(XiveNotifierClass),
> +};
> +
>  static void xive_register_types(void)
>  {
>  type_register_static(_source_info);
> +type_register_static(_fabric_info);
>  }
>  
>  type_init(xive_register_types)

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH v6 06/37] ppc/xive: add support for the END Event State buffers

2018-12-05 Thread David Gibson

On Thu, Dec 06, 2018 at 12:22:20AM +0100, Cédric Le Goater wrote:
> The Event Notification Descriptor (END) XIVE structure also contains
> two Event State Buffers providing further coalescing of interrupts,
> one for the notification event (ESn) and one for the escalation events
> (ESe). A MMIO page is assigned for each to control the EOI through
> loads only. Stores are not allowed.
> 
> The END ESBs are modeled through an object resembling the 'XiveSource'
> It is stateless as the END state bits are backed into the XiveEND
> structure under the XiveRouter and the MMIO accesses follow the same
> rules as for the standard source ESBs.
> 
> END ESBs are not supported by the Linux drivers neither on OPAL nor on
> sPAPR. Nevetherless, it provides a mean to study the question in the
> future and validates a bit more the XIVE model.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  include/hw/ppc/xive.h |  22 ++
>  hw/intc/xive.c| 173 +-
>  2 files changed, 193 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index d1b4c6c78ec5..d67b0785df7c 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -305,6 +305,8 @@ static inline void xive_source_irq_set(XiveSource *xsrc, 
> uint32_t srcno,
>  
>  typedef struct XiveRouter {
>  SysBusDeviceparent;
> +
> +uint32_t   chip_id;

I still don't think you need this..

>  } XiveRouter;
>  
>  #define TYPE_XIVE_ROUTER "xive-router"
> @@ -336,6 +338,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t 
> end_blk, uint32_t end_idx,
>  int xive_router_write_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t 
> end_idx,
>XiveEND *end, uint8_t word_number);
>  
> +/*
> + * XIVE END ESBs
> + */
> +
> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
> +#define XIVE_END_SOURCE(obj) \
> +OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> +
> +typedef struct XiveENDSource {
> +DeviceState parent;
> +
> +uint32_tnr_ends;
> +
> +/* ESB memory region */
> +uint32_tesb_shift;
> +MemoryRegionesb_mmio;
> +
> +XiveRouter  *xrtr;

..or this..

> +} XiveENDSource;
> +
>  /*
>   * For legacy compatibility, the exceptions define up to 256 different
>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 41d8ba1540d0..83686e260df5 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -612,8 +612,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, 
> uint8_t end_blk,
>   * even futher coalescing in the Router
>   */
>  if (!xive_end_is_notify()) {
> -qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> -return;
> +uint8_t pq = GETFIELD_BE32(END_W1_ESn, end.w1);
> +bool notify = xive_esb_trigger();
> +
> +if (pq != GETFIELD_BE32(END_W1_ESn, end.w1)) {
> +end.w1 = SETFIELD_BE32(END_W1_ESn, end.w1, pq);
> +xive_router_write_end(xrtr, end_blk, end_idx, , 1);
> +}
> +
> +/* ESn[Q]=1 : end of notification */
> +if (!notify) {
> +return;
> +}
>  }
>  
>  /*
> @@ -658,12 +668,18 @@ static void xive_router_notify(XiveNotifier *xn, 
> uint32_t lisn)
> GETFIELD_BE64(EAS_END_DATA,  eas.w));
>  }
>  
> +static Property xive_router_properties[] = {
> +DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
> +DEFINE_PROP_END_OF_LIST(),
> +};
> +
>  static void xive_router_class_init(ObjectClass *klass, void *data)
>  {
>  DeviceClass *dc = DEVICE_CLASS(klass);
>  XiveNotifierClass *xnc = XIVE_NOTIFIER_CLASS(klass);
>  
>  dc->desc= "XIVE Router Engine";
> +dc->props   = xive_router_properties;
>  xnc->notify = xive_router_notify;
>  }
>  
> @@ -692,6 +708,158 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t 
> lisn, Monitor *mon)
> (uint32_t) GETFIELD_BE64(EAS_END_DATA, eas->w));
>  }
>  
> +/*
> + * END ESB MMIO loads
> + */
> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned 
> size)
> +{
> +XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
> +XiveRouter *xrtr = xsrc->xrtr;
> +uint32_t offset = addr & 0xFFF;
> +uint8_t end_blk;
> +uint32_t end_idx;
> +XiveEND end;
> +uint32_t end_esmask;
> +uint8_t pq;
> +uint64_t ret = -1;
> +
> +end_blk = xrtr->chip_id;

.. instead I think it makes more sense to just configure the end_blk
directly on the end_source, rather than reaching into another object
to 

> +end_idx = addr >> (xsrc->esb_shift + 1);
> +
> +if (xive_router_get_end(xrtr, end_blk, end_idx, )) {
> +qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
> +  end_idx);
> +return -1;
> +}
> +
> +if (!xive_end_is_valid()) {
> +qemu_log_mask(LOG_GUEST_ERROR, "XIVE:

Re: [Qemu-devel] [PATCH v6 05/37] ppc/xive: introduce the XIVE Event Notification Descriptors

2018-12-05 Thread David Gibson

On Thu, Dec 06, 2018 at 12:22:19AM +0100, Cédric Le Goater wrote:
> To complete the event routing, the IVRE sub-engine uses a second table
> containing Event Notification Descriptor (END) structures.
> 
> An END specifies on which Event Queue (EQ) the event notification
> data, defined in the associated EAS, should be posted when an
> exception occurs. It also defines which Notification Virtual Target
> (NVT) should be notified.
> 
> The Event Queue is a memory page provided by the O/S defining a
> circular buffer, one per server and priority couple, containing Event
> Queue entries. These are 4 bytes long, the first bit being a
> 'generation' bit and the 31 following bits the END Data field. They
> are pulled by the O/S when the exception occurs.
> 
> The END Data field is a way to set an invariant logical event source
> number for an IRQ. On sPAPR machines, it is set with the
> H_INT_SET_SOURCE_CONFIG hcall when the EISN flag is used.
> 
> Signed-off-by: Cédric Le Goater 

Reviewed-by: David Gibson 

> ---
>  include/hw/ppc/xive.h  |  18 
>  include/hw/ppc/xive_regs.h |  57 
>  hw/intc/xive.c | 174 +
>  3 files changed, 249 insertions(+)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 57ec9f84f527..d1b4c6c78ec5 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -321,11 +321,29 @@ typedef struct XiveRouterClass {
>  /* XIVE table accessors */
>  int (*get_eas)(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
> XiveEAS *eas);
> +int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +   XiveEND *end);
> +int (*write_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> + XiveEND *end, uint8_t word_number);

I'm not sure if this is the best interface long term, but it doesn't
impact on public or migration interfaces, so I'm happy to run with it
for the time being.

>  } XiveRouterClass;
>  
>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>  
>  int xive_router_get_eas(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
>  XiveEAS *eas);
> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +XiveEND *end);
> +int xive_router_write_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t 
> end_idx,
> +  XiveEND *end, uint8_t word_number);
> +
> +/*
> + * For legacy compatibility, the exceptions define up to 256 different
> + * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> + * and the least favored level 0xFF.
> + */
> +#define XIVE_PRIORITY_MAX  7
> +
> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
> +void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width, Monitor 
> *mon);
>  
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index 15f2470ed9cc..3c0ebad18b69 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -47,4 +47,61 @@ typedef struct XiveEAS {
>  #define GETFIELD_BE64(m, v)  GETFIELD(m, be64_to_cpu(v))
>  #define SETFIELD_BE64(m, v, val) cpu_to_be64(SETFIELD(m, be64_to_cpu(v), 
> val))
>  
> +/* Event Notification Descriptor (END) */
> +typedef struct XiveEND {
> +uint32_tw0;
> +#define END_W0_VALID PPC_BIT32(0) /* "v" bit */
> +#define END_W0_ENQUEUE   PPC_BIT32(1) /* "q" bit */
> +#define END_W0_UCOND_NOTIFY  PPC_BIT32(2) /* "n" bit */
> +#define END_W0_BACKLOG   PPC_BIT32(3) /* "b" bit */
> +#define END_W0_PRECL_ESC_CTL PPC_BIT32(4) /* "p" bit */
> +#define END_W0_ESCALATE_CTL  PPC_BIT32(5) /* "e" bit */
> +#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
> +#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
> +#define END_W0_QSIZE PPC_BITMASK32(12, 15)
> +#define END_W0_SW0   PPC_BIT32(16)
> +#define END_W0_FIRMWARE  END_W0_SW0 /* Owned by FW */
> +#define END_QSIZE_4K 0
> +#define END_QSIZE_64K4
> +#define END_W0_HWDEP PPC_BITMASK32(24, 31)
> +uint32_tw1;
> +#define END_W1_ESn   PPC_BITMASK32(0, 1)
> +#define END_W1_ESn_P PPC_BIT32(0)
> +#define END_W1_ESn_Q PPC_BIT32(1)
> +#define END_W1_ESe   PPC_BITMASK32(2, 3)
> +#define END_W1_ESe_P PPC_BIT32(2)
> +#define END_W1_ESe_Q PPC_BIT32(3)
> +#define END_W1_GENERATIONPPC_BIT32(9)
> +#define END_W1_PAGE_OFF  PPC_BITMASK32(10, 31)
> +uint32_tw2;
> +#define END_W2_MIGRATION_REG PPC_BITMASK32(0, 3)
> +#define END_W2_OP_DESC_HIPPC_BITMASK32(4, 31)
> +uint32_tw3;
> +#define END_W3_OP_DESC_LOPPC_BITMASK32(0, 31)
> +uint32_tw4;
> +#define END_W4_ESC_END_BLOCK

Re: [Qemu-devel] [PATCH v6 04/37] ppc/xive: introduce the XiveRouter model

2018-12-05 Thread David Gibson

On Thu, Dec 06, 2018 at 12:22:18AM +0100, Cédric Le Goater wrote:
> The XiveRouter models the second sub-engine of the XIVE architecture :
> the Interrupt Virtualization Routing Engine (IVRE).
> 
> The IVRE handles event notifications of the IVSE and performs the
> interrupt routing process. For this purpose, it uses a set of tables
> stored in system memory, the first of which being the Event Assignment
> Structure (EAS) table.
> 
> The EAT associates an interrupt source number with an Event Notification
> Descriptor (END) which will be used in a second phase of the routing
> process to identify a Notification Virtual Target.
> 
> The XiveRouter is an abstract class which needs to be inherited from
> to define a storage for the EAT, and other upcoming tables.
> 
> Signed-off-by: Cédric Le Goater 
> ---
>  include/hw/ppc/xive.h  | 31 
>  include/hw/ppc/xive_regs.h | 50 +
>  hw/intc/xive.c | 76 ++
>  3 files changed, 157 insertions(+)
>  create mode 100644 include/hw/ppc/xive_regs.h
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 6770cffec67d..57ec9f84f527 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -141,6 +141,8 @@
>  #define PPC_XIVE_H
>  
>  #include "hw/qdev-core.h"
> +#include "hw/sysbus.h"
> +#include "hw/ppc/xive_regs.h"
>  
>  /*
>   * XIVE Fabric (Interface between Source and Router)
> @@ -297,4 +299,33 @@ static inline void xive_source_irq_set(XiveSource *xsrc, 
> uint32_t srcno,
>  }
>  }
>  
> +/*
> + * XIVE Router
> + */
> +
> +typedef struct XiveRouter {
> +SysBusDeviceparent;

I thought the plan was to make XiveRouter as well as XiveSource a
TYPE_DEVICE descendent rather than a SysBusDevice?

> +} XiveRouter;
> +
> +#define TYPE_XIVE_ROUTER "xive-router"
> +#define XIVE_ROUTER(obj)\
> +OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
> +#define XIVE_ROUTER_CLASS(klass)\
> +OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
> +#define XIVE_ROUTER_GET_CLASS(obj)  \
> +OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
> +
> +typedef struct XiveRouterClass {
> +SysBusDeviceClass parent;
> +
> +/* XIVE table accessors */
> +int (*get_eas)(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
> +   XiveEAS *eas);
> +} XiveRouterClass;
> +
> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> +
> +int xive_router_get_eas(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
> +XiveEAS *eas);
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> new file mode 100644
> index ..15f2470ed9cc
> --- /dev/null
> +++ b/include/hw/ppc/xive_regs.h
> @@ -0,0 +1,50 @@
> +/*
> + * QEMU PowerPC XIVE internal structure definitions
> + *
> + *
> + * The XIVE structures are accessed by the HW and their format is
> + * architected to be big-endian. Some macros are provided to ease
> + * access to the different fields.
> + *
> + *
> + * Copyright (c) 2016-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_XIVE_REGS_H
> +#define PPC_XIVE_REGS_H
> +
> +/*
> + * Interrupt source number encoding on PowerBUS
> + */
> +#define XIVE_SRCNO_BLOCK(srcno) (((srcno) >> 28) & 0xf)
> +#define XIVE_SRCNO_INDEX(srcno) ((srcno) & 0x0fff)
> +#define XIVE_SRCNO(blk, idx)((uint32_t)(blk) << 28 | (idx))
> +
> +/* EAS (Event Assignment Structure)
> + *
> + * One per interrupt source. Targets an interrupt to a given Event
> + * Notification Descriptor (END) and provides the corresponding
> + * logical interrupt number (END data)
> + */
> +typedef struct XiveEAS {
> +/* Use a single 64-bit definition to make it easier to
> + * perform atomic updates
> + */
> +uint64_tw;
> +#define EAS_VALID   PPC_BIT(0)
> +#define EAS_END_BLOCK   PPC_BITMASK(4, 7)/* Destination END block# */
> +#define EAS_END_INDEX   PPC_BITMASK(8, 31)   /* Destination END index */
> +#define EAS_MASKED  PPC_BIT(32)  /* Masked */
> +#define EAS_END_DATAPPC_BITMASK(33, 63)  /* Data written to the END 
> */
> +} XiveEAS;
> +
> +#define xive_eas_is_valid(eas)   (be64_to_cpu((eas)->w) & EAS_VALID)
> +#define xive_eas_is_masked(eas)  (be64_to_cpu((eas)->w) & EAS_MASKED)
> +
> +#define GETFIELD_BE64(m, v)  GETFIELD(m, be64_to_cpu(v))
> +#define SETFIELD_BE64(m, v, val) cpu_to_be64(SETFIELD(m, be64_to_cpu(v), 
> val))
> +
> +#endif /* PPC_XIVE_REGS_H */
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 79238eb57fae..d21df6674d8c 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -443,6 +443,81 @@ static const TypeInfo xive_source_info = {
>

[Qemu-devel] [PATCH for-4.0 v4 1/4] unify len and addr type for memory/address APIs

2018-12-05 Thread Li Zhijian

Some address/memory APIs have different type between
'hwaddr/target_ulong addr' and 'int len'. It is very unsafe, espcially
some APIs will be passed a non-int len by caller which might cause
overflow quietly.
Below is an potential overflow case:
dma_memory_read(uint32_t len)
  -> dma_memory_rw(uint32_t len)
-> dma_memory_rw_relaxed(uint32_t len)
  -> address_space_rw(int len) # len overflow

CC: Paolo Bonzini 
CC: Peter Crosthwaite 
CC: Richard Henderson 
CC: Peter Maydell 
Signed-off-by: Li Zhijian 
Reviewed-by: Peter Maydell 
Reviewed-by: Richard Henderson 

---
V4: minor fix at commit message and add Reviewed-by tag
V3: use the same type between len and addr(Peter Maydell)
rebase code basing on 
https://patchew.org/QEMU/20181122133507.30950-1-peter.mayd...@linaro.org/
---
 exec.c| 47 +++
 include/exec/cpu-all.h|  2 +-
 include/exec/cpu-common.h |  8 
 include/exec/memory.h | 22 +++---
 4 files changed, 39 insertions(+), 40 deletions(-)

diff --git a/exec.c b/exec.c
index 6e875f0..f475974 100644
--- a/exec.c
+++ b/exec.c
@@ -2848,10 +2848,10 @@ static const MemoryRegionOps watch_mem_ops = {
 };
 
 static MemTxResult flatview_read(FlatView *fv, hwaddr addr,
-  MemTxAttrs attrs, uint8_t *buf, int len);
+  MemTxAttrs attrs, uint8_t *buf, hwaddr 
len);
 static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs,
-  const uint8_t *buf, int len);
-static bool flatview_access_valid(FlatView *fv, hwaddr addr, int len,
+  const uint8_t *buf, hwaddr len);
+static bool flatview_access_valid(FlatView *fv, hwaddr addr, hwaddr len,
   bool is_write, MemTxAttrs attrs);
 
 static MemTxResult subpage_read(void *opaque, hwaddr addr, uint64_t *data,
@@ -3099,10 +3099,10 @@ MemoryRegion *get_system_io(void)
 /* physical memory access (slow version, mainly for debug) */
 #if defined(CONFIG_USER_ONLY)
 int cpu_memory_rw_debug(CPUState *cpu, target_ulong addr,
-uint8_t *buf, int len, int is_write)
+uint8_t *buf, target_ulong len, int is_write)
 {
-int l, flags;
-target_ulong page;
+int flags;
+target_ulong l, page;
 void * p;
 
 while (len > 0) {
@@ -3215,7 +3215,7 @@ static bool prepare_mmio_access(MemoryRegion *mr)
 static MemTxResult flatview_write_continue(FlatView *fv, hwaddr addr,
MemTxAttrs attrs,
const uint8_t *buf,
-   int len, hwaddr addr1,
+   hwaddr len, hwaddr addr1,
hwaddr l, MemoryRegion *mr)
 {
 uint8_t *ptr;
@@ -3260,7 +3260,7 @@ static MemTxResult flatview_write_continue(FlatView *fv, 
hwaddr addr,
 
 /* Called from RCU critical section.  */
 static MemTxResult flatview_write(FlatView *fv, hwaddr addr, MemTxAttrs attrs,
-  const uint8_t *buf, int len)
+  const uint8_t *buf, hwaddr len)
 {
 hwaddr l;
 hwaddr addr1;
@@ -3278,7 +3278,7 @@ static MemTxResult flatview_write(FlatView *fv, hwaddr 
addr, MemTxAttrs attrs,
 /* Called within RCU critical section.  */
 MemTxResult flatview_read_continue(FlatView *fv, hwaddr addr,
MemTxAttrs attrs, uint8_t *buf,
-   int len, hwaddr addr1, hwaddr l,
+   hwaddr len, hwaddr addr1, hwaddr l,
MemoryRegion *mr)
 {
 uint8_t *ptr;
@@ -3321,7 +3321,7 @@ MemTxResult flatview_read_continue(FlatView *fv, hwaddr 
addr,
 
 /* Called from RCU critical section.  */
 static MemTxResult flatview_read(FlatView *fv, hwaddr addr,
- MemTxAttrs attrs, uint8_t *buf, int len)
+ MemTxAttrs attrs, uint8_t *buf, hwaddr len)
 {
 hwaddr l;
 hwaddr addr1;
@@ -3334,7 +3334,7 @@ static MemTxResult flatview_read(FlatView *fv, hwaddr 
addr,
 }
 
 MemTxResult address_space_read_full(AddressSpace *as, hwaddr addr,
-MemTxAttrs attrs, uint8_t *buf, int len)
+MemTxAttrs attrs, uint8_t *buf, hwaddr len)
 {
 MemTxResult result = MEMTX_OK;
 FlatView *fv;
@@ -3351,7 +3351,7 @@ MemTxResult address_space_read_full(AddressSpace *as, 
hwaddr addr,
 
 MemTxResult address_space_write(AddressSpace *as, hwaddr addr,
 MemTxAttrs attrs,
-const uint8_t *buf, int len)
+const uint8_t *buf, hwaddr len)
 {
 MemTxResult result = MEMTX_OK;
 FlatView *fv;
@@ -3367,7 +3367,7 @@

[Qemu-devel] [PATCH for-4.0 v4 2/4] refactor load_image_size

2018-12-05 Thread Li Zhijian

Don't expect read(2) can always read as many as it's told.

Signed-off-by: Li Zhijian 
Reviewed-by: Richard Henderson 

---
V4: add reviewed-by tag
---
 hw/core/loader.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/hw/core/loader.c b/hw/core/loader.c
index fa41842..9cbceab 100644
--- a/hw/core/loader.c
+++ b/hw/core/loader.c
@@ -77,21 +77,20 @@ int64_t get_image_size(const char *filename)
 ssize_t load_image_size(const char *filename, void *addr, size_t size)
 {
 int fd;
-ssize_t actsize;
+ssize_t actsize, l = 0;
 
 fd = open(filename, O_RDONLY | O_BINARY);
 if (fd < 0) {
 return -1;
 }
 
-actsize = read(fd, addr, size);
-if (actsize < 0) {
-close(fd);
-return -1;
+while ((actsize = read(fd, addr + l, size - l)) > 0) {
+l += actsize;
 }
+
 close(fd);
 
-return actsize;
+return actsize < 0 ? -1 : l;
 }
 
 /* read()-like version */
-- 
2.7.4

[Qemu-devel] [PATCH for-4.0 v4 0/4] allow to load initrd below 4G for recent kernel

2018-12-05 Thread Li Zhijian

Long long ago, linux kernel has supported up to 4G initrd, but it's header
still hard code to allow loading initrd below 2G only.
 cutting from arch/x86/head.S:
 # (Header version 0x0203 or later) the highest safe address for the contents
 # of an initrd. The current kernel allows up to 4 GB, but leave it at 2 GB to
 # avoid possible bootloader bugs.

In order to support more than 2G initrd, qemu must allow loading initrd
above 2G address. Luckly, recent kernel introduced a new field to linux header
named xloadflags:XLF_CAN_BE_LOADED_ABOVE_4G which tells bootload an optional
and safe address to load initrd.

Current QEMU/BIOS always loads initrd below below_4g_mem_size which always
less than 4G, so here limiting initrd_max to 4G - 1 simply is enough if
this bit is set.

Default roms(Seabios + optionrom(linuxboot_dma)) works as expected with this
patchset.

changes:
V4:
  - add Reviwed-by tag to 1/4 and 2/4
  - use scripts/update-linux-headers.sh to import bootparam.h
  - minor fix at commit log
V3:
 - rebase code basing on http://patchwork.ozlabs.org/cover/1005990 and
   https://patchew.org/QEMU/20181122133507.30950-1-peter.mayd...@linaro.org
 - add new patch 3/4 to import header bootparam.h (Michael S. Tsirkin)

V2: add 2 patches(3/5, 4/5) to fix potential loading issue.

Li Zhijian (4):
  unify len and addr type for memory/address APIs
  refactor load_image_size
  i386: import & use bootparam.h
  i386: allow to load initrd below 4G for recent linux

 exec.c   | 47 ++--
 hw/core/loader.c | 11 +++
 hw/i386/pc.c | 18 ++-
 include/exec/cpu-all.h   |  2 +-
 include/exec/cpu-common.h|  8 ++---
 include/exec/memory.h| 22 ++---
 include/standard-headers/asm-x86/bootparam.h | 34 
 scripts/update-linux-headers.sh  |  4 +++
 8 files changed, 92 insertions(+), 54 deletions(-)
 create mode 100644 include/standard-headers/asm-x86/bootparam.h

-- 
2.7.4

[Qemu-devel] [PATCH for-4.0 v4 3/4] i386: import & use bootparam.h

2018-12-05 Thread Li Zhijian

it's from v4.20-rc5.

CC: Michael S. Tsirkin 
Signed-off-by: Li Zhijian 

---
V4: use scirpt to import bootparam.h (Michael S. Tsirkin)
V3: new patch

Signed-off-by: Li Zhijian 
---
 hw/i386/pc.c |  8 +--
 include/standard-headers/asm-x86/bootparam.h | 34 
 scripts/update-linux-headers.sh  |  4 
 3 files changed, 39 insertions(+), 7 deletions(-)
 create mode 100644 include/standard-headers/asm-x86/bootparam.h

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 067d23a..3b10726 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -74,6 +74,7 @@
 #include "hw/nmi.h"
 #include "hw/i386/intel_iommu.h"
 #include "hw/net/ne2000-isa.h"
+#include "standard-headers/asm-x86/bootparam.h"
 
 /* debug PC/ISA interrupts */
 //#define DEBUG_IRQ
@@ -820,13 +821,6 @@ static long get_file_size(FILE *f)
 return size;
 }
 
-/* setup_data types */
-#define SETUP_NONE 0
-#define SETUP_E820_EXT 1
-#define SETUP_DTB  2
-#define SETUP_PCI  3
-#define SETUP_EFI  4
-
 struct setup_data {
 uint64_t next;
 uint32_t type;
diff --git a/include/standard-headers/asm-x86/bootparam.h 
b/include/standard-headers/asm-x86/bootparam.h
new file mode 100644
index 000..67d4f01
--- /dev/null
+++ b/include/standard-headers/asm-x86/bootparam.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _ASM_X86_BOOTPARAM_H
+#define _ASM_X86_BOOTPARAM_H
+
+/* setup_data types */
+#define SETUP_NONE 0
+#define SETUP_E820_EXT 1
+#define SETUP_DTB  2
+#define SETUP_PCI  3
+#define SETUP_EFI  4
+#define SETUP_APPLE_PROPERTIES 5
+#define SETUP_JAILHOUSE6
+
+/* ram_size flags */
+#define RAMDISK_IMAGE_START_MASK   0x07FF
+#define RAMDISK_PROMPT_FLAG0x8000
+#define RAMDISK_LOAD_FLAG  0x4000
+
+/* loadflags */
+#define LOADED_HIGH(1<<0)
+#define KASLR_FLAG (1<<1)
+#define QUIET_FLAG (1<<5)
+#define KEEP_SEGMENTS  (1<<6)
+#define CAN_USE_HEAP   (1<<7)
+
+/* xloadflags */
+#define XLF_KERNEL_64  (1<<0)
+#define XLF_CAN_BE_LOADED_ABOVE_4G (1<<1)
+#define XLF_EFI_HANDOVER_32(1<<2)
+#define XLF_EFI_HANDOVER_64(1<<3)
+#define XLF_EFI_KEXEC  (1<<4)
+
+
+#endif /* _ASM_X86_BOOTPARAM_H */
diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 0a964fe..77ec108 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -120,6 +120,10 @@ for arch in $ARCHLIST; do
 cp "$tmpdir/include/asm/unistd_x32.h" "$output/linux-headers/asm-x86/"
 cp "$tmpdir/include/asm/unistd_64.h" "$output/linux-headers/asm-x86/"
 cp_portable "$tmpdir/include/asm/kvm_para.h" 
"$output/include/standard-headers/asm-$arch"
+# Remove everything except the macros from bootparam.h avoiding the
+# unnecessary import of several video/ist/etc headers
+sed -e '/__ASSEMBLY__/,/__ASSEMBLY__/d' 
$tmpdir/include/asm/bootparam.h > $tmpdir/bootparam.h
+cp_portable $tmpdir/bootparam.h 
"$output/include/standard-headers/asm-$arch"
 fi
 done
 
-- 
2.7.4

[Qemu-devel] [PATCH for-4.0 v4 4/4] i386: allow to load initrd below 4G for recent linux

2018-12-05 Thread Li Zhijian

a new field xloadflags was added to recent x86 linux, and BIT 1:
XLF_CAN_BE_LOADED_ABOVE_4G is used to tell bootload that where initrd can be
loaded safely.

Current QEMU/BIOS always loads initrd below below_4g_mem_size which is always
less than 4G, so here limiting initrd_max to 4G - 1 simply is enough if
this bit is set.

CC: Paolo Bonzini 
CC: Richard Henderson 
CC: Eduardo Habkost 
CC: "Michael S. Tsirkin" 
CC: Marcel Apfelbaum 
Signed-off-by: Li Zhijian 

---
V3: correct grammar and check XLF_CAN_BE_LOADED_ABOVE_4G first (Michael S. 
Tsirkin)

Signed-off-by: Li Zhijian 
---
 hw/i386/pc.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 3b10726..baa99c0 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -904,7 +904,15 @@ static void load_linux(PCMachineState *pcms,
 #endif
 
 /* highest address for loading the initrd */
-if (protocol >= 0x203) {
+if (protocol >= 0x20c &&
+lduw_p(header+0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) {
+/*
+ * Although kernel allows initrd loading to above 4G,
+ * it just makes it as large as possible while still staying below 4G
+ * since current BIOS always loads initrd below pcms->below_4g_mem_size
+ */
+initrd_max = UINT32_MAX;
+} else if (protocol >= 0x203) {
 initrd_max = ldl_p(header+0x22c);
 } else {
 initrd_max = 0x37ff;
-- 
2.7.4

Re: [Qemu-devel] [PATCH for-4.0 v3 3/4] i386: import bootparam.h

2018-12-05 Thread Li Zhijian




On 12/05/2018 11:33 PM, Michael S. Tsirkin wrote:

On Wed, Dec 05, 2018 at 06:28:11PM +0800, Li Zhijian wrote:

Hi Michael

I cooked a draft with cp_portable to import bootparam.h, could you have a look.

diff --git a/scripts/update-linux-headers.sh b/scripts/update-linux-headers.sh
index 0a964fe..1beeceb 100755
--- a/scripts/update-linux-headers.sh
+++ b/scripts/update-linux-headers.sh
@@ -44,6 +44,12 @@ cp_portable() {
   -e 'linux/kernel' \
   -e 'linux/sysinfo' \
   -e 'asm-generic/kvm_para' \
+ -e 'linux/screen_info.h' \
+ -e 'linux/apm_bios.h' \
+ -e 'linux/edd.h' \
+ -e 'video/edid.h' \
+ -e 'asm/ist.h' \
+ -e 'linux/ioctl.h' \
   > /dev/null
  then
  echo "Unexpected #include in input file $f".
@@ -59,6 +65,8 @@ cp_portable() {
  -e 's/__be\([0-9][0-9]*\)/uint\1_t/g' \
  -e 's/"\(input-event-codes\.h\)"/"standard-headers\/linux\/\1"/' \
  -e 's/]*\)>/"standard-headers\/linux\/\1"/' \
+-e "s/]*\)>/\"standard-headers\/asm-$arch\/\1\"/" \
+-e 's/]*\)>/"standard-headers\/video\/\1"/' \
  -e 's/__bitwise//' \
  -e 's/__attribute__((packed))/QEMU_PACKED/' \
  -e 's/__inline__/inline/' \
@@ -74,6 +82,23 @@ cp_portable() {
  "$f" > "$to/$header";
  }

+rm -rf "$output/include/standard-headers/linux"
+mkdir -p "$output/include/standard-headers/linux"
+
+cp_bootparam()
+{
+mkdir -p $output/include/standard-headers/video
+cp "$tmpdir"/include/linux/ioctl.h "$output/include/standard-headers/linux"
+cp_portable "$tmpdir"/include/linux/screen_info.h 
"$output/include/standard-headers/linux"
+cp_portable "$tmpdir/include/linux/apm_bios.h" 
"$output/include/standard-headers/linux"
+cp_portable "$tmpdir/include/linux/edd.h" 
"$output/include/standard-headers/linux"
+cp_portable "$tmpdir/include/asm/ist.h" 
$output/include/standard-headers/asm-$arch
+cp_portable "$tmpdir/include/video/edid.h" 
$output/include/standard-headers/video
+
+# bootparam.h includes above headers
+cp_portable "$tmpdir/include/asm/bootparam.h" 
"$output/include/standard-headers/asm-$arch"
+}
+
  # This will pick up non-directories too (eg "Kconfig") but we will
  # ignore them in the next loop.
  ARCHLIST=$(cd "$linux/arch" && echo *)
@@ -120,6 +145,7 @@ for arch in $ARCHLIST; do
  cp "$tmpdir/include/asm/unistd_x32.h" "$output/linux-headers/asm-x86/"
  cp "$tmpdir/include/asm/unistd_64.h" "$output/linux-headers/asm-x86/"
  cp_portable "$tmpdir/include/asm/kvm_para.h" 
"$output/include/standard-headers/asm-$arch"
+cp_bootparam
  fi
  done

@@ -163,8 +189,6 @@ cat <$output/linux-headers/linux/virtio_ring.h
  #include "standard-headers/linux/virtio_ring.h"
  EOF

-rm -rf "$output/include/standard-headers/linux"
-mkdir -p "$output/include/standard-headers/linux"
  for i in "$tmpdir"/include/linux/*virtio*.h \
   "$tmpdir/include/linux/qemu_fw_cfg.h" \
   "$tmpdir/include/linux/input.h" \

Thanks
Zhijian


So arch specific asm including asm doesn't work well right now :(
You can either fix the path to ist to pull it from asm-x86,



+-e "s/]*\)>/\"standard-headers\/asm-$arch\/\1\"/" \
+-e 's/]*\)>/"standard-headers\/video\/\1"/' \



Actually above changes fix the path with asm as well.
But I'd like below solution which is simpler and clearer



or if you don't actually need anything in that header the
macros, you can just cut out everything around __ASSEMBLY__
with a bit of e.g. sed magic. E.g. pvrdma does this.

Something like:

# Remove everything except the macros from bootparam.h avoiding the unnecessary
# import of several video/ist/etc headers
sed  -e '/__ASSEMBLY__/,/__ASSEMBLY__/d' arch/x86/include/uapi/asm/bootparam.h

should do the job.


Thanks
Zhijian

Re: [Qemu-devel] [PATCH for-4.0 v3 1/4] unify len and addr type for memory/address APIs

2018-12-05 Thread Li Zhijian




On 12/05/2018 01:40 AM, Philippe Mathieu-Daudé wrote:

Hi Li,

On 3/12/18 15:48, Li Zhijian wrote:

Some address/memory APIs have different type between
'hwaddr/target_ulong addr' and 'int len'. It is very unsafety, espcially

I'm not native English speaker, but I think this should be spell:

... "very unsafe, especially" ...


thanks, Google said so.

Thanks
Zhijian

Re: [Qemu-devel] [Qemu-arm] [PATCH V11 0/8] add pvpanic mmio support

2018-12-05 Thread peng.hao2

>On Wed, 5 Dec 2018 at 00:28,  wrote:
>>
>> >I'm afraid I don't understand. If it's a PCI device then
>> >it does not need to be listed in the device tree or the
>> >ACPI tables at all, because it is probeable by the guest.
>> >This also significantly simplifies the changes needed in QEMU.
>> >
>>
>> It is precisely because PCI devices can not be controlled by FDT or ACPI 
>> tables,
>> I do not want to implement it as a pci device.
>> X86/pvpanic is implemented as ISA device in QEMU and ACPI device in kernel.
>> My implementation extends the implementation of x86/pvpanic, and a large of 
>> x86/pvpanic
>> codes are reused.If PCI devices are implemented in qemu, then ACPI devices 
>> and PCI
>> devices may appear simultaneously in the kernel. This would add both devices 
>> to the
>> crash notifier list, which is odd. I want to see only one device at any time.
>
>Yes, certainly we only need one pvpanic device. If it's implemented
>as a PCI device, then that's what appears. We don't need and
>would not implement the MMIO version. On x86 a user could
>in theory use the command line to request both ISA and PCI
>pvpanic devices. That would not be very sensible, but there
>are lots of QEMU command lines the user can request that
>don't make sense.
>
>> Of course, many
>> architectures can use PCI devices, but we are currently reusing x86/pvpanic 
>> code as much
>> as possible in qemu and kernel , rather than reimplementing it. At the same 
>> time,
>> backward compatibility also needs to be considered.
>>
>>  pvpanic in guest kernel
>> ARM:   ACPI table acpi device
>> FDT  mmio device  (start guest bypassing uefi)
>> x86  ACPI table acpi device
>
>For Arm, there is no backward compatibility issue, as we have
>not yet implemented or shipped anything.
>

Sorry, the expression is not clear enough. I want to say that x86 needs 
backward 
compatibility if we intend to reuse the code of x86/pvpanic.

>> >> Secondly, I don't want it to be a pluggable device. If the user
>> >> deletes the device by mistake, it may lead to unpredictable results.
>> >
>> >If the user deletes the PCI device they're using for their
>> >disk or networking this will also lead to unpredictable
>> >results. We expect users not to randomly unplug things from
>> >their system if they want it to continue to work. In any
>> >case your guest driver can easily handle the unplug: the
>> >guest would then just lose the ability to notify on panic,
>> >falling back to as if the pvpanic device had never been
>> >present.
>>
>> If two devices can exist simultaneously by modifying the code,
>>  then because ACPI devices rely on a PCI device, if PCI devices are 
>> dynamically
>>  unplugged, ACPI device will not work when panic is triggered.
>
>If somebody modifies the code to QEMU or the guest kernel
>such that something breaks, that's their issue to deal with.
>My proposal is that we would ship:
>* a QEMU with a PCI pvpanic device (which you could plug in
>if you wanted it)
>* no changes to the Arm virt board, so nothing in the ACPI
>or device tree
>* no "mmio pvpanic" device

ok, I will try it.
thanks.
>
>thanks
>-- PMM

Re: [Qemu-devel] [PATCH v4] hw/arm: Add arm SBSA reference machine

2018-12-05 Thread Hongbo Zhang

On Wed, 5 Dec 2018 at 18:36, Leif Lindholm  wrote:
>
> On Wed, Dec 05, 2018 at 05:50:23PM +0800, Hongbo Zhang wrote:
> > > > +static
> > > > +void sbsa_ref_machine_done(Notifier *notifier, void *data)
> > > > +{
> > > > +VirtMachineState *vms = container_of(notifier, VirtMachineState,
> > > > + machine_done);
> > > > +ARMCPU *cpu = ARM_CPU(first_cpu);
> > > > +struct arm_boot_info *info = >bootinfo;
> > > > +AddressSpace *as = arm_boot_address_space(cpu, info);
> > > > +
> > > > +if (arm_load_dtb(info->dtb_start, info, info->dtb_limit, as) < 0) {
> > > > +exit(1);
> > > > +}
> > > > +}
> > >
> > > The virt board needs a machine-done notifier because it has
> > > to add extra things to the DTB here. You don't, so you don't
> > > need one. Don't set bootinfo.skip_dtb_autoload to true, and
> > > then the boot.c code will do the arm_load_dtb() call for you.
> > >
> > After test and check, I think we still need this machine_done callback
> > to call arm_load_dtb().
> > If only kernel loaded via -kernel but without any firmware, it should
> > work to let arm_load_kernel() call the arm_load_dtb(), while in our
> > case, we have have firmware loaded but no kernel, so the
> > arm_load_kernel() returns before calling arm_load_dtb(),  that is,
> > firmware runs and there is no chance to call arm_load_dtb(), then we
> > get error message and qemu quits.
> > Moving arm_load_dtb() to earlier place in arm_load_kernel() cannot fix
> > this issue either.
>
> I don't see the value in using -kernel on the SBSA machine.
> If it causes complexity, can we drop the functionality?
>
We don't use -kernel parameter on SBSA machine.
It doesn't cause complexity,  we were suggested to drop the previous
machine_done() callback to reduce code lines, but I think we have to
use it still.

> Regards,
>
> Leif

Re: [Qemu-devel] [PATCH v6 00/37] ppc: support for the XIVE interrupt controller (POWER9)

2018-12-05 Thread no-reply

Patchew URL: https://patchew.org/QEMU/20181205232251.10446-1-...@kaod.org/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20181205232251.10446-1-...@kaod.org
Subject: [Qemu-devel] [PATCH v6 00/37] ppc: support for the XIVE interrupt 
controller (POWER9)
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
b29ce00 spapr: add KVM support to the 'dual' machine
957ca7b spapr: check for KVM IRQ device activation
20c8779 spapr: introduce routines to delete the KVM IRQ device
6577bbe sysbus: add a sysbus_mmio_unmap() helper
9ae96ee spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers
643bf4b ppc/xics: introduce a icp_kvm_connect() routine
f727182 spapr: add a 'pseries-3.1-dual' machine type
d69c736 spapr/xive: enable XIVE MMIOs at reset
114495a spapr: set the interrupt presenter at reset
d28fbd6 spapr/xive: fix migration of the XiveTCTX under TCG
af33d19 spapr/xive: add migration support for KVM
ca0d71d spapr/xive: introduce a VM state change handler
6b44f52 spapr/xive: add state synchronization with KVM
543c65b spapr/xive: add KVM support
11b9fc0 linux-headers: update to 4.20-rc5
de7e220 spapr: add a 'pseries-3.1-xive' machine type
9a3ad0b spapr: add a 'reset' method to the sPAPR IRQ backend
e414def spapr: extend the sPAPR IRQ backend for XICS migration
7ee593a spapr: allocate the interrupt thread context under the CPU core
5830c26 spapr: add device tree support for the XIVE exploitation mode
7c6ed94 spapr: add hcalls support for the XIVE exploitation interrupt mode
8b0265d spapr: introdude a new machine IRQ backend for XIVE
7438740 spapr: export and rename the xics_max_server_number() routine
6c85b1a spapr: modify the irq backend 'init' method
cbdc5f2 spapr: introduce a spapr_irq_init() routine
7b96ab4 spapr: initialize VSMT before initializing the IRQ backend
99ed715 spapr/xive: use the VCPU id as a NVT identifier
12a7762 spapr/xive: introduce a XIVE interrupt controller
043a1e5 ppc/xive: notify the CPU when the interrupt priority is more privileged
4e2626f ppc/xive: introduce a simplified XIVE presenter
58d0fa5 ppc/xive: introduce the XIVE interrupt thread context
28efbe5 ppc/xive: add support for the END Event State buffers
3f3bd6c ppc/xive: introduce the XIVE Event Notification Descriptors
871c908 ppc/xive: introduce the XiveRouter model
caf164f ppc/xive: introduce the XiveNotifier interface
eef5a2f ppc/xive: add support for the LSI interrupt sources
7c9d387 ppc/xive: introduce a XIVE interrupt source model

=== OUTPUT BEGIN ===
Checking PATCH 1/37: ppc/xive: introduce a XIVE interrupt source model...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#60: 
new file mode 100644

total: 0 errors, 1 warnings, 656 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 2/37: ppc/xive: add support for the LSI interrupt sources...
Checking PATCH 3/37: ppc/xive: introduce the XiveNotifier interface...
Checking PATCH 4/37: ppc/xive: introduce the XiveRouter model...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#169: 
new file mode 100644

total: 0 errors, 1 warnings, 179 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 5/37: ppc/xive: introduce the XIVE Event Notification 
Descriptors...
Checking PATCH 6/37: ppc/xive: add support for the END Event State buffers...
Checking PATCH 7/37: ppc/xive: introduce the XIVE interrupt thread context...
Checking PATCH 8/37: ppc/xive: introduce a simplified XIVE presenter...
Checking PATCH 9/37: ppc/xive: notify the CPU when the interrupt priority is 
more privileged...
Checking PATCH 10/37: spapr/xive: introduce a XIVE interrupt controller...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#57: 
new file mode 100644

total: 0 errors, 1 warnings, 425 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 11/37: spapr/xive: use the VCPU id as a NVT identifier...
Checking PATCH 12/37: spapr: initialize VSMT before initializing the IRQ 
backend...
Checking PATCH 13/37:

Re: [Qemu-devel] Hosted CI for FreeBSD - Cirrus CI

2018-12-05 Thread Kamil Rytarowski

On 05.12.2018 22:58, Ed Maste wrote:
> On Wed, 5 Dec 2018 at 15:59, Kamil Rytarowski  wrote:
>>
>> There are already FreeBSD, OpenBSD and NetBSD test scripts in the qemu
>> project.
> 
> I see scripts under tests/vm/ for FreeBSD, NetBSD, and OpenBSD, but
> they're for testing BSD guests on QEMU, while I'm interested in
> building and testing QEMU on a FreeBSD host. Is there something else
> that I've missed?
> 

These tests build qemu in guest and execute tests, so BSDs are tested as
an [emulated] host.



signature.asc
Description: OpenPGP digital signature

[Qemu-devel] [Bug 1807052] Re: Qemu hangs during migration

2018-12-05 Thread Matthew Schumacher

If I remote iothreads and writeback caching, it seems more reliable, but
I can still get it to hang.

This time the source server shows the VM as running, backtrace looks
like:

(gdb) bt full
#0  0x7f27eab0028c in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x7f27eaaf9d35 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00865419 in qemu_mutex_lock_impl (mutex=mutex@entry=0x115b8e0 
, file=file@entry=0x8fdf14 "/tmp/qemu-3.0.0/cpus.c", 
line=line@entry=1768)
at util/qemu-thread-posix.c:66
err = 
__PRETTY_FUNCTION__ = "qemu_mutex_lock_impl"
__func__ = "qemu_mutex_lock_impl"
#3  0x00477578 in qemu_mutex_lock_iothread () at 
/tmp/qemu-3.0.0/cpus.c:1768
#4  0x008622b0 in main_loop_wait (timeout=) at 
util/main-loop.c:236
context = 0x1e72810
ret = 1
ret = 1
timeout = 4294967295
timeout_ns = 
#5  0x008622b0 in main_loop_wait (nonblocking=nonblocking@entry=0) at 
util/main-loop.c:497
ret = 1
timeout = 4294967295
timeout_ns = 
#6  0x00595dee in main_loop () at vl.c:1866
#7  0x0041f35d in main (argc=, argv=, 
envp=) at vl.c:4644
i = 
snapshot = 0
linux_boot = 
initrd_filename = 0x0
kernel_filename = 
kernel_cmdline = 
boot_order = 0x918f44 "cad"
boot_once = 0x0
ds = 
opts = 
machine_opts = 
icount_opts = 
accel_opts = 0x0
olist = 
optind = 71
optarg = 0x7fff5edcff69 "timestamp=on"
loadvm = 0x0
machine_class = 0x0
cpu_model = 0x7fff5edcf88a 
"Skylake-Server-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,umip=on,pku=on,ssbd=on,xsaves=on,topoext=on,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer"...
vga_model = 0x0
qtest_chrdev = 0x0
qtest_log = 0x0
pid_file = 
incoming = 0x7fff5edcff0a "defer"
userconfig = 
nographic = false
display_remote = 
log_mask = 
log_file = 
trace_file = 
maxram_size = 4294967296
ram_slots = 0
vmstate_dump_file = 0x0
main_loop_err = 0x0
---Type  to continue, or q  to quit---
err = 0x0
list_data_dirs = false
dir = 
dirs = 
bdo_queue = {sqh_first = 0x0, sqh_last = 0x7fff5edcd670}
__func__ = "main"


Dest server is paused, and looks like this:

#0  0x7f11c48bc3c1 in ppoll () at /lib64/libc.so.6
#1  0x00861659 in qemu_poll_ns (fds=, nfds=, timeout=timeout@entry=2999892383) at util/qemu-timer.c:334
ts = {tv_sec = 2, tv_nsec = 999892383}
Python Exception  That operation is not available on 
integers of more than 8 bytes.:
#2  0x008622a4 in main_loop_wait (timeout=) at 
util/main-loop.c:233
context = 0x2342810
ret = 
ret = -1295074913
timeout = 4294967295
timeout_ns = 
#3  0x008622a4 in main_loop_wait (nonblocking=nonblocking@entry=0) at 
util/main-loop.c:497
ret = -1295074913
timeout = 4294967295
timeout_ns = 
#4  0x00595dee in main_loop () at vl.c:1866
#5  0x0041f35d in main (argc=, argv=, 
envp=) at vl.c:4644
i = 
snapshot = 0
linux_boot = 
initrd_filename = 0x0
kernel_filename = 
kernel_cmdline = 
boot_order = 0x918f44 "cad"
boot_once = 0x0
ds = 
opts = 
machine_opts = 
icount_opts = 
accel_opts = 0x0
olist = 
optind = 71
optarg = 0x7ffe6b899f69 "timestamp=on"
loadvm = 0x0
machine_class = 0x0
cpu_model = 0x7ffe6b89988a 
"Skylake-Server-IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,umip=on,pku=on,ssbd=on,xsaves=on,topoext=on,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer"...
vga_model = 0x0
qtest_chrdev = 0x0
qtest_log = 0x0
pid_file = 
incoming = 0x7ffe6b899f0a "defer"
userconfig = 
nographic = false
display_remote = 
log_mask = 
log_file = 
trace_file = 
maxram_size = 4294967296
ram_slots = 0
vmstate_dump_file = 0x0
main_loop_err = 0x0
err = 0x0
list_data_dirs = false
dir = 
dirs = 
bdo_queue = {sqh_first = 0x0, sqh_last = 0x7ffe6b8988e0}
---Type  to continue, or q  to quit---
__func__ = "main"

Honestly looks pretty much like the same bug

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1807052

Title:
  Qemu hangs during migration

Status in QEMU:
  New

Bug description:
  Source server: linux 4.19.5 qemu-3.0.0 from source, libvirt 4.9
  Dest server: linux 4.18.19 qemu-3.0.0 from source, libvirt 4.9

[Qemu-devel] [Bug 1807052] [NEW] Qemu hangs during migration

2018-12-05 Thread Matthew Schumacher

Public bug reported:

Source server: linux 4.19.5 qemu-3.0.0 from source, libvirt 4.9
Dest server: linux 4.18.19 qemu-3.0.0 from source, libvirt 4.9

When this VM is running on source server:

/usr/bin/qemu-system-x86_64 -name guest=testvm,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-13-testvm
/master-key.aes -machine pc-q35-3.0,accel=kvm,usb=off,dump-guest-
core=off -cpu Skylake-Server-
IBRS,ss=on,hypervisor=on,tsc_adjust=on,clflushopt=on,umip=on,pku=on,ssbd=on,xsaves=on,topoext=on,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vpindex,hv_runtime,hv_synic,hv_stimer,hv_reset,hv_vendor_id=KVM
Hv -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1
-object iothread,id=iothread1 -uuid 3b00b788-ee91-4e45-80a6-c7319da71225
-no-user-config -nodefaults -chardev
socket,id=charmonitor,fd=23,server,nowait -mon
chardev=charmonitor,id=monitor,mode=control -rtc
base=localtime,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-
hpet -no-shutdown -boot strict=on -device pcie-root-
port,port=0x10,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x2
-device pcie-root-
port,port=0x11,chassis=2,id=pci.2,bus=pcie.0,addr=0x2.0x1 -device pcie-
pci-bridge,id=pci.3,bus=pci.1,addr=0x0 -device pcie-root-
port,port=0x12,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x2 -device pcie-
root-port,port=0x13,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x3 -device
piix3-usb-uhci,id=usb,bus=pci.3,addr=0x1 -device virtio-scsi-
pci,iothread=iothread1,id=scsi0,bus=pci.4,addr=0x0 -drive
file=/dev/zvol/datastore/vm/testvm-vda,format=raw,if=none,id=drive-
scsi0-0-0-0,cache=writeback,aio=threads -device scsi-
hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-
scsi0-0-0-0,id=scsi0-0-0-0,bootindex=2,write-cache=on -drive if=none,id
=drive-sata0-0-4,media=cdrom,readonly=on -device ide-cd,bus=ide.4,drive
=drive-sata0-0-4,id=sata0-0-4,bootindex=1 -netdev
tap,fd=25,id=hostnet0,vhost=on,vhostfd=26 -device virtio-net-
pci,netdev=hostnet0,id=net0,mac=52:54:00:a2:b7:a1,bus=pci.2,addr=0x0
-device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 127.0.0.1:0 -device
cirrus-vga,id=video0,bus=pcie.0,addr=0x1 -s -sandbox
on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg timestamp=on

I try to migrate it and the disks to the other side:

virsh migrate --live --undefinesource --persistent --verbose --copy-
storage-all testvm qemu+ssh://wasvirt1/system

We get to 99% then hang with both sides in the pause state.

Source server is stuck here:
(gdb) bt full
#0  0x7f327994f3c1 in ppoll () at /lib64/libc.so.6
#1  0x0086167b in qemu_poll_ns (fds=, nfds=nfds@entry=1, 
timeout=) at util/qemu-timer.c:322
#2  0x00863302 in aio_poll (ctx=0x21044e0, 
blocking=blocking@entry=true) at util/aio-posix.c:629
node = 
i = 
ret = 0
progress = 
timeout = 
start = 
__PRETTY_FUNCTION__ = "aio_poll"
#3  0x007e0d52 in nbd_client_close (bs=0x2ba2400) at 
block/nbd-client.c:62
waited_ = 
wait_ = 0x2ba563c
ctx_ = 0x2109bb0
bs_ = 0x2ba2400
client = 0x31287e0
client = 
request = {handle = 0, from = 0, len = 0, flags = 0, type = 2}
#4  0x007e0d52 in nbd_client_close (bs=0x2ba2400) at 
block/nbd-client.c:965
client = 
request = {handle = 0, from = 0, len = 0, flags = 0, type = 2}
#5  0x007de5ca in nbd_close (bs=) at block/nbd.c:491
s = 0x31287e0
#6  0x007823d6 in bdrv_unref (bs=0x2ba2400) at block.c:3352
ban = 
ban_next = 
child = 
next = 
#7  0x007823d6 in bdrv_unref (bs=0x2ba2400) at block.c:3560
#8  0x007823d6 in bdrv_unref (bs=0x2ba2400) at block.c:4616
#9  0x00782403 in bdrv_unref (bs=0x2af96f0) at block.c:3359
ban = 
ban_next = 
child = 
next = 
#10 0x00782403 in bdrv_unref (bs=0x2af96f0) at block.c:3560
#11 0x00782403 in bdrv_unref (bs=0x2af96f0) at block.c:4616
#12 0x00785784 in block_job_remove_all_bdrv (job=job@entry=0x2f32570) 
at blockjob.c:200
c = 0x23bac30
l = 0x20dd330 = {0x23bac30, 0x2b89410}
#13 0x007ceb5f in mirror_exit (job=0x2f32570, opaque=0x7f326407a350) at 
block/mirror.c:700
s = 0x2f32570
bjob = 0x2f32570
data = 0x7f326407a350
bs_opaque = 0x30d5600
replace_aio_context = 
src = 0x2131080
target_bs = 0x2af96f0
mirror_top_bs = 0x210eb70
local_err = 0x0
#14 0x00786452 in job_defer_to_main_loop_bh (opaque=0x7f32640786a0) at 
job.c:973
data = 0x7f32640786a0
job = 
aio_context = 0x2109bb0
#15 0x0085fd3f in aio_bh_poll (ctx=ctx@entry=0x21044e0) at 
util/async.c:118
---Type  to continue, or q  to quit---
bh = 
bhp = 
next = 0x2ea86e0
ret = 1
deleted = false
#16 0x008631b0 in aio_dispatch (ctx=0x21044e0) at

Re: [Qemu-devel] [RFC 0/3] QEMU changes to do PVH boot

2018-12-05 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/1544049446-6359-1-git-send-email-liam.merw...@oracle.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing 
commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#!/bin/bash
time make docker-test-mingw@fedora SHOW_ENV=1 J=8
=== TEST SCRIPT END ===

  CC  x86_64-softmmu/target/i386/hax-windows.o
  CC  x86_64-softmmu/target/i386/sev-stub.o
/tmp/qemu-test/src/hw/i386/pc.c: In function 'get_elf_note_type':
/tmp/qemu-test/src/hw/i386/pc.c:884:42: error: format '%lx' expects argument of 
type 'long unsigned int', but argument 2 has type 'size_t {aka long long 
unsigned int}' [-Werror=format=]
 error_report("Note type (0x%lx) not found in ELF Note section",
~~^
%llx
/tmp/qemu-test/src/hw/i386/pc.c: In function 'read_pvh_start_addr_elf_note':
/tmp/qemu-test/src/hw/i386/pc.c:982:12: error: implicit declaration of function 
'mmap'; did you mean 'max'? [-Werror=implicit-function-declaration]
 ehdr = mmap(0, statbuf.st_size,
^~~~
max
/tmp/qemu-test/src/hw/i386/pc.c:982:12: error: nested extern declaration of 
'mmap' [-Werror=nested-externs]
/tmp/qemu-test/src/hw/i386/pc.c:983:9: error: 'PROT_READ' undeclared (first use 
in this function); did you mean 'OF_READ'?
 PROT_READ | PROT_WRITE, MAP_PRIVATE, fileno(file), 0);
 ^
 OF_READ
/tmp/qemu-test/src/hw/i386/pc.c:983:9: note: each undeclared identifier is 
reported only once for each function it appears in
/tmp/qemu-test/src/hw/i386/pc.c:983:21: error: 'PROT_WRITE' undeclared (first 
use in this function); did you mean 'OF_WRITE'?
 PROT_READ | PROT_WRITE, MAP_PRIVATE, fileno(file), 0);
 ^~
 OF_WRITE
/tmp/qemu-test/src/hw/i386/pc.c:983:33: error: 'MAP_PRIVATE' undeclared (first 
use in this function); did you mean 'MEM_PRIVATE'?
 PROT_READ | PROT_WRITE, MAP_PRIVATE, fileno(file), 0);
 ^~~
 MEM_PRIVATE
/tmp/qemu-test/src/hw/i386/pc.c:984:17: error: 'MAP_FAILED' undeclared (first 
use in this function); did you mean 'WAIT_FAILED'?
 if (ehdr == MAP_FAILED) {
 ^~
 WAIT_FAILED
/tmp/qemu-test/src/hw/i386/pc.c:1058:44: error: format '%lx' expects argument 
of type 'long unsigned int', but argument 2 has type 'long long int' 
[-Werror=format=]
 error_report("ELF Nhdr offset (0x%lx) exceeds file (%s) bounds (%ld)",
  ~~^
  %llx
 (nhdr - ehdr), filename, statbuf.st_size);
 ~   
/tmp/qemu-test/src/hw/i386/pc.c:1058:75: error: format '%ld' expects argument 
of type 'long int', but argument 4 has type 'long long int' [-Werror=format=]
 error_report("ELF Nhdr offset (0x%lx) exceeds file (%s) bounds (%ld)",
 ~~^
 %lld
 (nhdr - ehdr), filename, statbuf.st_size);
  ~~~   
/tmp/qemu-test/src/hw/i386/pc.c:1075:46: error: format '%lx' expects argument 
of type 'long unsigned int', but argument 2 has type 'long long unsigned int' 
[-Werror=format=]
 error_report("ELF Nhdr contents (0x%lx) exceeds file bounds (%ld)",
~~^
%llx
/tmp/qemu-test/src/hw/i386/pc.c:1075:72: error: format '%ld' expects argument 
of type 'long int', but argument 3 has type 'long long int' [-Werror=format=]
 error_report("ELF Nhdr contents (0x%lx) exceeds file bounds (%ld)",
  ~~^
  %lld
/tmp/qemu-test/src/hw/i386/pc.c:1077:53:
 QEMU_ALIGN_UP(nhdr_descsz, phdr_align), statbuf.st_size);
 ~~~ 
/tmp/qemu-test/src/hw/i386/pc.c:1091:46: error: format '%lx' expects argument 
of type 'long unsigned int', but argument 2 has type 'long long unsigned int' 
[-Werror=format=]
 error_report("PVH ELF note addr (0x%lx) exceeds file (%s) bounds 
(%ld)",
~~^
%llx
 (elf_note_data_addr - (size_t)ehdr), filename, statbuf.st_size);
 ~~~
/tmp/qemu-test/src/hw/i386/pc.c:1091:77: error: format '%ld' expects argument 
of type 'long int', but argument 4 has type 'long long int' [-Werror=format=]
 error_report("PVH ELF note addr

Re: [Qemu-devel] [libvirt] [PATCH for-4.0 v4 0/2] virtio: Provide version-specific variants of virtio PCI devices

2018-12-05 Thread no-reply

Patchew URL: 
https://patchew.org/QEMU/20181205195704.17605-1-ehabk...@redhat.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Message-id: 20181205195704.17605-1-ehabk...@redhat.com
Subject: [libvirt] [PATCH for-4.0 v4 0/2] virtio: Provide version-specific 
variants of virtio PCI devices
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
failed=1
echo
fi
n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
486a758 virtio: Provide version-specific variants of virtio PCI devices
85361d9 virtio: Helper for registering virtio device types

=== OUTPUT BEGIN ===
Checking PATCH 1/2: virtio: Helper for registering virtio device types...
WARNING: line over 80 characters
#496: FILE: hw/virtio/virtio-pci.h:443:
+ * Implements both INTERFACE_PCIE_DEVICE and 
INTERFACE_CONVENTIONAL_PCI_DEVICE,

WARNING: line over 80 characters
#505: FILE: hw/virtio/virtio-pci.h:452:
+ * Implements both INTERFACE_PCIE_DEVICE and 
INTERFACE_CONVENTIONAL_PCI_DEVICE.

total: 0 errors, 2 warnings, 469 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
Checking PATCH 2/2: virtio: Provide version-specific variants of virtio PCI 
devices...
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#316: 
new file mode 100644

ERROR: line over 90 characters
#372: FILE: tests/acceptance/virtio_version.py:52:
+return devtype in [d['name'] for d in vm.command('qom-list-types', 
implements=implements)]

WARNING: line over 80 characters
#427: FILE: tests/acceptance/virtio_version.py:107:
+dev_1_0, nt_ifaces = self.run_device('%s-non-transitional' % 
(qemu_devtype))

WARNING: line over 80 characters
#451: FILE: tests/acceptance/virtio_version.py:131:
+dev_trans, trans_ifaces = self.run_device('%s-transitional' % 
(qemu_devtype))

total: 1 errors, 3 warnings, 404 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/20181205195704.17605-1-ehabk...@redhat.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-de...@redhat.com

[Qemu-devel] [PATCH v6 36/37] spapr: check for KVM IRQ device activation

2018-12-05 Thread Cédric Le Goater

The KVM IRQ device activation will depend on the interrupt mode chosen
at CAS time by the machine and some methods used at reset or by the
migration need to be protected.

Signed-off-by: Cédric Le Goater 
---
 hw/intc/spapr_xive_kvm.c | 28 
 hw/intc/xics_kvm.c   | 25 -
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index dba3344831c6..6135b8c11e63 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -94,9 +94,15 @@ static void kvmppc_xive_cpu_set_state(XiveTCTX *tctx, Error 
**errp)
 
 void kvmppc_xive_cpu_get_state(XiveTCTX *tctx, Error **errp)
 {
+sPAPRXive *xive = SPAPR_MACHINE(qdev_get_machine())->xive;
 uint64_t state[4] = { 0 };
 int ret;
 
+/* The KVM XIVE device is not in use */
+if (xive->fd == -1) {
+return;
+}
+
 ret = kvm_get_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
 if (ret != 0) {
 error_setg_errno(errp, errno, "Could capture KVM XIVE CPU %ld state",
@@ -132,6 +138,11 @@ void kvmppc_xive_cpu_connect(XiveTCTX *tctx, Error **errp)
 unsigned long vcpu_id;
 int ret;
 
+/* The KVM XIVE device is not in use */
+if (xive->fd == -1) {
+return;
+}
+
 /* Check if CPU was hot unplugged and replugged. */
 if (kvm_cpu_is_enabled(tctx->cs)) {
 return;
@@ -215,9 +226,13 @@ static void kvmppc_xive_source_get_state(XiveSource *xsrc)
 void kvmppc_xive_source_set_irq(void *opaque, int srcno, int val)
 {
 XiveSource *xsrc = opaque;
+sPAPRXive *xive = SPAPR_XIVE(xsrc->xive);
 struct kvm_irq_level args;
 int rc;
 
+/* The KVM XIVE device should be in use */
+assert(xive->fd != -1);
+
 args.irq = srcno;
 if (!xive_source_irq_is_lsi(xsrc, srcno)) {
 if (!val) {
@@ -564,6 +579,11 @@ int kvmppc_xive_pre_save(sPAPRXive *xive)
 Error *local_err = NULL;
 CPUState *cs;
 
+/* The KVM XIVE device is not in use */
+if (xive->fd == -1) {
+return 0;
+}
+
 /* Grab the EAT */
 kvmppc_xive_get_eas_state(xive, _err);
 if (local_err) {
@@ -596,6 +616,9 @@ int kvmppc_xive_post_load(sPAPRXive *xive, int version_id)
 Error *local_err = NULL;
 CPUState *cs;
 
+/* The KVM XIVE device should be in use */
+assert(xive->fd != -1);
+
 /* Restore the ENDT first. The targetting depends on it. */
 CPU_FOREACH(cs) {
 kvmppc_xive_set_eq_state(xive, cs, _err);
@@ -632,6 +655,11 @@ void kvmppc_xive_synchronize_state(sPAPRXive *xive)
 XiveSource *xsrc = >source;
 CPUState *cs;
 
+/* The KVM XIVE device is not in use */
+if (xive->fd == -1) {
+return;
+}
+
 /*
  * When the VM is stopped, the sources are masked and the previous
  * state is saved in anticipation of a migration. We should not
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index 2a60ae71730b..4355c9d12160 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -68,6 +68,11 @@ static void icp_get_kvm_state(ICPState *icp)
 uint64_t state;
 int ret;
 
+/* The KVM XICS device is not in use */
+if (kernel_xics_fd == -1) {
+return;
+}
+
 /* ICP for this CPU thread is not in use, exiting */
 if (!icp->cs) {
 return;
@@ -104,6 +109,11 @@ static int icp_set_kvm_state(ICPState *icp, int version_id)
 uint64_t state;
 int ret;
 
+/* The KVM XICS device is not in use */
+if (kernel_xics_fd == -1) {
+return 0;
+}
+
 /* ICP for this CPU thread is not in use, exiting */
 if (!icp->cs) {
 return 0;
@@ -140,8 +150,8 @@ static void icp_kvm_connect(ICPState *icp, Error **errp)
 unsigned long vcpu_id;
 int ret;
 
+/* The KVM XICS device is not in use */
 if (kernel_xics_fd == -1) {
-abort();
 return;
 }
 
@@ -220,6 +230,11 @@ static void ics_get_kvm_state(ICSState *ics)
 uint64_t state;
 int i;
 
+/* The KVM XICS device is not in use */
+if (kernel_xics_fd == -1) {
+return;
+}
+
 for (i = 0; i < ics->nr_irqs; i++) {
 ICSIRQState *irq = >irqs[i];
 
@@ -279,6 +294,11 @@ static int ics_set_kvm_state(ICSState *ics, int version_id)
 int i;
 Error *local_err = NULL;
 
+/* The KVM XICS device is not in use */
+if (kernel_xics_fd == -1) {
+return 0;
+}
+
 for (i = 0; i < ics->nr_irqs; i++) {
 ICSIRQState *irq = >irqs[i];
 int ret;
@@ -325,6 +345,9 @@ static void ics_kvm_set_irq(void *opaque, int srcno, int 
val)
 struct kvm_irq_level args;
 int rc;
 
+/* The KVM XICS device should be in use */
+assert(kernel_xics_fd != -1);
+
 args.irq = srcno + ics->offset;
 if (ics->irqs[srcno].flags & XICS_FLAGS_IRQ_MSI) {
 if (!val) {
-- 
2.17.2

[Qemu-devel] [PATCH v6 34/37] sysbus: add a sysbus_mmio_unmap() helper

2018-12-05 Thread Cédric Le Goater

This will be used to remove the MMIO regions of the POWER9 XIVE
interrupt controller when the sPAPR machine is reseted.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 include/hw/sysbus.h |  1 +
 hw/core/sysbus.c| 10 ++
 2 files changed, 11 insertions(+)

diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
index 0b59a3b8d605..bc641984b5da 100644
--- a/include/hw/sysbus.h
+++ b/include/hw/sysbus.h
@@ -92,6 +92,7 @@ qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
 void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
 void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
  int priority);
+void sysbus_mmio_unmap(SysBusDevice *dev, int n);
 void sysbus_add_io(SysBusDevice *dev, hwaddr addr,
MemoryRegion *mem);
 MemoryRegion *sysbus_address_space(SysBusDevice *dev);
diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
index 7ac36ad3e707..09f202167dcb 100644
--- a/hw/core/sysbus.c
+++ b/hw/core/sysbus.c
@@ -153,6 +153,16 @@ static void sysbus_mmio_map_common(SysBusDevice *dev, int 
n, hwaddr addr,
 }
 }
 
+void sysbus_mmio_unmap(SysBusDevice *dev, int n)
+{
+assert(n >= 0 && n < dev->num_mmio);
+
+if (dev->mmio[n].addr != (hwaddr)-1) {
+memory_region_del_subregion(get_system_memory(), dev->mmio[n].memory);
+dev->mmio[n].addr = (hwaddr)-1;
+}
+}
+
 void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
 {
 sysbus_mmio_map_common(dev, n, addr, false, 0);
-- 
2.17.2

[Qemu-devel] [PATCH v6 32/37] ppc/xics: introduce a icp_kvm_connect() routine

2018-12-05 Thread Cédric Le Goater

This routine gathers all the KVM initialization of the XICS KVM
presenter. It will be useful when the initialization of the KVM XICS
device is moved to a global routine.

Signed-off-by: Cédric Le Goater 
---
 hw/intc/xics_kvm.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index e8fa9a53aeba..de86e1d0b1ab 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -123,11 +123,8 @@ static void icp_kvm_reset(DeviceState *dev)
 icp_set_kvm_state(ICP(dev), 1);
 }
 
-static void icp_kvm_realize(DeviceState *dev, Error **errp)
+static void icp_kvm_connect(ICPState *icp, Error **errp)
 {
-ICPState *icp = ICP(dev);
-ICPStateClass *icpc = ICP_GET_CLASS(icp);
-Error *local_err = NULL;
 CPUState *cs;
 KVMEnabledICP *enabled_icp;
 unsigned long vcpu_id;
@@ -135,11 +132,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
 
 if (kernel_xics_fd == -1) {
 abort();
-}
-
-icpc->parent_realize(dev, _err);
-if (local_err) {
-error_propagate(errp, local_err);
 return;
 }
 
@@ -168,6 +160,25 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
 QLIST_INSERT_HEAD(_enabled_icps, enabled_icp, node);
 }
 
+static void icp_kvm_realize(DeviceState *dev, Error **errp)
+{
+ICPStateClass *icpc = ICP_GET_CLASS(dev);
+Error *local_err = NULL;
+
+icpc->parent_realize(dev, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+/* Connect the presenter to the VCPU (required for CPU hotplug) */
+icp_kvm_connect(ICP(dev), _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+
 static void icp_kvm_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
-- 
2.17.2

[Qemu-devel] [PATCH v6 35/37] spapr: introduce routines to delete the KVM IRQ device

2018-12-05 Thread Cédric Le Goater

If a new interrupt mode is chosen by CAS, the machine generates a
reset to reconfigure. At this point, the connection with the previous
KVM device needs to be closed and a new connection needs to opened
with the KVM device operating the chosen interrupt mode.

New routines are introduced to destroy the XICS and the XIVE KVM
devices. They make use of a new KVM device ioctl which destroys the
device and also disconnects the IRQ presenters from the vCPUs.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_xive.h |  1 +
 include/hw/ppc/xics.h   |  1 +
 hw/intc/spapr_xive_kvm.c| 61 +
 hw/intc/xics_kvm.c  | 57 ++
 4 files changed, 120 insertions(+)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 735e916d3844..250de7fdc943 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -71,6 +71,7 @@ void spapr_xive_enable_mmio(sPAPRXive *xive, bool enable);
  * KVM XIVE device helpers
  */
 void kvmppc_xive_connect(sPAPRXive *xive, Error **errp);
+void kvmppc_xive_disconnect(sPAPRXive *xive, Error **errp);
 void kvmppc_xive_synchronize_state(sPAPRXive *xive);
 int kvmppc_xive_pre_save(sPAPRXive *xive);
 int kvmppc_xive_post_load(sPAPRXive *xive, int version_id);
diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
index 14afda198cdb..f7e5f8f9b5d7 100644
--- a/include/hw/ppc/xics.h
+++ b/include/hw/ppc/xics.h
@@ -205,6 +205,7 @@ typedef struct sPAPRMachineState sPAPRMachineState;
 void spapr_dt_xics(sPAPRMachineState *spapr, uint32_t nr_servers, void *fdt,
uint32_t phandle);
 int xics_kvm_init(sPAPRMachineState *spapr, Error **errp);
+int xics_kvm_disconnect(sPAPRMachineState *spapr, Error **errp);
 void xics_spapr_init(sPAPRMachineState *spapr);
 
 Object *icp_create(Object *cpu, const char *type, XICSFabric *xi,
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 04f997479e8f..dba3344831c6 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -57,6 +57,16 @@ static void kvm_cpu_enable(CPUState *cs)
 QLIST_INSERT_HEAD(_enabled_cpus, enabled_cpu, node);
 }
 
+static void kvm_cpu_disable_all(void)
+{
+KVMEnabledCPU *enabled_cpu, *next;
+
+QLIST_FOREACH_SAFE(enabled_cpu, _enabled_cpus, node, next) {
+QLIST_REMOVE(enabled_cpu, node);
+g_free(enabled_cpu);
+}
+}
+
 /*
  * XIVE Thread Interrupt Management context (KVM)
  */
@@ -743,3 +753,54 @@ void kvmppc_xive_connect(sPAPRXive *xive, Error **errp)
 /* Map all regions */
 spapr_xive_map_mmio(xive);
 }
+
+void kvmppc_xive_disconnect(sPAPRXive *xive, Error **errp)
+{
+XiveSource *xsrc;
+struct kvm_create_device xive_destroy_device = { 0 };
+size_t esb_len;
+int rc;
+
+if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
+error_setg(errp,
+   "IRQ_XIVE capability must be present for KVM XIVE device");
+return;
+}
+
+/* The KVM XIVE device is not in use */
+if (!xive || xive->fd == -1) {
+return;
+}
+
+/* Clear the KVM mapping */
+xsrc = >source;
+esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
+
+sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 0);
+munmap(xsrc->esb_mmap, esb_len);
+
+sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 1);
+
+sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 2);
+munmap(xive->tm_mmap, 4ull << TM_SHIFT);
+
+/* Destroy the KVM device. This also clears the VCPU presenters */
+xive_destroy_device.fd = xive->fd;
+xive_destroy_device.type = KVM_DEV_TYPE_XIVE;
+rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, _destroy_device);
+if (rc < 0) {
+error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XIVE");
+}
+close(xive->fd);
+xive->fd = -1;
+
+kvm_kernel_irqchip = false;
+kvm_msi_via_irqfd_allowed = false;
+kvm_gsi_direct_mapping = false;
+
+/* Clear the local list of presenter (hotplug) */
+kvm_cpu_disable_all();
+
+/* VM Change state handler is not needed anymore */
+qemu_del_vm_change_state_handler(xive->change);
+}
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index de86e1d0b1ab..2a60ae71730b 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -50,6 +50,16 @@ typedef struct KVMEnabledICP {
 static QLIST_HEAD(, KVMEnabledICP)
 kvm_enabled_icps = QLIST_HEAD_INITIALIZER(_enabled_icps);
 
+static void kvm_disable_icps(void)
+{
+KVMEnabledICP *enabled_icp, *next;
+
+QLIST_FOREACH_SAFE(enabled_icp, _enabled_icps, node, next) {
+QLIST_REMOVE(enabled_icp, node);
+g_free(enabled_icp);
+}
+}
+
 /*
  * ICP-KVM
  */
@@ -456,6 +466,53 @@ fail:
 return -1;
 }
 
+int xics_kvm_disconnect(sPAPRMachineState *spapr, Error **errp)
+{
+int rc;
+struct kvm_create_device xics_create_device = {
+.fd = kernel_xics_fd,
+.type = KVM_DEV_TYPE_XICS,
+.flags = 0,
+};
+
+/* The KVM XICS device is

[Qemu-devel] [PATCH v6 37/37] spapr: add KVM support to the 'dual' machine

2018-12-05 Thread Cédric Le Goater

The interrupt mode is chosen by the CAS negotiation process and
activated after a reset to take into account the required changes in
the machine. This brings new constraints on how the associated KVM IRQ
device is initialized.

Currently, each model takes care of the initialization of the KVM
device in their realize method but this is not possible anymore as the
initialization needs to be done globaly when the interrupt mode is
known, i.e. when machine is reseted. It also means that we need a way
to delete a KVM device when another mode is chosen.

Also, to support migration, the QEMU objects holding the state to
transfer should always be available but not necessarily activated.

The overall approach of this proposal is to initialize both interrupt
mode at the QEMU level and keep the IRQ number space in sync to allow
switching from one mode to another. For the KVM side of things, the
whole initialization of the KVM device, sources and presenters, is
grouped in a single routine. The XICS and XIVE sPAPR IRQ reset
handlers are modified accordingly to handle the init and the delete
sequences of the KVM device. The post_load handlers also are, to take
into account a possible change of interrupt mode after transfer.

As KVM is now initialized at reset, we loose the possiblity to
fallback to the QEMU emulated mode in case of failure and failures
become fatal to the machine.

Signed-off-by: Cédric Le Goater 
---
 hw/intc/spapr_xive.c |  8 +
 hw/intc/spapr_xive_kvm.c | 26 +++
 hw/intc/xics_kvm.c   | 25 +++
 hw/intc/xive.c   |  4 ---
 hw/ppc/spapr_irq.c   | 68 +++-
 5 files changed, 98 insertions(+), 33 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 68d2a6fd8177..cdbcf27f9544 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -331,13 +331,7 @@ static void spapr_xive_realize(DeviceState *dev, Error 
**errp)
 xive->eat = g_new0(XiveEAS, xive->nr_irqs);
 xive->endt = g_new0(XiveEND, xive->nr_ends);
 
-if (kvmppc_xive_enabled()) {
-kvmppc_xive_connect(xive, _err);
-if (local_err) {
-error_propagate(errp, local_err);
-return;
-}
-} else {
+if (!kvmppc_xive_enabled()) {
 /* TIMA initialization */
 memory_region_init_io(>tm_mmio, OBJECT(xive), _tm_ops, xive,
   "xive.tima", 4ull << TM_SHIFT);
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 6135b8c11e63..d7d499db1b5d 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -712,6 +712,14 @@ void kvmppc_xive_connect(sPAPRXive *xive, Error **errp)
 Error *local_err = NULL;
 size_t esb_len;
 size_t tima_len;
+CPUState *cs;
+
+/* The KVM XIVE device already in use. This is the case when
+ * rebooting XIVE -> XIVE
+ */
+if (xive->fd != -1) {
+return;
+}
 
 if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
 error_setg(errp,
@@ -774,6 +782,24 @@ void kvmppc_xive_connect(sPAPRXive *xive, Error **errp)
 xive->change = qemu_add_vm_change_state_handler(
 kvmppc_xive_change_state_handler, xive);
 
+/* Connect the presenters to the initial VCPUs of the machine */
+CPU_FOREACH(cs) {
+PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+kvmppc_xive_cpu_connect(XIVE_TCTX(cpu->intc), _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+
+/* Update the KVM sources */
+kvmppc_xive_source_reset(xsrc, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
 kvm_kernel_irqchip = true;
 kvm_msi_via_irqfd_allowed = true;
 kvm_gsi_direct_mapping = true;
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index 4355c9d12160..393e4e0bd79c 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -431,6 +431,15 @@ static void rtas_dummy(PowerPCCPU *cpu, sPAPRMachineState 
*spapr,
 int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
 {
 int rc;
+CPUState *cs;
+Error *local_err = NULL;
+
+/* The KVM XICS device already in use. This is the case when
+ * rebooting XICS -> XICS
+ */
+if (kernel_xics_fd != -1) {
+return 0;
+}
 
 if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
 error_setg(errp,
@@ -479,6 +488,22 @@ int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
 kvm_msi_via_irqfd_allowed = true;
 kvm_gsi_direct_mapping = true;
 
+/* Connect the presenters to the initial VCPUs of the machine */
+CPU_FOREACH(cs) {
+PowerPCCPU *cpu = POWERPC_CPU(cs);
+ICPState *icp = ICP(cpu->intc);
+
+icp_kvm_connect(icp, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+goto fail;
+}
+icp_set_kvm_state(icp, 1);
+}
+
+/* Update the KVM sources

[Qemu-devel] [PATCH v6 29/37] spapr: set the interrupt presenter at reset

2018-12-05 Thread Cédric Le Goater

Currently, the interrupt presenter of the vCPU is set at realize
time. Setting it at reset will become useful when the new machine
supporting both interrupt modes is introduced. In this machine, the
interrupt mode is chosen at CAS time and activated after a reset.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_cpu_core.h |  2 ++
 hw/ppc/spapr_cpu_core.c | 26 ++
 hw/ppc/spapr_irq.c  | 11 +++
 3 files changed, 39 insertions(+)

diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
index 9e2821e4b31f..fc8ea9021656 100644
--- a/include/hw/ppc/spapr_cpu_core.h
+++ b/include/hw/ppc/spapr_cpu_core.h
@@ -53,4 +53,6 @@ static inline sPAPRCPUState *spapr_cpu_state(PowerPCCPU *cpu)
 return (sPAPRCPUState *)cpu->machine_data;
 }
 
+void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type);
+
 #endif
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 1811cd48db90..529de0b6b9c8 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -398,3 +398,29 @@ static const TypeInfo spapr_cpu_core_type_infos[] = {
 };
 
 DEFINE_TYPES(spapr_cpu_core_type_infos)
+
+typedef struct ForeachFindIntCArgs {
+const char *intc_type;
+Object *intc;
+} ForeachFindIntCArgs;
+
+static int spapr_cpu_core_find_intc(Object *child, void *opaque)
+{
+ForeachFindIntCArgs *args = opaque;
+
+if (object_dynamic_cast(child, args->intc_type)) {
+args->intc = child;
+}
+
+return args->intc != NULL;
+}
+
+void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type)
+{
+ForeachFindIntCArgs args = { intc_type, NULL };
+
+object_child_foreach(OBJECT(cpu), spapr_cpu_core_find_intc, );
+g_assert(args.intc);
+
+cpu->intc = args.intc;
+}
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 951d4ff1350a..f9b5564b271c 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -211,6 +211,11 @@ static int spapr_irq_post_load_xics(sPAPRMachineState 
*spapr, int version_id)
 
 static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
 {
+CPUState *cs;
+
+CPU_FOREACH(cs) {
+spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
+}
 }
 
 #define SPAPR_IRQ_XICS_NR_IRQS 0x1000
@@ -349,6 +354,12 @@ static int spapr_irq_post_load_xive(sPAPRMachineState 
*spapr, int version_id)
 
 static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
 {
+CPUState *cs;
+
+CPU_FOREACH(cs) {
+spapr_cpu_core_set_intc(POWERPC_CPU(cs), TYPE_XIVE_TCTX);
+}
+
 /*
  * Set the OS CAM line of the cpu interrupt thread context. Needs
  * to come after the XiveTCTX reset handlers.
-- 
2.17.2

[Qemu-devel] [PATCH v6 33/37] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers

2018-12-05 Thread Cédric Le Goater

Removing RTAS handlers will become necessary when the new pseries
machine supporting multiple interrupt mode is introduced.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h | 4 
 hw/ppc/spapr_rtas.c| 2 +-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index daced428a42c..ca38b9d9c046 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -649,6 +649,10 @@ typedef void (*spapr_rtas_fn)(PowerPCCPU *cpu, 
sPAPRMachineState *sm,
   uint32_t nargs, target_ulong args,
   uint32_t nret, target_ulong rets);
 void spapr_rtas_register(int token, const char *name, spapr_rtas_fn fn);
+static inline void spapr_rtas_unregister(int token)
+{
+spapr_rtas_register(token, NULL, NULL);
+}
 target_ulong spapr_rtas_call(PowerPCCPU *cpu, sPAPRMachineState *sm,
  uint32_t token, uint32_t nargs, target_ulong args,
  uint32_t nret, target_ulong rets);
diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index d6a0952154ac..e005d5d08151 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -404,7 +404,7 @@ void spapr_rtas_register(int token, const char *name, 
spapr_rtas_fn fn)
 
 token -= RTAS_TOKEN_BASE;
 
-assert(!rtas_table[token].name);
+assert(!name || !rtas_table[token].name);
 
 rtas_table[token].name = name;
 rtas_table[token].fn = fn;
-- 
2.17.2

[Qemu-devel] [PATCH v6 24/37] spapr/xive: add KVM support

2018-12-05 Thread Cédric Le Goater

This introduces a set of helpers to activate the KVM XIVE device when
KVM is in use.

They handle the initialization of the TIMA and the source ESB memory
regions which have a different type under KVM. These are 'ram device'
memory mappings, similarly to VFIO, exposed to the guest and the
associated VMAs on the host are populated dynamically with the
appropriate pages using a fault handler.

Signed-off-by: Cédric Le Goater 
---
 default-configs/ppc64-softmmu.mak |   1 +
 include/hw/ppc/spapr_xive.h   |  10 ++
 include/hw/ppc/xive.h |  20 +++
 target/ppc/kvm_ppc.h  |   6 +
 hw/intc/spapr_xive.c  |  31 ++--
 hw/intc/spapr_xive_kvm.c  | 253 ++
 hw/intc/xive.c|  30 +++-
 hw/ppc/spapr.c|   7 +-
 hw/ppc/spapr_irq.c|   9 --
 target/ppc/kvm.c  |   7 +
 hw/intc/Makefile.objs |   1 +
 11 files changed, 349 insertions(+), 26 deletions(-)
 create mode 100644 hw/intc/spapr_xive_kvm.c

diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index 7f34ad0528ed..c1bf5cd951f5 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -18,6 +18,7 @@ CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_XIVE=$(CONFIG_PSERIES)
 CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
+CONFIG_XIVE_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_MEM_DEVICE=y
 CONFIG_DIMM=y
 CONFIG_SPAPR_RNG=y
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 7244a6231ce6..ced187ee49e5 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -35,6 +35,10 @@ typedef struct sPAPRXive {
 /* TIMA mapping address */
 hwaddrtm_base;
 MemoryRegion  tm_mmio;
+
+/* KVM support */
+int   fd;
+void  *tm_mmap;
 } sPAPRXive;
 
 bool spapr_xive_irq_claim(sPAPRXive *xive, uint32_t lisn, bool lsi);
@@ -48,5 +52,11 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr);
 void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t nr_servers, void *fdt,
uint32_t phandle);
 void spapr_xive_reset_tctx(sPAPRXive *xive);
+void spapr_xive_map_mmio(sPAPRXive *xive);
+
+/*
+ * KVM XIVE device helpers
+ */
+void kvmppc_xive_connect(sPAPRXive *xive, Error **errp);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 60c335ce0e1e..3684d8e4f6be 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -140,6 +140,7 @@
 #ifndef PPC_XIVE_H
 #define PPC_XIVE_H
 
+#include "sysemu/kvm.h"
 #include "hw/qdev-core.h"
 #include "hw/sysbus.h"
 #include "hw/ppc/xive_regs.h"
@@ -195,6 +196,9 @@ typedef struct XiveSource {
 uint32_tesb_shift;
 MemoryRegionesb_mmio;
 
+/* KVM support */
+void*esb_mmap;
+
 XiveNotifier*xive;
 } XiveSource;
 
@@ -428,4 +432,20 @@ static inline uint32_t xive_nvt_cam_line(uint8_t nvt_blk, 
uint32_t nvt_idx)
 return (nvt_blk << 19) | nvt_idx;
 }
 
+/*
+ * KVM XIVE device helpers
+ */
+
+/* Keep inlined to discard compile of KVM code sections */
+static inline bool kvmppc_xive_enabled(void)
+{
+MachineState *machine = MACHINE(qdev_get_machine());
+
+return kvm_enabled() && machine_kernel_irqchip_allowed(machine);
+}
+
+void kvmppc_xive_source_reset(XiveSource *xsrc, Error **errp);
+void kvmppc_xive_source_set_irq(void *opaque, int srcno, int val);
+void kvmppc_xive_cpu_connect(XiveTCTX *tctx, Error **errp);
+
 #endif /* PPC_XIVE_H */
diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
index bdfaa4e70a83..d2159660f9f2 100644
--- a/target/ppc/kvm_ppc.h
+++ b/target/ppc/kvm_ppc.h
@@ -59,6 +59,7 @@ bool kvmppc_has_cap_fixup_hcalls(void);
 bool kvmppc_has_cap_htm(void);
 bool kvmppc_has_cap_mmu_radix(void);
 bool kvmppc_has_cap_mmu_hash_v3(void);
+bool kvmppc_has_cap_xive(void);
 int kvmppc_get_cap_safe_cache(void);
 int kvmppc_get_cap_safe_bounds_check(void);
 int kvmppc_get_cap_safe_indirect_branch(void);
@@ -307,6 +308,11 @@ static inline bool kvmppc_has_cap_mmu_hash_v3(void)
 return false;
 }
 
+static inline bool kvmppc_has_cap_xive(void)
+{
+return false;
+}
+
 static inline int kvmppc_get_cap_safe_cache(void)
 {
 return 0;
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 3cddc9332acb..256108914001 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -174,7 +174,7 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor 
*mon)
 }
 }
 
-static void spapr_xive_map_mmio(sPAPRXive *xive)
+void spapr_xive_map_mmio(sPAPRXive *xive)
 {
 sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->vc_base);
 sysbus_mmio_map(SYS_BUS_DEVICE(xive), 1, xive->end_base);
@@ -250,6 +250,9 @@ static void spapr_xive_instance_init(Object *obj)
   TYPE_XIVE_END_SOURCE);
 object_property_add_child(obj, "end_source", OBJECT(>end_source),

[Qemu-devel] [PATCH v6 26/37] spapr/xive: introduce a VM state change handler

2018-12-05 Thread Cédric Le Goater

This handler is in charge of stabilizing the flow of event notifications
in the XIVE controller before migrating a guest. This is a requirement
before transferring the guest EQ pages to a destination.

When the VM is stopped, the handler masks the sources (PQ=01) to stop
the flow of events and saves their previous state. The XIVE controller
is then synced through KVM to flush any in-flight event notification
and to stabilize the EQs. At this stage, the EQ pages are marked dirty
to make sure the EQ pages are transferred if a migration sequence is
in progress.

The previous configuration of the sources is restored when the VM
resumes, after a migration or a stop.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_xive.h |   1 +
 hw/intc/spapr_xive_kvm.c| 111 +++-
 2 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index bd81bb4d7608..c447534b6b29 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -39,6 +39,7 @@ typedef struct sPAPRXive {
 /* KVM support */
 int   fd;
 void  *tm_mmap;
+VMChangeStateEntry *change;
 } sPAPRXive;
 
 bool spapr_xive_irq_claim(sPAPRXive *xive, uint32_t lisn, bool lsi);
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 7cdf08ca368c..5cb7461e9743 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -334,12 +334,118 @@ static void kvmppc_xive_get_eas_state(sPAPRXive *xive, 
Error **errp)
 }
 }
 
+/*
+ * Sync the XIVE controller through KVM to flush any in-flight event
+ * notification and stabilize the EQs.
+ */
+ static void kvmppc_xive_sync_all(sPAPRXive *xive, Error **errp)
+{
+XiveSource *xsrc = >source;
+Error *local_err = NULL;
+int i;
+
+/* Sync the KVM source. This reaches the XIVE HW through OPAL */
+for (i = 0; i < xsrc->nr_irqs; i++) {
+XiveEAS *eas = >eat[i];
+
+if (!xive_eas_is_valid(eas)) {
+continue;
+}
+
+kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SYNC, i, NULL, true,
+  _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+}
+
+/*
+ * The primary goal of the XIVE VM change handler is to mark the EQ
+ * pages dirty when all XIVE event notifications have stopped.
+ *
+ * Whenever the VM is stopped, the VM change handler masks the sources
+ * (PQ=01) to stop the flow of events and saves the previous state in
+ * anticipation of a migration. The XIVE controller is then synced
+ * through KVM to flush any in-flight event notification and stabilize
+ * the EQs.
+ *
+ * At this stage, we can mark the EQ page dirty and let a migration
+ * sequence transfer the EQ pages to the destination, which is done
+ * just after the stop state.
+ *
+ * The previous configuration of the sources is restored when the VM
+ * runs again.
+ */
+static void kvmppc_xive_change_state_handler(void *opaque, int running,
+ RunState state)
+{
+sPAPRXive *xive = opaque;
+XiveSource *xsrc = >source;
+Error *local_err = NULL;
+int i;
+
+/*
+ * Restore the sources to their initial state. This is called when
+ * the VM resumes after a stop or a migration.
+ */
+if (running) {
+for (i = 0; i < xsrc->nr_irqs; i++) {
+uint8_t pq = xive_source_esb_get(xsrc, i);
+if (xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8)) != 0x1) 
{
+error_report("XIVE: IRQ %d has an invalid state", i);
+}
+}
+
+return;
+}
+
+/*
+ * Mask the sources, to stop the flow of event notifications, and
+ * save the PQs locally in the XiveSource object. The XiveSource
+ * state will be collected later on by its vmstate handler if a
+ * migration is in progress.
+ */
+for (i = 0; i < xsrc->nr_irqs; i++) {
+uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_01);
+xive_source_esb_set(xsrc, i, pq);
+}
+
+/*
+ * Sync the XIVE controller in KVM, to flush in-flight event
+ * notification that should be enqueued in the EQs.
+ */
+kvmppc_xive_sync_all(xive, _err);
+if (local_err) {
+error_report_err(local_err);
+return;
+}
+
+/*
+ * Mark the XIVE EQ pages dirty to collect all updates.
+ */
+kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL,
+  KVM_DEV_XIVE_SAVE_EQ_PAGES, NULL, true, _err);
+if (local_err) {
+error_report_err(local_err);
+}
+}
+
 void kvmppc_xive_synchronize_state(sPAPRXive *xive)
 {
 XiveSource *xsrc = >source;
 CPUState *cs;
 
-kvmppc_xive_source_get_state(xsrc);
+/*
+ * When the VM is stopped, the sources are masked and the previous
+ * state is saved in anticipation of a migration. We should not
+ * synchronize the source state

[Qemu-devel] [PATCH v6 31/37] spapr: add a 'pseries-3.1-dual' machine type

2018-12-05 Thread Cédric Le Goater

This pseries machine makes use of a new sPAPR IRQ backend supporting
both interrupt modes : XIVE and XICS, the default being XICS.

The interrupt mode is chosen by the CAS negotiation process and
activated after a reset to take into account the required changes in
the machine. These impact the device tree layout, the interrupt
presenter object and the exposed MMIO regions in the case of XIVE.

KVM is not yet supported.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_irq.h |   1 +
 hw/ppc/spapr.c |  18 -
 hw/ppc/spapr_hcall.c   |  13 
 hw/ppc/spapr_irq.c | 142 +
 4 files changed, 173 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 26727a7263a5..af429148ce2d 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -51,6 +51,7 @@ typedef struct sPAPRIrq {
 extern sPAPRIrq spapr_irq_xics;
 extern sPAPRIrq spapr_irq_xics_legacy;
 extern sPAPRIrq spapr_irq_xive;
+extern sPAPRIrq spapr_irq_dual;
 
 void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 3cdc66484f42..232956116518 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2634,7 +2634,8 @@ static void spapr_machine_init(MachineState *machine)
 spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
 
 /* advertise XIVE */
-if (smc->irq->ov5 == SPAPR_OV5_XIVE_EXPLOIT) {
+if (smc->irq->ov5 == SPAPR_OV5_XIVE_EXPLOIT ||
+smc->irq->ov5 == SPAPR_OV5_XIVE_BOTH) {
 spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
 }
 
@@ -4002,6 +4003,21 @@ static void 
spapr_machine_3_1_xive_class_options(MachineClass *mc)
 
 DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
 
+static void spapr_machine_3_1_dual_instance_options(MachineState *machine)
+{
+spapr_machine_3_1_instance_options(machine);
+}
+
+static void spapr_machine_3_1_dual_class_options(MachineClass *mc)
+{
+sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
+spapr_machine_3_1_class_options(mc);
+smc->irq = _irq_dual;
+}
+
+DEFINE_SPAPR_MACHINE(3_1_dual, "3.1-dual", false);
+
 /*
  * pseries-3.0
  */
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index ae913d070f50..186b6a65543f 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1654,6 +1654,19 @@ static target_ulong 
h_client_architecture_support(PowerPCCPU *cpu,
 (spapr_h_cas_compose_response(spapr, args[1], args[2],
   ov5_updates) != 0);
 }
+
+/*
+ * Generate a machine reset when we have an update of the
+ * interrupt mode. Only required on the machine supporting both
+ * mode.
+ */
+if (!spapr->cas_reboot) {
+sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+spapr->cas_reboot = spapr_ovec_test(ov5_updates, OV5_XIVE_EXPLOIT)
+&& smc->irq->ov5 == SPAPR_OV5_XIVE_BOTH;
+}
+
 spapr_ovec_cleanup(ov5_updates);
 
 if (spapr->cas_reboot) {
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 94bb4d27758a..157c335f6f8d 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -399,6 +399,148 @@ sPAPRIrq spapr_irq_xive = {
 .reset   = spapr_irq_reset_xive,
 };
 
+/*
+ * Dual XIVE and XICS IRQ backend.
+ *
+ * Both interrupt mode, XIVE and XICS, objects are created but the
+ * machine starts in legacy interrupt mode (XICS). It can be changed
+ * by the CAS negotiation process and, in that case, the new mode is
+ * activated after extra machine reset.
+ */
+
+/*
+ * Returns the sPAPR IRQ backend negotiated by CAS. XICS is the
+ * default.
+ */
+static sPAPRIrq *spapr_irq_current(sPAPRMachineState *spapr)
+{
+return spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT) ?
+_irq_xive : _irq_xics;
+}
+
+static void spapr_irq_init_dual(sPAPRMachineState *spapr, int nr_irqs,
+Error **errp)
+{
+MachineState *machine = MACHINE(spapr);
+Error *local_err = NULL;
+
+if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+error_setg(errp, "No KVM support for the 'dual' machine");
+return;
+}
+
+spapr_irq_xics.init(spapr, spapr_irq_xics.nr_irqs, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+
+spapr_irq_xive.init(spapr, spapr_irq_xive.nr_irqs, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+
+static int spapr_irq_claim_dual(sPAPRMachineState *spapr, int irq, bool lsi,
+Error **errp)
+{
+int ret;
+Error *local_err = NULL;
+
+ret = spapr_irq_xive.claim(spapr, irq, lsi, _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return ret;
+}
+
+ret = spapr_irq_xics.claim(spapr, irq, lsi, _err);
+if (local_err) {
+

[Qemu-devel] [PATCH v6 30/37] spapr/xive: enable XIVE MMIOs at reset

2018-12-05 Thread Cédric Le Goater

Depending on the interrupt mode chosen, enable or disable the XIVE
MMIOs.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_xive.h | 1 +
 hw/intc/spapr_xive.c| 9 +
 hw/ppc/spapr_irq.c  | 8 
 3 files changed, 18 insertions(+)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 21eeb4d5..735e916d3844 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -65,6 +65,7 @@ void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t 
nr_servers, void *fdt,
uint32_t phandle);
 void spapr_xive_reset_tctx(sPAPRXive *xive);
 void spapr_xive_map_mmio(sPAPRXive *xive);
+void spapr_xive_enable_mmio(sPAPRXive *xive, bool enable);
 
 /*
  * KVM XIVE device helpers
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index b030cfe7f136..68d2a6fd8177 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -196,6 +196,15 @@ void spapr_xive_map_mmio(sPAPRXive *xive)
 sysbus_mmio_map(SYS_BUS_DEVICE(xive), 2, xive->tm_base);
 }
 
+void spapr_xive_enable_mmio(sPAPRXive *xive, bool enable)
+{
+memory_region_set_enabled(>source.esb_mmio, enable);
+memory_region_set_enabled(>tm_mmio, enable);
+
+/* Disable the END ESBs until a guest OS makes use of them */
+memory_region_set_enabled(>end_source.esb_mmio, false);
+}
+
 /*
  * When a Virtual Processor is scheduled to run on a HW thread, the
  * hypervisor pushes its identifier in the OS CAM line. Emulate the
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index f9b5564b271c..94bb4d27758a 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -216,6 +216,11 @@ static void spapr_irq_reset_xics(sPAPRMachineState *spapr, 
Error **errp)
 CPU_FOREACH(cs) {
 spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
 }
+
+/* Deactivate the XIVE MMIOs */
+if (spapr->xive) {
+spapr_xive_enable_mmio(spapr->xive, false);
+}
 }
 
 #define SPAPR_IRQ_XICS_NR_IRQS 0x1000
@@ -365,6 +370,9 @@ static void spapr_irq_reset_xive(sPAPRMachineState *spapr, 
Error **errp)
  * to come after the XiveTCTX reset handlers.
  */
 spapr_xive_reset_tctx(spapr->xive);
+
+/* Activate the XIVE MMIOs */
+spapr_xive_enable_mmio(spapr->xive, true);
 }
 
 /*
-- 
2.17.2

[Qemu-devel] [PATCH v6 21/37] spapr: add a 'reset' method to the sPAPR IRQ backend

2018-12-05 Thread Cédric Le Goater

For the time being, the XIVE reset handler updates the OS CAM line of
the vCPU as it is done under a real hypervisor when a vCPU is
scheduled to run on a HW thread.

This method will become even more useful when the machine supporting
both interrupt modes, XIVE and XICS, is introduced. In this machine,
the interrupt mode is chosen by the CAS negotiation process and
activated after a reset.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_irq.h  |  2 ++
 include/hw/ppc/spapr_xive.h |  1 +
 hw/intc/spapr_xive.c| 24 
 hw/ppc/spapr.c  |  5 +
 hw/ppc/spapr_irq.c  | 25 +
 5 files changed, 57 insertions(+)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 91ac5784919c..bdb1c66125c9 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -44,6 +44,7 @@ typedef struct sPAPRIrq {
 Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
Error **errp);
 int (*post_load)(sPAPRMachineState *spapr, int version_id);
+void (*reset)(sPAPRMachineState *spapr, Error **errp);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
@@ -55,6 +56,7 @@ int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool 
lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
 int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
+void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp);
 
 /*
  * XICS legacy routines
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 728a5e8dc163..7244a6231ce6 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -47,5 +47,6 @@ typedef struct sPAPRMachineState sPAPRMachineState;
 void spapr_xive_hcall_init(sPAPRMachineState *spapr);
 void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t nr_servers, void *fdt,
uint32_t phandle);
+void spapr_xive_reset_tctx(sPAPRXive *xive);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index fd02dc6b91e4..3cddc9332acb 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -181,6 +181,30 @@ static void spapr_xive_map_mmio(sPAPRXive *xive)
 sysbus_mmio_map(SYS_BUS_DEVICE(xive), 2, xive->tm_base);
 }
 
+/*
+ * When a Virtual Processor is scheduled to run on a HW thread, the
+ * hypervisor pushes its identifier in the OS CAM line. Emulate the
+ * same behavior under QEMU.
+ */
+void spapr_xive_reset_tctx(sPAPRXive *xive)
+{
+CPUState *cs;
+uint8_t  nvt_blk;
+uint32_t nvt_idx;
+uint32_t nvt_cam;
+
+CPU_FOREACH(cs) {
+PowerPCCPU *cpu = POWERPC_CPU(cs);
+XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
+
+spapr_xive_cpu_to_nvt(xive, cpu, _blk, _idx);
+
+nvt_cam = cpu_to_be32(TM_QW1W2_VO |
+  xive_nvt_cam_line(nvt_blk, nvt_idx));
+memcpy(>regs[TM_QW1_OS + TM_WORD2], _cam, 4);
+}
+}
+
 static void spapr_xive_end_reset(XiveEND *end)
 {
 memset(end, 0, sizeof(*end));
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 8911465e32cf..530aee8d143d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1621,6 +1621,11 @@ static void spapr_machine_reset(void)
 
 qemu_devices_reset();
 
+/* This is fixing some of the default configuration of the XIVE
+ * devices. To be called after the reset of the machine devices.
+ */
+spapr_irq_reset(spapr, _fatal);
+
 /* DRC reset may cause a device to be unplugged. This will cause troubles
  * if this device is used by another device (eg, a running vhost backend
  * will crash QEMU if the DIMM holding the vring goes away). To avoid such
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 8943e28fc11b..58ce124c1501 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -13,6 +13,7 @@
 #include "qapi/error.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/spapr_cpu_core.h"
 #include "hw/ppc/xics.h"
 #include "sysemu/kvm.h"
 
@@ -208,6 +209,10 @@ static int spapr_irq_post_load_xics(sPAPRMachineState 
*spapr, int version_id)
 return 0;
 }
 
+static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
+{
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS 0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS \
 (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -224,6 +229,7 @@ sPAPRIrq spapr_irq_xics = {
 .dt_populate = spapr_dt_xics,
 .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
 .post_load   = spapr_irq_post_load_xics,
+.reset   = spapr_irq_reset_xics,
 };
 
 /*
@@ -331,6 +337,15 @@ static int spapr_irq_post_load_xive(sPAPRMachineState 
*spapr, int version_id)
 return 0;
 }
 
+static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
+{
+/*
+ * Set the OS CAM line of the cpu interrupt thread context. Needs
+ * to come after the

[Qemu-devel] [PATCH v6 28/37] spapr/xive: fix migration of the XiveTCTX under TCG

2018-12-05 Thread Cédric Le Goater

When the thread interrupt management state is retrieved from the KVM
VCPU, word2 is saved under the QEMU XIVE thread context to print out
the OS CAM line under the QEMU monitor.

This breaks the migration on a TCG guest (or on KVM with
kernel_irqchip=off) because the matching algorithm of the presenter
relies on the OS CAM value. Fix with an extra reset of the thread
contexts to restore the expected value.

Signed-off-by: Cédric Le Goater 
---
 hw/ppc/spapr_irq.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 7b401dc1d47c..951d4ff1350a 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -326,7 +326,25 @@ static Object 
*spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
 
 static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
 {
-return spapr_xive_post_load(spapr->xive, version_id);
+int ret;
+
+ret = spapr_xive_post_load(spapr->xive, version_id);
+if (ret) {
+return ret;
+}
+
+/*
+ * When the states are collected from the KVM XIVE device, word2
+ * of the XiveTCTX is set to print out the OS CAM line under the
+ * QEMU monitor.
+ *
+ * This breaks the migration on a TCG guest (or on KVM with
+ * kernel_irqchip=off) because the matching algorithm of the
+ * presenter relies on the OS CAM value. Fix with an extra reset
+ * of the thread contexts to restore the expected value.
+ */
+spapr_xive_reset_tctx(spapr->xive);
+return 0;
 }
 
 static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
-- 
2.17.2

[Qemu-devel] [PATCH v6 27/37] spapr/xive: add migration support for KVM

2018-12-05 Thread Cédric Le Goater

When the VM is stopped, the VM state handler stabilizes the XIVE IC
and marks the EQ pages dirty. These are then transfered to destination
before the transfer of the device vmstates starts.

The sPAPRXive interrupt controller model captures the XIVE internal
tables, EAT and ENDT and the XiveTCTX model does the same for the
thread interrupt context registers.

At restart, the sPAPRXive 'post_load' method restores all the XIVE
states. It is called by the sPAPR machine 'post_load' method, when all
XIVE states have been transferred and loaded.

Finally, the source states are restored in the VM change state handler
when the machine reaches the running state.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_xive.h |   5 +
 include/hw/ppc/xive.h   |   1 +
 hw/intc/spapr_xive.c|  34 +++
 hw/intc/spapr_xive_kvm.c| 187 +++-
 hw/intc/xive.c  |  17 
 hw/ppc/spapr_irq.c  |   2 +-
 6 files changed, 244 insertions(+), 2 deletions(-)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index c447534b6b29..21eeb4d5 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -47,6 +47,7 @@ bool spapr_xive_irq_free(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
 bool spapr_xive_priority_is_reserved(uint8_t priority);
+int spapr_xive_post_load(sPAPRXive *xive, int version_id);
 
 void spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
@@ -54,6 +55,8 @@ void spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, 
uint8_t prio,
uint8_t *out_end_blk, uint32_t *out_end_idx);
 int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
  uint8_t *out_end_blk, uint32_t *out_end_idx);
+int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t 
end_idx,
+ uint32_t *out_server, uint8_t *out_prio);
 
 typedef struct sPAPRMachineState sPAPRMachineState;
 
@@ -68,5 +71,7 @@ void spapr_xive_map_mmio(sPAPRXive *xive);
  */
 void kvmppc_xive_connect(sPAPRXive *xive, Error **errp);
 void kvmppc_xive_synchronize_state(sPAPRXive *xive);
+int kvmppc_xive_pre_save(sPAPRXive *xive);
+int kvmppc_xive_post_load(sPAPRXive *xive, int version_id);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 7330c11d31c8..d06bcae28e9d 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -448,5 +448,6 @@ void kvmppc_xive_source_reset(XiveSource *xsrc, Error 
**errp);
 void kvmppc_xive_source_set_irq(void *opaque, int srcno, int val);
 void kvmppc_xive_cpu_connect(XiveTCTX *tctx, Error **errp);
 void kvmppc_xive_cpu_synchronize_state(XiveTCTX *tctx);
+void kvmppc_xive_cpu_get_state(XiveTCTX *tctx, Error **errp);
 
 #endif /* PPC_XIVE_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 87f60dd4e453..b030cfe7f136 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -82,6 +82,19 @@ static int spapr_xive_target_to_nvt(sPAPRXive *xive, 
uint32_t target,
  * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
  * priorities per CPU
  */
+int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t 
end_idx,
+ uint32_t *out_server, uint8_t *out_prio)
+{
+if (out_server) {
+*out_server = end_idx >> 3;
+}
+
+if (out_prio) {
+*out_prio = end_idx & 0x7;
+}
+return 0;
+}
+
 void spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
uint8_t *out_end_blk, uint32_t *out_end_idx)
 {
@@ -426,10 +439,31 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
 },
 };
 
+static int vmstate_spapr_xive_pre_save(void *opaque)
+{
+if (kvmppc_xive_enabled()) {
+return kvmppc_xive_pre_save(SPAPR_XIVE(opaque));
+}
+
+return 0;
+}
+
+/* Called by the sPAPR machine 'post_load' method */
+int spapr_xive_post_load(sPAPRXive *xive, int version_id)
+{
+if (kvmppc_xive_enabled()) {
+return kvmppc_xive_post_load(xive, version_id);
+}
+
+return 0;
+}
+
 static const VMStateDescription vmstate_spapr_xive = {
 .name = TYPE_SPAPR_XIVE,
 .version_id = 1,
 .minimum_version_id = 1,
+.pre_save = vmstate_spapr_xive_pre_save,
+.post_load = NULL, /* handled at the machine level */
 .fields = (VMStateField[]) {
 VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
 VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 5cb7461e9743..04f997479e8f 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -60,7 +60,29 @@ static void kvm_cpu_enable(CPUState *cs)
 /*
  * XIVE Thread Interrupt Management

[Qemu-devel] [PATCH v6 15/37] spapr: export and rename the xics_max_server_number() routine

2018-12-05 Thread Cédric Le Goater

The XIVE sPAPR IRQ backend will use it to define the number of ENDs of
the IC controller.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h | 1 +
 hw/ppc/spapr.c | 8 
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 6279711fe8f7..198764066dc9 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -737,6 +737,7 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
 void spapr_reallocate_hpt(sPAPRMachineState *spapr, int shift,
   Error **errp);
 void spapr_clear_pending_events(sPAPRMachineState *spapr);
+int spapr_max_server_number(sPAPRMachineState *spapr);
 
 /* CPU and LMB DRC release callbacks. */
 void spapr_core_release(DeviceState *dev);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e470efe7993c..a689f853e020 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -150,7 +150,7 @@ static void pre_2_10_vmstate_unregister_dummy_icp(int i)
(void *)(uintptr_t) i);
 }
 
-static int xics_max_server_number(sPAPRMachineState *spapr)
+int spapr_max_server_number(sPAPRMachineState *spapr)
 {
 assert(spapr->vsmt);
 return DIV_ROUND_UP(max_cpus * spapr->vsmt, smp_threads);
@@ -1270,7 +1270,7 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr,
 _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
 
 /* /interrupt controller */
-spapr_dt_xics(xics_max_server_number(spapr), fdt, PHANDLE_XICP);
+spapr_dt_xics(spapr_max_server_number(spapr), fdt, PHANDLE_XICP);
 
 ret = spapr_populate_memory(spapr, fdt);
 if (ret < 0) {
@@ -2469,7 +2469,7 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
 if (smc->pre_2_10_has_unused_icps) {
 int i;
 
-for (i = 0; i < xics_max_server_number(spapr); i++) {
+for (i = 0; i < spapr_max_server_number(spapr); i++) {
 /* Dummy entries get deregistered when real ICPState objects
  * are registered during CPU core hotplug.
  */
@@ -2589,7 +2589,7 @@ static void spapr_machine_init(MachineState *machine)
 load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
 
 /* VSMT must be set in order to be able to compute VCPU ids, ie to
- * call xics_max_server_number() or spapr_vcpu_id().
+ * call spapr_max_server_number() or spapr_vcpu_id().
  */
 spapr_set_vsmt_mode(spapr, _fatal);
 
-- 
2.17.2

[Qemu-devel] [PATCH v6 25/37] spapr/xive: add state synchronization with KVM

2018-12-05 Thread Cédric Le Goater

This extends the KVM XIVE device backend with 'synchronize_state'
methods used to retrieve the state from KVM. The HW state of the
sources, the KVM device and the thread interrupt contexts are
collected for the monitor usage and also migration.

These get operations rely on their KVM counterpart in the host kernel
which acts as a proxy for OPAL, the host firmware. The set operations
will be added for migration support later.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_xive.h |   9 ++
 include/hw/ppc/xive.h   |   1 +
 hw/intc/spapr_xive.c|  20 ++--
 hw/intc/spapr_xive_kvm.c| 198 
 hw/intc/xive.c  |   4 +
 5 files changed, 223 insertions(+), 9 deletions(-)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index ced187ee49e5..bd81bb4d7608 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -45,6 +45,14 @@ bool spapr_xive_irq_claim(sPAPRXive *xive, uint32_t lisn, 
bool lsi);
 bool spapr_xive_irq_free(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
+bool spapr_xive_priority_is_reserved(uint8_t priority);
+
+void spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
+   uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
+void spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
+   uint8_t *out_end_blk, uint32_t *out_end_idx);
+int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
+ uint8_t *out_end_blk, uint32_t *out_end_idx);
 
 typedef struct sPAPRMachineState sPAPRMachineState;
 
@@ -58,5 +66,6 @@ void spapr_xive_map_mmio(sPAPRXive *xive);
  * KVM XIVE device helpers
  */
 void kvmppc_xive_connect(sPAPRXive *xive, Error **errp);
+void kvmppc_xive_synchronize_state(sPAPRXive *xive);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 3684d8e4f6be..7330c11d31c8 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -447,5 +447,6 @@ static inline bool kvmppc_xive_enabled(void)
 void kvmppc_xive_source_reset(XiveSource *xsrc, Error **errp);
 void kvmppc_xive_source_set_irq(void *opaque, int srcno, int val);
 void kvmppc_xive_cpu_connect(XiveTCTX *tctx, Error **errp);
+void kvmppc_xive_cpu_synchronize_state(XiveTCTX *tctx);
 
 #endif /* PPC_XIVE_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 256108914001..87f60dd4e453 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -48,8 +48,8 @@ static uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, 
uint8_t nvt_blk,
 return nvt_idx - SPAPR_XIVE_NVT_BASE;
 }
 
-static void spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
-  uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
+void spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
+   uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
 {
 XiveRouter *xrtr = XIVE_ROUTER(xive);
 
@@ -82,9 +82,8 @@ static int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t 
target,
  * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
  * priorities per CPU
  */
-static void spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu,
-  uint8_t prio, uint8_t *out_end_blk,
-  uint32_t *out_end_idx)
+void spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
+   uint8_t *out_end_blk, uint32_t *out_end_idx)
 {
 XiveRouter *xrtr = XIVE_ROUTER(xive);
 
@@ -100,9 +99,8 @@ static void spapr_xive_cpu_to_end(sPAPRXive *xive, 
PowerPCCPU *cpu,
 }
 }
 
-static int spapr_xive_target_to_end(sPAPRXive *xive,
-uint32_t target, uint8_t prio,
-uint8_t *out_end_blk, uint32_t 
*out_end_idx)
+int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
+ uint8_t *out_end_blk, uint32_t *out_end_idx)
 {
PowerPCCPU *cpu = spapr_find_cpu(target);
 
@@ -141,6 +139,10 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor 
*mon)
 XiveSource *xsrc = >source;
 int i;
 
+if (kvmppc_xive_enabled()) {
+kvmppc_xive_synchronize_state(xive);
+}
+
 monitor_printf(mon, "  LSIN PQEISN CPU/PRIO EQ\n");
 
 for (i = 0; i < xive->nr_irqs; i++) {
@@ -539,7 +541,7 @@ qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
  * interrupts (DD2.X POWER9). So we only allow the guest to use
  * priorities [0..6].
  */
-static bool spapr_xive_priority_is_reserved(uint8_t priority)
+bool spapr_xive_priority_is_reserved(uint8_t priority)
 {
 switch (priority) {
 case 0 ... 6:
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 8f773646aa3a..7cdf08ca368c 100644
---

[Qemu-devel] [PATCH v6 17/37] spapr: add hcalls support for the XIVE exploitation interrupt mode

2018-12-05 Thread Cédric Le Goater

The different XIVE virtualization structures (sources and event queues)
are configured with a set of Hypervisor calls :

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (ESB) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification configuration associated
   with the queue, only unconditional notification is supported for
   the moment. Reset is performed with a queue size of 0 and queueing
   is disabled in that case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the guest's internal interrupt structures to their
   initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure all notifications
   have reached their queue.

Calls that still need to be addressed :

   H_INT_SET_OS_REPORTING_LINE
   H_INT_GET_OS_REPORTING_LINE

See the code for more documentation on each hcall.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h  |  15 +-
 include/hw/ppc/spapr_xive.h |   4 +
 hw/intc/spapr_xive.c| 964 
 hw/ppc/spapr_irq.c  |   2 +
 4 files changed, 984 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index cb3082d319af..6bf028a02fe2 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -452,7 +452,20 @@ struct sPAPRMachineState {
 #define H_INVALIDATE_PID0x378
 #define H_REGISTER_PROC_TBL 0x37C
 #define H_SIGNAL_SYS_RESET  0x380
-#define MAX_HCALL_OPCODEH_SIGNAL_SYS_RESET
+
+#define H_INT_GET_SOURCE_INFO   0x3A8
+#define H_INT_SET_SOURCE_CONFIG 0x3AC
+#define H_INT_GET_SOURCE_CONFIG 0x3B0
+#define H_INT_GET_QUEUE_INFO0x3B4
+#define H_INT_SET_QUEUE_CONFIG  0x3B8
+#define H_INT_GET_QUEUE_CONFIG  0x3BC
+#define H_INT_SET_OS_REPORTING_LINE 0x3C0
+#define H_INT_GET_OS_REPORTING_LINE 0x3C4
+#define H_INT_ESB   0x3C8
+#define H_INT_SYNC  0x3CC
+#define H_INT_RESET 0x3D0
+
+#define MAX_HCALL_OPCODEH_INT_RESET
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index f087959b9924..9506a8f4d10a 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -42,4 +42,8 @@ bool spapr_xive_irq_free(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
 
+typedef struct sPAPRMachineState sPAPRMachineState;
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 8da7a8bee949..f54100b175a5 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -47,6 +47,72 @@ static uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, 
uint8_t nvt_blk,
 return nvt_idx - SPAPR_XIVE_NVT_BASE;
 }
 
+static void spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
+  uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
+{
+XiveRouter *xrtr = XIVE_ROUTER(xive);
+
+assert(cpu);
+
+if (out_nvt_blk) {
+/* For testing purpose, we could use 0 for nvt_blk */
+*out_nvt_blk = xrtr->chip_id;
+}
+
+if (out_nvt_blk) {
+*out_nvt_idx = SPAPR_XIVE_NVT_BASE + cpu->vcpu_id;
+}
+}
+
+static int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
+uint8_t *out_nvt_blk, uint32_t 
*out_nvt_idx)
+{
+PowerPCCPU *cpu = spapr_find_cpu(target);
+
+if (!cpu) {
+return -1;
+}
+
+spapr_xive_cpu_to_nvt(xive, cpu, out_nvt_blk, out_nvt_idx);
+return 0;
+}
+
+/*
+ * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
+ * priorities per CPU
+ */
+static void spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu,
+  uint8_t prio, uint8_t *out_end_blk,
+  uint32_t *out_end_idx)
+{
+XiveRouter *xrtr = XIVE_ROUTER(xive);
+
+assert(cpu);
+
+if (out_end_blk) {
+/* For testing purpose, we could use 0 for nvt_blk */
+*out_end_blk = xrtr->chip_id;
+}
+
+if (out_end_idx) {
+*out_end_idx = (cpu->vcpu_id << 3) + prio;
+}
+}
+
+static int spapr_xive_target_to_end(sPAPRXive *xive,
+uint32_t target, uint8_t

[Qemu-devel] [PATCH v6 16/37] spapr: introdude a new machine IRQ backend for XIVE

2018-12-05 Thread Cédric Le Goater

The XIVE IRQ backend uses the same layout as the new XICS backend but
covers the full range of the IRQ number space. The IRQ numbers for the
CPU IPIs are allocated at the bottom of this space, below 4K, to
preserve compatibility with XICS which does not use that range.

This should be enough given that the maximum number of CPUs is 1024
for the sPAPR machine under QEMU. For the record, the biggest POWER8
or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
cores, SMT8).

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h |   2 +
 include/hw/ppc/spapr_irq.h |   2 +
 hw/ppc/spapr_irq.c | 112 +
 3 files changed, 116 insertions(+)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 198764066dc9..cb3082d319af 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
 typedef struct sPAPREventSource sPAPREventSource;
 typedef struct sPAPRPendingHPT sPAPRPendingHPT;
 typedef struct ICSState ICSState;
+typedef struct sPAPRXive sPAPRXive;
 
 #define HPTE64_V_HPTE_DIRTY 0x0040ULL
 #define SPAPR_ENTRY_POINT   0x100
@@ -175,6 +176,7 @@ struct sPAPRMachineState {
 const char *icp_type;
 int32_t irq_map_nr;
 unsigned long *irq_map;
+sPAPRXive  *xive;
 
 bool cmd_line_caps[SPAPR_CAP_NUM];
 sPAPRCapabilities def, eff, mig;
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 0e9229bf219e..eec3159cd8d8 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -13,6 +13,7 @@
 /*
  * IRQ range offsets per device type
  */
+#define SPAPR_IRQ_IPI0x0
 #define SPAPR_IRQ_EPOW   0x1000  /* XICS_IRQ_BASE offset */
 #define SPAPR_IRQ_HOTPLUG0x1001
 #define SPAPR_IRQ_VIO0x1100  /* 256 VIO devices */
@@ -42,6 +43,7 @@ typedef struct sPAPRIrq {
 
 extern sPAPRIrq spapr_irq_xics;
 extern sPAPRIrq spapr_irq_xics_legacy;
+extern sPAPRIrq spapr_irq_xive;
 
 void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index bac45023..f05aa5a94959 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -12,6 +12,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
 #include "hw/ppc/xics.h"
 #include "sysemu/kvm.h"
 
@@ -204,6 +205,117 @@ sPAPRIrq spapr_irq_xics = {
 .print_info  = spapr_irq_print_info_xics,
 };
 
+/*
+ * XIVE IRQ backend.
+ */
+static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr, int nr_irqs,
+int nr_servers, Error **errp)
+{
+sPAPRXive *xive;
+Error *local_err = NULL;
+Object *obj;
+uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
+int i;
+
+/* TODO : use qdev_create() ? */
+obj = object_new(TYPE_SPAPR_XIVE);
+object_property_set_int(obj, nr_irqs, "nr-irqs", _abort);
+object_property_set_int(obj, nr_ends, "nr-ends", _abort);
+object_property_set_bool(obj, true, "realized", _err);
+if (local_err) {
+error_propagate(errp, local_err);
+return NULL;
+}
+qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+xive = SPAPR_XIVE(obj);
+
+/* Enable the CPU IPIs */
+for (i = 0; i < nr_servers; ++i) {
+spapr_xive_irq_claim(xive, SPAPR_IRQ_IPI + i, false);
+}
+
+return xive;
+}
+
+static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
+Error **errp)
+{
+MachineState *machine = MACHINE(spapr);
+Error *local_err = NULL;
+
+/* No KVM support */
+if (kvm_enabled()) {
+if (machine_kernel_irqchip_required(machine)) {
+error_setg(errp, "kernel_irqchip requested. no XIVE support");
+return;
+}
+}
+
+spapr->xive = spapr_xive_create(spapr, nr_irqs,
+spapr_max_server_number(spapr), 
_err);
+if (local_err) {
+error_propagate(errp, local_err);
+return;
+}
+}
+
+static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
+Error **errp)
+{
+if (!spapr_xive_irq_claim(spapr->xive, irq, lsi)) {
+error_setg(errp, "IRQ %d is invalid", irq);
+return -1;
+}
+return 0;
+}
+
+static void spapr_irq_free_xive(sPAPRMachineState *spapr, int irq, int num)
+{
+int i;
+
+for (i = irq; i < irq + num; ++i) {
+spapr_xive_irq_free(spapr->xive, i);
+}
+}
+
+static qemu_irq spapr_qirq_xive(sPAPRMachineState *spapr, int irq)
+{
+return spapr_xive_qirq(spapr->xive, irq);
+}
+
+static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
+  Monitor *mon)
+{
+CPUState *cs;
+
+CPU_FOREACH(cs) {
+PowerPCCPU *cpu =

[Qemu-devel] [PATCH v6 20/37] spapr: extend the sPAPR IRQ backend for XICS migration

2018-12-05 Thread Cédric Le Goater

Introduce a new sPAPR IRQ handler to handle resend after migration
when the machine is using a KVM XICS interrupt controller model.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_irq.h |  2 ++
 hw/ppc/spapr.c | 13 +
 hw/ppc/spapr_irq.c | 27 +++
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 689176455e51..91ac5784919c 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -43,6 +43,7 @@ typedef struct sPAPRIrq {
 void *fdt, uint32_t phandle);
 Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
Error **errp);
+int (*post_load)(sPAPRMachineState *spapr, int version_id);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
@@ -53,6 +54,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
+int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
 
 /*
  * XICS legacy routines
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4dae32049d0a..8911465e32cf 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1732,14 +1732,6 @@ static int spapr_post_load(void *opaque, int version_id)
 return err;
 }
 
-if (!object_dynamic_cast(OBJECT(spapr->ics), TYPE_ICS_KVM)) {
-CPUState *cs;
-CPU_FOREACH(cs) {
-PowerPCCPU *cpu = POWERPC_CPU(cs);
-icp_resend(ICP(cpu->intc));
-}
-}
-
 /* In earlier versions, there was no separate qdev for the PAPR
  * RTC, so the RTC offset was stored directly in sPAPREnvironment.
  * So when migrating from those versions, poke the incoming offset
@@ -1760,6 +1752,11 @@ static int spapr_post_load(void *opaque, int version_id)
 }
 }
 
+err = spapr_irq_post_load(spapr, version_id);
+if (err) {
+return err;
+}
+
 return err;
 }
 
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index e16265f29d74..8943e28fc11b 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -196,6 +196,18 @@ static Object 
*spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
 return icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), errp);
 }
 
+static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
+{
+if (!object_dynamic_cast(OBJECT(spapr->ics), TYPE_ICS_KVM)) {
+CPUState *cs;
+CPU_FOREACH(cs) {
+PowerPCCPU *cpu = POWERPC_CPU(cs);
+icp_resend(ICP(cpu->intc));
+}
+}
+return 0;
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS 0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS \
 (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -211,6 +223,7 @@ sPAPRIrq spapr_irq_xics = {
 .print_info  = spapr_irq_print_info_xics,
 .dt_populate = spapr_dt_xics,
 .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
+.post_load   = spapr_irq_post_load_xics,
 };
 
 /*
@@ -313,6 +326,11 @@ static Object 
*spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
 return xive_tctx_create(cpu, XIVE_ROUTER(spapr->xive), errp);
 }
 
+static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
+{
+return 0;
+}
+
 /*
  * XIVE uses the full IRQ number space. Set it to 8K to be compatible
  * with XICS.
@@ -332,6 +350,7 @@ sPAPRIrq spapr_irq_xive = {
 .print_info  = spapr_irq_print_info_xive,
 .dt_populate = spapr_dt_xive,
 .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
+.post_load   = spapr_irq_post_load_xive,
 };
 
 /*
@@ -370,6 +389,13 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
 return smc->irq->qirq(spapr, irq);
 }
 
+int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id)
+{
+sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+return smc->irq->post_load(spapr, version_id);
+}
+
 /*
  * XICS legacy routines - to deprecate one day
  */
@@ -438,4 +464,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
 .print_info  = spapr_irq_print_info_xics,
 .dt_populate = spapr_dt_xics,
 .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
+.post_load   = spapr_irq_post_load_xics,
 };
-- 
2.17.2

[Qemu-devel] [PATCH v6 23/37] linux-headers: update to 4.20-rc5

2018-12-05 Thread Cédric Le Goater

These changes provide the initial interface with the KVM device
implementing the XIVE native exploitation interrupt mode. Also used to
retrieve the state of the KVM device for the monitor usage and for
migration.

Available from :

  https://github.com/legoater/linux/commits/xive-4.20

Signed-off-by: Cédric Le Goater 
---
 linux-headers/asm-powerpc/kvm.h | 46 +
 linux-headers/linux/kvm.h   |  6 +
 2 files changed, 52 insertions(+)

diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
index 8c876c166ef2..10fe86c21e8f 100644
--- a/linux-headers/asm-powerpc/kvm.h
+++ b/linux-headers/asm-powerpc/kvm.h
@@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT16  /* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK 0xff
 
+#define KVM_REG_PPC_NVT_STATE  (KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC  1
 #define   KVM_DEV_MPIC_BASE_ADDR   0   /* 64-bit */
@@ -675,4 +677,48 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED(1ULL << 43)
 #define  KVM_XICS_QUEUED   (1ULL << 44)
 
+/* POWER9 XIVE Native Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_CTRL  1
+#define   KVM_DEV_XIVE_GET_ESB_FD  1
+#define   KVM_DEV_XIVE_GET_TIMA_FD 2
+#define   KVM_DEV_XIVE_VC_BASE 3
+#define   KVM_DEV_XIVE_SAVE_EQ_PAGES   4
+#define KVM_DEV_XIVE_GRP_SOURCES   2   /* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_SYNC  3   /* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_EAS   4   /* 64-bit eas attributes */
+#define KVM_DEV_XIVE_GRP_EQ5   /* 64-bit eq attributes */
+
+/* Layout of 64-bit XIVE source attribute values */
+#define KVM_XIVE_LEVEL_SENSITIVE   (1ULL << 0)
+#define KVM_XIVE_LEVEL_ASSERTED(1ULL << 1)
+
+/* Layout of 64-bit eas attribute values */
+#define KVM_XIVE_EAS_PRIORITY_SHIFT0
+#define KVM_XIVE_EAS_PRIORITY_MASK 0x7
+#define KVM_XIVE_EAS_SERVER_SHIFT  3
+#define KVM_XIVE_EAS_SERVER_MASK   0xfff8ULL
+#define KVM_XIVE_EAS_MASK_SHIFT32
+#define KVM_XIVE_EAS_MASK_MASK 0x1ULL
+#define KVM_XIVE_EAS_EISN_SHIFT33
+#define KVM_XIVE_EAS_EISN_MASK 0xfffeULL
+
+/* Layout of 64-bit eq attribute */
+#define KVM_XIVE_EQ_PRIORITY_SHIFT 0
+#define KVM_XIVE_EQ_PRIORITY_MASK  0x7
+#define KVM_XIVE_EQ_SERVER_SHIFT   3
+#define KVM_XIVE_EQ_SERVER_MASK0xfff8ULL
+
+/* Layout of 64-bit eq attribute values */
+struct kvm_ppc_xive_eq {
+   __u32 flags;
+   __u32 qsize;
+   __u64 qpage;
+   __u32 qtoggle;
+   __u32 qindex;
+};
+
+#define KVM_XIVE_EQ_FLAG_ENABLED   0x0001
+#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY 0x0002
+#define KVM_XIVE_EQ_FLAG_ESCALATE  0x0004
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index f11a7eb49cfa..b7a74c58d0db 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -965,6 +965,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COALESCED_PIO 162
 #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
 #define KVM_CAP_EXCEPTION_PAYLOAD 164
+#define KVM_CAP_ARM_VM_IPA_SIZE 165
+#define KVM_CAP_PPC_IRQ_XIVE 166
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1188,6 +1190,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3   KVM_DEV_TYPE_ARM_VGIC_V3
KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS  KVM_DEV_TYPE_ARM_VGIC_ITS
+   KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE  KVM_DEV_TYPE_XIVE
KVM_DEV_TYPE_MAX,
 };
 
@@ -1305,6 +1309,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DESTROY_DEVICE   _IOWR(KVMIO,  0xf0, struct kvm_create_device)
+
 /*
  * ioctls for vcpu fds
  */
-- 
2.17.2

[Qemu-devel] [PATCH v6 11/37] spapr/xive: use the VCPU id as a NVT identifier

2018-12-05 Thread Cédric Le Goater

The IVPE scans the O/S CAM line of the XIVE thread interrupt contexts
to find a matching Notification Virtual Target (NVT) among the NVTs
dispatched on the HW processor threads.

On a real system, the thread interrupt contexts are updated by the
hypervisor when a Virtual Processor is scheduled to run on a HW
thread. Under QEMU, the model will emulate the same behavior by
hardwiring the NVT identifier in the thread context registers at
reset.

The NVT identifier used by the sPAPRXive model is the VCPU id. The END
identifier is also derived from the VCPU id. A set of helpers doing
the conversion between identifiers are provided for the hcalls
configuring the sources and the ENDs.

The model does not need a NVT table but The XiveRouter NVT operations
are provided to perform some extra checks in the routing algorithm.

Signed-off-by: Cédric Le Goater 
---
 hw/intc/spapr_xive.c | 53 +++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index eef5830d45c6..8da7a8bee949 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -26,6 +26,27 @@
 #define SPAPR_XIVE_VC_BASE   0x00060100ull
 #define SPAPR_XIVE_TM_BASE   0x000603020318ull
 
+/*
+ * The allocation of VP blocks is a complex operation in OPAL and the
+ * VP identifiers have a relation with the number of HW chips, the
+ * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
+ * controller model does not have the same constraints and can use a
+ * simple mapping scheme of the CPU vcpu_id
+ *
+ * These identifiers are never returned to the OS.
+ */
+
+#define SPAPR_XIVE_NVT_BASE 0x400
+
+/*
+ * sPAPR NVT and END indexing helpers
+ */
+static uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
+  uint32_t nvt_idx)
+{
+return nvt_idx - SPAPR_XIVE_NVT_BASE;
+}
+
 /*
  * On sPAPR machines, use a simplified output for the XIVE END
  * structure dumping only the information related to the OS EQ.
@@ -40,7 +61,8 @@ static void spapr_xive_end_pic_print_info(sPAPRXive *xive, 
XiveEND *end,
 uint32_t nvt = GETFIELD_BE32(END_W6_NVT_INDEX, end->w6);
 uint8_t priority = GETFIELD_BE32(END_W7_F0_PRIORITY, end->w7);
 
-monitor_printf(mon, "%3d/%d % 6d/%5d ^%d", nvt,
+monitor_printf(mon, "%3d/%d % 6d/%5d ^%d",
+   spapr_xive_nvt_to_target(xive, 0, nvt),
priority, qindex, qentries, qgen);
 
 xive_end_queue_pic_print_info(end, 6, mon);
@@ -246,6 +268,33 @@ static int spapr_xive_write_end(XiveRouter *xrtr, uint8_t 
end_blk,
 return 0;
 }
 
+static int spapr_xive_get_nvt(XiveRouter *xrtr,
+  uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
+{
+sPAPRXive *xive = SPAPR_XIVE(xrtr);
+uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
+PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
+
+if (!cpu) {
+return -1;
+}
+
+/*
+ * sPAPR does not maintain a NVT table. Return that the NVT is
+ * valid if we have found a matching CPU
+ */
+nvt->w0 = cpu_to_be32(NVT_W0_VALID);
+return 0;
+}
+
+static int spapr_xive_write_nvt(XiveRouter *xrtr, uint8_t nvt_blk,
+uint32_t nvt_idx, XiveNVT *nvt,
+uint8_t word_number)
+{
+/* no NVT table */
+return 0;
+}
+
 static const VMStateDescription vmstate_spapr_xive_end = {
 .name = TYPE_SPAPR_XIVE "/end",
 .version_id = 1,
@@ -308,6 +357,8 @@ static void spapr_xive_class_init(ObjectClass *klass, void 
*data)
 xrc->get_eas = spapr_xive_get_eas;
 xrc->get_end = spapr_xive_get_end;
 xrc->write_end = spapr_xive_write_end;
+xrc->get_nvt = spapr_xive_get_nvt;
+xrc->write_nvt = spapr_xive_write_nvt;
 }
 
 static const TypeInfo spapr_xive_info = {
-- 
2.17.2

[Qemu-devel] [PATCH v6 10/37] spapr/xive: introduce a XIVE interrupt controller

2018-12-05 Thread Cédric Le Goater

sPAPRXive models the XIVE interrupt controller of the sPAPR machine.
It inherits from the XiveRouter and provisions storage for the routing
tables :

  - Event Assignment Structure (EAS)
  - Event Notification Descriptor (END)

The sPAPRXive model incorporates an internal XiveSource for the IPIs
and for the interrupts of the virtual devices of the guest. This model
is consistent with XIVE architecture which also incorporates an
internal IVSE for IPIs and accelerator interrupts in the IVRE
sub-engine.

The sPAPRXive model exports two memory regions, one for the ESB
trigger and management pages used to control the sources and one for
the TIMA pages. They are mapped by default at the addresses found on
chip 0 of a baremetal system. This is also consistent with the XIVE
architecture which defines a Virtualization Controller BAR for the
internal IVSE ESB pages and a Thread Managment BAR for the TIMA.

Signed-off-by: Cédric Le Goater 
---
 default-configs/ppc64-softmmu.mak |   1 +
 include/hw/ppc/spapr_xive.h   |  45 
 hw/intc/spapr_xive.c  | 366 ++
 hw/intc/Makefile.objs |   1 +
 4 files changed, 413 insertions(+)
 create mode 100644 include/hw/ppc/spapr_xive.h
 create mode 100644 hw/intc/spapr_xive.c

diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index 2d1e7c5c4668..7f34ad0528ed 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_XIVE=$(CONFIG_PSERIES)
+CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
 CONFIG_MEM_DEVICE=y
 CONFIG_DIMM=y
 CONFIG_SPAPR_RNG=y
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
new file mode 100644
index ..f087959b9924
--- /dev/null
+++ b/include/hw/ppc/spapr_xive.h
@@ -0,0 +1,45 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_SPAPR_XIVE_H
+#define PPC_SPAPR_XIVE_H
+
+#include "hw/ppc/xive.h"
+
+#define TYPE_SPAPR_XIVE "spapr-xive"
+#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
+
+typedef struct sPAPRXive {
+XiveRouterparent;
+
+/* Internal interrupt source for IPIs and virtual devices */
+XiveSourcesource;
+hwaddrvc_base;
+
+/* END ESB MMIOs */
+XiveENDSource end_source;
+hwaddrend_base;
+
+/* Routing table */
+XiveEAS   *eat;
+uint32_t  nr_irqs;
+XiveEND   *endt;
+uint32_t  nr_ends;
+
+/* TIMA mapping address */
+hwaddrtm_base;
+MemoryRegion  tm_mmio;
+} sPAPRXive;
+
+bool spapr_xive_irq_claim(sPAPRXive *xive, uint32_t lisn, bool lsi);
+bool spapr_xive_irq_free(sPAPRXive *xive, uint32_t lisn);
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
+
+#endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
new file mode 100644
index ..eef5830d45c6
--- /dev/null
+++ b/hw/intc/spapr_xive.c
@@ -0,0 +1,366 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "qemu/error-report.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive.h"
+#include "hw/ppc/xive_regs.h"
+
+/*
+ * XIVE Virtualization Controller BAR and Thread Managment BAR that we
+ * use for the ESB pages and the TIMA pages
+ */
+#define SPAPR_XIVE_VC_BASE   0x00060100ull
+#define SPAPR_XIVE_TM_BASE   0x000603020318ull
+
+/*
+ * On sPAPR machines, use a simplified output for the XIVE END
+ * structure dumping only the information related to the OS EQ.
+ */
+static void spapr_xive_end_pic_print_info(sPAPRXive *xive, XiveEND *end,
+  Monitor *mon)
+{
+uint32_t qindex = GETFIELD_BE32(END_W1_PAGE_OFF, end->w1);
+uint32_t qgen = GETFIELD_BE32(END_W1_GENERATION, end->w1);
+uint32_t qsize = GETFIELD_BE32(END_W0_QSIZE, end->w0);
+uint32_t qentries = 1 << (qsize + 10);
+uint32_t nvt = GETFIELD_BE32(END_W6_NVT_INDEX, end->w6);
+uint8_t priority = GETFIELD_BE32(END_W7_F0_PRIORITY, end->w7);
+
+monitor_printf(mon, "%3d/%d % 6d/%5d ^%d", nvt,
+   priority, qindex, qentries, qgen);
+
+xive_end_queue_pic_print_info(end, 6, mon);
+monitor_printf(mon, "]");
+}
+
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)

[Qemu-devel] [PATCH v6 19/37] spapr: allocate the interrupt thread context under the CPU core

2018-12-05 Thread Cédric Le Goater

Each interrupt mode has its own specific interrupt presenter object,
that we store under the CPU object, one for XICS and one for XIVE.
The XIVE model hardwires the NVT identifier in the thread context
model to emulate the push/pull of hypervisor when a vCPU is dispatched
on a HW thread.

The sPAPR IRQ backend is extended with a new handler to support them
both.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---

 Changes since v5:

 - hardwires the NVT identifier in the thread context
 
 include/hw/ppc/spapr_irq.h |  2 ++
 include/hw/ppc/xive.h  |  1 +
 hw/intc/xive.c | 31 +++
 hw/ppc/spapr_cpu_core.c|  5 ++---
 hw/ppc/spapr_irq.c | 15 +++
 5 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 457239826b8f..689176455e51 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -41,6 +41,8 @@ typedef struct sPAPRIrq {
 void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
 void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
 void *fdt, uint32_t phandle);
+Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
+   Error **errp);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index e9b06e75fc1c..60c335ce0e1e 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -421,6 +421,7 @@ typedef struct XiveTCTX {
 extern const MemoryRegionOps xive_tm_ops;
 
 void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
+Object *xive_tctx_create(Object *cpu, XiveRouter *xrtr, Error **errp);
 
 static inline uint32_t xive_nvt_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
 {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 0db77107ab15..7638592da20f 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -552,6 +552,37 @@ static const TypeInfo xive_tctx_info = {
 .class_init= xive_tctx_class_init,
 };
 
+Object *xive_tctx_create(Object *cpu, XiveRouter *xrtr, Error **errp)
+{
+
+CPUPPCState *env = _CPU(cpu)->env;
+uint32_t pir = env->spr_cb[SPR_PIR].default_value;
+uint32_t hw_cam = hw_cam_line((pir >> 8) & 0xf, pir & 0x7f);
+Error *local_err = NULL;
+Object *obj;
+
+obj = object_new(TYPE_XIVE_TCTX);
+object_property_add_child(cpu, TYPE_XIVE_TCTX, obj, _abort);
+object_unref(obj);
+object_property_add_const_link(obj, "cpu", cpu, _abort);
+object_property_add_const_link(obj, "xive", OBJECT(xrtr), _abort);
+object_property_set_int(obj, hw_cam, "hw-cam", _err);
+if (local_err) {
+goto error;
+}
+object_property_set_bool(obj, true, "realized", _err);
+if (local_err) {
+goto error;
+}
+
+return obj;
+
+error:
+object_unparent(obj);
+error_propagate(errp, local_err);
+return NULL;
+}
+
 /*
  * XIVE ESB helpers
  */
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 2398ce62c0e7..1811cd48db90 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -11,7 +11,6 @@
 #include "hw/ppc/spapr_cpu_core.h"
 #include "target/ppc/cpu.h"
 #include "hw/ppc/spapr.h"
-#include "hw/ppc/xics.h" /* for icp_create() - to be removed */
 #include "hw/boards.h"
 #include "qapi/error.h"
 #include "sysemu/cpus.h"
@@ -215,6 +214,7 @@ static void spapr_cpu_core_unrealize(DeviceState *dev, 
Error **errp)
 static void spapr_realize_vcpu(PowerPCCPU *cpu, sPAPRMachineState *spapr,
sPAPRCPUCore *sc, Error **errp)
 {
+sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
 CPUPPCState *env = >env;
 CPUState *cs = CPU(cpu);
 Error *local_err = NULL;
@@ -233,8 +233,7 @@ static void spapr_realize_vcpu(PowerPCCPU *cpu, 
sPAPRMachineState *spapr,
 qemu_register_reset(spapr_cpu_reset, cpu);
 spapr_cpu_reset(cpu);
 
-cpu->intc = icp_create(OBJECT(cpu), spapr->icp_type, XICS_FABRIC(spapr),
-   _err);
+cpu->intc = smc->irq->cpu_intc_create(spapr, OBJECT(cpu), _err);
 if (local_err) {
 goto error_unregister;
 }
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 8401c75fdbe4..e16265f29d74 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -190,6 +190,12 @@ static void spapr_irq_print_info_xics(sPAPRMachineState 
*spapr, Monitor *mon)
 ics_pic_print_info(spapr->ics, mon);
 }
 
+static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
+  Object *cpu, Error **errp)
+{
+return icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), errp);
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS 0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS \
 (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -204,6 +210,7 @@ sPAPRIrq spapr_irq_xics = {
 .qirq= spapr_qirq_xics,
 .print_info  = spapr_irq_print_info_xics,

[Qemu-devel] [PATCH v6 22/37] spapr: add a 'pseries-3.1-xive' machine type

2018-12-05 Thread Cédric Le Goater

The interrupt mode is statically defined to XIVE only for this machine.
The guest OS is required to have support for the XIVE exploitation
mode of the POWER9 interrupt controller.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr.h |  6 ++
 include/hw/ppc/spapr_irq.h |  1 +
 hw/ppc/spapr.c | 36 +++-
 hw/ppc/spapr_irq.c |  3 +++
 4 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 6bf028a02fe2..daced428a42c 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -824,5 +824,11 @@ int spapr_caps_post_migration(sPAPRMachineState *spapr);
 
 void spapr_check_pagesize(sPAPRMachineState *spapr, hwaddr pagesize,
   Error **errp);
+/*
+ * XIVE definitions
+ */
+#define SPAPR_OV5_XIVE_LEGACY   0x0
+#define SPAPR_OV5_XIVE_EXPLOIT  0x40
+#define SPAPR_OV5_XIVE_BOTH 0x80
 
 #endif /* HW_SPAPR_H */
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index bdb1c66125c9..26727a7263a5 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -33,6 +33,7 @@ void spapr_irq_msi_reset(sPAPRMachineState *spapr);
 typedef struct sPAPRIrq {
 uint32_tnr_irqs;
 uint32_tnr_msis;
+uint8_t ov5;
 
 void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
 int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 530aee8d143d..817dd1b2c442 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1097,12 +1097,14 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, 
void *fdt)
 spapr_dt_rtas_tokens(fdt, rtas);
 }
 
-/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
- * that the guest may request and thus the valid values for bytes 24..26 of
- * option vector 5: */
-static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
+/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
+ * and the XIVE features that the guest may request and thus the valid
+ * values for bytes 23..26 of option vector 5: */
+static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
+  int chosen)
 {
 PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
+sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
 
 char val[2 * 4] = {
 23, 0x00, /* Xive mode, filled in below. */
@@ -1123,7 +1125,11 @@ static void spapr_dt_ov5_platform_support(void *fdt, int 
chosen)
 } else {
 val[3] = 0x00; /* Hash */
 }
+/* No KVM support */
+val[1] = SPAPR_OV5_XIVE_LEGACY;
 } else {
+val[1] = smc->irq->ov5;
+
 /* V3 MMU supports both hash and radix in tcg (with dynamic switching) 
*/
 val[3] = 0xC0;
 }
@@ -1191,7 +1197,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, 
void *fdt)
 _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
 }
 
-spapr_dt_ov5_platform_support(fdt, chosen);
+spapr_dt_ov5_platform_support(spapr, fdt, chosen);
 
 g_free(stdout_path);
 g_free(bootlist);
@@ -2624,6 +2630,11 @@ static void spapr_machine_init(MachineState *machine)
 /* advertise support for ibm,dyamic-memory-v2 */
 spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
 
+/* advertise XIVE */
+if (smc->irq->ov5 == SPAPR_OV5_XIVE_EXPLOIT) {
+spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
+}
+
 /* init CPUs */
 spapr_init_cpus(spapr);
 
@@ -3973,6 +3984,21 @@ static void spapr_machine_3_1_class_options(MachineClass 
*mc)
 
 DEFINE_SPAPR_MACHINE(3_1, "3.1", true);
 
+static void spapr_machine_3_1_xive_instance_options(MachineState *machine)
+{
+spapr_machine_3_1_instance_options(machine);
+}
+
+static void spapr_machine_3_1_xive_class_options(MachineClass *mc)
+{
+sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
+spapr_machine_3_1_class_options(mc);
+smc->irq = _irq_xive;
+}
+
+DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
+
 /*
  * pseries-3.0
  */
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 58ce124c1501..8eead17c8f36 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -220,6 +220,7 @@ static void spapr_irq_reset_xics(sPAPRMachineState *spapr, 
Error **errp)
 sPAPRIrq spapr_irq_xics = {
 .nr_irqs = SPAPR_IRQ_XICS_NR_IRQS,
 .nr_msis = SPAPR_IRQ_XICS_NR_MSIS,
+.ov5 = SPAPR_OV5_XIVE_LEGACY,
 
 .init= spapr_irq_init_xics,
 .claim   = spapr_irq_claim_xics,
@@ -357,6 +358,7 @@ static void spapr_irq_reset_xive(sPAPRMachineState *spapr, 
Error **errp)
 sPAPRIrq spapr_irq_xive = {
 .nr_irqs = SPAPR_IRQ_XIVE_NR_IRQS,
 .nr_msis = SPAPR_IRQ_XIVE_NR_MSIS,
+.ov5 = SPAPR_OV5_XIVE_EXPLOIT,
 
 .init= spapr_irq_init_xive,
 .claim   = spapr_irq_claim_xive,
@@ -481,6 +483,7 @@ int

[Qemu-devel] [PATCH v6 05/37] ppc/xive: introduce the XIVE Event Notification Descriptors

2018-12-05 Thread Cédric Le Goater

To complete the event routing, the IVRE sub-engine uses a second table
containing Event Notification Descriptor (END) structures.

An END specifies on which Event Queue (EQ) the event notification
data, defined in the associated EAS, should be posted when an
exception occurs. It also defines which Notification Virtual Target
(NVT) should be notified.

The Event Queue is a memory page provided by the O/S defining a
circular buffer, one per server and priority couple, containing Event
Queue entries. These are 4 bytes long, the first bit being a
'generation' bit and the 31 following bits the END Data field. They
are pulled by the O/S when the exception occurs.

The END Data field is a way to set an invariant logical event source
number for an IRQ. On sPAPR machines, it is set with the
H_INT_SET_SOURCE_CONFIG hcall when the EISN flag is used.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h  |  18 
 include/hw/ppc/xive_regs.h |  57 
 hw/intc/xive.c | 174 +
 3 files changed, 249 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 57ec9f84f527..d1b4c6c78ec5 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -321,11 +321,29 @@ typedef struct XiveRouterClass {
 /* XIVE table accessors */
 int (*get_eas)(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
XiveEAS *eas);
+int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+   XiveEND *end);
+int (*write_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+ XiveEND *end, uint8_t word_number);
 } XiveRouterClass;
 
 void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
 
 int xive_router_get_eas(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
 XiveEAS *eas);
+int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+XiveEND *end);
+int xive_router_write_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+  XiveEND *end, uint8_t word_number);
+
+/*
+ * For legacy compatibility, the exceptions define up to 256 different
+ * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
+ * and the least favored level 0xFF.
+ */
+#define XIVE_PRIORITY_MAX  7
+
+void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
+void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width, Monitor *mon);
 
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index 15f2470ed9cc..3c0ebad18b69 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -47,4 +47,61 @@ typedef struct XiveEAS {
 #define GETFIELD_BE64(m, v)  GETFIELD(m, be64_to_cpu(v))
 #define SETFIELD_BE64(m, v, val) cpu_to_be64(SETFIELD(m, be64_to_cpu(v), val))
 
+/* Event Notification Descriptor (END) */
+typedef struct XiveEND {
+uint32_tw0;
+#define END_W0_VALID PPC_BIT32(0) /* "v" bit */
+#define END_W0_ENQUEUE   PPC_BIT32(1) /* "q" bit */
+#define END_W0_UCOND_NOTIFY  PPC_BIT32(2) /* "n" bit */
+#define END_W0_BACKLOG   PPC_BIT32(3) /* "b" bit */
+#define END_W0_PRECL_ESC_CTL PPC_BIT32(4) /* "p" bit */
+#define END_W0_ESCALATE_CTL  PPC_BIT32(5) /* "e" bit */
+#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
+#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
+#define END_W0_QSIZE PPC_BITMASK32(12, 15)
+#define END_W0_SW0   PPC_BIT32(16)
+#define END_W0_FIRMWARE  END_W0_SW0 /* Owned by FW */
+#define END_QSIZE_4K 0
+#define END_QSIZE_64K4
+#define END_W0_HWDEP PPC_BITMASK32(24, 31)
+uint32_tw1;
+#define END_W1_ESn   PPC_BITMASK32(0, 1)
+#define END_W1_ESn_P PPC_BIT32(0)
+#define END_W1_ESn_Q PPC_BIT32(1)
+#define END_W1_ESe   PPC_BITMASK32(2, 3)
+#define END_W1_ESe_P PPC_BIT32(2)
+#define END_W1_ESe_Q PPC_BIT32(3)
+#define END_W1_GENERATIONPPC_BIT32(9)
+#define END_W1_PAGE_OFF  PPC_BITMASK32(10, 31)
+uint32_tw2;
+#define END_W2_MIGRATION_REG PPC_BITMASK32(0, 3)
+#define END_W2_OP_DESC_HIPPC_BITMASK32(4, 31)
+uint32_tw3;
+#define END_W3_OP_DESC_LOPPC_BITMASK32(0, 31)
+uint32_tw4;
+#define END_W4_ESC_END_BLOCK PPC_BITMASK32(4, 7)
+#define END_W4_ESC_END_INDEX PPC_BITMASK32(8, 31)
+uint32_tw5;
+#define END_W5_ESC_END_DATA  PPC_BITMASK32(1, 31)
+uint32_tw6;
+#define END_W6_FORMAT_BITPPC_BIT32(8)
+#define END_W6_NVT_BLOCK PPC_BITMASK32(9, 12)
+#define END_W6_NVT_INDEX PPC_BITMASK32(13, 31)
+uint32_tw7;
+#define END_W7_F0_IGNORE PPC_BIT32(0)
+#define END_W7_F0_BLK_GROUPING

[Qemu-devel] [PATCH v6 08/37] ppc/xive: introduce a simplified XIVE presenter

2018-12-05 Thread Cédric Le Goater

The last sub-engine of the XIVE architecture is the Interrupt
Virtualization Presentation Engine (IVPE). On HW, the IVRE and the
IVPE share elements, the Power Bus interface (CQ), the routing table
descriptors, and they can be combined in the same HW logic. We do the
same in QEMU and combine both engines in the XiveRouter for
simplicity.

When the IVRE has completed its job of matching an event source with a
Notification Virtual Target (NVT) to notify, it forwards the event
notification to the IVPE sub-engine. The IVPE scans the thread
interrupt contexts of the Notification Virtual Targets (NVT)
dispatched on the HW processor threads and if a match is found, it
signals the thread. If not, the IVPE escalates the notification to
some other targets and records the notification in a backlog queue.

The IVPE maintains the thread interrupt context state for each of its
NVTs not dispatched on HW processor threads in the Notification
Virtual Target table (NVTT).

The model currently only supports single NVT notifications.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h  |  15 +++
 include/hw/ppc/xive_regs.h |  24 
 hw/intc/xive.c | 227 +
 3 files changed, 266 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 74b547707b17..e9b06e75fc1c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -327,6 +327,10 @@ typedef struct XiveRouterClass {
XiveEND *end);
 int (*write_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
  XiveEND *end, uint8_t word_number);
+int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+   XiveNVT *nvt);
+int (*write_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+ XiveNVT *nvt, uint8_t word_number);
 } XiveRouterClass;
 
 void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
@@ -337,6 +341,11 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, 
uint32_t end_idx,
 XiveEND *end);
 int xive_router_write_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
   XiveEND *end, uint8_t word_number);
+int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+XiveNVT *nvt);
+int xive_router_write_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+  XiveNVT *nvt, uint8_t word_number);
+
 
 /*
  * XIVE END ESBs
@@ -393,6 +402,7 @@ typedef struct XiveTCTX {
 qemu_irqoutput;
 
 uint8_t regs[XIVE_TM_RING_COUNT * XIVE_TM_RING_SIZE];
+uint32_thw_cam;
 } XiveTCTX;
 
 /*
@@ -412,4 +422,9 @@ extern const MemoryRegionOps xive_tm_ops;
 
 void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
 
+static inline uint32_t xive_nvt_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
+{
+return (nvt_blk << 19) | nvt_idx;
+}
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index ede3d04c5eda..85557e730cd8 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -186,4 +186,28 @@ typedef struct XiveEND {
 #define GETFIELD_BE32(m, v)   GETFIELD(m, be32_to_cpu(v))
 #define SETFIELD_BE32(m, v, val)  cpu_to_be32(SETFIELD(m, be32_to_cpu(v), val))
 
+/* Notification Virtual Target (NVT) */
+typedef struct XiveNVT {
+uint32_tw0;
+#define NVT_W0_VALID PPC_BIT32(0)
+uint32_tw1;
+uint32_tw2;
+uint32_tw3;
+uint32_tw4;
+uint32_tw5;
+uint32_tw6;
+uint32_tw7;
+uint32_tw8;
+#define NVT_W8_GRP_VALID PPC_BIT32(0)
+uint32_tw9;
+uint32_twa;
+uint32_twb;
+uint32_twc;
+uint32_twd;
+uint32_twe;
+uint32_twf;
+} XiveNVT;
+
+#define xive_nvt_is_valid(nvt)(be32_to_cpu((nvt)->w0) & NVT_W0_VALID)
+
 #endif /* PPC_XIVE_REGS_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 80a965c14200..891542920683 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -358,6 +358,25 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
 }
 }
 
+/* The HW CAM (23bits) is hardwired to :
+ *
+ *   0x000||0b1||4Bit chip number||7Bit Thread number.
+ *
+ * and when the block grouping extension is enabled :
+ *
+ *   4Bit chip number||0x001||7Bit Thread number.
+ */
+static uint32_t hw_cam_line(uint8_t chip_id, uint8_t tid)
+{
+bool block_group = false; /* TODO (PowerNV) */
+
+if (block_group) {
+return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
+} else {
+return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
+}
+}
+
 static void xive_tctx_reset(void *dev)
 {
 XiveTCTX *tctx = XIVE_TCTX(dev);
@@ -388,6 +407,12 @@ static void xive_tctx_realize(DeviceState *dev, Error

[Qemu-devel] [PATCH v6 07/37] ppc/xive: introduce the XIVE interrupt thread context

2018-12-05 Thread Cédric Le Goater

Each POWER9 processor chip has a XIVE presenter that can generate four
different exceptions to its threads:

  - hypervisor exception,
  - O/S exception
  - Event-Based Branch (EBB)
  - msgsnd (doorbell).

Each exception has a state independent from the others called a Thread
Interrupt Management context. This context is a set of registers which
lets the thread handle priority management and interrupt acknowledgment
among other things. The most important ones being :

  - Interrupt Priority Register  (PIPR)
  - Interrupt Pending Buffer (IPB)
  - Current Processor Priority   (CPPR)
  - Notification Source Register (NSR)

These registers are accessible through a specific MMIO region, called
the Thread Interrupt Management Area (TIMA), four aligned pages, each
exposing a different view of the registers. First page (page address
ending in 0b00) gives access to the entire context and is reserved for
the ring 0 view for the physical thread context. The second (page
address ending in 0b01) is for the hypervisor, ring 1 view. The third
(page address ending in 0b10) is for the operating system, ring 2
view. The fourth (page address ending in 0b11) is for user level, ring
3 view.

The thread interrupt context is modeled with a XiveTCTX object
containing the values of the different exception registers. The TIMA
region is mapped at the same address for each CPU.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h  |  44 
 include/hw/ppc/xive_regs.h |  82 
 hw/intc/xive.c | 419 +
 3 files changed, 545 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index d67b0785df7c..74b547707b17 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -368,4 +368,48 @@ typedef struct XiveENDSource {
 void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
 void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width, Monitor *mon);
 
+/*
+ * XIVE Thread interrupt Management (TM) context
+ */
+
+#define TYPE_XIVE_TCTX "xive-tctx"
+#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
+
+/*
+ * XIVE Thread interrupt Management register rings :
+ *
+ *   QW-0  User   event-based exception state
+ *   QW-1  O/SOS context for priority management, interrupt acks
+ *   QW-2  Pool   hypervisor pool context for virtual processors dispatched
+ *   QW-3  Physical   physical thread context and security context
+ */
+#define XIVE_TM_RING_COUNT  4
+#define XIVE_TM_RING_SIZE   0x10
+
+typedef struct XiveTCTX {
+DeviceState parent_obj;
+
+CPUState*cs;
+qemu_irqoutput;
+
+uint8_t regs[XIVE_TM_RING_COUNT * XIVE_TM_RING_SIZE];
+} XiveTCTX;
+
+/*
+ * XIVE Thread Interrupt Management Aera (TIMA)
+ *
+ * This region gives access to the registers of the thread interrupt
+ * management context. It is four page wide, each page providing a
+ * different view of the registers. The page with the lower offset is
+ * the most privileged and gives access to the entire context.
+ */
+#define XIVE_TM_HW_PAGE 0x0
+#define XIVE_TM_HV_PAGE 0x1
+#define XIVE_TM_OS_PAGE 0x2
+#define XIVE_TM_USER_PAGE   0x3
+
+extern const MemoryRegionOps xive_tm_ops;
+
+void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index 3c0ebad18b69..ede3d04c5eda 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -23,6 +23,88 @@
 #define XIVE_SRCNO_INDEX(srcno) ((srcno) & 0x0fff)
 #define XIVE_SRCNO(blk, idx)((uint32_t)(blk) << 28 | (idx))
 
+#define TM_SHIFT16
+
+/* TM register offsets */
+#define TM_QW0_USER 0x000 /* All rings */
+#define TM_QW1_OS   0x010 /* Ring 0..2 */
+#define TM_QW2_HV_POOL  0x020 /* Ring 0..1 */
+#define TM_QW3_HV_PHYS  0x030 /* Ring 0..1 */
+
+/* Byte offsets inside a QW QW0 QW1 QW2 QW3 */
+#define TM_NSR  0x0  /*  +   +   -   +  */
+#define TM_CPPR 0x1  /*  -   +   -   +  */
+#define TM_IPB  0x2  /*  -   +   +   +  */
+#define TM_LSMFB0x3  /*  -   +   +   +  */
+#define TM_ACK_CNT  0x4  /*  -   +   -   -  */
+#define TM_INC  0x5  /*  -   +   -   +  */
+#define TM_AGE  0x6  /*  -   +   -   +  */
+#define TM_PIPR 0x7  /*  -   +   -   +  */
+
+#define TM_WORD00x0
+#define TM_WORD10x4
+
+/*
+ * QW word 2 contains the valid bit at the top and other fields
+ * depending on the QW.
+ */
+#define TM_WORD20x8
+#define   TM_QW0W2_VU   PPC_BIT32(0)
+#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
+#define   TM_QW1W2_VO   PPC_BIT32(0)
+#define   TM_QW1W2_OS_CAM   PPC_BITMASK32(8, 31)
+#define   TM_QW2W2_VP   PPC_BIT32(0)
+#define

[Qemu-devel] [PATCH v6 18/37] spapr: add device tree support for the XIVE exploitation mode

2018-12-05 Thread Cédric Le Goater

The XIVE interface for the guest is described in the device tree under
the "interrupt-controller" node. A couple of new properties are
specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA), for the User level and for the Guest OS
   level. Only the Guest OS level is taken into account today.

 - "ibm,xive-eq-sizes"

   the size of the event queues. One cell per size supported, contains
   log2 of size, in ascending order.

 - "ibm,xive-lisn-ranges"

   the IRQ interrupt number ranges assigned to the guest for the IPIs.

and also under the root node :

 - "ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use. OPAL uses the priority 7 queue to automatically
   escalate interrupts for all other queues (DD2.X POWER9). So only
   priorities [0..6] are allowed for the guest.

Extend the sPAPR IRQ backend with a new handler to populate the DT
with the appropriate "interrupt-controller" node.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_irq.h  |  2 ++
 include/hw/ppc/spapr_xive.h |  2 ++
 include/hw/ppc/xics.h   |  4 +--
 hw/intc/spapr_xive.c| 64 +
 hw/intc/xics_spapr.c|  3 +-
 hw/ppc/spapr.c  |  3 +-
 hw/ppc/spapr_irq.c  |  3 ++
 7 files changed, 77 insertions(+), 4 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index eec3159cd8d8..457239826b8f 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -39,6 +39,8 @@ typedef struct sPAPRIrq {
 void (*free)(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
 void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
+void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
+void *fdt, uint32_t phandle);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 9506a8f4d10a..728a5e8dc163 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -45,5 +45,7 @@ qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
 typedef struct sPAPRMachineState sPAPRMachineState;
 
 void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t nr_servers, void *fdt,
+   uint32_t phandle);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
index 9958443d1984..14afda198cdb 100644
--- a/include/hw/ppc/xics.h
+++ b/include/hw/ppc/xics.h
@@ -181,8 +181,6 @@ typedef struct XICSFabricClass {
 ICPState *(*icp_get)(XICSFabric *xi, int server);
 } XICSFabricClass;
 
-void spapr_dt_xics(int nr_servers, void *fdt, uint32_t phandle);
-
 ICPState *xics_icp_get(XICSFabric *xi, int server);
 
 /* Internal XICS interfaces */
@@ -204,6 +202,8 @@ void icp_resend(ICPState *ss);
 
 typedef struct sPAPRMachineState sPAPRMachineState;
 
+void spapr_dt_xics(sPAPRMachineState *spapr, uint32_t nr_servers, void *fdt,
+   uint32_t phandle);
 int xics_kvm_init(sPAPRMachineState *spapr, Error **errp);
 void xics_spapr_init(sPAPRMachineState *spapr);
 
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index f54100b175a5..fd02dc6b91e4 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -14,6 +14,7 @@
 #include "target/ppc/cpu.h"
 #include "sysemu/cpus.h"
 #include "monitor/monitor.h"
+#include "hw/ppc/fdt.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_xive.h"
 #include "hw/ppc/xive.h"
@@ -1379,3 +1380,66 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr)
 spapr_register_hypercall(H_INT_SYNC, h_int_sync);
 spapr_register_hypercall(H_INT_RESET, h_int_reset);
 }
+
+void spapr_dt_xive(sPAPRMachineState *spapr, uint32_t nr_servers, void *fdt,
+   uint32_t phandle)
+{
+sPAPRXive *xive = spapr->xive;
+int node;
+uint64_t timas[2 * 2];
+/* Interrupt number ranges for the IPIs */
+uint32_t lisn_ranges[] = {
+cpu_to_be32(0),
+cpu_to_be32(nr_servers),
+};
+uint32_t eq_sizes[] = {
+cpu_to_be32(12), /* 4K */
+cpu_to_be32(16), /* 64K */
+cpu_to_be32(21), /* 2M */
+cpu_to_be32(24), /* 16M */
+};
+/* The following array is in sync with the reserved priorities
+ * defined by the 'spapr_xive_priority_is_reserved' routine.
+ */
+uint32_t plat_res_int_priorities[] = {
+cpu_to_be32(7),/* start */
+cpu_to_be32(0xf8), /* count */
+};
+gchar *nodename;
+
+/* Thread Interrupt Management Area : User (ring 3) and OS (ring 2) */
+timas[0] = cpu_to_be64(xive->tm_base +
+   XIVE_TM_USER_PAGE * (1ull << TM_SHIFT));
+timas[1] = cpu_to_be64(1ull << TM_SHIFT);
+timas[2] = cpu_to_be64(xive->tm_base +
+   XIVE_TM_OS_PAGE * (1ull <<

[Qemu-devel] [PATCH v6 14/37] spapr: modify the irq backend 'init' method

2018-12-05 Thread Cédric Le Goater

Add a 'nr_irqs' parameter to the 'init' method to remove the use of
the machine class. This will be useful when we introduce the machine
supporting the two sPAPR IRQ backends : XICS and XIVE.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/spapr_irq.h | 2 +-
 hw/ppc/spapr_irq.c | 7 +++
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index bd7301e6d9c6..0e9229bf219e 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -33,7 +33,7 @@ typedef struct sPAPRIrq {
 uint32_tnr_irqs;
 uint32_tnr_msis;
 
-void (*init)(sPAPRMachineState *spapr, Error **errp);
+void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
 int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void (*free)(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index f8b651de0ec9..bac45023 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -90,11 +90,10 @@ error:
 return NULL;
 }
 
-static void spapr_irq_init_xics(sPAPRMachineState *spapr, Error **errp)
+static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
+Error **errp)
 {
 MachineState *machine = MACHINE(spapr);
-sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
-int nr_irqs = smc->irq->nr_irqs;
 Error *local_err = NULL;
 
 if (kvm_enabled()) {
@@ -217,7 +216,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
 spapr_irq_msi_init(spapr, smc->irq->nr_msis);
 }
 
-smc->irq->init(spapr, errp);
+smc->irq->init(spapr, smc->irq->nr_irqs, errp);
 }
 
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
-- 
2.17.2

[Qemu-devel] [PATCH v6 03/37] ppc/xive: introduce the XiveNotifier interface

2018-12-05 Thread Cédric Le Goater

The XiveNotifier offers a simple interface, between the XiveSource
object and the main interrupt controller of the machine. It will
forward event notifications to the XIVE Interrupt Virtualization
Routing Engine (IVRE).

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h | 23 +++
 hw/intc/xive.c| 25 +
 2 files changed, 48 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 7cebc32eba4c..6770cffec67d 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -142,6 +142,27 @@
 
 #include "hw/qdev-core.h"
 
+/*
+ * XIVE Fabric (Interface between Source and Router)
+ */
+
+typedef struct XiveNotifier {
+Object parent;
+} XiveNotifier;
+
+#define TYPE_XIVE_NOTIFIER "xive-fabric"
+#define XIVE_NOTIFIER(obj) \
+OBJECT_CHECK(XiveNotifier, (obj), TYPE_XIVE_NOTIFIER)
+#define XIVE_NOTIFIER_CLASS(klass) \
+OBJECT_CLASS_CHECK(XiveNotifierClass, (klass), TYPE_XIVE_NOTIFIER)
+#define XIVE_NOTIFIER_GET_CLASS(obj)   \
+OBJECT_GET_CLASS(XiveNotifierClass, (obj), TYPE_XIVE_NOTIFIER)
+
+typedef struct XiveNotifierClass {
+InterfaceClass parent;
+void (*notify)(XiveNotifier *xn, uint32_t lisn);
+} XiveNotifierClass;
+
 /*
  * XIVE Interrupt Source
  */
@@ -171,6 +192,8 @@ typedef struct XiveSource {
 uint64_tesb_flags;
 uint32_tesb_shift;
 MemoryRegionesb_mmio;
+
+XiveNotifier*xive;
 } XiveSource;
 
 /*
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 11c7aac962de..79238eb57fae 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -155,7 +155,11 @@ static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t 
srcno)
  */
 static void xive_source_notify(XiveSource *xsrc, int srcno)
 {
+XiveNotifierClass *xnc = XIVE_NOTIFIER_GET_CLASS(xsrc->xive);
 
+if (xnc->notify) {
+xnc->notify(xsrc->xive, srcno);
+}
 }
 
 /*
@@ -362,6 +366,17 @@ static void xive_source_reset(void *dev)
 static void xive_source_realize(DeviceState *dev, Error **errp)
 {
 XiveSource *xsrc = XIVE_SOURCE(dev);
+Object *obj;
+Error *local_err = NULL;
+
+obj = object_property_get_link(OBJECT(dev), "xive", _err);
+if (!obj) {
+error_propagate(errp, local_err);
+error_prepend(errp, "required link 'xive' not found: ");
+return;
+}
+
+xsrc->xive = XIVE_NOTIFIER(obj);
 
 if (!xsrc->nr_irqs) {
 error_setg(errp, "Number of interrupt needs to be greater than 0");
@@ -428,9 +443,19 @@ static const TypeInfo xive_source_info = {
 .class_init= xive_source_class_init,
 };
 
+/*
+ * XIVE Fabric
+ */
+static const TypeInfo xive_fabric_info = {
+.name = TYPE_XIVE_NOTIFIER,
+.parent = TYPE_INTERFACE,
+.class_size = sizeof(XiveNotifierClass),
+};
+
 static void xive_register_types(void)
 {
 type_register_static(_source_info);
+type_register_static(_fabric_info);
 }
 
 type_init(xive_register_types)
-- 
2.17.2

[Qemu-devel] [PATCH v6 13/37] spapr: introduce a spapr_irq_init() routine

2018-12-05 Thread Cédric Le Goater

Initialize the MSI bitmap from it as this will be necessary for the
sPAPR IRQ backend for XIVE.

Signed-off-by: Cédric Le Goater 
Reviewed-by: David Gibson 
---
 include/hw/ppc/spapr_irq.h |  1 +
 hw/ppc/spapr.c |  2 +-
 hw/ppc/spapr_irq.c | 16 +++-
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index a467ce696ee4..bd7301e6d9c6 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -43,6 +43,7 @@ typedef struct sPAPRIrq {
 extern sPAPRIrq spapr_irq_xics;
 extern sPAPRIrq spapr_irq_xics_legacy;
 
+void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 50cb9f9f4a02..e470efe7993c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
 spapr_set_vsmt_mode(spapr, _fatal);
 
 /* Set up Interrupt Controller before we create the VCPUs */
-smc->irq->init(spapr, _fatal);
+spapr_irq_init(spapr, _fatal);
 
 /* Set up containers for ibm,client-architecture-support negotiated options
  */
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index e77b94cc685e..f8b651de0ec9 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -97,11 +97,6 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, 
Error **errp)
 int nr_irqs = smc->irq->nr_irqs;
 Error *local_err = NULL;
 
-/* Initialize the MSI IRQ allocator. */
-if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
-spapr_irq_msi_init(spapr, smc->irq->nr_msis);
-}
-
 if (kvm_enabled()) {
 if (machine_kernel_irqchip_allowed(machine) &&
 !xics_kvm_init(spapr, _err)) {
@@ -213,6 +208,17 @@ sPAPRIrq spapr_irq_xics = {
 /*
  * sPAPR IRQ frontend routines for devices
  */
+void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
+{
+sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+/* Initialize the MSI IRQ allocator. */
+if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
+spapr_irq_msi_init(spapr, smc->irq->nr_msis);
+}
+
+smc->irq->init(spapr, errp);
+}
 
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
 {
-- 
2.17.2

[Qemu-devel] [PATCH v6 01/37] ppc/xive: introduce a XIVE interrupt source model

2018-12-05 Thread Cédric Le Goater

The first sub-engine of the overall XIVE architecture is the Interrupt
Virtualization Source Engine (IVSE). An IVSE can be integrated into
another logic, like in a PCI PHB or in the main interrupt controller
to manage IPIs.

Each IVSE instance is associated with an Event State Buffer (ESB) that
contains a two bit state entry for each possible event source. When an
event is signaled to the IVSE, by MMIO or some other means, the
associated interrupt state bits are fetched from the ESB and
modified. Depending on the resulting ESB state, the event is forwarded
to the IVRE sub-engine of the controller doing the routing.

Each supported ESB entry is associated with either a single or a
even/odd pair of pages which provides commands to manage the source:
to EOI, to turn off the source for instance.

On a sPAPR machine, the O/S will obtain the page address of the ESB
entry associated with a source and its characteristic using the
H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is used.

The xive_source_notify() routine is in charge forwarding the source
event notification to the routing engine. It will be filled later on.

Signed-off-by: Cédric Le Goater 
---
 default-configs/ppc64-softmmu.mak |   1 +
 include/hw/ppc/xive.h | 260 
 hw/intc/xive.c| 382 ++
 hw/intc/Makefile.objs |   1 +
 4 files changed, 644 insertions(+)
 create mode 100644 include/hw/ppc/xive.h
 create mode 100644 hw/intc/xive.c

diff --git a/default-configs/ppc64-softmmu.mak 
b/default-configs/ppc64-softmmu.mak
index aec2855750d6..2d1e7c5c4668 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -16,6 +16,7 @@ CONFIG_VIRTIO_VGA=y
 CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
+CONFIG_XIVE=$(CONFIG_PSERIES)
 CONFIG_MEM_DEVICE=y
 CONFIG_DIMM=y
 CONFIG_SPAPR_RNG=y
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
new file mode 100644
index ..7aa2e3801222
--- /dev/null
+++ b/include/hw/ppc/xive.h
@@ -0,0 +1,260 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ *
+ * The POWER9 processor comes with a new interrupt controller, called
+ * XIVE as "eXternal Interrupt Virtualization Engine".
+ *
+ * = Overall architecture
+ *
+ *
+ *  XIVE Interrupt Controller
+ *  ++  IPIs
+ *  | +-+ +-+ ++ |+---+
+ *  | |VC   | |CQ   | |PC  |> | CORES |
+ *  | | esb | | | ||> |   |
+ *  | | eas | |  Bridge | |   tctx |> |   |
+ *  | |SC   end | | | |nvt | ||   |
+ *  +--+| +-+ +++ ++ |+-+-+-+-+
+ *  | RAM  |+--|-+  | | |
+ *  |  |   || | |
+ *  |  |   || | |
+ *  |  |  +vv-v-v--+other
+ *  |  <--+ Power Bus  +--> chips
+ *  |  esb |  +-+---+--+
+ *  |  eas ||   |
+ *  |  end | +--|--+|
+ *  |  nvt |   +++ |   +++
+ *  +--+   |SC   | |   |SC   |
+ * | | |   | |
+ * | PQ-bits | |   | PQ-bits |
+ * | local   |-+   |  in VC  |
+ * +-+ +-+
+ *PCIe NX,NPU,CAPI
+ *
+ *   SC: Source Controller (aka. IVSE)
+ *   VC: Virtualization Controller (aka. IVRE)
+ *   PC: Presentation Controller (aka. IVPE)
+ *   CQ: Common Queue (Bridge)
+ *
+ *  PQ-bits: 2 bits source state machine (P:pending Q:queued)
+ *  esb: Event State Buffer (Array of PQ bits in an IVSE)
+ *  eas: Event Assignment Structure
+ *  end: Event Notification Descriptor
+ *  nvt: Notification Virtual Target
+ * tctx: Thread interrupt Context
+ *
+ *
+ * The XIVE IC is composed of three sub-engines :
+ *
+ * - Interrupt Virtualization Source Engine (IVSE), or Source
+ *   Controller (SC). These are found in PCI PHBs, in the PSI host
+ *   bridge controller, but also inside the main controller for the
+ *   core IPIs and other sub-chips (NX, CAP, NPU) of the
+ *   chip/processor. They are configured to feed the IVRE with events.
+ *
+ * - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
+ *   Controller (VC). Its job is to match an event source with an
+ *   Event Notification Descriptor

[Qemu-devel] [PATCH v6 12/37] spapr: initialize VSMT before initializing the IRQ backend

2018-12-05 Thread Cédric Le Goater

We will need to use xics_max_server_number() to create the sPAPRXive
object modeling the interrupt controller of the machine which is
created before the CPUs.

Signed-off-by: Cédric Le Goater 
Reviewed-by: Greg Kurz 
---
 hw/ppc/spapr.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7afd1a175bf2..50cb9f9f4a02 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
 boot_cores_nr = possible_cpus->len;
 }
 
-/* VSMT must be set in order to be able to compute VCPU ids, ie to
- * call xics_max_server_number() or spapr_vcpu_id().
- */
-spapr_set_vsmt_mode(spapr, _fatal);
-
 if (smc->pre_2_10_has_unused_icps) {
 int i;
 
@@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
 /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
 load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
 
+/* VSMT must be set in order to be able to compute VCPU ids, ie to
+ * call xics_max_server_number() or spapr_vcpu_id().
+ */
+spapr_set_vsmt_mode(spapr, _fatal);
+
 /* Set up Interrupt Controller before we create the VCPUs */
 smc->irq->init(spapr, _fatal);
 
-- 
2.17.2

[Qemu-devel] [PATCH v6 04/37] ppc/xive: introduce the XiveRouter model

2018-12-05 Thread Cédric Le Goater

The XiveRouter models the second sub-engine of the XIVE architecture :
the Interrupt Virtualization Routing Engine (IVRE).

The IVRE handles event notifications of the IVSE and performs the
interrupt routing process. For this purpose, it uses a set of tables
stored in system memory, the first of which being the Event Assignment
Structure (EAS) table.

The EAT associates an interrupt source number with an Event Notification
Descriptor (END) which will be used in a second phase of the routing
process to identify a Notification Virtual Target.

The XiveRouter is an abstract class which needs to be inherited from
to define a storage for the EAT, and other upcoming tables.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h  | 31 
 include/hw/ppc/xive_regs.h | 50 +
 hw/intc/xive.c | 76 ++
 3 files changed, 157 insertions(+)
 create mode 100644 include/hw/ppc/xive_regs.h

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 6770cffec67d..57ec9f84f527 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -141,6 +141,8 @@
 #define PPC_XIVE_H
 
 #include "hw/qdev-core.h"
+#include "hw/sysbus.h"
+#include "hw/ppc/xive_regs.h"
 
 /*
  * XIVE Fabric (Interface between Source and Router)
@@ -297,4 +299,33 @@ static inline void xive_source_irq_set(XiveSource *xsrc, 
uint32_t srcno,
 }
 }
 
+/*
+ * XIVE Router
+ */
+
+typedef struct XiveRouter {
+SysBusDeviceparent;
+} XiveRouter;
+
+#define TYPE_XIVE_ROUTER "xive-router"
+#define XIVE_ROUTER(obj)\
+OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
+#define XIVE_ROUTER_CLASS(klass)\
+OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
+#define XIVE_ROUTER_GET_CLASS(obj)  \
+OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
+
+typedef struct XiveRouterClass {
+SysBusDeviceClass parent;
+
+/* XIVE table accessors */
+int (*get_eas)(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
+   XiveEAS *eas);
+} XiveRouterClass;
+
+void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
+
+int xive_router_get_eas(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
+XiveEAS *eas);
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
new file mode 100644
index ..15f2470ed9cc
--- /dev/null
+++ b/include/hw/ppc/xive_regs.h
@@ -0,0 +1,50 @@
+/*
+ * QEMU PowerPC XIVE internal structure definitions
+ *
+ *
+ * The XIVE structures are accessed by the HW and their format is
+ * architected to be big-endian. Some macros are provided to ease
+ * access to the different fields.
+ *
+ *
+ * Copyright (c) 2016-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_XIVE_REGS_H
+#define PPC_XIVE_REGS_H
+
+/*
+ * Interrupt source number encoding on PowerBUS
+ */
+#define XIVE_SRCNO_BLOCK(srcno) (((srcno) >> 28) & 0xf)
+#define XIVE_SRCNO_INDEX(srcno) ((srcno) & 0x0fff)
+#define XIVE_SRCNO(blk, idx)((uint32_t)(blk) << 28 | (idx))
+
+/* EAS (Event Assignment Structure)
+ *
+ * One per interrupt source. Targets an interrupt to a given Event
+ * Notification Descriptor (END) and provides the corresponding
+ * logical interrupt number (END data)
+ */
+typedef struct XiveEAS {
+/* Use a single 64-bit definition to make it easier to
+ * perform atomic updates
+ */
+uint64_tw;
+#define EAS_VALID   PPC_BIT(0)
+#define EAS_END_BLOCK   PPC_BITMASK(4, 7)/* Destination END block# */
+#define EAS_END_INDEX   PPC_BITMASK(8, 31)   /* Destination END index */
+#define EAS_MASKED  PPC_BIT(32)  /* Masked */
+#define EAS_END_DATAPPC_BITMASK(33, 63)  /* Data written to the END */
+} XiveEAS;
+
+#define xive_eas_is_valid(eas)   (be64_to_cpu((eas)->w) & EAS_VALID)
+#define xive_eas_is_masked(eas)  (be64_to_cpu((eas)->w) & EAS_MASKED)
+
+#define GETFIELD_BE64(m, v)  GETFIELD(m, be64_to_cpu(v))
+#define SETFIELD_BE64(m, v, val) cpu_to_be64(SETFIELD(m, be64_to_cpu(v), val))
+
+#endif /* PPC_XIVE_REGS_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 79238eb57fae..d21df6674d8c 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -443,6 +443,81 @@ static const TypeInfo xive_source_info = {
 .class_init= xive_source_class_init,
 };
 
+/*
+ * XIVE Router (aka. Virtualization Controller or IVRE)
+ */
+
+int xive_router_get_eas(XiveRouter *xrtr, uint8_t eas_blk, uint32_t eas_idx,
+XiveEAS *eas)
+{
+XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+return xrc->get_eas(xrtr, eas_blk, eas_idx, eas);
+}
+
+static void xive_router_notify(XiveNotifier *xn, uint32_t lisn)
+{
+XiveRouter *xrtr =

[Qemu-devel] [PATCH v6 09/37] ppc/xive: notify the CPU when the interrupt priority is more privileged

2018-12-05 Thread Cédric Le Goater

After the event data was enqueued in the O/S Event Queue, the IVPE
raises the bit corresponding to the priority of the pending interrupt
in the register IBP (Interrupt Pending Buffer) to indicate there is an
event pending in one of the 8 priority queues. The Pending Interrupt
Priority Register (PIPR) is also updated using the IPB. This register
represent the priority of the most favored pending notification.

The PIPR is then compared to the the Current Processor Priority
Register (CPPR). If it is more favored (numerically less than), the
CPU interrupt line is raised and the EO bit of the Notification Source
Register (NSR) is updated to notify the presence of an exception for
the O/S. The check needs to be done whenever the PIPR or the CPPR are
changed.

The O/S acknowledges the interrupt with a special load in the Thread
Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
takes the value of PIPR. The bit number in the IBP corresponding to
the priority of the pending interrupt is reseted and so is the EO bit
of the NSR.

Signed-off-by: Cédric Le Goater 
---
 hw/intc/xive.c | 94 +-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 891542920683..0db77107ab15 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -22,9 +22,73 @@
  * XIVE Thread Interrupt Management context
  */
 
+/* Convert a priority number to an Interrupt Pending Buffer (IPB)
+ * register, which indicates a pending interrupt at the priority
+ * corresponding to the bit number
+ */
+static uint8_t priority_to_ipb(uint8_t priority)
+{
+return priority > XIVE_PRIORITY_MAX ?
+0 : 1 << (XIVE_PRIORITY_MAX - priority);
+}
+
+/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
+ * Interrupt Priority Register (PIPR), which contains the priority of
+ * the most favored pending notification.
+ */
+static uint8_t ipb_to_pipr(uint8_t ibp)
+{
+return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
+}
+
+static void ipb_update(uint8_t *regs, uint8_t priority)
+{
+regs[TM_IPB] |= priority_to_ipb(priority);
+regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
+}
+
+static uint8_t exception_mask(uint8_t ring)
+{
+switch (ring) {
+case TM_QW1_OS:
+return TM_QW1_NSR_EO;
+default:
+g_assert_not_reached();
+}
+}
+
 static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
 {
-return 0;
+uint8_t *regs = >regs[ring];
+uint8_t nsr = regs[TM_NSR];
+uint8_t mask = exception_mask(ring);
+
+qemu_irq_lower(tctx->output);
+
+if (regs[TM_NSR] & mask) {
+uint8_t cppr = regs[TM_PIPR];
+
+regs[TM_CPPR] = cppr;
+
+/* Reset the pending buffer bit */
+regs[TM_IPB] &= ~priority_to_ipb(cppr);
+regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
+
+/* Drop Exception bit */
+regs[TM_NSR] &= ~mask;
+}
+
+return (nsr << 8) | regs[TM_CPPR];
+}
+
+static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
+{
+uint8_t *regs = >regs[ring];
+
+if (regs[TM_PIPR] < regs[TM_CPPR]) {
+regs[TM_NSR] |= exception_mask(ring);
+qemu_irq_raise(tctx->output);
+}
 }
 
 static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
@@ -34,6 +98,9 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, 
uint8_t cppr)
 }
 
 tctx->regs[ring + TM_CPPR] = cppr;
+
+/* CPPR has changed, check if we need to raise a pending exception */
+xive_tctx_notify(tctx, ring);
 }
 
 /*
@@ -189,6 +256,17 @@ static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr 
offset,
 xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
 }
 
+/*
+ * Adjust the IPB to allow a CPU to process event queues of other
+ * priorities during one physical interrupt cycle.
+ */
+static void xive_tm_set_os_pending(XiveTCTX *tctx, hwaddr offset,
+   uint64_t value, unsigned size)
+{
+ipb_update(>regs[TM_QW1_OS], value & 0xff);
+xive_tctx_notify(tctx, TM_QW1_OS);
+}
+
 /*
  * Define a mapping of "special" operations depending on the TIMA page
  * offset and the size of the operation.
@@ -211,6 +289,7 @@ static const XiveTmOp xive_tm_operations[] = {
 
 /* MMIOs above 2K : special operations with side effects */
 { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG, 2, NULL, xive_tm_ack_os_reg },
+{ XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL 
},
 };
 
 static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool 
write)
@@ -387,6 +466,13 @@ static void xive_tctx_reset(void *dev)
 tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
 tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
 tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
+
+/*
+ * Initialize PIPR to 0xFF to avoid phantom interrupts when the
+ * CPPR is first set.
+ */
+tctx->regs[TM_QW1_OS + TM_PIPR] =
+ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
 }
 
 static void

[Qemu-devel] [PATCH v6 02/37] ppc/xive: add support for the LSI interrupt sources

2018-12-05 Thread Cédric Le Goater

The 'sent' status of the LSI interrupt source is modeled with the 'P'
bit of the ESB and the assertion status of the source is maintained
with an extra bit under the main XiveSource object. The type of the
source is stored in the same array for practical reasons.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h | 19 -
 hw/intc/xive.c| 66 +++
 2 files changed, 78 insertions(+), 7 deletions(-)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 7aa2e3801222..7cebc32eba4c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -162,8 +162,9 @@ typedef struct XiveSource {
 /* IRQs */
 uint32_tnr_irqs;
 qemu_irq*qirqs;
+unsigned long   *lsi_map;
 
-/* PQ bits */
+/* PQ bits and LSI assertion bit */
 uint8_t *status;
 
 /* ESB memory region */
@@ -219,6 +220,7 @@ static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, 
int srcno)
  * When doing an EOI, the Q bit will indicate if the interrupt
  * needs to be re-triggered.
  */
+#define XIVE_STATUS_ASSERTED  0x4  /* Extra bit for LSI */
 #define XIVE_ESB_VAL_P0x2
 #define XIVE_ESB_VAL_Q0x1
 
@@ -257,4 +259,19 @@ static inline qemu_irq xive_source_qirq(XiveSource *xsrc, 
uint32_t srcno)
 return xsrc->qirqs[srcno];
 }
 
+static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
+{
+assert(srcno < xsrc->nr_irqs);
+return test_bit(srcno, xsrc->lsi_map);
+}
+
+static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
+   bool lsi)
+{
+assert(srcno < xsrc->nr_irqs);
+if (lsi) {
+bitmap_set(xsrc->lsi_map, srcno, 1);
+}
+}
+
 #endif /* PPC_XIVE_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 6389bd832371..11c7aac962de 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -89,14 +89,42 @@ uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t 
srcno, uint8_t pq)
 return xive_esb_set(>status[srcno], pq);
 }
 
+/*
+ * Returns whether the event notification should be forwarded.
+ */
+static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
+{
+uint8_t old_pq = xive_source_esb_get(xsrc, srcno);
+
+xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
+
+switch (old_pq) {
+case XIVE_ESB_RESET:
+xive_source_esb_set(xsrc, srcno, XIVE_ESB_PENDING);
+return true;
+default:
+return false;
+}
+}
+
 /*
  * Returns whether the event notification should be forwarded.
  */
 static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
 {
+bool ret;
+
 assert(srcno < xsrc->nr_irqs);
 
-return xive_esb_trigger(>status[srcno]);
+ret = xive_esb_trigger(>status[srcno]);
+
+if (xive_source_irq_is_lsi(xsrc, srcno) &&
+xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
+qemu_log_mask(LOG_GUEST_ERROR,
+  "XIVE: queued an event on LSI IRQ %d\n", srcno);
+}
+
+return ret;
 }
 
 /*
@@ -104,9 +132,22 @@ static bool xive_source_esb_trigger(XiveSource *xsrc, 
uint32_t srcno)
  */
 static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
 {
+bool ret;
+
 assert(srcno < xsrc->nr_irqs);
 
-return xive_esb_eoi(>status[srcno]);
+ret = xive_esb_eoi(>status[srcno]);
+
+/* LSI sources do not set the Q bit but they can still be
+ * asserted, in which case we should forward a new event
+ * notification
+ */
+if (xive_source_irq_is_lsi(xsrc, srcno) &&
+xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
+ret = xive_source_lsi_trigger(xsrc, srcno);
+}
+
+return ret;
 }
 
 /*
@@ -271,8 +312,16 @@ static void xive_source_set_irq(void *opaque, int srcno, 
int val)
 XiveSource *xsrc = XIVE_SOURCE(opaque);
 bool notify = false;
 
-if (val) {
-notify = xive_source_esb_trigger(xsrc, srcno);
+if (xive_source_irq_is_lsi(xsrc, srcno)) {
+if (val) {
+notify = xive_source_lsi_trigger(xsrc, srcno);
+} else {
+xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
+}
+} else {
+if (val) {
+notify = xive_source_esb_trigger(xsrc, srcno);
+}
 }
 
 /* Forward the source event notification for routing */
@@ -292,9 +341,11 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t 
offset, Monitor *mon)
 continue;
 }
 
-monitor_printf(mon, "  %08x %c%c\n", i + offset,
+monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
+   xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
pq & XIVE_ESB_VAL_P ? 'P' : '-',
-   pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
+   pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
+   xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
 }
 }
 
@@ -302,6 +353,8 @@ static void xive_source_reset(void *dev)
 {
 XiveSource

[Qemu-devel] [PATCH v6 06/37] ppc/xive: add support for the END Event State buffers

2018-12-05 Thread Cédric Le Goater

The Event Notification Descriptor (END) XIVE structure also contains
two Event State Buffers providing further coalescing of interrupts,
one for the notification event (ESn) and one for the escalation events
(ESe). A MMIO page is assigned for each to control the EOI through
loads only. Stores are not allowed.

The END ESBs are modeled through an object resembling the 'XiveSource'
It is stateless as the END state bits are backed into the XiveEND
structure under the XiveRouter and the MMIO accesses follow the same
rules as for the standard source ESBs.

END ESBs are not supported by the Linux drivers neither on OPAL nor on
sPAPR. Nevetherless, it provides a mean to study the question in the
future and validates a bit more the XIVE model.

Signed-off-by: Cédric Le Goater 
---
 include/hw/ppc/xive.h |  22 ++
 hw/intc/xive.c| 173 +-
 2 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index d1b4c6c78ec5..d67b0785df7c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -305,6 +305,8 @@ static inline void xive_source_irq_set(XiveSource *xsrc, 
uint32_t srcno,
 
 typedef struct XiveRouter {
 SysBusDeviceparent;
+
+uint32_t   chip_id;
 } XiveRouter;
 
 #define TYPE_XIVE_ROUTER "xive-router"
@@ -336,6 +338,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, 
uint32_t end_idx,
 int xive_router_write_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
   XiveEND *end, uint8_t word_number);
 
+/*
+ * XIVE END ESBs
+ */
+
+#define TYPE_XIVE_END_SOURCE "xive-end-source"
+#define XIVE_END_SOURCE(obj) \
+OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
+
+typedef struct XiveENDSource {
+DeviceState parent;
+
+uint32_tnr_ends;
+
+/* ESB memory region */
+uint32_tesb_shift;
+MemoryRegionesb_mmio;
+
+XiveRouter  *xrtr;
+} XiveENDSource;
+
 /*
  * For legacy compatibility, the exceptions define up to 256 different
  * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 41d8ba1540d0..83686e260df5 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -612,8 +612,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, 
uint8_t end_blk,
  * even futher coalescing in the Router
  */
 if (!xive_end_is_notify()) {
-qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
-return;
+uint8_t pq = GETFIELD_BE32(END_W1_ESn, end.w1);
+bool notify = xive_esb_trigger();
+
+if (pq != GETFIELD_BE32(END_W1_ESn, end.w1)) {
+end.w1 = SETFIELD_BE32(END_W1_ESn, end.w1, pq);
+xive_router_write_end(xrtr, end_blk, end_idx, , 1);
+}
+
+/* ESn[Q]=1 : end of notification */
+if (!notify) {
+return;
+}
 }
 
 /*
@@ -658,12 +668,18 @@ static void xive_router_notify(XiveNotifier *xn, uint32_t 
lisn)
GETFIELD_BE64(EAS_END_DATA,  eas.w));
 }
 
+static Property xive_router_properties[] = {
+DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
+DEFINE_PROP_END_OF_LIST(),
+};
+
 static void xive_router_class_init(ObjectClass *klass, void *data)
 {
 DeviceClass *dc = DEVICE_CLASS(klass);
 XiveNotifierClass *xnc = XIVE_NOTIFIER_CLASS(klass);
 
 dc->desc= "XIVE Router Engine";
+dc->props   = xive_router_properties;
 xnc->notify = xive_router_notify;
 }
 
@@ -692,6 +708,158 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, 
Monitor *mon)
(uint32_t) GETFIELD_BE64(EAS_END_DATA, eas->w));
 }
 
+/*
+ * END ESB MMIO loads
+ */
+static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
+{
+XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
+XiveRouter *xrtr = xsrc->xrtr;
+uint32_t offset = addr & 0xFFF;
+uint8_t end_blk;
+uint32_t end_idx;
+XiveEND end;
+uint32_t end_esmask;
+uint8_t pq;
+uint64_t ret = -1;
+
+end_blk = xrtr->chip_id;
+end_idx = addr >> (xsrc->esb_shift + 1);
+
+if (xive_router_get_end(xrtr, end_blk, end_idx, )) {
+qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
+  end_idx);
+return -1;
+}
+
+if (!xive_end_is_valid()) {
+qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
+  end_blk, end_idx);
+return -1;
+}
+
+end_esmask = addr_is_even(addr, xsrc->esb_shift) ? END_W1_ESn : END_W1_ESe;
+pq = GETFIELD_BE32(end_esmask, end.w1);
+
+switch (offset) {
+case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
+ret = xive_esb_eoi();
+
+/* Forward the source event notification for routing ?? */
+break;
+
+case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
+ret = pq;
+break;
+
+case XIVE_ESB_SET_PQ_00 ...

[Qemu-devel] [PATCH v6 00/37] ppc: support for the XIVE interrupt controller (POWER9)

2018-12-05 Thread Cédric Le Goater

Hello,

Here is the version 6 of the QEMU models adding support for the XIVE
interrupt controller to the sPAPR machine, under TCG and KVM. Support
for the PowerNV POWER9 machine will be proposed in a PowerNV patchset
sometime next year now.

The most important changes for sPAPR are the removal of the SysBusDevice
inheritance, the removal of the KVM classes, support for XIVE structures
in big-endian only format and the introduction of a VM change state handler
for KVM migration.

Thanks,

C.


Changes in v6 :

 Common XIVE models :

 - included documentation in xive.h
 - removed SysBusDevice inheritance from Xive Sources
 - set ASSERTED bit in xive_source_lsi_trigger()
 - renamed XiveFabric in XiveNotifier
 - reworked XIVE tables accessors, introduce a _write method for words
 - introduced the source number encoding on PowerBUS in accessors
 - used fixed big-endian format for XIVE structures
 - reworked the presenter matching routine
 - renamed *cam_line helpers

 sPAPR models :

 - reworked the 'info pic output
 - moved the END reset at the sPAPR level
 - renamed the spapr_xive_irq_enable/disable routine in claim/free
 - removed the reset_tctx hook
 - renamed xics_max_server_number() and fixed spapr_irq_init() prototype 
 - removed the use of the xive_router routines in the sPAPR XIVE hcalls
 - used address_space_map() to validate the EQ
 - introduced a spapr_xive_reset_tctx() to set the OS CAM line at reset
 - introduced OV5 defines for the XIVE mode
 - removed the XIVE classes
 - enable/disable the XIVE MMIOs depending on the mode
 - introduced a spapr_rtas_unregister() helper
 - mixed enhancements

 KVM :

 - removed the KVM XIVE models and reworked KVM support with helpers
 - introduced a VM change state handler to quiesce XIVE before
   transferring the EQ pages
 - improved KVM support for the dual machine (removed extra cleanups)

 PowerNV:

 - postponed for a PowerNV patchset only

Changes in v5 :

 Common XIVE models :

 - renamed the XIVE structures to fit the changes of the XIVE
   architecture documents: IVE, EQD, VPD -> EAS, END, NVT   
 - reworked the monitor ouput to print the EQ contents

 sPAPR models :

 - introduced a XIVE Router 'reset' method for the Xive Thread Context
   to set the OS CAM line of the VCPU
 - introduced a spapr_irq_init() routine to the sPAPR IRQ backend
   and reworked the XIVE-only machine to fit mainline QEMU
 - introduced a reset() method to the sPAPR IRQ backend to handle
   changes in the interrupt mode after machine reset
 - introduced a 'dual' machine supporting both interrupt mode

 KVM :

 - introduced some more sPAPR NVT and END indexing helpers for KVM support
 - fixed the virtual LSIs in KVM by using the H_INT_ESB source flag
 - improved the KVM support with better common classes and cleaner
   QEMU<->KVM interfaces
 - improved KVM migration with a better control on the capture sequence.
   Still some issues with 'ceded' VCPUs 
 - introduced KVM support for the 'dual' machine

 PowerNV:

 - introduced address spaces for the IPI and END set translation
   tables


Changes in v4 :

   See https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg01672.html


= XIVE =


The POWER9 processor comes with a new interrupt controller, called
XIVE as "eXternal Interrupt Virtualization Engine".


* Overall architecture


 XIVE Interrupt Controller
 ++  IPIs
 | +-+ +-+ ++ |+---+
 | |VC   | |CQ   | |PC  |> | CORES |
 | | esb | | | ||> |   |
 | | eas | |  Bridge | |   tctx |> |   |
 | |SC   end | | | |nvt | ||   |
 +--+| +-+ +++ ++ |+-+-+-+-+
 | RAM  |+--|-+  | | |
 |  |   || | |
 |  |   || | |
 |  |  +vv-v-v--+other
 |  <--+ Power Bus  +--> chips
 |  esb |  +-+---+--+
 |  eas ||   |
 |  end | +--|--+|
 |  nvt |   +++ |   +++
 +--+   |SC   | |   |SC   |
| | |   | |
| PQ-bits | |   | PQ-bits |
| local   |-+   |  in VC  |
+-+ +-+
   PCIe NX,NPU,CAPI

  SC: Source Controller (aka. IVSE)
  VC: Virtualization Controller (aka. IVRE)
  PC: Presentation Controller (aka. IVPE)
  CQ: Common Queue (Bridge)

 PQ-bits: 2 bits source state

Re: [Qemu-devel] [RFCv2 for-4.0 4/5] virtio-balloon: Use ram_block_discard_range() instead of raw madvise()

2018-12-05 Thread David Gibson

On Wed, Dec 05, 2018 at 08:59:06AM +0100, David Hildenbrand wrote:
> On 05.12.18 06:06, David Gibson wrote:
> > Currently, virtio-balloon uses madvise() with MADV_DONTNEED to actually
> > discard RAM pages inserted into the balloon.  This is basically a Linux
> > only interface (MADV_DONTNEED exists on some other platforms, but doesn't
> > always have the same semantics).  It also doesn't work on hugepages and has
> > some other limitations.
> > 
> > It turns out that postcopy also needs to discard chunks of memory, and uses
> > a better interface for it: ram_block_discard_range().  It doesn't cover
> > every case, but it covers more than going direct to madvise() and this
> > gives us a single place to update for more possibilities in future.
> > 
> > There are some subtleties here to maintain the current balloon behaviour:
> > 
> > * For now, we just ignore requests to balloon in a hugepage backed region.
> >   That matches current behaviour, because MADV_DONTNEED on a hugepage would
> >   simply fail, and we ignore the error.
> > 
> > * If host page size is > BALLOON_PAGE_SIZE we can frequently call this on
> >   non-host-page-aligned addresses.  These would also fail in madvise(),
> >   which we then ignored.  ram_block_discard_range() error_report()s calls
> >   on unaligned addresses, so we explicitly check that case to avoid
> >   spamming the logs.
> > 
> > * We now call ram_block_discard_range() with the *host* page size, whereas
> >   we previously called madvise() with BALLOON_PAGE_SIZE.  Surprisingly,
> >   this also matches existing behaviour.  Although the kernel fails madvise
> >   on unaligned addresses, it will round unaligned sizes *up* to the host
> >   page size.  Yes, this means that if BALLOON_PAGE_SIZE < guest page size
> >   we can incorrectly discard more memory than the guest asked us to.  I'm
> >   planning to address that soon.
> > 
> > Errors other than the ones discussed above, will now be reported by
> > ram_block_discard_range(), rather than silently ignored, which means we
> > have a much better chance of seeing when something is going wrong.
> > 
> > Signed-off-by: David Gibson 
> > Reviewed-by: Michael S. Tsirkin 
> > ---
> >  hw/virtio/virtio-balloon.c | 23 ++-
> >  1 file changed, 22 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
> > index c3a19aa27d..4435905c87 100644
> > --- a/hw/virtio/virtio-balloon.c
> > +++ b/hw/virtio/virtio-balloon.c
> > @@ -37,8 +37,29 @@ static void balloon_inflate_page(VirtIOBalloon *balloon,
> >   MemoryRegion *mr, hwaddr offset)
> >  {
> >  void *addr = memory_region_get_ram_ptr(mr) + offset;
> > +RAMBlock *rb;
> > +size_t rb_page_size;
> > +ram_addr_t ram_offset;
> >  
> > -qemu_madvise(addr, BALLOON_PAGE_SIZE, QEMU_MADV_DONTNEED);
> > +/* XXX is there a better way to get to the RAMBlock than via a
> > + * host address? */
> 
> We have qemu_get_ram_block(). That one should work as long as we know
> that it is a valid guest ram address. (would have to make it !static)
> 
> Then we would only have to pass to this function the original ram_addr_t
> handed over by the guest (which looks somewhat cleaner to me than going
> via memory regions)

So, I didn't use that because it's a hwaddr, not a ram_addr_t that the
guest gives us.  I think they have the same value for guest RAM
addresses, but I wasn't sure if it was safe to rely on that.

> 
> > +rb = qemu_ram_block_from_host(addr, false, _offset);
> > +rb_page_size = qemu_ram_pagesize(rb);
> > +
> > +/* Silently ignore hugepage RAM blocks */
> > +if (rb_page_size != getpagesize()) {
> > +return;
> > +}
> > +
> > +/* Silently ignore unaligned requests */
> > +if (ram_offset & (rb_page_size - 1)) {
> > +return;
> > +}
> > +
> > +ram_block_discard_range(rb, ram_offset, rb_page_size);
> > +/* We ignore errors from ram_block_discard_range(), because it has
> > + * already reported them, and failing to discard a balloon page is
> > + * not fatal */
> >  }
> >  
> >  static const char *balloon_stat_names[] = {
> > 
> 
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH for-4.0 2/5] spapr: Use default_machine_opts to set use_hotplug_event_source

2018-12-05 Thread David Gibson

On Wed, Dec 05, 2018 at 06:58:24PM -0200, Eduardo Habkost wrote:
> Instead of setting use_hotplug_event_source at instance_init
> time, set default_machine_opts on spapr_machine_2_7_class_options()
> to implement equivalent behavior.
> 
> This will let us eliminate the need for separate instance_init
> functions for each spapr machine-type.
> 
> Signed-off-by: Eduardo Habkost 

Acked-by: David Gibson 

> ---
>  hw/ppc/spapr.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 80d8498867..f6b60e6fbd 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4240,10 +4240,7 @@ static void phb_placement_2_7(sPAPRMachineState 
> *spapr, uint32_t index,
>  
>  static void spapr_machine_2_7_instance_options(MachineState *machine)
>  {
> -sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
> -
>  spapr_machine_2_8_instance_options(machine);
> -spapr->use_hotplug_event_source = false;
>  }
>  
>  static void spapr_machine_2_7_class_options(MachineClass *mc)
> @@ -4252,6 +4249,7 @@ static void 
> spapr_machine_2_7_class_options(MachineClass *mc)
>  
>  spapr_machine_2_8_class_options(mc);
>  mc->default_cpu_type = POWERPC_CPU_TYPE_NAME("power7_v2.3");
> +mc->default_machine_opts = "modern-hotplug-events=off";
>  SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_7);
>  smc->phb_placement = phb_placement_2_7;
>  }

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH for-4.0 3/5] spapr: Use default_machine_opts to set suppress_vmdesc

2018-12-05 Thread David Gibson

On Wed, Dec 05, 2018 at 06:58:25PM -0200, Eduardo Habkost wrote:
> Instead of setting suppress_vmdesc at instance_init time, set
> default_machine_opts on spapr_machine_2_2_class_options() to
> implement equivalent behavior.
> 
> This will let us eliminate the need for separate instance_init
> functions for each spapr machine-type.
> 
> Signed-off-by: Eduardo Habkost 

Acked-by: David Gibson 

> ---
>  hw/ppc/spapr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index f6b60e6fbd..0c3b27a8cc 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -4368,13 +4368,13 @@ DEFINE_SPAPR_MACHINE(2_3, "2.3", false);
>  static void spapr_machine_2_2_instance_options(MachineState *machine)
>  {
>  spapr_machine_2_3_instance_options(machine);
> -machine->suppress_vmdesc = true;
>  }
>  
>  static void spapr_machine_2_2_class_options(MachineClass *mc)
>  {
>  spapr_machine_2_3_class_options(mc);
>  SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_2);
> +mc->default_machine_opts = 
> "modern-hotplug-events=off,suppress-vmdesc=on";
>  }
>  DEFINE_SPAPR_MACHINE(2_2, "2.2", false);
>  

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

Re: [Qemu-devel] [PATCH for-4.0 4/5] spapr: Delete instance_options functions

2018-12-05 Thread David Gibson

On Wed, Dec 05, 2018 at 06:58:26PM -0200, Eduardo Habkost wrote:
> Now that all instance_options functions for spapr are empty,
> delete them.
> 
> Signed-off-by: Eduardo Habkost 

Acked-by: David Gibson 

Do you want me to stage the ppc patches in my ppc-for-4.0 tree, or
would you prefer to keep your series together to go in via your tree?

> ---
>  hw/ppc/spapr.c | 85 --
>  1 file changed, 85 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 0c3b27a8cc..523e5d83f8 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3939,16 +3939,10 @@ static const TypeInfo spapr_machine_info = {
>  mc->is_default = 1;  \
>  }\
>  }\
> -static void spapr_machine_##suffix##_instance_init(Object *obj)  \
> -{\
> -MachineState *machine = MACHINE(obj);\
> -spapr_machine_##suffix##_instance_options(machine);  \
> -}\
>  static const TypeInfo spapr_machine_##suffix##_info = {  \
>  .name = MACHINE_TYPE_NAME("pseries-" verstr),\
>  .parent = TYPE_SPAPR_MACHINE,\
>  .class_init = spapr_machine_##suffix##_class_init,   \
> -.instance_init = spapr_machine_##suffix##_instance_init, \
>  };   \
>  static void spapr_machine_register_##suffix(void)\
>  {\
> @@ -3959,10 +3953,6 @@ static const TypeInfo spapr_machine_info = {
>  /*
>   * pseries-4.0
>   */
> -static void spapr_machine_4_0_instance_options(MachineState *machine)
> -{
> -}
> -
>  static void spapr_machine_4_0_class_options(MachineClass *mc)
>  {
>  /* Defaults for the latest behaviour inherited from the base class */
> @@ -3976,11 +3966,6 @@ DEFINE_SPAPR_MACHINE(4_0, "4.0", true);
>  #define SPAPR_COMPAT_3_1  \
>  HW_COMPAT_3_1
>  
> -static void spapr_machine_3_1_instance_options(MachineState *machine)
> -{
> -spapr_machine_4_0_instance_options(machine);
> -}
> -
>  static void spapr_machine_3_1_class_options(MachineClass *mc)
>  {
>  spapr_machine_3_1_class_options(mc);
> @@ -3995,11 +3980,6 @@ DEFINE_SPAPR_MACHINE(3_1, "3.1", false);
>  #define SPAPR_COMPAT_3_0  \
>  HW_COMPAT_3_0
>  
> -static void spapr_machine_3_0_instance_options(MachineState *machine)
> -{
> -spapr_machine_3_1_instance_options(machine);
> -}
> -
>  static void spapr_machine_3_0_class_options(MachineClass *mc)
>  {
>  sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4029,11 +4009,6 @@ DEFINE_SPAPR_MACHINE(3_0, "3.0", false);
>  .value= "on",  \
>  },
>  
> -static void spapr_machine_2_12_instance_options(MachineState *machine)
> -{
> -spapr_machine_3_0_instance_options(machine);
> -}
> -
>  static void spapr_machine_2_12_class_options(MachineClass *mc)
>  {
>  sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4051,11 +4026,6 @@ static void 
> spapr_machine_2_12_class_options(MachineClass *mc)
>  
>  DEFINE_SPAPR_MACHINE(2_12, "2.12", false);
>  
> -static void spapr_machine_2_12_sxxm_instance_options(MachineState *machine)
> -{
> -spapr_machine_2_12_instance_options(machine);
> -}
> -
>  static void spapr_machine_2_12_sxxm_class_options(MachineClass *mc)
>  {
>  sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4074,11 +4044,6 @@ DEFINE_SPAPR_MACHINE(2_12_sxxm, "2.12-sxxm", false);
>  #define SPAPR_COMPAT_2_11  \
>  HW_COMPAT_2_11
>  
> -static void spapr_machine_2_11_instance_options(MachineState *machine)
> -{
> -spapr_machine_2_12_instance_options(machine);
> -}
> -
>  static void spapr_machine_2_11_class_options(MachineClass *mc)
>  {
>  sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> @@ -4096,11 +4061,6 @@ DEFINE_SPAPR_MACHINE(2_11, "2.11", false);
>  #define SPAPR_COMPAT_2_10  \
>  HW_COMPAT_2_10
>  
> -static void spapr_machine_2_10_instance_options(MachineState *machine)
> -{
> -spapr_machine_2_11_instance_options(machine);
> -}
> -
>  static void spapr_machine_2_10_class_options(MachineClass *mc)
>  {
>  spapr_machine_2_11_class_options(mc);
> @@ -4120,11 +4080,6 @@ DEFINE_SPAPR_MACHINE(2_10, "2.10", false);
>  .value= "on",  \
>  }, \
>  
> -static void

[Qemu-devel] [RFC 0/3] QEMU changes to do PVH boot

2018-12-05 Thread Liam Merwick

For certain applications it is desirable to rapidly boot a KVM virtual
machine. In cases where legacy hardware and software support within the
guest is not needed, QEMU should be able to boot directly into the
uncompressed Linux kernel binary with minimal firmware involvement.

There already exists an ABI to allow this for Xen PVH guests and the ABI
is supported by Linux and FreeBSD:

   https://xenbits.xen.org/docs/unstable/misc/pvh.html

Details on the Linux changes: https://lkml.org/lkml/2018/4/16/1002
qboot patches: http://patchwork.ozlabs.org/project/qemu-devel/list/?series=80020

This patch series provides QEMU support to read the ELF header of an
uncompressed kernel binary and get the 32-bit PVH kernel entry point
from an ELF Note.  This is called when initialising the machine state
in pc_memory_init().  Later on in load_linux() if the kernel entry
address is present, the uncompressed kernel binary (ELF) is loaded
and qboot does futher initialisation of the guest (e820, etc.) and
jumps to the kernel entry address and boots the guest.


Usіng the method/scripts documented by the NEMU team at

   https://github.com/intel/nemu/wiki/Measuring-Boot-Latency
   https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg00200.html

below are some timings measured (vmlinux and bzImage from the same build)
Time to get to kernel start is almost halved (95ṁs -> 48ms)

QEMU + qboot + vmlinux (PVH + 4.20-rc4)
 qemu_init_end: 41.550521
 fw_start: 41.667139 (+0.116618)
 fw_do_boot: 47.448495 (+5.781356)
 linux_startup_64: 47.720785 (+0.27229)
 linux_start_kernel: 48.399541 (+0.678756)
 linux_start_user: 296.952056 (+248.552515)

QEMU + qboot + bzImage:
 qemu_init_end: 29.209276
 fw_start: 29.317342 (+0.108066)
 linux_start_boot: 36.679362 (+7.36202)
 linux_startup_64: 94.531349 (+57.851987)
 linux_start_kernel: 94.900913 (+0.369564)
 linux_start_user: 401.060971 (+306.160058)

QEMU + bzImage:
 qemu_init_end: 30.424430
 linux_startup_64: 893.770334 (+863.345904)
 linux_start_kernel: 894.17049 (+0.400156)
 linux_start_user: 1208.679768 (+314.509278)


Liam Merwick (3):
  pvh: Add x86/HVM direct boot ABI header file
  pc: Read PVH entry point from ELF note in kernel binary
  pvh: Boot uncompressed kernel using direct boot ABI

 hw/i386/pc.c| 344 +++-
 include/elf.h   |  10 ++
 include/hw/xen/start_info.h | 146 +++
 3 files changed, 499 insertions(+), 1 deletion(-)
 create mode 100644 include/hw/xen/start_info.h

-- 
1.8.3.1

[Qemu-devel] [RFC 3/3] pvh: Boot uncompressed kernel using direct boot ABI

2018-12-05 Thread Liam Merwick

These changes (along with corresponding qboot and Linux kernel changes)
enable a guest to be booted using the x86/HVM direct boot ABI.

This commit adds a load_elfboot() routine to pass the size and
location of the kernel entry point to qboot (which will fill in
the start_info struct information needed to to boot the guest).
Having loaded the ELF binary, load_linux() will run qboot
which continues the boot.

The address for the kernel entry point has already been read
from an ELF Note in the uncompressed kernel binary earlier
in pc_memory_init().

Signed-off-by: George Kennedy 
Signed-off-by: Liam Merwick 
---
 hw/i386/pc.c | 72 
 1 file changed, 72 insertions(+)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index 056aa46d99b9..d3012cbd8597 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -54,6 +54,7 @@
 #include "sysemu/qtest.h"
 #include "kvm_i386.h"
 #include "hw/xen/xen.h"
+#include "hw/xen/start_info.h"
 #include "ui/qemu-spice.h"
 #include "exec/memory.h"
 #include "exec/address-spaces.h"
@@ -1098,6 +1099,50 @@ done:
 return pvh_start_addr != 0;
 }
 
+static bool load_elfboot(const char *kernel_filename,
+   int kernel_file_size,
+   uint8_t *header,
+   size_t pvh_xen_start_addr,
+   FWCfgState *fw_cfg)
+{
+uint32_t flags = 0;
+uint32_t mh_load_addr = 0;
+uint32_t elf_kernel_size = 0;
+uint64_t elf_entry;
+uint64_t elf_low, elf_high;
+int kernel_size;
+
+if (ldl_p(header) != 0x464c457f) {
+return false; /* no elfboot */
+}
+
+bool elf_is64 = header[EI_CLASS] == ELFCLASS64;
+flags = elf_is64 ?
+((Elf64_Ehdr *)header)->e_flags : ((Elf32_Ehdr *)header)->e_flags;
+
+if (flags & 0x00010004) { /* LOAD_ELF_HEADER_HAS_ADDR */
+error_report("elfboot unsupported flags = %x", flags);
+exit(1);
+}
+
+kernel_size = load_elf(kernel_filename, NULL, NULL, _entry,
+   _low, _high, 0, I386_ELF_MACHINE,
+   0, 0);
+
+if (kernel_size < 0) {
+error_report("Error while loading elf kernel");
+exit(1);
+}
+mh_load_addr = elf_low;
+elf_kernel_size = elf_high - elf_low;
+
+fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_xen_start_addr);
+fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr);
+fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size);
+
+return true;
+}
+
 static void load_linux(PCMachineState *pcms,
FWCfgState *fw_cfg)
 {
@@ -1138,6 +1183,33 @@ static void load_linux(PCMachineState *pcms,
 if (ldl_p(header+0x202) == 0x53726448) {
 protocol = lduw_p(header+0x206);
 } else {
+/* If the kernel address for using the x86/HVM direct boot ABI has
+ * been saved then proceed with booting the uncompressed kernel */
+if (pvh_start_addr) {
+if (load_elfboot(kernel_filename, kernel_size,
+ header, pvh_start_addr, fw_cfg)) {
+struct hvm_modlist_entry ramdisk_mod = { 0 };
+
+fclose(f);
+
+fw_cfg_add_i32(fw_cfg, FW_CFG_CMDLINE_SIZE,
+strlen(kernel_cmdline) + 1);
+fw_cfg_add_string(fw_cfg, FW_CFG_CMDLINE_DATA, kernel_cmdline);
+
+assert(machine->device_memory != NULL);
+ramdisk_mod.paddr = machine->device_memory->base;
+ramdisk_mod.size =
+memory_region_size(>device_memory->mr);
+
+fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, _mod,
+ sizeof(ramdisk_mod));
+fw_cfg_add_i32(fw_cfg, FW_CFG_SETUP_SIZE, sizeof(header));
+fw_cfg_add_bytes(fw_cfg, FW_CFG_SETUP_DATA,
+ header, sizeof(header));
+
+return;
+}
+}
 /* This looks like a multiboot kernel. If it is, let's stop
treating it like a Linux kernel. */
 if (load_multiboot(fw_cfg, f, kernel_filename, initrd_filename,
-- 
1.8.3.1

[Qemu-devel] [RFC 1/3] pvh: Add x86/HVM direct boot ABI header file

2018-12-05 Thread Liam Merwick

From: Liam Merwick 

The x86/HVM direct boot ABI permits Qemu to be able to boot directly
into the uncompressed Linux kernel binary without the need to run firmware.

https://xenbits.xen.org/docs/unstable/misc/pvh.html

This commit adds the header file that defines the start_info struct
that needs to be populated in order to use this ABI.

Signed-off-by: Maran Wilson 
Signed-off-by: Liam Merwick 
Reviewed-by: Konrad Rzeszutek Wilk 
---
 include/hw/xen/start_info.h | 146 
 1 file changed, 146 insertions(+)
 create mode 100644 include/hw/xen/start_info.h

diff --git a/include/hw/xen/start_info.h b/include/hw/xen/start_info.h
new file mode 100644
index ..348779eb10cd
--- /dev/null
+++ b/include/hw/xen/start_info.h
@@ -0,0 +1,146 @@
+/*
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2016, Citrix Systems, Inc.
+ */
+
+#ifndef __XEN_PUBLIC_ARCH_X86_HVM_START_INFO_H__
+#define __XEN_PUBLIC_ARCH_X86_HVM_START_INFO_H__
+
+/*
+ * Start of day structure passed to PVH guests and to HVM guests in %ebx.
+ *
+ * NOTE: nothing will be loaded at physical address 0, so a 0 value in any
+ * of the address fields should be treated as not present.
+ *
+ *  0 ++
+ *| magic  | Contains the magic value XEN_HVM_START_MAGIC_VALUE
+ *|| ("xEn3" with the 0x80 bit of the "E" set).
+ *  4 ++
+ *| version| Version of this structure. Current version is 1. New
+ *|| versions are guaranteed to be backwards-compatible.
+ *  8 ++
+ *| flags  | SIF_xxx flags.
+ * 12 ++
+ *| nr_modules | Number of modules passed to the kernel.
+ * 16 ++
+ *| modlist_paddr  | Physical address of an array of modules
+ *|| (layout of the structure below).
+ * 24 ++
+ *| cmdline_paddr  | Physical address of the command line,
+ *|| a zero-terminated ASCII string.
+ * 32 ++
+ *| rsdp_paddr | Physical address of the RSDP ACPI data structure.
+ * 40 ++
+ *| memmap_paddr   | Physical address of the (optional) memory map. Only
+ *|| present in version 1 and newer of the structure.
+ * 48 ++
+ *| memmap_entries | Number of entries in the memory map table. Only
+ *|| present in version 1 and newer of the structure.
+ *|| Zero if there is no memory map being provided.
+ * 52 ++
+ *| reserved   | Version 1 and newer only.
+ * 56 ++
+ *
+ * The layout of each entry in the module structure is the following:
+ *
+ *  0 ++
+ *| paddr  | Physical address of the module.
+ *  8 ++
+ *| size   | Size of the module in bytes.
+ * 16 ++
+ *| cmdline_paddr  | Physical address of the command line,
+ *|| a zero-terminated ASCII string.
+ * 24 ++
+ *| reserved   |
+ * 32 ++
+ *
+ * The layout of each entry in the memory map table is as follows:
+ *
+ *  0 ++
+ *| addr   | Base address
+ *  8 ++
+ *| size   | Size of mapping in bytes
+ * 16 ++
+ *| type   | Type of mapping as defined between the hypervisor
+ *|| and guest it's starting. E820_TYPE_xxx, for example.
+ * 20 +|
+ *| reserved   |
+ * 24 ++
+ *
+ * The address and sizes are always a 64bit little endian unsigned integer.
+ *
+ * NB: Xen on x86 will always try to place all the data below the 4GiB
+ * boundary.
+ *
+ * Version numbers of the hvm_start_info structure have evolved like this:
+ *
+ * Version 0:
+ *
+ * Version 1:   Added the memmap_paddr/memmap_entries

1 2 3 4 >

1 - 100 of 386 matches

Mail list logo