Re: [Virtio-fs] [virtiofsd PATCH v4 3/4] virtiofsd: support per-file DAX negotiation in FUSE_INIT

2021-08-19 Thread Dr. David Alan Gilbert
* JeffleXu (jeffl...@linux.alibaba.com) wrote:
> 
> 
> On 8/18/21 1:15 AM, Dr. David Alan Gilbert wrote:
> > * Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
> >> In FUSE_INIT negotiating phase, server/client should advertise if it
> >> supports per-file DAX.
> >>
> >> Once advertising support for per-file DAX feature, virtiofsd should
> >> support storing FS_DAX_FL flag persistently passed by
> >> FS_IOC_SETFLAGS/FS_IOC_FSSETXATTR ioctl, and set FUSE_ATTR_DAX in
> >> FUSE_LOOKUP accordingly if the file is capable of per-file DAX.
> >>
> >> Currently only ext4/xfs since linux kernel v5.8 support storing
> >> FS_DAX_FL flag persistently, and thus advertise support for per-file
> >> DAX feature only when the backend fs type is ext4 and xfs.
> > 
> > I'm a little worried about the meaning of the flags we're storing and
> > the fact we're storing them in the normal host DAX flags.
> > 
> > Doesn't this mean that we're using a single host flag to mean:
> >   a) It can be mapped as DAX on the host if it was a real DAX device
> >   b) We can map it as DAX inside the guest with virtiofs?
> 
> Yes the side effect is that the host file is also dax enabled if the
> backend fs is built upon real nvdimm device.
> 
> The rationale here is that, fuse daemon shall be capable of *marking*
> the file as dax capable *persistently*, so that it can be informed that
> this file is capable of dax later.

Right, so my worry here is that the untrusted guest changes both it's
own behaviour (fine) and also the behaviour of the host (less fine).

> I'm not sure if xattr (extent attribute) is a better option for this?

Well, if you used an xattr for it, it wouldn't clash with whatever the
host did (especially if it used the xattr mapping).

Dave

> 
> > 
> > what happens when we're using usernamespaces for the guest?
> > 
> > Dave
> > 
> > 
> >> Signed-off-by: Jeffle Xu 
> >> ---
> >>  tools/virtiofsd/fuse_common.h|  5 +
> >>  tools/virtiofsd/fuse_lowlevel.c  |  6 ++
> >>  tools/virtiofsd/passthrough_ll.c | 29 +
> >>  3 files changed, 40 insertions(+)
> >>
> >> diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
> >> index 8a75729be9..ee6fc64c23 100644
> >> --- a/tools/virtiofsd/fuse_common.h
> >> +++ b/tools/virtiofsd/fuse_common.h
> >> @@ -372,6 +372,11 @@ struct fuse_file_info {
> >>   */
> >>  #define FUSE_CAP_HANDLE_KILLPRIV_V2 (1 << 28)
> >>  
> >> +/**
> >> + * Indicates support for per-file DAX.
> >> + */
> >> +#define FUSE_CAP_PERFILE_DAX (1 << 29)
> >> +
> >>  /**
> >>   * Ioctl flags
> >>   *
> >> diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> >> b/tools/virtiofsd/fuse_lowlevel.c
> >> index 50fc5c8d5a..04a4f17423 100644
> >> --- a/tools/virtiofsd/fuse_lowlevel.c
> >> +++ b/tools/virtiofsd/fuse_lowlevel.c
> >> @@ -2065,6 +2065,9 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> >> nodeid,
> >>  if (arg->flags & FUSE_HANDLE_KILLPRIV_V2) {
> >>  se->conn.capable |= FUSE_CAP_HANDLE_KILLPRIV_V2;
> >>  }
> >> +if (arg->flags & FUSE_PERFILE_DAX) {
> >> +se->conn.capable |= FUSE_CAP_PERFILE_DAX;
> >> +}
> >>  #ifdef HAVE_SPLICE
> >>  #ifdef HAVE_VMSPLICE
> >>  se->conn.capable |= FUSE_CAP_SPLICE_WRITE | FUSE_CAP_SPLICE_MOVE;
> >> @@ -2180,6 +2183,9 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> >> nodeid,
> >>  if (se->conn.want & FUSE_CAP_POSIX_ACL) {
> >>  outarg.flags |= FUSE_POSIX_ACL;
> >>  }
> >> +if (se->op.ioctl && (se->conn.want & FUSE_CAP_PERFILE_DAX)) {
> >> +outarg.flags |= FUSE_PERFILE_DAX;
> >> +}
> >>  outarg.max_readahead = se->conn.max_readahead;
> >>  outarg.max_write = se->conn.max_write;
> >>  if (se->conn.max_background >= (1 << 16)) {
> >> diff --git a/tools/virtiofsd/passthrough_ll.c 
> >> b/tools/virtiofsd/passthrough_ll.c
> >> index e170b17adb..5b6228210f 100644
> >> --- a/tools/virtiofsd/passthrough_ll.c
> >> +++ b/tools/virtiofsd/passthrough_ll.c
> >> @@ -53,8 +53,10 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  #include 
> >>  #include 
> >> +#include 
> >>  
> >&g

Re: [Virtio-fs] [virtiofsd PATCH v4 4/4] virtiofsd: support per-file DAX in FUSE_LOOKUP

2021-08-19 Thread Dr. David Alan Gilbert
* JeffleXu (jeffl...@linux.alibaba.com) wrote:
> 
> 
> On 8/18/21 3:00 AM, Dr. David Alan Gilbert wrote:
> > * Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
> >> For passthrough, when the corresponding virtiofs in guest is mounted
> >> with '-o dax=inode', advertise that the file is capable of per-file
> >> DAX if the inode in the backend fs is marked with FS_DAX_FL flag.
> >>
> >> Signed-off-by: Jeffle Xu 
> >> ---
> >>  tools/virtiofsd/passthrough_ll.c | 43 
> >>  1 file changed, 43 insertions(+)
> >>
> >> diff --git a/tools/virtiofsd/passthrough_ll.c 
> >> b/tools/virtiofsd/passthrough_ll.c
> >> index 5b6228210f..4cbd904248 100644
> >> --- a/tools/virtiofsd/passthrough_ll.c
> >> +++ b/tools/virtiofsd/passthrough_ll.c
> >> @@ -171,6 +171,7 @@ struct lo_data {
> >>  int allow_direct_io;
> >>  int announce_submounts;
> >>  int perfile_dax_cap; /* capability of backend fs */
> >> +bool perfile_dax; /* enable per-file DAX or not */
> >>  bool use_statx;
> >>  struct lo_inode root;
> >>  GHashTable *inodes; /* protected by lo->mutex */
> >> @@ -716,6 +717,10 @@ static void lo_init(void *userdata, struct 
> >> fuse_conn_info *conn)
> >>  
> >>  if (conn->capable & FUSE_CAP_PERFILE_DAX && lo->perfile_dax_cap ) {
> >>  conn->want |= FUSE_CAP_PERFILE_DAX;
> >> +  lo->perfile_dax = 1;
> >> +}
> >> +else {
> >> +  lo->perfile_dax = 0;
> >>  }
> >>  }
> >>  
> >> @@ -983,6 +988,41 @@ static int do_statx(struct lo_data *lo, int dirfd, 
> >> const char *pathname,
> >>  return 0;
> >>  }
> >>  
> >> +/*
> >> + * If the file is marked with FS_DAX_FL or FS_XFLAG_DAX, then DAX should 
> >> be
> >> + * enabled for this file.
> >> + */
> >> +static bool lo_should_enable_dax(struct lo_data *lo, struct lo_inode *dir,
> >> +   const char *name)
> >> +{
> >> +int res, fd;
> >> +int ret = false;;
> >> +unsigned int attr;
> >> +struct fsxattr xattr;
> >> +
> >> +if (!lo->perfile_dax)
> >> +  return false;
> >> +
> >> +/* Open file without O_PATH, so that ioctl can be called. */
> >> +fd = openat(dir->fd, name, O_NOFOLLOW);
> >> +if (fd == -1)
> >> +return false;
> > 
> > Doesn't that defeat the whole benefit of using O_PATH - i.e. that we
> > might stumble into a /dev node or something else we're not allowed to
> > open?
> 
> As far as I know, virtiofsd will pivot_root/chroot to the source
> directory, and can only access files inside the source directory
> specified by "-o source=". Then where do these unexpected files come
> from? Besides, fd opened without O_PATH here is temporary and used for
> FS_IOC_GETFLAGS/FS_IOC_FSGETXATTR ioctl only. It's closed when the
> function returns.

The guest is still allowed to mknod.
See:
   https://lists.gnu.org/archive/html/qemu-devel/2021-01/msg05461.html

also it's legal to expose a root filesystem for a guest; the virtiofsd
should *never* open a device other than O_PATH - and it's really tricky
to do a check to see if it is a device in a race-free way.


> > 
> >> +if (lo->perfile_dax_cap == DAX_CAP_FLAGS) {
> >> +res = ioctl(fd, FS_IOC_GETFLAGS, );
> >> +if (!res && (attr & FS_DAX_FL))
> >> +  ret = true;
> >> +}
> >> +else if (lo->perfile_dax_cap == DAX_CAP_XATTR) {
> >> +  res = ioctl(fd, FS_IOC_FSGETXATTR, );
> >> +  if (!res && (xattr.fsx_xflags & FS_XFLAG_DAX))
> >> +  ret = true;
> >> +}
> > 
> > This all looks pretty expensive for each lookup.
> 
> Yes. it can be somehow optimized if we can agree on the way of storing
> the dax flag persistently.

Dave

> -- 
> Thanks,
> Jeffle
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [virtiofsd PATCH v4 4/4] virtiofsd: support per-file DAX in FUSE_LOOKUP

2021-08-17 Thread Dr. David Alan Gilbert
* Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
> For passthrough, when the corresponding virtiofs in guest is mounted
> with '-o dax=inode', advertise that the file is capable of per-file
> DAX if the inode in the backend fs is marked with FS_DAX_FL flag.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  tools/virtiofsd/passthrough_ll.c | 43 
>  1 file changed, 43 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 5b6228210f..4cbd904248 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -171,6 +171,7 @@ struct lo_data {
>  int allow_direct_io;
>  int announce_submounts;
>  int perfile_dax_cap; /* capability of backend fs */
> +bool perfile_dax; /* enable per-file DAX or not */
>  bool use_statx;
>  struct lo_inode root;
>  GHashTable *inodes; /* protected by lo->mutex */
> @@ -716,6 +717,10 @@ static void lo_init(void *userdata, struct 
> fuse_conn_info *conn)
>  
>  if (conn->capable & FUSE_CAP_PERFILE_DAX && lo->perfile_dax_cap ) {
>  conn->want |= FUSE_CAP_PERFILE_DAX;
> + lo->perfile_dax = 1;
> +}
> +else {
> + lo->perfile_dax = 0;
>  }
>  }
>  
> @@ -983,6 +988,41 @@ static int do_statx(struct lo_data *lo, int dirfd, const 
> char *pathname,
>  return 0;
>  }
>  
> +/*
> + * If the file is marked with FS_DAX_FL or FS_XFLAG_DAX, then DAX should be
> + * enabled for this file.
> + */
> +static bool lo_should_enable_dax(struct lo_data *lo, struct lo_inode *dir,
> +  const char *name)
> +{
> +int res, fd;
> +int ret = false;;
> +unsigned int attr;
> +struct fsxattr xattr;
> +
> +if (!lo->perfile_dax)
> + return false;
> +
> +/* Open file without O_PATH, so that ioctl can be called. */
> +fd = openat(dir->fd, name, O_NOFOLLOW);
> +if (fd == -1)
> +return false;

Doesn't that defeat the whole benefit of using O_PATH - i.e. that we
might stumble into a /dev node or something else we're not allowed to
open?

> +if (lo->perfile_dax_cap == DAX_CAP_FLAGS) {
> +res = ioctl(fd, FS_IOC_GETFLAGS, );
> +if (!res && (attr & FS_DAX_FL))
> + ret = true;
> +}
> +else if (lo->perfile_dax_cap == DAX_CAP_XATTR) {
> + res = ioctl(fd, FS_IOC_FSGETXATTR, );
> + if (!res && (xattr.fsx_xflags & FS_XFLAG_DAX))
> + ret = true;
> +}

This all looks pretty expensive for each lookup.

Dave


> +close(fd);
> +return ret;
> +}
> +
>  /*
>   * Increments nlookup on the inode on success. unref_inode_lolocked() must be
>   * called eventually to decrement nlookup again. If inodep is non-NULL, the
> @@ -1038,6 +1078,9 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
> parent, const char *name,
>  e->attr_flags |= FUSE_ATTR_SUBMOUNT;
>  }
>  
> +if (lo_should_enable_dax(lo, dir, name))
> +     e->attr_flags |= FUSE_ATTR_DAX;
> +
>  inode = lo_find(lo, >attr, mnt_id);
>  if (inode) {
>  close(newfd);
> -- 
> 2.27.0
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [virtiofsd PATCH v4 3/4] virtiofsd: support per-file DAX negotiation in FUSE_INIT

2021-08-17 Thread Dr. David Alan Gilbert
3826,6 +3841,20 @@ static void setup_root(struct lo_data *lo, struct 
> lo_inode *root)
>  root->posix_locks = g_hash_table_new_full(
>  g_direct_hash, g_direct_equal, NULL, posix_locks_value_destroy);
>  }
> +
> +/*
> + * Currently only ext4/xfs since linux kernel v5.8 support storing
> + * FS_DAX_FL flag persistently. Ext4 accesses this flag through
> + * FS_IOC_G[S]ETFLAGS ioctl, while xfs accesses this flag through
> +     * FS_IOC_FSG[S]ETXATTR ioctl.
> + */
> +res = fstatfs(fd, );
> +if (!res) {
> + if (statfs.f_type == EXT4_SUPER_MAGIC)
> + lo->perfile_dax_cap = DAX_CAP_FLAGS;
> + else if (statfs.f_type == XFS_SUPER_MAGIC)
> + lo->perfile_dax_cap = DAX_CAP_XATTR;
> +}
>  }
>  
>  static guint lo_key_hash(gconstpointer key)
> -- 
> 2.27.0
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread Dr. David Alan Gilbert
* Miklos Szeredi (mik...@szeredi.hu) wrote:
> On Tue, 17 Aug 2021 at 11:32, Dr. David Alan Gilbert
>  wrote:
> >
> > * Miklos Szeredi (mik...@szeredi.hu) wrote:
> > > On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  
> > > wrote:
> > > >
> > > > This patchset adds support of per-file DAX for virtiofs, which is
> > > > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> > >
> > > Can you please explain the background of this change in detail?
> > >
> > > Why would an admin want to enable DAX for a particular virtiofs file
> > > and not for others?
> >
> > Where we're contending on virtiofs dax cache size it makes a lot of
> > sense; it's quite expensive for us to map something into the cache
> > (especially if we push something else out), so selectively DAXing files
> > that are expected to be hot could help reduce cache churn.
> 
> If this is a performance issue, it should be fixed in a way that
> doesn't require hand tuning like you suggest, I think.

I'd agree that would be nice; however:
  a) It looks like other filesystems already gave something admin
selectable
  b) Trying to write clever heuristics is only going to work in some
cases; being able to say 'DAX this directory' might work better in
practice.

> I'm not sure what the  ext4/xfs case for per-file DAX is.  Maybe that
> can help understand the virtiofs case as well.

Yep, I don't understand the case with real nvdimm hardware.

Dave

> Thanks,
> Miklos
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 6/8] fuse: mark inode DONT_CACHE when per-file DAX indication changes

2021-08-17 Thread Dr. David Alan Gilbert
* Jeffle Xu (jeffl...@linux.alibaba.com) wrote:
> When the per-file DAX indication changes while the file is still
> *opened*, it is quite complicated and maybe fragile to dynamically
> change the DAX state.
> 
> Hence mark the inode and corresponding dentries as DONE_CACHE once the

 ^^
typo as DONT ?

Dave

> per-file DAX indication changes, so that the inode instance will be
> evicted and freed as soon as possible once the file is closed and the
> last reference to the inode is put. And then when the file gets reopened
> next time, the inode will reflect the new DAX state.
> 
> In summary, when the per-file DAX indication changes for an *opened*
> file, the state of the file won't be updated until this file is closed
> and reopened later.
> 
> Signed-off-by: Jeffle Xu 
> ---
>  fs/fuse/dax.c| 9 +
>  fs/fuse/fuse_i.h | 1 +
>  fs/fuse/inode.c  | 3 +++
>  3 files changed, 13 insertions(+)
> 
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index 30833f8d37dd..f7ede0be4e00 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -1364,6 +1364,15 @@ void fuse_dax_inode_init(struct inode *inode, unsigned 
> int flags)
>   inode->i_data.a_ops = _dax_file_aops;
>  }
>  
> +void fuse_dax_dontcache(struct inode *inode, bool newdax)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> +
> + if (fc->dax_mode == FUSE_DAX_INODE &&
> + fc->perfile_dax && (!!IS_DAX(inode) != newdax))
> + d_mark_dontcache(inode);
> +}
> +
>  bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int 
> map_alignment)
>  {
>   if (fc->dax && (map_alignment > FUSE_DAX_SHIFT)) {
> diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> index 7b7b4c208af2..56fe1c4d2136 100644
> --- a/fs/fuse/fuse_i.h
> +++ b/fs/fuse/fuse_i.h
> @@ -1260,6 +1260,7 @@ void fuse_dax_conn_free(struct fuse_conn *fc);
>  bool fuse_dax_inode_alloc(struct super_block *sb, struct fuse_inode *fi);
>  void fuse_dax_inode_init(struct inode *inode, unsigned int flags);
>  void fuse_dax_inode_cleanup(struct inode *inode);
> +void fuse_dax_dontcache(struct inode *inode, bool newdax);
>  bool fuse_dax_check_alignment(struct fuse_conn *fc, unsigned int 
> map_alignment);
>  void fuse_dax_cancel_work(struct fuse_conn *fc);
>  
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 8080f78befed..8c9774c6a210 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -269,6 +269,9 @@ void fuse_change_attributes(struct inode *inode, struct 
> fuse_attr *attr,
>   if (inval)
>   invalidate_inode_pages2(inode->i_mapping);
>   }
> +
> + if (IS_ENABLED(CONFIG_FUSE_DAX))
> + fuse_dax_dontcache(inode, attr->flags & FUSE_ATTR_DAX);
>  }
>  
>  static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
> -- 
> 2.27.0
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Virtio-fs] [PATCH v4 0/8] fuse,virtiofs: support per-file DAX

2021-08-17 Thread Dr. David Alan Gilbert
* Miklos Szeredi (mik...@szeredi.hu) wrote:
> On Tue, 17 Aug 2021 at 04:22, Jeffle Xu  wrote:
> >
> > This patchset adds support of per-file DAX for virtiofs, which is
> > inspired by Ira Weiny's work on ext4[1] and xfs[2].
> 
> Can you please explain the background of this change in detail?
> 
> Why would an admin want to enable DAX for a particular virtiofs file
> and not for others?

Where we're contending on virtiofs dax cache size it makes a lot of
sense; it's quite expensive for us to map something into the cache
(especially if we push something else out), so selectively DAXing files
that are expected to be hot could help reduce cache churn.

Dave

> Thanks,
> Miklos
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 01/19] tools/virtiofsd: add support for --socket-group

2020-10-07 Thread Dr. David Alan Gilbert
* Alex Bennée (alex.ben...@linaro.org) wrote:
> If you like running QEMU as a normal user (very common for TCG runs)
> but you have to run virtiofsd as a root user you run into connection
> problems. Adding support for an optional --socket-group allows the
> users to keep using the command line.
> 
> Signed-off-by: Alex Bennée 
> Reviewed-by: Stefan Hajnoczi 

Queued

> ---
> v1
>   - tweak documentation and commentary
> ---
>  docs/tools/virtiofsd.rst|  4 
>  tools/virtiofsd/fuse_i.h|  1 +
>  tools/virtiofsd/fuse_lowlevel.c |  6 ++
>  tools/virtiofsd/fuse_virtio.c   | 20 ++--
>  4 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> index e33c81ed41f1..085f9b12a6a3 100644
> --- a/docs/tools/virtiofsd.rst
> +++ b/docs/tools/virtiofsd.rst
> @@ -87,6 +87,10 @@ Options
>  
>Listen on vhost-user UNIX domain socket at PATH.
>  
> +.. option:: --socket-group=GROUP
> +
> +  Set the vhost-user UNIX domain socket gid to GROUP.
> +
>  .. option:: --fd=FDNUM
>  
>Accept connections from vhost-user UNIX domain socket file descriptor 
> FDNUM.
> diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
> index 1240828208ab..492e002181e2 100644
> --- a/tools/virtiofsd/fuse_i.h
> +++ b/tools/virtiofsd/fuse_i.h
> @@ -68,6 +68,7 @@ struct fuse_session {
>  size_t bufsize;
>  int error;
>  char *vu_socket_path;
> +char *vu_socket_group;
>  int   vu_listen_fd;
>  int   vu_socketfd;
>  struct fv_VuDev *virtio_dev;
> diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
> index 2dd36ec03b6e..4d1ba2925d1b 100644
> --- a/tools/virtiofsd/fuse_lowlevel.c
> +++ b/tools/virtiofsd/fuse_lowlevel.c
> @@ -2523,6 +2523,7 @@ static const struct fuse_opt fuse_ll_opts[] = {
>  LL_OPTION("--debug", debug, 1),
>  LL_OPTION("allow_root", deny_others, 1),
>  LL_OPTION("--socket-path=%s", vu_socket_path, 0),
> +LL_OPTION("--socket-group=%s", vu_socket_group, 0),
>  LL_OPTION("--fd=%d", vu_listen_fd, 0),
>  LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0),
>  FUSE_OPT_END
> @@ -2630,6 +2631,11 @@ struct fuse_session *fuse_session_new(struct fuse_args 
> *args,
>   "fuse: --socket-path and --fd cannot be given together\n");
>  goto out4;
>  }
> +if (se->vu_socket_group && !se->vu_socket_path) {
> +fuse_log(FUSE_LOG_ERR,
> + "fuse: --socket-group can only be used with 
> --socket-path\n");
> +goto out4;
> +}
>  
>  se->bufsize = FUSE_MAX_MAX_PAGES * getpagesize() + 
> FUSE_BUFFER_HEADER_SIZE;
>  
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 9e5537506c16..7942d3d11a87 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -31,6 +31,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  
>  #include "contrib/libvhost-user/libvhost-user.h"
> @@ -924,15 +926,29 @@ static int fv_create_listen_socket(struct fuse_session 
> *se)
>  
>  /*
>   * Unfortunately bind doesn't let you set the mask on the socket,
> - * so set umask to 077 and restore it later.
> + * so set umask appropriately and restore it later.
>   */
> -old_umask = umask(0077);
> +if (se->vu_socket_group) {
> +old_umask = umask(S_IROTH | S_IWOTH | S_IXOTH);
> +} else {
> +old_umask = umask(S_IRGRP | S_IWGRP | S_IXGRP | S_IROTH | S_IWOTH | 
> S_IXOTH);
> +}
>  if (bind(listen_sock, (struct sockaddr *), addr_len) == -1) {
>  fuse_log(FUSE_LOG_ERR, "vhost socket bind: %m\n");
>  close(listen_sock);
>  umask(old_umask);
>  return -1;
>  }
> +if (se->vu_socket_group) {
> +struct group *g = getgrnam(se->vu_socket_group);
> +if (g) {
> +if (!chown(se->vu_socket_path, -1, g->gr_gid)) {
> +fuse_log(FUSE_LOG_WARNING,
> + "vhost socket failed to set group to %s (%d)\n",
> + se->vu_socket_group, g->gr_gid);
> +}
> +}
> +}
>  umask(old_umask);
>  
>  if (listen(listen_sock, 1) == -1) {
> -- 
> 2.20.1
> 
-- 
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: virtq questions

2019-10-02 Thread Dr. David Alan Gilbert
* Miklos Szeredi (mik...@szeredi.hu) wrote:
> Looking at the ugly retry logic in virtiofs and have some questions.
> First one is, where do these features come from:
> 
> VIRTIO_F_RING_PACKED
> VIRTIO_RING_F_INDIRECT_DESC
> 
> I see that in virtiofs "packed" is off and "indirect" is on.  Is this
> guaranteed?

I think the xdindirect is coming from qemu's hw/virtio/virtio.h 
DEFINE_VIRTIO_COMMON_FEATURES
which has:
DEFINE_PROP_BIT64("indirect_desc", _state, _field,\
  VIRTIO_RING_F_INDIRECT_DESC, true), \

Dave

> Thanks,
> Miklos
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 0/4] virtio: Add definitions for shared memory regions

2019-07-25 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Add the virtio shared memory region definitions that
have just got merged into the spec.

Dr. David Alan Gilbert (4):
  virtio_pci: Define id field
  virtio_pci: Define virtio_pci_cap64
  virtio_pci: Defined shared memory capability
  virito_mmio: Define shared memory region registers

 include/uapi/linux/virtio_mmio.h | 11 +++
 include/uapi/linux/virtio_pci.h  | 11 ++-
 2 files changed, 21 insertions(+), 1 deletion(-)

-- 
2.21.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/4] virtio_pci: Define virtio_pci_cap64

2019-07-25 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Define 'virtio_pci_cap64' to allow capabilities to describe
memory regions larger than, or with an offset larger than 4GiB.

This will be used by the shared memory region capability.

Defined in virtio spec commit 8100dcfcd622 ("pci: Define virtio_pci_cap64")

Signed-off-by: Dr. David Alan Gilbert 
---
 include/uapi/linux/virtio_pci.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 9defe4a124c5..11e508719dfd 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -127,6 +127,12 @@ struct virtio_pci_cap {
__le32 length;  /* Length of the structure, in bytes. */
 };
 
+struct virtio_pci_cap64 {
+   struct virtio_pci_cap cap;
+   __le32 offset_hi;   /* Most sig 32 bits of offset */
+   __le32 length_hi;   /* Most sig 32 bits of length */
+};
+
 struct virtio_pci_notify_cap {
struct virtio_pci_cap cap;
__le32 notify_off_multiplier;   /* Multiplier for queue_notify_off. */
-- 
2.21.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 4/4] virito_mmio: Define shared memory region registers

2019-07-25 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Define an MMIO interface to discover and map shared
memory regions.

Defined in virtio spec commit 2dd2d468f69b ("shared memory: Define mmio 
registers")

Signed-off-by: Sebastien Boeuf 
Signed-off-by: Dr. David Alan Gilbert 
---
 include/uapi/linux/virtio_mmio.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/uapi/linux/virtio_mmio.h b/include/uapi/linux/virtio_mmio.h
index c4b09689ab64..0650f91bea6c 100644
--- a/include/uapi/linux/virtio_mmio.h
+++ b/include/uapi/linux/virtio_mmio.h
@@ -122,6 +122,17 @@
 #define VIRTIO_MMIO_QUEUE_USED_LOW 0x0a0
 #define VIRTIO_MMIO_QUEUE_USED_HIGH0x0a4
 
+/* Shared memory region id */
+#define VIRTIO_MMIO_SHM_SEL 0x0ac
+
+/* Shared memory region length, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_LEN_LOW 0x0b0
+#define VIRTIO_MMIO_SHM_LEN_HIGH0x0b4
+
+/* Shared memory region base address, 64 bits in two halves */
+#define VIRTIO_MMIO_SHM_BASE_LOW0x0b8
+#define VIRTIO_MMIO_SHM_BASE_HIGH   0x0bc
+
 /* Configuration atomicity value */
 #define VIRTIO_MMIO_CONFIG_GENERATION  0x0fc
 
-- 
2.21.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 3/4] virtio_pci: Defined shared memory capability

2019-07-25 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

Define the PCI capability used for enumerating shared memory regions.
Each capability contains a 'struct virtio_pci_cap64'
Multiple capabilities of this type may exist for a device and may
be distinguished using the 'cap.id' field.

Defined in virtio spec commit 855ad7af2bd64 ("shared memory: Define PCI 
capability")

Signed-off-by: Dr. David Alan Gilbert 
---
 include/uapi/linux/virtio_pci.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 11e508719dfd..401a5d6851f1 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -113,6 +113,8 @@
 #define VIRTIO_PCI_CAP_DEVICE_CFG  4
 /* PCI configuration access */
 #define VIRTIO_PCI_CAP_PCI_CFG 5
+/* Additional shared memory capability */
+#define VIRTIO_PCI_CAP_SHARED_MEMORY_CFG 8
 
 /* This is the PCI capability header: */
 struct virtio_pci_cap {
-- 
2.21.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/4] virtio_pci: Define id field

2019-07-25 Thread Dr. David Alan Gilbert (git)
From: "Dr. David Alan Gilbert" 

The new virtio-shared memory region system allows multiple regions
to be defined for a device; each of these is the same type of
capability.
To allow multiple capabilities of the same type, create
an 'id' field to differentiate them.

Defined in virtio spec commit 39dfc8afc0b93 ("pci: Define id field")

Signed-off-by: Dr. David Alan Gilbert 
---
 include/uapi/linux/virtio_pci.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
index 90007a1abcab..9defe4a124c5 100644
--- a/include/uapi/linux/virtio_pci.h
+++ b/include/uapi/linux/virtio_pci.h
@@ -121,7 +121,8 @@ struct virtio_pci_cap {
__u8 cap_len;   /* Generic PCI field: capability length */
__u8 cfg_type;  /* Identifies the structure. */
__u8 bar;   /* Where to find it. */
-   __u8 padding[3];/* Pad to full dword. */
+   __u8 id;/* Multiple capabilities of the same type */
+   __u8 padding[2];/* Pad to full dword. */
__le32 offset;  /* Offset within bar. */
__le32 length;  /* Length of the structure, in bytes. */
 };
-- 
2.21.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-09-07 Thread Dr. David Alan Gilbert
* Wei Wang (wei.w.w...@intel.com) wrote:
> On 07/23/2018 10:36 PM, Dr. David Alan Gilbert wrote:
> > * Michael S. Tsirkin (m...@redhat.com) wrote:
> > > On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:
> > > > This patch series is separated from the previous "Virtio-balloon
> > > > Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,
> > > > implemented by this series enables the virtio-balloon driver to report
> > > > hints of guest free pages to the host. It can be used to accelerate live
> > > > migration of VMs. Here is an introduction of this usage:
> > > > 
> > > > Live migration needs to transfer the VM's memory from the source machine
> > > > to the destination round by round. For the 1st round, all the VM's 
> > > > memory
> > > > is transferred. From the 2nd round, only the pieces of memory that were
> > > > written by the guest (after the 1st round) are transferred. One method
> > > > that is popularly used by the hypervisor to track which part of memory 
> > > > is
> > > > written is to write-protect all the guest memory.
> > > > 
> > > > This feature enables the optimization by skipping the transfer of guest
> > > > free pages during VM live migration. It is not concerned that the memory
> > > > pages are used after they are given to the hypervisor as a hint of the
> > > > free pages, because they will be tracked by the hypervisor and 
> > > > transferred
> > > > in the subsequent round if they are used and written.
> > > > 
> > > > * Tests
> > > > - Test Environment
> > > >  Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > > >  Guest: 8G RAM, 4 vCPU
> > > >  Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 
> > > > second
> > > > 
> > > > - Test Results
> > > >  - Idle Guest Live Migration Time (results are averaged over 10 
> > > > runs):
> > > >  - Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
> > > > (setting page poisoning zero and enabling ksm don't affect the
> > > >   comparison result)
> > > >  - Guest with Linux Compilation Workload (make bzImage -j4):
> > > >  - Live Migration Time (average)
> > > >Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% 
> > > > reduction
> > > >  - Linux Compilation Time
> > > >Optimization v.s. Legacy = 5min4s v.s. 5min12s
> > > >--> no obvious difference
> > > I'd like to see dgilbert's take on whether this kind of gain
> > > justifies adding a PV interfaces, and what kind of guest workload
> > > is appropriate.
> > > 
> > > Cc'd.
> > Well, 44% is great ... although the measurement is a bit weird.
> > 
> > a) A 2 second downtime is very large; 300-500ms is more normal
> > b) I'm not sure what the 'average' is  - is that just between a bunch of
> > repeated migrations?
> > c) What load was running in the guest during the live migration?
> > 
> > An interesting measurement to add would be to do the same test but
> > with a VM with a lot more RAM but the same load;  you'd hope the gain
> > would be even better.
> > It would be interesting, especially because the users who are interested
> > are people creating VMs allocated with lots of extra memory (for the
> > worst case) but most of the time migrating when it's fairly idle.
> > 
> > Dave
> > 
> 
> Hi Dave,
> 
> The results of the added experiments have been shown in the v37 cover
> letter.
> Could you have a look at https://lkml.org/lkml/2018/8/27/29 . Thanks.

OK, that's much better.
The ~50% reducton with a 8G VM and a real workload is great,
and it does what you expect when you put a lot more RAM in and see the
84% reduction on a guest with 128G RAM - 54s vs ~9s is a big win!

(The migrate_set_speed is a bit high, since that's in bytes/s - but it's
not important).

That looks good,

Thanks!

Dave

> Best,
> Wei
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v36 0/5] Virtio-balloon: support free page reporting

2018-07-23 Thread Dr. David Alan Gilbert
* Michael S. Tsirkin (m...@redhat.com) wrote:
> On Fri, Jul 20, 2018 at 04:33:00PM +0800, Wei Wang wrote:
> > This patch series is separated from the previous "Virtio-balloon
> > Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,  
> > implemented by this series enables the virtio-balloon driver to report
> > hints of guest free pages to the host. It can be used to accelerate live
> > migration of VMs. Here is an introduction of this usage:
> > 
> > Live migration needs to transfer the VM's memory from the source machine
> > to the destination round by round. For the 1st round, all the VM's memory
> > is transferred. From the 2nd round, only the pieces of memory that were
> > written by the guest (after the 1st round) are transferred. One method
> > that is popularly used by the hypervisor to track which part of memory is
> > written is to write-protect all the guest memory.
> > 
> > This feature enables the optimization by skipping the transfer of guest
> > free pages during VM live migration. It is not concerned that the memory
> > pages are used after they are given to the hypervisor as a hint of the
> > free pages, because they will be tracked by the hypervisor and transferred
> > in the subsequent round if they are used and written.
> > 
> > * Tests
> > - Test Environment
> > Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> > Guest: 8G RAM, 4 vCPU
> > Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second
> > 
> > - Test Results
> > - Idle Guest Live Migration Time (results are averaged over 10 runs):
> > - Optimization v.s. Legacy = 409ms vs 1757ms --> ~77% reduction
> > (setting page poisoning zero and enabling ksm don't affect the
> >  comparison result)
> > - Guest with Linux Compilation Workload (make bzImage -j4):
> > - Live Migration Time (average)
> >   Optimization v.s. Legacy = 1407ms v.s. 2528ms --> ~44% reduction
> > - Linux Compilation Time
> >   Optimization v.s. Legacy = 5min4s v.s. 5min12s
> >   --> no obvious difference
> 
> I'd like to see dgilbert's take on whether this kind of gain
> justifies adding a PV interfaces, and what kind of guest workload
> is appropriate.
> 
> Cc'd.

Well, 44% is great ... although the measurement is a bit weird.

a) A 2 second downtime is very large; 300-500ms is more normal
b) I'm not sure what the 'average' is  - is that just between a bunch of
repeated migrations?
c) What load was running in the guest during the live migration?

An interesting measurement to add would be to do the same test but
with a VM with a lot more RAM but the same load;  you'd hope the gain
would be even better.
It would be interesting, especially because the users who are interested
are people creating VMs allocated with lots of extra memory (for the
worst case) but most of the time migrating when it's fairly idle.

Dave

> 
> 
> > ChangeLog:
> > v35->v36:
> > - remove the mm patch, as Linus has a suggestion to get free page
> >   addresses via allocation, instead of reading from the free page
> >   list.
> > - virtio-balloon:
> > - replace oom notifier with shrinker;
> > - the guest to host communication interface remains the same as
> >   v32.
> > - allocate free page blocks and send to host one by one, and free
> >   them after sending all the pages.
> > 
> > For ChangeLogs from v22 to v35, please reference
> > https://lwn.net/Articles/759413/
> > 
> > For ChangeLogs before v21, please reference
> > https://lwn.net/Articles/743660/
> > 
> > Wei Wang (5):
> >   virtio-balloon: remove BUG() in init_vqs
> >   virtio_balloon: replace oom notifier with shrinker
> >   virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT
> >   mm/page_poison: expose page_poisoning_enabled to kernel modules
> >   virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON
> > 
> >  drivers/virtio/virtio_balloon.c | 456 
> > ++--
> >  include/uapi/linux/virtio_balloon.h |   7 +
> >  mm/page_poison.c|   6 +
> >  3 files changed, 394 insertions(+), 75 deletions(-)
> > 
> > -- 
> > 2.7.4
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 kernel 3/5] virtio-balloon: speed up inflate/deflate process

2017-01-20 Thread Dr. David Alan Gilbert
* Liang Li (liang.z...@intel.com) wrote:



> +static void free_extended_page_bitmap(struct virtio_balloon *vb)
> +{
> + int i, bmap_count = vb->nr_page_bmap;
> +
> + for (i = 1; i < bmap_count; i++) {
> + kfree(vb->page_bitmap[i]);
> + vb->page_bitmap[i] = NULL;
> + vb->nr_page_bmap--;
> + }
> +}
> +
> +static void kfree_page_bitmap(struct virtio_balloon *vb)
> +{
> + int i;
> +
> + for (i = 0; i < vb->nr_page_bmap; i++)
> + kfree(vb->page_bitmap[i]);
> +}

It might be worth commenting that pair of functions to make it clear
why they are so different; I guess the kfree_page_bitmap
is used just before you free the structure above it so you
don't need to keep the count/pointers updated?

Dave
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-15 Thread Dr. David Alan Gilbert
* Li, Liang Z (liang.z...@intel.com) wrote:
> > On Mon, Mar 14, 2016 at 05:03:34PM +0000, Dr. David Alan Gilbert wrote:
> > > * Li, Liang Z (liang.z...@intel.com) wrote:
> > > > >
> > > > > Hi,
> > > > >   I'm just catching back up on this thread; so without reference
> > > > > to any particular previous mail in the thread.
> > > > >
> > > > >   1) How many of the free pages do we tell the host about?
> > > > >  Your main change is telling the host about all the
> > > > >  free pages.
> > > >
> > > > Yes, all the guest's free pages.
> > > >
> > > > >  If we tell the host about all the free pages, then we might
> > > > >  end up needing to allocate more pages and update the host
> > > > >  with pages we now want to use; that would have to wait for the
> > > > >  host to acknowledge that use of these pages, since if we don't
> > > > >  wait for it then it might have skipped migrating a page we
> > > > >  just started using (I don't understand how your series solves 
> > > > > that).
> > > > >  So the guest probably needs to keep some free pages - how many?
> > > >
> > > > Actually, there is no need to care about whether the free pages will be
> > used by the host.
> > > > We only care about some of the free pages we get reused by the guest,
> > right?
> > > >
> > > > The dirty page logging can be used to solve this, starting the dirty
> > > > page logging before getting the free pages informant from guest.
> > > > Even some of the free pages are modified by the guest during the
> > > > process of getting the free pages information, these modified pages will
> > be traced by the dirty page logging mechanism. So in the following
> > migration_bitmap_sync() function.
> > > > The pages in the free pages bitmap, but latter was modified, will be
> > > > reset to dirty. We won't omit any dirtied pages.
> > > >
> > > > So, guest doesn't need to keep any free pages.
> > >
> > > OK, yes, that works; so we do:
> > >   * enable dirty logging
> > >   * ask guest for free pages
> > >   * initialise the migration bitmap as everything-free
> > >   * then later we do the normal sync-dirty bitmap stuff and it all just 
> > > works.
> > >
> > > That's nice and simple.
> > 
> > This works once, sure. But there's an issue is that you have to defer 
> > migration
> > until you get the free page list, and this only works once. So you end up 
> > with
> > heuristics about how long to wait.
> > 
> > Instead I propose:
> > 
> > - mark all pages dirty as we do now.
> > 
> > - at start of migration, start tracking dirty
> >   pages in kvm, and tell guest to start tracking free pages
> > 
> > we can now introduce any kind of delay, for example wait for ack from guest,
> > or do whatever else, or even just start migrating pages
> > 
> > - repeatedly:
> > - get list of free pages from guest
> > - clear them in migration bitmap
> > - get dirty list from kvm
> > 
> > - at end of migration, stop tracking writes in kvm,
> >   and tell guest to stop tracking free pages
> 
> I had thought of filtering out the free pages in each migration bitmap 
> synchronization. 
> The advantage is we can skip process as many free pages as possible. Not just 
> once.
> The disadvantage is that we should change the current memory management code 
> to track the free pages,
> instead of traversing the free page list to construct the free pages bitmap, 
> to reduce the overhead to get the free pages bitmap.
> I am not sure the if the Kernel people would like it.
> 
> If keeping the traversing mechanism, because of the overhead, maybe it's not 
> worth to filter out the free pages repeatedly.

Well, Michael's idea of not waiting for the dirty
bitmap to be filled does make that idea of constnatly
using the free-bitmap better.

In that case, is it easier if something (guest/host?)
allocates some memory in the guests physical RAM space
and just points the host to it, rather than having an 
explicit 'send'.

Dave

> Liang
> 
> 
> 
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-14 Thread Dr. David Alan Gilbert
ap to spot zero pages that would help find other zero'd
> > pages, but perhaps ballooned is enough?
> > 
> Could you describe your ideal with more details?

At the moment the migration code spends a fair amount of time checking if a page
is zero; I was thinking perhaps the qemu could just open /proc/self/pagemap
and check if the page was mapped; that would seem cheap if we're checking big
ranges; and that would find all the balloon pages.

> >   5) Second-migrate
> > Given a VM where you've done all those tricks on, what happens when
> > you migrate it a second time?   I guess you're aiming for the guest
> > to update it's bitmap;  HPe's solution is to migrate it's balloon
> > bitmap along with the migration data.
> 
> Nothing is special in the second migration, QEMU will request the guest for 
> free pages
> Information, and the guest will traverse it's current free page list to 
> construct a
> new free page bitmap and send it to QEMU. Just like in the first migration.

Right.

Dave

> Liang
> > 
> > Dave
> > 
> > --
> > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-10 Thread Dr. David Alan Gilbert
Hi,
  I'm just catching back up on this thread; so without reference to any
particular previous mail in the thread.

  1) How many of the free pages do we tell the host about?
 Your main change is telling the host about all the
 free pages.
 If we tell the host about all the free pages, then we might
 end up needing to allocate more pages and update the host
 with pages we now want to use; that would have to wait for the
 host to acknowledge that use of these pages, since if we don't
 wait for it then it might have skipped migrating a page we
 just started using (I don't understand how your series solves that).
 So the guest probably needs to keep some free pages - how many?

  2) Clearing out caches
 Does it make sense to clean caches?  They're apparently useful data
 so if we clean them it's likely to slow the guest down; I guess
 they're also likely to be fairly static data - so at least fairly
 easy to migrate.
 The answer here partially depends on what you want from your migration;
 if you're after the fastest possible migration time it might make
 sense to clean the caches and avoid migrating them; but that might
 be at the cost of more disruption to the guest - there's a trade off
 somewhere and it's not clear to me how you set that depending on your
 guest/network/reqirements.

  3) Why is ballooning slow?
 You've got a figure of 5s to balloon on an 8GB VM - but an 
 8GB VM isn't huge; so I worry about how long it would take
 on a big VM.   We need to understand why it's slow 
   * is it due to the guest shuffling pages around? 
   * is it due to the virtio-balloon protocol sending one page
 at a time?
 + Do balloon pages normally clump in physical memory
- i.e. would a 'large balloon' message help
- or do we need a bitmap because it tends not to clump?

   * is it due to the madvise on the host?
 If we were using the normal balloon messages, then we
 could, during migration, just route those to the migration
 code rather than bothering with the madvise.
 If they're clumping together we could just turn that into
 one big madvise; if they're not then would we benefit from
 a call that lets us madvise lots of areas?

  4) Speeding up the migration of those free pages
You're using the bitmap to avoid migrating those free pages; HPe's
patchset is reconstructing a bitmap from the balloon data;  OK, so
this all makes sense to avoid migrating them - I'd also been thinking
of using pagemap to spot zero pages that would help find other zero'd
pages, but perhaps ballooned is enough?

  5) Second-migrate
Given a VM where you've done all those tricks on, what happens when
you migrate it a second time?   I guess you're aiming for the guest
to update it's bitmap;  HPe's solution is to migrate it's balloon
bitmap along with the migration data.
 
Dave

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert
* Paolo Bonzini (pbonz...@redhat.com) wrote:
> 
> 
> On 04/03/2016 15:26, Li, Liang Z wrote:
> >> > 
> >> > The memory usage will keep increasing due to ever growing caches, etc, so
> >> > you'll be left with very little free memory fairly soon.
> >> > 
> > I don't think so.
> > 
> 
> Roman is right.  For example, here I am looking at a 64 GB (physical)
> machine which was booted about 30 minutes ago, and which is running
> disk-heavy workloads (installing VMs).
> 
> Since I have started writing this email (2 minutes?), the amount of free
> memory has already gone down from 37 GB to 33 GB.  I expect that by the
> time I have finished running the workload, in two hours, it will not
> have any free memory.

But what about a VM sitting idle, or that just has more RAM assigned to it
than is currently using.
 I've got a host here that's been up for 46 days and has been doing some
heavy VM debugging a few days ago, but today:

# free -m
  totalusedfree  shared  buff/cache   available
Mem:  965361146   44834 184   50555   94735

I very rarely use all it's RAM, so it's got a big chunk of free RAM, and yes
it's got a big chunk of cache as well.

Dave

> 
> Paolo
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Qemu-devel] [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-04 Thread Dr. David Alan Gilbert
* Roman Kagan (rka...@virtuozzo.com) wrote:
> On Fri, Mar 04, 2016 at 08:23:09AM +, Li, Liang Z wrote:
> > > On Thu, Mar 03, 2016 at 05:46:15PM +0000, Dr. David Alan Gilbert wrote:
> > > > * Liang Li (liang.z...@intel.com) wrote:
> > > > > The current QEMU live migration implementation mark the all the
> > > > > guest's RAM pages as dirtied in the ram bulk stage, all these pages
> > > > > will be processed and that takes quit a lot of CPU cycles.
> > > > >
> > > > > From guest's point of view, it doesn't care about the content in
> > > > > free pages. We can make use of this fact and skip processing the
> > > > > free pages in the ram bulk stage, it can save a lot CPU cycles and
> > > > > reduce the network traffic significantly while speed up the live
> > > > > migration process obviously.
> > > > >
> > > > > This patch set is the QEMU side implementation.
> > > > >
> > > > > The virtio-balloon is extended so that QEMU can get the free pages
> > > > > information from the guest through virtio.
> > > > >
> > > > > After getting the free pages information (a bitmap), QEMU can use it
> > > > > to filter out the guest's free pages in the ram bulk stage. This
> > > > > make the live migration process much more efficient.
> > > >
> > > > Hi,
> > > >   An interesting solution; I know a few different people have been
> > > > looking at how to speed up ballooned VM migration.
> > > >
> > > >   I wonder if it would be possible to avoid the kernel changes by
> > > > parsing /proc/self/pagemap - if that can be used to detect
> > > > unmapped/zero mapped pages in the guest ram, would it achieve the
> > > same result?
> > > 
> > > Yes I was about to suggest the same thing: it's simple and makes use of 
> > > the
> > > existing infrastructure.  And you wouldn't need to care if the pages were
> > > unmapped by ballooning or anything else (alternative balloon
> > > implementations, not yet touched by the guest, etc.).  Besides, you 
> > > wouldn't
> > > need to synchronize with the guest.
> > > 
> > > Roman.
> > 
> > The unmapped/zero mapped pages can be detected by parsing 
> > /proc/self/pagemap,
> > but the free pages can't be detected by this. Imaging an application 
> > allocates a large amount
> > of memory , after using, it frees the memory, then live migration happens. 
> > All these free pages
> > will be process and sent to the destination, it's not optimal.
> 
> First, the likelihood of such a situation is marginal, there's no point
> optimizing for it specifically.
> 
> And second, even if that happens, you inflate the balloon right before
> the migration and the free memory will get umapped very quickly, so this
> case is covered nicely by the same technique that works for more
> realistic cases, too.

Although I wonder which is cheaper; that would be fairly expensive for
the guest wouldn't it? And you'd somehow have to kick the guest
before migration to do the ballooning - and how long would you wait
for it to finish?

Dave

> 
> Roman.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC qemu 0/4] A PV solution for live migration optimization

2016-03-03 Thread Dr. David Alan Gilbert
* Liang Li (liang.z...@intel.com) wrote:
> The current QEMU live migration implementation mark the all the
> guest's RAM pages as dirtied in the ram bulk stage, all these pages
> will be processed and that takes quit a lot of CPU cycles.
> 
> From guest's point of view, it doesn't care about the content in free
> pages. We can make use of this fact and skip processing the free
> pages in the ram bulk stage, it can save a lot CPU cycles and reduce
> the network traffic significantly while speed up the live migration
> process obviously.
> 
> This patch set is the QEMU side implementation.
> 
> The virtio-balloon is extended so that QEMU can get the free pages
> information from the guest through virtio.
> 
> After getting the free pages information (a bitmap), QEMU can use it
> to filter out the guest's free pages in the ram bulk stage. This make
> the live migration process much more efficient.

Hi,
  An interesting solution; I know a few different people have been looking
at how to speed up ballooned VM migration.

  I wonder if it would be possible to avoid the kernel changes by
parsing /proc/self/pagemap - if that can be used to detect unmapped/zero
mapped pages in the guest ram, would it achieve the same result?

> This RFC version doesn't take the post-copy and RDMA into
> consideration, maybe both of them can benefit from this PV solution
> by with some extra modifications.

For postcopy to be safe, you would still need to send a message to the
destination telling it that there were zero pages, otherwise the destination
can't tell if it's supposed to request the page from the source or
treat the page as zero.

Dave

> 
> Performance data
> 
> 
> Test environment:
> 
> CPU: Intel (R) Xeon(R) CPU ES-2699 v3 @ 2.30GHz
> Host RAM: 64GB
> Host Linux Kernel:  4.2.0   Host OS: CentOS 7.1
> Guest Linux Kernel:  4.5.rc6Guest OS: CentOS 6.6
> Network:  X540-AT2 with 10 Gigabit connection
> Guest RAM: 8GB
> 
> Case 1: Idle guest just boots:
> 
> | original  |pv
> ---
> total time(ms)  |1894   |   421
> 
> transferred ram(KB) |   398017  |  353242
> 
> 
> 
> Case 2: The guest has ever run some memory consuming workload, the
> workload is terminated just before live migration.
> 
> | original  |pv
> ---
> total time(ms)  |   7436|   552
> 
> transferred ram(KB) |  8146291  |  361375
> 
> 
> Liang Li (4):
>   pc: Add code to get the lowmem form PCMachineState
>   virtio-balloon: Add a new feature to balloon device
>   migration: not set migration bitmap in setup stage
>   migration: filter out guest's free pages in ram bulk stage
> 
>  balloon.c   | 30 -
>  hw/i386/pc.c|  5 ++
>  hw/i386/pc_piix.c   |  1 +
>  hw/i386/pc_q35.c|  1 +
>  hw/virtio/virtio-balloon.c  | 81 
> -
>  include/hw/i386/pc.h|  3 +-
>  include/hw/virtio/virtio-balloon.h  | 17 +-
>  include/standard-headers/linux/virtio_balloon.h |  1 +
>  include/sysemu/balloon.h    | 10 ++-
>  migration/ram.c | 64 +++
>  10 files changed, 195 insertions(+), 18 deletions(-)
> 
> -- 
> 1.8.3.1
> 
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Signed bit field; int have_hotplug_status_watch:1

2011-04-03 Thread Dr. David Alan Gilbert
Hi Ian,
   I've been going through some sparse scans of the kernel and
it threw up:

  CHECK   drivers/net/xen-netback/xenbus.c
drivers/net/xen-netback/xenbus.c:29:40: error: dubious one-bit signed bitfield

int have_hotplug_status_watch:1;

from your patch f942dc2552b8bfdee607be867b12a8971bb9cd85 

It does look like that should be an unsigned (given it's assigned
0 and 1)

Dave

-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert|   Running GNU/Linux   | Happy  \ 
\ gro.gilbert @ treblig.org |   | In Hex /
 \ _|_ http://www.treblig.org   |___/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [Xen-users] Re: [Xen-devel] State of Xen in upstream Linux

2008-07-31 Thread Dr. David Alan Gilbert
* Jeremy Fitzhardinge ([EMAIL PROTECTED]) wrote:
 Grant McWilliams wrote:

snip

  Does this mean in the future all Fedora kernels will be Xen kernels?
  Is this wise? If I try to run VirtualBox on a Xen kernel the machine
  will reboot. If the Vbox module is loaded at runtime it will reboot
  forever. Yes, I know it's a Vbox issue but what about KVM. Can we run
  KVM on a Xen kernel?
 
  Or am I reading this completely wrong?
 
 If you boot the kernel on bare hardware, the Xen parts of the kernel
 will basically be switched off, and play no part in runtime.  All
 hardware features and device drivers should be available as normal.  I
 run kvm on pvops/xen kernels all the time.

Does that include things like /proc/cpuinfo?  Current Dom0's seem
to lose some of the physical IDs information from there.

Dave
-- 
 -Open up your eyes, open up your mind, open up your code ---   
/ Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy  \ 
\ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC  HPPA | In Hex /
 \ _|_ http://www.treblig.org   |___/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization