from:"Miklos Szeredi"

kvm crash with virtiofs

2024-06-06 Thread Miklos Szeredi

Hi,

I get the below crash when running virtio-fs on fedora 39.

Note: weirdly this makes chrome running on the host also crash.

Eric Sandeen also reported some bad behavior of virtio-fs on fc39,
which might be related.

Versions:
kernel-6.8.4-200.fc39.x86_64
qemu-kvm-8.1.3-5.fc39.x86_64
virtiofsd-1.10.1-1.fc39.x86_64

Thanks,
Miklos

/usr/libexec/virtiofsd --socket-path=/tmp/vhostqemu --shared-dir /home &

qemu-system-x86_64 -enable-kvm -s -serial none -parallel none -kernel
/home/mszeredi/git/linux/arch/x86_64/boot/bzImage -drive
format=raw,file=/home/mszeredi/root_fs,index=0,if=virtio -drive
format=raw,file=/home/mszeredi/images/ubd1,index=1,if=virtio -chardev
stdio,id=virtiocon0,signal=off -device virtio-serial -device
virtconsole,chardev=virtiocon0 -cpu host -m 16G -smp 8 -object
memory-backend-file,id=mem,size=16G,mem-path=/dev/shm,share=on -numa
node,memdev=mem -net user -net nic,model=virtio-net-pci -fsdev
local,security_model=none,id=fsdev0,path=/home -device virtio-rng-pci
-chardev socket,id=char0,path=/tmp/vhostqemu -device
vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs -device
virtio-9p-pci,fsdev=fsdev0,mount_tag=hostshare -append "root=/dev/vda
console=hvc0 "
[...]
root@kvm:~# time md5sum /host/mszeredi/images/ubd1
error: kvm run failed Bad address
RAX= RBX=888100044240 RCX=
RDX=888420c59ff0
RSI=0020 RDI=888420c59ff8 RBP=
RSP=c900016d3898
R8 =888420c59da8 R9 =0040 R10=00036140
R11=0005
R12=888420c59ff0 R13=000d R14=ea0010831600
R15=888420c59da8
RIP=82168d80 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   00c0
CS =0010   00a09b00 DPL=0 CS64 [-RA]
SS =0018   00c09300 DPL=0 DS   [-WA]
DS =   00c0
FS = 7fb83cea8740  00c0
GS = 88842fd4  00c0
LDT=   00c0
TR =0040 fe12a000 4087 8b00 DPL=0 TSS64-busy
GDT= fe128000 007f
IDT= fe00 0fff
CR0=80050033 CR2=7f2d3bd9b0f0 CR3=0001036ee005 CR4=00770ef0
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=0d01
Code=90 90 90 90 48 c7 07 00 00 00 00 48 89 fa 48 8d 7f 08 31 c0 <48>
c7 87 30 02 00 00 00 00 00 00 48 89 d1 48 83 e7 f8 48 29 f9 81 c1 40
02 00 00 c1 e9 03

Re: [PATCH] virtiofsd: Don't allow file creation with FUSE_OPEN

2021-06-18 Thread Miklos Szeredi

On Fri, 18 Jun 2021 at 11:21, Greg Kurz  wrote:
>
> On Fri, 18 Jun 2021 10:58:33 +0200
> Miklos Szeredi  wrote:
>
> > On Thu, 17 Jun 2021 at 16:15, Greg Kurz  wrote:
> > >
> > > A well behaved FUSE client uses FUSE_CREATE to create files. It isn't
> > > supposed to pass O_CREAT along a FUSE_OPEN request, as documented in
> > > the "fuse_lowlevel.h" header :
> > >
> > > /**
> > >  * Open a file
> > >  *
> > >  * Open flags are available in fi->flags. The following rules
> > >  * apply.
> > >  *
> > >  *  - Creation (O_CREAT, O_EXCL, O_NOCTTY) flags will be
> > >  *filtered out / handled by the kernel.
> > >
> > > But if it does anyway, virtiofsd crashes with:
> > >
> > > *** invalid openat64 call: O_CREAT or O_TMPFILE without mode ***: 
> > > terminated
> > >
> > > This is because virtiofsd ends up passing this flag to openat() without
> > > passing a mode_t 4th argument which is mandatory with O_CREAT, and glibc
> > > aborts.
> > >
> > > The offending path is:
> > >
> > > lo_open()
> > > lo_do_open()
> > > lo_inode_open()
> > >
> > > Other callers of lo_inode_open() only pass O_RDWR and lo_create()
> > > passes a valid fd to lo_do_open() which thus doesn't even call
> > > lo_inode_open() in this case.
> > >
> > > Specifying O_CREAT with FUSE_OPEN is a protocol violation. Check this
> > > in lo_open() and return an error to the client : EINVAL since this is
> > > already what glibc returns with other illegal flag combinations.
> > >
> > > The FUSE filesystem doesn't currently support O_TMPFILE, but the very
> > > same would happen if O_TMPFILE was passed in a FUSE_OPEN request. Check
> > > that as well.
> > >
> > > Signed-off-by: Greg Kurz 
> > > ---
> > >  tools/virtiofsd/passthrough_ll.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > b/tools/virtiofsd/passthrough_ll.c
> > > index 49c21fd85570..14f62133131c 100644
> > > --- a/tools/virtiofsd/passthrough_ll.c
> > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > @@ -2145,6 +2145,12 @@ static void lo_open(fuse_req_t req, fuse_ino_t 
> > > ino, struct fuse_file_info *fi)
> > >  return;
> > >  }
> > >
> > > +/* File creation is handled by lo_create() */
> > > +if (fi->flags & (O_CREAT | O_TMPFILE)) {
> > > +fuse_reply_err(req, EINVAL);
> > > +return;
> > > +}
> > > +
> >
> > Okay.  Question comes to mind whether the check should be even more
> > strict, possibly allowing just a specific set of flags, and erroring
> > out on everything else?
> >
>
> I've focused on O_CREAT and O_TMPFILE because they cause an explicit abort()
> in glibc when the code is compiled with -D_FORTIFY_SOURCE=2, but yes,
> maybe it could make sense to check more of them.
>
> > AFAICS linux kernel should never pass anything to FUSE_OPEN outside of this 
> > set:
> >
> > O_RDONLY
> > O_WRONLY
> > O_RDWR
> > O_APPEND
> > O_NDELAY
> > O_NONBLOCK
> > __O_SYNC
> > O_DSYNC
> > FASYNC
> > O_DIRECT
> > O_LARGEFILE
> > O_NOFOLLOW
> > O_NOATIME
> >
> > A separate question is whether virtiofsd should also be silently
> > ignoring some of the above flags.
> >
>
> Dunno on the top of my head...

Let's discuss this separately as this is mostly unrelated.  Added an
item to the virtiofs-todo etherpad.

>
> BTW, as suggested by Dave, I've submitted a similar patch to upstream
> libfuse:
>
> https://github.com/libfuse/libfuse/pull/615
>
> And I got interesting suggestions:
> 1) do it in core FUSE, i.e. fuse_lowlevel.c, since this isn't specific to
>passthrough_ll AFAICT
> 2) print out an error
> 3) exit
>
> 1 makes a lot of sense. I guess 2 is fine this cannot be used by a
> buggy guest to flood some log file on the host. 3 doesn't seems
> to be an acceptable solution, and it wouldn't change much the
> outcome compared to what we have now.
>
> So I will go for 1 and 2.

Okay, good.

Thanks,
Miklos

Re: [PATCH] virtiofsd: Don't allow file creation with FUSE_OPEN

2021-06-18 Thread Miklos Szeredi

On Thu, 17 Jun 2021 at 16:15, Greg Kurz  wrote:
>
> A well behaved FUSE client uses FUSE_CREATE to create files. It isn't
> supposed to pass O_CREAT along a FUSE_OPEN request, as documented in
> the "fuse_lowlevel.h" header :
>
> /**
>  * Open a file
>  *
>  * Open flags are available in fi->flags. The following rules
>  * apply.
>  *
>  *  - Creation (O_CREAT, O_EXCL, O_NOCTTY) flags will be
>  *filtered out / handled by the kernel.
>
> But if it does anyway, virtiofsd crashes with:
>
> *** invalid openat64 call: O_CREAT or O_TMPFILE without mode ***: terminated
>
> This is because virtiofsd ends up passing this flag to openat() without
> passing a mode_t 4th argument which is mandatory with O_CREAT, and glibc
> aborts.
>
> The offending path is:
>
> lo_open()
> lo_do_open()
> lo_inode_open()
>
> Other callers of lo_inode_open() only pass O_RDWR and lo_create()
> passes a valid fd to lo_do_open() which thus doesn't even call
> lo_inode_open() in this case.
>
> Specifying O_CREAT with FUSE_OPEN is a protocol violation. Check this
> in lo_open() and return an error to the client : EINVAL since this is
> already what glibc returns with other illegal flag combinations.
>
> The FUSE filesystem doesn't currently support O_TMPFILE, but the very
> same would happen if O_TMPFILE was passed in a FUSE_OPEN request. Check
> that as well.
>
> Signed-off-by: Greg Kurz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 49c21fd85570..14f62133131c 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2145,6 +2145,12 @@ static void lo_open(fuse_req_t req, fuse_ino_t ino, 
> struct fuse_file_info *fi)
>  return;
>  }
>
> +/* File creation is handled by lo_create() */
> +if (fi->flags & (O_CREAT | O_TMPFILE)) {
> +fuse_reply_err(req, EINVAL);
> +return;
> +}
> +

Okay.  Question comes to mind whether the check should be even more
strict, possibly allowing just a specific set of flags, and erroring
out on everything else?

AFAICS linux kernel should never pass anything to FUSE_OPEN outside of this set:

O_RDONLY
O_WRONLY
O_RDWR
O_APPEND
O_NDELAY
O_NONBLOCK
__O_SYNC
O_DSYNC
FASYNC
O_DIRECT
O_LARGEFILE
O_NOFOLLOW
O_NOATIME

A separate question is whether virtiofsd should also be silently
ignoring some of the above flags.

Thanks,
Miklos



>  err = lo_do_open(lo, inode, -1, fi);
>  lo_inode_put(lo, );
>  if (err) {
> --
> 2.31.1
>

Re: [PATCH 1/3] virtiofsd: Find original inode ID of mount points

2021-06-02 Thread Miklos Szeredi

On Wed, 2 Jun 2021 at 20:20, Vivek Goyal  wrote:
>
> On Wed, May 12, 2021 at 02:55:42PM +0200, Max Reitz wrote:
> > Mount point directories represent two inodes: On one hand, they are a
> > normal directory on their parent filesystem.  On the other, they are the
> > root node of the filesystem mounted there.  Thus, they have two inode
> > IDs.
> >
> > Right now, we only report the latter inode ID (i.e. the inode ID of the
> > mounted filesystem's root node).  This is fine once the guest has
> > auto-mounted a submount there (so this inode ID goes with a device ID
> > that is distinct from the parent filesystem), but before the auto-mount,
> > they have the device ID of the parent and the inode ID for the submount.
> > This is problematic because this is likely exactly the same
> > st_dev/st_ino combination as the parent filesystem's root node.  This
> > leads to problems for example with `find`, which will thus complain
> > about a filesystem loop if it has visited the parent filesystem's root
> > node before, and then refuse to descend into the submount.
> >
> > There is a way to find the mount directory's original inode ID, and that
> > is to readdir(3) the parent directory, look for the mount directory, and
> > read the dirent.d_ino field.  Using this, we can let lookup and
> > readdirplus return that original inode ID, which the guest will thus
> > show until the submount is auto-mounted.  (Then, it will invoke getattr
> > and that stat(2) call will return the inode ID for the submount.)
>
> [ CC miklos ]
>
> Hi Max,
>
> Though we discussed this in chat room, I am still responding to this
> email with the concern I have, so that there is a record of it.
>
> So with this patch for FUSE_LOOKUP  we always return submount's parentinode
> id and with GETATTR request we return actual inode id of submount. That
> kind of bothers me a bit as we are assuming that there is always going
> to be a GETATTR request after FUSE_LOOKUP. FUSE_LOOKUP itself has attrs
> returned with it and it might happen that after FUSE_LOOKUP, FUSE_GETATTR
> might not be called at all because FUSE_LOOKUP itself got the latest
> updated attrs with certain timeout.
>
> For example, if I call stat on a normal file (not submount), I see that
> after FUSE_LOOKUP, no FUSE_GETATTR request is going and
> fuse_update_get_attr() is using attrs from locally cached inode attrs.
>
> But same thing does not seem to be happening in case of submount. Once
> submount is created in guest, I see that we never seem to be calling
> ->revalidate() on newly created dentry of submount root. I am not sure
> why. And that means we don't do FUSE_LOOKUP and that means FUSE_GETATTR
> always gets called.

Yes, this sounds normal.

The lookup sequence for "/mnt/virtiofs/dir1/submount/file" will be:

LOOKUP(1, "dir")
create inode I1 in sb1
create dentry "dir" with inode I1 in sb1
LOOKUP(I1, "submount")
create inode I2 in sb1, set S_AUTOMOUNT
create dentry "submount" with inode I2 in sb1
automount triggered:
create sb2
create inode I2 in sb2
create dentry "/" with inode I2 in sb2
GETATTR(I2)
 fill inode I2
LOOKUP(I2, "file")
 create inode I3
 create dentry "file" with inode I3 in sb2

Notice how there's two inodes numbered I2, but in separate sb's.
However the server has only the notion of a single I2 inode which is
the root of the submount, since the mountpoint is not visible (except
for d_ino in readdir()).

Now AFAICS what this patch does is set the inode number in the
attributes returned by LOOKUP(I1, "submount") to the covered inode, so
that AT_NO_AUTOMOUNT stat will return the correct value.   While this
seems to work, it's not a fundamental fix to the problem, since the
attributes on sb1/I2 will time out and the next stat(...,
AT_NO_AUTOMOUNT) will trigger a GETATTR request, which will return the
inode number of the submount root.

Solving this properly would mean that the server would have to
generate separate nodeid's for the mountpoint and the submount root
and the protocol would have to be extended so that this information
could be transferred to the client.

Not sure whether this is worth it or not.

Thanks,
Miklos

Re: [Virtio-fs] [for-6.1 v3 3/3] virtiofsd: Add support for FUSE_SYNCFS request

2021-05-11 Thread Miklos Szeredi

On Tue, May 11, 2021 at 4:49 PM Vivek Goyal  wrote:
>
> On Tue, May 11, 2021 at 08:54:09AM -0400, Vivek Goyal wrote:
> > On Tue, May 11, 2021 at 02:31:14PM +0200, Miklos Szeredi wrote:
> > > On Mon, May 10, 2021 at 5:55 PM Greg Kurz  wrote:
> > > >
> > > > Honor the expected behavior of syncfs() to synchronously flush all data
> > > > and metadata on linux systems. Simply loop on all known submounts and
> > > > call syncfs() on them.
> > >
> > > Why not pass the submount's root to the server, so it can do just one
> > > targeted syncfs?
> > >
> > > E.g. somehting like this in fuse_sync_fs():
> > >
> > > args.nodeid = get_node_id(sb->s_root->d_inode);
> >
> > Hi Miklos,
> >
> > I think current proposal was due to lack of full understanding on my part.
> > I was assuming we have one super block in client and that's not the case
> > looks like. For every submount, we will have another superblock known
> > to vfs, IIUC. That means when sync() happens, we will receive ->syncfs()
> > for each of those super blocks. And that means file server does not
> > have to keep track of submounts explicitly and it will either receive
> > a single targeted SYNCFS (for the case of syncfs(fd)) or receive
> > multile SYNCFS calls (one for each submount when sync() is called).
>
> Tried sync() with submounts enabled and we are seeing a SYNCFS call
> only for top level super block and not for submounts.
>
> Greg noticed that it probably is due to the fact that iterate_super()
> skips super blocks which don't have SB_BORN flag set.
>
> Only vfs_get_tree() seems to set SB_BORN and for our submounts we
> are not calling vfs_get_tree(), hence SB_BORN is not set. NFS seems
> to call vfs_get_tree() and hence SB_BORN must be set for submounts.
>
> Maybe we need to modify virtio_fs_get_tree() so that it can deal with
> mount as well as submounts and then fuse_dentry_automount() should
> probably call vfs_get_tree() and that should set SB_BORN and hopefully
> sync() will work with it. Greg is planning to give it a try.
>
> Does it sound reasonable.

Just setting SB_BORN sounds much simpler.  What's the disadvantage?

Thanks,
Miklos

Re: [Virtio-fs] [for-6.1 v3 3/3] virtiofsd: Add support for FUSE_SYNCFS request

2021-05-11 Thread Miklos Szeredi

On Mon, May 10, 2021 at 5:55 PM Greg Kurz  wrote:
>
> Honor the expected behavior of syncfs() to synchronously flush all data
> and metadata on linux systems. Simply loop on all known submounts and
> call syncfs() on them.

Why not pass the submount's root to the server, so it can do just one
targeted syncfs?

E.g. somehting like this in fuse_sync_fs():

args.nodeid = get_node_id(sb->s_root->d_inode);

Thanks,
Miklos

Re: [PATCH v2 0/3] virtiofsd: Add options to enable/disable posix acl

2021-02-19 Thread Miklos Szeredi

On Fri, Feb 19, 2021 at 3:34 PM Vivek Goyal  wrote:
>
> On Fri, Feb 19, 2021 at 11:50:54AM +, Luis Henriques wrote:
> > Vivek Goyal  writes:
> >
> > > Hi,
> > >
> > > This is V2 of the patches. Changes since v1 are.
> > >
> > > - Rebased on top of latest master.
> > > - Took care of Miklos's comments to block acl xattrs if user
> > >   explicitly disabled posix acl.
> > >
> > > Luis Henriques reported that fstest generic/099 fails with virtiofs.
> > > Little debugging showed that we don't enable acl support. So this
> > > patch series provides option to enable/disable posix acl support. By
> > > default it is disabled.
> > >
> > > I have run blogbench and pjdfstests with posix acl enabled and
> > > things work fine.
> > >
> > > Luis, can you please apply these patches, and run virtiofsd with
> > > "-o posix_acl" and see if it fixes the failure you are seeing. I
> > > ran the steps you provided manually and it fixes the issue for
> > > me.
> >
> > Sorry for the delay.  I've finally tested these patches and they indeed
> > fix the problem I reported.  My only question about this fix is why is
> > this option not enabled by default, since this is the documented behavior
> > in acl(5) and umask(2)?  In fact, why is this an option at all?
>
> You mean why to not enable acl by default?
>
> I am concerned about performance drop this can lead to because extra
> GETXATTR(system.posix_acl_*) messages which will trigger if acls are enabled.
> And not all users might require these. That's why I preferred to not enable
> acl by default. Those who need it can enable it explicitly.
>
> Another example is xattr support. Due to performance concerns, we don't
> enable xattrs by default either.

Actually generic xattr is much worse, since there's no caching for
them currently, as opposed to posix acls, which are cached both when
positive and negative.

If we enable ACL by default in case xattrs are enabled, we should be
safe, I think.  Having an option to disable acls still makes sense,
but it's an optional plus.

Thanks,
Miklos

Re: [PATCH 1/3] virtiofsd: Add an option to enable/disable posix acls

2021-02-17 Thread Miklos Szeredi

On Wed, Feb 17, 2021 at 4:07 PM Vivek Goyal  wrote:
>
> On Wed, Feb 17, 2021 at 09:53:04AM +0100, Miklos Szeredi wrote:
> > On Wed, Feb 17, 2021 at 12:36 AM Vivek Goyal  wrote:
> > >
> > > fuse has an option FUSE_POSIX_ACL which needs to be opted in by fuse
> > > server to enable posix acls.
> > >
> > > Add virtiofsd option "-o posix_acl/no_posix_acl" to let users 
> > > enable/disable
> > > posix acl support. By default it is disabled as of now.
> >
> > If I read the code correctly, then no_posix_acl will still result in
> > system.posix_acl_* xattr ops being passed through to virtiofsd, which
> > will forward them to the underlying fs, resulting in posix acls
> > appearing to work, but doing so incorrectly (i.e. no change from
> > previous behavior).
>
> Yes, and this is confuing me a lot. fuse server has not indicated
> support for POSIX_ACL, still user can get and set ACLs. fuse_xattr_get()
> and fuse_xattr_set() must be kicking in.
>
> I do see that we have fuse_no_acl_xattr_handlers and that should
> be able to block setting/getting acls if acl support is not there
> but we register it only if we are not mounted in init_user_ns.
>
> if (sb->s_user_ns != _user_ns)
> sb->s_xattr = fuse_no_acl_xattr_handlers;
>
> So question is, should fuse client be fixed as well to block setting
> and getting acls if fuse server does not support ACL? Or we now need
> to keep it around for backward compatibility.

Yes, this is a compatibility thing.   User namespaces don't work
without actual ACL ops, so this was disabled in that case and no
backward compatibility worries in that case.   We should have disabled
this for virtiofs from the start, but at this point we are again stuck
with a backward compatibility issue.

Alternatively make posix_acl the default, hence fixing the bad
behavior is unlikely to cause a regression.

>
> > Possibly better would be to have three different
> > modes of operation:
> >
> > 1) no option: default fall back to broken acl support for backward
> > compat (this could be removed in the future)
>
> What about FUSE_DONT_MASK in this mode. ACLs are not enabled but
> user can get/set these. Should that mean we still honor default
> acl and not apply umask?
>
> Probably I should opt for FUSE_DONT_MASK only if posix_acl support is
> enabled. Given this does not work even today (atleast for virtiofs), so
> it is not a backward compatibility issue. And its confusing anyway.
>
> > 2) no_posix_acl: really disable acl support
>
> That is block getting and setting system.posix_acl xattr. Will do that.
> I think we will have to block it even if somebody has remapped xattrs
> in virtiofsd.

Okay.

Thanks,
Miklos

Re: [PATCH 1/3] virtiofsd: Add an option to enable/disable posix acls

2021-02-17 Thread Miklos Szeredi

On Wed, Feb 17, 2021 at 12:36 AM Vivek Goyal  wrote:
>
> fuse has an option FUSE_POSIX_ACL which needs to be opted in by fuse
> server to enable posix acls.
>
> Add virtiofsd option "-o posix_acl/no_posix_acl" to let users enable/disable
> posix acl support. By default it is disabled as of now.

If I read the code correctly, then no_posix_acl will still result in
system.posix_acl_* xattr ops being passed through to virtiofsd, which
will forward them to the underlying fs, resulting in posix acls
appearing to work, but doing so incorrectly (i.e. no change from
previous behavior).   Possibly better would be to have three different
modes of operation:

1) no option: default fall back to broken acl support for backward
compat (this could be removed in the future)
2) no_posix_acl: really disable acl support
3) posix_acl: enable proper acl support

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-28 Thread Miklos Szeredi

On Thu, Jan 28, 2021 at 1:15 PM Greg Kurz  wrote:
>
> On Wed, 27 Jan 2021 16:52:56 +0100
> Miklos Szeredi  wrote:
>
> > On Wed, Jan 27, 2021 at 4:47 PM Miklos Szeredi  wrote:
> > >
> > > On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz  wrote:
> > > >
> > > > On Wed, 27 Jan 2021 16:22:49 +0100
> > > > Miklos Szeredi  wrote:
> > > >
> > > > > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz  wrote:
> > > > > >
> > > > > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > > > > Miklos Szeredi  wrote:
> > > > > > > The semantics of O_CREATE are that it can fail neither because the
> > > > > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > > > > exported tree is not modified outside of a single guest, because 
> > > > > > > of
> > > > > > > locking provided by the guest kernel.
> > > > > > >
> > > > > >
> > > > > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> > > > >
> > > > > Let me make my  statement more precise:
> > > > >
> > > > > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > > > > the operation.
> > > > >
> > > >
> > > > True, but I still don't see what guarantees guest userspace that the
> > > > parent directory doesn't go away... I must have missed something.
> > > > Please elaborate.
> > >
> > > Generally there's no guarantee, however there can be certain
> > > situations where the caller can indeed rely on the existence of the
> > > parent (e.g. /tmp).
> >
> > Example from the virtiofs repo:
> >
> > https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L315
> > https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L382
> >
> > In that case breaking O_CREAT would be harmless, since no two
> > instances are allowed anyway, so it would just give a confusing error.
> > But it is breakage and it probably wouldn't be hard to find much worse
> > breakage in real life applications.
> >
>
> Ok, I get your point : a guest userspace application can't expect
> to hit ENOENT when doing open("/some_dir/foo", O_CREAT) even if
> someone else is doing unlink("/some_dir/foo"), which is a different
> case than somebody doing rmdir("/some_dir").
>
> But we still have a TOCTOU : the open(O_CREAT|O_EXCL) acts as
> the check to use open(O_PATH) and retry+timeout can't fix it
> reliably.

Right.

> A possible fix for the case where the race comes from the
> client itself would be to serialize FUSE requests that
> create/remove paths in the same directory. I don't know
> enough the code yet to assess if it's doable though.
>
> Then this would leave the case where something else on
> the host is racing with virtiofsd. IMHO this belongs to
> the broader family of "bad things the host can do
> in our back". This requires a bigger hammer than
> adding band-aids here and there IMHO. So in the
> scope of this patch, I don't think we should retry
> and timeout, but just return whatever errno that
> makes sense.

I never suggested a timeout, that would indeed be nonsense.

Just do a simple retry loop with a counter.  I'd set counter to a
small number (5 or whatever), so that basically any accidental races
are cared for, and the only case that would trigger the EIO is if the
file was constantly removed and recreated (and even in that case it
would be pretty difficult to trigger).  This would add only minimal
complexity or overhead.

The proper solution might be adding O_REGULAR, and it actually would
be useful for other O_CREAT users, since it's probably what they want
anyway (but existing semantics can't be changed).

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 4:47 PM Miklos Szeredi  wrote:
>
> On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz  wrote:
> >
> > On Wed, 27 Jan 2021 16:22:49 +0100
> > Miklos Szeredi  wrote:
> >
> > > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz  wrote:
> > > >
> > > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > > Miklos Szeredi  wrote:
> > > > > The semantics of O_CREATE are that it can fail neither because the
> > > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > > exported tree is not modified outside of a single guest, because of
> > > > > locking provided by the guest kernel.
> > > > >
> > > >
> > > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> > >
> > > Let me make my  statement more precise:
> > >
> > > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > > the operation.
> > >
> >
> > True, but I still don't see what guarantees guest userspace that the
> > parent directory doesn't go away... I must have missed something.
> > Please elaborate.
>
> Generally there's no guarantee, however there can be certain
> situations where the caller can indeed rely on the existence of the
> parent (e.g. /tmp).

Example from the virtiofs repo:

https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L315
https://gitlab.com/virtio-fs/ireg/-/blob/master/ireg.c#L382

In that case breaking O_CREAT would be harmless, since no two
instances are allowed anyway, so it would just give a confusing error.
But it is breakage and it probably wouldn't be hard to find much worse
breakage in real life applications.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 4:35 PM Greg Kurz  wrote:
>
> On Wed, 27 Jan 2021 16:22:49 +0100
> Miklos Szeredi  wrote:
>
> > On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz  wrote:
> > >
> > > On Wed, 27 Jan 2021 15:09:50 +0100
> > > Miklos Szeredi  wrote:
> > > > The semantics of O_CREATE are that it can fail neither because the
> > > > file exists nor because it doesn't.  This doesn't matter if the
> > > > exported tree is not modified outside of a single guest, because of
> > > > locking provided by the guest kernel.
> > > >
> > >
> > > Wrong. O_CREAT can legitimately fail with ENOENT if one element
> >
> > Let me make my  statement more precise:
> >
> > O_CREAT cannot fail with ENOENT if parent directory exists throughout
> > the operation.
> >
>
> True, but I still don't see what guarantees guest userspace that the
> parent directory doesn't go away... I must have missed something.
> Please elaborate.

Generally there's no guarantee, however there can be certain
situations where the caller can indeed rely on the existence of the
parent (e.g. /tmp).

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 4:09 PM Greg Kurz  wrote:
>
> On Wed, 27 Jan 2021 15:09:50 +0100
> Miklos Szeredi  wrote:
> > The semantics of O_CREATE are that it can fail neither because the
> > file exists nor because it doesn't.  This doesn't matter if the
> > exported tree is not modified outside of a single guest, because of
> > locking provided by the guest kernel.
> >
>
> Wrong. O_CREAT can legitimately fail with ENOENT if one element

Let me make my  statement more precise:

O_CREAT cannot fail with ENOENT if parent directory exists throughout
the operation.

I'm sure this property is used all over the place in userspace code,
and hence should be supported in strict coherence (cache=none) mode.

For relaxed coherence (cache=auto) I'm not quite sure.  NFS is usually
the reference, we'd need to look into what guarantees it provides wrt.
O_CREAT and remote racing unlink.

Thanks,
Miklos

Re: [PATCH v3] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 3:14 PM Stefan Hajnoczi  wrote:
>
> On Wed, Jan 27, 2021 at 02:01:54PM +0100, Miklos Szeredi wrote:

> > The problem here is there can also be a race between the open and the
> > subsequent lo_do_lookup().
> >
> > At this point it's probably enough to verify that fuse_entry_param
> > refers to the same object as the fh (using fstat and comparing st_dev
> > and st_ino).
>
> Can you describe the race in detail? FUSE_CREATE vs FUSE_OPEN?
> FUSE_CREATE vs FUSE_CREATE?

A race between FUSE_CREATE and external modification:

VIRTIOFSD: lo_create() {
VIRTIOFSD: fd = open(foo, O_CREAT | O_EXCL)
EXTERNAL:  unlink(foo)
EXTERNAL:  open(foo, O_CREAT)
VIRTIOFSD: lo_do_lookup() {
VIRTIOFSD: newfd = open(foo, O_PATH | O_NOFOLLOW)

Nothing serious will happen, but there will be a discrepancy between
the open file and the inode that it references.  I.e.  the following
in the client will yield weird results:

open(foo, O_CREAT) -> fd
sprintf(procname, "/proc/self/fd/%i", fd);
open(procname, O_RDONLY) -> fd2
write(fd, buf, bufsize)
read(fd2, buf, bufsize)

This is probably not a security issue, more of a quality of
implementation issue.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 2:49 PM Greg Kurz  wrote:
>
> On Wed, 27 Jan 2021 11:34:52 +0100
> Miklos Szeredi  wrote:

> > Another solution specifically for O_CREAT without O_EXCL would be to
> > turn it into an exclusive create.
>
> Would this added O_EXCL then appear on the client side, e.g. to
> guest userspace doing fcntl(F_GETFL) ?

No.  Guest kernel keeps track of open flags.

> > If that fails with EEXIST then try
> > the normal open path (open with O_PATH, fstat, open proc symlink).  If
>
> open(O_PATH | O_NOFOLLOW) + fstatat(AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW)
> would indeed allow to filter out anything that isn't a directory and
> to safely open the proc symlink.
>
> > that fails with ENOENT, then retry the whole thing a certain number of
>
> Indeed someone could have unlinked the file in the meantime, in which
> case the open(O_PATH | O_NOFOLLOW) would fail, but if it succeeds then
> we cannot hit ENOENT anymore AFAICT.

Right.

> > times.  If it still fails then somebody is definitely messing with us
> > and we can fail the request with EIO.
> >
>
> Not sure what the retry+timeout is trying to mitigate here... why not
> returning EIO right away ?

The semantics of O_CREATE are that it can fail neither because the
file exists nor because it doesn't.  This doesn't matter if the
exported tree is not modified outside of a single guest, because of
locking provided by the guest kernel.

However if we want to support shared access to a tree then O_CREAT
semantics should work even in the face of races due to external
modification of the tree.  I.e. following race:

virtiofsd: open(foo, O_CREAT | O_EXCL) -> EEXIST
other task: unlink(foo)
virtiofsd: open(foo, O_PATH | O_NOFOLLOW) -> ENOENT

To properly support the above the O_CREAT | O_EXCL open would need to
be retried.

Thanks,
Miklos

Re: [PATCH v3] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 12:21 PM Stefan Hajnoczi  wrote:
  }
> @@ -1654,9 +1677,11 @@ static void update_open_flags(int writeback, int 
> allow_direct_io,
>  static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
>mode_t mode, struct fuse_file_info *fi)
>  {
> +int open_flags = (fi->flags | O_CREAT) & ~O_NOFOLLOW;
>  int fd;
>  struct lo_data *lo = lo_data(req);
>  struct lo_inode *parent_inode;
> +struct lo_inode *existing_inode = NULL;
>  struct fuse_entry_param e;
>  int err;
>  struct lo_cred old = {};
> @@ -1682,11 +1707,23 @@ static void lo_create(fuse_req_t req, fuse_ino_t 
> parent, const char *name,
>
>  update_open_flags(lo->writeback, lo->allow_direct_io, fi);
>
> -fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
> -mode);
> +/* First, try to create a new file but don't open existing files */
> +fd = openat(parent_inode->fd, name, open_flags | O_EXCL, mode);
>  err = fd == -1 ? errno : 0;
> +
>  lo_restore_cred();
>
> +/* Second, open existing files if O_EXCL was not specified */
> +if (err == EEXIST && !(fi->flags & O_EXCL)) {
> +existing_inode = lookup_name(req, parent, name);
> +if (existing_inode) {
> +fd = lo_inode_open(lo, existing_inode, open_flags);
> +if (fd < 0) {
> +err = -fd;
> +}
> +}
> +}
> +
>  if (!err) {
>  ssize_t fh;

It's more of a mess than I thought.

The problem here is there can also be a race between the open and the
subsequent lo_do_lookup().

At this point it's probably enough to verify that fuse_entry_param
refers to the same object as the fh (using fstat and comparing st_dev
and st_ino).

Also O_CREAT open is not supposed to return ENOENT, so failure to open
without O_CREAT (race between O_CREAT open and plain open) should at
least translate error to ESTALE or EIO.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Wed, Jan 27, 2021 at 11:20 AM Greg Kurz  wrote:
>
> On Wed, 27 Jan 2021 10:25:28 +0100
> Miklos Szeredi  wrote:
>
> > On Tue, Jan 26, 2021 at 6:18 PM Greg Kurz  wrote:
> > >
> > > On Tue, 26 Jan 2021 10:35:02 +
> > > Stefan Hajnoczi  wrote:
> >
> > > The patch looks pretty good to me. It just seems to be missing a change in
> > > lo_create():
> > >
> > > fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & 
> > > ~O_NOFOLLOW,
> > > mode);
> > >
> > > A malicious guest could have created anything called ${name} in this 
> > > directory
> > > before calling FUSE_CREATE and we'll open it blindly, or I'm missing 
> > > something ?
> >
> > Right, this seems like an omission.
> >
> > Also the "& ~O_NOFOLLOW" looks like a copy-paste bug, since unlike
> > lo_open(), lo_create() is not opening a proc symlink.
> >
> > So that should be replaced with "| O_NOFOLLOW"
> >
>
>
> Yes, I've realized that later on. We should definitely enforce O_NOFOLLOW
> to avoid symlink escapes.
>
> Then comes the case of special files... A well-known case is the FIFO that
> causes openat() to block as described in my response. FWIW, we addressed
> this one in 9P by adding O_NONBLOCK and fixing the flags to the client
> expectation with fcntl(F_SETFL). But this is just a protection against
> being blocked. Blindly opening a special file can lead to any kind of
> troubles you can think of... so it really looks that the only sane way
> to be safe from such an attack is to forbid openat() of special files at
> the filesystem level.

Another solution specifically for O_CREAT without O_EXCL would be to
turn it into an exclusive create.   If that fails with EEXIST then try
the normal open path (open with O_PATH, fstat, open proc symlink).  If
that fails with ENOENT, then retry the whole thing a certain number of
times.  If it still fails then somebody is definitely messing with us
and we can fail the request with EIO.

Rather ugly, but I can't think of anything better.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH v2] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-27 Thread Miklos Szeredi

On Tue, Jan 26, 2021 at 6:18 PM Greg Kurz  wrote:
>
> On Tue, 26 Jan 2021 10:35:02 +
> Stefan Hajnoczi  wrote:

> The patch looks pretty good to me. It just seems to be missing a change in
> lo_create():
>
> fd = openat(parent_inode->fd, name, (fi->flags | O_CREAT) & ~O_NOFOLLOW,
> mode);
>
> A malicious guest could have created anything called ${name} in this directory
> before calling FUSE_CREATE and we'll open it blindly, or I'm missing 
> something ?

Right, this seems like an omission.

Also the "& ~O_NOFOLLOW" looks like a copy-paste bug, since unlike
lo_open(), lo_create() is not opening a proc symlink.

So that should be replaced with "| O_NOFOLLOW"

Thanks,
Miklos

Re: [PATCH] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-26 Thread Miklos Szeredi

On Tue, Jan 26, 2021 at 11:18 AM Stefan Hajnoczi  wrote:
>
> On Mon, Jan 25, 2021 at 05:12:23PM +0100, Miklos Szeredi wrote:
> > On Thu, Jan 21, 2021 at 3:44 PM Stefan Hajnoczi  wrote:
> >
> > > This patch adds the missing checks to virtiofsd. This is a short-term
> > > solution because it does not prevent a compromised virtiofsd process
> > > from opening device nodes on the host.
> >
> > I think the proper solution is adding support to the host in order to
> > restrict opens on filesystems that virtiofsd has access to.
> >
> > My idea was to add a "force_nodev" mount option that cannot be
> > disabled and will make propagated mounts  also be marked
> > "force_nodev,nodev".
>
> Interesting idea! Mount options that are relevant:
>  * noexec
>  * nosuid
>  * nodev
>  * nosymfollow
>
> Do you have time to work on the force_* mount options?

Not at the moment, but first we need to probe Al to see if this idea sticks...

> > A possibly simpler solution is to extend seccomp to restrict the
> > process itself from being able to open special files.  Not sure if
> > that's within the scope of seccomp though.
>
> I don't think seccomp can provide that restriction since it's unrelated
> to the syscall or its arguments.

How about selinux, then?

Thanks,
Miklos

Re: [PATCH] virtiofsd: prevent opening of special files (CVE-2020-35517)

2021-01-25 Thread Miklos Szeredi

On Thu, Jan 21, 2021 at 3:44 PM Stefan Hajnoczi  wrote:

> This patch adds the missing checks to virtiofsd. This is a short-term
> solution because it does not prevent a compromised virtiofsd process
> from opening device nodes on the host.

I think the proper solution is adding support to the host in order to
restrict opens on filesystems that virtiofsd has access to.

My idea was to add a "force_nodev" mount option that cannot be
disabled and will make propagated mounts  also be marked
"force_nodev,nodev".

A possibly simpler solution is to extend seccomp to restrict the
process itself from being able to open special files.  Not sure if
that's within the scope of seccomp though.

Thanks,
Miklos

Re: Some performance numbers for virtiofs, DAX and virtio-9p

2021-01-05 Thread Miklos Szeredi

On Fri, Dec 11, 2020 at 8:25 PM Vivek Goyal  wrote:
>
> On Fri, Dec 11, 2020 at 06:29:56PM +, Dr. David Alan Gilbert wrote:
>
> [..]
> > > >
> > > > Could we measure at what point does a large window size actually make
> > > > performance worse?
> > >
> > > Will do. Will run tests with varying window sizes (small to large)
> > > and see how does it impact performance for same workload with
> > > same guest memory.
> >
> > I wonder how realistic it is though;  it makes some sense if you have a
> > scenario like a fairly small root filesystem - something tractable;  but
> > if you have a large FS you're not realistically going to be able to set
> > the cache size to match it - that's why it's a cache!
>
> I think its more about active dataset size and not necessarily total
> FS size. FS might be big but if application is not accessing all of
> the in reasonabl small time window, then it does not matter.
>
> What worries me most is that cost of reclaim of a dax range seems
> too high (or keeps the process blocked for long enogh), that it
> kills the performance. I will need to revisit the reclaim path
> and see if I can optimize something.

Can we dynamically increase the dax window size?  Or hot plug
additional dax ranges on demand?

That would solve the problem of trying to find the active set size in advance.

We'd still need reclaim in order to prevent the window from growing
indefinitely, but that would always be background reclaim and be done
after some time of inactivity.

Thanks,
Miklos

Re: Some performance numbers for virtiofs, DAX and virtio-9p

2020-12-10 Thread Miklos Szeredi

On Thu, Dec 10, 2020 at 5:11 PM Vivek Goyal  wrote:

> Conclusion
> ---
> - virtiofs DAX seems to help a lot in many workloads.
>
>   Note, DAX performance well only if data fits in cache window. My total
>   data is 16G and cache window size is 16G as well. If data is larger
>   than DAX cache window, then performance of dax suffers a lot. Overhead
>   of reclaiming old mapping and setting up a new one is very high.

Which begs the question: what is the optimal window size?

What is the cost per GB of window to the host and guest?

Could we measure at what point does a large window size actually make
performance worse?

>
> NAMEWORKLOADBandwidth   IOPS
> 9p-none seqread-psync   98.6mb  24.6k
> 9p-mmap seqread-psync   97.5mb  24.3k
> 9p-looseseqread-psync   91.6mb  22.9k
> vtfs-none   seqread-psync   98.4mb  24.6k
> vtfs-none-dax   seqread-psync   660.3mb 165.0k
> vtfs-auto   seqread-psync   650.0mb 162.5k
> vtfs-auto-dax   seqread-psync   703.1mb 175.7k
> vtfs-always seqread-psync   671.3mb 167.8k
> vtfs-always-dax seqread-psync   687.2mb 171.8k
>
> 9p-none seqread-psync-multi 397.6mb 99.4k
> 9p-mmap seqread-psync-multi 382.7mb 95.6k
> 9p-looseseqread-psync-multi 350.5mb 87.6k
> vtfs-none   seqread-psync-multi 360.0mb 90.0k
> vtfs-none-dax   seqread-psync-multi 2281.1mb570.2k
> vtfs-auto   seqread-psync-multi 2530.7mb632.6k
> vtfs-auto-dax   seqread-psync-multi 2423.9mb605.9k
> vtfs-always seqread-psync-multi 2535.7mb633.9k
> vtfs-always-dax seqread-psync-multi 2406.1mb601.5k

Seems like in all the -multi tests 9p-none performs consistently
better than vtfs-none.   Could that be due to the single queue?

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: Use --thread-pool-size=0 to mean no thread pool

2020-11-23 Thread Miklos Szeredi

On Wed, Nov 18, 2020 at 8:52 PM Vivek Goyal  wrote:
>
> On Thu, Nov 12, 2020 at 10:06:37AM +0100, Miklos Szeredi wrote:
> > On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal  wrote:
> > >
> > > On Fri, Nov 06, 2020 at 08:33:50PM +, Venegas Munoz, Jose Carlos 
> > > wrote:
> > > > Hi Vivek,
> > > >
> > > > I have tested with Kata 1.12-apha0, the results seems that are better 
> > > > for the use fio config I am tracking.
> > > >
> > > > The fio config does  randrw:
> > > >
> > > > fio --direct=1 --gtod_reduce=1 --name=test 
> > > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M 
> > > > --readwrite=randrw --rwmixread=75
> > > >
> > >
> > > Hi Carlos,
> > >
> > > Thanks for the testing.
> > >
> > > So basically two conclusions from your tests.
> > >
> > > - for virtiofs, --thread-pool-size=0 is performing better as comapred
> > >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately
> > >   35-40% better.
> > >
> > > - virtio-9p is still approximately 30% better than virtiofs
> > >   --thread-pool-size=0.
> > >
> > > As I had done the analysis that this particular workload (mixed read and
> > > write) is bad with virtiofs because after every write we are invalidating
> > > attrs and cache so next read ends up fetching attrs again. I had posted
> > > patches to gain some of the performance.
> > >
> > > https://lore.kernel.org/linux-fsdevel/20200929185015.gg220...@redhat.com/
> > >
> > > But I got the feedback to look into implementing file leases instead.
> >
> > Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it
> > off for now?   9p doesn't have it, so no point in enabling it for
> > virtiofs by default.
>
> If we disable FUSE_AUTO_INVAL_DATA, then client page cache will not
> be invalidated even after 1 sec, right? (for cache=auto).

Unless FOPEN_KEEP_CACHE is used (only cache=always, AFAICS) data cache
will be invalidated on open.

I think it's what NFS does, except NFS does invalidation based on
mtime *at the time of open*.   At least that's what I remember from
past reading of NFS code.

> Given now we also want to target sharing directory tree among multiple
> clients, keeping FUSE_AUTO_INVAL_DATA enabled should help.

Depends what applications expect.  THe close-to-open coherency
provided even without FUSE_AUTO_INVAL_DATA should be enough for the
case when strong coherency is not required.

>
> >
> > Also I think some confusion comes from cache=auto being the default
> > for virtiofs.Not sure what the default is for 9p, but comparing
> > default to default will definitely not be apples to apples since this
> > mode is nonexistent in 9p.
> >
> > 9p:cache=none  <-> virtiofs:cache=none
> > 9p:cache=loose <-> virtiofs:cache=always
> >
> > "9p:cache=mmap" and "virtiofs:cache=auto" have no match.
>
> Agreed from performance comparison point of view.
>
> I will prefer cache=auto default (over cache=none) for virtiofsd. During
> some kernel compilation tests over virtiofs, cache=none was painfully
> slow as compared to cache=auto.

It would also be interesting to know the exact bottleneck in the
kernel compile with cache=none case.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: Use --thread-pool-size=0 to mean no thread pool

2020-11-12 Thread Miklos Szeredi

On Thu, Nov 12, 2020 at 12:34 PM Christian Schoenebeck
 wrote:
>
> On Donnerstag, 12. November 2020 10:06:37 CET Miklos Szeredi wrote:
> >
> > 9p:cache=none  <-> virtiofs:cache=none
> > 9p:cache=loose <-> virtiofs:cache=always
> >
> > "9p:cache=mmap" and "virtiofs:cache=auto" have no match.
>
> What does 'auto' do exactly?

It has a configurable timeout (default is 1s) for the attribute and
lookup cache and close-to-open consistency for data (flush on close,
invalidate on open).  This is similar to, but less refined as NFS.
It's something that could easily be improved if there's a need for it.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: Use --thread-pool-size=0 to mean no thread pool

2020-11-12 Thread Miklos Szeredi

On Thu, Nov 12, 2020 at 10:06 AM Miklos Szeredi  wrote:
>
> On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal  wrote:
> >
> > On Fri, Nov 06, 2020 at 08:33:50PM +, Venegas Munoz, Jose Carlos wrote:
> > > Hi Vivek,
> > >
> > > I have tested with Kata 1.12-apha0, the results seems that are better for 
> > > the use fio config I am tracking.
> > >
> > > The fio config does  randrw:
> > >
> > > fio --direct=1 --gtod_reduce=1 --name=test 
> > > --filename=random_read_write.fio --bs=4k --iodepth=64 --size=200M 
> > > --readwrite=randrw --rwmixread=75
> > >
> >
> > Hi Carlos,
> >
> > Thanks for the testing.
> >
> > So basically two conclusions from your tests.
> >
> > - for virtiofs, --thread-pool-size=0 is performing better as comapred
> >   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately
> >   35-40% better.
> >
> > - virtio-9p is still approximately 30% better than virtiofs
> >   --thread-pool-size=0.
> >
> > As I had done the analysis that this particular workload (mixed read and
> > write) is bad with virtiofs because after every write we are invalidating
> > attrs and cache so next read ends up fetching attrs again. I had posted
> > patches to gain some of the performance.
> >
> > https://lore.kernel.org/linux-fsdevel/20200929185015.gg220...@redhat.com/
> >
> > But I got the feedback to look into implementing file leases instead.
>
> Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it
> off for now?   9p doesn't have it, so no point in enabling it for
> virtiofs by default.
>
> Also I think some confusion comes from cache=auto being the default
> for virtiofs.Not sure what the default is for 9p, but comparing
> default to default will definitely not be apples to apples since this
> mode is nonexistent in 9p.
>
> 9p:cache=none  <-> virtiofs:cache=none
> 9p:cache=loose <-> virtiofs:cache=always
>
> "9p:cache=mmap" and "virtiofs:cache=auto" have no match.
>
> Untested patch attached.

Another performance issue with virtiofs could be due to the strict
page writeback rules in fuse that are meant to prevent misuse of
kernel memory by unprivileged processes.   Since virtiofs isn't
subject to that limitation, these could be relaxed.

Attaching a patch that does one half of this.  The other half is
getting rid of the page copying, that's a bit more involved, but
shouldn't be too difficult.  Just need to duplicate the cached
writeback callbacks for virtiofs and do away with the complex temp
page stuff.

Thanks,
Miklos
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index d414c787e362..92c92c482c57 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -502,6 +502,7 @@ struct fuse_fs_context {
 	bool no_force_umount:1;
 	bool legacy_opts_show:1;
 	bool dax:1;
+	bool relax_writeback:1;
 	unsigned int max_read;
 	unsigned int blksize;
 	const char *subtype;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 36ab05315828..029325ebd1b3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1130,7 +1130,8 @@ void fuse_free_conn(struct fuse_conn *fc)
 }
 EXPORT_SYMBOL_GPL(fuse_free_conn);
 
-static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
+static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb,
+			 struct fuse_fs_context *ctx)
 {
 	int err;
 	char *suffix = "";
@@ -1151,21 +1152,24 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 
 	/* fuse does it's own writeback accounting */
 	sb->s_bdi->capabilities &= ~BDI_CAP_WRITEBACK_ACCT;
-	sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT;
 
-	/*
-	 * For a single fuse filesystem use max 1% of dirty +
-	 * writeback threshold.
-	 *
-	 * This gives about 1M of write buffer for memory maps on a
-	 * machine with 1G and 10% dirty_ratio, which should be more
-	 * than enough.
-	 *
-	 * Privileged users can raise it by writing to
-	 *
-	 */sys/class/bdi//max_ratio
-	 */
-	bdi_set_max_ratio(sb->s_bdi, 1);
+	if (!ctx->relax_writeback) {
+		sb->s_bdi->capabilities |= BDI_CAP_STRICTLIMIT;
+
+		/*
+		 * For a single fuse filesystem use max 1% of dirty +
+		 * writeback threshold.
+		 *
+		 * This gives about 1M of write buffer for memory maps on a
+		 * machine with 1G and 10% dirty_ratio, which should be more
+		 * than enough.
+		 *
+		 * Privileged users can raise it by writing to
+		 *
+		 */sys/class/bdi//max_ratio
+		 */
+		bdi_set_max_ratio(sb->s_bdi, 1);
+	}
 
 	return 0;
 }
@@ -1354,7 +1358,7 @@ int fuse_fill_super_common(struct super_block *sb, struct fuse_fs_context *ctx)
 
 	fc->dev = sb->s_dev;
 	fm->sb = sb;
-	err = fuse_bdi_init(fc, sb);
+	err = fuse_bdi_init(fc, sb, ctx);
 	if (err

Re: [Virtio-fs] [PATCH] virtiofsd: Use --thread-pool-size=0 to mean no thread pool

2020-11-12 Thread Miklos Szeredi

On Fri, Nov 6, 2020 at 11:35 PM Vivek Goyal  wrote:
>
> On Fri, Nov 06, 2020 at 08:33:50PM +, Venegas Munoz, Jose Carlos wrote:
> > Hi Vivek,
> >
> > I have tested with Kata 1.12-apha0, the results seems that are better for 
> > the use fio config I am tracking.
> >
> > The fio config does  randrw:
> >
> > fio --direct=1 --gtod_reduce=1 --name=test --filename=random_read_write.fio 
> > --bs=4k --iodepth=64 --size=200M --readwrite=randrw --rwmixread=75
> >
>
> Hi Carlos,
>
> Thanks for the testing.
>
> So basically two conclusions from your tests.
>
> - for virtiofs, --thread-pool-size=0 is performing better as comapred
>   to --thread-pool-size=1 as well as --thread-pool-size=64. Approximately
>   35-40% better.
>
> - virtio-9p is still approximately 30% better than virtiofs
>   --thread-pool-size=0.
>
> As I had done the analysis that this particular workload (mixed read and
> write) is bad with virtiofs because after every write we are invalidating
> attrs and cache so next read ends up fetching attrs again. I had posted
> patches to gain some of the performance.
>
> https://lore.kernel.org/linux-fsdevel/20200929185015.gg220...@redhat.com/
>
> But I got the feedback to look into implementing file leases instead.

Hmm, the FUSE_AUTO_INVAL_DATA feature is buggy, how about turning it
off for now?   9p doesn't have it, so no point in enabling it for
virtiofs by default.

Also I think some confusion comes from cache=auto being the default
for virtiofs.Not sure what the default is for 9p, but comparing
default to default will definitely not be apples to apples since this
mode is nonexistent in 9p.

9p:cache=none  <-> virtiofs:cache=none
9p:cache=loose <-> virtiofs:cache=always

"9p:cache=mmap" and "virtiofs:cache=auto" have no match.

Untested patch attached.

Thanks,
Miklos
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index ec1008bceba8..d474c553bb5c 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -618,6 +618,9 @@ static void lo_init(void *userdata, struct fuse_conn_info *conn)
 lo->announce_submounts = false;
 }
 #endif
+
+/* This is currently buggy with mixed read-write load, so disable */
+conn->want &= ~FUSE_CAP_AUTO_INVAL_DATA;
 }
 
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
@@ -3444,7 +3447,7 @@ int main(int argc, char *argv[])
 lo.inodes = g_hash_table_new(lo_key_hash, lo_key_equal);
 lo.root.fd = -1;
 lo.root.fuse_ino = FUSE_ROOT_ID;
-lo.cache = CACHE_AUTO;
+lo.cache = CACHE_ALWAYS;
 
 /*
  * Set up the ino map like this:

Re: [PATCH v3 5/7] virtiofsd: Announce sub-mount points

2020-11-03 Thread Miklos Szeredi

On Tue, Nov 3, 2020 at 10:00 AM Max Reitz  wrote:
>
> On 03.11.20 09:10, Miklos Szeredi wrote:
> > On Mon, Nov 2, 2020 at 5:19 PM Max Reitz  wrote:
> >>
> >> Whenever we encounter a directory with an st_dev or mount ID that
> >> differs from that of its parent, we set the FUSE_ATTR_SUBMOUNT flag so
> >> the guest can create a submount for it.
> >>
> >> We only need to do so in lo_do_lookup().  The following functions return
> >> a fuse_attr object:
> >> - lo_create(), though fuse_reply_create(): Calls lo_do_lookup().
> >> - lo_lookup(), though fuse_reply_entry(): Calls lo_do_lookup().
> >> - lo_mknod_symlink(), through fuse_reply_entry(): Calls lo_do_lookup().
> >> - lo_link(), through fuse_reply_entry(): Creating a link cannot create a
> >>   submount, so there is no need to check for it.
> >> - lo_getattr(), through fuse_reply_attr(): Announcing submounts when the
> >>   node is first detected (at lookup) is sufficient.  We do not need to
> >>   return the submount attribute later.
> >> - lo_do_readdir(), through fuse_add_direntry_plus(): Calls
> >>   lo_do_lookup().
> >>
> >> Make announcing submounts optional, so submounts are only announced to
> >> the guest with the announce_submounts option.  Some users may prefer the
> >> current behavior, so that the guest learns nothing about the host mount
> >> structure.
> >>
> >> (announce_submounts is force-disabled when the guest does not present
> >> the FUSE_SUBMOUNTS capability, or when there is no statx().)
> >>
> >> Signed-off-by: Max Reitz 
> >> Reviewed-by: Stefan Hajnoczi 
> >> ---
> >>  tools/virtiofsd/helper.c |  1 +
> >>  tools/virtiofsd/passthrough_ll.c | 22 ++
> >>  2 files changed, 23 insertions(+)
> >>
> >> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> >> index 2e181a49b5..4433724d3a 100644
> >> --- a/tools/virtiofsd/helper.c
> >> +++ b/tools/virtiofsd/helper.c
> >> @@ -190,6 +190,7 @@ void fuse_cmdline_help(void)
> >> "   retain/discard O_DIRECT flags 
> >> passed down\n"
> >> "   to virtiofsd from guest 
> >> applications.\n"
> >> "   default: no_allow_direct_io\n"
> >> +   "-o announce_submounts  Announce sub-mount points to 
> >> the guest\n"
> >> );
> >>  }
> >>
> >> diff --git a/tools/virtiofsd/passthrough_ll.c 
> >> b/tools/virtiofsd/passthrough_ll.c
> >> index 34d107975f..ec1008bceb 100644
> >> --- a/tools/virtiofsd/passthrough_ll.c
> >> +++ b/tools/virtiofsd/passthrough_ll.c
> >> @@ -40,6 +40,7 @@
> >>  #include "fuse_virtio.h"
> >>  #include "fuse_log.h"
> >>  #include "fuse_lowlevel.h"
> >> +#include "standard-headers/linux/fuse.h"
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -167,6 +168,7 @@ struct lo_data {
> >>  int readdirplus_set;
> >>  int readdirplus_clear;
> >>  int allow_direct_io;
> >> +int announce_submounts;
> >>  bool use_statx;
> >>  struct lo_inode root;
> >>  GHashTable *inodes; /* protected by lo->mutex */
> >> @@ -207,6 +209,7 @@ static const struct fuse_opt lo_opts[] = {
> >>  { "no_readdirplus", offsetof(struct lo_data, readdirplus_clear), 1 },
> >>  { "allow_direct_io", offsetof(struct lo_data, allow_direct_io), 1 },
> >>  { "no_allow_direct_io", offsetof(struct lo_data, allow_direct_io), 0 
> >> },
> >> +{ "announce_submounts", offsetof(struct lo_data, announce_submounts), 
> >> 1 },
> >>  FUSE_OPT_END
> >>  };
> >>  static bool use_syslog = false;
> >> @@ -601,6 +604,20 @@ static void lo_init(void *userdata, struct 
> >> fuse_conn_info *conn)
> >>  fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling readdirplus\n");
> >>  conn->want &= ~FUSE_CAP_READDIRPLUS;
> >>  }
> >> +
> >> +if (!(conn->capable & FUSE_CAP_SUBMOUNTS) && lo->announce_submounts) {
> >> +fuse_log(FUSE_LOG_WARNING, "lo_init: Cannot announce submounts, 
> >> client "
> >> + "doe

Re: [PATCH v3 0/7] virtiofsd: Announce submounts to the guest

2020-11-03 Thread Miklos Szeredi

On Mon, Nov 2, 2020 at 5:19 PM Max Reitz  wrote:
>
> RFC: https://www.redhat.com/archives/virtio-fs/2020-May/msg00024.html
> v1: https://lists.nongnu.org/archive/html/qemu-devel/2020-09/msg03598.html
> v2: https://lists.nongnu.org/archive/html/qemu-devel/2020-10/msg09117.html
>
> Branch: https://github.com/XanClic/qemu.git virtiofs-submounts-v4
> Branch: https://git.xanclic.moe/XanClic/qemu.git virtiofs-submounts-v4
>
>
> Hi,
>
> In an effort to cut this cover letter shorter, I’ll defer to the cover
> letter of v2 linked above concerning the general description of this
> series, and limit myself to describing the differences from v2:


Other than the issues in 5/7:

Reviewed-by: Miklos Szeredi 

Thanks,
Miklos

Re: [PATCH v3 5/7] virtiofsd: Announce sub-mount points

2020-11-03 Thread Miklos Szeredi

On Mon, Nov 2, 2020 at 5:19 PM Max Reitz  wrote:
>
> Whenever we encounter a directory with an st_dev or mount ID that
> differs from that of its parent, we set the FUSE_ATTR_SUBMOUNT flag so
> the guest can create a submount for it.
>
> We only need to do so in lo_do_lookup().  The following functions return
> a fuse_attr object:
> - lo_create(), though fuse_reply_create(): Calls lo_do_lookup().
> - lo_lookup(), though fuse_reply_entry(): Calls lo_do_lookup().
> - lo_mknod_symlink(), through fuse_reply_entry(): Calls lo_do_lookup().
> - lo_link(), through fuse_reply_entry(): Creating a link cannot create a
>   submount, so there is no need to check for it.
> - lo_getattr(), through fuse_reply_attr(): Announcing submounts when the
>   node is first detected (at lookup) is sufficient.  We do not need to
>   return the submount attribute later.
> - lo_do_readdir(), through fuse_add_direntry_plus(): Calls
>   lo_do_lookup().
>
> Make announcing submounts optional, so submounts are only announced to
> the guest with the announce_submounts option.  Some users may prefer the
> current behavior, so that the guest learns nothing about the host mount
> structure.
>
> (announce_submounts is force-disabled when the guest does not present
> the FUSE_SUBMOUNTS capability, or when there is no statx().)
>
> Signed-off-by: Max Reitz 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  tools/virtiofsd/helper.c |  1 +
>  tools/virtiofsd/passthrough_ll.c | 22 ++
>  2 files changed, 23 insertions(+)
>
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index 2e181a49b5..4433724d3a 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -190,6 +190,7 @@ void fuse_cmdline_help(void)
> "   retain/discard O_DIRECT flags 
> passed down\n"
> "   to virtiofsd from guest 
> applications.\n"
> "   default: no_allow_direct_io\n"
> +   "-o announce_submounts  Announce sub-mount points to the 
> guest\n"
> );
>  }
>
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 34d107975f..ec1008bceb 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -40,6 +40,7 @@
>  #include "fuse_virtio.h"
>  #include "fuse_log.h"
>  #include "fuse_lowlevel.h"
> +#include "standard-headers/linux/fuse.h"
>  #include 
>  #include 
>  #include 
> @@ -167,6 +168,7 @@ struct lo_data {
>  int readdirplus_set;
>  int readdirplus_clear;
>  int allow_direct_io;
> +int announce_submounts;
>  bool use_statx;
>  struct lo_inode root;
>  GHashTable *inodes; /* protected by lo->mutex */
> @@ -207,6 +209,7 @@ static const struct fuse_opt lo_opts[] = {
>  { "no_readdirplus", offsetof(struct lo_data, readdirplus_clear), 1 },
>  { "allow_direct_io", offsetof(struct lo_data, allow_direct_io), 1 },
>  { "no_allow_direct_io", offsetof(struct lo_data, allow_direct_io), 0 },
> +{ "announce_submounts", offsetof(struct lo_data, announce_submounts), 1 
> },
>  FUSE_OPT_END
>  };
>  static bool use_syslog = false;
> @@ -601,6 +604,20 @@ static void lo_init(void *userdata, struct 
> fuse_conn_info *conn)
>  fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling readdirplus\n");
>  conn->want &= ~FUSE_CAP_READDIRPLUS;
>  }
> +
> +if (!(conn->capable & FUSE_CAP_SUBMOUNTS) && lo->announce_submounts) {
> +fuse_log(FUSE_LOG_WARNING, "lo_init: Cannot announce submounts, 
> client "
> + "does not support it\n");
> +lo->announce_submounts = false;
> +}
> +
> +#ifndef CONFIG_STATX
> +if (lo->announce_submounts) {
> +fuse_log(FUSE_LOG_WARNING, "lo_init: Cannot announce submounts, 
> there "
> + "is no statx()\n");
> +lo->announce_submounts = false;

Minor issue: this warns only when libc doesn't have statx, and not
when kernel doesn't have it.

> +}
> +#endif
>  }
>
>  static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
> @@ -877,6 +894,11 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
> parent, const char *name,
>  goto out_err;
>  }
>
> +if (S_ISDIR(e->attr.st_mode) && lo->announce_submounts &&
> +(e->attr.st_dev != dir->key.dev || mnt_id != dir->key.mnt_id)) {
> +e->attr_flags |= FUSE_ATTR_SUBMOUNT;
> +}

... and since announce_submounts isn't turned off in case the kernel
doesn't have STATX_MNT_ID, this will trigger a submount when crossing
into a different filesystem.

Possible solutions:

a) test and warn at startup in case kernel doesn't have statx

b) do not test st_dev, only mnt_id (which will always be zero if not supported)

c) merge announce_submounts and use_statx, which are basically doing
the same thing

> +
>  inode = lo_find(lo, >attr, mnt_id);
>  if (inode) {
>  close(newfd);
> --
> 2.26.2
>

Re: [PULL 00/32] VFIO updates 2020-10-26 (for QEMU 5.2 soft-freeze)

2020-10-28 Thread Miklos Szeredi

On Wed, Oct 28, 2020 at 10:19 AM Dr. David Alan Gilbert
 wrote:
>
> > I'm not comfortable trying to update Max's series to try to determine
> > if FUSE_SUBMOUNTS can be interchanged with FUSE_ATTR_FLAGS, where the

FUSE_SUBMOUNTS is the correct one, FUSE_ATTR_FLAGS was never merged
into mainline linux.

The only difference is that FUSE_ATTR_FLAGS was meant to be negotiated
(AFAIR), while FUSE_SUBMOUNTS is just announcing that the client
supports submounts and will honour the FUSE_ATTR_SUBMOUNT flag from
the server.

Thanks,
Miklos

Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Miklos Szeredi

On Tue, Sep 29, 2020 at 4:01 PM Vivek Goyal  wrote:
>
> On Tue, Sep 29, 2020 at 03:49:04PM +0200, Miklos Szeredi wrote:
> > On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal  wrote:
> >
> > > - virtiofs cache=none mode is faster than cache=auto mode for this
> > >   workload.
> >
> > Not sure why.  One cause could be that readahead is not perfect at
> > detecting the random pattern.  Could we compare total I/O on the
> > server vs. total I/O by fio?
>
> Hi Miklos,
>
> I will instrument virtiosd code to figure out total I/O.
>
> One more potential issue I am staring at is refreshing the attrs on
> READ if fc->auto_inval_data is set.
>
> fuse_cache_read_iter() {
> /*
>  * In auto invalidate mode, always update attributes on read.
>  * Otherwise, only update if we attempt to read past EOF (to ensure
>  * i_size is up to date).
>  */
> if (fc->auto_inval_data ||
> (iocb->ki_pos + iov_iter_count(to) > i_size_read(inode))) {
> int err;
> err = fuse_update_attributes(inode, iocb->ki_filp);
> if (err)
> return err;
> }
> }
>
> Given this is a mixed READ/WRITE workload, every WRITE will invalidate
> attrs. And next READ will first do GETATTR() from server (and potentially
> invalidate page cache) before doing READ.
>
> This sounds suboptimal especially from the point of view of WRITEs
> done by this client itself. I mean if another client has modified
> the file, then doing GETATTR after a second makes sense. But there
> should be some optimization to make sure our own WRITEs don't end
> up doing GETATTR and invalidate page cache (because cache contents
> are still valid).

Yeah, that sucks.

> I disabled ->auto_invalid_data and that seemed to result in 8-10%
> gain in performance for this workload.

Need to wrap my head around these caching issues.

Thanks,
Miklos

Re: [Virtio-fs] virtiofs vs 9p performance(Re: tools/virtiofs: Multi threading seems to hurt performance)

2020-09-29 Thread Miklos Szeredi

On Tue, Sep 29, 2020 at 3:18 PM Vivek Goyal  wrote:

> - virtiofs cache=none mode is faster than cache=auto mode for this
>   workload.

Not sure why.  One cause could be that readahead is not perfect at
detecting the random pattern.  Could we compare total I/O on the
server vs. total I/O by fio?

Thanks,
Millos

Re: [PATCH] virtiofsd: Used glib "shared" thread pool

2020-09-22 Thread Miklos Szeredi

On Mon, Sep 21, 2020 at 11:32 PM Vivek Goyal  wrote:

> glib offers thread pools and it seems to support "exclusive" and "shared"
> thread pools.
>
>
> https://developer.gnome.org/glib/stable/glib-Thread-Pools.html#g-thread-pool-new
>
> Currently we use "exlusive" thread pools but its performance seems to be
> poor. I tried using "shared" thread pools and performance seems much
> better. I posted performance results here.
>
> https://www.redhat.com/archives/virtio-fs/2020-September/msg00080.html
>
> So lets switch to shared thread pools. We can think of making it optional
> once somebody can show in what cases exclusive thread pools offer better
> results. For now, my simple performance tests across the board see
> better results with shared thread pools.
>

Needs this as well:

--- qemu.orig/tools/virtiofsd/passthrough_seccomp.c 2020-09-16
20:21:13.168686176 +0200
+++ qemu/tools/virtiofsd/passthrough_seccomp.c 2020-09-22
14:01:38.499164501 +0200
@@ -94,6 +94,8 @@ static const int syscall_whitelist[] = {
 SCMP_SYS(rt_sigaction),
 SCMP_SYS(rt_sigprocmask),
 SCMP_SYS(rt_sigreturn),
+SCMP_SYS(sched_getattr),
+SCMP_SYS(sched_setattr),
 SCMP_SYS(sendmsg),
 SCMP_SYS(setresgid),
 SCMP_SYS(setresuid),

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

2020-06-19 Thread Miklos Szeredi

On Fri, Jun 19, 2020 at 4:25 PM Vivek Goyal  wrote:
>
> On Fri, Jun 19, 2020 at 04:16:30PM +0200, Miklos Szeredi wrote:
> > On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal  wrote:
> > >
> > > On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > > > virtiofsd doesn't need of all Linux capabilities(7) available to root.  
> > > > Keep a
> > > > whitelisted set of capabilities that we require.  This improves 
> > > > security in
> > > > case virtiofsd is compromised by making it hard for an attacker to gain 
> > > > further
> > > > access to the system.
> > >
> > > Hi Stefan,
> > >
> > > I just noticed that this patch set breaks overlayfs on top of virtiofs.
> >
> > How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
> > requires CAP_SYS_ADMIN, not the accessor.
>
> virtiofsd needs CAP_SYS_ADMIN, otherwise fsetxattr(trusted.overlay.opaque)
> fails in lo_setxattr().
>
> This is triggered when we mount overlayfs on top of virtiofs and overlayfs
> tries to set OVL_XATTR_OPAQUE on upper to check if trusted xattrs are
> supported or not.

Ah, right.

Plan is to use "user.*" xattr for unprivileged overlay.  This would be
a good way to eliminate this attack surface in the overlay on virtiofs
case as well.

Other ways to minimize risk is to separate operations requiring
CAP_SYS_ADMIN into a separate process, preferably a separate
executable, that communicates with virtiofsd using a pipe and contains
the minimum amount of code.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: drop Linux capabilities(7)

2020-06-19 Thread Miklos Szeredi

On Thu, Jun 18, 2020 at 9:08 PM Vivek Goyal  wrote:
>
> On Thu, Apr 16, 2020 at 05:49:05PM +0100, Stefan Hajnoczi wrote:
> > virtiofsd doesn't need of all Linux capabilities(7) available to root.  
> > Keep a
> > whitelisted set of capabilities that we require.  This improves security in
> > case virtiofsd is compromised by making it hard for an attacker to gain 
> > further
> > access to the system.
>
> Hi Stefan,
>
> I just noticed that this patch set breaks overlayfs on top of virtiofs.

How so?  Virtiofs isn't mounting overlayfs, is it?  Only the mounter
requires CAP_SYS_ADMIN, not the accessor.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: remove symlink fallbacks

2020-05-14 Thread Miklos Szeredi

On Fri, May 15, 2020 at 5:43 AM Liu Bo  wrote:
>
> On Thu, May 14, 2020 at 04:07:36PM +0200, Miklos Szeredi wrote:
> > Path lookup in the kernel has special rules for looking up magic symlinks
> > under /proc.  If a filesystem operation is instructed to follow symlinks
> > (e.g. via AT_SYMLINK_FOLLOW or lack of AT_SYMLINK_NOFOLLOW), and the final
> > component is such a proc symlink, then the target of the magic symlink is
> > used for the operation, even if the target itself is a symlink.  I.e. path
> > lookup is always terminated after following a final magic symlink.
> >
>
> Hi Miklos,
>
> Are these mentioned special rules supported by a recent kernel version
> or are they there from day one linux?

Hi Liubo,

AFAICS, these rules have been there from day one.

Thanks,
Miklos

[PATCH] virtiofsd: remove symlink fallbacks

2020-05-14 Thread Miklos Szeredi

Path lookup in the kernel has special rules for looking up magic symlinks
under /proc.  If a filesystem operation is instructed to follow symlinks
(e.g. via AT_SYMLINK_FOLLOW or lack of AT_SYMLINK_NOFOLLOW), and the final
component is such a proc symlink, then the target of the magic symlink is
used for the operation, even if the target itself is a symlink.  I.e. path
lookup is always terminated after following a final magic symlink.

I was erronously assuming that in the above case the target symlink would
also be followed, and so workarounds were added for a couple of operations
to handle the symlink case.  Since the symlink can be handled simply by
following the proc symlink, these workardouds are not needed.

Also remove the "norace" option, which disabled the workarounds.

Commit bdfd66788349 ("virtiofsd: Fix xattr operations") already dealt with
the same issue for xattr operations.

Signed-off-by: Miklos Szeredi 
---
 tools/virtiofsd/passthrough_ll.c | 175 ++-
 1 file changed, 6 insertions(+), 169 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 3ba1d9098460..2ce7c96085bf 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -140,7 +140,6 @@ enum {
 struct lo_data {
 pthread_mutex_t mutex;
 int debug;
-int norace;
 int writeback;
 int flock;
 int posix_lock;
@@ -176,7 +175,6 @@ static const struct fuse_opt lo_opts[] = {
 { "cache=none", offsetof(struct lo_data, cache), CACHE_NONE },
 { "cache=auto", offsetof(struct lo_data, cache), CACHE_AUTO },
 { "cache=always", offsetof(struct lo_data, cache), CACHE_ALWAYS },
-{ "norace", offsetof(struct lo_data, norace), 1 },
 { "readdirplus", offsetof(struct lo_data, readdirplus_set), 1 },
 { "no_readdirplus", offsetof(struct lo_data, readdirplus_clear), 1 },
 FUSE_OPT_END
@@ -592,136 +590,6 @@ static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
 fuse_reply_attr(req, , lo->timeout);
 }
 
-/*
- * Increments parent->nlookup and caller must release refcount using
- * lo_inode_put().
- */
-static int lo_parent_and_name(struct lo_data *lo, struct lo_inode *inode,
-  char path[PATH_MAX], struct lo_inode **parent)
-{
-char procname[64];
-char *last;
-struct stat stat;
-struct lo_inode *p;
-int retries = 2;
-int res;
-
-retry:
-sprintf(procname, "%i", inode->fd);
-
-res = readlinkat(lo->proc_self_fd, procname, path, PATH_MAX);
-if (res < 0) {
-fuse_log(FUSE_LOG_WARNING, "%s: readlink failed: %m\n", __func__);
-goto fail_noretry;
-}
-
-if (res >= PATH_MAX) {
-fuse_log(FUSE_LOG_WARNING, "%s: readlink overflowed\n", __func__);
-goto fail_noretry;
-}
-path[res] = '\0';
-
-last = strrchr(path, '/');
-if (last == NULL) {
-/* Shouldn't happen */
-fuse_log(
-FUSE_LOG_WARNING,
-"%s: INTERNAL ERROR: bad path read from proc\n", __func__);
-goto fail_noretry;
-}
-if (last == path) {
-p = >root;
-pthread_mutex_lock(>mutex);
-p->nlookup++;
-g_atomic_int_inc(>refcount);
-pthread_mutex_unlock(>mutex);
-} else {
-*last = '\0';
-res = fstatat(AT_FDCWD, last == path ? "/" : path, , 0);
-if (res == -1) {
-if (!retries) {
-fuse_log(FUSE_LOG_WARNING,
- "%s: failed to stat parent: %m\n", __func__);
-}
-goto fail;
-}
-p = lo_find(lo, );
-if (p == NULL) {
-if (!retries) {
-fuse_log(FUSE_LOG_WARNING,
- "%s: failed to find parent\n", __func__);
-}
-goto fail;
-}
-}
-last++;
-res = fstatat(p->fd, last, , AT_SYMLINK_NOFOLLOW);
-if (res == -1) {
-if (!retries) {
-fuse_log(FUSE_LOG_WARNING,
- "%s: failed to stat last\n", __func__);
-}
-goto fail_unref;
-}
-if (stat.st_dev != inode->key.dev || stat.st_ino != inode->key.ino) {
-if (!retries) {
-fuse_log(FUSE_LOG_WARNING,
- "%s: failed to match last\n", __func__);
-}
-goto fail_unref;
-}
-*parent = p;
-memmove(path, last, strlen(last) + 1);
-
-return 0;
-
-fail_unref:
-unref_inode_lolocked(lo, p, 1);
-lo_inode_put(lo, );
-fail:
-if (retries) {
-retries--;
-goto retry;
-}
-fail_noretry:
-errno = EIO;
-return -1;
-}
-
-static int utimensat_empty(struct lo_data *lo, struct lo_inode *inode,
-   const struct timespec *tv)
-{
-int res;
-stru

Re: [Virtio-fs] [PATCH] virtiofsd: jail lo->proc_self_fd

2020-04-29 Thread Miklos Szeredi

On Wed, Apr 29, 2020 at 5:00 PM Vivek Goyal  wrote:
>
> On Wed, Apr 29, 2020 at 04:47:19PM +0200, Miklos Szeredi wrote:
> > On Wed, Apr 29, 2020 at 4:36 PM Vivek Goyal  wrote:
> > >
> > > On Wed, Apr 29, 2020 at 02:47:33PM +0200, Miklos Szeredi wrote:
> > > > While it's not possible to escape the proc filesystem through
> > > > lo->proc_self_fd, it is possible to escape to the root of the proc
> > > > filesystem itself through "../..".
> > >
> > > Hi Miklos,
> > >
> > > So this attack will work with some form of *at(lo->proc_self_fd, "../..")
> > > call?
> >
> > Right.
> >
> > >
> > > >
> > > > Use a temporary mount for opening lo->proc_self_fd, that has it's root 
> > > > at
> > > > /proc/self/fd/, preventing access to the ancestor directories.
> > >
> > > Does this mean that now similar attack can happen using "../.." on tmpdir
> > > fd instead and be able to look at peers of tmpdir. Or it is blocked
> > > due to mount point or something else.
> >
> > No, because tmpdir is detached, the root of that tree will be the
> > directory pointed to by the fd.  ".." will just lead to the same
> > directory.
>
> Aha.. got it. Thanks.
>
> One more question. We seem to be using PID namespaces. Can't we mount
> fresh instance of proc so that we don't see other processes apart from
> virtiofd. May be we are already doing it. I have not checked it yet. If
> yes, that should mitigate this concern?

Yes, it's doing that already just above this patch:

/* The child must remount /proc to use the new pid namespace */
if (mount("proc", "/proc", "proc",
  MS_NODEV | MS_NOEXEC | MS_NOSUID | MS_RELATIME, NULL) < 0) {
fuse_log(FUSE_LOG_ERR, "mount(/proc): %m\n");
exit(1);
}

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: jail lo->proc_self_fd

2020-04-29 Thread Miklos Szeredi

On Wed, Apr 29, 2020 at 4:47 PM Miklos Szeredi  wrote:
>
> On Wed, Apr 29, 2020 at 4:36 PM Vivek Goyal  wrote:
> >
> > On Wed, Apr 29, 2020 at 02:47:33PM +0200, Miklos Szeredi wrote:
> > > While it's not possible to escape the proc filesystem through
> > > lo->proc_self_fd, it is possible to escape to the root of the proc
> > > filesystem itself through "../..".
> >
> > Hi Miklos,
> >
> > So this attack will work with some form of *at(lo->proc_self_fd, "../..")
> > call?
>
> Right.
>
> >
> > >
> > > Use a temporary mount for opening lo->proc_self_fd, that has it's root at
> > > /proc/self/fd/, preventing access to the ancestor directories.
> >
> > Does this mean that now similar attack can happen using "../.." on tmpdir
> > fd instead and be able to look at peers of tmpdir. Or it is blocked
> > due to mount point or something else.
>
> No, because tmpdir is detached, the root of that tree will be the
> directory pointed to by the fd.  ".." will just lead to the same
> directory.

BTW, I would have liked to do this without a temp directory, but
apparently the fancy new mount stuff isn't up to this task, or at
least I haven't figured out yet.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: jail lo->proc_self_fd

2020-04-29 Thread Miklos Szeredi

On Wed, Apr 29, 2020 at 4:36 PM Vivek Goyal  wrote:
>
> On Wed, Apr 29, 2020 at 02:47:33PM +0200, Miklos Szeredi wrote:
> > While it's not possible to escape the proc filesystem through
> > lo->proc_self_fd, it is possible to escape to the root of the proc
> > filesystem itself through "../..".
>
> Hi Miklos,
>
> So this attack will work with some form of *at(lo->proc_self_fd, "../..")
> call?

Right.

>
> >
> > Use a temporary mount for opening lo->proc_self_fd, that has it's root at
> > /proc/self/fd/, preventing access to the ancestor directories.
>
> Does this mean that now similar attack can happen using "../.." on tmpdir
> fd instead and be able to look at peers of tmpdir. Or it is blocked
> due to mount point or something else.

No, because tmpdir is detached, the root of that tree will be the
directory pointed to by the fd.  ".." will just lead to the same
directory.

Thanks,
Miklos

[PATCH] virtiofsd: jail lo->proc_self_fd

2020-04-29 Thread Miklos Szeredi

While it's not possible to escape the proc filesystem through
lo->proc_self_fd, it is possible to escape to the root of the proc
filesystem itself through "../..".

Use a temporary mount for opening lo->proc_self_fd, that has it's root at
/proc/self/fd/, preventing access to the ancestor directories.

Signed-off-by: Miklos Szeredi 
---
 tools/virtiofsd/passthrough_ll.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4c35c95b256c..bc9c44c760f4 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2536,6 +2536,8 @@ static void print_capabilities(void)
 static void setup_namespaces(struct lo_data *lo, struct fuse_session *se)
 {
 pid_t child;
+char template[] = "virtiofsd-XX";
+char *tmpdir;
 
 /*
  * Create a new pid namespace for *child* processes.  We'll have to
@@ -2597,12 +2599,33 @@ static void setup_namespaces(struct lo_data *lo, struct 
fuse_session *se)
 exit(1);
 }
 
+tmpdir = mkdtemp(template);
+if (!tmpdir) {
+fuse_log(FUSE_LOG_ERR, "tmpdir(%s): %m\n", template);
+exit(1);
+}
+
+if (mount("/proc/self/fd", tmpdir, NULL, MS_BIND, NULL) < 0) {
+fuse_log(FUSE_LOG_ERR, "mount(/proc/self/fd, %s, MS_BIND): %m\n",
+ tmpdir);
+exit(1);
+}
+
 /* Now we can get our /proc/self/fd directory file descriptor */
-lo->proc_self_fd = open("/proc/self/fd", O_PATH);
+lo->proc_self_fd = open(tmpdir, O_PATH);
 if (lo->proc_self_fd == -1) {
-fuse_log(FUSE_LOG_ERR, "open(/proc/self/fd, O_PATH): %m\n");
+fuse_log(FUSE_LOG_ERR, "open(%s, O_PATH): %m\n", tmpdir);
 exit(1);
 }
+
+if (umount2(tmpdir, MNT_DETACH) < 0) {
+fuse_log(FUSE_LOG_ERR, "umount2(%s, MNT_DETACH): %m\n", tmpdir);
+exit(1);
+}
+
+if (rmdir(tmpdir) < 0) {
+fuse_log(FUSE_LOG_ERR, "rmdir(%s): %m\n", tmpdir);
+}
 }
 
 /*
-- 
2.21.1

Re: [Virtio-fs] [PATCH] virtiofsd: Show submounts

2020-04-29 Thread Miklos Szeredi

On Wed, Apr 29, 2020 at 9:59 AM Miklos Szeredi  wrote:
>
> On Tue, Apr 28, 2020 at 9:15 PM Dr. David Alan Gilbert
>  wrote:
>
> > So our current sequence is:
> >
> >(new namespace)
> >  1)if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
> >  2)   if (mount("proc", "/proc", "proc",
> >
> >  3)   if (mount(source, source, NULL, MS_BIND | MS_REC, NULL) < 0) {
> >  4)  (chdir newroot, pivot, chdir oldroot)
> >  5)   if (mount("", ".", "", MS_SLAVE | MS_REC, NULL) < 0) {
> >  6)   if (umount2(".", MNT_DETACH) < 0) {
> >
> > So are you saying we need a:
> >if (mount(NULL, "/", NULL, MS_REC | MS_SHARED, NULL) < 0) {
> >
> >   and can this go straight after (1) ?
>
> Or right before (3).   Important thing is that that new mount will
> only receive propagation if the type of the mount at source (before
> (3) is performed) is shared.

And seems I was wrong.  Bind mounting clones the slave property, hence
no need to set MS_SHARED.  I.e. if the source was a slave, the bind
mount will be a slave to the same master as well; the two slaves won't
receive propagation between each other, but both will receive
propagation from the master.

The only reason to set MS_SHARED would be if the bind mount wanted to
receive propagation from within the cloned namespace.   Which is not
the case.

Didn't I tell ya it was complicated ;)

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: Show submounts

2020-04-29 Thread Miklos Szeredi

On Wed, Apr 29, 2020 at 10:31 AM Max Reitz  wrote:
>
> On 28.04.20 21:15, Dr. David Alan Gilbert wrote:

> > So are you saying we need a:
> >if (mount(NULL, "/", NULL, MS_REC | MS_SHARED, NULL) < 0) {
> >
> >   and can this go straight after (1) ?
>
> Isn’t MS_SHARED and MS_SLAVE mutually exclusive, that is, both are just
> different propagation types?  So shouldn’t putting this after (1) be
> effectively the same as replacing (1)?

No, think of it more like a set of groups (mounts in a group have the
same "shared:$ID" tag in /proc/self/mountinfo) and these groups being
arranged into a tree, where child groups get propagation from the
parent group (mounts have a "master:$PARENT_ID" tag), but not the
other way round.

So if a mount is part of a shared group and this group is also the
child of another shared group, then it will have both a "shared:$ID"
and a "master:$PARENT_ID" tag in /proc/self/mountinfo.

For the gory details see Documentation/filesystems/sharedsubtree.txt

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: Show submounts

2020-04-29 Thread Miklos Szeredi

On Tue, Apr 28, 2020 at 9:15 PM Dr. David Alan Gilbert
 wrote:

> So our current sequence is:
>
>(new namespace)
>  1)if (mount(NULL, "/", NULL, MS_REC | MS_SLAVE, NULL) < 0) {
>  2)   if (mount("proc", "/proc", "proc",
>
>  3)   if (mount(source, source, NULL, MS_BIND | MS_REC, NULL) < 0) {
>  4)  (chdir newroot, pivot, chdir oldroot)
>  5)   if (mount("", ".", "", MS_SLAVE | MS_REC, NULL) < 0) {
>  6)   if (umount2(".", MNT_DETACH) < 0) {
>
> So are you saying we need a:
>if (mount(NULL, "/", NULL, MS_REC | MS_SHARED, NULL) < 0) {
>
>   and can this go straight after (1) ?

Or right before (3).   Important thing is that that new mount will
only receive propagation if the type of the mount at source (before
(3) is performed) is shared.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH] virtiofsd: Show submounts

2020-04-28 Thread Miklos Szeredi

On Tue, Apr 28, 2020 at 4:52 PM Stefan Hajnoczi  wrote:
>
> On Mon, Apr 27, 2020 at 06:59:02PM +0100, Dr. David Alan Gilbert wrote:
> > * Max Reitz (mre...@redhat.com) wrote:
> > > Currently, setup_mounts() bind-mounts the shared directory without
> > > MS_REC.  This makes all submounts disappear.
> > >
> > > Pass MS_REC so that the guest can see submounts again.
> >
> > Thanks!
> >
> > > Fixes: 3ca8a2b1c83eb185c232a4e87abbb65495263756
> >
> > Should this actually be 5baa3b8e95064c2434bd9e2f312edd5e9ae275dc ?
> >
> > > Signed-off-by: Max Reitz 
> > > ---
> > >  tools/virtiofsd/passthrough_ll.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > b/tools/virtiofsd/passthrough_ll.c
> > > index 4c35c95b25..9d7f863e66 100644
> > > --- a/tools/virtiofsd/passthrough_ll.c
> > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > @@ -2643,7 +2643,7 @@ static void setup_mounts(const char *source)
> > >  int oldroot;
> > >  int newroot;
> > >
> > > -if (mount(source, source, NULL, MS_BIND, NULL) < 0) {
> > > +if (mount(source, source, NULL, MS_BIND | MS_REC, NULL) < 0) {
> > >  fuse_log(FUSE_LOG_ERR, "mount(%s, %s, MS_BIND): %m\n", source, 
> > > source);
> > >  exit(1);
> > >  }
> >
> > Do we want MS_SLAVE to pick up future mounts that might happenf rom the
> > host?
>
> There are two separate concepts:
>
> 1. Mount namespaces.  The virtiofsd process is sandboxed and lives in
>its own mount namespace.  Therefore it does not share the mounts that
>the rest of the host system sees.
>
> 2. Propagation type.  This is related to bind mounts so that mount
>operations that happen in one bind-mounted location can also appear
>in other bind-mounted locations.
>
> Since virtiofsd is in a separate mount namespace, does the propagation
> type even have any effect?

It's a complicated thing.  Current setup results in propagation
happening to the cloned namespace, but not to the bind mounted root.

Why?  Because setting mounts "slave" after unshare, results in the
propagation being stopped at that point.  To make it propagate
further, change it back to "shared".  Note: the result changing to
"slave" and then to "shared" results in breaking the backward
propagation to the original namespace, but allowing propagation
further down the chain.

Thanks,
Miklos

Re: [PATCH 072/104] virtiofsd: passthrough_ll: fix refcounting on remove/rename

2020-01-17 Thread Miklos Szeredi

On Thu, Jan 16, 2020 at 5:45 PM Dr. David Alan Gilbert
 wrote:
>
> * Misono Tomohiro (misono.tomoh...@jp.fujitsu.com) wrote:
> > > From: Miklos Szeredi 
> > >
> > > Signed-off-by: Miklos Szeredi 
> >
> > I'm not familiar with qemu convention but shouldn't we put
> > at least one line of description like linux kernel?
>
> Miklos: would you like to suggest a better commit message?

Hmm, the patch doesn't really make sense, since the looked up inode is not used.

Not sure what happened here, this seems to be for supporting shared
versions, and these changes are part of commit 06f78a397f00
("virtiofsd: add initial support for shared versions") in our gitlab
qemu tree.  Was this intentionally split out?

Thanks,
Miklos

Re: [PATCH 068/104] virtiofsd: passthrough_ll: control readdirplus

2020-01-10 Thread Miklos Szeredi

On Fri, Jan 10, 2020 at 4:40 PM Vivek Goyal  wrote:
>
> On Fri, Jan 10, 2020 at 04:30:01PM +0100, Miklos Szeredi wrote:
> > On Fri, Jan 10, 2020 at 4:18 PM Daniel P. Berrangé  
> > wrote:
> > >
> > > On Fri, Jan 10, 2020 at 04:13:08PM +0100, Miklos Szeredi wrote:
> > > > On Fri, Jan 10, 2020 at 4:04 PM Dr. David Alan Gilbert
> > > >  wrote:
> > > > >
> > > > > * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > > > > > On Thu, Dec 12, 2019 at 04:38:28PM +, Dr. David Alan Gilbert 
> > > > > > (git) wrote:
> > > > > > > From: Miklos Szeredi 
> > > > > > >
> > > > > >
> > > > > > What is readdirplus and what do we need a command line option to
> > > > > > control it ? What's the user benefit of changing the setting ?
> > > > >
> > > > > cc'ing Miklos who understands this better than me.
> > > > >
> > > > > My understanding is that readdirplus is a heuristic inherited from NFS
> > > > > where when you iterate over the directory you also pick up stat() data
> > > > > for each entry in the directory.  You then cache that stat data
> > > > > somewhere.
> > > > > The Plus-ness is that a lot of directory operations involve you 
> > > > > stating
> > > > > each entry (e.g. to figure out if you can access it etc) so rolling it
> > > > > into one op avoids the separate stat.  The unplus-ness is that it's an
> > > > > overhead and I think changes some of the caching behaviour.
> > > >
> > > > Yeah, so either may give better performance and it's hard to pick a
> > > > clear winner.  NFS also has an option to control this.
> > >
> > > IIUC from the man page, the NFS option for controlling this is a client
> > > side mount option. This makes sense as only the client really has 
> > > knowledge
> > > of whether its workload will benefit.
> > >
> > > With this in mind, should the readdirplus control for virtio-fs also be a
> > > guest mount option instead of a host virtiofsd CLI option ? The guest 
> > > admin
> > > seems best placed to know whether their workload will benefit or not.
> >
> > Definitely.   In fact other options, e.g. ones that control caching,
> > should probably also be client side (cache=XXX, writeback,
> > timeout=XXX, etc).
>
> I am not sure about cache options. So if we want to share a directory
> between multiple guests with stronger coherency (cache=none), then admin
> should decide that cache=always/auto is not supported on this export.
>
> Also, how will one client know whether there are other clients same
> directory with strong coherency requirements and it should use cache=none
> instead of cache=always/auto.
>
> Having said that, it also makes sense that client knows its workoad
> and can decide if cache=auto works best for it and use that instead.
>
> May be we need both client and server side options. Client will request
> certain cache=xxx options and server can deny these if admin decides
> not to enable that option for that particular mount.
>
> For example, if admin decides that we can only support cache=none on
> this particular dir due to other guest sharing it, then daemon should
> be able to deny cache=auto/always requests from client.

Makes sense.  The server dictates policy, the client just passes the
options onto the server.

Thanks,
Miklos

Re: [PATCH 068/104] virtiofsd: passthrough_ll: control readdirplus

2020-01-10 Thread Miklos Szeredi

On Fri, Jan 10, 2020 at 4:18 PM Daniel P. Berrangé  wrote:
>
> On Fri, Jan 10, 2020 at 04:13:08PM +0100, Miklos Szeredi wrote:
> > On Fri, Jan 10, 2020 at 4:04 PM Dr. David Alan Gilbert
> >  wrote:
> > >
> > > * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > > > On Thu, Dec 12, 2019 at 04:38:28PM +, Dr. David Alan Gilbert (git) 
> > > > wrote:
> > > > > From: Miklos Szeredi 
> > > > >
> > > >
> > > > What is readdirplus and what do we need a command line option to
> > > > control it ? What's the user benefit of changing the setting ?
> > >
> > > cc'ing Miklos who understands this better than me.
> > >
> > > My understanding is that readdirplus is a heuristic inherited from NFS
> > > where when you iterate over the directory you also pick up stat() data
> > > for each entry in the directory.  You then cache that stat data
> > > somewhere.
> > > The Plus-ness is that a lot of directory operations involve you stating
> > > each entry (e.g. to figure out if you can access it etc) so rolling it
> > > into one op avoids the separate stat.  The unplus-ness is that it's an
> > > overhead and I think changes some of the caching behaviour.
> >
> > Yeah, so either may give better performance and it's hard to pick a
> > clear winner.  NFS also has an option to control this.
>
> IIUC from the man page, the NFS option for controlling this is a client
> side mount option. This makes sense as only the client really has knowledge
> of whether its workload will benefit.
>
> With this in mind, should the readdirplus control for virtio-fs also be a
> guest mount option instead of a host virtiofsd CLI option ? The guest admin
> seems best placed to know whether their workload will benefit or not.

Definitely.   In fact other options, e.g. ones that control caching,
should probably also be client side (cache=XXX, writeback,
timeout=XXX, etc).

This needs an extension of the INIT message, so options can be passed
to the server.   Added this to our TODO list.

Thanks,
Miklos

Re: [PATCH 068/104] virtiofsd: passthrough_ll: control readdirplus

2020-01-10 Thread Miklos Szeredi

On Fri, Jan 10, 2020 at 4:04 PM Dr. David Alan Gilbert
 wrote:
>
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > On Thu, Dec 12, 2019 at 04:38:28PM +, Dr. David Alan Gilbert (git) 
> > wrote:
> > > From: Miklos Szeredi 
> > >
> >
> > What is readdirplus and what do we need a command line option to
> > control it ? What's the user benefit of changing the setting ?
>
> cc'ing Miklos who understands this better than me.
>
> My understanding is that readdirplus is a heuristic inherited from NFS
> where when you iterate over the directory you also pick up stat() data
> for each entry in the directory.  You then cache that stat data
> somewhere.
> The Plus-ness is that a lot of directory operations involve you stating
> each entry (e.g. to figure out if you can access it etc) so rolling it
> into one op avoids the separate stat.  The unplus-ness is that it's an
> overhead and I think changes some of the caching behaviour.

Yeah, so either may give better performance and it's hard to pick a
clear winner.  NFS also has an option to control this.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: Two fix for xattr operation

2019-10-18 Thread Miklos Szeredi

On Thu, Oct 17, 2019 at 6:48 PM Miklos Szeredi  wrote:

> Even simpler: allow O_PATH descriptors for f*xattr().

Attached patch.  Will post shortly.

However, I think it would make sense to fix virtiofsd as well, as this
will take time to percolate down, even if Al doesn't find anything
wrong with it.

Doing unshare(CLONE_FS) after thread startup seems safe, though must
be careful to change the working directory to the root of the mount
*before* starting any threads.

Thanks,
Miklos
diff --git a/fs/xattr.c b/fs/xattr.c
index 90dd78f0eb27..fd1335b86e60 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -495,7 +495,7 @@ SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
 SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
 		const void __user *,value, size_t, size, int, flags)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_raw(fd);
 	int error = -EBADF;
 
 	if (!f.file)
@@ -587,7 +587,7 @@ SYSCALL_DEFINE4(lgetxattr, const char __user *, pathname,
 SYSCALL_DEFINE4(fgetxattr, int, fd, const char __user *, name,
 		void __user *, value, size_t, size)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_raw(fd);
 	ssize_t error = -EBADF;
 
 	if (!f.file)
@@ -662,7 +662,7 @@ SYSCALL_DEFINE3(llistxattr, const char __user *, pathname, char __user *, list,
 
 SYSCALL_DEFINE3(flistxattr, int, fd, char __user *, list, size_t, size)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_raw(fd);
 	ssize_t error = -EBADF;
 
 	if (!f.file)
@@ -727,7 +727,7 @@ SYSCALL_DEFINE2(lremovexattr, const char __user *, pathname,
 
 SYSCALL_DEFINE2(fremovexattr, int, fd, const char __user *, name)
 {
-	struct fd f = fdget(fd);
+	struct fd f = fdget_raw(fd);
 	int error = -EBADF;
 
 	if (!f.file)

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: Two fix for xattr operation

2019-10-17 Thread Miklos Szeredi

On Thu, Oct 17, 2019 at 6:09 PM Stefan Hajnoczi  wrote:
>
> On Thu, Oct 17, 2019 at 01:23:57PM +0200, Miklos Szeredi wrote:
> > On Thu, Oct 17, 2019 at 12:05 PM Stefan Hajnoczi  
> > wrote:
> > >
> > > On Wed, Oct 16, 2019 at 07:37:52PM +0900, Misono Tomohiro wrote:
> > > > Hello,
> > > >
> > > > I test xattr operation on virtiofs using xfstest generic/062
> > > > (with -o xattr option and XFS backend) and see some problems.
> > > >
> > > > These patches fixes the two of the problems.
> > > >
> > > > The remaining problems are:
> > > >  1. we cannot xattr to block device created by mknod
> > > > which does not have actual device (since open in virtiofsd fails)
> > > >  2. we cannot xattr to symbolic link
> > > >
> > > > I don't think 1 is a big problem but can we fix 2?
> > >
> > > Sorry, I don't know the answer.  Maybe it would be necessary to add a
> > > new O_SYMLINK open flag to open(2) so that fgetxattr()/fsetxattr()
> > > operations can be performed.  A kernel change like that would take some
> > > time to get accepted upstream and shipped by distros, but it might be
> > > the only way since the current syscall interface doesn't seem to offer a
> > > way to do this.
> >
> > The real problem is that open() on a non-regular, non-directory file
> > may have side effects (unless O_PATH is used).  These patches try to
> > paper over that, but the fact is: opening special files from a file
> > server is forbidden.
> >
> > I see why this is being done, and it's not easy to fix properly
> > without the ..at() versions of these syscalls.  One idea is to fork()
> > + fchdir(lo->proc_self_fd) + ..xattr().  Another related idea is to do
> > a unshare(CLONE_FS) after each thread's startup (will pthread library
> > balk?  I don't know) so that it's safe to do fchdir(lo->proc_self_fd)
> > + ...xattr() + fchdir(lo->root_fd).
>
> Looking at the f*xattr() code in the kernel, it doesn't really care
> about the file descriptor, it wants the inode instead.  So the O_SYMLINK
> idea I mentioned could also be called O_XATTR and be similar to O_PATH,
> except that only f*xattr() calls are allowed on this file descriptor.

Even simpler: allow O_PATH descriptors for f*xattr().

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: Two fix for xattr operation

2019-10-17 Thread Miklos Szeredi

On Thu, Oct 17, 2019 at 5:25 PM Vivek Goyal  wrote:
>
> On Thu, Oct 17, 2019 at 01:23:57PM +0200, Miklos Szeredi wrote:
> > On Thu, Oct 17, 2019 at 12:05 PM Stefan Hajnoczi  
> > wrote:
> > >
> > > On Wed, Oct 16, 2019 at 07:37:52PM +0900, Misono Tomohiro wrote:
> > > > Hello,
> > > >
> > > > I test xattr operation on virtiofs using xfstest generic/062
> > > > (with -o xattr option and XFS backend) and see some problems.
> > > >
> > > > These patches fixes the two of the problems.
> > > >
> > > > The remaining problems are:
> > > >  1. we cannot xattr to block device created by mknod
> > > > which does not have actual device (since open in virtiofsd fails)
> > > >  2. we cannot xattr to symbolic link
> > > >
> > > > I don't think 1 is a big problem but can we fix 2?
> > >
> > > Sorry, I don't know the answer.  Maybe it would be necessary to add a
> > > new O_SYMLINK open flag to open(2) so that fgetxattr()/fsetxattr()
> > > operations can be performed.  A kernel change like that would take some
> > > time to get accepted upstream and shipped by distros, but it might be
> > > the only way since the current syscall interface doesn't seem to offer a
> > > way to do this.
> >
> > The real problem is that open() on a non-regular, non-directory file
> > may have side effects (unless O_PATH is used).  These patches try to
> > paper over that, but the fact is: opening special files from a file
> > server is forbidden.
> >
> > I see why this is being done, and it's not easy to fix properly
> > without the ..at() versions of these syscalls.  One idea is to fork()
> > + fchdir(lo->proc_self_fd) + ..xattr().  Another related idea is to do
> > a unshare(CLONE_FS) after each thread's startup (will pthread library
> > balk?  I don't know) so that it's safe to do fchdir(lo->proc_self_fd)
> > + ...xattr() + fchdir(lo->root_fd).
>
> Hi Miklos,
>
> Trying to understand your proposal. So if we want to do an ..xattr()
> operation on a special file (and we don't have _at() variants), how
> will fchdir() help. Are you suggesting fchdir() to parent and then
> do something special.
>
> Can you please elaborate your proposal a bit more. I think I have
> missed the core idea completely.
>
> I understand that you want to do unshare(CLONE_FS) to make sure one thrad's
> fchdir() does not impact other thread. But did not understand that how
> fchdir() alone is enough to do getxattr()/setxattr() on special file.

The fchdir() call is to avoid having to do openat().  The openat() was
needed because  /proc/self/fd/ is only accessible through a file
handle (lo->proc_self_fd).

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: Two fix for xattr operation

2019-10-17 Thread Miklos Szeredi

On Thu, Oct 17, 2019 at 1:23 PM Miklos Szeredi  wrote:

> I see why this is being done, and it's not easy to fix properly
> without the ..at() versions of these syscalls.  One idea is to fork()
> + fchdir(lo->proc_self_fd) + ..xattr().  Another related idea is to do
> a unshare(CLONE_FS) after each thread's startup (will pthread library

Apparently Samba does this, so it should be safe.

Thanks,
Miklos

Re: [Virtio-fs] [PATCH 0/2] virtiofsd: Two fix for xattr operation

2019-10-17 Thread Miklos Szeredi

On Thu, Oct 17, 2019 at 12:05 PM Stefan Hajnoczi  wrote:
>
> On Wed, Oct 16, 2019 at 07:37:52PM +0900, Misono Tomohiro wrote:
> > Hello,
> >
> > I test xattr operation on virtiofs using xfstest generic/062
> > (with -o xattr option and XFS backend) and see some problems.
> >
> > These patches fixes the two of the problems.
> >
> > The remaining problems are:
> >  1. we cannot xattr to block device created by mknod
> > which does not have actual device (since open in virtiofsd fails)
> >  2. we cannot xattr to symbolic link
> >
> > I don't think 1 is a big problem but can we fix 2?
>
> Sorry, I don't know the answer.  Maybe it would be necessary to add a
> new O_SYMLINK open flag to open(2) so that fgetxattr()/fsetxattr()
> operations can be performed.  A kernel change like that would take some
> time to get accepted upstream and shipped by distros, but it might be
> the only way since the current syscall interface doesn't seem to offer a
> way to do this.

The real problem is that open() on a non-regular, non-directory file
may have side effects (unless O_PATH is used).  These patches try to
paper over that, but the fact is: opening special files from a file
server is forbidden.

I see why this is being done, and it's not easy to fix properly
without the ..at() versions of these syscalls.  One idea is to fork()
+ fchdir(lo->proc_self_fd) + ..xattr().  Another related idea is to do
a unshare(CLONE_FS) after each thread's startup (will pthread library
balk?  I don't know) so that it's safe to do fchdir(lo->proc_self_fd)
+ ...xattr() + fchdir(lo->root_fd).

Thanks,
Miklos

54 matches

Mail list logo