from:"Vivek Goyal"

Re: Status of DAX for virtio-fs/virtiofsd?

2023-05-18 Thread Vivek Goyal

On Wed, May 17, 2023 at 12:26:18PM -0400, Stefan Hajnoczi wrote:
> On Wed, 17 May 2023 at 11:54, Alex Bennée  wrote:
> Hi Alex,
> There were two unresolved issues:
> 
> 1. How to inject SIGBUS when the guest accesses a page that's beyond
> the end-of-file.
> 2. Implementing the vhost-user messages for mapping ranges of files to
> the vhost-user frontend.
> 
> The harder problem is SIGBUS. An mmap area may be larger than the
> length of the file. Or another process could truncate the file while
> it's mmapped, causing a previously correctly sized mmap to become
> longer than the actual file. When a page beyond the end of file is
> accessed, the kernel raises SIGBUS.
> 
> When this scenario occurs in the DAX Window, kvm.ko gets some type of
> vmexit (fault) and the code currently enters an infinite loop because
> it expects KVM memory regions to resolve faults. Since there is no
> page backing that part of the vma, the fault handling fails and the
> code loops trying to do this forever.
> 
> There needs to be a way to inject this fault back into the guest.
> However, we did not found a way to do that. We considered Machine
> Check Exceptions (MCEs), x86 interrupts, and paravirtualized
> approaches. None of them looked like a clean and sane way to do this.
> The Linux maintainers for MCEs and kvm.ko were not excited about
> supporting this.
> 
> So in the end, SIGBUS was never solved. It leads to a DoS because the
> host kernel will enter an infinite loop. We decided that until there
> is progress on SIGBUS, we can't go ahead with DAX Windows in
> production.
> 
> The easier problem is adding new vhost-user messages. It does lead to
> a fundamental change in the vhost-user protocol: the presence of the
> DAX Window means there are memory ranges that cannot be accessed via
> shared memory. Imagine Device A has a DAX Window and Device B needs to
> DMA to/from it. That doesn't work because the mmaps happen inside the
> frontend (QEMU), so Device B doesn't have access to the current
> mappings. The fundamental change to vhost-user is that virtqueue
> descriptor mapping code must now deal with the situation where guest
> addresses are absent from the shared memory regions and instead send
> vhost-user protocol messages to read/write to/from bounce buffers
> instead. The rest of the device backend does not require modification.
> This is a slow path, but at least it works whereas currently the I/O
> would fail because the memory is absent. Other solutions to the
> vhost-user DMA problem exist, but this is the one that Dave and I last
> discussed.
> 
> In the end, there is still work to do to make the DAX Window
> supportable. There is experimental code out there that kind of works,
> but we felt it was incomplete.

I feel that it will be good if someone can solve the vhost-user problem
first and get patches upstream. Now virtiofsd support from qemu has
been removed, so someone will have to add DAX support to rust virtiofsd.
(And make correspoding vhost-user changes in qemu).

Once that is done, someone can look into MCE issue.

With vhost-user problem solved, DAX will be usable in non-shared mode.
That is just pass through host filesystem into the guest and even host
can't make modifications. And that should steer clear us of the truncation
issue.

virtiofs DAX is a good piece of technology and provides speed up in many
cases. Will be sad to see the patches lost.

Now people are posting fixes to kernel side of DAX and there is no good
way to test these. I will try to make it work with old DAX branch david
had to test kernel changes but I am sure at some point of time it will
stop working and I don't want virtiofs kernel DAX code to become unstable.

Will be good if somebody takes up this project and makes it happen.

Thanks
Vivek

> 
> To your specific questions:
> 
> >  * What VMM/daemon combinations has DAX been tested on?
> 
> Only the experimental virtio-fs Kata Containers kernels and QEMU
> builds that were available a few years ago. I don't think the code has
> been rebased.
> 
> >  * Isn't it time the vhost-user spec is updated?
> 
> I don't know if Dave ever wrote the spec for or implemented the final
> version of the vhost-user protocol messages we discussed.
> 
> >  * Is anyone picking up Dave's patches for the QEMU side of support?
> 
> Not at the moment. It would be nice to support, but someone needs the
> energy/time/focus to deal with the outstanding issues I mentioned.
> 
> If you want to work on it, feel free to include me. I can help dig up
> old discussions and give input.
> 
> Stefan
>

Re: Use of unshare(CLONE_FS) in virtiofsd

2022-11-04 Thread Vivek Goyal



On Fri, Nov 04, 2022 at 08:50:45AM +0100, Florian Weimer wrote:
> I've got a proposed extension for glibc's pthread_create which allows
> the creation of threads with a dedicated current working
> directory/umask/chroot:
> 
>   [PATCH 0/2] Introduce per-thread file system properties on Linux
>   <https://sourceware.org/pipermail/libc-alpha/2022-October/142640.html>
> 
> I expect that glibc integration will work around the seccomp issue
> mentioned in a comment (also brought up by the Samba people for their
> use) because glibc will perform the unshare directly during the clone
> system call, and not via a separate system call.
> 
> I see that unshare(CLONE_FS) was introduced in this commit:
> 
> commit bdfd66788349acc43cd3f1298718ad491663cfcc
> Author: Misono Tomohiro 
> Date:   Thu Feb 27 14:59:27 2020 +0900
> 
> virtiofsd: Fix xattr operations
> 
> Current virtiofsd has problems about xattr operations and
> they does not work properly for directory/symlink/special file.
> 
> The fundamental cause is that virtiofsd uses openat() + f...xattr()
> systemcalls for xattr operation but we should not open symlink/special
> file in the daemon. Therefore the function is restricted.
> 
> Fix this problem by:
>  1. during setup of each thread, call unshare(CLONE_FS)
>  2. in xattr operations (i.e. lo_getxattr), if inode is not a regular
> file or directory, use fchdir(proc_loot_fd) + ...xattr() +
> fchdir(root.fd) instead of openat() + f...xattr()
> 
> (Note: for a regular file/directory openat() + f...xattr()
>  is still used for performance reason)
> 
> With this patch, xfstests generic/062 passes on virtiofs.
> 
> This fix is suggested by Miklos Szeredi and Stefan Hajnoczi.
> The original discussion can be found here:
>   https://www.redhat.com/archives/virtio-fs/2019-October/msg00046.html
>     
>     Signed-off-by: Misono Tomohiro 
> Message-Id: <20200227055927.24566-3-misono.tomoh...@jp.fujitsu.com>
> Acked-by: Vivek Goyal 
> Reviewed-by: Dr. David Alan Gilbert 
> Signed-off-by: Dr. David Alan Gilbert 
> 
> Now the question has come up on the libc-coord list why the *at
> interfaces are not used in such cases:
> 
>   <https://www.openwall.com/lists/libc-coord/2022/10/24/3>
> 
> Clearly the kernel lacks support for fgetxattrat today.

[ CC German, Sergio ]

fgetxattrat() will be nice.

>  The usual
> recommendation for emulating it is to use openat with O_PATH, and then
> use getxattr on the virtual /proc/self/fd path.  This needs an
> additional system call (openat, getxattr, close instead of fchdir,
> getxattr),

openat(O_PATH) + getxattr(/proc/self/fd) + close() sounds reasonable
too. Not sure why did we not take that path. May be due to that extra
syscall or something else.

> but it avoids the unshare(CLONE_FS) call behind libc's back.

Hmm.., did not know that libc does not like threads calling
unshare(CLONE_FS). Not sure why that is a problem.

BTW, we need separate umask per thread as well. During file creation 
we might be switching to umask provide in fuse protocol message
and then switch back. Given multiple therads might be doing this
creation in parallel, so we ofcourse need this to be per thread
property.

So if your patches for pthread_create() with per thread filesystem
attributes finally goes upstream, I guess we should be able to
make use of it and drop unshare(CLONE_FS).

Thanks
Vivek

> The directory entries in /proc/self/fd present as symbolic links, but
> are not implemented as such by the kernel: there is no separate pathname
> lookup for already-open O_PATH descriptors, so there is no race.
> 
> Thoughts?
> 
> Thanks,
> Florian
>

Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-10-05 Thread Vivek Goyal

On Mon, Oct 03, 2022 at 06:51:42PM -0400, Colin Walters wrote:
> 
> 
> On Thu, Sep 29, 2022, at 1:03 PM, Vivek Goyal wrote:
> > 
> > So rust version of virtiofsd, already supports running unprivileged
> > (inside a user namespace).
> 
> I know, but as I already said, the use case here is running inside an 
> OpenShift unprivileged pod where *we are already in a container*.
> 
> > host$ podman unshare -- virtiofsd --socket-path=/tmp/vfsd.sock 
> > --shared-dir /mnt \
> > --announce-submounts --sandbox chroot &
> 
> Yes, but in current OCP 4.11 our seccomp policy denies CLONE_NEWUSER:

Hmm..., no user namespaces allowed. 

So sandbox=none in theory should work once we fix it for unprivileged
user.

https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/136

Given you are already running inside a pod/container, not sure if
locking down virtiofsd with openat2(RESOLVE_IN_ROOT)/landlock is
must for you from security point of view. virtiofsd should not be
able to access anything outside the pod/container anyway and can
only affect things inside the pod/container.

Once we add support for openat2(). Next issue is do you need
arbitrary uid/gid support. By default it will be a single uid/gid
filesystem. Is that enough for your use case? Or inside the guest
you need to be able to switch between arbitrary uid/gid on this
virtiofs filesystem.

> 
> ```
> $ unshare -m
> unshare: unshare failed: Function not implemented
> ```
> 
> https://docs.openshift.com/container-platform/4.11/security/seccomp-profiles.html
> 
> > I think only privileged operation it needs is assigning a range of
> > subuid/subgid to the uid you are using on host.
> 
> We also turn on NO_NEW_PRIVILEGES by default in OCP pods.  
> 
> Now, I *could* in general get elevated permissions where I need to today.  
> But it's also really important to me to have a long term goal of having 
> operating system builds and tests work well as "just another workload" in our 
> production container platform (now, one *does* want to bind in /dev/kvm, but 
> that's generally safe, and even that strictly speaking is optional if one can 
> stomach the ~10x perf hit).

I am assuming this 10x performance hit is being compared with native
container build and test where no VM will be launched.

> 
> > Can you give rust virtiofsd (unprivileged) a try.
> 
> I admit to not actually trying it in a pod, but I think we all agree it can't 
> work, and the only thing that can today is openat2.

Agreed. Right now we rely on using user namespace for unpriviliged use
case. 

We should be able to enable sandbox=none for unprivileged user (no user
namespace) and possibly add openat2() support as well. 

I think being able to provide arbitrary uid/gid support will be more
tricky and more work. It will need to store actual uid/gid into some
sort of user xattr. (as done by 9pfs and fuse-overlay and libkrun etc).
And I will not be surprised that there are bunch of corner cases using
that approach. (setuid/setgid automatic clearing etc.)

Thanks
Vivek

Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-29 Thread Vivek Goyal

On Thu, Sep 29, 2022 at 11:47:32AM -0400, Colin Walters wrote:
> 
> 
> On Thu, Sep 29, 2022, at 10:10 AM, Vivek Goyal wrote:
> 
> > What's your use case. How do you plan to use virtiofs.
> 
> At the current time, the Kubernetes that we run does not support user 
> namespaces.  We want to do the production builds of our operating system 
> (Fedora CoreOS and RHEL CoreOS) today inside an *unprivileged* Kubernetes pod 
> (actually in OpenShift using anyuid, i.e. random unprivileged uid too), just 
> with /dev/kvm exposed from the host (which is safe).  Operating system builds 
> *and* tests in qemu are just another workload that can be shared with other 
> tenants.
> 
> qemu works fine in this model, as does 9p.  It's just the virtiofs isolation 
> requires privileges to be used today.

[ cc German ]

Hi Colin,

So rust version of virtiofsd, already supports running unprivileged
(inside a user namespace).

https://gitlab.com/virtio-fs/virtiofsd/-/blob/main/README.md#running-as-non-privileged-user

host$ podman unshare -- virtiofsd --socket-path=/tmp/vfsd.sock --shared-dir 
/mnt \
--announce-submounts --sandbox chroot &

I think only privileged operation it needs is assigning a range of
subuid/subgid to the uid you are using on host.

I think that should be usable for you as of now.

Having said that, openat2() and landlock are interesting improvements,
especially when somebody does not want to use user namespaces. Without
user namespaces, one will not be able to do arbitrary swithing of uid/gid.
IOW, inside guest, you will be limited to one uid/gid.

I am hoping German or somebody else can have a look openat2() and landlock
improvements in near future.

I am assuming you are fine with using user namespaces on host. And by
assigning subuid/subgid range, it will allow you arbitrary swithching
of uid/gid inside guest.

Can you give rust virtiofsd (unprivileged) a try.

Thanks
Vivek

Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-29 Thread Vivek Goyal

On Thu, Sep 29, 2022 at 10:04:36AM -0400, Colin Walters wrote:
> On Wed, Sep 28, 2022, at 3:28 PM, Vivek Goyal wrote:
> 
> > Sounds reasonable. In fact, we could probably do someting similar
> > for "landlock" as well. 
> 
> Thanks for the discussion all!  Can someone (vaguely) commit to look into 
> this in say the next few months?  It's not *urgent*, we can live with the 9p 
> flakes and problems short term, just trying to figure out if this needs to be 
> on our medium-term radar or not.  Thanks!

Hi Colin,

What's your use case. How do you plan to use virtiofs.

Thanks
Vivek

Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-28 Thread Vivek Goyal

On Wed, Sep 28, 2022 at 10:33:40AM +0200, Sergio Lopez wrote:
> On Tue, Sep 27, 2022 at 04:14:20PM -0400, Stefan Hajnoczi wrote:
> > On Tue, Sep 27, 2022 at 01:51:41PM -0400, Colin Walters wrote:
> > > 
> > > 
> > > On Tue, Sep 27, 2022, at 1:27 PM, German Maglione wrote:
> > > >
> > > >> > Now all the development has moved to rust virtiofsd.
> > > 
> > > Oh, awesome!!  The code there looks great.
> > > 
> > > > I could work on this for the next major version and see if anything 
> > > > breaks.
> > > > But I prefer to add this as a compilation feature, instead of a command 
> > > > line
> > > > option that we will then have to maintain for a while.
> > > 
> > > Hmm, what would be the issue with having the code there by default?  I 
> > > think rather than any new command line option, we automatically use 
> > > `openat2+RESOLVE_IN_ROOT` if the process is run as a nonzero uid.
> > > 
> > > > Also, I don't see it as a sandbox feature, as Stefan mentioned, a 
> > > > compromised
> > > > process can call openat2() without RESOLVE_IN_ROOT. 
> > > 
> > > I'm a bit skeptical honestly about how secure the existing namespace code 
> > > is against a compromised virtiofsd process.  The primary worry is guest 
> > > filesystem traversals, right?  openat2+RESOLVE_IN_ROOT addresses that.  
> > > Plus being in Rust makes this dramatically safer.
> > > 
> > > > I did some test with
> > > > Landlock to lock virtiofsd inside the shared directory, but IIRC it 
> > > > requires a
> > > > kernel 5.13
> > > 
> > > But yes, landlock and other things make sense, I just don't see these 
> > > things as strongly linked.  IOW we shouldn't in my opinion block 
> > > unprivileged virtiofsd on more sandboxing than openat2 already gives us.
> > 
> > I think openat2(RESOLVE_IN_ROOT) support should be added unless there is
> > another unprivileged mechanism that is stronger.
> > 
> > The security implications need to be covered in the user documentation
> > so people can decide whether using this mode is appropriate.
> > 
> > We should continue to explain the difference between a voluntary
> > mechanism like openat2(RESOLVE_IN_ROOT) and a mandatory mechanism like
> > mount namespaces with pivot_root(2). Rust programs are not immune to
> > arbitrary code execution, but it's less likely than with a C program.
> 
> I agree. Perhaps we could modify the "none" sandbox mode to use
> openat2, if available, and add an "openat2" mode which does basically
> the same thing, but bailing out if openat2 is not available.

Sounds reasonable. In fact, we could probably do someting similar
for "landlock" as well. 

Vivek

> 
> And explain this clearly in the docs, of course.
> 
> Sergio.

Re: virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-28 Thread Vivek Goyal

On Tue, Sep 27, 2022 at 07:27:02PM +0200, German Maglione wrote:
> On Tue, Sep 27, 2022 at 6:57 PM Vivek Goyal  wrote:
> >
> > On Tue, Sep 27, 2022 at 12:37:15PM -0400, Vivek Goyal wrote:
> > > On Fri, Sep 09, 2022 at 05:24:03PM -0400, Colin Walters wrote:
> > > > We previously had a chat here 
> > > > https://lore.kernel.org/all/348d4774-bd5f-4832-bd7e-a21491fda...@www.fastmail.com/T/
> > > > around virtiofsd and privileges and the case of trying to run virtiofsd 
> > > > inside an unprivileged (Kubernetes) container.
> > > >
> > > > Right now we're still using 9p, and it has bugs (basically it seems 
> > > > like the 9p inode flushing callback tries to allocate memory to send an 
> > > > RPC, and this causes OOM problems)
> > > > https://github.com/coreos/coreos-assembler/issues/1812
> > > >
> > > > Coming back to this...as of lately in Linux, there's support for 
> > > > strongly isolated filesystem access via openat2():
> > > > https://lwn.net/Articles/796868/
> > > >
> > > > Is there any reason we couldn't do an -o sandbox=openat2 ?  This 
> > > > operates without any privileges at all, and should be usable (and 
> > > > secure enough) in our use case.
> > >
> > > [ cc virtio-fs-list, german, sergio ]
> > >
> > > Hi Colin,
> > >
> > > Using openat2(RESOLVE_IN_ROOT) (if kernel is new enough), sounds like a
> > > good idea. We talked about it few times but nobody ever wrote a patch to
> > > implement it.
> > >
> > > And it probably makes sense with all the sandboxes (chroot(), namespaces).
> > >
> > > I am wondering that it probably should not be a new sandbox mode at all.
> > > It probably should be the default if kernel offers openat2() syscall.
> > >
> > > Now all the development has moved to rust virtiofsd.
> > >
> > > https://gitlab.com/virtio-fs/virtiofsd
> > >
> > > C version of virtiofsd is just seeing small critical fixes.
> > >
> > > And rust version allows running unprivileged (inside a user namespace).
> > > German is also working on allowing running unprivileged without
> > > user namespaces but this will not allow arbitrary uid/gid switching.
> > >
> > > https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/136
> > >
> > > If one wants to run unprivileged and also do arbitrary uid/gid switching,
> > > then you need to use user namepsaces and map a range of subuid/subgid
> > > into the user namepsace virtiofsd is running in.
> > >
> > > If possible, please try to use rust virtiofsd for your situation. Its
> > > already packaged for fedora.
> > >
> > > Coming back to original idea of using openat2(), I think we should
> > > probably give it a try in rust virtiofsd and if it works, it should
> > > work across all the sandboxing modes.
> >
> > Thinking more about it, enabling openat2() usage conditionally based on
> > some option probably is not a bad idea. I was assuming that using
> > openat2() by default will not break any of the existing use cases. But
> > I am not sure. I have burnt my fingers so many times and had to back
> > out on default settings that enabling usage of openat2() conditionally
> > will probably be a safer choice. :-)
> >
> 
> I could work on this for the next major version and see if anything breaks.
> But I prefer to add this as a compilation feature, instead of a command line
> option that we will then have to maintain for a while.

What does compilation feature mean? One can compile it out? If it is
compiled in, is it enabled by default?

> 
> Also, I don't see it as a sandbox feature, as Stefan mentioned, a compromised
> process can call openat2() without RESOLVE_IN_ROOT. I did some test with
> Landlock to lock virtiofsd inside the shared directory, but IIRC it requires a
> kernel 5.13

landlock sounds interesting. May be use it by default if kernel offers it.

Question will be, security mechanisms we are using, how many of these
are mutually exclusive and how many can be used together.

A. pivot_root()
B. chroot()
C. openat2()
D. landlock
E. seccomp

Seccomp goes well with everything. 
landlock probably will go well as well.

pivot_root() and chroot() are currently mutually exlusive.

openat2() is probably redundant if pivot_root()/chroot()/landlock is
being used. But should work anyway.

Something to document as Stefan suggested.

Vivek

Re: virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-27 Thread Vivek Goyal

On Tue, Sep 27, 2022 at 12:37:15PM -0400, Vivek Goyal wrote:
> On Fri, Sep 09, 2022 at 05:24:03PM -0400, Colin Walters wrote:
> > We previously had a chat here 
> > https://lore.kernel.org/all/348d4774-bd5f-4832-bd7e-a21491fda...@www.fastmail.com/T/
> > around virtiofsd and privileges and the case of trying to run virtiofsd 
> > inside an unprivileged (Kubernetes) container.
> > 
> > Right now we're still using 9p, and it has bugs (basically it seems like 
> > the 9p inode flushing callback tries to allocate memory to send an RPC, and 
> > this causes OOM problems)
> > https://github.com/coreos/coreos-assembler/issues/1812
> > 
> > Coming back to this...as of lately in Linux, there's support for strongly 
> > isolated filesystem access via openat2():
> > https://lwn.net/Articles/796868/
> > 
> > Is there any reason we couldn't do an -o sandbox=openat2 ?  This operates 
> > without any privileges at all, and should be usable (and secure enough) in 
> > our use case.
> 
> [ cc virtio-fs-list, german, sergio ]
> 
> Hi Colin,
> 
> Using openat2(RESOLVE_IN_ROOT) (if kernel is new enough), sounds like a
> good idea. We talked about it few times but nobody ever wrote a patch to
> implement it.
> 
> And it probably makes sense with all the sandboxes (chroot(), namespaces).
> 
> I am wondering that it probably should not be a new sandbox mode at all.
> It probably should be the default if kernel offers openat2() syscall.
> 
> Now all the development has moved to rust virtiofsd.
> 
> https://gitlab.com/virtio-fs/virtiofsd
> 
> C version of virtiofsd is just seeing small critical fixes.
> 
> And rust version allows running unprivileged (inside a user namespace).
> German is also working on allowing running unprivileged without
> user namespaces but this will not allow arbitrary uid/gid switching.
> 
> https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/136
> 
> If one wants to run unprivileged and also do arbitrary uid/gid switching,
> then you need to use user namepsaces and map a range of subuid/subgid
> into the user namepsace virtiofsd is running in.
> 
> If possible, please try to use rust virtiofsd for your situation. Its
> already packaged for fedora.
> 
> Coming back to original idea of using openat2(), I think we should
> probably give it a try in rust virtiofsd and if it works, it should
> work across all the sandboxing modes.

Thinking more about it, enabling openat2() usage conditionally based on
some option probably is not a bad idea. I was assuming that using
openat2() by default will not break any of the existing use cases. But
I am not sure. I have burnt my fingers so many times and had to back
out on default settings that enabling usage of openat2() conditionally
will probably be a safer choice. :-)

Vivek

Re: virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-27 Thread Vivek Goyal

On Fri, Sep 09, 2022 at 05:24:03PM -0400, Colin Walters wrote:
> We previously had a chat here 
> https://lore.kernel.org/all/348d4774-bd5f-4832-bd7e-a21491fda...@www.fastmail.com/T/
> around virtiofsd and privileges and the case of trying to run virtiofsd 
> inside an unprivileged (Kubernetes) container.
> 
> Right now we're still using 9p, and it has bugs (basically it seems like the 
> 9p inode flushing callback tries to allocate memory to send an RPC, and this 
> causes OOM problems)
> https://github.com/coreos/coreos-assembler/issues/1812
> 
> Coming back to this...as of lately in Linux, there's support for strongly 
> isolated filesystem access via openat2():
> https://lwn.net/Articles/796868/
> 
> Is there any reason we couldn't do an -o sandbox=openat2 ?  This operates 
> without any privileges at all, and should be usable (and secure enough) in 
> our use case.

[ cc virtio-fs-list, german, sergio ]

Hi Colin,

Using openat2(RESOLVE_IN_ROOT) (if kernel is new enough), sounds like a
good idea. We talked about it few times but nobody ever wrote a patch to
implement it.

And it probably makes sense with all the sandboxes (chroot(), namespaces).

I am wondering that it probably should not be a new sandbox mode at all.
It probably should be the default if kernel offers openat2() syscall.

Now all the development has moved to rust virtiofsd.

https://gitlab.com/virtio-fs/virtiofsd

C version of virtiofsd is just seeing small critical fixes.

And rust version allows running unprivileged (inside a user namespace).
German is also working on allowing running unprivileged without
user namespaces but this will not allow arbitrary uid/gid switching.

https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/136

If one wants to run unprivileged and also do arbitrary uid/gid switching,
then you need to use user namepsaces and map a range of subuid/subgid
into the user namepsace virtiofsd is running in.

If possible, please try to use rust virtiofsd for your situation. Its
already packaged for fedora.

Coming back to original idea of using openat2(), I think we should
probably give it a try in rust virtiofsd and if it works, it should
work across all the sandboxing modes.

Thanks
Vivek

> 
> I may try a patch if this sounds OK...
>

Re: [Virtio-fs] [PATCH] virtiofsd: use g_date_time_get_microsecond to get subsecond

2022-09-20 Thread Vivek Goyal

On Wed, Aug 24, 2022 at 01:41:29PM -0400, Stefan Hajnoczi wrote:
> On Thu, Aug 18, 2022 at 02:46:19PM -0400, Yusuke Okada wrote:
> > From: Yusuke Okada 
> > 
> > The "%f" specifier in g_date_time_format() is only available in glib
> > 2.65.2 or later. If combined with older glib, the function returns null
> > and the timestamp displayed as "(null)".
> > 
> > For backward compatibility, g_date_time_get_microsecond should be used
> > to retrieve subsecond.
> > 
> > In this patch the g_date_time_format() leaves subsecond field as "%06d"
> > and let next snprintf to format with g_date_time_get_microsecond.
> > 
> > Signed-off-by: Yusuke Okada 
> > ---
> >  tools/virtiofsd/passthrough_ll.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> Thanks, applied to my block tree for QEMU 7.2:
> https://gitlab.com/stefanha/qemu/commits/block

Hi Stefan,

Wondering when do you plan to send it for merge. This seems like
a simple fix. Not sure why it does not qualify as a fix for
7.1 instead.

Thanks
Vivek

[PATCH] virtiofsd: Disable killpriv_v2 by default

2022-07-29 Thread Vivek Goyal

We are having bunch of issues with killpriv_v2 enabled by default. First
of all it relies on clearing suid/sgid bits as needed by dropping
capability CAP_FSETID. This does not work for remote filesystems like
NFS (and possibly others). 

Secondly, we are noticing other issues related to clearing of SGID
which leads to failures for xfstests generic/355 and generic/193.

Thirdly, there are other issues w.r.t caching of metadata (suid/sgid)
bits in fuse client with killpriv_v2 enabled. Guest can cache that
data for sometime even if cleared on server.

Second and Third issue are fixable. Just that it might take a little
while to get it fixed in kernel. First one will probably not see
any movement for a long time.

Given these issues, killpriv_v2 does not seem to be a good candidate
for enabling by default. We have already disabled it by default in
rust version of virtiofsd.

Hence this patch disabled killpriv_v2 by default. User can choose to
enable it by passing option "-o killpriv_v2".

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c |   13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
===
--- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c 2022-07-29 
08:19:05.925119947 -0400
+++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c  2022-07-29 
08:27:08.048049096 -0400
@@ -767,19 +767,10 @@ static void lo_init(void *userdata, stru
 fuse_log(FUSE_LOG_DEBUG, "lo_init: enabling killpriv_v2\n");
 conn->want |= FUSE_CAP_HANDLE_KILLPRIV_V2;
 lo->killpriv_v2 = 1;
-} else if (lo->user_killpriv_v2 == -1 &&
-   conn->capable & FUSE_CAP_HANDLE_KILLPRIV_V2) {
-/*
- * User did not specify a value for killpriv_v2. By default enable it
- * if connection offers this capability
- */
-fuse_log(FUSE_LOG_DEBUG, "lo_init: enabling killpriv_v2\n");
-conn->want |= FUSE_CAP_HANDLE_KILLPRIV_V2;
-lo->killpriv_v2 = 1;
 } else {
 /*
- * Either user specified to disable killpriv_v2, or connection does
- * not offer this capability. Disable killpriv_v2 in both the cases
+ * Either user specified to disable killpriv_v2, or did not
+ * specify anything. Disable killpriv_v2 in both the cases.
  */
 fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling killpriv_v2\n");
 conn->want &= ~FUSE_CAP_HANDLE_KILLPRIV_V2;

Re: Question about performance comparison between virtio-fs and virtio-blk

2022-07-26 Thread Vivek Goyal

On Tue, Jul 26, 2022 at 10:41:23PM +0800, Hao Xu wrote:
> On 7/26/22 21:17, Vivek Goyal wrote:
> > On Tue, Jul 26, 2022 at 08:55:38AM -0400, Stefan Hajnoczi wrote:
> > > On Tue, 26 Jul 2022 at 08:24, Hao Xu  wrote:
> > > > I watched your presentation about virtiofs in 2020,
> > > > 
> > > > https://www.youtube.com/watch?v=EIVOzTsGMMI=232s
> > > > 
> > > > which is really helpful to me, but I have a question about the graph at
> > > > 3:53, could you give
> > > > 
> > > > me more info about the test, like what tool you use for the test, if
> > > > it's fio, what is the parameters.
> > > > 
> > > > I used fio to do randread test in a qemu box, but turns out the iops of
> > > > virtio-blk and virtio-fs are similar.
> > > 
> > 
> > Hi Hao,
> > 
> > My impression in general is that virtio-blk is much faster than virtiofs.
> 
> When testing virtio-blk, did you use the device directly or mount it and
> test against a file.

Frankly speaking, I don't recall any of the details right now. If do
remember that I ran some kernel compilation tests on virtio-blk and
that ofcourse needed mounting filesystem on virtio-blk.

> 
> > A simple macro test is do a kernel compilation and compare time taken
> > between the two.
> 
> Good idea, I just tested with single file.

single file using fio is good as micro benchmark which primarily
excercises the data operations. But kernel compilation is a good
macro benchmark sort of workload which stresses filesystem both for
data and metadata operations.

Thanks
Vivek

> 
> Thanks,
> Hao
> 
> > 
> > > I have CCed Vivek Goyal, who has done more virtiofs benchmarking and
> > > might have ideas to share.
> > > 
> > > The benchmarking tool was fio with the stated blocksize and I/O
> > > pattern. The benchmark was probably run with direct=1. Based on the
> > > virtio-blk numbers I think iodepth was greater than 1 but I don't have
> > > the exact fio job parameters.
> > 
> > I had basically used fio jobs. I wrote some simple wrapper scripts to
> > run fio and parse and report numbers.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests
> > 
> > I don't have data for virtio-blk but I do seem to have some comparison
> > numbers of virtiofs and virtio-9p.
> > 
> > https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-23-2021
> > 
> > Thanks
> > Vivek
> > 
> > 
>

Re: Question about performance comparison between virtio-fs and virtio-blk

2022-07-26 Thread Vivek Goyal

On Tue, Jul 26, 2022 at 08:55:38AM -0400, Stefan Hajnoczi wrote:
> On Tue, 26 Jul 2022 at 08:24, Hao Xu  wrote:
> > I watched your presentation about virtiofs in 2020,
> >
> > https://www.youtube.com/watch?v=EIVOzTsGMMI=232s
> >
> > which is really helpful to me, but I have a question about the graph at
> > 3:53, could you give
> >
> > me more info about the test, like what tool you use for the test, if
> > it's fio, what is the parameters.
> >
> > I used fio to do randread test in a qemu box, but turns out the iops of
> > virtio-blk and virtio-fs are similar.
> 

Hi Hao,

My impression in general is that virtio-blk is much faster than virtiofs.
A simple macro test is do a kernel compilation and compare time taken
between the two.

> I have CCed Vivek Goyal, who has done more virtiofs benchmarking and
> might have ideas to share.
> 
> The benchmarking tool was fio with the stated blocksize and I/O
> pattern. The benchmark was probably run with direct=1. Based on the
> virtio-blk numbers I think iodepth was greater than 1 but I don't have
> the exact fio job parameters.

I had basically used fio jobs. I wrote some simple wrapper scripts to
run fio and parse and report numbers.

https://github.com/rhvgoyal/virtiofs-tests

I don't have data for virtio-blk but I do seem to have some comparison
numbers of virtiofs and virtio-9p.

https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-23-2021

Thanks
Vivek

Re: [Virtio-fs] [Qemu] how to use viriofs in qemu without NUMA

2022-07-12 Thread Vivek Goyal

On Tue, Jul 12, 2022 at 07:06:50AM +, Zhao, Shirley wrote:
> Hi, all, 
> 
> I have another question want to consult you. 
> To enable DAX in virtiofs, according to the memu 
> https://virtio-fs.gitlab.io/howto-qemu.html. 
> I need to add "cache-size=2G" as below. 
> -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs,cache-size=2G
> 
> My qemu command is: 
> sudo qemu-system-x86_64 -M pc -cpu host --enable-kvm -smp 2 -m 4G -drive 
> if=virtio,file=ubuntu.img -object 
> memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on -machine 
> q35,memory-backend=mem -chardev socket,id=char0,path=/tmp/vhostqemu -device 
> vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs_root,cache-size=2G 
> -chardev stdio,mux=on,id=mon -mon chardev=mon,mode=readline -device 
> virtio-serial-pci -device virtconsole,chardev=mon -vga none -display none
> 
> And virtiofsd command is:
> sudo ./virtiofsd --socket-path=/tmp/vhostqemu -o source=/home/shirley/testdir 
> -o cache=always
> 
> But there is no option of "cache-size" in qemu 6.0, like below. So how to 
> enable it? 

Hi Shirley,

DAX support in qemu is not upstream yet. We are carrying DAX patches
out of the tree on a branch here.

https://gitlab.com/virtio-fs/qemu/-/commits/virtio-fs-dev

There are some changes required and David Gilbert is looking into
making these changes. I am hoping at some point of time these patches
will make into upstream.

So for the time being, to test DAX, you will have to fetch above branch,
build it and use that qemu.

Thanks
Vivek

> qemu-6.0.0$ qemu-system-x86_64 -device vhost-user-fs-pci,help
> vhost-user-fs-pci options:
>   acpi-index=-  (default: 0)
>   addr=   - Slot and optional function number, example: 06.0 
> or 06 (default: -1)
>   aer= - on/off (default: false)
>   any_layout=  - on/off (default: true)
>   ats= - on/off (default: false)
>   bootindex=
>   chardev=  - ID of a chardev to use as a backend
>   event_idx=   - on/off (default: true)
>   failover_pair_id=
>   indirect_desc=   - on/off (default: true)
>   iommu_platform=  - on/off (default: false)
>   migrate-extra=   - on/off (default: true)
>   modern-pio-notify= - on/off (default: false)
>   multifunction=   - on/off (default: false)
>   notify_on_empty= - on/off (default: true)
>   num-request-queues= -  (default: 1)
>   packed=  - on/off (default: false)
>   page-per-vq= - on/off (default: false)
>   queue-size=-  (default: 128)
>   rombar=-  (default: 1)
>   romfile=
>   romsize=   -  (default: 4294967295)
>   tag=
>   use-disabled-flag= -  (default: true)
>   use-started= -  (default: true)
>   vectors=   -  (default: 4294967295)
>   virtio-backend=>
>   virtio-pci-bus-master-bug-migration= - on/off (default: false)
>   x-ats-page-aligned= - on/off (default: true)
>   x-disable-legacy-check= -  (default: false)
>   x-disable-pcie=  - on/off (default: false)
>   x-ignore-backend-features= -  (default: false)
>   x-pcie-deverr-init= - on/off (default: true)
>   x-pcie-extcap-init= - on/off (default: true)
>   x-pcie-flr-init= - on/off (default: true)
>   x-pcie-lnkctl-init= - on/off (default: true)
>   x-pcie-lnksta-dllla= - on/off (default: true)
>   x-pcie-pm-init=  - on/off (default: true) 
> 
> 
> -Original Message-
> From: Zhao, Shirley 
> Sent: Friday, July 8, 2022 8:40 AM
> To: Dr. David Alan Gilbert 
> Cc: Thomas Huth ; qemu-devel@nongnu.org; 
> virtio...@redhat.com; Stefan Hajnoczi 
> Subject: RE: [Qemu] how to use viriofs in qemu without NUMA
> 
> Yes, the qemu version is too old. 
> My previous qemu version is 4.2, and I upgraded it into 6.0, and it worked 
> now. 
> Thanks a lot. 
> 
> - Shirley 
> 
> -Original Message-
> From: Dr. David Alan Gilbert 
> Sent: Tuesday, July 5, 2022 5:37 PM
> To: Zhao, Shirley 
> Cc: Thomas Huth ; qemu-devel@nongnu.org; 
> virtio...@redhat.com; Stefan Hajnoczi 
> Subject: Re: [Qemu] how to use viriofs in qemu without NUMA
> 
> * Zhao, Shirley (shirley.z...@intel.com) wrote:
> > Thanks for the information. 
> > Yes, I also found the memory backend options on s390x, and also copy the 
> > command to x86, but failed. 
> > 
> > The following is the command used to start qemu + virtiofs + ubuntu 20.04. 
> > One is worked well using NUMA, another one is failed without NUMA. 
> > Is there anything wrong? 
> > 
> > The worked one with NUMA options: 
> > 
> > qemu-system-x86_64 -M pc -cpu host --enable-kvm -smp 2 -m 4G -object 
> > memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on -numa 
> > node,memdev=mem -chardev socket,id=char0,path=/tmp/vfsd.sock -device 
> > vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=myfs -chardev 
> > stdio,mux=on,id=mon -mon chardev=mon,mode=readline -device 
> > virtio-serial-pci -device virtconsole,chardev=mon -vga none -display 
> > none -drive if=virtio,file=ubuntu.img
> > 
> > The failed one without NUMA options: 
> > 
> > qemu-system-x86_64 -M pc -cpu

Re: [PATCH] docs: Correct the default thread-pool-size

2022-04-14 Thread Vivek Goyal

On Wed, Apr 13, 2022 at 12:20:54PM +0800, Liu Yiding wrote:
> Refer to 26ec190964 virtiofsd: Do not use a thread pool by default
> 
> Signed-off-by: Liu Yiding 

Looks good. Our default used to be --thread-pool-size=64. But we changed
it to using no thread pool because on lower end of workloads it performed
better. When multiple threads are doing parallel I/O then, thread pool
helps. So people who want to do lots of parallel I/O should manually
enable thread pool.

Acked-by: Vivek Goyal 

Vivek
> ---
>  docs/tools/virtiofsd.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> index 0c0560203c..33fed08c6f 100644
> --- a/docs/tools/virtiofsd.rst
> +++ b/docs/tools/virtiofsd.rst
> @@ -127,7 +127,7 @@ Options
>  .. option:: --thread-pool-size=NUM
>  
>Restrict the number of worker threads per request queue to NUM.  The 
> default
> -  is 64.
> +  is 0.
>  
>  .. option:: --cache=none|auto|always
>  
> -- 
> 2.31.1
> 
> 
> 
>

Re: [PULL 09/12] virtiofsd: Create new file with security context

2022-04-07 Thread Vivek Goyal

On Thu, Apr 07, 2022 at 01:44:35PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Maydell (peter.mayd...@linaro.org) wrote:
> > On Thu, 17 Feb 2022 at 17:40, Dr. David Alan Gilbert (git)
> >  wrote:
> > >
> > > From: Vivek Goyal 
> > >
> > > This patch adds support for creating new file with security context
> > > as sent by client. It basically takes three paths.
> > >
> > > - If no security context enabled, then it continues to create files 
> > > without
> > >   security context.
> > >
> > > - If security context is enabled and but security.selinux has not been
> > >   remapped, then it uses /proc/thread-self/attr/fscreate knob to set
> > >   security context and then create the file. This will make sure that
> > >   newly created file gets the security context as set in "fscreate" and
> > >   this is atomic w.r.t file creation.
> > >
> > >   This is useful and host and guest SELinux policies don't conflict and
> > >   can work with each other. In that case, guest security.selinux xattr
> > >   is not remapped and it is passthrough as "security.selinux" xattr
> > >   on host.
> > >
> > > - If security context is enabled but security.selinux xattr has been
> > >   remapped to something else, then it first creates the file and then
> > >   uses setxattr() to set the remapped xattr with the security context.
> > >   This is a non-atomic operation w.r.t file creation.
> > >
> > >   This mode will be most versatile and allow host and guest to have their
> > >   own separate SELinux xattrs and have their own separate SELinux 
> > > policies.
> > >
> > > Reviewed-by: Dr. David Alan Gilbert 
> > > Signed-off-by: Vivek Goyal 
> > > Message-Id: <20220208204813.682906-9-vgo...@redhat.com>
> > > Signed-off-by: Dr. David Alan Gilbert 
> > 
> > Hi; Coverity reports some issues (CID 1487142, 1487195), because
> > it is not a fan of the error-handling pattern used in this code:
> > 
> > > +static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
> > > +   const char *name, const char 
> > > *secctx_name)
> > > +{
> > > +int path_fd, err;
> > > +char procname[64];
> > > +struct lo_data *lo = lo_data(req);
> > > +
> > > +if (!req->secctx.ctxlen) {
> > > +return 0;
> > > +}
> > > +
> > > +/* Open newly created element with O_PATH */
> > > +path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
> > > +err = path_fd == -1 ? errno : 0;
> > > +if (err) {
> > > +return err;
> > > +}
> > 
> > We set err based on whether path_fd is -1 or not, but we decide
> > whether to early-return based on the value of err. Coverity
> > doesn't know that openat() will always set errno to something
> > non-zero if it returns -1, so it complains because it thinks
> > there's a code path where openat() returns -1, but errno is 0,
> > and so we don't take the early-return and instead continue
> > through all the code below to the "close(path_fd)", which
> > should not be being passed a negative value for the filedescriptor.
> > 
> > I could just mark these as false-positives, but it does seem a bit
> > odd that we are using two different conditions here. Perhaps it would
> > be better to rephrase? For instance, for the openat() we could write:
> > 
> >path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
> >if (path_fd == -1) {
> >return errno;
> >}
> 
> That looks OK to me; please send a patch.
> 
> Some of the cases look like they need to just be a little careful that
> 'err' always gets set to 0 if there are later cases that might set err.

I think use of "err" to save errno pattern is used because in some
cases we can't return immediately after error. Instead we have to
take some actions to restore some state and then return.

So for this specific case, it looks fine because we don't have to
restore any state before returning.

Vivek
> 
> Dave
> 
> > and similarly for the openat() in open_set_proc_fscreate().
> > 
> > > +sprintf(procname, "%i", path_fd);
> > > +FCHDIR_NOFAIL(lo->proc_self_fd);
> > > +/* Set security context. This is not atomic w.r.t file creation */
> > > +err = setxattr(procname, secctx_name, req->secctx.ctx, 
> > > req->secctx.ctxlen,
> > > +   0);
> > > +if (err) {
> > > +err = errno;
> > > +}
> > 
> > > +FCHDIR_NOFAIL(lo->root.fd);
> > > +close(path_fd);
> > > +return err;
> > > +}
> > 
> > thanks
> > -- PMM
> > 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PATCH 0/2] virtiofsd: Support FUSE_SYNCFS on unannounced submounts

2022-03-04 Thread Vivek Goyal

On Thu, Mar 03, 2022 at 06:13:21PM +0100, Greg Kurz wrote:
> This is the current patches I have : one to track submounts
> and the other to call syncfs() on them. Tested on simple
> cases only.
> 
> I won't be able to work on this anymore, so I'm posting for the
> records. Anyone is welcome to pick it up as there won't be a v2
> from my side.

Thanks Greg. Hopefully somebody else will be able to pick it up.

What are TODO items to take this patch series to completion.

Vivek

> 
> Cheers,
> 
> --
> Greg
> 
> Greg Kurz (2):
>   virtiofsd: Track submounts
>   virtiofsd: Support FUSE_SYNCFS on unannounced submounts
> 
>  tools/virtiofsd/passthrough_ll.c | 61 
>  1 file changed, 55 insertions(+), 6 deletions(-)
> 
> -- 
> 2.34.1
> 
>

Re: [Virtio-fs] [PULL 00/12] virtiofs queue

2022-02-16 Thread Vivek Goyal

On Wed, Feb 16, 2022 at 07:40:14PM +, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (git) (dgilb...@redhat.com) wrote:
> > From: "Dr. David Alan Gilbert" 
> > 
> > The following changes since commit c13b8e9973635f34f3ce4356af27a311c993729c:
> > 
> >   Merge remote-tracking branch 
> > 'remotes/alistair/tags/pull-riscv-to-apply-20220216' into staging 
> > (2022-02-16 09:57:11 +)
> > 
> > are available in the Git repository at:
> > 
> >   https://gitlab.com/dagrh/qemu.git tags/pull-virtiofs-20220216
> > 
> > for you to fetch changes up to 47cc3ef597b2ee926c13c9433f4f73645429e128:
> > 
> >   virtiofsd: Add basic support for FUSE_SYNCFS request (2022-02-16 17:29:32 
> > +)
> 
> NAK
> this doesn't build on older Linuxes.
> 
> Rework version in the works.

Hi David,

I think it is patch 8 which is using gettid(). I have updated that
patch and now I am using syscall(NR_gettid) instead. Here is the
updated patch. I hope this solves the build on older Linux issue.


Subject: virtiofsd: Add helpers to work with /proc/self/task/tid/attr/fscreate

Soon we will be able to create and also set security context on the file
atomically using /proc/self/task/tid/attr/fscreate knob. If this knob
is available on the system, first set the knob with the desired context
and then create the file. It will be created with the context set in
fscreate. This works basically for SELinux and its per thread.

This patch just introduces the helper functions. Subsequent patches will
make use of these helpers.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c |   92 +++
 1 file changed, 92 insertions(+)

Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
===
--- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c 2022-02-16 
15:53:13.657015138 -0500
+++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c  2022-02-16 
15:55:14.911234993 -0500
@@ -173,10 +173,14 @@ struct lo_data {
 
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
+/* An O_PATH file descriptor to /proc/self/task/ */
+int proc_self_task;
 int user_killpriv_v2, killpriv_v2;
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+/* Keeps track if /proc//attr/fscreate should be used or not */
+bool use_fscreate;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -257,6 +261,72 @@ static struct lo_data *lo_data(fuse_req_
 }
 
 /*
+ * Tries to figure out if /proc//attr/fscreate is usable or not. With
+ * selinux=0, read from fscreate returns -EINVAL.
+ *
+ * TODO: Link with libselinux and use is_selinux_enabled() instead down
+ * the line. It probably will be more reliable indicator.
+ */
+static bool is_fscreate_usable(struct lo_data *lo)
+{
+char procname[64];
+int fscreate_fd;
+size_t bytes_read;
+
+sprintf(procname, "%ld/attr/fscreate", syscall(SYS_gettid));
+fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
+if (fscreate_fd == -1) {
+return false;
+}
+
+bytes_read = read(fscreate_fd, procname, 64);
+close(fscreate_fd);
+if (bytes_read == -1) {
+return false;
+}
+return true;
+}
+
+/* Helpers to set/reset fscreate */
+__attribute__((unused))
+static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
+  size_t ctxlen,int *fd)
+{
+char procname[64];
+int fscreate_fd, err = 0;
+size_t written;
+
+sprintf(procname, "%ld/attr/fscreate", syscall(SYS_gettid));
+fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
+err = fscreate_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+
+written = write(fscreate_fd, ctx, ctxlen);
+err = written == -1 ? errno : 0;
+if (err) {
+goto out;
+}
+
+*fd = fscreate_fd;
+return 0;
+out:
+close(fscreate_fd);
+return err;
+}
+
+__attribute__((unused))
+static void close_reset_proc_fscreate(int fd)
+{
+if ((write(fd, NULL, 0)) == -1) {
+fuse_log(FUSE_LOG_WARNING, "Failed to reset fscreate. err=%d\n", 
errno);
+}
+close(fd);
+return;
+}
+
+/*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
  * returns 0 on success
@@ -3522,6 +3592,15 @@ static void setup_namespaces(struct lo_d
 exit(1);
 }
 
+/* Get the /proc/self/task descriptor */
+lo->proc_self_task = open("/proc/self/task/", O_PATH);
+if (lo->proc_self_task == -1) {
+fuse_log(FUSE_LOG_ERR, "open(/proc/self/task, O_PATH): %m\n");
+exit(1);
+}
+
+lo->use_fscreate = is_fscreate_usab

Re: [PATCH v6 1/1] virtiofsd: Add basic support for FUSE_SYNCFS request

2022-02-15 Thread Vivek Goyal

On Tue, Feb 15, 2022 at 07:15:29PM +0100, Greg Kurz wrote:
> Honor the expected behavior of syncfs() to synchronously flush all data
> and metadata to disk on linux systems.
> 
> If virtiofsd is started with '-o announce_submounts', the client is
> expected to send a FUSE_SYNCFS request for each individual submount.
> In this case, we just create a new file descriptor on the submount
> inode with lo_inode_open(), call syncfs() on it and close it. The
> intermediary file is needed because O_PATH descriptors aren't
> backed by an actual file and syncfs() would fail with EBADF.
> 
> If virtiofsd is started without '-o announce_submounts' or if the
> client doesn't have the FUSE_CAP_SUBMOUNTS capability, the client
> only sends a single FUSE_SYNCFS request for the root inode. The
> server would thus need to track submounts internally and call
> syncfs() on each of them. This will be implemented later.
> 
> Note that syncfs() might suffer from a time penalty if the submounts
> are being hammered by some unrelated workload on the host. The only
> solution to prevent that is to avoid shared mounts.
> 
> Signed-off-by: Greg Kurz 

Looks good to me. Thanks Greg.

Reviewed-by: Vivek Goyal 

Vivek

> ---
>  tools/virtiofsd/fuse_lowlevel.c   | 11 +++
>  tools/virtiofsd/fuse_lowlevel.h   | 13 
>  tools/virtiofsd/passthrough_ll.c  | 44 +++
>  tools/virtiofsd/passthrough_seccomp.c |  1 +
>  4 files changed, 69 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
> index e4679c73abc2..e02d8b25a5f6 100644
> --- a/tools/virtiofsd/fuse_lowlevel.c
> +++ b/tools/virtiofsd/fuse_lowlevel.c
> @@ -1876,6 +1876,16 @@ static void do_lseek(fuse_req_t req, fuse_ino_t nodeid,
>  }
>  }
>  
> +static void do_syncfs(fuse_req_t req, fuse_ino_t nodeid,
> +  struct fuse_mbuf_iter *iter)
> +{
> +if (req->se->op.syncfs) {
> +req->se->op.syncfs(req, nodeid);
> +} else {
> +fuse_reply_err(req, ENOSYS);
> +}
> +}
> +
>  static void do_init(fuse_req_t req, fuse_ino_t nodeid,
>  struct fuse_mbuf_iter *iter)
>  {
> @@ -2280,6 +2290,7 @@ static struct {
>  [FUSE_RENAME2] = { do_rename2, "RENAME2" },
>  [FUSE_COPY_FILE_RANGE] = { do_copy_file_range, "COPY_FILE_RANGE" },
>  [FUSE_LSEEK] = { do_lseek, "LSEEK" },
> +[FUSE_SYNCFS] = { do_syncfs, "SYNCFS" },
>  };
>  
>  #define FUSE_MAXOP (sizeof(fuse_ll_ops) / sizeof(fuse_ll_ops[0]))
> diff --git a/tools/virtiofsd/fuse_lowlevel.h b/tools/virtiofsd/fuse_lowlevel.h
> index c55c0ca2fc1c..b889dae4de0e 100644
> --- a/tools/virtiofsd/fuse_lowlevel.h
> +++ b/tools/virtiofsd/fuse_lowlevel.h
> @@ -1226,6 +1226,19 @@ struct fuse_lowlevel_ops {
>   */
>  void (*lseek)(fuse_req_t req, fuse_ino_t ino, off_t off, int whence,
>struct fuse_file_info *fi);
> +
> +/**
> + * Synchronize file system content
> + *
> + * If this request is answered with an error code of ENOSYS,
> + * this is treated as success and future calls to syncfs() will
> + * succeed automatically without being sent to the filesystem
> + * process.
> + *
> + * @param req request handle
> + * @param ino the inode number
> + */
> +void (*syncfs)(fuse_req_t req, fuse_ino_t ino);
>  };
>  
>  /**
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index b3d0674f6d2f..0f65e6423cf5 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -3357,6 +3357,49 @@ static void lo_lseek(fuse_req_t req, fuse_ino_t ino, 
> off_t off, int whence,
>  }
>  }
>  
> +static int lo_do_syncfs(struct lo_data *lo, struct lo_inode *inode)
> +{
> +int fd, ret = 0;
> +
> +fuse_log(FUSE_LOG_DEBUG, "lo_do_syncfs(ino=%" PRIu64 ")\n",
> + inode->fuse_ino);
> +
> +fd = lo_inode_open(lo, inode, O_RDONLY);
> +if (fd < 0) {
> +return -fd;
> +}
> +
> +if (syncfs(fd) < 0) {
> +ret = errno;
> +}
> +
> +close(fd);
> +return ret;
> +}
> +
> +static void lo_syncfs(fuse_req_t req, fuse_ino_t ino)
> +{
> +struct lo_data *lo = lo_data(req);
> +struct lo_inode *inode = lo_inode(req, ino);
> +int err;
> +
> +if (!inode) {
> +fuse_reply_err(req, EBADF);
> +return;
> +}
> +
> +err = lo_do_syncfs(lo, inode);
> +lo_inode_put(lo, );
> +
> +/*
> + * If submounts aren't announced, the client onl

Re: [PATCH v5 3/3] virtiofsd: Add support for FUSE_SYNCFS request without announce_submounts

2022-02-15 Thread Vivek Goyal

On Tue, Feb 15, 2022 at 10:18:03AM +0100, Greg Kurz wrote:
> On Mon, 14 Feb 2022 14:09:47 -0500
> Vivek Goyal  wrote:
> 
> > On Mon, Feb 14, 2022 at 01:56:08PM -0500, Vivek Goyal wrote:
> > > On Mon, Feb 14, 2022 at 01:27:22PM -0500, Vivek Goyal wrote:
> > > > On Mon, Feb 14, 2022 at 02:58:20PM +0100, Greg Kurz wrote:
> > > > > This adds the missing bits to support FUSE_SYNCFS in the case 
> > > > > submounts
> > > > > aren't announced to the client.
> > > > > 
> > > > > Iterate over all inodes and call syncfs() on the ones marked as 
> > > > > submounts.
> > > > > Since syncfs() can block for an indefinite time, we cannot call it 
> > > > > with
> > > > > lo->mutex held as it would prevent the server to process other 
> > > > > requests.
> > > > > This is thus broken down in two steps. First build a list of submounts
> > > > > with lo->mutex held, drop the mutex and finally process the list. A
> > > > > reference is taken on the inodes to ensure they don't go away when
> > > > > lo->mutex is dropped.
> > > > > 
> > > > > Signed-off-by: Greg Kurz 
> > > > > ---
> > > > >  tools/virtiofsd/passthrough_ll.c | 38 
> > > > > ++--
> > > > >  1 file changed, 36 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > > > b/tools/virtiofsd/passthrough_ll.c
> > > > > index e94c4e6f8635..7ce944bfe2a0 100644
> > > > > --- a/tools/virtiofsd/passthrough_ll.c
> > > > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > > > @@ -3400,8 +3400,42 @@ static void lo_syncfs(fuse_req_t req, 
> > > > > fuse_ino_t ino)
> > > > >  err = lo_do_syncfs(lo, inode);
> > > > >  lo_inode_put(lo, );
> > > > >  } else {
> > > > > -/* Requires the sever to track submounts. Not implemented 
> > > > > yet */
> > > > > -err = ENOSYS;
> > > > > +g_autoptr(GSList) submount_list = NULL;
> > > > > +GSList *elem;
> > > > > +GHashTableIter iter;
> > > > > +gpointer key, value;
> > > > > +
> > > > > +pthread_mutex_lock(>mutex);
> > > > > +
> > > > > +g_hash_table_iter_init(, lo->inodes);
> > > > > +while (g_hash_table_iter_next(, , )) {
> > > > 
> > > > Going through all the inodes sounds very inefficient. If there are large
> > > > number of inodes (say 1 million or more), and if frequent syncfs 
> > > > requests
> > > > are coming this can consume lot of cpu cycles.
> > > > 
> > > > Given C virtiofsd is slowly going away, so I don't want to be too
> > > > particular about it. But, I would have thought to put submount
> > > > inodes into another list or hash map (using mount id as key) and just
> > > > traverse through that list instead. Given number of submounts should
> > > > be small, it should be pretty quick to walk through that list.
> > > > 
> > > > > +struct lo_inode *inode = value;
> > > > > +
> > > > > +if (inode->is_submount) {
> > > > > +g_atomic_int_inc(>refcount);
> > > > > +submount_list = g_slist_prepend(submount_list, 
> > > > > inode);
> > > > > +}
> > > > > +}
> > > > > +
> > > > > +pthread_mutex_unlock(>mutex);
> > > > > +
> > > > > +/* The root inode is always present and not tracked in the 
> > > > > hash table */
> > > > > +err = lo_do_syncfs(lo, >root);
> > > > > +
> > > > > +for (elem = submount_list; elem; elem = g_slist_next(elem)) {
> > > > > +struct lo_inode *inode = elem->data;
> > > > > +int r;
> > > > > +
> > > > > +r = lo_do_syncfs(lo, inode);
> > > > > +if (r) {
> > > > > +/*
> > > > > + * Try to sync as much as possible. Only one error 
> > > > > can be
> > > >

Re: [PATCH v5 3/3] virtiofsd: Add support for FUSE_SYNCFS request without announce_submounts

2022-02-14 Thread Vivek Goyal

On Mon, Feb 14, 2022 at 01:56:08PM -0500, Vivek Goyal wrote:
> On Mon, Feb 14, 2022 at 01:27:22PM -0500, Vivek Goyal wrote:
> > On Mon, Feb 14, 2022 at 02:58:20PM +0100, Greg Kurz wrote:
> > > This adds the missing bits to support FUSE_SYNCFS in the case submounts
> > > aren't announced to the client.
> > > 
> > > Iterate over all inodes and call syncfs() on the ones marked as submounts.
> > > Since syncfs() can block for an indefinite time, we cannot call it with
> > > lo->mutex held as it would prevent the server to process other requests.
> > > This is thus broken down in two steps. First build a list of submounts
> > > with lo->mutex held, drop the mutex and finally process the list. A
> > > reference is taken on the inodes to ensure they don't go away when
> > > lo->mutex is dropped.
> > > 
> > > Signed-off-by: Greg Kurz 
> > > ---
> > >  tools/virtiofsd/passthrough_ll.c | 38 ++--
> > >  1 file changed, 36 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > b/tools/virtiofsd/passthrough_ll.c
> > > index e94c4e6f8635..7ce944bfe2a0 100644
> > > --- a/tools/virtiofsd/passthrough_ll.c
> > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > @@ -3400,8 +3400,42 @@ static void lo_syncfs(fuse_req_t req, fuse_ino_t 
> > > ino)
> > >  err = lo_do_syncfs(lo, inode);
> > >  lo_inode_put(lo, );
> > >  } else {
> > > -/* Requires the sever to track submounts. Not implemented yet */
> > > -err = ENOSYS;
> > > +g_autoptr(GSList) submount_list = NULL;
> > > +GSList *elem;
> > > +GHashTableIter iter;
> > > +gpointer key, value;
> > > +
> > > +pthread_mutex_lock(>mutex);
> > > +
> > > +g_hash_table_iter_init(, lo->inodes);
> > > +while (g_hash_table_iter_next(, , )) {
> > 
> > Going through all the inodes sounds very inefficient. If there are large
> > number of inodes (say 1 million or more), and if frequent syncfs requests
> > are coming this can consume lot of cpu cycles.
> > 
> > Given C virtiofsd is slowly going away, so I don't want to be too
> > particular about it. But, I would have thought to put submount
> > inodes into another list or hash map (using mount id as key) and just
> > traverse through that list instead. Given number of submounts should
> > be small, it should be pretty quick to walk through that list.
> > 
> > > +struct lo_inode *inode = value;
> > > +
> > > +if (inode->is_submount) {
> > > +g_atomic_int_inc(>refcount);
> > > +submount_list = g_slist_prepend(submount_list, inode);
> > > +}
> > > +}
> > > +
> > > +pthread_mutex_unlock(>mutex);
> > > +
> > > +/* The root inode is always present and not tracked in the hash 
> > > table */
> > > +err = lo_do_syncfs(lo, >root);
> > > +
> > > +for (elem = submount_list; elem; elem = g_slist_next(elem)) {
> > > +struct lo_inode *inode = elem->data;
> > > +int r;
> > > +
> > > +r = lo_do_syncfs(lo, inode);
> > > +if (r) {
> > > +/*
> > > + * Try to sync as much as possible. Only one error can be
> > > + * reported to the client though, arbitrarily the last 
> > > one.
> > > + */
> > > +err = r;
> > > +}
> > > +lo_inode_put(lo, );
> > > +}
> > 
> > One more minor nit. What happens if virtiofsd is processing syncfs list
> > and then somebody hard reboots qemu and mounts virtiofs again. That
> > will trigger FUSE_INIT and will call lo_destroy() first.
> > 
> > fuse_lowlevel.c
> > 
> > fuse_session_process_buf_int()
> > {
> > fuse_log(FUSE_LOG_DEBUG, "%s: reinit\n", __func__);
> > se->got_destroy = 1;
> > se->got_init = 0;
> > if (se->op.destroy) {
> > se->op.destroy(se->userdata);
> > }
> > }
> > 
> > IIUC, there is no synchronization with this path. If we are running with
> > thread pool enabled, it could very well happen that one threa

Re: [PATCH v5 3/3] virtiofsd: Add support for FUSE_SYNCFS request without announce_submounts

2022-02-14 Thread Vivek Goyal

On Mon, Feb 14, 2022 at 01:27:22PM -0500, Vivek Goyal wrote:
> On Mon, Feb 14, 2022 at 02:58:20PM +0100, Greg Kurz wrote:
> > This adds the missing bits to support FUSE_SYNCFS in the case submounts
> > aren't announced to the client.
> > 
> > Iterate over all inodes and call syncfs() on the ones marked as submounts.
> > Since syncfs() can block for an indefinite time, we cannot call it with
> > lo->mutex held as it would prevent the server to process other requests.
> > This is thus broken down in two steps. First build a list of submounts
> > with lo->mutex held, drop the mutex and finally process the list. A
> > reference is taken on the inodes to ensure they don't go away when
> > lo->mutex is dropped.
> > 
> > Signed-off-by: Greg Kurz 
> > ---
> >  tools/virtiofsd/passthrough_ll.c | 38 ++--
> >  1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index e94c4e6f8635..7ce944bfe2a0 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -3400,8 +3400,42 @@ static void lo_syncfs(fuse_req_t req, fuse_ino_t ino)
> >  err = lo_do_syncfs(lo, inode);
> >  lo_inode_put(lo, );
> >  } else {
> > -/* Requires the sever to track submounts. Not implemented yet */
> > -err = ENOSYS;
> > +g_autoptr(GSList) submount_list = NULL;
> > +GSList *elem;
> > +GHashTableIter iter;
> > +gpointer key, value;
> > +
> > +pthread_mutex_lock(>mutex);
> > +
> > +g_hash_table_iter_init(, lo->inodes);
> > +while (g_hash_table_iter_next(, , )) {
> 
> Going through all the inodes sounds very inefficient. If there are large
> number of inodes (say 1 million or more), and if frequent syncfs requests
> are coming this can consume lot of cpu cycles.
> 
> Given C virtiofsd is slowly going away, so I don't want to be too
> particular about it. But, I would have thought to put submount
> inodes into another list or hash map (using mount id as key) and just
> traverse through that list instead. Given number of submounts should
> be small, it should be pretty quick to walk through that list.
> 
> > +struct lo_inode *inode = value;
> > +
> > +if (inode->is_submount) {
> > +g_atomic_int_inc(>refcount);
> > +submount_list = g_slist_prepend(submount_list, inode);
> > +}
> > +}
> > +
> > +pthread_mutex_unlock(>mutex);
> > +
> > +/* The root inode is always present and not tracked in the hash 
> > table */
> > +err = lo_do_syncfs(lo, >root);
> > +
> > +for (elem = submount_list; elem; elem = g_slist_next(elem)) {
> > +struct lo_inode *inode = elem->data;
> > +int r;
> > +
> > +r = lo_do_syncfs(lo, inode);
> > +if (r) {
> > +/*
> > + * Try to sync as much as possible. Only one error can be
> > + * reported to the client though, arbitrarily the last one.
> > + */
> > +err = r;
> > +}
> > +lo_inode_put(lo, );
> > +}
> 
> One more minor nit. What happens if virtiofsd is processing syncfs list
> and then somebody hard reboots qemu and mounts virtiofs again. That
> will trigger FUSE_INIT and will call lo_destroy() first.
> 
> fuse_lowlevel.c
> 
> fuse_session_process_buf_int()
> {
> fuse_log(FUSE_LOG_DEBUG, "%s: reinit\n", __func__);
> se->got_destroy = 1;
> se->got_init = 0;
> if (se->op.destroy) {
> se->op.destroy(se->userdata);
> }
> }
> 
> IIUC, there is no synchronization with this path. If we are running with
> thread pool enabled, it could very well happen that one thread is still
> doing syncfs while other thread is executing do_init(). That sounds
> like little bit of a problem. It will be good if there is a way
> to either abort syncfs() or do_destroy() waits for all the previous
> syncfs() to finish.
> 
> Greg, if you like, you could break down this work in two patch series.
> First patch series just issues syncfs() on inode id sent with FUSE_SYNCFS.
> That's easy fix and can get merged now.

Actually I think even single "syncfs" will have synchronization issue
with do_init() upon hard reboot if we drop lo->mutex during syncfs().

Vivek

> 
> And second patch series take care of above issues and will be little bit
> more work.
> 
> Thanks
> Vivek

Re: [PATCH v5 3/3] virtiofsd: Add support for FUSE_SYNCFS request without announce_submounts

2022-02-14 Thread Vivek Goyal

On Mon, Feb 14, 2022 at 02:58:20PM +0100, Greg Kurz wrote:
> This adds the missing bits to support FUSE_SYNCFS in the case submounts
> aren't announced to the client.
> 
> Iterate over all inodes and call syncfs() on the ones marked as submounts.
> Since syncfs() can block for an indefinite time, we cannot call it with
> lo->mutex held as it would prevent the server to process other requests.
> This is thus broken down in two steps. First build a list of submounts
> with lo->mutex held, drop the mutex and finally process the list. A
> reference is taken on the inodes to ensure they don't go away when
> lo->mutex is dropped.
> 
> Signed-off-by: Greg Kurz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 38 ++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index e94c4e6f8635..7ce944bfe2a0 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -3400,8 +3400,42 @@ static void lo_syncfs(fuse_req_t req, fuse_ino_t ino)
>  err = lo_do_syncfs(lo, inode);
>  lo_inode_put(lo, );
>  } else {
> -/* Requires the sever to track submounts. Not implemented yet */
> -err = ENOSYS;
> +g_autoptr(GSList) submount_list = NULL;
> +GSList *elem;
> +GHashTableIter iter;
> +gpointer key, value;
> +
> +pthread_mutex_lock(>mutex);
> +
> +g_hash_table_iter_init(, lo->inodes);
> +while (g_hash_table_iter_next(, , )) {

Going through all the inodes sounds very inefficient. If there are large
number of inodes (say 1 million or more), and if frequent syncfs requests
are coming this can consume lot of cpu cycles.

Given C virtiofsd is slowly going away, so I don't want to be too
particular about it. But, I would have thought to put submount
inodes into another list or hash map (using mount id as key) and just
traverse through that list instead. Given number of submounts should
be small, it should be pretty quick to walk through that list.

> +struct lo_inode *inode = value;
> +
> +if (inode->is_submount) {
> +g_atomic_int_inc(>refcount);
> +submount_list = g_slist_prepend(submount_list, inode);
> +}
> +}
> +
> +pthread_mutex_unlock(>mutex);
> +
> +/* The root inode is always present and not tracked in the hash 
> table */
> +err = lo_do_syncfs(lo, >root);
> +
> +for (elem = submount_list; elem; elem = g_slist_next(elem)) {
> +struct lo_inode *inode = elem->data;
> +int r;
> +
> +r = lo_do_syncfs(lo, inode);
> +if (r) {
> +/*
> + * Try to sync as much as possible. Only one error can be
> + * reported to the client though, arbitrarily the last one.
> + */
> +err = r;
> +}
> +lo_inode_put(lo, );
> +}

One more minor nit. What happens if virtiofsd is processing syncfs list
and then somebody hard reboots qemu and mounts virtiofs again. That
will trigger FUSE_INIT and will call lo_destroy() first.

fuse_lowlevel.c

fuse_session_process_buf_int()
{
fuse_log(FUSE_LOG_DEBUG, "%s: reinit\n", __func__);
se->got_destroy = 1;
se->got_init = 0;
if (se->op.destroy) {
se->op.destroy(se->userdata);
}
}

IIUC, there is no synchronization with this path. If we are running with
thread pool enabled, it could very well happen that one thread is still
doing syncfs while other thread is executing do_init(). That sounds
like little bit of a problem. It will be good if there is a way
to either abort syncfs() or do_destroy() waits for all the previous
syncfs() to finish.

Greg, if you like, you could break down this work in two patch series.
First patch series just issues syncfs() on inode id sent with FUSE_SYNCFS.
That's easy fix and can get merged now.

And second patch series take care of above issues and will be little bit
more work.

Thanks
Vivek

Re: [PATCH v2] Deprecate C virtiofsd

2022-02-14 Thread Vivek Goyal

On Mon, Feb 14, 2022 at 11:30:03AM +, Dr. David Alan Gilbert wrote:
> * Richard W.M. Jones (rjo...@redhat.com) wrote:
> > On Thu, Feb 10, 2022 at 05:47:14PM +, Dr. David Alan Gilbert (git) 
> > wrote:
> > > From: "Dr. David Alan Gilbert" 
> > > 
> > > There's a nice new Rust implementation out there; recommend people
> > > do new work on that.
> > > 
> > > Signed-off-by: Dr. David Alan Gilbert 
> > > ---
> > >  docs/about/deprecated.rst | 17 +
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> > > index 47a594a3b6..3c73d22729 100644
> > > --- a/docs/about/deprecated.rst
> > > +++ b/docs/about/deprecated.rst
> > > @@ -454,3 +454,20 @@ nanoMIPS ISA
> > >  
> > >  The ``nanoMIPS`` ISA has never been upstreamed to any compiler toolchain.
> > >  As it is hard to generate binaries for it, declare it deprecated.
> > > +
> > > +Tools
> > > +-
> > > +
> > > +virtiofsd
> > > +'
> > > +
> > > +There is a new Rust implementation of ``virtiofsd`` at
> > > +``https://gitlab.com/virtio-fs/virtiofsd``;
> > > +since this is now marked stable, new development should be done on that
> > > +rather than the existing C version in the QEMU tree.
> > > +The C version will still accept fixes and patches that
> > > +are already in development for the moment, but will eventually
> > > +be deleted from this tree.
> > > +New deployments should use the Rust version, and existing systems
> > > +should consider moving to it.  The command line and feature set
> > > +is very close and moving should be simple.
> > 
> > I'm not qualified to say if the Rust impl is complete enough
> > to replace the C version, so I won't add a reviewed tag.
> 
> We believe it is a complete replacement at this point, with compatible
> command line.

I think its not a complete replacement yet. For example, POSIX_ACLs are
not supported yet. German is looking into making it work.

There might be other small things here and there, but nothing major, I
think.

Vivek
> 
> Dave
> 
> > However I want to say that from the point of view of downstream
> > packagers of qemu -- especially Fedora -- it would be helpful if we
> > could direct both upstream development effort and downstream packaging
> > into just the one virtiofsd.  So I agree in principle with this.
> > 
> > Rich.
> > 
> > -- 
> > Richard Jones, Virtualization Group, Red Hat 
> > http://people.redhat.com/~rjones
> > Read my programming and virtualization blog: http://rwmj.wordpress.com
> > virt-p2v converts physical machines to virtual machines.  Boot with a
> > live CD or over the network (PXE) and turn machines into KVM guests.
> > http://libguestfs.org/virt-v2v
> > 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PATCH v6 10/10] virtiofsd: Add an option to enable/disable security label

2022-02-14 Thread Vivek Goyal

On Mon, Feb 14, 2022 at 01:32:38PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Provide an option "-o security_label/no_security_label" to enable/disable
> > security label functionality. By default these are turned off.
> > 
> > If enabled, server will indicate to client that it is capable of handling
> > one security label during file creation. Typically this is expected to
> > be a SELinux label. File server will set this label on the file. It will
> > try to set it atomically wherever possible. But its not possible in
> > all the cases.
> > 
> > Signed-off-by: Vivek Goyal 
> 
> Reviewed-by: Dr. David Alan Gilbert 
> 
> OK, but you have missed some of the docs typos I mentined in the last
> review; they can be cleared up any time.

Hi David,

I could not find any comments in V5 w.r.t doc typos. I am not sure which
email I have missed.

Anyway, will be nice if I can take care of these typos in a follow up
patch and these patches can be merged.

Thanks
Vivek
> 
> > ---
> >  docs/tools/virtiofsd.rst | 32 
> >  tools/virtiofsd/helper.c |  1 +
> >  tools/virtiofsd/passthrough_ll.c | 15 +++
> >  3 files changed, 48 insertions(+)
> > 
> > diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> > index 07ac0be551..0c0560203c 100644
> > --- a/docs/tools/virtiofsd.rst
> > +++ b/docs/tools/virtiofsd.rst
> > @@ -104,6 +104,13 @@ Options
> >* posix_acl|no_posix_acl -
> >  Enable/disable posix acl support.  Posix ACLs are disabled by default.
> >  
> > +  * security_label|no_security_label -
> > +Enable/disable security label support. Security labels are disabled by
> > +default. This will allow client to send a MAC label of file during
> > +file creation. Typically this is expected to be SELinux security
> > +label. Server will try to set that label on newly created file
> > +atomically wherever possible.
> > +
> >  .. option:: --socket-path=PATH
> >  
> >Listen on vhost-user UNIX domain socket at PATH.
> > @@ -348,6 +355,31 @@ client arguments or lists returned from the host.  
> > This stops
> >  the client seeing any 'security.' attributes on the server and
> >  stops it setting any.
> >  
> > +SELinux support
> > +---
> > +One can enable support for SELinux by running virtiofsd with option
> > +"-o security_label". But this will try to save guest's security context
> > +in xattr security.selinux on host and it might fail if host's SELinux
> > +policy does not permit virtiofsd to do this operation.
> > +
> > +Hence, it is preferred to remap guest's "security.selinux" xattr to say
> > +"trusted.virtiofs.security.selinux" on host.
> > +
> > +"-o xattrmap=:map:security.selinux:trusted.virtiofs.:"
> > +
> > +This will make sure that guest and host's SELinux xattrs on same file
> > +remain separate and not interfere with each other. And will allow both
> > +host and guest to implement their own separate SELinux policies.
> > +
> > +Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need
> > +add this capability to daemon.
> > +
> > +"-o modcaps=+sys_admin"
> > +
> > +Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more
> > +powerful and if gets compromised, it can do lot of damage to host system.
> > +So keep this trade-off in my mind while making a decision.
> > +
> >  Examples
> >  
> >  
> > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> > index a8295d975a..e226fc590f 100644
> > --- a/tools/virtiofsd/helper.c
> > +++ b/tools/virtiofsd/helper.c
> > @@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
> > "   default: no_allow_direct_io\n"
> > "-o announce_submounts  Announce sub-mount points to 
> > the guest\n"
> > "-o posix_acl/no_posix_acl  Enable/Disable posix_acl. 
> > (default: disabled)\n"
> > +   "-o security_label/no_security_label  Enable/Disable 
> > security label. (default: disabled)\n"
> > );
> >  }
> >  
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index d49128a58d..f3ec6aafe5 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -181,6 +181,7 @@ struct lo_data {

Re: [PATCH] Deprecate C virtiofsd

2022-02-09 Thread Vivek Goyal

On Wed, Feb 09, 2022 at 04:50:40PM +, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> There's a nice new Rust implementation out there; recommend people
> do new work on that.
> 
> Signed-off-by: Dr. David Alan Gilbert 

Acked-by: Vivek Goyal 

Vivek

> ---
>  docs/about/deprecated.rst | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
> index 47a594a3b6..3a0e15f8f5 100644
> --- a/docs/about/deprecated.rst
> +++ b/docs/about/deprecated.rst
> @@ -454,3 +454,17 @@ nanoMIPS ISA
>  
>  The ``nanoMIPS`` ISA has never been upstreamed to any compiler toolchain.
>  As it is hard to generate binaries for it, declare it deprecated.
> +
> +Tools
> +-
> +
> +virtiofsd
> +'
> +
> +There is a new Rust implementation of ``virtiofs`` at
> +``https://gitlab.com/virtio-fs/virtiofsd``;
> +since this is now marked stable, new development should be done on that
> +rather than the existing C version in the QEMU tree.
> +The C version will still accept fixes and patches that
> +are already in development for the moment.
> +
> -- 
> 2.34.1
>

Re: [Virtio-fs] [PATCH v5 0/9] virtiofsd: Add support for file security context at file creation

2022-02-09 Thread Vivek Goyal

On Wed, Feb 09, 2022 at 11:24:29AM +0100, German Maglione wrote:
> On Tue, Feb 8, 2022 at 11:44 PM Daniel P. Berrangé 
> wrote:
> 
> > On Mon, Feb 07, 2022 at 04:19:38PM -0500, Vivek Goyal wrote:
> > > On Mon, Feb 07, 2022 at 01:05:16PM +, Daniel P. Berrangé wrote:
> > > > On Wed, Feb 02, 2022 at 02:39:26PM -0500, Vivek Goyal wrote:
> > > > > Hi,
> > > > >
> > > > > This is V5 of the patches. I posted V4 here.
> > > > >
> > > > >
> > https://listman.redhat.com/archives/virtio-fs/2022-January/msg00041.html
> > > > >
> > > > > These will allow us to support SELinux with virtiofs. This will send
> > > > > SELinux context at file creation to server and server can set it on
> > > > > file.
> > > >
> > > > I've not entirely figured it out from the code, so easier for me
> > > > to ask...
> > > >
> > > > How is the SELinux labelled stored on the host side ? It is stored
> > > > directly in the security.* xattr namespace,
> > >
> > > [ CC Dan Walsh ]
> > >
> > > I just tried to test the mode where I don't do xattr remapping and try
> > > to set /proc/pid/attr/fscreate with the context I want to set. It will
> > > set security.selinux xattr on host.
> > >
> > > But write to /proc/pid/attr/fscreate fails if host does not recognize
> > > the label sent by guest. I am running virtiofsd with unconfined_t but
> > > it still fails because guest is trying to create a file with
> > > "test_filesystem_filetranscon_t" and host does not recognize this
> > > label. Seeing following in audit logs.
> > >
> > > type=SELINUX_ERR msg=audit(1644268262.666:8111): op=fscreate
> > invalid_context="unconfined_u:object_r:test_filesystem_filetranscon_t:s0"
> >
> > Yes, that's to be expected if the host policy doesn't know about the
> > label that the guest is using.
> >
> > IOW, non-mapping case is only useful if you have a very good match
> > between host + guest OS policy. This could be useful for an app
> > like Kata because their guest is not a full OS, it is something
> > special purpose and tightly controlled.
> >
> > > So if we don't remap xattrs and host has SELinux enabled, then it
> > probably
> > > work in very limited circumstances where host and guest policies don't
> > > conflict. I guess its like running fedora 34 guest on fedora 34 host.
> > > I suspect that this will see very limited use. Though I have put the
> > > code in for the sake of completeness.
> >
> > For general purpose guest OS virtualization remapping is going to be
> > effectuively mandatory.  The non-mapped case only usable when you tightly
> > control the guest OS packages from the host.
> >
> 
> 
> If remap is recommended, why not make it mandatory or automatic?,
> for instance, '-o security_label' either requires '-o xattrmap=' or
> automatically makes
> the mapping with the 'trusted' prefix, while  '-o security_label=nomap'
> doesn't, so you
> can choose whatever you want.

It is a recommended settings but not a mandatory setting. So enforcing
any kind of policy will work for some and not for others.

We could refine it down the line depending on how it is used and where
people find it useful.

For now, primary focus is to get basic support patches in the tree.

Thanks
Vivek

> 
> (I'm not suggesting the 'nomap' name, I'm terrible choosing names)
> 
> -- 
> German

[PATCH v6 06/10] virtiofsd: Move core file creation code in separate function

2022-02-08 Thread Vivek Goyal

Move core file creation bits in a separate function. Soon this is going
to get more complex as file creation need to set security context also.
And there will be multiple modes of file creation in next patch.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 36 ++--
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index b3d0674f6d..82023bf3d4 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2001,6 +2001,30 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 return 0;
 }
 
+static int do_lo_create(fuse_req_t req, struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi, int* open_fd)
+{
+int err = 0, fd;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+
+err = lo_change_cred(req, , lo->change_umask);
+if (err) {
+return err;
+}
+
+/* Try to create a new file but don't open existing files */
+fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
+if (fd == -1) {
+err = errno;
+} else {
+*open_fd = fd;
+}
+lo_restore_cred(, lo->change_umask);
+return err;
+}
+
 static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
   mode_t mode, struct fuse_file_info *fi)
 {
@@ -2010,7 +2034,6 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 struct lo_inode *inode = NULL;
 struct fuse_entry_param e;
 int err;
-struct lo_cred old = {};
 
 fuse_log(FUSE_LOG_DEBUG, "lo_create(parent=%" PRIu64 ", name=%s)"
  " kill_priv=%d\n", parent, name, fi->kill_priv);
@@ -2026,18 +2049,9 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 return;
 }
 
-err = lo_change_cred(req, , lo->change_umask);
-if (err) {
-goto out;
-}
-
 update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-/* Try to create a new file but don't open existing files */
-fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
-err = fd == -1 ? errno : 0;
-
-lo_restore_cred(, lo->change_umask);
+err = do_lo_create(req, parent_inode, name, mode, fi, );
 
 /* Ignore the error if file exists and O_EXCL was not given */
 if (err && (err != EEXIST || (fi->flags & O_EXCL))) {
-- 
2.34.1

[PATCH v6 00/10] virtiofsd: Add support for file security context at file creation

2022-02-08 Thread Vivek Goyal

Hi,

This is V6 of the patches. I posted V5 here.

https://listman.redhat.com/archives/virtio-fs/2022-February/msg00012.html

This patch series basically allows client to send a security context 
(which is expected to be xattr security.selinux and its content) to
virtiofsd and it will set that security context on file during creation
based on various settings. Hence, this patch series basically allows
supporting SELinux with virtiofs.

There are primarily 3 modes.

- If no security context enabled, then it continues to create files without
  security context.

- If security context is enabled and but security.selinux has not been
  remapped, then it uses /proc/thread-self/attr/fscreate knob to set
  security context and then create the file. This will make sure that
  newly created file gets the security context as set in "fscreate" and
  this is atomic w.r.t file creation.

  This is useful and host and guest SELinux policies don't conflict and
  can work with each other. In that case, guest security.selinux xattr
  is not remapped and it is passthrough as "security.selinux" xattr
  on host.

- If security context is enabled but security.selinux xattr has been
  remapped to something else, then it first creates the file and then
  uses setxattr() to set the remapped xattr with the security context.
  This is a non-atomic operation w.r.t file creation.

  This mode will be most versatile and allow host and guest to have their
  own separate SELinux xattrs and have their own separate SELinux policies.

Changes since V5:

- Added some documentation to recommend using xattr remapping to remap
  "security.selinux" to "trusted.virtiofs.security.selinux" and also 
  give CAP_SYS_ADMIN to daemon. Also put a warning to make users aware
  of trade-off involved here. ("Daniel P. Berrangé")

- Used macro endof() to determine end of fuse_init_in struct. (David
  Gilbert).

- Added a check to make sure fsecctx->size is not zero. Also added
  "return" statement at few places where it was required. (David Gilbert)

- Split patch 7 in the series. Some of the handling of setting and
  clearing fscreate knob has been moved into a separate patch. Found
  it hard to break it down further. So it helps a bit but not too
  much. (David Gilbert).

Thanks
Vivek

Vivek Goyal (10):
  virtiofsd: Fix breakage due to fuse_init_in size change
  linux-headers: Update headers to v5.17-rc1
  virtiofsd: Parse extended "struct fuse_init_in"
  virtiofsd: Extend size of fuse_conn_info->capable and ->want fields
  virtiofsd, fuse_lowlevel.c: Add capability to parse security context
  virtiofsd: Move core file creation code in separate function
  virtiofsd: Add helpers to work with /proc/self/task/tid/attr/fscreate
  virtiofsd: Create new file with security context
  virtiofsd: Create new file using O_TMPFILE and set security context
  virtiofsd: Add an option to enable/disable security label

 docs/tools/virtiofsd.rst  |  32 ++
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h |  11 +
 include/standard-headers/linux/ethtool.h  |   1 +
 include/standard-headers/linux/fuse.h |  60 ++-
 include/standard-headers/linux/pci_regs.h | 142 +++---
 include/standard-headers/linux/virtio_gpio.h  |  72 +++
 include/standard-headers/linux/virtio_i2c.h   |  47 ++
 include/standard-headers/linux/virtio_iommu.h |   8 +-
 .../standard-headers/linux/virtio_pcidev.h|  65 +++
 include/standard-headers/linux/virtio_scmi.h  |  24 +
 linux-headers/asm-generic/unistd.h|   5 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-riscv/bitsperlong.h |  14 +
 linux-headers/asm-riscv/mman.h|   1 +
 linux-headers/asm-riscv/unistd.h  |  44 ++
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |  16 +-
 linux-headers/asm-x86/unistd_32.h |   1 +
 linux-headers/asm-x86/unistd_64.h |   1 +
 linux-headers/asm-x86/unistd_x32.h|   1 +
 linux-headers/linux/kvm.h |  17 +
 tools/virtiofsd/fuse_common.h |   9 +-
 tools/virtiofsd/fuse_i.h  |   7 +
 tools/virtiofsd/fuse_lowlevel.c   | 168 +--
 tools/virtiofsd/helper.c  |   1 +
 tools/virtiofsd/passthrough_ll.c  | 414 --
 32 files changed, 1044 insertions(+), 132 deletions(-)
 create mode 100644 include/standard-headers/linux/virtio_gpio.h
 create mode 100644 include/standard-headers/linux/virtio_i2c.h
 create mode 100644 incl

[PATCH v6 02/10] linux-headers: Update headers to v5.17-rc1

2022-02-08 Thread Vivek Goyal

Update headers to 5.17-rc1. I need latest fuse changes.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h |  11 ++
 include/standard-headers/linux/ethtool.h  |   1 +
 include/standard-headers/linux/fuse.h |  60 +++-
 include/standard-headers/linux/pci_regs.h | 142 +-
 include/standard-headers/linux/virtio_gpio.h  |  72 +
 include/standard-headers/linux/virtio_i2c.h   |  47 ++
 include/standard-headers/linux/virtio_iommu.h |   8 +-
 .../standard-headers/linux/virtio_pcidev.h|  65 
 include/standard-headers/linux/virtio_scmi.h  |  24 +++
 linux-headers/asm-generic/unistd.h|   5 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-riscv/bitsperlong.h |  14 ++
 linux-headers/asm-riscv/mman.h|   1 +
 linux-headers/asm-riscv/unistd.h  |  44 ++
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |  16 +-
 linux-headers/asm-x86/unistd_32.h |   1 +
 linux-headers/asm-x86/unistd_64.h |   1 +
 linux-headers/asm-x86/unistd_x32.h|   1 +
 linux-headers/linux/kvm.h |  17 +++
 26 files changed, 469 insertions(+), 76 deletions(-)
 create mode 100644 include/standard-headers/linux/virtio_gpio.h
 create mode 100644 include/standard-headers/linux/virtio_i2c.h
 create mode 100644 include/standard-headers/linux/virtio_pcidev.h
 create mode 100644 include/standard-headers/linux/virtio_scmi.h
 create mode 100644 linux-headers/asm-riscv/bitsperlong.h
 create mode 100644 linux-headers/asm-riscv/mman.h
 create mode 100644 linux-headers/asm-riscv/unistd.h

diff --git a/include/standard-headers/asm-x86/kvm_para.h 
b/include/standard-headers/asm-x86/kvm_para.h
index 204cfb8640..f0235e58a1 100644
--- a/include/standard-headers/asm-x86/kvm_para.h
+++ b/include/standard-headers/asm-x86/kvm_para.h
@@ -8,6 +8,7 @@
  * should be used to determine that a VM is running under KVM.
  */
 #define KVM_CPUID_SIGNATURE0x4000
+#define KVM_SIGNATURE "KVMKVMKVM\0\0\0"
 
 /* This CPUID returns two feature bitmaps in eax, edx. Before enabling
  * a particular paravirtualization, the appropriate feature bit should
diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 2c025cb4fe..4888f85f69 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -313,6 +313,13 @@ extern "C" {
  */
 #define DRM_FORMAT_P016fourcc_code('P', '0', '1', '6') /* 2x2 
subsampled Cr:Cb plane 16 bits per channel */
 
+/* 2 plane YCbCr420.
+ * 3 10 bit components and 2 padding bits packed into 4 bytes.
+ * index 0 = Y plane, [31:0] x:Y2:Y1:Y0 2:10:10:10 little endian
+ * index 1 = Cr:Cb plane, [63:0] x:Cr2:Cb2:Cr1:x:Cb1:Cr0:Cb0 
[2:10:10:10:2:10:10:10] little endian
+ */
+#define DRM_FORMAT_P030fourcc_code('P', '0', '3', '0') /* 2x2 
subsampled Cr:Cb plane 10 bits per channel packed */
+
 /* 3 plane non-subsampled (444) YCbCr
  * 16 bits per component, but only 10 bits are used and 6 bits are padded
  * index 0: Y plane, [15:0] Y:x [10:6] little endian
@@ -853,6 +860,10 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t 
modifier)
  * and UV.  Some SAND-using hardware stores UV in a separate tiled
  * image from Y to reduce the column height, which is not supported
  * with these modifiers.
+ *
+ * The DRM_FORMAT_MOD_BROADCOM_SAND128_COL_HEIGHT modifier is also
+ * supported for DRM_FORMAT_P030 where the columns remain as 128 bytes
+ * wide, but as this is a 10 bpp format that translates to 96 pixels.
  */
 
 #define DRM_FORMAT_MOD_BROADCOM_SAND32_COL_HEIGHT(v) \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 688eb8dc39..38d5a4cd6e 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -231,6 +231,7 @@ enum tunable_id {
ETHTOOL_RX_COPYBREAK,
ETHTOOL_TX_COPYBREAK,
ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
+   ETHTOOL_TX_COPYBREAK_BUF_SIZE,
/*
 * Add your fresh new tunable attribute above and remember to update
 * tunable_strings[] in net/ethtool/common.c
diff --git a/include/standard-headers/linux/fuse.h 
b/include/standard-headers/linux/fuse.h
index 23ea31708b..bda06258be 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -184,6 +184,16 @@
  *
  *  7.34
  *  - add FUSE_SYNCFS
+ *

[PATCH v6 09/10] virtiofsd: Create new file using O_TMPFILE and set security context

2022-02-08 Thread Vivek Goyal

If guest and host policies can't work with each other, then guest security
context (selinux label) needs to be set into an xattr. Say remap guest
security.selinux xattr to trusted.virtiofs.security.selinux.

That means setting "fscreate" is not going to help as that's ony useful
for security.selinux xattr on host.

So we need another method which is atomic. Use O_TMPFILE to create new
file, set xattr and then linkat() to proper place.

But this works only for regular files. So dir, symlinks will continue
to be non-atomic.

Also if host filesystem does not support O_TMPFILE, we fallback to
non-atomic behavior.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 80 
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 68fa542fac..d49128a58d 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2153,14 +2153,29 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 
 static int do_create_nosecctx(fuse_req_t req, struct lo_inode *parent_inode,
const char *name, mode_t mode,
-   struct fuse_file_info *fi, int *open_fd)
+   struct fuse_file_info *fi, int *open_fd,
+  bool tmpfile)
 {
 int err, fd;
 struct lo_cred old = {};
 struct lo_data *lo = lo_data(req);
 int flags;
 
-flags = fi->flags | O_CREAT | O_EXCL;
+if (tmpfile) {
+flags = fi->flags | O_TMPFILE;
+/*
+ * Don't use O_EXCL as we want to link file later. Also reset O_CREAT
+ * otherwise openat() returns -EINVAL.
+ */
+flags &= ~(O_CREAT | O_EXCL);
+
+/* O_TMPFILE needs either O_RDWR or O_WRONLY */
+if ((flags & O_ACCMODE) == O_RDONLY) {
+flags |= O_RDWR;
+}
+} else {
+flags = fi->flags | O_CREAT | O_EXCL;
+}
 
 err = lo_change_cred(req, , lo->change_umask);
 if (err) {
@@ -2191,7 +2206,7 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 
 close_reset_proc_fscreate(fscreate_fd);
 if (!err) {
@@ -2200,6 +2215,44 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
+static int do_create_secctx_tmpfile(fuse_req_t req,
+struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi,
+const char *secctx_name, int *open_fd)
+{
+int err, fd = -1;
+struct lo_data *lo = lo_data(req);
+char procname[64];
+
+err = do_create_nosecctx(req, parent_inode, ".", mode, fi, , true);
+if (err) {
+return err;
+}
+
+err = fsetxattr(fd, secctx_name, req->secctx.ctx, req->secctx.ctxlen, 0);
+if (err) {
+err = errno;
+goto out;
+}
+
+/* Security context set on file. Link it in place */
+sprintf(procname, "%d", fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+err = linkat(AT_FDCWD, procname, parent_inode->fd, name,
+ AT_SYMLINK_FOLLOW);
+err = err == -1 ? errno : 0;
+FCHDIR_NOFAIL(lo->root.fd);
+
+out:
+if (!err) {
+*open_fd = fd;
+} else if (fd != -1) {
+close(fd);
+}
+return err;
+}
+
 static int do_create_secctx_noatomic(fuse_req_t req,
  struct lo_inode *parent_inode,
  const char *name, mode_t mode,
@@ -2208,7 +2261,7 @@ static int do_create_secctx_noatomic(fuse_req_t req,
 {
 int err = 0, fd = -1;
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 if (err) {
 goto out;
 }
@@ -2250,20 +2303,31 @@ static int do_lo_create(fuse_req_t req, struct lo_inode 
*parent_inode,
 if (secctx_enabled) {
 /*
  * If security.selinux has not been remapped and selinux is enabled,
- * use fscreate to set context before file creation.
- * Otherwise fallback to non-atomic method of file creation
- * and xattr settting.
+ * use fscreate to set context before file creation. If not, use
+ * tmpfile method for regular files. Otherwise fallback to
+ * non-atomic method of file creation and xattr settting.
  */
 if (!mapped_name && lo->use_fscreate) {
 err = do_create_secctx_fscreate(req, parent_inode, name, mode, fi,
 open_fd);

[PATCH v6 10/10] virtiofsd: Add an option to enable/disable security label

2022-02-08 Thread Vivek Goyal

Provide an option "-o security_label/no_security_label" to enable/disable
security label functionality. By default these are turned off.

If enabled, server will indicate to client that it is capable of handling
one security label during file creation. Typically this is expected to
be a SELinux label. File server will set this label on the file. It will
try to set it atomically wherever possible. But its not possible in
all the cases.

Signed-off-by: Vivek Goyal 
---
 docs/tools/virtiofsd.rst | 32 
 tools/virtiofsd/helper.c |  1 +
 tools/virtiofsd/passthrough_ll.c | 15 +++
 3 files changed, 48 insertions(+)

diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index 07ac0be551..0c0560203c 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -104,6 +104,13 @@ Options
   * posix_acl|no_posix_acl -
 Enable/disable posix acl support.  Posix ACLs are disabled by default.
 
+  * security_label|no_security_label -
+Enable/disable security label support. Security labels are disabled by
+default. This will allow client to send a MAC label of file during
+file creation. Typically this is expected to be SELinux security
+label. Server will try to set that label on newly created file
+atomically wherever possible.
+
 .. option:: --socket-path=PATH
 
   Listen on vhost-user UNIX domain socket at PATH.
@@ -348,6 +355,31 @@ client arguments or lists returned from the host.  This 
stops
 the client seeing any 'security.' attributes on the server and
 stops it setting any.
 
+SELinux support
+---
+One can enable support for SELinux by running virtiofsd with option
+"-o security_label". But this will try to save guest's security context
+in xattr security.selinux on host and it might fail if host's SELinux
+policy does not permit virtiofsd to do this operation.
+
+Hence, it is preferred to remap guest's "security.selinux" xattr to say
+"trusted.virtiofs.security.selinux" on host.
+
+"-o xattrmap=:map:security.selinux:trusted.virtiofs.:"
+
+This will make sure that guest and host's SELinux xattrs on same file
+remain separate and not interfere with each other. And will allow both
+host and guest to implement their own separate SELinux policies.
+
+Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need
+add this capability to daemon.
+
+"-o modcaps=+sys_admin"
+
+Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more
+powerful and if gets compromised, it can do lot of damage to host system.
+So keep this trade-off in my mind while making a decision.
+
 Examples
 
 
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index a8295d975a..e226fc590f 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
"-o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: 
disabled)\n"
+   "-o security_label/no_security_label  Enable/Disable security 
label. (default: disabled)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index d49128a58d..f3ec6aafe5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -181,6 +181,7 @@ struct lo_data {
 int user_posix_acl, posix_acl;
 /* Keeps track if /proc//attr/fscreate should be used or not */
 bool use_fscreate;
+int user_security_label;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -215,6 +216,8 @@ static const struct fuse_opt lo_opts[] = {
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
 { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
 { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
+{ "security_label", offsetof(struct lo_data, user_security_label), 1 },
+{ "no_security_label", offsetof(struct lo_data, user_security_label), 0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -808,6 +811,17 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling posix_acl\n");
 conn->want &= ~FUSE_CAP_POSIX_ACL;
 }
+
+if (lo->user_security_label == 1) {
+if (!(conn->capable & FUSE_CAP_SECURITY_CTX)) {
+fuse_log(FUSE_LOG_ERR, "lo_init: Can not enable security label."
+ " kernel does not support FUSE_SECURITY_CTX 
capability.\n");
+}
+conn->want |= FUSE_CAP_SECURITY_CTX;
+} else {
+fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling security label

[PATCH v6 04/10] virtiofsd: Extend size of fuse_conn_info->capable and ->want fields

2022-02-08 Thread Vivek Goyal

->capable keeps track of what capabilities kernel supports and ->wants keep
track of what capabilities filesytem wants.

Right now these fields are 32bit in size. But now fuse has run out of
bits and capabilities can now have bit number which are higher than 31.

That means 32 bit fields are not suffcient anymore. Increase size to 64
bit so that we can add newer capabilities and still be able to use existing
code to check and set the capabilities.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   | 4 ++--
 tools/virtiofsd/fuse_lowlevel.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 0c2665b977..6f8a988202 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -439,7 +439,7 @@ struct fuse_conn_info {
 /**
  * Capability flags that the kernel supports (read-only)
  */
-unsigned capable;
+uint64_t capable;
 
 /**
  * Capability flags that the filesystem wants to enable.
@@ -447,7 +447,7 @@ struct fuse_conn_info {
  * libfuse attempts to initialize this field with
  * reasonable default values before calling the init() handler.
  */
-unsigned want;
+uint64_t want;
 
 /**
  * Maximum number of pending "background" requests. A
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index b6712b763a..d91cd9743a 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2069,7 +2069,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 if (se->conn.want & (~se->conn.capable)) {
 fuse_log(FUSE_LOG_ERR,
  "fuse: error: filesystem requested capabilities "
- "0x%x that are not supported by kernel, aborting.\n",
+ "0x%llx that are not supported by kernel, aborting.\n",
  se->conn.want & (~se->conn.capable));
 fuse_reply_err(req, EPROTO);
 se->error = -EPROTO;
-- 
2.34.1

[PATCH v6 07/10] virtiofsd: Add helpers to work with /proc/self/task/tid/attr/fscreate

2022-02-08 Thread Vivek Goyal

Soon we will be able to create and also set security context on the file
atomically using /proc/self/task/tid/attr/fscreate knob. If this knob
is available on the system, first set the knob with the desired context
and then create the file. It will be created with the context set in
fscreate. This works basically for SELinux and its per thread.

This patch just introduces the helper functions. Subsequent patches will
make use of these helpers.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 92 
 1 file changed, 92 insertions(+)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 82023bf3d4..7762bf0d2c 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -173,10 +173,14 @@ struct lo_data {
 
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
+/* An O_PATH file descriptor to /proc/self/task/ */
+int proc_self_task;
 int user_killpriv_v2, killpriv_v2;
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+/* Keeps track if /proc//attr/fscreate should be used or not */
+bool use_fscreate;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -256,6 +260,72 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/*
+ * Tries to figure out if /proc//attr/fscreate is usable or not. With
+ * selinux=0, read from fscreate returns -EINVAL.
+ *
+ * TODO: Link with libselinux and use is_selinux_enabled() instead down
+ * the line. It probably will be more reliable indicator.
+ */
+static bool is_fscreate_usable(struct lo_data *lo)
+{
+char procname[64];
+int fscreate_fd;
+size_t bytes_read;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
+if (fscreate_fd == -1) {
+return false;
+}
+
+bytes_read = read(fscreate_fd, procname, 64);
+close(fscreate_fd);
+if (bytes_read == -1) {
+return false;
+}
+return true;
+}
+
+/* Helpers to set/reset fscreate */
+__attribute__((unused))
+static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
+  size_t ctxlen,int *fd)
+{
+char procname[64];
+int fscreate_fd, err = 0;
+size_t written;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
+err = fscreate_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+
+written = write(fscreate_fd, ctx, ctxlen);
+err = written == -1 ? errno : 0;
+if (err) {
+goto out;
+}
+
+*fd = fscreate_fd;
+return 0;
+out:
+close(fscreate_fd);
+return err;
+}
+
+__attribute__((unused))
+static void close_reset_proc_fscreate(int fd)
+{
+if ((write(fd, NULL, 0)) == -1) {
+fuse_log(FUSE_LOG_WARNING, "Failed to reset fscreate. err=%d\n", 
errno);
+}
+close(fd);
+return;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -3522,6 +3592,15 @@ static void setup_namespaces(struct lo_data *lo, struct 
fuse_session *se)
 exit(1);
 }
 
+/* Get the /proc/self/task descriptor */
+lo->proc_self_task = open("/proc/self/task/", O_PATH);
+if (lo->proc_self_task == -1) {
+fuse_log(FUSE_LOG_ERR, "open(/proc/self/task, O_PATH): %m\n");
+exit(1);
+}
+
+lo->use_fscreate = is_fscreate_usable(lo);
+
 /*
  * We only need /proc/self/fd. Prevent ".." from accessing parent
  * directories of /proc/self/fd by bind-mounting it over /proc. Since / was
@@ -3738,6 +3817,14 @@ static void setup_chroot(struct lo_data *lo)
 exit(1);
 }
 
+lo->proc_self_task = open("/proc/self/task", O_PATH);
+if (lo->proc_self_fd == -1) {
+fuse_log(FUSE_LOG_ERR, "open(\"/proc/self/task\", O_PATH): %m\n");
+exit(1);
+}
+
+lo->use_fscreate = is_fscreate_usable(lo);
+
 /*
  * Make the shared directory the file system root so that FUSE_OPEN
  * (lo_open()) cannot escape the shared directory by opening a symlink.
@@ -3923,6 +4010,10 @@ static void fuse_lo_data_cleanup(struct lo_data *lo)
 close(lo->proc_self_fd);
 }
 
+if (lo->proc_self_task >= 0) {
+close(lo->proc_self_task);
+}
+
 if (lo->root.fd >= 0) {
 close(lo->root.fd);
 }
@@ -3950,6 +4041,7 @@ int main(int argc, char *argv[])
 .posix_lock = 0,
 .allow_direct_io = 0,
 .proc_self_fd = -1,
+.proc_self_task = -1,
 .user_killpriv_v2 = -1,
 .user_posix_acl = -1,
 };
-- 
2.34.1

[PATCH v6 03/10] virtiofsd: Parse extended "struct fuse_init_in"

2022-02-08 Thread Vivek Goyal

Add some code to parse extended "struct fuse_init_in". And use a local
variable "flag" to represent 64 bit flags. This will make it easier
to add more features without having to worry about two 32bit flags (->flags
and ->flags2) in "fuse_struct_in".

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_lowlevel.c | 61 +
 1 file changed, 39 insertions(+), 22 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index ce29a70253..b6712b763a 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1881,11 +1881,14 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 {
 size_t compat_size = offsetof(struct fuse_init_in, max_readahead);
 size_t compat2_size = offsetof(struct fuse_init_in, flags) + 
sizeof(uint32_t);
+/* Fuse structure extended with minor version 36 */
+size_t compat3_size = endof(struct fuse_init_in, unused);
 struct fuse_init_in *arg;
 struct fuse_init_out outarg;
 struct fuse_session *se = req->se;
 size_t bufsize = se->bufsize;
 size_t outargsize = sizeof(outarg);
+uint64_t flags = 0;
 
 (void)nodeid;
 
@@ -1902,11 +1905,25 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 fuse_reply_err(req, EINVAL);
 return;
 }
+flags |= arg->flags;
+}
+
+/*
+ * fuse_init_in was extended again with minor version 36. Just read
+ * current known size of fuse_init so that future extension and
+ * header rebase does not cause breakage.
+ */
+if (sizeof(*arg) > compat2_size && (arg->flags & FUSE_INIT_EXT)) {
+if (!fuse_mbuf_iter_advance(iter, compat3_size - compat2_size)) {
+fuse_reply_err(req, EINVAL);
+return;
+}
+flags |= (uint64_t) arg->flags2 << 32;
 }
 
 fuse_log(FUSE_LOG_DEBUG, "INIT: %u.%u\n", arg->major, arg->minor);
 if (arg->major == 7 && arg->minor >= 6) {
-fuse_log(FUSE_LOG_DEBUG, "flags=0x%08x\n", arg->flags);
+fuse_log(FUSE_LOG_DEBUG, "flags=0x%016llx\n", flags);
 fuse_log(FUSE_LOG_DEBUG, "max_readahead=0x%08x\n", arg->max_readahead);
 }
 se->conn.proto_major = arg->major;
@@ -1934,68 +1951,68 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 if (arg->max_readahead < se->conn.max_readahead) {
 se->conn.max_readahead = arg->max_readahead;
 }
-if (arg->flags & FUSE_ASYNC_READ) {
+if (flags & FUSE_ASYNC_READ) {
 se->conn.capable |= FUSE_CAP_ASYNC_READ;
 }
-if (arg->flags & FUSE_POSIX_LOCKS) {
+if (flags & FUSE_POSIX_LOCKS) {
 se->conn.capable |= FUSE_CAP_POSIX_LOCKS;
 }
-if (arg->flags & FUSE_ATOMIC_O_TRUNC) {
+if (flags & FUSE_ATOMIC_O_TRUNC) {
 se->conn.capable |= FUSE_CAP_ATOMIC_O_TRUNC;
 }
-if (arg->flags & FUSE_EXPORT_SUPPORT) {
+if (flags & FUSE_EXPORT_SUPPORT) {
 se->conn.capable |= FUSE_CAP_EXPORT_SUPPORT;
 }
-if (arg->flags & FUSE_DONT_MASK) {
+if (flags & FUSE_DONT_MASK) {
 se->conn.capable |= FUSE_CAP_DONT_MASK;
 }
-if (arg->flags & FUSE_FLOCK_LOCKS) {
+if (flags & FUSE_FLOCK_LOCKS) {
 se->conn.capable |= FUSE_CAP_FLOCK_LOCKS;
 }
-if (arg->flags & FUSE_AUTO_INVAL_DATA) {
+if (flags & FUSE_AUTO_INVAL_DATA) {
 se->conn.capable |= FUSE_CAP_AUTO_INVAL_DATA;
 }
-if (arg->flags & FUSE_DO_READDIRPLUS) {
+if (flags & FUSE_DO_READDIRPLUS) {
 se->conn.capable |= FUSE_CAP_READDIRPLUS;
 }
-if (arg->flags & FUSE_READDIRPLUS_AUTO) {
+if (flags & FUSE_READDIRPLUS_AUTO) {
 se->conn.capable |= FUSE_CAP_READDIRPLUS_AUTO;
 }
-if (arg->flags & FUSE_ASYNC_DIO) {
+if (flags & FUSE_ASYNC_DIO) {
 se->conn.capable |= FUSE_CAP_ASYNC_DIO;
 }
-if (arg->flags & FUSE_WRITEBACK_CACHE) {
+if (flags & FUSE_WRITEBACK_CACHE) {
 se->conn.capable |= FUSE_CAP_WRITEBACK_CACHE;
 }
-if (arg->flags & FUSE_NO_OPEN_SUPPORT) {
+if (flags & FUSE_NO_OPEN_SUPPORT) {
 se->conn.capable |= FUSE_CAP_NO_OPEN_SUPPORT;
 }
-if (arg->flags & FUSE_PARALLEL_DIROPS) {
+if (flags & FUSE_PARALLEL_DIROPS) {
 se->conn.capable |= FUSE_CAP_PARALLEL_DIROPS;
 }
-if (arg->flags & FUSE_POSIX_ACL) {
+if (flags & FUSE_POSIX_ACL) {
 se->conn.capable |= FUSE_CAP_POSIX_ACL;
 }
-if (arg->flags & FUSE_HANDLE_KILLPRIV) {
+if (flags & FUSE_HANDLE_KILLPRIV) {
 se->conn.capable |= FUSE_CAP_HANDLE_KILLPRIV;
 }
-if (arg->flags & FUSE_NO_OPENDIR_SUPPORT) {
+

[PATCH v6 01/10] virtiofsd: Fix breakage due to fuse_init_in size change

2022-02-08 Thread Vivek Goyal

Kernel version 5.17 has increased the size of "struct fuse_init_in" struct.
Previously this struct was 16 bytes and now it has been extended to
64 bytes in size.

Once qemu headers are updated to latest, it will expect to receive 64 byte
size struct (for protocol version major 7 and minor > 6). But if guest is
booting older kernel (older than 5.17), then it still sends older
fuse_init_in of size 16 bytes. And do_init() fails. It is expecting
64 byte struct. And this results in mount of virtiofs failing.

Fix this by parsing 16 bytes only for now. Separate patches will be
posted which will parse rest of the bytes and enable new functionality.
Right now we don't support any of the new functionality, so we don't
lose anything by not parsing bytes beyond 16.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_lowlevel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index e4679c73ab..ce29a70253 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1880,6 +1880,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 struct fuse_mbuf_iter *iter)
 {
 size_t compat_size = offsetof(struct fuse_init_in, max_readahead);
+size_t compat2_size = offsetof(struct fuse_init_in, flags) + 
sizeof(uint32_t);
 struct fuse_init_in *arg;
 struct fuse_init_out outarg;
 struct fuse_session *se = req->se;
@@ -1897,7 +1898,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 
 /* ...and now consume the new fields. */
 if (arg->major == 7 && arg->minor >= 6) {
-if (!fuse_mbuf_iter_advance(iter, sizeof(*arg) - compat_size)) {
+if (!fuse_mbuf_iter_advance(iter, compat2_size - compat_size)) {
 fuse_reply_err(req, EINVAL);
 return;
 }
-- 
2.34.1

[PATCH v6 08/10] virtiofsd: Create new file with security context

2022-02-08 Thread Vivek Goyal

This patch adds support for creating new file with security context
as sent by client. It basically takes three paths.

- If no security context enabled, then it continues to create files without
  security context.

- If security context is enabled and but security.selinux has not been
  remapped, then it uses /proc/thread-self/attr/fscreate knob to set
  security context and then create the file. This will make sure that
  newly created file gets the security context as set in "fscreate" and
  this is atomic w.r.t file creation.

  This is useful and host and guest SELinux policies don't conflict and
  can work with each other. In that case, guest security.selinux xattr
  is not remapped and it is passthrough as "security.selinux" xattr
  on host.

- If security context is enabled but security.selinux xattr has been
  remapped to something else, then it first creates the file and then
  uses setxattr() to set the remapped xattr with the security context.
  This is a non-atomic operation w.r.t file creation.

  This mode will be most versatile and allow host and guest to have their
  own separate SELinux xattrs and have their own separate SELinux policies.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 229 +++
 1 file changed, 200 insertions(+), 29 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 7762bf0d2c..68fa542fac 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -234,6 +234,11 @@ static struct lo_inode *lo_find(struct lo_data *lo, struct 
stat *st,
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
+#define FCHDIR_NOFAIL(fd) do { \
+int fchdir_res = fchdir(fd);   \
+assert(fchdir_res == 0);   \
+} while (0)
+
 static bool is_dot_or_dotdot(const char *name)
 {
 return name[0] == '.' &&
@@ -288,7 +293,6 @@ static bool is_fscreate_usable(struct lo_data *lo)
 }
 
 /* Helpers to set/reset fscreate */
-__attribute__((unused))
 static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
   size_t ctxlen,int *fd)
 {
@@ -316,7 +320,6 @@ out:
 return err;
 }
 
-__attribute__((unused))
 static void close_reset_proc_fscreate(int fd)
 {
 if ((write(fd, NULL, 0)) == -1) {
@@ -1354,16 +1357,103 @@ static void lo_restore_cred_gain_cap(struct lo_cred 
*old, bool restore_umask,
 }
 }
 
+static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
+   const char *name, const char *secctx_name)
+{
+int path_fd, err;
+char procname[64];
+struct lo_data *lo = lo_data(req);
+
+if (!req->secctx.ctxlen) {
+return 0;
+}
+
+/* Open newly created element with O_PATH */
+path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+err = path_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+sprintf(procname, "%i", path_fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+/* Set security context. This is not atomic w.r.t file creation */
+err = setxattr(procname, secctx_name, req->secctx.ctx, req->secctx.ctxlen,
+   0);
+if (err) {
+err = errno;
+}
+FCHDIR_NOFAIL(lo->root.fd);
+close(path_fd);
+return err;
+}
+
+static int do_mknod_symlink(fuse_req_t req, struct lo_inode *dir,
+const char *name, mode_t mode, dev_t rdev,
+const char *link)
+{
+int err, fscreate_fd = -1;
+const char *secctx_name = req->secctx.name;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+char *mapped_name = NULL;
+bool secctx_enabled = req->secctx.ctxlen;
+bool do_fscreate = false;
+
+if (secctx_enabled && lo->xattrmap) {
+err = xattr_map_client(lo, req->secctx.name, _name);
+if (err < 0) {
+return -err;
+}
+secctx_name = mapped_name;
+}
+
+/*
+ * If security xattr has not been remapped and selinux is enabled on
+ * host, set fscreate and no need to do a setxattr() after file creation
+ */
+if (secctx_enabled && !mapped_name && lo->use_fscreate) {
+do_fscreate = true;
+err = open_set_proc_fscreate(lo, req->secctx.ctx, req->secctx.ctxlen,
+ _fd);
+if (err) {
+goto out;
+}
+}
+
+err = lo_change_cred(req, , lo->change_umask && !S_ISLNK(mode));
+if (err) {
+goto out;
+}
+
+err = mknod_wrapper(dir->fd, name, link, mode, rdev);
+err = err == -1 ? errno : 0;
+lo_restore_cred(, lo->change_umask && !S_ISLNK(mode));
+i

[PATCH v6 05/10] virtiofsd, fuse_lowlevel.c: Add capability to parse security context

2022-02-08 Thread Vivek Goyal

Add capability to enable and parse security context as sent by client
and put into fuse_req. Filesystems now can get security context from
request and set it on files during creation.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   |   5 ++
 tools/virtiofsd/fuse_i.h|   7 +++
 tools/virtiofsd/fuse_lowlevel.c | 102 +++-
 3 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 6f8a988202..bf46954dab 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -377,6 +377,11 @@ struct fuse_file_info {
  */
 #define FUSE_CAP_SETXATTR_EXT (1 << 29)
 
+/**
+ * Indicates that file server supports creating file security context
+ */
+#define FUSE_CAP_SECURITY_CTX (1ULL << 32)
+
 /**
  * Ioctl flags
  *
diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 492e002181..a5572fa4ae 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -15,6 +15,12 @@
 struct fv_VuDev;
 struct fv_QueueInfo;
 
+struct fuse_security_context {
+const char *name;
+uint32_t ctxlen;
+const void *ctx;
+};
+
 struct fuse_req {
 struct fuse_session *se;
 uint64_t unique;
@@ -35,6 +41,7 @@ struct fuse_req {
 } u;
 struct fuse_req *next;
 struct fuse_req *prev;
+struct fuse_security_context secctx;
 };
 
 struct fuse_notify_req {
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index d91cd9743a..2909122b23 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -886,11 +886,63 @@ static void do_readlink(fuse_req_t req, fuse_ino_t nodeid,
 }
 }
 
+static int parse_secctx_fill_req(fuse_req_t req, struct fuse_mbuf_iter *iter)
+{
+struct fuse_secctx_header *fsecctx_header;
+struct fuse_secctx *fsecctx;
+const void *secctx;
+const char *name;
+
+fsecctx_header = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx_header));
+if (!fsecctx_header) {
+return -EINVAL;
+}
+
+/*
+ * As of now maximum of one security context is supported. It can
+ * change in future though.
+ */
+if (fsecctx_header->nr_secctx > 1) {
+return -EINVAL;
+}
+
+/* No security context sent. Maybe no LSM supports it */
+if (!fsecctx_header->nr_secctx) {
+return 0;
+}
+
+fsecctx = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx));
+if (!fsecctx) {
+return -EINVAL;
+}
+
+/* struct fsecctx with zero sized context is not expected */
+if (!fsecctx->size) {
+return -EINVAL;
+}
+name = fuse_mbuf_iter_advance_str(iter);
+if (!name) {
+return -EINVAL;
+}
+
+secctx = fuse_mbuf_iter_advance(iter, fsecctx->size);
+if (!secctx) {
+return -EINVAL;
+}
+
+req->secctx.name = name;
+req->secctx.ctx = secctx;
+req->secctx.ctxlen = fsecctx->size;
+return 0;
+}
+
 static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
  struct fuse_mbuf_iter *iter)
 {
 struct fuse_mknod_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -901,6 +953,14 @@ static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, -err);
+return;
+}
+}
+
 if (req->se->op.mknod) {
 req->se->op.mknod(req, nodeid, name, arg->mode, arg->rdev);
 } else {
@@ -913,6 +973,8 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 {
 struct fuse_mkdir_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -923,6 +985,14 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+return;
+}
+}
+
 if (req->se->op.mkdir) {
 req->se->op.mkdir(req, nodeid, name, arg->mode);
 } else {
@@ -969,12 +1039,22 @@ static void do_symlink(fuse_req_t req, fuse_ino_t nodeid,
 {
 const char *name = fuse_mbuf_iter_advance_str(iter);
 const char *linkname = fuse_mbuf_iter_advance_str(iter);
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 if (!name || !linkname) {
 fuse_reply_err(req, EINVAL);
 return;
 }
 
+if (secctx_enabled) {
+err

Re: [PATCH v5 0/9] virtiofsd: Add support for file security context at file creation

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 01:05:16PM +, Daniel P. Berrangé wrote:
> On Wed, Feb 02, 2022 at 02:39:26PM -0500, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V5 of the patches. I posted V4 here.
> > 
> > https://listman.redhat.com/archives/virtio-fs/2022-January/msg00041.html
> > 
> > These will allow us to support SELinux with virtiofs. This will send
> > SELinux context at file creation to server and server can set it on
> > file.
> 
> I've not entirely figured it out from the code, so easier for me
> to ask...
> 
> How is the SELinux labelled stored on the host side ? It is stored
> directly in the security.* xattr namespace,

[ CC Dan Walsh ]

I just tried to test the mode where I don't do xattr remapping and try
to set /proc/pid/attr/fscreate with the context I want to set. It will
set security.selinux xattr on host.

But write to /proc/pid/attr/fscreate fails if host does not recognize
the label sent by guest. I am running virtiofsd with unconfined_t but
it still fails because guest is trying to create a file with
"test_filesystem_filetranscon_t" and host does not recognize this
label. Seeing following in audit logs.

type=SELINUX_ERR msg=audit(1644268262.666:8111): op=fscreate 
invalid_context="unconfined_u:object_r:test_filesystem_filetranscon_t:s0"

So if we don't remap xattrs and host has SELinux enabled, then it probably
work in very limited circumstances where host and guest policies don't
conflict. I guess its like running fedora 34 guest on fedora 34 host. 
I suspect that this will see very limited use. Though I have put the
code in for the sake of completeness.

Thanks
Vivek

> or is is subject to
> xattr remapping that virtiofsd already supports.
> 
> Storing directly means virtiofsd has to run in an essentially
> unconfined context, to let it do arbitrary  changes on security.*
> xattrs without being blocked by SELinux) and has risk that guest
> initiated changes can open holes in the host confinement if
> the exported FS is generally visible to processes on the host.
> 
> 
> Using remapping lets virtiofsd be strictly isolated by SELinux
> policy on the host, and ensures that guest context changes
> can't open up holes in the host.
> 
> Both are valid use cases, so I'd ultimately expect us to want
> to support both, but my preference for a "default" behaviour
> would be remapping.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v4 09/11] 9p: darwin: Implement compatibility for mknodat

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 11:49:12AM +0100, Greg Kurz wrote:
> On Mon, 7 Feb 2022 11:30:18 +0100
> Philippe Mathieu-Daudé  wrote:
> 
> > On 7/2/22 09:47, Greg Kurz wrote:
> > > On Sun, 6 Feb 2022 20:10:23 -0500
> > > Will Cohen  wrote:
> > > 
> > >> This patch set currently places it in 9p-util only because 9p is the only
> > >> place where this issue seems to have come up so far and we were wary of
> > >> editing files too far afield, but I have no attachment to its specific
> > >> location!
> > >>
> > > 
> > > Inline comments are preferred on qemu-devel. Please don't top post !
> > > This complicates the review a lot.
> > > 
> > > This is indeed a good candidate for osdep. This being said, unless there's
> > > some other user in the QEMU code base, it is acceptable to leave it under
> > > 9pfs.
> > 
> > virtiofsd could eventually use it.
> 
> 
> Indeed but virtiofsd is for linux hosts only AFAICT and I'm not aware of any
> work to support any other host OS.

[ CC Sergio ]

Will like to support virtiofs on other host OS. Getting rid of Linux
specific parts should be doable. I think bigger challenge is how to
make vhost-user stuff work on other OS, like macOS.

If virtiofsd was somehow running as part of qemu (and not as a separate
process), then making rest of the filesystem code to work on other
OS should not be too hard, I guess.

So question is, can one somehow run same virtiofsd code both as part
of qemu as well as separate daemon based on need (and one does not have
to maintain two separate code bases).

Thanks
Vivek

> 
> Cc'ing virtio-fs people for inputs on this topic.
>

Re: [PATCH v5 0/9] virtiofsd: Add support for file security context at file creation

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 01:30:16PM +, Daniel P. Berrangé wrote:
> On Mon, Feb 07, 2022 at 08:24:08AM -0500, Vivek Goyal wrote:
> > On Mon, Feb 07, 2022 at 01:05:16PM +, Daniel P. Berrangé wrote:
> > > On Wed, Feb 02, 2022 at 02:39:26PM -0500, Vivek Goyal wrote:
> > > > Hi,
> > > > 
> > > > This is V5 of the patches. I posted V4 here.
> > > > 
> > > > https://listman.redhat.com/archives/virtio-fs/2022-January/msg00041.html
> > > > 
> > > > These will allow us to support SELinux with virtiofs. This will send
> > > > SELinux context at file creation to server and server can set it on
> > > > file.
> > > 
> > > I've not entirely figured it out from the code, so easier for me
> > > to ask...
> > > 
> > > How is the SELinux labelled stored on the host side ? It is stored
> > > directly in the security.* xattr namespace, or is is subject to
> > > xattr remapping that virtiofsd already supports.
> > > 
> > > Storing directly means virtiofsd has to run in an essentially
> > > unconfined context, to let it do arbitrary  changes on security.*
> > > xattrs without being blocked by SELinux) and has risk that guest
> > > initiated changes can open holes in the host confinement if
> > > the exported FS is generally visible to processes on the host.
> > > 
> > > 
> > > Using remapping lets virtiofsd be strictly isolated by SELinux
> > > policy on the host, and ensures that guest context changes
> > > can't open up holes in the host.
> > > 
> > > Both are valid use cases, so I'd ultimately expect us to want
> > > to support both, but my preference for a "default" behaviour
> > > would be remapping.
> > 
> > I am expecting users to configure virtiofsd to remap "security.selinux"
> > to "trusted.virtiofsd.security.selinux" and that will allow guest
> > and host security selinux to co-exist and allow separate SELinux policies
> > for guest and host.
> > 
> > I agree that my preference for a default behavior is remapping as well.
> > That makes most sense. 
> > 
> > One downside of mapping to trusted namespace is that it requires
> > CAP_SYS_ADMIN for virtiofsd.
> > 
> > Having said that, these patches don't enforce the remapping default. That
> > has to come from the user because it also needs to be given CAP_SYS_ADMIN.
> > So out of box default is no remapping and passthrough SELinux.
> 
> Ok, that all makes sense then. My only suggestion then is to put something
> more explicit in the man page docs to highlight the implications /
> interaction beteen the new command line options for labelling and the
> likely need for remapping security.*

Ok, will do. While describing this new command line option, will also
mention the likely need of remapping and additional capability and
security implication. Or may be I will create a small new section for
SELinux in same file.

Thanks
Vivek

Re: [PATCH v5 0/9] virtiofsd: Add support for file security context at file creation

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 12:49:24PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Hi,
> > 
> > This is V5 of the patches. I posted V4 here.
> > 
> > https://listman.redhat.com/archives/virtio-fs/2022-January/msg00041.html
> > 
> > These will allow us to support SELinux with virtiofs. This will send
> > SELinux context at file creation to server and server can set it on
> > file.
> 
> I think that's pretty close; I've got some minor comments I've replied
> to on the individual patches.
> 
> I do worry that the number of different paths for each operation is now
> quite large so hard to test.

It is indeed many combinations to test. During development, I have made
sure to test every path atleast once to make sure it works.

> I also wonder what happens on something other than SELinux.

As of now this pretty much works only for SELinux. Especially usage of
fscreate knob is very specific to SELinux.

In some cases, it will work with some other LSM other than SELinux
as well. But lets not go there. 

If we want to support multiple security contexts at some point of time,
fuse procotol changes have been written in such a way so that fuse
can send mutiple security context and then we will have to modify
code to be able to deal with that.

In short, for now, this code is pretty much expectin one security
context that too SELinux. This is very much in line with ceph and
nfs.

Vivek
> 
> Dave
> 
> > Changes since V4
> > 
> > - Parse only known current size of fuse_init_in. This will make sure
> >   that future extension does not break existing code upon header
> >   update. (David Gilbert)
> > 
> > - Changed order of one of the patch. It is first patch in series. This
> >   will help fix the breakage before header update patch and code remains
> >   git bisectable. (David Gilbert)
> > 
> > - Changed %lx to %llx at one place. (David Gilbert).
> > 
> > Thanks
> > Vivek
> >  
> > Vivek Goyal (9):
> >   virtiofsd: Fix breakage due to fuse_init_in size change
> >   linux-headers: Update headers to v5.17-rc1
> >   virtiofsd: Parse extended "struct fuse_init_in"
> >   virtiofsd: Extend size of fuse_conn_info->capable and ->want fields
> >   virtiofsd, fuse_lowlevel.c: Add capability to parse security context
> >   virtiofsd: Move core file creation code in separate function
> >   virtiofsd: Create new file with fscreate set
> >   virtiofsd: Create new file using O_TMPFILE and set security context
> >   virtiofsd: Add an option to enable/disable security label
> > 
> >  docs/tools/virtiofsd.rst  |   7 +
> >  include/standard-headers/asm-x86/kvm_para.h   |   1 +
> >  include/standard-headers/drm/drm_fourcc.h |  11 +
> >  include/standard-headers/linux/ethtool.h  |   1 +
> >  include/standard-headers/linux/fuse.h |  60 ++-
> >  include/standard-headers/linux/pci_regs.h | 142 +++---
> >  include/standard-headers/linux/virtio_gpio.h  |  72 +++
> >  include/standard-headers/linux/virtio_i2c.h   |  47 ++
> >  include/standard-headers/linux/virtio_iommu.h |   8 +-
> >  .../standard-headers/linux/virtio_pcidev.h|  65 +++
> >  include/standard-headers/linux/virtio_scmi.h  |  24 +
> >  linux-headers/asm-generic/unistd.h|   5 +-
> >  linux-headers/asm-mips/unistd_n32.h   |   2 +
> >  linux-headers/asm-mips/unistd_n64.h   |   2 +
> >  linux-headers/asm-mips/unistd_o32.h   |   2 +
> >  linux-headers/asm-powerpc/unistd_32.h |   2 +
> >  linux-headers/asm-powerpc/unistd_64.h |   2 +
> >  linux-headers/asm-riscv/bitsperlong.h |  14 +
> >  linux-headers/asm-riscv/mman.h|   1 +
> >  linux-headers/asm-riscv/unistd.h  |  44 ++
> >  linux-headers/asm-s390/unistd_32.h|   2 +
> >  linux-headers/asm-s390/unistd_64.h|   2 +
> >  linux-headers/asm-x86/kvm.h   |  16 +-
> >  linux-headers/asm-x86/unistd_32.h |   1 +
> >  linux-headers/asm-x86/unistd_64.h |   1 +
> >  linux-headers/asm-x86/unistd_x32.h|   1 +
> >  linux-headers/linux/kvm.h |  17 +
> >  tools/virtiofsd/fuse_common.h |   9 +-
> >  tools/virtiofsd/fuse_i.h  |   7 +
> >  tools/virtiofsd/fuse_lowlevel.c   | 162 +--
> >  tools/virtiofsd/helper.c  |   1 +
> >  tools/virtiofsd/passthrough_ll.c  | 414 --
> >  32 files changed, 1013 insertions(+), 132 deleti

Re: [PATCH v5 9/9] virtiofsd: Add an option to enable/disable security label

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 12:40:21PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Provide an option "-o security_label/no_security_label" to enable/disable
> > security label functionality. By default these are turned off.
> > 
> > If enabled, server will indicate to client that it is capable of handling
> > one security label during file creation. Typically this is expected to
> > be a SELinux label. File server will set this label on the file. It will
> > try to set it atomically wherever possible. But its not possible in
> > all the cases.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  docs/tools/virtiofsd.rst |  7 +++
> >  tools/virtiofsd/helper.c |  1 +
> >  tools/virtiofsd/passthrough_ll.c | 15 +++
> >  3 files changed, 23 insertions(+)
> > 
> > diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
> > index 07ac0be551..a2c005f4a0 100644
> > --- a/docs/tools/virtiofsd.rst
> > +++ b/docs/tools/virtiofsd.rst
> > @@ -104,6 +104,13 @@ Options
> >* posix_acl|no_posix_acl -
> >  Enable/disable posix acl support.  Posix ACLs are disabled by default.
> >  
> > +  * security_label|no_security_label -
> > +Enable/disable security label support. Security labels are disabled by
> > +default. This will allow client to send a MAC label of file during
> ^ the^ a
> > +file creation. Typically this is expected to be SELinux security
>   ^ an
> 
> > +label. Server will try to set that label on newly created file
>   ^The server
> > +atomically wherever possible.
> > +
> >  .. option:: --socket-path=PATH
> >  
> >Listen on vhost-user UNIX domain socket at PATH.
> > diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> > index a8295d975a..e226fc590f 100644
> > --- a/tools/virtiofsd/helper.c
> > +++ b/tools/virtiofsd/helper.c
> > @@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
> > "   default: no_allow_direct_io\n"
> > "-o announce_submounts  Announce sub-mount points to 
> > the guest\n"
> > "-o posix_acl/no_posix_acl  Enable/Disable posix_acl. 
> > (default: disabled)\n"
> > +   "-o security_label/no_security_label  Enable/Disable 
> > security label. (default: disabled)\n"
> > );
> >  }
> >  
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index 43c9b6dbe5..fe8f3ccbb6 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -181,6 +181,7 @@ struct lo_data {
> >  int user_posix_acl, posix_acl;
> >  /* Keeps track if /proc//attr/fscreate should be used or not */
> >  bool use_fscreate;
> > +int user_security_label;
> >  };
> >  
> >  static const struct fuse_opt lo_opts[] = {
> > @@ -215,6 +216,8 @@ static const struct fuse_opt lo_opts[] = {
> >  { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
> >  { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
> >  { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
> > +{ "security_label", offsetof(struct lo_data, user_security_label), 1 },
> > +{ "no_security_label", offsetof(struct lo_data, user_security_label), 
> > 0 },
> >  FUSE_OPT_END
> >  };
> >  static bool use_syslog = false;
> > @@ -771,6 +774,17 @@ static void lo_init(void *userdata, struct 
> > fuse_conn_info *conn)
> >  fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling posix_acl\n");
> >  conn->want &= ~FUSE_CAP_POSIX_ACL;
> >  }
> > +
> > +if (lo->user_security_label == 1) {
> > +if (!(conn->capable & FUSE_CAP_SECURITY_CTX)) {
> > +fuse_log(FUSE_LOG_ERR, "lo_init: Can not enable security 
> > label."
> > + " kernel does not support FUSE_SECURITY_CTX 
> > capability.\n");
> > +}
> 
> Do you need to exit in this case - or at least clear the flag?

Actually we don't have to necessarily exit here because fuse_lowlevel.c
has a check which makes it exit. And that's why I do not clear the
flag from ->want to signifiy that filesystem wants tha

Re: [PATCH v5 7/9] virtiofsd: Create new file with fscreate set

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 11:38:12AM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > This patch adds support to set /proc/thread-self/attr/fscreate before
> > file creation. It is set to a value as sent by client. This will allow
> > for atomic creation of security context on files w.r.t file creation.
> > 
> > This is primarily useful when either there is no SELinux enabled on
> > host or host and guest policies are in sync and don't conflict.
> > 
> > Signed-off-by: Vivek Goyal 
> 
> Minor nit below, but I think this is right:
> 
> Reviewed-by: Dr. David Alan Gilbert 
> 
> I would however prefer if you could split this patch; it's a bit long to
> review.

Ok, I will look into splitting it.

> 
> 
> > ---
> >  tools/virtiofsd/passthrough_ll.c | 317 ---
> >  1 file changed, 290 insertions(+), 27 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index 82023bf3d4..acb99aa2fc 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -173,10 +173,14 @@ struct lo_data {
> >  
> >  /* An O_PATH file descriptor to /proc/self/fd/ */
> >  int proc_self_fd;
> > +/* An O_PATH file descriptor to /proc/self/task/ */
> > +int proc_self_task;
> >  int user_killpriv_v2, killpriv_v2;
> >  /* If set, virtiofsd is responsible for setting umask during creation 
> > */
> >  bool change_umask;
> >  int user_posix_acl, posix_acl;
> > +/* Keeps track if /proc//attr/fscreate should be used or not */
> > +bool use_fscreate;
> >  };
> >  
> >  static const struct fuse_opt lo_opts[] = {
> > @@ -230,6 +234,11 @@ static struct lo_inode *lo_find(struct lo_data *lo, 
> > struct stat *st,
> >  static int xattr_map_client(const struct lo_data *lo, const char 
> > *client_name,
> >  char **out_name);
> >  
> > +#define FCHDIR_NOFAIL(fd) do { \
> > +int fchdir_res = fchdir(fd);   \
> > +assert(fchdir_res == 0);   \
> > +} while (0)
> > +
> >  static bool is_dot_or_dotdot(const char *name)
> >  {
> >  return name[0] == '.' &&
> > @@ -256,6 +265,33 @@ static struct lo_data *lo_data(fuse_req_t req)
> >  return (struct lo_data *)fuse_req_userdata(req);
> >  }
> >  
> > +/*
> > + * Tries to figure out if /proc//attr/fscrate is usable or not. With
> > + * selinux=0, read from fscreate returns -EINVAL.
> > + *
> > + * TODO: Link with libselinux and use is_selinux_enabled() instead down
> > + * the line. It probably will be more reliable indicator.
> > + */
> > +static bool is_fscreate_usable(struct lo_data *lo)
> > +{
> > +char procname[64];
> > +int fscreate_fd;
> > +size_t bytes_read;
> > +
> > +sprintf(procname, "%d/attr/fscreate", gettid());
> > +fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
> > +if (fscreate_fd == -1) {
> > +return false;
> > +}
> > +
> > +bytes_read = read(fscreate_fd, procname, 64);
> > +close(fscreate_fd);
> > +if (bytes_read == -1) {
> > +return false;
> > +}
> > +return true;
> > +}
> > +
> >  /*
> >   * Load capng's state from our saved state if the current thread
> >   * hadn't previously been loaded.
> > @@ -1284,16 +1320,140 @@ static void lo_restore_cred_gain_cap(struct 
> > lo_cred *old, bool restore_umask,
> >  }
> >  }
> >  
> > +/* Helpers to set/reset fscreate */
> > +static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
> > +  size_t ctxlen, int *fd)
> > +{
> > +char procname[64];
> > +int fscreate_fd, err = 0;
> > +size_t written;
> > +
> > +sprintf(procname, "%d/attr/fscreate", gettid());
> > +fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
> > +err = fscreate_fd == -1 ? errno : 0;
> > +if (err) {
> > +return err;
> > +}
> > +
> > +written = write(fscreate_fd, ctx, ctxlen);
> > +err = written == -1 ? errno : 0;
> > +if (err) {
> > +goto out;
> > +}
> > +
> > +*fd = fscreate_fd;
> > +return 0;
> > +out:
> > +close(fscreate_fd);
> > +return err

Re: [PATCH v5 5/9] virtiofsd, fuse_lowlevel.c: Add capability to parse security context

2022-02-07 Thread Vivek Goyal

On Thu, Feb 03, 2022 at 07:41:27PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Add capability to enable and parse security context as sent by client
> > and put into fuse_req. Filesystems now can get security context from
> > request and set it on files during creation.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  tools/virtiofsd/fuse_common.h   |  5 ++
> >  tools/virtiofsd/fuse_i.h|  7 +++
> >  tools/virtiofsd/fuse_lowlevel.c | 95 -
> >  3 files changed, 106 insertions(+), 1 deletion(-)
> > 
> > diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
> > index 6f8a988202..bf46954dab 100644
> > --- a/tools/virtiofsd/fuse_common.h
> > +++ b/tools/virtiofsd/fuse_common.h
> > @@ -377,6 +377,11 @@ struct fuse_file_info {
> >   */
> >  #define FUSE_CAP_SETXATTR_EXT (1 << 29)
> >  
> > +/**
> > + * Indicates that file server supports creating file security context
> > + */
> > +#define FUSE_CAP_SECURITY_CTX (1ULL << 32)
> > +
> >  /**
> >   * Ioctl flags
> >   *
> > diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
> > index 492e002181..a5572fa4ae 100644
> > --- a/tools/virtiofsd/fuse_i.h
> > +++ b/tools/virtiofsd/fuse_i.h
> > @@ -15,6 +15,12 @@
> >  struct fv_VuDev;
> >  struct fv_QueueInfo;
> >  
> > +struct fuse_security_context {
> > +const char *name;
> > +uint32_t ctxlen;
> > +const void *ctx;
> > +};
> > +
> >  struct fuse_req {
> >  struct fuse_session *se;
> >  uint64_t unique;
> > @@ -35,6 +41,7 @@ struct fuse_req {
> >  } u;
> >  struct fuse_req *next;
> >  struct fuse_req *prev;
> > +struct fuse_security_context secctx;
> >  };
> >  
> >  struct fuse_notify_req {
> > diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> > b/tools/virtiofsd/fuse_lowlevel.c
> > index 83d29762a4..cd9ef97b3c 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.c
> > +++ b/tools/virtiofsd/fuse_lowlevel.c
> > @@ -886,11 +886,59 @@ static void do_readlink(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  }
> >  }
> >  
> > +static int parse_secctx_fill_req(fuse_req_t req, struct fuse_mbuf_iter 
> > *iter)
> > +{
> > +struct fuse_secctx_header *fsecctx_header;
> > +struct fuse_secctx *fsecctx;
> > +const void *secctx;
> > +const char *name;
> > +
> > +fsecctx_header = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx_header));
> > +if (!fsecctx_header) {
> > +return -EINVAL;
> > +}
> > +
> > +/*
> > + * As of now maximum of one security context is supported. It can
> > + * change in future though.
> > + */
> > +if (fsecctx_header->nr_secctx > 1) {
> > +return -EINVAL;
> > +}
> > +
> > +/* No security context sent. Maybe no LSM supports it */
> > +if (!fsecctx_header->nr_secctx) {
> > +return 0;
> > +}
> > +
> > +fsecctx = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx));
> > +if (!fsecctx) {
> > +return -EINVAL;
> > +}
> 
> Are there any sanity checks to be done on fsecctx->size?

May be we can check fsecctx->size is not 0. Also "man xattr" says...

"The  VFS  imposes limitations that an attribute names is limited to 255
bytes and an attribute value is limited to 64 kB."

We could probably check those limits. But if VFS decides to raise these
limits at some time, then newer kernels will be broken if they decide
to send longer security context with older virtiofsd. Very unlikely
though. So if you prefer puttling limits of "255" on name and
64kB on size, I can do that.


> 
> > +name = fuse_mbuf_iter_advance_str(iter);
> > +if (!name) {
> > +return -EINVAL;
> > +}
> > +
> > +secctx = fuse_mbuf_iter_advance(iter, fsecctx->size);
> > +if (!secctx) {
> > +return -EINVAL;
> > +}
> > +
> > +req->secctx.name = name;
> > +req->secctx.ctx = secctx;
> > +req->secctx.ctxlen = fsecctx->size;
> 
> It's OK to use the pointers into the iter here rather than take copies?

I see lookup() and setxattr() already do that. They pass "name" which
is just a pointer into iter. 

My understanding is that req will not be freed. So iter and associated
memory will continue to be valid. So it should be ok and not make
c

Re: [PATCH v5 0/9] virtiofsd: Add support for file security context at file creation

2022-02-07 Thread Vivek Goyal

On Mon, Feb 07, 2022 at 01:05:16PM +, Daniel P. Berrangé wrote:
> On Wed, Feb 02, 2022 at 02:39:26PM -0500, Vivek Goyal wrote:
> > Hi,
> > 
> > This is V5 of the patches. I posted V4 here.
> > 
> > https://listman.redhat.com/archives/virtio-fs/2022-January/msg00041.html
> > 
> > These will allow us to support SELinux with virtiofs. This will send
> > SELinux context at file creation to server and server can set it on
> > file.
> 
> I've not entirely figured it out from the code, so easier for me
> to ask...
> 
> How is the SELinux labelled stored on the host side ? It is stored
> directly in the security.* xattr namespace, or is is subject to
> xattr remapping that virtiofsd already supports.
> 
> Storing directly means virtiofsd has to run in an essentially
> unconfined context, to let it do arbitrary  changes on security.*
> xattrs without being blocked by SELinux) and has risk that guest
> initiated changes can open holes in the host confinement if
> the exported FS is generally visible to processes on the host.
> 
> 
> Using remapping lets virtiofsd be strictly isolated by SELinux
> policy on the host, and ensures that guest context changes
> can't open up holes in the host.
> 
> Both are valid use cases, so I'd ultimately expect us to want
> to support both, but my preference for a "default" behaviour
> would be remapping.

I am expecting users to configure virtiofsd to remap "security.selinux"
to "trusted.virtiofsd.security.selinux" and that will allow guest
and host security selinux to co-exist and allow separate SELinux policies
for guest and host.

I agree that my preference for a default behavior is remapping as well.
That makes most sense. 

One downside of mapping to trusted namespace is that it requires
CAP_SYS_ADMIN for virtiofsd.

Having said that, these patches don't enforce the remapping default. That
has to come from the user because it also needs to be given CAP_SYS_ADMIN.
So out of box default is no remapping and passthrough SELinux.

Thanks
Vivek

> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
>

Re: [PATCH v5 3/9] virtiofsd: Parse extended "struct fuse_init_in"

2022-02-07 Thread Vivek Goyal

On Thu, Feb 03, 2022 at 06:56:58PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Add some code to parse extended "struct fuse_init_in". And use a local
> > variable "flag" to represent 64 bit flags. This will make it easier
> > to add more features without having to worry about two 32bit flags (->flags
> > and ->flags2) in "fuse_struct_in".
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  tools/virtiofsd/fuse_lowlevel.c | 62 +
> >  1 file changed, 40 insertions(+), 22 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> > b/tools/virtiofsd/fuse_lowlevel.c
> > index ce29a70253..1f10dcc75b 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.c
> > +++ b/tools/virtiofsd/fuse_lowlevel.c
> > @@ -1881,11 +1881,15 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  {
> >  size_t compat_size = offsetof(struct fuse_init_in, max_readahead);
> >  size_t compat2_size = offsetof(struct fuse_init_in, flags) + 
> > sizeof(uint32_t);
> > +/* Fuse structure extended with minor version 36 */
> > +size_t compat3_size = offsetof(struct fuse_init_in, unused) +
> > +  (11 * sizeof(uint32_t));
> 
> Hmm that's actually quite difficult; what we have at the moment is:
> 
> struct fuse_init_in {
> uint32_tmajor;
> uint32_tminor;
> uint32_tmax_readahead;
> uint32_tflags;
> uint32_tflags2;
> uint32_tunused[11];
> };
> 
> so imagine someone comes along and changes that to:
> 
> struct fuse_init_in {
> uint32_tmajor;
> uint32_tminor;
> uint32_tmax_readahead;
> uint32_tflags;
> uint32_tflags2;
> uint32_tflags3;
> uint32_tunused[10];
> };
> 
> Then this code will break (oddly!), where the old code that didn't reference 
> the
> unusued field wouldn't.

Good catch. I did not think about it.

> It looks like qemu defines an 'endof' macro, so I think you can do:
> 
>   size_t compat3_size = endof(struct fuse_init_in, unused);
> 
> I think that should work as long as people nibble away at unused from
> the top.

Will use "endof" macro.

Thanks
Vivek

> 
> Dave
> 
> 
> >  struct fuse_init_in *arg;
> >  struct fuse_init_out outarg;
> >  struct fuse_session *se = req->se;
> >  size_t bufsize = se->bufsize;
> >  size_t outargsize = sizeof(outarg);
> > +uint64_t flags = 0;
> >  
> >  (void)nodeid;
> >  
> > @@ -1902,11 +1906,25 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  fuse_reply_err(req, EINVAL);
> >  return;
> >  }
> > +flags |= arg->flags;
> > +}
> > +
> > +/*
> > + * fuse_init_in was extended again with minor version 36. Just read
> > + * current known size of fuse_init so that future extension and
> > + * header rebase does not cause breakage.
> > + */
> > +if (sizeof(*arg) > compat2_size && (arg->flags & FUSE_INIT_EXT)) {
> > +if (!fuse_mbuf_iter_advance(iter, compat3_size - compat2_size)) {
> > +fuse_reply_err(req, EINVAL);
> > +return;
> > +}
> > +flags |= (uint64_t) arg->flags2 << 32;
> >  }
> >  
> >  fuse_log(FUSE_LOG_DEBUG, "INIT: %u.%u\n", arg->major, arg->minor);
> >  if (arg->major == 7 && arg->minor >= 6) {
> > -fuse_log(FUSE_LOG_DEBUG, "flags=0x%08x\n", arg->flags);
> > +fuse_log(FUSE_LOG_DEBUG, "flags=0x%016llx\n", flags);
> >  fuse_log(FUSE_LOG_DEBUG, "max_readahead=0x%08x\n", 
> > arg->max_readahead);
> >  }
> >  se->conn.proto_major = arg->major;
> > @@ -1934,68 +1952,68 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  if (arg->max_readahead < se->conn.max_readahead) {
> >  se->conn.max_readahead = arg->max_readahead;
> >  }
> > -if (arg->flags & FUSE_ASYNC_READ) {
> > +if (flags & FUSE_ASYNC_READ) {
> >  se->conn.capable |= FUSE_CAP_ASYNC_READ;
> >  }
> > -if (arg->flags & FUSE_POSIX_LOCKS) {
> > +if (flags & FUSE_POSIX_LOCKS) {
> >  se->conn.c

[PATCH v5 5/9] virtiofsd, fuse_lowlevel.c: Add capability to parse security context

2022-02-02 Thread Vivek Goyal

Add capability to enable and parse security context as sent by client
and put into fuse_req. Filesystems now can get security context from
request and set it on files during creation.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   |  5 ++
 tools/virtiofsd/fuse_i.h|  7 +++
 tools/virtiofsd/fuse_lowlevel.c | 95 -
 3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 6f8a988202..bf46954dab 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -377,6 +377,11 @@ struct fuse_file_info {
  */
 #define FUSE_CAP_SETXATTR_EXT (1 << 29)
 
+/**
+ * Indicates that file server supports creating file security context
+ */
+#define FUSE_CAP_SECURITY_CTX (1ULL << 32)
+
 /**
  * Ioctl flags
  *
diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 492e002181..a5572fa4ae 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -15,6 +15,12 @@
 struct fv_VuDev;
 struct fv_QueueInfo;
 
+struct fuse_security_context {
+const char *name;
+uint32_t ctxlen;
+const void *ctx;
+};
+
 struct fuse_req {
 struct fuse_session *se;
 uint64_t unique;
@@ -35,6 +41,7 @@ struct fuse_req {
 } u;
 struct fuse_req *next;
 struct fuse_req *prev;
+struct fuse_security_context secctx;
 };
 
 struct fuse_notify_req {
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 83d29762a4..cd9ef97b3c 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -886,11 +886,59 @@ static void do_readlink(fuse_req_t req, fuse_ino_t nodeid,
 }
 }
 
+static int parse_secctx_fill_req(fuse_req_t req, struct fuse_mbuf_iter *iter)
+{
+struct fuse_secctx_header *fsecctx_header;
+struct fuse_secctx *fsecctx;
+const void *secctx;
+const char *name;
+
+fsecctx_header = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx_header));
+if (!fsecctx_header) {
+return -EINVAL;
+}
+
+/*
+ * As of now maximum of one security context is supported. It can
+ * change in future though.
+ */
+if (fsecctx_header->nr_secctx > 1) {
+return -EINVAL;
+}
+
+/* No security context sent. Maybe no LSM supports it */
+if (!fsecctx_header->nr_secctx) {
+return 0;
+}
+
+fsecctx = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx));
+if (!fsecctx) {
+return -EINVAL;
+}
+
+name = fuse_mbuf_iter_advance_str(iter);
+if (!name) {
+return -EINVAL;
+}
+
+secctx = fuse_mbuf_iter_advance(iter, fsecctx->size);
+if (!secctx) {
+return -EINVAL;
+}
+
+req->secctx.name = name;
+req->secctx.ctx = secctx;
+req->secctx.ctxlen = fsecctx->size;
+return 0;
+}
+
 static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
  struct fuse_mbuf_iter *iter)
 {
 struct fuse_mknod_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -901,6 +949,13 @@ static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, -err);
+}
+}
+
 if (req->se->op.mknod) {
 req->se->op.mknod(req, nodeid, name, arg->mode, arg->rdev);
 } else {
@@ -913,6 +968,8 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 {
 struct fuse_mkdir_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -923,6 +980,13 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.mkdir) {
 req->se->op.mkdir(req, nodeid, name, arg->mode);
 } else {
@@ -969,12 +1033,21 @@ static void do_symlink(fuse_req_t req, fuse_ino_t nodeid,
 {
 const char *name = fuse_mbuf_iter_advance_str(iter);
 const char *linkname = fuse_mbuf_iter_advance_str(iter);
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 if (!name || !linkname) {
 fuse_reply_err(req, EINVAL);
 return;
 }
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.symlink) {
 req

[PATCH v5 4/9] virtiofsd: Extend size of fuse_conn_info->capable and ->want fields

2022-02-02 Thread Vivek Goyal

->capable keeps track of what capabilities kernel supports and ->wants keep
track of what capabilities filesytem wants.

Right now these fields are 32bit in size. But now fuse has run out of
bits and capabilities can now have bit number which are higher than 31.

That means 32 bit fields are not suffcient anymore. Increase size to 64
bit so that we can add newer capabilities and still be able to use existing
code to check and set the capabilities.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   | 4 ++--
 tools/virtiofsd/fuse_lowlevel.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 0c2665b977..6f8a988202 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -439,7 +439,7 @@ struct fuse_conn_info {
 /**
  * Capability flags that the kernel supports (read-only)
  */
-unsigned capable;
+uint64_t capable;
 
 /**
  * Capability flags that the filesystem wants to enable.
@@ -447,7 +447,7 @@ struct fuse_conn_info {
  * libfuse attempts to initialize this field with
  * reasonable default values before calling the init() handler.
  */
-unsigned want;
+uint64_t want;
 
 /**
  * Maximum number of pending "background" requests. A
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 1f10dcc75b..83d29762a4 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2070,7 +2070,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 if (se->conn.want & (~se->conn.capable)) {
 fuse_log(FUSE_LOG_ERR,
  "fuse: error: filesystem requested capabilities "
- "0x%x that are not supported by kernel, aborting.\n",
+ "0x%llx that are not supported by kernel, aborting.\n",
  se->conn.want & (~se->conn.capable));
 fuse_reply_err(req, EPROTO);
 se->error = -EPROTO;
-- 
2.34.1

[PATCH v5 0/9] virtiofsd: Add support for file security context at file creation

2022-02-02 Thread Vivek Goyal

Hi,

This is V5 of the patches. I posted V4 here.

https://listman.redhat.com/archives/virtio-fs/2022-January/msg00041.html

These will allow us to support SELinux with virtiofs. This will send
SELinux context at file creation to server and server can set it on
file.

Changes since V4

- Parse only known current size of fuse_init_in. This will make sure
  that future extension does not break existing code upon header
  update. (David Gilbert)

- Changed order of one of the patch. It is first patch in series. This
  will help fix the breakage before header update patch and code remains
  git bisectable. (David Gilbert)

- Changed %lx to %llx at one place. (David Gilbert).

Thanks
Vivek
 
Vivek Goyal (9):
  virtiofsd: Fix breakage due to fuse_init_in size change
  linux-headers: Update headers to v5.17-rc1
  virtiofsd: Parse extended "struct fuse_init_in"
  virtiofsd: Extend size of fuse_conn_info->capable and ->want fields
  virtiofsd, fuse_lowlevel.c: Add capability to parse security context
  virtiofsd: Move core file creation code in separate function
  virtiofsd: Create new file with fscreate set
  virtiofsd: Create new file using O_TMPFILE and set security context
  virtiofsd: Add an option to enable/disable security label

 docs/tools/virtiofsd.rst  |   7 +
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h |  11 +
 include/standard-headers/linux/ethtool.h  |   1 +
 include/standard-headers/linux/fuse.h |  60 ++-
 include/standard-headers/linux/pci_regs.h | 142 +++---
 include/standard-headers/linux/virtio_gpio.h  |  72 +++
 include/standard-headers/linux/virtio_i2c.h   |  47 ++
 include/standard-headers/linux/virtio_iommu.h |   8 +-
 .../standard-headers/linux/virtio_pcidev.h|  65 +++
 include/standard-headers/linux/virtio_scmi.h  |  24 +
 linux-headers/asm-generic/unistd.h|   5 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-riscv/bitsperlong.h |  14 +
 linux-headers/asm-riscv/mman.h|   1 +
 linux-headers/asm-riscv/unistd.h  |  44 ++
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |  16 +-
 linux-headers/asm-x86/unistd_32.h |   1 +
 linux-headers/asm-x86/unistd_64.h |   1 +
 linux-headers/asm-x86/unistd_x32.h|   1 +
 linux-headers/linux/kvm.h |  17 +
 tools/virtiofsd/fuse_common.h |   9 +-
 tools/virtiofsd/fuse_i.h  |   7 +
 tools/virtiofsd/fuse_lowlevel.c   | 162 +--
 tools/virtiofsd/helper.c  |   1 +
 tools/virtiofsd/passthrough_ll.c  | 414 --
 32 files changed, 1013 insertions(+), 132 deletions(-)
 create mode 100644 include/standard-headers/linux/virtio_gpio.h
 create mode 100644 include/standard-headers/linux/virtio_i2c.h
 create mode 100644 include/standard-headers/linux/virtio_pcidev.h
 create mode 100644 include/standard-headers/linux/virtio_scmi.h
 create mode 100644 linux-headers/asm-riscv/bitsperlong.h
 create mode 100644 linux-headers/asm-riscv/mman.h
 create mode 100644 linux-headers/asm-riscv/unistd.h

-- 
2.34.1

[PATCH v5 2/9] linux-headers: Update headers to v5.17-rc1

2022-02-02 Thread Vivek Goyal

Update headers to 5.17-rc1. I need latest fuse changes.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h |  11 ++
 include/standard-headers/linux/ethtool.h  |   1 +
 include/standard-headers/linux/fuse.h |  60 +++-
 include/standard-headers/linux/pci_regs.h | 142 +-
 include/standard-headers/linux/virtio_gpio.h  |  72 +
 include/standard-headers/linux/virtio_i2c.h   |  47 ++
 include/standard-headers/linux/virtio_iommu.h |   8 +-
 .../standard-headers/linux/virtio_pcidev.h|  65 
 include/standard-headers/linux/virtio_scmi.h  |  24 +++
 linux-headers/asm-generic/unistd.h|   5 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-riscv/bitsperlong.h |  14 ++
 linux-headers/asm-riscv/mman.h|   1 +
 linux-headers/asm-riscv/unistd.h  |  44 ++
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |  16 +-
 linux-headers/asm-x86/unistd_32.h |   1 +
 linux-headers/asm-x86/unistd_64.h |   1 +
 linux-headers/asm-x86/unistd_x32.h|   1 +
 linux-headers/linux/kvm.h |  17 +++
 26 files changed, 469 insertions(+), 76 deletions(-)
 create mode 100644 include/standard-headers/linux/virtio_gpio.h
 create mode 100644 include/standard-headers/linux/virtio_i2c.h
 create mode 100644 include/standard-headers/linux/virtio_pcidev.h
 create mode 100644 include/standard-headers/linux/virtio_scmi.h
 create mode 100644 linux-headers/asm-riscv/bitsperlong.h
 create mode 100644 linux-headers/asm-riscv/mman.h
 create mode 100644 linux-headers/asm-riscv/unistd.h

diff --git a/include/standard-headers/asm-x86/kvm_para.h 
b/include/standard-headers/asm-x86/kvm_para.h
index 204cfb8640..f0235e58a1 100644
--- a/include/standard-headers/asm-x86/kvm_para.h
+++ b/include/standard-headers/asm-x86/kvm_para.h
@@ -8,6 +8,7 @@
  * should be used to determine that a VM is running under KVM.
  */
 #define KVM_CPUID_SIGNATURE0x4000
+#define KVM_SIGNATURE "KVMKVMKVM\0\0\0"
 
 /* This CPUID returns two feature bitmaps in eax, edx. Before enabling
  * a particular paravirtualization, the appropriate feature bit should
diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 2c025cb4fe..4888f85f69 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -313,6 +313,13 @@ extern "C" {
  */
 #define DRM_FORMAT_P016fourcc_code('P', '0', '1', '6') /* 2x2 
subsampled Cr:Cb plane 16 bits per channel */
 
+/* 2 plane YCbCr420.
+ * 3 10 bit components and 2 padding bits packed into 4 bytes.
+ * index 0 = Y plane, [31:0] x:Y2:Y1:Y0 2:10:10:10 little endian
+ * index 1 = Cr:Cb plane, [63:0] x:Cr2:Cb2:Cr1:x:Cb1:Cr0:Cb0 
[2:10:10:10:2:10:10:10] little endian
+ */
+#define DRM_FORMAT_P030fourcc_code('P', '0', '3', '0') /* 2x2 
subsampled Cr:Cb plane 10 bits per channel packed */
+
 /* 3 plane non-subsampled (444) YCbCr
  * 16 bits per component, but only 10 bits are used and 6 bits are padded
  * index 0: Y plane, [15:0] Y:x [10:6] little endian
@@ -853,6 +860,10 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t 
modifier)
  * and UV.  Some SAND-using hardware stores UV in a separate tiled
  * image from Y to reduce the column height, which is not supported
  * with these modifiers.
+ *
+ * The DRM_FORMAT_MOD_BROADCOM_SAND128_COL_HEIGHT modifier is also
+ * supported for DRM_FORMAT_P030 where the columns remain as 128 bytes
+ * wide, but as this is a 10 bpp format that translates to 96 pixels.
  */
 
 #define DRM_FORMAT_MOD_BROADCOM_SAND32_COL_HEIGHT(v) \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 688eb8dc39..38d5a4cd6e 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -231,6 +231,7 @@ enum tunable_id {
ETHTOOL_RX_COPYBREAK,
ETHTOOL_TX_COPYBREAK,
ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
+   ETHTOOL_TX_COPYBREAK_BUF_SIZE,
/*
 * Add your fresh new tunable attribute above and remember to update
 * tunable_strings[] in net/ethtool/common.c
diff --git a/include/standard-headers/linux/fuse.h 
b/include/standard-headers/linux/fuse.h
index 23ea31708b..bda06258be 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -184,6 +184,16 @@
  *
  *  7.34
  *  - add FUSE_SYNCFS
+ *

[PATCH v5 7/9] virtiofsd: Create new file with fscreate set

2022-02-02 Thread Vivek Goyal

This patch adds support to set /proc/thread-self/attr/fscreate before
file creation. It is set to a value as sent by client. This will allow
for atomic creation of security context on files w.r.t file creation.

This is primarily useful when either there is no SELinux enabled on
host or host and guest policies are in sync and don't conflict.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 317 ---
 1 file changed, 290 insertions(+), 27 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 82023bf3d4..acb99aa2fc 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -173,10 +173,14 @@ struct lo_data {
 
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
+/* An O_PATH file descriptor to /proc/self/task/ */
+int proc_self_task;
 int user_killpriv_v2, killpriv_v2;
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+/* Keeps track if /proc//attr/fscreate should be used or not */
+bool use_fscreate;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -230,6 +234,11 @@ static struct lo_inode *lo_find(struct lo_data *lo, struct 
stat *st,
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
+#define FCHDIR_NOFAIL(fd) do { \
+int fchdir_res = fchdir(fd);   \
+assert(fchdir_res == 0);   \
+} while (0)
+
 static bool is_dot_or_dotdot(const char *name)
 {
 return name[0] == '.' &&
@@ -256,6 +265,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/*
+ * Tries to figure out if /proc//attr/fscrate is usable or not. With
+ * selinux=0, read from fscreate returns -EINVAL.
+ *
+ * TODO: Link with libselinux and use is_selinux_enabled() instead down
+ * the line. It probably will be more reliable indicator.
+ */
+static bool is_fscreate_usable(struct lo_data *lo)
+{
+char procname[64];
+int fscreate_fd;
+size_t bytes_read;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
+if (fscreate_fd == -1) {
+return false;
+}
+
+bytes_read = read(fscreate_fd, procname, 64);
+close(fscreate_fd);
+if (bytes_read == -1) {
+return false;
+}
+return true;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -1284,16 +1320,140 @@ static void lo_restore_cred_gain_cap(struct lo_cred 
*old, bool restore_umask,
 }
 }
 
+/* Helpers to set/reset fscreate */
+static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
+  size_t ctxlen, int *fd)
+{
+char procname[64];
+int fscreate_fd, err = 0;
+size_t written;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
+err = fscreate_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+
+written = write(fscreate_fd, ctx, ctxlen);
+err = written == -1 ? errno : 0;
+if (err) {
+goto out;
+}
+
+*fd = fscreate_fd;
+return 0;
+out:
+close(fscreate_fd);
+return err;
+}
+
+static void close_reset_proc_fscreate(int fd)
+{
+if ((write(fd, NULL, 0)) == -1) {
+fuse_log(FUSE_LOG_WARNING, "Failed to reset fscreate. err=%d\n", 
errno);
+}
+close(fd);
+return;
+}
+
+static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
+   const char *name, const char *secctx_name)
+{
+int path_fd, err;
+char procname[64];
+struct lo_data *lo = lo_data(req);
+
+if (!req->secctx.ctxlen) {
+return 0;
+}
+
+/* Open newly created element with O_PATH */
+path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+err = path_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+sprintf(procname, "%i", path_fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+/* Set security context. This is not atomic w.r.t file creation */
+err = setxattr(procname, secctx_name, req->secctx.ctx, req->secctx.ctxlen,
+   0);
+if (err) {
+err = errno;
+}
+FCHDIR_NOFAIL(lo->root.fd);
+close(path_fd);
+return err;
+}
+
+static int do_mknod_symlink(fuse_req_t req, struct lo_inode *dir,
+const char *name, mode_t mode, dev_t rdev,
+const char *link)
+{
+int err, fscreate_fd = -1;
+const char *secctx_name = req->secctx.name;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+char *mapped_name =

[PATCH v5 3/9] virtiofsd: Parse extended "struct fuse_init_in"

2022-02-02 Thread Vivek Goyal

Add some code to parse extended "struct fuse_init_in". And use a local
variable "flag" to represent 64 bit flags. This will make it easier
to add more features without having to worry about two 32bit flags (->flags
and ->flags2) in "fuse_struct_in".

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_lowlevel.c | 62 +
 1 file changed, 40 insertions(+), 22 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index ce29a70253..1f10dcc75b 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1881,11 +1881,15 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 {
 size_t compat_size = offsetof(struct fuse_init_in, max_readahead);
 size_t compat2_size = offsetof(struct fuse_init_in, flags) + 
sizeof(uint32_t);
+/* Fuse structure extended with minor version 36 */
+size_t compat3_size = offsetof(struct fuse_init_in, unused) +
+  (11 * sizeof(uint32_t));
 struct fuse_init_in *arg;
 struct fuse_init_out outarg;
 struct fuse_session *se = req->se;
 size_t bufsize = se->bufsize;
 size_t outargsize = sizeof(outarg);
+uint64_t flags = 0;
 
 (void)nodeid;
 
@@ -1902,11 +1906,25 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 fuse_reply_err(req, EINVAL);
 return;
 }
+flags |= arg->flags;
+}
+
+/*
+ * fuse_init_in was extended again with minor version 36. Just read
+ * current known size of fuse_init so that future extension and
+ * header rebase does not cause breakage.
+ */
+if (sizeof(*arg) > compat2_size && (arg->flags & FUSE_INIT_EXT)) {
+if (!fuse_mbuf_iter_advance(iter, compat3_size - compat2_size)) {
+fuse_reply_err(req, EINVAL);
+return;
+}
+flags |= (uint64_t) arg->flags2 << 32;
 }
 
 fuse_log(FUSE_LOG_DEBUG, "INIT: %u.%u\n", arg->major, arg->minor);
 if (arg->major == 7 && arg->minor >= 6) {
-fuse_log(FUSE_LOG_DEBUG, "flags=0x%08x\n", arg->flags);
+fuse_log(FUSE_LOG_DEBUG, "flags=0x%016llx\n", flags);
 fuse_log(FUSE_LOG_DEBUG, "max_readahead=0x%08x\n", arg->max_readahead);
 }
 se->conn.proto_major = arg->major;
@@ -1934,68 +1952,68 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 if (arg->max_readahead < se->conn.max_readahead) {
 se->conn.max_readahead = arg->max_readahead;
 }
-if (arg->flags & FUSE_ASYNC_READ) {
+if (flags & FUSE_ASYNC_READ) {
 se->conn.capable |= FUSE_CAP_ASYNC_READ;
 }
-if (arg->flags & FUSE_POSIX_LOCKS) {
+if (flags & FUSE_POSIX_LOCKS) {
 se->conn.capable |= FUSE_CAP_POSIX_LOCKS;
 }
-if (arg->flags & FUSE_ATOMIC_O_TRUNC) {
+if (flags & FUSE_ATOMIC_O_TRUNC) {
 se->conn.capable |= FUSE_CAP_ATOMIC_O_TRUNC;
 }
-if (arg->flags & FUSE_EXPORT_SUPPORT) {
+if (flags & FUSE_EXPORT_SUPPORT) {
 se->conn.capable |= FUSE_CAP_EXPORT_SUPPORT;
 }
-if (arg->flags & FUSE_DONT_MASK) {
+if (flags & FUSE_DONT_MASK) {
 se->conn.capable |= FUSE_CAP_DONT_MASK;
 }
-if (arg->flags & FUSE_FLOCK_LOCKS) {
+if (flags & FUSE_FLOCK_LOCKS) {
 se->conn.capable |= FUSE_CAP_FLOCK_LOCKS;
 }
-if (arg->flags & FUSE_AUTO_INVAL_DATA) {
+if (flags & FUSE_AUTO_INVAL_DATA) {
 se->conn.capable |= FUSE_CAP_AUTO_INVAL_DATA;
 }
-if (arg->flags & FUSE_DO_READDIRPLUS) {
+if (flags & FUSE_DO_READDIRPLUS) {
 se->conn.capable |= FUSE_CAP_READDIRPLUS;
 }
-if (arg->flags & FUSE_READDIRPLUS_AUTO) {
+if (flags & FUSE_READDIRPLUS_AUTO) {
 se->conn.capable |= FUSE_CAP_READDIRPLUS_AUTO;
 }
-if (arg->flags & FUSE_ASYNC_DIO) {
+if (flags & FUSE_ASYNC_DIO) {
 se->conn.capable |= FUSE_CAP_ASYNC_DIO;
 }
-if (arg->flags & FUSE_WRITEBACK_CACHE) {
+if (flags & FUSE_WRITEBACK_CACHE) {
 se->conn.capable |= FUSE_CAP_WRITEBACK_CACHE;
 }
-if (arg->flags & FUSE_NO_OPEN_SUPPORT) {
+if (flags & FUSE_NO_OPEN_SUPPORT) {
 se->conn.capable |= FUSE_CAP_NO_OPEN_SUPPORT;
 }
-if (arg->flags & FUSE_PARALLEL_DIROPS) {
+if (flags & FUSE_PARALLEL_DIROPS) {
 se->conn.capable |= FUSE_CAP_PARALLEL_DIROPS;
 }
-if (arg->flags & FUSE_POSIX_ACL) {
+if (flags & FUSE_POSIX_ACL) {
 se->conn.capable |= FUSE_CAP_POSIX_ACL;
 }
-if (arg->flags & FUSE_HANDLE_KILLPRIV) {
+if (flags & FUSE_HANDLE_KILLPRIV) {
 se->conn.capable |= FUSE_CAP_HANDLE_KILLPRIV;
 }

[PATCH v5 6/9] virtiofsd: Move core file creation code in separate function

2022-02-02 Thread Vivek Goyal

Move core file creation bits in a separate function. Soon this is going
to get more complex as file creation need to set security context also.
And there will be multiple modes of file creation in next patch.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 36 ++--
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index b3d0674f6d..82023bf3d4 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2001,6 +2001,30 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 return 0;
 }
 
+static int do_lo_create(fuse_req_t req, struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi, int* open_fd)
+{
+int err = 0, fd;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+
+err = lo_change_cred(req, , lo->change_umask);
+if (err) {
+return err;
+}
+
+/* Try to create a new file but don't open existing files */
+fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
+if (fd == -1) {
+err = errno;
+} else {
+*open_fd = fd;
+}
+lo_restore_cred(, lo->change_umask);
+return err;
+}
+
 static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
   mode_t mode, struct fuse_file_info *fi)
 {
@@ -2010,7 +2034,6 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 struct lo_inode *inode = NULL;
 struct fuse_entry_param e;
 int err;
-struct lo_cred old = {};
 
 fuse_log(FUSE_LOG_DEBUG, "lo_create(parent=%" PRIu64 ", name=%s)"
  " kill_priv=%d\n", parent, name, fi->kill_priv);
@@ -2026,18 +2049,9 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 return;
 }
 
-err = lo_change_cred(req, , lo->change_umask);
-if (err) {
-goto out;
-}
-
 update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-/* Try to create a new file but don't open existing files */
-fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
-err = fd == -1 ? errno : 0;
-
-lo_restore_cred(, lo->change_umask);
+err = do_lo_create(req, parent_inode, name, mode, fi, );
 
 /* Ignore the error if file exists and O_EXCL was not given */
 if (err && (err != EEXIST || (fi->flags & O_EXCL))) {
-- 
2.34.1

[PATCH v5 9/9] virtiofsd: Add an option to enable/disable security label

2022-02-02 Thread Vivek Goyal

Provide an option "-o security_label/no_security_label" to enable/disable
security label functionality. By default these are turned off.

If enabled, server will indicate to client that it is capable of handling
one security label during file creation. Typically this is expected to
be a SELinux label. File server will set this label on the file. It will
try to set it atomically wherever possible. But its not possible in
all the cases.

Signed-off-by: Vivek Goyal 
---
 docs/tools/virtiofsd.rst |  7 +++
 tools/virtiofsd/helper.c |  1 +
 tools/virtiofsd/passthrough_ll.c | 15 +++
 3 files changed, 23 insertions(+)

diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index 07ac0be551..a2c005f4a0 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -104,6 +104,13 @@ Options
   * posix_acl|no_posix_acl -
 Enable/disable posix acl support.  Posix ACLs are disabled by default.
 
+  * security_label|no_security_label -
+Enable/disable security label support. Security labels are disabled by
+default. This will allow client to send a MAC label of file during
+file creation. Typically this is expected to be SELinux security
+label. Server will try to set that label on newly created file
+atomically wherever possible.
+
 .. option:: --socket-path=PATH
 
   Listen on vhost-user UNIX domain socket at PATH.
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index a8295d975a..e226fc590f 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
"-o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: 
disabled)\n"
+   "-o security_label/no_security_label  Enable/Disable security 
label. (default: disabled)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 43c9b6dbe5..fe8f3ccbb6 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -181,6 +181,7 @@ struct lo_data {
 int user_posix_acl, posix_acl;
 /* Keeps track if /proc//attr/fscreate should be used or not */
 bool use_fscreate;
+int user_security_label;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -215,6 +216,8 @@ static const struct fuse_opt lo_opts[] = {
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
 { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
 { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
+{ "security_label", offsetof(struct lo_data, user_security_label), 1 },
+{ "no_security_label", offsetof(struct lo_data, user_security_label), 0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -771,6 +774,17 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling posix_acl\n");
 conn->want &= ~FUSE_CAP_POSIX_ACL;
 }
+
+if (lo->user_security_label == 1) {
+if (!(conn->capable & FUSE_CAP_SECURITY_CTX)) {
+fuse_log(FUSE_LOG_ERR, "lo_init: Can not enable security label."
+ " kernel does not support FUSE_SECURITY_CTX 
capability.\n");
+}
+conn->want |= FUSE_CAP_SECURITY_CTX;
+} else {
+fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling security label\n");
+conn->want &= ~FUSE_CAP_SECURITY_CTX;
+}
 }
 
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
@@ -4279,6 +4293,7 @@ int main(int argc, char *argv[])
 .proc_self_task = -1,
 .user_killpriv_v2 = -1,
 .user_posix_acl = -1,
+.user_security_label = -1,
 };
 struct lo_map_elem *root_elem;
 struct lo_map_elem *reserve_elem;
-- 
2.34.1

[PATCH v5 8/9] virtiofsd: Create new file using O_TMPFILE and set security context

2022-02-02 Thread Vivek Goyal

If guest and host policies can't work with each other, then guest security
context (selinux label) needs to be set into an xattr. Say remap guest
security.selinux xattr to trusted.virtiofs.security.selinux.

That means setting "fscreate" is not going to help as that's ony useful
for security.selinux xattr on host.

So we need another method which is atomic. Use O_TMPFILE to create new
file, set xattr and then linkat() to proper place.

But this works only for regular files. So dir, symlinks will continue
to be non-atomic.

Also if host filesystem does not support O_TMPFILE, we fallback to
non-atomic behavior.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 80 
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index acb99aa2fc..43c9b6dbe5 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2153,14 +2153,29 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 
 static int do_create_nosecctx(fuse_req_t req, struct lo_inode *parent_inode,
const char *name, mode_t mode,
-   struct fuse_file_info *fi, int *open_fd)
+   struct fuse_file_info *fi, int *open_fd,
+  bool tmpfile)
 {
 int err, fd;
 struct lo_cred old = {};
 struct lo_data *lo = lo_data(req);
 int flags;
 
-flags = fi->flags | O_CREAT | O_EXCL;
+if (tmpfile) {
+flags = fi->flags | O_TMPFILE;
+/*
+ * Don't use O_EXCL as we want to link file later. Also reset O_CREAT
+ * otherwise openat() returns -EINVAL.
+ */
+flags &= ~(O_CREAT | O_EXCL);
+
+/* O_TMPFILE needs either O_RDWR or O_WRONLY */
+if ((flags & O_ACCMODE) == O_RDONLY) {
+flags |= O_RDWR;
+}
+} else {
+flags = fi->flags | O_CREAT | O_EXCL;
+}
 
 err = lo_change_cred(req, , lo->change_umask);
 if (err) {
@@ -2191,7 +2206,7 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 
 close_reset_proc_fscreate(fscreate_fd);
 if (!err) {
@@ -2200,6 +2215,44 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
+static int do_create_secctx_tmpfile(fuse_req_t req,
+struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi,
+const char *secctx_name, int *open_fd)
+{
+int err, fd = -1;
+struct lo_data *lo = lo_data(req);
+char procname[64];
+
+err = do_create_nosecctx(req, parent_inode, ".", mode, fi, , true);
+if (err) {
+return err;
+}
+
+err = fsetxattr(fd, secctx_name, req->secctx.ctx, req->secctx.ctxlen, 0);
+if (err) {
+err = errno;
+goto out;
+}
+
+/* Security context set on file. Link it in place */
+sprintf(procname, "%d", fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+err = linkat(AT_FDCWD, procname, parent_inode->fd, name,
+ AT_SYMLINK_FOLLOW);
+err = err == -1 ? errno : 0;
+FCHDIR_NOFAIL(lo->root.fd);
+
+out:
+if (!err) {
+*open_fd = fd;
+} else if (fd != -1) {
+close(fd);
+}
+return err;
+}
+
 static int do_create_secctx_noatomic(fuse_req_t req,
  struct lo_inode *parent_inode,
  const char *name, mode_t mode,
@@ -2208,7 +2261,7 @@ static int do_create_secctx_noatomic(fuse_req_t req,
 {
 int err = 0, fd = -1;
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 if (err) {
 goto out;
 }
@@ -2250,20 +2303,31 @@ static int do_lo_create(fuse_req_t req, struct lo_inode 
*parent_inode,
 if (secctx_enabled) {
 /*
  * If security.selinux has not been remapped and selinux is enabled,
- * use fscreate to set context before file creation.
- * Otherwise fallback to non-atomic method of file creation
- * and xattr settting.
+ * use fscreate to set context before file creation. If not, use
+ * tmpfile method for regular files. Otherwise fallback to
+ * non-atomic method of file creation and xattr settting.
  */
 if (!mapped_name && lo->use_fscreate) {
 err = do_create_secctx_fscreate(req, parent_inode, name, mode, fi,
 open_fd);
 goto out;
+} else if (S_ISREG(

[PATCH v5 1/9] virtiofsd: Fix breakage due to fuse_init_in size change

2022-02-02 Thread Vivek Goyal

Kernel version 5.17 has increased the size of "struct fuse_init_in" struct.
Previously this struct was 16 bytes and now it has been extended to
64 bytes in size.

Once qemu headers are updated to latest, it will expect to receive 64 byte
size struct (for protocol version major 7 and minor > 6). But if guest is
booting older kernel (older than 5.17), then it still sends older
fuse_init_in of size 16 bytes. And do_init() fails. It is expecting
64 byte struct. And this results in mount of virtiofs failing.

Fix this by parsing 16 bytes only for now. Separate patches will be
posted which will parse rest of the bytes and enable new functionality.
Right now we don't support any of the new functionality, so we don't
lose anything by not parsing bytes beyond 16.

Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_lowlevel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index e4679c73ab..ce29a70253 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1880,6 +1880,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 struct fuse_mbuf_iter *iter)
 {
 size_t compat_size = offsetof(struct fuse_init_in, max_readahead);
+size_t compat2_size = offsetof(struct fuse_init_in, flags) + 
sizeof(uint32_t);
 struct fuse_init_in *arg;
 struct fuse_init_out outarg;
 struct fuse_session *se = req->se;
@@ -1897,7 +1898,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 
 /* ...and now consume the new fields. */
 if (arg->major == 7 && arg->minor >= 6) {
-if (!fuse_mbuf_iter_advance(iter, sizeof(*arg) - compat_size)) {
+if (!fuse_mbuf_iter_advance(iter, compat2_size - compat_size)) {
 fuse_reply_err(req, EINVAL);
 return;
 }
-- 
2.34.1

Re: [PATCH v4 3/9] virtiofsd: Parse extended "struct fuse_init_in"

2022-01-27 Thread Vivek Goyal

On Thu, Jan 27, 2022 at 05:50:50PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Add some code to parse extended "struct fuse_init_in". And use a local
> > variable "flag" to represent 64 bit flags. This will make it easier
> > to add more features without having to worry about two 32bit flags (->flags
> > and ->flags2) in "fuse_struct_in".
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  tools/virtiofsd/fuse_lowlevel.c | 55 -
> >  1 file changed, 33 insertions(+), 22 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> > b/tools/virtiofsd/fuse_lowlevel.c
> > index ce29a70253..c3af5ede08 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.c
> > +++ b/tools/virtiofsd/fuse_lowlevel.c
> > @@ -1886,6 +1886,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
> >  struct fuse_session *se = req->se;
> >  size_t bufsize = se->bufsize;
> >  size_t outargsize = sizeof(outarg);
> > +uint64_t flags = 0;
> >  
> >  (void)nodeid;
> >  
> > @@ -1902,11 +1903,21 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  fuse_reply_err(req, EINVAL);
> >  return;
> >  }
> > +flags |= arg->flags;
> > +}
> > +
> > +/* fuse_init_in was extended again with minor version 36 */
> > +if (sizeof(*arg) > compat2_size && (arg->flags & FUSE_INIT_EXT)) {
> > +if (!fuse_mbuf_iter_advance(iter, sizeof(*arg) - compat2_size)) {
> > +fuse_reply_err(req, EINVAL);
> > +return;
> > +}
> 
> Instead of reading upto sizeof(*arg) should we actually read up to the
> size of the last field we know about?  That way if fuse_init_in grows
> again this wont break?

I think that makes sense. Can probably call it init_size_v7_36, which
represents size of fuse_init till protocol version 7.36 extension.
Will update the patch.

Vivek


> 
> (Other than that OK)
> 
> > +flags |= (uint64_t) arg->flags2 << 32;
> >  }
> >  
> >  fuse_log(FUSE_LOG_DEBUG, "INIT: %u.%u\n", arg->major, arg->minor);
> >  if (arg->major == 7 && arg->minor >= 6) {
> > -fuse_log(FUSE_LOG_DEBUG, "flags=0x%08x\n", arg->flags);
> > +fuse_log(FUSE_LOG_DEBUG, "flags=0x%016llx\n", flags);
> >  fuse_log(FUSE_LOG_DEBUG, "max_readahead=0x%08x\n", 
> > arg->max_readahead);
> >  }
> >  se->conn.proto_major = arg->major;
> > @@ -1934,68 +1945,68 @@ static void do_init(fuse_req_t req, fuse_ino_t 
> > nodeid,
> >  if (arg->max_readahead < se->conn.max_readahead) {
> >  se->conn.max_readahead = arg->max_readahead;
> >  }
> > -if (arg->flags & FUSE_ASYNC_READ) {
> > +if (flags & FUSE_ASYNC_READ) {
> >  se->conn.capable |= FUSE_CAP_ASYNC_READ;
> >  }
> > -if (arg->flags & FUSE_POSIX_LOCKS) {
> > +if (flags & FUSE_POSIX_LOCKS) {
> >  se->conn.capable |= FUSE_CAP_POSIX_LOCKS;
> >  }
> > -if (arg->flags & FUSE_ATOMIC_O_TRUNC) {
> > +if (flags & FUSE_ATOMIC_O_TRUNC) {
> >  se->conn.capable |= FUSE_CAP_ATOMIC_O_TRUNC;
> >  }
> > -if (arg->flags & FUSE_EXPORT_SUPPORT) {
> > +if (flags & FUSE_EXPORT_SUPPORT) {
> >  se->conn.capable |= FUSE_CAP_EXPORT_SUPPORT;
> >  }
> > -if (arg->flags & FUSE_DONT_MASK) {
> > +if (flags & FUSE_DONT_MASK) {
> >  se->conn.capable |= FUSE_CAP_DONT_MASK;
> >  }
> > -if (arg->flags & FUSE_FLOCK_LOCKS) {
> > +if (flags & FUSE_FLOCK_LOCKS) {
> >  se->conn.capable |= FUSE_CAP_FLOCK_LOCKS;
> >  }
> > -if (arg->flags & FUSE_AUTO_INVAL_DATA) {
> > +if (flags & FUSE_AUTO_INVAL_DATA) {
> >  se->conn.capable |= FUSE_CAP_AUTO_INVAL_DATA;
> >  }
> > -if (arg->flags & FUSE_DO_READDIRPLUS) {
> > +if (flags & FUSE_DO_READDIRPLUS) {
> >  se->conn.capable |= FUSE_CAP_READDIRPLUS;
> >  }
> > -if (arg->flags & FUSE_READDIRPLUS_AUTO) {
> > +if (flags & FUSE_READDIRPLUS_AUTO) {
> >  se->conn.capable |= FUSE_CAP_READDIRPLUS_AUTO;
> >  }
> > -if (arg->flags & FUSE_ASYNC_DIO) {
> >

Re: [PATCH v4 4/9] virtiofsd: Extend size of fuse_conn_info->capable and ->want fields

2022-01-27 Thread Vivek Goyal

On Thu, Jan 27, 2022 at 05:53:20PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > ->capable keeps track of what capabilities kernel supports and ->wants keep
> > track of what capabilities filesytem wants.
> > 
> > Right now these fields are 32bit in size. But now fuse has run out of
> > bits and capabilities can now have bit number which are higher than 31.
> > 
> > That means 32 bit fields are not suffcient anymore. Increase size to 64
> > bit so that we can add newer capabilities and still be able to use existing
> > code to check and set the capabilities.
> > 
> > Signed-off-by: Vivek Goyal 
> > ---
> >  tools/virtiofsd/fuse_common.h   | 4 ++--
> >  tools/virtiofsd/fuse_lowlevel.c | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
> > index 0c2665b977..6f8a988202 100644
> > --- a/tools/virtiofsd/fuse_common.h
> > +++ b/tools/virtiofsd/fuse_common.h
> > @@ -439,7 +439,7 @@ struct fuse_conn_info {
> >  /**
> >   * Capability flags that the kernel supports (read-only)
> >   */
> > -unsigned capable;
> > +uint64_t capable;
> >  
> >  /**
> >   * Capability flags that the filesystem wants to enable.
> > @@ -447,7 +447,7 @@ struct fuse_conn_info {
> >   * libfuse attempts to initialize this field with
> >   * reasonable default values before calling the init() handler.
> >   */
> > -unsigned want;
> > +uint64_t want;
> >  
> >  /**
> >   * Maximum number of pending "background" requests. A
> > diff --git a/tools/virtiofsd/fuse_lowlevel.c 
> > b/tools/virtiofsd/fuse_lowlevel.c
> > index c3af5ede08..f3f5e70be6 100644
> > --- a/tools/virtiofsd/fuse_lowlevel.c
> > +++ b/tools/virtiofsd/fuse_lowlevel.c
> > @@ -2063,7 +2063,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
> >  if (se->conn.want & (~se->conn.capable)) {
> >  fuse_log(FUSE_LOG_ERR,
> >   "fuse: error: filesystem requested capabilities "
> > - "0x%x that are not supported by kernel, aborting.\n",
> > + "0x%lx that are not supported by kernel, aborting.\n",
> 
> I think this will be OK in practice (need to check 32 bit); but weren't
> you using llx in the last patch?

Probably I should use %llx so that it works fine on 32 bit. Will change
it.

Vivek

> 
> 
> 
> Reviewed-by: Dr. David Alan Gilbert 
> 
> Dave
> 
> >   se->conn.want & (~se->conn.capable));
> >  fuse_reply_err(req, EPROTO);
> >  se->error = -EPROTO;
> > -- 
> > 2.31.1
> > 
> -- 
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>

Re: [PATCH v4 1/9] linux-headers: Update headers to v5.17-rc1

2022-01-27 Thread Vivek Goyal

On Thu, Jan 27, 2022 at 05:21:02PM +, Dr. David Alan Gilbert wrote:
> * Vivek Goyal (vgo...@redhat.com) wrote:
> > Update headers to 5.17-rc1. I need latest fuse changes.
> > 
> > Signed-off-by: Vivek Goyal 
> 
> Can you just confirm that this lot was generated by running qemu's
> scripts/update-linux-headers.sh   ?

Yes, I generated it using scripts/update-linux-headers.sh

Vivek

> 
> Dave
> 
> > ---
> >  include/standard-headers/asm-x86/kvm_para.h   |   1 +
> >  include/standard-headers/drm/drm_fourcc.h |  11 ++
> >  include/standard-headers/linux/ethtool.h  |   1 +
> >  include/standard-headers/linux/fuse.h |  60 +++-
> >  include/standard-headers/linux/pci_regs.h | 142 +-
> >  include/standard-headers/linux/virtio_gpio.h  |  72 +
> >  include/standard-headers/linux/virtio_i2c.h   |  47 ++
> >  include/standard-headers/linux/virtio_iommu.h |   8 +-
> >  .../standard-headers/linux/virtio_pcidev.h|  65 
> >  include/standard-headers/linux/virtio_scmi.h  |  24 +++
> >  linux-headers/asm-generic/unistd.h|   5 +-
> >  linux-headers/asm-mips/unistd_n32.h   |   2 +
> >  linux-headers/asm-mips/unistd_n64.h   |   2 +
> >  linux-headers/asm-mips/unistd_o32.h   |   2 +
> >  linux-headers/asm-powerpc/unistd_32.h |   2 +
> >  linux-headers/asm-powerpc/unistd_64.h |   2 +
> >  linux-headers/asm-riscv/bitsperlong.h |  14 ++
> >  linux-headers/asm-riscv/mman.h|   1 +
> >  linux-headers/asm-riscv/unistd.h  |  44 ++
> >  linux-headers/asm-s390/unistd_32.h|   2 +
> >  linux-headers/asm-s390/unistd_64.h|   2 +
> >  linux-headers/asm-x86/kvm.h   |  16 +-
> >  linux-headers/asm-x86/unistd_32.h |   1 +
> >  linux-headers/asm-x86/unistd_64.h |   1 +
> >  linux-headers/asm-x86/unistd_x32.h|   1 +
> >  linux-headers/linux/kvm.h |  17 +++
> >  26 files changed, 469 insertions(+), 76 deletions(-)
> >  create mode 100644 include/standard-headers/linux/virtio_gpio.h
> >  create mode 100644 include/standard-headers/linux/virtio_i2c.h
> >  create mode 100644 include/standard-headers/linux/virtio_pcidev.h
> >  create mode 100644 include/standard-headers/linux/virtio_scmi.h
> >  create mode 100644 linux-headers/asm-riscv/bitsperlong.h
> >  create mode 100644 linux-headers/asm-riscv/mman.h
> >  create mode 100644 linux-headers/asm-riscv/unistd.h
> > 
> > diff --git a/include/standard-headers/asm-x86/kvm_para.h 
> > b/include/standard-headers/asm-x86/kvm_para.h
> > index 204cfb8640..f0235e58a1 100644
> > --- a/include/standard-headers/asm-x86/kvm_para.h
> > +++ b/include/standard-headers/asm-x86/kvm_para.h
> > @@ -8,6 +8,7 @@
> >   * should be used to determine that a VM is running under KVM.
> >   */
> >  #define KVM_CPUID_SIGNATURE0x4000
> > +#define KVM_SIGNATURE "KVMKVMKVM\0\0\0"
> >  
> >  /* This CPUID returns two feature bitmaps in eax, edx. Before enabling
> >   * a particular paravirtualization, the appropriate feature bit should
> > diff --git a/include/standard-headers/drm/drm_fourcc.h 
> > b/include/standard-headers/drm/drm_fourcc.h
> > index 2c025cb4fe..4888f85f69 100644
> > --- a/include/standard-headers/drm/drm_fourcc.h
> > +++ b/include/standard-headers/drm/drm_fourcc.h
> > @@ -313,6 +313,13 @@ extern "C" {
> >   */
> >  #define DRM_FORMAT_P016fourcc_code('P', '0', '1', '6') /* 2x2 
> > subsampled Cr:Cb plane 16 bits per channel */
> >  
> > +/* 2 plane YCbCr420.
> > + * 3 10 bit components and 2 padding bits packed into 4 bytes.
> > + * index 0 = Y plane, [31:0] x:Y2:Y1:Y0 2:10:10:10 little endian
> > + * index 1 = Cr:Cb plane, [63:0] x:Cr2:Cb2:Cr1:x:Cb1:Cr0:Cb0 
> > [2:10:10:10:2:10:10:10] little endian
> > + */
> > +#define DRM_FORMAT_P030fourcc_code('P', '0', '3', '0') /* 2x2 
> > subsampled Cr:Cb plane 10 bits per channel packed */
> > +
> >  /* 3 plane non-subsampled (444) YCbCr
> >   * 16 bits per component, but only 10 bits are used and 6 bits are padded
> >   * index 0: Y plane, [15:0] Y:x [10:6] little endian
> > @@ -853,6 +860,10 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t 
> > modifier)
> >   * and UV.  Some SAND-using hardware stores UV in a separate tiled
> >   * image from Y to reduce the column height, which is not supported
> >   * with these modifiers.
> > + *
> > + * The DRM_FO

Re: [Virtio-fs] [PATCH v4 1/2] virtiofsd: Track mounts

2022-01-26 Thread Vivek Goyal

On Wed, Jan 26, 2022 at 05:47:09PM -0500, Vivek Goyal wrote:
> On Tue, Jan 25, 2022 at 03:12:11PM +0100, Greg Kurz wrote:
> > The upcoming implementation of ->sync_fs() needs to know about all
> > submounts in order to call syncfs() on them when virtiofsd is started
> > without '-o announce_submounts'.
> > 
> > Track every inode that comes up with a new mount id in a GHashTable.
> > If the mount id isn't available, e.g. no statx() on the host, fallback
> > on the device id for the key. This is done during lookup because we
> > only care for the submounts that the client knows about. The inode
> > is removed from the hash table when ultimately unreferenced. This
> > can happen on a per-mount basis when the client posts a FUSE_FORGET
> > request or for all submounts at once with FUSE_DESTROY.
> > 
> > Signed-off-by: Greg Kurz 
> > ---
> >  tools/virtiofsd/passthrough_ll.c | 43 +---
> >  1 file changed, 40 insertions(+), 3 deletions(-)
> > 
> > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > b/tools/virtiofsd/passthrough_ll.c
> > index 64b5b4fbb186..7bf31fc129c8 100644
> > --- a/tools/virtiofsd/passthrough_ll.c
> > +++ b/tools/virtiofsd/passthrough_ll.c
> > @@ -117,6 +117,7 @@ struct lo_inode {
> >  GHashTable *posix_locks; /* protected by lo_inode->plock_mutex */
> >  
> >  mode_t filetype;
> > +bool is_mnt;
> >  };
> >  
> >  struct lo_cred {
> > @@ -164,6 +165,7 @@ struct lo_data {
> >  bool use_statx;
> >  struct lo_inode root;
> >  GHashTable *inodes; /* protected by lo->mutex */
> > +GHashTable *mnt_inodes; /* protected by lo->mutex */
> >  struct lo_map ino_map; /* protected by lo->mutex */
> >  struct lo_map dirp_map; /* protected by lo->mutex */
> >  struct lo_map fd_map; /* protected by lo->mutex */
> > @@ -1000,6 +1002,31 @@ static int do_statx(struct lo_data *lo, int dirfd, 
> > const char *pathname,
> >  return 0;
> >  }
> >  
> 
> Hi Greg,
> 
> Thanks for the patches. Had a quick look. Overall these patches look
> pretty good to me. I will spend more time testing and having a 
> closer look. Some quick thoughts below.
> 
> > +static uint64_t mnt_inode_key(struct lo_inode *inode)
> > +{
> > +/* Prefer mnt_id, fallback on dev */
> > +return inode->key.mnt_id ? inode->key.mnt_id : inode->key.dev;
> > +}
> 
> I am not sure if we should use inode->key.dev. This might create problem
> if same file system is bind mounted at two paths in shared dir. So
> say /dev/sdb is mounted at foo1/ and then bind mounted at foo2/ in
> shared dir. A user looks up foo1/ and does some writes. Then we
> lookup foo2/ and release that inode. Release of foo2 will let go
> inode from the hash. And that means if later another write happens
> in foo1/ followed by syncfs(), we will not issue syncfs() on filesystem
> backed by /dev/sdb.
> 
> So what are the options.
> 
> A. Make mnt_id mandatory and do not implement it if mnt_id is not
>available.
> 
> B. Don't do anything and live with this. It is a corner case and
>still better than not implement submount syncfs at all.
> 
> C. Instead of adding lo_inode to hash, create another kind of object
>and reference count that. It could be a mount fd which we open
>when we add object for the first time. So when foo1/ inode is
>instantiated, create mountfd object, add it to hash table using
>device id as the key. When foo2 comes along, we find the object
>in the hash and just bump up the ref. Now this mountfd object
>will go away when both foo1 and foo2 inodes have been evicted
>and will take care of the issue I am referring to.

And we could take a ref on mountfd object only when we find an
inode whose parent's device id/mnt_id is different from us. That
way for every inode in the system we don't go through this exercise.
Just only those dir inodes which are a mount point.

Vivek

> 
> I guess B is little extra complexity but probably not too bad.
> WDYT. It sounds litter better than option A and B.
> 
> 
> > +
> > +static void add_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> > +{
> > +uint64_t mnt_key = mnt_inode_key(inode);
> > +
> > +if (!g_hash_table_contains(lo->mnt_inodes, _key)) {
> > +inode->is_mnt = true;
> > +g_hash_table_insert(lo->mnt_inodes, _key, inode);
> > +}
> > +}
> > +
> > +static void remove_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> > +{
> > +uint64_t mnt

Re: [PATCH v4 1/2] virtiofsd: Track mounts

2022-01-26 Thread Vivek Goyal

On Tue, Jan 25, 2022 at 03:12:11PM +0100, Greg Kurz wrote:
> The upcoming implementation of ->sync_fs() needs to know about all
> submounts in order to call syncfs() on them when virtiofsd is started
> without '-o announce_submounts'.
> 
> Track every inode that comes up with a new mount id in a GHashTable.
> If the mount id isn't available, e.g. no statx() on the host, fallback
> on the device id for the key. This is done during lookup because we
> only care for the submounts that the client knows about. The inode
> is removed from the hash table when ultimately unreferenced. This
> can happen on a per-mount basis when the client posts a FUSE_FORGET
> request or for all submounts at once with FUSE_DESTROY.
> 
> Signed-off-by: Greg Kurz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 43 +---
>  1 file changed, 40 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 64b5b4fbb186..7bf31fc129c8 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -117,6 +117,7 @@ struct lo_inode {
>  GHashTable *posix_locks; /* protected by lo_inode->plock_mutex */
>  
>  mode_t filetype;
> +bool is_mnt;
>  };
>  
>  struct lo_cred {
> @@ -164,6 +165,7 @@ struct lo_data {
>  bool use_statx;
>  struct lo_inode root;
>  GHashTable *inodes; /* protected by lo->mutex */
> +GHashTable *mnt_inodes; /* protected by lo->mutex */
>  struct lo_map ino_map; /* protected by lo->mutex */
>  struct lo_map dirp_map; /* protected by lo->mutex */
>  struct lo_map fd_map; /* protected by lo->mutex */
> @@ -1000,6 +1002,31 @@ static int do_statx(struct lo_data *lo, int dirfd, 
> const char *pathname,
>  return 0;
>  }
>  

Hi Greg,

Thanks for the patches. Had a quick look. Overall these patches look
pretty good to me. I will spend more time testing and having a 
closer look. Some quick thoughts below.

> +static uint64_t mnt_inode_key(struct lo_inode *inode)
> +{
> +/* Prefer mnt_id, fallback on dev */
> +return inode->key.mnt_id ? inode->key.mnt_id : inode->key.dev;
> +}

I am not sure if we should use inode->key.dev. This might create problem
if same file system is bind mounted at two paths in shared dir. So
say /dev/sdb is mounted at foo1/ and then bind mounted at foo2/ in
shared dir. A user looks up foo1/ and does some writes. Then we
lookup foo2/ and release that inode. Release of foo2 will let go
inode from the hash. And that means if later another write happens
in foo1/ followed by syncfs(), we will not issue syncfs() on filesystem
backed by /dev/sdb.

So what are the options.

A. Make mnt_id mandatory and do not implement it if mnt_id is not
   available.

B. Don't do anything and live with this. It is a corner case and
   still better than not implement submount syncfs at all.

C. Instead of adding lo_inode to hash, create another kind of object
   and reference count that. It could be a mount fd which we open
   when we add object for the first time. So when foo1/ inode is
   instantiated, create mountfd object, add it to hash table using
   device id as the key. When foo2 comes along, we find the object
   in the hash and just bump up the ref. Now this mountfd object
   will go away when both foo1 and foo2 inodes have been evicted
   and will take care of the issue I am referring to.

I guess B is little extra complexity but probably not too bad.
WDYT. It sounds litter better than option A and B.

> +
> +static void add_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> +{
> +uint64_t mnt_key = mnt_inode_key(inode);
> +
> +if (!g_hash_table_contains(lo->mnt_inodes, _key)) {
> +inode->is_mnt = true;
> +g_hash_table_insert(lo->mnt_inodes, _key, inode);
> +}
> +}
> +
> +static void remove_mnt_inode(struct lo_data *lo, struct lo_inode *inode)
> +{
> +uint64_t mnt_key = mnt_inode_key(inode);
> +
> +if (inode->is_mnt) {
> +g_hash_table_remove(lo->mnt_inodes, _key);
> +}
> +}

Should we issue syncfs() on this inode when we are removing it? It
is possible guest did some writes, let go inode and later issued
a syncfs(). By that time inode is gone and we will not issue any
syncfs() on this filesystem. Hence leaving data in host page cache.

Thanks
Vivek

> +
>  /*
>   * Increments nlookup on the inode on success. unref_inode_lolocked() must be
>   * called eventually to decrement nlookup again. If inodep is non-NULL, the
> @@ -1086,10 +1113,15 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
> parent, const char *name,
>  pthread_mutex_lock(>mutex);
>  inode->fuse_ino = lo_add_inode_mapping(req, inode);
>  g_hash_table_insert(lo->inodes, >key, inode);
> +add_mnt_inode(lo, inode);
>  pthread_mutex_unlock(>mutex);
>  }
>  e->ino = inode->fuse_ino;
>  
> +fuse_log(FUSE_LOG_DEBUG, "  %lli/%s -> %lli%s\n",
> + (unsigned long long)

[PATCH] virtiofsd: Drop membership of all supplementary groups (CVE-2022-0358)

2022-01-25 Thread Vivek Goyal

At the start, drop membership of all supplementary groups. This is
not required.

If we have membership of "root" supplementary group and when we switch
uid/gid using setresuid/setsgid, we still retain membership of existing
supplemntary groups. And that can allow some operations which are not
normally allowed.

For example, if root in guest creates a dir as follows.

$ mkdir -m 03777 test_dir

This sets SGID on dir as well as allows unprivileged users to write into
this dir. 

And now as unprivileged user open file as follows.

$ su test
$ fd = open("test_dir/priviledge_id", O_RDWR|O_CREAT|O_EXCL, 02755);

This will create SGID set executable in test_dir/.

And that's a problem because now an unpriviliged user can execute it,
get egid=0 and get access to resources owned by "root" group. This is
privilege escalation.

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2044863
Fixes: CVE-2022-0358
Reported-by: JIETAO XIAO 
Suggested-by: Miklos Szeredi 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Dr. David Alan Gilbert 
Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c |   26 ++
 1 file changed, 26 insertions(+)

Index: rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c
===
--- rhvgoyal-qemu.orig/tools/virtiofsd/passthrough_ll.c 2022-01-25 
13:38:59.349534531 -0500
+++ rhvgoyal-qemu/tools/virtiofsd/passthrough_ll.c  2022-01-25 
13:39:10.140177868 -0500
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "qemu/cutils.h"
 #include "passthrough_helpers.h"
@@ -1161,6 +1162,29 @@ static void lo_lookup(fuse_req_t req, fu
 #define OURSYS_setresuid SYS_setresuid
 #endif
 
+static void drop_supplementary_groups(void)
+{
+int ret;
+
+ret = getgroups(0, NULL);
+if (ret == -1) {
+fuse_log(FUSE_LOG_ERR, "getgroups() failed with error=%d:%s\n",
+ errno, strerror(errno));
+exit(1);
+}
+
+if (!ret)
+return;
+
+/* Drop all supplementary groups. We should not need it */
+ret = setgroups(0, NULL);
+if (ret == -1) {
+fuse_log(FUSE_LOG_ERR, "setgroups() failed with error=%d:%s\n",
+ errno, strerror(errno));
+exit(1);
+}
+}
+
 /*
  * Change to uid/gid of caller so that file is created with
  * ownership of caller.
@@ -3926,6 +3950,8 @@ int main(int argc, char *argv[])
 
 qemu_init_exec_dir(argv[0]);
 
+drop_supplementary_groups();
+
 pthread_mutex_init(, NULL);
 lo.inodes = g_hash_table_new(lo_key_hash, lo_key_equal);
 lo.root.fd = -1;

[PATCH v4 1/9] linux-headers: Update headers to v5.17-rc1

2022-01-24 Thread Vivek Goyal

Update headers to 5.17-rc1. I need latest fuse changes.

Signed-off-by: Vivek Goyal 
---
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h |  11 ++
 include/standard-headers/linux/ethtool.h  |   1 +
 include/standard-headers/linux/fuse.h |  60 +++-
 include/standard-headers/linux/pci_regs.h | 142 +-
 include/standard-headers/linux/virtio_gpio.h  |  72 +
 include/standard-headers/linux/virtio_i2c.h   |  47 ++
 include/standard-headers/linux/virtio_iommu.h |   8 +-
 .../standard-headers/linux/virtio_pcidev.h|  65 
 include/standard-headers/linux/virtio_scmi.h  |  24 +++
 linux-headers/asm-generic/unistd.h|   5 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-riscv/bitsperlong.h |  14 ++
 linux-headers/asm-riscv/mman.h|   1 +
 linux-headers/asm-riscv/unistd.h  |  44 ++
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |  16 +-
 linux-headers/asm-x86/unistd_32.h |   1 +
 linux-headers/asm-x86/unistd_64.h |   1 +
 linux-headers/asm-x86/unistd_x32.h|   1 +
 linux-headers/linux/kvm.h |  17 +++
 26 files changed, 469 insertions(+), 76 deletions(-)
 create mode 100644 include/standard-headers/linux/virtio_gpio.h
 create mode 100644 include/standard-headers/linux/virtio_i2c.h
 create mode 100644 include/standard-headers/linux/virtio_pcidev.h
 create mode 100644 include/standard-headers/linux/virtio_scmi.h
 create mode 100644 linux-headers/asm-riscv/bitsperlong.h
 create mode 100644 linux-headers/asm-riscv/mman.h
 create mode 100644 linux-headers/asm-riscv/unistd.h

diff --git a/include/standard-headers/asm-x86/kvm_para.h 
b/include/standard-headers/asm-x86/kvm_para.h
index 204cfb8640..f0235e58a1 100644
--- a/include/standard-headers/asm-x86/kvm_para.h
+++ b/include/standard-headers/asm-x86/kvm_para.h
@@ -8,6 +8,7 @@
  * should be used to determine that a VM is running under KVM.
  */
 #define KVM_CPUID_SIGNATURE0x4000
+#define KVM_SIGNATURE "KVMKVMKVM\0\0\0"
 
 /* This CPUID returns two feature bitmaps in eax, edx. Before enabling
  * a particular paravirtualization, the appropriate feature bit should
diff --git a/include/standard-headers/drm/drm_fourcc.h 
b/include/standard-headers/drm/drm_fourcc.h
index 2c025cb4fe..4888f85f69 100644
--- a/include/standard-headers/drm/drm_fourcc.h
+++ b/include/standard-headers/drm/drm_fourcc.h
@@ -313,6 +313,13 @@ extern "C" {
  */
 #define DRM_FORMAT_P016fourcc_code('P', '0', '1', '6') /* 2x2 
subsampled Cr:Cb plane 16 bits per channel */
 
+/* 2 plane YCbCr420.
+ * 3 10 bit components and 2 padding bits packed into 4 bytes.
+ * index 0 = Y plane, [31:0] x:Y2:Y1:Y0 2:10:10:10 little endian
+ * index 1 = Cr:Cb plane, [63:0] x:Cr2:Cb2:Cr1:x:Cb1:Cr0:Cb0 
[2:10:10:10:2:10:10:10] little endian
+ */
+#define DRM_FORMAT_P030fourcc_code('P', '0', '3', '0') /* 2x2 
subsampled Cr:Cb plane 10 bits per channel packed */
+
 /* 3 plane non-subsampled (444) YCbCr
  * 16 bits per component, but only 10 bits are used and 6 bits are padded
  * index 0: Y plane, [15:0] Y:x [10:6] little endian
@@ -853,6 +860,10 @@ drm_fourcc_canonicalize_nvidia_format_mod(uint64_t 
modifier)
  * and UV.  Some SAND-using hardware stores UV in a separate tiled
  * image from Y to reduce the column height, which is not supported
  * with these modifiers.
+ *
+ * The DRM_FORMAT_MOD_BROADCOM_SAND128_COL_HEIGHT modifier is also
+ * supported for DRM_FORMAT_P030 where the columns remain as 128 bytes
+ * wide, but as this is a 10 bpp format that translates to 96 pixels.
  */
 
 #define DRM_FORMAT_MOD_BROADCOM_SAND32_COL_HEIGHT(v) \
diff --git a/include/standard-headers/linux/ethtool.h 
b/include/standard-headers/linux/ethtool.h
index 688eb8dc39..38d5a4cd6e 100644
--- a/include/standard-headers/linux/ethtool.h
+++ b/include/standard-headers/linux/ethtool.h
@@ -231,6 +231,7 @@ enum tunable_id {
ETHTOOL_RX_COPYBREAK,
ETHTOOL_TX_COPYBREAK,
ETHTOOL_PFC_PREVENTION_TOUT, /* timeout in msecs */
+   ETHTOOL_TX_COPYBREAK_BUF_SIZE,
/*
 * Add your fresh new tunable attribute above and remember to update
 * tunable_strings[] in net/ethtool/common.c
diff --git a/include/standard-headers/linux/fuse.h 
b/include/standard-headers/linux/fuse.h
index 23ea31708b..bda06258be 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -184,6 +184,16 @@
  *
  *  7.34
  *  - add FUSE_SYNCFS
+ *
+ *  7.35
+ *  - add FOPEN_N

[PATCH v4 3/9] virtiofsd: Parse extended "struct fuse_init_in"

2022-01-24 Thread Vivek Goyal

Add some code to parse extended "struct fuse_init_in". And use a local
variable "flag" to represent 64 bit flags. This will make it easier
to add more features without having to worry about two 32bit flags (->flags
and ->flags2) in "fuse_struct_in".

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_lowlevel.c | 55 -
 1 file changed, 33 insertions(+), 22 deletions(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index ce29a70253..c3af5ede08 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1886,6 +1886,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 struct fuse_session *se = req->se;
 size_t bufsize = se->bufsize;
 size_t outargsize = sizeof(outarg);
+uint64_t flags = 0;
 
 (void)nodeid;
 
@@ -1902,11 +1903,21 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 fuse_reply_err(req, EINVAL);
 return;
 }
+flags |= arg->flags;
+}
+
+/* fuse_init_in was extended again with minor version 36 */
+if (sizeof(*arg) > compat2_size && (arg->flags & FUSE_INIT_EXT)) {
+if (!fuse_mbuf_iter_advance(iter, sizeof(*arg) - compat2_size)) {
+fuse_reply_err(req, EINVAL);
+return;
+}
+flags |= (uint64_t) arg->flags2 << 32;
 }
 
 fuse_log(FUSE_LOG_DEBUG, "INIT: %u.%u\n", arg->major, arg->minor);
 if (arg->major == 7 && arg->minor >= 6) {
-fuse_log(FUSE_LOG_DEBUG, "flags=0x%08x\n", arg->flags);
+fuse_log(FUSE_LOG_DEBUG, "flags=0x%016llx\n", flags);
 fuse_log(FUSE_LOG_DEBUG, "max_readahead=0x%08x\n", arg->max_readahead);
 }
 se->conn.proto_major = arg->major;
@@ -1934,68 +1945,68 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 if (arg->max_readahead < se->conn.max_readahead) {
 se->conn.max_readahead = arg->max_readahead;
 }
-if (arg->flags & FUSE_ASYNC_READ) {
+if (flags & FUSE_ASYNC_READ) {
 se->conn.capable |= FUSE_CAP_ASYNC_READ;
 }
-if (arg->flags & FUSE_POSIX_LOCKS) {
+if (flags & FUSE_POSIX_LOCKS) {
 se->conn.capable |= FUSE_CAP_POSIX_LOCKS;
 }
-if (arg->flags & FUSE_ATOMIC_O_TRUNC) {
+if (flags & FUSE_ATOMIC_O_TRUNC) {
 se->conn.capable |= FUSE_CAP_ATOMIC_O_TRUNC;
 }
-if (arg->flags & FUSE_EXPORT_SUPPORT) {
+if (flags & FUSE_EXPORT_SUPPORT) {
 se->conn.capable |= FUSE_CAP_EXPORT_SUPPORT;
 }
-if (arg->flags & FUSE_DONT_MASK) {
+if (flags & FUSE_DONT_MASK) {
 se->conn.capable |= FUSE_CAP_DONT_MASK;
 }
-if (arg->flags & FUSE_FLOCK_LOCKS) {
+if (flags & FUSE_FLOCK_LOCKS) {
 se->conn.capable |= FUSE_CAP_FLOCK_LOCKS;
 }
-if (arg->flags & FUSE_AUTO_INVAL_DATA) {
+if (flags & FUSE_AUTO_INVAL_DATA) {
 se->conn.capable |= FUSE_CAP_AUTO_INVAL_DATA;
 }
-if (arg->flags & FUSE_DO_READDIRPLUS) {
+if (flags & FUSE_DO_READDIRPLUS) {
 se->conn.capable |= FUSE_CAP_READDIRPLUS;
 }
-if (arg->flags & FUSE_READDIRPLUS_AUTO) {
+if (flags & FUSE_READDIRPLUS_AUTO) {
 se->conn.capable |= FUSE_CAP_READDIRPLUS_AUTO;
 }
-if (arg->flags & FUSE_ASYNC_DIO) {
+if (flags & FUSE_ASYNC_DIO) {
 se->conn.capable |= FUSE_CAP_ASYNC_DIO;
 }
-if (arg->flags & FUSE_WRITEBACK_CACHE) {
+if (flags & FUSE_WRITEBACK_CACHE) {
 se->conn.capable |= FUSE_CAP_WRITEBACK_CACHE;
 }
-if (arg->flags & FUSE_NO_OPEN_SUPPORT) {
+if (flags & FUSE_NO_OPEN_SUPPORT) {
 se->conn.capable |= FUSE_CAP_NO_OPEN_SUPPORT;
 }
-if (arg->flags & FUSE_PARALLEL_DIROPS) {
+if (flags & FUSE_PARALLEL_DIROPS) {
 se->conn.capable |= FUSE_CAP_PARALLEL_DIROPS;
 }
-if (arg->flags & FUSE_POSIX_ACL) {
+if (flags & FUSE_POSIX_ACL) {
 se->conn.capable |= FUSE_CAP_POSIX_ACL;
 }
-if (arg->flags & FUSE_HANDLE_KILLPRIV) {
+if (flags & FUSE_HANDLE_KILLPRIV) {
 se->conn.capable |= FUSE_CAP_HANDLE_KILLPRIV;
 }
-if (arg->flags & FUSE_NO_OPENDIR_SUPPORT) {
+if (flags & FUSE_NO_OPENDIR_SUPPORT) {
 se->conn.capable |= FUSE_CAP_NO_OPENDIR_SUPPORT;
 }
-if (!(arg->flags & FUSE_MAX_PAGES)) {
+if (!(flags & FUSE_MAX_PAGES)) {
 size_t max_bufsize = FUSE_DEFAULT_MAX_PAGES_PER_REQ * getpagesize() +
  FUSE_BUFFER_HEADER_SIZE;
 if (bufsize > max_bufsize) {
 bufsize = max_bufsize;
 }
 }
-if (arg->flags & FUSE_SUBMOUNTS) {
+

[PATCH v4 7/9] virtiofsd: Create new file with fscreate set

2022-01-24 Thread Vivek Goyal

This patch adds support to set /proc/thread-self/attr/fscreate before
file creation. It is set to a value as sent by client. This will allow
for atomic creation of security context on files w.r.t file creation.

This is primarily useful when either there is no SELinux enabled on
host or host and guest policies are in sync and don't conflict.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 317 ---
 1 file changed, 290 insertions(+), 27 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 54978b7fae..7a714b1b5e 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -172,10 +172,14 @@ struct lo_data {
 
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
+/* An O_PATH file descriptor to /proc/self/task/ */
+int proc_self_task;
 int user_killpriv_v2, killpriv_v2;
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+/* Keeps track if /proc//attr/fscreate should be used or not */
+bool use_fscreate;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -229,6 +233,11 @@ static struct lo_inode *lo_find(struct lo_data *lo, struct 
stat *st,
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
+#define FCHDIR_NOFAIL(fd) do { \
+int fchdir_res = fchdir(fd);   \
+assert(fchdir_res == 0);   \
+} while (0)
+
 static bool is_dot_or_dotdot(const char *name)
 {
 return name[0] == '.' &&
@@ -255,6 +264,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/*
+ * Tries to figure out if /proc//attr/fscrate is usable or not. With
+ * selinux=0, read from fscreate returns -EINVAL.
+ *
+ * TODO: Link with libselinux and use is_selinux_enabled() instead down
+ * the line. It probably will be more reliable indicator.
+ */
+static bool is_fscreate_usable(struct lo_data *lo)
+{
+char procname[64];
+int fscreate_fd;
+size_t bytes_read;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
+if (fscreate_fd == -1) {
+return false;
+}
+
+bytes_read = read(fscreate_fd, procname, 64);
+close(fscreate_fd);
+if (bytes_read == -1) {
+return false;
+}
+return true;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -1259,16 +1295,140 @@ static void lo_restore_cred_gain_cap(struct lo_cred 
*old, bool restore_umask,
 }
 }
 
+/* Helpers to set/reset fscreate */
+static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
+  size_t ctxlen, int *fd)
+{
+char procname[64];
+int fscreate_fd, err = 0;
+size_t written;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
+err = fscreate_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+
+written = write(fscreate_fd, ctx, ctxlen);
+err = written == -1 ? errno : 0;
+if (err) {
+goto out;
+}
+
+*fd = fscreate_fd;
+return 0;
+out:
+close(fscreate_fd);
+return err;
+}
+
+static void close_reset_proc_fscreate(int fd)
+{
+if ((write(fd, NULL, 0)) == -1) {
+fuse_log(FUSE_LOG_WARNING, "Failed to reset fscreate. err=%d\n", 
errno);
+}
+close(fd);
+return;
+}
+
+static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
+   const char *name, const char *secctx_name)
+{
+int path_fd, err;
+char procname[64];
+struct lo_data *lo = lo_data(req);
+
+if (!req->secctx.ctxlen) {
+return 0;
+}
+
+/* Open newly created element with O_PATH */
+path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+err = path_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+sprintf(procname, "%i", path_fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+/* Set security context. This is not atomic w.r.t file creation */
+err = setxattr(procname, secctx_name, req->secctx.ctx, req->secctx.ctxlen,
+   0);
+if (err) {
+err = errno;
+}
+FCHDIR_NOFAIL(lo->root.fd);
+close(path_fd);
+return err;
+}
+
+static int do_mknod_symlink(fuse_req_t req, struct lo_inode *dir,
+const char *name, mode_t mode, dev_t rdev,
+const char *link)
+{
+int err, fscreate_fd = -1;
+const char *secctx_name = req->secctx.name;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+char *mapped_name =

[PATCH v4 5/9] virtiofsd, fuse_lowlevel.c: Add capability to parse security context

2022-01-24 Thread Vivek Goyal

Add capability to enable and parse security context as sent by client
and put into fuse_req. Filesystems now can get security context from
request and set it on files during creation.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   |  5 ++
 tools/virtiofsd/fuse_i.h|  7 +++
 tools/virtiofsd/fuse_lowlevel.c | 95 -
 3 files changed, 106 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 6f8a988202..bf46954dab 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -377,6 +377,11 @@ struct fuse_file_info {
  */
 #define FUSE_CAP_SETXATTR_EXT (1 << 29)
 
+/**
+ * Indicates that file server supports creating file security context
+ */
+#define FUSE_CAP_SECURITY_CTX (1ULL << 32)
+
 /**
  * Ioctl flags
  *
diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 492e002181..a5572fa4ae 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -15,6 +15,12 @@
 struct fv_VuDev;
 struct fv_QueueInfo;
 
+struct fuse_security_context {
+const char *name;
+uint32_t ctxlen;
+const void *ctx;
+};
+
 struct fuse_req {
 struct fuse_session *se;
 uint64_t unique;
@@ -35,6 +41,7 @@ struct fuse_req {
 } u;
 struct fuse_req *next;
 struct fuse_req *prev;
+struct fuse_security_context secctx;
 };
 
 struct fuse_notify_req {
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index f3f5e70be6..0bb6f7f316 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -886,11 +886,59 @@ static void do_readlink(fuse_req_t req, fuse_ino_t nodeid,
 }
 }
 
+static int parse_secctx_fill_req(fuse_req_t req, struct fuse_mbuf_iter *iter)
+{
+struct fuse_secctx_header *fsecctx_header;
+struct fuse_secctx *fsecctx;
+const void *secctx;
+const char *name;
+
+fsecctx_header = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx_header));
+if (!fsecctx_header) {
+return -EINVAL;
+}
+
+/*
+ * As of now maximum of one security context is supported. It can
+ * change in future though.
+ */
+if (fsecctx_header->nr_secctx > 1) {
+return -EINVAL;
+}
+
+/* No security context sent. Maybe no LSM supports it */
+if (!fsecctx_header->nr_secctx) {
+return 0;
+}
+
+fsecctx = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx));
+if (!fsecctx) {
+return -EINVAL;
+}
+
+name = fuse_mbuf_iter_advance_str(iter);
+if (!name) {
+return -EINVAL;
+}
+
+secctx = fuse_mbuf_iter_advance(iter, fsecctx->size);
+if (!secctx) {
+return -EINVAL;
+}
+
+req->secctx.name = name;
+req->secctx.ctx = secctx;
+req->secctx.ctxlen = fsecctx->size;
+return 0;
+}
+
 static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
  struct fuse_mbuf_iter *iter)
 {
 struct fuse_mknod_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -901,6 +949,13 @@ static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, -err);
+}
+}
+
 if (req->se->op.mknod) {
 req->se->op.mknod(req, nodeid, name, arg->mode, arg->rdev);
 } else {
@@ -913,6 +968,8 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 {
 struct fuse_mkdir_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -923,6 +980,13 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.mkdir) {
 req->se->op.mkdir(req, nodeid, name, arg->mode);
 } else {
@@ -969,12 +1033,21 @@ static void do_symlink(fuse_req_t req, fuse_ino_t nodeid,
 {
 const char *name = fuse_mbuf_iter_advance_str(iter);
 const char *linkname = fuse_mbuf_iter_advance_str(iter);
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 if (!name || !linkname) {
 fuse_reply_err(req, EINVAL);
 return;
 }
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.symlink) {
 req

[PATCH v4 2/9] virtiofsd: Fix breakage due to fuse_init_in size change

2022-01-24 Thread Vivek Goyal

Kernel version 5.17 has increased the size of "struct fuse_init_in" struct.
Previously this struct was 16 bytes and now it has been extended to
64 bytes in size.

Once qemu headers are updated to latest, it will expect to receive 64 byte
size struct (for protocol version major 7 and minor > 6). But if guest is
booting older kernel (older than 5.17), then it still sends older
fuse_init_in of size 16 bytes. And do_init() fails. It is expecting
64 byte struct. And this results in mount of virtiofs failing.

Fix this by parsing 16 bytes only for now. Separate patches will be
posted which will parse rest of the bytes and enable new functionality.
Right now we don't support any of the new functionality, so we don't
lose anything by not parsing bytes beyond 16.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_lowlevel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index e4679c73ab..ce29a70253 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -1880,6 +1880,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 struct fuse_mbuf_iter *iter)
 {
 size_t compat_size = offsetof(struct fuse_init_in, max_readahead);
+size_t compat2_size = offsetof(struct fuse_init_in, flags) + 
sizeof(uint32_t);
 struct fuse_init_in *arg;
 struct fuse_init_out outarg;
 struct fuse_session *se = req->se;
@@ -1897,7 +1898,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 
 /* ...and now consume the new fields. */
 if (arg->major == 7 && arg->minor >= 6) {
-if (!fuse_mbuf_iter_advance(iter, sizeof(*arg) - compat_size)) {
+if (!fuse_mbuf_iter_advance(iter, compat2_size - compat_size)) {
 fuse_reply_err(req, EINVAL);
 return;
 }
-- 
2.31.1

[PATCH v4 0/9] virtiofsd: Add support for file security context at creation

2022-01-24 Thread Vivek Goyal

Hi,

This is V4 of the patches. I posted V3 here.

https://listman.redhat.com/archives/virtio-fs/2021-November/msg00058.html

Now corresponding kernel patches have been merged in 5.17-rc1. So I am
requesting for inclusion of these patches.

These will allow us to support SELinux with virtiofs. This will send
SELinux context at file creation to server and server can set it on
file.

Please have a look and consider for inclusion.

Thanks
Vivek

Vivek Goyal (9):
  linux-headers: Update headers to v5.17-rc1
  virtiofsd: Fix breakage due to fuse_init_in size change
  virtiofsd: Parse extended "struct fuse_init_in"
  virtiofsd: Extend size of fuse_conn_info->capable and ->want fields
  virtiofsd, fuse_lowlevel.c: Add capability to parse security context
  virtiofsd: Move core file creation code in separate function
  virtiofsd: Create new file with fscreate set
  virtiofsd: Create new file using O_TMPFILE and set security context
  virtiofsd: Add an option to enable/disable security label

 docs/tools/virtiofsd.rst  |   7 +
 include/standard-headers/asm-x86/kvm_para.h   |   1 +
 include/standard-headers/drm/drm_fourcc.h |  11 +
 include/standard-headers/linux/ethtool.h  |   1 +
 include/standard-headers/linux/fuse.h |  60 ++-
 include/standard-headers/linux/pci_regs.h | 142 +++---
 include/standard-headers/linux/virtio_gpio.h  |  72 +++
 include/standard-headers/linux/virtio_i2c.h   |  47 ++
 include/standard-headers/linux/virtio_iommu.h |   8 +-
 .../standard-headers/linux/virtio_pcidev.h|  65 +++
 include/standard-headers/linux/virtio_scmi.h  |  24 +
 linux-headers/asm-generic/unistd.h|   5 +-
 linux-headers/asm-mips/unistd_n32.h   |   2 +
 linux-headers/asm-mips/unistd_n64.h   |   2 +
 linux-headers/asm-mips/unistd_o32.h   |   2 +
 linux-headers/asm-powerpc/unistd_32.h |   2 +
 linux-headers/asm-powerpc/unistd_64.h |   2 +
 linux-headers/asm-riscv/bitsperlong.h |  14 +
 linux-headers/asm-riscv/mman.h|   1 +
 linux-headers/asm-riscv/unistd.h  |  44 ++
 linux-headers/asm-s390/unistd_32.h|   2 +
 linux-headers/asm-s390/unistd_64.h|   2 +
 linux-headers/asm-x86/kvm.h   |  16 +-
 linux-headers/asm-x86/unistd_32.h |   1 +
 linux-headers/asm-x86/unistd_64.h |   1 +
 linux-headers/asm-x86/unistd_x32.h|   1 +
 linux-headers/linux/kvm.h |  17 +
 tools/virtiofsd/fuse_common.h |   9 +-
 tools/virtiofsd/fuse_i.h  |   7 +
 tools/virtiofsd/fuse_lowlevel.c   | 155 +--
 tools/virtiofsd/helper.c  |   1 +
 tools/virtiofsd/passthrough_ll.c  | 414 --
 32 files changed, 1006 insertions(+), 132 deletions(-)
 create mode 100644 include/standard-headers/linux/virtio_gpio.h
 create mode 100644 include/standard-headers/linux/virtio_i2c.h
 create mode 100644 include/standard-headers/linux/virtio_pcidev.h
 create mode 100644 include/standard-headers/linux/virtio_scmi.h
 create mode 100644 linux-headers/asm-riscv/bitsperlong.h
 create mode 100644 linux-headers/asm-riscv/mman.h
 create mode 100644 linux-headers/asm-riscv/unistd.h

-- 
2.31.1

[PATCH v4 6/9] virtiofsd: Move core file creation code in separate function

2022-01-24 Thread Vivek Goyal

Move core file creation bits in a separate function. Soon this is going
to get more complex as file creation need to set security context also.
And there will be multiple modes of file creation in next patch.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 36 ++--
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 64b5b4fbb1..54978b7fae 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1976,6 +1976,30 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 return 0;
 }
 
+static int do_lo_create(fuse_req_t req, struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi, int* open_fd)
+{
+int err = 0, fd;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+
+err = lo_change_cred(req, , lo->change_umask);
+if (err) {
+return err;
+}
+
+/* Try to create a new file but don't open existing files */
+fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
+if (fd == -1) {
+err = errno;
+} else {
+*open_fd = fd;
+}
+lo_restore_cred(, lo->change_umask);
+return err;
+}
+
 static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
   mode_t mode, struct fuse_file_info *fi)
 {
@@ -1985,7 +2009,6 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 struct lo_inode *inode = NULL;
 struct fuse_entry_param e;
 int err;
-struct lo_cred old = {};
 
 fuse_log(FUSE_LOG_DEBUG, "lo_create(parent=%" PRIu64 ", name=%s)"
  " kill_priv=%d\n", parent, name, fi->kill_priv);
@@ -2001,18 +2024,9 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 return;
 }
 
-err = lo_change_cred(req, , lo->change_umask);
-if (err) {
-goto out;
-}
-
 update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-/* Try to create a new file but don't open existing files */
-fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
-err = fd == -1 ? errno : 0;
-
-lo_restore_cred(, lo->change_umask);
+err = do_lo_create(req, parent_inode, name, mode, fi, );
 
 /* Ignore the error if file exists and O_EXCL was not given */
 if (err && (err != EEXIST || (fi->flags & O_EXCL))) {
-- 
2.31.1

[PATCH v4 4/9] virtiofsd: Extend size of fuse_conn_info->capable and ->want fields

2022-01-24 Thread Vivek Goyal

->capable keeps track of what capabilities kernel supports and ->wants keep
track of what capabilities filesytem wants.

Right now these fields are 32bit in size. But now fuse has run out of
bits and capabilities can now have bit number which are higher than 31.

That means 32 bit fields are not suffcient anymore. Increase size to 64
bit so that we can add newer capabilities and still be able to use existing
code to check and set the capabilities.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   | 4 ++--
 tools/virtiofsd/fuse_lowlevel.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 0c2665b977..6f8a988202 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -439,7 +439,7 @@ struct fuse_conn_info {
 /**
  * Capability flags that the kernel supports (read-only)
  */
-unsigned capable;
+uint64_t capable;
 
 /**
  * Capability flags that the filesystem wants to enable.
@@ -447,7 +447,7 @@ struct fuse_conn_info {
  * libfuse attempts to initialize this field with
  * reasonable default values before calling the init() handler.
  */
-unsigned want;
+uint64_t want;
 
 /**
  * Maximum number of pending "background" requests. A
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index c3af5ede08..f3f5e70be6 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2063,7 +2063,7 @@ static void do_init(fuse_req_t req, fuse_ino_t nodeid,
 if (se->conn.want & (~se->conn.capable)) {
 fuse_log(FUSE_LOG_ERR,
  "fuse: error: filesystem requested capabilities "
- "0x%x that are not supported by kernel, aborting.\n",
+ "0x%lx that are not supported by kernel, aborting.\n",
  se->conn.want & (~se->conn.capable));
 fuse_reply_err(req, EPROTO);
 se->error = -EPROTO;
-- 
2.31.1

[PATCH v4 8/9] virtiofsd: Create new file using O_TMPFILE and set security context

2022-01-24 Thread Vivek Goyal

If guest and host policies can't work with each other, then guest security
context (selinux label) needs to be set into an xattr. Say remap guest
security.selinux xattr to trusted.virtiofs.security.selinux.

That means setting "fscreate" is not going to help as that's ony useful
for security.selinux xattr on host.

So we need another method which is atomic. Use O_TMPFILE to create new
file, set xattr and then linkat() to proper place.

But this works only for regular files. So dir, symlinks will continue
to be non-atomic.

Also if host filesystem does not support O_TMPFILE, we fallback to
non-atomic behavior.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 80 
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 7a714b1b5e..4505c0c363 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2128,14 +2128,29 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 
 static int do_create_nosecctx(fuse_req_t req, struct lo_inode *parent_inode,
const char *name, mode_t mode,
-   struct fuse_file_info *fi, int *open_fd)
+   struct fuse_file_info *fi, int *open_fd,
+  bool tmpfile)
 {
 int err, fd;
 struct lo_cred old = {};
 struct lo_data *lo = lo_data(req);
 int flags;
 
-flags = fi->flags | O_CREAT | O_EXCL;
+if (tmpfile) {
+flags = fi->flags | O_TMPFILE;
+/*
+ * Don't use O_EXCL as we want to link file later. Also reset O_CREAT
+ * otherwise openat() returns -EINVAL.
+ */
+flags &= ~(O_CREAT | O_EXCL);
+
+/* O_TMPFILE needs either O_RDWR or O_WRONLY */
+if ((flags & O_ACCMODE) == O_RDONLY) {
+flags |= O_RDWR;
+}
+} else {
+flags = fi->flags | O_CREAT | O_EXCL;
+}
 
 err = lo_change_cred(req, , lo->change_umask);
 if (err) {
@@ -2166,7 +2181,7 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 
 close_reset_proc_fscreate(fscreate_fd);
 if (!err) {
@@ -2175,6 +2190,44 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
+static int do_create_secctx_tmpfile(fuse_req_t req,
+struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi,
+const char *secctx_name, int *open_fd)
+{
+int err, fd = -1;
+struct lo_data *lo = lo_data(req);
+char procname[64];
+
+err = do_create_nosecctx(req, parent_inode, ".", mode, fi, , true);
+if (err) {
+return err;
+}
+
+err = fsetxattr(fd, secctx_name, req->secctx.ctx, req->secctx.ctxlen, 0);
+if (err) {
+err = errno;
+goto out;
+}
+
+/* Security context set on file. Link it in place */
+sprintf(procname, "%d", fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+err = linkat(AT_FDCWD, procname, parent_inode->fd, name,
+ AT_SYMLINK_FOLLOW);
+err = err == -1 ? errno : 0;
+FCHDIR_NOFAIL(lo->root.fd);
+
+out:
+if (!err) {
+*open_fd = fd;
+} else if (fd != -1) {
+close(fd);
+}
+return err;
+}
+
 static int do_create_secctx_noatomic(fuse_req_t req,
  struct lo_inode *parent_inode,
  const char *name, mode_t mode,
@@ -2183,7 +2236,7 @@ static int do_create_secctx_noatomic(fuse_req_t req,
 {
 int err = 0, fd = -1;
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 if (err) {
 goto out;
 }
@@ -2225,20 +2278,31 @@ static int do_lo_create(fuse_req_t req, struct lo_inode 
*parent_inode,
 if (secctx_enabled) {
 /*
  * If security.selinux has not been remapped and selinux is enabled,
- * use fscreate to set context before file creation.
- * Otherwise fallback to non-atomic method of file creation
- * and xattr settting.
+ * use fscreate to set context before file creation. If not, use
+ * tmpfile method for regular files. Otherwise fallback to
+ * non-atomic method of file creation and xattr settting.
  */
 if (!mapped_name && lo->use_fscreate) {
 err = do_create_secctx_fscreate(req, parent_inode, name, mode, fi,
 open_fd);
 goto out;
+} else if (S_ISREG(

[PATCH v4 9/9] virtiofsd: Add an option to enable/disable security label

2022-01-24 Thread Vivek Goyal

Provide an option "-o security_label/no_security_label" to enable/disable
security label functionality. By default these are turned off.

If enabled, server will indicate to client that it is capable of handling
one security label during file creation. Typically this is expected to
be a SELinux label. File server will set this label on the file. It will
try to set it atomically wherever possible. But its not possible in
all the cases.

Signed-off-by: Vivek Goyal 
---
 docs/tools/virtiofsd.rst |  7 +++
 tools/virtiofsd/helper.c |  1 +
 tools/virtiofsd/passthrough_ll.c | 15 +++
 3 files changed, 23 insertions(+)

diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index 07ac0be551..a2c005f4a0 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -104,6 +104,13 @@ Options
   * posix_acl|no_posix_acl -
 Enable/disable posix acl support.  Posix ACLs are disabled by default.
 
+  * security_label|no_security_label -
+Enable/disable security label support. Security labels are disabled by
+default. This will allow client to send a MAC label of file during
+file creation. Typically this is expected to be SELinux security
+label. Server will try to set that label on newly created file
+atomically wherever possible.
+
 .. option:: --socket-path=PATH
 
   Listen on vhost-user UNIX domain socket at PATH.
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index a8295d975a..e226fc590f 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
"-o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: 
disabled)\n"
+   "-o security_label/no_security_label  Enable/Disable security 
label. (default: disabled)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4505c0c363..4334885619 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -180,6 +180,7 @@ struct lo_data {
 int user_posix_acl, posix_acl;
 /* Keeps track if /proc//attr/fscreate should be used or not */
 bool use_fscreate;
+int user_security_label;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -214,6 +215,8 @@ static const struct fuse_opt lo_opts[] = {
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
 { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
 { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
+{ "security_label", offsetof(struct lo_data, user_security_label), 1 },
+{ "no_security_label", offsetof(struct lo_data, user_security_label), 0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -770,6 +773,17 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling posix_acl\n");
 conn->want &= ~FUSE_CAP_POSIX_ACL;
 }
+
+if (lo->user_security_label == 1) {
+if (!(conn->capable & FUSE_CAP_SECURITY_CTX)) {
+fuse_log(FUSE_LOG_ERR, "lo_init: Can not enable security label."
+ " kernel does not support FUSE_SECURITY_CTX 
capability.\n");
+}
+conn->want |= FUSE_CAP_SECURITY_CTX;
+} else {
+fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling security label\n");
+conn->want &= ~FUSE_CAP_SECURITY_CTX;
+}
 }
 
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
@@ -4254,6 +4268,7 @@ int main(int argc, char *argv[])
 .proc_self_task = -1,
 .user_killpriv_v2 = -1,
 .user_posix_acl = -1,
+.user_security_label = -1,
 };
 struct lo_map_elem *root_elem;
 struct lo_map_elem *reserve_elem;
-- 
2.31.1

Re: [Virtio-fs] [PATCH v2] virtiofsd: Do not support blocking flock

2022-01-14 Thread Vivek Goyal

On Thu, Jan 13, 2022 at 04:32:49PM +0100, Sebastian Hasler wrote:
> With the current implementation, blocking flock can lead to
> deadlock. Thus, it's better to return EOPNOTSUPP if a user attempts
> to perform a blocking flock request.
> 
> Signed-off-by: Sebastian Hasler 

Reviewed-by: Vivek Goyal 

Thanks Sebastian. Good fix. I can easily reproduce the deadlock.

shell1> flock foo.txt -c "sleep 10"
shell2> flock foo.txt -c echo

First commands take flock on foo.txt. Second command blocks on lock. And
only virtiofsd thread serving the virt messages blocks on flock(). Now
first command never exits. I think it will try to free lock once sleep
is over and that will deadlock. virtiofsd thread is blocked and it will
never wake up because lock release operation will never make progress.

This will be little painful for people as they will start seeing
errors. But I guess erroring out early is better than a potential
deadlock later.

Vivek

> ---
>  tools/virtiofsd/passthrough_ll.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 64b5b4fbb1..faa62278c5 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -2442,6 +2442,15 @@ static void lo_flock(fuse_req_t req, fuse_ino_t ino, 
> struct fuse_file_info *fi,
>  int res;
>  (void)ino;
>  
> +if (!(op & LOCK_NB)) {
> +/*
> + * Blocking flock can deadlock as there is only one thread
> + * serving the queue.
> + */
> +fuse_reply_err(req, EOPNOTSUPP);
> +return;
> +}
> +
>  res = flock(lo_fi_fd(req, fi), op);
>  
>  fuse_reply_err(req, res == -1 ? errno : 0);
> -- 
> 2.33.1
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
>

[PATCH v3 2/6] virtiofsd, fuse_lowlevel.c: Add capability to parse security context

2021-11-10 Thread Vivek Goyal

Add capability to enable and parse security context as sent by client
and put into fuse_req. Filesystems now can get security context from
request and set it on files during creation.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   |  5 ++
 tools/virtiofsd/fuse_i.h|  7 +++
 tools/virtiofsd/fuse_lowlevel.c | 91 +
 3 files changed, 103 insertions(+)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 0c2665b977..6f3485d1dc 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -377,6 +377,11 @@ struct fuse_file_info {
  */
 #define FUSE_CAP_SETXATTR_EXT (1 << 29)
 
+/**
+ * Indicates that file server supports creating file security context
+ */
+#define FUSE_CAP_SECURITY_CTX (1 << 30)
+
 /**
  * Ioctl flags
  *
diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 492e002181..a5572fa4ae 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -15,6 +15,12 @@
 struct fv_VuDev;
 struct fv_QueueInfo;
 
+struct fuse_security_context {
+const char *name;
+uint32_t ctxlen;
+const void *ctx;
+};
+
 struct fuse_req {
 struct fuse_session *se;
 uint64_t unique;
@@ -35,6 +41,7 @@ struct fuse_req {
 } u;
 struct fuse_req *next;
 struct fuse_req *prev;
+struct fuse_security_context secctx;
 };
 
 struct fuse_notify_req {
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index e4679c73ab..3ef1aff9e0 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -886,11 +886,59 @@ static void do_readlink(fuse_req_t req, fuse_ino_t nodeid,
 }
 }
 
+static int parse_secctx_fill_req(fuse_req_t req, struct fuse_mbuf_iter *iter)
+{
+struct fuse_secctx_header *fsecctx_header;
+struct fuse_secctx *fsecctx;
+const void *secctx;
+const char *name;
+
+fsecctx_header = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx_header));
+if (!fsecctx_header) {
+return -EINVAL;
+}
+
+/*
+ * As of now maximum of one security context is supported. It can
+ * change in future though.
+ */
+if (fsecctx_header->nr_secctx > 1) {
+return -EINVAL;
+}
+
+/* No security context sent. Maybe no LSM supports it */
+if (!fsecctx_header->nr_secctx) {
+return 0;
+}
+
+fsecctx = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx));
+if (!fsecctx) {
+return -EINVAL;
+}
+
+name = fuse_mbuf_iter_advance_str(iter);
+if (!name) {
+return -EINVAL;
+}
+
+secctx = fuse_mbuf_iter_advance(iter, fsecctx->size);
+if (!secctx) {
+return -EINVAL;
+}
+
+req->secctx.name = name;
+req->secctx.ctx = secctx;
+req->secctx.ctxlen = fsecctx->size;
+return 0;
+}
+
 static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
  struct fuse_mbuf_iter *iter)
 {
 struct fuse_mknod_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -901,6 +949,13 @@ static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, -err);
+}
+}
+
 if (req->se->op.mknod) {
 req->se->op.mknod(req, nodeid, name, arg->mode, arg->rdev);
 } else {
@@ -913,6 +968,8 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 {
 struct fuse_mkdir_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -923,6 +980,13 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.mkdir) {
 req->se->op.mkdir(req, nodeid, name, arg->mode);
 } else {
@@ -969,12 +1033,21 @@ static void do_symlink(fuse_req_t req, fuse_ino_t nodeid,
 {
 const char *name = fuse_mbuf_iter_advance_str(iter);
 const char *linkname = fuse_mbuf_iter_advance_str(iter);
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 if (!name || !linkname) {
 fuse_reply_err(req, EINVAL);
 return;
 }
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.symlink) {
 req->se->op.symlink(req,

[PATCH v3 6/6] virtiofsd: Add an option to enable/disable security label

2021-11-10 Thread Vivek Goyal

Provide an option "-o security_label/no_security_label" to enable/disable
security label functionality. By default these are turned off.

If enabled, server will indicate to client that it is capable of handling
one security label during file creation. Typically this is expected to
be a SELinux label. File server will set this label on the file. It will
try to set it atomically wherever possible. But its not possible in
all the cases.

Signed-off-by: Vivek Goyal 
---
 docs/tools/virtiofsd.rst |  7 +++
 tools/virtiofsd/helper.c |  1 +
 tools/virtiofsd/passthrough_ll.c | 15 +++
 3 files changed, 23 insertions(+)

diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index cc31402830..54699b2013 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -104,6 +104,13 @@ Options
   * posix_acl|no_posix_acl -
 Enable/disable posix acl support.  Posix ACLs are disabled by default.
 
+  * security_label|no_security_label -
+Enable/disable security label support. Security labels are disabled by
+default. This will allow client to send a MAC label of file during
+file creation. Typically this is expected to be SELinux security
+label. Server will try to set that label on newly created file
+atomically wherever possible.
+
 .. option:: --socket-path=PATH
 
   Listen on vhost-user UNIX domain socket at PATH.
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index a8295d975a..e226fc590f 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
"-o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: 
disabled)\n"
+   "-o security_label/no_security_label  Enable/Disable security 
label. (default: disabled)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4505c0c363..4334885619 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -180,6 +180,7 @@ struct lo_data {
 int user_posix_acl, posix_acl;
 /* Keeps track if /proc//attr/fscreate should be used or not */
 bool use_fscreate;
+int user_security_label;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -214,6 +215,8 @@ static const struct fuse_opt lo_opts[] = {
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
 { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
 { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
+{ "security_label", offsetof(struct lo_data, user_security_label), 1 },
+{ "no_security_label", offsetof(struct lo_data, user_security_label), 0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -770,6 +773,17 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling posix_acl\n");
 conn->want &= ~FUSE_CAP_POSIX_ACL;
 }
+
+if (lo->user_security_label == 1) {
+if (!(conn->capable & FUSE_CAP_SECURITY_CTX)) {
+fuse_log(FUSE_LOG_ERR, "lo_init: Can not enable security label."
+ " kernel does not support FUSE_SECURITY_CTX 
capability.\n");
+}
+conn->want |= FUSE_CAP_SECURITY_CTX;
+} else {
+fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling security label\n");
+conn->want &= ~FUSE_CAP_SECURITY_CTX;
+}
 }
 
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
@@ -4254,6 +4268,7 @@ int main(int argc, char *argv[])
 .proc_self_task = -1,
 .user_killpriv_v2 = -1,
 .user_posix_acl = -1,
+.user_security_label = -1,
 };
 struct lo_map_elem *root_elem;
 struct lo_map_elem *reserve_elem;
-- 
2.31.1

[PATCH v3 0/6] virtiofsd: Add support for file security context at creation

2021-11-10 Thread Vivek Goyal

Hi,

This is V3 of the patches. I posted V2 here.

https://lore.kernel.org/qemu-devel/20211014153126.575173-1-vgo...@redhat.com/

Kernel patches are not upstream yet. So header files will need to be
updated once kernel patches are merged. I posted V3 of kernel patches
here.

https://lore.kernel.org/linux-fsdevel/2020225528.48601-1-vgo...@redhat.com/T/#m08352d3d46f948c6c507c28f9db83098d175ca54

Changes since v2:

- Renamed "struct fuse_secctxs" to "struct fuse_secctx_header".
- Added a size field to fuse_secctx_header.

Thanks
Vivek

Vivek Goyal (6):
  fuse: Header file changes for FUSE_SECURITY_CTX
  virtiofsd, fuse_lowlevel.c: Add capability to parse security context
  virtiofsd: Move core file creation code in separate function
  virtiofsd: Create new file with fscreate set
  virtiofsd: Create new file using O_TMPFILE and set security context
  virtiofsd: Add an option to enable/disable security label

 docs/tools/virtiofsd.rst  |   7 +
 include/standard-headers/linux/fuse.h |  19 +-
 tools/virtiofsd/fuse_common.h |   5 +
 tools/virtiofsd/fuse_i.h  |   7 +
 tools/virtiofsd/fuse_lowlevel.c   |  91 ++
 tools/virtiofsd/helper.c  |   1 +
 tools/virtiofsd/passthrough_ll.c  | 414 --
 7 files changed, 514 insertions(+), 30 deletions(-)

-- 
2.31.1

[PATCH v3 1/6] fuse: Header file changes for FUSE_SECURITY_CTX

2021-11-10 Thread Vivek Goyal

These are just header file changes which should show up in qemu if
corresponding kernel changes get merged.

Signed-off-by: Vivek Goyal 
---
 include/standard-headers/linux/fuse.h | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/standard-headers/linux/fuse.h 
b/include/standard-headers/linux/fuse.h
index cce105bfba..f412ff7f50 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -181,6 +181,10 @@
  *  - add FUSE_OPEN_KILL_SUIDGID
  *  - extend fuse_setxattr_in, add FUSE_SETXATTR_EXT
  *  - add FUSE_SETXATTR_ACL_KILL_SGID
+ *
+ *  7.35
+ *  - add FUSE_SECURITY_CTX flag for fuse_init_out
+ *  - add security context to create, mkdir, symlink, and mknod requests
  */
 
 #ifndef _LINUX_FUSE_H
@@ -212,7 +216,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 33
+#define FUSE_KERNEL_MINOR_VERSION 35
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -329,6 +333,8 @@ struct fuse_file_lock {
  * write/truncate sgid is killed only if file has group
  * execute permission. (Same as Linux VFS behavior).
  * FUSE_SETXATTR_EXT:  Server supports extended struct fuse_setxattr_in
+ * FUSE_SECURITY_CTX:  add security context to create, mkdir, symlink, and
+ * mknod
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
@@ -360,6 +366,7 @@ struct fuse_file_lock {
 #define FUSE_SUBMOUNTS (1 << 27)
 #define FUSE_HANDLE_KILLPRIV_V2(1 << 28)
 #define FUSE_SETXATTR_EXT  (1 << 29)
+#define FUSE_SECURITY_CTX  (1 << 30)
 
 /**
  * CUSE INIT request/reply flags
@@ -967,4 +974,14 @@ struct fuse_removemapping_one {
 #define FUSE_REMOVEMAPPING_MAX_ENTRY   \
(PAGE_SIZE / sizeof(struct fuse_removemapping_one))
 
+struct fuse_secctx {
+   uint32_tsize;
+   uint32_tpadding;
+};
+
+struct fuse_secctx_header {
+   uint32_tsize;
+   uint32_tnr_secctx;
+};
+
 #endif /* _LINUX_FUSE_H */
-- 
2.31.1

[PATCH v3 5/6] virtiofsd: Create new file using O_TMPFILE and set security context

2021-11-10 Thread Vivek Goyal

If guest and host policies can't work with each other, then guest security
context (selinux label) needs to be set into an xattr. Say remap guest
security.selinux xattr to trusted.virtiofs.security.selinux.

That means setting "fscreate" is not going to help as that's ony useful
for security.selinux xattr on host.

So we need another method which is atomic. Use O_TMPFILE to create new
file, set xattr and then linkat() to proper place.

But this works only for regular files. So dir, symlinks will continue
to be non-atomic.

Also if host filesystem does not support O_TMPFILE, we fallback to
non-atomic behavior.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 80 
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 7a714b1b5e..4505c0c363 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2128,14 +2128,29 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 
 static int do_create_nosecctx(fuse_req_t req, struct lo_inode *parent_inode,
const char *name, mode_t mode,
-   struct fuse_file_info *fi, int *open_fd)
+   struct fuse_file_info *fi, int *open_fd,
+  bool tmpfile)
 {
 int err, fd;
 struct lo_cred old = {};
 struct lo_data *lo = lo_data(req);
 int flags;
 
-flags = fi->flags | O_CREAT | O_EXCL;
+if (tmpfile) {
+flags = fi->flags | O_TMPFILE;
+/*
+ * Don't use O_EXCL as we want to link file later. Also reset O_CREAT
+ * otherwise openat() returns -EINVAL.
+ */
+flags &= ~(O_CREAT | O_EXCL);
+
+/* O_TMPFILE needs either O_RDWR or O_WRONLY */
+if ((flags & O_ACCMODE) == O_RDONLY) {
+flags |= O_RDWR;
+}
+} else {
+flags = fi->flags | O_CREAT | O_EXCL;
+}
 
 err = lo_change_cred(req, , lo->change_umask);
 if (err) {
@@ -2166,7 +2181,7 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 
 close_reset_proc_fscreate(fscreate_fd);
 if (!err) {
@@ -2175,6 +2190,44 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
+static int do_create_secctx_tmpfile(fuse_req_t req,
+struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi,
+const char *secctx_name, int *open_fd)
+{
+int err, fd = -1;
+struct lo_data *lo = lo_data(req);
+char procname[64];
+
+err = do_create_nosecctx(req, parent_inode, ".", mode, fi, , true);
+if (err) {
+return err;
+}
+
+err = fsetxattr(fd, secctx_name, req->secctx.ctx, req->secctx.ctxlen, 0);
+if (err) {
+err = errno;
+goto out;
+}
+
+/* Security context set on file. Link it in place */
+sprintf(procname, "%d", fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+err = linkat(AT_FDCWD, procname, parent_inode->fd, name,
+ AT_SYMLINK_FOLLOW);
+err = err == -1 ? errno : 0;
+FCHDIR_NOFAIL(lo->root.fd);
+
+out:
+if (!err) {
+*open_fd = fd;
+} else if (fd != -1) {
+close(fd);
+}
+return err;
+}
+
 static int do_create_secctx_noatomic(fuse_req_t req,
  struct lo_inode *parent_inode,
  const char *name, mode_t mode,
@@ -2183,7 +2236,7 @@ static int do_create_secctx_noatomic(fuse_req_t req,
 {
 int err = 0, fd = -1;
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 if (err) {
 goto out;
 }
@@ -2225,20 +2278,31 @@ static int do_lo_create(fuse_req_t req, struct lo_inode 
*parent_inode,
 if (secctx_enabled) {
 /*
  * If security.selinux has not been remapped and selinux is enabled,
- * use fscreate to set context before file creation.
- * Otherwise fallback to non-atomic method of file creation
- * and xattr settting.
+ * use fscreate to set context before file creation. If not, use
+ * tmpfile method for regular files. Otherwise fallback to
+ * non-atomic method of file creation and xattr settting.
  */
 if (!mapped_name && lo->use_fscreate) {
 err = do_create_secctx_fscreate(req, parent_inode, name, mode, fi,
 open_fd);
 goto out;
+} else if (S_ISREG(

[PATCH v3 3/6] virtiofsd: Move core file creation code in separate function

2021-11-10 Thread Vivek Goyal

Move core file creation bits in a separate function. Soon this is going
to get more complex as file creation need to set security context also.
And there will be multiple modes of file creation in next patch.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 36 ++--
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 64b5b4fbb1..54978b7fae 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1976,6 +1976,30 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 return 0;
 }
 
+static int do_lo_create(fuse_req_t req, struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi, int* open_fd)
+{
+int err = 0, fd;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+
+err = lo_change_cred(req, , lo->change_umask);
+if (err) {
+return err;
+}
+
+/* Try to create a new file but don't open existing files */
+fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
+if (fd == -1) {
+err = errno;
+} else {
+*open_fd = fd;
+}
+lo_restore_cred(, lo->change_umask);
+return err;
+}
+
 static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
   mode_t mode, struct fuse_file_info *fi)
 {
@@ -1985,7 +2009,6 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 struct lo_inode *inode = NULL;
 struct fuse_entry_param e;
 int err;
-struct lo_cred old = {};
 
 fuse_log(FUSE_LOG_DEBUG, "lo_create(parent=%" PRIu64 ", name=%s)"
  " kill_priv=%d\n", parent, name, fi->kill_priv);
@@ -2001,18 +2024,9 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 return;
 }
 
-err = lo_change_cred(req, , lo->change_umask);
-if (err) {
-goto out;
-}
-
 update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-/* Try to create a new file but don't open existing files */
-fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
-err = fd == -1 ? errno : 0;
-
-lo_restore_cred(, lo->change_umask);
+err = do_lo_create(req, parent_inode, name, mode, fi, );
 
 /* Ignore the error if file exists and O_EXCL was not given */
 if (err && (err != EEXIST || (fi->flags & O_EXCL))) {
-- 
2.31.1

[PATCH v3 4/6] virtiofsd: Create new file with fscreate set

2021-11-10 Thread Vivek Goyal

This patch adds support to set /proc/thread-self/attr/fscreate before
file creation. It is set to a value as sent by client. This will allow
for atomic creation of security context on files w.r.t file creation.

This is primarily useful when either there is no SELinux enabled on
host or host and guest policies are in sync and don't conflict.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 317 ---
 1 file changed, 290 insertions(+), 27 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 54978b7fae..7a714b1b5e 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -172,10 +172,14 @@ struct lo_data {
 
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
+/* An O_PATH file descriptor to /proc/self/task/ */
+int proc_self_task;
 int user_killpriv_v2, killpriv_v2;
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+/* Keeps track if /proc//attr/fscreate should be used or not */
+bool use_fscreate;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -229,6 +233,11 @@ static struct lo_inode *lo_find(struct lo_data *lo, struct 
stat *st,
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
+#define FCHDIR_NOFAIL(fd) do { \
+int fchdir_res = fchdir(fd);   \
+assert(fchdir_res == 0);   \
+} while (0)
+
 static bool is_dot_or_dotdot(const char *name)
 {
 return name[0] == '.' &&
@@ -255,6 +264,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/*
+ * Tries to figure out if /proc//attr/fscrate is usable or not. With
+ * selinux=0, read from fscreate returns -EINVAL.
+ *
+ * TODO: Link with libselinux and use is_selinux_enabled() instead down
+ * the line. It probably will be more reliable indicator.
+ */
+static bool is_fscreate_usable(struct lo_data *lo)
+{
+char procname[64];
+int fscreate_fd;
+size_t bytes_read;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
+if (fscreate_fd == -1) {
+return false;
+}
+
+bytes_read = read(fscreate_fd, procname, 64);
+close(fscreate_fd);
+if (bytes_read == -1) {
+return false;
+}
+return true;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -1259,16 +1295,140 @@ static void lo_restore_cred_gain_cap(struct lo_cred 
*old, bool restore_umask,
 }
 }
 
+/* Helpers to set/reset fscreate */
+static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
+  size_t ctxlen, int *fd)
+{
+char procname[64];
+int fscreate_fd, err = 0;
+size_t written;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
+err = fscreate_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+
+written = write(fscreate_fd, ctx, ctxlen);
+err = written == -1 ? errno : 0;
+if (err) {
+goto out;
+}
+
+*fd = fscreate_fd;
+return 0;
+out:
+close(fscreate_fd);
+return err;
+}
+
+static void close_reset_proc_fscreate(int fd)
+{
+if ((write(fd, NULL, 0)) == -1) {
+fuse_log(FUSE_LOG_WARNING, "Failed to reset fscreate. err=%d\n", 
errno);
+}
+close(fd);
+return;
+}
+
+static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
+   const char *name, const char *secctx_name)
+{
+int path_fd, err;
+char procname[64];
+struct lo_data *lo = lo_data(req);
+
+if (!req->secctx.ctxlen) {
+return 0;
+}
+
+/* Open newly created element with O_PATH */
+path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+err = path_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+sprintf(procname, "%i", path_fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+/* Set security context. This is not atomic w.r.t file creation */
+err = setxattr(procname, secctx_name, req->secctx.ctx, req->secctx.ctxlen,
+   0);
+if (err) {
+err = errno;
+}
+FCHDIR_NOFAIL(lo->root.fd);
+close(path_fd);
+return err;
+}
+
+static int do_mknod_symlink(fuse_req_t req, struct lo_inode *dir,
+const char *name, mode_t mode, dev_t rdev,
+const char *link)
+{
+int err, fscreate_fd = -1;
+const char *secctx_name = req->secctx.name;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+char *mapped_name =

Re: [for-6.1 v3 0/3] virtiofsd: Add support for FUSE_SYNCFS request

2021-11-10 Thread Vivek Goyal

Hi Greg,

I don't see FUSE_SYNCFS support in virtiofsd. I see that kernel 
patches got merged. Did you post another version of patches?

Will be nice to add syncfs support in virtiofsd/virtiofsd_rs as well. 

Thanks
Vivek

On Mon, May 10, 2021 at 05:55:36PM +0200, Greg Kurz wrote:
> FUSE_SYNCFS allows the client to flush the host page cache.
> This isn't available in upstream linux yet, but the following
> tree can be used to test:
> 
> https://gitlab.com/gkurz/linux/-/tree/virtio-fs-sync
> 
> v3: - track submounts and do per-submount syncfs() (Vivek)
> - based on new version of FUSE_SYNCFS (still not upstream)
>   https://listman.redhat.com/archives/virtio-fs/2021-May/msg00025.html
> 
> v2: - based on new version of FUSE_SYNCFS
>   https://listman.redhat.com/archives/virtio-fs/2021-April/msg00166.html
> - propagate syncfs() errors to client (Vivek)
> 
> Greg Kurz (3):
>   Update linux headers to 5.13-rc1 + FUSE_SYNCFS
>   virtiofsd: Track mounts
>   virtiofsd: Add support for FUSE_SYNCFS request
> 
>  .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.h   |  35 -
>  include/standard-headers/drm/drm_fourcc.h |  23 +-
>  include/standard-headers/linux/ethtool.h  | 109 ++-
>  include/standard-headers/linux/fuse.h |  27 +-
>  include/standard-headers/linux/input.h|   2 +-
>  include/standard-headers/linux/virtio_ids.h   |   2 +
>  .../standard-headers/rdma/vmw_pvrdma-abi.h|   7 +
>  linux-headers/asm-generic/unistd.h|  13 +-
>  linux-headers/asm-mips/unistd_n32.h   | 752 +++
>  linux-headers/asm-mips/unistd_n64.h   | 704 +++---
>  linux-headers/asm-mips/unistd_o32.h   | 844 -
>  linux-headers/asm-powerpc/kvm.h   |   2 +
>  linux-headers/asm-powerpc/unistd_32.h | 857 +-
>  linux-headers/asm-powerpc/unistd_64.h | 801 
>  linux-headers/asm-s390/unistd_32.h|   5 +
>  linux-headers/asm-s390/unistd_64.h|   5 +
>  linux-headers/asm-x86/kvm.h   |   1 +
>  linux-headers/asm-x86/unistd_32.h |   5 +
>  linux-headers/asm-x86/unistd_64.h |   5 +
>  linux-headers/asm-x86/unistd_x32.h|   5 +
>  linux-headers/linux/kvm.h | 134 +++
>  linux-headers/linux/userfaultfd.h |  36 +-
>  linux-headers/linux/vfio.h|  35 +
>  tools/virtiofsd/fuse_lowlevel.c   |  11 +
>  tools/virtiofsd/fuse_lowlevel.h   |  12 +
>  tools/virtiofsd/passthrough_ll.c  |  80 +-
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  27 files changed, 2465 insertions(+), 2048 deletions(-)
> 
> -- 
> 2.26.3
> 
>

Re: [PATCH v4 11/12] virtiofsd: Optionally fill lo_inode.fhandle

2021-10-20 Thread Vivek Goyal

On Wed, Oct 20, 2021 at 12:00:07PM +0200, Hanna Reitz wrote:

[..]
> > > @@ -1302,13 +1512,26 @@ static int lo_do_lookup(fuse_req_t req, 
> > > fuse_ino_t parent, const char *name,
> > >   goto out;
> > >   }
> > > -newfd = openat(dir_path_fd.fd, name, O_PATH | O_NOFOLLOW);
> > > -if (newfd == -1) {
> > > -goto out_err;
> > > +fh = get_file_handle(lo, dir_path_fd.fd, name, _open_handle);
> > > +if (!fh || !can_open_handle) {
> > > +/*
> > > + * If we will not be able to open the file handle again
> > > + * (can_open_handle is false), open an FD that we can put into
> > > + * lo_inode (in case we need to create a new lo_inode).
> > > + */
> > > +newfd = openat(dir_path_fd.fd, name, O_PATH | O_NOFOLLOW);
> > > +if (newfd == -1) {
> > > +goto out_err;
> > > +}
> > >   }
> > > -res = do_statx(lo, newfd, "", >attr, AT_EMPTY_PATH | 
> > > AT_SYMLINK_NOFOLLOW,
> > > -   _id);
> > > +if (newfd >= 0) {
> > > +res = do_statx(lo, newfd, "", >attr,
> > > +   AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, _id);
> > > +} else {
> > > +res = do_statx(lo, dir_path_fd.fd, name, >attr,
> > > +   AT_EMPTY_PATH | AT_SYMLINK_NOFOLLOW, _id);
> > > +}
> > >   if (res == -1) {
> > >   goto out_err;
> > >   }
> > > @@ -1318,9 +1541,19 @@ static int lo_do_lookup(fuse_req_t req, fuse_ino_t 
> > > parent, const char *name,
> > >   e->attr_flags |= FUSE_ATTR_SUBMOUNT;
> > Can this FUSE_ATTR_SUBMOUNT check be racy w.r.t file handles. I mean
> > say we open the file handle, and before we call do_statx(), another
> > mount shows up on the directory in queustion. So stats now belong
> > to file in new mount and we will think it is a SUBMOUNT. So effectively
> > now we have fh belonging to old file but stats belonging to new file
> > in new mount?
> 
> Yes.  Not just the submount, but the whole stat information, so also the
> file type that goes into the lo_inode.
> 
> I thought this wasn’t too bad, though now I don’t really know why. Perhaps
> it was just how I started the implementation and I never could get myself to
> care enough (not good, I know).  Thanks for making me care! :)
> 
> We could theoretically open an O_PATH FD from the file handle to get the
> stat information from it, but that wouldn’t work for un-openable file
> handles.
> 
> So I think the best is to open an O_PATH FD unconditionally first, and then
> generate the file handle from it.  Then we can stat the FD.

Yes, this sounds like a more reasonable appraoch. This O_PATH fd will be
temporary in nature, so it should not be a problem.

[..]
> > > + *
> > > + * Passing true for cap_dac_read_search adds CAP_DAC_READ_SEARCH to the
> > > + * allowlist.
> > >*/
> > > -static void setup_capabilities(char *modcaps_in)
> > > +static void setup_capabilities(char *modcaps_in, bool 
> > > cap_dac_read_search)
> > >   {
> > >   char *modcaps = modcaps_in;
> > >   pthread_mutex_lock();
> > > @@ -4012,6 +4266,17 @@ static void setup_capabilities(char *modcaps_in)
> > >   exit(1);
> > >   }
> > > +/*
> > > + * If we need CAP_DAC_READ_SEARCH (for file handles), add that, too.
> > > + */
> > > +if (cap_dac_read_search &&
> > > +capng_update(CAPNG_ADD, CAPNG_PERMITTED | CAPNG_EFFECTIVE,
> > > + CAP_DAC_READ_SEARCH)) {
> > > +fuse_log(FUSE_LOG_ERR, "%s: capng_update failed for "
> > > + "CAP_DAC_READ_SEARCH\n", __func__);
> > > +exit(1);
> > > +}
> > > +
> > >   /*
> > >* The modcaps option is a colon separated list of caps,
> > >* each preceded by either + or -.
> > > @@ -4158,7 +4423,7 @@ static void setup_sandbox(struct lo_data *lo, 
> > > struct fuse_session *se,
> > >   }
> > >   setup_seccomp(enable_syslog);
> > > -setup_capabilities(g_strdup(lo->modcaps));
> > > +setup_capabilities(g_strdup(lo->modcaps), lo->inode_file_handles);
> > >   }
> > >   /* Set the maximum number of open file descriptors */
> > > @@ -4498,6 +4763,14 @@ int main(int argc, char *argv[])
> > >   lo.use_statx = true;
> > > +#if !defined(CONFIG_STATX) || !defined(STATX_MNT_ID)
> > > +if (lo.inode_file_handles) {
> > > +fuse_log(FUSE_LOG_WARNING,
> > > + "No statx() or mount ID support: Will not be able to 
> > > use file "
> > > + "handles for inodes\n");
> > > +}
> > Again, I think we should error out if user asked for file handle support
> > explicitly and we can't enable it. But if we end up enabling by default,
> > it probably is fine to just log a message and not use it.
> > 
> > This begs the question what happens if filesystem does not support the
> > file handles. Ideally, I would think that we can error out.But for
> > submounts check will happen much later. For root mount atleast we
> > should be

Re: [PATCH v4 01/12] virtiofsd: Keep /proc/self/mountinfo open

2021-10-20 Thread Vivek Goyal

On Wed, Oct 20, 2021 at 11:04:31AM +0200, Hanna Reitz wrote:
> On 18.10.21 19:07, Vivek Goyal wrote:
> > On Thu, Sep 16, 2021 at 10:40:34AM +0200, Hanna Reitz wrote:
> > > File handles are specific to mounts, and so name_to_handle_at() returns
> > > the respective mount ID.  However, open_by_handle_at() is not content
> > > with an ID, it wants a file descriptor for some inode on the mount,
> > > which we have to open.
> > > 
> > > We want to use /proc/self/mountinfo to find the mounts' root directories
> > > so we can open them and pass the respective FDs to open_by_handle_at().
> > > (We need to use the root directory, because we want the inode belonging
> > > to every mount FD be deletable.  Before the root directory can be
> > > deleted, all entries within must have been closed, and so when it is
> > > deleted, there should not be any file handles left that need its FD as
> > > their mount FD.  Thus, we can then close that FD and the inode can be
> > > deleted.[1])
> > > 
> > > That is why we need to open /proc/self/mountinfo so that we can use it
> > > to translate mount IDs into root directory paths.  We have to open it
> > > after setup_mounts() was called, because if we try to open it before, it
> > > will appear as an empty file after setup_mounts().
> > > 
> > > [1] Note that in practice, you still cannot delete the mount root
> > > directory.  It is a mount point on the host, after all, and mount points
> > > cannot be deleted.  But by using the mount point as the mount FD, we
> > > will at least not hog any actually deletable inodes.
> > > 
> > > Signed-off-by: Hanna Reitz 
> > > ---
> > >   tools/virtiofsd/passthrough_ll.c | 40 
> > >   1 file changed, 40 insertions(+)
> > > 
> > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > b/tools/virtiofsd/passthrough_ll.c
> > > index 38b2af8599..6511a6acb4 100644
> > > --- a/tools/virtiofsd/passthrough_ll.c
> > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > @@ -172,6 +172,8 @@ struct lo_data {
> > >   /* An O_PATH file descriptor to /proc/self/fd/ */
> > >   int proc_self_fd;
> > > +/* A read-only FILE pointer for /proc/self/mountinfo */
> > > +FILE *mountinfo_fp;
> > >   int user_killpriv_v2, killpriv_v2;
> > >   /* If set, virtiofsd is responsible for setting umask during 
> > > creation */
> > >   bool change_umask;
> > > @@ -3718,6 +3720,19 @@ static void setup_chroot(struct lo_data *lo)
> > >   static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
> > > bool enable_syslog)
> > >   {
> > > +int proc_self, mountinfo_fd;
> > > +int saverr;
> > > +
> > > +/*
> > > + * Open /proc/self before we pivot to the new root so we can still
> > > + * open /proc/self/mountinfo afterwards
> > > + */
> > > +proc_self = open("/proc/self", O_PATH);
> > > +if (proc_self < 0) {
> > > +fuse_log(FUSE_LOG_WARNING, "Failed to open /proc/self: %m; "
> > > + "will not be able to use file handles\n");
> > > +}
> > > +
> > Hi Hanna,
> > 
> > Should we open /proc/self and /proc/self/mountinfo only if user wants
> > to file handle. We have already parsed options by now so we know.
> 
> I didn’t think it would matter given that it wouldn’t have an adverse
> effect.  If we can’t open them (and I can’t imagine a case where we’d be
> unable to open them), the only result is a warning.
> 
> > Also, if user asked for file handles, and we can't open /proc/self or
> > /proc/self/mountinfo successfully, I would think we should error out
> > and not continue (instead of just log it and continue).
> 
> Well, that would break the assumption I had above.  Not that that’s really
> relevant, I just want to mention it.
> 
> File handles are a best effort in any case.  If they don’t work, we always
> fall back.  So I don’t know whether we must error out.
> 
> OTOH if we know they can never work, then perhaps it would be more sensible
> to error out.

Yes. If they can't work because filesystem does not have capability or
we don't have CAP_DAC_READ_SEARCH or any other necessary component is
not there then we should fail.
> 
> FWIW I’ve ported the relevant v1..v4 changes to virtiofsd-rs, and there it
> errors out.  The error is unconditional, though, so even

Re: [PATCH v4 10/12] virtiofsd: Add inodes_by_handle hash table

2021-10-20 Thread Vivek Goyal

On Wed, Oct 20, 2021 at 04:10:51PM +0200, Hanna Reitz wrote:
> On 20.10.21 14:29, Vivek Goyal wrote:
> > On Wed, Oct 20, 2021 at 12:02:32PM +0200, Hanna Reitz wrote:
> > > On 19.10.21 22:02, Vivek Goyal wrote:
> > > > On Thu, Sep 16, 2021 at 10:40:43AM +0200, Hanna Reitz wrote:
> > > > > Currently, lo_inode.fhandle is always NULL and so always keep an 
> > > > > O_PATH
> > > > > FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
> > > > > its inode ID will remain in use until we drop our lo_inode (and
> > > > > lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely 
> > > > > use
> > > > > the inode ID as an lo_inode key, because any inode with an inode ID we
> > > > > find in lo_data.inodes (on the same filesystem) must be the exact same
> > > > > file.
> > > > > 
> > > > > This will change when we start setting lo_inode.fhandle so we do not
> > > > > have to keep an O_PATH FD open.  Then, unlinking such an inode will
> > > > > immediately remove it, so its ID can then be reused by newly created
> > > > > files, even while the lo_inode object is still there[1].
> > > > > 
> > > > > So creating a new file can then reuse the old file's inode ID, and
> > > > > looking up the new file would lead to us finding the old file's
> > > > > lo_inode, which is not ideal.
> > > > > 
> > > > > Luckily, just as file handles cause this problem, they also solve it: 
> > > > >  A
> > > > > file handle contains a generation ID, which changes when an inode ID 
> > > > > is
> > > > > reused, so the new file can be distinguished from the old one.  So all
> > > > > we need to do is to add a second map besides lo_data.inodes that maps
> > > > > file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
> > > > > clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.
> > > > > 
> > > > > Unfortunately, we cannot rely on being able to generate file handles
> > > > > every time.  Therefore, we still enter every lo_inode object into
> > > > > inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
> > > > > potential inodes_by_handle entry then has precedence, the 
> > > > > inodes_by_ids
> > > > > entry is just a fallback.
> > > > > 
> > > > > Note that we do not generate lo_fhandle objects yet, and so we also do
> > > > > not enter anything into the inodes_by_handle map yet.  Also, all 
> > > > > lookups
> > > > > skip that map.  We might manually create file handles with some code
> > > > > that is immediately removed by the next patch again, but that would
> > > > > break the assumption in lo_find() that every lo_inode with a non-NULL
> > > > > .fhandle must have an entry in inodes_by_handle and vice versa.  So we
> > > > > leave actually using the inodes_by_handle map for the next patch.
> > > > > 
> > > > > [1] If some application in the guest still has the file open, there is
> > > > > going to be a corresponding FD mapping in lo_data.fd_map.  In such a
> > > > > case, the inode will only go away once every application in the guest
> > > > > has closed it.  The problem described only applies to cases where the
> > > > > guest does not have the file open, and it is just in the dentry cache,
> > > > > basically.
> > > > > 
> > > > > Signed-off-by: Hanna Reitz 
> > > > > ---
> > > > >tools/virtiofsd/passthrough_ll.c | 81 
> > > > > +---
> > > > >1 file changed, 65 insertions(+), 16 deletions(-)
> > > > > 
> > > > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > > > b/tools/virtiofsd/passthrough_ll.c
> > > > > index bd8fc922ea..b7d6aa7f9d 100644
> > > > > --- a/tools/virtiofsd/passthrough_ll.c
> > > > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > > > @@ -186,7 +186,8 @@ struct lo_data {
> > > > >int announce_submounts;
> > > > >bool use_statx;
> > > > >struct lo_inode root;
> > > > > -GHashTable *inodes; /* protected by lo->mutex */
> > > > > +GHashTable *inodes_b

Re: [PATCH v4 10/12] virtiofsd: Add inodes_by_handle hash table

2021-10-20 Thread Vivek Goyal

On Wed, Oct 20, 2021 at 12:02:32PM +0200, Hanna Reitz wrote:
> On 19.10.21 22:02, Vivek Goyal wrote:
> > On Thu, Sep 16, 2021 at 10:40:43AM +0200, Hanna Reitz wrote:
> > > Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
> > > FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
> > > its inode ID will remain in use until we drop our lo_inode (and
> > > lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
> > > the inode ID as an lo_inode key, because any inode with an inode ID we
> > > find in lo_data.inodes (on the same filesystem) must be the exact same
> > > file.
> > > 
> > > This will change when we start setting lo_inode.fhandle so we do not
> > > have to keep an O_PATH FD open.  Then, unlinking such an inode will
> > > immediately remove it, so its ID can then be reused by newly created
> > > files, even while the lo_inode object is still there[1].
> > > 
> > > So creating a new file can then reuse the old file's inode ID, and
> > > looking up the new file would lead to us finding the old file's
> > > lo_inode, which is not ideal.
> > > 
> > > Luckily, just as file handles cause this problem, they also solve it:  A
> > > file handle contains a generation ID, which changes when an inode ID is
> > > reused, so the new file can be distinguished from the old one.  So all
> > > we need to do is to add a second map besides lo_data.inodes that maps
> > > file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
> > > clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.
> > > 
> > > Unfortunately, we cannot rely on being able to generate file handles
> > > every time.  Therefore, we still enter every lo_inode object into
> > > inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
> > > potential inodes_by_handle entry then has precedence, the inodes_by_ids
> > > entry is just a fallback.
> > > 
> > > Note that we do not generate lo_fhandle objects yet, and so we also do
> > > not enter anything into the inodes_by_handle map yet.  Also, all lookups
> > > skip that map.  We might manually create file handles with some code
> > > that is immediately removed by the next patch again, but that would
> > > break the assumption in lo_find() that every lo_inode with a non-NULL
> > > .fhandle must have an entry in inodes_by_handle and vice versa.  So we
> > > leave actually using the inodes_by_handle map for the next patch.
> > > 
> > > [1] If some application in the guest still has the file open, there is
> > > going to be a corresponding FD mapping in lo_data.fd_map.  In such a
> > > case, the inode will only go away once every application in the guest
> > > has closed it.  The problem described only applies to cases where the
> > > guest does not have the file open, and it is just in the dentry cache,
> > > basically.
> > > 
> > > Signed-off-by: Hanna Reitz 
> > > ---
> > >   tools/virtiofsd/passthrough_ll.c | 81 +---
> > >   1 file changed, 65 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > b/tools/virtiofsd/passthrough_ll.c
> > > index bd8fc922ea..b7d6aa7f9d 100644
> > > --- a/tools/virtiofsd/passthrough_ll.c
> > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > @@ -186,7 +186,8 @@ struct lo_data {
> > >   int announce_submounts;
> > >   bool use_statx;
> > >   struct lo_inode root;
> > > -GHashTable *inodes; /* protected by lo->mutex */
> > > +GHashTable *inodes_by_ids; /* protected by lo->mutex */
> > > +GHashTable *inodes_by_handle; /* protected by lo->mutex */
> > >   struct lo_map ino_map; /* protected by lo->mutex */
> > >   struct lo_map dirp_map; /* protected by lo->mutex */
> > >   struct lo_map fd_map; /* protected by lo->mutex */
> > > @@ -275,8 +276,9 @@ static struct {
> > >   /* That we loaded cap-ng in the current thread from the saved */
> > >   static __thread bool cap_loaded = 0;
> > > -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
> > > -uint64_t mnt_id);
> > > +static struct lo_inode *lo_find(struct lo_data *lo,
> > > +const struct lo_fhandle *fhandle,
> > > +struct stat *st, uint64_t mnt_id);
&

Re: [PATCH v4 10/12] virtiofsd: Add inodes_by_handle hash table

2021-10-20 Thread Vivek Goyal

On Wed, Oct 20, 2021 at 12:02:32PM +0200, Hanna Reitz wrote:
> On 19.10.21 22:02, Vivek Goyal wrote:
> > On Thu, Sep 16, 2021 at 10:40:43AM +0200, Hanna Reitz wrote:
> > > Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
> > > FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
> > > its inode ID will remain in use until we drop our lo_inode (and
> > > lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
> > > the inode ID as an lo_inode key, because any inode with an inode ID we
> > > find in lo_data.inodes (on the same filesystem) must be the exact same
> > > file.
> > > 
> > > This will change when we start setting lo_inode.fhandle so we do not
> > > have to keep an O_PATH FD open.  Then, unlinking such an inode will
> > > immediately remove it, so its ID can then be reused by newly created
> > > files, even while the lo_inode object is still there[1].
> > > 
> > > So creating a new file can then reuse the old file's inode ID, and
> > > looking up the new file would lead to us finding the old file's
> > > lo_inode, which is not ideal.
> > > 
> > > Luckily, just as file handles cause this problem, they also solve it:  A
> > > file handle contains a generation ID, which changes when an inode ID is
> > > reused, so the new file can be distinguished from the old one.  So all
> > > we need to do is to add a second map besides lo_data.inodes that maps
> > > file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
> > > clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.
> > > 
> > > Unfortunately, we cannot rely on being able to generate file handles
> > > every time.  Therefore, we still enter every lo_inode object into
> > > inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
> > > potential inodes_by_handle entry then has precedence, the inodes_by_ids
> > > entry is just a fallback.
> > > 
> > > Note that we do not generate lo_fhandle objects yet, and so we also do
> > > not enter anything into the inodes_by_handle map yet.  Also, all lookups
> > > skip that map.  We might manually create file handles with some code
> > > that is immediately removed by the next patch again, but that would
> > > break the assumption in lo_find() that every lo_inode with a non-NULL
> > > .fhandle must have an entry in inodes_by_handle and vice versa.  So we
> > > leave actually using the inodes_by_handle map for the next patch.
> > > 
> > > [1] If some application in the guest still has the file open, there is
> > > going to be a corresponding FD mapping in lo_data.fd_map.  In such a
> > > case, the inode will only go away once every application in the guest
> > > has closed it.  The problem described only applies to cases where the
> > > guest does not have the file open, and it is just in the dentry cache,
> > > basically.
> > > 
> > > Signed-off-by: Hanna Reitz 
> > > ---
> > >   tools/virtiofsd/passthrough_ll.c | 81 +---
> > >   1 file changed, 65 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/tools/virtiofsd/passthrough_ll.c 
> > > b/tools/virtiofsd/passthrough_ll.c
> > > index bd8fc922ea..b7d6aa7f9d 100644
> > > --- a/tools/virtiofsd/passthrough_ll.c
> > > +++ b/tools/virtiofsd/passthrough_ll.c
> > > @@ -186,7 +186,8 @@ struct lo_data {
> > >   int announce_submounts;
> > >   bool use_statx;
> > >   struct lo_inode root;
> > > -GHashTable *inodes; /* protected by lo->mutex */
> > > +GHashTable *inodes_by_ids; /* protected by lo->mutex */
> > > +GHashTable *inodes_by_handle; /* protected by lo->mutex */
> > >   struct lo_map ino_map; /* protected by lo->mutex */
> > >   struct lo_map dirp_map; /* protected by lo->mutex */
> > >   struct lo_map fd_map; /* protected by lo->mutex */
> > > @@ -275,8 +276,9 @@ static struct {
> > >   /* That we loaded cap-ng in the current thread from the saved */
> > >   static __thread bool cap_loaded = 0;
> > > -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
> > > -uint64_t mnt_id);
> > > +static struct lo_inode *lo_find(struct lo_data *lo,
> > > +const struct lo_fhandle *fhandle,
> > > +struct stat *st, uint64_t mnt_id);
&

Re: [PATCH v4 10/12] virtiofsd: Add inodes_by_handle hash table

2021-10-19 Thread Vivek Goyal

On Thu, Sep 16, 2021 at 10:40:43AM +0200, Hanna Reitz wrote:
> Currently, lo_inode.fhandle is always NULL and so always keep an O_PATH
> FD in lo_inode.fd.  Therefore, when the respective inode is unlinked,
> its inode ID will remain in use until we drop our lo_inode (and
> lo_inode_put() thus closes the FD).  Therefore, lo_find() can safely use
> the inode ID as an lo_inode key, because any inode with an inode ID we
> find in lo_data.inodes (on the same filesystem) must be the exact same
> file.
> 
> This will change when we start setting lo_inode.fhandle so we do not
> have to keep an O_PATH FD open.  Then, unlinking such an inode will
> immediately remove it, so its ID can then be reused by newly created
> files, even while the lo_inode object is still there[1].
> 
> So creating a new file can then reuse the old file's inode ID, and
> looking up the new file would lead to us finding the old file's
> lo_inode, which is not ideal.
> 
> Luckily, just as file handles cause this problem, they also solve it:  A
> file handle contains a generation ID, which changes when an inode ID is
> reused, so the new file can be distinguished from the old one.  So all
> we need to do is to add a second map besides lo_data.inodes that maps
> file handles to lo_inodes, namely lo_data.inodes_by_handle.  For
> clarity, lo_data.inodes is renamed to lo_data.inodes_by_ids.
> 
> Unfortunately, we cannot rely on being able to generate file handles
> every time.  Therefore, we still enter every lo_inode object into
> inodes_by_ids, but having an entry in inodes_by_handle is optional.  A
> potential inodes_by_handle entry then has precedence, the inodes_by_ids
> entry is just a fallback.
> 
> Note that we do not generate lo_fhandle objects yet, and so we also do
> not enter anything into the inodes_by_handle map yet.  Also, all lookups
> skip that map.  We might manually create file handles with some code
> that is immediately removed by the next patch again, but that would
> break the assumption in lo_find() that every lo_inode with a non-NULL
> .fhandle must have an entry in inodes_by_handle and vice versa.  So we
> leave actually using the inodes_by_handle map for the next patch.
> 
> [1] If some application in the guest still has the file open, there is
> going to be a corresponding FD mapping in lo_data.fd_map.  In such a
> case, the inode will only go away once every application in the guest
> has closed it.  The problem described only applies to cases where the
> guest does not have the file open, and it is just in the dentry cache,
> basically.
> 
> Signed-off-by: Hanna Reitz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 81 +---
>  1 file changed, 65 insertions(+), 16 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index bd8fc922ea..b7d6aa7f9d 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -186,7 +186,8 @@ struct lo_data {
>  int announce_submounts;
>  bool use_statx;
>  struct lo_inode root;
> -GHashTable *inodes; /* protected by lo->mutex */
> +GHashTable *inodes_by_ids; /* protected by lo->mutex */
> +GHashTable *inodes_by_handle; /* protected by lo->mutex */
>  struct lo_map ino_map; /* protected by lo->mutex */
>  struct lo_map dirp_map; /* protected by lo->mutex */
>  struct lo_map fd_map; /* protected by lo->mutex */
> @@ -275,8 +276,9 @@ static struct {
>  /* That we loaded cap-ng in the current thread from the saved */
>  static __thread bool cap_loaded = 0;
>  
> -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
> -uint64_t mnt_id);
> +static struct lo_inode *lo_find(struct lo_data *lo,
> +const struct lo_fhandle *fhandle,
> +struct stat *st, uint64_t mnt_id);
>  static int xattr_map_client(const struct lo_data *lo, const char 
> *client_name,
>  char **out_name);
>  
> @@ -1143,18 +1145,40 @@ out_err:
>  fuse_reply_err(req, saverr);
>  }
>  
> -static struct lo_inode *lo_find(struct lo_data *lo, struct stat *st,
> -uint64_t mnt_id)

> +static struct lo_inode *lo_find(struct lo_data *lo,
> +const struct lo_fhandle *fhandle,
> +struct stat *st, uint64_t mnt_id)
>  {
> -struct lo_inode *p;
> -struct lo_key key = {
> +struct lo_inode *p = NULL;
> +struct lo_key ids_key = {
>  .ino = st->st_ino,
>  .dev = st->st_dev,
>  .mnt_id = mnt_id,
>  };
>  
>  pthread_mutex_lock(>mutex);
> -p = g_hash_table_lookup(lo->inodes, );
> +if (fhandle) {
> +p = g_hash_table_lookup(lo->inodes_by_handle, fhandle);
> +}
> +if (!p) {
> +p = g_hash_table_lookup(lo->inodes_by_ids, _key);
> +/*
> + * When we had to fall back to looking up an

Re: [PATCH v4 11/12] virtiofsd: Optionally fill lo_inode.fhandle

2021-10-19 Thread Vivek Goyal

On Thu, Sep 16, 2021 at 10:40:44AM +0200, Hanna Reitz wrote:
> When the inode_file_handles option is set, try to generate a file handle
> for new inodes instead of opening an O_PATH FD.
> 
> Being able to open these again will require CAP_DAC_READ_SEARCH, so
> setting this option will result in us taking that capability.
> 
> Generating a file handle returns the mount ID it is valid for.  Opening
> it will require an FD instead.  We have mount_fds to map an ID to an FD.
> get_file_handle() scans /proc/self/mountinfo to map mount IDs to their
> mount points, which we open to get the mount FD we need.  To verify that
> the resulting FD indeed represents the handle's mount ID, we use
> statx().  Therefore, using file handles requires statx() support.
> 
> Signed-off-by: Hanna Reitz 
> ---
>  tools/virtiofsd/helper.c  |   3 +
>  tools/virtiofsd/passthrough_ll.c  | 297 --
>  tools/virtiofsd/passthrough_seccomp.c |   1 +
>  3 files changed, 289 insertions(+), 12 deletions(-)
> 
> diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
> index a8295d975a..311f05c7ee 100644
> --- a/tools/virtiofsd/helper.c
> +++ b/tools/virtiofsd/helper.c
> @@ -187,6 +187,9 @@ void fuse_cmdline_help(void)
> "   default: no_allow_direct_io\n"
> "-o announce_submounts  Announce sub-mount points to the 
> guest\n"
> "-o posix_acl/no_posix_acl  Enable/Disable posix_acl. 
> (default: disabled)\n"
> +   "-o inode_file_handles  Use file handles to reference 
> inodes\n"
> +   "   instead of O_PATH file 
> descriptors\n"
> +   "   (adds +dac_read_search to 
> modcaps)\n"
> );
>  }
>  
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index b7d6aa7f9d..e86fad8b2f 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -206,6 +206,8 @@ struct lo_data {
>  /* Maps (integer) mount IDs to lo_mount_fd objects */
>  GHashTable *mount_fds;
>  pthread_rwlock_t mount_fds_lock;
> +
> +int inode_file_handles;
>  };
>  
>  /**
> @@ -262,6 +264,10 @@ static const struct fuse_opt lo_opts[] = {
>  { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
>  { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
>  { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
> +{ "inode_file_handles", offsetof(struct lo_data, inode_file_handles), 1 
> },
> +{ "no_inode_file_handles",
> +  offsetof(struct lo_data, inode_file_handles),
> +  0 },
>  FUSE_OPT_END
>  };
>  static bool use_syslog = false;
> @@ -359,8 +365,15 @@ static void free_lo_mount_fd(gpointer data)
>   *
>   * Pass @drop_mount_fd_ref == true if and only if this handle has a
>   * strong reference to an lo_mount_fd object in the mount_fds hash
> - * table.  That is always the case for file handles stored in lo_inode
> - * objects.
> + * table, i.e. if this file handle has been returned by a
> + * get_file_handle() call where *can_open_handle was returned to be
> + * true.  (That is always the case for file handles stored in lo_inode
> + * objects, because those file handles must be open-able.)
> + *
> + * Conversely, pass @drop_mount_fd_ref == false if and only if this
> + * file handle has been returned by a get_file_handle() call where
> + * either NULL was passed for @can_open_handle, or where
> + * *can_open_handle was returned to be false.
>   */
>  static void release_file_handle(struct lo_data *lo, struct lo_fhandle *fh,
>  bool drop_mount_fd_ref)
> @@ -399,6 +412,196 @@ static void release_file_handle(struct lo_data *lo, 
> struct lo_fhandle *fh,
>  g_free(fh);
>  }
>  
> +/**
> + * Generate a file handle for the given dirfd/name combination.
> + *
> + * If mount_fds does not yet contain an entry for the handle's mount
> + * ID, (re)open dirfd/name in O_RDONLY mode and add it to mount_fds
> + * as the FD for that mount ID.  (That is the file that we have
> + * generated a handle for, so it should be representative for the
> + * mount ID.  However, to be sure (and to rule out races), we use
> + * statx() to verify that our assumption is correct.)
> + *
> + * Opening a mount FD can fail in various ways, and independently of
> + * whether generating a file handle was possible.  Many callers only
> + * care about getting a file handle for a lookup, though, and so do
> + * not necessarily need it to be usable.  (You need a valid mount FD
> + * for the handle to be usable.)
> + * *can_open_handle will be set to true if the file handle can be
> + * opened (i.e., we have a mount FD for it), and to false otherwise.
> + * By passing NULL for @can_open_handle, the caller indicates that
> + * they do not care about the handle being open-able, and so
> + * generating a mount FD will be skipped

Re: [PATCH v4 07/12] virtiofsd: Let lo_inode_open() return a TempFd

2021-10-18 Thread Vivek Goyal

On Thu, Sep 16, 2021 at 10:40:40AM +0200, Hanna Reitz wrote:
> Strictly speaking, this is not necessary, because lo_inode_open() will
> always return a new FD owned by the caller, so TempFd.owned will always
> be true.
> 
> The auto-cleanup is nice, though.  Also, we get a more unified interface
> where you always get a TempFd when you need an FD for an lo_inode
> (regardless of whether it is an O_PATH FD or a non-O_PATH FD).
> 
> Signed-off-by: Hanna Reitz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 156 +++
>  1 file changed, 75 insertions(+), 81 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 3bf20b8659..d257eda129 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -293,10 +293,8 @@ static void temp_fd_clear(TempFd *temp_fd)
>  /**
>   * Return an owned fd from *temp_fd that will not be closed when
>   * *temp_fd goes out of scope.
> - *
> - * (TODO: Remove __attribute__ once this is used.)
>   */
> -static __attribute__((unused)) int temp_fd_steal(TempFd *temp_fd)
> +static int temp_fd_steal(TempFd *temp_fd)
>  {
>  if (temp_fd->owned) {
>  temp_fd->owned = false;
> @@ -309,10 +307,8 @@ static __attribute__((unused)) int temp_fd_steal(TempFd 
> *temp_fd)
>  /**
>   * Create a borrowing copy of an existing TempFd.  Note that *to is
>   * only valid as long as *from is valid.
> - *
> - * (TODO: Remove __attribute__ once this is used.)
>   */
> -static __attribute__((unused)) void temp_fd_copy(const TempFd *from, TempFd 
> *to)
> +static void temp_fd_copy(const TempFd *from, TempFd *to)
>  {
>  *to = (TempFd) {
>  .fd = from->fd,
> @@ -689,9 +685,12 @@ static int lo_fd(fuse_req_t req, fuse_ino_t ino, TempFd 
> *tfd)
>   * when a malicious client opens special files such as block device nodes.
>   * Symlink inodes are also rejected since symlinks must already have been
>   * traversed on the client side.
> + *
> + * The fd is returned in tfd->fd.  The return value is 0 on success and 
> -errno
> + * otherwise.
>   */
>  static int lo_inode_open(struct lo_data *lo, struct lo_inode *inode,
> - int open_flags)
> + int open_flags, TempFd *tfd)
>  {
>  g_autofree char *fd_str = g_strdup_printf("%d", inode->fd);
>  int fd;
> @@ -710,7 +709,13 @@ static int lo_inode_open(struct lo_data *lo, struct 
> lo_inode *inode,
>  if (fd < 0) {
>  return -errno;
>  }
> -return fd;
> +
> +*tfd = (TempFd) {
> +.fd = fd,
> +.owned = true,
> +};
> +
> +return 0;
>  }
>  
>  static void lo_init(void *userdata, struct fuse_conn_info *conn)
> @@ -854,7 +859,8 @@ static int lo_fi_fd(fuse_req_t req, struct fuse_file_info 
> *fi)
>  static void lo_setattr(fuse_req_t req, fuse_ino_t ino, struct stat *attr,
> int valid, struct fuse_file_info *fi)
>  {
> -g_auto(TempFd) path_fd = TEMP_FD_INIT;
> +g_auto(TempFd) path_fd = TEMP_FD_INIT; /* at least an O_PATH fd */

What does atleast O_PATH fd mean?

> +g_auto(TempFd) rw_fd = TEMP_FD_INIT; /* O_RDWR fd */
>  int saverr;
>  char procname[64];
>  struct lo_data *lo = lo_data(req);
> @@ -868,7 +874,15 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
> struct stat *attr,
>  return;
>  }
>  
> -res = lo_inode_fd(inode, _fd);
> +if (!fi && (valid & FUSE_SET_ATTR_SIZE)) {
> +/* We need an O_RDWR FD for ftruncate() */
> +res = lo_inode_open(lo, inode, O_RDWR, _fd);
> +if (res >= 0) {
> +temp_fd_copy(_fd, _fd);

I am lost here. If lo_inode_open() failed, why are we calling this
temp_fd_copy()? path_fd is not even a valid fd yet.

Still beats me that why open rw_fd now instead of down in
FUSE_SET_ATTR_SIZE block. I think we had this discussion and you
had some reasons to move it up.

Vivek

> +}
> +} else {
> +res = lo_inode_fd(inode, _fd);
> +}
>  if (res < 0) {
>  saverr = -res;
>  goto out_err;
> @@ -916,18 +930,12 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
> struct stat *attr,
>  if (fi) {
>  truncfd = fd;
>  } else {
> -truncfd = lo_inode_open(lo, inode, O_RDWR);
> -if (truncfd < 0) {
> -saverr = -truncfd;
> -goto out_err;
> -}
> +assert(rw_fd.fd >= 0);
> +truncfd = rw_fd.fd;
>  }
>  
>  saverr = drop_security_capability(lo, truncfd);
>  if (saverr) {
> -if (!fi) {
> -close(truncfd);
> -}
>  goto out_err;
>  }
>  
> @@ -935,9 +943,6 @@ static void lo_setattr(fuse_req_t req, fuse_ino_t ino, 
> struct stat *attr,
>  res = drop_effective_cap("FSETID", _fsetid_dropped);
>  if (res != 0) {
>  saverr = res;
> -if

Re: [PATCH v4 00/12] virtiofsd: Allow using file handles instead of O_PATH FDs

2021-10-18 Thread Vivek Goyal

On Thu, Sep 16, 2021 at 10:40:33AM +0200, Hanna Reitz wrote:

[..]
> Second, I’ve renamed the TempFd objects to reflect what kind of FDs they
> contain; i.e. they’re no longer called “inode_fd” or “dir_fd”, but
> “path_fd”, “rw_fd”, or “dir_path_fd” instead.

This change is really helpful. Makes it easier to read code and figure
out which fd we are referring to.

Vivek

Re: [PATCH v4 02/12] virtiofsd: Limit setxattr()'s creds-dropped region

2021-10-18 Thread Vivek Goyal

On Thu, Sep 16, 2021 at 10:40:35AM +0200, Hanna Reitz wrote:
> We only need to drop/switch our credentials for the (f)setxattr() call
> alone, not for the openat() or fchdir() around it.
> 
> (Right now, this may not be that big of a problem, but with inodes being
> identified by file handles instead of an O_PATH fd, we will need
> open_by_handle_at() calls here, which is really fickle when it comes to
> credentials being dropped.)
> 
> Signed-off-by: Hanna Reitz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 34 +++-
>  1 file changed, 25 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 6511a6acb4..b43afdfbd3 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -3123,6 +3123,7 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t ino, 
> const char *in_name,
>  bool switched_creds = false;
>  bool cap_fsetid_dropped = false;
>  struct lo_cred old = {};
> +bool changed_cwd = false;
>  
>  if (block_xattr(lo, in_name)) {
>  fuse_reply_err(req, EOPNOTSUPP);
> @@ -3158,6 +3159,24 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t 
> ino, const char *in_name,
>   ", name=%s value=%s size=%zd)\n", ino, name, value, size);
>  
>  sprintf(procname, "%i", inode->fd);
> +/*
> + * We can only open regular files or directories.  If the inode is
> + * something else, we have to enter /proc/self/fd and use
> + * setxattr() on the link's filename there.
> + */
> +if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
> +fd = openat(lo->proc_self_fd, procname, O_RDONLY);
> +if (fd < 0) {
> +saverr = errno;
> +goto out;
> +}
> +} else {
> +/* fchdir should not fail here */
> +FCHDIR_NOFAIL(lo->proc_self_fd);
> +/* Set flag so the clean-up path will chdir back */
> +changed_cwd = true;

Is there a need to move FCHDIR_NOFAIL() call earlier too? I am assuming
this will not be impacted by file handle stuff. So we probably could
leave it in place. Easier to read.

Vivek

> +}
> +
>  /*
>   * If we are setting posix access acl and if SGID needs to be
>   * cleared, then switch to caller's gid and drop CAP_FSETID
> @@ -3178,20 +3197,12 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t 
> ino, const char *in_name,
>  }
>  switched_creds = true;
>  }
> -if (S_ISREG(inode->filetype) || S_ISDIR(inode->filetype)) {
> -fd = openat(lo->proc_self_fd, procname, O_RDONLY);
> -if (fd < 0) {
> -saverr = errno;
> -goto out;
> -}
> +if (fd >= 0) {
>  ret = fsetxattr(fd, name, value, size, flags);
>  saverr = ret == -1 ? errno : 0;
>  } else {
> -/* fchdir should not fail here */
> -FCHDIR_NOFAIL(lo->proc_self_fd);
>  ret = setxattr(procname, name, value, size, flags);
>  saverr = ret == -1 ? errno : 0;
> -FCHDIR_NOFAIL(lo->root.fd);
>  }
>  if (switched_creds) {
>  if (cap_fsetid_dropped)
> @@ -3201,6 +3212,11 @@ static void lo_setxattr(fuse_req_t req, fuse_ino_t 
> ino, const char *in_name,
>  }
>  
>  out:
> +if (changed_cwd) {
> +/* Change CWD back, fchdir should not fail here */
> +FCHDIR_NOFAIL(lo->root.fd);
> +}
> +
>  if (fd >= 0) {
>  close(fd);
>  }
> -- 
> 2.31.1
>

Re: [PATCH v4 01/12] virtiofsd: Keep /proc/self/mountinfo open

2021-10-18 Thread Vivek Goyal

On Thu, Sep 16, 2021 at 10:40:34AM +0200, Hanna Reitz wrote:
> File handles are specific to mounts, and so name_to_handle_at() returns
> the respective mount ID.  However, open_by_handle_at() is not content
> with an ID, it wants a file descriptor for some inode on the mount,
> which we have to open.
> 
> We want to use /proc/self/mountinfo to find the mounts' root directories
> so we can open them and pass the respective FDs to open_by_handle_at().
> (We need to use the root directory, because we want the inode belonging
> to every mount FD be deletable.  Before the root directory can be
> deleted, all entries within must have been closed, and so when it is
> deleted, there should not be any file handles left that need its FD as
> their mount FD.  Thus, we can then close that FD and the inode can be
> deleted.[1])
> 
> That is why we need to open /proc/self/mountinfo so that we can use it
> to translate mount IDs into root directory paths.  We have to open it
> after setup_mounts() was called, because if we try to open it before, it
> will appear as an empty file after setup_mounts().
> 
> [1] Note that in practice, you still cannot delete the mount root
> directory.  It is a mount point on the host, after all, and mount points
> cannot be deleted.  But by using the mount point as the mount FD, we
> will at least not hog any actually deletable inodes.
> 
> Signed-off-by: Hanna Reitz 
> ---
>  tools/virtiofsd/passthrough_ll.c | 40 
>  1 file changed, 40 insertions(+)
> 
> diff --git a/tools/virtiofsd/passthrough_ll.c 
> b/tools/virtiofsd/passthrough_ll.c
> index 38b2af8599..6511a6acb4 100644
> --- a/tools/virtiofsd/passthrough_ll.c
> +++ b/tools/virtiofsd/passthrough_ll.c
> @@ -172,6 +172,8 @@ struct lo_data {
>  
>  /* An O_PATH file descriptor to /proc/self/fd/ */
>  int proc_self_fd;
> +/* A read-only FILE pointer for /proc/self/mountinfo */
> +FILE *mountinfo_fp;
>  int user_killpriv_v2, killpriv_v2;
>  /* If set, virtiofsd is responsible for setting umask during creation */
>  bool change_umask;
> @@ -3718,6 +3720,19 @@ static void setup_chroot(struct lo_data *lo)
>  static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
>bool enable_syslog)
>  {
> +int proc_self, mountinfo_fd;
> +int saverr;
> +
> +/*
> + * Open /proc/self before we pivot to the new root so we can still
> + * open /proc/self/mountinfo afterwards
> + */
> +proc_self = open("/proc/self", O_PATH);
> +if (proc_self < 0) {
> +fuse_log(FUSE_LOG_WARNING, "Failed to open /proc/self: %m; "
> + "will not be able to use file handles\n");
> +}
> +

Hi Hanna,

Should we open /proc/self and /proc/self/mountinfo only if user wants
to file handle. We have already parsed options by now so we know.

Also, if user asked for file handles, and we can't open /proc/self or
/proc/self/mountinfo successfully, I would think we should error out
and not continue (instead of just log it and continue).

That seems to be general theme. If user asked for a feature and if
we can't enable it, we error out and let user retry without that
particular feature.

>  if (lo->sandbox == SANDBOX_NAMESPACE) {
>  setup_namespaces(lo, se);
>  setup_mounts(lo->source);
> @@ -3725,6 +3740,31 @@ static void setup_sandbox(struct lo_data *lo, struct 
> fuse_session *se,
>  setup_chroot(lo);
>  }
>  
> +/*
> + * Opening /proc/self/mountinfo before the umount2() call in
> + * setup_mounts() leads to the file appearing empty.  That is why
> + * we defer opening it until here.
> + */
> +lo->mountinfo_fp = NULL;
> +if (proc_self >= 0) {
> +mountinfo_fd = openat(proc_self, "mountinfo", O_RDONLY);
> +if (mountinfo_fd < 0) {
> +saverr = errno;
> +} else if (mountinfo_fd >= 0) {
> +lo->mountinfo_fp = fdopen(mountinfo_fd, "r");
> +if (!lo->mountinfo_fp) {
> +saverr = errno;
> +close(mountinfo_fd);
> +}
> +}
> +if (!lo->mountinfo_fp) {
> +fuse_log(FUSE_LOG_WARNING, "Failed to open /proc/self/mountinfo: 
> "
> + "%s; will not be able to use file handles\n",
> + strerror(saverr));
> +}
> +close(proc_self);
> +}
> +

Above code couple probably be moved in a helper function. Makes it
easier to read setup_sandbox(). Same here, open mountinfo only if
user wants file handle support and error out if file handle support
can't be enabled.

Thanks
Vivek
>  setup_seccomp(enable_syslog);
>  setup_capabilities(g_strdup(lo->modcaps));
>  }
> -- 
> 2.31.1
>

Re: [Virtio-fs] [PATCH] virtiofsd: Error on bad socket group name

2021-10-18 Thread Vivek Goyal

On Thu, Oct 14, 2021 at 01:25:54PM +0100, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" 
> 
> Make the '--socket-group=' option fail if the group name is unknown:
> 
> ./tools/virtiofsd/virtiofsd  --socket-group=zaphod
> vhost socket: unable to find group 'zaphod'
> 
> Reported-by: Xiaoling Gao 
> Signed-off-by: Dr. David Alan Gilbert 

Hi Dave,

This looks good to me. Just a minor nit for code cleanup. It could
be done in a separate patch or sometime later as well.

Reviewed-by: Vivek Goyal 

> ---
>  tools/virtiofsd/fuse_virtio.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/tools/virtiofsd/fuse_virtio.c b/tools/virtiofsd/fuse_virtio.c
> index 8f4fd165b9..39eebffb62 100644
> --- a/tools/virtiofsd/fuse_virtio.c
> +++ b/tools/virtiofsd/fuse_virtio.c
> @@ -999,6 +999,13 @@ static int fv_create_listen_socket(struct fuse_session 
> *se)
>   "vhost socket failed to set group to %s (%d): %m\n",
>   se->vu_socket_group, g->gr_gid);
>  }
> +} else {
> +fuse_log(FUSE_LOG_ERR,
> + "vhost socket: unable to find group '%s'\n",
> + se->vu_socket_group);
> +close(listen_sock);
> +umask(old_umask);
^^^
> +return -1;
>  }
>  }
>  umask(old_umask);

This umask() call could be moved little early right after bind() call and
that way we don't have to take care of calling umask(old_umask) in error
path if group name could not be found.

Vivek
> -- 
> 2.31.1
> 
> ___
> Virtio-fs mailing list
> virtio...@redhat.com
> https://listman.redhat.com/mailman/listinfo/virtio-fs
>

[PATCH v2 6/6] virtiofsd: Add an option to enable/disable security label

2021-10-14 Thread Vivek Goyal

Provide an option "-o security_label/no_security_label" to enable/disable
security label functionality. By default these are turned off.

If enabled, server will indicate to client that it is capable of handling
one security label during file creation. Typically this is expected to
be a SELinux label. File server will set this label on the file. It will
try to set it atomically wherever possible. But its not possible in
all the cases.

Signed-off-by: Vivek Goyal 
---
 docs/tools/virtiofsd.rst |  7 +++
 tools/virtiofsd/helper.c |  1 +
 tools/virtiofsd/passthrough_ll.c | 15 +++
 3 files changed, 23 insertions(+)

diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
index cc31402830..54699b2013 100644
--- a/docs/tools/virtiofsd.rst
+++ b/docs/tools/virtiofsd.rst
@@ -104,6 +104,13 @@ Options
   * posix_acl|no_posix_acl -
 Enable/disable posix acl support.  Posix ACLs are disabled by default.
 
+  * security_label|no_security_label -
+Enable/disable security label support. Security labels are disabled by
+default. This will allow client to send a MAC label of file during
+file creation. Typically this is expected to be SELinux security
+label. Server will try to set that label on newly created file
+atomically wherever possible.
+
 .. option:: --socket-path=PATH
 
   Listen on vhost-user UNIX domain socket at PATH.
diff --git a/tools/virtiofsd/helper.c b/tools/virtiofsd/helper.c
index a8295d975a..e226fc590f 100644
--- a/tools/virtiofsd/helper.c
+++ b/tools/virtiofsd/helper.c
@@ -187,6 +187,7 @@ void fuse_cmdline_help(void)
"   default: no_allow_direct_io\n"
"-o announce_submounts  Announce sub-mount points to the 
guest\n"
"-o posix_acl/no_posix_acl  Enable/Disable posix_acl. (default: 
disabled)\n"
+   "-o security_label/no_security_label  Enable/Disable security 
label. (default: disabled)\n"
);
 }
 
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4505c0c363..4334885619 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -180,6 +180,7 @@ struct lo_data {
 int user_posix_acl, posix_acl;
 /* Keeps track if /proc//attr/fscreate should be used or not */
 bool use_fscreate;
+int user_security_label;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -214,6 +215,8 @@ static const struct fuse_opt lo_opts[] = {
 { "no_killpriv_v2", offsetof(struct lo_data, user_killpriv_v2), 0 },
 { "posix_acl", offsetof(struct lo_data, user_posix_acl), 1 },
 { "no_posix_acl", offsetof(struct lo_data, user_posix_acl), 0 },
+{ "security_label", offsetof(struct lo_data, user_security_label), 1 },
+{ "no_security_label", offsetof(struct lo_data, user_security_label), 0 },
 FUSE_OPT_END
 };
 static bool use_syslog = false;
@@ -770,6 +773,17 @@ static void lo_init(void *userdata, struct fuse_conn_info 
*conn)
 fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling posix_acl\n");
 conn->want &= ~FUSE_CAP_POSIX_ACL;
 }
+
+if (lo->user_security_label == 1) {
+if (!(conn->capable & FUSE_CAP_SECURITY_CTX)) {
+fuse_log(FUSE_LOG_ERR, "lo_init: Can not enable security label."
+ " kernel does not support FUSE_SECURITY_CTX 
capability.\n");
+}
+conn->want |= FUSE_CAP_SECURITY_CTX;
+} else {
+fuse_log(FUSE_LOG_DEBUG, "lo_init: disabling security label\n");
+conn->want &= ~FUSE_CAP_SECURITY_CTX;
+}
 }
 
 static void lo_getattr(fuse_req_t req, fuse_ino_t ino,
@@ -4254,6 +4268,7 @@ int main(int argc, char *argv[])
 .proc_self_task = -1,
 .user_killpriv_v2 = -1,
 .user_posix_acl = -1,
+.user_security_label = -1,
 };
 struct lo_map_elem *root_elem;
 struct lo_map_elem *reserve_elem;
-- 
2.31.1

[PATCH v2 5/6] virtiofsd: Create new file using O_TMPFILE and set security context

2021-10-14 Thread Vivek Goyal

If guest and host policies can't work with each other, then guest security
context (selinux label) needs to be set into an xattr. Say remap guest
security.selinux xattr to trusted.virtiofs.security.selinux.

That means setting "fscreate" is not going to help as that's ony useful
for security.selinux xattr on host.

So we need another method which is atomic. Use O_TMPFILE to create new
file, set xattr and then linkat() to proper place.

But this works only for regular files. So dir, symlinks will continue
to be non-atomic.

Also if host filesystem does not support O_TMPFILE, we fallback to
non-atomic behavior.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 80 
 1 file changed, 72 insertions(+), 8 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 7a714b1b5e..4505c0c363 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2128,14 +2128,29 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 
 static int do_create_nosecctx(fuse_req_t req, struct lo_inode *parent_inode,
const char *name, mode_t mode,
-   struct fuse_file_info *fi, int *open_fd)
+   struct fuse_file_info *fi, int *open_fd,
+  bool tmpfile)
 {
 int err, fd;
 struct lo_cred old = {};
 struct lo_data *lo = lo_data(req);
 int flags;
 
-flags = fi->flags | O_CREAT | O_EXCL;
+if (tmpfile) {
+flags = fi->flags | O_TMPFILE;
+/*
+ * Don't use O_EXCL as we want to link file later. Also reset O_CREAT
+ * otherwise openat() returns -EINVAL.
+ */
+flags &= ~(O_CREAT | O_EXCL);
+
+/* O_TMPFILE needs either O_RDWR or O_WRONLY */
+if ((flags & O_ACCMODE) == O_RDONLY) {
+flags |= O_RDWR;
+}
+} else {
+flags = fi->flags | O_CREAT | O_EXCL;
+}
 
 err = lo_change_cred(req, , lo->change_umask);
 if (err) {
@@ -2166,7 +2181,7 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 
 close_reset_proc_fscreate(fscreate_fd);
 if (!err) {
@@ -2175,6 +2190,44 @@ static int do_create_secctx_fscreate(fuse_req_t req,
 return err;
 }
 
+static int do_create_secctx_tmpfile(fuse_req_t req,
+struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi,
+const char *secctx_name, int *open_fd)
+{
+int err, fd = -1;
+struct lo_data *lo = lo_data(req);
+char procname[64];
+
+err = do_create_nosecctx(req, parent_inode, ".", mode, fi, , true);
+if (err) {
+return err;
+}
+
+err = fsetxattr(fd, secctx_name, req->secctx.ctx, req->secctx.ctxlen, 0);
+if (err) {
+err = errno;
+goto out;
+}
+
+/* Security context set on file. Link it in place */
+sprintf(procname, "%d", fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+err = linkat(AT_FDCWD, procname, parent_inode->fd, name,
+ AT_SYMLINK_FOLLOW);
+err = err == -1 ? errno : 0;
+FCHDIR_NOFAIL(lo->root.fd);
+
+out:
+if (!err) {
+*open_fd = fd;
+} else if (fd != -1) {
+close(fd);
+}
+return err;
+}
+
 static int do_create_secctx_noatomic(fuse_req_t req,
  struct lo_inode *parent_inode,
  const char *name, mode_t mode,
@@ -2183,7 +2236,7 @@ static int do_create_secctx_noatomic(fuse_req_t req,
 {
 int err = 0, fd = -1;
 
-err = do_create_nosecctx(req, parent_inode, name, mode, fi, );
+err = do_create_nosecctx(req, parent_inode, name, mode, fi, , false);
 if (err) {
 goto out;
 }
@@ -2225,20 +2278,31 @@ static int do_lo_create(fuse_req_t req, struct lo_inode 
*parent_inode,
 if (secctx_enabled) {
 /*
  * If security.selinux has not been remapped and selinux is enabled,
- * use fscreate to set context before file creation.
- * Otherwise fallback to non-atomic method of file creation
- * and xattr settting.
+ * use fscreate to set context before file creation. If not, use
+ * tmpfile method for regular files. Otherwise fallback to
+ * non-atomic method of file creation and xattr settting.
  */
 if (!mapped_name && lo->use_fscreate) {
 err = do_create_secctx_fscreate(req, parent_inode, name, mode, fi,
 open_fd);
 goto out;
+} else if (S_ISREG(

[PATCH v2 3/6] virtiofsd: Move core file creation code in separate function

2021-10-14 Thread Vivek Goyal

Move core file creation bits in a separate function. Soon this is going
to get more complex as file creation need to set security context also.
And there will be multiple modes of file creation in next patch.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 36 ++--
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 64b5b4fbb1..54978b7fae 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -1976,6 +1976,30 @@ static int lo_do_open(struct lo_data *lo, struct 
lo_inode *inode,
 return 0;
 }
 
+static int do_lo_create(fuse_req_t req, struct lo_inode *parent_inode,
+const char *name, mode_t mode,
+struct fuse_file_info *fi, int* open_fd)
+{
+int err = 0, fd;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+
+err = lo_change_cred(req, , lo->change_umask);
+if (err) {
+return err;
+}
+
+/* Try to create a new file but don't open existing files */
+fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
+if (fd == -1) {
+err = errno;
+} else {
+*open_fd = fd;
+}
+lo_restore_cred(, lo->change_umask);
+return err;
+}
+
 static void lo_create(fuse_req_t req, fuse_ino_t parent, const char *name,
   mode_t mode, struct fuse_file_info *fi)
 {
@@ -1985,7 +2009,6 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 struct lo_inode *inode = NULL;
 struct fuse_entry_param e;
 int err;
-struct lo_cred old = {};
 
 fuse_log(FUSE_LOG_DEBUG, "lo_create(parent=%" PRIu64 ", name=%s)"
  " kill_priv=%d\n", parent, name, fi->kill_priv);
@@ -2001,18 +2024,9 @@ static void lo_create(fuse_req_t req, fuse_ino_t parent, 
const char *name,
 return;
 }
 
-err = lo_change_cred(req, , lo->change_umask);
-if (err) {
-goto out;
-}
-
 update_open_flags(lo->writeback, lo->allow_direct_io, fi);
 
-/* Try to create a new file but don't open existing files */
-fd = openat(parent_inode->fd, name, fi->flags | O_CREAT | O_EXCL, mode);
-err = fd == -1 ? errno : 0;
-
-lo_restore_cred(, lo->change_umask);
+err = do_lo_create(req, parent_inode, name, mode, fi, );
 
 /* Ignore the error if file exists and O_EXCL was not given */
 if (err && (err != EEXIST || (fi->flags & O_EXCL))) {
-- 
2.31.1

[PATCH v2 1/6] fuse: Header file changes for FUSE_SECURITY_CTX

2021-10-14 Thread Vivek Goyal

These are just header file changes which should show up in qemu if
corresponding kernel changes get merged.

Signed-off-by: Vivek Goyal 
---
 include/standard-headers/linux/fuse.h | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/standard-headers/linux/fuse.h 
b/include/standard-headers/linux/fuse.h
index cce105bfba..adf70b884c 100644
--- a/include/standard-headers/linux/fuse.h
+++ b/include/standard-headers/linux/fuse.h
@@ -181,6 +181,10 @@
  *  - add FUSE_OPEN_KILL_SUIDGID
  *  - extend fuse_setxattr_in, add FUSE_SETXATTR_EXT
  *  - add FUSE_SETXATTR_ACL_KILL_SGID
+ *
+ *  7.35
+ *  - add FUSE_SECURITY_CTX flag for fuse_init_out
+ *  - add security context to create, mkdir, symlink, and mknod requests
  */
 
 #ifndef _LINUX_FUSE_H
@@ -212,7 +216,7 @@
 #define FUSE_KERNEL_VERSION 7
 
 /** Minor version number of this interface */
-#define FUSE_KERNEL_MINOR_VERSION 33
+#define FUSE_KERNEL_MINOR_VERSION 35
 
 /** The node ID of the root inode */
 #define FUSE_ROOT_ID 1
@@ -329,6 +333,8 @@ struct fuse_file_lock {
  * write/truncate sgid is killed only if file has group
  * execute permission. (Same as Linux VFS behavior).
  * FUSE_SETXATTR_EXT:  Server supports extended struct fuse_setxattr_in
+ * FUSE_SECURITY_CTX:  add security context to create, mkdir, symlink, and
+ * mknod
  */
 #define FUSE_ASYNC_READ(1 << 0)
 #define FUSE_POSIX_LOCKS   (1 << 1)
@@ -360,6 +366,7 @@ struct fuse_file_lock {
 #define FUSE_SUBMOUNTS (1 << 27)
 #define FUSE_HANDLE_KILLPRIV_V2(1 << 28)
 #define FUSE_SETXATTR_EXT  (1 << 29)
+#define FUSE_SECURITY_CTX  (1 << 30)
 
 /**
  * CUSE INIT request/reply flags
@@ -967,4 +974,14 @@ struct fuse_removemapping_one {
 #define FUSE_REMOVEMAPPING_MAX_ENTRY   \
(PAGE_SIZE / sizeof(struct fuse_removemapping_one))
 
+struct fuse_secctx {
+   uint32_tsize;
+   uint32_tpadding;
+};
+
+struct fuse_secctxs {
+   uint32_tnr_secctx;
+   uint32_tpadding;
+};
+
 #endif /* _LINUX_FUSE_H */
-- 
2.31.1

[PATCH v2 4/6] virtiofsd: Create new file with fscreate set

2021-10-14 Thread Vivek Goyal

This patch adds support to set /proc/thread-self/attr/fscreate before
file creation. It is set to a value as sent by client. This will allow
for atomic creation of security context on files w.r.t file creation.

This is primarily useful when either there is no SELinux enabled on
host or host and guest policies are in sync and don't conflict.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/passthrough_ll.c | 317 ---
 1 file changed, 290 insertions(+), 27 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 54978b7fae..7a714b1b5e 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -172,10 +172,14 @@ struct lo_data {
 
 /* An O_PATH file descriptor to /proc/self/fd/ */
 int proc_self_fd;
+/* An O_PATH file descriptor to /proc/self/task/ */
+int proc_self_task;
 int user_killpriv_v2, killpriv_v2;
 /* If set, virtiofsd is responsible for setting umask during creation */
 bool change_umask;
 int user_posix_acl, posix_acl;
+/* Keeps track if /proc//attr/fscreate should be used or not */
+bool use_fscreate;
 };
 
 static const struct fuse_opt lo_opts[] = {
@@ -229,6 +233,11 @@ static struct lo_inode *lo_find(struct lo_data *lo, struct 
stat *st,
 static int xattr_map_client(const struct lo_data *lo, const char *client_name,
 char **out_name);
 
+#define FCHDIR_NOFAIL(fd) do { \
+int fchdir_res = fchdir(fd);   \
+assert(fchdir_res == 0);   \
+} while (0)
+
 static bool is_dot_or_dotdot(const char *name)
 {
 return name[0] == '.' &&
@@ -255,6 +264,33 @@ static struct lo_data *lo_data(fuse_req_t req)
 return (struct lo_data *)fuse_req_userdata(req);
 }
 
+/*
+ * Tries to figure out if /proc//attr/fscrate is usable or not. With
+ * selinux=0, read from fscreate returns -EINVAL.
+ *
+ * TODO: Link with libselinux and use is_selinux_enabled() instead down
+ * the line. It probably will be more reliable indicator.
+ */
+static bool is_fscreate_usable(struct lo_data *lo)
+{
+char procname[64];
+int fscreate_fd;
+size_t bytes_read;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_RDWR);
+if (fscreate_fd == -1) {
+return false;
+}
+
+bytes_read = read(fscreate_fd, procname, 64);
+close(fscreate_fd);
+if (bytes_read == -1) {
+return false;
+}
+return true;
+}
+
 /*
  * Load capng's state from our saved state if the current thread
  * hadn't previously been loaded.
@@ -1259,16 +1295,140 @@ static void lo_restore_cred_gain_cap(struct lo_cred 
*old, bool restore_umask,
 }
 }
 
+/* Helpers to set/reset fscreate */
+static int open_set_proc_fscreate(struct lo_data *lo, const void *ctx,
+  size_t ctxlen, int *fd)
+{
+char procname[64];
+int fscreate_fd, err = 0;
+size_t written;
+
+sprintf(procname, "%d/attr/fscreate", gettid());
+fscreate_fd = openat(lo->proc_self_task, procname, O_WRONLY);
+err = fscreate_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+
+written = write(fscreate_fd, ctx, ctxlen);
+err = written == -1 ? errno : 0;
+if (err) {
+goto out;
+}
+
+*fd = fscreate_fd;
+return 0;
+out:
+close(fscreate_fd);
+return err;
+}
+
+static void close_reset_proc_fscreate(int fd)
+{
+if ((write(fd, NULL, 0)) == -1) {
+fuse_log(FUSE_LOG_WARNING, "Failed to reset fscreate. err=%d\n", 
errno);
+}
+close(fd);
+return;
+}
+
+static int do_mknod_symlink_secctx(fuse_req_t req, struct lo_inode *dir,
+   const char *name, const char *secctx_name)
+{
+int path_fd, err;
+char procname[64];
+struct lo_data *lo = lo_data(req);
+
+if (!req->secctx.ctxlen) {
+return 0;
+}
+
+/* Open newly created element with O_PATH */
+path_fd = openat(dir->fd, name, O_PATH | O_NOFOLLOW);
+err = path_fd == -1 ? errno : 0;
+if (err) {
+return err;
+}
+sprintf(procname, "%i", path_fd);
+FCHDIR_NOFAIL(lo->proc_self_fd);
+/* Set security context. This is not atomic w.r.t file creation */
+err = setxattr(procname, secctx_name, req->secctx.ctx, req->secctx.ctxlen,
+   0);
+if (err) {
+err = errno;
+}
+FCHDIR_NOFAIL(lo->root.fd);
+close(path_fd);
+return err;
+}
+
+static int do_mknod_symlink(fuse_req_t req, struct lo_inode *dir,
+const char *name, mode_t mode, dev_t rdev,
+const char *link)
+{
+int err, fscreate_fd = -1;
+const char *secctx_name = req->secctx.name;
+struct lo_cred old = {};
+struct lo_data *lo = lo_data(req);
+char *mapped_name =

[PATCH v2 2/6] virtiofsd, fuse_lowlevel.c: Add capability to parse security context

2021-10-14 Thread Vivek Goyal

Add capability to enable and parse security context as sent by client
and put into fuse_req. Filesystems now can get security context from
request and set it on files during creation.

Signed-off-by: Vivek Goyal 
---
 tools/virtiofsd/fuse_common.h   |  5 ++
 tools/virtiofsd/fuse_i.h|  7 +++
 tools/virtiofsd/fuse_lowlevel.c | 91 +
 3 files changed, 103 insertions(+)

diff --git a/tools/virtiofsd/fuse_common.h b/tools/virtiofsd/fuse_common.h
index 0c2665b977..6f3485d1dc 100644
--- a/tools/virtiofsd/fuse_common.h
+++ b/tools/virtiofsd/fuse_common.h
@@ -377,6 +377,11 @@ struct fuse_file_info {
  */
 #define FUSE_CAP_SETXATTR_EXT (1 << 29)
 
+/**
+ * Indicates that file server supports creating file security context
+ */
+#define FUSE_CAP_SECURITY_CTX (1 << 30)
+
 /**
  * Ioctl flags
  *
diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 492e002181..a5572fa4ae 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -15,6 +15,12 @@
 struct fv_VuDev;
 struct fv_QueueInfo;
 
+struct fuse_security_context {
+const char *name;
+uint32_t ctxlen;
+const void *ctx;
+};
+
 struct fuse_req {
 struct fuse_session *se;
 uint64_t unique;
@@ -35,6 +41,7 @@ struct fuse_req {
 } u;
 struct fuse_req *next;
 struct fuse_req *prev;
+struct fuse_security_context secctx;
 };
 
 struct fuse_notify_req {
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index e4679c73ab..94bea4a3c9 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -886,11 +886,59 @@ static void do_readlink(fuse_req_t req, fuse_ino_t nodeid,
 }
 }
 
+static int parse_secctx_fill_req(fuse_req_t req, struct fuse_mbuf_iter *iter)
+{
+struct fuse_secctxs *fsecctxs;
+struct fuse_secctx *fsecctx;
+const void *secctx;
+const char *name;
+
+fsecctxs = fuse_mbuf_iter_advance(iter, sizeof(*fsecctxs));
+if (!fsecctxs) {
+return -EINVAL;
+}
+
+/*
+ * As of now maximum of one security context is supported. It can
+ * change in future though.
+ */
+if (fsecctxs->nr_secctx > 1) {
+return -EINVAL;
+}
+
+/* No security context sent. Maybe no LSM supports it */
+if (!fsecctxs->nr_secctx) {
+return 0;
+}
+
+fsecctx = fuse_mbuf_iter_advance(iter, sizeof(*fsecctx));
+if (!fsecctx) {
+return -EINVAL;
+}
+
+name = fuse_mbuf_iter_advance_str(iter);
+if (!name) {
+return -EINVAL;
+}
+
+secctx = fuse_mbuf_iter_advance(iter, fsecctx->size);
+if (!secctx) {
+return -EINVAL;
+}
+
+req->secctx.name = name;
+req->secctx.ctx = secctx;
+req->secctx.ctxlen = fsecctx->size;
+return 0;
+}
+
 static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
  struct fuse_mbuf_iter *iter)
 {
 struct fuse_mknod_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -901,6 +949,13 @@ static void do_mknod(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, -err);
+}
+}
+
 if (req->se->op.mknod) {
 req->se->op.mknod(req, nodeid, name, arg->mode, arg->rdev);
 } else {
@@ -913,6 +968,8 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 {
 struct fuse_mkdir_in *arg;
 const char *name;
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 arg = fuse_mbuf_iter_advance(iter, sizeof(*arg));
 name = fuse_mbuf_iter_advance_str(iter);
@@ -923,6 +980,13 @@ static void do_mkdir(fuse_req_t req, fuse_ino_t nodeid,
 
 req->ctx.umask = arg->umask;
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.mkdir) {
 req->se->op.mkdir(req, nodeid, name, arg->mode);
 } else {
@@ -969,12 +1033,21 @@ static void do_symlink(fuse_req_t req, fuse_ino_t nodeid,
 {
 const char *name = fuse_mbuf_iter_advance_str(iter);
 const char *linkname = fuse_mbuf_iter_advance_str(iter);
+bool secctx_enabled = req->se->conn.want & FUSE_CAP_SECURITY_CTX;
+int err;
 
 if (!name || !linkname) {
 fuse_reply_err(req, EINVAL);
 return;
 }
 
+if (secctx_enabled) {
+err = parse_secctx_fill_req(req, iter);
+if (err) {
+fuse_reply_err(req, err);
+}
+}
+
 if (req->se->op.symlink) {
 req->se->op.symlink(req, linkname, nodeid, name);
 } else {
@@

1 2 3 4 5 >

1 - 100 of 479 matches

Mail list logo