Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-10-03 Thread Colin Walters



On Thu, Sep 29, 2022, at 1:03 PM, Vivek Goyal wrote:
> 
> So rust version of virtiofsd, already supports running unprivileged
> (inside a user namespace).

I know, but as I already said, the use case here is running inside an OpenShift 
unprivileged pod where *we are already in a container*.

> host$ podman unshare -- virtiofsd --socket-path=/tmp/vfsd.sock 
> --shared-dir /mnt \
> --announce-submounts --sandbox chroot &

Yes, but in current OCP 4.11 our seccomp policy denies CLONE_NEWUSER:

```
$ unshare -m
unshare: unshare failed: Function not implemented
```

https://docs.openshift.com/container-platform/4.11/security/seccomp-profiles.html

> I think only privileged operation it needs is assigning a range of
> subuid/subgid to the uid you are using on host.

We also turn on NO_NEW_PRIVILEGES by default in OCP pods.  

Now, I *could* in general get elevated permissions where I need to today.  But 
it's also really important to me to have a long term goal of having operating 
system builds and tests work well as "just another workload" in our production 
container platform (now, one *does* want to bind in /dev/kvm, but that's 
generally safe, and even that strictly speaking is optional if one can stomach 
the ~10x perf hit).

> Can you give rust virtiofsd (unprivileged) a try.

I admit to not actually trying it in a pod, but I think we all agree it can't 
work, and the only thing that can today is openat2.



Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-29 Thread Colin Walters



On Thu, Sep 29, 2022, at 10:10 AM, Vivek Goyal wrote:

> What's your use case. How do you plan to use virtiofs.

At the current time, the Kubernetes that we run does not support user 
namespaces.  We want to do the production builds of our operating system 
(Fedora CoreOS and RHEL CoreOS) today inside an *unprivileged* Kubernetes pod 
(actually in OpenShift using anyuid, i.e. random unprivileged uid too), just 
with /dev/kvm exposed from the host (which is safe).  Operating system builds 
*and* tests in qemu are just another workload that can be shared with other 
tenants.

qemu works fine in this model, as does 9p.  It's just the virtiofs isolation 
requires privileges to be used today.





Re: [Virtio-fs] virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-29 Thread Colin Walters
On Wed, Sep 28, 2022, at 3:28 PM, Vivek Goyal wrote:

> Sounds reasonable. In fact, we could probably do someting similar
> for "landlock" as well. 

Thanks for the discussion all!  Can someone (vaguely) commit to look into this 
in say the next few months?  It's not *urgent*, we can live with the 9p flakes 
and problems short term, just trying to figure out if this needs to be on our 
medium-term radar or not.  Thanks!



Re: virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-27 Thread Colin Walters



On Tue, Sep 27, 2022, at 1:27 PM, German Maglione wrote:
>
>> > Now all the development has moved to rust virtiofsd.

Oh, awesome!!  The code there looks great.

> I could work on this for the next major version and see if anything breaks.
> But I prefer to add this as a compilation feature, instead of a command line
> option that we will then have to maintain for a while.

Hmm, what would be the issue with having the code there by default?  I think 
rather than any new command line option, we automatically use 
`openat2+RESOLVE_IN_ROOT` if the process is run as a nonzero uid.

> Also, I don't see it as a sandbox feature, as Stefan mentioned, a compromised
> process can call openat2() without RESOLVE_IN_ROOT. 

I'm a bit skeptical honestly about how secure the existing namespace code is 
against a compromised virtiofsd process.  The primary worry is guest filesystem 
traversals, right?  openat2+RESOLVE_IN_ROOT addresses that.  Plus being in Rust 
makes this dramatically safer.

> I did some test with
> Landlock to lock virtiofsd inside the shared directory, but IIRC it requires a
> kernel 5.13

But yes, landlock and other things make sense, I just don't see these things as 
strongly linked.  IOW we shouldn't in my opinion block unprivileged virtiofsd 
on more sandboxing than openat2 already gives us.



virtiofsd: Any reason why there's not an "openat2" sandbox mode?

2022-09-09 Thread Colin Walters
We previously had a chat here 
https://lore.kernel.org/all/348d4774-bd5f-4832-bd7e-a21491fda...@www.fastmail.com/T/
around virtiofsd and privileges and the case of trying to run virtiofsd inside 
an unprivileged (Kubernetes) container.

Right now we're still using 9p, and it has bugs (basically it seems like the 9p 
inode flushing callback tries to allocate memory to send an RPC, and this 
causes OOM problems)
https://github.com/coreos/coreos-assembler/issues/1812

Coming back to this...as of lately in Linux, there's support for strongly 
isolated filesystem access via openat2():
https://lwn.net/Articles/796868/

Is there any reason we couldn't do an -o sandbox=openat2 ?  This operates 
without any privileges at all, and should be usable (and secure enough) in our 
use case.

I may try a patch if this sounds OK...



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-17 Thread Colin Walters



On Wed, Jun 17, 2020, at 8:50 AM, Stefan Hajnoczi wrote:

> Something along these lines should work. Hopefully seccomp can be
> retained. It would also be necessary to check how not having the shared
> directory as / in the mount namespace affects functionality. For one,
> I'm pretty sure symlink escapes and similar path traversals outside the
> shared directory will be possible since virtiofsd normally relies on /
> as protection.

Yes, though two points:

- As I said, I don't care about that for my use case; the operating system 
we're testing is going to e.g. run on bare metal hosting workloads itself, so 
if it's malicious we have already lost (reliability against *accidental* damage 
is always nice though, like a stray rm -rf in some test script walking into the 
host)
- Probably the best long term fix would be to use 
https://lwn.net/Articles/796868/ anyways



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-06-02 Thread Colin Walters



On Tue, Jun 2, 2020, at 5:55 AM, Stefan Hajnoczi wrote:
> 
> Ping Colin. It would be great if you have time to share your thoughts on
> this discussion and explain how you are using this patch.

Yeah sorry about not replying in this thread earlier, this was just a quick 
Friday side project for me and the thread obviously exploded =)

Thinking about this more, probably what would be good enough for now is an 
option to just disable internal containerization/sandboxing.  In fact per the 
discussion our production pipeline runs inside OpenShift 4 and because 
Kubernetes doesn't support user namespaces yet it also doesn't support 
recursive containerization, so we need an option to turn off the internal 
containerization.

Our use case is somewhat specialized - for what we're doing we generally trust 
the guest.  We use VMs for operating system testing and development of content 
we trust, as opposed to e.g. something like kata.

It's fine for us to run virtiofs as the same user/security context as qemu.

So...something like this?  (Only compile tested)

diff --git a/tools/virtiofsd/fuse_i.h b/tools/virtiofsd/fuse_i.h
index 1240828208..603773c505 100644
--- a/tools/virtiofsd/fuse_i.h
+++ b/tools/virtiofsd/fuse_i.h
@@ -51,6 +51,7 @@ struct fuse_session {
 int fd;
 int debug;
 int deny_others;
+int no_namespaces;
 struct fuse_lowlevel_ops op;
 int got_init;
 struct cuse_data *cuse_data;
diff --git a/tools/virtiofsd/fuse_lowlevel.c b/tools/virtiofsd/fuse_lowlevel.c
index 2dd36ec03b..263134f792 100644
--- a/tools/virtiofsd/fuse_lowlevel.c
+++ b/tools/virtiofsd/fuse_lowlevel.c
@@ -2522,6 +2522,7 @@ static const struct fuse_opt fuse_ll_opts[] = {
 LL_OPTION("-d", debug, 1),
 LL_OPTION("--debug", debug, 1),
 LL_OPTION("allow_root", deny_others, 1),
+LL_OPTION("--no-namespaces", no_namespaces, 1),
 LL_OPTION("--socket-path=%s", vu_socket_path, 0),
 LL_OPTION("--fd=%d", vu_listen_fd, 0),
 LL_OPTION("--thread-pool-size=%d", thread_pool_size, 0),
@@ -2542,6 +2543,7 @@ void fuse_lowlevel_help(void)
  */
 printf(
 "-o allow_root  allow access by root\n"
+"--no-namespacesDisable internal use of 
unshare()/clone(UNSHARE)\n"
 "--socket-path=PATH path for the vhost-user socket\n"
 "--fd=FDNUM fd number of vhost-user socket\n"
 "--thread-pool-size=NUM thread pool size limit (default %d)\n",
diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 3ba1d90984..7c54a9cde3 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2551,15 +2551,15 @@ static void setup_namespaces(struct lo_data *lo, struct 
fuse_session *se)
 char *tmpdir;
 
 /*
- * Create a new pid namespace for *child* processes.  We'll have to
- * fork in order to enter the new pid namespace.  A new mount namespace
- * is also needed so that we can remount /proc for the new pid
- * namespace.
- *
- * Our UNIX domain sockets have been created.  Now we can move to
- * an empty network namespace to prevent TCP/IP and other network
- * activity in case this process is compromised.
- */
+* Create a new pid namespace for *child* processes.  We'll have to
+* fork in order to enter the new pid namespace.  A new mount namespace
+* is also needed so that we can remount /proc for the new pid
+* namespace.
+*
+* Our UNIX domain sockets have been created.  Now we can move to
+* an empty network namespace to prevent TCP/IP and other network
+* activity in case this process is compromised.
+*/
 if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
 fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
 exit(1);
@@ -2775,6 +2775,8 @@ static void setup_capabilities(void)
 static void setup_sandbox(struct lo_data *lo, struct fuse_session *se,
   bool enable_syslog)
 {
+if (se->no_namespaces)
+return;
 setup_namespaces(lo, se);
 setup_mounts(lo->source);
 setup_seccomp(enable_syslog);



Re: [PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-04 Thread Colin Walters



On Mon, May 4, 2020, at 10:07 AM, Marc-André Lureau wrote:

> Now that systemd-nspawn works without privileges, isn't that also a
> solution? One that would fit both system and session level
> permissions, and integration with other services?

This is a complex topic and one I should probably write up in the bubblewrap 
README.md.  Today for example for CoreOS, our build and CI processes run inside 
OpenShift (Kubernetes) - we aren't running systemd inside our containers.

bubblewrap is a small self-contained C wrapper around the container system 
calls basically.  In contrast, AFAICS right now, nspawn requires systemd - 
which won't work for our use case.

Really the contention point here is systemd's dependency on cgroups for process 
tracking; in a "nested containerization" scenario you often just want the 
cgroups from the "outer" container to apply.  But having nested mounts/pid 
namespaces are still very useful.  (That said, cgroups v2 allows sane nesting, 
but we aren't there yet)

Also related is https://github.com/kubernetes/enhancements/issues/127 - without 
that one requires privileged containers to do nesting.

Now honestly, probably an even easier fix is `virtiofsd --disable-sandboxing` 
because we fully trust the code running in these VMs.

Or to directly respond again to your proposal: systemd-nspawn as an option may 
work for some cases but won't for mine (I don't want virtiofsd/qemu instances 
to "escape" the build container or run separately).




[PATCH] virtiofsd: Use clone() and not unshare(), support non-root

2020-05-01 Thread Colin Walters
I'd like to make use of virtiofs as part of our tooling in
https://github.com/coreos/coreos-assembler
Most of the code runs as non-root today; qemu also runs as non-root.
We use 9p right now.

virtiofsd's builtin sandboxing effectively assumes it runs as
root.

First, change the code to use `clone()` and not `unshare()+fork()`.

Next, automatically use `CLONE_NEWUSER` if we're running as non root.

This is similar logic to that in https://github.com/containers/bubblewrap
(Which...BTW, it could make sense for virtiofs to depend on bubblewrap
 and re-exec itself rather than re-implementing the containerization
 itself)

Signed-off-by: Colin Walters 
---
 tools/virtiofsd/passthrough_ll.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/tools/virtiofsd/passthrough_ll.c b/tools/virtiofsd/passthrough_ll.c
index 4c35c95b25..468617f6d6 100644
--- a/tools/virtiofsd/passthrough_ll.c
+++ b/tools/virtiofsd/passthrough_ll.c
@@ -2530,6 +2530,21 @@ static void print_capabilities(void)
 printf("}\n");
 }
 
+/* Copied from bubblewrap */
+static int
+raw_clone(unsigned long flags, void *child_stack)
+{
+#if defined(__s390__) || defined(__CRIS__)
+  /*
+   * On s390 and cris the order of the first and second arguments
+   * of the raw clone() system call is reversed.
+   */
+return (int) syscall(__NR_clone, child_stack, flags);
+#else
+return (int) syscall(__NR_clone, flags, child_stack);
+#endif
+}
+
 /*
  * Move to a new mount, net, and pid namespaces to isolate this process.
  */
@@ -2547,14 +2562,15 @@ static void setup_namespaces(struct lo_data *lo, struct 
fuse_session *se)
  * an empty network namespace to prevent TCP/IP and other network
  * activity in case this process is compromised.
  */
-if (unshare(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET) != 0) {
-fuse_log(FUSE_LOG_ERR, "unshare(CLONE_NEWPID | CLONE_NEWNS): %m\n");
-exit(1);
+int clone_flags = SIGCHLD | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET;
+/* If we're non root, we need a new user namespace */
+if (getuid() != 0) {
+clone_flags |= CLONE_NEWUSER;
 }
 
-child = fork();
+child = raw_clone(clone_flags, NULL);
 if (child < 0) {
-fuse_log(FUSE_LOG_ERR, "fork() failed: %m\n");
+fuse_log(FUSE_LOG_ERR, "clone() failed: %m\n");
 exit(1);
 }
 if (child > 0) {
-- 
2.24.1