Re: [PATCH] uts_namespace: Move boot_id in uts namespace
On 04/05/2018 03:35 AM, Eric W. Biederman wrote: > Marian Marinov <ker...@kyup.com> writes: > >> On 04/04/2018 07:02 PM, Eric W. Biederman wrote: >>> Angel Shtilianov <ker...@kyup.com> writes: >>> >>>> Currently the same boot_id is reported for all containers running >>>> on a host node, including the host node itself. Even after restarting >>>> a container it will still have the same persistent boot_id. >>>> >>>> This can cause troubles in cases where you have multiple containers >>>> from the same cluster on one host node. The software inside each >>>> container will get the same boot_id and thus fail to join the cluster, >>>> after the first container from the node has already joined. >>>> >>>> UTS namespace on other hand keeps the machine specific data, so it >>>> seems to be the correct place to move the boot_id and instantiate it, >>>> so each container will have unique id for its own boot lifetime, if >>>> it has its own uts namespace. >>> >>> Technically this really needs to use the sysctl infrastructure that >>> allows you to register different files in different namespaces. That >>> way the value you read from proc_do_uuid will be based on who opens the >>> file not on who is reading the file. >> >> Ok, so would you accept a patch that reimplements boot_id trough the sysctl >> infrastructure? > > Assuming I am convinced this makes sense to do on the semantic level. > >>> Practically why does a bind mount on top of boot_id work? What makes >>> this a general problem worth solving in the kernel? Why is hiding the >>> fact that you are running the same instance of the same kernel a useful >>> thing? That is the reality. >> >> The problem is, that the distros do not know that they are in >> container and don't know that they have to bind mount something on top >> of boot_id. You need to tell Docker, LXC/LXD and all other container >> runtimes that they need to do this bind mount for boot_id. > > Yes. Anything like this is the responsibility of the container runtime > one way or another. Magic to get around fixing the small set of > container runtimes you care about is a questionable activity. > >> I consider this to be a general issue, that lacks good general >> solution in userspace. The kernel is providing this boot_id >> interface, but it is giving wrong data in the context of containers. > > I disagree. Namespaces have never been about hiding that you are on a > given machine or a single machine. They are about encapsulating global > identifers so that process migration can happen, and so that processes > can be better isolated. The boot_id is not an identify of an object in > the kernel at all, and it is designed to be trully globally unique > across time and space so I am not at all certain that it makes the least > bit of sense to do anything with a boot_id. > > > That said my memory of boot_id is that was added so that emacs (and > related programs) could create lock files on nfs and could tell if the > current machine owns the file, and if so be able to tell if the owner > of the lock file is alive. > > So there is an argument to be made that boot_id is to coarse. That > argument suggest that boot_id is a pid_namespace property. > > I have not looked at the users of boot_id, and I don't have a definition > of boot_id that makes me think it is too coarse. > > If you can provide a clear description of what the semantics are and > what they should be for boot_id showing how boot_id fits into a > namespace, making it clear what should happen with checkpoint/restart. > We can definitely look at changing how the kernel supports boot_id. > > The reason I suggested the bind mount is there are lots of cases where > people want to lie to applications about the reality of what is going on > for whatever reason, and we leave those lies to userspace. Things > like changing the contents of /proc/cpuinfo. Even thou previously we have sent patches that are supposed to lie to the software running inside a container, this one is not of this sort. You are pointing to Emacs and I'm talking about Consul https://www.consul.io/ Systemd is also using boot_id. "journalctl --list-boots" actually uses the boot_id to distinguish between different boots. -2 caf0524a1d394ce0bdbcff75b9fe Tue 2015-02-03 21:48:52 UTC—Tue 2015-02-03 22:17:00 UTC -1 13883d180dc0420db0abcb5fa26d6198 Tue 2015-02-03 22:17:03 UTC—Tue 2015-02-03 22:19:08 UTC 0 bed718b17a73415fade0e4e7f4bea609 Tue 2015-02-03 22:19:12 UTC—Tue 2015-02-03 23:01:0
Re: [PATCH] uts_namespace: Move boot_id in uts namespace
On 04/05/2018 03:35 AM, Eric W. Biederman wrote: > Marian Marinov writes: > >> On 04/04/2018 07:02 PM, Eric W. Biederman wrote: >>> Angel Shtilianov writes: >>> >>>> Currently the same boot_id is reported for all containers running >>>> on a host node, including the host node itself. Even after restarting >>>> a container it will still have the same persistent boot_id. >>>> >>>> This can cause troubles in cases where you have multiple containers >>>> from the same cluster on one host node. The software inside each >>>> container will get the same boot_id and thus fail to join the cluster, >>>> after the first container from the node has already joined. >>>> >>>> UTS namespace on other hand keeps the machine specific data, so it >>>> seems to be the correct place to move the boot_id and instantiate it, >>>> so each container will have unique id for its own boot lifetime, if >>>> it has its own uts namespace. >>> >>> Technically this really needs to use the sysctl infrastructure that >>> allows you to register different files in different namespaces. That >>> way the value you read from proc_do_uuid will be based on who opens the >>> file not on who is reading the file. >> >> Ok, so would you accept a patch that reimplements boot_id trough the sysctl >> infrastructure? > > Assuming I am convinced this makes sense to do on the semantic level. > >>> Practically why does a bind mount on top of boot_id work? What makes >>> this a general problem worth solving in the kernel? Why is hiding the >>> fact that you are running the same instance of the same kernel a useful >>> thing? That is the reality. >> >> The problem is, that the distros do not know that they are in >> container and don't know that they have to bind mount something on top >> of boot_id. You need to tell Docker, LXC/LXD and all other container >> runtimes that they need to do this bind mount for boot_id. > > Yes. Anything like this is the responsibility of the container runtime > one way or another. Magic to get around fixing the small set of > container runtimes you care about is a questionable activity. > >> I consider this to be a general issue, that lacks good general >> solution in userspace. The kernel is providing this boot_id >> interface, but it is giving wrong data in the context of containers. > > I disagree. Namespaces have never been about hiding that you are on a > given machine or a single machine. They are about encapsulating global > identifers so that process migration can happen, and so that processes > can be better isolated. The boot_id is not an identify of an object in > the kernel at all, and it is designed to be trully globally unique > across time and space so I am not at all certain that it makes the least > bit of sense to do anything with a boot_id. > > > That said my memory of boot_id is that was added so that emacs (and > related programs) could create lock files on nfs and could tell if the > current machine owns the file, and if so be able to tell if the owner > of the lock file is alive. > > So there is an argument to be made that boot_id is to coarse. That > argument suggest that boot_id is a pid_namespace property. > > I have not looked at the users of boot_id, and I don't have a definition > of boot_id that makes me think it is too coarse. > > If you can provide a clear description of what the semantics are and > what they should be for boot_id showing how boot_id fits into a > namespace, making it clear what should happen with checkpoint/restart. > We can definitely look at changing how the kernel supports boot_id. > > The reason I suggested the bind mount is there are lots of cases where > people want to lie to applications about the reality of what is going on > for whatever reason, and we leave those lies to userspace. Things > like changing the contents of /proc/cpuinfo. Even thou previously we have sent patches that are supposed to lie to the software running inside a container, this one is not of this sort. You are pointing to Emacs and I'm talking about Consul https://www.consul.io/ Systemd is also using boot_id. "journalctl --list-boots" actually uses the boot_id to distinguish between different boots. -2 caf0524a1d394ce0bdbcff75b9fe Tue 2015-02-03 21:48:52 UTC—Tue 2015-02-03 22:17:00 UTC -1 13883d180dc0420db0abcb5fa26d6198 Tue 2015-02-03 22:17:03 UTC—Tue 2015-02-03 22:19:08 UTC 0 bed718b17a73415fade0e4e7f4bea609 Tue 2015-02-03 22:19:12 UTC—Tue 2015-02-03 23:01:01 UTC There may be a lot of other sof
Re: [PATCH] uts_namespace: Move boot_id in uts namespace
On 04/04/2018 07:02 PM, Eric W. Biederman wrote: > Angel Shtilianov <ker...@kyup.com> writes: > >> Currently the same boot_id is reported for all containers running >> on a host node, including the host node itself. Even after restarting >> a container it will still have the same persistent boot_id. >> >> This can cause troubles in cases where you have multiple containers >> from the same cluster on one host node. The software inside each >> container will get the same boot_id and thus fail to join the cluster, >> after the first container from the node has already joined. >> >> UTS namespace on other hand keeps the machine specific data, so it >> seems to be the correct place to move the boot_id and instantiate it, >> so each container will have unique id for its own boot lifetime, if >> it has its own uts namespace. > > Technically this really needs to use the sysctl infrastructure that > allows you to register different files in different namespaces. That > way the value you read from proc_do_uuid will be based on who opens the > file not on who is reading the file. Ok, so would you accept a patch that reimplements boot_id trough the sysctl infrastructure? > Practically why does a bind mount on top of boot_id work? What makes > this a general problem worth solving in the kernel? Why is hiding the > fact that you are running the same instance of the same kernel a useful > thing? That is the reality. The problem is, that the distros do not know that they are in container and don't know that they have to bind mount something on top of boot_id. You need to tell Docker, LXC/LXD and all other container runtimes that they need to do this bind mount for boot_id. I consider this to be a general issue, that lacks good general solution in userspace. The kernel is providing this boot_id interface, but it is giving wrong data in the context of containers. Proposing to fix this problem in userspace seams like ignoring the issue. You could have said to the Consul guys, that they should simply stop using boot_id, because it doesn't work correctly on containers. Marian > > Eric > > > > >> Signed-off-by: Angel Shtilianov <ker...@kyup.com> >> --- >> drivers/char/random.c | 4 >> include/linux/utsname.h | 1 + >> kernel/utsname.c| 4 +++- >> 3 files changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/char/random.c b/drivers/char/random.c >> index ec42c8bb9b0d..e05daf7f38f4 100644 >> --- a/drivers/char/random.c >> +++ b/drivers/char/random.c >> @@ -1960,6 +1960,10 @@ static int proc_do_uuid(struct ctl_table *table, int >> write, >> unsigned char buf[64], tmp_uuid[16], *uuid; >> >> uuid = table->data; >> +#ifdef CONFIG_UTS_NS >> +if (!!uuid && (uuid == (unsigned char *)sysctl_bootid)) >> +uuid = current->nsproxy->uts_ns->sysctl_bootid; >> +#endif >> if (!uuid) { >> uuid = tmp_uuid; >> generate_random_uuid(uuid); >> diff --git a/include/linux/utsname.h b/include/linux/utsname.h >> index c8060c2ecd04..f704aca3e95a 100644 >> --- a/include/linux/utsname.h >> +++ b/include/linux/utsname.h >> @@ -27,6 +27,7 @@ struct uts_namespace { >> struct user_namespace *user_ns; >> struct ucounts *ucounts; >> struct ns_common ns; >> +char sysctl_bootid[16]; >> } __randomize_layout; >> extern struct uts_namespace init_uts_ns; >> >> diff --git a/kernel/utsname.c b/kernel/utsname.c >> index 913fe4336d2b..f1749cdcd341 100644 >> --- a/kernel/utsname.c >> +++ b/kernel/utsname.c >> @@ -34,8 +34,10 @@ static struct uts_namespace *create_uts_ns(void) >> struct uts_namespace *uts_ns; >> >> uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL); >> -if (uts_ns) >> +if (uts_ns) { >> kref_init(_ns->kref); >> +memset(uts_ns->sysctl_bootid, 0, 16); >> +} >> return uts_ns; >> } > -- Marian Marinov Co-founder & CTO of Kyup.com Jabber/GTalk: hack...@jabber.org IRQ: 7556201 IRC: hackman @ irc.freenode.net Mobile: +359 886 660 270
Re: [PATCH] uts_namespace: Move boot_id in uts namespace
On 04/04/2018 07:02 PM, Eric W. Biederman wrote: > Angel Shtilianov writes: > >> Currently the same boot_id is reported for all containers running >> on a host node, including the host node itself. Even after restarting >> a container it will still have the same persistent boot_id. >> >> This can cause troubles in cases where you have multiple containers >> from the same cluster on one host node. The software inside each >> container will get the same boot_id and thus fail to join the cluster, >> after the first container from the node has already joined. >> >> UTS namespace on other hand keeps the machine specific data, so it >> seems to be the correct place to move the boot_id and instantiate it, >> so each container will have unique id for its own boot lifetime, if >> it has its own uts namespace. > > Technically this really needs to use the sysctl infrastructure that > allows you to register different files in different namespaces. That > way the value you read from proc_do_uuid will be based on who opens the > file not on who is reading the file. Ok, so would you accept a patch that reimplements boot_id trough the sysctl infrastructure? > Practically why does a bind mount on top of boot_id work? What makes > this a general problem worth solving in the kernel? Why is hiding the > fact that you are running the same instance of the same kernel a useful > thing? That is the reality. The problem is, that the distros do not know that they are in container and don't know that they have to bind mount something on top of boot_id. You need to tell Docker, LXC/LXD and all other container runtimes that they need to do this bind mount for boot_id. I consider this to be a general issue, that lacks good general solution in userspace. The kernel is providing this boot_id interface, but it is giving wrong data in the context of containers. Proposing to fix this problem in userspace seams like ignoring the issue. You could have said to the Consul guys, that they should simply stop using boot_id, because it doesn't work correctly on containers. Marian > > Eric > > > > >> Signed-off-by: Angel Shtilianov >> --- >> drivers/char/random.c | 4 >> include/linux/utsname.h | 1 + >> kernel/utsname.c| 4 +++- >> 3 files changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/char/random.c b/drivers/char/random.c >> index ec42c8bb9b0d..e05daf7f38f4 100644 >> --- a/drivers/char/random.c >> +++ b/drivers/char/random.c >> @@ -1960,6 +1960,10 @@ static int proc_do_uuid(struct ctl_table *table, int >> write, >> unsigned char buf[64], tmp_uuid[16], *uuid; >> >> uuid = table->data; >> +#ifdef CONFIG_UTS_NS >> +if (!!uuid && (uuid == (unsigned char *)sysctl_bootid)) >> +uuid = current->nsproxy->uts_ns->sysctl_bootid; >> +#endif >> if (!uuid) { >> uuid = tmp_uuid; >> generate_random_uuid(uuid); >> diff --git a/include/linux/utsname.h b/include/linux/utsname.h >> index c8060c2ecd04..f704aca3e95a 100644 >> --- a/include/linux/utsname.h >> +++ b/include/linux/utsname.h >> @@ -27,6 +27,7 @@ struct uts_namespace { >> struct user_namespace *user_ns; >> struct ucounts *ucounts; >> struct ns_common ns; >> +char sysctl_bootid[16]; >> } __randomize_layout; >> extern struct uts_namespace init_uts_ns; >> >> diff --git a/kernel/utsname.c b/kernel/utsname.c >> index 913fe4336d2b..f1749cdcd341 100644 >> --- a/kernel/utsname.c >> +++ b/kernel/utsname.c >> @@ -34,8 +34,10 @@ static struct uts_namespace *create_uts_ns(void) >> struct uts_namespace *uts_ns; >> >> uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL); >> -if (uts_ns) >> +if (uts_ns) { >> kref_init(_ns->kref); >> +memset(uts_ns->sysctl_bootid, 0, 16); >> +} >> return uts_ns; >> } > -- Marian Marinov Co-founder & CTO of Kyup.com Jabber/GTalk: hack...@jabber.org IRQ: 7556201 IRC: hackman @ irc.freenode.net Mobile: +359 886 660 270
Re: Patch breaks suspend
On 05/05/2015 02:37 AM, Rafael J. Wysocki wrote: > On Saturday, May 02, 2015 11:27:32 PM Marian Marinov wrote: >> Hi guys, >> I have Lenovo T520 with one SSD and one SATA drive. >> >> I tried to upgrade to Linux 4.0 and found that after suspend and resume I >> can't access the second (SATA) drive. >> Both drives have bios encryption enabled. >> >> I did a bisect and found that the following patch causes the issue: >> commit 5d5132059a1f652de9dc2d62a8ff15561e648d11 >> Author: Rafael J. Wysocki >> Date: Sat Feb 22 00:48:31 2014 +0100 >> >> ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices >> >> Modify the SATA subsystem to add hotplug contexts to ACPI companions >> of SATA devices and ports instead of registering special ACPI dock >> operations using register_hotplug_dock_device(). >> >> That change will allow the entire code handling those special ACPI >> dock operations to be dropped in the next commit. >> >> Signed-off-by: Rafael J. Wysocki >> Reviewed-by: Aaron Lu >> Acked-by: Tejun Heo >> >> Unfortunately I do not understand this part of the code and have no idea >> what I can do. >> Any pointers would be very appreciated. > Well, not right from the top of my head, but this looks really suspicious to > me. > > Can you please file a bug entry for this at bugzilla.kernel.org (in the > ACPI/BIOS > category), assign it to me and CC Aaron? BUG created: https://bugzilla.kernel.org/show_bug.cgi?id=97731 Added you and Aaron to the CC list. What additional info can I provide you? Would you like any debug info from the kernel it self? Dmesg output? Marian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Patch breaks suspend
On 05/05/2015 02:37 AM, Rafael J. Wysocki wrote: On Saturday, May 02, 2015 11:27:32 PM Marian Marinov wrote: Hi guys, I have Lenovo T520 with one SSD and one SATA drive. I tried to upgrade to Linux 4.0 and found that after suspend and resume I can't access the second (SATA) drive. Both drives have bios encryption enabled. I did a bisect and found that the following patch causes the issue: commit 5d5132059a1f652de9dc2d62a8ff15561e648d11 Author: Rafael J. Wysocki rafael.j.wyso...@intel.com Date: Sat Feb 22 00:48:31 2014 +0100 ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices Modify the SATA subsystem to add hotplug contexts to ACPI companions of SATA devices and ports instead of registering special ACPI dock operations using register_hotplug_dock_device(). That change will allow the entire code handling those special ACPI dock operations to be dropped in the next commit. Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com Reviewed-by: Aaron Lu aaron...@intel.com Acked-by: Tejun Heo t...@kernel.org Unfortunately I do not understand this part of the code and have no idea what I can do. Any pointers would be very appreciated. Well, not right from the top of my head, but this looks really suspicious to me. Can you please file a bug entry for this at bugzilla.kernel.org (in the ACPI/BIOS category), assign it to me and CC Aaron? BUG created: https://bugzilla.kernel.org/show_bug.cgi?id=97731 Added you and Aaron to the CC list. What additional info can I provide you? Would you like any debug info from the kernel it self? Dmesg output? Marian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Patch breaks suspend
Hi guys, I have Lenovo T520 with one SSD and one SATA drive. I tried to upgrade to Linux 4.0 and found that after suspend and resume I can't access the second (SATA) drive. Both drives have bios encryption enabled. I did a bisect and found that the following patch causes the issue: commit 5d5132059a1f652de9dc2d62a8ff15561e648d11 Author: Rafael J. Wysocki Date: Sat Feb 22 00:48:31 2014 +0100 ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices Modify the SATA subsystem to add hotplug contexts to ACPI companions of SATA devices and ports instead of registering special ACPI dock operations using register_hotplug_dock_device(). That change will allow the entire code handling those special ACPI dock operations to be dropped in the next commit. Signed-off-by: Rafael J. Wysocki Reviewed-by: Aaron Lu Acked-by: Tejun Heo Unfortunately I do not understand this part of the code and have no idea what I can do. Any pointers would be very appreciated. Best regards, Marian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Patch breaks suspend
Hi guys, I have Lenovo T520 with one SSD and one SATA drive. I tried to upgrade to Linux 4.0 and found that after suspend and resume I can't access the second (SATA) drive. Both drives have bios encryption enabled. I did a bisect and found that the following patch causes the issue: commit 5d5132059a1f652de9dc2d62a8ff15561e648d11 Author: Rafael J. Wysocki rafael.j.wyso...@intel.com Date: Sat Feb 22 00:48:31 2014 +0100 ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices Modify the SATA subsystem to add hotplug contexts to ACPI companions of SATA devices and ports instead of registering special ACPI dock operations using register_hotplug_dock_device(). That change will allow the entire code handling those special ACPI dock operations to be dropped in the next commit. Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com Reviewed-by: Aaron Lu aaron...@intel.com Acked-by: Tejun Heo t...@kernel.org Unfortunately I do not understand this part of the code and have no idea what I can do. Any pointers would be very appreciated. Best regards, Marian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 2/2] cgroups: add a pids subsystem
Hi Aleksa, would you be willing to put your patches online in a repo like what Dwight Engen did 3 years ago. https://github.com/dwengen/linux/tree/cpuacct-task-limit-3.14 I'm using his patchset for more then a year now. However I would be happy to experiment with your patches as well. And heaving a public repo from where I can pull is extremely good for people like me. It is easier for testing and long term support. Thank you for your work on this matter, Marian On 03/04/2015 10:23 PM, Aleksa Sarai wrote: > Adds a new single-purpose pids subsystem to limit the number of > tasks that can run inside a cgroup. Essentially this is an > implementation of RLIMIT_NPROC that will applies to a cgroup rather than > a process tree. > > PIDs are fundamentally a global resource, and it is possible to reach > PID exhaustion inside a cgroup without hitting any reasonable kmemcg > policy. Once you've hit PID exhaustion, you're only in a marginally > better state than OOM. This subsystem allows PID exhaustion inside a > cgroup to be prevented. > > Signed-off-by: Aleksa Sarai > --- > include/linux/cgroup_subsys.h | 4 + > init/Kconfig | 12 ++ > kernel/Makefile | 1 + > kernel/cgroup_pids.c | 281 > ++ > 4 files changed, 298 insertions(+) > create mode 100644 kernel/cgroup_pids.c > > diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h > index e4a96fb..a198822 100644 > --- a/include/linux/cgroup_subsys.h > +++ b/include/linux/cgroup_subsys.h > @@ -47,6 +47,10 @@ SUBSYS(net_prio) > SUBSYS(hugetlb) > #endif > > +#if IS_ENABLED(CONFIG_CGROUP_PIDS) > +SUBSYS(pids) > +#endif > + > /* > * The following subsystems are not supported on the default hierarchy. > */ > diff --git a/init/Kconfig b/init/Kconfig > index f5dbc6d..58f104a 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1054,6 +1054,18 @@ config CGROUP_HUGETLB > control group is tracked in the third page lru pointer. This means > that we cannot use the controller with huge page less than 3 pages. > > +config CGROUP_PIDS > + bool "Process number limiting on cgroups" > + depends on PAGE_COUNTER > + help > + This options enables the setting of process number limits in the scope > + of a cgroup. Any attempt to fork more processes than is allowed in the > + cgroup will fail. PIDs are fundamentally a global resource because it > + is fairly trivial to reach PID exhaustion before you reach even a > + conservative kmemcg limit. As a result, it is possible to grind a > + system to halt without being limited by other cgroup policies. The > pids > + cgroup subsystem is designed to stop this from happening. > + > config CGROUP_PERF > bool "Enable perf_event per-cpu per-container group (cgroup) monitoring" > depends on PERF_EVENTS && CGROUPS > diff --git a/kernel/Makefile b/kernel/Makefile > index 1408b33..e823592 100644 > --- a/kernel/Makefile > +++ b/kernel/Makefile > @@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o > obj-$(CONFIG_COMPAT) += compat.o > obj-$(CONFIG_CGROUPS) += cgroup.o > obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o > +obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o > obj-$(CONFIG_CPUSETS) += cpuset.o > obj-$(CONFIG_UTS_NS) += utsname.o > obj-$(CONFIG_USER_NS) += user_namespace.o > diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c > new file mode 100644 > index 000..65cbab3 > --- /dev/null > +++ b/kernel/cgroup_pids.c > @@ -0,0 +1,281 @@ > +/* > + * Process number limiting subsys for cgroups. > + * > + * Copyright (C) 2015 Aleksa Sarai > + * > + */ > + > +#include > +#include > +#include > +#include > + > +#define PIDS_UNLIMITED -1 > + > +struct pids { > + struct pids *parent; > + struct cgroup_subsys_state css; > + > + atomic_long_t counter; > + long limit; > +}; > + > +static inline struct pids *css_pids(struct cgroup_subsys_state *css) > +{ > + return css ? container_of(css, struct pids, css) : NULL; > +} > + > +static inline struct pids *task_pids(struct task_struct *task) > +{ > + return css_pids(task_css(task, pids_cgrp_id)); > +} > + > +static struct pids *parent_pids(struct pids *pids) > +{ > + return css_pids(pids->css.parent); > +} > + > +static struct cgroup_subsys_state * > +pids_css_alloc(struct cgroup_subsys_state *parent) > +{ > + struct pids *pids; > + > + pids = kzalloc(sizeof(struct pids), GFP_KERNEL); > + if (!pids) > + return ERR_PTR(-ENOMEM); > + > + return >css; > +} > + > +static int pids_css_online(struct cgroup_subsys_state *css) > +{ > + struct pids *pids = css_pids(css); > + long limit = -1; > + > + pids->parent = parent_pids(pids); > + if (pids->parent) > + limit = pids->parent->limit; > + > + pids->limit = limit; > + atomic_long_set(>counter, 0); > + return 0; > +} > + > +static void
Re: [PATCH v3 2/2] cgroups: add a pids subsystem
Hi Aleksa, would you be willing to put your patches online in a repo like what Dwight Engen did 3 years ago. https://github.com/dwengen/linux/tree/cpuacct-task-limit-3.14 I'm using his patchset for more then a year now. However I would be happy to experiment with your patches as well. And heaving a public repo from where I can pull is extremely good for people like me. It is easier for testing and long term support. Thank you for your work on this matter, Marian On 03/04/2015 10:23 PM, Aleksa Sarai wrote: Adds a new single-purpose pids subsystem to limit the number of tasks that can run inside a cgroup. Essentially this is an implementation of RLIMIT_NPROC that will applies to a cgroup rather than a process tree. PIDs are fundamentally a global resource, and it is possible to reach PID exhaustion inside a cgroup without hitting any reasonable kmemcg policy. Once you've hit PID exhaustion, you're only in a marginally better state than OOM. This subsystem allows PID exhaustion inside a cgroup to be prevented. Signed-off-by: Aleksa Sarai cyp...@cyphar.com --- include/linux/cgroup_subsys.h | 4 + init/Kconfig | 12 ++ kernel/Makefile | 1 + kernel/cgroup_pids.c | 281 ++ 4 files changed, 298 insertions(+) create mode 100644 kernel/cgroup_pids.c diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index e4a96fb..a198822 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -47,6 +47,10 @@ SUBSYS(net_prio) SUBSYS(hugetlb) #endif +#if IS_ENABLED(CONFIG_CGROUP_PIDS) +SUBSYS(pids) +#endif + /* * The following subsystems are not supported on the default hierarchy. */ diff --git a/init/Kconfig b/init/Kconfig index f5dbc6d..58f104a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1054,6 +1054,18 @@ config CGROUP_HUGETLB control group is tracked in the third page lru pointer. This means that we cannot use the controller with huge page less than 3 pages. +config CGROUP_PIDS + bool Process number limiting on cgroups + depends on PAGE_COUNTER + help + This options enables the setting of process number limits in the scope + of a cgroup. Any attempt to fork more processes than is allowed in the + cgroup will fail. PIDs are fundamentally a global resource because it + is fairly trivial to reach PID exhaustion before you reach even a + conservative kmemcg limit. As a result, it is possible to grind a + system to halt without being limited by other cgroup policies. The pids + cgroup subsystem is designed to stop this from happening. + config CGROUP_PERF bool Enable perf_event per-cpu per-container group (cgroup) monitoring depends on PERF_EVENTS CGROUPS diff --git a/kernel/Makefile b/kernel/Makefile index 1408b33..e823592 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o obj-$(CONFIG_COMPAT) += compat.o obj-$(CONFIG_CGROUPS) += cgroup.o obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o +obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o obj-$(CONFIG_CPUSETS) += cpuset.o obj-$(CONFIG_UTS_NS) += utsname.o obj-$(CONFIG_USER_NS) += user_namespace.o diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c new file mode 100644 index 000..65cbab3 --- /dev/null +++ b/kernel/cgroup_pids.c @@ -0,0 +1,281 @@ +/* + * Process number limiting subsys for cgroups. + * + * Copyright (C) 2015 Aleksa Sarai cyp...@cyphar.com + * + */ + +#include linux/kernel.h +#include linux/atomic.h +#include linux/cgroup.h +#include linux/slab.h + +#define PIDS_UNLIMITED -1 + +struct pids { + struct pids *parent; + struct cgroup_subsys_state css; + + atomic_long_t counter; + long limit; +}; + +static inline struct pids *css_pids(struct cgroup_subsys_state *css) +{ + return css ? container_of(css, struct pids, css) : NULL; +} + +static inline struct pids *task_pids(struct task_struct *task) +{ + return css_pids(task_css(task, pids_cgrp_id)); +} + +static struct pids *parent_pids(struct pids *pids) +{ + return css_pids(pids-css.parent); +} + +static struct cgroup_subsys_state * +pids_css_alloc(struct cgroup_subsys_state *parent) +{ + struct pids *pids; + + pids = kzalloc(sizeof(struct pids), GFP_KERNEL); + if (!pids) + return ERR_PTR(-ENOMEM); + + return pids-css; +} + +static int pids_css_online(struct cgroup_subsys_state *css) +{ + struct pids *pids = css_pids(css); + long limit = -1; + + pids-parent = parent_pids(pids); + if (pids-parent) + limit = pids-parent-limit; + + pids-limit = limit; + atomic_long_set(pids-counter, 0); + return 0; +} + +static void pids_css_free(struct cgroup_subsys_state *css) +{ +
Re: RFC: sysctl fixes
On 03/04/2015 05:48 PM, Stephan Mueller wrote: > Am Mittwoch, 4. März 2015, 15:18:19 schrieb Marian Marinov: > > Hi Marian, > >> Hello, >> we are running hundreds of containers and we got some valid requests > >from customers that want to be able to change their container's >> hostname using the sysctl command. >> >> The current implementation does not allow each UTS namespace to change >> its own hostname. >> >> So we sponsored the development of fix for this issue. >> >> Please checkout the changes here: >> >> https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb >> 8f8735d3c...sysctl-hostname-fixes >> >> If this fix is acceptable I'll send the patches as it is common. But >> first I want to know if you think this is the way to go. > The description in http://lwn.net/Articles/531381/ does not work? Stephan, the users can change the hostname of the container using the hostname command. I can also change their hostname when I'm creating the containers. However the customers need to be able to change the hostname of the machine using valid system command: sysctl kernel.hostname=newhost sysctl kernel.domainname=newdomain Both of these commands try to talk to the kernel via the /proc filesystem. This is why we propose these fixes. Without them, only the root on the host node can do this. Correct me, if I'm wrong. Marian >> Thank you, >> Marian >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" >> in the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > > Ciao > Stephan > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: sysctl fixes
Hello, we are running hundreds of containers and we got some valid requests from customers that want to be able to change their container's hostname using the sysctl command. The current implementation does not allow each UTS namespace to change its own hostname. So we sponsored the development of fix for this issue. Please checkout the changes here: https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb8f8735d3c...sysctl-hostname-fixes If this fix is acceptable I'll send the patches as it is common. But first I want to know if you think this is the way to go. Thank you, Marian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: sysctl fixes
On 03/04/2015 05:48 PM, Stephan Mueller wrote: Am Mittwoch, 4. März 2015, 15:18:19 schrieb Marian Marinov: Hi Marian, Hello, we are running hundreds of containers and we got some valid requests from customers that want to be able to change their container's hostname using the sysctl command. The current implementation does not allow each UTS namespace to change its own hostname. So we sponsored the development of fix for this issue. Please checkout the changes here: https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb 8f8735d3c...sysctl-hostname-fixes If this fix is acceptable I'll send the patches as it is common. But first I want to know if you think this is the way to go. The description in http://lwn.net/Articles/531381/ does not work? Stephan, the users can change the hostname of the container using the hostname command. I can also change their hostname when I'm creating the containers. However the customers need to be able to change the hostname of the machine using valid system command: sysctl kernel.hostname=newhost sysctl kernel.domainname=newdomain Both of these commands try to talk to the kernel via the /proc filesystem. This is why we propose these fixes. Without them, only the root on the host node can do this. Correct me, if I'm wrong. Marian Thank you, Marian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ Ciao Stephan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RFC: sysctl fixes
Hello, we are running hundreds of containers and we got some valid requests from customers that want to be able to change their container's hostname using the sysctl command. The current implementation does not allow each UTS namespace to change its own hostname. So we sponsored the development of fix for this issue. Please checkout the changes here: https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb8f8735d3c...sysctl-hostname-fixes If this fix is acceptable I'll send the patches as it is common. But first I want to know if you think this is the way to go. Thank you, Marian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/10/2014 02:52 PM, Michal Hocko wrote: > [More people to CC] On Tue 10-06-14 14:35:02, Marian Marinov wrote: >> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 >> >> Hello, > > Hi, > >> a while back in 2012 there was a request for this functionality. oom, memcg: >> handle sysctl >> oom_kill_allocating_task while memcg oom happening >> >> This is the thread: https://lkml.org/lkml/2012/10/16/168 >> >> Now we run a several machines with around 10k processes on each machine, >> using containers. >> >> Regularly we see OOM from within a container that causes performance >> degradation. > > What kind of performance degradation and which parts of the system are > affected? The responsiveness to SSH terminals and DB queries on the host machine is significantly slowed. I'm still unsure what exactly to measure. > > memcg oom killer happens outside of any locks currently so the only > bottleneck I can see is the per-cgroup > container which iterates all tasks in the group. Is this what is going on > here? When the container has 1000s of processes it seams that there is a problem. But I'm not sure, and will be happy to put some diagnostic lines of code there. > >> We are running 3.12.20 with the following OOM configuration and memcg oom >> enabled: >> >> vm.oom_dump_tasks = 0 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 0 >> >> When OOM occurs we see very high numbers for the loadavg and the overall >> responsiveness of the machine degrades. > > What is the system waiting for? I don't know, since I was not the one to actually handle the case. However, my guys are instructed to collect iostat and vmstat information from the machines, the next time this happens. > >> During these OOM states the load of the machine gradualy increases from 25 >> up to 120 in the interval of >> 10minutes. >> >> Once we manually bring down the memory usage of a container(killing some >> tasks) the load drops down to 25 within >> 5 to 7 minutes. > > So the OOM killer is not able to find a victim to kill? It was constantly killing tasks. 245 oom invocations in less then 6min for that particular cgroup. With top 61 oom invocations in one minute. It was killing... In that particular case, the problem was a web server that was under attack. New php processes was spawned very often and instead of killing each newly created process(which is allocating memory) the kernel tries to find more suitable task. Which in this case was not desired. > >> I read the whole thread from 2012 but I do not see the expected behavior >> that is described by the people that >> commented the issue. > > Why do you think that killing the allocating task would be helpful in your > case? As mentioned above, the usual case with the hosting companies is that, the allocating task should not be allowed to run. So killing it is the proper solution there. Essentially we solved the issue by setting a process limit to that particular cgroup using the task-limit patches of Dwight Engen. > >> In this case, with real usage for this patch, would it be considered for >> inclusion? > > I would still prefer to fix the real issue which is not clear from your > description yet. I would love to have a better way to solve the issue. Marian - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOXGUsACgkQ4mt9JeIbjJR1YACgysnzxg9IPzcwQRmBZVVV6cp3 N4YAoKygaqbqcuz6dkmtMfI/pu2Br5H/ =3pZj -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, a while back in 2012 there was a request for this functionality. oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening This is the thread: https://lkml.org/lkml/2012/10/16/168 Now we run a several machines with around 10k processes on each machine, using containers. Regularly we see OOM from within a container that causes performance degradation. We are running 3.12.20 with the following OOM configuration and memcg oom enabled: vm.oom_dump_tasks = 0 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 0 When OOM occurs we see very high numbers for the loadavg and the overall responsiveness of the machine degrades. During these OOM states the load of the machine gradualy increases from 25 up to 120 in the interval of 10minutes. Once we manually bring down the memory usage of a container(killing some tasks) the load drops down to 25 within 5 to 7 minutes. I read the whole thread from 2012 but I do not see the expected behavior that is described by the people that commented the issue. In this case, with real usage for this patch, would it be considered for inclusion? Marian - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOW7WYACgkQ4mt9JeIbjJTZCACgj74t6T9GkCS5RlQEF6NkTbed xJwAoK93irH+FneDoNdSi2DBSnR7a76r =zl6k -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, a while back in 2012 there was a request for this functionality. oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening This is the thread: https://lkml.org/lkml/2012/10/16/168 Now we run a several machines with around 10k processes on each machine, using containers. Regularly we see OOM from within a container that causes performance degradation. We are running 3.12.20 with the following OOM configuration and memcg oom enabled: vm.oom_dump_tasks = 0 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 0 When OOM occurs we see very high numbers for the loadavg and the overall responsiveness of the machine degrades. During these OOM states the load of the machine gradualy increases from 25 up to 120 in the interval of 10minutes. Once we manually bring down the memory usage of a container(killing some tasks) the load drops down to 25 within 5 to 7 minutes. I read the whole thread from 2012 but I do not see the expected behavior that is described by the people that commented the issue. In this case, with real usage for this patch, would it be considered for inclusion? Marian - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOW7WYACgkQ4mt9JeIbjJTZCACgj74t6T9GkCS5RlQEF6NkTbed xJwAoK93irH+FneDoNdSi2DBSnR7a76r =zl6k -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/10/2014 02:52 PM, Michal Hocko wrote: [More people to CC] On Tue 10-06-14 14:35:02, Marian Marinov wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, Hi, a while back in 2012 there was a request for this functionality. oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening This is the thread: https://lkml.org/lkml/2012/10/16/168 Now we run a several machines with around 10k processes on each machine, using containers. Regularly we see OOM from within a container that causes performance degradation. What kind of performance degradation and which parts of the system are affected? The responsiveness to SSH terminals and DB queries on the host machine is significantly slowed. I'm still unsure what exactly to measure. memcg oom killer happens outside of any locks currently so the only bottleneck I can see is the per-cgroup container which iterates all tasks in the group. Is this what is going on here? When the container has 1000s of processes it seams that there is a problem. But I'm not sure, and will be happy to put some diagnostic lines of code there. We are running 3.12.20 with the following OOM configuration and memcg oom enabled: vm.oom_dump_tasks = 0 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 0 When OOM occurs we see very high numbers for the loadavg and the overall responsiveness of the machine degrades. What is the system waiting for? I don't know, since I was not the one to actually handle the case. However, my guys are instructed to collect iostat and vmstat information from the machines, the next time this happens. During these OOM states the load of the machine gradualy increases from 25 up to 120 in the interval of 10minutes. Once we manually bring down the memory usage of a container(killing some tasks) the load drops down to 25 within 5 to 7 minutes. So the OOM killer is not able to find a victim to kill? It was constantly killing tasks. 245 oom invocations in less then 6min for that particular cgroup. With top 61 oom invocations in one minute. It was killing... In that particular case, the problem was a web server that was under attack. New php processes was spawned very often and instead of killing each newly created process(which is allocating memory) the kernel tries to find more suitable task. Which in this case was not desired. I read the whole thread from 2012 but I do not see the expected behavior that is described by the people that commented the issue. Why do you think that killing the allocating task would be helpful in your case? As mentioned above, the usual case with the hosting companies is that, the allocating task should not be allowed to run. So killing it is the proper solution there. Essentially we solved the issue by setting a process limit to that particular cgroup using the task-limit patches of Dwight Engen. In this case, with real usage for this patch, would it be considered for inclusion? I would still prefer to fix the real issue which is not clear from your description yet. I would love to have a better way to solve the issue. Marian - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOXGUsACgkQ4mt9JeIbjJR1YACgysnzxg9IPzcwQRmBZVVV6cp3 N4YAoKygaqbqcuz6dkmtMfI/pu2Br5H/ =3pZj -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Per-user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/03/2014 08:54 PM, Eric W. Biederman wrote: > Serge Hallyn writes: > >> Quoting Pavel Emelyanov (xe...@parallels.com): >>> On 05/29/2014 07:32 PM, Serge Hallyn wrote: >>>> Quoting Marian Marinov (m...@1h.com): >>>>> We are not using NFS. We are using a shared block storage that offers us >>>>> snapshots. So provisioning new >>>>> containers is extremely cheep and fast. Comparing that with untar is >>>>> comparing a race car with Smart. Yes >>>>> it can be done and no, I do not believe we should go backwards. >>>>> >>>>> We do not share filesystems between containers, we offer them block >>>>> devices. >>>> >>>> Yes, this is a real nuisance for openstack style deployments. >>>> >>>> One nice solution to this imo would be a very thin stackable filesystem >>>> which does uid shifting, or, better >>>> yet, a non-stackable way of shifting uids at mount. >>> >>> I vote for non-stackable way too. Maybe on generic VFS level so that >>> filesystems don't bother with it. From >>> what I've seen, even simple stacking is quite a challenge. >> >> Do you have any ideas for how to go about it? It seems like we'd have to >> have separate inodes per mapping for >> each file, which is why of course stacking seems "natural" here. >> >> Trying to catch the uid/gid at every kernel-userspace crossing seems like a >> design regression from the current >> userns approach. I suppose we could continue in the kuid theme and >> introduce a iiud/igid for the in-kernel inode >> uid/gid owners. Then allow a user privileged in some ns to create a new >> mount associated with a different >> mapping for any ids over which he is privileged. > > There is a simple solution. > > We pick the filesystems we choose to support. We add privileged mounting in a > user namespace. We create the user > and mount namespace. Global root goes into the target mount namespace with > setns and performs the mounts. > > 90% of that work is already done. > > As long as we don't plan to support XFS (as it XFS likes to expose it's > implementation details to userspace) it > should be quite straight forward. > > The permission check change would probably only need to be: > > > @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char > *fstype, int flags, return -ENODEV; > > if (user_ns != _user_ns) { + if (!(type->fs_flags & > FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN)) > { + put_filesystem(type); + > return -EPERM; + } if > (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM; > > > There are also a few funnies with capturing the user namespace of the > filesystem when we perform the mount (in the > superblock?), and not allowing a mount of that same filesystem in a different > user namespace. > > But as long as the kuid conversions don't measurably slow down the filesystem > when mounted in the initial mount and > user namespaces I don't see how this would be a problem for anyone, and is > very little code. > This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user namespaces. In our cases, this means breaking container migration mechanisms. Will this at all be addressed or I'm the only one here that has this sort of requirement? Marian > > Eric > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOOQIgACgkQ4mt9JeIbjJRUWQCgsp/dN0WBy9iLJmsjO8KB+Bin HiIAoIkm8TlcJr4UnbJOAHoYgPVHhg4P =B9xA -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Per-user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/03/2014 08:54 PM, Eric W. Biederman wrote: Serge Hallyn serge.hal...@ubuntu.com writes: Quoting Pavel Emelyanov (xe...@parallels.com): On 05/29/2014 07:32 PM, Serge Hallyn wrote: Quoting Marian Marinov (m...@1h.com): We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I do not believe we should go backwards. We do not share filesystems between containers, we offer them block devices. Yes, this is a real nuisance for openstack style deployments. One nice solution to this imo would be a very thin stackable filesystem which does uid shifting, or, better yet, a non-stackable way of shifting uids at mount. I vote for non-stackable way too. Maybe on generic VFS level so that filesystems don't bother with it. From what I've seen, even simple stacking is quite a challenge. Do you have any ideas for how to go about it? It seems like we'd have to have separate inodes per mapping for each file, which is why of course stacking seems natural here. Trying to catch the uid/gid at every kernel-userspace crossing seems like a design regression from the current userns approach. I suppose we could continue in the kuid theme and introduce a iiud/igid for the in-kernel inode uid/gid owners. Then allow a user privileged in some ns to create a new mount associated with a different mapping for any ids over which he is privileged. There is a simple solution. We pick the filesystems we choose to support. We add privileged mounting in a user namespace. We create the user and mount namespace. Global root goes into the target mount namespace with setns and performs the mounts. 90% of that work is already done. As long as we don't plan to support XFS (as it XFS likes to expose it's implementation details to userspace) it should be quite straight forward. The permission check change would probably only need to be: @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char *fstype, int flags, return -ENODEV; if (user_ns != init_user_ns) { + if (!(type-fs_flags FS_UNPRIV_MOUNT) !capable(CAP_SYS_ADMIN)) { + put_filesystem(type); + return -EPERM; + } if (!(type-fs_flags FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM; There are also a few funnies with capturing the user namespace of the filesystem when we perform the mount (in the superblock?), and not allowing a mount of that same filesystem in a different user namespace. But as long as the kuid conversions don't measurably slow down the filesystem when mounted in the initial mount and user namespaces I don't see how this would be a problem for anyone, and is very little code. This may solve one of the problems, but it does not solve the issue with UID/GID maps that overlap in different user namespaces. In our cases, this means breaking container migration mechanisms. Will this at all be addressed or I'm the only one here that has this sort of requirement? Marian Eric - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOOQIgACgkQ4mt9JeIbjJRUWQCgsp/dN0WBy9iLJmsjO8KB+Bin HiIAoIkm8TlcJr4UnbJOAHoYgPVHhg4P =B9xA -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Per-user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/29/2014 01:06 PM, Eric W. Biederman wrote: > Marian Marinov writes: > >> Hello, >> >> I have the following proposition. >> >> Number of currently running processes is accounted at the root user >> namespace. The problem I'm facing is that >> multiple containers in different user namespaces share the process counters. > > That is deliberate. And I understand that very well ;) > >> So if containerX runs 100 with UID 99, containerY should have NPROC limit of >> above 100 in order to execute any >> processes with ist own UID 99. >> >> I know that some of you will tell me that I should not provision all of my >> containers with the same UID/GID maps, >> but this brings another problem. >> >> We are provisioning the containers from a template. The template has a lot >> of files 500k and more. And chowning >> these causes a lot of I/O and also slows down provisioning considerably. >> >> The other problem is that when we migrate one container from one host >> machine to another the IDs may be already >> in use on the new machine and we need to chown all the files again. > > You should have the same uid allocations for all machines in your fleet as > much as possible. That has been true > ever since NFS was invented and is not new here. You can avoid the cost of > chowning if you untar your files inside > of your user namespace. You can have different maps per machine if you are > crazy enough to do that. You can even > have shared uids that you use to share files between containers as long as > none of those files is setuid. And map > those shared files to some kind of nobody user in your user namespace. We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I do not believe we should go backwards. We do not share filesystems between containers, we offer them block devices. > >> Finally if we use different UID/GID maps we can not do live migration to >> another node because the UIDs may be >> already in use. >> >> So I'm proposing one hack modifying unshare_userns() to allocate new >> user_struct for the cred that is created for >> the first task creating the user_ns and free it in exit_creds(). > > I do not like the idea of having user_structs be per user namespace, and > deliberately made the code not work that > way. > >> Can you please comment on that? > > I have been pondering having some recursive resources limits that are per > user namespace and if all you are worried > about are process counts that might work. I don't honestly know what makes > sense at the moment. It seams to me that the only limit(from RLIMIT) that are generally a problem for the namespaces is number of processes and pending signals. This is why I proposed the above modification. However I'm not sure if the places I have chosen are right and also I'm not really convinced that having per-namespace user_struct is the right approach for the process counter. > > Eric > Marian - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOHDqoACgkQ4mt9JeIbjJRLPACZARH6agr856HeoB3Ub+e6U1PI ICgAoLbQTRM2SqcYOLep7WPIeuoiw4aB =/Ii4 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Per-user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I have the following proposition. Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple containers in different user namespaces share the process counters. So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any processes with ist own UID 99. I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but this brings another problem. We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these causes a lot of I/O and also slows down provisioning considerably. The other problem is that when we migrate one container from one host machine to another the IDs may be already in use on the new machine and we need to chown all the files again. Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already in use. So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the first task creating the user_ns and free it in exit_creds(). Can you please comment on that? Or suggest a better solution? Best regards, Marian - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOG1Y0ACgkQ4mt9JeIbjJQREACbBrax+ztBj2Y0P2jY3qYEUY9T JJ0AnAqLj3pqFFjXCkczEydV1V0LdzQ3 =8M+P -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Per-user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I have the following proposition. Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple containers in different user namespaces share the process counters. So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any processes with ist own UID 99. I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but this brings another problem. We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these causes a lot of I/O and also slows down provisioning considerably. The other problem is that when we migrate one container from one host machine to another the IDs may be already in use on the new machine and we need to chown all the files again. Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already in use. So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the first task creating the user_ns and free it in exit_creds(). Can you please comment on that? Or suggest a better solution? Best regards, Marian - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOG1Y0ACgkQ4mt9JeIbjJQREACbBrax+ztBj2Y0P2jY3qYEUY9T JJ0AnAqLj3pqFFjXCkczEydV1V0LdzQ3 =8M+P -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Per-user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/29/2014 01:06 PM, Eric W. Biederman wrote: Marian Marinov m...@1h.com writes: Hello, I have the following proposition. Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple containers in different user namespaces share the process counters. That is deliberate. And I understand that very well ;) So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any processes with ist own UID 99. I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but this brings another problem. We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these causes a lot of I/O and also slows down provisioning considerably. The other problem is that when we migrate one container from one host machine to another the IDs may be already in use on the new machine and we need to chown all the files again. You should have the same uid allocations for all machines in your fleet as much as possible. That has been true ever since NFS was invented and is not new here. You can avoid the cost of chowning if you untar your files inside of your user namespace. You can have different maps per machine if you are crazy enough to do that. You can even have shared uids that you use to share files between containers as long as none of those files is setuid. And map those shared files to some kind of nobody user in your user namespace. We are not using NFS. We are using a shared block storage that offers us snapshots. So provisioning new containers is extremely cheep and fast. Comparing that with untar is comparing a race car with Smart. Yes it can be done and no, I do not believe we should go backwards. We do not share filesystems between containers, we offer them block devices. Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already in use. So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the first task creating the user_ns and free it in exit_creds(). I do not like the idea of having user_structs be per user namespace, and deliberately made the code not work that way. Can you please comment on that? I have been pondering having some recursive resources limits that are per user namespace and if all you are worried about are process counts that might work. I don't honestly know what makes sense at the moment. It seams to me that the only limit(from RLIMIT) that are generally a problem for the namespaces is number of processes and pending signals. This is why I proposed the above modification. However I'm not sure if the places I have chosen are right and also I'm not really convinced that having per-namespace user_struct is the right approach for the process counter. Eric Marian - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOHDqoACgkQ4mt9JeIbjJRLPACZARH6agr856HeoB3Ub+e6U1PI ICgAoLbQTRM2SqcYOLep7WPIeuoiw4aB =/Ii4 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Per user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I have the following proposition. Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple containers in different user namespaces share the process counters. So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any processes with ist own UID 99. I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but this brings another problem. We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these causes a lot of I/O and also slows down provisioning considerably. The other problem is that when we migrate one container from one host machine to another the IDs may be already in use on the new machine and we need to chown all the files again. Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already in use. So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the first task creating the user_ns and free it in exit_creds(). Can you please comment on that? Or suggest a better solution? Best regards, Marian - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOF4VMACgkQ4mt9JeIbjJSFnwCgrAgCDvBs5ue0BPUHmach04Oo xgMAn34PJbuVCBwh/RqUumFP05fvMHHy =5J9M -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] Per user namespace process accounting
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I have the following proposition. Number of currently running processes is accounted at the root user namespace. The problem I'm facing is that multiple containers in different user namespaces share the process counters. So if containerX runs 100 with UID 99, containerY should have NPROC limit of above 100 in order to execute any processes with ist own UID 99. I know that some of you will tell me that I should not provision all of my containers with the same UID/GID maps, but this brings another problem. We are provisioning the containers from a template. The template has a lot of files 500k and more. And chowning these causes a lot of I/O and also slows down provisioning considerably. The other problem is that when we migrate one container from one host machine to another the IDs may be already in use on the new machine and we need to chown all the files again. Finally if we use different UID/GID maps we can not do live migration to another node because the UIDs may be already in use. So I'm proposing one hack modifying unshare_userns() to allocate new user_struct for the cred that is created for the first task creating the user_ns and free it in exit_creds(). Can you please comment on that? Or suggest a better solution? Best regards, Marian - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOF4VMACgkQ4mt9JeIbjJSFnwCgrAgCDvBs5ue0BPUHmach04Oo xgMAn34PJbuVCBwh/RqUumFP05fvMHHy =5J9M -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Manfred, On 05/25/2014 11:01 PM, Manfred Spraul wrote: > Hi Marian, > > On 05/22/2014 03:01 PM, Marian Marinov wrote: >> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 >> >> On 05/05/2014 10:59 PM, Marian Marinov wrote: >>> >>> In my tests it worked exactly as expected. Here is an example: >>> >>> [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 >>> ~]# lxc-attach -n cent_plain >>> [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 >>> [root@localhost ~]# halt [root@sp2 ~]# >>> sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl >>> kernel.shmmax=34359738368 kernel.shmmax >>> = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# >>> lxc-attach -n cent_plain [root@localhost >>> ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]# >>> >>> So it seams to work as expected :) >>> >>> It works because wen you setup a new shmmax limit it is actually the limit >>> in the init_ipc_ns. So when we are >>> creating a new ipc_ns its ok to copy the values from init_ipc_ns. >>> >>> -Marian >>> >> Ping? >> >> So will there be any more comments on that? >> >> It seams logical. I like your approach even better. I'll test it tomorrow to confirm if all of my tests are successful. Marian > Attached is an untested idea: - each new namespace copies from it's parent, > i.e. nested namespaces should work. - > msg, sem and shm updated > > -- Manfred - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOCha4ACgkQ4mt9JeIbjJTMvwCdGHV2R7Y1H7x14n53Vyihg5AB 2k4Anjp6inQsGjvof2MAx9Sh2/1K3L25 =QTO9 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inotify, new idea?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/24/2014 03:34 PM, Richard Weinberger wrote: > Am 24.05.2014 09:52, schrieb Michael Kerrisk (man-pages): >> On 04/21/2014 10:42 AM, Richard Weinberger wrote: >>> Am 21.04.2014 09:24, schrieb Michael Kerrisk: >>>>> Does recursive monitoring even work with inotify? Last time I've tried it >>>>> did failed as soon I did a mkdir >>>>> -p a/b/c/d because mkdir() raced against the thread which installes the >>>>> new watches. >>>> >>>> As I understand it, you have to program to deal with the races (rescan >>>> directories after adding watches). I >>>> recently did a lot of work updating the inotify(7) man page to discuss all >>>> the issues that I know of, and >>>> their remedies. If I missed anything, I'd appreciate a note on it, so that >>>> it can be added. See >>>> http://man7.org/linux/man-pages/man7/inotify.7.html#NOTES >>> >>> I'm aware of the rescan hack, but in my case it does not help because my >>> program must not miss any event. >>> Currently I'm using a fuse overlay filesystem to log everything. Not >>> perfect but works... :-) >> >> Richard, >> >> A late follow up question. How does your application deal with the event >> overflow problem (i.e., when you get a >> large number of events much faster than your application can deal with them? > > The downside of the FUSE approach is that you have to intercept every > filesystem function. This can be a > performance issue. But due to this design the overflow problem cannot happen > as the FUSE filesystem blocks until > the event has been proceed. > Have anyone of you looked at this: https://github.com/1and1/linux-filemon/ I haven't stress tested it, but in the past I ported it to more recent kernels: https://github.com/hackman/filemon-patches It is not polished, but it works. I'm considering fixing some of the issues at has for use in my home setup. Marian > Thanks, //richard -- To unsubscribe from this list: send the line > "unsubscribe linux-kernel" in the body of a > message to majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html Please read > the FAQ at http://www.tux.org/lkml/ > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOCgOwACgkQ4mt9JeIbjJSXrgCdHCFADxvKN3Hxb4qinc/5Fj6V WYMAnArk3Zb8YqTUNoyNCh1zYplUUx9B =y/Sm -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: inotify, new idea?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/24/2014 03:34 PM, Richard Weinberger wrote: Am 24.05.2014 09:52, schrieb Michael Kerrisk (man-pages): On 04/21/2014 10:42 AM, Richard Weinberger wrote: Am 21.04.2014 09:24, schrieb Michael Kerrisk: Does recursive monitoring even work with inotify? Last time I've tried it did failed as soon I did a mkdir -p a/b/c/d because mkdir() raced against the thread which installes the new watches. As I understand it, you have to program to deal with the races (rescan directories after adding watches). I recently did a lot of work updating the inotify(7) man page to discuss all the issues that I know of, and their remedies. If I missed anything, I'd appreciate a note on it, so that it can be added. See http://man7.org/linux/man-pages/man7/inotify.7.html#NOTES I'm aware of the rescan hack, but in my case it does not help because my program must not miss any event. Currently I'm using a fuse overlay filesystem to log everything. Not perfect but works... :-) Richard, A late follow up question. How does your application deal with the event overflow problem (i.e., when you get a large number of events much faster than your application can deal with them? The downside of the FUSE approach is that you have to intercept every filesystem function. This can be a performance issue. But due to this design the overflow problem cannot happen as the FUSE filesystem blocks until the event has been proceed. Have anyone of you looked at this: https://github.com/1and1/linux-filemon/ I haven't stress tested it, but in the past I ported it to more recent kernels: https://github.com/hackman/filemon-patches It is not polished, but it works. I'm considering fixing some of the issues at has for use in my home setup. Marian Thanks, //richard -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOCgOwACgkQ4mt9JeIbjJSXrgCdHCFADxvKN3Hxb4qinc/5Fj6V WYMAnArk3Zb8YqTUNoyNCh1zYplUUx9B =y/Sm -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Manfred, On 05/25/2014 11:01 PM, Manfred Spraul wrote: Hi Marian, On 05/22/2014 03:01 PM, Marian Marinov wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/05/2014 10:59 PM, Marian Marinov wrote: In my tests it worked exactly as expected. Here is an example: [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@localhost ~]# halt [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl kernel.shmmax=34359738368 kernel.shmmax = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]# So it seams to work as expected :) It works because wen you setup a new shmmax limit it is actually the limit in the init_ipc_ns. So when we are creating a new ipc_ns its ok to copy the values from init_ipc_ns. -Marian Ping? So will there be any more comments on that? It seams logical. I like your approach even better. I'll test it tomorrow to confirm if all of my tests are successful. Marian Attached is an untested idea: - each new namespace copies from it's parent, i.e. nested namespaces should work. - msg, sem and shm updated -- Manfred - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlOCha4ACgkQ4mt9JeIbjJTMvwCdGHV2R7Y1H7x14n53Vyihg5AB 2k4Anjp6inQsGjvof2MAx9Sh2/1K3L25 =QTO9 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/20/2014 05:19 PM, Serge Hallyn wrote: > Quoting Andy Lutomirski (l...@amacapital.net): >> On May 15, 2014 1:26 PM, "Serge E. Hallyn" wrote: >>> >>> Quoting Richard Weinberger (rich...@nod.at): >>>> Am 15.05.2014 21:50, schrieb Serge Hallyn: >>>>> Quoting Richard Weinberger (richard.weinber...@gmail.com): >>>>>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman >>>>>> wrote: >>>>>>> Then don't use a container to build such a thing, or fix the build >>>>>>> scripts to not do that :) >>>>>> >>>>>> I second this. To me it looks like some folks try to (ab)use Linux >>>>>> containers for purposes where KVM >>>>>> would much better fit in. Please don't put more complexity into >>>>>> containers. They are already horrible >>>>>> complex and error prone. >>>>> >>>>> I, naturally, disagree :) The only use case which is inherently not >>>>> valid for containers is running a >>>>> kernel. Practically speaking there are other things which likely will >>>>> never be possible, but if someone >>>>> offers a way to do something in containers, "you can't do that in >>>>> containers" is not an apropos response. >>>>> >>>>> "That abstraction is wrong" is certainly valid, as when vpids were >>>>> originally proposed and rejected, >>>>> resulting in the development of pid namespaces. "We have to work out (x) >>>>> first" can be valid (and I can >>>>> think of examples here), assuming it's not just trying to hide behind a >>>>> catch-22/chicken-egg problem. >>>>> >>>>> Finally, saying "containers are complex and error prone" is conflating >>>>> several large suites of userspace >>>>> code and many kernel features which support them. Being more precise >>>>> would, if the argument is valid, lend >>>>> it a lot more weight. >>>> >>>> We (my company) use Linux containers since 2011 in production. First LXC, >>>> now libvirt-lxc. To understand the >>>> internals better I also wrote my own userspace to create/start containers. >>>> There are so many things which can >>>> hurt you badly. With user namespaces we expose a really big attack surface >>>> to regular users. I.e. Suddenly a >>>> user is allowed to mount filesystems. >>> >>> That is currently not the case. They can mount some virtual filesystems >>> and do bind mounts, but cannot mount >>> most real filesystems. This keeps us protected (for now) from potentially >>> unsafe superblock readers in the >>> kernel. >>> >>>> Ask Andy, he found already lots of nasty things... >> >> I don't think I have anything brilliant to add to this discussion right now, >> except possibly: >> >> ISTM that Linux distributions are, in general, vulnerable to all kinds of >> shenanigans that would happen if an >> untrusted user can cause a block device to appear. That user doesn't need >> permission to mount it > > Interesting point. This would further suggest that we absolutely must ensure > that a loop device which shows up in > the container does not also show up in the host. Can I suggest the usage of the devices cgroup to achieve that? Marian > >> or even necessarily to change its contents on the fly. >> >> E.g. what happens if you boot a machine that contains a malicious disk image >> that has the same partition UUID as >> /? Nothing good, I imagine. >> >> So if we're going to go down this road, we really need some way to tell the >> host that certain devices are not >> trusted. >> >> --Andy > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to > majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html Please read the FAQ at > http://www.tux.org/lkml/ > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W =iQLO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/20/2014 05:19 PM, Serge Hallyn wrote: Quoting Andy Lutomirski (l...@amacapital.net): On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Richard Weinberger (rich...@nod.at): Am 15.05.2014 21:50, schrieb Serge Hallyn: Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, you can't do that in containers is not an apropos response. That abstraction is wrong is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. We have to work out (x) first can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying containers are complex and error prone is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the kernel. Ask Andy, he found already lots of nasty things... I don't think I have anything brilliant to add to this discussion right now, except possibly: ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an untrusted user can cause a block device to appear. That user doesn't need permission to mount it Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in the container does not also show up in the host. Can I suggest the usage of the devices cgroup to achieve that? Marian or even necessarily to change its contents on the fly. E.g. what happens if you boot a machine that contains a malicious disk image that has the same partition UUID as /? Nothing good, I imagine. So if we're going to go down this road, we really need some way to tell the host that certain devices are not trusted. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W =iQLO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 11/11] loop: Allow priveleged operations for root in the namespace which owns a device
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 One question about this patch. Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device? This way you can be sure that the root in that namespace can not access devices to which the host system did not gave him access to. Marian On 05/15/2014 12:34 AM, Seth Forshee wrote: > Signed-off-by: Seth Forshee --- > drivers/block/loop.c | 14 +- 1 file > changed, 9 insertions(+), 5 deletions(-) > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c index > 66bd938bcc1c..2cc19868ea0d 100644 --- > a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1074,7 +1074,7 @@ > loop_set_status(struct loop_device *lo, > const struct loop_info64 *info) > > if (lo->lo_encrypt_key_size && !uid_eq(lo->lo_key_owner, uid) && - > !capable(CAP_SYS_ADMIN)) + > !ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) return -EPERM; if > (lo->lo_state != Lo_bound) return > -ENXIO; @@ -1164,7 +1164,8 @@ loop_get_status(struct loop_device *lo, struct > loop_info64 *info) > memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE); > info->lo_encrypt_type = lo->lo_encryption ? > lo->lo_encryption->number : 0; - if (lo->lo_encrypt_key_size && > capable(CAP_SYS_ADMIN)) { + if > (lo->lo_encrypt_key_size && + ns_capable(disk_to_dev(lo->lo_disk)->ns, > CAP_SYS_ADMIN)) { > info->lo_encrypt_key_size = lo->lo_encrypt_key_size; > memcpy(info->lo_encrypt_key, lo->lo_encrypt_key, > lo->lo_encrypt_key_size); @@ -1309,7 +1310,8 @@ static int lo_ioctl(struct > block_device *bdev, fmode_t mode, > break; case LOOP_SET_STATUS: err = -EPERM; - if ((mode & > FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + if ((mode & > FMODE_WRITE) || + ns_capable(disk_to_dev(lo->lo_disk)->ns, > CAP_SYS_ADMIN)) err = loop_set_status_old(lo, > (struct loop_info __user *)arg); break; @@ -1318,7 +1320,8 @@ static int > lo_ioctl(struct block_device *bdev, > fmode_t mode, break; case LOOP_SET_STATUS64: err = -EPERM; - if > ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + > if ((mode & FMODE_WRITE) || + > ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) err = > loop_set_status64(lo, (struct loop_info64 __user *) arg); break; @@ -1327,7 > +1330,8 @@ static int lo_ioctl(struct > block_device *bdev, fmode_t mode, break; case LOOP_SET_CAPACITY: err = > -EPERM; - if ((mode & FMODE_WRITE) || > capable(CAP_SYS_ADMIN)) + if ((mode & FMODE_WRITE) || + > ns_capable(disk_to_dev(lo->lo_disk)->ns, > CAP_SYS_ADMIN)) err = loop_set_capacity(lo, bdev); break; default: > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN+4SkACgkQ4mt9JeIbjJQMTACffYfr/7IGyjGU0DGnqxRJNMvv 3o4An06xvFTbINvXX9AYrM/gwTDL0GMt =mlO7 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Pondering per-process vsyscall disablement
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/23/2014 02:04 AM, Andy Lutomirski wrote: > It would be nice to have a way for new programs to declare that they don't > need vsyscalls. What's the right way to > do this? An ELF header entry in the loader? An ELF header entry in the > program? A new arch_prctl? > > As background, there's an old part of the x86_64 ABI that allows programs to > do gettimeofday, clock_gettime, and > getcpu by calling to fixed addresses of the form 0xff600n00 where n > indicates which of those three syscalls > is being invoked. This is a security issue. > > Since Linux 3.1, vsyscalls are emulated using NX and page faults. As a > result, vsyscalls no longer offer any > performance advantage over normal syscalls; in fact, they're much slower. As > far as I know, nothing newer than > 2012 will attempt to use vsyscalls if a vdso is present. (Sadly, a lot of > things will still fall back to the > vsyscall page if there is no vdso, but that shouldn't matter, since there is > always a vdso.) > > Despite the emulation, they could still be used as a weird form of ROP gadget > that lives at a fixed address. I'd > like to offer a way for new runtimes to indicate that they don't use > vsyscalls so that the kernel can selectively > disable emulation and remove the fixed-address executable code issue. > > Wouldn't it be more useful if the check is against a bitmask added as extended attribute for that executable? This way the administrators and will have the flexibility to simply add the new attribute to the executable. Marian > --Andy -- To unsubscribe from this list: send the line "unsubscribe > linux-kernel" in the body of a message to > majord...@vger.kernel.org More majordomo info at > http://vger.kernel.org/majordomo-info.html Please read the FAQ at > http://www.tux.org/lkml/ > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN+tg4ACgkQ4mt9JeIbjJRGkgCgjnD2s+J9kIr5oEDeL3VKHNvX X4cAn17zC0aPKyTCVekmqZRlVukqLWyC =vrfk -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/05/2014 10:59 PM, Marian Marinov wrote: > On 05/04/2014 02:17 PM, Manfred Spraul wrote: >> Hi Marian, >> >> Note: The limits will soon be increased to (nearly) ULONG_MAX. I.e.: If you >> propose the patch because you are >> running into issues with a too small SEMMAX after an unshare(CLONE_NEWIPC), >> then this will be fixed soon. >> >> >> On 05/04/2014 01:53 AM, Davidlohr Bueso wrote: >>> On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: >>>> When we are creating new IPC namespace that should be cloned from the >>>> current namespace it is a good idea to >>>> copy the values of the current shmmax and shmall to the new namespace. >> The idea sounds reasonable: If an admin has reduced the limits, then the >> reduction should also apply after a >> unshare(CLONE_NEWIPC). >> >> But: Your patch doesn't use the current shmmax, it uses the shmmax from >> init_ipc_ns. Would it be possible to use >> the current values? > > In my tests it worked exactly as expected. Here is an example: > > [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# > lxc-attach -n cent_plain > [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 > [root@localhost ~]# halt [root@sp2 ~]# sysctl > -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl > kernel.shmmax=34359738368 kernel.shmmax = > 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach > -n cent_plain [root@localhost ~]# > sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]# > > So it seams to work as expected :) > > It works because wen you setup a new shmmax limit it is actually the limit in > the init_ipc_ns. So when we are > creating a new ipc_ns its ok to copy the values from init_ipc_ns. > > -Marian > Ping? So will there be any more comments on that? Marian >> >>> Why is this a good idea? >>> >>> This would break userspace that relies on the current behavior. Furthermore >>> we've recently changed the default >>> value of both these limits to be as large as you can get, thus deprecating >>> them. I don't like the idea of this >>> being replaced by namespaces. >> Davidlohr: We are not deprecating them, we make the default huge. The limits >> should stay as usable as they were. >> >> With regards to breaking user space, I must think about it a bit more. Right >> now, each new namespace starts with >> SEMMAX=32MB, i.e. an often unusable default. >> >> -- Manfred >> > > - -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN99SAACgkQ4mt9JeIbjJQHrQCfdexU5xdW4A/pO66SvbcYQVqF uREAoJ1e6hytp6435YUrpKjEG2qVulI1 =QqGi -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/05/2014 10:59 PM, Marian Marinov wrote: On 05/04/2014 02:17 PM, Manfred Spraul wrote: Hi Marian, Note: The limits will soon be increased to (nearly) ULONG_MAX. I.e.: If you propose the patch because you are running into issues with a too small SEMMAX after an unshare(CLONE_NEWIPC), then this will be fixed soon. On 05/04/2014 01:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. The idea sounds reasonable: If an admin has reduced the limits, then the reduction should also apply after a unshare(CLONE_NEWIPC). But: Your patch doesn't use the current shmmax, it uses the shmmax from init_ipc_ns. Would it be possible to use the current values? In my tests it worked exactly as expected. Here is an example: [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@localhost ~]# halt [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl kernel.shmmax=34359738368 kernel.shmmax = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]# So it seams to work as expected :) It works because wen you setup a new shmmax limit it is actually the limit in the init_ipc_ns. So when we are creating a new ipc_ns its ok to copy the values from init_ipc_ns. -Marian Ping? So will there be any more comments on that? Marian Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Davidlohr: We are not deprecating them, we make the default huge. The limits should stay as usable as they were. With regards to breaking user space, I must think about it a bit more. Right now, each new namespace starts with SEMMAX=32MB, i.e. an often unusable default. -- Manfred - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN99SAACgkQ4mt9JeIbjJQHrQCfdexU5xdW4A/pO66SvbcYQVqF uREAoJ1e6hytp6435YUrpKjEG2qVulI1 =QqGi -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Pondering per-process vsyscall disablement
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/23/2014 02:04 AM, Andy Lutomirski wrote: It would be nice to have a way for new programs to declare that they don't need vsyscalls. What's the right way to do this? An ELF header entry in the loader? An ELF header entry in the program? A new arch_prctl? As background, there's an old part of the x86_64 ABI that allows programs to do gettimeofday, clock_gettime, and getcpu by calling to fixed addresses of the form 0xff600n00 where n indicates which of those three syscalls is being invoked. This is a security issue. Since Linux 3.1, vsyscalls are emulated using NX and page faults. As a result, vsyscalls no longer offer any performance advantage over normal syscalls; in fact, they're much slower. As far as I know, nothing newer than 2012 will attempt to use vsyscalls if a vdso is present. (Sadly, a lot of things will still fall back to the vsyscall page if there is no vdso, but that shouldn't matter, since there is always a vdso.) Despite the emulation, they could still be used as a weird form of ROP gadget that lives at a fixed address. I'd like to offer a way for new runtimes to indicate that they don't use vsyscalls so that the kernel can selectively disable emulation and remove the fixed-address executable code issue. Wouldn't it be more useful if the check is against a bitmask added as extended attribute for that executable? This way the administrators and will have the flexibility to simply add the new attribute to the executable. Marian --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN+tg4ACgkQ4mt9JeIbjJRGkgCgjnD2s+J9kIr5oEDeL3VKHNvX X4cAn17zC0aPKyTCVekmqZRlVukqLWyC =vrfk -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 11/11] loop: Allow priveleged operations for root in the namespace which owns a device
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 One question about this patch. Why don't you use the devices cgroup check if the root user in that namespace is allowed to use this device? This way you can be sure that the root in that namespace can not access devices to which the host system did not gave him access to. Marian On 05/15/2014 12:34 AM, Seth Forshee wrote: Signed-off-by: Seth Forshee seth.fors...@canonical.com --- drivers/block/loop.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 66bd938bcc1c..2cc19868ea0d 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1074,7 +1074,7 @@ loop_set_status(struct loop_device *lo, const struct loop_info64 *info) if (lo-lo_encrypt_key_size !uid_eq(lo-lo_key_owner, uid) - !capable(CAP_SYS_ADMIN)) + !ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) return -EPERM; if (lo-lo_state != Lo_bound) return -ENXIO; @@ -1164,7 +1164,8 @@ loop_get_status(struct loop_device *lo, struct loop_info64 *info) memcpy(info-lo_crypt_name, lo-lo_crypt_name, LO_NAME_SIZE); info-lo_encrypt_type = lo-lo_encryption ? lo-lo_encryption-number : 0; - if (lo-lo_encrypt_key_size capable(CAP_SYS_ADMIN)) { + if (lo-lo_encrypt_key_size + ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) { info-lo_encrypt_key_size = lo-lo_encrypt_key_size; memcpy(info-lo_encrypt_key, lo-lo_encrypt_key, lo-lo_encrypt_key_size); @@ -1309,7 +1310,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode, break; case LOOP_SET_STATUS: err = -EPERM; - if ((mode FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + if ((mode FMODE_WRITE) || + ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) err = loop_set_status_old(lo, (struct loop_info __user *)arg); break; @@ -1318,7 +1320,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode, break; case LOOP_SET_STATUS64: err = -EPERM; - if ((mode FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + if ((mode FMODE_WRITE) || + ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) err = loop_set_status64(lo, (struct loop_info64 __user *) arg); break; @@ -1327,7 +1330,8 @@ static int lo_ioctl(struct block_device *bdev, fmode_t mode, break; case LOOP_SET_CAPACITY: err = -EPERM; - if ((mode FMODE_WRITE) || capable(CAP_SYS_ADMIN)) + if ((mode FMODE_WRITE) || + ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) err = loop_set_capacity(lo, bdev); break; default: - -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iEYEARECAAYFAlN+4SkACgkQ4mt9JeIbjJQMTACffYfr/7IGyjGU0DGnqxRJNMvv 3o4An06xvFTbINvXX9AYrM/gwTDL0GMt =mlO7 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
On 05/04/2014 02:17 PM, Manfred Spraul wrote: Hi Marian, Note: The limits will soon be increased to (nearly) ULONG_MAX. I.e.: If you propose the patch because you are running into issues with a too small SEMMAX after an unshare(CLONE_NEWIPC), then this will be fixed soon. On 05/04/2014 01:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. The idea sounds reasonable: If an admin has reduced the limits, then the reduction should also apply after a unshare(CLONE_NEWIPC). But: Your patch doesn't use the current shmmax, it uses the shmmax from init_ipc_ns. Would it be possible to use the current values? In my tests it worked exactly as expected. Here is an example: [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@localhost ~]# halt [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl kernel.shmmax=34359738368 kernel.shmmax = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]# So it seams to work as expected :) It works because wen you setup a new shmmax limit it is actually the limit in the init_ipc_ns. So when we are creating a new ipc_ns its ok to copy the values from init_ipc_ns. -Marian Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Davidlohr: We are not deprecating them, we make the default huge. The limits should stay as usable as they were. With regards to breaking user space, I must think about it a bit more. Right now, each new namespace starts with SEMMAX=32MB, i.e. an often unusable default. -- Manfred -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
On 05/04/2014 02:17 PM, Manfred Spraul wrote: Hi Marian, Note: The limits will soon be increased to (nearly) ULONG_MAX. I.e.: If you propose the patch because you are running into issues with a too small SEMMAX after an unshare(CLONE_NEWIPC), then this will be fixed soon. On 05/04/2014 01:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. The idea sounds reasonable: If an admin has reduced the limits, then the reduction should also apply after a unshare(CLONE_NEWIPC). But: Your patch doesn't use the current shmmax, it uses the shmmax from init_ipc_ns. Would it be possible to use the current values? In my tests it worked exactly as expected. Here is an example: [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@localhost ~]# halt [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl kernel.shmmax=34359738368 kernel.shmmax = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach -n cent_plain [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]# So it seams to work as expected :) It works because wen you setup a new shmmax limit it is actually the limit in the init_ipc_ns. So when we are creating a new ipc_ns its ok to copy the values from init_ipc_ns. -Marian Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Davidlohr: We are not deprecating them, we make the default huge. The limits should stay as usable as they were. With regards to breaking user space, I must think about it a bit more. Right now, each new namespace starts with SEMMAX=32MB, i.e. an often unusable default. -- Manfred -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
On 05/04/2014 04:20 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 03:28 +0300, Marian Marinov wrote: On 05/04/2014 02:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Thanks, Davidlohr The current behavior is create_ipc_ns()->shm_init_ns() void shm_init_ns(struct ipc_namespace *ns) { ns->shm_ctlmax = SHMMAX; ns->shm_ctlall = SHMALL; ns->shm_ctlmni = SHMMNI; ns->shm_rmid_forced = 0; ns->shm_tot = 0; ipc_init_ids(_ids(ns)); } This means that whenever you are creating an IPC namespace it gets its SHMMAX and SHMALL values from the defaults for the kernel. This is exactly what I meant by 'current behavior'. If for some reason you want to have smaller(or bigger, for older kernels) limit. This means changing the values in /proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that is started with the new IPC namespace may lack privileges to write to these files and so it can not modify them. I see no reason why namespaces should behave any different than the rest of the system, wrt this. And this changes how *and* when these limits are set, which impacts at a userspace level with no justification. What I'm proposing is simply to copy the current values of the host machine, as set by a privileged process before the namespace creation. Maybe a better approach would be to allow the changes to be done by processes having CAP_SYS_RESOURCE inside the new namespace? Why do you need this? Is there any real impact/issue you're seeing? I'm using Linux Containers and I need to be able to either start containers with different SHMMAX or set different SHMMAX to already running containers without giving them full root access. -Marian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
On 05/04/2014 04:20 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 03:28 +0300, Marian Marinov wrote: On 05/04/2014 02:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Thanks, Davidlohr The current behavior is create_ipc_ns()-shm_init_ns() void shm_init_ns(struct ipc_namespace *ns) { ns-shm_ctlmax = SHMMAX; ns-shm_ctlall = SHMALL; ns-shm_ctlmni = SHMMNI; ns-shm_rmid_forced = 0; ns-shm_tot = 0; ipc_init_ids(shm_ids(ns)); } This means that whenever you are creating an IPC namespace it gets its SHMMAX and SHMALL values from the defaults for the kernel. This is exactly what I meant by 'current behavior'. If for some reason you want to have smaller(or bigger, for older kernels) limit. This means changing the values in /proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that is started with the new IPC namespace may lack privileges to write to these files and so it can not modify them. I see no reason why namespaces should behave any different than the rest of the system, wrt this. And this changes how *and* when these limits are set, which impacts at a userspace level with no justification. What I'm proposing is simply to copy the current values of the host machine, as set by a privileged process before the namespace creation. Maybe a better approach would be to allow the changes to be done by processes having CAP_SYS_RESOURCE inside the new namespace? Why do you need this? Is there any real impact/issue you're seeing? I'm using Linux Containers and I need to be able to either start containers with different SHMMAX or set different SHMMAX to already running containers without giving them full root access. -Marian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
On 05/04/2014 02:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Thanks, Davidlohr The current behavior is create_ipc_ns()->shm_init_ns() void shm_init_ns(struct ipc_namespace *ns) { ns->shm_ctlmax = SHMMAX; ns->shm_ctlall = SHMALL; ns->shm_ctlmni = SHMMNI; ns->shm_rmid_forced = 0; ns->shm_tot = 0; ipc_init_ids(_ids(ns)); } This means that whenever you are creating an IPC namespace it gets its SHMMAX and SHMALL values from the defaults for the kernel. If for some reason you want to have smaller(or bigger, for older kernels) limit. This means changing the values in /proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that is started with the new IPC namespace may lack privileges to write to these files and so it can not modify them. What I'm proposing is simply to copy the current values of the host machine, as set by a privileged process before the namespace creation. Maybe a better approach would be to allow the changes to be done by processes having CAP_SYS_RESOURCE inside the new namespace? Marian -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] IPC initialize shmmax and shmall from the current value not the default
When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. Copying the values of the init_ipc_ns would allow us to create new ipc namespaces with different values without granting them privileges to change those values. Here is the proposed patch: --- ipc/shm.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/ipc/shm.c b/ipc/shm.c index 7a51443..b7a4728 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -74,8 +74,13 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void *it); void shm_init_ns(struct ipc_namespace *ns) { - ns->shm_ctlmax = SHMMAX; - ns->shm_ctlall = SHMALL; + if (ns == _ipc_ns) { + ns->shm_ctlmax = SHMMAX; + ns->shm_ctlall = SHMALL; + } else { + ns->shm_ctlmax = init_ipc_ns.shm_ctlmax; + ns->shm_ctlall = init_ipc_ns.shm_ctlall; + } ns->shm_ctlmni = SHMMNI; ns->shm_rmid_forced = 0; ns->shm_tot = 0; -- 1.8.4 -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] IPC initialize shmmax and shmall from the current value not the default
When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. Copying the values of the init_ipc_ns would allow us to create new ipc namespaces with different values without granting them privileges to change those values. Here is the proposed patch: --- ipc/shm.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/ipc/shm.c b/ipc/shm.c index 7a51443..b7a4728 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -74,8 +74,13 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void *it); void shm_init_ns(struct ipc_namespace *ns) { - ns-shm_ctlmax = SHMMAX; - ns-shm_ctlall = SHMALL; + if (ns == init_ipc_ns) { + ns-shm_ctlmax = SHMMAX; + ns-shm_ctlall = SHMALL; + } else { + ns-shm_ctlmax = init_ipc_ns.shm_ctlmax; + ns-shm_ctlall = init_ipc_ns.shm_ctlall; + } ns-shm_ctlmni = SHMMNI; ns-shm_rmid_forced = 0; ns-shm_tot = 0; -- 1.8.4 -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default
On 05/04/2014 02:53 AM, Davidlohr Bueso wrote: On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote: When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the values of the current shmmax and shmall to the new namespace. Why is this a good idea? This would break userspace that relies on the current behavior. Furthermore we've recently changed the default value of both these limits to be as large as you can get, thus deprecating them. I don't like the idea of this being replaced by namespaces. Thanks, Davidlohr The current behavior is create_ipc_ns()-shm_init_ns() void shm_init_ns(struct ipc_namespace *ns) { ns-shm_ctlmax = SHMMAX; ns-shm_ctlall = SHMALL; ns-shm_ctlmni = SHMMNI; ns-shm_rmid_forced = 0; ns-shm_tot = 0; ipc_init_ids(shm_ids(ns)); } This means that whenever you are creating an IPC namespace it gets its SHMMAX and SHMALL values from the defaults for the kernel. If for some reason you want to have smaller(or bigger, for older kernels) limit. This means changing the values in /proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that is started with the new IPC namespace may lack privileges to write to these files and so it can not modify them. What I'm proposing is simply to copy the current values of the host machine, as set by a privileged process before the namespace creation. Maybe a better approach would be to allow the changes to be done by processes having CAP_SYS_RESOURCE inside the new namespace? Marian -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/30/2014 03:01 AM, Stéphane Graber wrote: On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote: On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber wrote: On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote: On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov wrote: On 04/30/2014 01:45 AM, Andy Lutomirski wrote: On 04/29/2014 03:29 PM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/30/2014 01:02 AM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for "the current task", and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Right, and I think the two operations you're looking at seem sane to allow. If you are ok with this patch, I will fix all file systems and send patches. Sounds good, thanks. Signed-off-by: Marian Marinov Acked-by: Serge E. Hallyn Wait, what? Inodes aren't owned by user namespaces; they're owned by users. And any user can arrange to have a user namespace in which they pass an inode_capable check on any inode that they own. Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE entirely. The problem I'm trying to solve is this: container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this container has access to. Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working. With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container. The first solution got its user namespace from the currently running process and the second gets its user namespace from the currently opened inode. So what would be the best solution in this case? I'd suggest adding a mount option like fs_owner_uid that names a uid that owns, in the sense of having unlimited access to, a filesystem. Then anyone with caps on a namespace owned by that uid could do whatever. Eric? --Andy The most obvious problem I can think of with "do whatever" is that this will likely include mknod of char and block devices which you can then chown/chmod as you wish and use to access any devices on the system from an unprivileged container. This can however be mitigated by using the devices cgroup controller. Or 'nodev'. setuid/setgid may have the same problem, too. Implementing something like this would also make CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE work. Arguably it should be impossible to mount such a thing in the first place without global privilege. You also probably wouldn't want any unprivileged user from the host to find a way to access that mounted filesytem but so long as you do the mount in a separate mountns and don't share uids between the host and the container, that should be fine too. This part should be a nonissue -- an unprivileged user who has the right uid owns the namespace anyway, so this is the least of your worries. --Andy It should be a nonissue so long as we make sure that a file owned by a uid outside the scope of the container may not be changed even though fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say a shell and anyone who can see the fs from the host will be getting a root shell (assuming said file is owned by the host's uid 0). So that's restricting slightly what "do whatever" would do in this case. In my case I give an
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/30/2014 01:45 AM, Andy Lutomirski wrote: On 04/29/2014 03:29 PM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/30/2014 01:02 AM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for "the current task", and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Right, and I think the two operations you're looking at seem sane to allow. If you are ok with this patch, I will fix all file systems and send patches. Sounds good, thanks. Signed-off-by: Marian Marinov Acked-by: Serge E. Hallyn Wait, what? Inodes aren't owned by user namespaces; they're owned by users. And any user can arrange to have a user namespace in which they pass an inode_capable check on any inode that they own. Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE entirely. The problem I'm trying to solve is this: container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this container has access to. Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working. With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container. The first solution got its user namespace from the currently running process and the second gets its user namespace from the currently opened inode. So what would be the best solution in this case? Marian Nacked-by: Andy Lutomirski -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/30/2014 01:02 AM, Serge Hallyn wrote: Quoting Marian Marinov (m...@1h.com): On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (ty...@mit.edu): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for "the current task", and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Right, and I think the two operations you're looking at seem sane to allow. If you are ok with this patch, I will fix all file systems and send patches. Signed-off-by: Marian Marinov --- fs/ext4/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index d011b69..9418634 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) * This test looks nicer. Thanks to Pauline Middelink */ if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) { - if (!capable(CAP_LINUX_IMMUTABLE)) + if (!inode_capable(inode, CAP_LINUX_IMMUTABLE)) goto flags_out; } --- 1.8.4 Marian thanks, -serge -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (ty...@mit.edu): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for "the current task", and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Marian No? Otherwise, pretty much every single use of capable() would be broken, not just this once instances in ext4/ioctl.c. - Ted ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
Hello, when using user namespaces I found a bug in the capability checks done by ioctl. If someone tries to use chattr +i while in a different user namespace it will get the following: ioctl(3, EXT2_IOC_SETFLAGS, 0x7fffa4fedacc) = -1 EPERM (Operation not permitted) I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE). If you agree I can send patches for all filesystems. I'm proposing the following patch: --- fs/ext4/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index d011b69..25683d0 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) * This test looks nicer. Thanks to Pauline Middelink */ if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) { - if (!capable(CAP_LINUX_IMMUTABLE)) + if (!ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE)) goto flags_out; } -- 1.8.4 -- Marian Marinov Founder & CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
Hello, when using user namespaces I found a bug in the capability checks done by ioctl. If someone tries to use chattr +i while in a different user namespace it will get the following: ioctl(3, EXT2_IOC_SETFLAGS, 0x7fffa4fedacc) = -1 EPERM (Operation not permitted) I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE). If you agree I can send patches for all filesystems. I'm proposing the following patch: --- fs/ext4/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index d011b69..25683d0 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) * This test looks nicer. Thanks to Pauline Middelink */ if ((flags ^ oldflags) (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) { - if (!capable(CAP_LINUX_IMMUTABLE)) + if (!ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE)) goto flags_out; } -- 1.8.4 -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (ty...@mit.edu): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(init_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for the current task, and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Marian No? Otherwise, pretty much every single use of capable() would be broken, not just this once instances in ext4/ioctl.c. - Ted ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/containers -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/30/2014 01:02 AM, Serge Hallyn wrote: Quoting Marian Marinov (m...@1h.com): On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (ty...@mit.edu): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(init_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for the current task, and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Right, and I think the two operations you're looking at seem sane to allow. If you are ok with this patch, I will fix all file systems and send patches. Signed-off-by: Marian Marinov m...@yuhu.biz --- fs/ext4/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c index d011b69..9418634 100644 --- a/fs/ext4/ioctl.c +++ b/fs/ext4/ioctl.c @@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) * This test looks nicer. Thanks to Pauline Middelink */ if ((flags ^ oldflags) (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) { - if (!capable(CAP_LINUX_IMMUTABLE)) + if (!inode_capable(inode, CAP_LINUX_IMMUTABLE)) goto flags_out; } --- 1.8.4 Marian thanks, -serge -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/30/2014 01:45 AM, Andy Lutomirski wrote: On 04/29/2014 03:29 PM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/30/2014 01:02 AM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(init_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for the current task, and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Right, and I think the two operations you're looking at seem sane to allow. If you are ok with this patch, I will fix all file systems and send patches. Sounds good, thanks. Signed-off-by: Marian Marinov mm-nv7lj0so...@public.gmane.org Acked-by: Serge E. Hallyn serge.hallyn-GeWIH/nmzzlqt0dzr+a...@public.gmane.org Wait, what? Inodes aren't owned by user namespaces; they're owned by users. And any user can arrange to have a user namespace in which they pass an inode_capable check on any inode that they own. Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE entirely. The problem I'm trying to solve is this: container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this container has access to. Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working. With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container. The first solution got its user namespace from the currently running process and the second gets its user namespace from the currently opened inode. So what would be the best solution in this case? Marian Nacked-by: Andy Lutomirski l...@amacapital.net -- Marian Marinov Founder CEO of 1H Ltd. Jabber/GTalk: hack...@jabber.org ICQ: 7556201 Mobile: +359 886 660 270 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace
On 04/30/2014 03:01 AM, Stéphane Graber wrote: On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote: On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber stgra...@ubuntu.com wrote: On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote: On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov m...@1h.com wrote: On 04/30/2014 01:45 AM, Andy Lutomirski wrote: On 04/29/2014 03:29 PM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/30/2014 01:02 AM, Serge Hallyn wrote: Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org): On 04/29/2014 09:52 PM, Serge Hallyn wrote: Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org): On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote: I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE). Um, wouldn't it be better to simply fix the capable() function? /** * capable - Determine if the current task has a superior capability in effect * @cap: The capability to be tested for * * Return true if the current task has the given superior capability currently * available for use, false if not. * * This sets PF_SUPERPRIV on the task if the capability is available on the * assumption that it's about to be used. */ bool capable(int cap) { return ns_capable(init_user_ns, cap); } EXPORT_SYMBOL(capable); The documentation states that it is for the current task, and I can't imagine any use case, where user namespaces are in effect, where using init_user_ns would ever make sense. the init_user_ns represents the user_ns owning the object, not the subject. The patch by Marian is wrong. Anyone can do 'clone(CLONE_NEWUSER)', setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' by definition. So NACK to that particular patch. I'm not sure, but IIUC it should be safe to check against the userns owning the inode? So what you are proposing is to replace 'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' with 'inode_capable(inode, CAP_SYS_IMMUTABLE)' ? I agree that this is more sane. Right, and I think the two operations you're looking at seem sane to allow. If you are ok with this patch, I will fix all file systems and send patches. Sounds good, thanks. Signed-off-by: Marian Marinov mm-nv7lj0so...@public.gmane.org Acked-by: Serge E. Hallyn serge.hallyn-GeWIH/nmzzlqt0dzr+a...@public.gmane.org Wait, what? Inodes aren't owned by user namespaces; they're owned by users. And any user can arrange to have a user namespace in which they pass an inode_capable check on any inode that they own. Presumably there's a reason that CAP_SYS_IMMUTABLE is needed. If this gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE entirely. The problem I'm trying to solve is this: container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this container has access to. Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working. With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container. The first solution got its user namespace from the currently running process and the second gets its user namespace from the currently opened inode. So what would be the best solution in this case? I'd suggest adding a mount option like fs_owner_uid that names a uid that owns, in the sense of having unlimited access to, a filesystem. Then anyone with caps on a namespace owned by that uid could do whatever. Eric? --Andy The most obvious problem I can think of with do whatever is that this will likely include mknod of char and block devices which you can then chown/chmod as you wish and use to access any devices on the system from an unprivileged container. This can however be mitigated by using the devices cgroup controller. Or 'nodev'. setuid/setgid may have the same problem, too. Implementing something like this would also make CAP_DAC_READ_SEARCH and CAP_DAC_OVERRIDE work. Arguably it should be impossible to mount such a thing in the first place without global privilege. You also probably wouldn't want any unprivileged user from the host to find a way to access that mounted filesytem but so long as you do the mount in a separate mountns and don't share uids between the host and the container, that should be fine too. This part should be a nonissue -- an unprivileged user who has the right uid owns the namespace anyway, so this is the least of your worries. --Andy It should be a nonissue so long as we make sure that a file owned by a uid outside the scope of the container may not be changed even though fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say a shell and anyone who can see the fs from the host will be getting a root shell (assuming said file is owned by the host's uid 0). So that's restricting
kernel trees
Hello, I'm maintaining a few patches out of the main tree. Currently I keep one repository that is a clone of this: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git But if I'm monitoring the 3.13 version I can see only the rc versions in that repository. Which repository should I follow in order to get the 3.13.* versions? Thank you very much! Marian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
kernel trees
Hello, I'm maintaining a few patches out of the main tree. Currently I keep one repository that is a clone of this: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git But if I'm monitoring the 3.13 version I can see only the rc versions in that repository. Which repository should I follow in order to get the 3.13.* versions? Thank you very much! Marian -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: cgroups aware proc
On 01/13/2014 07:12 PM, Peter Zijlstra wrote: On Mon, Jan 13, 2014 at 06:23:50PM +0200, Marian Marinov wrote: Hello Peter, I need help with the scheduler. I'm currently trying to patch the /proc/loadavg to show the load that is only related to the processes from the current cgroup. I looked trough the code and I was hoping that tsk->sched_task_group->cfs_rq struct will give me the needed information, but unfortunately for me, it did not. Can you advise me, how to approach this problem? Yeah, don't :-) Really, loadavg is a stupid metric. Yes... stupid, but unfortunately everyone is looking at it :( I'm totally new to the scheduler code. Luckily you won't actually have to touch much of it. Most of the actual loadavg code lives in the first ~400 lines of kernel/sched/proc.c, read and weep. Its one of the best documented bits around. I looked trough it but I don't understand how to introduce the per cgroup calculation. I looked trough the headers and found the following, which is already implemented. task->sched_task_group->load_avg task->sched_task_group->cfs_rq->load_avg task->sched_task_group->cfs_rq->load.weight task->sched_task_group->cfs_rq->runnable_load_avg Unfortunately there is almost no documentation for these elements of the cfs_rq and task_group structs. It seams to me that part of the per task group loadavg code is already present. Your proposition however is extremely expensive, you turn something that's already expensive O(nr_cpus) into something O(nr_cpus * nr_cgroups). I'm fairly sure people will not like that, esp. for something of such questionable use as the loadavg -- its really only a pretty number that doesn't mean all that much. I know that its use is questionable but in my case I need to have it, or I will not be able to offer correct loadavg values in the containers. Original Message From: Li Zefan Then you should add Peter, Ingo and LKML to your Cc list. :) You failed that, let me fix that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: cgroups aware proc
On 01/13/2014 07:12 PM, Peter Zijlstra wrote: On Mon, Jan 13, 2014 at 06:23:50PM +0200, Marian Marinov wrote: Hello Peter, I need help with the scheduler. I'm currently trying to patch the /proc/loadavg to show the load that is only related to the processes from the current cgroup. I looked trough the code and I was hoping that tsk-sched_task_group-cfs_rq struct will give me the needed information, but unfortunately for me, it did not. Can you advise me, how to approach this problem? Yeah, don't :-) Really, loadavg is a stupid metric. Yes... stupid, but unfortunately everyone is looking at it :( I'm totally new to the scheduler code. Luckily you won't actually have to touch much of it. Most of the actual loadavg code lives in the first ~400 lines of kernel/sched/proc.c, read and weep. Its one of the best documented bits around. I looked trough it but I don't understand how to introduce the per cgroup calculation. I looked trough the headers and found the following, which is already implemented. task-sched_task_group-load_avg task-sched_task_group-cfs_rq-load_avg task-sched_task_group-cfs_rq-load.weight task-sched_task_group-cfs_rq-runnable_load_avg Unfortunately there is almost no documentation for these elements of the cfs_rq and task_group structs. It seams to me that part of the per task group loadavg code is already present. Your proposition however is extremely expensive, you turn something that's already expensive O(nr_cpus) into something O(nr_cpus * nr_cgroups). I'm fairly sure people will not like that, esp. for something of such questionable use as the loadavg -- its really only a pretty number that doesn't mean all that much. I know that its use is questionable but in my case I need to have it, or I will not be able to offer correct loadavg values in the containers. Original Message From: Li Zefan lize...@huawei.com Then you should add Peter, Ingo and LKML to your Cc list. :) You failed that, let me fix that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/