from:"Marian Marinov"

Re: [PATCH] uts_namespace: Move boot_id in uts namespace

2018-04-04 Thread Marian Marinov

On 04/05/2018 03:35 AM, Eric W. Biederman wrote:
> Marian Marinov <ker...@kyup.com> writes:
> 
>> On 04/04/2018 07:02 PM, Eric W. Biederman wrote:
>>> Angel Shtilianov <ker...@kyup.com> writes:
>>>
>>>> Currently the same boot_id is reported for all containers running
>>>> on a host node, including the host node itself. Even after restarting
>>>> a container it will still have the same persistent boot_id.
>>>>
>>>> This can cause troubles in cases where you have multiple containers
>>>> from the same cluster on one host node. The software inside each
>>>> container will get the same boot_id and thus fail to join the cluster,
>>>> after the first container from the node has already joined.
>>>>
>>>> UTS namespace on other hand keeps the machine specific data, so it
>>>> seems to be the correct place to move the boot_id and instantiate it,
>>>> so each container will have unique id for its own boot lifetime, if
>>>> it has its own uts namespace.
>>>
>>> Technically this really needs to use the sysctl infrastructure that
>>> allows you to register different files in different namespaces.  That
>>> way the value you read from proc_do_uuid will be based on who opens the
>>> file not on who is reading the file.
>>
>> Ok, so would you accept a patch that reimplements boot_id trough the sysctl 
>> infrastructure?
> 
> Assuming I am convinced this makes sense to do on the semantic level.
> 
>>> Practically why does a bind mount on top of boot_id work?  What makes
>>> this a general problem worth solving in the kernel?  Why is hiding the
>>> fact that you are running the same instance of the same kernel a useful
>>> thing? That is the reality.
>>
>> The problem is, that the distros do not know that they are in
>> container and don't know that they have to bind mount something on top
>> of boot_id.  You need to tell Docker, LXC/LXD and all other container
>> runtimes that they need to do this bind mount for boot_id.
> 
> Yes.  Anything like this is the responsibility of the container runtime
> one way or another.   Magic to get around fixing the small set of
> container runtimes you care about is a questionable activity.
> 
>> I consider this to be a general issue, that lacks good general
>> solution in userspace.  The kernel is providing this boot_id
>> interface, but it is giving wrong data in the context of containers.
> 
> I disagree.  Namespaces have never been about hiding that you are on a
> given machine or a single machine.  They are about encapsulating global
> identifers so that process migration can happen, and so that processes
> can be better isolated.  The boot_id is not an identify of an object in
> the kernel at all, and it is designed to be trully globally unique
> across time and space so I am not at all certain that it makes the least
> bit of sense to do anything with a boot_id.
> 
> 
> That said my memory of boot_id is that was added so that emacs (and
> related programs) could create lock files on nfs and could tell if the
> current machine owns the file, and if so be able to tell if the owner
> of the lock file is alive.
> 
> So there is an argument to be made that boot_id is to coarse.  That
> argument suggest that boot_id is a pid_namespace property.
> 
> I have not looked at the users of boot_id, and I don't have a definition
> of boot_id that makes me think it is too coarse.
> 
> If you can provide a clear description of what the semantics are and
> what they should be for boot_id showing how boot_id fits into a
> namespace, making it clear what should happen with checkpoint/restart.
> We can definitely look at changing how the kernel supports boot_id.
> 
> The reason I suggested the bind mount is there are lots of cases where
> people want to lie to applications about the reality of what is going on
> for whatever reason, and we leave those lies to userspace.  Things
> like changing the contents of /proc/cpuinfo.

Even thou previously we have sent patches that are supposed to lie to the 
software running inside a container, this one is not of this sort.

You are pointing to Emacs and I'm talking about Consul https://www.consul.io/
Systemd is also using boot_id. "journalctl --list-boots" actually uses the 
boot_id to distinguish between different boots.

  -2 caf0524a1d394ce0bdbcff75b9fe Tue 2015-02-03 21:48:52 UTC—Tue 
2015-02-03 22:17:00 UTC
  -1 13883d180dc0420db0abcb5fa26d6198 Tue 2015-02-03 22:17:03 UTC—Tue 
2015-02-03 22:19:08 UTC
   0 bed718b17a73415fade0e4e7f4bea609 Tue 2015-02-03 22:19:12 UTC—Tue 
2015-02-03 23:01:0

Re: [PATCH] uts_namespace: Move boot_id in uts namespace

2018-04-04 Thread Marian Marinov

On 04/05/2018 03:35 AM, Eric W. Biederman wrote:
> Marian Marinov  writes:
> 
>> On 04/04/2018 07:02 PM, Eric W. Biederman wrote:
>>> Angel Shtilianov  writes:
>>>
>>>> Currently the same boot_id is reported for all containers running
>>>> on a host node, including the host node itself. Even after restarting
>>>> a container it will still have the same persistent boot_id.
>>>>
>>>> This can cause troubles in cases where you have multiple containers
>>>> from the same cluster on one host node. The software inside each
>>>> container will get the same boot_id and thus fail to join the cluster,
>>>> after the first container from the node has already joined.
>>>>
>>>> UTS namespace on other hand keeps the machine specific data, so it
>>>> seems to be the correct place to move the boot_id and instantiate it,
>>>> so each container will have unique id for its own boot lifetime, if
>>>> it has its own uts namespace.
>>>
>>> Technically this really needs to use the sysctl infrastructure that
>>> allows you to register different files in different namespaces.  That
>>> way the value you read from proc_do_uuid will be based on who opens the
>>> file not on who is reading the file.
>>
>> Ok, so would you accept a patch that reimplements boot_id trough the sysctl 
>> infrastructure?
> 
> Assuming I am convinced this makes sense to do on the semantic level.
> 
>>> Practically why does a bind mount on top of boot_id work?  What makes
>>> this a general problem worth solving in the kernel?  Why is hiding the
>>> fact that you are running the same instance of the same kernel a useful
>>> thing? That is the reality.
>>
>> The problem is, that the distros do not know that they are in
>> container and don't know that they have to bind mount something on top
>> of boot_id.  You need to tell Docker, LXC/LXD and all other container
>> runtimes that they need to do this bind mount for boot_id.
> 
> Yes.  Anything like this is the responsibility of the container runtime
> one way or another.   Magic to get around fixing the small set of
> container runtimes you care about is a questionable activity.
> 
>> I consider this to be a general issue, that lacks good general
>> solution in userspace.  The kernel is providing this boot_id
>> interface, but it is giving wrong data in the context of containers.
> 
> I disagree.  Namespaces have never been about hiding that you are on a
> given machine or a single machine.  They are about encapsulating global
> identifers so that process migration can happen, and so that processes
> can be better isolated.  The boot_id is not an identify of an object in
> the kernel at all, and it is designed to be trully globally unique
> across time and space so I am not at all certain that it makes the least
> bit of sense to do anything with a boot_id.
> 
> 
> That said my memory of boot_id is that was added so that emacs (and
> related programs) could create lock files on nfs and could tell if the
> current machine owns the file, and if so be able to tell if the owner
> of the lock file is alive.
> 
> So there is an argument to be made that boot_id is to coarse.  That
> argument suggest that boot_id is a pid_namespace property.
> 
> I have not looked at the users of boot_id, and I don't have a definition
> of boot_id that makes me think it is too coarse.
> 
> If you can provide a clear description of what the semantics are and
> what they should be for boot_id showing how boot_id fits into a
> namespace, making it clear what should happen with checkpoint/restart.
> We can definitely look at changing how the kernel supports boot_id.
> 
> The reason I suggested the bind mount is there are lots of cases where
> people want to lie to applications about the reality of what is going on
> for whatever reason, and we leave those lies to userspace.  Things
> like changing the contents of /proc/cpuinfo.

Even thou previously we have sent patches that are supposed to lie to the 
software running inside a container, this one is not of this sort.

You are pointing to Emacs and I'm talking about Consul https://www.consul.io/
Systemd is also using boot_id. "journalctl --list-boots" actually uses the 
boot_id to distinguish between different boots.

  -2 caf0524a1d394ce0bdbcff75b9fe Tue 2015-02-03 21:48:52 UTC—Tue 
2015-02-03 22:17:00 UTC
  -1 13883d180dc0420db0abcb5fa26d6198 Tue 2015-02-03 22:17:03 UTC—Tue 
2015-02-03 22:19:08 UTC
   0 bed718b17a73415fade0e4e7f4bea609 Tue 2015-02-03 22:19:12 UTC—Tue 
2015-02-03 23:01:01 UTC

There may be a lot of other sof

Re: [PATCH] uts_namespace: Move boot_id in uts namespace

2018-04-04 Thread Marian Marinov

On 04/04/2018 07:02 PM, Eric W. Biederman wrote:
> Angel Shtilianov <ker...@kyup.com> writes:
> 
>> Currently the same boot_id is reported for all containers running
>> on a host node, including the host node itself. Even after restarting
>> a container it will still have the same persistent boot_id.
>>
>> This can cause troubles in cases where you have multiple containers
>> from the same cluster on one host node. The software inside each
>> container will get the same boot_id and thus fail to join the cluster,
>> after the first container from the node has already joined.
>>
>> UTS namespace on other hand keeps the machine specific data, so it
>> seems to be the correct place to move the boot_id and instantiate it,
>> so each container will have unique id for its own boot lifetime, if
>> it has its own uts namespace.
> 
> Technically this really needs to use the sysctl infrastructure that
> allows you to register different files in different namespaces.  That
> way the value you read from proc_do_uuid will be based on who opens the
> file not on who is reading the file.

Ok, so would you accept a patch that reimplements boot_id trough the sysctl 
infrastructure?

> Practically why does a bind mount on top of boot_id work?  What makes
> this a general problem worth solving in the kernel?  Why is hiding the
> fact that you are running the same instance of the same kernel a useful
> thing? That is the reality.

The problem is, that the distros do not know that they are in container and 
don't know that they have to bind mount something on top of boot_id.
You need to tell Docker, LXC/LXD and all other container runtimes that they 
need to do this bind mount for boot_id.
I consider this to be a general issue, that lacks good general solution in 
userspace.
The kernel is providing this boot_id interface, but it is giving wrong data in 
the context of containers.
Proposing to fix this problem in userspace seams like ignoring the issue.
You could have said to the Consul guys, that they should simply stop using 
boot_id, because it doesn't work correctly on containers.

Marian

> 
> Eric
> 
> 
> 
> 
>> Signed-off-by: Angel Shtilianov <ker...@kyup.com>
>> ---
>>  drivers/char/random.c   | 4 
>>  include/linux/utsname.h | 1 +
>>  kernel/utsname.c| 4 +++-
>>  3 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/char/random.c b/drivers/char/random.c
>> index ec42c8bb9b0d..e05daf7f38f4 100644
>> --- a/drivers/char/random.c
>> +++ b/drivers/char/random.c
>> @@ -1960,6 +1960,10 @@ static int proc_do_uuid(struct ctl_table *table, int 
>> write,
>>  unsigned char buf[64], tmp_uuid[16], *uuid;
>>  
>>  uuid = table->data;
>> +#ifdef CONFIG_UTS_NS
>> +if (!!uuid && (uuid == (unsigned char *)sysctl_bootid))
>> +uuid = current->nsproxy->uts_ns->sysctl_bootid;
>> +#endif
>>  if (!uuid) {
>>  uuid = tmp_uuid;
>>  generate_random_uuid(uuid);
>> diff --git a/include/linux/utsname.h b/include/linux/utsname.h
>> index c8060c2ecd04..f704aca3e95a 100644
>> --- a/include/linux/utsname.h
>> +++ b/include/linux/utsname.h
>> @@ -27,6 +27,7 @@ struct uts_namespace {
>>  struct user_namespace *user_ns;
>>  struct ucounts *ucounts;
>>  struct ns_common ns;
>> +char sysctl_bootid[16];
>>  } __randomize_layout;
>>  extern struct uts_namespace init_uts_ns;
>>  
>> diff --git a/kernel/utsname.c b/kernel/utsname.c
>> index 913fe4336d2b..f1749cdcd341 100644
>> --- a/kernel/utsname.c
>> +++ b/kernel/utsname.c
>> @@ -34,8 +34,10 @@ static struct uts_namespace *create_uts_ns(void)
>>  struct uts_namespace *uts_ns;
>>  
>>  uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);
>> -if (uts_ns)
>> +if (uts_ns) {
>>  kref_init(_ns->kref);
>> +memset(uts_ns->sysctl_bootid, 0, 16);
>> +}
>>  return uts_ns;
>>  }
> 


-- 
Marian Marinov
Co-founder & CTO of Kyup.com
Jabber/GTalk: hack...@jabber.org
IRQ: 7556201
IRC: hackman @ irc.freenode.net
Mobile: +359 886 660 270

Re: [PATCH] uts_namespace: Move boot_id in uts namespace

2018-04-04 Thread Marian Marinov

On 04/04/2018 07:02 PM, Eric W. Biederman wrote:
> Angel Shtilianov  writes:
> 
>> Currently the same boot_id is reported for all containers running
>> on a host node, including the host node itself. Even after restarting
>> a container it will still have the same persistent boot_id.
>>
>> This can cause troubles in cases where you have multiple containers
>> from the same cluster on one host node. The software inside each
>> container will get the same boot_id and thus fail to join the cluster,
>> after the first container from the node has already joined.
>>
>> UTS namespace on other hand keeps the machine specific data, so it
>> seems to be the correct place to move the boot_id and instantiate it,
>> so each container will have unique id for its own boot lifetime, if
>> it has its own uts namespace.
> 
> Technically this really needs to use the sysctl infrastructure that
> allows you to register different files in different namespaces.  That
> way the value you read from proc_do_uuid will be based on who opens the
> file not on who is reading the file.

Ok, so would you accept a patch that reimplements boot_id trough the sysctl 
infrastructure?

> Practically why does a bind mount on top of boot_id work?  What makes
> this a general problem worth solving in the kernel?  Why is hiding the
> fact that you are running the same instance of the same kernel a useful
> thing? That is the reality.

The problem is, that the distros do not know that they are in container and 
don't know that they have to bind mount something on top of boot_id.
You need to tell Docker, LXC/LXD and all other container runtimes that they 
need to do this bind mount for boot_id.
I consider this to be a general issue, that lacks good general solution in 
userspace.
The kernel is providing this boot_id interface, but it is giving wrong data in 
the context of containers.
Proposing to fix this problem in userspace seams like ignoring the issue.
You could have said to the Consul guys, that they should simply stop using 
boot_id, because it doesn't work correctly on containers.

Marian

> 
> Eric
> 
> 
> 
> 
>> Signed-off-by: Angel Shtilianov 
>> ---
>>  drivers/char/random.c   | 4 
>>  include/linux/utsname.h | 1 +
>>  kernel/utsname.c| 4 +++-
>>  3 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/char/random.c b/drivers/char/random.c
>> index ec42c8bb9b0d..e05daf7f38f4 100644
>> --- a/drivers/char/random.c
>> +++ b/drivers/char/random.c
>> @@ -1960,6 +1960,10 @@ static int proc_do_uuid(struct ctl_table *table, int 
>> write,
>>  unsigned char buf[64], tmp_uuid[16], *uuid;
>>  
>>  uuid = table->data;
>> +#ifdef CONFIG_UTS_NS
>> +if (!!uuid && (uuid == (unsigned char *)sysctl_bootid))
>> +uuid = current->nsproxy->uts_ns->sysctl_bootid;
>> +#endif
>>  if (!uuid) {
>>  uuid = tmp_uuid;
>>  generate_random_uuid(uuid);
>> diff --git a/include/linux/utsname.h b/include/linux/utsname.h
>> index c8060c2ecd04..f704aca3e95a 100644
>> --- a/include/linux/utsname.h
>> +++ b/include/linux/utsname.h
>> @@ -27,6 +27,7 @@ struct uts_namespace {
>>  struct user_namespace *user_ns;
>>  struct ucounts *ucounts;
>>  struct ns_common ns;
>> +char sysctl_bootid[16];
>>  } __randomize_layout;
>>  extern struct uts_namespace init_uts_ns;
>>  
>> diff --git a/kernel/utsname.c b/kernel/utsname.c
>> index 913fe4336d2b..f1749cdcd341 100644
>> --- a/kernel/utsname.c
>> +++ b/kernel/utsname.c
>> @@ -34,8 +34,10 @@ static struct uts_namespace *create_uts_ns(void)
>>  struct uts_namespace *uts_ns;
>>  
>>  uts_ns = kmalloc(sizeof(struct uts_namespace), GFP_KERNEL);
>> -if (uts_ns)
>> +if (uts_ns) {
>>  kref_init(_ns->kref);
>> +memset(uts_ns->sysctl_bootid, 0, 16);
>> +}
>>  return uts_ns;
>>  }
> 


-- 
Marian Marinov
Co-founder & CTO of Kyup.com
Jabber/GTalk: hack...@jabber.org
IRQ: 7556201
IRC: hackman @ irc.freenode.net
Mobile: +359 886 660 270

Re: Patch breaks suspend

2015-05-05 Thread Marian Marinov

On 05/05/2015 02:37 AM, Rafael J. Wysocki wrote:
> On Saturday, May 02, 2015 11:27:32 PM Marian Marinov wrote:
>> Hi guys,
>> I have Lenovo T520 with one SSD and one SATA drive.
>>
>> I tried to upgrade to Linux 4.0 and found that after suspend and resume I 
>> can't access the second (SATA) drive.
>> Both drives have bios encryption enabled.
>>
>> I did a bisect and found that the following patch causes the issue:
>> commit 5d5132059a1f652de9dc2d62a8ff15561e648d11
>> Author: Rafael J. Wysocki 
>> Date:   Sat Feb 22 00:48:31 2014 +0100
>>
>> ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices
>>
>> Modify the SATA subsystem to add hotplug contexts to ACPI companions
>> of SATA devices and ports instead of registering special ACPI dock
>> operations using register_hotplug_dock_device().
>>
>> That change will allow the entire code handling those special ACPI
>> dock operations to be dropped in the next commit.
>>
>> Signed-off-by: Rafael J. Wysocki 
>> Reviewed-by: Aaron Lu 
>> Acked-by: Tejun Heo 
>>
>> Unfortunately I do not understand this part of the code and have no idea 
>> what I can do.
>> Any pointers would be very appreciated.
> Well, not right from the top of my head, but this looks really suspicious to 
> me.
>
> Can you please file a bug entry for this at bugzilla.kernel.org (in the 
> ACPI/BIOS
> category), assign it to me and CC Aaron?
BUG created: https://bugzilla.kernel.org/show_bug.cgi?id=97731

Added you and Aaron to the CC list.

What additional info can I provide you? Would you like any debug info from the 
kernel it self?
Dmesg output?

Marian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Patch breaks suspend

2015-05-05 Thread Marian Marinov

On 05/05/2015 02:37 AM, Rafael J. Wysocki wrote:
 On Saturday, May 02, 2015 11:27:32 PM Marian Marinov wrote:
 Hi guys,
 I have Lenovo T520 with one SSD and one SATA drive.

 I tried to upgrade to Linux 4.0 and found that after suspend and resume I 
 can't access the second (SATA) drive.
 Both drives have bios encryption enabled.

 I did a bisect and found that the following patch causes the issue:
 commit 5d5132059a1f652de9dc2d62a8ff15561e648d11
 Author: Rafael J. Wysocki rafael.j.wyso...@intel.com
 Date:   Sat Feb 22 00:48:31 2014 +0100

 ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices

 Modify the SATA subsystem to add hotplug contexts to ACPI companions
 of SATA devices and ports instead of registering special ACPI dock
 operations using register_hotplug_dock_device().

 That change will allow the entire code handling those special ACPI
 dock operations to be dropped in the next commit.

 Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com
 Reviewed-by: Aaron Lu aaron...@intel.com
 Acked-by: Tejun Heo t...@kernel.org

 Unfortunately I do not understand this part of the code and have no idea 
 what I can do.
 Any pointers would be very appreciated.
 Well, not right from the top of my head, but this looks really suspicious to 
 me.

 Can you please file a bug entry for this at bugzilla.kernel.org (in the 
 ACPI/BIOS
 category), assign it to me and CC Aaron?
BUG created: https://bugzilla.kernel.org/show_bug.cgi?id=97731

Added you and Aaron to the CC list.

What additional info can I provide you? Would you like any debug info from the 
kernel it self?
Dmesg output?

Marian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch breaks suspend

2015-05-02 Thread Marian Marinov

Hi guys,
I have Lenovo T520 with one SSD and one SATA drive.

I tried to upgrade to Linux 4.0 and found that after suspend and resume I can't 
access the second (SATA) drive.
Both drives have bios encryption enabled.

I did a bisect and found that the following patch causes the issue:
commit 5d5132059a1f652de9dc2d62a8ff15561e648d11
Author: Rafael J. Wysocki 
Date:   Sat Feb 22 00:48:31 2014 +0100

ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices
   
Modify the SATA subsystem to add hotplug contexts to ACPI companions
of SATA devices and ports instead of registering special ACPI dock
operations using register_hotplug_dock_device().
   
That change will allow the entire code handling those special ACPI
dock operations to be dropped in the next commit.
   
Signed-off-by: Rafael J. Wysocki 
Reviewed-by: Aaron Lu 
Acked-by: Tejun Heo 

Unfortunately I do not understand this part of the code and have no idea what I 
can do.
Any pointers would be very appreciated.

Best regards,
Marian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch breaks suspend

2015-05-02 Thread Marian Marinov

Hi guys,
I have Lenovo T520 with one SSD and one SATA drive.

I tried to upgrade to Linux 4.0 and found that after suspend and resume I can't 
access the second (SATA) drive.
Both drives have bios encryption enabled.

I did a bisect and found that the following patch causes the issue:
commit 5d5132059a1f652de9dc2d62a8ff15561e648d11
Author: Rafael J. Wysocki rafael.j.wyso...@intel.com
Date:   Sat Feb 22 00:48:31 2014 +0100

ACPI / ATA: Add hotplug contexts to ACPI companions of SATA devices
   
Modify the SATA subsystem to add hotplug contexts to ACPI companions
of SATA devices and ports instead of registering special ACPI dock
operations using register_hotplug_dock_device().
   
That change will allow the entire code handling those special ACPI
dock operations to be dropped in the next commit.
   
Signed-off-by: Rafael J. Wysocki rafael.j.wyso...@intel.com
Reviewed-by: Aaron Lu aaron...@intel.com
Acked-by: Tejun Heo t...@kernel.org

Unfortunately I do not understand this part of the code and have no idea what I 
can do.
Any pointers would be very appreciated.

Best regards,
Marian

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/2] cgroups: add a pids subsystem

2015-03-05 Thread Marian Marinov

Hi Aleksa,
would you be willing to put your patches online in a repo like what Dwight 
Engen did 3 years ago.
  https://github.com/dwengen/linux/tree/cpuacct-task-limit-3.14

I'm using his patchset for more then a year now. However I would be happy to 
experiment with your patches as well.

And heaving a public repo from where I can pull is extremely good for people 
like me. It is easier for testing and long
term support.

Thank you for your work on this matter,
Marian

On 03/04/2015 10:23 PM, Aleksa Sarai wrote:
> Adds a new single-purpose pids subsystem to limit the number of
> tasks that can run inside a cgroup. Essentially this is an
> implementation of RLIMIT_NPROC that will applies to a cgroup rather than
> a process tree.
>
> PIDs are fundamentally a global resource, and it is possible to reach
> PID exhaustion inside a cgroup without hitting any reasonable kmemcg
> policy. Once you've hit PID exhaustion, you're only in a marginally
> better state than OOM. This subsystem allows PID exhaustion inside a
> cgroup to be prevented.
>
> Signed-off-by: Aleksa Sarai 
> ---
>  include/linux/cgroup_subsys.h |   4 +
>  init/Kconfig  |  12 ++
>  kernel/Makefile   |   1 +
>  kernel/cgroup_pids.c  | 281 
> ++
>  4 files changed, 298 insertions(+)
>  create mode 100644 kernel/cgroup_pids.c
>
> diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
> index e4a96fb..a198822 100644
> --- a/include/linux/cgroup_subsys.h
> +++ b/include/linux/cgroup_subsys.h
> @@ -47,6 +47,10 @@ SUBSYS(net_prio)
>  SUBSYS(hugetlb)
>  #endif
>  
> +#if IS_ENABLED(CONFIG_CGROUP_PIDS)
> +SUBSYS(pids)
> +#endif
> +
>  /*
>   * The following subsystems are not supported on the default hierarchy.
>   */
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d..58f104a 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1054,6 +1054,18 @@ config CGROUP_HUGETLB
> control group is tracked in the third page lru pointer. This means
> that we cannot use the controller with huge page less than 3 pages.
>  
> +config CGROUP_PIDS
> + bool "Process number limiting on cgroups"
> + depends on PAGE_COUNTER
> + help
> +   This options enables the setting of process number limits in the scope
> +   of a cgroup. Any attempt to fork more processes than is allowed in the
> +   cgroup will fail. PIDs are fundamentally a global resource because it
> +   is fairly trivial to reach PID exhaustion before you reach even a
> +   conservative kmemcg limit. As a result, it is possible to grind a
> +   system to halt without being limited by other cgroup policies. The 
> pids
> +   cgroup subsystem is designed to stop this from happening.
> +
>  config CGROUP_PERF
>   bool "Enable perf_event per-cpu per-container group (cgroup) monitoring"
>   depends on PERF_EVENTS && CGROUPS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 1408b33..e823592 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>  obj-$(CONFIG_COMPAT) += compat.o
>  obj-$(CONFIG_CGROUPS) += cgroup.o
>  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
> +obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
>  obj-$(CONFIG_CPUSETS) += cpuset.o
>  obj-$(CONFIG_UTS_NS) += utsname.o
>  obj-$(CONFIG_USER_NS) += user_namespace.o
> diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
> new file mode 100644
> index 000..65cbab3
> --- /dev/null
> +++ b/kernel/cgroup_pids.c
> @@ -0,0 +1,281 @@
> +/*
> + * Process number limiting subsys for cgroups.
> + *
> + * Copyright (C) 2015 Aleksa Sarai 
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define PIDS_UNLIMITED -1
> +
> +struct pids {
> + struct pids *parent;
> + struct cgroup_subsys_state css;
> +
> + atomic_long_t counter;
> + long limit;
> +};
> +
> +static inline struct pids *css_pids(struct cgroup_subsys_state *css)
> +{
> + return css ? container_of(css, struct pids, css) : NULL;
> +}
> +
> +static inline struct pids *task_pids(struct task_struct *task)
> +{
> + return css_pids(task_css(task, pids_cgrp_id));
> +}
> +
> +static struct pids *parent_pids(struct pids *pids)
> +{
> + return css_pids(pids->css.parent);
> +}
> +
> +static struct cgroup_subsys_state *
> +pids_css_alloc(struct cgroup_subsys_state *parent)
> +{
> + struct pids *pids;
> +
> + pids = kzalloc(sizeof(struct pids), GFP_KERNEL);
> + if (!pids)
> + return ERR_PTR(-ENOMEM);
> +
> + return >css;
> +}
> +
> +static int pids_css_online(struct cgroup_subsys_state *css)
> +{
> + struct pids *pids = css_pids(css);
> + long limit = -1;
> +
> + pids->parent = parent_pids(pids);
> + if (pids->parent)
> + limit = pids->parent->limit;
> +
> + pids->limit = limit;
> + atomic_long_set(>counter, 0);
> + return 0;
> +}
> +
> +static void

Re: [PATCH v3 2/2] cgroups: add a pids subsystem

2015-03-05 Thread Marian Marinov

Hi Aleksa,
would you be willing to put your patches online in a repo like what Dwight 
Engen did 3 years ago.
  https://github.com/dwengen/linux/tree/cpuacct-task-limit-3.14

I'm using his patchset for more then a year now. However I would be happy to 
experiment with your patches as well.

And heaving a public repo from where I can pull is extremely good for people 
like me. It is easier for testing and long
term support.

Thank you for your work on this matter,
Marian

On 03/04/2015 10:23 PM, Aleksa Sarai wrote:
 Adds a new single-purpose pids subsystem to limit the number of
 tasks that can run inside a cgroup. Essentially this is an
 implementation of RLIMIT_NPROC that will applies to a cgroup rather than
 a process tree.

 PIDs are fundamentally a global resource, and it is possible to reach
 PID exhaustion inside a cgroup without hitting any reasonable kmemcg
 policy. Once you've hit PID exhaustion, you're only in a marginally
 better state than OOM. This subsystem allows PID exhaustion inside a
 cgroup to be prevented.

 Signed-off-by: Aleksa Sarai cyp...@cyphar.com
 ---
  include/linux/cgroup_subsys.h |   4 +
  init/Kconfig  |  12 ++
  kernel/Makefile   |   1 +
  kernel/cgroup_pids.c  | 281 
 ++
  4 files changed, 298 insertions(+)
  create mode 100644 kernel/cgroup_pids.c

 diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
 index e4a96fb..a198822 100644
 --- a/include/linux/cgroup_subsys.h
 +++ b/include/linux/cgroup_subsys.h
 @@ -47,6 +47,10 @@ SUBSYS(net_prio)
  SUBSYS(hugetlb)
  #endif
  
 +#if IS_ENABLED(CONFIG_CGROUP_PIDS)
 +SUBSYS(pids)
 +#endif
 +
  /*
   * The following subsystems are not supported on the default hierarchy.
   */
 diff --git a/init/Kconfig b/init/Kconfig
 index f5dbc6d..58f104a 100644
 --- a/init/Kconfig
 +++ b/init/Kconfig
 @@ -1054,6 +1054,18 @@ config CGROUP_HUGETLB
 control group is tracked in the third page lru pointer. This means
 that we cannot use the controller with huge page less than 3 pages.
  
 +config CGROUP_PIDS
 + bool Process number limiting on cgroups
 + depends on PAGE_COUNTER
 + help
 +   This options enables the setting of process number limits in the scope
 +   of a cgroup. Any attempt to fork more processes than is allowed in the
 +   cgroup will fail. PIDs are fundamentally a global resource because it
 +   is fairly trivial to reach PID exhaustion before you reach even a
 +   conservative kmemcg limit. As a result, it is possible to grind a
 +   system to halt without being limited by other cgroup policies. The 
 pids
 +   cgroup subsystem is designed to stop this from happening.
 +
  config CGROUP_PERF
   bool Enable perf_event per-cpu per-container group (cgroup) monitoring
   depends on PERF_EVENTS  CGROUPS
 diff --git a/kernel/Makefile b/kernel/Makefile
 index 1408b33..e823592 100644
 --- a/kernel/Makefile
 +++ b/kernel/Makefile
 @@ -53,6 +53,7 @@ obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
  obj-$(CONFIG_COMPAT) += compat.o
  obj-$(CONFIG_CGROUPS) += cgroup.o
  obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
 +obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
  obj-$(CONFIG_CPUSETS) += cpuset.o
  obj-$(CONFIG_UTS_NS) += utsname.o
  obj-$(CONFIG_USER_NS) += user_namespace.o
 diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
 new file mode 100644
 index 000..65cbab3
 --- /dev/null
 +++ b/kernel/cgroup_pids.c
 @@ -0,0 +1,281 @@
 +/*
 + * Process number limiting subsys for cgroups.
 + *
 + * Copyright (C) 2015 Aleksa Sarai cyp...@cyphar.com
 + *
 + */
 +
 +#include linux/kernel.h
 +#include linux/atomic.h
 +#include linux/cgroup.h
 +#include linux/slab.h
 +
 +#define PIDS_UNLIMITED -1
 +
 +struct pids {
 + struct pids *parent;
 + struct cgroup_subsys_state css;
 +
 + atomic_long_t counter;
 + long limit;
 +};
 +
 +static inline struct pids *css_pids(struct cgroup_subsys_state *css)
 +{
 + return css ? container_of(css, struct pids, css) : NULL;
 +}
 +
 +static inline struct pids *task_pids(struct task_struct *task)
 +{
 + return css_pids(task_css(task, pids_cgrp_id));
 +}
 +
 +static struct pids *parent_pids(struct pids *pids)
 +{
 + return css_pids(pids-css.parent);
 +}
 +
 +static struct cgroup_subsys_state *
 +pids_css_alloc(struct cgroup_subsys_state *parent)
 +{
 + struct pids *pids;
 +
 + pids = kzalloc(sizeof(struct pids), GFP_KERNEL);
 + if (!pids)
 + return ERR_PTR(-ENOMEM);
 +
 + return pids-css;
 +}
 +
 +static int pids_css_online(struct cgroup_subsys_state *css)
 +{
 + struct pids *pids = css_pids(css);
 + long limit = -1;
 +
 + pids-parent = parent_pids(pids);
 + if (pids-parent)
 + limit = pids-parent-limit;
 +
 + pids-limit = limit;
 + atomic_long_set(pids-counter, 0);
 + return 0;
 +}
 +
 +static void pids_css_free(struct cgroup_subsys_state *css)
 +{
 +

Re: RFC: sysctl fixes

2015-03-04 Thread Marian Marinov

On 03/04/2015 05:48 PM, Stephan Mueller wrote:
> Am Mittwoch, 4. März 2015, 15:18:19 schrieb Marian Marinov:
>
> Hi Marian,
>
>> Hello,
>> we are running hundreds of containers and we got some valid requests
> >from customers that want to be able to change their container's
>> hostname using the sysctl command.
>>
>> The current implementation does not allow each UTS namespace to change
>> its own hostname.
>>
>> So we sponsored the development of fix for this issue.
>>
>> Please checkout the changes here:
>>
>> https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb
>> 8f8735d3c...sysctl-hostname-fixes
>>
>> If this fix is acceptable I'll send the patches as it is common. But
>> first I want to know if you think this is the way to go.
> The description in http://lwn.net/Articles/531381/ does not work?

Stephan, the users can change the hostname of the container using the hostname 
command. I can also change their hostname
when I'm creating the containers.
However the customers need to be able to change the hostname of the machine 
using valid system command:

sysctl kernel.hostname=newhost
sysctl kernel.domainname=newdomain

Both of these commands try to talk to the kernel via the /proc filesystem. This 
is why we propose these fixes.
Without them, only the root on the host node can do this.

Correct me, if I'm wrong.

Marian

>> Thank you,
>> Marian
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> in the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> Ciao
> Stephan
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFC: sysctl fixes

2015-03-04 Thread Marian Marinov

Hello,
we are running hundreds of containers and we got some valid requests from 
customers that want to be able to change their
container's hostname using the sysctl command.

The current implementation does not allow each UTS namespace to change its own 
hostname.

So we sponsored the development of fix for this issue.

Please checkout the changes here:
  
https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb8f8735d3c...sysctl-hostname-fixes

If this fix is acceptable I'll send the patches as it is common. But first I 
want to know if you think this is the way
to go.

Thank you,
Marian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: sysctl fixes

2015-03-04 Thread Marian Marinov

On 03/04/2015 05:48 PM, Stephan Mueller wrote:
 Am Mittwoch, 4. März 2015, 15:18:19 schrieb Marian Marinov:

 Hi Marian,

 Hello,
 we are running hundreds of containers and we got some valid requests
 from customers that want to be able to change their container's
 hostname using the sysctl command.

 The current implementation does not allow each UTS namespace to change
 its own hostname.

 So we sponsored the development of fix for this issue.

 Please checkout the changes here:

 https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb
 8f8735d3c...sysctl-hostname-fixes

 If this fix is acceptable I'll send the patches as it is common. But
 first I want to know if you think this is the way to go.
 The description in http://lwn.net/Articles/531381/ does not work?

Stephan, the users can change the hostname of the container using the hostname 
command. I can also change their hostname
when I'm creating the containers.
However the customers need to be able to change the hostname of the machine 
using valid system command:

sysctl kernel.hostname=newhost
sysctl kernel.domainname=newdomain

Both of these commands try to talk to the kernel via the /proc filesystem. This 
is why we propose these fixes.
Without them, only the root on the host node can do this.

Correct me, if I'm wrong.

Marian

 Thank you,
 Marian

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel
 in the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

 Ciao
 Stephan



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RFC: sysctl fixes

2015-03-04 Thread Marian Marinov

Hello,
we are running hundreds of containers and we got some valid requests from 
customers that want to be able to change their
container's hostname using the sysctl command.

The current implementation does not allow each UTS namespace to change its own 
hostname.

So we sponsored the development of fix for this issue.

Please checkout the changes here:
  
https://github.com/1HLtd/linux/compare/b24e2bdde4af656bb0679a101265ebb8f8735d3c...sysctl-hostname-fixes

If this fix is acceptable I'll send the patches as it is common. But first I 
want to know if you think this is the way
to go.

Thank you,
Marian

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

2014-06-10 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/10/2014 02:52 PM, Michal Hocko wrote:
> [More people to CC] On Tue 10-06-14 14:35:02, Marian Marinov wrote:
>> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
>> 
>> Hello,
> 
> Hi,
> 
>> a while back in 2012 there was a request for this functionality. oom, memcg: 
>> handle sysctl
>> oom_kill_allocating_task while memcg oom happening
>> 
>> This is the thread: https://lkml.org/lkml/2012/10/16/168
>> 
>> Now we run a several machines with around 10k processes on each machine, 
>> using containers.
>> 
>> Regularly we see OOM from within a container that causes performance 
>> degradation.
> 
> What kind of performance degradation and which parts of the system are 
> affected?

The responsiveness to SSH terminals and DB queries on the host machine is 
significantly slowed.
I'm still unsure what exactly to measure.

> 
> memcg oom killer happens outside of any locks currently so the only 
> bottleneck I can see is the per-cgroup
> container which iterates all tasks in the group. Is this what is going on 
> here?

When the container has 1000s of processes it seams that there is a problem. But 
I'm not sure, and will be happy to put
some diagnostic lines of code there.

> 
>> We are running 3.12.20 with the following OOM configuration and memcg oom 
>> enabled:
>> 
>> vm.oom_dump_tasks = 0 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 0
>> 
>> When OOM occurs we see very high numbers for the loadavg and the overall 
>> responsiveness of the machine degrades.
> 
> What is the system waiting for?

I don't know, since I was not the one to actually handle the case. However, my 
guys are instructed to collect iostat
and vmstat information from the machines, the next time this happens.

> 
>> During these OOM states the load of the machine gradualy increases from 25 
>> up to 120 in the interval of
>> 10minutes.
>> 
>> Once we manually bring down the memory usage of a container(killing some 
>> tasks) the load drops down to 25 within
>> 5 to 7 minutes.
> 
> So the OOM killer is not able to find a victim to kill?

It was constantly killing tasks. 245 oom invocations in less then 6min for that 
particular cgroup. With top 61 oom
invocations in one minute.

It was killing... In that particular case, the problem was a web server that 
was under attack. New php processes was
spawned very often and instead of killing each newly created process(which is 
allocating memory) the kernel tries to
find more suitable task. Which in this case was not desired.

> 
>> I read the whole thread from 2012 but I do not see the expected behavior 
>> that is described by the people that
>> commented the issue.
> 
> Why do you think that killing the allocating task would be helpful in your 
> case?

As mentioned above, the usual case with the hosting companies is that, the 
allocating task should not be allowed to
run. So killing it is the proper solution there.

Essentially we solved the issue by setting a process limit to that particular 
cgroup using the task-limit patches of
Dwight Engen.

> 
>> In this case, with real usage for this patch, would it be considered for 
>> inclusion?
> 
> I would still prefer to fix the real issue which is not clear from your 
> description yet.

I would love to have a better way to solve the issue.

Marian

- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOXGUsACgkQ4mt9JeIbjJR1YACgysnzxg9IPzcwQRmBZVVV6cp3
N4YAoKygaqbqcuz6dkmtMfI/pu2Br5H/
=3pZj
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

2014-06-10 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,
a while back in 2012 there was a request for this functionality.
  oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

This is the thread:  https://lkml.org/lkml/2012/10/16/168

Now we run a several machines with around 10k processes on each machine, using 
containers.

Regularly we see OOM from within a container that causes performance 
degradation.

We are running 3.12.20 with the following OOM configuration and memcg oom 
enabled:

vm.oom_dump_tasks = 0
vm.oom_kill_allocating_task = 1
vm.panic_on_oom = 0

When OOM occurs we see very high numbers for the loadavg and the overall 
responsiveness of the machine degrades.

During these OOM states the load of the machine gradualy increases from 25 up 
to 120 in the interval of 10minutes.

Once we manually bring down the memory usage of a container(killing some tasks) 
the load drops down to 25 within 5 to
7 minutes.

I read the whole thread from 2012 but I do not see the expected behavior that 
is described by the people that
commented the issue.

In this case, with real usage for this patch, would it be considered for 
inclusion?

Marian


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOW7WYACgkQ4mt9JeIbjJTZCACgj74t6T9GkCS5RlQEF6NkTbed
xJwAoK93irH+FneDoNdSi2DBSnR7a76r
=zl6k
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

2014-06-10 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,
a while back in 2012 there was a request for this functionality.
  oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

This is the thread:  https://lkml.org/lkml/2012/10/16/168

Now we run a several machines with around 10k processes on each machine, using 
containers.

Regularly we see OOM from within a container that causes performance 
degradation.

We are running 3.12.20 with the following OOM configuration and memcg oom 
enabled:

vm.oom_dump_tasks = 0
vm.oom_kill_allocating_task = 1
vm.panic_on_oom = 0

When OOM occurs we see very high numbers for the loadavg and the overall 
responsiveness of the machine degrades.

During these OOM states the load of the machine gradualy increases from 25 up 
to 120 in the interval of 10minutes.

Once we manually bring down the memory usage of a container(killing some tasks) 
the load drops down to 25 within 5 to
7 minutes.

I read the whole thread from 2012 but I do not see the expected behavior that 
is described by the people that
commented the issue.

In this case, with real usage for this patch, would it be considered for 
inclusion?

Marian


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOW7WYACgkQ4mt9JeIbjJTZCACgj74t6T9GkCS5RlQEF6NkTbed
xJwAoK93irH+FneDoNdSi2DBSnR7a76r
=zl6k
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] oom, memcg: handle sysctl oom_kill_allocating_task while memcg oom happening

2014-06-10 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/10/2014 02:52 PM, Michal Hocko wrote:
 [More people to CC] On Tue 10-06-14 14:35:02, Marian Marinov wrote:
 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
 
 Hello,
 
 Hi,
 
 a while back in 2012 there was a request for this functionality. oom, memcg: 
 handle sysctl
 oom_kill_allocating_task while memcg oom happening
 
 This is the thread: https://lkml.org/lkml/2012/10/16/168
 
 Now we run a several machines with around 10k processes on each machine, 
 using containers.
 
 Regularly we see OOM from within a container that causes performance 
 degradation.
 
 What kind of performance degradation and which parts of the system are 
 affected?

The responsiveness to SSH terminals and DB queries on the host machine is 
significantly slowed.
I'm still unsure what exactly to measure.

 
 memcg oom killer happens outside of any locks currently so the only 
 bottleneck I can see is the per-cgroup
 container which iterates all tasks in the group. Is this what is going on 
 here?

When the container has 1000s of processes it seams that there is a problem. But 
I'm not sure, and will be happy to put
some diagnostic lines of code there.

 
 We are running 3.12.20 with the following OOM configuration and memcg oom 
 enabled:
 
 vm.oom_dump_tasks = 0 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 0
 
 When OOM occurs we see very high numbers for the loadavg and the overall 
 responsiveness of the machine degrades.
 
 What is the system waiting for?

I don't know, since I was not the one to actually handle the case. However, my 
guys are instructed to collect iostat
and vmstat information from the machines, the next time this happens.

 
 During these OOM states the load of the machine gradualy increases from 25 
 up to 120 in the interval of
 10minutes.
 
 Once we manually bring down the memory usage of a container(killing some 
 tasks) the load drops down to 25 within
 5 to 7 minutes.
 
 So the OOM killer is not able to find a victim to kill?

It was constantly killing tasks. 245 oom invocations in less then 6min for that 
particular cgroup. With top 61 oom
invocations in one minute.

It was killing... In that particular case, the problem was a web server that 
was under attack. New php processes was
spawned very often and instead of killing each newly created process(which is 
allocating memory) the kernel tries to
find more suitable task. Which in this case was not desired.

 
 I read the whole thread from 2012 but I do not see the expected behavior 
 that is described by the people that
 commented the issue.
 
 Why do you think that killing the allocating task would be helpful in your 
 case?

As mentioned above, the usual case with the hosting companies is that, the 
allocating task should not be allowed to
run. So killing it is the proper solution there.

Essentially we solved the issue by setting a process limit to that particular 
cgroup using the task-limit patches of
Dwight Engen.

 
 In this case, with real usage for this patch, would it be considered for 
 inclusion?
 
 I would still prefer to fix the real issue which is not clear from your 
 description yet.

I would love to have a better way to solve the issue.

Marian



- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOXGUsACgkQ4mt9JeIbjJR1YACgysnzxg9IPzcwQRmBZVVV6cp3
N4YAoKygaqbqcuz6dkmtMfI/pu2Br5H/
=3pZj
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Per-user namespace process accounting

2014-06-03 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/03/2014 08:54 PM, Eric W. Biederman wrote:
> Serge Hallyn  writes:
> 
>> Quoting Pavel Emelyanov (xe...@parallels.com):
>>> On 05/29/2014 07:32 PM, Serge Hallyn wrote:
>>>> Quoting Marian Marinov (m...@1h.com):
>>>>> We are not using NFS. We are using a shared block storage that offers us 
>>>>> snapshots. So provisioning new
>>>>> containers is extremely cheep and fast. Comparing that with untar is 
>>>>> comparing a race car with Smart. Yes
>>>>> it can be done and no, I do not believe we should go backwards.
>>>>> 
>>>>> We do not share filesystems between containers, we offer them block 
>>>>> devices.
>>>> 
>>>> Yes, this is a real nuisance for openstack style deployments.
>>>> 
>>>> One nice solution to this imo would be a very thin stackable filesystem 
>>>> which does uid shifting, or, better
>>>> yet, a non-stackable way of shifting uids at mount.
>>> 
>>> I vote for non-stackable way too. Maybe on generic VFS level so that 
>>> filesystems don't bother with it. From
>>> what I've seen, even simple stacking is quite a challenge.
>> 
>> Do you have any ideas for how to go about it?  It seems like we'd have to 
>> have separate inodes per mapping for
>> each file, which is why of course stacking seems "natural" here.
>> 
>> Trying to catch the uid/gid at every kernel-userspace crossing seems like a 
>> design regression from the current
>> userns approach.  I suppose we could continue in the kuid theme and 
>> introduce a iiud/igid for the in-kernel inode
>> uid/gid owners.  Then allow a user privileged in some ns to create a new 
>> mount associated with a different
>> mapping for any ids over which he is privileged.
> 
> There is a simple solution.
> 
> We pick the filesystems we choose to support. We add privileged mounting in a 
> user namespace. We create the user
> and mount namespace. Global root goes into the target mount namespace with 
> setns and performs the mounts.
> 
> 90% of that work is already done.
> 
> As long as we don't plan to support XFS (as it XFS likes to expose it's 
> implementation details to userspace) it
> should be quite straight forward.
> 
> The permission check change would probably only need to be:
> 
> 
> @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char 
> *fstype, int flags, return -ENODEV;
> 
> if (user_ns != _user_ns) { +   if (!(type->fs_flags & 
> FS_UNPRIV_MOUNT) && !capable(CAP_SYS_ADMIN))
> { +   put_filesystem(type); +   
> return -EPERM; +   } if
> (!(type->fs_flags & FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM;
> 
> 
> There are also a few funnies with capturing the user namespace of the 
> filesystem when we perform the mount (in the
> superblock?), and not allowing a mount of that same filesystem in a different 
> user namespace.
> 
> But as long as the kuid conversions don't measurably slow down the filesystem 
> when mounted in the initial mount and
> user namespaces I don't see how this would be a problem for anyone, and is 
> very little code.
> 

This may solve one of the problems, but it does not solve the issue with 
UID/GID maps that overlap in different user
namespaces.
In our cases, this means breaking container migration mechanisms.

Will this at all be addressed or I'm the only one here that has this sort of 
requirement?

Marian


> 
> Eric
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOOQIgACgkQ4mt9JeIbjJRUWQCgsp/dN0WBy9iLJmsjO8KB+Bin
HiIAoIkm8TlcJr4UnbJOAHoYgPVHhg4P
=B9xA
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Per-user namespace process accounting

2014-06-03 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/03/2014 08:54 PM, Eric W. Biederman wrote:
 Serge Hallyn serge.hal...@ubuntu.com writes:
 
 Quoting Pavel Emelyanov (xe...@parallels.com):
 On 05/29/2014 07:32 PM, Serge Hallyn wrote:
 Quoting Marian Marinov (m...@1h.com):
 We are not using NFS. We are using a shared block storage that offers us 
 snapshots. So provisioning new
 containers is extremely cheep and fast. Comparing that with untar is 
 comparing a race car with Smart. Yes
 it can be done and no, I do not believe we should go backwards.
 
 We do not share filesystems between containers, we offer them block 
 devices.
 
 Yes, this is a real nuisance for openstack style deployments.
 
 One nice solution to this imo would be a very thin stackable filesystem 
 which does uid shifting, or, better
 yet, a non-stackable way of shifting uids at mount.
 
 I vote for non-stackable way too. Maybe on generic VFS level so that 
 filesystems don't bother with it. From
 what I've seen, even simple stacking is quite a challenge.
 
 Do you have any ideas for how to go about it?  It seems like we'd have to 
 have separate inodes per mapping for
 each file, which is why of course stacking seems natural here.
 
 Trying to catch the uid/gid at every kernel-userspace crossing seems like a 
 design regression from the current
 userns approach.  I suppose we could continue in the kuid theme and 
 introduce a iiud/igid for the in-kernel inode
 uid/gid owners.  Then allow a user privileged in some ns to create a new 
 mount associated with a different
 mapping for any ids over which he is privileged.
 
 There is a simple solution.
 
 We pick the filesystems we choose to support. We add privileged mounting in a 
 user namespace. We create the user
 and mount namespace. Global root goes into the target mount namespace with 
 setns and performs the mounts.
 
 90% of that work is already done.
 
 As long as we don't plan to support XFS (as it XFS likes to expose it's 
 implementation details to userspace) it
 should be quite straight forward.
 
 The permission check change would probably only need to be:
 
 
 @@ -2180,6 +2245,10 @@ static int do_new_mount(struct path *path, const char 
 *fstype, int flags, return -ENODEV;
 
 if (user_ns != init_user_ns) { +   if (!(type-fs_flags  
 FS_UNPRIV_MOUNT)  !capable(CAP_SYS_ADMIN))
 { +   put_filesystem(type); +   
 return -EPERM; +   } if
 (!(type-fs_flags  FS_USERNS_MOUNT)) { put_filesystem(type); return -EPERM;
 
 
 There are also a few funnies with capturing the user namespace of the 
 filesystem when we perform the mount (in the
 superblock?), and not allowing a mount of that same filesystem in a different 
 user namespace.
 
 But as long as the kuid conversions don't measurably slow down the filesystem 
 when mounted in the initial mount and
 user namespaces I don't see how this would be a problem for anyone, and is 
 very little code.
 

This may solve one of the problems, but it does not solve the issue with 
UID/GID maps that overlap in different user
namespaces.
In our cases, this means breaking container migration mechanisms.

Will this at all be addressed or I'm the only one here that has this sort of 
requirement?

Marian


 
 Eric
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOOQIgACgkQ4mt9JeIbjJRUWQCgsp/dN0WBy9iLJmsjO8KB+Bin
HiIAoIkm8TlcJr4UnbJOAHoYgPVHhg4P
=B9xA
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Per-user namespace process accounting

2014-05-29 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
> Marian Marinov  writes:
> 
>> Hello,
>> 
>> I have the following proposition.
>> 
>> Number of currently running processes is accounted at the root user 
>> namespace. The problem I'm facing is that
>> multiple containers in different user namespaces share the process counters.
> 
> That is deliberate.

And I understand that very well ;)

> 
>> So if containerX runs 100 with UID 99, containerY should have NPROC limit of 
>> above 100 in order to execute any 
>> processes with ist own UID 99.
>> 
>> I know that some of you will tell me that I should not provision all of my 
>> containers with the same UID/GID maps,
>> but this brings another problem.
>> 
>> We are provisioning the containers from a template. The template has a lot 
>> of files 500k and more. And chowning
>> these causes a lot of I/O and also slows down provisioning considerably.
>> 
>> The other problem is that when we migrate one container from one host 
>> machine to another the IDs may be already
>> in use on the new machine and we need to chown all the files again.
> 
> You should have the same uid allocations for all machines in your fleet as 
> much as possible.   That has been true
> ever since NFS was invented and is not new here.  You can avoid the cost of 
> chowning if you untar your files inside
> of your user namespace.  You can have different maps per machine if you are 
> crazy enough to do that.  You can even
> have shared uids that you use to share files between containers as long as 
> none of those files is setuid.  And map
> those shared files to some kind of nobody user in your user namespace.

We are not using NFS. We are using a shared block storage that offers us 
snapshots. So provisioning new containers is
extremely cheep and fast. Comparing that with untar is comparing a race car 
with Smart. Yes it can be done and no, I
do not believe we should go backwards.

We do not share filesystems between containers, we offer them block devices.

> 
>> Finally if we use different UID/GID maps we can not do live migration to 
>> another node because the UIDs may be
>> already in use.
>> 
>> So I'm proposing one hack modifying unshare_userns() to allocate new 
>> user_struct for the cred that is created for
>> the first task creating the user_ns and free it in exit_creds().
> 
> I do not like the idea of having user_structs be per user namespace, and 
> deliberately made the code not work that
> way.
> 
>> Can you please comment on that?
> 
> I have been pondering having some recursive resources limits that are per 
> user namespace and if all you are worried
> about are process counts that might work.  I don't honestly know what makes 
> sense at the moment.

It seams to me that the only limit(from RLIMIT) that are generally a problem 
for the namespaces is number of processes
and pending signals.
This is why I proposed the above modification. However I'm not sure if the 
places I have chosen are right and also I'm
not really convinced that having per-namespace user_struct is the right 
approach for the process counter.

> 
> Eric
> 
Marian

- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOHDqoACgkQ4mt9JeIbjJRLPACZARH6agr856HeoB3Ub+e6U1PI
ICgAoLbQTRM2SqcYOLep7WPIeuoiw4aB
=/Ii4
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Per-user namespace process accounting

2014-05-29 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I have the following proposition.

Number of currently running processes is accounted at the root user namespace. 
The problem I'm facing is that multiple
containers in different user namespaces share the process counters.

So if containerX runs 100 with UID 99, containerY should have NPROC limit of 
above 100 in order to execute any
processes with ist own UID 99.

I know that some of you will tell me that I should not provision all of my 
containers with the same UID/GID maps, but
this brings another problem.

We are provisioning the containers from a template. The template has a lot of 
files 500k and more. And chowning these
causes a lot of I/O and also slows down provisioning considerably.

The other problem is that when we migrate one container from one host machine 
to another the IDs may be already in use
on the new machine and we need to chown all the files again.

Finally if we use different UID/GID maps we can not do live migration to 
another node because the UIDs may be already
in use.

So I'm proposing one hack modifying unshare_userns() to allocate new 
user_struct for the cred that is created for the
first task creating the user_ns and free it in exit_creds().

Can you please comment on that?

Or suggest a better solution?

Best regards,
Marian


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOG1Y0ACgkQ4mt9JeIbjJQREACbBrax+ztBj2Y0P2jY3qYEUY9T
JJ0AnAqLj3pqFFjXCkczEydV1V0LdzQ3
=8M+P
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Per-user namespace process accounting

2014-05-29 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I have the following proposition.

Number of currently running processes is accounted at the root user namespace. 
The problem I'm facing is that multiple
containers in different user namespaces share the process counters.

So if containerX runs 100 with UID 99, containerY should have NPROC limit of 
above 100 in order to execute any
processes with ist own UID 99.

I know that some of you will tell me that I should not provision all of my 
containers with the same UID/GID maps, but
this brings another problem.

We are provisioning the containers from a template. The template has a lot of 
files 500k and more. And chowning these
causes a lot of I/O and also slows down provisioning considerably.

The other problem is that when we migrate one container from one host machine 
to another the IDs may be already in use
on the new machine and we need to chown all the files again.

Finally if we use different UID/GID maps we can not do live migration to 
another node because the UIDs may be already
in use.

So I'm proposing one hack modifying unshare_userns() to allocate new 
user_struct for the cred that is created for the
first task creating the user_ns and free it in exit_creds().

Can you please comment on that?

Or suggest a better solution?

Best regards,
Marian


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOG1Y0ACgkQ4mt9JeIbjJQREACbBrax+ztBj2Y0P2jY3qYEUY9T
JJ0AnAqLj3pqFFjXCkczEydV1V0LdzQ3
=8M+P
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Per-user namespace process accounting

2014-05-29 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/29/2014 01:06 PM, Eric W. Biederman wrote:
 Marian Marinov m...@1h.com writes:
 
 Hello,
 
 I have the following proposition.
 
 Number of currently running processes is accounted at the root user 
 namespace. The problem I'm facing is that
 multiple containers in different user namespaces share the process counters.
 
 That is deliberate.

And I understand that very well ;)

 
 So if containerX runs 100 with UID 99, containerY should have NPROC limit of 
 above 100 in order to execute any 
 processes with ist own UID 99.
 
 I know that some of you will tell me that I should not provision all of my 
 containers with the same UID/GID maps,
 but this brings another problem.
 
 We are provisioning the containers from a template. The template has a lot 
 of files 500k and more. And chowning
 these causes a lot of I/O and also slows down provisioning considerably.
 
 The other problem is that when we migrate one container from one host 
 machine to another the IDs may be already
 in use on the new machine and we need to chown all the files again.
 
 You should have the same uid allocations for all machines in your fleet as 
 much as possible.   That has been true
 ever since NFS was invented and is not new here.  You can avoid the cost of 
 chowning if you untar your files inside
 of your user namespace.  You can have different maps per machine if you are 
 crazy enough to do that.  You can even
 have shared uids that you use to share files between containers as long as 
 none of those files is setuid.  And map
 those shared files to some kind of nobody user in your user namespace.

We are not using NFS. We are using a shared block storage that offers us 
snapshots. So provisioning new containers is
extremely cheep and fast. Comparing that with untar is comparing a race car 
with Smart. Yes it can be done and no, I
do not believe we should go backwards.

We do not share filesystems between containers, we offer them block devices.

 
 Finally if we use different UID/GID maps we can not do live migration to 
 another node because the UIDs may be
 already in use.
 
 So I'm proposing one hack modifying unshare_userns() to allocate new 
 user_struct for the cred that is created for
 the first task creating the user_ns and free it in exit_creds().
 
 I do not like the idea of having user_structs be per user namespace, and 
 deliberately made the code not work that
 way.
 
 Can you please comment on that?
 
 I have been pondering having some recursive resources limits that are per 
 user namespace and if all you are worried
 about are process counts that might work.  I don't honestly know what makes 
 sense at the moment.

It seams to me that the only limit(from RLIMIT) that are generally a problem 
for the namespaces is number of processes
and pending signals.
This is why I proposed the above modification. However I'm not sure if the 
places I have chosen are right and also I'm
not really convinced that having per-namespace user_struct is the right 
approach for the process counter.

 
 Eric
 
Marian

- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOHDqoACgkQ4mt9JeIbjJRLPACZARH6agr856HeoB3Ub+e6U1PI
ICgAoLbQTRM2SqcYOLep7WPIeuoiw4aB
=/Ii4
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Per user namespace process accounting

2014-05-28 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I have the following proposition.

Number of currently running processes is accounted at the root user namespace. 
The problem I'm facing is that multiple
containers in different user namespaces share the process counters.

So if containerX runs 100 with UID 99, containerY should have NPROC limit of 
above 100 in order to execute any
processes with ist own UID 99.

I know that some of you will tell me that I should not provision all of my 
containers with the same UID/GID maps, but
this brings another problem.

We are provisioning the containers from a template. The template has a lot of 
files 500k and more. And chowning these
causes a lot of I/O and also slows down provisioning considerably.

The other problem is that when we migrate one container from one host machine 
to another the IDs may be already in use
on the new machine and we need to chown all the files again.

Finally if we use different UID/GID maps we can not do live migration to 
another node because the UIDs may be already
in use.

So I'm proposing one hack modifying unshare_userns() to allocate new 
user_struct for the cred that is created for the
first task creating the user_ns and free it in exit_creds().

Can you please comment on that?

Or suggest a better solution?

Best regards,
Marian


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOF4VMACgkQ4mt9JeIbjJSFnwCgrAgCDvBs5ue0BPUHmach04Oo
xgMAn34PJbuVCBwh/RqUumFP05fvMHHy
=5J9M
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] Per user namespace process accounting

2014-05-28 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I have the following proposition.

Number of currently running processes is accounted at the root user namespace. 
The problem I'm facing is that multiple
containers in different user namespaces share the process counters.

So if containerX runs 100 with UID 99, containerY should have NPROC limit of 
above 100 in order to execute any
processes with ist own UID 99.

I know that some of you will tell me that I should not provision all of my 
containers with the same UID/GID maps, but
this brings another problem.

We are provisioning the containers from a template. The template has a lot of 
files 500k and more. And chowning these
causes a lot of I/O and also slows down provisioning considerably.

The other problem is that when we migrate one container from one host machine 
to another the IDs may be already in use
on the new machine and we need to chown all the files again.

Finally if we use different UID/GID maps we can not do live migration to 
another node because the UIDs may be already
in use.

So I'm proposing one hack modifying unshare_userns() to allocate new 
user_struct for the cred that is created for the
first task creating the user_ns and free it in exit_creds().

Can you please comment on that?

Or suggest a better solution?

Best regards,
Marian


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOF4VMACgkQ4mt9JeIbjJSFnwCgrAgCDvBs5ue0BPUHmach04Oo
xgMAn34PJbuVCBwh/RqUumFP05fvMHHy
=5J9M
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-25 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Manfred,

On 05/25/2014 11:01 PM, Manfred Spraul wrote:
> Hi Marian,
> 
> On 05/22/2014 03:01 PM, Marian Marinov wrote:
>> -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
>> 
>> On 05/05/2014 10:59 PM, Marian Marinov wrote:
>>> 
>>> In my tests it worked exactly as expected. Here is an example:
>>> 
>>> [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 
>>> ~]# lxc-attach -n cent_plain 
>>> [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 
>>> [root@localhost ~]# halt [root@sp2 ~]#
>>> sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl 
>>> kernel.shmmax=34359738368 kernel.shmmax
>>> = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# 
>>> lxc-attach -n cent_plain [root@localhost
>>> ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]#
>>> 
>>> So it seams to work as expected :)
>>> 
>>> It works because wen you setup a new shmmax limit it is actually the limit 
>>> in the init_ipc_ns. So when we are 
>>> creating a new ipc_ns its ok to copy the values from init_ipc_ns.
>>> 
>>> -Marian
>>> 
>> Ping?
>> 
>> So will there be any more comments on that?
>> 
>> 

It seams logical. I like your approach even better.
I'll test it tomorrow to confirm if all of my tests are successful.

Marian

> Attached is an untested idea: - each new namespace copies from it's parent, 
> i.e. nested namespaces should work. -
> msg, sem and shm updated
> 
> -- Manfred


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOCha4ACgkQ4mt9JeIbjJTMvwCdGHV2R7Y1H7x14n53Vyihg5AB
2k4Anjp6inQsGjvof2MAx9Sh2/1K3L25
=QTO9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inotify, new idea?

2014-05-25 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/24/2014 03:34 PM, Richard Weinberger wrote:
> Am 24.05.2014 09:52, schrieb Michael Kerrisk (man-pages):
>> On 04/21/2014 10:42 AM, Richard Weinberger wrote:
>>> Am 21.04.2014 09:24, schrieb Michael Kerrisk:
>>>>> Does recursive monitoring even work with inotify? Last time I've tried it 
>>>>> did failed as soon I did a mkdir
>>>>> -p a/b/c/d because mkdir() raced against the thread which installes the 
>>>>> new watches.
>>>> 
>>>> As I understand it, you have to program to deal with the races (rescan 
>>>> directories after adding watches). I
>>>> recently did a lot of work updating the inotify(7) man page to discuss all 
>>>> the issues that I know of, and
>>>> their remedies. If I missed anything, I'd appreciate a note on it, so that 
>>>> it can be added. See 
>>>> http://man7.org/linux/man-pages/man7/inotify.7.html#NOTES
>>> 
>>> I'm aware of the rescan hack, but in my case it does not help because my 
>>> program must not miss any event. 
>>> Currently I'm using a fuse overlay filesystem to log everything. Not 
>>> perfect but works... :-)
>> 
>> Richard,
>> 
>> A late follow up question. How does your application deal with the event 
>> overflow problem (i.e., when you get a
>> large number of events much faster than your application can deal with them?
> 
> The downside of the FUSE approach is that you have to intercept every 
> filesystem function. This can be a
> performance issue. But due to this design the overflow problem cannot happen 
> as the FUSE filesystem blocks until
> the event has been proceed.
> 

Have anyone of you looked at this: https://github.com/1and1/linux-filemon/

I haven't stress tested it, but in the past I ported it to more recent kernels:
  https://github.com/hackman/filemon-patches

It is not polished, but it works.

I'm considering fixing some of the issues at has for use in my home setup.

Marian

> Thanks, //richard -- To unsubscribe from this list: send the line 
> "unsubscribe linux-kernel" in the body of a
> message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html Please read
> the FAQ at  http://www.tux.org/lkml/
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOCgOwACgkQ4mt9JeIbjJSXrgCdHCFADxvKN3Hxb4qinc/5Fj6V
WYMAnArk3Zb8YqTUNoyNCh1zYplUUx9B
=y/Sm
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: inotify, new idea?

2014-05-25 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/24/2014 03:34 PM, Richard Weinberger wrote:
 Am 24.05.2014 09:52, schrieb Michael Kerrisk (man-pages):
 On 04/21/2014 10:42 AM, Richard Weinberger wrote:
 Am 21.04.2014 09:24, schrieb Michael Kerrisk:
 Does recursive monitoring even work with inotify? Last time I've tried it 
 did failed as soon I did a mkdir
 -p a/b/c/d because mkdir() raced against the thread which installes the 
 new watches.
 
 As I understand it, you have to program to deal with the races (rescan 
 directories after adding watches). I
 recently did a lot of work updating the inotify(7) man page to discuss all 
 the issues that I know of, and
 their remedies. If I missed anything, I'd appreciate a note on it, so that 
 it can be added. See 
 http://man7.org/linux/man-pages/man7/inotify.7.html#NOTES
 
 I'm aware of the rescan hack, but in my case it does not help because my 
 program must not miss any event. 
 Currently I'm using a fuse overlay filesystem to log everything. Not 
 perfect but works... :-)
 
 Richard,
 
 A late follow up question. How does your application deal with the event 
 overflow problem (i.e., when you get a
 large number of events much faster than your application can deal with them?
 
 The downside of the FUSE approach is that you have to intercept every 
 filesystem function. This can be a
 performance issue. But due to this design the overflow problem cannot happen 
 as the FUSE filesystem blocks until
 the event has been proceed.
 

Have anyone of you looked at this: https://github.com/1and1/linux-filemon/

I haven't stress tested it, but in the past I ported it to more recent kernels:
  https://github.com/hackman/filemon-patches

It is not polished, but it works.

I'm considering fixing some of the issues at has for use in my home setup.

Marian

 Thanks, //richard -- To unsubscribe from this list: send the line 
 unsubscribe linux-kernel in the body of a
 message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html Please read
 the FAQ at  http://www.tux.org/lkml/
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOCgOwACgkQ4mt9JeIbjJSXrgCdHCFADxvKN3Hxb4qinc/5Fj6V
WYMAnArk3Zb8YqTUNoyNCh1zYplUUx9B
=y/Sm
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-25 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Manfred,

On 05/25/2014 11:01 PM, Manfred Spraul wrote:
 Hi Marian,
 
 On 05/22/2014 03:01 PM, Marian Marinov wrote:
 -BEGIN PGP SIGNED MESSAGE- Hash: SHA1
 
 On 05/05/2014 10:59 PM, Marian Marinov wrote:
 
 In my tests it worked exactly as expected. Here is an example:
 
 [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 
 ~]# lxc-attach -n cent_plain 
 [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 
 [root@localhost ~]# halt [root@sp2 ~]#
 sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl 
 kernel.shmmax=34359738368 kernel.shmmax
 = 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# 
 lxc-attach -n cent_plain [root@localhost
 ~]# sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]#
 
 So it seams to work as expected :)
 
 It works because wen you setup a new shmmax limit it is actually the limit 
 in the init_ipc_ns. So when we are 
 creating a new ipc_ns its ok to copy the values from init_ipc_ns.
 
 -Marian
 
 Ping?
 
 So will there be any more comments on that?
 
 

It seams logical. I like your approach even better.
I'll test it tomorrow to confirm if all of my tests are successful.

Marian

 Attached is an untested idea: - each new namespace copies from it's parent, 
 i.e. nested namespaces should work. -
 msg, sem and shm updated
 
 -- Manfred


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlOCha4ACgkQ4mt9JeIbjJTMvwCdGHV2R7Y1H7x14n53Vyihg5AB
2k4Anjp6inQsGjvof2MAx9Sh2/1K3L25
=QTO9
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/20/2014 05:19 PM, Serge Hallyn wrote:
> Quoting Andy Lutomirski (l...@amacapital.net):
>> On May 15, 2014 1:26 PM, "Serge E. Hallyn"  wrote:
>>> 
>>> Quoting Richard Weinberger (rich...@nod.at):
>>>> Am 15.05.2014 21:50, schrieb Serge Hallyn:
>>>>> Quoting Richard Weinberger (richard.weinber...@gmail.com):
>>>>>> On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
>>>>>>  wrote:
>>>>>>> Then don't use a container to build such a thing, or fix the build 
>>>>>>> scripts to not do that :)
>>>>>> 
>>>>>> I second this. To me it looks like some folks try to (ab)use Linux 
>>>>>> containers for purposes where KVM
>>>>>> would much better fit in. Please don't put more complexity into 
>>>>>> containers. They are already horrible
>>>>>> complex and error prone.
>>>>> 
>>>>> I, naturally, disagree :)  The only use case which is inherently not 
>>>>> valid for containers is running a
>>>>> kernel.  Practically speaking there are other things which likely will 
>>>>> never be possible, but if someone 
>>>>> offers a way to do something in containers, "you can't do that in 
>>>>> containers" is not an apropos response.
>>>>> 
>>>>> "That abstraction is wrong" is certainly valid, as when vpids were 
>>>>> originally proposed and rejected,
>>>>> resulting in the development of pid namespaces.  "We have to work out (x) 
>>>>> first" can be valid (and I can
>>>>> think of examples here), assuming it's not just trying to hide behind a 
>>>>> catch-22/chicken-egg problem.
>>>>> 
>>>>> Finally, saying "containers are complex and error prone" is conflating 
>>>>> several large suites of userspace
>>>>> code and many kernel features which support them.  Being more precise 
>>>>> would, if the argument is valid, lend
>>>>> it a lot more weight.
>>>> 
>>>> We (my company) use Linux containers since 2011 in production. First LXC, 
>>>> now libvirt-lxc. To understand the
>>>> internals better I also wrote my own userspace to create/start containers. 
>>>> There are so many things which can
>>>> hurt you badly. With user namespaces we expose a really big attack surface 
>>>> to regular users. I.e. Suddenly a
>>>> user is allowed to mount filesystems.
>>> 
>>> That is currently not the case.  They can mount some virtual filesystems 
>>> and do bind mounts, but cannot mount
>>> most real filesystems.  This keeps us protected (for now) from potentially 
>>> unsafe superblock readers in the 
>>> kernel.
>>> 
>>>> Ask Andy, he found already lots of nasty things...
>> 
>> I don't think I have anything brilliant to add to this discussion right now, 
>> except possibly:
>> 
>> ISTM that Linux distributions are, in general, vulnerable to all kinds of 
>> shenanigans that would happen if an
>> untrusted user can cause a block device to appear.  That user doesn't need 
>> permission to mount it
> 
> Interesting point.  This would further suggest that we absolutely must ensure 
> that a loop device which shows up in
> the container does not also show up in the host.

Can I suggest the usage of the devices cgroup to achieve that?

Marian

> 
>> or even necessarily to change its contents on the fly.
>> 
>> E.g. what happens if you boot a machine that contains a malicious disk image 
>> that has the same partition UUID as
>> /?  Nothing good, I imagine.
>> 
>> So if we're going to go down this road, we really need some way to tell the 
>> host that certain devices are not
>> trusted.
>> 
>> --Andy
> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in 
> the body of a message to
> majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html Please read the FAQ at
> http://www.tux.org/lkml/
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin
byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W
=iQLO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/20/2014 05:19 PM, Serge Hallyn wrote:
 Quoting Andy Lutomirski (l...@amacapital.net):
 On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
 Quoting Richard Weinberger (rich...@nod.at):
 Am 15.05.2014 21:50, schrieb Serge Hallyn:
 Quoting Richard Weinberger (richard.weinber...@gmail.com):
 On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
 gre...@linuxfoundation.org wrote:
 Then don't use a container to build such a thing, or fix the build 
 scripts to not do that :)
 
 I second this. To me it looks like some folks try to (ab)use Linux 
 containers for purposes where KVM
 would much better fit in. Please don't put more complexity into 
 containers. They are already horrible
 complex and error prone.
 
 I, naturally, disagree :)  The only use case which is inherently not 
 valid for containers is running a
 kernel.  Practically speaking there are other things which likely will 
 never be possible, but if someone 
 offers a way to do something in containers, you can't do that in 
 containers is not an apropos response.
 
 That abstraction is wrong is certainly valid, as when vpids were 
 originally proposed and rejected,
 resulting in the development of pid namespaces.  We have to work out (x) 
 first can be valid (and I can
 think of examples here), assuming it's not just trying to hide behind a 
 catch-22/chicken-egg problem.
 
 Finally, saying containers are complex and error prone is conflating 
 several large suites of userspace
 code and many kernel features which support them.  Being more precise 
 would, if the argument is valid, lend
 it a lot more weight.
 
 We (my company) use Linux containers since 2011 in production. First LXC, 
 now libvirt-lxc. To understand the
 internals better I also wrote my own userspace to create/start containers. 
 There are so many things which can
 hurt you badly. With user namespaces we expose a really big attack surface 
 to regular users. I.e. Suddenly a
 user is allowed to mount filesystems.
 
 That is currently not the case.  They can mount some virtual filesystems 
 and do bind mounts, but cannot mount
 most real filesystems.  This keeps us protected (for now) from potentially 
 unsafe superblock readers in the 
 kernel.
 
 Ask Andy, he found already lots of nasty things...
 
 I don't think I have anything brilliant to add to this discussion right now, 
 except possibly:
 
 ISTM that Linux distributions are, in general, vulnerable to all kinds of 
 shenanigans that would happen if an
 untrusted user can cause a block device to appear.  That user doesn't need 
 permission to mount it
 
 Interesting point.  This would further suggest that we absolutely must ensure 
 that a loop device which shows up in
 the container does not also show up in the host.

Can I suggest the usage of the devices cgroup to achieve that?

Marian

 
 or even necessarily to change its contents on the fly.
 
 E.g. what happens if you boot a machine that contains a malicious disk image 
 that has the same partition UUID as
 /?  Nothing good, I imagine.
 
 So if we're going to go down this road, we really need some way to tell the 
 host that certain devices are not
 trusted.
 
 --Andy
 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in 
 the body of a message to
 majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html Please read the FAQ at
 http://www.tux.org/lkml/
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN/BL8ACgkQ4mt9JeIbjJRuTwCgjpP8cNle5deHpUSJJoDkcfin
byEAn3Fy4wwiZ3avNwA/ljZWVWeGFU8W
=iQLO
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 11/11] loop: Allow priveleged operations for root in the namespace which owns a device

2014-05-22 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

One question about this patch.

Why don't you use the devices cgroup check if the root user in that namespace 
is allowed to use this device?

This way you can be sure that the root in that namespace can not access devices 
to which the host system did not gave
him access to.

Marian

On 05/15/2014 12:34 AM, Seth Forshee wrote:
> Signed-off-by: Seth Forshee  --- 
> drivers/block/loop.c | 14 +- 1 file
> changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 
> 66bd938bcc1c..2cc19868ea0d 100644 ---
> a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1074,7 +1074,7 @@ 
> loop_set_status(struct loop_device *lo,
> const struct loop_info64 *info)
> 
> if (lo->lo_encrypt_key_size && !uid_eq(lo->lo_key_owner, uid) && -
> !capable(CAP_SYS_ADMIN)) +
> !ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) return -EPERM; if 
> (lo->lo_state != Lo_bound) return
> -ENXIO; @@ -1164,7 +1164,8 @@ loop_get_status(struct loop_device *lo, struct 
> loop_info64 *info) 
> memcpy(info->lo_crypt_name, lo->lo_crypt_name, LO_NAME_SIZE); 
> info->lo_encrypt_type = lo->lo_encryption ?
> lo->lo_encryption->number : 0; -  if (lo->lo_encrypt_key_size && 
> capable(CAP_SYS_ADMIN)) { +  if
> (lo->lo_encrypt_key_size && + ns_capable(disk_to_dev(lo->lo_disk)->ns, 
> CAP_SYS_ADMIN)) { 
> info->lo_encrypt_key_size = lo->lo_encrypt_key_size; 
> memcpy(info->lo_encrypt_key, lo->lo_encrypt_key, 
> lo->lo_encrypt_key_size); @@ -1309,7 +1310,8 @@ static int lo_ioctl(struct 
> block_device *bdev, fmode_t mode, 
> break; case LOOP_SET_STATUS: err = -EPERM; -  if ((mode & 
> FMODE_WRITE) || capable(CAP_SYS_ADMIN)) +   if ((mode &
> FMODE_WRITE) || + ns_capable(disk_to_dev(lo->lo_disk)->ns, 
> CAP_SYS_ADMIN)) err = loop_set_status_old(lo, 
> (struct loop_info __user *)arg); break; @@ -1318,7 +1320,8 @@ static int 
> lo_ioctl(struct block_device *bdev,
> fmode_t mode, break; case LOOP_SET_STATUS64: err = -EPERM; -  if 
> ((mode & FMODE_WRITE) || capable(CAP_SYS_ADMIN)) +
> if ((mode & FMODE_WRITE) || + 
> ns_capable(disk_to_dev(lo->lo_disk)->ns, CAP_SYS_ADMIN)) err =
> loop_set_status64(lo, (struct loop_info64 __user *) arg); break; @@ -1327,7 
> +1330,8 @@ static int lo_ioctl(struct
> block_device *bdev, fmode_t mode, break; case LOOP_SET_CAPACITY: err = 
> -EPERM; -  if ((mode & FMODE_WRITE) ||
> capable(CAP_SYS_ADMIN)) + if ((mode & FMODE_WRITE) || +   
> ns_capable(disk_to_dev(lo->lo_disk)->ns,
> CAP_SYS_ADMIN)) err = loop_set_capacity(lo, bdev); break; default:
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN+4SkACgkQ4mt9JeIbjJQMTACffYfr/7IGyjGU0DGnqxRJNMvv
3o4An06xvFTbINvXX9AYrM/gwTDL0GMt
=mlO7
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Pondering per-process vsyscall disablement

2014-05-22 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/23/2014 02:04 AM, Andy Lutomirski wrote:
> It would be nice to have a way for new programs to declare that they don't 
> need vsyscalls.  What's the right way to
> do this?  An ELF header entry in the loader?  An ELF header entry in the 
> program?  A new arch_prctl?
> 
> As background, there's an old part of the x86_64 ABI that allows programs to 
> do gettimeofday, clock_gettime, and
> getcpu by calling to fixed addresses of the form 0xff600n00 where n 
> indicates which of those three syscalls
> is being invoked.  This is a security issue.
> 
> Since Linux 3.1, vsyscalls are emulated using NX and page faults.  As a 
> result, vsyscalls no longer offer any
> performance advantage over normal syscalls; in fact, they're much slower.  As 
> far as I know, nothing newer than
> 2012 will attempt to use vsyscalls if a vdso is present.  (Sadly, a lot of 
> things will still fall back to the
> vsyscall page if there is no vdso, but that shouldn't matter, since there is 
> always a vdso.)
> 
> Despite the emulation, they could still be used as a weird form of ROP gadget 
> that lives at a fixed address.  I'd
> like to offer a way for new runtimes to indicate that they don't use 
> vsyscalls so that the kernel can selectively
> disable emulation and remove the fixed-address executable code issue.
> 
> 
Wouldn't it be more useful if the check is against a bitmask added as extended 
attribute for that executable?
This way the administrators and will have the flexibility to simply add the new 
attribute to the executable.

Marian

> --Andy -- To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in the body of a message to
> majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html Please read the FAQ at
> http://www.tux.org/lkml/
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN+tg4ACgkQ4mt9JeIbjJRGkgCgjnD2s+J9kIr5oEDeL3VKHNvX
X4cAn17zC0aPKyTCVekmqZRlVukqLWyC
=vrfk
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-22 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/05/2014 10:59 PM, Marian Marinov wrote:
> On 05/04/2014 02:17 PM, Manfred Spraul wrote:
>> Hi Marian,
>> 
>> Note: The limits will soon be increased to (nearly) ULONG_MAX. I.e.: If you 
>> propose the patch because you are
>> running into issues with a too small SEMMAX after an unshare(CLONE_NEWIPC), 
>> then this will be fixed soon.
>> 
>> 
>> On 05/04/2014 01:53 AM, Davidlohr Bueso wrote:
>>> On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:
>>>> When we are creating new IPC namespace that should be cloned from the 
>>>> current namespace it is a good idea to
>>>> copy the values of the current shmmax and shmall to the new namespace.
>> The idea sounds reasonable: If an admin has reduced the limits, then the 
>> reduction should also apply after a
>> unshare(CLONE_NEWIPC).
>> 
>> But: Your patch doesn't use the current shmmax, it uses the shmmax from 
>> init_ipc_ns. Would it be possible to use
>> the current values?
> 
> In my tests it worked exactly as expected. Here is an example:
> 
> [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# 
> lxc-attach -n cent_plain 
> [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 
> [root@localhost ~]# halt [root@sp2 ~]# sysctl
> -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl 
> kernel.shmmax=34359738368 kernel.shmmax =
> 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach 
> -n cent_plain [root@localhost ~]#
> sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]#
> 
> So it seams to work as expected :)
> 
> It works because wen you setup a new shmmax limit it is actually the limit in 
> the init_ipc_ns. So when we are
> creating a new ipc_ns its ok to copy the values from init_ipc_ns.
> 
> -Marian
> 

Ping?

So will there be any more comments on that?

Marian

>> 
>>> Why is this a good idea?
>>> 
>>> This would break userspace that relies on the current behavior. Furthermore 
>>> we've recently changed the default
>>> value of both these limits to be as large as you can get, thus deprecating 
>>> them. I don't like the idea of this
>>> being replaced by namespaces.
>> Davidlohr: We are not deprecating them, we make the default huge. The limits 
>> should stay as usable as they were.
>> 
>> With regards to breaking user space, I must think about it a bit more. Right 
>> now, each new namespace starts with
>> SEMMAX=32MB, i.e. an often unusable default.
>> 
>> -- Manfred
>> 
> 
> 


- -- 
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN99SAACgkQ4mt9JeIbjJQHrQCfdexU5xdW4A/pO66SvbcYQVqF
uREAoJ1e6hytp6435YUrpKjEG2qVulI1
=QqGi
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-22 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/05/2014 10:59 PM, Marian Marinov wrote:
 On 05/04/2014 02:17 PM, Manfred Spraul wrote:
 Hi Marian,
 
 Note: The limits will soon be increased to (nearly) ULONG_MAX. I.e.: If you 
 propose the patch because you are
 running into issues with a too small SEMMAX after an unshare(CLONE_NEWIPC), 
 then this will be fixed soon.
 
 
 On 05/04/2014 01:53 AM, Davidlohr Bueso wrote:
 On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:
 When we are creating new IPC namespace that should be cloned from the 
 current namespace it is a good idea to
 copy the values of the current shmmax and shmall to the new namespace.
 The idea sounds reasonable: If an admin has reduced the limits, then the 
 reduction should also apply after a
 unshare(CLONE_NEWIPC).
 
 But: Your patch doesn't use the current shmmax, it uses the shmmax from 
 init_ipc_ns. Would it be possible to use
 the current values?
 
 In my tests it worked exactly as expected. Here is an example:
 
 [root@sp2 ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# 
 lxc-attach -n cent_plain 
 [root@localhost ~]# sysctl -a|grep shmmax kernel.shmmax = 68719476736 
 [root@localhost ~]# halt [root@sp2 ~]# sysctl
 -a|grep shmmax kernel.shmmax = 68719476736 [root@sp2 ~]# sysctl 
 kernel.shmmax=34359738368 kernel.shmmax =
 34359738368 [root@sp2 ~]# lxc-start -n cent_plain -d [root@sp2 ~]# lxc-attach 
 -n cent_plain [root@localhost ~]#
 sysctl -a|grep shmmax kernel.shmmax = 34359738368 [root@localhost ~]#
 
 So it seams to work as expected :)
 
 It works because wen you setup a new shmmax limit it is actually the limit in 
 the init_ipc_ns. So when we are
 creating a new ipc_ns its ok to copy the values from init_ipc_ns.
 
 -Marian
 

Ping?

So will there be any more comments on that?

Marian

 
 Why is this a good idea?
 
 This would break userspace that relies on the current behavior. Furthermore 
 we've recently changed the default
 value of both these limits to be as large as you can get, thus deprecating 
 them. I don't like the idea of this
 being replaced by namespaces.
 Davidlohr: We are not deprecating them, we make the default huge. The limits 
 should stay as usable as they were.
 
 With regards to breaking user space, I must think about it a bit more. Right 
 now, each new namespace starts with
 SEMMAX=32MB, i.e. an often unusable default.
 
 -- Manfred
 
 
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN99SAACgkQ4mt9JeIbjJQHrQCfdexU5xdW4A/pO66SvbcYQVqF
uREAoJ1e6hytp6435YUrpKjEG2qVulI1
=QqGi
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Pondering per-process vsyscall disablement

2014-05-22 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/23/2014 02:04 AM, Andy Lutomirski wrote:
 It would be nice to have a way for new programs to declare that they don't 
 need vsyscalls.  What's the right way to
 do this?  An ELF header entry in the loader?  An ELF header entry in the 
 program?  A new arch_prctl?
 
 As background, there's an old part of the x86_64 ABI that allows programs to 
 do gettimeofday, clock_gettime, and
 getcpu by calling to fixed addresses of the form 0xff600n00 where n 
 indicates which of those three syscalls
 is being invoked.  This is a security issue.
 
 Since Linux 3.1, vsyscalls are emulated using NX and page faults.  As a 
 result, vsyscalls no longer offer any
 performance advantage over normal syscalls; in fact, they're much slower.  As 
 far as I know, nothing newer than
 2012 will attempt to use vsyscalls if a vdso is present.  (Sadly, a lot of 
 things will still fall back to the
 vsyscall page if there is no vdso, but that shouldn't matter, since there is 
 always a vdso.)
 
 Despite the emulation, they could still be used as a weird form of ROP gadget 
 that lives at a fixed address.  I'd
 like to offer a way for new runtimes to indicate that they don't use 
 vsyscalls so that the kernel can selectively
 disable emulation and remove the fixed-address executable code issue.
 
 
Wouldn't it be more useful if the check is against a bitmask added as extended 
attribute for that executable?
This way the administrators and will have the flexibility to simply add the new 
attribute to the executable.

Marian

 --Andy -- To unsubscribe from this list: send the line unsubscribe 
 linux-kernel in the body of a message to
 majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html Please read the FAQ at
 http://www.tux.org/lkml/
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN+tg4ACgkQ4mt9JeIbjJRGkgCgjnD2s+J9kIr5oEDeL3VKHNvX
X4cAn17zC0aPKyTCVekmqZRlVukqLWyC
=vrfk
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH 11/11] loop: Allow priveleged operations for root in the namespace which owns a device

2014-05-22 Thread Marian Marinov

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

One question about this patch.

Why don't you use the devices cgroup check if the root user in that namespace 
is allowed to use this device?

This way you can be sure that the root in that namespace can not access devices 
to which the host system did not gave
him access to.

Marian

On 05/15/2014 12:34 AM, Seth Forshee wrote:
 Signed-off-by: Seth Forshee seth.fors...@canonical.com --- 
 drivers/block/loop.c | 14 +- 1 file
 changed, 9 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 
 66bd938bcc1c..2cc19868ea0d 100644 ---
 a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1074,7 +1074,7 @@ 
 loop_set_status(struct loop_device *lo,
 const struct loop_info64 *info)
 
 if (lo-lo_encrypt_key_size  !uid_eq(lo-lo_key_owner, uid)  -
 !capable(CAP_SYS_ADMIN)) +
 !ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) return -EPERM; if 
 (lo-lo_state != Lo_bound) return
 -ENXIO; @@ -1164,7 +1164,8 @@ loop_get_status(struct loop_device *lo, struct 
 loop_info64 *info) 
 memcpy(info-lo_crypt_name, lo-lo_crypt_name, LO_NAME_SIZE); 
 info-lo_encrypt_type = lo-lo_encryption ?
 lo-lo_encryption-number : 0; -  if (lo-lo_encrypt_key_size  
 capable(CAP_SYS_ADMIN)) { +  if
 (lo-lo_encrypt_key_size  + ns_capable(disk_to_dev(lo-lo_disk)-ns, 
 CAP_SYS_ADMIN)) { 
 info-lo_encrypt_key_size = lo-lo_encrypt_key_size; 
 memcpy(info-lo_encrypt_key, lo-lo_encrypt_key, 
 lo-lo_encrypt_key_size); @@ -1309,7 +1310,8 @@ static int lo_ioctl(struct 
 block_device *bdev, fmode_t mode, 
 break; case LOOP_SET_STATUS: err = -EPERM; -  if ((mode  
 FMODE_WRITE) || capable(CAP_SYS_ADMIN)) +   if ((mode 
 FMODE_WRITE) || + ns_capable(disk_to_dev(lo-lo_disk)-ns, 
 CAP_SYS_ADMIN)) err = loop_set_status_old(lo, 
 (struct loop_info __user *)arg); break; @@ -1318,7 +1320,8 @@ static int 
 lo_ioctl(struct block_device *bdev,
 fmode_t mode, break; case LOOP_SET_STATUS64: err = -EPERM; -  if 
 ((mode  FMODE_WRITE) || capable(CAP_SYS_ADMIN)) +
 if ((mode  FMODE_WRITE) || + 
 ns_capable(disk_to_dev(lo-lo_disk)-ns, CAP_SYS_ADMIN)) err =
 loop_set_status64(lo, (struct loop_info64 __user *) arg); break; @@ -1327,7 
 +1330,8 @@ static int lo_ioctl(struct
 block_device *bdev, fmode_t mode, break; case LOOP_SET_CAPACITY: err = 
 -EPERM; -  if ((mode  FMODE_WRITE) ||
 capable(CAP_SYS_ADMIN)) + if ((mode  FMODE_WRITE) || +   
 ns_capable(disk_to_dev(lo-lo_disk)-ns,
 CAP_SYS_ADMIN)) err = loop_set_capacity(lo, bdev); break; default:
 


- -- 
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAlN+4SkACgkQ4mt9JeIbjJQMTACffYfr/7IGyjGU0DGnqxRJNMvv
3o4An06xvFTbINvXX9AYrM/gwTDL0GMt
=mlO7
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-05 Thread Marian Marinov


On 05/04/2014 02:17 PM, Manfred Spraul wrote:

Hi Marian,

Note: The limits will soon be increased to (nearly) ULONG_MAX.
I.e.: If you propose the patch because you are running into issues with a too 
small SEMMAX after an
unshare(CLONE_NEWIPC), then this will be fixed soon.


On 05/04/2014 01:53 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:

When we are creating new IPC namespace that should be cloned from the current 
namespace it is a good idea to copy the
values of the current shmmax and shmall to the new namespace.

The idea sounds reasonable:
If an admin has reduced the limits, then the reduction should also apply after 
a unshare(CLONE_NEWIPC).

But:
Your patch doesn't use the current shmmax, it uses the shmmax from init_ipc_ns.
Would it be possible to use the current values?


In my tests it worked exactly as expected.
Here is an example:

[root@sp2 ~]# sysctl -a|grep shmmax
kernel.shmmax = 68719476736
[root@sp2 ~]# lxc-attach -n cent_plain
[root@localhost ~]# sysctl -a|grep shmmax
kernel.shmmax = 68719476736
[root@localhost ~]# halt
[root@sp2 ~]# sysctl -a|grep shmmax
kernel.shmmax = 68719476736
[root@sp2 ~]# sysctl kernel.shmmax=34359738368
kernel.shmmax = 34359738368
[root@sp2 ~]# lxc-start -n cent_plain -d
[root@sp2 ~]# lxc-attach -n cent_plain
[root@localhost ~]# sysctl -a|grep shmmax
kernel.shmmax = 34359738368
[root@localhost ~]#

So it seams to work as expected :)

It works because wen you setup a new shmmax limit it is actually the limit in 
the init_ipc_ns.
So when we are creating a new ipc_ns its ok to copy the values from init_ipc_ns.

-Marian




Why is this a good idea?

This would break userspace that relies on the current behavior.
Furthermore we've recently changed the default value of both these
limits to be as large as you can get, thus deprecating them. I don't
like the idea of this being replaced by namespaces.

Davidlohr: We are not deprecating them, we make the default huge.
The limits should stay as usable as they were.

With regards to breaking user space, I must think about it a bit more.
Right now, each new namespace starts with SEMMAX=32MB, i.e. an often unusable 
default.

--
 Manfred




--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-05 Thread Marian Marinov


On 05/04/2014 02:17 PM, Manfred Spraul wrote:

Hi Marian,

Note: The limits will soon be increased to (nearly) ULONG_MAX.
I.e.: If you propose the patch because you are running into issues with a too 
small SEMMAX after an
unshare(CLONE_NEWIPC), then this will be fixed soon.


On 05/04/2014 01:53 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:

When we are creating new IPC namespace that should be cloned from the current 
namespace it is a good idea to copy the
values of the current shmmax and shmall to the new namespace.

The idea sounds reasonable:
If an admin has reduced the limits, then the reduction should also apply after 
a unshare(CLONE_NEWIPC).

But:
Your patch doesn't use the current shmmax, it uses the shmmax from init_ipc_ns.
Would it be possible to use the current values?


In my tests it worked exactly as expected.
Here is an example:

[root@sp2 ~]# sysctl -a|grep shmmax
kernel.shmmax = 68719476736
[root@sp2 ~]# lxc-attach -n cent_plain
[root@localhost ~]# sysctl -a|grep shmmax
kernel.shmmax = 68719476736
[root@localhost ~]# halt
[root@sp2 ~]# sysctl -a|grep shmmax
kernel.shmmax = 68719476736
[root@sp2 ~]# sysctl kernel.shmmax=34359738368
kernel.shmmax = 34359738368
[root@sp2 ~]# lxc-start -n cent_plain -d
[root@sp2 ~]# lxc-attach -n cent_plain
[root@localhost ~]# sysctl -a|grep shmmax
kernel.shmmax = 34359738368
[root@localhost ~]#

So it seams to work as expected :)

It works because wen you setup a new shmmax limit it is actually the limit in 
the init_ipc_ns.
So when we are creating a new ipc_ns its ok to copy the values from init_ipc_ns.

-Marian




Why is this a good idea?

This would break userspace that relies on the current behavior.
Furthermore we've recently changed the default value of both these
limits to be as large as you can get, thus deprecating them. I don't
like the idea of this being replaced by namespaces.

Davidlohr: We are not deprecating them, we make the default huge.
The limits should stay as usable as they were.

With regards to breaking user space, I must think about it a bit more.
Right now, each new namespace starts with SEMMAX=32MB, i.e. an often unusable 
default.

--
 Manfred




--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-04 Thread Marian Marinov


On 05/04/2014 04:20 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 03:28 +0300, Marian Marinov wrote:

On 05/04/2014 02:53 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:

When we are creating new IPC namespace that should be cloned from the current 
namespace it is a good idea to copy the
values of the current shmmax and shmall to the new namespace.


Why is this a good idea?

This would break userspace that relies on the current behavior.
Furthermore we've recently changed the default value of both these
limits to be as large as you can get, thus deprecating them. I don't
like the idea of this being replaced by namespaces.

Thanks,
Davidlohr



The current behavior is create_ipc_ns()->shm_init_ns()

void shm_init_ns(struct ipc_namespace *ns)
{
  ns->shm_ctlmax = SHMMAX;
  ns->shm_ctlall = SHMALL;
  ns->shm_ctlmni = SHMMNI;
  ns->shm_rmid_forced = 0;
  ns->shm_tot = 0;
  ipc_init_ids(_ids(ns));
}

This means that whenever you are creating an IPC namespace it gets its SHMMAX 
and SHMALL values from the defaults for
the kernel.


This is exactly what I meant by 'current behavior'.


If for some reason you want to have smaller(or bigger, for older kernels) 
limit. This means changing the values in
/proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that 
is started with the new IPC namespace may
lack privileges to write to these files and so it can not modify them.


I see no reason why namespaces should behave any different than the rest
of the system, wrt this. And this changes how *and* when these limits
are set, which impacts at a userspace level with no justification.


What I'm proposing is simply to copy the current values of the host machine, as 
set by a privileged process before the
namespace creation.

Maybe a better approach would be to allow the changes to be done by processes 
having CAP_SYS_RESOURCE inside the new
namespace?


Why do you need this? Is there any real impact/issue you're seeing?

I'm using Linux Containers and I need to be able to either start containers with different SHMMAX or set different 
SHMMAX to already running containers without giving them full root access.


-Marian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-04 Thread Marian Marinov


On 05/04/2014 04:20 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 03:28 +0300, Marian Marinov wrote:

On 05/04/2014 02:53 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:

When we are creating new IPC namespace that should be cloned from the current 
namespace it is a good idea to copy the
values of the current shmmax and shmall to the new namespace.


Why is this a good idea?

This would break userspace that relies on the current behavior.
Furthermore we've recently changed the default value of both these
limits to be as large as you can get, thus deprecating them. I don't
like the idea of this being replaced by namespaces.

Thanks,
Davidlohr



The current behavior is create_ipc_ns()-shm_init_ns()

void shm_init_ns(struct ipc_namespace *ns)
{
  ns-shm_ctlmax = SHMMAX;
  ns-shm_ctlall = SHMALL;
  ns-shm_ctlmni = SHMMNI;
  ns-shm_rmid_forced = 0;
  ns-shm_tot = 0;
  ipc_init_ids(shm_ids(ns));
}

This means that whenever you are creating an IPC namespace it gets its SHMMAX 
and SHMALL values from the defaults for
the kernel.


This is exactly what I meant by 'current behavior'.


If for some reason you want to have smaller(or bigger, for older kernels) 
limit. This means changing the values in
/proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that 
is started with the new IPC namespace may
lack privileges to write to these files and so it can not modify them.


I see no reason why namespaces should behave any different than the rest
of the system, wrt this. And this changes how *and* when these limits
are set, which impacts at a userspace level with no justification.


What I'm proposing is simply to copy the current values of the host machine, as 
set by a privileged process before the
namespace creation.

Maybe a better approach would be to allow the changes to be done by processes 
having CAP_SYS_RESOURCE inside the new
namespace?


Why do you need this? Is there any real impact/issue you're seeing?

I'm using Linux Containers and I need to be able to either start containers with different SHMMAX or set different 
SHMMAX to already running containers without giving them full root access.


-Marian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-03 Thread Marian Marinov


On 05/04/2014 02:53 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:

When we are creating new IPC namespace that should be cloned from the current 
namespace it is a good idea to copy the
values of the current shmmax and shmall to the new namespace.


Why is this a good idea?

This would break userspace that relies on the current behavior.
Furthermore we've recently changed the default value of both these
limits to be as large as you can get, thus deprecating them. I don't
like the idea of this being replaced by namespaces.

Thanks,
Davidlohr



The current behavior is create_ipc_ns()->shm_init_ns()

void shm_init_ns(struct ipc_namespace *ns)
{
ns->shm_ctlmax = SHMMAX;
ns->shm_ctlall = SHMALL;
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
ipc_init_ids(_ids(ns));
}

This means that whenever you are creating an IPC namespace it gets its SHMMAX and SHMALL values from the defaults for 
the kernel.
If for some reason you want to have smaller(or bigger, for older kernels) limit. This means changing the values in 
/proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that is started with the new IPC namespace may 
lack privileges to write to these files and so it can not modify them.


What I'm proposing is simply to copy the current values of the host machine, as set by a privileged process before the 
namespace creation.


Maybe a better approach would be to allow the changes to be done by processes having CAP_SYS_RESOURCE inside the new 
namespace?


Marian

--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-03 Thread Marian Marinov

When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the 
values of the current shmmax and shmall to the new namespace.


Copying the values of the init_ipc_ns would allow us to create new ipc namespaces with different values without granting 
them privileges to change those values.


Here is the proposed patch:

---
 ipc/shm.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7a51443..b7a4728 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -74,8 +74,13 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void 
*it);

 void shm_init_ns(struct ipc_namespace *ns)
 {
-   ns->shm_ctlmax = SHMMAX;
-   ns->shm_ctlall = SHMALL;
+   if (ns == _ipc_ns) {
+   ns->shm_ctlmax = SHMMAX;
+   ns->shm_ctlall = SHMALL;
+   } else {
+   ns->shm_ctlmax = init_ipc_ns.shm_ctlmax;
+   ns->shm_ctlall = init_ipc_ns.shm_ctlall;
+   }
ns->shm_ctlmni = SHMMNI;
ns->shm_rmid_forced = 0;
ns->shm_tot = 0;
--
1.8.4

--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-03 Thread Marian Marinov

When we are creating new IPC namespace that should be cloned from the current namespace it is a good idea to copy the 
values of the current shmmax and shmall to the new namespace.


Copying the values of the init_ipc_ns would allow us to create new ipc namespaces with different values without granting 
them privileges to change those values.


Here is the proposed patch:

---
 ipc/shm.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index 7a51443..b7a4728 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -74,8 +74,13 @@ static int sysvipc_shm_proc_show(struct seq_file *s, void 
*it);

 void shm_init_ns(struct ipc_namespace *ns)
 {
-   ns-shm_ctlmax = SHMMAX;
-   ns-shm_ctlall = SHMALL;
+   if (ns == init_ipc_ns) {
+   ns-shm_ctlmax = SHMMAX;
+   ns-shm_ctlall = SHMALL;
+   } else {
+   ns-shm_ctlmax = init_ipc_ns.shm_ctlmax;
+   ns-shm_ctlall = init_ipc_ns.shm_ctlall;
+   }
ns-shm_ctlmni = SHMMNI;
ns-shm_rmid_forced = 0;
ns-shm_tot = 0;
--
1.8.4

--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] IPC initialize shmmax and shmall from the current value not the default

2014-05-03 Thread Marian Marinov


On 05/04/2014 02:53 AM, Davidlohr Bueso wrote:

On Sun, 2014-05-04 at 01:48 +0300, Marian Marinov wrote:

When we are creating new IPC namespace that should be cloned from the current 
namespace it is a good idea to copy the
values of the current shmmax and shmall to the new namespace.


Why is this a good idea?

This would break userspace that relies on the current behavior.
Furthermore we've recently changed the default value of both these
limits to be as large as you can get, thus deprecating them. I don't
like the idea of this being replaced by namespaces.

Thanks,
Davidlohr



The current behavior is create_ipc_ns()-shm_init_ns()

void shm_init_ns(struct ipc_namespace *ns)
{
ns-shm_ctlmax = SHMMAX;
ns-shm_ctlall = SHMALL;
ns-shm_ctlmni = SHMMNI;
ns-shm_rmid_forced = 0;
ns-shm_tot = 0;
ipc_init_ids(shm_ids(ns));
}

This means that whenever you are creating an IPC namespace it gets its SHMMAX and SHMALL values from the defaults for 
the kernel.
If for some reason you want to have smaller(or bigger, for older kernels) limit. This means changing the values in 
/proc/sys/kernel/shmmax and /proc/sys/kernel/shmall. However the program that is started with the new IPC namespace may 
lack privileges to write to these files and so it can not modify them.


What I'm proposing is simply to copy the current values of the host machine, as set by a privileged process before the 
namespace creation.


Maybe a better approach would be to allow the changes to be done by processes having CAP_SYS_RESOURCE inside the new 
namespace?


Marian

--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/30/2014 03:01 AM, Stéphane Graber wrote:

On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:

On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber  wrote:

On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:

On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov  wrote:

On 04/30/2014 01:45 AM, Andy Lutomirski wrote:


On 04/29/2014 03:29 PM, Serge Hallyn wrote:


Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):


On 04/30/2014 01:02 AM, Serge Hallyn wrote:


Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):


On 04/29/2014 09:52 PM, Serge Hallyn wrote:


Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org):


On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:



I'm proposing a fix to this, by replacing the
capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()->user_ns,
CAP_LINUX_IMMUTABLE).



Um, wouldn't it be better to simply fix the capable() function?

/**
   * capable - Determine if the current task has a superior
capability in effect
   * @cap: The capability to be tested for
   *
   * Return true if the current task has the given superior
capability currently
   * available for use, false if not.
   *
   * This sets PF_SUPERPRIV on the task if the capability is
available on the
   * assumption that it's about to be used.
   */
bool capable(int cap)
{
 return ns_capable(_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for "the current task", and I
can't imagine any use case, where user namespaces are in effect,
where
using init_user_ns would ever make sense.



the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying
'ns_capable(current_cred()->userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should
be
safe to check against the userns owning the inode?



So what you are proposing is to replace
'ns_capable(current_cred()->userns, CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.



Right, and I think the two operations you're looking at seem sane
to allow.



If you are ok with this patch, I will fix all file systems and send
patches.



Sounds good, thanks.


Signed-off-by: Marian Marinov 



Acked-by: Serge E. Hallyn




Wait, what?

Inodes aren't owned by user namespaces; they're owned by users.  And any
user can arrange to have a user namespace in which they pass an
inode_capable check on any inode that they own.

Presumably there's a reason that CAP_SYS_IMMUTABLE is needed.  If this
gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
entirely.



The problem I'm trying to solve is this:

container with its own user namespace and CAP_SYS_IMMUTABLE should be able
to use chattr on all files witch this container has access to.

Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.

With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
container.

The first solution got its user namespace from the currently running process
and the second gets its user namespace from the currently opened inode.

So what would be the best solution in this case?


I'd suggest adding a mount option like fs_owner_uid that names a uid
that owns, in the sense of having unlimited access to, a filesystem.
Then anyone with caps on a namespace owned by that uid could do
whatever.

Eric?

--Andy


The most obvious problem I can think of with "do whatever" is that this
will likely include mknod of char and block devices which you can then
chown/chmod as you wish and use to access any devices on the system from
an unprivileged container.
This can however be mitigated by using the devices cgroup controller.


Or 'nodev'.  setuid/setgid may have the same problem, too.

Implementing something like this would also make CAP_DAC_READ_SEARCH
and CAP_DAC_OVERRIDE work.

Arguably it should be impossible to mount such a thing in the first
place without global privilege.



You also probably wouldn't want any unprivileged user from the host to
find a way to access that mounted filesytem but so long as you do the
mount in a separate mountns and don't share uids between the host and
the container, that should be fine too.


This part should be a nonissue -- an unprivileged user who has the
right uid owns the namespace anyway, so this is the least of your
worries.

--Andy


It should be a nonissue so long as we make sure that a file owned by a
uid outside the scope of the container may not be changed even though
fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
a shell and anyone who can see the fs from the host will be getting a
root shell (assuming said file is owned by the host's uid 0).

So that's restricting slightly what "do whatever" would do in this case.



In my case I give an

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/30/2014 01:45 AM, Andy Lutomirski wrote:

On 04/29/2014 03:29 PM, Serge Hallyn wrote:

Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):

On 04/30/2014 01:02 AM, Serge Hallyn wrote:

Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):

On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
  * capable - Determine if the current task has a superior capability in effect
  * @cap: The capability to be tested for
  *
  * Return true if the current task has the given superior capability currently
  * available for use, false if not.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
bool capable(int cap)
{
return ns_capable(_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for "the current task", and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?



So what you are proposing is to replace 'ns_capable(current_cred()->userns, 
CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.


Right, and I think the two operations you're looking at seem sane
to allow.


If you are ok with this patch, I will fix all file systems and send patches.


Sounds good, thanks.


Signed-off-by: Marian Marinov 


Acked-by: Serge E. Hallyn 


Wait, what?

Inodes aren't owned by user namespaces; they're owned by users.  And any
user can arrange to have a user namespace in which they pass an
inode_capable check on any inode that they own.

Presumably there's a reason that CAP_SYS_IMMUTABLE is needed.  If this
gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
entirely.


The problem I'm trying to solve is this:

container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this 
container has access to.


Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.

With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container.

The first solution got its user namespace from the currently running process and the second gets its user namespace from 
the currently opened inode.


So what would be the best solution in this case?

Marian



Nacked-by: Andy Lutomirski 






--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/30/2014 01:02 AM, Serge Hallyn wrote:

Quoting Marian Marinov (m...@1h.com):

On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (ty...@mit.edu):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
  * capable - Determine if the current task has a superior capability in effect
  * @cap: The capability to be tested for
  *
  * Return true if the current task has the given superior capability currently
  * available for use, false if not.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
bool capable(int cap)
{
return ns_capable(_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for "the current task", and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?



So what you are proposing is to replace 'ns_capable(current_cred()->userns, 
CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.


Right, and I think the two operations you're looking at seem sane
to allow.


If you are ok with this patch, I will fix all file systems and send patches.

Signed-off-by: Marian Marinov 
---
 fs/ext4/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d011b69..9418634 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 * This test looks nicer. Thanks to Pauline Middelink
 */
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE))
+ if (!inode_capable(inode, CAP_LINUX_IMMUTABLE))
goto flags_out;
}

---
1.8.4

Marian




thanks,
-serge




--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (ty...@mit.edu):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
  * capable - Determine if the current task has a superior capability in effect
  * @cap: The capability to be tested for
  *
  * Return true if the current task has the given superior capability currently
  * available for use, false if not.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
bool capable(int cap)
{
return ns_capable(_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for "the current task", and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()->userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?



So what you are proposing is to replace 'ns_capable(current_cred()->userns, 
CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.

Marian


No?  Otherwise, pretty much every single use of capable() would be
broken, not just this once instances in ext4/ioctl.c.

- Ted
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers





--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


Hello,
when using user namespaces I found a bug in the capability checks done by ioctl.

If someone tries to use chattr +i while in a different user namespace it will 
get the following:

ioctl(3, EXT2_IOC_SETFLAGS, 0x7fffa4fedacc) = -1 EPERM (Operation not permitted)

I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with 
ns_capable(current_cred()->user_ns, CAP_LINUX_IMMUTABLE).


If you agree I can send patches for all filesystems.

I'm proposing the following patch:

---
 fs/ext4/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d011b69..25683d0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 * This test looks nicer. Thanks to Pauline Middelink
 */
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE))
+   if (!ns_capable(current_cred()->user_ns, 
CAP_LINUX_IMMUTABLE))
goto flags_out;
}

--
1.8.4


--
Marian Marinov
Founder & CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


Hello,
when using user namespaces I found a bug in the capability checks done by ioctl.

If someone tries to use chattr +i while in a different user namespace it will 
get the following:

ioctl(3, EXT2_IOC_SETFLAGS, 0x7fffa4fedacc) = -1 EPERM (Operation not permitted)

I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE) check with 
ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE).


If you agree I can send patches for all filesystems.

I'm proposing the following patch:

---
 fs/ext4/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d011b69..25683d0 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 * This test looks nicer. Thanks to Pauline Middelink
 */
if ((flags ^ oldflags)  (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE))
+   if (!ns_capable(current_cred()-user_ns, 
CAP_LINUX_IMMUTABLE))
goto flags_out;
}

--
1.8.4


--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (ty...@mit.edu):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
  * capable - Determine if the current task has a superior capability in effect
  * @cap: The capability to be tested for
  *
  * Return true if the current task has the given superior capability currently
  * available for use, false if not.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
bool capable(int cap)
{
return ns_capable(init_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for the current task, and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?



So what you are proposing is to replace 'ns_capable(current_cred()-userns, 
CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.

Marian


No?  Otherwise, pretty much every single use of capable() would be
broken, not just this once instances in ext4/ioctl.c.

- Ted
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/containers





--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/30/2014 01:02 AM, Serge Hallyn wrote:

Quoting Marian Marinov (m...@1h.com):

On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (ty...@mit.edu):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
  * capable - Determine if the current task has a superior capability in effect
  * @cap: The capability to be tested for
  *
  * Return true if the current task has the given superior capability currently
  * available for use, false if not.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
bool capable(int cap)
{
return ns_capable(init_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for the current task, and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?



So what you are proposing is to replace 'ns_capable(current_cred()-userns, 
CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.


Right, and I think the two operations you're looking at seem sane
to allow.


If you are ok with this patch, I will fix all file systems and send patches.

Signed-off-by: Marian Marinov m...@yuhu.biz
---
 fs/ext4/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d011b69..9418634 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -265,7 +265,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 * This test looks nicer. Thanks to Pauline Middelink
 */
if ((flags ^ oldflags)  (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE))
+ if (!inode_capable(inode, CAP_LINUX_IMMUTABLE))
goto flags_out;
}

---
1.8.4

Marian




thanks,
-serge




--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/30/2014 01:45 AM, Andy Lutomirski wrote:

On 04/29/2014 03:29 PM, Serge Hallyn wrote:

Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):

On 04/30/2014 01:02 AM, Serge Hallyn wrote:

Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):

On 04/29/2014 09:52 PM, Serge Hallyn wrote:

Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org):

On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:


I'm proposing a fix to this, by replacing the capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()-user_ns, CAP_LINUX_IMMUTABLE).


Um, wouldn't it be better to simply fix the capable() function?

/**
  * capable - Determine if the current task has a superior capability in effect
  * @cap: The capability to be tested for
  *
  * Return true if the current task has the given superior capability currently
  * available for use, false if not.
  *
  * This sets PF_SUPERPRIV on the task if the capability is available on the
  * assumption that it's about to be used.
  */
bool capable(int cap)
{
return ns_capable(init_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for the current task, and I
can't imagine any use case, where user namespaces are in effect, where
using init_user_ns would ever make sense.


the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying 'ns_capable(current_cred()-userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should be
safe to check against the userns owning the inode?



So what you are proposing is to replace 'ns_capable(current_cred()-userns, 
CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.


Right, and I think the two operations you're looking at seem sane
to allow.


If you are ok with this patch, I will fix all file systems and send patches.


Sounds good, thanks.


Signed-off-by: Marian Marinov mm-nv7lj0so...@public.gmane.org


Acked-by: Serge E. Hallyn serge.hallyn-GeWIH/nmzzlqt0dzr+a...@public.gmane.org


Wait, what?

Inodes aren't owned by user namespaces; they're owned by users.  And any
user can arrange to have a user namespace in which they pass an
inode_capable check on any inode that they own.

Presumably there's a reason that CAP_SYS_IMMUTABLE is needed.  If this
gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
entirely.


The problem I'm trying to solve is this:

container with its own user namespace and CAP_SYS_IMMUTABLE should be able to use chattr on all files witch this 
container has access to.


Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.

With the proposed two fixes CAP_SYS_IMMUTABLE started working in the container.

The first solution got its user namespace from the currently running process and the second gets its user namespace from 
the currently opened inode.


So what would be the best solution in this case?

Marian



Nacked-by: Andy Lutomirski l...@amacapital.net






--
Marian Marinov
Founder  CEO of 1H Ltd.
Jabber/GTalk: hack...@jabber.org
ICQ: 7556201
Mobile: +359 886 660 270
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl CAP_LINUX_IMMUTABLE is checked in the wrong namespace

2014-04-29 Thread Marian Marinov


On 04/30/2014 03:01 AM, Stéphane Graber wrote:

On Tue, Apr 29, 2014 at 04:51:54PM -0700, Andy Lutomirski wrote:

On Tue, Apr 29, 2014 at 4:47 PM, Stéphane Graber stgra...@ubuntu.com wrote:

On Tue, Apr 29, 2014 at 04:22:55PM -0700, Andy Lutomirski wrote:

On Tue, Apr 29, 2014 at 4:20 PM, Marian Marinov m...@1h.com wrote:

On 04/30/2014 01:45 AM, Andy Lutomirski wrote:


On 04/29/2014 03:29 PM, Serge Hallyn wrote:


Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):


On 04/30/2014 01:02 AM, Serge Hallyn wrote:


Quoting Marian Marinov (mm-108mbtlg...@public.gmane.org):


On 04/29/2014 09:52 PM, Serge Hallyn wrote:


Quoting Theodore Ts'o (tytso-3s7wtutd...@public.gmane.org):


On Tue, Apr 29, 2014 at 04:49:14PM +0300, Marian Marinov wrote:



I'm proposing a fix to this, by replacing the
capable(CAP_LINUX_IMMUTABLE)
check with ns_capable(current_cred()-user_ns,
CAP_LINUX_IMMUTABLE).



Um, wouldn't it be better to simply fix the capable() function?

/**
   * capable - Determine if the current task has a superior
capability in effect
   * @cap: The capability to be tested for
   *
   * Return true if the current task has the given superior
capability currently
   * available for use, false if not.
   *
   * This sets PF_SUPERPRIV on the task if the capability is
available on the
   * assumption that it's about to be used.
   */
bool capable(int cap)
{
 return ns_capable(init_user_ns, cap);
}
EXPORT_SYMBOL(capable);

The documentation states that it is for the current task, and I
can't imagine any use case, where user namespaces are in effect,
where
using init_user_ns would ever make sense.



the init_user_ns represents the user_ns owning the object, not the
subject.

The patch by Marian is wrong.  Anyone can do 'clone(CLONE_NEWUSER)',
setuid(0), execve, and end up satisfying
'ns_capable(current_cred()-userns,
CAP_SYS_IMMUTABLE)' by definition.

So NACK to that particular patch.  I'm not sure, but IIUC it should
be
safe to check against the userns owning the inode?



So what you are proposing is to replace
'ns_capable(current_cred()-userns, CAP_SYS_IMMUTABLE)' with
'inode_capable(inode, CAP_SYS_IMMUTABLE)' ?

I agree that this is more sane.



Right, and I think the two operations you're looking at seem sane
to allow.



If you are ok with this patch, I will fix all file systems and send
patches.



Sounds good, thanks.


Signed-off-by: Marian Marinov mm-nv7lj0so...@public.gmane.org



Acked-by: Serge E. Hallyn
serge.hallyn-GeWIH/nmzzlqt0dzr+a...@public.gmane.org



Wait, what?

Inodes aren't owned by user namespaces; they're owned by users.  And any
user can arrange to have a user namespace in which they pass an
inode_capable check on any inode that they own.

Presumably there's a reason that CAP_SYS_IMMUTABLE is needed.  If this
gets merged, then it would be better to just drop CAP_SYS_IMMUTABLE
entirely.



The problem I'm trying to solve is this:

container with its own user namespace and CAP_SYS_IMMUTABLE should be able
to use chattr on all files witch this container has access to.

Unfortunately with the capable(CAP_SYS_IMMUTABLE) check this is not working.

With the proposed two fixes CAP_SYS_IMMUTABLE started working in the
container.

The first solution got its user namespace from the currently running process
and the second gets its user namespace from the currently opened inode.

So what would be the best solution in this case?


I'd suggest adding a mount option like fs_owner_uid that names a uid
that owns, in the sense of having unlimited access to, a filesystem.
Then anyone with caps on a namespace owned by that uid could do
whatever.

Eric?

--Andy


The most obvious problem I can think of with do whatever is that this
will likely include mknod of char and block devices which you can then
chown/chmod as you wish and use to access any devices on the system from
an unprivileged container.
This can however be mitigated by using the devices cgroup controller.


Or 'nodev'.  setuid/setgid may have the same problem, too.

Implementing something like this would also make CAP_DAC_READ_SEARCH
and CAP_DAC_OVERRIDE work.

Arguably it should be impossible to mount such a thing in the first
place without global privilege.



You also probably wouldn't want any unprivileged user from the host to
find a way to access that mounted filesytem but so long as you do the
mount in a separate mountns and don't share uids between the host and
the container, that should be fine too.


This part should be a nonissue -- an unprivileged user who has the
right uid owns the namespace anyway, so this is the least of your
worries.

--Andy


It should be a nonissue so long as we make sure that a file owned by a
uid outside the scope of the container may not be changed even though
fs_owner_uid is set. Otherwise, it's just a matter of chmod +S on say
a shell and anyone who can see the fs from the host will be getting a
root shell (assuming said file is owned by the host's uid 0).

So that's restricting

kernel trees

2014-03-31 Thread Marian Marinov


Hello,
I'm maintaining a few patches out of the main tree.
Currently I keep one repository that is a clone of this:
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

But if I'm monitoring the 3.13 version I can see only the rc versions in that 
repository.

Which repository should I follow in order to get the 3.13.* versions?

Thank you very much!
Marian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

kernel trees

2014-03-31 Thread Marian Marinov


Hello,
I'm maintaining a few patches out of the main tree.
Currently I keep one repository that is a clone of this:
  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

But if I'm monitoring the 3.13 version I can see only the rc versions in that 
repository.

Which repository should I follow in order to get the 3.13.* versions?

Thank you very much!
Marian
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: cgroups aware proc

2014-01-13 Thread Marian Marinov


On 01/13/2014 07:12 PM, Peter Zijlstra wrote:

On Mon, Jan 13, 2014 at 06:23:50PM +0200, Marian Marinov wrote:

Hello Peter,

I need help with the scheduler.

I'm currently trying to patch the /proc/loadavg to show the load that is
only related to the processes from the current cgroup.

I looked trough the code and I was hoping that tsk->sched_task_group->cfs_rq
struct will give me the needed information, but unfortunately for me, it did
not.

Can you advise me, how to approach this problem?


Yeah, don't :-) Really, loadavg is a stupid metric.


Yes... stupid, but unfortunately everyone is looking at it :(




I'm totally new to the scheduler code.


Luckily you won't actually have to touch much of it. Most of the actual
loadavg code lives in the first ~400 lines of kernel/sched/proc.c, read
and weep. Its one of the best documented bits around.


I looked trough it but I don't understand how to introduce the per cgroup 
calculation.

I looked trough the headers and found the following, which is already 
implemented.

  task->sched_task_group->load_avg
  task->sched_task_group->cfs_rq->load_avg
  task->sched_task_group->cfs_rq->load.weight
  task->sched_task_group->cfs_rq->runnable_load_avg

Unfortunately there is almost no documentation for these elements of the cfs_rq 
and task_group structs.

It seams to me that part of the per task group loadavg code is already present.



Your proposition however is extremely expensive, you turn something
that's already expensive O(nr_cpus) into something O(nr_cpus *
nr_cgroups).

I'm fairly sure people will not like that, esp. for something of such
questionable use as the loadavg -- its really only a pretty number that
doesn't mean all that much.


I know that its use is questionable but in my case I need to have it, or I will not be able to offer correct loadavg 
values in the containers.






 Original Message 
From: Li Zefan 

Then you should add Peter, Ingo and LKML to your Cc list. :)


You failed that, let me fix that.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: cgroups aware proc

2014-01-13 Thread Marian Marinov


On 01/13/2014 07:12 PM, Peter Zijlstra wrote:

On Mon, Jan 13, 2014 at 06:23:50PM +0200, Marian Marinov wrote:

Hello Peter,

I need help with the scheduler.

I'm currently trying to patch the /proc/loadavg to show the load that is
only related to the processes from the current cgroup.

I looked trough the code and I was hoping that tsk-sched_task_group-cfs_rq
struct will give me the needed information, but unfortunately for me, it did
not.

Can you advise me, how to approach this problem?


Yeah, don't :-) Really, loadavg is a stupid metric.


Yes... stupid, but unfortunately everyone is looking at it :(




I'm totally new to the scheduler code.


Luckily you won't actually have to touch much of it. Most of the actual
loadavg code lives in the first ~400 lines of kernel/sched/proc.c, read
and weep. Its one of the best documented bits around.


I looked trough it but I don't understand how to introduce the per cgroup 
calculation.

I looked trough the headers and found the following, which is already 
implemented.

  task-sched_task_group-load_avg
  task-sched_task_group-cfs_rq-load_avg
  task-sched_task_group-cfs_rq-load.weight
  task-sched_task_group-cfs_rq-runnable_load_avg

Unfortunately there is almost no documentation for these elements of the cfs_rq 
and task_group structs.

It seams to me that part of the per task group loadavg code is already present.



Your proposition however is extremely expensive, you turn something
that's already expensive O(nr_cpus) into something O(nr_cpus *
nr_cgroups).

I'm fairly sure people will not like that, esp. for something of such
questionable use as the loadavg -- its really only a pretty number that
doesn't mean all that much.


I know that its use is questionable but in my case I need to have it, or I will not be able to offer correct loadavg 
values in the containers.






 Original Message 
From: Li Zefan lize...@huawei.com

Then you should add Peter, Ingo and LKML to your Cc list. :)


You failed that, let me fix that.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

60 matches

Mail list logo