Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-10 Thread Laurent Vivier
On 09/10/2018 17:19, Laurent Vivier wrote:
> Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
>> On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>   */
>>>  #define MAX_REGISTER_LENGTH 1920
>>>  
>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>> +{
>>> +   struct binfmt_namespace *b_ns;
>>> +
>>> +   while (ns) {
>>> +   b_ns = READ_ONCE(ns->binfmt_ns);
>>> +   if (b_ns)
>>> +   return b_ns;
>>> +   ns = ns->parent;
>>> +   }
>>> +   WARN_ON_ONCE(1);
>>
>> It looks like we warn here,
>>
>>> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>> struct file *interp_file = NULL;
>>> int retval;
>>> int fd_binary = -1;
>>> +   struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>>  
>>> retval = -ENOEXEC;
>>> -   if (!enabled)
>>> +   if (!ns->enabled)
>>
>> ...but then in cases like this we immediately dereference the pointer
>> anyways and crash. Can we return some other error code here in the !ns
>> case so we don't crash?
> 
> My concern here is I don't want to add code to check an error case that
> cannot happen. The first namespace binfmt_ns pointer is initialized with
> _binfmt_ns, so the return value cannot be NULL.

Perhaps it could be reasonable to return _binfmt_ns rather than
NULL in this case?

Thanks,
Laurent



Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-10 Thread Laurent Vivier
On 09/10/2018 17:19, Laurent Vivier wrote:
> Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
>> On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>   */
>>>  #define MAX_REGISTER_LENGTH 1920
>>>  
>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>> +{
>>> +   struct binfmt_namespace *b_ns;
>>> +
>>> +   while (ns) {
>>> +   b_ns = READ_ONCE(ns->binfmt_ns);
>>> +   if (b_ns)
>>> +   return b_ns;
>>> +   ns = ns->parent;
>>> +   }
>>> +   WARN_ON_ONCE(1);
>>
>> It looks like we warn here,
>>
>>> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>> struct file *interp_file = NULL;
>>> int retval;
>>> int fd_binary = -1;
>>> +   struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>>  
>>> retval = -ENOEXEC;
>>> -   if (!enabled)
>>> +   if (!ns->enabled)
>>
>> ...but then in cases like this we immediately dereference the pointer
>> anyways and crash. Can we return some other error code here in the !ns
>> case so we don't crash?
> 
> My concern here is I don't want to add code to check an error case that
> cannot happen. The first namespace binfmt_ns pointer is initialized with
> _binfmt_ns, so the return value cannot be NULL.

Perhaps it could be reasonable to return _binfmt_ns rather than
NULL in this case?

Thanks,
Laurent



Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-10 Thread Aleksa Sarai
On 2018-10-09, Laurent Vivier  wrote:
> Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
> > On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
> >> @@ -80,18 +74,32 @@ static int entry_count;
> >>   */
> >>  #define MAX_REGISTER_LENGTH 1920
> >>  
> >> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >> +{
> >> +  struct binfmt_namespace *b_ns;
> >> +
> >> +  while (ns) {
> >> +  b_ns = READ_ONCE(ns->binfmt_ns);
> >> +  if (b_ns)
> >> +  return b_ns;
> >> +  ns = ns->parent;
> >> +  }
> >> +  WARN_ON_ONCE(1);
> > 
> > It looks like we warn here,
> > 
> >> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm 
> >> *bprm)
> >>struct file *interp_file = NULL;
> >>int retval;
> >>int fd_binary = -1;
> >> +  struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
> >>  
> >>retval = -ENOEXEC;
> >> -  if (!enabled)
> >> +  if (!ns->enabled)
> > 
> > ...but then in cases like this we immediately dereference the pointer
> > anyways and crash. Can we return some other error code here in the !ns
> > case so we don't crash?
> 
> My concern here is I don't want to add code to check an error case that
> cannot happen. The first namespace binfmt_ns pointer is initialized with
> _binfmt_ns, so the return value cannot be NULL.

I'd argue that BUG() is a better thing to do then -- if doing a dummy
error path makes no sense. Though IIRC BUG() is no longer a popular
thing to do.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-10 Thread Aleksa Sarai
On 2018-10-09, Laurent Vivier  wrote:
> Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
> > On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
> >> @@ -80,18 +74,32 @@ static int entry_count;
> >>   */
> >>  #define MAX_REGISTER_LENGTH 1920
> >>  
> >> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >> +{
> >> +  struct binfmt_namespace *b_ns;
> >> +
> >> +  while (ns) {
> >> +  b_ns = READ_ONCE(ns->binfmt_ns);
> >> +  if (b_ns)
> >> +  return b_ns;
> >> +  ns = ns->parent;
> >> +  }
> >> +  WARN_ON_ONCE(1);
> > 
> > It looks like we warn here,
> > 
> >> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm 
> >> *bprm)
> >>struct file *interp_file = NULL;
> >>int retval;
> >>int fd_binary = -1;
> >> +  struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
> >>  
> >>retval = -ENOEXEC;
> >> -  if (!enabled)
> >> +  if (!ns->enabled)
> > 
> > ...but then in cases like this we immediately dereference the pointer
> > anyways and crash. Can we return some other error code here in the !ns
> > case so we don't crash?
> 
> My concern here is I don't want to add code to check an error case that
> cannot happen. The first namespace binfmt_ns pointer is initialized with
> _binfmt_ns, so the return value cannot be NULL.

I'd argue that BUG() is a better thing to do then -- if doing a dummy
error path makes no sense. Though IIRC BUG() is no longer a popular
thing to do.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 19:01, Kirill Tkhai a écrit :
> On 09.10.2018 19:45, Laurent Vivier wrote:
>> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>>> On 09.10.2018 13:37, Laurent Vivier wrote:
 This patch allows to have a different binfmt_misc configuration
 for each new user namespace. By default, the binfmt_misc configuration
 is the one of the previous level, but if the binfmt_misc filesystem is
 mounted in the new namespace a new empty binfmt instance is created and
 used in this namespace.

 For instance, using "unshare" we can start a chroot of an another
 architecture and configure the binfmt_misc interpreter without being root
 to run the binaries in this chroot.

 Signed-off-by: Laurent Vivier 
 ---
  fs/binfmt_misc.c   | 106 -
  include/linux/user_namespace.h |  13 
  kernel/user.c  |  13 
  kernel/user_namespace.c|   3 +
  4 files changed, 107 insertions(+), 28 deletions(-)

 diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
 index aa4a7a23ff99..1e0029d097d9 100644
 --- a/fs/binfmt_misc.c
 +++ b/fs/binfmt_misc.c
>> ...
 @@ -80,18 +74,32 @@ static int entry_count;
   */
  #define MAX_REGISTER_LENGTH 1920
  
 +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
 +{
 +  struct binfmt_namespace *b_ns;
 +
 +  while (ns) {
 +  b_ns = READ_ONCE(ns->binfmt_ns);
 +  if (b_ns)
 +  return b_ns;
 +  ns = ns->parent;
 +  }
 +  WARN_ON_ONCE(1);
 +  return NULL;
 +}
 +
>> ...
 @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
  static int bm_fill_super(struct super_block *sb, void *data, int silent)
  {
int err;
 +  struct user_namespace *ns = sb->s_user_ns;
static const struct tree_descr bm_files[] = {
[2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
[3] = {"register", _register_operations, S_IWUSR},
/* last one */ {""}
};
  
 +  /* create a new binfmt namespace
 +   * if we are not in the first user namespace
 +   * but the binfmt namespace is the first one
 +   */
 +  if (READ_ONCE(ns->binfmt_ns) == NULL) {
 +  struct binfmt_namespace *new_ns;
 +
 +  new_ns = kmalloc(sizeof(struct binfmt_namespace),
 +   GFP_KERNEL);
 +  if (new_ns == NULL)
 +  return -ENOMEM;
 +  INIT_LIST_HEAD(_ns->entries);
 +  new_ns->enabled = 1;
 +  rwlock_init(_ns->entries_lock);
 +  new_ns->bm_mnt = NULL;
 +  new_ns->entry_count = 0;
 +  /* ensure new_ns is completely initialized before sharing it */
 +  smp_wmb();
>>>
>>> (I haven't dived into patch logic, here just small barrier remark from 
>>> quick sight).
>>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>>> you want something like below in read hunk:
>>>
>>> +   b_ns = READ_ONCE(ns->binfmt_ns);
>>> +   if (b_ns) {
>>> +   smp_rmb();
>>> +   return b_ns;
>>> +   }
>>>
>>>
>>
>> The write barrier is here to ensure the structure is fully written
>> before we set the pointer.
>>
>> I don't understand how read barrier can change something at this level,
>> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
>> have correctly initialized the pointer and the structure when we read
>> the pointer back.
>>
>> I think the pointer itself is the "barrier" to access the memory
>> modified before.
> 
> smp_rmb() guarantees you see stores in the order you want. If you have:
> 
> [cpu0][cpu1]
> new_ns->entry_count = 0; 
> smp_wmb();
> WRITE_ONCE(ns->binfmt_ns, new_ns);b_ns = READ_ONCE(ns->binfmt_ns);
>   smp_rmb();
>   entry_count>
> 
> smp_rmb() guarantees you see true entry_count on the cpu1. Without
> smp_rmb() you may see old value of new_ns->entry_count.
>   
> See Documentation/memory-barriers.txt

Yes, I tried to read this document several times...

What I understand from example line 1077 (7696f9910a9a
Documentation/memory-barriers.txt) is we only need a data dependency
barrier, and as explained by Jann it comes with the READ_ONCE() (and is
only needed for alpha).

Thanks,
Laurent



Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 19:01, Kirill Tkhai a écrit :
> On 09.10.2018 19:45, Laurent Vivier wrote:
>> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>>> On 09.10.2018 13:37, Laurent Vivier wrote:
 This patch allows to have a different binfmt_misc configuration
 for each new user namespace. By default, the binfmt_misc configuration
 is the one of the previous level, but if the binfmt_misc filesystem is
 mounted in the new namespace a new empty binfmt instance is created and
 used in this namespace.

 For instance, using "unshare" we can start a chroot of an another
 architecture and configure the binfmt_misc interpreter without being root
 to run the binaries in this chroot.

 Signed-off-by: Laurent Vivier 
 ---
  fs/binfmt_misc.c   | 106 -
  include/linux/user_namespace.h |  13 
  kernel/user.c  |  13 
  kernel/user_namespace.c|   3 +
  4 files changed, 107 insertions(+), 28 deletions(-)

 diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
 index aa4a7a23ff99..1e0029d097d9 100644
 --- a/fs/binfmt_misc.c
 +++ b/fs/binfmt_misc.c
>> ...
 @@ -80,18 +74,32 @@ static int entry_count;
   */
  #define MAX_REGISTER_LENGTH 1920
  
 +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
 +{
 +  struct binfmt_namespace *b_ns;
 +
 +  while (ns) {
 +  b_ns = READ_ONCE(ns->binfmt_ns);
 +  if (b_ns)
 +  return b_ns;
 +  ns = ns->parent;
 +  }
 +  WARN_ON_ONCE(1);
 +  return NULL;
 +}
 +
>> ...
 @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
  static int bm_fill_super(struct super_block *sb, void *data, int silent)
  {
int err;
 +  struct user_namespace *ns = sb->s_user_ns;
static const struct tree_descr bm_files[] = {
[2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
[3] = {"register", _register_operations, S_IWUSR},
/* last one */ {""}
};
  
 +  /* create a new binfmt namespace
 +   * if we are not in the first user namespace
 +   * but the binfmt namespace is the first one
 +   */
 +  if (READ_ONCE(ns->binfmt_ns) == NULL) {
 +  struct binfmt_namespace *new_ns;
 +
 +  new_ns = kmalloc(sizeof(struct binfmt_namespace),
 +   GFP_KERNEL);
 +  if (new_ns == NULL)
 +  return -ENOMEM;
 +  INIT_LIST_HEAD(_ns->entries);
 +  new_ns->enabled = 1;
 +  rwlock_init(_ns->entries_lock);
 +  new_ns->bm_mnt = NULL;
 +  new_ns->entry_count = 0;
 +  /* ensure new_ns is completely initialized before sharing it */
 +  smp_wmb();
>>>
>>> (I haven't dived into patch logic, here just small barrier remark from 
>>> quick sight).
>>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>>> you want something like below in read hunk:
>>>
>>> +   b_ns = READ_ONCE(ns->binfmt_ns);
>>> +   if (b_ns) {
>>> +   smp_rmb();
>>> +   return b_ns;
>>> +   }
>>>
>>>
>>
>> The write barrier is here to ensure the structure is fully written
>> before we set the pointer.
>>
>> I don't understand how read barrier can change something at this level,
>> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
>> have correctly initialized the pointer and the structure when we read
>> the pointer back.
>>
>> I think the pointer itself is the "barrier" to access the memory
>> modified before.
> 
> smp_rmb() guarantees you see stores in the order you want. If you have:
> 
> [cpu0][cpu1]
> new_ns->entry_count = 0; 
> smp_wmb();
> WRITE_ONCE(ns->binfmt_ns, new_ns);b_ns = READ_ONCE(ns->binfmt_ns);
>   smp_rmb();
>   entry_count>
> 
> smp_rmb() guarantees you see true entry_count on the cpu1. Without
> smp_rmb() you may see old value of new_ns->entry_count.
>   
> See Documentation/memory-barriers.txt

Yes, I tried to read this document several times...

What I understand from example line 1077 (7696f9910a9a
Documentation/memory-barriers.txt) is we only need a data dependency
barrier, and as explained by Jann it comes with the READ_ONCE() (and is
only needed for alpha).

Thanks,
Laurent



Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 6:57 PM Laurent Vivier  wrote:
> Le 09/10/2018 à 18:53, Jann Horn a écrit :
> > On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier  wrote:
> >> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> >>> On 09.10.2018 13:37, Laurent Vivier wrote:
>  This patch allows to have a different binfmt_misc configuration
>  for each new user namespace. By default, the binfmt_misc configuration
>  is the one of the previous level, but if the binfmt_misc filesystem is
>  mounted in the new namespace a new empty binfmt instance is created and
>  used in this namespace.
> 
>  For instance, using "unshare" we can start a chroot of an another
>  architecture and configure the binfmt_misc interpreter without being root
>  to run the binaries in this chroot.
> 
>  Signed-off-by: Laurent Vivier 
>  ---
>   fs/binfmt_misc.c   | 106 -
>   include/linux/user_namespace.h |  13 
>   kernel/user.c  |  13 
>   kernel/user_namespace.c|   3 +
>   4 files changed, 107 insertions(+), 28 deletions(-)
> 
>  diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>  index aa4a7a23ff99..1e0029d097d9 100644
>  --- a/fs/binfmt_misc.c
>  +++ b/fs/binfmt_misc.c
> >> ...
>  @@ -80,18 +74,32 @@ static int entry_count;
>    */
>   #define MAX_REGISTER_LENGTH 1920
> 
>  +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>  +{
>  +struct binfmt_namespace *b_ns;
>  +
>  +while (ns) {
>  +b_ns = READ_ONCE(ns->binfmt_ns);
>  +if (b_ns)
>  +return b_ns;
>  +ns = ns->parent;
>  +}
>  +WARN_ON_ONCE(1);
>  +return NULL;
>  +}
>  +
> >> ...
>  @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>   static int bm_fill_super(struct super_block *sb, void *data, int silent)
>   {
>   int err;
>  +struct user_namespace *ns = sb->s_user_ns;
>   static const struct tree_descr bm_files[] = {
>   [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>   [3] = {"register", _register_operations, S_IWUSR},
>   /* last one */ {""}
>   };
> 
>  +/* create a new binfmt namespace
>  + * if we are not in the first user namespace
>  + * but the binfmt namespace is the first one
>  + */
>  +if (READ_ONCE(ns->binfmt_ns) == NULL) {
>  +struct binfmt_namespace *new_ns;
>  +
>  +new_ns = kmalloc(sizeof(struct binfmt_namespace),
>  + GFP_KERNEL);
>  +if (new_ns == NULL)
>  +return -ENOMEM;
>  +INIT_LIST_HEAD(_ns->entries);
>  +new_ns->enabled = 1;
>  +rwlock_init(_ns->entries_lock);
>  +new_ns->bm_mnt = NULL;
>  +new_ns->entry_count = 0;
>  +/* ensure new_ns is completely initialized before sharing 
>  it */
>  +smp_wmb();
> >>>
> >>> (I haven't dived into patch logic, here just small barrier remark from 
> >>> quick sight).
> >>> smp_wmb() has no sense without paired smp_rmb() on the read side. 
> >>> Possible,
> >>> you want something like below in read hunk:
> >>>
> >>> + b_ns = READ_ONCE(ns->binfmt_ns);
> >>> + if (b_ns) {
> >>> + smp_rmb();
> >>> + return b_ns;
> >>> + }
> >>>
> >>>
> >>
> >> The write barrier is here to ensure the structure is fully written
> >> before we set the pointer.
> >>
> >> I don't understand how read barrier can change something at this level,
> >> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> >> have correctly initialized the pointer and the structure when we read
> >> the pointer back.
> >>
> >> I think the pointer itself is the "barrier" to access the memory
> >> modified before.
> >
> > Things don't work that way on alpha, but that's why READ_ONCE()
> > includes an smp_read_barrier_depends():
> >
> > #define __READ_ONCE(x, check)   \
> > ({  \
> > union { typeof(x) __val; char __c[1]; } __u;\
> > if (check)  \
> > __read_once_size(&(x), __u.__c, sizeof(x)); \
> > else\
> > __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
> > smp_read_barrier_depends(); /* Enforce dependency ordering from x 
> > */ \
> > __u.__val;  \
> > })
> > #define READ_ONCE(x) __READ_ONCE(x, 1)

Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 6:57 PM Laurent Vivier  wrote:
> Le 09/10/2018 à 18:53, Jann Horn a écrit :
> > On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier  wrote:
> >> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> >>> On 09.10.2018 13:37, Laurent Vivier wrote:
>  This patch allows to have a different binfmt_misc configuration
>  for each new user namespace. By default, the binfmt_misc configuration
>  is the one of the previous level, but if the binfmt_misc filesystem is
>  mounted in the new namespace a new empty binfmt instance is created and
>  used in this namespace.
> 
>  For instance, using "unshare" we can start a chroot of an another
>  architecture and configure the binfmt_misc interpreter without being root
>  to run the binaries in this chroot.
> 
>  Signed-off-by: Laurent Vivier 
>  ---
>   fs/binfmt_misc.c   | 106 -
>   include/linux/user_namespace.h |  13 
>   kernel/user.c  |  13 
>   kernel/user_namespace.c|   3 +
>   4 files changed, 107 insertions(+), 28 deletions(-)
> 
>  diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>  index aa4a7a23ff99..1e0029d097d9 100644
>  --- a/fs/binfmt_misc.c
>  +++ b/fs/binfmt_misc.c
> >> ...
>  @@ -80,18 +74,32 @@ static int entry_count;
>    */
>   #define MAX_REGISTER_LENGTH 1920
> 
>  +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>  +{
>  +struct binfmt_namespace *b_ns;
>  +
>  +while (ns) {
>  +b_ns = READ_ONCE(ns->binfmt_ns);
>  +if (b_ns)
>  +return b_ns;
>  +ns = ns->parent;
>  +}
>  +WARN_ON_ONCE(1);
>  +return NULL;
>  +}
>  +
> >> ...
>  @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>   static int bm_fill_super(struct super_block *sb, void *data, int silent)
>   {
>   int err;
>  +struct user_namespace *ns = sb->s_user_ns;
>   static const struct tree_descr bm_files[] = {
>   [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>   [3] = {"register", _register_operations, S_IWUSR},
>   /* last one */ {""}
>   };
> 
>  +/* create a new binfmt namespace
>  + * if we are not in the first user namespace
>  + * but the binfmt namespace is the first one
>  + */
>  +if (READ_ONCE(ns->binfmt_ns) == NULL) {
>  +struct binfmt_namespace *new_ns;
>  +
>  +new_ns = kmalloc(sizeof(struct binfmt_namespace),
>  + GFP_KERNEL);
>  +if (new_ns == NULL)
>  +return -ENOMEM;
>  +INIT_LIST_HEAD(_ns->entries);
>  +new_ns->enabled = 1;
>  +rwlock_init(_ns->entries_lock);
>  +new_ns->bm_mnt = NULL;
>  +new_ns->entry_count = 0;
>  +/* ensure new_ns is completely initialized before sharing 
>  it */
>  +smp_wmb();
> >>>
> >>> (I haven't dived into patch logic, here just small barrier remark from 
> >>> quick sight).
> >>> smp_wmb() has no sense without paired smp_rmb() on the read side. 
> >>> Possible,
> >>> you want something like below in read hunk:
> >>>
> >>> + b_ns = READ_ONCE(ns->binfmt_ns);
> >>> + if (b_ns) {
> >>> + smp_rmb();
> >>> + return b_ns;
> >>> + }
> >>>
> >>>
> >>
> >> The write barrier is here to ensure the structure is fully written
> >> before we set the pointer.
> >>
> >> I don't understand how read barrier can change something at this level,
> >> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> >> have correctly initialized the pointer and the structure when we read
> >> the pointer back.
> >>
> >> I think the pointer itself is the "barrier" to access the memory
> >> modified before.
> >
> > Things don't work that way on alpha, but that's why READ_ONCE()
> > includes an smp_read_barrier_depends():
> >
> > #define __READ_ONCE(x, check)   \
> > ({  \
> > union { typeof(x) __val; char __c[1]; } __u;\
> > if (check)  \
> > __read_once_size(&(x), __u.__c, sizeof(x)); \
> > else\
> > __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
> > smp_read_barrier_depends(); /* Enforce dependency ordering from x 
> > */ \
> > __u.__val;  \
> > })
> > #define READ_ONCE(x) __READ_ONCE(x, 1)

Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Kirill Tkhai
On 09.10.2018 19:45, Laurent Vivier wrote:
> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>> On 09.10.2018 13:37, Laurent Vivier wrote:
>>> This patch allows to have a different binfmt_misc configuration
>>> for each new user namespace. By default, the binfmt_misc configuration
>>> is the one of the previous level, but if the binfmt_misc filesystem is
>>> mounted in the new namespace a new empty binfmt instance is created and
>>> used in this namespace.
>>>
>>> For instance, using "unshare" we can start a chroot of an another
>>> architecture and configure the binfmt_misc interpreter without being root
>>> to run the binaries in this chroot.
>>>
>>> Signed-off-by: Laurent Vivier 
>>> ---
>>>  fs/binfmt_misc.c   | 106 -
>>>  include/linux/user_namespace.h |  13 
>>>  kernel/user.c  |  13 
>>>  kernel/user_namespace.c|   3 +
>>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>>> index aa4a7a23ff99..1e0029d097d9 100644
>>> --- a/fs/binfmt_misc.c
>>> +++ b/fs/binfmt_misc.c
> ...
>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>   */
>>>  #define MAX_REGISTER_LENGTH 1920
>>>  
>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>> +{
>>> +   struct binfmt_namespace *b_ns;
>>> +
>>> +   while (ns) {
>>> +   b_ns = READ_ONCE(ns->binfmt_ns);
>>> +   if (b_ns)
>>> +   return b_ns;
>>> +   ns = ns->parent;
>>> +   }
>>> +   WARN_ON_ONCE(1);
>>> +   return NULL;
>>> +}
>>> +
> ...
>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>>  {
>>> int err;
>>> +   struct user_namespace *ns = sb->s_user_ns;
>>> static const struct tree_descr bm_files[] = {
>>> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>>> [3] = {"register", _register_operations, S_IWUSR},
>>> /* last one */ {""}
>>> };
>>>  
>>> +   /* create a new binfmt namespace
>>> +* if we are not in the first user namespace
>>> +* but the binfmt namespace is the first one
>>> +*/
>>> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
>>> +   struct binfmt_namespace *new_ns;
>>> +
>>> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
>>> +GFP_KERNEL);
>>> +   if (new_ns == NULL)
>>> +   return -ENOMEM;
>>> +   INIT_LIST_HEAD(_ns->entries);
>>> +   new_ns->enabled = 1;
>>> +   rwlock_init(_ns->entries_lock);
>>> +   new_ns->bm_mnt = NULL;
>>> +   new_ns->entry_count = 0;
>>> +   /* ensure new_ns is completely initialized before sharing it */
>>> +   smp_wmb();
>>
>> (I haven't dived into patch logic, here just small barrier remark from quick 
>> sight).
>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>> you want something like below in read hunk:
>>
>> +b_ns = READ_ONCE(ns->binfmt_ns);
>> +if (b_ns) {
>> +smp_rmb();
>> +return b_ns;
>> +}
>>
>>
> 
> The write barrier is here to ensure the structure is fully written
> before we set the pointer.
> 
> I don't understand how read barrier can change something at this level,
> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> have correctly initialized the pointer and the structure when we read
> the pointer back.
> 
> I think the pointer itself is the "barrier" to access the memory
> modified before.

smp_rmb() guarantees you see stores in the order you want. If you have:

[cpu0]  [cpu1]
new_ns->entry_count = 0; 
smp_wmb();
WRITE_ONCE(ns->binfmt_ns, new_ns);  b_ns = READ_ONCE(ns->binfmt_ns);
smp_rmb();
entry_count>

smp_rmb() guarantees you see true entry_count on the cpu1. Without
smp_rmb() you may see old value of new_ns->entry_count.

See Documentation/memory-barriers.txt


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Kirill Tkhai
On 09.10.2018 19:45, Laurent Vivier wrote:
> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>> On 09.10.2018 13:37, Laurent Vivier wrote:
>>> This patch allows to have a different binfmt_misc configuration
>>> for each new user namespace. By default, the binfmt_misc configuration
>>> is the one of the previous level, but if the binfmt_misc filesystem is
>>> mounted in the new namespace a new empty binfmt instance is created and
>>> used in this namespace.
>>>
>>> For instance, using "unshare" we can start a chroot of an another
>>> architecture and configure the binfmt_misc interpreter without being root
>>> to run the binaries in this chroot.
>>>
>>> Signed-off-by: Laurent Vivier 
>>> ---
>>>  fs/binfmt_misc.c   | 106 -
>>>  include/linux/user_namespace.h |  13 
>>>  kernel/user.c  |  13 
>>>  kernel/user_namespace.c|   3 +
>>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>>
>>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>>> index aa4a7a23ff99..1e0029d097d9 100644
>>> --- a/fs/binfmt_misc.c
>>> +++ b/fs/binfmt_misc.c
> ...
>>> @@ -80,18 +74,32 @@ static int entry_count;
>>>   */
>>>  #define MAX_REGISTER_LENGTH 1920
>>>  
>>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>>> +{
>>> +   struct binfmt_namespace *b_ns;
>>> +
>>> +   while (ns) {
>>> +   b_ns = READ_ONCE(ns->binfmt_ns);
>>> +   if (b_ns)
>>> +   return b_ns;
>>> +   ns = ns->parent;
>>> +   }
>>> +   WARN_ON_ONCE(1);
>>> +   return NULL;
>>> +}
>>> +
> ...
>>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>>  {
>>> int err;
>>> +   struct user_namespace *ns = sb->s_user_ns;
>>> static const struct tree_descr bm_files[] = {
>>> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>>> [3] = {"register", _register_operations, S_IWUSR},
>>> /* last one */ {""}
>>> };
>>>  
>>> +   /* create a new binfmt namespace
>>> +* if we are not in the first user namespace
>>> +* but the binfmt namespace is the first one
>>> +*/
>>> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
>>> +   struct binfmt_namespace *new_ns;
>>> +
>>> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
>>> +GFP_KERNEL);
>>> +   if (new_ns == NULL)
>>> +   return -ENOMEM;
>>> +   INIT_LIST_HEAD(_ns->entries);
>>> +   new_ns->enabled = 1;
>>> +   rwlock_init(_ns->entries_lock);
>>> +   new_ns->bm_mnt = NULL;
>>> +   new_ns->entry_count = 0;
>>> +   /* ensure new_ns is completely initialized before sharing it */
>>> +   smp_wmb();
>>
>> (I haven't dived into patch logic, here just small barrier remark from quick 
>> sight).
>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>> you want something like below in read hunk:
>>
>> +b_ns = READ_ONCE(ns->binfmt_ns);
>> +if (b_ns) {
>> +smp_rmb();
>> +return b_ns;
>> +}
>>
>>
> 
> The write barrier is here to ensure the structure is fully written
> before we set the pointer.
> 
> I don't understand how read barrier can change something at this level,
> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> have correctly initialized the pointer and the structure when we read
> the pointer back.
> 
> I think the pointer itself is the "barrier" to access the memory
> modified before.

smp_rmb() guarantees you see stores in the order you want. If you have:

[cpu0]  [cpu1]
new_ns->entry_count = 0; 
smp_wmb();
WRITE_ONCE(ns->binfmt_ns, new_ns);  b_ns = READ_ONCE(ns->binfmt_ns);
smp_rmb();
entry_count>

smp_rmb() guarantees you see true entry_count on the cpu1. Without
smp_rmb() you may see old value of new_ns->entry_count.

See Documentation/memory-barriers.txt


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 18:53, Jann Horn a écrit :
> On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier  wrote:
>> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>>> On 09.10.2018 13:37, Laurent Vivier wrote:
 This patch allows to have a different binfmt_misc configuration
 for each new user namespace. By default, the binfmt_misc configuration
 is the one of the previous level, but if the binfmt_misc filesystem is
 mounted in the new namespace a new empty binfmt instance is created and
 used in this namespace.

 For instance, using "unshare" we can start a chroot of an another
 architecture and configure the binfmt_misc interpreter without being root
 to run the binaries in this chroot.

 Signed-off-by: Laurent Vivier 
 ---
  fs/binfmt_misc.c   | 106 -
  include/linux/user_namespace.h |  13 
  kernel/user.c  |  13 
  kernel/user_namespace.c|   3 +
  4 files changed, 107 insertions(+), 28 deletions(-)

 diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
 index aa4a7a23ff99..1e0029d097d9 100644
 --- a/fs/binfmt_misc.c
 +++ b/fs/binfmt_misc.c
>> ...
 @@ -80,18 +74,32 @@ static int entry_count;
   */
  #define MAX_REGISTER_LENGTH 1920

 +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
 +{
 +struct binfmt_namespace *b_ns;
 +
 +while (ns) {
 +b_ns = READ_ONCE(ns->binfmt_ns);
 +if (b_ns)
 +return b_ns;
 +ns = ns->parent;
 +}
 +WARN_ON_ONCE(1);
 +return NULL;
 +}
 +
>> ...
 @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
  static int bm_fill_super(struct super_block *sb, void *data, int silent)
  {
  int err;
 +struct user_namespace *ns = sb->s_user_ns;
  static const struct tree_descr bm_files[] = {
  [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
  [3] = {"register", _register_operations, S_IWUSR},
  /* last one */ {""}
  };

 +/* create a new binfmt namespace
 + * if we are not in the first user namespace
 + * but the binfmt namespace is the first one
 + */
 +if (READ_ONCE(ns->binfmt_ns) == NULL) {
 +struct binfmt_namespace *new_ns;
 +
 +new_ns = kmalloc(sizeof(struct binfmt_namespace),
 + GFP_KERNEL);
 +if (new_ns == NULL)
 +return -ENOMEM;
 +INIT_LIST_HEAD(_ns->entries);
 +new_ns->enabled = 1;
 +rwlock_init(_ns->entries_lock);
 +new_ns->bm_mnt = NULL;
 +new_ns->entry_count = 0;
 +/* ensure new_ns is completely initialized before sharing it 
 */
 +smp_wmb();
>>>
>>> (I haven't dived into patch logic, here just small barrier remark from 
>>> quick sight).
>>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>>> you want something like below in read hunk:
>>>
>>> + b_ns = READ_ONCE(ns->binfmt_ns);
>>> + if (b_ns) {
>>> + smp_rmb();
>>> + return b_ns;
>>> + }
>>>
>>>
>>
>> The write barrier is here to ensure the structure is fully written
>> before we set the pointer.
>>
>> I don't understand how read barrier can change something at this level,
>> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
>> have correctly initialized the pointer and the structure when we read
>> the pointer back.
>>
>> I think the pointer itself is the "barrier" to access the memory
>> modified before.
> 
> Things don't work that way on alpha, but that's why READ_ONCE()
> includes an smp_read_barrier_depends():
> 
> #define __READ_ONCE(x, check)   \
> ({  \
> union { typeof(x) __val; char __c[1]; } __u;\
> if (check)  \
> __read_once_size(&(x), __u.__c, sizeof(x)); \
> else\
> __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
> smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
> __u.__val;  \
> })
> #define READ_ONCE(x) __READ_ONCE(x, 1)
> 

So my questions are:

- do we need a smp_wmb() barrier if we use READ_ONCE() and WRITE_ONCE()?

- if we need an smp_wmb() barrier, do we need an smp_rmb() barrier as
the data we want to "protect" are behind an access to the pointer?

Thanks,
Laurent


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 18:53, Jann Horn a écrit :
> On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier  wrote:
>> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
>>> On 09.10.2018 13:37, Laurent Vivier wrote:
 This patch allows to have a different binfmt_misc configuration
 for each new user namespace. By default, the binfmt_misc configuration
 is the one of the previous level, but if the binfmt_misc filesystem is
 mounted in the new namespace a new empty binfmt instance is created and
 used in this namespace.

 For instance, using "unshare" we can start a chroot of an another
 architecture and configure the binfmt_misc interpreter without being root
 to run the binaries in this chroot.

 Signed-off-by: Laurent Vivier 
 ---
  fs/binfmt_misc.c   | 106 -
  include/linux/user_namespace.h |  13 
  kernel/user.c  |  13 
  kernel/user_namespace.c|   3 +
  4 files changed, 107 insertions(+), 28 deletions(-)

 diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
 index aa4a7a23ff99..1e0029d097d9 100644
 --- a/fs/binfmt_misc.c
 +++ b/fs/binfmt_misc.c
>> ...
 @@ -80,18 +74,32 @@ static int entry_count;
   */
  #define MAX_REGISTER_LENGTH 1920

 +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
 +{
 +struct binfmt_namespace *b_ns;
 +
 +while (ns) {
 +b_ns = READ_ONCE(ns->binfmt_ns);
 +if (b_ns)
 +return b_ns;
 +ns = ns->parent;
 +}
 +WARN_ON_ONCE(1);
 +return NULL;
 +}
 +
>> ...
 @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
  static int bm_fill_super(struct super_block *sb, void *data, int silent)
  {
  int err;
 +struct user_namespace *ns = sb->s_user_ns;
  static const struct tree_descr bm_files[] = {
  [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
  [3] = {"register", _register_operations, S_IWUSR},
  /* last one */ {""}
  };

 +/* create a new binfmt namespace
 + * if we are not in the first user namespace
 + * but the binfmt namespace is the first one
 + */
 +if (READ_ONCE(ns->binfmt_ns) == NULL) {
 +struct binfmt_namespace *new_ns;
 +
 +new_ns = kmalloc(sizeof(struct binfmt_namespace),
 + GFP_KERNEL);
 +if (new_ns == NULL)
 +return -ENOMEM;
 +INIT_LIST_HEAD(_ns->entries);
 +new_ns->enabled = 1;
 +rwlock_init(_ns->entries_lock);
 +new_ns->bm_mnt = NULL;
 +new_ns->entry_count = 0;
 +/* ensure new_ns is completely initialized before sharing it 
 */
 +smp_wmb();
>>>
>>> (I haven't dived into patch logic, here just small barrier remark from 
>>> quick sight).
>>> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
>>> you want something like below in read hunk:
>>>
>>> + b_ns = READ_ONCE(ns->binfmt_ns);
>>> + if (b_ns) {
>>> + smp_rmb();
>>> + return b_ns;
>>> + }
>>>
>>>
>>
>> The write barrier is here to ensure the structure is fully written
>> before we set the pointer.
>>
>> I don't understand how read barrier can change something at this level,
>> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
>> have correctly initialized the pointer and the structure when we read
>> the pointer back.
>>
>> I think the pointer itself is the "barrier" to access the memory
>> modified before.
> 
> Things don't work that way on alpha, but that's why READ_ONCE()
> includes an smp_read_barrier_depends():
> 
> #define __READ_ONCE(x, check)   \
> ({  \
> union { typeof(x) __val; char __c[1]; } __u;\
> if (check)  \
> __read_once_size(&(x), __u.__c, sizeof(x)); \
> else\
> __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
> smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
> __u.__val;  \
> })
> #define READ_ONCE(x) __READ_ONCE(x, 1)
> 

So my questions are:

- do we need a smp_wmb() barrier if we use READ_ONCE() and WRITE_ONCE()?

- if we need an smp_wmb() barrier, do we need an smp_rmb() barrier as
the data we want to "protect" are behind an access to the pointer?

Thanks,
Laurent


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier  wrote:
> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> > On 09.10.2018 13:37, Laurent Vivier wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of an another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> >>
> >> Signed-off-by: Laurent Vivier 
> >> ---
> >>  fs/binfmt_misc.c   | 106 -
> >>  include/linux/user_namespace.h |  13 
> >>  kernel/user.c  |  13 
> >>  kernel/user_namespace.c|   3 +
> >>  4 files changed, 107 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> >> index aa4a7a23ff99..1e0029d097d9 100644
> >> --- a/fs/binfmt_misc.c
> >> +++ b/fs/binfmt_misc.c
> ...
> >> @@ -80,18 +74,32 @@ static int entry_count;
> >>   */
> >>  #define MAX_REGISTER_LENGTH 1920
> >>
> >> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >> +{
> >> +struct binfmt_namespace *b_ns;
> >> +
> >> +while (ns) {
> >> +b_ns = READ_ONCE(ns->binfmt_ns);
> >> +if (b_ns)
> >> +return b_ns;
> >> +ns = ns->parent;
> >> +}
> >> +WARN_ON_ONCE(1);
> >> +return NULL;
> >> +}
> >> +
> ...
> >> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>  {
> >>  int err;
> >> +struct user_namespace *ns = sb->s_user_ns;
> >>  static const struct tree_descr bm_files[] = {
> >>  [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
> >>  [3] = {"register", _register_operations, S_IWUSR},
> >>  /* last one */ {""}
> >>  };
> >>
> >> +/* create a new binfmt namespace
> >> + * if we are not in the first user namespace
> >> + * but the binfmt namespace is the first one
> >> + */
> >> +if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >> +struct binfmt_namespace *new_ns;
> >> +
> >> +new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >> + GFP_KERNEL);
> >> +if (new_ns == NULL)
> >> +return -ENOMEM;
> >> +INIT_LIST_HEAD(_ns->entries);
> >> +new_ns->enabled = 1;
> >> +rwlock_init(_ns->entries_lock);
> >> +new_ns->bm_mnt = NULL;
> >> +new_ns->entry_count = 0;
> >> +/* ensure new_ns is completely initialized before sharing it 
> >> */
> >> +smp_wmb();
> >
> > (I haven't dived into patch logic, here just small barrier remark from 
> > quick sight).
> > smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> > you want something like below in read hunk:
> >
> > + b_ns = READ_ONCE(ns->binfmt_ns);
> > + if (b_ns) {
> > + smp_rmb();
> > + return b_ns;
> > + }
> >
> >
>
> The write barrier is here to ensure the structure is fully written
> before we set the pointer.
>
> I don't understand how read barrier can change something at this level,
> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> have correctly initialized the pointer and the structure when we read
> the pointer back.
>
> I think the pointer itself is the "barrier" to access the memory
> modified before.

Things don't work that way on alpha, but that's why READ_ONCE()
includes an smp_read_barrier_depends():

#define __READ_ONCE(x, check)   \
({  \
union { typeof(x) __val; char __c[1]; } __u;\
if (check)  \
__read_once_size(&(x), __u.__c, sizeof(x)); \
else\
__read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
__u.__val;  \
})
#define READ_ONCE(x) __READ_ONCE(x, 1)


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 6:45 PM Laurent Vivier  wrote:
> Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> > On 09.10.2018 13:37, Laurent Vivier wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of an another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> >>
> >> Signed-off-by: Laurent Vivier 
> >> ---
> >>  fs/binfmt_misc.c   | 106 -
> >>  include/linux/user_namespace.h |  13 
> >>  kernel/user.c  |  13 
> >>  kernel/user_namespace.c|   3 +
> >>  4 files changed, 107 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> >> index aa4a7a23ff99..1e0029d097d9 100644
> >> --- a/fs/binfmt_misc.c
> >> +++ b/fs/binfmt_misc.c
> ...
> >> @@ -80,18 +74,32 @@ static int entry_count;
> >>   */
> >>  #define MAX_REGISTER_LENGTH 1920
> >>
> >> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> >> +{
> >> +struct binfmt_namespace *b_ns;
> >> +
> >> +while (ns) {
> >> +b_ns = READ_ONCE(ns->binfmt_ns);
> >> +if (b_ns)
> >> +return b_ns;
> >> +ns = ns->parent;
> >> +}
> >> +WARN_ON_ONCE(1);
> >> +return NULL;
> >> +}
> >> +
> ...
> >> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>  {
> >>  int err;
> >> +struct user_namespace *ns = sb->s_user_ns;
> >>  static const struct tree_descr bm_files[] = {
> >>  [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
> >>  [3] = {"register", _register_operations, S_IWUSR},
> >>  /* last one */ {""}
> >>  };
> >>
> >> +/* create a new binfmt namespace
> >> + * if we are not in the first user namespace
> >> + * but the binfmt namespace is the first one
> >> + */
> >> +if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >> +struct binfmt_namespace *new_ns;
> >> +
> >> +new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >> + GFP_KERNEL);
> >> +if (new_ns == NULL)
> >> +return -ENOMEM;
> >> +INIT_LIST_HEAD(_ns->entries);
> >> +new_ns->enabled = 1;
> >> +rwlock_init(_ns->entries_lock);
> >> +new_ns->bm_mnt = NULL;
> >> +new_ns->entry_count = 0;
> >> +/* ensure new_ns is completely initialized before sharing it 
> >> */
> >> +smp_wmb();
> >
> > (I haven't dived into patch logic, here just small barrier remark from 
> > quick sight).
> > smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> > you want something like below in read hunk:
> >
> > + b_ns = READ_ONCE(ns->binfmt_ns);
> > + if (b_ns) {
> > + smp_rmb();
> > + return b_ns;
> > + }
> >
> >
>
> The write barrier is here to ensure the structure is fully written
> before we set the pointer.
>
> I don't understand how read barrier can change something at this level,
> IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
> have correctly initialized the pointer and the structure when we read
> the pointer back.
>
> I think the pointer itself is the "barrier" to access the memory
> modified before.

Things don't work that way on alpha, but that's why READ_ONCE()
includes an smp_read_barrier_depends():

#define __READ_ONCE(x, check)   \
({  \
union { typeof(x) __val; char __c[1]; } __u;\
if (check)  \
__read_once_size(&(x), __u.__c, sizeof(x)); \
else\
__read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
__u.__val;  \
})
#define READ_ONCE(x) __READ_ONCE(x, 1)


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> On 09.10.2018 13:37, Laurent Vivier wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of an another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
>>
>> Signed-off-by: Laurent Vivier 
>> ---
>>  fs/binfmt_misc.c   | 106 -
>>  include/linux/user_namespace.h |  13 
>>  kernel/user.c  |  13 
>>  kernel/user_namespace.c|   3 +
>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>> index aa4a7a23ff99..1e0029d097d9 100644
>> --- a/fs/binfmt_misc.c
>> +++ b/fs/binfmt_misc.c
...
>> @@ -80,18 +74,32 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +struct binfmt_namespace *b_ns;
>> +
>> +while (ns) {
>> +b_ns = READ_ONCE(ns->binfmt_ns);
>> +if (b_ns)
>> +return b_ns;
>> +ns = ns->parent;
>> +}
>> +WARN_ON_ONCE(1);
>> +return NULL;
>> +}
>> +
...
>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>  {
>>  int err;
>> +struct user_namespace *ns = sb->s_user_ns;
>>  static const struct tree_descr bm_files[] = {
>>  [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>>  [3] = {"register", _register_operations, S_IWUSR},
>>  /* last one */ {""}
>>  };
>>  
>> +/* create a new binfmt namespace
>> + * if we are not in the first user namespace
>> + * but the binfmt namespace is the first one
>> + */
>> +if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +struct binfmt_namespace *new_ns;
>> +
>> +new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> + GFP_KERNEL);
>> +if (new_ns == NULL)
>> +return -ENOMEM;
>> +INIT_LIST_HEAD(_ns->entries);
>> +new_ns->enabled = 1;
>> +rwlock_init(_ns->entries_lock);
>> +new_ns->bm_mnt = NULL;
>> +new_ns->entry_count = 0;
>> +/* ensure new_ns is completely initialized before sharing it */
>> +smp_wmb();
> 
> (I haven't dived into patch logic, here just small barrier remark from quick 
> sight).
> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> you want something like below in read hunk:
> 
> + b_ns = READ_ONCE(ns->binfmt_ns);
> + if (b_ns) {
> + smp_rmb();
> + return b_ns;
> + }
> 
> 

The write barrier is here to ensure the structure is fully written
before we set the pointer.

I don't understand how read barrier can change something at this level,
IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
have correctly initialized the pointer and the structure when we read
the pointer back.

I think the pointer itself is the "barrier" to access the memory
modified before.

Thanks,
Laurent


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 18:15, Kirill Tkhai a écrit :
> On 09.10.2018 13:37, Laurent Vivier wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of an another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
>>
>> Signed-off-by: Laurent Vivier 
>> ---
>>  fs/binfmt_misc.c   | 106 -
>>  include/linux/user_namespace.h |  13 
>>  kernel/user.c  |  13 
>>  kernel/user_namespace.c|   3 +
>>  4 files changed, 107 insertions(+), 28 deletions(-)
>>
>> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
>> index aa4a7a23ff99..1e0029d097d9 100644
>> --- a/fs/binfmt_misc.c
>> +++ b/fs/binfmt_misc.c
...
>> @@ -80,18 +74,32 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +struct binfmt_namespace *b_ns;
>> +
>> +while (ns) {
>> +b_ns = READ_ONCE(ns->binfmt_ns);
>> +if (b_ns)
>> +return b_ns;
>> +ns = ns->parent;
>> +}
>> +WARN_ON_ONCE(1);
>> +return NULL;
>> +}
>> +
...
>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>  {
>>  int err;
>> +struct user_namespace *ns = sb->s_user_ns;
>>  static const struct tree_descr bm_files[] = {
>>  [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>>  [3] = {"register", _register_operations, S_IWUSR},
>>  /* last one */ {""}
>>  };
>>  
>> +/* create a new binfmt namespace
>> + * if we are not in the first user namespace
>> + * but the binfmt namespace is the first one
>> + */
>> +if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +struct binfmt_namespace *new_ns;
>> +
>> +new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> + GFP_KERNEL);
>> +if (new_ns == NULL)
>> +return -ENOMEM;
>> +INIT_LIST_HEAD(_ns->entries);
>> +new_ns->enabled = 1;
>> +rwlock_init(_ns->entries_lock);
>> +new_ns->bm_mnt = NULL;
>> +new_ns->entry_count = 0;
>> +/* ensure new_ns is completely initialized before sharing it */
>> +smp_wmb();
> 
> (I haven't dived into patch logic, here just small barrier remark from quick 
> sight).
> smp_wmb() has no sense without paired smp_rmb() on the read side. Possible,
> you want something like below in read hunk:
> 
> + b_ns = READ_ONCE(ns->binfmt_ns);
> + if (b_ns) {
> + smp_rmb();
> + return b_ns;
> + }
> 
> 

The write barrier is here to ensure the structure is fully written
before we set the pointer.

I don't understand how read barrier can change something at this level,
IMHO the couple WRITE_ONCE()/READ_ONCE() should be enough to ensure we
have correctly initialized the pointer and the structure when we read
the pointer back.

I think the pointer itself is the "barrier" to access the memory
modified before.

Thanks,
Laurent


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Kirill Tkhai
On 09.10.2018 13:37, Laurent Vivier wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
> 
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
> 
> Signed-off-by: Laurent Vivier 
> ---
>  fs/binfmt_misc.c   | 106 -
>  include/linux/user_namespace.h |  13 
>  kernel/user.c  |  13 
>  kernel/user_namespace.c|   3 +
>  4 files changed, 107 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index aa4a7a23ff99..1e0029d097d9 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -38,9 +38,6 @@ enum {
>   VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -60,10 +57,7 @@ typedef struct {
>   struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -80,18 +74,32 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> + struct binfmt_namespace *b_ns;
> +
> + while (ns) {
> + b_ns = READ_ONCE(ns->binfmt_ns);
> + if (b_ns)
> + return b_ns;
> + ns = ns->parent;
> + }
> + WARN_ON_ONCE(1);
> + return NULL;
> +}
> +
>  /*
>   * Check if we support the binfmt
>   * if we do, return the node, else NULL
>   * locking is done in load_misc_binary
>   */
> -static Node *check_file(struct linux_binprm *bprm)
> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm 
> *bprm)
>  {
>   char *p = strrchr(bprm->interp, '.');
>   struct list_head *l;
>  
>   /* Walk all the registered handlers. */
> - list_for_each(l, ) {
> + list_for_each(l, >entries) {
>   Node *e = list_entry(l, Node, list);
>   char *s;
>   int j;
> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   struct file *interp_file = NULL;
>   int retval;
>   int fd_binary = -1;
> + struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!ns->enabled)
>   return retval;
>  
>   /* to keep locking time low, we copy the interpreter string */
> - read_lock(_lock);
> - fmt = check_file(bprm);
> + read_lock(>entries_lock);
> + fmt = check_file(ns, bprm);
>   if (fmt)
>   dget(fmt->dentry);
> - read_unlock(_lock);
> + read_unlock(>entries_lock);
>   if (!fmt)
>   return retval;
>  
> @@ -609,19 +618,19 @@ static void bm_evict_inode(struct inode *inode)
>   kfree(e);
>  }
>  
> -static void kill_node(Node *e)
> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>  {
>   struct dentry *dentry;
>  
> - write_lock(_lock);
> + write_lock(>entries_lock);
>   list_del_init(>list);
> - write_unlock(_lock);
> + write_unlock(>entries_lock);
>  
>   dentry = e->dentry;
>   drop_nlink(d_inode(dentry));
>   d_drop(dentry);
>   dput(dentry);
> - simple_release_fs(_mnt, _count);
> + simple_release_fs(>bm_mnt, >entry_count);
>  }
>  
>  /* / */
> @@ -651,6 +660,9 @@ static ssize_t bm_entry_write(struct file *file, const 
> char __user *buffer,
>   struct dentry *root;
>   Node *e = file_inode(file)->i_private;
>   int res = parse_command(buffer, count);
> + struct binfmt_namespace *ns;
> +
> + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
>  
>   switch (res) {
>   case 1:
> @@ -667,7 +679,7 @@ static ssize_t bm_entry_write(struct file *file, const 
> char __user *buffer,
>   inode_lock(d_inode(root));
>  
>   if (!list_empty(>list))
> - kill_node(e);
> + kill_node(ns, e);
>  
>   inode_unlock(d_inode(root));
>   break;
> @@ -693,6 +705,7 @@ static ssize_t bm_register_write(struct file *file, const 
> char __user *buffer,
>   struct inode *inode;
>   struct super_block *sb = file_inode(file)->i_sb;
>   struct dentry *root = sb->s_root, *dentry;
> + struct binfmt_namespace *ns;
>   int err = 0;
>  
>   e = create_entry(buffer, count);
> @@ 

Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Kirill Tkhai
On 09.10.2018 13:37, Laurent Vivier wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
> 
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
> 
> Signed-off-by: Laurent Vivier 
> ---
>  fs/binfmt_misc.c   | 106 -
>  include/linux/user_namespace.h |  13 
>  kernel/user.c  |  13 
>  kernel/user_namespace.c|   3 +
>  4 files changed, 107 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
> index aa4a7a23ff99..1e0029d097d9 100644
> --- a/fs/binfmt_misc.c
> +++ b/fs/binfmt_misc.c
> @@ -38,9 +38,6 @@ enum {
>   VERBOSE_STATUS = 1 /* make it zero to save 400 bytes kernel memory */
>  };
>  
> -static LIST_HEAD(entries);
> -static int enabled = 1;
> -
>  enum {Enabled, Magic};
>  #define MISC_FMT_PRESERVE_ARGV0 (1 << 31)
>  #define MISC_FMT_OPEN_BINARY (1 << 30)
> @@ -60,10 +57,7 @@ typedef struct {
>   struct file *interp_file;
>  } Node;
>  
> -static DEFINE_RWLOCK(entries_lock);
>  static struct file_system_type bm_fs_type;
> -static struct vfsmount *bm_mnt;
> -static int entry_count;
>  
>  /*
>   * Max length of the register string.  Determined by:
> @@ -80,18 +74,32 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> + struct binfmt_namespace *b_ns;
> +
> + while (ns) {
> + b_ns = READ_ONCE(ns->binfmt_ns);
> + if (b_ns)
> + return b_ns;
> + ns = ns->parent;
> + }
> + WARN_ON_ONCE(1);
> + return NULL;
> +}
> +
>  /*
>   * Check if we support the binfmt
>   * if we do, return the node, else NULL
>   * locking is done in load_misc_binary
>   */
> -static Node *check_file(struct linux_binprm *bprm)
> +static Node *check_file(struct binfmt_namespace *ns, struct linux_binprm 
> *bprm)
>  {
>   char *p = strrchr(bprm->interp, '.');
>   struct list_head *l;
>  
>   /* Walk all the registered handlers. */
> - list_for_each(l, ) {
> + list_for_each(l, >entries) {
>   Node *e = list_entry(l, Node, list);
>   char *s;
>   int j;
> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   struct file *interp_file = NULL;
>   int retval;
>   int fd_binary = -1;
> + struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!ns->enabled)
>   return retval;
>  
>   /* to keep locking time low, we copy the interpreter string */
> - read_lock(_lock);
> - fmt = check_file(bprm);
> + read_lock(>entries_lock);
> + fmt = check_file(ns, bprm);
>   if (fmt)
>   dget(fmt->dentry);
> - read_unlock(_lock);
> + read_unlock(>entries_lock);
>   if (!fmt)
>   return retval;
>  
> @@ -609,19 +618,19 @@ static void bm_evict_inode(struct inode *inode)
>   kfree(e);
>  }
>  
> -static void kill_node(Node *e)
> +static void kill_node(struct binfmt_namespace *ns, Node *e)
>  {
>   struct dentry *dentry;
>  
> - write_lock(_lock);
> + write_lock(>entries_lock);
>   list_del_init(>list);
> - write_unlock(_lock);
> + write_unlock(>entries_lock);
>  
>   dentry = e->dentry;
>   drop_nlink(d_inode(dentry));
>   d_drop(dentry);
>   dput(dentry);
> - simple_release_fs(_mnt, _count);
> + simple_release_fs(>bm_mnt, >entry_count);
>  }
>  
>  /* / */
> @@ -651,6 +660,9 @@ static ssize_t bm_entry_write(struct file *file, const 
> char __user *buffer,
>   struct dentry *root;
>   Node *e = file_inode(file)->i_private;
>   int res = parse_command(buffer, count);
> + struct binfmt_namespace *ns;
> +
> + ns = binfmt_ns(file->f_path.dentry->d_sb->s_user_ns);
>  
>   switch (res) {
>   case 1:
> @@ -667,7 +679,7 @@ static ssize_t bm_entry_write(struct file *file, const 
> char __user *buffer,
>   inode_lock(d_inode(root));
>  
>   if (!list_empty(>list))
> - kill_node(e);
> + kill_node(ns, e);
>  
>   inode_unlock(d_inode(root));
>   break;
> @@ -693,6 +705,7 @@ static ssize_t bm_register_write(struct file *file, const 
> char __user *buffer,
>   struct inode *inode;
>   struct super_block *sb = file_inode(file)->i_sb;
>   struct dentry *root = sb->s_root, *dentry;
> + struct binfmt_namespace *ns;
>   int err = 0;
>  
>   e = create_entry(buffer, count);
> @@ 

Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
> On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
>> @@ -80,18 +74,32 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +struct binfmt_namespace *b_ns;
>> +
>> +while (ns) {
>> +b_ns = READ_ONCE(ns->binfmt_ns);
>> +if (b_ns)
>> +return b_ns;
>> +ns = ns->parent;
>> +}
>> +WARN_ON_ONCE(1);
> 
> It looks like we warn here,
> 
>> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>  struct file *interp_file = NULL;
>>  int retval;
>>  int fd_binary = -1;
>> +struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>  
>>  retval = -ENOEXEC;
>> -if (!enabled)
>> +if (!ns->enabled)
> 
> ...but then in cases like this we immediately dereference the pointer
> anyways and crash. Can we return some other error code here in the !ns
> case so we don't crash?

My concern here is I don't want to add code to check an error case that
cannot happen. The first namespace binfmt_ns pointer is initialized with
_binfmt_ns, so the return value cannot be NULL.

Thanks,
Laurent



Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 17:16, Tycho Andersen a écrit :
> On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
>> @@ -80,18 +74,32 @@ static int entry_count;
>>   */
>>  #define MAX_REGISTER_LENGTH 1920
>>  
>> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
>> +{
>> +struct binfmt_namespace *b_ns;
>> +
>> +while (ns) {
>> +b_ns = READ_ONCE(ns->binfmt_ns);
>> +if (b_ns)
>> +return b_ns;
>> +ns = ns->parent;
>> +}
>> +WARN_ON_ONCE(1);
> 
> It looks like we warn here,
> 
>> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>>  struct file *interp_file = NULL;
>>  int retval;
>>  int fd_binary = -1;
>> +struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>>  
>>  retval = -ENOEXEC;
>> -if (!enabled)
>> +if (!ns->enabled)
> 
> ...but then in cases like this we immediately dereference the pointer
> anyways and crash. Can we return some other error code here in the !ns
> case so we don't crash?

My concern here is I don't want to add code to check an error case that
cannot happen. The first namespace binfmt_ns pointer is initialized with
_binfmt_ns, so the return value cannot be NULL.

Thanks,
Laurent



Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Tycho Andersen
On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
> @@ -80,18 +74,32 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> + struct binfmt_namespace *b_ns;
> +
> + while (ns) {
> + b_ns = READ_ONCE(ns->binfmt_ns);
> + if (b_ns)
> + return b_ns;
> + ns = ns->parent;
> + }
> + WARN_ON_ONCE(1);

It looks like we warn here,

> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   struct file *interp_file = NULL;
>   int retval;
>   int fd_binary = -1;
> + struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!ns->enabled)

...but then in cases like this we immediately dereference the pointer
anyways and crash. Can we return some other error code here in the !ns
case so we don't crash?

Tycho


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Tycho Andersen
On Tue, Oct 09, 2018 at 12:37:52PM +0200, Laurent Vivier wrote:
> @@ -80,18 +74,32 @@ static int entry_count;
>   */
>  #define MAX_REGISTER_LENGTH 1920
>  
> +static struct binfmt_namespace *binfmt_ns(struct user_namespace *ns)
> +{
> + struct binfmt_namespace *b_ns;
> +
> + while (ns) {
> + b_ns = READ_ONCE(ns->binfmt_ns);
> + if (b_ns)
> + return b_ns;
> + ns = ns->parent;
> + }
> + WARN_ON_ONCE(1);

It looks like we warn here,

> @@ -133,17 +141,18 @@ static int load_misc_binary(struct linux_binprm *bprm)
>   struct file *interp_file = NULL;
>   int retval;
>   int fd_binary = -1;
> + struct binfmt_namespace *ns = binfmt_ns(current_user_ns());
>  
>   retval = -ENOEXEC;
> - if (!enabled)
> + if (!ns->enabled)

...but then in cases like this we immediately dereference the pointer
anyways and crash. Can we return some other error code here in the !ns
case so we don't crash?

Tycho


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 3:06 PM Laurent Vivier  wrote:
>
> Le 09/10/2018 à 14:43, Jann Horn a écrit :
> > On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier  wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of an another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> > [...]
> >> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>  {
> >> int err;
> >> +   struct user_namespace *ns = sb->s_user_ns;
> >> static const struct tree_descr bm_files[] = {
> >> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
> >> [3] = {"register", _register_operations, S_IWUSR},
> >> /* last one */ {""}
> >> };
> >>
> >> +   /* create a new binfmt namespace
> >> +* if we are not in the first user namespace
> >> +* but the binfmt namespace is the first one
> >> +*/
> >> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >> +   struct binfmt_namespace *new_ns;
> >> +
> >> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >> +GFP_KERNEL);
> >> +   if (new_ns == NULL)
> >> +   return -ENOMEM;
> >> +   INIT_LIST_HEAD(_ns->entries);
> >> +   new_ns->enabled = 1;
> >> +   rwlock_init(_ns->entries_lock);
> >> +   new_ns->bm_mnt = NULL;
> >> +   new_ns->entry_count = 0;
> >> +   /* ensure new_ns is completely initialized before sharing 
> >> it */
> >> +   smp_wmb();
> >> +   WRITE_ONCE(ns->binfmt_ns, new_ns);
> >> +   }
> >
> > You're still not preventing a concurrent race of two mount() calls,
> > right? What prevents two instances of this code block from running
> > concurrently in two different namespaces? I think you want to take
> > some sort of global lock around this.
> >
>
> My guess was we have only one binfmt superblock by user namespace, so as
> we can't have duplicate superblock, we will not have duplicate binfmt_ns
> structure. This function is only called once in the namespace and I
> think the superblock creation is already protected by some kind of lock.

Ah! Nevermind, I missed the mount_ns().


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 3:06 PM Laurent Vivier  wrote:
>
> Le 09/10/2018 à 14:43, Jann Horn a écrit :
> > On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier  wrote:
> >> This patch allows to have a different binfmt_misc configuration
> >> for each new user namespace. By default, the binfmt_misc configuration
> >> is the one of the previous level, but if the binfmt_misc filesystem is
> >> mounted in the new namespace a new empty binfmt instance is created and
> >> used in this namespace.
> >>
> >> For instance, using "unshare" we can start a chroot of an another
> >> architecture and configure the binfmt_misc interpreter without being root
> >> to run the binaries in this chroot.
> > [...]
> >> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
> >>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
> >>  {
> >> int err;
> >> +   struct user_namespace *ns = sb->s_user_ns;
> >> static const struct tree_descr bm_files[] = {
> >> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
> >> [3] = {"register", _register_operations, S_IWUSR},
> >> /* last one */ {""}
> >> };
> >>
> >> +   /* create a new binfmt namespace
> >> +* if we are not in the first user namespace
> >> +* but the binfmt namespace is the first one
> >> +*/
> >> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
> >> +   struct binfmt_namespace *new_ns;
> >> +
> >> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
> >> +GFP_KERNEL);
> >> +   if (new_ns == NULL)
> >> +   return -ENOMEM;
> >> +   INIT_LIST_HEAD(_ns->entries);
> >> +   new_ns->enabled = 1;
> >> +   rwlock_init(_ns->entries_lock);
> >> +   new_ns->bm_mnt = NULL;
> >> +   new_ns->entry_count = 0;
> >> +   /* ensure new_ns is completely initialized before sharing 
> >> it */
> >> +   smp_wmb();
> >> +   WRITE_ONCE(ns->binfmt_ns, new_ns);
> >> +   }
> >
> > You're still not preventing a concurrent race of two mount() calls,
> > right? What prevents two instances of this code block from running
> > concurrently in two different namespaces? I think you want to take
> > some sort of global lock around this.
> >
>
> My guess was we have only one binfmt superblock by user namespace, so as
> we can't have duplicate superblock, we will not have duplicate binfmt_ns
> structure. This function is only called once in the namespace and I
> think the superblock creation is already protected by some kind of lock.

Ah! Nevermind, I missed the mount_ns().


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 14:43, Jann Horn a écrit :
> On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier  wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of an another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
> [...]
>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>  {
>> int err;
>> +   struct user_namespace *ns = sb->s_user_ns;
>> static const struct tree_descr bm_files[] = {
>> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>> [3] = {"register", _register_operations, S_IWUSR},
>> /* last one */ {""}
>> };
>>
>> +   /* create a new binfmt namespace
>> +* if we are not in the first user namespace
>> +* but the binfmt namespace is the first one
>> +*/
>> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +   struct binfmt_namespace *new_ns;
>> +
>> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +GFP_KERNEL);
>> +   if (new_ns == NULL)
>> +   return -ENOMEM;
>> +   INIT_LIST_HEAD(_ns->entries);
>> +   new_ns->enabled = 1;
>> +   rwlock_init(_ns->entries_lock);
>> +   new_ns->bm_mnt = NULL;
>> +   new_ns->entry_count = 0;
>> +   /* ensure new_ns is completely initialized before sharing it 
>> */
>> +   smp_wmb();
>> +   WRITE_ONCE(ns->binfmt_ns, new_ns);
>> +   }
> 
> You're still not preventing a concurrent race of two mount() calls,
> right? What prevents two instances of this code block from running
> concurrently in two different namespaces? I think you want to take
> some sort of global lock around this.
> 

My guess was we have only one binfmt superblock by user namespace, so as
we can't have duplicate superblock, we will not have duplicate binfmt_ns
structure. This function is only called once in the namespace and I
think the superblock creation is already protected by some kind of lock.

But I'm not a VFS expert, if someone wants to clarify the situation,
please go ahead.

Thanks,
Laurent





Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Laurent Vivier
Le 09/10/2018 à 14:43, Jann Horn a écrit :
> On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier  wrote:
>> This patch allows to have a different binfmt_misc configuration
>> for each new user namespace. By default, the binfmt_misc configuration
>> is the one of the previous level, but if the binfmt_misc filesystem is
>> mounted in the new namespace a new empty binfmt instance is created and
>> used in this namespace.
>>
>> For instance, using "unshare" we can start a chroot of an another
>> architecture and configure the binfmt_misc interpreter without being root
>> to run the binaries in this chroot.
> [...]
>> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>>  {
>> int err;
>> +   struct user_namespace *ns = sb->s_user_ns;
>> static const struct tree_descr bm_files[] = {
>> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
>> [3] = {"register", _register_operations, S_IWUSR},
>> /* last one */ {""}
>> };
>>
>> +   /* create a new binfmt namespace
>> +* if we are not in the first user namespace
>> +* but the binfmt namespace is the first one
>> +*/
>> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
>> +   struct binfmt_namespace *new_ns;
>> +
>> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
>> +GFP_KERNEL);
>> +   if (new_ns == NULL)
>> +   return -ENOMEM;
>> +   INIT_LIST_HEAD(_ns->entries);
>> +   new_ns->enabled = 1;
>> +   rwlock_init(_ns->entries_lock);
>> +   new_ns->bm_mnt = NULL;
>> +   new_ns->entry_count = 0;
>> +   /* ensure new_ns is completely initialized before sharing it 
>> */
>> +   smp_wmb();
>> +   WRITE_ONCE(ns->binfmt_ns, new_ns);
>> +   }
> 
> You're still not preventing a concurrent race of two mount() calls,
> right? What prevents two instances of this code block from running
> concurrently in two different namespaces? I think you want to take
> some sort of global lock around this.
> 

My guess was we have only one binfmt superblock by user namespace, so as
we can't have duplicate superblock, we will not have duplicate binfmt_ns
structure. This function is only called once in the namespace and I
think the superblock creation is already protected by some kind of lock.

But I'm not a VFS expert, if someone wants to clarify the situation,
please go ahead.

Thanks,
Laurent





Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier  wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
>
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
[...]
> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>  {
> int err;
> +   struct user_namespace *ns = sb->s_user_ns;
> static const struct tree_descr bm_files[] = {
> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
> [3] = {"register", _register_operations, S_IWUSR},
> /* last one */ {""}
> };
>
> +   /* create a new binfmt namespace
> +* if we are not in the first user namespace
> +* but the binfmt namespace is the first one
> +*/
> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
> +   struct binfmt_namespace *new_ns;
> +
> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +GFP_KERNEL);
> +   if (new_ns == NULL)
> +   return -ENOMEM;
> +   INIT_LIST_HEAD(_ns->entries);
> +   new_ns->enabled = 1;
> +   rwlock_init(_ns->entries_lock);
> +   new_ns->bm_mnt = NULL;
> +   new_ns->entry_count = 0;
> +   /* ensure new_ns is completely initialized before sharing it 
> */
> +   smp_wmb();
> +   WRITE_ONCE(ns->binfmt_ns, new_ns);
> +   }

You're still not preventing a concurrent race of two mount() calls,
right? What prevents two instances of this code block from running
concurrently in two different namespaces? I think you want to take
some sort of global lock around this.


Re: [RFC v5 1/1] ns: add binfmt_misc to the user namespace

2018-10-09 Thread Jann Horn
On Tue, Oct 9, 2018 at 12:38 PM Laurent Vivier  wrote:
> This patch allows to have a different binfmt_misc configuration
> for each new user namespace. By default, the binfmt_misc configuration
> is the one of the previous level, but if the binfmt_misc filesystem is
> mounted in the new namespace a new empty binfmt instance is created and
> used in this namespace.
>
> For instance, using "unshare" we can start a chroot of an another
> architecture and configure the binfmt_misc interpreter without being root
> to run the binaries in this chroot.
[...]
> @@ -823,12 +847,34 @@ static const struct super_operations s_ops = {
>  static int bm_fill_super(struct super_block *sb, void *data, int silent)
>  {
> int err;
> +   struct user_namespace *ns = sb->s_user_ns;
> static const struct tree_descr bm_files[] = {
> [2] = {"status", _status_operations, S_IWUSR|S_IRUGO},
> [3] = {"register", _register_operations, S_IWUSR},
> /* last one */ {""}
> };
>
> +   /* create a new binfmt namespace
> +* if we are not in the first user namespace
> +* but the binfmt namespace is the first one
> +*/
> +   if (READ_ONCE(ns->binfmt_ns) == NULL) {
> +   struct binfmt_namespace *new_ns;
> +
> +   new_ns = kmalloc(sizeof(struct binfmt_namespace),
> +GFP_KERNEL);
> +   if (new_ns == NULL)
> +   return -ENOMEM;
> +   INIT_LIST_HEAD(_ns->entries);
> +   new_ns->enabled = 1;
> +   rwlock_init(_ns->entries_lock);
> +   new_ns->bm_mnt = NULL;
> +   new_ns->entry_count = 0;
> +   /* ensure new_ns is completely initialized before sharing it 
> */
> +   smp_wmb();
> +   WRITE_ONCE(ns->binfmt_ns, new_ns);
> +   }

You're still not preventing a concurrent race of two mount() calls,
right? What prevents two instances of this code block from running
concurrently in two different namespaces? I think you want to take
some sort of global lock around this.