Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-14 Thread Stanislav Kinsburskiy


14.09.2017 13:45, Ian Kent пишет:
> On 14/09/17 19:39, Stanislav Kinsburskiy wrote:
>>
>>
>> 14.09.2017 13:29, Ian Kent пишет:
>>> On 14/09/17 17:24, Stanislav Kinsburskiy wrote:
>>>>
>>>>
>>>> 14.09.2017 02:38, Ian Kent пишет:
>>>>> On 01/09/17 19:21, Stanislav Kinsburskiy wrote:
>>>>>> Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
>>>>>> ---
>>>>>>  fs/autofs4/autofs_i.h  |3 +++
>>>>>>  fs/autofs4/dev-ioctl.c |3 +++
>>>>>>  fs/autofs4/inode.c |4 +++-
>>>>>>  3 files changed, 9 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>>>>>> index 4737615..3da105f 100644
>>>>>> --- a/fs/autofs4/autofs_i.h
>>>>>> +++ b/fs/autofs4/autofs_i.h
>>>>>> @@ -120,6 +120,9 @@ struct autofs_sb_info {
>>>>>>  struct list_head active_list;
>>>>>>  struct list_head expiring_list;
>>>>>>  struct rcu_head rcu;
>>>>>> +#ifdef CONFIG_COMPAT
>>>>>> +unsigned is32bit:1;
>>>>>> +#endif
>>>>>>  };
>>>>>>  
>>>>>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>>>>>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
>>>>>> index b7c816f..467d6c4 100644
>>>>>> --- a/fs/autofs4/dev-ioctl.c
>>>>>> +++ b/fs/autofs4/dev-ioctl.c
>>>>>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file 
>>>>>> *fp,
>>>>>>  sbi->pipefd = pipefd;
>>>>>>  sbi->pipe = pipe;
>>>>>>  sbi->catatonic = 0;
>>>>>> +#ifdef CONFIG_COMPAT
>>>>>> +sbi->is32bit = is_compat_task();
>>>>>> +#endif
>>>>>>  }
>>>>>>  out:
>>>>>>  put_pid(new_pid);
>>>>>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>>>>>> index 09e7d68..21d3c0b 100644
>>>>>> --- a/fs/autofs4/inode.c
>>>>>> +++ b/fs/autofs4/inode.c
>>>>>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void 
>>>>>> *data, int silent)
>>>>>>  } else {
>>>>>>  sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
>>>>>>  }
>>>>>> -
>>>>>> +#ifdef CONFIG_COMPAT
>>>>>> +sbi->is32bit = is_compat_task();
>>>>>> +#endif
>>>>>>  if (autofs_type_trigger(sbi->type))
>>>>>>  __managed_dentry_set_managed(root);
>>>>>>  
>>>>>>
>>>>>
>>>>> Not sure about this.
>>>>>
>>>>> Don't you think it would be better to avoid the in code #ifdefs by doing 
>>>>> some
>>>>> checks and defines in the header file and defining what's need to just use
>>>>> is_compat_task().
>>>>>
>>>>
>>>> Yes, might be...
>>>>
>>>>> Not sure 2 patches are needed for this either ..
>>>>>
>>>>
>>>> Well, I found this issue occasionally.
>>>
>>> I'm wondering what the symptoms are?
>>>
>>
>> Size of struct autofs_v5_packet is 300 bytes for x86 and 304 bytes for 
>> x86_64.
>> Which means, that 32bit task can read more than size of autofs_v5_packet on 
>> 64bit kernel.
> 
> Are you sure?
> 
> Shouldn't that be a short read on the x86 side of a 4 bytes longer
> structure on the x86_64 side.
> 
> I didn't think you could have a 64 bit client on a 32 bit kernel
> so the converse (the read past end of struct) doesn't apply.
> 

Sorry for the confusion, I had to add brackets like this:

> Which means, that 32bit task can read more than size of autofs_v5_packet (on 
> 64bit kernel).

IOW, 32bit task expects to read 300 bytes (size of struct autofs_v5_packet) 
while there are 304 bytes available on the "wire" from the 64bit kernel.


> Ian
> 


Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-14 Thread Stanislav Kinsburskiy


14.09.2017 13:45, Ian Kent пишет:
> On 14/09/17 19:39, Stanislav Kinsburskiy wrote:
>>
>>
>> 14.09.2017 13:29, Ian Kent пишет:
>>> On 14/09/17 17:24, Stanislav Kinsburskiy wrote:
>>>>
>>>>
>>>> 14.09.2017 02:38, Ian Kent пишет:
>>>>> On 01/09/17 19:21, Stanislav Kinsburskiy wrote:
>>>>>> Signed-off-by: Stanislav Kinsburskiy 
>>>>>> ---
>>>>>>  fs/autofs4/autofs_i.h  |3 +++
>>>>>>  fs/autofs4/dev-ioctl.c |3 +++
>>>>>>  fs/autofs4/inode.c |4 +++-
>>>>>>  3 files changed, 9 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>>>>>> index 4737615..3da105f 100644
>>>>>> --- a/fs/autofs4/autofs_i.h
>>>>>> +++ b/fs/autofs4/autofs_i.h
>>>>>> @@ -120,6 +120,9 @@ struct autofs_sb_info {
>>>>>>  struct list_head active_list;
>>>>>>  struct list_head expiring_list;
>>>>>>  struct rcu_head rcu;
>>>>>> +#ifdef CONFIG_COMPAT
>>>>>> +unsigned is32bit:1;
>>>>>> +#endif
>>>>>>  };
>>>>>>  
>>>>>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>>>>>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
>>>>>> index b7c816f..467d6c4 100644
>>>>>> --- a/fs/autofs4/dev-ioctl.c
>>>>>> +++ b/fs/autofs4/dev-ioctl.c
>>>>>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file 
>>>>>> *fp,
>>>>>>  sbi->pipefd = pipefd;
>>>>>>  sbi->pipe = pipe;
>>>>>>  sbi->catatonic = 0;
>>>>>> +#ifdef CONFIG_COMPAT
>>>>>> +sbi->is32bit = is_compat_task();
>>>>>> +#endif
>>>>>>  }
>>>>>>  out:
>>>>>>  put_pid(new_pid);
>>>>>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>>>>>> index 09e7d68..21d3c0b 100644
>>>>>> --- a/fs/autofs4/inode.c
>>>>>> +++ b/fs/autofs4/inode.c
>>>>>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void 
>>>>>> *data, int silent)
>>>>>>  } else {
>>>>>>  sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
>>>>>>  }
>>>>>> -
>>>>>> +#ifdef CONFIG_COMPAT
>>>>>> +sbi->is32bit = is_compat_task();
>>>>>> +#endif
>>>>>>  if (autofs_type_trigger(sbi->type))
>>>>>>  __managed_dentry_set_managed(root);
>>>>>>  
>>>>>>
>>>>>
>>>>> Not sure about this.
>>>>>
>>>>> Don't you think it would be better to avoid the in code #ifdefs by doing 
>>>>> some
>>>>> checks and defines in the header file and defining what's need to just use
>>>>> is_compat_task().
>>>>>
>>>>
>>>> Yes, might be...
>>>>
>>>>> Not sure 2 patches are needed for this either ..
>>>>>
>>>>
>>>> Well, I found this issue occasionally.
>>>
>>> I'm wondering what the symptoms are?
>>>
>>
>> Size of struct autofs_v5_packet is 300 bytes for x86 and 304 bytes for 
>> x86_64.
>> Which means, that 32bit task can read more than size of autofs_v5_packet on 
>> 64bit kernel.
> 
> Are you sure?
> 
> Shouldn't that be a short read on the x86 side of a 4 bytes longer
> structure on the x86_64 side.
> 
> I didn't think you could have a 64 bit client on a 32 bit kernel
> so the converse (the read past end of struct) doesn't apply.
> 

Sorry for the confusion, I had to add brackets like this:

> Which means, that 32bit task can read more than size of autofs_v5_packet (on 
> 64bit kernel).

IOW, 32bit task expects to read 300 bytes (size of struct autofs_v5_packet) 
while there are 304 bytes available on the "wire" from the 64bit kernel.


> Ian
> 


Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-14 Thread Stanislav Kinsburskiy


14.09.2017 13:29, Ian Kent пишет:
> On 14/09/17 17:24, Stanislav Kinsburskiy wrote:
>>
>>
>> 14.09.2017 02:38, Ian Kent пишет:
>>> On 01/09/17 19:21, Stanislav Kinsburskiy wrote:
>>>> Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
>>>> ---
>>>>  fs/autofs4/autofs_i.h  |3 +++
>>>>  fs/autofs4/dev-ioctl.c |3 +++
>>>>  fs/autofs4/inode.c |4 +++-
>>>>  3 files changed, 9 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>>>> index 4737615..3da105f 100644
>>>> --- a/fs/autofs4/autofs_i.h
>>>> +++ b/fs/autofs4/autofs_i.h
>>>> @@ -120,6 +120,9 @@ struct autofs_sb_info {
>>>>struct list_head active_list;
>>>>struct list_head expiring_list;
>>>>struct rcu_head rcu;
>>>> +#ifdef CONFIG_COMPAT
>>>> +  unsigned is32bit:1;
>>>> +#endif
>>>>  };
>>>>  
>>>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>>>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
>>>> index b7c816f..467d6c4 100644
>>>> --- a/fs/autofs4/dev-ioctl.c
>>>> +++ b/fs/autofs4/dev-ioctl.c
>>>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
>>>>sbi->pipefd = pipefd;
>>>>sbi->pipe = pipe;
>>>>sbi->catatonic = 0;
>>>> +#ifdef CONFIG_COMPAT
>>>> +  sbi->is32bit = is_compat_task();
>>>> +#endif
>>>>}
>>>>  out:
>>>>put_pid(new_pid);
>>>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>>>> index 09e7d68..21d3c0b 100644
>>>> --- a/fs/autofs4/inode.c
>>>> +++ b/fs/autofs4/inode.c
>>>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void 
>>>> *data, int silent)
>>>>} else {
>>>>sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
>>>>}
>>>> -
>>>> +#ifdef CONFIG_COMPAT
>>>> +  sbi->is32bit = is_compat_task();
>>>> +#endif
>>>>if (autofs_type_trigger(sbi->type))
>>>>__managed_dentry_set_managed(root);
>>>>  
>>>>
>>>
>>> Not sure about this.
>>>
>>> Don't you think it would be better to avoid the in code #ifdefs by doing 
>>> some
>>> checks and defines in the header file and defining what's need to just use
>>> is_compat_task().
>>>
>>
>> Yes, might be...
>>
>>> Not sure 2 patches are needed for this either ..
>>>
>>
>> Well, I found this issue occasionally.
> 
> I'm wondering what the symptoms are?
> 

Size of struct autofs_v5_packet is 300 bytes for x86 and 304 bytes for x86_64.
Which means, that 32bit task can read more than size of autofs_v5_packet on 
64bit kernel.

>> And, frankly speaking, it's not clear to me, whether this issue is important 
>> at all, so I wanted to clarify this first.
>> Thanks to O_DIRECT, the only way to catch the issue is to try to read more, 
>> than expected, in compat task (that's how I found it).
> 
> Right, the O_DIRECT patch from Linus was expected to fix the structure
> alignment problem. The stuct field offsets are ok aren't they?
> 

Yes, they are ok.

>> I don't see any other flaw so far. And if so, that, probably, we shouldn't 
>> care about the issue at all.
>> What do you think?
> 
> If we are seeing hangs, incorrect struct fields or similar something
> should be done about it but if all is actually working ok then the
> O_DIRECT fix is doing it's job and further changes aren't necessary.
> 

Well, yes. O_DIRECT fix covers the issue.
Ok then.
Thanks for the clarification!

> Ian
> 


Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-14 Thread Stanislav Kinsburskiy


14.09.2017 13:29, Ian Kent пишет:
> On 14/09/17 17:24, Stanislav Kinsburskiy wrote:
>>
>>
>> 14.09.2017 02:38, Ian Kent пишет:
>>> On 01/09/17 19:21, Stanislav Kinsburskiy wrote:
>>>> Signed-off-by: Stanislav Kinsburskiy 
>>>> ---
>>>>  fs/autofs4/autofs_i.h  |3 +++
>>>>  fs/autofs4/dev-ioctl.c |3 +++
>>>>  fs/autofs4/inode.c |4 +++-
>>>>  3 files changed, 9 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>>>> index 4737615..3da105f 100644
>>>> --- a/fs/autofs4/autofs_i.h
>>>> +++ b/fs/autofs4/autofs_i.h
>>>> @@ -120,6 +120,9 @@ struct autofs_sb_info {
>>>>struct list_head active_list;
>>>>struct list_head expiring_list;
>>>>struct rcu_head rcu;
>>>> +#ifdef CONFIG_COMPAT
>>>> +  unsigned is32bit:1;
>>>> +#endif
>>>>  };
>>>>  
>>>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>>>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
>>>> index b7c816f..467d6c4 100644
>>>> --- a/fs/autofs4/dev-ioctl.c
>>>> +++ b/fs/autofs4/dev-ioctl.c
>>>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
>>>>sbi->pipefd = pipefd;
>>>>sbi->pipe = pipe;
>>>>sbi->catatonic = 0;
>>>> +#ifdef CONFIG_COMPAT
>>>> +  sbi->is32bit = is_compat_task();
>>>> +#endif
>>>>}
>>>>  out:
>>>>put_pid(new_pid);
>>>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>>>> index 09e7d68..21d3c0b 100644
>>>> --- a/fs/autofs4/inode.c
>>>> +++ b/fs/autofs4/inode.c
>>>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void 
>>>> *data, int silent)
>>>>} else {
>>>>sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
>>>>}
>>>> -
>>>> +#ifdef CONFIG_COMPAT
>>>> +  sbi->is32bit = is_compat_task();
>>>> +#endif
>>>>if (autofs_type_trigger(sbi->type))
>>>>__managed_dentry_set_managed(root);
>>>>  
>>>>
>>>
>>> Not sure about this.
>>>
>>> Don't you think it would be better to avoid the in code #ifdefs by doing 
>>> some
>>> checks and defines in the header file and defining what's need to just use
>>> is_compat_task().
>>>
>>
>> Yes, might be...
>>
>>> Not sure 2 patches are needed for this either ..
>>>
>>
>> Well, I found this issue occasionally.
> 
> I'm wondering what the symptoms are?
> 

Size of struct autofs_v5_packet is 300 bytes for x86 and 304 bytes for x86_64.
Which means, that 32bit task can read more than size of autofs_v5_packet on 
64bit kernel.

>> And, frankly speaking, it's not clear to me, whether this issue is important 
>> at all, so I wanted to clarify this first.
>> Thanks to O_DIRECT, the only way to catch the issue is to try to read more, 
>> than expected, in compat task (that's how I found it).
> 
> Right, the O_DIRECT patch from Linus was expected to fix the structure
> alignment problem. The stuct field offsets are ok aren't they?
> 

Yes, they are ok.

>> I don't see any other flaw so far. And if so, that, probably, we shouldn't 
>> care about the issue at all.
>> What do you think?
> 
> If we are seeing hangs, incorrect struct fields or similar something
> should be done about it but if all is actually working ok then the
> O_DIRECT fix is doing it's job and further changes aren't necessary.
> 

Well, yes. O_DIRECT fix covers the issue.
Ok then.
Thanks for the clarification!

> Ian
> 


Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-14 Thread Stanislav Kinsburskiy


14.09.2017 02:38, Ian Kent пишет:
> On 01/09/17 19:21, Stanislav Kinsburskiy wrote:
>> Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
>> ---
>>  fs/autofs4/autofs_i.h  |3 +++
>>  fs/autofs4/dev-ioctl.c |3 +++
>>  fs/autofs4/inode.c |4 +++-
>>  3 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>> index 4737615..3da105f 100644
>> --- a/fs/autofs4/autofs_i.h
>> +++ b/fs/autofs4/autofs_i.h
>> @@ -120,6 +120,9 @@ struct autofs_sb_info {
>>  struct list_head active_list;
>>  struct list_head expiring_list;
>>  struct rcu_head rcu;
>> +#ifdef CONFIG_COMPAT
>> +unsigned is32bit:1;
>> +#endif
>>  };
>>  
>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
>> index b7c816f..467d6c4 100644
>> --- a/fs/autofs4/dev-ioctl.c
>> +++ b/fs/autofs4/dev-ioctl.c
>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
>>  sbi->pipefd = pipefd;
>>  sbi->pipe = pipe;
>>  sbi->catatonic = 0;
>> +#ifdef CONFIG_COMPAT
>> +sbi->is32bit = is_compat_task();
>> +#endif
>>  }
>>  out:
>>  put_pid(new_pid);
>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>> index 09e7d68..21d3c0b 100644
>> --- a/fs/autofs4/inode.c
>> +++ b/fs/autofs4/inode.c
>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void 
>> *data, int silent)
>>  } else {
>>  sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
>>  }
>> -
>> +#ifdef CONFIG_COMPAT
>> +sbi->is32bit = is_compat_task();
>> +#endif
>>  if (autofs_type_trigger(sbi->type))
>>  __managed_dentry_set_managed(root);
>>  
>>
> 
> Not sure about this.
> 
> Don't you think it would be better to avoid the in code #ifdefs by doing some
> checks and defines in the header file and defining what's need to just use
> is_compat_task().
> 

Yes, might be...

> Not sure 2 patches are needed for this either ..
> 

Well, I found this issue occasionally.
And, frankly speaking, it's not clear to me, whether this issue is important at 
all, so I wanted to clarify this first.
Thanks to O_DIRECT, the only way to catch the issue is to try to read more, 
than expected, in compat task (that's how I found it).
I don't see any other flaw so far. And if so, that, probably, we shouldn't care 
about the issue at all.
What do you think?


> Ian
> 


 


Re: [RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-14 Thread Stanislav Kinsburskiy


14.09.2017 02:38, Ian Kent пишет:
> On 01/09/17 19:21, Stanislav Kinsburskiy wrote:
>> Signed-off-by: Stanislav Kinsburskiy 
>> ---
>>  fs/autofs4/autofs_i.h  |3 +++
>>  fs/autofs4/dev-ioctl.c |3 +++
>>  fs/autofs4/inode.c |4 +++-
>>  3 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
>> index 4737615..3da105f 100644
>> --- a/fs/autofs4/autofs_i.h
>> +++ b/fs/autofs4/autofs_i.h
>> @@ -120,6 +120,9 @@ struct autofs_sb_info {
>>  struct list_head active_list;
>>  struct list_head expiring_list;
>>  struct rcu_head rcu;
>> +#ifdef CONFIG_COMPAT
>> +unsigned is32bit:1;
>> +#endif
>>  };
>>  
>>  static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
>> diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
>> index b7c816f..467d6c4 100644
>> --- a/fs/autofs4/dev-ioctl.c
>> +++ b/fs/autofs4/dev-ioctl.c
>> @@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
>>  sbi->pipefd = pipefd;
>>  sbi->pipe = pipe;
>>  sbi->catatonic = 0;
>> +#ifdef CONFIG_COMPAT
>> +sbi->is32bit = is_compat_task();
>> +#endif
>>  }
>>  out:
>>  put_pid(new_pid);
>> diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
>> index 09e7d68..21d3c0b 100644
>> --- a/fs/autofs4/inode.c
>> +++ b/fs/autofs4/inode.c
>> @@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void 
>> *data, int silent)
>>  } else {
>>  sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
>>  }
>> -
>> +#ifdef CONFIG_COMPAT
>> +sbi->is32bit = is_compat_task();
>> +#endif
>>  if (autofs_type_trigger(sbi->type))
>>  __managed_dentry_set_managed(root);
>>  
>>
> 
> Not sure about this.
> 
> Don't you think it would be better to avoid the in code #ifdefs by doing some
> checks and defines in the header file and defining what's need to just use
> is_compat_task().
> 

Yes, might be...

> Not sure 2 patches are needed for this either ..
> 

Well, I found this issue occasionally.
And, frankly speaking, it's not clear to me, whether this issue is important at 
all, so I wanted to clarify this first.
Thanks to O_DIRECT, the only way to catch the issue is to try to read more, 
than expected, in compat task (that's how I found it).
I don't see any other flaw so far. And if so, that, probably, we shouldn't care 
about the issue at all.
What do you think?


> Ian
> 


 


[RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-01 Thread Stanislav Kinsburskiy
Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/autofs4/autofs_i.h  |3 +++
 fs/autofs4/dev-ioctl.c |3 +++
 fs/autofs4/inode.c |4 +++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 4737615..3da105f 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -120,6 +120,9 @@ struct autofs_sb_info {
struct list_head active_list;
struct list_head expiring_list;
struct rcu_head rcu;
+#ifdef CONFIG_COMPAT
+   unsigned is32bit:1;
+#endif
 };
 
 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index b7c816f..467d6c4 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
sbi->pipefd = pipefd;
sbi->pipe = pipe;
sbi->catatonic = 0;
+#ifdef CONFIG_COMPAT
+   sbi->is32bit = is_compat_task();
+#endif
}
 out:
put_pid(new_pid);
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 09e7d68..21d3c0b 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
} else {
sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
}
-
+#ifdef CONFIG_COMPAT
+   sbi->is32bit = is_compat_task();
+#endif
if (autofs_type_trigger(sbi->type))
__managed_dentry_set_managed(root);
 



[RFC PATCH 1/2] autofs: set compat flag on sbi when daemon uses 32bit addressation

2017-09-01 Thread Stanislav Kinsburskiy
Signed-off-by: Stanislav Kinsburskiy 
---
 fs/autofs4/autofs_i.h  |3 +++
 fs/autofs4/dev-ioctl.c |3 +++
 fs/autofs4/inode.c |4 +++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 4737615..3da105f 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -120,6 +120,9 @@ struct autofs_sb_info {
struct list_head active_list;
struct list_head expiring_list;
struct rcu_head rcu;
+#ifdef CONFIG_COMPAT
+   unsigned is32bit:1;
+#endif
 };
 
 static inline struct autofs_sb_info *autofs4_sbi(struct super_block *sb)
diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index b7c816f..467d6c4 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -397,6 +397,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
sbi->pipefd = pipefd;
sbi->pipe = pipe;
sbi->catatonic = 0;
+#ifdef CONFIG_COMPAT
+   sbi->is32bit = is_compat_task();
+#endif
}
 out:
put_pid(new_pid);
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 09e7d68..21d3c0b 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -301,7 +301,9 @@ int autofs4_fill_super(struct super_block *s, void *data, 
int silent)
} else {
sbi->oz_pgrp = get_task_pid(current, PIDTYPE_PGID);
}
-
+#ifdef CONFIG_COMPAT
+   sbi->is32bit = is_compat_task();
+#endif
if (autofs_type_trigger(sbi->type))
__managed_dentry_set_managed(root);
 



[RFC PATCH 2/2] autofs: sent 32-bit sized packet for 32-bit process

2017-09-01 Thread Stanislav Kinsburskiy
The structure autofs_v5_packet (except name) is not aligned by 8 bytes, which
leads to different sizes in 32 and 64-bit architectures.
Let's form 32-bit compatible packet when daemon has 32-bit addressation.

Suggested-by: Dmitry V. Levin <l...@altlinux.org>
Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/autofs4/waitq.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 24a58bf..1f9b7d8 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -151,6 +151,11 @@ static void autofs4_notify_daemon(struct autofs_sb_info 
*sbi,
struct autofs_v5_packet *packet = _pkt.v5_packet;
struct user_namespace *user_ns = sbi->pipe->f_cred->user_ns;
 
+#ifdef CONFIG_COMPAT
+   if (sbi->is32bit)
+   pktsz = offsetofend(struct autofs_v5_packet, name);
+   else
+#endif
pktsz = sizeof(*packet);
 
packet->wait_queue_token = wq->wait_queue_token;



[RFC PATCH 2/2] autofs: sent 32-bit sized packet for 32-bit process

2017-09-01 Thread Stanislav Kinsburskiy
The structure autofs_v5_packet (except name) is not aligned by 8 bytes, which
leads to different sizes in 32 and 64-bit architectures.
Let's form 32-bit compatible packet when daemon has 32-bit addressation.

Suggested-by: Dmitry V. Levin 
Signed-off-by: Stanislav Kinsburskiy 
---
 fs/autofs4/waitq.c |5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/autofs4/waitq.c b/fs/autofs4/waitq.c
index 24a58bf..1f9b7d8 100644
--- a/fs/autofs4/waitq.c
+++ b/fs/autofs4/waitq.c
@@ -151,6 +151,11 @@ static void autofs4_notify_daemon(struct autofs_sb_info 
*sbi,
struct autofs_v5_packet *packet = _pkt.v5_packet;
struct user_namespace *user_ns = sbi->pipe->f_cred->user_ns;
 
+#ifdef CONFIG_COMPAT
+   if (sbi->is32bit)
+   pktsz = offsetofend(struct autofs_v5_packet, name);
+   else
+#endif
pktsz = sizeof(*packet);
 
packet->wait_queue_token = wq->wait_queue_token;



[RFC PATCH 0/2] autofs: fix autofs_v5_packet dlivery in compat mode

2017-09-01 Thread Stanislav Kinsburskiy
The problem is that in compat mode struct autofs_v5_packet has to have 
different size
(i.e. 4 bytes less).

This is RFC because:
1) This issue is hidden, because autofs pipe has O_DIRECT and the rest of the
epacket is truncated when read.
2) X86 arch doesn't have is_compat_task() helper
3) It's now clear, what to do if "pgrp" option is specified.

The following series implements...

---

Stanislav Kinsburskiy (2):
  autofs: set compat flag on sbi when daemon uses 32bit addressation
  autofs: sent 32-bit sized packet for 32-bit process


 fs/autofs4/autofs_i.h  |3 +++
 fs/autofs4/dev-ioctl.c |3 +++
 fs/autofs4/inode.c |4 +++-
 fs/autofs4/waitq.c |5 +
 4 files changed, 14 insertions(+), 1 deletion(-)


[RFC PATCH 0/2] autofs: fix autofs_v5_packet dlivery in compat mode

2017-09-01 Thread Stanislav Kinsburskiy
The problem is that in compat mode struct autofs_v5_packet has to have 
different size
(i.e. 4 bytes less).

This is RFC because:
1) This issue is hidden, because autofs pipe has O_DIRECT and the rest of the
epacket is truncated when read.
2) X86 arch doesn't have is_compat_task() helper
3) It's now clear, what to do if "pgrp" option is specified.

The following series implements...

---

Stanislav Kinsburskiy (2):
  autofs: set compat flag on sbi when daemon uses 32bit addressation
  autofs: sent 32-bit sized packet for 32-bit process


 fs/autofs4/autofs_i.h  |3 +++
 fs/autofs4/dev-ioctl.c |3 +++
 fs/autofs4/inode.c |4 +++-
 fs/autofs4/waitq.c |5 +
 4 files changed, 14 insertions(+), 1 deletion(-)


Re: [PATCH v2] prctl: remove one-shot limitation for changing exe link

2016-10-20 Thread Stanislav Kinsburskiy

Gentlemen, ping.

Let's decide something, how to get rid of this strange solution.

It doesn't provide the security it was aimed to, looks ugly and 
obfuscates the user of the feature.


It looks like it can be just thrown away.

But if not, please, advice, what should be changed to make is safe and 
solid.



27.09.2016 17:39, Stanislav Kinsburskiy пишет:

This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

This explanation doesn't look sufficient.
The only thing "exe" link is indicating is the file, used to execve, which is
basically nothing and not reliable immediately after process has returned from
execve system call.

Moreover, to use this feture, all the mappings to previous exe file have to be
unmapped and all the new exe file permissions must be satisfied.

Which means, that changing exe link is very similar to calling execve on the
binary.

The need to remove this limitations comes from migration of NFS mount point,
which is not accessible during restore and replaced by other file system.
Because of this exe link has to be changed twice.

v2:
Rebased on current linux-next
de
---
  include/linux/sched.h |4 +++-
  kernel/sys.c  |   10 --
  2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1c9b42..ad48b7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
  #define MMF_VM_MERGEABLE  16  /* KSM may merge identical pages */
  #define MMF_VM_HUGEPAGE   17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
  
  #define MMF_HAS_UPROBES		19	/* has uprobes */

  #define MMF_RECALC_UPROBES20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
  
-	/*

-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);





Re: [PATCH v2] prctl: remove one-shot limitation for changing exe link

2016-10-20 Thread Stanislav Kinsburskiy

Gentlemen, ping.

Let's decide something, how to get rid of this strange solution.

It doesn't provide the security it was aimed to, looks ugly and 
obfuscates the user of the feature.


It looks like it can be just thrown away.

But if not, please, advice, what should be changed to make is safe and 
solid.



27.09.2016 17:39, Stanislav Kinsburskiy пишет:

This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

This explanation doesn't look sufficient.
The only thing "exe" link is indicating is the file, used to execve, which is
basically nothing and not reliable immediately after process has returned from
execve system call.

Moreover, to use this feture, all the mappings to previous exe file have to be
unmapped and all the new exe file permissions must be satisfied.

Which means, that changing exe link is very similar to calling execve on the
binary.

The need to remove this limitations comes from migration of NFS mount point,
which is not accessible during restore and replaced by other file system.
Because of this exe link has to be changed twice.

v2:
Rebased on current linux-next
de
---
  include/linux/sched.h |4 +++-
  kernel/sys.c  |   10 --
  2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1c9b42..ad48b7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
  #define MMF_VM_MERGEABLE  16  /* KSM may merge identical pages */
  #define MMF_VM_HUGEPAGE   17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
  
  #define MMF_HAS_UPROBES		19	/* has uprobes */

  #define MMF_RECALC_UPROBES20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
  
-	/*

-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);





[PATCH v2] prctl: remove one-shot limitation for changing exe link

2016-09-27 Thread Stanislav Kinsburskiy
This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

This explanation doesn't look sufficient.
The only thing "exe" link is indicating is the file, used to execve, which is
basically nothing and not reliable immediately after process has returned from
execve system call.

Moreover, to use this feture, all the mappings to previous exe file have to be
unmapped and all the new exe file permissions must be satisfied.

Which means, that changing exe link is very similar to calling execve on the
binary.

The need to remove this limitations comes from migration of NFS mount point,
which is not accessible during restore and replaced by other file system.
Because of this exe link has to be changed twice.

v2:
Rebased on current linux-next

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
Acked-by: Oleg Nesterov <o...@redhat.com>
Acked-by: Cyrill Gorcunov <gorcu...@openvz.org>
---
 include/linux/sched.h |4 +++-
 kernel/sys.c  |   10 --
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1c9b42..ad48b7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE   16  /* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
 
-   /*
-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);



[PATCH v2] prctl: remove one-shot limitation for changing exe link

2016-09-27 Thread Stanislav Kinsburskiy
This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

This explanation doesn't look sufficient.
The only thing "exe" link is indicating is the file, used to execve, which is
basically nothing and not reliable immediately after process has returned from
execve system call.

Moreover, to use this feture, all the mappings to previous exe file have to be
unmapped and all the new exe file permissions must be satisfied.

Which means, that changing exe link is very similar to calling execve on the
binary.

The need to remove this limitations comes from migration of NFS mount point,
which is not accessible during restore and replaced by other file system.
Because of this exe link has to be changed twice.

v2:
Rebased on current linux-next

Signed-off-by: Stanislav Kinsburskiy 
Acked-by: Oleg Nesterov 
Acked-by: Cyrill Gorcunov 
---
 include/linux/sched.h |4 +++-
 kernel/sys.c  |   10 --
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a1c9b42..ad48b7d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -520,7 +520,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE   16  /* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
 
-   /*
-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);



Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-08-10 Thread Stanislav Kinsburskiy



30.07.2016 19:31, Eric W. Biederman пишет:

Cyrill Gorcunov  writes:


On Mon, Jul 25, 2016 at 02:56:43PM -0500, Eric W. Biederman wrote:
...

Also there is a big fat bug in prctl_set_mm_exe_file.  It doesn't
validate that the new file is a actually mmaped executable.  We would
definitely need that to be fixed before even considering removing the
limit.

Could you please elaborate? We check for inode being executable,
what else needed?

That the inode is mmaped into the process with executable mappings.

Effectively what we check the old mapping for and refuse to remove the old
mm_exe_file if it exists.

I think a reasonable argument can be made that if the file is
executable, and it is mmaped with executable pages that exe_file is not
a complete lie.

I might be missing something obvious, so sorry for the question --
when criu setups old exe link the inode we obtain from file open
is not mapped into memory, the old exe not read by anyone because
it's not even executed anyhow. So I don't really understand which
mapping we should check here. Mind to point me?

That sounds like an out and out bug that should not be preserved.
Of course we should mmap the executable and set it up so that it can be
executed (at least as much as the executable was previously mapped).
Anything else is a buggy restart, and lying to userspace.


Which is the important part.  At the end of the day how much can
userspace trust /proc/pid/exe?  If we are too lax it is just a random
file descriptor we can not trust at all.  At which point there is
exactly no point in preserving it in checkpoint/restart, because nothing
will trust or look at it.

You know, I think we should not trust exe link much, and in real we
never could: this link is rather a hint pointing which executable a
process has been using on execve call, once the process start working
one can't be sure if the code currently running is exactly from the
file pointed by exe link. It just a hint suitable for debuggin and
obtain clean view of which processes are running on noncompromised
system. Monitoring exe link change won't help much if there are
malicious software running on the system.

But it is not just a hint.  It is a record of which executable we called
execve on.  Knowing which file was executed doesn't guarantee what is
running now but it provides a very strong hint.

At then end of a restart the state of a process should be (by
definition) exactly the state the process was before a checkpoint
and thus a state the original executable could have gotten into.

I admit it is possible for an application to unmap itself.  I honestly
have not met that application (except perhaps criu).


If the only user is checkpoint/restart perhaps it should be only ptrace
that can set this and not the process itself with a prctl.  I don't
know.  All I know is that we should work on making it a very trustable
value even though in some specific instances we can set it.

Since as I said I suppose nobody except us using this feature, we can
setup some sysctl trigger for it (I personally think this is an
overkill, but OTOH if people rely on the exe link and not going
to use criu at all, this trigger will help).

Some clarity of thought came to me, and I apologise for not replying
sooner with it sooner.

My problem with the original patch submission is that it was
justifying changing prctl_set_mm_exe_file based on what
prctl_set_mm_exe_file does today.  As prctl_set_mm_exe_file was added
for the checkpoint/restart case that is justifying changing code based
on a buggy implementation.

It is necessary to look at the ordinary situation.  Without
prctl_set_mm_exe /proc/[pid]/exe can be counted on as a record
of which executable was last passed to execve.  Furthermore the state of
a process can be counted on to be a state reachable from calling
execve on /proc/[pid]/exe.

Which means to preserve those expectations prctl_set_mm_exe_file should
in practice just be a nicer less cumbersome interface to things you can
already achieve with execve.

Justifying removale of the one-short nature for prctl_set_mm_exe_file
is as straight forward as noting that a process can call execve on
any executable file.

However when I compare the invariants that execve has on a file (such as
the executable being mmaped) I see some noticable disparities between
what prctl_set_mm_exe_file allows and what execve allows.  With
prctl_set_mm_exe being less strict.

So what I am requesting is very simple.  That the checks in
prctl_set_mm_exe_file be tightened up to more closely approach what
execve requires.  Thus preserving the value of the /proc/[pid]/exe for
the applications that want to use the exe link.


Eric, could you elaborate on the checks, you mentioned?
I don't see any significant checks, which are applicable to
prctl_set_mm_exe_file.
Say, there is a check for PF_NPROC_EXCEEDED in execve, but this is not 
applicable.
Some of the checks are related to file open, but this is not done in 

Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-08-10 Thread Stanislav Kinsburskiy



30.07.2016 19:31, Eric W. Biederman пишет:

Cyrill Gorcunov  writes:


On Mon, Jul 25, 2016 at 02:56:43PM -0500, Eric W. Biederman wrote:
...

Also there is a big fat bug in prctl_set_mm_exe_file.  It doesn't
validate that the new file is a actually mmaped executable.  We would
definitely need that to be fixed before even considering removing the
limit.

Could you please elaborate? We check for inode being executable,
what else needed?

That the inode is mmaped into the process with executable mappings.

Effectively what we check the old mapping for and refuse to remove the old
mm_exe_file if it exists.

I think a reasonable argument can be made that if the file is
executable, and it is mmaped with executable pages that exe_file is not
a complete lie.

I might be missing something obvious, so sorry for the question --
when criu setups old exe link the inode we obtain from file open
is not mapped into memory, the old exe not read by anyone because
it's not even executed anyhow. So I don't really understand which
mapping we should check here. Mind to point me?

That sounds like an out and out bug that should not be preserved.
Of course we should mmap the executable and set it up so that it can be
executed (at least as much as the executable was previously mapped).
Anything else is a buggy restart, and lying to userspace.


Which is the important part.  At the end of the day how much can
userspace trust /proc/pid/exe?  If we are too lax it is just a random
file descriptor we can not trust at all.  At which point there is
exactly no point in preserving it in checkpoint/restart, because nothing
will trust or look at it.

You know, I think we should not trust exe link much, and in real we
never could: this link is rather a hint pointing which executable a
process has been using on execve call, once the process start working
one can't be sure if the code currently running is exactly from the
file pointed by exe link. It just a hint suitable for debuggin and
obtain clean view of which processes are running on noncompromised
system. Monitoring exe link change won't help much if there are
malicious software running on the system.

But it is not just a hint.  It is a record of which executable we called
execve on.  Knowing which file was executed doesn't guarantee what is
running now but it provides a very strong hint.

At then end of a restart the state of a process should be (by
definition) exactly the state the process was before a checkpoint
and thus a state the original executable could have gotten into.

I admit it is possible for an application to unmap itself.  I honestly
have not met that application (except perhaps criu).


If the only user is checkpoint/restart perhaps it should be only ptrace
that can set this and not the process itself with a prctl.  I don't
know.  All I know is that we should work on making it a very trustable
value even though in some specific instances we can set it.

Since as I said I suppose nobody except us using this feature, we can
setup some sysctl trigger for it (I personally think this is an
overkill, but OTOH if people rely on the exe link and not going
to use criu at all, this trigger will help).

Some clarity of thought came to me, and I apologise for not replying
sooner with it sooner.

My problem with the original patch submission is that it was
justifying changing prctl_set_mm_exe_file based on what
prctl_set_mm_exe_file does today.  As prctl_set_mm_exe_file was added
for the checkpoint/restart case that is justifying changing code based
on a buggy implementation.

It is necessary to look at the ordinary situation.  Without
prctl_set_mm_exe /proc/[pid]/exe can be counted on as a record
of which executable was last passed to execve.  Furthermore the state of
a process can be counted on to be a state reachable from calling
execve on /proc/[pid]/exe.

Which means to preserve those expectations prctl_set_mm_exe_file should
in practice just be a nicer less cumbersome interface to things you can
already achieve with execve.

Justifying removale of the one-short nature for prctl_set_mm_exe_file
is as straight forward as noting that a process can call execve on
any executable file.

However when I compare the invariants that execve has on a file (such as
the executable being mmaped) I see some noticable disparities between
what prctl_set_mm_exe_file allows and what execve allows.  With
prctl_set_mm_exe being less strict.

So what I am requesting is very simple.  That the checks in
prctl_set_mm_exe_file be tightened up to more closely approach what
execve requires.  Thus preserving the value of the /proc/[pid]/exe for
the applications that want to use the exe link.


Eric, could you elaborate on the checks, you mentioned?
I don't see any significant checks, which are applicable to
prctl_set_mm_exe_file.
Say, there is a check for PF_NPROC_EXCEEDED in execve, but this is not 
applicable.
Some of the checks are related to file open, but this is not done in 

Re: [RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine

2016-08-04 Thread Stanislav Kinsburskiy



04.08.2016 15:16, Jeff Layton пишет:

On Thu, 2016-08-04 at 12:55 +0200, Stanislav Kinsburskiy wrote:

03.08.2016 19:36, Jeff Layton пишет:

On Wed, 2016-08-03 at 20:54 +0400, Stanislav Kinsburskiy wrote:

Otherwise freezer cgroup state might never become "FROZEN".

Here is a deadlock scheme for 2 processes in one freezer cgroup,
which is
freezing:

CPU 0   CPU 1

do_last
inode_lock(dir->d_inode)
vfs_create
nfs_create
...
__rpc_execute
rpc_wait_bit_killable
__refrigerator
  do_last
  inode_lock(dir->d_inode)

So, the problem is that one process takes directory inode mutex,
executes
creation request and goes to refrigerator.
Another one waits till directory lock is released, remains "thawed"
and thus
freezer cgroup state never becomes "FROZEN".

Notes:
1) Interesting, that this is not a pure deadlock: one can thaw cgroup
and then
freeze it again.
2) The issue was introduced by commit
d310310cbff18ec385c6ab4d58f33b100192a96a.
3) This patch is not aimed to fix the issue, but to show the problem
root.
Look like this problem moght be applicable to other hunks from the
commit,
mentioned above.



Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>

---
   net/sunrpc/sched.c |1 -
   1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae5885..ec7ccc1 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
   
   static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)

   {

-   freezable_schedule_unsafe();
if (signal_pending_state(mode, current))
return -ERESTARTSYS;
return 0;

Ummm...so what actually does the schedule() with this patch?

Schedule() replaces freezable_schedule_unsafe() of course, sorry for this.


There was a bit of discussion on this recently -- see the thread with
this subject line in linux-nfs:

  Re: Hang due to nfs letting tasks freeze with locked inodes

Thanks, had a look.


Basically it comes down to this:

All of the proposals so far to fix this problem just switch out the
freezable_schedule_unsafe (and similar) calls for those that don't
allow the process to freeze.

The problem there is that we originally added that stuff in response to
bug reports about machines failing to suspend. What often happens is
that the network interfaces come down, and then the freezer runs over
all of the processes, which never return because they're blocked
waiting on the server to reply.

I probably don't understand something, but this sounds somewhat wrong to
me: freezing processes _after_ network is down.




...shrug...

Maybe we should just go ahead and do it (and to CIFS as well). Just be
prepared for the inevitable complaints about laptops failing to suspend
once you do.

The worst part in all of this, from my POW, is that current behavior
makes NFS non-freezable in a generic case, even in case of freezing a
container, which has it's own net ns and NFS mount.
So, I would say, that returning of previous logic would make the
world better.


Part of the fix, I think is to add a return code (similar to
ERESTARTSYS) that gets interpreted near the kernel-userland boundary
as: "allow the process to be frozen, and then retry the call once it's
resumed".

With that, filesystems could return the error code when they want to
redrive the entire syscall from that level. That won't work for non-
idempotent requests though. We'd need to do something more elaborate
there.


Might be, that breaking rpc request is something that should be avoided
at all.
With all these locks being held, almost all (any?) of the requests to
remote server
should be considered as an atomic operation from freezer point of view.
The process always can be frozen on signal handling.

IOW, I might worth considering a scenario, when NFS is not freezable at all,
and any problems with suspend on laptops/whatever have to solved in
suspend code.



Fair enough. At this point, I don't care much one way or another. Maybe
if we make this change and laptops start failing to suspend, we'll be
able to use that as leverage pursue other avenues to make the
suspend/resume subsystem work with NFS.

That said, the patch you have really isn't sufficient. There are places
where the NFS client can sleep while waiting for things other than RPC
calls.


Sure. As I said, this patch wasn't aimed to fix the issue but rather 
start the discussion.

Thanks for your patch.



Re: [RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine

2016-08-04 Thread Stanislav Kinsburskiy



04.08.2016 15:16, Jeff Layton пишет:

On Thu, 2016-08-04 at 12:55 +0200, Stanislav Kinsburskiy wrote:

03.08.2016 19:36, Jeff Layton пишет:

On Wed, 2016-08-03 at 20:54 +0400, Stanislav Kinsburskiy wrote:

Otherwise freezer cgroup state might never become "FROZEN".

Here is a deadlock scheme for 2 processes in one freezer cgroup,
which is
freezing:

CPU 0   CPU 1

do_last
inode_lock(dir->d_inode)
vfs_create
nfs_create
...
__rpc_execute
rpc_wait_bit_killable
__refrigerator
  do_last
  inode_lock(dir->d_inode)

So, the problem is that one process takes directory inode mutex,
executes
creation request and goes to refrigerator.
Another one waits till directory lock is released, remains "thawed"
and thus
freezer cgroup state never becomes "FROZEN".

Notes:
1) Interesting, that this is not a pure deadlock: one can thaw cgroup
and then
freeze it again.
2) The issue was introduced by commit
d310310cbff18ec385c6ab4d58f33b100192a96a.
3) This patch is not aimed to fix the issue, but to show the problem
root.
Look like this problem moght be applicable to other hunks from the
commit,
mentioned above.



Signed-off-by: Stanislav Kinsburskiy 

---
   net/sunrpc/sched.c |1 -
   1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae5885..ec7ccc1 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
   
   static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)

   {

-   freezable_schedule_unsafe();
if (signal_pending_state(mode, current))
return -ERESTARTSYS;
return 0;

Ummm...so what actually does the schedule() with this patch?

Schedule() replaces freezable_schedule_unsafe() of course, sorry for this.


There was a bit of discussion on this recently -- see the thread with
this subject line in linux-nfs:

  Re: Hang due to nfs letting tasks freeze with locked inodes

Thanks, had a look.


Basically it comes down to this:

All of the proposals so far to fix this problem just switch out the
freezable_schedule_unsafe (and similar) calls for those that don't
allow the process to freeze.

The problem there is that we originally added that stuff in response to
bug reports about machines failing to suspend. What often happens is
that the network interfaces come down, and then the freezer runs over
all of the processes, which never return because they're blocked
waiting on the server to reply.

I probably don't understand something, but this sounds somewhat wrong to
me: freezing processes _after_ network is down.




...shrug...

Maybe we should just go ahead and do it (and to CIFS as well). Just be
prepared for the inevitable complaints about laptops failing to suspend
once you do.

The worst part in all of this, from my POW, is that current behavior
makes NFS non-freezable in a generic case, even in case of freezing a
container, which has it's own net ns and NFS mount.
So, I would say, that returning of previous logic would make the
world better.


Part of the fix, I think is to add a return code (similar to
ERESTARTSYS) that gets interpreted near the kernel-userland boundary
as: "allow the process to be frozen, and then retry the call once it's
resumed".

With that, filesystems could return the error code when they want to
redrive the entire syscall from that level. That won't work for non-
idempotent requests though. We'd need to do something more elaborate
there.


Might be, that breaking rpc request is something that should be avoided
at all.
With all these locks being held, almost all (any?) of the requests to
remote server
should be considered as an atomic operation from freezer point of view.
The process always can be frozen on signal handling.

IOW, I might worth considering a scenario, when NFS is not freezable at all,
and any problems with suspend on laptops/whatever have to solved in
suspend code.



Fair enough. At this point, I don't care much one way or another. Maybe
if we make this change and laptops start failing to suspend, we'll be
able to use that as leverage pursue other avenues to make the
suspend/resume subsystem work with NFS.

That said, the patch you have really isn't sufficient. There are places
where the NFS client can sleep while waiting for things other than RPC
calls.


Sure. As I said, this patch wasn't aimed to fix the issue but rather 
start the discussion.

Thanks for your patch.



Re: [RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine

2016-08-04 Thread Stanislav Kinsburskiy



03.08.2016 19:36, Jeff Layton пишет:

On Wed, 2016-08-03 at 20:54 +0400, Stanislav Kinsburskiy wrote:

Otherwise freezer cgroup state might never become "FROZEN".

Here is a deadlock scheme for 2 processes in one freezer cgroup,
which is
freezing:

CPU 0   CPU 1

do_last
inode_lock(dir->d_inode)
vfs_create
nfs_create
...
__rpc_execute
rpc_wait_bit_killable
__refrigerator
 do_last
 inode_lock(dir->d_inode)

So, the problem is that one process takes directory inode mutex,
executes
creation request and goes to refrigerator.
Another one waits till directory lock is released, remains "thawed"
and thus
freezer cgroup state never becomes "FROZEN".

Notes:
1) Interesting, that this is not a pure deadlock: one can thaw cgroup
and then
freeze it again.
2) The issue was introduced by commit
d310310cbff18ec385c6ab4d58f33b100192a96a.
3) This patch is not aimed to fix the issue, but to show the problem
root.
Look like this problem moght be applicable to other hunks from the
commit,
mentioned above.


Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
  net/sunrpc/sched.c |1 -
  1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae5885..ec7ccc1 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
  
  static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)

  {
-   freezable_schedule_unsafe();
if (signal_pending_state(mode, current))
return -ERESTARTSYS;
return 0;


Ummm...so what actually does the schedule() with this patch?


Schedule() replaces freezable_schedule_unsafe() of course, sorry for this.


There was a bit of discussion on this recently -- see the thread with
this subject line in linux-nfs:

 Re: Hang due to nfs letting tasks freeze with locked inodes


Thanks, had a look.


Basically it comes down to this:

All of the proposals so far to fix this problem just switch out the
freezable_schedule_unsafe (and similar) calls for those that don't
allow the process to freeze.

The problem there is that we originally added that stuff in response to
bug reports about machines failing to suspend. What often happens is
that the network interfaces come down, and then the freezer runs over
all of the processes, which never return because they're blocked
waiting on the server to reply.


I probably don't understand something, but this sounds somewhat wrong to 
me: freezing processes _after_ network is down.





...shrug...

Maybe we should just go ahead and do it (and to CIFS as well). Just be
prepared for the inevitable complaints about laptops failing to suspend
once you do.


The worst part in all of this, from my POW, is that current behavior
makes NFS non-freezable in a generic case, even in case of freezing a
container, which has it's own net ns and NFS mount.
So, I would say, that returning of previous logic would make the
world better.


Part of the fix, I think is to add a return code (similar to
ERESTARTSYS) that gets interpreted near the kernel-userland boundary
as: "allow the process to be frozen, and then retry the call once it's
resumed".

With that, filesystems could return the error code when they want to
redrive the entire syscall from that level. That won't work for non-
idempotent requests though. We'd need to do something more elaborate
there.



Might be, that breaking rpc request is something that should be avoided 
at all.
With all these locks being held, almost all (any?) of the requests to 
remote server

should be considered as an atomic operation from freezer point of view.
The process always can be frozen on signal handling.

IOW, I might worth considering a scenario, when NFS is not freezable at all,
and any problems with suspend on laptops/whatever have to solved in 
suspend code.





Re: [RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine

2016-08-04 Thread Stanislav Kinsburskiy



03.08.2016 19:36, Jeff Layton пишет:

On Wed, 2016-08-03 at 20:54 +0400, Stanislav Kinsburskiy wrote:

Otherwise freezer cgroup state might never become "FROZEN".

Here is a deadlock scheme for 2 processes in one freezer cgroup,
which is
freezing:

CPU 0   CPU 1

do_last
inode_lock(dir->d_inode)
vfs_create
nfs_create
...
__rpc_execute
rpc_wait_bit_killable
__refrigerator
 do_last
 inode_lock(dir->d_inode)

So, the problem is that one process takes directory inode mutex,
executes
creation request and goes to refrigerator.
Another one waits till directory lock is released, remains "thawed"
and thus
freezer cgroup state never becomes "FROZEN".

Notes:
1) Interesting, that this is not a pure deadlock: one can thaw cgroup
and then
freeze it again.
2) The issue was introduced by commit
d310310cbff18ec385c6ab4d58f33b100192a96a.
3) This patch is not aimed to fix the issue, but to show the problem
root.
Look like this problem moght be applicable to other hunks from the
commit,
mentioned above.


Signed-off-by: Stanislav Kinsburskiy 
---
  net/sunrpc/sched.c |1 -
  1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae5885..ec7ccc1 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
  
  static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)

  {
-   freezable_schedule_unsafe();
if (signal_pending_state(mode, current))
return -ERESTARTSYS;
return 0;


Ummm...so what actually does the schedule() with this patch?


Schedule() replaces freezable_schedule_unsafe() of course, sorry for this.


There was a bit of discussion on this recently -- see the thread with
this subject line in linux-nfs:

 Re: Hang due to nfs letting tasks freeze with locked inodes


Thanks, had a look.


Basically it comes down to this:

All of the proposals so far to fix this problem just switch out the
freezable_schedule_unsafe (and similar) calls for those that don't
allow the process to freeze.

The problem there is that we originally added that stuff in response to
bug reports about machines failing to suspend. What often happens is
that the network interfaces come down, and then the freezer runs over
all of the processes, which never return because they're blocked
waiting on the server to reply.


I probably don't understand something, but this sounds somewhat wrong to 
me: freezing processes _after_ network is down.





...shrug...

Maybe we should just go ahead and do it (and to CIFS as well). Just be
prepared for the inevitable complaints about laptops failing to suspend
once you do.


The worst part in all of this, from my POW, is that current behavior
makes NFS non-freezable in a generic case, even in case of freezing a
container, which has it's own net ns and NFS mount.
So, I would say, that returning of previous logic would make the
world better.


Part of the fix, I think is to add a return code (similar to
ERESTARTSYS) that gets interpreted near the kernel-userland boundary
as: "allow the process to be frozen, and then retry the call once it's
resumed".

With that, filesystems could return the error code when they want to
redrive the entire syscall from that level. That won't work for non-
idempotent requests though. We'd need to do something more elaborate
there.



Might be, that breaking rpc request is something that should be avoided 
at all.
With all these locks being held, almost all (any?) of the requests to 
remote server

should be considered as an atomic operation from freezer point of view.
The process always can be frozen on signal handling.

IOW, I might worth considering a scenario, when NFS is not freezable at all,
and any problems with suspend on laptops/whatever have to solved in 
suspend code.





[RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine

2016-08-03 Thread Stanislav Kinsburskiy
Otherwise freezer cgroup state might never become "FROZEN".

Here is a deadlock scheme for 2 processes in one freezer cgroup, which is
freezing:

CPU 0   CPU 1

do_last
inode_lock(dir->d_inode)
vfs_create
nfs_create
...
__rpc_execute
rpc_wait_bit_killable
__refrigerator
do_last
inode_lock(dir->d_inode)

So, the problem is that one process takes directory inode mutex, executes
creation request and goes to refrigerator.
Another one waits till directory lock is released, remains "thawed" and thus
freezer cgroup state never becomes "FROZEN".

Notes:
1) Interesting, that this is not a pure deadlock: one can thaw cgroup and then
freeze it again.
2) The issue was introduced by commit d310310cbff18ec385c6ab4d58f33b100192a96a.
3) This patch is not aimed to fix the issue, but to show the problem root.
Look like this problem moght be applicable to other hunks from the commit,
mentioned above.


Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 net/sunrpc/sched.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae5885..ec7ccc1 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-   freezable_schedule_unsafe();
if (signal_pending_state(mode, current))
return -ERESTARTSYS;
return 0;



[RFC PATCH] sunrpc: do not allow process to freeze within RPC state machine

2016-08-03 Thread Stanislav Kinsburskiy
Otherwise freezer cgroup state might never become "FROZEN".

Here is a deadlock scheme for 2 processes in one freezer cgroup, which is
freezing:

CPU 0   CPU 1

do_last
inode_lock(dir->d_inode)
vfs_create
nfs_create
...
__rpc_execute
rpc_wait_bit_killable
__refrigerator
do_last
inode_lock(dir->d_inode)

So, the problem is that one process takes directory inode mutex, executes
creation request and goes to refrigerator.
Another one waits till directory lock is released, remains "thawed" and thus
freezer cgroup state never becomes "FROZEN".

Notes:
1) Interesting, that this is not a pure deadlock: one can thaw cgroup and then
freeze it again.
2) The issue was introduced by commit d310310cbff18ec385c6ab4d58f33b100192a96a.
3) This patch is not aimed to fix the issue, but to show the problem root.
Look like this problem moght be applicable to other hunks from the commit,
mentioned above.


Signed-off-by: Stanislav Kinsburskiy 
---
 net/sunrpc/sched.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 9ae5885..ec7ccc1 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -253,7 +253,6 @@ EXPORT_SYMBOL_GPL(rpc_destroy_wait_queue);
 
 static int rpc_wait_bit_killable(struct wait_bit_key *key, int mode)
 {
-   freezable_schedule_unsafe();
if (signal_pending_state(mode, current))
return -ERESTARTSYS;
return 0;



Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-26 Thread Stanislav Kinsburskiy



25.07.2016 20:21, Eric W. Biederman пишет:

Stanislav Kinsburskiy <skinsbur...@virtuozzo.com> writes:


Gentlemen,

Looks like there are no objections to this patch.

There has been objection.

The only justification for the change that has been put forward is
someone doing a restore lazily.  I don't see a reason why you can't call
prctl_set_mm_exe_file until you have the file in place instead of a
place holder that sounds like a trivial solution to any restore issues.


If I understand your proposal correctly, you mean, that the call to
prctl_set_mm_exe_file can be posponed till the actual file is in place.
It can be done this way you propose (although it's ugly).
But you objection looks like you want to preserve some behavior you 
believe is reliable.

But it's not.


The truth is an unlimited settable exe link is essentially meaningless,
as you can't depend on it for anything.  One shot seems the best
compromise I have seen put forward between the definite
checkpoint/restart requirement to set the this value and the general
need to have something that makes sense and people can depend on for
system management.


Depending on exe link for system management is useful, but can't be 
reliable and

can't prevent malicious software to compromise the system.
If we wouldn't have the ability to change exe link, than the only thing 
we could be sure,

that process at least started with the binary we believe is reliable.

But since we can change it, the only thing we can rely now is the file 
is executable

and process have permissions to execute it.
And this guarantee in preserved across any amount of exe link replacement.


Also there is a big fat bug in prctl_set_mm_exe_file.  It doesn't
validate that the new file is a actually mmaped executable.  We would
definitely need that to be fixed before even considering removing the
limit.


Why do we need it? To guarantee, that process has read permission to the 
file?




Right now all I see is people involved in the implementation details of
their own little feature

So for the patch I am responding to:
Nacked-by: "Eric W. Biederman" <ebied...@xmission.com>

Plus the merge window is open so no one is taking any patches right now.
It is the time to take what has already been staged and get that code
merged.

Eric




Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-26 Thread Stanislav Kinsburskiy



25.07.2016 20:21, Eric W. Biederman пишет:

Stanislav Kinsburskiy  writes:


Gentlemen,

Looks like there are no objections to this patch.

There has been objection.

The only justification for the change that has been put forward is
someone doing a restore lazily.  I don't see a reason why you can't call
prctl_set_mm_exe_file until you have the file in place instead of a
place holder that sounds like a trivial solution to any restore issues.


If I understand your proposal correctly, you mean, that the call to
prctl_set_mm_exe_file can be posponed till the actual file is in place.
It can be done this way you propose (although it's ugly).
But you objection looks like you want to preserve some behavior you 
believe is reliable.

But it's not.


The truth is an unlimited settable exe link is essentially meaningless,
as you can't depend on it for anything.  One shot seems the best
compromise I have seen put forward between the definite
checkpoint/restart requirement to set the this value and the general
need to have something that makes sense and people can depend on for
system management.


Depending on exe link for system management is useful, but can't be 
reliable and

can't prevent malicious software to compromise the system.
If we wouldn't have the ability to change exe link, than the only thing 
we could be sure,

that process at least started with the binary we believe is reliable.

But since we can change it, the only thing we can rely now is the file 
is executable

and process have permissions to execute it.
And this guarantee in preserved across any amount of exe link replacement.


Also there is a big fat bug in prctl_set_mm_exe_file.  It doesn't
validate that the new file is a actually mmaped executable.  We would
definitely need that to be fixed before even considering removing the
limit.


Why do we need it? To guarantee, that process has read permission to the 
file?




Right now all I see is people involved in the implementation details of
their own little feature

So for the patch I am responding to:
Nacked-by: "Eric W. Biederman" 

Plus the merge window is open so no one is taking any patches right now.
It is the time to take what has already been staged and get that code
merged.

Eric




Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-20 Thread Stanislav Kinsburskiy



18.07.2016 22:11, One Thousand Gnomes пишет:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

IFF it is the same uid or root (in which case you already lost). In the
case of cross uid activity this is not true.


Could you elaborate on it, please?


Alan




Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-20 Thread Stanislav Kinsburskiy



18.07.2016 22:11, One Thousand Gnomes пишет:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

IFF it is the same uid or root (in which case you already lost). In the
case of cross uid activity this is not true.


Could you elaborate on it, please?


Alan




Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-13 Thread Stanislav Kinsburskiy



12.07.2016 18:52, Eric W. Biederman пишет:

Cyrill Gorcunov <gorcu...@gmail.com> writes:


On Tue, Jul 12, 2016 at 07:30:29PM +0400, Stanislav Kinsburskiy wrote:

This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

But the way how the feature can be used is the following:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

While removing this limitation allow to replace files from underneath of a
running process as many times as required. One of the use cases is network
file systems migration (NFS, to be precise) by CRIU.

NFS mount can't be mounted on restore stage because network is locked.
To overcome this limitation, another file system (FUSE-based) is used. Then
opened files replaced by the proper ones NFS is remounted.
Thus exe link replace has to be done twice: first on restore stage and second
- when actual NFS was remounted.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>

Persistent exe-link doesn't guarantee anything if you have rights to ptrace
task and inject own code into (from security POV). So lets rip it out.

Acked-by: Cyrill Gorcunov <gorcu...@openvz.org>

I believe the original concern was someone injecting a code into a
process and playing silly buggers with the exe link.  Someone who does
not have ptrace capability.

It is completely not ok to change this until someone goes back to the
original conversation and looks at the original threat model, and
refutes it.


Fair enough, Eric.
That's why all the people, who were participating in original 
discussion, included in the recipients list.



Eric





Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-13 Thread Stanislav Kinsburskiy



12.07.2016 18:52, Eric W. Biederman пишет:

Cyrill Gorcunov  writes:


On Tue, Jul 12, 2016 at 07:30:29PM +0400, Stanislav Kinsburskiy wrote:

This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

But the way how the feature can be used is the following:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

While removing this limitation allow to replace files from underneath of a
running process as many times as required. One of the use cases is network
file systems migration (NFS, to be precise) by CRIU.

NFS mount can't be mounted on restore stage because network is locked.
To overcome this limitation, another file system (FUSE-based) is used. Then
opened files replaced by the proper ones NFS is remounted.
Thus exe link replace has to be done twice: first on restore stage and second
- when actual NFS was remounted.

Signed-off-by: Stanislav Kinsburskiy 

Persistent exe-link doesn't guarantee anything if you have rights to ptrace
task and inject own code into (from security POV). So lets rip it out.

Acked-by: Cyrill Gorcunov 

I believe the original concern was someone injecting a code into a
process and playing silly buggers with the exe link.  Someone who does
not have ptrace capability.

It is completely not ok to change this until someone goes back to the
original conversation and looks at the original threat model, and
refutes it.


Fair enough, Eric.
That's why all the people, who were participating in original 
discussion, included in the recipients list.



Eric





Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-12 Thread Stanislav Kinsburskiy



12.07.2016 18:42, Oleg Nesterov пишет:

On 07/12, Stanislav Kinsburskiy wrote:

--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
  
-	/*

-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-

I didn't even try to read the changelog so I do not know why do you
want this change ;)

But I would like to ack it in any case. I never understood why do we
want/need this MMF_EXE_FILE_CHANGED check, I suggested to remove it
many times.

And can't resist, please note the xchg() below. Currently (before this
patch) we do not need it. I was specially added to ensure that we can
just remove this test_and_set_bit(MMF_EXE_FILE_CHANGED) without adding
a race.


Thanks, Oleg. I'll take a look.
But should this be addressed in this patch? Especially if it's not 
needed even now (before this patch)?




Acked-by: Oleg Nesterov <o...@redhat.com>





Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-12 Thread Stanislav Kinsburskiy



12.07.2016 18:42, Oleg Nesterov пишет:

On 07/12, Stanislav Kinsburskiy wrote:

--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
  
-	/*

-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-

I didn't even try to read the changelog so I do not know why do you
want this change ;)

But I would like to ack it in any case. I never understood why do we
want/need this MMF_EXE_FILE_CHANGED check, I suggested to remove it
many times.

And can't resist, please note the xchg() below. Currently (before this
patch) we do not need it. I was specially added to ensure that we can
just remove this test_and_set_bit(MMF_EXE_FILE_CHANGED) without adding
a race.


Thanks, Oleg. I'll take a look.
But should this be addressed in this patch? Especially if it's not 
needed even now (before this patch)?




Acked-by: Oleg Nesterov 





[PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-12 Thread Stanislav Kinsburskiy
This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

But the way how the feature can be used is the following:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

While removing this limitation allow to replace files from underneath of a
running process as many times as required. One of the use cases is network
file systems migration (NFS, to be precise) by CRIU.

NFS mount can't be mounted on restore stage because network is locked.
To overcome this limitation, another file system (FUSE-based) is used. Then
opened files replaced by the proper ones NFS is remounted.
Thus exe link replace has to be done twice: first on restore stage and second
- when actual NFS was remounted.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 include/linux/sched.h |4 +++-
 kernel/sys.c  |   10 --
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..83b5f2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -518,7 +518,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE   16  /* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
 
-   /*
-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);



[PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-12 Thread Stanislav Kinsburskiy
This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

But the way how the feature can be used is the following:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

While removing this limitation allow to replace files from underneath of a
running process as many times as required. One of the use cases is network
file systems migration (NFS, to be precise) by CRIU.

NFS mount can't be mounted on restore stage because network is locked.
To overcome this limitation, another file system (FUSE-based) is used. Then
opened files replaced by the proper ones NFS is remounted.
Thus exe link replace has to be done twice: first on restore stage and second
- when actual NFS was remounted.

Signed-off-by: Stanislav Kinsburskiy 
---
 include/linux/sched.h |4 +++-
 kernel/sys.c  |   10 --
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..83b5f2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -518,7 +518,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE   16  /* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
 
-   /*
-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);



[PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-12 Thread Stanislav Kinsburskiy
This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

But the way how the feature can be used is the following:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

While removing this limitation allow to replace files from underneath of a
running process as many times as required. One of the use cases is network
file systems migration (NFS, to be precise) by CRIU.

NFS mount can't be mounted on restore stage because network is locked.
To overcome this limitation, another file system (FUSE-based) is used. Then
opened files replaced by the proper ones NFS is remounted.
Thus exe link replace has to be done twice: first on restore stage and second
- when actual NFS was remounted.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 include/linux/sched.h |4 +++-
 kernel/sys.c  |   10 --
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..83b5f2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -518,7 +518,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE   16  /* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
 
-   /*
-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);



[PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-12 Thread Stanislav Kinsburskiy
This limitation came with the reason to remove "another
way for malicious code to obscure a compromised program and
masquerade as a benign process" by allowing "security-concious program can use
this prctl once during its early initialization to ensure the prctl cannot
later be abused for this purpose":

http://marc.info/?l=linux-kernel=133160684517468=2

But the way how the feature can be used is the following:

1) Attach to process via ptrace (protected by CAP_SYS_PTRACE)
2) Unmap all the process file mappings, related to "exe" file.
3) Change exe link (protected by CAP_SYS_RESOURCE).

IOW, some other process already has an access to process internals (and thus
it's already compromised), and can inject fork and use the child of the
compromised program to masquerade.
Which means this limitation doesn't solve the problem it was aimed to.

While removing this limitation allow to replace files from underneath of a
running process as many times as required. One of the use cases is network
file systems migration (NFS, to be precise) by CRIU.

NFS mount can't be mounted on restore stage because network is locked.
To overcome this limitation, another file system (FUSE-based) is used. Then
opened files replaced by the proper ones NFS is remounted.
Thus exe link replace has to be done twice: first on restore stage and second
- when actual NFS was remounted.

Signed-off-by: Stanislav Kinsburskiy 
---
 include/linux/sched.h |4 +++-
 kernel/sys.c  |   10 --
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 553af29..83b5f2d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -518,7 +518,9 @@ static inline int get_dumpable(struct mm_struct *mm)
/* leave room for more dump flags */
 #define MMF_VM_MERGEABLE   16  /* KSM may merge identical pages */
 #define MMF_VM_HUGEPAGE17  /* set when VM_HUGEPAGE is set 
on vma */
-#define MMF_EXE_FILE_CHANGED   18  /* see prctl_set_mm_exe_file() */
+/* This ine-shot flag is droped due to necessivity of changing exe once again
+ * on NFS restore */
+//#define MMF_EXE_FILE_CHANGED 18  /* see prctl_set_mm_exe_file() */
 
 #define MMF_HAS_UPROBES19  /* has uprobes */
 #define MMF_RECALC_UPROBES 20  /* MMF_HAS_UPROBES can be wrong */
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..fd6f508 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1696,16 +1696,6 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, 
unsigned int fd)
fput(exe_file);
}
 
-   /*
-* The symlink can be changed only once, just to disallow arbitrary
-* transitions malicious software might bring in. This means one
-* could make a snapshot over all processes running and monitor
-* /proc/pid/exe changes to notice unusual activity if needed.
-*/
-   err = -EPERM;
-   if (test_and_set_bit(MMF_EXE_FILE_CHANGED, >flags))
-   goto exit;
-
err = 0;
/* set the new file, lockless */
get_file(exe.file);



Re: call_usermodehelper in containers

2016-02-13 Thread Stanislav Kinsburskiy



13.02.2016 00:39, Ian Kent пишет:

On Fri, 2013-11-15 at 15:54 +0400, Stanislav Kinsbursky wrote:

15.11.2013 15:03, Eric W. Biederman пишет:

Stanislav Kinsbursky  writes:


12.11.2013 17:30, Jeff Layton пишет:

On Tue, 12 Nov 2013 17:02:36 +0400
Stanislav Kinsbursky  wrote:


12.11.2013 15:12, Jeff Layton пишет:

On Mon, 11 Nov 2013 16:47:03 -0800
Greg KH  wrote:


On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton
wrote:

We have a bit of a problem wrt to upcalls that use
call_usermodehelper
with containers and I'd like to bring this to some sort
of resolution...

A particularly problematic case (though there are
others) is the
nfsdcltrack upcall. It basically uses
call_usermodehelper to run a
program in userland to track some information on stable
storage for
nfsd.

I thought the discussion at the kernel summit about this
issue was:
- don't do this.
- don't do it.
- if you really need to do this, fix nfsd


Sorry, I couldn't make the kernel summit so I missed that
discussion. I
guess LWN didn't cover it?

In any case, I guess then that we'll either have to come up
with some
way to fix nfsd here, or simply ensure that nfsd can never
be started
unless root in the container has a full set of a full set of
capabilities.

One sort of Rube Goldberg possibility to fix nfsd is:

- when we start nfsd in a container, fork off an extra
kernel thread
  that just sits idle. That thread would need to be a
descendant of the
  userland process that started nfsd, so we'd need to
create it with
  kernel_thread().

- Have the kernel just start up the UMH program in the
init_ns mount
  namespace as it currently does, but also pass the pid
of the idle
  kernel thread to the UMH upcall.

- The program will then use /proc//root and
/proc//ns/* to set
  itself up for doing things properly.

Note that with this mechanism we can't actually run a
different binary
per container, but that's probably fine for most purposes.


Hmmm... Why we can't? We can go a bit further with userspace
idea.

We use UMH some very limited number of user programs. For 2,
actually:
1) /sbin/nfs_cache_getent
2) /sbin/nfsdcltrack


No, the kernel uses them for a lot more than that. Pretty much
all of
the keys API upcalls use it. See all of the callers of
call_usermodehelper. All of them are running user binaries out
of the
kernel, and almost all of them are certainly broken wrt
containers.


If we convert them into proxies, which use /proc//root
and /proc//ns/*, this will allow us to lookup the right
binary.
The only limitation here is presence of this "proxy" binaries
on "host".


Suppose I spawn my own container as a user, using all of this
spiffy
new user namespace stuff. Then I make the kernel use
call_usermodehelper to call the upcall in the init_ns, and then
trick
it into running my new "escape_from_namespace" program with
"real" root
privileges.

I don't think we can reasonably assume that having the kernel
exec an
arbitrary binary inside of a container is safe. Doing so inside
of the
init_ns is marginally more safe, but only marginally so...


And we don't need any significant changes in kernel.

BTW, Jeff, could you remind me, please, why exactly we need to
use UMH to run the binary?
What are this capabilities, which force us to do so?


Nothing _forces_ us to do so, but upcalls are very difficult to
handle,
and UMH has a lot of advantages over a long-running daemon
launched by
userland.

Originally, I created the nfsdcltrack upcall as a running daemon
called
nfsdcld, and the kernel used rpc_pipefs to communicate with it.

Everyone hated it because no one likes to have to run daemons
for
infrequently used upcalls. It's a pain for users to ensure that
it's
running and it's a pain to handle when it isn't. So, I was
encouraged
to turn that instead into a UMH upcall.

But leaving that aside, this problem is a lot larger than just
nfsd. We
have a *lot* of UMH upcalls in the kernel, so this problem is
more
general than just "fixing" nfsd's.


Ok. So we are talking about generic approach to UMH support in a
container (and/or namespace).

Actually, as far as I can see, there are more that one aspect,
which is not supported.
One one them is executing of the right binary. Another one is
capabilities (and maybe there are more, like user namespaces), but
I
don't really care about them for now.
Executing the right binary, actually, is not about namespaces at
all. This is about lookup implementation in VFS
(do_execve_common).




Would be great to unshare FS for forked UHM kthread and swap it to
desired root. This will solve the problem with proper lookup.
However,
as far as I understand, this approach is not welcome by the
community.

I don't understand that one.  Having a preforked thread with the
proper
environment that can act like kthreadd in terms of spawning user
mode
helpers works and is simple.  The only downside I can see is that
there
is extra overhead.


What do you mean by "simple" here? 

Re: call_usermodehelper in containers

2016-02-13 Thread Stanislav Kinsburskiy



13.02.2016 00:39, Ian Kent пишет:

On Fri, 2013-11-15 at 15:54 +0400, Stanislav Kinsbursky wrote:

15.11.2013 15:03, Eric W. Biederman пишет:

Stanislav Kinsbursky  writes:


12.11.2013 17:30, Jeff Layton пишет:

On Tue, 12 Nov 2013 17:02:36 +0400
Stanislav Kinsbursky  wrote:


12.11.2013 15:12, Jeff Layton пишет:

On Mon, 11 Nov 2013 16:47:03 -0800
Greg KH  wrote:


On Mon, Nov 11, 2013 at 07:18:25AM -0500, Jeff Layton
wrote:

We have a bit of a problem wrt to upcalls that use
call_usermodehelper
with containers and I'd like to bring this to some sort
of resolution...

A particularly problematic case (though there are
others) is the
nfsdcltrack upcall. It basically uses
call_usermodehelper to run a
program in userland to track some information on stable
storage for
nfsd.

I thought the discussion at the kernel summit about this
issue was:
- don't do this.
- don't do it.
- if you really need to do this, fix nfsd


Sorry, I couldn't make the kernel summit so I missed that
discussion. I
guess LWN didn't cover it?

In any case, I guess then that we'll either have to come up
with some
way to fix nfsd here, or simply ensure that nfsd can never
be started
unless root in the container has a full set of a full set of
capabilities.

One sort of Rube Goldberg possibility to fix nfsd is:

- when we start nfsd in a container, fork off an extra
kernel thread
  that just sits idle. That thread would need to be a
descendant of the
  userland process that started nfsd, so we'd need to
create it with
  kernel_thread().

- Have the kernel just start up the UMH program in the
init_ns mount
  namespace as it currently does, but also pass the pid
of the idle
  kernel thread to the UMH upcall.

- The program will then use /proc//root and
/proc//ns/* to set
  itself up for doing things properly.

Note that with this mechanism we can't actually run a
different binary
per container, but that's probably fine for most purposes.


Hmmm... Why we can't? We can go a bit further with userspace
idea.

We use UMH some very limited number of user programs. For 2,
actually:
1) /sbin/nfs_cache_getent
2) /sbin/nfsdcltrack


No, the kernel uses them for a lot more than that. Pretty much
all of
the keys API upcalls use it. See all of the callers of
call_usermodehelper. All of them are running user binaries out
of the
kernel, and almost all of them are certainly broken wrt
containers.


If we convert them into proxies, which use /proc//root
and /proc//ns/*, this will allow us to lookup the right
binary.
The only limitation here is presence of this "proxy" binaries
on "host".


Suppose I spawn my own container as a user, using all of this
spiffy
new user namespace stuff. Then I make the kernel use
call_usermodehelper to call the upcall in the init_ns, and then
trick
it into running my new "escape_from_namespace" program with
"real" root
privileges.

I don't think we can reasonably assume that having the kernel
exec an
arbitrary binary inside of a container is safe. Doing so inside
of the
init_ns is marginally more safe, but only marginally so...


And we don't need any significant changes in kernel.

BTW, Jeff, could you remind me, please, why exactly we need to
use UMH to run the binary?
What are this capabilities, which force us to do so?


Nothing _forces_ us to do so, but upcalls are very difficult to
handle,
and UMH has a lot of advantages over a long-running daemon
launched by
userland.

Originally, I created the nfsdcltrack upcall as a running daemon
called
nfsdcld, and the kernel used rpc_pipefs to communicate with it.

Everyone hated it because no one likes to have to run daemons
for
infrequently used upcalls. It's a pain for users to ensure that
it's
running and it's a pain to handle when it isn't. So, I was
encouraged
to turn that instead into a UMH upcall.

But leaving that aside, this problem is a lot larger than just
nfsd. We
have a *lot* of UMH upcalls in the kernel, so this problem is
more
general than just "fixing" nfsd's.


Ok. So we are talking about generic approach to UMH support in a
container (and/or namespace).

Actually, as far as I can see, there are more that one aspect,
which is not supported.
One one them is executing of the right binary. Another one is
capabilities (and maybe there are more, like user namespaces), but
I
don't really care about them for now.
Executing the right binary, actually, is not about namespaces at
all. This is about lookup implementation in VFS
(do_execve_common).




Would be great to unshare FS for forked UHM kthread and swap it to
desired root. This will solve the problem with proper lookup.
However,
as far as I understand, this approach is not welcome by the
community.

I don't understand that one.  Having a preforked thread with the
proper
environment that can act like kthreadd in terms of spawning user
mode
helpers works and is simple.  The only downside I 

[PATCH v2] autofs: show pipe inode in mount options

2016-01-25 Thread Stanislav Kinsburskiy
This is required for CRIU (Checkpoint Restart In Userspace) to migrate a mount
point, when write end in user space is closed.
Below is the brief description of the problem.
To migrate non-catatonic autofs mount point, one have to restore control
pipe between kernel and and autofs master process.
One of the autofs masters is systemd, which closes pipe write end after
passing it to the kernel with mount call.
To be able to restore systemd control pipe, one have to know, which read pipe
end in systemd corresponds to the write pipe end in the kernel. Pipe "fd" in
mount options is not enough, because it was closed and probably replaced by
some other descriptor.
Thus, some other attribute is required to be able to find the read pipe end.
The best attribute, allowing to find correct pipe end is inode number, becuase
it's unique for the whole system and can't be reused until autofs mount
exists. This attribute also allows to recognize a situation with autofs mount
without master (no process with specified "pgrp" or not file descriptor with
"pipe_ino", specified in autofs mount options).

v2:
1) New option "pipe_ino" was moved to the end of the options list.
2) Option is printed only if CONFIG_CHECKPOINT_RESTORE is set.


Signed-off-by: Stanislav Kinsburskiy 
---
 fs/autofs4/inode.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index a3ae0b2..4320faa 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -94,7 +94,12 @@ static int autofs4_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",direct");
else
seq_printf(m, ",indirect");
-
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   if (sbi->pipe)
+   seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino);
+   else
+   seq_printf(m, ",pipe_ino=-1");
+#endif
return 0;
 }
 



Re: [PATCH] autofs: show pipe inode in mount options

2016-01-25 Thread Stanislav Kinsburskiy



23.01.2016 01:30, Ian Kent пишет:

On Fri, 2016-01-22 at 12:34 +0100, Stanislav Kinsburskiy wrote:

Hi again,

I would like to ask about any progress with this patch?
Any other requirements to make it able to merge?

Sorry for the delay.

Since there haven't been any comments from Al or Stephen I'm think I
should include it in the series I plan on sending to linux-next to
rename autofs4 to autofs (among other things).

I haven't had anything significant enough for autofs to warrant
maintaining a tree and sending push requests so I'll need to ask
Stephen what I need to do (perhaps you could offer some advise on that
now Stephen, please).

I'm also struggling to get back to this and carry out the needed
testing and I'll need to re-base the series too now but I'm getting
there.

I didn't see a follow up patch with an updated description, did I miss
it?


No, you didn't.
I have send it few minutes ago.
Thanks.


Ian




[PATCH v2] autofs: show pipe inode in mount options

2016-01-25 Thread Stanislav Kinsburskiy
This is required for CRIU (Checkpoint Restart In Userspace) to migrate a mount
point, when write end in user space is closed.
Below is the brief description of the problem.
To migrate non-catatonic autofs mount point, one have to restore control
pipe between kernel and and autofs master process.
One of the autofs masters is systemd, which closes pipe write end after
passing it to the kernel with mount call.
To be able to restore systemd control pipe, one have to know, which read pipe
end in systemd corresponds to the write pipe end in the kernel. Pipe "fd" in
mount options is not enough, because it was closed and probably replaced by
some other descriptor.
Thus, some other attribute is required to be able to find the read pipe end.
The best attribute, allowing to find correct pipe end is inode number, becuase
it's unique for the whole system and can't be reused until autofs mount
exists. This attribute also allows to recognize a situation with autofs mount
without master (no process with specified "pgrp" or not file descriptor with
"pipe_ino", specified in autofs mount options).

v2:
1) New option "pipe_ino" was moved to the end of the options list.
2) Option is printed only if CONFIG_CHECKPOINT_RESTORE is set.


Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/autofs4/inode.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index a3ae0b2..4320faa 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -94,7 +94,12 @@ static int autofs4_show_options(struct seq_file *m, struct 
dentry *root)
seq_printf(m, ",direct");
else
seq_printf(m, ",indirect");
-
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   if (sbi->pipe)
+   seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino);
+   else
+   seq_printf(m, ",pipe_ino=-1");
+#endif
return 0;
 }
 



Re: [PATCH] autofs: show pipe inode in mount options

2016-01-25 Thread Stanislav Kinsburskiy



23.01.2016 01:30, Ian Kent пишет:

On Fri, 2016-01-22 at 12:34 +0100, Stanislav Kinsburskiy wrote:

Hi again,

I would like to ask about any progress with this patch?
Any other requirements to make it able to merge?

Sorry for the delay.

Since there haven't been any comments from Al or Stephen I'm think I
should include it in the series I plan on sending to linux-next to
rename autofs4 to autofs (among other things).

I haven't had anything significant enough for autofs to warrant
maintaining a tree and sending push requests so I'll need to ask
Stephen what I need to do (perhaps you could offer some advise on that
now Stephen, please).

I'm also struggling to get back to this and carry out the needed
testing and I'll need to re-base the series too now but I'm getting
there.

I didn't see a follow up patch with an updated description, did I miss
it?


No, you didn't.
I have send it few minutes ago.
Thanks.


Ian




Re: [PATCH] autofs: show pipe inode in mount options

2016-01-22 Thread Stanislav Kinsburskiy

Hi again,

I would like to ask about any progress with this patch?
Any other requirements to make it able to merge?


Re: [PATCH] autofs: show pipe inode in mount options

2016-01-22 Thread Stanislav Kinsburskiy

Hi again,

I would like to ask about any progress with this patch?
Any other requirements to make it able to merge?


Re: [PATCH] autofs: show pipe inode in mount options

2016-01-07 Thread Stanislav Kinsburskiy

Good day, gentlemen.

Could you update, what's the status with this patch?
Without it it's impossible to match process pipe with kernel pipe, while 
this is "must have" to be able to migrate AutoFS via CRIU.




16.12.2015 13:02, Stanislav Kinsburskiy пишет:

This is required for CRIU to migrate a mount point, when write end in user
space is closed.
To be able to migrate such mount, read end of the pipe have to be searched
within autofs master process, and pipe inode will be used as a key.

Signed-off-by: Stanislav Kinsburskiy 
---
  fs/autofs4/inode.c |4 
  1 file changed, 4 insertions(+)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index a3ae0b2..16f875a 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -77,6 +77,10 @@ static int autofs4_show_options(struct seq_file *m, struct 
dentry *root)
return 0;
  
  	seq_printf(m, ",fd=%d", sbi->pipefd);

+   if (sbi->pipe)
+   seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino);
+   else
+   seq_printf(m, ",pipe_ino=-1");
if (!uid_eq(root_inode->i_uid, GLOBAL_ROOT_UID))
seq_printf(m, ",uid=%u",
from_kuid_munged(_user_ns, root_inode->i_uid));



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] autofs: show pipe inode in mount options

2016-01-07 Thread Stanislav Kinsburskiy

Good day, gentlemen.

Could you update, what's the status with this patch?
Without it it's impossible to match process pipe with kernel pipe, while 
this is "must have" to be able to migrate AutoFS via CRIU.




16.12.2015 13:02, Stanislav Kinsburskiy пишет:

This is required for CRIU to migrate a mount point, when write end in user
space is closed.
To be able to migrate such mount, read end of the pipe have to be searched
within autofs master process, and pipe inode will be used as a key.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
  fs/autofs4/inode.c |4 
  1 file changed, 4 insertions(+)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index a3ae0b2..16f875a 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -77,6 +77,10 @@ static int autofs4_show_options(struct seq_file *m, struct 
dentry *root)
return 0;
  
  	seq_printf(m, ",fd=%d", sbi->pipefd);

+   if (sbi->pipe)
+   seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino);
+   else
+   seq_printf(m, ",pipe_ino=-1");
if (!uid_eq(root_inode->i_uid, GLOBAL_ROOT_UID))
seq_printf(m, ",uid=%u",
from_kuid_munged(_user_ns, root_inode->i_uid));



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] autofs: show pipe inode in mount options

2015-12-16 Thread Stanislav Kinsburskiy
This is required for CRIU to migrate a mount point, when write end in user
space is closed.
To be able to migrate such mount, read end of the pipe have to be searched
within autofs master process, and pipe inode will be used as a key.

Signed-off-by: Stanislav Kinsburskiy 
---
 fs/autofs4/inode.c |4 
 1 file changed, 4 insertions(+)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index a3ae0b2..16f875a 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -77,6 +77,10 @@ static int autofs4_show_options(struct seq_file *m, struct 
dentry *root)
return 0;
 
seq_printf(m, ",fd=%d", sbi->pipefd);
+   if (sbi->pipe)
+   seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino);
+   else
+   seq_printf(m, ",pipe_ino=-1");
if (!uid_eq(root_inode->i_uid, GLOBAL_ROOT_UID))
seq_printf(m, ",uid=%u",
from_kuid_munged(_user_ns, root_inode->i_uid));

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] autofs: show pipe inode in mount options

2015-12-16 Thread Stanislav Kinsburskiy
This is required for CRIU to migrate a mount point, when write end in user
space is closed.
To be able to migrate such mount, read end of the pipe have to be searched
within autofs master process, and pipe inode will be used as a key.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/autofs4/inode.c |4 
 1 file changed, 4 insertions(+)

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index a3ae0b2..16f875a 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -77,6 +77,10 @@ static int autofs4_show_options(struct seq_file *m, struct 
dentry *root)
return 0;
 
seq_printf(m, ",fd=%d", sbi->pipefd);
+   if (sbi->pipe)
+   seq_printf(m, ",pipe_ino=%ld", sbi->pipe->f_inode->i_ino);
+   else
+   seq_printf(m, ",pipe_ino=-1");
if (!uid_eq(root_inode->i_uid, GLOBAL_ROOT_UID))
seq_printf(m, ",uid=%u",
from_kuid_munged(_user_ns, root_inode->i_uid));

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/3] splice: new SPLICE_F_PACKET flag introduced

2015-12-15 Thread Stanislav Kinsburskiy
This flag is used by kernel only to represent pipe packetized mode for splice
engine.

Signed-off-by: Stanislav Kinsburskiy 
---
 include/linux/splice.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/splice.h b/include/linux/splice.h
index da2751d..13ca14d 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -20,6 +20,11 @@
 #define SPLICE_F_MORE  (0x04)  /* expect more data */
 #define SPLICE_F_GIFT  (0x08)  /* pages passed in are a gift */
 
+#ifdef __KERNEL__
+
+#define SPLICE_F_PACKET(0x1000) /* pipe buffers have to be packetized 
*/
+
+#endif
 /*
  * Passed to the actors
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/3] Fix splice for packetized pipes

2015-12-15 Thread Stanislav Kinsburskiy
With introduction of packetized pipes, splice wasn't updated to respect this
new behaviour.
In terms of splice it means, that it never set PIPE_BUF_FLAG_PACKET on
created pipe buffer regarless pipe operating mode, thus breaking the whole
logic.
To fix this, new SPLICE_F_PACKET flag was introduced. It's supposed to be used
only in kernel and set, when write pipe is in packetized mode. In
splice_to_pipe() it's converted into created pipe buffer PIPE_BUF_FLAG_PACKET
flag.

The following series implements...

---

Stanislav Kinsburskiy (3):
  pipe: make is_packetized() non-static and declare in pipe_fs_i.h
  splice: new SPLICE_F_PACKET flag introduced
  splice: add support of splicing to packetized pipe


 fs/pipe.c |2 +-
 fs/splice.c   |8 
 include/linux/pipe_fs_i.h |2 ++
 include/linux/splice.h|5 +
 4 files changed, 16 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/3] pipe: make is_packetized() non-static and declare in pipe_fs_i.h

2015-12-15 Thread Stanislav Kinsburskiy
With introduction of packetized pipe mode, represented by O_DIRECT flag,
splice stopped working correctly with a pipe in this mode.
To be able to fix them, this helper have to exposed.

Signed-off-by: Stanislav Kinsburskiy 
---
 fs/pipe.c |2 +-
 include/linux/pipe_fs_i.h |2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 42cf8dd..645d142 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -329,7 +329,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
return ret;
 }
 
-static inline int is_packetized(struct file *file)
+int is_packetized(struct file *file)
 {
return (file->f_flags & O_DIRECT) != 0;
 }
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index eb8b8ac..9fdaf8f 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -146,4 +146,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
 
 int create_pipe_files(struct file **, int);
 
+int is_packetized(struct file *file);
+
 #endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 3/3] splice: add support of splicing to packetized pipe

2015-12-15 Thread Stanislav Kinsburskiy
This patch uses SPLICE_F_PACKET as a flag, representing packetized pipe.
In splice_to_pipe() this flag is converted into PIPE_BUF_FLAG_PACKET on pipe
buffer.

Signed-off-by: Stanislav Kinsburskiy 
---
 fs/splice.c |8 
 1 file changed, 8 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 4cf700d..d698fb2 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -210,6 +210,8 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf->ops = spd->ops;
if (spd->flags & SPLICE_F_GIFT)
buf->flags |= PIPE_BUF_FLAG_GIFT;
+   if (spd->flags & SPLICE_F_PACKET)
+   buf->flags |= PIPE_BUF_FLAG_PACKET;
 
pipe->nrbufs++;
page_nr++;
@@ -1420,6 +1422,9 @@ static long do_splice(struct file *in, loff_t __user 
*off_in,
offset = in->f_pos;
}
 
+   if (is_packetized(out))
+   flags |= SPLICE_F_PACKET;
+
ret = do_splice_to(in, , opipe, len, flags);
 
if (!off_in)
@@ -1609,6 +1614,9 @@ static long vmsplice_to_pipe(struct file *file, const 
struct iovec __user *iov,
if (splice_grow_spd(pipe, ))
return -ENOMEM;
 
+   if (is_packetized(file))
+   spd.flags |= SPLICE_F_PACKET;
+
spd.nr_pages = get_iovec_page_array(iov, nr_segs, spd.pages,
spd.partial, false,
spd.nr_pages_max);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fcntl: allow to set O_DIRECT flag on pipe

2015-12-15 Thread Stanislav Kinsburskiy
With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file
via sys_fcntl, because of unsupported (by pipes) sanity checks.
Ability to set this flag will be used by CRIU to migrate packetized pipes.

Signed-off-by: Stanislav Kinsburskiy 
---
 fs/fcntl.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..4463a87 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -51,7 +51,8 @@ static int setfl(int fd, struct file * filp, unsigned long 
arg)
   if (arg & O_NDELAY)
   arg |= O_NONBLOCK;
 
-   if (arg & O_DIRECT) {
+   /* Pipe packetized mode is controlled by O_DIRECT flag */
+   if (!IS_FIFO(f->f_mode) && (arg & O_DIRECT)) {
if (!filp->f_mapping || !filp->f_mapping->a_ops ||
!filp->f_mapping->a_ops->direct_IO)
return -EINVAL;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] fcntl: allow to set O_DIRECT flag on pipe

2015-12-15 Thread Stanislav Kinsburskiy
With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file
via sys_fcntl, because of unsupported sanity checks.
Ability to set this flag will be used by CRIU to migrate packetized pipes.

v2:
Fixed typos and mode variable to check.

Signed-off-by: Stanislav Kinsburskiy 
---
 fs/fcntl.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..350a2c8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -51,7 +51,8 @@ static int setfl(int fd, struct file * filp, unsigned long 
arg)
   if (arg & O_NDELAY)
   arg |= O_NONBLOCK;
 
-   if (arg & O_DIRECT) {
+   /* Pipe packetized mode is controlled by O_DIRECT flag */
+   if (!S_ISFIFO(filp->f_inode->i_mode) && (arg & O_DIRECT)) {
if (!filp->f_mapping || !filp->f_mapping->a_ops ||
!filp->f_mapping->a_ops->direct_IO)
return -EINVAL;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 3/3] splice: add support of splicing to packetized pipe

2015-12-15 Thread Stanislav Kinsburskiy
This patch uses SPLICE_F_PACKET as a flag, representing packetized pipe.
In splice_to_pipe() this flag is converted into PIPE_BUF_FLAG_PACKET on pipe
buffer.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/splice.c |8 
 1 file changed, 8 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 4cf700d..d698fb2 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -210,6 +210,8 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
buf->ops = spd->ops;
if (spd->flags & SPLICE_F_GIFT)
buf->flags |= PIPE_BUF_FLAG_GIFT;
+   if (spd->flags & SPLICE_F_PACKET)
+   buf->flags |= PIPE_BUF_FLAG_PACKET;
 
pipe->nrbufs++;
page_nr++;
@@ -1420,6 +1422,9 @@ static long do_splice(struct file *in, loff_t __user 
*off_in,
offset = in->f_pos;
}
 
+   if (is_packetized(out))
+   flags |= SPLICE_F_PACKET;
+
ret = do_splice_to(in, , opipe, len, flags);
 
if (!off_in)
@@ -1609,6 +1614,9 @@ static long vmsplice_to_pipe(struct file *file, const 
struct iovec __user *iov,
if (splice_grow_spd(pipe, ))
return -ENOMEM;
 
+   if (is_packetized(file))
+   spd.flags |= SPLICE_F_PACKET;
+
spd.nr_pages = get_iovec_page_array(iov, nr_segs, spd.pages,
spd.partial, false,
spd.nr_pages_max);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 0/3] Fix splice for packetized pipes

2015-12-15 Thread Stanislav Kinsburskiy
With introduction of packetized pipes, splice wasn't updated to respect this
new behaviour.
In terms of splice it means, that it never set PIPE_BUF_FLAG_PACKET on
created pipe buffer regarless pipe operating mode, thus breaking the whole
logic.
To fix this, new SPLICE_F_PACKET flag was introduced. It's supposed to be used
only in kernel and set, when write pipe is in packetized mode. In
splice_to_pipe() it's converted into created pipe buffer PIPE_BUF_FLAG_PACKET
flag.

The following series implements...

---

Stanislav Kinsburskiy (3):
  pipe: make is_packetized() non-static and declare in pipe_fs_i.h
  splice: new SPLICE_F_PACKET flag introduced
  splice: add support of splicing to packetized pipe


 fs/pipe.c |2 +-
 fs/splice.c   |8 
 include/linux/pipe_fs_i.h |2 ++
 include/linux/splice.h|5 +
 4 files changed, 16 insertions(+), 1 deletion(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 2/3] splice: new SPLICE_F_PACKET flag introduced

2015-12-15 Thread Stanislav Kinsburskiy
This flag is used by kernel only to represent pipe packetized mode for splice
engine.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 include/linux/splice.h |5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/splice.h b/include/linux/splice.h
index da2751d..13ca14d 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -20,6 +20,11 @@
 #define SPLICE_F_MORE  (0x04)  /* expect more data */
 #define SPLICE_F_GIFT  (0x08)  /* pages passed in are a gift */
 
+#ifdef __KERNEL__
+
+#define SPLICE_F_PACKET(0x1000) /* pipe buffers have to be packetized 
*/
+
+#endif
 /*
  * Passed to the actors
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH RFC 1/3] pipe: make is_packetized() non-static and declare in pipe_fs_i.h

2015-12-15 Thread Stanislav Kinsburskiy
With introduction of packetized pipe mode, represented by O_DIRECT flag,
splice stopped working correctly with a pipe in this mode.
To be able to fix them, this helper have to exposed.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/pipe.c |2 +-
 include/linux/pipe_fs_i.h |2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 42cf8dd..645d142 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -329,7 +329,7 @@ pipe_read(struct kiocb *iocb, struct iov_iter *to)
return ret;
 }
 
-static inline int is_packetized(struct file *file)
+int is_packetized(struct file *file)
 {
return (file->f_flags & O_DIRECT) != 0;
 }
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index eb8b8ac..9fdaf8f 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -146,4 +146,6 @@ struct pipe_inode_info *get_pipe_info(struct file *file);
 
 int create_pipe_files(struct file **, int);
 
+int is_packetized(struct file *file);
+
 #endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] fcntl: allow to set O_DIRECT flag on pipe

2015-12-15 Thread Stanislav Kinsburskiy
With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file
via sys_fcntl, because of unsupported (by pipes) sanity checks.
Ability to set this flag will be used by CRIU to migrate packetized pipes.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/fcntl.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..4463a87 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -51,7 +51,8 @@ static int setfl(int fd, struct file * filp, unsigned long 
arg)
   if (arg & O_NDELAY)
   arg |= O_NONBLOCK;
 
-   if (arg & O_DIRECT) {
+   /* Pipe packetized mode is controlled by O_DIRECT flag */
+   if (!IS_FIFO(f->f_mode) && (arg & O_DIRECT)) {
if (!filp->f_mapping || !filp->f_mapping->a_ops ||
!filp->f_mapping->a_ops->direct_IO)
return -EINVAL;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] fcntl: allow to set O_DIRECT flag on pipe

2015-12-15 Thread Stanislav Kinsburskiy
With packetized mode for pipes, it's not possible to set O_DIRECT on pipe file
via sys_fcntl, because of unsupported sanity checks.
Ability to set this flag will be used by CRIU to migrate packetized pipes.

v2:
Fixed typos and mode variable to check.

Signed-off-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com>
---
 fs/fcntl.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ee85cd4..350a2c8 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -51,7 +51,8 @@ static int setfl(int fd, struct file * filp, unsigned long 
arg)
   if (arg & O_NDELAY)
   arg |= O_NONBLOCK;
 
-   if (arg & O_DIRECT) {
+   /* Pipe packetized mode is controlled by O_DIRECT flag */
+   if (!S_ISFIFO(filp->f_inode->i_mode) && (arg & O_DIRECT)) {
if (!filp->f_mapping || !filp->f_mapping->a_ops ||
!filp->f_mapping->a_ops->direct_IO)
return -EINVAL;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/