Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Mateusz Guzik
On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote:
> On Tue 27-03-18 02:20:39, Yang Shi wrote:
> [...]
> The patch looks reasonable to me. Maybe it would be better to be more
> explicit about the purpose of the patch. As others noticed, this alone
> wouldn't solve the mmap_sem contention issues. I _think_ that if you
> were more explicit about the mmap_sem abuse it would trigger less
> questions.
> 

>From what I gather even with other fixes the kernel will still end up
grabbing the semaphore. In this case I don't see what's the upside of
adding the spinlock for args. The downside is growth of mm_struct.

i.e. the code can be refactored to just hold the lock and relock only if
necessary (unable to copy to user without faulting)

-- 
Mateusz Guzik


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-27 Thread Mateusz Guzik
On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote:
> On Tue 27-03-18 02:20:39, Yang Shi wrote:
> [...]
> The patch looks reasonable to me. Maybe it would be better to be more
> explicit about the purpose of the patch. As others noticed, this alone
> wouldn't solve the mmap_sem contention issues. I _think_ that if you
> were more explicit about the mmap_sem abuse it would trigger less
> questions.
> 

>From what I gather even with other fixes the kernel will still end up
grabbing the semaphore. In this case I don't see what's the upside of
adding the spinlock for args. The downside is growth of mm_struct.

i.e. the code can be refactored to just hold the lock and relock only if
necessary (unable to copy to user without faulting)

-- 
Mateusz Guzik


Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Mateusz Guzik
On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 

They are not arbitrary - there is basic validation performed when
setting them.

> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 

While switching to arg spinlock here will relieve mmap_sem to an extent,
it wont help with the problem you are seeing here.

proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you
got yet another down_read(>mmap_sem);.

i.e. the issue you ran into is well known and predates my change.

The problem does not stem from contention either, but blocking for a
long time while holding the lock - the most common example is dealing
with dead nfs mount vs mmaped areas.

I don't have good ideas how to fix the problem. The least bad I came up
with was to trylock with a timeout - after a failure either return an
error or resort to returning p_comm. ps/top could be modified to
fallback to snatching the name from /status.

Since the lock owner is now being stored in the semaphore, perhaps the
above routine can happily spin until it grabs the lock or the owner is
detected to have gone into uninterruptible sleep and react accordingly. 

I don't know whether it is feasible to somehow avoid the mmap lock
altogether.

If it has to be there no matter what the code can be refactored to grab
it once and relock only if copyout would fault. This would in particular
reduce the number of times it is taken to begin with and still provide
the current synchronisation against prctl. But the fundamental problem
will remain.

That said, refactoring above will have the same effect as your patch and
will avoid growing mm_struct.

That's my $0,03. MM overlords have to comment on what to do with this.

> So, introduce a new spinlock in mm_struct to protect the concurrent access
> to arg_start|end and env_start|end.
> 
> And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
> sem for writing to protect against others") changed down_read to
> down_write to avoid write race condition in prctl_set_mm(). Since we
> already have dedicated lock to protect them, it is safe to change back
> to down_read.
> 
> Signed-off-by: Yang Shi <yang@linux.alibaba.com>
> Cc: Alexey Dobriyan <adobri...@gmail.com>
> Cc: Michal Hocko <mho...@kernel.org>
> Cc: Matthew Wilcox <wi...@infradead.org>
> Cc: Mateusz Guzik <mgu...@redhat.com>
> Cc: Cyrill Gorcunov <gorcu...@openvz.org>
> ---
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
> 
>  fs/proc/base.c   |  8 
>  include/linux/mm_types.h |  2 ++
>  kernel/fork.c|  1 +
>  kernel/sys.c | 14 ++
>  mm/init-mm.c |  1 +
>  5 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9298324..e0282b6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - 

Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

2018-03-26 Thread Mateusz Guzik
On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
> 

They are not arbitrary - there is basic validation performed when
setting them.

> And, the mmap_sem contention may cause unexpected issue like below:
> 
> INFO: task ps:14018 blocked for more than 120 seconds.
>Tainted: GE 4.9.79-009.ali3000.alios7.x86_64 #1
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
>  ps  D0 14018  1 0x0004
>   885582f84000 885e8682f000 880972943000 885ebf499bc0
>   8828ee12 c900349bfca8 817154d0 0040
>   00ff812f872a 885ebf499bc0 024000d000948300 880972943000
>  Call Trace:
>   [] ? __schedule+0x250/0x730
>   [] schedule+0x36/0x80
>   [] rwsem_down_read_failed+0xf0/0x150
>   [] call_rwsem_down_read_failed+0x18/0x30
>   [] down_read+0x20/0x40
>   [] proc_pid_cmdline_read+0xd9/0x4e0
>   [] ? do_filp_open+0xa5/0x100
>   [] __vfs_read+0x37/0x150
>   [] ? security_file_permission+0x9b/0xc0
>   [] vfs_read+0x96/0x130
>   [] SyS_read+0x55/0xc0
>   [] entry_SYSCALL_64_fastpath+0x1a/0xc5
> 
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
> 

While switching to arg spinlock here will relieve mmap_sem to an extent,
it wont help with the problem you are seeing here.

proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you
got yet another down_read(>mmap_sem);.

i.e. the issue you ran into is well known and predates my change.

The problem does not stem from contention either, but blocking for a
long time while holding the lock - the most common example is dealing
with dead nfs mount vs mmaped areas.

I don't have good ideas how to fix the problem. The least bad I came up
with was to trylock with a timeout - after a failure either return an
error or resort to returning p_comm. ps/top could be modified to
fallback to snatching the name from /status.

Since the lock owner is now being stored in the semaphore, perhaps the
above routine can happily spin until it grabs the lock or the owner is
detected to have gone into uninterruptible sleep and react accordingly. 

I don't know whether it is feasible to somehow avoid the mmap lock
altogether.

If it has to be there no matter what the code can be refactored to grab
it once and relock only if copyout would fault. This would in particular
reduce the number of times it is taken to begin with and still provide
the current synchronisation against prctl. But the fundamental problem
will remain.

That said, refactoring above will have the same effect as your patch and
will avoid growing mm_struct.

That's my $0,03. MM overlords have to comment on what to do with this.

> So, introduce a new spinlock in mm_struct to protect the concurrent access
> to arg_start|end and env_start|end.
> 
> And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
> sem for writing to protect against others") changed down_read to
> down_write to avoid write race condition in prctl_set_mm(). Since we
> already have dedicated lock to protect them, it is safe to change back
> to down_read.
> 
> Signed-off-by: Yang Shi 
> Cc: Alexey Dobriyan 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Cc: Mateusz Guzik 
> Cc: Cyrill Gorcunov 
> ---
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
> 
>  fs/proc/base.c   |  8 
>  include/linux/mm_types.h |  2 ++
>  kernel/fork.c|  1 +
>  kernel/sys.c | 14 ++
>  mm/init-mm.c |  1 +
>  5 files changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9298324..e0282b6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, 
> char __user *buf,
>   goto out_mmput;
>   }
>  
> - down_read(>mmap_sem);
> + spin_lock(>arg_lock);
>   arg_start = mm->arg_start;
>   arg_end = mm->arg_end;
>   env_start = mm->env_start;
>   env_end = mm->env_end;
> - up_read(>mmap_sem);
> + spin_unlock(>arg_lock);
>  
>   BUG_ON(arg_start > arg_end);
>   BUG_ON(env_start > env_end);
&g

Re: linux-next: build failure after merge of the vfs tree

2018-03-19 Thread Mateusz Guzik
On Mon, Mar 19, 2018 at 05:06:27PM +1100, Stephen Rothwell wrote:
> Hi Al,
> 
> After merging the vfs tree, today's linux-next build (x86_64 allnoconfig)
> failed like this:
> 
> fs/super.c: In function 'do_thaw_all_callback':
> fs/super.c:942:3: error: implicit declaration of function 
> 'emergency_thaw_bdev'; did you mean 'emergency_remount'? 
> [-Werror=implicit-function-declaration]
>emergency_thaw_bdev(sb);
>^~~
>emergency_remount
> 
> Caused by commit
> 
>   92afc556e622 ("buffer.c: call thaw_super during emergency thaw")
> 
> I have reverted that commit for today.
> 

Oops, did not test with CONFIG_BLOCK disabled. The sysrq func itself is guarded
with it so imho the right fixup is to do the same thing:

diff --git a/fs/super.c b/fs/super.c
index 5fa9a8d..86b5575 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -935,6 +935,7 @@ void emergency_remount(void)
}
 }
 
+#ifdef CONFIG_BLOCK
 static void do_thaw_all_callback(struct super_block *sb)
 {
down_write(>s_umount);
@@ -968,6 +969,7 @@ void emergency_thaw_all(void)
schedule_work(work);
}
 }
+#endif
 
 /*
  * Unnamed block devices are dummy devices used by virtual


-- 
Mateusz Guzik


Re: linux-next: build failure after merge of the vfs tree

2018-03-19 Thread Mateusz Guzik
On Mon, Mar 19, 2018 at 05:06:27PM +1100, Stephen Rothwell wrote:
> Hi Al,
> 
> After merging the vfs tree, today's linux-next build (x86_64 allnoconfig)
> failed like this:
> 
> fs/super.c: In function 'do_thaw_all_callback':
> fs/super.c:942:3: error: implicit declaration of function 
> 'emergency_thaw_bdev'; did you mean 'emergency_remount'? 
> [-Werror=implicit-function-declaration]
>emergency_thaw_bdev(sb);
>^~~
>emergency_remount
> 
> Caused by commit
> 
>   92afc556e622 ("buffer.c: call thaw_super during emergency thaw")
> 
> I have reverted that commit for today.
> 

Oops, did not test with CONFIG_BLOCK disabled. The sysrq func itself is guarded
with it so imho the right fixup is to do the same thing:

diff --git a/fs/super.c b/fs/super.c
index 5fa9a8d..86b5575 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -935,6 +935,7 @@ void emergency_remount(void)
}
 }
 
+#ifdef CONFIG_BLOCK
 static void do_thaw_all_callback(struct super_block *sb)
 {
down_write(>s_umount);
@@ -968,6 +969,7 @@ void emergency_thaw_all(void)
schedule_work(work);
}
 }
+#endif
 
 /*
  * Unnamed block devices are dummy devices used by virtual


-- 
Mateusz Guzik


[PATCH] proc: get rid of task lock/unlock pair to read umask for the "status" file

2018-02-07 Thread Mateusz Guzik
get_task_umask locks/unlocks the task on its own. The only caller does
the same thing immediately after.

Utilize the fact the task has to be locked anyway and just do it once.
Since there are no other users and the code is short, fold it in.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 fs/proc/array.c | 23 +--
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index d67a72d..5f0ec1f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -141,25 +141,12 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[task_state_index(tsk)];
 }
 
-static inline int get_task_umask(struct task_struct *tsk)
-{
-   struct fs_struct *fs;
-   int umask = -ENOENT;
-
-   task_lock(tsk);
-   fs = tsk->fs;
-   if (fs)
-   umask = fs->umask;
-   task_unlock(tsk);
-   return umask;
-}
-
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g, umask;
+   int g, umask = -1;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -177,16 +164,16 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
-   umask = get_task_umask(p);
-   if (umask >= 0)
-   seq_printf(m, "Umask:\t%#04o\n", umask);
-
task_lock(p);
+   if (p->fs)
+   umask = p->fs->umask;
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
task_unlock(p);
rcu_read_unlock();
 
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t%#04o\n", umask);
seq_printf(m, "State:\t%s", get_task_state(p));
 
seq_put_decimal_ull(m, "\nTgid:\t", tgid);
-- 
1.8.3.1



[PATCH] proc: get rid of task lock/unlock pair to read umask for the "status" file

2018-02-07 Thread Mateusz Guzik
get_task_umask locks/unlocks the task on its own. The only caller does
the same thing immediately after.

Utilize the fact the task has to be locked anyway and just do it once.
Since there are no other users and the code is short, fold it in.

Signed-off-by: Mateusz Guzik 
---
 fs/proc/array.c | 23 +--
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index d67a72d..5f0ec1f 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -141,25 +141,12 @@ static inline const char *get_task_state(struct 
task_struct *tsk)
return task_state_array[task_state_index(tsk)];
 }
 
-static inline int get_task_umask(struct task_struct *tsk)
-{
-   struct fs_struct *fs;
-   int umask = -ENOENT;
-
-   task_lock(tsk);
-   fs = tsk->fs;
-   if (fs)
-   umask = fs->umask;
-   task_unlock(tsk);
-   return umask;
-}
-
 static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *p)
 {
struct user_namespace *user_ns = seq_user_ns(m);
struct group_info *group_info;
-   int g, umask;
+   int g, umask = -1;
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
@@ -177,16 +164,16 @@ static inline void task_state(struct seq_file *m, struct 
pid_namespace *ns,
ngid = task_numa_group_id(p);
cred = get_task_cred(p);
 
-   umask = get_task_umask(p);
-   if (umask >= 0)
-   seq_printf(m, "Umask:\t%#04o\n", umask);
-
task_lock(p);
+   if (p->fs)
+   umask = p->fs->umask;
if (p->files)
max_fds = files_fdtable(p->files)->max_fds;
task_unlock(p);
rcu_read_unlock();
 
+   if (umask >= 0)
+   seq_printf(m, "Umask:\t%#04o\n", umask);
seq_printf(m, "State:\t%s", get_task_state(p));
 
seq_put_decimal_ull(m, "\nTgid:\t", tgid);
-- 
1.8.3.1



Re: [RESEND PATCH 05/13] vfs: Replace array of file pointers with an IDR

2017-10-04 Thread Mateusz Guzik
On Tue, Jul 11, 2017 at 06:50:52PM +0530, Sandhya Bankar wrote:
> Instead of storing all the file pointers in a single array, use an
> IDR.  It is RCU-safe, and does not need to be reallocated when the
> fd array grows.  It also handles allocation of new file descriptors.
> 
> ---
>  
[snip]
> @@ -604,22 +576,9 @@ void put_unused_fd(unsigned int fd)
>  void __fd_install(struct files_struct *files, unsigned int fd,
>   struct file *file)
>  {
> - struct fdtable *fdt;
> -
> - might_sleep();
> - rcu_read_lock_sched();
> -
> - while (unlikely(files->resize_in_progress)) {
> - rcu_read_unlock_sched();
> - wait_event(files->resize_wait, !files->resize_in_progress);
> - rcu_read_lock_sched();
> - }
> - /* coupled with smp_wmb() in expand_fdtable() */
> - smp_rmb();
> - fdt = rcu_dereference_sched(files->fdt);
> - BUG_ON(fdt->fd[fd] != NULL);
> - rcu_assign_pointer(fdt->fd[fd], file);
> - rcu_read_unlock_sched();
> + rcu_read_lock();
> + BUG_ON(idr_replace(>fd_idr, file, fd));
> + rcu_read_unlock();
>  }
>  
>  void fd_install(unsigned int fd, struct file *file)
> @@ -641,10 +600,9 @@ int __close_fd(struct files_struct *files, unsigned fd)
>   fdt = files_fdtable(files);
>   if (fd >= fdt->max_fds)
>   goto out_unlock;
> - file = fdt->fd[fd];
> + file = idr_remove(>fd_idr, fd);
>   if (!file)
>   goto out_unlock;
> - rcu_assign_pointer(fdt->fd[fd], NULL);
>   __clear_close_on_exec(fd, fdt);
>   __put_unused_fd(files, fd);
>   spin_unlock(>file_lock);

I have no opinions about the switch, however these 2 places make me
worried. I did not check all the changes so perhaps I missed something.

In the current code we are safe when it comes to concurrent install and
close, in particular here:

CPU0CPU1
alloc_fd
__close_fd
fd_install

__close_fd will either see a NULL pointer and return -EBADF or will see
an installed pointer and proceed with the close.

Your proposed patch seems to be buggy in this regard.

You call idr_remove, which from what I understand will free up the slot
no matter what. You only detect an error based on whether there was a
non-NULL pointer there or not. If so, fd_install can proceed to play
with a deallocated entry.

-- 
Mateusz Guzik


Re: [RESEND PATCH 05/13] vfs: Replace array of file pointers with an IDR

2017-10-04 Thread Mateusz Guzik
On Tue, Jul 11, 2017 at 06:50:52PM +0530, Sandhya Bankar wrote:
> Instead of storing all the file pointers in a single array, use an
> IDR.  It is RCU-safe, and does not need to be reallocated when the
> fd array grows.  It also handles allocation of new file descriptors.
> 
> ---
>  
[snip]
> @@ -604,22 +576,9 @@ void put_unused_fd(unsigned int fd)
>  void __fd_install(struct files_struct *files, unsigned int fd,
>   struct file *file)
>  {
> - struct fdtable *fdt;
> -
> - might_sleep();
> - rcu_read_lock_sched();
> -
> - while (unlikely(files->resize_in_progress)) {
> - rcu_read_unlock_sched();
> - wait_event(files->resize_wait, !files->resize_in_progress);
> - rcu_read_lock_sched();
> - }
> - /* coupled with smp_wmb() in expand_fdtable() */
> - smp_rmb();
> - fdt = rcu_dereference_sched(files->fdt);
> - BUG_ON(fdt->fd[fd] != NULL);
> - rcu_assign_pointer(fdt->fd[fd], file);
> - rcu_read_unlock_sched();
> + rcu_read_lock();
> + BUG_ON(idr_replace(>fd_idr, file, fd));
> + rcu_read_unlock();
>  }
>  
>  void fd_install(unsigned int fd, struct file *file)
> @@ -641,10 +600,9 @@ int __close_fd(struct files_struct *files, unsigned fd)
>   fdt = files_fdtable(files);
>   if (fd >= fdt->max_fds)
>   goto out_unlock;
> - file = fdt->fd[fd];
> + file = idr_remove(>fd_idr, fd);
>   if (!file)
>   goto out_unlock;
> - rcu_assign_pointer(fdt->fd[fd], NULL);
>   __clear_close_on_exec(fd, fdt);
>   __put_unused_fd(files, fd);
>   spin_unlock(>file_lock);

I have no opinions about the switch, however these 2 places make me
worried. I did not check all the changes so perhaps I missed something.

In the current code we are safe when it comes to concurrent install and
close, in particular here:

CPU0CPU1
alloc_fd
__close_fd
fd_install

__close_fd will either see a NULL pointer and return -EBADF or will see
an installed pointer and proceed with the close.

Your proposed patch seems to be buggy in this regard.

You call idr_remove, which from what I understand will free up the slot
no matter what. You only detect an error based on whether there was a
non-NULL pointer there or not. If so, fd_install can proceed to play
with a deallocated entry.

-- 
Mateusz Guzik


[PATCH 0/2] fix emergency unfreezing

2017-10-03 Thread Mateusz Guzik
Filesystems frozen with fsfreeze --freeze are not getting thawed with
sysrq j (or echo j > /proc/sysrq-trigger). Thawing only deals with bdev
part, does not do anything about fsfreeze.

I went for the easiest fix possible: just thaw the super block.
Doing it requires a little bit of refactoring in order to keep the sb
locked the whole time.

The first patch moves sb iteration found in emergency remount to a
dedicated func for code reuse. This patch patch is a no-op.

The second employs factored out code to do both bdev and sb thaws.

Mateusz Guzik (2):
  vfs: factor sb iteration out of do_emergency_remount
  buffer.c: call thaw_super during emergency thaw

 fs/buffer.c|  25 +
 fs/super.c | 105 +++--
 include/linux/fs.h |   1 +
 3 files changed, 80 insertions(+), 51 deletions(-)

-- 
1.8.3.1



[PATCH 0/2] fix emergency unfreezing

2017-10-03 Thread Mateusz Guzik
Filesystems frozen with fsfreeze --freeze are not getting thawed with
sysrq j (or echo j > /proc/sysrq-trigger). Thawing only deals with bdev
part, does not do anything about fsfreeze.

I went for the easiest fix possible: just thaw the super block.
Doing it requires a little bit of refactoring in order to keep the sb
locked the whole time.

The first patch moves sb iteration found in emergency remount to a
dedicated func for code reuse. This patch patch is a no-op.

The second employs factored out code to do both bdev and sb thaws.

Mateusz Guzik (2):
  vfs: factor sb iteration out of do_emergency_remount
  buffer.c: call thaw_super during emergency thaw

 fs/buffer.c|  25 +
 fs/super.c | 105 +++--
 include/linux/fs.h |   1 +
 3 files changed, 80 insertions(+), 51 deletions(-)

-- 
1.8.3.1



[PATCH 2/2] buffer.c: call thaw_super during emergency thaw

2017-10-03 Thread Mateusz Guzik
There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.

In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained
unaffected.

The patch is a hack which adds blind unfreezing.

In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is
employed.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 fs/buffer.c| 25 +
 fs/super.c | 44 ++--
 include/linux/fs.h |  1 +
 3 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 170df85..37ea00b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -523,35 +523,12 @@ static int osync_buffers_list(spinlock_t *lock, struct 
list_head *list)
return err;
 }
 
-static void do_thaw_one(struct super_block *sb, void *unused)
+void emergency_thaw_bdev(struct super_block *sb)
 {
while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb))
printk(KERN_WARNING "Emergency Thaw on %pg\n", sb->s_bdev);
 }
 
-static void do_thaw_all(struct work_struct *work)
-{
-   iterate_supers(do_thaw_one, NULL);
-   kfree(work);
-   printk(KERN_WARNING "Emergency Thaw complete\n");
-}
-
-/**
- * emergency_thaw_all -- forcibly thaw every frozen filesystem
- *
- * Used for emergency unfreeze of all filesystems via SysRq
- */
-void emergency_thaw_all(void)
-{
-   struct work_struct *work;
-
-   work = kmalloc(sizeof(*work), GFP_ATOMIC);
-   if (work) {
-   INIT_WORK(work, do_thaw_all);
-   schedule_work(work);
-   }
-}
-
 /**
  * sync_mapping_buffers - write out & wait upon a mapping's "associated" 
buffers
  * @mapping: the mapping which wants those buffers written
diff --git a/fs/super.c b/fs/super.c
index fd9c02f..83c5c8a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include 
 #include "internal.h"
 
+static int thaw_super_locked(struct super_block *sb);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -934,6 +935,40 @@ void emergency_remount(void)
}
 }
 
+static void do_thaw_all_callback(struct super_block *sb)
+{
+   down_write(>s_umount);
+   if (sb->s_root && sb->s_flags & MS_BORN) {
+   emergency_thaw_bdev(sb);
+   thaw_super_locked(sb);
+   } else {
+   up_write(>s_umount);
+   }
+}
+
+static void do_thaw_all(struct work_struct *work)
+{
+   __iterate_supers(do_thaw_all_callback);
+   kfree(work);
+   printk(KERN_WARNING "Emergency Thaw complete\n");
+}
+
+/**
+ * emergency_thaw_all -- forcibly thaw every frozen filesystem
+ *
+ * Used for emergency unfreeze of all filesystems via SysRq
+ */
+void emergency_thaw_all(void)
+{
+   struct work_struct *work;
+
+   work = kmalloc(sizeof(*work), GFP_ATOMIC);
+   if (work) {
+   INIT_WORK(work, do_thaw_all);
+   schedule_work(work);
+   }
+}
+
 /*
  * Unnamed block devices are dummy devices used by virtual
  * filesystems which don't use real block-devices.  -- jrs
@@ -1503,11 +1538,10 @@ int freeze_super(struct super_block *sb)
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super().
  */
-int thaw_super(struct super_block *sb)
+static int thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   down_write(>s_umount);
if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
up_write(>s_umount);
return -EINVAL;
@@ -1538,4 +1572,10 @@ int thaw_super(struct super_block *sb)
deactivate_locked_super(sb);
return 0;
 }
+
+int thaw_super(struct super_block *sb)
+{
+   down_write(>s_umount);
+   return thaw_super_locked(sb);
+}
 EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..e3f1b13 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2428,6 +2428,7 @@ extern int finish_open(struct file *file, struct dentry 
*dentry,
 extern void kill_bdev(struct block_device *);
 extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
+extern void emergency_thaw_bdev(struct super_block *sb);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
 
-- 
1.8.3.1



[PATCH 2/2] buffer.c: call thaw_super during emergency thaw

2017-10-03 Thread Mateusz Guzik
There are 2 distinct freezing mechanisms - one operates on block
devices and another one directly on super blocks. Both end up with the
same result, but thaw of only one of these does not thaw the other.

In particular fsfreeze --freeze uses the ioctl variant going to the
super block. Since prior to this patch emergency thaw was not doing
a relevant thaw, filesystems frozen with this method remained
unaffected.

The patch is a hack which adds blind unfreezing.

In order to keep the super block write-locked the whole time the code
is shuffled around and the newly introduced __iterate_supers is
employed.

Signed-off-by: Mateusz Guzik 
---
 fs/buffer.c| 25 +
 fs/super.c | 44 ++--
 include/linux/fs.h |  1 +
 3 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 170df85..37ea00b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -523,35 +523,12 @@ static int osync_buffers_list(spinlock_t *lock, struct 
list_head *list)
return err;
 }
 
-static void do_thaw_one(struct super_block *sb, void *unused)
+void emergency_thaw_bdev(struct super_block *sb)
 {
while (sb->s_bdev && !thaw_bdev(sb->s_bdev, sb))
printk(KERN_WARNING "Emergency Thaw on %pg\n", sb->s_bdev);
 }
 
-static void do_thaw_all(struct work_struct *work)
-{
-   iterate_supers(do_thaw_one, NULL);
-   kfree(work);
-   printk(KERN_WARNING "Emergency Thaw complete\n");
-}
-
-/**
- * emergency_thaw_all -- forcibly thaw every frozen filesystem
- *
- * Used for emergency unfreeze of all filesystems via SysRq
- */
-void emergency_thaw_all(void)
-{
-   struct work_struct *work;
-
-   work = kmalloc(sizeof(*work), GFP_ATOMIC);
-   if (work) {
-   INIT_WORK(work, do_thaw_all);
-   schedule_work(work);
-   }
-}
-
 /**
  * sync_mapping_buffers - write out & wait upon a mapping's "associated" 
buffers
  * @mapping: the mapping which wants those buffers written
diff --git a/fs/super.c b/fs/super.c
index fd9c02f..83c5c8a 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -36,6 +36,7 @@
 #include 
 #include "internal.h"
 
+static int thaw_super_locked(struct super_block *sb);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -934,6 +935,40 @@ void emergency_remount(void)
}
 }
 
+static void do_thaw_all_callback(struct super_block *sb)
+{
+   down_write(>s_umount);
+   if (sb->s_root && sb->s_flags & MS_BORN) {
+   emergency_thaw_bdev(sb);
+   thaw_super_locked(sb);
+   } else {
+   up_write(>s_umount);
+   }
+}
+
+static void do_thaw_all(struct work_struct *work)
+{
+   __iterate_supers(do_thaw_all_callback);
+   kfree(work);
+   printk(KERN_WARNING "Emergency Thaw complete\n");
+}
+
+/**
+ * emergency_thaw_all -- forcibly thaw every frozen filesystem
+ *
+ * Used for emergency unfreeze of all filesystems via SysRq
+ */
+void emergency_thaw_all(void)
+{
+   struct work_struct *work;
+
+   work = kmalloc(sizeof(*work), GFP_ATOMIC);
+   if (work) {
+   INIT_WORK(work, do_thaw_all);
+   schedule_work(work);
+   }
+}
+
 /*
  * Unnamed block devices are dummy devices used by virtual
  * filesystems which don't use real block-devices.  -- jrs
@@ -1503,11 +1538,10 @@ int freeze_super(struct super_block *sb)
  *
  * Unlocks the filesystem and marks it writeable again after freeze_super().
  */
-int thaw_super(struct super_block *sb)
+static int thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   down_write(>s_umount);
if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
up_write(>s_umount);
return -EINVAL;
@@ -1538,4 +1572,10 @@ int thaw_super(struct super_block *sb)
deactivate_locked_super(sb);
return 0;
 }
+
+int thaw_super(struct super_block *sb)
+{
+   down_write(>s_umount);
+   return thaw_super_locked(sb);
+}
 EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 339e737..e3f1b13 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2428,6 +2428,7 @@ extern int finish_open(struct file *file, struct dentry 
*dentry,
 extern void kill_bdev(struct block_device *);
 extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
+extern void emergency_thaw_bdev(struct super_block *sb);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
 
-- 
1.8.3.1



[PATCH 1/2] vfs: factor sb iteration out of do_emergency_remount

2017-10-03 Thread Mateusz Guzik
The intent is to reduce code duplication with other code
iterating the list.

No functional changes.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 fs/super.c | 61 -
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 166c4ee..fd9c02f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -574,6 +574,28 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+static void __iterate_supers(void (*f)(struct super_block *))
+{
+   struct super_block *sb, *p = NULL;
+
+   spin_lock(_lock);
+   list_for_each_entry(sb, _blocks, s_list) {
+   if (hlist_unhashed(>s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(_lock);
+
+   f(sb);
+
+   spin_lock(_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(_lock);
+}
 /**
  * iterate_supers - call function for all active superblocks
  * @f: function to call
@@ -881,33 +903,22 @@ int do_remount_sb(struct super_block *sb, int sb_flags, 
void *data, int force)
return retval;
 }
 
-static void do_emergency_remount(struct work_struct *work)
+static void do_emergency_remount_callback(struct super_block *sb)
 {
-   struct super_block *sb, *p = NULL;
-
-   spin_lock(_lock);
-   list_for_each_entry(sb, _blocks, s_list) {
-   if (hlist_unhashed(>s_instances))
-   continue;
-   sb->s_count++;
-   spin_unlock(_lock);
-   down_write(>s_umount);
-   if (sb->s_root && sb->s_bdev && (sb->s_flags & SB_BORN) &&
-   !sb_rdonly(sb)) {
-   /*
-* What lock protects sb->s_flags??
-*/
-   do_remount_sb(sb, SB_RDONLY, NULL, 1);
-   }
-   up_write(>s_umount);
-   spin_lock(_lock);
-   if (p)
-   __put_super(p);
-   p = sb;
+   down_write(>s_umount);
+   if (sb->s_root && sb->s_bdev && (sb->s_flags & SB_BORN) &&
+   !sb_rdonly(sb)) {
+   /*
+* What lock protects sb->s_flags??
+*/
+   do_remount_sb(sb, SB_RDONLY, NULL, 1);
}
-   if (p)
-   __put_super(p);
-   spin_unlock(_lock);
+   up_write(>s_umount);
+}
+
+static void do_emergency_remount(struct work_struct *work)
+{
+   __iterate_supers(do_emergency_remount_callback);
kfree(work);
printk("Emergency Remount complete\n");
 }
-- 
1.8.3.1



[PATCH 1/2] vfs: factor sb iteration out of do_emergency_remount

2017-10-03 Thread Mateusz Guzik
The intent is to reduce code duplication with other code
iterating the list.

No functional changes.

Signed-off-by: Mateusz Guzik 
---
 fs/super.c | 61 -
 1 file changed, 36 insertions(+), 25 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 166c4ee..fd9c02f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -574,6 +574,28 @@ void drop_super_exclusive(struct super_block *sb)
 }
 EXPORT_SYMBOL(drop_super_exclusive);
 
+static void __iterate_supers(void (*f)(struct super_block *))
+{
+   struct super_block *sb, *p = NULL;
+
+   spin_lock(_lock);
+   list_for_each_entry(sb, _blocks, s_list) {
+   if (hlist_unhashed(>s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(_lock);
+
+   f(sb);
+
+   spin_lock(_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(_lock);
+}
 /**
  * iterate_supers - call function for all active superblocks
  * @f: function to call
@@ -881,33 +903,22 @@ int do_remount_sb(struct super_block *sb, int sb_flags, 
void *data, int force)
return retval;
 }
 
-static void do_emergency_remount(struct work_struct *work)
+static void do_emergency_remount_callback(struct super_block *sb)
 {
-   struct super_block *sb, *p = NULL;
-
-   spin_lock(_lock);
-   list_for_each_entry(sb, _blocks, s_list) {
-   if (hlist_unhashed(>s_instances))
-   continue;
-   sb->s_count++;
-   spin_unlock(_lock);
-   down_write(>s_umount);
-   if (sb->s_root && sb->s_bdev && (sb->s_flags & SB_BORN) &&
-   !sb_rdonly(sb)) {
-   /*
-* What lock protects sb->s_flags??
-*/
-   do_remount_sb(sb, SB_RDONLY, NULL, 1);
-   }
-   up_write(>s_umount);
-   spin_lock(_lock);
-   if (p)
-   __put_super(p);
-   p = sb;
+   down_write(>s_umount);
+   if (sb->s_root && sb->s_bdev && (sb->s_flags & SB_BORN) &&
+   !sb_rdonly(sb)) {
+   /*
+* What lock protects sb->s_flags??
+*/
+   do_remount_sb(sb, SB_RDONLY, NULL, 1);
}
-   if (p)
-   __put_super(p);
-   spin_unlock(_lock);
+   up_write(>s_umount);
+}
+
+static void do_emergency_remount(struct work_struct *work)
+{
+   __iterate_supers(do_emergency_remount_callback);
kfree(work);
printk("Emergency Remount complete\n");
 }
-- 
1.8.3.1



[PATCH 0/2] minor fd cleanup

2017-10-03 Thread Mateusz Guzik
The first patch is a super minor nit and  I'm fine with it being ignored.

Second patch is imho of actual value.

fd_install has an avoidable requirement to be called in a sleepable context.
It matters e.g. in rhel7 to where the patch was recently backported and the
new requirement could not be enforced.

I don't see any benefits from sleeping over locking and not sleeping makes
the function slightly nicer.

Mateusz Guzik (2):
  vfs: stop clearing close on exec when closing a fd
  vfs: grab the lock instead of blocking in __fd_install during resizing

 Documentation/filesystems/porting |  4 
 fs/file.c | 12 +++-
 2 files changed, 7 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCH 0/2] minor fd cleanup

2017-10-03 Thread Mateusz Guzik
The first patch is a super minor nit and  I'm fine with it being ignored.

Second patch is imho of actual value.

fd_install has an avoidable requirement to be called in a sleepable context.
It matters e.g. in rhel7 to where the patch was recently backported and the
new requirement could not be enforced.

I don't see any benefits from sleeping over locking and not sleeping makes
the function slightly nicer.

Mateusz Guzik (2):
  vfs: stop clearing close on exec when closing a fd
  vfs: grab the lock instead of blocking in __fd_install during resizing

 Documentation/filesystems/porting |  4 
 fs/file.c | 12 +++-
 2 files changed, 7 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-03 Thread Mateusz Guzik
Explicit locking in the fallback case provides a safe state of the
table. Getting rid of blocking semantics makes __fd_install usable
again in non-sleepable contexts, which easies backporting efforts.

There is a side effect of slightly nicer assembly for the common case
as might_sleep can now be removed.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 Documentation/filesystems/porting |  4 
 fs/file.c | 11 +++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/porting 
b/Documentation/filesystems/porting
index 93e0a24..17bb4dc 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -502,10 +502,6 @@ in your dentry operations instead.
store it as cookie.
 --
 [mandatory]
-   __fd_install() & fd_install() can now sleep. Callers should not
-   hold a spinlock or other resources that do not allow a schedule.
---
-[mandatory]
any symlink that might use page_follow_link_light/page_put_link() must
have inode_nohighmem(inode) called before anything might start playing 
with
its pagecache.  No highmem pages should end up in the pagecache of such
diff --git a/fs/file.c b/fs/file.c
index 9d047bd..4115503 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -592,13 +592,16 @@ void __fd_install(struct files_struct *files, unsigned 
int fd,
 {
struct fdtable *fdt;
 
-   might_sleep();
rcu_read_lock_sched();
 
-   while (unlikely(files->resize_in_progress)) {
+   if (unlikely(files->resize_in_progress)) {
rcu_read_unlock_sched();
-   wait_event(files->resize_wait, !files->resize_in_progress);
-   rcu_read_lock_sched();
+   spin_lock(>file_lock);
+   fdt = files_fdtable(files);
+   BUG_ON(fdt->fd[fd] != NULL);
+   rcu_assign_pointer(fdt->fd[fd], file);
+   spin_unlock(>file_lock);
+   return;
}
/* coupled with smp_wmb() in expand_fdtable() */
smp_rmb();
-- 
1.8.3.1



[PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-03 Thread Mateusz Guzik
Explicit locking in the fallback case provides a safe state of the
table. Getting rid of blocking semantics makes __fd_install usable
again in non-sleepable contexts, which easies backporting efforts.

There is a side effect of slightly nicer assembly for the common case
as might_sleep can now be removed.

Signed-off-by: Mateusz Guzik 
---
 Documentation/filesystems/porting |  4 
 fs/file.c | 11 +++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/Documentation/filesystems/porting 
b/Documentation/filesystems/porting
index 93e0a24..17bb4dc 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -502,10 +502,6 @@ in your dentry operations instead.
store it as cookie.
 --
 [mandatory]
-   __fd_install() & fd_install() can now sleep. Callers should not
-   hold a spinlock or other resources that do not allow a schedule.
---
-[mandatory]
any symlink that might use page_follow_link_light/page_put_link() must
have inode_nohighmem(inode) called before anything might start playing 
with
its pagecache.  No highmem pages should end up in the pagecache of such
diff --git a/fs/file.c b/fs/file.c
index 9d047bd..4115503 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -592,13 +592,16 @@ void __fd_install(struct files_struct *files, unsigned 
int fd,
 {
struct fdtable *fdt;
 
-   might_sleep();
rcu_read_lock_sched();
 
-   while (unlikely(files->resize_in_progress)) {
+   if (unlikely(files->resize_in_progress)) {
rcu_read_unlock_sched();
-   wait_event(files->resize_wait, !files->resize_in_progress);
-   rcu_read_lock_sched();
+   spin_lock(>file_lock);
+   fdt = files_fdtable(files);
+   BUG_ON(fdt->fd[fd] != NULL);
+   rcu_assign_pointer(fdt->fd[fd], file);
+   spin_unlock(>file_lock);
+   return;
}
/* coupled with smp_wmb() in expand_fdtable() */
smp_rmb();
-- 
1.8.3.1



[PATCH 1/2] vfs: stop clearing close on exec when closing a fd

2017-10-03 Thread Mateusz Guzik
Codepaths allocating a fd always make sure the bit is set/unset
depending on flags, thus clearing on close is redundant.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 fs/file.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 1fc7fbb..9d047bd 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -631,7 +631,6 @@ int __close_fd(struct files_struct *files, unsigned fd)
if (!file)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
-   __clear_close_on_exec(fd, fdt);
__put_unused_fd(files, fd);
spin_unlock(>file_lock);
return filp_close(file, files);
-- 
1.8.3.1



[PATCH 1/2] vfs: stop clearing close on exec when closing a fd

2017-10-03 Thread Mateusz Guzik
Codepaths allocating a fd always make sure the bit is set/unset
depending on flags, thus clearing on close is redundant.

Signed-off-by: Mateusz Guzik 
---
 fs/file.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/file.c b/fs/file.c
index 1fc7fbb..9d047bd 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -631,7 +631,6 @@ int __close_fd(struct files_struct *files, unsigned fd)
if (!file)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
-   __clear_close_on_exec(fd, fdt);
__put_unused_fd(files, fd);
spin_unlock(>file_lock);
return filp_close(file, files);
-- 
1.8.3.1



Re: fs, net: deadlock between bind/splice on af_unix

2017-02-07 Thread Mateusz Guzik
On Sun, Feb 05, 2017 at 11:22:12PM -0800, Cong Wang wrote:
> On Tue, Jan 31, 2017 at 10:14 AM, Mateusz Guzik <mgu...@redhat.com> wrote:
> > On Mon, Jan 30, 2017 at 10:44:03PM -0800, Cong Wang wrote:
> >> Mind being more specific?
> >
> > Consider 2 threads which bind the same socket, but with different paths.
> >
> > Currently exactly one file will get created, the one used to bind.
> >
> > With your patch both threads can succeed creating their respective
> > files, but only one will manage to bind. The other one must error out,
> > but it already created a file it is unclear what to do with.
> 
> In this case, it simply puts the path back:
> 
> err = -EINVAL;
> if (u->addr)
> goto out_up;
> [...]
> 
> out_up:
> mutex_unlock(>bindlock);
> out_put:
> if (err)
> path_put();
> out:
> return err;
> 
> 
> Which is what unix_release_sock() does too:
> 
> if (path.dentry)
> path_put();

Yes, but unix_release_sock is expected to leave the file behind.
Note I'm not claiming there is a leak, but that racing threads will be
able to trigger a condition where you create a file and fail to bind it.

What to do with the file now?

Untested, but likely a working solution would rework the code so that
e.g. a flag is set and the lock can be dropped.


Re: fs, net: deadlock between bind/splice on af_unix

2017-02-07 Thread Mateusz Guzik
On Sun, Feb 05, 2017 at 11:22:12PM -0800, Cong Wang wrote:
> On Tue, Jan 31, 2017 at 10:14 AM, Mateusz Guzik  wrote:
> > On Mon, Jan 30, 2017 at 10:44:03PM -0800, Cong Wang wrote:
> >> Mind being more specific?
> >
> > Consider 2 threads which bind the same socket, but with different paths.
> >
> > Currently exactly one file will get created, the one used to bind.
> >
> > With your patch both threads can succeed creating their respective
> > files, but only one will manage to bind. The other one must error out,
> > but it already created a file it is unclear what to do with.
> 
> In this case, it simply puts the path back:
> 
> err = -EINVAL;
> if (u->addr)
> goto out_up;
> [...]
> 
> out_up:
> mutex_unlock(>bindlock);
> out_put:
> if (err)
> path_put();
> out:
> return err;
> 
> 
> Which is what unix_release_sock() does too:
> 
> if (path.dentry)
> path_put();

Yes, but unix_release_sock is expected to leave the file behind.
Note I'm not claiming there is a leak, but that racing threads will be
able to trigger a condition where you create a file and fail to bind it.

What to do with the file now?

Untested, but likely a working solution would rework the code so that
e.g. a flag is set and the lock can be dropped.


Re: fs, net: deadlock between bind/splice on af_unix

2017-01-31 Thread Mateusz Guzik
On Mon, Jan 30, 2017 at 10:44:03PM -0800, Cong Wang wrote:
> On Thu, Jan 26, 2017 at 10:41 PM, Mateusz Guzik <mgu...@redhat.com> wrote:
> > On Thu, Jan 26, 2017 at 09:11:07PM -0800, Cong Wang wrote:
> >> On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzik <mgu...@redhat.com> wrote:
> >> > Currently the file creation is potponed until unix_bind can no longer
> >> > fail otherwise. With it reordered, it may be someone races you with a
> >> > different path and now you are left with a file to clean up. Except it
> >> > is quite unclear for me if you can unlink it.
> >>
> >> What races do you mean here? If you mean someone could get a
> >> refcount of that file, it could happen no matter we have bindlock or not
> >> since it is visible once created. The filesystem layer should take care of
> >> the file refcount so all we need to do here is calling path_put() as in my
> >> patch. Or if you mean two threads calling unix_bind() could race without
> >> binlock, only one of them should succeed the other one just fails out.
> >
> > Two threads can race and one fails with EINVAL.
> >
> > With your patch there is a new file created and it is unclear what to
> > do with it - leaving it as it is sounds like the last resort and
> > unlinking it sounds extremely fishy as it opens you to games played by
> > the user.
> 
> But the file is created and visible to users too even without my patch,
> the file is also put when the unix sock is released. So the only difference
> my patch makes is bindlock is no longer taken during file creation, which
> does not seem to be the cause of the problem you complain here.
> 
> Mind being more specific?

Consider 2 threads which bind the same socket, but with different paths.

Currently exactly one file will get created, the one used to bind.

With your patch both threads can succeed creating their respective
files, but only one will manage to bind. The other one must error out,
but it already created a file it is unclear what to do with.


Re: fs, net: deadlock between bind/splice on af_unix

2017-01-31 Thread Mateusz Guzik
On Mon, Jan 30, 2017 at 10:44:03PM -0800, Cong Wang wrote:
> On Thu, Jan 26, 2017 at 10:41 PM, Mateusz Guzik  wrote:
> > On Thu, Jan 26, 2017 at 09:11:07PM -0800, Cong Wang wrote:
> >> On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzik  wrote:
> >> > Currently the file creation is potponed until unix_bind can no longer
> >> > fail otherwise. With it reordered, it may be someone races you with a
> >> > different path and now you are left with a file to clean up. Except it
> >> > is quite unclear for me if you can unlink it.
> >>
> >> What races do you mean here? If you mean someone could get a
> >> refcount of that file, it could happen no matter we have bindlock or not
> >> since it is visible once created. The filesystem layer should take care of
> >> the file refcount so all we need to do here is calling path_put() as in my
> >> patch. Or if you mean two threads calling unix_bind() could race without
> >> binlock, only one of them should succeed the other one just fails out.
> >
> > Two threads can race and one fails with EINVAL.
> >
> > With your patch there is a new file created and it is unclear what to
> > do with it - leaving it as it is sounds like the last resort and
> > unlinking it sounds extremely fishy as it opens you to games played by
> > the user.
> 
> But the file is created and visible to users too even without my patch,
> the file is also put when the unix sock is released. So the only difference
> my patch makes is bindlock is no longer taken during file creation, which
> does not seem to be the cause of the problem you complain here.
> 
> Mind being more specific?

Consider 2 threads which bind the same socket, but with different paths.

Currently exactly one file will get created, the one used to bind.

With your patch both threads can succeed creating their respective
files, but only one will manage to bind. The other one must error out,
but it already created a file it is unclear what to do with.


Re: timerfd: use-after-free in timerfd_remove_cancel

2017-01-30 Thread Mateusz Guzik
On Mon, Jan 30, 2017 at 07:41:59PM +0100, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers use-after-free in timerfd_remove_cancel:
> https://gist.githubusercontent.com/dvyukov/202576d437c84ffbbe52e9ccd77e1b44/raw/5562bff8626a73627157331ea2b837f59080ac84/gistfile1.txt
> 
> BUG: KASAN: use-after-free in __list_del include/linux/list.h:104
> [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in __list_del_entry
> include/linux/list.h:119 [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in list_del_rcu include/linux/rculist.h:129
> [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in timerfd_remove_cancel fs/timerfd.c:120
> [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in timerfd_release+0x28e/0x310
> fs/timerfd.c:209 at addr 88006bab1410
> Write of size 8 by task a.out/2897
> 
[..]
> Seems that ctx->might_cancel is racy.
> 

Indeed it is. Can you try the patch below please. If it works I'll send
it in a nicer form.

diff --git a/fs/timerfd.c b/fs/timerfd.c
index c173cc1..63f91c3 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -112,14 +112,30 @@ void timerfd_clock_was_set(void)
rcu_read_unlock();
 }
 
+static void timerfd_set_cancel(struct timerfd_ctx *ctx)
+{
+   if (ctx->might_cancel)
+   return;
+
+   spin_lock(_lock);
+   if (!ctx->might_cancel) {
+   ctx->might_cancel = true;
+   list_add_rcu(>clist, _list);
+   }
+   spin_unlock(_lock);
+}
+
 static void timerfd_remove_cancel(struct timerfd_ctx *ctx)
 {
+   if (!ctx->might_cancel)
+   return;
+
+   spin_lock(_lock);
if (ctx->might_cancel) {
ctx->might_cancel = false;
-   spin_lock(_lock);
list_del_rcu(>clist);
-   spin_unlock(_lock);
}
+   spin_unlock(_lock);
 }
 
 static bool timerfd_canceled(struct timerfd_ctx *ctx)
@@ -134,16 +150,10 @@ static void timerfd_setup_cancel(struct timerfd_ctx *ctx, 
int flags)
 {
if ((ctx->clockid == CLOCK_REALTIME ||
 ctx->clockid == CLOCK_REALTIME_ALARM) &&
-   (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
-   if (!ctx->might_cancel) {
-   ctx->might_cancel = true;
-   spin_lock(_lock);
-   list_add_rcu(>clist, _list);
-   spin_unlock(_lock);
-   }
-   } else if (ctx->might_cancel) {
+   (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET))
+   timerfd_set_cancel(ctx);
+   else
timerfd_remove_cancel(ctx);
-   }
 }
 
 static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)


Re: timerfd: use-after-free in timerfd_remove_cancel

2017-01-30 Thread Mateusz Guzik
On Mon, Jan 30, 2017 at 07:41:59PM +0100, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers use-after-free in timerfd_remove_cancel:
> https://gist.githubusercontent.com/dvyukov/202576d437c84ffbbe52e9ccd77e1b44/raw/5562bff8626a73627157331ea2b837f59080ac84/gistfile1.txt
> 
> BUG: KASAN: use-after-free in __list_del include/linux/list.h:104
> [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in __list_del_entry
> include/linux/list.h:119 [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in list_del_rcu include/linux/rculist.h:129
> [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in timerfd_remove_cancel fs/timerfd.c:120
> [inline] at addr 88006bab1410
> BUG: KASAN: use-after-free in timerfd_release+0x28e/0x310
> fs/timerfd.c:209 at addr 88006bab1410
> Write of size 8 by task a.out/2897
> 
[..]
> Seems that ctx->might_cancel is racy.
> 

Indeed it is. Can you try the patch below please. If it works I'll send
it in a nicer form.

diff --git a/fs/timerfd.c b/fs/timerfd.c
index c173cc1..63f91c3 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -112,14 +112,30 @@ void timerfd_clock_was_set(void)
rcu_read_unlock();
 }
 
+static void timerfd_set_cancel(struct timerfd_ctx *ctx)
+{
+   if (ctx->might_cancel)
+   return;
+
+   spin_lock(_lock);
+   if (!ctx->might_cancel) {
+   ctx->might_cancel = true;
+   list_add_rcu(>clist, _list);
+   }
+   spin_unlock(_lock);
+}
+
 static void timerfd_remove_cancel(struct timerfd_ctx *ctx)
 {
+   if (!ctx->might_cancel)
+   return;
+
+   spin_lock(_lock);
if (ctx->might_cancel) {
ctx->might_cancel = false;
-   spin_lock(_lock);
list_del_rcu(>clist);
-   spin_unlock(_lock);
}
+   spin_unlock(_lock);
 }
 
 static bool timerfd_canceled(struct timerfd_ctx *ctx)
@@ -134,16 +150,10 @@ static void timerfd_setup_cancel(struct timerfd_ctx *ctx, 
int flags)
 {
if ((ctx->clockid == CLOCK_REALTIME ||
 ctx->clockid == CLOCK_REALTIME_ALARM) &&
-   (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
-   if (!ctx->might_cancel) {
-   ctx->might_cancel = true;
-   spin_lock(_lock);
-   list_add_rcu(>clist, _list);
-   spin_unlock(_lock);
-   }
-   } else if (ctx->might_cancel) {
+   (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET))
+   timerfd_set_cancel(ctx);
+   else
timerfd_remove_cancel(ctx);
-   }
 }
 
 static ktime_t timerfd_get_remaining(struct timerfd_ctx *ctx)


Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Mateusz Guzik
On Thu, Jan 26, 2017 at 09:11:07PM -0800, Cong Wang wrote:
> On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzik <mgu...@redhat.com> wrote:
> > On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
> >> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> >> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro <v...@zeniv.linux.org.uk> wrote:
> >> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
> >> >>
> >> >>> > Why do we do autobind there, anyway, and why is it conditional on
> >> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
> >> >>> > to sending stuff without autobind ever done - just use socketpair()
> >> >>> > to create that sucker and we won't be going through the connect()
> >> >>> > at all.
> >> >>>
> >> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls 
> >> >>> unix_autobind(),
> >> >>> not SOCK_STREAM.
> >> >>
> >> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
> >> >> triggered
> >> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
> >> >>
> >> >>> I guess some lock, perhaps the u->bindlock could be dropped before
> >> >>> acquiring the next one (sb_writer), but I need to double check.
> >> >>
> >> >> Bad idea, IMO - do you *want* autobind being able to come through while
> >> >> bind(2) is busy with mknod?
> >> >
> >> >
> >> > Ping. This is still happening on HEAD.
> >> >
> >>
> >> Thanks for your reminder. Mind to give the attached patch (compile only)
> >> a try? I take another approach to fix this deadlock, which moves the
> >> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
> >> impact with this way.
> >>
> >
> > I don't think this is the right approach.
> >
> > Currently the file creation is potponed until unix_bind can no longer
> > fail otherwise. With it reordered, it may be someone races you with a
> > different path and now you are left with a file to clean up. Except it
> > is quite unclear for me if you can unlink it.
> 
> What races do you mean here? If you mean someone could get a
> refcount of that file, it could happen no matter we have bindlock or not
> since it is visible once created. The filesystem layer should take care of
> the file refcount so all we need to do here is calling path_put() as in my
> patch. Or if you mean two threads calling unix_bind() could race without
> binlock, only one of them should succeed the other one just fails out.

Two threads can race and one fails with EINVAL.

With your patch there is a new file created and it is unclear what to
do with it - leaving it as it is sounds like the last resort and
unlinking it sounds extremely fishy as it opens you to games played by
the user.


Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Mateusz Guzik
On Thu, Jan 26, 2017 at 09:11:07PM -0800, Cong Wang wrote:
> On Thu, Jan 26, 2017 at 3:29 PM, Mateusz Guzik  wrote:
> > On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
> >> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov  wrote:
> >> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro  wrote:
> >> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
> >> >>
> >> >>> > Why do we do autobind there, anyway, and why is it conditional on
> >> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
> >> >>> > to sending stuff without autobind ever done - just use socketpair()
> >> >>> > to create that sucker and we won't be going through the connect()
> >> >>> > at all.
> >> >>>
> >> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls 
> >> >>> unix_autobind(),
> >> >>> not SOCK_STREAM.
> >> >>
> >> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
> >> >> triggered
> >> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
> >> >>
> >> >>> I guess some lock, perhaps the u->bindlock could be dropped before
> >> >>> acquiring the next one (sb_writer), but I need to double check.
> >> >>
> >> >> Bad idea, IMO - do you *want* autobind being able to come through while
> >> >> bind(2) is busy with mknod?
> >> >
> >> >
> >> > Ping. This is still happening on HEAD.
> >> >
> >>
> >> Thanks for your reminder. Mind to give the attached patch (compile only)
> >> a try? I take another approach to fix this deadlock, which moves the
> >> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
> >> impact with this way.
> >>
> >
> > I don't think this is the right approach.
> >
> > Currently the file creation is potponed until unix_bind can no longer
> > fail otherwise. With it reordered, it may be someone races you with a
> > different path and now you are left with a file to clean up. Except it
> > is quite unclear for me if you can unlink it.
> 
> What races do you mean here? If you mean someone could get a
> refcount of that file, it could happen no matter we have bindlock or not
> since it is visible once created. The filesystem layer should take care of
> the file refcount so all we need to do here is calling path_put() as in my
> patch. Or if you mean two threads calling unix_bind() could race without
> binlock, only one of them should succeed the other one just fails out.

Two threads can race and one fails with EINVAL.

With your patch there is a new file created and it is unclear what to
do with it - leaving it as it is sounds like the last resort and
unlinking it sounds extremely fishy as it opens you to games played by
the user.


Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Mateusz Guzik
On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov  wrote:
> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro  wrote:
> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
> >>
> >>> > Why do we do autobind there, anyway, and why is it conditional on
> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
> >>> > to sending stuff without autobind ever done - just use socketpair()
> >>> > to create that sucker and we won't be going through the connect()
> >>> > at all.
> >>>
> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls unix_autobind(),
> >>> not SOCK_STREAM.
> >>
> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
> >> triggered
> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
> >>
> >>> I guess some lock, perhaps the u->bindlock could be dropped before
> >>> acquiring the next one (sb_writer), but I need to double check.
> >>
> >> Bad idea, IMO - do you *want* autobind being able to come through while
> >> bind(2) is busy with mknod?
> >
> >
> > Ping. This is still happening on HEAD.
> >
> 
> Thanks for your reminder. Mind to give the attached patch (compile only)
> a try? I take another approach to fix this deadlock, which moves the
> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
> impact with this way.
> 

I don't think this is the right approach.

Currently the file creation is potponed until unix_bind can no longer
fail otherwise. With it reordered, it may be someone races you with a
different path and now you are left with a file to clean up. Except it
is quite unclear for me if you can unlink it.

I don't have a good idea how to fix it. A somewhat typical approach
would introduce an intermediate state ("under construction") and drop
the lock between calling into unix_mknod.

In this particular case, perhaps you could repurpose gc_flags as a
general flags carrier and add a 'binding in process' flag to test.


Re: fs, net: deadlock between bind/splice on af_unix

2017-01-26 Thread Mateusz Guzik
On Tue, Jan 17, 2017 at 01:21:48PM -0800, Cong Wang wrote:
> On Mon, Jan 16, 2017 at 1:32 AM, Dmitry Vyukov  wrote:
> > On Fri, Dec 9, 2016 at 7:41 AM, Al Viro  wrote:
> >> On Thu, Dec 08, 2016 at 10:32:00PM -0800, Cong Wang wrote:
> >>
> >>> > Why do we do autobind there, anyway, and why is it conditional on
> >>> > SOCK_PASSCRED?  Note that e.g. for SOCK_STREAM we can bloody well get
> >>> > to sending stuff without autobind ever done - just use socketpair()
> >>> > to create that sucker and we won't be going through the connect()
> >>> > at all.
> >>>
> >>> In the case Dmitry reported, unix_dgram_sendmsg() calls unix_autobind(),
> >>> not SOCK_STREAM.
> >>
> >> Yes, I've noticed.  What I'm asking is what in there needs autobind 
> >> triggered
> >> on sendmsg and why doesn't the same need affect the SOCK_STREAM case?
> >>
> >>> I guess some lock, perhaps the u->bindlock could be dropped before
> >>> acquiring the next one (sb_writer), but I need to double check.
> >>
> >> Bad idea, IMO - do you *want* autobind being able to come through while
> >> bind(2) is busy with mknod?
> >
> >
> > Ping. This is still happening on HEAD.
> >
> 
> Thanks for your reminder. Mind to give the attached patch (compile only)
> a try? I take another approach to fix this deadlock, which moves the
> unix_mknod() out of unix->bindlock. Not sure if there is any unexpected
> impact with this way.
> 

I don't think this is the right approach.

Currently the file creation is potponed until unix_bind can no longer
fail otherwise. With it reordered, it may be someone races you with a
different path and now you are left with a file to clean up. Except it
is quite unclear for me if you can unlink it.

I don't have a good idea how to fix it. A somewhat typical approach
would introduce an intermediate state ("under construction") and drop
the lock between calling into unix_mknod.

In this particular case, perhaps you could repurpose gc_flags as a
general flags carrier and add a 'binding in process' flag to test.


Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

2016-10-05 Thread Mateusz Guzik
On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote:
> On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> > When triggering thaw-filesystems via magic sysrq, the system enters a
> > loop in do_thaw_one(), as thaw_bdev() still returns success if
> > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> > error (and simplify the code a bit at the same time).
> > 
> 
> The patch looks good.
> 

Now that I had a closer look, while the patch indeed gets rid of the
infinite loop, the functionality itself does not work properly.

Note I'm not familiar with this code, so chances are I got details
wrong. I also don't know the original reasoning.

The current state is that you can freeze by calling either freeze_super
or freeze_bdev. The latter bumps bd_fsfreeze_count and calls
freeze_super. freeze_super does NOT modify bd_fsfreeze_count.

freeze_bdev is used by device mapper, xfs and e2fs.

freeze_super is used by the FIFREEZE ioctl.

This disparity leads to breakage.

Quick look suggests device freezing has its users and sb is optionally
present. So the fix would consist on making all freezers call the bdev
variant.

The j sysrq ends up not working in 2 ways.
1. It ends up calling do_thaw_one -> thaw_bdev. It will look at the
bd_fsfreeze_count counter and error out. But if the user froze the fs
with the ioctl, the counter is not modified and the fs in question ends
up not being thawed.

2. (assuming 1 is fixed) Since freezing through freeze_bdev supports
nesting, multiple invocations of the sysrq may be needed to actually
thaw.

Now, looking at *_bdev functions:

> struct super_block *freeze_bdev(struct block_device *bdev)
> {   
> struct super_block *sb;
> int error = 0;
> 
> mutex_lock(>bd_fsfreeze_mutex);
> if (++bdev->bd_fsfreeze_count > 1) {

No limit is put in place so in principle this will eventually turn negative.

> /*
>  * We don't even need to grab a reference - the first call
>  * to freeze_bdev grab an active reference and only the last
>  * thaw_bdev drops it.
>  */
> sb = get_super(bdev);
> if (sb) 
> drop_super(sb);
> mutex_unlock(>bd_fsfreeze_mutex);

The code assumes nobody thaws the fs from under the consumer.

That is, code doing sb = freeze_bdev(b); ...; thaw_bdev(b, sb); risks
use-after-free if the user thawed the fs.

I did not check if current consumers hold the object in different ways
and thus avoiding the bug.

> return sb;
> }
> 
> sb = get_active_super(bdev);
> if (!sb)
> goto out;
> if (sb->s_op->freeze_super)
> error = sb->s_op->freeze_super(sb);
> else
> error = freeze_super(sb);

This differs from the ioctl version, which has this check:

>if (sb->s_op->freeze_fs == NULL && sb->s_op->freeze_super == NULL)
>return -EOPNOTSUPP;

For a filesystem with freeze_fs == NULL, freeze_super will return 0 and
freeze_bdev will end up returnig 0.

I don't know if this turns into a real problem.

> int thaw_bdev(struct block_device *bdev, struct super_block *sb)
[snip]
> if (sb->s_op->thaw_super)
> error = sb->s_op->thaw_super(sb);
> else
> error = thaw_super(sb);
> if (error) {  
>  
> bdev->bd_fsfreeze_count++;
> mutex_unlock(>bd_fsfreeze_mutex); 
> return error;
> } 

The fact that someone else can thaw makes it very fishy to pass super
blocks around in the first place. In particular:
kernel freeze -> user thaw -> mount -> user freeze -> kernel thaw

Are we guaranteed that the sb returned for the kernel freeze is the same
which was used for user freeze?

That said, rough proposed fix is as follows:
- store the pointer to the sb used for freezing in bdev
- drop the sb argument from thaw_bdev
- optionally return a referenced sb from freeze_bdev, otherwise just
  return NULL on success
- convert all freeze consumers to use *_bdev variants
- somewhat cosmetic, introduce nesting limit for freezes (say 2048 or
  whatever, just to have something)

The ioctl used use freeze_bdev but this was changed with 18e9e5104fcd
"Introduce freeze_super and thaw_super for the fsfreeze ioctl" which
cites btrfs as the reason for freeze_super as opposed to freeze_bdev.

CC'ing the author for comments.

-- 
Mateusz Guzik


Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

2016-10-05 Thread Mateusz Guzik
On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote:
> On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> > When triggering thaw-filesystems via magic sysrq, the system enters a
> > loop in do_thaw_one(), as thaw_bdev() still returns success if
> > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> > error (and simplify the code a bit at the same time).
> > 
> 
> The patch looks good.
> 

Now that I had a closer look, while the patch indeed gets rid of the
infinite loop, the functionality itself does not work properly.

Note I'm not familiar with this code, so chances are I got details
wrong. I also don't know the original reasoning.

The current state is that you can freeze by calling either freeze_super
or freeze_bdev. The latter bumps bd_fsfreeze_count and calls
freeze_super. freeze_super does NOT modify bd_fsfreeze_count.

freeze_bdev is used by device mapper, xfs and e2fs.

freeze_super is used by the FIFREEZE ioctl.

This disparity leads to breakage.

Quick look suggests device freezing has its users and sb is optionally
present. So the fix would consist on making all freezers call the bdev
variant.

The j sysrq ends up not working in 2 ways.
1. It ends up calling do_thaw_one -> thaw_bdev. It will look at the
bd_fsfreeze_count counter and error out. But if the user froze the fs
with the ioctl, the counter is not modified and the fs in question ends
up not being thawed.

2. (assuming 1 is fixed) Since freezing through freeze_bdev supports
nesting, multiple invocations of the sysrq may be needed to actually
thaw.

Now, looking at *_bdev functions:

> struct super_block *freeze_bdev(struct block_device *bdev)
> {   
> struct super_block *sb;
> int error = 0;
> 
> mutex_lock(>bd_fsfreeze_mutex);
> if (++bdev->bd_fsfreeze_count > 1) {

No limit is put in place so in principle this will eventually turn negative.

> /*
>  * We don't even need to grab a reference - the first call
>  * to freeze_bdev grab an active reference and only the last
>  * thaw_bdev drops it.
>  */
> sb = get_super(bdev);
> if (sb) 
> drop_super(sb);
> mutex_unlock(>bd_fsfreeze_mutex);

The code assumes nobody thaws the fs from under the consumer.

That is, code doing sb = freeze_bdev(b); ...; thaw_bdev(b, sb); risks
use-after-free if the user thawed the fs.

I did not check if current consumers hold the object in different ways
and thus avoiding the bug.

> return sb;
> }
> 
> sb = get_active_super(bdev);
> if (!sb)
> goto out;
> if (sb->s_op->freeze_super)
> error = sb->s_op->freeze_super(sb);
> else
> error = freeze_super(sb);

This differs from the ioctl version, which has this check:

>if (sb->s_op->freeze_fs == NULL && sb->s_op->freeze_super == NULL)
>return -EOPNOTSUPP;

For a filesystem with freeze_fs == NULL, freeze_super will return 0 and
freeze_bdev will end up returnig 0.

I don't know if this turns into a real problem.

> int thaw_bdev(struct block_device *bdev, struct super_block *sb)
[snip]
> if (sb->s_op->thaw_super)
> error = sb->s_op->thaw_super(sb);
> else
> error = thaw_super(sb);
> if (error) {  
>  
> bdev->bd_fsfreeze_count++;
> mutex_unlock(>bd_fsfreeze_mutex); 
> return error;
> } 

The fact that someone else can thaw makes it very fishy to pass super
blocks around in the first place. In particular:
kernel freeze -> user thaw -> mount -> user freeze -> kernel thaw

Are we guaranteed that the sb returned for the kernel freeze is the same
which was used for user freeze?

That said, rough proposed fix is as follows:
- store the pointer to the sb used for freezing in bdev
- drop the sb argument from thaw_bdev
- optionally return a referenced sb from freeze_bdev, otherwise just
  return NULL on success
- convert all freeze consumers to use *_bdev variants
- somewhat cosmetic, introduce nesting limit for freezes (say 2048 or
  whatever, just to have something)

The ioctl used use freeze_bdev but this was changed with 18e9e5104fcd
"Introduce freeze_super and thaw_super for the fsfreeze ioctl" which
cites btrfs as the reason for freeze_super as opposed to freeze_bdev.

CC'ing the author for comments.

-- 
Mateusz Guzik


Re: [PATCH] fs/block_dev.c: always return error in thaw_bdev()

2016-10-02 Thread Mateusz Guzik
On Thu, Aug 20, 2015 at 12:07:49PM +0200, Pierre Morel wrote:
> When triggering thaw-filesystems via magic sysrq, the system enters a
> loop in do_thaw_one(), as thaw_bdev() still returns success if
> bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> error (and simplify the code a bit at the same time).
> 
> Reviewed-by: Eric Farman <far...@linux.vnet.ibm.com>
> Reviewed-by: Cornelia Huck <cornelia.h...@de.ibm.com>
> Signed-off-by: Pierre Morel <pmo...@linux.vnet.ibm.com>
> ---
>  fs/block_dev.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c7e4163..7809c92 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -269,14 +269,11 @@ int thaw_bdev(struct block_device *bdev, struct 
> super_block *sb)
>   error = sb->s_op->thaw_super(sb);
>   else
>   error = thaw_super(sb);
> - if (error) {
> + if (error)
>   bdev->bd_fsfreeze_count++;
> - mutex_unlock(>bd_fsfreeze_mutex);
> - return error;
> - }
>  out:
>   mutex_unlock(>bd_fsfreeze_mutex);
> - return 0;
> + return error;
>  }
>  EXPORT_SYMBOL(thaw_bdev);
>  

Apparently this never got in.

The bug is still there, reproducible with mere:
echo j > /proc/sysrq-trigger

-- 
Mateusz Guzik


Re: [PATCH] fs/block_dev.c: always return error in thaw_bdev()

2016-10-02 Thread Mateusz Guzik
On Thu, Aug 20, 2015 at 12:07:49PM +0200, Pierre Morel wrote:
> When triggering thaw-filesystems via magic sysrq, the system enters a
> loop in do_thaw_one(), as thaw_bdev() still returns success if
> bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> error (and simplify the code a bit at the same time).
> 
> Reviewed-by: Eric Farman 
> Reviewed-by: Cornelia Huck 
> Signed-off-by: Pierre Morel 
> ---
>  fs/block_dev.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c7e4163..7809c92 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -269,14 +269,11 @@ int thaw_bdev(struct block_device *bdev, struct 
> super_block *sb)
>   error = sb->s_op->thaw_super(sb);
>   else
>   error = thaw_super(sb);
> - if (error) {
> + if (error)
>   bdev->bd_fsfreeze_count++;
> - mutex_unlock(>bd_fsfreeze_mutex);
> - return error;
> - }
>  out:
>   mutex_unlock(>bd_fsfreeze_mutex);
> - return 0;
> + return error;
>  }
>  EXPORT_SYMBOL(thaw_bdev);
>  

Apparently this never got in.

The bug is still there, reproducible with mere:
echo j > /proc/sysrq-trigger

-- 
Mateusz Guzik


Re: [PATCH 3/4] autofs - make mountpoint checks namespace aware

2016-09-17 Thread Mateusz Guzik
On Wed, Sep 14, 2016 at 02:14:45PM +0800, Ian Kent wrote:
> If an automount mount is clone(2)ed into a file system that is
> propagation private, when it later expires in the originating
> namespace subsequent calls to autofs ->d_automount() for that
> dentry in the original namespace will return ELOOP until the
> mount is manually umounted in the cloned namespace.
> 
> In the same way, if an autofs mount is triggered by automount(8)
> running within a container the dentry will be seen as mounted in
> the root init namespace and calls to ->d_automount() in that namespace
> will return ELOOP until the mount is umounted within the container.
> 
> Also, have_submounts() can return an incorect result when a mount
> exists in a namespace other than the one being checked.
> 
> @@ -460,7 +460,7 @@ static int autofs4_d_manage(struct dentry *dentry, bool 
> rcu_walk)
>  
>   if (ino->flags & AUTOFS_INF_WANT_EXPIRE)
>   return 0;
> - if (d_mountpoint(dentry))
> + if (is_local_mountpoint(dentry))
>   return 0;
>   inode = d_inode_rcu(dentry);
>   if (inode && S_ISLNK(inode->i_mode))

This change is within RCU lookup.

is_local_mountpoint may end up calling __is_local_mountpoint, which will
optionally take the namespace_sem lock, resulting in a splat:

 #0:  (&(>fs_lock)->rlock){+.+...}, at: [] 
autofs4_d_manage+0x202/0x290^M
Preemption disabled at:[] autofs4_d_manage+0x202/0x290^M
^M
CPU: 1 PID: 1307 Comm: iknowthis Not tainted 4.8.0-rc6-next-20160916dupa #448^M
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011^M
 8800077378f8 818eaffc 0001 ^M
 880007737930 8110c870 880007588048 82483840^M
 0015  880007588040 880007737978^M
Call Trace:^M
 [] dump_stack+0x85/0xc9^M
 [] ___might_sleep+0x1e0/0x2e0^M
 [] __might_sleep+0x71/0xe0^M
 [] ? preempt_count_sub+0x39/0x80^M
 [] down_read+0x2a/0xc0^M
 [] __is_local_mountpoint+0x66/0xe0^M
 [] autofs4_d_manage+0x263/0x290^M
 [] follow_managed+0x157/0x480^M
 [] lookup_fast+0x3ab/0x690^M
 [] ? trailing_symlink+0x370/0x370^M
 [] ? path_init+0x917/0xa10^M
 [] ? __mutex_init+0x77/0x80^M
 [] path_openat+0x2bc/0x13e0^M
 [] ? path_lookupat+0x1f0/0x1f0^M
 [] ? __asan_loadN+0xf/0x20^M
 [] ? pvclock_clocksource_read+0xd6/0x180^M
 [] ? kvm_clock_read+0x23/0x40^M
 [] do_filp_open+0x122/0x1c0^M
 [] ? preempt_count_sub+0x39/0x80^M
 [] ? may_open_dev+0x50/0x50^M
 [] ? preempt_count_sub.part.67+0x18/0x90^M
 [] ? preempt_count_sub+0x39/0x80^M
 [] ? _raw_spin_unlock+0x31/0x50^M
 [] ? __alloc_fd+0x141/0x2b0^M
 [] do_sys_open+0x17c/0x2c0^M
 [] ? filp_open+0x60/0x60^M
 [] ? trace_hardirqs_on_thunk+0x1a/0x1c^M
 [] SyS_open+0x1e/0x20^M
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2^M
 [] ? trace_hardirqs_off_caller+0xc5/0x120^M

I don't know this code. Perhaps it will be perfectly fine performance wise to
just drop out of RCU lookup in this case.

-- 
Mateusz Guzik


Re: [PATCH 3/4] autofs - make mountpoint checks namespace aware

2016-09-17 Thread Mateusz Guzik
On Wed, Sep 14, 2016 at 02:14:45PM +0800, Ian Kent wrote:
> If an automount mount is clone(2)ed into a file system that is
> propagation private, when it later expires in the originating
> namespace subsequent calls to autofs ->d_automount() for that
> dentry in the original namespace will return ELOOP until the
> mount is manually umounted in the cloned namespace.
> 
> In the same way, if an autofs mount is triggered by automount(8)
> running within a container the dentry will be seen as mounted in
> the root init namespace and calls to ->d_automount() in that namespace
> will return ELOOP until the mount is umounted within the container.
> 
> Also, have_submounts() can return an incorect result when a mount
> exists in a namespace other than the one being checked.
> 
> @@ -460,7 +460,7 @@ static int autofs4_d_manage(struct dentry *dentry, bool 
> rcu_walk)
>  
>   if (ino->flags & AUTOFS_INF_WANT_EXPIRE)
>   return 0;
> - if (d_mountpoint(dentry))
> + if (is_local_mountpoint(dentry))
>   return 0;
>   inode = d_inode_rcu(dentry);
>   if (inode && S_ISLNK(inode->i_mode))

This change is within RCU lookup.

is_local_mountpoint may end up calling __is_local_mountpoint, which will
optionally take the namespace_sem lock, resulting in a splat:

 #0:  (&(>fs_lock)->rlock){+.+...}, at: [] 
autofs4_d_manage+0x202/0x290^M
Preemption disabled at:[] autofs4_d_manage+0x202/0x290^M
^M
CPU: 1 PID: 1307 Comm: iknowthis Not tainted 4.8.0-rc6-next-20160916dupa #448^M
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011^M
 8800077378f8 818eaffc 0001 ^M
 880007737930 8110c870 880007588048 82483840^M
 0015  880007588040 880007737978^M
Call Trace:^M
 [] dump_stack+0x85/0xc9^M
 [] ___might_sleep+0x1e0/0x2e0^M
 [] __might_sleep+0x71/0xe0^M
 [] ? preempt_count_sub+0x39/0x80^M
 [] down_read+0x2a/0xc0^M
 [] __is_local_mountpoint+0x66/0xe0^M
 [] autofs4_d_manage+0x263/0x290^M
 [] follow_managed+0x157/0x480^M
 [] lookup_fast+0x3ab/0x690^M
 [] ? trailing_symlink+0x370/0x370^M
 [] ? path_init+0x917/0xa10^M
 [] ? __mutex_init+0x77/0x80^M
 [] path_openat+0x2bc/0x13e0^M
 [] ? path_lookupat+0x1f0/0x1f0^M
 [] ? __asan_loadN+0xf/0x20^M
 [] ? pvclock_clocksource_read+0xd6/0x180^M
 [] ? kvm_clock_read+0x23/0x40^M
 [] do_filp_open+0x122/0x1c0^M
 [] ? preempt_count_sub+0x39/0x80^M
 [] ? may_open_dev+0x50/0x50^M
 [] ? preempt_count_sub.part.67+0x18/0x90^M
 [] ? preempt_count_sub+0x39/0x80^M
 [] ? _raw_spin_unlock+0x31/0x50^M
 [] ? __alloc_fd+0x141/0x2b0^M
 [] do_sys_open+0x17c/0x2c0^M
 [] ? filp_open+0x60/0x60^M
 [] ? trace_hardirqs_on_thunk+0x1a/0x1c^M
 [] SyS_open+0x1e/0x20^M
 [] entry_SYSCALL_64_fastpath+0x1f/0xc2^M
 [] ? trace_hardirqs_off_caller+0xc5/0x120^M

I don't know this code. Perhaps it will be perfectly fine performance wise to
just drop out of RCU lookup in this case.

-- 
Mateusz Guzik


Re: fs: GPF in bd_mount

2016-09-04 Thread Mateusz Guzik
On Sun, Sep 04, 2016 at 12:43:28PM +0200, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers GPF in bd_mount:
> 
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main()
> {
>   int fd;
> 
>   unshare(CLONE_NEWUSER);
>   unshare(CLONE_NEWNS);
>   mkdir("./bus", 0662515705056234013740);
>   mount("./bus/bus", "./bus", "bdev", 0,
>   "\xa9\x95\xbd\x88\x07\x6a\x39\xe8\xf4\xef\xf2\x6b\x88\x53\x1d\xdb"
>   "\xd2\x83\xf9\x5f\x4f\x44\x71\xf2\x08\x84\x2b\xae\x94\x87\xb7\xa6"
>   "\xf8\x9d\xc9\x96\xc7\x17\x2e\x22\xc4\xd2\xcc\xf9\x04\x0b\xd2\xaf"
>   "\xf3\x0b\xec\xeb\x2b\x75\xf2\x93\xa2\xd4\x00\xd8\x69\x47\x48\xf5"
>   "\xaf\x2b\xb8\x7c\x06\x04\x69\x8b\x46\x0d\x44\x79\x8c\x86\x68\xfd"
>   "\xd3\xb4\x1c\x8e\x9e\x6c\x58\x0c\xa5\xdf\x55\x4d\x59\x65\xc9\x70"
>   "\x7c\x8a\x44\x26\x7d\xba\xf0\x3d\x46\x9e\x3c\xbe\x22\xc3");
>   return 0;
> }
> 
> 
> general protection fault:  [#1] SMP DEBUG_PAGEALLOC KASAN
> Modules linked in:
> CPU: 2 PID: 4052 Comm: a.out Not tainted 4.8.0-rc3-next-20160825+ #11
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 88006b37a380 task.stack: 880066dc
> RIP: 0010:[]  []
> bd_mount+0x52/0xa0 fs/block_dev.c:650
> RSP: 0018:880066dc7ca0  EFLAGS: 00010207
> RAX: dc00 RBX:  RCX: 0001
> RDX: 0018 RSI: 886fd6c0 RDI: 00c7
> RBP: 880066dc7cb0 R08: 88006b642cb8 R09: 
> R10:  R11:  R12: 8880d440
> R13: 88006a4ac1c0 R14: 88006b64b000 R15: 
> FS:  012b2880() GS:88006d20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004b2160 CR3: 6cc72000 CR4: 06e0
> Stack:
>  880068e2c840 8880d440 880066dc7cf0 8186e73b
>  0004 88006659c600 8880d440 88006a4ac1c0
>   880068e2c840 880066dc7d40 818ce44a
> Call Trace:
>  [] mount_fs+0x9b/0x2f0 fs/super.c:1177
>  [] vfs_kern_mount+0x7a/0x3e0 fs/namespace.c:948
>  [< inline >] do_new_mount fs/namespace.c:2393
>  [] do_mount+0x3d5/0x26b0 fs/namespace.c:2715
>  [< inline >] SYSC_mount fs/namespace.c:2907
>  [] SyS_mount+0xab/0x120 fs/namespace.c:2884
>  [] do_syscall_64+0x1df/0x640 arch/x86/entry/common.c:288
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> arch/x86/entry/entry_64.S:249
> Code: 87 e8 f3 ca fb ff 48 85 c0 48 89 c3 74 4c e8 a6 47 ca ff 48 8d
> bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80>
> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
> RIP  [] bd_mount+0x52/0xa0 fs/block_dev.c:650
>  RSP 
> ---[ end trace 0e5d909159d79633 ]---
> 
> 
> On commit 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


I think this is fixed with:
commit e9e5e3fae8da7e237049e00e0bfc9e32fd808fe8
Author: Vegard Nossum <vegard.nos...@oracle.com>
Date:   Mon Aug 22 12:47:43 2016 +0200

bdev: fix NULL pointer dereference

in mainline. The commit is absent in linux-next.

-- 
Mateusz Guzik


Re: fs: GPF in bd_mount

2016-09-04 Thread Mateusz Guzik
On Sun, Sep 04, 2016 at 12:43:28PM +0200, Dmitry Vyukov wrote:
> Hello,
> 
> The following program triggers GPF in bd_mount:
> 
> 
> // autogenerated by syzkaller (http://github.com/google/syzkaller)
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int main()
> {
>   int fd;
> 
>   unshare(CLONE_NEWUSER);
>   unshare(CLONE_NEWNS);
>   mkdir("./bus", 0662515705056234013740);
>   mount("./bus/bus", "./bus", "bdev", 0,
>   "\xa9\x95\xbd\x88\x07\x6a\x39\xe8\xf4\xef\xf2\x6b\x88\x53\x1d\xdb"
>   "\xd2\x83\xf9\x5f\x4f\x44\x71\xf2\x08\x84\x2b\xae\x94\x87\xb7\xa6"
>   "\xf8\x9d\xc9\x96\xc7\x17\x2e\x22\xc4\xd2\xcc\xf9\x04\x0b\xd2\xaf"
>   "\xf3\x0b\xec\xeb\x2b\x75\xf2\x93\xa2\xd4\x00\xd8\x69\x47\x48\xf5"
>   "\xaf\x2b\xb8\x7c\x06\x04\x69\x8b\x46\x0d\x44\x79\x8c\x86\x68\xfd"
>   "\xd3\xb4\x1c\x8e\x9e\x6c\x58\x0c\xa5\xdf\x55\x4d\x59\x65\xc9\x70"
>   "\x7c\x8a\x44\x26\x7d\xba\xf0\x3d\x46\x9e\x3c\xbe\x22\xc3");
>   return 0;
> }
> 
> 
> general protection fault:  [#1] SMP DEBUG_PAGEALLOC KASAN
> Modules linked in:
> CPU: 2 PID: 4052 Comm: a.out Not tainted 4.8.0-rc3-next-20160825+ #11
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 88006b37a380 task.stack: 880066dc
> RIP: 0010:[]  []
> bd_mount+0x52/0xa0 fs/block_dev.c:650
> RSP: 0018:880066dc7ca0  EFLAGS: 00010207
> RAX: dc00 RBX:  RCX: 0001
> RDX: 0018 RSI: 886fd6c0 RDI: 00c7
> RBP: 880066dc7cb0 R08: 88006b642cb8 R09: 
> R10:  R11:  R12: 8880d440
> R13: 88006a4ac1c0 R14: 88006b64b000 R15: 
> FS:  012b2880() GS:88006d20() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004b2160 CR3: 6cc72000 CR4: 06e0
> Stack:
>  880068e2c840 8880d440 880066dc7cf0 8186e73b
>  0004 88006659c600 8880d440 88006a4ac1c0
>   880068e2c840 880066dc7d40 818ce44a
> Call Trace:
>  [] mount_fs+0x9b/0x2f0 fs/super.c:1177
>  [] vfs_kern_mount+0x7a/0x3e0 fs/namespace.c:948
>  [< inline >] do_new_mount fs/namespace.c:2393
>  [] do_mount+0x3d5/0x26b0 fs/namespace.c:2715
>  [< inline >] SYSC_mount fs/namespace.c:2907
>  [] SyS_mount+0xab/0x120 fs/namespace.c:2884
>  [] do_syscall_64+0x1df/0x640 arch/x86/entry/common.c:288
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> arch/x86/entry/entry_64.S:249
> Code: 87 e8 f3 ca fb ff 48 85 c0 48 89 c3 74 4c e8 a6 47 ca ff 48 8d
> bb c8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80>
> 3c 02 00 75 36 4c 8b a3 c8 00 00 00 48 b8 00 00 00 00 00 fc
> RIP  [] bd_mount+0x52/0xa0 fs/block_dev.c:650
>  RSP 
> ---[ end trace 0e5d909159d79633 ]---
> 
> 
> On commit 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


I think this is fixed with:
commit e9e5e3fae8da7e237049e00e0bfc9e32fd808fe8
Author: Vegard Nossum 
Date:   Mon Aug 22 12:47:43 2016 +0200

bdev: fix NULL pointer dereference

in mainline. The commit is absent in linux-next.

-- 
Mateusz Guzik


Re: [PACTH v4 1/3] mm, proc: Implement /proc//totmaps

2016-08-31 Thread Mateusz Guzik
On Wed, Aug 31, 2016 at 12:36:26PM -0400, Robert Foss wrote:
> On 2016-08-31 05:45 AM, Jacek Anaszewski wrote:
> > > +static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
> > > +{
> > > +return NULL + (*pos == 0);
> > > +}
> > > +
> > > +static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
> > > +{
> > > +++*pos;
> > > +return NULL;
> > > +}
> > > +
> > 
> > When reading totmaps of kernel processes the following NULL pointer
> > dereference occurs:
> > 
> > Unable to handle kernel NULL pointer dereference at virtual address
> > 0044
> > [] (down_read) from [] (totmaps_proc_show+0x2c/0x1e8)
> > [] (totmaps_proc_show) from [] (seq_read+0x1c8/0x4b8)
> > [] (seq_read) from [] (__vfs_read+0x2c/0x110)
> > [] (__vfs_read) from [] (vfs_read+0x8c/0x110)
> > [] (vfs_read) from [] (SyS_read+0x40/0x8c)
> > [] (SyS_read) from [] (ret_fast_syscall+0x0/0x3c)
> > 
> > It seems that some protection is needed for such processes, so that
> > totmaps would return empty string then, like in case of smaps.
> > 
> 
> Thanks for the testing Jacek!
> 
> I had a look around the corresponding smaps code, but I'm not seeing any
> checks, do you know where that check actually is made?
> 

See m_start in f/sproc/task_mmu.c. It not only check for non-null mm,
but also tries to bump ->mm_users and only then proceeds to walk the mm.

-- 
Mateusz Guzik


Re: [PACTH v4 1/3] mm, proc: Implement /proc//totmaps

2016-08-31 Thread Mateusz Guzik
On Wed, Aug 31, 2016 at 12:36:26PM -0400, Robert Foss wrote:
> On 2016-08-31 05:45 AM, Jacek Anaszewski wrote:
> > > +static void *m_totmaps_start(struct seq_file *p, loff_t *pos)
> > > +{
> > > +return NULL + (*pos == 0);
> > > +}
> > > +
> > > +static void *m_totmaps_next(struct seq_file *p, void *v, loff_t *pos)
> > > +{
> > > +++*pos;
> > > +return NULL;
> > > +}
> > > +
> > 
> > When reading totmaps of kernel processes the following NULL pointer
> > dereference occurs:
> > 
> > Unable to handle kernel NULL pointer dereference at virtual address
> > 0044
> > [] (down_read) from [] (totmaps_proc_show+0x2c/0x1e8)
> > [] (totmaps_proc_show) from [] (seq_read+0x1c8/0x4b8)
> > [] (seq_read) from [] (__vfs_read+0x2c/0x110)
> > [] (__vfs_read) from [] (vfs_read+0x8c/0x110)
> > [] (vfs_read) from [] (SyS_read+0x40/0x8c)
> > [] (SyS_read) from [] (ret_fast_syscall+0x0/0x3c)
> > 
> > It seems that some protection is needed for such processes, so that
> > totmaps would return empty string then, like in case of smaps.
> > 
> 
> Thanks for the testing Jacek!
> 
> I had a look around the corresponding smaps code, but I'm not seeing any
> checks, do you know where that check actually is made?
> 

See m_start in f/sproc/task_mmu.c. It not only check for non-null mm,
but also tries to bump ->mm_users and only then proceeds to walk the mm.

-- 
Mateusz Guzik


Re: [PATCHv2 0/2] introduce get_task_exe_file and use it to fix audit_exe_compare

2016-08-30 Thread Mateusz Guzik
On Tue, Aug 30, 2016 at 02:50:21PM -0400, Richard Guy Briggs wrote:
> On 2016-08-23 16:20, Mateusz Guzik wrote:
> > audit_exe_compare directly accesses mm->exe_file without making sure the
> > object is stable. Fixing it using current primitives results in
> > partially duplicating what proc_exe_link is doing.
> > 
> > As such, introduce a trivial helper which can be used in both places and
> > fix the func.
> > 
> > Changes since v1:
> > * removed an unused 'out' label which crept in
> > 
> > Mateusz Guzik (2):
> >   mm: introduce get_task_exe_file
> >   audit: fix exe_file access in audit_exe_compare
> 
> The task_lock affects a much bigger struct than the mm ref count.  Is
> this really necessary?  Is a spin-lock significantly lower cost than a
> refcount?  Other than that, this refactorization looks sensible.
> 

proc_exe_link was taking the lock anyway to guarantee a stable mm.
I think the helper cleans the code up a little bit and there is
microoptimisation to not play with the refcount.

If audit_exe_compare has guarantees the task wont reach exit_mm, it can
use get_mm_exe_file which means the atomic op would be only on the file
object.

I was under the impression this is the expected behaviour, but your
patch used the task lock to grab mm, so I mimicked it here.

-- 
Mateusz Guzik


Re: [PATCHv2 0/2] introduce get_task_exe_file and use it to fix audit_exe_compare

2016-08-30 Thread Mateusz Guzik
On Tue, Aug 30, 2016 at 02:50:21PM -0400, Richard Guy Briggs wrote:
> On 2016-08-23 16:20, Mateusz Guzik wrote:
> > audit_exe_compare directly accesses mm->exe_file without making sure the
> > object is stable. Fixing it using current primitives results in
> > partially duplicating what proc_exe_link is doing.
> > 
> > As such, introduce a trivial helper which can be used in both places and
> > fix the func.
> > 
> > Changes since v1:
> > * removed an unused 'out' label which crept in
> > 
> > Mateusz Guzik (2):
> >   mm: introduce get_task_exe_file
> >   audit: fix exe_file access in audit_exe_compare
> 
> The task_lock affects a much bigger struct than the mm ref count.  Is
> this really necessary?  Is a spin-lock significantly lower cost than a
> refcount?  Other than that, this refactorization looks sensible.
> 

proc_exe_link was taking the lock anyway to guarantee a stable mm.
I think the helper cleans the code up a little bit and there is
microoptimisation to not play with the refcount.

If audit_exe_compare has guarantees the task wont reach exit_mm, it can
use get_mm_exe_file which means the atomic op would be only on the file
object.

I was under the impression this is the expected behaviour, but your
patch used the task lock to grab mm, so I mimicked it here.

-- 
Mateusz Guzik


Re: [PATCHv2 1/2] mm: introduce get_task_exe_file

2016-08-23 Thread Mateusz Guzik
On Tue, Aug 23, 2016 at 04:48:13PM +0200, Oleg Nesterov wrote:
> On 08/23, Mateusz Guzik wrote:
> >
> > +struct file *get_task_exe_file(struct task_struct *task)
> > +{
> > +   struct file *exe_file = NULL;
> > +   struct mm_struct *mm;
> > +
> > +   task_lock(task);
> > +   mm = task->mm;
> > +   if (mm) {
> > +   if (!(task->flags & PF_KTHREAD))
> > +   exe_file = get_mm_exe_file(mm);
> > +   }
> > +   task_unlock(task);
> > +   return exe_file;
> > +}
> 
> I can't believe I am going to comment the coding style but I can't resist ;)
> 
>   if (mm && !(task->flags & PF_KTHREAD)))
>   exe_file = get_mm_exe_file(mm);
> 
> looks a bit simpler to me. But this is purely cosmetic and subjective,
> both patches look good to me.
> 

Actually I did it for some consistency with get_task_mm.

The check can likely be done prior to taking the lock in both functions
and that would clean them up a little bit, but I wanted to avoid nit
picking... :>

> Acked-by: Oleg Nesterov <o...@redhat.com>
> 

Thanks

-- 
Mateusz Guzik


Re: [PATCHv2 1/2] mm: introduce get_task_exe_file

2016-08-23 Thread Mateusz Guzik
On Tue, Aug 23, 2016 at 04:48:13PM +0200, Oleg Nesterov wrote:
> On 08/23, Mateusz Guzik wrote:
> >
> > +struct file *get_task_exe_file(struct task_struct *task)
> > +{
> > +   struct file *exe_file = NULL;
> > +   struct mm_struct *mm;
> > +
> > +   task_lock(task);
> > +   mm = task->mm;
> > +   if (mm) {
> > +   if (!(task->flags & PF_KTHREAD))
> > +   exe_file = get_mm_exe_file(mm);
> > +   }
> > +   task_unlock(task);
> > +   return exe_file;
> > +}
> 
> I can't believe I am going to comment the coding style but I can't resist ;)
> 
>   if (mm && !(task->flags & PF_KTHREAD)))
>   exe_file = get_mm_exe_file(mm);
> 
> looks a bit simpler to me. But this is purely cosmetic and subjective,
> both patches look good to me.
> 

Actually I did it for some consistency with get_task_mm.

The check can likely be done prior to taking the lock in both functions
and that would clean them up a little bit, but I wanted to avoid nit
picking... :>

> Acked-by: Oleg Nesterov 
> 

Thanks

-- 
Mateusz Guzik


[PATCHv2 0/2] introduce get_task_exe_file and use it to fix audit_exe_compare

2016-08-23 Thread Mateusz Guzik
audit_exe_compare directly accesses mm->exe_file without making sure the
object is stable. Fixing it using current primitives results in
partially duplicating what proc_exe_link is doing.

As such, introduce a trivial helper which can be used in both places and
fix the func.

Changes since v1:
* removed an unused 'out' label which crept in

Mateusz Guzik (2):
  mm: introduce get_task_exe_file
  audit: fix exe_file access in audit_exe_compare

 fs/proc/base.c   |  7 +--
 include/linux/mm.h   |  1 +
 kernel/audit_watch.c |  8 +---
 kernel/fork.c| 23 +++
 4 files changed, 30 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCHv2 0/2] introduce get_task_exe_file and use it to fix audit_exe_compare

2016-08-23 Thread Mateusz Guzik
audit_exe_compare directly accesses mm->exe_file without making sure the
object is stable. Fixing it using current primitives results in
partially duplicating what proc_exe_link is doing.

As such, introduce a trivial helper which can be used in both places and
fix the func.

Changes since v1:
* removed an unused 'out' label which crept in

Mateusz Guzik (2):
  mm: introduce get_task_exe_file
  audit: fix exe_file access in audit_exe_compare

 fs/proc/base.c   |  7 +--
 include/linux/mm.h   |  1 +
 kernel/audit_watch.c |  8 +---
 kernel/fork.c| 23 +++
 4 files changed, 30 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCHv2 1/2] mm: introduce get_task_exe_file

2016-08-23 Thread Mateusz Guzik
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
---
 fs/proc/base.c |  7 +--
 include/linux/mm.h |  1 +
 kernel/fork.c  | 23 +++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2ed41cb..ebccdc1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1556,18 +1556,13 @@ static const struct file_operations 
proc_pid_set_comm_operations = {
 static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
struct task_struct *task;
-   struct mm_struct *mm;
struct file *exe_file;
 
task = get_proc_task(d_inode(dentry));
if (!task)
return -ENOENT;
-   mm = get_task_mm(task);
+   exe_file = get_task_exe_file(task);
put_task_struct(task);
-   if (!mm)
-   return -ENOENT;
-   exe_file = get_mm_exe_file(mm);
-   mmput(mm);
if (exe_file) {
*exe_path = exe_file->f_path;
path_get(_file->f_path);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d85402..f4e639e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2014,6 +2014,7 @@ extern void mm_drop_all_locks(struct mm_struct *mm);
 
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
+extern struct file *get_task_exe_file(struct task_struct *task);
 
 extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long 
npages);
 extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6fe775c..a4b2384 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -800,6 +800,29 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 EXPORT_SYMBOL(get_mm_exe_file);
 
 /**
+ * get_task_exe_file - acquire a reference to the task's executable file
+ *
+ * Returns %NULL if task's mm (if any) has no associated executable file or
+ * this is a kernel thread with borrowed mm (see the comment above 
get_task_mm).
+ * User must release file via fput().
+ */
+struct file *get_task_exe_file(struct task_struct *task)
+{
+   struct file *exe_file = NULL;
+   struct mm_struct *mm;
+
+   task_lock(task);
+   mm = task->mm;
+   if (mm) {
+   if (!(task->flags & PF_KTHREAD))
+   exe_file = get_mm_exe_file(mm);
+   }
+   task_unlock(task);
+   return exe_file;
+}
+EXPORT_SYMBOL(get_task_exe_file);
+
+/**
  * get_task_mm - acquire a reference to the task's mm
  *
  * Returns %NULL if the task has no mm.  Checks PF_KTHREAD (meaning
-- 
1.8.3.1



[PATCHv2 1/2] mm: introduce get_task_exe_file

2016-08-23 Thread Mateusz Guzik
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik 
Acked-by: Konstantin Khlebnikov 
---
 fs/proc/base.c |  7 +--
 include/linux/mm.h |  1 +
 kernel/fork.c  | 23 +++
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2ed41cb..ebccdc1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1556,18 +1556,13 @@ static const struct file_operations 
proc_pid_set_comm_operations = {
 static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
struct task_struct *task;
-   struct mm_struct *mm;
struct file *exe_file;
 
task = get_proc_task(d_inode(dentry));
if (!task)
return -ENOENT;
-   mm = get_task_mm(task);
+   exe_file = get_task_exe_file(task);
put_task_struct(task);
-   if (!mm)
-   return -ENOENT;
-   exe_file = get_mm_exe_file(mm);
-   mmput(mm);
if (exe_file) {
*exe_path = exe_file->f_path;
path_get(_file->f_path);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d85402..f4e639e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2014,6 +2014,7 @@ extern void mm_drop_all_locks(struct mm_struct *mm);
 
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
+extern struct file *get_task_exe_file(struct task_struct *task);
 
 extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long 
npages);
 extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6fe775c..a4b2384 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -800,6 +800,29 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 EXPORT_SYMBOL(get_mm_exe_file);
 
 /**
+ * get_task_exe_file - acquire a reference to the task's executable file
+ *
+ * Returns %NULL if task's mm (if any) has no associated executable file or
+ * this is a kernel thread with borrowed mm (see the comment above 
get_task_mm).
+ * User must release file via fput().
+ */
+struct file *get_task_exe_file(struct task_struct *task)
+{
+   struct file *exe_file = NULL;
+   struct mm_struct *mm;
+
+   task_lock(task);
+   mm = task->mm;
+   if (mm) {
+   if (!(task->flags & PF_KTHREAD))
+   exe_file = get_mm_exe_file(mm);
+   }
+   task_unlock(task);
+   return exe_file;
+}
+EXPORT_SYMBOL(get_task_exe_file);
+
+/**
  * get_task_mm - acquire a reference to the task's mm
  *
  * Returns %NULL if the task has no mm.  Checks PF_KTHREAD (meaning
-- 
1.8.3.1



[PATCHv2 2/2] audit: fix exe_file access in audit_exe_compare

2016-08-23 Thread Mateusz Guzik
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.

Use get_task_exe_file to safely obtain stable exe_file.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
Acked-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
---
 kernel/audit_watch.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index d6709eb..0d302a8 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -19,6 +19,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -544,10 +545,11 @@ int audit_exe_compare(struct task_struct *tsk, struct 
audit_fsnotify_mark *mark)
unsigned long ino;
dev_t dev;
 
-   rcu_read_lock();
-   exe_file = rcu_dereference(tsk->mm->exe_file);
+   exe_file = get_task_exe_file(tsk);
+   if (!exe_file)
+   return 0;
ino = exe_file->f_inode->i_ino;
dev = exe_file->f_inode->i_sb->s_dev;
-   rcu_read_unlock();
+   fput(exe_file);
return audit_mark_compare(mark, ino, dev);
 }
-- 
1.8.3.1



[PATCHv2 2/2] audit: fix exe_file access in audit_exe_compare

2016-08-23 Thread Mateusz Guzik
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.

Use get_task_exe_file to safely obtain stable exe_file.

Signed-off-by: Mateusz Guzik 
Acked-by: Konstantin Khlebnikov 
---
 kernel/audit_watch.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index d6709eb..0d302a8 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -19,6 +19,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -544,10 +545,11 @@ int audit_exe_compare(struct task_struct *tsk, struct 
audit_fsnotify_mark *mark)
unsigned long ino;
dev_t dev;
 
-   rcu_read_lock();
-   exe_file = rcu_dereference(tsk->mm->exe_file);
+   exe_file = get_task_exe_file(tsk);
+   if (!exe_file)
+   return 0;
ino = exe_file->f_inode->i_ino;
dev = exe_file->f_inode->i_sb->s_dev;
-   rcu_read_unlock();
+   fput(exe_file);
return audit_mark_compare(mark, ino, dev);
 }
-- 
1.8.3.1



[PATCH 2/2] audit: fix exe_file access in audit_exe_compare

2016-08-22 Thread Mateusz Guzik
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.

Use get_task_exe_file to safely obtain stable exe_file.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 kernel/audit_watch.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index d6709eb..0d302a8 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -19,6 +19,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -544,10 +545,11 @@ int audit_exe_compare(struct task_struct *tsk, struct 
audit_fsnotify_mark *mark)
unsigned long ino;
dev_t dev;
 
-   rcu_read_lock();
-   exe_file = rcu_dereference(tsk->mm->exe_file);
+   exe_file = get_task_exe_file(tsk);
+   if (!exe_file)
+   return 0;
ino = exe_file->f_inode->i_ino;
dev = exe_file->f_inode->i_sb->s_dev;
-   rcu_read_unlock();
+   fput(exe_file);
return audit_mark_compare(mark, ino, dev);
 }
-- 
1.8.3.1



[PATCH 2/2] audit: fix exe_file access in audit_exe_compare

2016-08-22 Thread Mateusz Guzik
Prior to the change the function would blindly deference mm, exe_file
and exe_file->f_inode, each of which could have been NULL or freed.

Use get_task_exe_file to safely obtain stable exe_file.

Signed-off-by: Mateusz Guzik 
---
 kernel/audit_watch.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index d6709eb..0d302a8 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -19,6 +19,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -544,10 +545,11 @@ int audit_exe_compare(struct task_struct *tsk, struct 
audit_fsnotify_mark *mark)
unsigned long ino;
dev_t dev;
 
-   rcu_read_lock();
-   exe_file = rcu_dereference(tsk->mm->exe_file);
+   exe_file = get_task_exe_file(tsk);
+   if (!exe_file)
+   return 0;
ino = exe_file->f_inode->i_ino;
dev = exe_file->f_inode->i_sb->s_dev;
-   rcu_read_unlock();
+   fput(exe_file);
return audit_mark_compare(mark, ino, dev);
 }
-- 
1.8.3.1



[PATCH 0/2] introduce get_task_exe_file and use it to fix audit_exe_compare

2016-08-22 Thread Mateusz Guzik
audit_exe_compare directly accesses mm->exe_file without making sure the
object is stable. Fixing it using current primitives results in
partially duplicating what proc_exe_link is doing.

As such, introduce a trivial helper which can be used in both places and
fix the func.

Mateusz Guzik (2):
  mm: introduce get_task_exe_file
  audit: fix exe_file access in audit_exe_compare

 fs/proc/base.c   |  7 +--
 include/linux/mm.h   |  1 +
 kernel/audit_watch.c |  8 +---
 kernel/fork.c| 24 
 4 files changed, 31 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCH 1/2] mm: introduce get_task_exe_file

2016-08-22 Thread Mateusz Guzik
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 fs/proc/base.c |  7 +--
 include/linux/mm.h |  1 +
 kernel/fork.c  | 24 
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2ed41cb..ebccdc1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1556,18 +1556,13 @@ static const struct file_operations 
proc_pid_set_comm_operations = {
 static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
struct task_struct *task;
-   struct mm_struct *mm;
struct file *exe_file;
 
task = get_proc_task(d_inode(dentry));
if (!task)
return -ENOENT;
-   mm = get_task_mm(task);
+   exe_file = get_task_exe_file(task);
put_task_struct(task);
-   if (!mm)
-   return -ENOENT;
-   exe_file = get_mm_exe_file(mm);
-   mmput(mm);
if (exe_file) {
*exe_path = exe_file->f_path;
path_get(_file->f_path);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d85402..f4e639e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2014,6 +2014,7 @@ extern void mm_drop_all_locks(struct mm_struct *mm);
 
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
+extern struct file *get_task_exe_file(struct task_struct *task);
 
 extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long 
npages);
 extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6fe775c..84a636d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -800,6 +800,30 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 EXPORT_SYMBOL(get_mm_exe_file);
 
 /**
+ * get_task_exe_file - acquire a reference to the task's executable file
+ *
+ * Returns %NULL if task's mm (if any) has no associated executable file or
+ * this is a kernel thread with borrowed mm (see the comment above 
get_task_mm).
+ * User must release file via fput().
+ */
+struct file *get_task_exe_file(struct task_struct *task)
+{
+   struct file *exe_file = NULL;
+   struct mm_struct *mm;
+
+   task_lock(task);
+   mm = task->mm;
+   if (mm) {
+   if (!(task->flags & PF_KTHREAD))
+   exe_file = get_mm_exe_file(mm);
+   }
+   task_unlock(task);
+out:
+   return exe_file;
+}
+EXPORT_SYMBOL(get_task_exe_file);
+
+/**
  * get_task_mm - acquire a reference to the task's mm
  *
  * Returns %NULL if the task has no mm.  Checks PF_KTHREAD (meaning
-- 
1.8.3.1



[PATCH 0/2] introduce get_task_exe_file and use it to fix audit_exe_compare

2016-08-22 Thread Mateusz Guzik
audit_exe_compare directly accesses mm->exe_file without making sure the
object is stable. Fixing it using current primitives results in
partially duplicating what proc_exe_link is doing.

As such, introduce a trivial helper which can be used in both places and
fix the func.

Mateusz Guzik (2):
  mm: introduce get_task_exe_file
  audit: fix exe_file access in audit_exe_compare

 fs/proc/base.c   |  7 +--
 include/linux/mm.h   |  1 +
 kernel/audit_watch.c |  8 +---
 kernel/fork.c| 24 
 4 files changed, 31 insertions(+), 9 deletions(-)

-- 
1.8.3.1



[PATCH 1/2] mm: introduce get_task_exe_file

2016-08-22 Thread Mateusz Guzik
For more convenient access if one has a pointer to the task.

As a minor nit take advantage of the fact that only task lock + rcu are
needed to safely grab ->exe_file. This saves mm refcount dance.

Use the helper in proc_exe_link.

Signed-off-by: Mateusz Guzik 
---
 fs/proc/base.c |  7 +--
 include/linux/mm.h |  1 +
 kernel/fork.c  | 24 
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 2ed41cb..ebccdc1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1556,18 +1556,13 @@ static const struct file_operations 
proc_pid_set_comm_operations = {
 static int proc_exe_link(struct dentry *dentry, struct path *exe_path)
 {
struct task_struct *task;
-   struct mm_struct *mm;
struct file *exe_file;
 
task = get_proc_task(d_inode(dentry));
if (!task)
return -ENOENT;
-   mm = get_task_mm(task);
+   exe_file = get_task_exe_file(task);
put_task_struct(task);
-   if (!mm)
-   return -ENOENT;
-   exe_file = get_mm_exe_file(mm);
-   mmput(mm);
if (exe_file) {
*exe_path = exe_file->f_path;
path_get(_file->f_path);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d85402..f4e639e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2014,6 +2014,7 @@ extern void mm_drop_all_locks(struct mm_struct *mm);
 
 extern void set_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file);
 extern struct file *get_mm_exe_file(struct mm_struct *mm);
+extern struct file *get_task_exe_file(struct task_struct *task);
 
 extern bool may_expand_vm(struct mm_struct *, vm_flags_t, unsigned long 
npages);
 extern void vm_stat_account(struct mm_struct *, vm_flags_t, long npages);
diff --git a/kernel/fork.c b/kernel/fork.c
index 6fe775c..84a636d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -800,6 +800,30 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 EXPORT_SYMBOL(get_mm_exe_file);
 
 /**
+ * get_task_exe_file - acquire a reference to the task's executable file
+ *
+ * Returns %NULL if task's mm (if any) has no associated executable file or
+ * this is a kernel thread with borrowed mm (see the comment above 
get_task_mm).
+ * User must release file via fput().
+ */
+struct file *get_task_exe_file(struct task_struct *task)
+{
+   struct file *exe_file = NULL;
+   struct mm_struct *mm;
+
+   task_lock(task);
+   mm = task->mm;
+   if (mm) {
+   if (!(task->flags & PF_KTHREAD))
+   exe_file = get_mm_exe_file(mm);
+   }
+   task_unlock(task);
+out:
+   return exe_file;
+}
+EXPORT_SYMBOL(get_task_exe_file);
+
+/**
  * get_task_mm - acquire a reference to the task's mm
  *
  * Returns %NULL if the task has no mm.  Checks PF_KTHREAD (meaning
-- 
1.8.3.1



Re: [PATCH] audit: fix audit_exe_compare using get_mm_exe_file

2016-08-22 Thread Mateusz Guzik
On Mon, Aug 22, 2016 at 11:41:34AM -0400, Richard Guy Briggs wrote:
> Fix original naive attempt to get/lock access to task->mm->exe_file by
> using get_mm_exe_file and checking for NULL.
> 
> See: https://lkml.org/lkml/2016/7/30/97
> 
> Signed-off-by: Richard Guy Briggs <r...@redhat.com>
> ---
>  kernel/audit_watch.c |   13 ++---
>  1 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index d6709eb..0b29279 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -19,6 +19,7 @@
>   * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -540,14 +541,20 @@ int audit_dupe_exe(struct audit_krule *new, struct 
> audit_krule *old)
>  
>  int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark 
> *mark)
>  {
> + struct mm_struct *mm;
>   struct file *exe_file;
>   unsigned long ino;
>   dev_t dev;
>  
> - rcu_read_lock();
> - exe_file = rcu_dereference(tsk->mm->exe_file);
> + mm = get_task_mm(tsk);
> + if (!mm)
> + return 0;
> + exe_file = get_mm_exe_file(mm);
> + mmput(mm);
> + if (!exe_file)
> + return 0;
>   ino = exe_file->f_inode->i_ino;
>   dev = exe_file->f_inode->i_sb->s_dev;
> - rcu_read_unlock();
> + fput(exe_file);
>   return audit_mark_compare(mark, ino, dev);
>  }

I think this works but I have a better idea. This basically duplicates a
part of proc_exe_link and does unnecessary work.

Instead, one can introduce get_task_exe_file. task_lock held guarantees
the stability of mm struct itself, and then we can we can try to grab
the file without mm refcount dance.

I'll implement this later today.

-- 
Mateusz Guzik


Re: [PATCH] audit: fix audit_exe_compare using get_mm_exe_file

2016-08-22 Thread Mateusz Guzik
On Mon, Aug 22, 2016 at 11:41:34AM -0400, Richard Guy Briggs wrote:
> Fix original naive attempt to get/lock access to task->mm->exe_file by
> using get_mm_exe_file and checking for NULL.
> 
> See: https://lkml.org/lkml/2016/7/30/97
> 
> Signed-off-by: Richard Guy Briggs 
> ---
>  kernel/audit_watch.c |   13 ++---
>  1 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index d6709eb..0b29279 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -19,6 +19,7 @@
>   * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -540,14 +541,20 @@ int audit_dupe_exe(struct audit_krule *new, struct 
> audit_krule *old)
>  
>  int audit_exe_compare(struct task_struct *tsk, struct audit_fsnotify_mark 
> *mark)
>  {
> + struct mm_struct *mm;
>   struct file *exe_file;
>   unsigned long ino;
>   dev_t dev;
>  
> - rcu_read_lock();
> - exe_file = rcu_dereference(tsk->mm->exe_file);
> + mm = get_task_mm(tsk);
> + if (!mm)
> + return 0;
> + exe_file = get_mm_exe_file(mm);
> + mmput(mm);
> + if (!exe_file)
> + return 0;
>   ino = exe_file->f_inode->i_ino;
>   dev = exe_file->f_inode->i_sb->s_dev;
> - rcu_read_unlock();
> + fput(exe_file);
>   return audit_mark_compare(mark, ino, dev);
>  }

I think this works but I have a better idea. This basically duplicates a
part of proc_exe_link and does unnecessary work.

Instead, one can introduce get_task_exe_file. task_lock held guarantees
the stability of mm struct itself, and then we can we can try to grab
the file without mm refcount dance.

I'll implement this later today.

-- 
Mateusz Guzik


Re: [PACTH v1] mm, proc: Implement /proc//totmaps

2016-08-10 Thread Mateusz Guzik
On Wed, Aug 10, 2016 at 11:39:12AM -0400, Robert Foss wrote:
> 
> 
> On 2016-08-09 04:17 PM, Robert Foss wrote:
> > > > +static int totmaps_proc_show(struct seq_file *m, void *data)
> > > > +{
> > > > +struct proc_maps_private *priv = m->private;
> > > > +struct mm_struct *mm;
> > > > +struct vm_area_struct *vma;
> > > > +struct mem_size_stats *mss_sum = priv->mss;
> > > > +
> > > > +/* reference to priv->task already taken */
> > > > +/* but need to get the mm here because */
> > > > +/* task could be in the process of exiting */
> > > > +mm = get_task_mm(priv->task);
> > > > +if (!mm || IS_ERR(mm))
> > > > +return -EINVAL;
> > > > +
> > > 
> > > That's not how it's done in smaps.
> > 
> > Alright, I'll have to look into the difference between this approach and
> > the smaps one.
> 
> 
> I had a look at show_smaps(), and it's not entirely clear to me what the
> advantage of doing it show_smaps() way.
> 
> mm = get_task_mm(priv->task) is needed to iterate through all of the
> mappings. Is there a preferable way of doing that?

In the other part of the mail I stated smaps goes to proc_maps_open
which has:
priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);

This gives you stable access to mm and all needed permission checks.

Then, in the read routine you can just:
if (!atomic_inc_not_zero(>mm_users))
goto thats_it;

See smaps routines or e.g. environ_read.
-- 
Mateusz Guzik


Re: [PACTH v1] mm, proc: Implement /proc//totmaps

2016-08-10 Thread Mateusz Guzik
On Wed, Aug 10, 2016 at 11:39:12AM -0400, Robert Foss wrote:
> 
> 
> On 2016-08-09 04:17 PM, Robert Foss wrote:
> > > > +static int totmaps_proc_show(struct seq_file *m, void *data)
> > > > +{
> > > > +struct proc_maps_private *priv = m->private;
> > > > +struct mm_struct *mm;
> > > > +struct vm_area_struct *vma;
> > > > +struct mem_size_stats *mss_sum = priv->mss;
> > > > +
> > > > +/* reference to priv->task already taken */
> > > > +/* but need to get the mm here because */
> > > > +/* task could be in the process of exiting */
> > > > +mm = get_task_mm(priv->task);
> > > > +if (!mm || IS_ERR(mm))
> > > > +return -EINVAL;
> > > > +
> > > 
> > > That's not how it's done in smaps.
> > 
> > Alright, I'll have to look into the difference between this approach and
> > the smaps one.
> 
> 
> I had a look at show_smaps(), and it's not entirely clear to me what the
> advantage of doing it show_smaps() way.
> 
> mm = get_task_mm(priv->task) is needed to iterate through all of the
> mappings. Is there a preferable way of doing that?

In the other part of the mail I stated smaps goes to proc_maps_open
which has:
priv->mm = proc_mem_open(inode, PTRACE_MODE_READ);

This gives you stable access to mm and all needed permission checks.

Then, in the read routine you can just:
if (!atomic_inc_not_zero(>mm_users))
goto thats_it;

See smaps routines or e.g. environ_read.
-- 
Mateusz Guzik


Re: [PACTH v1] mm, proc: Implement /proc//totmaps

2016-08-09 Thread Mateusz Guzik
On Tue, Aug 09, 2016 at 12:05:43PM -0400, robert.f...@collabora.com wrote:
> From: Sonny Rao <sonny...@chromium.org>
> 
> This is based on earlier work by Thiago Goncales. It implements a new
> per process proc file which summarizes the contents of the smaps file
> but doesn't display any addresses.  It gives more detailed information
> than statm like the PSS (proprotional set size).  It differs from the
> original implementation in that it doesn't use the full blown set of
> seq operations, uses a different termination condition, and doesn't
> displayed "Locked" as that was broken on the original implemenation.
> 
> This new proc file provides information faster than parsing the potentially
> huge smaps file.

I have no idea about usefulness of this.

The patch is definitely buggy with respect to how it implements actual
access to mm.

> +static int totmaps_proc_show(struct seq_file *m, void *data)
> +{
> + struct proc_maps_private *priv = m->private;
> + struct mm_struct *mm;
> + struct vm_area_struct *vma;
> + struct mem_size_stats *mss_sum = priv->mss;
> +
> + /* reference to priv->task already taken */
> + /* but need to get the mm here because */
> + /* task could be in the process of exiting */
> + mm = get_task_mm(priv->task);
> + if (!mm || IS_ERR(mm))
> + return -EINVAL;
> +

That's not how it's done in smaps.

> +static int totmaps_open(struct inode *inode, struct file *file)
> +{
> + struct proc_maps_private *priv;
> + int ret = -ENOMEM;
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (priv) {
> + priv->mss = kzalloc(sizeof(*priv->mss), GFP_KERNEL);
> + if (!priv->mss)
> + return -ENOMEM;

Cases below explicitly kfree(priv). I can't remember whether the close
routine gets called if this one fails. Either way, something is wrong
here.

> +
> + /* we need to grab references to the task_struct */
> + /* at open time, because there's a potential information */
> + /* leak where the totmaps file is opened and held open */
> + /* while the underlying pid to task mapping changes */
> + /* underneath it */
> + priv->task = get_pid_task(proc_pid(inode), PIDTYPE_PID);

This performs no permission checks that I would see. If you take a look
at smaps you will see the user ends up in proc_maps_open which performs
proc_mem_open(inode, PTRACE_MODE_READ) and gets a mm from there.


> + if (!priv->task) {
> + kfree(priv->mss);
> + kfree(priv);
> + return -ESRCH;
> + }
> +
> + ret = single_open(file, totmaps_proc_show, priv);
> + if (ret) {
> + put_task_struct(priv->task);
> + kfree(priv->mss);
> + kfree(priv);
> + }
> + }
> + return ret;
> +}
> +

-- 
Mateusz Guzik


Re: [PACTH v1] mm, proc: Implement /proc//totmaps

2016-08-09 Thread Mateusz Guzik
On Tue, Aug 09, 2016 at 12:05:43PM -0400, robert.f...@collabora.com wrote:
> From: Sonny Rao 
> 
> This is based on earlier work by Thiago Goncales. It implements a new
> per process proc file which summarizes the contents of the smaps file
> but doesn't display any addresses.  It gives more detailed information
> than statm like the PSS (proprotional set size).  It differs from the
> original implementation in that it doesn't use the full blown set of
> seq operations, uses a different termination condition, and doesn't
> displayed "Locked" as that was broken on the original implemenation.
> 
> This new proc file provides information faster than parsing the potentially
> huge smaps file.

I have no idea about usefulness of this.

The patch is definitely buggy with respect to how it implements actual
access to mm.

> +static int totmaps_proc_show(struct seq_file *m, void *data)
> +{
> + struct proc_maps_private *priv = m->private;
> + struct mm_struct *mm;
> + struct vm_area_struct *vma;
> + struct mem_size_stats *mss_sum = priv->mss;
> +
> + /* reference to priv->task already taken */
> + /* but need to get the mm here because */
> + /* task could be in the process of exiting */
> + mm = get_task_mm(priv->task);
> + if (!mm || IS_ERR(mm))
> + return -EINVAL;
> +

That's not how it's done in smaps.

> +static int totmaps_open(struct inode *inode, struct file *file)
> +{
> + struct proc_maps_private *priv;
> + int ret = -ENOMEM;
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (priv) {
> + priv->mss = kzalloc(sizeof(*priv->mss), GFP_KERNEL);
> + if (!priv->mss)
> + return -ENOMEM;

Cases below explicitly kfree(priv). I can't remember whether the close
routine gets called if this one fails. Either way, something is wrong
here.

> +
> + /* we need to grab references to the task_struct */
> + /* at open time, because there's a potential information */
> + /* leak where the totmaps file is opened and held open */
> + /* while the underlying pid to task mapping changes */
> + /* underneath it */
> + priv->task = get_pid_task(proc_pid(inode), PIDTYPE_PID);

This performs no permission checks that I would see. If you take a look
at smaps you will see the user ends up in proc_maps_open which performs
proc_mem_open(inode, PTRACE_MODE_READ) and gets a mm from there.


> + if (!priv->task) {
> + kfree(priv->mss);
> + kfree(priv);
> + return -ESRCH;
> + }
> +
> + ret = single_open(file, totmaps_proc_show, priv);
> + if (ret) {
> +     put_task_struct(priv->task);
> + kfree(priv->mss);
> + kfree(priv);
> + }
> + }
> + return ret;
> +}
> +

-- 
Mateusz Guzik


Re: Is reading from /proc/self/smaps thread-safe?

2016-07-31 Thread Mateusz Guzik
On Tue, Jul 26, 2016 at 02:44:48PM +0200, Marcin Ślusarz wrote:
> Hey
> 
> I have a simple program that mmaps 8MB of anonymous memory, spawns 16
> threads, reads /proc/self/smaps in each thread and looks up whether
> mapped address can be found in smaps. From time to time it's not there.
> 
> Is this supposed to work reliably?
> 
> My guess is that libc functions allocate memory internally using mmap
> and modify process' address space while other thread is iterating over
> vmas.
> 
> I see that reading from smaps takes mmap_sem in read mode. I'm guessing
> vm modifications are done under mmap_sem in write mode.
> 
> Documentation/filesystem/proc.txt says reading from smaps is "slow but
> very precise" (although in context of RSS).
> 

Address space modification definitely happens as threads get their
stacks mmaped and unmapped.

If you run your program under strace you will see all threads perform
multiple read()s to get the content as the kernel keeps return short
reads (below 1 page size). In particular, seq_read imposes the limit
artificially.

Since there are multiple trips to the kernel, locks are dropped and
special measures are needed to maintain consistency of the result.

In m_start you can see there is a best-effort attempt: it is remembered
what vma was accessed by the previous run. But the vma can be unmapped
before we get here next time.

So no, reading the file when the content is bigger than 4k is not
guaranteed to give consistent results across reads.

I don't have a good idea how to fix it, and it is likely not worth
working on. This is not the only place which is unable to return
reliable information for sufficiently large dataset.

The obvious thing to try out is just storing all the necessary
information and generating the text form on read. Unfortunately even
that data is quite big -- over 100 bytes per vma. This can be shrinked
down significantly with encoding what information is present as opposed
to keeping all records). But with thousands of entries per application
this translates into kilobytes of memory which would have to allocated
just to hold it, sounds like a non-starter to me.

-- 
Mateusz Guzik


Re: Is reading from /proc/self/smaps thread-safe?

2016-07-31 Thread Mateusz Guzik
On Tue, Jul 26, 2016 at 02:44:48PM +0200, Marcin Ślusarz wrote:
> Hey
> 
> I have a simple program that mmaps 8MB of anonymous memory, spawns 16
> threads, reads /proc/self/smaps in each thread and looks up whether
> mapped address can be found in smaps. From time to time it's not there.
> 
> Is this supposed to work reliably?
> 
> My guess is that libc functions allocate memory internally using mmap
> and modify process' address space while other thread is iterating over
> vmas.
> 
> I see that reading from smaps takes mmap_sem in read mode. I'm guessing
> vm modifications are done under mmap_sem in write mode.
> 
> Documentation/filesystem/proc.txt says reading from smaps is "slow but
> very precise" (although in context of RSS).
> 

Address space modification definitely happens as threads get their
stacks mmaped and unmapped.

If you run your program under strace you will see all threads perform
multiple read()s to get the content as the kernel keeps return short
reads (below 1 page size). In particular, seq_read imposes the limit
artificially.

Since there are multiple trips to the kernel, locks are dropped and
special measures are needed to maintain consistency of the result.

In m_start you can see there is a best-effort attempt: it is remembered
what vma was accessed by the previous run. But the vma can be unmapped
before we get here next time.

So no, reading the file when the content is bigger than 4k is not
guaranteed to give consistent results across reads.

I don't have a good idea how to fix it, and it is likely not worth
working on. This is not the only place which is unable to return
reliable information for sufficiently large dataset.

The obvious thing to try out is just storing all the necessary
information and generating the text form on read. Unfortunately even
that data is quite big -- over 100 bytes per vma. This can be shrinked
down significantly with encoding what information is present as opposed
to keeping all records). But with thousands of entries per application
this translates into kilobytes of memory which would have to allocated
just to hold it, sounds like a non-starter to me.

-- 
Mateusz Guzik


Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-30 Thread Mateusz Guzik
On Sat, Jul 30, 2016 at 12:31:40PM -0500, Eric W. Biederman wrote:
> So what I am requesting is very simple.  That the checks in
> prctl_set_mm_exe_file be tightened up to more closely approach what
> execve requires.  Thus preserving the value of the /proc/[pid]/exe for
> the applications that want to use the exe link.
> 
> Once the checks in prctl_set_mm_exe_file are tightened up please feel
> free to remove the one shot test.
> 

This is more fishy.

First of all exe_file is used by the audit subsystem. So someone has to
ask audit people what is the significance (if any) of the field.

All exe_file users but one use get_mm_exe_file and handle NULL
gracefully.

Even with the current limit of changing the field once, the user can
cause a transient failure of get_mm_exe_file which can fail to increment
the refcount before it drops to 0.

This transient failure can be used to get a NULL value stored in
->exe_file during fork (in dup_mmap):
RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));

The one place which is not using get_mm_exe_file to get to the pointer
is audit_exe_compare:
rcu_read_lock();
exe_file = rcu_dereference(tsk->mm->exe_file);
ino = exe_file->f_inode->i_ino;
dev = exe_file->f_inode->i_sb->s_dev;
rcu_read_unlock();

This is buggy on 2 accounts:
1. exe_file can be NULL
2. rcu does not protect f_inode

The issue is made worse with allowing arbitrary number changes.

Modifying get_mm_exe_file to retry is trivial and in effect never return
NULL is trivial. With arbitrary number of changes allowed this may
require some cond_resched() or something.

For comments I cc'ed Richard Guy Briggs, who is both an audit person and
the author of audit_exe_compare.
-- 
Mateusz Guzik


Re: [PATCH] prctl: remove one-shot limitation for changing exe link

2016-07-30 Thread Mateusz Guzik
On Sat, Jul 30, 2016 at 12:31:40PM -0500, Eric W. Biederman wrote:
> So what I am requesting is very simple.  That the checks in
> prctl_set_mm_exe_file be tightened up to more closely approach what
> execve requires.  Thus preserving the value of the /proc/[pid]/exe for
> the applications that want to use the exe link.
> 
> Once the checks in prctl_set_mm_exe_file are tightened up please feel
> free to remove the one shot test.
> 

This is more fishy.

First of all exe_file is used by the audit subsystem. So someone has to
ask audit people what is the significance (if any) of the field.

All exe_file users but one use get_mm_exe_file and handle NULL
gracefully.

Even with the current limit of changing the field once, the user can
cause a transient failure of get_mm_exe_file which can fail to increment
the refcount before it drops to 0.

This transient failure can be used to get a NULL value stored in
->exe_file during fork (in dup_mmap):
RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));

The one place which is not using get_mm_exe_file to get to the pointer
is audit_exe_compare:
rcu_read_lock();
exe_file = rcu_dereference(tsk->mm->exe_file);
ino = exe_file->f_inode->i_ino;
dev = exe_file->f_inode->i_sb->s_dev;
rcu_read_unlock();

This is buggy on 2 accounts:
1. exe_file can be NULL
2. rcu does not protect f_inode

The issue is made worse with allowing arbitrary number changes.

Modifying get_mm_exe_file to retry is trivial and in effect never return
NULL is trivial. With arbitrary number of changes allowed this may
require some cond_resched() or something.

For comments I cc'ed Richard Guy Briggs, who is both an audit person and
the author of audit_exe_compare.
-- 
Mateusz Guzik


Re: [PATCH v2 3/3] Write dump into container's filesystem for pipe_type core_pattern

2016-06-07 Thread Mateusz Guzik
On Tue, Jun 07, 2016 at 07:29:37PM +0800, Zhao Lei wrote:
> In current system, when we set core_pattern to a pipe, both pipe program
> and program's output are in host's filesystem.
> But when we set core_pattern to a file, the container will write dump
> into container's filesystem.
> 
> For example, when we set following core_pattern:
>  # echo "|/my_dump_pipe %s %c %p %u %g %t e" >/proc/sys/kernel/core_pattern
> and trigger a segment fault in a container, my_dump_pipe is searched from
> host's filesystem, and it will write coredump into host's filesystem too.
> 
> In a privileged container, user can destroy host system by following
> command:
>  # # In a container
>  # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
>  # make_dump
> 
> Actually, all operation in a container should not change host's
> environment, the container should use core_pattern as its private setting.
> In detail, in core dump action:
> 1: Search pipe program in container's fs namespace.
> 2: Run pipe program in container's fs namespace to write coredump to it.
> 
> This patch fixed above problem by running pipe program with container's
> fs_root.
> 

This does not look sufficient, but I can't easily verify.

For instance, is the spawned process able to attach itself with ptrace
to processes outside of the original container? Even if not, can it
mount procfs and use /proc/pid/root of processes outside of said
container?

The spawned process should be subject to all limitations imposed on the
container (which may mean it just must be in the container).

-- 
Mateusz Guzik


Re: [PATCH v2 3/3] Write dump into container's filesystem for pipe_type core_pattern

2016-06-07 Thread Mateusz Guzik
On Tue, Jun 07, 2016 at 07:29:37PM +0800, Zhao Lei wrote:
> In current system, when we set core_pattern to a pipe, both pipe program
> and program's output are in host's filesystem.
> But when we set core_pattern to a file, the container will write dump
> into container's filesystem.
> 
> For example, when we set following core_pattern:
>  # echo "|/my_dump_pipe %s %c %p %u %g %t e" >/proc/sys/kernel/core_pattern
> and trigger a segment fault in a container, my_dump_pipe is searched from
> host's filesystem, and it will write coredump into host's filesystem too.
> 
> In a privileged container, user can destroy host system by following
> command:
>  # # In a container
>  # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
>  # make_dump
> 
> Actually, all operation in a container should not change host's
> environment, the container should use core_pattern as its private setting.
> In detail, in core dump action:
> 1: Search pipe program in container's fs namespace.
> 2: Run pipe program in container's fs namespace to write coredump to it.
> 
> This patch fixed above problem by running pipe program with container's
> fs_root.
> 

This does not look sufficient, but I can't easily verify.

For instance, is the spawned process able to attach itself with ptrace
to processes outside of the original container? Even if not, can it
mount procfs and use /proc/pid/root of processes outside of said
container?

The spawned process should be subject to all limitations imposed on the
container (which may mean it just must be in the container).

-- 
Mateusz Guzik


[PATCH] locking/barriers: evaluate the argument to lockless_dereference only once

2016-06-07 Thread Mateusz Guzik
The commit 25841ee0e9d2a ("Validate lockless_dereference() is used on a
pointer type") added a second use of the parameter to the macro.

This leads to trouble with consumers which use arguments with side
effects.

Instead, reuse the value which was read the first time.

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>

---
 include/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e9c6417..06f27fd 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -551,8 +551,8 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
  */
 #define lockless_dereference(p) \
 ({ \
-   __maybe_unused const void * const _p2 = p; \
typeof(p) _p1 = READ_ONCE(p); \
+   __maybe_unused const void * const _p2 = _p1; \
smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
(_p1); \
 })
-- 
1.8.3.1



[PATCH] locking/barriers: evaluate the argument to lockless_dereference only once

2016-06-07 Thread Mateusz Guzik
The commit 25841ee0e9d2a ("Validate lockless_dereference() is used on a
pointer type") added a second use of the parameter to the macro.

This leads to trouble with consumers which use arguments with side
effects.

Instead, reuse the value which was read the first time.

Signed-off-by: Mateusz Guzik 

---
 include/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e9c6417..06f27fd 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -551,8 +551,8 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
  */
 #define lockless_dereference(p) \
 ({ \
-   __maybe_unused const void * const _p2 = p; \
typeof(p) _p1 = READ_ONCE(p); \
+   __maybe_unused const void * const _p2 = _p1; \
smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
(_p1); \
 })
-- 
1.8.3.1



Re: [tip:locking/core] locking/barriers: Validate lockless_dereference() is used on a pointer type

2016-06-06 Thread Mateusz Guzik
On Fri, Jun 03, 2016 at 03:58:09AM -0700, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  25841ee0e9d2a1d952828138416701f20ea831eb
> Gitweb: http://git.kernel.org/tip/25841ee0e9d2a1d952828138416701f20ea831eb
> Author: Peter Zijlstra <pet...@infradead.org>
> AuthorDate: Sun, 22 May 2016 12:48:27 +0200
> Committer:  Ingo Molnar <mi...@kernel.org>
> CommitDate: Fri, 3 Jun 2016 12:06:11 +0200
> 
> locking/barriers: Validate lockless_dereference() is used on a pointer type
> 
> Add a cast to void * to validate the argument @p is indeed a pointer.
> 
> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> Cc: Alexey Dobriyan <adobri...@gmail.com>
> Cc: Andrew Morton <a...@linux-foundation.org>
> Cc: Linus Torvalds <torva...@linux-foundation.org>
> Cc: Paul E. McKenney <paul...@linux.vnet.ibm.com>
> Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <pet...@infradead.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Link: 
> http://lkml.kernel.org/r/20160522104827.gp3...@twins.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar <mi...@kernel.org>
> ---
>  include/linux/compiler.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 793c082..e9c6417 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -545,9 +545,13 @@ static __always_inline void __write_once_size(volatile 
> void *p, void *res, int s
>   * Similar to rcu_dereference(), but for situations where the pointed-to
>   * object's lifetime is managed by something other than RCU.  That
>   * "something other" might be reference counting or simple immortality.
> + *
> + * The seemingly unused void * variable is to validate @p is indeed a pointer
> + * type. All pointer types silently cast to void *.
>   */
>  #define lockless_dereference(p) \
>  ({ \
> + __maybe_unused const void * const _p2 = p; \
>   typeof(p) _p1 = READ_ONCE(p); \
>   smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
>   (_p1); \


This causes issues, e.g.:
BUG: KASAN: user-memory-access on address 60ff95001931^M
Read of size 1 by task NetworkManager/897^M
CPU: 1 PID: 897 Comm: NetworkManager Not tainted 4.7.0-rc1dupa+ #355^M
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011^M
 88003a1f8040 17d08b23 880033aff4b0 8187b2f8^M
 88005b44f2c8 880033aff548 880033aff538 813506a6^M
 81167e5esystemd[1]: sshd.service: Forked /usr/sbin/sshd as 904^M
^M
 ed000675fe9d 0286 880033aff588^M
Call Trace:^M
 [] dump_stack+0x85/0xcd^M
 [] kasan_report_error+0x456/0x560^M
 [] ? vprintk_default+0x3e/0x60^M
 [] ? printk+0xa8/0xd8^M
 [] ? power_down+0xa9/0xa9^M
 [] kasan_report+0x58/0x60^M
 [] ? leaf_walk_rcu+0x235/0x2d0^M
 [] __asan_load1+0x47/0x50^M
 [] leaf_walk_rcu+0x235/0x2d0^M
[snip]

The reason is that leaf_walk_rcu does get_child_rcu(pn, cindex++), which
ends up in lockless_dereference as pn[cindex++], which is now evaluated
twice. 

The simplest fix I see does the assignment later, that is:
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e9c6417..06f27fd 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -551,8 +551,8 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
  */
 #define lockless_dereference(p) \
 ({ \
-   __maybe_unused const void * const _p2 = p; \
typeof(p) _____p1 = READ_ONCE(p); \
+   __maybe_unused const void * const _p2 = _p1; \
smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
(_p1); \
 })

-- 
Mateusz Guzik


Re: [tip:locking/core] locking/barriers: Validate lockless_dereference() is used on a pointer type

2016-06-06 Thread Mateusz Guzik
On Fri, Jun 03, 2016 at 03:58:09AM -0700, tip-bot for Peter Zijlstra wrote:
> Commit-ID:  25841ee0e9d2a1d952828138416701f20ea831eb
> Gitweb: http://git.kernel.org/tip/25841ee0e9d2a1d952828138416701f20ea831eb
> Author: Peter Zijlstra 
> AuthorDate: Sun, 22 May 2016 12:48:27 +0200
> Committer:  Ingo Molnar 
> CommitDate: Fri, 3 Jun 2016 12:06:11 +0200
> 
> locking/barriers: Validate lockless_dereference() is used on a pointer type
> 
> Add a cast to void * to validate the argument @p is indeed a pointer.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> Cc: Alexey Dobriyan 
> Cc: Andrew Morton 
> Cc: Linus Torvalds 
> Cc: Paul E. McKenney 
> Cc: Paul McKenney 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Link: 
> http://lkml.kernel.org/r/20160522104827.gp3...@twins.programming.kicks-ass.net
> Signed-off-by: Ingo Molnar 
> ---
>  include/linux/compiler.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 793c082..e9c6417 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -545,9 +545,13 @@ static __always_inline void __write_once_size(volatile 
> void *p, void *res, int s
>   * Similar to rcu_dereference(), but for situations where the pointed-to
>   * object's lifetime is managed by something other than RCU.  That
>   * "something other" might be reference counting or simple immortality.
> + *
> + * The seemingly unused void * variable is to validate @p is indeed a pointer
> + * type. All pointer types silently cast to void *.
>   */
>  #define lockless_dereference(p) \
>  ({ \
> + __maybe_unused const void * const _p2 = p; \
>   typeof(p) _p1 = READ_ONCE(p); \
>   smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
>   (_p1); \


This causes issues, e.g.:
BUG: KASAN: user-memory-access on address 60ff95001931^M
Read of size 1 by task NetworkManager/897^M
CPU: 1 PID: 897 Comm: NetworkManager Not tainted 4.7.0-rc1dupa+ #355^M
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011^M
 88003a1f8040 17d08b23 880033aff4b0 8187b2f8^M
 88005b44f2c8 880033aff548 880033aff538 813506a6^M
 81167e5esystemd[1]: sshd.service: Forked /usr/sbin/sshd as 904^M
^M
 ed000675fe9d 0286 880033aff588^M
Call Trace:^M
 [] dump_stack+0x85/0xcd^M
 [] kasan_report_error+0x456/0x560^M
 [] ? vprintk_default+0x3e/0x60^M
 [] ? printk+0xa8/0xd8^M
 [] ? power_down+0xa9/0xa9^M
 [] kasan_report+0x58/0x60^M
 [] ? leaf_walk_rcu+0x235/0x2d0^M
 [] __asan_load1+0x47/0x50^M
 [] leaf_walk_rcu+0x235/0x2d0^M
[snip]

The reason is that leaf_walk_rcu does get_child_rcu(pn, cindex++), which
ends up in lockless_dereference as pn[cindex++], which is now evaluated
twice. 

The simplest fix I see does the assignment later, that is:
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e9c6417..06f27fd 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -551,8 +551,8 @@ static __always_inline void __write_once_size(volatile void 
*p, void *res, int s
  */
 #define lockless_dereference(p) \
 ({ \
-   __maybe_unused const void * const _p2 = p; \
typeof(p) _p1 = READ_ONCE(p); \
+   __maybe_unused const void * const _p2 = _p1; \
smp_read_barrier_depends(); /* Dependency order vs. p above. */ \
(_p1); \
 })

-- 
Mateusz Guzik


[PATCH] coredump: fix dumping through pipes

2016-06-05 Thread Mateusz Guzik
The offset in the core file used to be tracked with ->written field of
the coredump_params structure. The field was retired in favour of
file->f_pos.

However, ->f_pos is not maintained for pipes which leads to breakage.

Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.

Fixes: a00839395103 ("get rid of coredump_params->written").

Reported-by: Zbigniew Jędrzejewski-Szmek <zbys...@in.waw.pl>
Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
Reviewed-by: Omar Sandoval <osan...@fb.com>
---
 arch/powerpc/platforms/cell/spufs/coredump.c | 2 +-
 fs/binfmt_elf.c  | 2 +-
 fs/binfmt_elf_fdpic.c| 2 +-
 fs/coredump.c| 4 +++-
 include/linux/binfmts.h  | 1 +
 5 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c 
b/arch/powerpc/platforms/cell/spufs/coredump.c
index 84fb984..85c85eb 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -172,7 +172,7 @@ static int spufs_arch_write_note(struct spu_context *ctx, 
int i,
if (rc < 0)
goto out;
 
-   skip = roundup(cprm->file->f_pos - total + sz, 4) - cprm->file->f_pos;
+   skip = roundup(cprm->pos - total + sz, 4) - cprm->pos;
if (!dump_skip(cprm, skip))
goto Eio;
 out:
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e158b22..a7a28110 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2275,7 +2275,7 @@ static int elf_core_dump(struct coredump_params *cprm)
goto end_coredump;
 
/* Align to page */
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 71ade0e..2035893 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1787,7 +1787,7 @@ static int elf_fdpic_core_dump(struct coredump_params 
*cprm)
goto end_coredump;
}
 
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
if (!elf_fdpic_dump_segments(cprm))
diff --git a/fs/coredump.c b/fs/coredump.c
index 38a7ab8..281b768 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -794,6 +794,7 @@ int dump_emit(struct coredump_params *cprm, const void 
*addr, int nr)
return 0;
file->f_pos = pos;
cprm->written += n;
+   cprm->pos += n;
nr -= n;
}
return 1;
@@ -808,6 +809,7 @@ int dump_skip(struct coredump_params *cprm, size_t nr)
if (dump_interrupted() ||
file->f_op->llseek(file, nr, SEEK_CUR) < 0)
return 0;
+   cprm->pos += nr;
return 1;
} else {
while (nr > PAGE_SIZE) {
@@ -822,7 +824,7 @@ EXPORT_SYMBOL(dump_skip);
 
 int dump_align(struct coredump_params *cprm, int align)
 {
-   unsigned mod = cprm->file->f_pos & (align - 1);
+   unsigned mod = cprm->pos & (align - 1);
if (align & (align - 1))
return 0;
return mod ? dump_skip(cprm, align - mod) : 1;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 576e463..314b3ca 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -65,6 +65,7 @@ struct coredump_params {
unsigned long limit;
unsigned long mm_flags;
loff_t written;
+   loff_t pos;
 };
 
 /*
-- 
1.8.3.1



[PATCH] coredump: fix dumping through pipes

2016-06-05 Thread Mateusz Guzik
The offset in the core file used to be tracked with ->written field of
the coredump_params structure. The field was retired in favour of
file->f_pos.

However, ->f_pos is not maintained for pipes which leads to breakage.

Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.

Fixes: a00839395103 ("get rid of coredump_params->written").

Reported-by: Zbigniew Jędrzejewski-Szmek 
Signed-off-by: Mateusz Guzik 
Reviewed-by: Omar Sandoval 
---
 arch/powerpc/platforms/cell/spufs/coredump.c | 2 +-
 fs/binfmt_elf.c  | 2 +-
 fs/binfmt_elf_fdpic.c| 2 +-
 fs/coredump.c| 4 +++-
 include/linux/binfmts.h  | 1 +
 5 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c 
b/arch/powerpc/platforms/cell/spufs/coredump.c
index 84fb984..85c85eb 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -172,7 +172,7 @@ static int spufs_arch_write_note(struct spu_context *ctx, 
int i,
if (rc < 0)
goto out;
 
-   skip = roundup(cprm->file->f_pos - total + sz, 4) - cprm->file->f_pos;
+   skip = roundup(cprm->pos - total + sz, 4) - cprm->pos;
if (!dump_skip(cprm, skip))
goto Eio;
 out:
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e158b22..a7a28110 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2275,7 +2275,7 @@ static int elf_core_dump(struct coredump_params *cprm)
goto end_coredump;
 
/* Align to page */
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 71ade0e..2035893 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1787,7 +1787,7 @@ static int elf_fdpic_core_dump(struct coredump_params 
*cprm)
goto end_coredump;
}
 
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
if (!elf_fdpic_dump_segments(cprm))
diff --git a/fs/coredump.c b/fs/coredump.c
index 38a7ab8..281b768 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -794,6 +794,7 @@ int dump_emit(struct coredump_params *cprm, const void 
*addr, int nr)
return 0;
file->f_pos = pos;
cprm->written += n;
+   cprm->pos += n;
nr -= n;
}
return 1;
@@ -808,6 +809,7 @@ int dump_skip(struct coredump_params *cprm, size_t nr)
if (dump_interrupted() ||
file->f_op->llseek(file, nr, SEEK_CUR) < 0)
return 0;
+   cprm->pos += nr;
return 1;
} else {
while (nr > PAGE_SIZE) {
@@ -822,7 +824,7 @@ EXPORT_SYMBOL(dump_skip);
 
 int dump_align(struct coredump_params *cprm, int align)
 {
-   unsigned mod = cprm->file->f_pos & (align - 1);
+   unsigned mod = cprm->pos & (align - 1);
if (align & (align - 1))
return 0;
return mod ? dump_skip(cprm, align - mod) : 1;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 576e463..314b3ca 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -65,6 +65,7 @@ struct coredump_params {
unsigned long limit;
unsigned long mm_flags;
loff_t written;
+   loff_t pos;
 };
 
 /*
-- 
1.8.3.1



[PATCH] coredump: fix dumping through pipes

2016-06-05 Thread Mateusz Guzik
The offset in the core file used to be tracked with ->written field of
the coredump_params structure. Commit a0083939510 ("get rid of
coredump_params->written") replaced all its uses with file->f_pos.

However, ->f_pos is not maintained for pipes which leads to breakage.

Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.

Reported-by: Zbigniew Jędrzejewski-Szmek <zbys...@in.waw.pl>
Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 arch/powerpc/platforms/cell/spufs/coredump.c | 2 +-
 fs/binfmt_elf.c  | 2 +-
 fs/binfmt_elf_fdpic.c| 2 +-
 fs/coredump.c| 4 +++-
 include/linux/binfmts.h  | 1 +
 5 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c 
b/arch/powerpc/platforms/cell/spufs/coredump.c
index 84fb984..85c85eb 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -172,7 +172,7 @@ static int spufs_arch_write_note(struct spu_context *ctx, 
int i,
if (rc < 0)
goto out;
 
-   skip = roundup(cprm->file->f_pos - total + sz, 4) - cprm->file->f_pos;
+   skip = roundup(cprm->pos - total + sz, 4) - cprm->pos;
if (!dump_skip(cprm, skip))
goto Eio;
 out:
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e158b22..a7a28110 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2275,7 +2275,7 @@ static int elf_core_dump(struct coredump_params *cprm)
goto end_coredump;
 
/* Align to page */
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 71ade0e..2035893 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1787,7 +1787,7 @@ static int elf_fdpic_core_dump(struct coredump_params 
*cprm)
goto end_coredump;
}
 
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
if (!elf_fdpic_dump_segments(cprm))
diff --git a/fs/coredump.c b/fs/coredump.c
index 38a7ab8..281b768 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -794,6 +794,7 @@ int dump_emit(struct coredump_params *cprm, const void 
*addr, int nr)
return 0;
file->f_pos = pos;
cprm->written += n;
+   cprm->pos += n;
nr -= n;
}
return 1;
@@ -808,6 +809,7 @@ int dump_skip(struct coredump_params *cprm, size_t nr)
if (dump_interrupted() ||
file->f_op->llseek(file, nr, SEEK_CUR) < 0)
return 0;
+   cprm->pos += nr;
return 1;
} else {
while (nr > PAGE_SIZE) {
@@ -822,7 +824,7 @@ EXPORT_SYMBOL(dump_skip);
 
 int dump_align(struct coredump_params *cprm, int align)
 {
-   unsigned mod = cprm->file->f_pos & (align - 1);
+   unsigned mod = cprm->pos & (align - 1);
if (align & (align - 1))
return 0;
return mod ? dump_skip(cprm, align - mod) : 1;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 576e463..314b3ca 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -65,6 +65,7 @@ struct coredump_params {
unsigned long limit;
unsigned long mm_flags;
loff_t written;
+   loff_t pos;
 };
 
 /*
-- 
1.8.3.1



[PATCH] coredump: fix dumping through pipes

2016-06-05 Thread Mateusz Guzik
The offset in the core file used to be tracked with ->written field of
the coredump_params structure. Commit a0083939510 ("get rid of
coredump_params->written") replaced all its uses with file->f_pos.

However, ->f_pos is not maintained for pipes which leads to breakage.

Restore explicit tracking of the offset in coredump_params. Introduce
->pos field for this purpose since ->written was already reused.

Reported-by: Zbigniew Jędrzejewski-Szmek 
Signed-off-by: Mateusz Guzik 
---
 arch/powerpc/platforms/cell/spufs/coredump.c | 2 +-
 fs/binfmt_elf.c  | 2 +-
 fs/binfmt_elf_fdpic.c| 2 +-
 fs/coredump.c| 4 +++-
 include/linux/binfmts.h  | 1 +
 5 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c 
b/arch/powerpc/platforms/cell/spufs/coredump.c
index 84fb984..85c85eb 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -172,7 +172,7 @@ static int spufs_arch_write_note(struct spu_context *ctx, 
int i,
if (rc < 0)
goto out;
 
-   skip = roundup(cprm->file->f_pos - total + sz, 4) - cprm->file->f_pos;
+   skip = roundup(cprm->pos - total + sz, 4) - cprm->pos;
if (!dump_skip(cprm, skip))
goto Eio;
 out:
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index e158b22..a7a28110 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2275,7 +2275,7 @@ static int elf_core_dump(struct coredump_params *cprm)
goto end_coredump;
 
/* Align to page */
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
diff --git a/fs/binfmt_elf_fdpic.c b/fs/binfmt_elf_fdpic.c
index 71ade0e..2035893 100644
--- a/fs/binfmt_elf_fdpic.c
+++ b/fs/binfmt_elf_fdpic.c
@@ -1787,7 +1787,7 @@ static int elf_fdpic_core_dump(struct coredump_params 
*cprm)
goto end_coredump;
}
 
-   if (!dump_skip(cprm, dataoff - cprm->file->f_pos))
+   if (!dump_skip(cprm, dataoff - cprm->pos))
goto end_coredump;
 
if (!elf_fdpic_dump_segments(cprm))
diff --git a/fs/coredump.c b/fs/coredump.c
index 38a7ab8..281b768 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -794,6 +794,7 @@ int dump_emit(struct coredump_params *cprm, const void 
*addr, int nr)
return 0;
file->f_pos = pos;
cprm->written += n;
+   cprm->pos += n;
nr -= n;
}
return 1;
@@ -808,6 +809,7 @@ int dump_skip(struct coredump_params *cprm, size_t nr)
if (dump_interrupted() ||
file->f_op->llseek(file, nr, SEEK_CUR) < 0)
return 0;
+   cprm->pos += nr;
return 1;
} else {
while (nr > PAGE_SIZE) {
@@ -822,7 +824,7 @@ EXPORT_SYMBOL(dump_skip);
 
 int dump_align(struct coredump_params *cprm, int align)
 {
-   unsigned mod = cprm->file->f_pos & (align - 1);
+   unsigned mod = cprm->pos & (align - 1);
if (align & (align - 1))
return 0;
return mod ? dump_skip(cprm, align - mod) : 1;
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index 576e463..314b3ca 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -65,6 +65,7 @@ struct coredump_params {
unsigned long limit;
unsigned long mm_flags;
loff_t written;
+   loff_t pos;
 };
 
 /*
-- 
1.8.3.1



[tip:x86/asm] x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS

2016-05-13 Thread tip-bot for Mateusz Guzik
Commit-ID:  4afd0565552c87f23834db9121dd9cf6955d0b43
Gitweb: http://git.kernel.org/tip/4afd0565552c87f23834db9121dd9cf6955d0b43
Author: Mateusz Guzik <mgu...@redhat.com>
AuthorDate: Tue, 10 May 2016 22:56:43 +0200
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 13 May 2016 13:50:15 +0200

x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS

This fixes an oversight in:

731e33e39a5b95 ("Remove FSBASE/GSBASE < 4G optimization")

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Andy Lutomirski <l...@amacapital.net>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: Arnaldo Carvalho de Melo <a...@redhat.com>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Brian Gerst <brge...@gmail.com>
Cc: Denys Vlasenko <dvlas...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Stephane Eranian <eran...@google.com>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: Vince Weaver <vincent.wea...@maine.edu>
Link: 
http://lkml.kernel.org/r/1462913803-29634-1-git-send-email-mgu...@redhat.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 arch/x86/kernel/process_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4285f6a..6b16c36 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -541,6 +541,7 @@ long do_arch_prctl(struct task_struct *task, int code, 
unsigned long addr)
load_gs_index(0);
ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
}
+   put_cpu();
break;
case ARCH_SET_FS:
/* Not strictly needed for fs, but do it for symmetry


[tip:x86/asm] x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS

2016-05-13 Thread tip-bot for Mateusz Guzik
Commit-ID:  4afd0565552c87f23834db9121dd9cf6955d0b43
Gitweb: http://git.kernel.org/tip/4afd0565552c87f23834db9121dd9cf6955d0b43
Author: Mateusz Guzik 
AuthorDate: Tue, 10 May 2016 22:56:43 +0200
Committer:  Ingo Molnar 
CommitDate: Fri, 13 May 2016 13:50:15 +0200

x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS

This fixes an oversight in:

731e33e39a5b95 ("Remove FSBASE/GSBASE < 4G optimization")

Signed-off-by: Mateusz Guzik 
Cc: Alexander Shishkin 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Arnaldo Carvalho de Melo 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Link: 
http://lkml.kernel.org/r/1462913803-29634-1-git-send-email-mgu...@redhat.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/kernel/process_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4285f6a..6b16c36 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -541,6 +541,7 @@ long do_arch_prctl(struct task_struct *task, int code, 
unsigned long addr)
load_gs_index(0);
ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
}
+   put_cpu();
break;
case ARCH_SET_FS:
/* Not strictly needed for fs, but do it for symmetry


Re: [PATCH] x86/arch_prctl/64: restore accidentally removed put_cpu in ARCH_SET_GS

2016-05-11 Thread Mateusz Guzik
On Tue, May 10, 2016 at 01:58:24PM -0700, Andy Lutomirski wrote:
> On Tue, May 10, 2016 at 1:56 PM, Mateusz Guzik <mgu...@redhat.com> wrote:
> > This fixes 731e33e39a5b95ad770 "Remove FSBASE/GSBASE < 4G optimization"
> 
> Indeed.  How did that survive lockdep?
> 

lockdep_sys_exit only checks actual locks.

In the common path after return from particular syscall interrupts get
blindly disabled (as opposed to checking first that they are enabled).
preemption count is not checked in the fast path at all and is checked
elsewhere as a side effect of calls to e.g. schedule().

How about a hack along these lines (note I don't claim this is
committable as it is, but it should work):

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e5..5887bc7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -303,6 +303,24 @@ static void syscall_slow_exit_work(struct pt_regs *regs, 
u32 cached_flags)
tracehook_report_syscall_exit(regs, step);
 }
 
+#ifdef CONFIG_PROVE_LOCKING
+/*
+ * Called after syscall handlers return.
+ */
+__visible void syscall_assert_exit(struct pt_regs *regs)
+{
+   if (in_atomic() || irqs_disabled()) {
+   printk(KERN_ERR "invalid state on exit from syscall %ld: "
+   "in_atomic(): %d, irqs_disabled(): %d, pid: %d, "
+   "name: %s\n", regs->orig_ax, in_atomic(),
+   irqs_disabled(), current->pid, current->comm);
+   }
+
+   if (irqs_disabled())
+   local_irq_enable();
+}
+#endif
+
 /*
  * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
  * state such that we can immediately switch to user mode.
@@ -314,9 +332,7 @@ __visible inline void syscall_return_slowpath(struct 
pt_regs *regs)
 
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 
-   if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
-   WARN(irqs_disabled(), "syscall %ld left IRQs disabled", 
regs->orig_ax))
-   local_irq_enable();
+   syscall_assert_exit(regs);
 
/*
 * First do one-time work.  If these work items are enabled, we
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..6c5cc23 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -210,6 +210,12 @@ entry_SYSCALL_64_fastpath:
movq%rax, RAX(%rsp)
 1:
 
+#ifdef CONFIG_PROVE_LOCKING
+   /*
+* We want to validate bunch of stuff, which will clobber registers.
+*/
+   jmp 2f
+#endif
/*
 * If we get here, then we know that pt_regs is clean for SYSRET64.
 * If we see that no exit work is required (which we are required
@@ -236,6 +242,7 @@ entry_SYSCALL_64_fastpath:
 */
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+2:
SAVE_EXTRA_REGS
movq    %rsp, %rdi
callsyscall_return_slowpath /* returns with IRQs disabled */

-- 
Mateusz Guzik


Re: [PATCH] x86/arch_prctl/64: restore accidentally removed put_cpu in ARCH_SET_GS

2016-05-11 Thread Mateusz Guzik
On Tue, May 10, 2016 at 01:58:24PM -0700, Andy Lutomirski wrote:
> On Tue, May 10, 2016 at 1:56 PM, Mateusz Guzik  wrote:
> > This fixes 731e33e39a5b95ad770 "Remove FSBASE/GSBASE < 4G optimization"
> 
> Indeed.  How did that survive lockdep?
> 

lockdep_sys_exit only checks actual locks.

In the common path after return from particular syscall interrupts get
blindly disabled (as opposed to checking first that they are enabled).
preemption count is not checked in the fast path at all and is checked
elsewhere as a side effect of calls to e.g. schedule().

How about a hack along these lines (note I don't claim this is
committable as it is, but it should work):

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index ec138e5..5887bc7 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -303,6 +303,24 @@ static void syscall_slow_exit_work(struct pt_regs *regs, 
u32 cached_flags)
tracehook_report_syscall_exit(regs, step);
 }
 
+#ifdef CONFIG_PROVE_LOCKING
+/*
+ * Called after syscall handlers return.
+ */
+__visible void syscall_assert_exit(struct pt_regs *regs)
+{
+   if (in_atomic() || irqs_disabled()) {
+   printk(KERN_ERR "invalid state on exit from syscall %ld: "
+   "in_atomic(): %d, irqs_disabled(): %d, pid: %d, "
+   "name: %s\n", regs->orig_ax, in_atomic(),
+   irqs_disabled(), current->pid, current->comm);
+   }
+
+   if (irqs_disabled())
+   local_irq_enable();
+}
+#endif
+
 /*
  * Called with IRQs on and fully valid regs.  Returns with IRQs off in a
  * state such that we can immediately switch to user mode.
@@ -314,9 +332,7 @@ __visible inline void syscall_return_slowpath(struct 
pt_regs *regs)
 
CT_WARN_ON(ct_state() != CONTEXT_KERNEL);
 
-   if (IS_ENABLED(CONFIG_PROVE_LOCKING) &&
-   WARN(irqs_disabled(), "syscall %ld left IRQs disabled", 
regs->orig_ax))
-   local_irq_enable();
+   syscall_assert_exit(regs);
 
/*
 * First do one-time work.  If these work items are enabled, we
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 9ee0da1..6c5cc23 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -210,6 +210,12 @@ entry_SYSCALL_64_fastpath:
movq%rax, RAX(%rsp)
 1:
 
+#ifdef CONFIG_PROVE_LOCKING
+   /*
+* We want to validate bunch of stuff, which will clobber registers.
+*/
+   jmp 2f
+#endif
/*
 * If we get here, then we know that pt_regs is clean for SYSRET64.
 * If we see that no exit work is required (which we are required
@@ -236,6 +242,7 @@ entry_SYSCALL_64_fastpath:
 */
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
+2:
SAVE_EXTRA_REGS
movq    %rsp, %rdi
callsyscall_return_slowpath /* returns with IRQs disabled */

-- 
Mateusz Guzik


[PATCH] x86/arch_prctl/64: restore accidentally removed put_cpu in ARCH_SET_GS

2016-05-10 Thread Mateusz Guzik
This fixes 731e33e39a5b95ad770 "Remove FSBASE/GSBASE < 4G optimization"

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 arch/x86/kernel/process_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4285f6a..6b16c36 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -541,6 +541,7 @@ long do_arch_prctl(struct task_struct *task, int code, 
unsigned long addr)
load_gs_index(0);
ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
}
+   put_cpu();
break;
case ARCH_SET_FS:
/* Not strictly needed for fs, but do it for symmetry
-- 
1.8.3.1



[PATCH] x86/arch_prctl/64: restore accidentally removed put_cpu in ARCH_SET_GS

2016-05-10 Thread Mateusz Guzik
This fixes 731e33e39a5b95ad770 "Remove FSBASE/GSBASE < 4G optimization"

Signed-off-by: Mateusz Guzik 
---
 arch/x86/kernel/process_64.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 4285f6a..6b16c36 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -541,6 +541,7 @@ long do_arch_prctl(struct task_struct *task, int code, 
unsigned long addr)
load_gs_index(0);
ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
}
+   put_cpu();
break;
case ARCH_SET_FS:
/* Not strictly needed for fs, but do it for symmetry
-- 
1.8.3.1



[PATCH] rlimit: locking tidy ups

2016-05-04 Thread Mateusz Guzik
rlimits are stored in task->signal and are guaranteed to remain valid as
long as the task struct is valid. All modifications are protected by
locking task->group_leader. Additionally changes to RLIMIT_CPU need
task->sighand.

do_prlimit takes tasklist_lock, which as a side effect gurantees stable
->sighand however, there is no need to take the lock for any limit other
than RLIMIT_CPU and even then we can get away with locking sighand itself.

proc_pid_limits takes ->sighand lock prior to accessing rlimits, but it
serves no purpose as it does not prevent modifications.

Both functions effectively always perform ->sighand != NULL check, but it
is only of concern when RLIMIT_CPU is being set. ->sighand is only cleared
when the process is reaped, so a dedicated check only makes it less likely
to access limits of a dead process.

As such, eliminate the unneeded check and:
- do_prlimit: stop taking tasklist_lock at all and only lock sighand when
necessary
- proc_pid_limits: lock group leader in order to obtain a stable copy

Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
---
 fs/proc/base.c |  6 ++
 kernel/sys.c   | 22 ++
 kernel/time/posix-cpu-timers.c |  3 +--
 security/selinux/hooks.c   |  4 +++-
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 704ae63..3d4963e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -618,14 +618,12 @@ static int proc_pid_limits(struct seq_file *m, struct 
pid_namespace *ns,
   struct pid *pid, struct task_struct *task)
 {
unsigned int i;
-   unsigned long flags;
 
struct rlimit rlim[RLIM_NLIMITS];
 
-   if (!lock_task_sighand(task, ))
-   return 0;
+   task_lock(task->group_leader);
memcpy(rlim, task->signal->rlim, sizeof(struct rlimit) * RLIM_NLIMITS);
-   unlock_task_sighand(task, );
+   task_unlock(task->group_leader);
 
/*
 * print the file header
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..1c8a67d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1361,7 +1361,9 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
struct rlimit *new_rlim, struct rlimit *old_rlim)
 {
struct rlimit *rlim;
+   unsigned long flags;
int retval = 0;
+   int sighand_locked = 0;
 
if (resource >= RLIM_NLIMITS)
return -EINVAL;
@@ -1373,15 +1375,17 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
return -EPERM;
}
 
-   /* protect tsk->signal and tsk->sighand from disappearing */
-   read_lock(_lock);
-   if (!tsk->sighand) {
-   retval = -ESRCH;
-   goto out;
+   task_lock(tsk->group_leader);
+   if (new_rlim && resource == RLIMIT_CPU &&
+   new_rlim->rlim_cur != RLIM_INFINITY) {
+   if (!lock_task_sighand(tsk, )) {
+   retval = -ESRCH;
+   goto out;
+   }
+   sighand_locked = 1;
}
 
rlim = tsk->signal->rlim + resource;
-   task_lock(tsk->group_leader);
if (new_rlim) {
/* Keep the capable check against init_user_ns until
   cgroups can contain all limits */
@@ -1407,7 +1411,6 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
if (new_rlim)
*rlim = *new_rlim;
}
-   task_unlock(tsk->group_leader);
 
/*
 * RLIMIT_CPU handling.   Note that the kernel fails to return an error
@@ -1418,8 +1421,11 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
 if (!retval && new_rlim && resource == RLIMIT_CPU &&
 new_rlim->rlim_cur != RLIM_INFINITY)
update_rlimit_cpu(tsk, new_rlim->rlim_cur);
+
+   if (sighand_locked)
+   unlock_task_sighand(tsk, );
 out:
-   read_unlock(_lock);
+   task_unlock(tsk->group_leader);
return retval;
 }
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 1cafba8..fc38417 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -23,9 +23,8 @@ void update_rlimit_cpu(struct task_struct *task, unsigned 
long rlim_new)
 {
cputime_t cputime = secs_to_cputime(rlim_new);
 
-   spin_lock_irq(>sighand->siglock);
+   lockdep_assert_held(>sighand->siglock);
set_process_cpu_timer(task, CPUCLOCK_PROF, , NULL);
-   spin_unlock_irq(>sighand->siglock);
 }
 
 static int check_clock(const clockid_t which_clock)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index a86d537..d74b91a 100644
--- a/security/selinux/hooks.c
+++ b/security/selin

[PATCH] rlimit: locking tidy ups

2016-05-04 Thread Mateusz Guzik
rlimits are stored in task->signal and are guaranteed to remain valid as
long as the task struct is valid. All modifications are protected by
locking task->group_leader. Additionally changes to RLIMIT_CPU need
task->sighand.

do_prlimit takes tasklist_lock, which as a side effect gurantees stable
->sighand however, there is no need to take the lock for any limit other
than RLIMIT_CPU and even then we can get away with locking sighand itself.

proc_pid_limits takes ->sighand lock prior to accessing rlimits, but it
serves no purpose as it does not prevent modifications.

Both functions effectively always perform ->sighand != NULL check, but it
is only of concern when RLIMIT_CPU is being set. ->sighand is only cleared
when the process is reaped, so a dedicated check only makes it less likely
to access limits of a dead process.

As such, eliminate the unneeded check and:
- do_prlimit: stop taking tasklist_lock at all and only lock sighand when
necessary
- proc_pid_limits: lock group leader in order to obtain a stable copy

Signed-off-by: Mateusz Guzik 
---
 fs/proc/base.c |  6 ++
 kernel/sys.c   | 22 ++
 kernel/time/posix-cpu-timers.c |  3 +--
 security/selinux/hooks.c   |  4 +++-
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 704ae63..3d4963e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -618,14 +618,12 @@ static int proc_pid_limits(struct seq_file *m, struct 
pid_namespace *ns,
   struct pid *pid, struct task_struct *task)
 {
unsigned int i;
-   unsigned long flags;
 
struct rlimit rlim[RLIM_NLIMITS];
 
-   if (!lock_task_sighand(task, ))
-   return 0;
+   task_lock(task->group_leader);
memcpy(rlim, task->signal->rlim, sizeof(struct rlimit) * RLIM_NLIMITS);
-   unlock_task_sighand(task, );
+   task_unlock(task->group_leader);
 
/*
 * print the file header
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..1c8a67d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1361,7 +1361,9 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
struct rlimit *new_rlim, struct rlimit *old_rlim)
 {
struct rlimit *rlim;
+   unsigned long flags;
int retval = 0;
+   int sighand_locked = 0;
 
if (resource >= RLIM_NLIMITS)
return -EINVAL;
@@ -1373,15 +1375,17 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
return -EPERM;
}
 
-   /* protect tsk->signal and tsk->sighand from disappearing */
-   read_lock(_lock);
-   if (!tsk->sighand) {
-   retval = -ESRCH;
-   goto out;
+   task_lock(tsk->group_leader);
+   if (new_rlim && resource == RLIMIT_CPU &&
+   new_rlim->rlim_cur != RLIM_INFINITY) {
+   if (!lock_task_sighand(tsk, )) {
+   retval = -ESRCH;
+   goto out;
+   }
+   sighand_locked = 1;
}
 
rlim = tsk->signal->rlim + resource;
-   task_lock(tsk->group_leader);
if (new_rlim) {
/* Keep the capable check against init_user_ns until
   cgroups can contain all limits */
@@ -1407,7 +1411,6 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
if (new_rlim)
*rlim = *new_rlim;
}
-   task_unlock(tsk->group_leader);
 
/*
 * RLIMIT_CPU handling.   Note that the kernel fails to return an error
@@ -1418,8 +1421,11 @@ int do_prlimit(struct task_struct *tsk, unsigned int 
resource,
 if (!retval && new_rlim && resource == RLIMIT_CPU &&
 new_rlim->rlim_cur != RLIM_INFINITY)
update_rlimit_cpu(tsk, new_rlim->rlim_cur);
+
+   if (sighand_locked)
+   unlock_task_sighand(tsk, );
 out:
-   read_unlock(_lock);
+   task_unlock(tsk->group_leader);
return retval;
 }
 
diff --git a/kernel/time/posix-cpu-timers.c b/kernel/time/posix-cpu-timers.c
index 1cafba8..fc38417 100644
--- a/kernel/time/posix-cpu-timers.c
+++ b/kernel/time/posix-cpu-timers.c
@@ -23,9 +23,8 @@ void update_rlimit_cpu(struct task_struct *task, unsigned 
long rlim_new)
 {
cputime_t cputime = secs_to_cputime(rlim_new);
 
-   spin_lock_irq(>sighand->siglock);
+   lockdep_assert_held(>sighand->siglock);
set_process_cpu_timer(task, CPUCLOCK_PROF, , NULL);
-   spin_unlock_irq(>sighand->siglock);
 }
 
 static int check_clock(const clockid_t which_clock)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index a86d537..d74b91a 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2506,8 +2506,10 @@ s

Re: linux-next: Tree for May 2 [WARNING: at fs/dcache.c]

2016-05-02 Thread Mateusz Guzik
On Mon, May 02, 2016 at 07:15:24PM +0900, Sergey Senozhatsky wrote:
> On (05/02/16 18:40), Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20160429
> 
> Hello,
> 
> [0.368791] [ cut here ]
> [0.368850] WARNING: CPU: 0 PID: 1 at fs/dcache.c:1688 d_set_d_op+0x5e/0xcc
> [0.368911] Modules linked in:
> [0.369002] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.6.0-rc6-next-20160502-dbg-5-gf58c9da-dirty #404
> [0.369161]   880133067908 811b8202 
> 
> [0.369371]   880133067948 81039365 
> 0698e5dffe26
> [0.369580]  880132c090c0 81613680 880132c040a0 
> 880132c08000
> [0.369791] Call Trace:
> [0.369846]  [] dump_stack+0x4d/0x63
> [0.369904]  [] __warn+0xb8/0xd3
> [0.369962]  [] warn_slowpath_null+0x18/0x1a
> [0.370021]  [] d_set_d_op+0x5e/0xcc
> [0.370079]  [] simple_lookup+0x2e/0x45

The issue is that 2 macros have the same value:

#define DCACHE_OP_REAL  0x0800

#define DCACHE_PAR_LOOKUP   0x0800 /* being looked up
(with parent locked shared) */

Verified with switching one to 0x1000 and the warning went away.

-- 
Mateusz Guzik


Re: linux-next: Tree for May 2 [WARNING: at fs/dcache.c]

2016-05-02 Thread Mateusz Guzik
On Mon, May 02, 2016 at 07:15:24PM +0900, Sergey Senozhatsky wrote:
> On (05/02/16 18:40), Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20160429
> 
> Hello,
> 
> [0.368791] [ cut here ]
> [0.368850] WARNING: CPU: 0 PID: 1 at fs/dcache.c:1688 d_set_d_op+0x5e/0xcc
> [0.368911] Modules linked in:
> [0.369002] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.6.0-rc6-next-20160502-dbg-5-gf58c9da-dirty #404
> [0.369161]   880133067908 811b8202 
> 
> [0.369371]   880133067948 81039365 
> 0698e5dffe26
> [0.369580]  880132c090c0 81613680 880132c040a0 
> 880132c08000
> [0.369791] Call Trace:
> [0.369846]  [] dump_stack+0x4d/0x63
> [0.369904]  [] __warn+0xb8/0xd3
> [0.369962]  [] warn_slowpath_null+0x18/0x1a
> [0.370021]  [] d_set_d_op+0x5e/0xcc
> [0.370079]  [] simple_lookup+0x2e/0x45

The issue is that 2 macros have the same value:

#define DCACHE_OP_REAL  0x0800

#define DCACHE_PAR_LOOKUP   0x0800 /* being looked up
(with parent locked shared) */

Verified with switching one to 0x1000 and the warning went away.

-- 
Mateusz Guzik


Re: [PATCH] proc: prevent accessing /proc//environ until it's ready

2016-04-28 Thread Mateusz Guzik
On Thu, Apr 28, 2016 at 09:04:18PM +0200, Mathias Krause wrote:
> If /proc//environ gets read before the envp[] array is fully set
> up in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying
> to read more bytes than are actually written, as env_start will already
> be set but env_end will still be zero, making the range calculation
> underflow, allowing to read beyond the end of what has been written.
> 
> Fix this as it is done for /proc//cmdline by testing env_end for
> zero. It is, apparently, intentionally set last in create_*_tables().
> 
> This bug was found by the PaX size_overflow plugin that detected the
> arithmetic underflow of 'this_len = env_end - (env_start + src)' when
> env_end is still zero.
> 
> Reported-at: https://forums.grsecurity.net/viewtopic.php?f=3=4363
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=116461
> ---
>  fs/proc/base.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4f764c2ac1a5..45f2162e55b2 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -955,7 +955,8 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   struct mm_struct *mm = file->private_data;
>   unsigned long env_start, env_end;
>  
> - if (!mm)
> + /* Ensure the process spawned far enough to have an environment. */
> + if (!mm || !mm->env_end)
>   return 0;
>  
>   page = (char *)__get_free_page(GFP_TEMPORARY);

In this case get_cmdline in mm/util.c should also be patched for
completness. It tests for arg_end, but later accesses env_end.

-- 
Mateusz Guzik


Re: [PATCH] proc: prevent accessing /proc//environ until it's ready

2016-04-28 Thread Mateusz Guzik
On Thu, Apr 28, 2016 at 09:04:18PM +0200, Mathias Krause wrote:
> If /proc//environ gets read before the envp[] array is fully set
> up in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying
> to read more bytes than are actually written, as env_start will already
> be set but env_end will still be zero, making the range calculation
> underflow, allowing to read beyond the end of what has been written.
> 
> Fix this as it is done for /proc//cmdline by testing env_end for
> zero. It is, apparently, intentionally set last in create_*_tables().
> 
> This bug was found by the PaX size_overflow plugin that detected the
> arithmetic underflow of 'this_len = env_end - (env_start + src)' when
> env_end is still zero.
> 
> Reported-at: https://forums.grsecurity.net/viewtopic.php?f=3=4363
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=116461
> ---
>  fs/proc/base.c |3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 4f764c2ac1a5..45f2162e55b2 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -955,7 +955,8 @@ static ssize_t environ_read(struct file *file, char 
> __user *buf,
>   struct mm_struct *mm = file->private_data;
>   unsigned long env_start, env_end;
>  
> - if (!mm)
> + /* Ensure the process spawned far enough to have an environment. */
> + if (!mm || !mm->env_end)
>   return 0;
>  
>   page = (char *)__get_free_page(GFP_TEMPORARY);

In this case get_cmdline in mm/util.c should also be patched for
completness. It tests for arg_end, but later accesses env_end.

-- 
Mateusz Guzik


Re: [PATCH] fs: reintroduce freezing nesting

2016-04-19 Thread Mateusz Guzik
On Tue, Apr 19, 2016 at 10:48:41PM +0200, Florian Margaine wrote:
> The behavior was removed in 18e9e5104fcd9a973ffe3eed3816c87f2a1b6cd2
> noting that this was a better idea than using a counter. However, this
> behavior is actually wanted if multiple applications want to freeze
> concurrently while remaining non-racy.
> 
> This patch reintroduces this feature by using a counter.
> 

This patch is wrong.

It uses non-atomic ops to modify the counter and no locks are
held to protect it.

I would argue the code should track that freezing has started and
additional freezers must only return when the state is
SB_FREEZE_COMPLETE.

> ---
>  fs/super.c | 15 +++
>  include/linux/fs.h |  1 +
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 74914b1..9fa8ca1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -231,6 +231,7 @@ static struct super_block *alloc_super(struct
> file_system_type *type, int flags)
>    */
>   down_write_nested(>s_umount, SINGLE_DEPTH_NESTING);
>   s->s_count = 1;
> + s->s_freezers = 0;
>   atomic_set(>s_active, 1);
>   mutex_init(>s_vfs_rename_mutex);
>   lockdep_set_class(>s_vfs_rename_mutex, 
> >s_vfs_rename_key);
> @@ -1275,12 +1276,12 @@ int freeze_super(struct super_block *sb)
>  {
>   int ret;
>  
> + sb->s_freezers++;
> + if (sb->s_writers.frozen != SB_UNFROZEN)
> + return 0;
> +
>   atomic_inc(>s_active);
>   down_write(>s_umount);
> - if (sb->s_writers.frozen != SB_UNFROZEN) {
> - deactivate_locked_super(sb);
> - return -EBUSY;
> - }
>  
>   if (!(sb->s_flags & MS_BORN)) {
>   up_write(>s_umount);
> @@ -1338,14 +1339,20 @@ EXPORT_SYMBOL(freeze_super);
>   * @sb: the super to thaw
>   *
>   * Unlocks the filesystem and marks it writeable again after
> freeze_super().
> + * Since nesting freezes is allowed, only the last freeze actually
> unlocks.
>   */
>  int thaw_super(struct super_block *sb)
>  {
>   int error;
>  
> + sb->s_freezers--;
> + if (sb->s_freezers > 0)
> + return 0;
> +
>   down_write(>s_umount);
>   if (sb->s_writers.frozen == SB_UNFROZEN) {
>   up_write(>s_umount);
> + sb->s_freezers++;
>   return -EINVAL;
>   }
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e514f76..c045e2a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1333,6 +1333,7 @@ struct super_block {
>   struct quota_info   s_dquot;/* Diskquota specific
> options */
>  
>   struct sb_writers   s_writers;
> + int s_freezers;
>  
>   char s_id[32];  /* Informational
> name */
>   u8 s_uuid[16];  /* UUID */
> -- 
> 2.8.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mateusz Guzik


Re: [PATCH] fs: reintroduce freezing nesting

2016-04-19 Thread Mateusz Guzik
On Tue, Apr 19, 2016 at 10:48:41PM +0200, Florian Margaine wrote:
> The behavior was removed in 18e9e5104fcd9a973ffe3eed3816c87f2a1b6cd2
> noting that this was a better idea than using a counter. However, this
> behavior is actually wanted if multiple applications want to freeze
> concurrently while remaining non-racy.
> 
> This patch reintroduces this feature by using a counter.
> 

This patch is wrong.

It uses non-atomic ops to modify the counter and no locks are
held to protect it.

I would argue the code should track that freezing has started and
additional freezers must only return when the state is
SB_FREEZE_COMPLETE.

> ---
>  fs/super.c | 15 +++
>  include/linux/fs.h |  1 +
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index 74914b1..9fa8ca1 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -231,6 +231,7 @@ static struct super_block *alloc_super(struct
> file_system_type *type, int flags)
>    */
>   down_write_nested(>s_umount, SINGLE_DEPTH_NESTING);
>   s->s_count = 1;
> + s->s_freezers = 0;
>   atomic_set(>s_active, 1);
>   mutex_init(>s_vfs_rename_mutex);
>   lockdep_set_class(>s_vfs_rename_mutex, 
> >s_vfs_rename_key);
> @@ -1275,12 +1276,12 @@ int freeze_super(struct super_block *sb)
>  {
>   int ret;
>  
> + sb->s_freezers++;
> + if (sb->s_writers.frozen != SB_UNFROZEN)
> + return 0;
> +
>   atomic_inc(>s_active);
>   down_write(>s_umount);
> - if (sb->s_writers.frozen != SB_UNFROZEN) {
> - deactivate_locked_super(sb);
> - return -EBUSY;
> - }
>  
>   if (!(sb->s_flags & MS_BORN)) {
>   up_write(>s_umount);
> @@ -1338,14 +1339,20 @@ EXPORT_SYMBOL(freeze_super);
>   * @sb: the super to thaw
>   *
>   * Unlocks the filesystem and marks it writeable again after
> freeze_super().
> + * Since nesting freezes is allowed, only the last freeze actually
> unlocks.
>   */
>  int thaw_super(struct super_block *sb)
>  {
>   int error;
>  
> + sb->s_freezers--;
> + if (sb->s_freezers > 0)
> + return 0;
> +
>   down_write(>s_umount);
>   if (sb->s_writers.frozen == SB_UNFROZEN) {
>   up_write(>s_umount);
> + sb->s_freezers++;
>   return -EINVAL;
>   }
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e514f76..c045e2a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1333,6 +1333,7 @@ struct super_block {
>   struct quota_info   s_dquot;/* Diskquota specific
> options */
>  
>   struct sb_writers   s_writers;
> + int s_freezers;
>  
>   char s_id[32];  /* Informational
> name */
>   u8 s_uuid[16];  /* UUID */
> -- 
> 2.8.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Mateusz Guzik


Re: [PATCH] fs: add the FIGETFROZEN ioctl call

2016-04-14 Thread Mateusz Guzik
On Thu, Apr 14, 2016 at 09:57:07AM +0200, Florian Margaine wrote:
> This lets userland get the filesystem freezing status, aka whether the
> filesystem is frozen or not. This is so that an application can know if
> it should freeze the filesystem or if it isn't necessary when taking a
> snapshot.

The feature may be useful in general, I don't know.

However, I'm confused why programs would depend on it. If you froze
a particular subsystem, you don't have to check. If you did not, what
prevents whoever originaly froze it from unfreezing as you access it?

As such, maybe the feature you are looking for would count how many
times the fs is frozen.

-- 
Mateusz Guzik


Re: [PATCH] fs: add the FIGETFROZEN ioctl call

2016-04-14 Thread Mateusz Guzik
On Thu, Apr 14, 2016 at 09:57:07AM +0200, Florian Margaine wrote:
> This lets userland get the filesystem freezing status, aka whether the
> filesystem is frozen or not. This is so that an application can know if
> it should freeze the filesystem or if it isn't necessary when taking a
> snapshot.

The feature may be useful in general, I don't know.

However, I'm confused why programs would depend on it. If you froze
a particular subsystem, you don't have to check. If you did not, what
prevents whoever originaly froze it from unfreezing as you access it?

As such, maybe the feature you are looking for would count how many
times the fs is frozen.

-- 
Mateusz Guzik


Re: [PATCH] Make core_pattern support namespace

2016-02-17 Thread Mateusz Guzik
On Wed, Feb 17, 2016 at 02:15:24PM -0600, Eric W. Biederman wrote:
> Mateusz Guzik <mgu...@redhat.com> writes:
> > On Tue, Feb 16, 2016 at 07:33:39PM +0800, Zhao Lei wrote:
> >> For container based on namespace design, it is good to allow
> >> each container keeping their own coredump setting.
> >
> > Sorry if this is a false alarm, I don't have easy means to test it, but
> > is not this an immediate privilege escalation?
> 
> It is.  This is why we do not currently have a per namespace setting.
> 

Thanks for confimation.

> Solving the user mode helper problem is technically a fair amount of
> work, if not theoretically challenging.
> 

Well, I would say custom core_patterns without pipe support are still
better than none.

Say one would ensure a stable core_pattern (i.e. that it cannot be
modified as it is being parsed) and a restricted set of allowed
characters in the pattern (which would not include the pipe), validated
when one attempts to set the pattern.

Does this sound acceptable? If so, and there are no counter ideas from
Lei, I can get around to that.

-- 
Mateusz Guzik


Re: [PATCH] Make core_pattern support namespace

2016-02-17 Thread Mateusz Guzik
On Wed, Feb 17, 2016 at 02:15:24PM -0600, Eric W. Biederman wrote:
> Mateusz Guzik  writes:
> > On Tue, Feb 16, 2016 at 07:33:39PM +0800, Zhao Lei wrote:
> >> For container based on namespace design, it is good to allow
> >> each container keeping their own coredump setting.
> >
> > Sorry if this is a false alarm, I don't have easy means to test it, but
> > is not this an immediate privilege escalation?
> 
> It is.  This is why we do not currently have a per namespace setting.
> 

Thanks for confimation.

> Solving the user mode helper problem is technically a fair amount of
> work, if not theoretically challenging.
> 

Well, I would say custom core_patterns without pipe support are still
better than none.

Say one would ensure a stable core_pattern (i.e. that it cannot be
modified as it is being parsed) and a restricted set of allowed
characters in the pattern (which would not include the pipe), validated
when one attempts to set the pattern.

Does this sound acceptable? If so, and there are no counter ideas from
Lei, I can get around to that.

-- 
Mateusz Guzik


Re: [PATCH] Make core_pattern support namespace

2016-02-16 Thread Mateusz Guzik
On Tue, Feb 16, 2016 at 07:33:39PM +0800, Zhao Lei wrote:
> Currently, each container shared one copy of coredump setting
> with the host system, if host system changed the setting, each
> running containers will be affected.
> 
> Moreover, it is not easy to let each container keeping their own
> coredump setting.
> 
> We can use some workaround as pipe program to make the second
> requirement possible, but it is not simple, and both host and
> container are limited to set to fixed pipe program.
> In one word, for host running contailer, we can't change core_pattern
> anymore.
> To make the problem more hard, if a host running more than one
> container product, each product will try to snatch the global
> coredump setting to fit their own requirement.
> 
> For container based on namespace design, it is good to allow
> each container keeping their own coredump setting.
> 
> It will bring us following benefit:
> 1: Each container can change their own coredump setting
>based on operation on /proc/sys/kernel/core_pattern
> 2: Coredump setting changed in host will not affect
>running containers.
> 3: Support both case of "putting coredump in guest" and
>"putting curedump in host".
> 
> Each namespace-based software(lxc, docker, ..) can use this function
> to custom their dump setting.
> 
> And this function makes each continer working as separate system,
> it fit for design goal of namespace.
> 

Sorry if this is a false alarm, I don't have easy means to test it, but
is not this an immediate privilege escalation?

In particular, if the new pattern was to include a pipe, would not it
cause the kernel to run whatever the namespace put in there, but on the
"host"?

The feature definitely looks useful, but this is another privileged
operation which is now made available to unprivileged users. This poses
some questions:
- should the namespace be allowed to set anything it wants, including
  pipes? I would argue the latter should be disallowed for simplicity
- now that the pattern can change at any time by namespace root, it
  becomes fishy to parse it for filename generation without any
  protection against concurrent modification
- how do you ensure that namespaces which did not explicitely set their
  pattern still get updates from the host?

That said, I suggest adding an allocated buffer to hold it or be NULL
otherwise. If the pointer is NULL, the "host" pattern is used.

For dumping purposes the code could just kmalloc, rcu_read_lock, memcpy
and be done with it. Or, if it is acceptable given the size, just have a
buffer on the stack.

Finally, the code setting it could simply xchg the pointer to avoid
locks if there is no good lock to be taken here, and then kfree_rcu the
old buffer.

Just my $0,03.

-- 
Mateusz Guzik


Re: [PATCH] Make core_pattern support namespace

2016-02-16 Thread Mateusz Guzik
On Tue, Feb 16, 2016 at 07:33:39PM +0800, Zhao Lei wrote:
> Currently, each container shared one copy of coredump setting
> with the host system, if host system changed the setting, each
> running containers will be affected.
> 
> Moreover, it is not easy to let each container keeping their own
> coredump setting.
> 
> We can use some workaround as pipe program to make the second
> requirement possible, but it is not simple, and both host and
> container are limited to set to fixed pipe program.
> In one word, for host running contailer, we can't change core_pattern
> anymore.
> To make the problem more hard, if a host running more than one
> container product, each product will try to snatch the global
> coredump setting to fit their own requirement.
> 
> For container based on namespace design, it is good to allow
> each container keeping their own coredump setting.
> 
> It will bring us following benefit:
> 1: Each container can change their own coredump setting
>based on operation on /proc/sys/kernel/core_pattern
> 2: Coredump setting changed in host will not affect
>running containers.
> 3: Support both case of "putting coredump in guest" and
>"putting curedump in host".
> 
> Each namespace-based software(lxc, docker, ..) can use this function
> to custom their dump setting.
> 
> And this function makes each continer working as separate system,
> it fit for design goal of namespace.
> 

Sorry if this is a false alarm, I don't have easy means to test it, but
is not this an immediate privilege escalation?

In particular, if the new pattern was to include a pipe, would not it
cause the kernel to run whatever the namespace put in there, but on the
"host"?

The feature definitely looks useful, but this is another privileged
operation which is now made available to unprivileged users. This poses
some questions:
- should the namespace be allowed to set anything it wants, including
  pipes? I would argue the latter should be disallowed for simplicity
- now that the pattern can change at any time by namespace root, it
  becomes fishy to parse it for filename generation without any
  protection against concurrent modification
- how do you ensure that namespaces which did not explicitely set their
  pattern still get updates from the host?

That said, I suggest adding an allocated buffer to hold it or be NULL
otherwise. If the pointer is NULL, the "host" pattern is used.

For dumping purposes the code could just kmalloc, rcu_read_lock, memcpy
and be done with it. Or, if it is acceptable given the size, just have a
buffer on the stack.

Finally, the code setting it could simply xchg the pointer to avoid
locks if there is no good lock to be taken here, and then kfree_rcu the
old buffer.

Just my $0,03.

-- 
Mateusz Guzik


[PATCH] xattr handlers: plug a lock leak in simple_xattr_list

2016-02-03 Thread Mateusz Guzik
The code could leak xattrs->lock on error.

Problem introduced with 786534b92f3ce68f4 "tmpfs: listxattr should
include POSIX ACL xattrs".

Signed-off-by: Mateusz Guzik 
---
 fs/xattr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 07d0e47..4861322 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -940,7 +940,7 @@ ssize_t simple_xattr_list(struct inode *inode, struct 
simple_xattrs *xattrs,
bool trusted = capable(CAP_SYS_ADMIN);
struct simple_xattr *xattr;
ssize_t remaining_size = size;
-   int err;
+   int err = 0;
 
 #ifdef CONFIG_FS_POSIX_ACL
if (inode->i_acl) {
@@ -965,11 +965,11 @@ ssize_t simple_xattr_list(struct inode *inode, struct 
simple_xattrs *xattrs,
 
err = xattr_list_one(, _size, xattr->name);
if (err)
-   return err;
+   break;
}
spin_unlock(>lock);
 
-   return size - remaining_size;
+   return err ? err : size - remaining_size;
 }
 
 /*
-- 
1.8.3.1



Re: [PATCH 1/3] mm: migrate: do not touch page->mem_cgroup of live pages

2016-02-03 Thread Mateusz Guzik
l_recover+0x119/0x130
[  105.724532]   [] jbd2_journal_load+0xe0/0x390
[  105.736533]   [] ext4_load_journal+0x5ef/0x6b8
[  105.748581]   [] ext4_fill_super+0x1ad3/0x2a10
[  105.760597]   [] mount_bdev+0x18c/0x1c0
[  105.772574]   [] ext4_mount+0x15/0x20
[  105.784489]   [] mount_fs+0x39/0x170
[  105.796406]   [] vfs_kern_mount+0x6b/0x150
[  105.808386]   [] do_mount+0x24d/0xed0
[  105.820355]   [] SyS_mount+0x83/0xd0
[  105.832317]   [] entry_SYSCALL_64_fastpath+0x1f/0xbd
[  105.844390] irq event stamp: 341784
[  105.856114] hardirqs last  enabled at (341783): [] 
flat_send_IPI_mask+0x8a/0xc0
[  105.879644] hardirqs last disabled at (341784): [] 
_raw_spin_lock_irq+0x1f/0x80
[  105.903369] softirqs last  enabled at (341744): [] 
__do_softirq+0x343/0x480
[  105.926856] softirqs last disabled at (341739): [] 
irq_exit+0x106/0x120
[  105.950499] 
[  105.950499] other info that might help us debug this:
[  105.973803]  Possible unsafe locking scenario:
[  105.973803] 
[  105.997202]CPU0
[  106.008754]
[  106.020265]   lock(&(>tree_lock)->rlock);
[  106.032196]   
[  106.043832] lock(&(>tree_lock)->rlock);
[  106.055891] 
[  106.055891]  *** DEADLOCK ***
[  106.055891] 
[  106.090289] 2 locks held by trinity-c3/1600:
[  106.102049]  #0:  (>mmap_sem){++}, at: [] 
__do_page_fault+0x15f/0x470
[  106.125631]  #1:  (&(>tree_lock)->rlock){?.-.-.}, at: 
[] migrate_page_move_mapping+0x71/0x4f0
[  106.149777] 
[  106.149777] stack backtrace:
[  106.172646] CPU: 3 PID: 1600 Comm: trinity-c3 Not tainted 
4.5.0-rc2-next-20160203dupa+ #121
[  106.196137] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  106.208200]  0086 cf42937f 88004bdef630 
814acb6d
[  106.231859]  88004e0e 82f17fb0 88004bdef680 
811a47e9
[  106.255250]   8801 8801 

[  106.278647] Call Trace:
[  106.290201]  [] dump_stack+0x85/0xc8
[  106.302013]  [] print_usage_bug+0x1eb/0x1fc
[  106.314144]  [] ? 
print_shortest_lock_dependencies+0x1d0/0x1d0
[  106.337708]  [] mark_lock+0x20d/0x290
[  106.349493]  [] mark_held_locks+0x71/0x90
[  106.361462]  [] ? _raw_spin_unlock_irq+0x2c/0x40
[  106.373469]  [] trace_hardirqs_on_caller+0xa9/0x1c0
[  106.385466]  [] trace_hardirqs_on+0xd/0x10
[  106.397421]  [] _raw_spin_unlock_irq+0x2c/0x40
[  106.409417]  [] commit_charge+0xbe/0x390
[  106.421310]  [] mem_cgroup_migrate+0x135/0x360
[  106.433352]  [] migrate_page_move_mapping+0x132/0x4f0
[  106.445369]  [] migrate_page+0x2b/0x50
[  106.457301]  [] buffer_migrate_page+0x10a/0x150
[  106.469260]  [] move_to_new_page+0x93/0x270
[  106.481209]  [] ? try_to_unmap+0xa7/0x170
[  106.493094]  [] ? page_remove_rmap+0x2a0/0x2a0
[  106.505052]  [] ? __hugepage_set_anon_rmap+0x80/0x80
[  106.517130]  [] migrate_pages+0x846/0xac0
[  106.528993]  [] ? __reset_isolation_suitable+0x120/0x120
[  106.541061]  [] ? isolate_freepages_block+0x4e0/0x4e0
[  106.553068]  [] compact_zone+0x33d/0xa80
[  106.565026]  [] compact_zone_order+0x7b/0xc0
[  106.576944]  [] try_to_compact_pages+0x13a/0x2e0
[  106.588948]  [] __alloc_pages_direct_compact+0x3b/0xf9
[  106.600995]  [] 
__alloc_pages_slowpath.constprop.87+0x2e5/0x886
[  106.624383]  [] __alloc_pages_nodemask+0x456/0x460
[  106.636380]  [] alloc_pages_vma+0x28b/0x2d0
[  106.648440]  [] do_huge_pmd_anonymous_page+0x13e/0x540
[  106.660497]  [] handle_mm_fault+0x7e4/0x980
[  106.672585]  [] ? handle_mm_fault+0x59/0x980
[  106.684595]  [] __do_page_fault+0x1cd/0x470
[  106.696524]  [] trace_do_page_fault+0x6e/0x250
[  106.708477]  [] do_async_page_fault+0x1a/0xb0
[  106.720407]  [] async_page_fault+0x28/0x30

-- 
Mateusz Guzik


  1   2   3   4   >