[PATCH] fs/ext4: prevent the CPU from being 100% occupied in ext4_mb_discard_group_preallocations

2021-04-18 Thread Wen Yang
The kworker has occupied 100% of the CPU for several days:
PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11

And the stack obtained through sysrq is as follows:
[20613144.850426] task: 8800b5e08000 task.stack: c9001342c000
[20613144.850427] RIP: 0010:[] ^Ac
[] ext4_mb_discard_group_preallocations+0x1b3/0x480 [ext4]
...
[20613144.850435] Stack:
[20613144.850435]  881942d6a6e8^Ac 8813bb5f72d0^Ac 0001a02427cf^Ac 
0140^Ac
[20613144.850436]  880f80618000^Ac ^Ac c9001342f770^Ac 
c9001342f770^Ac
[20613144.850437]  ea0056360dc0^Ac 88158d837000^Ac ea0045155f80^Ac 
88114557e000^Ac
[20613144.850438] Call Trace:
[20613144.850439]  [] ext4_mb_new_blocks+0x429/0x550 [ext4]
[20613144.850439]  [] ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
[20613144.850440]  [] ? numa_zonelist_order_handler+0xa1/0x1c0
[20613144.850441]  [] ext4_map_blocks+0x172/0x620 [ext4]
[20613144.850441]  [] ? ext4_writepages+0x4cd/0xf00 [ext4]
[20613144.850442]  [] ext4_writepages+0x7e5/0xf00 [ext4]
[20613144.850442]  [] ? wb_position_ratio+0x1f0/0x1f0
[20613144.850443]  [] do_writepages+0x1e/0x30
[20613144.850444]  [] __writeback_single_inode+0x45/0x320
[20613144.850444]  [] writeback_sb_inodes+0x272/0x600
[20613144.850445]  [] __writeback_inodes_wb+0x92/0xc0
[20613144.850445]  [] wb_writeback+0x268/0x300
[20613144.850446]  [] wb_workfn+0xb4/0x380
[20613144.850447]  [] process_one_work+0x189/0x420
[20613144.850447]  [] worker_thread+0x4e/0x4b0
[20613144.850448]  [] ? process_one_work+0x420/0x420
[20613144.850448]  [] kthread+0xe6/0x100
[20613144.850449]  [] ? kthread_park+0x60/0x60
[20613144.850450]  [] ret_from_fork+0x39/0x50

The thread that references this pa has been waiting for IO to return:
PID: 15140  TASK: 88004d6dc300  CPU: 16  COMMAND: "kworker/u64:1"
[c900273e7518] __schedule at 8173ca3b
[c900273e75a0] schedule at 8173cfb6
[c900273e75b8] io_schedule at 810bb75a
[c900273e75e0] bit_wait_io at 8173d8d1
[c900273e75f8] __wait_on_bit_lock at 8173d4e9
[c900273e7638] out_of_line_wait_on_bit_lock at 8173d742
[c900273e76b0] __lock_buffer at 81288c32
[c900273e76c8] do_get_write_access at a00dd177 [jbd2]
[c900273e7728] jbd2_journal_get_write_access at a00dd3a3 [jbd2]
[c900273e7750] __ext4_journal_get_write_access at a023b37b [ext4]
[c900273e7788] ext4_mb_mark_diskspace_used at a0242a0b [ext4]
[c900273e77f0] ext4_mb_new_blocks at a0244100 [ext4]
[c900273e7860] ext4_ext_map_blocks at a02389ae [ext4]
[c900273e7950] ext4_map_blocks at a0204b52 [ext4]
[c900273e79d0] ext4_writepages at a0208675 [ext4]
[c900273e7b30] do_writepages at 811c487e
[c900273e7b40] __writeback_single_inode at 81280265
[c900273e7b90] writeback_sb_inodes at 81280ab2
[c900273e7c90] __writeback_inodes_wb at 81280ed2
[c900273e7cd8] wb_writeback at 81281238
[c900273e7d80] wb_workfn at 812819f4
[c900273e7e18] process_one_work at 810a5dc9
[c900273e7e60] worker_thread at 810a60ae
[c900273e7ec0] kthread at 810ac696
[c900273e7f50] ret_from_fork at 81741dd9

On the bare metal server, we will use multiple hard disks, the Linux
kernel will run on the system disk, and business programs will run on
several hard disks virtualized by the BM hypervisor. The reason why IO
has not returned here is that the process handling IO in the BM hypervisor
has failed.

The cpu resources of the cloud server are precious, and the server
cannot be restarted after running for a long time. So it's slightly
optimized here to prevent the CPU from being 100% occupied.

Signed-off-by: Wen Yang 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: Ritesh Harjani 
Cc: Baoyou Xie 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/ext4/mballoc.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a02fadf..c73f212 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,8 @@ static void ext4_mb_generate_from_freelist(struct 
super_block *sb, void *bitmap,
ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
 
+static inline void ext4_mb_show_pa(struct super_block *sb);
+
 /*
  * The algorithm using this percpu seq counter goes below:
  * 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -4217,9 +4219,9 @@ static void ext4_mb_new_preallocation(struct 
ext4_allocation_context *ac)
struct ext4_prealloc_space *pa, *tmp;
struct list_head list;
struct ext4_buddy e4b;
+   int free_total = 0;
+   int busy, free;
int e

Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

2021-04-15 Thread Wen Yang




在 2021/4/11 下午12:25, Theodore Ts'o 写道:

On Sun, Apr 11, 2021 at 03:45:01AM +0800, Wen Yang wrote:

At this time, some logs are lost. It is suspected that the hard disk itself
is faulty.


If you have a kernel crash dump, that means you can extract out the
dmesg buffer, correct?  Is there any I/O error messages in the kernel
log?

What is the basis of the suspicion that the hard drive is faulty?
Kernel dmesg output?  Error reporting from smartctl?



Hello, we are using a Bare-metal Cloud server 
(https://www.semanticscholar.org/paper/High-density-Multi-tenant-Bare-metal-Cloud-Zhang-Zheng/ab1b5f0743816c8cb7188019d844ff3f7d565d9f/figure/3), 
so there is no error log in dmesg or smartctl, and we have to check it 
in Bm-hypervisor. We finally found that the io processing process on 
Bm-hypervisor is indeed abnormal.




There are many hard disks on our server. Maybe we should not occupy 100% CPU
for a long time just because one hard disk fails.


It depends on the nature of the hard drive failure.  How is it
failing?

One thing which we do need to be careful about is when focusing on how
to prevent a failure caused by some particular (potentially extreme)
scenarios, that we don't cause problems on more common scenarios (for
example a heavily loaded server, and/or a case where the file system
is almost full where we have multiple files "fighting" over a small
number of free blocks).

In general, my attitude is that the best way to protect against hard
drive failures is to have processes which are monitoring the health of
the system, and if there is evidence of a failed drive, that we
immediately kill all jobs which are relying on that drive (which we
call "draining" a particular drive), and/or if a sufficiently large
percentage of the drives have failed, or the machine can no longer do
its job, to automatically move all of those jobs to other servers
(e.g., "drain" the server), and then send the machine to some kind of
data center repair service, where the failed hard drives can be
replaced.

I'm skeptical of attempts to try to make the file system to somehow
continue to be able to "work" in the face of hard drive failures,
since failures can be highly atypical, and what might work well in one
failure scenario might be catastrophic in another.  It's especially
problematic if the HDD is not explcitly signalling an error condition,
but rather being slow (because it's doing a huge number of retries),
or the HDD is returning data which is simply different from what was
previously written.  The best we can do in that case is to detect that
something is wrong (this is where metadata checksums would be very
helpful), and then either remount the file system r/o, or panic the
machine, and/or signal to userspace that some particular file system
should be drained.



Thanks you.
We generally separate the physical disks. One system disk and several 
business disks. The linux kernel runs on this system disk, and various 
services run on several business disks. In this way, even if a business 
disk has a problem , it will not affect the entire system.



But the current implementation of mblloc may cause 100% of the cpu to be 
occupied for a long time, could we optimize it slightly, as follows:


diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a02fadf..c73f212 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,8 @@ static void ext4_mb_generate_from_freelist(struct 
super_block *sb, void *bitmap,

ext4_group_t group);
 static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);

+static inline void ext4_mb_show_pa(struct super_block *sb);
+
 /*
  * The algorithm using this percpu seq counter goes below:
  * 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -4217,9 +4219,9 @@ static void ext4_mb_new_preallocation(struct 
ext4_allocation_context *ac)

struct ext4_prealloc_space *pa, *tmp;
struct list_head list;
struct ext4_buddy e4b;
+   int free_total = 0;
+   int busy, free;
int err;
-   int busy = 0;
-   int free, free_total = 0;

mb_debug(sb, "discard preallocation for group %u\n", group);
if (list_empty(>bb_prealloc_list))
@@ -4247,6 +4249,7 @@ static void ext4_mb_new_preallocation(struct 
ext4_allocation_context *ac)


INIT_LIST_HEAD();
 repeat:
+   busy = 0;
free = 0;
ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
@@ -4255,6 +4258,8 @@ static void ext4_mb_new_preallocation(struct 
ext4_allocation_context *ac)

if (atomic_read(>pa_count)) {
spin_unlock(>pa_lock);
busy = 1;
+   mb_debug(sb, "used pa while discarding for group %u\n", 
group);
+   ext4_mb_show_pa(sb);
continue;
}
  

Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

2021-04-10 Thread Wen Yang




在 2021/4/9 下午1:47, riteshh 写道:

On 21/04/09 02:50AM, Wen Yang wrote:

On Apr 7, 2021, at 5:16 AM, riteshh  wrote:


On 21/04/07 03:01PM, Wen Yang wrote:

From: Wen Yang 

The kworker has occupied 100% of the CPU for several days:
PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11

And the stack obtained through sysrq is as follows:
[20613144.850426] task: 8800b5e08000 task.stack: c9001342c000
[20613144.850438] Call Trace:
[20613144.850439] []ext4_mb_new_blocks+0x429/0x550

[ext4]

[20613144.850439]  [] ext4_ext_map_blocks+0xb5e/0xf30

[ext4]

[20613144.850441]  [] ext4_map_blocks+0x172/0x620

[ext4]

[20613144.850442]  [] ext4_writepages+0x7e5/0xf00

[ext4]

[20613144.850443]  [] do_writepages+0x1e/0x30
[20613144.850444]  []

__writeback_single_inode+0x45/0x320

[20613144.850444]  [] writeback_sb_inodes+0x272/0x600
[20613144.850445]  [] __writeback_inodes_wb+0x92/0xc0
[20613144.850445]  [] wb_writeback+0x268/0x300
[20613144.850446]  [] wb_workfn+0xb4/0x380
[20613144.850447]  [] process_one_work+0x189/0x420
[20613144.850447]  [] worker_thread+0x4e/0x4b0

The cpu resources of the cloud server are precious, and the server
cannot be restarted after running for a long time, so a configuration
parameter is added to prevent this endless loop.


Strange, if there is a endless loop here. Then I would definitely see
if there is any accounting problem in pa->pa_count. Otherwise busy=1
should not be set everytime. ext4_mb_show_pa() function may help debug

this.


If yes, then that means there always exists either a file preallocation
or a group preallocation. Maybe it is possible, in some use case.
Others may know of such use case, if any.



If this code is broken, then it doesn't make sense to me that we would
leave it in the "run forever" state after the patch, and require a sysfs
tunable to be set to have a properly working system?



Is there anything particularly strange about the workload/system that
might cause this?  Filesystem is very full, memory is very low, etc?


Hi Ritesh and Andreas,

Thank you for your reply. Since there is still a faulty machine, we have
analyzed it again and found it is indeed a very special case:


crash> struct ext4_group_info 8813bb5f72d0
struct ext4_group_info {
   bb_state = 0,
   bb_free_root = {
 rb_node = 0x0
   },
   bb_first_free = 1681,
   bb_free = 0,


Not related to this issue, but above two variables values doesn't looks
consistent.


   bb_fragments = 0,
   bb_largest_free_order = -1,
   bb_prealloc_list = {
 next = 0x880268291d78,
 prev = 0x880268291d78 ---> *** The list is empty
   },


Ok. So when you collected the dump this list was empty.


   alloc_sem = {
 count = {
   counter = 0
 },
 wait_list = {
   next = 0x8813bb5f7308,
   prev = 0x8813bb5f7308
 },
 wait_lock = {
   raw_lock = {
 {
   val = {
 counter = 0
   },
   {
 locked = 0 '\000',
 pending = 0 '\000'
   },
   {
 locked_pending = 0,
 tail = 0
   }
 }
   }
 },
 osq = {
   tail = {
 counter = 0
   }
 },
 owner = 0x0
   },
   bb_counters = 0x8813bb5f7328
}
crash>


crash> list 0x880268291d78  -l ext4_prealloc_space.pa_group_list -s


No point of doing this I guess, since the list anyway is empty.
What you may be seeing below is some garbage data.


ext4_prealloc_space.pa_count
880268291d78
   pa_count = {
 counter = 1  ---> pa->pa_count
   }
8813bb5f72f0
   pa_count = {
 counter = -30701
   }


I guess, since list is empty and you are seeing garbage hence counter value
of above node looks weird.




crash> struct -xo  ext4_prealloc_space
struct ext4_prealloc_space {
[0x0] struct list_head pa_inode_list;
   [0x10] struct list_head pa_group_list;
  union {
  struct list_head pa_tmp_list;
  struct callback_head pa_rcu;
   [0x20] } u;
   [0x30] spinlock_t pa_lock;
   [0x34] atomic_t pa_count;
   [0x38] unsigned int pa_deleted;
   [0x40] ext4_fsblk_t pa_pstart;
   [0x48] ext4_lblk_t pa_lstart;
   [0x4c] ext4_grpblk_t pa_len;
   [0x50] ext4_grpblk_t pa_free;
   [0x54] unsigned short pa_type;
   [0x58] spinlock_t *pa_obj_lock;
   [0x60] struct inode *pa_inode;
}
SIZE: 0x68


crash> rd 0x880268291d68 20
880268291d68:  881822f8a4c8 881822f8a4c8   ..."..."
880268291d78:  8813bb5f72f0 8813bb5f72f0   .r_..r_.
880268291d88:  1000 880db2371000   ..7.
880268291d98:  00010001    
880268291da8:  00029c39 00170c41   9...A...
880268291db8:  0016 881822f8a4d8   ..."
880268291dc8:  881822f8a268 

Re: [PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_p

2021-04-08 Thread Wen Yang

> On Apr 7, 2021, at 5:16 AM, riteshh  wrote:
>>
>> On 21/04/07 03:01PM, Wen Yang wrote:
>>> From: Wen Yang 
>>>
>>> The kworker has occupied 100% of the CPU for several days:
>>> PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
>>> 68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11
>>>
>>> And the stack obtained through sysrq is as follows:
>>> [20613144.850426] task: 8800b5e08000 task.stack: c9001342c000
>>> [20613144.850438] Call Trace:
>>> [20613144.850439] 
[]ext4_mb_new_blocks+0x429/0x550 [ext4]
>>> [20613144.850439]  [] 
ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
>>> [20613144.850441]  [] ext4_map_blocks+0x172/0x620 
[ext4]
>>> [20613144.850442]  [] ext4_writepages+0x7e5/0xf00 
[ext4]

>>> [20613144.850443]  [] do_writepages+0x1e/0x30
>>> [20613144.850444]  [] 
__writeback_single_inode+0x45/0x320

>>> [20613144.850444]  [] writeback_sb_inodes+0x272/0x600
>>> [20613144.850445]  [] __writeback_inodes_wb+0x92/0xc0
>>> [20613144.850445]  [] wb_writeback+0x268/0x300
>>> [20613144.850446]  [] wb_workfn+0xb4/0x380
>>> [20613144.850447]  [] process_one_work+0x189/0x420
>>> [20613144.850447]  [] worker_thread+0x4e/0x4b0
>>>
>>> The cpu resources of the cloud server are precious, and the server
>>> cannot be restarted after running for a long time, so a configuration
>>> parameter is added to prevent this endless loop.
>>
>> Strange, if there is a endless loop here. Then I would definitely see
>> if there is any accounting problem in pa->pa_count. Otherwise busy=1
>> should not be set everytime. ext4_mb_show_pa() function may help 
debug this.

>>
>> If yes, then that means there always exists either a file preallocation
>> or a group preallocation. Maybe it is possible, in some use case.
>> Others may know of such use case, if any.

> If this code is broken, then it doesn't make sense to me that we would
> leave it in the "run forever" state after the patch, and require a sysfs
> tunable to be set to have a properly working system?

> Is there anything particularly strange about the workload/system that
> might cause this?  Filesystem is very full, memory is very low, etc?

Hi Ritesh and Andreas,

Thank you for your reply. Since there is still a faulty machine, we have 
analyzed it again and found it is indeed a very special case:



crash> struct ext4_group_info 8813bb5f72d0
struct ext4_group_info {
  bb_state = 0,
  bb_free_root = {
rb_node = 0x0
  },
  bb_first_free = 1681,
  bb_free = 0,
  bb_fragments = 0,
  bb_largest_free_order = -1,
  bb_prealloc_list = {
next = 0x880268291d78,
prev = 0x880268291d78 ---> *** The list is empty
  },
  alloc_sem = {
count = {
  counter = 0
},
wait_list = {
  next = 0x8813bb5f7308,
  prev = 0x8813bb5f7308
},
wait_lock = {
  raw_lock = {
{
  val = {
counter = 0
  },
  {
locked = 0 '\000',
pending = 0 '\000'
  },
  {
locked_pending = 0,
tail = 0
  }
}
  }
},
osq = {
  tail = {
counter = 0
  }
},
owner = 0x0
  },
  bb_counters = 0x8813bb5f7328
}
crash>


crash> list 0x880268291d78  -l ext4_prealloc_space.pa_group_list -s 
ext4_prealloc_space.pa_count

880268291d78
  pa_count = {
counter = 1  ---> pa->pa_count
  }
8813bb5f72f0
  pa_count = {
counter = -30701
  }


crash> struct -xo  ext4_prealloc_space
struct ext4_prealloc_space {
   [0x0] struct list_head pa_inode_list;
  [0x10] struct list_head pa_group_list;
 union {
 struct list_head pa_tmp_list;
 struct callback_head pa_rcu;
  [0x20] } u;
  [0x30] spinlock_t pa_lock;
  [0x34] atomic_t pa_count;
  [0x38] unsigned int pa_deleted;
  [0x40] ext4_fsblk_t pa_pstart;
  [0x48] ext4_lblk_t pa_lstart;
  [0x4c] ext4_grpblk_t pa_len;
  [0x50] ext4_grpblk_t pa_free;
  [0x54] unsigned short pa_type;
  [0x58] spinlock_t *pa_obj_lock;
  [0x60] struct inode *pa_inode;
}
SIZE: 0x68


crash> rd 0x880268291d68 20
880268291d68:  881822f8a4c8 881822f8a4c8   ..."..."
880268291d78:  8813bb5f72f0 8813bb5f72f0   .r_..r_.
880268291d88:  1000 880db2371000   ..7.
880268291d98:  00010001    
880268291da8:  00029c39 00170c41   9...A...
880268291db8:  0016 881822f8a4d8   ..."
880268291dc8:  881822f8a268 880268291af8   h.."..)h
880268291dd8:  880268291dd0 ea00

[PATCH] ext4: add a configurable parameter to prevent endless loop in ext4_mb_discard_group_preallocations

2021-04-07 Thread Wen Yang
From: Wen Yang 

The kworker has occupied 100% of the CPU for several days:
PID USER  PR  NI VIRT RES SHR S  %CPU  %MEM TIME+  COMMAND
68086 root 20 0  00   0   R  100.0 0.0  9718:18 kworker/u64:11

And the stack obtained through sysrq is as follows:
[20613144.850426] task: 8800b5e08000 task.stack: c9001342c000
[20613144.850427] RIP: 0010:[] ^Ac
[] ext4_mb_discard_group_preallocations+0x1b3/0x480
[ext4]
[20613144.850428] RSP: 0018:c9001342f740  EFLAGS: 0246
[20613144.850428] RAX:  RBX: 8813bb5f72f0 RCX:

[20613144.850429] RDX: 0001 RSI: 880268291d78 RDI:
880268291d98
[20613144.850430] RBP: c9001342f7e8 R08: 00493b8bc070da84 R09:

[20613144.850430] R10: 1800 R11: 880155e7c380 R12:
880268291d98
[20613144.850431] R13: 8813bb5f72e0 R14: 880268291d78 R15:
880268291d68
[20613144.850432] FS:  () GS:881fbf08()
knlGS:
[20613144.850432] CS:  0010 DS:  ES:  CR0: 80050033
[20613144.850433] CR2: 00c000823010 CR3: 01c08000 CR4:
003606f0
[20613144.850434] DR0:  DR1:  DR2:

[20613144.850434] DR3:  DR6: fffe0ff0 DR7:
0400
[20613144.850435] Stack:
[20613144.850435]  881942d6a6e8^Ac 8813bb5f72d0^Ac 0001a02427cf^Ac 
0140^Ac
[20613144.850436]  880f80618000^Ac ^Ac c9001342f770^Ac 
c9001342f770^Ac
[20613144.850437]  ea0056360dc0^Ac 88158d837000^Ac ea0045155f80^Ac 
88114557e000^Ac
[20613144.850438] Call Trace:
[20613144.850439]  [] ext4_mb_new_blocks+0x429/0x550 [ext4]
[20613144.850439]  [] ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
[20613144.850440]  [] ? numa_zonelist_order_handler+0xa1/0x1c0
[20613144.850441]  [] ext4_map_blocks+0x172/0x620 [ext4]
[20613144.850441]  [] ? ext4_writepages+0x4cd/0xf00 [ext4]
[20613144.850442]  [] ext4_writepages+0x7e5/0xf00 [ext4]
[20613144.850442]  [] ? wb_position_ratio+0x1f0/0x1f0
[20613144.850443]  [] do_writepages+0x1e/0x30
[20613144.850444]  [] __writeback_single_inode+0x45/0x320
[20613144.850444]  [] writeback_sb_inodes+0x272/0x600
[20613144.850445]  [] __writeback_inodes_wb+0x92/0xc0
[20613144.850445]  [] wb_writeback+0x268/0x300
[20613144.850446]  [] wb_workfn+0xb4/0x380
[20613144.850447]  [] process_one_work+0x189/0x420
[20613144.850447]  [] worker_thread+0x4e/0x4b0
[20613144.850448]  [] ? process_one_work+0x420/0x420
[20613144.850448]  [] kthread+0xe6/0x100
[20613144.850449]  [] ? kthread_park+0x60/0x60
[20613144.850450]  [] ret_from_fork+0x39/0x50

The cpu resources of the cloud server are precious, and the server
cannot be restarted after running for a long time, so a configuration
parameter is added to prevent this endless loop.

Signed-off-by: Wen Yang 
Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: Baoyou Xie 
Cc: linux-e...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 fs/ext4/ext4.h|  1 +
 fs/ext4/mballoc.c | 19 +++
 fs/ext4/mballoc.h |  2 ++
 fs/ext4/sysfs.c   |  2 ++
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2866d24..c238fec 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1543,6 +1543,7 @@ struct ext4_sb_info {
unsigned long s_mb_last_start;
unsigned int s_mb_prefetch;
unsigned int s_mb_prefetch_limit;
+   unsigned long s_mb_max_retries_per_group;
 
/* stats for buddy allocator */
atomic_t s_bal_reqs;/* number of reqs with len > 1 */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 99bf091..c126b15 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2853,6 +2853,8 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
sbi->s_mb_max_inode_prealloc = MB_DEFAULT_MAX_INODE_PREALLOC;
+   sbi->s_mb_max_retries_per_group = MB_DISCARD_RETRIES_FOREVER; 
+
/*
 * The default group preallocation is 512, which for 4k block
 * sizes translates to 2 megabytes.  However for bigalloc file
@@ -4206,6 +4208,7 @@ static void ext4_mb_new_preallocation(struct 
ext4_allocation_context *ac)
ext4_group_t group, int needed)
 {
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
+   struct ext4_sb_info *sbi = EXT4_SB(sb);
struct buffer_head *bitmap_bh = NULL;
struct ext4_prealloc_space *pa, *tmp;
struct list_head list;
@@ -4213,6 +4216,7 @@ static void ext4_mb_new_preallocation(struct 
ext4_allocation_context *ac)
int err;
int busy = 0;
int free, free_total = 0;
+   int discard_retries = 0;
 
mb_debug(sb, "discard preallocation for group %u\n", group);
if (list_empty(>bb_prealloc_list

Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry

2021-01-07 Thread Wen Yang




在 2021/1/8 上午2:28, Greg Kroah-Hartman 写道:

On Fri, Jan 08, 2021 at 12:21:38AM +0800, Wen Yang wrote:



在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道:

On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote:

The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.

Suppose the following race appears:

release_task dput
-> proc_flush_task
   -> dentry->d_op->d_delete(dentry)
-> __exit_signal
   -> dentry->d_lockref.count--  and return.

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This issue was introduced by 60347f6716aa ("pid namespaces: prepare
proc_flust_task() to flush entries from multiple proc trees"), exposed by
f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then
fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc").


Why are you just submitting a series for 4.9 and 4.19, what about 4.14?
We can't have users move to a newer kernel and then experience old bugs,
right?


Okay, the patches corresponding to 4.14 will be ready later.


Note for some reason you didn't cc: the stable list for these patches :(


But the larger question is why are you backporting a whole new feature
here?  Why is CLONE_PIDFD needed?  That feels really wrong...



The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: Use a
list of inodes to flush from proc") relies on wait_pidfd.lock. There are
indeed many associated modifications here. We are also testing it. Please
check the code more.


Is the only "issue" here wasted memory?  Will it eventually be freed
anyway even if you do not echo to the proc file to flush caches?

You mention the inability to create a new pid for a specific namespace,
is that really a problem?  Shouldn't the code handle such issues
normally?  What breaks without these changes?

I think at this point, it might just time for you to move to a newer
kernel release, as adding a whole new userspace feature for this feels
really really odd.

What is preventing you from doing that today?  What holds you to older
kernels that will not allow you to move forward?



We have encountered this problem in the cloud server environment. Users 
will frequently create and delete containers, and the corresponding 
pid_ns will accumulate, eventually making it impossible to create a new 
container.


https://bugzilla.kernel.org/show_bug.cgi?id=208613

The kernels (4.9/4.19) used on a large scale in our current production 
environment (almost tens of thousands of machines) may need to be fixed.


Thanks.

--
Best wishes,
Wen




Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry

2021-01-07 Thread Wen Yang




在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道:

On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote:

The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.

Suppose the following race appears:

release_task dput
-> proc_flush_task
  -> dentry->d_op->d_delete(dentry)
-> __exit_signal
  -> dentry->d_lockref.count--  and return.

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This issue was introduced by 60347f6716aa ("pid namespaces: prepare
proc_flust_task() to flush entries from multiple proc trees"), exposed by
f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then
fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc").


Why are you just submitting a series for 4.9 and 4.19, what about 4.14?
We can't have users move to a newer kernel and then experience old bugs,
right?


Okay, the patches corresponding to 4.14 will be ready later.



But the larger question is why are you backporting a whole new feature
here?  Why is CLONE_PIDFD needed?  That feels really wrong...



The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: 
Use a list of inodes to flush from proc") relies on wait_pidfd.lock. 
There are indeed many associated modifications here. We are also testing 
it. Please check the code more.


Thanks.

--
Best wishes,
Wen



[PATCH 4.19 1/7] clone: add CLONE_PIDFD

2021-01-06 Thread Wen Yang
From: Christian Brauner 

[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds 
Signed-off-by: Christian Brauner 
Co-developed-by: Jann Horn 
Signed-off-by: Jann Horn 
Reviewed-by: Oleg Nesterov 
Cc: Arnd Bergmann 
Cc: "Eric W. Biederman" 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: David Howells 
Cc: "Michael Kerrisk (man-pages)" 
Cc: Andy Lutomirsky 
Cc: Andrew Morton 
Cc: Aleksa Sarai 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc:  # 4.19.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang 
---
 include/linux/pid.h|   2 +
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c  | 107 +++--
 3 files changed, 106 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 14a9a39..29c0a99 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -66,6 +66,8 @@ struct pid
 
 extern struct pid init_struct_pid;
 
+extern const struct file_operations pidfd_fops;
+
 static inline struct pid *get_pid(struct pid *pid)
 {
if (pid)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 22627f8..ed4ee17 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -10,6 +10,7 @@
 #define CLONE_FS   0x0200  /* set if fs info shared between 
processes */
 #define CLONE_FILES0x0400  /* set if open files shared between 
processes */
 #define CLONE_SIGHAND  0x0800  /* set if signal handlers and blocked 
signals shared */
+#define CLONE_PIDFD0x1000  /* set if a pidfd should be placed in 
parent */
 #define CLONE_PTRACE   0x2000  /* set if we want to let tracing 
continue on the child too */
 #define CLONE_VFORK0x4000  /* set if the parent wants the child to 
wake it up on mm_release */
 #define CLONE_PARENT   0x8000  /* set if we want to have the same 
parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index f2c92c1..e419891 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -21,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1666,6 +1668,58 @@ static void copy_oom_score_adj(u64 clone_flags, struct 
task_struct *tsk)
mutex_unlock(_adj_mutex);
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+   struct pid *pid = file->private_data;
+
+   file->private_data = NULL;
+   put_pid(pid);
+   return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+   struct pid_namespace *ns = proc_pid_ns(file_inode(m->file));
+   struct pid *pid = f->private_data;
+
+   seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+   seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+   .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+   .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file 

[PATCH 4.19 5/7] proc: Clear the pieces of proc_inode that proc_evict_inode cares about

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]

This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.

The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index fffc7e4..45b4344 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -34,21 +34,27 @@ static void proc_evict_inode(struct inode *inode)
 {
struct proc_dir_entry *de;
struct ctl_table_header *head;
+   struct proc_inode *ei = PROC_I(inode);
 
truncate_inode_pages_final(>i_data);
clear_inode(inode);
 
/* Stop tracking associated processes */
-   put_pid(PROC_I(inode)->pid);
+   if (ei->pid) {
+   put_pid(ei->pid);
+   ei->pid = NULL;
+   }
 
/* Let go of any associated proc directory entry */
-   de = PDE(inode);
-   if (de)
+   de = ei->pde;
+   if (de) {
pde_put(de);
+   ei->pde = NULL;
+   }
 
-   head = PROC_I(inode)->sysctl;
+   head = ei->sysctl;
if (head) {
-   RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+   RCU_INIT_POINTER(ei->sysctl, NULL);
proc_sys_evict_inode(inode, head);
}
 }
-- 
1.8.3.1



[PATCH 4.19 6/7] proc: Use d_invalidate in proc_prune_siblings_dcache

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]

The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused.  It will not remove aliases for
the dcache or even think of removing mounts from the dcache.  For that
behavior d_invalidate is needed.

To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.

For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.

Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.

To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.

V2: Split the directory and non-directory cases.  To make this
code robust to future changes in proc.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 16 ++--
 fs/proc/internal.h|  2 +-
 fs/proc/proc_sysctl.c |  8 
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 45b4344..fad579e 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -118,7 +118,7 @@ void __init proc_init_kmemcache(void)
BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE);
 }
 
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t 
*lock)
 {
struct inode *inode;
struct proc_inode *ei;
@@ -147,7 +147,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, 
spinlock_t *lock)
continue;
}
 
-   d_prune_aliases(inode);
+   if (S_ISDIR(inode->i_mode)) {
+   struct dentry *dir = d_find_any_alias(inode);
+   if (dir) {
+   d_invalidate(dir);
+   dput(dir);
+   }
+   } else {
+   struct dentry *dentry;
+   while ((dentry = d_find_alias(inode))) {
+   d_invalidate(dentry);
+   dput(dentry);
+   }
+   }
iput(inode);
deactivate_super(sb);
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 6cae472..1db693b 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -210,7 +210,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 void proc_init_kmemcache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t 
*lock);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct 
proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 57b16bf..f8f1f8a 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -262,9 +262,9 @@ static void unuse_table(struct ctl_table_header *p)
complete(p->unregistering);
 }
 
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
 {
-   proc_prune_siblings_dcache(>inodes, _lock);
+   proc_invalidate_siblings_dcache(>inodes, _lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
@@ -286,10 +286,10 @@ static void start_unregistering(struct ctl_table_header 
*p)
spin_unlock(_lock);
}
/*
-* Prune dentries for unregistered sysctls: namespaced sysctls
+* Invalidate dentries for unregistered sysctls: namespaced sysctls
 * can have duplicate names and contaminate dcache very badly.
 */
-   proc_sys_prune_dcache(p);
+   proc_sys_invalidate_dcache(p);
/*
 * do not remove from the list until nobody holds it; walking the
 * list in do_sysctl() relies on that.
-- 
1.8.3.1



[PATCH 4.19 2/7] pidfd: add polling support

2021-01-06 Thread Wen Yang
From: "Joel Fernandes (Google)" 

[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski 
Cc: Steven Rostedt 
Cc: Daniel Colascione 
Cc: Jann Horn 
Cc: Tim Murray 
Cc: Jonathan Kowalski 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Kees Cook 
Cc: David Howells 
Cc: Oleg Nesterov 
Cc: kernel-t...@android.com
Reviewed-by: Oleg Nesterov 
Co-developed-by: Daniel Colascione 
Signed-off-by: Daniel Colascione 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Christian Brauner 
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 include/linux/pid.h |  3 +++
 kernel/fork.c   | 26 ++
 kernel/pid.c|  2 ++
 kernel/signal.c | 11 +++
 4 files changed, 42 insertions(+)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 29c0a99..a82d2f7 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -3,6 +3,7 @@
 #define _LINUX_PID_H
 
 #include 
+#include 
 
 enum pid_type
 {
@@ -60,6 +61,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+   /* wait queue for pidfd notifications */
+   wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index e419891..33dc746 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1688,8 +1688,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct 
file *f)
 }
 #endif
 
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct 
*pts)
+{
+   struct task_struct *task;
+   struct pid *pid = file->private_data;
+   int poll_flags = 0;
+
+   poll_wait(file, >wait_pidfd, pts);
+
+   rcu_read_lock();
+   task = pid_task(pid, PIDTYPE_PID);
+   /*
+* Inform pollers only when the whole thread group exits.
+* If the thread group leader exits before all other threads in the
+* group, then poll(2) should block, similar to the wait(2) family.
+*/
+   if (!task || (task->exit_state && thread_group_empty(task)))
+   poll_flags = POLLIN | POLLRDNORM;
+   rcu_read_unlock();
+
+   return poll_flags;
+}
+
 const struct file_operations pidfd_fops = {
.release = pidfd_release,
+   .poll = pidfd_poll,
 #ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
 #endif
diff --git a/kernel/pid.c b/kernel/pid.c
index b88fe5e..3ba6fcb 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -214,6 +214,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(>tasks[type]);
 
+   init_waitqueue_head(>wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(_lock);
if (!(ns->pid_allocated & PIDNS_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index a02a25a..22a04795 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1810,6 +1810,14 @@ int send_sigqueue(struct sigqueue *q, struct pid *pid, 
enum pid_type type)
return ret;
 }
 
+static void do_notify_pidfd(struct task_struct *task)
+{
+   struct pid *pid;
+
+   pid = task_pid(task);
+   wake_up_all(>wait_pidfd);
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notify_parent_cldstop instead.
@@ -1833,6 +1841,9 @@ bool do_notify_par

[PATCH 4.19 3/7] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 0afa5ca82212247456f9de1468b595a111fee633 ]

I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.

Just give the field a generic name so it is clear that
it is no longer sysctl specific.

Also for good measure initialize sibling_inodes when
proc_inode is initialized.

Signed-off-by: Eric W. Biederman 
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 1 +
 fs/proc/internal.h| 2 +-
 fs/proc/proc_sysctl.c | 8 
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 31bf3bb..e5334ed 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -70,6 +70,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
ei->pde = NULL;
ei->sysctl = NULL;
ei->sysctl_entry = NULL;
+   INIT_HLIST_NODE(>sibling_inodes);
ei->ns_ops = NULL;
inode = >vfs_inode;
return inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 95b1419..d922c01 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -91,7 +91,7 @@ struct proc_inode {
struct proc_dir_entry *pde;
struct ctl_table_header *sysctl;
struct ctl_table *sysctl_entry;
-   struct hlist_node sysctl_inodes;
+   struct hlist_node sibling_inodes;
const struct proc_ns_operations *ns_ops;
struct inode vfs_inode;
 } __randomize_layout;
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c95f32b..0f578f6 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -274,9 +274,9 @@ static void proc_sys_prune_dcache(struct ctl_table_header 
*head)
node = hlist_first_rcu(>inodes);
if (!node)
break;
-   ei = hlist_entry(node, struct proc_inode, sysctl_inodes);
+   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
spin_lock(_lock);
-   hlist_del_init_rcu(>sysctl_inodes);
+   hlist_del_init_rcu(>sibling_inodes);
spin_unlock(_lock);
 
inode = >vfs_inode;
@@ -478,7 +478,7 @@ static struct inode *proc_sys_make_inode(struct super_block 
*sb,
}
ei->sysctl = head;
ei->sysctl_entry = table;
-   hlist_add_head_rcu(>sysctl_inodes, >inodes);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
head->count++;
spin_unlock(_lock);
 
@@ -509,7 +509,7 @@ static struct inode *proc_sys_make_inode(struct super_block 
*sb,
 void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
 {
spin_lock(_lock);
-   hlist_del_init_rcu(_I(inode)->sysctl_inodes);
+   hlist_del_init_rcu(_I(inode)->sibling_inodes);
if (!--head->count)
kfree_rcu(head, rcu);
spin_unlock(_lock);
-- 
1.8.3.1



[PATCH 4.19 7/7] proc: Use a list of inodes to flush from proc

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
initialization into the initial version of the patch.  A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman 
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc:  # 4.19.x: b3e583825266: clone: add
CLONE_PIDFD
Cc:  # 4.19.x: b53b0b9d9a61: pidfd: add polling
support
Cc:  # 4.19.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc:  # 4.19.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc:  # 4.19.x: 71448011ea2a: proc: Clear the
pieces of
proc_inode that proc_evict_inode cares about
Cc:  # 4.19.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c  | 111 
 fs/proc/inode.c |   2 +-
 fs/proc/internal.h  |   1 +
 include/linux/pid.h |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c   |   4 +-
 kernel/pid.c|   1 +
 7 files changed, 45 insertions(+), 79 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5e705fa..ea74c7c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1741,11 +1741,25 @@ void task_dump_owner(struct task_struct *task, umode_t 
mode,
*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+   struct pid *pid = ei->pid;
+
+   if (S_ISDIR(ei->vfs_inode.i_mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
+   put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
+   struct pid *pid;
 
/* We need a new inode */
 
@@ -1763,10 +1777,18 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
/*
 * grab the reference to task.
 */
-   ei->pid = get_task_pid(task, PIDTYPE_PID);
-   if (!ei->pid)
+   pid = get_task_pid(task, PIDTYPE_PID);
+   if (!pid)
goto out_unlock;
 
+   /* Let the pid remember us for quick removal */
+   ei->pid = pid;
+   if (S_ISDIR(mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
task_dump_owner(task, 0, >i_uid, >i_gid);
security_task_to_inode(task, inode);
 
@@ -3067,90 +3089,29 @@ static struct dentry *proc_tgid_base_lookup(struct 
inode *dir, struct dentry *de
.permission = proc_pid_permis

[PATCH 4.19 4/7] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 26dbc60f385ff9cff475ea2a3bad02e80fd6fa43 ]

This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.

Signed-off-by: Eric W. Biederman 
Cc:  # 4.19.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 38 ++
 fs/proc/internal.h|  1 +
 fs/proc/proc_sysctl.c | 35 +--
 3 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index e5334ed..fffc7e4 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -112,6 +112,44 @@ void __init proc_init_kmemcache(void)
BUILD_BUG_ON(sizeof(struct proc_dir_entry) >= SIZEOF_PDE);
 }
 
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+{
+   struct inode *inode;
+   struct proc_inode *ei;
+   struct hlist_node *node;
+   struct super_block *sb;
+
+   rcu_read_lock();
+   for (;;) {
+   node = hlist_first_rcu(inodes);
+   if (!node)
+   break;
+   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
+   spin_lock(lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(lock);
+
+   inode = >vfs_inode;
+   sb = inode->i_sb;
+   if (!atomic_inc_not_zero(>s_active))
+   continue;
+   inode = igrab(inode);
+   rcu_read_unlock();
+   if (unlikely(!inode)) {
+   deactivate_super(sb);
+   rcu_read_lock();
+   continue;
+   }
+
+   d_prune_aliases(inode);
+   iput(inode);
+   deactivate_super(sb);
+
+   rcu_read_lock();
+   }
+   rcu_read_unlock();
+}
+
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
struct super_block *sb = root->d_sb;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index d922c01..6cae472 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -210,6 +210,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 void proc_init_kmemcache(void);
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 void set_proc_pid_nlink(void);
 extern struct inode *proc_get_inode(struct super_block *, struct 
proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 0f578f6..57b16bf 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -264,40 +264,7 @@ static void unuse_table(struct ctl_table_header *p)
 
 static void proc_sys_prune_dcache(struct ctl_table_header *head)
 {
-   struct inode *inode;
-   struct proc_inode *ei;
-   struct hlist_node *node;
-   struct super_block *sb;
-
-   rcu_read_lock();
-   for (;;) {
-   node = hlist_first_rcu(>inodes);
-   if (!node)
-   break;
-   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
-   spin_lock(_lock);
-   hlist_del_init_rcu(>sibling_inodes);
-   spin_unlock(_lock);
-
-   inode = >vfs_inode;
-   sb = inode->i_sb;
-   if (!atomic_inc_not_zero(>s_active))
-   continue;
-   inode = igrab(inode);
-   rcu_read_unlock();
-   if (unlikely(!inode)) {
-   deactivate_super(sb);
-   rcu_read_lock();
-   continue;
-   }
-
-   d_prune_aliases(inode);
-   iput(inode);
-   deactivate_super(sb);
-
-   rcu_read_lock();
-   }
-   rcu_read_unlock();
+   proc_prune_siblings_dcache(>inodes, _lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
-- 
1.8.3.1



[PATCH v2 4.9 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 68eb94f16227336a5773b83ecfa8290f1d6b78ce ]

Instead of making the files owned by the GLOBAL_ROOT_USER.  Make
non-dumpable files whose mm has always lived in a user namespace owned
by the user namespace root.  This allows the container root to have
things work as expected in a container.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c | 102 ++---
 fs/proc/fd.c   |  12 +--
 fs/proc/internal.h |  16 ++---
 3 files changed, 61 insertions(+), 69 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ee2e0ec..5bfdb61 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1676,12 +1676,63 @@ static int proc_pid_readlink(struct dentry * dentry, 
char __user * buffer, int b
 
 /* building an inode */
 
+void task_dump_owner(struct task_struct *task, mode_t mode,
+kuid_t *ruid, kgid_t *rgid)
+{
+   /* Depending on the state of dumpable compute who should own a
+* proc file for a task.
+*/
+   const struct cred *cred;
+   kuid_t uid;
+   kgid_t gid;
+
+   /* Default to the tasks effective ownership */
+   rcu_read_lock();
+   cred = __task_cred(task);
+   uid = cred->euid;
+   gid = cred->egid;
+   rcu_read_unlock();
+
+   /*
+* Before the /proc/pid/status file was created the only way to read
+* the effective uid of a /process was to stat /proc/pid.  Reading
+* /proc/pid/status is slow enough that procps and other packages
+* kept stating /proc/pid.  To keep the rules in /proc simple I have
+* made this apply to all per process world readable and executable
+* directories.
+*/
+   if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) {
+   struct mm_struct *mm;
+   task_lock(task);
+   mm = task->mm;
+   /* Make non-dumpable tasks owned by some root */
+   if (mm) {
+   if (get_dumpable(mm) != SUID_DUMP_USER) {
+   struct user_namespace *user_ns = mm->user_ns;
+
+   uid = make_kuid(user_ns, 0);
+   if (!uid_valid(uid))
+   uid = GLOBAL_ROOT_UID;
+
+   gid = make_kgid(user_ns, 0);
+   if (!gid_valid(gid))
+   gid = GLOBAL_ROOT_GID;
+   }
+   } else {
+   uid = GLOBAL_ROOT_UID;
+   gid = GLOBAL_ROOT_GID;
+   }
+   task_unlock(task);
+   }
+   *ruid = uid;
+   *rgid = gid;
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
-   const struct cred *cred;
 
/* We need a new inode */
 
@@ -1703,13 +1754,7 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
if (!ei->pid)
goto out_unlock;
 
-   if (task_dumpable(task)) {
-   rcu_read_lock();
-   cred = __task_cred(task);
-   inode->i_uid = cred->euid;
-   inode->i_gid = cred->egid;
-   rcu_read_unlock();
-   }
+   task_dump_owner(task, 0, >i_uid, >i_gid);
security_task_to_inode(task, inode);
 
 out:
@@ -1724,7 +1769,6 @@ int pid_getattr(struct vfsmount *mnt, struct dentry 
*dentry, struct kstat *stat)
 {
struct inode *inode = d_inode(dentry);
struct task_struct *task;
-   const struct cred *cred;
struct pid_namespace *pid = dentry->d_sb->s_fs_info;
 
generic_fillattr(inode, stat);
@@ -1742,12 +1786,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry 
*dentry, struct kstat *stat)
 */
return -ENOENT;
}
-   if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
-   task_dumpable(task)) {
-   cred = __task_cred(task);
-   stat->uid = cred->euid;
-   stat->gid = cred->egid;
-   }
+   task_dump_owner(task, inode->i_mode, >uid, >gid);
}
rcu_read_unlock();
return 0;
@@ -1763,18 +1802,11 @@ int pid_getattr(struct vfsmount *mnt, struct dentry 
*dentry, struct kstat *stat)
  * Rewrite the inode's ownerships here because the owning task may have
  * performed a setuid(), etc.
  *
- * Before the /proc/pid/status file was created the only way to read
- * the effective uid of a /process was to stat /proc/pid.  Reading
- * /proc/pid/status is slow enough that procps and other packages
- * kept stating /proc/pid.  To keep the

[PATCH 4.19 0/7] fix a race in release_task when flushing the dentry

2021-01-06 Thread Wen Yang
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they 
should be deleted when the process exits. 

Suppose the following race appears: 

release_task dput 
-> proc_flush_task 
 -> dentry->d_op->d_delete(dentry) 
-> __exit_signal 
 -> dentry->d_lockref.count--  and return. 

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This issue was introduced by 60347f6716aa ("pid namespaces: prepare
proc_flust_task() to flush entries from multiple proc trees"), exposed by
f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then
fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc").


Christian Brauner (1):
  clone: add CLONE_PIDFD

Eric W. Biederman (5):
  proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  proc: Clear the pieces of proc_inode that proc_evict_inode cares about
  proc: Use d_invalidate in proc_prune_siblings_dcache
  proc: Use a list of inodes to flush from proc

Joel Fernandes (Google) (1):
  pidfd: add polling support

 fs/proc/base.c | 111 -
 fs/proc/inode.c|  67 +--
 fs/proc/internal.h |   4 +-
 fs/proc/proc_sysctl.c  |  45 ++-
 include/linux/pid.h|   6 ++
 include/linux/proc_fs.h|   4 +-
 include/uapi/linux/sched.h |   1 +
 kernel/exit.c  |   4 +-
 kernel/fork.c  | 133 +++--
 kernel/pid.c   |   3 +
 kernel/signal.c|  11 
 11 files changed, 262 insertions(+), 127 deletions(-)

-- 
1.8.3.1



[PATCH v2 4.9 05/10] proc: use %u for pid printing and slightly less stack

2021-01-06 Thread Wen Yang
From: Alexey Dobriyan 

[ Upstream commit e3912ac37e07a13c70675cd75020694de4841c74 ]

PROC_NUMBUF is 13 which is enough for "negative int + \n + \0".

However PIDs and TGIDs are never negative and newline is not a concern,
so use just 10 per integer.

Link: http://lkml.kernel.org/r/20171120203005.GA27743@avx2
Signed-off-by: Alexey Dobriyan 
Cc: Alexander Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c| 16 
 fs/proc/fd.c  |  2 +-
 fs/proc/self.c|  6 +++---
 fs/proc/thread_self.c |  5 ++---
 4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5bfdb61..3502a40 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3018,11 +3018,11 @@ static struct dentry *proc_tgid_base_lookup(struct 
inode *dir, struct dentry *de
 static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 {
struct dentry *dentry, *leader, *dir;
-   char buf[PROC_NUMBUF];
+   char buf[10 + 1];
struct qstr name;
 
name.name = buf;
-   name.len = snprintf(buf, sizeof(buf), "%d", pid);
+   name.len = snprintf(buf, sizeof(buf), "%u", pid);
/* no ->d_hash() rejects on procfs */
dentry = d_hash_and_lookup(mnt->mnt_root, );
if (dentry) {
@@ -3034,7 +3034,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
return;
 
name.name = buf;
-   name.len = snprintf(buf, sizeof(buf), "%d", tgid);
+   name.len = snprintf(buf, sizeof(buf), "%u", tgid);
leader = d_hash_and_lookup(mnt->mnt_root, );
if (!leader)
goto out;
@@ -3046,7 +3046,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
goto out_put_leader;
 
name.name = buf;
-   name.len = snprintf(buf, sizeof(buf), "%d", pid);
+   name.len = snprintf(buf, sizeof(buf), "%u", pid);
dentry = d_hash_and_lookup(dir, );
if (dentry) {
d_invalidate(dentry);
@@ -3226,14 +3226,14 @@ int proc_pid_readdir(struct file *file, struct 
dir_context *ctx)
for (iter = next_tgid(ns, iter);
 iter.task;
 iter.tgid += 1, iter = next_tgid(ns, iter)) {
-   char name[PROC_NUMBUF];
+   char name[10 + 1];
int len;
 
cond_resched();
if (!has_pid_permissions(ns, iter.task, 2))
continue;
 
-   len = snprintf(name, sizeof(name), "%d", iter.tgid);
+   len = snprintf(name, sizeof(name), "%u", iter.tgid);
ctx->pos = iter.tgid + TGID_OFFSET;
if (!proc_fill_cache(file, ctx, name, len,
 proc_pid_instantiate, iter.task, NULL)) {
@@ -3557,10 +3557,10 @@ static int proc_task_readdir(struct file *file, struct 
dir_context *ctx)
for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
 task;
 task = next_tid(task), ctx->pos++) {
-   char name[PROC_NUMBUF];
+   char name[10 + 1];
int len;
tid = task_pid_nr_ns(task, ns);
-   len = snprintf(name, sizeof(name), "%d", tid);
+   len = snprintf(name, sizeof(name), "%u", tid);
if (!proc_fill_cache(file, ctx, name, len,
proc_task_instantiate, task, NULL)) {
/* returning this tgid failed, save it as the first
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 00ce153..390c2fe 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -235,7 +235,7 @@ static int proc_readfd_common(struct file *file, struct 
dir_context *ctx,
for (fd = ctx->pos - 2;
 fd < files_fdtable(files)->max_fds;
 fd++, ctx->pos++) {
-   char name[PROC_NUMBUF];
+   char name[10 + 1];
int len;
 
if (!fcheck_files(files, fd))
diff --git a/fs/proc/self.c b/fs/proc/self.c
index f6e2e3f..dd06755 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -35,11 +35,11 @@ static const char *proc_self_get_link(struct dentry *dentry,
 
if (!tgid)
return ERR_PTR(-ENOENT);
-   /* 11 for max length of signed int in decimal + NULL term */
-   name = kmalloc(12, dentry ? GFP_KERNEL : GFP_ATOMIC);
+   /* max length of unsigned int in decimal + NULL term */
+   name = kmalloc(10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);
if (unlikely(!name))
return dentry ? ERR_PTR(-ENOMEM) : ERR_PTR(-ECHILD);
-   sprintf(name, "%d", tgid);
+   sprintf(name, "%u", tgid);
set_delayed_call(done, kfree_link, name);
return name;
 }
d

[PATCH v2 4.9 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]

This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.

The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 920c761..739fb9c 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -32,21 +32,27 @@ static void proc_evict_inode(struct inode *inode)
 {
struct proc_dir_entry *de;
struct ctl_table_header *head;
+   struct proc_inode *ei = PROC_I(inode);
 
truncate_inode_pages_final(>i_data);
clear_inode(inode);
 
/* Stop tracking associated processes */
-   put_pid(PROC_I(inode)->pid);
+   if (ei->pid) {
+   put_pid(ei->pid);
+   ei->pid = NULL;
+   }
 
/* Let go of any associated proc directory entry */
-   de = PDE(inode);
-   if (de)
+   de = ei->pde;
+   if (de) {
pde_put(de);
+   ei->pde = NULL;
+   }
 
-   head = PROC_I(inode)->sysctl;
+   head = ei->sysctl;
if (head) {
-   RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+   RCU_INIT_POINTER(ei->sysctl, NULL);
proc_sys_evict_inode(inode, head);
}
 }
-- 
1.8.3.1



[PATCH v2 4.9 10/10] proc: Use a list of inodes to flush from proc

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
initialization into the initial version of the patch.  A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman 
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid
namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush
entries from multiple proc trees")
Cc:  # 4.9.x: b3e583825266: clone: add CLONE_PIDFD
Cc:  # 4.9.x: b53b0b9d9a61: pidfd: add polling
support
Cc:  # 4.9.x: db978da8fa1d: proc: Pass file mode to
proc_pid_make_inode
Cc:  # 4.9.x: 68eb94f16227: proc: Better ownership of
files for non-dumpable tasks in user namespaces
Cc:  # 4.9.x: e3912ac37e07: proc: use %u for pid
printing and slightly less stack
Cc:  # 4.9.x: 0afa5ca82212: proc: Rename in
proc_inode rename sysctl_inodes sibling_inodes
Cc:  # 4.9.x: 26dbc60f385f: proc: Generalize
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc:  # 4.9.x: 71448011ea2a: proc: Clear the pieces of
proc_inode that proc_evict_inode cares about
Cc:  # 4.9.x: f90f3cafe8d5: Use d_invalidate in
proc_prune_siblings_dcache
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang 
---
 fs/proc/base.c  | 111 
 fs/proc/inode.c |   2 +-
 fs/proc/internal.h  |   1 +
 include/linux/pid.h |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c   |   5 ++-
 kernel/pid.c|   1 +
 7 files changed, 45 insertions(+), 80 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t 
mode,
*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+   struct pid *pid = ei->pid;
+
+   if (S_ISDIR(ei->vfs_inode.i_mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
+   put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
+   struct pid *pid;
 
/* We need a new inode */
 
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
/*
 * grab the reference to task.
 */
-   ei->pid = get_task_pid(task, PIDTYPE_PID);
-   if (!ei->pid)
+   pid = get_task_pid(task, PIDTYPE_PID);
+   if (!pid)
goto out_unlock;
 
+   /* Let the pid remember us for quick removal */
+   ei->pid = pid;
+   if (S_ISDIR(mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
+   spi

[PATCH v2 4.9 01/10] clone: add CLONE_PIDFD

2021-01-06 Thread Wen Yang
From: Christian Brauner 

[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds 
Signed-off-by: Christian Brauner 
Co-developed-by: Jann Horn 
Signed-off-by: Jann Horn 
Reviewed-by: Oleg Nesterov 
Cc: Arnd Bergmann 
Cc: "Eric W. Biederman" 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: David Howells 
Cc: "Michael Kerrisk (man-pages)" 
Cc: Andy Lutomirsky 
Cc: Andrew Morton 
Cc: Aleksa Sarai 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc:  # 4.9.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang 
---
 include/linux/pid.h|   1 +
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c  | 105 +++--
 3 files changed, 103 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 97b745d..7599a78 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,7 @@ struct pid_link
struct hlist_node node;
struct pid *pid;
 };
+extern const struct file_operations pidfd_fops;
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..ed6e31d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
 #define CLONE_FS   0x0200  /* set if fs info shared between 
processes */
 #define CLONE_FILES0x0400  /* set if open files shared between 
processes */
 #define CLONE_SIGHAND  0x0800  /* set if signal handlers and blocked 
signals shared */
+#define CLONE_PIDFD0x1000  /* set if a pidfd should be placed in 
parent */
 #define CLONE_PTRACE   0x2000  /* set if we want to let tracing 
continue on the child too */
 #define CLONE_VFORK0x4000  /* set if the parent wants the child to 
wake it up on mm_release */
 #define CLONE_PARENT   0x8000  /* set if we want to have the same 
parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index b64efec..4249f60 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -1460,6 +1461,58 @@ static void posix_cpu_timers_init(struct task_struct 
*tsk)
 task->pids[type].pid = pid;
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+   struct pid *pid = file->private_data;
+
+   file->private_data = NULL;
+   put_pid(pid);
+   return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+   struct pid_namespace *ns = file_inode(m->file)->i_sb->s_fs_info;
+   struct pid *pid = f->private_data;
+
+   seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+   seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+   .release = pidfd_release,
+#ifdef CONFIG_PROC_FS
+   .show_fdinfo = pidfd_show_fdinfo,
+#endif
+};
+
+/**
+ * pidfd_create() - Create a new pid file descriptor.
+ *
+ * @pid:  struct pid that the pidfd will reference
+ *
+ * This 

[PATCH v2 4.9 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 26dbc60f385ff9cff475ea2a3bad02e80fd6fa43 ]

This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.

Signed-off-by: Eric W. Biederman 
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for 26dbc60f385f)
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 38 ++
 fs/proc/internal.h|  1 +
 fs/proc/proc_sysctl.c | 35 +--
 3 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 14d9c1d..920c761 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -101,6 +101,44 @@ void __init proc_init_inodecache(void)
 init_once);
 }
 
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+{
+   struct inode *inode;
+   struct proc_inode *ei;
+   struct hlist_node *node;
+   struct super_block *sb;
+
+   rcu_read_lock();
+   for (;;) {
+   node = hlist_first_rcu(inodes);
+   if (!node)
+   break;
+   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
+   spin_lock(lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(lock);
+
+   inode = >vfs_inode;
+   sb = inode->i_sb;
+   if (!atomic_inc_not_zero(>s_active))
+   continue;
+   inode = igrab(inode);
+   rcu_read_unlock();
+   if (unlikely(!inode)) {
+   deactivate_super(sb);
+   rcu_read_lock();
+   continue;
+   }
+
+   d_prune_aliases(inode);
+   iput(inode);
+   deactivate_super(sb);
+
+   rcu_read_lock();
+   }
+   rcu_read_unlock();
+}
+
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
struct super_block *sb = root->d_sb;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 409b5c5..9bc44a1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,6 +200,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 extern void proc_init_inodecache(void);
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 extern struct inode *proc_get_inode(struct super_block *, struct 
proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 671490e..f19063b 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -262,40 +262,7 @@ static void unuse_table(struct ctl_table_header *p)
 
 static void proc_sys_prune_dcache(struct ctl_table_header *head)
 {
-   struct inode *inode;
-   struct proc_inode *ei;
-   struct hlist_node *node;
-   struct super_block *sb;
-
-   rcu_read_lock();
-   for (;;) {
-   node = hlist_first_rcu(>inodes);
-   if (!node)
-   break;
-   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
-   spin_lock(_lock);
-   hlist_del_init_rcu(>sibling_inodes);
-   spin_unlock(_lock);
-
-   inode = >vfs_inode;
-   sb = inode->i_sb;
-   if (!atomic_inc_not_zero(>s_active))
-   continue;
-   inode = igrab(inode);
-   rcu_read_unlock();
-   if (unlikely(!inode)) {
-   deactivate_super(sb);
-   rcu_read_lock();
-   continue;
-   }
-
-   d_prune_aliases(inode);
-   iput(inode);
-   deactivate_super(sb);
-
-   rcu_read_lock();
-   }
-   rcu_read_unlock();
+   proc_prune_siblings_dcache(>inodes, _lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
-- 
1.8.3.1



[PATCH v2 4.9 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 0afa5ca82212247456f9de1468b595a111fee633 ]

I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.

Just give the field a generic name so it is clear that
it is no longer sysctl specific.

Also for good measure initialize sibling_inodes when
proc_inode is initialized.

Signed-off-by: Eric W. Biederman 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 1 +
 fs/proc/internal.h| 2 +-
 fs/proc/proc_sysctl.c | 8 
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a289349..14d9c1d 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -67,6 +67,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
ei->pde = NULL;
ei->sysctl = NULL;
ei->sysctl_entry = NULL;
+   INIT_HLIST_NODE(>sibling_inodes);
ei->ns_ops = NULL;
inode = >vfs_inode;
return inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 103435f..409b5c5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -65,7 +65,7 @@ struct proc_inode {
struct proc_dir_entry *pde;
struct ctl_table_header *sysctl;
struct ctl_table *sysctl_entry;
-   struct hlist_node sysctl_inodes;
+   struct hlist_node sibling_inodes;
const struct proc_ns_operations *ns_ops;
struct inode vfs_inode;
 };
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 191573a..671490e 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -272,9 +272,9 @@ static void proc_sys_prune_dcache(struct ctl_table_header 
*head)
node = hlist_first_rcu(>inodes);
if (!node)
break;
-   ei = hlist_entry(node, struct proc_inode, sysctl_inodes);
+   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
spin_lock(_lock);
-   hlist_del_init_rcu(>sysctl_inodes);
+   hlist_del_init_rcu(>sibling_inodes);
spin_unlock(_lock);
 
inode = >vfs_inode;
@@ -480,7 +480,7 @@ static struct inode *proc_sys_make_inode(struct super_block 
*sb,
}
ei->sysctl = head;
ei->sysctl_entry = table;
-   hlist_add_head_rcu(>sysctl_inodes, >inodes);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
head->count++;
spin_unlock(_lock);
 
@@ -511,7 +511,7 @@ static struct inode *proc_sys_make_inode(struct super_block 
*sb,
 void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
 {
spin_lock(_lock);
-   hlist_del_init_rcu(_I(inode)->sysctl_inodes);
+   hlist_del_init_rcu(_I(inode)->sibling_inodes);
if (!--head->count)
kfree_rcu(head, rcu);
spin_unlock(_lock);
-- 
1.8.3.1



[PATCH v2 4.9 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache

2021-01-06 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]

The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused.  It will not remove aliases for
the dcache or even think of removing mounts from the dcache.  For that
behavior d_invalidate is needed.

To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.

For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.

Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.

To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.

V2: Split the directory and non-directory cases.  To make this
code robust to future changes in proc.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for f90f3cafe8d5)
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 16 ++--
 fs/proc/internal.h|  2 +-
 fs/proc/proc_sysctl.c |  8 
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 739fb9c..2af9f4f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -107,7 +107,7 @@ void __init proc_init_inodecache(void)
 init_once);
 }
 
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t 
*lock)
 {
struct inode *inode;
struct proc_inode *ei;
@@ -136,7 +136,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, 
spinlock_t *lock)
continue;
}
 
-   d_prune_aliases(inode);
+   if (S_ISDIR(inode->i_mode)) {
+   struct dentry *dir = d_find_any_alias(inode);
+   if (dir) {
+   d_invalidate(dir);
+   dput(dir);
+   }
+   } else {
+   struct dentry *dentry;
+   while ((dentry = d_find_alias(inode))) {
+   d_invalidate(dentry);
+   dput(dentry);
+   }
+   }
iput(inode);
deactivate_super(sb);
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9bc44a1..6a1d679 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,7 +200,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 extern void proc_init_inodecache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t 
*lock);
 extern struct inode *proc_get_inode(struct super_block *, struct 
proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f19063b..b6668a5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -260,9 +260,9 @@ static void unuse_table(struct ctl_table_header *p)
complete(p->unregistering);
 }
 
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
 {
-   proc_prune_siblings_dcache(>inodes, _lock);
+   proc_invalidate_siblings_dcache(>inodes, _lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
@@ -284,10 +284,10 @@ static void start_unregistering(struct ctl_table_header 
*p)
spin_unlock(_lock);
}
/*
-* Prune dentries for unregistered sysctls: namespaced sysctls
+* Invalidate dentries for unregistered sysctls: namespaced sysctls
 * can have duplicate names and contaminate dcache very badly.
 */
-   proc_sys_prune_dcache(p);
+   proc_sys_invalidate_dcache(p);
/*
 * do not remove from the list until nobody holds it; walking the
 * list in do_sysctl() relies on that.
-- 
1.8.3.1



[PATCH v2 4.9 03/10] proc: Pass file mode to proc_pid_make_inode

2021-01-06 Thread Wen Yang
From: Andreas Gruenbacher 

[ Upstream commit db978da8fa1d0819b210c137d31a339149b88875 ]

Pass the file mode of the proc inode to be created to
proc_pid_make_inode.  In proc_pid_make_inode, initialize inode->i_mode
before calling security_task_to_inode.  This allows selinux to set
isec->sclass right away without introducing "half-initialized" inode
security structs.

Signed-off-by: Andreas Gruenbacher 
Signed-off-by: Paul Moore 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c   | 23 +--
 fs/proc/fd.c |  6 ++
 fs/proc/internal.h   |  2 +-
 fs/proc/namespaces.c |  3 +--
 security/selinux/hooks.c |  1 +
 5 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b9e4183..ee2e0ec 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1676,7 +1676,8 @@ static int proc_pid_readlink(struct dentry * dentry, char 
__user * buffer, int b
 
 /* building an inode */
 
-struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct 
*task)
+struct inode *proc_pid_make_inode(struct super_block * sb,
+ struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
@@ -1690,6 +1691,7 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb, struct task_struct *t
 
/* Common stuff */
ei = PROC_I(inode);
+   inode->i_mode = mode;
inode->i_ino = get_next_ino();
inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
inode->i_op = _def_inode_operations;
@@ -2041,7 +2043,9 @@ struct map_files_info {
struct proc_inode *ei;
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK |
+   ((mode & FMODE_READ ) ? S_IRUSR : 0) |
+   ((mode & FMODE_WRITE) ? S_IWUSR : 0));
if (!inode)
return -ENOENT;
 
@@ -2050,12 +2054,6 @@ struct map_files_info {
 
inode->i_op = _map_files_link_inode_operations;
inode->i_size = 64;
-   inode->i_mode = S_IFLNK;
-
-   if (mode & FMODE_READ)
-   inode->i_mode |= S_IRUSR;
-   if (mode & FMODE_WRITE)
-   inode->i_mode |= S_IWUSR;
 
d_set_d_op(dentry, _map_files_dentry_operations);
d_add(dentry, inode);
@@ -2409,12 +2407,11 @@ static int proc_pident_instantiate(struct inode *dir,
struct inode *inode;
struct proc_inode *ei;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, p->mode);
if (!inode)
goto out;
 
ei = PROC_I(inode);
-   inode->i_mode = p->mode;
if (S_ISDIR(inode->i_mode))
set_nlink(inode, 2);/* Use getattr to fix if necessary */
if (p->iop)
@@ -3096,11 +3093,10 @@ static int proc_pid_instantiate(struct inode *dir,
 {
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | 
S_IXUGO);
if (!inode)
goto out;
 
-   inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
inode->i_op = _tgid_base_inode_operations;
inode->i_fop = _tgid_base_operations;
inode->i_flags|=S_IMMUTABLE;
@@ -3391,11 +3387,10 @@ static int proc_task_instantiate(struct inode *dir,
struct dentry *dentry, struct task_struct *task, const void *ptr)
 {
struct inode *inode;
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | 
S_IXUGO);
 
if (!inode)
goto out;
-   inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
inode->i_op = _tid_base_inode_operations;
inode->i_fop = _tid_base_operations;
inode->i_flags|=S_IMMUTABLE;
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index d21dafe..4274f83 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -183,14 +183,13 @@ static int proc_fd_link(struct dentry *dentry, struct 
path *path)
struct proc_inode *ei;
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK);
if (!inode)
goto out;
 
ei = PROC_I(inode);
ei->fd = fd;
 
-   inode->i_mode = S_IFLNK;
inode->i_op = _pid_link_inode_operations;
inode->i_size = 64;
 
@@ -322,14 +321,13 @@ int proc_fd_permission(struct inode *inode, int mask)
struct proc_inode *ei;
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFREG | S_IRUSR);
if

[PATCH v2 4.9 02/10] pidfd: add polling support

2021-01-06 Thread Wen Yang
From: "Joel Fernandes (Google)" 

[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski 
Cc: Steven Rostedt 
Cc: Daniel Colascione 
Cc: Jann Horn 
Cc: Tim Murray 
Cc: Jonathan Kowalski 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Kees Cook 
Cc: David Howells 
Cc: Oleg Nesterov 
Cc: kernel-t...@android.com
Reviewed-by: Oleg Nesterov 
Co-developed-by: Daniel Colascione 
Signed-off-by: Daniel Colascione 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Christian Brauner 
Cc:  # 4.9.x
(pidfd: fix up cherry-pick conflicts for b53b0b9d9a61)
Signed-off-by: Wen Yang 
---
 include/linux/pid.h |  3 +++
 kernel/fork.c   | 26 ++
 kernel/pid.c|  2 ++
 kernel/signal.c | 11 +++
 4 files changed, 42 insertions(+)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7599a78..f5552ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -2,6 +2,7 @@
 #define _LINUX_PID_H
 
 #include 
+#include 
 
 enum pid_type
 {
@@ -62,6 +63,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+   /* wait queue for pidfd notifications */
+   wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 4249f60..e3a4a14 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1481,8 +1481,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct 
file *f)
 }
 #endif
 
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct 
*pts)
+{
+   struct task_struct *task;
+   struct pid *pid = file->private_data;
+   int poll_flags = 0;
+
+   poll_wait(file, >wait_pidfd, pts);
+
+   rcu_read_lock();
+   task = pid_task(pid, PIDTYPE_PID);
+   /*
+* Inform pollers only when the whole thread group exits.
+* If the thread group leader exits before all other threads in the
+* group, then poll(2) should block, similar to the wait(2) family.
+*/
+   if (!task || (task->exit_state && thread_group_empty(task)))
+   poll_flags = POLLIN | POLLRDNORM;
+   rcu_read_unlock();
+
+   return poll_flags;
+}
+
 const struct file_operations pidfd_fops = {
.release = pidfd_release,
+   .poll = pidfd_poll,
 #ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
 #endif
diff --git a/kernel/pid.c b/kernel/pid.c
index fa704f8..e605398 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -333,6 +333,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(>tasks[type]);
 
+   init_waitqueue_head(>wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(_lock);
if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index bedca16..053de87a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1632,6 +1632,14 @@ int send_sigqueue(struct sigqueue *q, struct task_struct 
*t, int group)
return ret;
 }
 
+static void do_notify_pidfd(struct task_struct *task)
+{
+   struct pid *pid;
+
+   pid = task_pid(task);
+   wake_up_all(>wait_pidfd);
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/continued status change, use do_notif

[PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry

2021-01-06 Thread Wen Yang
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they 
should be deleted when the process exits. 

Suppose the following race appears: 

release_task dput 
-> proc_flush_task 
 -> dentry->d_op->d_delete(dentry) 
-> __exit_signal 
 -> dentry->d_lockref.count--  and return. 

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This issue was introduced by 60347f6716aa ("pid namespaces: prepare
proc_flust_task() to flush entries from multiple proc trees"), exposed by
f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then
fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc").


Alexey Dobriyan (1):
  proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
  proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
  clone: add CLONE_PIDFD

Eric W. Biederman (6):
  proc: Better ownership of files for non-dumpable tasks in user
namespaces
  proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  proc: Clear the pieces of proc_inode that proc_evict_inode cares about
  proc: Use d_invalidate in proc_prune_siblings_dcache
  proc: Use a list of inodes to flush from proc

Joel Fernandes (Google) (1):
  pidfd: add polling support

 fs/proc/base.c | 242 -
 fs/proc/fd.c   |  20 +---
 fs/proc/inode.c|  67 -
 fs/proc/internal.h |  22 ++---
 fs/proc/namespaces.c   |   3 +-
 fs/proc/proc_sysctl.c  |  45 ++---
 fs/proc/self.c |   6 +-
 fs/proc/thread_self.c  |   5 +-
 include/linux/pid.h|   5 +
 include/linux/proc_fs.h|   4 +-
 include/uapi/linux/sched.h |   1 +
 kernel/exit.c  |   5 +-
 kernel/fork.c  | 131 +++-
 kernel/pid.c   |   3 +
 kernel/signal.c|  11 +++
 security/selinux/hooks.c   |   1 +
 16 files changed, 343 insertions(+), 228 deletions(-)

-- 
1.8.3.1



[PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry

2021-01-06 Thread Wen Yang
The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they 
should be deleted when the process exits. 

Suppose the following race appears: 

release_task dput 
-> proc_flush_task 
 -> dentry->d_op->d_delete(dentry) 
-> __exit_signal 
 -> dentry->d_lockref.count--  and return. 

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This issue was introduced by 60347f6716aa ("pid namespaces: prepare
proc_flust_task() to flush entries from multiple proc trees"), exposed by
f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then
fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc").


Alexey Dobriyan (1):
  proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
  proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
  clone: add CLONE_PIDFD

Eric W. Biederman (6):
  proc: Better ownership of files for non-dumpable tasks in user
namespaces
  proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
  proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache
  proc: Clear the pieces of proc_inode that proc_evict_inode cares about
  proc: Use d_invalidate in proc_prune_siblings_dcache
  proc: Use a list of inodes to flush from proc

Joel Fernandes (Google) (1):
  pidfd: add polling support

 fs/proc/base.c | 242 -
 fs/proc/fd.c   |  20 +---
 fs/proc/inode.c|  67 -
 fs/proc/internal.h |  22 ++---
 fs/proc/namespaces.c   |   3 +-
 fs/proc/proc_sysctl.c  |  45 ++---
 fs/proc/self.c |   6 +-
 fs/proc/thread_self.c  |   5 +-
 include/linux/pid.h|   5 +
 include/linux/proc_fs.h|   4 +-
 include/uapi/linux/sched.h |   1 +
 kernel/exit.c  |   5 +-
 kernel/fork.c  | 131 +++-
 kernel/pid.c   |   3 +
 kernel/signal.c|  11 +++
 security/selinux/hooks.c   |   1 +
 16 files changed, 343 insertions(+), 228 deletions(-)

-- 
1.8.3.1



Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2021-01-03 Thread Wen Yang




在 2020/12/31 下午5:22, Greg Kroah-Hartman 写道:

On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote:



在 2020/12/4 上午2:31, Wen Yang 写道:

The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.

Suppose the following race appears:

release_task     dput
-> proc_flush_task
       -> dentry->d_op->d_delete(dentry)
-> __exit_signal
   -> dentry->d_lockref.count--  and return.

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc//ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
 proc_flush_task and __exit_signal in release_task() [2]

The test process is as follows:

A, terminal #1

Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace

Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash


B, terminal #2

Find the pid of the unshare process:

# pstree -p | grep unshare
 | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)


Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 
entry=818/8802c7b670e8


C, terminal #3

Execute the opendir program, it will always open the /proc/818/ns/ directory:

# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup

D, go back to terminal #2

Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay

Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818

Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876

Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid

Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still
cached:
# dmesg | grep 8802a3999548
…
[565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, 
inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, 
pinode=8802c7812010, keywords: be cached


It could also be verified via the crash tool:

crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  8802bea7b528
d_flags = 0x2800cc
d_iname = "ns\000"
d_inode = 0x8802b38c2010
d_lockref = {
  {
lock_count = 0x0,
{
  lock = {
{
  rlock = {
raw_lock = {
  {
val = {
  counter = 0x0
},
{
  locked = 0x0,
  pending = 0x0
},
{
  locked_pending = 0x0,
  tail = 0x0
}
  }
}
  }
}
  },
  count = 0x0
}
  }
}
crash> kmem  8802bea7b528
CACHE OBJSIZE  ALLOCATED TOTAL  SLABS  SSIZE  NAME
8802dd5f5900  192  23663 2613087116k  dentry
SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
ea000afa9e00  8802bea78000 0 30 25 5
FREE / [ALLOCATED]
[8802bea7b520]

PAGEPHYSICAL  MAPPING   INDEX CNT FLAGS
ea000afa9ec0 2bea7b000 dead04000  0 2f8000
crash>

This series of patches is to fix this issue.

Regards,
Wen

Alexey Dobriyan (1):
proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
clone: add CLONE_PIDFD

Eric W. Biederman (6):
proc: Better ownership of files for non-dumpable tasks in user

Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2020-12-16 Thread Wen Yang




在 2020/12/4 上午2:31, Wen Yang 写道:

The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.

Suppose the following race appears:

release_task     dput
-> proc_flush_task
      -> dentry->d_op->d_delete(dentry)
-> __exit_signal
  -> dentry->d_lockref.count--  and return.

In the proc_flush_task(), if another process is using this dentry, it will
not be deleted. At the same time, in dput(), d_op->d_delete() can be executed
before __exit_signal(pid has not been hashed), d_delete returns false, so
this dentry still cannot be deleted.

This dentry will always be cached (although its count is 0 and the
DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and
these dentries can only be deleted when drop_caches is manually triggered.

This will result in wasted memory. What's more troublesome is that these
dentries reference pid, according to the commit f333c700c610 ("pidns: Add a
limit on the number of pid namespaces"), if the pid cannot be released, it
may result in the inability to create a new pid_ns.

This problem occurred in our cluster environment (Linux 4.9 LTS).
We could reproduce it by manually constructing a test program + adding some
debugging switches in the kernel:
* A test program to open the directory (/proc//ns) [1]
* Adding some debugging switches to the kernel, adding a delay between
proc_flush_task and __exit_signal in release_task() [2]

The test process is as follows:

A, terminal #1

Turn on the debug switch:
echo 1> /proc/sys/vm/dentry_debug_trace

Execute the following unshare command:
sudo unshare --pid --fork --mount-proc bash


B, terminal #2

Find the pid of the unshare process:

# pstree -p | grep unshare
| `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818)


Find the corresponding dentry:
# dmesg | grep pid=818
[70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 
entry=818/8802c7b670e8


C, terminal #3

Execute the opendir program, it will always open the /proc/818/ns/ directory:

# ./a.out /proc/818/ns/
pid: 876
.
..
net
uts
ipc
pid
user
mnt
cgroup

D, go back to terminal #2

Turn on the debugging switches to construct the race:
# echo 818> /proc/sys/vm/dentry_debug_pid
# echo 1> /proc/sys/vm/dentry_debug_delay

Kill the unshare process (pid 818). Since the debugging switches have been
turned on, it will get stuck in release_task():
# kill -9 818

Then kill the process that opened the /proc/818/ns/ directory:
# kill -9 876

Then turn off these debugging switches to allow the 818 process to exit:
# echo 0> /proc/sys/vm/dentry_debug_delay
# echo 0> /proc/sys/vm/dentry_debug_pid

Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0,
and the flag is 2800cc (#define DCACHE_OP_DELETE 0x0008), but it is still
cached:
# dmesg | grep 8802a3999548
…
[565.559156] XXX dput:853 dentry=ns/8802bea7b528, flag=2800cc, cnt=0, 
inode=8802b38c2010, pdentry=818/8802c7b670e8, pflag=20008c, pcnt=1, 
pinode=8802c7812010, keywords: be cached


It could also be verified via the crash tool:

crash> dentry.d_flags,d_iname,d_inode,d_lockref -x  8802bea7b528
   d_flags = 0x2800cc
   d_iname = "ns\000"
   d_inode = 0x8802b38c2010
   d_lockref = {
 {
   lock_count = 0x0,
   {
 lock = {
   {
 rlock = {
   raw_lock = {
 {
   val = {
 counter = 0x0
   },
   {
 locked = 0x0,
 pending = 0x0
   },
   {
 locked_pending = 0x0,
 tail = 0x0
   }
 }
   }
 }
   }
 },
 count = 0x0
   }
 }
   }
crash> kmem  8802bea7b528
CACHE OBJSIZE  ALLOCATED TOTAL  SLABS  SSIZE  NAME
8802dd5f5900  192  23663 2613087116k  dentry
   SLAB  MEMORYNODE  TOTAL  ALLOCATED  FREE
   ea000afa9e00  8802bea78000 0 30 25 5
   FREE / [ALLOCATED]
   [8802bea7b520]

   PAGEPHYSICAL  MAPPING   INDEX CNT FLAGS
ea000afa9ec0 2bea7b000 dead04000  0 2f8000
crash>

This series of patches is to fix this issue.

Regards,
Wen

Alexey Dobriyan (1):
   proc: use %u for pid printing and slightly less stack

Andreas Gruenbacher (1):
   proc: Pass file mode to proc_pid_make_inode

Christian Brauner (1):
   clone: add CLONE_PIDFD

Eric W. Biederman (6):
   proc: Better ownership of files for non-dumpable tasks in user
 namespaces
   proc: Rename in proc_inode rename sysctl_inodes sibling_inodes
   proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache

[PATCH 10/10] proc: Use a list of inodes to flush from proc

2020-12-03 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization.  The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died.  When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid.  Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid.  Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place.  This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes.  I somehow failed to get that
initialization into the initial version of the patch.  A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman 
Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
Fixes: 60347f6716aa ("pid namespaces: prepare proc_flust_task() to flush 
entries from multiple proc trees")
Cc:  # 4.9.x: b3e5838: clone: add CLONE_PIDFD
Cc:  # 4.9.x: b53b0b9: pidfd: add polling support
Cc:  # 4.9.x: db978da: proc: Pass file mode to 
proc_pid_make_inode
Cc:  # 4.9.x: 68eb94f: proc: Better ownership of files 
for non-dumpable tasks in user namespaces
Cc:  # 4.9.x: e3912ac: proc: use %u for pid printing 
and slightly less stack
Cc:  # 4.9.x: 0afa5ca: proc: Rename in proc_inode 
rename sysctl_inodes sibling_inodes
Cc:  # 4.9.x: 26dbc60: proc: Generalize 
proc_sys_prune_dcache into proc_prune_siblings_dcache
Cc:  # 4.9.x: 7144801: proc: Clear the pieces of 
proc_inode that proc_evict_inode cares about
Cc:  # 4.9.x: f90f3ca: Use d_invalidate in 
proc_prune_siblings_dcache
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for 7bc3e6e55acf)
Signed-off-by: Wen Yang 
---
 fs/proc/base.c  | 111 
 fs/proc/inode.c |   2 +-
 fs/proc/internal.h  |   1 +
 include/linux/pid.h |   1 +
 include/linux/proc_fs.h |   4 +-
 kernel/exit.c   |   5 ++-
 kernel/pid.c|   1 +
 7 files changed, 45 insertions(+), 80 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 3502a40..11caf35 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1728,11 +1728,25 @@ void task_dump_owner(struct task_struct *task, mode_t 
mode,
*rgid = gid;
 }
 
+void proc_pid_evict_inode(struct proc_inode *ei)
+{
+   struct pid *pid = ei->pid;
+
+   if (S_ISDIR(ei->vfs_inode.i_mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }
+
+   put_pid(pid);
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
+   struct pid *pid;
 
/* We need a new inode */
 
@@ -1750,10 +1764,18 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
/*
 * grab the reference to task.
 */
-   ei->pid = get_task_pid(task, PIDTYPE_PID);
-   if (!ei->pid)
+   pid = get_task_pid(task, PIDTYPE_PID);
+   if (!pid)
goto out_unlock;
 
+   /* Let the pid remember us for quick removal */
+   ei->pid = pid;
+   if (S_ISDIR(mode)) {
+   spin_lock(>wait_pidfd.lock);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
+   spin_unlock(>wait_pidfd.lock);
+   }

[PATCH 09/10] proc: Use d_invalidate in proc_prune_siblings_dcache

2020-12-03 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit f90f3cafe8d56d593fc509a4185da1d5800efea4 ]

The function d_prune_aliases has the problem that it will only prune
aliases thare are completely unused.  It will not remove aliases for
the dcache or even think of removing mounts from the dcache.  For that
behavior d_invalidate is needed.

To use d_invalidate replace d_prune_aliases with d_find_alias followed
by d_invalidate and dput.

For completeness the directory and the non-directory cases are
separated because in theory (although not in currently in practice for
proc) directories can only ever have a single dentry while
non-directories can have hardlinks and thus multiple dentries.
As part of this separation use d_find_any_alias for directories
to spare d_find_alias the extra work of doing that.

Plus the differences between d_find_any_alias and d_find_alias makes
it clear why the directory and non-directory code and not share code.

To make it clear these routines now invalidate dentries rename
proc_prune_siblings_dache to proc_invalidate_siblings_dcache, and rename
proc_sys_prune_dcache proc_sys_invalidate_dcache.

V2: Split the directory and non-directory cases.  To make this
code robust to future changes in proc.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for f90f3cafe8d5)
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 16 ++--
 fs/proc/internal.h|  2 +-
 fs/proc/proc_sysctl.c |  8 
 3 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 739fb9c..2af9f4f 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -107,7 +107,7 @@ void __init proc_init_inodecache(void)
 init_once);
 }
 
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t 
*lock)
 {
struct inode *inode;
struct proc_inode *ei;
@@ -136,7 +136,19 @@ void proc_prune_siblings_dcache(struct hlist_head *inodes, 
spinlock_t *lock)
continue;
}
 
-   d_prune_aliases(inode);
+   if (S_ISDIR(inode->i_mode)) {
+   struct dentry *dir = d_find_any_alias(inode);
+   if (dir) {
+   d_invalidate(dir);
+   dput(dir);
+   }
+   } else {
+   struct dentry *dentry;
+   while ((dentry = d_find_alias(inode))) {
+   d_invalidate(dentry);
+   dput(dentry);
+   }
+   }
iput(inode);
deactivate_super(sb);
 
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 9bc44a1..6a1d679 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,7 +200,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 extern void proc_init_inodecache(void);
-void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
+void proc_invalidate_siblings_dcache(struct hlist_head *inodes, spinlock_t 
*lock);
 extern struct inode *proc_get_inode(struct super_block *, struct 
proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f19063b..b6668a5 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -260,9 +260,9 @@ static void unuse_table(struct ctl_table_header *p)
complete(p->unregistering);
 }
 
-static void proc_sys_prune_dcache(struct ctl_table_header *head)
+static void proc_sys_invalidate_dcache(struct ctl_table_header *head)
 {
-   proc_prune_siblings_dcache(>inodes, _lock);
+   proc_invalidate_siblings_dcache(>inodes, _lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
@@ -284,10 +284,10 @@ static void start_unregistering(struct ctl_table_header 
*p)
spin_unlock(_lock);
}
/*
-* Prune dentries for unregistered sysctls: namespaced sysctls
+* Invalidate dentries for unregistered sysctls: namespaced sysctls
 * can have duplicate names and contaminate dcache very badly.
 */
-   proc_sys_prune_dcache(p);
+   proc_sys_invalidate_dcache(p);
/*
 * do not remove from the list until nobody holds it; walking the
 * list in do_sysctl() relies on that.
-- 
1.8.3.1



[PATCH 07/10] proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache

2020-12-03 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 26dbc60f385ff9cff475ea2a3bad02e80fd6fa43 ]

This prepares the way for allowing the pid part of proc to use this
dcache pruning code as well.

Signed-off-by: Eric W. Biederman 
Cc:  # 4.9.x
(proc: fix up cherry-pick conflicts for 26dbc60f385f)
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 38 ++
 fs/proc/internal.h|  1 +
 fs/proc/proc_sysctl.c | 35 +--
 3 files changed, 40 insertions(+), 34 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 14d9c1d..920c761 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -101,6 +101,44 @@ void __init proc_init_inodecache(void)
 init_once);
 }
 
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock)
+{
+   struct inode *inode;
+   struct proc_inode *ei;
+   struct hlist_node *node;
+   struct super_block *sb;
+
+   rcu_read_lock();
+   for (;;) {
+   node = hlist_first_rcu(inodes);
+   if (!node)
+   break;
+   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
+   spin_lock(lock);
+   hlist_del_init_rcu(>sibling_inodes);
+   spin_unlock(lock);
+
+   inode = >vfs_inode;
+   sb = inode->i_sb;
+   if (!atomic_inc_not_zero(>s_active))
+   continue;
+   inode = igrab(inode);
+   rcu_read_unlock();
+   if (unlikely(!inode)) {
+   deactivate_super(sb);
+   rcu_read_lock();
+   continue;
+   }
+
+   d_prune_aliases(inode);
+   iput(inode);
+   deactivate_super(sb);
+
+   rcu_read_lock();
+   }
+   rcu_read_unlock();
+}
+
 static int proc_show_options(struct seq_file *seq, struct dentry *root)
 {
struct super_block *sb = root->d_sb;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 409b5c5..9bc44a1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -200,6 +200,7 @@ struct pde_opener {
 extern const struct inode_operations proc_pid_link_inode_operations;
 
 extern void proc_init_inodecache(void);
+void proc_prune_siblings_dcache(struct hlist_head *inodes, spinlock_t *lock);
 extern struct inode *proc_get_inode(struct super_block *, struct 
proc_dir_entry *);
 extern int proc_fill_super(struct super_block *, void *data, int flags);
 extern void proc_entry_rundown(struct proc_dir_entry *);
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 671490e..f19063b 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -262,40 +262,7 @@ static void unuse_table(struct ctl_table_header *p)
 
 static void proc_sys_prune_dcache(struct ctl_table_header *head)
 {
-   struct inode *inode;
-   struct proc_inode *ei;
-   struct hlist_node *node;
-   struct super_block *sb;
-
-   rcu_read_lock();
-   for (;;) {
-   node = hlist_first_rcu(>inodes);
-   if (!node)
-   break;
-   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
-   spin_lock(_lock);
-   hlist_del_init_rcu(>sibling_inodes);
-   spin_unlock(_lock);
-
-   inode = >vfs_inode;
-   sb = inode->i_sb;
-   if (!atomic_inc_not_zero(>s_active))
-   continue;
-   inode = igrab(inode);
-   rcu_read_unlock();
-   if (unlikely(!inode)) {
-   deactivate_super(sb);
-   rcu_read_lock();
-   continue;
-   }
-
-   d_prune_aliases(inode);
-   iput(inode);
-   deactivate_super(sb);
-
-   rcu_read_lock();
-   }
-   rcu_read_unlock();
+   proc_prune_siblings_dcache(>inodes, _lock);
 }
 
 /* called under sysctl_lock, will reacquire if has to wait */
-- 
1.8.3.1



[PATCH 08/10] proc: Clear the pieces of proc_inode that proc_evict_inode cares about

2020-12-03 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 71448011ea2a1cd36d8f5cbdab0ed716c454d565 ]

This just keeps everything tidier, and allows for using flags like
SLAB_TYPESAFE_BY_RCU where slabs are not always cleared before reuse.
I don't see reuse without reinitializing happening with the proc_inode
but I had a false alarm while reworking flushing of proc dentries and
indoes when a process dies that caused me to tidy this up.

The code is a little easier to follow and reason about this
way so I figured the changes might as well be kept.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 920c761..739fb9c 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -32,21 +32,27 @@ static void proc_evict_inode(struct inode *inode)
 {
struct proc_dir_entry *de;
struct ctl_table_header *head;
+   struct proc_inode *ei = PROC_I(inode);
 
truncate_inode_pages_final(>i_data);
clear_inode(inode);
 
/* Stop tracking associated processes */
-   put_pid(PROC_I(inode)->pid);
+   if (ei->pid) {
+   put_pid(ei->pid);
+   ei->pid = NULL;
+   }
 
/* Let go of any associated proc directory entry */
-   de = PDE(inode);
-   if (de)
+   de = ei->pde;
+   if (de) {
pde_put(de);
+   ei->pde = NULL;
+   }
 
-   head = PROC_I(inode)->sysctl;
+   head = ei->sysctl;
if (head) {
-   RCU_INIT_POINTER(PROC_I(inode)->sysctl, NULL);
+   RCU_INIT_POINTER(ei->sysctl, NULL);
proc_sys_evict_inode(inode, head);
}
 }
-- 
1.8.3.1



[PATCH 05/10] proc: use %u for pid printing and slightly less stack

2020-12-03 Thread Wen Yang
From: Alexey Dobriyan 

[ Upstream commit e3912ac37e07a13c70675cd75020694de4841c74 ]

PROC_NUMBUF is 13 which is enough for "negative int + \n + \0".

However PIDs and TGIDs are never negative and newline is not a concern,
so use just 10 per integer.

Link: http://lkml.kernel.org/r/20171120203005.GA27743@avx2
Signed-off-by: Alexey Dobriyan 
Cc: Alexander Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c| 16 
 fs/proc/fd.c  |  2 +-
 fs/proc/self.c|  6 +++---
 fs/proc/thread_self.c |  5 ++---
 4 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 5bfdb61..3502a40 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3018,11 +3018,11 @@ static struct dentry *proc_tgid_base_lookup(struct 
inode *dir, struct dentry *de
 static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
 {
struct dentry *dentry, *leader, *dir;
-   char buf[PROC_NUMBUF];
+   char buf[10 + 1];
struct qstr name;
 
name.name = buf;
-   name.len = snprintf(buf, sizeof(buf), "%d", pid);
+   name.len = snprintf(buf, sizeof(buf), "%u", pid);
/* no ->d_hash() rejects on procfs */
dentry = d_hash_and_lookup(mnt->mnt_root, );
if (dentry) {
@@ -3034,7 +3034,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
return;
 
name.name = buf;
-   name.len = snprintf(buf, sizeof(buf), "%d", tgid);
+   name.len = snprintf(buf, sizeof(buf), "%u", tgid);
leader = d_hash_and_lookup(mnt->mnt_root, );
if (!leader)
goto out;
@@ -3046,7 +3046,7 @@ static void proc_flush_task_mnt(struct vfsmount *mnt, 
pid_t pid, pid_t tgid)
goto out_put_leader;
 
name.name = buf;
-   name.len = snprintf(buf, sizeof(buf), "%d", pid);
+   name.len = snprintf(buf, sizeof(buf), "%u", pid);
dentry = d_hash_and_lookup(dir, );
if (dentry) {
d_invalidate(dentry);
@@ -3226,14 +3226,14 @@ int proc_pid_readdir(struct file *file, struct 
dir_context *ctx)
for (iter = next_tgid(ns, iter);
 iter.task;
 iter.tgid += 1, iter = next_tgid(ns, iter)) {
-   char name[PROC_NUMBUF];
+   char name[10 + 1];
int len;
 
cond_resched();
if (!has_pid_permissions(ns, iter.task, 2))
continue;
 
-   len = snprintf(name, sizeof(name), "%d", iter.tgid);
+   len = snprintf(name, sizeof(name), "%u", iter.tgid);
ctx->pos = iter.tgid + TGID_OFFSET;
if (!proc_fill_cache(file, ctx, name, len,
 proc_pid_instantiate, iter.task, NULL)) {
@@ -3557,10 +3557,10 @@ static int proc_task_readdir(struct file *file, struct 
dir_context *ctx)
for (task = first_tid(proc_pid(inode), tid, ctx->pos - 2, ns);
 task;
 task = next_tid(task), ctx->pos++) {
-   char name[PROC_NUMBUF];
+   char name[10 + 1];
int len;
tid = task_pid_nr_ns(task, ns);
-   len = snprintf(name, sizeof(name), "%d", tid);
+   len = snprintf(name, sizeof(name), "%u", tid);
if (!proc_fill_cache(file, ctx, name, len,
proc_task_instantiate, task, NULL)) {
/* returning this tgid failed, save it as the first
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 00ce153..390c2fe 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -235,7 +235,7 @@ static int proc_readfd_common(struct file *file, struct 
dir_context *ctx,
for (fd = ctx->pos - 2;
 fd < files_fdtable(files)->max_fds;
 fd++, ctx->pos++) {
-   char name[PROC_NUMBUF];
+   char name[10 + 1];
int len;
 
if (!fcheck_files(files, fd))
diff --git a/fs/proc/self.c b/fs/proc/self.c
index f6e2e3f..dd06755 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -35,11 +35,11 @@ static const char *proc_self_get_link(struct dentry *dentry,
 
if (!tgid)
return ERR_PTR(-ENOENT);
-   /* 11 for max length of signed int in decimal + NULL term */
-   name = kmalloc(12, dentry ? GFP_KERNEL : GFP_ATOMIC);
+   /* max length of unsigned int in decimal + NULL term */
+   name = kmalloc(10 + 1, dentry ? GFP_KERNEL : GFP_ATOMIC);
if (unlikely(!name))
return dentry ? ERR_PTR(-ENOMEM) : ERR_PTR(-ECHILD);
-   sprintf(name, "%d", tgid);
+   sprintf(name, "%u", tgid);
set_delayed_call(done, kfree_link, name);
return name;
 }
d

[PATCH 06/10] proc: Rename in proc_inode rename sysctl_inodes sibling_inodes

2020-12-03 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 0afa5ca82212247456f9de1468b595a111fee633 ]

I about to need and use the same functionality for pid based
inodes and there is no point in adding a second field when
this field is already here and serving the same purporse.

Just give the field a generic name so it is clear that
it is no longer sysctl specific.

Also for good measure initialize sibling_inodes when
proc_inode is initialized.

Signed-off-by: Eric W. Biederman 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/inode.c   | 1 +
 fs/proc/internal.h| 2 +-
 fs/proc/proc_sysctl.c | 8 
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index a289349..14d9c1d 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -67,6 +67,7 @@ static struct inode *proc_alloc_inode(struct super_block *sb)
ei->pde = NULL;
ei->sysctl = NULL;
ei->sysctl_entry = NULL;
+   INIT_HLIST_NODE(>sibling_inodes);
ei->ns_ops = NULL;
inode = >vfs_inode;
return inode;
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 103435f..409b5c5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -65,7 +65,7 @@ struct proc_inode {
struct proc_dir_entry *pde;
struct ctl_table_header *sysctl;
struct ctl_table *sysctl_entry;
-   struct hlist_node sysctl_inodes;
+   struct hlist_node sibling_inodes;
const struct proc_ns_operations *ns_ops;
struct inode vfs_inode;
 };
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 191573a..671490e 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -272,9 +272,9 @@ static void proc_sys_prune_dcache(struct ctl_table_header 
*head)
node = hlist_first_rcu(>inodes);
if (!node)
break;
-   ei = hlist_entry(node, struct proc_inode, sysctl_inodes);
+   ei = hlist_entry(node, struct proc_inode, sibling_inodes);
spin_lock(_lock);
-   hlist_del_init_rcu(>sysctl_inodes);
+   hlist_del_init_rcu(>sibling_inodes);
spin_unlock(_lock);
 
inode = >vfs_inode;
@@ -480,7 +480,7 @@ static struct inode *proc_sys_make_inode(struct super_block 
*sb,
}
ei->sysctl = head;
ei->sysctl_entry = table;
-   hlist_add_head_rcu(>sysctl_inodes, >inodes);
+   hlist_add_head_rcu(>sibling_inodes, >inodes);
head->count++;
spin_unlock(_lock);
 
@@ -511,7 +511,7 @@ static struct inode *proc_sys_make_inode(struct super_block 
*sb,
 void proc_sys_evict_inode(struct inode *inode, struct ctl_table_header *head)
 {
spin_lock(_lock);
-   hlist_del_init_rcu(_I(inode)->sysctl_inodes);
+   hlist_del_init_rcu(_I(inode)->sibling_inodes);
if (!--head->count)
kfree_rcu(head, rcu);
spin_unlock(_lock);
-- 
1.8.3.1



[PATCH 04/10] proc: Better ownership of files for non-dumpable tasks in user namespaces

2020-12-03 Thread Wen Yang
From: "Eric W. Biederman" 

[ Upstream commit 68eb94f16227336a5773b83ecfa8290f1d6b78ce ]

Instead of making the files owned by the GLOBAL_ROOT_USER.  Make
non-dumpable files whose mm has always lived in a user namespace owned
by the user namespace root.  This allows the container root to have
things work as expected in a container.

Signed-off-by: "Eric W. Biederman" 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c | 102 ++---
 fs/proc/fd.c   |  12 +--
 fs/proc/internal.h |  16 ++---
 3 files changed, 61 insertions(+), 69 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ee2e0ec..5bfdb61 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1676,12 +1676,63 @@ static int proc_pid_readlink(struct dentry * dentry, 
char __user * buffer, int b
 
 /* building an inode */
 
+void task_dump_owner(struct task_struct *task, mode_t mode,
+kuid_t *ruid, kgid_t *rgid)
+{
+   /* Depending on the state of dumpable compute who should own a
+* proc file for a task.
+*/
+   const struct cred *cred;
+   kuid_t uid;
+   kgid_t gid;
+
+   /* Default to the tasks effective ownership */
+   rcu_read_lock();
+   cred = __task_cred(task);
+   uid = cred->euid;
+   gid = cred->egid;
+   rcu_read_unlock();
+
+   /*
+* Before the /proc/pid/status file was created the only way to read
+* the effective uid of a /process was to stat /proc/pid.  Reading
+* /proc/pid/status is slow enough that procps and other packages
+* kept stating /proc/pid.  To keep the rules in /proc simple I have
+* made this apply to all per process world readable and executable
+* directories.
+*/
+   if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) {
+   struct mm_struct *mm;
+   task_lock(task);
+   mm = task->mm;
+   /* Make non-dumpable tasks owned by some root */
+   if (mm) {
+   if (get_dumpable(mm) != SUID_DUMP_USER) {
+   struct user_namespace *user_ns = mm->user_ns;
+
+   uid = make_kuid(user_ns, 0);
+   if (!uid_valid(uid))
+   uid = GLOBAL_ROOT_UID;
+
+   gid = make_kgid(user_ns, 0);
+   if (!gid_valid(gid))
+   gid = GLOBAL_ROOT_GID;
+   }
+   } else {
+   uid = GLOBAL_ROOT_UID;
+   gid = GLOBAL_ROOT_GID;
+   }
+   task_unlock(task);
+   }
+   *ruid = uid;
+   *rgid = gid;
+}
+
 struct inode *proc_pid_make_inode(struct super_block * sb,
  struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
-   const struct cred *cred;
 
/* We need a new inode */
 
@@ -1703,13 +1754,7 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb,
if (!ei->pid)
goto out_unlock;
 
-   if (task_dumpable(task)) {
-   rcu_read_lock();
-   cred = __task_cred(task);
-   inode->i_uid = cred->euid;
-   inode->i_gid = cred->egid;
-   rcu_read_unlock();
-   }
+   task_dump_owner(task, 0, >i_uid, >i_gid);
security_task_to_inode(task, inode);
 
 out:
@@ -1724,7 +1769,6 @@ int pid_getattr(struct vfsmount *mnt, struct dentry 
*dentry, struct kstat *stat)
 {
struct inode *inode = d_inode(dentry);
struct task_struct *task;
-   const struct cred *cred;
struct pid_namespace *pid = dentry->d_sb->s_fs_info;
 
generic_fillattr(inode, stat);
@@ -1742,12 +1786,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry 
*dentry, struct kstat *stat)
 */
return -ENOENT;
}
-   if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) ||
-   task_dumpable(task)) {
-   cred = __task_cred(task);
-   stat->uid = cred->euid;
-   stat->gid = cred->egid;
-   }
+   task_dump_owner(task, inode->i_mode, >uid, >gid);
}
rcu_read_unlock();
return 0;
@@ -1763,18 +1802,11 @@ int pid_getattr(struct vfsmount *mnt, struct dentry 
*dentry, struct kstat *stat)
  * Rewrite the inode's ownerships here because the owning task may have
  * performed a setuid(), etc.
  *
- * Before the /proc/pid/status file was created the only way to read
- * the effective uid of a /process was to stat /proc/pid.  Reading
- * /proc/pid/status is slow enough that procps and other packages
- * kept stating /proc/pid.  To keep the

[PATCH 03/10] proc: Pass file mode to proc_pid_make_inode

2020-12-03 Thread Wen Yang
From: Andreas Gruenbacher 

[ Upstream commit db978da8fa1d0819b210c137d31a339149b88875 ]

Pass the file mode of the proc inode to be created to
proc_pid_make_inode.  In proc_pid_make_inode, initialize inode->i_mode
before calling security_task_to_inode.  This allows selinux to set
isec->sclass right away without introducing "half-initialized" inode
security structs.

Signed-off-by: Andreas Gruenbacher 
Signed-off-by: Paul Moore 
Cc:  # 4.9.x
Signed-off-by: Wen Yang 
---
 fs/proc/base.c   | 23 +--
 fs/proc/fd.c |  6 ++
 fs/proc/internal.h   |  2 +-
 fs/proc/namespaces.c |  3 +--
 security/selinux/hooks.c |  1 +
 5 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index b9e4183..ee2e0ec 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1676,7 +1676,8 @@ static int proc_pid_readlink(struct dentry * dentry, char 
__user * buffer, int b
 
 /* building an inode */
 
-struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct 
*task)
+struct inode *proc_pid_make_inode(struct super_block * sb,
+ struct task_struct *task, umode_t mode)
 {
struct inode * inode;
struct proc_inode *ei;
@@ -1690,6 +1691,7 @@ struct inode *proc_pid_make_inode(struct super_block * 
sb, struct task_struct *t
 
/* Common stuff */
ei = PROC_I(inode);
+   inode->i_mode = mode;
inode->i_ino = get_next_ino();
inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
inode->i_op = _def_inode_operations;
@@ -2041,7 +2043,9 @@ struct map_files_info {
struct proc_inode *ei;
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK |
+   ((mode & FMODE_READ ) ? S_IRUSR : 0) |
+   ((mode & FMODE_WRITE) ? S_IWUSR : 0));
if (!inode)
return -ENOENT;
 
@@ -2050,12 +2054,6 @@ struct map_files_info {
 
inode->i_op = _map_files_link_inode_operations;
inode->i_size = 64;
-   inode->i_mode = S_IFLNK;
-
-   if (mode & FMODE_READ)
-   inode->i_mode |= S_IRUSR;
-   if (mode & FMODE_WRITE)
-   inode->i_mode |= S_IWUSR;
 
d_set_d_op(dentry, _map_files_dentry_operations);
d_add(dentry, inode);
@@ -2409,12 +2407,11 @@ static int proc_pident_instantiate(struct inode *dir,
struct inode *inode;
struct proc_inode *ei;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, p->mode);
if (!inode)
goto out;
 
ei = PROC_I(inode);
-   inode->i_mode = p->mode;
if (S_ISDIR(inode->i_mode))
set_nlink(inode, 2);/* Use getattr to fix if necessary */
if (p->iop)
@@ -3096,11 +3093,10 @@ static int proc_pid_instantiate(struct inode *dir,
 {
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | 
S_IXUGO);
if (!inode)
goto out;
 
-   inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
inode->i_op = _tgid_base_inode_operations;
inode->i_fop = _tgid_base_operations;
inode->i_flags|=S_IMMUTABLE;
@@ -3391,11 +3387,10 @@ static int proc_task_instantiate(struct inode *dir,
struct dentry *dentry, struct task_struct *task, const void *ptr)
 {
struct inode *inode;
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | 
S_IXUGO);
 
if (!inode)
goto out;
-   inode->i_mode = S_IFDIR|S_IRUGO|S_IXUGO;
inode->i_op = _tid_base_inode_operations;
inode->i_fop = _tid_base_operations;
inode->i_flags|=S_IMMUTABLE;
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index d21dafe..4274f83 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -183,14 +183,13 @@ static int proc_fd_link(struct dentry *dentry, struct 
path *path)
struct proc_inode *ei;
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFLNK);
if (!inode)
goto out;
 
ei = PROC_I(inode);
ei->fd = fd;
 
-   inode->i_mode = S_IFLNK;
inode->i_op = _pid_link_inode_operations;
inode->i_size = 64;
 
@@ -322,14 +321,13 @@ int proc_fd_permission(struct inode *inode, int mask)
struct proc_inode *ei;
struct inode *inode;
 
-   inode = proc_pid_make_inode(dir->i_sb, task);
+   inode = proc_pid_make_inode(dir->i_sb, task, S_IFREG | S_IRUSR);
if

[PATCH 01/10] clone: add CLONE_PIDFD

2020-12-03 Thread Wen Yang
From: Christian Brauner 

[ Upstream commit b3e5838252665ee4cfa76b82bdf1198dca81e5be ]

This patchset makes it possible to retrieve pid file descriptors at
process creation time by introducing the new flag CLONE_PIDFD to the
clone() system call.  Linus originally suggested to implement this as a
new flag to clone() instead of making it a separate system call.  As
spotted by Linus, there is exactly one bit for clone() left.

CLONE_PIDFD creates file descriptors based on the anonymous inode
implementation in the kernel that will also be used to implement the new
mount api.  They serve as a simple opaque handle on pids.  Logically,
this makes it possible to interpret a pidfd differently, narrowing or
widening the scope of various operations (e.g. signal sending).  Thus, a
pidfd cannot just refer to a tgid, but also a tid, or in theory - given
appropriate flag arguments in relevant syscalls - a process group or
session. A pidfd does not represent a privilege.  This does not imply it
cannot ever be that way but for now this is not the case.

A pidfd comes with additional information in fdinfo if the kernel supports
procfs.  The fdinfo file contains the pid of the process in the callers
pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".

As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
parent_tidptr argument of clone.  This has the advantage that we can
give back the associated pid and the pidfd at the same time.

To remove worries about missing metadata access this patchset comes with
a sample program that illustrates how a combination of CLONE_PIDFD, and
pidfd_send_signal() can be used to gain race-free access to process
metadata through /proc/.  The sample program can easily be
translated into a helper that would be suitable for inclusion in libc so
that users don't have to worry about writing it themselves.

Suggested-by: Linus Torvalds 
Signed-off-by: Christian Brauner 
Co-developed-by: Jann Horn 
Signed-off-by: Jann Horn 
Reviewed-by: Oleg Nesterov 
Cc: Arnd Bergmann 
Cc: "Eric W. Biederman" 
Cc: Kees Cook 
Cc: Thomas Gleixner 
Cc: David Howells 
Cc: "Michael Kerrisk (man-pages)" 
Cc: Andy Lutomirsky 
Cc: Andrew Morton 
Cc: Aleksa Sarai 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc:  # 4.9.x
(clone: fix up cherry-pick conflicts for b3e583825266)
Signed-off-by: Wen Yang 
---
 include/linux/pid.h|   1 +
 include/uapi/linux/sched.h |   1 +
 kernel/fork.c  | 119 +++--
 3 files changed, 117 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 97b745d..7599a78 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,7 @@ struct pid_link
struct hlist_node node;
struct pid *pid;
 };
+extern const struct file_operations pidfd_fops;
 
 static inline struct pid *get_pid(struct pid *pid)
 {
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 5f0fe01..ed6e31d 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -9,6 +9,7 @@
 #define CLONE_FS   0x0200  /* set if fs info shared between 
processes */
 #define CLONE_FILES0x0400  /* set if open files shared between 
processes */
 #define CLONE_SIGHAND  0x0800  /* set if signal handlers and blocked 
signals shared */
+#define CLONE_PIDFD0x1000  /* set if a pidfd should be placed in 
parent */
 #define CLONE_PTRACE   0x2000  /* set if we want to let tracing 
continue on the child too */
 #define CLONE_VFORK0x4000  /* set if the parent wants the child to 
wake it up on mm_release */
 #define CLONE_PARENT   0x8000  /* set if we want to have the same 
parent as the cloner */
diff --git a/kernel/fork.c b/kernel/fork.c
index b64efec..076297a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,7 +11,22 @@
  * management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
  */
 
+#include 
 #include 
+#if 0
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+>>>>>>> b3e58382... clone: add CLONE_PIDFD
+#endif
 #include 
 #include 
 #include 
@@ -1460,6 +1475,58 @@ static void posix_cpu_timers_init(struct task_struct 
*tsk)
 task->pids[type].pid = pid;
 }
 
+static int pidfd_release(struct inode *inode, struct file *file)
+{
+   struct pid *pid = file->private_data;
+
+   file->private_data = NULL;
+   put_pid(pid);
+   return 0;
+}
+
+#ifdef CONFIG_PROC_FS
+static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
+{
+   struct pid_namespace *ns = file_inode(m->file)->i_sb->s_fs_info;
+   struct pid *pid = f->private_data;
+
+   seq_put_decimal_ull(m, "Pid:\t", pid_nr_ns(pid, ns));
+   seq_putc(m, '\n');
+}
+#endif
+
+const struct file_operations pidfd_fops = {
+   .release = pidfd

[PATCH 02/10] pidfd: add polling support

2020-12-03 Thread Wen Yang
From: "Joel Fernandes (Google)" 

[ Upstream commit b53b0b9d9a613c418057f6cb921c2f40a6f78c24 ]

This patch adds polling support to pidfd.

Android low memory killer (LMK) needs to know when a process dies once
it is sent the kill signal. It does so by checking for the existence of
/proc/pid which is both racy and slow. For example, if a PID is reused
between when LMK sends a kill signal and checks for existence of the
PID, since the wrong PID is now possibly checked for existence.
Using the polling support, LMK will be able to get notified when a process
exists in race-free and fast way, and allows the LMK to do other things
(such as by polling on other fds) while awaiting the process being killed
to die.

For notification to polling processes, we follow the same existing
mechanism in the kernel used when the parent of the task group is to be
notified of a child's death (do_notify_parent). This is precisely when the
tasks waiting on a poll of pidfd are also awakened in this patch.

We have decided to include the waitqueue in struct pid for the following
reasons:
1. The wait queue has to survive for the lifetime of the poll. Including
   it in task_struct would not be option in this case because the task can
   be reaped and destroyed before the poll returns.

2. By including the struct pid for the waitqueue means that during
   de_thread(), the new thread group leader automatically gets the new
   waitqueue/pid even though its task_struct is different.

Appropriate test cases are added in the second patch to provide coverage of
all the cases the patch is handling.

Cc: Andy Lutomirski 
Cc: Steven Rostedt 
Cc: Daniel Colascione 
Cc: Jann Horn 
Cc: Tim Murray 
Cc: Jonathan Kowalski 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Kees Cook 
Cc: David Howells 
Cc: Oleg Nesterov 
Cc: kernel-t...@android.com
Reviewed-by: Oleg Nesterov 
Co-developed-by: Daniel Colascione 
Signed-off-by: Daniel Colascione 
Signed-off-by: Joel Fernandes (Google) 
Signed-off-by: Christian Brauner 
Cc:  # 4.9.x: b3e5838: clone: add CLONE_PIDFD
Cc:  # 4.9.x
(pidfd: fix up cherry-pick conflicts for b53b0b9d9a61)
Signed-off-by: Wen Yang 
---
 include/linux/pid.h |  3 +++
 kernel/fork.c   | 26 ++
 kernel/pid.c|  2 ++
 kernel/signal.c | 11 +++
 4 files changed, 42 insertions(+)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 7599a78..f5552ba 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -2,6 +2,7 @@
 #define _LINUX_PID_H
 
 #include 
+#include 
 
 enum pid_type
 {
@@ -62,6 +63,8 @@ struct pid
unsigned int level;
/* lists of tasks that use this pid */
struct hlist_head tasks[PIDTYPE_MAX];
+   /* wait queue for pidfd notifications */
+   wait_queue_head_t wait_pidfd;
struct rcu_head rcu;
struct upid numbers[1];
 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 076297a..ac57f91 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1495,8 +1495,34 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct 
file *f)
 }
 #endif
 
+/*
+ * Poll support for process exit notification.
+ */
+static unsigned int pidfd_poll(struct file *file, struct poll_table_struct 
*pts)
+{
+   struct task_struct *task;
+   struct pid *pid = file->private_data;
+   int poll_flags = 0;
+
+   poll_wait(file, >wait_pidfd, pts);
+
+   rcu_read_lock();
+   task = pid_task(pid, PIDTYPE_PID);
+   /*
+* Inform pollers only when the whole thread group exits.
+* If the thread group leader exits before all other threads in the
+* group, then poll(2) should block, similar to the wait(2) family.
+*/
+   if (!task || (task->exit_state && thread_group_empty(task)))
+   poll_flags = POLLIN | POLLRDNORM;
+   rcu_read_unlock();
+
+   return poll_flags;
+}
+
 const struct file_operations pidfd_fops = {
.release = pidfd_release,
+   .poll = pidfd_poll,
 #ifdef CONFIG_PROC_FS
.show_fdinfo = pidfd_show_fdinfo,
 #endif
diff --git a/kernel/pid.c b/kernel/pid.c
index fa704f8..e605398 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -333,6 +333,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (type = 0; type < PIDTYPE_MAX; ++type)
INIT_HLIST_HEAD(>tasks[type]);
 
+   init_waitqueue_head(>wait_pidfd);
+
upid = pid->numbers + ns->level;
spin_lock_irq(_lock);
if (!(ns->nr_hashed & PIDNS_HASH_ADDING))
diff --git a/kernel/signal.c b/kernel/signal.c
index bedca16..053de87a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1632,6 +1632,14 @@ int send_sigqueue(struct sigqueue *q, struct task_struct 
*t, int group)
return ret;
 }
 
+static void do_notify_pidfd(struct task_struct *task)
+{
+   struct pid *pid;
+
+   pid = task_pid(task);
+   wake_up_all(>wait_pidfd);
+}
+
 /*
  * Let a parent know about the death of a child.
  * For a stopped/c

[PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry

2020-12-03 Thread Wen Yang
ug_trace;
+
 static int proc_pid_instantiate(struct inode *dir,
   struct dentry * dentry,
   struct task_struct *task, const void *ptr)
@@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir,
d_set_d_op(dentry, _dentry_operations);

d_add(dentry, inode);
+
+   if (sysctl_dentry_debug_trace)
+   printk("XXX %s:%d pid=%d tid=%d  entry=%s/%p\n",
+   __func__, __LINE__, task->pid, task->tgid,
+   dentry->d_name.name, dentry);
+
/* Close the race of the process dying before we return the dentry */
if (pid_revalidate(dentry, 0))
return 0;
diff --git a/kernel/exit.c b/kernel/exit.c
index 27f4168..2b3e1b6 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -55,6 +55,8 @@
 #include 
 #include 

+#include 
+
 #include 
 #include 
 #include 
@@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
put_task_struct(tsk);
 }

+int sysctl_dentry_debug_delay __read_mostly = 0;
+int sysctl_dentry_debug_pid __read_mostly = 0;

 void release_task(struct task_struct *p)
 {
@@ -178,6 +182,11 @@ void release_task(struct task_struct *p)

proc_flush_task(p);

+   if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) {
+   while (sysctl_dentry_debug_delay)
+   mdelay(1);
+   }
+
write_lock_irq(_lock);
ptrace_release_task(p);
__exit_signal(p);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 513e6da..27f1395 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
 static int max_extfrag_threshold = 1000;
 #endif

+extern int sysctl_dentry_debug_trace;
+extern int sysctl_dentry_debug_delay;
+extern int sysctl_dentry_debug_pid;
+
 static struct ctl_table kern_table[] = {
{
.procname   = "sched_child_runs_first",
@@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.proc_handler   = proc_dointvec,
.extra1 = ,
},
+   {
+   .procname   = "dentry_debug_trace",
+   .data   = _dentry_debug_trace,
+   .maxlen = sizeof(sysctl_dentry_debug_trace),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = ,
+   },
+   {
+   .procname   = "dentry_debug_delay",
+   .data   = _dentry_debug_delay,
+   .maxlen = sizeof(sysctl_dentry_debug_delay),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = ,
+   },
+   {
+   .procname   = "dentry_debug_pid",
+   .data   = _dentry_debug_pid,
+   .maxlen = sizeof(sysctl_dentry_debug_pid),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = ,
+   },
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
{
.procname   = "legacy_va_layout",


Signed-off-by: Wen Yang 
Cc: Pavel Emelyanov 
Cc: Oleg Nesterov 
Cc: Sukadev Bhattiprolu 
Cc: Paul Menage 
Cc: "Eric W. Biederman" 
Cc: Greg Kroah-Hartman 
Cc: 
-- 
1.8.3.1



Re: [PATCH] exit: fix a race in release_task when flushing the dentry

2020-11-29 Thread Wen Yang




在 2020/11/29 下午2:05, Greg Kroah-Hartman 写道:

On Sat, Nov 28, 2020 at 11:28:53PM +0800, Wen Yang wrote:



在 2020/11/28 下午10:05, Greg Kroah-Hartman 写道:

On Sat, Nov 28, 2020 at 09:59:09PM +0800, Wen Yang wrote:



在 2020/11/28 下午4:06, Greg Kroah-Hartman 写道:

On Sat, Nov 28, 2020 at 02:47:22PM +0800, Wen Yang wrote:

[ Upstream commit 7bc3e6e55acf065500a24621f3b313e7e5998acf ]


No, that is not this commit at all.

What are you wanting to have happen here?

confused,

greg k-h



Thanks.
Let's explain it briefly:

The dentries such as /proc//ns/ipc have the DCACHE_OP_DELETE flag, they
should be deleted when the process exits.
Suppose the following race appears:

release_taskdput
-> proc_flush_task
  ->  dentry->d_op->d_delete(dentry)
-> __exit_signal
  -> dentry->d_lockref.count--  and return.


In the proc_flush_task function, because another processe is using this
dentry, it cannot be deleted;
In the dput function, d_delete may be executed before __exit_signal (the pid
has not been unhashed), so that d_delete returns false and the dentry can
not be deleted.

So this dentry is still caches (count is 0), and its parent dentries are
also caches, and those dentries can only be deleted when drop_caches is
manually triggered.


In the release_task function, we should move proc_flush_task after the
tasklist_lock is released(Just like the commit
7bc3e6e55acf065500a24621f3b313e7e5998acf did).


I do not understand, is this a patch being submitted for the main kernel
tree, or for a stable kernel release?

If stable, please read:
  https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

If main kernel tree, you can't have the "Upstream commit" line in the
changelog text as that makes no sense at all.



Hi,
This patch is submitted to the stable branches (from 4.9.y
to 5.6.y).

This problem can also be solved if the following patch could be ported to
the stable branch:
7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
26dbc60f385f ("proc: Generalize proc_sys_prune_dcache into
proc_prune_siblings_dcache")
f90f3cafe8d5 ("proc: Use d_invalidate in proc_prune_siblings_dcache")

However, the above-mentioned patches modify too much code (more than 100
lines), and there may also be some undiscovered bugs.

So the safer method may be to apply this small patch(also ported from the
equivalent fix already exist in Linus’ tree).

We will reformat the patch later.


We always prefer to take the original, upstream patches, instead of
one-off changes as almost always, those one-off changes end up being
wrong and hard to work with over time.

So if we need more than one patch to solve this reported problem, that's
fine, can you test the above series of patches and provide a backported
set of them that we can use for this?



Ok, we will follow your suggestions.
Thanks.

--
Best wishes,
Wen



[PATCH] proc: add locking checks in proc_inode_is_dead

2020-11-28 Thread Wen Yang
The proc_inode_is_dead function might race with __unhash_process.
This will result in a whole bunch of stale proc entries being cached.
To prevent that, add the required locking.

Signed-off-by: Wen Yang 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Cc: Alexey Dobriyan 
Cc: Christian Brauner 
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
---
 fs/proc/base.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 1bc9bcd..59720bc 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1994,7 +1994,13 @@ static int pid_revalidate(struct dentry *dentry, 
unsigned int flags)
 
 static inline bool proc_inode_is_dead(struct inode *inode)
 {
-   return !proc_pid(inode)->tasks[PIDTYPE_PID].first;
+   bool has_task;
+
+   read_lock(_lock);
+   has_task = pid_has_task(proc_pid(inode), PIDTYPE_PID);
+   read_unlock(_lock);
+
+   return !has_task;
 }
 
 int pid_delete_dentry(const struct dentry *dentry)
-- 
1.8.3.1



Stable backport request for fixing the issue of not being able to create a new pid_ns

2020-10-08 Thread Wen Yang
After the process exits, the following three dentries still refer to the pid:
/proc/
/proc//ns
/proc//ns/pid

https://bugzilla.kernel.org/show_bug.cgi?id=208613

According to the commit f333c700c610 ("pidns: Add a limit on the number of
pid namespaces"), if the pid cannot be released, it may result in the
inability to create a new pid_ns.

Please backport the following patches to the kernel stable trees (from 4.9.y
to 5.6.y):
7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
26dbc60f385f ("proc: Generalize proc_sys_prune_dcache into 
proc_prune_siblings_dcache")
f90f3cafe8d5 ("proc: Use d_invalidate in proc_prune_siblings_dcache")

Signed-off-by: Wen Yang 
Cc: Eric W. Biederman 
Cc: sta...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org


Re: [PATCH] net: core: explicitly call linkwatch_fire_event to speed up the startup of network services

2020-09-15 Thread Wen Yang



on 2020/8/6 PM5:09, Wen Yang wrote:



在 2020/8/5 上午6:58, David Miller 写道:

From: Wen Yang 
Date: Sat,  1 Aug 2020 16:58:45 +0800


diff --git a/net/core/link_watch.c b/net/core/link_watch.c
index 75431ca..6b9d44b 100644
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -98,6 +98,9 @@ static bool linkwatch_urgent_event(struct 
net_device *dev)

  if (netif_is_lag_port(dev) || netif_is_lag_master(dev))
  return true;
  +    if ((dev->flags & IFF_UP) && dev->operstate == IF_OPER_DOWN)
+    return true;
+
  return netif_carrier_ok(dev) && qdisc_tx_changing(dev);
  }


You're bypassing explicitly the logic here:

/*
 * Limit the number of linkwatch events to one
 * per second so that a runaway driver does not
 * cause a storm of messages on the netlink
 * socket.  This limit does not apply to up events
 * while the device qdisc is down.
 */
if (!urgent_only)
    linkwatch_nextevent = jiffies + HZ;
/* Limit wrap-around effect on delay. */
else if (time_after(linkwatch_nextevent, jiffies + HZ))
    linkwatch_nextevent = jiffies;

Something about this isn't right.  We need to analyze what you are 
seeing,

what device you are using, and what systemd is doing to figure out what
the right place for the fix.

Thank you.



Thank you very much for your comments.
We are using virtio_net and the environment is a microvm similar to 
firecracker.


Let's briefly explain.
net_device->operstate is assigned through linkwatch_event, and the 
call stack is as follows:

process_one_work
-> linkwatch_event
 -> __linkwatch_run_queue
  -> linkwatch_do_dev
   -> rfc2863_policy
    -> default_operstate

During the machine startup process, net_device->operstate has the 
following two-step state changes:


STEP A: virtnet_probe detects the network card and triggers the 
execution of linkwatch_fire_event.

Since linkwatch_nextevent is initialized to 0, linkwatch_work will run.
And since net_device->state is 6 (__LINK_STATE_PRESENT | 
__LINK_STATE_NOCARRIER), net_device->operstate will be changed from 
IF_OPER_UNKNOWN to IF_OPER_DOWN:

eth0 operstate:0 (IF_OPER_UNKNOWN) -> operstate:2 (IF_OPER_DOWN)

virtnet_probe then executes netif_carrier_on to update 
net_device->state, it will be changed from ‘__LINK_STATE_PRESENT | 
__LINK_STATE_NOCARRIER’ to __LINK_STATE_PRESENT:
eth0 state: 6 (__LINK_STATE_PRESENT | __LINK_STATE_NOCARRIER) -> 2 
(__LINK_STATE_PRESENT)


STEP B: One second later (because linkwatch_nextevent = jiffies + HZ), 
linkwatch_work is executed again.
At this time, since net_device->state is __LINK_STATE_PRESENT, so the 
net_device->operstate will be changed from IF_OPER_DOWN to IF_OPER_UP:

eth0 operstate:2 (IF_OPER_DOWN) -> operstate:6 (IF_OPER_UP)


The above state change can be completed within 2 seconds.
Generally, the machine will load the initramfs first, and do some 
initialization in the initramfs, which takes some time; then 
switch_root to the system disk and continue the initialization, which 
will also take some time, and finally start the systemd-networkd 
service, bringing link, etc.,
In this way, the linkwatch_work work queue has enough time to run 
twice, and the state of net_device->operstate is already IF_OPER_UP,

So bringing link up quickly returns the following information:
Aug 06 16:35:55.966121 iZuf6h1kfgutxc3el68z2lZ systemd-networkd[580]: 
eth0: bringing link up

...
Aug 06 16:35:55.990461 iZuf6h1kfgutxc3el68z2lZ systemd-networkd[580]: 
eth0: flags change: +UP +LOWER_UP +RUNNING


But we are now using MicroVM, which requires extreme speed to start, 
bypassing the initramfs and directly booting the trimmed system on the 
disk.
systemd-networkd starts in less than 1 second after booting. the STEP 
B has not been run yet, so it will wait for several hundred 
milliseconds here, as follows:

Jul 20 22:00:47.432552 systemd-networkd[210]: eth0: bringing link up
...
Jul 20 22:00:47.446108 systemd-networkd[210]: eth0: flags change: +UP 
+LOWER_UP

...
Jul 20 22:00:47.781463 systemd-networkd[210]: eth0: flags change: 
+RUNNING



Note: dhcp pays attention to IFF_RUNNING status, we may refer to:
https://www.kernel.org/doc/Documentation/networking/operstates.txt

A routing daemon or dhcp client just needs to care for IFF_RUNNING or
waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before
considering the interface / querying a DHCP address.

Finally, the STEP B above only updates the value of operstate based on 
the known state (operstate/state) on the net_device, without any 
hardware interaction involved, so it is not very reasonable to wait 
for 1 second there.


By adding:
+    if ((dev->flags & IFF_UP) && dev->operstate == IF_OPER_DOWN)
+    return true;
+
We hope to improve the linkwatch_urgent_event function a bit.

Hope to get more of your advice and guidance.

Best wishes,
Wen


hi, this issue is worth continuing discuss

Re: [PATCH] net: core: explicitly call linkwatch_fire_event to speed up the startup of network services

2020-08-06 Thread Wen Yang




在 2020/8/5 上午6:58, David Miller 写道:

From: Wen Yang 
Date: Sat,  1 Aug 2020 16:58:45 +0800


diff --git a/net/core/link_watch.c b/net/core/link_watch.c
index 75431ca..6b9d44b 100644
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -98,6 +98,9 @@ static bool linkwatch_urgent_event(struct net_device *dev)
if (netif_is_lag_port(dev) || netif_is_lag_master(dev))
return true;
  
+	if ((dev->flags & IFF_UP) && dev->operstate == IF_OPER_DOWN)

+   return true;
+
return netif_carrier_ok(dev) && qdisc_tx_changing(dev);
  }
  


You're bypassing explicitly the logic here:

/*
 * Limit the number of linkwatch events to one
 * per second so that a runaway driver does not
 * cause a storm of messages on the netlink
 * socket.  This limit does not apply to up events
 * while the device qdisc is down.
 */
if (!urgent_only)
linkwatch_nextevent = jiffies + HZ;
/* Limit wrap-around effect on delay. */
else if (time_after(linkwatch_nextevent, jiffies + HZ))
linkwatch_nextevent = jiffies;

Something about this isn't right.  We need to analyze what you are seeing,
what device you are using, and what systemd is doing to figure out what
the right place for the fix.

Thank you.



Thank you very much for your comments.
We are using virtio_net and the environment is a microvm similar to 
firecracker.


Let's briefly explain.
net_device->operstate is assigned through linkwatch_event, and the call 
stack is as follows:

process_one_work
-> linkwatch_event
 -> __linkwatch_run_queue
  -> linkwatch_do_dev
   -> rfc2863_policy
-> default_operstate

During the machine startup process, net_device->operstate has the 
following two-step state changes:


STEP A: virtnet_probe detects the network card and triggers the 
execution of linkwatch_fire_event.

Since linkwatch_nextevent is initialized to 0, linkwatch_work will run.
And since net_device->state is 6 (__LINK_STATE_PRESENT | 
__LINK_STATE_NOCARRIER), net_device->operstate will be changed from 
IF_OPER_UNKNOWN to IF_OPER_DOWN:

eth0 operstate:0 (IF_OPER_UNKNOWN) -> operstate:2 (IF_OPER_DOWN)

virtnet_probe then executes netif_carrier_on to update 
net_device->state, it will be changed from ‘__LINK_STATE_PRESENT | 
__LINK_STATE_NOCARRIER’ to __LINK_STATE_PRESENT:
eth0 state: 6 (__LINK_STATE_PRESENT | __LINK_STATE_NOCARRIER) -> 2 
(__LINK_STATE_PRESENT)


STEP B: One second later (because linkwatch_nextevent = jiffies + HZ), 
linkwatch_work is executed again.
At this time, since net_device->state is __LINK_STATE_PRESENT, so the 
net_device->operstate will be changed from IF_OPER_DOWN to IF_OPER_UP:

eth0 operstate:2 (IF_OPER_DOWN) -> operstate:6 (IF_OPER_UP)


The above state change can be completed within 2 seconds.
Generally, the machine will load the initramfs first, and do some 
initialization in the initramfs, which takes some time; then switch_root 
to the system disk and continue the initialization, which will also take 
some time, and finally start the systemd-networkd service, bringing 
link, etc.,
In this way, the linkwatch_work work queue has enough time to run twice, 
and the state of net_device->operstate is already IF_OPER_UP,

So bringing link up quickly returns the following information:
Aug 06 16:35:55.966121 iZuf6h1kfgutxc3el68z2lZ systemd-networkd[580]: 
eth0: bringing link up

...
Aug 06 16:35:55.990461 iZuf6h1kfgutxc3el68z2lZ systemd-networkd[580]: 
eth0: flags change: +UP +LOWER_UP +RUNNING


But we are now using MicroVM, which requires extreme speed to start, 
bypassing the initramfs and directly booting the trimmed system on the disk.
systemd-networkd starts in less than 1 second after booting. the STEP B 
has not been run yet, so it will wait for several hundred milliseconds 
here, as follows:

Jul 20 22:00:47.432552 systemd-networkd[210]: eth0: bringing link up
...
Jul 20 22:00:47.446108 systemd-networkd[210]: eth0: flags change: +UP 
+LOWER_UP

...
Jul 20 22:00:47.781463 systemd-networkd[210]: eth0: flags change: +RUNNING


Note: dhcp pays attention to IFF_RUNNING status, we may refer to:
https://www.kernel.org/doc/Documentation/networking/operstates.txt

A routing daemon or dhcp client just needs to care for IFF_RUNNING or
waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before
considering the interface / querying a DHCP address.

Finally, the STEP B above only updates the value of operstate based on 
the known state (operstate/state) on the net_device, without any 
hardware interaction involved, so it is not very reasonable to wait for 
1 second there.


By adding:
+   if ((dev->flags & IFF_UP) && dev->operstate == IF_OPER_DOWN)
+   return true;
+
We hope to improve the linkwatch_urgent_event function a bit.

Hope to get more of your advice and guidance.

Best wishes,
Wen


[PATCH] net: core: explicitly call linkwatch_fire_event to speed up the startup of network services

2020-08-01 Thread Wen Yang
The linkwatch_event work queue runs up to one second later.
When the MicroVM starts, it takes 300+ms for the ethX flag
to change from '+UP +LOWER_UP' to '+RUNNING', as follows:
Jul 20 22:00:47.432552 systemd-networkd[210]: eth0: bringing link up
...
Jul 20 22:00:47.446108 systemd-networkd[210]: eth0: flags change: +UP +LOWER_UP
...
Jul 20 22:00:47.781463 systemd-networkd[210]: eth0: flags change: +RUNNING

Let's manually trigger it here to make the network service start faster.

After applying this patch, the time consumption of
systemd-networkd.service was reduced from 366ms to 50ms.

Signed-off-by: Wen Yang 
Cc: "David S. Miller" 
Cc: Jakub Kicinski 
Cc: Andrew Lunn 
Cc: Eric Dumazet 
Cc: Jiri Pirko 
Cc: Leon Romanovsky 
Cc: Julian Wiedmann 
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 net/core/link_watch.c | 3 +++
 net/core/rtnetlink.c  | 1 +
 2 files changed, 4 insertions(+)

diff --git a/net/core/link_watch.c b/net/core/link_watch.c
index 75431ca..6b9d44b 100644
--- a/net/core/link_watch.c
+++ b/net/core/link_watch.c
@@ -98,6 +98,9 @@ static bool linkwatch_urgent_event(struct net_device *dev)
if (netif_is_lag_port(dev) || netif_is_lag_master(dev))
return true;
 
+   if ((dev->flags & IFF_UP) && dev->operstate == IF_OPER_DOWN)
+   return true;
+
return netif_carrier_ok(dev) && qdisc_tx_changing(dev);
 }
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 58c484a..fd0b3b6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2604,6 +2604,7 @@ static int do_setlink(const struct sk_buff *skb,
   extack);
if (err < 0)
goto errout;
+   linkwatch_fire_event(dev);
}
 
if (tb[IFLA_MASTER]) {
-- 
1.8.3.1



Re: [PATCH] usb: roles: Switch on role-switch uevent reporting

2020-05-08 Thread Wen Yang




在 2020/5/9 上午12:29, Bryan O'Donoghue 写道:

Right now we don't report to user-space a role switch when doing a
usb_role_switch_set_role() despite having registered the uevent callbacks.

This patch switches on the notifications allowing user-space to see
role-switch change notifications and subsequently determine the current
controller data-role.

example:
PFX=/devices/platform/soc/78d9000.usb/ci_hdrc.0

root@somebox# udevadm monitor -p

KERNEL[49.894994] change $PFX/usb_role/ci_hdrc.0-role-switch (usb_role)
ACTION=change
DEVPATH=$PFX/usb_role/ci_hdrc.0-role-switch
SUBSYSTEM=usb_role
DEVTYPE=usb_role_switch
USB_ROLE_SWITCH=ci_hdrc.0-role-switch
SEQNUM=2432

Cc: Heikki Krogerus 
Cc: Greg Kroah-Hartman 
Cc: Chunfeng Yun 
Cc: Suzuki K Poulose 
Cc: Alexandre Belloni 
Cc: Wen Yang 
Cc: chenqiwu 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bryan O'Donoghue 
---
  drivers/usb/roles/class.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/roles/class.c b/drivers/usb/roles/class.c
index 5b17709821df..27d92af29635 100644
--- a/drivers/usb/roles/class.c
+++ b/drivers/usb/roles/class.c
@@ -49,8 +49,10 @@ int usb_role_switch_set_role(struct usb_role_switch *sw, 
enum usb_role role)
mutex_lock(>lock);
  
  	ret = sw->set(sw, role);

-   if (!ret)
+   if (!ret) {
sw->role = role;
+   kobject_uevent(>dev.kobj, KOBJ_CHANGE);
+   }
  
  	mutex_unlock(>lock);
  



Hi, we may also need to deal with the return value of kobject_uevent(). 
Should we move it under the line mutex_unlock(>lock)?


Regards,
Wen



[PATCH] checkpatch: add checks for fixes tags

2019-10-17 Thread Wen Yang
SHA1 should be at least 12 digits long, as suggested
by Stephen:
Https://lkml.org/lkml/2019/9/10/626
Https://lkml.org/lkml/2019/7/10/304

And the fixes tag should also be capitalized (Fixes:),
as suggested by David:
Https://lkml.org/lkml/2019/10/1/1067

Add checks to the above issues.

Signed-off-by: Wen Yang 
Cc: Andy Whitcroft  (maintainer:CHECKPATCH)
Cc: Joe Perches  (maintainer:CHECKPATCH)
Cc: Stephen Rothwell 
Cc: "David S. Miller" 
Cc: linux-kernel@vger.kernel.org (open list)
---
 scripts/checkpatch.pl | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index a85d719df1f4..daefd0c546ff 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2925,7 +2925,7 @@ sub process {
}
 
 # check for invalid commit id
-   if ($in_commit_log && $line =~ 
/(^fixes:|\bcommit)\s+([0-9a-f]{6,40})\b/i) {
+   if ($in_commit_log && $line =~ 
/(\bcommit)\s+([0-9a-f]{6,40})\b/i) {
my $id;
my $description;
($id, $description) = git_commit_info($2, undef, undef);
@@ -2935,6 +2935,25 @@ sub process {
}
}
 
+# check for invalid fixes tag
+   if ($in_commit_log && $line =~ 
/(^fixes:)\s+([0-9a-f]{6,40})\b/i) {
+   my $id;
+   my $description;
+   ($id, $description) = git_commit_info($2, undef, undef);
+   if (!defined($id)) {
+   WARN("UNKNOWN_COMMIT_ID",
+"Unknown commit id '$2', maybe rebased or 
not pulled?\n" . $herecurr);
+   }
+   if ($1 ne "Fixes:") {
+   WARN("FIXES_TAG_STYLE",
+"The fixes tag should be capitalized 
(Fixes:).\n" . $hereprev);
+   }
+   if (length($2) < 12) {
+   WARN("FIXES_TAG_STYLE",
+"SHA1 should be at least 12 digits 
long.\n" . $hereprev);
+   }
+   }
+
 # ignore non-hunk lines and lines being removed
next if (!$hunk_line || $line =~ /^-/);
 
-- 
2.23.0



Re: [PATCH] net: dsa: rtl8366rb: add missing of_node_put after calling of_get_child_by_name

2019-10-16 Thread Wen Yang



On 2019/10/2 1:03 上午, David Miller wrote:

From: Wen Yang 
Date: Sun, 29 Sep 2019 15:00:47 +0800


of_node_put needs to be called when the device node which is got
from of_get_child_by_name finished using.
irq_domain_add_linear() also calls of_node_get() to increase refcount,
so irq_domain will not be affected when it is released.

fixes: d8652956cf37 ("net: dsa: realtek-smi: Add Realtek SMI driver")
Signed-off-by: Wen Yang 


Please capitalize Fixes:, seriously I am very curious where did you
learned to specify the fixes tag non-capitalized?

Patch applied, t hanks.



Thank you for your comments.

We checked the code repository and found that both ‘Fixes’ and ‘fixes’
are being used, such as:

commit a53651ec93a8d7ab5b26c5390e0c389048b4b4b6
…
 net: ena: don't wake up tx queue when down
…
 fixes: 1738cd3ed342 (net: ena: Add a driver for Amazon Elastic
Network Adapters (ENA))
…

And,

commit 1df379924304b687263942452836db1d725155df
…
 clk: consoldiate the __clk_get_hw() declarations
…

 Fixes: 59fcdce425b7 ("clk: Remove ifdef for COMMON_CLK in
clk-provider.h")
 fixes: 73e0e496afda ("clkdev: Always allocate a struct clk and call
__clk_get() w/ CCF")
…


It is also found that the sha1 following ‘Fixes:’ requires at least 12
digits.

So we plan to modify the checkpatch.pl script to check for these issues.


diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index a85d719..ddcd2d0 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2925,7 +2925,7 @@ sub process {
}

  # check for invalid commit id
-   if ($in_commit_log && $line =~
/(^fixes:|\bcommit)\s+([0-9a-f]{6,40})\b/i) {
+   if ($in_commit_log && $line =~ 
/(\bcommit)\s+([0-9a-f]{6,40})\b/i) {
my $id;
my $description;
($id, $description) = git_commit_info($2, undef, undef);
@@ -2935,6 +2935,25 @@ sub process {
}
}

+# check for fixes tag
+   if ($in_commit_log && $line =~ 
/(^fixes:)\s+([0-9a-f]{6,40})\b/i) {
+   my $id;
+   my $description;
+   ($id, $description) = git_commit_info($2, undef, undef);
+   if (!defined($id)) {
+   WARN("UNKNOWN_COMMIT_ID",
+"Unknown commit id '$2', maybe rebased or not 
pulled?\n" .
$herecurr);
+   }
+   if ($1 ne "Fixes") {
+   WARN("FIXES_TAG_STYLE",
+"The fixes tag should be capitalized 
(Fixes:).\n" . $hereprev);
+   }
+   if (length($2) < 12) {
+   WARN("FIXES_TAG_STYLE",
+"SHA1 should be at least 12 digits 
long.\n" . $hereprev);
+   }
+   }
+
  # ignore non-hunk lines and lines being removed
next if (!$hunk_line || $line =~ /^-/);

--
Best wishes,
Wen Yang


Re: [PATCH] net: mscc: ocelot: add missing of_node_put after calling of_get_child_by_name

2019-10-16 Thread Wen Yang




On 2019/10/2 1:02 上午, David Miller wrote:

From: Wen Yang 
Date: Sun, 29 Sep 2019 14:54:24 +0800


of_node_put needs to be called when the device node which is got
from of_get_child_by_name finished using.
In both cases of success and failure, we need to release 'ports',
so clean up the code using goto.

fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support")
Signed-off-by: Wen Yang 


Applied.



Thank you for your comments.

We checked the code repository and found that both ‘Fixes’ and ‘fixes’ 
are being used, such as:


commit a53651ec93a8d7ab5b26c5390e0c389048b4b4b6
…
net: ena: don't wake up tx queue when down
…
fixes: 1738cd3ed342 (net: ena: Add a driver for Amazon Elastic 
Network Adapters (ENA))

…

And,

commit 1df379924304b687263942452836db1d725155df
…
clk: consoldiate the __clk_get_hw() declarations
…

Fixes: 59fcdce425b7 ("clk: Remove ifdef for COMMON_CLK in 
clk-provider.h")
fixes: 73e0e496afda ("clkdev: Always allocate a struct clk and call 
__clk_get() w/ CCF")

…


It is also found that the sha1 following ‘Fixes:’ requires at least 12 
digits.


So we plan to modify the checkpatch.pl script to check for these issues.


diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index a85d719..ddcd2d0 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -2925,7 +2925,7 @@ sub process {
}

 # check for invalid commit id
-		if ($in_commit_log && $line =~ 
/(^fixes:|\bcommit)\s+([0-9a-f]{6,40})\b/i) {

+   if ($in_commit_log && $line =~ 
/(\bcommit)\s+([0-9a-f]{6,40})\b/i) {
my $id;
my $description;
($id, $description) = git_commit_info($2, undef, undef);
@@ -2935,6 +2935,25 @@ sub process {
}
}

+# check for fixes tag
+   if ($in_commit_log && $line =~ 
/(^fixes:)\s+([0-9a-f]{6,40})\b/i) {
+   my $id;
+   my $description;
+   ($id, $description) = git_commit_info($2, undef, undef);
+   if (!defined($id)) {
+   WARN("UNKNOWN_COMMIT_ID",
+ "Unknown commit id '$2', maybe rebased or not pulled?\n" . 
$herecurr);

+   }
+   if ($1 ne "Fixes") {
+   WARN("FIXES_TAG_STYLE",
+"The fixes tag should be capitalized 
(Fixes:).\n" . $hereprev);
+   }
+   if (length($2) < 12) {
+   WARN("FIXES_TAG_STYLE",
+"SHA1 should be at least 12 digits 
long.\n" . $hereprev);
+   }
+   }
+
 # ignore non-hunk lines and lines being removed
next if (!$hunk_line || $line =~ /^-/);


--
Best wishes,
Wen Yang





[PATCH] net: dsa: rtl8366rb: add missing of_node_put after calling of_get_child_by_name

2019-09-29 Thread Wen Yang
of_node_put needs to be called when the device node which is got
from of_get_child_by_name finished using.
irq_domain_add_linear() also calls of_node_get() to increase refcount,
so irq_domain will not be affected when it is released.

fixes: d8652956cf37 ("net: dsa: realtek-smi: Add Realtek SMI driver")
Signed-off-by: Wen Yang 
Cc: Linus Walleij 
Cc: Andrew Lunn 
Cc: Vivien Didelot 
Cc: Florian Fainelli 
Cc: "David S. Miller" 
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/dsa/rtl8366rb.c | 16 ++--
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/net/dsa/rtl8366rb.c b/drivers/net/dsa/rtl8366rb.c
index a268085..f5cc8b0 100644
--- a/drivers/net/dsa/rtl8366rb.c
+++ b/drivers/net/dsa/rtl8366rb.c
@@ -507,7 +507,8 @@ static int rtl8366rb_setup_cascaded_irq(struct realtek_smi 
*smi)
irq = of_irq_get(intc, 0);
if (irq <= 0) {
dev_err(smi->dev, "failed to get parent IRQ\n");
-   return irq ? irq : -EINVAL;
+   ret = irq ? irq : -EINVAL;
+   goto out_put_node;
}
 
/* This clears the IRQ status register */
@@ -515,7 +516,7 @@ static int rtl8366rb_setup_cascaded_irq(struct realtek_smi 
*smi)
  );
if (ret) {
dev_err(smi->dev, "can't read interrupt status\n");
-   return ret;
+   goto out_put_node;
}
 
/* Fetch IRQ edge information from the descriptor */
@@ -537,7 +538,7 @@ static int rtl8366rb_setup_cascaded_irq(struct realtek_smi 
*smi)
 val);
if (ret) {
dev_err(smi->dev, "could not configure IRQ polarity\n");
-   return ret;
+   goto out_put_node;
}
 
ret = devm_request_threaded_irq(smi->dev, irq, NULL,
@@ -545,7 +546,7 @@ static int rtl8366rb_setup_cascaded_irq(struct realtek_smi 
*smi)
"RTL8366RB", smi);
if (ret) {
dev_err(smi->dev, "unable to request irq: %d\n", ret);
-   return ret;
+   goto out_put_node;
}
smi->irqdomain = irq_domain_add_linear(intc,
   RTL8366RB_NUM_INTERRUPT,
@@ -553,12 +554,15 @@ static int rtl8366rb_setup_cascaded_irq(struct 
realtek_smi *smi)
   smi);
if (!smi->irqdomain) {
dev_err(smi->dev, "failed to create IRQ domain\n");
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
for (i = 0; i < smi->num_ports; i++)
irq_set_parent(irq_create_mapping(smi->irqdomain, i), irq);
 
-   return 0;
+out_put_node:
+   of_node_put(intc);
+   return ret;
 }
 
 static int rtl8366rb_set_addr(struct realtek_smi *smi)
-- 
1.8.3.1



[PATCH] net: mscc: ocelot: add missing of_node_put after calling of_get_child_by_name

2019-09-29 Thread Wen Yang
of_node_put needs to be called when the device node which is got
from of_get_child_by_name finished using.
In both cases of success and failure, we need to release 'ports',
so clean up the code using goto.

fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support")
Signed-off-by: Wen Yang 
Cc: Alexandre Belloni 
Cc: Microchip Linux Driver Support 
Cc: "David S. Miller" 
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/ethernet/mscc/ocelot_board.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_board.c 
b/drivers/net/ethernet/mscc/ocelot_board.c
index b063eb7..aac1151 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -388,13 +388,14 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
continue;
 
phy = of_phy_find_device(phy_node);
+   of_node_put(phy_node);
if (!phy)
continue;
 
err = ocelot_probe_port(ocelot, port, regs, phy);
if (err) {
of_node_put(portnp);
-   return err;
+   goto out_put_ports;
}
 
phy_mode = of_get_phy_mode(portnp);
@@ -422,7 +423,8 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
"invalid phy mode for port%d, (Q)SGMII only\n",
port);
of_node_put(portnp);
-   return -EINVAL;
+   err = -EINVAL;
+   goto out_put_ports;
}
 
serdes = devm_of_phy_get(ocelot->dev, portnp, NULL);
@@ -435,7 +437,8 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
"missing SerDes phys for port%d\n",
port);
 
-   goto err_probe_ports;
+   of_node_put(portnp);
+   goto out_put_ports;
}
 
ocelot->ports[port]->serdes = serdes;
@@ -447,9 +450,8 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
 
dev_info(>dev, "Ocelot switch probed\n");
 
-   return 0;
-
-err_probe_ports:
+out_put_ports:
+   of_node_put(ports);
return err;
 }
 
-- 
1.8.3.1



[PATCH] can: dev: add missing of_node_put after calling of_get_child_by_name

2019-09-28 Thread Wen Yang
of_node_put needs to be called when the device node which is got
from of_get_child_by_name finished using.

fixes: 2290aefa2e90 ("can: dev: Add support for limiting configured bitrate")
Signed-off-by: Wen Yang 
Cc: Wolfgang Grandegger 
Cc: Marc Kleine-Budde 
Cc: "David S. Miller" 
Cc: Franklin S Cooper Jr 
Cc: linux-...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/can/dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/can/dev.c b/drivers/net/can/dev.c
index ac86be5..1c88c36 100644
--- a/drivers/net/can/dev.c
+++ b/drivers/net/can/dev.c
@@ -848,6 +848,7 @@ void of_can_transceiver(struct net_device *dev)
return;
 
ret = of_property_read_u32(dn, "max-bitrate", >bitrate_max);
+   of_node_put(dn);
if ((ret && ret != -EINVAL) || (!ret && !priv->bitrate_max))
netdev_warn(dev, "Invalid value for transceiver max bitrate. 
Ignoring bitrate limit.\n");
 }
-- 
1.8.3.1



[PATCH v7] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-16 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f6 ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v7: adapt to commit ("cpufreq: Make cpufreq_generic_init() return void")
v6: keep the blank line and fix warning: label 'out_unmap_sdcpwr' defined but 
not used.
v5: put together the code to get, use, and release cpu device_node.
v4: restore the blank line.
v3: fix a leaked reference.
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 23 +--
 1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 93f39a1..c66f566 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -131,10 +131,18 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int err = -ENODEV;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
+   if (!cpu)
+   goto out;
 
+   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
of_node_put(cpu);
-   if (!cpu)
+   if (!max_freqp) {
+   err = -EINVAL;
goto out;
+   }
+
+   /* we need the freq in kHz */
+   max_freq = *max_freqp / 1000;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
@@ -171,16 +179,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
}
 
pr_debug("init cpufreq on CPU %d\n", policy->cpu);
-
-   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
-   if (!max_freqp) {
-   err = -EINVAL;
-   goto out_unmap_sdcpwr;
-   }
-
-   /* we need the freq in kHz */
-   max_freq = *max_freqp / 1000;
-
pr_debug("max clock-frequency is at %u kHz\n", max_freq);
pr_debug("initializing frequency table\n");
 
@@ -199,9 +197,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
return 0;
 
-out_unmap_sdcpwr:
-   iounmap(sdcpwr_mapbase);
-
 out_unmap_sdcasr:
iounmap(sdcasr_mapbase);
 out:
-- 
2.9.5



[PATCH v3] coccinelle: semantic code search for missing of_node_put

2019-07-15 Thread Wen Yang
There are functions which increment a reference counter for a device node.
These functions belong to a programming interface for the management
of information from device trees.
The counter must be decremented after the last usage of a device node.
We find these functions by using the following script:


@initialize:ocaml@
@@

let relevant_str = "use of_node_put() on it when done"

let contains s1 s2 =
let re = Str.regexp_string s2
in
try ignore (Str.search_forward re s1 0); true
with Not_found -> false

let relevant_functions = ref []

let add_function f c =
if not (List.mem f !relevant_functions)
then
  begin
let s = String.concat " "
  (
(List.map String.lowercase_ascii
  (List.filter
(function x ->
  Str.string_match
  (Str.regexp "[a-zA-Z_\\(\\)][-a-zA-Z0-9_\\(\\)]*$")
x 0) (Str.split (Str.regexp "[ .;\t\n]+") c in
 if contains s relevant_str
 then
   Printf.printf "Found relevant function: %s\n" f;
   relevant_functions := f :: !relevant_functions;
  end

@r@
identifier fn;
comments c;
type T = struct device_node *;
@@

T@c fn(...) {
...
}

@script:ocaml@
f << r.fn;
c << r.c;
@@

let (cb,cm,ca) = List.hd c in
let c = String.concat " " cb in
add_function f c


Then copy the function names found by the above script to the r_miss_put
rule. This rule checks for missing of_node_put.

And this patch also looks for places where an of_node_put() call is on some
paths but not on others (implemented by the r_miss_put_ext rule).

Finally, this patch finds use-after-free issues for a node.
(implemented by the r_use_after_put rule)

Suggested-by: Julia Lawall 
Signed-off-by: Wen Yang 
Cc: Julia Lawall 
Cc: Gilles Muller 
Cc: Nicolas Palix 
Cc: Michal Marek 
Cc: Masahiro Yamada 
Cc: Wen Yang 
Cc: Markus Elfring 
Cc: co...@systeme.lip6.fr
---
v3: delete the global set, add a rule that checks for use-after-free.
v2: improve the commit description and delete duplicate code.

 scripts/coccinelle/free/of_node_put.cocci | 192 ++
 1 file changed, 192 insertions(+)
 create mode 100644 scripts/coccinelle/free/of_node_put.cocci

diff --git a/scripts/coccinelle/free/of_node_put.cocci 
b/scripts/coccinelle/free/of_node_put.cocci
new file mode 100644
index 000..cda43fa
--- /dev/null
+++ b/scripts/coccinelle/free/of_node_put.cocci
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0
+/// Find missing of_node_put
+///
+// Confidence: Moderate
+// Copyright: (C) 2018-2019 Wen Yang, ZTE.
+// Comments:
+// Options: --no-includes --include-headers
+
+virtual report
+virtual org
+
+@initialize:python@
+@@
+
+report_miss_prefix = "ERROR: missing of_node_put; acquired a node pointer with 
refcount incremented on line "
+report_miss_suffix = ", but without a corresponding object release within this 
function."
+org_miss_main = "acquired a node pointer with refcount incremented"
+org_miss_sec = "needed of_node_put"
+report_use_after_put = "ERROR: use-after-free; reference preceded by 
of_node_put on line "
+org_use_after_put_main = "of_node_put"
+org_use_after_put_sec = "reference"
+
+@r_miss_put exists@
+local idexpression struct device_node *x;
+expression e, e1;
+position p1, p2;
+statement S;
+type T, T1;
+@@
+
+* x = @p1\(of_find_all_nodes\|
+ of_get_cpu_node\|
+ of_get_parent\|
+ of_get_next_parent\|
+ of_get_next_child\|
+ of_get_next_cpu_node\|
+ of_get_compatible_child\|
+ of_get_child_by_name\|
+ of_find_node_opts_by_path\|
+ of_find_node_by_name\|
+ of_find_node_by_type\|
+ of_find_compatible_node\|
+ of_find_node_with_property\|
+ of_find_matching_node_and_match\|
+ of_find_node_by_phandle\|
+ of_parse_phandle\|
+ of_find_next_cache_node\|
+ of_get_next_available_child\)(...);
+...
+if (x == NULL || ...) S
+... when != e = (T)x
+when != of_node_put(x)
+when != of_get_next_parent(x)
+when != of_find_matching_node(x, ...)
+when != if (x) { ... return x; }
+when != v4l2_async_notifier_add_fwnode_subdev(..., <+...x...+>, ...)
+when != e1 = of_fwnode_handle(x)
+(
+ if (x) { ... when forall
+ of_node_put(x) ... }
+|
+ return (T1)x;
+|
+ return of_fwnode_handle(x);
+|
+* return@p2 ...;
+)
+
+@script:python depends on report && r_miss_put@
+p1 << r_miss_put.p1;
+p2 << r_miss_put.p2;
+@@
+
+coccilib.report.print_report(p2[0], report_miss_prefix + p1[0].line + 
report_miss_suffix)
+
+@script:python depends on org && r_miss_put@
+p1 << r_miss_put.p1;
+p2 << r_miss_put.p2;
+@@
+
+cocci.print_main(org_miss_main, p1)
+cocci.print_secs(org_miss_sec, p2)
+

[PATCH 0/2] ASoC: samsung: odroid: fix err handling of odroid_audio_probe

2019-07-12 Thread Wen Yang
We developed a coccinelle SmPL to detect sound/soc/samsung/odroid.c and
found some use-after-free problems.
This patch series fixes those problems.

Wen Yang (2):
  ASoC: samsung: odroid: fix an use-after-free issue for codec
  ASoC: samsung: odroid: fix a double-free issue for cpu_dai

 sound/soc/samsung/odroid.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

Cc: Krzysztof Kozlowski 
Cc: Sangbeom Kim 
Cc: Sylwester Nawrocki 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org

-- 
2.9.5



[PATCH 2/2] ASoC: samsung: odroid: fix a double-free issue for cpu_dai

2019-07-12 Thread Wen Yang
The cpu_dai variable is still being used after the of_node_put() call,
which may result in double-free:

of_node_put(cpu_dai);---> released here

ret = devm_snd_soc_register_card(dev, card);
if (ret < 0) {
...
goto err_put_clk_i2s;--> jump to err_put_clk_i2s
...

err_put_clk_i2s:
clk_put(priv->clk_i2s_bus);
err_put_sclk:
clk_put(priv->sclk_i2s);
err_put_cpu_dai:
of_node_put(cpu_dai);--> double-free here

Fixes: d832d2b246c5 ("ASoC: samsung: odroid: Fix of_node refcount unbalance")
Signed-off-by: Wen Yang 
Cc: Krzysztof Kozlowski 
Cc: Sangbeom Kim 
Cc: Sylwester Nawrocki 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/samsung/odroid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/soc/samsung/odroid.c b/sound/soc/samsung/odroid.c
index 64ebe89..f0f5fa9 100644
--- a/sound/soc/samsung/odroid.c
+++ b/sound/soc/samsung/odroid.c
@@ -308,7 +308,6 @@ static int odroid_audio_probe(struct platform_device *pdev)
ret = PTR_ERR(priv->clk_i2s_bus);
goto err_put_sclk;
}
-   of_node_put(cpu_dai);
 
ret = devm_snd_soc_register_card(dev, card);
if (ret < 0) {
@@ -316,6 +315,7 @@ static int odroid_audio_probe(struct platform_device *pdev)
goto err_put_clk_i2s;
}
 
+   of_node_put(cpu_dai);
of_node_put(codec);
return 0;
 
-- 
2.9.5



[PATCH 1/2] ASoC: samsung: odroid: fix an use-after-free issue for codec

2019-07-12 Thread Wen Yang
The codec variable is still being used after the of_node_put() call,
which may result in use-after-free.

Fixes: bc3cf17b575a ("ASoC: samsung: odroid: Add support for secondary CPU DAI")
Signed-off-by: Wen Yang 
Cc: Krzysztof Kozlowski 
Cc: Sangbeom Kim 
Cc: Sylwester Nawrocki 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/samsung/odroid.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/sound/soc/samsung/odroid.c b/sound/soc/samsung/odroid.c
index dfb6e46..64ebe89 100644
--- a/sound/soc/samsung/odroid.c
+++ b/sound/soc/samsung/odroid.c
@@ -284,9 +284,8 @@ static int odroid_audio_probe(struct platform_device *pdev)
}
 
of_node_put(cpu);
-   of_node_put(codec);
if (ret < 0)
-   return ret;
+   goto err_put_node;
 
ret = snd_soc_of_get_dai_link_codecs(dev, codec, codec_link);
if (ret < 0)
@@ -317,6 +316,7 @@ static int odroid_audio_probe(struct platform_device *pdev)
goto err_put_clk_i2s;
}
 
+   of_node_put(codec);
return 0;
 
 err_put_clk_i2s:
@@ -326,6 +326,8 @@ static int odroid_audio_probe(struct platform_device *pdev)
 err_put_cpu_dai:
of_node_put(cpu_dai);
snd_soc_of_put_dai_link_codecs(codec_link);
+err_put_node:
+   of_node_put(codec);
return ret;
 }
 
-- 
2.9.5



[PATCH v6] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-11 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f6 ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v6: keep the blank line and fix warning: label 'out_unmap_sdcpwr' defined but 
not used.
v5: put together the code to get, use, and release cpu device_node.
v4: restore the blank line.
v3: fix a leaked reference.
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 6b1e4ab..7d557f9 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -131,10 +131,18 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int err = -ENODEV;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
+   if (!cpu)
+   goto out;
 
+   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
of_node_put(cpu);
-   if (!cpu)
+   if (!max_freqp) {
+   err = -EINVAL;
goto out;
+   }
+
+   /* we need the freq in kHz */
+   max_freq = *max_freqp / 1000;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
@@ -171,16 +179,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
}
 
pr_debug("init cpufreq on CPU %d\n", policy->cpu);
-
-   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
-   if (!max_freqp) {
-   err = -EINVAL;
-   goto out_unmap_sdcpwr;
-   }
-
-   /* we need the freq in kHz */
-   max_freq = *max_freqp / 1000;
-
pr_debug("max clock-frequency is at %u kHz\n", max_freq);
pr_debug("initializing frequency table\n");
 
@@ -196,7 +194,11 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
policy->cur = pas_freqs[cur_astate].frequency;
ppc_proc_freq = policy->cur * 1000ul;
 
-   return cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
+   err = cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
+   if (err)
+   goto out_unmap_sdcpwr;
+
+   return 0;
 
 out_unmap_sdcpwr:
iounmap(sdcpwr_mapbase);
-- 
2.9.5



[PATCH 1/4] ASoC: simple-card: fix an use-after-free in simple_dai_link_of_dpcm()

2019-07-10 Thread Wen Yang
The node variable is still being used after the of_node_put() call,
which may result in use-after-free.

Fixes: cfc652a73331 ("ASoC: simple-card: tidyup prefix for snd_soc_codec_conf")
Signed-off-by: Wen Yang 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: Kuninori Morimoto 
Cc: Jon Hunter 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/generic/simple-card.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/sound/soc/generic/simple-card.c b/sound/soc/generic/simple-card.c
index e5cde0d..4117e54 100644
--- a/sound/soc/generic/simple-card.c
+++ b/sound/soc/generic/simple-card.c
@@ -124,8 +124,6 @@ static int simple_dai_link_of_dpcm(struct asoc_simple_priv 
*priv,
 
li->link++;
 
-   of_node_put(node);
-
/* For single DAI link & old style of DT node */
if (is_top)
prefix = PREFIX;
@@ -147,17 +145,17 @@ static int simple_dai_link_of_dpcm(struct 
asoc_simple_priv *priv,
 
ret = asoc_simple_parse_cpu(np, dai_link, _single_links);
if (ret)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_parse_clk_cpu(dev, np, dai_link, dai);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_set_dailink_name(dev, dai_link,
   "fe.%s",
   cpus->dai_name);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
asoc_simple_canonicalize_cpu(dai_link, is_single_links);
} else {
@@ -180,17 +178,17 @@ static int simple_dai_link_of_dpcm(struct 
asoc_simple_priv *priv,
 
ret = asoc_simple_parse_codec(np, dai_link);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_parse_clk_codec(dev, np, dai_link, dai);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_set_dailink_name(dev, dai_link,
   "be.%s",
   codecs->dai_name);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
/* check "prefix" from top node */
snd_soc_of_parse_node_prefix(top, cconf, codecs->of_node,
@@ -208,19 +206,21 @@ static int simple_dai_link_of_dpcm(struct 
asoc_simple_priv *priv,
 
ret = asoc_simple_parse_tdm(np, dai);
if (ret)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_parse_daifmt(dev, node, codec,
   prefix, _link->dai_fmt);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
dai_link->dpcm_playback = 1;
dai_link->dpcm_capture  = 1;
dai_link->ops   = _ops;
dai_link->init  = asoc_simple_dai_init;
 
-   return 0;
+out_put_node:
+   of_node_put(node);
+   return ret;
 }
 
 static int simple_dai_link_of(struct asoc_simple_priv *priv,
-- 
2.9.5



[PATCH 0/4] Fix some use-after-free problems in sound/soc/generic

2019-07-10 Thread Wen Yang
We developed a coccinelle SmPL to detect sound/sooc/generic code and
found some use-after-free problems.
This patch series fixes those problems.

Wen Yang (4):
  ASoC: simple-card: fix an use-after-free in simple_dai_link_of_dpcm()
  ASoC: simple-card: fix an use-after-free in simple_for_each_link()
  ASoC: audio-graph-card: fix use-after-free in graph_dai_link_of_dpcm()
  ASoC: audio-graph-card: fix an use-after-free in graph_get_dai_id()

 sound/soc/generic/audio-graph-card.c | 30 --
 sound/soc/generic/simple-card.c  | 26 +-
 2 files changed, 29 insertions(+), 27 deletions(-)

Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: Kuninori Morimoto 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org

-- 
2.9.5



[PATCH 3/4] ASoC: audio-graph-card: fix use-after-free in graph_dai_link_of_dpcm()

2019-07-10 Thread Wen Yang
After calling of_node_put() on the ports, port, and node variables,
they are still being used, which may result in use-after-free.
Fix this issue by calling of_node_put() after the last usage.

Fixes: dd98fbc558a0 ("ASoC: audio-graph-card: cleanup DAI link loop method - 
step1")
Signed-off-by: Wen Yang 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: Kuninori Morimoto 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/generic/audio-graph-card.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/sound/soc/generic/audio-graph-card.c 
b/sound/soc/generic/audio-graph-card.c
index 30a4e83..31fc83d 100644
--- a/sound/soc/generic/audio-graph-card.c
+++ b/sound/soc/generic/audio-graph-card.c
@@ -208,10 +208,6 @@ static int graph_dai_link_of_dpcm(struct asoc_simple_priv 
*priv,
 
dev_dbg(dev, "link_of DPCM (%pOF)\n", ep);
 
-   of_node_put(ports);
-   of_node_put(port);
-   of_node_put(node);
-
if (li->cpu) {
int is_single_links = 0;
 
@@ -229,17 +225,17 @@ static int graph_dai_link_of_dpcm(struct asoc_simple_priv 
*priv,
 
ret = asoc_simple_parse_cpu(ep, dai_link, _single_links);
if (ret)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_parse_clk_cpu(dev, ep, dai_link, dai);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_set_dailink_name(dev, dai_link,
   "fe.%s",
   cpus->dai_name);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
/* card->num_links includes Codec */
asoc_simple_canonicalize_cpu(dai_link, is_single_links);
@@ -263,17 +259,17 @@ static int graph_dai_link_of_dpcm(struct asoc_simple_priv 
*priv,
 
ret = asoc_simple_parse_codec(ep, dai_link);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_parse_clk_codec(dev, ep, dai_link, dai);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_set_dailink_name(dev, dai_link,
   "be.%s",
   codecs->dai_name);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
/* check "prefix" from top node */
snd_soc_of_parse_node_prefix(top, cconf, codecs->of_node,
@@ -293,19 +289,23 @@ static int graph_dai_link_of_dpcm(struct asoc_simple_priv 
*priv,
 
ret = asoc_simple_parse_tdm(ep, dai);
if (ret)
-   return ret;
+   goto out_put_node;
 
ret = asoc_simple_parse_daifmt(dev, cpu_ep, codec_ep,
   NULL, _link->dai_fmt);
if (ret < 0)
-   return ret;
+   goto out_put_node;
 
dai_link->dpcm_playback = 1;
dai_link->dpcm_capture  = 1;
dai_link->ops   = _ops;
dai_link->init  = asoc_simple_dai_init;
 
-   return 0;
+out_put_node:
+   of_node_put(ports);
+   of_node_put(port);
+   of_node_put(node);
+   return ret;
 }
 
 static int graph_dai_link_of(struct asoc_simple_priv *priv,
-- 
2.9.5



[PATCH 2/4] ASoC: simple-card: fix an use-after-free in simple_for_each_link()

2019-07-10 Thread Wen Yang
The codec variable is still being used after the of_node_put() call,
which may result in use-after-free.

Fixes: d947cdfd4be2 ("ASoC: simple-card: cleanup DAI link loop method - step1")
Signed-off-by: Wen Yang 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: Kuninori Morimoto 
Cc: Jon Hunter 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/generic/simple-card.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/sound/soc/generic/simple-card.c b/sound/soc/generic/simple-card.c
index 4117e54..ef84915 100644
--- a/sound/soc/generic/simple-card.c
+++ b/sound/soc/generic/simple-card.c
@@ -364,8 +364,6 @@ static int simple_for_each_link(struct asoc_simple_priv 
*priv,
goto error;
}
 
-   of_node_put(codec);
-
/* get convert-xxx property */
memset(, 0, sizeof(adata));
for_each_child_of_node(node, np)
@@ -387,11 +385,13 @@ static int simple_for_each_link(struct asoc_simple_priv 
*priv,
ret = func_noml(priv, np, codec, li, is_top);
 
if (ret < 0) {
+   of_node_put(codec);
of_node_put(np);
goto error;
}
}
 
+   of_node_put(codec);
node = of_get_next_child(top, node);
} while (!is_top && node);
 
-- 
2.9.5



[PATCH 4/4] ASoC: audio-graph-card: fix an use-after-free in graph_get_dai_id()

2019-07-10 Thread Wen Yang
After calling of_node_put() on the node variable, it is still being
used, which may result in use-after-free.
Fix this issue by calling of_node_put() after the last usage.

Fixes: a0c426fe1433 ("ASoC: simple-card-utils: check "reg" property on 
asoc_simple_card_get_dai_id()")
Signed-off-by: Wen Yang 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: Kuninori Morimoto 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/generic/audio-graph-card.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/sound/soc/generic/audio-graph-card.c 
b/sound/soc/generic/audio-graph-card.c
index 31fc83d..c8abb86 100644
--- a/sound/soc/generic/audio-graph-card.c
+++ b/sound/soc/generic/audio-graph-card.c
@@ -63,6 +63,7 @@ static int graph_get_dai_id(struct device_node *ep)
struct device_node *endpoint;
struct of_endpoint info;
int i, id;
+   u32 *reg;
int ret;
 
/* use driver specified DAI ID if exist */
@@ -83,8 +84,9 @@ static int graph_get_dai_id(struct device_node *ep)
return info.id;
 
node = of_get_parent(ep);
+   reg = of_get_property(node, "reg", NULL);
of_node_put(node);
-   if (of_get_property(node, "reg", NULL))
+   if (reg)
return info.port;
}
node = of_graph_get_port_parent(ep);
-- 
2.9.5



[tip:irq/urgent] irqchip/renesas-rza1: Prevent use-after-free in rza1_irqc_probe()

2019-07-09 Thread tip-bot for Wen Yang
Commit-ID:  7c8e90ddf02f139a90bc29c04302e9914818f0c8
Gitweb: https://git.kernel.org/tip/7c8e90ddf02f139a90bc29c04302e9914818f0c8
Author: Wen Yang 
AuthorDate: Mon, 8 Jul 2019 14:19:04 +0800
Committer:  Thomas Gleixner 
CommitDate: Tue, 9 Jul 2019 14:53:50 +0200

irqchip/renesas-rza1: Prevent use-after-free in rza1_irqc_probe()

The gic_node is still being used in the rza1_irqc_parse_map() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a644ccb819bc ("irqchip: Add Renesas RZ/A1 Interrupt Controller driver")
Signed-off-by: Wen Yang 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Geert Uytterhoeven 
Link: 
https://lkml.kernel.org/r/1562566745-7447-3-git-send-email-wen.yan...@zte.com.cn
---
 drivers/irqchip/irq-renesas-rza1.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/irqchip/irq-renesas-rza1.c 
b/drivers/irqchip/irq-renesas-rza1.c
index b1f19b210190..b0d46ac42b89 100644
--- a/drivers/irqchip/irq-renesas-rza1.c
+++ b/drivers/irqchip/irq-renesas-rza1.c
@@ -208,20 +208,19 @@ static int rza1_irqc_probe(struct platform_device *pdev)
return PTR_ERR(priv->base);
 
gic_node = of_irq_find_parent(np);
-   if (gic_node) {
+   if (gic_node)
parent = irq_find_host(gic_node);
-   of_node_put(gic_node);
-   }
 
if (!parent) {
dev_err(dev, "cannot find parent domain\n");
-   return -ENODEV;
+   ret = -ENODEV;
+   goto out_put_node;
}
 
ret = rza1_irqc_parse_map(priv, gic_node);
if (ret) {
dev_err(dev, "cannot parse %s: %d\n", "interrupt-map", ret);
-   return ret;
+   goto out_put_node;
}
 
priv->chip.name = "rza1-irqc",
@@ -237,10 +236,12 @@ static int rza1_irqc_probe(struct platform_device *pdev)
priv);
if (!priv->irq_domain) {
dev_err(dev, "cannot initialize irq domain\n");
-   return -ENOMEM;
+   ret = -ENOMEM;
}
 
-   return 0;
+out_put_node:
+   of_node_put(gic_node);
+   return ret;
 }
 
 static int rza1_irqc_remove(struct platform_device *pdev)


[PATCH 2/2] powerpc/83xx: cleanup error paths in mpc831x_usb_cfg()

2019-07-09 Thread Wen Yang
Rename the jump labels according to the cleanup they perform,
and move reference handling to simplify cleanup.

Signed-off-by: Wen Yang 
Cc: Scott Wood 
Cc: Kumar Gala 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Markus Elfring 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
---
 arch/powerpc/platforms/83xx/usb.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/usb.c 
b/arch/powerpc/platforms/83xx/usb.c
index 19dcef5..56b36fa 100644
--- a/arch/powerpc/platforms/83xx/usb.c
+++ b/arch/powerpc/platforms/83xx/usb.c
@@ -160,11 +160,9 @@ int mpc831x_usb_cfg(void)
 
/* Map USB SOC space */
ret = of_address_to_resource(np, 0, );
-   if (ret) {
-   of_node_put(immr_node);
-   of_node_put(np);
-   return ret;
-   }
+   if (ret)
+   goto out_put_node;
+
usb_regs = ioremap(res.start, resource_size());
 
/* Using on-chip PHY */
@@ -173,7 +171,7 @@ int mpc831x_usb_cfg(void)
u32 refsel;
 
if (of_device_is_compatible(immr_node, "fsl,mpc8308-immr"))
-   goto out;
+   goto out_unmap;
 
if (of_device_is_compatible(immr_node, "fsl,mpc8315-immr"))
refsel = CONTROL_REFSEL_24MHZ;
@@ -200,8 +198,9 @@ int mpc831x_usb_cfg(void)
ret = -EINVAL;
}
 
-out:
+out_unmap:
iounmap(usb_regs);
+out_put_node:
of_node_put(immr_node);
of_node_put(np);
return ret;
-- 
2.9.5



[PATCH 1/2] powerpc/83xx: fix use-after-free in mpc831x_usb_cfg()

2019-07-09 Thread Wen Yang
The immr_node variable is still being used after the of_node_put() call,
which may result in use-after-free.
Fix this issue by calling of_node_put() after the last usage.

Fixes: fd066e850351 ("powerpc/mpc8308: fix USB DR controller initialization")
Signed-off-by: Wen Yang 
Cc: Scott Wood 
Cc: Kumar Gala 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Markus Elfring 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
---
 arch/powerpc/platforms/83xx/usb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/usb.c 
b/arch/powerpc/platforms/83xx/usb.c
index 3d247d7..19dcef5 100644
--- a/arch/powerpc/platforms/83xx/usb.c
+++ b/arch/powerpc/platforms/83xx/usb.c
@@ -158,11 +158,10 @@ int mpc831x_usb_cfg(void)
 
iounmap(immap);
 
-   of_node_put(immr_node);
-
/* Map USB SOC space */
ret = of_address_to_resource(np, 0, );
if (ret) {
+   of_node_put(immr_node);
of_node_put(np);
return ret;
}
@@ -203,6 +202,7 @@ int mpc831x_usb_cfg(void)
 
 out:
iounmap(usb_regs);
+   of_node_put(immr_node);
of_node_put(np);
return ret;
 }
-- 
2.9.5



[PATCH 0/2] fix use-after-free in mpc831x_usb_cfg() and do some cleanups

2019-07-09 Thread Wen Yang
Fix use-after-free in mpc831x_usb_cfg() and do some cleanups.
According to Markus's suggestion, split it into two small patches:
https://lkml.org/lkml/2019/7/8/520

Wen Yang (2):
  powerpc/83xx: fix use-after-free in mpc831x_usb_cfg()
  powerpc/83xx: cleanup error paths in mpc831x_usb_cfg()

 arch/powerpc/platforms/83xx/usb.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

-- 
2.9.5



[PATCH v5] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-09 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v5: put together the code to get, use, and release cpu device_node.
v4: restore the blank line.
v3: fix a leaked reference.
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 6b1e4ab..1f0beb7 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -131,10 +131,17 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int err = -ENODEV;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
-
-   of_node_put(cpu);
if (!cpu)
goto out;
+   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
+   of_node_put(cpu);
+   if (!max_freqp) {
+   err = -EINVAL;
+   goto out;
+   }
+
+   /* we need the freq in kHz */
+   max_freq = *max_freqp / 1000;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
@@ -171,16 +178,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
}
 
pr_debug("init cpufreq on CPU %d\n", policy->cpu);
-
-   max_freqp = of_get_property(cpu, "clock-frequency", NULL);
-   if (!max_freqp) {
-   err = -EINVAL;
-   goto out_unmap_sdcpwr;
-   }
-
-   /* we need the freq in kHz */
-   max_freq = *max_freqp / 1000;
-
pr_debug("max clock-frequency is at %u kHz\n", max_freq);
pr_debug("initializing frequency table\n");
 
-- 
2.9.5



[PATCH v4] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-08 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v4: restore the blank line.
v3: fix a leaked reference.
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 6b1e4ab..f0c98fc 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -128,20 +128,21 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int cur_astate, idx;
struct resource res;
struct device_node *cpu, *dn;
-   int err = -ENODEV;
+   int err;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
 
-   of_node_put(cpu);
if (!cpu)
-   goto out;
+   return -ENODEV;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
dn = of_find_compatible_node(NULL, NULL,
 "pasemi,pwrficient-sdc");
-   if (!dn)
+   if (!dn) {
+   err = -ENODEV;
goto out;
+   }
err = of_address_to_resource(dn, 0, );
of_node_put(dn);
if (err)
@@ -196,6 +197,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
policy->cur = pas_freqs[cur_astate].frequency;
ppc_proc_freq = policy->cur * 1000ul;
 
+   of_node_put(cpu);
return cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
 
 out_unmap_sdcpwr:
@@ -204,6 +206,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
 out_unmap_sdcasr:
iounmap(sdcasr_mapbase);
 out:
+   of_node_put(cpu);
return err;
 }
 
-- 
2.9.5



[PATCH v3] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-08 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v3: fix a leaked reference.
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 6b1e4ab..9dc5163 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -128,20 +128,20 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int cur_astate, idx;
struct resource res;
struct device_node *cpu, *dn;
-   int err = -ENODEV;
+   int err;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
-
-   of_node_put(cpu);
if (!cpu)
-   goto out;
+   return -ENODEV;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
dn = of_find_compatible_node(NULL, NULL,
 "pasemi,pwrficient-sdc");
-   if (!dn)
+   if (!dn) {
+   err = -ENODEV;
goto out;
+   }
err = of_address_to_resource(dn, 0, );
of_node_put(dn);
if (err)
@@ -196,6 +196,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
policy->cur = pas_freqs[cur_astate].frequency;
ppc_proc_freq = policy->cur * 1000ul;
 
+   of_node_put(cpu);
return cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
 
 out_unmap_sdcpwr:
@@ -204,6 +205,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
 out_unmap_sdcasr:
iounmap(sdcasr_mapbase);
 out:
+   of_node_put(cpu);
return err;
 }
 
-- 
2.9.5



[PATCH v2] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-08 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
v2: clean up the code according to the advice of viresh.

 drivers/cpufreq/pasemi-cpufreq.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 6b1e4ab..c6d464b 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -128,20 +128,18 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
int cur_astate, idx;
struct resource res;
struct device_node *cpu, *dn;
-   int err = -ENODEV;
+   int err;
 
cpu = of_get_cpu_node(policy->cpu, NULL);
-
-   of_node_put(cpu);
if (!cpu)
-   goto out;
+   return -ENODEV;
 
dn = of_find_compatible_node(NULL, NULL, "1682m-sdc");
if (!dn)
dn = of_find_compatible_node(NULL, NULL,
 "pasemi,pwrficient-sdc");
if (!dn)
-   goto out;
+   return -ENODEV;
err = of_address_to_resource(dn, 0, );
of_node_put(dn);
if (err)
@@ -196,6 +194,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
policy->cur = pas_freqs[cur_astate].frequency;
ppc_proc_freq = policy->cur * 1000ul;
 
+   of_node_put(cpu);
return cpufreq_generic_init(policy, pas_freqs, get_gizmo_latency());
 
 out_unmap_sdcpwr:
@@ -204,6 +203,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
 out_unmap_sdcasr:
iounmap(sdcasr_mapbase);
 out:
+   of_node_put(cpu);
return err;
 }
 
-- 
2.9.5



[PATCH] irqchip: renesas-rza1: fix an use-after-free in rza1_irqc_probe()

2019-07-08 Thread Wen Yang
The gic_node is still being used in the rza1_irqc_parse_map() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a644ccb819bc ("irqchip: Add Renesas RZ/A1 Interrupt Controller driver")
Signed-off-by: Wen Yang 
Cc: Thomas Gleixner 
Cc: Jason Cooper 
Cc: Marc Zyngier 
Cc: Geert Uytterhoeven 
Cc: Chris Brandt 
Cc: Simon Horman 
Cc: linux-kernel@vger.kernel.org
---
 drivers/irqchip/irq-renesas-rza1.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/irqchip/irq-renesas-rza1.c 
b/drivers/irqchip/irq-renesas-rza1.c
index b1f19b21..b0d46ac 100644
--- a/drivers/irqchip/irq-renesas-rza1.c
+++ b/drivers/irqchip/irq-renesas-rza1.c
@@ -208,20 +208,19 @@ static int rza1_irqc_probe(struct platform_device *pdev)
return PTR_ERR(priv->base);
 
gic_node = of_irq_find_parent(np);
-   if (gic_node) {
+   if (gic_node)
parent = irq_find_host(gic_node);
-   of_node_put(gic_node);
-   }
 
if (!parent) {
dev_err(dev, "cannot find parent domain\n");
-   return -ENODEV;
+   ret = -ENODEV;
+   goto out_put_node;
}
 
ret = rza1_irqc_parse_map(priv, gic_node);
if (ret) {
dev_err(dev, "cannot parse %s: %d\n", "interrupt-map", ret);
-   return ret;
+   goto out_put_node;
}
 
priv->chip.name = "rza1-irqc",
@@ -237,10 +236,12 @@ static int rza1_irqc_probe(struct platform_device *pdev)
priv);
if (!priv->irq_domain) {
dev_err(dev, "cannot initialize irq domain\n");
-   return -ENOMEM;
+   ret = -ENOMEM;
}
 
-   return 0;
+out_put_node:
+   of_node_put(gic_node);
+   return ret;
 }
 
 static int rza1_irqc_remove(struct platform_device *pdev)
-- 
2.9.5



[PATCH] phy: ti: am654-serdes: fix an use-after-free in serdes_am654_clk_register()

2019-07-08 Thread Wen Yang
The regmap_node variable is still being used in the syscon_node_to_regmap()
call after the of_node_put() call, which may result in use-after-free.

Fixes: 71e2f5c5c224 ("phy: ti: Add a new SERDES driver for TI's AM654x SoC")
Signed-off-by: Wen Yang 
Cc: Kishon Vijay Abraham I 
Cc: Roger Quadros 
Cc: linux-kernel@vger.kernel.org
---
 drivers/phy/ti/phy-am654-serdes.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/phy/ti/phy-am654-serdes.c 
b/drivers/phy/ti/phy-am654-serdes.c
index f8edd08..f14f1f0 100644
--- a/drivers/phy/ti/phy-am654-serdes.c
+++ b/drivers/phy/ti/phy-am654-serdes.c
@@ -405,6 +405,7 @@ static int serdes_am654_clk_register(struct serdes_am654 
*am654_phy,
const __be32 *addr;
unsigned int reg;
struct clk *clk;
+   int ret = 0;
 
mux = devm_kzalloc(dev, sizeof(*mux), GFP_KERNEL);
if (!mux)
@@ -413,34 +414,40 @@ static int serdes_am654_clk_register(struct serdes_am654 
*am654_phy,
init = >clk_data;
 
regmap_node = of_parse_phandle(node, "ti,serdes-clk", 0);
-   of_node_put(regmap_node);
if (!regmap_node) {
dev_err(dev, "Fail to get serdes-clk node\n");
-   return -ENODEV;
+   ret = -ENODEV;
+   goto out_put_node;
}
 
regmap = syscon_node_to_regmap(regmap_node->parent);
if (IS_ERR(regmap)) {
dev_err(dev, "Fail to get Syscon regmap\n");
-   return PTR_ERR(regmap);
+   ret = PTR_ERR(regmap);
+   goto out_put_node;
}
 
num_parents = of_clk_get_parent_count(node);
if (num_parents < 2) {
dev_err(dev, "SERDES clock must have parents\n");
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
 
parent_names = devm_kzalloc(dev, (sizeof(char *) * num_parents),
GFP_KERNEL);
-   if (!parent_names)
-   return -ENOMEM;
+   if (!parent_names) {
+   ret = -ENOMEM;
+   goto out_put_node;
+   }
 
of_clk_parent_fill(node, parent_names, num_parents);
 
addr = of_get_address(regmap_node, 0, NULL, NULL);
-   if (!addr)
-   return -EINVAL;
+   if (!addr) {
+   ret = -EINVAL;
+   goto out_put_node;
+   }
 
reg = be32_to_cpu(*addr);
 
@@ -456,12 +463,16 @@ static int serdes_am654_clk_register(struct serdes_am654 
*am654_phy,
mux->hw.init = init;
 
clk = devm_clk_register(dev, >hw);
-   if (IS_ERR(clk))
-   return PTR_ERR(clk);
+   if (IS_ERR(clk)) {
+   ret = PTR_ERR(clk);
+   goto out_put_node;
+   }
 
am654_phy->clks[clock_num] = clk;
 
-   return 0;
+out_put_node:
+   of_node_put(regmap_node);
+   return ret;
 }
 
 static const struct of_device_id serdes_am654_id_table[] = {
-- 
2.9.5



[PATCH] cpufreq/pasemi: fix an use-after-free in pas_cpufreq_cpu_init()

2019-07-08 Thread Wen Yang
The cpu variable is still being used in the of_get_property() call
after the of_node_put() call, which may result in use-after-free.

Fixes: a9acc26b75f ("cpufreq/pasemi: fix possible object reference leak")
Signed-off-by: Wen Yang 
Cc: "Rafael J. Wysocki" 
Cc: Viresh Kumar 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/cpufreq/pasemi-cpufreq.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/cpufreq/pasemi-cpufreq.c b/drivers/cpufreq/pasemi-cpufreq.c
index 6b1e4ab..d2dd47b 100644
--- a/drivers/cpufreq/pasemi-cpufreq.c
+++ b/drivers/cpufreq/pasemi-cpufreq.c
@@ -132,7 +132,6 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
 
cpu = of_get_cpu_node(policy->cpu, NULL);
 
-   of_node_put(cpu);
if (!cpu)
goto out;
 
@@ -141,15 +140,15 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
dn = of_find_compatible_node(NULL, NULL,
 "pasemi,pwrficient-sdc");
if (!dn)
-   goto out;
+   goto out_put_cpu_node;
err = of_address_to_resource(dn, 0, );
of_node_put(dn);
if (err)
-   goto out;
+   goto out_put_cpu_node;
sdcasr_mapbase = ioremap(res.start + SDCASR_OFFSET, 0x2000);
if (!sdcasr_mapbase) {
err = -EINVAL;
-   goto out;
+   goto out_put_cpu_node;
}
 
dn = of_find_compatible_node(NULL, NULL, "1682m-gizmo");
@@ -177,6 +176,7 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
err = -EINVAL;
goto out_unmap_sdcpwr;
}
+   of_node_put(cpu);
 
/* we need the freq in kHz */
max_freq = *max_freqp / 1000;
@@ -203,6 +203,8 @@ static int pas_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
 
 out_unmap_sdcasr:
iounmap(sdcasr_mapbase);
+out_put_cpu_node:
+   of_node_put(cpu);
 out:
return err;
 }
-- 
2.9.5



[PATCH] net: pasemi: fix an use-after-free in pasemi_mac_phy_init()

2019-07-05 Thread Wen Yang
The phy_dn variable is still being used in of_phy_connect() after the
of_node_put() call, which may result in use-after-free.

Fixes: 1dd2d06c0459 ("net: Rework pasemi_mac driver to use of_mdio 
infrastructure")
Signed-off-by: Wen Yang 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Luis Chamberlain 
Cc: Michael Ellerman 
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/ethernet/pasemi/pasemi_mac.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/pasemi/pasemi_mac.c 
b/drivers/net/ethernet/pasemi/pasemi_mac.c
index bf5a7bc..be66601 100644
--- a/drivers/net/ethernet/pasemi/pasemi_mac.c
+++ b/drivers/net/ethernet/pasemi/pasemi_mac.c
@@ -1042,7 +1042,6 @@ static int pasemi_mac_phy_init(struct net_device *dev)
 
dn = pci_device_to_OF_node(mac->pdev);
phy_dn = of_parse_phandle(dn, "phy-handle", 0);
-   of_node_put(phy_dn);
 
mac->link = 0;
mac->speed = 0;
@@ -1051,6 +1050,7 @@ static int pasemi_mac_phy_init(struct net_device *dev)
phydev = of_phy_connect(dev, phy_dn, _adjust_link, 0,
PHY_INTERFACE_MODE_SGMII);
 
+   of_node_put(phy_dn);
if (!phydev) {
printk(KERN_ERR "%s: Could not attach to phy\n", dev->name);
return -ENODEV;
-- 
2.9.5



[PATCH] net: axienet: fix a potential double free in axienet_probe()

2019-07-05 Thread Wen Yang
There is a possible use-after-free issue in the axienet_probe():

1701:   np = of_parse_phandle(pdev->dev.of_node, "axistream-connected", 0);
1702:   if (np) {
...
1787:   of_node_put(np); ---> released here
1788:   lp->eth_irq = platform_get_irq(pdev, 0);
1789:   } else {
...
1801:   }
1802:   if (IS_ERR(lp->dma_regs)) {
...
1805:   of_node_put(np); ---> double released here
1806:   goto free_netdev;
1807:   }

We solve this problem by removing the unnecessary of_node_put().

Fixes: 28ef9ebdb64c ("net: axienet: make use of axistream-connected attribute 
optional")
Signed-off-by: Wen Yang 
Cc: Anirudha Sarangi 
Cc: John Linn 
Cc: "David S. Miller" 
Cc: Michal Simek 
Cc: Robert Hancock 
Cc: net...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/ethernet/xilinx/xilinx_axienet_main.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c 
b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
index 561e28a..4fc627f 100644
--- a/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
+++ b/drivers/net/ethernet/xilinx/xilinx_axienet_main.c
@@ -1802,7 +1802,6 @@ static int axienet_probe(struct platform_device *pdev)
if (IS_ERR(lp->dma_regs)) {
dev_err(>dev, "could not map DMA regs\n");
ret = PTR_ERR(lp->dma_regs);
-   of_node_put(np);
goto free_netdev;
}
if ((lp->rx_irq <= 0) || (lp->tx_irq <= 0)) {
-- 
2.9.5



[PATCH] can: flexcan: fix an use-after-free in flexcan_setup_stop_mode()

2019-07-05 Thread Wen Yang
The gpr_np variable is still being used in dev_dbg() after the
of_node_put() call, which may result in use-after-free.

Fixes: de3578c198c6 ("can: flexcan: add self wakeup support")
Signed-off-by: Wen Yang 
Cc: Wolfgang Grandegger 
Cc: Marc Kleine-Budde 
Cc: "David S. Miller" 
Cc: linux-...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/can/flexcan.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/can/flexcan.c b/drivers/net/can/flexcan.c
index f2fe344..33ce45d 100644
--- a/drivers/net/can/flexcan.c
+++ b/drivers/net/can/flexcan.c
@@ -1437,10 +1437,10 @@ static int flexcan_setup_stop_mode(struct 
platform_device *pdev)
 
priv = netdev_priv(dev);
priv->stm.gpr = syscon_node_to_regmap(gpr_np);
-   of_node_put(gpr_np);
if (IS_ERR(priv->stm.gpr)) {
dev_dbg(>dev, "could not find gpr regmap\n");
-   return PTR_ERR(priv->stm.gpr);
+   ret = PTR_ERR(priv->stm.gpr);
+   goto out_put_node;
}
 
priv->stm.req_gpr = out_val[1];
@@ -1455,7 +1455,9 @@ static int flexcan_setup_stop_mode(struct platform_device 
*pdev)
 
device_set_wakeup_capable(>dev, true);
 
-   return 0;
+out_put_node:
+   of_node_put(gpr_np);
+   return ret;
 }
 
 static const struct of_device_id flexcan_of_match[] = {
-- 
2.9.5



[PATCH] powerpc/prom: fix use-after-free on cpu_to_chip_id()

2019-07-05 Thread Wen Yang
The np variable is still being used after the of_node_put() call,
which may result in use-after-free.
We fix this issue by calling of_node_put() after the last usage.

Fixes: 3eb906c6b6c1 ("powerpc: Make cpu_to_chip_id() available when SMP=n")
Signed-off-by: Wen Yang 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Christophe Leroy 
Cc: Andrew Morton 
Cc: Mike Rapoport 
Cc: Greg Kroah-Hartman 
Cc: "Aneesh Kumar K.V" 
Cc: Thomas Gleixner 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
---
 arch/powerpc/kernel/prom.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 7159e79..ad87b94 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -872,13 +872,15 @@ EXPORT_SYMBOL(of_get_ibm_chip_id);
 int cpu_to_chip_id(int cpu)
 {
struct device_node *np;
+   int id;
 
np = of_get_cpu_node(cpu, NULL);
if (!np)
return -1;
 
+   id = of_get_ibm_chip_id(np);
of_node_put(np);
-   return of_get_ibm_chip_id(np);
+   return id;
 }
 EXPORT_SYMBOL(cpu_to_chip_id);
 
-- 
2.9.5



[PATCH] powerpc: fix use-after-free on fixup_port_irq()

2019-07-05 Thread Wen Yang
There is a possible use-after-free issue in the fixup_port_irq():

460 static void __init fixup_port_irq(int index,
461   struct device_node *np,
462   struct plat_serial8250_port *port)
463 {
...
469 if (!virq && legacy_serial_infos[index].irq_check_parent) {
470 np = of_get_parent(np);  --> modified here.
...
474 of_node_put(np); ---> released here
475 }
...
481 #ifdef CONFIG_SERIAL_8250_FSL
482   if (of_device_is_compatible(np, "fsl,ns16550")) --> dereferenced here
...
484 #endif
485 }

We solve this problem by introducing a new parent_np variable.

Fixes: 9deaa53ac7fa ("serial: add irq handler for Freescale 16550 errata.")
Signed-off-by: Wen Yang 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Rob Herring 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
---
 arch/powerpc/kernel/legacy_serial.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index 7cea597..0105f3e 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -461,17 +461,18 @@ static void __init fixup_port_irq(int index,
  struct device_node *np,
  struct plat_serial8250_port *port)
 {
+   struct device_node *parent_np;
unsigned int virq;
 
DBG("fixup_port_irq(%d)\n", index);
 
virq = irq_of_parse_and_map(np, 0);
if (!virq && legacy_serial_infos[index].irq_check_parent) {
-   np = of_get_parent(np);
-   if (np == NULL)
+   parent_np = of_get_parent(np);
+   if (parent_np == NULL)
return;
-   virq = irq_of_parse_and_map(np, 0);
-   of_node_put(np);
+   virq = irq_of_parse_and_map(parent_np, 0);
+   of_node_put(parent_np);
}
if (!virq)
return;
-- 
2.9.5



[PATCH] powerpc/83xx: fix use-after-free on mpc831x_usb_cfg()

2019-07-05 Thread Wen Yang
The np variable is still being used after the of_node_put() call,
which may result in use-after-free.
We fix this issue by calling of_node_put() after the last usage.
This patatch also do some cleanup.

Fixes: fd066e850351 ("powerpc/mpc8308: fix USB DR controller initialization")
Signed-off-by: Wen Yang 
Cc: Scott Wood 
Cc: Kumar Gala 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-...@lists.ozlabs.org
Cc: linux-kernel@vger.kernel.org
---
 arch/powerpc/platforms/83xx/usb.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/usb.c 
b/arch/powerpc/platforms/83xx/usb.c
index 3d247d7..56b36fa 100644
--- a/arch/powerpc/platforms/83xx/usb.c
+++ b/arch/powerpc/platforms/83xx/usb.c
@@ -158,14 +158,11 @@ int mpc831x_usb_cfg(void)
 
iounmap(immap);
 
-   of_node_put(immr_node);
-
/* Map USB SOC space */
ret = of_address_to_resource(np, 0, );
-   if (ret) {
-   of_node_put(np);
-   return ret;
-   }
+   if (ret)
+   goto out_put_node;
+
usb_regs = ioremap(res.start, resource_size());
 
/* Using on-chip PHY */
@@ -174,7 +171,7 @@ int mpc831x_usb_cfg(void)
u32 refsel;
 
if (of_device_is_compatible(immr_node, "fsl,mpc8308-immr"))
-   goto out;
+   goto out_unmap;
 
if (of_device_is_compatible(immr_node, "fsl,mpc8315-immr"))
refsel = CONTROL_REFSEL_24MHZ;
@@ -201,8 +198,10 @@ int mpc831x_usb_cfg(void)
ret = -EINVAL;
}
 
-out:
+out_unmap:
iounmap(usb_regs);
+out_put_node:
+   of_node_put(immr_node);
of_node_put(np);
return ret;
 }
-- 
2.9.5



[PATCH] ASoC: audio-graph-card: fix use-after-free in graph_for_each_link

2019-07-04 Thread Wen Yang
After calling of_node_put() on the codec_ep and codec_port variables,
they are still being used, which may result in use-after-free.
We fix this issue by calling of_node_put() after the last usage.

Fixes: fce9b90c1ab7 ("ASoC: audio-graph-card: cleanup DAI link loop method - 
step2")
Signed-off-by: Wen Yang 
Cc: Liam Girdwood 
Cc: Mark Brown 
Cc: Jaroslav Kysela 
Cc: Takashi Iwai 
Cc: Kuninori Morimoto 
Cc: alsa-de...@alsa-project.org
Cc: linux-kernel@vger.kernel.org
---
 sound/soc/generic/audio-graph-card.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/sound/soc/generic/audio-graph-card.c 
b/sound/soc/generic/audio-graph-card.c
index e438011..30a4e83 100644
--- a/sound/soc/generic/audio-graph-card.c
+++ b/sound/soc/generic/audio-graph-card.c
@@ -421,9 +421,6 @@ static int graph_for_each_link(struct asoc_simple_priv 
*priv,
codec_ep = of_graph_get_remote_endpoint(cpu_ep);
codec_port = of_get_parent(codec_ep);
 
-   of_node_put(codec_ep);
-   of_node_put(codec_port);
-
/* get convert-xxx property */
memset(, 0, sizeof(adata));
graph_parse_convert(dev, codec_ep, );
@@ -443,6 +440,9 @@ static int graph_for_each_link(struct asoc_simple_priv 
*priv,
else
ret = func_noml(priv, cpu_ep, codec_ep, li);
 
+   of_node_put(codec_ep);
+   of_node_put(codec_port);
+
if (ret < 0)
return ret;
 
-- 
2.9.5



[PATCH v2] media: xilinx: fix leaked of_node references

2019-07-01 Thread Wen Yang
The call to of_get_child_by_name returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
drivers/media/platform/xilinx/xilinx-vipp.c:487:3-9: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 477, but 
without a corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-vipp.c:491:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 477, but 
without a corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:732:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:741:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:757:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:764:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Patrice Chotard 
Cc: Hyun Kwon 
Cc: Laurent Pinchart 
Cc: Mauro Carvalho Chehab 
Cc: Michal Simek 
Cc: linux-me...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
v2: fix Comparison to NULL

 drivers/media/platform/xilinx/xilinx-tpg.c  | 18 +-
 drivers/media/platform/xilinx/xilinx-vipp.c |  8 +---
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/media/platform/xilinx/xilinx-tpg.c 
b/drivers/media/platform/xilinx/xilinx-tpg.c
index ed01bed..e71d022 100644
--- a/drivers/media/platform/xilinx/xilinx-tpg.c
+++ b/drivers/media/platform/xilinx/xilinx-tpg.c
@@ -713,10 +713,13 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
struct device_node *port;
unsigned int nports = 0;
bool has_endpoint = false;
+   int ret = 0;
 
ports = of_get_child_by_name(node, "ports");
-   if (ports == NULL)
+   if (!ports) {
ports = node;
+   of_node_get(ports);
+   }
 
for_each_child_of_node(ports, port) {
const struct xvip_video_format *format;
@@ -729,7 +732,8 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
if (IS_ERR(format)) {
dev_err(dev, "invalid format in DT");
of_node_put(port);
-   return PTR_ERR(format);
+   ret = PTR_ERR(format);
+   goto out_put_node;
}
 
/* Get and check the format description */
@@ -738,7 +742,8 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
} else if (xtpg->vip_format != format) {
dev_err(dev, "in/out format mismatch in DT");
of_node_put(port);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
 
if (nports == 0) {
@@ -754,14 +759,17 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
 
if (nports != 1 && nports != 2) {
dev_err(dev, "invalid number of ports %u\n", nports);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
 
xtpg->npads = nports;
if (nports == 2 && has_endpoint)
xtpg->has_input = true;
 
-   return 0;
+out_put_node:
+   of_node_put(ports);
+   return ret;
 }
 
 static int xtpg_probe(struct platform_device *pdev)
diff --git a/drivers/media/platform/xilinx/xilinx-vipp.c 
b/drivers/media/platform/xilinx/xilinx-vipp.c
index edce040..307717c 100644
--- a/drivers/media/platform/xilinx/xilinx-vipp.c
+++ b/drivers/media/platform/xilinx/xilinx-vipp.c
@@ -472,7 +472,7 @@ static int xvip_graph_dma_init(struct xvip_composite_device 
*xdev)
 {
struct device_node *ports;
struct device_node *port;
-   int ret;
+   int ret = 0;
 
ports = of_get_child_by_name(xdev->dev->of_node, "ports");
if (ports == NULL) {
@@ -484,11 +484,13 @@ static int xvip_graph_dma_init(struct 
xvip_composite_device *xdev)
ret = xvip_graph_dma_init_one(xdev, port);
if (ret < 0) {
of_node_put(port);
-   return ret;
+   goto out_put_node;
}
}
 
-   return 0;
+out_put_node:

[PATCH 1/3] media: xilinx: fix leaked of_node references

2019-06-27 Thread Wen Yang
The call to of_get_child_by_name returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
drivers/media/platform/xilinx/xilinx-vipp.c:487:3-9: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 477, but 
without a corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-vipp.c:491:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 477, but 
without a corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:732:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:741:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:757:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.
drivers/media/platform/xilinx/xilinx-tpg.c:764:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 717, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Patrice Chotard 
Cc: Hyun Kwon 
Cc: Laurent Pinchart 
Cc: Mauro Carvalho Chehab 
Cc: Michal Simek 
Cc: linux-me...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/media/platform/xilinx/xilinx-tpg.c  | 18 +-
 drivers/media/platform/xilinx/xilinx-vipp.c |  8 +---
 2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/drivers/media/platform/xilinx/xilinx-tpg.c 
b/drivers/media/platform/xilinx/xilinx-tpg.c
index ed01bed..e71d022 100644
--- a/drivers/media/platform/xilinx/xilinx-tpg.c
+++ b/drivers/media/platform/xilinx/xilinx-tpg.c
@@ -713,10 +713,13 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
struct device_node *port;
unsigned int nports = 0;
bool has_endpoint = false;
+   int ret = 0;
 
ports = of_get_child_by_name(node, "ports");
-   if (ports == NULL)
+   if (ports == NULL) {
ports = node;
+   of_node_get(ports);
+   }
 
for_each_child_of_node(ports, port) {
const struct xvip_video_format *format;
@@ -729,7 +732,8 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
if (IS_ERR(format)) {
dev_err(dev, "invalid format in DT");
of_node_put(port);
-   return PTR_ERR(format);
+   ret = PTR_ERR(format);
+   goto out_put_node;
}
 
/* Get and check the format description */
@@ -738,7 +742,8 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
} else if (xtpg->vip_format != format) {
dev_err(dev, "in/out format mismatch in DT");
of_node_put(port);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
 
if (nports == 0) {
@@ -754,14 +759,17 @@ static int xtpg_parse_of(struct xtpg_device *xtpg)
 
if (nports != 1 && nports != 2) {
dev_err(dev, "invalid number of ports %u\n", nports);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
 
xtpg->npads = nports;
if (nports == 2 && has_endpoint)
xtpg->has_input = true;
 
-   return 0;
+out_put_node:
+   of_node_put(ports);
+   return ret;
 }
 
 static int xtpg_probe(struct platform_device *pdev)
diff --git a/drivers/media/platform/xilinx/xilinx-vipp.c 
b/drivers/media/platform/xilinx/xilinx-vipp.c
index edce040..307717c 100644
--- a/drivers/media/platform/xilinx/xilinx-vipp.c
+++ b/drivers/media/platform/xilinx/xilinx-vipp.c
@@ -472,7 +472,7 @@ static int xvip_graph_dma_init(struct xvip_composite_device 
*xdev)
 {
struct device_node *ports;
struct device_node *port;
-   int ret;
+   int ret = 0;
 
ports = of_get_child_by_name(xdev->dev->of_node, "ports");
if (ports == NULL) {
@@ -484,11 +484,13 @@ static int xvip_graph_dma_init(struct 
xvip_composite_device *xdev)
ret = xvip_graph_dma_init_one(xdev, port);
if (ret < 0) {
of_node_put(port);
-   return ret;
+   goto out_put_node;
}
}
 
-   return 0;
+out_put_node:
+  

[PATCH 0/3] fix leaked of_node references in drivers/media

2019-06-27 Thread Wen Yang
The call to of_get_cpu_node/of_find_compatible_node/of_parse_phandle...
returns a node pointer with refcount incremented thus it must be
explicitly decremented after the last usage.

We developed a coccinelle SmPL to detect  drivers/media/ code and
found some issues.
This patch series fixes those issues.

Wen Yang (3):
  media: xilinx: fix leaked of_node references
  media: exynos4-is: fix leaked of_node references
  media: ti-vpe: fix leaked of_node references

 drivers/media/platform/exynos4-is/fimc-is.c   |  1 +
 drivers/media/platform/exynos4-is/media-dev.c |  2 ++
 drivers/media/platform/ti-vpe/cal.c   |  1 +
 drivers/media/platform/xilinx/xilinx-tpg.c| 18 +-
 drivers/media/platform/xilinx/xilinx-vipp.c   |  8 +---
 5 files changed, 22 insertions(+), 8 deletions(-)

Cc: Mauro Carvalho Chehab 
Cc: Hans Verkuil 
Cc: Laurent Pinchart 
Cc: Philipp Zabel 
Cc: Stanimir Varbanov 
Cc: linux-me...@vger.kernel.org

-- 
2.9.5



[PATCH 3/3] media: ti-vpe: fix leaked of_node references

2019-06-27 Thread Wen Yang
The call to of_get_parent returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
drivers/media/platform/ti-vpe/cal.c:1621:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 1607, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Benoit Parrot 
Cc: Mauro Carvalho Chehab 
Cc: linux-me...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/media/platform/ti-vpe/cal.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/media/platform/ti-vpe/cal.c 
b/drivers/media/platform/ti-vpe/cal.c
index 9e86d761..8e19974 100644
--- a/drivers/media/platform/ti-vpe/cal.c
+++ b/drivers/media/platform/ti-vpe/cal.c
@@ -1613,6 +1613,7 @@ of_get_next_port(const struct device_node *parent,
}
prev = port;
} while (!of_node_name_eq(port, "port"));
+   of_node_put(ports);
}
 
return port;
-- 
2.9.5



[PATCH 2/3] media: exynos4-is: fix leaked of_node references

2019-06-27 Thread Wen Yang
The call to of_get_child_by_name returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
drivers/media/platform/exynos4-is/fimc-is.c:813:2-8: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 807, but 
without a corresponding object release within this function.
drivers/media/platform/exynos4-is/fimc-is.c:870:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 807, but 
without a corresponding object release within this function.
drivers/media/platform/exynos4-is/fimc-is.c:885:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 807, but 
without a corresponding object release within this function.
drivers/media/platform/exynos4-is/media-dev.c:545:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 541, but 
without a corresponding object release within this function.
drivers/media/platform/exynos4-is/media-dev.c:528:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 499, but 
without a corresponding object release within this function.
drivers/media/platform/exynos4-is/media-dev.c:534:1-7: ERROR: missing 
of_node_put; acquired a node pointer with refcount incremented on line 499, but 
without a corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Kyungmin Park 
Cc: Sylwester Nawrocki 
Cc: Mauro Carvalho Chehab 
Cc: Kukjin Kim 
Cc: Krzysztof Kozlowski 
Cc: linux-me...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-samsung-...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/media/platform/exynos4-is/fimc-is.c   | 1 +
 drivers/media/platform/exynos4-is/media-dev.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/media/platform/exynos4-is/fimc-is.c 
b/drivers/media/platform/exynos4-is/fimc-is.c
index e043d55..b7cc8e6 100644
--- a/drivers/media/platform/exynos4-is/fimc-is.c
+++ b/drivers/media/platform/exynos4-is/fimc-is.c
@@ -806,6 +806,7 @@ static int fimc_is_probe(struct platform_device *pdev)
return -ENODEV;
 
is->pmu_regs = of_iomap(node, 0);
+   of_node_put(node);
if (!is->pmu_regs)
return -ENOMEM;
 
diff --git a/drivers/media/platform/exynos4-is/media-dev.c 
b/drivers/media/platform/exynos4-is/media-dev.c
index d53427a..a838189 100644
--- a/drivers/media/platform/exynos4-is/media-dev.c
+++ b/drivers/media/platform/exynos4-is/media-dev.c
@@ -501,6 +501,7 @@ static int fimc_md_register_sensor_entities(struct fimc_md 
*fmd)
continue;
 
ret = fimc_md_parse_port_node(fmd, port, index);
+   of_node_put(port);
if (ret < 0) {
of_node_put(node);
goto cleanup;
@@ -542,6 +543,7 @@ static int __of_get_csis_id(struct device_node *np)
if (!np)
return -EINVAL;
of_property_read_u32(np, "reg", );
+   of_node_put(np);
return reg - FIMC_INPUT_MIPI_CSI2_0;
 }
 
-- 
2.9.5



[PATCH 0/4] fix leaked of_node references in drivers/media

2019-05-06 Thread Wen Yang
The call to of_get_cpu_node/of_find_compatible_node/of_parse_phandle...
returns a node pointer with refcount incremented thus it must be
explicitly decremented after the last usage.

We developed a coccinelle SmPL to detect  drivers/media/ code and
found some issues.
This patch series fixes those issues.

Wen Yang (4):
  media: venus: firmware: fix leaked of_node references
  media: mtk-vpu: fix leaked of_node references
  media: mtk-vcodec: fix leaked of_node references
  media: xilinx: fix leaked of_node references

 drivers/media/platform/exynos4-is/fimc-is.c   | 1 +
 drivers/media/platform/exynos4-is/media-dev.c | 1 +
 drivers/media/platform/mtk-vcodec/mtk_vcodec_dec_pm.c | 2 +-
 drivers/media/platform/mtk-vpu/mtk_vpu.c  | 2 +-
 drivers/media/platform/qcom/venus/firmware.c  | 6 --
 drivers/media/platform/xilinx/xilinx-vipp.c   | 8 +---
 6 files changed, 13 insertions(+), 7 deletions(-)

Cc: Mauro Carvalho Chehab 
Cc: Hans Verkuil 
Cc: Laurent Pinchart 
Cc: Kieran Bingham 
Cc: Philipp Zabel 
Cc: Stanimir Varbanov 
Cc: linux-me...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

-- 
2.9.5



[PATCH v3] ARM: rockchip: fix a leaked reference by adding missing of_node_put

2019-04-26 Thread Wen Yang
The call to of_get_next_child returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
./arch/arm/mach-rockchip/pm.c:269:2-8: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 259, but without a corresponding 
object release within this function.
./arch/arm/mach-rockchip/pm.c:275:2-8: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 259, but without a corresponding 
object release within this function
./arch/arm/mach-rockchip/platsmp.c:281:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 272, but without a 
corresponding object release within this function.
./arch/arm/mach-rockchip/platsmp.c:285:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 272, but without a 
corresponding object release within this function.
./arch/arm/mach-rockchip/platsmp.c:289:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 272, but without a 
corresponding object release within this function.
./arch/arm/mach-rockchip/platsmp.c:303:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 294, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Reviewed-by: Florian Fainelli 
Suggested-by: Heiko Stuebner 
Cc: Russell King 
Cc: Heiko Stuebner 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-rockc...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
v2: add a missing space between "adding" and "missing"
v3: just add a regular of_node_put 

 arch/arm/mach-rockchip/platsmp.c | 12 ++--
 arch/arm/mach-rockchip/pm.c  |  2 ++
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/arm/mach-rockchip/platsmp.c b/arch/arm/mach-rockchip/platsmp.c
index 4675d92..afd1514 100644
--- a/arch/arm/mach-rockchip/platsmp.c
+++ b/arch/arm/mach-rockchip/platsmp.c
@@ -278,19 +278,25 @@ static void __init rockchip_smp_prepare_cpus(unsigned int 
max_cpus)
sram_base_addr = of_iomap(node, 0);
if (!sram_base_addr) {
pr_err("%s: could not map sram registers\n", __func__);
+   of_node_put(node);
return;
}
 
-   if (has_pmu && rockchip_smp_prepare_pmu())
+   if (has_pmu && rockchip_smp_prepare_pmu()) {
+   of_node_put(node);
return;
+   }
 
if (read_cpuid_part() == ARM_CPU_PART_CORTEX_A9) {
-   if (rockchip_smp_prepare_sram(node))
+   if (rockchip_smp_prepare_sram(node)) {
+   of_node_put(node);
return;
+   }
 
/* enable the SCU power domain */
pmu_set_power_domain(PMU_PWRDN_SCU, true);
 
+   of_node_put(node);
node = of_find_compatible_node(NULL, NULL, "arm,cortex-a9-scu");
if (!node) {
pr_err("%s: missing scu\n", __func__);
@@ -300,6 +306,7 @@ static void __init rockchip_smp_prepare_cpus(unsigned int 
max_cpus)
scu_base_addr = of_iomap(node, 0);
if (!scu_base_addr) {
pr_err("%s: could not map scu registers\n", __func__);
+   of_node_put(node);
return;
}
 
@@ -318,6 +325,7 @@ static void __init rockchip_smp_prepare_cpus(unsigned int 
max_cpus)
asm ("mrc p15, 1, %0, c9, c0, 2\n" : "=r" (l2ctlr));
ncores = ((l2ctlr >> 24) & 0x3) + 1;
}
+   of_node_put(node);
 
/* Make sure that all cores except the first are really off */
for (i = 1; i < ncores; i++)
diff --git a/arch/arm/mach-rockchip/pm.c b/arch/arm/mach-rockchip/pm.c
index 065b09e..4a4f914 100644
--- a/arch/arm/mach-rockchip/pm.c
+++ b/arch/arm/mach-rockchip/pm.c
@@ -266,12 +266,14 @@ static int __init rk3288_suspend_init(struct device_node 
*np)
rk3288_bootram_base = of_iomap(sram_np, 0);
if (!rk3288_bootram_base) {
pr_err("%s: could not map bootram base\n", __func__);
+   of_node_put(sram_np);
return -ENOMEM;
}
 
ret = of_address_to_resource(sram_np, 0, );
if (ret) {
pr_err("%s: could not get bootram phy addr\n", __func__);
+   of_node_put(sram_np);
return ret;
}
rk3288_bootram_phy = res.start;
-- 
2.9.5



[PATCH] fpga: stratix10-soc: fix use-after-free on s10_init()

2019-04-23 Thread Wen Yang
The refcount of fw_np has already been decreased by of_find_matching_node()
so it shouldn't be used anymore.
This patch adds an of_node_get() before of_find_matching_node() to avoid
the use-after-free problem.

Fixes: e7eef1d7633a ("fpga: add intel stratix10 soc fpga manager driver")
Signed-off-by: Wen Yang 
Cc: Alan Tull 
Cc: Moritz Fischer 
Cc: Nicolas Saenz Julienne 
Cc: linux-f...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/fpga/stratix10-soc.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/fpga/stratix10-soc.c b/drivers/fpga/stratix10-soc.c
index 13851b3..215d337 100644
--- a/drivers/fpga/stratix10-soc.c
+++ b/drivers/fpga/stratix10-soc.c
@@ -507,12 +507,16 @@ static int __init s10_init(void)
if (!fw_np)
return -ENODEV;
 
+   of_node_get(fw_np);
np = of_find_matching_node(fw_np, s10_of_match);
-   if (!np)
+   if (!np) {
+   of_node_put(fw_np);
return -ENODEV;
+   }
 
of_node_put(np);
ret = of_platform_populate(fw_np, s10_of_match, NULL, NULL);
+   of_node_put(fw_np);
if (ret)
return ret;
 
-- 
2.9.5



[PATCH 0/3] fix leaked of_node references in drivers/firmware

2019-04-16 Thread Wen Yang
The call to of_get_cpu_node/of_find_compatible_node/of_parse_phandle...
returns a node pointer with refcount incremented thus it must be
explicitly decremented after the last usage.

We developed a coccinelle SmPL to detect drivers/firmware code and
found some issues.
This patch series fixes those issues.

Wen Yang (3):
  firmware: arm_sdei: fix leaked of_node references
  firmware: psci: fix leaked of_node references
  firmware: stratix10-svc: fix leaked of_node references

 drivers/firmware/arm_sdei.c  |  1 +
 drivers/firmware/psci.c  |  4 +++-
 drivers/firmware/stratix10-svc.c | 14 ++
 3 files changed, 14 insertions(+), 5 deletions(-)

-- 
2.9.5



[PATCH 1/3] firmware: arm_sdei: fix leaked of_node references

2019-04-16 Thread Wen Yang
In sdei_present_dt function, fw_np is obtained by calling
of_find_node_by_name(), np is obtained by calling
of_find_matching_node(), and the reference counts of those
two device_nodes, fw_np and np, are increased.
But when the function exits, only of_node_put is called on np,
and fw_np's reference count is leaked.

Detected by coccinelle with the following warnings:
./drivers/firmware/arm_sdei.c:1088:2-8: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 1082, but without a 
corresponding object release within this function.
./drivers/firmware/arm_sdei.c:1091:1-7: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 1082, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: James Morse 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/firmware/arm_sdei.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c
index e6376f9..2faa329 100644
--- a/drivers/firmware/arm_sdei.c
+++ b/drivers/firmware/arm_sdei.c
@@ -1084,6 +1084,7 @@ static bool __init sdei_present_dt(void)
return false;
 
np = of_find_matching_node(fw_np, sdei_of_match);
+   of_node_put(fw_np);
if (!np)
return false;
of_node_put(np);
-- 
2.9.5



[PATCH 2/3] firmware: psci: fix leaked of_node references

2019-04-16 Thread Wen Yang
The call to of_find_matching_node_and_match returns a node pointer
with refcount incremented thus it must be explicitly decremented
after the last usage.

672 int __init psci_dt_init(void)
673 {
674 struct device_node *np;
...
678 np = of_find_matching_node_and_match(...);
679
680 if (!np || !of_device_is_available(np))
682 return -ENODEV;  ---> leaked here
...
686 return init_fn(np);  ---> released here
687 }

Detected by using coccinelle.

Signed-off-by: Wen Yang 
Cc: Mark Rutland 
Cc: Lorenzo Pieralisi 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/firmware/psci.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/psci.c b/drivers/firmware/psci.c
index c80ec1d..e4143f8 100644
--- a/drivers/firmware/psci.c
+++ b/drivers/firmware/psci.c
@@ -677,8 +677,10 @@ int __init psci_dt_init(void)
 
np = of_find_matching_node_and_match(NULL, psci_of_match, _np);
 
-   if (!np || !of_device_is_available(np))
+   if (!np || !of_device_is_available(np)) {
+   of_node_put(np);
return -ENODEV;
+   }
 
init_fn = (psci_initcall_t)matched_np->data;
return init_fn(np);
-- 
2.9.5



[PATCH 3/3] firmware: stratix10-svc: fix leaked of_node references

2019-04-16 Thread Wen Yang
In stratix10_svc_init function, fw_np is obtained by calling
of_find_node_by_name(), np is obtained by calling
of_find_matching_node(), and the reference counts of those
two device_nodes, fw_np and np, are increased.
But when the function exits, only of_node_put is called on np,
and fw_np's reference count is leaked.

Detected by coccinelle with the following warnings:
./drivers/firmware/stratix10-svc.c:1020:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 1014, but without a 
corresponding object release within this function.
./drivers/firmware/stratix10-svc.c:1025:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 1014, but without a 
corresponding object release within this function.
./drivers/firmware/stratix10-svc.c:1027:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 1014, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Greg Kroah-Hartman 
Cc: Alan Tull 
Cc: Richard Gong 
Cc: Nicolas Saenz Julienne 
Cc: linux-kernel@vger.kernel.org
---
 drivers/firmware/stratix10-svc.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/firmware/stratix10-svc.c b/drivers/firmware/stratix10-svc.c
index 6e65148..482a6bd 100644
--- a/drivers/firmware/stratix10-svc.c
+++ b/drivers/firmware/stratix10-svc.c
@@ -1016,15 +1016,21 @@ static int __init stratix10_svc_init(void)
return -ENODEV;
 
np = of_find_matching_node(fw_np, stratix10_svc_drv_match);
-   if (!np)
-   return -ENODEV;
+   if (!np) {
+   ret = -ENODEV;
+   goto out_put_fw_np;
+   }
 
of_node_put(np);
ret = of_platform_populate(fw_np, stratix10_svc_drv_match, NULL, NULL);
if (ret)
-   return ret;
+   goto out_put_fw_np;
 
-   return platform_driver_register(_svc_driver);
+   ret = platform_driver_register(_svc_driver);
+
+out_put_fw_np:
+   of_node_put(fw_np);
+   return ret;
 }
 
 static void __exit stratix10_svc_exit(void)
-- 
2.9.5



[PATCH 1/2] power: supply: fix leaked of_node refs in ab8500_bm_of_probe

2019-04-16 Thread Wen Yang
The call to of_parse_phandle returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

492 int ab8500_bm_of_probe(struct device *dev,
493struct device_node *np,
494struct abx500_bm_data *bm)
495 {
496 const struct batres_vs_temp *tmp_batres_tbl;
497 struct device_node *battery_node;
...
501 /* get phandle to 'battery-info' node */
502 battery_node = of_parse_phandle(np, "battery", 0);
...
509 if (!btech) {
510 dev_warn(dev, "missing property battery-name/type\n");
511 return -EINVAL;---> leaked here
512 }
...
540 of_node_put(battery_node);   ---> released here
541
542 return 0;
543 }

Detected by coccinelle with the following warnings:
./drivers/power/supply/ab8500_bmdata.c:511:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 502, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Sebastian Reichel 
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/power/supply/ab8500_bmdata.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/power/supply/ab8500_bmdata.c 
b/drivers/power/supply/ab8500_bmdata.c
index 7b2b699..f6a6697 100644
--- a/drivers/power/supply/ab8500_bmdata.c
+++ b/drivers/power/supply/ab8500_bmdata.c
@@ -508,6 +508,7 @@ int ab8500_bm_of_probe(struct device *dev,
btech = of_get_property(battery_node, "stericsson,battery-type", NULL);
if (!btech) {
dev_warn(dev, "missing property battery-name/type\n");
+   of_node_put(battery_node);
return -EINVAL;
}
 
-- 
2.9.5



[PATCH 2/2] power: supply: fix leaked of_node refs in power_supply_get_battery_info

2019-04-16 Thread Wen Yang
The call to of_parse_phandle returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
./drivers/power/supply/power_supply_core.c:601:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.
./drivers/power/supply/power_supply_core.c:604:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.
./drivers/power/supply/power_supply_core.c:632:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.
./drivers/power/supply/power_supply_core.c:635:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.
./drivers/power/supply/power_supply_core.c:653:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.
./drivers/power/supply/power_supply_core.c:664:3-9: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.
./drivers/power/supply/power_supply_core.c:673:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 595, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Sebastian Reichel 
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/power/supply/power_supply_core.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/power/supply/power_supply_core.c 
b/drivers/power/supply/power_supply_core.c
index 65c619c..874495c 100644
--- a/drivers/power/supply/power_supply_core.c
+++ b/drivers/power/supply/power_supply_core.c
@@ -598,10 +598,12 @@ int power_supply_get_battery_info(struct power_supply 
*psy,
 
err = of_property_read_string(battery_np, "compatible", );
if (err)
-   return err;
+   goto out_put_node;
 
-   if (strcmp("simple-battery", value))
-   return -ENODEV;
+   if (strcmp("simple-battery", value)) {
+   err = -ENODEV;
+   goto out_put_node;
+   }
 
/* The property and field names below must correspond to elements
 * in enum power_supply_property. For reasoning, see
@@ -629,10 +631,12 @@ int power_supply_get_battery_info(struct power_supply 
*psy,
 
len = of_property_count_u32_elems(battery_np, "ocv-capacity-celsius");
if (len < 0 && len != -EINVAL) {
-   return len;
+   err = len;
+   goto out_put_node;
} else if (len > POWER_SUPPLY_OCV_TEMP_MAX) {
dev_err(>dev, "Too many temperature values\n");
-   return -EINVAL;
+   err = -EINVAL;
+   goto out_put_node;
} else if (len > 0) {
of_property_read_u32_array(battery_np, "ocv-capacity-celsius",
   info->ocv_temp, len);
@@ -650,7 +654,8 @@ int power_supply_get_battery_info(struct power_supply *psy,
dev_err(>dev, "failed to get %s\n", propname);
kfree(propname);
power_supply_put_battery_info(psy, info);
-   return -EINVAL;
+   err = -EINVAL;
+   goto out_put_node;
}
 
kfree(propname);
@@ -661,7 +666,8 @@ int power_supply_get_battery_info(struct power_supply *psy,
devm_kcalloc(>dev, tab_len, sizeof(*table), 
GFP_KERNEL);
if (!info->ocv_table[index]) {
power_supply_put_battery_info(psy, info);
-   return -ENOMEM;
+   err = -ENOMEM;
+   goto out_put_node;
}
 
for (i = 0; i < tab_len; i++) {
@@ -670,7 +676,9 @@ int power_supply_get_battery_info(struct power_supply *psy,
}
}
 
-   return 0;
+out_put_node:
+   of_node_put(battery_np);
+   return err;
 }
 EXPORT_SYMBOL_GPL(power_supply_get_battery_info);
 
-- 
2.9.5



[PATCH 0/2] fix leaked of_node references in drivers/power

2019-04-16 Thread Wen Yang
The call to of_get_cpu_node/of_find_compatible_node/of_parse_phandle...
returns a node pointer with refcount incremented thus it must be
explicitly decremented after the last usage.

We developed a coccinelle SmPL to detect drivers/power code and
found some issues.
This patch series fixes those issues.

Cc: Sebastian Reichel 
Cc: linux...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Wen Yang (2):
  power: supply: fix leaked of_node refs in ab8500_bm_of_probe
  power: supply: fix leaked of_node refs in
power_supply_get_battery_info

 drivers/power/supply/ab8500_bmdata.c |  1 +
 drivers/power/supply/power_supply_core.c | 24 
 2 files changed, 17 insertions(+), 8 deletions(-)

-- 
2.9.5



[PATCH v2] pinctrl: rockchip: fix leaked of_node references

2019-04-15 Thread Wen Yang
The call to of_parse_phandle returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
./drivers/pinctrl/pinctrl-rockchip.c:3221:2-8: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 3196, but without a 
corresponding object release within this function.
./drivers/pinctrl/pinctrl-rockchip.c:3223:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 3196, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Linus Walleij 
Cc: Heiko Stuebner 
Cc: linux-g...@vger.kernel.org
Cc: linux-rockc...@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
v2: 
 - put of_node_put below the whole if clause.
 - In the if clause, since node is NULL, there is no need to call of_node_put 
before return.

 drivers/pinctrl/pinctrl-rockchip.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pinctrl/pinctrl-rockchip.c 
b/drivers/pinctrl/pinctrl-rockchip.c
index 16bf21b..6436336 100644
--- a/drivers/pinctrl/pinctrl-rockchip.c
+++ b/drivers/pinctrl/pinctrl-rockchip.c
@@ -3212,6 +3212,7 @@ static int rockchip_get_bank_data(struct 
rockchip_pin_bank *bank,
base,
_regmap_config);
}
+   of_node_put(node);
}
 
bank->irq = irq_of_parse_and_map(bank->of_node, 0);
-- 
2.9.5



[PATCH 3/5] pinctrl: st: fix leaked of_node references

2019-04-12 Thread Wen Yang
The call to of_get_child_by_name returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
./drivers/pinctrl/pinctrl-st.c:1188:3-9: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 1175, but without a 
corresponding object release within this function.
./drivers/pinctrl/pinctrl-st.c:1188:3-9: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 1175, but without a 
corresponding object release within this function.
./drivers/pinctrl/pinctrl-st.c:1199:2-8: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 1175, but without a 
corresponding object release within this function.
./drivers/pinctrl/pinctrl-st.c:1199:2-8: ERROR: missing of_node_put; acquired a 
node pointer with refcount incremented on line 1175, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Patrice Chotard 
Cc: Linus Walleij 
Cc: linux-g...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org (open list)
---
 drivers/pinctrl/pinctrl-st.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/pinctrl/pinctrl-st.c b/drivers/pinctrl/pinctrl-st.c
index e66af93..195b442 100644
--- a/drivers/pinctrl/pinctrl-st.c
+++ b/drivers/pinctrl/pinctrl-st.c
@@ -1170,7 +1170,7 @@ static int st_pctl_dt_parse_groups(struct device_node *np,
struct property *pp;
struct st_pinconf *conf;
struct device_node *pins;
-   int i = 0, npins = 0, nr_props;
+   int i = 0, npins = 0, nr_props, ret = 0;
 
pins = of_get_child_by_name(np, "st,pins");
if (!pins)
@@ -1185,7 +1185,8 @@ static int st_pctl_dt_parse_groups(struct device_node *np,
npins++;
} else {
pr_warn("Invalid st,pins in %pOFn node\n", np);
-   return -EINVAL;
+   ret = -EINVAL;
+   goto out_put_node;
}
}
 
@@ -1195,8 +1196,10 @@ static int st_pctl_dt_parse_groups(struct device_node 
*np,
grp->pin_conf = devm_kcalloc(info->dev,
npins, sizeof(*conf), GFP_KERNEL);
 
-   if (!grp->pins || !grp->pin_conf)
-   return -ENOMEM;
+   if (!grp->pins || !grp->pin_conf) {
+   ret = -ENOMEM;
+   goto out_put_node;
+   }
 
/*  */
for_each_property_of_node(pins, pp) {
@@ -1229,9 +1232,11 @@ static int st_pctl_dt_parse_groups(struct device_node 
*np,
}
i++;
}
+
+out_put_node:
of_node_put(pins);
 
-   return 0;
+   return ret;
 }
 
 static int st_pctl_parse_functions(struct device_node *np,
-- 
2.9.5



[PATCH 1/5] pinctrl: pistachio: fix leaked of_node references

2019-04-12 Thread Wen Yang
The call to of_get_child_by_name returns a node pointer with refcount
incremented thus it must be explicitly decremented after the last
usage.

Detected by coccinelle with the following warnings:
./drivers/pinctrl/pinctrl-pistachio.c:1422:1-7: ERROR: missing of_node_put; 
acquired a node pointer with refcount incremented on line 1360, but without a 
corresponding object release within this function.

Signed-off-by: Wen Yang 
Cc: Linus Walleij 
Cc: linux-g...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/pinctrl/pinctrl-pistachio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/pinctrl/pinctrl-pistachio.c 
b/drivers/pinctrl/pinctrl-pistachio.c
index aa5f949..5b0678f 100644
--- a/drivers/pinctrl/pinctrl-pistachio.c
+++ b/drivers/pinctrl/pinctrl-pistachio.c
@@ -1367,6 +1367,7 @@ static int pistachio_gpio_register(struct 
pistachio_pinctrl *pctl)
if (!of_find_property(child, "gpio-controller", NULL)) {
dev_err(pctl->dev,
"No gpio-controller property for bank %u\n", i);
+   of_node_put(child);
ret = -ENODEV;
goto err;
}
@@ -1374,6 +1375,7 @@ static int pistachio_gpio_register(struct 
pistachio_pinctrl *pctl)
irq = irq_of_parse_and_map(child, 0);
if (irq < 0) {
dev_err(pctl->dev, "No IRQ for bank %u: %d\n", i, irq);
+   of_node_put(child);
ret = irq;
goto err;
}
-- 
2.9.5



  1   2   3   >