Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Qian Cai



On 6/24/2021 7:48 AM, Will Deacon wrote:
> Ok, diff below which attempts to tackle the offset issue I mentioned as
> well. Qian Cai -- please can you try with these changes?

This works fine.

> 
> Will
> 
> --->8
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 175b6c113ed8..39284ff2a6cd 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -116,7 +116,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
> phys_addr_t paddr)
>  
>  static inline bool is_swiotlb_force_bounce(struct device *dev)
>  {
> -   return dev->dma_io_tlb_mem->force_bounce;
> +   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +
> +   return mem && mem->force_bounce;
>  }
>  
>  void __init swiotlb_exit(void);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 44be8258e27b..0ffbaae9fba2 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -449,6 +449,7 @@ static int swiotlb_find_slots(struct device *dev, 
> phys_addr_t orig_addr,
> dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> unsigned int nslots = nr_slots(alloc_size), stride;
> unsigned int index, wrap, count = 0, i;
> +   unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> unsigned long flags;
>  
> BUG_ON(!nslots);
> @@ -497,7 +498,7 @@ static int swiotlb_find_slots(struct device *dev, 
> phys_addr_t orig_addr,
> for (i = index; i < index + nslots; i++) {
> mem->slots[i].list = 0;
> mem->slots[i].alloc_size =
> -   alloc_size - ((i - index) << IO_TLB_SHIFT);
> +   alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
> }
> for (i = index - 1;
>  io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
> 


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Qian Cai



On 6/23/2021 2:37 PM, Will Deacon wrote:
> On Wed, Jun 23, 2021 at 12:39:29PM -0400, Qian Cai wrote:
>>
>>
>> On 6/18/2021 11:40 PM, Claire Chang wrote:
>>> Propagate the swiotlb_force into io_tlb_default_mem->force_bounce and
>>> use it to determine whether to bounce the data or not. This will be
>>> useful later to allow for different pools.
>>>
>>> Signed-off-by: Claire Chang 
>>> Reviewed-by: Christoph Hellwig 
>>> Tested-by: Stefano Stabellini 
>>> Tested-by: Will Deacon 
>>> Acked-by: Stefano Stabellini 
>>
>> Reverting the rest of the series up to this patch fixed a boot crash with 
>> NVMe on today's linux-next.
> 
> Hmm, so that makes patch 7 the suspicious one, right?

Will, no. It is rather patch #6 (this patch). Only the patch from #6 to #12 
were reverted to fix the issue. Also, looking at this offset of the crash,

pc : dma_direct_map_sg+0x304/0x8f0
is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+   return dev->dma_io_tlb_mem->force_bounce;
+}

> 
> Looking at that one more closely, it looks like swiotlb_find_slots() takes
> 'alloc_size + offset' as its 'alloc_size' parameter from
> swiotlb_tbl_map_single() and initialises 'mem->slots[i].alloc_size' based
> on 'alloc_size + offset', which looks like a change in behaviour from the
> old code, which didn't include the offset there.
> 
> swiotlb_release_slots() then adds the offset back on afaict, so we end up
> accounting for it twice and possibly unmap more than we're supposed to?
> 
> Will
> 


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Qian Cai



On 6/18/2021 11:40 PM, Claire Chang wrote:
> Propagate the swiotlb_force into io_tlb_default_mem->force_bounce and
> use it to determine whether to bounce the data or not. This will be
> useful later to allow for different pools.
> 
> Signed-off-by: Claire Chang 
> Reviewed-by: Christoph Hellwig 
> Tested-by: Stefano Stabellini 
> Tested-by: Will Deacon 
> Acked-by: Stefano Stabellini 

Reverting the rest of the series up to this patch fixed a boot crash with NVMe 
on today's linux-next.

[   22.286574][T7] Unable to handle kernel paging request at virtual 
address dfff800e
[   22.295225][T7] Mem abort info:
[   22.298743][T7]   ESR = 0x9604
[   22.302496][T7]   EC = 0x25: DABT (current EL), IL = 32 bits
[   22.308525][T7]   SET = 0, FnV = 0
[   22.312274][T7]   EA = 0, S1PTW = 0
[   22.316131][T7]   FSC = 0x04: level 0 translation fault
[   22.321704][T7] Data abort info:
[   22.325278][T7]   ISV = 0, ISS = 0x0004
[   22.329840][T7]   CM = 0, WnR = 0
[   22.333503][T7] [dfff800e] address between user and kernel 
address ranges
[   22.338543][  T256] igb 0006:01:00.0: Intel(R) Gigabit Ethernet Network 
Connection
[   22.341400][T7] Internal error: Oops: 9604 [#1] SMP
[   22.348915][  T256] igb 0006:01:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 
4c:38:d5:09:c8:83
[   22.354458][T7] Modules linked in: igb(+) i2c_algo_bit nvme mlx5_core(+) 
i2c_core nvme_core firmware_class
[   22.362512][  T256] igb 0006:01:00.0: eth0: PBA No: G69016-004
[   22.372287][T7] CPU: 13 PID: 7 Comm: kworker/u64:0 Not tainted 
5.13.0-rc7-next-20210623+ #47
[   22.372293][T7] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, 
BIOS 1.6 06/28/2020
[   22.372298][T7] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[   22.378145][  T256] igb 0006:01:00.0: Using MSI-X interrupts. 4 rx queue(s), 
4 tx queue(s)
[   22.386901][T7] 
[   22.386905][T7] pstate: 1005 (nzcV daif -PAN -UAO -TCO BTYPE=--)
[   22.386910][T7] pc : dma_direct_map_sg+0x304/0x8f0

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
(inlined by) dma_direct_map_page at /usr/src/linux-next/kernel/dma/direct.h:90
(inlined by) dma_direct_map_sg at /usr/src/linux-next/kernel/dma/direct.c:428

[   22.386919][T7] lr : dma_map_sg_attrs+0x6c/0x118
[   22.386924][T7] sp : 80001dc8eac0
[   22.386926][T7] x29: 80001dc8eac0 x28: 199e70b0 x27: 

[   22.386935][T7] x26: 000847ee7000 x25: 80001158e570 x24: 
0002
[   22.386943][T7] x23: dfff8000 x22: 0100 x21: 
199e7460
[   22.386951][T7] x20: 199e7488 x19: 0001 x18: 
10062670
[   22.386955][  T253] Unable to handle kernel paging request at virtual 
address dfff800e
[   22.386958][T7] x17: 8000109f6a90 x16: 8000109e1b4c x15: 
89303420
[   22.386965][  T253] Mem abort info:
[   22.386967][T7] x14: 0001 x13: 80001158e000
[   22.386970][  T253]   ESR = 0x9604
[   22.386972][T7]  x12: 1fffe00108fdce01
[   22.386975][  T253]   EC = 0x25: DABT (current EL), IL = 32 bits
[   22.386976][T7] x11: 1fffe00108fdce03 x10: 000847ee700c x9 : 
0004
[   22.386981][  T253]   SET = 0, FnV = 0
[   22.386983][T7] 
[   22.386985][T7] x8 : 73b91d72
[   22.386986][  T253]   EA = 0, S1PTW = 0
[   22.386987][T7]  x7 :  x6 : 000e
[   22.386990][  T253]   FSC = 0x04: level 0 translation fault
[   22.386992][T7] 
[   22.386994][T7] x5 : dfff8000
[   22.386995][  T253] Data abort info:
[   22.386997][T7]  x4 : 0008c7ede000
[   22.386999][  T253]   ISV = 0, ISS = 0x0004
[   22.386999][T7]  x3 : 0008c7ede000
[   22.387003][T7] x2 : 1000
[   22.387003][  T253]   CM = 0, WnR = 0
[   22.387006][T7]  x1 :  x0 : 0071
[   22.387008][  T253] [dfff800e] address between user and kernel 
address ranges
[   22.387011][T7] 
[   22.387013][T7] Call trace:
[   22.387016][T7]  dma_direct_map_sg+0x304/0x8f0
[   22.387022][T7]  dma_map_sg_attrs+0x6c/0x118
[   22.387026][T7]  nvme_map_data+0x2ec/0x21d8 [nvme]
[   22.387040][T7]  nvme_queue_rq+0x274/0x3f0 [nvme]
[   22.387052][T7]  blk_mq_dispatch_rq_list+0x2ec/0x18a0
[   22.387060][T7]  __blk_mq_sched_dispatch_requests+0x2a0/0x3e8
[   22.387065][T7]  blk_mq_sched_dispatch_requests+0xa4/0x100
[   22.387070][T7]  __blk_mq_run_hw_queue+0x148/0x1d8
[   22.387075][T7]  __blk_mq_delay_run_hw_queue+0x3f8/0x730
[   22.414539][  T269] igb 0006:01:00.0 enP6p1s0: renamed from eth0
[   22.418957][T7]  blk_mq_run_hw_queue+0x148/0x248
[   22.418969][T7]  blk_mq_sched_insert_request+0x2a4/0x330
[   22.418975][T7]  blk_execute_rq_nowait+0xc8/0x118
[   22.418981][T7]  blk_execute_rq+0xd4/0x188
[   22.453203][  T255] 

Re: [PATCH 0/5] cpufreq: cppc: Fix suspend/resume specific races with FIE code

2021-06-15 Thread Qian Cai



On 6/15/2021 3:50 AM, Viresh Kumar wrote:
> Hi Qian,
> 
> First of all thanks for testing this, I need more of your help to test
> this out :)
> 
> FWIW, I did test this on my Hikey board today, with some hacks, and
> tried multiple insmod/rmmod operations for the driver, and I wasn't
> able to reproduce the issue you reported. I did enable the list-debug
> config option.

The setup here is an arm64 server with 32 CPUs.

> 
> On 14-06-21, 09:48, Qian Cai wrote:
>> Unfortunately, this series looks like needing more works.
>>
>> [  487.773586][T0] CPU17: Booted secondary processor 0x000801 
>> [0x503f0002]
>> [  487.976495][  T670] list_del corruption. next->prev should be 
>> 009b66e9ec70, but was 009b66dfec70
>> [  487.987037][  T670] [ cut here ]
>> [  487.992351][  T670] kernel BUG at lib/list_debug.c:54!
>> [  487.997810][  T670] Internal error: Oops - BUG: 0 [#1] SMP
>> [  488.003295][  T670] Modules linked in: cpufreq_userspace xfs loop 
>> cppc_cpufreq processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod 
>> igb i2c_algo_bit nvme mlx5_core i2c_core nvme_core firmware_class
>> [  488.021759][  T670] CPU: 1 PID: 670 Comm: cppc_fie Not tainted 
>> 5.13.0-rc5-next-20210611+ #46
>> [  488.030190][  T670] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, 
>> BIOS 1.6 06/28/2020
>> [  488.038705][  T670] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
>> [  488.045398][  T670] pc : __list_del_entry_valid+0x154/0x158
>> [  488.050969][  T670] lr : __list_del_entry_valid+0x154/0x158
>> [  488.056534][  T670] sp : 8000229afd70
>> [  488.060534][  T670] x29: 8000229afd70 x28: 0008c8f4f340 x27: 
>> dfff8000
>> [  488.068361][  T670] x26: 009b66e9ec70 x25: 800011c8b4d0 x24: 
>> 0008d4bfe488
>> [  488.076188][  T670] x23: 0008c8f4f340 x22: 0008c8f4f340 x21: 
>> 009b6789ec70
>> [  488.084015][  T670] x20: 0008d4bfe4c8 x19: 009b66e9ec70 x18: 
>> 0008c8f4fd70
>> [  488.091842][  T670] x17: 20747562202c3037 x16: 6365396536366239 x15: 
>> 0028
>> [  488.099669][  T670] x14:  x13: 0001 x12: 
>> 60136cdd3447
>> [  488.107495][  T670] x11: 1fffe0136cdd3446 x10: 60136cdd3446 x9 : 
>> 8000103ee444
>> [  488.115322][  T670] x8 : 009b66e9a237 x7 : 0001 x6 : 
>> 009b66e9a230
>> [  488.123149][  T670] x5 : 9fec9322cbba x4 : 60136cdd3447 x3 : 
>> 1fffe001191e9e69
>> [  488.130975][  T670] x2 :  x1 :  x0 : 
>> 0054
>> [  488.138803][  T670] Call trace:
>> [  488.141935][  T670]  __list_del_entry_valid+0x154/0x158
>> [  488.147153][  T670]  kthread_worker_fn+0x15c/0xda0
> 
> This is a strange place to get the issue from. And this is a new
> issue.

Well, it was still the same exercises with CPU online/offline.

> 
>> [  488.151939][  T670]  kthread+0x3ac/0x460
>> [  488.155854][  T670]  ret_from_fork+0x10/0x18
>> [  488.160120][  T670] Code: 911e8000 aa1303e1 910a 941b595b (d421)
>> [  488.166901][  T670] ---[ end trace e637e2d38b2cc087 ]---
>> [  488.172206][  T670] Kernel panic - not syncing: Oops - BUG: Fatal 
>> exception
>> [  488.179182][  T670] SMP: stopping secondary CPUs
>> [  489.209347][  T670] SMP: failed to stop secondary CPUs 0-1,10-11,16-17,31
>> [  489.216128][  T][  T670] Memoryn ]---
> 
> Can you give details on what exactly did you try to do, to get this ?
> Normal boot or something more ?

Basically, it has the cpufreq driver as CPPC and the governor as schedutil. 
Running a few workloads to get CPU scaling up and down. Later, try to offline 
all CPUs until the last one and then online all CPUs.

> 
> I have made some changes to the way calls were happening, may get this
> thing sorted. Can you please try this branch ?
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc
> 
> I can see one place where race can happen, i.e. between
> topology_clear_scale_freq_source() and topology_scale_freq_tick(). It
> is possible that sfd->set_freq_scale() may get called for a previously
> set handler as there is no protection there.
> 
> I will see how to fix that. But I am not sure if the issue reported
> above comes from there.
> 
> Anyway, please give my branch a try, lets see.

I am hesitate to try this at the moment because this all feel like shooting in 
the dark. Ideally, you will be able to get access to one of those arm64 servers 
(Huawei, Ampere, TX2, FJ etc) eventually and really try the same exercises 
yourself with those 

Re: [PATCH 0/5] cpufreq: cppc: Fix suspend/resume specific races with FIE code

2021-06-14 Thread Qian Cai



On 6/10/2021 4:23 AM, Viresh Kumar wrote:
> Hi Qian,
> 
> It would be helpful if you can test this patchset and confirm if the races you
> mentioned went away or not and that the FIE code works as we wanted it to.
> 
> I don't have a real setup and so it won't be easy for me to test this out.
> 
> I have already sent a temporary fix for 5.13 and this patchset is targeted for
> 5.14 and is based over that.

Unfortunately, this series looks like needing more works.

[  487.773586][T0] CPU17: Booted secondary processor 0x000801 
[0x503f0002]
[  487.976495][  T670] list_del corruption. next->prev should be 
009b66e9ec70, but was 009b66dfec70
[  487.987037][  T670] [ cut here ]
[  487.992351][  T670] kernel BUG at lib/list_debug.c:54!
[  487.997810][  T670] Internal error: Oops - BUG: 0 [#1] SMP
[  488.003295][  T670] Modules linked in: cpufreq_userspace xfs loop 
cppc_cpufreq processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb 
i2c_algo_bit nvme mlx5_core i2c_core nvme_core firmware_class
[  488.021759][  T670] CPU: 1 PID: 670 Comm: cppc_fie Not tainted 
5.13.0-rc5-next-20210611+ #46
[  488.030190][  T670] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, 
BIOS 1.6 06/28/2020
[  488.038705][  T670] pstate: 60c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
[  488.045398][  T670] pc : __list_del_entry_valid+0x154/0x158
[  488.050969][  T670] lr : __list_del_entry_valid+0x154/0x158
[  488.056534][  T670] sp : 8000229afd70
[  488.060534][  T670] x29: 8000229afd70 x28: 0008c8f4f340 x27: 
dfff8000
[  488.068361][  T670] x26: 009b66e9ec70 x25: 800011c8b4d0 x24: 
0008d4bfe488
[  488.076188][  T670] x23: 0008c8f4f340 x22: 0008c8f4f340 x21: 
009b6789ec70
[  488.084015][  T670] x20: 0008d4bfe4c8 x19: 009b66e9ec70 x18: 
0008c8f4fd70
[  488.091842][  T670] x17: 20747562202c3037 x16: 6365396536366239 x15: 
0028
[  488.099669][  T670] x14:  x13: 0001 x12: 
60136cdd3447
[  488.107495][  T670] x11: 1fffe0136cdd3446 x10: 60136cdd3446 x9 : 
8000103ee444
[  488.115322][  T670] x8 : 009b66e9a237 x7 : 0001 x6 : 
009b66e9a230
[  488.123149][  T670] x5 : 9fec9322cbba x4 : 60136cdd3447 x3 : 
1fffe001191e9e69
[  488.130975][  T670] x2 :  x1 :  x0 : 
0054
[  488.138803][  T670] Call trace:
[  488.141935][  T670]  __list_del_entry_valid+0x154/0x158
[  488.147153][  T670]  kthread_worker_fn+0x15c/0xda0
[  488.151939][  T670]  kthread+0x3ac/0x460
[  488.155854][  T670]  ret_from_fork+0x10/0x18
[  488.160120][  T670] Code: 911e8000 aa1303e1 910a 941b595b (d421)
[  488.166901][  T670] ---[ end trace e637e2d38b2cc087 ]---
[  488.172206][  T670] Kernel panic - not syncing: Oops - BUG: Fatal exception
[  488.179182][  T670] SMP: stopping secondary CPUs
[  489.209347][  T670] SMP: failed to stop secondary CPUs 0-1,10-11,16-17,31
[  489.216128][  T][  T670] Memoryn ]---

> 
> -8<-
> 
> The CPPC driver currently stops the frequency invariance related
> kthread_work and irq_work from cppc_freq_invariance_exit() which is only
> called during driver's removal.
> 
> This is not sufficient as the CPUs can get hot-plugged out while the
> driver is in use, the same also happens during system suspend/resume.
> 
> In such a cases we can reach a state where the CPU is removed by the
> kernel but its kthread_work or irq_work aren't stopped.
> 
> Fix this by implementing the start_cpu() and stop_cpu() callbacks in the
> cpufreq core, which will be called for each CPU's addition/removal.
> 
> A similar call was already available in the cpufreq core, which isn't required
> anymore and so its users are migrated to use exit() callback instead.
> 
> This is targeted for v5.14-rc1.
> 
> --
> Viresh
> 
> Viresh Kumar (5):
>   cpufreq: cppc: Migrate to ->exit() callback instead of ->stop_cpu()
>   cpufreq: intel_pstate: Migrate to ->exit() callback instead of
> ->stop_cpu()
>   cpufreq: powerenv: Migrate to ->exit() callback instead of
> ->stop_cpu()
>   cpufreq: Add start_cpu() and stop_cpu() callbacks
>   cpufreq: cppc: Fix suspend/resume specific races with the FIE code
> 
>  Documentation/cpu-freq/cpu-drivers.rst |   7 +-
>  drivers/cpufreq/Kconfig.arm|   1 -
>  drivers/cpufreq/cppc_cpufreq.c | 163 ++---
>  drivers/cpufreq/cpufreq.c  |  11 +-
>  drivers/cpufreq/intel_pstate.c |   9 +-
>  drivers/cpufreq/powernv-cpufreq.c  |  23 ++--
>  include/linux/cpufreq.h|   5 +-
>  7 files changed, 119 insertions(+), 100 deletions(-)
> 


Power9 NV linux-next random process hang

2021-01-05 Thread Qian Cai
.config: 
https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config

Today's linux-next starts to generate random process hang quite easily.
Yesterday's build seems work fine. Sometimes, the process stack seems corrupt
while the process is running 100% CPU with gdb shows it just entered a
subroutine that really can't see why it hangs.

[ 6732.309621][T11627] task:ranbug  state:R  running task 
stack:24176 pid: 2893 ppid:  2867 flags:0x0004 
[ 6732.309779][T11627] Call Trace: 
[ 6732.309826][T11627] [c0006166fa30] [c0006166fb60] 0xc0006166fb60 
(unreliable) 

Also, running LTP syscalls ended up hanging with lots of zombie process. Any 
idea?

root2023  0.0  0.0  0 0 ?Zs   14:10   0:00 [login] 

root   52052  0.0  0.0  0 0 pts/0Z15:03   0:00 [recv01] 

root   52054  0.0  0.0  0 0 pts/0Z15:03   0:00 [recvfrom01] 

root   52056  0.0  0.0  0 0 pts/0Z15:03   0:00 [recvmsg01] 

root   52155  0.0  0.0  0 0 pts/0Z15:03   0:00 
[rt_sigtimedwait] 
root   52305  0.0  0.0  0 0 pts/0Z15:03   0:00 [semctl01] 

root   52362  0.0  0.0  0 0 pts/0Z15:03   0:00 [send01] 

root   52386  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52387  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52388  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52389  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52390  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52392  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52393  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52394  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52395  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52396  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52398  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile05] 

root   52400  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile05_64] 
root   52415  0.0  0.0  0 0 pts/0Z15:04   0:00 [sendmsg01] 

root   53470  0.0  0.0  0 0 pts/0Z15:04   0:00 [sendto01] 

root   53763  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53764  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53765  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53766  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53767  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53768  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53769  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53770  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53771  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53772  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53773  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53774  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53775  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53776  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53777  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53778  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53779  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53780  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53782  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
nobody 54290  0.0  0.0  0 0 pts/0Z15:07   0:00 [sysctl03] 

root   56813  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56814  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56815  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56816  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56817  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56818  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56819  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56820  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56821  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56822  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56823  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56825  0.0  0.0 

Re: [PATCH] powerpc/mm: Refactor the floor/ceiling check in hugetlb range freeing functions

2020-12-11 Thread Qian Cai
On Fri, 2020-11-06 at 13:20 +, Christophe Leroy wrote:
> All hugetlb range freeing functions have a verification like the following,
> which only differs by the mask used, depending on the page table level.
> 
>   start &= MASK;
>   if (start < floor)
>   return;
>   if (ceiling) {
>   ceiling &= MASK;
>   if (! ceiling)
>   return;
>   }
>   if (end - 1 > ceiling - 1)
>   return;
> 
> Refactor that into a helper function which takes the mask as
> an argument, returning true when [start;end[ is not fully
> contained inside [floor;ceiling[
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/hugetlbpage.c | 56 ---
>  1 file changed, 19 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 36c3800769fb..f8d8a4988e15 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -294,6 +294,21 @@ static void hugepd_free(struct mmu_gather *tlb, void 
> *hugepte)
>  static inline void hugepd_free(struct mmu_gather *tlb, void *hugepte) {}
>  #endif
>  
> +/* Return true when the entry to be freed maps more than the area being 
> freed */
> +static bool range_is_outside_limits(unsigned long start, unsigned long end,
> + unsigned long floor, unsigned long ceiling,
> + unsigned long mask)
> +{
> + if ((start & mask) < floor)
> + return true;
> + if (ceiling) {
> + ceiling &= mask;
> + if (!ceiling)
> + return true;
> + }
> + return end - 1 > ceiling - 1;
> +}
> +
>  static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int 
> pdshift,
> unsigned long start, unsigned long end,
> unsigned long floor, unsigned long ceiling)
> @@ -309,15 +324,7 @@ static void free_hugepd_range(struct mmu_gather *tlb, 
> hugepd_t *hpdp, int pdshif
>   if (shift > pdshift)
>   num_hugepd = 1 << (shift - pdshift);
>  
> - start &= pdmask;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= pdmask;
> - if (! ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(start, end, floor, ceiling, pdmask))
>   return;
>  
>   for (i = 0; i < num_hugepd; i++, hpdp++)
> @@ -334,18 +341,9 @@ static void hugetlb_free_pte_range(struct mmu_gather 
> *tlb, pmd_t *pmd,
>  unsigned long addr, unsigned long end,
>  unsigned long floor, unsigned long ceiling)
>  {
> - unsigned long start = addr;
>   pgtable_t token = pmd_pgtable(*pmd);
>  
> - start &= PMD_MASK;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= PMD_MASK;
> - if (!ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(addr, end, floor, ceiling, PMD_MASK))
>   return;
>  
>   pmd_clear(pmd);
> @@ -395,15 +393,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather 
> *tlb, pud_t *pud,
> addr, next, floor, ceiling);
>   } while (addr = next, addr != end);
>  
> - start &= PUD_MASK;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= PUD_MASK;
> - if (!ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(start, end, floor, ceiling, PUD_MASK))
>   return;
>  
>   pmd = pmd_offset(pud, start);
> @@ -446,15 +436,7 @@ static void hugetlb_free_pud_range(struct mmu_gather 
> *tlb, p4d_t *p4d,
>   }
>   } while (addr = next, addr != end);
>  
> - start &= PGDIR_MASK;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= PGDIR_MASK;
> - if (!ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(start, end, floor, ceiling, PGDIR_MASK))
>   return;
>  
>   pud = pud_offset(p4d, start);

Well, "start" is still in use in hugetlb_free_pmd_range() and
hugetlb_free_pud_range() after range_is_outside_limits(), but after this patch,
"start" is not longer has the bitmask, i.e., "no &=".

Anyway, reverting this commit from today's linux-next fixed a crash on POWE9 NV.

# runltp -f hugetlb
[ 7703.114640][T58070] LTP: starting hugemmap05_1 (hugemmap05 -m)
[ 7703.157792][   C99] [ cut here ]
[ 7703.158279][   C99] kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:387!
[ 7703.158306][   C99] Oops: Exception in kernel mode, sig: 5 [#1]
[ 

Re: [PATCH v6 0/5] PCI: Unify ECAM constants in native PCI Express drivers

2020-12-08 Thread Qian Cai
On Sun, 2020-11-29 at 23:07 +, Krzysztof Wilczyński wrote:
> Unify ECAM-related constants into a single set of standard constants
> defining memory address shift values for the byte-level address that can
> be used when accessing the PCI Express Configuration Space, and then
> move native PCI Express controller drivers to use newly introduced
> definitions retiring any driver-specific ones.
> 
> The ECAM ("Enhanced Configuration Access Mechanism") is defined by the
> PCI Express specification (see PCI Express Base Specification, Revision
> 5.0, Version 1.0, Section 7.2.2, p. 676), thus most hardware should
> implement it the same way.
> 
> Most of the native PCI Express controller drivers define their ECAM-related
> constants, many of these could be shared, or use open-coded values when
> setting the ".bus_shift" field of the "struct pci_ecam_ops".
> 
> All of the newly added constants should remove ambiguity and reduce the
> number of open-coded values, and also correlate more strongly with the
> descriptions in the aforementioned specification (see Table 7-1
> "Enhanced Configuration Address Mapping", p. 677).
> 
> Suggested-by: Bjorn Helgaas 
> Signed-off-by: Krzysztof Wilczyński 

Reverting this series on the top of today's linux-next fixed a boot crash on
arm64 Thunder X2 server.

.config: https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

[  186.285957][T1] ACPI: PCI Root Bridge [PCI0] (domain  [bus 00-7f])
[  186.293127][T1] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig 
Segments MSI HPX-Type3]
[  186.317072][T1] acpi PNP0A08:00: _OSC: not requesting OS control; OS 
requires [ExtendedConfig ASPM ClockPM MSI]
[  186.330336][T1] acpi PNP0A08:00: ECAM area [mem 0x3000-0x37ff] 
reserved by PNP0C02:00
[  186.339538][T1] acpi PNP0A08:00: ECAM at [mem 0x3000-0x37ff] for 
[bus 00-7f]
[  186.353258][T1] PCI host bridge to bus :00
[  186.358162][T1] pci_bus :00: root bus resource [mem 
0x4000-0x5fff window]
[  186.366509][T1] pci_bus :00: root bus resource [mem 
0x100-0x13f window]
[  186.375366][T1] pci_bus :00: root bus resource [bus 00-7f]
[  186.382652][T1] pci :00:00.0: [177d:af00] type 00 class 0x06
[  186.395174][T1] pci :00:01.0: [177d:af84] type 01 class 0x060400
[  186.402433][T1] pci :00:01.0: PME# supported from D0 D3hot D3cold
[  186.415652][T1] Unable to handle kernel paging request at virtual 
address 80002937
[  186.424398][T1] Mem abort info:
[  186.427930][T1]   ESR = 0x9607
[  186.431725][T1]   EC = 0x25: DABT (current EL), IL = 32 bits
[  186.437805][T1]   SET = 0, FnV = 0
[  186.441599][T1]   EA = 0, S1PTW = 0
[  186.445485][T1] Data abort info:
[  186.449104][T1]   ISV = 0, ISS = 0x0007
[  186.453687][T1]   CM = 0, WnR = 0
[  186.457396][T1] swapper pgtable: 64k pages, 48-bit VAs, 
pgdp=da92
[  186.464979][T1] [80002937] pgd=008ffcff0003, 
p4d=008ffcff0003, pud=008ffcff0003, pmd=811d0003, 
pte=
[  186.478424][T1] Internal error: Oops: 9607 [#1] SMP
[  186.484059][T1] Modules linked in:
[  186.487854][T1] CPU: 38 PID: 1 Comm: swapper/0 Tainted: GWL  
  5.10.0-rc7-next-20201208 #3
[  186.497617][T1] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.16 07/29/2020
[  186.508174][T1] pstate: 20400089 (nzCv daIf +PAN -UAO -TCO BTYPE=--)
[  186.514954][T1] pc : pci_generic_config_read+0x78/0x1d0
[  186.520587][T1] lr : pci_generic_config_read+0x64/0x1d0
pci_generic_config_read at drivers/pci/access.c:83
[  186.526223][T1] sp : 05f0ef30
[  186.530278][T1] x29: 05f0ef30 x28: 0010 
[  186.536359][T1] x27:  x26: 0087 
[  186.542441][T1] x25:  x24: 00084a3a5000 
[  186.548517][T1] x23: 05f0f150 x22: 0004 
[  186.554593][T1] x21: 800011404588 x20: 05f0eff0 
[  186.560669][T1] x19: 00084a3a5000 x18: 1fffe001cf0d53ed 
[  186.566750][T1] x17:  x16: 0003 
[  186.572831][T1] x15:  x14: 0003 
[  186.578908][T1] x13: 60be1ddf x12: 1fffe0be1dde 
[  186.584983][T1] x11: 1fffe0be1dde x10: 60be1dde 
[  186.591059][T1] x9 : 800010c4f59c x8 : 05f0eef3 
[  186.597139][T1] x7 : 0001 x6 : 0001 
[  186.603222][T1] x5 : 1fffe00109474a1c x4 : 1fffe010fd074cb2 
[  186.609298][T1] x3 :  x2 :  
[  186.615374][T1] x1 : 0001 x0 : 80002937 
[  186.621451][T1] Call trace:
[  186.624623][T1]  pci_generic_config_read+0x78/0x1d0
__raw_readl at arch/arm64/include/asm/io.h:75
(inlined by) pci_generic_config_read at drivers/pci/access.c:93
[  

Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Qian Cai
On Thu, 2020-12-03 at 12:17 -0500, Qian Cai wrote:
> []
> > +static inline bool
> > +bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
> > +{
> > +   return WARN(mmu_has_feature(MMU_FTR_RADIX_KUAP) &&
> > +   (regs->kuap & (is_write ? AMR_KUAP_BLOCK_WRITE : 
> > AMR_KUAP_BLOCK_READ)),
> > +   "Bug: %s fault blocked by AMR!", is_write ? "Write" : 
> > "Read");
> > +}
> 
> A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
> endlessly on POWER9 NV.

I have just realized the patch just moved this warning around, so the issue was
pre-existent. Since I have not tested sysrq-t regularly, I am not sure when it
started to break. So far, I have reverted some of those for testing which did
not help, i.e., the sysrq-t issue remains.

16852975f0f  Revert "powerpc/64s: Use early_mmu_has_feature() in set_kuap()"
129e240ead32 Revert "powerpc: Implement user_access_save() and 
user_access_restore()"
edb0046c842c Revert "powerpc/64s/kuap: Add missing isync to KUAP restore paths"
2d46ee87ce44 Revert "powerpc/64/kuap: Conditionally restore AMR in interrupt 
exit"
c1e0e805fc57 Revert "powerpc/64s/kuap: Conditionally restore AMR in 
kuap_restore_amr asm"
7f30b7aaf23a Revert "selftests/powerpc: rfi_flush: disable entry flush if 
present"
bc9b9967a100 Revert "powerpc/64s: flush L1D on kernel entry"
b77e7b54f5eb Revert "powerpc/64s: flush L1D after user accesses"
22dddf532c64 Revert "powerpc: Only include kup-radix.h for 64-bit Book3S"
2679d155c46a Revert "selftests/powerpc: entry flush test"
87954b9b4243 Revert "selftests/powerpc: refactor entry and rfi_flush tests"
342d82bd4c5d Revert "powerpc/64s: rename pnv|pseries_setup_rfi_flush to 
_setup_security_mitigations"



Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Qian Cai
On Thu, 2020-12-03 at 12:17 -0500, Qian Cai wrote:
> A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
> endlessly on Power8 NV.

Correction -- POWER9 NV.



Re: [PATCH 3/7] powerpc/64s: flush L1D after user accesses

2020-12-03 Thread Qian Cai
On Fri, 2020-11-20 at 10:13 +1100, Daniel Axtens wrote:
> From: Nicholas Piggin 
> 
> IBM Power9 processors can speculatively operate on data in the L1 cache
> before it has been completely validated, via a way-prediction mechanism. It
> is not possible for an attacker to determine the contents of impermissible
> memory using this method, since these systems implement a combination of
> hardware and software security measures to prevent scenarios where
> protected data could be leaked.
> 
> However these measures don't address the scenario where an attacker induces
> the operating system to speculatively execute instructions using data that
> the attacker controls. This can be used for example to speculatively bypass
> "kernel user access prevention" techniques, as discovered by Anthony
> Steinhauser of Google's Safeside Project. This is not an attack by itself,
> but there is a possibility it could be used in conjunction with
> side-channels or other weaknesses in the privileged code to construct an
> attack.
> 
> This issue can be mitigated by flushing the L1 cache between privilege
> boundaries of concern. This patch flushes the L1 cache after user accesses.
> 
> This is part of the fix for CVE-2020-4788.
> 
> Signed-off-by: Nicholas Piggin 
> Signed-off-by: Daniel Axtens 
> ---
>  .../admin-guide/kernel-parameters.txt |  4 +
>  .../powerpc/include/asm/book3s/64/kup-radix.h | 66 --
>  arch/powerpc/include/asm/exception-64s.h  |  3 +
>  arch/powerpc/include/asm/feature-fixups.h |  9 ++
>  arch/powerpc/include/asm/kup.h| 19 +++--
>  arch/powerpc/include/asm/security_features.h  |  3 +
>  arch/powerpc/include/asm/setup.h  |  1 +
>  arch/powerpc/kernel/exceptions-64s.S  | 85 ++-
>  arch/powerpc/kernel/setup_64.c| 62 ++
>  arch/powerpc/kernel/vmlinux.lds.S |  7 ++
>  arch/powerpc/lib/feature-fixups.c | 50 +++
>  arch/powerpc/platforms/powernv/setup.c| 10 ++-
>  arch/powerpc/platforms/pseries/setup.c|  4 +
>  13 files changed, 233 insertions(+), 90 deletions(-)
[]
> diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
> b/arch/powerpc/include/asm/book3s/64/kup-radix.h
> index 3ee1ec60be84..97c2394e7dea 100644
> --- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
[]
> +static inline bool
> +bad_kuap_fault(struct pt_regs *regs, unsigned long address, bool is_write)
> +{
> + return WARN(mmu_has_feature(MMU_FTR_RADIX_KUAP) &&
> + (regs->kuap & (is_write ? AMR_KUAP_BLOCK_WRITE : 
> AMR_KUAP_BLOCK_READ)),
> + "Bug: %s fault blocked by AMR!", is_write ? "Write" : 
> "Read");
> +}

A simple "echo t > /proc/sysrq-trigger" will trigger this warning almost
endlessly on Power8 NV.

.config: 
https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config

[  391.734028][ T1986] Bug: Read fault blocked by AMR!
[  391.734032][ T1986] WARNING: CPU: 80 PID: 1986 at 
arch/powerpc/include/asm/book3s/64/kup-radix.h:145 do_page_fault+0x8fc/0xb70
[  391.734232][ T1986] Modules linked in: kvm_hv kvm ip_tables x_tables sd_mod 
ahci libahci tg3 libata firmware_class libphy dm_mirror dm_region_hash dm_log 
dm_mod
[  391.734425][ T1986] CPU: 80 PID: 1986 Comm: bash Tainted: GW 
5.10.0-rc6-next-20201203+ #3
[  391.734535][ T1986] NIP:  c004dd1c LR: c004dd18 CTR: 

[  391.734648][ T1986] REGS: c00020003a0bf3a0 TRAP: 0700   Tainted: GW  
(5.10.0-rc6-next-20201203+)
[  391.734768][ T1986] MSR:  90021033   CR: 
4884  XER: 
[  391.734906][ T1986] CFAR: c00bb05c IRQMASK: 1 
[  391.734906][ T1986] GPR00: c004dd18 c00020003a0bf630 
c7fe0d00 001f 
[  391.734906][ T1986] GPR04: c0f1cc58 0003 
0027 c000201cc6207218 
[  391.734906][ T1986] GPR08: 0023 0002 
c00020004753bd80 c7f1cee8 
[  391.734906][ T1986] GPR12: 2000 c000201fff7f8380 
4000 000110929798 
[  391.734906][ T1986] GPR16: 000110929724 0001108c6988 
00011085f290 00011092d568 
[  391.734906][ T1986] GPR20: 0001229f1f80 0001 
0001 c0aa8dc8 
[  391.734906][ T1986] GPR24: c0ab4a00 c00020001cc8c880 
  
[  391.734906][ T1986] GPR28: c801aa18 0160 
c00020003a0bf760 0300 
[  391.735865][ T1986] NIP [c004dd1c] do_page_fault+0x8fc/0xb70
[  391.735947][ T1986] LR [c004dd18] do_page_fault+0x8f8/0xb70
[  391.736033][ T1986] Call Trace:
[  391.736072][ T1986] [c00020003a0bf630] [c004dd18] 
do_page_fault+0x8f8/0xb70 (unreliable)
[  391.736181][ T1986] [c00020003a0bf6f0] [c000c1b8] 
handle_page_fault+0x10/0x2c
[  391.736294][ T1986] --- interrupt: 300 at 

Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai
On Wed, 2020-10-28 at 17:31 -0700, Paul E. McKenney wrote:
> On Thu, Oct 29, 2020 at 11:09:07AM +1100, Michael Ellerman wrote:
> > Qian Cai  writes:
> > > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > > in the CPU-hotplug onlining process, which results in lockdep splats as
> > > follows:
> > 
> > Since when?
> > What kernel version?
> > 
> > I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> > v5.10-rc1. Am I missing a CONFIG?
> 
> My guess would be that adding CONFIG_PROVE_RAW_LOCK_NESTING=y will
> get you some splats.

Well, I don't have that set, so it should be CONFIG_PROVE_RCU_LIST=y. Anyway,
this is .config to reproduce on Power9 NV:

https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config



Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai
On Thu, 2020-10-29 at 11:09 +1100, Michael Ellerman wrote:
> Qian Cai  writes:
> > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > in the CPU-hotplug onlining process, which results in lockdep splats as
> > follows:
> 
> Since when?

For me, it is since the commit in the link which looks now merged into
v5.10-rc1. Then, it needs CONFIG_PROVE_RCU_LIST=y.

> What kernel version?
> 
> I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> v5.10-rc1. Am I missing a CONFIG?
> 
> cheers
> 
> 
> >  WARNING: suspicious RCU usage
> >  -
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> > 
> >  other info that might help us debug this:
> > 
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> > 
> >  Call Trace:
> >  dump_stack+0xec/0x144 (unreliable)
> >  lockdep_rcu_suspicious+0x128/0x14c
> >  __lock_acquire+0x1060/0x1c60
> >  lock_acquire+0x140/0x5f0
> >  _raw_spin_lock_irqsave+0x64/0xb0
> >  clockevents_register_device+0x74/0x270
> >  register_decrementer_clockevent+0x94/0x110
> >  start_secondary+0x134/0x800
> >  start_secondary_prolog+0x10/0x14
> > 
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the start_secondary() function. Note that the
> > raw_smp_processor_id() is required in order to avoid calling into
> > lockdep before RCU has declared the CPU to be watched for readers.
> > 
> > Link: 
> > https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> > Signed-off-by: Qian Cai 
> > ---
> >  arch/powerpc/kernel/smp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 3c6b9822f978..8c2857cbd960 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
> >  /* Activate a secondary processor. */
> >  void start_secondary(void *unused)
> >  {
> > -   unsigned int cpu = smp_processor_id();
> > +   unsigned int cpu = raw_smp_processor_id();
> >  
> > mmgrab(_mm);
> > current->active_mm = _mm;
> >  
> > smp_store_cpu_info(cpu);
> > set_dec(tb_ticks_per_jiffy);
> > +   rcu_cpu_starting(cpu);
> > preempt_disable();
> > cpu_callin_map[cpu] = 1;
> >  
> > -- 
> > 2.28.0



[PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-28 Thread Qian Cai
The call to rcu_cpu_starting() in start_secondary() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats as
follows:

 WARNING: suspicious RCU usage
 -
 kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 RCU used illegally from offline CPU!
 rcu_scheduler_active = 1, debug_locks = 1
 no locks held by swapper/1/0.

 Call Trace:
 dump_stack+0xec/0x144 (unreliable)
 lockdep_rcu_suspicious+0x128/0x14c
 __lock_acquire+0x1060/0x1c60
 lock_acquire+0x140/0x5f0
 _raw_spin_lock_irqsave+0x64/0xb0
 clockevents_register_device+0x74/0x270
 register_decrementer_clockevent+0x94/0x110
 start_secondary+0x134/0x800
 start_secondary_prolog+0x10/0x14

This is avoided by moving the call to rcu_cpu_starting up near the
beginning of the start_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into
lockdep before RCU has declared the CPU to be watched for readers.

Link: 
https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Signed-off-by: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3c6b9822f978..8c2857cbd960 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
 /* Activate a secondary processor. */
 void start_secondary(void *unused)
 {
-   unsigned int cpu = smp_processor_id();
+   unsigned int cpu = raw_smp_processor_id();
 
mmgrab(_mm);
current->active_mm = _mm;
 
smp_store_cpu_info(cpu);
set_dec(tb_ticks_per_jiffy);
+   rcu_cpu_starting(cpu);
preempt_disable();
cpu_callin_map[cpu] = 1;
 
-- 
2.28.0



[PATCH] powerpc/eeh_cache: Fix a possible debugfs deadlock

2020-10-28 Thread Qian Cai
Lockdep complains that a possible deadlock below in
eeh_addr_cache_show() because it is acquiring a lock with IRQ enabled,
but eeh_addr_cache_insert_dev() needs to acquire the same lock with IRQ
disabled. Let's just make eeh_addr_cache_show() acquire the lock with
IRQ disabled as well.

CPU0CPU1

   lock(_io_addr_cache_root.piar_lock);
local_irq_disable();
lock(>lock);
lock(_io_addr_cache_root.piar_lock);
   
 lock(>lock);

  *** DEADLOCK ***

  lock_acquire+0x140/0x5f0
  _raw_spin_lock_irqsave+0x64/0xb0
  eeh_addr_cache_insert_dev+0x48/0x390
  eeh_probe_device+0xb8/0x1a0
  pnv_pcibios_bus_add_device+0x3c/0x80
  pcibios_bus_add_device+0x118/0x290
  pci_bus_add_device+0x28/0xe0
  pci_bus_add_devices+0x54/0xb0
  pcibios_init+0xc4/0x124
  do_one_initcall+0xac/0x528
  kernel_init_freeable+0x35c/0x3fc
  kernel_init+0x24/0x148
  ret_from_kernel_thread+0x5c/0x80

  lock_acquire+0x140/0x5f0
  _raw_spin_lock+0x4c/0x70
  eeh_addr_cache_show+0x38/0x110
  seq_read+0x1a0/0x660
  vfs_read+0xc8/0x1f0
  ksys_read+0x74/0x130
  system_call_exception+0xf8/0x1d0
  system_call_common+0xe8/0x218

Fixes: 5ca85ae6318d ("powerpc/eeh_cache: Add a way to dump the EEH address 
cache")
Signed-off-by: Qian Cai 
---
 arch/powerpc/kernel/eeh_cache.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
index 6b50bf15d8c1..bf3270426d82 100644
--- a/arch/powerpc/kernel/eeh_cache.c
+++ b/arch/powerpc/kernel/eeh_cache.c
@@ -264,8 +264,9 @@ static int eeh_addr_cache_show(struct seq_file *s, void *v)
 {
struct pci_io_addr_range *piar;
struct rb_node *n;
+   unsigned long flags;
 
-   spin_lock(_io_addr_cache_root.piar_lock);
+   spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
for (n = rb_first(_io_addr_cache_root.rb_root); n; n = rb_next(n)) {
piar = rb_entry(n, struct pci_io_addr_range, rb_node);
 
@@ -273,7 +274,7 @@ static int eeh_addr_cache_show(struct seq_file *s, void *v)
   (piar->flags & IORESOURCE_IO) ? "i/o" : "mem",
   >addr_lo, >addr_hi, pci_name(piar->pcidev));
}
-   spin_unlock(_io_addr_cache_root.piar_lock);
+   spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
 
return 0;
 }
-- 
2.28.0



[PATCH -next] Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"

2020-10-14 Thread Qian Cai
This reverts commit 3a3181e16fbde752007759f8759d25e0ff1fc425 which
causes memory corruptions on POWER9 NV.

Signed-off-by: Qian Cai 
---
 arch/powerpc/include/asm/pci-bridge.h |   6 --
 arch/powerpc/kernel/pci-common.c  | 114 --
 2 files changed, 120 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index d21e070352dc..d2a2a14e56f9 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -48,9 +48,6 @@ struct pci_controller_ops {
 
 /*
  * Structure of a PCI controller (host bridge)
- *
- * @irq_count: number of interrupt mappings
- * @irq_map: interrupt mappings
  */
 struct pci_controller {
struct pci_bus *bus;
@@ -130,9 +127,6 @@ struct pci_controller {
 
void *private_data;
struct npu *npu;
-
-   unsigned int irq_count;
-   unsigned int *irq_map;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index deb831f0ae13..be108616a721 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -353,115 +353,6 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
return NULL;
 }
 
-/*
- * Assumption is made on the interrupt parent. All interrupt-map
- * entries are considered to have the same parent.
- */
-static int pcibios_irq_map_count(struct pci_controller *phb)
-{
-   const __be32 *imap;
-   int imaplen;
-   struct device_node *parent;
-   u32 intsize, addrsize, parintsize, paraddrsize;
-
-   if (of_property_read_u32(phb->dn, "#interrupt-cells", ))
-   return 0;
-   if (of_property_read_u32(phb->dn, "#address-cells", ))
-   return 0;
-
-   imap = of_get_property(phb->dn, "interrupt-map", );
-   if (!imap) {
-   pr_debug("%pOF : no interrupt-map\n", phb->dn);
-   return 0;
-   }
-   imaplen /= sizeof(u32);
-   pr_debug("%pOF : imaplen=%d\n", phb->dn, imaplen);
-
-   if (imaplen < (addrsize + intsize + 1))
-   return 0;
-
-   imap += intsize + addrsize;
-   parent = of_find_node_by_phandle(be32_to_cpup(imap));
-   if (!parent) {
-   pr_debug("%pOF : no imap parent found !\n", phb->dn);
-   return 0;
-   }
-
-   if (of_property_read_u32(parent, "#interrupt-cells", )) {
-   pr_debug("%pOF : parent lacks #interrupt-cells!\n", phb->dn);
-   return 0;
-   }
-
-   if (of_property_read_u32(parent, "#address-cells", ))
-   paraddrsize = 0;
-
-   return imaplen / (addrsize + intsize + 1 + paraddrsize + parintsize);
-}
-
-static void pcibios_irq_map_init(struct pci_controller *phb)
-{
-   phb->irq_count = pcibios_irq_map_count(phb);
-   if (phb->irq_count < PCI_NUM_INTX)
-   phb->irq_count = PCI_NUM_INTX;
-
-   pr_debug("%pOF : interrupt map #%d\n", phb->dn, phb->irq_count);
-
-   phb->irq_map = kcalloc(phb->irq_count, sizeof(unsigned int),
-  GFP_KERNEL);
-}
-
-static void pci_irq_map_register(struct pci_dev *pdev, unsigned int virq)
-{
-   struct pci_controller *phb = pci_bus_to_host(pdev->bus);
-   int i;
-
-   if (!phb->irq_map)
-   return;
-
-   for (i = 0; i < phb->irq_count; i++) {
-   /*
-* Look for an empty or an equivalent slot, as INTx
-* interrupts can be shared between adapters.
-*/
-   if (phb->irq_map[i] == virq || !phb->irq_map[i]) {
-   phb->irq_map[i] = virq;
-   break;
-   }
-   }
-
-   if (i == phb->irq_count)
-   pr_err("PCI:%s all platform interrupts mapped\n",
-  pci_name(pdev));
-}
-
-/*
- * Clearing the mapped interrupts will also clear the underlying
- * mappings of the ESB pages of the interrupts when under XIVE. It is
- * a requirement of PowerVM to clear all memory mappings before
- * removing a PHB.
- */
-static void pci_irq_map_dispose(struct pci_bus *bus)
-{
-   struct pci_controller *phb = pci_bus_to_host(bus);
-   int i;
-
-   if (!phb->irq_map)
-   return;
-
-   pr_debug("PCI: Clearing interrupt mappings for PHB %04x:%02x...\n",
-pci_domain_nr(bus), bus->number);
-   for (i = 0; i < phb->irq_count; i++)
-   irq_dispose_mapping(phb->irq_map[i]);
-
-   kfree(phb->irq_map);
-}
-
-void pcibios_remove_bus(struct pci_bus *bus)
-{
-   pci_irq_map_dispose(bus);
-}
-EXPORT_SYMBOL_GPL(pcibios_remove_bus);
-
 /*
  * Reads the interrupt pin to determine if interrupt is

Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-10-13 Thread Qian Cai
On Wed, 2020-09-23 at 09:06 +0200, Cédric Le Goater wrote:
> On 9/23/20 2:33 AM, Qian Cai wrote:
> > On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:
> > > When a passthrough IO adapter is removed from a pseries machine using
> > > hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
> > > guest OS to clear all page table entries related to the adapter. If
> > > some are still present, the RTAS call which isolates the PCI slot
> > > returns error 9001 "valid outstanding translations" and the removal of
> > > the IO adapter fails. This is because when the PHBs are scanned, Linux
> > > maps automatically the INTx interrupts in the Linux interrupt number
> > > space but these are never removed.
> > > 
> > > To solve this problem, we introduce a PPC platform specific
> > > pcibios_remove_bus() routine which clears all interrupt mappings when
> > > the bus is removed. This also clears the associated page table entries
> > > of the ESB pages when using XIVE.
> > > 
> > > For this purpose, we record the logical interrupt numbers of the
> > > mapped interrupt under the PHB structure and let pcibios_remove_bus()
> > > do the clean up.
> > > 
> > > Since some PCI adapters, like GPUs, use the "interrupt-map" property
> > > to describe interrupt mappings other than the legacy INTx interrupts,
> > > we can not restrict the size of the mapping array to PCI_NUM_INTX. The
> > > number of interrupt mappings is computed from the "interrupt-map"
> > > property and the mapping array is allocated accordingly.
> > > 
> > > Cc: "Oliver O'Halloran" 
> > > Cc: Alexey Kardashevskiy 
> > > Signed-off-by: Cédric Le Goater 
> > 
> > Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed
> > to
> > this patch.
> > 
> > .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> 
> OK. The patch is missing a NULL assignement after kfree() and that
> might be the issue. 
> 
> I did try PHB removal under PowerNV, so I would like to understand 
> how we managed to remove twice the PCI bus and possibly reproduce. 
> Any chance we could grab what the syscall fuzzer (syzkaller) did ? 

Any update on this? Maybe Michael or Stephen could drop this for now, so our
fuzzing could continue to find something else new?

It can still be reproduced on today's linux-next. BTW, this is running trinity
from an unprivileged user. This is the snapshot of the each fuzzing thread when
this happens.

http://people.redhat.com/qcai/pcibios_remove_bus/trinity-post-mortem.log

It can be reproduced by simply keep running this for a while:

$ trinity -C  --arch 64

[19611.946827][T1717146] pci_bus 0035:03: busn_res: [bus 03-07] is released
[19611.950956][T1717146] pci_bus 0035:08: busn_res: [bus 08-0c] is released
[19611.951260][T1717146] 
=
[19611.952336][T1717146] BUG kmalloc-16 (Tainted: GW  O ): Object 
already free
[19611.952365][T1717146] 
-
[19611.952365][T1717146] 
[19611.952411][T1717146] Disabling lock debugging due to kernel taint
[19611.952438][T1717146] INFO: Allocated in pcibios_scan_phb+0x104/0x3e0 
age=1960714 cpu=4 pid=1
[19611.952481][T1717146]__slab_alloc+0xa4/0xf0
[19611.952500][T1717146]__kmalloc+0x294/0x330
[19611.952519][T1717146]pcibios_scan_phb+0x104/0x3e0
[19611.952549][T1717146]pcibios_init+0x84/0x124
[19611.952578][T1717146]do_one_initcall+0xac/0x528
[19611.952599][T1717146]kernel_init_freeable+0x35c/0x3fc
[19611.952618][T1717146]kernel_init+0x24/0x148
[19611.952646][T1717146]ret_from_kernel_thread+0x5c/0x80
[19611.952665][T1717146] INFO: Freed in pcibios_remove_bus+0x70/0x90 age=0 
cpu=16 pid=1717146
[19611.952691][T1717146]kfree+0x49c/0x510
[19611.952700][T1717146]pcibios_remove_bus+0x70/0x90
[19611.952711][T1717146]pci_remove_bus+0xe4/0x110
[19611.952730][T1717146]pci_remove_bus_device+0x74/0x170
[19611.952749][T1717146]pci_remove_bus_device+0x4c/0x170
[19611.952768][T1717146]pci_stop_and_remove_bus_device_locked+0x34/0x50
[19611.952798][T1717146]remove_store+0xc0/0xe0
[19611.952819][T1717146]dev_attr_store+0x30/0x50
[19611.952852][T1717146]sysfs_kf_write+0x68/0xb0
[19611.952870][T1717146]kernfs_fop_write+0x114/0x260
[19611.952904][T1717146]vfs_write+0xe4/0x260
[19611.952922][T1717146]ksys_write+0x74/0x130
[19611.952951][T1717146]system_call_exception+0xf8/0x1d0
[19611.952970][T1717146]system_

Re: [PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2

2020-10-07 Thread Qian Cai
On Wed, 2020-10-07 at 19:47 +0530, Srikar Dronamraju wrote:
> Can you confirm if CONFIG_CPUMASK_OFFSTACK is enabled in your config?

Yes, https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

We tested here almost daily on linux-next.



Re: [PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2

2020-10-07 Thread Qian Cai
On Mon, 2020-09-21 at 15:26 +0530, Srikar Dronamraju wrote:
> All threads of a SMT4 core can either be part of this CPU's l2-cache
> mask or not related to this CPU l2-cache mask. Use this relation to
> reduce the number of iterations needed to find all the CPUs that share
> the same l2-cache.
> 
> Use a temporary mask to iterate through the CPUs that may share l2_cache
> mask. Also instead of setting one CPU at a time into cpu_l2_cache_mask,
> copy the SMT4/sub mask at one shot.
> 
...
>  static bool update_mask_by_l2(int cpu)
>  {
> + struct cpumask *(*submask_fn)(int) = cpu_sibling_mask;
>   struct device_node *l2_cache, *np;
> + cpumask_var_t mask;
>   int i;
>  
>   l2_cache = cpu_to_l2cache(cpu);
> @@ -1240,22 +1264,37 @@ static bool update_mask_by_l2(int cpu)
>   return false;
>   }
>  
> - cpumask_set_cpu(cpu, cpu_l2_cache_mask(cpu));
> - for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
> + alloc_cpumask_var_node(, GFP_KERNEL, cpu_to_node(cpu));

Shouldn't this be GFP_ATOMIC? Otherwise, during the CPU hotplugging, we have,

(irqs were disabled in do_idle())

[  335.420001][T0] BUG: sleeping function called from invalid context at 
mm/slab.h:494
[  335.420003][T0] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 
0, name: swapper/88
[  335.420005][T0] no locks held by swapper/88/0.
[  335.420007][T0] irq event stamp: 18074448
[  335.420015][T0] hardirqs last  enabled at (18074447): 
[] tick_nohz_idle_enter+0x9c/0x110
[  335.420019][T0] hardirqs last disabled at (18074448): 
[] do_idle+0x138/0x3b0
do_idle at kernel/sched/idle.c:253 (discriminator 1)
[  335.420023][T0] softirqs last  enabled at (18074440): 
[] irq_enter_rcu+0x94/0xa0
[  335.420026][T0] softirqs last disabled at (18074439): 
[] irq_enter_rcu+0x70/0xa0
[  335.420030][T0] CPU: 88 PID: 0 Comm: swapper/88 Tainted: GW  
   5.9.0-rc8-next-20201007 #1
[  335.420032][T0] Call Trace:
[  335.420037][T0] [c0002a4bfcf0] [c0649e98] 
dump_stack+0xec/0x144 (unreliable)
[  335.420043][T0] [c0002a4bfd30] [c00f6c34] 
___might_sleep+0x2f4/0x310
[  335.420048][T0] [c0002a4bfdb0] [c0354f94] 
slab_pre_alloc_hook.constprop.82+0x124/0x190
[  335.420051][T0] [c0002a4bfe00] [c035e9e8] 
__kmalloc_node+0x88/0x3a0
slab_alloc_node at mm/slub.c:2817
(inlined by) __kmalloc_node at mm/slub.c:4013
[  335.420054][T0] [c0002a4bfe80] [c06494d8] 
alloc_cpumask_var_node+0x38/0x80
kmalloc_node at include/linux/slab.h:577
(inlined by) alloc_cpumask_var_node at lib/cpumask.c:116
[  335.420060][T0] [c0002a4bfef0] [c003eedc] 
start_secondary+0x27c/0x800
update_mask_by_l2 at arch/powerpc/kernel/smp.c:1267
(inlined by) add_cpu_to_masks at arch/powerpc/kernel/smp.c:1387
(inlined by) start_secondary at arch/powerpc/kernel/smp.c:1420
[  335.420063][T0] [c0002a4bff90] [c000c468] 
start_secondary_resume+0x10/0x14

> + cpumask_and(mask, cpu_online_mask, cpu_cpu_mask(cpu));
> +
> + if (has_big_cores)
> + submask_fn = cpu_smallcore_mask;
> +
> + /* Update l2-cache mask with all the CPUs that are part of submask */
> + or_cpumasks_related(cpu, cpu, submask_fn, cpu_l2_cache_mask);
> +
> + /* Skip all CPUs already part of current CPU l2-cache mask */
> + cpumask_andnot(mask, mask, cpu_l2_cache_mask(cpu));
> +
> + for_each_cpu(i, mask) {
>   /*
>* when updating the marks the current CPU has not been marked
>* online, but we need to update the cache masks
>*/
>   np = cpu_to_l2cache(i);
> - if (!np)
> - continue;
>  
> - if (np == l2_cache)
> - set_cpus_related(cpu, i, cpu_l2_cache_mask);
> + /* Skip all CPUs already part of current CPU l2-cache */
> + if (np == l2_cache) {
> + or_cpumasks_related(cpu, i, submask_fn,
> cpu_l2_cache_mask);
> + cpumask_andnot(mask, mask, submask_fn(i));
> + } else {
> + cpumask_andnot(mask, mask, cpu_l2_cache_mask(i));
> + }
>  
>   of_node_put(np);
>   }
>   of_node_put(l2_cache);
> + free_cpumask_var(mask);
>  
>   return true;
>  }



Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-09-22 Thread Qian Cai
On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:
> When a passthrough IO adapter is removed from a pseries machine using
> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
> guest OS to clear all page table entries related to the adapter. If
> some are still present, the RTAS call which isolates the PCI slot
> returns error 9001 "valid outstanding translations" and the removal of
> the IO adapter fails. This is because when the PHBs are scanned, Linux
> maps automatically the INTx interrupts in the Linux interrupt number
> space but these are never removed.
> 
> To solve this problem, we introduce a PPC platform specific
> pcibios_remove_bus() routine which clears all interrupt mappings when
> the bus is removed. This also clears the associated page table entries
> of the ESB pages when using XIVE.
> 
> For this purpose, we record the logical interrupt numbers of the
> mapped interrupt under the PHB structure and let pcibios_remove_bus()
> do the clean up.
> 
> Since some PCI adapters, like GPUs, use the "interrupt-map" property
> to describe interrupt mappings other than the legacy INTx interrupts,
> we can not restrict the size of the mapping array to PCI_NUM_INTX. The
> number of interrupt mappings is computed from the "interrupt-map"
> property and the mapping array is allocated accordingly.
> 
> Cc: "Oliver O'Halloran" 
> Cc: Alexey Kardashevskiy 
> Signed-off-by: Cédric Le Goater 

Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed to
this patch.

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

[ 3574.564109][  T965] ata1.00: disabled
[ 3574.580373][T151472] sd 0:0:0:0: [sdb] Synchronizing SCSI cache
[ 3574.581180][T151472] sd 0:0:0:0: [sdb] Synchronize Cache(10) failed: Result: 
hostbyte=0x04 driverbyte=0x00
[ 3574.581226][T151472] sd 0:0:0:0: [sdb] Stopping disk
[ 3574.581289][T151472] sd 0:0:0:0: [sdb] Start/Stop Unit failed: Result: 
hostbyte=0x04 driverbyte=0x00
[ 3574.611424][ T3019] Read-error on swap-device (254:1:849792)
[ 3574.611685][ T3019] Read-error on swap-device (254:1:914944)
[ 3574.611769][ T3019] Read-error on swap-device (254:1:915072)
[ 3574.611838][ T3019] Read-error on swap-device (254:1:915200)
[ 3574.611926][ T3019] Read-error on swap-device (254:1:915328)
[ 3574.612268][ T3084] Read-error on swap-device (254:1:792576)
[ 3574.612342][ T3084] Read-error on swap-device (254:1:792704)
[ 3574.612757][ T2362] Read-error on swap-device (254:1:957440)
[ 3574.612773][ T2905] Read-error on swap-device (254:1:784128)
[ 3574.613015][ T2362] Read-error on swap-device (254:1:957568)
[ 3574.613160][ T2905] Read-error on swap-device (254:1:784256)
[ 3574.613241][ T2362] Read-error on swap-device (254:1:957696)
[ 3574.613342][ T2362] Read-error on swap-device (254:1:957824)
[ 3574.614448][ T3019] Core dump to |/usr/lib/systemd/systemd-coredump pipe 
failed
[ 3574.614663][ T3019] Read-error on swap-device (254:1:961536)
[ 3574.675330][T151844] Read-error on swap-device (254:1:128)
[ 3574.675515][T151844] Read-error on swap-device (254:1:256)
[ 3574.675700][T151844] Read-error on swap-device (254:1:384)
[ 3574.703570][  T971] ata2.00: disabled
[ 3574.710393][T151472] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[ 3574.710864][T151472] sd 1:0:0:0: [sda] Synchronize Cache(10) failed: Result: 
hostbyte=0x04 driverbyte=0x00
[ 3574.710922][T151472] sd 1:0:0:0: [sda] Stopping disk
[ 3574.711010][T151472] sd 1:0:0:0: [sda] Start/Stop Unit failed: Result: 
hostbyte=0x04 driverbyte=0x00
[ 3574.826569][  T674] dm-0: writeback error on inode 68507862, offset 65536, 
sector 54281504
[ 3575.117547][ T3366] dm-0: writeback error on inode 68507851, offset 0, 
sector 54378880
[ 3575.140104][T151472] pci 0004:03:00.0: Removing from iommu group 3
[ 3575.141778][T151472] pci 0004:03 : [PE# fb] Releasing PE
[ 3575.141965][T151472] pci 0004:03 : [PE# fb] Removing DMA window #0
[ 3575.142452][T151472] pci 0004:03 : [PE# fb] Disabling 64-bit DMA bypass
[ 3575.149369][T151472] pci_bus 0004:03: busn_res: [bus 03] is released
[ 3575.150574][T152037] Read-error on swap-device (254:1:35584)
[ 3575.150713][T152037] Read-error on swap-device (254:1:35712)
[ 3575.152632][T152037] Read-error on swap-device (254:1:915584)
[ 3575.152706][T151472] pci_bus 0004:04: busn_res: [bus 04-08] is released
[ 3575.152983][T151472] 
=
[ 3575.153937][T151472] BUG kmalloc-16 (Not tainted): Object already free
[ 3575.153962][T151472] 
-
[ 3575.153962][T151472] 
[ 3575.154020][T151472] Disabling lock debugging due to kernel taint
[ 3575.154047][T151472] INFO: Allocated in pcibios_scan_phb+0x104/0x3e0 
age=356904 cpu=4 pid=1
[ 3575.154084][T151472] __slab_alloc+0xa4/0xf0
[ 3575.154105][T151472] __kmalloc+0x294/0x330
[ 3575.154127][T151472] pcibios_scan_phb+0x104/0x3e0
[ 

Re: [PATCH -next] fork: silence a false postive warning in __mmdrop

2020-09-08 Thread Qian Cai
On Wed, Jul 22, 2020 at 03:44:06PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 22, 2020 at 09:19:00AM -0400, Qian Cai wrote:
> > On Wed, Jul 22, 2020 at 12:06:37PM +0200, pet...@infradead.org wrote:
> > > On Thu, Jun 04, 2020 at 11:03:44AM -0400, Qian Cai wrote:
> > > > The linux-next commit bf2c59fce407 ("sched/core: Fix illegal RCU from
> > > > offline CPUs") delayed,
> > > > 
> > > > idle->active_mm = _mm;
> > > > 
> > > > into finish_cpu() instead of idle_task_exit() which results in a false
> > > > positive warning that was originally designed in the commit 3eda69c92d47
> > > > ("kernel/fork.c: detect early free of a live mm").
> > > > 
> > > >  WARNING: CPU: 127 PID: 72976 at kernel/fork.c:697
> > > >  __mmdrop+0x230/0x2c0
> > > >  do_exit+0x424/0xfa0
> > > >  Call Trace:
> > > >  do_exit+0x424/0xfa0
> > > >  do_group_exit+0x64/0xd0
> > > >  sys_exit_group+0x24/0x30
> > > >  system_call_exception+0x108/0x1d0
> > > >  system_call_common+0xf0/0x278
> > > 
> > > Please explain; because afaict this is a use-after-free.
> > > 
> > > The thing is __mmdrop() is going to actually free the mm, so then what
> > > is finish_cpu()'s mmdrop() going to do?
> > > 
> > > ->active_mm() should have a refcount on the mm.
> > 
> > Well, the refcount issue you mentioned then happens all before bf2c59fce407 
> > was
> > introduced as well, but then it looks harmless because mmdrop() in 
> > finish_cpu()
> > will do,
> > 
> > if (unlikely(atomic_dec_and_test(>mm_count)))
> > __mmdrop(mm);
> 
> That's not harmless, that's a use-after-free. Those can cause memory
> corruption bugs and the like at best. Who knows what's at the location
> of mm->mm_count after we've already freed it.
> 
> > where that atomic_dec_and_test() see the negative refcount and will not 
> > involve
> > __mmdrop() again. It is not clear to me that once the CPU is offline if it
> > needs to care about its idle thread mm_count at all. Even if this refcount
> > issue is finally addressed, it could hit this warning in finish_cpu() 
> > without
> > this patch.
> > 
> > On the other hand, if you look at the commit 3eda69c92d47, it is clearly 
> > that
> > the assumption of,
> > 
> >WARN_ON_ONCE(mm == current->active_mm);
> > 
> > is totally gone due to bf2c59fce407. Thus, the patch is to fix that 
> > discrepancy
> > first and then I'll look at that the imbalance mmdrop()/mmgrab() elsewhere.
> 
> No, you're talking nonsense. We must not free @mm when
> 'current->active_mm == mm', never.

Yes, you are right. It still trigger this below on powerpc with today's
linux-next by fuzzing for a while (saw a few times on recent linux-next before
as well but so far mostly reproducible on powerpc here). Any idea?

[12802.547809][T191552] BUG mm_struct (Tainted: G   O ): Poison 
overwritten
[12802.547824][T191552] 
-
[12802.547824][T191552] 
[12802.547843][T191552] Disabling lock debugging due to kernel taint
[12802.547867][T191552] INFO: 0x0e2a54ec-0x0e2a54ec 
@offset=96464. First byte 0x6a instead of 0x6b
[12802.547889][T191552] INFO: Allocated in dup_mm+0x48/0x6d0 age=955 cpu=108 
pid=191552
[12802.547915][T191552] __slab_alloc+0xa4/0xf0
[12802.547937][T191552] kmem_cache_alloc+0x314/0x4a0
[12802.547959][T191552] dup_mm+0x48/0x6d0
dup_mm at kernel/fork.c:1344
[12802.547978][T191552] copy_process+0x11bc/0x19a0
[12802.548010][T191552] kernel_clone+0x120/0xb80
[12802.548031][T191552] __do_sys_clone+0x88/0xd0
[12802.548055][T191552] system_call_exception+0xf8/0x1d0
[12802.548083][T191552] system_call_common+0xe8/0x218
[12802.548093][T191552] INFO: Freed in __mmdrop+0x144/0x250 age=942 cpu=69 
pid=882503
[12802.548140][T191552] kmem_cache_free+0x47c/0x500
[12802.548161][T191552] __mmdrop+0x144/0x250
__mmdrop at kernel/fork.c:685
[12802.548170][T191552] do_exit+0x3f4/0xed0
[12802.548212][T191552] do_group_exit+0x5c/0xd0
[12802.548244][T191552] sys_exit_group+0x1c/0x20
[12802.548277][T191552] system_call_exception+0xf8/0x1d0
[12802.548309][T191552] system_call_common+0xe8/0x218
[12802.548342][T191552] INFO: Slab 0x48df84af objects=64 used=64 
fp=0x flags=0x87fff810200
[12802.548379][T191552] INFO: Object 0x583c5ba3 @offset=96384 
fp=0x681f5d04
[12802.548379][T191552] 
[12802.548419][T1

Re: [PATCH v3 0/3] Off-load TLB invalidations to host for !GTSE

2020-07-16 Thread Qian Cai
On Fri, Jul 03, 2020 at 11:06:05AM +0530, Bharata B Rao wrote:
> Hypervisor may choose not to enable Guest Translation Shootdown Enable
> (GTSE) option for the guest. When GTSE isn't ON, the guest OS isn't
> permitted to use instructions like tblie and tlbsync directly, but is
> expected to make hypervisor calls to get the TLB flushed.
> 
> This series enables the TLB flush routines in the radix code to
> off-load TLB flushing to hypervisor via the newly proposed hcall
> H_RPT_INVALIDATE. 
> 
> To easily check the availability of GTSE, it is made an MMU feature.
> The OV5 handling and H_REGISTER_PROC_TBL hcall are changed to
> handle GTSE as an optionally available feature and to not assume GTSE
> when radix support is available.
> 
> The actual hcall implementation for KVM isn't included in this
> patchset and will be posted separately.
> 
> Changes in v3
> =
> - Fixed a bug in the hcall wrapper code where we were missing setting
>   H_RPTI_TYPE_NESTED while retrying the failed flush request with
>   a full flush for the nested case.
> - s/psize_to_h_rpti/psize_to_rpti_pgsize
> 
> v2: 
> https://lore.kernel.org/linuxppc-dev/20200626131000.5207-1-bhar...@linux.ibm.com/T/#t
> 
> Bharata B Rao (2):
>   powerpc/mm: Enable radix GTSE only if supported.
>   powerpc/pseries: H_REGISTER_PROC_TBL should ask for GTSE only if
> enabled
> 
> Nicholas Piggin (1):
>   powerpc/mm/book3s64/radix: Off-load TLB invalidations to host when
> !GTSE

Reverting the whole series fixed random memory corruptions during boot on
POWER9 PowerNV systems below.

IBM 8335-GTH (ibm,witherspoon)
POWER9, altivec supported
262144 MB memory, 2000 GB disk space

.config:
https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

[9.338996][  T925] BUG: Unable to handle kernel instruction fetch (NULL 
pointer?)
[9.339026][  T925] Faulting instruction address: 0x
[9.339051][  T925] Oops: Kernel access of bad area, sig: 11 [#1]
[9.339064][  T925] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[9.339098][  T925] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod
[9.339150][  T925] CPU: 92 PID: 925 Comm: (md-udevd) Not tainted 
5.8.0-rc5-next-20200716 #3
[9.339186][  T925] NIP:   LR: c021f2cc CTR: 

[9.339210][  T925] REGS: c000201cb52d79b0 TRAP: 0400   Not tainted  
(5.8.0-rc5-next-20200716)
[9.339244][  T925] MSR:  900040009033   CR: 
2492  XER: 
[9.339278][  T925] CFAR: c021f2c8 IRQMASK: 0 
[9.339278][  T925] GPR00: c021f2cc c000201cb52d7c40 
c5901000 c000201cb52d7ca8 
[9.339278][  T925] GPR04: c0080ea60038  
7fff 7fff 
[9.339278][  T925] GPR08:   
c000201cb50bd500 0003 
[9.339278][  T925] GPR12:  c000201fff694500 
7fffa4a8a940 7fffa4a8a6c8 
[9.339278][  T925] GPR16: 7fffa4a8a8f8 7fffa4a8a650 
7fffa4a8a488  
[9.339278][  T925] GPR20: 00050001 7fffa4a8a984 
7fff ca4545cc 
[9.339278][  T925] GPR24: c0affe28  
 0166 
[9.339278][  T925] GPR28: c000201cb52d7ca8 c0080ea6 
c000201cc3b72600 7fff 
[9.339493][  T925] NIP [] 0x0
[9.339516][  T925] LR [c021f2cc] __seccomp_filter+0xec/0x530
bpf_dispatcher_nop_func at include/linux/bpf.h:567
(inlined by) bpf_prog_run_pin_on_cpu at include/linux/filter.h:597
(inlined by) seccomp_run_filters at kernel/seccomp.c:324
(inlined by) __seccomp_filter at kernel/seccomp.c:937
[9.339538][  T925] Call Trace:
[9.339548][  T925] [c000201cb52d7c40] [c021f2cc] 
__seccomp_filter+0xec/0x530 (unreliable)
[9.339566][  T925] [c000201cb52d7d50] [c0025af8] 
do_syscall_trace_enter+0xb8/0x470
do_seccomp at arch/powerpc/kernel/ptrace/ptrace.c:252
(inlined by) do_syscall_trace_enter at arch/powerpc/kernel/ptrace/ptrace.c:327
[9.339600][  T925] [c000201cb52d7dc0] [c002c8f8] 
system_call_exception+0x138/0x180
[9.339625][  T925] [c000201cb52d7e20] [c000c9e8] 
system_call_common+0xe8/0x214
[9.339648][  T925] Instruction dump:
[9.339667][  T925]       
  
[9.339706][  T925]       
  
[9.339748][  T925] ---[ end trace d89eb80f9a6bc141 ]---
[  OK  ] Started Journal Service.
[   10.452364][  T925] Kernel panic - not syncing: Fatal exception
[   11.876655][  T925] ---[ end Kernel panic - not syncing: Fatal exception ]---

There could also be lots of random userspace segfault like,

[   16.463545][  T771] rngd[771]: segfault (11) at 0 nip 0 lr 0 code 1 in 
rngd[106d6+2]
[   16.463620][  T771] rngd[771]: code:     
   

Re: [PATCH 18/20] block: refator submit_bio_noacct

2020-07-02 Thread Qian Cai
On Mon, Jun 29, 2020 at 09:39:45PM +0200, Christoph Hellwig wrote:
> Split out a __submit_bio_noacct helper for the actual de-recursion
> algorithm, and simplify the loop by using a continue when we can't
> enter the queue for a bio.
> 
> Signed-off-by: Christoph Hellwig 

Reverting this commit and its dependencies,

5a6c35f9af41 block: remove direct_make_request
ff93ea0ce763 block: shortcut __submit_bio_noacct for blk-mq drivers

fixed the stack-out-of-bounds during boot,

https://lore.kernel.org/linux-block/bcdeaa05a9728...@google.com/

[   55.573431][ T1373] BUG: KASAN: stack-out-of-bounds in 
bio_alloc_bioset+0x493/0x4a0
bio_alloc_bioset+0x493/0x4a0:
bio_list_empty at include/linux/bio.h:561
(inlined by) bio_alloc_bioset at block/bio.c:482
[   55.581140][ T1373] Read of size 8 at addr c9000a7df1e0 by task 
mount/1373
[   55.588409][ T1373]
[   55.590615][ T1373] CPU: 2 PID: 1373 Comm: mount Not tainted 
5.8.0-rc3-next-20200702 #2
[   55.598672][ T1373] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 07/10/2019
[   55.607972][ T1373] Call Trace:
[   55.607980][ T1373]  dump_stack+0x9d/0xe0
[   55.607984][ T1373]  ? bio_alloc_bioset+0x493/0x4a0
[   55.607992][ T1373]  ? bio_alloc_bioset+0x493/0x4a0
[   55.625007][ T1373]  print_address_description.constprop.8.cold.10+0x56/0x44e
[   55.632191][ T1373]  ? bio_alloc_bioset+0x493/0x4a0
[   55.637100][ T1373]  ? bio_alloc_bioset+0x493/0x4a0
[   55.642011][ T1373]  kasan_report.cold.11+0x37/0x7c
[   55.646923][ T1373]  ? bio_alloc_bioset+0x493/0x4a0
[   55.651968][ T1373]  bio_alloc_bioset+0x493/0x4a0
[   55.651971][ T1373]  ? bvec_alloc+0x290/0x290
[   55.651975][ T1373]  ? mark_lock+0x147/0x1800
[   55.651978][ T1373]  ? mark_lock+0x147/0x1800
[   55.651981][ T1373]  bio_clone_fast+0xe/0x30
[   55.651983][ T1373]  bio_split+0x8a/0x4c0
[   55.651986][ T1373]  ? print_irqtrace_events+0x270/0x270
[   55.651990][ T1373]  __blk_queue_split+0xc42/0x13e0
[   55.651998][ T1373]  ? __lock_acquire+0xc57/0x4da0
 Startin[   55.693322][ T1373]  ? __blk_rq_map_sg+0x14c0/0x14c0
[   55.699711][ T1373]  ? lockdep_hardirqs_on_prepare+0x550/0x550
[   55.705602][ T1373]  ? mark_held_locks+0xb0/0x110
[   55.705605][ T1373]  ? lockdep_hardirqs_on_prepare+0x550/0x550
[   55.705608][ T1373]  ? lockdep_hardirqs_on_prepare+0x550/0x550
[   55.705611][ T1373]  ? find_held_lock+0x33/0x1c0
[   55.705614][ T1373]  ? find_held_lock+0x33/0x1c0
[   55.705618][ T1373]  blk_mq_submit_bio+0x19e/0x1e20
[   55.705621][ T1373]  ? lock_downgrade+0x720/0x720
[   55.705624][ T1373]  ? blk_mq_try_issue_directly+0x140/0x140
[   55.705628][ T1373]  ? rcu_read_lock_sched_held+0xaa/0xd0
[   55.705631][ T1373]  ? rcu_read_lock_bh_held+0xc0/0xc0
[   55.705635][ T1373]  ? blk_queue_enter+0x83c/0x9a0
[   55.705647][ T1373]  ? submit_bio_checks+0x1cc0/0x1cc0
[   55.767384][ T1373]  submit_bio_noacct+0x9c0/0xeb0
[   55.772212][ T1373]  ? blk_queue_enter+0x9a0/0x9a0
[   55.777038][ T1373]  ? lockdep_hardirqs_on_prepare+0x550/0x550
[   55.782913][ T1373]  ? trace_hardirqs_on+0x20/0x1b5
[   55.787825][ T1373]  ? submit_bio+0xe7/0x480
[   55.792125][ T1373]  submit_bio+0xe7/0x480
[   55.796252][ T1373]  ? bio_associate_blkg_from_css+0x4a3/0xd30
[   55.802124][ T1373]  ? submit_bio_noacct+0xeb0/0xeb0
[   55.807124][ T1373]  ? lock_downgrade+0x720/0x720
[   55.811862][ T1373]  ? rcu_read_unlock+0x50/0x50
[   55.816512][ T1373]  ? lockdep_init_map_waits+0x267/0x7b0
[   55.821948][ T1373]  ? lockdep_init_map_waits+0x267/0x7b0
g LVM event acti[   55.827386][ T1373]  ? __raw_spin_lock_init+0x34/0x100
[   55.833957][ T1373]  submit_bio_wait+0xf9/0x200
vation on device[   55.838521][ T1373]  ? submit_bio_wait_endio+0x30/0x30
[   55.845091][ T1373]  xfs_rw_bdev+0x3ca/0x4d0
[   55.849396][ T1373]  xlog_do_io+0x149/0x320
[   55.853611][ T1373]  xlog_bread+0x1e/0xb0
[   55.857651][ T1373]  xlog_find_verify_log_record+0xba/0x4c0
[   55.863264][ T1373]  ? xlog_header_check_mount+0xb0/0xb0
[   55.868615][ T1373]  xlog_find_zeroed+0x2bc/0x4c0
 8:3...
[   55.873356][ T1373]  ? print_irqtrace_events+0x270/0x270
[   55.880093][ T1373]  ? xlog_find_verify_log_record+0x4c0/0x4c0
[   55.885966][ T1373]  ? __lock_acquire+0x1920/0x4da0
[   55.890881][ T1373]  xlog_find_head+0xd4/0x790
[   55.895355][ T1373]  ? xlog_find_zeroed+0x4c0/0x4c0
[   55.900269][ T1373]  ? rcu_read_lock_sched_held+0xaa/0xd0
[   55.905708][ T1373]  ? rcu_read_lock_bh_held+0xc0/0xc0
[   55.910885][ T1373]  ? sugov_update_single+0x18d/0x4f0
[   55.916058][ T1373]  xlog_find_tail+0xc2/0x810
[   55.920534][ T1373]  ? mark_lock+0x147/0x1800
[   55.924921][ T1373]  ? xlog_verify_head+0x4c0/0x4c0
[   55.929834][ T1373]  ? debug_show_held_locks+0x30/0x50
[   55.935007][ T1373]  ? print_irqtrace_events+0x270/0x270
[   55.940358][ T1373]  ? try_to_wake_up+0x6d1/0xf40
[   55.945094][ T1373]  ? mark_held_locks+0xb0/0x110
[   55.949835][ T1373]  ? lockdep_hardirqs_on_prepare+0x38c/0x550
[   55.955708][ T1373]  ? 

Re: [PATCH v4 08/14] powerpc: add support for folded p4d page tables

2020-06-04 Thread Qian Cai



> On Jun 3, 2020, at 3:05 PM, Andrew Morton  wrote:
> 
> A bunch of new material just landed in linux-next/powerpc.
> 
> The timing is awkward!  I trust this will be going into mainline during
> this merge window?  If not, please drop it and repull after -rc1.

I have noticed the same pattern over and over again, i.e., many powerpc new 
material has only shown up in linux-next for only a few days before sending for 
a pull request to Linus.

There are absolutely no safe net for this kind of practice. The main problem is 
that Linus seems totally fine with it.

Re: [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()

2020-06-02 Thread Qian Cai
On Tue, Jun 02, 2020 at 05:05:11PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 02, 2020 at 10:42:35AM -0400, Qian Cai wrote:
> 
> > Reverted this commit fixed the POWER9 boot warning,
> 
> ARGH, I'm an idiot. Please try this instead:
>
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index a3eb6eba8c41..c4201b7f42b1 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -438,7 +438,7 @@ void irq_exit_rcu(void)
>   */
>  void irq_exit(void)
>  {
> - irq_exit_rcu();
> + __irq_exit_rcu();
>   rcu_irq_exit();
>/* must be last! */
>   lockdep_hardirq_exit();

This works fine.


Re: [PATCH 11/14] x86/entry: Clarify irq_{enter,exit}_rcu()

2020-06-02 Thread Qian Cai
On Fri, May 29, 2020 at 11:27:39PM +0200, Peter Zijlstra wrote:
> Because:
> 
>   irq_enter_rcu() includes lockdep_hardirq_enter()
>   irq_exit_rcu() does *NOT* include lockdep_hardirq_exit()
> 
> Which resulted in two 'stray' lockdep_hardirq_exit() calls in
> idtentry.h, and me spending a long time trying to find the matching
> enter calls.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  arch/x86/include/asm/idtentry.h |2 --
>  kernel/softirq.c|   19 +--
>  2 files changed, 13 insertions(+), 8 deletions(-)
> 
[]
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -404,12 +404,7 @@ static inline void tick_irq_exit(void)
>  #endif
>  }
>  
> -/**
> - * irq_exit_rcu() - Exit an interrupt context without updating RCU
> - *
> - * Also processes softirqs if needed and possible.
> - */
> -void irq_exit_rcu(void)
> +static inline void __irq_exit_rcu(void)
>  {
>  #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
>   local_irq_disable();
> @@ -425,6 +420,18 @@ void irq_exit_rcu(void)
>  }
>  
>  /**
> + * irq_exit_rcu() - Exit an interrupt context without updating RCU
> + *
> + * Also processes softirqs if needed and possible.
> + */
> +void irq_exit_rcu(void)
> +{
> + __irq_exit_rcu();
> +  /* must be last! */
> + lockdep_hardirq_exit();
> +}
> +
> +/**
>   * irq_exit - Exit an interrupt context, update RCU and lockdep
>   *
>   * Also processes softirqs if needed and possible.
> 
>

Reverted this commit fixed the POWER9 boot warning,

[0.005196][T0] clocksource: timebase: mask: 0x 
max_cycles: 0x761537d007, max_idle_ns: 440795202126 ns
[0.012502][T0] clocksource: timebase mult[1f4] shift[24] registered
[0.030273][T0] [ cut here ]
[0.034421][T0] DEBUG_LOCKS_WARN_ON(current->hardirq_context)
[0.034433][T0] WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3680 
lockdep_hardirqs_on_prepare+0x29c/0x2d0
[0.045874][T0] Modules linked in:
[0.047977][T0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
5.7.0-next-20200602 #1
[0.053187][T0] NIP:  c01d2fec LR: c01d2fe8 CTR: 
c074b0a0
[0.057395][T0] REGS: c130f810 TRAP: 0700   Not tainted  
(5.7.0-next-20200602)
[0.062614][T0] MSR:  90021033   CR: 
48000422  XER: 2004
[0.069856][T0] CFAR: c010e448 IRQMASK: 1
[0.069856][T0] GPR00: c01d2fe8 c130faa0 
c130aa00 002d
[0.069856][T0] GPR04: c133c3b0 000d 
6e6f635f 72727563284e4f5f
[0.069856][T0] GPR08: 0002 c0dcf230 
0001 c12b0280
[0.069856][T0] GPR12:  c57b 
 
[0.069856][T0] GPR16:   
 
[0.069856][T0] GPR20:  0001 
10004d9c 100053ed
[0.069856][T0] GPR24: 10005411 0001 
0002 0003
[0.069856][T0] GPR28:   
 c3e3b008
[0.117846][T0] NIP [c01d2fec] 
lockdep_hardirqs_on_prepare+0x29c/0x2d0
[0.123052][T0] LR [c01d2fe8] 
lockdep_hardirqs_on_prepare+0x298/0x2d0
[0.127248][T0] Call Trace:
[0.129337][T0] [c130faa0] [c01d2fe8] 
lockdep_hardirqs_on_prepare+0x298/0x2d0 (unreliable)
[0.137613][T0] [c130fb10] [c02d3834] 
trace_hardirqs_on+0x94/0x230
trace_hardirqs_on at kernel/trace/trace_preemptirq.c:49
[0.141824][T0] [c130fb60] [c0039100] 
interrupt_exit_kernel_prepare+0x110/0x1f0
interrupt_exit_kernel_prepare at arch/powerpc/kernel/syscall_64.c:337
[0.148069][T0] [c130fbc0] [c000f328] 
interrupt_return+0x118/0x1c0
[0.152281][T0] --- interrupt: 900 at arch_local_irq_restore+0xc0/0xd0
arch_local_irq_restore at arch/powerpc/kernel/irq.c:367
(inlined by) arch_local_irq_restore at arch/powerpc/kernel/irq.c:318
[0.152281][T0] LR = start_kernel+0x7f0/0x9dc
[0.153579][T0] [c130fec0] [c1208fa8] 
init_on_free+0x0/0x2b0 (unreliable)
[0.159810][T0] [c130fee0] [c0c845c8] 
start_kernel+0x7e4/0x9dc
start_kernel at init/main.c:961 (discriminator 3)
[0.165017][T0] [c130ff90] [c000c890] 
start_here_common+0x1c/0x8c
[0.169224][T0] Instruction dump:
[0.171324][T0] 0fe0 e8010080 ebc10060 ebe10068 7c0803a6 4bfffe7c 
3c82ff8b 3c62ff8a
[0.177558][T0] 38848808 3863e460 4bf3b3fd 6000 <0fe0> e8010080 
ebc10060 ebe10068
[0.183796][T0] irq event stamp: 16
[0.186904][T0] hardirqs last  enabled at (14): [] 
rcu_core+0x9a4/0xbe0
[0.191130][T0] hardirqs last disabled at (15): [] 
__do_softirq+0x5d4/0x8d8
[0.195365][T0] softirqs last  

Re: [PATCH] powerpc/kvm/book3s64/vio: fix some RCU-list locks

2020-05-26 Thread Qian Cai
On Wed, May 27, 2020 at 11:13:23AM +1000, Paul Mackerras wrote:
> On Sun, May 10, 2020 at 01:18:34AM -0400, Qian Cai wrote:
> > It is unsafe to traverse kvm->arch.spapr_tce_tables and
> > stt->iommu_tables without the RCU read lock held. Also, add
> > cond_resched_rcu() in places with the RCU read lock held that could take
> > a while to finish.
> 
> This mostly looks fine.  The cond_resched_rcu() in kvmppc_tce_validate
> doesn't seem necessary (the list would rarely have more than a few
> dozen entries) and could be a performance problem given that TCE
> validation is a hot-path.
> 
> Are you OK with me modifying the patch to take out that
> cond_resched_rcu(), or is there some reason why it's essential that it
> be there?

Feel free to take out that cond_resched_rcu(). Your reasoning makes
sense.


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-21 Thread Qian Cai
On Thu, May 21, 2020 at 11:39:38AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > > .config are in,
> > > 
> > > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > > not seen anything like that myself. Let me go have a look.
> > > 
> > > 
> > > In as far as the logs are readable (they're a wrapped mess, please don't
> > > do that!), they contain very little useful, as is typical with IPIs :/
> > > 
> > > > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > > flush_smp_call_function_queue+0x1fa/0x2e0
> > 
> > So I've tried to think of a race that could produce that and here is
> > the only thing I could come up with. It's a bit complicated unfortunately:
> 
> This:
> 
> > smp_call_function_single_async() { 
> > smp_call_function_single_async() {
> > // verified csd->flags != CSD_LOCK // verified 
> > csd->flags != CSD_LOCK
> > csd->flags = CSD_LOCK  csd->flags = 
> > CSD_LOCK
> 
> concurrent smp_call_function_single_async() using the same csd is what
> I'm looking at as well. Now in the ILB case there is an easy cure:
> 
> (because there is only a single ilb target)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
>* is idle. And the softirq performing nohz idle load balance
>* will be run before returning from the IPI.
>*/
> - smp_call_function_single_async(ilb_cpu, _rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, _rq()->nohz_csd);
>  }
>  
>  /*
> 
> Qian, can you give that a spin?

Running for a few hours now. It works fine.


Re: Endless soft-lockups for compiling workload since next-20200519

2020-05-20 Thread Qian Cai
On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.

Yes, I ended up figuring out the same commit a bit earlier. Since then I
reverted that commit and its dependency,

2a0a24ebb499 ("sched: Make scheduler_ipi inline")

Everything works fine so far.

> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/

Sorry about that. I forgot that gmail webUI will wrap things around. I will
switch to mutt.

> 
> > [ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0
> > [ 1168.00][C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
> > fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
> > acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
> > scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
> > dm_region_hash dm_log dm_mod
> > [ 1168.029492][C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> > 5.7.0-rc6-next-20200519 #1
> > [ 1168.037665][C1] Hardware name: HPE ProLiant DL385
> > Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> > [ 1168.046978][C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
> > [ 1168.053658][C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
> > ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
> > e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
> > 10 52 3f 71
> > [ 1168.073262][C1] RSP: 0018:c9178918 EFLAGS: 00010046
> > [ 1168.079253][C1] RAX:  RBX: 430c58f8
> > RCX: 8ec26083
> > [ 1168.087156][C1] RDX: 0003 RSI: dc00
> > RDI: 430c58f8
> > [ 1168.095054][C1] RBP: c91789a8 R08: ed1108618cec
> > R09: ed1108618cec
> > [ 1168.102964][C1] R10: 430c675b R11: 
> > R12: 430c58e0
> > [ 1168.110866][C1] R13: 8eb30c40 R14: 430c5880
> > R15: 430c58e0
> > [ 1168.118767][C1] FS:  ()
> > GS:4308() knlGS:
> > [ 1168.127628][C1] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1168.134129][C1] CR2: 55b169604560 CR3: 000d08a14000
> > CR4: 003406e0
> > [ 1168.142026][C1] Call Trace:
> > [ 1168.145206][C1]  
> > [ 1168.147957][C1]  ? smp_call_on_cpu_callback+0xd0/0xd0
> > [ 1168.153421][C1]  ? rcu_read_lock_sched_held+0xac/0xe0
> > [ 1168.158880][C1]  ? rcu_read_lock_bh_held+0xc0/0xc0
> > [ 1168.164076][C1]  generic_smp_call_function_single_interrupt+0x13/0x2b
> > [ 1168.170938][C1]  smp_call_function_single_interrupt+0x157/0x4e0
> > [ 1168.177278][C1]  ? smp_call_function_interrupt+0x4e0/0x4e0
> > [ 1168.183172][C1]  ? interrupt_entry+0xe4/0xf0
> > [ 1168.187846][C1]  ? trace_hardirqs_off_caller+0x8d/0x1f0
> > [ 1168.193478][C1]  ? trace_hardirqs_on_caller+0x1f0/0x1f0
> > [ 1168.199116][C1]  ? _nohz_idle_balance+0x221/0x360
> > [ 1168.204228][C1]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [ 1168.209690][C1]  call_function_single_interrupt+0xf/0x20


Endless soft-lockups for compiling workload since next-20200519

2020-05-19 Thread Qian Cai
Just a head up. Repeatedly compiling kernels for a while would trigger
endless soft-lockups since next-20200519 on both x86_64 and powerpc.
.config are in,

https://github.com/cailca/linux-mm

I did first try to revert the linux-next commit 68cd9f4e7238
("tick/nohz: Narrow down noise while setting current task's tick
dependency"), but it did not help.

== x86_64 ==
[ 1167.993773][C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
flush_smp_call_function_queue+0x1fa/0x2e0
[ 1168.00][C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
dm_region_hash dm_log dm_mod
[ 1168.029492][C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
5.7.0-rc6-next-20200519 #1
[ 1168.037665][C1] Hardware name: HPE ProLiant DL385
Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 1168.046978][C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
[ 1168.053658][C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
10 52 3f 71
[ 1168.073262][C1] RSP: 0018:c9178918 EFLAGS: 00010046
[ 1168.079253][C1] RAX:  RBX: 430c58f8
RCX: 8ec26083
[ 1168.087156][C1] RDX: 0003 RSI: dc00
RDI: 430c58f8
[ 1168.095054][C1] RBP: c91789a8 R08: ed1108618cec
R09: ed1108618cec
[ 1168.102964][C1] R10: 430c675b R11: 
R12: 430c58e0
[ 1168.110866][C1] R13: 8eb30c40 R14: 430c5880
R15: 430c58e0
[ 1168.118767][C1] FS:  ()
GS:4308() knlGS:
[ 1168.127628][C1] CS:  0010 DS:  ES:  CR0: 80050033
[ 1168.134129][C1] CR2: 55b169604560 CR3: 000d08a14000
CR4: 003406e0
[ 1168.142026][C1] Call Trace:
[ 1168.145206][C1]  
[ 1168.147957][C1]  ? smp_call_on_cpu_callback+0xd0/0xd0
[ 1168.153421][C1]  ? rcu_read_lock_sched_held+0xac/0xe0
[ 1168.158880][C1]  ? rcu_read_lock_bh_held+0xc0/0xc0
[ 1168.164076][C1]  generic_smp_call_function_single_interrupt+0x13/0x2b
[ 1168.170938][C1]  smp_call_function_single_interrupt+0x157/0x4e0
[ 1168.177278][C1]  ? smp_call_function_interrupt+0x4e0/0x4e0
[ 1168.183172][C1]  ? interrupt_entry+0xe4/0xf0
[ 1168.187846][C1]  ? trace_hardirqs_off_caller+0x8d/0x1f0
[ 1168.193478][C1]  ? trace_hardirqs_on_caller+0x1f0/0x1f0
[ 1168.199116][C1]  ? _nohz_idle_balance+0x221/0x360
[ 1168.204228][C1]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1168.209690][C1]  call_function_single_interrupt+0xf/0x20
[ 1168.215415][C1] RIP: 0010:_raw_spin_unlock_irqrestore+0x46/0x50
[ 1168.221747][C1] Code: 8d 5e ff 4c 89 e7 e8 a9 35 5f ff f6 c7 02
75 13 53 9d e8 fd c0 6f ff 65 ff 0d 4e ab a6 70 5b 41 5c 5d c3 e8 dc
c2 6f ff 53 9d  eb 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 ff
05 2b ab a6
[ 1168.241353][C1] RSP: 0018:c9178bd0 EFLAGS: 0246
ORIG_RAX: ff04
[ 1168.249700][C1] RAX:  RBX: 0246
RCX: 8eba0740
[ 1168.257602][C1] RDX: 0007 RSI: dc00
RDI: 888214f5c8e4
[ 1168.265503][C1] RBP: c9178be0 R08: fbfff2120216
R09: 
[ 1168.273400][C1] R10:  R11: 
R12: 43145880
[ 1168.281300][C1] R13: 90b2db80 R14: 0002
R15: 0001000164cb
[ 1168.289218][C1]  ? call_function_single_interrupt+0xa/0x20
[ 1168.295117][C1]  ? lockdep_hardirqs_on+0x1b0/0x2c0
[ 1168.300319][C1]  _nohz_idle_balance+0x221/0x360
[ 1168.305256][C1]  run_rebalance_domains+0x16c/0x2e0
[ 1168.310452][C1]  __do_softirq+0x1ca/0x96a
[ 1168.314861][C1]  ? __irqentry_text_end+0x1fa9e7/0x1fa9e7
[ 1168.320579][C1]  ? hrtimer_reprogram+0x170/0x170
[ 1168.325608][C1]  ? __bpf_trace_preemptirq_template+0x100/0x100
[ 1168.331856][C1]  ? lapic_next_event+0x3c/0x50
[ 1168.336617][C1]  ? clockevents_program_event+0xfc/0x180
[ 1168.342249][C1]  ? check_flags.part.28+0x86/0x220
[ 1168.347355][C1]  ? trace_hardirqs_off+0x8d/0x1f0
[ 1168.352374][C1]  ? __bpf_trace_preemptirq_template+0x100/0x100
[ 1168.358620][C1]  ? rcu_read_lock_sched_held+0xac/0xe0
[ 1168.364077][C1]  ? rcu_read_lock_bh_held+0xc0/0xc0
[ 1168.369282][C1]  irq_exit+0xd6/0xf0
[ 1168.373168][C1]  smp_apic_timer_interrupt+0x215/0x560
[ 1168.378628][C1]  ? smp_call_function_single_interrupt+0x4e0/0x4e0
[ 1168.385137][C1]  ? smp_call_function_interrupt+0x4e0/0x4e0
[ 1168.391031][C1]  ? interrupt_entry+0xe4/0xf0
[ 1168.395705][C1]  ? trace_hardirqs_off_caller+0x8d/0x1f0
[ 1168.401336][C1]  ? 

[PATCH] powerpc/xive: silence kmemleak false positives

2020-05-13 Thread Qian Cai
opal_xive_donate_page() will reference the newly allocated memory using
__pa(). Since kmemleak is unable to track the physical memory resulting
in false positives, silence those by using kmemleak_ignore().

unreferenced object 0xc000201b53e9 (size 65536):
 comm "qemu-kvm", pid 124557, jiffies 4295650285 (age 364.370s)
 hex dump (first 32 bytes):
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
 backtrace:
   [<acc2fb77>] xive_native_alloc_vp_block+0x168/0x210
   xive_native_provision_pages at arch/powerpc/sysdev/xive/native.c:645
   (inlined by) xive_native_alloc_vp_block at 
arch/powerpc/sysdev/xive/native.c:674
   [<4d5c7964>] kvmppc_xive_compute_vp_id+0x20c/0x3b0 [kvm]
   [<55317cd2>] kvmppc_xive_connect_vcpu+0xa4/0x4a0 [kvm]
   [<93dfc014>] kvm_arch_vcpu_ioctl+0x388/0x508 [kvm]
   [<d25aea0f>] kvm_vcpu_ioctl+0x15c/0x950 [kvm]
   [<48155cd6>] ksys_ioctl+0xd8/0x130
   [<41ffeaa7>] sys_ioctl+0x28/0x40
   [<4afc4310>] system_call_exception+0x114/0x1e0
   [<fb70a873>] system_call_common+0xf0/0x278

Signed-off-by: Qian Cai 
---
 arch/powerpc/sysdev/xive/native.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 5218fdc4b29a..2d19f28967a6 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -647,6 +648,9 @@ static bool xive_native_provision_pages(void)
pr_err("Failed to allocate provisioning page\n");
return false;
}
+   /* Kmemleak is unable to track the physical address. */
+   kmemleak_ignore(p);
+
opal_xive_donate_page(chip, __pa(p));
}
return true;
-- 
2.21.0 (Apple Git-122.2)



[PATCH] powerpc/kvm/radix: ignore kmemleak false positives

2020-05-13 Thread Qian Cai
kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
pud_populate() and pmd_populate() will use __pa() to reference the newly
allocated memory.

Since kmemleak is unable to track the physical memory resulting in false
positives, silence those by using kmemleak_ignore().

unreferenced object 0xc000201c382a1000 (size 4096):
 comm "qemu-kvm", pid 124828, jiffies 4295733767 (age 341.250s)
 hex dump (first 32 bytes):
   c0 00 20 09 f4 60 03 87 c0 00 20 10 72 a0 03 87  .. ..` .r...
   c0 00 20 0e 13 a0 03 87 c0 00 20 1b dc c0 03 87  .. ... .
 backtrace:
   [<4cc2790f>] kvmppc_create_pte+0x838/0xd20 [kvm_hv]
   kvmppc_pmd_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:366
   (inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:590
   [<d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
   [<bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
   [<86dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
   [<5ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
   [<d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
   [<d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
   [<2543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
   [<48155cd6>] ksys_ioctl+0xd8/0x130
   [<41ffeaa7>] sys_ioctl+0x28/0x40
   [<4afc4310>] system_call_exception+0x114/0x1e0
   [<fb70a873>] system_call_common+0xf0/0x278
unreferenced object 0xc0002001f0c03900 (size 256):
 comm "qemu-kvm", pid 124830, jiffies 4295735235 (age 326.570s)
 hex dump (first 32 bytes):
   c0 00 20 10 fa a0 03 87 c0 00 20 10 fa a1 03 87  .. ... .
   c0 00 20 10 fa a2 03 87 c0 00 20 10 fa a3 03 87  .. ... .
 backtrace:
   [<23f675b8>] kvmppc_create_pte+0x854/0xd20 [kvm_hv]
   kvmppc_pte_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:356
   (inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:593
   [<d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
   [<bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
   [<86dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
   [<5ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
   [<d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
   [<d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
   [<2543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
   [<48155cd6>] ksys_ioctl+0xd8/0x130
   [<41ffeaa7>] sys_ioctl+0x28/0x40
   [<4afc4310>] system_call_exception+0x114/0x1e0
   [<fb70a873>] system_call_common+0xf0/0x278

Signed-off-by: Qian Cai 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index aa12cd4078b3..bc6c1aa3d0e9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -353,7 +353,13 @@ static struct kmem_cache *kvm_pmd_cache;
 
 static pte_t *kvmppc_pte_alloc(void)
 {
-   return kmem_cache_alloc(kvm_pte_cache, GFP_KERNEL);
+   pte_t *pte;
+
+   pte = kmem_cache_alloc(kvm_pte_cache, GFP_KERNEL);
+   /* pmd_populate() will only reference _pa(pte). */
+   kmemleak_ignore(pte);
+
+   return pte;
 }
 
 static void kvmppc_pte_free(pte_t *ptep)
@@ -363,7 +369,13 @@ static void kvmppc_pte_free(pte_t *ptep)
 
 static pmd_t *kvmppc_pmd_alloc(void)
 {
-   return kmem_cache_alloc(kvm_pmd_cache, GFP_KERNEL);
+   pmd_t *pmd;
+
+   pmd = kmem_cache_alloc(kvm_pmd_cache, GFP_KERNEL);
+   /* pud_populate() will only reference _pa(pmd). */
+   kmemleak_ignore(pmd);
+
+   return pmd;
 }
 
 static void kvmppc_pmd_free(pmd_t *pmdp)
-- 
2.21.0 (Apple Git-122.2)



Re: [PATCH] powerpc/kvm: silence kmemleak false positives

2020-05-13 Thread Qian Cai



> On May 13, 2020, at 12:04 AM, Michael Ellerman  wrote:
> 
> This should probably also have an include of  ?

No, asm/book3s/64/pgalloc.h has already had it and since this is 
book3s_64_mmu_radix.c, it will include it eventually from,

asm/pgalloc.h
  asm/book3s/pgalloc.h

Re: [PATCH] powerpc/kvm: silence kmemleak false positives

2020-05-11 Thread Qian Cai



> On May 11, 2020, at 7:15 AM, Michael Ellerman  wrote:
> 
> There is kmemleak_alloc_phys(), which according to the docs can be used
> for tracking a phys address.
> 
> Did you try that?

Caitlin, feel free to give your thoughts here.

My understanding is that it seems the doc is a bit misleading. 
kmemleak_alloc_phys() is to allocate kmemleak objects for a physical address 
range, so  kmemleak could scan those memory pointers within for possible 
referencing other memory. It was only used in memblock so far, but those new 
memory allocations here contain no reference to other memory.

In this case, we have already had kmemleak objects for those memory allocation. 
It is just that other pointers reference those memory by their physical address 
which is a known kmemleak limitation won’t be able to track the the connection. 
Thus, we always use kmemleak_ignore() to not reporting those as leaks and don’t 
scan those because they do not contain other memory reference.

[PATCH] powerpc/kvm/book3s64/vio: fix some RCU-list locks

2020-05-09 Thread Qian Cai
It is unsafe to traverse kvm->arch.spapr_tce_tables and
stt->iommu_tables without the RCU read lock held. Also, add
cond_resched_rcu() in places with the RCU read lock held that could take
a while to finish.

 arch/powerpc/kvm/book3s_64_vio.c:76 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 no locks held by qemu-kvm/4265.

 stack backtrace:
 CPU: 96 PID: 4265 Comm: qemu-kvm Not tainted 5.7.0-rc4-next-20200508+ #2
 Call Trace:
 [c000201a8690f720] [c0715948] dump_stack+0xfc/0x174 (unreliable)
 [c000201a8690f770] [c01d9470] lockdep_rcu_suspicious+0x140/0x164
 [c000201a8690f7f0] [c00810b9fb48] 
kvm_spapr_tce_release_iommu_group+0x1f0/0x220 [kvm]
 [c000201a8690f870] [c00810b8462c] 
kvm_spapr_tce_release_vfio_group+0x54/0xb0 [kvm]
 [c000201a8690f8a0] [c00810b84710] kvm_vfio_destroy+0x88/0x140 [kvm]
 [c000201a8690f8f0] [c00810b7d488] kvm_put_kvm+0x370/0x600 [kvm]
 [c000201a8690f990] [c00810b7e3c0] kvm_vm_release+0x38/0x60 [kvm]
 [c000201a8690f9c0] [c05223f4] __fput+0x124/0x330
 [c000201a8690fa20] [c0151cd8] task_work_run+0xb8/0x130
 [c000201a8690fa70] [c01197e8] do_exit+0x4e8/0xfa0
 [c000201a8690fb70] [c011a374] do_group_exit+0x64/0xd0
 [c000201a8690fbb0] [c0132c90] get_signal+0x1f0/0x1200
 [c000201a8690fcc0] [c0020690] do_notify_resume+0x130/0x3c0
 [c000201a8690fda0] [c0038d64] syscall_exit_prepare+0x1a4/0x280
 [c000201a8690fe20] [c000c8f8] system_call_common+0xf8/0x278

 
 arch/powerpc/kvm/book3s_64_vio.c:368 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 2 locks held by qemu-kvm/4264:
  #0: c000201ae2d000d8 (>mutex){+.+.}-{3:3}, at: 
kvm_vcpu_ioctl+0xdc/0x950 [kvm]
  #1: c000200c9ed0c468 (>srcu){}-{0:0}, at: 
kvmppc_h_put_tce+0x88/0x340 [kvm]

 
 arch/powerpc/kvm/book3s_64_vio.c:108 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by qemu-kvm/4257:
  #0: c000200b1b363a40 (>lock){+.+.}-{3:3}, at: 
kvm_vfio_set_attr+0x598/0x6c0 [kvm]

 
 arch/powerpc/kvm/book3s_64_vio.c:146 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by qemu-kvm/4257:
  #0: c000200b1b363a40 (>lock){+.+.}-{3:3}, at: 
kvm_vfio_set_attr+0x598/0x6c0 [kvm]

Signed-off-by: Qian Cai 
---
 arch/powerpc/kvm/book3s_64_vio.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index 50555ad1db93..4f5016bab723 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -73,6 +73,7 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
struct kvmppc_spapr_tce_iommu_table *stit, *tmp;
struct iommu_table_group *table_group = NULL;
 
+   rcu_read_lock();
list_for_each_entry_rcu(stt, >arch.spapr_tce_tables, list) {
 
table_group = iommu_group_get_iommudata(grp);
@@ -87,7 +88,9 @@ extern void kvm_spapr_tce_release_iommu_group(struct kvm *kvm,
kref_put(>kref, kvm_spapr_tce_liobn_put);
}
}
+   cond_resched_rcu();
}
+   rcu_read_unlock();
 }
 
 extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm, int tablefd,
@@ -105,12 +108,14 @@ extern long kvm_spapr_tce_attach_iommu_group(struct kvm 
*kvm, int tablefd,
if (!f.file)
return -EBADF;
 
+   rcu_read_lock();
list_for_each_entry_rcu(stt, >arch.spapr_tce_tables, list) {
if (stt == f.file->private_data) {
found = true;
break;
}
}
+   rcu_read_unlock();
 
fdput(f);
 
@@ -143,6 +148,7 @@ extern long kvm_spapr_tce_attach_iommu_group(struct kvm 
*kvm, int tablefd,
if (!tbl)
return -EINVAL;
 
+   rcu_read_lock();
list_for_each_entry_rcu(stit, >iommu_tables, next) {
if (tbl != stit->tbl)
continue;
@@ -150,14 +156,17 @@ extern long kvm_spapr_tce_attach_iommu_group(struct kvm 
*kvm, int tablefd,
if (!kref_get_unless_zero(>kref)) {
/* stit is being destroyed */
iommu_tce_table_put(tbl);
+   rcu_read_unlock();
return -ENOTTY;
}
/*
 * The table is already known to this KVM, we just increased
 * its KVM reference counter and can return.
 */
+   rcu_read_unlock();
return 0;
}
+   rcu_read_unlock()

[PATCH] powerpc/mm/book3s64/iommu: fix some RCU-list locks

2020-05-09 Thread Qian Cai
It is safe to traverse mm->context.iommu_group_mem_list with either
mem_list_mutex or the RCU read lock held. Silence a few RCU-list false
positive warnings and fix a few missing RCU read locks.

 arch/powerpc/mm/book3s64/iommu_api.c:330 RCU-list traversed in non-reader 
section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 2 locks held by qemu-kvm/4305:
  #0: c00bc3fe4d68 (>lock){+.+.}-{3:3}, at: 
tce_iommu_ioctl.part.9+0xc7c/0x1870 [vfio_iommu_spapr_tce]
  #1: c1501910 (mem_list_mutex){+.+.}-{3:3}, at: mm_iommu_get+0x50/0x190

 
 arch/powerpc/mm/book3s64/iommu_api.c:132 RCU-list traversed in non-reader 
section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 2 locks held by qemu-kvm/4305:
  #0: c00bc3fe4d68 (>lock){+.+.}-{3:3}, at: 
tce_iommu_ioctl.part.9+0xc7c/0x1870 [vfio_iommu_spapr_tce]
  #1: c1501910 (mem_list_mutex){+.+.}-{3:3}, at: 
mm_iommu_do_alloc+0x120/0x5f0

 
 arch/powerpc/mm/book3s64/iommu_api.c:292 RCU-list traversed in non-reader 
section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 2 locks held by qemu-kvm/4312:
  #0: c00ecafe23c8 (>mutex){+.+.}-{3:3}, at: 
kvm_vcpu_ioctl+0xdc/0x950 [kvm]
  #1: c00045e6c468 (>srcu){}-{0:0}, at: 
kvmppc_h_put_tce+0x88/0x340 [kvm]

 
 arch/powerpc/mm/book3s64/iommu_api.c:424 RCU-list traversed in non-reader 
section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 2 locks held by qemu-kvm/4312:
  #0: c00ecafe23c8 (>mutex){+.+.}-{3:3}, at: 
kvm_vcpu_ioctl+0xdc/0x950 [kvm]
  #1: c00045e6c468 (>srcu){}-{0:0}, at: 
kvmppc_h_put_tce+0x88/0x340 [kvm]

Signed-off-by: Qian Cai 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index fa05bbd1f682..bf0108b6f445 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -129,7 +129,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
mutex_lock(_list_mutex);
 
-   list_for_each_entry_rcu(mem2, >context.iommu_group_mem_list, next) {
+   list_for_each_entry_rcu(mem2, >context.iommu_group_mem_list, next,
+   lockdep_is_held(_list_mutex)) {
/* Overlap? */
if ((mem2->ua < (ua + (entries << PAGE_SHIFT))) &&
(ua < (mem2->ua +
@@ -289,6 +290,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
 {
struct mm_iommu_table_group_mem_t *mem, *ret = NULL;
 
+   rcu_read_lock();
list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next) {
if ((mem->ua <= ua) &&
(ua + size <= mem->ua +
@@ -297,6 +299,7 @@ struct mm_iommu_table_group_mem_t *mm_iommu_lookup(struct 
mm_struct *mm,
break;
}
}
+   rcu_read_unlock();
 
return ret;
 }
@@ -327,7 +330,8 @@ struct mm_iommu_table_group_mem_t *mm_iommu_get(struct 
mm_struct *mm,
 
mutex_lock(_list_mutex);
 
-   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next) {
+   list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next,
+   lockdep_is_held(_list_mutex)) {
if ((mem->ua == ua) && (mem->entries == entries)) {
ret = mem;
++mem->used;
@@ -421,6 +425,7 @@ bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long 
hpa,
struct mm_iommu_table_group_mem_t *mem;
unsigned long end;
 
+   rcu_read_lock();
list_for_each_entry_rcu(mem, >context.iommu_group_mem_list, next) {
if (mem->dev_hpa == MM_IOMMU_TABLE_INVALID_HPA)
continue;
@@ -437,6 +442,7 @@ bool mm_iommu_is_devmem(struct mm_struct *mm, unsigned long 
hpa,
return true;
}
}
+   rcu_read_unlock();
 
return false;
 }
-- 
2.21.0 (Apple Git-122.2)



[PATCH] powerpc/powernv/pci: fix a RCU-list lock

2020-05-09 Thread Qian Cai
It is unsafe to traverse tbl->it_group_list without the RCU read lock.

 WARNING: suspicious RCU usage
 5.7.0-rc4-next-20200508 #1 Not tainted
 -
 arch/powerpc/platforms/powernv/pci-ioda-tce.c:355 RCU-list traversed in 
non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 3 locks held by qemu-kvm/4305:
  #0: c00bc3fe6988 (>group_lock){}-{3:3}, at: 
vfio_fops_unl_ioctl+0x108/0x410 [vfio]
  #1: c0080fcc7400 (_drivers_lock){+.+.}-{3:3}, at: 
vfio_fops_unl_ioctl+0x148/0x410 [vfio]
  #2: c00bc3fe4d68 (>lock){+.+.}-{3:3}, at: 
tce_iommu_attach_group+0x3c/0x4f0 [vfio_iommu_spapr_tce]

 stack backtrace:
 CPU: 4 PID: 4305 Comm: qemu-kvm Not tainted 5.7.0-rc4-next-20200508 #1
 Call Trace:
 [c010f29afa60] [c07154c8] dump_stack+0xfc/0x174 (unreliable)
 [c010f29afab0] [c01d8ff0] lockdep_rcu_suspicious+0x140/0x164
 [c010f29afb30] [c00dae2c] 
pnv_pci_unlink_table_and_group+0x11c/0x200
 [c010f29afb70] [c00d4a34] pnv_pci_ioda2_unset_window+0xc4/0x190
 [c010f29afbf0] [c00d4b4c] pnv_ioda2_take_ownership+0x4c/0xd0
 [c010f29afc30] [c0080fd60ee0] tce_iommu_attach_group+0x2c8/0x4f0 
[vfio_iommu_spapr_tce]
 [c010f29afcd0] [c0080fcc11a0] vfio_fops_unl_ioctl+0x238/0x410 [vfio]
 [c010f29afd50] [c05430a8] ksys_ioctl+0xd8/0x130
 [c010f29afda0] [c0543128] sys_ioctl+0x28/0x40
 [c010f29afdc0] [c0038af4] system_call_exception+0x114/0x1e0
 [c010f29afe20] [c000c8f0] system_call_common+0xf0/0x278

Signed-off-by: Qian Cai 
---
 arch/powerpc/platforms/powernv/pci-ioda-tce.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
index 5dc6847d5f4c..6be9cf292b4e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
@@ -352,6 +352,8 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
 
/* Remove link to a group from table's list of attached groups */
found = false;
+
+   rcu_read_lock();
list_for_each_entry_rcu(tgl, >it_group_list, next) {
if (tgl->table_group == table_group) {
list_del_rcu(>next);
@@ -360,6 +362,8 @@ void pnv_pci_unlink_table_and_group(struct iommu_table *tbl,
break;
}
}
+   rcu_read_unlock();
+
if (WARN_ON(!found))
return;
 
-- 
2.21.0 (Apple Git-122.2)



Re: ioremap() called early from pnv_pci_init_ioda_phb()

2020-05-09 Thread Qian Cai



> On May 9, 2020, at 4:38 AM, Nicholas Piggin  wrote:
> 
> Your patch to use early_ioremap is faulting? I wonder why?

Yes, I don’t know the reasons either. I suppose not many places in other parts 
of the kernel which keep using those addresses from early_ioremap() after 
system booted. Otherwise, we would see those leak warnings elsewhere.

Thus, we probably have to audit the code, and if still necessary, call 
early_ioremap() and then early_iounmap() followed by a ioremap() once slab 
allocator is available?

Re: [PATCH v3] powerpc/64s/pgtable: fix an undefined behaviour

2020-05-09 Thread Qian Cai



> On Mar 6, 2020, at 1:56 AM, Christophe Leroy  wrote:
> 
> 
> 
> Le 06/03/2020 à 05:48, Qian Cai a écrit :
>> Booting a power9 server with hash MMU could trigger an undefined
>> behaviour because pud_offset(p4d, 0) will do,
>> 0 >> (PAGE_SHIFT:16 + PTE_INDEX_SIZE:8 + H_PMD_INDEX_SIZE:10)
>> Fix it by converting pud_index() and friends to static inline
>> functions.
>> UBSAN: shift-out-of-bounds in arch/powerpc/mm/ptdump/ptdump.c:282:15
>> shift exponent 34 is too large for 32-bit type 'int'
>> CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-20200303+ #13
>> Call Trace:
>> dump_stack+0xf4/0x164 (unreliable)
>> ubsan_epilogue+0x18/0x78
>> __ubsan_handle_shift_out_of_bounds+0x160/0x21c
>> walk_pagetables+0x2cc/0x700
>> walk_pud at arch/powerpc/mm/ptdump/ptdump.c:282
>> (inlined by) walk_pagetables at arch/powerpc/mm/ptdump/ptdump.c:311
>> ptdump_check_wx+0x8c/0xf0
>> mark_rodata_ro+0x48/0x80
>> kernel_init+0x74/0x194
>> ret_from_kernel_thread+0x5c/0x74
>> Suggested-by: Christophe Leroy 
>> Signed-off-by: Qian Cai 
> 
> Reviewed-by: Christophe Leroy 

Michael, can you take a look at this patch when you have a chance? It looks 
falling through the cracks.

> 
>> ---
>> v3: convert pud_index() etc to static inline functions.
>> v2: convert pud_offset() etc to static inline functions.
>>  arch/powerpc/include/asm/book3s/64/pgtable.h | 23 
>>  1 file changed, 19 insertions(+), 4 deletions(-)
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index 201a69e6a355..bd432c6706b9 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -998,10 +998,25 @@ extern struct page *pgd_page(pgd_t pgd);
>>  #define pud_page_vaddr(pud) __va(pud_val(pud) & ~PUD_MASKED_BITS)
>>  #define pgd_page_vaddr(pgd) __va(pgd_val(pgd) & ~PGD_MASKED_BITS)
>>  -#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & (PTRS_PER_PGD - 
>> 1))
>> -#define pud_index(address) (((address) >> (PUD_SHIFT)) & (PTRS_PER_PUD - 1))
>> -#define pmd_index(address) (((address) >> (PMD_SHIFT)) & (PTRS_PER_PMD - 1))
>> -#define pte_index(address) (((address) >> (PAGE_SHIFT)) & (PTRS_PER_PTE - 
>> 1))
>> +static inline unsigned long pgd_index(unsigned long address)
>> +{
>> +return (address >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1);
>> +}
>> +
>> +static inline unsigned long pud_index(unsigned long address)
>> +{
>> +return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
>> +}
>> +
>> +static inline unsigned long pmd_index(unsigned long address)
>> +{
>> +return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
>> +}
>> +
>> +static inline unsigned long pte_index(unsigned long address)
>> +{
>> +return (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
>> +}
>>/*
>>   * Find an entry in a page-table-directory.  We combine the address region



[PATCH] powerpc/kvm: silence kmemleak false positives

2020-05-08 Thread Qian Cai
kvmppc_pmd_alloc() and kvmppc_pte_alloc() allocate some memory but then
pud_populate() and pmd_populate() will use __pa() to reference the newly
allocated memory. The same is in xive_native_provision_pages().

Since kmemleak is unable to track the physical memory resulting in false
positives, silence those by using kmemleak_ignore().

unreferenced object 0xc000201c382a1000 (size 4096):
  comm "qemu-kvm", pid 124828, jiffies 4295733767 (age 341.250s)
  hex dump (first 32 bytes):
c0 00 20 09 f4 60 03 87 c0 00 20 10 72 a0 03 87  .. ..` .r...
c0 00 20 0e 13 a0 03 87 c0 00 20 1b dc c0 03 87  .. ... .
  backtrace:
[<4cc2790f>] kvmppc_create_pte+0x838/0xd20 [kvm_hv]
kvmppc_pmd_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:366
(inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:590
[<d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
[<bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
[<86dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
[<5ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
[<d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
[<d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
[<2543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
[<48155cd6>] ksys_ioctl+0xd8/0x130
[<41ffeaa7>] sys_ioctl+0x28/0x40
[<4afc4310>] system_call_exception+0x114/0x1e0
[<fb70a873>] system_call_common+0xf0/0x278
unreferenced object 0xc0002001f0c03900 (size 256):
  comm "qemu-kvm", pid 124830, jiffies 4295735235 (age 326.570s)
  hex dump (first 32 bytes):
c0 00 20 10 fa a0 03 87 c0 00 20 10 fa a1 03 87  .. ... .
c0 00 20 10 fa a2 03 87 c0 00 20 10 fa a3 03 87  .. ... .
  backtrace:
[<23f675b8>] kvmppc_create_pte+0x854/0xd20 [kvm_hv]
kvmppc_pte_alloc at arch/powerpc/kvm/book3s_64_mmu_radix.c:356
(inlined by) kvmppc_create_pte at arch/powerpc/kvm/book3s_64_mmu_radix.c:593
[<d123c49a>] kvmppc_book3s_instantiate_page+0x2e0/0x8c0 [kvm_hv]
[<bb549087>] kvmppc_book3s_radix_page_fault+0x1b4/0x2b0 [kvm_hv]
[<86dddc0e>] kvmppc_book3s_hv_page_fault+0x214/0x12a0 [kvm_hv]
[<5ae9ccc2>] kvmppc_vcpu_run_hv+0xc5c/0x15f0 [kvm_hv]
[<d22162ff>] kvmppc_vcpu_run+0x34/0x48 [kvm]
[<d6953bc4>] kvm_arch_vcpu_ioctl_run+0x314/0x420 [kvm]
[<2543dd54>] kvm_vcpu_ioctl+0x33c/0x950 [kvm]
[<48155cd6>] ksys_ioctl+0xd8/0x130
[<41ffeaa7>] sys_ioctl+0x28/0x40
[<4afc4310>] system_call_exception+0x114/0x1e0
[<fb70a873>] system_call_common+0xf0/0x278
unreferenced object 0xc000201b53e9 (size 65536):
  comm "qemu-kvm", pid 124557, jiffies 4295650285 (age 364.370s)
  hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
  backtrace:
[<acc2fb77>] xive_native_alloc_vp_block+0x168/0x210
xive_native_provision_pages at arch/powerpc/sysdev/xive/native.c:645
(inlined by) xive_native_alloc_vp_block at 
arch/powerpc/sysdev/xive/native.c:674
[<4d5c7964>] kvmppc_xive_compute_vp_id+0x20c/0x3b0 [kvm]
[<55317cd2>] kvmppc_xive_connect_vcpu+0xa4/0x4a0 [kvm]
[<93dfc014>] kvm_arch_vcpu_ioctl+0x388/0x508 [kvm]
[<d25aea0f>] kvm_vcpu_ioctl+0x15c/0x950 [kvm]
[<48155cd6>] ksys_ioctl+0xd8/0x130
[<41ffeaa7>] sys_ioctl+0x28/0x40
[<4afc4310>] system_call_exception+0x114/0x1e0
[<fb70a873>] system_call_common+0xf0/0x278

Signed-off-by: Qian Cai 
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c | 16 ++--
 arch/powerpc/sysdev/xive/native.c  |  4 
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index aa12cd4078b3..bc6c1aa3d0e9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -353,7 +353,13 @@ static struct kmem_cache *kvm_pmd_cache;
 
 static pte_t *kvmppc_pte_alloc(void)
 {
-   return kmem_cache_alloc(kvm_pte_cache, GFP_KERNEL);
+   pte_t *pte;
+
+   pte = kmem_cache_alloc(kvm_pte_cache, GFP_KERNEL);
+   /* pmd_populate() will only reference _pa(pte). */
+   kmemleak_ignore(pte);
+
+   return pte;
 }
 
 static void kvmppc_pte_free(pte_t *ptep)
@@ -363,7 +369,13 @@ static void kvmppc_pte_free(pte_t *ptep)
 
 static pmd_t *kvmppc_pmd_alloc(void)
 {
-   return kmem_cache_alloc(kvm_pmd_cache, GFP_KERNEL);
+   pmd_t *pmd;
+
+   pmd = kmem_cache_alloc(kvm_pmd_cache, GFP_K

Re: ioremap() called early from pnv_pci_init_ioda_phb()

2020-05-08 Thread Qian Cai



> On May 8, 2020, at 10:39 AM, Qian Cai  wrote:
> 
> Booting POWER9 PowerNV has this message,
> 
> "ioremap() called early from pnv_pci_init_ioda_phb+0x420/0xdfc. Use 
> early_ioremap() instead”
> 
> but use the patch below will result in leaks because it will never call 
> early_iounmap() anywhere. However, it looks me it was by design that 
> phb->regs mapping would be there forever where it would be used in 
> pnv_ioda_get_inval_reg(), so is just that check_early_ioremap_leak() initcall 
> too strong?
> 
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -36,6 +36,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include 
> 
> @@ -3827,7 +3828,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
> device_node *np,
>/* Get registers */
>if (!of_address_to_resource(np, 0, )) {
>phb->regs_phys = r.start;
> -   phb->regs = ioremap(r.start, resource_size());
> +   phb->regs = early_ioremap(r.start, resource_size());
>if (phb->regs == NULL)
>pr_err("  Failed to map registers !\n”);

This will also trigger a panic with debugfs reads, so isn’t that this commit 
bogus at least for powerpc64?

d538aadc2718 (“powerpc/ioremap: warn on early use of ioremap()")

11017.617022][T122068] Faulting instruction address: 0xc00db564
[11017.617257][T122066] Faulting instruction address: 0xc00db564
[11017.617950][T122073] Faulting instruction address: 0xc00db564
[11017.61][T122064] BUG: Unable to handle kernel data access on read at 
0xffe20e10
[11017.618935][T122064] Faulting instruction address: 0xc00db564
[11017.737996][T122072] 
[11017.738010][T122073] Oops: Kernel access of bad area, sig: 11 [#2]
[11017.738024][T122073] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[11017.738051][T122073] Modules linked in: brd ext4 crc16 mbcache jbd2 loop 
kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci tg3 mdio libata 
libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
[11017.738110][T122073] CPU: 108 PID: 122073 Comm: read_all Tainted: G  D W 
5.7.0-rc4-next-20200508+ #4
[11017.738138][T122073] NIP:  c00db564 LR: c056f660 CTR: 
c00db550
[11017.738173][T122073] REGS: c00374f6f980 TRAP: 0380   Tainted: G  D W 
 (5.7.0-rc4-next-20200508+)
[11017.738234][T122073] MSR:  90009033   CR: 
22002282  XER: 2004
[11017.738278][T122073] CFAR: c056f65c IRQMASK: 0 
[11017.738278][T122073] GPR00: c056f660 c00374f6fc10 
c1689400 c000201ffc41aa00 
[11017.738278][T122073] GPR04: c00374f6fc70  
 0001 
[11017.738278][T122073] GPR08:  ffe2 
 c008ee380080 
[11017.738278][T122073] GPR12: c00db550 c000201fff671280 
  
[11017.738278][T122073] GPR16: 0002 10040800 
1001ccd8 1001cc80 
[11017.738278][T122073] GPR20: 1001cc98 1001ccc8 
1001cca8 1001cb48 
[11017.738278][T122073] GPR24:   
03ff 7fffebb67390 
[11017.738278][T122073] GPR28: c00374f6fd90 c000200c0c6a7550 
 c000200c0c6a7500 
[11017.738542][T122073] NIP [c00db564] pnv_eeh_dbgfs_get_inbB+0x14/0x30
[11017.738579][T122073] LR [c056f660] simple_attr_read+0xa0/0x180
[11017.738613][T122073] Call Trace:
[11017.738645][T122073] [c00374f6fc10] [c056f630] 
simple_attr_read+0x70/0x180 (unreliable)
[11017.738672][T122073] [c00374f6fcb0] [c064a2e0] 
full_proxy_read+0x90/0xe0
[11017.738686][T122073] [c00374f6fd00] [c051fe0c] 
__vfs_read+0x3c/0x70
[11017.738722][T122073] [c00374f6fd20] [c051feec] 
vfs_read+0xac/0x170
[11017.738757][T122073] [c00374f6fd70] [c052034c] 
ksys_read+0x7c/0x140
[11017.738818][T122073] [c00374f6fdc0] [c0038af4] 
system_call_exception+0x114/0x1e0
[11017.738867][T122073] [c00374f6fe20] [c000c8f0] 
system_call_common+0xf0/0x278
[11017.738916][T122073] Instruction dump:
[11017.738948][T122073] 7c0004ac f9490d10 a14d0c78 3860 b14d0c7a 4e800020 
6000 7c0802a6 
[11017.739001][T122073] 6000 e9230278 e9290028 7c0004ac  0c09 
4c00012c 3860 
[11017.739052][T122073] ---[ end trace f68728a0d3053b5e ]---
[11017.828156][T122073] 
[11017.828170][T122068] Oops: Kernel access of bad area, sig: 11 [#3]
[11017.828184][T122068] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[11017.828209][T122068] Modules linked in: brd ext4 crc16 mbcache jbd2 loop 
kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci tg3 mdio libata 
libphy firmwa

ioremap() called early from pnv_pci_init_ioda_phb()

2020-05-08 Thread Qian Cai
 Booting POWER9 PowerNV has this message,

"ioremap() called early from pnv_pci_init_ioda_phb+0x420/0xdfc. Use 
early_ioremap() instead”

but use the patch below will result in leaks because it will never call 
early_iounmap() anywhere. However, it looks me it was by design that phb->regs 
mapping would be there forever where it would be used in 
pnv_ioda_get_inval_reg(), so is just that check_early_ioremap_leak() initcall 
too strong?

--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -3827,7 +3828,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
/* Get registers */
if (!of_address_to_resource(np, 0, )) {
phb->regs_phys = r.start;
-   phb->regs = ioremap(r.start, resource_size());
+   phb->regs = early_ioremap(r.start, resource_size());
if (phb->regs == NULL)
pr_err("  Failed to map registers !\n”);

[   23.080069][T1] [ cut here ]
[   23.080089][T1] Debug warning: early ioremap leak of 10 areas detected.
[   23.080089][T1] please boot with early_ioremap_debug and report the 
dmesg.
[   23.080157][T1] WARNING: CPU: 4 PID: 1 at mm/early_ioremap.c:99 
check_early_ioremap_leak+0xd4/0x108
[   23.080171][T1] Modules linked in:
[   23.080192][T1] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 
5.7.0-rc4-next-20200508+ #4
[   23.080214][T1] NIP:  c103f2d8 LR: c103f2d4 CTR: 

[   23.080226][T1] REGS: c0003df0f860 TRAP: 0700   Not tainted  
(5.7.0-rc4-next-20200508+)
[   23.080259][T1] MSR:  90029033   CR: 
48000222  XER: 2004
[   23.080296][T1] CFAR: c010d5a8 IRQMASK: 0 
[   23.080296][T1] GPR00: c103f2d4 c0003df0faf0 
c1689400 0072 
[   23.080296][T1] GPR04: 0006  
c0003df0f7e4 0004 
[   23.080296][T1] GPR08: 001ffbb6  
c0003dee6680 0002 
[   23.080296][T1] GPR12:  c01fae00 
c1057860 c10578b0 
[   23.080296][T1] GPR16: c1002d38 c14f0660 
c14f0680 c14f06a0 
[   23.080296][T1] GPR20: c14f06c0 c14f06e0 
c14f0700 c14f0720 
[   23.080296][T1] GPR24: c0c4bc30 c00486b82000 
c15a0fe0 c15a0fc0 
[   23.080296][T1] GPR28: 0010 0010 
c1061e30 000a 
[   23.080507][T1] NIP [c103f2d8] 
check_early_ioremap_leak+0xd4/0x108
[   23.080530][T1] LR [c103f2d4] check_early_ioremap_leak+0xd0/0x108
[   23.080552][T1] Call Trace:
[   23.080571][T1] [c0003df0faf0] [c103f2d4] 
check_early_ioremap_leak+0xd0/0x108 (unreliable)
[   23.080607][T1] [c0003df0fb80] [c001130c] 
do_one_initcall+0xcc/0x660
[   23.080648][T1] [c0003df0fc80] [c1004c18] 
kernel_init_freeable+0x480/0x568
[   23.080681][T1] [c0003df0fdb0] [c0012180] 
kernel_init+0x24/0x194
[   23.080713][T1] [c0003df0fe20] [c000cb28] 
ret_from_kernel_thread+0x5c/0x74

This is from the early_ioremap_debug dmesg.

[0.00][T0] [ cut here ]
[0.00][T0] __early_ioremap(0x000600c3c001, 0001) [0] => 
 + ffbe
[0.00][T0] WARNING: CPU: 0 PID: 0 at mm/early_ioremap.c:162 
__early_ioremap+0x2d8/0x408
[0.00][T0] Modules linked in:
[0.00][T0] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.7.0-rc4-next-20200508+ #4
[0.00][T0] NIP:  c103f5e4 LR: c103f5e0 CTR: 
c01e77f0
[0.00][T0] REGS: c168f980 TRAP: 0700   Not tainted  
(5.7.0-rc4-next-20200508+)
[0.00][T0] MSR:  90021033   CR: 
28000248  XER: 2004
[0.00][T0] CFAR: c010d5a8 IRQMASK: 1 
[0.00][T0] GPR00: c103f5e0 c168fc10 
c1689400 0050 
[0.00][T0] GPR04: c152f6f8  
c168f904  
[0.00][T0] GPR08:   
c162f600 0002 
[0.00][T0] GPR12: c01e77f0 c5b3 
  
[0.00][T0] GPR16:   
 1000 
[0.00][T0] GPR20:  81ae 
  
[0.00][T0] GPR24: 0001 c1061da8 
0008 0008 
[0.00][T0] GPR28:  c1061db0 
 c1061eb8 
[0.00][T0] NIP [c103f5e4] __early_ioremap+0x2d8/0x408
[0.00][T0] LR 

Re: [PATCH 3/3] powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel

2020-04-23 Thread Qian Cai



> On Apr 21, 2020, at 1:35 PM, Naveen N. Rao  
> wrote:
> 
> Since commit c55d7b5e64265f ("powerpc: Remove STRICT_KERNEL_RWX
> incompatibility with RELOCATABLE"), powerpc kernels with
> -mprofile-kernel can crash in certain scenarios with a trace like below:
> 
>BUG: Unable to handle kernel instruction fetch (NULL pointer?)
>Faulting instruction address: 0x
>Oops: Kernel access of bad area, sig: 11 [#1]
>LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA PowerNV
>
>NIP [] 0x0
>LR [c008102c0048] ext4_iomap_end+0x8/0x30 [ext4]
>Call Trace:
> iomap_apply+0x20c/0x920 (unreliable)
> iomap_bmap+0xfc/0x160
> ext4_bmap+0xa4/0x180 [ext4]
> bmap+0x4c/0x80
> jbd2_journal_init_inode+0x44/0x1a0 [jbd2]
> ext4_load_journal+0x440/0x860 [ext4]
> ext4_fill_super+0x342c/0x3ab0 [ext4]
> mount_bdev+0x25c/0x290
> ext4_mount+0x28/0x50 [ext4]
> legacy_get_tree+0x4c/0xb0
> vfs_get_tree+0x4c/0x130
> do_mount+0xa18/0xc50
> sys_mount+0x158/0x180
> system_call+0x5c/0x68
> 
> The NIP points to NULL, or a random location (data even), while the LR
> always points to the LEP of a function (with an offset of 8), indicating
> that something went wrong with ftrace. However, ftrace is not
> necessarily active when such crashes occur.
> 
> The kernel OOPS sometimes follows a warning from ftrace indicating that
> some module functions could not be patched with a nop. Other times, if a
> module is loaded early during boot, instruction patching can fail due to
> a separate bug, but the error is not reported due to missing error
> reporting.
> 
> In all the above cases when instruction patching fails, ftrace will be
> disabled but certain kernel module functions will be left with default
> calls to _mcount(). This is not a problem with ELFv1. However, with
> -mprofile-kernel, the default stub is problematic since it depends on a
> valid module TOC in r2. If the kernel (or a different module) calls into
> a function that does not use the TOC, the function won't have a prologue
> to setup the module TOC. When that function calls into _mcount(), we
> will end up in the relocation stub that will use the previous TOC, and
> end up trying to jump into a random location. From the above trace:
> 
>   iomap_apply+0x20c/0x920 [kernel TOC]
>   |
>   V
>   ext4_iomap_end+0x8/0x30 [no GEP == kernel TOC]
>   |
>   V
>   _mcount() stub
>   [uses kernel TOC -> random entry]
> 
> To address this, let's change over to using the special stub that is
> used for ftrace_[regs_]caller() for _mcount(). This ensures that we are
> not dependent on a valid module TOC in r2 for default _mcount()
> handling.
> 
> Reported-by: Qian Cai 
> Signed-off-by: Naveen N. Rao 

Feel free to add,

Tested-by: Qian Cai 

Re: [PATCH v3 0/4] Clean up hugetlb boot command line processing

2020-04-20 Thread Qian Cai



> On Apr 17, 2020, at 2:50 PM, Mike Kravetz  wrote:
> 
> Longpeng(Mike) reported a weird message from hugetlb command line processing
> and proposed a solution [1].  While the proposed patch does address the
> specific issue, there are other related issues in command line processing.
> As hugetlbfs evolved, updates to command line processing have been made to
> meet immediate needs and not necessarily in a coordinated manner.  The result
> is that some processing is done in arch specific code, some is done in arch
> independent code and coordination is problematic.  Semantics can vary between
> architectures.
> 
> The patch series does the following:
> - Define arch specific arch_hugetlb_valid_size routine used to validate
>  passed huge page sizes.
> - Move hugepagesz= command line parsing out of arch specific code and into
>  an arch independent routine.
> - Clean up command line processing to follow desired semantics and
>  document those semantics.
> 
> [1] 
> https://lore.kernel.org/linux-mm/20200305033014.1152-1-longpe...@huawei.com
> 
> Mike Kravetz (4):
>  hugetlbfs: add arch_hugetlb_valid_size
>  hugetlbfs: move hugepagesz= parsing to arch independent code
>  hugetlbfs: remove hugetlb_add_hstate() warning for existing hstate
>  hugetlbfs: clean up command line processing

Reverted this series fixed many undefined behaviors on arm64 with the config,

https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config

[   54.172683][T1] UBSAN: shift-out-of-bounds in 
./include/linux/hugetlb.h:555:34
[   54.180411][T1] shift exponent 4294967285 is too large for 64-bit type 
'unsigned long'
[   54.15][T1] CPU: 130 PID: 1 Comm: swapper/0 Not tainted 
5.7.0-rc2-next-20200420 #1
[   54.197284][T1] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.11 06/18/2019
[   54.207888][T1] Call trace:
[   54.211100][T1]  dump_backtrace+0x0/0x224
[   54.215565][T1]  show_stack+0x20/0x2c
[   54.219651][T1]  dump_stack+0xfc/0x184
[   54.223829][T1]  __ubsan_handle_shift_out_of_bounds+0x304/0x344
[   54.230204][T1]  hugetlb_add_hstate+0x3ec/0x414
huge_page_size at include/linux/hugetlb.h:555
(inlined by) hugetlb_add_hstate at mm/hugetlb.c:3301
[   54.235191][T1]  hugetlbpage_init+0x14/0x30
[   54.239824][T1]  do_one_initcall+0x6c/0x144
[   54.26][T1]  do_initcall_level+0x158/0x1c4
[   54.249336][T1]  do_initcalls+0x68/0xb0
[   54.253597][T1]  do_basic_setup+0x28/0x30
[   54.258049][T1]  kernel_init_freeable+0x19c/0x228
[   54.263188][T1]  kernel_init+0x14/0x208
[   54.267473][T1]  ret_from_fork+0x10/0x18


[   55.534338][T1] UBSAN: shift-out-of-bounds in 
./include/linux/hugetlb.h:555:34
[   55.542064][T1] shift exponent 4294967285 is too large for 64-bit type 
'unsigned long'
[   55.550555][T1] CPU: 129 PID: 1 Comm: swapper/0 Not tainted 
5.7.0-rc2-next-20200420 #1
[   55.558992][T1] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.11 06/18/2019
[   55.569659][T1] Call trace:
[   55.572898][T1]  dump_backtrace+0x0/0x224
[   55.577335][T1]  show_stack+0x20/0x2c
[   55.581442][T1]  dump_stack+0xfc/0x184
[   55.585621][T1]  __ubsan_handle_shift_out_of_bounds+0x304/0x344
[   55.592031][T1]  __hugetlb_cgroup_file_dfl_init+0x37c/0x384
[   55.598062][T1]  hugetlb_cgroup_file_init+0x9c/0xd8
[   55.603399][T1]  hugetlb_init+0x248/0x448
[   55.607840][T1]  do_one_initcall+0x6c/0x144
[   55.612493][T1]  do_initcall_level+0x158/0x1c4
[   55.617404][T1]  do_initcalls+0x68/0xb0
[   55.621664][T1]  do_basic_setup+0x28/0x30
[   55.626107][T1]  kernel_init_freeable+0x19c/0x228
[   55.631253][T1]  kernel_init+0x14/0x208
[   55.635519][T1]  ret_from_fork+0x10/0x18

[  153.283648][T1] 

[  153.293078][T1] UBSAN: shift-out-of-bounds in 
./include/linux/hugetlb.h:555:34
[  153.300841][T1] shift exponent 4294967285 is too large for 64-bit type 
'unsigned long'
[  153.309185][T1] CPU: 161 PID: 1 Comm: swapper/0 Tainted: G L 
   5.7.0-rc2-next-20200420 #1
[  153.318879][T1] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.11 06/18/2019
[  153.329352][T1] Call trace:
[  153.332545][T1]  dump_backtrace+0x0/0x224
[  153.336945][T1]  show_stack+0x20/0x2c
[  153.341000][T1]  dump_stack+0xfc/0x184
[  153.345149][T1]  __ubsan_handle_shift_out_of_bounds+0x304/0x344
[  153.351465][T1]  hugetlbfs_fill_super+0x424/0x43c
[  153.356560][T1]  vfs_get_super+0xcc/0x170
[  153.360959][T1]  get_tree_nodev+0x28/0x34
[  153.365358][T1]  hugetlbfs_get_tree+0xfc/0x128
[  153.370193][T1]  vfs_get_tree+0x54/0x158
[  153.374513][T1]  fc_mount+0x1c/0x5c
[  153.378399][T1]  mount_one_hugetlbfs+0x54/0xc8
[  153.383233][T1]  

Re: [PATCH 15/21] mm: memmap_init: iterate over memblock regions rather that check each PFN

2020-04-20 Thread Qian Cai



> On Apr 12, 2020, at 3:48 PM, Mike Rapoport  wrote:
> 
> From: Baoquan He 
> 
> When called during boot the memmap_init_zone() function checks if each PFN
> is valid and actually belongs to the node being initialized using
> early_pfn_valid() and early_pfn_in_nid().
> 
> Each such check may cost up to O(log(n)) where n is the number of memory
> banks, so for large amount of memory overall time spent in early_pfn*()
> becomes substantial.
> 
> Since the information is anyway present in memblock, we can iterate over
> memblock memory regions in memmap_init() and only call memmap_init_zone()
> for PFN ranges that are know to be valid and in the appropriate node.
> 
> Signed-off-by: Baoquan He 
> Signed-off-by: Mike Rapoport 
> ---
> mm/page_alloc.c | 26 --
> 1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f6a3081edb8..c43ce8709457 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5995,14 +5995,6 @@ void __meminit memmap_init_zone(unsigned long size, 
> int nid, unsigned long zone,
>* function.  They do not exist on hotplugged memory.
>*/
>   if (context == MEMMAP_EARLY) {
> - if (!early_pfn_valid(pfn)) {
> - pfn = next_pfn(pfn);
> - continue;
> - }
> - if (!early_pfn_in_nid(pfn, nid)) {
> - pfn++;
> - continue;
> - }

This causes a compilation warning from Clang,

mm/page_alloc.c:5917:39: warning: unused function 'next_pfn' [-Wunused-function]
static inline __meminit unsigned long next_pfn(unsigned long pfn)

This should fix it,

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d469384c9ca7..b48336e20bdc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5912,23 +5912,6 @@ overlap_memmap_init(unsigned long zone, unsigned long 
*pfn)
return false;
 }
 
-#ifdef CONFIG_SPARSEMEM
-/* Skip PFNs that belong to non-present sections */
-static inline __meminit unsigned long next_pfn(unsigned long pfn)
-{
-   const unsigned long section_nr = pfn_to_section_nr(++pfn);
-
-   if (present_section_nr(section_nr))
-   return pfn;
-   return section_nr_to_pfn(next_present_section_nr(section_nr));
-}
-#else
-static inline __meminit unsigned long next_pfn(unsigned long pfn)
-{
-   return pfn++;
-}
-#endif
-
 /*
  * Initially all pages are reserved - free ones are freed
  * up by memblock_free_all() once the early boot process is

>   if (overlap_memmap_init(zone, ))
>   continue;
>   if (defer_init(nid, pfn, end_pfn))
> @@ -6118,9 +6110,23 @@ static void __meminit zone_init_free_lists(struct zone 
> *zone)
> }
> 
> void __meminit __weak memmap_init(unsigned long size, int nid,
> -   unsigned long zone, unsigned long start_pfn)
> +   unsigned long zone,
> +   unsigned long range_start_pfn)
> {
> - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, NULL);
> + unsigned long start_pfn, end_pfn;
> + unsigned long range_end_pfn = range_start_pfn + size;
> + int i;
> +
> + for_each_mem_pfn_range(i, nid, _pfn, _pfn, NULL) {
> + start_pfn = clamp(start_pfn, range_start_pfn, range_end_pfn);
> + end_pfn = clamp(end_pfn, range_start_pfn, range_end_pfn);
> +
> + if (end_pfn > start_pfn) {
> + size = end_pfn - start_pfn;
> + memmap_init_zone(size, nid, zone, start_pfn,
> +  MEMMAP_EARLY, NULL);
> + }
> + }
> }
> 
> static int zone_batchsize(struct zone *zone)
> -- 
> 2.25.1
> 
> 



Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-17 Thread Qian Cai



> On Apr 2, 2020, at 7:24 AM, Michael Ellerman  wrote:
> 
> Qian Cai  writes:
>> From: Peter Zijlstra 
>> 
>> In the CPU-offline process, it calls mmdrop() after idle entry and the
>> subsequent call to cpuhp_report_idle_dead(). Once execution passes the
>> call to rcu_report_dead(), RCU is ignoring the CPU, which results in
>> lockdep complaining when mmdrop() uses RCU from either memcg or
>> debugobjects below.
>> 
>> Fix it by cleaning up the active_mm state from BP instead. Every arch
>> which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
>> from AP. The only exception is parisc because it switches them to
>> _mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
>> but the patch will still work there because it calls mmgrab(_mm) in
>> smp_cpu_init() and then should call mmdrop(_mm) in finish_cpu().
> 
> Thanks for debugging this. How did you hit it in the first place?
> 
> A link to the original thread would have helped me:
> 
>  https://lore.kernel.org/lkml/20200113190331.12788-1-...@lca.pw/
> 
>> WARNING: suspicious RCU usage
>> -
>> kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
>> 
>> other info that might help us debug this:
>> 
>> RCU used illegally from offline CPU!
>> Call Trace:
>> dump_stack+0xf4/0x164 (unreliable)
>> lockdep_rcu_suspicious+0x140/0x164
>> get_work_pool+0x110/0x150
>> __queue_work+0x1bc/0xca0
>> queue_work_on+0x114/0x120
>> css_release+0x9c/0xc0
>> percpu_ref_put_many+0x204/0x230
>> free_pcp_prepare+0x264/0x570
>> free_unref_page+0x38/0xf0
>> __mmdrop+0x21c/0x2c0
>> idle_task_exit+0x170/0x1b0
>> pnv_smp_cpu_kill_self+0x38/0x2e0
>> cpu_die+0x48/0x64
>> arch_cpu_idle_dead+0x30/0x50
>> do_idle+0x2f4/0x470
>> cpu_startup_entry+0x38/0x40
>> start_secondary+0x7a8/0xa80
>> start_secondary_resume+0x10/0x14
> 
> Do we know when this started happening? ie. can we determine a Fixes
> tag?
> 
>> 
>> Signed-off-by: Qian Cai 
>> ---
>> arch/powerpc/platforms/powernv/smp.c |  1 -
>> include/linux/sched/mm.h |  2 ++
>> kernel/cpu.c | 18 +-
>> kernel/sched/core.c  |  5 +++--
>> 4 files changed, 22 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/smp.c 
>> b/arch/powerpc/platforms/powernv/smp.c
>> index 13e251699346..b2ba3e95bda7 100644
>> --- a/arch/powerpc/platforms/powernv/smp.c
>> +++ b/arch/powerpc/platforms/powernv/smp.c
>> @@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
>>  /* Standard hot unplug procedure */
>> 
>>  idle_task_exit();
>> -current->active_mm = NULL; /* for sanity */
> 
> If I'm reading it right, we'll now be running with active_mm == init_mm
> in the offline loop.
> 
> I guess that's fine, I can't think of any reason it would matter, and it
> seems like we were NULL'ing it out just for paranoia's sake not because
> of any actual problem.
> 
> Acked-by: Michael Ellerman  (powerpc)

Peter, can you take a look at this patch when you have a chance?

> 
> 
> cheers
> 
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index c49257a3b510..a132d875d351 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
>>  __mmdrop(mm);
>> }
>> 
>> +void mmdrop(struct mm_struct *mm);
>> +
>> /*
>>  * This has to be called after a get_task_mm()/mmget_not_zero()
>>  * followed by taking the mmap_sem for writing before modifying the
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 2371292f30b0..244d30544377 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -3,6 +3,7 @@
>>  *
>>  * This code is licenced under the GPL.
>>  */
>> +#include 
>> #include 
>> #include 
>> #include 
>> @@ -564,6 +565,21 @@ static int bringup_cpu(unsigned int cpu)
>>  return bringup_wait_for_ap(cpu);
>> }
>> 
>> +static int finish_cpu(unsigned int cpu)
>> +{
>> +struct task_struct *idle = idle_thread_get(cpu);
>> +struct mm_struct *mm = idle->active_mm;
>> +
>> +/*
>> + * idle_task_exit() will have switched to _mm, now
>> + * clean up any remaining active_mm state.
>> + */
>> +if (mm != _mm)
>> +idle->active_mm = _mm;
>> +mmdrop(mm);
>> +return 0;
>> +}

Re: POWER9 crash due to STRICT_KERNEL_RWX (WAS: Re: Linux-next POWER9 NULL pointer NIP...)

2020-04-17 Thread Qian Cai



> On Apr 17, 2020, at 3:01 AM, Naveen N. Rao  wrote:
> 
> Hi Qian,
> 
> Qian Cai wrote:
>> OK, reverted the commit,
>> c55d7b5e6426 (“powerpc: Remove STRICT_KERNEL_RWX incompatibility with 
>> RELOCATABLE”)
>> or set STRICT_KERNEL_RWX=n fixed the crash below and also mentioned in this 
>> thread,
>> https://lore.kernel.org/lkml/15ac5b0e-a221-4b8c-9039-fa96b8ef7...@lca.pw/
> 
> Do you see any errors logged in dmesg when you see the crash?  
> STRICT_KERNEL_RWX changes how patch_instruction() works, so it would be 
> interesting to see if there are any ftrace-related errors thrown before the 
> crash.

Yes, looks like there is a warning right after,

echo function > /sys/kernel/debug/tracing/current_tracer
echo nop > /sys/kernel/debug/tracing/current_tracer

and just before the crash,

[ T3454] ftrace-powerpc: Unexpected call sequence at de85f044: 48003d1d 
7c0802a6
[   56.870472][ T3454] [ cut here ]
[   56.870500][ T3454] WARNING: CPU: 52 PID: 3454 at kernel/trace/ftrace.c:2026 
ftrace_bug+0x104/0x310
[   56.870527][ T3454] Modules linked in: kvm_hv kvm ses enclosure 
scsi_transport_sas ip_tables x_tables xfs sd_mod i40e firmware_class aacraid 
dm_mirror dm_region_hash dm_log dm_mod
[   56.870592][ T3454] CPU: 52 PID: 3454 Comm: nip.sh Not tainted 
5.7.0-rc1-next-20200416 #4
[   56.870627][ T3454] NIP:  c02a3ae4 LR: c02a47fc CTR: 
c02436f0
[   56.870661][ T3454] REGS: c0069a9ef710 TRAP: 0700   Not tainted  
(5.7.0-rc1-next-20200416)
[   56.870697][ T3454] MSR:  9282b033 
  CR: 28228222  XER: 
[   56.870748][ T3454] CFAR: c02a3a2c IRQMASK: 0 
[   56.870748][ T3454] GPR00: c02a47fc c0069a9ef9a0 
c12f9000 ffea 
[   56.870748][ T3454] GPR04: c0002004e2160438 c007fedf0ad8 
614ca19d 0007 
[   56.870748][ T3454] GPR08: 0003  
 0002 
[   56.870748][ T3454] GPR12: 4000 c007fffd5600 
4000 000139ae9798 
[   56.870748][ T3454] GPR16: 000139ae9724 000139a86968 
000139a1f230 000139aed568 
[   56.870748][ T3454] GPR20: 0001402af8b0 0009 
000139a996e8 7fffc9186d94 
[   56.870748][ T3454] GPR24:  c0069a9efc00 
c132cd00 c0069a9efc40 
[   56.870748][ T3454] GPR28: c11c29e8 0001 
c0002004e2160438 c00809321a64 
[   56.870969][ T3454] NIP [c02a3ae4] ftrace_bug+0x104/0x310
ftrace_bug at kernel/trace/ftrace.c:2026
[   56.870995][ T3454] LR [c02a47fc] ftrace_modify_all_code+0x16c/0x210
ftrace_modify_all_code at kernel/trace/ftrace.c:2672
[   56.871034][ T3454] Call Trace:
[   56.871057][ T3454] [c0069a9ef9a0] [4b899a9efa00] 0x4b899a9efa00 
(unreliable)
[   56.871086][ T3454] [c0069a9efa20] [c02a47fc] 
ftrace_modify_all_code+0x16c/0x210
[   56.871125][ T3454] [c0069a9efa50] [c0061b68] 
arch_ftrace_update_code+0x18/0x30
[   56.871162][ T3454] [c0069a9efa70] [c02a49c4] 
ftrace_run_update_code+0x44/0xc0
[   56.871199][ T3454] [c0069a9efaa0] [c02aa3c8] 
ftrace_startup+0xe8/0x1b0
[   56.871236][ T3454] [c0069a9efae0] [c02aa4e0] 
register_ftrace_function+0x50/0xc0
[   56.871275][ T3454] [c0069a9efb10] [c02d0468] 
function_trace_init+0x98/0xd0
[   56.871312][ T3454] [c0069a9efb40] [c02c75c0] 
tracing_set_tracer+0x350/0x640
[   56.871349][ T3454] [c0069a9efbe0] [c02c7a90] 
tracing_set_trace_write+0x1e0/0x370
[   56.871388][ T3454] [c0069a9efd00] [c052094c] 
__vfs_write+0x3c/0x70
[   56.871424][ T3454] [c0069a9efd20] [c0523d4c] 
vfs_write+0xcc/0x200
[   56.871461][ T3454] [c0069a9efd70] [c05240ec] 
ksys_write+0x7c/0x140
[   56.871498][ T3454] [c0069a9efdc0] [c0038a94] 
system_call_exception+0x114/0x1e0
[   56.871535][ T3454] [c0069a9efe20] [c000c870] 
system_call_common+0xf0/0x278
[   56.871570][ T3454] Instruction dump:
[   56.871592][ T3454] 7d908120 4e800020 6000 2b890001 409effd4 3c62ff8b 
38631958 4bf4491d 
[   56.871639][ T3454] 6000 4bc0 6000 fba10068 <0fe0> 3901 
3ce20003 3d22fed7 
[   56.871685][ T3454] irq event stamp: 95388
[   56.871708][ T3454] hardirqs last  enabled at (95387): [] 
console_unlock+0x6a4/0x950
[   56.871746][ T3454] hardirqs last disabled at (95388): [] 
program_check_common_virt+0x2bc/0x310
[   56.871785][ T3454] softirqs last  enabled at (91222): [] 
__do_softirq+0x658/0x8d8
[   56.871823][ T3454] softirqs last disabled at (91215): [] 
irq_exit+0x16c/0x1d0
[   56.871859][ T3454] ---[ end trace 48f8445450a4e206 ]---
[   56.871907][ T3454] ftrace failed to modify 
[   56.871913][ T3454] [] 
show_sas_rphy_phy_identifier+0xc/0x60 [scsi_transport_sas]
show_sas_rphy_phy_identifier at drivers/scsi/scsi_transport_

Re: POWER9 crash due to STRICT_KERNEL_RWX (WAS: Re: Linux-next POWER9 NULL pointer NIP...)

2020-04-17 Thread Qian Cai



> On Apr 16, 2020, at 10:46 PM, Russell Currey  wrote:
> 
> On Thu, 2020-04-16 at 22:40 -0400, Qian Cai wrote:
>>> On Apr 16, 2020, at 10:27 PM, Russell Currey 
>>> wrote:
>>> 
>>> Reverting the patch with the given config will have the same effect
>>> as
>>> STRICT_KERNEL_RWX=n.  Not discounting that it could be a bug on the
>>> powerpc side (i.e. relocatable kernels with strict RWX on haven't
>>> been
>>> exhaustively tested yet), but we should definitely figure out
>>> what's
>>> going on with this bad access first.
>> 
>> BTW, this bad access only happened once. The overwhelming rest of
>> crashes are with NULL pointer NIP like below. How can you explain
>> that STRICT_KERNEL_RWX=n would also make those NULL NIP disappear if
>> STRICT_KERNEL_RWX is just a messenger?
> 
> What happens if you test with STRICT_KERNEL_RWX=y and RELOCATABLE=n,
> reverting my patch?  This would give us an idea of whether it's
> something broken recently or if there's something else going on.

That combination will crash as well. I don’t think it is broken recently though 
due to
the crash could happen back in 5.6-rc1 when your commit first introduced.

> 
>> 
>> [  215.281666][T16896] LTP: starting chown04_16
>> [  215.424203][T18297] BUG: Unable to handle kernel instruction fetch
>> (NULL pointer?)
>> [  215.424289][T18297] Faulting instruction address: 0x
>> [  215.424313][T18297] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  215.424341][T18297] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256
>> DEBUG_PAGEALLOC NUMA PowerNV
>> [  215.424383][T18297] Modules linked in: loop kvm_hv kvm ip_tables
>> x_tables xfs sd_mod bnx2x mdio tg3 ahci libahci libphy libata
>> firmware_class dm_mirror dm_region_hash dm_log dm_mod
>> [  215.424459][T18297] CPU: 85 PID: 18297 Comm: chown04_16 Tainted:
>> GW 5.6.0-next-20200405+ #3
>> [  215.424489][T18297] NIP:   LR: c0080fbc0408
>> CTR: 
>> [  215.424530][T18297] REGS: c000200b8606f990 TRAP: 0400   Tainted:
>> GW  (5.6.0-next-20200405+)
>> [  215.424570][T18297] MSR:  900040009033
>>   CR: 84000248  XER: 2004
>> [  215.424619][T18297] CFAR: c0080fbc64f4 IRQMASK: 0 
>> [  215.424619][T18297] GPR00: c06c2238 c000200b8606fc20
>> c165ce00  
>> [  215.424619][T18297] GPR04: c000201a58106400 c000200b8606fcc0
>> 5f037e7d 00013bfb 
>> [  215.424619][T18297] GPR08: c000201a58106400 
>>  c1652ee0 
>> [  215.424619][T18297] GPR12:  c000201fff69a600
>>   
>> [  215.424619][T18297] GPR16:  
>>   
>> [  215.424619][T18297] GPR20:  
>>  0007 
>> [  215.424619][T18297] GPR24:  
>> c0080fbc8688 c000200b8606fcc0 
>> [  215.424619][T18297] GPR28:  7fff
>> c0080fbc0400 c00020068b8c0e70 
>> [  215.424914][T18297] NIP [] 0x0
>> [  215.424953][T18297] LR [c0080fbc0408] find_free_cb+0x8/0x30
>> [loop]
>> find_free_cb at drivers/block/loop.c:2129
>> [  215.424997][T18297] Call Trace:
>> [  215.425036][T18297] [c000200b8606fc20] [c06c2290]
>> idr_for_each+0xf0/0x170 (unreliable)
>> [  215.425073][T18297] [c000200b8606fca0] [c0080fbc2744]
>> loop_lookup.part.2+0x4c/0xb0 [loop]
>> loop_lookup at drivers/block/loop.c:2144
>> [  215.425105][T18297] [c000200b8606fce0] [c0080fbc3558]
>> loop_control_ioctl+0x120/0x1d0 [loop]
>> [  215.425149][T18297] [c000200b8606fd40] [c04eb688]
>> ksys_ioctl+0xd8/0x130
>> [  215.425190][T18297] [c000200b8606fd90] [c04eb708]
>> sys_ioctl+0x28/0x40
>> [  215.425233][T18297] [c000200b8606fdb0] [c003cc30]
>> system_call_exception+0x110/0x1e0
>> [  215.425274][T18297] [c000200b8606fe20] [c000c9f0]
>> system_call_common+0xf0/0x278
>> [  215.425314][T18297] Instruction dump:
>> [  215.425338][T18297]     
>>    
>> [  215.425374][T18297]     
>>    
>> [  215.425422][T18297] ---[ end trace ebed248fad431966 ]---
>> [  215.642114][T18297] 
>> [  216.642220][T18297] Kernel panic - not syncing: Fatal exception



Re: POWER9 crash due to STRICT_KERNEL_RWX (WAS: Re: Linux-next POWER9 NULL pointer NIP...)

2020-04-16 Thread Qian Cai



> On Apr 16, 2020, at 10:46 PM, Russell Currey  wrote:
> 
> On Thu, 2020-04-16 at 22:40 -0400, Qian Cai wrote:
>>> On Apr 16, 2020, at 10:27 PM, Russell Currey 
>>> wrote:
>>> 
>>> Reverting the patch with the given config will have the same effect
>>> as
>>> STRICT_KERNEL_RWX=n.  Not discounting that it could be a bug on the
>>> powerpc side (i.e. relocatable kernels with strict RWX on haven't
>>> been
>>> exhaustively tested yet), but we should definitely figure out
>>> what's
>>> going on with this bad access first.
>> 
>> BTW, this bad access only happened once. The overwhelming rest of
>> crashes are with NULL pointer NIP like below. How can you explain
>> that STRICT_KERNEL_RWX=n would also make those NULL NIP disappear if
>> STRICT_KERNEL_RWX is just a messenger?
> 
> What happens if you test with STRICT_KERNEL_RWX=y and RELOCATABLE=n,
> reverting my patch?  This would give us an idea of whether it's
> something broken recently or if there's something else going on.

I don’t know what did you mean by reverting your patch because that combination
can be tested as-is. Anyway, it could take a long time to reproduce, so I’ll 
keep it
running for up to 12-hour to confirm it could not really crash.

> 
>> 
>> [  215.281666][T16896] LTP: starting chown04_16
>> [  215.424203][T18297] BUG: Unable to handle kernel instruction fetch
>> (NULL pointer?)
>> [  215.424289][T18297] Faulting instruction address: 0x
>> [  215.424313][T18297] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  215.424341][T18297] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256
>> DEBUG_PAGEALLOC NUMA PowerNV
>> [  215.424383][T18297] Modules linked in: loop kvm_hv kvm ip_tables
>> x_tables xfs sd_mod bnx2x mdio tg3 ahci libahci libphy libata
>> firmware_class dm_mirror dm_region_hash dm_log dm_mod
>> [  215.424459][T18297] CPU: 85 PID: 18297 Comm: chown04_16 Tainted:
>> GW 5.6.0-next-20200405+ #3
>> [  215.424489][T18297] NIP:   LR: c0080fbc0408
>> CTR: 
>> [  215.424530][T18297] REGS: c000200b8606f990 TRAP: 0400   Tainted:
>> GW  (5.6.0-next-20200405+)
>> [  215.424570][T18297] MSR:  900040009033
>>   CR: 84000248  XER: 2004
>> [  215.424619][T18297] CFAR: c0080fbc64f4 IRQMASK: 0 
>> [  215.424619][T18297] GPR00: c06c2238 c000200b8606fc20
>> c165ce00  
>> [  215.424619][T18297] GPR04: c000201a58106400 c000200b8606fcc0
>> 5f037e7d 00013bfb 
>> [  215.424619][T18297] GPR08: c000201a58106400 
>>  c1652ee0 
>> [  215.424619][T18297] GPR12:  c000201fff69a600
>>   
>> [  215.424619][T18297] GPR16:  
>>   
>> [  215.424619][T18297] GPR20:  
>>  0007 
>> [  215.424619][T18297] GPR24:  
>> c0080fbc8688 c000200b8606fcc0 
>> [  215.424619][T18297] GPR28:  7fff
>> c0080fbc0400 c00020068b8c0e70 
>> [  215.424914][T18297] NIP [] 0x0
>> [  215.424953][T18297] LR [c0080fbc0408] find_free_cb+0x8/0x30
>> [loop]
>> find_free_cb at drivers/block/loop.c:2129
>> [  215.424997][T18297] Call Trace:
>> [  215.425036][T18297] [c000200b8606fc20] [c06c2290]
>> idr_for_each+0xf0/0x170 (unreliable)
>> [  215.425073][T18297] [c000200b8606fca0] [c0080fbc2744]
>> loop_lookup.part.2+0x4c/0xb0 [loop]
>> loop_lookup at drivers/block/loop.c:2144
>> [  215.425105][T18297] [c000200b8606fce0] [c0080fbc3558]
>> loop_control_ioctl+0x120/0x1d0 [loop]
>> [  215.425149][T18297] [c000200b8606fd40] [c04eb688]
>> ksys_ioctl+0xd8/0x130
>> [  215.425190][T18297] [c000200b8606fd90] [c04eb708]
>> sys_ioctl+0x28/0x40
>> [  215.425233][T18297] [c000200b8606fdb0] [c003cc30]
>> system_call_exception+0x110/0x1e0
>> [  215.425274][T18297] [c000200b8606fe20] [c000c9f0]
>> system_call_common+0xf0/0x278
>> [  215.425314][T18297] Instruction dump:
>> [  215.425338][T18297]     
>>    
>> [  215.425374][T18297]     
>>    
>> [  215.425422][T18297] ---[ end trace ebed248fad431966 ]---
>> [  215.642114][T18297] 
>> [  216.642220][T18297] Kernel panic - not syncing: Fatal exception



Re: POWER9 crash due to STRICT_KERNEL_RWX (WAS: Re: Linux-next POWER9 NULL pointer NIP...)

2020-04-16 Thread Qian Cai



> On Apr 16, 2020, at 10:27 PM, Russell Currey  wrote:
> 
> Reverting the patch with the given config will have the same effect as
> STRICT_KERNEL_RWX=n.  Not discounting that it could be a bug on the
> powerpc side (i.e. relocatable kernels with strict RWX on haven't been
> exhaustively tested yet), but we should definitely figure out what's
> going on with this bad access first.

BTW, this bad access only happened once. The overwhelming rest of crashes are 
with NULL pointer NIP like below. How can you explain that STRICT_KERNEL_RWX=n 
would also make those NULL NIP disappear if STRICT_KERNEL_RWX is just a 
messenger?

[  215.281666][T16896] LTP: starting chown04_16
[  215.424203][T18297] BUG: Unable to handle kernel instruction fetch (NULL 
pointer?)
[  215.424289][T18297] Faulting instruction address: 0x
[  215.424313][T18297] Oops: Kernel access of bad area, sig: 11 [#1]
[  215.424341][T18297] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[  215.424383][T18297] Modules linked in: loop kvm_hv kvm ip_tables x_tables 
xfs sd_mod bnx2x mdio tg3 ahci libahci libphy libata firmware_class dm_mirror 
dm_region_hash dm_log dm_mod
[  215.424459][T18297] CPU: 85 PID: 18297 Comm: chown04_16 Tainted: GW  
   5.6.0-next-20200405+ #3
[  215.424489][T18297] NIP:   LR: c0080fbc0408 CTR: 

[  215.424530][T18297] REGS: c000200b8606f990 TRAP: 0400   Tainted: GW  
(5.6.0-next-20200405+)
[  215.424570][T18297] MSR:  900040009033   CR: 
84000248  XER: 2004
[  215.424619][T18297] CFAR: c0080fbc64f4 IRQMASK: 0 
[  215.424619][T18297] GPR00: c06c2238 c000200b8606fc20 
c165ce00  
[  215.424619][T18297] GPR04: c000201a58106400 c000200b8606fcc0 
5f037e7d 00013bfb 
[  215.424619][T18297] GPR08: c000201a58106400  
 c1652ee0 
[  215.424619][T18297] GPR12:  c000201fff69a600 
  
[  215.424619][T18297] GPR16:   
  
[  215.424619][T18297] GPR20:   
 0007 
[  215.424619][T18297] GPR24:   
c0080fbc8688 c000200b8606fcc0 
[  215.424619][T18297] GPR28:  7fff 
c0080fbc0400 c00020068b8c0e70 
[  215.424914][T18297] NIP [] 0x0
[  215.424953][T18297] LR [c0080fbc0408] find_free_cb+0x8/0x30 [loop]
find_free_cb at drivers/block/loop.c:2129
[  215.424997][T18297] Call Trace:
[  215.425036][T18297] [c000200b8606fc20] [c06c2290] 
idr_for_each+0xf0/0x170 (unreliable)
[  215.425073][T18297] [c000200b8606fca0] [c0080fbc2744] 
loop_lookup.part.2+0x4c/0xb0 [loop]
loop_lookup at drivers/block/loop.c:2144
[  215.425105][T18297] [c000200b8606fce0] [c0080fbc3558] 
loop_control_ioctl+0x120/0x1d0 [loop]
[  215.425149][T18297] [c000200b8606fd40] [c04eb688] 
ksys_ioctl+0xd8/0x130
[  215.425190][T18297] [c000200b8606fd90] [c04eb708] sys_ioctl+0x28/0x40
[  215.425233][T18297] [c000200b8606fdb0] [c003cc30] 
system_call_exception+0x110/0x1e0
[  215.425274][T18297] [c000200b8606fe20] [c000c9f0] 
system_call_common+0xf0/0x278
[  215.425314][T18297] Instruction dump:
[  215.425338][T18297]       
  
[  215.425374][T18297]       
  
[  215.425422][T18297] ---[ end trace ebed248fad431966 ]---
[  215.642114][T18297] 
[  216.642220][T18297] Kernel panic - not syncing: Fatal exception

POWER9 crash due to STRICT_KERNEL_RWX (WAS: Re: Linux-next POWER9 NULL pointer NIP...)

2020-04-16 Thread Qian Cai
OK, reverted the commit,

c55d7b5e6426 (“powerpc: Remove STRICT_KERNEL_RWX incompatibility with 
RELOCATABLE”)

or set STRICT_KERNEL_RWX=n fixed the crash below and also mentioned in this 
thread,

https://lore.kernel.org/lkml/15ac5b0e-a221-4b8c-9039-fa96b8ef7...@lca.pw/

[  148.110969][T13115] LTP: starting chown04_16
[  148.255048][T13380] kernel tried to execute exec-protected page 
(c16804ac) - exploit attempt? (uid: 0)
[  148.255099][T13380] BUG: Unable to handle kernel instruction fetch
[  148.255122][T13380] Faulting instruction address: 0xc16804ac
[  148.255136][T13380] Oops: Kernel access of bad area, sig: 11 [#1]
[  148.255157][T13380] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[  148.255171][T13380] Modules linked in: loop kvm_hv kvm xfs sd_mod bnx2x mdio 
ahci tg3 libahci libphy libata firmware_class dm_mirror dm_region_hash dm_log 
dm_mod
[  148.255213][T13380] CPU: 45 PID: 13380 Comm: chown04_16 Tainted: GW  
   5.6.0+ #7
[  148.255236][T13380] NIP:  c16804ac LR: c0080fa60408 CTR: 
c16804ac
[  148.255250][T13380] REGS: c010a6fafa00 TRAP: 0400   Tainted: GW  
(5.6.0+)
[  148.255281][T13380] MSR:  900010009033   CR: 
84000248  XER: 2004
[  148.255310][T13380] CFAR: c0080fa66534 IRQMASK: 0 
[  148.255310][T13380] GPR00: c0973268 c010a6fafc90 
c1648200  
[  148.255310][T13380] GPR04: c00d8a22dc00 c010a6fafd30 
b5e98331 00012c9f 
[  148.255310][T13380] GPR08: c00d8a22dc00  
 c163c520 
[  148.255310][T13380] GPR12: c16804ac c01dad80 
  
[  148.255310][T13380] GPR16:   
  
[  148.255310][T13380] GPR20:   
  
[  148.255310][T13380] GPR24: 7fff8f5e2e48  
c0080fa6a488 c010a6fafd30 
[  148.255310][T13380] GPR28:  7fff 
c0080fa60400 c00efd0c6780 
[  148.255494][T13380] NIP [c16804ac] sysctl_net_busy_read+0x0/0x4
[  148.255516][T13380] LR [c0080fa60408] find_free_cb+0x8/0x30 [loop]
[  148.255528][T13380] Call Trace:
[  148.255538][T13380] [c010a6fafc90] [c09732c0] 
idr_for_each+0xf0/0x170 (unreliable)
[  148.255572][T13380] [c010a6fafd10] [c0080fa626c4] 
loop_lookup.part.1+0x4c/0xb0 [loop]
[  148.255597][T13380] [c010a6fafd50] [c0080fa634d8] 
loop_control_ioctl+0x120/0x1d0 [loop]
[  148.255623][T13380] [c010a6fafdb0] [c04ddc08] 
ksys_ioctl+0xd8/0x130
[  148.255636][T13380] [c010a6fafe00] [c04ddc88] sys_ioctl+0x28/0x40
[  148.255669][T13380] [c010a6fafe20] [c000b378] 
system_call+0x5c/0x68
[  148.255699][T13380] Instruction dump:
[  148.255718][T13380]       
  
[  148.255744][T13380]       
  
[  148.255772][T13380] ---[ end trace a5894a74208c22ec ]---
[  148.576663][T13380] 
[  149.576765][T13380] Kernel panic - not syncing: Fatal exception




Re: Linux-next POWER9 NULL pointer NIP since 1st Apr.

2020-04-15 Thread Qian Cai



> On Apr 10, 2020, at 3:20 PM, Qian Cai  wrote:
> 
> 
> 
>> On Apr 9, 2020, at 10:14 AM, Steven Rostedt  wrote:
>> 
>> On Thu, 9 Apr 2020 06:06:35 -0400
>> Qian Cai  wrote:
>> 
>>>>> I’ll go to bisect some more but it is going to take a while.
>>>>> 
>>>>> $ git log --oneline 4c205c84e249..8e99cf91b99b
>>>>> 8e99cf91b99b tracing: Do not allocate buffer in trace_find_next_entry() 
>>>>> in atomic
>>>>> 2ab2a0924b99 tracing: Add documentation on set_ftrace_notrace_pid and 
>>>>> set_event_notrace_pid
>>>>> ebed9628f5c2 selftests/ftrace: Add test to test new set_event_notrace_pid 
>>>>> file
>>>>> ed8839e072b8 selftests/ftrace: Add test to test new 
>>>>> set_ftrace_notrace_pid file
>>>>> 276836260301 tracing: Create set_event_notrace_pid to not trace tasks  
>>>> 
>>>>> b3b1e6ededa4 ftrace: Create set_ftrace_notrace_pid to not trace tasks
>>>>> 717e3f5ebc82 ftrace: Make function trace pid filtering a bit more exact  
>>>> 
>>>> If it is affecting function tracing, it is probably one of the above two
>>>> commits.  
>>> 
>>> OK, it was narrowed down to one of those messed with mcount here,
>> 
>> Thing is, nothing here touches mcount.
> 
> Yes, you are right. I went back to test the commit just before the 5.7-trace 
> merge request,
> I did reproduce there. The thing is that this bastard could take more 6-hour 
> to happen,
> so my previous attempt did not wait long enough. Back to the square one…

OK, I starts to test all commits up to 12 hours. The progess on far is,

BAD: v5.6-rc1
GOOD: v5.5
GOOD: 153b5c566d30 Merge tag 'microblaze-v5.6-rc1' of 
git://git.monstr.eu/linux-2.6-microblaze

The next step I’ll be testing,

71c3a888cbca Merge tag 'powerpc-5.6-1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

IF that is BAD, the merge request is the culprit. I can see a few commits are 
more related that others.

5290ae2b8e5f powerpc/64: Use {SAVE,REST}_NVGPRS macros
ed0bc98f8cbe powerpc/64s: Reimplement power4_idle code in C

Does it ring any bell yet?




Re: Linux-next POWER9 NULL pointer NIP since 1st Apr.

2020-04-10 Thread Qian Cai



> On Apr 9, 2020, at 10:14 AM, Steven Rostedt  wrote:
> 
> On Thu, 9 Apr 2020 06:06:35 -0400
> Qian Cai  wrote:
> 
>>>> I’ll go to bisect some more but it is going to take a while.
>>>> 
>>>> $ git log --oneline 4c205c84e249..8e99cf91b99b
>>>> 8e99cf91b99b tracing: Do not allocate buffer in trace_find_next_entry() in 
>>>> atomic
>>>> 2ab2a0924b99 tracing: Add documentation on set_ftrace_notrace_pid and 
>>>> set_event_notrace_pid
>>>> ebed9628f5c2 selftests/ftrace: Add test to test new set_event_notrace_pid 
>>>> file
>>>> ed8839e072b8 selftests/ftrace: Add test to test new set_ftrace_notrace_pid 
>>>> file
>>>> 276836260301 tracing: Create set_event_notrace_pid to not trace tasks  
>>> 
>>>> b3b1e6ededa4 ftrace: Create set_ftrace_notrace_pid to not trace tasks
>>>> 717e3f5ebc82 ftrace: Make function trace pid filtering a bit more exact  
>>> 
>>> If it is affecting function tracing, it is probably one of the above two
>>> commits.  
>> 
>> OK, it was narrowed down to one of those messed with mcount here,
> 
> Thing is, nothing here touches mcount.

Yes, you are right. I went back to test the commit just before the 5.7-trace 
merge request,
I did reproduce there. The thing is that this bastard could take more 6-hour to 
happen,
so my previous attempt did not wait long enough. Back to the square one...

> 
>> 
>> 8e99cf91b99b tracing: Do not allocate buffer in trace_find_next_entry() in 
>> atomic
> 
> Touches reading the trace buffer.
> 
>> 2ab2a0924b99 tracing: Add documentation on set_ftrace_notrace_pid and 
>> set_event_notrace_pid
> 
> Documentation.
> 
>> 6a13a0d7b4d1 ftrace/kprobe: Show the maxactive number on kprobe_events
> 
> kprobe output.
> 
>> c9b7a4a72ff6 ring-buffer/tracing: Have iterator acknowledge dropped events
> 
> Reading the buffer.
> 
>> 06e0a548bad0 tracing: Do not disable tracing when reading the trace file
> 
> Reading the buffer.
> 
>> 1039221cc278 ring-buffer: Do not disable recording when there is an iterator
> 
> Reading the buffer.
> 
>> 07b8b10ec94f ring-buffer: Make resize disable per cpu buffer instead of 
>> total buffer
> 
> Resizing the buffer.
> 
>> 153368ce1bd0 ring-buffer: Optimize rb_iter_head_event()
> 
> Reading the buffer.
> 
>> ff84c50cfb4b ring-buffer: Do not die if rb_iter_peek() fails more than thrice
> 
> Reading the buffer.
> 
>> 785888c544e0 ring-buffer: Have rb_iter_head_event() handle concurrent writer
> 
> Reading the buffer.
> 
>> 28e3fc56a471 ring-buffer: Add page_stamp to iterator for synchronization
> 
> Reading the buffer.
> 
>> bc1a72afdc4a ring-buffer: Rename ring_buffer_read() to 
>> read_buffer_iter_advance()
> 
> Reading the buffer.
> 
>> ead6ecfddea5 ring-buffer: Have ring_buffer_empty() not depend on tracing 
>> stopped
> 
> Reading the buffer.
> 
>> ff895103a84a tracing: Save off entry when peeking at next entry
> 
> Reading the buffer.
> 
>> bf2cbe044da2 tracing: Use address-of operator on section symbols
> 
> Affects trace_printk()
> 
>> bbd9d05618a6 gpu/trace: add a gpu total memory usage tracepoint
> 
> New tracepoint infrastructure (just new trace events for gpu)
> 
>> 89b74cac7834 tools/bootconfig: Show line and column in parse error
> 
> Extended command line boot config.
> 
>> 306b69dce926 bootconfig: Support O= option
> 
> Extended command line boot config
> 
>> 5412e0b763e0 tracing: Remove unused TRACE_BUFFER bits
> 
> Removed unused enums.
> 
>> b396bfdebffc tracing: Have hwlat ts be first instance and record count of 
>> instances
> 
> Affects only the hard ware latency detector (most likely not even
> configured in the kernel).
> 
> So I don't understand how any of the above commits can cause a problem.
> 
> -- Steve



Re: Linux-next POWER9 NULL pointer NIP since 1st Apr.

2020-04-09 Thread Qian Cai



> On Apr 7, 2020, at 9:30 AM, Steven Rostedt  wrote:
> 
> On Tue, 7 Apr 2020 09:01:10 -0400
> Qian Cai  wrote:
> 
>> + Steven
>> 
>>> On Apr 7, 2020, at 8:42 AM, Michael Ellerman  wrote:
>>> 
>>> Qian Cai  writes:  
>>>> Ever since 1st Apr, linux-next starts to trigger a NULL pointer NIP on 
>>>> POWER9 below using
>>>> this config,
>>>> 
>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>>>> 
>>>> It takes a while to reproduce, so before I bury myself into bisecting and 
>>>> just send a head-up
>>>> to see if anyone spots anything obvious.
>>>> 
>>>> [  206.744625][T13224] LTP: starting fallocate04
>>>> [  207.601583][T27684] /dev/zero: Can't open blockdev
>>>> [  208.674301][T27684] EXT4-fs (loop0): mounting ext3 file system using 
>>>> the ext4 subsystem
>>>> [  208.680347][T27684] BUG: Unable to handle kernel instruction fetch 
>>>> (NULL pointer?)
>>>> [  208.680383][T27684] Faulting instruction address: 0x
>>>> [  208.680406][T27684] Oops: Kernel access of bad area, sig: 11 [#1]
>>>> [  208.680439][T27684] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
>>>> DEBUG_PAGEALLOC NUMA PowerNV
>>>> [  208.680474][T27684] Modules linked in: ext4 crc16 mbcache jbd2 loop 
>>>> kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci mdio tg3 
>>>> libata libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
>>>> [  208.680576][T27684] CPU: 117 PID: 27684 Comm: fallocate04 Tainted: G
>>>> W 5.6.0-next-20200401+ #288
>>>> [  208.680614][T27684] NIP:   LR: c008102c0048 CTR: 
>>>> 
>>>> [  208.680657][T27684] REGS: c000200361def420 TRAP: 0400   Tainted: G  
>>>>   W  (5.6.0-next-20200401+)
>>>> [  208.680700][T27684] MSR:  90004280b033 
>>>>   CR: 4208  XER: 2004
>>>> [  208.680760][T27684] CFAR: c0081032c494 IRQMASK: 0 
>>>> [  208.680760][T27684] GPR00: c05ac3f8 c000200361def6b0 
>>>> c165c200 c00020107dae0bd0 
>>>> [  208.680760][T27684] GPR04:  0400 
>>>>   
>>>> [  208.680760][T27684] GPR08: c000200361def6e8 c008102c0040 
>>>> 7fff c1614e80 
>>>> [  208.680760][T27684] GPR12:  c000201fff671280 
>>>>  0002 
>>>> [  208.680760][T27684] GPR16: 0002 00040001 
>>>> c00020030f5a1000 c00020030f5a1548 
>>>> [  208.680760][T27684] GPR20: c15fbad8 c168c654 
>>>> c000200361def818 c05b4c10 
>>>> [  208.680760][T27684] GPR24:  c008103365b8 
>>>> c00020107dae0bd0 0400 
>>>> [  208.680760][T27684] GPR28: c168c3a8  
>>>>   
>>>> [  208.681014][T27684] NIP [] 0x0
>>>> [  208.681065][T27684] LR [c008102c0048] ext4_iomap_end+0x8/0x30 
>>>> [ext4]  
>>> 
>>> That LR looks like it's pointing to the return from _mcount in
>>> ext4_iomap_end(), which means we have probably crashed in ftrace
>>> somewhere.
>>> 
>>> Did you have tracing enabled when you ran the test? Or does it do
>>> tracing itself?  
>> 
>> Yes, it run ftrace at first before running LTP to trigger it,
>> 
>> https://github.com/cailca/linux-mm/blob/master/test.sh
>> 
>> echo function > /sys/kernel/debug/tracing/current_tracer
>> echo nop > /sys/kernel/debug/tracing/current_tracer
>> 
>> There is another crash with even non-NULL NIP, but then symbol behaves weird.
>> 
>> # ./scripts/faddr2line vmlinux sysctl_net_busy_read+0x0/0x4
>> skipping sysctl_net_busy_read address at 0xc16804ac due to 
>> non-function symbol of type 'D'
>> 
>> [  148.110969][T13115] LTP: starting chown04_16
>> [  148.255048][T13380] kernel tried to execute exec-protected page 
>> (c16804ac) - exploit attempt? (uid: 0)
>> [  148.255099][T13380] BUG: Unable to handle kernel instruction fetch
>> [  148.255122][T13380] Faulting instruction address: 0xc16804ac
>> [  148.255136][T13380] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  148.255157][T13380] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
>&

Re: Linux-next POWER9 NULL pointer NIP since 1st Apr.

2020-04-08 Thread Qian Cai



> On Apr 7, 2020, at 9:30 AM, Steven Rostedt  wrote:
> 
> On Tue, 7 Apr 2020 09:01:10 -0400
> Qian Cai  wrote:
> 
>> + Steven
>> 
>>> On Apr 7, 2020, at 8:42 AM, Michael Ellerman  wrote:
>>> 
>>> Qian Cai  writes:  
>>>> Ever since 1st Apr, linux-next starts to trigger a NULL pointer NIP on 
>>>> POWER9 below using
>>>> this config,
>>>> 
>>>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>>>> 
>>>> It takes a while to reproduce, so before I bury myself into bisecting and 
>>>> just send a head-up
>>>> to see if anyone spots anything obvious.
>>>> 
>>>> [  206.744625][T13224] LTP: starting fallocate04
>>>> [  207.601583][T27684] /dev/zero: Can't open blockdev
>>>> [  208.674301][T27684] EXT4-fs (loop0): mounting ext3 file system using 
>>>> the ext4 subsystem
>>>> [  208.680347][T27684] BUG: Unable to handle kernel instruction fetch 
>>>> (NULL pointer?)
>>>> [  208.680383][T27684] Faulting instruction address: 0x
>>>> [  208.680406][T27684] Oops: Kernel access of bad area, sig: 11 [#1]
>>>> [  208.680439][T27684] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
>>>> DEBUG_PAGEALLOC NUMA PowerNV
>>>> [  208.680474][T27684] Modules linked in: ext4 crc16 mbcache jbd2 loop 
>>>> kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci mdio tg3 
>>>> libata libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
>>>> [  208.680576][T27684] CPU: 117 PID: 27684 Comm: fallocate04 Tainted: G
>>>> W 5.6.0-next-20200401+ #288
>>>> [  208.680614][T27684] NIP:   LR: c008102c0048 CTR: 
>>>> 
>>>> [  208.680657][T27684] REGS: c000200361def420 TRAP: 0400   Tainted: G  
>>>>   W  (5.6.0-next-20200401+)
>>>> [  208.680700][T27684] MSR:  90004280b033 
>>>>   CR: 4208  XER: 2004
>>>> [  208.680760][T27684] CFAR: c0081032c494 IRQMASK: 0 
>>>> [  208.680760][T27684] GPR00: c05ac3f8 c000200361def6b0 
>>>> c165c200 c00020107dae0bd0 
>>>> [  208.680760][T27684] GPR04:  0400 
>>>>   
>>>> [  208.680760][T27684] GPR08: c000200361def6e8 c008102c0040 
>>>> 7fff c1614e80 
>>>> [  208.680760][T27684] GPR12:  c000201fff671280 
>>>>  0002 
>>>> [  208.680760][T27684] GPR16: 0002 00040001 
>>>> c00020030f5a1000 c00020030f5a1548 
>>>> [  208.680760][T27684] GPR20: c15fbad8 c168c654 
>>>> c000200361def818 c05b4c10 
>>>> [  208.680760][T27684] GPR24:  c008103365b8 
>>>> c00020107dae0bd0 0400 
>>>> [  208.680760][T27684] GPR28: c168c3a8  
>>>>   
>>>> [  208.681014][T27684] NIP [] 0x0
>>>> [  208.681065][T27684] LR [c008102c0048] ext4_iomap_end+0x8/0x30 
>>>> [ext4]  
>>> 
>>> That LR looks like it's pointing to the return from _mcount in
>>> ext4_iomap_end(), which means we have probably crashed in ftrace
>>> somewhere.
>>> 
>>> Did you have tracing enabled when you ran the test? Or does it do
>>> tracing itself?  
>> 
>> Yes, it run ftrace at first before running LTP to trigger it,
>> 
>> https://github.com/cailca/linux-mm/blob/master/test.sh
>> 
>> echo function > /sys/kernel/debug/tracing/current_tracer
>> echo nop > /sys/kernel/debug/tracing/current_tracer
>> 
>> There is another crash with even non-NULL NIP, but then symbol behaves weird.
>> 
>> # ./scripts/faddr2line vmlinux sysctl_net_busy_read+0x0/0x4
>> skipping sysctl_net_busy_read address at 0xc16804ac due to 
>> non-function symbol of type 'D'
>> 
>> [  148.110969][T13115] LTP: starting chown04_16
>> [  148.255048][T13380] kernel tried to execute exec-protected page 
>> (c16804ac) - exploit attempt? (uid: 0)
>> [  148.255099][T13380] BUG: Unable to handle kernel instruction fetch
>> [  148.255122][T13380] Faulting instruction address: 0xc16804ac
>> [  148.255136][T13380] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  148.255157][T13380] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
>&

Re: Linux-next POWER9 NULL pointer NIP since 1st Apr.

2020-04-07 Thread Qian Cai
+ Steven

> On Apr 7, 2020, at 8:42 AM, Michael Ellerman  wrote:
> 
> Qian Cai  writes:
>> Ever since 1st Apr, linux-next starts to trigger a NULL pointer NIP on 
>> POWER9 below using
>> this config,
>> 
>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>> 
>> It takes a while to reproduce, so before I bury myself into bisecting and 
>> just send a head-up
>> to see if anyone spots anything obvious.
>> 
>> [  206.744625][T13224] LTP: starting fallocate04
>> [  207.601583][T27684] /dev/zero: Can't open blockdev
>> [  208.674301][T27684] EXT4-fs (loop0): mounting ext3 file system using the 
>> ext4 subsystem
>> [  208.680347][T27684] BUG: Unable to handle kernel instruction fetch (NULL 
>> pointer?)
>> [  208.680383][T27684] Faulting instruction address: 0x
>> [  208.680406][T27684] Oops: Kernel access of bad area, sig: 11 [#1]
>> [  208.680439][T27684] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
>> DEBUG_PAGEALLOC NUMA PowerNV
>> [  208.680474][T27684] Modules linked in: ext4 crc16 mbcache jbd2 loop 
>> kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci mdio tg3 libata 
>> libphy firmware_class dm_mirror dm_region_hash dm_log dm_mod
>> [  208.680576][T27684] CPU: 117 PID: 27684 Comm: fallocate04 Tainted: G  
>>   W 5.6.0-next-20200401+ #288
>> [  208.680614][T27684] NIP:   LR: c008102c0048 CTR: 
>> 
>> [  208.680657][T27684] REGS: c000200361def420 TRAP: 0400   Tainted: G
>> W  (5.6.0-next-20200401+)
>> [  208.680700][T27684] MSR:  90004280b033 
>>   CR: 4208  XER: 2004
>> [  208.680760][T27684] CFAR: c0081032c494 IRQMASK: 0 
>> [  208.680760][T27684] GPR00: c05ac3f8 c000200361def6b0 
>> c165c200 c00020107dae0bd0 
>> [  208.680760][T27684] GPR04:  0400 
>>   
>> [  208.680760][T27684] GPR08: c000200361def6e8 c008102c0040 
>> 7fff c1614e80 
>> [  208.680760][T27684] GPR12:  c000201fff671280 
>>  0002 
>> [  208.680760][T27684] GPR16: 0002 00040001 
>> c00020030f5a1000 c00020030f5a1548 
>> [  208.680760][T27684] GPR20: c15fbad8 c168c654 
>> c000200361def818 c05b4c10 
>> [  208.680760][T27684] GPR24:  c008103365b8 
>> c00020107dae0bd0 0400 
>> [  208.680760][T27684] GPR28: c168c3a8  
>>   
>> [  208.681014][T27684] NIP [] 0x0
>> [  208.681065][T27684] LR [c008102c0048] ext4_iomap_end+0x8/0x30 [ext4]
> 
> That LR looks like it's pointing to the return from _mcount in
> ext4_iomap_end(), which means we have probably crashed in ftrace
> somewhere.
> 
> Did you have tracing enabled when you ran the test? Or does it do
> tracing itself?

Yes, it run ftrace at first before running LTP to trigger it,

https://github.com/cailca/linux-mm/blob/master/test.sh

echo function > /sys/kernel/debug/tracing/current_tracer
echo nop > /sys/kernel/debug/tracing/current_tracer

There is another crash with even non-NULL NIP, but then symbol behaves weird.

# ./scripts/faddr2line vmlinux sysctl_net_busy_read+0x0/0x4
skipping sysctl_net_busy_read address at 0xc16804ac due to non-function 
symbol of type 'D'

[  148.110969][T13115] LTP: starting chown04_16
[  148.255048][T13380] kernel tried to execute exec-protected page 
(c16804ac) - exploit attempt? (uid: 0)
[  148.255099][T13380] BUG: Unable to handle kernel instruction fetch
[  148.255122][T13380] Faulting instruction address: 0xc16804ac
[  148.255136][T13380] Oops: Kernel access of bad area, sig: 11 [#1]
[  148.255157][T13380] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[  148.255171][T13380] Modules linked in: loop kvm_hv kvm xfs sd_mod bnx2x mdio 
ahci tg3 libahci libphy libata firmware_class dm_mirror dm_region_hash dm_log 
dm_mod
[  148.255213][T13380] CPU: 45 PID: 13380 Comm: chown04_16 Tainted: GW  
   5.6.0+ #7
[  148.255236][T13380] NIP:  c16804ac LR: c0080fa60408 CTR: 
c16804ac
[  148.255250][T13380] REGS: c010a6fafa00 TRAP: 0400   Tainted: GW  
(5.6.0+)
[  148.255281][T13380] MSR:  900010009033   CR: 
84000248  XER: 2004
[  148.255310][T13380] CFAR: c0080fa66534 IRQMASK: 0 
[  148.255310][T13380] GPR00: c0973268 c010a6fafc90 
c1648200  
[  148.255310][T13380] GPR04: c00d8a22dc00 c010a6fafd30 
b5e98331 00012c9f 
[  148.255310][T13380] GPR0

Linux-next POWER9 NULL pointer NIP since 1st Apr.

2020-04-06 Thread Qian Cai
Ever since 1st Apr, linux-next starts to trigger a NULL pointer NIP on POWER9 
below using
this config,

https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config

It takes a while to reproduce, so before I bury myself into bisecting and just 
send a head-up
to see if anyone spots anything obvious.

[  206.744625][T13224] LTP: starting fallocate04
[  207.601583][T27684] /dev/zero: Can't open blockdev
[  208.674301][T27684] EXT4-fs (loop0): mounting ext3 file system using the 
ext4 subsystem
[  208.680347][T27684] BUG: Unable to handle kernel instruction fetch (NULL 
pointer?)
[  208.680383][T27684] Faulting instruction address: 0x
[  208.680406][T27684] Oops: Kernel access of bad area, sig: 11 [#1]
[  208.680439][T27684] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[  208.680474][T27684] Modules linked in: ext4 crc16 mbcache jbd2 loop kvm_hv 
kvm ip_tables x_tables xfs sd_mod bnx2x ahci libahci mdio tg3 libata libphy 
firmware_class dm_mirror dm_region_hash dm_log dm_mod
[  208.680576][T27684] CPU: 117 PID: 27684 Comm: fallocate04 Tainted: G
W 5.6.0-next-20200401+ #288
[  208.680614][T27684] NIP:   LR: c008102c0048 CTR: 

[  208.680657][T27684] REGS: c000200361def420 TRAP: 0400   Tainted: GW  
(5.6.0-next-20200401+)
[  208.680700][T27684] MSR:  90004280b033 
  CR: 4208  XER: 2004
[  208.680760][T27684] CFAR: c0081032c494 IRQMASK: 0 
[  208.680760][T27684] GPR00: c05ac3f8 c000200361def6b0 
c165c200 c00020107dae0bd0 
[  208.680760][T27684] GPR04:  0400 
  
[  208.680760][T27684] GPR08: c000200361def6e8 c008102c0040 
7fff c1614e80 
[  208.680760][T27684] GPR12:  c000201fff671280 
 0002 
[  208.680760][T27684] GPR16: 0002 00040001 
c00020030f5a1000 c00020030f5a1548 
[  208.680760][T27684] GPR20: c15fbad8 c168c654 
c000200361def818 c05b4c10 
[  208.680760][T27684] GPR24:  c008103365b8 
c00020107dae0bd0 0400 
[  208.680760][T27684] GPR28: c168c3a8  
  
[  208.681014][T27684] NIP [] 0x0
[  208.681065][T27684] LR [c008102c0048] ext4_iomap_end+0x8/0x30 [ext4]
[  208.681091][T27684] Call Trace:
[  208.681129][T27684] [c000200361def6b0] [c05ac3bc] 
iomap_apply+0x20c/0x920 (unreliable)
iomap_apply at fs/iomap/apply.c:80 (discriminator 4)
[  208.681173][T27684] [c000200361def7f0] [c05b4adc] 
iomap_bmap+0xfc/0x160
iomap_bmap at fs/iomap/fiemap.c:142
[  208.681228][T27684] [c000200361def850] [c008102c2c1c] 
ext4_bmap+0xa4/0x180 [ext4]
ext4_bmap at fs/ext4/inode.c:3213
[  208.681260][T27684] [c000200361def890] [c04f71fc] bmap+0x4c/0x80
[  208.681281][T27684] [c000200361def8c0] [c0080fdb0acc] 
jbd2_journal_init_inode+0x44/0x1a0 [jbd2]
jbd2_journal_init_inode at fs/jbd2/journal.c:1255
[  208.681326][T27684] [c000200361def960] [c0081031c808] 
ext4_load_journal+0x440/0x860 [ext4]
[  208.681371][T27684] [c000200361defa30] [c00810322a14] 
ext4_fill_super+0x342c/0x3ab0 [ext4]
[  208.681414][T27684] [c000200361defba0] [c04cb0bc] 
mount_bdev+0x25c/0x290
[  208.681478][T27684] [c000200361defc40] [c00810310250] 
ext4_mount+0x28/0x50 [ext4]
[  208.681520][T27684] [c000200361defc60] [c053242c] 
legacy_get_tree+0x4c/0xb0
[  208.681556][T27684] [c000200361defc90] [c04c864c] 
vfs_get_tree+0x4c/0x130
[  208.681593][T27684] [c000200361defd00] [c050a1c8] 
do_mount+0xa18/0xc50
[  208.681641][T27684] [c000200361defdd0] [c050a9a8] 
sys_mount+0x158/0x180
[  208.681679][T27684] [c000200361defe20] [c000b3f8] 
system_call+0x5c/0x68
[  208.681726][T27684] Instruction dump:
[  208.681747][T27684]       
  
[  208.681797][T27684]       
  
[  208.681839][T27684] ---[ end trace 4e9e2bab7f1d4048 ]---
[  208.802259][T27684] 
[  209.802373][T27684] Kernel panic - not syncing: Fatal exception

[  215.281666][T16896] LTP: starting chown04_16
[  215.424203][T18297] BUG: Unable to handle kernel instruction fetch (NULL 
pointer?)
[  215.424289][T18297] Faulting instruction address: 0x
[  215.424313][T18297] Oops: Kernel access of bad area, sig: 11 [#1]
[  215.424341][T18297] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 
DEBUG_PAGEALLOC NUMA PowerNV
[  215.424383][T18297] Modules linked in: loop kvm_hv kvm ip_tables x_tables 
xfs sd_mod bnx2x mdio tg3 ahci libahci libphy libata firmware_class dm_mirror 
dm_region_hash dm_log dm_mod
[  215.424459][T18297] CPU: 85 PID: 18297 Comm: chown04_16 Tainted: GW  
   5.6.0-next-20200405+ #3
[  215.424489][T18297] NIP:   LR: c0080fbc0408 CTR: 

Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Qian Cai



> On Apr 2, 2020, at 11:54 AM, Paul E. McKenney  wrote:
> 
> I do run this combination quite frequently, but only as part of
> rcutorture, which might not be a representative workload.  For one thing,
> it has a minimal userspace consisting only of a trivial init program.
> I don't recall having ever seen this.  (I have seen one recent complaint
> about an IPI being sent to an offline CPU, but I cannot prove that this
> was not due to RCU bugs that I was chasing at the time.)

Yes, a trivial init is tough while running systemd should be able to catch it 
as it will use cgroup.

Re: [PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-02 Thread Qian Cai



> On Apr 2, 2020, at 7:24 AM, Michael Ellerman  wrote:
> 
> Qian Cai  writes:
>> From: Peter Zijlstra 
>> 
>> In the CPU-offline process, it calls mmdrop() after idle entry and the
>> subsequent call to cpuhp_report_idle_dead(). Once execution passes the
>> call to rcu_report_dead(), RCU is ignoring the CPU, which results in
>> lockdep complaining when mmdrop() uses RCU from either memcg or
>> debugobjects below.
>> 
>> Fix it by cleaning up the active_mm state from BP instead. Every arch
>> which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
>> from AP. The only exception is parisc because it switches them to
>> _mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
>> but the patch will still work there because it calls mmgrab(_mm) in
>> smp_cpu_init() and then should call mmdrop(_mm) in finish_cpu().
> 
> Thanks for debugging this. How did you hit it in the first place?

Just repeatedly offline/online CPUs which will eventually cause an idle thread
refcount goes to 0 and trigger __mmdrop() and of course it needs to enable
lockdep (PROVE_RCU?) as well as having luck to hit the cgroup, workqueue
or debugobject code paths to call RCU.

> 
> A link to the original thread would have helped me:
> 
>  https://lore.kernel.org/lkml/20200113190331.12788-1-...@lca.pw/
> 
>> WARNING: suspicious RCU usage
>> -
>> kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
>> 
>> other info that might help us debug this:
>> 
>> RCU used illegally from offline CPU!
>> Call Trace:
>> dump_stack+0xf4/0x164 (unreliable)
>> lockdep_rcu_suspicious+0x140/0x164
>> get_work_pool+0x110/0x150
>> __queue_work+0x1bc/0xca0
>> queue_work_on+0x114/0x120
>> css_release+0x9c/0xc0
>> percpu_ref_put_many+0x204/0x230
>> free_pcp_prepare+0x264/0x570
>> free_unref_page+0x38/0xf0
>> __mmdrop+0x21c/0x2c0
>> idle_task_exit+0x170/0x1b0
>> pnv_smp_cpu_kill_self+0x38/0x2e0
>> cpu_die+0x48/0x64
>> arch_cpu_idle_dead+0x30/0x50
>> do_idle+0x2f4/0x470
>> cpu_startup_entry+0x38/0x40
>> start_secondary+0x7a8/0xa80
>> start_secondary_resume+0x10/0x14
> 
> Do we know when this started happening? ie. can we determine a Fixes
> tag?

I don’t know. I looked at some commits that it seems the code was like that
even 10-year ago. It must be nobody who cares to run lockdep (PROVE_RCU?)
with CPU hotplug very regularly.

> 
>> 
>> Signed-off-by: Qian Cai 
>> ---
>> arch/powerpc/platforms/powernv/smp.c |  1 -
>> include/linux/sched/mm.h |  2 ++
>> kernel/cpu.c | 18 +-
>> kernel/sched/core.c  |  5 +++--
>> 4 files changed, 22 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/smp.c 
>> b/arch/powerpc/platforms/powernv/smp.c
>> index 13e251699346..b2ba3e95bda7 100644
>> --- a/arch/powerpc/platforms/powernv/smp.c
>> +++ b/arch/powerpc/platforms/powernv/smp.c
>> @@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
>>  /* Standard hot unplug procedure */
>> 
>>  idle_task_exit();
>> -current->active_mm = NULL; /* for sanity */
> 
> If I'm reading it right, we'll now be running with active_mm == init_mm
> in the offline loop.
> 
> I guess that's fine, I can't think of any reason it would matter, and it
> seems like we were NULL'ing it out just for paranoia's sake not because
> of any actual problem.
> 
> Acked-by: Michael Ellerman  (powerpc)
> 
> 
> cheers
> 
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index c49257a3b510..a132d875d351 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
>>  __mmdrop(mm);
>> }
>> 
>> +void mmdrop(struct mm_struct *mm);
>> +
>> /*
>>  * This has to be called after a get_task_mm()/mmget_not_zero()
>>  * followed by taking the mmap_sem for writing before modifying the
>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> index 2371292f30b0..244d30544377 100644
>> --- a/kernel/cpu.c
>> +++ b/kernel/cpu.c
>> @@ -3,6 +3,7 @@
>>  *
>>  * This code is licenced under the GPL.
>>  */
>> +#include 
>> #include 
>> #include 
>> #include 
>> @@ -564,6 +565,21 @@ static int bringup_cpu(unsigned int cpu)
>>  return bringup_wait_for_ap(cpu);
>> }
>> 
>> +static int finish_cpu(unsigned int cpu)
>> +{
>> +struct task_struct *idl

[PATCH v2] sched/core: fix illegal RCU from offline CPUs

2020-04-01 Thread Qian Cai
From: Peter Zijlstra 

In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.

Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(_mm) in
smp_cpu_init() and then should call mmdrop(_mm) in finish_cpu().

WARNING: suspicious RCU usage
-
kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

other info that might help us debug this:

RCU used illegally from offline CPU!
Call Trace:
 dump_stack+0xf4/0x164 (unreliable)
 lockdep_rcu_suspicious+0x140/0x164
 get_work_pool+0x110/0x150
 __queue_work+0x1bc/0xca0
 queue_work_on+0x114/0x120
 css_release+0x9c/0xc0
 percpu_ref_put_many+0x204/0x230
 free_pcp_prepare+0x264/0x570
 free_unref_page+0x38/0xf0
 __mmdrop+0x21c/0x2c0
 idle_task_exit+0x170/0x1b0
 pnv_smp_cpu_kill_self+0x38/0x2e0
 cpu_die+0x48/0x64
 arch_cpu_idle_dead+0x30/0x50
 do_idle+0x2f4/0x470
 cpu_startup_entry+0x38/0x40
 start_secondary+0x7a8/0xa80
 start_secondary_resume+0x10/0x14


Signed-off-by: Qian Cai 
---
 arch/powerpc/platforms/powernv/smp.c |  1 -
 include/linux/sched/mm.h |  2 ++
 kernel/cpu.c | 18 +-
 kernel/sched/core.c  |  5 +++--
 4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index 13e251699346..b2ba3e95bda7 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -167,7 +167,6 @@ static void pnv_smp_cpu_kill_self(void)
/* Standard hot unplug procedure */
 
idle_task_exit();
-   current->active_mm = NULL; /* for sanity */
cpu = smp_processor_id();
DBG("CPU%d offline\n", cpu);
generic_set_cpu_dead(cpu);
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index c49257a3b510..a132d875d351 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -49,6 +49,8 @@ static inline void mmdrop(struct mm_struct *mm)
__mmdrop(mm);
 }
 
+void mmdrop(struct mm_struct *mm);
+
 /*
  * This has to be called after a get_task_mm()/mmget_not_zero()
  * followed by taking the mmap_sem for writing before modifying the
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 2371292f30b0..244d30544377 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3,6 +3,7 @@
  *
  * This code is licenced under the GPL.
  */
+#include 
 #include 
 #include 
 #include 
@@ -564,6 +565,21 @@ static int bringup_cpu(unsigned int cpu)
return bringup_wait_for_ap(cpu);
 }
 
+static int finish_cpu(unsigned int cpu)
+{
+   struct task_struct *idle = idle_thread_get(cpu);
+   struct mm_struct *mm = idle->active_mm;
+
+   /*
+* idle_task_exit() will have switched to _mm, now
+* clean up any remaining active_mm state.
+*/
+   if (mm != _mm)
+   idle->active_mm = _mm;
+   mmdrop(mm);
+   return 0;
+}
+
 /*
  * Hotplug state machine related functions
  */
@@ -1549,7 +1565,7 @@ static struct cpuhp_step cpuhp_hp_states[] = {
[CPUHP_BRINGUP_CPU] = {
.name   = "cpu:bringup",
.startup.single = bringup_cpu,
-   .teardown.single= NULL,
+   .teardown.single= finish_cpu,
.cant_stop  = true,
},
/* Final state before CPU kills itself */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a2694ba82874..8787958339d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6200,13 +6200,14 @@ void idle_task_exit(void)
struct mm_struct *mm = current->active_mm;
 
BUG_ON(cpu_online(smp_processor_id()));
+   BUG_ON(current != this_rq()->idle);
 
if (mm != _mm) {
switch_mm(mm, _mm, current);
-   current->active_mm = _mm;
finish_arch_post_lock_switch();
}
-   mmdrop(mm);
+
+   /* finish_cpu(), as ran on the BP, will clean up the active_mm state */
 }
 
 /*
-- 
2.21.0 (Apple Git-122.2)



Re: Argh, can't find dcache properties !

2020-03-24 Thread Qian Cai



> On Mar 24, 2020, at 4:06 PM, Chris Packham 
>  wrote:
> 
> On Tue, 2020-03-24 at 15:47 +1100, Michael Ellerman wrote:
>> Chris Packham  writes:
>>> Hi All,
>>> 
>>> Just booting up v5.5.11 on a Freescale T2080RDB and I'm seeing the
>>> following mesage.
>>> 
>>> kern.warning linuxbox kernel: Argh, can't find dcache properties !
>>> kern.warning linuxbox kernel: Argh, can't find icache properties !
>>> 
>>> This was changed from DBG() to pr_warn() in commit 3b9176e9a874
>>> ("powerpc/setup_64: fix -Wempty-body warnings") but the message
>>> seems
>>> to be much older than that. So it's probably been an issue on the
>>> T2080
>>> (and other QorIQ SoCs) for a while.
>> 
>> That's an e6500 I think? So 64-bit Book3E.
>> 
> 
> Yes that's correct.
> 
>> You'll be getting the default values, which is 64 bytes so I guess
>> that
>> works in practice.
>> 
>>> Looking at the code the t208x doesn't specifiy any of the d-cache-
>>> size/i-cache-size properties. Should I add them to silence the
>>> warning
>>> or switch it to pr_debug()/pr_info()?
>> 
>> Yeah ideally you'd add them to the device tree(s) for those boards.
>> 
> 
> I think the info I need is in the block diagram[0]. I'll whip up
> a patch.
> 
> --
> [1] - 
> https://www.nxp.com/products/processors-and-microcontrollers/power-architecture/qoriq-communication-processors/t-series/qoriq-t2080-and-t2081-multicore-communications-processors:T2080

BTW, POWER9 PowerNV would have the same thing. 

[0.00][T0] Setting debug_guardpage_minorder to 1
[0.00][T0] Reserving 512MB of memory at 128MB for crashkernel 
(System RAM: 262144MB)
[0.00][T0] radix-mmu: Page sizes from device-tree:
[0.00][T0] radix-mmu: Page size shift = 12 AP=0x0
[0.00][T0] radix-mmu: Page size shift = 16 AP=0x5
[0.00][T0] radix-mmu: Page size shift = 21 AP=0x1
[0.00][T0] radix-mmu: Page size shift = 30 AP=0x2
[0.00][T0] radix-mmu: Activating Kernel Userspace Execution 
Prevention
[0.00][T0] radix-mmu: Activating Kernel Userspace Access Prevention
[0.00][T0] radix-mmu: Mapped 0x-0x0160 
with 2.00 MiB pages (exec)
[0.00][T0] radix-mmu: Mapped 0x0160-0x4000 
with 2.00 MiB pages
[0.00][T0] radix-mmu: Mapped 0x4000-0x0020 
with 1.00 GiB pages
[0.00][T0] radix-mmu: Mapped 0x2000-0x2020 
with 1.00 GiB pages
[0.00][T0] radix-mmu: Initializing Radix MMU
[0.00][T0] Linux version 5.6.0-rc7-next-20200324+ (root@ibm-p9wr) 
(gcc version 8.3.1 20191121 (Red Hat 8.3.1-5) (GCC)) #2 SMP Tue Mar 24 15:52:36 
EDT 2020
[0.00][T0] Argh, can't find dcache properties !
[0.00][T0] Argh, can't find icache properties !
[0.00][T0] Found initrd at 0xc785:0xcad26142
[0.00][T0] OPAL: Found memory mapped LPC bus on chip 0
[0.00][T0] Using PowerNV machine description
[0.00][T0] printk: bootconsole [udbg0] enabled
[0.00][T0] CPU maps initialized for 4 threads per core
[0.00][T0] -
[0.00][T0] phys_mem_size = 0x40
[0.00][T0] dcache_bsize  = 0x80
[0.00][T0] icache_bsize  = 0x80
[0.00][T0] cpu_features  = 0x0001f86f8f5fb1a7
[0.00][T0]   possible= 0x0003fbffcf5fb1a7
[0.00][T0]   always  = 0x006f8b5c91a1
[0.00][T0] cpu_user_features = 0xdc0065c2 0xaee0
[0.00][T0] mmu_features  = 0xbc006041
[0.00][T0] firmware_features = 0x1000
[0.00][T0] vmalloc start = 0xc008
[0.00][T0] IO start  = 0xc00a
[0.00][T0] vmemmap start = 0xc00c
[0.00][T0] -

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Qian Cai



> On Mar 6, 2020, at 7:56 PM, Anshuman Khandual  
> wrote:
> 
> 
> 
> On 03/07/2020 06:04 AM, Qian Cai wrote:
>> 
>> 
>>> On Mar 6, 2020, at 7:03 PM, Anshuman Khandual  
>>> wrote:
>>> 
>>> Hmm, set_pte_at() function is not preferred here for these tests. The idea
>>> is to avoid or atleast minimize TLB/cache flushes triggered from these sort
>>> of 'static' tests. set_pte_at() is platform provided and could/might trigger
>>> these flushes or some other platform specific synchronization stuff. Just
>> 
>> Why is that important for this debugging option?
> 
> Primarily reason is to avoid TLB/cache flush instructions on the system
> during these tests that only involve transforming different page table
> level entries through helpers. Unless really necessary, why should it
> emit any TLB/cache flush instructions ?
> 
>> 
>>> wondering is there specific reason with respect to the soft lock up problem
>>> making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?
>> 
>> Looks at the s390 version of set_pte_at(), it has this comment,
>> vmaddr);
>> 
>> /*
>> * Certain architectures need to do special things when PTEs
>> * within a page table are directly modified.  Thus, the following
>> * hook is made available.
>> */
>> 
>> I can only guess that powerpc  could be the same here.
> 
> This comment is present in multiple platforms while defining set_pte_at().
> Is not 'barrier()' here alone good enough ? Else what exactly set_pte_at()

No, barrier() is not enough.

> does as compared to WRITE_ONCE() that avoids the soft lock up, just trying
> to understand.

I surely can spend hours to figure which exact things in set_pte_at() is 
necessary for
pte_clear() not to stuck, and then propose a solution and possible need to 
retest on
multiple arches. I am not sure if that is a good use of my time just to saving
a few TLB/cache flush on a debug kernel?

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Qian Cai



> On Mar 6, 2020, at 7:03 PM, Anshuman Khandual  
> wrote:
> 
> Hmm, set_pte_at() function is not preferred here for these tests. The idea
> is to avoid or atleast minimize TLB/cache flushes triggered from these sort
> of 'static' tests. set_pte_at() is platform provided and could/might trigger
> these flushes or some other platform specific synchronization stuff. Just

Why is that important for this debugging option?

> wondering is there specific reason with respect to the soft lock up problem
> making it necessary to use set_pte_at() rather than a simple WRITE_ONCE() ?

Looks at the s390 version of set_pte_at(), it has this comment,
vmaddr);

/*
 * Certain architectures need to do special things when PTEs
 * within a page table are directly modified.  Thus, the following
 * hook is made available.
 */

I can only guess that powerpc  could be the same here.

Re: [PATCH V15] mm/debug: Add tests validating architecture page table helpers

2020-03-06 Thread Qian Cai
On Fri, 2020-03-06 at 05:27 +0530, Anshuman Khandual wrote:
> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> inside kernel_init() right after async_synchronize_full().
> 
> This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any
> architecture, which is willing to subscribe this test will need to select
> ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, s390
> and ppc32 platforms where the test is known to build and run successfully.
> Going forward, other architectures too can subscribe the test after fixing
> any build or runtime problems with their page table helpers. Meanwhile for
> better platform coverage, the test can also be enabled with CONFIG_EXPERT
> even without ARCH_HAS_DEBUG_VM_PGTABLE.
> 
> Folks interested in making sure that a given platform's page table helpers
> conform to expected generic MM semantics should enable the above config
> which will just trigger this test during boot. Any non conformity here will
> be reported as an warning which would need to be fixed. This test will help
> catch any changes to the agreed upon semantics expected from generic MM and
> enable platforms to accommodate it thereafter.

OK, I get this working on powerpc hash MMU as well, so this?

diff --git a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
index 64d0f9b15c49..c527d05c0459 100644
--- a/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
+++ b/Documentation/features/debug/debug-vm-pgtable/arch-support.txt
@@ -22,8 +22,7 @@
 |   nios2: | TODO |
 |openrisc: | TODO |
 |  parisc: | TODO |
-|  powerpc/32: |  ok  |
-|  powerpc/64: | TODO |
+| powerpc: |  ok  |
 |   riscv: | TODO |
 |s390: |  ok  |
 |  sh: | TODO |
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2e7eee523ba1..176930f40e07 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -116,7 +116,7 @@ config PPC
    #
    select ARCH_32BIT_OFF_T if PPC32
    select ARCH_HAS_DEBUG_VIRTUAL
-   select ARCH_HAS_DEBUG_VM_PGTABLE if PPC32
+   select ARCH_HAS_DEBUG_VM_PGTABLE
    select ARCH_HAS_DEVMEM_IS_ALLOWED
    select ARCH_HAS_ELF_RANDOMIZE
    select ARCH_HAS_FORTIFY_SOURCE
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 96a91bda3a85..98990a515268 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -256,7 +256,8 @@ static void __init pte_clear_tests(struct mm_struct *mm,
pte_t *ptep,
    pte_t pte = READ_ONCE(*ptep);
 
    pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
-   WRITE_ONCE(*ptep, pte);
+   set_pte_at(mm, vaddr, ptep, pte);
+   barrier();
    pte_clear(mm, vaddr, ptep);
    pte = READ_ONCE(*ptep);
    WARN_ON(!pte_none(pte));


Re: [PATCH -next v2] powerpc/64s/pgtable: fix an undefined behaviour

2020-03-05 Thread Qian Cai



> On Mar 5, 2020, at 2:22 PM, Christophe Leroy  wrote:
> 
> 
> 
> Le 05/03/2020 à 15:32, Qian Cai a écrit :
>> Booting a power9 server with hash MMU could trigger an undefined
>> behaviour because pud_offset(p4d, 0) will do,
>> 0 >> (PAGE_SHIFT:16 + PTE_INDEX_SIZE:8 + H_PMD_INDEX_SIZE:10)
>> Fix it by converting pud_offset() and friends to static inline
>> functions.
> 
> I was suggesting to convert pud_index() to static inline, because that's 
> where the shift sits. Is it not possible ?
> 
> Here you seems to fix the problem for now, but if someone reuses pud_index() 
> in another macro one day, the same problem may happen again.
> 

Sounds reasonable. I send out a v3,

https://lore.kernel.org/lkml/20200306044852.3236-1-...@lca.pw/T/#u

> Christophe
> 
>>  UBSAN: shift-out-of-bounds in arch/powerpc/mm/ptdump/ptdump.c:282:15
>>  shift exponent 34 is too large for 32-bit type 'int'
>>  CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-20200303+ #13
>>  Call Trace:
>>  dump_stack+0xf4/0x164 (unreliable)
>>  ubsan_epilogue+0x18/0x78
>>  __ubsan_handle_shift_out_of_bounds+0x160/0x21c
>>  walk_pagetables+0x2cc/0x700
>>  walk_pud at arch/powerpc/mm/ptdump/ptdump.c:282
>>  (inlined by) walk_pagetables at arch/powerpc/mm/ptdump/ptdump.c:311
>>  ptdump_check_wx+0x8c/0xf0
>>  mark_rodata_ro+0x48/0x80
>>  kernel_init+0x74/0x194
>>  ret_from_kernel_thread+0x5c/0x74
>> Suggested-by: Christophe Leroy 
>> Signed-off-by: Qian Cai 
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h | 20 ++--
>>  1 file changed, 14 insertions(+), 6 deletions(-)
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index fa60e8594b9f..4967bc9e25e2 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -1016,12 +1016,20 @@ static inline bool p4d_access_permitted(p4d_t p4d, 
>> bool write)
>>#define pgd_offset(mm, address)((mm)->pgd + pgd_index(address))
>>  -#define pud_offset(p4dp, addr) \
>> -(((pud_t *) p4d_page_vaddr(*(p4dp))) + pud_index(addr))
>> -#define pmd_offset(pudp,addr) \
>> -(((pmd_t *) pud_page_vaddr(*(pudp))) + pmd_index(addr))
>> -#define pte_offset_kernel(dir,addr) \
>> -(((pte_t *) pmd_page_vaddr(*(dir))) + pte_index(addr))
>> +static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
>> +{
>> +return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
>> +}
>> +
>> +static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
>> +{
>> +return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
>> +}
>> +
>> +static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
>> +{
>> +return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
>> +}
>>#define pte_offset_map(dir,addr)  pte_offset_kernel((dir), (addr))
>>  



[PATCH v3] powerpc/64s/pgtable: fix an undefined behaviour

2020-03-05 Thread Qian Cai
Booting a power9 server with hash MMU could trigger an undefined
behaviour because pud_offset(p4d, 0) will do,

0 >> (PAGE_SHIFT:16 + PTE_INDEX_SIZE:8 + H_PMD_INDEX_SIZE:10)

Fix it by converting pud_index() and friends to static inline
functions.

UBSAN: shift-out-of-bounds in arch/powerpc/mm/ptdump/ptdump.c:282:15
shift exponent 34 is too large for 32-bit type 'int'
CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-20200303+ #13
Call Trace:
dump_stack+0xf4/0x164 (unreliable)
ubsan_epilogue+0x18/0x78
__ubsan_handle_shift_out_of_bounds+0x160/0x21c
walk_pagetables+0x2cc/0x700
walk_pud at arch/powerpc/mm/ptdump/ptdump.c:282
(inlined by) walk_pagetables at arch/powerpc/mm/ptdump/ptdump.c:311
ptdump_check_wx+0x8c/0xf0
mark_rodata_ro+0x48/0x80
kernel_init+0x74/0x194
ret_from_kernel_thread+0x5c/0x74

Suggested-by: Christophe Leroy 
Signed-off-by: Qian Cai 
---

v3: convert pud_index() etc to static inline functions.
v2: convert pud_offset() etc to static inline functions.

 arch/powerpc/include/asm/book3s/64/pgtable.h | 23 
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 201a69e6a355..bd432c6706b9 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -998,10 +998,25 @@ extern struct page *pgd_page(pgd_t pgd);
 #define pud_page_vaddr(pud)__va(pud_val(pud) & ~PUD_MASKED_BITS)
 #define pgd_page_vaddr(pgd)__va(pgd_val(pgd) & ~PGD_MASKED_BITS)
 
-#define pgd_index(address) (((address) >> (PGDIR_SHIFT)) & (PTRS_PER_PGD - 1))
-#define pud_index(address) (((address) >> (PUD_SHIFT)) & (PTRS_PER_PUD - 1))
-#define pmd_index(address) (((address) >> (PMD_SHIFT)) & (PTRS_PER_PMD - 1))
-#define pte_index(address) (((address) >> (PAGE_SHIFT)) & (PTRS_PER_PTE - 1))
+static inline unsigned long pgd_index(unsigned long address)
+{
+   return (address >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1);
+}
+
+static inline unsigned long pud_index(unsigned long address)
+{
+   return (address >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
+}
+
+static inline unsigned long pmd_index(unsigned long address)
+{
+   return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
+}
+
+static inline unsigned long pte_index(unsigned long address)
+{
+   return (address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+}
 
 /*
  * Find an entry in a page-table-directory.  We combine the address region
-- 
2.21.0 (Apple Git-122.2)



[PATCH -next v2] powerpc/64s/pgtable: fix an undefined behaviour

2020-03-05 Thread Qian Cai
Booting a power9 server with hash MMU could trigger an undefined
behaviour because pud_offset(p4d, 0) will do,

0 >> (PAGE_SHIFT:16 + PTE_INDEX_SIZE:8 + H_PMD_INDEX_SIZE:10)

Fix it by converting pud_offset() and friends to static inline
functions.

 UBSAN: shift-out-of-bounds in arch/powerpc/mm/ptdump/ptdump.c:282:15
 shift exponent 34 is too large for 32-bit type 'int'
 CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-20200303+ #13
 Call Trace:
 dump_stack+0xf4/0x164 (unreliable)
 ubsan_epilogue+0x18/0x78
 __ubsan_handle_shift_out_of_bounds+0x160/0x21c
 walk_pagetables+0x2cc/0x700
 walk_pud at arch/powerpc/mm/ptdump/ptdump.c:282
 (inlined by) walk_pagetables at arch/powerpc/mm/ptdump/ptdump.c:311
 ptdump_check_wx+0x8c/0xf0
 mark_rodata_ro+0x48/0x80
 kernel_init+0x74/0x194
 ret_from_kernel_thread+0x5c/0x74

Suggested-by: Christophe Leroy 
Signed-off-by: Qian Cai 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index fa60e8594b9f..4967bc9e25e2 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1016,12 +1016,20 @@ static inline bool p4d_access_permitted(p4d_t p4d, bool 
write)
 
 #define pgd_offset(mm, address) ((mm)->pgd + pgd_index(address))
 
-#define pud_offset(p4dp, addr) \
-   (((pud_t *) p4d_page_vaddr(*(p4dp))) + pud_index(addr))
-#define pmd_offset(pudp,addr) \
-   (((pmd_t *) pud_page_vaddr(*(pudp))) + pmd_index(addr))
-#define pte_offset_kernel(dir,addr) \
-   (((pte_t *) pmd_page_vaddr(*(dir))) + pte_index(addr))
+static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
+{
+   return (pud_t *)p4d_page_vaddr(*p4d) + pud_index(address);
+}
+
+static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
+{
+   return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
+}
+
+static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
+{
+   return (pte_t *)pmd_page_vaddr(*pmd) + pte_index(address);
+}
 
 #define pte_offset_map(dir,addr)   pte_offset_kernel((dir), (addr))
 
-- 
1.8.3.1



[PATCH -next] powerpc/mm/ptdump: fix an undefined behaviour

2020-03-04 Thread Qian Cai
Booting a power9 server with hash MMU could trigger an undefined
behaviour because pud_offset(p4d, 0) will do,

0 >> (PAGE_SHIFT:16 + PTE_INDEX_SIZE:8 + H_PMD_INDEX_SIZE:10)

 UBSAN: shift-out-of-bounds in arch/powerpc/mm/ptdump/ptdump.c:282:15
 shift exponent 34 is too large for 32-bit type 'int'
 CPU: 6 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-20200303+ #13
 Call Trace:
 dump_stack+0xf4/0x164 (unreliable)
 ubsan_epilogue+0x18/0x78
 __ubsan_handle_shift_out_of_bounds+0x160/0x21c
 walk_pagetables+0x2cc/0x700
 walk_pud at arch/powerpc/mm/ptdump/ptdump.c:282
 (inlined by) walk_pagetables at arch/powerpc/mm/ptdump/ptdump.c:311
 ptdump_check_wx+0x8c/0xf0
 mark_rodata_ro+0x48/0x80
 kernel_init+0x74/0x194
 ret_from_kernel_thread+0x5c/0x74

Fixes: 8eb07b187000 ("powerpc/mm: Dump linux pagetables")
Signed-off-by: Qian Cai 
---

Notes for maintainers:

This is on the top of the linux-next commit "powerpc: add support for
folded p4d page tables" which is in the Andrew's tree.

 arch/powerpc/mm/ptdump/ptdump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index 9d6256b61df3..b530f81398a7 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -279,7 +279,7 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, 
unsigned long start)
 
 static void walk_pud(struct pg_state *st, p4d_t *p4d, unsigned long start)
 {
-   pud_t *pud = pud_offset(p4d, 0);
+   pud_t *pud = pud_offset(p4d, 0UL);
unsigned long addr;
unsigned int i;
 
-- 
2.21.0 (Apple Git-122.2)



Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-03-04 Thread Qian Cai



> On Mar 4, 2020, at 1:49 AM, Christophe Leroy  wrote:
> 
> AFAIU, you are not taking an interrupt here. You are stuck in the 
> pte_update(), most likely due to nested locks. Try with LOCKDEP ?

Not exactly sure what did you mean here, but the kernel has all lockdep enabled 
and did not flag anything here.

Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-03-03 Thread Qian Cai


> Below is slightly modified version of your change above and should still
> prevent the bug on powerpc. Will it be possible for you to re-test this
> ? Once confirmed, will send a patch enabling this test on powerpc64
> keeping your authorship. Thank you.

This works fine on radix MMU but I decided to go a bit future to test hash
MMU. The kernel will stuck here below. I did confirm that pte_alloc_map_lock()
was successful, so I don’t understand hash MMU well enough to tell why
it could still take an interrupt at pte_clear_tests() even before we calls
pte_unmap_unlock()?

[   33.881515][T1] ok 8 - property-entry
[   33.883653][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
architecture page table helpers
[   60.418885][C8] watchdog: BUG: soft lockup - CPU#8 stuck for 23s!
[swapper/0:1]
[   60.418913][C8] Modules linked in:
[   60.418927][C8] irq event stamp: 2896762
[   60.418945][C8] hardirqs last  enabled at (2896761): []
fast_exc_return_irq+0x28/0x34
[   60.418960][C8] hardirqs last disabled at (2896762): []
decrementer_common+0x10c/0x130
[   60.418985][C8] softirqs last  enabled at (2896760): []
__do_softirq+0x640/0x8c8
[   60.419009][C8] softirqs last disabled at (2896753): []
irq_exit+0x16c/0x1d0
[   60.419024][C8] CPU: 8 PID: 1 Comm: swapper/0 Not tainted 5.6.0-rc4-next-
20200303+ #7
[   60.419055][C8] NIP:  c103dc14 LR: c103db0c CTR:

[   60.419076][C8] REGS: c0003dd4fa30 TRAP: 0901   Not tainted  (5.6.0-
rc4-next-20200303+)
[   60.419107][C8] MSR:  90009033   CR:
42000222  XER: 
[   60.419134][C8] CFAR: c103dc1c IRQMASK: 0 
[   60.419134][C8] GPR00: c103db0c c0003dd4fcc0 c1657d00
0521000100c0 
[   60.419134][C8] GPR04: 8105 000a f4d9864c
0001 
[   60.419134][C8] GPR08:   0001
000a 
[   60.419134][C8] GPR12:  c01f9880 
[   60.419220][C8] NIP [c103dc14] debug_vm_pgtable+0x7a8/0xbb4
hash__pte_update at arch/powerpc/include/asm/book3s/64/hash.h:159
(inlined by) pte_update at arch/powerpc/include/asm/book3s/64/pgtable.h:359
(inlined by) pte_clear at arch/powerpc/include/asm/book3s/64/pgtable.h:477
(inlined by) pte_clear_tests at mm/debug_vm_pgtable.c:259
(inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:368
[   60.419241][C8] LR [c103db0c] debug_vm_pgtable+0x6a0/0xbb4
pmd_basic_tests at mm/debug_vm_pgtable.c:74
(inlined by) debug_vm_pgtable at mm/debug_vm_pgtable.c:363
[   60.419260][C8] Call Trace:
[   60.419278][C8] [c0003dd4fcc0] [c103d994]
debug_vm_pgtable+0x528/0xbb4 (unreliable)
[   60.419302][C8] [c0003dd4fdb0] [c0010eac]
kernel_init+0x30/0x194
[   60.419325][C8] [c0003dd4fe20] [c000b748]
ret_from_kernel_thread+0x5c/0x74
[   60.419363][C8] Instruction dump:
[   60.419382][C8] 7d075078 7ce74b78 7ce0f9ad 40c2fff0 7e449378 7fc3f378
4b03531d 6000 
[   60.419416][C8] 4880 3920 3941 3900 <7e00f8a8> 7e075039
40c2fff8 7e074878 
[   98.908889][C8] rcu: INFO: rcu_sched self-detected stall on CPU
[   98.908933][C8] rcu: 8-: (6500 ticks this GP)
idle=522/1/0x4002 softirq=132/132 fqs=3250 
[   98.908963][C8] (t=6501 jiffies g=-719 q=510)
[   98.908984][C8] NMI backtrace for cpu 8
[   98.909012][C8] CPU: 8 PID: 1 Comm: swapper/0 Tainted:
G L5.6.0-rc4-next-20200303+ #7
[   98.909025][C8] Call Trace:
[   98.909046][C8] [c0003dd4f360] [c0970fe0]
dump_stack+0xf4/0x164 (unreliable)
[   98.909070][C8] [c0003dd4f3b0] [c097dcf4]
nmi_cpu_backtrace+0x1b4/0x1e0
[   98.909084][C8] [c0003dd4f450] [c097df48]
nmi_trigger_cpumask_backtrace+0x228/0x2c0
[   98.909118][C8] [c0003dd4f500] [c0057bf8]
arch_trigger_cpumask_backtrace+0x28/0x40
[   98.909152][C8] [c0003dd4f520] [c0202dd4]
rcu_dump_cpu_stacks+0x1c4/0x234
[   98.909184][C8] [c0003dd4f5a0] [c0201634]
rcu_sched_clock_irq+0xd54/0x1130
[   98.909207][C8] [c0003dd4f6c0] [c0217068]
update_process_times+0x48/0xb0
[   98.909239][C8] [c0003dd4f6f0] [c02358b4]
tick_sched_handle+0x34/0xb0
[   98.909262][C8] [c0003dd4f720] [c02361d8]
tick_sched_timer+0x68/0xe0
[   98.909284][C8] [c0003dd4f760] [c0219768]
__hrtimer_run_queues+0x528/0xa60
[   98.909306][C8] [c0003dd4f880] [c021ab58]
hrtimer_interrupt+0x128/0x330
[   98.909329][C8] [c0003dd4f930] [c002e1b4]
timer_interrupt+0x264/0x680
[   98.909352][C8] [c0003dd4f9c0] [c0009264]
decrementer_common+0x124/0x130
[   98.909366][C8] --- interrupt: 901 at debug_vm_pgtable+0x7a8/0xbb4
[   98.909366][C8] LR = debug_vm_pgtable+0x6a0/0xbb4
[   98.909402][C8] 

Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-03-02 Thread Qian Cai
On Wed, 2020-02-26 at 10:51 -0500, Qian Cai wrote:
> On Wed, 2020-02-26 at 15:45 +0100, Christophe Leroy wrote:
> > 
> > Le 26/02/2020 à 15:09, Qian Cai a écrit :
> > > On Mon, 2020-02-17 at 08:47 +0530, Anshuman Khandual wrote:
> > > > This adds tests which will validate architecture page table helpers and
> > > > other accessors in their compliance with expected generic MM semantics.
> > > > This will help various architectures in validating changes to existing
> > > > page table helpers or addition of new ones.
> > > > 
> > > > This test covers basic page table entry transformations including but 
> > > > not
> > > > limited to old, young, dirty, clean, write, write protect etc at various
> > > > level along with populating intermediate entries with next page table 
> > > > page
> > > > and validating them.
> > > > 
> > > > Test page table pages are allocated from system memory with required 
> > > > size
> > > > and alignments. The mapped pfns at page table levels are derived from a
> > > > real pfn representing a valid kernel text symbol. This test gets called
> > > > inside kernel_init() right after async_synchronize_full().
> > > > 
> > > > This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. 
> > > > Any
> > > > architecture, which is willing to subscribe this test will need to 
> > > > select
> > > > ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, 
> > > > s390
> > > > and ppc32 platforms where the test is known to build and run 
> > > > successfully.
> > > > Going forward, other architectures too can subscribe the test after 
> > > > fixing
> > > > any build or runtime problems with their page table helpers. Meanwhile 
> > > > for
> > > > better platform coverage, the test can also be enabled with 
> > > > CONFIG_EXPERT
> > > > even without ARCH_HAS_DEBUG_VM_PGTABLE.
> > > > 
> > > > Folks interested in making sure that a given platform's page table 
> > > > helpers
> > > > conform to expected generic MM semantics should enable the above config
> > > > which will just trigger this test during boot. Any non conformity here 
> > > > will
> > > > be reported as an warning which would need to be fixed. This test will 
> > > > help
> > > > catch any changes to the agreed upon semantics expected from generic MM 
> > > > and
> > > > enable platforms to accommodate it thereafter.
> > > 
> > > How useful is this that straightly crash the powerpc?
> > > 
> > > [   23.263425][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
> > > architecture page table helpers
> > > [   23.263625][T1] [ cut here ]
> > > [   23.263649][T1] kernel BUG at arch/powerpc/mm/pgtable.c:274!
> > 
> > The problem on PPC64 is known and has to be investigated and fixed.
> 
> It might be interesting to hear what powerpc64 maintainers would say about it
> and if it is actually worth "fixing" in the arch code, but that BUG_ON() was
> there since 2009 and had not been exposed until this patch comes alone?

This patch below makes it works on powerpc64 in order to dodge the BUG_ON()s in 
assert_pte_locked() triggered by pte_clear_tests().


diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 96dd7d574cef..50b385233971 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -55,6 +55,8 @@
 #define RANDOM_ORVALUE GENMASK(BITS_PER_LONG - 1, S390_MASK_BITS)
 #define RANDOM_NZVALUE GENMASK(7, 0)
 
+unsigned long vaddr;
+
 static void __init pte_basic_tests(unsigned long pfn, pgprot_t prot)
 {
    pte_t pte = pfn_pte(pfn, prot);
@@ -256,7 +258,7 @@ static void __init pte_clear_tests(struct mm_struct *mm,
pte_t *ptep)
 
    pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
    WRITE_ONCE(*ptep, pte);
-   pte_clear(mm, 0, ptep);
+   pte_clear(mm, vaddr, ptep);
    pte = READ_ONCE(*ptep);
    WARN_ON(!pte_none(pte));
 }
@@ -310,8 +312,9 @@ void __init debug_vm_pgtable(void)
    pgtable_t saved_ptep;
    pgprot_t prot;
    phys_addr_t paddr;
-   unsigned long vaddr, pte_aligned, pmd_aligned;
+   unsigned long pte_aligned, pmd_aligned;
    unsigned long pud_aligned, p4d_aligned, pgd_aligned;
+   spinlock_t *ptl;
 
    pr_info("Validating architecture page table helpers\n");
    prot = vm_get_page_prot(VMFLAGS);
@@ -344,7 +347,7 @@ void __init debug_vm_pgtable(void)
    p4dp = p4d_alloc(mm, pgdp, vaddr);
    pudp = pud_alloc(mm, p4dp, vaddr);
    pmdp = pmd_alloc(mm, pudp, vaddr);
-   ptep = pte_alloc_map(mm, pmdp, vaddr);
+   ptep = pte_alloc_map_lock(mm, pmdp, vaddr, );
 
    /*
     * Save all the page table page addresses as the page table
@@ -370,7 +373,7 @@ void __init debug_vm_pgtable(void)
    p4d_clear_tests(mm, p4dp);
    pgd_clear_tests(mm, pgdp);
 
-   pte_unmap(ptep);
+   pte_unmap_unlock(ptep, ptl);
 
    pmd_populate_tests(mm, pmdp, saved_ptep);
    pud_populate_tests(mm, pudp, saved_pmdp);


Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-02-26 Thread Qian Cai
On Wed, 2020-02-26 at 15:45 +0100, Christophe Leroy wrote:
> 
> Le 26/02/2020 à 15:09, Qian Cai a écrit :
> > On Mon, 2020-02-17 at 08:47 +0530, Anshuman Khandual wrote:
> > > This adds tests which will validate architecture page table helpers and
> > > other accessors in their compliance with expected generic MM semantics.
> > > This will help various architectures in validating changes to existing
> > > page table helpers or addition of new ones.
> > > 
> > > This test covers basic page table entry transformations including but not
> > > limited to old, young, dirty, clean, write, write protect etc at various
> > > level along with populating intermediate entries with next page table page
> > > and validating them.
> > > 
> > > Test page table pages are allocated from system memory with required size
> > > and alignments. The mapped pfns at page table levels are derived from a
> > > real pfn representing a valid kernel text symbol. This test gets called
> > > inside kernel_init() right after async_synchronize_full().
> > > 
> > > This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any
> > > architecture, which is willing to subscribe this test will need to select
> > > ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, 
> > > s390
> > > and ppc32 platforms where the test is known to build and run successfully.
> > > Going forward, other architectures too can subscribe the test after fixing
> > > any build or runtime problems with their page table helpers. Meanwhile for
> > > better platform coverage, the test can also be enabled with CONFIG_EXPERT
> > > even without ARCH_HAS_DEBUG_VM_PGTABLE.
> > > 
> > > Folks interested in making sure that a given platform's page table helpers
> > > conform to expected generic MM semantics should enable the above config
> > > which will just trigger this test during boot. Any non conformity here 
> > > will
> > > be reported as an warning which would need to be fixed. This test will 
> > > help
> > > catch any changes to the agreed upon semantics expected from generic MM 
> > > and
> > > enable platforms to accommodate it thereafter.
> > 
> > How useful is this that straightly crash the powerpc?
> > 
> > [   23.263425][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
> > architecture page table helpers
> > [   23.263625][T1] [ cut here ]
> > [   23.263649][T1] kernel BUG at arch/powerpc/mm/pgtable.c:274!
> 
> The problem on PPC64 is known and has to be investigated and fixed.

It might be interesting to hear what powerpc64 maintainers would say about it
and if it is actually worth "fixing" in the arch code, but that BUG_ON() was
there since 2009 and had not been exposed until this patch comes alone?


Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-02-26 Thread Qian Cai
On Wed, 2020-02-26 at 09:09 -0500, Qian Cai wrote:
> On Mon, 2020-02-17 at 08:47 +0530, Anshuman Khandual wrote:
> > This adds tests which will validate architecture page table helpers and
> > other accessors in their compliance with expected generic MM semantics.
> > This will help various architectures in validating changes to existing
> > page table helpers or addition of new ones.
> > 
> > This test covers basic page table entry transformations including but not
> > limited to old, young, dirty, clean, write, write protect etc at various
> > level along with populating intermediate entries with next page table page
> > and validating them.
> > 
> > Test page table pages are allocated from system memory with required size
> > and alignments. The mapped pfns at page table levels are derived from a
> > real pfn representing a valid kernel text symbol. This test gets called
> > inside kernel_init() right after async_synchronize_full().
> > 
> > This test gets built and run when CONFIG_DEBUG_VM_PGTABLE is selected. Any
> > architecture, which is willing to subscribe this test will need to select
> > ARCH_HAS_DEBUG_VM_PGTABLE. For now this is limited to arc, arm64, x86, s390
> > and ppc32 platforms where the test is known to build and run successfully.
> > Going forward, other architectures too can subscribe the test after fixing
> > any build or runtime problems with their page table helpers. Meanwhile for
> > better platform coverage, the test can also be enabled with CONFIG_EXPERT
> > even without ARCH_HAS_DEBUG_VM_PGTABLE.
> > 
> > Folks interested in making sure that a given platform's page table helpers
> > conform to expected generic MM semantics should enable the above config
> > which will just trigger this test during boot. Any non conformity here will
> > be reported as an warning which would need to be fixed. This test will help
> > catch any changes to the agreed upon semantics expected from generic MM and
> > enable platforms to accommodate it thereafter.
> 
> How useful is this that straightly crash the powerpc?

And then generate warnings on arm64,

[  146.634626][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
architecture page table helpers
[  146.643995][T1] [ cut here ]
[  146.649350][T1] virt_to_phys used for non-linear address:
(ptrval) (start_kernel+0x0/0x580)
[  146.658840][T1] WARNING: CPU: 165 PID: 1 at arch/arm64/mm/physaddr.c:15
__virt_to_phys+0x98/0xe0
[  146.667976][T1] Modules linked in:
[  146.671741][T1] CPU: 165 PID: 1 Comm: swapper/0 Tainted:
G L5.6.0-rc3-next-20200226 #1
[  146.681397][T1] Hardware name: HPE Apollo
70 /C01_APACHE_MB , BIOS L50_5.13_1.11 06/18/2019
[  146.691840][T1] pstate: 6049 (nZCv daif +PAN -UAO)
[  146.697334][T1] pc : __virt_to_phys+0x98/0xe0
[  146.702045][T1] lr : __virt_to_phys+0x98/0xe0
[  146.706753][T1] sp : 18ff00082b7afe10
[  146.710766][T1] x29: 18ff00082b7afe30 x28:  
[  146.716782][T1] x27:  x26:  
[  146.722798][T1] x25:  x24:  
[  146.728813][T1] x23:  x22:  
[  146.734827][T1] x21:  x20: 9000135b4000 
[  146.740842][T1] x19: 900011200858 x18:  
[  146.746857][T1] x17:  x16:  
[  146.752872][T1] x15:  x14: 3078302b6c656e72 
[  146.758887][T1] x13: 656b5f7472617473 x12: 90001369ea90 
[  146.764901][T1] x11: 00c9 x10: 800082b76c0e 
[  146.770917][T1] x9 : 9d6a2e2260401300 x8 : 9d6a2e2260401300 
[  146.776932][T1] x7 :  x6 :  
[  146.782946][T1] x5 : 0080 x4 :  
[  146.788960][T1] x3 : 0010 x2 : 0008 
[  146.794975][T1] x1 : 0006 x0 : 0053 
[  146.800990][T1] Call trace:
[  146.804140][T1]  __virt_to_phys+0x98/0xe0
[  146.808512][T1]  debug_vm_pgtable+0x74/0x3fc
[  146.813140][T1]  kernel_init+0x1c/0x208
[  146.817334][T1]  ret_from_fork+0x10/0x18
[  146.821608][T1] irq event stamp: 19843388
[  146.825978][T1] hardirqs last  enabled at (19843387):
[] console_unlock+0x8d0/0x970
[  146.835553][T1] hardirqs last disabled at (19843388):
[] do_debug_exception+0x58/0x2cc
[  146.845387][T1] softirqs last  enabled at (19843384):
[] __do_softirq+0x864/0x900
[  146.854796][T1] softirqs last disabled at (19843377):
[] irq_exit+0x1c8/0x238
[  146.863845][T1] ---[ end trace 31678d9e845dff89 ]---

> 
> [   23.263425][T1] debug_vm_pgtable: debug_vm_pgtable: Validating
> architecture page table helpers
&

Re: [PATCH V14] mm/debug: Add tests validating architecture page table helpers

2020-02-26 Thread Qian Cai
v 
> Cc: "H. Peter Anvin" 
> Cc: Kirill A. Shutemov 
> Cc: Paul Walmsley 
> Cc: Palmer Dabbelt 
> Cc: linux-snps-...@lists.infradead.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux-ri...@lists.infradead.org
> Cc: x...@kernel.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> 
> Suggested-by: Catalin Marinas 
> Reviewed-by: Ingo Molnar 
> Tested-by: Gerald Schaefer# s390
> Tested-by: Christophe Leroy  # ppc32
> Signed-off-by: Andrew Morton 
> Signed-off-by: Christophe Leroy 
> Signed-off-by: Anshuman Khandual 
> ---
> This adds a test validation for architecture exported page table helpers.
> Patch adds basic transformation tests at various levels of the page table.
> 
> This test was originally suggested by Catalin during arm64 THP migration
> RFC discussion earlier. Going forward it can include more specific tests
> with respect to various generic MM functions like THP, HugeTLB etc and
> platform specific tests.
> 
> https://lore.kernel.org/linux-mm/20190628102003.ga56...@arrakis.emea.arm.com/
> 
> Needs to be applied on linux V5.6-rc2
> 
> Changes in V14:
> 
> - Disabled DEBUG_VM_PGFLAGS for IA64 and ARM (32 Bit) per Andrew and 
> Christophe
> - Updated DEBUG_VM_PGFLAGS documentation wrt EXPERT and disabled platforms
> - Updated RANDOM_[OR|NZ]VALUE open encodings with GENMASK() per Catalin
> - Updated s390 constraint bits from 12 to 4 (S390_MASK_BITS) per Gerald
> - Updated in-code documentation for RANDOM_ORVALUE per Gerald
> - Updated pxx_basic_tests() to use invert functions first per Catalin
> - Dropped ARCH_HAS_4LEVEL_HACK check from pud_basic_tests()
> - Replaced __ARCH_HAS_[4|5]LEVEL_HACK with __PAGETABLE_[PUD|P4D]_FOLDED per 
> Catalin
> - Trimmed the CC list on the commit message per Catalin
> 
> Changes in V13: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=237125)
> 
> - Subscribed s390 platform and updated debug-vm-pgtable/arch-support.txt per 
> Gerald
> - Dropped keyword 'extern' from debug_vm_pgtable() declaration per Christophe
> - Moved debug_vm_pgtable() declarations to  per Christophe
> - Moved debug_vm_pgtable() call site into kernel_init() per Christophe
> - Changed CONFIG_DEBUG_VM_PGTABLE rules per Christophe
> - Updated commit to include new supported platforms and changed config 
> selection
> 
> Changes in V12: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=233905)
> 
> - Replaced __mmdrop() with mmdrop()
> - Enable ARCH_HAS_DEBUG_VM_PGTABLE on X86 for non CONFIG_X86_PAE platforms as 
> the
>   test procedure interfere with pre-allocated PMDs attached to the PGD 
> resulting
>   in runtime failures with VM_BUG_ON()
> 
> Changes in V11: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=221135)
> 
> - Rebased the patch on V5.4
> 
> Changes in V10: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=205529)
> 
> - Always enable DEBUG_VM_PGTABLE when DEBUG_VM is enabled per Ingo
> - Added tags from Ingo
> 
> Changes in V9: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=201429)
> 
> - Changed feature support enumeration for powerpc platforms per Christophe
> - Changed config wrapper for basic_[pmd|pud]_tests() to enable ARC platform
> - Enabled the test on ARC platform
> 
> Changes in V8: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=194297)
> 
> - Enabled ARCH_HAS_DEBUG_VM_PGTABLE on PPC32 platform per Christophe
> - Updated feature documentation as DEBUG_VM_PGTABLE is now enabled on PPC32 
> platform
> - Moved ARCH_HAS_DEBUG_VM_PGTABLE earlier to indent it with DEBUG_VM per 
> Christophe
> - Added an information message in debug_vm_pgtable() per Christophe
> - Dropped random_vaddr boundary condition checks per Christophe and Qian
> - Replaced virt_addr_valid() check with pfn_valid() check in 
> debug_vm_pgtable()
> - Slightly changed pr_fmt(fmt) information
> 
> Changes in V7: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=193051)
> 
> - Memory allocation and free routines for mapped pages have been droped
> - Mapped pfns are derived from standard kernel text symbol per Matthew
> - Moved debug_vm_pgtaable() after page_alloc_init_late() per Michal and Qian 
> - Updated the commit message per Michal
> - Updated W=1 GCC warning problem on x86 per Qian Cai
> - Addition of new alloc_contig_pages() helper has been submitted separately
> 
> Changes in V6: 
> (https://patchwork.kernel.org/project/linux-mm/list/?series=187589)
> 
> - Moved alloc_gigantic_page_order() into mm/page_alloc.c per Michal
> - Moved alloc_gigantic_page

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-02-03 Thread Qian Cai
On Mon, 2020-02-03 at 16:14 +0100, Christophe Leroy wrote:
> 
> Le 02/02/2020 à 12:26, Qian Cai a écrit :
> > 
> > 
> > > On Jan 30, 2020, at 9:13 AM, Christophe Leroy  
> > > wrote:
> > > 
> > > config DEBUG_VM_PGTABLE
> > > bool "Debug arch page table for semantics compliance" if 
> > > ARCH_HAS_DEBUG_VM_PGTABLE || EXPERT
> > > depends on MMU
> > > default 'n' if !ARCH_HAS_DEBUG_VM_PGTABLE
> > > default 'y' if DEBUG_VM
> > 
> > Does it really necessary to potentially force all bots to run this? Syzbot, 
> > kernel test robot etc? Does it ever pay off for all their machine times 
> > there?
> > 
> 
> Machine time ?
> 
> On a 32 bits powerpc running at 132 MHz, the tests takes less than 10ms. 
> Is it worth taking the risk of not detecting faults by not selecting it 
> by default ?

The risk is quite low as Catalin mentioned this thing is not to detect
regressions but rather for arch/mm maintainers.

I do appreciate the efforts to get everyone as possible to run this thing,
so it get more notices once it is broken. However, DEBUG_VM seems like such
a generic Kconfig those days that have even been enabled by default for
Fedora Linux, so I would rather see a more sensitive default been taken
even though the test runtime is fairly quickly on a small machine for now.

> 
> [5.656916] debug_vm_pgtable: debug_vm_pgtable: Validating 
> architecture page table helpers
> [5.665661] debug_vm_pgtable: debug_vm_pgtable: Validated 
> architecture page table helpers



Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-02-02 Thread Qian Cai



> On Jan 30, 2020, at 9:13 AM, Christophe Leroy  wrote:
> 
> config DEBUG_VM_PGTABLE
>bool "Debug arch page table for semantics compliance" if 
> ARCH_HAS_DEBUG_VM_PGTABLE || EXPERT
>depends on MMU
>default 'n' if !ARCH_HAS_DEBUG_VM_PGTABLE
>default 'y' if DEBUG_VM

Does it really necessary to potentially force all bots to run this? Syzbot, 
kernel test robot etc? Does it ever pay off for all their machine times there?

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-29 Thread Qian Cai



> On Jan 29, 2020, at 5:36 AM, Catalin Marinas  wrote:
> 
> On Tue, Jan 28, 2020 at 02:07:10PM -0500, Qian Cai wrote:
>> On Jan 28, 2020, at 12:47 PM, Catalin Marinas  
>> wrote:
>>> The primary goal here is not finding regressions but having clearly
>>> defined semantics of the page table accessors across architectures. x86
>>> and arm64 are a good starting point and other architectures will be
>>> enabled as they are aligned to the same semantics.
>> 
>> This still does not answer the fundamental question. If this test is
>> simply inefficient to find bugs,
> 
> Who said this is inefficient (other than you)?

Inefficient of finding bugs. It said only found a bug or two in its lifetime?

> 
>> who wants to spend time to use it regularly? 
> 
> Arch maintainers, mm maintainers introducing new macros or assuming
> certain new semantics of the existing macros.
> 
>> If this is just one off test that may get running once in a few years
>> (when introducing a new arch), how does it justify the ongoing cost to
>> maintain it?
> 
> You are really missing the point. It's not only for a new arch but
> changes to existing arch code. And if the arch code churn in this area
> is relatively small, I'd expect a similarly small cost of maintaining
> this test.
> 
> If you only turn DEBUG_VM on once every few years, don't generalise this
> to the rest of the kernel developers (as others pointed out, this test
> is default y if DEBUG_VM).

Quite the opposite, I am running DEBUG_VM almost daily for regression
workload while I felt strongly this thing does not add any value mixing there.

So, I would suggest to decouple this away from DEBUG_VM, and clearly
document that this test is not something intended for automated regression
workloads, so those people don’t need to waste time running this.

> 
> Anyway, I think that's a pointless discussion, so not going to reply
> further (unless you have technical content to add).
> 
> -- 
> Catalin



Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-28 Thread Qian Cai



> On Jan 28, 2020, at 12:47 PM, Catalin Marinas  wrote:
> 
> The primary goal here is not finding regressions but having clearly
> defined semantics of the page table accessors across architectures. x86
> and arm64 are a good starting point and other architectures will be
> enabled as they are aligned to the same semantics.

This still does not answer the fundamental question. If this test is simply 
inefficient to find bugs, who wants to spend time to use it regularly?  If this 
is just one off test that may get running once in a few years (when introducing 
a new arch), how does it justify the ongoing cost to maintain it?

I do agree there could be a need to clearly define this thing but that belongs 
to documentation rather than testing purpose. It is confusing to mix this with 
other config options which have somewhat a different purpose, it will then be a 
waste of time for people who mistakenly enable this for regular automatic 
testing and never found any bug from it.

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-28 Thread Qian Cai



> On Jan 28, 2020, at 7:10 AM, Mike Rapoport  wrote:
> 
> Aren't x86 and arm64 not decent enough?
> Even if this test could be used to detect regressions only on these two
> platforms, the test is valuable.

The question is does it detect regressions good enough? Where is the list of 
past bugs that it had found?

It is an usual deal for unproven debugging features remain out of tree first 
and keep gathering unique bugs it found and then justify for a mainline 
inclusion with enough data.

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-27 Thread Qian Cai



> On Jan 28, 2020, at 1:13 AM, Christophe Leroy  wrote:
> 
> ppc32 an indecent / legacy platform ? Are you kidying ?
> 
> Powerquicc II PRO for instance is fully supported by the manufacturer and 
> widely used in many small networking devices.

Of course I forgot about embedded devices. The problem is that how many 
developers are actually going to run this debug option on embedded devices?

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-27 Thread Qian Cai



> On Jan 28, 2020, at 2:03 AM, Anshuman Khandual  
> wrote:
> 
> 'allyesconfig' makes 'DEBUG_VM = y' which in turn will enable 
> 'DEBUG_VM_PGTABLE = y'
> on platforms that subscribe ARCH_HAS_DEBUG_VM_PGTABLE.

Isn’t that only for compiling testing? Who is booting such a beast and make 
sure everything working as expected?

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-27 Thread Qian Cai



> On Jan 28, 2020, at 1:17 AM, Christophe Leroy  wrote:
> 
> It is 'default y' so there is no much risk that it is forgotten, at least all 
> test suites run with 'allyes_defconfig' will trigger the test, so I think it 
> is really a good feature.

This thing depends on DEBUG_VM which I don’t see it is selected by any 
defconfig. Am I missing anything?

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-27 Thread Qian Cai



> On Jan 27, 2020, at 11:58 PM, Anshuman Khandual  
> wrote:
> 
> As I had mentioned before, the test attempts to formalize page table helper 
> semantics
> as expected from generic MM code paths and intend to catch deviations when 
> enabled on
> a given platform. How else should we test semantics errors otherwise ? There 
> are past
> examples of usefulness for this procedure on arm64 and on s390. I am 
> wondering how
> else to prove the usefulness of a debug feature if these references are not 
> enough.

Not saying it will not be useful. As you mentioned it actually found a bug or 
two in the past. The problem is that there is always a cost to maintain 
something like this, and nobody knew how things could be broken even for the 
isolated code you mentioned in the future given how complicated the kernel code 
base is. I am not so positive that many developers would enable this debug 
feature and use it on a regular basis from the information you gave so far. 

On the other hand, it might just be good at maintaining this thing out of tree 
by yourself anyway, because if there isn’t going to be used by many developers, 
few people is going to contribute to this and even noticed when it is broken. 
What’s the point of getting this merged apart from being getting some 
meaningless credits?

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-27 Thread Qian Cai



> On Jan 27, 2020, at 10:06 PM, Anshuman Khandual  
> wrote:
> 
> 
> 
> On 01/28/2020 07:41 AM, Qian Cai wrote:
>> 
>> 
>>> On Jan 27, 2020, at 8:28 PM, Anshuman Khandual  
>>> wrote:
>>> 
>>> This adds tests which will validate architecture page table helpers and
>>> other accessors in their compliance with expected generic MM semantics.
>>> This will help various architectures in validating changes to existing
>>> page table helpers or addition of new ones.
>>> 
>>> This test covers basic page table entry transformations including but not
>>> limited to old, young, dirty, clean, write, write protect etc at various
>>> level along with populating intermediate entries with next page table page
>>> and validating them.
>>> 
>>> Test page table pages are allocated from system memory with required size
>>> and alignments. The mapped pfns at page table levels are derived from a
>>> real pfn representing a valid kernel text symbol. This test gets called
>>> right after page_alloc_init_late().
>>> 
>>> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
>>> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
>>> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
>>> arm64. Going forward, other architectures too can enable this after fixing
>>> build or runtime problems (if any) with their page table helpers.
> 
> Hello Qian,
> 
>> 
>> What’s the value of this block of new code? It only supports x86 and arm64
>> which are supposed to be good now.
> 
> We have been over the usefulness of this code many times before as the patch 
> is
> already in it's V12. Currently it is enabled on arm64, x86 (except PAE), arc 
> and
> ppc32. There are build time or runtime problems with other archs which prevent

I am not sure if I care too much about arc and ppc32 which are pretty much 
legacy
platforms.

> enablement of this test (for the moment) but then the goal is to integrate all
> of them going forward. The test not only validates platform's adherence to the
> expected semantics from generic MM but also helps in keeping it that way 
> during
> code changes in future as well.

Another option maybe to get some decent arches on board first before merging 
this
thing, so it have more changes to catch regressions for developers who might 
run this. 

> 
>> Did those tests ever find any regression or this is almost only useful for 
>> new
> 
> The test has already found problems with s390 page table helpers.

Hmm, that is pretty weak where s390 is not even official supported with this 
version.

> 
>> architectures which only happened once in a few years?
> 
> Again, not only it validates what exist today but its also a tool to make
> sure that all platforms continue adhere to a common agreed upon semantics
> as reflected through the tests here.
> 
>> The worry if not many people will use this config and code those that much in
> 
> Debug features or tests in the kernel are used when required. These are never 
> or
> should not be enabled by default. AFAICT this is true even for entire DEBUG_VM
> packaged tests. Do you have any particular data or precedence to substantiate
> the fact that this test will be used any less often than the other similar 
> ones
> in the tree ? I can only speak for arm64 platform but the very idea for this
> test came from Catalin when we were trying to understand the semantics for THP
> helpers while enabling THP migration without split. Apart from going over the
> commit messages from the past, there were no other way to figure out how any
> particular page table helper is suppose to change given page table entry. This
> test tries to formalize those semantics.

I am thinking about how we made so many mistakes before by merging too many of
those debugging options that many of them have been broken for many releases
proving that nobody actually used them regularly. We don’t need to repeat the 
same
mistake again. I am actually thinking about to remove things like  
page_poisoning often
which is almost are never found any bug recently and only cause pains when 
interacting
with other new features that almost nobody will test them together to begin 
with.
We even have some SLUB debugging code sit there for almost 15 years that almost
nobody used it and maintainers refused to remove it.

> 
>> the future because it is inefficient to find bugs, it will simply be rotten
> Could you be more specific here ? What parts of the test are inefficient ? I
> am happy to improve upon the test. Do let me know you if you have suggestions.
> 
&g

Re: [PATCH V12] mm/debug: Add tests validating architecture page table helpers

2020-01-27 Thread Qian Cai



> On Jan 27, 2020, at 8:28 PM, Anshuman Khandual  
> wrote:
> 
> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> right after page_alloc_init_late().
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.

What’s the value of this block of new code? It only supports x86 and arm64 
which are supposed to be good now. Did those tests ever find any regression or 
this is almost only useful for new architectures which only happened once in a 
few years? The worry if not many people will use this config and code those 
that much in the future because it is inefficient to find bugs, it will simply 
be rotten like a few other debugging options out there we have in the mainline 
that will be a pain to remove later on.

Re: "ftrace: Rework event_create_dir()" triggers boot error messages

2020-01-06 Thread Qian Cai



> On Dec 18, 2019, at 11:31 PM, Steven Rostedt  wrote:
> 
> On Wed, 18 Dec 2019 22:58:23 -0500
> Qian Cai  wrote:
> 
>> The linux-next commit "ftrace: Rework event_create_dir()” [1] triggers boot 
>> warnings
>> for Clang-build (Clang version 8.0.1) kernels (reproduced on both arm64 and 
>> powerpc).
>> Reverted it (with trivial conflict fixes) on the top of today’s linux-next 
>> fixed the issue.
>> 
>> configs:
>> https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>> 
>> [1] https://lore.kernel.org/lkml/2019132458.342979...@infradead.org/
>> 
>> [  115.799327][T1] Registered efivars operations
>> [  115.849770][T1] clocksource: Switched to clocksource arch_sys_counter
>> [  115.901145][T1] Could not initialize trace point 
>> events/sys_enter_rt_sigreturn
>> [  115.908854][T1] Could not create directory for event 
>> sys_enter_rt_sigreturn
>> [  115.998949][T1] Could not initialize trace point 
>> events/sys_enter_restart_syscall
>> [  116.006802][T1] Could not create directory for event 
>> sys_enter_restart_syscall
>> [  116.062702][T1] Could not initialize trace point 
>> events/sys_enter_getpid
>> [  116.069828][T1] Could not create directory for event sys_enter_getpid
>> [  116.078058][T1] Could not initialize trace point 
>> events/sys_enter_gettid
>> [  116.085181][T1] Could not create directory for event sys_enter_gettid
>> [  116.093405][T1] Could not initialize trace point 
>> events/sys_enter_getppid
>> [  116.100612][T1] Could not create directory for event sys_enter_getppid
>> [  116.108989][T1] Could not initialize trace point 
>> events/sys_enter_getuid
>> [  116.116058][T1] Could not create directory for event sys_enter_getuid
>> [  116.124250][T1] Could not initialize trace point 
>> events/sys_enter_geteuid
>> [  116.131457][T1] Could not create directory for event sys_enter_geteuid
>> [  116.139840][T1] Could not initialize trace point 
>> events/sys_enter_getgid
>> [  116.146908][T1] Could not create directory for event sys_enter_getgid
>> [  116.155163][T1] Could not initialize trace point 
>> events/sys_enter_getegid
>> [  116.162370][T1] Could not create directory for event sys_enter_getegid
>> [  116.178015][T1] Could not initialize trace point 
>> events/sys_enter_setsid
>> [  116.185138][T1] Could not create directory for event sys_enter_setsid
>> [  116.269307][T1] Could not initialize trace point 
>> events/sys_enter_sched_yield
>> [  116.276811][T1] Could not create directory for event 
>> sys_enter_sched_yield
>> [  116.527652][T1] Could not initialize trace point 
>> events/sys_enter_munlockall
>> [  116.535126][T1] Could not create directory for event 
>> sys_enter_munlockall
>> [  116.622096][T1] Could not initialize trace point 
>> events/sys_enter_vhangup
>> [  116.629307][T1] Could not create directory for event sys_enter_vhangup
>> [  116.783867][T1] Could not initialize trace point events/sys_enter_sync
>> [  116.790819][T1] Could not create directory for event sys_enter_sync
>> [  117.723402][T1] pnp: PnP ACPI init
> 
> I noticed that all of the above have zero parameters. Does the
> following patch fix it?
> 
> (note, I prefer "ret" and "i" on different lines anyway)
> 
> -- Steve
> 
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 53935259f701..abb70c71fe60 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -269,7 +269,8 @@ static int __init syscall_enter_define_fields(struct 
> trace_event_call *call)
>   struct syscall_trace_enter trace;
>   struct syscall_metadata *meta = call->data;
>   int offset = offsetof(typeof(trace), args);
> - int ret, i;
> + int ret = 0;
> + int i;
> 
>   for (i = 0; i < meta->nb_args; i++) {
>   ret = trace_define_field(call, meta->types[i],

Steve, those errors are still there in today’s linux-next. Is this patch on the 
way to the linux-next?



Re: "ftrace: Rework event_create_dir()" triggers boot error messages

2019-12-18 Thread Qian Cai



> On Dec 18, 2019, at 11:31 PM, Steven Rostedt  wrote:
> 
> On Wed, 18 Dec 2019 22:58:23 -0500
> Qian Cai  wrote:
> 
>> The linux-next commit "ftrace: Rework event_create_dir()” [1] triggers boot 
>> warnings
>> for Clang-build (Clang version 8.0.1) kernels (reproduced on both arm64 and 
>> powerpc).
>> Reverted it (with trivial conflict fixes) on the top of today’s linux-next 
>> fixed the issue.
>> 
>> configs:
>> https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>> 
>> [1] https://lore.kernel.org/lkml/2019132458.342979...@infradead.org/
>> 
>> [  115.799327][T1] Registered efivars operations
>> [  115.849770][T1] clocksource: Switched to clocksource arch_sys_counter
>> [  115.901145][T1] Could not initialize trace point 
>> events/sys_enter_rt_sigreturn
>> [  115.908854][T1] Could not create directory for event 
>> sys_enter_rt_sigreturn
>> [  115.998949][T1] Could not initialize trace point 
>> events/sys_enter_restart_syscall
>> [  116.006802][T1] Could not create directory for event 
>> sys_enter_restart_syscall
>> [  116.062702][T1] Could not initialize trace point 
>> events/sys_enter_getpid
>> [  116.069828][T1] Could not create directory for event sys_enter_getpid
>> [  116.078058][T1] Could not initialize trace point 
>> events/sys_enter_gettid
>> [  116.085181][T1] Could not create directory for event sys_enter_gettid
>> [  116.093405][T1] Could not initialize trace point 
>> events/sys_enter_getppid
>> [  116.100612][T1] Could not create directory for event sys_enter_getppid
>> [  116.108989][T1] Could not initialize trace point 
>> events/sys_enter_getuid
>> [  116.116058][T1] Could not create directory for event sys_enter_getuid
>> [  116.124250][T1] Could not initialize trace point 
>> events/sys_enter_geteuid
>> [  116.131457][T1] Could not create directory for event sys_enter_geteuid
>> [  116.139840][T1] Could not initialize trace point 
>> events/sys_enter_getgid
>> [  116.146908][T1] Could not create directory for event sys_enter_getgid
>> [  116.155163][T1] Could not initialize trace point 
>> events/sys_enter_getegid
>> [  116.162370][T1] Could not create directory for event sys_enter_getegid
>> [  116.178015][T1] Could not initialize trace point 
>> events/sys_enter_setsid
>> [  116.185138][T1] Could not create directory for event sys_enter_setsid
>> [  116.269307][T1] Could not initialize trace point 
>> events/sys_enter_sched_yield
>> [  116.276811][T1] Could not create directory for event 
>> sys_enter_sched_yield
>> [  116.527652][T1] Could not initialize trace point 
>> events/sys_enter_munlockall
>> [  116.535126][T1] Could not create directory for event 
>> sys_enter_munlockall
>> [  116.622096][T1] Could not initialize trace point 
>> events/sys_enter_vhangup
>> [  116.629307][T1] Could not create directory for event sys_enter_vhangup
>> [  116.783867][T1] Could not initialize trace point events/sys_enter_sync
>> [  116.790819][T1] Could not create directory for event sys_enter_sync
>> [  117.723402][T1] pnp: PnP ACPI init
> 
> I noticed that all of the above have zero parameters. Does the
> following patch fix it?

Yes, it works.

> 
> (note, I prefer "ret" and "i" on different lines anyway)
> 
> -- Steve
> 
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index 53935259f701..abb70c71fe60 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -269,7 +269,8 @@ static int __init syscall_enter_define_fields(struct 
> trace_event_call *call)
>   struct syscall_trace_enter trace;
>   struct syscall_metadata *meta = call->data;
>   int offset = offsetof(typeof(trace), args);
> - int ret, i;
> + int ret = 0;
> + int i;
> 
>   for (i = 0; i < meta->nb_args; i++) {
>   ret = trace_define_field(call, meta->types[i],



"ftrace: Rework event_create_dir()" triggers boot error messages

2019-12-18 Thread Qian Cai
The linux-next commit "ftrace: Rework event_create_dir()” [1] triggers boot 
warnings
for Clang-build (Clang version 8.0.1) kernels (reproduced on both arm64 and 
powerpc).
Reverted it (with trivial conflict fixes) on the top of today’s linux-next 
fixed the issue.

configs:
https://raw.githubusercontent.com/cailca/linux-mm/master/arm64.config
https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config

[1] https://lore.kernel.org/lkml/2019132458.342979...@infradead.org/

[  115.799327][T1] Registered efivars operations
[  115.849770][T1] clocksource: Switched to clocksource arch_sys_counter
[  115.901145][T1] Could not initialize trace point 
events/sys_enter_rt_sigreturn
[  115.908854][T1] Could not create directory for event 
sys_enter_rt_sigreturn
[  115.998949][T1] Could not initialize trace point 
events/sys_enter_restart_syscall
[  116.006802][T1] Could not create directory for event 
sys_enter_restart_syscall
[  116.062702][T1] Could not initialize trace point events/sys_enter_getpid
[  116.069828][T1] Could not create directory for event sys_enter_getpid
[  116.078058][T1] Could not initialize trace point events/sys_enter_gettid
[  116.085181][T1] Could not create directory for event sys_enter_gettid
[  116.093405][T1] Could not initialize trace point events/sys_enter_getppid
[  116.100612][T1] Could not create directory for event sys_enter_getppid
[  116.108989][T1] Could not initialize trace point events/sys_enter_getuid
[  116.116058][T1] Could not create directory for event sys_enter_getuid
[  116.124250][T1] Could not initialize trace point events/sys_enter_geteuid
[  116.131457][T1] Could not create directory for event sys_enter_geteuid
[  116.139840][T1] Could not initialize trace point events/sys_enter_getgid
[  116.146908][T1] Could not create directory for event sys_enter_getgid
[  116.155163][T1] Could not initialize trace point events/sys_enter_getegid
[  116.162370][T1] Could not create directory for event sys_enter_getegid
[  116.178015][T1] Could not initialize trace point events/sys_enter_setsid
[  116.185138][T1] Could not create directory for event sys_enter_setsid
[  116.269307][T1] Could not initialize trace point 
events/sys_enter_sched_yield
[  116.276811][T1] Could not create directory for event 
sys_enter_sched_yield
[  116.527652][T1] Could not initialize trace point 
events/sys_enter_munlockall
[  116.535126][T1] Could not create directory for event sys_enter_munlockall
[  116.622096][T1] Could not initialize trace point events/sys_enter_vhangup
[  116.629307][T1] Could not create directory for event sys_enter_vhangup
[  116.783867][T1] Could not initialize trace point events/sys_enter_sync
[  116.790819][T1] Could not create directory for event sys_enter_sync
[  117.723402][T1] pnp: PnP ACPI init
[  117.736379][T1] system 00:00: [mem 0x3000-0x3fff window] could 
not be reserved
[  126.020353][T1] pnp: PnP ACPI: found 1 devices
[  126.093919][T1] NET: Registered protocol family 2
[  126.180007][T1] tcp_listen_portaddr_hash hash table entries: 65536 
(order: 6, 4718592 bytes, vmalloc)
[  126.206510][T1] TCP established hash table entries: 524288 (order: 6, 
4194304 bytes, vmalloc)
[  126.227766][T1] TCP bind hash table entries: 65536 (order: 6, 4194304 
bytes, vmalloc)
[  126.240146][T1] TCP: Hash tables configured (established 524288 bind 
65536)

XFS check crash (WAS Re: [PATCH v11 1/4] kasan: support backing vmalloc space with real shadow memory)

2019-11-29 Thread Qian Cai



> On Nov 29, 2019, at 7:29 AM, Daniel Axtens  wrote:
> 
 
 Nope, it's vm_map_ram() not being handled
>>> 
>>> 
>>> Another suspicious one. Related to kasan/vmalloc?
>> 
>> Very likely the same as with ion:
>> 
>> # git grep vm_map_ram|grep xfs
>> fs/xfs/xfs_buf.c:* vm_map_ram() will allocate auxiliary 
>> structures (e.g.
>> fs/xfs/xfs_buf.c:   bp->b_addr = vm_map_ram(bp->b_pages, 
>> bp->b_page_count,
> 
> Aaargh, that's an embarassing miss.
> 
> It's a bit intricate because kasan_vmalloc_populate function is
> currently set up to take a vm_struct not a vmap_area, but I'll see if I
> can get something simple out this evening - I'm away for the first part
> of next week.
> 
> Do you have to do anything interesting to get it to explode with xfs? Is
> it as simple as mounting a drive and doing some I/O? Or do you need to
> do something more involved?


I instead trigger something a bit different by manually triggering a crash 
first to make the XFS
partition uncleanly shutdown.

# echo c >/proc/sysrq-trigger

and then reboot the same kernel where it will crash while checking the XFS. 
This can be workaround
by rebooting to an older kernel (v4.18) first where xfs_repair will be 
successfully there, and then rebooting
to the new linux-next kernel will be fine.

[  OK  ] Started File System Check on /dev/mapper/rhel_hpe--sy680gen9--01-root.
 Mounting /sysroot...
[  141.177726][ T1730] SGI XFS with security attributes, no debug enabled
[  141.432382][ T1720] XFS (dm-0): Mounting V5 Filesystem
[**] A start job is running for /sysroot (39s / 1min 51s)[  158.738816][ 
T1720] XFS (dm-0): Starting recovery (logdev: internal)
[  158.792010][  T844] BUG: unable to handle page fault for address: 
f52001fc
[  158.830913][  T844] #PF: supervisor read access in kernel mode
[  158.859680][  T844] #PF: error_code(0x) - not-present page
[  158.886057][  T844] PGD 207ffe3067 P4D 207ffe3067 PUD 2071f2067 PMD 
f68e08067 PTE 0
[  158.922065][  T844] Oops:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  158.949620][  T844] CPU: 112 PID: 844 Comm: kworker/112:1 Not tainted 
5.4.0-next-20191127+ #3
[  158.988759][  T844] Hardware name: HP Synergy 680 Gen9/Synergy 680 Gen9 
Compute Module, BIOS I40 05/23/2018
[  159.033380][  T844] Workqueue: xfs-buf/dm-0 xfs_buf_ioend_work [xfs]
[  159.061935][  T844] RIP: 0010:__asan_load4+0x3a/0xa0
[  159.061941][  T844] Code: 00 00 00 00 00 00 ff 48 39 f8 77 6d 48 8d 47 03 48 
89 c2 83 e2 07 48 83 fa 02 76 30 48 be 00 00 00 00 00 fc ff df 48 c1 e8 03 <0f> 
b6 04 30 84 c0 75 3e 5d c3 48 b8 00 00 00 00 00 80 ff ff eb c7
[  159.061944][  T844] RSP: 0018:c9000a4b7cb0 EFLAGS: 00010a06
[  159.061949][  T844] RAX: 192001fc RBX: c9000f80 RCX: 
c06d10ae
[  159.061952][  T844] RDX: 0003 RSI: dc00 RDI: 
c9000f800060
[  159.061955][  T844] RBP: c9000a4b7cb0 R08: ed130bee89e5 R09: 
0001
[  159.061958][  T844] R10: ed130bee89e4 R11: 88985f744f23 R12: 

[  159.061961][  T844] R13: 889724be0040 R14: 88836c8e5000 R15: 
000c8000
[  159.061965][  T844] FS:  () GS:88985f70() 
knlGS:
[  159.061968][  T844] CS:  0010 DS:  ES:  CR0: 80050033
[  159.061971][  T844] CR2: f52001fc CR3: 001f615b8004 CR4: 
003606e0
[  159.061974][  T844] DR0:  DR1:  DR2: 

[  159.061976][  T844] DR3:  DR6: fffe0ff0 DR7: 
0400
[  159.061978][  T844] Call Trace:
[  159.062118][  T844]  xfs_inode_buf_verify+0x13e/0x230 [xfs]
[  159.062264][  T844]  xfs_inode_buf_readahead_verify+0x13/0x20 [xfs]
[  159.634441][  T844]  xfs_buf_ioend+0x153/0x6b0 [xfs]
[  159.634455][  T844]  ? trace_hardirqs_on+0x3a/0x160
[  159.679087][  T844]  xfs_buf_ioend_work+0x15/0x20 [xfs]
[  159.702689][  T844]  process_one_work+0x579/0xb90
[  159.723898][  T844]  ? pwq_dec_nr_in_flight+0x170/0x170
[  159.747499][  T844]  worker_thread+0x63/0x5b0
[  159.767531][  T844]  ? process_one_work+0xb90/0xb90
[  159.789549][  T844]  kthread+0x1e6/0x210
[  159.807166][  T844]  ? kthread_create_worker_on_cpu+0xc0/0xc0
[  159.833064][  T844]  ret_from_fork+0x3a/0x50
[  159.852200][  T844] Modules linked in: xfs sd_mod bnx2x mdio firmware_class 
hpsa scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod
[  159.915273][  T844] CR2: f52001fc
[  159.934029][  T844] ---[ end trace 3f3b30f5fc34bbf1 ]---
[  159.957937][  T844] RIP: 0010:__asan_load4+0x3a/0xa0
[  159.980316][  T844] Code: 00 00 00 00 00 00 ff 48 39 f8 77 6d 48 8d 47 03 48 
89 c2 83 e2 07 48 83 fa 02 76 30 48 be 00 00 00 00 00 fc ff df 48 c1 e8 03 <0f> 
b6 04 30 84 c0 75 3e 5d c3 48 b8 00 00 00 00 00 80 ff ff eb c7
[  160.068386][  T844] RSP: 0018:c9000a4b7cb0 EFLAGS: 00010a06
[  160.068389][  T844] RAX: 192001fc RBX: c9000f80 RCX: 
c06d10ae

Re: lockdep warning while booting POWER9 PowerNV

2019-11-21 Thread Qian Cai



> On Sep 4, 2019, at 11:55 PM, Michael Ellerman  wrote:
> 
> Bart Van Assche  writes:
>> On 8/30/19 2:13 PM, Qian Cai wrote:
>>> https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
>>> 
>>> Once in a while, booting an IBM POWER9 PowerNV system (8335-GTH) would 
>>> generate
>>> a warning in lockdep_register_key() at,
>>> 
>>> if (WARN_ON_ONCE(static_obj(key)))
>>> 
>>> because
>>> 
>>> key = 0xc19ad118
>>> &_stext = 0xc000
>>> &_end = 0xc49d
>>> 
>>> i.e., it will cause static_obj() returns 1.
>> 
>> (back from a trip)
>> 
>> Hi Qian,
>> 
>> Does this mean that on POWER9 it can happen that a dynamically allocated 
>> object has an address that falls between &_stext and &_end?
> 
> I thought that was true on all arches due to initmem, but seems not.
> 
> I guess we have the same problem as s390 and we need to define
> arch_is_kernel_initmem_freed().
> 
> Qian, can you try this:
> 
> diff --git a/arch/powerpc/include/asm/sections.h 
> b/arch/powerpc/include/asm/sections.h
> index 4a1664a8658d..616b1b7b7e52 100644
> --- a/arch/powerpc/include/asm/sections.h
> +++ b/arch/powerpc/include/asm/sections.h
> @@ -5,8 +5,22 @@
> 
> #include 
> #include 
> +
> +#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
> +
> #include 
> 
> +extern bool init_mem_is_free;
> +
> +static inline int arch_is_kernel_initmem_freed(unsigned long addr)
> +{
> + if (!init_mem_is_free)
> + return 0;
> +
> + return addr >= (unsigned long)__init_begin &&
> + addr < (unsigned long)__init_end;
> +}
> +
> extern char __head_end[];
> 
> #ifdef __powerpc64__
> 

Michael, this fix is also needed as it starts to trigger another one of those 
where the allocated
memory is from initmem. 

[   31.326825] key = c19049a0
[   31.326862] stext = c000, end = c70e
[   31.326907] init_start = c0c7, init_end = c20f

[   31.325021] WARNING: CPU: 0 PID: 5 at kernel/locking/lockdep.c:1121 
lockdep_register_key+0xb4/0x340
[   31.325061] Modules linked in: tg3(+) ahci(+) libahci libata mdio libphy 
firmware_class dm_mirror dm_region_hash dm_log dm_mod
[   31.325128] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 
5.4.0-rc8-next-20191120+ #4
[   31.325190] Workqueue: events work_for_cpu_fn
[   31.325215] NIP:  c01a23a4 LR: c075eccc CTR: 
[   31.325257] REGS: c0002e72f4c0 TRAP: 0700   Not tainted  
(5.4.0-rc8-next-20191120+)
[   31.325320] MSR:  9282b033   CR: 
48000c20  XER: 2004
[   31.325392] CFAR: c01a233c IRQMASK: 0 
   GPR00: c075eccc c0002e72f750 c2cff500 
c70df500 
   GPR04: c014beb01990 c19042b8  
 
   GPR08:   c0425e28 
c00c04761020 
   GPR12:  c70e c0002e5214f8 
c01ffca018c8 
   GPR16: c01ffca018e4 c01ffca01c80 c01ffca018d0 
c0002e6e3e48 
   GPR20: c2cbf500 c0002e520080 c01ffca05408 
c0002e6e3e00 
   GPR24: c07d36d0 0005 0005 
c1904000 
   GPR28: c70e c000 c19049a0 
c0002e72f7f0 
[   31.325765] NIP [c01a23a4] lockdep_register_key+0xb4/0x340
[   31.325809] LR [c075eccc] alloc_netdev_mqs+0x15c/0x500
[   31.325848] Call Trace:
[   31.325886] [c0002e72f750] [0005] 0x5 (unreliable)
[   31.325930] [c0002e72f7f0] [c075eccc] 
alloc_netdev_mqs+0x15c/0x500
[   31.325984] [c0002e72f8d0] [c07d37f0] 
alloc_etherdev_mqs+0x60/0x90
[   31.326047] [c0002e72f910] [c0080f150110] tg3_init_one+0x108/0x1d00 
[tg3]
[   31.326098] [c0002e72fac0] [c0633b48] local_pci_probe+0x78/0x100
[   31.326143] [c0002e72fb50] [c0134b60] work_for_cpu_fn+0x40/0x70
[   31.326190] [c0002e72fb80] [c013927c] 
process_one_work+0x3ac/0x710
[   31.326221] [c0002e72fc70] [c0138d90] 
process_scheduled_works+0x60/0xa0
[   31.326274] [c0002e72fcb0] [c0139ba4] worker_thread+0x344/0x4a0
[   31.326317] [c0002e72fda0] [c0142f68] kthread+0x1b8/0x1e0
[   31.326363] [c0002e72fe20] [c000b748] 
ret_from_kernel_thread+0x5c/0x74
[   31.326412] Instruction dump:
[   31.326448] 2823 418200a0 7fc3f378 48191fd9 6000 70630001 41810018 
7fc3f378 
[   31.326510] 4807fe25 6000 70630001 40810060 <0fe0> 3c62fffc 8883fa2f 
70840001 
[   31.326573] irq event stamp: 8

Re: powerpc ftrace broken due to "manual merge of the ftrace tree with the arm64 tree"

2019-11-18 Thread Qian Cai
On Mon, 2019-11-18 at 10:16 -0500, Steven Rostedt wrote:
> On Mon, 18 Nov 2019 09:58:42 -0500
> Steven Rostedt  wrote:
> 
> > On Mon, 18 Nov 2019 09:51:04 -0500
> > Steven Rostedt  wrote:
> > 
> > > > > Test this commit please: b83b43ffc6e4b514ca034a0fbdee01322e2f7022 
> > > > >  
> > > > 
> > > > # git reset --hard b83b43ffc6e4b514ca034a0fbdee01322e2f7022
> > > > 
> > > > Yes, that one is bad.
> > > 
> > > Can you see if this patch fixes the issue for you?  
> > 
> > Don't bother. This isn't the right fix, I know see the real issue.
> > 
> > New fix coming shortly.
> > 
> 
> Can you try this?

Yes, it works fine.

> 
> It appears that I picked a name "ftrace_graph_stub", that was already in
> use by powerpc. This just renames the function stub I used.
> 
> -- Steve
> 
> diff --git a/include/asm-generic/vmlinux.lds.h 
> b/include/asm-generic/vmlinux.lds.h
> index 0f358be551cd..996db32c491b 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -112,7 +112,7 @@
>  #ifdef CONFIG_FTRACE_MCOUNT_RECORD
>  #ifdef CC_USING_PATCHABLE_FUNCTION_ENTRY
>  /*
> - * Need to also make ftrace_graph_stub point to ftrace_stub
> + * Need to also make ftrace_stub_graph point to ftrace_stub
>   * so that the same stub location may have different protocols
>   * and not mess up with C verifiers.
>   */
> @@ -120,17 +120,17 @@
>   __start_mcount_loc = .; \
>   KEEP(*(__patchable_function_entries))   \
>   __stop_mcount_loc = .;  \
> - ftrace_graph_stub = ftrace_stub;
> + ftrace_stub_graph = ftrace_stub;
>  #else
>  #define MCOUNT_REC() . = ALIGN(8);   \
>   __start_mcount_loc = .; \
>   KEEP(*(__mcount_loc))   \
>   __stop_mcount_loc = .;  \
> - ftrace_graph_stub = ftrace_stub;
> + ftrace_stub_graph = ftrace_stub;
>  #endif
>  #else
>  # ifdef CONFIG_FUNCTION_TRACER
> -#  define MCOUNT_REC()   ftrace_graph_stub = ftrace_stub;
> +#  define MCOUNT_REC()   ftrace_stub_graph = ftrace_stub;
>  # else
>  #  define MCOUNT_REC()
>  # endif
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index fa3ce10d0405..67e0c462b059 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -336,10 +336,10 @@ int ftrace_graph_entry_stub(struct ftrace_graph_ent 
> *trace)
>   * Simply points to ftrace_stub, but with the proper protocol.
>   * Defined by the linker script in linux/vmlinux.lds.h
>   */
> -extern void ftrace_graph_stub(struct ftrace_graph_ret *);
> +extern void ftrace_stub_graph(struct ftrace_graph_ret *);
>  
>  /* The callbacks that hook a function */
> -trace_func_graph_ret_t ftrace_graph_return = ftrace_graph_stub;
> +trace_func_graph_ret_t ftrace_graph_return = ftrace_stub_graph;
>  trace_func_graph_ent_t ftrace_graph_entry = ftrace_graph_entry_stub;
>  static trace_func_graph_ent_t __ftrace_graph_entry = ftrace_graph_entry_stub;
>  
> @@ -619,7 +619,7 @@ void unregister_ftrace_graph(struct fgraph_ops *gops)
>   goto out;
>  
>   ftrace_graph_active--;
> - ftrace_graph_return = ftrace_graph_stub;
> + ftrace_graph_return = ftrace_stub_graph;
>   ftrace_graph_entry = ftrace_graph_entry_stub;
>   __ftrace_graph_entry = ftrace_graph_entry_stub;
>   ftrace_shutdown(_ops, FTRACE_STOP_FUNC_RET);


Re: powerpc ftrace broken due to "manual merge of the ftrace tree with the arm64 tree"

2019-11-15 Thread Qian Cai
On Fri, 2019-11-15 at 16:02 -0500, Steven Rostedt wrote:
> On Fri, 15 Nov 2019 15:28:52 -0500
> Qian Cai  wrote:
> 
> > # echo function >/sys/kernel/debug/tracing/current_tracer
> > 
> > It hangs forever with today's linux-next on powerpc. Reverted the conflict 
> > fix
> > [1] as below fixes the issue.
> > 
> > [1] 
> > https://lore.kernel.org/linux-next/20191115135357.10386...@canb.auug.org.au/
> 
> What's your config file.

https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config

> 
> And can you test the two conflicting commits to see which one caused
> your error?
> 
> Test this commit please: b83b43ffc6e4b514ca034a0fbdee01322e2f7022

# git reset --hard b83b43ffc6e4b514ca034a0fbdee01322e2f7022

Yes, that one is bad.

> 
> And see if the issue is with that one, and not with the one without it.
> 
> -- Steve
> 
> 
> > 
> > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-
> > generic/vmlinux.lds.h
> > index 7d0d03a03d4d..a9f79b53 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -136,29 +136,20 @@
> >  #endif
> >  
> >  #ifdef CONFIG_FTRACE_MCOUNT_RECORD
> > -/*
> > - * The ftrace call sites are logged to a section whose name depends on the
> > - * compiler option used. A given kernel image will only use one, AKA
> > - * FTRACE_CALLSITE_SECTION. We capture all of them here to avoid header
> > - * dependencies for FTRACE_CALLSITE_SECTION's definition.
> > - */
> > -/*
> > - * Need to also make ftrace_graph_stub point to ftrace_stub
> > - * so that the same stub location may have different protocols
> > - * and not mess up with C verifiers.
> > - */
> > -#define MCOUNT_REC()   . = ALIGN(8);   \
> > +#ifdef CC_USING_PATCHABLE_FUNCTION_ENTRY
> > +#define MCOUNT_REC()   . = ALIGN(8)\
> >     __start_mcount_loc = .; \
> > -   KEEP(*(__mcount_loc))   \
> >     KEEP(*(__patchable_function_entries))   \
> >     __stop_mcount_loc = .;  \
> >     ftrace_graph_stub = ftrace_stub;
> >  #else
> > -# ifdef CONFIG_FUNCTION_TRACER
> > -#  define MCOUNT_REC() ftrace_graph_stub = ftrace_stub;
> > -# else
> > -#  define MCOUNT_REC()
> > -# endif
> > +#define MCOUNT_REC()   . = ALIGN(8);   \
> > +   __start_mcount_loc = .; \
> > +   KEEP(*(__mcount_loc))   \
> > +   __stop_mcount_loc = .;
> > +#endif
> > +#else
> > +#define MCOUNT_REC()
> >  #endif
> >  
> >  #ifdef CONFIG_TRACE_BRANCH_PROFILING
> 
> 


powerpc ftrace broken due to "manual merge of the ftrace tree with the arm64 tree"

2019-11-15 Thread Qian Cai
# echo function >/sys/kernel/debug/tracing/current_tracer

It hangs forever with today's linux-next on powerpc. Reverted the conflict fix
[1] as below fixes the issue.

[1] https://lore.kernel.org/linux-next/20191115135357.10386...@canb.auug.org.au/

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-
generic/vmlinux.lds.h
index 7d0d03a03d4d..a9f79b53 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,29 +136,20 @@
 #endif
 
 #ifdef CONFIG_FTRACE_MCOUNT_RECORD
-/*
- * The ftrace call sites are logged to a section whose name depends on the
- * compiler option used. A given kernel image will only use one, AKA
- * FTRACE_CALLSITE_SECTION. We capture all of them here to avoid header
- * dependencies for FTRACE_CALLSITE_SECTION's definition.
- */
-/*
- * Need to also make ftrace_graph_stub point to ftrace_stub
- * so that the same stub location may have different protocols
- * and not mess up with C verifiers.
- */
-#define MCOUNT_REC()   . = ALIGN(8);   \
+#ifdef CC_USING_PATCHABLE_FUNCTION_ENTRY
+#define MCOUNT_REC()   . = ALIGN(8)\
    __start_mcount_loc = .; \
-   KEEP(*(__mcount_loc))   \
    KEEP(*(__patchable_function_entries))   \
    __stop_mcount_loc = .;  \
    ftrace_graph_stub = ftrace_stub;
 #else
-# ifdef CONFIG_FUNCTION_TRACER
-#  define MCOUNT_REC() ftrace_graph_stub = ftrace_stub;
-# else
-#  define MCOUNT_REC()
-# endif
+#define MCOUNT_REC()   . = ALIGN(8);   \
+   __start_mcount_loc = .; \
+   KEEP(*(__mcount_loc))   \
+   __stop_mcount_loc = .;
+#endif
+#else
+#define MCOUNT_REC()
 #endif
 
 #ifdef CONFIG_TRACE_BRANCH_PROFILING


Re: [PATCH v11 1/4] kasan: support backing vmalloc space with real shadow memory

2019-11-15 Thread Qian Cai
On Thu, 2019-10-31 at 20:39 +1100, Daniel Axtens wrote:
>   /*
>* In this function, newly allocated vm_struct has VM_UNINITIALIZED
>* flag. It means that vm_struct is not fully initialized.
> @@ -3377,6 +3411,9 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned 
> long *offsets,
>  
>   setup_vmalloc_vm_locked(vms[area], vas[area], VM_ALLOC,
>pcpu_get_vm_areas);
> +
> + /* assume success here */
> + kasan_populate_vmalloc(sizes[area], vms[area]);
>   }
>   spin_unlock(_area_lock);

Here it is all wrong. GFP_KERNEL with in_atomic().

[   32.231000][T1] BUG: sleeping function called from invalid context at
mm/page_alloc.c:4681
[   32.239934][T1] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1,
name: swapper/0
[   32.248896][T1] 2 locks held by swapper/0/1:
[   32.253580][T1]  #0: 880d6160 (pcpu_alloc_mutex){+.+.}, at:
pcpu_alloc+0x707/0xbe0
[   32.262305][T1]  #1: 88105558 (vmap_area_lock){+.+.}, at:
pcpu_get_vm_areas+0xc4f/0x1e60
[   32.271919][T1] CPU: 4 PID: 1 Comm: swapper/0 Tainted:
GW 5.4.0-rc7-next-20191115+ #6
[   32.281555][T1] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 03/09/2018
[   32.281896][T1] Call Trace:
[   32.281896][T1]  dump_stack+0xa0/0xea
[   32.281896][T1]  ___might_sleep.cold.89+0xd2/0x122
[   32.301996][T1]  __might_sleep+0x73/0xe0
[   32.301996][T1]  __alloc_pages_nodemask+0x442/0x720
[   32.311564][T1]  ? __kasan_check_read+0x11/0x20
[   32.311564][T1]  ? __alloc_pages_slowpath+0x1870/0x1870
[   32.321705][T1]  ? mark_held_locks+0x86/0xb0
[   32.321705][T1]  ? _raw_spin_unlock_irqrestore+0x44/0x50
[   32.331563][T1]  alloc_page_interleave+0x18/0x130
[   32.331563][T1]  alloc_pages_current+0xf6/0x110
[   32.341979][T1]  __get_free_pages+0x12/0x60
[   32.341979][T1]  __pte_alloc_kernel+0x1b/0xc0
[   32.351563][T1]  apply_to_page_range+0x5b5/0x690
[   32.351563][T1]  ? memset+0x40/0x40
[   32.361693][T1]  kasan_populate_vmalloc+0x6d/0xa0
[   32.361693][T1]  pcpu_get_vm_areas+0xd49/0x1e60
[   32.371425][T1]  ? vm_map_ram+0x10d0/0x10d0
[   32.371425][T1]  ? pcpu_mem_zalloc+0x65/0x90
[   32.371425][T1]  pcpu_create_chunk+0x152/0x3f0
[   32.371425][T1]  pcpu_alloc+0xa2f/0xbe0
[   32.391423][T1]  ? pcpu_balance_workfn+0xb00/0xb00
[   32.391423][T1]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[   32.391423][T1]  ? kasan_kmalloc+0x9/0x10
[   32.391423][T1]  ? kmem_cache_alloc_trace+0x1f8/0x470
[   32.411421][T1]  ? iommu_dma_get_resv_regions+0x10/0x10
[   32.411421][T1]  __alloc_percpu+0x15/0x20
[   32.411421][T1]  init_iova_flush_queue+0x79/0x230
[   32.411421][T1]  iommu_setup_dma_ops+0x87d/0x890
[   32.431420][T1]  ? __kasan_check_write+0x14/0x20
[   32.431420][T1]  ? refcount_sub_and_test_checked+0xba/0x170
[   32.431420][T1]  ? __kasan_check_write+0x14/0x20
[   32.431420][T1]  ? iommu_dma_alloc+0x1e0/0x1e0
[   32.451420][T1]  ? iommu_group_get_for_dev+0x153/0x450
[   32.451420][T1]  ? refcount_dec_and_test_checked+0x11/0x20
[   32.451420][T1]  ? kobject_put+0x36/0x270
[   32.451420][T1]  amd_iommu_add_device+0x560/0x710
[   32.471423][T1]  ? iommu_probe_device+0x150/0x150
[   32.471423][T1]  iommu_probe_device+0x8c/0x150
[   32.471423][T1]  add_iommu_group+0xe/0x20
[   32.471423][T1]  bus_for_each_dev+0xfe/0x160
[   32.491421][T1]  ? subsys_dev_iter_init+0x80/0x80
[   32.491421][T1]  ? blocking_notifier_chain_register+0x4f/0x70
[   32.491421][T1]  bus_set_iommu+0xc6/0x100
[   32.491421][T1]  ? e820__memblock_setup+0x10e/0x10e
[   32.511571][T1]  amd_iommu_init_api+0x25/0x3e
[   32.511571][T1]  state_next+0x214/0x7ea
[   32.511571][T1]  ? check_flags.part.25+0x86/0x220
[   32.511571][T1]  ? early_amd_iommu_init+0x10c0/0x10c0
[   32.531421][T1]  ? e820__memblock_setup+0x10e/0x10e
[   32.531421][T1]  ? rcu_read_lock_sched_held+0xac/0xe0
[   32.531421][T1]  ? e820__memblock_setup+0x10e/0x10e
[   32.551423][T1]  amd_iommu_init+0x25/0x57
[   32.551423][T1]  pci_iommu_init+0x26/0x62
[   32.551423][T1]  do_one_initcall+0xfe/0x4fa
[   32.551423][T1]  ? perf_trace_initcall_level+0x240/0x240
[   32.571420][T1]  ? rcu_read_lock_sched_held+0xac/0xe0
[   32.571420][T1]  ? rcu_read_lock_bh_held+0xc0/0xc0
[   32.571420][T1]  ? __kasan_check_read+0x11/0x20
[   32.571420][T1]  kernel_init_freeable+0x420/0x4e4
[   32.591420][T1]  ? start_kernel+0x6a9/0x6a9
[   32.591420][T1]  ? lockdep_hardirqs_on+0x1b0/0x2a0
[   32.591420][T1]  ? _raw_spin_unlock_irq+0x27/0x40
[   32.591420][T1]  ? rest_init+0x307/0x307
[   32.611557][T1]  kernel_init+0x11/0x139
[   32.611557][T1]  ? rest_init+0x307/0x307
[   32.611557][T1]  ret_from_fork+0x27/0x50


[   32.054647][ 

Section mismatch warnings on powerpc

2019-10-30 Thread Qian Cai
Still see those,

WARNING: vmlinux.o(.text+0x2d04): Section mismatch in reference from the
variable __boot_from_prom to the function .init.text:prom_init()
The function __boot_from_prom() references
the function __init prom_init().
This is often because __boot_from_prom lacks a __init
annotation or the annotation of prom_init is wrong.

WARNING: vmlinux.o(.text+0x2ec8): Section mismatch in reference from the
variable start_here_common to the function .init.text:start_kernel()
The function start_here_common() references
the function __init start_kernel().
This is often because start_here_common lacks a __init
annotation or the annotation of start_kernel is wrong.

There is a patch around,

http://patchwork.ozlabs.org/patch/895442/

Does it still wait for Michael to come with some better names?


Re: [PATCH v2] powerpc/imc: Dont create debugfs files for cpu-less nodes

2019-10-30 Thread Qian Cai
On Tue, 2019-07-23 at 16:57 +0530, Anju T Sudhakar wrote:
> Hi Qian,
> 
> On 7/16/19 12:11 AM, Qian Cai wrote:
> > On Thu, 2019-07-11 at 14:53 +1000, Michael Ellerman wrote:
> > > Hi Maddy,
> > > 
> > > Madhavan Srinivasan  writes:
> > > > diff --git a/arch/powerpc/platforms/powernv/opal-imc.c
> > > > b/arch/powerpc/platforms/powernv/opal-imc.c
> > > > index 186109bdd41b..e04b20625cb9 100644
> > > > --- a/arch/powerpc/platforms/powernv/opal-imc.c
> > > > +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> > > > @@ -69,20 +69,20 @@ static void export_imc_mode_and_cmd(struct 
> > > > device_node
> > > > *node,
> > > >     if (of_property_read_u32(node, "cb_offset", _offset))
> > > >     cb_offset = IMC_CNTL_BLK_OFFSET;
> > > >   
> > > > -   for_each_node(nid) {
> > > > -   loc = (u64)(pmu_ptr->mem_info[chip].vbase) + cb_offset;
> > > > +   while (ptr->vbase != NULL) {
> > > 
> > > This means you'll bail out as soon as you find a node with no vbase, but
> > > it's possible we could have a CPU-less node intermingled with other
> > > nodes.
> > > 
> > > So I think you want to keep the for loop, but continue if you see a NULL
> > > vbase?
> > 
> > Not sure if this will also takes care of some of those messages during the 
> > boot
> > on today's linux-next even without this patch.
> > 
> > 
> > [   18.077780][T1] debugfs: Directory 'imc' with parent 'powerpc' 
> > already
> > present!
> > 
> > 
> 
> This is introduced by a recent commit: c33d442328f55 (debugfs: make 
> error message a bit more verbose).
> 
> So basically, the debugfs imc_* file is created per node, and is created 
> by the first nest unit which is
> 
> being registered. For the subsequent nest units, debugfs_create_dir() 
> will just return since the imc_* file already
> 
> exist.
> 
> The commit "c33d442328f55 (debugfs: make error message a bit more 
> verbose)", prints
> 
> a message if the debugfs file already exists in debugfs_create_dir(). 
> That is why we are encountering these
> 
> messages now.
> 
> 
> This patch (i.e, powerpc/imc: Dont create debugfs files for cpu-less 
> nodes) will address the initial issue, i.e
> 
> "numa crash while reading imc_* debugfs files for cpu less nodes", and 
> will not address these debugfs messages.
> 
> 
> But yeah this is a good catch. We can have some checks to avoid these 
> debugfs messages.

Anju, do you still plan to fix those "Directory 'imc' with parent 'powerpc'
already present!" warnings as they are still there in the latest linux-next?

> 
> 
> Hi Michael,
> 
> Do we need to have a separate patch to address these debugfs messages, 
> or can we address the same
> 
> in the next version of this patch itself?
> 
> 
> Thanks,
> 
> Anju
> 
> 
> 
> 


Re: [PATCH v7] numa: make node_to_cpumask_map() NUMA_NO_NODE aware

2019-10-30 Thread Qian Cai



> On Oct 30, 2019, at 6:28 AM, Peter Zijlstra  wrote:
> 
> It only makes 'wild' guesses when the BIOS is shit and it complains
> about that.
> 
> Or do you like you BIOS broken?

Agree. It is the garbage in and garbage out. No need to complicate the existing 
code further.

Re: [PATCH V8] mm/debug: Add tests validating architecture page table helpers

2019-10-29 Thread Qian Cai



> On Oct 28, 2019, at 1:29 AM, Anshuman Khandual  
> wrote:
> 
> This adds tests which will validate architecture page table helpers and
> other accessors in their compliance with expected generic MM semantics.
> This will help various architectures in validating changes to existing
> page table helpers or addition of new ones.
> 
> This test covers basic page table entry transformations including but not
> limited to old, young, dirty, clean, write, write protect etc at various
> level along with populating intermediate entries with next page table page
> and validating them.
> 
> Test page table pages are allocated from system memory with required size
> and alignments. The mapped pfns at page table levels are derived from a
> real pfn representing a valid kernel text symbol. This test gets called
> right after page_alloc_init_late().
> 
> This gets build and run when CONFIG_DEBUG_VM_PGTABLE is selected along with
> CONFIG_VM_DEBUG. Architectures willing to subscribe this test also need to
> select CONFIG_ARCH_HAS_DEBUG_VM_PGTABLE which for now is limited to x86 and
> arm64. Going forward, other architectures too can enable this after fixing
> build or runtime problems (if any) with their page table helpers.
> 
> Folks interested in making sure that a given platform's page table helpers
> conform to expected generic MM semantics should enable the above config
> which will just trigger this test during boot. Any non conformity here will
> be reported as an warning which would need to be fixed. This test will help
> catch any changes to the agreed upon semantics expected from generic MM and
> enable platforms to accommodate it thereafter.

This looks like a perfect candidate to streamline with the new kunit framework, 
no?

[PATCH] powerpc/powernv/smp: fix a warning at CPU hotplug

2019-10-28 Thread Qian Cai
The commit e78a7614f387 ("idle: Prevent late-arriving interrupts from
disrupting offline") introduced a warning on powerpc with CPU hotplug,

WARNING: CPU: 1 PID: 0 at arch/powerpc/platforms/powernv/smp.c:160
pnv_smp_cpu_kill_self+0x5c/0x330
Call Trace:
 cpu_die+0x48/0x64
 arch_cpu_idle_dead+0x30/0x50
 do_idle+0x2e4/0x460
 cpu_startup_entry+0x3c/0x40
 start_secondary+0x7a8/0xa80
 start_secondary_resume+0x10/0x14

because it calls local_irq_disable() before arch_cpu_idle_dead().

Fixes: e78a7614f387 ("idle: Prevent late-arriving interrupts from disrupting 
offline")
Signed-off-by: Qian Cai 
---
 arch/powerpc/platforms/powernv/smp.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index fbd6e6b7bbf2..51f4e07b9168 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -157,7 +157,6 @@ static void pnv_smp_cpu_kill_self(void)
 * This hard disables local interurpts, ensuring we have no lazy
 * irqs pending.
 */
-   WARN_ON(irqs_disabled());
hard_irq_disable();
WARN_ON(lazy_irq_pending());
 
-- 
1.8.3.1



Re: [PATCH V7] mm/debug: Add tests validating architecture page table helpers

2019-10-24 Thread Qian Cai



> On Oct 24, 2019, at 11:45 PM, Anshuman Khandual  
> wrote:
> 
> Nothing specific. But just tested this with x86 defconfig with relevant 
> configs
> which are required for this test. Not sure if it involved W=1.

No, it will not. It needs to run like,

make W=1 -j 64 2>/tmp/warns

  1   2   >