Re: [PATCH v5 0/6] Fix some bugs related to ramp and dax

2022-04-01 Thread Qian Cai
On Fri, Apr 01, 2022 at 11:44:16AM +0800, Muchun Song wrote:
> Thanks for your report. Would you mind providing the .config?

$ make ARCH=arm64 defconfig debug.config



Re: [PATCH v5 0/6] Fix some bugs related to ramp and dax

2022-03-31 Thread Qian Cai
On Fri, Mar 18, 2022 at 03:45:23PM +0800, Muchun Song wrote:
> This series is based on next-20220225.
> 
> Patch 1-2 fix a cache flush bug, because subsequent patches depend on
> those on those changes, there are placed in this series.  Patch 3-4
> are preparation for fixing a dax bug in patch 5.  Patch 6 is code cleanup
> since the previous patch remove the usage of follow_invalidate_pte().

Reverting this series fixed boot crashes.

 KASAN: null-ptr-deref in range [0x0018-0x001f]
 Mem abort info:
   ESR = 0x9604
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x04: level 0 translation fault
 Data abort info:
   ISV = 0, ISS = 0x0004
   CM = 0, WnR = 0
 [dfff8003] address between user and kernel address ranges
 Internal error: Oops: 9604 [#1] PREEMPT SMP
 Modules linked in: cdc_ether usbnet ipmi_devintf ipmi_msghandler cppc_cpufreq 
fuse ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon 
raid6_pq zstd_compress dm_mod nouveau crct10dif_ce drm_ttm_helper mlx5_core ttm 
drm_dp_helper drm_kms_helper nvme mpt3sas nvme_core xhci_pci raid_class drm 
xhci_pci_renesas
 CPU: 3 PID: 1707 Comm: systemd-udevd Not tainted 
5.17.0-next-20220331-4-g2d550916a6b9 #51
 pstate: 104000c9 (nzcV daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : __lock_acquire
 lr : lock_acquire.part.0
 sp : 800030a16fd0
 x29: 800030a16fd0 x28: dd876c4e9f90 x27: 0018
 x26:  x25: 0018 x24: 
 x23: 08022beacf00 x22: dd8772507660 x21: 
 x20:  x19:  x18: dd8772417d2c
 x17: dd876c5bc2e0 x16: 1fffe100457d5b06 x15: 0094
 x14: f1f1 x13: f3f3f3f3 x12: 08022beacf08
 x11: 1bb0ee482fa5 x10: dd8772417d28 x9 : 
 x8 : 0003 x7 : dd876c4e9f90 x6 : 
 x5 :  x4 : 0001 x3 : 
 x2 :  x1 : 0003 x0 : dfff8000
 Call trace:
  __lock_acquire
  lock_acquire.part.0
  lock_acquire
  _raw_spin_lock
  page_vma_mapped_walk
  try_to_migrate_one
  rmap_walk_anon
  try_to_migrate
  __unmap_and_move
  unmap_and_move
  migrate_pages
  migrate_misplaced_page
  do_huge_pmd_numa_page
  __handle_mm_fault
  handle_mm_fault
  do_translation_fault
  do_mem_abort
  el0_da
  el0t_64_sync_handler
  el0t_64_sync
 Code: d65f03c0 d343ff61 d2d0 f2fbffe0 (38e06820)
 ---[ end trace  ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x5d8763da from 0x8800
 PHYS_OFFSET: 0x8000
 CPU features: 0x000,00085c0d,19801c82
 Memory Limit: none
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
> 
> v5:
> - Collect Reviewed-by from Dan Williams.
> - Fix panic reported by kernel test robot .
> - Remove pmdpp parameter from follow_invalidate_pte() and fold it into 
> follow_pte().
> 
> v4:
> - Fix compilation error on riscv.
> 
> v3:
> - Based on next-20220225.
> 
> v2:
> - Avoid the overly long line in lots of places suggested by Christoph.
> - Fix a compiler warning reported by kernel test robot since pmd_pfn()
>   is not defined when !CONFIG_TRANSPARENT_HUGEPAGE on powerpc architecture.
> - Split a new patch 4 for preparation of fixing the dax bug.
> 
> Muchun Song (6):
>   mm: rmap: fix cache flush on THP pages
>   dax: fix cache flush on PMD-mapped pages
>   mm: rmap: introduce pfn_mkclean_range() to cleans PTEs
>   mm: pvmw: add support for walking devmap pages
>   dax: fix missing writeprotect the pte entry
>   mm: simplify follow_invalidate_pte()
> 
>  fs/dax.c | 82 
> +---
>  include/linux/mm.h   |  3 --
>  include/linux/rmap.h |  3 ++
>  mm/internal.h| 26 +++--
>  mm/memory.c  | 81 +++
>  mm/page_vma_mapped.c | 16 +-
>  mm/rmap.c| 68 +++
>  7 files changed, 114 insertions(+), 165 deletions(-)
> 
> -- 
> 2.11.0
> 



Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout

2021-01-11 Thread Qian Cai
On Sun, 2021-01-10 at 17:39 +0200, Mike Rapoport wrote:
> On Wed, Jan 06, 2021 at 04:04:21PM -0500, Qian Cai wrote:
> > On Wed, 2021-01-06 at 10:05 +0200, Mike Rapoport wrote:
> > > I think we trigger PF_POISONED_CHECK() in PageSlab(), then
> > > fffe
> > > is "accessed" from VM_BUG_ON_PAGE().
> > > 
> > > It seems to me that we are not initializing struct pages for holes at the
> > > node
> > > boundaries because zones are already clamped to exclude those holes.
> > > 
> > > Can you please try to see if the patch below will produce any useful info:
> > 
> > [0.00] init_unavailable_range: spfn: 8c, epfn: 9b, zone: DMA, node:
> > 0
> > [0.00] init_unavailable_range: spfn: 1f7be, epfn: 1f9fe, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 28784, epfn: 288e4, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 298b9, epfn: 298bd, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 29923, epfn: 29931, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 29933, epfn: 29941, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 29945, epfn: 29946, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 29ff9, epfn: 2a823, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 33a23, epfn: 33a53, zone:
> > DMA32, node: 0
> > [0.00] init_unavailable_range: spfn: 78000, epfn: 10, zone:
> > DMA32, node: 0
> > ...
> > [  572.222563][ T2302] kpagecount_read: pfn 47f380 is poisoned
> ...
> > [  590.570032][ T2302] kpagecount_read: pfn 47 is poisoned
> > [  604.268653][ T2302] kpagecount_read: pfn 87ff80 is poisoned
> ...
> > [  604.611698][ T2302] kpagecount_read: pfn 87ffbc is poisoned
> > [  617.484205][ T2302] kpagecount_read: pfn c7ff80 is poisoned
> ...
> > [  618.212344][ T2302] kpagecount_read: pfn c7 is poisoned
> > [  633.134228][ T2302] kpagecount_read: pfn 107ff80 is poisoned
> ...
> > [  633.874087][ T2302] kpagecount_read: pfn 107 is poisoned
> > [  647.686412][ T2302] kpagecount_read: pfn 147ff80 is poisoned
> ...
> > [  648.425548][ T2302] kpagecount_read: pfn 147 is poisoned
> > [  663.692630][ T2302] kpagecount_read: pfn 187ff80 is poisoned
> ...
> > [  664.432671][ T2302] kpagecount_read: pfn 187 is poisoned
> > [  675.462757][ T2302] kpagecount_read: pfn 1c7ff80 is poisoned
> ...
> > [  676.202548][ T2302] kpagecount_read: pfn 1c7 is poisoned
> > [  687.121605][ T2302] kpagecount_read: pfn 207ff80 is poisoned
> ...
> > [  687.860981][ T2302] kpagecount_read: pfn 207 is poisoned
> 
> The e820 map has a hole near the end of each node and these holes are not
> initialized with init_unavailable_range() after it was interleaved with
> memmap initialization because such holes are not accounted by
> zone->spanned_pages.
> 
> Yet, I'm still cannot really understand how this never triggered 
> 
>   VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> before v5.7 as all the struct pages for these holes would have zone=0 and
> node=0 ... 
> 
> @Qian, can you please boot your system with memblock=debug and share the
> logs?
> 

http://people.redhat.com/qcai/memblock.txt



Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout

2021-01-06 Thread Qian Cai
On Wed, 2021-01-06 at 10:05 +0200, Mike Rapoport wrote:
> I think we trigger PF_POISONED_CHECK() in PageSlab(), then fffe
> is "accessed" from VM_BUG_ON_PAGE().
> 
> It seems to me that we are not initializing struct pages for holes at the node
> boundaries because zones are already clamped to exclude those holes.
> 
> Can you please try to see if the patch below will produce any useful info:

[0.00] init_unavailable_range: spfn: 8c, epfn: 9b, zone: DMA, node: 0
[0.00] init_unavailable_range: spfn: 1f7be, epfn: 1f9fe, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 28784, epfn: 288e4, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 298b9, epfn: 298bd, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 29923, epfn: 29931, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 29933, epfn: 29941, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 29945, epfn: 29946, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 29ff9, epfn: 2a823, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 33a23, epfn: 33a53, zone: DMA32, 
node: 0
[0.00] init_unavailable_range: spfn: 78000, epfn: 10, zone: DMA32, 
node: 0
...
[  572.222563][ T2302] kpagecount_read: pfn 47f380 is poisoned
[  572.228208][ T2302] kpagecount_read: pfn 47f381 is poisoned
[  572.233823][ T2302] kpagecount_read: pfn 47f382 is poisoned
[  572.239465][ T2302] kpagecount_read: pfn 47f383 is poisoned
[  572.245495][ T2302] kpagecount_read: pfn 47f384 is poisoned
[  572.251110][ T2302] kpagecount_read: pfn 47f385 is poisoned
[  572.256739][ T2302] kpagecount_read: pfn 47f386 is poisoned
[  572.262353][ T2302] kpagecount_read: pfn 47f387 is poisoned
[  572.268445][ T2302] kpagecount_read: pfn 47f388 is poisoned
[  572.274057][ T2302] kpagecount_read: pfn 47f389 is poisoned
[  572.279687][ T2302] kpagecount_read: pfn 47f38a is poisoned
[  572.285320][ T2302] kpagecount_read: pfn 47f38b is poisoned
[  572.290934][ T2302] kpagecount_read: pfn 47f38c is poisoned
[  572.296939][ T2302] kpagecount_read: pfn 47f38d is poisoned
[  572.302551][ T2302] kpagecount_read: pfn 47f38e is poisoned
[  572.308180][ T2302] kpagecount_read: pfn 47f38f is poisoned
[  572.313791][ T2302] kpagecount_read: pfn 47f390 is poisoned
[  572.319859][ T2302] kpagecount_read: pfn 47f391 is poisoned
[  572.325536][ T2302] kpagecount_read: pfn 47f392 is poisoned
[  572.331150][ T2302] kpagecount_read: pfn 47f393 is poisoned
[  572.336945][ T2302] kpagecount_read: pfn 47f394 is poisoned
[  572.342981][ T2302] kpagecount_read: pfn 47f395 is poisoned
[  572.348615][ T2302] kpagecount_read: pfn 47f396 is poisoned
[  572.354226][ T2302] kpagecount_read: pfn 47f397 is poisoned
[  572.359865][ T2302] kpagecount_read: pfn 47f398 is poisoned
[  572.365495][ T2302] kpagecount_read: pfn 47f399 is poisoned
[  572.371568][ T2302] kpagecount_read: pfn 47f39a is poisoned
[  572.377199][ T2302] kpagecount_read: pfn 47f39b is poisoned
[  572.382813][ T2302] kpagecount_read: pfn 47f39c is poisoned
[  572.388443][ T2302] kpagecount_read: pfn 47f39d is poisoned
[  572.394507][ T2302] kpagecount_read: pfn 47f39e is poisoned
[  572.400137][ T2302] kpagecount_read: pfn 47f39f is poisoned
[  572.405766][ T2302] kpagecount_read: pfn 47f3a0 is poisoned
[  572.411379][ T2302] kpagecount_read: pfn 47f3a1 is poisoned
[  572.417475][ T2302] kpagecount_read: pfn 47f3a2 is poisoned
[  572.423088][ T2302] kpagecount_read: pfn 47f3a3 is poisoned
[  572.428717][ T2302] kpagecount_read: pfn 47f3a4 is poisoned
[  572.434329][ T2302] kpagecount_read: pfn 47f3a5 is poisoned
[  572.439963][ T2302] kpagecount_read: pfn 47f3a6 is poisoned
[  572.446079][ T2302] kpagecount_read: pfn 47f3a7 is poisoned
[  572.451692][ T2302] kpagecount_read: pfn 47f3a8 is poisoned
[  572.457367][ T2302] kpagecount_read: pfn 47f3a9 is poisoned
[  572.462981][ T2302] kpagecount_read: pfn 47f3aa is poisoned
[  572.469079][ T2302] kpagecount_read: pfn 47f3ab is poisoned
[  572.474694][ T2302] kpagecount_read: pfn 47f3ac is poisoned
[  572.480332][ T2302] kpagecount_read: pfn 47f3ad is poisoned
[  572.485962][ T2302] kpagecount_read: pfn 47f3ae is poisoned
[  572.491577][ T2302] kpagecount_read: pfn 47f3af is poisoned
[  572.497677][ T2302] kpagecount_read: pfn 47f3b0 is poisoned
[  572.503292][ T2302] kpagecount_read: pfn 47f3b1 is poisoned
[  572.508921][ T2302] kpagecount_read: pfn 47f3b2 is poisoned
[  572.514535][ T2302] kpagecount_read: pfn 47f3b3 is poisoned
[  572.520643][ T2302] kpagecount_read: pfn 47f3b4 is poisoned
[  572.526273][ T2302] kpagecount_read: pfn 47f3b5 is poisoned
[  572.531886][ T2302] kpagecount_read: pfn 47f3b6 is poisoned
[  572.537524][ T2302] kpagecount_read: pfn 47f3b7 is poisoned
[  572.543676][ T2302] kpagecount_read: pfn 47f3b8 is poisoned
[  572.549305][ T2302] kpagecount_read: pfn 47f3b9 is poisoned
[  572.554919][ T2302] kpagecount_read: pfn 47f3ba is poisoned
[  

Power9 NV linux-next random process hang

2021-01-05 Thread Qian Cai
.config: 
https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config

Today's linux-next starts to generate random process hang quite easily.
Yesterday's build seems work fine. Sometimes, the process stack seems corrupt
while the process is running 100% CPU with gdb shows it just entered a
subroutine that really can't see why it hangs.

[ 6732.309621][T11627] task:ranbug  state:R  running task 
stack:24176 pid: 2893 ppid:  2867 flags:0x0004 
[ 6732.309779][T11627] Call Trace: 
[ 6732.309826][T11627] [c0006166fa30] [c0006166fb60] 0xc0006166fb60 
(unreliable) 

Also, running LTP syscalls ended up hanging with lots of zombie process. Any 
idea?

root2023  0.0  0.0  0 0 ?Zs   14:10   0:00 [login] 

root   52052  0.0  0.0  0 0 pts/0Z15:03   0:00 [recv01] 

root   52054  0.0  0.0  0 0 pts/0Z15:03   0:00 [recvfrom01] 

root   52056  0.0  0.0  0 0 pts/0Z15:03   0:00 [recvmsg01] 

root   52155  0.0  0.0  0 0 pts/0Z15:03   0:00 
[rt_sigtimedwait] 
root   52305  0.0  0.0  0 0 pts/0Z15:03   0:00 [semctl01] 

root   52362  0.0  0.0  0 0 pts/0Z15:03   0:00 [send01] 

root   52386  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52387  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52388  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52389  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52390  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile04] 

root   52392  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52393  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52394  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52395  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52396  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile04_64] 
root   52398  0.0  0.0  0 0 pts/0Z15:03   0:00 [sendfile05] 

root   52400  0.0  0.0  0 0 pts/0Z15:03   0:00 
[sendfile05_64] 
root   52415  0.0  0.0  0 0 pts/0Z15:04   0:00 [sendmsg01] 

root   53470  0.0  0.0  0 0 pts/0Z15:04   0:00 [sendto01] 

root   53763  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53764  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53765  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53766  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53767  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53768  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53769  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53770  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53771  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53772  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53773  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53774  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53775  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53776  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53777  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53778  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53779  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53780  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
root   53782  0.0  0.0  0 0 pts/0Z15:06   0:00 
[setrlimit01] 
nobody 54290  0.0  0.0  0 0 pts/0Z15:07   0:00 [sysctl03] 

root   56813  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56814  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56815  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56816  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56817  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56818  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56819  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56820  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56821  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56822  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56823  0.0  0.0  0 0 pts/0Z16:09   0:00 [waitpid03] 

root   56825  0.0  0.0 

Re: [PATCH v21 00/19] per memcg lru lock

2021-01-05 Thread Qian Cai
On Tue, 2021-01-05 at 13:35 -0800, Hugh Dickins wrote:
> This patchset went into mmotm 2020-11-16-16-23, so probably linux-next
> on 2020-11-17: you'll have had three trouble-free weeks testing with it
> in, so it's not a likely suspect.  I haven't looked yet at your report,
> to think of a more likely suspect: will do.

Probably my memory was bad then. Unfortunately, I had 2 weeks holidays before
the Thanksgiving as well. I have tried a few times so far and only been able to
reproduce once. Looks nasty...



Re: [PATCH v21 00/19] per memcg lru lock

2021-01-05 Thread Qian Cai
On Tue, 2021-01-05 at 11:42 -0800, Shakeel Butt wrote:
> On Tue, Jan 5, 2021 at 11:30 AM Qian Cai  wrote:
> > On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> > > This version rebase on next/master 20201104, with much of Johannes's
> > > Acks and some changes according to Johannes comments. And add a new patch
> > > v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> > > v21-0007.
> > > 
> > > This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> > > added to -mm tree yesterday.
> > > 
> > > Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> > > Johannes Weiner.
> > 
> > Given the troublesome history of this patchset, and had been put into linux-
> > next
> > recently, as well as it touched both THP and mlock. Is it a good idea to
> > suspect
> > this patchset introducing some races and a spontaneous crash with some mlock
> > memory presume?
> 
> This has already been merged into the linus tree. Were you able to get
> a similar crash on the latest upstream kernel as well?

No, I seldom test the mainline those days. Before the vacations, I have tested
linux-next up to something like 12/10 which did not include this patchset IIRC
and never saw any crash like this. I am still trying to figure out how to
reproduce it fast, so I can try a revert to confirm.



Re: [PATCH v21 00/19] per memcg lru lock

2021-01-05 Thread Qian Cai
On Thu, 2020-11-05 at 16:55 +0800, Alex Shi wrote:
> This version rebase on next/master 20201104, with much of Johannes's
> Acks and some changes according to Johannes comments. And add a new patch
> v21-0006-mm-rmap-stop-store-reordering-issue-on-page-mapp.patch to support
> v21-0007.
> 
> This patchset followed 2 memcg VM_WARN_ON_ONCE_PAGE patches which were
> added to -mm tree yesterday.
>  
> Many thanks for line by line review by Hugh Dickins, Alexander Duyck and
> Johannes Weiner.

Given the troublesome history of this patchset, and had been put into 
linux-next 
recently, as well as it touched both THP and mlock. Is it a good idea to suspect
this patchset introducing some races and a spontaneous crash with some mlock
memory presume?

[10392.154328][T23803] huge_memory: total_mapcount: 5, page_count(): 6
[10392.154835][T23803] page:eb7725ad refcount:6 mapcount:0 
mapping: index:0x7fff72a0 pfn:0x20023760
[10392.154865][T23803] head:eb7725ad order:5 compound_mapcount:0 
compound_pincount:0
[10392.154889][T23803] anon flags: 
0x87fff89000d(locked|uptodate|dirty|head|swapbacked)
[10392.154908][T23803] raw: 087fff89000d 5deadbeef100 5deadbeef122 
c0002016ff5e0849
[10392.154933][T23803] raw: 7fff72a0  0006 
c0002014eb676000
[10392.154965][T23803] page dumped because: total_mapcount(head) > 0
[10392.154987][T23803] pages's memcg:c0002014eb676000
[10392.155023][T23803] [ cut here ]
[10392.155042][T23803] kernel BUG at mm/huge_memory.c:2767!
[10392.155064][T23803] Oops: Exception in kernel mode, sig: 5 [#1]
[10392.155084][T23803] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=256 NUMA PowerNV
[10392.155114][T23803] Modules linked in: loop kvm_hv kvm ip_tables x_tables 
sd_mod bnx2x ahci tg3 libahci mdio libphy libata firmware_class dm_mirror 
dm_region_hash dm_log dm_mod
[10392.155185][T23803] CPU: 44 PID: 23803 Comm: ranbug Not tainted 
5.11.0-rc2-next-20210105 #2
[10392.155217][T23803] NIP:  c03b5218 LR: c03b5214 CTR: 

[10392.155247][T23803] REGS: c0001a8d6ee0 TRAP: 0700   Not tainted  
(5.11.0-rc2-next-20210105)
[10392.155279][T23803] MSR:  92823033  
 CR: 2842  XER: 
[10392.155314][T23803] CFAR: c03135ac IRQMASK: 1 
[10392.155314][T23803] GPR00: c03b5214 c0001a8d7180 
c7f70b00 001e 
[10392.155314][T23803] GPR04: c0eacd38 0004 
0027 c01ffe8a7218 
[10392.155314][T23803] GPR08: 0023  
 c7eacfc8 
[10392.155314][T23803] GPR12: 2000 c01cda00 
 0001 
[10392.155314][T23803] GPR16: c00c0008008dd808 0004 
 0020 
[10392.155314][T23803] GPR20: c00c0008008dd800 0020 
0006 0001 
[10392.155314][T23803] GPR24: 0005  
c0002016ff5e0848  
[10392.155314][T23803] GPR28: c0002014eb676e60 c00c0008008dd800 
c0001a8d73a8 c00c0008008dd800 
[10392.155533][T23803] NIP [c03b5218] 
split_huge_page_to_list+0xa38/0xa40
[10392.18][T23803] LR [c03b5214] split_huge_page_to_list+0xa34/0xa40
[10392.155579][T23803] Call Trace:
[10392.155595][T23803] [c0001a8d7180] [c03b5214] 
split_huge_page_to_list+0xa34/0xa40 (unreliable)
[10392.155630][T23803] [c0001a8d7270] [c02dd378] 
shrink_page_list+0x1568/0x1b00
shrink_page_list at mm/vmscan.c:1251 (discriminator 1)
[10392.155655][T23803] [c0001a8d7380] [c02df798] 
shrink_inactive_list+0x228/0x5e0
[10392.155678][T23803] [c0001a8d7450] [c02e0858] 
shrink_lruvec+0x2b8/0x6f0
shrink_lruvec at mm/vmscan.c:2462
[10392.155710][T23803] [c0001a8d7590] [c02e0fd8] 
shrink_node+0x348/0x970
[10392.155742][T23803] [c0001a8d7660] [c02e1728] 
do_try_to_free_pages+0x128/0x560
[10392.155765][T23803] [c0001a8d7710] [c02e3b78] 
try_to_free_pages+0x198/0x500
[10392.155780][T23803] [c0001a8d77e0] [c0356f5c] 
__alloc_pages_slowpath.constprop.112+0x64c/0x1380
[10392.155795][T23803] [c0001a8d79c0] [c0358170] 
__alloc_pages_nodemask+0x4e0/0x590
[10392.155830][T23803] [c0001a8d7a50] [c0381fb8] 
alloc_pages_vma+0xb8/0x340
[10392.155854][T23803] [c0001a8d7ac0] [c0324fe8] 
handle_mm_fault+0xf38/0x1bd0
[10392.155887][T23803] [c0001a8d7ba0] [c0316cd4] 
__get_user_pages+0x434/0x7d0
[10392.155920][T23803] [c0001a8d7cb0] [c03197d0] 
__mm_populate+0xe0/0x290
__mm_populate at mm/gup.c:1459
[10392.155952][T23803] [c0001a8d7d20] [c032d5a0] 
do_mlock+0x180/0x360
do_mlock at mm/mlock.c:688
[10392.155975][T23803] [c0001a8d7d90] [c032d954] sys_mlock+0x24/0x40
[10392.155999][T23803] [c0001a8d7db0] [c002f510] 
system_call_exception+0x170/0x280
[10392.156032][T23803] [c0001a8d7e10] 

Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout

2021-01-05 Thread Qian Cai
On Tue, 2021-01-05 at 10:24 +0200, Mike Rapoport wrote:
> Hi,
> 
> On Mon, Jan 04, 2021 at 02:03:00PM -0500, Qian Cai wrote:
> > On Wed, 2020-12-09 at 23:43 +0200, Mike Rapoport wrote:
> > > From: Mike Rapoport 
> > > 
> > > Interleave initialization of pages that correspond to holes with the
> > > initialization of memory map, so that zone and node information will be
> > > properly set on such pages.
> > > 
> > > Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
> > > rather
> > > that check each PFN")
> > > Reported-by: Andrea Arcangeli 
> > > Signed-off-by: Mike Rapoport 
> > 
> > Reverting this commit on the top of today's linux-next fixed a crash while
> > reading /proc/kpagecount on a NUMA server.
> 
> Can you please post the entire dmesg?

http://people.redhat.com/qcai/dmesg.txt

> Is it possible to get the pfn that triggered the crash?

Do you have any idea how to convert that fffe to pfn as it is always
that address? I don't understand what that address is though. I tried to catch
it from struct page pointer and page_address() without luck.

>  
> > [ 8858.006726][T99897] BUG: unable to handle page fault for address:
> > fffe
> > [ 8858.014814][T99897] #PF: supervisor read access in kernel mode
> > [ 8858.020686][T99897] #PF: error_code(0x) - not-present page
> > [ 8858.026557][T99897] PGD 1371417067 P4D 1371417067 PUD 1371419067 PMD 0 
> > [ 8858.033224][T99897] Oops:  [#1] SMP KASAN NOPTI
> > [ 8858.038710][T99897] CPU: 28 PID: 99897 Comm: proc01 Tainted:
> > G   O  5.11.0-rc1-next-20210104 #1
> > [ 8858.048515][T99897] Hardware name: HPE ProLiant DL385 Gen10/ProLiant
> > DL385 Gen10, BIOS A40 03/09/2018
> > [ 8858.057794][T99897] RIP: 0010:kpagecount_read+0x1be/0x5e0
> > PageSlab at include/linux/page-flags.h:342
> > (inlined by) kpagecount_read at fs/proc/page.c:69



Re: [PATCH v2 2/2] mm: fix initialization of struct page for holes in memory layout

2021-01-04 Thread Qian Cai
On Wed, 2020-12-09 at 23:43 +0200, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> There could be struct pages that are not backed by actual physical memory.
> This can happen when the actual memory bank is not a multiple of
> SECTION_SIZE or when an architecture does not register memory holes
> reserved by the firmware as memblock.memory.
> 
> Such pages are currently initialized using init_unavailable_mem() function
> that iterated through PFNs in holes in memblock.memory and if there is a
> struct page corresponding to a PFN, the fields if this page are set to
> default values and it is marked as Reserved.
> 
> init_unavailable_mem() does not take into account zone and node the page
> belongs to and sets both zone and node links in struct page to zero.
> 
> On a system that has firmware reserved holes in a zone above ZONE_DMA, for
> instance in a configuration below:
> 
>   # grep -A1 E820 /proc/iomem
>   7a17b000-7a216fff : Unknown E820 type
>   7a217000-7bff : System RAM
> 
> unset zone link in struct page will trigger
> 
>   VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page);
> 
> because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link in
> struct page) in the same pageblock.
> 
> Interleave initialization of pages that correspond to holes with the
> initialization of memory map, so that zone and node information will be
> properly set on such pages.
> 
> Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather
> that check each PFN")
> Reported-by: Andrea Arcangeli 
> Signed-off-by: Mike Rapoport 

Reverting this commit on the top of today's linux-next fixed a crash while
reading /proc/kpagecount on a NUMA server.

[ 8858.006726][T99897] BUG: unable to handle page fault for address: 
fffe
[ 8858.014814][T99897] #PF: supervisor read access in kernel mode
[ 8858.020686][T99897] #PF: error_code(0x) - not-present page
[ 8858.026557][T99897] PGD 1371417067 P4D 1371417067 PUD 1371419067 PMD 0 
[ 8858.033224][T99897] Oops:  [#1] SMP KASAN NOPTI
[ 8858.038710][T99897] CPU: 28 PID: 99897 Comm: proc01 Tainted: G   O   
   5.11.0-rc1-next-20210104 #1
[ 8858.048515][T99897] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 03/09/2018
[ 8858.057794][T99897] RIP: 0010:kpagecount_read+0x1be/0x5e0
PageSlab at include/linux/page-flags.h:342
(inlined by) kpagecount_read at fs/proc/page.c:69
[ 8858.063717][T99897] Code: 3c 30 00 0f 85 29 03 00 00 48 8b 53 08 48 8d 42 ff 
83 e2 01 48 0f 44 c3 48 89 c2 48 c1 ea 03 42 80 3c 32 00 0f 85 e7 02 00 00 <48> 
83 38 ff 0f 84 f3 01 00 00 48 89 c8 48 c1 e8 03 42 80 3c 30 00
[ 8858.083303][T99897] RSP: 0018:c9002159fdd0 EFLAGS: 00010246
[ 8858.089637][T99897] RAX: fffe RBX: ea0011fce000 RCX: 
ea0011fce008
[ 8858.097518][T99897] RDX: 1fff RSI: 0064d7c0 RDI: 
951f91c8
[ 8858.105396][T99897] RBP: 0064d7c0 R08: ed129063f402 R09: 
ed129063f402
[ 8858.113760][T99897] R10: 8894831fa00b R11: ed129063f401 R12: 
0047f380
[ 8858.121639][T99897] R13: 0400 R14: dc00 R15: 
0064d7c0
[ 8858.129517][T99897] FS:  7fd18849d040() GS:88a02fc0() 
knlGS:
[ 8858.138886][T99897] CS:  0010 DS:  ES:  CR0: 80050033
[ 8858.145369][T99897] CR2: fffe CR3: 001c8b5d CR4: 
003506e0
[ 8858.153247][T99897] Call Trace:
[ 8858.156415][T99897]  proc_reg_read+0x1a6/0x240
[ 8858.161345][T99897]  vfs_read+0x175/0x440
[ 8858.165383][T99897]  ksys_read+0xf1/0x1c0
[ 8858.169420][T99897]  ? vfs_write+0x870/0x870
[ 8858.173719][T99897]  ? task_work_run+0xeb/0x170
[ 8858.178284][T99897]  ? syscall_enter_from_user_mode+0x1c/0x40
[ 8858.184073][T99897]  do_syscall_64+0x33/0x40
[ 8858.188863][T99897]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 8858.194652][T99897] RIP: 0033:0x7fd187da1d5d
[ 8858.198952][T99897] Code: 31 11 2b 00 31 c9 64 83 3e 0b 75 ca eb b8 e8 ca fb 
ff ff 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 39 ca 77 2b 31 c0 0f 05 <48> 
3d 00 f0 ff ff 77 0b c3 66 2e 0f 1f 84 00 00 00 00 00 48 8b 15
[ 8858.218978][T99897] RSP: 002b:7ffe733de1f8 EFLAGS: 0246 ORIG_RAX: 

[ 8858.227297][T99897] RAX: ffda RBX: 7ffe733df370 RCX: 
7fd187da1d5d
[ 8858.235824][T99897] RDX: 0400 RSI: 0064d7c0 RDI: 
0004
[ 8858.243739][T99897] RBP: 0400 R08: 018fbe73 R09: 
7fd187e13d40
[ 8858.251617][T99897] R10:  R11: 0246 R12: 
023f9c00
[ 8858.259496][T99897] R13: 0004 R14: 0044663c R15: 

[ 8858.267856][T99897] Modules linked in: vfat fat fuse vfio_pci vfio_virqfd 
vfio_iommu_type1 vfio loop iavf kvm_amd ses kvm enclosure irqbypass 
acpi_cpufreq ip_tables x_tables sd_mod smartpqi bnxt_en scsi_transport_sas tg3 
i40e firmware_class libphy dm_mirror dm_region_hash dm_log 

Re: [PATCH 3/3] driver core: platform: use bus_type functions

2020-12-11 Thread Qian Cai
On Thu, 2020-11-19 at 13:46 +0100, Uwe Kleine-König wrote:
> This works towards the goal mentioned in 2006 in commit 594c8281f905
> ("[PATCH] Add bus_type probe, remove, shutdown methods.").
> 
> The functions are moved to where the other bus_type functions are
> defined and renamed to match the already established naming scheme.
> 
> Signed-off-by: Uwe Kleine-König 

Reverting this commit from today's linux-next fixed a crash during shutdown.

.config: https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

[ 9771.596916][T113465] BUG: unable to handle page fault for address: 
ffe8
[ 9771.604627][T113465] #PF: supervisor read access in kernel mode
[ 9771.610581][T113465] #PF: error_code(0x) - not-present page
[ 9771.616533][T113465] PGD 19c1e17067 P4D 19c1e17067 PUD 19c1e19067 PMD 0 
[ 9771.623279][T113465] Oops:  [#1] SMP KASAN PTI
[ 9771.628098][T113465] CPU: 22 PID: 113465 Comm: reboot Tainted: G  IO 
 5.10.0-rc7-next-20201211 #1
[ 9771.638071][T113465] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 
Gen10, BIOS U34 11/13/2019
[ 9771.647431][T113465] RIP: 0010:platform_shutdown+0x44/0x70
platform_shutdown at drivers/base/platform.c:1357
[ 9771.652956][T113465] Code: fa 48 c1 ea 03 80 3c 02 00 75 3d 48 b8 00 00 00 
00 00 fc ff df 48 8b 6b 68 48 8d 7d e8 48 89 fa 48 c1 ea 03 80 3c 02 00 75 17 
<48> 8b 45 e8 48 85 c0 74 0b 48 8d 7b f0 5b 5d e9 08 45 6c 00 5b 5d
[ 9771.672623][T113465] RSP: 0018:c90008a77d38 EFLAGS: 00010246
[ 9771.678665][T113465] RAX: dc00 RBX: 60d78810 RCX: 
60d78870
[ 9771.686628][T113465] RDX: 1ffd RSI: 0001 RDI: 
ffe8
[ 9771.694591][T113465] RBP:  R08: ed110c1af166 R09: 
ed110c1af166
[ 9771.702555][T113465] R10: 60d78b2b R11: ed110c1af165 R12: 
60d78810
[ 9771.710516][T113465] R13: 60d78920 R14: fbfff2db0008 R15: 
60d78818
[ 9771.718478][T113465] FS:  7f3434549540() GS:88901f50() 
knlGS:
[ 9771.727402][T113465] CS:  0010 DS:  ES:  CR0: 80050033
[ 9771.733966][T113465] CR2: ffe8 CR3: 00092e9c0004 CR4: 
007706e0
[ 9771.741929][T113465] DR0:  DR1:  DR2: 

[ 9771.749890][T113465] DR3:  DR6: fffe0ff0 DR7: 
0400
[ 9771.757852][T113465] PKRU: 5554
[ 9771.761359][T113465] Call Trace:
[ 9771.764604][T113465]  device_shutdown+0x2ec/0x540
[ 9771.769335][T113465]  kernel_restart+0xe/0x40
[ 9771.773721][T113465]  __do_sys_reboot+0x143/0x2b0
[ 9771.778450][T113465]  ? kernel_power_off+0xa0/0xa0
[ 9771.783269][T113465]  ? debug_object_deactivate+0x3b0/0x3b0
[ 9771.788877][T113465]  ? syscall_enter_from_user_mode+0x17/0x40
[ 9771.794747][T113465]  ? rcu_read_lock_sched_held+0xa1/0xd0
[ 9771.800267][T113465]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[ 9771.806221][T113465]  ? syscall_enter_from_user_mode+0x1c/0x40
[ 9771.812087][T113465]  do_syscall_64+0x33/0x40
[ 9771.816472][T113465]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 9771.822340][T113465] RIP: 0033:0x7f343379b857
[ 9771.826724][T113465] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 
00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 
<48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 01 86 2c 00 f7 d8 64 89 02 b8
[ 9771.846392][T113465] RSP: 002b:7ffef9f85e58 EFLAGS: 0246 ORIG_RAX: 
00a9
[ 9771.854791][T113465] RAX: ffda RBX:  RCX: 
7f343379b857
[ 9771.862752][T113465] RDX: 01234567 RSI: 28121969 RDI: 
fee1dead
[ 9771.870713][T113465] RBP: 7ffef9f85ea0 R08: 0002 R09: 

[ 9771.878673][T113465] R10: 004b R11: 0246 R12: 
0001
[ 9771.886635][T113465] R13: fffe R14: 0006 R15: 

[ 9771.894596][T113465] Modules linked in: isofs cdrom fuse loop nls_ascii 
nls_cp437 vfat fat kvm_intel kvm ses enclosure irqbypass efivarfs ip_tables 
x_tables sd_mod tg3 nvme firmware_class smartpqi nvme_core libphy 
scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
dummy_del_mod]
[ 9771.921472][T113465] CR2: ffe8
[ 9771.925590][T113465] ---[ end trace 8a3c9cffc1068bd2 ]---
[ 9771.931017][T113465] RIP: 0010:platform_shutdown+0x44/0x70
[ 9771.936535][T113465] Code: fa 48 c1 ea 03 80 3c 02 00 75 3d 48 b8 00 00 00 
00 00 fc ff df 48 8b 6b 68 48 8d 7d e8 48 89 fa 48 c1 ea 03 80 3c 02 00 75 17 
<48> 8b 45 e8 48 85 c0 74 0b 48 8d 7b f0 5b 5d e9 08 45 6c 00 5b 5d
[ 9771.956204][T113465] RSP: 0018:c90008a77d38 EFLAGS: 00010246

> ---
>  drivers/base/platform.c | 132 
>  1 file changed, 65 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index b847f5f8f992..8ad06daa2eaa 100644
> --- 

Re: [PATCH] powerpc/mm: Refactor the floor/ceiling check in hugetlb range freeing functions

2020-12-11 Thread Qian Cai
On Fri, 2020-11-06 at 13:20 +, Christophe Leroy wrote:
> All hugetlb range freeing functions have a verification like the following,
> which only differs by the mask used, depending on the page table level.
> 
>   start &= MASK;
>   if (start < floor)
>   return;
>   if (ceiling) {
>   ceiling &= MASK;
>   if (! ceiling)
>   return;
>   }
>   if (end - 1 > ceiling - 1)
>   return;
> 
> Refactor that into a helper function which takes the mask as
> an argument, returning true when [start;end[ is not fully
> contained inside [floor;ceiling[
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/mm/hugetlbpage.c | 56 ---
>  1 file changed, 19 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> index 36c3800769fb..f8d8a4988e15 100644
> --- a/arch/powerpc/mm/hugetlbpage.c
> +++ b/arch/powerpc/mm/hugetlbpage.c
> @@ -294,6 +294,21 @@ static void hugepd_free(struct mmu_gather *tlb, void 
> *hugepte)
>  static inline void hugepd_free(struct mmu_gather *tlb, void *hugepte) {}
>  #endif
>  
> +/* Return true when the entry to be freed maps more than the area being 
> freed */
> +static bool range_is_outside_limits(unsigned long start, unsigned long end,
> + unsigned long floor, unsigned long ceiling,
> + unsigned long mask)
> +{
> + if ((start & mask) < floor)
> + return true;
> + if (ceiling) {
> + ceiling &= mask;
> + if (!ceiling)
> + return true;
> + }
> + return end - 1 > ceiling - 1;
> +}
> +
>  static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int 
> pdshift,
> unsigned long start, unsigned long end,
> unsigned long floor, unsigned long ceiling)
> @@ -309,15 +324,7 @@ static void free_hugepd_range(struct mmu_gather *tlb, 
> hugepd_t *hpdp, int pdshif
>   if (shift > pdshift)
>   num_hugepd = 1 << (shift - pdshift);
>  
> - start &= pdmask;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= pdmask;
> - if (! ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(start, end, floor, ceiling, pdmask))
>   return;
>  
>   for (i = 0; i < num_hugepd; i++, hpdp++)
> @@ -334,18 +341,9 @@ static void hugetlb_free_pte_range(struct mmu_gather 
> *tlb, pmd_t *pmd,
>  unsigned long addr, unsigned long end,
>  unsigned long floor, unsigned long ceiling)
>  {
> - unsigned long start = addr;
>   pgtable_t token = pmd_pgtable(*pmd);
>  
> - start &= PMD_MASK;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= PMD_MASK;
> - if (!ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(addr, end, floor, ceiling, PMD_MASK))
>   return;
>  
>   pmd_clear(pmd);
> @@ -395,15 +393,7 @@ static void hugetlb_free_pmd_range(struct mmu_gather 
> *tlb, pud_t *pud,
> addr, next, floor, ceiling);
>   } while (addr = next, addr != end);
>  
> - start &= PUD_MASK;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= PUD_MASK;
> - if (!ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(start, end, floor, ceiling, PUD_MASK))
>   return;
>  
>   pmd = pmd_offset(pud, start);
> @@ -446,15 +436,7 @@ static void hugetlb_free_pud_range(struct mmu_gather 
> *tlb, p4d_t *p4d,
>   }
>   } while (addr = next, addr != end);
>  
> - start &= PGDIR_MASK;
> - if (start < floor)
> - return;
> - if (ceiling) {
> - ceiling &= PGDIR_MASK;
> - if (!ceiling)
> - return;
> - }
> - if (end - 1 > ceiling - 1)
> + if (range_is_outside_limits(start, end, floor, ceiling, PGDIR_MASK))
>   return;
>  
>   pud = pud_offset(p4d, start);

Well, "start" is still in use in hugetlb_free_pmd_range() and
hugetlb_free_pud_range() after range_is_outside_limits(), but after this patch,
"start" is not longer has the bitmask, i.e., "no &=".

Anyway, reverting this commit from today's linux-next fixed a crash on POWE9 NV.

# runltp -f hugetlb
[ 7703.114640][T58070] LTP: starting hugemmap05_1 (hugemmap05 -m)
[ 7703.157792][   C99] [ cut here ]
[ 7703.158279][   C99] kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:387!
[ 7703.158306][   C99] Oops: Exception in kernel mode, sig: 5 [#1]
[ 

Re: [PATCH v2 16/17] driver core: Refactor fw_devlink feature

2020-12-11 Thread Qian Cai
On Fri, 2020-11-20 at 18:02 -0800, Saravana Kannan wrote:
> The current implementation of fw_devlink is very inefficient because it
> tries to get away without creating fwnode links in the name of saving
> memory usage. Past attempts to optimize runtime at the cost of memory
> usage were blocked with request for data showing that the optimization
> made significant improvement for real world scenarios.
> 
> We have those scenarios now. There have been several reports of boot
> time increase in the order of seconds in this thread [1]. Several OEMs
> and SoC manufacturers have also privately reported significant
> (350-400ms) increase in boot time due to all the parsing done by
> fw_devlink.
> 
> So this patch uses all the setup done by the previous patches in this
> series to refactor fw_devlink to be more efficient. Most of the code has
> been moved out of firmware specific (DT mostly) code into driver core.
> 
> This brings the following benefits:
> - Instead of parsing the device tree multiple times during bootup,
>   fw_devlink parses each fwnode node/property only once and creates
>   fwnode links. The rest of the fw_devlink code then just looks at these
>   fwnode links to do rest of the work.
> 
> - Makes it much easier to debug probe issue due to fw_devlink in the
>   future. fw_devlink=on blocks the probing of devices if they depend on
>   a device that hasn't been added yet. With this refactor, it'll be very
>   easy to tell what that device is because we now have a reference to
>   the fwnode of the device.
> 
> - Much easier to add fw_devlink support to ACPI and other firmware
>   types. A refactor to move the common bits from DT specific code to
>   driver core was in my TODO list as a prerequisite to adding ACPI
>   support to fw_devlink. This series gets that done.
> 
> [1] - 
> https://lore.kernel.org/linux-omap/ea02f57e-871d-cd16-4418-c1da4bbc4...@ti.com/
> Signed-off-by: Saravana Kannan 
> Tested-by: Laurent Pinchart 
> Tested-by: Grygorii Strashko 

Reverting this commit and its dependency:

2d09e6eb4a6f driver core: Delete pointless parameter in 
fwnode_operations.add_links

from today's linux-next fixed a boot crash on an arm64 Thunder X2 server.

.config: https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

[   57.413929][T1] ACPI: 5 ACPI AML tables successfully acquired and loaded
[   60.571643][T1] ACPI: Interpreter enabled
[   60.576104][T1] ACPI: Using GIC for interrupt routing
[   60.582474][T1] ACPI: MCFG table detected, 1 entries
[   60.588051][T1] ACPI: IORT: SMMU-v3[40230] Mapped to Proximity 
domain 0
[   60.601374][T1] Unable to handle kernel paging request at virtual 
address dfff8000
[   60.610146][T1] Mem abort info:
[   60.613694][T1]   ESR = 0x9604
[   60.617496][T1]   EC = 0x25: DABT (current EL), IL = 32 bits
[   60.623616][T1]   SET = 0, FnV = 0
[   60.627420][T1]   EA = 0, S1PTW = 0
[   60.631304][T1] Data abort info:
[   60.634957][T1]   ISV = 0, ISS = 0x0004
[   60.639546][T1]   CM = 0, WnR = 0
[   60.643255][T1] [dfff8000] address between user and kernel 
address ranges
[   60.651226][T1] Internal error: Oops: 9604 [#1] SMP
[   60.656864][T1] Modules linked in:
[   60.660658][T1] CPU: 38 PID: 1 Comm: swapper/0 Tainted: GW   
  5.10.0-rc7-next-20201211 #2
[   60.670424][T1] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.16 07/29/2020
[   60.680979][T1] pstate: 1049 (nzcV daif +PAN -UAO -TCO BTYPE=--)
[   60.687757][T1] pc : device_add+0xf60/0x16b0
[   60.692430][T1] lr : device_add+0xf08/0x16b0
[   60.697098][T1] sp : 063bf7d0
[   60.701147][T1] x29: 063bf7d0 x28: 1f760810 
[   60.707226][T1] x27: fff8 x26: 1f760858 
[   60.713304][T1] x25: 1f760c58 x24: fff8 
[   60.719381][T1] x23: 1fffe3eec10b x22: 8000190d0260 
[   60.725458][T1] x21: 800011dba708 x20:  
[   60.731535][T1] x19: 1fffe0c77f10 x18: 1fffe001cf0d53ed 
[   60.737616][T1] x17:  x16: 133676f1 
[   60.743709][T1] x15:  x14: 800011731e34 
[   60.749786][T1] x13: 73217fb9 x12: 13217fb8 
[   60.755864][T1] x11: 13217fb8 x10: 73217fb8 
[   60.761940][T1] x9 : dfff8000 x8 : 8000190bfdc7 
[   60.768017][T1] x7 : 0001 x6 : 73217fb9 
[   60.774094][T1] x5 : 73217fb9 x4 : 06324a80 
[   60.780170][T1] x3 : 1fffe0c64951 x2 :  
[   60.786247][T1] x1 :  x0 : dfff8000 
[   60.792324][T1] Call trace:
[   60.795495][T1]  device_add+0xf60/0x16b0
__fw_devlink_link_to_consumers at drivers/base/core.c:1583
(inlined by) fw_devlink_link_device at drivers/base/core.c:1726
(inlined by) device_add at 

Re: [PATCH 4/6] locking/lockdep: Clean up check_redundant() a bit

2020-12-10 Thread Qian Cai
On Thu, 2020-12-10 at 15:42 +0100, Peter Zijlstra wrote:
[]
>  /*
> @@ -2706,6 +2666,55 @@ static inline int check_irq_usage(struct
>  }
>  #endif /* CONFIG_TRACE_IRQFLAGS */
>  
> +#ifdef CONFIG_LOCKDEP_SMALL
> +/*
> + * Check that the dependency graph starting at  can lead to
> + *  or not. If it can,  ->  dependency is already
> + * in the graph.
> + *
> + * Return BFS_RMATCH if it does, or BFS_RMATCH if it does not, return BFS_E* 
> if
> + * any error appears in the bfs search.

Correction -- or BFS_RNOMATCH if it does not.



Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-12-08 Thread Qian Cai
On Mon, 2020-12-07 at 19:27 +, Valentin Schneider wrote:
> Ok, can reproduce this on a TX2 on next-20201207. I didn't use your config,
> I oldconfig'd my distro config and only modified it to CONFIG_PREEMPT_NONE.
> Interestingly the BUG happens on CPU127 here too...

I think that number is totally random. For example, on this x86, it could happen
for CPU8 or CPU111.



Re: [PATCH v4 17/26] kvm: arm64: Add offset for hyp VA <-> PA conversion

2020-12-07 Thread Qian Cai
On Wed, 2020-12-02 at 18:41 +, David Brazdil wrote:
> Add a host-initialized constant to KVM nVHE hyp code for converting
> between EL2 linear map virtual addresses and physical addresses.
> Also add `__hyp_pa` macro that performs the conversion.
> 
> Signed-off-by: David Brazdil 
> ---
>  arch/arm64/kvm/hyp/nvhe/psci-relay.c |  3 +++
>  arch/arm64/kvm/va_layout.c   | 30 +---
>  2 files changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/psci-relay.c 
> b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
> index 61375d4571c2..70b42f433449 100644
> --- a/arch/arm64/kvm/hyp/nvhe/psci-relay.c
> +++ b/arch/arm64/kvm/hyp/nvhe/psci-relay.c
> @@ -18,6 +18,9 @@
>  /* Config options set by the host. */
>  __ro_after_init u32 kvm_host_psci_version;
>  __ro_after_init struct psci_0_1_function_ids kvm_host_psci_0_1_function_ids;
> +__ro_after_init s64 hyp_physvirt_offset;
> +
> +#define __hyp_pa(x) ((phys_addr_t)((x)) + hyp_physvirt_offset)
>  
>  static u64 get_psci_func_id(struct kvm_cpu_context *host_ctxt)
>  {
> diff --git a/arch/arm64/kvm/va_layout.c b/arch/arm64/kvm/va_layout.c
> index 4130b72e6891..d8cc51bd60bf 100644
> --- a/arch/arm64/kvm/va_layout.c
> +++ b/arch/arm64/kvm/va_layout.c
> @@ -23,6 +23,30 @@ static u8 tag_lsb;
>  static u64 tag_val;
>  static u64 va_mask;
>  
> +/*
> + * Compute HYP VA by using the same computation as kern_hyp_va().
> + */
> +static u64 __early_kern_hyp_va(u64 addr)
> +{
> + addr &= va_mask;
> + addr |= tag_val << tag_lsb;
> + return addr;
> +}
> +
> +/*
> + * Store a hyp VA <-> PA offset into a hyp-owned variable.
> + */
> +static void init_hyp_physvirt_offset(void)
> +{
> + extern s64 kvm_nvhe_sym(hyp_physvirt_offset);
> + u64 kern_va, hyp_va;
> +
> + /* Compute the offset from the hyp VA and PA of a random symbol. */
> + kern_va = (u64)kvm_ksym_ref(__hyp_text_start);
> + hyp_va = __early_kern_hyp_va(kern_va);
> + CHOOSE_NVHE_SYM(hyp_physvirt_offset) = (s64)__pa(kern_va) - (s64)hyp_va;

The code here introduced a warning on TX2 from today's linux-next.

.config: https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

[   29.356963] CPU255: Booted secondary processor 0x011f03 [0x431f0af1]
[   29.358301] smp: Brought up 2 nodes, 256 CPUs
[   29.364962] SMP: Total of 256 processors activated.
[   29.364985] CPU features: detected: Privileged Access Never
[   29.365003] CPU features: detected: LSE atomic instructions
[   29.365023] CPU features: detected: CRC32 instructions
[   29.431660] CPU: All CPU(s) started at EL2
[   29.431685] [ cut here ]
[   29.431713] virt_to_phys used for non-linear address: (ptrval) 
(__hyp_idmap_text_end+0x0/0x534)
[   29.431744] WARNING: CPU: 0 PID: 1 at arch/arm64/mm/physaddr.c:15 
__virt_to_phys+0x80/0xc0
[   29.431759] Modules linked in:
[   29.431787] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
5.10.0-rc6-next-20201207+ #2
[   29.431804] pstate: 1049 (nzcV daif +PAN -UAO -TCO BTYPE=--)
[   29.431819] pc : __virt_to_phys+0x80/0xc0
[   29.431834] lr : __virt_to_phys+0x80/0xc0
[   29.431848] sp : 05fefc90
[   29.431862] x29: 05fefc90 x28: 8000191c9010 
[   29.431891] x27: 05f21228 x26: b14e19fe279ae3eb 
[   29.431920] x25: 8000191c9010 x24: 8000191c9000 
[   29.431948] x23: 8000191c9000 x22: 000f800011235acc 
[   29.431975] x21: 0001 x20: 000f8000 
[   29.432003] x19: 800011235acc x18: 6001cedcc336 
[   29.432031] x17: 1308 x16: 0002 
[   29.432058] x15:  x14: 7261656e696c2d6e 
[   29.432086] x13: 60bfdee7 x12: 1fffe0bfdee6 
[   29.432113] x11: 1fffe0bfdee6 x10: 60bfdee6 
[   29.432141] x9 : 80001020a928 x8 : 05fef737 
[   29.432169] x7 : 0001 x6 : 60bfdee7 
[   29.432196] x5 : 60bfdee7 x4 : 1fffe0bfdedc 
[   29.432223] x3 : 1fffe0be4009 x2 : 60bfdf5c 
[   29.432251] x1 : 8fd448c3d76ca800 x0 :  
[   29.432279] Call trace:
[   29.432294]  __virt_to_phys+0x80/0xc0
[   29.432312]  kvm_compute_layout+0x21c/0x264
init_hyp_physvirt_offset at arch/arm64/kvm/va_layout.c:47
(inlined by) kvm_compute_layout at arch/arm64/kvm/va_layout.c:82
[   29.432327]  smp_cpus_done+0x164/0x17c
[   29.432342]  smp_init+0xc4/0xd8
[   29.432358]  kernel_init_freeable+0x4ec/0x734
[   29.432375]  kernel_init+0x18/0x12c
[   29.432391]  ret_from_fork+0x10/0x1c
[   29.432405] irq event stamp: 490612
[   29.432424] hardirqs last  enabled at (490611): [] 
console_unlock+0x8e0/0xca0
[   29.432440] hardirqs last disabled at (490612): [] 
el1_dbg+0x24/0x50
[   29.432455] softirqs last  enabled at (487946): [] 
_stext+0xa98/0x113c
[   29.432473] softirqs last disabled at (487939): [] 
irq_exit+0x500/0x5e0
[   29.432492] ---[ end trace 96247b4cbbdf9333 ]---

> +}
> +
>  /*
>   * We want to generate a hyp VA with the following format (with V 

Re: [PATCH v14 09/10] arch, mm: wire up memfd_secret system call were relevant

2020-12-07 Thread Qian Cai
On Thu, 2020-12-03 at 08:29 +0200, Mike Rapoport wrote:
> From: Mike Rapoport 
> 
> Wire up memfd_secret system call on architectures that define
> ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.
> 
> Signed-off-by: Mike Rapoport 
> Acked-by: Palmer Dabbelt 
> Acked-by: Arnd Bergmann 
> ---
>  arch/arm64/include/uapi/asm/unistd.h   | 1 +
>  arch/riscv/include/asm/unistd.h| 1 +
>  arch/x86/entry/syscalls/syscall_32.tbl | 1 +
>  arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>  include/linux/syscalls.h   | 1 +
>  include/uapi/asm-generic/unistd.h  | 6 +-
>  mm/secretmem.c | 3 +++
>  scripts/checksyscalls.sh   | 4 
>  8 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/uapi/asm/unistd.h 
> b/arch/arm64/include/uapi/asm/unistd.h
> index f83a70e07df8..ce2ee8f1e361 100644
> --- a/arch/arm64/include/uapi/asm/unistd.h
> +++ b/arch/arm64/include/uapi/asm/unistd.h
> @@ -20,5 +20,6 @@
>  #define __ARCH_WANT_SET_GET_RLIMIT
>  #define __ARCH_WANT_TIME32_SYSCALLS
>  #define __ARCH_WANT_SYS_CLONE3
> +#define __ARCH_WANT_MEMFD_SECRET
>  
>  #include 
> diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
> index 977ee6181dab..6c316093a1e5 100644
> --- a/arch/riscv/include/asm/unistd.h
> +++ b/arch/riscv/include/asm/unistd.h
> @@ -9,6 +9,7 @@
>   */
>  
>  #define __ARCH_WANT_SYS_CLONE
> +#define __ARCH_WANT_MEMFD_SECRET
>  
>  #include 
>  
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
> b/arch/x86/entry/syscalls/syscall_32.tbl
> index c52ab1c4a755..109e6681b8fa 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -446,3 +446,4 @@
>  439  i386faccessat2  sys_faccessat2
>  440  i386process_madvise sys_process_madvise
>  441  i386watch_mount sys_watch_mount
> +442  i386memfd_secretsys_memfd_secret
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index f3270a9ef467..742cf17d7725 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -363,6 +363,7 @@
>  439  common  faccessat2  sys_faccessat2
>  440  common  process_madvise sys_process_madvise
>  441  common  watch_mount sys_watch_mount
> +442  common  memfd_secretsys_memfd_secret
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 6d55324363ab..f9d93fbf9b69 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1010,6 +1010,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int 
> sig,
>  asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
>  asmlinkage long sys_watch_mount(int dfd, const char __user *path,
>   unsigned int at_flags, int watch_fd, int 
> watch_id);
> +asmlinkage long sys_memfd_secret(unsigned long flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h 
> b/include/uapi/asm-generic/unistd.h
> index 5df46517260e..51151888f330 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -861,9 +861,13 @@ __SYSCALL(__NR_faccessat2, sys_faccessat2)
>  __SYSCALL(__NR_process_madvise, sys_process_madvise)
>  #define __NR_watch_mount 441
>  __SYSCALL(__NR_watch_mount, sys_watch_mount)
> +#ifdef __ARCH_WANT_MEMFD_SECRET
> +#define __NR_memfd_secret 442
> +__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
> +#endif

I can't see where was it defined for arm64 after it looks like Andrew has
deleted the  above chunk. Thus, we have a warning using this .config:

https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

:1539:2: warning: #warning syscall memfd_secret not implemented [-Wcpp]

>  
>  #undef __NR_syscalls
> -#define __NR_syscalls 442
> +#define __NR_syscalls 443
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 7236f4d9458a..b8a32954ac68 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -415,6 +415,9 @@ static int __init secretmem_setup(char *str)
>   unsigned long reserved_size;
>   int err;
>  
> + if (!can_set_direct_map())
> + return 0;
> +
>   reserved_size = memparse(str, NULL);
>   if (!reserved_size)
>   return 0;
> diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
> index a18b47695f55..b7609958ee36 100755
> --- a/scripts/checksyscalls.sh
> +++ b/scripts/checksyscalls.sh
> @@ -40,6 +40,10 @@ cat << EOF
>  #define __IGNORE_setrlimit   /* setrlimit */
>  #endif
>  
> +#ifndef __ARCH_WANT_MEMFD_SECRET
> +#define __IGNORE_memfd_secret
> +#endif
> +
>  /* Missing flags argument */
>  #define __IGNORE_renameat/* renameat2 */
>  



Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-12-05 Thread Qian Cai
On Sat, 2020-12-05 at 18:37 +, Valentin Schneider wrote:
> From there I see:
> 
> [20798.166987][  T650] CPU127 nr_running=2
> [20798.171185][  T650]  p=migration/127
> [20798.175161][  T650]  p=kworker/127:1
> 
> so this might be another workqueue hurdle. This should be prevented by:
> 
>   06249738a41a ("workqueue: Manually break affinity on hotplug")

Well, since it was reproduced on the latest linux-next which has already
included the commit.

> Note that much earlier in your log, you have a softlockup on CPU127:
> 
> [   74.278367][  C127] watchdog: BUG: soft lockup - CPU#127 stuck for 23s!
> [swapper/0:1]

That's something separate. It was there all the time.

https://lore.kernel.org/linux-acpi/20200929183444.25079-1-...@redhat.com/



Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-12-04 Thread Qian Cai
On Tue, 2020-11-17 at 19:28 +, Valentin Schneider wrote:
> We did have some breakage in that area, but all the holes I was aware of
> have been plugged. What would help here is to see which tasks are still
> queued on that outgoing CPU, and their recent activity.
> 
> Something like
> - ftrace_dump_on_oops on your kernel cmdline
> - trace-cmd start -e 'sched:*'
>  
> 
> ought to do it. Then you can paste the (tail of the) ftrace dump.
> 
> I also had this laying around, which may or may not be of some help:

Okay, your patch did not help, since it can still be reproduced using this,

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/hotplug/cpu_hotplug/functional/cpuhotplug04.sh

# while :; do cpuhotplug04.sh -l 1; done

The ftrace dump has too much output on this 256-CPU system, so I have not had
the patient to wait for it to finish after 15-min. But here is the log capturing
so far (search for "kernel BUG" there).

http://people.redhat.com/qcai/console.log

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a6aaf9fb3400..c4a4cb8b47a2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7534,7 +7534,25 @@ int sched_cpu_dying(unsigned int cpu)
>   sched_tick_stop(cpu);
>  
>   rq_lock_irqsave(rq, );
> - BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));
> +
> + if (rq->nr_running != 1 || rq_has_pinned_tasks(rq)) {
> + struct task_struct *g, *p;
> +
> + pr_crit("CPU%d nr_running=%d\n", cpu, rq->nr_running);
> + rcu_read_lock();
> + for_each_process_thread(g, p) {
> + if (task_cpu(p) != cpu)
> + continue;
> +
> + if (!task_on_rq_queued(p))
> + continue;
> +
> + pr_crit("\tp=%s\n", p->comm);
> + }
> + rcu_read_unlock();
> + BUG();
> + }
> +
>   rq_unlock_irqrestore(rq, );
>  
>   calc_load_migrate(rq);
> 



Re: [PATCH] mm/memblock:use a more appropriate order calculation when free memblock pages

2020-12-04 Thread Qian Cai
On Thu, 2020-12-03 at 23:23 +0800, carver4...@163.com wrote:
> From: Hailong Liu 
> 
> When system in the booting stage, pages span from [start, end] of a memblock
> are freed to buddy in a order as large as possible (less than MAX_ORDER) at
> first, then decrease gradually to a proper order(less than end) in a loop.
> 
> However, *min(MAX_ORDER - 1UL, __ffs(start))* can not get the largest order
> in some cases.
> Instead, *__ffs(end - start)* may be more appropriate and meaningful.
> 
> Signed-off-by: Hailong Liu 

Reverting this commit on the top of today's linux-next fixed boot crashes on
multiple NUMA systems.

[5.050736][T0] flags: 0x3fffc0()
[5.055103][T0] raw: 003fffc0 ea000448 ea000448 

[5.063572][T0] raw:    

[5.072045][T0] page dumped because: VM_BUG_ON_PAGE(pfn & ((1 << order) 
- 1))
[5.079580][T0] [ cut here ]
[5.084883][T0] kernel BUG at mm/page_alloc.c:1015!
[5.090151][T0] invalid opcode:  [#1] SMP KASAN NOPTI
[5.095894][T0] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.10.0-rc6-next-20201204+ #11
[5.104099][T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 07/10/2019
[5.113370][T0] RIP: 0010:__free_one_page+0xa19/0x1140
[5.118864][T0] Code: d2 e9 69 f6 ff ff 0f 0b 48 c7 c6 e0 52 2d a5 4c 89 
ff e8 7a 98 f8 ff 0f 0b 0f 0b 48 c7 c6 60 53 2d a5 4c 89 ff e8 67 98 f8 ff <0f> 
0b 48 c7 c6 c0 53 2d a5 4c 89 ff e8 56 98 f8 ff 0f 0b 48 89 da
[5.138427][T0] RSP: :a5807c30 EFLAGS: 00010086
[5.144367][T0] RAX:  RBX: 0008 RCX: 
a3c4abf4
[5.152228][T0] RDX: 1d40008f RSI:  RDI: 
ea000478
[5.160091][T0] RBP: 0007 R08: fbfff5918fc5 R09: 
fbfff5918fc5
[5.167951][T0] R10: ac8c7e23 R11: fbfff5918fc4 R12: 

[5.175815][T0] R13: 0003 R14: 7fff6000 R15: 
ea000440
[5.183677][T0] FS:  () GS:1e80() 
knlGS:
[5.192499][T0] CS:  0010 DS:  ES:  CR0: 80050033
[5.198963][T0] CR2: 88907efff000 CR3: 000ce3e14000 CR4: 
000406b0
[5.206823][T0] Call Trace:
[5.209978][T0]  ? rwlock_bug.part.1+0x90/0x90
[5.214774][T0]  free_one_page+0x7e/0x1e0
[5.219142][T0]  __free_pages_ok+0x646/0x13b0
[5.223863][T0]  memblock_free_all+0x21c/0x2c0
(inlined by) __free_memory_core at mm/memblock.c:2037
(inlined by) free_low_memory_core_early at mm/memblock.c:2060
(inlined by) memblock_free_all at mm/memblock.c:2100
[5.228662][T0]  ? reset_all_zones_managed_pages+0x9a/0x9a
[5.234515][T0]  ? memblock_alloc_try_nid+0xe6/0x127
[5.239842][T0]  ? memblock_alloc_try_nid_raw+0x12a/0x12a
[5.245610][T0]  ? early_amd_iommu_init+0x1e1f/0x1e1f
[5.251024][T0]  ? iommu_go_to_state+0x24/0x28
[5.255831][T0]  mem_init+0x1a/0x350
[5.259762][T0]  mm_init+0x5f/0x87
[5.263515][T0]  start_kernel+0x14c/0x3a7
[5.267882][T0]  ? copy_bootdata+0x19/0x47
[5.272340][T0]  secondary_startup_64_no_verify+0xc2/0xcb
[5.278102][T0] Modules linked in:
[5.281869][T0] random: get_random_bytes called from 
print_oops_end_marker+0x26/0x40 with crng_init=0
[5.281878][T0] ---[ end trace 32dd7228cc16af82 ]---
[5.296795][T0] RIP: 0010:__free_one_page+0xa19/0x1140
[5.302299][T0] Code: d2 e9 69 f6 ff ff 0f 0b 48 c7 c6 e0 52 2d a5 4c 89 
ff e8 7a 98 f8 ff 0f 0b 0f 0b 48 c7 c6 60 53 2d a5 4c 89 ff e8 67 98 f8 ff <0f> 
0b 48 c7 c6 c0 53 2d a5 4c 89 ff e8 56 98 f8 ff 0f 0b 48 89 da
[5.321864][T0] RSP: :a5807c30 EFLAGS: 00010086
[5.327803][T0] RAX:  RBX: 0008 RCX: 
a3c4abf4
[5.335665][T0] RDX: 1d40008f RSI:  RDI: 
ea000478
[5.343526][T0] RBP: 0007 R08: fbfff5918fc5 R09: 
fbfff5918fc5
[5.351389][T0] R10: ac8c7e23 R11: fbfff5918fc4 R12: 

[5.359249][T0] R13: 0003 R14: 7fff6000 R15: 
ea000440
[5.367110][T0] FS:  () GS:1e80() 
knlGS:
[5.375932][T0] CS:  0010 DS:  ES:  CR0: 80050033
[5.382397][T0] CR2: 88907efff000 CR3: 000ce3e14000 CR4: 
000406b0
[5.390261][T0] Kernel panic - not syncing: Fatal exception
[5.396320][T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

> ---
>  mm/memblock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index b68ee8678..7c6d0dde7 100644
> --- a/mm/memblock.c
> +++ 

Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-12-03 Thread Qian Cai
FYI, it did crash on arm64 (Thunder X2) as well, so I'll re-run to gather more
information too.

.config: https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

[20370.682747][T77637] psci: CPU123 killed (polled 0 ms) 
[20370.823651][  T635] IRQ 43: no longer affine to CPU124 
[20370.828862][  T635] IRQ 49: no longer affine to CPU124 
[20370.834072][  T635] IRQ 60: no longer affine to CPU124 
[20370.839517][  T635] IRQ 94: no longer affine to CPU124 
[20370.845778][T77637] CPU124: shutdown 
[20370.861891][T77637] psci: CPU124 killed (polled 10 ms) 
[20371.425434][T77637] CPU125: shutdown 
[20371.441464][T77637] psci: CPU125 killed (polled 10 ms) 
[20371.984072][T77637] CPU126: shutdown 
[20372.57][T77637] psci: CPU126 killed (polled 10 ms) 
[20372.223858][  T650] [ cut here ] 
[20372.229599][  T650] kernel BUG at kernel/sched/core.c:7594! 
[20372.235165][  T650] Internal error: Oops - BUG: 0 [#1] SMP 
[20372.240643][  T650] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1
vfio loop processor ip_tables x_tables sd_mod mlx5_core firmware_class ahci
libahci libata dm_mirror dm_region_hash dm_log dm_mod efivarfs 
[20372.259814][  T650] CPU: 127 PID: 650 Comm: migration/127 Tainted: G 
L5.10.0-rc6-next-20201203+ #5 
[20372.270152][  T650] Hardware name: HPE Apollo 70 /C01_APACHE_MB  
   , BIOS L50_5.13_1.16 07/29/2020 
[20372.280579][  T650] Stopper: multi_cpu_stop+0x0/0x390 <- 0x0 
[20372.286230][  T650] pstate: 20400089 (nzCv daIf +PAN -UAO -TCO BTYPE=--) 
[20372.292923][  T650] pc : sched_cpu_dying+0x198/0x1b8 
[20372.297879][  T650] lr : sched_cpu_dying+0x68/0x1b8 
[20372.302748][  T650] sp : 1076fba0 
[20372.306747][  T650] x29: 1076fba0 x28:   
[20372.312751][  T650] x27: 0001 x26: 800011db3000  
[20372.318753][  T650] x25: 000e7bdd16a8 x24: 005a  
[20372.324754][  T650] x23: 007f x22: 0080  
[20372.330756][  T650] x21: fab7 x20: 000e7be63818  
[20372.336757][  T650] x19: 000e7be63800 x18: 1fffe001cf7cb3ed  
[20372.342758][  T650] x17: 1308 x16:   
[20372.348759][  T650] x15: 0001053f x14: 0001053f  
[20372.354761][  T650] x13: 620edf65 x12: 1fffe20edf64  
[20372.360763][  T650] x11: 1fffe20edf64 x10: 620edf64  
[20372.366764][  T650] x9 : dfff8000 x8 : 1076fb23  
[20372.372766][  T650] x7 : 0001 x6 : 0001  
[20372.378767][  T650] x5 : 1fffe20b9a0a x4 : dfff8000  
[20372.384769][  T650] x3 : dfff8000 x2 : 0003  
[20372.390770][  T650] x1 : 000e7be63840 x0 : 0002  
[20372.396771][  T650] Call trace: 
[20372.399905][  T650]  sched_cpu_dying+0x198/0x1b8 
[20372.404514][  T650]  cpuhp_invoke_callback+0x208/0x2bf0 
[20372.409730][  T650]  take_cpu_down+0x11c/0x1f0 
[20372.414165][  T650]  multi_cpu_stop+0x184/0x390 
[20372.418687][  T650]  cpu_stopper_thread+0x1f0/0x430 
[20372.423557][  T650]  smpboot_thread_fn+0x3a8/0x9c8 
[20372.428339][  T650]  kthread+0x3a0/0x448 
[20372.432253][  T650]  ret_from_fork+0x10/0x1c 
[20372.436517][  T650] Code: d65f03c0 911a82a2 140004fb 17d9 (d421)  
[20372.443298][  T650] ---[ end trace c51d5b6889ec29a8 ]--- 
[20372.448602][  T650] Kernel panic - not syncing: Oops - BUG: Fatal exception 



Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP

2020-12-03 Thread Qian Cai
On Thu, 2020-12-03 at 18:27 +0100, Marek Szyprowski wrote:
> Hi
> 
> On 03.12.2020 16:46, Marek Szyprowski wrote:
> > On 25.11.2020 03:32, Matthew Wilcox wrote:
> > > On Tue, Nov 17, 2020 at 11:43:02PM +, Matthew Wilcox wrote:
> > > > On Tue, Nov 17, 2020 at 07:15:13PM +, Matthew Wilcox wrote:
> > > > > I find both of these functions exceptionally confusing.  Does this
> > > > > make it easier to understand?
> > > > Never mind, this is buggy.  I'll send something better tomorrow.
> > > That took a week, not a day.  *sigh*.  At least this is shorter.
> > > 
> > > commit 1a02863ce04fd325922d6c3db6d01e18d55f966b
> > > Author: Matthew Wilcox (Oracle) 
> > > Date:   Tue Nov 17 10:45:18 2020 -0500
> > > 
> > >  fix mm-truncateshmem-handle-truncates-that-split-thps.patch
> > 
> > This patch landed in todays linux-next (20201203) as commit 
> > 8678b27f4b8b ("8678b27f4b8bfc130a13eb9e9f27171bcd8c0b3b"). Sadly it 
> > breaks booting of ANY of my ARM 32bit test systems, which use initrd. 
> > ARM64bit based systems boot fine. Here is example of the crash:
> 
> One more thing. Reverting those two:
> 
> 1b1aa968b0b6 mm-truncateshmem-handle-truncates-that-split-thps-fix-fix
> 
> 8678b27f4b8b mm-truncateshmem-handle-truncates-that-split-thps-fix
> 
> on top of linux next-20201203 fixes the boot issues.

We have to revert those two patches as well to fix this one process keeps
running 100% CPU in find_get_entries() and all other threads are blocking on the
i_mutex almost forever.

[  380.735099] INFO: task trinity-c58:2143 can't die for more than 125 seconds.
[  380.742923] task:trinity-c58 state:R  running task stack:26056 pid: 
2143 ppid:  1914 flags:0x4006
[  380.753640] Call Trace:
[  380.756811]  ? find_get_entries+0x339/0x790
find_get_entry at mm/filemap.c:1848
(inlined by) find_get_entries at mm/filemap.c:1904
[  380.761723]  ? __lock_page_or_retry+0x3f0/0x3f0
[  380.767009]  ? shmem_undo_range+0x3bf/0xb60
[  380.771944]  ? unmap_mapping_pages+0x96/0x230
[  380.777036]  ? find_held_lock+0x33/0x1c0
[  380.781688]  ? shmem_write_begin+0x1b0/0x1b0
[  380.786703]  ? unmap_mapping_pages+0xc2/0x230
[  380.791796]  ? down_write+0xe0/0x150
[  380.796114]  ? do_wp_page+0xc60/0xc60
[  380.800507]  ? shmem_truncate_range+0x14/0x80
[  380.805618]  ? shmem_setattr+0x827/0xc70
[  380.810274]  ? notify_change+0x6cf/0xc30
[  380.814941]  ? do_truncate+0xe2/0x180
[  380.819335]  ? do_truncate+0xe2/0x180
[  380.823741]  ? do_sys_openat2+0x5c0/0x5c0
[  380.828484]  ? do_sys_ftruncate+0x2e2/0x4e0
[  380.833417]  ? trace_hardirqs_on+0x1c/0x150
[  380.838335]  ? do_syscall_64+0x33/0x40
[  380.842828]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  380.848870]



Re: [PATCH v3 1/1] kasan: fix object remain in offline per-cpu quarantine

2020-12-03 Thread Qian Cai
On Wed, 2020-12-02 at 15:53 +0800, Kuan-Ying Lee wrote:
> We hit this issue in our internal test.
> When enabling generic kasan, a kfree()'d object is put into per-cpu
> quarantine first. If the cpu goes offline, object still remains in
> the per-cpu quarantine. If we call kmem_cache_destroy() now, slub
> will report "Objects remaining" error.

Reverting this commit on the top of today's linux-next fixed memory corruptions
while doing CPU hotplug.

.config: https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

[  421.539476][  T120] BUG kmalloc-128 (Not tainted): Object already free
[  421.546047][  T120] 
-
[  421.546047][  T120] 
[  421.557165][  T120] INFO: Allocated in 
memcg_alloc_page_obj_cgroups+0x86/0x140 age=755 cpu=21 pid=2316
[  421.566533][  T120]  __slab_alloc+0x55/0x70
[  421.570744][  T120]  __kmalloc_node+0xdc/0x280
[  421.575215][  T120]  memcg_alloc_page_obj_cgroups+0x86/0x140
[  421.580910][  T120]  allocate_slab+0xd8/0x610
[  421.585299][  T120]  ___slab_alloc+0x4cb/0x830
[  421.589770][  T120]  __slab_alloc+0x55/0x70
[  421.593985][  T120]  kmem_cache_alloc+0x225/0x280
[  421.598724][  T120]  vm_area_dup+0x76/0x2a0
[  421.602940][  T120]  __split_vma+0x90/0x4b0
[  421.607151][  T120]  mprotect_fixup+0x5da/0x7d0
[  421.611712][  T120]  do_mprotect_pkey+0x41a/0x7c0
[  421.616447][  T120]  __x64_sys_mprotect+0x74/0xb0
[  421.621181][  T120]  do_syscall_64+0x33/0x40
[  421.625479][  T120]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  421.631262][  T120] INFO: Freed in quarantine_put+0xb5/0x1b0 age=3 cpu=21 
pid=120
[  421.638795][  T120]  quarantine_put+0xe7/0x1b0
[  421.643270][  T120]  slab_free_freelist_hook+0x71/0x1a0
[  421.648529][  T120]  kfree+0xe2/0x5d0
[  421.652215][  T120]  __free_slab+0x1f8/0x300
[  421.656517][  T120]  qlist_free_all+0x56/0xc0
[  421.660903][  T120]  kasan_cpu_offline+0x1a/0x20
[  421.665550][  T120]  cpuhp_invoke_callback+0x1dd/0x1530
[  421.670811][  T120]  cpuhp_thread_fun+0x343/0x690
[  421.675547][  T120]  smpboot_thread_fn+0x30f/0x780
[  421.680371][  T120]  kthread+0x359/0x420
[  421.684317][  T120]  ret_from_fork+0x22/0x30
[  421.688616][  T120] INFO: Slab 0x80d42669 objects=12 used=11 
fp=0x0ce8ce1d flags=0x4bfffc10201
[  421.699031][  T120] INFO: Object 0x0ce8ce1d @offset=1408 
fp=0x
[  421.699031][  T120] 
[  421.709186][  T120] Redzone 2d1421b0: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.719339][  T120] Redzone 18565a7c: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.729492][  T120] Redzone d8a699c9: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.739645][  T120] Redzone af065f39: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.749800][  T120] Redzone 480c8db9: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.759954][  T120] Redzone c37ee06b: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.770106][  T120] Redzone 40f9cbf1: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.780256][  T120] Redzone e714e01e: bb bb bb bb bb bb bb bb bb bb 
bb bb bb bb bb bb  
[  421.790408][  T120] Object 0ce8ce1d: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.800471][  T120] Object 46eb4462: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.810536][  T120] Object c3a122ae: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.820599][  T120] Object d0195822: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.830665][  T120] Object 8332e5f7: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.840729][  T120] Object a04f77eb: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.850796][  T120] Object 326e9ce3: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b 6b  
[  421.860862][  T120] Object fa32b4b7: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 
6b 6b 6b 6b 6b a5  kkk.
[  421.870926][  T120] Redzone 1d59aa8f: bb bb bb bb bb bb bb bb
  
[  421.880382][  T120] Padding f49e6727: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[  421.890537][  T120] Padding 7eb7befd: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[  421.900691][  T120] Padding c69f7c35: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 
5a 5a 5a 5a 5a 5a  
[  421.910842][  T120] CPU: 21 PID: 120 Comm: cpuhp/21 Tainted: GB  
   5.10.0-rc6-next-20201203+ #8
[  421.920733][  T120] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 07/10/2019
[  421.930011][  T120] Call Trace:

Re: [PATCH v2] lib: stackdepot: Add support to configure STACK_HASH_SIZE

2020-12-03 Thread Qian Cai
On Thu, 2020-11-26 at 10:13 +0530, vji...@codeaurora.org wrote:
> From: Yogesh Lal 
> 
> Add a kernel parameter stack_hash_order to configure STACK_HASH_SIZE.
> 
> Aim is to have configurable value for STACK_HASH_SIZE, so that one
> can configure it depending on usecase there by reducing the static
> memory overhead.
> 
> One example is of Page Owner, default value of STACK_HASH_SIZE lead
> stack depot to consume 8MB of static memory. Making it configurable
> and use lower value helps to enable features like CONFIG_PAGE_OWNER
> without any significant overhead.
> 
> Suggested-by: Minchan Kim 
> Signed-off-by: Yogesh Lal 
> Signed-off-by: Vijayanand Jitta 

Reverting this commit on today's linux-next fixed boot crash with KASAN.

.config:
https://cailca.coding.net/public/linux/mm/git/files/master/x86.config
https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config


[5.135848][T0] random: get_random_u64 called from 
__kmem_cache_create+0x2e/0x490 with crng_init=0
[5.135909][T0] BUG: unable to handle page fault for address: 
002ac6d0
[5.152733][T0] #PF: supervisor read access in kernel mode
[5.158585][T0] #PF: error_code(0x) - not-present page
[5.164438][T0] PGD 0 P4D 0 
[5.167670][T0] Oops:  [#1] SMP KASAN NOPTI
[5.172566][T0] CPU: 0 PID: 0 Comm: swapper Not tainted 
5.10.0-rc6-next-20201203+ #3
[5.180685][T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 07/10/2019
[5.189950][T0] RIP: 0010:stack_depot_save+0xf4/0x460
stack_depot_save at lib/stackdepot.c:272
[5.195362][T0] Code: 00 00 83 ff 01 0f 84 b3 00 00 00 8b 0d 35 67 39 08 
b8 01 00 00 00 48 d3 e0 48 8b 0d 46 9c 27 11 48 83 e8 01 21 d8 4c 8d 34 c1 <4d> 
8b 2e 4d 85 ed 0f 84 ca 00 00 00 41 8d 74 24 ff 48 c1 e6 03 eb
[5.214927][T0] RSP: :99007c18 EFLAGS: 00010002
[5.220865][T0] RAX: 000558da RBX: caa558da RCX: 

[5.228726][T0] RDX: 0cc0 RSI: e11e461a RDI: 
0001
[5.236590][T0] RBP: 99007c68 R08: 2ff39dab R09: 
0007
[5.244450][T0] R10: 99007b60 R11: 0005 R12: 
0008
[5.252313][T0] R13: 8881000400b7 R14: 002ac6d0 R15: 
0078
[5.260173][T0] FS:  () GS:1e80() 
knlGS:
[5.268996][T0] CS:  0010 DS:  ES:  CR0: 80050033
[5.275674][T0] CR2: 002ac6d0 CR3: 000cc8814000 CR4: 
000406b0
[5.283534][T0] Call Trace:
[5.286687][T0]  kasan_save_stack+0x2f/0x40
[5.291225][T0]  ? kasan_save_stack+0x19/0x40
[5.295939][T0]  ? kasan_kmalloc.constprop.8+0x85/0xa0
[5.301793][T0]  ? __kmem_cache_create+0x26a/0x490
[5.306950][T0]  ? create_boot_cache+0x75/0x98
[5.311751][T0]  ? kmem_cache_init+0x42/0x146
[5.316471][T0]  ? mm_init+0x64/0x87
[5.320399][T0]  ? start_kernel+0x14c/0x3a7
[5.324945][T0]  ? secondary_startup_64_no_verify+0xc2/0xcb
[5.330885][T0]  ? lockdep_hardirqs_on_prepare+0x3d0/0x3d0
[5.336733][T0]  ? lockdep_hardirqs_on_prepare+0x3d0/0x3d0
[5.342590][T0]  ? __isolate_free_page+0x540/0x540
[5.347742][T0]  ? find_held_lock+0x33/0x1c0
[5.352371][T0]  ? __alloc_pages_nodemask+0x534/0x700
[5.357784][T0]  ? __alloc_pages_slowpath.constprop.110+0x20f0/0x20f0
[5.364600][T0]  ? __kasan_init_slab_obj+0x20/0x30
[5.369753][T0]  ? unpoison_range+0xf/0x30
[5.374207][T0]  kasan_kmalloc.constprop.8+0x85/0xa0
kasan_set_track at mm/kasan/common.c:47
(inlined by) set_alloc_info at mm/kasan/common.c:405
(inlined by) kasan_kmalloc at mm/kasan/common.c:436
[5.379886][T0]  __kmem_cache_create+0x26a/0x490
early_kmem_cache_node_alloc at mm/slub.c:3566
(inlined by) init_kmem_cache_nodes at mm/slub.c:3606
(inlined by) kmem_cache_open at mm/slub.c:3858
(inlined by) __kmem_cache_create at mm/slub.c:4468
[5.384864][T0]  create_boot_cache+0x75/0x98
create_boot_cache at mm/slab_common.c:568
[5.389493][T0]  kmem_cache_init+0x42/0x146
[5.394035][T0]  mm_init+0x64/0x87
[5.397791][T0]  start_kernel+0x14c/0x3a7
[5.402159][T0]  ? copy_bootdata+0x19/0x47
[5.406615][T0]  secondary_startup_64_no_verify+0xc2/0xcb
[5.412380][T0] Modules linked in:
[5.416136][T0] CR2: 002ac6d0
[5.420158][T0] ---[ end trace c97cf41616dddbe6 ]---
[5.425483][T0] RIP: 0010:stack_depot_save+0xf4/0x460
[5.430898][T0] Code: 00 00 83 ff 01 0f 84 b3 00 00 00 8b 0d 35 67 39 08 
b8 01 00 00 00 48 d3 e0 48 8b 0d 46 9c 27 11 48 83 e8 01 21 d8 4c 8d 34 c1 <4d> 
8b 2e 4d 85 ed 0f 84 ca 00 00 00 41 8d 74 24 ff 48 c1 e6 03 eb
[5.450464][T0] RSP: :99007c18 EFLAGS: 00010002
[5.456403][T0] RAX: 000558da 

Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-12-03 Thread Qian Cai
On Mon, 2020-11-23 at 19:13 +0100, Sebastian Andrzej Siewior wrote:
> On 2020-11-18 09:44:34 [-0500], Qian Cai wrote:
> > On Tue, 2020-11-17 at 19:28 +, Valentin Schneider wrote:
> > > We did have some breakage in that area, but all the holes I was aware of
> > > have been plugged. What would help here is to see which tasks are still
> > > queued on that outgoing CPU, and their recent activity.
> > > 
> > > Something like
> > > - ftrace_dump_on_oops on your kernel cmdline
> > > - trace-cmd start -e 'sched:*'
> > >  
> > > 
> > > ought to do it. Then you can paste the (tail of the) ftrace dump.
> > > 
> > > I also had this laying around, which may or may not be of some help:
> > 
> > Once I have found a reliable reproducer, I'll report back.
> 
> any update?

Hmm, the bug is still there after running a bit longer. Let me apply Valentin's
patch and setup ftrace to try to catch it again.

[ 6152.085915][   T61] kernel BUG at kernel/sched/core.c:7594!
[ 6152.091523][   T61] invalid opcode:  [#1] SMP KASAN PTI
[ 6152.097126][   T61] CPU: 10 PID: 61 Comm: migration/10 Tainted: G  
IO  5.10.0-rc6-next-20201201+ #1
[ 6152.107272][   T61] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 
Gen10, BIOS U34 11/13/2019
[ 6152.116545][   T61] Stopper: multi_cpu_stop+0x0/0x350 <- 0x0
[ 6152.122237][   T61] RIP: 0010:sched_cpu_dying+0x14f/0x180
[ 6152.127667][   T61] Code: 10 00 31 c0 48 83 c4 08 5b 41 5c 41 5d 5d c3 be 08 
00 00 00 48 c7 c7 60 5f 15 a1 e8 1b e5 4d 00 f0 4c 01 25 63 c8 5a 09 eb a3 <0f> 
0b 48 89 34 24 e8 f6 e0 4d 00 48 8b 34 24 e9 1e ff ff ff 48 89
[ 6152.147248][   T61] RSP: 0018:c90006fbfca0 EFLAGS: 00010002
[ 6152.153202][   T61] RAX: 723d RBX: 8887dfab2400 RCX: 
1110fbf56488
[ 6152.161076][   T61] RDX:  RSI: 723d RDI: 
8887dfab2440
[ 6152.168950][   T61] RBP: c90006fbfcc0 R08: fbfff417923d R09: 
fbfff417923d
[ 6152.176824][   T61] R10: a0bc91e7 R11: fbfff417923c R12: 
8887dfab2418
[ 6152.184698][   T61] R13: 0086 R14: 97b03da0 R15: 
0003
[ 6152.192574][   T61] FS:  () GS:8887dfa8() 
knlGS:
[ 6152.201409][   T61] CS:  0010 DS:  ES:  CR0: 80050033
[ 6152.207886][   T61] CR2: 55fed2192f58 CR3: 000cb7e14006 CR4: 
007706e0
[ 6152.215761][   T61] DR0:  DR1:  DR2: 

[ 6152.223636][   T61] DR3:  DR6: fffe0ff0 DR7: 
0400
[ 6152.231509][   T61] PKRU: 5554
[ 6152.234928][   T61] Call Trace:
[ 6152.238086][   T61]  ? x86_pmu_starting_cpu+0x20/0x20
[ 6152.243166][   T61]  ? sched_cpu_wait_empty+0x290/0x290
[ 6152.248422][   T61]  cpuhp_invoke_callback+0x1d8/0x1520
[ 6152.253677][   T61]  ? x2apic_send_IPI_mask+0x10/0x10
[ 6152.258758][   T61]  ? clear_local_APIC+0x788/0xc10
[ 6152.263663][   T61]  ? cpuhp_invoke_callback+0x1520/0x1520
[ 6152.269178][   T61]  take_cpu_down+0x10f/0x1a0
[ 6152.273646][   T61]  multi_cpu_stop+0x149/0x350
[ 6152.278201][   T61]  ? stop_machine_yield+0x10/0x10
[ 6152.283106][   T61]  cpu_stopper_thread+0x200/0x400
[ 6152.288012][   T61]  ? cpu_stop_create+0x70/0x70
[ 6152.292655][   T61]  smpboot_thread_fn+0x30a/0x770
[ 6152.297472][   T61]  ? smpboot_register_percpu_thread+0x370/0x370
[ 6152.303600][   T61]  ? trace_hardirqs_on+0x1c/0x150
[ 6152.308504][   T61]  ? __kthread_parkme+0xcc/0x1a0
[ 6152.313321][   T61]  ? smpboot_register_percpu_thread+0x370/0x370
[ 6152.319447][   T61]  kthread+0x354/0x420
[ 6152.323390][   T61]  ? kthread_create_on_node+0xc0/0xc0
[ 6152.328645][   T61]  ret_from_fork+0x22/0x30
[ 6152.332938][   T61] Modules linked in: isofs cdrom fuse loop nls_ascii 
nls_cp437 vfat fat kvm_intel kvm irqbypass ses enclosure efivarfs ip_tables 
x_tables sd_mod tg3 nvme firmware_class smartpqi nvme_core libphy 
scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
dummy_del_mod]
[ 6152.359732][   T61] ---[ end trace f59b31dec044f746 ]---
[ 6152.365076][   T61] RIP: 0010:sched_cpu_dying+0x14f/0x180
[ 6152.370505][   T61] Code: 10 00 31 c0 48 83 c4 08 5b 41 5c 41 5d 5d c3 be 08 
00 00 00 48 c7 c7 60 5f 15 a1 e8 1b e5 4d 00 f0 4c 01 25 63 c8 5a 09 eb a3 <0f> 
0b 48 89 34 24 e8 f6 e0 4d 00 48 8b 34 24 e9 1e ff ff ff 48 89
[ 6152.390085][   T61] RSP: 0018:c90006fbfca0 EFLAGS: 00010002
[ 6152.396039][   T61] RAX: 723d RBX: 8887dfab2400 RCX: 
1110fbf56488
[ 6152.403914][   T61] RDX:  RSI: 723d RDI: 
8887dfab2440
[ 6152.411789][   T61] RBP: c90006fbfcc0 R08: fbfff417923d R09: 
fbfff417923d
[ 6152.419662][   T61] R10: a0bc91e7 R11: fbfff417923c R12: 
8887dfab2418
[ 6152.427537][   T61] R13: 0086 R14: 97b03da0

Re: [PATCH] mm/swapfile: Do not sleep with a spin lock held

2020-12-02 Thread Qian Cai
On Wed, 2020-12-02 at 15:15 -0800, Andrew Morton wrote:
> On Wed,  2 Dec 2020 10:15:49 -0500 Qian Cai  wrote:
> 
> > We can't call kvfree() with a spin lock held, so defer it.
> > 
> 
> Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct
> allocatio
> n")
> 
> Do you think it's worth a cc:stable?  IOW, is this known to ever
> produce a warning?

Yes, it did trigger a might_sleep() warning.



Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-12-02 Thread Qian Cai
On Mon, 2020-11-23 at 19:13 +0100, Sebastian Andrzej Siewior wrote:
> On 2020-11-18 09:44:34 [-0500], Qian Cai wrote:
> > On Tue, 2020-11-17 at 19:28 +, Valentin Schneider wrote:
> > > We did have some breakage in that area, but all the holes I was aware of
> > > have been plugged. What would help here is to see which tasks are still
> > > queued on that outgoing CPU, and their recent activity.
> > > 
> > > Something like
> > > - ftrace_dump_on_oops on your kernel cmdline
> > > - trace-cmd start -e 'sched:*'
> > >  
> > > 
> > > ought to do it. Then you can paste the (tail of the) ftrace dump.
> > > 
> > > I also had this laying around, which may or may not be of some help:
> > 
> > Once I have found a reliable reproducer, I'll report back.
> 
> any update?

Just back from a vacation. I have been running the same workload on today's
linux-next for a few hours and it has been good so far. I'll surely report back
if it happens again in our daily runs.



[PATCH] mm/swapfile: Do not sleep with a spin lock held

2020-12-02 Thread Qian Cai
We can't call kvfree() with a spin lock held, so defer it.

Signed-off-by: Qian Cai 
---
 mm/swapfile.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index c4a613688a17..d58361109066 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2867,6 +2867,7 @@ late_initcall(max_swapfiles_check);
 static struct swap_info_struct *alloc_swap_info(void)
 {
struct swap_info_struct *p;
+   struct swap_info_struct *defer = NULL;
unsigned int type;
int i;
 
@@ -2895,7 +2896,7 @@ static struct swap_info_struct *alloc_swap_info(void)
smp_wmb();
WRITE_ONCE(nr_swapfiles, nr_swapfiles + 1);
} else {
-   kvfree(p);
+   defer = p;
p = swap_info[type];
/*
 * Do not memset this entry: a racing procfs swap_next()
@@ -2908,6 +2909,7 @@ static struct swap_info_struct *alloc_swap_info(void)
plist_node_init(>avail_lists[i], 0);
p->flags = SWP_USED;
spin_unlock(_lock);
+   kvfree(defer);
spin_lock_init(>lock);
spin_lock_init(>cont_lock);
 
-- 
2.28.0



Re: [PATCH 0/7] HWPoison: Refactor get page interface

2020-12-02 Thread Qian Cai
On Thu, 2020-11-19 at 11:57 +0100, Oscar Salvador wrote:
> Hi,
> 
> following up on previous fix-ups an refactors, this patchset simplifies
> the get page interface and removes the MF_COUNT_INCREASED trick we have
> for soft offline.

Well, the madvise() EIO is back. I don't understand why we can't test it on a
NUMA system before posting this over and over again.

# git clone https://e.coding.net/cailca/linux/mm
# cd mm; make
# ./ranbug 1 
- start: migrate_huge_offline
- use NUMA nodes 0,3.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 3
madvise: Input/output error

[ 1270.054919][ T7497] Soft offlining pfn 0x1958e00 at process virtual address 
0x7f7d9ca0
[ 1270.067318][ T7497] Soft offlining pfn 0x18d0600 at process virtual address 
0x7f7d9c80
[ 1270.078856][ T7497] Soft offlining pfn 0x1ac800 at process virtual address 
0x7f7d9ca0
[ 1270.091268][ T7497] Soft offlining pfn 0x1e10a00 at process virtual address 
0x7f7d9c80
[ 1270.101946][ T7497] Soft offlining pfn 0x18c800 at process virtual address 
0x7f7d9ca0
[ 1270.111678][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, 
page count 2, type bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.126133][ T7497] Soft offlining pfn 0x18b5400 at process virtual address 
0x7f7d9c80
[ 1270.136581][ T7497] Soft offlining pfn 0x211c00 at process virtual address 
0x7f7d9ca0
[ 1270.146214][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, 
page count 2, type bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.160624][ T7497] Soft offlining pfn 0x19bee00 at process virtual address 
0x7f7d9c80
[ 1270.170896][ T7497] Soft offlining pfn 0x1e21a00 at process virtual address 
0x7f7d9ca0
[ 1270.185011][ T7497] Soft offlining pfn 0x1fd1200 at process virtual address 
0x7f7d9c80
[ 1270.195341][ T7497] Soft offlining pfn 0x1882400 at process virtual address 
0x7f7d9ca0
[ 1270.480593][ T7497] Soft offlining pfn 0x18bc000 at process virtual address 
0x7f7d9c80
[ 1270.491961][ T7497] soft offline: 0x18bc000: hugepage isolation failed: 0, 
page count 2, type 3bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.506018][ T7497] Soft offlining pfn 0x1e76a00 at process virtual address 
0x7f7d9c80
[ 1270.590266][ T7497] Soft offlining pfn 0x1b3c00 at process virtual address 
0x7f7d9ca0
[ 1270.600207][ T7497] soft offline: 0x1b3c00: hugepage isolation failed: 0, 
page count 2, type bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.614316][ T7497] Soft offlining pfn 0x1882600 at process virtual address 
0x7f7d9c80
[ 1270.662427][ T7497] Soft offlining pfn 0x1b3c00 at process virtual address 
0x7f7d9ca0
[ 1270.744249][ T7497] Soft offlining pfn 0x18bc000 at process virtual address 
0x7f7d9c80
[ 1270.754314][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 
0x7f7d9ca0
[ 1270.765204][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, 
page count 2, type 3bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.816653][ T7497] Soft offlining pfn 0x18d0400 at process virtual address 
0x7f7d9c80
[ 1270.827049][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 
0x7f7d9ca0
[ 1270.837997][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, 
page count 2, type 3bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.852156][ T7497] Soft offlining pfn 0x186ca00 at process virtual address 
0x7f7d9c80
[ 1270.862350][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 
0x7f7d9ca0
[ 1270.872922][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, 
page count 2, type 3bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.887133][ T7497] Soft offlining pfn 0x18ac200 at process virtual address 
0x7f7d9c80
[ 1270.897450][ T7497] Soft offlining pfn 0x211c00 at process virtual address 
0x7f7d9ca0
[ 1270.907416][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, 
page count 2, type bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.921365][ T7497] Soft offlining pfn 0x1e1cc00 at process virtual address 
0x7f7d9c80
[ 1270.931700][ T7497] Soft offlining pfn 0x18c800 at process virtual address 
0x7f7d9ca0
[ 1270.941580][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, 
page count 2, type bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.955649][ T7497] Soft offlining pfn 0x1e6ae00 at process virtual address 
0x7f7d9c80
[ 1270.966063][ T7497] Soft offlining pfn 0x211c00 at process virtual address 
0x7f7d9ca0
[ 1270.975965][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, 
page count 2, type bfffc1000e (referenced|uptodate|dirty|head)
[ 1270.990059][ T7497] Soft offlining pfn 0x1e72e00 at process virtual address 
0x7f7d9c80
[ 1271.000323][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 
0x7f7d9ca0
[ 1271.011006][ T7497] soft offline: 0x18d1200: hugepage isolation 

Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-11-18 Thread Qian Cai
On Tue, 2020-11-17 at 19:28 +, Valentin Schneider wrote:
> We did have some breakage in that area, but all the holes I was aware of
> have been plugged. What would help here is to see which tasks are still
> queued on that outgoing CPU, and their recent activity.
> 
> Something like
> - ftrace_dump_on_oops on your kernel cmdline
> - trace-cmd start -e 'sched:*'
>  
> 
> ought to do it. Then you can paste the (tail of the) ftrace dump.
> 
> I also had this laying around, which may or may not be of some help:

Once I have found a reliable reproducer, I'll report back.



Re: [PATCH v4 11/19] sched/core: Make migrate disable and CPU hotplug cooperative

2020-11-13 Thread Qian Cai
On Fri, 2020-10-23 at 12:12 +0200, Peter Zijlstra wrote:
> From: Thomas Gleixner 
> 
> On CPU unplug tasks which are in a migrate disabled region cannot be pushed
> to a different CPU until they returned to migrateable state.
> 
> Account the number of tasks on a runqueue which are in a migrate disabled
> section and make the hotplug wait mechanism respect that.
> 
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c  |   36 ++--
>  kernel/sched/sched.h |4 
>  2 files changed, 34 insertions(+), 6 deletions(-)
> 
[] 
> @@ -7310,7 +7334,7 @@ int sched_cpu_dying(unsigned int cpu)
>   sched_tick_stop(cpu);
>  
>   rq_lock_irqsave(rq, );
> - BUG_ON(rq->nr_running != 1);
> + BUG_ON(rq->nr_running != 1 || rq_has_pinned_tasks(rq));

CPU hotplug is now triggering this. This is with Valentin's affine_move_task()
fix on top:

https://lore.kernel.org/lkml/20201113112414.2569-1-valentin.schnei...@arm.com/

[  809.412232][  T428] kernel BUG at kernel/sched/core.c:7547!
[  809.417841][  T428] invalid opcode:  [#1] SMP KASAN PTI
[  809.423445][  T428] CPU: 72 PID: 428 Comm: migration/72 Tainted: G  
I   5.10.0-rc3-next-20201113+ #1
[  809.433678][  T428] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 
Gen10, BIOS U34 11/13/2019
[  809.442951][  T428] Stopper: multi_cpu_stop+0x0/0x350 <- 0x0
[  809.448643][  T428] RIP: 0010:sched_cpu_dying+0x10f/0x130
[  809.454071][  T428] Code: 10 00 31 c0 48 83 c4 08 5b 41 5c 41 5d 5d c3 be 08 
00 00 00 48 c7 c7 60 3f b5 96 e8 ab 81 4d 00 f0 4c 01 25 73 c4 5a 09 eb a3 <0f> 
0b 48 89 34 24 e8 86 7d 4d 00 48 8b 34 24 e9 5d ff ff ff e8 88
[  809.473650][  T428] RSP: 0018:c9000889fca0 EFLAGS: 00010002
[  809.479606][  T428] RAX:  RBX: 8887dfcb23c0 RCX: 
8e057e0d
[  809.487482][  T428] RDX: 1110fbf96480 RSI: 7c11 RDI: 
8887dfcb2400
[  809.495355][  T428] RBP: c9000889fcc0 R08: fbfff2cb8e96 R09: 
fbfff2cb8e96
[  809.503229][  T428] R10: 965c74af R11: fbfff2cb8e95 R12: 
8887dfcb23d8
[  809.511103][  T428] R13: 0086 R14: 8d5038e0 R15: 
0003
[  809.518979][  T428] FS:  () GS:8887dfc8() 
knlGS:
[  809.527815][  T428] CS:  0010 DS:  ES:  CR0: 80050033
[  809.534291][  T428] CR2: 7fea4cdf899c CR3: 0018c7414002 CR4: 
007706e0
[  809.542165][  T428] DR0:  DR1:  DR2: 

[  809.550040][  T428] DR3:  DR6: fffe0ff0 DR7: 
0400
[  809.557913][  T428] PKRU: 5554
[  809.561332][  T428] Call Trace:
[  809.564489][  T428]  ? x86_pmu_starting_cpu+0x20/0x20
[  809.569570][  T428]  ? sched_cpu_wait_empty+0x220/0x220
[  809.574826][  T428]  cpuhp_invoke_callback+0x1d8/0x1520
[  809.580082][  T428]  ? x2apic_send_IPI_mask+0x10/0x10
[  809.585161][  T428]  ? clear_local_APIC+0x788/0xc10
[  809.590068][  T428]  ? cpuhp_invoke_callback+0x1520/0x1520
[  809.595584][  T428]  take_cpu_down+0x10f/0x1a0
[  809.600053][  T428]  multi_cpu_stop+0x149/0x350
[  809.604607][  T428]  ? stop_machine_yield+0x10/0x10
[  809.609511][  T428]  cpu_stopper_thread+0x200/0x400
[  809.614416][  T428]  ? cpu_stop_create+0x70/0x70
[  809.619059][  T428]  smpboot_thread_fn+0x30a/0x770
[  809.623878][  T428]  ? smpboot_register_percpu_thread+0x370/0x370
[  809.630005][  T428]  ? trace_hardirqs_on+0x1c/0x150
[  809.634910][  T428]  ? __kthread_parkme+0xcc/0x1a0
[  809.639729][  T428]  ? smpboot_register_percpu_thread+0x370/0x370
[  809.645855][  T428]  kthread+0x352/0x420
[  809.649798][  T428]  ? kthread_create_on_node+0xc0/0xc0
[  809.655052][  T428]  ret_from_fork+0x22/0x30
[  809.659345][  T428] Modules linked in: nls_ascii nls_cp437 vfat fat 
kvm_intel kvm ses enclosure irqbypass efivarfs ip_tables x_tables sd_mod nvme 
tg3 firmware_class smartpqi nvme_core scsi_transport_sas libphy dm_mirror 
dm_region_hash dm_log dm_mod
[  809.681502][  T428] ---[ end trace 416318a3e677bf17 ]---
[  809.686844][  T428] RIP: 0010:sched_cpu_dying+0x10f/0x130
[  809.692273][  T428] Code: 10 00 31 c0 48 83 c4 08 5b 41 5c 41 5d 5d c3 be 08 
00 00 00 48 c7 c7 60 3f b5 96 e8 ab 81 4d 00 f0 4c 01 25 73 c4 5a 09 eb a3 <0f> 
0b 48 89 34 24 e8 86 7d 4d 00 48 8b 34 24 e9 5d ff ff ff e8 88
[  809.711853][  T428] RSP: 0018:c9000889fca0 EFLAGS: 00010002
[  809.717807][  T428] RAX:  RBX: 8887dfcb23c0 RCX: 
8e057e0d
[  809.725681][  T428] RDX: 1110fbf96480 RSI: 7c11 RDI: 
8887dfcb2400
[  809.733556][  T428] RBP: c9000889fcc0 R08: fbfff2cb8e96 R09: 
fbfff2cb8e96
[  809.741432][  T428] R10: 965c74af R11: fbfff2cb8e95 R12: 
8887dfcb23d8
[  809.749307][  T428] R13: 0086 R14: 8d5038e0 R15: 
0003
[  809.757182][  T428] FS:  () 

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote:
> a) Do you also get this on CONFIG_PREEMPT=y?

This also happens with:

CONFIG_PREEMPT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_DEBUG_PREEMPT=y
CONFIG_PREEMPTIRQ_TRACEPOINTS=y

[ 1235.044945][  T330] INFO: task trinity-c4:60050 blocked for more than 245 
seconds.
[ 1235.052540][  T330]   Not tainted 5.10.0-rc3-next-20201112+ #2
[ 1235.058774][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1235.067392][  T330] task:trinity-c4  state:D stack:26880 pid:60050 ppid: 
 1722 flags:0x4000
[ 1235.076505][  T330] Call Trace:
[ 1235.079680][ T330] __schedule (kernel/sched/core.c:4272 
kernel/sched/core.c:5019) 
[ 1235.083971][ T330] ? __sched_text_start (kernel/sched/core.c:4901) 
[ 1235.088721][ T330] schedule (kernel/sched/core.c:5099 (discriminator 1)) 
[ 1235.092661][ T330] schedule_timeout (kernel/time/timer.c:1848) 
[ 1235.097399][ T330] ? usleep_range (kernel/time/timer.c:1833) 
[ 1235.101945][ T330] ? wait_for_completion (kernel/sched/completion.c:85 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[ 1235.107156][ T330] ? lock_downgrade (kernel/locking/lockdep.c:5443) 
[ 1235.111883][ T330] ? rcu_read_unlock (./include/linux/rcupdate.h:692 
(discriminator 5)) 
[ 1235.116561][ T330] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 
./include/asm-generic/atomic-instrumented.h:707 
./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) 
[ 1235.121459][ T330] ? _raw_spin_unlock_irq 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) 
[ 1235.126601][ T330] wait_for_completion (kernel/sched/completion.c:86 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[ 1235.131591][ T330] ? wait_for_completion_interruptible 
(kernel/sched/completion.c:137) 
[ 1235.138013][ T330] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[ 1235.143698][ T330] affine_move_task (./include/linux/instrumented.h:101 
./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 
./include/linux/refcount.h:315 ./include/linux/refcount.h:333 
kernel/sched/core.c:2263) 
[ 1235.148451][ T330] ? move_queued_task (kernel/sched/core.c:2151) 
[ 1235.153351][ T330] ? update_curr (kernel/sched/sched.h:1176 
kernel/sched/fair.c:845) 
[ 1235.157848][ T330] ? enqueue_entity (kernel/sched/fair.c:4247) 
[ 1235.162658][ T330] ? set_next_task_fair 
(./arch/x86/include/asm/jump_label.h:25 (discriminator 2) 
./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 
(discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) 
kernel/sched/fair.c:10953 (discriminator 2)) 
[ 1235.167667][ T330] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[ 1235.172905][ T330] ? affine_move_task (kernel/sched/core.c:2287) 
[ 1235.177826][ T330] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[ 1235.183501][ T330] sched_setaffinity (kernel/sched/core.c:6460) 
[ 1235.188345][ T330] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) 
[ 1235.193937][ T330] ? _copy_from_user (./arch/x86/include/asm/uaccess_64.h:46 
./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) 
[ 1235.198605][ T330] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 
kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
[ 1235.204291][ T330] ? sched_setaffinity (kernel/sched/core.c:6500) 
[ 1235.209324][ T330] ? syscall_enter_from_user_mode 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
kernel/entry/common.c:98) 
[ 1235.215133][ T330] do_syscall_64 (arch/x86/entry/common.c:46) 
[ 1235.219431][ T330] entry_SYSCALL_64_after_hwframe 
(arch/x86/entry/entry_64.S:127) 
[ 1235.225251][  T330] RIP: 0033:0x7fb102b1178d

> b) Could you try the below?

It is running good so far on multiple systems. I'll keep it running and report
back if it happens again.



Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 19:31 +, Valentin Schneider wrote:
> One thing I don't get: that trace shows refcount_dec_and_test()
> (kernel/sched/core.c:2263) happening before the wait_for_completion(). It's
> not the case in the below trace.

Yes, that is normal. Sometimes, the decoding is a bit off not sure because of
some debugging options like KASAN obscures it.

> a) Do you also get this on CONFIG_PREEMPT=y?

I don't know. None of the systems here has that, but I could probably try.

> b) Could you try the below?

Let me run it and report.

> 
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02076e6d3792..fad0a8e62aca 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1923,7 +1923,7 @@ static int migration_cpu_stop(void *data)
>   else
>   p->wake_cpu = dest_cpu;
>  
> - } else if (dest_cpu < 0) {
> + } else if (dest_cpu < 0 || pending) {
>   /*
>* This happens when we get migrated between migrate_enable()'s
>* preempt_enable() and scheduling the stopper task. At that
> @@ -1933,6 +1933,17 @@ static int migration_cpu_stop(void *data)
>* more likely.
>*/
>  
> + /*
> +  * The task moved before the stopper got to run. We're holding
> +  * ->pi_lock, so the allowed mask is stable - if it got
> +  * somewhere allowed, we're done.
> +  */
> + if (pending && cpumask_test_cpu(task_cpu(p), p->cpus_ptr)) {
> + p->migration_pending = NULL;
> + complete = true;
> + goto out;
> + }
> +
>   /*
>* When this was migrate_enable() but we no longer have an
>* @pending, a concurrent SCA 'fixed' things and we should be
> 



Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote:
> On 12/11/20 16:38, Qian Cai wrote:
> > Some syscall fuzzing from an unprivileged user starts to trigger this below
> > since this commit first appeared in the linux-next today. Does it ring any
> > bells?

X86 in a KVM guest as well.

guest .config: 
https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

To reproduce:

# /usr/libexec/qemu-kvm -name kata -cpu host -smp 48 -m 48g -hda rhel-8.3-
x86_64-kvm.img.qcow2 -cdrom kata.iso -nic user,hostfwd=tcp::-:22 -nographic

== inside the guest ===
# git clone https://e.coding.net/cailca/linux/mm
# cd mm; make
# ./random -x 0-100 -f

[17213.432777][ T348] INFO: task trinity-c7:216885 can't die for more than 122 
seconds.
[17213.434895][ T348] task:trinity-c7  state:D stack:27088 pid:216885 
ppid:103237 flags:0x4004
[17213.437297][ T348] Call Trace:
[17213.438142][ T348] __schedule (kernel/sched/core.c:4272 
kernel/sched/core.c:5019) 
[17213.439256][ T348] ? __sched_text_start (kernel/sched/core.c:4901) 
[17213.440477][ T348] schedule (./arch/x86/include/asm/current.h:15 
(discriminator 1) ./include/linux/sched.h:1892 (discriminator 1) 
kernel/sched/core.c:5100 (discriminator 1)) 
[17213.441501][ T348] schedule_timeout (kernel/time/timer.c:1848) 
[17213.442834][ T348] ? usleep_range (kernel/time/timer.c:1833) 
[17213.444070][ T348] ? wait_for_completion (kernel/sched/completion.c:85 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[17213.445457][ T348] ? lock_downgrade (kernel/locking/lockdep.c:5443) 
[17213.446695][ T348] ? rcu_read_unlock (./include/linux/rcupdate.h:692 
(discriminator 5)) 
[17213.447911][ T348] ? do_raw_spin_lock (./arch/x86/include/asm/atomic.h:202 
./include/asm-generic/atomic-instrumented.h:707 
./include/asm-generic/qspinlock.h:82 kernel/locking/spinlock_debug.c:113) 
[17213.449190][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.450714][ T348] ? _raw_spin_unlock_irq 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
./include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) 
[17213.452042][ T348] wait_for_completion (kernel/sched/completion.c:86 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[17213.453468][ T348] ? wait_for_completion_interruptible 
(kernel/sched/completion.c:137) 
[17213.455152][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.456651][ T348] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[17213.458115][ T348] affine_move_task (./include/linux/instrumented.h:101 
./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 
./include/linux/refcount.h:315 ./include/linux/refcount.h:333 
kernel/sched/core.c:2263) 
[17213.459313][ T348] ? move_queued_task (kernel/sched/core.c:2151) 
[17213.460553][ T348] ? update_curr (kernel/sched/sched.h:1176 
kernel/sched/fair.c:845) 
[17213.461684][ T348] ? enqueue_entity (kernel/sched/fair.c:4247) 
[17213.463001][ T348] ? set_next_task_fair 
(./arch/x86/include/asm/jump_label.h:25 (discriminator 2) 
./include/linux/jump_label.h:200 (discriminator 2) kernel/sched/fair.c:4567 
(discriminator 2) kernel/sched/fair.c:4683 (discriminator 2) 
kernel/sched/fair.c:10953 (discriminator 2)) 
[17213.464294][ T348] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[17213.465668][ T348] ? affine_move_task (kernel/sched/core.c:2287) 
[17213.466952][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.468452][ T348] ? _raw_spin_unlock_irqrestore 
(./include/linux/spinlock_api_smp.h:160 kernel/locking/spinlock.c:191) 
[17213.469908][ T348] sched_setaffinity (kernel/sched/core.c:6460) 
[17213.471127][ T348] ? __ia32_sys_sched_getattr (kernel/sched/core.c:6393) 
[17213.472644][ T348] ? _copy_from_user (./arch/x86/include/asm/uaccess_64.h:46 
./arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) 
[17213.473850][ T348] __x64_sys_sched_setaffinity (kernel/sched/core.c:6511 
kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
[17213.475307][ T348] ? sched_setaffinity (kernel/sched/core.c:6500) 
[17213.476542][ T348] ? lockdep_hardirqs_on_prepare 
(kernel/locking/lockdep.c:4036 kernel/locking/lockdep.c:4096 
kernel/locking/lockdep.c:4048) 
[17213.477991][ T348] ? syscall_enter_from_user_mode 
(./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 
kernel/entry/common.c:98) 
[17213.479428][ T348] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 
(discriminator 22)) 
[17213.480642][ T348] do_syscall_64 (arch/x86/entry/common.c:46) 
[17213.481706][ T348] entry_SYSCALL_64_after_hwframe 
(arch/x86/entry/entry_64.

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Thu, 2020-11-12 at 17:26 +, Valentin Schneider wrote:
> On 12/11/20 16:38, Qian Cai wrote:
> > Some syscall fuzzing from an unprivileged user starts to trigger this below
> > since this commit first appeared in the linux-next today. Does it ring any
> > bells?
> > 
> 
> What's the .config? I'm interested in
> CONFIG_PREEMPT
> CONFIG_PREEMPT_RT
> CONFIG_SMP

https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

# CONFIG_PREEMPT is not set
CONFIG_SMP=y

Also, I have been able to reproduce this on powerpc as well just now.

> 
> From a quick look it seems that tree doesn't have Thomas' "generalization" of
> migrate_disable(), so if this doesn't have PREEMPT_RT we could forget about
> migrate_disable() for now.
> 
> > [12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than
> > 368 seconds.
> > [12065.073524][ T1310]   Tainted: G L5.10.0-rc3-next-
> > 20201112 #2
> > [12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730
> > ppid: 82688 flags:0x
> > [12065.098818][ T1310] Call trace:
> > [12065.101987][ T1310]  __switch_to+0xf0/0x1a8
> > [12065.106227][ T1310]  __schedule+0x6ec/0x1708
> > [12065.110505][ T1310]  schedule+0x1bc/0x3b0
> > [12065.114562][ T1310]  schedule_timeout+0x3c4/0x4c0
> > [12065.119275][ T1310]  wait_for_completion+0x13c/0x248
> > [12065.124257][ T1310]  affine_move_task+0x410/0x688
> > (inlined by) affine_move_task at kernel/sched/core.c:2261
> > [12065.129013][ T1310]  __set_cpus_allowed_ptr+0x1b4/0x370
> > [12065.134248][ T1310]  sched_setaffinity+0x4f0/0x7e8
> > [12065.139088][ T1310]  __arm64_sys_sched_setaffinity+0x1f4/0x2a0
> > [12065.144972][ T1310]  do_el0_svc+0x124/0x228
> > [12065.149165][ T1310]  el0_sync_handler+0x208/0x384
> > [12065.153876][ T1310]  el0_sync+0x140/0x180
> > [12065.157971][ T1310]
> 
> So that's a task changing the affinity of some task (either itself or
> another; I can't say without a decoded stacktrace), and then blocking on a
> wait_for_completion() that apparently never happens.
> 
> I don't see stop_one_cpu() in the trace, so I assume it's the !task_running
> case, for which the completion should be completed before getting to the
> wait (unless we *do* have migrate_disable()).
> 
> Could you please run scripts/decode_stacktrace.sh on the above?

[12065.101987][ T1310] __switch_to (arch/arm64/kernel/process.c:580) 
[12065.106227][ T1310] __schedule (kernel/sched/core.c:4272 
kernel/sched/core.c:5019) 
[12065.110505][ T1310] schedule (./arch/arm64/include/asm/current.h:19 
(discriminator 1) ./arch/arm64/include/asm/preempt.h:53 (discriminator 1) 
kernel/sched/core.c:5099 (discriminator 1)) 
[12065.114562][ T1310] schedule_timeout (kernel/time/timer.c:1848) 
[12065.119275][ T1310] wait_for_completion (kernel/sched/completion.c:85 
kernel/sched/completion.c:106 kernel/sched/completion.c:117 
kernel/sched/completion.c:138) 
[12065.124257][ T1310] affine_move_task (./include/linux/instrumented.h:101 
./include/asm-generic/atomic-instrumented.h:220 ./include/linux/refcount.h:272 
./include/linux/refcount.h:315 ./include/linux/refcount.h:333 
kernel/sched/core.c:2263) 
[12065.129013][ T1310] __set_cpus_allowed_ptr (kernel/sched/core.c:2353) 
[12065.134248][ T1310] sched_setaffinity (kernel/sched/core.c:6460) 
[12065.139088][ T1310] __arm64_sys_sched_setaffinity (kernel/sched/core.c:6511 
kernel/sched/core.c:6500 kernel/sched/core.c:6500) 
[12065.144972][ T1310] do_el0_svc (arch/arm64/kernel/syscall.c:36 
arch/arm64/kernel/syscall.c:48 arch/arm64/kernel/syscall.c:159 
arch/arm64/kernel/syscall.c:205) 
[12065.149165][ T1310] el0_sync_handler (arch/arm64/kernel/entry-common.c:236 
arch/arm64/kernel/entry-common.c:254) 
[12065.153876][ T1310] el0_sync (arch/arm64/kernel/entry.S:741)

== powerpc ==
[18060.020301][ T676] [c000200014227670] [c0a6d1e8] 
__func__.5350+0x1220e0/0x181338 unreliable 
[18060.020333][ T676] [c000200014227850] [c001a278] __switch_to 
(arch/powerpc/kernel/process.c:1273) 
[18060.020351][ T676] [c0002000142278c0] [c08f3e94] __schedule 
(kernel/sched/core.c:4269 kernel/sched/core.c:5019) 
[18060.020377][ T676] [c000200014227990] [c08f4638] schedule 
(./include/asm-generic/preempt.h:59 (discriminator 1) kernel/sched/core.c:5099 
(discriminator 1)) 
[18060.020394][ T676] [c0002000142279c0] [c08fbd34] schedule_timeout 
(kernel/time/timer.c:1847) 
[18060.020420][ T676] [c000200014227ac0] [c08f6398] wait_for_completion 
(kernel/sched/completion.c:85 kernel/sched/completion.c:106 
kernel/sched/completion.c:117 kernel/sched/completion.c:138) 
[18060.02

Re: [PATCH v4 10/19] sched: Fix migrate_disable() vs set_cpus_allowed_ptr()

2020-11-12 Thread Qian Cai
On Fri, 2020-10-23 at 12:12 +0200, Peter Zijlstra wrote:
> Concurrent migrate_disable() and set_cpus_allowed_ptr() has
> interesting features. We rely on set_cpus_allowed_ptr() to not return
> until the task runs inside the provided mask. This expectation is
> exported to userspace.
> 
> This means that any set_cpus_allowed_ptr() caller must wait until
> migrate_enable() allows migrations.
> 
> At the same time, we don't want migrate_enable() to schedule, due to
> patterns like:
> 
>   preempt_disable();
>   migrate_disable();
>   ...
>   migrate_enable();
>   preempt_enable();
> 
> And:
> 
>   raw_spin_lock();
>   spin_unlock();
> 
> this means that when migrate_enable() must restore the affinity
> mask, it cannot wait for completion thereof. Luck will have it that
> that is exactly the case where there is a pending
> set_cpus_allowed_ptr(), so let that provide storage for the async stop
> machine.
> 
> Much thanks to Valentin who used TLA+ most effective and found lots of
> 'interesting' cases.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/sched.h |1 
>  kernel/sched/core.c   |  234 +++-
> --
>  2 files changed, 205 insertions(+), 30 deletions(-)

Some syscall fuzzing from an unprivileged user starts to trigger this below
since this commit first appeared in the linux-next today. Does it ring any
bells?

[12065.065837][ T1310] INFO: task trinity-c30:91730 blocked for more than 368 
seconds.
[12065.073524][ T1310]   Tainted: G L
5.10.0-rc3-next-20201112 #2
[12065.081076][ T1310] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[12065.089648][ T1310] task:trinity-c30 state:D stack:26576 pid:91730 ppid: 
82688 flags:0x
[12065.098818][ T1310] Call trace:
[12065.101987][ T1310]  __switch_to+0xf0/0x1a8
[12065.106227][ T1310]  __schedule+0x6ec/0x1708
[12065.110505][ T1310]  schedule+0x1bc/0x3b0
[12065.114562][ T1310]  schedule_timeout+0x3c4/0x4c0
[12065.119275][ T1310]  wait_for_completion+0x13c/0x248
[12065.124257][ T1310]  affine_move_task+0x410/0x688
(inlined by) affine_move_task at kernel/sched/core.c:2261
[12065.129013][ T1310]  __set_cpus_allowed_ptr+0x1b4/0x370
[12065.134248][ T1310]  sched_setaffinity+0x4f0/0x7e8
[12065.139088][ T1310]  __arm64_sys_sched_setaffinity+0x1f4/0x2a0
[12065.144972][ T1310]  do_el0_svc+0x124/0x228
[12065.149165][ T1310]  el0_sync_handler+0x208/0x384
[12065.153876][ T1310]  el0_sync+0x140/0x180
[12065.157971][ T1310] 
[12065.157971][ T1310] Showing all locks held in the system:
[12065.166401][ T1310] 1 lock held by khungtaskd/1310:
[12065.171288][ T1310]  #0: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
rcu_lock_acquire.constprop.56+0x0/0x38
[12065.182210][ T1310] 4 locks held by trinity-main/82688:
[12065.187515][ T1310] 2 locks held by kworker/u513:3/82813:
[12065.192922][ T1310]  #0: 00419d38 
((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x69c/0x18c8
[12065.203890][ T1310]  #1: 122bfd40 
((work_completion)(>work)){+.+.}-{0:0}, at: __update_idle_core+0xa8/0x460
[12065.214916][ T1310] 1 lock held by trinity-c35/137168:
[12065.220061][ T1310]  #0: 0087ce767898 (>ldisc_sem){}-{0:0}, at: 
ldsem_down_read+0x3c/0x48
[12065.229483][ T1310] 3 locks held by trinity-c61/137611:
[12065.234757][ T1310] 1 lock held by trinity-c7/137630:
[12065.239828][ T1310] 1 lock held by trinity-c57/137714:
[12065.242612][T137611] futex_wake_op: trinity-c61 tries to shift op by 1008; 
fix this program
[12065.245012][ T1310] 1 lock held by trinity-c52/137771:
[12065.258538][ T1310] 2 locks held by trinity-c42/137835:
[12065.263783][ T1310] 4 locks held by trinity-c22/137868:
[12065.269051][ T1310]  #0: 000e78503798 (>lock){-.-.}-{2:2}, at: 
newidle_balance+0x92c/0xd78
[12065.278155][ T1310]  #1: 0087ce767930 
(>atomic_write_lock){+.+.}-{3:3}, at: tty_write_lock+0x30/0x58
[12065.288317][ T1310]  #2: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
__mutex_lock+0x24c/0x1310
[12065.297592][ T1310]  #3: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
lock_page_memcg+0x98/0x240
[12065.307026][ T1310] 2 locks held by trinity-c34/137896:
[12065.312266][ T1310]  #0: 000e78463798 (>lock){-.-.}-{2:2}, at: 
__schedule+0x22c/0x1708
[12065.321023][ T1310]  #1: 800018d0cb40 (rcu_read_lock){}-{1:2}, at: 
__update_idle_core+0xa8/0x460
[12065.330663][ T1310] 2 locks held by trinity-c43/137909:
[12065.335996][ T1310] 1 lock held by trinity-c24/137910:
[12065.341164][ T1310] 1 lock held by trinity-c1/137954:
[12065.346272][ T1310] 1 lock held by trinity-c49/138020:
[12065.351425][ T1310] 1 lock held by trinity-c10/138021:
[12065.356649][ T1310] 1 lock held by trinity-c32/138039:
[12065.361813][ T1310] 4 locks held by trinity-c36/138042:
[12065.367129][ T1310] 2 locks held by trinity-c14/138061:
[12065.372378][ T1310] 2 locks held by trinity-c38/138070:
[12065.377688][ T1310] 1 lock held by 

Re: linux-next boot error: BUG: unable to handle kernel NULL pointer dereference in mempool_init_node

2020-11-11 Thread Qian Cai
It looks to me the code paths below had recently been modified heavily by this
patchset. If this is reproducible, it can be confirmed by reverting it.

https://lore.kernel.org/linux-arm-kernel/cover.1605046662.git.andreyk...@google.com/

On Tue, 2020-11-10 at 23:45 -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:3e14f70c Add linux-next specific files for 2020
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12e6af6250
> kernel config:  https://syzkaller.appspot.com/x/.config?x=d6f4c7e100b61b76
> dashboard link: https://syzkaller.appspot.com/bug?extid=2d6f3dad1a42d86a5801
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+2d6f3dad1a42d86a5...@syzkaller.appspotmail.com
> 
> RPC: Registered named UNIX socket transport module.
> RPC: Registered udp transport module.
> RPC: Registered tcp transport module.
> RPC: Registered tcp NFSv4.1 backchannel transport module.
> NET: Registered protocol family 44
> pci_bus :00: resource 4 [io  0x-0x0cf7 window]
> pci_bus :00: resource 5 [io  0x0d00-0x window]
> pci_bus :00: resource 6 [mem 0x000a-0x000b window]
> pci_bus :00: resource 7 [mem 0xc000-0xfebfefff window]
> pci :00:00.0: Limiting direct PCI/PCI transfers
> PCI: CLS 0 bytes, default 64
> PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> software IO TLB: mapped [mem 0xb5e0-0xb9e0] (64MB)
> RAPL PMU: API unit is 2^-32 Joules, 0 fixed counters, 10737418240 ms ovfl
> timer
> kvm: already loaded the other module
> clocksource: tsc: mask: 0x max_cycles: 0x212735223b2,
> max_idle_ns: 440795277976 ns
> clocksource: Switched to clocksource tsc
> Initialise system trusted keyrings
> workingset: timestamp_bits=40 max_order=21 bucket_order=0
> zbud: loaded
> DLM installed
> squashfs: version 4.0 (2009/01/31) Phillip Lougher
> FS-Cache: Netfs 'nfs' registered for caching
> NFS: Registering the id_resolver key type
> Key type id_resolver registered
> Key type id_legacy registered
> nfs4filelayout_init: NFSv4 File Layout Driver Registering...
> Installing knfsd (copyright (C) 1996 o...@monad.swb.de).
> FS-Cache: Netfs 'cifs' registered for caching
> Key type cifs.spnego registered
> Key type cifs.idmap registered
> ntfs: driver 2.1.32 [Flags: R/W].
> efs: 1.0a - http://aeschi.ch.eu.org/efs/
> jffs2: version 2.2. (NAND) (SUMMARY)  © 2001-2006 Red Hat, Inc.
> romfs: ROMFS MTD (C) 2007 Red Hat, Inc.
> QNX4 filesystem 0.2.3 registered.
> qnx6: QNX6 filesystem 1.0.0 registered.
> fuse: init (API version 7.32)
> orangefs_debugfs_init: called with debug mask: :none: :0:
> orangefs_init: module version upstream loaded
> JFS: nTxBlock = 8192, nTxLock = 65536
> SGI XFS with ACLs, security attributes, realtime, quota, no debug enabled
> 9p: Installing v9fs 9p2000 file system support
> FS-Cache: Netfs '9p' registered for caching
> NILFS version 2 loaded
> befs: version: 0.9.3
> ocfs2: Registered cluster interface o2cb
> ocfs2: Registered cluster interface user
> OCFS2 User DLM kernel interface loaded
> gfs2: GFS2 installed
> BUG: kernel NULL pointer dereference, address: 0018
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x) - not-present page
> PGD 0 P4D 0 
> Oops:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.10.0-rc3-next-2020-syzkaller
> #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:nearest_obj include/linux/slub_def.h:169 [inline]
> RIP: 0010:kasan_slab_free+0x19/0x110 mm/kasan/common.c:350
> Code: 00 48 c7 c0 fb ff ff ff c3 cc cc cc cc cc cc cc cc 41 55 49 89 d5 41 54
> 49 89 fc 48 89 f7 55 48 89 f5 53 89 cb e8 f7 27 7e ff <41> 8b 7c 24 18 48 be
> 00 00 00 00 00 16 00 00 48 c1 e8 0c 48 89 c1
> RSP: :c9c67d30 EFLAGS: 00010293
> RAX: 0001436d RBX:  RCX: 8130a760
> RDX: 888140748000 RSI: 8130a76a RDI: 0007
> RBP: 8881436d R08: 00fe R09: ed10286da800
> R10:  R11:  R12: 
> R13: 81945766 R14: 888143557944 R15: 81943b80
> FS:  () GS:8880b9e0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0018 CR3: 0b08e000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  kasan_slab_free_mempool include/linux/kasan.h:202 [inline]
>  kasan_poison_element mm/mempool.c:107 [inline]
>  add_element mm/mempool.c:124 [inline]
>  mempool_init_node+0x37e/0x580 mm/mempool.c:205
>  mempool_create_node mm/mempool.c:269 [inline]
>  mempool_create+0x76/0xc0 mm/mempool.c:254
>  

Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-11 Thread Qian Cai
On Wed, 2020-11-11 at 17:27 +0800, Ming Lei wrote:
> Can this issue disappear by applying the following change?

This makes the system boot again as well.

> 
> diff --git a/block/blk-flush.c b/block/blk-flush.c
> index e32958f0b687..b1fe6176d77f 100644
> --- a/block/blk-flush.c
> +++ b/block/blk-flush.c
> @@ -469,9 +469,6 @@ struct blk_flush_queue *blk_alloc_flush_queue(int node,
> int cmd_size,
>   INIT_LIST_HEAD(>flush_queue[1]);
>   INIT_LIST_HEAD(>flush_data_in_flight);
>  
> - lockdep_register_key(>key);
> - lockdep_set_class(>mq_flush_lock, >key);
> -
>   return fq;
>  
>   fail_rq:
> @@ -486,7 +483,6 @@ void blk_free_flush_queue(struct blk_flush_queue *fq)
>   if (!fq)
>   return;
>  
> - lockdep_unregister_key(>key);
>   kfree(fq->flush_rq);
>   kfree(fq);
>  }
> 
> 
> Thanks, 
> Ming



Re: linux-next: build warning after merge of the bpf-next tree

2020-11-11 Thread Qian Cai
On Wed, 2020-11-11 at 12:01 +1100, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the bpf-next tree, today's linux-next build (powerpc
> ppc64_defconfig) produced this warning:
> 
> kernel/bpf/btf.c:4481:20: warning: 'btf_parse_module' defined but not used [-
> Wunused-function]
>  4481 | static struct btf *btf_parse_module(const char *module_name, const
> void *data, unsigned int data_size)
>   |^~~~
> 
> Introduced by commit
> 
>   36e68442d1af ("bpf: Load and verify kernel module BTFs")
> 

It loos like btf_parse_module() is only used when
CONFIG_DEBUG_INFO_BTF_MODULES=y, so this should fix it.

diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 0f1fd2669d69..e877eeebc616 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -4478,6 +4478,7 @@ struct btf *btf_parse_vmlinux(void)
return ERR_PTR(err);
 }
 
+#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
 static struct btf *btf_parse_module(const char *module_name, const void *data, 
unsigned int data_size)
 {
struct btf_verifier_env *env = NULL;
@@ -4546,6 +4547,7 @@ static struct btf *btf_parse_module(const char 
*module_name, const void *data, u
}
return ERR_PTR(err);
 }
+#endif /* CONFIG_DEBUG_INFO_BTF_MODULES */
 
 struct btf *bpf_prog_get_target_btf(const struct bpf_prog *prog)
 {



Re: [PATCH v3 19/35] x86/io_apic: Cleanup trigger/polarity helpers

2020-11-09 Thread Qian Cai
On Sat, 2020-10-24 at 22:35 +0100, David Woodhouse wrote:
> From: Thomas Gleixner 
> 
> 'trigger' and 'polarity' are used throughout the I/O-APIC code for handling
> the trigger type (edge/level) and the active low/high configuration. While
> there are defines for initializing these variables and struct members, they
> are not used consequently and the meaning of 'trigger' and 'polarity' is
> opaque and confusing at best.
> 
> Rename them to 'is_level' and 'active_low' and make them boolean in various
> structs so it's entirely clear what the meaning is.
> 
> Signed-off-by: Thomas Gleixner 
> Signed-off-by: David Woodhouse 
> ---
>  arch/x86/include/asm/hw_irq.h   |   6 +-
>  arch/x86/kernel/apic/io_apic.c  | 244 +---
>  arch/x86/pci/intel_mid_pci.c|   8 +-
>  drivers/iommu/amd/iommu.c   |  10 +-
>  drivers/iommu/intel/irq_remapping.c |   9 +-
>  5 files changed, 130 insertions(+), 147 deletions(-)

Reverting the rest of patchset up to this commit on next-20201109 fixed an
endless soft-lockups issue booting an AMD server below. I noticed that the
failed boots always has this IOMMU IO_PAGE_FAULT before those soft-lockups:

[ 3404.093354][T1] AMD-Vi: Interrupt remapping enabled
[ 3404.098593][T1] AMD-Vi: Virtual APIC enabled
[ 3404.107783][  T340] pci :00:14.0: AMD-Vi: Event logged [IO_PAGE_FAULT 
domain=0x address=0xfffdf8020200 flags=0x0008]
[ 3404.120644][T1] AMD-Vi: Lazy IO/TLB flushing enabled
[ 3404.126011][T1] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
[ 3404.133173][T1] software IO TLB: mapped [mem 
0x68dcf000-0x6cdcf000] (64MB)

.config (if ever matters):
https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

good boot dmesg (with the commits reverted):
http://people.redhat.com/qcai/dmesg.txt

== system info ==
Dell Poweredge R6415
AMD EPYC 7401P 24-Core Processor
24576 MB memory, 239 GB disk space

[  OK  ] Started Flush Journal to Persistent Storage.
[  OK  ] Started udev Kernel Device Manager.
[  OK  ] Started udev Coldplug all Devices.
[  OK  ] Started Monitoring of LVM2 mirrors,…sing dmeventd or progress polling.
[  OK  ] Reached target Local File Systems (Pre).
 Mounting /boot...
[  OK  ] Created slice system-lvm2\x2dpvscan.slice.
[ 3740.376500][ T1058] XFS (sda1): Mounting V5 Filesystem
[ 3740.438474][ T1058] XFS (sda1): Ending clean mount
[ 3765.159433][C0] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! 
[systemd:1]
[ 3765.166929][C0] Modules linked in: acpi_cpufreq(+) ip_tables x_tables 
sd_mod ahci libahci tg3 bnxt_en megaraid_sas libata firmware_class libphy 
dm_mirror dm_region_hash dm_log dm_mod
[ 3765.183576][C0] irq event stamp: 26230104
[ 3765.187954][C0] hardirqs last  enabled at (26230103): 
[] asm_common_interrupt+0x1e/0x40
[ 3765.197873][C0] hardirqs last disabled at (26230104): 
[] sysvec_apic_timer_interrupt+0xa/0xa0
[ 3765.208303][C0] softirqs last  enabled at (26202664): 
[] __do_softirq+0x61b/0x95d
[ 3765.217699][C0] softirqs last disabled at (26202591): 
[] asm_call_irq_on_stack+0x12/0x20
[ 3765.227702][C0] CPU: 0 PID: 1 Comm: systemd Not tainted 
5.10.0-rc2-next-20201109+ #2
[ 3765.235793][C0] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 
1.9.3 06/25/2019
[ 3765.244065][C0] RIP: 0010:lock_acquire+0x1f4/0x820
lock_acquire at kernel/locking/lockdep.c:5404
[ 3765.249211][C0] Code: ff ff ff 48 83 c4 20 65 0f c1 05 a7 ba 9e 7e 83 f8 
01 4c 8b 54 24 08 0f 85 60 04 00 00 41 52 9d 48 b8 00 00 00 00 00 fc ff df <48> 
01 c3 c7 03 00 00 00 00 c7 43 08 00 00 00 00 48 8b0
[ 3765.268657][C0] RSP: 0018:c906f9f8 EFLAGS: 0246
[ 3765.274587][C0] RAX: dc00 RBX: 1920df42 RCX: 
1920df28
[ 3765.282420][C0] RDX: 111020645769 RSI:  RDI: 
0001
[ 3765.290256][C0] RBP: 0001 R08: fbfff164cb10 R09: 
fbfff164cb10
[ 3765.298090][C0] R10: 0246 R11: fbfff164cb0f R12: 
88812be555b0
[ 3765.305922][C0] R13:  R14:  R15: 

[ 3765.313750][C0] FS:  7f12bb8c59c0() GS:8881b700() 
knlGS:
[ 3765.322537][C0] CS:  0010 DS:  ES:  CR0: 80050033
[ 3765.328985][C0] CR2: 7f0c2d828fd0 CR3: 00011868a000 CR4: 
003506f0
[ 3765.336820][C0] Call Trace:
[ 3765.339979][C0]  ? rcu_read_unlock+0x40/0x40
[ 3765.344609][C0]  __d_move+0x2a2/0x16f0
__seqprop_spinlock_assert at include/linux/seqlock.h:277
(inlined by) __d_move at fs/dcache.c:2861
[ 3765.348711][C0]  ? d_move+0x47/0x70
[ 3765.352560][C0]  ? _raw_spin_unlock+0x1a/0x30
[ 3765.357275][C0]  d_move+0x47/0x70
write_seqcount_t_end at include/linux/seqlock.h:560
(inlined by) write_sequnlock at include/linux/seqlock.h:901
(inlined by) d_move at fs/dcache.c:2916
[ 3765.360951][C0]  ? vfs_rename+0x9ac/0x1270

Re: [PATCH][next] cpumask: allocate enough space for string and trailing '\0' char

2020-11-09 Thread Qian Cai
On Mon, 2020-11-09 at 13:04 +, Colin King wrote:
> From: Colin Ian King 
> 
> Currently the allocation of cpulist is based on the length of buf but does
> not include the addition end of string '\0' terminator. Static analysis is
> reporting this as a potential out-of-bounds access on cpulist. Fix this by
> allocating enough space for the additional '\0' terminator.
> 
> Addresses-Coverity: ("Out-of-bounds access")
> Fixes: 65987e67f7ff ("cpumask: add "last" alias for cpu list specifications")

Yeah, this bad commit also introduced KASAN errors everywhere and then will
disable lockdep that makes our linux-next CI miserable. Confirmed that this
patch will fix it.

> Signed-off-by: Colin Ian King 
> ---
>  lib/cpumask.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index 34ecb3005941..cb8a3ef0e73e 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -185,7 +185,7 @@ int __ref cpulist_parse(const char *buf, struct cpumask
> *dstp)
>  {
>   int r;
>   char *cpulist, last_cpu[5]; /* NR_CPUS <=  */
> - size_t len = strlen(buf);
> + size_t len = strlen(buf) + 1;
>   bool early = !slab_is_available();
>  
>   if (!strcmp(buf, "all")) {



Re: [tip: ras/core] x86/mce: Enable additional error logging on certain Intel CPUs

2020-11-09 Thread Qian Cai
On Mon, 2020-11-02 at 11:18 +, tip-bot2 for Tony Luck wrote:
> The following commit has been merged into the ras/core branch of tip:
> 
> Commit-ID: 68299a42f84288537ee3420c431ac0115ccb90b1
> Gitweb:
> https://git.kernel.org/tip/68299a42f84288537ee3420c431ac0115ccb90b1
> Author:Tony Luck 
> AuthorDate:Fri, 30 Oct 2020 12:04:00 -07:00
> Committer: Borislav Petkov 
> CommitterDate: Mon, 02 Nov 2020 11:15:59 +01:00
> 
> x86/mce: Enable additional error logging on certain Intel CPUs
> 
> The Xeon versions of Sandy Bridge, Ivy Bridge and Haswell support an
> optional additional error logging mode which is enabled by an MSR.
> 
> Previously, this mode was enabled from the mcelog(8) tool via /dev/cpu,
> but userspace should not be poking at MSRs. So move the enabling into
> the kernel.
> 
>  [ bp: Correct the explanation why this is done. ]
> 
> Suggested-by: Boris Petkov 
> Signed-off-by: Tony Luck 
> Signed-off-by: Borislav Petkov 

Booting a simple KVM guest using today's linux-next is now generating those
errors below inside the guest due to this patch. Are those expected?

# qemu-kvm -name kata -cpu host -smp 48 -m 48g -hda 
rhel-8.3-x86_64-kvm.img.qcow2 -cdrom kata.iso -nic user,hostfwd=tcp::-:22 
-nographic

guest .config (if ever matters): 
https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

[6.801741][T0] clocksource: tsc-early: mask: 0x 
max_cycles: 0x1e3bca858ab, max_idle_ns: 440795282452 ns
[6.804371][T0] Calibrating delay loop (skipped), value calculated using 
timer frequency.. 4194.90 BogoMIPS (lpj=20974530)
[6.806956][T0] pid_max: default: 49152 minimum: 384
[6.814328][T0] Mount-cache hash table entries: 131072 (order: 8, 
1048576 bytes, linear)
[6.814328][T0] Mountpoint-cache hash table entries: 131072 (order: 8, 
1048576 bytes, linear)
[6.814328][T0] x86/cpu: User Mode Instruction Prevention (UMIP) 
activated
[6.814328][T0] unchecked MSR access error: RDMSR from 0x17f at rIP: 
0x84483f16 (mce_intel_feature_init+0x156/0x270)
[6.814328][T0] Call Trace:
[6.814328][T0]  __mcheck_cpu_init_vendor+0x105/0x250
__rdmsr at arch/x86/include/asm/msr.h:93
(inlined by) native_read_msr at arch/x86/include/asm/msr.h:127
(inlined by) intel_imc_init at arch/x86/kernel/cpu/mce/intel.c:524
(inlined by) mce_intel_feature_init at arch/x86/kernel/cpu/mce/intel.c:537
[6.814328][T0]  mcheck_cpu_init+0x21f/0xb00
[6.814328][T0]  identify_cpu+0xfcb/0x1980
[6.814328][T0]  identify_boot_cpu+0xd/0xb5
[6.814328][T0]  check_bugs+0x6c/0x1606
[6.814328][T0]  ? _raw_spin_unlock+0x1a/0x30
[6.814328][T0]  ? poking_init+0x2b5/0x2ea
[6.814328][T0]  ? l1tf_cmdline+0x11a/0x11a
[6.814328][T0]  ? lockdep_init_map_waits+0x267/0x6f0
[6.814328][T0]  start_kernel+0x372/0x39f
[6.814328][T0]  secondary_startup_64_no_verify+0xc2/0xcb
[6.814328][T0] unchecked MSR access error: WRMSR to 0x17f (tried to 
write 0x0002) at rIP: 0x84483f3a 
(mce_intel_feature_init+0x17a/0x270)
[6.814328][T0] Call Trace:
[6.814328][T0]  __mcheck_cpu_init_vendor+0x105/0x250
[6.814328][T0]  mcheck_cpu_init+0x21f/0xb00
[6.814328][T0]  identify_cpu+0xfcb/0x1980
[6.814328][T0]  identify_boot_cpu+0xd/0xb5
[6.814328][T0]  check_bugs+0x6c/0x1606
[6.814328][T0]  ? _raw_spin_unlock+0x1a/0x30
[6.814328][T0]  ? poking_init+0x2b5/0x2ea
[6.814328][T0]  ? l1tf_cmdline+0x11a/0x11a
[6.814328][T0]  ? lockdep_init_map_waits+0x267/0x6f0
[6.814328][T0]  start_kernel+0x372/0x39f
[6.814328][T0]  secondary_startup_64_no_verify+0xc2/0xcb
[6.814328][T0] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
[6.814328][T0] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0, 1GB 0

== host CPU ==
# lscpu
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
CPU(s):  48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  12
Socket(s):   4
NUMA node(s):4
Vendor ID:   GenuineIntel
CPU family:  6
Model:   63
Model name:  Intel(R) Xeon(R) CPU E5-4650 v3 @ 2.10GHz
Stepping:2
CPU MHz: 1980.076
BogoMIPS:4195.25
Virtualization:  VT-x
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:30720K
NUMA node0 CPU(s):   0-5,24-29
NUMA node1 CPU(s):   6-11,30-35
NUMA node2 CPU(s):   12-17,36-41
NUMA node3 CPU(s):   18-23,42-47
Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 sdbg fma cx16 xtpr 

Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-09 Thread Qian Cai
On Mon, 2020-11-09 at 08:49 +, John Garry wrote:
> On 07/11/2020 00:17, Qian Cai wrote:
> > On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:
> > > I am able to hit the boot hang and similar kind of stack traces as
> > > reported by Qian with shared .config on x86 machine.
> > > In my case the system boots after a hang of 40-45 mins. Qian, is it
> > > true for you as well ?
> > I don't know. I had never waited for that long.
> > 
> > .
> > 
> 
> Hi Qian,
> 
> By chance do have an equivalent arm64 .config, enabling the same RH 
> config options?
> 
> I suppose I could try do this myself also, but an authentic version 
> would be nicer.
The closest one I have here is:
https://cailca.coding.net/public/linux/mm/git/files/master/arm64.config

but it only selects the Thunder X2 platform and needs to manually select
CONFIG_MEGARAID_SAS=m to start with, but none of arm64 systems here have
megaraid_sas.



Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-06 Thread Qian Cai
On Sat, 2020-11-07 at 00:55 +0530, Sumit Saxena wrote:
> I am able to hit the boot hang and similar kind of stack traces as
> reported by Qian with shared .config on x86 machine.
> In my case the system boots after a hang of 40-45 mins. Qian, is it
> true for you as well ?
I don't know. I had never waited for that long.



Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier

2020-11-06 Thread Qian Cai
On Fri, 2020-11-06 at 10:37 +, Will Deacon wrote:
> > diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> > index 09c96f57818c..10729d2d6084 100644
> > --- a/arch/arm64/kernel/smp.c
> > +++ b/arch/arm64/kernel/smp.c
> > @@ -421,6 +421,8 @@ void cpu_die_early(void)
> >  
> > update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
> >  
> > +   rcu_report_dead(cpu);
> 
> I think this is in the wrong place, see:
> 
> https://lore.kernel.org/r/20201106103602.9849-1-w...@kernel.org
> 
> which seems to fix the problem for me.
Ah, I had not realized that cpu_psci_cpu_die() could no return. Your patchset
looks good to me.



Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier

2020-11-05 Thread Qian Cai
On Thu, 2020-11-05 at 15:28 -0800, Paul E. McKenney wrote:
> On Thu, Nov 05, 2020 at 06:02:49PM -0500, Qian Cai wrote:
> > On Thu, 2020-11-05 at 22:22 +, Will Deacon wrote:
> > > On Fri, Oct 30, 2020 at 04:33:25PM +, Will Deacon wrote:
> > > > On Wed, 28 Oct 2020 14:26:14 -0400, Qian Cai wrote:
> > > > > The call to rcu_cpu_starting() in secondary_start_kernel() is not
> > > > > early
> > > > > enough in the CPU-hotplug onlining process, which results in lockdep
> > > > > splats as follows:
> > > > > 
> > > > >  WARNING: suspicious RCU usage
> > > > >  -
> > > > >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader
> > > > > section!!
> > > > > 
> > > > > [...]
> > > > 
> > > > Applied to arm64 (for-next/fixes), thanks!
> > > > 
> > > > [1/1] arm64/smp: Move rcu_cpu_starting() earlier
> > > >   https://git.kernel.org/arm64/c/ce3d31ad3cac
> > > 
> > > Hmm, this patch has caused a regression in the case that we fail to
> > > online a CPU because it has incompatible CPU features and so we park it
> > > in cpu_die_early(). We now get an endless spew of RCU stalls because the
> > > core will never come online, but is being tracked by RCU. So I'm tempted
> > > to revert this and live with the lockdep warning while we figure out a
> > > proper fix.
> > > 
> > > What's the correct say to undo rcu_cpu_starting(), given that we cannot
> > > invoke the full hotplug machinery here? Is it correct to call
> > > rcutree_dying_cpu() on the bad CPU and then rcutree_dead_cpu() from the
> > > CPU doing cpu_up(), or should we do something else?
> > It looks to me that rcu_report_dead() does the opposite of
> > rcu_cpu_starting(),
> > so lift rcu_report_dead() out of CONFIG_HOTPLUG_CPU and use it there to
> > rewind,
> > Paul?
> 
> Yes, rcu_report_dead() should do the trick.  Presumably the earlier
> online-time CPU-hotplug notifiers are also unwound?
I don't think that is an issue here. cpu_die_early() set CPU_STUCK_IN_KERNEL,
and then __cpu_up() will see a timeout waiting for the AP online and then deal
with CPU_STUCK_IN_KERNEL according. Thus, something like this? I don't see
anything in rcu_report_dead() depends on CONFIG_HOTPLUG_CPU=y.

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 09c96f57818c..10729d2d6084 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -421,6 +421,8 @@ void cpu_die_early(void)
 
update_cpu_boot_status(CPU_STUCK_IN_KERNEL);
 
+   rcu_report_dead(cpu);
+
cpu_park_loop();
 }
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2a52f42f64b6..bd04b09b84b3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4077,7 +4077,6 @@ void rcu_cpu_starting(unsigned int cpu)
smp_mb(); /* Ensure RCU read-side usage follows above initialization. */
 }
 
-#ifdef CONFIG_HOTPLUG_CPU
 /*
  * The outgoing function has no further need of RCU, so remove it from
  * the rcu_node tree's ->qsmaskinitnext bit masks.
@@ -4117,6 +4116,7 @@ void rcu_report_dead(unsigned int cpu)
rdp->cpu_started = false;
 }
 
+#ifdef CONFIG_HOTPLUG_CPU
 /*
  * The outgoing CPU has just passed through the dying-idle state, and we
  * are being invoked from the CPU that was IPIed to continue the offline



Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier

2020-11-05 Thread Qian Cai
On Thu, 2020-11-05 at 22:22 +, Will Deacon wrote:
> On Fri, Oct 30, 2020 at 04:33:25PM +, Will Deacon wrote:
> > On Wed, 28 Oct 2020 14:26:14 -0400, Qian Cai wrote:
> > > The call to rcu_cpu_starting() in secondary_start_kernel() is not early
> > > enough in the CPU-hotplug onlining process, which results in lockdep
> > > splats as follows:
> > > 
> > >  WARNING: suspicious RCU usage
> > >  -
> > >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> > > 
> > > [...]
> > 
> > Applied to arm64 (for-next/fixes), thanks!
> > 
> > [1/1] arm64/smp: Move rcu_cpu_starting() earlier
> >   https://git.kernel.org/arm64/c/ce3d31ad3cac
> 
> Hmm, this patch has caused a regression in the case that we fail to
> online a CPU because it has incompatible CPU features and so we park it
> in cpu_die_early(). We now get an endless spew of RCU stalls because the
> core will never come online, but is being tracked by RCU. So I'm tempted
> to revert this and live with the lockdep warning while we figure out a
> proper fix.
> 
> What's the correct say to undo rcu_cpu_starting(), given that we cannot
> invoke the full hotplug machinery here? Is it correct to call
> rcutree_dying_cpu() on the bad CPU and then rcutree_dead_cpu() from the
> CPU doing cpu_up(), or should we do something else?
It looks to me that rcu_report_dead() does the opposite of rcu_cpu_starting(),
so lift rcu_report_dead() out of CONFIG_HOTPLUG_CPU and use it there to rewind,
Paul?



Re: [PATCH] KVM: x86: use positive error values for msr emulation that causes #GP

2020-11-04 Thread Qian Cai
On Sun, 2020-11-01 at 13:55 +0200, Maxim Levitsky wrote:
> Recent introduction of the userspace msr filtering added code that uses
> negative error codes for cases that result in either #GP delivery to
> the guest, or handled by the userspace msr filtering.
> 
> This breaks an assumption that a negative error code returned from the
> msr emulation code is a semi-fatal error which should be returned
> to userspace via KVM_RUN ioctl and usually kill the guest.
> 
> Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
> and by adding a new KVM_MSR_RET_FILTERED error code for the
> userspace filtered msrs.
> 
> Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation
> to userspace")
> Reported-by: Qian Cai 
> Signed-off-by: Maxim Levitsky 
Apparently, it does not apply cleanly on today's linux-next. Paolo, is it
possible to toss this into -next soon, so our CI won't be blocked because of
this bug?



Re: kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]

2020-11-04 Thread Qian Cai
On Wed, 2020-11-04 at 16:16 +0100, Jan Kara wrote:
> On Mon 26-10-20 10:26:26, Qian Cai wrote:
> > On Mon, 2020-10-26 at 07:55 -0600, Jens Axboe wrote:
> > > I've tried to reproduce this as well, to no avail. Qian, could you perhaps
> > > detail the setup? What kind of storage, kernel config, compiler, etc.
> > > 
> > 
> > So far I have only been able to reproduce on this Intel platform:
> > 
> > HPE DL560 gen10
> > Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
> > 131072 MB memory, 1000 GB disk space (smartpqi nvme)
> 
> Did you try running with the debug patch Matthew sent? Any results?
Running every day, but no luck so far.



Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-04 Thread Qian Cai
On Tue, 2020-11-03 at 08:04 -0500, Qian Cai wrote:
> On Tue, 2020-11-03 at 10:54 +, John Garry wrote:
> > I have no x86 system to test that x86 config, though. How about 
> > v5.10-rc2 for this issue?
> 
> v5.10-rc2 is also broken here.

John, Kashyap, any update on this? If this is going to take a while to fix it
proper, should I send a patch to revert this or at least disable the feature by
default for megaraid_sas in the meantime, so it no longer breaks the existing
systems out there?

> 
> [  251.941451][  T330] INFO: task systemd-udevd:551 blocked for more than 122
> seconds.
> [  251.949176][  T330]   Not tainted 5.10.0-rc2 #3
> [  251.954094][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  251.962633][  T330] task:systemd-udevd   state:D stack:27160 pid:  551
> ppid:   506 flags:0x0324
> [  251.971707][  T330] Call Trace:
> [  251.974871][  T330]  __schedule+0x71d/0x1b50
> [  251.979155][  T330]  ? kcore_callback+0x1d/0x1d
> [  251.983709][  T330]  schedule+0xbf/0x270
> [  251.987640][  T330]  schedule_timeout+0x3fc/0x590
> [  251.992370][  T330]  ? usleep_range+0x120/0x120
> [  251.996910][  T330]  ? wait_for_completion+0x156/0x250
> [  252.002080][  T330]  ? lock_downgrade+0x700/0x700
> [  252.006792][  T330]  ? rcu_read_unlock+0x40/0x40
> [  252.011435][  T330]  ? do_raw_spin_lock+0x121/0x290
> [  252.016324][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [  252.022178][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> [  252.027235][  T330]  wait_for_completion+0x15e/0x250
> [  252.032226][  T330]  ? wait_for_completion_interruptible+0x2f0/0x2f0
> [  252.038590][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> [  252.03][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> [  252.049502][  T330]  __flush_work+0x42a/0x900
> [  252.053882][  T330]  ? queue_delayed_work_on+0x90/0x90
> [  252.059025][  T330]  ? __queue_work+0x463/0xf40
> [  252.063583][  T330]  ? init_pwq+0x320/0x320
> [  252.06][  T330]  ? queue_work_on+0x5e/0x80
> [  252.072249][  T330]  ? trace_hardirqs_on+0x1c/0x150
> [  252.077138][  T330]  work_on_cpu+0xe7/0x130
> [  252.081347][  T330]  ? flush_delayed_work+0xc0/0xc0
> [  252.086231][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
> [  252.091655][  T330]  ? work_debug_hint+0x30/0x30
> [  252.096284][  T330]  ? pci_device_shutdown+0x80/0x80
> [  252.101274][  T330]  ? cpumask_next_and+0x57/0x80
> [  252.105990][  T330]  pci_device_probe+0x500/0x5c0
> [  252.110703][  T330]  ? pci_device_remove+0x1f0/0x1f0
> [  252.115697][  T330]  really_probe+0x207/0xad0
> [  252.120065][  T330]  ? device_driver_attach+0x120/0x120
> [  252.125317][  T330]  driver_probe_device+0x1f1/0x370
> [  252.130291][  T330]  device_driver_attach+0xe5/0x120
> [  252.135281][  T330]  __driver_attach+0xf0/0x260
> [  252.139827][  T330]  bus_for_each_dev+0x117/0x1a0
> [  252.144552][  T330]  ? subsys_dev_iter_exit+0x10/0x10
> [  252.149609][  T330]  bus_add_driver+0x399/0x560
> [  252.154166][  T330]  driver_register+0x189/0x310
> [  252.158795][  T330]  ? 0xc05c5000
> [  252.162838][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
> [  252.168593][  T330]  do_one_initcall+0xf6/0x510
> [  252.173143][  T330]  ? perf_trace_initcall_level+0x490/0x490
> [  252.178809][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> [  252.183973][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> [  252.189728][  T330]  ? do_init_module+0x49/0x6c0
> [  252.194370][  T330]  ? kmem_cache_alloc_trace+0x12e/0x2a0
> [  252.199780][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> [  252.204942][  T330]  do_init_module+0x1ed/0x6c0
> [  252.209479][  T330]  load_module+0x4a25/0x5cf0
> [  252.213950][  T330]  ? layout_and_allocate+0x2770/0x2770
> [  252.219271][  T330]  ? __vmalloc_node+0x8d/0x100
> [  252.223913][  T330]  ? kernel_read_file+0x485/0x5a0
> [  252.228796][  T330]  ? kernel_read_file+0x305/0x5a0
> [  252.233696][  T330]  ? __ia32_sys_fsconfig+0x6a0/0x6a0
> [  252.238841][  T330]  ? __do_sys_finit_module+0xff/0x180
> [  252.244093][  T330]  __do_sys_finit_module+0xff/0x180
> [  252.249155][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
> [  252.254403][  T330]  ? __fget_files+0x1c3/0x2e0
> [  252.258940][  T330]  do_syscall_64+0x33/0x40
> [  252.263234][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  252.268984][  T330] RIP: 0033:0x7f7cf6a4878d
> [  252.273276][  T330] Code: Unable to access opcode bytes at RIP
> 0x7f7cf6a48763.
> [  252.280499][  T330] RSP: 002b:7ffcfa94b978 EFLAGS: 0246 ORIG_RAX:
> 0139
> [  252.288781][  T330] RAX: ffda RBX: 55e01f48b730 RCX:
> 7f7cf6a4878d
> [  252.296628][  T330] RDX:  RSI: 7f7cf75ba82d RDI:
> 000

Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-03 Thread Qian Cai
On Tue, 2020-11-03 at 10:54 +, John Garry wrote:
> I have no x86 system to test that x86 config, though. How about 
> v5.10-rc2 for this issue?

v5.10-rc2 is also broken here.

[  251.941451][  T330] INFO: task systemd-udevd:551 blocked for more than 122 
seconds.
[  251.949176][  T330]   Not tainted 5.10.0-rc2 #3
[  251.954094][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  251.962633][  T330] task:systemd-udevd   state:D stack:27160 pid:  551 ppid: 
  506 flags:0x0324
[  251.971707][  T330] Call Trace:
[  251.974871][  T330]  __schedule+0x71d/0x1b50
[  251.979155][  T330]  ? kcore_callback+0x1d/0x1d
[  251.983709][  T330]  schedule+0xbf/0x270
[  251.987640][  T330]  schedule_timeout+0x3fc/0x590
[  251.992370][  T330]  ? usleep_range+0x120/0x120
[  251.996910][  T330]  ? wait_for_completion+0x156/0x250
[  252.002080][  T330]  ? lock_downgrade+0x700/0x700
[  252.006792][  T330]  ? rcu_read_unlock+0x40/0x40
[  252.011435][  T330]  ? do_raw_spin_lock+0x121/0x290
[  252.016324][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[  252.022178][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
[  252.027235][  T330]  wait_for_completion+0x15e/0x250
[  252.032226][  T330]  ? wait_for_completion_interruptible+0x2f0/0x2f0
[  252.038590][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[  252.03][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
[  252.049502][  T330]  __flush_work+0x42a/0x900
[  252.053882][  T330]  ? queue_delayed_work_on+0x90/0x90
[  252.059025][  T330]  ? __queue_work+0x463/0xf40
[  252.063583][  T330]  ? init_pwq+0x320/0x320
[  252.06][  T330]  ? queue_work_on+0x5e/0x80
[  252.072249][  T330]  ? trace_hardirqs_on+0x1c/0x150
[  252.077138][  T330]  work_on_cpu+0xe7/0x130
[  252.081347][  T330]  ? flush_delayed_work+0xc0/0xc0
[  252.086231][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
[  252.091655][  T330]  ? work_debug_hint+0x30/0x30
[  252.096284][  T330]  ? pci_device_shutdown+0x80/0x80
[  252.101274][  T330]  ? cpumask_next_and+0x57/0x80
[  252.105990][  T330]  pci_device_probe+0x500/0x5c0
[  252.110703][  T330]  ? pci_device_remove+0x1f0/0x1f0
[  252.115697][  T330]  really_probe+0x207/0xad0
[  252.120065][  T330]  ? device_driver_attach+0x120/0x120
[  252.125317][  T330]  driver_probe_device+0x1f1/0x370
[  252.130291][  T330]  device_driver_attach+0xe5/0x120
[  252.135281][  T330]  __driver_attach+0xf0/0x260
[  252.139827][  T330]  bus_for_each_dev+0x117/0x1a0
[  252.144552][  T330]  ? subsys_dev_iter_exit+0x10/0x10
[  252.149609][  T330]  bus_add_driver+0x399/0x560
[  252.154166][  T330]  driver_register+0x189/0x310
[  252.158795][  T330]  ? 0xc05c5000
[  252.162838][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
[  252.168593][  T330]  do_one_initcall+0xf6/0x510
[  252.173143][  T330]  ? perf_trace_initcall_level+0x490/0x490
[  252.178809][  T330]  ? kasan_unpoison_shadow+0x30/0x40
[  252.183973][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[  252.189728][  T330]  ? do_init_module+0x49/0x6c0
[  252.194370][  T330]  ? kmem_cache_alloc_trace+0x12e/0x2a0
[  252.199780][  T330]  ? kasan_unpoison_shadow+0x30/0x40
[  252.204942][  T330]  do_init_module+0x1ed/0x6c0
[  252.209479][  T330]  load_module+0x4a25/0x5cf0
[  252.213950][  T330]  ? layout_and_allocate+0x2770/0x2770
[  252.219271][  T330]  ? __vmalloc_node+0x8d/0x100
[  252.223913][  T330]  ? kernel_read_file+0x485/0x5a0
[  252.228796][  T330]  ? kernel_read_file+0x305/0x5a0
[  252.233696][  T330]  ? __ia32_sys_fsconfig+0x6a0/0x6a0
[  252.238841][  T330]  ? __do_sys_finit_module+0xff/0x180
[  252.244093][  T330]  __do_sys_finit_module+0xff/0x180
[  252.249155][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
[  252.254403][  T330]  ? __fget_files+0x1c3/0x2e0
[  252.258940][  T330]  do_syscall_64+0x33/0x40
[  252.263234][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  252.268984][  T330] RIP: 0033:0x7f7cf6a4878d
[  252.273276][  T330] Code: Unable to access opcode bytes at RIP 
0x7f7cf6a48763.
[  252.280499][  T330] RSP: 002b:7ffcfa94b978 EFLAGS: 0246 ORIG_RAX: 
0139
[  252.288781][  T330] RAX: ffda RBX: 55e01f48b730 RCX: 
7f7cf6a4878d
[  252.296628][  T330] RDX:  RSI: 7f7cf75ba82d RDI: 
0006
[  252.304482][  T330] RBP: 7f7cf75ba82d R08:  R09: 
7ffcfa94baa0
[  252.312331][  T330] R10: 0006 R11: 0246 R12: 

[  252.320167][  T330] R13: 55e01f433530 R14: 0002 R15: 

[  252.328052][  T330] 
[  252.328052][  T330] Showing all locks held in the system:
[  252.335722][  T330] 3 locks held by kworker/3:1/289:
[  252.340697][  T330]  #0: 8881001eb338 
((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ec/0x1610
[  252.350906][  T330]  #1: c90004ef7e00 
((work_completion)()){+.+.}-{0:0}, at: process_one_work+0x820/0x1610
[  252.361725][  T330]  #2: 88810dc600e0 (>scan_mutex){+.+.}-{3:3}, 
at: 

Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL

2020-11-02 Thread Qian Cai
On Mon, 2020-11-02 at 12:50 -0700, Jens Axboe wrote:
> Ah, but that's because later patches assume that TIF_NOTIFY_SIGNAL is
> always there once all archs have been converted. If you just want to back
> out that patch, you'll need to just revert this one:
> 
> commit 82ef6998ed9d488e56bbfbcc2ec9adf62bf78f08
> Author: Jens Axboe 
> Date:   Fri Oct 9 16:04:39 2020 -0600
> 
> kernel: remove checking for TIF_NOTIFY_SIGNAL
> 
> as well and I suspect it should build.

No, at the minimal, I'll need to revert those to build successfully.

7b074c15374c io_uring: remove 'twa_signal_ok' deadlock work-around
eb48a0f216fa kernel: remove checking for TIF_NOTIFY_SIGNAL
c634e6b63a81 signal: kill JOBCTL_TASK_WORK
f8b667db31a3 io_uring: JOBCTL_TASK_WORK is no longer used by task_work
4c3d9c3b415a s390: add support for TIF_NOTIFY_SIGNAL

Then, it will fix the boot issue as well.






Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL

2020-11-02 Thread Qian Cai
On Mon, 2020-11-02 at 10:07 -0700, Jens Axboe wrote:
> On 11/2/20 9:59 AM, Qian Cai wrote:
> > On Sun, 2020-11-01 at 17:31 +, Heiko Carstens wrote:
> > > On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote:
> > > > Wire up TIF_NOTIFY_SIGNAL handling for s390.
> > > > 
> > > > Cc: linux-s...@vger.kernel.org
> > > > Signed-off-by: Jens Axboe 
> > 
> > Even though I did confirm that today's linux-next contains this additional
> > patch
> > from Heiko below, a z10 guest is still unable to boot. Reverting the whole
> > series (reverting only "s390: add support for TIF_NOTIFY_SIGNAL" introduced
> > compiling errors) fixed the problem, i.e., git revert --no-edit
> > af0dd809f3d3..7b074c15374c [1]
> 
> That's odd, it should build fine without that patch. How did it fail for you?

In file included from ./arch/s390/include/asm/bug.h:5,
 from ./include/linux/bug.h:5,
 from ./include/linux/mmdebug.h:5,
 from ./include/linux/percpu.h:5,
 from ./include/linux/context_tracking_state.h:5,
 from ./include/linux/hardirq.h:5,
 from ./include/linux/kvm_host.h:7,
 from arch/s390/kernel/asm-offsets.c:11:
./include/linux/sched/signal.h: In function ‘signal_pending’:
./include/linux/sched/signal.h:368:39: error: ‘TIF_NOTIFY_SIGNAL’ undeclared
(first use in this function); did you mean ‘TIF_NOTIFY_RESUME’?
  if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))
   ^
./include/linux/compiler.h:78:42: note: in definition of macro ‘unlikely’
 # define unlikely(x) __builtin_expect(!!(x), 0)
  ^
./include/linux/sched/signal.h:368:39: note: each undeclared identifier is
reported only once for each function it appears in
  if (unlikely(test_tsk_thread_flag(p, TIF_NOTIFY_SIGNAL)))
   ^
./include/linux/compiler.h:78:42: note: in definition of macro ‘unlikely’
 # define unlikely(x) __builtin_expect(!!(x), 0)
  ^
make[1]: *** [scripts/Makefile.build:117: arch/s390/kernel/asm-offsets.s] Error
1
make: *** [Makefile:1198: prepare0] Error 2

> 
> Can you try and add this on top? Looks like I forgot the signal change for
> s390, though that shouldn't really cause any issues.

It does not help with the boot issue at all.

> 
> 
> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
> index 9e900a8977bd..a68c3796a1bf 100644
> --- a/arch/s390/kernel/signal.c
> +++ b/arch/s390/kernel/signal.c
> @@ -472,7 +472,7 @@ void do_signal(struct pt_regs *regs)
>   current->thread.system_call =
>   test_pt_regs_flag(regs, PIF_SYSCALL) ? regs->int_code : 0;
>  
> - if (get_signal()) {
> + if (test_thread_flag(TIF_NOTIFY_SIGNAL) && get_signal()) {
>   /* Whee!  Actually deliver the signal.  */
>   if (current->thread.system_call) {
>   regs->int_code = current->thread.system_call;
> 



Re: [PATCH] s390: add support for TIF_NOTIFY_SIGNAL

2020-11-02 Thread Qian Cai
On Sun, 2020-11-01 at 17:31 +, Heiko Carstens wrote:
> On Thu, Oct 29, 2020 at 10:21:11AM -0600, Jens Axboe wrote:
> > Wire up TIF_NOTIFY_SIGNAL handling for s390.
> > 
> > Cc: linux-s...@vger.kernel.org
> > Signed-off-by: Jens Axboe 

Even though I did confirm that today's linux-next contains this additional patch
from Heiko below, a z10 guest is still unable to boot. Reverting the whole
series (reverting only "s390: add support for TIF_NOTIFY_SIGNAL" introduced
compiling errors) fixed the problem, i.e., git revert --no-edit
af0dd809f3d3..7b074c15374c [1]

.config: https://cailca.coding.net/public/linux/mm/git/files/master/s390.config

01: [3.284902] systemd[1]: systemd 239 (239-40.el8) running in system mode. 
01: (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +
01: GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCR
01: E2 default-hierarchy=legacy)
01: [3.285558] systemd[1]: Detected virtualization zvm. 
01: [3.285585] systemd[1]: Detected architecture s390x. 
01: [3.285618] systemd[1]: Running in initial RAM disk. 
01: [3.376459] systemd[1]: Set hostname to . 
01: [3.464950] mkdir (45) used greatest stack depth: 57824 bytes left   
01: 
01: Welcome to [0;34mRed Hat Enterprise Linux 8.3 (Ootpa) dracut-049-95.git20200
01: 804.el8 (Initramfs)[0m! 
01: 
00: [   87.908107] random: crng init done 
 
01: [  490.492263] INFO: task (sd-executor):42 can't die for more than 368 secon
01: ds. 
01: [  490.492303] task:(sd-executor)   state:R  running task stack:58984 pi
01: d:   42 ppid: 1 flags:0x0002
01: [  490.492359] Call Trace:  
01: [  490.492382]  [<163f0652>] __schedule+0xa12/0x1840
01: [  490.492391]  [<163f1562>] schedule+0xe2/0x310
(inlined by) __preempt_count_add at arch/s390/include/asm/preempt.h:56
(discriminator 1)
(inlined by) __preempt_count_sub at arch/s390/include/asm/preempt.h:63
(discriminator 1)
(inlined by) schedule at kernel/sched/core.c:4602 (discriminator 1)
01: [  490.492399]  [<1640390a>] system_call+0xe2/0x278
system_call at arch/s390/kernel/entry.S:424
01: [  490.492407] no locks held by (sd-executor)/42.   
01: [  490.492420]  
01: [  490.492420] Showing all locks held in the system:
01: [  490.492438] 1 lock held by khungtaskd/25:
01: [  490.492445]  #0: 16b92c80 (rcu_read_lock){}-{1:2}, at: rcu_lo
01: ck_acquire.constprop.54+0x0/0x50
01: [  490.492481]  
01: [  490.492488] =
01: [  490.492488]

[1]:
7b074c15374c io_uring: remove 'twa_signal_ok' deadlock work-around
eb48a0f216fa kernel: remove checking for TIF_NOTIFY_SIGNAL
c634e6b63a81 signal: kill JOBCTL_TASK_WORK
f8b667db31a3 io_uring: JOBCTL_TASK_WORK is no longer used by task_work
c50eb9d59bb1 task_work: remove legacy TWA_SIGNAL path
1d48c8d6d71e xtensa: add support for TIF_NOTIFY_SIGNAL
8ef9c750c5a1 um: add support for TIF_NOTIFY_SIGNAL
3f242a158b7c sparc: add support for TIF_NOTIFY_SIGNAL
40c7ac5c4790 sh: add support for TIF_NOTIFY_SIGNAL
5e59963ed1ac riscv: add support for TIF_NOTIFY_SIGNAL
9333d15595e8 openrisc: add support for TIF_NOTIFY_SIGNAL
c34f87ae2e81 nds32: add support for TIF_NOTIFY_SIGNAL
27af2ca0cdda microblaze: add support for TIF_NOTIFY_SIGNAL
ef1863c4081e ia64: add support for TIF_NOTIFY_SIGNAL
58d670021acc hexagon: add support for TIF_NOTIFY_SIGNAL
1facd6bf079c h8300: add support for TIF_NOTIFY_SIGNAL
1b81145fc28d csky: add support for TIF_NOTIFY_SIGNAL
bbc8d03c0bf3 c6x: add support for TIF_NOTIFY_SIGNAL
6cbc413682ac arm: add support for TIF_NOTIFY_SIGNAL
e9822185daa1 alpha: add support for TIF_NOTIFY_SIGNAL
4c3d9c3b415a s390: add support for TIF_NOTIFY_SIGNAL
d0772a4d9367 mips: add support for TIF_NOTIFY_SIGNAL
07246df9ebe4 powerpc: add support for TIF_NOTIFY_SIGNAL
9edbc08ce909 parisc: add support for TIF_NOTIFY_SIGNAL
c96152dd9c01 nios32: add support for TIF_NOTIFY_SIGNAL
89d22e3adff3 m68k: add support for TIF_NOTIFY_SIGNAL
3db7550a998c arm64: add support for TIF_NOTIFY_SIGNAL
9161d936d1ff arc: add support for TIF_NOTIFY_SIGNAL
fdb5f027ce66 task_work: use TIF_NOTIFY_SIGNAL if available

Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-02 Thread Qian Cai
On Mon, 2020-11-02 at 20:01 +0530, Kashyap Desai wrote:
> > On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> > > From: Kashyap Desai 
> > > 
> > > Fusion adapters can steer completions to individual queues, and we now
> > > have support for shared host-wide tags.
> > > So we can enable multiqueue support for fusion adapters.
> > > 
> > > Once driver enable shared host-wide tags, cpu hotplug feature is also
> > > supported as it was enabled using below patchsets - commit
> > > bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> > > offline")
> > > 
> > > Currently driver has provision to disable host-wide tags using
> > > "host_tagset_enable" module parameter.
> > > 
> > > Once we do not have any major performance regression using host-wide
> > > tags, we will drop the hand-crafted interrupt affinity settings.
> > > 
> > > Performance is also meeting the expecatation - (used both none and
> > > mq-deadline scheduler)
> > > 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> > > 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get
> > > 3.1M IOPs.
> > > 
> > > Signed-off-by: Kashyap Desai 
> > > Signed-off-by: Hannes Reinecke 
> > > Signed-off-by: John Garry 
> > 
> > Reverting this commit fixed an issue that Dell Power Edge R6415 server
> > with
> > megaraid_sas is unable to boot.
> 
> I will take a look at this. BTW, can you try keeping same PATCH but use
> module parameter "host_tagset_enable =0"

Yes, that also works.



Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-02 Thread Qian Cai
On Mon, 2020-11-02 at 14:51 +, John Garry wrote:
> On 02/11/2020 14:17, Qian Cai wrote:
> > [  251.961152][  T330] INFO: task systemd-udevd:567 blocked for more than
> > 122 seconds.
> > [  251.968876][  T330]   Not tainted 5.10.0-rc1-next-20201102 #1
> > [  251.975003][  T330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  251.983546][  T330] task:systemd-udevd   state:D stack:27224 pid:  567
> > ppid:   506 flags:0x4324
> > [  251.992620][  T330] Call Trace:
> > [  251.995784][  T330]  __schedule+0x71d/0x1b60
> > [  252.67][  T330]  ? __sched_text_start+0x8/0x8
> > [  252.004798][  T330]  schedule+0xbf/0x270
> > [  252.008735][  T330]  schedule_timeout+0x3fc/0x590
> > [  252.013464][  T330]  ? usleep_range+0x120/0x120
> > [  252.018008][  T330]  ? wait_for_completion+0x156/0x250
> > [  252.023176][  T330]  ? lock_downgrade+0x700/0x700
> > [  252.027886][  T330]  ? rcu_read_unlock+0x40/0x40
> > [  252.032530][  T330]  ? do_raw_spin_lock+0x121/0x290
> > [  252.037412][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> > [  252.043268][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> > [  252.048331][  T330]  wait_for_completion+0x15e/0x250
> > [  252.053323][  T330]  ? wait_for_completion_interruptible+0x320/0x320
> > [  252.059687][  T330]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
> > [  252.065543][  T330]  ? _raw_spin_unlock_irq+0x1f/0x30
> > [  252.070606][  T330]  __flush_work+0x42a/0x900
> > [  252.074989][  T330]  ? queue_delayed_work_on+0x90/0x90
> > [  252.080139][  T330]  ? __queue_work+0x463/0xf40
> > [  252.084700][  T330]  ? init_pwq+0x320/0x320
> > [  252.088891][  T330]  ? queue_work_on+0x5e/0x80
> > [  252.093364][  T330]  ? trace_hardirqs_on+0x1c/0x150
> > [  252.098255][  T330]  work_on_cpu+0xe7/0x130
> > [  252.102461][  T330]  ? flush_delayed_work+0xc0/0xc0
> > [  252.107342][  T330]  ? __mutex_unlock_slowpath+0xd4/0x670
> > [  252.112764][  T330]  ? work_debug_hint+0x30/0x30
> > [  252.117391][  T330]  ? pci_device_shutdown+0x80/0x80
> > [  252.122378][  T330]  ? cpumask_next_and+0x57/0x80
> > [  252.127094][  T330]  pci_device_probe+0x500/0x5c0
> > [  252.131824][  T330]  ? pci_device_remove+0x1f0/0x1f0
> 
> Is CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled? I figure it is, with this call.
> 
> Or please share the .config

No. https://cailca.coding.net/public/linux/mm/git/files/master/x86.config

> 
> Cheers,
> John
> 
> > [  252.136805][  T330]  really_probe+0x207/0xad0
> > [  252.141191][  T330]  ? device_driver_attach+0x120/0x120
> > [  252.146428][  T330]  driver_probe_device+0x1f1/0x370
> > [  252.151424][  T330]  device_driver_attach+0xe5/0x120
> > [  252.156399][  T330]  __driver_attach+0xf0/0x260
> > [  252.160953][  T330]  bus_for_each_dev+0x117/0x1a0
> > [  252.165669][  T330]  ? subsys_dev_iter_exit+0x10/0x10
> > [  252.170731][  T330]  bus_add_driver+0x399/0x560
> > [  252.175289][  T330]  driver_register+0x189/0x310
> > [  252.179919][  T330]  ? 0xc05c1000
> > [  252.183960][  T330]  megasas_init+0x117/0x1000 [megaraid_sas]
> > [  252.189713][  T330]  do_one_initcall+0xf6/0x510
> > [  252.194267][  T330]  ? perf_trace_initcall_level+0x490/0x490
> > [  252.199940][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> > [  252.205104][  T330]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
> > [  252.210859][  T330]  ? do_init_module+0x49/0x6c0
> > [  252.215500][  T330]  ? kmem_cache_alloc_trace+0x11f/0x1e0
> > [  252.220925][  T330]  ? kasan_unpoison_shadow+0x30/0x40
> > [  252.226068][  T330]  do_init_module+0x1ed/0x6c0
> > [  252.230608][  T330]  load_module+0x4a59/0x5d20
> > [  252.235081][  T330]  ? layout_and_allocate+0x2770/0x2770
> > [  252.240404][  T330]  ? __vmalloc_node+0x8d/0x100
> > [  252.245046][  T330]  ? kernel_read_file+0x485/0x5a0
> > [  252.249934][  T330]  ? kernel_read_file+0x305/0x5a0
> > [  252.254839][  T330]  ? __x64_sys_fsconfig+0x970/0x970
> > [  252.259903][  T330]  ? __do_sys_finit_module+0xff/0x180
> > [  252.265153][  T330]  __do_sys_finit_module+0xff/0x180
> > [  252.270216][  T330]  ? __do_sys_init_module+0x1d0/0x1d0
> > [  252.275465][  T330]  ? __fget_files+0x1c3/0x2e0
> > [  252.280010][  T330]  do_syscall_64+0x33/0x40
> > [  252.284304][  T330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [  252.290054][  T330] RIP: 0033:0x7fbb3e2fa78d
> > [  252.294348][  T330] Code: Unable to access opcode bytes at RIP
> > 0x7fbb3e2fa763.
> > [  252.301584][  T330] RSP: 002b:7ffe572e8d18 EFLAGS: 0246 ORIG_RAX:
> > 0

Re: WARN_ON(fuse_insert_writeback(root, wpa)) in tree_insert()

2020-11-02 Thread Qian Cai
On Thu, 2020-10-29 at 16:20 +0100, Miklos Szeredi wrote:
> On Thu, Oct 29, 2020 at 4:02 PM Qian Cai  wrote:
> > On Wed, 2020-10-07 at 16:08 -0400, Qian Cai wrote:
> > > Running some fuzzing by a unprivileged user on virtiofs could trigger the
> > > warning below. The warning was introduced not long ago by the commit
> > > c146024ec44c ("fuse: fix warning in tree_insert() and clean up writepage
> > > insertion").
> > > 
> > > From the logs, the last piece of the fuzzing code is:
> > > 
> > > fgetxattr(fd=426, name=0x7f39a69af000, value=0x7f39a8abf000, size=1)
> > 
> > I can still reproduce it on today's linux-next. Any idea on how to debug it
> > further?
> 
> Can you please try the attached patch?

It has survived the testing over the weekend. There is a issue that virtiofsd
hung, but it looks like a separate issue.



Re: [PATCH v8 17/18] scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug

2020-11-02 Thread Qian Cai
On Wed, 2020-08-19 at 23:20 +0800, John Garry wrote:
> From: Kashyap Desai 
> 
> Fusion adapters can steer completions to individual queues, and
> we now have support for shared host-wide tags.
> So we can enable multiqueue support for fusion adapters.
> 
> Once driver enable shared host-wide tags, cpu hotplug feature is also
> supported as it was enabled using below patchsets -
> commit bf0beec0607d ("blk-mq: drain I/O when all CPUs in a hctx are
> offline")
> 
> Currently driver has provision to disable host-wide tags using
> "host_tagset_enable" module parameter.
> 
> Once we do not have any major performance regression using host-wide
> tags, we will drop the hand-crafted interrupt affinity settings.
> 
> Performance is also meeting the expecatation - (used both none and
> mq-deadline scheduler)
> 24 Drive SSD on Aero with/without this patch can get 3.1M IOPs
> 3 VDs consist of 8 SAS SSD on Aero with/without this patch can get 3.1M
> IOPs.
> 
> Signed-off-by: Kashyap Desai 
> Signed-off-by: Hannes Reinecke 
> Signed-off-by: John Garry 

Reverting this commit fixed an issue that Dell Power Edge R6415 server with
megaraid_sas is unable to boot.

c1:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 
02)
DeviceName: Integrated RAID
Subsystem: Dell PERC H730P Mini
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR-  [disabled]
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, 
L1 <1us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
SlotPowerLimit 0.000W
DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- 
TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x8, ASPM L0s, Exit Latency 
L0s <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x8 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range BC, TimeoutDis+, NROPrPrP-, 
LTR-
 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, 
ExtFmt-, EETLPPrefix-
 EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
 FRS-, TPHComp-, ExtTPHComp-
 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, 
OBFF Disabled
 AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
 Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
 Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, 
EqualizationComplete+, EqualizationPhase1+
 EqualizationPhase2+, EqualizationPhase3+, 
LinkEqualizationRequest-
Capabilities: [a8] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address:   Data: 
Masking:   Pending: 
Capabilities: [c0] MSI-X: Enable+ Count=97 Masked-
Vector table: BAR=1 offset=e000
PBA: BAR=1 offset=f000
Capabilities: [100 v2] Advanced Error Reporting
UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ 
RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- 
RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
AdvNonFatalErr+
CEMsk:  RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ 
AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- 
ECRCChkCap- ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 0401 c00f c108 4ba9007a
Capabilities: [1e0 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [1c0 v1] Power Budgeting 
  

Re: [PATCH] s390/smp: Move rcu_cpu_starting() earlier

2020-10-31 Thread Qian Cai
On Sat, 2020-10-31 at 19:37 +0100, Heiko Carstens wrote:
> On Wed, Oct 28, 2020 at 02:27:42PM -0400, Qian Cai wrote:
> > The call to rcu_cpu_starting() in smp_init_secondary() is not early
> > enough in the CPU-hotplug onlining process, which results in lockdep
> > splats as follows:
> > 
> >  WARNING: suspicious RCU usage
> >  -
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> > 
> >  other info that might help us debug this:
> > 
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> > 
> >  Call Trace:
> >  show_stack+0x158/0x1f0
> >  dump_stack+0x1f2/0x238
> >  __lock_acquire+0x2640/0x4dd0
> >  lock_acquire+0x3a8/0xd08
> >  _raw_spin_lock_irqsave+0xc0/0xf0
> >  clockevents_register_device+0xa8/0x528
> >  init_cpu_timer+0x33e/0x468
> >  smp_init_secondary+0x11a/0x328
> >  smp_start_secondary+0x82/0x88
> > 
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the smp_init_secondary() function. Note that the
> > raw_smp_processor_id() is required in order to avoid calling into
> > lockdep before RCU has declared the CPU to be watched for readers.
> > 
> > Link: 
> > https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> > Signed-off-by: Qian Cai 
> > ---
> >  arch/s390/kernel/smp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Could you provide the config you used? I'm wondering why I can't
> reproduce this even though I have lot's of debug options enabled.
https://cailca.coding.net/public/linux/mm/git/files/master/s390.config

Essentially, I believe it requires CONFIG_PROVE_RCU_LIST=y. Also, it occurs to
me that this only starts to happen after the commit mentioned in the above link.



Re: [PATCH -next] fs: Fix memory leaks in do_renameat2() error paths

2020-10-30 Thread Qian Cai
On Fri, 2020-10-30 at 09:27 -0600, Jens Axboe wrote:
> On 10/30/20 9:24 AM, Qian Cai wrote:
> > We will need to call putname() before do_renameat2() returning -EINVAL
> > to avoid memory leaks.
> 
> Thanks, should mention that this isn't final by any stretch (which is
> why it hasn't been posted yet), just pushed out for some exposure.

I don't know what other people think about this, but I do find a bit
discouraging in testing those half-baked patches in linux-next where it does not
even ready to post for a review.

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=3c5499fa56f568005648e6e38201f8ae9ab88015



[PATCH -next] fs: Fix memory leaks in do_renameat2() error paths

2020-10-30 Thread Qian Cai
We will need to call putname() before do_renameat2() returning -EINVAL
to avoid memory leaks.

Fixes: 3c5499fa56f5 ("fs: make do_renameat2() take struct filename")
Signed-off-by: Qian Cai 
---
 fs/namei.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 27f5a4e025fd..9dc5e1b139c9 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4362,11 +4362,11 @@ int do_renameat2(int olddfd, struct filename *oldname, 
int newdfd,
int error;
 
if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
-   return -EINVAL;
+   goto out;
 
if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
(flags & RENAME_EXCHANGE))
-   return -EINVAL;
+   goto out;
 
if (flags & RENAME_EXCHANGE)
target_flags = 0;
@@ -4486,6 +4486,14 @@ int do_renameat2(int olddfd, struct filename *oldname, 
int newdfd,
}
 exit:
return error;
+out:
+   if (!IS_ERR(oldname))
+   putname(oldname);
+
+   if (!IS_ERR(newname))
+   putname(newname);
+
+   return -EINVAL;
 }
 
 SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
-- 
2.28.0



Re: kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]

2020-10-30 Thread Qian Cai
On Thu, 2020-10-22 at 18:12 +0100, Matthew Wilcox wrote:
> On Thu, Oct 22, 2020 at 11:35:26AM -0400, Qian Cai wrote:
> > On Thu, 2020-10-22 at 01:49 +0100, Matthew Wilcox wrote:
> > > On Wed, Oct 21, 2020 at 08:30:18PM -0400, Qian Cai wrote:
> > > > Today's linux-next starts to trigger this wondering if anyone has any
> > > > clue.
> > > 
> > > I've seen that occasionally too.  I changed that BUG_ON to VM_BUG_ON_PAGE
> > > to try to get a clue about it.  Good to know it's not the THP patches
> > > since they aren't in linux-next.
> > > 
> > > I don't understand how it can happen.  We have the page locked, and then
> > > we
> > > do:
> > > 
> > > if (PageWriteback(page)) {
> > > if (wbc->sync_mode != WB_SYNC_NONE)
> > > wait_on_page_writeback(page);
> > > else
> > > goto continue_unlock;
> > > }
> > > 
> > > VM_BUG_ON_PAGE(PageWriteback(page), page);
> > > 
> > > Nobody should be able to put this page under writeback while we have it
> > > locked ... right?  The page can be redirtied by the code that's supposed
> > > to be writing it back, but I don't see how anyone can make PageWriteback
> > > true while we're holding the page lock.
> > 
> > It happened again on today's linux-next:
> > 
> > [ 7613.579890][T55770] page:a4b35e02 refcount:3 mapcount:0
> > mapping:457ceb87 index:0x3e pfn:0x1cef4e
> > [ 7613.590594][T55770] aops:xfs_address_space_operations ino:805d85a dentry
> > name:"doio.f1.55762"
> > [ 7613.599192][T55770] flags:
> > 0xbfffc000bf(locked|waiters|referenced|uptodate|dirty|lru|active)
> > [ 7613.608596][T55770] raw: 00bfffc000bf ea0005027d48
> > 88810eaec030 888231f3a6a8
> > [ 7613.617101][T55770] raw: 003e 
> > 0003 888143724000
> > [ 7613.625590][T55770] page dumped because:
> > VM_BUG_ON_PAGE(PageWriteback(page))
> > [ 7613.632695][T55770] page->mem_cgroup:888143724000
> 
> Seems like it reproduces for you pretty quickly.  I have no luck ;-(
> 
> Can you add this?

It turns out I had no luck for the last a few days. I'll keep running and report
back if it triggers again.

> 
> +++ b/mm/page-writeback.c
> @@ -2774,6 +2774,7 @@ int __test_set_page_writeback(struct page *page, bool
> keep_write)
> struct address_space *mapping = page_mapping(page);
> int ret, access_ret;
>  
> +   VM_BUG_ON_PAGE(!PageLocked(page), page);
> lock_page_memcg(page);
> if (mapping && mapping_use_writeback_tags(mapping)) {
> XA_STATE(xas, >i_pages, page_index(page));
> 
> This is the only place (afaict) that sets PageWriteback, so that will
> tell us whether someone is setting Writeback without holding the lock,
> or whether we're suffering from a spurious wakeup.
> 



Re: WARN_ON(fuse_insert_writeback(root, wpa)) in tree_insert()

2020-10-30 Thread Qian Cai
On Thu, 2020-10-29 at 16:20 +0100, Miklos Szeredi wrote:
> On Thu, Oct 29, 2020 at 4:02 PM Qian Cai  wrote:
> > On Wed, 2020-10-07 at 16:08 -0400, Qian Cai wrote:
> > > Running some fuzzing by a unprivileged user on virtiofs could trigger the
> > > warning below. The warning was introduced not long ago by the commit
> > > c146024ec44c ("fuse: fix warning in tree_insert() and clean up writepage
> > > insertion").
> > > 
> > > From the logs, the last piece of the fuzzing code is:
> > > 
> > > fgetxattr(fd=426, name=0x7f39a69af000, value=0x7f39a8abf000, size=1)
> > 
> > I can still reproduce it on today's linux-next. Any idea on how to debug it
> > further?
> 
> Can you please try the attached patch?

So far so good. I'll keep running it over the weekend to be a little bit sure.
It was taking a while to reproduce.



Re: WARN_ON(fuse_insert_writeback(root, wpa)) in tree_insert()

2020-10-29 Thread Qian Cai
On Wed, 2020-10-07 at 16:08 -0400, Qian Cai wrote:
> Running some fuzzing by a unprivileged user on virtiofs could trigger the
> warning below. The warning was introduced not long ago by the commit
> c146024ec44c ("fuse: fix warning in tree_insert() and clean up writepage
> insertion").
> 
> From the logs, the last piece of the fuzzing code is:
> 
> fgetxattr(fd=426, name=0x7f39a69af000, value=0x7f39a8abf000, size=1)

I can still reproduce it on today's linux-next. Any idea on how to debug it
further?

The last syscall to trigger this time is:

ftruncate(fd=410, length=4)

[main]  testfile fd:410 filename:trinity-testfile1 flags:2 fopened:1 
fcntl_flags:42400 global:1
[main]   start: 0x7fadab1eb000 size:4KB  name: trinity-testfile1 global:1

[ 3353.774694][T124459] WARNING: CPU: 45 PID: 124459 at fs/fuse/file.c:1742 
tree_insert.part.39+0x0/0x10 [fuse]
[ 3353.777295][T124459] Modules linked in: isofs kvm_intel kvm irqbypass 
nls_ascii nls_cp437 vfat fat ip_tables x_tables virtiofs fuse sr_mod sd_mod 
cdrom ata_piix virtio_pci virtio_ring e1000 libata virtio dm_d
[ 3353.783690][T124459] CPU: 45 PID: 124459 Comm: trinity-c45 Not tainted 
5.10.0-rc1-next-20201029+ #3
[ 3353.786200][T124459] Hardware name: Red Hat KVM, BIOS 
1.14.0-1.module+el8.3.0+7638+07cf13d2 04/01/2014
[ 3353.788746][T124459] RIP: 0010:tree_insert.part.39+0x0/0x10 [fuse]
[ 3353.790847][T124459] Code: fd b7 d7 48 8b 0c 24 e9 ec fb ff ff 0f 1f 40 00 
66 2e 0f 1f 84 00 00 00 00 00 0f 0b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 
<0f> 0b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 b0
[ 3353.796025][T124459] RSP: 0018:c90008b4f828 EFLAGS: 00010286
[ 3353.797628][T124459] RAX: 88818875cd00 RBX: 888261d9a100 RCX: 
8882051023d0
[ 3353.799752][T124459] RDX:  RSI: 888261d9a100 RDI: 
88818875cdb0
[ 3353.803681][T124459] RBP: ea000a835300 R08: 888261d9a1f8 R09: 
f52001169ef8
[ 3353.807019][T124459] R10: 0003 R11: f52001169ef8 R12: 
888205101f40
[ 3353.810694][T124459] R13: ea0007d812c0 R14: 8881b48b1000 R15: 
888205102470
[ 3353.813877][T124459] FS:  7fadae016740() GS:888bcd14() 
knlGS:
[ 3353.817613][T124459] CS:  0010 DS:  ES:  CR0: 80050033
[ 3353.819366][T124459] CR2: 00e6 CR3: 000125140004 CR4: 
00170ee0
[ 3353.822295][T124459] Call Trace:
[ 3353.823242][T124459]  fuse_writepage_locked+0xa43/0xd40 [fuse]
[ 3353.824930][T124459]  fuse_launder_page+0x5b/0xc0 [fuse]
[ 3353.826466][T124459]  invalidate_inode_pages2_range+0x709/0xa90
[ 3353.828231][T124459]  ? unmap_mapping_pages+0x91/0x230
[ 3353.829703][T124459]  ? truncate_exceptional_pvec_entries.part.18+0x460/0x460
[ 3353.832203][T124459]  ? unmap_mapping_pages+0xbd/0x230
[ 3353.833657][T124459]  ? virtio_fs_wake_pending_and_unlock+0x1eb/0x610 
[virtiofs]
[ 3353.835757][T124459]  ? lock_downgrade+0x700/0x700
[ 3353.837184][T124459]  ? down_write+0xdb/0x150
[ 3353.838484][T124459]  ? unmap_mapping_pages+0xbd/0x230
[ 3353.840278][T124459]  ? do_wp_page+0xc50/0xc50
[ 3353.841603][T124459]  fuse_do_setattr+0xd9c/0x13f0 [fuse]
[ 3353.843155][T124459]  ? print_usage_bug+0x1a0/0x1a0
[ 3353.844527][T124459]  ? fuse_flush_times+0x3d0/0x3d0 [fuse]
[ 3353.846129][T124459]  ? mark_held_locks+0xb0/0x110
[ 3353.847471][T124459]  fuse_setattr+0x1ff/0x4b0 [fuse]
[ 3353.848901][T124459]  notify_change+0x6ca/0xc30
[ 3353.850663][T124459]  ? down_write_killable_nested+0x170/0x170
[ 3353.852334][T124459]  ? do_truncate+0xdd/0x180
[ 3353.853651][T124459]  do_truncate+0xdd/0x180
[ 3353.854912][T124459]  ? do_sys_openat2+0x5b0/0x5b0
[ 3353.856339][T124459]  ? rcu_read_lock_any_held+0xcd/0xf0
[ 3353.857898][T124459]  ? __sb_start_write+0x229/0x2d0
[ 3353.859314][T124459]  do_sys_ftruncate+0x1f5/0x2c0
[ 3353.861148][T124459]  ? trace_hardirqs_on+0x1c/0x150
[ 3353.862529][T124459]  do_syscall_64+0x33/0x40
[ 3353.863801][T124459]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3353.865413][T124459] RIP: 0033:0x7fadad92978d
[ 3353.866612][T124459] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e 
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
<48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8
[ 3353.872380][T124459] RSP: 002b:7fffabe83818 EFLAGS: 0246 ORIG_RAX: 
004d
[ 3353.874667][T124459] RAX: ffda RBX: 004d RCX: 
7fadad92978d
[ 3353.876842][T124459] RDX: fffd RSI: 0004 RDI: 
019a
[ 3353.879053][T124459] RBP: 004d R08: 207124800010c410 R09: 
9a60a1048000
[ 3353.881679][T124459] R10:  R11: 0246 R12: 
0002
[ 3353.883872][T124459] R13: 7fadaded4058 R14: 7fadae0166c0 R15: 
7fadaded4000
[ 3353.886136][T124459] CPU: 45 PID: 124459 Comm: trinity-c45 Not tainted 
5.10.0-rc1-next-20201029+ #3
[ 3353.888602][T124459] Hardware name: R

Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai
On Wed, 2020-10-28 at 17:31 -0700, Paul E. McKenney wrote:
> On Thu, Oct 29, 2020 at 11:09:07AM +1100, Michael Ellerman wrote:
> > Qian Cai  writes:
> > > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > > in the CPU-hotplug onlining process, which results in lockdep splats as
> > > follows:
> > 
> > Since when?
> > What kernel version?
> > 
> > I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> > v5.10-rc1. Am I missing a CONFIG?
> 
> My guess would be that adding CONFIG_PROVE_RAW_LOCK_NESTING=y will
> get you some splats.

Well, I don't have that set, so it should be CONFIG_PROVE_RCU_LIST=y. Anyway,
this is .config to reproduce on Power9 NV:

https://cailca.coding.net/public/linux/mm/git/files/master/powerpc.config



Re: [PATCH] arm64/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai
On Thu, 2020-10-29 at 09:10 +, Will Deacon wrote:
> On Wed, Oct 28, 2020 at 02:26:14PM -0400, Qian Cai wrote:
> > The call to rcu_cpu_starting() in secondary_start_kernel() is not early
> > enough in the CPU-hotplug onlining process, which results in lockdep
> > splats as follows:
> > 
> >  WARNING: suspicious RCU usage
> >  -
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> > 
> >  other info that might help us debug this:
> > 
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> > 
> >  Call trace:
> >   dump_backtrace+0x0/0x3c8
> >   show_stack+0x14/0x60
> >   dump_stack+0x14c/0x1c4
> >   lockdep_rcu_suspicious+0x134/0x14c
> >   __lock_acquire+0x1c30/0x2600
> >   lock_acquire+0x274/0xc48
> >   _raw_spin_lock+0xc8/0x140
> >   vprintk_emit+0x90/0x3d0
> >   vprintk_default+0x34/0x40
> >   vprintk_func+0x378/0x590
> >   printk+0xa8/0xd4
> >   __cpuinfo_store_cpu+0x71c/0x868
> >   cpuinfo_store_cpu+0x2c/0xc8
> >   secondary_start_kernel+0x244/0x318
> > 
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the secondary_start_kernel() function.
> 
> Hmm, it's not really a move though -- we'll end up calling this thing twice
> afaict. It would be better to make sure we've called notify_cpu_starting()
> early enough. Can we do that instead?

Paul mentioned that it is fine to call rcu_cpu_starting() multiple times, and
Peter mentioned that CPU bringup is complicated. Thus, I thought about doing
something safe here.

I tested a bit of patch below which seems fine, but I can't tell for sure if it
is safe. Any suggestion?

--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -224,6 +224,7 @@ asmlinkage notrace void secondary_start_kernel(void)
 
preempt_disable();
trace_hardirqs_off();
+   notify_cpu_starting(cpu);
 
/*
 * If the system has established the capabilities, make sure
@@ -244,7 +245,6 @@ asmlinkage notrace void secondary_start_kernel(void)
/*
 * Enable GIC and timers.
 */
-   notify_cpu_starting(cpu);
 
ipi_setup(cpu);



Re: [PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-29 Thread Qian Cai
On Thu, 2020-10-29 at 11:09 +1100, Michael Ellerman wrote:
> Qian Cai  writes:
> > The call to rcu_cpu_starting() in start_secondary() is not early enough
> > in the CPU-hotplug onlining process, which results in lockdep splats as
> > follows:
> 
> Since when?

For me, it is since the commit in the link which looks now merged into
v5.10-rc1. Then, it needs CONFIG_PROVE_RCU_LIST=y.

> What kernel version?
> 
> I haven't seen this running CPU hotplug tests with PROVE_LOCKING=y on
> v5.10-rc1. Am I missing a CONFIG?
> 
> cheers
> 
> 
> >  WARNING: suspicious RCU usage
> >  -
> >  kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!
> > 
> >  other info that might help us debug this:
> > 
> >  RCU used illegally from offline CPU!
> >  rcu_scheduler_active = 1, debug_locks = 1
> >  no locks held by swapper/1/0.
> > 
> >  Call Trace:
> >  dump_stack+0xec/0x144 (unreliable)
> >  lockdep_rcu_suspicious+0x128/0x14c
> >  __lock_acquire+0x1060/0x1c60
> >  lock_acquire+0x140/0x5f0
> >  _raw_spin_lock_irqsave+0x64/0xb0
> >  clockevents_register_device+0x74/0x270
> >  register_decrementer_clockevent+0x94/0x110
> >  start_secondary+0x134/0x800
> >  start_secondary_prolog+0x10/0x14
> > 
> > This is avoided by moving the call to rcu_cpu_starting up near the
> > beginning of the start_secondary() function. Note that the
> > raw_smp_processor_id() is required in order to avoid calling into
> > lockdep before RCU has declared the CPU to be watched for readers.
> > 
> > Link: 
> > https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
> > Signed-off-by: Qian Cai 
> > ---
> >  arch/powerpc/kernel/smp.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 3c6b9822f978..8c2857cbd960 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
> >  /* Activate a secondary processor. */
> >  void start_secondary(void *unused)
> >  {
> > -   unsigned int cpu = smp_processor_id();
> > +   unsigned int cpu = raw_smp_processor_id();
> >  
> > mmgrab(_mm);
> > current->active_mm = _mm;
> >  
> > smp_store_cpu_info(cpu);
> > set_dec(tb_ticks_per_jiffy);
> > +   rcu_cpu_starting(cpu);
> > preempt_disable();
> > cpu_callin_map[cpu] = 1;
> >  
> > -- 
> > 2.28.0



[PATCH] arm64/smp: Move rcu_cpu_starting() earlier

2020-10-28 Thread Qian Cai
The call to rcu_cpu_starting() in secondary_start_kernel() is not early
enough in the CPU-hotplug onlining process, which results in lockdep
splats as follows:

 WARNING: suspicious RCU usage
 -
 kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 RCU used illegally from offline CPU!
 rcu_scheduler_active = 1, debug_locks = 1
 no locks held by swapper/1/0.

 Call trace:
  dump_backtrace+0x0/0x3c8
  show_stack+0x14/0x60
  dump_stack+0x14c/0x1c4
  lockdep_rcu_suspicious+0x134/0x14c
  __lock_acquire+0x1c30/0x2600
  lock_acquire+0x274/0xc48
  _raw_spin_lock+0xc8/0x140
  vprintk_emit+0x90/0x3d0
  vprintk_default+0x34/0x40
  vprintk_func+0x378/0x590
  printk+0xa8/0xd4
  __cpuinfo_store_cpu+0x71c/0x868
  cpuinfo_store_cpu+0x2c/0xc8
  secondary_start_kernel+0x244/0x318

This is avoided by moving the call to rcu_cpu_starting up near the
beginning of the secondary_start_kernel() function.

Link: 
https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Signed-off-by: Qian Cai 
---
 arch/arm64/kernel/smp.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 82e75fc2c903..09c96f57818c 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -222,6 +222,7 @@ asmlinkage notrace void secondary_start_kernel(void)
if (system_uses_irq_prio_masking())
init_gic_priority_masking();
 
+   rcu_cpu_starting(cpu);
preempt_disable();
trace_hardirqs_off();
 
-- 
2.28.0



[PATCH] powerpc/eeh_cache: Fix a possible debugfs deadlock

2020-10-28 Thread Qian Cai
Lockdep complains that a possible deadlock below in
eeh_addr_cache_show() because it is acquiring a lock with IRQ enabled,
but eeh_addr_cache_insert_dev() needs to acquire the same lock with IRQ
disabled. Let's just make eeh_addr_cache_show() acquire the lock with
IRQ disabled as well.

CPU0CPU1

   lock(_io_addr_cache_root.piar_lock);
local_irq_disable();
lock(>lock);
lock(_io_addr_cache_root.piar_lock);
   
 lock(>lock);

  *** DEADLOCK ***

  lock_acquire+0x140/0x5f0
  _raw_spin_lock_irqsave+0x64/0xb0
  eeh_addr_cache_insert_dev+0x48/0x390
  eeh_probe_device+0xb8/0x1a0
  pnv_pcibios_bus_add_device+0x3c/0x80
  pcibios_bus_add_device+0x118/0x290
  pci_bus_add_device+0x28/0xe0
  pci_bus_add_devices+0x54/0xb0
  pcibios_init+0xc4/0x124
  do_one_initcall+0xac/0x528
  kernel_init_freeable+0x35c/0x3fc
  kernel_init+0x24/0x148
  ret_from_kernel_thread+0x5c/0x80

  lock_acquire+0x140/0x5f0
  _raw_spin_lock+0x4c/0x70
  eeh_addr_cache_show+0x38/0x110
  seq_read+0x1a0/0x660
  vfs_read+0xc8/0x1f0
  ksys_read+0x74/0x130
  system_call_exception+0xf8/0x1d0
  system_call_common+0xe8/0x218

Fixes: 5ca85ae6318d ("powerpc/eeh_cache: Add a way to dump the EEH address 
cache")
Signed-off-by: Qian Cai 
---
 arch/powerpc/kernel/eeh_cache.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/eeh_cache.c b/arch/powerpc/kernel/eeh_cache.c
index 6b50bf15d8c1..bf3270426d82 100644
--- a/arch/powerpc/kernel/eeh_cache.c
+++ b/arch/powerpc/kernel/eeh_cache.c
@@ -264,8 +264,9 @@ static int eeh_addr_cache_show(struct seq_file *s, void *v)
 {
struct pci_io_addr_range *piar;
struct rb_node *n;
+   unsigned long flags;
 
-   spin_lock(_io_addr_cache_root.piar_lock);
+   spin_lock_irqsave(_io_addr_cache_root.piar_lock, flags);
for (n = rb_first(_io_addr_cache_root.rb_root); n; n = rb_next(n)) {
piar = rb_entry(n, struct pci_io_addr_range, rb_node);
 
@@ -273,7 +274,7 @@ static int eeh_addr_cache_show(struct seq_file *s, void *v)
   (piar->flags & IORESOURCE_IO) ? "i/o" : "mem",
   >addr_lo, >addr_hi, pci_name(piar->pcidev));
}
-   spin_unlock(_io_addr_cache_root.piar_lock);
+   spin_unlock_irqrestore(_io_addr_cache_root.piar_lock, flags);
 
return 0;
 }
-- 
2.28.0



[PATCH] s390/smp: Move rcu_cpu_starting() earlier

2020-10-28 Thread Qian Cai
The call to rcu_cpu_starting() in smp_init_secondary() is not early
enough in the CPU-hotplug onlining process, which results in lockdep
splats as follows:

 WARNING: suspicious RCU usage
 -
 kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 RCU used illegally from offline CPU!
 rcu_scheduler_active = 1, debug_locks = 1
 no locks held by swapper/1/0.

 Call Trace:
 show_stack+0x158/0x1f0
 dump_stack+0x1f2/0x238
 __lock_acquire+0x2640/0x4dd0
 lock_acquire+0x3a8/0xd08
 _raw_spin_lock_irqsave+0xc0/0xf0
 clockevents_register_device+0xa8/0x528
 init_cpu_timer+0x33e/0x468
 smp_init_secondary+0x11a/0x328
 smp_start_secondary+0x82/0x88

This is avoided by moving the call to rcu_cpu_starting up near the
beginning of the smp_init_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into
lockdep before RCU has declared the CPU to be watched for readers.

Link: 
https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Signed-off-by: Qian Cai 
---
 arch/s390/kernel/smp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index ebfe86d097f0..390d97daa2b3 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -855,13 +855,14 @@ void __init smp_detect_cpus(void)
 
 static void smp_init_secondary(void)
 {
-   int cpu = smp_processor_id();
+   int cpu = raw_smp_processor_id();
 
S390_lowcore.last_update_clock = get_tod_clock();
restore_access_regs(S390_lowcore.access_regs_save_area);
set_cpu_flag(CIF_ASCE_PRIMARY);
set_cpu_flag(CIF_ASCE_SECONDARY);
cpu_init();
+   rcu_cpu_starting(cpu);
preempt_disable();
init_cpu_timer();
vtime_init();
-- 
2.28.0



[PATCH] powerpc/smp: Move rcu_cpu_starting() earlier

2020-10-28 Thread Qian Cai
The call to rcu_cpu_starting() in start_secondary() is not early enough
in the CPU-hotplug onlining process, which results in lockdep splats as
follows:

 WARNING: suspicious RCU usage
 -
 kernel/locking/lockdep.c:3497 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 RCU used illegally from offline CPU!
 rcu_scheduler_active = 1, debug_locks = 1
 no locks held by swapper/1/0.

 Call Trace:
 dump_stack+0xec/0x144 (unreliable)
 lockdep_rcu_suspicious+0x128/0x14c
 __lock_acquire+0x1060/0x1c60
 lock_acquire+0x140/0x5f0
 _raw_spin_lock_irqsave+0x64/0xb0
 clockevents_register_device+0x74/0x270
 register_decrementer_clockevent+0x94/0x110
 start_secondary+0x134/0x800
 start_secondary_prolog+0x10/0x14

This is avoided by moving the call to rcu_cpu_starting up near the
beginning of the start_secondary() function. Note that the
raw_smp_processor_id() is required in order to avoid calling into
lockdep before RCU has declared the CPU to be watched for readers.

Link: 
https://lore.kernel.org/lkml/160223032121.7002.1269740091547117869.tip-bot2@tip-bot2/
Signed-off-by: Qian Cai 
---
 arch/powerpc/kernel/smp.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3c6b9822f978..8c2857cbd960 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1393,13 +1393,14 @@ static void add_cpu_to_masks(int cpu)
 /* Activate a secondary processor. */
 void start_secondary(void *unused)
 {
-   unsigned int cpu = smp_processor_id();
+   unsigned int cpu = raw_smp_processor_id();
 
mmgrab(_mm);
current->active_mm = _mm;
 
smp_store_cpu_info(cpu);
set_dec(tb_ticks_per_jiffy);
+   rcu_cpu_starting(cpu);
preempt_disable();
cpu_callin_map[cpu] = 1;
 
-- 
2.28.0



Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-28 Thread Qian Cai
On Wed, 2020-10-28 at 08:53 -0700, Paul E. McKenney wrote:
> On Wed, Oct 28, 2020 at 10:39:47AM -0400, Qian Cai wrote:
> > On Tue, 2020-10-27 at 20:01 -0700, Paul E. McKenney wrote:
> > > If I have the right email thread associated with the right fixes, these
> > > commits in -rcu should be what you are looking for:
> > > 
> > > 73b658b6b7d5 ("rcu: Prevent lockdep-RCU splats on lock
> > > acquisition/release")
> > > 626b79aa935a ("x86/smpboot:  Move rcu_cpu_starting() earlier")
> > > 
> > > And maybe this one as well:
> > > 
> > > 3a6f638cb95b ("rcu,ftrace: Fix ftrace recursion")
> > > 
> > > Please let me know if these commits do not fix things.
> > While those patches silence the warnings for x86. Other arches are still
> > suffering. It is only after applying the patch from Boqun below fixed
> > everything.
> 
> Fair point!
> 
> > Is it a good idea for Boqun to write a formal patch or we should fix all
> > arches
> > individually like "x86/smpboot: Move rcu_cpu_starting() earlier"?
> 
> By Boqun's patch, you mean the change to debug_lockdep_rcu_enabled()
> shown below?  Peter Zijlstra showed that real failures can happen, so we

Yes.

> do not want to cover them up.  So we are firmly in "fix all architectures"
> space here, sorry!
> 
> I am happy to accumulate those patches, but cannot commit to creating
> or testing them.

Okay, I posted 3 patches for each arch and CC'ed you. BTW, it looks like
something is wrong on @vger.kernel.org today where I received many of those,

4.7.1 Hello [216.205.24.124], for recipient address 
 the policy analysis reported: zpostgrey: 
connect: Connection refused

and I can see your previous mails did not even reach there either.

https://lore.kernel.org/lkml/





Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-28 Thread Qian Cai
On Tue, 2020-10-27 at 20:01 -0700, Paul E. McKenney wrote:
> If I have the right email thread associated with the right fixes, these
> commits in -rcu should be what you are looking for:
> 
> 73b658b6b7d5 ("rcu: Prevent lockdep-RCU splats on lock acquisition/release")
> 626b79aa935a ("x86/smpboot:  Move rcu_cpu_starting() earlier")
> 
> And maybe this one as well:
> 
> 3a6f638cb95b ("rcu,ftrace: Fix ftrace recursion")
> 
> Please let me know if these commits do not fix things.
While those patches silence the warnings for x86. Other arches are still
suffering. It is only after applying the patch from Boqun below fixed
everything.

Is it a good idea for Boqun to write a formal patch or we should fix all arches
individually like "x86/smpboot: Move rcu_cpu_starting() earlier"?

> > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> > > index 39334d2d2b37..35d9bab65b75 100644
> > > --- a/kernel/rcu/update.c
> > > +++ b/kernel/rcu/update.c
> > > @@ -275,8 +275,8 @@ EXPORT_SYMBOL_GPL(rcu_callback_map);
> > >  
> > >  noinstr int notrace debug_lockdep_rcu_enabled(void)
> > >  {
> > > - return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
> > > -current->lockdep_recursion == 0;
> > > + return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE &&
> > > +__lockdep_enabled;
> > >  }
> > >  EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);

The warnings for each arch are:

== powerpc ==
[0.176044][T1] smp: Bringing up secondary CPUs ...
[0.179731][T0] 
[0.179734][T0] =
[0.179736][T0] WARNING: suspicious RCU usage
[0.179739][T0] 5.10.0-rc1-next-20201028+ #2 Not tainted
[0.179741][T0] -
[0.179744][T0] kernel/locking/lockdep.c:3497 RCU-list traversed in 
non-reader section!!
[0.179745][T0] 
[0.179745][T0] other info that might help us debug this:
[0.179745][T0] 
[0.179748][T0] 
[0.179748][T0] RCU used illegally from offline CPU!
[0.179748][T0] rcu_scheduler_active = 1, debug_locks = 1
[0.179750][T0] no locks held by swapper/1/0.
[0.179752][T0] 
[0.179752][T0] stack backtrace:
[0.179757][T0] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
5.10.0-rc1-next-20201028+ #2
[0.179759][T0] Call Trace:
[0.179767][T0] [c00015b27ab0] [c0657188] 
dump_stack+0xec/0x144 (unreliable)
[0.179776][T0] [c00015b27af0] [c014d0d4] 
lockdep_rcu_suspicious+0x128/0x14c
[0.179782][T0] [c00015b27b70] [c0148920] 
__lock_acquire+0x1060/0x1c60
[0.179788][T0] [c00015b27ca0] [c014a1d0] 
lock_acquire+0x140/0x5f0
[0.179794][T0] [c00015b27d90] [c08f22f4] 
_raw_spin_lock_irqsave+0x64/0xb0
[0.179801][T0] [c00015b27dd0] [c01a1094] 
clockevents_register_device+0x74/0x270
[0.179808][T0] [c00015b27e80] [c001f194] 
register_decrementer_clockevent+0x94/0x110
[0.179814][T0] [c00015b27ef0] [c003fd84] 
start_secondary+0x134/0x800
[0.179819][T0] [c00015b27f90] [c000c454] 
start_secondary_prolog+0x10/0x14
[0.179855][T0] 
[0.179857][T0] =
[0.179858][T0] WARNING: suspicious RCU usage
[0.179860][T0] 5.10.0-rc1-next-20201028+ #2 Not tainted
[0.179862][T0] -
[0.179864][T0] kernel/locking/lockdep.c:886 RCU-list traversed in 
non-reader section!!
[0.179866][T0] 
[0.179866][T0] other info that might help us debug this:
[0.179866][T0] 
[0.179868][T0] 
[0.179868][T0] RCU used illegally from offline CPU!
[0.179868][T0] rcu_scheduler_active = 1, debug_locks = 1
[0.179870][T0] no locks held by swapper/1/0.
[0.179871][T0] 
[0.179871][T0] stack backtrace:
[0.179875][T0] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
5.10.0-rc1-next-20201028+ #2
[0.179876][T0] Call Trace:
[0.179880][T0] [c00015b27980] [c0657188] 
dump_stack+0xec/0x144 (unreliable)
[0.179886][T0] [c00015b279c0] [c014d0d4] 
lockdep_rcu_suspicious+0x128/0x14c
[0.179892][T0] [c00015b27a40] [c014b010] 
register_lock_class+0x680/0xc70
[0.179896][T0] [c00015b27b50] [c014795c] 
__lock_acquire+0x9c/0x1c60
[0.179901][T0] [c00015b27c80] [c014a1d0] 
lock_acquire+0x140/0x5f0
[0.179906][T0] [c00015b27d70] [c08f22f4] 
_raw_spin_lock_irqsave+0x64/0xb0
[0.179912][T0] [c00015b27db0] [c03a2fb4] 
__delete_object+0x44/0x80
[0.179917][T0] [c00015b27de0] [c035a964] 
slab_free_freelist_hook+0x174/0x300
[0.179921][T0] [c00015b27e50] [c035f848] kfree+0xf8/0x500
[0.179926][T0] [c00015b27ed0] [c0656878] 
free_cpumask_var+0x18/0x30
[0.179931][T0] 

Re: [PATCH v6 2/4] KVM: x86: report negative values from wrmsr emulation to userspace

2020-10-27 Thread Qian Cai
On Mon, 2020-10-26 at 15:40 -0400, Qian Cai wrote:
> On Wed, 2020-09-23 at 00:10 +0300, Maxim Levitsky wrote:
> > This will allow the KVM to report such errors (e.g -ENOMEM)
> > to the userspace.
> > 
> > Signed-off-by: Maxim Levitsky 
> 
> Reverting this and its dependency:
> 
> 72f211ecaa80 KVM: x86: allow kvm_x86_ops.set_efer to return an error value
> 
> on the top of linux-next (they have also unfortunately merged into the
> mainline
> at the same time) fixed an issue that a simple Intel KVM guest is unable to
> boot
> below.

So I debug this a bit more. This also breaks nested virt (VMX). We have here:

[  345.504403] kvm [1491]: vcpu0 unhandled rdmsr: 0x4e data 0x0
[  345.758560] kvm [1491]: vcpu0 unhandled rdmsr: 0x1c9 data 0x0
[  345.758594] kvm [1491]: vcpu0 unhandled rdmsr: 0x1a6 data 0x0
[  345.758619] kvm [1491]: vcpu0 unhandled rdmsr: 0x1a7 data 0x0
[  345.758644] kvm [1491]: vcpu0 unhandled rdmsr: 0x3f6 data 0x0
[  345.951601] kvm [1493]: vcpu1 unhandled rdmsr: 0x4e data 0x0
[  351.857036] kvm [1493]: vcpu1 unhandled wrmsr: 0xc90 data 0xf

After this commit, -ENOENT is returned to vcpu_enter_guest() causes the
userspace to abort.

kvm_msr_ignored_check()
  kvm_set_msr()
kvm_emulate_wrmsr()
  vmx_handle_exit()
vcpu_enter_guest()

Something like below will unbreak the userspace, but does anyone has a better
idea?

--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1748,7 +1748,7 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu)
return 0;
 
/* Signal all other negative errors to userspace */
-   if (r < 0)
+   if (r < 0 && r != -ENOENT)
return r;
 
/* MSR write failed? Inject a #GP */



Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-27 Thread Qian Cai
On Mon, 2020-10-12 at 11:11 +0800, Boqun Feng wrote:
> Hi,
> 
> On Fri, Oct 09, 2020 at 09:41:24AM -0400, Qian Cai wrote:
> > On Fri, 2020-10-09 at 07:58 +, tip-bot2 for Peter Zijlstra wrote:
> > > The following commit has been merged into the locking/core branch of tip:
> > > 
> > > Commit-ID: 4d004099a668c41522242aa146a38cc4eb59cb1e
> > > Gitweb:
> > > https://git.kernel.org/tip/4d004099a668c41522242aa146a38cc4eb59cb1e
> > > Author:Peter Zijlstra 
> > > AuthorDate:Fri, 02 Oct 2020 11:04:21 +02:00
> > > Committer: Ingo Molnar 
> > > CommitterDate: Fri, 09 Oct 2020 08:53:30 +02:00
> > > 
> > > lockdep: Fix lockdep recursion
> > > 
> > > Steve reported that lockdep_assert*irq*(), when nested inside lockdep
> > > itself, will trigger a false-positive.
> > > 
> > > One example is the stack-trace code, as called from inside lockdep,
> > > triggering tracing, which in turn calls RCU, which then uses
> > > lockdep_assert_irqs_disabled().
> > > 
> > > Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-
> > > cpu
> > > variables")
> > > Reported-by: Steven Rostedt 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > Signed-off-by: Ingo Molnar 
> > 
> > Reverting this linux-next commit fixed booting RCU-list warnings everywhere.
> > 
> 
> I think this happened because in this commit debug_lockdep_rcu_enabled()
> didn't adopt to the change that made lockdep_recursion a percpu
> variable?
> 
> Qian, mind to try the following?

Boqun, Paul, may I ask what's the latest with the fixes? I must admit that I got
lost in this thread, but I remember that the patch from Boqun below at least
silence quite some of those warnings if not all. The problem is that some of
those warnings would trigger a lockdep circular locks warning due to printk()
with some locks held which in turn disabling the lockdep, makes our test runs
inefficient.

> 
> Although, arguably the problem still exists, i.e. we still have an RCU
> read-side critical section inside lock_acquire(), which may be called on
> a yet-to-online CPU, which RCU doesn't watch. I think this used to be OK
> because we don't "free" anything from lockdep, IOW, there is no
> synchronize_rcu() or call_rcu() that _needs_ to wait for the RCU
> read-side critical sections inside lockdep. But now we lock class
> recycling, so it might be a problem.
> 
> That said, currently validate_chain() and lock class recycling are
> mutually excluded via graph_lock, so we are safe for this one ;-)
> 
> --->8
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 39334d2d2b37..35d9bab65b75 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -275,8 +275,8 @@ EXPORT_SYMBOL_GPL(rcu_callback_map);
>  
>  noinstr int notrace debug_lockdep_rcu_enabled(void)
>  {
> - return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
> -current->lockdep_recursion == 0;
> + return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE &&
> +__lockdep_enabled;
>  }
>  EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);




Re: [PATCH -next] arm64: Fix redefinition of init_new_context()

2020-10-27 Thread Qian Cai
On Mon, 2020-10-12 at 10:10 -0400, Qian Cai wrote:
> The linux-next commit c870baeede75 ("asm-generic: add generic MMU
> versions of mmu context functions") missed a case in the arm64/for-next
> branch.
> 
> Signed-off-by: Qian Cai 

Arnd, Stephen, can you apply this patch? Those compiling errors are back again
in next-20201027.

> ---
>  arch/arm64/include/asm/mmu_context.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/mmu_context.h
> b/arch/arm64/include/asm/mmu_context.h
> index da5f146e665b..cd5c33a50469 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -176,6 +176,7 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
>   */
>  void check_and_switch_context(struct mm_struct *mm);
>  
> +#define init_new_context init_new_context
>  static inline int
>  init_new_context(struct task_struct *tsk, struct mm_struct *mm)
>  {



Re: [PATCH v6 2/4] KVM: x86: report negative values from wrmsr emulation to userspace

2020-10-26 Thread Qian Cai
On Wed, 2020-09-23 at 00:10 +0300, Maxim Levitsky wrote:
> This will allow the KVM to report such errors (e.g -ENOMEM)
> to the userspace.
> 
> Signed-off-by: Maxim Levitsky 

Reverting this and its dependency:

72f211ecaa80 KVM: x86: allow kvm_x86_ops.set_efer to return an error value

on the top of linux-next (they have also unfortunately merged into the mainline
at the same time) fixed an issue that a simple Intel KVM guest is unable to boot
below.

.config: http://people.redhat.com/qcai/x86.config

qemu-kvm-4.2.0-34.module+el8.3.0+7976+077be4ec.x86_64

# /usr/libexec/qemu-kvm -name ubuntu-20.04-server-cloudimg -cpu host -smp 2 -m 
2g -hda ./ubuntu-20.04-server-cloudimg.qcow2 -cdrom 
./ubuntu-20.04-server-cloudimg.iso  -nic user,hostfwd=tcp::-:22 -nographic

[1.141022] evm: Initialising EVM extended attributes:
[1.143344] evm: security.selinux
[1.144968] evm: security.SMACK64
[1.146574] evm: security.SMACK64EXEC
[1.148305] evm: security.SMACK64TRANSMUTE
[1.150215] evm: security.SMACK64MMAP
[1.151960] evm: security.apparmor
[1.153755] evm: security.ima
[1.155454] evm: security.capability
[1.155456] evm: HMAC attrs: 0x1
[1.162331] ata1.00: ATA-7: QEMU HARDDISK, 2.5+, max UDMA/100
[1.162635] PM:   Magic number: 8:937:635
[1.165607] ata1.00: 2147483648 sectors, multi 16: LBA48 
[1.169799] scsi 0:0:0:0: Direct-Access ATA  QEMU HARDDISK2.5+ 
PQ: 0 ANSI: 5
[1.174196] rtc_cmos 00:00: setting system clock to 2020-10-26T13:38:53 UTC 
(1603719533)
[1.178237] sd 0:0:0:0: Attached scsi generic sg0 type 0
[1.178293] sd 0:0:0:0: [sda] 2147483648 512-byte logical blocks: (1.10 
TB/1.00 TiB)
[1.180567] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100
[1.183986] sd 0:0:0:0: [sda] Write Protect is off
[error: kvm run failed No such file or directory
 RAX= RBX= RCX=0150 
RDX=801c
RSI= RDI=0150 RBP=b67840083e40 
RSP=b67840083e00
R8 =931dfda17608 R9 = R10=931dfda17848 
R11=
R12= R13=00b7 R14=931dfd4013c0 
R15=aa8f48d0
RIP=aa078894 RFL=0246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   00c0
CS =0010   00a09b00 DPL=0 CS64 [-RA]
SS =   00c0
DS =   00c0
FS =   00c0
GS = 931dfda0  00c0
LDT=   00c0
TR =0040 fe003000 206f 8b00 DPL=0 TSS64-busy
GDT= fe001000 007f
IDT= fe00 0fff
CR0=80050033 CR2= CR3=2960a001 CR4=00760ef0
DR0= DR1= DR2= 
DR3= 
DR6=fffe0ff0 DR7=0400
EFER=0d01
Code=dc 60 4e 00 4c 89 e0 41 5c 5d c3 0f 1f 44 00 00 89 f0 89 f9 <0f> 30 31 c0 
0f 1f 44 00 00 c3 55 48 c1 e2 20 89 f6 48 09 d6 89 c2 48 89 e5 48 83 ec 08 89

> ---
>  arch/x86/kvm/emulate.c | 7 +--
>  arch/x86/kvm/x86.c | 6 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 1d450d7710d63..d855304f5a509 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -3702,13 +3702,16 @@ static int em_dr_write(struct x86_emulate_ctxt *ctxt)
>  static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
>  {
>   u64 msr_data;
> + int ret;
>  
>   msr_data = (u32)reg_read(ctxt, VCPU_REGS_RAX)
>   | ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
> - if (ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data))
> +
> + ret = ctxt->ops->set_msr(ctxt, reg_read(ctxt, VCPU_REGS_RCX), msr_data);
> + if (ret > 0)
>   return emulate_gp(ctxt, 0);
>  
> - return X86EMUL_CONTINUE;
> + return ret < 0 ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
>  }
>  
>  static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 063d70e736f7f..e4b07be450d4e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1612,8 +1612,12 @@ int kvm_emulate_wrmsr(struct kvm_vcpu *vcpu)
>  {
>   u32 ecx = kvm_rcx_read(vcpu);
>   u64 data = kvm_read_edx_eax(vcpu);
> + int ret = kvm_set_msr(vcpu, ecx, data);
>  
> - if (kvm_set_msr(vcpu, ecx, data)) {
> + if (ret < 0)
> + return ret;
> +
> + if (ret > 0) {
>   trace_kvm_msr_write_ex(ecx, data);
>   kvm_inject_gp(vcpu, 0);
>   return 1;



Re: kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]

2020-10-26 Thread Qian Cai
On Mon, 2020-10-26 at 07:55 -0600, Jens Axboe wrote:
> I've tried to reproduce this as well, to no avail. Qian, could you perhaps
> detail the setup? What kind of storage, kernel config, compiler, etc.

This should work:

https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config



Re: kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]

2020-10-26 Thread Qian Cai
On Mon, 2020-10-26 at 07:55 -0600, Jens Axboe wrote:
> I've tried to reproduce this as well, to no avail. Qian, could you perhaps
> detail the setup? What kind of storage, kernel config, compiler, etc.
> 

So far I have only been able to reproduce on this Intel platform:

HPE DL560 gen10
Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz
131072 MB memory, 1000 GB disk space (smartpqi nvme)

It was running some CPU and memory hotplug, KVM and then LTP workloads
(syscalls, mm, and fs). Finally, it was always this LTP test case to trigger it:

# export LTPROOT; rwtest -N iogen01 -i 120s -s read,write -Da -Dv -n 2 
500b:$TMPDIR/doio.f1.$$ 1000b:$TMPDIR/doio.f2.$$
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/fs/doio/rwtest

gcc-8.3.1-5.1.el8.x86_64
.config: https://git.code.tencent.com/cail/linux-mm/blob/master/x86.config


== storage information == 
[   46.131150] smartpqi :23:00.0: Online Firmware Activation enabled
[   46.138495] smartpqi :23:00.0: Serial Management Protocol enabled
[   46.145701] smartpqi :23:00.0: New Soft Reset Handshake enabled
[   46.152705] smartpqi :23:00.0: RAID IU Timeout enabled
[   46.158934] smartpqi :23:00.0: TMF IU Timeout enabled
[   46.168740] scsi host0: smartpqi
[   47.750425] nvme nvme1: 31/0/0 default/read/poll queues
[   47.826211] scsi 0:0:0:0: Direct-Access ATA  MM1000GEFQV  HPG8 
PQ: 0 ANSI: 6
[   47.841752] smartpqi :23:00.0: added 0:0:0:0  
Direct-Access ATA  MM1000GEFQV  AIO+ qd=32
[   47.941249]  nvme1n1: p1
[   47.943078] scsi 0:0:1:0: Enclosure HPE  Smart Adapter2.62 
PQ: 0 ANSI: 5
[   47.958506] smartpqi :23:00.0: added 0:0:1:0 51402ec001f36448 Enclosure  
   HPE  Smart AdapterAIO-
[   47.962844] nvme nvme0: 31/0/0 default/read/poll queues
[   48.015511] scsi 0:2:0:0: RAID  HPE  P408i-a SR Gen10 2.62 
PQ: 0 ANSI: 5
[   48.029736] smartpqi :23:00.0: added 0:2:0:0  RAID   
   HPE  P408i-a SR Gen10 
[   48.042711] smartpqi :4e:00.0: Microsemi Smart Family Controller found
[   48.149820]  nvme0n1: p1
[   48.956194] smartpqi :4e:00.0: Online Firmware Activation enabled
[   48.963399] smartpqi :4e:00.0: Serial Management Protocol enabled
[   48.970625] smartpqi :4e:00.0: New Soft Reset Handshake enabled
[   48.977645] smartpqi :4e:00.0: RAID IU Timeout enabled
[   48.983873] smartpqi :4e:00.0: TMF IU Timeout enabled
[   48.994687] scsi host1: smartpqi
[   50.612955] scsi 1:0:0:0: Enclosure HPE  Smart Adapter2.62 
PQ: 0 ANSI: 5
[   50.628219] smartpqi :4e:00.0: added 1:0:0:0 51402ec01040ffc8 Enclosure  
   HPE  Smart AdapterAIO-
[   50.681859] scsi 1:2:0:0: RAID  HPE  E208i-p SR Gen10 2.62 
PQ: 0 ANSI: 5
[   50.695843] smartpqi :4e:00.0: added 1:2:0:0  RAID   
   HPE  E208i-p SR Gen10 
[   50.856683] sd 0:0:0:0: [sda] 1953525168 512-byte logical blocks: (1.00 
TB/932 GiB)
[   50.865195] sd 0:0:0:0: [sda] 4096-byte physical blocks
[   50.871354] sd 0:0:0:0: [sda] Write Protect is off
[   50.876956] sd 0:0:0:0: [sda] Mode Sense: 46 00 10 08
[   50.877299] sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, 
supports DPO and FUA
[   50.898824]  sda: sda1 sda2 sda3
[   50.943835] sd 0:0:0:0: [sda] Attached SCSI disk



Re: kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]

2020-10-22 Thread Qian Cai
On Thu, 2020-10-22 at 01:49 +0100, Matthew Wilcox wrote:
> On Wed, Oct 21, 2020 at 08:30:18PM -0400, Qian Cai wrote:
> > Today's linux-next starts to trigger this wondering if anyone has any clue.
> 
> I've seen that occasionally too.  I changed that BUG_ON to VM_BUG_ON_PAGE
> to try to get a clue about it.  Good to know it's not the THP patches
> since they aren't in linux-next.
> 
> I don't understand how it can happen.  We have the page locked, and then we
> do:
> 
> if (PageWriteback(page)) {
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
> else
> goto continue_unlock;
> }
> 
> VM_BUG_ON_PAGE(PageWriteback(page), page);
> 
> Nobody should be able to put this page under writeback while we have it
> locked ... right?  The page can be redirtied by the code that's supposed
> to be writing it back, but I don't see how anyone can make PageWriteback
> true while we're holding the page lock.

It happened again on today's linux-next:

[ 7613.579890][T55770] page:a4b35e02 refcount:3 mapcount:0 
mapping:457ceb87 index:0x3e pfn:0x1cef4e
[ 7613.590594][T55770] aops:xfs_address_space_operations ino:805d85a dentry 
name:"doio.f1.55762"
[ 7613.599192][T55770] flags: 
0xbfffc000bf(locked|waiters|referenced|uptodate|dirty|lru|active)
[ 7613.608596][T55770] raw: 00bfffc000bf ea0005027d48 88810eaec030 
888231f3a6a8
[ 7613.617101][T55770] raw: 003e  0003 
888143724000
[ 7613.625590][T55770] page dumped because: VM_BUG_ON_PAGE(PageWriteback(page))
[ 7613.632695][T55770] page->mem_cgroup:888143724000



kernel BUG at mm/page-writeback.c:2241 [ BUG_ON(PageWriteback(page); ]

2020-10-21 Thread Qian Cai
Today's linux-next starts to trigger this wondering if anyone has any clue.

[ 9765.086947][T48578] LTP: starting iogen01 (export LTPROOT; rwtest -N iogen01 
-i 120s -s read,write -Da -Dv -n 2 500b:$TMPDIR/doio.f1.$$ 
1000b:$TMPDIR/doio.f2.$$)
[ 9839.423703][T97227] [ cut here ]
[ 9839.429819][T97227] kernel BUG at mm/page-writeback.c:2241!
[ 9839.435459][T97227] invalid opcode:  [#1] SMP KASAN PTI
[ 9839.441066][T97227] CPU: 56 PID: 97227 Comm: doio Tainted: G  IO 
 5.9.0-next-20201021 #1
[ 9839.450251][T97227] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 
Gen10, BIOS U34 11/13/2019
[ 9839.459532][T97227] RIP: 0010:write_cache_pages+0x95f/0xeb0
[ 9839.465137][T97227] Code: 03 80 3c 02 00 0f 85 e5 04 00 00 49 8b 46 08 48 c7 
c6 40 fb ca 9c 48 8d 50 ff a8 01 4c 0f 45 f2 4c 89 f7 e8 33 e3 08 00 0f 0b <0f> 
0b 3d 00 00 08 00 0f 84 c3 00 00 00 48 8b 54 24 30 48 c1 ea 03
[ 9839.484715][T97227] RSP: 0018:c9003063f610 EFLAGS: 00010282
[ 9839.490672][T97227] RAX: 01bfffc0803f RBX: ea0024e68500 RCX: 
9b8d232e
[ 9839.498547][T97227] RDX:  RSI: 0008 RDI: 
ea0024e68500
[ 9839.506422][T97227] RBP: c9003063f708 R08: f940049cd0a1 R09: 
f940049cd0a1
[ 9839.514297][T97227] R10: ea0024e68507 R11: f940049cd0a0 R12: 
c9003063fa20
[ 9839.522171][T97227] R13: ea0024e68500 R14: ea0024e68500 R15: 
dc00
[ 9839.530044][T97227] FS:  7f23ef12a740() GS:88a01f28() 
knlGS:
[ 9839.538878][T97227] CS:  0010 DS:  ES:  CR0: 80050033
[ 9839.545355][T97227] CR2: 01c79000 CR3: 000c15786004 CR4: 
007706e0
[ 9839.553229][T97227] DR0:  DR1:  DR2: 

[ 9839.561104][T97227] DR3:  DR6: fffe0ff0 DR7: 
0400
[ 9839.568976][T97227] PKRU: 5554
[ 9839.572395][T97227] Call Trace:
[ 9839.575559][T97227]  ? iomap_writepage_map+0x23a0/0x23a0
[ 9839.580900][T97227]  ? clear_page_dirty_for_io+0x990/0x990
[ 9839.586420][T97227]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 9839.591850][T97227]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 9839.597021][T97227]  ? find_held_lock+0x33/0x1c0
[ 9839.601670][T97227]  ? xfs_vm_writepages+0xc2/0x130
[ 9839.606575][T97227]  ? lock_downgrade+0x700/0x700
[ 9839.611305][T97227]  ? rcu_read_unlock+0x40/0x40
[ 9839.615949][T97227]  ? do_raw_spin_lock+0x121/0x290
[ 9839.620854][T97227]  ? rwlock_bug.part.1+0x90/0x90
[ 9839.625669][T97227]  iomap_writepages+0x3f/0xb0
iomap_writepages at fs/iomap/buffered-io.c:1576
[ 9839.630225][T97227]  xfs_vm_writepages+0xd7/0x130
[ 9839.634955][T97227]  ? xfs_vm_readahead+0x10/0x10
[ 9839.639686][T97227]  ? find_held_lock+0x33/0x1c0
[ 9839.644327][T97227]  do_writepages+0xcd/0x250
do_writepages at mm/page-writeback.c:2355
[ 9839.648707][T97227]  ? page_writeback_cpu_online+0x10/0x10
[ 9839.654224][T97227]  ? do_raw_spin_lock+0x121/0x290
[ 9839.659129][T97227]  ? rwlock_bug.part.1+0x90/0x90
[ 9839.663945][T97227]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 9839.669113][T97227]  __filemap_fdatawrite_range+0x250/0x310
__filemap_fdatawrite_range at mm/filemap.c:423
[ 9839.674717][T97227]  ? delete_from_page_cache_batch+0xaa0/0xaa0
[ 9839.680669][T97227]  ? rcu_read_lock_bh_held+0xb0/0xb0
[ 9839.685836][T97227]  ? rcu_read_lock_sched_held+0x9c/0xd0
[ 9839.691265][T97227]  file_write_and_wait_range+0x85/0xe0
file_write_and_wait_range at mm/filemap.c:761
[ 9839.696607][T97227]  xfs_file_fsync+0x192/0x710
fs_file_fsync at fs/xfs/xfs_file.c:105
[ 9839.701163][T97227]  ? xfs_file_read_iter+0x490/0x490
[ 9839.706242][T97227]  ? up_write+0x148/0x460
[ 9839.710446][T97227]  ? iomap_write_begin+0xde0/0xde0
[ 9839.715438][T97227]  xfs_file_buffered_aio_write+0x82a/0xa30
generic_write_sync at include/linux/fs.h:2727
(inlined by) xfs_file_buffered_aio_write at fs/xfs/xfs_file.c:684
[ 9839.721129][T97227]  ? xfs_file_aio_write_checks+0x620/0x620
[ 9839.726820][T97227]  ? lockdep_hardirqs_on_prepare+0x3d0/0x3d0
[ 9839.732691][T97227]  new_sync_write+0x3aa/0x610
[ 9839.737247][T97227]  ? new_sync_read+0x600/0x600
[ 9839.741888][T97227]  ? vfs_write+0x36c/0x5b0
[ 9839.746181][T97227]  ? rcu_read_lock_any_held+0xcd/0xf0
[ 9839.751433][T97227]  vfs_write+0x3e9/0x5b0



Re: WARNING: suspicious RCU usage in io_init_identity

2020-10-16 Thread Qian Cai
On Fri, 2020-10-16 at 01:12 -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:b2926c10 Add linux-next specific files for 20201016
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=12fc877f90
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6160209582f55fb1
> dashboard link: https://syzkaller.appspot.com/bug?extid=4596e1fcf98efa7d1745
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+4596e1fcf98efa7d1...@syzkaller.appspotmail.com
> 
> =
> WARNING: suspicious RCU usage
> 5.9.0-next-20201016-syzkaller #0 Not tainted
> -
> include/linux/cgroup.h:494 suspicious rcu_dereference_check() usage!

Introduced by the linux-next commits:

07950f53f85b ("io_uring: COW io_identity on mismatch")

Can't find the patchset was posted anywhere. Anyway, this should fix it? 

--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1049,7 +1049,9 @@ static void io_init_identity(struct io_identity *id)
id->files = current->files;
id->mm = current->mm;
 #ifdef CONFIG_BLK_CGROUP
+   rcu_read_lock();
id->blkcg_css = blkcg_css();
+   rcu_read_unlock();
 #endif
id->creds = current_cred();
id->nsproxy = current->nsproxy;

> 
> other info that might help us debug this:
> 
> 
> rcu_scheduler_active = 2, debug_locks = 1
> no locks held by syz-executor.0/8301.
> 
> stack backtrace:
> CPU: 0 PID: 8301 Comm: syz-executor.0 Not tainted 5.9.0-next-20201016-
> syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x198/0x1fb lib/dump_stack.c:118
>  task_css include/linux/cgroup.h:494 [inline]
>  blkcg_css include/linux/blk-cgroup.h:224 [inline]
>  blkcg_css include/linux/blk-cgroup.h:217 [inline]
>  io_init_identity+0x3a9/0x450 fs/io_uring.c:1052
>  io_uring_alloc_task_context+0x176/0x250 fs/io_uring.c:7730
>  io_uring_add_task_file+0x10d/0x180 fs/io_uring.c:8653
>  io_uring_get_fd fs/io_uring.c:9144 [inline]
>  io_uring_create fs/io_uring.c:9308 [inline]
>  io_uring_setup+0x2727/0x3660 fs/io_uring.c:9342
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x45de59
> Code: 0d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f
> 83 db b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7f7e11fe1bf8 EFLAGS: 0206 ORIG_RAX: 01a9
> RAX: ffda RBX: 2080 RCX: 0045de59
> RDX: 206d4000 RSI: 2080 RDI: 0087
> RBP: 0118c020 R08: 2040 R09: 2040
> R10: 2000 R11: 0206 R12: 206d4000
> R13: 20ee7000 R14: 2040 R15: 2000
> 
> 
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.



Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-15 Thread Qian Cai
On Tue, 2020-10-13 at 14:40 -0400, Vivek Goyal wrote:
> > == the thread is stuck in the loop ==
> > [10813.290694] task:trinity-c33 state:D stack:25888 pid:254219 ppid:
> > 87180
> > flags:0x4004
> > [10813.292671] Call Trace:
> > [10813.293379]  __schedule+0x71d/0x1b50
> > [10813.294182]  ? __sched_text_start+0x8/0x8
> > [10813.295146]  ? mark_held_locks+0xb0/0x110
> > [10813.296117]  schedule+0xbf/0x270
> > [10813.296782]  ? __lock_page_killable+0x276/0x830
> > [10813.297867]  io_schedule+0x17/0x60
> > [10813.298772]  __lock_page_killable+0x33b/0x830
> 
> This seems to suggest that filemap_fault() is blocked on page lock and
> is sleeping. For some reason it never wakes up. Not sure why.
> 
> And this will be called from.
> 
> fuse_fill_write_pages()
>iov_iter_fault_in_readable()
> 
> So fuse code will take inode_lock() and then looks like same process
> is sleeping waiting on page lock. And rest of the processes get blocked
> behind inode lock.
> 
> If we are woken up (while waiting on page lock), we should make forward
> progress. Question is what page it is and why the entity which is
> holding lock is not releasing lock.

FYI, it was mentioned that this is likely a deadlock in FUSE:

https://lore.kernel.org/linux-fsdevel/CAHk-=wh9Eu-gNHzqgfvUAAiO=vj+pwnzxkv+tx55xhgpfy+...@mail.gmail.com/





[PATCH -next] Revert "powerpc/pci: unmap legacy INTx interrupts when a PHB is removed"

2020-10-14 Thread Qian Cai
This reverts commit 3a3181e16fbde752007759f8759d25e0ff1fc425 which
causes memory corruptions on POWER9 NV.

Signed-off-by: Qian Cai 
---
 arch/powerpc/include/asm/pci-bridge.h |   6 --
 arch/powerpc/kernel/pci-common.c  | 114 --
 2 files changed, 120 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index d21e070352dc..d2a2a14e56f9 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -48,9 +48,6 @@ struct pci_controller_ops {
 
 /*
  * Structure of a PCI controller (host bridge)
- *
- * @irq_count: number of interrupt mappings
- * @irq_map: interrupt mappings
  */
 struct pci_controller {
struct pci_bus *bus;
@@ -130,9 +127,6 @@ struct pci_controller {
 
void *private_data;
struct npu *npu;
-
-   unsigned int irq_count;
-   unsigned int *irq_map;
 };
 
 /* These are used for config access before all the PCI probing
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index deb831f0ae13..be108616a721 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -353,115 +353,6 @@ struct pci_controller *pci_find_controller_for_domain(int 
domain_nr)
return NULL;
 }
 
-/*
- * Assumption is made on the interrupt parent. All interrupt-map
- * entries are considered to have the same parent.
- */
-static int pcibios_irq_map_count(struct pci_controller *phb)
-{
-   const __be32 *imap;
-   int imaplen;
-   struct device_node *parent;
-   u32 intsize, addrsize, parintsize, paraddrsize;
-
-   if (of_property_read_u32(phb->dn, "#interrupt-cells", ))
-   return 0;
-   if (of_property_read_u32(phb->dn, "#address-cells", ))
-   return 0;
-
-   imap = of_get_property(phb->dn, "interrupt-map", );
-   if (!imap) {
-   pr_debug("%pOF : no interrupt-map\n", phb->dn);
-   return 0;
-   }
-   imaplen /= sizeof(u32);
-   pr_debug("%pOF : imaplen=%d\n", phb->dn, imaplen);
-
-   if (imaplen < (addrsize + intsize + 1))
-   return 0;
-
-   imap += intsize + addrsize;
-   parent = of_find_node_by_phandle(be32_to_cpup(imap));
-   if (!parent) {
-   pr_debug("%pOF : no imap parent found !\n", phb->dn);
-   return 0;
-   }
-
-   if (of_property_read_u32(parent, "#interrupt-cells", )) {
-   pr_debug("%pOF : parent lacks #interrupt-cells!\n", phb->dn);
-   return 0;
-   }
-
-   if (of_property_read_u32(parent, "#address-cells", ))
-   paraddrsize = 0;
-
-   return imaplen / (addrsize + intsize + 1 + paraddrsize + parintsize);
-}
-
-static void pcibios_irq_map_init(struct pci_controller *phb)
-{
-   phb->irq_count = pcibios_irq_map_count(phb);
-   if (phb->irq_count < PCI_NUM_INTX)
-   phb->irq_count = PCI_NUM_INTX;
-
-   pr_debug("%pOF : interrupt map #%d\n", phb->dn, phb->irq_count);
-
-   phb->irq_map = kcalloc(phb->irq_count, sizeof(unsigned int),
-  GFP_KERNEL);
-}
-
-static void pci_irq_map_register(struct pci_dev *pdev, unsigned int virq)
-{
-   struct pci_controller *phb = pci_bus_to_host(pdev->bus);
-   int i;
-
-   if (!phb->irq_map)
-   return;
-
-   for (i = 0; i < phb->irq_count; i++) {
-   /*
-* Look for an empty or an equivalent slot, as INTx
-* interrupts can be shared between adapters.
-*/
-   if (phb->irq_map[i] == virq || !phb->irq_map[i]) {
-   phb->irq_map[i] = virq;
-   break;
-   }
-   }
-
-   if (i == phb->irq_count)
-   pr_err("PCI:%s all platform interrupts mapped\n",
-  pci_name(pdev));
-}
-
-/*
- * Clearing the mapped interrupts will also clear the underlying
- * mappings of the ESB pages of the interrupts when under XIVE. It is
- * a requirement of PowerVM to clear all memory mappings before
- * removing a PHB.
- */
-static void pci_irq_map_dispose(struct pci_bus *bus)
-{
-   struct pci_controller *phb = pci_bus_to_host(bus);
-   int i;
-
-   if (!phb->irq_map)
-   return;
-
-   pr_debug("PCI: Clearing interrupt mappings for PHB %04x:%02x...\n",
-pci_domain_nr(bus), bus->number);
-   for (i = 0; i < phb->irq_count; i++)
-   irq_dispose_mapping(phb->irq_map[i]);
-
-   kfree(phb->irq_map);
-}
-
-void pcibios_remove_bus(struct pci_bus *bus)
-{
-   pci_irq_map_dispose(bus);
-}
-EXPORT_SYMBOL_GPL(pcibios_remove_bus);
-
 /*
  * Reads the interrupt pin to determine if interrupt is

Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Qian Cai
On Tue, 2020-10-13 at 15:57 -0400, Vivek Goyal wrote:
> Hmm..., So how do I reproduce it. Just run trinity as root and it will
> reproduce after some time?

Only need to run it as unprivileged user after mounting virtiofs on /tmp
(trinity will need to create and use files there) as many as CPUs as possible.
Also, make sure your guest's memory usage does not exceed the host's /dev/shm
size. Otherwise, horrible things could happen.

$ trinity -C 48 --arch 64

It might get coredump or exit due to some other unrelated reasons, so just keep
retrying. It is best to apply your recent patch for the virtiofs false positive
warning first, so it won't taint the kernel which will stop the trinity. Today,
I had been able to reproduce it twice within half-hour each.





Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Qian Cai
On Tue, 2020-10-13 at 14:58 -0400, Vivek Goyal wrote:

> I am wondering if virtiofsd still alive and responding to requests? I
> see another task which is blocked on getdents() for more than 120s.
> 
> [10580.142571][  T348] INFO: task trinity-c36:254165 blocked for more than 123
> +seconds.
> [10580.143924][  T348]   Tainted: G   O5.9.0-next-20201013+ #2
> [10580.145158][  T348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> +disables this message.
> [10580.146636][  T348] task:trinity-c36 state:D stack:26704 pid:254165
> ppid:
> +87180 flags:0x0004
> [10580.148260][  T348] Call Trace:
> [10580.148789][  T348]  __schedule+0x71d/0x1b50
> [10580.149532][  T348]  ? __sched_text_start+0x8/0x8
> [10580.150343][  T348]  schedule+0xbf/0x270
> [10580.151044][  T348]  schedule_preempt_disabled+0xc/0x20
> [10580.152006][  T348]  __mutex_lock+0x9f1/0x1360
> [10580.152777][  T348]  ? __fdget_pos+0x9c/0xb0
> [10580.153484][  T348]  ? mutex_lock_io_nested+0x1240/0x1240
> [10580.154432][  T348]  ? find_held_lock+0x33/0x1c0
> [10580.155220][  T348]  ? __fdget_pos+0x9c/0xb0
> [10580.155934][  T348]  __fdget_pos+0x9c/0xb0
> [10580.156660][  T348]  __x64_sys_getdents+0xff/0x230
> 
> May be virtiofsd crashed and hence no requests are completing leading
> to a hard lockup?
Virtiofsd is still working. Once this happened, I manually create a file on the
guest (in virtiofs) and then I can see the content of it from the host.



Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-10-13 Thread Qian Cai
On Wed, 2020-09-23 at 09:06 +0200, Cédric Le Goater wrote:
> On 9/23/20 2:33 AM, Qian Cai wrote:
> > On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:
> > > When a passthrough IO adapter is removed from a pseries machine using
> > > hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
> > > guest OS to clear all page table entries related to the adapter. If
> > > some are still present, the RTAS call which isolates the PCI slot
> > > returns error 9001 "valid outstanding translations" and the removal of
> > > the IO adapter fails. This is because when the PHBs are scanned, Linux
> > > maps automatically the INTx interrupts in the Linux interrupt number
> > > space but these are never removed.
> > > 
> > > To solve this problem, we introduce a PPC platform specific
> > > pcibios_remove_bus() routine which clears all interrupt mappings when
> > > the bus is removed. This also clears the associated page table entries
> > > of the ESB pages when using XIVE.
> > > 
> > > For this purpose, we record the logical interrupt numbers of the
> > > mapped interrupt under the PHB structure and let pcibios_remove_bus()
> > > do the clean up.
> > > 
> > > Since some PCI adapters, like GPUs, use the "interrupt-map" property
> > > to describe interrupt mappings other than the legacy INTx interrupts,
> > > we can not restrict the size of the mapping array to PCI_NUM_INTX. The
> > > number of interrupt mappings is computed from the "interrupt-map"
> > > property and the mapping array is allocated accordingly.
> > > 
> > > Cc: "Oliver O'Halloran" 
> > > Cc: Alexey Kardashevskiy 
> > > Signed-off-by: Cédric Le Goater 
> > 
> > Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed
> > to
> > this patch.
> > 
> > .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> 
> OK. The patch is missing a NULL assignement after kfree() and that
> might be the issue. 
> 
> I did try PHB removal under PowerNV, so I would like to understand 
> how we managed to remove twice the PCI bus and possibly reproduce. 
> Any chance we could grab what the syscall fuzzer (syzkaller) did ? 

Any update on this? Maybe Michael or Stephen could drop this for now, so our
fuzzing could continue to find something else new?

It can still be reproduced on today's linux-next. BTW, this is running trinity
from an unprivileged user. This is the snapshot of the each fuzzing thread when
this happens.

http://people.redhat.com/qcai/pcibios_remove_bus/trinity-post-mortem.log

It can be reproduced by simply keep running this for a while:

$ trinity -C  --arch 64

[19611.946827][T1717146] pci_bus 0035:03: busn_res: [bus 03-07] is released
[19611.950956][T1717146] pci_bus 0035:08: busn_res: [bus 08-0c] is released
[19611.951260][T1717146] 
=
[19611.952336][T1717146] BUG kmalloc-16 (Tainted: GW  O ): Object 
already free
[19611.952365][T1717146] 
-
[19611.952365][T1717146] 
[19611.952411][T1717146] Disabling lock debugging due to kernel taint
[19611.952438][T1717146] INFO: Allocated in pcibios_scan_phb+0x104/0x3e0 
age=1960714 cpu=4 pid=1
[19611.952481][T1717146]__slab_alloc+0xa4/0xf0
[19611.952500][T1717146]__kmalloc+0x294/0x330
[19611.952519][T1717146]pcibios_scan_phb+0x104/0x3e0
[19611.952549][T1717146]pcibios_init+0x84/0x124
[19611.952578][T1717146]do_one_initcall+0xac/0x528
[19611.952599][T1717146]kernel_init_freeable+0x35c/0x3fc
[19611.952618][T1717146]kernel_init+0x24/0x148
[19611.952646][T1717146]ret_from_kernel_thread+0x5c/0x80
[19611.952665][T1717146] INFO: Freed in pcibios_remove_bus+0x70/0x90 age=0 
cpu=16 pid=1717146
[19611.952691][T1717146]kfree+0x49c/0x510
[19611.952700][T1717146]pcibios_remove_bus+0x70/0x90
[19611.952711][T1717146]pci_remove_bus+0xe4/0x110
[19611.952730][T1717146]pci_remove_bus_device+0x74/0x170
[19611.952749][T1717146]pci_remove_bus_device+0x4c/0x170
[19611.952768][T1717146]pci_stop_and_remove_bus_device_locked+0x34/0x50
[19611.952798][T1717146]remove_store+0xc0/0xe0
[19611.952819][T1717146]dev_attr_store+0x30/0x50
[19611.952852][T1717146]sysfs_kf_write+0x68/0xb0
[19611.952870][T1717146]kernfs_fop_write+0x114/0x260
[19611.952904][T1717146]vfs_write+0xe4/0x260
[19611.952922][T1717146]ksys_write+0x74/0x130
[19611.952951][T1717146]system_call_exception+0xf8/0x1d0
[19611.952970][T1717146]system_

Re: Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Qian Cai
On Tue, 2020-10-13 at 14:58 -0400, Vivek Goyal wrote:
> I am wondering if virtiofsd still alive and responding to requests? I
> see another task which is blocked on getdents() for more than 120s.
> 
> [10580.142571][  T348] INFO: task trinity-c36:254165 blocked for more than 123
> +seconds.
> [10580.143924][  T348]   Tainted: G   O5.9.0-next-20201013+ #2
> [10580.145158][  T348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> +disables this message.
> [10580.146636][  T348] task:trinity-c36 state:D stack:26704 pid:254165
> ppid:
> +87180 flags:0x0004
> [10580.148260][  T348] Call Trace:
> [10580.148789][  T348]  __schedule+0x71d/0x1b50
> [10580.149532][  T348]  ? __sched_text_start+0x8/0x8
> [10580.150343][  T348]  schedule+0xbf/0x270
> [10580.151044][  T348]  schedule_preempt_disabled+0xc/0x20
> [10580.152006][  T348]  __mutex_lock+0x9f1/0x1360
> [10580.152777][  T348]  ? __fdget_pos+0x9c/0xb0
> [10580.153484][  T348]  ? mutex_lock_io_nested+0x1240/0x1240
> [10580.154432][  T348]  ? find_held_lock+0x33/0x1c0
> [10580.155220][  T348]  ? __fdget_pos+0x9c/0xb0
> [10580.155934][  T348]  __fdget_pos+0x9c/0xb0
> [10580.156660][  T348]  __x64_sys_getdents+0xff/0x230
> 
> May be virtiofsd crashed and hence no requests are completing leading
> to a hard lockup?
No, it was not crashed. After I had to forcibly close the guest, the virtiofsd
daemon will exit normally. However, I can't tell exactly if the virtiofsd daemon
was still functioning normally. I'll enable the debug and retry to see if there
is anything interesting.



Unbreakable loop in fuse_fill_write_pages()

2020-10-13 Thread Qian Cai
Running some fuzzing on virtiofs with an unprivileged user on today's 
linux-next 
could trigger soft-lockups below.

# virtiofsd --socket-path=/tmp/vhostqemu -o source=$TESTDIR -o cache=always -o 
no_posix_lock

Basically, everything was blocking on inode_lock(inode) because one thread
(trinity-c33) was holding it but stuck in the loop in fuse_fill_write_pages()
and unable to exit for more than 10 minutes before I executed sysrq-t.
Afterwards, the systems was totally unresponsive:

kernel:NMI watchdog: Watchdog detected hard LOCKUP on cpu 8

To exit the loop, it needs,

iov_iter_advance(ii, tmp) to set "tmp" to non-zero for each iteration.

and

} while (iov_iter_count(ii) && count < fc->max_write &&
 ap->num_pages < max_pages && offset == 0);

== the thread is stuck in the loop ==
[10813.290694] task:trinity-c33 state:D stack:25888 pid:254219 ppid: 87180
flags:0x4004
[10813.292671] Call Trace:
[10813.293379]  __schedule+0x71d/0x1b50
[10813.294182]  ? __sched_text_start+0x8/0x8
[10813.295146]  ? mark_held_locks+0xb0/0x110
[10813.296117]  schedule+0xbf/0x270
[10813.296782]  ? __lock_page_killable+0x276/0x830
[10813.297867]  io_schedule+0x17/0x60
[10813.298772]  __lock_page_killable+0x33b/0x830
[10813.299695]  ? wait_on_page_bit+0x710/0x710
[10813.300609]  ? __lock_page_or_retry+0x3c0/0x3c0
[10813.301894]  ? up_read+0x1a3/0x730
[10813.302791]  ? page_cache_free_page.isra.45+0x390/0x390
[10813.304077]  filemap_fault+0x2bd/0x2040
[10813.305019]  ? read_cache_page_gfp+0x10/0x10
[10813.306041]  ? lock_downgrade+0x700/0x700
[10813.306958]  ? replace_page_cache_page+0x1130/0x1130
[10813.308124]  __do_fault+0xf5/0x530
[10813.308968]  handle_mm_fault+0x1c0e/0x25b0
[10813.309955]  ? copy_page_range+0xfe0/0xfe0
[10813.310895]  do_user_addr_fault+0x383/0x820
[10813.312084]  exc_page_fault+0x56/0xb0
[10813.312979]  asm_exc_page_fault+0x1e/0x30
[10813.313978] RIP: 0010:iov_iter_fault_in_readable+0x271/0x350
fault_in_pages_readable at include/linux/pagemap.h:745
(inlined by) iov_iter_fault_in_readable at lib/iov_iter.c:438
[10813.315293] Code: 48 39 d7 0f 82 1a ff ff ff 0f 01 cb 0f ae e8 44 89 c0 8a 0a
0f 01 ca 88 4c 24 70 85 c0 74 da e9 f8 fe ff ff 0f 01 cb 0f ae e8 <8a> 11 0f 01
ca 88 54 24 30 85 c0 0f 85 04 ff ff ff 48 29 ee e9
 45
[10813.319196] RSP: 0018:c90017ccf830 EFLAGS: 00050246
[10813.320446] RAX:  RBX: 192002f99f08 RCX: 7fe284f1004c
[10813.322202] RDX: 0001 RSI: 1000 RDI: 8887a7664000
[10813.323729] RBP: 1000 R08:  R09: 
[10813.325282] R10: c90017ccfd48 R11: ed102789d5ff R12: 8887a7664020
[10813.326898] R13: c90017ccfd40 R14: dc00 R15: 00e0df6a
[10813.328456]  ? iov_iter_revert+0x8e0/0x8e0
[10813.329404]  ? copyin+0x96/0xc0
[10813.330230]  ? iov_iter_copy_from_user_atomic+0x1f0/0xa40
[10813.331742]  fuse_perform_write+0x3eb/0xf20 [fuse]
fuse_fill_write_pages at fs/fuse/file.c:1150
(inlined by) fuse_perform_write at fs/fuse/file.c:1226
[10813.332880]  ? fuse_file_fallocate+0x5f0/0x5f0 [fuse]
[10813.334090]  fuse_file_write_iter+0x6b7/0x900 [fuse]
[10813.335191]  do_iter_readv_writev+0x42b/0x6d0
[10813.336161]  ? new_sync_write+0x610/0x610
[10813.337194]  do_iter_write+0x11f/0x5b0
[10813.338177]  ? __sb_start_write+0x229/0x2d0
[10813.339169]  vfs_writev+0x16d/0x2d0
[10813.339973]  ? vfs_iter_write+0xb0/0xb0
[10813.340950]  ? __fdget_pos+0x9c/0xb0
[10813.342039]  ? rcu_read_lock_sched_held+0x9c/0xd0
[10813.343120]  ? rcu_read_lock_bh_held+0xb0/0xb0
[10813.344104]  ? find_held_lock+0x33/0x1c0
[10813.345050]  do_writev+0xfb/0x1e0
[10813.345920]  ? vfs_writev+0x2d0/0x2d0
[10813.346802]  ? lockdep_hardirqs_on_prepare+0x27c/0x3d0
[10813.348026]  ? syscall_enter_from_user_mode+0x1c/0x50
[10813.349197]  do_syscall_64+0x33/0x40
[10813.350026]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

== soft-lockups ==
[10579.953730][  T348]   Tainted: G   O  5.9.0-next-20201013+ #2
[10579.955016][  T348] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[10579.956467][  T348] task:trinity-c25 state:D stack:26704 pid:253906 
ppid: 87180 flags:0x4002
[10579.958028][  T348] Call Trace:
[10579.958609][  T348]  __schedule+0x71d/0x1b50
[10579.959309][  T348]  ? __sched_text_start+0x8/0x8
[10579.960144][  T348]  schedule+0xbf/0x270
[10579.960774][  T348]  rwsem_down_write_slowpath+0x8ea/0xf30
[10579.961828][  T348]  ? rwsem_mark_wake+0x8d0/0x8d0
[10579.962675][  T348]  ? lockdep_hardirqs_on_prepare+0x3d0/0x3d0
[10579.963721][  T348]  ? rcu_read_lock_sched_held+0x9c/0xd0
[10579.964658][  T348]  ? lock_acquire+0x1c8/0x820
[10579.965453][  T348]  ? down_write+0x138/0x150
[10579.966237][  T348]  ? down_write+0xb3/0x150
[10579.966994][  T348]  down_write+0x138/0x150
[10579.967787][  T348]  ? down_write_killable_nested+0x170/0x170
[10579.968844][  T348]  fuse_flush+0x1a0/0x500 [fuse]
[10579.969732][  T348]  ? 

Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-12 Thread Qian Cai
On Mon, 2020-10-12 at 11:11 +0800, Boqun Feng wrote:
> Hi,
> 
> On Fri, Oct 09, 2020 at 09:41:24AM -0400, Qian Cai wrote:
> > On Fri, 2020-10-09 at 07:58 +, tip-bot2 for Peter Zijlstra wrote:
> > > The following commit has been merged into the locking/core branch of tip:
> > > 
> > > Commit-ID: 4d004099a668c41522242aa146a38cc4eb59cb1e
> > > Gitweb:
> > > https://git.kernel.org/tip/4d004099a668c41522242aa146a38cc4eb59cb1e
> > > Author:Peter Zijlstra 
> > > AuthorDate:Fri, 02 Oct 2020 11:04:21 +02:00
> > > Committer: Ingo Molnar 
> > > CommitterDate: Fri, 09 Oct 2020 08:53:30 +02:00
> > > 
> > > lockdep: Fix lockdep recursion
> > > 
> > > Steve reported that lockdep_assert*irq*(), when nested inside lockdep
> > > itself, will trigger a false-positive.
> > > 
> > > One example is the stack-trace code, as called from inside lockdep,
> > > triggering tracing, which in turn calls RCU, which then uses
> > > lockdep_assert_irqs_disabled().
> > > 
> > > Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-
> > > cpu
> > > variables")
> > > Reported-by: Steven Rostedt 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > Signed-off-by: Ingo Molnar 
> > 
> > Reverting this linux-next commit fixed booting RCU-list warnings everywhere.
> > 
> 
> I think this happened because in this commit debug_lockdep_rcu_enabled()
> didn't adopt to the change that made lockdep_recursion a percpu
> variable?
> 
> Qian, mind to try the following?

Yes, it works fine.

> 
> Although, arguably the problem still exists, i.e. we still have an RCU
> read-side critical section inside lock_acquire(), which may be called on
> a yet-to-online CPU, which RCU doesn't watch. I think this used to be OK
> because we don't "free" anything from lockdep, IOW, there is no
> synchronize_rcu() or call_rcu() that _needs_ to wait for the RCU
> read-side critical sections inside lockdep. But now we lock class
> recycling, so it might be a problem.
> 
> That said, currently validate_chain() and lock class recycling are
> mutually excluded via graph_lock, so we are safe for this one ;-)
> 
> --->8
> diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
> index 39334d2d2b37..35d9bab65b75 100644
> --- a/kernel/rcu/update.c
> +++ b/kernel/rcu/update.c
> @@ -275,8 +275,8 @@ EXPORT_SYMBOL_GPL(rcu_callback_map);
>  
>  noinstr int notrace debug_lockdep_rcu_enabled(void)
>  {
> - return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE && debug_locks &&
> -current->lockdep_recursion == 0;
> + return rcu_scheduler_active != RCU_SCHEDULER_INACTIVE &&
> +__lockdep_enabled;
>  }
>  EXPORT_SYMBOL_GPL(debug_lockdep_rcu_enabled);
> 



[PATCH -next] arm64: Fix redefinition of init_new_context()

2020-10-12 Thread Qian Cai
The linux-next commit c870baeede75 ("asm-generic: add generic MMU
versions of mmu context functions") missed a case in the arm64/for-next
branch.

Signed-off-by: Qian Cai 
---
 arch/arm64/include/asm/mmu_context.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/mmu_context.h 
b/arch/arm64/include/asm/mmu_context.h
index da5f146e665b..cd5c33a50469 100644
--- a/arch/arm64/include/asm/mmu_context.h
+++ b/arch/arm64/include/asm/mmu_context.h
@@ -176,6 +176,7 @@ static inline void cpu_replace_ttbr1(pgd_t *pgdp)
  */
 void check_and_switch_context(struct mm_struct *mm);
 
+#define init_new_context init_new_context
 static inline int
 init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 {
-- 
2.28.0



Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-09 Thread Qian Cai
On Fri, 2020-10-09 at 13:36 -0400, Qian Cai wrote:
> Back to x86, we have:
> 
> start_secondary()
>   smp_callin()
> apic_ap_setup()
>   setup_local_APIC()
> printk() in certain conditions.
> 
> which is before smp_store_cpu_info().
> 
> Can't we add a rcu_cpu_starting() at the very top for each start_secondary(),
> secondary_start_kernel(), smp_start_secondary() etc, so we don't worry about
> any printk() later?

This is rather irony. rcu_cpu_starting() is taking a lock and then reports
itself.

[8.826732][T0]  __lock_acquire.cold.76+0x2ad/0x3e0
[8.826732][T0]  lock_acquire+0x1c8/0x820
[8.826732][T0]  _raw_spin_lock_irqsave+0x30/0x50
[8.826732][T0]  rcu_cpu_starting+0xd0/0x2c0
[8.826732][T0]  start_secondary+0x10/0x2a0
[8.826732][T0]  secondary_startup_64_no_verify+0xb8/0xbb




Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-09 Thread Qian Cai
On Fri, 2020-10-09 at 18:23 +0200, Peter Zijlstra wrote:
> On Fri, Oct 09, 2020 at 06:58:37AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 09, 2020 at 09:41:24AM -0400, Qian Cai wrote:
> > > On Fri, 2020-10-09 at 07:58 +, tip-bot2 for Peter Zijlstra wrote:
> > > > The following commit has been merged into the locking/core branch of
> > > > tip:
> > > > 
> > > > Commit-ID: 4d004099a668c41522242aa146a38cc4eb59cb1e
> > > > Gitweb:
> > > > https://git.kernel.org/tip/4d004099a668c41522242aa146a38cc4eb59cb1e
> > > > Author:Peter Zijlstra 
> > > > AuthorDate:Fri, 02 Oct 2020 11:04:21 +02:00
> > > > Committer: Ingo Molnar 
> > > > CommitterDate: Fri, 09 Oct 2020 08:53:30 +02:00
> > > > 
> > > > lockdep: Fix lockdep recursion
> > > > 
> > > > Steve reported that lockdep_assert*irq*(), when nested inside lockdep
> > > > itself, will trigger a false-positive.
> > > > 
> > > > One example is the stack-trace code, as called from inside lockdep,
> > > > triggering tracing, which in turn calls RCU, which then uses
> > > > lockdep_assert_irqs_disabled().
> > > > 
> > > > Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to
> > > > per-cpu
> > > > variables")
> > > > Reported-by: Steven Rostedt 
> > > > Signed-off-by: Peter Zijlstra (Intel) 
> > > > Signed-off-by: Ingo Molnar 
> > > 
> > > Reverting this linux-next commit fixed booting RCU-list warnings
> > > everywhere.
> > 
> > Is it possible that the RCU-list warnings were being wrongly suppressed
> > without a21ee6055c30?  As in are you certain that these RCU-list warnings
> > are in fact false positives?
> > > [4.002695][T0]  init_timer_key+0x29/0x220
> > > [4.002695][T0]  identify_cpu+0xfcb/0x1980
> > > [4.002695][T0]  identify_secondary_cpu+0x1d/0x190
> > > [4.002695][T0]  smp_store_cpu_info+0x167/0x1f0
> > > [4.002695][T0]  start_secondary+0x5b/0x290
> > > [4.002695][T0]  secondary_startup_64_no_verify+0xb8/0xbb
> 
> They're actually correct warnings, this is trying to use RCU before that
> CPU is reported to RCU.
> 
> Possibly something like the below works, but I've not tested it, nor
> have I really thought hard about it, bring up tricky and this is just
> moving code.

I don't think this will always work. Basically, anything like printk() would
trigger the warning because it tries to acquire a lock. For example, on arm64:

[0.418627]  lockdep_rcu_suspicious+0x134/0x14c
[0.418629]  __lock_acquire+0x1c30/0x2600
[0.418631]  lock_acquire+0x274/0xc48
[0.418632]  _raw_spin_lock+0xc8/0x140
[0.418634]  vprintk_emit+0x90/0x3d0
[0.418636]  vprintk_default+0x34/0x40
[0.418638]  vprintk_func+0x378/0x590
[0.418640]  printk+0xa8/0xd4
[0.418642]  __cpuinfo_store_cpu+0x71c/0x868
[0.418644]  cpuinfo_store_cpu+0x2c/0xc8
[0.418645]  secondary_start_kernel+0x244/0x318

Back to x86, we have:

start_secondary()
  smp_callin()
apic_ap_setup()
  setup_local_APIC()
printk() in certain conditions.

which is before smp_store_cpu_info().

Can't we add a rcu_cpu_starting() at the very top for each start_secondary(),
secondary_start_kernel(), smp_start_secondary() etc, so we don't worry about any
printk() later?

> 
> ---
> 
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 35ad8480c464..9173d64ee69d 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1670,6 +1670,9 @@ void __init identify_boot_cpu(void)
>  void identify_secondary_cpu(struct cpuinfo_x86 *c)
>  {
>   BUG_ON(c == _cpu_data);
> +
> + rcu_cpu_starting(smp_processor_id());
> +
>   identify_cpu(c);
>  #ifdef CONFIG_X86_32
>   enable_sep_cpu();
> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> index 6a80f36b5d59..5f436cb4f7c4 100644
> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> @@ -794,8 +794,6 @@ void mtrr_ap_init(void)
>   if (!use_intel() || mtrr_aps_delayed_init)
>   return;
>  
> - rcu_cpu_starting(smp_processor_id());
> -
>   /*
>* Ideally we should hold mtrr_mutex here to avoid mtrr entries
>* changed, but this routine will be called in cpu boot time,
> 



Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-09 Thread Qian Cai
On Fri, 2020-10-09 at 06:58 -0700, Paul E. McKenney wrote:
> On Fri, Oct 09, 2020 at 09:41:24AM -0400, Qian Cai wrote:
> > On Fri, 2020-10-09 at 07:58 +, tip-bot2 for Peter Zijlstra wrote:
> > > The following commit has been merged into the locking/core branch of tip:
> > > 
> > > Commit-ID: 4d004099a668c41522242aa146a38cc4eb59cb1e
> > > Gitweb:
> > > https://git.kernel.org/tip/4d004099a668c41522242aa146a38cc4eb59cb1e
> > > Author:Peter Zijlstra 
> > > AuthorDate:Fri, 02 Oct 2020 11:04:21 +02:00
> > > Committer: Ingo Molnar 
> > > CommitterDate: Fri, 09 Oct 2020 08:53:30 +02:00
> > > 
> > > lockdep: Fix lockdep recursion
> > > 
> > > Steve reported that lockdep_assert*irq*(), when nested inside lockdep
> > > itself, will trigger a false-positive.
> > > 
> > > One example is the stack-trace code, as called from inside lockdep,
> > > triggering tracing, which in turn calls RCU, which then uses
> > > lockdep_assert_irqs_disabled().
> > > 
> > > Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-
> > > cpu
> > > variables")
> > > Reported-by: Steven Rostedt 
> > > Signed-off-by: Peter Zijlstra (Intel) 
> > > Signed-off-by: Ingo Molnar 
> > 
> > Reverting this linux-next commit fixed booting RCU-list warnings everywhere.
> 
> Is it possible that the RCU-list warnings were being wrongly suppressed
> without a21ee6055c30?  As in are you certain that these RCU-list warnings
> are in fact false positives?

I guess you mean this commit a046a86082cc ("lockdep: Fix lockdep recursion")
instead of a21ee6055c30. It is unclear to me how that commit a046a86082cc would
suddenly start to generate those warnings, although I can see it starts to use
percpu variables even though the CPU is not yet set online.

DECLARE_PER_CPU(unsigned int, lockdep_recursion);

Anyway, the problem is that when we in the early boot:

start_secondary()
  smp_init_secondary()
init_cpu_timer()
  clockevents_register_device()

We are taking a lock there but the CPU is not yet online, and the
__lock_acquire() would call things like hlist_for_each_entry_rcu() from
lookup_chain_cache() or register_lock_class(). Thus, triggering the RCU-list
from an offline CPU warnings.

I am not entirely sure how to fix those though.



Re: [tip: locking/core] lockdep: Fix lockdep recursion

2020-10-09 Thread Qian Cai
On Fri, 2020-10-09 at 07:58 +, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the locking/core branch of tip:
> 
> Commit-ID: 4d004099a668c41522242aa146a38cc4eb59cb1e
> Gitweb:
> https://git.kernel.org/tip/4d004099a668c41522242aa146a38cc4eb59cb1e
> Author:Peter Zijlstra 
> AuthorDate:Fri, 02 Oct 2020 11:04:21 +02:00
> Committer: Ingo Molnar 
> CommitterDate: Fri, 09 Oct 2020 08:53:30 +02:00
> 
> lockdep: Fix lockdep recursion
> 
> Steve reported that lockdep_assert*irq*(), when nested inside lockdep
> itself, will trigger a false-positive.
> 
> One example is the stack-trace code, as called from inside lockdep,
> triggering tracing, which in turn calls RCU, which then uses
> lockdep_assert_irqs_disabled().
> 
> Fixes: a21ee6055c30 ("lockdep: Change hardirq{s_enabled,_context} to per-cpu
> variables")
> Reported-by: Steven Rostedt 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Ingo Molnar 

Reverting this linux-next commit fixed booting RCU-list warnings everywhere.

== x86 ==
[8.101841][T1] rcu: Hierarchical SRCU implementation.
[8.110615][T5] NMI watchdog: Enabled. Permanently consumes one hw-PMU 
counter.
[8.153506][T1] smp: Bringing up secondary CPUs ...
[8.163075][T1] x86: Booting SMP configuration:
[8.167843][T1]  node  #0, CPUs:#1
[4.002695][T0] 
[4.002695][T0] =
[4.002695][T0] WARNING: suspicious RCU usage
[4.002695][T0] 5.9.0-rc8-next-20201009 #2 Not tainted
[4.002695][T0] -
[4.002695][T0] kernel/locking/lockdep.c:3497 RCU-list traversed in 
non-reader section!!
[4.002695][T0] 
[4.002695][T0] other info that might help us debug this:
[4.002695][T0] 
[4.002695][T0] 
[4.002695][T0] RCU used illegally from offline CPU!
[4.002695][T0] rcu_scheduler_active = 1, debug_locks = 1
[4.002695][T0] no locks held by swapper/1/0.
[4.002695][T0] 
[4.002695][T0] stack backtrace:
[4.002695][T0] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
5.9.0-rc8-next-20201009 #2
[4.002695][T0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 
Gen10, BIOS A40 07/10/2019
[4.002695][T0] Call Trace:
[4.002695][T0]  dump_stack+0x99/0xcb
[4.002695][T0]  __lock_acquire.cold.76+0x2ad/0x3e0
lookup_chain_cache at kernel/locking/lockdep.c:3497
(inlined by) lookup_chain_cache_add at kernel/locking/lockdep.c:3517
(inlined by) validate_chain at kernel/locking/lockdep.c:3572
(inlined by) __lock_acquire at kernel/locking/lockdep.c:4837
[4.002695][T0]  ? lockdep_hardirqs_on_prepare+0x3d0/0x3d0
[4.002695][T0]  lock_acquire+0x1c8/0x820
lockdep_recursion_finish at kernel/locking/lockdep.c:435
(inlined by) lock_acquire at kernel/locking/lockdep.c:5444
(inlined by) lock_acquire at kernel/locking/lockdep.c:5407
[4.002695][T0]  ? __debug_object_init+0xb4/0xf50
[4.002695][T0]  ? memset+0x1f/0x40
[4.002695][T0]  ? rcu_read_unlock+0x40/0x40
[4.002695][T0]  ? mce_gather_info+0x170/0x170
[4.002695][T0]  ? arch_freq_get_on_cpu+0x270/0x270
[4.002695][T0]  ? mce_cpu_restart+0x40/0x40
[4.002695][T0]  _raw_spin_lock_irqsave+0x30/0x50
[4.002695][T0]  ? __debug_object_init+0xb4/0xf50
[4.002695][T0]  __debug_object_init+0xb4/0xf50
[4.002695][T0]  ? mce_amd_feature_init+0x80c/0xa70
[4.002695][T0]  ? debug_object_fixup+0x30/0x30
[4.002695][T0]  ? machine_check_poll+0x2d0/0x2d0
[4.002695][T0]  ? mce_cpu_restart+0x40/0x40
[4.002695][T0]  init_timer_key+0x29/0x220
[4.002695][T0]  identify_cpu+0xfcb/0x1980
[4.002695][T0]  identify_secondary_cpu+0x1d/0x190
[4.002695][T0]  smp_store_cpu_info+0x167/0x1f0
[4.002695][T0]  start_secondary+0x5b/0x290
[4.002695][T0]  secondary_startup_64_no_verify+0xb8/0xbb
[8.379508][T1]   #2
[8.389728][T1]   #3
[8.399901][T1] 

== s390 ==
00: [1.539768] rcu: Hierarchical SRCU implementation.   
00: [1.561622] smp: Bringing up secondary CPUs ...  
00: [1.568677]  
00: [1.568681] =
00: [1.568682] WARNING: suspicious RCU usage
00: [1.568688] 5.9.0-rc8-next-20201009 #2 Not tainted   
00: [1.568688] -
00: [1.568691] kernel/locking/lockdep.c:3497 RCU-list traversed in non-reade
00: r section!! 
00: [1.568692]  
00: [1.568692] other info that might help us debug this:

Re: [PATCHv2] arm64: initialize per-cpu offsets earlier

2020-10-08 Thread Qian Cai
On Mon, 2020-10-05 at 17:43 +0100, Mark Rutland wrote:
> The current initialization of the per-cpu offset register is difficult
> to follow and this initialization is not always early enough for
> upcoming instrumentation with KCSAN, where the instrumentation callbacks
> use the per-cpu offset.
> 
> To make it possible to support KCSAN, and to simplify reasoning about
> early bringup code, let's initialize the per-cpu offset earlier, before
> we run any C code that may consume it. To do so, this patch adds a new
> init_this_cpu_offset() helper that's called before the usual
> primary/secondary start functions. For consistency, this is also used to
> re-initialize the per-cpu offset after the runtime per-cpu areas have
> been allocated (which can change CPU0's offset).
> 
> So that init_this_cpu_offset() isn't subject to any instrumentation that
> might consume the per-cpu offset, it is marked with noinstr, preventing
> instrumentation.
> 
> Signed-off-by: Mark Rutland 
> Cc: Catalin Marinas 
> Cc: James Morse 
> Cc: Will Deacon 

Reverting this commit on the top of today's linux-next fixed an issue that
Thunder X2 is unable to boot:

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/arm64.config

EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable, KASLR will be disabled
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services and installing virtual address map...

It hangs here for more than 10 minutes even with "earlycon" before I gave up.
The reverting makes it boot again following by those lines almost immediately.

[0.00][T0] Booting Linux on physical CPU 0x00 [0x431f0af1]
[0.00][T0] Linux version 5.9.0-rc8-next-20201008+ (gcc (GCC) 8.3.1 
20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #6 SMP Thu Oct 8 
20:57:40 EDT 2020
[0.00][T0] efi: EFI v2.70 by American Megatrends
[0.00][T0] efi: ESRT=0xf9224418 SMBIOS=0xfcca SMBIOS 
3.0=0xfcc9 ACPI 2.0=0xf972 MEMRESERVE=0xfc965918 
[0.00][T0] esrt: Reserving ESRT space from 0xf9224418 to 
0xf9224450.
[0.00][T0] ACPI: Early table checksum verification disabled
[0.00][T0] ACPI: RSDP 0xF972 24 (v02 HPE   )
[0.00][T0] ACPI: XSDT 0xF9720028 DC (v01 HPE
ServerCL 01072009 AMI  00010013)
[0.00][T0] ACPI: FACP 0xF9720108 000114 (v06 HPE
ServerCL 01072009 AMI  00010013)
[0.00][T0] ACPI: DSDT 0xF9720220 000714 (v02 HPE
ServerCL 20150406 INTL 20170831)
[0.00][T0] ACPI: FIDT 0xF9720938 9C (v01 HPE
ServerCL 01072009 AMI  00010013)
...

# lscpu
Architecture:aarch64
Byte Order:  Little Endian
CPU(s):  224
On-line CPU(s) list: 0-223
Thread(s) per core:  4
Core(s) per socket:  28
Socket(s):   2
NUMA node(s):2
Vendor ID:   Cavium
Model:   1
Model name:  ThunderX2 99xx
Stepping:0x1
BogoMIPS:400.00
L1d cache:   32K
L1i cache:   32K
L2 cache:256K
L3 cache:32768K
NUMA node0 CPU(s):   0-111
NUMA node1 CPU(s):   112-223
Flags:   fp asimd aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

> ---
>  arch/arm64/include/asm/cpu.h |  2 ++
>  arch/arm64/kernel/head.S |  3 +++
>  arch/arm64/kernel/setup.c| 12 ++--
>  arch/arm64/kernel/smp.c  | 13 -
>  4 files changed, 19 insertions(+), 11 deletions(-)
> 
> Since v1[1]:
> 
> * Fix typos
> * Rebase atop v5.9-rc4
> 
> Mark.
> 
> [1] https://lore.kernel.org/r/20200730163806.23053-1-mark.rutl...@arm.com
> 
> diff --git a/arch/arm64/include/asm/cpu.h b/arch/arm64/include/asm/cpu.h
> index 7faae6ff3ab4d..d9d60b18e8116 100644
> --- a/arch/arm64/include/asm/cpu.h
> +++ b/arch/arm64/include/asm/cpu.h
> @@ -68,4 +68,6 @@ void __init init_cpu_features(struct cpuinfo_arm64 *info);
>  void update_cpu_features(int cpu, struct cpuinfo_arm64 *info,
>struct cpuinfo_arm64 *boot);
>  
> +void init_this_cpu_offset(void);
> +
>  #endif /* __ASM_CPU_H */
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 037421c66b147..2720e6ec68140 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -452,6 +452,8 @@ SYM_FUNC_START_LOCAL(__primary_switched)
>   bl  __pi_memset
>   dsb ishst   // Make zero page visible to
> PTW
>  
> + bl  init_this_cpu_offset
> +
>  #ifdef CONFIG_KASAN
>   bl  kasan_early_init
>  #endif
> @@ -758,6 +760,7 @@ SYM_FUNC_START_LOCAL(__secondary_switched)
>   ptrauth_keys_init_cpu x2, x3, x4, x5
>  #endif
>  
> + bl  init_this_cpu_offset
>   b   secondary_start_kernel
>  SYM_FUNC_END(__secondary_switched)
>  
> diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c
> index 53acbeca4f574..fde4396418add 100644
> --- 

Re: misc I/O submission cleanups

2020-10-08 Thread Qian Cai
On Mon, 2020-10-05 at 10:41 +0200, Christoph Hellwig wrote:
> Hi Martin,
> 
> this series tidies up various loose ends in the SCSI I/O submission path.

Reverting this patchset on the top of today's linux-next fixed the boot failures
below with libata, i.e.,

git revert --no-edit 653eb7c99d84..ed7fb2d018fd

== Easy to reproduce using qemu-kvm "-hda" CONFIG_ATA_PIIX=y. ==
.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config

[   46.047499][  T757] ata2.00: WARNING: zero len r/w req
[   46.049734][  T644] ata2.00: WARNING: zero len r/w req
[   46.051962][  T757] ata2.00: WARNING: zero len r/w req
[   46.054182][  T644] ata2.00: WARNING: zero len r/w req
[   46.058018][  T757] ata2.00: WARNING: zero len r/w req
[   46.060514][  T644] ata2.00: WARNING: zero len r/w req
[   46.065764][  T757] ata2.00: WARNING: zero len r/w req
[   46.068005][  T644] ata2.00: WARNING: zero len r/w req
[   46.070192][  T644] ata2.00: WARNING: zero len r/w req
[   46.072379][  T644] ata2.00: WARNING: zero len r/w req
[   46.074629][  T644] ata2.00: WARNING: zero len r/w req
[   46.077255][  T644] ata2.00: WARNING: zero len r/w req
[   46.081553][   C36] sr 1:0:0:0: [sr0] tag#0 UNKNOWN(0x2003) Result: 
hostbyte=0x07 driverbyte=0x00 cmd_age=0s
[   46.086336][   C36] sr 1:0:0:0: [sr0] tag#0 CDB: opcode=0x28 
[   46.089171][   C36] blk_update_request: I/O error, dev sr0, sector 2097136 
op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[   46.094979][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.097526][  T757] blk_update_request: I/O error, dev sr0, sector 2097136 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.102364][  T757] Buffer I/O error on dev sr0, logical block 2097136, 
async page read
[   46.106080][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.108590][  T757] blk_update_request: I/O error, dev sr0, sector 2097137 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.113234][  T757] Buffer I/O error on dev sr0, logical block 2097137, 
async page read
[   46.117053][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.119581][  T757] blk_update_request: I/O error, dev sr0, sector 2097138 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.124382][  T757] Buffer I/O error on dev sr0, logical block 2097138, 
async page read
[   46.128545][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.131038][  T757] blk_update_request: I/O error, dev sr0, sector 2097139 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.135905][  T757] Buffer I/O error on dev sr0, logical block 2097139, 
async page read
[   46.139835][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.142422][  T757] blk_update_request: I/O error, dev sr0, sector 2097140 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.147240][  T757] Buffer I/O error on dev sr0, logical block 2097140, 
async page read
[   46.150764][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.153248][  T757] blk_update_request: I/O error, dev sr0, sector 2097141 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.158439][  T757] Buffer I/O error on dev sr0, logical block 2097141, 
async page read
[   46.162383][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.165062][  T757] blk_update_request: I/O error, dev sr0, sector 2097142 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.169785][  T757] Buffer I/O error on dev sr0, logical block 2097142, 
async page read
[   46.173252][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.175968][  T757] blk_update_request: I/O error, dev sr0, sector 2097143 
op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[   46.181049][  T757] Buffer I/O error on dev sr0, logical block 2097143, 
async page read
[   46.184779][   C36] sr 1:0:0:0: [sr0] tag#0 UNKNOWN(0x2003) Result: 
hostbyte=0x07 driverbyte=0x00 cmd_age=0s
[   46.188966][   C36] sr 1:0:0:0: [sr0] tag#0 CDB: opcode=0x28 
[   46.191731][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.194223][  T757] Buffer I/O error on dev sr0, logical block 0, async page 
read
[   46.197976][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.200781][  T757] Buffer I/O error on dev sr0, logical block 1, async page 
read
[   46.204221][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.207133][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.209790][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer
[   46.212464][  T757] sr 1:0:0:0: [sr0] tag#0 unaligned transfer

== baremetal with ahci ==
[   14.560235][  T515] ata1.00: WARNING: zero len r/w req
[   14.560397][  T515] ata1.00: WARNING: zero len r/w req
[   14.560450][  T515] ata1.00: WARNING: zero len r/w req
[   14.560502][  T515] ata1.00: WARNING: zero len r/w req
[   14.560594][  T515] ata1.00: WARNING: zero len r/w req
[   14.560644][  T515] ata1.00: WARNING: zero len r/w req
[   14.560709][  C100] sd 0:0:0:0: [sdb] tag#7 UNKNOWN(0x2003) Result: 
hostbyte=0x07 driverbyte=0x00 cmd_age=0s
[   14.560790][  C100] sd 0:0:0:0: 

Re: INFO: task can't die in request_wait_answer

2020-10-07 Thread Qian Cai
On Sun, 2020-10-04 at 21:10 -0700, syzbot wrote:
> syzbot has found a reproducer for the following issue on:
> 
> HEAD commit:2172e358 Add linux-next specific files for 20201002
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1596c7a390
> kernel config:  https://syzkaller.appspot.com/x/.config?x=70698f530a7e856f
> dashboard link: https://syzkaller.appspot.com/bug?extid=ea48ca29949b1820e745
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16e1c8e790
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10a166af90

Looking through the reproducer, it has all those privileged operations like
mount etc which seems indicating that syzkaller tries to run as a root user
which could easily screw up things like those soft-lockups.

So far, I can only reproduce this as a root user and then also need to Ctrl-C
the reproducer in order for the kernel to stuck in request_wait_answer() forever
probably because no response from the server (the server was already exited).

> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ea48ca29949b1820e...@syzkaller.appspotmail.com
> 
> INFO: task syz-executor220:7040 can't die for more than 143 seconds.
> task:syz-executor220 state:D stack:28792 pid: 7040 ppid:  6888
> flags:0x4004
> Call Trace:
>  context_switch kernel/sched/core.c:3772 [inline]
>  __schedule+0xec5/0x2200 kernel/sched/core.c:4521
>  schedule+0xcf/0x270 kernel/sched/core.c:4599
>  request_wait_answer+0x505/0x7f0 fs/fuse/dev.c:402
>  __fuse_request_send fs/fuse/dev.c:421 [inline]
>  fuse_simple_request+0x526/0xc10 fs/fuse/dev.c:503
>  fuse_do_getattr+0x226/0xc40 fs/fuse/dir.c:952
>  fuse_update_get_attr fs/fuse/dir.c:988 [inline]
>  fuse_getattr+0x37f/0x430 fs/fuse/dir.c:1723
>  vfs_getattr_nosec+0x246/0x2e0 fs/stat.c:87
>  vfs_getattr fs/stat.c:124 [inline]
>  vfs_statx+0x18d/0x390 fs/stat.c:189
>  vfs_fstatat fs/stat.c:207 [inline]
>  vfs_stat include/linux/fs.h:3148 [inline]
>  __do_sys_newstat+0x91/0x110 fs/stat.c:349
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x446c99
> Code: Bad RIP value.
> RSP: 002b:7fafe2cfedb8 EFLAGS: 0246 ORIG_RAX: 0004
> RAX: ffda RBX: 006dbc28 RCX: 00446c99
> RDX: 00446c99 RSI:  RDI: 24c0
> RBP: 006dbc20 R08:  R09: 
> R10:  R11: 0246 R12: 006dbc2c
> R13: 7fff48a82c7f R14: 7fafe2cff9c0 R15: 
> INFO: task syz-executor220:7040 blocked for more than 143 seconds.
>   Not tainted 5.9.0-rc7-next-20201002-syzkaller #0
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:syz-executor220 state:D stack:28792 pid: 7040 ppid:  6888
> flags:0x4004
> Call Trace:
>  context_switch kernel/sched/core.c:3772 [inline]
>  __schedule+0xec5/0x2200 kernel/sched/core.c:4521
>  schedule+0xcf/0x270 kernel/sched/core.c:4599
>  request_wait_answer+0x505/0x7f0 fs/fuse/dev.c:402
>  __fuse_request_send fs/fuse/dev.c:421 [inline]
>  fuse_simple_request+0x526/0xc10 fs/fuse/dev.c:503
>  fuse_do_getattr+0x226/0xc40 fs/fuse/dir.c:952
>  fuse_update_get_attr fs/fuse/dir.c:988 [inline]
>  fuse_getattr+0x37f/0x430 fs/fuse/dir.c:1723
>  vfs_getattr_nosec+0x246/0x2e0 fs/stat.c:87
>  vfs_getattr fs/stat.c:124 [inline]
>  vfs_statx+0x18d/0x390 fs/stat.c:189
>  vfs_fstatat fs/stat.c:207 [inline]
>  vfs_stat include/linux/fs.h:3148 [inline]
>  __do_sys_newstat+0x91/0x110 fs/stat.c:349
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x446c99
> Code: Bad RIP value.
> RSP: 002b:7fafe2cfedb8 EFLAGS: 0246 ORIG_RAX: 0004
> RAX: ffda RBX: 006dbc28 RCX: 00446c99
> RDX: 00446c99 RSI:  RDI: 24c0
> RBP: 006dbc20 R08:  R09: 
> R10:  R11: 0246 R12: 006dbc2c
> R13: 7fff48a82c7f R14: 7fafe2cff9c0 R15: 
> INFO: task syz-executor220:7044 can't die for more than 143 seconds.
> task:syz-executor220 state:D stack:28752 pid: 7044 ppid:  6887
> flags:0x4004
> Call Trace:
>  context_switch kernel/sched/core.c:3772 [inline]
>  __schedule+0xec5/0x2200 kernel/sched/core.c:4521
>  schedule+0xcf/0x270 kernel/sched/core.c:4599
>  request_wait_answer+0x505/0x7f0 fs/fuse/dev.c:402
>  __fuse_request_send fs/fuse/dev.c:421 [inline]
>  fuse_simple_request+0x526/0xc10 fs/fuse/dev.c:503
>  fuse_do_getattr+0x226/0xc40 fs/fuse/dir.c:952
>  fuse_update_get_attr fs/fuse/dir.c:988 [inline]
>  fuse_getattr+0x37f/0x430 fs/fuse/dir.c:1723
>  vfs_getattr_nosec+0x246/0x2e0 fs/stat.c:87
>  vfs_getattr fs/stat.c:124 [inline]
>  vfs_statx+0x18d/0x390 fs/stat.c:189
>  

WARN_ON(fuse_insert_writeback(root, wpa)) in tree_insert()

2020-10-07 Thread Qian Cai
Running some fuzzing by a unprivileged user on virtiofs could trigger the
warning below. The warning was introduced not long ago by the commit
c146024ec44c ("fuse: fix warning in tree_insert() and clean up writepage
insertion").

>From the logs, the last piece of the fuzzing code is:

fgetxattr(fd=426, name=0x7f39a69af000, value=0x7f39a8abf000, size=1)

[main]  testfile fd:426 filename:trinity-testfile2 flags:2 fopened:1 
fcntl_flags:42c00 global:1
[main]   start: 0x7f39a58e6000 size:4KB  name: trinity-testfile2 global:1

[15969.175004][T179559] WARNING: CPU: 0 PID: 179559 at fs/fuse/file.c:1732 
tree_insert.part.40+0x0/0x10 [fuse]
[15969.180644][T179559] Modules linked in: loop isofs kvm_intel kvm irqbypass 
nls_ascii nls_cp437 vfat fat ip_tables x_tables virtiofs fuse sr_mod sd_mod 
cdrom ata_piix virtio_pci virtio_ring e1000 virtio libat]
[15969.197671][T179559] CPU: 0 PID: 179559 Comm: trinity-c24 Tainted: G 
  O  5.9.0-rc8-next-20201007+ #1
[15969.204027][T179559] Hardware name: Red Hat KVM, BIOS 
1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[15969.208993][T179559] RIP: 0010:tree_insert.part.40+0x0/0x10 [fuse]
[15969.213593][T179559] Code: 44 24 10 48 8b 74 24 08 48 8b 0c 24 e9 40 fc ff 
ff 66 0f 1f 84 00 00 00 00 00 0f 0b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 
<0f> 0b c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 b0
[15969.224348][T179559] RSP: 0018:c90007fc77f8 EFLAGS: 00010286
[15969.227798][T179559] RAX: 8884b8f73500 RBX: 8884b8f76900 RCX: 
8889e45ff910
[15969.233572][T179559] RDX:  RSI: 8884b8f76900 RDI: 
8884b8f735b0
[15969.238282][T179559] RBP: ea000550c880 R08: 8884b8f769f8 R09: 
f52000ff8ef2
[15969.243394][T179559] R10: 0003 R11: f52000ff8ef2 R12: 
8889e45ff480
[15969.247845][T179559] R13: ea0004d71380 R14: 88818285c000 R15: 
8889e45ff9b0
[15969.252884][T179559] FS:  7f39a8ab7740() GS:888bcc60() 
knlGS:
[15969.258385][T179559] CS:  0010 DS:  ES:  CR0: 80050033
[15969.262647][T179559] CR2: 008f CR3: 000557d56005 CR4: 
00770ef0
[15969.268492][T179559] DR0:  DR1:  DR2: 

[15969.273773][T179559] DR3:  DR6: fffe0ff0 DR7: 
0600
[15969.278030][T179559] PKRU: 5554
[15969.279920][T179559] Call Trace:
[15969.282279][T179559]  fuse_writepage_locked+0xa20/0xd10 [fuse]
[15969.285587][T179559]  fuse_launder_page+0x5b/0xc0 [fuse]
[15969.288303][T179559]  invalidate_inode_pages2_range+0x709/0xa90
invalidate_inode_pages2_range at mm/truncate.c:765
[15969.292495][T179559]  ? truncate_exceptional_pvec_entries.part.18+0x460/0x460
[15969.296605][T179559]  ? rcu_read_lock_sched_held+0x9c/0xd0
[15969.301015][T179559]  ? rcu_read_lock_bh_held+0xb0/0xb0
[15969.304427][T179559]  ? rcu_read_unlock+0x40/0x40
[15969.306759][T179559]  ? _raw_spin_unlock+0x1a/0x30
[15969.309124][T179559]  ? fuse_change_attributes+0x237/0x540 [fuse]
[15969.313701][T179559]  fuse_do_getattr+0x28b/0xd50 [fuse]
fuse_do_getattr at fs/fuse/dir.c:962
[15969.316774][T179559]  ? do_syscall_64+0x33/0x40
[15969.319617][T179559]  ? fuse_dentry_revalidate+0x6c0/0x6c0 [fuse]
[15969.323498][T179559]  ? rcu_read_lock_bh_held+0xb0/0xb0
[15969.326591][T179559]  ? find_held_lock+0x33/0x1c0
[15969.328989][T179559]  ? rwlock_bug.part.1+0x90/0x90
[15969.332202][T179559]  fuse_permission+0x29c/0x3c0 [fuse]
[15969.335564][T179559]  ? __kasan_kmalloc.constprop.11+0xc1/0xd0
[15969.338445][T179559]  inode_permission+0x2c1/0x390
[15969.342187][T179559]  vfs_getxattr+0x43/0x80
[15969.344605][T179559]  getxattr+0xe5/0x210
[15969.347120][T179559]  ? path_listxattr+0x100/0x100
[15969.350019][T179559]  ? rcu_read_lock_sched_held+0x9c/0xd0
[15969.354014][T179559]  ? rcu_read_lock_bh_held+0xb0/0xb0
[15969.356977][T179559]  ? find_held_lock+0x33/0x1c0
[15969.359631][T179559]  ? __task_pid_nr_ns+0x127/0x3a0
[15969.363099][T179559]  ? lock_downgrade+0x730/0x730
[15969.365714][T179559]  ? syscall_enter_from_user_mode+0x17/0x50
[15969.369104][T179559]  ? rcu_read_lock_sched_held+0x9c/0xd0
[15969.374492][T179559]  __x64_sys_fgetxattr+0xd9/0x140
[15969.377317][T179559]  do_syscall_64+0x33/0x40
[15969.380588][T179559]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[15969.384059][T179559] RIP: 0033:0x7f39a83ca78d
[15969.386559][T179559] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e 
fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 
<48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cb 56 2c 00 f7 d8
[15969.399200][T179559] RSP: 002b:7ffe920f3778 EFLAGS: 0246 ORIG_RAX: 
00c1
[15969.405661][T179559] RAX: ffda RBX: 00c1 RCX: 
7f39a83ca78d
[15969.411274][T179559] RDX: 7f39a8abf000 RSI: 7f39a69af000 RDI: 
01aa
[15969.415813][T179559] RBP: 00c1 R08: 0480 R09: 
003e
[15969.421984][T179559] R10: 

Re: [PATCH v2 09/11] powerpc/smp: Optimize update_mask_by_l2

2020-10-07 Thread Qian Cai
On Wed, 2020-10-07 at 19:47 +0530, Srikar Dronamraju wrote:
> Can you confirm if CONFIG_CPUMASK_OFFSTACK is enabled in your config?

Yes, https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config

We tested here almost daily on linux-next.



  1   2   3   4   5   6   7   8   9   10   >