Re: [BUG] random kernel crashes after THP rework on s390 (maybe also on PowerPC and ARM)

2016-02-13 Thread Sebastian Ott

On Sat, 13 Feb 2016, Kirill A. Shutemov wrote:
> Could you check if revert of fecffad25458 helps?

I reverted fecffad25458 on top of 721675fcf277cf - it oopsed with:

¢ 1851.721062! Unable to handle kernel pointer dereference in virtual kernel 
address space
¢ 1851.721075! failing address:  TEID: 0483
¢ 1851.721078! Fault in home space mode while using kernel ASCE.
¢ 1851.721085! AS:00d5c007 R3:0007 S:a800 
P:003d
¢ 1851.721128! Oops: 0004 ilc:3 ¢#1! PREEMPT SMP DEBUG_PAGEALLOC
¢ 1851.721135! Modules linked in: bridge stp llc btrfs mlx4_ib mlx4_en ib_sa 
ib_mad vxlan xor ip6_udp_tunnel ib_core udp_tunnel ptp pps_core ib_addr 
ghash_s390raid6_pq prng ecb aes_s390 mlx4_core des_s390 des_generic genwqe_card 
sha512_s390 sha256_s390 sha1_s390 sha_common crc_itu_t dm_mod scm_block 
vhost_net tun vhost eadm_sch macvtap macvlan kvm autofs4
¢ 1851.721183! CPU: 7 PID: 256422 Comm: bash Not tainted 
4.5.0-rc3-00058-g07923d7-dirty #178
¢ 1851.721186! task: 7fbfd290 ti: 8c604000 task.ti: 
8c604000
¢ 1851.721189! Krnl PSW : 0704d0018000 0045d3b8 
(__rb_erase_color+0x280/0x308)
¢ 1851.721200!R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
EA:3
   Krnl GPRS: 0001 0020  
bd07eff1
¢ 1851.721205!0027ca10  83e45898 
77b61198
¢ 1851.721207!7ce1a490 bd07eff0 7ce1a548 
0027ca10
¢ 1851.721210!bd07c350 bd07eff0 8c607aa8 
8c607a68
¢ 1851.721221! Krnl Code: 0045d3aa: e3c0d0080024   stg 
%%r12,8(%%r13)
  0045d3b0: b9040039   lgr %%r3,%%r9
 #0045d3b4: a53b0001   oill%%r3,1
 >0045d3b8: e3301024   stg 
%%r3,0(%%r1)
  0045d3be: ec28000e007c   cgij
%%r2,0,8,45d3da
  0045d3c4: e3402004   lg  
%%r4,0(%%r2)
  0045d3ca: b904001c   lgr 
%%r1,%%r12
  0045d3ce: ec143f3f0056   rosbg   
%%r1,%%r4,63,63,0
¢ 1851.721269! Call Trace:
¢ 1851.721273! (¢<83e45898>! 0x83e45898)
¢ 1851.721279!  ¢<0029342a>! unlink_anon_vmas+0x9a/0x1d8
¢ 1851.721282!  ¢<00283f34>! free_pgtables+0xcc/0x148
¢ 1851.721285!  ¢<0028c376>! exit_mmap+0xd6/0x300
¢ 1851.721289!  ¢<00134db8>! mmput+0x90/0x118
¢ 1851.721294!  ¢<002d76bc>! flush_old_exec+0x5d4/0x700
¢ 1851.721298!  ¢<003369f4>! load_elf_binary+0x2f4/0x13e8
¢ 1851.721301!  ¢<002d6e4a>! search_binary_handler+0x9a/0x1f8
¢ 1851.721304!  ¢<002d8970>! do_execveat_common.isra.32+0x668/0x9a0
¢ 1851.721307!  ¢<002d8cec>! do_execve+0x44/0x58
¢ 1851.721310!  ¢<002d8f92>! SyS_execve+0x3a/0x48
¢ 1851.721315!  ¢<006fb096>! system_call+0xd6/0x258
¢ 1851.721317!  ¢<03ff997436d6>! 0x3ff997436d6
¢ 1851.721319! INFO: lockdep is turned off.
¢ 1851.721321! Last Breaking-Event-Address:
¢ 1851.721323!  ¢<0045d31a>! __rb_erase_color+0x1e2/0x308
¢ 1851.721327!
¢ 1851.721329! ---¢ end trace 0d80041ac00cfae2 !---


> 
> And could you share how crashes looks like? I haven't seen backtraces yet.
> 

Sure. I didn't because they really looked random to me. Most of the time
in rcu or list debugging but I thought these have just been the messenger
observing a corruption first. Anyhow, here is an older one that might look
interesting:

[   59.851421] list_del corruption. next->prev should be 6e1eb000, but 
was 0400
[   59.851469] [ cut here ]
[   59.851472] WARNING: at lib/list_debug.c:71
[   59.851475] Modules linked in: bridge stp llc btrfs xor mlx4_en vxlan 
ip6_udp_tunnel udp_tunnel mlx4_ib ptp pps_core ib_sa ib_mad ib_core ib_addr 
ghash_s390 prng raid6_pq ecb aes_s390 des_s390 des_generic sha512_s390 
sha256_s390 sha1_s390 mlx4_core sha_common genwqe_card scm_block crc_itu_t 
vhost_net tun vhost dm_mod macvtap eadm_sch macvlan kvm autofs4
[   59.851532] CPU: 0 PID: 5400 Comm: git Not tainted 
4.4.0-07794-ga4eff16-dirty #77
[   59.851535] task: d231 ti: d661 task.ti: 
d661
[   59.851539] Krnl PSW : 0704c0018000 00487434 
(__list_del_entry+0xa4/0xe0)
[   59.851548]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
EA:3
   Krnl GPRS: 01a7a1cf d231 0054 
0001
[   59.851554]00487430   
774e6900
[   59.851557]03ff5300 6d4017a0 03ff52f0 
03ff52f0
[   59.851560]03d10178 6e1eb000 00487430 
d6613b00
[   59.851571] Krnl 

Re: [PATCH V2 00/29] Book3s abstraction in preparation for new MMU model

2016-02-13 Thread Denis Kirjanov
On 2/13/16, Aneesh Kumar K.V  wrote:
> Paul Mackerras  writes:
>
>> On Mon, Feb 08, 2016 at 02:50:12PM +0530, Aneesh Kumar K.V wrote:
>>> Hello,
>>>
>>> This is a large series, mostly consisting of code movement. No new
>>> features
>>> are done in this series. The changes are done to accomodate the upcoming
>>> new memory
>>> model in future powerpc chips. The details of the new MMU model can be
>>> found at
>>>
>>>  http://ibm.biz/power-isa3 (Needs registration). I am including a summary
>>> of the changes below.

That's not a good idea to put your changes somewhere and
ask people to register to be able to download them. It's just
complicates testing your
big amount of changes.

>>
>> This series doesn't seem to apply against either v4.4 or Linus'
>> current master.  What is this patch against?
>>
>
> The patchset have dependencies against other patcheset posted to the
> list. The best option is to pull the branch mentioned instead of trying to
> apply them individually.
>
> -aneesh
>
> ___
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [V3] powerpc/mm: Fix Multi hit ERAT cause by recent THP update

2016-02-13 Thread Aneesh Kumar K.V
Michael Ellerman  writes:

> On Tue, 2016-09-02 at 01:20:31 UTC, "Aneesh Kumar K.V" wrote:
>> With ppc64 we use the deposited pgtable_t to store the hash pte slot
>> information. We should not withdraw the deposited pgtable_t without
>> marking the pmd none. This ensure that low level hash fault handling
>> will skip this huge pte and we will handle them at upper levels.
>> 
>> Recent change to pmd splitting changed the above in order to handle the
>> race between pmd split and exit_mmap. The race is explained below.
>> 
>> Consider following race:
>> 
>>  CPU0CPU1
>> shrink_page_list()
>>   add_to_swap()
>> split_huge_page_to_list()
>>   __split_huge_pmd_locked()
>> pmdp_huge_clear_flush_notify()
>>  // pmd_none() == true
>>  exit_mmap()
>>unmap_vmas()
>>  zap_pmd_range()
>>// no action on pmd since 
>> pmd_none() == true
>>  pmd_populate()
>> 
>> As result the THP will not be freed. The leak is detected by check_mm():
>> 
>>  BUG: Bad rss-counter state mm:880058d2e580 idx:1 val:512
>> 
>> The above required us to not mark pmd none during a pmd split.
>> 
>> The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
>> level fault handling code skip this pte. At higher level we do take ptl
>> lock. That should serialze us against the pmd split. Once the lock is
>> acquired we do check the pmd again using pmd_same. That should always
>> return false for us and hence we should retry the access. We do the
>> pmd_same check in all case after taking plt with
>> THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
>> huge_pmd_set_accessed)
>> 
>> Also make sure we wait for irq disable section in other cpus to finish
>> before flipping a huge pte entry with a regular pmd entry. Code paths
>> like find_linux_pte_or_hugepte depend on irq disable to get
>> a stable pte_t pointer. A parallel thp split need to make sure we
>> don't convert a pmd pte to a regular pmd entry without waiting for the
>> irq disable section to finish.
>> 
>> Acked-by: Kirill A. Shutemov 
>> Signed-off-by: Aneesh Kumar K.V 
>
> Applied to powerpc fixes, thanks.
>
> https://git.kernel.org/powerpc/c/9db4cd6c21535a4846b38808f3
>

Can we apply the below hunk ?. The reason for marking pmd none was to
avoid clearing both _PAGE_USER and _PAGE_PRESENT on the pte. At pmd
level that used to mean a hugepd pointer before. We did fix that earlier
by introducing _PAGE_PTE. But then I was thinking it was harmless to
mark pmd none. Now marking it one will still result in the race I
explained above, eventhough the window is much smaller now.

diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index c8a00da39969..03f6e72697d0 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -694,7 +694,7 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
 void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 pmd_t *pmdp)
 {
-   pmd_hugepage_update(vma->vm_mm, address, pmdp, ~0UL, 0);
+   pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
/*
 * This ensures that generic code that rely on IRQ disabling
 * to prevent a parallel THP split work as expected.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev