Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Naoya Horiguchi
On Wed, Mar 22, 2017 at 03:39:00PM +0100, Christian Borntraeger wrote:
> On 03/22/2017 01:53 PM, Christian Borntraeger wrote:
> > On 03/22/2017 03:31 AM, Naoya Horiguchi wrote:
> >> I found the race condition which triggers the following bug when
> >> move_pages() and soft offline are called on a single hugetlb page
> >> concurrently.
> >>
> >> [61163.578957] Soft offlining page 0x119400 at 0x7000
> >> [61163.580062] BUG: unable to handle kernel paging request at 
> >> ea0011943820
> >> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> >> [61163.581203] PGD 7ffd2067
> >> [61163.581204] PUD 7ffd1067
> >> [61163.581471] PMD 0
> >> [61163.581723]
> >> [61163.582052] Oops:  [#1] SMP
> >> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> >> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
> >> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
> >> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
> >> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> >> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P  
> >>  OE   4.11.0-rc2-mm1+ #2
> >> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> >> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> >> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> >> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> >> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> >> 3000
> >> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> >> 000465003e80
> >> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> >> 880138d34000
> >> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> >> ea0011943800
> >> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> >> 77ff8000
> >> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> >> knlGS:
> >> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> >> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> >> 001406f0
> >> [61163.593330] Call Trace:
> >> [61163.593556]  follow_page_mask+0x270/0x550
> >> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> >> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> >> [61163.594798]  SyS_move_pages+0xe/0x10
> >> [61163.595113]  do_syscall_64+0x67/0x180
> >> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> >> [61163.595837] RIP: 0033:0x7fc976e03949
> >> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> >> 0117
> >> [61163.596940] RAX: ffda RBX:  RCX: 
> >> 7fc976e03949
> >> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> >> 5827
> >> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> >> 0004
> >> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> >> 00400650
> >> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> >> 
> >> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 
> >> 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 
> >> 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> >> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> >> [61163.602376] CR2: ea0011943820
> >> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> >> [61163.603236] Kernel panic - not syncing: Fatal exception
> >> [61163.603706] Kernel Offset: disabled
> >>
> >> This bug is triggered when pmd_present() returns true for non-present
> >> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> >> Using pmd_present() to determine present/non-present for hugetlb is
> >> not correct, because pmd_present() checks multiple bits (not only
> >> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> >>
> >> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in 
> >> follow_huge_pmd()")
> >> Signed-off-by: Naoya Horiguchi 
> >> Cc: [4.0+]
> > 
> > I think this is broken for s390. The page table entries look different from
> > the segment table entries (pmds) on s390, e.g. they have the invalid bit at
> > different places. Using pte functions on pmd does not work here.
> > Gerald can you confirm.
> > 
> 
> 
> Hmmm, it looks like that the s390 variant of huge_ptep_get already
> does the translation. So its probably fine.

Thank you for checking. I think so, generic hugetlb code should refer to
leaf level page table entries with 'pte' even if it's actually pmd or pud.
The detail of arch-dependency is contained in huge_ptep_get() as you 

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Naoya Horiguchi
On Wed, Mar 22, 2017 at 03:39:00PM +0100, Christian Borntraeger wrote:
> On 03/22/2017 01:53 PM, Christian Borntraeger wrote:
> > On 03/22/2017 03:31 AM, Naoya Horiguchi wrote:
> >> I found the race condition which triggers the following bug when
> >> move_pages() and soft offline are called on a single hugetlb page
> >> concurrently.
> >>
> >> [61163.578957] Soft offlining page 0x119400 at 0x7000
> >> [61163.580062] BUG: unable to handle kernel paging request at 
> >> ea0011943820
> >> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> >> [61163.581203] PGD 7ffd2067
> >> [61163.581204] PUD 7ffd1067
> >> [61163.581471] PMD 0
> >> [61163.581723]
> >> [61163.582052] Oops:  [#1] SMP
> >> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> >> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
> >> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
> >> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
> >> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> >> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P  
> >>  OE   4.11.0-rc2-mm1+ #2
> >> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> >> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> >> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> >> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> >> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> >> 3000
> >> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> >> 000465003e80
> >> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> >> 880138d34000
> >> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> >> ea0011943800
> >> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> >> 77ff8000
> >> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> >> knlGS:
> >> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> >> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> >> 001406f0
> >> [61163.593330] Call Trace:
> >> [61163.593556]  follow_page_mask+0x270/0x550
> >> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> >> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> >> [61163.594798]  SyS_move_pages+0xe/0x10
> >> [61163.595113]  do_syscall_64+0x67/0x180
> >> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> >> [61163.595837] RIP: 0033:0x7fc976e03949
> >> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> >> 0117
> >> [61163.596940] RAX: ffda RBX:  RCX: 
> >> 7fc976e03949
> >> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> >> 5827
> >> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> >> 0004
> >> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> >> 00400650
> >> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> >> 
> >> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 
> >> 0c 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 
> >> 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> >> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> >> [61163.602376] CR2: ea0011943820
> >> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> >> [61163.603236] Kernel panic - not syncing: Fatal exception
> >> [61163.603706] Kernel Offset: disabled
> >>
> >> This bug is triggered when pmd_present() returns true for non-present
> >> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> >> Using pmd_present() to determine present/non-present for hugetlb is
> >> not correct, because pmd_present() checks multiple bits (not only
> >> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> >>
> >> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in 
> >> follow_huge_pmd()")
> >> Signed-off-by: Naoya Horiguchi 
> >> Cc: [4.0+]
> > 
> > I think this is broken for s390. The page table entries look different from
> > the segment table entries (pmds) on s390, e.g. they have the invalid bit at
> > different places. Using pte functions on pmd does not work here.
> > Gerald can you confirm.
> > 
> 
> 
> Hmmm, it looks like that the s390 variant of huge_ptep_get already
> does the translation. So its probably fine.

Thank you for checking. I think so, generic hugetlb code should refer to
leaf level page table entries with 'pte' even if it's actually pmd or pud.
The detail of arch-dependency is contained in huge_ptep_get() as you pointed 
out.

- Naoya


Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Christian Borntraeger
On 03/22/2017 01:53 PM, Christian Borntraeger wrote:
> On 03/22/2017 03:31 AM, Naoya Horiguchi wrote:
>> I found the race condition which triggers the following bug when
>> move_pages() and soft offline are called on a single hugetlb page
>> concurrently.
>>
>> [61163.578957] Soft offlining page 0x119400 at 0x7000
>> [61163.580062] BUG: unable to handle kernel paging request at 
>> ea0011943820
>> [61163.580791] IP: follow_huge_pmd+0x143/0x190
>> [61163.581203] PGD 7ffd2067
>> [61163.581204] PUD 7ffd1067
>> [61163.581471] PMD 0
>> [61163.581723]
>> [61163.582052] Oops:  [#1] SMP
>> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
>> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
>> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
>> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
>> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
>> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P
>>OE   4.11.0-rc2-mm1+ #2
>> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
>> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
>> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
>> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
>> 3000
>> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
>> 000465003e80
>> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
>> 880138d34000
>> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
>> ea0011943800
>> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
>> 77ff8000
>> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
>> knlGS:
>> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
>> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
>> 001406f0
>> [61163.593330] Call Trace:
>> [61163.593556]  follow_page_mask+0x270/0x550
>> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
>> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
>> [61163.594798]  SyS_move_pages+0xe/0x10
>> [61163.595113]  do_syscall_64+0x67/0x180
>> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
>> [61163.595837] RIP: 0033:0x7fc976e03949
>> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
>> 0117
>> [61163.596940] RAX: ffda RBX:  RCX: 
>> 7fc976e03949
>> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
>> 5827
>> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
>> 0004
>> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
>> 00400650
>> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
>> 
>> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
>> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 
>> 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
>> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
>> [61163.602376] CR2: ea0011943820
>> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
>> [61163.603236] Kernel panic - not syncing: Fatal exception
>> [61163.603706] Kernel Offset: disabled
>>
>> This bug is triggered when pmd_present() returns true for non-present
>> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
>> Using pmd_present() to determine present/non-present for hugetlb is
>> not correct, because pmd_present() checks multiple bits (not only
>> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
>>
>> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
>> Signed-off-by: Naoya Horiguchi 
>> Cc: [4.0+]
> 
> I think this is broken for s390. The page table entries look different from
> the segment table entries (pmds) on s390, e.g. they have the invalid bit at
> different places. Using pte functions on pmd does not work here.
> Gerald can you confirm.
> 


Hmmm, it looks like that the s390 variant of huge_ptep_get already
does the translation. So its probably fine.



Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Christian Borntraeger
On 03/22/2017 01:53 PM, Christian Borntraeger wrote:
> On 03/22/2017 03:31 AM, Naoya Horiguchi wrote:
>> I found the race condition which triggers the following bug when
>> move_pages() and soft offline are called on a single hugetlb page
>> concurrently.
>>
>> [61163.578957] Soft offlining page 0x119400 at 0x7000
>> [61163.580062] BUG: unable to handle kernel paging request at 
>> ea0011943820
>> [61163.580791] IP: follow_huge_pmd+0x143/0x190
>> [61163.581203] PGD 7ffd2067
>> [61163.581204] PUD 7ffd1067
>> [61163.581471] PMD 0
>> [61163.581723]
>> [61163.582052] Oops:  [#1] SMP
>> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
>> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
>> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
>> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
>> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
>> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P
>>OE   4.11.0-rc2-mm1+ #2
>> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
>> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
>> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
>> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
>> 3000
>> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
>> 000465003e80
>> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
>> 880138d34000
>> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
>> ea0011943800
>> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
>> 77ff8000
>> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
>> knlGS:
>> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
>> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
>> 001406f0
>> [61163.593330] Call Trace:
>> [61163.593556]  follow_page_mask+0x270/0x550
>> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
>> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
>> [61163.594798]  SyS_move_pages+0xe/0x10
>> [61163.595113]  do_syscall_64+0x67/0x180
>> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
>> [61163.595837] RIP: 0033:0x7fc976e03949
>> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
>> 0117
>> [61163.596940] RAX: ffda RBX:  RCX: 
>> 7fc976e03949
>> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
>> 5827
>> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
>> 0004
>> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
>> 00400650
>> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
>> 
>> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
>> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 
>> 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
>> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
>> [61163.602376] CR2: ea0011943820
>> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
>> [61163.603236] Kernel panic - not syncing: Fatal exception
>> [61163.603706] Kernel Offset: disabled
>>
>> This bug is triggered when pmd_present() returns true for non-present
>> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
>> Using pmd_present() to determine present/non-present for hugetlb is
>> not correct, because pmd_present() checks multiple bits (not only
>> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
>>
>> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
>> Signed-off-by: Naoya Horiguchi 
>> Cc: [4.0+]
> 
> I think this is broken for s390. The page table entries look different from
> the segment table entries (pmds) on s390, e.g. they have the invalid bit at
> different places. Using pte functions on pmd does not work here.
> Gerald can you confirm.
> 


Hmmm, it looks like that the s390 variant of huge_ptep_get already
does the translation. So its probably fine.



Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Michal Hocko
[CC Mike]

On Wed 22-03-17 11:31:38, Naoya Horiguchi wrote:
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 
> <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]
> ---
>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>*/
>   if (!pmd_huge(*pmd))
>   goto out;
> - if (pmd_present(*pmd)) {
> + pte = huge_ptep_get((pte_t *)pmd);
> + if (pte_present(pte)) {
>   page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
>   if (flags & 

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Michal Hocko
[CC Mike]

On Wed 22-03-17 11:31:38, Naoya Horiguchi wrote:
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 
> <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]
> ---
>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>*/
>   if (!pmd_huge(*pmd))
>   goto out;
> - if (pmd_present(*pmd)) {
> + pte = huge_ptep_get((pte_t *)pmd);
> + if (pte_present(pte)) {
>   page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
>   if (flags & FOLL_GET)
>   get_page(page);

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Christian Borntraeger
On 03/22/2017 03:31 AM, Naoya Horiguchi wrote:
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 
> <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]

I think this is broken for s390. The page table entries look different from
the segment table entries (pmds) on s390, e.g. they have the invalid bit at
different places. Using pte functions on pmd does not work here.
Gerald can you confirm.





> ---
>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>*/
>   if (!pmd_huge(*pmd))
>  

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-22 Thread Christian Borntraeger
On 03/22/2017 03:31 AM, Naoya Horiguchi wrote:
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
> libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
> serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
> dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 
> <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]

I think this is broken for s390. The page table entries look different from
the segment table entries (pmds) on s390, e.g. they have the invalid bit at
different places. Using pte functions on pmd does not work here.
Gerald can you confirm.





> ---
>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>*/
>   if (!pmd_huge(*pmd))
>   goto out;
> - if (pmd_present(*pmd)) 

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-21 Thread Hillf Danton



On March 22, 2017 10:32 AM Naoya Horiguchi wrote: 
> 
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq
> ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel 
> ata_piix serio_raw libata virtio_pci 8139cp
virtio_ring virtio
> mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49
01 d4 f6 45 bc
> 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]
> ---

Acked-by: Hillf Danton 

>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>*/
>   if (!pmd_huge(*pmd))
>   goto out;
> - if (pmd_present(*pmd)) {
> + pte = huge_ptep_get((pte_t *)pmd);
> + if (pte_present(pte)) {
>   page = pmd_page(*pmd) + ((address & 

Re: [PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-21 Thread Hillf Danton



On March 22, 2017 10:32 AM Naoya Horiguchi wrote: 
> 
> I found the race condition which triggers the following bug when
> move_pages() and soft offline are called on a single hugetlb page
> concurrently.
> 
> [61163.578957] Soft offlining page 0x119400 at 0x7000
> [61163.580062] BUG: unable to handle kernel paging request at 
> ea0011943820
> [61163.580791] IP: follow_huge_pmd+0x143/0x190
> [61163.581203] PGD 7ffd2067
> [61163.581204] PUD 7ffd1067
> [61163.581471] PMD 0
> [61163.581723]
> [61163.582052] Oops:  [#1] SMP
> [61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
> parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq
> ip_tables xfs libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel 
> ata_piix serio_raw libata virtio_pci 8139cp
virtio_ring virtio
> mii floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: cap_check]
> [61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P 
>   OE   4.11.0-rc2-mm1+ #2
> [61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> [61163.586627] task: 88007c951680 task.stack: c90004bd8000
> [61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
> [61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
> [61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
> 3000
> [61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
> 000465003e80
> [61163.589486] RBP: c90004bdbd18 R08:  R09: 
> 880138d34000
> [61163.590097] R10: ea000465 R11: 00c363b0 R12: 
> ea0011943800
> [61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
> 77ff8000
> [61163.591375] FS:  7fc977710740() GS:88007dc0() 
> knlGS:
> [61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
> [61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
> 001406f0
> [61163.593330] Call Trace:
> [61163.593556]  follow_page_mask+0x270/0x550
> [61163.593908]  SYSC_move_pages+0x4ea/0x8f0
> [61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
> [61163.594798]  SyS_move_pages+0xe/0x10
> [61163.595113]  do_syscall_64+0x67/0x180
> [61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
> [61163.595837] RIP: 0033:0x7fc976e03949
> [61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
> 0117
> [61163.596940] RAX: ffda RBX:  RCX: 
> 7fc976e03949
> [61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
> 5827
> [61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
> 0004
> [61163.598842] R10: 00c363b0 R11: 0246 R12: 
> 00400650
> [61163.599456] R13: 7ffe72221ee0 R14:  R15: 
> 
> [61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 
> 4c 01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49
01 d4 f6 45 bc
> 04 74 90 <49> 8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
> [61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
> [61163.602376] CR2: ea0011943820
> [61163.602767] ---[ end trace e4f81353a2d23232 ]---
> [61163.603236] Kernel panic - not syncing: Fatal exception
> [61163.603706] Kernel Offset: disabled
> 
> This bug is triggered when pmd_present() returns true for non-present
> hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
> Using pmd_present() to determine present/non-present for hugetlb is
> not correct, because pmd_present() checks multiple bits (not only
> _PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.
> 
> Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
> Signed-off-by: Naoya Horiguchi 
> Cc: [4.0+]
> ---

Acked-by: Hillf Danton 

>  mm/hugetlb.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
> v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> index 3d0aab9..f501f14 100644
> --- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
> +++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
> @@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>  {
>   struct page *page = NULL;
>   spinlock_t *ptl;
> + pte_t pte;
>  retry:
>   ptl = pmd_lockptr(mm, pmd);
>   spin_lock(ptl);
> @@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
> address,
>*/
>   if (!pmd_huge(*pmd))
>   goto out;
> - if (pmd_present(*pmd)) {
> + pte = huge_ptep_get((pte_t *)pmd);
> + if (pte_present(pte)) {
>   page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
>   if (flags & FOLL_GET)
>

[PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-21 Thread Naoya Horiguchi
I found the race condition which triggers the following bug when
move_pages() and soft offline are called on a single hugetlb page
concurrently.

[61163.578957] Soft offlining page 0x119400 at 0x7000
[61163.580062] BUG: unable to handle kernel paging request at 
ea0011943820
[61163.580791] IP: follow_huge_pmd+0x143/0x190
[61163.581203] PGD 7ffd2067
[61163.581204] PUD 7ffd1067
[61163.581471] PMD 0
[61163.581723]
[61163.582052] Oops:  [#1] SMP
[61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: cap_check]
[61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P   
OE   4.11.0-rc2-mm1+ #2
[61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[61163.586627] task: 88007c951680 task.stack: c90004bd8000
[61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
[61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
[61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
3000
[61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
000465003e80
[61163.589486] RBP: c90004bdbd18 R08:  R09: 
880138d34000
[61163.590097] R10: ea000465 R11: 00c363b0 R12: 
ea0011943800
[61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
77ff8000
[61163.591375] FS:  7fc977710740() GS:88007dc0() 
knlGS:
[61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
[61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
001406f0
[61163.593330] Call Trace:
[61163.593556]  follow_page_mask+0x270/0x550
[61163.593908]  SYSC_move_pages+0x4ea/0x8f0
[61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
[61163.594798]  SyS_move_pages+0xe/0x10
[61163.595113]  do_syscall_64+0x67/0x180
[61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
[61163.595837] RIP: 0033:0x7fc976e03949
[61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
0117
[61163.596940] RAX: ffda RBX:  RCX: 
7fc976e03949
[61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
5827
[61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
0004
[61163.598842] R10: 00c363b0 R11: 0246 R12: 
00400650
[61163.599456] R13: 7ffe72221ee0 R14:  R15: 

[61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 
01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 
8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
[61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
[61163.602376] CR2: ea0011943820
[61163.602767] ---[ end trace e4f81353a2d23232 ]---
[61163.603236] Kernel panic - not syncing: Fatal exception
[61163.603706] Kernel Offset: disabled

This bug is triggered when pmd_present() returns true for non-present
hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
Using pmd_present() to determine present/non-present for hugetlb is
not correct, because pmd_present() checks multiple bits (not only
_PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.

Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
Signed-off-by: Naoya Horiguchi 
Cc: [4.0+]
---
 mm/hugetlb.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
index 3d0aab9..f501f14 100644
--- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
+++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
@@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
address,
 {
struct page *page = NULL;
spinlock_t *ptl;
+   pte_t pte;
 retry:
ptl = pmd_lockptr(mm, pmd);
spin_lock(ptl);
@@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
address,
 */
if (!pmd_huge(*pmd))
goto out;
-   if (pmd_present(*pmd)) {
+   pte = huge_ptep_get((pte_t *)pmd);
+   if (pte_present(pte)) {
page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
if (flags & FOLL_GET)
get_page(page);
} else {
-   if (is_hugetlb_entry_migration(huge_ptep_get((pte_t *)pmd))) {
+   if (is_hugetlb_entry_migration(pte)) {
spin_unlock(ptl);
 

[PATCH v1] mm, hugetlb: use pte_present() instead of pmd_present() in follow_huge_pmd()

2017-03-21 Thread Naoya Horiguchi
I found the race condition which triggers the following bug when
move_pages() and soft offline are called on a single hugetlb page
concurrently.

[61163.578957] Soft offlining page 0x119400 at 0x7000
[61163.580062] BUG: unable to handle kernel paging request at 
ea0011943820
[61163.580791] IP: follow_huge_pmd+0x143/0x190
[61163.581203] PGD 7ffd2067
[61163.581204] PUD 7ffd1067
[61163.581471] PMD 0
[61163.581723]
[61163.582052] Oops:  [#1] SMP
[61163.582349] Modules linked in: binfmt_misc ppdev virtio_balloon 
parport_pc pcspkr i2c_piix4 parport i2c_core acpi_cpufreq ip_tables xfs 
libcrc32c ata_generic pata_acpi virtio_blk 8139too crc32c_intel ata_piix 
serio_raw libata virtio_pci 8139cp virtio_ring virtio mii floppy dm_mirror 
dm_region_hash dm_log dm_mod [last unloaded: cap_check]
[61163.585130] CPU: 0 PID: 22573 Comm: iterate_numa_mo Tainted: P   
OE   4.11.0-rc2-mm1+ #2
[61163.586055] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[61163.586627] task: 88007c951680 task.stack: c90004bd8000
[61163.587181] RIP: 0010:follow_huge_pmd+0x143/0x190
[61163.587622] RSP: 0018:c90004bdbcd0 EFLAGS: 00010202
[61163.588096] RAX: 000465003e80 RBX: ea0004e34d30 RCX: 
3000
[61163.588818] RDX: 11943800 RSI: 00080001 RDI: 
000465003e80
[61163.589486] RBP: c90004bdbd18 R08:  R09: 
880138d34000
[61163.590097] R10: ea000465 R11: 00c363b0 R12: 
ea0011943800
[61163.590751] R13: 8801b8d34000 R14: ea00 R15: 
77ff8000
[61163.591375] FS:  7fc977710740() GS:88007dc0() 
knlGS:
[61163.592068] CS:  0010 DS:  ES:  CR0: 80050033
[61163.592627] CR2: ea0011943820 CR3: 7a746000 CR4: 
001406f0
[61163.593330] Call Trace:
[61163.593556]  follow_page_mask+0x270/0x550
[61163.593908]  SYSC_move_pages+0x4ea/0x8f0
[61163.594253]  ? lru_cache_add_active_or_unevictable+0x4b/0xd0
[61163.594798]  SyS_move_pages+0xe/0x10
[61163.595113]  do_syscall_64+0x67/0x180
[61163.595434]  entry_SYSCALL64_slow_path+0x25/0x25
[61163.595837] RIP: 0033:0x7fc976e03949
[61163.596148] RSP: 002b:7ffe72221d88 EFLAGS: 0246 ORIG_RAX: 
0117
[61163.596940] RAX: ffda RBX:  RCX: 
7fc976e03949
[61163.597567] RDX: 00c22390 RSI: 1400 RDI: 
5827
[61163.598177] RBP: 7ffe72221e00 R08: 00c2c3a0 R09: 
0004
[61163.598842] R10: 00c363b0 R11: 0246 R12: 
00400650
[61163.599456] R13: 7ffe72221ee0 R14:  R15: 

[61163.600067] Code: 81 e4 ff ff 1f 00 48 21 c2 49 c1 ec 0c 48 c1 ea 0c 4c 
01 e2 49 bc 00 00 00 00 00 ea ff ff 48 c1 e2 06 49 01 d4 f6 45 bc 04 74 90 <49> 
8b 7c 24 20 40 f6 c7 01 75 2b 4c 89 e7 8b 47 1c 85 c0 7e 2a
[61163.601845] RIP: follow_huge_pmd+0x143/0x190 RSP: c90004bdbcd0
[61163.602376] CR2: ea0011943820
[61163.602767] ---[ end trace e4f81353a2d23232 ]---
[61163.603236] Kernel panic - not syncing: Fatal exception
[61163.603706] Kernel Offset: disabled

This bug is triggered when pmd_present() returns true for non-present
hugetlb, so fixing the present check in follow_huge_pmd() prevents it.
Using pmd_present() to determine present/non-present for hugetlb is
not correct, because pmd_present() checks multiple bits (not only
_PAGE_PRESENT) for historical reason and it can misjudge hugetlb state.

Fixes: e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()")
Signed-off-by: Naoya Horiguchi 
Cc: [4.0+]
---
 mm/hugetlb.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c 
v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
index 3d0aab9..f501f14 100644
--- v4.11-rc2-mmotm-2017-03-17-15-26/mm/hugetlb.c
+++ v4.11-rc2-mmotm-2017-03-17-15-26_patched/mm/hugetlb.c
@@ -4651,6 +4651,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
address,
 {
struct page *page = NULL;
spinlock_t *ptl;
+   pte_t pte;
 retry:
ptl = pmd_lockptr(mm, pmd);
spin_lock(ptl);
@@ -4660,12 +4661,13 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long 
address,
 */
if (!pmd_huge(*pmd))
goto out;
-   if (pmd_present(*pmd)) {
+   pte = huge_ptep_get((pte_t *)pmd);
+   if (pte_present(pte)) {
page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
if (flags & FOLL_GET)
get_page(page);
} else {
-   if (is_hugetlb_entry_migration(huge_ptep_get((pte_t *)pmd))) {
+   if (is_hugetlb_entry_migration(pte)) {
spin_unlock(ptl);
__migration_entry_wait(mm,