Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-22 Thread Kirill A. Shutemov
On Wed, Jun 22, 2016 at 02:42:25AM +, Naoya Horiguchi wrote:
> On Wed, Jun 22, 2016 at 01:37:00AM +, Horiguchi Naoya(堀口 直也) wrote:
> > On Tue, Jun 21, 2016 at 06:04:33PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > > > > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > > > > + unsigned long address, struct page 
> > > > > *page)
> > > > > +{
> > > > > + pgd_t *pgd;
> > > > > + pud_t *pud;
> > > > > + pmd_t *pmd;
> > > > > +
> > > > > + pgd = pgd_offset(vma->vm_mm, address);
> > > > > + if (!pgd_present(*pgd))
> > > > > + return;
> > > > > +
> > > > > + pud = pud_offset(pgd, address);
> > > > > + if (!pud_present(*pud))
> > > > > + return;
> > > > > +
> > > > > + pmd = pmd_offset(pud, address);
> > > > > + __split_huge_pmd(vma, pmd, address, page, true);
> > > > >  }
> > > > 
> > > > I don't see a reason to introduce new function. Just move the page
> > > > check under ptl from split_huge_pmd_address() and that should be enough.
> > 
> > Sorry for my slow response (I was offline yesterday.)
> > 
> > My point of separating function is to avoid checking pmd_present outside ptl
> > just for freeze=true case (I didn't want affect other path,
> > i.e. from vma_adjust_trans_huge().)
> > But I think that the new function is unnecessary if we move the following
> > part of split_huge_pmd_address() into ptl,
> > 
> > if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && 
> > !pmd_devmap(*pmd)))
> > return;
> > 
> > Does it make sense?
> > 
> > > > Or am I missing something?
> > > 
> > > I'm talking about something like patch below. Could you test it?
> > 
> > Thanks, with this patch my 3-hour testing doesn't trigger the problem,
> > so it works. But I feel it's weird because I think that the source of the
> > race is "if (!pmd_present)" check in split_huge_pmd_address() called 
> > outside ptl.
> > Your patch doesn't change that part, so I'm not sure why this fix works.
> 
> Hmm, after sending previous email, I found the bug triggered with your patch.
> So I updated the patch with removing the prechecks.
> 
> Thanks,
> Naoya Horiguchi
> ---
> From: Naoya Horiguchi 
> Date: Wed, 22 Jun 2016 11:34:47 +0900
> Subject: [PATCH] mm: thp: move pmd check inside ptl for freeze_page()
> 
> I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
> a testcase with 3 processes:
>   - process 1: keep writing thp,
>   - process 2: keep clearing soft-dirty bits from virtual address of process 1
>   - process 3: call migratepages for process 1,
> 
> The kernel message is like this:
> 
>   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
>   invalid opcode:  [#1] SMP
>   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
> tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
> virtio_ring virtio
>   CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
>   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>   task: 88003732 ti: 88007cdd task.ti: 88007cdd
>   RIP: 0010:[]  [] 
> split_huge_page_to_list+0x496/0x590
>   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
>   RAX: 0001 RBX: 88007c7b88c0 RCX: 
>   RDX:  RSI: 00070200 RDI: ea0003188000
>   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
>   R10: 8800 R11: c01f R12: ea0003188000
>   R13: ea0003188000 R14:  R15: 0480
>   FS:  7f8ec241d740() GS:88007dc0() 
> knlGS: CS:  0010 DS:  ES:  CR0: 
> 80050033
>   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
>   Stack:
>8139ef6d ea00031c6280 88011ffec000 
>7040 7020 88007cdd3d08 8800dbbe3008
>0480 88007cdd3c20 811dd0b1 88007cdd3d68
>   Call Trace:
>[] ? list_del+0xd/0x30
>[] queue_pages_pte_range+0x4d1/0x590
>[] __walk_page_range+0x204/0x4e0
>[] walk_page_range+0x71/0xf0
>[] queue_pages_range+0x75/0x90
>[] ? queue_pages_hugetlb+0x190/0x190
>[] ? new_node_page+0xc0/0xc0
>[] ? change_prot_numa+0x40/0x40
>[] migrate_to_node+0x71/0xd0
>[] do_migrate_pages+0x1c3/0x210
>[] SyS_migrate_pages+0x261/0x290
>[] entry_SYSCALL_64_fastpath+0x1a/0xa4
>   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 
> c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 
> 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
>   RIP  [] split_huge_page_to_list+0x496/0x590
>RSP 
> 
> I'm not sure of the full scenario of the 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-22 Thread Kirill A. Shutemov
On Wed, Jun 22, 2016 at 02:42:25AM +, Naoya Horiguchi wrote:
> On Wed, Jun 22, 2016 at 01:37:00AM +, Horiguchi Naoya(堀口 直也) wrote:
> > On Tue, Jun 21, 2016 at 06:04:33PM +0300, Kirill A. Shutemov wrote:
> > > On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > > > > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > > > > + unsigned long address, struct page 
> > > > > *page)
> > > > > +{
> > > > > + pgd_t *pgd;
> > > > > + pud_t *pud;
> > > > > + pmd_t *pmd;
> > > > > +
> > > > > + pgd = pgd_offset(vma->vm_mm, address);
> > > > > + if (!pgd_present(*pgd))
> > > > > + return;
> > > > > +
> > > > > + pud = pud_offset(pgd, address);
> > > > > + if (!pud_present(*pud))
> > > > > + return;
> > > > > +
> > > > > + pmd = pmd_offset(pud, address);
> > > > > + __split_huge_pmd(vma, pmd, address, page, true);
> > > > >  }
> > > > 
> > > > I don't see a reason to introduce new function. Just move the page
> > > > check under ptl from split_huge_pmd_address() and that should be enough.
> > 
> > Sorry for my slow response (I was offline yesterday.)
> > 
> > My point of separating function is to avoid checking pmd_present outside ptl
> > just for freeze=true case (I didn't want affect other path,
> > i.e. from vma_adjust_trans_huge().)
> > But I think that the new function is unnecessary if we move the following
> > part of split_huge_pmd_address() into ptl,
> > 
> > if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && 
> > !pmd_devmap(*pmd)))
> > return;
> > 
> > Does it make sense?
> > 
> > > > Or am I missing something?
> > > 
> > > I'm talking about something like patch below. Could you test it?
> > 
> > Thanks, with this patch my 3-hour testing doesn't trigger the problem,
> > so it works. But I feel it's weird because I think that the source of the
> > race is "if (!pmd_present)" check in split_huge_pmd_address() called 
> > outside ptl.
> > Your patch doesn't change that part, so I'm not sure why this fix works.
> 
> Hmm, after sending previous email, I found the bug triggered with your patch.
> So I updated the patch with removing the prechecks.
> 
> Thanks,
> Naoya Horiguchi
> ---
> From: Naoya Horiguchi 
> Date: Wed, 22 Jun 2016 11:34:47 +0900
> Subject: [PATCH] mm: thp: move pmd check inside ptl for freeze_page()
> 
> I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
> a testcase with 3 processes:
>   - process 1: keep writing thp,
>   - process 2: keep clearing soft-dirty bits from virtual address of process 1
>   - process 3: call migratepages for process 1,
> 
> The kernel message is like this:
> 
>   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
>   invalid opcode:  [#1] SMP
>   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
> tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
> virtio_ring virtio
>   CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
>   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>   task: 88003732 ti: 88007cdd task.ti: 88007cdd
>   RIP: 0010:[]  [] 
> split_huge_page_to_list+0x496/0x590
>   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
>   RAX: 0001 RBX: 88007c7b88c0 RCX: 
>   RDX:  RSI: 00070200 RDI: ea0003188000
>   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
>   R10: 8800 R11: c01f R12: ea0003188000
>   R13: ea0003188000 R14:  R15: 0480
>   FS:  7f8ec241d740() GS:88007dc0() 
> knlGS: CS:  0010 DS:  ES:  CR0: 
> 80050033
>   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
>   Stack:
>8139ef6d ea00031c6280 88011ffec000 
>7040 7020 88007cdd3d08 8800dbbe3008
>0480 88007cdd3c20 811dd0b1 88007cdd3d68
>   Call Trace:
>[] ? list_del+0xd/0x30
>[] queue_pages_pte_range+0x4d1/0x590
>[] __walk_page_range+0x204/0x4e0
>[] walk_page_range+0x71/0xf0
>[] queue_pages_range+0x75/0x90
>[] ? queue_pages_hugetlb+0x190/0x190
>[] ? new_node_page+0xc0/0xc0
>[] ? change_prot_numa+0x40/0x40
>[] migrate_to_node+0x71/0xd0
>[] do_migrate_pages+0x1c3/0x210
>[] SyS_migrate_pages+0x261/0x290
>[] entry_SYSCALL_64_fastpath+0x1a/0xa4
>   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 
> c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 
> 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
>   RIP  [] split_huge_page_to_list+0x496/0x590
>RSP 
> 
> I'm not sure of the full scenario of the reproduction, but my debug 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-21 Thread Naoya Horiguchi
On Wed, Jun 22, 2016 at 01:37:00AM +, Horiguchi Naoya(堀口 直也) wrote:
> On Tue, Jun 21, 2016 at 06:04:33PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > > > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > > > +   unsigned long address, struct page 
> > > > *page)
> > > > +{
> > > > +   pgd_t *pgd;
> > > > +   pud_t *pud;
> > > > +   pmd_t *pmd;
> > > > +
> > > > +   pgd = pgd_offset(vma->vm_mm, address);
> > > > +   if (!pgd_present(*pgd))
> > > > +   return;
> > > > +
> > > > +   pud = pud_offset(pgd, address);
> > > > +   if (!pud_present(*pud))
> > > > +   return;
> > > > +
> > > > +   pmd = pmd_offset(pud, address);
> > > > +   __split_huge_pmd(vma, pmd, address, page, true);
> > > >  }
> > > 
> > > I don't see a reason to introduce new function. Just move the page
> > > check under ptl from split_huge_pmd_address() and that should be enough.
> 
> Sorry for my slow response (I was offline yesterday.)
> 
> My point of separating function is to avoid checking pmd_present outside ptl
> just for freeze=true case (I didn't want affect other path,
> i.e. from vma_adjust_trans_huge().)
> But I think that the new function is unnecessary if we move the following
> part of split_huge_pmd_address() into ptl,
> 
> if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && 
> !pmd_devmap(*pmd)))
> return;
> 
> Does it make sense?
> 
> > > Or am I missing something?
> > 
> > I'm talking about something like patch below. Could you test it?
> 
> Thanks, with this patch my 3-hour testing doesn't trigger the problem,
> so it works. But I feel it's weird because I think that the source of the
> race is "if (!pmd_present)" check in split_huge_pmd_address() called outside 
> ptl.
> Your patch doesn't change that part, so I'm not sure why this fix works.

Hmm, after sending previous email, I found the bug triggered with your patch.
So I updated the patch with removing the prechecks.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi 
Date: Wed, 22 Jun 2016 11:34:47 +0900
Subject: [PATCH] mm: thp: move pmd check inside ptl for freeze_page()

I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
a testcase with 3 processes:
  - process 1: keep writing thp,
  - process 2: keep clearing soft-dirty bits from virtual address of process 1
  - process 3: call migratepages for process 1,

The kernel message is like this:

  kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
  invalid opcode:  [#1] SMP
  Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
virtio_ring virtio
  CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: 88003732 ti: 88007cdd task.ti: 88007cdd
  RIP: 0010:[]  [] 
split_huge_page_to_list+0x496/0x590
  RSP: 0018:88007cdd3b70  EFLAGS: 00010202
  RAX: 0001 RBX: 88007c7b88c0 RCX: 
  RDX:  RSI: 00070200 RDI: ea0003188000
  RBP: 88007cdd3bb8 R08: 0001 R09: 3000
  R10: 8800 R11: c01f R12: ea0003188000
  R13: ea0003188000 R14:  R15: 0480
  FS:  7f8ec241d740() GS:88007dc0() knlGS:  
   CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
  Stack:
   8139ef6d ea00031c6280 88011ffec000 
   7040 7020 88007cdd3d08 8800dbbe3008
   0480 88007cdd3c20 811dd0b1 88007cdd3d68
  Call Trace:
   [] ? list_del+0xd/0x30
   [] queue_pages_pte_range+0x4d1/0x590
   [] __walk_page_range+0x204/0x4e0
   [] walk_page_range+0x71/0xf0
   [] queue_pages_range+0x75/0x90
   [] ? queue_pages_hugetlb+0x190/0x190
   [] ? new_node_page+0xc0/0xc0
   [] ? change_prot_numa+0x40/0x40
   [] migrate_to_node+0x71/0xd0
   [] do_migrate_pages+0x1c3/0x210
   [] SyS_migrate_pages+0x261/0x290
   [] entry_SYSCALL_64_fastpath+0x1a/0xa4
  Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 
b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 
00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
  RIP  [] split_huge_page_to_list+0x496/0x590
   RSP 

I'm not sure of the full scenario of the reproduction, but my debug showed that
split_huge_pmd_address(freeze=true) returned without running main code of pmd
splitting because pmd_present(*pmd) in precheck somehow returned 0.
If this happens, the subsequent try_to_unmap() fails and returns non-zero
(because 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-21 Thread Naoya Horiguchi
On Wed, Jun 22, 2016 at 01:37:00AM +, Horiguchi Naoya(堀口 直也) wrote:
> On Tue, Jun 21, 2016 at 06:04:33PM +0300, Kirill A. Shutemov wrote:
> > On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > > > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > > > +   unsigned long address, struct page 
> > > > *page)
> > > > +{
> > > > +   pgd_t *pgd;
> > > > +   pud_t *pud;
> > > > +   pmd_t *pmd;
> > > > +
> > > > +   pgd = pgd_offset(vma->vm_mm, address);
> > > > +   if (!pgd_present(*pgd))
> > > > +   return;
> > > > +
> > > > +   pud = pud_offset(pgd, address);
> > > > +   if (!pud_present(*pud))
> > > > +   return;
> > > > +
> > > > +   pmd = pmd_offset(pud, address);
> > > > +   __split_huge_pmd(vma, pmd, address, page, true);
> > > >  }
> > > 
> > > I don't see a reason to introduce new function. Just move the page
> > > check under ptl from split_huge_pmd_address() and that should be enough.
> 
> Sorry for my slow response (I was offline yesterday.)
> 
> My point of separating function is to avoid checking pmd_present outside ptl
> just for freeze=true case (I didn't want affect other path,
> i.e. from vma_adjust_trans_huge().)
> But I think that the new function is unnecessary if we move the following
> part of split_huge_pmd_address() into ptl,
> 
> if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && 
> !pmd_devmap(*pmd)))
> return;
> 
> Does it make sense?
> 
> > > Or am I missing something?
> > 
> > I'm talking about something like patch below. Could you test it?
> 
> Thanks, with this patch my 3-hour testing doesn't trigger the problem,
> so it works. But I feel it's weird because I think that the source of the
> race is "if (!pmd_present)" check in split_huge_pmd_address() called outside 
> ptl.
> Your patch doesn't change that part, so I'm not sure why this fix works.

Hmm, after sending previous email, I found the bug triggered with your patch.
So I updated the patch with removing the prechecks.

Thanks,
Naoya Horiguchi
---
From: Naoya Horiguchi 
Date: Wed, 22 Jun 2016 11:34:47 +0900
Subject: [PATCH] mm: thp: move pmd check inside ptl for freeze_page()

I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
a testcase with 3 processes:
  - process 1: keep writing thp,
  - process 2: keep clearing soft-dirty bits from virtual address of process 1
  - process 3: call migratepages for process 1,

The kernel message is like this:

  kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
  invalid opcode:  [#1] SMP
  Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
virtio_ring virtio
  CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: 88003732 ti: 88007cdd task.ti: 88007cdd
  RIP: 0010:[]  [] 
split_huge_page_to_list+0x496/0x590
  RSP: 0018:88007cdd3b70  EFLAGS: 00010202
  RAX: 0001 RBX: 88007c7b88c0 RCX: 
  RDX:  RSI: 00070200 RDI: ea0003188000
  RBP: 88007cdd3bb8 R08: 0001 R09: 3000
  R10: 8800 R11: c01f R12: ea0003188000
  R13: ea0003188000 R14:  R15: 0480
  FS:  7f8ec241d740() GS:88007dc0() knlGS:  
   CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
  Stack:
   8139ef6d ea00031c6280 88011ffec000 
   7040 7020 88007cdd3d08 8800dbbe3008
   0480 88007cdd3c20 811dd0b1 88007cdd3d68
  Call Trace:
   [] ? list_del+0xd/0x30
   [] queue_pages_pte_range+0x4d1/0x590
   [] __walk_page_range+0x204/0x4e0
   [] walk_page_range+0x71/0xf0
   [] queue_pages_range+0x75/0x90
   [] ? queue_pages_hugetlb+0x190/0x190
   [] ? new_node_page+0xc0/0xc0
   [] ? change_prot_numa+0x40/0x40
   [] migrate_to_node+0x71/0xd0
   [] do_migrate_pages+0x1c3/0x210
   [] SyS_migrate_pages+0x261/0x290
   [] entry_SYSCALL_64_fastpath+0x1a/0xa4
  Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 
b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 
00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
  RIP  [] split_huge_page_to_list+0x496/0x590
   RSP 

I'm not sure of the full scenario of the reproduction, but my debug showed that
split_huge_pmd_address(freeze=true) returned without running main code of pmd
splitting because pmd_present(*pmd) in precheck somehow returned 0.
If this happens, the subsequent try_to_unmap() fails and returns non-zero
(because page_mapcount() still > 0), and 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-21 Thread Naoya Horiguchi
On Tue, Jun 21, 2016 at 06:04:33PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > > + unsigned long address, struct page *page)
> > > +{
> > > + pgd_t *pgd;
> > > + pud_t *pud;
> > > + pmd_t *pmd;
> > > +
> > > + pgd = pgd_offset(vma->vm_mm, address);
> > > + if (!pgd_present(*pgd))
> > > + return;
> > > +
> > > + pud = pud_offset(pgd, address);
> > > + if (!pud_present(*pud))
> > > + return;
> > > +
> > > + pmd = pmd_offset(pud, address);
> > > + __split_huge_pmd(vma, pmd, address, page, true);
> > >  }
> > 
> > I don't see a reason to introduce new function. Just move the page
> > check under ptl from split_huge_pmd_address() and that should be enough.

Sorry for my slow response (I was offline yesterday.)

My point of separating function is to avoid checking pmd_present outside ptl
just for freeze=true case (I didn't want affect other path,
i.e. from vma_adjust_trans_huge().)
But I think that the new function is unnecessary if we move the following
part of split_huge_pmd_address() into ptl,

if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)))
return;

Does it make sense?

> > Or am I missing something?
> 
> I'm talking about something like patch below. Could you test it?

Thanks, with this patch my 3-hour testing doesn't trigger the problem,
so it works. But I feel it's weird because I think that the source of the
race is "if (!pmd_present)" check in split_huge_pmd_address() called outside 
ptl.
Your patch doesn't change that part, so I'm not sure why this fix works.

> If it works fine to you, feel free to submit with my Signed-off-by.
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index eb810816bbc6..92ce91c03cd0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -98,7 +98,7 @@ static inline int split_huge_page(struct page *page)
>  void deferred_split_huge_page(struct page *page);
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> - unsigned long address, bool freeze);
> + unsigned long address, bool freeze, struct page *page);
>  
>  #define split_huge_pmd(__vma, __pmd, __address)  
> \
>   do {\
> @@ -106,7 +106,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   if (pmd_trans_huge(*pmd)\
>   || pmd_devmap(*pmd))\
>   __split_huge_pmd(__vma, __pmd, __address,   \
> - false); \
> + false, NULL);   \
>   }  while (0)
>  
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index eaf3a4a655a6..2297aa41581e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1638,7 +1638,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>  }
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> - unsigned long address, bool freeze)
> + unsigned long address, bool freeze, struct page *page)
>  {
>   spinlock_t *ptl;
>   struct mm_struct *mm = vma->vm_mm;
> @@ -1646,8 +1646,17 @@ void __split_huge_pmd(struct vm_area_struct *vma, 
> pmd_t *pmd,
>  
>   mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
>   ptl = pmd_lock(mm, pmd);
> +
> + /*
> +  * If caller asks to setup a migration entries, we need a page to check
> +  * pmd against. Otherwise we can end up replacing wrong page.
> +  */
> + VM_BUG_ON(freeze && !page);
> + if (page && page != pmd_page(*pmd))
> + goto out;
> +
>   if (pmd_trans_huge(*pmd)) {
> - struct page *page = pmd_page(*pmd);
> + page = pmd_page(*pmd);
>   if (PageMlocked(page))
>   clear_page_mlock(page);
>   } else if (!pmd_devmap(*pmd))
> @@ -1678,20 +1687,12 @@ void split_huge_pmd_address(struct vm_area_struct 
> *vma, unsigned long address,
>   return;
>  
>   /*
> -  * If caller asks to setup a migration entries, we need a page to check
> -  * pmd against. Otherwise we can end up replacing wrong page.
> -  */
> - VM_BUG_ON(freeze && !page);
> - if (page && page != pmd_page(*pmd))
> - return;
> -
> - /*
>* Caller holds the mmap_sem write mode or the anon_vma lock,
>* so a huge pmd cannot materialize from under us (khugepaged
>* holds both the mmap_sem write mode and the anon_vma lock
>* write mode).

Not a big issue, but this comment about mmap_sem seems relevant only when
called from vma_adjust_trans_huge(), so might need some update?

>*/
> 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-21 Thread Naoya Horiguchi
On Tue, Jun 21, 2016 at 06:04:33PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > > + unsigned long address, struct page *page)
> > > +{
> > > + pgd_t *pgd;
> > > + pud_t *pud;
> > > + pmd_t *pmd;
> > > +
> > > + pgd = pgd_offset(vma->vm_mm, address);
> > > + if (!pgd_present(*pgd))
> > > + return;
> > > +
> > > + pud = pud_offset(pgd, address);
> > > + if (!pud_present(*pud))
> > > + return;
> > > +
> > > + pmd = pmd_offset(pud, address);
> > > + __split_huge_pmd(vma, pmd, address, page, true);
> > >  }
> > 
> > I don't see a reason to introduce new function. Just move the page
> > check under ptl from split_huge_pmd_address() and that should be enough.

Sorry for my slow response (I was offline yesterday.)

My point of separating function is to avoid checking pmd_present outside ptl
just for freeze=true case (I didn't want affect other path,
i.e. from vma_adjust_trans_huge().)
But I think that the new function is unnecessary if we move the following
part of split_huge_pmd_address() into ptl,

if (!pmd_present(*pmd) || (!pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)))
return;

Does it make sense?

> > Or am I missing something?
> 
> I'm talking about something like patch below. Could you test it?

Thanks, with this patch my 3-hour testing doesn't trigger the problem,
so it works. But I feel it's weird because I think that the source of the
race is "if (!pmd_present)" check in split_huge_pmd_address() called outside 
ptl.
Your patch doesn't change that part, so I'm not sure why this fix works.

> If it works fine to you, feel free to submit with my Signed-off-by.
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index eb810816bbc6..92ce91c03cd0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -98,7 +98,7 @@ static inline int split_huge_page(struct page *page)
>  void deferred_split_huge_page(struct page *page);
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> - unsigned long address, bool freeze);
> + unsigned long address, bool freeze, struct page *page);
>  
>  #define split_huge_pmd(__vma, __pmd, __address)  
> \
>   do {\
> @@ -106,7 +106,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   if (pmd_trans_huge(*pmd)\
>   || pmd_devmap(*pmd))\
>   __split_huge_pmd(__vma, __pmd, __address,   \
> - false); \
> + false, NULL);   \
>   }  while (0)
>  
>  
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index eaf3a4a655a6..2297aa41581e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1638,7 +1638,7 @@ static void __split_huge_pmd_locked(struct 
> vm_area_struct *vma, pmd_t *pmd,
>  }
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> - unsigned long address, bool freeze)
> + unsigned long address, bool freeze, struct page *page)
>  {
>   spinlock_t *ptl;
>   struct mm_struct *mm = vma->vm_mm;
> @@ -1646,8 +1646,17 @@ void __split_huge_pmd(struct vm_area_struct *vma, 
> pmd_t *pmd,
>  
>   mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
>   ptl = pmd_lock(mm, pmd);
> +
> + /*
> +  * If caller asks to setup a migration entries, we need a page to check
> +  * pmd against. Otherwise we can end up replacing wrong page.
> +  */
> + VM_BUG_ON(freeze && !page);
> + if (page && page != pmd_page(*pmd))
> + goto out;
> +
>   if (pmd_trans_huge(*pmd)) {
> - struct page *page = pmd_page(*pmd);
> + page = pmd_page(*pmd);
>   if (PageMlocked(page))
>   clear_page_mlock(page);
>   } else if (!pmd_devmap(*pmd))
> @@ -1678,20 +1687,12 @@ void split_huge_pmd_address(struct vm_area_struct 
> *vma, unsigned long address,
>   return;
>  
>   /*
> -  * If caller asks to setup a migration entries, we need a page to check
> -  * pmd against. Otherwise we can end up replacing wrong page.
> -  */
> - VM_BUG_ON(freeze && !page);
> - if (page && page != pmd_page(*pmd))
> - return;
> -
> - /*
>* Caller holds the mmap_sem write mode or the anon_vma lock,
>* so a huge pmd cannot materialize from under us (khugepaged
>* holds both the mmap_sem write mode and the anon_vma lock
>* write mode).

Not a big issue, but this comment about mmap_sem seems relevant only when
called from vma_adjust_trans_huge(), so might need some update?

>*/
> 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-21 Thread Kirill A. Shutemov
On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > +   unsigned long address, struct page *page)
> > +{
> > +   pgd_t *pgd;
> > +   pud_t *pud;
> > +   pmd_t *pmd;
> > +
> > +   pgd = pgd_offset(vma->vm_mm, address);
> > +   if (!pgd_present(*pgd))
> > +   return;
> > +
> > +   pud = pud_offset(pgd, address);
> > +   if (!pud_present(*pud))
> > +   return;
> > +
> > +   pmd = pmd_offset(pud, address);
> > +   __split_huge_pmd(vma, pmd, address, page, true);
> >  }
> 
> I don't see a reason to introduce new function. Just move the page
> check under ptl from split_huge_pmd_address() and that should be enough.
> 
> Or am I missing something?

I'm talking about something like patch below. Could you test it?

If it works fine to you, feel free to submit with my Signed-off-by.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index eb810816bbc6..92ce91c03cd0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -98,7 +98,7 @@ static inline int split_huge_page(struct page *page)
 void deferred_split_huge_page(struct page *page);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long address, bool freeze);
+   unsigned long address, bool freeze, struct page *page);
 
 #define split_huge_pmd(__vma, __pmd, __address)
\
do {\
@@ -106,7 +106,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (pmd_trans_huge(*pmd)\
|| pmd_devmap(*pmd))\
__split_huge_pmd(__vma, __pmd, __address,   \
-   false); \
+   false, NULL);   \
}  while (0)
 
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eaf3a4a655a6..2297aa41581e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1638,7 +1638,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long address, bool freeze)
+   unsigned long address, bool freeze, struct page *page)
 {
spinlock_t *ptl;
struct mm_struct *mm = vma->vm_mm;
@@ -1646,8 +1646,17 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
ptl = pmd_lock(mm, pmd);
+
+   /*
+* If caller asks to setup a migration entries, we need a page to check
+* pmd against. Otherwise we can end up replacing wrong page.
+*/
+   VM_BUG_ON(freeze && !page);
+   if (page && page != pmd_page(*pmd))
+   goto out;
+
if (pmd_trans_huge(*pmd)) {
-   struct page *page = pmd_page(*pmd);
+   page = pmd_page(*pmd);
if (PageMlocked(page))
clear_page_mlock(page);
} else if (!pmd_devmap(*pmd))
@@ -1678,20 +1687,12 @@ void split_huge_pmd_address(struct vm_area_struct *vma, 
unsigned long address,
return;
 
/*
-* If caller asks to setup a migration entries, we need a page to check
-* pmd against. Otherwise we can end up replacing wrong page.
-*/
-   VM_BUG_ON(freeze && !page);
-   if (page && page != pmd_page(*pmd))
-   return;
-
-   /*
 * Caller holds the mmap_sem write mode or the anon_vma lock,
 * so a huge pmd cannot materialize from under us (khugepaged
 * holds both the mmap_sem write mode and the anon_vma lock
 * write mode).
 */
-   __split_huge_pmd(vma, pmd, address, freeze);
+   __split_huge_pmd(vma, pmd, address, freeze, page);
 }
 
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
 Kirill A. Shutemov


Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-21 Thread Kirill A. Shutemov
On Mon, Jun 20, 2016 at 12:32:01PM +0300, Kirill A. Shutemov wrote:
> > +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> > +   unsigned long address, struct page *page)
> > +{
> > +   pgd_t *pgd;
> > +   pud_t *pud;
> > +   pmd_t *pmd;
> > +
> > +   pgd = pgd_offset(vma->vm_mm, address);
> > +   if (!pgd_present(*pgd))
> > +   return;
> > +
> > +   pud = pud_offset(pgd, address);
> > +   if (!pud_present(*pud))
> > +   return;
> > +
> > +   pmd = pmd_offset(pud, address);
> > +   __split_huge_pmd(vma, pmd, address, page, true);
> >  }
> 
> I don't see a reason to introduce new function. Just move the page
> check under ptl from split_huge_pmd_address() and that should be enough.
> 
> Or am I missing something?

I'm talking about something like patch below. Could you test it?

If it works fine to you, feel free to submit with my Signed-off-by.

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index eb810816bbc6..92ce91c03cd0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -98,7 +98,7 @@ static inline int split_huge_page(struct page *page)
 void deferred_split_huge_page(struct page *page);
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long address, bool freeze);
+   unsigned long address, bool freeze, struct page *page);
 
 #define split_huge_pmd(__vma, __pmd, __address)
\
do {\
@@ -106,7 +106,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
if (pmd_trans_huge(*pmd)\
|| pmd_devmap(*pmd))\
__split_huge_pmd(__vma, __pmd, __address,   \
-   false); \
+   false, NULL);   \
}  while (0)
 
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eaf3a4a655a6..2297aa41581e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1638,7 +1638,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct 
*vma, pmd_t *pmd,
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-   unsigned long address, bool freeze)
+   unsigned long address, bool freeze, struct page *page)
 {
spinlock_t *ptl;
struct mm_struct *mm = vma->vm_mm;
@@ -1646,8 +1646,17 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
 
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
ptl = pmd_lock(mm, pmd);
+
+   /*
+* If caller asks to setup a migration entries, we need a page to check
+* pmd against. Otherwise we can end up replacing wrong page.
+*/
+   VM_BUG_ON(freeze && !page);
+   if (page && page != pmd_page(*pmd))
+   goto out;
+
if (pmd_trans_huge(*pmd)) {
-   struct page *page = pmd_page(*pmd);
+   page = pmd_page(*pmd);
if (PageMlocked(page))
clear_page_mlock(page);
} else if (!pmd_devmap(*pmd))
@@ -1678,20 +1687,12 @@ void split_huge_pmd_address(struct vm_area_struct *vma, 
unsigned long address,
return;
 
/*
-* If caller asks to setup a migration entries, we need a page to check
-* pmd against. Otherwise we can end up replacing wrong page.
-*/
-   VM_BUG_ON(freeze && !page);
-   if (page && page != pmd_page(*pmd))
-   return;
-
-   /*
 * Caller holds the mmap_sem write mode or the anon_vma lock,
 * so a huge pmd cannot materialize from under us (khugepaged
 * holds both the mmap_sem write mode and the anon_vma lock
 * write mode).
 */
-   __split_huge_pmd(vma, pmd, address, freeze);
+   __split_huge_pmd(vma, pmd, address, freeze, page);
 }
 
 void vma_adjust_trans_huge(struct vm_area_struct *vma,
-- 
 Kirill A. Shutemov


Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-20 Thread Kirill A. Shutemov
On Mon, Jun 20, 2016 at 08:55:03AM +, Naoya Horiguchi wrote:
> On Fri, Jun 17, 2016 at 11:40:41AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 17, 2016 at 11:30:03AM +0900, Naoya Horiguchi wrote:
> > > I found a race condition triggering VM_BUG_ON() in freeze_page(), when 
> > > running
> > > a testcase with 3 processes:
> > >   - process 1: keep writing thp,
> > >   - process 2: keep clearing soft-dirty bits from virtual address of 
> > > process 1
> > >   - process 3: call migratepages for process 1,
> > >
> > > The kernel message is like this:
> > > 
> > >   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
> > >   invalid opcode:  [#1] SMP
> > >   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> > > virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq 
> > > tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy 
> > > virtio_pci virtio_ring virtio
> > >   CPU: 0 PID: 28863 Comm: migratepages Not tainted 
> > > 4.6.0-v4.6-160602-0827-+ #2
> > >   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > >   task: 88003732 ti: 88007cdd task.ti: 88007cdd
> > >   RIP: 0010:[]  [] 
> > > split_huge_page_to_list+0x496/0x590
> > >   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
> > >   RAX: 0001 RBX: 88007c7b88c0 RCX: 
> > >   RDX:  RSI: 00070200 RDI: ea0003188000
> > >   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
> > >   R10: 8800 R11: c01f R12: ea0003188000
> > >   R13: ea0003188000 R14:  R15: 0480
> > >   FS:  7f8ec241d740() GS:88007dc0() 
> > > knlGS: CS:  0010 DS:  ES:  CR0: 
> > > 80050033
> > >   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
> > >   Stack:
> > >8139ef6d ea00031c6280 88011ffec000 
> > >7040 7020 88007cdd3d08 8800dbbe3008
> > >0480 88007cdd3c20 811dd0b1 88007cdd3d68
> > >   Call Trace:
> > >[] ? list_del+0xd/0x30
> > >[] queue_pages_pte_range+0x4d1/0x590
> > >[] __walk_page_range+0x204/0x4e0
> > >[] walk_page_range+0x71/0xf0
> > >[] queue_pages_range+0x75/0x90
> > >[] ? queue_pages_hugetlb+0x190/0x190
> > >[] ? new_node_page+0xc0/0xc0
> > >[] ? change_prot_numa+0x40/0x40
> > >[] migrate_to_node+0x71/0xd0
> > >[] do_migrate_pages+0x1c3/0x210
> > >[] SyS_migrate_pages+0x261/0x290
> > >[] entry_SYSCALL_64_fastpath+0x1a/0xa4
> > >   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 
> > > c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 
> > > c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
> > >   RIP  [] split_huge_page_to_list+0x496/0x590
> > >RSP 
> > > 
> > > I'm not sure of the full scenario of the reproduction, but my debug 
> > > showed that
> > > split_huge_pmd_address(freeze=true) returned without running main code of 
> > > pmd
> > > splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
> > > try_to_unmap() fails and returns non-zero (because page_mapcount() still 
> > > > 0),
> > > and finally VM_BUG_ON() fires.
> > > 
> > > This patch fixes it by adding a separate split_huge_pmd_address()'s 
> > > variant
> > > for freeze=true and checking pmd's state within ptl for that case.
> > 
> > Checking pmd under ptl is right thing to do, but I want to understand the
> > scenario first.
> > 
> > Do you have code to trigger this?
> 
> Here's the testcode (maybe takes 5-10 min to trigger.)
> 
>   background_migratepages() {
>   local pid=$1
>   
>   while kill -0 $pid 2> /dev/null ; do
>   migratepages $pid 0 1
>   migratepages $pid 1 0
>   done
>   }
>   
>   background_clear_refs() {
>   local pid=$1
>   
>   while kill -0 $pid 2> /dev/null ; do
>   echo 4 > /proc/$pid/clear_refs 2> /dev/null
>   done
>   }
>   
>   while true ; do
>   $(dirname $BASH_SOURCE)/thp_alloc &
>   PID=$!
>   sleep 0.$RANDOM
>   background_migratepages $PID > /dev/null &
>   background_clear_refs $PID   > /dev/null &
>   sleep 0.$RANDOM
>   kill -9 $PID
>   done
> 
> 
>   # thp_alloc.c
> 
>   #include 
>   #include 
>   #include 
>   
>   int main(int argc, char **argv) {
>   size_t size = 2*1024*1024*10;
>   char *p = mmap((void *)0x7000UL, size, PROT_READ|PROT_WRITE,
>  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>   madvise(p, size, MADV_HUGEPAGE);
>   while (1)
>   memset(p, 0, size);
>   }

I failed to trigger it myself so far, but I can see where race is comming
from.

> > > I think that this change seems to fit the comment in 
> > > split_huge_pmd_address()
> > > that says "Caller holds the mmap_sem write mode, so a huge pmd 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-20 Thread Kirill A. Shutemov
On Mon, Jun 20, 2016 at 08:55:03AM +, Naoya Horiguchi wrote:
> On Fri, Jun 17, 2016 at 11:40:41AM +0300, Kirill A. Shutemov wrote:
> > On Fri, Jun 17, 2016 at 11:30:03AM +0900, Naoya Horiguchi wrote:
> > > I found a race condition triggering VM_BUG_ON() in freeze_page(), when 
> > > running
> > > a testcase with 3 processes:
> > >   - process 1: keep writing thp,
> > >   - process 2: keep clearing soft-dirty bits from virtual address of 
> > > process 1
> > >   - process 3: call migratepages for process 1,
> > >
> > > The kernel message is like this:
> > > 
> > >   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
> > >   invalid opcode:  [#1] SMP
> > >   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> > > virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq 
> > > tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy 
> > > virtio_pci virtio_ring virtio
> > >   CPU: 0 PID: 28863 Comm: migratepages Not tainted 
> > > 4.6.0-v4.6-160602-0827-+ #2
> > >   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > >   task: 88003732 ti: 88007cdd task.ti: 88007cdd
> > >   RIP: 0010:[]  [] 
> > > split_huge_page_to_list+0x496/0x590
> > >   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
> > >   RAX: 0001 RBX: 88007c7b88c0 RCX: 
> > >   RDX:  RSI: 00070200 RDI: ea0003188000
> > >   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
> > >   R10: 8800 R11: c01f R12: ea0003188000
> > >   R13: ea0003188000 R14:  R15: 0480
> > >   FS:  7f8ec241d740() GS:88007dc0() 
> > > knlGS: CS:  0010 DS:  ES:  CR0: 
> > > 80050033
> > >   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
> > >   Stack:
> > >8139ef6d ea00031c6280 88011ffec000 
> > >7040 7020 88007cdd3d08 8800dbbe3008
> > >0480 88007cdd3c20 811dd0b1 88007cdd3d68
> > >   Call Trace:
> > >[] ? list_del+0xd/0x30
> > >[] queue_pages_pte_range+0x4d1/0x590
> > >[] __walk_page_range+0x204/0x4e0
> > >[] walk_page_range+0x71/0xf0
> > >[] queue_pages_range+0x75/0x90
> > >[] ? queue_pages_hugetlb+0x190/0x190
> > >[] ? new_node_page+0xc0/0xc0
> > >[] ? change_prot_numa+0x40/0x40
> > >[] migrate_to_node+0x71/0xd0
> > >[] do_migrate_pages+0x1c3/0x210
> > >[] SyS_migrate_pages+0x261/0x290
> > >[] entry_SYSCALL_64_fastpath+0x1a/0xa4
> > >   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 
> > > c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 
> > > c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
> > >   RIP  [] split_huge_page_to_list+0x496/0x590
> > >RSP 
> > > 
> > > I'm not sure of the full scenario of the reproduction, but my debug 
> > > showed that
> > > split_huge_pmd_address(freeze=true) returned without running main code of 
> > > pmd
> > > splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
> > > try_to_unmap() fails and returns non-zero (because page_mapcount() still 
> > > > 0),
> > > and finally VM_BUG_ON() fires.
> > > 
> > > This patch fixes it by adding a separate split_huge_pmd_address()'s 
> > > variant
> > > for freeze=true and checking pmd's state within ptl for that case.
> > 
> > Checking pmd under ptl is right thing to do, but I want to understand the
> > scenario first.
> > 
> > Do you have code to trigger this?
> 
> Here's the testcode (maybe takes 5-10 min to trigger.)
> 
>   background_migratepages() {
>   local pid=$1
>   
>   while kill -0 $pid 2> /dev/null ; do
>   migratepages $pid 0 1
>   migratepages $pid 1 0
>   done
>   }
>   
>   background_clear_refs() {
>   local pid=$1
>   
>   while kill -0 $pid 2> /dev/null ; do
>   echo 4 > /proc/$pid/clear_refs 2> /dev/null
>   done
>   }
>   
>   while true ; do
>   $(dirname $BASH_SOURCE)/thp_alloc &
>   PID=$!
>   sleep 0.$RANDOM
>   background_migratepages $PID > /dev/null &
>   background_clear_refs $PID   > /dev/null &
>   sleep 0.$RANDOM
>   kill -9 $PID
>   done
> 
> 
>   # thp_alloc.c
> 
>   #include 
>   #include 
>   #include 
>   
>   int main(int argc, char **argv) {
>   size_t size = 2*1024*1024*10;
>   char *p = mmap((void *)0x7000UL, size, PROT_READ|PROT_WRITE,
>  MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>   madvise(p, size, MADV_HUGEPAGE);
>   while (1)
>   memset(p, 0, size);
>   }

I failed to trigger it myself so far, but I can see where race is comming
from.

> > > I think that this change seems to fit the comment in 
> > > split_huge_pmd_address()
> > > that says "Caller holds the mmap_sem write mode, so a huge pmd 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-20 Thread Naoya Horiguchi
On Fri, Jun 17, 2016 at 11:40:41AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 17, 2016 at 11:30:03AM +0900, Naoya Horiguchi wrote:
> > I found a race condition triggering VM_BUG_ON() in freeze_page(), when 
> > running
> > a testcase with 3 processes:
> >   - process 1: keep writing thp,
> >   - process 2: keep clearing soft-dirty bits from virtual address of 
> > process 1
> >   - process 3: call migratepages for process 1,
> >
> > The kernel message is like this:
> > 
> >   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
> >   invalid opcode:  [#1] SMP
> >   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> > virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq 
> > tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy 
> > virtio_pci virtio_ring virtio
> >   CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ 
> > #2
> >   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> >   task: 88003732 ti: 88007cdd task.ti: 88007cdd
> >   RIP: 0010:[]  [] 
> > split_huge_page_to_list+0x496/0x590
> >   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
> >   RAX: 0001 RBX: 88007c7b88c0 RCX: 
> >   RDX:  RSI: 00070200 RDI: ea0003188000
> >   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
> >   R10: 8800 R11: c01f R12: ea0003188000
> >   R13: ea0003188000 R14:  R15: 0480
> >   FS:  7f8ec241d740() GS:88007dc0() 
> > knlGS: CS:  0010 DS:  ES:  CR0: 
> > 80050033
> >   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
> >   Stack:
> >8139ef6d ea00031c6280 88011ffec000 
> >7040 7020 88007cdd3d08 8800dbbe3008
> >0480 88007cdd3c20 811dd0b1 88007cdd3d68
> >   Call Trace:
> >[] ? list_del+0xd/0x30
> >[] queue_pages_pte_range+0x4d1/0x590
> >[] __walk_page_range+0x204/0x4e0
> >[] walk_page_range+0x71/0xf0
> >[] queue_pages_range+0x75/0x90
> >[] ? queue_pages_hugetlb+0x190/0x190
> >[] ? new_node_page+0xc0/0xc0
> >[] ? change_prot_numa+0x40/0x40
> >[] migrate_to_node+0x71/0xd0
> >[] do_migrate_pages+0x1c3/0x210
> >[] SyS_migrate_pages+0x261/0x290
> >[] entry_SYSCALL_64_fastpath+0x1a/0xa4
> >   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 
> > c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 
> > c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
> >   RIP  [] split_huge_page_to_list+0x496/0x590
> >RSP 
> > 
> > I'm not sure of the full scenario of the reproduction, but my debug showed 
> > that
> > split_huge_pmd_address(freeze=true) returned without running main code of 
> > pmd
> > splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
> > try_to_unmap() fails and returns non-zero (because page_mapcount() still > 
> > 0),
> > and finally VM_BUG_ON() fires.
> > 
> > This patch fixes it by adding a separate split_huge_pmd_address()'s variant
> > for freeze=true and checking pmd's state within ptl for that case.
> 
> Checking pmd under ptl is right thing to do, but I want to understand the
> scenario first.
> 
> Do you have code to trigger this?

Here's the testcode (maybe takes 5-10 min to trigger.)

  background_migratepages() {
local pid=$1
  
while kill -0 $pid 2> /dev/null ; do
migratepages $pid 0 1
migratepages $pid 1 0
done
  }
  
  background_clear_refs() {
local pid=$1
  
while kill -0 $pid 2> /dev/null ; do
echo 4 > /proc/$pid/clear_refs 2> /dev/null
done
  }
  
  while true ; do
$(dirname $BASH_SOURCE)/thp_alloc &
PID=$!
sleep 0.$RANDOM
background_migratepages $PID > /dev/null &
background_clear_refs $PID   > /dev/null &
sleep 0.$RANDOM
kill -9 $PID
  done


  # thp_alloc.c

  #include 
  #include 
  #include 
  
  int main(int argc, char **argv) {
size_t size = 2*1024*1024*10;
char *p = mmap((void *)0x7000UL, size, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
madvise(p, size, MADV_HUGEPAGE);
while (1)
memset(p, 0, size);
  }



> > I think that this change seems to fit the comment in 
> > split_huge_pmd_address()
> > that says "Caller holds the mmap_sem write mode, so a huge pmd cannot
> > materialize from under us." We don't hold mmap_sem write if called from
> > split_huge_page(), so maybe there were some different assumptions between
> > callers (split_huge_page() and vma_adjust_trans_huge().)
> > 
> > Signed-off-by: Naoya Horiguchi 
> > ---
> >  include/linux/huge_mm.h |  8 
> >  

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-20 Thread Naoya Horiguchi
On Fri, Jun 17, 2016 at 11:40:41AM +0300, Kirill A. Shutemov wrote:
> On Fri, Jun 17, 2016 at 11:30:03AM +0900, Naoya Horiguchi wrote:
> > I found a race condition triggering VM_BUG_ON() in freeze_page(), when 
> > running
> > a testcase with 3 processes:
> >   - process 1: keep writing thp,
> >   - process 2: keep clearing soft-dirty bits from virtual address of 
> > process 1
> >   - process 3: call migratepages for process 1,
> >
> > The kernel message is like this:
> > 
> >   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
> >   invalid opcode:  [#1] SMP
> >   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> > virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq 
> > tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy 
> > virtio_pci virtio_ring virtio
> >   CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ 
> > #2
> >   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> >   task: 88003732 ti: 88007cdd task.ti: 88007cdd
> >   RIP: 0010:[]  [] 
> > split_huge_page_to_list+0x496/0x590
> >   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
> >   RAX: 0001 RBX: 88007c7b88c0 RCX: 
> >   RDX:  RSI: 00070200 RDI: ea0003188000
> >   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
> >   R10: 8800 R11: c01f R12: ea0003188000
> >   R13: ea0003188000 R14:  R15: 0480
> >   FS:  7f8ec241d740() GS:88007dc0() 
> > knlGS: CS:  0010 DS:  ES:  CR0: 
> > 80050033
> >   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
> >   Stack:
> >8139ef6d ea00031c6280 88011ffec000 
> >7040 7020 88007cdd3d08 8800dbbe3008
> >0480 88007cdd3c20 811dd0b1 88007cdd3d68
> >   Call Trace:
> >[] ? list_del+0xd/0x30
> >[] queue_pages_pte_range+0x4d1/0x590
> >[] __walk_page_range+0x204/0x4e0
> >[] walk_page_range+0x71/0xf0
> >[] queue_pages_range+0x75/0x90
> >[] ? queue_pages_hugetlb+0x190/0x190
> >[] ? new_node_page+0xc0/0xc0
> >[] ? change_prot_numa+0x40/0x40
> >[] migrate_to_node+0x71/0xd0
> >[] do_migrate_pages+0x1c3/0x210
> >[] SyS_migrate_pages+0x261/0x290
> >[] entry_SYSCALL_64_fastpath+0x1a/0xa4
> >   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 
> > c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 
> > c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
> >   RIP  [] split_huge_page_to_list+0x496/0x590
> >RSP 
> > 
> > I'm not sure of the full scenario of the reproduction, but my debug showed 
> > that
> > split_huge_pmd_address(freeze=true) returned without running main code of 
> > pmd
> > splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
> > try_to_unmap() fails and returns non-zero (because page_mapcount() still > 
> > 0),
> > and finally VM_BUG_ON() fires.
> > 
> > This patch fixes it by adding a separate split_huge_pmd_address()'s variant
> > for freeze=true and checking pmd's state within ptl for that case.
> 
> Checking pmd under ptl is right thing to do, but I want to understand the
> scenario first.
> 
> Do you have code to trigger this?

Here's the testcode (maybe takes 5-10 min to trigger.)

  background_migratepages() {
local pid=$1
  
while kill -0 $pid 2> /dev/null ; do
migratepages $pid 0 1
migratepages $pid 1 0
done
  }
  
  background_clear_refs() {
local pid=$1
  
while kill -0 $pid 2> /dev/null ; do
echo 4 > /proc/$pid/clear_refs 2> /dev/null
done
  }
  
  while true ; do
$(dirname $BASH_SOURCE)/thp_alloc &
PID=$!
sleep 0.$RANDOM
background_migratepages $PID > /dev/null &
background_clear_refs $PID   > /dev/null &
sleep 0.$RANDOM
kill -9 $PID
  done


  # thp_alloc.c

  #include 
  #include 
  #include 
  
  int main(int argc, char **argv) {
size_t size = 2*1024*1024*10;
char *p = mmap((void *)0x7000UL, size, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
madvise(p, size, MADV_HUGEPAGE);
while (1)
memset(p, 0, size);
  }



> > I think that this change seems to fit the comment in 
> > split_huge_pmd_address()
> > that says "Caller holds the mmap_sem write mode, so a huge pmd cannot
> > materialize from under us." We don't hold mmap_sem write if called from
> > split_huge_page(), so maybe there were some different assumptions between
> > callers (split_huge_page() and vma_adjust_trans_huge().)
> > 
> > Signed-off-by: Naoya Horiguchi 
> > ---
> >  include/linux/huge_mm.h |  8 
> >  mm/huge_memory.c| 50 
> > 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-17 Thread Kirill A. Shutemov
On Fri, Jun 17, 2016 at 11:30:03AM +0900, Naoya Horiguchi wrote:
> I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
> a testcase with 3 processes:
>   - process 1: keep writing thp,
>   - process 2: keep clearing soft-dirty bits from virtual address of process 1
>   - process 3: call migratepages for process 1,
>
> The kernel message is like this:
> 
>   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
>   invalid opcode:  [#1] SMP
>   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
> tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
> virtio_ring virtio
>   CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
>   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>   task: 88003732 ti: 88007cdd task.ti: 88007cdd
>   RIP: 0010:[]  [] 
> split_huge_page_to_list+0x496/0x590
>   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
>   RAX: 0001 RBX: 88007c7b88c0 RCX: 
>   RDX:  RSI: 00070200 RDI: ea0003188000
>   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
>   R10: 8800 R11: c01f R12: ea0003188000
>   R13: ea0003188000 R14:  R15: 0480
>   FS:  7f8ec241d740() GS:88007dc0() 
> knlGS: CS:  0010 DS:  ES:  CR0: 
> 80050033
>   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
>   Stack:
>8139ef6d ea00031c6280 88011ffec000 
>7040 7020 88007cdd3d08 8800dbbe3008
>0480 88007cdd3c20 811dd0b1 88007cdd3d68
>   Call Trace:
>[] ? list_del+0xd/0x30
>[] queue_pages_pte_range+0x4d1/0x590
>[] __walk_page_range+0x204/0x4e0
>[] walk_page_range+0x71/0xf0
>[] queue_pages_range+0x75/0x90
>[] ? queue_pages_hugetlb+0x190/0x190
>[] ? new_node_page+0xc0/0xc0
>[] ? change_prot_numa+0x40/0x40
>[] migrate_to_node+0x71/0xd0
>[] do_migrate_pages+0x1c3/0x210
>[] SyS_migrate_pages+0x261/0x290
>[] entry_SYSCALL_64_fastpath+0x1a/0xa4
>   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 
> c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 
> 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
>   RIP  [] split_huge_page_to_list+0x496/0x590
>RSP 
> 
> I'm not sure of the full scenario of the reproduction, but my debug showed 
> that
> split_huge_pmd_address(freeze=true) returned without running main code of pmd
> splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
> try_to_unmap() fails and returns non-zero (because page_mapcount() still > 0),
> and finally VM_BUG_ON() fires.
> 
> This patch fixes it by adding a separate split_huge_pmd_address()'s variant
> for freeze=true and checking pmd's state within ptl for that case.

Checking pmd under ptl is right thing to do, but I want to understand the
scenario first.

Do you have code to trigger this?

> I think that this change seems to fit the comment in split_huge_pmd_address()
> that says "Caller holds the mmap_sem write mode, so a huge pmd cannot
> materialize from under us." We don't hold mmap_sem write if called from
> split_huge_page(), so maybe there were some different assumptions between
> callers (split_huge_page() and vma_adjust_trans_huge().)
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  include/linux/huge_mm.h |  8 
>  mm/huge_memory.c| 50 
> +
>  mm/rmap.c   |  3 +--
>  3 files changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git v4.6/include/linux/huge_mm.h v4.6_patched/include/linux/huge_mm.h
> index d7b9e53..6fa4348 100644
> --- v4.6/include/linux/huge_mm.h
> +++ v4.6_patched/include/linux/huge_mm.h
> @@ -108,8 +108,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   }  while (0)
>  
>  
> -void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long 
> address,
> - bool freeze, struct page *page);
> +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> + unsigned long address, struct page *page);
>  
>  extern int hugepage_madvise(struct vm_area_struct *vma,
>   unsigned long *vm_flags, int advice);
> @@ -177,8 +177,8 @@ static inline void deferred_split_huge_page(struct page 
> *page) {}
>  #define split_huge_pmd(__vma, __pmd, __address)  \
>   do { } while (0)
>  
> -static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> - unsigned long address, bool freeze, struct page *page) {}
> +static inline void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> + unsigned long address, 

Re: [PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-17 Thread Kirill A. Shutemov
On Fri, Jun 17, 2016 at 11:30:03AM +0900, Naoya Horiguchi wrote:
> I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
> a testcase with 3 processes:
>   - process 1: keep writing thp,
>   - process 2: keep clearing soft-dirty bits from virtual address of process 1
>   - process 3: call migratepages for process 1,
>
> The kernel message is like this:
> 
>   kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
>   invalid opcode:  [#1] SMP
>   Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
> virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
> tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
> virtio_ring virtio
>   CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
>   Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>   task: 88003732 ti: 88007cdd task.ti: 88007cdd
>   RIP: 0010:[]  [] 
> split_huge_page_to_list+0x496/0x590
>   RSP: 0018:88007cdd3b70  EFLAGS: 00010202
>   RAX: 0001 RBX: 88007c7b88c0 RCX: 
>   RDX:  RSI: 00070200 RDI: ea0003188000
>   RBP: 88007cdd3bb8 R08: 0001 R09: 3000
>   R10: 8800 R11: c01f R12: ea0003188000
>   R13: ea0003188000 R14:  R15: 0480
>   FS:  7f8ec241d740() GS:88007dc0() 
> knlGS: CS:  0010 DS:  ES:  CR0: 
> 80050033
>   CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
>   Stack:
>8139ef6d ea00031c6280 88011ffec000 
>7040 7020 88007cdd3d08 8800dbbe3008
>0480 88007cdd3c20 811dd0b1 88007cdd3d68
>   Call Trace:
>[] ? list_del+0xd/0x30
>[] queue_pages_pte_range+0x4d1/0x590
>[] __walk_page_range+0x204/0x4e0
>[] walk_page_range+0x71/0xf0
>[] queue_pages_range+0x75/0x90
>[] ? queue_pages_hugetlb+0x190/0x190
>[] ? new_node_page+0xc0/0xc0
>[] ? change_prot_numa+0x40/0x40
>[] migrate_to_node+0x71/0xd0
>[] do_migrate_pages+0x1c3/0x210
>[] SyS_migrate_pages+0x261/0x290
>[] entry_SYSCALL_64_fastpath+0x1a/0xa4
>   Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 
> c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 
> 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
>   RIP  [] split_huge_page_to_list+0x496/0x590
>RSP 
> 
> I'm not sure of the full scenario of the reproduction, but my debug showed 
> that
> split_huge_pmd_address(freeze=true) returned without running main code of pmd
> splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
> try_to_unmap() fails and returns non-zero (because page_mapcount() still > 0),
> and finally VM_BUG_ON() fires.
> 
> This patch fixes it by adding a separate split_huge_pmd_address()'s variant
> for freeze=true and checking pmd's state within ptl for that case.

Checking pmd under ptl is right thing to do, but I want to understand the
scenario first.

Do you have code to trigger this?

> I think that this change seems to fit the comment in split_huge_pmd_address()
> that says "Caller holds the mmap_sem write mode, so a huge pmd cannot
> materialize from under us." We don't hold mmap_sem write if called from
> split_huge_page(), so maybe there were some different assumptions between
> callers (split_huge_page() and vma_adjust_trans_huge().)
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  include/linux/huge_mm.h |  8 
>  mm/huge_memory.c| 50 
> +
>  mm/rmap.c   |  3 +--
>  3 files changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git v4.6/include/linux/huge_mm.h v4.6_patched/include/linux/huge_mm.h
> index d7b9e53..6fa4348 100644
> --- v4.6/include/linux/huge_mm.h
> +++ v4.6_patched/include/linux/huge_mm.h
> @@ -108,8 +108,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   }  while (0)
>  
>  
> -void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long 
> address,
> - bool freeze, struct page *page);
> +void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> + unsigned long address, struct page *page);
>  
>  extern int hugepage_madvise(struct vm_area_struct *vma,
>   unsigned long *vm_flags, int advice);
> @@ -177,8 +177,8 @@ static inline void deferred_split_huge_page(struct page 
> *page) {}
>  #define split_huge_pmd(__vma, __pmd, __address)  \
>   do { } while (0)
>  
> -static inline void split_huge_pmd_address(struct vm_area_struct *vma,
> - unsigned long address, bool freeze, struct page *page) {}
> +static inline void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
> + unsigned long address, struct page *page) {}
>  
> 

[PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-16 Thread Naoya Horiguchi
I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
a testcase with 3 processes:
  - process 1: keep writing thp,
  - process 2: keep clearing soft-dirty bits from virtual address of process 1
  - process 3: call migratepages for process 1,

The kernel message is like this:

  kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
  invalid opcode:  [#1] SMP
  Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
virtio_ring virtio
  CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: 88003732 ti: 88007cdd task.ti: 88007cdd
  RIP: 0010:[]  [] 
split_huge_page_to_list+0x496/0x590
  RSP: 0018:88007cdd3b70  EFLAGS: 00010202
  RAX: 0001 RBX: 88007c7b88c0 RCX: 
  RDX:  RSI: 00070200 RDI: ea0003188000
  RBP: 88007cdd3bb8 R08: 0001 R09: 3000
  R10: 8800 R11: c01f R12: ea0003188000
  R13: ea0003188000 R14:  R15: 0480
  FS:  7f8ec241d740() GS:88007dc0() knlGS:  
   CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
  Stack:
   8139ef6d ea00031c6280 88011ffec000 
   7040 7020 88007cdd3d08 8800dbbe3008
   0480 88007cdd3c20 811dd0b1 88007cdd3d68
  Call Trace:
   [] ? list_del+0xd/0x30
   [] queue_pages_pte_range+0x4d1/0x590
   [] __walk_page_range+0x204/0x4e0
   [] walk_page_range+0x71/0xf0
   [] queue_pages_range+0x75/0x90
   [] ? queue_pages_hugetlb+0x190/0x190
   [] ? new_node_page+0xc0/0xc0
   [] ? change_prot_numa+0x40/0x40
   [] migrate_to_node+0x71/0xd0
   [] do_migrate_pages+0x1c3/0x210
   [] SyS_migrate_pages+0x261/0x290
   [] entry_SYSCALL_64_fastpath+0x1a/0xa4
  Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 
b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 
00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
  RIP  [] split_huge_page_to_list+0x496/0x590
   RSP 

I'm not sure of the full scenario of the reproduction, but my debug showed that
split_huge_pmd_address(freeze=true) returned without running main code of pmd
splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
try_to_unmap() fails and returns non-zero (because page_mapcount() still > 0),
and finally VM_BUG_ON() fires.

This patch fixes it by adding a separate split_huge_pmd_address()'s variant
for freeze=true and checking pmd's state within ptl for that case.

I think that this change seems to fit the comment in split_huge_pmd_address()
that says "Caller holds the mmap_sem write mode, so a huge pmd cannot
materialize from under us." We don't hold mmap_sem write if called from
split_huge_page(), so maybe there were some different assumptions between
callers (split_huge_page() and vma_adjust_trans_huge().)

Signed-off-by: Naoya Horiguchi 
---
 include/linux/huge_mm.h |  8 
 mm/huge_memory.c| 50 +
 mm/rmap.c   |  3 +--
 3 files changed, 43 insertions(+), 18 deletions(-)

diff --git v4.6/include/linux/huge_mm.h v4.6_patched/include/linux/huge_mm.h
index d7b9e53..6fa4348 100644
--- v4.6/include/linux/huge_mm.h
+++ v4.6_patched/include/linux/huge_mm.h
@@ -108,8 +108,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
}  while (0)
 
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
-   bool freeze, struct page *page);
+void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
+   unsigned long address, struct page *page);
 
 extern int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice);
@@ -177,8 +177,8 @@ static inline void deferred_split_huge_page(struct page 
*page) {}
 #define split_huge_pmd(__vma, __pmd, __address)\
do { } while (0)
 
-static inline void split_huge_pmd_address(struct vm_area_struct *vma,
-   unsigned long address, bool freeze, struct page *page) {}
+static inline void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
+   unsigned long address, struct page *page) {}
 
 static inline int hugepage_madvise(struct vm_area_struct *vma,
   unsigned long *vm_flags, int advice)
diff --git v4.6/mm/huge_memory.c v4.6_patched/mm/huge_memory.c
index b49ee12..c48f22c 100644
--- v4.6/mm/huge_memory.c
+++ v4.6_patched/mm/huge_memory.c
@@ -2989,6 +2989,16 @@ void __split_huge_pmd(struct vm_area_struct 

[PATCH v1 1/2] mm: thp: move pmd check inside ptl for freeze_page()

2016-06-16 Thread Naoya Horiguchi
I found a race condition triggering VM_BUG_ON() in freeze_page(), when running
a testcase with 3 processes:
  - process 1: keep writing thp,
  - process 2: keep clearing soft-dirty bits from virtual address of process 1
  - process 3: call migratepages for process 1,

The kernel message is like this:

  kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
  invalid opcode:  [#1] SMP
  Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr 
virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis 
tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci 
virtio_ring virtio
  CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: 88003732 ti: 88007cdd task.ti: 88007cdd
  RIP: 0010:[]  [] 
split_huge_page_to_list+0x496/0x590
  RSP: 0018:88007cdd3b70  EFLAGS: 00010202
  RAX: 0001 RBX: 88007c7b88c0 RCX: 
  RDX:  RSI: 00070200 RDI: ea0003188000
  RBP: 88007cdd3bb8 R08: 0001 R09: 3000
  R10: 8800 R11: c01f R12: ea0003188000
  R13: ea0003188000 R14:  R15: 0480
  FS:  7f8ec241d740() GS:88007dc0() knlGS:  
   CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 7f8ec1f3ed20 CR3: 3707b000 CR4: 06f0
  Stack:
   8139ef6d ea00031c6280 88011ffec000 
   7040 7020 88007cdd3d08 8800dbbe3008
   0480 88007cdd3c20 811dd0b1 88007cdd3d68
  Call Trace:
   [] ? list_del+0xd/0x30
   [] queue_pages_pte_range+0x4d1/0x590
   [] __walk_page_range+0x204/0x4e0
   [] walk_page_range+0x71/0xf0
   [] queue_pages_range+0x75/0x90
   [] ? queue_pages_hugetlb+0x190/0x190
   [] ? new_node_page+0xc0/0xc0
   [] ? change_prot_numa+0x40/0x40
   [] migrate_to_node+0x71/0xd0
   [] do_migrate_pages+0x1c3/0x210
   [] SyS_migrate_pages+0x261/0x290
   [] entry_SYSCALL_64_fastpath+0x1a/0xa4
  Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 
b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 
00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
  RIP  [] split_huge_page_to_list+0x496/0x590
   RSP 

I'm not sure of the full scenario of the reproduction, but my debug showed that
split_huge_pmd_address(freeze=true) returned without running main code of pmd
splitting because pmd_present(*pmd) was 0. If this happens, the subsequent
try_to_unmap() fails and returns non-zero (because page_mapcount() still > 0),
and finally VM_BUG_ON() fires.

This patch fixes it by adding a separate split_huge_pmd_address()'s variant
for freeze=true and checking pmd's state within ptl for that case.

I think that this change seems to fit the comment in split_huge_pmd_address()
that says "Caller holds the mmap_sem write mode, so a huge pmd cannot
materialize from under us." We don't hold mmap_sem write if called from
split_huge_page(), so maybe there were some different assumptions between
callers (split_huge_page() and vma_adjust_trans_huge().)

Signed-off-by: Naoya Horiguchi 
---
 include/linux/huge_mm.h |  8 
 mm/huge_memory.c| 50 +
 mm/rmap.c   |  3 +--
 3 files changed, 43 insertions(+), 18 deletions(-)

diff --git v4.6/include/linux/huge_mm.h v4.6_patched/include/linux/huge_mm.h
index d7b9e53..6fa4348 100644
--- v4.6/include/linux/huge_mm.h
+++ v4.6_patched/include/linux/huge_mm.h
@@ -108,8 +108,8 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
}  while (0)
 
 
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
-   bool freeze, struct page *page);
+void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
+   unsigned long address, struct page *page);
 
 extern int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice);
@@ -177,8 +177,8 @@ static inline void deferred_split_huge_page(struct page 
*page) {}
 #define split_huge_pmd(__vma, __pmd, __address)\
do { } while (0)
 
-static inline void split_huge_pmd_address(struct vm_area_struct *vma,
-   unsigned long address, bool freeze, struct page *page) {}
+static inline void split_huge_pmd_address_freeze(struct vm_area_struct *vma,
+   unsigned long address, struct page *page) {}
 
 static inline int hugepage_madvise(struct vm_area_struct *vma,
   unsigned long *vm_flags, int advice)
diff --git v4.6/mm/huge_memory.c v4.6_patched/mm/huge_memory.c
index b49ee12..c48f22c 100644
--- v4.6/mm/huge_memory.c
+++ v4.6_patched/mm/huge_memory.c
@@ -2989,6 +2989,16 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,