RE: [PATCH v8 05/22] Add vm_replace_mixed()
> -Original Message- > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On > Behalf Of Kirill A. Shutemov > Sent: Monday, July 28, 2014 9:26 PM > To: Matthew Wilcox > Cc: Wilcox, Matthew R; linux-fsde...@vger.kernel.org; linux...@kvack.org; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed() > > On Fri, Jul 25, 2014 at 03:44:50PM -0400, Matthew Wilcox wrote: > > On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote: > > > > update_hiwater_rss(mm); > > > > > > No: you cannot end up with lower rss after replace, iiuc. > > > > Actually, you can ... when we replace a real page with a PFN, our rss > > decreases. > > Okay. > > > > Do you mean you pointed to new file all the time? O_CREAT doesn't > > > truncate file if it exists, iirc. > > > > It was pointing to a new file. Still not sure why that one failed to > > trigger the problem. The slightly modified version attached triggered > > the problem *just fine* :-) > > > > I've attached all the patches in my tree so far. For the v9 patch > > kit, I'll keep patch 3 as a separate patch, but roll patches 1, 2 and > > 4 into other patches. > > > > I am seeing something odd though. When I run double-map with > > debugging printks inserted in strategic spots in the kernel, I see > > four calls to do_dax_fault(). The first two, as expected, are the > > loads from the two mapped addresses. The third is via mkwrite, but > > then the fourth time I get a regular page fault for write, and I don't > understand why I get it. > > > > Any ideas? > > unmap_mapping_range() clears pte you've just set by vm_replace_mixed() on > third fault. > > And locking looks wrong: it seems you need to hold i_mmap_mutex while > replacing hole page with pfn. Your VM_BUG_ON() in zap_pte_single() triggers > on my setup. > > > +static void zap_pte_single(struct vm_area_struct *vma, pte_t *pte, > > + unsigned long addr) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + int force_flush = 0; > > + int rss[NR_MM_COUNTERS]; > > + > > + > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_m > utex)); > > It's wrong place for VM_BUG_ON(): zap_pte_single() on anon mapping should > work fine) Hi Shutemov: 1. I am confuse that why insert_page() drop the PageAnon, insert_pfn() donot drop PageAnon? 2. remap_vmalloc_range() -> vm_insert_page()->insert_page() -> inc_mm_counter_fast(mm, MM_FILEPAGES); so, in this scenario, this vmalloc page maybe not sure be a page cache, why increase the MM_FILEPAGES ? > > > + > > + init_rss_vec(rss); > > Vector to commit single update to mm counters? What about inline counters > update for rss == NULL case? > > > + update_hiwater_rss(mm); > > + flush_cache_page(vma, addr, pte_pfn(*pte)); > > + zap_pte(NULL, vma, pte, addr, NULL, rss, &force_flush); > > + flush_tlb_page(vma, addr); > > + add_mm_rss_vec(mm, rss); > > +} > > + > > -- > Kirill A. Shutemov > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to > majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Fri, Jul 25, 2014 at 03:44:50PM -0400, Matthew Wilcox wrote: > On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote: > > > update_hiwater_rss(mm); > > > > No: you cannot end up with lower rss after replace, iiuc. > > Actually, you can ... when we replace a real page with a PFN, our rss > decreases. Okay. > > Do you mean you pointed to new file all the time? O_CREAT doesn't truncate > > file if it exists, iirc. > > It was pointing to a new file. Still not sure why that one failed to trigger > the problem. The slightly modified version attached triggered the problem > *just fine* :-) > > I've attached all the patches in my tree so far. For the v9 patch kit, > I'll keep patch 3 as a separate patch, but roll patches 1, 2 and 4 into > other patches. > > I am seeing something odd though. When I run double-map with debugging > printks inserted in strategic spots in the kernel, I see four calls to > do_dax_fault(). The first two, as expected, are the loads from the two > mapped addresses. The third is via mkwrite, but then the fourth time > I get a regular page fault for write, and I don't understand why I get it. > > Any ideas? unmap_mapping_range() clears pte you've just set by vm_replace_mixed() on third fault. And locking looks wrong: it seems you need to hold i_mmap_mutex while replacing hole page with pfn. Your VM_BUG_ON() in zap_pte_single() triggers on my setup. > +static void zap_pte_single(struct vm_area_struct *vma, pte_t *pte, > + unsigned long addr) > +{ > + struct mm_struct *mm = vma->vm_mm; > + int force_flush = 0; > + int rss[NR_MM_COUNTERS]; > + > + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); It's wrong place for VM_BUG_ON(): zap_pte_single() on anon mapping should work fine) > + > + init_rss_vec(rss); Vector to commit single update to mm counters? What about inline counters update for rss == NULL case? > + update_hiwater_rss(mm); > + flush_cache_page(vma, addr, pte_pfn(*pte)); > + zap_pte(NULL, vma, pte, addr, NULL, rss, &force_flush); > + flush_tlb_page(vma, addr); > + add_mm_rss_vec(mm, rss); > +} > + -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote: > > update_hiwater_rss(mm); > > No: you cannot end up with lower rss after replace, iiuc. Actually, you can ... when we replace a real page with a PFN, our rss decreases. > Do you mean you pointed to new file all the time? O_CREAT doesn't truncate > file if it exists, iirc. It was pointing to a new file. Still not sure why that one failed to trigger the problem. The slightly modified version attached triggered the problem *just fine* :-) I've attached all the patches in my tree so far. For the v9 patch kit, I'll keep patch 3 as a separate patch, but roll patches 1, 2 and 4 into other patches. I am seeing something odd though. When I run double-map with debugging printks inserted in strategic spots in the kernel, I see four calls to do_dax_fault(). The first two, as expected, are the loads from the two mapped addresses. The third is via mkwrite, but then the fourth time I get a regular page fault for write, and I don't understand why I get it. Any ideas? [ 880.966632] do_dax_fault: fault a8 page = (null) bh state 0 written 0 addr 7ff598835000 [ 880.966637] dax_load_hole: page = ea0002784730 [ 882.114387] do_dax_fault: fault a8 page = ea0002784730 bh state 0 written 0 addr 7ff598834000 [ 882.114389] dax_load_hole: page = ea0002784730 [ 882.780013] do_dax_fault: fault 5 page = ea0002784730 bh state 0 written 0 addr 7ff598835000 [ 882.780095] insert_pfn: pte = 800108200225 [ 882.780096] do_dax_fault: page = ea0002784730 pfn = 108200 error = 0 [ 882.780098] CPU: 1 PID: 1511 Comm: double-map Not tainted 3.16.0-rc6+ #89 [ 882.780099] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013 [ 882.780100] 8800b41e3ba0 815c6e73 ea0002784730 [ 882.780102] 8800b41e3c88 8124c319 7ff598835000 8800b2436ac0 [ 882.780104] 0005 a01e3020 0004 8805 [ 882.780106] Call Trace: [ 882.780110] [] dump_stack+0x4d/0x66 [ 882.780114] [] do_dax_fault+0x569/0x6f0 [ 882.780133] [] dax_fault+0x3f/0x90 [ 882.780136] [] ? native_sched_clock+0x2e/0xb0 [ 882.780137] [] dax_mkwrite+0xe/0x10 [ 882.780143] [] ext4_dax_mkwrite+0x15/0x20 [ext4] [ 882.780146] [] do_page_mkwrite+0x47/0xb0 [ 882.780148] [] do_wp_page+0x6e2/0x990 [ 882.780150] [] handle_mm_fault+0x6ab/0xf70 [ 882.780154] [] __do_page_fault+0x1ec/0x5b0 [ 882.780163] [] do_page_fault+0x22/0x30 [ 882.780165] [] page_fault+0x28/0x30 [ 882.780204] do_dax_fault: fault a9 page = (null) bh state 20 written 1 addr 7ff598835000 [ 882.780206] insert_pfn: pte = 800108200225 [ 882.780207] do_dax_fault: page = (null) pfn = 108200 error = 0 [ 882.780208] CPU: 1 PID: 1511 Comm: double-map Not tainted 3.16.0-rc6+ #89 [ 882.780208] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./Q87M-D2H, BIOS F6 08/03/2013 [ 882.780209] 8800b41e3bc0 815c6e73 [ 882.780211] 8800b41e3ca8 8124c319 7ff598835000 8800b41e3c08 [ 882.780213] 8800a2a60608 a01e3020 88a9 [ 882.780214] Call Trace: [ 882.780216] [] dump_stack+0x4d/0x66 [ 882.780218] [] do_dax_fault+0x569/0x6f0 [ 882.780232] [] dax_fault+0x3f/0x90 [ 882.780238] [] ext4_dax_fault+0x15/0x20 [ext4] [ 882.780240] [] __do_fault+0x41/0xd0 [ 882.780241] [] do_shared_fault.isra.56+0x35/0x220 [ 882.780243] [] handle_mm_fault+0x303/0xf70 [ 882.780246] [] __do_page_fault+0x1ec/0x5b0 [ 882.780254] [] do_page_fault+0x22/0x30 [ 882.780255] [] page_fault+0x28/0x30 diff --git a/fs/dax.c b/fs/dax.c index b4fdfd9..4b0f928 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -257,6 +257,7 @@ static int dax_load_hole(struct address_space *mapping, struct page *page, if (!page) page = find_or_create_page(mapping, vmf->pgoff, GFP_KERNEL | __GFP_ZERO); +printk("%s: page = %p\n", __func__, page); if (!page) return VM_FAULT_OOM; /* Recheck i_size under page lock to avoid truncate race */ @@ -332,6 +333,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, if (error || bh.b_size < PAGE_SIZE) goto sigbus; +printk("%s: fault %x page = %p bh state %lx written %d addr %lx\n", __func__, vmf->flags, page, bh.b_state, buffer_written(&bh), vaddr); if (!buffer_written(&bh) && !vmf->cow_page) { if (vmf->flags & FAULT_FLAG_WRITE) { error = get_block(inode, block, &bh, 1); @@ -372,6 +374,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, error = dax_get_pfn(&bh, &pfn, blkbits); if (error > 0) error = vm_replace_mixed(vma, vaddr, pfn); +printk("%s: page = %p pfn = %
RE: [PATCH v8 05/22] Add vm_replace_mixed()
It is double page_table_lock issue, should be free-and-realloc will be simple and readability? + if (!pte_none(*pte)) { + if (!replace) + goto out_unlock; + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); + pte_unmap_unlock(pte, ptl); + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); + pte = get_locked_pte(mm, addr, &ptl); + } Best, Figo > -Original Message- > From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On > Behalf Of Kirill A. Shutemov > Sent: Wednesday, July 23, 2014 11:55 PM > To: Matthew Wilcox > Cc: Wilcox, Matthew R; linux-fsde...@vger.kernel.org; linux...@kvack.org; > linux-kernel@vger.kernel.org > Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed() > > On Wed, Jul 23, 2014 at 10:27:45AM -0400, Matthew Wilcox wrote: > > On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote: > > > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote: > > > > I'd love to use a lighter-weight weapon! What would you recommend > > > > using, zap_pte_range()? > > > > > > The most straight-forward way: extract body of pte cycle from > > > zap_pte_range() to separate function -- zap_pte() -- and use it. > > > > OK, I can do that. What about the other parts of zap_page_range(), do > > I need to call them? > > > > lru_add_drain(); > > No, I guess.. > > > tlb_gather_mmu(&tlb, mm, address, end); > > tlb_finish_mmu(&tlb, address, end); > > New zap_pte() should tolerate tlb == NULL and does flush_tlb_page() or > pte_clear_*flush or something. > > > update_hiwater_rss(mm); > > No: you cannot end up with lower rss after replace, iiuc. > > > mmu_notifier_invalidate_range_start(mm, address, end); > > mmu_notifier_invalidate_range_end(mm, address, end); > > mmu_notifier_invalidate_page() should be enough. > > > > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) { > > > > perror(argv[1]); > > > > exit(1); > > > > } > > > > > > > > if (ftruncate(fd, 4096) < 0) { > > > > > > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below > > > will fault in page from backing storage, not hole and write will not > > > replace anything. > > > > Ah, it was starting with a new file, hence the O_CREAT up above. > > Do you mean you pointed to new file all the time? O_CREAT doesn't truncate > file if it exists, iirc. > > -- > Kirill A. Shutemov > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to > majord...@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: mailto:"d...@kvack.org";> em...@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Wed, Jul 23, 2014 at 10:27:45AM -0400, Matthew Wilcox wrote: > On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote: > > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote: > > > I'd love to use a lighter-weight weapon! What would you recommend using, > > > zap_pte_range()? > > > > The most straight-forward way: extract body of pte cycle from > > zap_pte_range() to separate function -- zap_pte() -- and use it. > > OK, I can do that. What about the other parts of zap_page_range(), > do I need to call them? > > lru_add_drain(); No, I guess.. > tlb_gather_mmu(&tlb, mm, address, end); > tlb_finish_mmu(&tlb, address, end); New zap_pte() should tolerate tlb == NULL and does flush_tlb_page() or pte_clear_*flush or something. > update_hiwater_rss(mm); No: you cannot end up with lower rss after replace, iiuc. > mmu_notifier_invalidate_range_start(mm, address, end); > mmu_notifier_invalidate_range_end(mm, address, end); mmu_notifier_invalidate_page() should be enough. > > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) { > > > perror(argv[1]); > > > exit(1); > > > } > > > > > > if (ftruncate(fd, 4096) < 0) { > > > > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will > > fault in page from backing storage, not hole and write will not replace > > anything. > > Ah, it was starting with a new file, hence the O_CREAT up above. Do you mean you pointed to new file all the time? O_CREAT doesn't truncate file if it exists, iirc. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote: > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote: > > I'd love to use a lighter-weight weapon! What would you recommend using, > > zap_pte_range()? > > The most straight-forward way: extract body of pte cycle from > zap_pte_range() to separate function -- zap_pte() -- and use it. OK, I can do that. What about the other parts of zap_page_range(), do I need to call them? lru_add_drain(); tlb_gather_mmu(&tlb, mm, address, end); update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(mm, address, end); [ unmap_single_vma(&tlb, vma, address, end, details);] mmu_notifier_invalidate_range_end(mm, address, end); tlb_finish_mmu(&tlb, address, end); > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) { > > perror(argv[1]); > > exit(1); > > } > > > > if (ftruncate(fd, 4096) < 0) { > > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will > fault in page from backing storage, not hole and write will not replace > anything. Ah, it was starting with a new file, hence the O_CREAT up above. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote: > On Wed, Jul 23, 2014 at 02:45:40PM +0300, Kirill A. Shutemov wrote: > > On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote: > > > From: Matthew Wilcox > > > > > > vm_insert_mixed() will fail if there is already a valid PTE at that > > > location. The DAX code would rather replace the previous value with > > > the new PTE. > > > > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, > > > unsigned long addr, > > > if (!pte) > > > goto out; > > > retval = -EBUSY; > > > - if (!pte_none(*pte)) > > > - goto out_unlock; > > > + if (!pte_none(*pte)) { > > > + if (!replace) > > > + goto out_unlock; > > > + > > > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); > > > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); > > > > zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not > > going to work. > > I have a test program that exercises this path ... it seems to work! > Following the code, I don't understand why it does. Maybe it's not > exercising this path after all? I've attached the program (so that I > have an "oh, duh" moment about 5 seconds after sending the email). See below. > > > And zap_page_range*() is pretty heavy weapon to shoot down one pte, which > > we already have pointer to. Why? > > I'd love to use a lighter-weight weapon! What would you recommend using, > zap_pte_range()? The most straight-forward way: extract body of pte cycle from zap_pte_range() to separate function -- zap_pte() -- and use it. > #include > #include > #include > #include > #include > #include > #include > #include > > int > main(int argc, char *argv[]) > { > int fd; > void *addr; > char buf[4096]; > > if (argc != 2) { > fprintf(stderr, "usage: %s filename\n", argv[0]); > exit(1); > } > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) { > perror(argv[1]); > exit(1); > } > > if (ftruncate(fd, 4096) < 0) { Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will fault in page from backing storage, not hole and write will not replace anything. > perror("ftruncate"); > exit(1); > } > > if ((addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, > fd, 0)) == MAP_FAILED) { > perror("mmap"); > exit(1); > } > > close(fd); > > /* first read */ > memcpy(buf, addr, 4096); > > /* now write a bit */ > memcpy(addr, buf, 8); > > printf("%s: test passed.\n", argv[0]); > exit(0); > } -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Wed, Jul 23, 2014 at 02:45:40PM +0300, Kirill A. Shutemov wrote: > On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote: > > From: Matthew Wilcox > > > > vm_insert_mixed() will fail if there is already a valid PTE at that > > location. The DAX code would rather replace the previous value with > > the new PTE. > > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, > > unsigned long addr, > > if (!pte) > > goto out; > > retval = -EBUSY; > > - if (!pte_none(*pte)) > > - goto out_unlock; > > + if (!pte_none(*pte)) { > > + if (!replace) > > + goto out_unlock; > > + > > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); > > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); > > zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not > going to work. I have a test program that exercises this path ... it seems to work! Following the code, I don't understand why it does. Maybe it's not exercising this path after all? I've attached the program (so that I have an "oh, duh" moment about 5 seconds after sending the email). > And zap_page_range*() is pretty heavy weapon to shoot down one pte, which > we already have pointer to. Why? I'd love to use a lighter-weight weapon! What would you recommend using, zap_pte_range()? #include #include #include #include #include #include #include #include int main(int argc, char *argv[]) { int fd; void *addr; char buf[4096]; if (argc != 2) { fprintf(stderr, "usage: %s filename\n", argv[0]); exit(1); } if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) { perror(argv[1]); exit(1); } if (ftruncate(fd, 4096) < 0) { perror("ftruncate"); exit(1); } if ((addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)) == MAP_FAILED) { perror("mmap"); exit(1); } close(fd); /* first read */ memcpy(buf, addr, 4096); /* now write a bit */ memcpy(addr, buf, 8); printf("%s: test passed.\n", argv[0]); exit(0); }
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote: > From: Matthew Wilcox > > vm_insert_mixed() will fail if there is already a valid PTE at that > location. The DAX code would rather replace the previous value with > the new PTE. > > Signed-off-by: Matthew Wilcox > --- > include/linux/mm.h | 8 ++-- > mm/memory.c| 34 +- > 2 files changed, 27 insertions(+), 15 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e04f531..8d1194c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned > long addr, > int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page > *); > int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn); > -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > - unsigned long pfn); > +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > + unsigned long pfn, bool replace); > +#define vm_insert_mixed(vma, addr, pfn) \ > + __vm_insert_mixed(vma, addr, pfn, false) > +#define vm_replace_mixed(vma, addr, pfn) \ > + __vm_insert_mixed(vma, addr, pfn, true) > int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned > long len); > > > diff --git a/mm/memory.c b/mm/memory.c > index 42bf429..cf06c97 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned > long addr, > * pages reserved for the old functions anyway. > */ > static int insert_page(struct vm_area_struct *vma, unsigned long addr, > - struct page *page, pgprot_t prot) > + struct page *page, pgprot_t prot, bool replace) > { > struct mm_struct *mm = vma->vm_mm; > int retval; > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, > unsigned long addr, > if (!pte) > goto out; > retval = -EBUSY; > - if (!pte_none(*pte)) > - goto out_unlock; > + if (!pte_none(*pte)) { > + if (!replace) > + goto out_unlock; > + > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not going to work. And zap_page_range*() is pretty heavy weapon to shoot down one pte, which we already have pointer to. Why? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8 05/22] Add vm_replace_mixed()
On Tue 22-07-14 15:47:53, Matthew Wilcox wrote: > From: Matthew Wilcox > > vm_insert_mixed() will fail if there is already a valid PTE at that > location. The DAX code would rather replace the previous value with > the new PTE. > > Signed-off-by: Matthew Wilcox This looks good to me although I'm not an expert in this area. So just: Acked-by: Jan Kara Honza > --- > include/linux/mm.h | 8 ++-- > mm/memory.c| 34 +- > 2 files changed, 27 insertions(+), 15 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index e04f531..8d1194c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned > long addr, > int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page > *); > int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, > unsigned long pfn); > -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > - unsigned long pfn); > +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > + unsigned long pfn, bool replace); > +#define vm_insert_mixed(vma, addr, pfn) \ > + __vm_insert_mixed(vma, addr, pfn, false) > +#define vm_replace_mixed(vma, addr, pfn) \ > + __vm_insert_mixed(vma, addr, pfn, true) > int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned > long len); > > > diff --git a/mm/memory.c b/mm/memory.c > index 42bf429..cf06c97 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned > long addr, > * pages reserved for the old functions anyway. > */ > static int insert_page(struct vm_area_struct *vma, unsigned long addr, > - struct page *page, pgprot_t prot) > + struct page *page, pgprot_t prot, bool replace) > { > struct mm_struct *mm = vma->vm_mm; > int retval; > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, > unsigned long addr, > if (!pte) > goto out; > retval = -EBUSY; > - if (!pte_none(*pte)) > - goto out_unlock; > + if (!pte_none(*pte)) { > + if (!replace) > + goto out_unlock; > + > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); > + } > > /* Ok, finally just insert the thing.. */ > get_page(page); > @@ -1549,12 +1553,12 @@ int vm_insert_page(struct vm_area_struct *vma, > unsigned long addr, > BUG_ON(vma->vm_flags & VM_PFNMAP); > vma->vm_flags |= VM_MIXEDMAP; > } > - return insert_page(vma, addr, page, vma->vm_page_prot); > + return insert_page(vma, addr, page, vma->vm_page_prot, false); > } > EXPORT_SYMBOL(vm_insert_page); > > static int insert_pfn(struct vm_area_struct *vma, unsigned long addr, > - unsigned long pfn, pgprot_t prot) > + unsigned long pfn, pgprot_t prot, bool replace) > { > struct mm_struct *mm = vma->vm_mm; > int retval; > @@ -1566,8 +1570,12 @@ static int insert_pfn(struct vm_area_struct *vma, > unsigned long addr, > if (!pte) > goto out; > retval = -EBUSY; > - if (!pte_none(*pte)) > - goto out_unlock; > + if (!pte_none(*pte)) { > + if (!replace) > + goto out_unlock; > + > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex)); > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL); > + } > > /* Ok, finally just insert the thing.. */ > entry = pte_mkspecial(pfn_pte(pfn, prot)); > @@ -1620,14 +1628,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, > unsigned long addr, > if (track_pfn_insert(vma, &pgprot, pfn)) > return -EINVAL; > > - ret = insert_pfn(vma, addr, pfn, pgprot); > + ret = insert_pfn(vma, addr, pfn, pgprot, false); > > return ret; > } > EXPORT_SYMBOL(vm_insert_pfn); > > -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > - unsigned long pfn) > +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, > + unsigned long pfn, bool replace) > { > BUG_ON(!(vma->vm_flags & VM_MIXEDMAP)); > > @@ -1645,11 +1653,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, > unsigned long addr, > struct page *page; > > page = pfn_to_page(pfn); > - return insert_page(vma, addr, page, vma->vm_page_prot); > + return insert_page(vma, addr, page, vma->vm_page_prot, replace); > } > - return insert_pfn(vma, addr, pfn, vma->vm_page_prot); > +