RE: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-28 Thread Zhang, Tianfei


> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> Behalf Of Kirill A. Shutemov
> Sent: Monday, July 28, 2014 9:26 PM
> To: Matthew Wilcox
> Cc: Wilcox, Matthew R; linux-fsde...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()
> 
> On Fri, Jul 25, 2014 at 03:44:50PM -0400, Matthew Wilcox wrote:
> > On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote:
> > > > update_hiwater_rss(mm);
> > >
> > > No: you cannot end up with lower rss after replace, iiuc.
> >
> > Actually, you can ... when we replace a real page with a PFN, our rss
> > decreases.
> 
> Okay.
> 
> > > Do you mean you pointed to new file all the time? O_CREAT doesn't
> > > truncate file if it exists, iirc.
> >
> > It was pointing to a new file.  Still not sure why that one failed to
> > trigger the problem.  The slightly modified version attached triggered
> > the problem *just fine* :-)
> >
> > I've attached all the patches in my tree so far.  For the v9 patch
> > kit, I'll keep patch 3 as a separate patch, but roll patches 1, 2 and
> > 4 into other patches.
> >
> > I am seeing something odd though.  When I run double-map with
> > debugging printks inserted in strategic spots in the kernel, I see
> > four calls to do_dax_fault().  The first two, as expected, are the
> > loads from the two mapped addresses.  The third is via mkwrite, but
> > then the fourth time I get a regular page fault for write, and I don't
> understand why I get it.
> >
> > Any ideas?
> 
> unmap_mapping_range() clears pte you've just set by vm_replace_mixed() on
> third fault.
> 
> And locking looks wrong: it seems you need to hold i_mmap_mutex while
> replacing hole page with pfn. Your VM_BUG_ON() in zap_pte_single() triggers
> on my setup.
> 
> > +static void zap_pte_single(struct vm_area_struct *vma, pte_t *pte,
> > +   unsigned long addr)
> > +{
> > +   struct mm_struct *mm = vma->vm_mm;
> > +   int force_flush = 0;
> > +   int rss[NR_MM_COUNTERS];
> > +
> > +
>   VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_m
> utex));
> 
> It's wrong place for VM_BUG_ON(): zap_pte_single() on anon mapping should
> work fine)

Hi Shutemov:
1. I am confuse that why insert_page() drop the PageAnon, insert_pfn() donot 
drop PageAnon?

2. 
remap_vmalloc_range() -> vm_insert_page()->insert_page() -> 
inc_mm_counter_fast(mm, MM_FILEPAGES);

so, in this scenario, this vmalloc page maybe not sure be a page cache, why 
increase the MM_FILEPAGES ?


> 
> > +
> > +   init_rss_vec(rss);
> 
> Vector to commit single update to mm counters? What about inline counters
> update for rss == NULL case?
> 
> > +   update_hiwater_rss(mm);
> > +   flush_cache_page(vma, addr, pte_pfn(*pte));
> > +   zap_pte(NULL, vma, pte, addr, NULL, rss, &force_flush);
> > +   flush_tlb_page(vma, addr);
> > +   add_mm_rss_vec(mm, rss);
> > +}
> > +
> 
> --
>  Kirill A. Shutemov
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to
> majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-28 Thread Kirill A. Shutemov
On Fri, Jul 25, 2014 at 03:44:50PM -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote:
> > > update_hiwater_rss(mm);
> > 
> > No: you cannot end up with lower rss after replace, iiuc.
> 
> Actually, you can ... when we replace a real page with a PFN, our rss
> decreases.

Okay.

> > Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
> > file if it exists, iirc.
> 
> It was pointing to a new file.  Still not sure why that one failed to trigger
> the problem.  The slightly modified version attached triggered the problem
> *just fine* :-)
> 
> I've attached all the patches in my tree so far.  For the v9 patch kit,
> I'll keep patch 3 as a separate patch, but roll patches 1, 2 and 4 into
> other patches.
> 
> I am seeing something odd though.  When I run double-map with debugging
> printks inserted in strategic spots in the kernel, I see four calls to
> do_dax_fault().  The first two, as expected, are the loads from the two
> mapped addresses.  The third is via mkwrite, but then the fourth time
> I get a regular page fault for write, and I don't understand why I get it.
> 
> Any ideas?

unmap_mapping_range() clears pte you've just set by vm_replace_mixed() on
third fault.

And locking looks wrong: it seems you need to hold i_mmap_mutex while
replacing hole page with pfn. Your VM_BUG_ON() in zap_pte_single()
triggers on my setup.

> +static void zap_pte_single(struct vm_area_struct *vma, pte_t *pte,
> + unsigned long addr)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + int force_flush = 0;
> + int rss[NR_MM_COUNTERS];
> +
> + VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));

It's wrong place for VM_BUG_ON(): zap_pte_single() on anon mapping should
work fine)

> +
> + init_rss_vec(rss);

Vector to commit single update to mm counters? What about inline counters
update for rss == NULL case?

> + update_hiwater_rss(mm);
> + flush_cache_page(vma, addr, pte_pfn(*pte));
> + zap_pte(NULL, vma, pte, addr, NULL, rss, &force_flush);
> + flush_tlb_page(vma, addr);
> + add_mm_rss_vec(mm, rss);
> +}
> +

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-25 Thread Matthew Wilcox
On Wed, Jul 23, 2014 at 06:55:00PM +0300, Kirill A. Shutemov wrote:
> > update_hiwater_rss(mm);
> 
> No: you cannot end up with lower rss after replace, iiuc.

Actually, you can ... when we replace a real page with a PFN, our rss
decreases.

> Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
> file if it exists, iirc.

It was pointing to a new file.  Still not sure why that one failed to trigger
the problem.  The slightly modified version attached triggered the problem
*just fine* :-)

I've attached all the patches in my tree so far.  For the v9 patch kit,
I'll keep patch 3 as a separate patch, but roll patches 1, 2 and 4 into
other patches.

I am seeing something odd though.  When I run double-map with debugging
printks inserted in strategic spots in the kernel, I see four calls to
do_dax_fault().  The first two, as expected, are the loads from the two
mapped addresses.  The third is via mkwrite, but then the fourth time
I get a regular page fault for write, and I don't understand why I get it.

Any ideas?


[  880.966632] do_dax_fault: fault a8 page =   (null) bh state 0 written
 0 addr 7ff598835000
[  880.966637] dax_load_hole: page = ea0002784730
[  882.114387] do_dax_fault: fault a8 page = ea0002784730 bh state 0 
written 0 addr 7ff598834000
[  882.114389] dax_load_hole: page = ea0002784730
[  882.780013] do_dax_fault: fault 5 page = ea0002784730 bh state 0 written 
0 addr 7ff598835000
[  882.780095] insert_pfn: pte = 800108200225
[  882.780096] do_dax_fault: page = ea0002784730 pfn = 108200 error = 0
[  882.780098] CPU: 1 PID: 1511 Comm: double-map Not tainted 3.16.0-rc6+ #89
[  882.780099] Hardware name: Gigabyte Technology Co., Ltd. To be filled by 
O.E.M./Q87M-D2H, BIOS F6 08/03/2013
[  882.780100]   8800b41e3ba0 815c6e73 
ea0002784730
[  882.780102]  8800b41e3c88 8124c319 7ff598835000 
8800b2436ac0
[  882.780104]  0005 a01e3020 0004 
8805
[  882.780106] Call Trace:
[  882.780110]  [] dump_stack+0x4d/0x66
[  882.780114]  [] do_dax_fault+0x569/0x6f0
[  882.780133]  [] dax_fault+0x3f/0x90
[  882.780136]  [] ? native_sched_clock+0x2e/0xb0
[  882.780137]  [] dax_mkwrite+0xe/0x10
[  882.780143]  [] ext4_dax_mkwrite+0x15/0x20 [ext4]
[  882.780146]  [] do_page_mkwrite+0x47/0xb0
[  882.780148]  [] do_wp_page+0x6e2/0x990
[  882.780150]  [] handle_mm_fault+0x6ab/0xf70
[  882.780154]  [] __do_page_fault+0x1ec/0x5b0
[  882.780163]  [] do_page_fault+0x22/0x30
[  882.780165]  [] page_fault+0x28/0x30
[  882.780204] do_dax_fault: fault a9 page =   (null) bh state 20 
written 1 addr 7ff598835000
[  882.780206] insert_pfn: pte = 800108200225
[  882.780207] do_dax_fault: page =   (null) pfn = 108200 error = 0
[  882.780208] CPU: 1 PID: 1511 Comm: double-map Not tainted 3.16.0-rc6+ #89
[  882.780208] Hardware name: Gigabyte Technology Co., Ltd. To be filled by 
O.E.M./Q87M-D2H, BIOS F6 08/03/2013
[  882.780209]   8800b41e3bc0 815c6e73 

[  882.780211]  8800b41e3ca8 8124c319 7ff598835000 
8800b41e3c08
[  882.780213]  8800a2a60608 a01e3020  
88a9
[  882.780214] Call Trace:
[  882.780216]  [] dump_stack+0x4d/0x66
[  882.780218]  [] do_dax_fault+0x569/0x6f0
[  882.780232]  [] dax_fault+0x3f/0x90
[  882.780238]  [] ext4_dax_fault+0x15/0x20 [ext4]
[  882.780240]  [] __do_fault+0x41/0xd0
[  882.780241]  [] do_shared_fault.isra.56+0x35/0x220
[  882.780243]  [] handle_mm_fault+0x303/0xf70
[  882.780246]  [] __do_page_fault+0x1ec/0x5b0
[  882.780254]  [] do_page_fault+0x22/0x30
[  882.780255]  [] page_fault+0x28/0x30


diff --git a/fs/dax.c b/fs/dax.c
index b4fdfd9..4b0f928 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -257,6 +257,7 @@ static int dax_load_hole(struct address_space *mapping, 
struct page *page,
if (!page)
page = find_or_create_page(mapping, vmf->pgoff,
GFP_KERNEL | __GFP_ZERO);
+printk("%s: page = %p\n", __func__, page);
if (!page)
return VM_FAULT_OOM;
/* Recheck i_size under page lock to avoid truncate race */
@@ -332,6 +333,7 @@ static int do_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
if (error || bh.b_size < PAGE_SIZE)
goto sigbus;
 
+printk("%s: fault %x page = %p bh state %lx written %d addr %lx\n", __func__, 
vmf->flags, page, bh.b_state, buffer_written(&bh), vaddr);
if (!buffer_written(&bh) && !vmf->cow_page) {
if (vmf->flags & FAULT_FLAG_WRITE) {
error = get_block(inode, block, &bh, 1);
@@ -372,6 +374,8 @@ static int do_dax_fault(struct vm_area_struct *vma, struct 
vm_fault *vmf,
error = dax_get_pfn(&bh, &pfn, blkbits);
if (error > 0)
error = vm_replace_mixed(vma, vaddr, pfn);
+printk("%s: page = %p pfn = %

RE: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Zhang, Tianfei
It is double page_table_lock issue, should be free-and-realloc will be simple 
and readability?

+   if (!pte_none(*pte)) {
+   if (!replace)
+   goto out_unlock;
+   
VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
+  pte_unmap_unlock(pte, ptl);
+   zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
+  pte = get_locked_pte(mm, addr, &ptl);
+   }

Best,
Figo

> -Original Message-
> From: owner-linux...@kvack.org [mailto:owner-linux...@kvack.org] On
> Behalf Of Kirill A. Shutemov
> Sent: Wednesday, July 23, 2014 11:55 PM
> To: Matthew Wilcox
> Cc: Wilcox, Matthew R; linux-fsde...@vger.kernel.org; linux...@kvack.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v8 05/22] Add vm_replace_mixed()
> 
> On Wed, Jul 23, 2014 at 10:27:45AM -0400, Matthew Wilcox wrote:
> > On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> > > > I'd love to use a lighter-weight weapon!  What would you recommend
> > > > using, zap_pte_range()?
> > >
> > > The most straight-forward way: extract body of pte cycle from
> > > zap_pte_range() to separate function -- zap_pte() -- and use it.
> >
> > OK, I can do that.  What about the other parts of zap_page_range(), do
> > I need to call them?
> >
> > lru_add_drain();
> 
> No, I guess..
> 
> > tlb_gather_mmu(&tlb, mm, address, end);
> > tlb_finish_mmu(&tlb, address, end);
> 
> New zap_pte() should tolerate tlb == NULL and does flush_tlb_page() or
> pte_clear_*flush or something.
> 
> > update_hiwater_rss(mm);
> 
> No: you cannot end up with lower rss after replace, iiuc.
> 
> > mmu_notifier_invalidate_range_start(mm, address, end);
> > mmu_notifier_invalidate_range_end(mm, address, end);
> 
> mmu_notifier_invalidate_page() should be enough.
> 
> > > > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> > > > perror(argv[1]);
> > > > exit(1);
> > > > }
> > > >
> > > > if (ftruncate(fd, 4096) < 0) {
> > >
> > > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below
> > > will fault in page from backing storage, not hole and write will not
> > > replace anything.
> >
> > Ah, it was starting with a new file, hence the O_CREAT up above.
> 
> Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
> file if it exists, iirc.
> 
> --
>  Kirill A. Shutemov
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to
> majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Kirill A. Shutemov
On Wed, Jul 23, 2014 at 10:27:45AM -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> > > I'd love to use a lighter-weight weapon!  What would you recommend using,
> > > zap_pte_range()?
> > 
> > The most straight-forward way: extract body of pte cycle from
> > zap_pte_range() to separate function -- zap_pte() -- and use it.
> 
> OK, I can do that.  What about the other parts of zap_page_range(),
> do I need to call them?
> 
> lru_add_drain();

No, I guess..

> tlb_gather_mmu(&tlb, mm, address, end);
> tlb_finish_mmu(&tlb, address, end);

New zap_pte() should tolerate tlb == NULL and does flush_tlb_page() or
pte_clear_*flush or something.

> update_hiwater_rss(mm);

No: you cannot end up with lower rss after replace, iiuc.

> mmu_notifier_invalidate_range_start(mm, address, end);
> mmu_notifier_invalidate_range_end(mm, address, end);

mmu_notifier_invalidate_page() should be enough.

> > >   if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> > >   perror(argv[1]);
> > >   exit(1);
> > >   }
> > > 
> > >   if (ftruncate(fd, 4096) < 0) {
> > 
> > Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will
> > fault in page from backing storage, not hole and write will not replace
> > anything.
> 
> Ah, it was starting with a new file, hence the O_CREAT up above.

Do you mean you pointed to new file all the time? O_CREAT doesn't truncate
file if it exists, iirc.

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Matthew Wilcox
On Wed, Jul 23, 2014 at 05:20:48PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> > I'd love to use a lighter-weight weapon!  What would you recommend using,
> > zap_pte_range()?
> 
> The most straight-forward way: extract body of pte cycle from
> zap_pte_range() to separate function -- zap_pte() -- and use it.

OK, I can do that.  What about the other parts of zap_page_range(),
do I need to call them?

lru_add_drain();
tlb_gather_mmu(&tlb, mm, address, end);
update_hiwater_rss(mm);
mmu_notifier_invalidate_range_start(mm, address, end);
[   unmap_single_vma(&tlb, vma, address, end, details);]
mmu_notifier_invalidate_range_end(mm, address, end);
tlb_finish_mmu(&tlb, address, end);

> > if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
> > perror(argv[1]);
> > exit(1);
> > }
> > 
> > if (ftruncate(fd, 4096) < 0) {
> 
> Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will
> fault in page from backing storage, not hole and write will not replace
> anything.

Ah, it was starting with a new file, hence the O_CREAT up above.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Kirill A. Shutemov
On Wed, Jul 23, 2014 at 09:52:22AM -0400, Matthew Wilcox wrote:
> On Wed, Jul 23, 2014 at 02:45:40PM +0300, Kirill A. Shutemov wrote:
> > On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote:
> > > From: Matthew Wilcox 
> > > 
> > > vm_insert_mixed() will fail if there is already a valid PTE at that
> > > location.  The DAX code would rather replace the previous value with
> > > the new PTE.
> 
> > > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, 
> > > unsigned long addr,
> > >   if (!pte)
> > >   goto out;
> > >   retval = -EBUSY;
> > > - if (!pte_none(*pte))
> > > - goto out_unlock;
> > > + if (!pte_none(*pte)) {
> > > + if (!replace)
> > > + goto out_unlock;
> > > + 
> > > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> > > + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> > 
> > zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not
> > going to work.
> 
> I have a test program that exercises this path ... it seems to work!
> Following the code, I don't understand why it does.  Maybe it's not
> exercising this path after all?  I've attached the program (so that I
> have an "oh, duh" moment about 5 seconds after sending the email).

See below.

> 
> > And zap_page_range*() is pretty heavy weapon to shoot down one pte, which
> > we already have pointer to. Why?
> 
> I'd love to use a lighter-weight weapon!  What would you recommend using,
> zap_pte_range()?

The most straight-forward way: extract body of pte cycle from
zap_pte_range() to separate function -- zap_pte() -- and use it.

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> int
> main(int argc, char *argv[])
> {
>   int fd;
>   void *addr;
>   char buf[4096];
> 
>   if (argc != 2) {
>   fprintf(stderr, "usage: %s filename\n", argv[0]);
>   exit(1);
>   }
> 
>   if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
>   perror(argv[1]);
>   exit(1);
>   }
> 
>   if (ftruncate(fd, 4096) < 0) {

Shouldn't this be ftruncate(fd, 0)? Otherwise the memcpy() below will
fault in page from backing storage, not hole and write will not replace
anything.

>   perror("ftruncate");
>   exit(1);
>   }
> 
>   if ((addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
>   fd, 0)) == MAP_FAILED) {
>   perror("mmap");
>   exit(1);
>   }
> 
>   close(fd);
> 
>   /* first read */
>   memcpy(buf, addr, 4096);
> 
>   /* now write a bit */
>   memcpy(addr, buf, 8);
> 
>   printf("%s: test passed.\n", argv[0]);
>   exit(0);
> }


-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Matthew Wilcox
On Wed, Jul 23, 2014 at 02:45:40PM +0300, Kirill A. Shutemov wrote:
> On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox 
> > 
> > vm_insert_mixed() will fail if there is already a valid PTE at that
> > location.  The DAX code would rather replace the previous value with
> > the new PTE.

> > @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, 
> > unsigned long addr,
> > if (!pte)
> > goto out;
> > retval = -EBUSY;
> > -   if (!pte_none(*pte))
> > -   goto out_unlock;
> > +   if (!pte_none(*pte)) {
> > +   if (!replace)
> > +   goto out_unlock;
> > +   
> > VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> > +   zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> 
> zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not
> going to work.

I have a test program that exercises this path ... it seems to work!
Following the code, I don't understand why it does.  Maybe it's not
exercising this path after all?  I've attached the program (so that I
have an "oh, duh" moment about 5 seconds after sending the email).

> And zap_page_range*() is pretty heavy weapon to shoot down one pte, which
> we already have pointer to. Why?

I'd love to use a lighter-weight weapon!  What would you recommend using,
zap_pte_range()?

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int
main(int argc, char *argv[])
{
	int fd;
	void *addr;
	char buf[4096];

	if (argc != 2) {
		fprintf(stderr, "usage: %s filename\n", argv[0]);
		exit(1);
	}

	if ((fd = open(argv[1], O_CREAT|O_RDWR, 0666)) < 0) {
		perror(argv[1]);
		exit(1);
	}

	if (ftruncate(fd, 4096) < 0) {
		perror("ftruncate");
		exit(1);
	}

	if ((addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED,
	fd, 0)) == MAP_FAILED) {
		perror("mmap");
		exit(1);
	}

	close(fd);

	/* first read */
	memcpy(buf, addr, 4096);

	/* now write a bit */
	memcpy(addr, buf, 8);

	printf("%s: test passed.\n", argv[0]);
	exit(0);
}


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Kirill A. Shutemov
On Tue, Jul 22, 2014 at 03:47:53PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> vm_insert_mixed() will fail if there is already a valid PTE at that
> location.  The DAX code would rather replace the previous value with
> the new PTE.
> 
> Signed-off-by: Matthew Wilcox 
> ---
>  include/linux/mm.h |  8 ++--
>  mm/memory.c| 34 +-
>  2 files changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e04f531..8d1194c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned 
> long addr,
>  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page 
> *);
>  int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn);
> -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn);
> +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, bool replace);
> +#define vm_insert_mixed(vma, addr, pfn)  \
> + __vm_insert_mixed(vma, addr, pfn, false)
> +#define vm_replace_mixed(vma, addr, pfn) \
> + __vm_insert_mixed(vma, addr, pfn, true)
>  int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned 
> long len);
>  
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 42bf429..cf06c97 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned 
> long addr,
>   * pages reserved for the old functions anyway.
>   */
>  static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> - struct page *page, pgprot_t prot)
> + struct page *page, pgprot_t prot, bool replace)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   int retval;
> @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (!pte)
>   goto out;
>   retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (!replace)
> + goto out_unlock;
> + 
> VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);

zap_page_range_single() takes ptl by itself in zap_pte_range(). It's not
going to work.

And zap_page_range*() is pretty heavy weapon to shoot down one pte, which
we already have pointer to. Why?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v8 05/22] Add vm_replace_mixed()

2014-07-23 Thread Jan Kara
On Tue 22-07-14 15:47:53, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> vm_insert_mixed() will fail if there is already a valid PTE at that
> location.  The DAX code would rather replace the previous value with
> the new PTE.
> 
> Signed-off-by: Matthew Wilcox 
  This looks good to me although I'm not an expert in this area. So just:
Acked-by: Jan Kara 

Honza

> ---
>  include/linux/mm.h |  8 ++--
>  mm/memory.c| 34 +-
>  2 files changed, 27 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e04f531..8d1194c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1958,8 +1958,12 @@ int remap_pfn_range(struct vm_area_struct *, unsigned 
> long addr,
>  int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page 
> *);
>  int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
>   unsigned long pfn);
> -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn);
> +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, bool replace);
> +#define vm_insert_mixed(vma, addr, pfn)  \
> + __vm_insert_mixed(vma, addr, pfn, false)
> +#define vm_replace_mixed(vma, addr, pfn) \
> + __vm_insert_mixed(vma, addr, pfn, true)
>  int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned 
> long len);
>  
>  
> diff --git a/mm/memory.c b/mm/memory.c
> index 42bf429..cf06c97 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1476,7 +1476,7 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned 
> long addr,
>   * pages reserved for the old functions anyway.
>   */
>  static int insert_page(struct vm_area_struct *vma, unsigned long addr,
> - struct page *page, pgprot_t prot)
> + struct page *page, pgprot_t prot, bool replace)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   int retval;
> @@ -1492,8 +1492,12 @@ static int insert_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (!pte)
>   goto out;
>   retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (!replace)
> + goto out_unlock;
> + 
> VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> + }
>  
>   /* Ok, finally just insert the thing.. */
>   get_page(page);
> @@ -1549,12 +1553,12 @@ int vm_insert_page(struct vm_area_struct *vma, 
> unsigned long addr,
>   BUG_ON(vma->vm_flags & VM_PFNMAP);
>   vma->vm_flags |= VM_MIXEDMAP;
>   }
> - return insert_page(vma, addr, page, vma->vm_page_prot);
> + return insert_page(vma, addr, page, vma->vm_page_prot, false);
>  }
>  EXPORT_SYMBOL(vm_insert_page);
>  
>  static int insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn, pgprot_t prot)
> + unsigned long pfn, pgprot_t prot, bool replace)
>  {
>   struct mm_struct *mm = vma->vm_mm;
>   int retval;
> @@ -1566,8 +1570,12 @@ static int insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (!pte)
>   goto out;
>   retval = -EBUSY;
> - if (!pte_none(*pte))
> - goto out_unlock;
> + if (!pte_none(*pte)) {
> + if (!replace)
> + goto out_unlock;
> + 
> VM_BUG_ON(!mutex_is_locked(&vma->vm_file->f_mapping->i_mmap_mutex));
> + zap_page_range_single(vma, addr, PAGE_SIZE, NULL);
> + }
>  
>   /* Ok, finally just insert the thing.. */
>   entry = pte_mkspecial(pfn_pte(pfn, prot));
> @@ -1620,14 +1628,14 @@ int vm_insert_pfn(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (track_pfn_insert(vma, &pgprot, pfn))
>   return -EINVAL;
>  
> - ret = insert_pfn(vma, addr, pfn, pgprot);
> + ret = insert_pfn(vma, addr, pfn, pgprot, false);
>  
>   return ret;
>  }
>  EXPORT_SYMBOL(vm_insert_pfn);
>  
> -int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> - unsigned long pfn)
> +int __vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> + unsigned long pfn, bool replace)
>  {
>   BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
>  
> @@ -1645,11 +1653,11 @@ int vm_insert_mixed(struct vm_area_struct *vma, 
> unsigned long addr,
>   struct page *page;
>  
>   page = pfn_to_page(pfn);
> - return insert_page(vma, addr, page, vma->vm_page_prot);
> + return insert_page(vma, addr, page, vma->vm_page_prot, replace);
>   }
> - return insert_pfn(vma, addr, pfn, vma->vm_page_prot);
> +