Re: [Xen-devel] [PATCH v2] x86: enable RCU based table free

2017-08-24 Thread Kirill A. Shutemov
On Thu, Aug 24, 2017 at 11:22:58AM +0200, Vitaly Kuznetsov wrote:
> On x86 software page-table walkers depend on the fact that remote TLB flush
> does an IPI: walk is performed lockless but with interrupts disabled and in
> case the pagetable is freed the freeing CPU will get blocked as remote TLB
> flush is required. On other architecture which don't require an IPI to do
> remote TLB flush we have an RCU-based mechanism (see
> include/asm-generic/tlb.h for more details).
> 
> In virtualized environments we may want to override .flush_tlb_others hook
> in pv_mmu_ops and use a hypercall asking the hypervisor to do remote TLB
> flush for us. This breaks the assumption about IPI. Xen PV does this for
> years and the upcoming remote TLB flush for Hyper-V will do it too. This
> is not safe, software pagetable walkers may step on an already freed page.
> 
> Solve the issue by enabling RCU-based table free mechanism. Testing with
> kernbench and mmap/munmap microbenchmars didn't show any notable
> performance impact.
> 
> Suggested-by: Peter Zijlstra <pet...@infradead.org>
> Signed-off-by: Vitaly Kuznetsov <vkuzn...@redhat.com>
> Acked-by: Juergen Gross <jgr...@suse.com>
> ---
> Changes since v1:
> - Enable HAVE_RCU_TABLE_FREE unconditionally to avoid different code pathes
>   for no reason [Linus Torvalds]

Acked-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>

-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT

2017-08-24 Thread Kirill A. Shutemov
On Wed, Aug 23, 2017 at 04:03:53PM -0700, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 3:36 PM, Kirill A. Shutemov
> <kirill.shute...@linux.intel.com> wrote:
> >
> > Below is test cases that allocates a lot of page tables and measuare
> > fork/exit time. (I'm not entirely sure it's the best way to stress the
> > codepath.)
> 
> Looks ok to me. Doing a profile (without the RCU freeing, obviously) gives me
> 
>0.77%  a.out[kernel.vmlinux]  [k] free_pgd_range
> 
> 
>   ▒
> 
> so it does seem to spend time in the page directory code.
> 
> > Unpatched:  average 4.8322s, stddev 0.114s
> > Patched:average 4.8362s, stddev 0.111s
> 
> Ok, I vote for avoiding the complexity of two different behaviors, and
> just making the page table freeing use RCU unconditionally.
> 
> If actively trying to trigger that code doesn't show a real measurable
> difference, I don't think it matters, and the fewer different code
> paths we have, the better.

Numbers from bigger 2-socket machine:

Unpatched:  average 5.0542s, stddev 0.058s
Patched:    average 5.0440s, stddev 0.072s

Still fine.

I don't see a reason not to go this path.

-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT

2017-08-23 Thread Kirill A. Shutemov
On Wed, Aug 23, 2017 at 08:27:18PM +, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 12:59 PM, Kirill A. Shutemov
> <kir...@shutemov.name> wrote:
> >
> > In this case we need performance numbers for !PARAVIRT kernel.
> 
> Yes.
> 
> > Numbers for tight loop of "mmap(MAP_POPULATE); munmap()" might be
> > interesting too for worst case scenario.
> 
> Actually, I don't think you want to populate all the pages. You just
> want to populate *one* page, in order to build up the page directory
> structure, not allocate all the final points.
> 
> And we only free the actual page tables when there is nothing around,
> so it should be at least a 2MB-aligned region etc.
> 
> So you should do a *big* allocation, and then touch a single page in
> the middle, and then minmap it - that should give you maximal page
> table activity. Otherwise the page tables will generally just stay
> around.
> 
> Realistically, it's mainly exit() that frees page tables. Yes, you may
> have a few page tables free'd by a normal munmap(), but it's usually
> very limited. Which is why I suggested that script-heavy thing with
> lots of small executables. That tends to be the main realistic load
> that really causes a ton of page directory activity.

Below is test cases that allocates a lot of page tables and measuare
fork/exit time. (I'm not entirely sure it's the best way to stress the
codepath.)

Unpatched:  average 4.8322s, stddev 0.114s
Patched:average 4.8362s, stddev 0.111s

Both without PARAVIRT. Patch is modified to enable HAVE_RCU_TABLE_FREE for
!PARAVIRT too.

The test-case requires "echo 1 > /proc/sys/vm/overcommit_memory".

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define PUD_SIZE (1UL << 30)
#define PMD_SIZE (1UL << 21)

#define NR_PUD 4096

#define NSEC_PER_SEC10L

int main(void)
{
char *addr = NULL;
unsigned long i, j;
struct timespec start, finish;
long long nsec;

prctl(PR_SET_THP_DISABLE);
for (i = 0; i < NR_PUD ; i++) {
addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
break;
}

for (j = 0; j < PUD_SIZE; j += PMD_SIZE)
assert(addr[j] == 0);
}

for (i = 0; i < 10; i++) {
pid_t pid;

clock_gettime(CLOCK_MONOTONIC, );
pid = fork();
if (pid == -1)
perror("fork");
if (!pid)
exit(0);
wait(NULL);
clock_gettime(CLOCK_MONOTONIC, );

nsec = (finish.tv_sec - start.tv_sec) * NSEC_PER_SEC +
(finish.tv_nsec - start.tv_nsec);
printf("%lld\n", nsec);
}

return 0;
}
-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT

2017-08-23 Thread Kirill A. Shutemov
On Wed, Aug 23, 2017 at 11:26:46AM -0700, Linus Torvalds wrote:
> On Wed, Aug 23, 2017 at 6:45 AM, Vitaly Kuznetsov <vkuzn...@redhat.com> wrote:
> >
> > Solve the issue by enabling RCU-based table free mechanism when PARAVIRT
> > is selected in config. Testing with kernbench doesn't show any notable
> > performance impact:
> 
> I wonder if we should just make it unconditional if it doesn't really
> show any performance difference. One less config complexity to worry
> about (and in this case I'm not so much worried about Kconfig itself,
> as just "oh, you have totally different paths in the core VM depending
> on PARAVIRT".

In this case we need performance numbers for !PARAVIRT kernel.

> That said, the thing to test for these kinds of things is often
> heavily scripted loads that just run thousands and thousands of really
> small processes, and build up and tear down page tables all the time
> because of fork/exit.
> 
> The load I've used occasionally is just "make test" in the git source
> tree. Tons and tons of trivial fork/exec/exit things for all those
> small tests and shell scripts.

Numbers for tight loop of "mmap(MAP_POPULATE); munmap()" might be
interesting too for worst case scenario.

-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCHv4 18/33] x86/xen: convert __xen_pgd_walk() and xen_cleanmfnmap() to support p4d

2017-03-07 Thread Kirill A. Shutemov
SIZE);
> > -   set_pgd(pgd, __pgd(0));
> > -   do {
> > -   pud = pud_page + pud_index(va);
> > -   if (pud_none(*pud)) {
> > -   va += PUD_SIZE;
> > -   } else if (pud_large(*pud)) {
> > -   pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
> > -   xen_free_ro_pages(pa, PUD_SIZE);
> > -   va += PUD_SIZE;
> > -   } else {
> > -   pmd = pmd_offset(pud, va);
> > -   if (pmd_large(*pmd)) {
> > -   pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
> > -   xen_free_ro_pages(pa, PMD_SIZE);
> > -   } else if (!pmd_none(*pmd)) {
> > -   pte = pte_offset_kernel(pmd, va);
> > -   set_pmd(pmd, __pmd(0));
> > -   for (i = 0; i < PTRS_PER_PTE; ++i) {
> > -   if (pte_none(pte[i]))
> > -   break;
> > -   pa = pte_pfn(pte[i]) << PAGE_SHIFT;
> > -   xen_free_ro_pages(pa, PAGE_SIZE);
> > -   }
> > -   xen_cleanmfnmap_free_pgtbl(pte, unpin);
> > -   }
> > -   va += PMD_SIZE;
> > -   if (pmd_index(va))
> > -   continue;
> > -   set_pud(pud, __pud(0));
> > -   xen_cleanmfnmap_free_pgtbl(pmd, unpin);
> > -   }
> > -
> > -   } while (pud_index(va) || pmd_index(va));
> > -   xen_cleanmfnmap_free_pgtbl(pud_page, unpin);
> > +   vaddr &= PMD_MASK;
> > +   pgd = pgd_offset_k(vaddr);
> > +   p4d = p4d_offset(pgd, 0);
> > +   for (i = 0; i < PTRS_PER_P4D; i++) {
> > +   if (p4d_none(p4d[i]))
> > +   continue;
> > +   xen_cleanmfnmap_p4d(p4d + i, unpin);
> > +   }
> 
> Don't we need to pass vaddr down to all routines so that they select
> appropriate tables? You seem to always be choosing the first one.

IIUC, we clear whole page table subtree covered by one pgd entry.
So, no, there's no need to pass vaddr down. Just pointer to page table
entry is enough.

But I know virtually nothing about Xen. Please re-check my reasoning.

I would also appreciate help with getting x86 Xen code work with 5-level
paging enabled. For now I make CONFIG_XEN dependent on !CONFIG_X86_5LEVEL.

Fixup:

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a4079cfab007..d66b7e79781a 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -629,7 +629,8 @@ static int xen_pud_walk(struct mm_struct *mm, pud_t *pud,
pmd = pmd_offset([i], 0);
if (PTRS_PER_PMD > 1)
flush |= (*func)(mm, virt_to_page(pmd), PT_PMD);
-   xen_pmd_walk(mm, pmd, func, last && i == nr - 1, limit);
+   flush |= xen_pmd_walk(mm, pmd, func,
+   last && i == nr - 1, limit);
}
return flush;
 }
@@ -650,7 +651,8 @@ static int xen_p4d_walk(struct mm_struct *mm, p4d_t *p4d,
pud = pud_offset([i], 0);
if (PTRS_PER_PUD > 1)
flush |= (*func)(mm, virt_to_page(pud), PT_PUD);
-   xen_pud_walk(mm, pud, func, last && i == nr - 1, limit);
+   flush |= xen_pud_walk(mm, pud, func,
+   last && i == nr - 1, limit);
}
return flush;
 }
@@ -706,7 +708,7 @@ static int __xen_pgd_walk(struct mm_struct *mm, pgd_t *pgd,
p4d = p4d_offset([i], 0);
if (PTRS_PER_P4D > 1)
flush |= (*func)(mm, virt_to_page(p4d), PT_P4D);
-   xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
+   flush |= xen_p4d_walk(mm, p4d, func, i == nr - 1, limit);
}
 
/* Do the top level last, so that the callbacks can use it as
-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCHv4 28/33] x86/mm: add support of additional page table level during early boot

2017-03-06 Thread Kirill A. Shutemov
On Mon, Mar 06, 2017 at 03:05:49PM -0500, Boris Ostrovsky wrote:
> 
> > diff --git a/arch/x86/include/asm/pgtable_64.h 
> > b/arch/x86/include/asm/pgtable_64.h
> > index 9991224f6238..c9e41f1599dd 100644
> > --- a/arch/x86/include/asm/pgtable_64.h
> > +++ b/arch/x86/include/asm/pgtable_64.h
> > @@ -14,15 +14,17 @@
> >  #include 
> >  #include 
> >  
> > +extern p4d_t level4_kernel_pgt[512];
> > +extern p4d_t level4_ident_pgt[512];
> >  extern pud_t level3_kernel_pgt[512];
> >  extern pud_t level3_ident_pgt[512];
> >  extern pmd_t level2_kernel_pgt[512];
> >  extern pmd_t level2_fixmap_pgt[512];
> >  extern pmd_t level2_ident_pgt[512];
> >  extern pte_t level1_fixmap_pgt[512];
> > -extern pgd_t init_level4_pgt[];
> > +extern pgd_t init_top_pgt[];
> >  
> > -#define swapper_pg_dir init_level4_pgt
> > +#define swapper_pg_dir init_top_pgt
> >  
> >  extern void paging_init(void);
> >  
> 
> 
> This means you also need
> 
> 
> diff --git a/arch/x86/xen/xen-pvh.S b/arch/x86/xen/xen-pvh.S
> index 5e24671..e1a5fbe 100644
> --- a/arch/x86/xen/xen-pvh.S
> +++ b/arch/x86/xen/xen-pvh.S
> @@ -87,7 +87,7 @@ ENTRY(pvh_start_xen)
> wrmsr
>  
> /* Enable pre-constructed page tables. */
> -   mov $_pa(init_level4_pgt), %eax
> +   mov $_pa(init_top_pgt), %eax
> mov %eax, %cr3
> mov $(X86_CR0_PG | X86_CR0_PE), %eax
> mov %eax, %cr0
> 
> 

Ah. Thanks. I've missed that.

The fix is folded.

-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

2015-02-19 Thread Kirill A. Shutemov
On Thu, Feb 19, 2015 at 01:06:53PM +, David Vrabel wrote:
 Mel,
 
 The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
 dereference pmd outside of the lock during NUMA hinting fault) and
 specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
 p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.
 
 Any fault on a present userspace mapping (e.g., a write to a read-only
 mapping) is being misinterpreted as a NUMA hinting fault and not handled
 correctly.  All userspace programs end up continuously  faulting.
 
 This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
 all present userspace page table entries.

I'm feeling I miss very basic background on how Xen works, but why does it
set _PAGE_GLOBAL on userspace entries? It sounds strange to me.

 
 Note that the comment in asm/pgtable_types.h that says that
 _PAGE_BIT_PROTNONE is only valid on non-present entries.
 
   /* If _PAGE_BIT_PRESENT is clear, we use these: */
   /* - if the user mapped it with PROT_NONE; pte_present gives true */
   #define _PAGE_BIT_PROTNONE  _PAGE_BIT_GLOBAL
 
 Adjusting pte_protnone() and pmd_protnone() to check for the absence of
 _PAGE_PRESENT allows 64-bit Xen PV guests to work correctly again (see
 following patch), but I'm not sure if NUMA_BALANCING would correctly
 work with this change.
-- 
 Kirill A. Shutemov

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel