date:20080402

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Andrea Arcangeli

On Tue, Apr 01, 2008 at 01:55:32PM -0700, Christoph Lameter wrote:
 +/* Perform a callback */
 +int __emm_notify(struct mm_struct *mm, enum emm_operation op,
 + unsigned long start, unsigned long end)
 +{
 + struct emm_notifier *e = rcu_dereference(mm)-emm_notifier;
 + int x;
 +
 + while (e) {
 +
 + if (e-callback) {
 + x = e-callback(e, mm, op, start, end);
 + if (x)
 + return x;

There are much bigger issues besides the rcu safety in this patch,
proper aging of the secondary mmu through access bits set by hardware
is unfixable with this model (you would need to do age |=
e-callback), which is the proof of why this isn't flexibile enough by
forcing the same parameter and retvals for all methods. No idea why
you go for such inferior solution that will never get the aging right
and will likely fall apart if we add more methods in the future.

For example the switch you have to add in
xpmem_emm_notifier_callback doesn't look good, at least gcc may be
able to optimize it with an array indexing simulating proper pointer
to function like in #v9.

Most other patches will apply cleanly on top of my coming mmu
notifiers #v10 that I hope will go in -mm.

For #v10 the only two left open issues to discuss are:

1) the moment you remove rcu_read_lock from the methods (my #v9 had
   rcu_read_lock so synchronize_rcu() in Jack's patch was working with
   my #v9) GRU has no way to ensure the methods will fire immediately
   after registering. To fix this race after removing the
   rcu_read_lock (to prepare for the later patches that allows the VM
   to schedule when the mmu notifiers methods are invoked) I can
   replace rcu_read_lock with seqlock locking in the same way as I did
   in a previous patch posted here (seqlock_write around the
   registration method, and seqlock_read replying all callbacks if the
   race happened). then synchronize_rcu become unnecessary and the
   methods will be correctly replied allowing GRU not to corrupt
   memory after the registration method. EMM would also need a fix
   like this for GRU to be safe on top of EMM.

   Another less obviously safe approach is to allow the register
   method to succeed only when mm_users=1 and the task is single
   threaded. This way if all the places where the mmu notifers aren't
   invoked on the mm not by the current task, are only doing
   invalidates after/before zapping ptes, if the istantiation of new
   ptes is single threaded too, we shouldn't worry if we miss an
   invalidate for a pte that is zero and doesn't point to any physical
   page. In the places where current-mm != mm I'm using
   invalidate_page 99% of the time, and that only follows the
   ptep_clear_flush. The problem are the range_begin that will happen
   before zapping the pte in places where current-mm !=
   mm. Unfortunately in my incremental patch where I move all
   invalidate_page outside of the PT lock to prepare for allowing
   sleeping inside the mmu notifiers, I used range_begin/end in places
   like try_to_unmap_cluster where current-mm != mm. In general
   this solution looks more fragile than the seqlock.

2) I'm uncertain how the driver can handle a range_end called before
   range_begin. Also multiple range_begin can happen in parallel later
   followed by range_end, so if there's a global seqlock that
   serializes the secondary mmu page fault, that will screwup (you
   can't seqlock_write in range_begin and sequnlock_write in
   range_end). The write side of the seqlock must be serialized and
   calling seqlock_write twice in a row before any sequnlock operation
   will break.

   A recursive rwsem taken in range_begin and released in range_end
   seems to be the only way to stop the secondary mmu page faults.

   If I would remove all range_begin/end in places where current-mm
   != mm, then I could as well bail out in mmu_notifier_register if
   use mm_users != 1 to solve problem 2 too.

   My solution to this is that I believe the driver is safe if the
   range_end is being missed if range_end is followed by an invalidate
   event like in invalidate_range_end, so the driver is ok to just
   have a static value that accounts if range_begin has ever happened
   and it will just return from range_end without doing anything if no
   range_begin ever happened.


Notably I'll be trying to use range_begin in KVM too so I got to deal
with 2) too. For Nick: the reason for using range_begin is supposedly
an optimization: to guarantee that the last free of the page will
happen outside the mmu_lock, so KVM internally to the mmu_lock is free
to do:

 spin_lock(kvm-mmu_lock)
 put_page()
 spte = nonpresent
 flush secondary tlb()
 spin_unlock(kvm-mmu_lock)

The above ordering is unsafe if the page could ever reach the freelist
before the tlb flush happened. The range_begin will take the mmu_lock
and will hold off

Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2

2008-04-02 Thread Avi Kivity

David Abrahams wrote:
 With the title combination, the guest takes nearly 100% of my real CPU
 time and still only sees one CPU.  Is this a known problem, and does it
 have a known solution?

   

Can you send the output of 'kvm_stat -1'?  Also, what does the guest 
think it is doing (in task manager, etc.)?

What HAL do you see in device manager?

-- 
error compiling committee.c: too many arguments to function


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH] Fix endianness for virtio-blk config space

2008-04-02 Thread Avi Kivity

Anthony Liguori wrote:
 The virtio config space is little endian.  Make sure that in virtio-blk we
 store the values in little endian format.
   

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 07:32:35AM +0300, Avi Kivity wrote:
 It ought to work.  gfn_to_hfn() (old gfn_to_page) will still need to take a 
 refcount if possible.

This reminds me, that mmu notifiers we could implement gfn_to_hfn only
with follow_page and skip the refcounting on the struct page.

I'm not suggesting that though, the refcount makes the code more
robust IMHO, and notably it allows to run on kernels without mmu
notifiers.

So the trick is to do something like

   if (pfn_valid(pfn))
  put_page(pfn_to_page(pfn));

That's the trick I was suggesting to take care of mmio space.

If the reserved ram is used instead of VT-d to allow DMA, that will
have change to:

   if (pfn_valid(pfn))
  page = pfn_to_page(pfn);
  if (page_count(page))
 put_page(page);

that will take care of both the reserved ram and the pfn.

In general I made it for the reserved-ram with only the page_count !=
0 check, because 'struct page' existed.

I'm unsure if it's good to add struct pages for non-ram, I find it
slightly confusing and not the right thing, it takes memory for stuff
that can't happen (refcounting only makes sense if the page finally
goes in the freelist when count reaches 0, and PG_dirty/referenced
bits and the like don't make sense either for non-ram or
reserved-ram). So I'm not sure the vmemmap trick is the best.

 This will increase the potential for type errors, so maybe we need to make 
 gfn_t and hfn_t distinct structures, and use accessors to get the actual 
 values.

That's also a possibility. If we go for this then hfn_t is probably a
better name as it's less likely to collide with core kernel VM
code. Otherwise perhaps pfn can be used instead of hfn, it's up to
you ;).

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] True meaning of showing wealth

2008-04-02 Thread Jerardo Fullgraf

Fantastic offer on luxuries

The clock was invented to be re-invented to have a spot on your wrist. 
http://www.poretiage.com/

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Avi Kivity

Andrea Arcangeli wrote:
 On Wed, Apr 02, 2008 at 07:32:35AM +0300, Avi Kivity wrote:
   
 It ought to work.  gfn_to_hfn() (old gfn_to_page) will still need to take a 
 refcount if possible.
 

 This reminds me, that mmu notifiers we could implement gfn_to_hfn only
 with follow_page and skip the refcounting on the struct page.

 I'm not suggesting that though, the refcount makes the code more
 robust IMHO, and notably it allows to run on kernels without mmu
 notifiers.

   

Isn't it faster though?  We don't need to pull in the cacheline 
containing the struct page anymore.

We could hack something to make pre mmu notifier kernels work.


 I'm unsure if it's good to add struct pages for non-ram, I find it
 slightly confusing and not the right thing, it takes memory for stuff
 that can't happen (refcounting only makes sense if the page finally
 goes in the freelist when count reaches 0, and PG_dirty/referenced
 bits and the like don't make sense either for non-ram or
 reserved-ram). So I'm not sure the vmemmap trick is the best.

   

I thought we'd meet with resistance to that idea :) though I'd like to 
point out that struct pages to exist for mmio on some machines (those 
with = 4GB).

 This will increase the potential for type errors, so maybe we need to make 
 gfn_t and hfn_t distinct structures, and use accessors to get the actual 
 values.
 

 That's also a possibility. If we go for this then hfn_t is probably a
 better name as it's less likely to collide with core kernel VM
 code. Otherwise perhaps pfn can be used instead of hfn, it's up to
 you ;).
   

I guess we can move to pfn, they're unambiguous enough.

-- 
error compiling committee.c: too many arguments to function


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote:
 Isn't it faster though?  We don't need to pull in the cacheline containing 
 the struct page anymore.

Exactly, not only that, get_user_pages is likely a bit slower that we
need for just kvm pte lookup. GRU uses follow_page directly because it
runs in the tlb miss handler, for us instead the tlb miss handler
doesn't invoke a page fault unless the spte is non present.

I expect for us that optimization will be mostly lost in the noise
but it is likely measurable in heavy VM workloads and in general it
sure worth it in the long run (if nothing else as a microoptimization
for whatever spte fault benchmark).

 We could hack something to make pre mmu notifier kernels work.

Actually I already removed the refcounting for the reserved ram, so
it's going to be very easy to do with a few CONFIG_MMU_NOTIFIER.

But because it's a low prio I would defer it for later, first patch
can do the refcounting unconditionally to better test it. (I'm
assuming we're going to deal with kernels without mmu notifiers [or
without EMM ;) ] for a long while)

 I thought we'd meet with resistance to that idea :) though I'd like to 
 point out that struct pages to exist for mmio on some machines (those with 
 = 4GB).

Sure, those should have page_count 0 like the reserved ram. I'm
uncertain about the PageReserved semantics, I thought Nick nuked it
long ago but perhaps it was something else. Now copy_page_range bails
out on PFNMAP vmas, in the old days it would threat PG_reserved pages
mapped with remap_page_range like a VM_SHARED by not wrprotecting and
not refcounting even if this was a private mapping.

In general using page_t for anything but ram is something that should
be avoided and that also makes PG_reserved semantics mostly
irrelevant. The users of remap_pfn_range (like /dev/mem) should never
use pages regardless if it's ram or non-ram or if page_t exists or
not. The fact page_t sometime exists for non-ram is strictly a
performance optimization to make the pfn_to_page resolution quicker at
runtime (speedup is significant and memory loss is negligeable).

 I guess we can move to pfn, they're unambiguous enough.

Ok! ;)

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [RFC/PATCH 05/15 v3] kvm-s390: s390 arch backend for the kvm kernel module

2008-04-02 Thread Christian Borntraeger

Am Montag, 31. März 2008 schrieb Arnd Bergmann:

Hello Arnd,

thanks for the review.

 On Tuesday 25 March 2008, Carsten Otte wrote:
 
  +
  +static inline void __user *__guestaddr_to_user(struct kvm_vcpu *vcpu,
  +  u64 guestaddr)
  +{
  +   u64 prefix  = vcpu-arch.sie_block-prefix;
  +   u64 origin  = vcpu-kvm-arch.guest_origin;
  +   u64 memsize = vcpu-kvm-arch.guest_memsize;
  +
  +   if (guestaddr  2 * PAGE_SIZE)
  +   guestaddr += prefix;
  +   else if ((guestaddr = prefix)  (guestaddr  prefix + 2 * 
PAGE_SIZE))
  +   guestaddr -= prefix;
 
 What happens if prefix is set to 4096? Does this do the right thing
 according to the architecture definition?

The z/architecture and the instructions (spx + sigp set prefix) dont allow 
4096 as prefix address. When setting a prefix, bits 1-18 (IBM numbering 
scheme) are used and appended with 13 zero to the right. That means prefix 
address is always a multiple of 8192.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 2/2] Move in-kernel PIT device to separate file

2008-04-02 Thread Avi Kivity

Anthony Liguori wrote:
 This patch is mostly code motion to move the newly refactored
 in-kernel PIT device to a separate file.

   

Applied both, thanks.


-- 
error compiling committee.c: too many arguments to function


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Robin Holt

On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote:
 Most other patches will apply cleanly on top of my coming mmu
 notifiers #v10 that I hope will go in -mm.
 
 For #v10 the only two left open issues to discuss are:

Does your v10 allow sleeping inside the callbacks?

Thanks,
Robin

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Avi Kivity

Andrea Arcangeli wrote:
 On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote:
   
 Isn't it faster though?  We don't need to pull in the cacheline containing 
 the struct page anymore.
 

 Exactly, not only that, get_user_pages is likely a bit slower that we
 need for just kvm pte lookup. GRU uses follow_page directly because it
 runs in the tlb miss handler, for us instead the tlb miss handler
 doesn't invoke a page fault unless the spte is non present.

   

How about this plan?

0. Merge mmu notifiers

1. gfn_to_page() - gfn_to_pfn()

Still keeping the refcount.  Change bad_page to kvm_bad_hfn.

2. Drop the refcounting from gfn_to_pfn() and from kvm_release_page_*()

Still using get_user_pages() (and dropping the refcount immediately)

Simultaneously, change hack_module.awk to add the refcount back.

3. Export follow_page() or something based on fast_gup(), and use it

btw, if we change the method we use to read the Linux pte, I'd like to 
get the writable bit out of it.  This was, when we create an spte for a 
gpte that is writable and dirty, we can set the spte writable iff the 
Linux pte is writable.  This avoids breaking COW unnecessarily.

-- 
error compiling committee.c: too many arguments to function


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] KVM hangs with windows XP guest - help with debugging?

2008-04-02 Thread Jonathan Underwood

On 01/04/2008, Avi Kivity [EMAIL PROTECTED] wrote:

  Please run 'kvm_stat -1' (found in the tarball) while the guest is hanging
 and report the output.
# /home/jgu/Desktop/kvm_stat -1
efer_reload  8757563   129
exits   17039786   362
halt_exits384540   129
halt_wakeup0 0
invlpg 0 0
io_exits 3938052   100
irq_exits 113040 0
irq_window 0 0
light_exits  8419349   233
mmio_exits   4813794 0
pf_fixed 587157199
pf_guest  523123 0
request_irq0 0
signal_exits  136078 0
tlb_flush 61846733

 Also report whether the qemu process is consuming
 cpu or not.


Yes, it's hammering the CPU flat out.

Let me know if i can provide more info.
J.

  --
  Any sufficiently difficult bug is indistinguishable from a feature.



-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH] KVM: add kvm_get_kvm and kvm_put_kvm functions

2008-04-02 Thread Avi Kivity

Izik Eidus wrote:
 From ebb9fe4765f1572314d2249e29a7ef4d0de07273 Mon Sep 17 00:00:00 2001
 From: Izik Eidus [EMAIL PROTECTED]
 Date: Sun, 30 Mar 2008 15:48:35 +0300
 Subject: [PATCH] KVM: add kvm_get_kvm and kvm_put_kvm functions,

 the main purpose of adding this functions is the abilaty to release the 
 spinlock
 that protect the kvm list while still be able to do operations on a specific 
 kvm
 in a safe way.

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] KVM hangs with windows XP guest - help with debugging?

2008-04-02 Thread Jonathan Underwood

On 02/04/2008, Jonathan Underwood [EMAIL PROTECTED] wrote:
 On 01/04/2008, Avi Kivity [EMAIL PROTECTED] wrote:

Please run 'kvm_stat -1' (found in the tarball) while the guest is hanging
   and report the output.

 # /home/jgu/Desktop/kvm_stat -1
  efer_reload  8757563   129
  exits   17039786   362
  halt_exits384540   129
  halt_wakeup0 0
  invlpg 0 0
  io_exits 3938052   100
  irq_exits 113040 0
  irq_window 0 0
  light_exits  8419349   233
  mmio_exits   4813794 0
  pf_fixed 587157199
  pf_guest  523123 0
  request_irq0 0
  signal_exits  136078 0
  tlb_flush 61846733


Actually, that VM did eventually spring back into life, bizarrely.
However, later on I saw another hard hang:

# /home/jgu/Desktop/kvm_stat -1
efer_reload 28180156  1140
exits  339058958312917
halt_exits   1254859 0
halt_wakeup0 0
invlpg 0 0
io_exits18320867 0
irq_exits 826972 0
irq_window 0 0
light_exits312098749312898
mmio_exits   9583557 0
pf_fixed   110064034312919
pf_guest 2235329 0
request_irq0 0
signal_exits 1207511  1087
tlb_flush2391783 0

And the VM was consuming still CPU

J.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 02:16:41PM +0300, Avi Kivity wrote:
 Ugh, there's still mark_page_accessed() and SetPageDirty().

btw, like PG_dirty is only set if the spte is writeable,
mark_page_accessed should only run if the accessed bit is set in the
spte. It doesn't matter now as nobody could possibly clear it, but
with mmu notifiers the -clear_young will clear that accessed bit for
the first time, and if the guest didn't use the spte in the meantime,
we don't need to mark the page accessed and we can let the VM swapout
the page at the next round of the lru.

But that's ok, we'll simply run mark_page_accessed and SetPageDirty if
pfn_valid is true. We're under mmu_lock so the mmu notifiers will
prevent the page to go away from under us. 

The detail I'm uncertain is why my reserved-ram patch had both
pfn_valid = 1 and page_count =0 and PG_reserved =0 for all reserved
ram. Perhaps I triggered a bug in some memmap initialization as the
common code should keep all pages that aren't supposed to go in the
freelist when freed as PG_reserved and page_count 1. Anyway in
practice it worked fine and there's zero risk I'm not using
reserved-ram with the page_count = 0 check ;).

The rmap_remove should look like this (this won't work with the
reserved ram but that needs fixing somehow as reserved-ram must be
like mmio region but with a pfn_valid and in turn with a page_t, and
then the reserved-ram patch will be able to cope with the reserved ram
not having a allocated memmap by luck of how page_alloc.c works).

if (pfn_valid(pfn)) {
page = pfn_to_page(pfn);
if (!PageReserved(page)) {
BUG_ON(page_count(page) != 1);
if (is_writeable_pte(*spte))
SetPageDirty(page);
if (*spte  PT_ACCESSED_MASK)
mark_page_accessed(page);
}
}

It still skips an atomic op. Your plan still sounds just fine despite
the above, infact it sounds too perfect: the awk hack to re-add the
refcounting when building the external module if CONFIG_MMU_NOTIFIER
isn't defined is going to be messy, a plain CONFIG_MMU_NOTIFIER in
kvm.git would be simpler and more robust IMHO even if less perfect :).

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 01:50:19PM +0200, Andrea Arcangeli wrote:
   if (pfn_valid(pfn)) {
   page = pfn_to_page(pfn);
   if (!PageReserved(page)) {
   BUG_ON(page_count(page) != 1);
   if (is_writeable_pte(*spte))
   SetPageDirty(page);
   if (*spte  PT_ACCESSED_MASK)
   mark_page_accessed(page);
   }
   }

warning the indenting of the above in text-mode was wrong... read it
like a c compiler would do sorry ;)

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] How guest virtual address is translated to physical address?

2008-04-02 Thread Guillaume Thouvenin

Hello,

 I have a question about how guest page table and shadow page table
work together and more precisely, about how host is involved when guest
access a page that it's already in the page table.

 The guest maintains its page table to translate guest virtual address
to guest physical address. It uses the MMU and cr3 register for the
translation. When there is a page fault, the host gets involved and it
catches the guest physical address (by walking through the guest page
table?) and fills its shadow page table. Thus the host can make the
translation from guest physical address to physical address. Now, if
the guest reads the same page, the PTE will point to guest physical
address but there will not be any page fault so the host will not be
involved in the translation (not so sure). I don't see how the guest
virtual address will be translated to physical address? I miss
something but I don't see what.


Thanks for your help,
Guillaume

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Robin Holt

I must have missed v10.  Could you repost so I can build xpmem
against it to see how it operates?  To help reduce confusion, you should
probably comandeer the patches from Christoph's set which you think are
needed to make it sleep.

Thanks,
Robin


On Wed, Apr 02, 2008 at 01:16:51PM +0200, Andrea Arcangeli wrote:
 On Wed, Apr 02, 2008 at 05:59:25AM -0500, Robin Holt wrote:
  On Wed, Apr 02, 2008 at 08:49:52AM +0200, Andrea Arcangeli wrote:
   Most other patches will apply cleanly on top of my coming mmu
   notifiers #v10 that I hope will go in -mm.
   
   For #v10 the only two left open issues to discuss are:
  
  Does your v10 allow sleeping inside the callbacks?
 
 Yes if you apply all the patches. But not if you apply the first patch
 only, most patches in EMM serie will apply cleanly or with minor
 rejects to #v10 too, Christoph's further work to make EEM sleep
 capable looks very good and it's going to be 100% shared, it's also
 going to be a lot more controversial for merging than the two #v10 or
 EMM first patch. EMM also doesn't allow sleeping inside the callbacks
 if you only apply the first patch in the serie.
 
 My priority is to get #v9 or the coming #v10 merged in -mm (only
 difference will be the replacement of rcu_read_lock with the seqlock
 to avoid breaking the synchronize_rcu in GRU code). I will mix seqlock
 with rcu ordered writes. EMM indeed breaks GRU by making
 synchronize_rcu a noop and by not providing any alternative (I will
 obsolete synchronize_rcu making it a noop instead). This assumes Jack
 used synchronize_rcu for whatever good reason. But this isn't the real
 strong point against EMM, adding seqlock to EMM is as easy as adding
 it to #v10 (admittedly with #v10 is a bit easier because I didn't
 expand the hlist operations for zero gain like in EMM).
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1/1] direct mmio for passthrough - kernel part

2008-04-02 Thread Anthony Liguori

Avi Kivity wrote:
 Andrea Arcangeli wrote:
 On Wed, Apr 02, 2008 at 12:50:50PM +0300, Avi Kivity wrote:
  
 Isn't it faster though?  We don't need to pull in the cacheline 
 containing the struct page anymore.
 

 Exactly, not only that, get_user_pages is likely a bit slower that we
 need for just kvm pte lookup. GRU uses follow_page directly because it
 runs in the tlb miss handler, for us instead the tlb miss handler
 doesn't invoke a page fault unless the spte is non present.

   

 How about this plan?

 0. Merge mmu notifiers

Are mmu-notifiers likely to get pushed when the 2.6.26 window opens up?

 1. gfn_to_page() - gfn_to_pfn()

 Still keeping the refcount.  Change bad_page to kvm_bad_hfn.

kvm_bad_pfn.

 2. Drop the refcounting from gfn_to_pfn() and from kvm_release_page_*()

 Still using get_user_pages() (and dropping the refcount immediately)

 Simultaneously, change hack_module.awk to add the refcount back.

This has the dependency on mmu notifiers so step 1 can conceivably be 
merged in the absence of mmu notifiers.

Regards,

Anthony Liguori

 3. Export follow_page() or something based on fast_gup(), and use it

 btw, if we change the method we use to read the Linux pte, I'd like to 
 get the writable bit out of it.  This was, when we create an spte for 
 a gpte that is writable and dirty, we can set the spte writable iff 
 the Linux pte is writable.  This avoids breaking COW unnecessarily.



-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] How guest virtual address is translated to physical address?

2008-04-02 Thread Anthony Liguori

Guillaume Thouvenin wrote:
 Hello,

  I have a question about how guest page table and shadow page table
 work together and more precisely, about how host is involved when guest
 access a page that it's already in the page table.

  The guest maintains its page table to translate guest virtual address
 to guest physical address. It uses the MMU and cr3 register for the
 translation. When there is a page fault, the host gets involved and it
 catches the guest physical address (by walking through the guest page
 table?) and fills its shadow page table.

When a page fault occurs, a vmexit is generated.  The vmexit is 
delivered before delivery to the interrupt handler but after CR2 is 
populated with the faulting address.  We can then obtain the faulting 
virtual address from CR2 in the host and walk the guest's page table to 
determine what guest physical address should be mapped at that guest 
virtual address (if any at all).

We then use that information to add an entry to the shadow page table.  
The guest runs with the shadow page table installed, the hardware never 
actually sees the guest page table.  This is because the hardware has no 
knowledge of MMU virtualization (this is all assuming pre-EPT/NPT hardware).

  Thus the host can make the
 translation from guest physical address to physical address.

And we use a combination of things to make that determination.  First, 
there is a set of slots that cover ranges of physical addresses.  Slots 
are necessary on x86 as there may be very large holes in physical 
memory.  It also makes certain optimization easier.  Each slot contains 
a base virtual address.  This is a QEMU userspace address that is where 
we malloc()'d memory for that particular slot.  Normally, this 
corresponds to phys_ram_base within QEMU.

We add the guest physical address to the virtual address base in the 
slot (minus the slot starting address) to obtain a QEMU userspace 
address for the guest physical address.  Once we have that, we can call 
get_user_pages() to pin the userspace memory into physical memory.  We 
can then obtain a host physical address.  That's what we use to populate 
the shadow page table.

The flip side of this is that as long as we have a shadow page table 
mapping referencing a host physical address, we must keep the page 
pinned into memory.  Once we drop the shadow page table entry (because 
the guest process has gone away or we decide to evict it from the cache) 
we can reduce the reference count via put_page().

This is where mmu notifiers come into play.  If Linux really wants to 
reclaim memory on the host, there's no way ATM for it to do so with the 
memory we have pinned.  mmu notifiers provide a mechanism to Linux to 
ask KVM to explicitly unpin a particular guest page.

  Now, if
 the guest reads the same page, the PTE will point to guest physical
 address but there will not be any page fault so the host will not be
 involved in the translation (not so sure). I don't see how the guest
 virtual address will be translated to physical address? I miss
 something but I don't see what.
   

I think what you're missing is that while the guest is running, it's CR3 
does not point to it's own page table but rather to the shadow page 
table.  We intercept CR3 reads to pretend to the guest that it's page 
table is, in fact, installed but it's really the shadow page table 
that's in the hardware register.

Regards,

Anthony Liguori

 Thanks for your help,
 Guillaume

 -
 Check out the new SourceForge.net Marketplace.
 It's the best place to buy or sell services for
 just about anything Open Source.
 http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2

2008-04-02 Thread David Abrahams


on Wed Apr 02 2008, Avi Kivity avi-AT-qumranet.com wrote:

 David Abrahams wrote:
 With the title combination, the guest takes nearly 100% of my real CPU
 time and still only sees one CPU.  Is this a known problem, and does it
 have a known solution?

   

 Can you send the output of 'kvm_stat -1'?  

I don't appear to have a kvm_stat executable anywhere.  How do I acquire
that?

 Also, what does the guest
 think it is doing (in task manager, etc.)?

System Idle Process :(

Everything is calm in the guest, but top in the host shows a kvm
process at between 50 and 70 percent of the CPU.

 What HAL do you see in device manager?

In the guest?  I don't see anything that obviously appears to be HAL in
device manager.  Could you be more specific?

Sorry to be so clueless, and thanks again in advance,

-- 
Dave Abrahams
Boost Consulting
http://boost-consulting.com

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount

2008-04-02 Thread Andrea Arcangeli

On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
   This results in f.e. the Aim9 brk performance test to got down by 10-15%.

I guess it's more likely because of overscheduling for small crtitical
sections, did you counted the total number of context switches? I
guess there will be a lot more with your patch applied. That
regression is a showstopper and it is the reason why I've suggested
before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
option to make the VM locks sleep capable only when XPMEM=y
(PREEMPT_RT will enable it too). Thanks for doing the benchmark work!

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

 There are much bigger issues besides the rcu safety in this patch,
 proper aging of the secondary mmu through access bits set by hardware
 is unfixable with this model (you would need to do age |=
 e-callback), which is the proof of why this isn't flexibile enough by
 forcing the same parameter and retvals for all methods. No idea why
 you go for such inferior solution that will never get the aging right
 and will likely fall apart if we add more methods in the future.

There is always the possibility to add special functions in the same way 
as done in the mmu notifier series if it really becomes necessary. EMM 
does  in no way preclude that.

Here f.e. We can add a special emm_age() function that iterates 
differently and does the | for you.

 For example the switch you have to add in
 xpmem_emm_notifier_callback doesn't look good, at least gcc may be
 able to optimize it with an array indexing simulating proper pointer
 to function like in #v9.

Actually the switch looks really good because it allows code to run
for all callbacks like f.e. xpmem_tg_ref(). Otherwise the refcounting code 
would have to be added to each callback.

 
 Most other patches will apply cleanly on top of my coming mmu
 notifiers #v10 that I hope will go in -mm.
 
 For #v10 the only two left open issues to discuss are:

Did I see #v10? Could you start a new subject when you post please? Do 
not respond to some old message otherwise the threading will be wrong.

methods will be correctly replied allowing GRU not to corrupt
memory after the registration method. EMM would also need a fix
like this for GRU to be safe on top of EMM.

How exactly does the GRU corrupt memory?
 
Another less obviously safe approach is to allow the register
method to succeed only when mm_users=1 and the task is single
threaded. This way if all the places where the mmu notifers aren't
invoked on the mm not by the current task, are only doing
invalidates after/before zapping ptes, if the istantiation of new
ptes is single threaded too, we shouldn't worry if we miss an
invalidate for a pte that is zero and doesn't point to any physical
page. In the places where current-mm != mm I'm using
invalidate_page 99% of the time, and that only follows the
ptep_clear_flush. The problem are the range_begin that will happen
before zapping the pte in places where current-mm !=
mm. Unfortunately in my incremental patch where I move all
invalidate_page outside of the PT lock to prepare for allowing
sleeping inside the mmu notifiers, I used range_begin/end in places
like try_to_unmap_cluster where current-mm != mm. In general
this solution looks more fragile than the seqlock.

Hmmm... Okay that is one solution that would just require a BUG_ON in the 
registration methods.

 2) I'm uncertain how the driver can handle a range_end called before
range_begin. Also multiple range_begin can happen in parallel later
followed by range_end, so if there's a global seqlock that
serializes the secondary mmu page fault, that will screwup (you
can't seqlock_write in range_begin and sequnlock_write in
range_end). The write side of the seqlock must be serialized and
calling seqlock_write twice in a row before any sequnlock operation
will break.

Well doesnt the requirement of just one execution thread also deal with 
that issue?

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

 On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
This results in f.e. the Aim9 brk performance test to got down by 10-15%.
 
 I guess it's more likely because of overscheduling for small crtitical
 sections, did you counted the total number of context switches? I
 guess there will be a lot more with your patch applied. That
 regression is a showstopper and it is the reason why I've suggested
 before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
 option to make the VM locks sleep capable only when XPMEM=y
 (PREEMPT_RT will enable it too). Thanks for doing the benchmark work!

There are more context switches if locks are contended. 

But that has actually also some good aspects because we avoid busy loops 
and can potentially continue work in another process.


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] EMM: Fixup return value handling of emm_notify()

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Christoph Lameter wrote:

 Here f.e. We can add a special emm_age() function that iterates 
 differently and does the | for you.

Well maybe not really necessary. How about this fix? Its likely a problem 
to stop callbacks if one callback returned an error.


Subject: EMM: Fixup return value handling of emm_notify()

Right now we stop calling additional subsystems if one callback returned
an error. That has the potential for causing additional trouble with the
subsystems that do not receive the callbacks they expect if one has failed.

So change the handling of error code to continue callbacks to other
subsystems but return the first error code encountered.

If a callback returns a positive return value then add up all the value
from all the calls. That can be used to establish how many references
exist (xpmem may want this feature at some point) or ensure that the
aging works the way Andrea wants it to (KVM, XPmem so far do not
care too much).

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/rmap.c |   28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-04-02 11:46:20.738342852 -0700
+++ linux-2.6/mm/rmap.c 2008-04-02 12:03:57.672494320 -0700
@@ -299,27 +299,45 @@ void emm_notifier_register(struct emm_no
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
 
-/* Perform a callback */
+/*
+ * Perform a callback
+ *
+ * The return of this function is either a negative error of the first
+ * callback that failed or a consolidated count of all the positive
+ * values that were returned by the callbacks.
+ */
 int __emm_notify(struct mm_struct *mm, enum emm_operation op,
unsigned long start, unsigned long end)
 {
struct emm_notifier *e = rcu_dereference(mm-emm_notifier);
int x;
+   int result = 0;
 
while (e) {
-
if (e-callback) {
x = e-callback(e, mm, op, start, end);
-   if (x)
-   return x;
+
+   /*
+* Callback may return a positive value to indicate a 
count
+* or a negative error code. We keep the first error 
code
+* but continue to perform callbacks to other subscribed
+* subsystems.
+*/
+   if (x  result = 0) {
+   if (x = 0)
+   result += x;
+   else
+   result = x;
+   }
}
+
/*
 * emm_notifier contents (e) must be fetched after
 * the retrival of the pointer to the notifier.
 */
e = rcu_dereference(e-next);
}
-   return 0;
+   return result;
 }
 EXPORT_SYMBOL_GPL(__emm_notify);
 #endif

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [PATCH] Don't assume struct page for x86 MMU

2008-04-02 Thread Anthony Liguori

This patch introduces a gfn_to_pfn() function and corresponding functions like
kvm_release_pfn_dirty().  Using these new functions, we can modify the x86
MMU to no longer assume that it can always get a struct page for any given gfn.

We don't want to eliminate gfn_to_page() entirely because a number of places
assume they can do gfn_to_page() and then kmap() the results.  When we support
IO memory, gfn_to_page() will fail for IO pages although gfn_to_pfn() will
succeed.

This does not implement support for avoiding reference counting for reserved
RAM or for IO memory.  However, it should make those things pretty straight
forward.

Since we're only introducing new common symbols, I don't think it will break
the non-x86 architectures but I haven't tested those.  I've tested Intel,
AMD, NPT, and hugetlbfs with Windows and Linux guests.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1594ee0..867920f 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -240,11 +240,9 @@ static int is_rmap_pte(u64 pte)
return is_shadow_present_pte(pte);
 }
 
-static struct page *spte_to_page(u64 pte)
+static pfn_t spte_to_pfn(u64 pte)
 {
-   hfn_t hfn = (pte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
-
-   return pfn_to_page(hfn);
+   return (pte  PT64_BASE_ADDR_MASK)  PAGE_SHIFT;
 }
 
 static gfn_t pse36_gfn_delta(u32 gpte)
@@ -541,19 +539,19 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
struct kvm_rmap_desc *desc;
struct kvm_rmap_desc *prev_desc;
struct kvm_mmu_page *sp;
-   struct page *page;
+   pfn_t pfn;
unsigned long *rmapp;
int i;
 
if (!is_rmap_pte(*spte))
return;
sp = page_header(__pa(spte));
-   page = spte_to_page(*spte);
-   mark_page_accessed(page);
+   pfn = spte_to_pfn(*spte);
+   kvm_set_pfn_accessed(pfn);
if (is_writeble_pte(*spte))
-   kvm_release_page_dirty(page);
+   kvm_release_pfn_dirty(pfn);
else
-   kvm_release_page_clean(page);
+   kvm_release_pfn_clean(pfn);
rmapp = gfn_to_rmap(kvm, sp-gfns[spte - sp-spt], is_large_pte(*spte));
if (!*rmapp) {
printk(KERN_ERR rmap_remove: %p %llx 0-BUG\n, spte, *spte);
@@ -634,11 +632,11 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
spte = rmap_next(kvm, rmapp, spte);
}
if (write_protected) {
-   struct page *page;
+   pfn_t pfn;
 
spte = rmap_next(kvm, rmapp, NULL);
-   page = spte_to_page(*spte);
-   SetPageDirty(page);
+   pfn = spte_to_pfn(*spte);
+   kvm_set_pfn_dirty(pfn);
}
 
/* check for huge page mappings */
@@ -1035,7 +1033,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault, int dirty,
 int *ptwrite, int largepage, gfn_t gfn,
-struct page *page, bool speculative)
+pfn_t pfn, bool speculative)
 {
u64 spte;
int was_rmapped = 0;
@@ -1057,10 +1055,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
 
child = page_header(pte  PT64_BASE_ADDR_MASK);
mmu_page_remove_parent_pte(child, shadow_pte);
-   } else if (page != spte_to_page(*shadow_pte)) {
+   } else if (pfn != spte_to_pfn(*shadow_pte)) {
pgprintk(hfn old %lx new %lx\n,
-page_to_pfn(spte_to_page(*shadow_pte)),
-page_to_pfn(page));
+spte_to_pfn(*shadow_pte), pfn);
rmap_remove(vcpu-kvm, shadow_pte);
} else {
if (largepage)
@@ -1089,7 +1086,7 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*shadow_pte,
if (largepage)
spte |= PT_PAGE_SIZE_MASK;
 
-   spte |= page_to_phys(page);
+   spte |= (pfn  PAGE_SHIFT);
 
if ((pte_access  ACC_WRITE_MASK)
|| (write_fault  !is_write_protection(vcpu)  !user_fault)) {
@@ -1134,12 +1131,12 @@ unshadowed:
if (!was_rmapped) {
rmap_add(vcpu, shadow_pte, gfn, largepage);
if (!is_rmap_pte(*shadow_pte))
-   kvm_release_page_clean(page);
+   kvm_release_pfn_clean(pfn);
} else {
if (was_writeble)
-   kvm_release_page_dirty(page);
+   kvm_release_pfn_dirty(pfn);
else
-   kvm_release_page_clean(page);
+   kvm_release_pfn_clean(pfn);
}
if (!ptwrite || !*ptwrite)

Re: [kvm-devel] EMM: Fixup return value handling of emm_notify()

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 12:03:50PM -0700, Christoph Lameter wrote:
 + /*
 +  * Callback may return a positive value to indicate a 
 count
 +  * or a negative error code. We keep the first error 
 code
 +  * but continue to perform callbacks to other subscribed
 +  * subsystems.
 +  */
 + if (x  result = 0) {
 + if (x = 0)
 + result += x;
 + else
 + result = x;
 + }
   }
 +

Now think of when one of the kernel janitors will micro-optimize
PG_dirty to be returned by invalidate_page so a single set_page_dirty
will be invoked... Keep in mind this is a kernel internal APIs, ask
Greg if we can change it in order to optimize later in the future. I
think my #v9 is optimal enough while being simple at the same time,
but anyway it's silly to be hardwired to such an interface that worst
of all requires switch statements instead of proper pointer to
functions and a fixed set of parameters and retval semantics for all
methods.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] EMM: Require single threadedness for registration.

2008-04-02 Thread Christoph Lameter

Here is a patch to require single threaded execution during emm_register. 
This also allows an easy implementation of an unregister function and gets
rid of the races that Andrea worried about.

The approach here is similar to what was used in selinux for security
context changes (see selinux_setprocattr).

Is it okay for the users of emm to require single threadedness for 
registration?



Subject: EMM: Require single threaded execution for register and unregister

We can avoid the concurrency issues arising at registration if we
only allow registration of notifiers when the process has only a single
thread. That even allows to avoid the use of rcu.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 mm/rmap.c |   46 +-
 1 file changed, 37 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-04-02 13:53:46.002473685 -0700
+++ linux-2.6/mm/rmap.c 2008-04-02 14:03:05.872199896 -0700
@@ -286,20 +286,48 @@ void emm_notifier_release(struct mm_stru
}
 }
 
-/* Register a notifier */
+/*
+ * Register a notifier
+ *
+ * mmap_sem is held writably.
+ *
+ * Process must be single threaded.
+ */
 void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
 {
+   BUG_ON(atomic_read(mm-mm_users) != 1);
+
e-next = mm-emm_notifier;
-   /*
-* The update to emm_notifier (e-next) must be visible
-* before the pointer becomes visible.
-* rcu_assign_pointer() does exactly what we need.
-*/
-   rcu_assign_pointer(mm-emm_notifier, e);
+   mm-emm_notifier = e;
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
 
 /*
+ * Unregister a notifier
+ *
+ * mmap_sem is held writably
+ *
+ * Process must be single threaded
+ */
+void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
+{
+   struct emm_notifier *p = mm-emm_notifier;
+
+   BUG_ON(atomic_read(mm-mm_users) != 1);
+
+   if (e == p)
+   mm-emm_notifier = e-next;
+   else {
+   while (p-next != e)
+   p = p-next;
+
+   p-next = e-next;
+   }
+   e-callback(e, mm, emm_release, 0, TASK_SIZE);
+}
+EXPORT_SYMBOL_GPL(emm_notifier_unregister);
+
+/*
  * Perform a callback
  *
  * The return of this function is either a negative error of the first
@@ -309,7 +337,7 @@ EXPORT_SYMBOL_GPL(emm_notifier_register)
 int __emm_notify(struct mm_struct *mm, enum emm_operation op,
unsigned long start, unsigned long end)
 {
-   struct emm_notifier *e = rcu_dereference(mm-emm_notifier);
+   struct emm_notifier *e = mm-emm_notifier;
int x;
int result = 0;
 
@@ -335,7 +363,7 @@ int __emm_notify(struct mm_struct *mm, e
 * emm_notifier contents (e) must be fetched after
 * the retrival of the pointer to the notifier.
 */
-   e = rcu_dereference(e-next);
+   e = e-next;
}
return result;
 }

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 10:59:50AM -0700, Christoph Lameter wrote:
 Did I see #v10? Could you start a new subject when you post please? Do 
 not respond to some old message otherwise the threading will be wrong.

I wasn't clear enough, #v10 was in the works... I was thinking about
the last two issues before posting it.

 How exactly does the GRU corrupt memory?

Jack added synchronize_rcu, I assume for a reason.

  
 Another less obviously safe approach is to allow the register
 method to succeed only when mm_users=1 and the task is single
 threaded. This way if all the places where the mmu notifers aren't
 invoked on the mm not by the current task, are only doing
 invalidates after/before zapping ptes, if the istantiation of new
 ptes is single threaded too, we shouldn't worry if we miss an
 invalidate for a pte that is zero and doesn't point to any physical
 page. In the places where current-mm != mm I'm using
 invalidate_page 99% of the time, and that only follows the
 ptep_clear_flush. The problem are the range_begin that will happen
 before zapping the pte in places where current-mm !=
 mm. Unfortunately in my incremental patch where I move all
 invalidate_page outside of the PT lock to prepare for allowing
 sleeping inside the mmu notifiers, I used range_begin/end in places
 like try_to_unmap_cluster where current-mm != mm. In general
 this solution looks more fragile than the seqlock.
 
 Hmmm... Okay that is one solution that would just require a BUG_ON in the 
 registration methods.

Perhaps you didn't notice that this solution can't work if you call
range_begin/end not in the current context and try_to_unmap_cluster
does exactly that for both my patchset and yours. Missing an _end is
ok, missing a _begin is never ok.

 Well doesnt the requirement of just one execution thread also deal with 
 that issue?

Yes, except again it can't work for try_to_unmap_cluster.

This solution is only applicable to #v10 if I fix try_to_unmap_cluster
to only call invalidate_page (relaying on the fact the VM holds a pin
and a lock on any page that is being mmu-notifier-invalidated).

You can't use the single threaded approach to solve either 1 or 2,
because your _begin call is called anywhere and that's where you call
the secondary-tlb flush and it's fatal to miss it.

invalidate_page is called always after, so it enforced the tlb flush
to be called _after_ and so it's inherently safe.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 11:15:26AM -0700, Christoph Lameter wrote:
 On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
 
  On Tue, Apr 01, 2008 at 01:55:36PM -0700, Christoph Lameter wrote:
 This results in f.e. the Aim9 brk performance test to got down by 
   10-15%.
  
  I guess it's more likely because of overscheduling for small crtitical
  sections, did you counted the total number of context switches? I
  guess there will be a lot more with your patch applied. That
  regression is a showstopper and it is the reason why I've suggested
  before to add a CONFIG_XPMEM or CONFIG_MMU_NOTIFIER_SLEEP config
  option to make the VM locks sleep capable only when XPMEM=y
  (PREEMPT_RT will enable it too). Thanks for doing the benchmark work!
 
 There are more context switches if locks are contended. 
 
 But that has actually also some good aspects because we avoid busy loops 
 and can potentially continue work in another process.

That would be the case if the wait time would be longer than the
scheduling time, the whole point is that with anonvma the write side
is so fast it's likely never worth scheduling (probably not even with
preempt-rt for the write side, the read side is an entirely different
matter but the read side can run concurrently if the system is heavy
paging), hence the slowdown. What you benchmarked is the write side,
which is also the fast path when the system is heavily CPU bound. I've
to say aim is a great benchmark to test this regression.

But I think a config option will solve all of this.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

  Hmmm... Okay that is one solution that would just require a BUG_ON in the 
  registration methods.
 
 Perhaps you didn't notice that this solution can't work if you call
 range_begin/end not in the current context and try_to_unmap_cluster
 does exactly that for both my patchset and yours. Missing an _end is
 ok, missing a _begin is never ok.

If you look at the patch you will see a requirement of holding a 
writelock on mmap_sem which will keep out get_user_pages().

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [PATCH 2 of 8] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-02 Thread Andrea Arcangeli

# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1207159010 -7200
# Node ID fe00cb9deeb31467396370c835cb808f4b85209a
# Parent  a406c0cc686d0ca94a4d890d661cdfa48cfba09f
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -121,27 +121,6 @@
seqlock_init(mm-mmu_notifier_lock);
 }
 
-#define ptep_clear_flush_notify(__vma, __address, __ptep)  \
-({ \
-   pte_t __pte;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __pte = ptep_clear_flush(___vma, ___address, __ptep);   \
-   mmu_notifier_invalidate_page(___vma-vm_mm, ___address);\
-   __pte;  \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
-({ \
-   int __young;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
-   __young |= mmu_notifier_clear_flush_young(___vma-vm_mm,\
- ___address);  \
-   __young;\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -173,9 +152,6 @@
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,11 +194,13 @@
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
-   pteval = ptep_clear_flush_notify(vma, address, pte);
+   pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
+   /* must invalidate_page _before_ freeing the page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(page);
}
}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1626,9 +1626,10 @@
 */
page_table = pte_offset_map_lock(mm, pmd, address,
 ptl);
-   page_cache_release(old_page);
+   new_page = NULL;
if (!pte_same(*page_table, orig_pte))
goto unlock;
+   page_cache_release(old_page);
 
page_mkwrite = 1;
}
@@ -1644,6 +1645,7 @@
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+   old_page = new_page = NULL;
goto unlock;
}
 
@@ -1688,7 +1690,7 @@
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1700,12 +1702,18 @@
} else
mem_cgroup_uncharge_page(new_page);
 
-   if (new_page)
+unlock:
+   pte_unmap_unlock(page_table, ptl);
+
+   if (new_page) {
+   if (new_page == old_page)
+   /* cow happened, notify before releasing old_page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(new_page);
+   }
if (old_page)
page_cache_release(old_page);
-unlock:
-   pte_unmap_unlock(page_table, ptl);
+
if (dirty_page) {
if (vma-vm_file)
file_update_time(vma-vm_file);
diff --git

[kvm-devel] [PATCH 4 of 8] The conversion to a rwsem allows callbacks during rmap traversal

2008-04-02 Thread Andrea Arcangeli

# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1207159011 -7200
# Node ID 3c3787c496cab1fc590ba3f97e7904bdfaab5375
# Parent  d880c227ddf345f5d577839d36d150c37b653bfd
The conversion to a rwsem allows callbacks during rmap traversal
for files in a non atomic context. A rw style lock also allows concurrent
walking of the reverse map. This is fairly straightforward if one removes
pieces of the resched checking.

[Restarting unmapping is an issue to be discussed].

This slightly increases Aim9 performance results on an 8p.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@
if (!vma_shareable(vma, addr))
return;
 
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@
put_page(virt_to_page(spte));
spin_unlock(mm-page_table_lock);
 out:
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@
pgoff = offset  PAGE_SHIFT;
 
i_size_write(inode, offset);
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
if (!prio_tree_empty(mapping-i_mmap))
hugetlb_vmtruncate_list(mapping-i_mmap, pgoff);
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@
INIT_LIST_HEAD(inode-i_devices);
INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC);
rwlock_init(inode-i_data.tree_lock);
-   spin_lock_init(inode-i_data.i_mmap_lock);
+   init_rwsem(inode-i_data.i_mmap_sem);
INIT_LIST_HEAD(inode-i_data.private_list);
spin_lock_init(inode-i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -503,7 +503,7 @@
unsigned inti_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root   i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-   spinlock_t  i_mmap_lock;/* protect tree, count, list */
+   struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned inttruncate_count; /* Cover race condition with 
truncate */
unsigned long   nrpages;/* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -716,7 +716,7 @@
struct address_space *check_mapping;/* Check page-mapping if set */
pgoff_t first_index;/* Lowest page-index to unmap 
*/
pgoff_t last_index; /* Highest page-index to unmap 
*/
-   spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */
+   struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */
unsigned long truncate_count;   /* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -274,12 +274,12 @@
atomic_dec(inode-i_writecount);
 
/* insert tmp into the share list, just after mpnt */
-   spin_lock(file-f_mapping-i_mmap_lock);
+   down_write(file-f_mapping-i_mmap_sem);
tmp-vm_truncate_count = mpnt-vm_truncate_count;
flush_dcache_mmap_lock(file-f_mapping);
vma_prio_tree_add(tmp, mpnt);
flush_dcache_mmap_unlock(file-f_mapping);
-   spin_unlock(file-f_mapping-i_mmap_lock);
+   up_write(file-f_mapping-i_mmap_sem);
}
 
/*
diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -61,16 +61,16 @@
 /*
  * Lock ordering:
  *
- *  -i_mmap_lock  (vmtruncate)
+ *  -i_mmap_sem   (vmtruncate)
  *-private_lock   (__free_pte-__set_page_dirty_buffers)
  *  -swap_lock(exclusive_swap_page, others)
  *-mapping-tree_lock
  *
  *  -i_mutex
- *-i_mmap_lock(truncate-unmap_mapping_range)
+ *-i_mmap_sem

Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

 paging), hence the slowdown. What you benchmarked is the write side,
 which is also the fast path when the system is heavily CPU bound. I've
 to say aim is a great benchmark to test this regression.

I am a bit surprised that brk performance is that important. There may be 
other measurement that have to be made to assess how this would impact a 
real load.




-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [PATCH 0 of 8] mmu notifiers #v10

2008-04-02 Thread Andrea Arcangeli

Hello,

this is the mmu notifier #v10. Patches 1 and 2 are the only difference between
this and EMM V2. The rest is the same as with Christoph's patches.

I think maximum priority should be given in merging patch 1 and 2 into -mm and
ASAP in mainline.

Patches from 3 to 8 can go in -mm for testing but I'm not sure if we should
support sleep capable notifiers in mainline unless we make the VM locking
conditional to avoid overscheduling for extremely small critical sections in
the common case. I only rediffed Christoph's patches on top of the mmu
notifier patches.

KVM current plans are to heavily depend on mmu notifiers for swapping, to
optimize the spte faults, and we need it for smp guest ballooning with
madvise(DONT_NEED) and other optimizations and features.

Patches from 3 to 8 are Christoph's work ported on top of #v10 to make the
#v10 mmu notifiers sleep capable (at least supposedly). I didn't test the
scheduling, but I assume you'll quickly test XPMEM on top of this.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] EMM: Fixup return value handling of emm_notify()

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

 but anyway it's silly to be hardwired to such an interface that worst
 of all requires switch statements instead of proper pointer to
 functions and a fixed set of parameters and retval semantics for all
 methods.

The EMM API with a single callback is the simplest approach at this point. 
A common callback for all operations allows the driver to implement common 
entry and exit code as seen in XPMem.

I guess we can complicate this more by switching to a different API or 
adding additional emm_xxx() callback if need be but I really want to have 
a strong case for why this would be needed. There is the danger of 
adding frills with special callbacks in this and that situation that could 
make the notifier complicated and specific to a certain usage scenario. 

Having this generic simple interface will hopefully avoid such things.



-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [PATCH 5 of 8] We no longer abort unmapping in unmap vmas because we can reschedule while

2008-04-02 Thread Andrea Arcangeli

# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1207159055 -7200
# Node ID 316e5b1e4bf388ef0198c91b3067ed1e4171d7f6
# Parent  3c3787c496cab1fc590ba3f97e7904bdfaab5375
We no longer abort unmapping in unmap vmas because we can reschedule while
unmapping since we are holding a semaphore. This would allow moving more
of the tlb flusing into unmap_vmas reducing code in various places.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,8 +723,7 @@
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -805,7 +805,6 @@
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -817,20 +816,13 @@
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -838,7 +830,15 @@
unsigned long tlb_start = 0;/* For tlb_finish_mmu */
int tlb_start_valid = 0;
unsigned long start = start_addr;
-   int fullmm = (*tlbp)-fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
+   struct mm_struct *mm = vma-vm_mm;
+
+   mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb-fullmm;
 
for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
unsigned long end;
@@ -865,7 +865,7 @@
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, zap_work, details);
 
if (zap_work  0) {
@@ -873,13 +873,15 @@
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
cond_resched();
-   *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma-vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
+   mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
 }
 
@@ -893,20 +895,10 @@
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
 {
-   struct mm_struct *mm = vma-vm_mm;
-   struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
 
-   lru_add_drain();
-   tlb = tlb_gather_mmu(mm, 0);
-   update_hiwater_rss(mm);
-   mmu_notifier_invalidate_range_start(mm, address, end);
-   end = unmap_vmas(tlb, vma, address, end, nr_accounted, details);
-   mmu_notifier_invalidate_range_end(mm, address, end);
-   if (tlb)
-   tlb_finish_mmu(tlb, address, end);
-   return end;
+   return

[kvm-devel] [PATCH 8 of 8] This patch adds a lock ordering rule to avoid a potential deadlock when

2008-04-02 Thread Andrea Arcangeli

# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1207159059 -7200
# Node ID f3f119118b0abd9c4624263ef388dc7230d937fe
# Parent  31fc23193bd039cc595fba1ca149a9715f7d0fb2
This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@
  *
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
+ *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
  *
  *  -i_mutex
  *-i_alloc_sem (various)

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [PATCH 3 of 8] Move the tlb flushing into free_pgtables. The conversion of the locks

2008-04-02 Thread Andrea Arcangeli

# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1207159010 -7200
# Node ID d880c227ddf345f5d577839d36d150c37b653bfd
# Parent  fe00cb9deeb31467396370c835cb808f4b85209a
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables(). Moving the tlb flushing into free_pgtables allows
sleeping in parts of free_pgtables().

This means that we do a tlb_finish_mmu() before freeing the page tables.
Strictly speaking there may not be the need to do another tlb flush after
freeing the tables. But its the only way to free a series of page table
pages from the tlb list. And we do not want to call into the page allocator
for performance reasons. Aim9 numbers look okay after this patch.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -751,8 +751,8 @@
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma-vm_next;
unsigned long addr = vma-vm_start;
@@ -286,8 +288,10 @@
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
} else {
/*
 * Optimization: gather nearby vmas into one call down
@@ -299,8 +303,10 @@
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
}
vma = next;
}
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1751,9 +1751,9 @@
mmu_notifier_invalidate_range_start(mm, start, end);
unmap_vmas(tlb, vma, start, end, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
+   tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
-   tlb_finish_mmu(tlb, start, end);
mmu_notifier_invalidate_range_end(mm, start, end);
 }
 
@@ -2050,8 +2050,8 @@
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 2 of 8] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

 diff --git a/mm/memory.c b/mm/memory.c
 --- a/mm/memory.c
 +++ b/mm/memory.c
 @@ -1626,9 +1626,10 @@
*/
   page_table = pte_offset_map_lock(mm, pmd, address,
ptl);
 - page_cache_release(old_page);
 + new_page = NULL;
   if (!pte_same(*page_table, orig_pte))
   goto unlock;
 + page_cache_release(old_page);
  
   page_mkwrite = 1;
   }

This is deferring frees and not moving the callouts. KVM specific? What 
exactly is this doing?

A significant portion of this seems to be undoing what the first patch 
did.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] EMM: Require single threadedness for registration.

2008-04-02 Thread Christoph Lameter

On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

 That would work for #v10 if I remove the invalidate_range_start from
 try_to_unmap_cluster, it can't work for EMM because you've
 emm_invalidate_start firing anywhere outside the context of the
 current task (even regular rmap code, not just nonlinear corner case
 will trigger the race). In short the single threaded approach would be

But in that case it will be firing for a callback to another mm_struct. 
The notifiers are bound to mm_structs and keep separate contexts.

 The requirement for invalidate_page is that the pte and linux tlb are
 flushed _before_ and the page is freed _after_ the invalidate_page
 method. that's not the case for _begin/_end. The page is freed well
 before _end runs, hence the need of _begin and to block the secondary
 mmu page fault during the vma-mangling operations.

You could flush in _begin and free on _end? I thought you are taking a 
refcount on the page? You can drop the refcount only on _end to ensure 
that the page does not go away before.


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 02:54:52PM -0700, Christoph Lameter wrote:
 On Wed, 2 Apr 2008, Andrea Arcangeli wrote:
 
   Hmmm... Okay that is one solution that would just require a BUG_ON in the 
   registration methods.
  
  Perhaps you didn't notice that this solution can't work if you call
  range_begin/end not in the current context and try_to_unmap_cluster
  does exactly that for both my patchset and yours. Missing an _end is
  ok, missing a _begin is never ok.
 
 If you look at the patch you will see a requirement of holding a 
 writelock on mmap_sem which will keep out get_user_pages().

I said try_to_unmap_cluster, not get_user_pages.

  CPU0  CPU1
  try_to_unmap_cluster:
  emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10)
  walking the list by hand in EMM (or with hlist cleaner in #v10)
  xpmem method invoked
  schedule for a long while inside invalidate_range_start while skbs are sent
gru registers
synchronize_rcu (sorry useless now)
single threaded, so taking a page fault
secondary tlb instantiated
  xpm method returns
  end of the list (didn't notice that it has to restart to flush the gru)
  zap pte
  free the page
gru corrupts memory

CPU 1 was single threaded, CPU0 doesn't hold any mmap_sem or any other
lock that could ever serialize against the GRU as far as I can tell.

In general my #v10 solution mixing seqlock + rcu looks more robust and
allows multithreaded attachment of mmu notifers as well. I could have
fixed it with the single threaded thanks to the fact the only place
outside the mm-mmap_sem is try_to_unmap_cluster for me but it wasn't
simple to convert, nor worth it, given nonlinear isn't worth
optimizing for (not even the core VM cares about try_to_unmap_cluster
which is infact the only place in the VM with a O(N) complexity for
each try_to_unmap call, where N is the size of the mapping divided by
page_size).

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [patch 5/9] Convert anon_vma lock to rw_sem and refcount

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 02:56:25PM -0700, Christoph Lameter wrote:
 I am a bit surprised that brk performance is that important. There may be 

I think it's not brk but fork that is being slowed down, did you
oprofile? AIM forks a lot... The write side fast path generating the
overscheduling I guess is when the new vmas are created for the child
and queued in the parent anon-vma in O(1), so immediate, even
preempt-rt would be ok with it spinning and not scheduling, it's just
a list_add (much faster than schedule() indeed). Every time there's a
collision when multiple child forks simultaneously and they all try to
queue in the same anon-vma, things will slowdown.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] EMM: Require single threadedness for registration.

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 03:06:19PM -0700, Christoph Lameter wrote:
 On Thu, 3 Apr 2008, Andrea Arcangeli wrote:
 
  That would work for #v10 if I remove the invalidate_range_start from
  try_to_unmap_cluster, it can't work for EMM because you've
  emm_invalidate_start firing anywhere outside the context of the
  current task (even regular rmap code, not just nonlinear corner case
  will trigger the race). In short the single threaded approach would be
 
 But in that case it will be firing for a callback to another mm_struct. 
 The notifiers are bound to mm_structs and keep separate contexts.

Why can't it fire on the mm_struct where GRU just registered? That
mm_struct existed way before GRU registered, and VM is free to unmap
it w/o mmap_sem if there was any memory pressure.

 You could flush in _begin and free on _end? I thought you are taking a 
 refcount on the page? You can drop the refcount only on _end to ensure 
 that the page does not go away before.

we're going to lock + flush on begin and unlock on _end w/o
refcounting to microoptimize. Free is done by
unmap_vmas/madvise/munmap at will. That's a very slow path, inflating
the balloon is not problematic. But invalidate_page allows to avoid
blocking page faults during swapping so minor faults can happen and
refresh the pte young bits etc... When the VM unmaps the page while
holding the page pin, there's no race and that's where invalidate_page
is being used to generate lower invalidation overhead.

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1 of 8] Core of mmu notifiers

2008-04-02 Thread Christoph Lameter

On Wed, 2 Apr 2008, Andrea Arcangeli wrote:

 + void (*invalidate_page)(struct mmu_notifier *mn,
 + struct mm_struct *mm,
 + unsigned long address);
 +
 + void (*invalidate_range_start)(struct mmu_notifier *mn,
 +struct mm_struct *mm,
 +unsigned long start, unsigned long end);
 + void (*invalidate_range_end)(struct mmu_notifier *mn,
 +  struct mm_struct *mm,
 +  unsigned long start, unsigned long end);

Still two methods ...

 +void __mmu_notifier_release(struct mm_struct *mm)
 +{
 + struct mmu_notifier *mn;
 + unsigned seq;
 +
 + seq = read_seqbegin(mm-mmu_notifier_lock);
 + while (unlikely(!hlist_empty(mm-mmu_notifier_list))) {
 + mn = hlist_entry(mm-mmu_notifier_list.first,
 +  struct mmu_notifier,
 +  hlist);
 + hlist_del(mn-hlist);
 + if (mn-ops-release)
 + mn-ops-release(mn, mm);
 + BUG_ON(read_seqretry(mm-mmu_notifier_lock, seq));
 + }
 +}

seqlock just taken for checking if everything is ok?

 +
 +/*
 + * If no young bitflag is supported by the hardware, -clear_flush_young can
 + * unmap the address and return 1 or 0 depending if the mapping previously
 + * existed or not.
 + */
 +int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
 + unsigned long address)
 +{
 + struct mmu_notifier *mn;
 + struct hlist_node *n;
 + int young = 0;
 + unsigned seq;
 +
 + seq = read_seqbegin(mm-mmu_notifier_lock);
 + do {
 + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) {
 + if (mn-ops-clear_flush_young)
 + young |= mn-ops-clear_flush_young(mn, mm,
 + address);
 + }
 + } while (read_seqretry(mm-mmu_notifier_lock, seq));
 +

The critical section could be run multiple times for one callback which 
could result in multiple callbacks to clear the young bit. Guess not that 
big of an issue?

 +void __mmu_notifier_invalidate_page(struct mm_struct *mm,
 +   unsigned long address)
 +{
 + struct mmu_notifier *mn;
 + struct hlist_node *n;
 + unsigned seq;
 +
 + seq = read_seqbegin(mm-mmu_notifier_lock);
 + do {
 + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) {
 + if (mn-ops-invalidate_page)
 + mn-ops-invalidate_page(mn, mm, address);
 + }
 + } while (read_seqretry(mm-mmu_notifier_lock, seq));
 +}

Ok. Retry would try to invalidate the page a second time which is not a 
problem unless you would drop the refcount or make other state changes 
that require correspondence with mapping. I guess this is the reason 
that you stopped adding a refcount?

 +void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 +   unsigned long start, unsigned long end)
 +{
 + struct mmu_notifier *mn;
 + struct hlist_node *n;
 + unsigned seq;
 +
 + seq = read_seqbegin(mm-mmu_notifier_lock);
 + do {
 + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) {
 + if (mn-ops-invalidate_range_start)
 + mn-ops-invalidate_range_start(mn, mm,
 + start, end);
 + }
 + } while (read_seqretry(mm-mmu_notifier_lock, seq));
 +}

Multiple invalidate_range_starts on the same range? This means the driver 
needs to be able to deal with the situation and ignore the repeated 
call?

 +void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 +   unsigned long start, unsigned long end)
 +{
 + struct mmu_notifier *mn;
 + struct hlist_node *n;
 + unsigned seq;
 +
 + seq = read_seqbegin(mm-mmu_notifier_lock);
 + do {
 + hlist_for_each_entry_rcu(mn, n, mm-mmu_notifier_list, hlist) {
 + if (mn-ops-invalidate_range_end)
 + mn-ops-invalidate_range_end(mn, mm,
 +   start, end);
 + }
 + } while (read_seqretry(mm-mmu_notifier_lock, seq));
 +}

Retry can lead to multiple invalidate_range callbacks with the same 
parameters? Driver needs to ignore if the range is already clear?

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___

Re: [kvm-devel] [patch 1/9] EMM Notifier: The notifier calls

2008-04-02 Thread Christoph Lameter

On Thu, 3 Apr 2008, Andrea Arcangeli wrote:

 I said try_to_unmap_cluster, not get_user_pages.
 
   CPU0CPU1
   try_to_unmap_cluster:
   emm_invalidate_start in EMM (or mmu_notifier_invalidate_range_start in #v10)
   walking the list by hand in EMM (or with hlist cleaner in #v10)
   xpmem method invoked
   schedule for a long while inside invalidate_range_start while skbs are sent
   gru registers
   synchronize_rcu (sorry useless now)

All of this would be much easier if you could stop the drivel. The sync 
rcu was for an earlier release of the mmu notifier. Why the sniping?

   single threaded, so taking a page fault
   secondary tlb instantiated

The driver must not allow faults to occur between start and end. The 
trouble here is that GRU and xpmem are mixed. If CPU0 would have been 
running GRU instead of XPMEM then the fault would not have occurred 
because the gru would have noticed that a range op is active. If both
systems would have run xpmem then the same would have worked.
 
I guess this means that an address space cannot reliably registered to 
multiple subsystems if some of those do not take a refcount. If all 
drivers would be required to take a refcount then this would also not 
occur.

 In general my #v10 solution mixing seqlock + rcu looks more robust and
 allows multithreaded attachment of mmu notifers as well. I could have

Well its easy to say that if no one else has looked at it yet. I expressed 
some concerns in reply to your post of #v10.


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [patch 0/3] separate thread for IO handling V3

2008-04-02 Thread Marcelo Tosatti

This version fixes the vmdk problems found by the
regression testing.

Dor, regarding the option to disable the IO thread, it
would require duplicating most of the changed code. For now 
I believe its better to get the patch into a state
where its considered stable enough for inclusion.

Please rerun the regression tests. Thanks.

-- 


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [patch 1/3] QEMU/KVM: separate thread for IO handling

2008-04-02 Thread Marcelo Tosatti

Move IO processing from vcpu0 to a dedicated thread.

This removes load on vcpu0 by allowing better cache locality and also
improves latency.

Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]

Index: kvm-userspace.io/qemu/qemu-kvm.c
===
--- kvm-userspace.io.orig/qemu/qemu-kvm.c
+++ kvm-userspace.io/qemu/qemu-kvm.c
@@ -28,6 +28,8 @@ kvm_context_t kvm_context;
 
 extern int smp_cpus;
 
+static int qemu_kvm_reset_requested;
+
 pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER;
 pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER;
 __thread struct vcpu_info *vcpu;
@@ -38,6 +40,7 @@ struct qemu_kvm_signal_table {
 };
 
 static struct qemu_kvm_signal_table io_signal_table;
+static struct qemu_kvm_signal_table vcpu_signal_table;
 
 #define SIG_IPI (SIGRTMIN+4)
 
@@ -51,6 +54,8 @@ struct vcpu_info {
 int stopped;
 } vcpu_info[256];
 
+pthread_t io_thread;
+
 static inline unsigned long kvm_get_thread_id(void)
 {
 return syscall(SYS_gettid);
@@ -67,12 +72,19 @@ static void sig_ipi_handler(int n)
 
 void kvm_update_interrupt_request(CPUState *env)
 {
-if (env  vcpu  env != vcpu-env) {
-   if (vcpu_info[env-cpu_index].signalled)
-   return;
-   vcpu_info[env-cpu_index].signalled = 1;
-   if (vcpu_info[env-cpu_index].thread)
-   pthread_kill(vcpu_info[env-cpu_index].thread, SIG_IPI);
+int signal = 0;
+
+if (env) {
+if (!vcpu)
+signal = 1;
+if (vcpu  env != vcpu-env  !vcpu_info[env-cpu_index].signalled)
+signal = 1;
+
+if (signal) {
+vcpu_info[env-cpu_index].signalled = 1;
+if (vcpu_info[env-cpu_index].thread)
+pthread_kill(vcpu_info[env-cpu_index].thread, SIG_IPI);
+}
 }
 }
 
@@ -105,7 +117,7 @@ static void post_kvm_run(void *opaque, i
 
 static int pre_kvm_run(void *opaque, int vcpu)
 {
-CPUState *env = cpu_single_env;
+CPUState *env = qemu_kvm_cpu_env(vcpu);
 
 kvm_arch_pre_kvm_run(opaque, vcpu);
 
@@ -151,7 +163,8 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
-static int kvm_eat_signal(CPUState *env, int timeout)
+static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env,
+  int timeout)
 {
 struct timespec ts;
 int r, e, ret = 0;
@@ -160,12 +173,12 @@ static int kvm_eat_signal(CPUState *env,
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
-r = sigtimedwait(io_signal_table.sigset, siginfo, ts);
+r = sigtimedwait(waitset-sigset, siginfo, ts);
 if (r == -1  (errno == EAGAIN || errno == EINTR)  !timeout)
return 0;
 e = errno;
 pthread_mutex_lock(qemu_mutex);
-if (vcpu)
+if (env  vcpu)
 cpu_single_env = vcpu-env;
 if (r == -1  !(errno == EAGAIN || errno == EINTR)) {
printf(sigtimedwait: %s\n, strerror(e));
@@ -181,7 +194,7 @@ static int kvm_eat_signal(CPUState *env,
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
vcpu_info[env-cpu_index].stopped = 1;
-   pthread_kill(vcpu_info[0].thread, SIG_IPI);
+   pthread_kill(io_thread, SIGUSR1);
 }
 pthread_mutex_unlock(qemu_mutex);
 
@@ -192,24 +205,16 @@ static int kvm_eat_signal(CPUState *env,
 static void kvm_eat_signals(CPUState *env, int timeout)
 {
 int r = 0;
+struct qemu_kvm_signal_table *waitset = vcpu_signal_table;
 
-while (kvm_eat_signal(env, 0))
+while (kvm_eat_signal(waitset, env, 0))
r = 1;
 if (!r  timeout) {
-   r = kvm_eat_signal(env, timeout);
+   r = kvm_eat_signal(waitset, env, timeout);
if (r)
-   while (kvm_eat_signal(env, 0))
+   while (kvm_eat_signal(waitset, env, 0))
;
 }
-/*
- * we call select() even if no signal was received, to account for
- * for which there is no signal handler installed.
- */
-pthread_mutex_lock(qemu_mutex);
-cpu_single_env = vcpu-env;
-if (env-cpu_index == 0)
-   main_loop_wait(0);
-pthread_mutex_unlock(qemu_mutex);
 }
 
 static void kvm_main_loop_wait(CPUState *env, int timeout)
@@ -225,29 +230,29 @@ static int all_threads_paused(void)
 {
 int i;
 
-for (i = 1; i  smp_cpus; ++i)
+for (i = 0; i  smp_cpus; ++i)
if (vcpu_info[i].stopped)
return 0;
 return 1;
 }
 
-static void pause_other_threads(void)
+static void pause_all_threads(void)
 {
 int i;
 
-for (i = 1; i  smp_cpus; ++i) {
+for (i = 0; i  smp_cpus; ++i) {
vcpu_info[i].stop = 1;
pthread_kill(vcpu_info[i].thread, SIG_IPI);
 }
 while (!all_threads_paused())
-   kvm_eat_signals(vcpu-env, 0);
+   kvm_eat_signal(io_signal_table, NULL, 1000);
 }
 
-static void resume_other_threads(void)
+static void resume_all_threads(void)
 {
 int i;
 
-for (i = 1; i  smp_cpus; ++i) {
+for (i = 0; i  smp_cpus; ++i) {

[kvm-devel] [patch 3/3] QEMU/KVM: notify IO thread of pending bhs

2008-04-02 Thread Marcelo Tosatti

This fixes the slow vmdk issue found in the regression
tests.

Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]

Index: kvm-userspace.io/qemu/qemu-kvm.c
===
--- kvm-userspace.io.orig/qemu/qemu-kvm.c
+++ kvm-userspace.io/qemu/qemu-kvm.c
@@ -411,6 +411,12 @@ int kvm_init_ap(void)
 return 0;
 }
 
+void qemu_kvm_notify_work(void)
+{
+if (io_thread)
+pthread_kill(io_thread, SIGUSR1);
+}
+
 /*
  * The IO thread has all signals that inform machine events
  * blocked (io_signal_table), so it won't get interrupted
Index: kvm-userspace.io/qemu/qemu-kvm.h
===
--- kvm-userspace.io.orig/qemu/qemu-kvm.h
+++ kvm-userspace.io/qemu/qemu-kvm.h
@@ -60,6 +60,8 @@ void qemu_kvm_aio_wait_start(void);
 void qemu_kvm_aio_wait(void);
 void qemu_kvm_aio_wait_end(void);
 
+void qemu_kvm_notify_work(void);
+
 void kvm_tpr_opt_setup();
 void kvm_tpr_access_report(CPUState *env, uint64_t rip, int is_write);
 int handle_tpr_access(void *opaque, int vcpu,
Index: kvm-userspace.io/qemu/vl.c
===
--- kvm-userspace.io.orig/qemu/vl.c
+++ kvm-userspace.io/qemu/vl.c
@@ -7588,6 +7588,8 @@ void qemu_bh_schedule(QEMUBH *bh)
 if (env) {
 cpu_interrupt(env, CPU_INTERRUPT_EXIT);
 }
+if (kvm_enabled())
+qemu_kvm_notify_work();
 }
 
 void qemu_bh_cancel(QEMUBH *bh)

-- 


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [patch 2/3] QEMU/KVM: add function to handle signals

2008-04-02 Thread Marcelo Tosatti

SIGUSR1 has no handler, and the SIGUSR2 one does nothing useful anymore 
(thats also true for SIGIO on tap fd, which runs host_alarm_handler
unnecessarily). 

Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]

Index: kvm-userspace.io/qemu/qemu-kvm.c
===
--- kvm-userspace.io.orig/qemu/qemu-kvm.c
+++ kvm-userspace.io/qemu/qemu-kvm.c
@@ -163,13 +163,30 @@ static int has_work(CPUState *env)
 return kvm_arch_has_work(env);
 }
 
+static int kvm_process_signal(int si_signo)
+{
+struct sigaction sa;
+
+switch (si_signo) {
+case SIGUSR2:
+pthread_cond_signal(qemu_aio_cond);
+break;
+case SIGALRM:
+case SIGIO:
+sigaction(si_signo, NULL, sa);
+sa.sa_handler(si_signo);
+break;
+}
+
+return 1;
+}
+
 static int kvm_eat_signal(struct qemu_kvm_signal_table *waitset, CPUState *env,
   int timeout)
 {
 struct timespec ts;
 int r, e, ret = 0;
 siginfo_t siginfo;
-struct sigaction sa;
 
 ts.tv_sec = timeout / 1000;
 ts.tv_nsec = (timeout % 1000) * 100;
@@ -184,13 +201,9 @@ static int kvm_eat_signal(struct qemu_kv
printf(sigtimedwait: %s\n, strerror(e));
exit(1);
 }
-if (r != -1) {
-   sigaction(siginfo.si_signo, NULL, sa);
-   sa.sa_handler(siginfo.si_signo);
-   if (siginfo.si_signo == SIGUSR2)
-   pthread_cond_signal(qemu_aio_cond);
-   ret = 1;
-}
+if (r != -1)
+ret = kvm_process_signal(siginfo.si_signo);
+
 if (env  vcpu_info[env-cpu_index].stop) {
vcpu_info[env-cpu_index].stop = 0;
vcpu_info[env-cpu_index].stopped = 1;

-- 


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [02/17][PATCH] Implement smp_call_function_mask for ia64 - V8

2008-04-02 Thread Luck, Tony

  Have you looked at Jens Axboe's patches to make all this stuff a lot 
  more arch-common?

 Nope, do you have a pointer?

Check your favourite archive for this Subject line:

 Generic smp_call_function(), improvements, and  smp_call_function_single()

-Tony

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1 of 8] Core of mmu notifiers

2008-04-02 Thread Andrea Arcangeli

On Wed, Apr 02, 2008 at 03:34:01PM -0700, Christoph Lameter wrote:
 Still two methods ...

Yes, the invalidate_page is called with the core VM holding a
reference on the page _after_ the tlb flush. The invalidate_end is
called after the page has been freed already and after the tlb
flush. They've different semantics and with invalidate_page there's no
need to block the kvm fault handler. But invalidate_page is only
the most efficient for operations that aren't creating holes in the
vma, for the rest invalidate_range_start/end provides the best
performance by reducing the number of tlb flushes.

 seqlock just taken for checking if everything is ok?

Exactly.

 The critical section could be run multiple times for one callback which 
 could result in multiple callbacks to clear the young bit. Guess not that 
 big of an issue?

Yes, that's ok.

 Ok. Retry would try to invalidate the page a second time which is not a 
 problem unless you would drop the refcount or make other state changes 
 that require correspondence with mapping. I guess this is the reason 
 that you stopped adding a refcount?

The current patch using mmu notifiers is already robust against
multiple invalidates. The refcounting represent a spte mapping, if we
already invalidated it, the spte will be nonpresent and there's no
page to unpin. The removal of the refcount is only a
microoptimization.

 Multiple invalidate_range_starts on the same range? This means the driver 
 needs to be able to deal with the situation and ignore the repeated 
 call?

The driver would need to store current-pid in a list and remove it in
range_stop. And range_stop would need to do nothing at all, if the pid
isn't found in the list.

But thinking more I'm not convinced the driver is safe by ignoring if
range_end runs before range_begin (pid not found in the list). And I
don't see a clear way to fix it not internally to the device driver
nor externally. So the repeated call is easy to handle for the
driver. What is not trivial is to block the secondary page faults when
mmu_notifier_register happens in the middle of range_start/end
critical section. sptes can be established in between range_start/_end
and that shouldn't happen. So the core problem returns to be how to
handle mmu_notifier_register happening in the middle of
_range_start/_end, dismissing it as a job for the driver seems not
feasible (you have the same problem with EMM of course).

 Retry can lead to multiple invalidate_range callbacks with the same 
 parameters? Driver needs to ignore if the range is already clear?

Mostly covered above.


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] [PATCH 1 of 8] Core of mmu notifiers

2008-04-02 Thread Christoph Lameter

Thinking about this adventurous locking some more: I think you are 
misunderstanding what a seqlock is. It is *not* a spinlock.

The critical read section with the reading of a version before and after 
allows you access to a certain version of memory how it is or was some 
time ago (caching effect). It does not mean that the current state of 
memory is fixed and neither does it allow syncing when an item is added 
to the list.

So it could be that you are traversing a list that is missing one item 
because it is not visible to this processor yet.

You may just see a state from the past. I would think that you will need a 
real lock in order to get the desired effect.


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2

2008-04-02 Thread Liu, Eric E

David Abrahams wrote:
 on Wed Apr 02 2008, Avi Kivity avi-AT-qumranet.com wrote:
 
 David Abrahams wrote:
 With the title combination, the guest takes nearly 100% of my real
 CPU time and still only sees one CPU.  Is this a known problem, and
 does it have a known solution? 
 
 
 
 Can you send the output of 'kvm_stat -1'?
 
 I don't appear to have a kvm_stat executable anywhere.  How do I
 acquire that?
Hi David,
Do you have kvm userspace code? If you have, you can find it in
kvm-userspace/kvm_stat. And you also can get the code from git clone
git://git.kernel.org/pub/scm/virt/kvm/kvm-userspace.git.


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] EMM: disable other notifiers before register and unregister

2008-04-02 Thread Christoph Lameter

Ok lets forget about the single theaded thing to solve the registration 
races. As Andrea pointed out this still has ssues with other subscribed 
subsystems (and also try_to_unmap). We could do something like what 
stop_machine_run does: First disable all running subsystems before 
registering a new one.

Maybe this is a possible solution.


Subject: EMM: disable other notifiers before register and unregister

As Andrea has pointed out: There are races during registration if other
subsystem notifiers are active while we register a callback.

Solve that issue by adding two new notifiers:

emm_stop
Stops the notifier operations. Notifier must block on
invalidate_start and emm_referenced from this point on.
If an invalidate_start has not been completed by a call
to invalidate_end then the driver must wait until the
operation is complete before returning.

emm_start
Restart notifier operations.

Before registration all other subscribed subsystems are stopped.
Then the new subsystem is subscribed and things can get running
without consistency issues.

Subsystems are restarted after the lists have been updated.

This also works for unregistering. If we can get all subsystems
to stop then we can also reliably unregister a subsystem. So
provide that callback.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

---
 include/linux/rmap.h |   10 +++---
 mm/rmap.c|   30 ++
 2 files changed, 37 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/rmap.h
===
--- linux-2.6.orig/include/linux/rmap.h 2008-04-02 18:16:07.906032549 -0700
+++ linux-2.6/include/linux/rmap.h  2008-04-02 18:17:10.291070009 -0700
@@ -94,7 +94,9 @@ enum emm_operation {
emm_release,/* Process exiting, */
emm_invalidate_start,   /* Before the VM unmaps pages */
emm_invalidate_end, /* After the VM unmapped pages */
-   emm_referenced  /* Check if a range was referenced */
+   emm_referenced, /* Check if a range was referenced */
+   emm_stop,   /* Halt all faults/invalidate_starts */
+   emm_start,  /* Restart operations */
 };
 
 struct emm_notifier {
@@ -126,13 +128,15 @@ static inline int emm_notify(struct mm_s
 
 /*
  * Register a notifier with an mm struct. Release occurs when the process
- * terminates by calling the notifier function with emm_release.
+ * terminates by calling the notifier function with emm_release or when
+ * emm_notifier_unregister is called.
  *
  * Must hold the mmap_sem for write.
  */
 extern void emm_notifier_register(struct emm_notifier *e,
struct mm_struct *mm);
-
+extern void emm_notifier_unregister(struct emm_notifier *e,
+   struct mm_struct *mm);
 
 /*
  * Called from mm/vmscan.c to handle paging out
Index: linux-2.6/mm/rmap.c
===
--- linux-2.6.orig/mm/rmap.c2008-04-02 18:16:09.378057062 -0700
+++ linux-2.6/mm/rmap.c 2008-04-02 18:16:10.710079201 -0700
@@ -289,16 +289,46 @@ void emm_notifier_release(struct mm_stru
 /* Register a notifier */
 void emm_notifier_register(struct emm_notifier *e, struct mm_struct *mm)
 {
+   /* Bring all other notifiers into a quiescent state */
+   emm_notify(mm, emm_stop, 0, TASK_SIZE);
+
e-next = mm-emm_notifier;
+
/*
 * The update to emm_notifier (e-next) must be visible
 * before the pointer becomes visible.
 * rcu_assign_pointer() does exactly what we need.
 */
rcu_assign_pointer(mm-emm_notifier, e);
+
+   /* Continue notifiers */
+   emm_notify(mm, emm_start, 0, TASK_SIZE);
 }
 EXPORT_SYMBOL_GPL(emm_notifier_register);
 
+/* Unregister a notifier */
+void emm_notifier_unregister(struct emm_notifier *e, struct mm_struct *mm)
+{
+   struct emm_notifier *p;
+
+   emm_notify(mm, emm_stop, 0, TASK_SIZE);
+
+   p = mm-emm_notifier;
+   if (e == p)
+   mm-emm_notifier = e-next;
+   else {
+   while (p-next != e)
+   p = p-next;
+
+   p-next = e-next;
+   }
+   e-next = mm-emm_notifier;
+
+   emm_notify(mm, emm_start, 0, TASK_SIZE);
+   e-callback(e, mm, emm_release, 0, TASK_SIZE);
+}
+EXPORT_SYMBOL_GPL(emm_notifier_unregister);
+
 /*
  * Perform a callback
  *


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

Re: [kvm-devel] Ubuntu Gutsy host / XP guest / -smp 2

2008-04-02 Thread Jim Paris

David Abrahams wrote:
  What HAL do you see in device manager?
 
 In the guest?  I don't see anything that obviously appears to be HAL in
 device manager.  Could you be more specific?

Hi,

He's referring to the Windows computer type -- the entry under
computer in device manager.  See
  http://kvm.qumranet.com/kvmwiki/Windows_ACPI_Workaround

-jim

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

[kvm-devel] [Patch][00/18] kvm-ia64 for kernel V10

2008-04-02 Thread Zhang, Xiantao

Compared with V9, just fixed indentation issues  in patch 12. I put it
the patchset in
git://git.kernel.org/pub/scm/linux/kernel/git/xiantao/kvm-ia64.git
kvm-ia64-mc10. Please help to review.
Specially, the first two patches (TR Management patch and
smp_call_function_mask patch) needs Tony's review and Ack. Thanks:-)
Xiantao 

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel

57 matches

Mail list logo