Re: [PATCH] do not keep interrupt window closed by sti in real mode

2009-04-08 Thread H. Peter Anvin
Avi Kivity wrote:
 
 I'm guessing the problem is due to the second instruction.  We don't
 clear the 'blocked by interrupt shadow' flag when we emulate, which
 extends interrupt shadow by one more instruction.  If the instruction
 sequence is 'sti hlt' we end in an inconsistent state.
 

Ah, and since we're in real mode, we have to emulate everything (at
least on some hardware), right?  So we really do need to clear the
interrupt shadow bit in the interpreter... I don't see a way around that.

Otherwise not just STI but MOV SS shadows will break, and in real mode
MOV SS shadow is crucial.

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Re: [Qemu-devel] Changing the QEMU svn VERSION string

2009-04-08 Thread Gerd Hoffmann

On 04/07/09 19:13, Jamie Lokier wrote:

Anthony Liguori wrote:

I still think libvirt should work with versions of QEMU/KVM built from
svn/git though.  I think the only way to do that is for libvirt to relax
their version checks to accommodate suffixes in the form
major.minor.stable-foo.


Ok, but try to stick to a well-defined rule about what suffix means
later or earlier.  In package managers, 1.2.3-rc1 is typically
seen as a later version than 1.2.3 purely due to syntax.


Fedora typically handles this using a leading zero in the 'release' 
component for pre-final versions, like this: app-1.2.3-0.rc1.fc11 
(rc/beta) and app-1.2.3-1.fc11 (final).  Likewise for snapshots: 
app-1.2.3-0.svn${date}.fc11



If you're
consistently meaning 0.11.0-rc1 is earlier than 0.11.0 (final),
that might need to be encoded in libvirt and other wrappers, if they
have any fine-grained version sensistivity such as command line
changes or bug workarounds.


libvirt scans the help text to figure which features are present 
(checking for as -drive and -uuid cmd line switches for example).


cheers,
  Gerd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] do not keep interrupt window closed by sti in real mode

2009-04-08 Thread Avi Kivity

H. Peter Anvin wrote:

Avi Kivity wrote:
  

I'm guessing the problem is due to the second instruction.  We don't
clear the 'blocked by interrupt shadow' flag when we emulate, which
extends interrupt shadow by one more instruction.  If the instruction
sequence is 'sti hlt' we end in an inconsistent state.




Ah, and since we're in real mode, we have to emulate everything (at
least on some hardware), right?  


Well, not everything.  We use vm86 mode in the guest to emulate real 
mode.  Of course that doesn't support all instructions, so we emulate 
these.  Unfortunately it also doesn't support big real mode.



So we really do need to clear the
interrupt shadow bit in the interpreter... I don't see a way around that.
  


Yes.


Otherwise not just STI but MOV SS shadows will break, and in real mode
MOV SS shadow is crucial.
  


'mov ss' executes natively.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2741742 ] debian-20090117-kfreebsd-i386 installation fails on kvm 80

2009-04-08 Thread SourceForge.net
Bugs item #2741742, was opened at 2009-04-07 22:37
Message generated for change (Comment added) made by scoof
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Andreas Jacobsen (scoof)
Assigned to: Nobody/Anonymous (nobody)
Summary: debian-20090117-kfreebsd-i386 installation fails on kvm  80

Initial Comment:
Machine: Thinkpad T60 with Intel T2400, running 32 bit Debian
Linux trillian 2.6.26-1-686 #1 SMP Sat Jan 10 18:29:31 UTC 2009 i686 GNU/Linux
ii  linux-image-2.6.26-1-686  2.6.26-13 
Linux 2.6.26 image on PPro/Celeron/PII/PIII/P4

KVM command-line: kvm -m 1024 -name kfreebsd-debian -hda debian-kfreebsd.img 
-hdb debian-kfreebsd-swap.img -cdrom debian-20090117-kfreebsd-i386-install.iso 
-boot d -net nic,macaddr=00:16:3e:49:01:33,vlan=0 -net tap

debian-kfreebsd.img is an 8G qcow2 image, debian-kfreebsd-swap.img is a 1G 
qcow2 image.
debian-20090117-kfreebsd-i386-install.iso is available from 
http://glibc-bsd.alioth.debian.org/install-cd/kfreebsd-i386/20090117/debian-20090117-kfreebsd-i386-install.iso

Steps to reproduce:
Boot KVM  80
Select Express
Select ad0
Press a, q
Select BootMgr
Select ad1
Press a, q
Select BootMgr
Press tab, enter
Press c, enter, enter, /, enter
Select ad1
Press c, enter, s, enter
Press q
Select Minimal
Select Exit
Select CD/DVD
Select cd0
Select Yes
Press Alt-F3 to select
Select Europe
Select Copenhagen
Installation should now fail

Installation fails with kvm  80
Installation succeeds with qemu without kqemu
Installation succeeds with kvm-80
Installation fails with kvm-83 and -no-kvm
Installation fails with kvm-83 and -no-kvm-pit

Sometimes, it fails with a duplicate alloc or duplicate dealloc error 
message


--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 10:50

Message:
the latest git pull (kvm-84-6620-ge3dbe3f) fails already during boot of the
debian-kfreebsd-iso.

The kernel prints:
Fatal double fault
eip = 0xc0480708
esp = 0xe437cc20
ebp = 0xe437cc78


--

Comment By: Amit Shah (amitshah)
Date: 2009-04-08 08:49

Message:
I just tried this with the kvm git snapshot and it works fine. Can you try
with a nightly build or from the git tree?

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2741742 ] debian-20090117-kfreebsd-i386 installation fails on kvm 80

2009-04-08 Thread SourceForge.net
Bugs item #2741742, was opened at 2009-04-07 22:37
Message generated for change (Comment added) made by scoof
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Andreas Jacobsen (scoof)
Assigned to: Nobody/Anonymous (nobody)
Summary: debian-20090117-kfreebsd-i386 installation fails on kvm  80

Initial Comment:
Machine: Thinkpad T60 with Intel T2400, running 32 bit Debian
Linux trillian 2.6.26-1-686 #1 SMP Sat Jan 10 18:29:31 UTC 2009 i686 GNU/Linux
ii  linux-image-2.6.26-1-686  2.6.26-13 
Linux 2.6.26 image on PPro/Celeron/PII/PIII/P4

KVM command-line: kvm -m 1024 -name kfreebsd-debian -hda debian-kfreebsd.img 
-hdb debian-kfreebsd-swap.img -cdrom debian-20090117-kfreebsd-i386-install.iso 
-boot d -net nic,macaddr=00:16:3e:49:01:33,vlan=0 -net tap

debian-kfreebsd.img is an 8G qcow2 image, debian-kfreebsd-swap.img is a 1G 
qcow2 image.
debian-20090117-kfreebsd-i386-install.iso is available from 
http://glibc-bsd.alioth.debian.org/install-cd/kfreebsd-i386/20090117/debian-20090117-kfreebsd-i386-install.iso

Steps to reproduce:
Boot KVM  80
Select Express
Select ad0
Press a, q
Select BootMgr
Select ad1
Press a, q
Select BootMgr
Press tab, enter
Press c, enter, enter, /, enter
Select ad1
Press c, enter, s, enter
Press q
Select Minimal
Select Exit
Select CD/DVD
Select cd0
Select Yes
Press Alt-F3 to select
Select Europe
Select Copenhagen
Installation should now fail

Installation fails with kvm  80
Installation succeeds with qemu without kqemu
Installation succeeds with kvm-80
Installation fails with kvm-83 and -no-kvm
Installation fails with kvm-83 and -no-kvm-pit

Sometimes, it fails with a duplicate alloc or duplicate dealloc error 
message


--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 12:05

Message:
Git bisect fingers this:
6364a3918cb5c28376849e7fca3e09bd66b859f3 is first bad commit
commit 6364a3918cb5c28376849e7fca3e09bd66b859f3
Author: Marcelo Tosatti mtosa...@redhat.com
Date:   Mon Dec 1 22:32:04 2008 -0200

KVM: MMU: skip global pgtables on sync due to cr3 switch

Skip syncing global pages on cr3 switch (but not on cr4/cr0). This is
important for Linux 32-bit guests with PAE, where the kmap page is
marked as global.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

:04 04 18636d9c5eecaa456d7a216a67b8e16fa99c021b
0b83a51e41eb613dddb8c8e85742e5072b505561 M  arch


--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 10:50

Message:
the latest git pull (kvm-84-6620-ge3dbe3f) fails already during boot of the
debian-kfreebsd-iso.

The kernel prints:
Fatal double fault
eip = 0xc0480708
esp = 0xe437cc20
ebp = 0xe437cc78


--

Comment By: Amit Shah (amitshah)
Date: 2009-04-08 08:49

Message:
I just tried this with the kvm git snapshot and it works fine. Can you try
with a nightly build or from the git tree?

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2741742 ] debian-20090117-kfreebsd-i386 installation fails on kvm 80

2009-04-08 Thread SourceForge.net
Bugs item #2741742, was opened at 2009-04-07 22:37
Message generated for change (Comment added) made by aurel32
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Andreas Jacobsen (scoof)
Assigned to: Nobody/Anonymous (nobody)
Summary: debian-20090117-kfreebsd-i386 installation fails on kvm  80

Initial Comment:
Machine: Thinkpad T60 with Intel T2400, running 32 bit Debian
Linux trillian 2.6.26-1-686 #1 SMP Sat Jan 10 18:29:31 UTC 2009 i686 GNU/Linux
ii  linux-image-2.6.26-1-686  2.6.26-13 
Linux 2.6.26 image on PPro/Celeron/PII/PIII/P4

KVM command-line: kvm -m 1024 -name kfreebsd-debian -hda debian-kfreebsd.img 
-hdb debian-kfreebsd-swap.img -cdrom debian-20090117-kfreebsd-i386-install.iso 
-boot d -net nic,macaddr=00:16:3e:49:01:33,vlan=0 -net tap

debian-kfreebsd.img is an 8G qcow2 image, debian-kfreebsd-swap.img is a 1G 
qcow2 image.
debian-20090117-kfreebsd-i386-install.iso is available from 
http://glibc-bsd.alioth.debian.org/install-cd/kfreebsd-i386/20090117/debian-20090117-kfreebsd-i386-install.iso

Steps to reproduce:
Boot KVM  80
Select Express
Select ad0
Press a, q
Select BootMgr
Select ad1
Press a, q
Select BootMgr
Press tab, enter
Press c, enter, enter, /, enter
Select ad1
Press c, enter, s, enter
Press q
Select Minimal
Select Exit
Select CD/DVD
Select cd0
Select Yes
Press Alt-F3 to select
Select Europe
Select Copenhagen
Installation should now fail

Installation fails with kvm  80
Installation succeeds with qemu without kqemu
Installation succeeds with kvm-80
Installation fails with kvm-83 and -no-kvm
Installation fails with kvm-83 and -no-kvm-pit

Sometimes, it fails with a duplicate alloc or duplicate dealloc error 
message


--

Comment By: Aurelien Jarno (aurel32)
Date: 2009-04-08 13:26

Message:
This is most probably the OOS issue. Try loading the kvm module with
oos_shadow=0

--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 12:05

Message:
Git bisect fingers this:
6364a3918cb5c28376849e7fca3e09bd66b859f3 is first bad commit
commit 6364a3918cb5c28376849e7fca3e09bd66b859f3
Author: Marcelo Tosatti mtosa...@redhat.com
Date:   Mon Dec 1 22:32:04 2008 -0200

KVM: MMU: skip global pgtables on sync due to cr3 switch

Skip syncing global pages on cr3 switch (but not on cr4/cr0). This is
important for Linux 32-bit guests with PAE, where the kmap page is
marked as global.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

:04 04 18636d9c5eecaa456d7a216a67b8e16fa99c021b
0b83a51e41eb613dddb8c8e85742e5072b505561 M  arch


--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 10:50

Message:
the latest git pull (kvm-84-6620-ge3dbe3f) fails already during boot of the
debian-kfreebsd-iso.

The kernel prints:
Fatal double fault
eip = 0xc0480708
esp = 0xe437cc20
ebp = 0xe437cc78


--

Comment By: Amit Shah (amitshah)
Date: 2009-04-08 08:49

Message:
I just tried this with the kvm git snapshot and it works fine. Can you try
with a nightly build or from the git tree?

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


AW: AW: AW: KVM performance

2009-04-08 Thread BRAUN, Stefanie

BRAUN, Stefanie wrote:
 1. Subtest: VLC reads video from local disk and streams it via udp to
another pc
   Host performance:   11% 11%
 kvm process in host (top):22% 22%
   vlc process in vmu (top):   15% 7%

   

While this isn't wonderful, it's not your major bottleneck now.  What's
the bandwidth generated by the workload?

Generated Bandwidth : 6500 kbit per sec


 4. Subtest: Reading video locally, adding a logo to the video stream
and then saving the video locally
   Host performance:   50% 50%
 kvm process in host (top) :   99% 99%
   vlc process in vmu (top) :  99% 99%
   

Now this is bad.  Please provide the output of 'kvm_stat -1' while this
is running.  Also, describe the guest.  Is it Linux?  if so, i386 or
x86_64?  and is CONFIG_HIGHMEM enabled?

Linux, Fedora 10, x86_64, (2.6.27.21-170.2.56.fc10.x86_64)
The config file does not contain a CONFIG_HIGHMEM parameter.

UDP performance is a known issue now, and we are working on it.  TCP is
much better due to segmentation offload.

--
I have a truly marvellous patch that fixes the bug which this signature
is too narrow to contain.



vmu01_stat
Description: vmu01_stat


[ kvm-Bugs-2741742 ] debian-20090117-kfreebsd-i386 installation fails on kvm 80

2009-04-08 Thread SourceForge.net
Bugs item #2741742, was opened at 2009-04-07 22:37
Message generated for change (Comment added) made by scoof
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Andreas Jacobsen (scoof)
Assigned to: Nobody/Anonymous (nobody)
Summary: debian-20090117-kfreebsd-i386 installation fails on kvm  80

Initial Comment:
Machine: Thinkpad T60 with Intel T2400, running 32 bit Debian
Linux trillian 2.6.26-1-686 #1 SMP Sat Jan 10 18:29:31 UTC 2009 i686 GNU/Linux
ii  linux-image-2.6.26-1-686  2.6.26-13 
Linux 2.6.26 image on PPro/Celeron/PII/PIII/P4

KVM command-line: kvm -m 1024 -name kfreebsd-debian -hda debian-kfreebsd.img 
-hdb debian-kfreebsd-swap.img -cdrom debian-20090117-kfreebsd-i386-install.iso 
-boot d -net nic,macaddr=00:16:3e:49:01:33,vlan=0 -net tap

debian-kfreebsd.img is an 8G qcow2 image, debian-kfreebsd-swap.img is a 1G 
qcow2 image.
debian-20090117-kfreebsd-i386-install.iso is available from 
http://glibc-bsd.alioth.debian.org/install-cd/kfreebsd-i386/20090117/debian-20090117-kfreebsd-i386-install.iso

Steps to reproduce:
Boot KVM  80
Select Express
Select ad0
Press a, q
Select BootMgr
Select ad1
Press a, q
Select BootMgr
Press tab, enter
Press c, enter, enter, /, enter
Select ad1
Press c, enter, s, enter
Press q
Select Minimal
Select Exit
Select CD/DVD
Select cd0
Select Yes
Press Alt-F3 to select
Select Europe
Select Copenhagen
Installation should now fail

Installation fails with kvm  80
Installation succeeds with qemu without kqemu
Installation succeeds with kvm-80
Installation fails with kvm-83 and -no-kvm
Installation fails with kvm-83 and -no-kvm-pit

Sometimes, it fails with a duplicate alloc or duplicate dealloc error 
message


--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 15:07

Message:
It works with oos_shadow=0

--

Comment By: Aurelien Jarno (aurel32)
Date: 2009-04-08 13:26

Message:
This is most probably the OOS issue. Try loading the kvm module with
oos_shadow=0

--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 12:05

Message:
Git bisect fingers this:
6364a3918cb5c28376849e7fca3e09bd66b859f3 is first bad commit
commit 6364a3918cb5c28376849e7fca3e09bd66b859f3
Author: Marcelo Tosatti mtosa...@redhat.com
Date:   Mon Dec 1 22:32:04 2008 -0200

KVM: MMU: skip global pgtables on sync due to cr3 switch

Skip syncing global pages on cr3 switch (but not on cr4/cr0). This is
important for Linux 32-bit guests with PAE, where the kmap page is
marked as global.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

:04 04 18636d9c5eecaa456d7a216a67b8e16fa99c021b
0b83a51e41eb613dddb8c8e85742e5072b505561 M  arch


--

Comment By: Andreas Jacobsen (scoof)
Date: 2009-04-08 10:50

Message:
the latest git pull (kvm-84-6620-ge3dbe3f) fails already during boot of the
debian-kfreebsd-iso.

The kernel prints:
Fatal double fault
eip = 0xc0480708
esp = 0xe437cc20
ebp = 0xe437cc78


--

Comment By: Amit Shah (amitshah)
Date: 2009-04-08 08:49

Message:
I just tried this with the kvm git snapshot and it works fine. Can you try
with a nightly build or from the git tree?

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2741742group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Re: [Qemu-devel] Changing the QEMU svn VERSION string

2009-04-08 Thread Anthony Liguori

Paul Brook wrote:

On Tuesday 07 April 2009, Daniel Jacobowitz wrote:
  

On Tue, Apr 07, 2009 at 08:52:46AM -0500, Anthony Liguori wrote:


I think that's going to lead to even more confusion.  While I'm inclined
to not greatly mind 0.10.99 for the development tree, when we do release
candidates for the next release, it's going to be 0.11.0-rc1.  I don't
expect RPMs to ever be created from non-release versions of QEMU provided
we stick to our plan of frequent releases.
  

FWIW, GDB uses 6.8.50 (devel branch), 6.8.90 (release branch), 6.8.91
(rc1).  That's worked out well for us.



I like this one.
  


So do I.

Regards,

Anthony Liguori

I'm extremely sceptical of anything that claims to need a fine grained version 
number. In practice version numbers for open source projects are fairly 
arbitrary and meaningless because almost everyone has their own set of 
patches and backported fixes anyway.


Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Re: [Qemu-devel] Changing the QEMU svn VERSION string

2009-04-08 Thread Jamie Lokier
Paul Brook wrote:
 I'm extremely sceptical of anything that claims to need a fine
 grained version number. In practice version numbers for open source
 projects are fairly arbitrary and meaningless because almost
 everyone has their own set of patches and backported fixes anyway.

I find it's needed onlyh when you need to interact with a program and
workaround bugs or temporarily broken features, and also when the
program gives no other way to determine its features.  For some
reason, I find kernels are the main thing this matters for...

If the help text, some other output, or an API gives enough
information for interacting programs to know what to do, that's much
better and works with arbitrary patches etc.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] do not keep interrupt window closed by sti in real mode

2009-04-08 Thread Glauber Costa
On Tue, Apr 07, 2009 at 09:14:58PM -0700, H. Peter Anvin wrote:
 Glauber Costa wrote:
  While in real mode, sti does not block interrupts from the subsequent
  instruction. This is stated at Intel SDM Volume 2b, page 4-432
 
 I don't see how you're getting that idea from the STI documentation --
 and I am quite sure that that is not the case.  Quite on the contrary.
 The only differences between protected mode and real mode has to do with
 the handling of VIF when CPL=3 (this rather naturally falls out if one
 considers CPL=0 in real mode).
 
 The text is:
 
 If protected-mode virtual interrupts are not enabled, STI sets the
 interrupt flag (IF) in the EFLAGS register. After the IF flag is set,
 the processor begins responding to external, maskable interrupts after
 the next instruction is executed. The delayed effect of this instruction
 is provided to allow interrupts to be enabled just before returning from
 a procedure (or subroutine). For instance, if an STI instruction is
 followed by an RET instruction, the RET instruction is allowed to
 execute before external interrupts are recognized1. If the STI
 instruction is followed by a CLI instruction (which clears the IF flag),
 the effect of the STI instruction is negated.
 
 Obviously, in real mode, protected-mode virtual interrupts are not
 enabled, as is also confirmed by Table 4-5.

I get the idea from the pseudocode in sti description.
It says:
IF PE = 0 (* Executing in real-address mode *)
THEN
IF - 1; (* Set Interrupt Flag *)
ELSE (* Executing in protected mode or virtual-8086 mode *)

There is no mention to any other activity besides setting the if flag.
Also, sti is used extensively in many places like the linux kernel for the
guest, and it works just fine in kvm. So I was led to believe that real mode
in fact behaving differently.

I'll take a look at avi's suggestion.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] [PATCH 13/16] kvm: enable MSI-X capabilty for assigned device

2009-04-08 Thread Alex Williamson
Hi Sheng,

On Wed, 2009-04-08 at 10:26 +0800, Sheng Yang wrote:
 On Wednesday 08 April 2009 00:38:10 Alex Williamson wrote:
  On Tue, 2009-04-07 at 14:09 +0800, Sheng Yang wrote:
   Could you enable DEVICE_ASSSIGNMENT_DEBUG=1 in
   qemu/hw/device-assignment.c and post the output?
 
  Yup, see below.  The error comes after I 'ifdown eth0; ifup eth0' in the
  guest.  Note bnx2 appears to only turn on MSIX for SMP systems.  Thanks,
 
  Alex
 
 Seems your ifdown/ifup script reload the module?

No, the bnx2 module isn't unloaded on ifdown.

  Oh god, I found one bug 
 after checked the spec:
 
 System software reads this field to determine the MSI-X Table Size *N*, which 
 is encoded as *N-1*. For example, a returned value of “011” indicates 
 a table size of 4.
 
 But it seems still can't explain the problem...(OK, it may affect the guest 
 in 
 a unknown way as well...) I would post a fix for it soon.
[snip]
 
 The writing to MMIO have been intercepted, but code fail to count it? 
 Strange...
 
 Could you try this debug?

I added the debug printfs, plus the MSI-X table size patch, and printed
the value of msg_ctrl as we loop through.  Output below.  This is what
made me think the MSI-X state isn't getting cleared when the driver
closes the interface.  Let me know what you think.  Thanks,

Alex

init_assigned_device: Registering real physical device 03:00.0 (bus=3 dev=0 
func=0)
get_real_device: region 0 size 33554432 start 0xf400 type 512 resource_fd 19
assigned_dev_pci_read_config: (4.0): address= val=0x14e4 len=2
assigned_dev_pci_read_config: (4.0): address=0002 val=0x1639 len=2
assigned_dev_pci_read_config: (4.0): address= val=0x14e4 len=2
assigned_dev_pci_read_config: (4.0): address=0002 val=0x1639 len=2
assigned_dev_pci_read_config: (4.0): address= val=0x14e4 len=2
assigned_dev_pci_read_config: (4.0): address=0002 val=0x1639 len=2
assigned_dev_pci_read_config: (4.0): address=000a val=0x0200 len=2
assigned_dev_pci_read_config: (4.0): address= val=0x14e4 len=2
assigned_dev_pci_read_config: (4.0): address=0002 val=0x1639 len=2
assigned_dev_pci_write_config: (4.0): address=0010 val=0x len=4
assigned_dev_pci_read_config: (4.0): address=0010 val=0xfe00 len=4
assigned_dev_pci_read_config: (4.0): address=0010 val=0xfe00 len=4
assigned_dev_pci_write_config: (4.0): address=0010 val=0xf400 len=4
assigned_dev_iomem_map: e_phys=f400 r_virt=0x7f4fd9bfc000 type=0 
len=0200 region_num=0 
assigned_dev_iomem_map: munmap done, virt_base 0x0x7f4fd9c08000
assigned_dev_pci_read_config: (4.0): address=0004 val=0x0442 len=2
assigned_dev_pci_write_config: (4.0): address=0004 val=0x0442 len=2
assigned_dev_pci_write_config: NON BAR (4.0): address=0004 val=0x0442 len=2
assigned_dev_pci_write_config: (4.0): address=0014 val=0x len=4
assigned_dev_pci_read_config: (4.0): address=0014 val=0x len=4
assigned_dev_pci_write_config: (4.0): address=0018 val=0x len=4
assigned_dev_pci_read_config: (4.0): address=0018 val=0x len=4
assigned_dev_pci_write_config: (4.0): address=001c val=0x len=4
assigned_dev_pci_read_config: (4.0): address=001c val=0x len=4
assigned_dev_pci_write_config: (4.0): address=0020 val=0x len=4
assigned_dev_pci_read_config: (4.0): address=0020 val=0x len=4
assigned_dev_pci_write_config: (4.0): address=0024 val=0x len=4
assigned_dev_pci_read_config: (4.0): address=0024 val=0x len=4
assigned_dev_pci_write_config: (4.0): address=0030 val=0x len=4
assigned_dev_pci_write_config: NON BAR (4.0): address=0030 val=0x len=4
assigned_dev_pci_read_config: (4.0): address=0030 val=0x0001 len=4
assigned_dev_pci_read_config: (4.0): address=0030 val=0x0001 len=4
assigned_dev_pci_write_config: (4.0): address=0030 val=0x0001 len=4
assigned_dev_pci_write_config: NON BAR (4.0): address=0030 val=0x0001 len=4
assigned_dev_pci_read_config: (4.0): address=0004 val=0x0442 len=2
assigned_dev_pci_write_config: (4.0): address=0004 val=0x0442 len=2
assigned_dev_pci_write_config: NON BAR (4.0): address=0004 val=0x0442 len=2
assigned_dev_pci_read_config: (4.0): address=003d val=0x0001 len=1
assigned_dev_pci_write_config: (4.0): address=003c val=0x000b len=1
assigned_dev_pci_read_config: (4.0): address=000a val=0x0200 len=2
assigned_dev_pci_read_config: (4.0): address= val=0x14e4 len=2
assigned_dev_pci_read_config: (4.0): address=0002 val=0x1639 len=2
assigned_dev_pci_read_config: (4.0): address=000e val=0x0080 len=1
assigned_dev_pci_read_config: (4.0): address= val=0x163914e4 len=4
assigned_dev_pci_read_config: (4.0): address=000e val=0x0080 len=1
assigned_dev_pci_read_config: (4.0): address=0006 val=0x0010 len=2
assigned_dev_pci_read_config: (4.0): address=0034 val=0x0040 len=1
assigned_dev_pci_read_config: (4.0): address=0040 val=0x0005 len=1
assigned_dev_pci_read_config: (4.0): 

Re: User Question

2009-04-08 Thread Cam Macdonell

Randy Broman wrote:

I'm running Kubuntu Jaunty 9.04 on an AMD Phenom II 910, with a custom
2.6.28 kernel. I want to install KVM with a Windows XP guest. Apologies
I'm confused as to exactly what to install 

-I can (should?) apt-get install KVM and/or QEMU from the Jaunty archives.
-I can configure KVM into my custom kernel using CONFIG_KVM=m,
CONFIG_KVM_AMD=m and a couple other .config options.
-I can download, compile and install kvm-84 from source for my kernel
On this basis I would presumably invoke qemu-system-x86_64 from
the install directory.


Hi Randy,

I follow the third method.  Compiling the kvm-84 tarball will build both 
the kernel modules and the userspace qemu-system-x64_84 for you and 
install them for you.



This is a home not production system, and I'd like to get the best guest
performance possible. 


Getting the best performance possible will depend on your use of 
devices.  For network performance do not use the user network stack 
it's really slow; use vde or bridged networking.  Docs for this are here 
(http://www.linux-kvm.org/page/Networking).  VDE has no description 
(yet), but google will help.



I'm a little confused between KVM and QEMU ... I
know there's a KVM kernel module(s) plus the facility to run the virtual
guest, but I'm unsure which of the above choices to use. 


Qemu is an system emulator that emulate numerous architectures.  With 
KVM, Qemu is used in the userspace to manage virtual devices and 
allocate memory for the VMs (no processor emulation is done; Qemu is 
only used for x86 on x86 within KVM), but kernel modules are added to 
take advantage of hardware virtualization support.  It's well known that 
the names of executables can be confusing :)



Would appreciate recommendations and/or pointers to useful docs.


linux-kvm.org is the place to start.  Specifically the how-to section.

Good luck,
Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] disable interrupt shadow state for emulated instruction

2009-04-08 Thread H. Peter Anvin

Glauber Costa wrote:

we currently unblock shadow interrupt state when we skip an instruction,
but failing to do so when we actually emulate one. This blocks interrupts
in key instruction blocks, in particular sti; hlt; sequences

Without this patch, I cannot boot gpxe option roms at vmx machines.
This is described at https://bugzilla.redhat.com/show_bug.cgi?id=494469

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index c6997c0..cee38e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -736,26 +736,34 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, 
unsigned long rflags)
vmcs_writel(GUEST_RFLAGS, rflags);
 }
 
+static void vmx_block_interrupt_shadow(struct kvm_vcpu *vcpu)

+{
+   /*
+* We emulated an instruction, so temporary interrupt blocking
+* should be removed, if set.
+*/
+   u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
+   u32 interruptibility_mask = ((GUEST_INTR_STATE_STI | 
GUEST_INTR_STATE_MOV_SS));
+
+   if (interruptibility  interruptibility_mask)
+   vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
+interruptibility  ~interruptibility_mask);
+   vcpu-arch.interrupt_window_open = 1;
+}
+


How does this logic work when the instruction emulated is an STI or MOV 
SS instruction?  In particular, when does GUEST_INTERRUPTIBILITY_INFO 
sets set to reflect the *blocking* operation?


The pseudo-code for this kind of stuff looks like:


forever {
tmp_int_flags - int_flags

/* Begin instruction execution */
int_flags |= GUEST_INTR_STATE_STI   /* STI instruction */
/* End instruction execution */

int_flags = ~tmp_int_flags

if (irq_pending  eflags.if == 1  int_flags == 0)
take_interrupt();
}

Note the behavior in the case of sequential STIs, that int_flags goes to 
0 after the second execution.


-hpa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] disable interrupt shadow state for emulated instruction

2009-04-08 Thread Glauber Costa
On Wed, Apr 08, 2009 at 11:16:05AM -0700, H. Peter Anvin wrote:
 Glauber Costa wrote:
 we currently unblock shadow interrupt state when we skip an instruction,
 but failing to do so when we actually emulate one. This blocks interrupts
 in key instruction blocks, in particular sti; hlt; sequences

 Without this patch, I cannot boot gpxe option roms at vmx machines.
 This is described at https://bugzilla.redhat.com/show_bug.cgi?id=494469

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index c6997c0..cee38e4 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -736,26 +736,34 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, 
 unsigned long rflags)
  vmcs_writel(GUEST_RFLAGS, rflags);
  }
  +static void vmx_block_interrupt_shadow(struct kvm_vcpu *vcpu)
 +{
 +/*
 + * We emulated an instruction, so temporary interrupt blocking
 + * should be removed, if set.
 + */
 +u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
 +u32 interruptibility_mask = ((GUEST_INTR_STATE_STI | 
 GUEST_INTR_STATE_MOV_SS));
 +
 +if (interruptibility  interruptibility_mask)
 +vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
 + interruptibility  ~interruptibility_mask);
 +vcpu-arch.interrupt_window_open = 1;
 +}
 +

 How does this logic work when the instruction emulated is an STI or MOV  
 SS instruction?  In particular, when does GUEST_INTERRUPTIBILITY_INFO  
 sets set to reflect the *blocking* operation?
mov ss is a non-issue, since it is executed natively.

As for sti, I'm not sure. I see code for emulating sti, but in my testings,
this code was never ever touched, under a number of different scenarios.
Avi, can you clarify if sti can be in fact emulated, and under which
circunstamces? 

If it can, I'd say we'd have to introduce a block_interrupt_shadow as well,
and call from it from within the emulator, whenever the first sti is dispatched.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] disable interrupt shadow state for emulated instruction

2009-04-08 Thread H. Peter Anvin

Glauber Costa wrote:

mov ss is a non-issue, since it is executed natively.


In real mode?

-hpa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] disable interrupt shadow state for emulated instruction

2009-04-08 Thread Glauber Costa
On Wed, Apr 08, 2009 at 11:31:54AM -0700, H. Peter Anvin wrote:
 Glauber Costa wrote:
 mov ss is a non-issue, since it is executed natively.

 In real mode?
it seems so, to me. But I can be wrong. If I am, then I'd
propose the same path I proposed for sti for this.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] disable interrupt shadow state for emulated instruction

2009-04-08 Thread Gleb Natapov
On Wed, Apr 08, 2009 at 01:57:32PM -0400, Glauber Costa wrote:
 we currently unblock shadow interrupt state when we skip an instruction,
 but failing to do so when we actually emulate one. This blocks interrupts
 in key instruction blocks, in particular sti; hlt; sequences
 
 Without this patch, I cannot boot gpxe option roms at vmx machines.
 This is described at https://bugzilla.redhat.com/show_bug.cgi?id=494469
 
 Signed-off-by: Glauber Costa glom...@redhat.com
 CC: H. Peter Anvin h...@zytor.com
 CC: Avi Kivity a...@redhat.com
 ---
  arch/x86/include/asm/kvm_host.h |1 +
  arch/x86/kvm/svm.c  |   12 ++--
  arch/x86/kvm/vmx.c  |   29 +++--
  arch/x86/kvm/x86.c  |2 ++
  4 files changed, 32 insertions(+), 12 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 3fc4623..0890a6e 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -513,6 +513,7 @@ struct kvm_x86_ops {
   void (*run)(struct kvm_vcpu *vcpu, struct kvm_run *run);
   int (*handle_exit)(struct kvm_run *run, struct kvm_vcpu *vcpu);
   void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
 + void (*block_interrupt_shadow)(struct kvm_vcpu *vcpu);
remove_interrupt_shadow or unblock_interrupt_shadow would be better
names IMHO.

   void (*patch_hypercall)(struct kvm_vcpu *vcpu,
   unsigned char *hypercall_addr);
   int (*get_irq)(struct kvm_vcpu *vcpu);
 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
 index 3ffb695..d303e86 100644
 --- a/arch/x86/kvm/svm.c
 +++ b/arch/x86/kvm/svm.c
 @@ -210,6 +210,14 @@ static int is_external_interrupt(u32 info)
   return info == (SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_INTR);
  }
  
 +static void svm_block_interrupt_shadow(struct kvm_vcpu *vcpu)
 +{
 + struct vcpu_svm *svm = to_svm(vcpu);
 +
 + svm-vmcb-control.int_state = ~SVM_INTERRUPT_SHADOW_MASK;
 + vcpu-arch.interrupt_window_open = (svm-vcpu.arch.hflags  
 HF_GIF_MASK);
 +}
 +
  static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
  {
   struct vcpu_svm *svm = to_svm(vcpu);
 @@ -223,9 +231,8 @@ static void skip_emulated_instruction(struct kvm_vcpu 
 *vcpu)
  __func__, kvm_rip_read(vcpu), svm-next_rip);
  
   kvm_rip_write(vcpu, svm-next_rip);
 - svm-vmcb-control.int_state = ~SVM_INTERRUPT_SHADOW_MASK;
  
 - vcpu-arch.interrupt_window_open = (svm-vcpu.arch.hflags  
 HF_GIF_MASK);
 + svm_block_interrupt_shadow(vcpu);
  }
  
  static int has_svm(void)
 @@ -2660,6 +2667,7 @@ static struct kvm_x86_ops svm_x86_ops = {
   .run = svm_vcpu_run,
   .handle_exit = handle_exit,
   .skip_emulated_instruction = skip_emulated_instruction,
 + .block_interrupt_shadow = svm_block_interrupt_shadow,
   .patch_hypercall = svm_patch_hypercall,
   .get_irq = svm_get_irq,
   .set_irq = svm_set_irq,
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index c6997c0..cee38e4 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -736,26 +736,34 @@ static void vmx_set_rflags(struct kvm_vcpu *vcpu, 
 unsigned long rflags)
   vmcs_writel(GUEST_RFLAGS, rflags);
  }
  
 +static void vmx_block_interrupt_shadow(struct kvm_vcpu *vcpu)
 +{
 + /*
 +  * We emulated an instruction, so temporary interrupt blocking
 +  * should be removed, if set.
 +  */
 + u32 interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
 + u32 interruptibility_mask = ((GUEST_INTR_STATE_STI | 
 GUEST_INTR_STATE_MOV_SS));
 +
 + if (interruptibility  interruptibility_mask)
 + vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
 +  interruptibility  ~interruptibility_mask);
 + vcpu-arch.interrupt_window_open = 1;
 +}
 +
  static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
  {
   unsigned long rip;
 - u32 interruptibility;
  
   rip = kvm_rip_read(vcpu);
   rip += vmcs_read32(VM_EXIT_INSTRUCTION_LEN);
   kvm_rip_write(vcpu, rip);
  
 - /*
 -  * We emulated an instruction, so temporary interrupt blocking
 -  * should be removed, if set.
 -  */
 - interruptibility = vmcs_read32(GUEST_INTERRUPTIBILITY_INFO);
 - if (interruptibility  3)
 - vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
 -  interruptibility  ~3);
 - vcpu-arch.interrupt_window_open = 1;
 + /* skipping an emulated instruction also counts */
 + vmx_block_interrupt_shadow(vcpu);
  }
  
 +
  static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
   bool has_error_code, u32 error_code)
  {
 @@ -3727,6 +3735,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
   .run = vmx_vcpu_run,
   .handle_exit = vmx_handle_exit,
   .skip_emulated_instruction = skip_emulated_instruction,
 + .block_interrupt_shadow = vmx_block_interrupt_shadow,
   

Re: [PATCH] disable interrupt shadow state for emulated instruction

2009-04-08 Thread Gleb Natapov
On Wed, Apr 08, 2009 at 03:43:06PM -0300, Glauber Costa wrote:
 On Wed, Apr 08, 2009 at 11:31:54AM -0700, H. Peter Anvin wrote:
  Glauber Costa wrote:
  mov ss is a non-issue, since it is executed natively.
 
  In real mode?
 it seems so, to me. But I can be wrong. If I am, then I'd
 propose the same path I proposed for sti for this.
 
In big real mode everything is emulated.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] qemu: Add prototype and make qemu_uuid_parse() non-static

2009-04-08 Thread Alex Williamson
SMBIOS parameters can also provide a UUID outside of vl.c.

Signed-off-by: Alex Williamson alex.william...@hp.com
---

 sysemu.h |1 +
 vl.c |2 +-
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/sysemu.h b/sysemu.h
index 3eab34b..e94d5c3 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -15,6 +15,7 @@ extern const char *bios_dir;
 extern int vm_running;
 extern const char *qemu_name;
 extern uint8_t qemu_uuid[];
+int qemu_uuid_parse(const char *str, uint8_t *uuid);
 #define UUID_FMT 
%02hhx%02hhx%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx-%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx
 
 typedef struct vm_change_state_entry VMChangeStateEntry;
diff --git a/vl.c b/vl.c
index ddbcc6c..6235341 100644
--- a/vl.c
+++ b/vl.c
@@ -4197,7 +4197,7 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type)
 }
 #endif
 
-static int qemu_uuid_parse(const char *str, uint8_t *uuid)
+int qemu_uuid_parse(const char *str, uint8_t *uuid)
 {
 int ret;
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: User Question

2009-04-08 Thread Randy Broman
Thanks Cam for the information, I've compiled and installed the kvm-84 
tarball,
I'm running modules KVM and KVM_AMD, and I can start my WinXP guest 
successfully
with qemu-system-x86_64. I've compiled and installed vde, but I can't 
get vde networking

working in the guest. My Kubuntu host is on 192.168.0 network, and I did:

% sudo vde_switch -t tap0 -daemon
%sudo ifconfig tap0 192.168.1.1 netmask 255.255.255.0

When I do an ifconfig -a, both my host eth0 and tap0 appear to show up 
correctly on
the host (?). When I start qemu-system-x86_64 with it's several options, 
I'm confused

as to how to specify the

-net 
vde[,vlan=n][,name=str][,sock-socketpath][,port=n][,group=groupname][,mode=octalmode]


options. I have only one guest on the system/network. The guest starts, 
and I've

set it's IP at 192.168.1.2 and gateway 192.168.1.1, but no connectivity.

Another question, should I use -net model=virtio, and if so where do I 
get the

WinXP driver for that?

Thanks, Randy



Cam Macdonell wrote:

I'm running Kubuntu Jaunty 9.04 on an AMD Phenom II 910, with a custom
2.6.28 kernel. I want to install KVM with a Windows XP guest. Apologies
I'm confused as to exactly what to install 

-I can (should?) apt-get install KVM and/or QEMU from the Jaunty 
archives.

-I can configure KVM into my custom kernel using CONFIG_KVM=m,
CONFIG_KVM_AMD=m and a couple other .config options.
-I can download, compile and install kvm-84 from source for my kernel
On this basis I would presumably invoke qemu-system-x86_64 from
the install directory.


Hi Randy,

I follow the third method.  Compiling the kvm-84 tarball will build both 
the kernel modules and the userspace qemu-system-x64_84 for you and 
install them for you.



This is a home not production system, and I'd like to get the best guest
performance possible. 


Getting the best performance possible will depend on your use of 
devices.  For network performance do not use the user network stack 
it's really slow; use vde or bridged networking.  Docs for this are here 
(http://www.linux-kvm.org/page/Networking).  VDE has no description 
(yet), but google will help.



I'm a little confused between KVM and QEMU ... I
know there's a KVM kernel module(s) plus the facility to run the virtual
guest, but I'm unsure which of the above choices to use. 


Qemu is an system emulator that emulate numerous architectures.  With 
KVM, Qemu is used in the userspace to manage virtual devices and 
allocate memory for the VMs (no processor emulation is done; Qemu is 
only used for x86 on x86 within KVM), but kernel modules are added to 
take advantage of hardware virtualization support.  It's well known that 
the names of executables can be confusing :)



Would appreciate recommendations and/or pointers to useful docs.


linux-kvm.org is the place to start.  Specifically the how-to section.

Good luck,
Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] [PATCH 13/16] kvm: enable MSI-X capabilty for assigned device

2009-04-08 Thread Sheng Yang
On Thursday 09 April 2009 00:13:56 Alex Williamson wrote:
 Hi Sheng,

 On Wed, 2009-04-08 at 10:26 +0800, Sheng Yang wrote:
  On Wednesday 08 April 2009 00:38:10 Alex Williamson wrote:
   On Tue, 2009-04-07 at 14:09 +0800, Sheng Yang wrote:
Could you enable DEVICE_ASSSIGNMENT_DEBUG=1 in
qemu/hw/device-assignment.c and post the output?
  
   Yup, see below.  The error comes after I 'ifdown eth0; ifup eth0' in
   the guest.  Note bnx2 appears to only turn on MSIX for SMP systems. 
   Thanks,
  
   Alex
 
  Seems your ifdown/ifup script reload the module?

 No, the bnx2 module isn't unloaded on ifdown.

   Oh god, I found one bug
  after checked the spec:
 
  System software reads this field to determine the MSI-X Table Size *N*,
  which is encoded as *N-1*. For example, a returned value of “011”
  indicates a table size of 4.
 
  But it seems still can't explain the problem...(OK, it may affect the
  guest in a unknown way as well...) I would post a fix for it soon.

 [snip]

  The writing to MMIO have been intercepted, but code fail to count it?
  Strange...
 
  Could you try this debug?

 I added the debug printfs, plus the MSI-X table size patch, and printed
 the value of msg_ctrl as we loop through.  Output below.  This is what
 made me think the MSI-X state isn't getting cleared when the driver
 closes the interface.  Let me know what you think.  Thanks,


Thanks Alex, now I know where the problem is. Part of functional haven't been 
implemented...

 address=0052 val=0x8008 len=2 the MSIX capabilty position is 0x50
 the MSIX entries_max_nr is 0x9
 0: msg_ctrl: 0001
 1: msg_ctrl: 0001
 2: msg_ctrl: 0001
 3: msg_ctrl: 0001
 4: msg_ctrl: 0001
 5: msg_ctrl: 0001
 6: msg_ctrl: 0001
 7: msg_ctrl: 0001
 8: msg_ctrl: 0001
 MSI-X entry number is zero!
 assigned_dev_update_msix_mmio: No such device or address

Driver write to the vectors at first, then enable MSI-X,

 msix_mmio_writel: write to MSI-X entry table mmio offset 0xc, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x1c, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x2c, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x3c, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x4c, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x5c, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x6c, val 0x0
 msix_mmio_writel: write to MSI-X entry table mmio offset 0x7c, val 0x0

And finally clear the mask bit.

For current we didn't implement mask capability in MSI-X vectors, so it won't 
work...

OK. I'd like to remove the check of mask bit and only ignored unused vector 
when msg data is zero now(hope it won't cause more problems). And we would add 
support for per-vector mask later.

Thanks for help to debug!
-- 
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] kvm: don't check per-vector mask bit before enable MSI-X

2009-04-08 Thread Sheng Yang
Some driver(e.g. bnx2) do the following to enable MSI-X:
1. Mask all vectors.
2. Write the msg data and address.
3. Enable MSI-X
4. Unmask all the vectors.

For this, check per-vector mask bit before enable MSI-X would cause device
fail to enable MSI-X. So now we only determine the availability of vector
by if msg_data is zero.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 qemu/hw/device-assignment.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index f33ce3c..1f0a1a7 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -823,9 +823,6 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev)
 /* Get the usable entry number for allocating */
 for (i = 0; i  entries_max_nr; i++) {
 memcpy(msg_ctrl, va + i * 16 + 12, 4);
-/* 0x1 is mask bit for per vector */
-if (msg_ctrl  0x1)
-continue;
 memcpy(msg_data, va + i * 16 + 8, 4);
 /* Ignore unused entry even it's unmasked */
 if (msg_data == 0)
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] add replace_page(): change the page pte is pointing to.

2009-04-08 Thread Izik Eidus
replace_page() allow changing the mapping of pte from one physical page
into diffrent physical page.

this function is working by removing oldpage from the rmap and calling
put_page on it, and by setting the pte to point into newpage and by
inserting it to the rmap using page_add_file_rmap().

note: newpage must be non anonymous page, the reason for this is:
replace_page() is built to allow mapping one page into more than one
virtual addresses, the mapping of this page can happen in diffrent
offsets inside each vma, and therefore we cannot trust the page-index
anymore.

the side effect of this issue is that newpage cannot be anything but
kernel allocated page that is not swappable.

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mm.h |5 +++
 mm/memory.c|   80 
 2 files changed, 85 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..7a831ce 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1240,6 +1240,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned 
long addr,
 int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot);
+#endif
+
 struct page *follow_page(struct vm_area_struct *, unsigned long address,
unsigned int foll_flags);
 #define FOLL_WRITE 0x01/* check pte is writable */
diff --git a/mm/memory.c b/mm/memory.c
index 1e1a14b..d6e53c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned 
long addr,
 }
 EXPORT_SYMBOL(vm_insert_mixed);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+/**
+ * replace_page - replace page in vma with new page
+ * @vma:  vma that hold the pte oldpage is pointed by.
+ * @oldpage:  the page we are replacing with newpage
+ * @newpage:  the page we replace oldpage with
+ * @orig_pte: the original value of the pte
+ * @prot: page protection bits
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ *
+ * Note: @newpage must not be an anonymous page because replace_page() does
+ * not change the mapping of @newpage to have the same values as @oldpage.
+ * @newpage can be mapped in several vmas at different offsets (page-index).
+ */
+int replace_page(struct vm_area_struct *vma, struct page *oldpage,
+struct page *newpage, pte_t orig_pte, pgprot_t prot)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *ptep;
+   spinlock_t *ptl;
+   unsigned long addr;
+   int ret;
+
+   BUG_ON(PageAnon(newpage));
+
+   ret = -EFAULT;
+   addr = page_address_in_vma(oldpage, vma);
+   if (addr == -EFAULT)
+   goto out;
+
+   pgd = pgd_offset(mm, addr);
+   if (!pgd_present(*pgd))
+   goto out;
+
+   pud = pud_offset(pgd, addr);
+   if (!pud_present(*pud))
+   goto out;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   goto out;
+
+   ptep = pte_offset_map_lock(mm, pmd, addr, ptl);
+   if (!ptep)
+   goto out;
+
+   if (!pte_same(*ptep, orig_pte)) {
+   pte_unmap_unlock(ptep, ptl);
+   goto out;
+   }
+
+   ret = 0;
+   get_page(newpage);
+   page_add_file_rmap(newpage);
+
+   flush_cache_page(vma, addr, pte_pfn(*ptep));
+   ptep_clear_flush(vma, addr, ptep);
+   set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot));
+
+   page_remove_rmap(oldpage);
+   if (PageAnon(oldpage)) {
+   dec_mm_counter(mm, anon_rss);
+   inc_mm_counter(mm, file_rss);
+   }
+   put_page(oldpage);
+
+   pte_unmap_unlock(ptep, ptl);
+
+out:
+   return ret;
+}
+EXPORT_SYMBOL_GPL(replace_page);
+
+#endif
+
 /*
  * maps a range of physical memory into the requested pages. the old
  * mappings are removed. any references to nonexistent pages results
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()

2009-04-08 Thread Izik Eidus
this macro allow setting the pte in the shadow page tables directly
instead of flushing the shadow page table entry and then get vmexit in
order to set it.

This function is optimzation for kvm/users of mmu_notifiers for COW
pages, it is useful for kvm when ksm is used beacuse it allow kvm
not to have to recive VMEXIT and only then map the shared page into
the mmu shadow pages, but instead map it directly at the same time
linux map the page into the host page table.

this mmu notifer macro is working by calling to callback that will map
directly the physical page into the shadow page tables.

(users of mmu_notifiers that didnt implement the set_pte_at_notify()
call back will just recive the mmu_notifier_invalidate_page callback)

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/mmu_notifier.h |   34 ++
 mm/memory.c  |   10 --
 mm/mmu_notifier.c|   20 
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b77486d..8bb245f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -61,6 +61,15 @@ struct mmu_notifier_ops {
 struct mm_struct *mm,
 unsigned long address);
 
+   /* 
+   * change_pte is called in cases that pte mapping into page is changed
+   * for example when ksm mapped pte to point into a new shared page.
+   */
+   void (*change_pte)(struct mmu_notifier *mn,
+  struct mm_struct *mm,
+  unsigned long address,
+  pte_t pte);
+
/*
 * Before this is invoked any secondary MMU is still ok to
 * read/write to the page previously pointed to by the Linux
@@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
  unsigned long address);
+extern void __mmu_notifier_change_pte(struct mm_struct *mm, 
+ unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
@@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+   if (mm_has_notifiers(mm))
+   __mmu_notifier_change_pte(mm, address, pte);
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct 
mm_struct *mm)
__young;\
 })
 
+#define set_pte_at_notify(__mm, __address, __ptep, __pte)  \
+({ \
+   struct mm_struct *___mm = __mm; \
+   unsigned long ___address = __address;   \
+   pte_t ___pte = __pte;   \
+   \
+   set_pte_at(__mm, __address, __ptep, ___pte);\
+   mmu_notifier_change_pte(___mm, ___address, ___pte); \
+})
+
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct 
mm_struct *mm,
return 0;
 }
 
+static inline void mmu_notifier_change_pte(struct mm_struct *mm,
+  unsigned long address, pte_t pte)
+{
+}
+
 static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
  unsigned long address)
 {
@@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct 
*mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define ptep_clear_flush_notify ptep_clear_flush
+#define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
 
diff --git a/mm/memory.c b/mm/memory.c
index cf6873e..1e1a14b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2051,9 +2051,15 @@ gotten:
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
page_add_new_anon_rmap(new_page, vma, address);
-  

[PATCH 0/4] ksm - dynamic page sharing driver for linux v3

2009-04-08 Thread Izik Eidus
From v2 to v3:

1)Remove unnessery check of is_dirty_pte() inside PageKsm()
   We have added the is_dirty_pte() chceck to protect against the
   reuse: case inside do_wp_page().
   Andrea pointed to me that such condtion couldnt ever happen,
   du to the fact that if VM_SHARED is set no Anonymous page can be
   on the vma, therefore it is unpossible that such Page would become
   KsmPage and therefore KsmPages would never trigger the reuse case
   (Checkout From v1 to v2 for more info)

2)Add !vm_file check in addition to PageKsm() to check if sharedpage
   Until now Ksm was checking whatever Pages are sharedpages (KsmPage)
   by just running get_user_page() and then check if Page != AnonPage.
   The problem raise as Ksm keep virtual addresses inside its data
   strctures and if the user will free page and allocate new !AnonPage
   Page, Ksm might think this page is shared page.
   To solve this problem we have added an additional check for Ksm,
   We are checking whatever the vma-vm_file is set to NULL, in case
   we see a virtual address that its vma-vm_file is NULL and the
   page that it pointing into it isnt AnonPage we can safetly know that
   this is shared page (KsmPage).
  
3)Replace jhash() with jhash2()
   Andrey Panin pointed that we should use jhash2 as it faster than
   jhash().

Thanks.

(Below is info from previous posts)

From v1 to v2:

1)Fixed security issue found by Chris Wright:
Ksm was checking if page is a shared page by running !PageAnon.
Beacuse that Ksm scan only anonymous memory, all !PageAnons
inside ksm data strctures are shared page, however there might
be a case for do_wp_page() when the VM_SHARED is used where
do_wp_page() would instead of copying the page into new anonymos
page, would reuse the page, it was fixed by adding check for the
dirty_bit of the virtual addresses pointing into the shared page.
I was not finding any VM code tha would clear the dirty bit from
this virtual address (due to the fact that we allocate the page
using page_alloc() - kernel allocated pages), ~but i still want
confirmation about this from the vm guys - thanks.~

2)Moved to sysfs to control ksm:
It was requested as a better way to control the ksm scanning
thread than ioctls.
the sysfs api:
dir: /sys/kernel/mm/ksm/

kernel_pages_allocated - information about how many kernel pages
ksm have allocated, this pages are not swappable, and each page
like that is used by ksm to share pages with identical content

pages_shared - how many pages were shared by ksm

run - set to 1 when you want ksm to run, 0 when no

max_kernel_pages - set the maximum amount of kernel pages
to be allocated by ksm, set 0 for unlimited.

pages_to_scan - how many pages to scan before ksm will sleep

sleep - how much usecs ksm will sleep.

3)Add sysfs paramater to control the maximum kernel pages to be by
ksm.

4)Add statistics about how much pages are really shared.


One issue still to be discussed:
There was a suggestion to use madvice(SHAREABLE) instead of using
ioctls to register memory that need to be scanned by ksm.
Such change is outside the area of ksm.c and would required adding
new madvice api, and change some parts of the vm and the kernel
code, so first thing to do, is realized if we really want this.

I dont know any other open issues.

Thanks.

This is from the first post:
(The kvm part, togather with the kvm-userspace part, was post with V1
before about a week, whoever want to test ksm may download the
patch from lkml archive)

KSM is a linux driver that allows dynamicly sharing identical memory
pages between one or more processes.

Unlike tradtional page sharing that is made at the allocation of the
memory, ksm do it dynamicly after the memory was created.
Memory is periodically scanned; identical pages are identified and
merged.
The sharing is unnoticeable by the process that use this memory.
(the shared pages are marked as readonly, and in case of write
do_wp_page() take care to create new copy of the page)

To find identical pages ksm use algorithm that is split into three
primery levels:

1) Ksm will start scan the memory and will calculate checksum for each
   page that is registred to be scanned.
   (In the first round of the scanning, ksm would only calculate
this checksum for all the pages)

2) Ksm will go again on the whole memory and will recalculate the
   checmsum of the pages, pages that are found to have the same
   checksum value, would be considered pages that are most likely
   wont changed
   Ksm will insert this pages into sorted by page content RB-tree that
   is called unstable tree, the reason that this tree is called
   unstable is due to the fact that the page contents might changed
   while they are still inside the tree, and therefore the tree would
   become corrupted.
   Due to this problem ksm take two more steps in addition to the
   checksum calculation:
   a) Ksm will throw 

[PATCH 2/4] add page_wrprotect(): write protecting page.

2009-04-08 Thread Izik Eidus
this patch add new function called page_wrprotect(),
page_wrprotect() is used to take a page and mark all the pte that
point into it as readonly.

The function is working by walking the rmap of the page, and setting
each pte realted to the page as readonly.

The odirect_sync parameter is used to protect against possible races
with odirect while we are marking the pte as readonly,
as noted by Andrea Arcanglei:

While thinking at get_user_pages_fast I figured another worse way
things can go wrong with ksm and o_direct: think a thread writing
constantly to the last 512bytes of a page, while another thread read
and writes to/from the first 512bytes of the page. We can lose
O_DIRECT reads, the very moment we mark any pte wrprotected...

Signed-off-by: Izik Eidus iei...@redhat.com
---
 include/linux/rmap.h |   11 
 mm/rmap.c|  139 ++
 2 files changed, 150 insertions(+), 0 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b35bc0e..469376d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page)
 }
 #endif
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+int page_wrprotect(struct page *page, int *odirect_sync, int count_offset);
+#endif
+
 #else  /* !CONFIG_MMU */
 
 #define anon_vma_init()do {} while (0)
@@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page)
return 0;
 }
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+static inline int page_wrprotect(struct page *page, int *odirect_sync,
+int count_offset)
+{
+   return 0;
+}
+#endif
 
 #endif /* CONFIG_MMU */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 1652166..95c55ea 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -585,6 +585,145 @@ int page_mkclean(struct page *page)
 }
 EXPORT_SYMBOL_GPL(page_mkclean);
 
+#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE)
+
+static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma,
+ int *odirect_sync, int count_offset)
+{
+   struct mm_struct *mm = vma-vm_mm;
+   unsigned long address;
+   pte_t *pte;
+   spinlock_t *ptl;
+   int ret = 0;
+
+   address = vma_address(page, vma);
+   if (address == -EFAULT)
+   goto out;
+
+   pte = page_check_address(page, mm, address, ptl, 0);
+   if (!pte)
+   goto out;
+
+   if (pte_write(*pte)) {
+   pte_t entry;
+
+   flush_cache_page(vma, address, pte_pfn(*pte));
+   /*
+* Ok this is tricky, when get_user_pages_fast() run it doesnt
+* take any lock, therefore the check that we are going to make
+* with the pagecount against the mapcount is racey and
+* O_DIRECT can happen right after the check.
+* So we clear the pte and flush the tlb before the check
+* this assure us that no O_DIRECT can happen after the check
+* or in the middle of the check.
+*/
+   entry = ptep_clear_flush(vma, address, pte);
+   /*
+* Check that no O_DIRECT or similar I/O is in progress on the
+* page
+*/
+   if ((page_mapcount(page) + count_offset) != page_count(page)) {
+   *odirect_sync = 0;
+   set_pte_at_notify(mm, address, pte, entry);
+   goto out_unlock;
+   }
+   entry = pte_wrprotect(entry);
+   set_pte_at_notify(mm, address, pte, entry);
+   }
+   ret = 1;
+
+out_unlock:
+   pte_unmap_unlock(pte, ptl);
+out:
+   return ret;
+}
+
+static int page_wrprotect_file(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct address_space *mapping;
+   struct prio_tree_iter iter;
+   struct vm_area_struct *vma;
+   pgoff_t pgoff = page-index  (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+   int ret = 0;
+
+   mapping = page_mapping(page);
+   if (!mapping)
+   return ret;
+
+   spin_lock(mapping-i_mmap_lock);
+
+   vma_prio_tree_foreach(vma, iter, mapping-i_mmap, pgoff, pgoff)
+   ret += page_wrprotect_one(page, vma, odirect_sync,
+ count_offset);
+
+   spin_unlock(mapping-i_mmap_lock);
+
+   return ret;
+}
+
+static int page_wrprotect_anon(struct page *page, int *odirect_sync,
+  int count_offset)
+{
+   struct vm_area_struct *vma;
+   struct anon_vma *anon_vma;
+   int ret = 0;
+
+   anon_vma = page_lock_anon_vma(page);
+   if (!anon_vma)
+   return ret;
+
+   /*
+* If the page is inside the swap cache, its _count number was
+* increased by one, therefore we have to increase 

[PATCH 4/4] add ksm kernel shared memory driver.

2009-04-08 Thread Izik Eidus
Ksm is driver that allow merging identical pages between one or more
applications in way unvisible to the application that use it.
Pages that are merged are marked as readonly and are COWed when any
application try to change them.

Ksm is used for cases where using fork() is not suitable,
one of this cases is where the pages of the application keep changing
dynamicly and the application cannot know in advance what pages are
going to be identical.

Ksm works by walking over the memory pages of the applications it
scan in order to find identical pages.
It uses a two sorted data strctures called stable and unstable trees
to find in effective way the identical pages.

When ksm finds two identical pages, it marks them as readonly and merges
them into single one page,
after the pages are marked as readonly and merged into one page, linux
will treat this pages as normal copy_on_write pages and will fork them
when write access will happen to them.

Ksm scan just memory areas that were registred to be scanned by it.

Ksm api:

KSM_GET_API_VERSION:
Give the userspace the api version of the module.

KSM_CREATE_SHARED_MEMORY_AREA:
Create shared memory reagion fd, that latter allow the user to register
the memory region to scan by using:
KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION

KSM_REGISTER_MEMORY_REGION:
Register userspace virtual address range to be scanned by ksm.
This ioctl is using the ksm_memory_region structure:
ksm_memory_region:
__u32 npages;
 number of pages to share inside this memory region.
__u32 pad;
__u64 addr:
the begining of the virtual address of this region.
__u64 reserved_bits;
reserved bits for future usage.

KSM_REMOVE_MEMORY_REGION:
Remove memory region from ksm.

Signed-off-by: Izik Eidus iei...@redhat.com
Signed-off-by: Chris Wright chr...@redhat.com
Signed-off-by: Andrea Arcangeli aarca...@redhat.com
---
 include/linux/ksm.h|   48 ++
 include/linux/miscdevice.h |1 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1674 
 5 files changed, 1730 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/ksm.h
 create mode 100644 mm/ksm.c

diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..2c11e9a
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,48 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+/*
+ * Userspace interface for /dev/ksm - kvm shared memory
+ */
+
+#include linux/types.h
+#include linux/ioctl.h
+
+#include asm/types.h
+
+#define KSM_API_VERSION 1
+
+#define ksm_control_flags_run 1
+
+/* for KSM_REGISTER_MEMORY_REGION */
+struct ksm_memory_region {
+   __u32 npages; /* number of pages to share */
+   __u32 pad;
+   __u64 addr; /* the begining of the virtual address */
+__u64 reserved_bits;
+};
+
+#define KSMIO 0xAB
+
+/* ioctls for /dev/ksm */
+
+#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
+/*
+ * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
+ */
+#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
+
+/* ioctls for SMA fds */
+
+/*
+ * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
+ * scanned by kvm.
+ */
+#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
+ struct ksm_memory_region)
+/*
+ * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
+ */
+#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+
+#endif
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index beb6ec9..297c0bb 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -30,6 +30,7 @@
 #define HPET_MINOR 228
 #define FUSE_MINOR 229
 #define KVM_MINOR  232
+#define KSM_MINOR  233
 #define MISC_DYNAMIC_MINOR 255
 
 struct device;
diff --git a/mm/Kconfig b/mm/Kconfig
index b53427a..3f3fd04 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   tristate Enable KSM for page sharing
+   help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..a15a92d
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1674 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages 

Re: one question about virualization and kvm

2009-04-08 Thread Vasiliy Tolstov
В Срд, 01/04/2009 в 09:01 -0500, Javier Guerra пишет:
 On Wed, Apr 1, 2009 at 7:27 AM, Vasiliy Tolstov v.tols...@selfip.ru wrote:
  Hello!
  I have two containers with os linux. All files in /usr and /bin are
  identical.
  Is that possible to mount/bind /usr and /bin to containers? (not copy
  all files to containers).. ?
 
 the problem (and solution) is exactly the same as if they weren't
 virtual machines, but real machines: use the network.
 
 simply share the directories with NFS and mount them in your initrd
 scripts (preferably read/only).
 
 other way would be to set a new image file with a copy of the
 directories, and mount them on both virtual machines.  of course, now
 you MUST mount them as readonly.  and you can't change anything there
 without ummounting from both VMs.
 
 usually it's not worth it, unless you have tens of identical VMs
 

Thank You for answer. But if i store 100-200 kvm guests under one host
system, and mount all shared resources via nfs - can this slow down my
system?
I need only read only access to shared files (only /home and /etc/ is
different)


-- 
Vasiliy Tolstov v.tols...@selfip.ru
Selfip.Ru

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Map guest (ts,tid) to individual shadow tid

2009-04-08 Thread Liu Yu-B13201

 -Original Message-
 From: kvm-ppc-ow...@vger.kernel.org 
 [mailto:kvm-ppc-ow...@vger.kernel.org] On Behalf Of Liu Yu-B13201
 Sent: Thursday, April 09, 2009 11:45 AM
 To: Hollis Blanchard
 Cc: kvm-ppc@vger.kernel.org
 Subject: RE: [PATCH] Map guest (ts,tid) to individual shadow tid
 
  -Original Message-
  From: Hollis Blanchard [mailto:holl...@us.ibm.com] 
  Sent: Thursday, April 09, 2009 12:43 AM
  To: Liu Yu-B13201
  Cc: kvm-ppc@vger.kernel.org
  Subject: Re: [PATCH] Map guest (ts,tid) to individual shadow tid
  
  On Tuesday 07 April 2009 21:11:11 Liu Yu-B13201 wrote:
   
-Original Message-
From: Hollis Blanchard [mailto:holl...@us.ibm.com] 
Sent: Tuesday, April 07, 2009 11:41 PM
To: Liu Yu-B13201
Cc: kvm-ppc@vger.kernel.org
Subject: Re: [PATCH] Map guest (ts,tid) to individual shadow tid

On Tue, 2009-04-07 at 17:51 +0800, Liu Yu wrote:
 Hi guys,
 
 Thanks to Hollis's idea.
 This patch help to make KVMPPC be more friendly to OSes 
other than Linux.

Wow, that was quick. :)

Have you tested this code? Doesn't this break the assumption in
kvmppc_e500_tlb1_invalidate() (and probably elsewhere) 
  that stid ==
gtid?
   
   Yes, have taken a simple test.
  
  Once we can reduce the number of TLB flushes (see below) it will be 
  interesting to see if there's a performance impact.
  
  Good catch, it needs to handle it here in 
  kvmppc_e500_tlb1_invalidate().
  Thanks.
  But it's ok for now, because TLB1 only contains tid=0 
  entries, and tid=0
  always be mapped to stid=0.
  That's why the test is fine...
  
  OK. Still, it makes me nervous to break such a simple 
  assumption. We should 
  introduce nice accessors to make it difficult to code it 
  wrong in the future. 
  I'm honestly surprised that kvmppc_e500_tlb1_invalidate() is 
  the only affected 
  site.
 
 It is the only site. We don't pay much attention on shadow tid.
 For 500, shadow tid is inherit from guest tid.
 As guest tlb1 mapping is broken into 4K shadow mappings,
 the kvmppc_e500_tlb1_invalidate() need to check tid to find 
 all 4k shadow mappings related to an guest mapping.
 
 Actually, I have been thinking about remove all shadow tlb 
 like 440 did.
 This maybe helpful to succedent work such as huge tlb mapping.
 After doing that, kvmppc_e500_tlb1_invalidate() won't has 
 this assumption.
 
 Anyway, the patch is an RFC, not aimed at getting applied.
 Just make something to discuss. :)
 

Well, I missed another site...
Host pid should be updated to stid in two case:
1. guest accesses to SPRN_PID
2. guest switch btween kernel and userspace. (if we map guest kernel
[tid=0] to non zero stid)
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html