Re: [kvm-devel] stable distro for kvm?

2008-02-13 Thread Alexey Eremenko
-Original Message-
From: [EMAIL PROTECTED] on behalf of Andrey Dmitriev
Sent: Tue 2/12/2008 11:20 PM
To: kvm-devel@lists.sourceforge.net
Subject: [kvm-devel] stable distro for kvm?

Any recommendations or link to plans for a stable KVM with any major distro? 

Latest KVM (KVM-60) is stable; Fedora7+/RHEL5+ have good support for KVM. 
openSUSE/SLES 11.x will likely have good KVM support too.

I've read somewhere that Ubuntu will support it soon (how soon?) but I thought 
it was based on debian, and it doesn't seem to have it as part of etch yet (if 
I switch to unstable, I seem to be able to get it) 
  
Debian Etch (Stable) was feature-frozen before KVM was released. (Debian Etch 
was released in 2007, but it was in feature-freeze since summer of 2006).
As with any Debian Stable, it takes time to insert features into it. Those 
features need to be ready nearly year before insertion into Debian Stable. Both 
Ubuntu 8.04 LTS and Debian Lenny will have good KVM support.

openSUSE 10.3, Ubuntu 7.04, Ubuntu 7.10 have early KVM included, so they have 
some bugs. (basic KVM support) 
In any case, I would recommend you to install KVM-60, because it is more stable.

-Alexey Technologov, Qumranet QA Team Member.
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Clock off in guest

2008-02-13 Thread Koen Vermeer
Hi,

I'm running an Linux AMD64 guest on an AMD64 host. The host is running a
2.6.23 kernel (self compiled), the guest is running a stock
linux-image-2.6.22-3-amd64 Debian kernel.

My problem is that the clock on the guest is off (slow), while the clock
on the host seems to be OK. When doing 'time sleep 10' on the guest, it
takes about 16 'real' seconds for it to finish.

I can change the host kernel, the guest kernel or fiddle around with kvm
command line options, but I would like some guidance where to start.
Which option has the largest probability of solving my problem? Let me
know if you need some dmesg output or some other log data; I didn't want
to flood the list with random logs of both the host and the guest.

Best,
Koen



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/3] KVM: SVM: enable LBRV virtualization

2008-02-13 Thread Avi Kivity
Joerg Roedel wrote:
  
   
 This still has the same issue as the previous patchset:  if the guest 
 enables some other bit 
 in MSR_IA32_DEBUCTLMSR, we silently ignore it.  We should either pr_unimpl() 
 on such bits or 
 not handle them (ultimately injecting a #GP).
 

 Thats not true. The patch saves the MSR value in vmcb-save.dbgctl. This
 value is returned on reads of that MSR. So no bit is ignored. This value
 in the VMCB is also used as the guests copy of that MSR if LBR
 virtualization is enabled. 

Right, my mistake.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/3] KVM: SVM: enable LBRV virtualization

2008-02-13 Thread Joerg Roedel
On Wed, Feb 13, 2008 at 11:50:58AM +0200, Avi Kivity wrote:
 Joerg Roedel wrote:
 @@ -1224,6 +1261,15 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, 
 unsigned ecx, u64 data)
  if (data != 0)
  goto unhandled;
  break;
 +case MSR_IA32_DEBUGCTLMSR:
 +svm-vmcb-save.dbgctl = data;
 +if (!svm_has(SVM_FEATURE_LBRV))
 +break;
 +if (data  (1ULL0))
 +svm_enable_lbrv(svm);
 +else
 +svm_disable_lbrv(svm);
 +break;
  default:
  unhandled:
  return kvm_set_msr_common(vcpu, ecx, data);
   
 
 This still has the same issue as the previous patchset:  if the guest enables 
 some other bit 
 in MSR_IA32_DEBUCTLMSR, we silently ignore it.  We should either pr_unimpl() 
 on such bits or 
 not handle them (ultimately injecting a #GP).

Thats not true. The patch saves the MSR value in vmcb-save.dbgctl. This
value is returned on reads of that MSR. So no bit is ignored. This value
in the VMCB is also used as the guests copy of that MSR if LBR
virtualization is enabled. But another issue, I should ensure the guest
does not set reserved bits in that MSR.

 Also, I'd like a simple patch for 2.6.25 to add support for Windows x86 on 
 AMD.  So if the 
 first patch in the series can add support for the bits that Windows sets in 
 MSR_IA32_DEBUGCTLMSR (I imagine it just writes zero?) then I can queue that 
 for 2.6.25 and 
 the rest for 2.6.26.

Ok, I will work that into the patchset.

Joerg

-- 
   |   AMD Saxony Limited Liability Company  Co. KG
 Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System|  Register Court Dresden: HRA 4896
 Research  |  General Partner authorized to represent:
 Center| AMD Saxony LLC (Wilmington, Delaware, US)
   | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/3] KVM: SVM: enable LBRV virtualization

2008-02-13 Thread Avi Kivity
Joerg Roedel wrote:
 @@ -1224,6 +1261,15 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
 ecx, u64 data)
   if (data != 0)
   goto unhandled;
   break;
 + case MSR_IA32_DEBUGCTLMSR:
 + svm-vmcb-save.dbgctl = data;
 + if (!svm_has(SVM_FEATURE_LBRV))
 + break;
 + if (data  (1ULL0))
 + svm_enable_lbrv(svm);
 + else
 + svm_disable_lbrv(svm);
 + break;
   default:
   unhandled:
   return kvm_set_msr_common(vcpu, ecx, data);
   

This still has the same issue as the previous patchset:  if the guest 
enables some other bit in MSR_IA32_DEBUCTLMSR, we silently ignore it.  
We should either pr_unimpl() on such bits or not handle them (ultimately 
injecting a #GP).

Also, I'd like a simple patch for 2.6.25 to add support for Windows x86 
on AMD.  So if the first patch in the series can add support for the 
bits that Windows sets in MSR_IA32_DEBUGCTLMSR (I imagine it just writes 
zero?) then I can queue that for 2.6.25 and the rest for 2.6.26.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/2] kvmclock - the host part.

2008-02-13 Thread Avi Kivity
Glauber de Oliveira Costa wrote:
 This is the host part of kvm clocksource implementation. As it does
 not include clockevents, it is a fairly simple implementation. We
 only have to register a per-vcpu area, and start writting to it periodically.

 The area is binary compatible with xen, as we use the same shadow_info 
 structure.

   

 +static void kvm_write_wall_clock(struct kvm_vcpu *v, gpa_t wall_clock)
 +{
 + int version = 1;
 + struct kvm_wall_clock wc;
 + unsigned long flags;
 + struct timespec wc_ts;
 +
 + local_irq_save(flags);
 + kvm_get_msr(v, MSR_IA32_TIME_STAMP_COUNTER,
 +   v-arch.hv_clock.tsc_timestamp);
   

Why is this needed? IIRC the wall clock is not tied to any vcpu.

If we can remove this, the argument to the function should be kvm, not 
kvm_vcpu. We can remove the irq games as well.

 + wc_ts = current_kernel_time();
 + local_irq_restore(flags);
 +
 + down_write(current-mm-mmap_sem);
 + kvm_write_guest(v-kvm, wall_clock, version, sizeof(version));
 + up_write(current-mm-mmap_sem);
   

Why down_write? accidentally or on purpose?

For mutual exclusion, I suggest taking kvm-lock instead (for the entire 
function).

 +
 + /* With all the info we got, fill in the values */
 + wc.wc_sec = wc_ts.tv_sec;
 + wc.wc_nsec = wc_ts.tv_nsec;
 + wc.wc_version = ++version;
 +
 + down_write(current-mm-mmap_sem);
 + kvm_write_guest(v-kvm, wall_clock, wc, sizeof(wc));
 + up_write(current-mm-mmap_sem);
   

Should be in three steps: write version, write data, write version. 
kvm_write_guest doesn't guarantee any order. It may fail as well, and we 
need to handle that.

  
 +/* xen binary-compatible interface. See xen headers for details */
 +struct kvm_vcpu_time_info {
 + uint32_t version;
 + uint32_t pad0;
 + uint64_t tsc_timestamp;
 + uint64_t system_time;
 + uint32_t tsc_to_system_mul;
 + int8_t   tsc_shift;
 +}; /* 32 bytes */
 +
 +struct kvm_wall_clock {
 + uint32_t wc_version;
 + uint32_t wc_sec;
 + uint32_t wc_nsec;
 +};
 +
   

These structures are dangerously sized. __Suggest__ 
__attribute__((__packed__)). (or some padding at the end of 
kvm_vcpu_time_info.

 diff --git a/include/linux/kvm.h b/include/linux/kvm.h
 index 4de4fd2..78ce53f 100644
 --- a/include/linux/kvm.h
 +++ b/include/linux/kvm.h
 @@ -232,6 +232,7 @@ #define KVM_CAP_USER_MEMORY 3
  #define KVM_CAP_SET_TSS_ADDR 4
  #define KVM_CAP_EXT_CPUID 5
  #define KVM_CAP_VAPIC 6
 +#define KVM_CAP_CLOCKSOURCE 7
  
   

Please refresh against kvm.git, this has changed a bit.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread Dan Kenigsberg
On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote:
 Hi,
 
 I'm running an Linux AMD64 guest on an AMD64 host. The host is running a
 2.6.23 kernel (self compiled), the guest is running a stock
 linux-image-2.6.22-3-amd64 Debian kernel.
 
 My problem is that the clock on the guest is off (slow), while the clock
 on the host seems to be OK. When doing 'time sleep 10' on the guest, it
 takes about 16 'real' seconds for it to finish.

What is the clock source of your guest? If it is the traditional PIT,
you describe a known problem in busy hosts. You may try to set
clocksource=tsc in the guest kerenl parameters.

Regards,

Dan.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Signals for file descriptors

2008-02-13 Thread Anders Melchiorsen
I am wondering about this commit,

http://git.kernel.org/?p=virt/kvm/kvm-userspace.git;a=commit;h=b4e392c21c4b98c1c13af353caa3d6e6bcb6b8af

which adds signals on tap I/O. It seems a bit half-done to me. For one
thing, it is mixing timers with I/O.

Anyway, my question is about the remaining file descriptors. Should
signals be activated for them as well, for example in
qemu_set_fd_handler2() ? My example on hand is that connecting a VNC
client currently delays until the next timer expiry.


Anders.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread Uri Lublin
From: [EMAIL PROTECTED] on behalf of Dan Kenigsberg
Sent: Wed 13/02/2008 13:25
To: Koen Vermeer
Cc: kvm-devel@lists.sourceforge.net
Subject: Re: [kvm-devel] Clock off in guest
 
On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote:
 Hi,
 
 I'm running an Linux AMD64 guest on an AMD64 host. The host is running a
 2.6.23 kernel (self compiled), the guest is running a stock
 linux-image-2.6.22-3-amd64 Debian kernel.
 
 My problem is that the clock on the guest is off (slow), while the clock
 on the host seems to be OK. When doing 'time sleep 10' on the guest, it
 takes about 16 'real' seconds for it to finish.

What is the clock source of your guest? If it is the traditional PIT,
you describe a known problem in busy hosts. You may try to set
clocksource=tsc in the guest kerenl parameters.

Regards,

Dan.

Hi,

This would not work if you are using an old version of kvm ( with no 
in-kernel-apic )
I recommend upgrading to kvm-60 (or latest linux kernel).
Or as an alternative, probably not as good, sometimes (when the guest's 
clocksource is PIT) adding '-tdf' to the command line helps.

Uri.
-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Signals for file descriptors

2008-02-13 Thread Avi Kivity
Anders Melchiorsen wrote:
 I am wondering about this commit,

 http://git.kernel.org/?p=virt/kvm/kvm-userspace.git;a=commit;h=b4e392c21c4b98c1c13af353caa3d6e6bcb6b8af

 which adds signals on tap I/O. It seems a bit half-done to me. For one
 thing, it is mixing timers with I/O.

   

The signal handler doesn't actually matter; all that's needed is to 
break out of the loop.

 Anyway, my question is about the remaining file descriptors. Should
 signals be activated for them as well, for example in
 qemu_set_fd_handler2() ? My example on hand is that connecting a VNC
 client currently delays until the next timer expiry.

In practice it doesn't matter, but yes, any fd which we will select() 
needs to have a signal attached.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Signals for file descriptors

2008-02-13 Thread Anders Melchiorsen
Avi Kivity [EMAIL PROTECTED] writes:

 In practice it doesn't matter, but yes, any fd which we will
 select() needs to have a signal attached.

It matters to me, because I am removing periodic timers, and so I
ended up where I could not attach with VNC at all (well, strace would
break the loop).

With your answers in mind, I will prepare a patch to add this.


Thanks,
Anders.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread koen
 This would not work if you are using an old version of kvm ( with no
 in-kernel-apic )
 I recommend upgrading to kvm-60 (or latest linux kernel).

Should I upgrade the guest kernel or the host kernel? My bet is the host
kernel, but the clocksource=tsc applies to the guest, so I'm not really
sure...

 Or as an alternative, probably not as good, sometimes (when the guest's
 clocksource is PIT) adding '-tdf' to the command line helps.

I cannot find this in man kvm or man qemu. Should I add this to the
command line that starts the guest?

Best,
Koen


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread koen
Hi Dan,

 On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote:
 I'm running an Linux AMD64 guest on an AMD64 host. The host is running a
 2.6.23 kernel (self compiled), the guest is running a stock
 linux-image-2.6.22-3-amd64 Debian kernel.
 My problem is that the clock on the guest is off (slow), while the clock
 on the host seems to be OK. When doing 'time sleep 10' on the guest, it
 takes about 16 'real' seconds for it to finish.
 What is the clock source of your guest? If it is the traditional PIT,
 you describe a known problem in busy hosts. You may try to set
 clocksource=tsc in the guest kerenl parameters.

Thanks for your reply. I tried looking for PIT in dmesg, which didn't give
me anything. Then I tried 'dmesg|grep -i tsc' and I got

Time: tsc clocksource has been installed.

I then ran 'dmesg|grep -i clock' and got

ACPI: PM-Timer IO Port: 0xb008
time.c: Detected 1803.751 MHz processor.
Calibrating delay using timer specific routine.. 14442.19 BogoMIPS
(lpj=28884398)
Using local APIC timer interrupts.
Detected 62.628 MHz APIC timer.
* Found PM-Timer Bug on the chipset. Due to workarounds for a bug,
Time: tsc clocksource has been installed.
Real Time Clock Driver v1.12ac
PCI: Setting latency timer of device :00:01.1 to 64
PCI: Setting latency timer of device :00:01.2 to 64

The kernel command line is rather simple: root=/dev/hda1 ro quiet

Does that answer your question, or should I be looking somewhere else?

Thanks!

Koen


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 0/6] MMU Notifiers V6

2008-02-13 Thread Jack Steiner
 GRU
 - Simple additional hardware TLB (possibly covering multiple instances of
   Linux)
 - Needs TLB shootdown when the VM unmaps pages.
 - Determines page address via follow_page (from interrupt context) but can
   fall back to get_user_pages().
 - No page reference possible since no page status is kept..

I applied the latest mmuops patch to a 2.6.24 kernel  updated the
GRU driver to use it. As far as I can tell, everything works ok.
Although more testing is needed, all current tests of driver functionality
are working on both a system simulator and a hardware simulator.

The driver itself is still a few weeks from being ready to post but I can
send code fragments of the portions related to mmuops or external TLB
management if anyone is interested.


--- jack

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread koen
 On Wed, Feb 13, 2008 at 02:00:38PM +0100, [EMAIL PROTECTED] wrote:
  What is the clock source of your guest? If it is the traditional PIT,
  you describe a known problem in busy hosts. You may try to set
  clocksource=tsc in the guest kerenl parameters.
 Thanks for your reply. I tried looking for PIT in dmesg, which didn't
 give
 me anything. Then I tried 'dmesg|grep -i tsc' and I got
 Time: tsc clocksource has been installed.
 If this was done in the guest then, yes, your guest's clocksource is
 tsc.

Yes, this was done on the guest. The host shows

Marking TSC unstable due to TSCs unsynchronized
hpet0: at MMIO 0xfed0, IRQs 2, 8, 31
hpet0: 3 32-bit timers, 2500 Hz
Time: hpet clocksource has been installed.

 However, as Uri mentioned earlier, this is useful only with newer KVMs.
 I assume that your host runs the kvm from 2.6.23 which is pretty old in
 kvm timescale. Try downloading kvm-60, insmod it to your host and try
 running your guest.

I'll try building a kvm-60 module, run the guest with that and report the
results. I cannot do that immediately, though, because people are actually
using the guest system.

Best,
Koen


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] KVM: SVM: fix Windows XP 64 bit installation crash

2008-02-13 Thread Joerg Roedel
While installing Windows XP 64 bit wants to access the DEBUGCTL and the last
branch record (LBR) MSRs. Don't allowing this in KVM causes the installation to
crash. This patch allow the access to these MSRs and fixes the issue.

Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
Signed-off-by: Markus Rechberger [EMAIL PROTECTED]
---
 arch/x86/kvm/svm.c |   22 ++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 13765e9..1ef3e7b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1155,6 +1155,24 @@ static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 *data)
case MSR_IA32_SYSENTER_ESP:
*data = svm-vmcb-save.sysenter_esp;
break;
+   /* Nobody will change the following 5 values in the VMCB so
+  we can safely return them on rdmsr. They will always be 0
+  until LBRV is implemented. */
+   case MSR_IA32_DEBUGCTLMSR:
+   *data = svm-vmcb-save.dbgctl;
+   break;
+   case MSR_IA32_LASTBRANCHFROMIP:
+   *data = svm-vmcb-save.br_from;
+   break;
+   case MSR_IA32_LASTBRANCHTOIP:
+   *data = svm-vmcb-save.br_to;
+   break;
+   case MSR_IA32_LASTINTFROMIP:
+   *data = svm-vmcb-save.last_excp_from;
+   break;
+   case MSR_IA32_LASTINTTOIP:
+   *data = svm-vmcb-save.last_excp_to;
+   break;
default:
return kvm_get_msr_common(vcpu, ecx, data);
}
@@ -1215,6 +1233,10 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
case MSR_IA32_SYSENTER_ESP:
svm-vmcb-save.sysenter_esp = data;
break;
+   case MSR_IA32_DEBUGCTLMSR:
+   pr_unimpl(vcpu, %s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n,
+   __FUNCTION__, data);
+   break;
case MSR_K7_EVNTSEL0:
case MSR_K7_EVNTSEL1:
case MSR_K7_EVNTSEL2:
-- 
1.5.3.7




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread Dan Kenigsberg
On Wed, Feb 13, 2008 at 02:00:38PM +0100, [EMAIL PROTECTED] wrote:
 Hi Dan,
 
  On Wed, Feb 13, 2008 at 10:41:44AM +0100, Koen Vermeer wrote:
  I'm running an Linux AMD64 guest on an AMD64 host. The host is running a
  2.6.23 kernel (self compiled), the guest is running a stock
  linux-image-2.6.22-3-amd64 Debian kernel.
  My problem is that the clock on the guest is off (slow), while the clock
  on the host seems to be OK. When doing 'time sleep 10' on the guest, it
  takes about 16 'real' seconds for it to finish.
  What is the clock source of your guest? If it is the traditional PIT,
  you describe a known problem in busy hosts. You may try to set
  clocksource=tsc in the guest kerenl parameters.
 
 Thanks for your reply. I tried looking for PIT in dmesg, which didn't give
 me anything. Then I tried 'dmesg|grep -i tsc' and I got
 
 Time: tsc clocksource has been installed.

If this was done in the guest then, yes, your guest's clocksource is
tsc.

However, as Uri mentioned earlier, this is useful only with newer KVMs.
I assume that your host runs the kvm from 2.6.23 which is pretty old in
kvm timescale. Try downloading kvm-60, insmod it to your host and try
running your guest.

regards,
Dan.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Raisch

Chelsio's T3 HW doesn't support this.


For ehca we currently can't modify a large MR when it has been allocated.
EHCA Hardware expects the pages to be there (MRs must not have holes).
This is also true for the global MR covering all kernel space.
Therefore we still need the memory to be pinned if ib_umem_get() is
called.

So with the current implementation we don't have much use for a notifier.


It is difficult to make predictions, especially about the future
Gruss / Regards
Christoph Raisch + Hoang-Nam Nguyen




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread Koen Vermeer
On Wed, 2008-02-13 at 17:55 +0200, Dan Kenigsberg wrote:
 On Wed, Feb 13, 2008 at 02:00:38PM +0100, [EMAIL PROTECTED] wrote:
 However, as Uri mentioned earlier, this is useful only with newer KVMs.
 I assume that your host runs the kvm from 2.6.23 which is pretty old in
 kvm timescale. Try downloading kvm-60, insmod it to your host and try
 running your guest.

OK, I installed both version 60 of both kvm and kvm-source (Debian
Lenny), ran m-a a-i kvm and just to be sure I removed and reloaded the
kvm and kvm-amd modules. modinfo kvm and modinfo kvm-amd shows that
these are indeed version 60.

I then restarted the guest (which is exactly the same as before) and I
tried the 'sleep 10' test again. Same result: It takes about 17 or 18
seconds for the prompt to return again.

I assume that clocksource=tsc isn't useful as it is already using the
tsc. So, what else can I try? Any command line parameter I can add to
the kvm call? Kernel parameters in the guest? Update the guest OS to a
newer kernel (2.6.22 to 2.6.24)? Update the host OS to a newer kernel
(2.6.23 to 2.6.24)?

Thanks for the help!

Koen


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] Qemu powerpc work around

2008-02-13 Thread Jerone Young

On Wed, 2008-02-13 at 09:29 +0200, Avi Kivity wrote:
 Jerone Young wrote:
  So the recent code in qemu cvs has problem powerpc. So what I have done
  is mainly work around this in the build system, by creating
  ppcemb_kvm-sofmmu target. Along with this is a fake-exec.c that stubs
  out the functions that are no longer defined (something done by Anthony
  Liguori attempting to fix qemu_cvs). What do folks think about this
  approach, for us all we really need is a qemu that is not built with tcg
  dependency.
 

 
 Since a target in qemu is a cpu type, how the instructions are executed 
 (kvm, kqemu, dyngen, or tcg) shouldn't come into it.  Instead we can 
 have a --without-cpu-emulation or --no-tcg which would simply disable 
 those parts.

Actually this much much more sensible solution. So I took some time and
implemented it.

So on the qemu command line you use --disable-cpu-emulation

Signed-off-by: Jerone Young [EMAIL PROTECTED]

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -179,11 +179,17 @@ all: $(PROGS)
 
 #
 # cpu emulator library
-LIBOBJS=exec.o kqemu.o translate-all.o cpu-exec.o\
-translate.o op.o host-utils.o
+LIBOBJS=exec.o kqemu.o cpu-exec.o host-utils.o
+
+ifneq ($(NO_CPU_EMULATION), 1)
+LIBOBJS+= translate-all.o translate.o op.o 
 # TCG code generator
 LIBOBJS+= tcg/tcg.o tcg/tcg-dyngen.o tcg/tcg-runtime.o
 CPPFLAGS+=-I$(SRC_PATH)/tcg -I$(SRC_PATH)/tcg/$(ARCH)
+else
+LIBOBJS+= fake-exec.o
+endif
+
 ifeq ($(USE_KVM), 1)
 LIBOBJS+=qemu-kvm.o
 endif
diff --git a/qemu/configure b/qemu/configure
--- a/qemu/configure
+++ b/qemu/configure
@@ -110,6 +110,7 @@ darwin_user=no
 darwin_user=no
 build_docs=no
 uname_release=
+cpu_emulation=yes
 
 # OS specific
 targetos=`uname -s`
@@ -339,6 +340,8 @@ for opt do
   ;;
   --disable-werror) werror=no
   ;;
+  --disable-cpu-emulation) cpu_emulation=no 
+  ;;
   *) echo ERROR: unknown option $opt; exit 1
   ;;
   esac
@@ -770,6 +773,7 @@ fi
 fi
 echo kqemu support $kqemu
 echo kvm support   $kvm
+echo CPU emulation $cpu_emulation
 echo Documentation $build_docs
 [ ! -z $uname_release ]  \
 echo uname -r  $uname_release
@@ -1094,12 +1098,20 @@ interp_prefix1=`echo $interp_prefix | 
 interp_prefix1=`echo $interp_prefix | sed s/%M/$target_cpu/g`
 echo #define CONFIG_QEMU_PREFIX \$interp_prefix1\  $config_h
 
+disable_cpu_emulation() {
+  if test $cpu_emulation = no; then
+echo #define NO_CPU_EMULATION 1  $config_h
+echo NO_CPU_EMULATION=1  $config_mak
+  fi
+}
+
 configure_kvm() {
   if test $kvm = yes -a $target_softmmu = yes -a \
   \( $cpu = i386 -o $cpu = x86_64 -o $cpu = ia64 -o $cpu 
= powerpc \); then
 echo #define USE_KVM 1  $config_h
 echo USE_KVM=1  $config_mak
 echo CONFIG_KVM_KERNEL_INC=$kernel_path/include  $config_mak
+disable_cpu_emulation 
   fi
 }
 
diff --git a/qemu/exec.c b/qemu/exec.c
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -35,7 +35,11 @@
 
 #include cpu.h
 #include exec-all.h
+
+#if !defined(NO_CPU_EMULATION)
 #include tcg-target.h
+#endif
+
 #include qemu-kvm.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
diff --git a/qemu/fake-exec.c b/qemu/fake-exec.c
new file mode 100644
--- /dev/null
+++ b/qemu/fake-exec.c
@@ -0,0 +1,62 @@
+#include stdarg.h
+#include stdlib.h
+#include stdio.h
+#include string.h
+#include inttypes.h
+
+#include cpu.h
+#include exec-all.h
+
+int code_copy_enabled = 0;
+
+void cpu_dump_state (CPUState *env, FILE *f,
+ int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
+ int flags)
+{
+}
+
+void ppc_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...))
+{
+}
+
+void cpu_dump_statistics (CPUState *env, FILE*f,
+  int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
+  int flags)
+{
+}
+
+unsigned long code_gen_max_block_size(void)
+{
+return 32;
+}
+
+void cpu_gen_init(void)
+{
+}
+
+int cpu_restore_state(TranslationBlock *tb,
+  CPUState *env, unsigned long searched_pc,
+  void *puc)
+
+{
+return 0;
+}
+
+int cpu_ppc_gen_code(CPUState *env, TranslationBlock *tb, int 
*gen_code_size_ptr)
+{
+return 0;
+}
+
+const ppc_def_t *cpu_ppc_find_by_name (const unsigned char *name)
+{
+return NULL;
+}
+
+int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def)
+{
+return 0;
+}
+
+void flush_icache_range(unsigned long start, unsigned long stop)
+{
+}



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] KVM: SVM: Implement LBR virtualization

2008-02-13 Thread Joerg Roedel
This patch set enables the virtualization of the last branch record in SVM if
this feature is supported by the hardware. To the previous post the fix for the
XP 64 bit install crash has been removed from this series and was posted
seperatly to keep it small enough for 2.6.25. This patch set applies on top of
that fix.

Joerg

diffstat:

 arch/x86/kvm/kvm_svm.h |2 +
 arch/x86/kvm/svm.c |  118 ++-
 2 files changed, 77 insertions(+), 43 deletions(-)





-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/3] KVM: SVM: allocate the MSR permission map per VCPU

2008-02-13 Thread Joerg Roedel
This patch changes the kvm-amd module to allocate the SVM MSR permission map
per VCPU instead of a global map for all VCPUs. With this we have more
flexibility allowing specific guests to access virtualized MSRs. This is
required for LBR virtualization.

Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
Signed-off-by: Markus Rechberger [EMAIL PROTECTED]
---
 arch/x86/kvm/kvm_svm.h |2 +
 arch/x86/kvm/svm.c |   67 +++-
 2 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/kvm_svm.h b/arch/x86/kvm/kvm_svm.h
index ecdfe97..65ef0fc 100644
--- a/arch/x86/kvm/kvm_svm.h
+++ b/arch/x86/kvm/kvm_svm.h
@@ -39,6 +39,8 @@ struct vcpu_svm {
unsigned long host_db_regs[NUM_DB_REGS];
unsigned long host_dr6;
unsigned long host_dr7;
+
+   u32 *msrpm;
 };
 
 #endif
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 5a69619..3b31162 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -65,7 +65,6 @@ static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
 }
 
 unsigned long iopm_base;
-unsigned long msrpm_base;
 
 struct kvm_ldttss_desc {
u16 limit0;
@@ -370,12 +369,29 @@ static void set_msr_interception(u32 *msrpm, unsigned msr,
BUG();
 }
 
+static void svm_vcpu_init_msrpm(u32 *msrpm)
+{
+   memset(msrpm, 0xff, PAGE_SIZE * (1  MSRPM_ALLOC_ORDER));
+
+#ifdef CONFIG_X86_64
+   set_msr_interception(msrpm, MSR_GS_BASE, 1, 1);
+   set_msr_interception(msrpm, MSR_FS_BASE, 1, 1);
+   set_msr_interception(msrpm, MSR_KERNEL_GS_BASE, 1, 1);
+   set_msr_interception(msrpm, MSR_LSTAR, 1, 1);
+   set_msr_interception(msrpm, MSR_CSTAR, 1, 1);
+   set_msr_interception(msrpm, MSR_SYSCALL_MASK, 1, 1);
+#endif
+   set_msr_interception(msrpm, MSR_K6_STAR, 1, 1);
+   set_msr_interception(msrpm, MSR_IA32_SYSENTER_CS, 1, 1);
+   set_msr_interception(msrpm, MSR_IA32_SYSENTER_ESP, 1, 1);
+   set_msr_interception(msrpm, MSR_IA32_SYSENTER_EIP, 1, 1);
+}
+
 static __init int svm_hardware_setup(void)
 {
int cpu;
struct page *iopm_pages;
-   struct page *msrpm_pages;
-   void *iopm_va, *msrpm_va;
+   void *iopm_va;
int r;
 
iopm_pages = alloc_pages(GFP_KERNEL, IOPM_ALLOC_ORDER);
@@ -388,37 +404,13 @@ static __init int svm_hardware_setup(void)
clear_bit(0x80, iopm_va); /* allow direct access to PC debug port */
iopm_base = page_to_pfn(iopm_pages)  PAGE_SHIFT;
 
-
-   msrpm_pages = alloc_pages(GFP_KERNEL, MSRPM_ALLOC_ORDER);
-
-   r = -ENOMEM;
-   if (!msrpm_pages)
-   goto err_1;
-
-   msrpm_va = page_address(msrpm_pages);
-   memset(msrpm_va, 0xff, PAGE_SIZE * (1  MSRPM_ALLOC_ORDER));
-   msrpm_base = page_to_pfn(msrpm_pages)  PAGE_SHIFT;
-
-#ifdef CONFIG_X86_64
-   set_msr_interception(msrpm_va, MSR_GS_BASE, 1, 1);
-   set_msr_interception(msrpm_va, MSR_FS_BASE, 1, 1);
-   set_msr_interception(msrpm_va, MSR_KERNEL_GS_BASE, 1, 1);
-   set_msr_interception(msrpm_va, MSR_LSTAR, 1, 1);
-   set_msr_interception(msrpm_va, MSR_CSTAR, 1, 1);
-   set_msr_interception(msrpm_va, MSR_SYSCALL_MASK, 1, 1);
-#endif
-   set_msr_interception(msrpm_va, MSR_K6_STAR, 1, 1);
-   set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_CS, 1, 1);
-   set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_ESP, 1, 1);
-   set_msr_interception(msrpm_va, MSR_IA32_SYSENTER_EIP, 1, 1);
-
if (boot_cpu_has(X86_FEATURE_NX))
kvm_enable_efer_bits(EFER_NX);
 
for_each_online_cpu(cpu) {
r = svm_cpu_init(cpu);
if (r)
-   goto err_2;
+   goto err;
}
 
svm_features = cpuid_edx(SVM_CPUID_FUNC);
@@ -438,10 +430,7 @@ static __init int svm_hardware_setup(void)
 
return 0;
 
-err_2:
-   __free_pages(msrpm_pages, MSRPM_ALLOC_ORDER);
-   msrpm_base = 0;
-err_1:
+err:
__free_pages(iopm_pages, IOPM_ALLOC_ORDER);
iopm_base = 0;
return r;
@@ -449,9 +438,8 @@ err_1:
 
 static __exit void svm_hardware_unsetup(void)
 {
-   __free_pages(pfn_to_page(msrpm_base  PAGE_SHIFT), MSRPM_ALLOC_ORDER);
__free_pages(pfn_to_page(iopm_base  PAGE_SHIFT), IOPM_ALLOC_ORDER);
-   iopm_base = msrpm_base = 0;
+   iopm_base = 0;
 }
 
 static void init_seg(struct vmcb_seg *seg)
@@ -536,7 +524,7 @@ static void init_vmcb(struct vcpu_svm *svm)
(1ULL  INTERCEPT_MWAIT);
 
control-iopm_base_pa = iopm_base;
-   control-msrpm_base_pa = msrpm_base;
+   control-msrpm_base_pa = __pa(svm-msrpm);
control-tsc_offset = 0;
control-int_ctl = V_INTR_MASKING_MASK;
 
@@ -615,6 +603,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
 {
struct vcpu_svm *svm;
struct page *page;
+   struct page *msrpm_pages;
int err;
 
svm = 

[kvm-devel] [PATCH 3/3] KVM: SVM: enable LBR virtualization

2008-02-13 Thread Joerg Roedel
This patch implements the Last Branch Record Virtualization (LBRV) feature of
the AMD Barcelona and Phenom processors into the kvm-amd module. It will only
be enabled if the guest enables last branch recording in the DEBUG_CTL MSR. So
there is no increased world switch overhead when the guest doesn't use these
MSRs.

Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
Signed-off-by: Markus Rechberger [EMAIL PROTECTED]
---
 arch/x86/kvm/svm.c |   39 +--
 1 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 3b31162..e1d139f 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -47,6 +47,8 @@ MODULE_LICENSE(GPL);
 #define SVM_FEATURE_LBRV (1  1)
 #define SVM_DEATURE_SVML (1  2)
 
+#define DEBUGCTL_RESERVED_BITS (~(0x3fULL))
+
 /* enable NPT for AMD64 and X86 with PAE */
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 static bool npt_enabled = true;
@@ -387,6 +389,28 @@ static void svm_vcpu_init_msrpm(u32 *msrpm)
set_msr_interception(msrpm, MSR_IA32_SYSENTER_EIP, 1, 1);
 }
 
+static void svm_enable_lbrv(struct vcpu_svm *svm)
+{
+   u32 *msrpm = svm-msrpm;
+
+   svm-vmcb-control.lbr_ctl = 1;
+   set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 1, 1);
+   set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 1, 1);
+   set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 1, 1);
+   set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 1, 1);
+}
+
+static void svm_disable_lbrv(struct vcpu_svm *svm)
+{
+   u32 *msrpm = svm-msrpm;
+
+   svm-vmcb-control.lbr_ctl = 0;
+   set_msr_interception(msrpm, MSR_IA32_LASTBRANCHFROMIP, 0, 0);
+   set_msr_interception(msrpm, MSR_IA32_LASTBRANCHTOIP, 0, 0);
+   set_msr_interception(msrpm, MSR_IA32_LASTINTFROMIP, 0, 0);
+   set_msr_interception(msrpm, MSR_IA32_LASTINTTOIP, 0, 0);
+}
+
 static __init int svm_hardware_setup(void)
 {
int cpu;
@@ -1231,8 +1255,19 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
svm-vmcb-save.sysenter_esp = data;
break;
case MSR_IA32_DEBUGCTLMSR:
-   pr_unimpl(vcpu, %s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n,
-   __FUNCTION__, data);
+   if (!svm_has(SVM_FEATURE_LBRV)) {
+   pr_unimpl(vcpu, %s: MSR_IA32_DEBUGCTL 0x%llx, nop\n,
+   __FUNCTION__, data);
+   break;
+   }
+   if (data  DEBUGCTL_RESERVED_BITS)
+   return 1;
+
+   svm-vmcb-save.dbgctl = data;
+   if (data  (1ULL0))
+   svm_enable_lbrv(svm);
+   else
+   svm_disable_lbrv(svm);
break;
case MSR_K7_EVNTSEL0:
case MSR_K7_EVNTSEL1:
-- 
1.5.3.7




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/3] KVM: SVM: let init_vmcb() take struct vcpu_svm as parameter

2008-02-13 Thread Joerg Roedel
Change the parameter of the init_vmcb() function in the kvm-amd module from
struct vmcb to struct vcpu_svm.

Signed-off-by: Joerg Roedel [EMAIL PROTECTED]
Signed-off-by: Markus Rechberger [EMAIL PROTECTED]
---
 arch/x86/kvm/svm.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1ef3e7b..5a69619 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -471,10 +471,10 @@ static void init_sys_seg(struct vmcb_seg *seg, uint32_t 
type)
seg-base = 0;
 }
 
-static void init_vmcb(struct vmcb *vmcb)
+static void init_vmcb(struct vcpu_svm *svm)
 {
-   struct vmcb_control_area *control = vmcb-control;
-   struct vmcb_save_area *save = vmcb-save;
+   struct vmcb_control_area *control = svm-vmcb-control;
+   struct vmcb_save_area *save = svm-vmcb-save;
 
control-intercept_cr_read =INTERCEPT_CR0_MASK |
INTERCEPT_CR3_MASK |
@@ -600,7 +600,7 @@ static int svm_vcpu_reset(struct kvm_vcpu *vcpu)
 {
struct vcpu_svm *svm = to_svm(vcpu);
 
-   init_vmcb(svm-vmcb);
+   init_vmcb(svm);
 
if (vcpu-vcpu_id != 0) {
svm-vmcb-save.rip = 0;
@@ -638,7 +638,7 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, 
unsigned int id)
svm-vmcb_pa = page_to_pfn(page)  PAGE_SHIFT;
svm-asid_generation = 0;
memset(svm-db_regs, 0, sizeof(svm-db_regs));
-   init_vmcb(svm-vmcb);
+   init_vmcb(svm);
 
fx_init(svm-vcpu);
svm-vcpu.fpu_active = 1;
@@ -1024,7 +1024,7 @@ static int shutdown_interception(struct vcpu_svm *svm, 
struct kvm_run *kvm_run)
 * so reinitialize it.
 */
clear_page(svm-vmcb);
-   init_vmcb(svm-vmcb);
+   init_vmcb(svm);
 
kvm_run-exit_reason = KVM_EXIT_SHUTDOWN;
return 0;
-- 
1.5.3.7




-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Lameter
On Tue, 12 Feb 2008, Christian Bell wrote:

 You're arguing that a HW page table is not needed by describing a use
 case that is essentially what all RDMA solutions already do above the
 wire protocols (all solutions except Quadrics, of course).

The HW page table is not essential to the notification scheme. That the 
RDMA uses the page table for linearization is another issue. A chip could 
just have a TLB cache and lookup the entries using the OS page table f.e.

  Lets say you have a two systems A and B. Each has their memory region MemA 
  and MemB. Each side also has page tables for this region PtA and PtB.
  If either side then accesses the page again then the reverse process 
  happens. If B accesses the page then it wil first of all incur a page 
  fault because the entry in PtB is missing. The fault will then cause a 
  message to be send to A to establish the page again. A will create an 
  entry in PtA and will then confirm to B that the page was established. At 
  that point RDMA operations can occur again.
 
 The notifier-reclaim cycle you describe is akin to the out-of-band
 pin-unpin control messages used by existing communication libraries.
 Also, I think what you are proposing can have problems at scale -- A
 must keep track of all of the (potentially many systems) of memA and
 cooperatively get an agreement from all these systems before reclaiming
 the page.

Right. We (SGI) have done something like this for a long time with XPmem 
and it scales ok.

 When messages are sufficiently large, the control messaging necessary
 to setup/teardown the regions is relatively small.  This is not
 always the case however -- in programming models that employ smaller
 messages, the one-sided nature of RDMA is the most attractive part of
 it.  

The messaging would only be needed if a process comes under memory 
pressure. As long as there is enough memory nothing like this will occur.

 Nothing any communication/runtime system can't already do today.  The
 point of RDMA demand paging is enabling the possibility of using RDMA
 without the implied synchronization -- the optimistic part.  Using
 the notifiers to duplicate existing memory region handling for RDMA
 hardware that doesn't have HW page tables is possible but undermines
 the more important consumer of your patches in my opinion.

The notifier schemet should integrate into existing memory region 
handling and not cause a duplication. If you already have library layers 
that do this then it should be possible to integrate it.

 One other area that has not been brought up yet (I think) is the
 applicability of notifiers in letting users know when pinned memory
 is reclaimed by the kernel.  This is useful when a lower-level
 library employs lazy deregistration strategies on memory regions that
 are subsequently released to the kernel via the application's use of
 munmap or sbrk.  Ohio Supercomputing Center has work in this area but
 a generalized approach in the kernel would certainly be welcome.

The driver gets the notifications about memory being reclaimed. The driver 
could then notify user code about the release as well.

Pinned memory current *cannot* be reclaimed by the kernel. The refcount is 
elevated. This means that the VM tries to remove the mappings and then 
sees that it was not able to remove all references. Then it gives up and 
tries again and again and again Thus the potential for livelock.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Clock off in guest

2008-02-13 Thread Uri Lublin
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Wed 13/02/2008 14:52
To: Uri Lublin
Cc: kvm-devel@lists.sourceforge.net
Subject: RE: [kvm-devel] Clock off in guest
 
 This would not work if you are using an old version of kvm ( with no
 in-kernel-apic )
 I recommend upgrading to kvm-60 (or latest linux kernel).

Should I upgrade the guest kernel or the host kernel? My bet is the host
kernel, but the clocksource=tsc applies to the guest, so I'm not really
sure...

The host kernel or kvm.
If you choose to upgrade your host kernel (and kvm that comes with it), make 
sure 
you are using recent kvm-userspace too (e.g. kvm-60).

 Or as an alternative, probably not as good, sometimes (when the guest's
 clocksource is PIT) adding '-tdf' to the command line helps.

I cannot find this in man kvm or man qemu.

I'm not sure about the man pages, but kvm/qemu's help says:
bash$ /usr/bin/kvm -h | grep tdf
-tdfinject timer interrupts that got lost

Should I add this to the command line that starts the guest?

Yes, try adding it to the command line that starts the guest (executable name 
may vary):
   /usr/bin/kvm [kvm-params] -tdf

Also tdf (time drift fix) only works when using PIT+PIC (no APIC) so sometimes 
it's helpful
to also add -no-acpi:
   /usr/bin/kvm [kvm-params] -tdf -no-acpi



Best,
Koen


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Lameter
On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

 But this isn't how IB or iwarp work at all. What you describe is a
 significant change to the general RDMA operation and requires changes to
 both sides of the connection and the wire protocol.

Yes it may require a separate connection between both sides where a 
kind of VM notification protocol is established to tear these things down and 
set them up again. That is if there is nothing in the RDMA protocol that
allows a notification to the other side that the mapping is being down 
down.

  - In RDMA (iwarp and IB versions) the hardware page tables exist to
linearize the local memory so the remote does not need to be aware
of non-linearities in the physical address space. The main
motivation for this is kernel bypass where the user space app wants
to instruct the remote side to DMA into memory using user space
addresses. Hardware provides the page tables to switch from
incoming user space virtual addresses to physical addresess.

s/switch/translate I guess. That is good and those page tables could be 
used for the notification scheme to enable reclaim. But they are optional 
and are maintaining the driver state. The linearization could be 
reconstructed from the kernel page tables on demand.

Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
for access control and enforcing the liftime of the mapping.

Well the mapping would have to be on demand to avoid the issues that we 
currently have with pinning. The user API could stay the same. If the 
driver tracks the mappings using the notifier then the VM can make sure 
that the right things happen on exit etc etc.

The page tables in the RDMA hardware exist primarily to support
this, and not for other reasons. The pinning of pages is one part
to support the HW page tables and one part to support the RDMA
lifetime rules, the liftime rules are what cause problems for
the VM.

So the driver software can tear down and establish page tables 
entries at will? I do not see the problem. The RDMA hardware is one thing, 
the way things are visible to the user another. If the driver can 
establish and remove mappings as needed via RDMA then the user can have 
the illusion of persistent RDMA memory. This is the same as virtual memory 
providing the illusion of a process having lots of memory all for itself.


  - The wire protocol consists of packets that say 'Write XXX bytes to
offset YY in Region RRR'. Creating a region produces the RRR label
and currently pins the pages. So long as the RRR label is valid the
remote side can issue write packets at any time without any
further synchronization. There is no wire level events associated
with creating RRR. You can pass RRR to the other machine in any
fashion, even using carrier pigeons :)
  - The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
are built on top of it and they specify the lifetime rules and
protocol for exchanging RRR.

Well yes of course. What is proposed here is an additional notification 
mechanism (could even be via tcp/udp to simplify things) that would manage 
the mappings at a higher level. The writes would not occur if the mapping 
has not been established.
 
This is your step 'A will then send a message to B notifying..'.
It simply does not exist in the protocol specifications

Of course. You need to create an additional communication layer to get 
that.

 What it boils down to is that to implement true removal of pages in a
 general way the kernel and HCA must either drop packets or stall
 incoming packets, both are big performance problems - and I can't see
 many users wanting this. Enterprise style people using SCSI, NFS, etc
 already have short pin periods and HPC MPI users probably won't care
 about the VM issues enough to warrent the performance overhead.

True maybe you cannot do this by simply staying within the protocol bounds 
of RDMA that is based on page pinning if the RDMA protocol does not 
support a notification to the other side that the mapping is going away. 

If RDMA cannot do this then you would need additional ways of notifying 
the remote side that pages/mappings are invalidated.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] Qemu powerpc work around

2008-02-13 Thread Anthony Liguori
Jerone Young wrote:
 On Wed, 2008-02-13 at 09:29 +0200, Avi Kivity wrote:
   
 Jerone Young wrote:
 
 So the recent code in qemu cvs has problem powerpc. So what I have done
 is mainly work around this in the build system, by creating
 ppcemb_kvm-sofmmu target. Along with this is a fake-exec.c that stubs
 out the functions that are no longer defined (something done by Anthony
 Liguori attempting to fix qemu_cvs). What do folks think about this
 approach, for us all we really need is a qemu that is not built with tcg
 dependency.

   
   
 Since a target in qemu is a cpu type, how the instructions are executed 
 (kvm, kqemu, dyngen, or tcg) shouldn't come into it.  Instead we can 
 have a --without-cpu-emulation or --no-tcg which would simply disable 
 those parts.
 

 Actually this much much more sensible solution. So I took some time and
 implemented it.
   

Funny enough, I was thinking the same thing last night :-)

Please move fake-exec.c to target-ppc/fake-exec.c as it contains PPC 
specific code.  Otherwise, this patch is much better!

Regards,

Anthony Liguori


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Jason Gunthorpe
On Wed, Feb 13, 2008 at 10:51:58AM -0800, Christoph Lameter wrote:
 On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
 
  But this isn't how IB or iwarp work at all. What you describe is a
  significant change to the general RDMA operation and requires changes to
  both sides of the connection and the wire protocol.
 
 Yes it may require a separate connection between both sides where a 
 kind of VM notification protocol is established to tear these things down and 
 set them up again. That is if there is nothing in the RDMA protocol that
 allows a notification to the other side that the mapping is being down 
 down.

Well, yes, you could build this thing you are describing on top of the
RDMA protocol and get some support from some of the hardware - but it
is a new set of protocols and they would need to be implemented in
several places. It is not transparent to userspace and it is not
compatible with existing implementations.

Unfortunately it really has little to do with the drivers - changes,
for instance, need to be made to support this in the user space MPI
libraries. The RDMA ops do not pass through the kernel, userspace
talks directly to the hardware which complicates building any sort of
abstraction.

That is where I think you run into trouble, if you ask the MPI people
to add code to their critical path to support swapping they probably
will not be too interested. At a minimum to support your idea you need
to check on every RDMA if the remote page is mapped... Plus the
overheads Christian was talking about in the OOB channel(s).

Jason

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christian Bell
On Wed, 13 Feb 2008, Christoph Lameter wrote:

 Right. We (SGI) have done something like this for a long time with XPmem 
 and it scales ok.

I'd dispute this based on experience developing PGAS language support
on the Altix but more importantly (and less subjectively), I think
that scales ok refers to a very specific case.  Sure, pages (and/or
regions) can be large on some systems and the number of systems may
not always be in the thousands but you're still claiming scalability
for a mechanism that essentially logs who accesses the regions.  Then
there's the fact that reclaim becomes a collective communication
operation over all region accessors.  Makes me nervous.

  When messages are sufficiently large, the control messaging necessary
  to setup/teardown the regions is relatively small.  This is not
  always the case however -- in programming models that employ smaller
  messages, the one-sided nature of RDMA is the most attractive part of
  it.  
 
 The messaging would only be needed if a process comes under memory 
 pressure. As long as there is enough memory nothing like this will occur.
 
  Nothing any communication/runtime system can't already do today.  The
  point of RDMA demand paging is enabling the possibility of using RDMA
  without the implied synchronization -- the optimistic part.  Using
  the notifiers to duplicate existing memory region handling for RDMA
  hardware that doesn't have HW page tables is possible but undermines
  the more important consumer of your patches in my opinion.
 

 The notifier schemet should integrate into existing memory region 
 handling and not cause a duplication. If you already have library layers 
 that do this then it should be possible to integrate it.

I appreciate that you're trying to make a general case for the
applicability of notifiers on all types of existing RDMA hardware and
wire protocols.  Also, I'm not disagreeing whether a HW page table
is required or not: clearly it's not required to make *some* use of
the notifier scheme.

However, short of providing user-level notifications for pinned pages
that are inadvertently released to the O/S, I don't believe that the
patchset provides any significant added value for the HPC community
that can't optimistically do RDMA demand paging.


. . christian


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Lameter
On Wed, 13 Feb 2008, Christoph Raisch wrote:

 For ehca we currently can't modify a large MR when it has been allocated.
 EHCA Hardware expects the pages to be there (MRs must not have holes).
 This is also true for the global MR covering all kernel space.
 Therefore we still need the memory to be pinned if ib_umem_get() is
 called.

It cannot be freed and then reallocated? What happens when a process 
exists?


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Qemu powerpc work around

2008-02-13 Thread Anthony Liguori

Hi Jerone,

Jerone Young wrote:

Ok taking everybodys suggestions. This patch adds a
--disable-cpu-emulation option to qemu.  This way powerpc has the
ability to compile, and also gives other archs the ability to easily add
the ability to compile without the tcg code.

Signed-off-by: Jerone Young [EMAIL PROTECTED]
  


Can you try out this version of the patch on PPC?  This version also 
supports --disable-cpu-emulation on x86.  It also eliminates -no-kvm 
when using --disable-cpu-emulation and exit's if KVM can not be initialized.


This should be useful on x86 where people cannot easily get their hands 
on gcc-3 and only wish to run KVM.


Regards,

Anthony Liguori
diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 49b81df..8b0436b 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -179,11 +179,17 @@ all: $(PROGS)
 
 #
 # cpu emulator library
-LIBOBJS=exec.o kqemu.o translate-all.o cpu-exec.o\
-translate.o op.o host-utils.o
+LIBOBJS=exec.o kqemu.o cpu-exec.o host-utils.o
+
+ifeq ($(NO_CPU_EMULATION), 1)
+LIBOBJS+=fake-exec.o
+else
+LIBOBJS+= translate-all.o translate.o op.o 
 # TCG code generator
 LIBOBJS+= tcg/tcg.o tcg/tcg-dyngen.o tcg/tcg-runtime.o
 CPPFLAGS+=-I$(SRC_PATH)/tcg -I$(SRC_PATH)/tcg/$(ARCH)
+endif
+
 ifeq ($(USE_KVM), 1)
 LIBOBJS+=qemu-kvm.o
 endif
diff --git a/qemu/configure b/qemu/configure
index 92299b9..bc42665 100755
--- a/qemu/configure
+++ b/qemu/configure
@@ -110,6 +110,7 @@ linux_user=no
 darwin_user=no
 build_docs=no
 uname_release=
+cpu_emulation=yes
 
 # OS specific
 targetos=`uname -s`
@@ -339,6 +340,8 @@ for opt do
   ;;
   --disable-werror) werror=no
   ;;
+  --disable-cpu-emulation) cpu_emulation=no 
+  ;;
   *) echo ERROR: unknown option $opt; exit 1
   ;;
   esac
@@ -770,6 +773,7 @@ if test -n $sparc_cpu; then
 fi
 echo kqemu support $kqemu
 echo kvm support   $kvm
+echo CPU emulation $cpu_emulation
 echo Documentation $build_docs
 [ ! -z $uname_release ]  \
 echo uname -r  $uname_release
@@ -1094,12 +1098,20 @@ elfload32=no
 interp_prefix1=`echo $interp_prefix | sed s/%M/$target_cpu/g`
 echo #define CONFIG_QEMU_PREFIX \$interp_prefix1\  $config_h
 
+disable_cpu_emulation() {
+  if test $cpu_emulation = no; then
+echo #define NO_CPU_EMULATION 1  $config_h
+echo NO_CPU_EMULATION=1  $config_mak
+  fi
+}
+
 configure_kvm() {
   if test $kvm = yes -a $target_softmmu = yes -a \
   \( $cpu = i386 -o $cpu = x86_64 -o $cpu = ia64 -o $cpu = powerpc \); then
 echo #define USE_KVM 1  $config_h
 echo USE_KVM=1  $config_mak
 echo CONFIG_KVM_KERNEL_INC=$kernel_path/include  $config_mak
+disable_cpu_emulation 
   fi
 }
 
diff --git a/qemu/exec.c b/qemu/exec.c
index 050b150..960adcd 100644
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -35,7 +35,11 @@
 
 #include cpu.h
 #include exec-all.h
+
+#if !defined(NO_CPU_EMULATION)
 #include tcg-target.h
+#endif
+
 #include qemu-kvm.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
diff --git a/qemu/target-i386/fake-exec.c b/qemu/target-i386/fake-exec.c
new file mode 100644
index 000..737286d
--- /dev/null
+++ b/qemu/target-i386/fake-exec.c
@@ -0,0 +1,54 @@
+/*
+ * fake-exec.c
+ *
+ * This is a file for stub functions so that compilation is possible
+ * when TCG CPU emulation is disabled during compilation.
+ *
+ * Copyright 2007 IBM Corporation.
+ * Added by  Authors:
+ * 	Jerone Young [EMAIL PROTECTED]
+ * This work is licensed under the GNU GPL licence version 2 or later.
+ *
+ */
+#include exec.h
+#include cpu.h
+
+int code_copy_enabled = 0;
+
+CCTable cc_table[CC_OP_NB];
+
+void cpu_dump_statistics (CPUState *env, FILE*f,
+  int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
+  int flags)
+{
+}
+
+unsigned long code_gen_max_block_size(void)
+{
+return 32;
+}
+
+void cpu_gen_init(void)
+{
+}
+
+int cpu_restore_state(TranslationBlock *tb,
+  CPUState *env, unsigned long searched_pc,
+  void *puc)
+
+{
+return 0;
+}
+
+int cpu_x86_gen_code(CPUState *env, TranslationBlock *tb, int *gen_code_size_ptr)
+{
+return 0;
+}
+
+void flush_icache_range(unsigned long start, unsigned long stop)
+{
+}
+
+void optimize_flags_init(void)
+{
+}
diff --git a/qemu/target-ppc/fake-exec.c b/qemu/target-ppc/fake-exec.c
new file mode 100644
index 000..b042f58
--- /dev/null
+++ b/qemu/target-ppc/fake-exec.c
@@ -0,0 +1,75 @@
+/*
+ * fake-exec.c
+ *
+ * This is a file for stub functions so that compilation is possible
+ * when TCG CPU emulation is disabled during compilation.
+ *
+ * Copyright 2007 IBM Corporation.
+ * Added by  Authors:
+ * 	Jerone Young [EMAIL PROTECTED]
+ * This work is licensed under the GNU GPL licence version 2 or later.
+ *
+ */
+
+#include stdarg.h
+#include stdlib.h
+#include stdio.h
+#include string.h
+#include inttypes.h
+
+#include cpu.h
+#include exec-all.h
+
+int code_copy_enabled = 0;
+
+void 

Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Lameter
On Wed, 13 Feb 2008, Christian Bell wrote:

 not always be in the thousands but you're still claiming scalability
 for a mechanism that essentially logs who accesses the regions.  Then
 there's the fact that reclaim becomes a collective communication
 operation over all region accessors.  Makes me nervous.

Well reclaim is not a very fast process (and we usually try to avoid it 
as much as possible for our HPC). Essentially its only there to allow 
shifts of processing loads and to allow efficient caching of application 
data.

 However, short of providing user-level notifications for pinned pages
 that are inadvertently released to the O/S, I don't believe that the
 patchset provides any significant added value for the HPC community
 that can't optimistically do RDMA demand paging.

We currently also run XPmem with pinning. Its great as long as you just 
run one load on the system. No reclaim ever iccurs.

However, if you do things that require lots of allocations etc etc then 
the page pinning can easily lead to livelock if reclaim is finally 
triggerd and also strange OOM situations since the VM cannot free any 
pages. So the main issue that is addressed here is reliability of pinned 
page operations. Better VM integration avoids these issues because we can 
unpin on request to deal with memory shortages.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Lameter
On Wed, 13 Feb 2008, Jason Gunthorpe wrote:

 Unfortunately it really has little to do with the drivers - changes,
 for instance, need to be made to support this in the user space MPI
 libraries. The RDMA ops do not pass through the kernel, userspace
 talks directly to the hardware which complicates building any sort of
 abstraction.

Ok so the notifiers have to be handed over to the user space library that 
has the function of the device driver here...

 That is where I think you run into trouble, if you ask the MPI people
 to add code to their critical path to support swapping they probably
 will not be too interested. At a minimum to support your idea you need
 to check on every RDMA if the remote page is mapped... Plus the
 overheads Christian was talking about in the OOB channel(s).

You only need to check if a handle has been receiving invalidates. If not 
then you can just go ahead as now. You can use the notifier to take down 
the whole region if any reclaim occur against it (probably best and 
simples to implement approach). Then you mark the handle so that the 
mapping is reestablished before the next operation.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Qemu powerpc work around

2008-02-13 Thread Jerone Young
Ok taking everybodys suggestions. This patch adds a
--disable-cpu-emulation option to qemu.  This way powerpc has the
ability to compile, and also gives other archs the ability to easily add
the ability to compile without the tcg code.

Signed-off-by: Jerone Young [EMAIL PROTECTED]

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -179,11 +179,15 @@ all: $(PROGS)
 
 #
 # cpu emulator library
-LIBOBJS=exec.o kqemu.o translate-all.o cpu-exec.o\
-translate.o op.o host-utils.o
+LIBOBJS=exec.o kqemu.o cpu-exec.o host-utils.o
+
+ifneq ($(NO_CPU_EMULATION), 1)
+LIBOBJS+= translate-all.o translate.o op.o 
 # TCG code generator
 LIBOBJS+= tcg/tcg.o tcg/tcg-dyngen.o tcg/tcg-runtime.o
 CPPFLAGS+=-I$(SRC_PATH)/tcg -I$(SRC_PATH)/tcg/$(ARCH)
+endif
+
 ifeq ($(USE_KVM), 1)
 LIBOBJS+=qemu-kvm.o
 endif
@@ -214,6 +218,9 @@ LIBOBJS+= op_helper.o helper.o
 LIBOBJS+= op_helper.o helper.o
 ifeq ($(USE_KVM), 1)
 LIBOBJS+= qemu-kvm-powerpc.o
+endif
+ifeq ($(NO_CPU_EMULATION), 1)
+LIBOBJS+=fake-exec-ppc.o
 endif
 endif
 
diff --git a/qemu/configure b/qemu/configure
--- a/qemu/configure
+++ b/qemu/configure
@@ -110,6 +110,7 @@ darwin_user=no
 darwin_user=no
 build_docs=no
 uname_release=
+cpu_emulation=yes
 
 # OS specific
 targetos=`uname -s`
@@ -339,6 +340,8 @@ for opt do
   ;;
   --disable-werror) werror=no
   ;;
+  --disable-cpu-emulation) cpu_emulation=no 
+  ;;
   *) echo ERROR: unknown option $opt; exit 1
   ;;
   esac
@@ -770,6 +773,7 @@ fi
 fi
 echo kqemu support $kqemu
 echo kvm support   $kvm
+echo CPU emulation $cpu_emulation
 echo Documentation $build_docs
 [ ! -z $uname_release ]  \
 echo uname -r  $uname_release
@@ -1094,12 +1098,20 @@ interp_prefix1=`echo $interp_prefix | 
 interp_prefix1=`echo $interp_prefix | sed s/%M/$target_cpu/g`
 echo #define CONFIG_QEMU_PREFIX \$interp_prefix1\  $config_h
 
+disable_cpu_emulation() {
+  if test $cpu_emulation = no; then
+echo #define NO_CPU_EMULATION 1  $config_h
+echo NO_CPU_EMULATION=1  $config_mak
+  fi
+}
+
 configure_kvm() {
   if test $kvm = yes -a $target_softmmu = yes -a \
   \( $cpu = i386 -o $cpu = x86_64 -o $cpu = ia64 -o $cpu 
= powerpc \); then
 echo #define USE_KVM 1  $config_h
 echo USE_KVM=1  $config_mak
 echo CONFIG_KVM_KERNEL_INC=$kernel_path/include  $config_mak
+disable_cpu_emulation 
   fi
 }
 
diff --git a/qemu/exec.c b/qemu/exec.c
--- a/qemu/exec.c
+++ b/qemu/exec.c
@@ -35,7 +35,11 @@
 
 #include cpu.h
 #include exec-all.h
+
+#if !defined(NO_CPU_EMULATION)
 #include tcg-target.h
+#endif
+
 #include qemu-kvm.h
 #if defined(CONFIG_USER_ONLY)
 #include qemu.h
diff --git a/qemu/fake-exec-ppc.c b/qemu/fake-exec-ppc.c
new file mode 100644
--- /dev/null
+++ b/qemu/fake-exec-ppc.c
@@ -0,0 +1,75 @@
+/*
+ * fake-exec.c
+ *
+ * This is a file for stub functions so that compilation is possible
+ * when TCG CPU emulation is disabled during compilation.
+ *
+ * Copyright 2007 IBM Corporation.
+ * Added by  Authors:
+ * Jerone Young [EMAIL PROTECTED]
+ * This work is licensed under the GNU GPL licence version 2 or later.
+ *
+ */
+
+#include stdarg.h
+#include stdlib.h
+#include stdio.h
+#include string.h
+#include inttypes.h
+
+#include cpu.h
+#include exec-all.h
+
+int code_copy_enabled = 0;
+
+void cpu_dump_state (CPUState *env, FILE *f,
+ int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
+ int flags)
+{
+}
+
+void ppc_cpu_list (FILE *f, int (*cpu_fprintf)(FILE *f, const char *fmt, ...))
+{
+}
+
+void cpu_dump_statistics (CPUState *env, FILE*f,
+  int (*cpu_fprintf)(FILE *f, const char *fmt, ...),
+  int flags)
+{
+}
+
+unsigned long code_gen_max_block_size(void)
+{
+return 32;
+}
+
+void cpu_gen_init(void)
+{
+}
+
+int cpu_restore_state(TranslationBlock *tb,
+  CPUState *env, unsigned long searched_pc,
+  void *puc)
+
+{
+return 0;
+}
+
+int cpu_ppc_gen_code(CPUState *env, TranslationBlock *tb, int 
*gen_code_size_ptr)
+{
+return 0;
+}
+
+const ppc_def_t *cpu_ppc_find_by_name (const unsigned char *name)
+{
+return NULL;
+}
+
+int cpu_ppc_register_internal (CPUPPCState *env, const ppc_def_t *def)
+{
+return 0;
+}
+
+void flush_icache_range(unsigned long start, unsigned long stop)
+{
+}



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/2] kvmclock - the host part.

2008-02-13 Thread Glauber Costa
Avi Kivity wrote:
 Glauber de Oliveira Costa wrote:
 This is the host part of kvm clocksource implementation. As it does
 not include clockevents, it is a fairly simple implementation. We
 only have to register a per-vcpu area, and start writting to it 
 periodically.

 The area is binary compatible with xen, as we use the same shadow_info 
 structure.

   
 
 +static void kvm_write_wall_clock(struct kvm_vcpu *v, gpa_t wall_clock)
 +{
 +int version = 1;
 +struct kvm_wall_clock wc;
 +unsigned long flags;
 +struct timespec wc_ts;
 +
 +local_irq_save(flags);
 +kvm_get_msr(v, MSR_IA32_TIME_STAMP_COUNTER,
 +  v-arch.hv_clock.tsc_timestamp);
   
 
 Why is this needed? IIRC the wall clock is not tied to any vcpu.
 
 If we can remove this, the argument to the function should be kvm, not 
 kvm_vcpu. We can remove the irq games as well.

After some new thoughts, I don't agree. The time calculation in the 
guest will be in the format wallclock + delta tsc. Everything that has a 
tsc _is_ tied to a cpu. Although we can store the area globally, I think 
the best semantics is to have a vcpu to always issue an msr write to the 
area before reading it, in order to have the tsc updated.

 +wc_ts = current_kernel_time();
 +local_irq_restore(flags);
 +
 +down_write(current-mm-mmap_sem);
 +kvm_write_guest(v-kvm, wall_clock, version, sizeof(version));
 +up_write(current-mm-mmap_sem);
   
 
 Why down_write? accidentally or on purpose?
accidentally. Marcelo has pointed it out to me, and I do have a version 
without it now.

 For mutual exclusion, I suggest taking kvm-lock instead (for the entire 
 function).

This function is called too often. since we only need to guarantee 
mutual exclusion in a tiny part, it seems preferable to me. Do you have 
any extra reason for kvm-lock'ing the entire function?

 +
 +/* With all the info we got, fill in the values */
 +wc.wc_sec = wc_ts.tv_sec;
 +wc.wc_nsec = wc_ts.tv_nsec;
 +wc.wc_version = ++version;
 +
 +down_write(current-mm-mmap_sem);
 +kvm_write_guest(v-kvm, wall_clock, wc, sizeof(wc));
 +up_write(current-mm-mmap_sem);
   
 
 Should be in three steps: write version, write data, write version. 
 kvm_write_guest doesn't guarantee any order. It may fail as well, and we 
 need to handle that.
Ok, I see. This is fundamentally different from the system time case, 
because multiple cpus can
be running over the same area.

  
 +/* xen binary-compatible interface. See xen headers for details */
 +struct kvm_vcpu_time_info {
 +uint32_t version;
 +uint32_t pad0;
 +uint64_t tsc_timestamp;
 +uint64_t system_time;
 +uint32_t tsc_to_system_mul;
 +int8_t   tsc_shift;
 +}; /* 32 bytes */
 +
 +struct kvm_wall_clock {
 +uint32_t wc_version;
 +uint32_t wc_sec;
 +uint32_t wc_nsec;
 +};
 +
   
 
 These structures are dangerously sized. __Suggest__ 
 __attribute__((__packed__)). (or some padding at the end of 
 kvm_vcpu_time_info.

They are sized so to have same size as xen's. If it concerns you,
packed should be better.
 diff --git a/include/linux/kvm.h b/include/linux/kvm.h
 index 4de4fd2..78ce53f 100644
 --- a/include/linux/kvm.h
 +++ b/include/linux/kvm.h
 @@ -232,6 +232,7 @@ #define KVM_CAP_USER_MEMORY 3
  #define KVM_CAP_SET_TSS_ADDR 4
  #define KVM_CAP_EXT_CPUID 5
  #define KVM_CAP_VAPIC 6
 +#define KVM_CAP_CLOCKSOURCE 7
  
   
 
 Please refresh against kvm.git, this has changed a bit.
ok.



-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Kanoj Sarcar

--- Christoph Lameter [EMAIL PROTECTED] wrote:

 On Wed, 13 Feb 2008, Christian Bell wrote:
 
  not always be in the thousands but you're still
 claiming scalability
  for a mechanism that essentially logs who accesses
 the regions.  Then
  there's the fact that reclaim becomes a collective
 communication
  operation over all region accessors.  Makes me
 nervous.
 
 Well reclaim is not a very fast process (and we
 usually try to avoid it 
 as much as possible for our HPC). Essentially its
 only there to allow 
 shifts of processing loads and to allow efficient
 caching of application 
 data.
 
  However, short of providing user-level
 notifications for pinned pages
  that are inadvertently released to the O/S, I
 don't believe that the
  patchset provides any significant added value for
 the HPC community
  that can't optimistically do RDMA demand paging.
 
 We currently also run XPmem with pinning. Its great
 as long as you just 
 run one load on the system. No reclaim ever iccurs.
 
 However, if you do things that require lots of
 allocations etc etc then 
 the page pinning can easily lead to livelock if
 reclaim is finally 
 triggerd and also strange OOM situations since the
 VM cannot free any 
 pages. So the main issue that is addressed here is
 reliability of pinned 
 page operations. Better VM integration avoids these
 issues because we can 
 unpin on request to deal with memory shortages.
 
 

I have a question on the basic need for the mmu
notifier stuff wrt rdma hardware and pinning memory.

It seems that the need is to solve potential memory
shortage and overcommit issues by being able to
reclaim pages pinned by rdma driver/hardware. Is my
understanding correct?

If I do understand correctly, then why is rdma page
pinning any different than eg mlock pinning? I imagine
Oracle pins lots of memory (using mlock), how come
they do not run into vm overcommit issues?

Are we up against some kind of breaking c-o-w issue
here that is different between mlock and rdma pinning?

Asked another way, why should effort be spent on a
notifier scheme, and rather not on fixing any memory
accounting problems and unifying how pin pages are
accounted for that get pinned via mlock() or rdma
drivers?

Startup benefits are well understood with the notifier
scheme (ie, not all pages need to be faulted in at
memory region creation time), specially when most of
the memory region is not accessed at all. I would
imagine most of HPC does not work this way though.
Then again, as rdma hardware is applied
(increasingly?) towards apps with short lived
connections, the notifier scheme will help with
startup times.

Kanoj



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Christoph Lameter
On Wed, 13 Feb 2008, Kanoj Sarcar wrote:

 It seems that the need is to solve potential memory
 shortage and overcommit issues by being able to
 reclaim pages pinned by rdma driver/hardware. Is my
 understanding correct?

Correct.

 If I do understand correctly, then why is rdma page
 pinning any different than eg mlock pinning? I imagine
 Oracle pins lots of memory (using mlock), how come
 they do not run into vm overcommit issues?

Mlocked pages are not pinned. They are movable by f.e. page migration and 
will be potentially be moved by future memory defrag approaches. Currently 
we have the same issues with mlocked pages as with pinned pages. There is 
work in progress to put mlocked pages onto a different lru so that reclaim 
exempts these pages and more work on limiting the percentage of memory 
that can be mlocked.

 Are we up against some kind of breaking c-o-w issue
 here that is different between mlock and rdma pinning?

Not that I know.

 Asked another way, why should effort be spent on a
 notifier scheme, and rather not on fixing any memory
 accounting problems and unifying how pin pages are
 accounted for that get pinned via mlock() or rdma
 drivers?

There are efforts underway to account for and limit mlocked pages as 
described above. Page pinning the way it is done by Infiniband through
increasing the page refcount is treated by the VM as a temporary 
condition not as a permanent pin. The VM will continually try to reclaim 
these pages thinking that the temporary usage of the page must cease 
soon. This is why the use of large amounts of pinned pages can lead to 
livelock situations.

If we want to have pinning behavior then we could mark pinned pages 
specially so that the VM will not continually try to evict these pages. We 
could manage them similar to mlocked pages but just not allow page 
migration, memory unplug and defrag to occur on pinned memory. All of 
theses would have to fail. With the notifier scheme the device driver 
could be told to get rid of the pinned memory. This would make these 3 
techniques work despite having an RDMA memory section.

 Startup benefits are well understood with the notifier
 scheme (ie, not all pages need to be faulted in at
 memory region creation time), specially when most of
 the memory region is not accessed at all. I would
 imagine most of HPC does not work this way though.

No for optimal performance  you would want to prefault all pages like 
it is now. The notifier scheme would only become relevant in memory 
shortage situations.

 Then again, as rdma hardware is applied (increasingly?) towards apps 
 with short lived connections, the notifier scheme will help with startup 
 times.

The main use of the notifier scheme is for stability and reliability. The 
pinned pages become unpinnable on request by the VM. So the VM can work 
itself out of memory shortage situations in cooperation with the 
RDMA logic instead of simply failing.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Pete Wyckoff
[EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800:
 One other area that has not been brought up yet (I think) is the
 applicability of notifiers in letting users know when pinned memory
 is reclaimed by the kernel.  This is useful when a lower-level
 library employs lazy deregistration strategies on memory regions that
 are subsequently released to the kernel via the application's use of
 munmap or sbrk.  Ohio Supercomputing Center has work in this area but
 a generalized approach in the kernel would certainly be welcome.

The whole need for memory registration is a giant pain.  There is no
motivating application need for it---it is simply a hack around
virtual memory and the lack of full VM support in current hardware.
There are real hardware issues that interact poorly with virtual
memory, as discussed previously in this thread.

The way a messaging cycle goes in IB is:

register buf
post send from buf
wait for completion
deregister buf

This tends to get hidden via userspace software libraries into
a single call:

MPI_send(buf)

Now if you actually do the reg/dereg every time, things are very
slow.  So userspace library writers came up with the idea of caching
registrations:

if buf is not registered:
register buf
post send from buf
wait for completion

The second time that the app happens to do a send from the same
buffer, it proceeds much faster.  Spatial locality applies here, and
this caching is generally worth it.  Some libraries have schemes to
limit the size of the registration cache too.

But there are plenty of ways to hurt yourself with such a scheme.
The first being a huge pool of unused but registered memory, as the
library doesn't know the app patterns, and it doesn't know the VM
pressure level in the kernel.

There are plenty of subtle ways that this breaks too.  If the
registered buf is removed from the address space via munmap() or
sbrk() or other ways, the mapping and registration are gone, but the
library has no way of knowing that the app just did this.  Sure the
physical page is still there and pinned, but the app cannot get at
it.  Later if new address space arrives at the same virtual address
but a different physical page, the library will mistakenly think it
already has it registered properly, and data is transferred from
this old now-unmapped physical page.

The whole situation is rather ridiculuous, but we are quite stuck
with it for current generation IB and iWarp hardware.  If we can't
have the kernel interact with the device directly, we could at least
manage state in these multiple userspace registration caches.  The
VM could ask for certain (or any) pages to be released, and the
library would respond if they are indeed not in use by the device.
The app itself does not know about pinned regions, and the library
is aware of exactly which regions are potentially in use.

Since the great majority of userspace messaging over IB goes through
middleware like MPI or PGAS languages, and they all have the same
approach to registration caching, this approach could fix the
problem for a big segment of use cases.

More text on the registration caching problem is here:

http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

with an approach using vm_ops open and close operations in a kernel
module here:

http://www.osc.edu/~pw/dreg/

There is a place for VM notifiers in RDMA messaging, but not in
talking to devices, at least not the current set.  If you can define
a reasonable userspace interface for VM notifiers, libraries can
manage registration caches more efficiently, letting the kernel
unmap pinned pages as it likes.

-- Pete


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Kanoj Sarcar

--- Christoph Lameter [EMAIL PROTECTED] wrote:

 On Wed, 13 Feb 2008, Kanoj Sarcar wrote:
 
  It seems that the need is to solve potential
 memory
  shortage and overcommit issues by being able to
  reclaim pages pinned by rdma driver/hardware. Is
 my
  understanding correct?
 
 Correct.
 
  If I do understand correctly, then why is rdma
 page
  pinning any different than eg mlock pinning? I
 imagine
  Oracle pins lots of memory (using mlock), how come
  they do not run into vm overcommit issues?
 
 Mlocked pages are not pinned. They are movable by
 f.e. page migration and 
 will be potentially be moved by future memory defrag
 approaches. Currently 
 we have the same issues with mlocked pages as with
 pinned pages. There is 
 work in progress to put mlocked pages onto a
 different lru so that reclaim 
 exempts these pages and more work on limiting the
 percentage of memory 
 that can be mlocked.
 
  Are we up against some kind of breaking c-o-w
 issue
  here that is different between mlock and rdma
 pinning?
 
 Not that I know.
 
  Asked another way, why should effort be spent on a
  notifier scheme, and rather not on fixing any
 memory
  accounting problems and unifying how pin pages are
  accounted for that get pinned via mlock() or rdma
  drivers?
 
 There are efforts underway to account for and limit
 mlocked pages as 
 described above. Page pinning the way it is done by
 Infiniband through
 increasing the page refcount is treated by the VM as
 a temporary 
 condition not as a permanent pin. The VM will
 continually try to reclaim 
 these pages thinking that the temporary usage of the
 page must cease 
 soon. This is why the use of large amounts of pinned
 pages can lead to 
 livelock situations.

Oh ok, yes, I did see the discussion on this; sorry I
missed it. I do see what notifiers bring to the table
now (without endorsing it :-)).

An orthogonal question is this: is IB/rdma the only
culprit that elevates page refcounts? Are there no
other subsystems which do a similar thing?

The example I am thinking about is rawio (Oracle's
mlock'ed SHM regions are handed to rawio, isn't it?).
My understanding of how rawio works in Linux is quite
dated though ...

Kanoj

 
 If we want to have pinning behavior then we could
 mark pinned pages 
 specially so that the VM will not continually try to
 evict these pages. We 
 could manage them similar to mlocked pages but just
 not allow page 
 migration, memory unplug and defrag to occur on
 pinned memory. All of 
 theses would have to fail. With the notifier scheme
 the device driver 
 could be told to get rid of the pinned memory. This
 would make these 3 
 techniques work despite having an RDMA memory
 section.
 
  Startup benefits are well understood with the
 notifier
  scheme (ie, not all pages need to be faulted in at
  memory region creation time), specially when most
 of
  the memory region is not accessed at all. I would
  imagine most of HPC does not work this way though.
 
 No for optimal performance  you would want to
 prefault all pages like 
 it is now. The notifier scheme would only become
 relevant in memory 
 shortage situations.
 
  Then again, as rdma hardware is applied
 (increasingly?) towards apps 
  with short lived connections, the notifier scheme
 will help with startup 
  times.
 
 The main use of the notifier scheme is for stability
 and reliability. The 
 pinned pages become unpinnable on request by the
 VM. So the VM can work 
 itself out of memory shortage situations in
 cooperation with the 
 RDMA logic instead of simply failing.
 
 --
 To unsubscribe, send a message with 'unsubscribe
 linux-mm' in
 the body to [EMAIL PROTECTED]  For more info on
 Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:[EMAIL PROTECTED]
 [EMAIL PROTECTED] /a
 



  

Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  
http://tools.search.yahoo.com/newsearch/category.php?category=shopping

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Demand paging for memory regions

2008-02-13 Thread Jesse Barnes
On Wednesday, February 13, 2008 3:43 pm Kanoj Sarcar wrote:
 Oh ok, yes, I did see the discussion on this; sorry I
 missed it. I do see what notifiers bring to the table
 now (without endorsing it :-)).

 An orthogonal question is this: is IB/rdma the only
 culprit that elevates page refcounts? Are there no
 other subsystems which do a similar thing?

 The example I am thinking about is rawio (Oracle's
 mlock'ed SHM regions are handed to rawio, isn't it?).
 My understanding of how rawio works in Linux is quite
 dated though ...

We're doing something similar in the DRM these days...  We need big chunks of 
memory to be pinned so that the GPU can operate on them, but when the 
operation completes we can allow them to be swappable again.  I think with 
the current implementation, allocations are always pinned, but we'll 
definitely want to change that soon.

Dave?

Jesse

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Jason Gunthorpe
On Wed, Feb 13, 2008 at 06:23:08PM -0500, Pete Wyckoff wrote:
 [EMAIL PROTECTED] wrote on Tue, 12 Feb 2008 20:09 -0800:
  One other area that has not been brought up yet (I think) is the
  applicability of notifiers in letting users know when pinned memory
  is reclaimed by the kernel.  This is useful when a lower-level
  library employs lazy deregistration strategies on memory regions that
  are subsequently released to the kernel via the application's use of
  munmap or sbrk.  Ohio Supercomputing Center has work in this area but
  a generalized approach in the kernel would certainly be welcome.
 
 The whole need for memory registration is a giant pain.  There is no
 motivating application need for it---it is simply a hack around
 virtual memory and the lack of full VM support in current hardware.
 There are real hardware issues that interact poorly with virtual
 memory, as discussed previously in this thread.

Well, the registrations also exist to provide protection against
rouge/faulty remotes, but for the purposes of MPI that is probably not
important.

Here is a thought.. Some RDMA hardware can change the page tables on
the fly. What if the kernel had a mechanism to dynamically maintain a
full registration of the processes entire address space ('mlocked' but
able to be migrated)? MPI would never need to register a buffer, and
all the messy cases with munmap/sbrk/etc go away - the risk is that
other MPI nodes can randomly scribble all over the process :)

Christoph: It seemed to me you were first talking about
freeing/swapping/faulting RDMA'able pages - but would pure migration
as a special hardware supported case be useful like Catilan suggested?

Regards,
Jason

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ofa-general] Re: Demand paging for memory regions

2008-02-13 Thread Andrea Arcangeli
Hi Kanoj,

On Wed, Feb 13, 2008 at 03:43:17PM -0800, Kanoj Sarcar wrote:
 Oh ok, yes, I did see the discussion on this; sorry I
 missed it. I do see what notifiers bring to the table
 now (without endorsing it :-)).

I'm not really livelocks are really the big issue here.

I'm running N 1G VM on a 1G ram system, with N-1G swapped
out. Combining this with auto-ballooning, rss limiting, and ksm ram
sharing, provides really advanced and lowlevel virtualization VM
capabilities to the linux kernel while at the same time guaranteeing
no oom failures as long as the guest pages are lower than ram+swap
(just slower runtime if too many pages are unshared or if the balloons
are deflated etc..).

Swapping the virtual machine in the host may be more efficient than
having the guest swapping over a virtual swap paravirt storage for
example. As more management features are added admins will gain more
experience in handling those new features and they'll find what's best
for them. mmu notifiers and real reliable swapping are the enabler for
those more advanced VM features.

oom livelocks wouldn't happen anyway with KVM as long as the maximimal
number of guest physical is lower than RAM.

 An orthogonal question is this: is IB/rdma the only
 culprit that elevates page refcounts? Are there no
 other subsystems which do a similar thing?
 
 The example I am thinking about is rawio (Oracle's
 mlock'ed SHM regions are handed to rawio, isn't it?).
 My understanding of how rawio works in Linux is quite
 dated though ...

rawio in flight I/O shall be limited. As long as each task can't pin
more than X ram, and the ram is released when the task is oom killed,
and the first get_user_pages/alloc_pages/slab_alloc that returns
-ENOMEM takes an oom fail path that returns failure to userland,
everything is ok.

Even with IB deadlock could only happen if IB would allow unlimited
memory to be pinned down by unprivileged users.

If IB is insecure and DoSable without mmu notifiers, then I'm not sure
how enabling swapping of the IB memory could be enough to fix the
DoS. Keep in mind that even tmpfs can't be safe allowing all ram+swap
to be allocated in a tmpfs file (despite the tmpfs file storage
includes swap and not only ram). Pinning the whole ram+swap with tmpfs
livelocks the same way of pinning the whole ram with ramfs. So if you
add mmu notifier support to IB, you only need to RDMA an area as large
as ram+swap to livelock again as before... no difference at all.

I don't think livelocks have anything to do with mmu notifiers (other
than to deferring the livelock to the swap+ram point of no return
instead of the current ram point of no return). Livelocks have to be
solved the usual way: handling alloc_pages/get_user_pages/slab
allocation failures with a fail path that returns to userland and
allows the ram to be released if the task was selected for
oom-killage.

The real benefit of the mmu notifiers for IB would be to allow the
rdma region to be larger than RAM without triggering the oom
killer (or without triggering a livelock if it's DoSable but then the
livelock would need fixing to be converted in a regular oom-killing by
some other mean not related to the mmu-notifier, it's really an
orthogonal problem).

So suppose you've a MPI simulation that requires a 10G array and
you've only 1G of ram, then you can rdma over 10G like if you had 10G
of ram. Things will preform ok only if there's some huge locality of
the computations. For virtualization it's orders of magnitude more
useful than for computer clusters but certain simulations really swaps
so I don't exclude certain RDMA apps will also need this (dunno about
IB).

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] Qemu powerpc work around

2008-02-13 Thread Jerone Young
This works better. Not sure why but when I had fake-exec in target-ppc,
the build system was complaining that it could not find fake-exec.d. So
then I just decided to move it to fake-exec-ppc.c. 

This patch works fine for powerpc.

On Wed, 2008-02-13 at 12:55 -0600, Anthony Liguori wrote:
 Jerone Young wrote:
  On Wed, 2008-02-13 at 09:29 +0200, Avi Kivity wrote:

  Jerone Young wrote:
  
  So the recent code in qemu cvs has problem powerpc. So what I have done
  is mainly work around this in the build system, by creating
  ppcemb_kvm-sofmmu target. Along with this is a fake-exec.c that stubs
  out the functions that are no longer defined (something done by Anthony
  Liguori attempting to fix qemu_cvs). What do folks think about this
  approach, for us all we really need is a qemu that is not built with tcg
  dependency.
 


  Since a target in qemu is a cpu type, how the instructions are executed 
  (kvm, kqemu, dyngen, or tcg) shouldn't come into it.  Instead we can 
  have a --without-cpu-emulation or --no-tcg which would simply disable 
  those parts.
  
 
  Actually this much much more sensible solution. So I took some time and
  implemented it.

 
 Funny enough, I was thinking the same thing last night :-)
 
 Please move fake-exec.c to target-ppc/fake-exec.c as it contains PPC 
 specific code.  Otherwise, this patch is much better!
 
 Regards,
 
 Anthony Liguori
 
 
 -
 This SF.net email is sponsored by: Microsoft
 Defy all challenges. Microsoft(R) Visual Studio 2008.
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel