Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Rusty Russell
On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
 Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
 Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
 Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)

That rtt time is awful.  I know the notification suppression heuristic
in qemu sucks.

I could dig through the code, but I'll ask directly: what heuristic do
you use for notification prevention in your venet_tap driver?

As you point out, 350-450 is possible, which is still bad, and it's at least
partially caused by the exit to userspace and two system calls.  If virtio_net
had a backend in the kernel, we'd be able to compare numbers properly.

 Bare metal: tput = 9717Mb/s, round-trip = 30396pps (33us rtt)
 Virtio-net: tput = 4578Mb/s, round-trip = 249pps (4016us rtt)
 Venet: tput = 5802Mb/s, round-trip = 15127 (66us rtt)
 
 Note that even the throughput was slightly better in this test for venet, 
 though
 neither venet nor virtio-net could achieve line-rate.  I suspect some tuning 
 may
 allow these numbers to improve, TBD.

At some point, the copying will hurt you.  This is fairly easy to avoid on
xmit tho.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: strange guest slowness after some time

2009-04-01 Thread Tomasz Chmielewski

David S. Ahern schrieb:


Could you add a (unused) e1000 interface to your virtio guests?
As this issue happens rarely for me, maybe you could help to
reproduce it as well (i.e. if network gets slow on virtio interface,
give e1000 a IP address, and try if network is also slow on e1000 on
the very same guest).

Will do and report

BTW, what CPU do you have?

One dual core Opteron 2212
Note: I will upgrade to two Shanghai Quad-Cores in 2 weeks and test
with those as well.

I have this slowness on an Intel CPU as well, after about 10 days of
guest uptime (using virtio net):

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU3050  @ 2.13GHz



For the Intel server, the guest is using the e1000 NIC or virtio or
other? I have a few DL320G5s with this processor; I have not hit this
problem running rhel3 and rhel4 guests using e1000/scsi devices.


As I mentioned, it was using virtio net.

Guests running with e1000 (and virtio_blk) don't have this problem.


--
Tomasz Chmielewski
http://wpkg.org

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] Fix task switching.

2009-04-01 Thread Jan Kiszka
Gleb Natapov wrote:
 On Tue, Mar 31, 2009 at 05:21:16PM +0200, Kohl, Bernhard (NSN - DE/Munich) 
 wrote:
 Bernhard Kohl wrote:
 Jan Kiszka wrote:
 Gleb Natapov wrote:
 The patch fixes two problems with task switching.
 1. Back link is written to a wrong TSS.
 2. Instruction emulation is not needed if the reason for 
 task switch
is a task gate in IDT and access to it is caused by an 
 external even.
 2 is currently solved only for VMX since there is not 
 reliable way to
 skip an instruction in SVM. We should emulate it instead.
 Does this series fix all issues Bernhard, Thomas and Julian 
 stumbled over?

 Jan
 I will try this today. Thanks.

 Yes, it works for us (Thomas + Bernhard).

 Great. Thanks for testing.
 

Same here: No obvious regressions found while running various NMI/IRQ tests.

Jan



signature.asc
Description: OpenPGP digital signature


[PATCH 1/2] KVM: VMX: Clean up Flex Priority related

2009-04-01 Thread Sheng Yang
And clean paranthes on returns.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/vmx.c |   47 ++-
 1 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aba41ae..1caa1fc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -216,61 +216,69 @@ static inline int is_external_interrupt(u32 intr_info)
 
 static inline int cpu_has_vmx_msr_bitmap(void)
 {
-   return (vmcs_config.cpu_based_exec_ctrl  CPU_BASED_USE_MSR_BITMAPS);
+   return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_USE_MSR_BITMAPS;
 }
 
 static inline int cpu_has_vmx_tpr_shadow(void)
 {
-   return (vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW);
+   return vmcs_config.cpu_based_exec_ctrl  CPU_BASED_TPR_SHADOW;
 }
 
 static inline int vm_need_tpr_shadow(struct kvm *kvm)
 {
-   return ((cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm)));
+   return (cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm));
 }
 
 static inline int cpu_has_secondary_exec_ctrls(void)
 {
-   return (vmcs_config.cpu_based_exec_ctrl 
-   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS);
+   return vmcs_config.cpu_based_exec_ctrl 
+   CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
 }
 
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
-   return flexpriority_enabled;
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES;
+}
+
+static inline bool cpu_has_vmx_flexpriority(void)
+{
+   return cpu_has_vmx_tpr_shadow() 
+   cpu_has_vmx_virtualize_apic_accesses();
 }
 
 static inline int cpu_has_vmx_invept_individual_addr(void)
 {
-   return (!!(vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT));
+   return !!(vmx_capability.ept  VMX_EPT_EXTENT_INDIVIDUAL_BIT);
 }
 
 static inline int cpu_has_vmx_invept_context(void)
 {
-   return (!!(vmx_capability.ept  VMX_EPT_EXTENT_CONTEXT_BIT));
+   return !!(vmx_capability.ept  VMX_EPT_EXTENT_CONTEXT_BIT);
 }
 
 static inline int cpu_has_vmx_invept_global(void)
 {
-   return (!!(vmx_capability.ept  VMX_EPT_EXTENT_GLOBAL_BIT));
+   return !!(vmx_capability.ept  VMX_EPT_EXTENT_GLOBAL_BIT);
 }
 
 static inline int cpu_has_vmx_ept(void)
 {
-   return (vmcs_config.cpu_based_2nd_exec_ctrl 
-   SECONDARY_EXEC_ENABLE_EPT);
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_ENABLE_EPT;
 }
 
 static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm)
 {
-   return ((cpu_has_vmx_virtualize_apic_accesses()) 
-   (irqchip_in_kernel(kvm)));
+   return flexpriority_enabled 
+   (cpu_has_vmx_virtualize_apic_accesses()) 
+   (irqchip_in_kernel(kvm));
 }
 
 static inline int cpu_has_vmx_vpid(void)
 {
-   return (vmcs_config.cpu_based_2nd_exec_ctrl 
-   SECONDARY_EXEC_ENABLE_VPID);
+   return vmcs_config.cpu_based_2nd_exec_ctrl 
+   SECONDARY_EXEC_ENABLE_VPID;
 }
 
 static inline int cpu_has_virtual_nmis(void)
@@ -278,6 +286,11 @@ static inline int cpu_has_virtual_nmis(void)
return vmcs_config.pin_based_exec_ctrl  PIN_BASED_VIRTUAL_NMIS;
 }
 
+static inline bool report_flexpriority(void)
+{
+   return flexpriority_enabled;
+}
+
 static int __find_msr_index(struct vcpu_vmx *vmx, u32 msr)
 {
int i;
@@ -1201,7 +1214,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
if (!cpu_has_vmx_ept())
enable_ept = 0;
 
-   if (!(vmcs_config.cpu_based_2nd_exec_ctrl  
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
+   if (!cpu_has_vmx_flexpriority())
flexpriority_enabled = 0;
 
min = 0;
@@ -3655,7 +3668,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.check_processor_compatibility = vmx_check_processor_compat,
.hardware_enable = hardware_enable,
.hardware_disable = hardware_disable,
-   .cpu_has_accelerated_tpr = cpu_has_vmx_virtualize_apic_accesses,
+   .cpu_has_accelerated_tpr = report_flexpriority,
 
.vcpu_create = vmx_create_vcpu,
.vcpu_free = vmx_free_vcpu,
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: VMX: Fix feature testing

2009-04-01 Thread Sheng Yang
The testing of feature is too early now, before vmcs_config complete 
initialization.

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/vmx.c |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1caa1fc..7d7b0d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1208,15 +1208,6 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  vmx_capability.ept, vmx_capability.vpid);
}
 
-   if (!cpu_has_vmx_vpid())
-   enable_vpid = 0;
-
-   if (!cpu_has_vmx_ept())
-   enable_ept = 0;
-
-   if (!cpu_has_vmx_flexpriority())
-   flexpriority_enabled = 0;
-
min = 0;
 #ifdef CONFIG_X86_64
min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -1320,6 +1311,15 @@ static __init int hardware_setup(void)
if (boot_cpu_has(X86_FEATURE_NX))
kvm_enable_efer_bits(EFER_NX);
 
+   if (!cpu_has_vmx_vpid())
+   enable_vpid = 0;
+
+   if (!cpu_has_vmx_ept())
+   enable_ept = 0;
+
+   if (!cpu_has_vmx_flexpriority())
+   flexpriority_enabled = 0;
+
return alloc_kvm_area();
 }
 
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: VMX: Clean up Flex Priority related

2009-04-01 Thread Avi Kivity

Sheng Yang wrote:

And clean paranthes on returns.
  


Applied, thanks.  Bad bugs on my part :(

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use rsvd_bits_mask in load_pdptrs for cleanup and considing EXB bit

2009-04-01 Thread Avi Kivity

Dong, Eddie wrote:

Looks good, but doesn't apply; please check if you are working against
the latest version.



Rebased on top of a317a1e496b22d1520218ecf16a02498b99645e2 + previous rsvd bits 
violation check patch.
  


Applied, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mm_pages_next() question

2009-04-01 Thread Avi Kivity

Marcelo Tosatti wrote:

On Sun, Mar 29, 2009 at 03:24:08PM +0300, Avi Kivity wrote:
  

static int mmu_pages_next(struct kvm_mmu_pages *pvec,
  struct mmu_page_path *parents,
  int i)
{
int n;

for (n = i+1; n  pvec-nr; n++) {
struct kvm_mmu_page *sp = pvec-page[n].sp;

if (sp-role.level == PT_PAGE_TABLE_LEVEL) {
parents-idx[0] = pvec-page[n].idx;
return n;
}

parents-parent[sp-role.level-2] = sp;
parents-idx[sp-role.level-1] = pvec-page[n].idx;
}

return n;
}
  
Do we need to break out of the loop if we switch parents during the loop  
(since that will give us a different mmu_page_path)?  Or are callers  
careful to only pass pvecs which belong to the same shadow page?



This function builds mmu_page_path for a number of pagetable (leaf)
pages. Whenever the path changes, mmu_page_path will be rebuilt.

The pages in the pvec must be organized as follows:

level4, level3, level2, level1, level1, level1, , level3, level2,
level1, level1, ...

So you don't have to repeat higher levels for a number of leaf pages.
  


I'm still missing something.   That if () tests for level == 
PT_PAGE_TABLE_LEVEL.  So it looks like we'll have batch sizes of 4, 1, 
1, 1, ... 3, 1, 1, 1, ...?



--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] Fix handling of a fault during NMI unblocked due to IRET

2009-04-01 Thread Avi Kivity

Gleb Natapov wrote:

Bit 12 is undefined in any of the following cases:
 If the VM exit sets the valid bit in the IDT-vectoring information field.
 If the VM exit is due to a double fault.
  


Applied the entire series, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] fix header-sync using --with-patched-kernel

2009-04-01 Thread Avi Kivity

Mark McLoughlin wrote:

Hi Avi,
Here are a few fairly trivial build patches - they fix
building kvm.git using --with-patched-kernel alongside an unconfigured
kvm.git tree.
  


Applied all, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm: qemu: check device assignment command

2009-04-01 Thread Avi Kivity

Han, Weidong wrote:

pci_parse_devaddr parses [[domain:][bus:]slot, it's valid when even enter 
only slot, whereas it must be bus:slot.func in device assignment command  (-pcidevice 
host=bus:slot.func). So I implemented a dedicated function to parse device bdf in device 
assignment command, rather than mix two parsing function together.

  


Applied, thanks.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't download kvmctl scripts

2009-04-01 Thread Avi Kivity

Brent A Nelson wrote:

URL: http://www.linux-kvm.org/page/HowToConfigScript

The kvmctl scripts in the HowTo pages can't be downloaded, as the 
download links are actually uploads.


Copying smintz...

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] KVM: Qemu: Do not use log dirty in IA64

2009-04-01 Thread Zhang, Xiantao
Avi Kivity wrote:
 Zhang, Yang wrote:
 hi
  please checkin it to kvm85, thanks!
 
 IA64 does not support log dirty. We should not use it
 in IA64, or it will have some problem.
 
 
 Applied, thanks.  When are you planning to add support for log dirty
 on ia64?

We had the patch at hand, but still there are other issues which block 
upstream, so hadn't tested it yet. 
Xiantao--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Rusty Russell wrote:
 On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
   
 Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
 Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
 Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)
 

 That rtt time is awful.  I know the notification suppression heuristic
 in qemu sucks.

 I could dig through the code, but I'll ask directly: what heuristic do
 you use for notification prevention in your venet_tap driver?
   

I am not 100% sure I know what you mean with notification prevention,
but let me take a stab at it.

So like most of these kinds of constructs, I have two rings (rx + tx on
the guest is reversed to tx + rx on the host), each of which can signal
in either direction for a total of 4 events, 2 on each side of the
connection.  I utilize what I call bidirectional napi so that only the
first packet submitted needs to signal across the guest/host boundary. 
E.g. first ingress packet injects an interrupt, and then does a
napi_schedule and masks future irqs.  Likewise, first egress packet does
a hypercall, and then does a napi_schedule (I dont actually use napi
in this path, but its conceptually identical) and masks future
hypercalls.  So thats is my first form of what I would call notification
prevention.

The second form occurs on the tx-complete path (that is guest-host
tx).  I only signal back to the guest to reclaim its skbs every 10
packets, or if I drain the queue, whichever comes first (note to self:
make this # configurable).

The nice part about this scheme is it significantly reduces the amount
of guest/host transitions, while still providing the lowest latency
response for single packets possible.  e.g. Send one packet, and you get
one hypercall, and one tx-complete interrupt as soon as it queues on the
hardware.  Send 100 packets, and you get one hypercall and 10
tx-complete interrupts as frequently as every tenth packet queues on the
hardware.  There is no timer governing the flow, etc.

Is that what you were asking?

 As you point out, 350-450 is possible, which is still bad, and it's at least
 partially caused by the exit to userspace and two system calls.  If virtio_net
 had a backend in the kernel, we'd be able to compare numbers properly.
   
:)

But that is the whole point, isnt it?  I created vbus specifically as a
framework for putting things in the kernel, and that *is* one of the
major reasons it is faster than virtio-net...its not the difference in,
say, IOQs vs virtio-ring (though note I also think some of the
innovations we have added such as bi-dir napi are helping too, but these
are not in-kernel specific kinds of features and could probably help
the userspace version too).

I would be entirely happy if you guys accepted the general concept and
framework of vbus, and then worked with me to actually convert what I
have as venet-tap into essentially an in-kernel virtio-net.  I am not
specifically interested in creating a competing pv-net driver...I just
needed something to showcase the concepts and I didnt want to hack the
virtio-net infrastructure to do it until I had everyone's blessing. 
Note to maintainers: I *am* perfectly willing to maintain the venet
drivers if, for some reason, we decide that we want to keep them as
is.   Its just an ideal for me to collapse virtio-net and venet-tap
together, and I suspect our community would prefer this as well.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Andi Kleen wrote:
 Gregory Haskins ghask...@novell.com writes:

 What might be useful is if you could expand a bit more on what the high level
 use cases for this. 

 Questions that come to mind and that would be good to answer:

 This seems to be aimed at having multiple VMs talk
 to each other, but not talk to the rest of the world, correct? 
 Is that a common use case? 
   

Actually we didn't design specifically for either type of environment. 
I think it would, in fact, be well suited to either type of
communication model, even concurrently (e.g. an intra-vm ipc channel
resource could live right on the same bus as a virtio-net and a
virtio-disk resource)

 Wouldn't they typically have a default route  anyways and be able to talk to 
 each 
 other this way? 
 And why can't any such isolation be done with standard firewalling? (it's 
 known that 
 current iptables has some scalability issues, but there's work going on right
 now to fix that). 
   
vbus itself, and even some of the higher level constructs we apply on
top of it (like venet) are at a different scope than I think what you
are getting at above.  Yes, I suppose you could create a private network
using the existing virtio-net + iptables.  But you could also do the
same using virtio-net and a private bridge devices as well.  That is not
what we are trying to address.

What we *are* trying to address is making an easy way to declare virtual
resources directly in the kernel so that they can be accessed more
efficiently.  Contrast that to the way its done today, where the models
live in, say, qemu userspace.

So instead of having
guest-host-qemu::virtio-net-tap-[iptables|bridge], you simply have
guest-host-[iptables|bridge].  How you make your private network (if
that is what you want to do) is orthogonal...its the path to get there
that we changed.

 What would be the use cases for non networking devices?

 How would the interfaces to the user look like?
   

I am not sure if you are asking about the guests perspective or the
host-administators perspective.

First now lets look at the low-level device interface from the guests
perspective.  We can cover the admin perspective in a separate doc, if
need be.

Each device in vbus supports two basic verbs: CALL, and SHM

int (*call)(struct vbus_device_proxy *dev, u32 func,
void *data, size_t len, int flags);

int (*shm)(struct vbus_device_proxy *dev, int id, int prio,
   void *ptr, size_t len,
   struct shm_signal_desc *sigdesc, struct shm_signal **signal,
   int flags);

CALL provides a synchronous method for invoking some verb on the device
(defined by func) with some arbitrary data.  The namespace for func
is part of the ABI for the device in question.  It is analogous to an
ioctl, with the primary difference being that its remotable (it invokes
from the guest driver across to the host device).

SHM provides a way to register shared-memory with the device which can
be used for asynchronous communication.  The memory is always owned by
the north (the guest), while the south (the host) simply maps it
into its address space.  You can optionally establish a shm_signal
object on this memory for signaling in either direction, and I
anticipate most shm regions will use this feature.  Each shm region has
an id namespace, which like the func namespace from the CALL method
is completely owned by the device ABI.  For example, we have might have
id's of RX-RING and TX-RING, etc.

From there, we can (hopefully) build an arbitrary type of IO service to
map on top.  So for instance, for venet-tap, we have CALL verbs for
things like MACQUERY, and LINKUP, and we have SHM ids for RX-QUEUE and
TX-QUEUE.  We can write a driver that speaks this ABI on the bottom
edge, and presents a normal netif interface on the top edge.  So the
actual consumption of these resources can look just like another other
resource of a similar type.

-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-04-01 Thread Gregory Haskins
Avi Kivity wrote:
 Gregory Haskins wrote:
 +struct shm_signal_irq {
 +__u8  enabled;
 +__u8  pending;
 +__u8  dirty;
 +};
 
 Some ABIs may choose to pad this, suggest explicit padding.
 

 Yeah, good idea.  What is the official way to do this these days?  Are
 GCC pragmas allowed?

   

 I just add a __u8 pad[5] in such cases.

Oh, duh.  Dumb question.  I was getting confused with pack, not pad.  :)


 +
 +struct shm_signal;
 +
 +struct shm_signal_ops {
 +int  (*inject)(struct shm_signal *s);
 +void (*fault)(struct shm_signal *s, const char *fmt, ...);
 
 Eww.  Must we involve strings and printf formats?
 

 This is still somewhat of a immature part of the design.  Its supposed
 to be used so that by default, its a panic.  But on the host side, we
 can do something like inject a machine-check.  That way malicious/broken
 guests cannot (should not? ;) be able to take down the host.  Note today
 I do not map this to anything other than the default panic, so this
 needs some love.

 But given the asynchronous nature of the fault, I want to be sure we
 have decent accounting to avoid bug reports like silent MCE kills the
 guest ;)  At least this way, we can log the fault string somewhere to
 get a clue.
   

 I see.

 This raises a point I've been thinking of - the symmetrical nature of
 the API vs the assymetrical nature of guest/host or user/kernel
 interfaces.  This is most pronounced in -inject(); in the host-guest
 direction this is async (host can continue processing while the guest
 is handling the interrupt), whereas in the guest-host direction it is
 synchronous (the guest is blocked while the host is processing the
 call, unless the host explicitly hands off work to a different thread).

Note that this is exactly what I do (though it is device specific). 
venet-tap has a ioq_notifier registered on its rx ring (which is the
tx-ring for the guest) that simply calls ioq_notify_disable() (which
calls shm_signal_disable() under the covers) and it wakes its
rx-thread.  This all happens in the context of the hypercall, which then
returns and allows the vcpu to re-enter guest mode immediately.








signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-04-01 Thread Avi Kivity

Gregory Haskins wrote:
Note that this is exactly what I do (though it is device specific). 
venet-tap has a ioq_notifier registered on its rx ring (which is the

tx-ring for the guest) that simply calls ioq_notify_disable() (which
calls shm_signal_disable() under the covers) and it wakes its
rx-thread.  This all happens in the context of the hypercall, which then
returns and allows the vcpu to re-enter guest mode immediately.
  
I think this is suboptimal.  The ring is likely to be cache hot on the 
current cpu, waking a thread will introduce scheduling latency + IPI 
+cache-to-cache transfers.


On a benchmark setup, host resources are likely to exceed guest 
requirements, so you can throw cpu at the problem and no one notices.  
But I think the bits/cycle figure will decrease, even if bits/sec increases.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


one question about virualization and kvm

2009-04-01 Thread Vasiliy Tolstov
Hello!
I have two containers with os linux. All files in /usr and /bin are
identical. 
Is that possible to mount/bind /usr and /bin to containers? (not copy
all files to containers).. ?

P.S. Sorry for bad english and may be stupid question.
-- 
Vasiliy Tolstov v.tols...@selfip.ru
Selfip.Ru

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Andi Kleen
On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
 Andi Kleen wrote:
  Gregory Haskins ghask...@novell.com writes:
 
  What might be useful is if you could expand a bit more on what the high 
  level
  use cases for this. 
 
  Questions that come to mind and that would be good to answer:
 
  This seems to be aimed at having multiple VMs talk
  to each other, but not talk to the rest of the world, correct? 
  Is that a common use case? 

 
 Actually we didn't design specifically for either type of environment. 

But surely you must have some specific use case in mind? Something
that it does better than the various methods that are available
today. Or rather there must be some problem you're trying
to solve. I'm just not sure what that problem exactly is.

 What we *are* trying to address is making an easy way to declare virtual
 resources directly in the kernel so that they can be accessed more
 efficiently.  Contrast that to the way its done today, where the models
 live in, say, qemu userspace.
 
 So instead of having
 guest-host-qemu::virtio-net-tap-[iptables|bridge], you simply have
 guest-host-[iptables|bridge].  How you make your private network (if

So is the goal more performance or simplicity or what?

  What would be the use cases for non networking devices?
 
  How would the interfaces to the user look like?

 
 I am not sure if you are asking about the guests perspective or the
 host-administators perspective.

I was wondering about the host-administrators perspective.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 01/17] shm-signal: shared-memory signals

2009-04-01 Thread Gregory Haskins
Avi Kivity wrote:
 Gregory Haskins wrote:
 Note that this is exactly what I do (though it is device specific).
 venet-tap has a ioq_notifier registered on its rx ring (which is the
 tx-ring for the guest) that simply calls ioq_notify_disable() (which
 calls shm_signal_disable() under the covers) and it wakes its
 rx-thread.  This all happens in the context of the hypercall, which then
 returns and allows the vcpu to re-enter guest mode immediately.
   
 I think this is suboptimal.

Heh, yes I know this is your (well documented) position, but I
respectfully disagree. :)

CPUs are not getting much faster, but they are rapidly getting more
cores.  If we want to continue to make software run increasingly faster,
we need to actually use those cores IMO.  Generally this means split
workloads up into as many threads as possible as long as you can keep
pipelines filed.

   The ring is likely to be cache hot on the current cpu, waking a
 thread will introduce scheduling latency + IPI
This part is a valid criticism, though note that Linux is very adept at
scheduling so we are talking mere ns/us range here, which is dwarfed by
the latency of something like your typical IO device (e.g. 36us for a
rtt packet on 10GE baremetal, etc).  The benefit, of course, is the
potential for increased parallelism which I have plenty of data to show
we are very much taking advantage of here (I can saturate two cores
almost completely according to LTT traces, one doing vcpu work, and the
other running my rx thread which schedules the packet on the hardware)

 +cache-to-cache transfers.
This one I take exception to.  While it is perfectly true that splitting
the work between two cores has a greater cache impact than staying on
one, you cannot look at this one metric alone and say this is bad. 
Its also a function of how efficiently the second (or more) cores are
utilized.  There will be a point in the curve where the cost of cache
coherence will become marginalized by the efficiency added by the extra
compute power.  Some workloads will invariably be on the bad end of that
curve, and therefore doing the work on one core is better.  However, we
cant ignore that there will others that are on the good end of this
spectrum either.  Otherwise, we risk performance stagnation on our
effectively uniprocessor box ;).  In addition, the task-scheduler will
attempt to co-locate tasks that are sharing data according to a best-fit
within the cache hierarchy.  Therefore, we will still be sharing as much
as possible (perhaps only L2, L3, or a local NUMA domain, but this is
still better than nothing)

The way I have been thinking about these issues is something I have been
calling soft-asics.  In the early days, we had things like a simple
uniprocessor box with a simple dumb ethernet.  People figured out that
if you put more processing power into the NIC, you could offload that
work from the cpu and do more in parallel.   So things like checksum
computation and segmentation duties were a good fit.  More recently, we
see even more advanced hardware where you can do L2 or even L4 packet
classification right in the hardware, etc.  All of these things are
effectively parallel computation, and it occurs in a completely foreign
cache domain!

So a lot of my research has been around the notion of trying to use some
of our cpu cores to do work like some of the advanced asic based offload
engines do.  The cores are often under utilized anyway, and this will
bring some of the features of advanced silicon to commodity resources. 
They also have the added flexibility that its just software, so you can
change or enhance the system at will.

So if you think about it, by using threads like this in venet-tap, I am
effectively using other cores to do csum/segmentation (if the physical
hardware doesn't support it), layer 2 classification (linux bridging),
filtering (iptables in the bridge), queuing, etc as if it was some
smart device out on the PCI bus.  The guest just queues up packets
independently in its own memory, while the device just dma's the data
on its own (after the initial kick).  The vcpu is keeping the pipeline
filled on its side independently.


 On a benchmark setup, host resources are likely to exceed guest
 requirements, so you can throw cpu at the problem and no one notices.
Sure, but with the type of design I have presented this still sorts
itself out naturally even if the host doesn't have the resources.  For
instance, if there is a large number of threads competing for a small
number of cores, we will simply see things like the rx-thread stalling
and going to sleep, or the vcpu thread backpressuring and going idle
(and therefore sleeping).  All of these things are self throttling.  If
you don't have enough resources to run a workload at a desirable
performance level, the system wasn't sized right to begin with. ;)

   But I think the bits/cycle figure will decrease, even if bits/sec
 increases.

Note that this isn't necessarily a bad thing.  I 

Re: one question about virualization and kvm

2009-04-01 Thread Javier Guerra
On Wed, Apr 1, 2009 at 7:27 AM, Vasiliy Tolstov v.tols...@selfip.ru wrote:
 Hello!
 I have two containers with os linux. All files in /usr and /bin are
 identical.
 Is that possible to mount/bind /usr and /bin to containers? (not copy
 all files to containers).. ?

the problem (and solution) is exactly the same as if they weren't
virtual machines, but real machines: use the network.

simply share the directories with NFS and mount them in your initrd
scripts (preferably read/only).

other way would be to set a new image file with a copy of the
directories, and mount them on both virtual machines.  of course, now
you MUST mount them as readonly.  and you can't change anything there
without ummounting from both VMs.

usually it's not worth it, unless you have tens of identical VMs

-- 
Javier
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Andi Kleen wrote:
 On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
   
 Andi Kleen wrote:
 
 Gregory Haskins ghask...@novell.com writes:

 What might be useful is if you could expand a bit more on what the high 
 level
 use cases for this. 

 Questions that come to mind and that would be good to answer:

 This seems to be aimed at having multiple VMs talk
 to each other, but not talk to the rest of the world, correct? 
 Is that a common use case? 
   
   
 Actually we didn't design specifically for either type of environment. 
 

 But surely you must have some specific use case in mind? Something
 that it does better than the various methods that are available
 today. Or rather there must be some problem you're trying
 to solve. I'm just not sure what that problem exactly is.
   
Performance.  We are trying to create a high performance IO infrastructure.

Ideally we would like to see things like virtual-machines have
bare-metal performance (or as close as possible) using just pure
software on commodity hardware.   The data I provided shows that
something like KVM with virtio-net does a good job on throughput even on
10GE, but the latency is several orders of magnitude slower than
bare-metal.   We are addressing this issue and others like it that are a
result of the current design of out-of-kernel emulation.
   
 What we *are* trying to address is making an easy way to declare virtual
 resources directly in the kernel so that they can be accessed more
 efficiently.  Contrast that to the way its done today, where the models
 live in, say, qemu userspace.

 So instead of having
 guest-host-qemu::virtio-net-tap-[iptables|bridge], you simply have
 guest-host-[iptables|bridge].  How you make your private network (if
 

 So is the goal more performance or simplicity or what?
   

(Answered above)

   
 What would be the use cases for non networking devices?

 How would the interfaces to the user look like?
   
   
 I am not sure if you are asking about the guests perspective or the
 host-administators perspective.
 

 I was wondering about the host-administrators perspective.
   
Ah, ok.  Sorry about that.  It was probably good to document that other
thing anyway, so no harm.

So about the host-administrator interface.  The whole thing is driven by
configfs, and the basics are already covered in the documentation in
patch 2, so I wont repeat it here.  Here is a reference to the file for
everyone's convenience:

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=Documentation/vbus.txt;h=e8a05dafaca2899d37bd4314fb0c7529c167ee0f;hb=f43949f7c340bf667e68af6e6a29552e62f59033

So a sufficiently privileged user can instantiate a new bus (e.g.
container) and devices on that bus via configfs operations.  The types
of devices available to instantiate are dictated by whatever vbus-device
modules you have loaded into your particular kernel.  The loaded modules
available are enumerated under /sys/vbus/deviceclass.

Now presumably the administrator knows what a particular module is and
how to configure it before instantiating it.  Once they instantiate it,
it will present an interface in sysfs with a set of attributes.  For
example, an instantiated venet-tap looks like this:

ghask...@test:~ tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
|-- class - ../../deviceclass/venet-tap
|-- client_mac
|-- enabled
|-- host_mac
|-- ifname
`-- interfaces
`-- 0 - ../../../instances/bar/devices/0


Some of these attributes, like class and interfaces are default
attributes that are filled in by the infrastructure.  Other attributes,
like client_mac and enabled are properties defined by the venet-tap
module itself.  So the administrator can then set these attributes as
desired to manipulate the configuration of the instance of the device,
on a per device basis.

So now imagine we have some kind of disk-io vbus device that is designed
to act kind of like a file-loopback device.  It might define an
attribute allowing you to specify the path to the file/block-dev that
you want it to export.

(Warning: completely fictitious tree output to follow ;)

ghask...@test:~ tree /sys/vbus/devices
/sys/vbus/devices
`-- foo
|-- class - ../../deviceclass/vdisk
|-- src_path
`-- interfaces
`-- 0 - ../../../instances/bar/devices/0

So the admin would instantiate this vdisk device and do:

'echo /path/to/my/exported/disk.dat  /sys/vbus/devices/foo/src_path'

To point the device to the file on the host that it wants to present as
a vdisk.  Any guest that has access to the particular bus that contains
this device would then see it as a standard vdisk ABI device (as if
there where such a thing, yet) and could talk to it using a vdisk
specific driver.

A property of a vbus is that it is inherited by children.  Today, I do
not have direct support in qemu for creating/configuring vbus devices. 
Instead what I do is I set up the vbus and devices 

Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Gregory Haskins wrote:
 Andi Kleen wrote:
   
 On Wed, Apr 01, 2009 at 08:03:49AM -0400, Gregory Haskins wrote:
   
 
 Andi Kleen wrote:
 
   
 Gregory Haskins ghask...@novell.com writes:

 What might be useful is if you could expand a bit more on what the high 
 level
 use cases for this. 

 Questions that come to mind and that would be good to answer:

 This seems to be aimed at having multiple VMs talk
 to each other, but not talk to the rest of the world, correct? 
 Is that a common use case? 
   
   
 
 Actually we didn't design specifically for either type of environment. 
 
   
 But surely you must have some specific use case in mind? Something
 that it does better than the various methods that are available
 today. Or rather there must be some problem you're trying
 to solve. I'm just not sure what that problem exactly is.
   
 
 Performance.  We are trying to create a high performance IO infrastructure.
   
Actually, I should also state that I am interested in enabling some new
kinds of features based on having in-kernel devices like this.  For
instance (and this is still very theoretical and half-baked), I would
like to try to support RT guests.

[adding linux-rt-users]

I think one of the things that we need in order to do that is being able
to convey vcpu priority state information to the host in an efficient
way.  I was thinking that a shared-page per vcpu could have something
like current and theshold priorties.  The guest modifies current
while the host modifies threshold.   The guest would be allowed to
increase its current priority without a hypercall (after all, if its
already running presumably it is already of sufficient priority that the
scheduler).  But if the guest wants to drop below threshold, it needs
to hypercall the host to give it an opportunity to schedule() a new task
(vcpu or not).

The host, on the other hand, could apply a mapping so that the guests
priority of RT1-RT99 might map to RT20-RT30 on the host, or something
like that.  We would have to take other considerations as well, such as
implicit boosting on IRQ injection (e.g. the guest could be in HLT/IDLE
when an interrupt is injected...but by virtue of injecting that
interrupt we may need to boost it to (guest-relative) RT50).

Like I said, this is all half-baked right now.  My primary focus is
improving performance, but I did try to lay the groundwork for taking
things in new directions too..rt being an example.

Hope that helps!
-Greg




signature.asc
Description: OpenPGP digital signature


Re: Commit 3d28613c225ba94062950dacbb2304b2d2024abc breaks linux boot

2009-04-01 Thread Avi Kivity

Gleb Natapov wrote:

Commit 3d28613c225ba94062950dacbb2304b2d2024abc break linux boot.
It hangs after printing:
 SMP alternatives: switching to UP code
  


Does dropping bit 8 from context-rsvd_bits_mask[0][1] (PT64_ROOT_LEVEL) 
help?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Commit 3d28613c225ba94062950dacbb2304b2d2024abc breaks linux boot

2009-04-01 Thread Gleb Natapov
Commit 3d28613c225ba94062950dacbb2304b2d2024abc break linux boot.
It hangs after printing:
 SMP alternatives: switching to UP code

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Add shared memory PCI device that shares a memory object betweens VMs

2009-04-01 Thread Cam Macdonell
This patch supports sharing memory between VMs and between the host/VM.  It's a 
first 
cut and comments are encouraged.  The goal is to support simple Inter-VM 
communication
with zero-copy access to shared memory.

The patch adds the switch -ivshmem (short for Inter-VM shared memory) that is 
used as follows -ivshmem file,size.  

The shared memory object named 'file' will be created/opened and mapped onto a
PCI memory device with size 'size'.  The PCI device has two BARs, BAR0 for
registers and BAR1 for the memory region that maps the file above.  The memory
region can be mmapped into userspace on the guest (or read and written if you 
want). 

The register region will eventually be used to support interrupts which are
communicated via unix domain sockets, but I need some tips on how to do this
using a qemu character device. 

Also, feel free to suggest a better name if you have one.

Thanks,
Cam

---
 qemu/Makefile.target |2 +
 qemu/hw/ivshmem.c|  363 ++
 qemu/hw/pc.c |6 +
 qemu/hw/pc.h |3 +
 qemu/qemu-options.hx |   10 ++
 qemu/sysemu.h|7 +
 qemu/vl.c|   12 ++
 7 files changed, 403 insertions(+), 0 deletions(-)
 create mode 100644 qemu/hw/ivshmem.c

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 6eed853..167db55 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -640,6 +640,8 @@ OBJS += e1000.o
 
 # Serial mouse
 OBJS += msmouse.o
+# Inter-VM PCI shared memory
+OBJS += ivshmem.o
 
 ifeq ($(USE_KVM_DEVICE_ASSIGNMENT), 1)
 OBJS+= device-assignment.o
diff --git a/qemu/hw/ivshmem.c b/qemu/hw/ivshmem.c
new file mode 100644
index 000..27db95f
--- /dev/null
+++ b/qemu/hw/ivshmem.c
@@ -0,0 +1,363 @@
+/*
+ * Inter-VM Shared Memory PCI device.
+ *
+ * Author:
+ *  Cam Macdonell c...@cs.ualberta.ca
+ *
+ * Based On: cirrus_vga.c and rtl8139.c
+ *
+ * This code is licensed under the GNU GPL v2.
+ */
+
+#include hw.h
+#include console.h
+#include pc.h
+#include pci.h
+#include sysemu.h
+
+#include qemu-common.h
+#include sys/mman.h
+
+#define PCI_COMMAND_IOACCESS0x0001
+#define PCI_COMMAND_MEMACCESS   0x0002
+#define PCI_COMMAND_BUSMASTER   0x0004
+
+//#define DEBUG_IVSHMEM
+
+#ifdef DEBUG_IVSHMEM
+#define IVSHMEM_DPRINTF(fmt, args...)\
+do {printf(IVSHMEM:  fmt, ##args); } while (0)
+#else
+#define IVSHMEM_DPRINTF(fmt, args...)
+#endif
+
+typedef struct IVShmemState {
+uint16_t intrmask;
+uint16_t intrstatus;
+uint8_t *ivshmem_ptr;
+unsigned long ivshmem_offset;
+unsigned int ivshmem_size;
+unsigned long bios_offset;
+unsigned int bios_size;
+target_phys_addr_t base_ctrl;
+int it_shift;
+PCIDevice *pci_dev;
+unsigned long map_addr;
+unsigned long map_end;
+int ivshmem_mmio_io_addr;
+} IVShmemState;
+
+typedef struct PCI_IVShmemState {
+PCIDevice dev;
+IVShmemState ivshmem_state;
+} PCI_IVShmemState;
+
+typedef struct IVShmemDesc {
+char name[1024];
+int size;
+} IVShmemDesc;
+
+
+/* registers for the Inter-VM shared memory device */
+enum ivshmem_registers {
+IntrMask = 0,
+IntrStatus = 16
+};
+
+static int num_ivshmem_devices = 0;
+static IVShmemDesc ivshmem_desc;
+
+static void ivshmem_map(PCIDevice *pci_dev, int region_num,
+uint32_t addr, uint32_t size, int type)
+{
+PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+IVShmemState *s = d-ivshmem_state;
+
+IVSHMEM_DPRINTF(addr = %u size = %u\n, addr, size);
+cpu_register_physical_memory(addr, s-ivshmem_size, s-ivshmem_offset);
+
+}
+
+void ivshmem_init(const char * optarg) {
+
+char * temp;
+int size;
+
+num_ivshmem_devices++;
+
+/* currently we only support 1 device */
+if (num_ivshmem_devices  MAX_IVSHMEM_DEVICES) {
+return;
+}
+
+temp = strdup(optarg);
+snprintf(ivshmem_desc.name, 1024, /%s, strsep(temp,,));
+size = atol(temp);
+if ( size == -1) {
+ivshmem_desc.size = TARGET_PAGE_SIZE;
+} else {
+ivshmem_desc.size = size*1024*1024;
+}
+IVSHMEM_DPRINTF(optarg is %s, name is %s, size is %d\n, optarg,
+ivshmem_desc.name,
+ivshmem_desc.size);
+}
+
+int ivshmem_get_size(void) {
+return ivshmem_desc.size;
+}
+
+/* accessing registers - based on rtl8139 */
+static void ivshmem_update_irq(IVShmemState *s)
+{
+int isr;
+isr = (s-intrstatus  s-intrmask)  0x;
+
+/* don't print ISR resets */
+if (isr) {
+IVSHMEM_DPRINTF(Set IRQ to %d (%04x %04x)\n,
+   isr ? 1 : 0, s-intrstatus, s-intrmask);
+}
+
+qemu_set_irq(s-pci_dev-irq[0], (isr != 0));
+}
+
+static void ivshmem_mmio_map(PCIDevice *pci_dev, int region_num,
+   uint32_t addr, uint32_t size, int type)
+{
+PCI_IVShmemState *d = (PCI_IVShmemState *)pci_dev;
+IVShmemState *s = 

[PATCH] Guest device driver for an inter-VM shared memory PCI device that maps a shared file (from /dev/shm) on the devices memory.

2009-04-01 Thread Cam Macdonell
This driver corresponds to the shared memory PCI device that maps
a host file into shared memory on the device.

Accessing the device can be done through creating a device file on the guest

num =`cat /proc/devices | grep kvm_ivshmem | awk '{print $1}`
mknod --mode=666 /dev/ivshmem $num 0

read, write, lseek and mmap are supported, but mmap is the usual usage to get
zero-copy communication.  The driver contains the initial interrupt support,
but I have not yet added the unix domain socket to support interrupts yet.

---
 drivers/char/Kconfig   |8 +
 drivers/char/Makefile  |2 +
 drivers/char/kvm_ivshmem.c |  370 
 3 files changed, 380 insertions(+), 0 deletions(-)
 create mode 100644 drivers/char/kvm_ivshmem.c

diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index 735bbe2..afa7cb8 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -1099,6 +1099,14 @@ config DEVPORT
depends on ISA || PCI
default y
 
+config KVM_IVSHMEM
+tristate Inter-VM Shared Memory Device
+depends on PCI
+default m
+help
+  This device maps a region of shared memory between the host OS and any
+  number of virtual machines.
+
 source drivers/s390/char/Kconfig
 
 endmenu
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 9caf5b5..021f06b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -111,6 +111,8 @@ obj-$(CONFIG_PS3_FLASH) += ps3flash.o
 obj-$(CONFIG_JS_RTC)   += js-rtc.o
 js-rtc-y = rtc.o
 
+obj-$(CONFIG_KVM_IVSHMEM)  += kvm_ivshmem.o
+
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
 
diff --git a/drivers/char/kvm_ivshmem.c b/drivers/char/kvm_ivshmem.c
new file mode 100644
index 000..7d46ac4
--- /dev/null
+++ b/drivers/char/kvm_ivshmem.c
@@ -0,0 +1,370 @@
+/*
+ * drivers/char/kvm_ivshmem.c - driver for KVM Inter-VM shared memory PCI 
device
+ *
+ * Copyright 2009 Cam Macdonell c...@cs.ualberta.ca
+ *
+ * Based on cirrusfb.c and 8139cp.c:
+ * Copyright 1999-2001 Jeff Garzik
+ * Copyright 2001-2004 Jeff Garzik
+ *
+ */
+
+#include linux/init.h
+#include linux/kernel.h
+#include linux/module.h
+#include linux/pci.h
+#include linux/proc_fs.h
+#include linux/smp_lock.h
+#include asm/uaccess.h
+#include linux/interrupt.h
+
+#define TRUE 1
+#define FALSE 0
+#define KVM_IVSHMEM_DEVICE_MINOR_NUM 0
+
+enum {
+/* KVM Shmem device register offsets */
+IntrMask= 0x00,/* Interrupt Mask */
+IntrStatus= 0x10,/* Interrupt Status */
+
+ShmOK = 1/* Everything is OK */
+};
+
+typedef struct kvm_ivshmem_device {
+void __iomem * regs;
+
+void * base_addr;
+
+unsigned int regaddr;
+unsigned int reg_size;
+
+unsigned int ioaddr;
+unsigned int ioaddr_size;
+unsigned int irq;
+
+bool enabled;
+spinlock_t   dev_spinlock;
+} kvm_ivshmem_device;
+
+static kvm_ivshmem_device kvm_ivshmem_dev;
+
+static int device_major_nr;
+
+static int kvm_ivshmem_mmap(struct file *, struct vm_area_struct *);
+static int kvm_ivshmem_open(struct inode *, struct file *);
+static int kvm_ivshmem_release(struct inode *, struct file *);
+static ssize_t kvm_ivshmem_read(struct file *, char *, size_t, loff_t *);
+static ssize_t kvm_ivshmem_write(struct file *, const char *, size_t, loff_t 
*);
+static loff_t kvm_ivshmem_lseek(struct file * filp, loff_t offset, int origin);
+
+static const struct file_operations kvm_ivshmem_ops = {
+.owner   = THIS_MODULE,
+.open= kvm_ivshmem_open,
+.mmap= kvm_ivshmem_mmap,
+.read= kvm_ivshmem_read,
+.write   = kvm_ivshmem_write,
+.llseek  = kvm_ivshmem_lseek,
+.release = kvm_ivshmem_release,
+};
+
+static struct pci_device_id kvm_ivshmem_id_table[] = {
+{ 0x1af4, 0x1110, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0 },
+{ 0 },
+};
+MODULE_DEVICE_TABLE (pci, kvm_ivshmem_id_table);
+
+static void kvm_ivshmem_remove_device(struct pci_dev* pdev);
+static int kvm_ivshmem_probe_device (struct pci_dev *pdev,
+const struct pci_device_id * ent);
+
+static struct pci_driver kvm_ivshmem_pci_driver = {
+.name= kvm-shmem,
+.id_table= kvm_ivshmem_id_table,
+.probe= kvm_ivshmem_probe_device,
+.remove= kvm_ivshmem_remove_device,
+};
+
+static ssize_t kvm_ivshmem_read(struct file * filp, char * buffer, size_t len,
+loff_t * poffset)
+{
+
+int bytes_read = 0;
+unsigned long offset;
+
+offset = *poffset;
+
+printk(KERN_INFO kvm_ivshmem: trying to read\n);
+if (!kvm_ivshmem_dev.base_addr) {
+printk(KERN_ERR KVM_IVSHMEM: cannot read from ioaddr (NULL)\n);
+return 0;
+}
+
+if (len  kvm_ivshmem_dev.ioaddr_size - offset) {
+len = kvm_ivshmem_dev.ioaddr_size - offset;
+}
+
+printk(KERN_INFO KVM_IVSHMEM: len is %u\n, (unsigned) len);
+if 

kvm-autotest: weird memory error during stepmaker test

2009-04-01 Thread Ryan Harper
Wondering if anyone else using kvm-autotest stepmaker has ever seen this
error:

Traceback (most recent call last):
  File 
/home/rharper/work/git/build/kvm-autotest/client/tests/kvm_runtest_2/stepmaker.py,
 line 146, in update
self.set_image_from_file(self.screendump_filename)
  File 
/home/rharper/work/git/build/kvm-autotest/client/tests/kvm_runtest_2/stepeditor.py,
 line 499, in set_image_from_file
self.set_image(w, h, data)
  File 
/home/rharper/work/git/build/kvm-autotest/client/tests/kvm_runtest_2/stepeditor.py,
 line 485, in set_image
w, h, w*3))
MemoryError

The guest is still running, but stepmaker isn't recording any more so it's
boned at that point.  And of course, it's near the end of a guest install so
one has lost a decent amount of time...


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ry...@us.ibm.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Anthony Liguori

Rusty Russell wrote:

On Wednesday 01 April 2009 05:12:47 Gregory Haskins wrote:
  

Bare metal: tput = 4078Mb/s, round-trip = 25593pps (39us rtt)
Virtio-net: tput = 4003Mb/s, round-trip = 320pps (3125us rtt)
Venet: tput = 4050Mb/s, round-trip = 15255 (65us rtt)



That rtt time is awful.  I know the notification suppression heuristic
in qemu sucks.

I could dig through the code, but I'll ask directly: what heuristic do
you use for notification prevention in your venet_tap driver?

As you point out, 350-450 is possible, which is still bad, and it's at least
partially caused by the exit to userspace and two system calls.  If virtio_net
had a backend in the kernel, we'd be able to compare numbers properly.
  


I doubt the userspace exit is the problem.  On a modern system, it takes 
about 1us to do a light-weight exit and about 2us to do a heavy-weight 
exit.  A transition to userspace is only about ~150ns, the bulk of the 
additional heavy-weight exit cost is from vcpu_put() within KVM.


If you were to switch to another kernel thread, and I'm pretty sure you 
have to, you're going to still see about a 2us exit cost.  Even if you 
factor in the two syscalls, we're still talking about less than .5us 
that you're saving.  Avi mentioned he had some ideas to allow in-kernel 
thread switching without taking a heavy-weight exit but suffice to say, 
we can't do that today.


You have no easy way to generate PCI interrupts in the kernel either.  
You'll most certainly have to drop down to userspace anyway for that.


I believe the real issue is that we cannot get enough information today 
from tun/tap to do proper notification prevention b/c we don't know when 
the packet processing is completed.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs

2009-04-01 Thread Anthony Liguori

Hi Cam,

Cam Macdonell wrote:
This patch supports sharing memory between VMs and between the host/VM.  It's a first 
cut and comments are encouraged.  The goal is to support simple Inter-VM communication

with zero-copy access to shared memory.
  


Nice work!

I would suggest two design changes to make here.  The first is that I 
think you should use virtio.  The second is that I think instead of 
relying on mapping in device memory to the guest, you should have the 
guest allocate it's own memory to dedicate to sharing.


A lot of what you're doing is duplicating functionality in virtio-pci.  
You can also obtain greater portability by building the drivers with 
virtio.  It may not seem obvious how to make the memory sharing via BAR 
fit into virtio, but if you follow my second suggestion, it will be a 
lot easier.


Right now, you've got a bit of a hole in your implementation because you 
only support files that are powers-of-two in size even though that's not 
documented/enforced.  This is a limitation of PCI resource regions.  
Also, the PCI memory hole is limited in size today which is going to put 
an upper bound on the amount of memory you could ever map into a guest.  
Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
too since qemu_ram_alloc() is a static allocation from a contiguous heap.


If you used virtio, what you could do is provide a ring queue that was 
used to communicate a series of requests/response.  The exchange might 
look like this:


guest: REQ discover memory region
host: RSP memory region id: 4 size: 8k
guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
(addr=944000,size=4k)}

host: RSP mapped region id: 4
guest: REQ notify region id: 4
host: RSP notify region id: 4
guest: REQ poll region id: 4
host: RSP poll region id: 4

And the REQ/RSP order does not have to be in series like this.  In 
general, you need one entry on the queue to poll for new memory regions, 
one entry for each mapped region to poll for incoming notification, and 
then the remaining entries can be used to send short-lived 
requests/responses.


It's important that the REQ map takes a scatter/gather list of physical 
addresses because after running for a while, it's unlikely that you'll 
be able to allocate any significant size of contiguous memory.


From a QEMU perspective, you would do memory sharing by waiting for a 
map REQ from the guest and then you would complete the request by doing 
an mmap(MAP_FIXED) with the appropriate parameters into phys_ram_base.


Notifications are a topic for discussion I think.  A CharDriverState 
could be used by I think it would be more interesting to do something 
like a fd passed by SCM_RIGHTS so that eventfd can be used.


To simplify things, I'd suggest starting out only supporting one memory 
region mapping.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM on Via Nano (Isaiah) CPUs?

2009-04-01 Thread Avi Kivity

Craig Metz wrote:

  Has anyone (esp. the KVM core developers) tried to determine whether KVM
works on the new Via Nano CPUs? They claim to support the Intel-style VT-x
instruction set extensions and show up in cpuinfo that way. But, according to
some Google searching, folks who have tried to use KVM (or Hyper-V) have not
been succesful. It's not clear if this is a CPU implementation problem and/or
something that needs more work in KVM.
  


Via engineers have contacted me and confirmed that this is a problem in 
the processor.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2725367 ] KVM userspace segfaults due to internal VNC server

2009-04-01 Thread SourceForge.net
Bugs item #2725367, was opened at 2009-04-01 19:57
Message generated for change (Tracker Item Submitted) made by technologov
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725367group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 8
Private: No
Submitted By: Technologov (technologov)
Assigned to: Nobody/Anonymous (nobody)
Summary: KVM userspace segfaults due to internal VNC server

Initial Comment:
KVM's internal VNC server is unstable.

When running KVM (KVM-84 or 85rc2), the userspace segfaults when I try to 
connect to it with VNC client.
Only some VNC clients can trigger it. It happens on both Intel  AMD.
I used TightVNC 1.3 client for Linux 64-bit.
No problems happen with SDL rendering.

Host: Intel Core 2 CPU, KVM-85rc2, Fedora 7 x64
Guest: Windows XP SP2 32-bit

The Command sent to Qemu/KVM: 
/usr/local/bin/qemu-system-x86_64 -m 256 -monitor 
tcp:localhost:4502,server,nowait -cdrom /isos/windows/WindowsXP-sp2-vlk.iso  
-hda /vm/winxp.qcow2 -name WindowsXP -vnc :1

GDB output:
(gdb) c
Continuing.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 46912498463376 (LWP 18803)]
0x00438cfc in vga_draw_line24_32 (s1=value optimized out,
d=0x2aaabc822000 Address 0x2aaabc822000 out of bounds,
s=0x2aaabb3eeef7 , width=36)
at /root/Linstall/kvm-85rc2/qemu/hw/vga_template.h:484
484 ((PIXEL_TYPE *)d)[0] = glue(rgb_to_pixel, PIXEL_NAME)(r, g, b);
(gdb) bt
#0  0x00438cfc in vga_draw_line24_32 (s1=value optimized out,
d=0x2aaabc822000 Address 0x2aaabc822000 out of bounds,
s=0x2aaabb3eeef7 , width=36)
at /root/Linstall/kvm-85rc2/qemu/hw/vga_template.h:484
#1  0x00437b0d in vga_update_display (opaque=value optimized out)
at /root/Linstall/kvm-85rc2/qemu/hw/vga.c:1767
#2  0x00490c45 in vnc_listen_read (opaque=0x2aaabb3eeef7) at vnc.c:2020
#3  0x004093dc in main_loop_wait (timeout=value optimized out)
at /root/Linstall/kvm-85rc2/qemu/vl.c:3818
#4  0x0051724a in kvm_main_loop ()
at /root/Linstall/kvm-85rc2/qemu/qemu-kvm.c:588
#5  0x0040e28a in main (argc=13, argv=0x7fff25e77658,
envp=value optimized out) at /root/Linstall/kvm-85rc2/qemu/vl.c:3875
(gdb) c
Continuing.


Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb)
The program is not being run.

-Alexey

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725367group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Andi Kleen
On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
  
 
  But surely you must have some specific use case in mind? Something
  that it does better than the various methods that are available
  today. Or rather there must be some problem you're trying
  to solve. I'm just not sure what that problem exactly is.

 Performance.  We are trying to create a high performance IO infrastructure.

Ok. So the goal is to bypass user space qemu completely for better
performance. Can you please put this into the initial patch
description?

 So the administrator can then set these attributes as
 desired to manipulate the configuration of the instance of the device,
 on a per device basis.

How would the guest learn of any changes in there?

I think the interesting part would be how e.g. a vnet device
would be connected to the outside interfaces.

 So the admin would instantiate this vdisk device and do:
 
 'echo /path/to/my/exported/disk.dat  /sys/vbus/devices/foo/src_path'

So it would act like a loop device? Would you reuse the loop device
or write something new?

How about VFS mount name spaces?

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Izik Eidus

KAMEZAWA Hiroyuki wrote:

On Tue, 31 Mar 2009 15:21:53 +0300
Izik Eidus iei...@redhat.com wrote:
  
  
  

kpage is actually what going to be KsmPage - the shared page...

Right now this pages are not swappable..., after ksm will be merged we 
will make this pages swappable as well...




sure.

  

If so, please
 - show the amount of kpage
 
 - allow users to set limit for usage of kpages. or preserve kpages at boot or

   by user's command.
  
  
kpage actually save memory..., and limiting the number of them, would 
make you limit the number of shared pages...





Ah, I'm working for memory control cgroup. And *KSM* will be out of control.
It's ok to make the default limit value as INFINITY. but please add knobs.
  
Sure, when i will post V2 i will take care for this issue (i will do it 
after i get little bit more review for ksm.c :-))


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs

2009-04-01 Thread Avi Kivity

Anthony Liguori wrote:

Hi Cam,

Cam Macdonell wrote:
This patch supports sharing memory between VMs and between the 
host/VM.  It's a first cut and comments are encouraged.  The goal is 
to support simple Inter-VM communication

with zero-copy access to shared memory.
  


Nice work!

I would suggest two design changes to make here.  The first is that I 
think you should use virtio.


I disagree with this.  While virtio is excellent at exporting guest 
memory, it isn't so good at importing another guest's memory.


  The second is that I think instead of relying on mapping in device 
memory to the guest, you should have the guest allocate it's own 
memory to dedicate to sharing.


That's not what you describe below.  You're having the guest allocate 
parts of its address space that happen to be used by RAM, and overlaying 
those parts with the shared memory.


Right now, you've got a bit of a hole in your implementation because 
you only support files that are powers-of-two in size even though 
that's not documented/enforced.  This is a limitation of PCI resource 
regions.  


While the BAR needs to be a power of two, I don't think the RAM backing 
it needs to be.


Also, the PCI memory hole is limited in size today which is going to 
put an upper bound on the amount of memory you could ever map into a 
guest.  


Today.  We could easily lift this restriction by supporting 64-bit 
BARs.  It would probably take only a few lines of code.


Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
too since qemu_ram_alloc() is a static allocation from a contiguous heap.


We need to fix this anyway, for memory hotplug.



If you used virtio, what you could do is provide a ring queue that was 
used to communicate a series of requests/response.  The exchange might 
look like this:


guest: REQ discover memory region
host: RSP memory region id: 4 size: 8k
guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
(addr=944000,size=4k)}

host: RSP mapped region id: 4
guest: REQ notify region id: 4
host: RSP notify region id: 4
guest: REQ poll region id: 4
host: RSP poll region id: 4


That looks significantly more complex.



And the REQ/RSP order does not have to be in series like this.  In 
general, you need one entry on the queue to poll for new memory 
regions, one entry for each mapped region to poll for incoming 
notification, and then the remaining entries can be used to send 
short-lived requests/responses.


It's important that the REQ map takes a scatter/gather list of 
physical addresses because after running for a while, it's unlikely 
that you'll be able to allocate any significant size of contiguous 
memory.


From a QEMU perspective, you would do memory sharing by waiting for a 
map REQ from the guest and then you would complete the request by 
doing an mmap(MAP_FIXED) with the appropriate parameters into 
phys_ram_base.


That will fragment the vma list.  And what do you do when you unmap the 
region?


How does a 256M guest map 1G of shared memory?

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Anthony Liguori

Andi Kleen wrote:

On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
  



But surely you must have some specific use case in mind? Something
that it does better than the various methods that are available
today. Or rather there must be some problem you're trying
to solve. I'm just not sure what that problem exactly is.
  
  

Performance.  We are trying to create a high performance IO infrastructure.



Ok. So the goal is to bypass user space qemu completely for better
performance. Can you please put this into the initial patch
description?
  


FWIW, there's nothing that prevents in-kernel back ends with virtio so 
vbus certainly isn't required for in-kernel backends.


That said, I don't think we're bound today by the fact that we're in 
userspace.  Rather we're bound by the interfaces we have between the 
host kernel and userspace to generate IO.  I'd rather fix those 
interfaces than put more stuff in the kernel.


Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs

2009-04-01 Thread Anthony Liguori

Avi Kivity wrote:

Anthony Liguori wrote:

Hi Cam,


I would suggest two design changes to make here.  The first is that I 
think you should use virtio.


I disagree with this.  While virtio is excellent at exporting guest 
memory, it isn't so good at importing another guest's memory.


First we need to separate static memory sharing and dynamic memory 
sharing.  Static memory sharing has to be configured on start up.  I 
think in practice, static memory sharing is not terribly interesting 
except for maybe embedded environments.


Dynamically memory sharing requires bidirectional communication in order 
to establish mappings and tear down mappings.  You'll eventually 
recreate virtio once you've implemented this communication mechanism.


  The second is that I think instead of relying on mapping in device 
memory to the guest, you should have the guest allocate it's own 
memory to dedicate to sharing.


That's not what you describe below.  You're having the guest allocate 
parts of its address space that happen to be used by RAM, and 
overlaying those parts with the shared memory.


But from the guest's perspective, it's RAM is being used for memory sharing.

If you're clever, you could start a guest with -mem-path and then use 
this mechanism to map a portion of one guest's memory into another guest 
without either guest ever knowing who owns the memory and with exactly 
the same driver on both.


Right now, you've got a bit of a hole in your implementation because 
you only support files that are powers-of-two in size even though 
that's not documented/enforced.  This is a limitation of PCI resource 
regions.  


While the BAR needs to be a power of two, I don't think the RAM 
backing it needs to be.


Then you need a side channel to communicate the information to the guest.

Also, the PCI memory hole is limited in size today which is going to 
put an upper bound on the amount of memory you could ever map into a 
guest.  


Today.  We could easily lift this restriction by supporting 64-bit 
BARs.  It would probably take only a few lines of code.


Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
too since qemu_ram_alloc() is a static allocation from a contiguous 
heap.


We need to fix this anyway, for memory hotplug.


It's going to be hard to fix with TCG.

If you used virtio, what you could do is provide a ring queue that 
was used to communicate a series of requests/response.  The exchange 
might look like this:


guest: REQ discover memory region
host: RSP memory region id: 4 size: 8k
guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
(addr=944000,size=4k)}

host: RSP mapped region id: 4
guest: REQ notify region id: 4
host: RSP notify region id: 4
guest: REQ poll region id: 4
host: RSP poll region id: 4


That looks significantly more complex.


It's also supporting dynamic shared memory.  If you do use BARs, then 
perhaps you'd just do PCI hotplug to make things dynamic.




And the REQ/RSP order does not have to be in series like this.  In 
general, you need one entry on the queue to poll for new memory 
regions, one entry for each mapped region to poll for incoming 
notification, and then the remaining entries can be used to send 
short-lived requests/responses.


It's important that the REQ map takes a scatter/gather list of 
physical addresses because after running for a while, it's unlikely 
that you'll be able to allocate any significant size of contiguous 
memory.


From a QEMU perspective, you would do memory sharing by waiting for a 
map REQ from the guest and then you would complete the request by 
doing an mmap(MAP_FIXED) with the appropriate parameters into 
phys_ram_base.


That will fragment the vma list.  And what do you do when you unmap 
the region?


How does a 256M guest map 1G of shared memory?


It doesn't but it couldn't today either b/c of the 32-bit BARs.

Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT support breakage on: KVM: VMX: Zero ept module parameter if ept is not present

2009-04-01 Thread Andrew Theurer

Sheng Yang wrote:

Oops... Thanks very much for reporting! I can't believe we haven't awared of
that...

Could you please try the attached patch? Thanks!
  

Tested and works great.  Thanks!

-Andrew

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aba41ae..8d6465b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1195,15 +1195,6 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  vmx_capability.ept, vmx_capability.vpid);
}

-   if (!cpu_has_vmx_vpid())
-   enable_vpid = 0;
-
-   if (!cpu_has_vmx_ept())
-   enable_ept = 0;
-
-   if (!(vmcs_config.cpu_based_2nd_exec_ctrl  
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
-   flexpriority_enabled = 0;
-
min = 0;
 #ifdef CONFIG_X86_64
min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -1307,6 +1298,15 @@ static __init int hardware_setup(void)
if (boot_cpu_has(X86_FEATURE_NX))
kvm_enable_efer_bits(EFER_NX);

+   if (!cpu_has_vmx_vpid())
+   enable_vpid = 0;
+
+   if (!cpu_has_vmx_ept())
+   enable_ept = 0;
+
+   if (!(vmcs_config.cpu_based_2nd_exec_ctrl  
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
+   flexpriority_enabled = 0;
+
return alloc_kvm_area();
 }

  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Andi Kleen wrote:
 On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
   
 
 
 But surely you must have some specific use case in mind? Something
 that it does better than the various methods that are available
 today. Or rather there must be some problem you're trying
 to solve. I'm just not sure what that problem exactly is.
   
   
 Performance.  We are trying to create a high performance IO infrastructure.
 

 Ok. So the goal is to bypass user space qemu completely for better
 performance. Can you please put this into the initial patch
 description?
   
Yes, good point.  I will be sure to be more explicit in the next rev.

   
 So the administrator can then set these attributes as
 desired to manipulate the configuration of the instance of the device,
 on a per device basis.
 

 How would the guest learn of any changes in there?
   
The only events explicitly supported by the infrastructure of this
nature would be device-add and device-remove.  So when an admin adds or
removes a device to a bus, the guest would see driver::probe() and
driver::remove() callbacks, respectively.  All other events are left (by
design) to be handled by the device ABI itself, presumably over the
provided shm infrastructure.

So for instance, I have on my todo list to add a third shm-ring for
events in the venet ABI.   One of the event-types I would like to
support is LINK_UP and LINK_DOWN.  These events would be coupled to the
administrative manipulation of the enabled attribute in sysfs.  Other
event-types could be added as needed/appropriate.

I decided to do it this way because I felt it didn't make sense for me
to expose the attributes directly, since they are often back-end
specific anyway.   Therefore I leave it to the device-specific ABI which
has all the necessary tools for async events built in.


 I think the interesting part would be how e.g. a vnet device
 would be connected to the outside interfaces.
   

Ah, good question.  This ties into the statement I made earlier about
how presumably the administrative agent would know what a module is and
how it works.  As part of this, they would also handle any kind of
additional work, such as wiring the backend up.  Here is a script that I
use for testing that demonstrates this:

--
#!/bin/bash

set -e

modprobe venet-tap
mount -t configfs configfs /config

bridge=vbus-br0

brctl addbr $bridge
brctl setfd $bridge 0
ifconfig $bridge up

createtap()
{
mkdir /config/vbus/devices/$1-dev
echo venet-tap  /config/vbus/devices/$1-dev/type
mkdir /config/vbus/instances/$1-bus
ln -s /config/vbus/devices/$1-dev /config/vbus/instances/$1-bus
echo 1  /sys/vbus/devices/$1-dev/enabled

ifname=$(cat /sys/vbus/devices/$1-dev/ifname)
ifconfig $ifname up
brctl addif $bridge $ifname
}

createtap client
createtap server



This script creates two buses (client-bus and server-bus),
instantiates a single venet-tap on each of them, and then wires them
together with a private bridge instance called vbus-br0.  To complete
the picture here, you would want to launch two kvms, one of each of the
client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.

# (echo client-bus  /proc/self/vbus; qemu-kvm -hda client.img)
# (echo server-bus  /proc/self/vbus; qemu-kvm -hda server.img)

(And as noted, someday qemu will be able to do all the setup that the
script did, natively.  It would wire whatever tap it created to an
existing bridge with qemu-ifup, just like we do for tun-taps today)

One of the key details is where I do ifname=$(cat
/sys/vbus/devices/$1-dev/ifname).  The ifname attribute of the
venet-tap is a read-only attribute that reports back the netif interface
name that was returned when the device did a register_netdev() (e.g.
eth3).  This register_netdev() operation occurs as a result of echoing
the 1 into the enabled attribute.  Deferring the registration until
the admin explicitly does an enable gives the admin a chance to change
the MAC address of the virtual-adapter before it is registered (note:
the current code doesnt support rw on the mac attributes yet..i need a
parser first).


   
 So the admin would instantiate this vdisk device and do:

 'echo /path/to/my/exported/disk.dat  /sys/vbus/devices/foo/src_path'
 

 So it would act like a loop device? Would you reuse the loop device
 or write something new?
   

Well, keeping in mind that I haven't even looked at writing a block
device for this infrastructure yetmy blanket statement would be
lets reuse as much as possible ;)  If the existing loop infrastructure
would work here, great!

 How about VFS mount name spaces?
   

Yeah, ultimately I would love to be able to support a fairly wide range
of the normal userspace/kernel ABI through this mechanism.  In fact, one
of my original design goals was to somehow expose the syscall ABI
directly via some kind of syscall proxy device on the bus.  I have 

Re: [PATCH] Add shared memory PCI device that shares a memory object betweens VMs

2009-04-01 Thread Cam Macdonell


Hi Anthony and Avi,

Anthony Liguori wrote:

Avi Kivity wrote:

Anthony Liguori wrote:

Hi Cam,


I would suggest two design changes to make here.  The first is that I 
think you should use virtio.


I disagree with this.  While virtio is excellent at exporting guest 
memory, it isn't so good at importing another guest's memory.


First we need to separate static memory sharing and dynamic memory 
sharing.  Static memory sharing has to be configured on start up.  I 
think in practice, static memory sharing is not terribly interesting 
except for maybe embedded environments.


I think there is value for static memory sharing.   It can be used for 
fast, simple synchronization and communication between guests (and the 
host) that use need to share data that needs to be updated frequently 
(such as a simple cache or notification system).  It may not be a common 
task, but I think static sharing has its place and that's what this 
device is for at this point.


Dynamically memory sharing requires bidirectional communication in order 
to establish mappings and tear down mappings.  You'll eventually 
recreate virtio once you've implemented this communication mechanism.


  The second is that I think instead of relying on mapping in device 
memory to the guest, you should have the guest allocate it's own 
memory to dedicate to sharing.


That's not what you describe below.  You're having the guest allocate 
parts of its address space that happen to be used by RAM, and 
overlaying those parts with the shared memory.


But from the guest's perspective, it's RAM is being used for memory 
sharing.


If you're clever, you could start a guest with -mem-path and then use 
this mechanism to map a portion of one guest's memory into another guest 
without either guest ever knowing who owns the memory and with exactly 
the same driver on both.


Right now, you've got a bit of a hole in your implementation because 
you only support files that are powers-of-two in size even though 
that's not documented/enforced.  This is a limitation of PCI resource 
regions.  


While the BAR needs to be a power of two, I don't think the RAM 
backing it needs to be.


Then you need a side channel to communicate the information to the guest.


Couldn't one of the registers in BAR0 be used to store the actual 
(non-power-of-two) size?


Also, the PCI memory hole is limited in size today which is going to 
put an upper bound on the amount of memory you could ever map into a 
guest.  


Today.  We could easily lift this restriction by supporting 64-bit 
BARs.  It would probably take only a few lines of code.


Since you're using qemu_ram_alloc() also, it makes hotplug unworkable 
too since qemu_ram_alloc() is a static allocation from a contiguous 
heap.


We need to fix this anyway, for memory hotplug.


It's going to be hard to fix with TCG.

If you used virtio, what you could do is provide a ring queue that 
was used to communicate a series of requests/response.  The exchange 
might look like this:


guest: REQ discover memory region
host: RSP memory region id: 4 size: 8k
guest: REQ map region id: 4 size: 8k: sgl: {(addr=43000, size=4k), 
(addr=944000,size=4k)}

host: RSP mapped region id: 4
guest: REQ notify region id: 4
host: RSP notify region id: 4
guest: REQ poll region id: 4
host: RSP poll region id: 4


That looks significantly more complex.


It's also supporting dynamic shared memory.  If you do use BARs, then 
perhaps you'd just do PCI hotplug to make things dynamic.




And the REQ/RSP order does not have to be in series like this.  In 
general, you need one entry on the queue to poll for new memory 
regions, one entry for each mapped region to poll for incoming 
notification, and then the remaining entries can be used to send 
short-lived requests/responses.


It's important that the REQ map takes a scatter/gather list of 
physical addresses because after running for a while, it's unlikely 
that you'll be able to allocate any significant size of contiguous 
memory.


From a QEMU perspective, you would do memory sharing by waiting for a 
map REQ from the guest and then you would complete the request by 
doing an mmap(MAP_FIXED) with the appropriate parameters into 
phys_ram_base.


That will fragment the vma list.  And what do you do when you unmap 
the region?


How does a 256M guest map 1G of shared memory?


It doesn't but it couldn't today either b/c of the 32-bit BARs.



Cam
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OT: No vmx-Flag in Via Nano CPU on Samsung NC-20 Netbooks

2009-04-01 Thread Luca Tettamanti
On Sat, Mar 28, 2009 at 7:51 AM, Oliver Rath rat...@web.de wrote:
 I took a look at the new Samsung NC-20 Netbook with Via Nano Processor.
 Unfortunatly the vmx--bit looks to be disaabled on the Via Nano U2250.
 Tested with the newest Bios 7MC. Does anyone know more about this
 missing feature?

It seems that VIA processors are not fully compatible with Intel VT
specification, quoting Avi: Via engineers have contacted me and
confirmed that this is a problem in the processor.

Luca
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM on Via Nano (Isaiah) CPUs?

2009-04-01 Thread Jesse Ahrens

 Via engineers have contacted me and confirmed that this is a problem in
 the processor.

I'd like to clarify. Stepping 2 Nano processors do not support VMX. This 
should have been disabled by the BIOS. Support for VMX was not finished 
until stepping 3. If you have a stepping 2 processor with this enabled 
please let me know which platform it is on so we can have the 
manufacturer release a new BIOS.


Jesse Ahrens
Systems Engineer
Centaur Technology
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Chris Wright
* Anthony Liguori (anth...@codemonkey.ws) wrote:
 Andi Kleen wrote:
 On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
 Performance.  We are trying to create a high performance IO infrastructure.

 Ok. So the goal is to bypass user space qemu completely for better
 performance. Can you please put this into the initial patch
 description?

 FWIW, there's nothing that prevents in-kernel back ends with virtio so  
 vbus certainly isn't required for in-kernel backends.

Indeed.

 That said, I don't think we're bound today by the fact that we're in  
 userspace.  Rather we're bound by the interfaces we have between the  
 host kernel and userspace to generate IO.  I'd rather fix those  
 interfaces than put more stuff in the kernel.

And more stuff in the kernel can come at the potential cost of weakening
protection/isolation.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2725669 ] kvm init script breaks network interfaces with multiple IPs

2009-04-01 Thread SourceForge.net
Bugs item #2725669, was opened at 2009-04-01 16:44
Message generated for change (Tracker Item Submitted) made by paulsd
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Paul Donohue (paulsd)
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm init script breaks network interfaces with multiple IPs

Initial Comment:
If multiple IP addresses are assigned to a network interface (Using interface 
aliases - for example 'ifconfig eth0 10.0.0.1 ; ifconfig eth0:1 10.0.0.2'), 
then the kvm init script causes the interface to become unresponsive when it 
creates a bridge using the interface.

I haven't yet had a need to use bridging for my VMs, so I haven't yet tried to 
figure out how to properly configure a bridge when multiple IPs are in use on 
the host system (I assume the multiple IPs simply need to be configured using 
aliases of the bridge itself - for example 'ifconfig sw0 10.0.0.1 ; ifconfig 
sw0:1 10.0.0.2' - but I haven't actually tried it).  Therefore, I am not sure 
at the moment how the kvm init script needs to be updated to fix this problem.

Regardless, I do have a number of machines which are using multiple IPs on the 
host system, and I recently installed kvm on them, then discovered that after 
the next reboot of each machine, the network interface is unresponsive until I 
disable the kvm init script and reboot again.

So, ideally the kvm init script should be updated to properly handle aliased 
interfaces, but at the very least, it needs to be updated to detect aliased 
interfaces and refuse to create a bridge for them, since that seems to 
completely break the underlying interface.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OT: No vmx-Flag in Via Nano CPU on Samsung NC-20 Netbooks

2009-04-01 Thread Luca Tettamanti
On Wed, Apr 1, 2009 at 10:32 PM, Luca Tettamanti kronos...@gmail.com wrote:
 On Sat, Mar 28, 2009 at 7:51 AM, Oliver Rath rat...@web.de wrote:
 I took a look at the new Samsung NC-20 Netbook with Via Nano Processor.
 Unfortunatly the vmx--bit looks to be disaabled on the Via Nano U2250.
 Tested with the newest Bios 7MC. Does anyone know more about this
 missing feature?

 It seems that VIA processors are not fully compatible with Intel VT
 specification, quoting Avi: Via engineers have contacted me and
 confirmed that this is a problem in the processor.

More info here:
http://marc.info/?l=kvmm=123861829901077w=2

Luca
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Anthony Liguori wrote:
 Andi Kleen wrote:
 On Wed, Apr 01, 2009 at 10:19:49AM -0400, Gregory Haskins wrote:
  
 
 But surely you must have some specific use case in mind? Something
 that it does better than the various methods that are available
 today. Or rather there must be some problem you're trying
 to solve. I'm just not sure what that problem exactly is.
 
 Performance.  We are trying to create a high performance IO
 infrastructure.
 

 Ok. So the goal is to bypass user space qemu completely for better
 performance. Can you please put this into the initial patch
 description?
   

 FWIW, there's nothing that prevents in-kernel back ends with virtio so
 vbus certainly isn't required for in-kernel backends.

I think there is a slight disconnect here.  This is *exactly* what I am
trying to do.  You can of course do this many ways, and I am not denying
it could be done a different way than the path I have chosen.  One
extreme would be to just slam a virtio-net specific chunk of code
directly into kvm on the host.  Another extreme would be to build a
generic framework into Linux for declaring arbitrary IO types,
integrating it with kvm (as well as other environments such as lguest,
userspace, etc), and building a virtio-net model on top of that.

So in case it is not obvious at this point, I have gone with the latter
approach.  I wanted to make sure it wasn't kvm specific or something
like pci specific so it had the broadest applicability to a range of
environments.  So that is why the design is the way it is.  I understand
that this approach is technically harder/more-complex than the slam
virtio-net into kvm approach, but I've already done that work.  All we
need to do now is agree on the details ;)



 That said, I don't think we're bound today by the fact that we're in
 userspace.
You will *always* be bound by the fact that you are in userspace.  Its
purely a question of how much and does anyone care.Right now,
the anwer is a lot (roughly 45x slower) and at least Greg's customers
do.  I have no doubt that this can and will change/improve in the
future.  But it will always be true that no matter how much userspace
improves, the kernel based solution will always be faster.  Its simple
physics.  I'm cutting out the middleman to ultimately reach the same
destination as the userspace path, so userspace can never be equal.

I agree that the does anyone care part of the equation will approach
zero as the latency difference shrinks across some threshold (probably
the single microsecond range), but I will believe that is even possible
when I see it ;)

Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


[ kvm-Bugs-2725669 ] kvm init script breaks network interfaces with multiple IPs

2009-04-01 Thread SourceForge.net
Bugs item #2725669, was opened at 2009-04-01 15:44
Message generated for change (Comment added) made by iggy_cav
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Paul Donohue (paulsd)
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm init script breaks network interfaces with multiple IPs

Initial Comment:
If multiple IP addresses are assigned to a network interface (Using interface 
aliases - for example 'ifconfig eth0 10.0.0.1 ; ifconfig eth0:1 10.0.0.2'), 
then the kvm init script causes the interface to become unresponsive when it 
creates a bridge using the interface.

I haven't yet had a need to use bridging for my VMs, so I haven't yet tried to 
figure out how to properly configure a bridge when multiple IPs are in use on 
the host system (I assume the multiple IPs simply need to be configured using 
aliases of the bridge itself - for example 'ifconfig sw0 10.0.0.1 ; ifconfig 
sw0:1 10.0.0.2' - but I haven't actually tried it).  Therefore, I am not sure 
at the moment how the kvm init script needs to be updated to fix this problem.

Regardless, I do have a number of machines which are using multiple IPs on the 
host system, and I recently installed kvm on them, then discovered that after 
the next reboot of each machine, the network interface is unresponsive until I 
disable the kvm init script and reboot again.

So, ideally the kvm init script should be updated to properly handle aliased 
interfaces, but at the very least, it needs to be updated to detect aliased 
interfaces and refuse to create a bridge for them, since that seems to 
completely break the underlying interface.

--

Comment By: Brian Jackson (iggy_cav)
Date: 2009-04-01 16:08

Message:
KVM doesn't come with an init script in the tarball. This is most likely
provided by your distro or some other third party. You should contact them
for support.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Chris Wright wrote:
 And more stuff in the kernel can come at the potential cost of weakening
 protection/isolation.
   
Note that the design of vbus should prevent any weakening...though if
you see a hole, please point it out.

(On that front, note that I still have some hardening to do, such as not
calling BUG_ON() in venet-tap if the ring is in a funk, etc)

Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Chris Wright
* Gregory Haskins (ghask...@novell.com) wrote:
 Note that the design of vbus should prevent any weakening

Could you elaborate?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OT: No vmx-Flag in Via Nano CPU on Samsung NC-20 Netbooks

2009-04-01 Thread Oliver Rath
Hi Luca!

Luca Tettamanti schrieb:
 [..]
 It seems that VIA processors are not fully compatible with Intel VT
 specification, quoting Avi: Via engineers have contacted me and
 confirmed that this is a problem in the processor.
 

 More info here:
 http://marc.info/?l=kvmm=123861829901077w=2

 Luca
 --
   

Thank you so much for this info! Neither the Via Support nor the Samsung
support werent able (or willing?) to respond this question :-(

Maybe we should correct the wikipedia entry of the via nano in this way?

Best regards

Oliver

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Chris Wright wrote:
 * Gregory Haskins (ghask...@novell.com) wrote:
   
 Note that the design of vbus should prevent any weakening
 

 Could you elaborate?
   

Absolutely.

So you said that something in the kernel could weaken the
protection/isolation.  And I fully agree that whatever we do here has to
be done carefully...more carefully than a userspace derived counterpart,
naturally.

So to address this, I put in various mechanisms to (hopefully? :) ensure
we can still maintain proper isolation, as well as protect the host,
other guests, and other applications from corruption.  Here are some of
the highlights:

*) As I mentioned, a vbus is a form of a kernel-resource-container. 
It is designed so that the view of a vbus is a unique namespace of
device-ids.  Each bus has its own individual namespace that consist
solely of the devices that have been placed on that bus.  The only way
to create a bus, and/or create a device on a bus, is via the
administrative interface on the host.

*) A task can only associate with, at most, one vbus at a time.  This
means that a task can only see the device-id namespace of the devices on
its associated bus and thats it.  This is enforced by the host kernel by
placing a reference to the associated vbus on the task-struct itself. 
Again, the only way to modify this association is via a host based
administrative operation.  Note that multiple tasks can associate to the
same vbus, which would commonly be used by all threads in an app, or all
vcpus in a guest, etc.

*) the asynchronous nature of the shm/ring interfaces implies we have
the potential for asynchronous faults.  E.g. crap in the ring might
not be discovered at the EIP of the guest vcpu when it actually inserts
the crap, but rather later when the host side tries to update the ring. 
A naive implementation would have the host do a BUG_ON() when it
discovers the discrepancy (note that I still have a few of these to fix
in the venet-tap code).  Instead, what should happen is that we utilize
an asynchronous fault mechanism that allows the guest to always be the
one punished (via something like a machine-check for guests, or SIGABRT
for userspace, etc)

*) south-to-north path signaling robustness.  Because vbus supports a
variety of different environments, I call guest/userspace north', and
the host/kernel south.  When the north wants to communicate with the
kernel, its perfectly ok to stall the north indefinitely if the south is
not ready.  However, it is not really ok to stall the south when
communicating with the north because this is an attack vector.  E.g. a
malicous/broken guest could just stop servicing its ring to cause
threads in the host to jam up.  This is bad. :)  So what we do is we
design all south-to-north signaling paths to be robust against
stalling.  What they do instead is manage backpressure a little bit more
intelligently than simply blocking like they might in the guest.  For
instance, in venet-tap, a transmit from netif that has to be injected
in the south-to-north ring when it is full will result in a
netif_stop_queue().   etc.

I cant think of more examples right now, but I will update this list
if/when I come up with more.  I hope that satisfactorily answered your
question, though!

Regards,
-Greg



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Andi Kleen
On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote:
  description?

 Yes, good point.  I will be sure to be more explicit in the next rev.
 

  So the administrator can then set these attributes as
  desired to manipulate the configuration of the instance of the device,
  on a per device basis.
  
 
  How would the guest learn of any changes in there?

 The only events explicitly supported by the infrastructure of this
 nature would be device-add and device-remove.  So when an admin adds or
 removes a device to a bus, the guest would see driver::probe() and
 driver::remove() callbacks, respectively.  All other events are left (by
 design) to be handled by the device ABI itself, presumably over the
 provided shm infrastructure.

Ok so you rely on a transaction model where everything is set up
before it is somehow comitted to the guest? I hope that is made
explicit in the interface somehow.

 This script creates two buses (client-bus and server-bus),
 instantiates a single venet-tap on each of them, and then wires them
 together with a private bridge instance called vbus-br0.  To complete
 the picture here, you would want to launch two kvms, one of each of the
 client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.
 
 # (echo client-bus  /proc/self/vbus; qemu-kvm -hda client.img)
 # (echo server-bus  /proc/self/vbus; qemu-kvm -hda server.img)
 
 (And as noted, someday qemu will be able to do all the setup that the
 script did, natively.  It would wire whatever tap it created to an
 existing bridge with qemu-ifup, just like we do for tun-taps today)

The usual problem with that is permissions. Just making qemu-ifup suid
it not very nice.  It would be good if any new design addressed this.

 the current code doesnt support rw on the mac attributes yet..i need a
 parser first).

parser in kernel space always sounds scary to me.


 
 Yeah, ultimately I would love to be able to support a fairly wide range
 of the normal userspace/kernel ABI through this mechanism.  In fact, one
 of my original design goals was to somehow expose the syscall ABI
 directly via some kind of syscall proxy device on the bus.  I have since

That sounds really scary for security. 


 backed away from that idea once I started thinking about things some
 more and realized that a significant number of system calls are really
 inappropriate for a guest type environment due to their ability to
 block.   We really dont want a vcpu to block.however, the AIO type

Not only because of blocking, but also because of security issues.
After all one of the usual reasons to run a guest is security isolation.

In general the more powerful the guest API the more risky it is, so some
self moderation is probably a good thing.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Izik Eidus

Anthony Liguori wrote:

Andrea Arcangeli wrote:

On Tue, Mar 31, 2009 at 10:54:57AM -0500, Anthony Liguori wrote:
 
You can still disable ksm and simply return ENOSYS for the MADV_ 
flag.  You 



Anthony, the biggest problem about madvice() is that it is a real system 
call api, i wouldnt want in that stage of ksm commit into api changes of 
linux...


The ioctl itself is restricting, madvice is much more...,

Can we draft this issue to after ksm is merged, and after all the big 
new fetures that we want to add to ksm will be merge
(then the api would be much more stable, and we will be able to ask ppl 
in the list about changing of api, but for new driver that it yet to be 
merged, it is kind of overkill to add api to linux)


What do you think?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Andi Kleen wrote:
 On Wed, Apr 01, 2009 at 04:29:57PM -0400, Gregory Haskins wrote:
   
 description?
   
   
 Yes, good point.  I will be sure to be more explicit in the next rev.

 
   
   
 So the administrator can then set these attributes as
 desired to manipulate the configuration of the instance of the device,
 on a per device basis.
 
 
 How would the guest learn of any changes in there?
   
   
 The only events explicitly supported by the infrastructure of this
 nature would be device-add and device-remove.  So when an admin adds or
 removes a device to a bus, the guest would see driver::probe() and
 driver::remove() callbacks, respectively.  All other events are left (by
 design) to be handled by the device ABI itself, presumably over the
 provided shm infrastructure.
 

 Ok so you rely on a transaction model where everything is set up
 before it is somehow comitted to the guest? I hope that is made
 explicit in the interface somehow.
   
Well, its not an explicit transaction model, but I guess you could think
of it that way.

Generally you set the device up before you launch the guest.  By the
time the guest loads and tries to scan the bus for the initial
discovery, all the devices would be ready to go.

This does bring up the question of hotswap.  Today we fully support
hotswap in and out, but leaving this enabled transaction to the
individual device means that the device-id would be visible in the bus
namespace before the device may want to actually communicate.  Hmmm

Perhaps I need to build this in as a more explicit enabled
feature...and the guest will not see the driver::probe() until this happens.

   
 This script creates two buses (client-bus and server-bus),
 instantiates a single venet-tap on each of them, and then wires them
 together with a private bridge instance called vbus-br0.  To complete
 the picture here, you would want to launch two kvms, one of each of the
 client-bus/server-bus instances.  You can do this via /proc/$pid/vbus.  E.g.

 # (echo client-bus  /proc/self/vbus; qemu-kvm -hda client.img)
 # (echo server-bus  /proc/self/vbus; qemu-kvm -hda server.img)

 (And as noted, someday qemu will be able to do all the setup that the
 script did, natively.  It would wire whatever tap it created to an
 existing bridge with qemu-ifup, just like we do for tun-taps today)
 

 The usual problem with that is permissions. Just making qemu-ifup suid
 it not very nice.  It would be good if any new design addressed this.
   

Well, its kind of out of my control.  venet-tap ultimately creates a
simple netif interface which we must do something with.  Once its
created, wiring it up to something like a linux-bridge is no different
than something like a tun-tap, so the qemu-ifup requirement doesn't change.

The one thing I can think of is it would be possible to build a
venet-switch module, and this could be done without using brctl or
qemu-ifup...but then I would lose all the benefits of re-using that
infrastructure.  I do not recommend we actually do this, but it would
technically be a way to address your concern.


   
 the current code doesnt support rw on the mac attributes yet..i need a
 parser first).
 

 parser in kernel space always sounds scary to me.
   
Heh..why do you think I keep procrastinating ;)


   
 Yeah, ultimately I would love to be able to support a fairly wide range
 of the normal userspace/kernel ABI through this mechanism.  In fact, one
 of my original design goals was to somehow expose the syscall ABI
 directly via some kind of syscall proxy device on the bus.  I have since
 

 That sounds really scary for security. 


   
 backed away from that idea once I started thinking about things some
 more and realized that a significant number of system calls are really
 inappropriate for a guest type environment due to their ability to
 block.   We really dont want a vcpu to block.however, the AIO type
 

 Not only because of blocking, but also because of security issues.
 After all one of the usual reasons to run a guest is security isolation.
   
Oh yeah, totally agreed.  Not that I am advocating this, because I have
abandoned the idea.  But back when I was thinking of this, I would have
addressed the security with the vbus and syscall-proxy-device objects
themselves.  E.g. if you dont instantiate a syscall-proxy-device on the
bus, the guest wouldnt have access to syscalls at all.   And you could
put filters into the module to limit what syscalls were allowed, which
UID to make the guest appear as, etc.

 In general the more powerful the guest API the more risky it is, so some
 self moderation is probably a good thing.
   
:)

-Greg



signature.asc
Description: OpenPGP digital signature


[ kvm-Bugs-2725669 ] kvm init script breaks network interfaces with multiple IPs

2009-04-01 Thread SourceForge.net
Bugs item #2725669, was opened at 2009-04-01 16:44
Message generated for change (Comment added) made by paulsd
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Paul Donohue (paulsd)
Assigned to: Nobody/Anonymous (nobody)
Summary: kvm init script breaks network interfaces with multiple IPs

Initial Comment:
If multiple IP addresses are assigned to a network interface (Using interface 
aliases - for example 'ifconfig eth0 10.0.0.1 ; ifconfig eth0:1 10.0.0.2'), 
then the kvm init script causes the interface to become unresponsive when it 
creates a bridge using the interface.

I haven't yet had a need to use bridging for my VMs, so I haven't yet tried to 
figure out how to properly configure a bridge when multiple IPs are in use on 
the host system (I assume the multiple IPs simply need to be configured using 
aliases of the bridge itself - for example 'ifconfig sw0 10.0.0.1 ; ifconfig 
sw0:1 10.0.0.2' - but I haven't actually tried it).  Therefore, I am not sure 
at the moment how the kvm init script needs to be updated to fix this problem.

Regardless, I do have a number of machines which are using multiple IPs on the 
host system, and I recently installed kvm on them, then discovered that after 
the next reboot of each machine, the network interface is unresponsive until I 
disable the kvm init script and reboot again.

So, ideally the kvm init script should be updated to properly handle aliased 
interfaces, but at the very least, it needs to be updated to detect aliased 
interfaces and refuse to create a bridge for them, since that seems to 
completely break the underlying interface.

--

Comment By: Paul Donohue (paulsd)
Date: 2009-04-01 19:48

Message:
Yes, it does, in the userspace tree, under the scripts subdirectory:
http://git.kernel.org/?p=virt/kvm/kvm-userspace.git;a=blob;f=scripts/kvm;h=cddc931fd3b289f3c325e23b55f261e996328bd6;hb=HEAD

--

Comment By: Brian Jackson (iggy_cav)
Date: 2009-04-01 17:08

Message:
KVM doesn't come with an init script in the tarball. This is most likely
provided by your distro or some other third party. You should contact them
for support.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2725669group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Anthony Liguori

Gregory Haskins wrote:

Anthony Liguori wrote:
  
I think there is a slight disconnect here.  This is *exactly* what I am
trying to do. 


If it were exactly what you were trying to do, you would have posted a 
virtio-net in-kernel backend implementation instead of a whole new 
paravirtual IO framework ;-)



That said, I don't think we're bound today by the fact that we're in
userspace.


You will *always* be bound by the fact that you are in userspace.


Again, let's talk numbers.  A heavy-weight exit is 1us slower than a 
light weight exit.  Ideally, you're taking  1 exit per packet because 
you're batching notifications.  If you're ping latency on bare metal 
compared to vbus is 39us to 65us, then all other things being equally, 
the cost imposed by doing what your doing in userspace would make the 
latency be 66us taking your latency from 166% of native to 169% of 
native.  That's not a huge difference and I'm sure you'll agree there 
are a lot of opportunities to improve that even further.


And you didn't mention whether your latency tests are based on ping or 
something more sophisticated as ping will be a pathological case that 
doesn't allow any notification batching.



I agree that the does anyone care part of the equation will approach
zero as the latency difference shrinks across some threshold (probably
the single microsecond range), but I will believe that is even possible
when I see it ;)
  


Note the other hat we have to where is not just virtualization developer 
but Linux developer.  If there are bad userspace interfaces for IO that 
impose artificial restrictions, then we need to identify those and fix them.


Regards,

Anthony Liguori


Regards,
-Greg

  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Anthony Liguori

Izik Eidus wrote:
Anthony, the biggest problem about madvice() is that it is a real 
system call api, i wouldnt want in that stage of ksm commit into api 
changes of linux...


The ioctl itself is restricting, madvice is much more...,

Can we draft this issue to after ksm is merged, and after all the big 
new fetures that we want to add to ksm will be merge
(then the api would be much more stable, and we will be able to ask 
ppl in the list about changing of api, but for new driver that it yet 
to be merged, it is kind of overkill to add api to linux)


What do you think?


You can't change ABIs after something is merged or you break userspace.  
So you need to figure out the right ABI first.


Regards,

Anthony Liguori

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Chris Wright
* Anthony Liguori (anth...@codemonkey.ws) wrote:
 You can't change ABIs after something is merged or you break userspace.   
 So you need to figure out the right ABI first.

Absolutely.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Chris Wright
* Anthony Liguori (anth...@codemonkey.ws) wrote:
 The ioctl() interface is quite bad for what you're doing.  You're  
 telling the kernel extra information about a VA range in userspace.   
 That's what madvise is for.  You're tweaking simple read/write values of  
 kernel infrastructure.  That's what sysfs is for.

I agree re: sysfs (brought it up myself before).  As far as madvise vs.
ioctl, the one thing that comes from the ioctl is fops-release to
automagically unregister memory on exit.  This needs to be handled
anyway if some -p pid is added to add a process after it's running,
so less weight there.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Rusty Russell
On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote:
 Rusty Russell wrote:
  I could dig through the code, but I'll ask directly: what heuristic do
  you use for notification prevention in your venet_tap driver?
 
 I am not 100% sure I know what you mean with notification prevention,
 but let me take a stab at it.

Good stab :)

 I only signal back to the guest to reclaim its skbs every 10
 packets, or if I drain the queue, whichever comes first (note to self:
 make this # configurable).

Good stab, though I was referring to guest-host signals (I'll assume
you use a similar scheme there).

You use a number of packets, qemu uses a timer (150usec), lguest uses a
variable timer (starting at 500usec, dropping by 1 every time but increasing
by 10 every time we get fewer packets than last time).

So, if the guest sends two packets and stops, you'll hang indefinitely?
That's why we use a timer, otherwise any mitigation scheme has this issue.

Thanks,
Rusty.


 
 The nice part about this scheme is it significantly reduces the amount
 of guest/host transitions, while still providing the lowest latency
 response for single packets possible.  e.g. Send one packet, and you get
 one hypercall, and one tx-complete interrupt as soon as it queues on the
 hardware.  Send 100 packets, and you get one hypercall and 10
 tx-complete interrupts as frequently as every tenth packet queues on the
 hardware.  There is no timer governing the flow, etc.
 
 Is that what you were asking?
 
  As you point out, 350-450 is possible, which is still bad, and it's at least
  partially caused by the exit to userspace and two system calls.  If 
  virtio_net
  had a backend in the kernel, we'd be able to compare numbers properly.

 :)
 
 But that is the whole point, isnt it?  I created vbus specifically as a
 framework for putting things in the kernel, and that *is* one of the
 major reasons it is faster than virtio-net...its not the difference in,
 say, IOQs vs virtio-ring (though note I also think some of the
 innovations we have added such as bi-dir napi are helping too, but these
 are not in-kernel specific kinds of features and could probably help
 the userspace version too).
 
 I would be entirely happy if you guys accepted the general concept and
 framework of vbus, and then worked with me to actually convert what I
 have as venet-tap into essentially an in-kernel virtio-net.  I am not
 specifically interested in creating a competing pv-net driver...I just
 needed something to showcase the concepts and I didnt want to hack the
 virtio-net infrastructure to do it until I had everyone's blessing. 
 Note to maintainers: I *am* perfectly willing to maintain the venet
 drivers if, for some reason, we decide that we want to keep them as
 is.   Its just an ideal for me to collapse virtio-net and venet-tap
 together, and I suspect our community would prefer this as well.
 
 -Greg
 
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm : qemu : fix compilation error in kvm-userspace for ia64

2009-04-01 Thread Zhang, Yang
when using make in kernel, it can not find msidef.h. This patch
fix this. 

Signed-off-by: Yang Zhang yang.zh...@intel.com

diff --git a/kernel/include-compat/asm-ia64/msidef.h 
b/kernel/include-compat/asm-ia64/msidef.h
new file mode 100644
index 000..592c104
--- /dev/null
+++ b/kernel/include-compat/asm-ia64/msidef.h
@@ -0,0 +1,42 @@
+#ifndef _IA64_MSI_DEF_H
+#define _IA64_MSI_DEF_H
+
+/*
+ * Shifts for APIC-based data
+ */
+
+#define MSI_DATA_VECTOR_SHIFT  0
+#defineMSI_DATA_VECTOR(v)  (((u8)v)  
MSI_DATA_VECTOR_SHIFT)
+#define MSI_DATA_VECTOR_MASK   0xff00
+
+#define MSI_DATA_DELIVERY_MODE_SHIFT   8
+#define MSI_DATA_DELIVERY_FIXED(0  MSI_DATA_DELIVERY_MODE_SHIFT)
+#define MSI_DATA_DELIVERY_LOWPRI   (1  MSI_DATA_DELIVERY_MODE_SHIFT)
+
+#define MSI_DATA_LEVEL_SHIFT   14
+#define MSI_DATA_LEVEL_DEASSERT(0  MSI_DATA_LEVEL_SHIFT)
+#define MSI_DATA_LEVEL_ASSERT  (1  MSI_DATA_LEVEL_SHIFT)
+
+#define MSI_DATA_TRIGGER_SHIFT 15
+#define MSI_DATA_TRIGGER_EDGE  (0  MSI_DATA_TRIGGER_SHIFT)
+#define MSI_DATA_TRIGGER_LEVEL (1  MSI_DATA_TRIGGER_SHIFT)
+
+/*
+ * Shift/mask fields for APIC-based bus address
+ */
+
+#define MSI_ADDR_DEST_ID_SHIFT 4
+#define MSI_ADDR_HEADER0xfee0
+
+#define MSI_ADDR_DEST_ID_MASK  0xffff
+#define MSI_ADDR_DEST_ID_CPU(cpu)  ((cpu)  MSI_ADDR_DEST_ID_SHIFT)
+
+#define MSI_ADDR_DEST_MODE_SHIFT   2
+#define MSI_ADDR_DEST_MODE_PHYS(0  MSI_ADDR_DEST_MODE_SHIFT)
+#defineMSI_ADDR_DEST_MODE_LOGIC(1  MSI_ADDR_DEST_MODE_SHIFT)
+
+#define MSI_ADDR_REDIRECTION_SHIFT 3
+#define MSI_ADDR_REDIRECTION_CPU   (0  MSI_ADDR_REDIRECTION_SHIFT)
+#define MSI_ADDR_REDIRECTION_LOWPRI(1  
MSI_ADDR_REDIRECTION_SHIFT)
+
+#endif/* _IA64_MSI_DEF_H */
-- 
1.6.0.rc1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Qemu: Flush i-cache after ide-dma operation in IA64

2009-04-01 Thread Zhang, Yang
The data from dma will include instructions. In order to exeuting the right
instruction, we should to flush the i-cache to ensure those data can be see 
by cpu.

Signed-off-by: Xiantao Zhang xiantao.zh...@intel.com
Signed-off-by: Yang Zhang yang.zh...@intel.com
---


diff --git a/qemu/cache-utils.h b/qemu/cache-utils.h
index b45fde4..5e11d12 100644
--- a/qemu/cache-utils.h
+++ b/qemu/cache-utils.h
@@ -33,8 +33,22 @@ static inline void flush_icache_range(unsigned long start, 
unsigned long stop)
 asm volatile (sync : : : memory);
 asm volatile (isync : : : memory);
 }
+#define qemu_sync_idcache flush_icache_range
+#else
 
+#ifdef __ia64__
+static inline void qemu_sync_idcache(unsigned long start, unsigned long stop)
+{
+while (start  stop) {
+   asm volatile (fc %0 :: r(start));
+   start += 32;
+}
+asm volatile (;;sync.i;;srlz.i;;);
+}
 #else
+static inline void qemu_sync_idcache(unsigned long start, unsigned long stop)
+#endif
+
 #define qemu_cache_utils_init(envp) do { (void) (envp); } while (0)
 #endif
 
diff --git a/qemu/cutils.c b/qemu/cutils.c
index 5b36cc6..7b57173 100644
--- a/qemu/cutils.c
+++ b/qemu/cutils.c
@@ -23,6 +23,7 @@
  */
 #include qemu-common.h
 #include host-utils.h
+#include cache-utils.h
 #include assert.h
 
 void pstrcpy(char *buf, int buf_size, const char *str)
@@ -215,6 +216,8 @@ void qemu_iovec_from_buffer(QEMUIOVector *qiov, const void 
*buf, size_t count)
 if (copy  qiov-iov[i].iov_len)
 copy = qiov-iov[i].iov_len;
 memcpy(qiov-iov[i].iov_base, p, copy);
+qemu_sync_idcache((unsigned long)qiov-iov[i].iov_base, 
+(unsigned long)(qiov-iov[i].iov_base + copy));
 p += copy;
 count -= copy;
 }
-- 
1.6.0.rc1


0001-KVM-Qemu-Flush-icache-after-ide-dma-operation-in-IA64.patch
Description: 0001-KVM-Qemu-Flush-icache-after-ide-dma-operation-in-IA64.patch


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Rusty Russell wrote:
 On Wednesday 01 April 2009 22:05:39 Gregory Haskins wrote:
   
 Rusty Russell wrote:
 
 I could dig through the code, but I'll ask directly: what heuristic do
 you use for notification prevention in your venet_tap driver?
   
 I am not 100% sure I know what you mean with notification prevention,
 but let me take a stab at it.
 

 Good stab :)

   
 I only signal back to the guest to reclaim its skbs every 10
 packets, or if I drain the queue, whichever comes first (note to self:
 make this # configurable).
 

 Good stab, though I was referring to guest-host signals (I'll assume
 you use a similar scheme there).
   
Oh, actually no.  The guest-host path only uses the bidir napi thing
I mentioned.  So first packet hypercalls the host immediately with no
delay, schedules my host-side rx thread, disables subsequent
hypercalls, and returns to the guest.  If the guest tries to send
another packet before the time it takes the host to drain all queued
skbs (in this case, 1), it will simply queue it to the ring with no
additional hypercalls.Like typical napi ingress processing, the host
will leave hypercalls disabled until it finds the ring empty, so this
process can continue indefinitely until the host catches up.  Once fully
drained,  the host will re-enable the hypercall channel and subsequent
transmissions will repeat the original process.

In summary, infrequent transmissions will tend to have one hypercall per
packet.  Bursty transmissions will have one hypercall per burst
(starting immediately with the first packet).  In both cases, we
minimize the latency to get the first packet out the door.

So really the only place I am using a funky heuristic is the modulus 10
operation for tx-complete going host-guest.  The rest are kind of
standard napi event mitigation techniques.

 You use a number of packets, qemu uses a timer (150usec), lguest uses a
 variable timer (starting at 500usec, dropping by 1 every time but increasing
 by 10 every time we get fewer packets than last time).

 So, if the guest sends two packets and stops, you'll hang indefinitely?
   
Shouldn't, no.  The host will send tx-complete interrupts at *max* every
10 packets, but if it drains the queue before the modulus 10 expires, it
will send a tx-complete immediately, right before it re-enables
hypercalls.  So there is no hang, and there is no delay.

For reference, here is the modulus 10 signaling
(./drivers/vbus/devices/venet-tap.c, line 584):

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l584

Here is the one that happens after the queue is fully drained (line 593)

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l593

and finally, here is where I re-enable hypercalls (or system calls if
the driver is in userspace, etc)

http://git.kernel.org/?p=linux/kernel/git/ghaskins/vbus/linux-2.6.git;a=blob;f=drivers/vbus/devices/venet-tap.c;h=0ccb7ed94a1a8edd0cca269488f940f40fce20df;hb=master#l600

 That's why we use a timer, otherwise any mitigation scheme has this issue.
   

I'm not sure I follow.  I don't think I need a timer at all using this
scheme, but perhaps I am missing something?

Thanks Rusty!
-Greg





signature.asc
Description: OpenPGP digital signature


[PATCH] KVM: Discard reserved bits checking on PDE bit 7-8

2009-04-01 Thread Sheng Yang
1. It's related to a Linux kernel bug which fixed by Ingo on
07a66d7c53a538e1a9759954a82bb6c07365eff9. The original code exists for quite a
long time, and it would convert a PDE for large page into a normal PDE. But it
fail to fit normal PDE well.  With the code before Ingo's fix, the kernel would
fall reserved bit checking with bit 8 - the remaining global bit of PTE. So the
kernel would receive a double-fault.

2. After discussion, we decide to discard PDE bit 7-8 reserved checking for now.
For this marked as reserved in SDM, but didn't checked by the processor in
fact...

Signed-off-by: Sheng Yang sh...@linux.intel.com
---
 arch/x86/kvm/mmu.c |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index e0f63b6..a0b130d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2196,7 +2196,7 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, 
int level)
break;
case PT32E_ROOT_LEVEL:
context-rsvd_bits_mask[0][1] = exb_bit_rsvd |
-   rsvd_bits(maxphyaddr, 62);  /* PDE */
+   rsvd_bits(maxphyaddr, 62);  /* PDE */
context-rsvd_bits_mask[0][0] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 62);  /* PTE */
context-rsvd_bits_mask[1][1] = exb_bit_rsvd |
@@ -2210,13 +2210,14 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu 
*vcpu, int level)
context-rsvd_bits_mask[0][2] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
context-rsvd_bits_mask[0][1] = exb_bit_rsvd |
-   rsvd_bits(maxphyaddr, 51) | rsvd_bits(7, 8);
+   rsvd_bits(maxphyaddr, 51);
context-rsvd_bits_mask[0][0] = exb_bit_rsvd |
rsvd_bits(maxphyaddr, 51);
context-rsvd_bits_mask[1][3] = context-rsvd_bits_mask[0][3];
context-rsvd_bits_mask[1][2] = context-rsvd_bits_mask[0][2];
context-rsvd_bits_mask[1][1] = exb_bit_rsvd |
-   rsvd_bits(maxphyaddr, 51) | rsvd_bits(13, 20);
+   rsvd_bits(maxphyaddr, 51) |
+   rsvd_bits(13, 20);  /* large page */
context-rsvd_bits_mask[1][0] = ~0ull;
break;
}
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] add ksm kernel shared memory driver.

2009-04-01 Thread Anthony Liguori

Chris Wright wrote:

* Anthony Liguori (anth...@codemonkey.ws) wrote:
  
The ioctl() interface is quite bad for what you're doing.  You're  
telling the kernel extra information about a VA range in userspace.   
That's what madvise is for.  You're tweaking simple read/write values of  
kernel infrastructure.  That's what sysfs is for.



I agree re: sysfs (brought it up myself before).  As far as madvise vs.
ioctl, the one thing that comes from the ioctl is fops-release to
automagically unregister memory on exit.


This is precisely why ioctl() is a bad interface.  fops-release isn't 
tied to the process but rather tied to the open file.  The file can stay 
open long after the process exits either by a fork()'d child inheriting 
the file descriptor or through something more sinister like SCM_RIGHTS.


In fact, a common mistake is to leak file descriptors by not closing 
them when exec()'ing a process.  Instead of just delaying a close, if 
you rely on this behavior to unregister memory regions, you could 
potentially have badness happen in the kernel if ksm attempted to access 
an invalid memory region.


So you absolutely have to automatically unregister regions in something 
other than the fops-release handler based on something that's tied to 
the pid's life cycle.


Using an interface like madvise() would force the issue to be dealt with 
properly from the start :-)


I'm often afraid of what sort of bugs we'd uncover in kvm if we passed 
the fds around via SCM_RIGHTS and started poking around :-/


Regards,

Anthony Liguori



  This needs to be handled
anyway if some -p pid is added to add a process after it's running,
so less weight there.

thanks,
-chris
  


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Check valid bit of VM_EXIT_INTR_INFO

2009-04-01 Thread Dong, Eddie
Thx, eddie



commit ad4a9829c8d5b30995f008e32774bd5f555b7e9f
Author: root r...@eddie-wb.localdomain
Date:   Thu Apr 2 11:16:03 2009 +0800

Check valid bit of VM_EXIT_INTR_INFO before unblock nmi.

Signed-off-by: Eddie Dong eddie.d...@intel.com

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aba41ae..689523a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3268,16 +3268,18 @@ static void vmx_complete_interrupts(struct vcpu_vmx 
*vmx)
 
exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
if (cpu_has_virtual_nmis()) {
-   unblock_nmi = (exit_intr_info  INTR_INFO_UNBLOCK_NMI) != 0;
-   vector = exit_intr_info  INTR_INFO_VECTOR_MASK;
/*
 * SDM 3: 25.7.1.2
 * Re-set bit block by NMI before VM entry if vmexit caused by
 * a guest IRET fault.
 */
-   if (unblock_nmi  vector != DF_VECTOR)
-   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+   if (exit_intr_info  INTR_INFO_VALID_MASK) {
+   unblock_nmi = !!(exit_intr_info  
INTR_INFO_UNBLOCK_NMI);
+   vector = exit_intr_info  INTR_INFO_VECTOR_MASK;
+   if (unblock_nmi  vector != DF_VECTOR)
+   vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
  GUEST_INTR_STATE_NMI);
+   }
} else if (unlikely(vmx-soft_vnmi_blocked))
vmx-vnmi_blocked_time +=
ktime_to_ns(ktime_sub(ktime_get(), vmx-entry_time));

nmi_valid.patch
Description: nmi_valid.patch


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Gregory Haskins
Anthony Liguori wrote:
 Gregory Haskins wrote:
 Anthony Liguori wrote:
   I think there is a slight disconnect here.  This is *exactly* what
 I am
 trying to do. 

 If it were exactly what you were trying to do, you would have posted a
 virtio-net in-kernel backend implementation instead of a whole new
 paravirtual IO framework ;-)

semantics, semantics ;)

but ok, fair enough.


 That said, I don't think we're bound today by the fact that we're in
 userspace.
 
 You will *always* be bound by the fact that you are in userspace.

 Again, let's talk numbers.  A heavy-weight exit is 1us slower than a
 light weight exit.  Ideally, you're taking  1 exit per packet because
 you're batching notifications.  If you're ping latency on bare metal
 compared to vbus is 39us to 65us, then all other things being equally,
 the cost imposed by doing what your doing in userspace would make the
 latency be 66us taking your latency from 166% of native to 169% of
 native.  That's not a huge difference and I'm sure you'll agree there
 are a lot of opportunities to improve that even further.

Ok, so lets see it happen.  Consider the gauntlet thrown :)  Your
challenge, should you chose to accept it, is to take todays 4000us and
hit a 65us latency target while maintaining 10GE line-rate (at least
1500 mtu line-rate).

I personally don't want to even stop at 65.  I want to hit that 36us!  
In case you think that is crazy, my first prototype of venet was hitting
about 140us, and I shaved 10us here, 10us there, eventually getting down
to the 65us we have today.  The low hanging fruit is all but harvested
at this point, but I am not done searching for additional sources of
latency. I just needed to take a breather to get the code out there for
review. :)


 And you didn't mention whether your latency tests are based on ping or
 something more sophisticated

Well, the numbers posted were actually from netperf -t UDP_RR.  This
generates a pps from a continuous (but non-bursted) RTT measurement.  So
I invert the pps result of this test to get the average rtt time.  I
have also confirmed that ping jives with these results (e.g. virtio-net
results were about 4ms, and venet were about 0.065ms as reported by ping).

 as ping will be a pathological case
Ah, but this is not really pathological IMO.  There are plenty of
workloads that exhibit request-reply patterns (e.g. RPC), and this is a
direct measurement of the systems ability to support these
efficiently.   And even unidirectional flows can be hampered by poor
latency (think PTP clock sync, etc).

Massive throughput with poor latency is like Andrew Tanenbaum's
station-wagon full of backup tapes ;)  I think I have proven we can
actually get both with a little creative use of resources.

 that doesn't allow any notification batching.
Well, if we can take anything away from all this: I think I have
demonstrated that you don't need notification batching to get good
throughput.  And batching on the head-end of the queue adds directly to
your latency overhead, so I don't think its a good technique in general
(though I realize that not everyone cares about latency, per se, so
maybe most are satisfied with the status-quo).


 I agree that the does anyone care part of the equation will approach
 zero as the latency difference shrinks across some threshold (probably
 the single microsecond range), but I will believe that is even possible
 when I see it ;)
   

 Note the other hat we have to where is not just virtualization
 developer but Linux developer.  If there are bad userspace interfaces
 for IO that impose artificial restrictions, then we need to identify
 those and fix them.

Fair enough, and I would love to take that on but alas my
development/debug bandwidth is rather finite these days ;)

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Herbert Xu
Rusty Russell ru...@rustcorp.com.au wrote:
 
 As you point out, 350-450 is possible, which is still bad, and it's at least
 partially caused by the exit to userspace and two system calls.  If virtio_net
 had a backend in the kernel, we'd be able to compare numbers properly.

FWIW I don't really care whether we go with this or a kernel
virtio_net backend.  Either way should be good.  However the
status quo where we're stuck with a user-space backend really
sucks!

Thanks,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Herbert Xu
Anthony Liguori anth...@codemonkey.ws wrote:

 That said, I don't think we're bound today by the fact that we're in 
 userspace.  Rather we're bound by the interfaces we have between the 
 host kernel and userspace to generate IO.  I'd rather fix those 
 interfaces than put more stuff in the kernel.

I'm sorry but I totally disagree with that.  By having our IO
infrastructure in user-space we've basically given up the main
advantage of kvm, which is that the physical drivers operate in
the same environment as the hypervisor.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 00/17] virtual-bus

2009-04-01 Thread Herbert Xu
Chris Wright chr...@sous-sol.org wrote:

 That said, I don't think we're bound today by the fact that we're in  
 userspace.  Rather we're bound by the interfaces we have between the  
 host kernel and userspace to generate IO.  I'd rather fix those  
 interfaces than put more stuff in the kernel.
 
 And more stuff in the kernel can come at the potential cost of weakening
 protection/isolation.

Protection/isolation always comes at a cost.  Not everyone wants
to pay that, just like health insurance :) We should enable the
users to choose which model they want, based on their needs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmVHI~} herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


The errors appear when compiling kvm-guest-drivers on kernel-2.6.29

2009-04-01 Thread Zhiyong Wu
HI,

I have recently setup a guest network in KVM,

when i tried to compile kvm guest driver - virtio on kernel-2.6.29, an
issue has appeared

(1) the version of kvm guest driver

[r...@fedora9 kvm-guest-drivers-linux {master}]$ git describe
kvm-guest-drivers-linux-1-13-gae1ae62

(2) the output of make

[r...@fedora9 kvm-guest-drivers-linux {master}]$ make
make -C /lib/modules/2.6.29/build M=`pwd` $@
make[1]: Entering directory `/usr/src/linux-2.6.29'
  CC [M]  /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o
/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c: In function
\u2018xmit_skb\u2019:
/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
\u2018CHECKSUM_HW\u2019 undeclared (first use in this function)
/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
(Each undeclared identifier is reported only once
/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
for each function it appears in.)
/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:550: error:
\u2018struct sk_buff\u2019 has no member named \u2018h\u2019
make[2]: *** [/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o] Error 1
make[1]: *** [_module_/home/zwu/study/virt/kvm-guest-drivers-linux] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.29'
make: *** [all] Error 2


Cheers,

Zhiyong Wu
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The errors appear when compiling kvm-guest-drivers on kernel-2.6.29

2009-04-01 Thread Zhiyong Wu
In virtio_net.c,

#ifdef COMPAT_csum_offset
if (skb-ip_summed == CHECKSUM_HW) {
#else
if (skb-ip_summed == CHECKSUM_PARTIAL) {

It seems that CHECKSUM_HW is not declared.

In skbuff.h, only the macros below are defined.

/* Don't change this without changing skb_csum_unnecessary! */
#define CHECKSUM_NONE 0
#define CHECKSUM_UNNECESSARY 1
#define CHECKSUM_COMPLETE 2
#define CHECKSUM_PARTIAL 3

Cheers,

Zhiyong Wu

On Thu, Apr 2, 2009 at 12:10 PM, Zhiyong Wu zwu.ker...@gmail.com wrote:
 HI,

 I have recently setup a guest network in KVM,

 when i tried to compile kvm guest driver - virtio on kernel-2.6.29, an
 issue has appeared

 (1) the version of kvm guest driver

 [r...@fedora9 kvm-guest-drivers-linux {master}]$ git describe
 kvm-guest-drivers-linux-1-13-gae1ae62

 (2) the output of make

 [r...@fedora9 kvm-guest-drivers-linux {master}]$ make
 make -C /lib/modules/2.6.29/build M=`pwd` $@
 make[1]: Entering directory `/usr/src/linux-2.6.29'
  CC [M]  /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c: In function
 \u2018xmit_skb\u2019:
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
 \u2018CHECKSUM_HW\u2019 undeclared (first use in this function)
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
 (Each undeclared identifier is reported only once
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
 for each function it appears in.)
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:550: error:
 \u2018struct sk_buff\u2019 has no member named \u2018h\u2019
 make[2]: *** [/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o] 
 Error 1
 make[1]: *** [_module_/home/zwu/study/virt/kvm-guest-drivers-linux] Error 2
 make[1]: Leaving directory `/usr/src/linux-2.6.29'
 make: *** [all] Error 2


 Cheers,

 Zhiyong Wu

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The errors appear when compiling kvm-guest-drivers on kernel-2.6.29

2009-04-01 Thread Brian Jackson
I don't think the kvm-guest-drivers are still well maintained (they haven't 
been touched in 5 months). If you are using kernel 2.6.29, it already has 
virtio drivers and you don't need the kvm-guest-drivers tree at all.

--Brian Jackson



On Wednesday 01 April 2009 23:10:43 Zhiyong Wu wrote:
 HI,

 I have recently setup a guest network in KVM,

 when i tried to compile kvm guest driver - virtio on kernel-2.6.29, an
 issue has appeared

 (1) the version of kvm guest driver

 [r...@fedora9 kvm-guest-drivers-linux {master}]$ git describe
 kvm-guest-drivers-linux-1-13-gae1ae62

 (2) the output of make

 [r...@fedora9 kvm-guest-drivers-linux {master}]$ make
 make -C /lib/modules/2.6.29/build M=`pwd` $@
 make[1]: Entering directory `/usr/src/linux-2.6.29'
   CC [M]  /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c: In function
 \u2018xmit_skb\u2019:
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
 \u2018CHECKSUM_HW\u2019 undeclared (first use in this function)
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
 (Each undeclared identifier is reported only once
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:544: error:
 for each function it appears in.)
 /home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.c:550: error:
 \u2018struct sk_buff\u2019 has no member named \u2018h\u2019
 make[2]: *** [/home/zwu/study/virt/kvm-guest-drivers-linux/virtio_net.o]
 Error 1 make[1]: *** [_module_/home/zwu/study/virt/kvm-guest-drivers-linux]
 Error 2 make[1]: Leaving directory `/usr/src/linux-2.6.29'
 make: *** [all] Error 2


 Cheers,

 Zhiyong Wu
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/4] update ksm userspace interfaces

2009-04-01 Thread Chris Wright
* Anthony Liguori (anth...@codemonkey.ws) wrote:
 Using an interface like madvise() would force the issue to be dealt with  
 properly from the start :-)

Yeah, I'm not at all opposed to it.

This updates to madvise for register and sysfs for control.

madvise issues:
- MADV_SHAREABLE
  - register only ATM, can add MADV_UNSHAREABLE to allow an app to proactively
unregister, but need a cleanup when -mm goes away via exit/exec
  - will register a region per vma, should probably push the whole thing
into vma rather than keep [mm,addr,len] tuple in ksm
sysfs issues:
- none really, i added a reporting mechanism for number of pages shared,
  doesn't decrement on COW
- could use some extra sanity checks

It compiles!  Diff output is hard to read, I can send a 4/4 w/ this
patch rolled in for easier review.

Signed-off-by: Chris Wright chr...@redhat.com
---
 include/asm-generic/mman.h |1 +
 include/linux/ksm.h|   63 +
 mm/ksm.c   |  352 
 mm/madvise.c   |   18 +++
 4 files changed, 149 insertions(+), 285 deletions(-)

diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h
index 5e3dde2..a1c1d5c 100644
--- a/include/asm-generic/mman.h
+++ b/include/asm-generic/mman.h
@@ -34,6 +34,7 @@
 #define MADV_REMOVE9   /* remove these pages  resources */
 #define MADV_DONTFORK  10  /* don't inherit across fork */
 #define MADV_DOFORK11  /* do inherit across fork */
+#define MADV_SHAREABLE 12  /* can share identical pages */
 
 /* compatibility flags */
 #define MAP_FILE   0
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 5776dce..e032f6f 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -1,69 +1,8 @@
 #ifndef __LINUX_KSM_H
 #define __LINUX_KSM_H
 
-/*
- * Userspace interface for /dev/ksm - kvm shared memory
- */
-
-#include linux/types.h
-#include linux/ioctl.h
-
-#include asm/types.h
-
-#define KSM_API_VERSION 1
-
 #define ksm_control_flags_run 1
 
-/* for KSM_REGISTER_MEMORY_REGION */
-struct ksm_memory_region {
-   __u32 npages; /* number of pages to share */
-   __u32 pad;
-   __u64 addr; /* the begining of the virtual address */
-__u64 reserved_bits;
-};
-
-struct ksm_kthread_info {
-   __u32 sleep; /* number of microsecoends to sleep */
-   __u32 pages_to_scan; /* number of pages to scan */
-   __u32 flags; /* control flags */
-__u32 pad;
-__u64 reserved_bits;
-};
-
-#define KSMIO 0xAB
-
-/* ioctls for /dev/ksm */
-
-#define KSM_GET_API_VERSION  _IO(KSMIO,   0x00)
-/*
- * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd
- */
-#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO,   0x01) /* return SMA fd */
-/*
- * KSM_START_STOP_KTHREAD - control the kernel thread scanning speed
- * (can stop the kernel thread from working by setting running = 0)
- */
-#define KSM_START_STOP_KTHREAD  _IOW(KSMIO,  0x02,\
- struct ksm_kthread_info)
-/*
- * KSM_GET_INFO_KTHREAD - return information about the kernel thread
- * scanning speed.
- */
-#define KSM_GET_INFO_KTHREAD_IOW(KSMIO,  0x03,\
- struct ksm_kthread_info)
-
-
-/* ioctls for SMA fds */
-
-/*
- * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be
- * scanned by kvm.
- */
-#define KSM_REGISTER_MEMORY_REGION   _IOW(KSMIO,  0x20,\
- struct ksm_memory_region)
-/*
- * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm.
- */
-#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO,   0x21)
+long ksm_register_memory(struct vm_area_struct *, unsigned long, unsigned 
long);
 
 #endif
diff --git a/mm/ksm.c b/mm/ksm.c
index eba4c09..fcbf76e 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -17,7 +17,6 @@
 #include linux/errno.h
 #include linux/mm.h
 #include linux/fs.h
-#include linux/miscdevice.h
 #include linux/vmalloc.h
 #include linux/file.h
 #include linux/mman.h
@@ -38,6 +37,7 @@
 #include linux/rbtree.h
 #include linux/anon_inodes.h
 #include linux/ksm.h
+#include linux/kobject.h
 
 #include asm/tlbflush.h
 
@@ -55,20 +55,11 @@ MODULE_PARM_DESC(rmap_hash_size, Hash table size for the 
reverse mapping);
  */
 struct ksm_mem_slot {
struct list_head link;
-   struct list_head sma_link;
struct mm_struct *mm;
unsigned long addr; /* the begining of the virtual address */
unsigned npages;/* number of pages to share */
 };
 
-/*
- * ksm_sma - shared memory area, each process have its own sma that contain the
- * information about the slots that it own
- */
-struct ksm_sma {
-   struct list_head sma_slots;
-};
-
 /**
  * struct ksm_scan - cursor for scanning
  * @slot_index: the current slot we are scanning
@@ -190,6 +181,7 @@ static struct kmem_cache *rmap_item_cache;
 
 static int 

[PATCH 4/4 alternative userspace] add ksm kernel shared memory driver

2009-04-01 Thread Chris Wright
Here's ksm w/ a user interface built around madvise for registering and
sysfs for controlling (should just drop config tristate and make it bool,
CONFIG_KSM= y or n).

#include Izik's changelog

  Ksm is driver that allow merging identical pages between one or more
  applications in way unvisible to the application that use it.
  Pages that are merged are marked as readonly and are COWed when any
  application try to change them.

  Ksm is used for cases where using fork() is not suitable,
  one of this cases is where the pages of the application keep changing
  dynamicly and the application cannot know in advance what pages are
  going to be identical.

  Ksm works by walking over the memory pages of the applications it
  scan in order to find identical pages.
  It uses a two sorted data strctures called stable and unstable trees
  to find in effective way the identical pages.

  When ksm finds two identical pages, it marks them as readonly and merges
  them into single one page,
  after the pages are marked as readonly and merged into one page, linux
  will treat this pages as normal copy_on_write pages and will fork them
  when write access will happen to them.

  Ksm scan just memory areas that were registred to be scanned by it.

Ksm api (for users to register region):

Register a memory region as shareable:

madvise(void *addr, size_t len, MADV_SHAREABLE)

Unregister a shareable memory region (not currently implemented):

madvise(void *addr, size_t len, MADV_UNSHAREABLE)

Ksm api (for users to control ksm scanning daemon):
/sys/kernel/mm/ksm
|-- pages_shared-- RO, attribute showing number of pages shared
|-- pages_to_scan   -- RW, number of pages to scan per scan loop
|-- run -- RW, whether scanning daemon should scan
`-- sleep   -- RW, number of usecs to sleep between scan loops

Signed-off-by: Izik Eidus iei...@redhat.com
Signed-off-by: Chris Wright chr...@redhat.com
---
 include/asm-generic/mman.h |1 +
 include/linux/ksm.h|8 +
 mm/Kconfig |6 +
 mm/Makefile|1 +
 mm/ksm.c   | 1337 
 mm/madvise.c   |   18 +
 6 files changed, 1371 insertions(+), 0 deletions(-)

diff --git a/include/asm-generic/mman.h b/include/asm-generic/mman.h
index 5e3dde2..a1c1d5c 100644
--- a/include/asm-generic/mman.h
+++ b/include/asm-generic/mman.h
@@ -34,6 +34,7 @@
 #define MADV_REMOVE9   /* remove these pages  resources */
 #define MADV_DONTFORK  10  /* don't inherit across fork */
 #define MADV_DOFORK11  /* do inherit across fork */
+#define MADV_SHAREABLE 12  /* can share identical pages */
 
 /* compatibility flags */
 #define MAP_FILE   0
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
new file mode 100644
index 000..e032f6f
--- /dev/null
+++ b/include/linux/ksm.h
@@ -0,0 +1,8 @@
+#ifndef __LINUX_KSM_H
+#define __LINUX_KSM_H
+
+#define ksm_control_flags_run 1
+
+long ksm_register_memory(struct vm_area_struct *, unsigned long, unsigned 
long);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index b53427a..3f3fd04 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT
 
 config MMU_NOTIFIER
bool
+
+config KSM
+   tristate Enable KSM for page sharing
+   help
+ Enable the KSM kernel module to allow page sharing of equal pages
+ among different tasks.
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..b885513 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
 obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o
 obj-$(CONFIG_SLOB) += slob.o
 obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o
+obj-$(CONFIG_KSM) += ksm.o
 obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o
 obj-$(CONFIG_SLAB) += slab.o
 obj-$(CONFIG_SLUB) += slub.o
diff --git a/mm/ksm.c b/mm/ksm.c
new file mode 100644
index 000..fcbf76e
--- /dev/null
+++ b/mm/ksm.c
@@ -0,0 +1,1337 @@
+/*
+ * Memory merging driver for Linux
+ *
+ * This module enables dynamic sharing of identical pages found in different
+ * memory areas, even if they are not shared by fork()
+ *
+ * Copyright (C) 2008 Red Hat, Inc.
+ * Authors:
+ * Izik Eidus
+ * Andrea Arcangeli
+ * Chris Wright
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.
+ */
+
+#include linux/module.h
+#include linux/errno.h
+#include linux/mm.h
+#include linux/fs.h
+#include linux/vmalloc.h
+#include linux/file.h
+#include linux/mman.h
+#include linux/sched.h
+#include linux/rwsem.h
+#include linux/pagemap.h
+#include linux/sched.h
+#include linux/rmap.h
+#include linux/spinlock.h
+#include linux/jhash.h
+#include linux/delay.h
+#include linux/kthread.h
+#include linux/wait.h
+#include linux/scatterlist.h
+#include linux/random.h
+#include linux/slab.h
+#include linux/swap.h
+#include linux/rbtree.h
+#include linux/anon_inodes.h

Re: [PATCH 4/4 alternative userspace] add ksm kernel shared memory driver

2009-04-01 Thread Bert Wesarg
On Thu, Apr 2, 2009 at 07:48, Chris Wright chr...@redhat.com wrote:
 Ksm api (for users to register region):

 Register a memory region as shareable:

 madvise(void *addr, size_t len, MADV_SHAREABLE)

 Unregister a shareable memory region (not currently implemented):

 madvise(void *addr, size_t len, MADV_UNSHAREABLE)
I can't find a definition for MADV_UNSHAREABLE!

Bert
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html