[PATCH] KVM: Fix srcu struct leakage

2010-11-07 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Clean up the srcu struct and refactor its release on early errors.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 virt/kvm/kvm_main.c |   15 +++
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4111a4b..c80a44a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void)
r = -ENOMEM;
kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm-memslots)
-   goto out_err;
+   goto out_err_nosrcu;
if (init_srcu_struct(kvm-srcu))
-   goto out_err;
+   goto out_err_nosrcu;
for (i = 0; i  KVM_NR_BUSES; i++) {
kvm-buses[i] = kzalloc(sizeof(struct kvm_io_bus),
GFP_KERNEL);
-   if (!kvm-buses[i]) {
-   cleanup_srcu_struct(kvm-srcu);
+   if (!kvm-buses[i])
goto out_err;
-   }
}
 
r = kvm_init_mmu_notifier(kvm);
-   if (r) {
-   cleanup_srcu_struct(kvm-srcu);
+   if (r)
goto out_err;
-   }
 
kvm-mm = current-mm;
atomic_inc(kvm-mm-mm_count);
@@ -435,6 +431,8 @@ out:
return kvm;
 
 out_err:
+   cleanup_srcu_struct(kvm-srcu);
+out_err_nosrcu:
hardware_disable_all();
 out_err_nodisable:
for (i = 0; i  KVM_NR_BUSES; i++)
@@ -516,6 +514,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
kvm_arch_destroy_vm(kvm);
hardware_disable_all();
mmdrop(mm);
+   cleanup_srcu_struct(kvm-srcu);
 }
 
 void kvm_get_kvm(struct kvm *kvm)
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pci-assign: Catch missing KVM support

2010-11-07 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Report an error instead of raising a SEGV when a pci-assign device is
about to be initialized without KVM enabled.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/device-assignment.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index bde231d..5f5bde1 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1440,6 +1440,11 @@ static int assigned_initfn(struct PCIDevice *pci_dev)
 uint8_t e_device, e_intx;
 int r;
 
+if (!kvm_enabled()) {
+error_report(pci-assign: error: requires KVM support);
+return -1;
+}
+
 if (!dev-host.seg  !dev-host.bus  !dev-host.dev  !dev-host.func) 
{
 error_report(pci-assign: error: no host device specified);
 return -1;
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk I/O stuck with KVM - no clue how to solve that

2010-11-07 Thread Hermann Himmelbauer
Am Samstag 06 November 2010 20:58:12 schrieb Stefan Hajnoczi:
 On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote:
  I experience strange disk I/O stucks on my Linux Host + Guest with KVM,
  which make the system (especially the guests) almost unusable. These
  stucks come periodically, e.g. every 2 to 10 seconds and last between 3
  and sometimes over 120 seconds, which trigger kernel messages like this
  (on host and/or guest):
 
  INFO: task postgres:2195 blocked for more than 120 seconds

 The fact that this happens on the host too suggests there's an issue
 with the host software/hardware and the VM is triggering it but not
 the root cause.

 Does dmesg display any other suspicious messages?

No, there's anything that can be seen via dmesg. I at first suspected the 
hardware, too. I can think of the following reasons:

1) Broken SATA cable / Harddisks - I changed some cables, no change, thus this 
is probably ruled out. I also can't see anything via S.M.A.R.T. Moreover, the 
problem is not bound to a specific device, instead it happens on sda - sdd, 
so I doubt it's harddisk related.

2) Broken Power Supply / Insufficient Power - I'd expect either a complete 
crash or some error messages in this case, so I'd rather rule that out.

3) Broken SATA-Controller - I cannot think of any way to check that, but I'd 
also expect some crashes or kernel messages. I flashed the board to the 
latest BIOS version, no change either.

However, it seems no one except me seems to have this problem, so I'll buy a 
new, similar but different mainboard (Intel instead of Asus), hopefully this 
solves the problem.

What do you think, any better idea?

Anyway, thanks for your reply!

Best Regards,
Hermann

-- 
herm...@qwer.tk
GPG key ID: 299893C7 (on keyservers)
FP: 0124 2584 8809 EF2A DBF9  4902 64B4 D16B 2998 93C7
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: Fix srcu struct leakage

2010-11-07 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Clean up the srcu struct on vm destruction and refactor its release on
early errors.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

This one is better: kvm_arch_destroy_vm releases the kvm object, so we
need to free the struct earlier.

 virt/kvm/kvm_main.c |   15 +++
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 4111a4b..6cfcde7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -401,23 +401,19 @@ static struct kvm *kvm_create_vm(void)
r = -ENOMEM;
kvm-memslots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
if (!kvm-memslots)
-   goto out_err;
+   goto out_err_nosrcu;
if (init_srcu_struct(kvm-srcu))
-   goto out_err;
+   goto out_err_nosrcu;
for (i = 0; i  KVM_NR_BUSES; i++) {
kvm-buses[i] = kzalloc(sizeof(struct kvm_io_bus),
GFP_KERNEL);
-   if (!kvm-buses[i]) {
-   cleanup_srcu_struct(kvm-srcu);
+   if (!kvm-buses[i])
goto out_err;
-   }
}
 
r = kvm_init_mmu_notifier(kvm);
-   if (r) {
-   cleanup_srcu_struct(kvm-srcu);
+   if (r)
goto out_err;
-   }
 
kvm-mm = current-mm;
atomic_inc(kvm-mm-mm_count);
@@ -435,6 +431,8 @@ out:
return kvm;
 
 out_err:
+   cleanup_srcu_struct(kvm-srcu);
+out_err_nosrcu:
hardware_disable_all();
 out_err_nodisable:
for (i = 0; i  KVM_NR_BUSES; i++)
@@ -513,6 +511,7 @@ static void kvm_destroy_vm(struct kvm *kvm)
 #else
kvm_arch_flush_shadow(kvm);
 #endif
+   cleanup_srcu_struct(kvm-srcu);
kvm_arch_destroy_vm(kvm);
hardware_disable_all();
mmdrop(mm);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH] virtio-net: init link state correctly

2010-11-07 Thread Rusty Russell
On Fri, 5 Nov 2010 08:17:18 pm Jason Wang wrote:
 For device that supports VIRTIO_NET_F_STATUS, there's no need to
 assume the link is up and we need to call nerif_carrier_off() before
 querying device status, otherwise we may get wrong operstate after
 diver was loaded because the link watch event was not fired as
 expected.
 
 For device that does not support VIRITO_NET_F_STATUS, we could not get
 its status through virtnet_update_status() and what we can only do is
 always assuming the link is up.
 
 Acked-by: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com

Acked-by: Rusty Russell ru...@rustcorp.com.au

Thanks!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5 v3] MSI-X mask supporting for assigned device(kernel)

2010-11-07 Thread Sheng Yang
On Saturday 06 November 2010 08:27:15 Marcelo Tosatti wrote:
 On Thu, Nov 04, 2010 at 02:15:16PM +0800, Sheng Yang wrote:
  Here is the latest series of MSI-X mask supporting patches.
  
  The bigest change from last version is, in order to reduce the
  complexity, I moved all mask bit operation to kernel, including disabled
  entries. This addressed two concerns:
  1. KVM and QEmu each own a part of mask bit operation.
  2. QEmu need accessing the real hardware to get the mask bit information.
  
  So now QEmu have to use kernel API to get mask bit information. Though it
  would be slower than direct accessing the real hardware's MMIO, the
  direct accessing is unacceptable beacuse in fact the host OS own the
  device. The host OS can access the device without notifying the
  guest(and don't need to do so). Userspace shouldn't penetrate the host
  OS layer to directly operate the real hardware, otherwise would cause
  guest confusion.
  
  Also I discard the userspace mask operation as well, after showed the
  performance number.
  
  This version also removed the capability enabling mechanism. Because we
  want to use the struct kvm_assigned_msix_entry with new IOCTL, so there
  is no compatible issue.
  
  Please review. And I would run more test with current patch. So far so
  good.
 
 It would be good to know where the performance issue is with the entire
 implementation of mask bit in QEMU.
 
 http://www.mail-archive.com/kvm@vger.kernel.org/msg42652.html.
 
 All you mentioned in the past was there was high CPU utilization when
 running in QEMU and decided to start implementing in kernel. But AFAICS
 you did not really look into where the problem was...

We've analysed the same issue in Xen, and believed it was caused by exiting to 
the 
QEmu everytime when guest want to access the mask bit(and some specific kernel 
want 
to do this desperately). So we provided the patches to Xen, and it worked very 
well. 

The cost of exiting to the userspace in KVM is much smaller than Xen side, but 
it 
still much slower than in-kernel. I've shown the performance data of kernel and 
userspace here:

http://ns.spinics.net/lists/kvm/msg43712.html

We can easily get 20% performance gain using kernel.

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-07 Thread Huang Ying
On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote:
 On 11/01/2010 03:22 PM, Anthony Liguori wrote:
  On 11/01/2010 02:20 PM, Huang Ying wrote:
  Yes. As general interface, it may not work so well, but as test
  interface, it works quite well and useful.
 
  Do we have any mechanism to add a test only interface?
 
  I'd like to see what Luiz/Markus think but definitely only a human 
  monitor interface and probably prefix the command with a 'x-' prefix 
  to indicate that it's not a supported interface.
 
 
 Automated testing will want to use qmp.

Yes. The main usage of the interface is automated testing.

  The documentation should be very clear about the limitations of the 
  interface and the intended use-case.
 
 
 Perhaps a { execute: 'enable-version-specific-commands', arguments: 
 ['pfa2hva'] } ?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] KVM: MMU: don not retry #PF for nonpaging guest

2010-11-07 Thread Xiao Guangrong
On 11/05/2010 06:31 PM, Gleb Natapov wrote:
 On Fri, Nov 05, 2010 at 04:03:28PM +0800, Xiao Guangrong wrote:
 On 11/05/2010 03:45 PM, Gleb Natapov wrote:


 It looks like something broken: apfs can generated in L2 guest (nested ntp 
 guest)
 and be retried in L1 guest.

 Why is this a problem? apf will be generate on direct map even when L2
 guest is running so it should be OK to prefault it into direct map on
 completion.


 The nested_cr3 is different between L2 and L1, fix L2's page fault in L1's 
 page table
 is useless.
 But we are fixing L0 page faults in L0 page table. We do not start apf
 because of L1 faulted in its page table.
 

Hi Gleb,

For example, NPT Guest L1 runs on Host, and Nested NPT Guest L2 runs on Guest 
L1.
Now, Guest L2 is running, has below sequences:

a: NPF/PF occurs in L2 Guest, and generates a apf(named A-apf), then
   L2 Guest is blocked

b: a external event wakes up L2 Guest, and let it run again.

c: L2 Guest VMEXIT to L1 Guest because L2 Guest's action is intercepted by 
Guest L1

d: When cpu enter L1 Guest, A-apf is completed, then it will retry A-apf in
   L1 Guest's mmu context, and this 'retry' is useless.

Could you please point it out for me if i missed something. :-)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] Flow Control and Port Mirroring

2010-11-07 Thread Rusty Russell
On Sat, 30 Oct 2010 01:29:33 pm Simon Horman wrote:
 [ CCed VHOST contacts ]
 
 On Thu, Oct 28, 2010 at 01:22:02PM -0700, Jesse Gross wrote:
  On Thu, Oct 28, 2010 at 4:54 AM, Simon Horman ho...@verge.net.au wrote:
   My reasoning is that in the non-mirroring case the guest is
   limited by the external interface through wich the packets
   eventually flow - that is 1Gbit/s. But in the mirrored either
   there is no flow control or the flow control is acting on the
   rate of dummy0, which is essentailly infinate.
  
   Before investigating this any further I wanted to ask if
   this behaviour is intentional.
  
  It's not intentional but I can take a guess at what is happening.
  
  When we send the packet to a mirror, the skb is cloned but only the
  original skb is charged to the sender.  If the original packet is
  delivered to localhost then it will be freed quickly and no longer
  accounted for, despite the fact that the real packet is still
  sitting in the transmit queue on the NIC.  The UDP stack will then
  send the next packet, limited only by the speed of the CPU.
 
 That would explain what I have observed.

I can't find the thread (what is ovs-dev?), but I think the tap device
has this fundamental feature: you can blast as many packets as you want
through it.

If that's a bad thing, we have to look harder...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support

2010-11-07 Thread Sheng Yang
On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote:
 On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
  On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
   On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
  +};
  +
  +This ioctl would enable in-kernel MSI-X emulation, which would
  handle MSI-X +mask bit in the kernel.
 
 What happens on repeated calls when it's already enabled?
 How does one disable after it's enabled?

Suppose this should only be called once. But again would update the
MMIO base.
   
   So maybe add this all to documentation.
   
It
would be disabled along with device deassignment.
   
   So what are you going to do when guest enables and later disables MSIX?
   Disable assignment completely?
  
  This device goes with PCI resources allocation, rather than IRQ
  assignment.
 
 I see. I guess we can live with it, but it seems tied to a specific
 mode of use. Would be better to support enabling together with msix too,
 this requires an ability to disable.
 
We enable it explicitly because
not all devices have MSI-X capability.

  +
 
 This is very specialized for assigned devices.  I would like an
 interface not tied to assigned devices explicitly
 (even if the implementation at the moment only works
 for assigned devices, I can patch that in later). E.g. vhost-net
 would benefit from such an extension too.  Maybe tie this to a
 special GSI and this GSI can be an assigned device or not?

You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?
   
   Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
   and kvm_assigned_msix_mmio really.
   
We can't tie it to GSI, because we
should also cover entries without GSI. The entry number should be
fine for all kinds of devices using MSI-X.
   
   I don't completely understand: entries without GSI
   never need to inject an interrupt. Why do we care
   about them? Let's pass such accesses to userspace.
   
   In any case, I think MSIX GSIs are inexpensive (we never search them
   all), so we could create GSIs for all entries if we wanted to.
  
  All mask bit accessing controlled by kernel, in order to keep consistent.
 
 Well I think passing whatever we can't handle to userspace makes
 complete sense. We do this for address/data writes after all.
 If we wanted to do it *all* in kernel we would handle
 address/data there too. I think Avi suggested this at some point.

I think Avi's point was we handle mask bit half in the kernel and half in the 
userspace, which makes API complex. I agree with this so I moved all mask bit 
handling to kernel.

Data/address is another issue, due to QEmu own the routing table. We can change 
it 
in the future, but this is not related to this patch.
 
 The main point is msix mmio BAR handling is completely unrelated
 to the assigned device. Emulated devices have an msix BAR as well.
 So tying it to an assigned devices is suboptimal, we'll need to
 grow yet another interface for these.

If you only means the name, we can change that. But I think all MSI-X entries 
would have entry number, regardless of if it got GSI number. Keep entry number 
as 
parameter makes more sense to me.

And still, the patch's only current user is assigned device. I am glad if it 
can 
cover the other future case along with it, but that's not the primary target of 
this patch.
 
I am not sure about what PV one would looks like.
I use kvm_assigned_msix_entry
just because it can be easily reused.
   
   Not only PV, emulated devices too. OK let's try to figure out:
   
   What happens with userspace is, we inject interrupts either through
   ioctls or eventfds. Avoiding an exit to userspace on MSIX
   access is beneficial in exactly the same way.
   
   We could have an ioctl to pass in the table addresses, and mapping
   table to relevant GSIs. For example, add kvm_msix_mmio with table
   start/end, also specify which GSIs are covered.
   Maybe ask that userspace allocate all GSIs for a table consequitively?
   Or add kvm_irq_routing_msix which is same as msi but
   also has the mmio addresses? Bonus points for avoiding the need
   to scan all table on each write. For example, don't scan at all,
   instead just set a bit in KVM, and set irq entry on the next
   interrupt only: we have KVM_MAX_MSIX_PER_DEV as 256 so maybe a small 32
   byte bitmap would be enough for this, avoding a linear scan.
   
   When we pass in kvm_msix_mmio, kvm would start registering masked
   state.
   
  +4.48 KVM_ASSIGN_GET_MSIX_ENTRY
  +
  +Capability: KVM_CAP_DEVICE_MSIX_MASK
  +Architectures: x86
  +Type: vm ioctl
  +Parameters: struct kvm_assigned_msix_entry (in and out)
  +Returns: 0 on success, !0 on error
  +
  +struct kvm_assigned_msix_entry {
  +   /* Assigned device's ID */
  +   __u32 assigned_dev_id;
  +   /* Ignored */
  +   

Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-07 Thread Anthony Liguori

On 11/06/2010 11:24 AM, Avi Kivity wrote:

On 11/01/2010 03:22 PM, Anthony Liguori wrote:

On 11/01/2010 02:20 PM, Huang Ying wrote:

Yes. As general interface, it may not work so well, but as test
interface, it works quite well and useful.

Do we have any mechanism to add a test only interface?


I'd like to see what Luiz/Markus think but definitely only a human 
monitor interface and probably prefix the command with a 'x-' prefix 
to indicate that it's not a supported interface.




Automated testing will want to use qmp.


The command is inherently racy so if you tried to implement an automated 
unit test, you'd get false


The documentation should be very clear about the limitations of the 
interface and the intended use-case.




Perhaps a { execute: 'enable-version-specific-commands', arguments: 
['pfa2hva'] } ?




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-07 Thread Anthony Liguori

On 11/07/2010 07:29 PM, Huang Ying wrote:

On Sun, 2010-11-07 at 00:24 +0800, Avi Kivity wrote:
   

On 11/01/2010 03:22 PM, Anthony Liguori wrote:
 

On 11/01/2010 02:20 PM, Huang Ying wrote:
   

Yes. As general interface, it may not work so well, but as test
interface, it works quite well and useful.

Do we have any mechanism to add a test only interface?
 

I'd like to see what Luiz/Markus think but definitely only a human
monitor interface and probably prefix the command with a 'x-' prefix
to indicate that it's not a supported interface.

   

Automated testing will want to use qmp.
 

Yes. The main usage of the interface is automated testing.
   


That's precisely what the command should not be used for.

You can't assume a gpa - hva mapping is consistent in an external 
application.  If you want to implement an interface for testing, you 
have to push more of the logic into QEMU to avoid the race.


Regards,

Anthony Liguori


The documentation should be very clear about the limitations of the
interface and the intended use-case.

   

Perhaps a { execute: 'enable-version-specific-commands', arguments:
['pfa2hva'] } ?
 

Best Regards,
Huang Ying


   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ovs-dev] Flow Control and Port Mirroring

2010-11-07 Thread Simon Horman
On Mon, Nov 08, 2010 at 01:41:23PM +1030, Rusty Russell wrote:
 On Sat, 30 Oct 2010 01:29:33 pm Simon Horman wrote:
  [ CCed VHOST contacts ]
  
  On Thu, Oct 28, 2010 at 01:22:02PM -0700, Jesse Gross wrote:
   On Thu, Oct 28, 2010 at 4:54 AM, Simon Horman ho...@verge.net.au wrote:
My reasoning is that in the non-mirroring case the guest is
limited by the external interface through wich the packets
eventually flow - that is 1Gbit/s. But in the mirrored either
there is no flow control or the flow control is acting on the
rate of dummy0, which is essentailly infinate.
   
Before investigating this any further I wanted to ask if
this behaviour is intentional.
   
   It's not intentional but I can take a guess at what is happening.
   
   When we send the packet to a mirror, the skb is cloned but only the
   original skb is charged to the sender.  If the original packet is
   delivered to localhost then it will be freed quickly and no longer
   accounted for, despite the fact that the real packet is still
   sitting in the transmit queue on the NIC.  The UDP stack will then
   send the next packet, limited only by the speed of the CPU.
  
  That would explain what I have observed.
 
 I can't find the thread (what is ovs-dev?),

Sorry, yes its on ovs-dev.
http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html

 but I think the tap device
 has this fundamental feature: you can blast as many packets as you want
 through it.
 
 If that's a bad thing, we have to look harder...

There does seem to be flow control in the non-mirrored case.
So I suspect its occurring at the skb level but that breaks down when
a clone occurs. It would seem that fragment level flow control would
help this problem (which is basically what Xen's netback/netfront has),
but by this point I am speculating wildly.  I'll try and find out exactly
where the problem is occurring in order for us to have a more informed
discussion.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] qemu-kvm: device assignment: emulate MSI-X mask bits

2010-11-07 Thread Sheng Yang
On Friday 05 November 2010 21:54:16 Michael S. Tsirkin wrote:
 On Fri, Nov 05, 2010 at 07:02:02PM +0800, Sheng Yang wrote:
  On Friday 05 November 2010 17:05:32 Michael S. Tsirkin wrote:
   On Fri, Nov 05, 2010 at 11:20:37AM +0800, Sheng Yang wrote:
On Thursday 04 November 2010 17:44:27 Michael S. Tsirkin wrote:

Thanks very much for reviewing this! Seems nobody cares about
userspace one before...

 On Thu, Nov 04, 2010 at 02:18:21PM +0800, Sheng Yang wrote:
  This patch emulated MSI-X per vector mask bit on assigned device.
  
  Signed-off-by: Sheng Yang sh...@linux.intel.com
  ---
  
   hw/device-assignment.c |  161
   ++-- 1 
files

changed, 155

   insertions(+), 6 deletions(-)
  
  diff --git a/hw/device-assignment.c b/hw/device-assignment.c
  index 8a98876..639aa0b 100644
  --- a/hw/device-assignment.c
  +++ b/hw/device-assignment.c
  @@ -62,6 +62,11 @@ static void
  assigned_dev_load_option_rom(AssignedDevice *dev);
  
   static void assigned_dev_unregister_msix_mmio(AssignedDevice
   *dev);
  
  +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus,
  uint8_t devfn) +{
  +return (uint32_t)seg  16 | (uint32_t)bus  8 |
  (uint32_t)devfn; +}
  +
  
   static uint32_t assigned_dev_ioport_rw(AssignedDevRegion
   *dev_region,
   
  uint32_t addr, int len,
  uint32_t *val)
   
   {
  
  @@ -264,6 +269,9 @@ static void assigned_dev_iomem_map(PCIDevice
  *pci_dev, int region_num,
  
   AssignedDevRegion *region = r_dev-v_addrs[region_num];
   PCIRegion *real_region =
   r_dev-real_device.regions[region_num]; int ret = 0;
  
  +#ifdef KVM_CAP_DEVICE_MSIX_MASK
  +struct kvm_assigned_msix_mmio msix_mmio;
  +#endif
  
   DEBUG(e_phys=%08 FMT_PCIBUS  r_virt=%p type=%d len=%08
   FMT_PCIBUS  region_num=%d \n,
   
 e_phys, region-u.r_virtbase, type, e_size,
 region_num);
  
  @@ -282,6 +290,16 @@ static void assigned_dev_iomem_map(PCIDevice
  *pci_dev, int region_num,
  
   cpu_register_physical_memory(e_phys + offset,
   
   TARGET_PAGE_SIZE, r_dev-mmio_index);
  
  +#ifdef KVM_CAP_DEVICE_MSIX_MASK
  +   memset(msix_mmio, 0, sizeof(struct
  kvm_assigned_msix_mmio)); + msix_mmio.assigned_dev_id =
  calc_assigned_dev_id(r_dev-h_segnr, +  
  r_dev-
h_busnr,
  r_dev-h_devfn);
  +   msix_mmio.base_addr = e_phys + offset;
  +/* We required kernel MSI-X support */
  +   ret = kvm_assign_reg_msix_mmio(kvm_context, msix_mmio);
  +   if (ret)
  +fprintf(stderr, fail to register in-kernel
  msix_mmio!\n); +#endif
  
   }
   
   }
  
  @@ -824,11 +842,6 @@ static void
  free_assigned_device(AssignedDevice *dev)
  
   }
   
   }
  
  -static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus,
  uint8_t devfn) -{
  -return (uint32_t)seg  16 | (uint32_t)bus  8 |
  (uint32_t)devfn; -}
  -
  
   static void assign_failed_examine(AssignedDevice *dev)
   {
   
   char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {},
   *ns;
  
  @@ -1123,6 +1136,27 @@ static int
  get_msix_entries_max_nr(AssignedDevice *adev)
  
   return entries_max_nr;
   
   }
  
  +#ifdef KVM_CAP_DEVICE_MSIX_MASK
  +static int assigned_dev_msix_entry_masked(AssignedDevice *adev,
  int entry) +{
  +struct kvm_assigned_msix_entry msix_entry;
  +int r;
  +
  +memset(msix_entry, 0, sizeof msix_entry);
  +msix_entry.assigned_dev_id =
  calc_assigned_dev_id(adev-h_segnr, +adev-h_busnr,
  (uint8_t)adev-h_devfn);
  +msix_entry.entry = entry;
  +msix_entry.flags = KVM_MSIX_FLAG_QUERY_MASK;
  +r = kvm_assign_get_msix_entry(kvm_context, msix_entry);
  +if (r) {
  +fprintf(stderr, assigned_dev_msix_entry_masked: 
  +   Fail to get mask bit of entry %d\n, entry);
  +return 1;
 
 This error handling seems pretty useless. assert?

Maybe we can continue with it. Assert seems a little strict.
   
   Well need to consider whether to return 1 or 0 then.
  
  Tell it it's masked then routing change may be continue.
 
 OK, then
 1. we probably want to print this only once, not on each write
 2. only try doing this if capability is present.
 
  +}
  +return (msix_entry.flags  KVM_MSIX_FLAG_MASK);
  +}
  +#endif
  +
  
   

Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support

2010-11-07 Thread Sheng Yang
On Saturday 06 November 2010 08:24:51 Marcelo Tosatti wrote:
 On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
  On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
   On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
  +};
  +
  +This ioctl would enable in-kernel MSI-X emulation, which would
  handle MSI-X +mask bit in the kernel.
 
 What happens on repeated calls when it's already enabled?
 How does one disable after it's enabled?

Suppose this should only be called once. But again would update the
MMIO base.
   
   So maybe add this all to documentation.
   
It
would be disabled along with device deassignment.
   
   So what are you going to do when guest enables and later disables MSIX?
   Disable assignment completely?
  
  This device goes with PCI resources allocation, rather than IRQ
  assignment.
 
 What about hot plug, for example? I can't see how this can function
 without unregistering regions.

The register time is when PCI resources allocation, and the unregister time is 
when device was deassigned.

I suppose hot plug path would also notify kernel to deassign the device?

--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] qemu-kvm: device assignment: emulate MSI-X mask bits

2010-11-07 Thread Michael S. Tsirkin
On Mon, Nov 08, 2010 at 01:17:47PM +0800, Sheng Yang wrote:
   @@ -1250,10 +1297,16 @@ static void
   assigned_dev_update_msix(PCIDevice *pci_dev, int enable_msix)
   
fprintf(stderr, assigned_dev_update_msix: MSI-X
entries_max_nr == 0); return;

}
   
   +/*
   + * Guest may try to enable MSI-X before setting MSI-X entry
   done, so + * let's wait until guest unmask the entries.
   + */
  
  Well it can also set up any number of entries, enable msix then
  set up more entries. Now what?
 
 It's the same. If it want to set up more entries, it have to unmask
 them. Then we would get them.

Why can't we handle the first enable in the same way?
   
   Don't understand the question, but I guess the answer is
   pci_enable_msix().
  
  pci_enable_msix is an internal kernel API to enable msix, isn't it?
  We are discussing qemu patch here.
 
 OK, I understand the question now. The others entries would be enabled with 
 unmask, but not very first ones. Guest can unmask the entries first, then 
 enable 
 MSI-X. We should scan the table when MSI-X enabled, that's why the first one 
 is 
 different from others.

I see. I think the comment mislead me: you really mean
'Guest can unmask MSI-X entries when MSI-X is disabled,
 don't do anything until MSI-X is enabled'.



entries_max_nr); if (entries_nr == 0) {
   
   +#ifndef KVM_CAP_DEVICE_MSIX_MASK
   
if (enable_msix)

fprintf(stderr, MSI-X entry number is zero!\n);
  
  And what happens then?
 
 MSI-X can't work for old ones without MSIX mask support.

Old ones?
   
   Old userspace without KVM_CAP_DEVICE_MSIX_MASK
   
 Reload the guest module
 may help.

Guest might not have any concept of modules.
   
   Just an workaround. Not a suggestion method.
  
  Look, if there's some failure mode we need a way to recover,
  or to make it very unlikely. If this is guest doing something
  illegal let's add a comment explaining what it is and why
  it's illegal. If this is legal, why the print out?
 
 It's unsupported in the old QEmu.


I think I have it. The comment you want is:

/* Error: number of MSI-X entries is zero. Using this device
   requires KVM_CAP_DEVICE_MSIX_MASK support in kvm
   which is missing.
 */

As I said prebiously, we must check runtime capabilities
for this, ifdef is not enough to know kernel supports a feature
because there's no dependency between kernel and qemu packages.

 I can change the comment, but I do think it's 
 good to let user know something important is wrong. 

stderr is really there for developers.
You can't expect the user to understand this message.

 Maybe it's better to put it into DEBUG.
 --
 regards
 Yang, Sheng
 
  
   --
   regards
   Yang, Sheng
   
 --
 regards
 Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Disk I/O stuck with KVM - no clue how to solve that

2010-11-07 Thread Stefan Hajnoczi
On Sun, Nov 7, 2010 at 4:07 PM, Hermann Himmelbauer du...@qwer.tk wrote:
 Am Samstag 06 November 2010 20:58:12 schrieb Stefan Hajnoczi:
 On Fri, Nov 5, 2010 at 5:16 PM, Hermann Himmelbauer du...@qwer.tk wrote:
  I experience strange disk I/O stucks on my Linux Host + Guest with KVM,
  which make the system (especially the guests) almost unusable. These
  stucks come periodically, e.g. every 2 to 10 seconds and last between 3
  and sometimes over 120 seconds, which trigger kernel messages like this
  (on host and/or guest):
 
  INFO: task postgres:2195 blocked for more than 120 seconds

 The fact that this happens on the host too suggests there's an issue
 with the host software/hardware and the VM is triggering it but not
 the root cause.

 Does dmesg display any other suspicious messages?

 No, there's anything that can be seen via dmesg. I at first suspected the
 hardware, too. I can think of the following reasons:

 1) Broken SATA cable / Harddisks - I changed some cables, no change, thus this
 is probably ruled out. I also can't see anything via S.M.A.R.T. Moreover, the
 problem is not bound to a specific device, instead it happens on sda - sdd,
 so I doubt it's harddisk related.

 2) Broken Power Supply / Insufficient Power - I'd expect either a complete
 crash or some error messages in this case, so I'd rather rule that out.

 3) Broken SATA-Controller - I cannot think of any way to check that, but I'd
 also expect some crashes or kernel messages. I flashed the board to the
 latest BIOS version, no change either.

 However, it seems no one except me seems to have this problem, so I'll buy a
 new, similar but different mainboard (Intel instead of Asus), hopefully this
 solves the problem.

 What do you think, any better idea?

If you have the time, you can use perf probes to trace I/O requests in
the host kernel.  Perhaps completion interrupts are being dropped.
You may wish to start by tracing requests issued and completed by the
SATA driver.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] KVM: assigned dev: MSI-X mask support

2010-11-07 Thread Michael S. Tsirkin
On Mon, Nov 08, 2010 at 11:18:34AM +0800, Sheng Yang wrote:
 On Friday 05 November 2010 21:27:42 Michael S. Tsirkin wrote:
  On Fri, Nov 05, 2010 at 06:53:50PM +0800, Sheng Yang wrote:
   On Friday 05 November 2010 16:43:33 Michael S. Tsirkin wrote:
On Fri, Nov 05, 2010 at 10:44:19AM +0800, Sheng Yang wrote:
   +};
   +
   +This ioctl would enable in-kernel MSI-X emulation, which would
   handle MSI-X +mask bit in the kernel.
  
  What happens on repeated calls when it's already enabled?
  How does one disable after it's enabled?
 
 Suppose this should only be called once. But again would update the
 MMIO base.

So maybe add this all to documentation.

 It
 would be disabled along with device deassignment.

So what are you going to do when guest enables and later disables MSIX?
Disable assignment completely?
   
   This device goes with PCI resources allocation, rather than IRQ
   assignment.
  
  I see. I guess we can live with it, but it seems tied to a specific
  mode of use. Would be better to support enabling together with msix too,
  this requires an ability to disable.
  
 We enable it explicitly because
 not all devices have MSI-X capability.
 
   +
  
  This is very specialized for assigned devices.  I would like an
  interface not tied to assigned devices explicitly
  (even if the implementation at the moment only works
  for assigned devices, I can patch that in later). E.g. vhost-net
  would benefit from such an extension too.  Maybe tie this to a
  special GSI and this GSI can be an assigned device or not?
 
 You're talking about KVM_ASSIGN_GET_MSIX_ENTRY?

Yes but also, and more importantly, about KVM_ASSIGN_REG_MSIX_MMIO
and kvm_assigned_msix_mmio really.

 We can't tie it to GSI, because we
 should also cover entries without GSI. The entry number should be
 fine for all kinds of devices using MSI-X.

I don't completely understand: entries without GSI
never need to inject an interrupt. Why do we care
about them? Let's pass such accesses to userspace.

In any case, I think MSIX GSIs are inexpensive (we never search them
all), so we could create GSIs for all entries if we wanted to.
   
   All mask bit accessing controlled by kernel, in order to keep consistent.
  
  Well I think passing whatever we can't handle to userspace makes
  complete sense. We do this for address/data writes after all.
  If we wanted to do it *all* in kernel we would handle
  address/data there too. I think Avi suggested this at some point.
 
 I think Avi's point was we handle mask bit half in the kernel and half in the 
 userspace, which makes API complex. I agree with this so I moved all mask bit 
 handling to kernel.
 
 Data/address is another issue, due to QEmu own the routing table. We can 
 change it 
 in the future, but this is not related to this patch.
  
  The main point is msix mmio BAR handling is completely unrelated
  to the assigned device. Emulated devices have an msix BAR as well.
  So tying it to an assigned devices is suboptimal, we'll need to
  grow yet another interface for these.
 
 If you only means the name, we can change that.

I mean it shouldn't use the assigned device id. Use some other index:
I suggested adding a new GSI type for this, but can be something else
if you prefer.

 But I think all MSI-X entries 
 would have entry number, regardless of if it got GSI number. Keep entry 
 number as 
 parameter makes more sense to me.
 
 And still, the patch's only current user is assigned device. I am glad if it 
 can 
 cover the other future case along with it, but that's not the primary target 
 of 
 this patch.

It's a problem I think. Stop thinking about the patch for a moment
and think about kvm as a whole. We know we will want such an interface for
emulated and PV devices. What then? Add another interface and a bunch of
code to support it?  Point being, since userspace interfaces need to be
supported forever, they should be generic if possible.


  
 I am not sure about what PV one would looks like.
 I use kvm_assigned_msix_entry
 just because it can be easily reused.

Not only PV, emulated devices too. OK let's try to figure out:

What happens with userspace is, we inject interrupts either through
ioctls or eventfds. Avoiding an exit to userspace on MSIX
access is beneficial in exactly the same way.

We could have an ioctl to pass in the table addresses, and mapping
table to relevant GSIs. For example, add kvm_msix_mmio with table
start/end, also specify which GSIs are covered.
Maybe ask that userspace allocate all GSIs for a table consequitively?
Or add kvm_irq_routing_msix which is same as msi but
also has the mmio addresses? Bonus points for avoiding the need
to scan all table on each write. For example, don't scan at all,
instead just 

Re: [Qemu-devel] [PATCH] virtio-9p: fix build on !CONFIG_UTIMENSAT v2

2010-11-07 Thread M. Mohan Kumar
 This patch introduce a fallback mechanism for old systems that do not
 support utimensat.  This will fix build failure with following warnings:
 
 hw/virtio-9p-local.c: In function 'local_utimensat':
 hw/virtio-9p-local.c:479: warning: implicit declaration of function
 'utimensat' hw/virtio-9p-local.c:479: warning: nested extern declaration
 of 'utimensat'
 
 and
 
 hw/virtio-9p.c: In function 'v9fs_setattr_post_chmod':
 hw/virtio-9p.c:1410: error: 'UTIME_NOW' undeclared (first use in this
 function) hw/virtio-9p.c:1410: error: (Each undeclared identifier is
 reported only once hw/virtio-9p.c:1410: error: for each function it
 appears in.)
 hw/virtio-9p.c:1413: error: 'UTIME_OMIT' undeclared (first use in this
 function) hw/virtio-9p.c: In function 'v9fs_wstat_post_chmod':
 hw/virtio-9p.c:2905: error: 'UTIME_OMIT' undeclared (first use in this
 function)
 
 Signed-off-by: Hidetoshi Seto seto.hideto...@jp.fujitsu.com
 ---
  hw/virtio-9p-local.c |   32 ++--
  hw/virtio-9p.h   |   10 ++
  2 files changed, 40 insertions(+), 2 deletions(-)
 
 diff --git a/hw/virtio-9p-local.c b/hw/virtio-9p-local.c
 index 0d52020..7811d2c 100644
 --- a/hw/virtio-9p-local.c
 +++ b/hw/virtio-9p-local.c
 @@ -479,10 +479,38 @@ static int local_chown(FsContext *fs_ctx, const char
 *path, FsCred *credp) return -1;
  }
 
 +/* TODO: relocate this to proper file, and make it more generic */
 +static int qemu_utimensat(int dirfd, const char *path,
 +  const struct timespec *times, int flags)
 +{

IMHO, this code can be moved to cutils.c

 +#ifdef CONFIG_UTIMENSAT
 +return utimensat(dirfd, path, times, flags);
 +#else
 +/*
 + * Fallback: use utimes() instead of utimensat().
 + * See commit 74bc02b2d2272dc88fb98d43e631eb154717f517 for known
 problem. + */
 +struct timeval tv[2];
 +int i;
 +
 +for (i = 0; i  2; i++) {
 +if (times[i].tv_nsec == UTIME_OMIT || times[i].tv_nsec ==
 UTIME_NOW) { +tv[i].tv_sec = 0;
 +tv[i].tv_usec = 0;
 +} else {
 +tv[i].tv_sec = times[i].tv_sec;
 +tv[i].tv_usec = times[i].tv_nsec / 1000;
 +}
 +}
 +
 +return utimes(path, tv[0]);
 +#endif

The idea of introducing utimensat was to avoid resetting atime to 1970-01-01 
05:30:00 (utime does not give option to not change atime). But as per utimes 
man page, if any of the time field is 0, it would be set to current time. As 
per stat man page, truncate will not update atime, only mtime will be updated.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re:[PATCH v14 06/17] Use callback to deal with skb_release_data() specially.

2010-11-07 Thread xiaohui . xin
From: Xin Xiaohui xiaohui@intel.com

 Hmm, I suggest you read the comment two lines above.

 If destructor_arg is now cleared each time we allocate a new skb, then,
 please move it before dataref in shinfo structure, so that the following
 memset() does the job efficiently...


Something like :

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index e6ba898..2dca504 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -195,6 +195,9 @@ struct skb_shared_info {
   __be32  ip6_frag_id;
   __u8tx_flags;
   struct sk_buff  *frag_list;
+  /* Intermediate layers must ensure that destructor_arg
+   * remains valid until skb destructor */
+  void*destructor_arg;
   struct skb_shared_hwtstamps hwtstamps;

   /*
@@ -202,9 +205,6 @@ struct skb_shared_info {
*/
   atomic_tdataref;

-  /* Intermediate layers must ensure that destructor_arg
-   * remains valid until skb destructor */
-  void *  destructor_arg;
   /* must be last field, see pskb_expand_head() */
   skb_frag_t  frags[MAX_SKB_FRAGS];
 };



Will that affect the cache line?
Or, we can move the line to clear destructor_arg to the end of __alloc_skb().
It looks like as the following, which one do you prefer?

Thanks
Xiaohui

---
 net/core/skbuff.c |8 
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c83b421..df852f2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -224,6 +224,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
 
child-fclone = SKB_FCLONE_UNAVAILABLE;
}
+   shinfo-destructor_arg = NULL;
 out:
return skb;
 nodata:
@@ -343,6 +344,13 @@ static void skb_release_data(struct sk_buff *skb)
if (skb_has_frags(skb))
skb_drop_fraglist(skb);
 
+   if (skb-dev  dev_is_mpassthru(skb-dev)) {
+   struct skb_ext_page *ext_page =
+   skb_shinfo(skb)-destructor_arg;
+   if (ext_page  ext_page-dtor)
+   ext_page-dtor(ext_page);
+   }
+
kfree(skb-head);
}
 }
-- 
1.7.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v13 10/16] Add a hook to intercept external buffers from NIC driver.

2010-11-07 Thread Xin, Xiaohui
I have addressed this issue in v14 patch set.

Thanks
Xiaohui

-Original Message-
From: David Miller [mailto:da...@davemloft.net]
Sent: Saturday, October 30, 2010 4:29 AM
To: Xin, Xiaohui
Cc: net...@vger.kernel.org; kvm@vger.kernel.org; linux-ker...@vger.kernel.org;
m...@redhat.com; mi...@elte.hu; herb...@gondor.apana.org.au; 
jd...@linux.intel.com
Subject: Re: [PATCH v13 10/16] Add a hook to intercept external buffers from 
NIC driver.

From: Xin, Xiaohui xiaohui@intel.com
Date: Wed, 27 Oct 2010 09:33:12 +0800

 Somehow, it seems not a trivial work to support it now. Can we support it
 later and as a todo with our current work?

I would prefer the feature work properly, rather than only in specific
cases, before being integated.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html