Re: [PATCHv2 0/8 RFC] boot order specification

2010-11-01 Thread Gleb Natapov
On Sun, Oct 31, 2010 at 06:25:53PM -0400, Kevin O'Connor wrote:
 On Sun, Oct 31, 2010 at 01:40:01PM +0200, Gleb Natapov wrote:
  This is current sate of the patch series for people to comment on.
  I tried to use open firmware naming scheme to specify device path names.
  
  The patch series produce names like these:
  for pci machine:
  /p...@i0cf8/pci-isa-bri...@1/f...@03f1/flo...@0
  /p...@i0cf8/pci-isa-bri...@1/f...@03f1/flo...@1
  /p...@i0cf8/a...@1,1/ata-d...@1:0
  /p...@i0cf8/a...@1,1/ata-d...@1:1
  /p...@i0cf8/virtio-...@3/virtio-d...@0
  /p...@i0cf8/ether...@4/ethernet-...@0
  /p...@i0cf8/ether...@5/ethernet-...@0
  
  for isa machine:
  adding '/isa/f...@03f1/flo...@0' at index 2
  adding '/isa/f...@03f1/flo...@1' at index 1
  adding '/isa/a...@0170/ata-d...@0:0' at index 0
  adding '/isa/a...@0170/ata-d...@0:1' at index 3
 
 Hi Gleb,
 
 How will USB drives be identified?
 
USB bus has Open Firmware binding. I haven't look at the spec yet, but it
should be easy.

 I'm not sure how SeaBIOS will be able to line up something like
 /p...@i0cf8/ether...@4/ethernet-...@0 to an optionrom BEV.  Also, if
 there is an optionrom with BCVs (eg, a scsi card), I'm not sure how
 that would that would be identified.

The way to parse  /p...@i0cf8/ether...@4/ethernet-...@0 is this: each
element (between /.../) consist of node-n...@unit-address. node-name
describes device/bus. unit-address is a device address on preceding node.
So p...@i0cf8 tells us that this is pci bus accessible through io
register 0x0cf8, ether...@4 tells us that this is ethernet device in pci
slot 4 function 0, (a...@1,1 means ata device in slot 1 function 1).
ethernet-...@0 means first phy on this ethernet device (usually there is
only one anyway). So if the pci card in slot 4 device 0 has optionrom
with BCV Seabios can associate bootindex with it easily given the 
device path above.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device

2010-11-01 Thread Michael S. Tsirkin
On Tue, Oct 26, 2010 at 03:19:38PM -0700, Sridhar Samudrala wrote:
 With the current default macvtap mode, a KVM guest using virtio with 
 macvtap backend has the following limitations.
 - cannot change/add a mac address on the guest virtio-net
 - cannot create a vlan device on the guest virtio-net
 - cannot enable promiscuous mode on guest virtio-net
 
 This patch introduces a new mode called 'passthru' when creating a 
 macvlan device which allows takeover of the underlying device and 
 passing it to a guest using virtio with macvtap backend.
 
 Only one macvlan device is allowed in passthru mode and it inherits
 the mac address from the underlying device and sets it in promiscuous 
 mode to receive and forward all the packets.
 
 Thanks
 Sridhar

One concern with promisc mode is that for the common case,
which is a single mac and no vlans, we will be getting
extra packets that will get dropped in userspace/guest
as compared to the case where same mac is programmed
in hardware and by guest.

We could let userspace supply a list of mac/vlan addresses through
an ioctl on macvtap, and then
1. for a single mac, program it in hardware
2. for other configurations, set promisc mode

tun already has TUNSETTXFILTER which might come in handy here.
We don't pass in vlans with the filter now but maybe we should.
How does this sound?

 diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
 index 0ef0eb0..bca3cb7 100644
 --- a/drivers/net/macvlan.c
 +++ b/drivers/net/macvlan.c
 @@ -38,6 +38,7 @@ struct macvlan_port {
   struct hlist_head   vlan_hash[MACVLAN_HASH_SIZE];
   struct list_headvlans;
   struct rcu_head rcu;
 + boolpassthru;
  };
  
  #define macvlan_port_get_rcu(dev) \
 @@ -169,6 +170,7 @@ static struct sk_buff *macvlan_handle_frame(struct 
 sk_buff *skb)
   macvlan_broadcast(skb, port, NULL,
 MACVLAN_MODE_PRIVATE |
 MACVLAN_MODE_VEPA|
 +   MACVLAN_MODE_PASSTHRU|
 MACVLAN_MODE_BRIDGE);
   else if (src-mode == MACVLAN_MODE_VEPA)
   /* flood to everyone except source */
 @@ -185,7 +187,10 @@ static struct sk_buff *macvlan_handle_frame(struct 
 sk_buff *skb)
   return skb;
   }
  
 - vlan = macvlan_hash_lookup(port, eth-h_dest);
 + if (port-passthru)
 + vlan = list_first_entry(port-vlans, struct macvlan_dev, list);
 + else
 + vlan = macvlan_hash_lookup(port, eth-h_dest);
   if (vlan == NULL)
   return skb;
  
 @@ -284,6 +289,11 @@ static int macvlan_open(struct net_device *dev)
   struct net_device *lowerdev = vlan-lowerdev;
   int err;
  
 + if (vlan-port-passthru) {
 + dev_set_promiscuity(lowerdev, 1);
 + goto hash_add;
 + }
 +
   err = -EBUSY;
   if (macvlan_addr_busy(vlan-port, dev-dev_addr))
   goto out;
 @@ -296,6 +306,8 @@ static int macvlan_open(struct net_device *dev)
   if (err  0)
   goto del_unicast;
   }
 +
 +hash_add:
   macvlan_hash_add(vlan);
   return 0;
  
 @@ -310,12 +322,18 @@ static int macvlan_stop(struct net_device *dev)
   struct macvlan_dev *vlan = netdev_priv(dev);
   struct net_device *lowerdev = vlan-lowerdev;
  
 + if (vlan-port-passthru) {
 + dev_set_promiscuity(lowerdev, -1);
 + goto hash_del;
 + }
 +
   dev_mc_unsync(lowerdev, dev);
   if (dev-flags  IFF_ALLMULTI)
   dev_set_allmulti(lowerdev, -1);
  
   dev_uc_del(lowerdev, dev-dev_addr);
  
 +hash_del:
   macvlan_hash_del(vlan);
   return 0;
  }
 @@ -549,6 +567,7 @@ static int macvlan_port_create(struct net_device *dev)
   if (port == NULL)
   return -ENOMEM;
  
 + port-passthru = false;
   port-dev = dev;
   INIT_LIST_HEAD(port-vlans);
   for (i = 0; i  MACVLAN_HASH_SIZE; i++)
 @@ -593,6 +612,7 @@ static int macvlan_validate(struct nlattr *tb[], struct 
 nlattr *data[])
   case MACVLAN_MODE_PRIVATE:
   case MACVLAN_MODE_VEPA:
   case MACVLAN_MODE_BRIDGE:
 + case MACVLAN_MODE_PASSTHRU:
   break;
   default:
   return -EINVAL;
 @@ -661,6 +681,10 @@ int macvlan_common_newlink(struct net *src_net, struct 
 net_device *dev,
   }
   port = macvlan_port_get(lowerdev);
  
 + /* Only 1 macvlan device can be created in passthru mode */
 + if (port-passthru)
 + return -EINVAL;
 +
   vlan-lowerdev = lowerdev;
   vlan-dev  = dev;
   vlan-port = port;
 @@ -671,6 +695,13 @@ int macvlan_common_newlink(struct net *src_net, struct 
 net_device *dev,
   if (data  data[IFLA_MACVLAN_MODE])
   vlan-mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
  
 

[PATCH v2 1/7] KVM: fix tracing kvm_try_async_get_page

2010-11-01 Thread Xiao Guangrong
Tracing 'async' and *pfn is useless, since 'async' is always true,
and '*pfn' is always fault_pfn'

We can trace 'gva' and 'gfn' instead, it can help us to see the
life-cycle of an async_pf

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |2 +-
 include/trace/events/kvm.h |   12 +++-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 39cc0c6..5275c50 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2627,7 +2627,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
no_apf, gfn_t gfn,
put_page(pfn_to_page(*pfn));
 
if (!no_apf  can_do_async_pf(vcpu)) {
-   trace_kvm_try_async_get_page(async, *pfn);
+   trace_kvm_try_async_get_page(gva, gfn);
if (kvm_find_async_pf_gfn(vcpu, gfn)) {
trace_kvm_async_pf_doublefault(gva, gfn);
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 9c2cc6a..30063c6 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -188,18 +188,20 @@ TRACE_EVENT(kvm_age_page,
 #ifdef CONFIG_KVM_ASYNC_PF
 TRACE_EVENT(
kvm_try_async_get_page,
-   TP_PROTO(bool async, u64 pfn),
-   TP_ARGS(async, pfn),
+   TP_PROTO(u64 gva, u64 gfn),
+   TP_ARGS(gva, gfn),
 
TP_STRUCT__entry(
-   __field(__u64, pfn)
+   __field(u64, gva)
+   __field(u64, gfn)
),
 
TP_fast_assign(
-   __entry-pfn = (!async) ? pfn : (u64)-1;
+   __entry-gva = gva;
+   __entry-gfn = gfn;
),
 
-   TP_printk(pfn %#llx, __entry-pfn)
+   TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn)
 );
 
 TRACE_EVENT(
-- 
1.7.0.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/7] KVM: cleanup aysnc_pf tracepoints

2010-11-01 Thread Xiao Guangrong
Use 'DECLARE_EVENT_CLASS' to cleanup async_pf tracepoints

Acked-by: Gleb Natapov g...@redhat.com
Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 include/trace/events/kvm.h |   76 ---
 1 files changed, 35 insertions(+), 41 deletions(-)

diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 30063c6..7bec396 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -186,59 +186,71 @@ TRACE_EVENT(kvm_age_page,
 );
 
 #ifdef CONFIG_KVM_ASYNC_PF
-TRACE_EVENT(
-   kvm_try_async_get_page,
+DECLARE_EVENT_CLASS(kvm_async_get_page_class,
+
TP_PROTO(u64 gva, u64 gfn),
+
TP_ARGS(gva, gfn),
 
TP_STRUCT__entry(
-   __field(u64, gva)
+   __field(__u64, gva)
__field(u64, gfn)
-   ),
+   ),
 
TP_fast_assign(
__entry-gva = gva;
__entry-gfn = gfn;
-   ),
+   ),
 
TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn)
 );
 
-TRACE_EVENT(
-   kvm_async_pf_not_present,
+DEFINE_EVENT(kvm_async_get_page_class, kvm_try_async_get_page,
+
+   TP_PROTO(u64 gva, u64 gfn),
+
+   TP_ARGS(gva, gfn)
+);
+
+DEFINE_EVENT(kvm_async_get_page_class, kvm_async_pf_doublefault,
+
+   TP_PROTO(u64 gva, u64 gfn),
+
+   TP_ARGS(gva, gfn)
+);
+
+DECLARE_EVENT_CLASS(kvm_async_pf_nopresent_ready,
+
TP_PROTO(u64 token, u64 gva),
+
TP_ARGS(token, gva),
 
TP_STRUCT__entry(
__field(__u64, token)
__field(__u64, gva)
-   ),
+   ),
 
TP_fast_assign(
__entry-token = token;
__entry-gva = gva;
-   ),
+   ),
+
+   TP_printk(token %#llx gva %#llx, __entry-token, __entry-gva)
 
-   TP_printk(token %#llx gva %#llx not present, __entry-token,
- __entry-gva)
 );
 
-TRACE_EVENT(
-   kvm_async_pf_ready,
+DEFINE_EVENT(kvm_async_pf_nopresent_ready, kvm_async_pf_not_present,
+
TP_PROTO(u64 token, u64 gva),
-   TP_ARGS(token, gva),
 
-   TP_STRUCT__entry(
-   __field(__u64, token)
-   __field(__u64, gva)
-   ),
+   TP_ARGS(token, gva)
+);
 
-   TP_fast_assign(
-   __entry-token = token;
-   __entry-gva = gva;
-   ),
+DEFINE_EVENT(kvm_async_pf_nopresent_ready, kvm_async_pf_ready,
+
+   TP_PROTO(u64 token, u64 gva),
 
-   TP_printk(token %#llx gva %#llx ready, __entry-token, __entry-gva)
+   TP_ARGS(token, gva)
 );
 
 TRACE_EVENT(
@@ -262,24 +274,6 @@ TRACE_EVENT(
  __entry-address, __entry-pfn)
 );
 
-TRACE_EVENT(
-   kvm_async_pf_doublefault,
-   TP_PROTO(u64 gva, u64 gfn),
-   TP_ARGS(gva, gfn),
-
-   TP_STRUCT__entry(
-   __field(u64, gva)
-   __field(u64, gfn)
-   ),
-
-   TP_fast_assign(
-   __entry-gva = gva;
-   __entry-gfn = gfn;
-   ),
-
-   TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn)
-);
-
 #endif
 
 #endif /* _TRACE_KVM_MAIN_H */
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/7] KVM: fix searching async gfn in kvm_async_pf_gfn_slot

2010-11-01 Thread Xiao Guangrong
Don't search later slots if the slot is empty

Acked-by: Gleb Natapov g...@redhat.com
Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2cfdf2d..9b543f4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6206,8 +6206,8 @@ static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, 
gfn_t gfn)
u32 key = kvm_async_pf_hash_fn(gfn);
 
for (i = 0; i  roundup_pow_of_two(ASYNC_PF_PER_VCPU) 
-(vcpu-arch.apf.gfns[key] != gfn ||
- vcpu-arch.apf.gfns[key] == ~0); i++)
+(vcpu-arch.apf.gfns[key] != gfn 
+ vcpu-arch.apf.gfns[key] != ~0); i++)
key = kvm_async_pf_next_probe(key);
 
return key;
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/7] KVM: avoid unnecessary wait for a async pf

2010-11-01 Thread Xiao Guangrong
In current code, it checks async pf completion out of the wait context,
like this:

if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
!vcpu-arch.apf.halted)
r = vcpu_enter_guest(vcpu);
else {
..
kvm_vcpu_block(vcpu)
 ^- waiting until 'async_pf.done' is not empty
}

kvm_check_async_pf_completion(vcpu)
 ^- delete list from async_pf.done

So, if we check aysnc pf completion first, it can be blocked at
kvm_vcpu_block

Fixed by mark the vcpu is unhalted in kvm_check_async_pf_completion()
path

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9b543f4..4da8485 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6280,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
vcpu-arch.fault.address = work-arch.token;
kvm_inject_page_fault(vcpu);
}
+   vcpu-arch.apf.halted = false;
 }
 
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/7] KVM: handle more completed apfs if possible

2010-11-01 Thread Xiao Guangrong
If it's no need to inject async #PF to PV guest we can handle
more completed apfs at one time, so we can retry guest #PF
as early as possible

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |3 ++-
 arch/x86/kvm/x86.c  |8 ++--
 virt/kvm/async_pf.c |   28 
 3 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1be0058..c95b3ff 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -818,7 +818,8 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long 
linear_rip);
 
 void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
 struct kvm_async_pf *work);
-void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+/* return true if we can handle more completed apfs, false otherwise */
+bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 struct kvm_async_pf *work);
 void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
   struct kvm_async_pf *work);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4da8485..189664a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6265,7 +6265,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu 
*vcpu,
}
 }
 
-void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
+bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 struct kvm_async_pf *work)
 {
trace_kvm_async_pf_ready(work-arch.token, work-gva);
@@ -6274,13 +6274,17 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
else
kvm_del_async_pf_gfn(vcpu, work-arch.gfn);
 
+   vcpu-arch.apf.halted = false;
+
if ((vcpu-arch.apf.msr_val  KVM_ASYNC_PF_ENABLED) 
!apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
vcpu-arch.fault.error_code = 0;
vcpu-arch.fault.address = work-arch.token;
kvm_inject_page_fault(vcpu);
+   return false;
}
-   vcpu-arch.apf.halted = false;
+
+   return true;
 }
 
 bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 60df9e0..d57ec92 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -123,25 +123,29 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu 
*vcpu)
 void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
 {
struct kvm_async_pf *work;
+   bool ret;
 
if (list_empty_careful(vcpu-async_pf.done) ||
!kvm_arch_can_inject_async_page_present(vcpu))
return;
 
-   spin_lock(vcpu-async_pf.lock);
-   work = list_first_entry(vcpu-async_pf.done, typeof(*work), link);
-   list_del(work-link);
-   spin_unlock(vcpu-async_pf.lock);
+   do {
+   spin_lock(vcpu-async_pf.lock);
+   work = list_first_entry(vcpu-async_pf.done, typeof(*work),
+   link);
+   list_del(work-link);
+   spin_unlock(vcpu-async_pf.lock);
 
-   if (work-page)
-   kvm_arch_async_page_ready(vcpu, work);
-   kvm_arch_async_page_present(vcpu, work);
+   if (work-page)
+   kvm_arch_async_page_ready(vcpu, work);
+   ret = kvm_arch_async_page_present(vcpu, work);
 
-   list_del(work-queue);
-   vcpu-async_pf.queued--;
-   if (work-page)
-   put_page(work-page);
-   kmem_cache_free(async_pf_cache, work);
+   list_del(work-queue);
+   vcpu-async_pf.queued--;
+   if (work-page)
+   put_page(work-page);
+   kmem_cache_free(async_pf_cache, work);
+   } while (ret  !list_empty_careful(vcpu-async_pf.done));
 }
 
 int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 6/7] KVM: fix the race while wakeup all pv guest

2010-11-01 Thread Xiao Guangrong
In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu-async_pf.done
without holding vcpu-async_pf.lock, it will break if we are handling apfs
at this time.

Also use 'list_empty_careful()' instead of 'list_empty()'

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 virt/kvm/async_pf.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index d57ec92..6ef3373 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -200,7 +200,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
 {
struct kvm_async_pf *work;
 
-   if (!list_empty(vcpu-async_pf.done))
+   if (!list_empty_careful(vcpu-async_pf.done))
return 0;
 
work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
@@ -211,7 +211,10 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
get_page(bad_page);
INIT_LIST_HEAD(work-queue); /* for list_del to work */
 
+   spin_lock(vcpu-async_pf.lock);
list_add_tail(work-link, vcpu-async_pf.done);
+   spin_unlock(vcpu-async_pf.lock);
+
vcpu-async_pf.queued++;
return 0;
 }
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 7/7] KVM: KVM: don't break vcpu 'halt' state due to apfs

2010-11-01 Thread Xiao Guangrong
Don't make a KVM_REQ_UNHALT request after async pf is completed since it
can break guest's 'HLT' instruction.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/x86.c   |   13 ++---
 include/linux/kvm_host.h |6 ++
 virt/kvm/kvm_main.c  |9 -
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 189664a..c57fb38 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6105,13 +6105,20 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
 
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
 {
-   return (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
+   if ((vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
!vcpu-arch.apf.halted)
-   || !list_empty_careful(vcpu-async_pf.done)
|| vcpu-arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
|| vcpu-arch.nmi_pending ||
(kvm_arch_interrupt_allowed(vcpu) 
-kvm_cpu_has_interrupt(vcpu));
+kvm_cpu_has_interrupt(vcpu)))
+   return 1;
+
+   if (!list_empty_careful(vcpu-async_pf.done)) {
+   vcpu-arch.apf.halted = false;
+   return 2;
+   }
+
+   return 0;
 }
 
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 462b982..8def043 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -436,6 +436,12 @@ void kvm_arch_hardware_disable(void *garbage);
 int kvm_arch_hardware_setup(void);
 void kvm_arch_hardware_unsetup(void);
 void kvm_arch_check_processor_compat(void *rtn);
+
+/*
+ * return value:  0 if the vcpu is runnable, 0 if not.
+ * Especially, if the return value == 1, we should make a
+ * 'KVM_REQ_UNHALT' request.
+ */
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 
 void kvm_free_physmem(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dbe1d6a..9ccf105 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1375,14 +1375,21 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
 void kvm_vcpu_block(struct kvm_vcpu *vcpu)
 {
DEFINE_WAIT(wait);
+   int ret;
 
for (;;) {
prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE);
 
-   if (kvm_arch_vcpu_runnable(vcpu)) {
+   ret = kvm_arch_vcpu_runnable(vcpu);
+
+   if (ret == 1) {
kvm_make_request(KVM_REQ_UNHALT, vcpu);
break;
}
+
+   if (ret  1)
+   break;
+
if (kvm_cpu_has_pending_timer(vcpu))
break;
if (signal_pending(current))
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/7] KVM: handle more completed apfs if possible

2010-11-01 Thread Gleb Natapov
On Mon, Nov 01, 2010 at 05:02:35PM +0800, Xiao Guangrong wrote:
 If it's no need to inject async #PF to PV guest we can handle
 more completed apfs at one time, so we can retry guest #PF
 as early as possible
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/include/asm/kvm_host.h |3 ++-
  arch/x86/kvm/x86.c  |8 ++--
  virt/kvm/async_pf.c |   28 
  3 files changed, 24 insertions(+), 15 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 1be0058..c95b3ff 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -818,7 +818,8 @@ bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned 
 long linear_rip);
  
  void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work);
 -void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 +/* return true if we can handle more completed apfs, false otherwise */
 +bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work);
  void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
  struct kvm_async_pf *work);
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 4da8485..189664a 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6265,7 +6265,7 @@ void kvm_arch_async_page_not_present(struct kvm_vcpu 
 *vcpu,
   }
  }
  
 -void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 +bool kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
struct kvm_async_pf *work)
  {
   trace_kvm_async_pf_ready(work-arch.token, work-gva);
 @@ -6274,13 +6274,17 @@ void kvm_arch_async_page_present(struct kvm_vcpu 
 *vcpu,
   else
   kvm_del_async_pf_gfn(vcpu, work-arch.gfn);
  
 + vcpu-arch.apf.halted = false;
 +
   if ((vcpu-arch.apf.msr_val  KVM_ASYNC_PF_ENABLED) 
   !apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
   vcpu-arch.fault.error_code = 0;
   vcpu-arch.fault.address = work-arch.token;
   kvm_inject_page_fault(vcpu);
 + return false;
   }
 - vcpu-arch.apf.halted = false;
 +
 + return true;
  }
  
  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
 diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
 index 60df9e0..d57ec92 100644
 --- a/virt/kvm/async_pf.c
 +++ b/virt/kvm/async_pf.c
 @@ -123,25 +123,29 @@ void kvm_clear_async_pf_completion_queue(struct 
 kvm_vcpu *vcpu)
  void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
  {
   struct kvm_async_pf *work;
 + bool ret;
  
   if (list_empty_careful(vcpu-async_pf.done) ||
   !kvm_arch_can_inject_async_page_present(vcpu))
   return;
  
 - spin_lock(vcpu-async_pf.lock);
 - work = list_first_entry(vcpu-async_pf.done, typeof(*work), link);
 - list_del(work-link);
 - spin_unlock(vcpu-async_pf.lock);
 + do {
 + spin_lock(vcpu-async_pf.lock);
 + work = list_first_entry(vcpu-async_pf.done, typeof(*work),
 + link);
 + list_del(work-link);
 + spin_unlock(vcpu-async_pf.lock);
  
 - if (work-page)
 - kvm_arch_async_page_ready(vcpu, work);
 - kvm_arch_async_page_present(vcpu, work);
 + if (work-page)
 + kvm_arch_async_page_ready(vcpu, work);
 + ret = kvm_arch_async_page_present(vcpu, work);
  
 - list_del(work-queue);
 - vcpu-async_pf.queued--;
 - if (work-page)
 - put_page(work-page);
 - kmem_cache_free(async_pf_cache, work);
 + list_del(work-queue);
 + vcpu-async_pf.queued--;
 + if (work-page)
 + put_page(work-page);
 + kmem_cache_free(async_pf_cache, work);
 + } while (ret  !list_empty_careful(vcpu-async_pf.done));
  }
  
No need to change kvm_arch_async_page_present() to return anything. You
can do while loop like this:

while (!list_empty_careful(vcpu-async_pf.done) 
   kvm_arch_can_inject_async_page_present(vcpu)) {
}

If kvm_arch_async_page_present() call injects exception
kvm_arch_can_inject_async_page_present() will return false on next
iteration.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/7] KVM: avoid unnecessary wait for a async pf

2010-11-01 Thread Gleb Natapov
On Mon, Nov 01, 2010 at 05:01:28PM +0800, Xiao Guangrong wrote:
 In current code, it checks async pf completion out of the wait context,
 like this:
 
 if (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
   !vcpu-arch.apf.halted)
   r = vcpu_enter_guest(vcpu);
   else {
   ..
   kvm_vcpu_block(vcpu)
^- waiting until 'async_pf.done' is not empty
 }
   
 kvm_check_async_pf_completion(vcpu)
  ^- delete list from async_pf.done
 
 So, if we check aysnc pf completion first, it can be blocked at
 kvm_vcpu_block
 
 Fixed by mark the vcpu is unhalted in kvm_check_async_pf_completion()
 path
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
Acked-by: Gleb Natapov g...@redhat.com

 ---
  arch/x86/kvm/x86.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 9b543f4..4da8485 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6280,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
   vcpu-arch.fault.address = work-arch.token;
   kvm_inject_page_fault(vcpu);
   }
 + vcpu-arch.apf.halted = false;
  }
  
  bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
 -- 
 1.7.0.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/7] KVM: handle more completed apfs if possible

2010-11-01 Thread Xiao Guangrong
On 11/01/2010 05:24 PM, Gleb Natapov wrote:

 -put_page(work-page);
 -kmem_cache_free(async_pf_cache, work);
 +list_del(work-queue);
 +vcpu-async_pf.queued--;
 +if (work-page)
 +put_page(work-page);
 +kmem_cache_free(async_pf_cache, work);
 +} while (ret  !list_empty_careful(vcpu-async_pf.done));
  }
  
 No need to change kvm_arch_async_page_present() to return anything. You
 can do while loop like this:
 
 while (!list_empty_careful(vcpu-async_pf.done) 
kvm_arch_can_inject_async_page_present(vcpu)) {
 }
 
 If kvm_arch_async_page_present() call injects exception
 kvm_arch_can_inject_async_page_present() will return false on next
 iteration.

Yeah, it's a better way, i'll fix it, thanks Gleb!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: TODO item: guest programmable mac/vlan filtering with macvtap

2010-11-01 Thread Dragos Tatulea
 1. add a secondary mac (or third, etc) address to the guest virtio-net
 interface.
Maybe I misunderstood this. Is it just setting another mac on the
guest virtio-net interface?


 4. the above stuff must be controllable by host admin
  - Well, for this there are a few options:
     admin switch that allows the guest user to add macs
     preconfig allowed MAC's in mactap (or qemu config) for the guest user
     allow/disallow command for user in qemu (although this doesn't
 seem to be supported)

Well, on a second thought, qemu capabilities should be just fine, right?

-- Dragos
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Device assignment, shared IRQs, uio_pci_generic

2010-11-01 Thread Jan Kiszka
Am 31.10.2010 16:05, Jan Kiszka wrote:
 Still not working here are ehci, ath9k and e1000e when passed through.
 They receive IRQs, but somehow the Linux guest drivers are unhappy about
 the device states (the e1000e detects Hardware Unit Hang e.g.). This
 is independent of my patches. Anyone any ideas? Improper configurations
 of the BIOS? I stumbled over this in the kernel log:
 
 DMAR: BIOS has allocated no shadow GTT; disabling IOMMU for graphics
 

Argh, no one told me that there is CONFIG_DMAR_DEFAULT_ON and that you
have to pass intel_iommu=on if that config switch is off like here.
Moreover, I somehow assumed iommu=0 on the command line means use
IOMMU with index 0 (don't ask me why). Usability deserves some patches
- later.

So it's starting to work, already accessed some devices (ath9k did a
successful scan), but now it's crashing in intel_iommu_attach_device.
Looks like use after release. Sigh.

Jan



signature.asc
Description: OpenPGP digital signature


Re: TODO item: guest programmable mac/vlan filtering with macvtap

2010-11-01 Thread Michael S. Tsirkin
On Mon, Nov 01, 2010 at 11:48:23AM +0100, Dragos Tatulea wrote:
  1. add a secondary mac (or third, etc) address to the guest virtio-net
  interface.
 Maybe I misunderstood this. Is it just setting another mac on the
 guest virtio-net interface?

Well, yes, that's also not possible at the moment.  Or e.g. set more
than one mac per virtio-net device using macvlan.

 
  4. the above stuff must be controllable by host admin
   - Well, for this there are a few options:
      admin switch that allows the guest user to add macs
      preconfig allowed MAC's in mactap (or qemu config) for the guest user
      allow/disallow command for user in qemu (although this doesn't
  seem to be supported)
 
 Well, on a second thought, qemu capabilities should be just fine, right?
 
 -- Dragos

At some level, although I think we also want a way to disable
access that qemu can't override unless it has net admin capability.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Crash in intel_iommu_assign_device

2010-11-01 Thread Jan Kiszka
Hi Sheng,

I'm not claiming to understand the details, but this looks like use
(dereference of pte via dma_pte_addr) after release (free_pgtable_page
of dmar_domain-pgd aka pte) to me:

static int intel_iommu_attach_device(struct iommu_domain *domain,
 struct device *dev)
{
[...]
pte = dmar_domain-pgd;
if (dma_pte_present(pte)) {
free_pgtable_page(dmar_domain-pgd);
dmar_domain-pgd = (struct dma_pte *)
phys_to_virt(dma_pte_addr(pte));
}

At least it crashes here right on pte-val access. Swap both lines?

Jan



signature.asc
Description: OpenPGP digital signature


Cannot boot 2.6.35 SMP guest in 2.6.35 host

2010-11-01 Thread Richard Chan
I am having a problem with a Fedora 14 x86_64 host booting SMP 2.6.35.x guests.
The boot proceeds halfway but does not reach a shell or desktop.
Oddly the console echoes characters.




I am using a AMD SVM system and found this:
https://patchwork.kernel.org/patch/226981/
Has this been resolved?
Fedora 14 x86_64 Host:
Guests:
0. Fedora 14 Install DVD - boots halfway and gets stuck
1. Ubuntu 10.04 (2.6.32.x) - boots SMP to desktop
2. Ubuntu 10.10 (2.6.35.x) -boots UP to shell (server version); cannot
get it to boot to desktop UP or SMP
3. Windows XP guest ok
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cannot boot 2.6.35 SMP guest in 2.6.35 host

2010-11-01 Thread Michael Tokarev
Richard Chan wrote:
 I am having a problem with a Fedora 14 x86_64 host booting SMP 2.6.35.x 
 guests.
 The boot proceeds halfway but does not reach a shell or desktop.
 Oddly the console echoes characters.
 
 I am using a AMD SVM system and found this:
 https://patchwork.kernel.org/patch/226981/
 Has this been resolved?

2.6.36.8 kernel which were released a few days ago
contains 2 simple patches that addresses this issue.
The same patches are needed for 2.6.32 kernel too,
but they weren't applied to 2.6.32.25.  I can't
say anything about fedora kernels.  This is all
about host kernel, not guest one.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Cannot boot 2.6.35 SMP guest in 2.6.35 host

2010-11-01 Thread Michael Tokarev
Michael Tokarev wrote:
 Richard Chan wrote:
 I am having a problem with a Fedora 14 x86_64 host booting SMP 2.6.35.x 
 guests.
 The boot proceeds halfway but does not reach a shell or desktop.
 Oddly the console echoes characters.

 I am using a AMD SVM system and found this:
 https://patchwork.kernel.org/patch/226981/
 Has this been resolved?
 
 2.6.36.8 kernel which were released a few days ago

I mean 2.6.35.8 ofcourse.

/mjt
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Crash on kvm_iommu_map_pages

2010-11-01 Thread Jan Kiszka
Hi again,

OK, I swapped those two lines in intel_iommu_attach_device [1], fixed
another warning in the wbinvd emulation, but now I'm about to give up.
This is freaky MMU stuff:


general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/devices/pci:00/:00:1a.0/device
CPU 1 
Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss snd_seq 
snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw 
xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle 
nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables 
ip6table_filter ip6_tables x_tables ipv6 fuse loop mac80211 snd_hda_codec_hdmi 
snd_hda_codec_realtek snd_hda_intel snd_hda_codec ath9k_common ath9k_hw 
snd_hwdep snd_pcm snd_timer snd ath pcmcia cfg80211 sdhci_pci tpm_infineon 
sdhci yenta_socket firewire_ohci tpm_tis mmc_core soundcore e1000e sg 
pcmcia_rsrc tpm firewire_core iTCO_wdt video snd_page_alloc pcmcia_core 
i2c_i801 rfkill tpm_bios intel_agp output fujitsu_laptop iTCO_vendor_support 
i2c_core serio_raw pcspkr led_class joydev crc_itu_t button battery intel_gtt 
ac ext4 mbcache jbd2 crc16 sha256_generic aes_x86_64 aes_generic cbc dm_crypt 
linear sd_mod crc_t10dif dm_snapshot dm_mod fan processor ahci libahci 
ata_generi
c liba
ta scsi_mod thermal thermal_sys hwmon
Nov  1 13:19:11 mchn199C kernel:
Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS H700
RIP: 0010:[8121de8c]  [8121de8c] pfn_to_dma_pte+0x73/0x190
RSP: 0018:8800bd4bdb68  EFLAGS: 00010202
RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009
RDX: 0180 RSI: 88012a940938 RDI: 0202
RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004
R10: 0001 R11:  R12: 880128dfee00
R13: 0002 R14: 000f R15: 0009
FS:  7f4990d33710() GS:8800be68() knlGS:
CS:  0010 DS: 002b ES: 002b CR0: 80050033
CR2: 01408000 CR3: bd7db000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 
88012a94)
Stack:
8800bd4bdb88 8800ac378000 00d2 000f
0 8800bd4bdce8 8800ac2fc000 0003 000f0001
0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573
Call Trace:
[8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a
[812a9573] iommu_iova_to_phys+0x13/0x15
[a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm]
[8111404d] ? __vmalloc_node+0x86/0x9b
[a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm]
[81081ee8] ? mark_held_locks+0x50/0x72
[8137c983] ? mutex_lock_nested+0x325/0x34d
[a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm]
[a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
[a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
[811355aa] ? fget_light+0x17b/0x31f
[81143bd7] do_vfs_ioctl+0x4c6/0x507
[81135732] ? fget_light+0x303/0x31f
[811355aa] ? fget_light+0x17b/0x31f
[8137ebb9] ? retint_swapgs+0x13/0x1b
[81143c6e] sys_ioctl+0x56/0x7c
[81002df2] system_call_fastpath+0x16/0x1b
Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 81 
e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 85 d8 00 
00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 
RIP  [8121de8c] pfn_to_dma_pte+0x73/0x190
RSP 8800bd4bdb68


Can anyone parse this? Is it Intel-specific or a generic issue? The
kernel is current kvm.git + unrelated patch + [1].

Jan

[1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v2 7/7] KVM: KVM: don't break vcpu 'halt' state due to apfs

2010-11-01 Thread Gleb Natapov
On Mon, Nov 01, 2010 at 05:05:00PM +0800, Xiao Guangrong wrote:
 Don't make a KVM_REQ_UNHALT request after async pf is completed since it
 can break guest's 'HLT' instruction.
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/x86.c   |   13 ++---
  include/linux/kvm_host.h |6 ++
  virt/kvm/kvm_main.c  |9 -
  3 files changed, 24 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 189664a..c57fb38 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -6105,13 +6105,20 @@ void kvm_arch_flush_shadow(struct kvm *kvm)
  
  int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
  {
 - return (vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
 + if ((vcpu-arch.mp_state == KVM_MP_STATE_RUNNABLE 
   !vcpu-arch.apf.halted)
 - || !list_empty_careful(vcpu-async_pf.done)
   || vcpu-arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
   || vcpu-arch.nmi_pending ||
   (kvm_arch_interrupt_allowed(vcpu) 
 -  kvm_cpu_has_interrupt(vcpu));
 +  kvm_cpu_has_interrupt(vcpu)))
 + return 1;
 +
 + if (!list_empty_careful(vcpu-async_pf.done)) {
 + vcpu-arch.apf.halted = false;
 + return 2;
 + }
kvm_arch_vcpu_runnable() shouldn't change vcpu state. I don't like the
way it propagates internal x86 state to kvm_vcpu_block() too.
May be what you are looking for is the patch below? (not tested).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2cfdf2d..f7aed95 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5295,8 +5295,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
{
switch(vcpu-arch.mp_state) {
case KVM_MP_STATE_HALTED:
-   vcpu-arch.mp_state =
-   KVM_MP_STATE_RUNNABLE;
+   if 
(list_empty_careful(vcpu-async_pf.done))
+   vcpu-arch.mp_state =
+   KVM_MP_STATE_RUNNABLE;
case KVM_MP_STATE_RUNNABLE:
vcpu-arch.apf.halted = false;
break;
@@ -6279,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
vcpu-arch.fault.error_code = 0;
vcpu-arch.fault.address = work-arch.token;
kvm_inject_page_fault(vcpu);
+   vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE;
}
 }
 
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Jan Kiszka
[ Forgot to CC LKML - maybe it's not KVM-specific.
  BTW, is anyone actually using current KVM device assigment on
  Intel? I'm starting to believe that can only very few lucky people...
]

Am 01.11.2010 13:51, Jan Kiszka wrote:
 Hi again,
 
 OK, I swapped those two lines in intel_iommu_attach_device [1], fixed
 another warning in the wbinvd emulation, but now I'm about to give up.
 This is freaky MMU stuff:
 
 
 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 last sysfs file: /sys/devices/pci:00/:00:1a.0/device
 CPU 1 
 Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss snd_seq 
 snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw 
 xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle 
 nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 
 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop mac80211 
 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec 
 ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath pcmcia cfg80211 
 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci tpm_tis mmc_core 
 soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt video 
 snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output 
 fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class joydev 
 crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 sha256_generic 
 aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif dm_snapshot 
 dm_mod fan processor ahci libahci ata_gene
ri
 c liba
 ta scsi_mod thermal thermal_sys hwmon
 Nov  1 13:19:11 mchn199C kernel:
 Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS H700
 RIP: 0010:[8121de8c]  [8121de8c] pfn_to_dma_pte+0x73/0x190
 RSP: 0018:8800bd4bdb68  EFLAGS: 00010202
 RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009
 RDX: 0180 RSI: 88012a940938 RDI: 0202
 RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004
 R10: 0001 R11:  R12: 880128dfee00
 R13: 0002 R14: 000f R15: 0009
 FS:  7f4990d33710() GS:8800be68() knlGS:
 CS:  0010 DS: 002b ES: 002b CR0: 80050033
 CR2: 01408000 CR3: bd7db000 CR4: 26e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 
 88012a94)
 Stack:
 8800bd4bdb88 8800ac378000 00d2 000f
 0 8800bd4bdce8 8800ac2fc000 0003 000f0001
 0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573
 Call Trace:
 [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a
 [812a9573] iommu_iova_to_phys+0x13/0x15
 [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm]
 [8111404d] ? __vmalloc_node+0x86/0x9b
 [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm]
 [81081ee8] ? mark_held_locks+0x50/0x72
 [8137c983] ? mutex_lock_nested+0x325/0x34d
 [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm]
 [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
 [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
 [811355aa] ? fget_light+0x17b/0x31f
 [81143bd7] do_vfs_ioctl+0x4c6/0x507
 [81135732] ? fget_light+0x303/0x31f
 [811355aa] ? fget_light+0x17b/0x31f
 [8137ebb9] ? retint_swapgs+0x13/0x1b
 [81143c6e] sys_ioctl+0x56/0x7c
 [81002df2] system_call_fastpath+0x16/0x1b
 Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 81 
 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 85 d8 
 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 
 RIP  [8121de8c] pfn_to_dma_pte+0x73/0x190
 RSP 8800bd4bdb68
 
 
 Can anyone parse this? Is it Intel-specific or a generic issue? The
 kernel is current kvm.git + unrelated patch + [1].
 
 Jan
 
 [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923
 



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v2 6/7] KVM: fix the race while wakeup all pv guest

2010-11-01 Thread Gleb Natapov
On Mon, Nov 01, 2010 at 05:03:44PM +0800, Xiao Guangrong wrote:
 In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu-async_pf.done
 without holding vcpu-async_pf.lock, it will break if we are handling apfs
 at this time.
 
This should never happen to well behaved guest, but malicious guest can do it
on purpose.

 Also use 'list_empty_careful()' instead of 'list_empty()'
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
Acked-by: Gleb Natapov g...@redhat.com

 ---
  virt/kvm/async_pf.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)
 
 diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
 index d57ec92..6ef3373 100644
 --- a/virt/kvm/async_pf.c
 +++ b/virt/kvm/async_pf.c
 @@ -200,7 +200,7 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
  {
   struct kvm_async_pf *work;
  
 - if (!list_empty(vcpu-async_pf.done))
 + if (!list_empty_careful(vcpu-async_pf.done))
   return 0;
  
   work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
 @@ -211,7 +211,10 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
   get_page(bad_page);
   INIT_LIST_HEAD(work-queue); /* for list_del to work */
  
 + spin_lock(vcpu-async_pf.lock);
   list_add_tail(work-link, vcpu-async_pf.done);
 + spin_unlock(vcpu-async_pf.lock);
 +
   vcpu-async_pf.queued++;
   return 0;
  }
 -- 
 1.7.0.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: x86: Issue smp_call_function_many with preemption disabled

2010-11-01 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

smp_call_function_many is specified to be called only with preemption
disabled. Fulfill this requirement.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 arch/x86/kvm/x86.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2cfdf2d..0ed6dad 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4012,8 +4012,10 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu)
return X86EMUL_CONTINUE;
 
if (kvm_x86_ops-has_wbinvd_exit()) {
+   preempt_disable();
smp_call_function_many(vcpu-arch.wbinvd_dirty_mask,
wbinvd_ipi, NULL, 1);
+   preempt_enable();
cpumask_clear(vcpu-arch.wbinvd_dirty_mask);
}
wbinvd();
-- 
1.7.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: x86: Avoid issuing wbinvd twice

2010-11-01 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Micro optimization to avoid calling wbinvd twice on the CPU that has to
emulate it. As we might be preempted between smp_call_function_many and
the local wbinvd, the cache might be filled again so that real work
could be done uselessly.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

I can fold the first patch into this one if it's preferred.

 arch/x86/kvm/x86.c |   10 ++
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ed6dad..58e8232 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4012,13 +4012,15 @@ int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu)
return X86EMUL_CONTINUE;
 
if (kvm_x86_ops-has_wbinvd_exit()) {
-   preempt_disable();
+   int cpu = get_cpu();
+
+   cpumask_set_cpu(cpu, vcpu-arch.wbinvd_dirty_mask);
smp_call_function_many(vcpu-arch.wbinvd_dirty_mask,
wbinvd_ipi, NULL, 1);
-   preempt_enable();
+   put_cpu();
cpumask_clear(vcpu-arch.wbinvd_dirty_mask);
-   }
-   wbinvd();
+   } else
+   wbinvd();
return X86EMUL_CONTINUE;
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_wbinvd);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/7] KVM: fix tracing kvm_try_async_get_page

2010-11-01 Thread Gleb Natapov
On Mon, Nov 01, 2010 at 04:58:43PM +0800, Xiao Guangrong wrote:
 Tracing 'async' and *pfn is useless, since 'async' is always true,
 and '*pfn' is always fault_pfn'
 
 We can trace 'gva' and 'gfn' instead, it can help us to see the
 life-cycle of an async_pf
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
Acked-by: Gleb Natapov g...@redhat.com

 ---
  arch/x86/kvm/mmu.c |2 +-
  include/trace/events/kvm.h |   12 +++-
  2 files changed, 8 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 39cc0c6..5275c50 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2627,7 +2627,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
 no_apf, gfn_t gfn,
   put_page(pfn_to_page(*pfn));
  
   if (!no_apf  can_do_async_pf(vcpu)) {
 - trace_kvm_try_async_get_page(async, *pfn);
 + trace_kvm_try_async_get_page(gva, gfn);
   if (kvm_find_async_pf_gfn(vcpu, gfn)) {
   trace_kvm_async_pf_doublefault(gva, gfn);
   kvm_make_request(KVM_REQ_APF_HALT, vcpu);
 diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
 index 9c2cc6a..30063c6 100644
 --- a/include/trace/events/kvm.h
 +++ b/include/trace/events/kvm.h
 @@ -188,18 +188,20 @@ TRACE_EVENT(kvm_age_page,
  #ifdef CONFIG_KVM_ASYNC_PF
  TRACE_EVENT(
   kvm_try_async_get_page,
 - TP_PROTO(bool async, u64 pfn),
 - TP_ARGS(async, pfn),
 + TP_PROTO(u64 gva, u64 gfn),
 + TP_ARGS(gva, gfn),
  
   TP_STRUCT__entry(
 - __field(__u64, pfn)
 + __field(u64, gva)
 + __field(u64, gfn)
   ),
  
   TP_fast_assign(
 - __entry-pfn = (!async) ? pfn : (u64)-1;
 + __entry-gva = gva;
 + __entry-gfn = gfn;
   ),
  
 - TP_printk(pfn %#llx, __entry-pfn)
 + TP_printk(gva = %#llx, gfn = %#llx, __entry-gva, __entry-gfn)
  );
  
  TRACE_EVENT(
 -- 
 1.7.0.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Roedel, Joerg
The registers rax and rbx contain non-canonical addresses (if
interpreted as pointers). The instruction where this happens is a mov so
I guess that the #GP is because of an non-canonical address.
Can you find out the code-line where this happens and the exact
assembler instruction? (haven't managed to decode the registers used).

Joerg

On Mon, Nov 01, 2010 at 08:57:42AM -0400, Jan Kiszka wrote:
 [ Forgot to CC LKML - maybe it's not KVM-specific.
   BTW, is anyone actually using current KVM device assigment on
   Intel? I'm starting to believe that can only very few lucky people...
 ]
 
 Am 01.11.2010 13:51, Jan Kiszka wrote:
  Hi again,
  
  OK, I swapped those two lines in intel_iommu_attach_device [1], fixed
  another warning in the wbinvd emulation, but now I'm about to give up.
  This is freaky MMU stuff:
  
  
  general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
  last sysfs file: /sys/devices/pci:00/:00:1a.0/device
  CPU 1 
  Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss 
  snd_seq snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 
  ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter 
  ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack 
  nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop 
  mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel 
  snd_hda_codec ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath 
  pcmcia cfg80211 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci 
  tpm_tis mmc_core soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt 
  video snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output 
  fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class 
  joydev crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 
  sha256_generic aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif 
  dm_snapshot dm_mod fan processor ahci libahci ata_gene
 ri
  c liba
  ta scsi_mod thermal thermal_sys hwmon
  Nov  1 13:19:11 mchn199C kernel:
  Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS 
  H700
  RIP: 0010:[8121de8c]  [8121de8c] 
  pfn_to_dma_pte+0x73/0x190
  RSP: 0018:8800bd4bdb68  EFLAGS: 00010202
  RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009
  RDX: 0180 RSI: 88012a940938 RDI: 0202
  RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004
  R10: 0001 R11:  R12: 880128dfee00
  R13: 0002 R14: 000f R15: 0009
  FS:  7f4990d33710() GS:8800be68() knlGS:
  CS:  0010 DS: 002b ES: 002b CR0: 80050033
  CR2: 01408000 CR3: bd7db000 CR4: 26e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 
  88012a94)
  Stack:
  8800bd4bdb88 8800ac378000 00d2 000f
  0 8800bd4bdce8 8800ac2fc000 0003 000f0001
  0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573
  Call Trace:
  [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a
  [812a9573] iommu_iova_to_phys+0x13/0x15
  [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm]
  [8111404d] ? __vmalloc_node+0x86/0x9b
  [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm]
  [81081ee8] ? mark_held_locks+0x50/0x72
  [8137c983] ? mutex_lock_nested+0x325/0x34d
  [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm]
  [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
  [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
  [811355aa] ? fget_light+0x17b/0x31f
  [81143bd7] do_vfs_ioctl+0x4c6/0x507
  [81135732] ? fget_light+0x303/0x31f
  [811355aa] ? fget_light+0x17b/0x31f
  [8137ebb9] ? retint_swapgs+0x13/0x1b
  [81143c6e] sys_ioctl+0x56/0x7c
  [81002df2] system_call_fastpath+0x16/0x1b
  Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 
  81 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 
  85 d8 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 
  RIP  [8121de8c] pfn_to_dma_pte+0x73/0x190
  RSP 8800bd4bdb68
  
  
  Can anyone parse this? Is it Intel-specific or a generic issue? The
  kernel is current kvm.git + unrelated patch + [1].
  
  Jan
  
  [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923
  
 



-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message 

Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 14:21, Roedel, Joerg wrote:
 The registers rax and rbx contain non-canonical addresses (if
 interpreted as pointers). The instruction where this happens is a mov so
 I guess that the #GP is because of an non-canonical address.
 Can you find out the code-line where this happens and the exact
 assembler instruction? (haven't managed to decode the registers used).

In pfn_to_dma_pte, line 710:

if (!dma_pte_present(pte)) {
8121de8c:   f6 03 03testb  $0x3,(%rbx)
8121de8f:   0f 85 d8 00 00 00   jne8121df6d 
pfn_to_dma_pte+0x154

The first instruction raises the fault.

Jan

 
   Joerg
 
 On Mon, Nov 01, 2010 at 08:57:42AM -0400, Jan Kiszka wrote:
 [ Forgot to CC LKML - maybe it's not KVM-specific.
   BTW, is anyone actually using current KVM device assigment on
   Intel? I'm starting to believe that can only very few lucky people...
 ]

 Am 01.11.2010 13:51, Jan Kiszka wrote:
 Hi again,

 OK, I swapped those two lines in intel_iommu_attach_device [1], fixed
 another warning in the wbinvd emulation, but now I'm about to give up.
 This is freaky MMU stuff:


 general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 last sysfs file: /sys/devices/pci:00/:00:1a.0/device
 CPU 1 
 Modules linked in: kvm_intel kvm bluetooth snd_pcm_oss snd_mixer_oss 
 snd_seq snd_seq_device edd ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 
 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter 
 ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack 
 nf_defrag_ipv4 ip_tables ip6table_filter ip6_tables x_tables ipv6 fuse loop 
 mac80211 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel 
 snd_hda_codec ath9k_common ath9k_hw snd_hwdep snd_pcm snd_timer snd ath 
 pcmcia cfg80211 sdhci_pci tpm_infineon sdhci yenta_socket firewire_ohci 
 tpm_tis mmc_core soundcore e1000e sg pcmcia_rsrc tpm firewire_core iTCO_wdt 
 video snd_page_alloc pcmcia_core i2c_i801 rfkill tpm_bios intel_agp output 
 fujitsu_laptop iTCO_vendor_support i2c_core serio_raw pcspkr led_class 
 joydev crc_itu_t button battery intel_gtt ac ext4 mbcache jbd2 crc16 
 sha256_generic aes_x86_64 aes_generic cbc dm_crypt linear sd_mod crc_t10dif 
 dm_snapshot dm_mod fan processor ahci libahci ata_ge
ne
 ri
 c liba
 ta scsi_mod thermal thermal_sys hwmon
 Nov  1 13:19:11 mchn199C kernel:
 Pid: 2248, comm: qemu-system-x86 Not tainted 2.6.36+ #12 FJNB211W/CELSIUS 
 H700
 RIP: 0010:[8121de8c]  [8121de8c] 
 pfn_to_dma_pte+0x73/0x190
 RSP: 0018:8800bd4bdb68  EFLAGS: 00010202
 RAX: 1000bd4fe000 RBX: 1000bd4fec00 RCX: 0009
 RDX: 0180 RSI: 88012a940938 RDI: 0202
 RBP: 8800bd4bdba8 R08: ea00025ac2a0 R09: 0004
 R10: 0001 R11:  R12: 880128dfee00
 R13: 0002 R14: 000f R15: 0009
 FS:  7f4990d33710() GS:8800be68() knlGS:
 CS:  0010 DS: 002b ES: 002b CR0: 80050033
 CR2: 01408000 CR3: bd7db000 CR4: 26e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process qemu-system-x86 (pid: 2248, threadinfo 8800bd4bc000, task 
 88012a94)
 Stack:
 8800bd4bdb88 8800ac378000 00d2 000f
 0 8800bd4bdce8 8800ac2fc000 0003 000f0001
 0 8800bd4bdbb8 8121dfbe 8800bd4bdbc8 812a9573
 Call Trace:
 [8121dfbe] intel_iommu_iova_to_phys+0x15/0x2a
 [812a9573] iommu_iova_to_phys+0x13/0x15
 [a04e91d0] kvm_iommu_map_pages+0x77/0x194 [kvm]
 [8111404d] ? __vmalloc_node+0x86/0x9b
 [a04e30e2] __kvm_set_memory_region+0x4e5/0x787 [kvm]
 [81081ee8] ? mark_held_locks+0x50/0x72
 [8137c983] ? mutex_lock_nested+0x325/0x34d
 [a04e33bb] kvm_set_memory_region+0x37/0x50 [kvm]
 [a04e4c15] kvm_vm_ioctl_set_memory_region+0x18/0x1a [kvm]
 [a04e4e44] kvm_vm_ioctl+0x22d/0x3b1 [kvm]
 [811355aa] ? fget_light+0x17b/0x31f
 [81143bd7] do_vfs_ioctl+0x4c6/0x507
 [81135732] ? fget_light+0x303/0x31f
 [811355aa] ? fget_light+0x17b/0x31f
 [8137ebb9] ? retint_swapgs+0x13/0x1b
 [81143c6e] sys_ioctl+0x56/0x7c
 [81002df2] system_call_fastpath+0x16/0x1b
 Code: c7 31 db 47 8d 3c ff e9 1d 01 00 00 0f 0b 4c 89 f2 44 88 f9 48 d3 ea 
 81 e2 ff 01 00 00 41 83 fd 01 48 8d 1c d0 0f 84 0b 01 00 00 f6 03 03 0f 
 85 d8 00 00 00 41 8b 44 24 04 85 c0 79 08 65 8b 04 
 RIP  [8121de8c] pfn_to_dma_pte+0x73/0x190
 RSP 8800bd4bdb68


 Can anyone parse this? Is it Intel-specific or a generic issue? The
 kernel is current kvm.git + unrelated patch + [1].

 Jan

 [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/61923


 
 
 




signature.asc
Description: OpenPGP digital signature


[PATCHv2] KVM: handle exit due to INVD in VMX

2010-11-01 Thread Gleb Natapov
Currently the exit is unhandled, so guest halts with error if it tries
to execute INVD instruction. Call into emulator when INVD instruction
is executed by a guest instead. This instruction is not needed by ordinary
guests, but firmware (like OpenBIOS) use it and fail.

Signed-off-by: Gleb Natapov g...@redhat.com
---
 ChangeLog:
 v1-v2
  - New and improved patch description

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 9f0cbd9..42d9590 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -239,6 +239,7 @@ enum vmcs_field {
 #define EXIT_REASON_TASK_SWITCH 9
 #define EXIT_REASON_CPUID   10
 #define EXIT_REASON_HLT 12
+#define EXIT_REASON_INVD13
 #define EXIT_REASON_INVLPG  14
 #define EXIT_REASON_RDPMC   15
 #define EXIT_REASON_RDTSC   16
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 993e332..200533e 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3350,6 +3350,11 @@ static int handle_vmx_insn(struct kvm_vcpu *vcpu)
return 1;
 }
 
+static int handle_invd(struct kvm_vcpu *vcpu)
+{
+   return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
+}
+
 static int handle_invlpg(struct kvm_vcpu *vcpu)
 {
unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -3650,6 +3655,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu 
*vcpu) = {
[EXIT_REASON_MSR_WRITE]   = handle_wrmsr,
[EXIT_REASON_PENDING_INTERRUPT]   = handle_interrupt_window,
[EXIT_REASON_HLT] = handle_halt,
+   [EXIT_REASON_INVD]= handle_invd,
[EXIT_REASON_INVLPG]  = handle_invlpg,
[EXIT_REASON_VMCALL]  = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmx_insn,
--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Roedel, Joerg
On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote:
 Am 01.11.2010 14:21, Roedel, Joerg wrote:
  The registers rax and rbx contain non-canonical addresses (if
  interpreted as pointers). The instruction where this happens is a mov so
  I guess that the #GP is because of an non-canonical address.
  Can you find out the code-line where this happens and the exact
  assembler instruction? (haven't managed to decode the registers used).
 
 In pfn_to_dma_pte, line 710:
 
 if (!dma_pte_present(pte)) {
 8121de8c:   f6 03 03testb  $0x3,(%rbx)
 8121de8f:   0f 85 d8 00 00 00   jne8121df6d 
 pfn_to_dma_pte+0x154
 
 The first instruction raises the fault.

Ok, so it seems that my understanding of the Code: field in the
crash-message was wrong :)
Anyway, the testb uses rbx as an address which has a non-canonical
value. This means the the address of 'pte' is invalid. Since rax also
contains a wrong address the 'parent' variable probably already contains
the wrong address. Does the attached patch help?

diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
index 5619f85..ca46f24 100644
--- a/include/linux/dma_remapping.h
+++ b/include/linux/dma_remapping.h
@@ -6,7 +6,7 @@
  */
 #define VTD_PAGE_SHIFT (12)
 #define VTD_PAGE_SIZE  (1UL  VTD_PAGE_SHIFT)
-#define VTD_PAGE_MASK  (((u64)-1)  VTD_PAGE_SHIFT)
+#define VTD_PAGE_MASK  u64)-1)  VTD_PAGE_SHIFT)  ((1ULL  52) 
- 1))
 #define VTD_PAGE_ALIGN(addr)   (((addr) + VTD_PAGE_SIZE - 1)  VTD_PAGE_MASK)
 
 #define DMA_PTE_READ (1)

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] KVM: Clear assigned guest IRQ on release

2010-11-01 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

When we deassign a guest IRQ, clear the potentially asserted guest line.
There might be no chance for the guest to do this, specifically if we
switch from INTx to MSI mode.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 virt/kvm/assigned-dev.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 5c1b56a..d3ddfea 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -129,6 +129,9 @@ static void deassign_guest_irq(struct kvm *kvm,
kvm_unregister_irq_ack_notifier(kvm, assigned_dev-ack_notifier);
assigned_dev-ack_notifier.gsi = -1;
 
+   kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
+   assigned_dev-guest_irq, 0);
+
if (assigned_dev-irq_source_id != -1)
kvm_free_irq_source_id(kvm, assigned_dev-irq_source_id);
assigned_dev-irq_source_id = -1;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] KVM: Improve IRQ assignment for device passthrough

2010-11-01 Thread Jan Kiszka
Three patches to improve classic device assigment /wrt IRQs. Highlight
is the last one that resolves the host IRQ sharing issue for all PCI 2.3
devices. Quite essential when passing non-MSI-ready devices like many
USB host controllers.

Jan Kiszka (3):
  KVM: Fold assigned interrupt work into IRQ handler
  KVM: Clear assigned guest IRQ on release
  KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

 include/linux/kvm_host.h |2 +-
 virt/kvm/assigned-dev.c  |  215 ++
 2 files changed, 160 insertions(+), 57 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-11-01 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices between on the host side when
passing them to a guest.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 include/linux/kvm_host.h |1 +
 virt/kvm/assigned-dev.c  |  153 +
 2 files changed, 140 insertions(+), 14 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index df5497f..fcdc849 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
unsigned int entries_nr;
int host_irq;
bool host_irq_disabled;
+   bool pci_2_3;
struct msix_entry *host_msix_entries;
int guest_irq;
struct kvm_guest_msix_entry *guest_msix_entries;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index d3ddfea..411643c 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct 
kvm_assigned_dev_kernel
return index;
 }
 
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2.
+ */
+static bool pci_2_3_supported(struct pci_dev *pdev)
+{
+   u16 orig, new;
+   bool supported = false;
+
+   pci_block_user_cfg_access(pdev);
+   pci_read_config_word(pdev, PCI_COMMAND, orig);
+   pci_write_config_word(pdev, PCI_COMMAND,
+ orig ^ PCI_COMMAND_INTX_DISABLE);
+   pci_read_config_word(pdev, PCI_COMMAND, new);
+   
+   /*
+* There's no way to protect against
+* hardware bugs or detect them reliably, but as long as we know
+* what the value should be, let's go ahead and check it.
+*/
+   if ((new ^ orig)  ~PCI_COMMAND_INTX_DISABLE) {
+   dev_err(pdev-dev, Command changed from 0x%x to 0x%x: 
+   driver or HW bug?\n, orig, new);
+   goto out;
+   }
+   if (!((new ^ orig)  PCI_COMMAND_INTX_DISABLE)) {
+   dev_warn(pdev-dev, Device does not support 
+disabling interrupts: unable to bind.\n);
+   goto out;
+   }
+   supported = true;
+
+   /* Now restore the original value. */
+   pci_write_config_word(pdev, PCI_COMMAND, orig);
+
+out:
+   pci_unblock_user_cfg_access(pdev);
+   return supported;
+}
+
+static void
+pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
+{
+   u32 cmd_status_dword;
+   u16 origcmd, newcmd;
+
+   /*
+* We do a single dword read to retrieve both command and status.
+* Document assumptions that make this possible.
+*/
+   BUILD_BUG_ON(PCI_COMMAND % 4);
+   BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
+
+   pci_block_user_cfg_access(dev);
+
+   /*
+* Read both command and status registers in a single 32-bit operation.
+* Note: we could cache the value for command and move the status read
+* out of the lock if there was a way to get notified of user changes
+* to command register through sysfs. Should be good for shared irqs.
+*/
+   pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword);
+   origcmd = cmd_status_dword;
+
+   if (irq_status) {
+   /*
+   * Check interrupt status register to see whether our device 
triggered
+   * the interrupt.
+   */
+   *irq_status = (cmd_status_dword  16)  PCI_STATUS_INTERRUPT;
+   if (*irq_status == 0)
+   goto done;
+   }
+
+   newcmd = origcmd  ~PCI_COMMAND_INTX_DISABLE;
+   if (mask)
+   newcmd |= PCI_COMMAND_INTX_DISABLE;
+   if (newcmd != origcmd)
+   pci_write_config_word(dev, PCI_COMMAND, newcmd);
+
+done:
+   pci_unblock_user_cfg_access(dev);
+}
+
 static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
 {
struct kvm_assigned_dev_kernel *assigned_dev =
(struct kvm_assigned_dev_kernel *) dev_id;
+   int ret = IRQ_HANDLED;
unsigned long flags;
 
spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
@@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void 
*dev_id)
guest_entries[i].vector, 1);
}
} else {
-   kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
-   assigned_dev-guest_irq, 1);
-
if (assigned_dev-irq_requested_type 
KVM_DEV_IRQ_GUEST_INTX) {
-   disable_irq_nosync(irq);
+   if (assigned_dev-pci_2_3) {
+   unsigned int irq_status;
+
+   

[PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler

2010-11-01 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

The complete work handler runs with assigned_dev_lock acquired and
interrupts disabled, so there is nothing to gain pushing this work out
of the actually IRQ handler. Fold them together.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 include/linux/kvm_host.h |1 -
 virt/kvm/assigned-dev.c  |   71 +++---
 2 files changed, 23 insertions(+), 49 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 462b982..df5497f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -465,7 +465,6 @@ struct kvm_guest_msix_entry {
 
 struct kvm_assigned_dev_kernel {
struct kvm_irq_ack_notifier ack_notifier;
-   struct work_struct interrupt_work;
struct list_head list;
int assigned_dev_id;
int host_segnr;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 7c98928..5c1b56a 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -55,18 +55,24 @@ static int find_index_from_host_irq(struct 
kvm_assigned_dev_kernel
return index;
 }
 
-static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
+static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
 {
-   struct kvm_assigned_dev_kernel *assigned_dev;
-   int i;
-
-   assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
-   interrupt_work);
+   struct kvm_assigned_dev_kernel *assigned_dev =
+   (struct kvm_assigned_dev_kernel *) dev_id;
+   unsigned long flags;
 
-   spin_lock_irq(assigned_dev-assigned_dev_lock);
+   spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
struct kvm_guest_msix_entry *guest_entries =
assigned_dev-guest_msix_entries;
+   int index = find_index_from_host_irq(assigned_dev, irq);
+   int i;
+
+   if (index  0)
+   goto out;
+
+   guest_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING;
+
for (i = 0; i  assigned_dev-entries_nr; i++) {
if (!(guest_entries[i].flags 
KVM_ASSIGNED_MSIX_PENDING))
@@ -76,33 +82,15 @@ static void kvm_assigned_dev_interrupt_work_handler(struct 
work_struct *work)
assigned_dev-irq_source_id,
guest_entries[i].vector, 1);
}
-   } else
+   } else {
kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
assigned_dev-guest_irq, 1);
 
-   spin_unlock_irq(assigned_dev-assigned_dev_lock);
-}
-
-static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
-{
-   unsigned long flags;
-   struct kvm_assigned_dev_kernel *assigned_dev =
-   (struct kvm_assigned_dev_kernel *) dev_id;
-
-   spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
-   if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
-   int index = find_index_from_host_irq(assigned_dev, irq);
-   if (index  0)
-   goto out;
-   assigned_dev-guest_msix_entries[index].flags |=
-   KVM_ASSIGNED_MSIX_PENDING;
-   }
-
-   schedule_work(assigned_dev-interrupt_work);
-
-   if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_GUEST_INTX) {
-   disable_irq_nosync(irq);
-   assigned_dev-host_irq_disabled = true;
+   if (assigned_dev-irq_requested_type 
+   KVM_DEV_IRQ_GUEST_INTX) {
+   disable_irq_nosync(irq);
+   assigned_dev-host_irq_disabled = true;
+   }
}
 
 out:
@@ -147,33 +135,23 @@ static void deassign_guest_irq(struct kvm *kvm,
assigned_dev-irq_requested_type = ~(KVM_DEV_IRQ_GUEST_MASK);
 }
 
-/* The function implicit hold kvm-lock mutex due to cancel_work_sync() */
 static void deassign_host_irq(struct kvm *kvm,
  struct kvm_assigned_dev_kernel *assigned_dev)
 {
/*
-* In kvm_free_device_irq, cancel_work_sync return true if:
-* 1. work is scheduled, and then cancelled.
-* 2. work callback is executed.
-*
-* The first one ensured that the irq is disabled and no more events
-* would happen. But for the second one, the irq may be enabled (e.g.
-* for MSI). So we disable irq here to prevent further events.
+* We disable irq here to prevent further events.
 *
 * Notice this maybe result in nested disable if the interrupt type is
 * INTx, but it's OK for we are going to free it.
 *
 * If this function is a part of VM destroy, please ensure that till
 * now, the 

Re: [PATCH 0/4] VFIO V5: Non-privileged user level PCI drivers

2010-11-01 Thread Etienne Martineau
Is this the complete patch set? Or maybe it assumes a pre-patch kernel 
tree. If so where can I get it?

thanks,
-Etienne


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 14:53, Roedel, Joerg wrote:
 On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote:
 Am 01.11.2010 14:21, Roedel, Joerg wrote:
 The registers rax and rbx contain non-canonical addresses (if
 interpreted as pointers). The instruction where this happens is a mov so
 I guess that the #GP is because of an non-canonical address.
 Can you find out the code-line where this happens and the exact
 assembler instruction? (haven't managed to decode the registers used).

 In pfn_to_dma_pte, line 710:

 if (!dma_pte_present(pte)) {
 8121de8c:   f6 03 03testb  $0x3,(%rbx)
 8121de8f:   0f 85 d8 00 00 00   jne8121df6d 
 pfn_to_dma_pte+0x154

 The first instruction raises the fault.
 
 Ok, so it seems that my understanding of the Code: field in the
 crash-message was wrong :)
 Anyway, the testb uses rbx as an address which has a non-canonical
 value. This means the the address of 'pte' is invalid. Since rax also
 contains a wrong address the 'parent' variable probably already contains
 the wrong address. Does the attached patch help?
 
 diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
 index 5619f85..ca46f24 100644
 --- a/include/linux/dma_remapping.h
 +++ b/include/linux/dma_remapping.h
 @@ -6,7 +6,7 @@
   */
  #define VTD_PAGE_SHIFT   (12)
  #define VTD_PAGE_SIZE(1UL  VTD_PAGE_SHIFT)
 -#define VTD_PAGE_MASK(((u64)-1)  VTD_PAGE_SHIFT)
 +#define VTD_PAGE_MASKu64)-1)  VTD_PAGE_SHIFT)  ((1ULL 
  52) - 1))
  #define VTD_PAGE_ALIGN(addr) (((addr) + VTD_PAGE_SIZE - 1)  VTD_PAGE_MASK)
  
  #define DMA_PTE_READ (1)
 

Crashes during early boot while initializing dmar. If you need the
trace, I could set up some debug console.

Jan



signature.asc
Description: OpenPGP digital signature


Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Joerg Roedel
On Mon, Nov 01, 2010 at 03:22:15PM +0100, Jan Kiszka wrote:
 Am 01.11.2010 14:53, Roedel, Joerg wrote:
  On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote:
  Am 01.11.2010 14:21, Roedel, Joerg wrote:
  The registers rax and rbx contain non-canonical addresses (if
  interpreted as pointers). The instruction where this happens is a mov so
  I guess that the #GP is because of an non-canonical address.
  Can you find out the code-line where this happens and the exact
  assembler instruction? (haven't managed to decode the registers used).
 
  In pfn_to_dma_pte, line 710:
 
  if (!dma_pte_present(pte)) {
  8121de8c:   f6 03 03testb  $0x3,(%rbx)
  8121de8f:   0f 85 d8 00 00 00   jne8121df6d 
  pfn_to_dma_pte+0x154
 
  The first instruction raises the fault.
  
  Ok, so it seems that my understanding of the Code: field in the
  crash-message was wrong :)
  Anyway, the testb uses rbx as an address which has a non-canonical
  value. This means the the address of 'pte' is invalid. Since rax also
  contains a wrong address the 'parent' variable probably already contains
  the wrong address. Does the attached patch help?
  
  diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
  index 5619f85..ca46f24 100644
  --- a/include/linux/dma_remapping.h
  +++ b/include/linux/dma_remapping.h
  @@ -6,7 +6,7 @@
*/
   #define VTD_PAGE_SHIFT (12)
   #define VTD_PAGE_SIZE  (1UL  VTD_PAGE_SHIFT)
  -#define VTD_PAGE_MASK  (((u64)-1)  VTD_PAGE_SHIFT)
  +#define VTD_PAGE_MASK  u64)-1)  VTD_PAGE_SHIFT)  ((1ULL 
   52) - 1))
   #define VTD_PAGE_ALIGN(addr)   (((addr) + VTD_PAGE_SIZE - 1)  
  VTD_PAGE_MASK)
   
   #define DMA_PTE_READ (1)
  
 
 Crashes during early boot while initializing dmar. If you need the
 trace, I could set up some debug console.

Hmm, no. This was only a guess. The VTD_PAGE_MASK does not mask out the
bits 52-63 of the pte. According to the VT-d spec it is allowed to set
these bits, some are marked as AVL and some have special meanings. If a
pte has one of these bits set the phys_addr calculated will be wrong and
the virt_addr calculated from it too (probably non-canonical, leading to
the GPF). 

Probably masking out these bits in dma_pte_addr helps. 

Joerg

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/2] Minimal RAM API support

2010-11-01 Thread Alex Williamson
v2:

 - Move to Makefile.objs
 - Move structures to memory.c and create a callback function
 - Fix memory leak

I haven't moved to the state parameter because there should only
be a single instance of this per VM.  The state parameter seems
like it would add complications in setup and function calling, but
maybe point me to an example if I'm off base.  Thanks,

Alex

v1:

For VFIO based device assignment, we need to know what guest memory
areas are actual RAM.  RAMBlocks have long since become a grab bag
of misc allocations, so aren't effective for this.  Anthony has had
a RAM API in mind for a while now that addresses this problem.  This
implements just enough of it so that we have an interface to get
actual guest memory physical addresses to setup the host IOMMU.  We
can continue building a full RAM API on top of this stub.

Anthony, feel free to add copyright to memory.c as it's based on
your initial implementation.  I had to add something since the file
in your branch just copies a header with Frabrice's copywrite.
Thanks,

Alex

---

Alex Williamson (2):
  RAM API: Make use of it for x86 PC
  Minimal RAM API support


 Makefile.objs |1 +
 cpu-common.h  |2 +
 hw/pc.c   |   12 +++---
 memory.c  |  109 +
 memory.h  |   18 +
 5 files changed, 136 insertions(+), 6 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] Minimal RAM API support

2010-11-01 Thread Alex Williamson
This adds a minimum chunk of Anthony's RAM API support so that we
can identify actual VM RAM versus all the other things that make
use of qemu_ram_alloc.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 Makefile.objs |1 +
 cpu-common.h  |2 +
 memory.c  |  109 +
 memory.h  |   18 +
 4 files changed, 130 insertions(+), 0 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h

diff --git a/Makefile.objs b/Makefile.objs
index f07fb01..33fae0b 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -154,6 +154,7 @@ hw-obj-y += vl.o loader.o
 hw-obj-y += virtio.o virtio-console.o
 hw-obj-y += fw_cfg.o pci.o pci_host.o pcie_host.o
 hw-obj-y += watchdog.o
+hw-obj-y += memory.o
 hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
 hw-obj-$(CONFIG_ECC) += ecc.o
 hw-obj-$(CONFIG_NAND) += nand.o
diff --git a/cpu-common.h b/cpu-common.h
index a543b5d..6aa2738 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -23,6 +23,8 @@
 /* address in the RAM (different from a physical address) */
 typedef unsigned long ram_addr_t;
 
+#include memory.h
+
 /* memory API */
 
 typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, 
uint32_t value);
diff --git a/memory.c b/memory.c
new file mode 100644
index 000..2895082
--- /dev/null
+++ b/memory.c
@@ -0,0 +1,109 @@
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamson alex.william...@redhat.com
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see http://www.gnu.org/licenses/.
+ */
+#include memory.h
+#include range.h
+
+typedef struct QemuRamSlot {
+target_phys_addr_t start_addr;
+ram_addr_t size;
+ram_addr_t offset;
+QLIST_ENTRY(QemuRamSlot) next;
+} QemuRamSlot;
+
+typedef struct QemuRamSlots {
+QLIST_HEAD(slots, QemuRamSlot) slots;
+} QemuRamSlots;
+
+static QemuRamSlots ram_slots = { .slots = QLIST_HEAD_INITIALIZER(ram_slots) };
+
+static QemuRamSlot *qemu_ram_find_slot(target_phys_addr_t start_addr,
+   ram_addr_t size)
+{
+QemuRamSlot *slot;
+
+QLIST_FOREACH(slot, ram_slots.slots, next) {
+if (slot-start_addr == start_addr  slot-size == size) {
+return slot;
+}
+
+if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) {
+abort();
+}
+}
+
+return NULL;
+}
+
+int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+   ram_addr_t phys_offset)
+{
+QemuRamSlot *slot;
+
+if (!size) {
+return -EINVAL;
+}
+
+assert(!qemu_ram_find_slot(start_addr, size));
+
+slot = qemu_mallocz(sizeof(QemuRamSlot));
+
+slot-start_addr = start_addr;
+slot-size = size;
+slot-offset = phys_offset;
+
+QLIST_INSERT_HEAD(ram_slots.slots, slot, next);
+
+cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset);
+
+return 0;
+}
+
+void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size)
+{
+QemuRamSlot *slot;
+
+if (!size) {
+return;
+}
+
+slot = qemu_ram_find_slot(start_addr, size);
+assert(slot != NULL);
+
+QLIST_REMOVE(slot, next);
+qemu_free(slot);
+cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED);
+
+return;
+}
+
+int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn)
+{
+QemuRamSlot *slot;
+
+QLIST_FOREACH(slot, ram_slots.slots, next) {
+int ret = fn(opaque, slot-start_addr, slot-size, slot-offset);
+if (ret) {
+return ret;
+}
+}
+return 0;
+}
diff --git a/memory.h b/memory.h
new file mode 100644
index 000..0c17ff9
--- /dev/null
+++ b/memory.h
@@ -0,0 +1,18 @@
+#ifndef QEMU_MEMORY_H
+#define QEMU_MEMORY_H
+
+#include qemu-common.h
+#include cpu-common.h
+
+int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+  ram_addr_t phys_offset);
+
+void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size);
+
+typedef int (*qemu_ram_for_each_slot_fn)(void *opaque,
+ target_phys_addr_t start_addr,
+ ram_addr_t size,
+ ram_addr_t phys_offset);
+
+int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn);
+#endif

--
To 

[PATCH v2 2/2] RAM API: Make use of it for x86 PC

2010-11-01 Thread Alex Williamson
Register the actual VM RAM using the new API

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pc.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 69b13bf..0ea6d10 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -912,14 +912,14 @@ void pc_memory_init(ram_addr_t ram_size,
 /* allocate RAM */
 ram_addr = qemu_ram_alloc(NULL, pc.ram,
   below_4g_mem_size + above_4g_mem_size);
-cpu_register_physical_memory(0, 0xa, ram_addr);
-cpu_register_physical_memory(0x10,
- below_4g_mem_size - 0x10,
- ram_addr + 0x10);
+
+qemu_ram_register(0, 0xa, ram_addr);
+qemu_ram_register(0x10, below_4g_mem_size - 0x10,
+  ram_addr + 0x10);
 #if TARGET_PHYS_ADDR_BITS  32
 if (above_4g_mem_size  0) {
-cpu_register_physical_memory(0x1ULL, above_4g_mem_size,
- ram_addr + below_4g_mem_size);
+qemu_ram_register(0x1ULL, above_4g_mem_size,
+  ram_addr + below_4g_mem_size);
 }
 #endif
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-11-01 Thread Michael S. Tsirkin
On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 PCI 2.3 allows to generically disable IRQ sources at device level. This
 enables us to share IRQs of such devices between on the host side when
 passing them to a guest.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  include/linux/kvm_host.h |1 +
  virt/kvm/assigned-dev.c  |  153 +
  2 files changed, 140 insertions(+), 14 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index df5497f..fcdc849 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
   unsigned int entries_nr;
   int host_irq;
   bool host_irq_disabled;
 + bool pci_2_3;
   struct msix_entry *host_msix_entries;
   int guest_irq;
   struct kvm_guest_msix_entry *guest_msix_entries;
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index d3ddfea..411643c 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct 
 kvm_assigned_dev_kernel
   return index;
  }
  
 +/*
 + * Verify that the device supports Interrupt Disable bit in command register,
 + * per PCI 2.3, by flipping this bit and reading it back: this bit was 
 readonly
 + * in PCI 2.2.
 + */
 +static bool pci_2_3_supported(struct pci_dev *pdev)
 +{
 + u16 orig, new;
 + bool supported = false;
 +
 + pci_block_user_cfg_access(pdev);
 + pci_read_config_word(pdev, PCI_COMMAND, orig);
 + pci_write_config_word(pdev, PCI_COMMAND,
 +   orig ^ PCI_COMMAND_INTX_DISABLE);
 + pci_read_config_word(pdev, PCI_COMMAND, new);
 + 
 + /*
 +  * There's no way to protect against
 +  * hardware bugs or detect them reliably, but as long as we know
 +  * what the value should be, let's go ahead and check it.
 +  */
 + if ((new ^ orig)  ~PCI_COMMAND_INTX_DISABLE) {
 + dev_err(pdev-dev, Command changed from 0x%x to 0x%x: 
 + driver or HW bug?\n, orig, new);
 + goto out;
 + }
 + if (!((new ^ orig)  PCI_COMMAND_INTX_DISABLE)) {
 + dev_warn(pdev-dev, Device does not support 
 +  disabling interrupts: unable to bind.\n);
 + goto out;
 + }
 + supported = true;
 +
 + /* Now restore the original value. */
 + pci_write_config_word(pdev, PCI_COMMAND, orig);
 +
 +out:
 + pci_unblock_user_cfg_access(pdev);
 + return supported;
 +}
 +
 +static void
 +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
 +{
 + u32 cmd_status_dword;
 + u16 origcmd, newcmd;
 +
 + /*
 +  * We do a single dword read to retrieve both command and status.
 +  * Document assumptions that make this possible.
 +  */
 + BUILD_BUG_ON(PCI_COMMAND % 4);
 + BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
 +
 + pci_block_user_cfg_access(dev);
 +
 + /*
 +  * Read both command and status registers in a single 32-bit operation.
 +  * Note: we could cache the value for command and move the status read
 +  * out of the lock if there was a way to get notified of user changes
 +  * to command register through sysfs. Should be good for shared irqs.
 +  */
 + pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword);
 + origcmd = cmd_status_dword;
 +
 + if (irq_status) {
 + /*
 + * Check interrupt status register to see whether our device 
 triggered
 + * the interrupt.
 + */
 + *irq_status = (cmd_status_dword  16)  PCI_STATUS_INTERRUPT;
 + if (*irq_status == 0)
 + goto done;
 + }
 +
 + newcmd = origcmd  ~PCI_COMMAND_INTX_DISABLE;
 + if (mask)
 + newcmd |= PCI_COMMAND_INTX_DISABLE;
 + if (newcmd != origcmd)
 + pci_write_config_word(dev, PCI_COMMAND, newcmd);
 +
 +done:
 + pci_unblock_user_cfg_access(dev);
 +}
 +

Let's return irq_status instead of returning through a pointer?
Will save a branch and generally make the code a bit cleaner.

  static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
  {
   struct kvm_assigned_dev_kernel *assigned_dev =
   (struct kvm_assigned_dev_kernel *) dev_id;
 + int ret = IRQ_HANDLED;
   unsigned long flags;
  
   spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
 @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void 
 *dev_id)
   guest_entries[i].vector, 1);
   }
   } else {
 - kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
 - assigned_dev-guest_irq, 1);
 -
   if (assigned_dev-irq_requested_type 
   KVM_DEV_IRQ_GUEST_INTX) {
 -   

Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 15:35, Joerg Roedel wrote:
 On Mon, Nov 01, 2010 at 03:22:15PM +0100, Jan Kiszka wrote:
 Am 01.11.2010 14:53, Roedel, Joerg wrote:
 On Mon, Nov 01, 2010 at 09:25:00AM -0400, Jan Kiszka wrote:
 Am 01.11.2010 14:21, Roedel, Joerg wrote:
 The registers rax and rbx contain non-canonical addresses (if
 interpreted as pointers). The instruction where this happens is a mov so
 I guess that the #GP is because of an non-canonical address.
 Can you find out the code-line where this happens and the exact
 assembler instruction? (haven't managed to decode the registers used).

 In pfn_to_dma_pte, line 710:

 if (!dma_pte_present(pte)) {
 8121de8c:   f6 03 03testb  $0x3,(%rbx)
 8121de8f:   0f 85 d8 00 00 00   jne8121df6d 
 pfn_to_dma_pte+0x154

 The first instruction raises the fault.

 Ok, so it seems that my understanding of the Code: field in the
 crash-message was wrong :)
 Anyway, the testb uses rbx as an address which has a non-canonical
 value. This means the the address of 'pte' is invalid. Since rax also
 contains a wrong address the 'parent' variable probably already contains
 the wrong address. Does the attached patch help?

 diff --git a/include/linux/dma_remapping.h b/include/linux/dma_remapping.h
 index 5619f85..ca46f24 100644
 --- a/include/linux/dma_remapping.h
 +++ b/include/linux/dma_remapping.h
 @@ -6,7 +6,7 @@
   */
  #define VTD_PAGE_SHIFT (12)
  #define VTD_PAGE_SIZE  (1UL  VTD_PAGE_SHIFT)
 -#define VTD_PAGE_MASK  (((u64)-1)  VTD_PAGE_SHIFT)
 +#define VTD_PAGE_MASK  u64)-1)  VTD_PAGE_SHIFT)  ((1ULL 
  52) - 1))
  #define VTD_PAGE_ALIGN(addr)   (((addr) + VTD_PAGE_SIZE - 1)  
 VTD_PAGE_MASK)
  
  #define DMA_PTE_READ (1)


 Crashes during early boot while initializing dmar. If you need the
 trace, I could set up some debug console.
 
 Hmm, no. This was only a guess. The VTD_PAGE_MASK does not mask out the
 bits 52-63 of the pte. According to the VT-d spec it is allowed to set
 these bits, some are marked as AVL and some have special meanings. If a
 pte has one of these bits set the phys_addr calculated will be wrong and
 the virt_addr calculated from it too (probably non-canonical, leading to
 the GPF). 
 
 Probably masking out these bits in dma_pte_addr helps. 
 

Nope. But I just noticed a fatal thinko in my fix to
intel_iommu_attach_device - probably that was the key. Need to boot the
test kernel...

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
 On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 PCI 2.3 allows to generically disable IRQ sources at device level. This
 enables us to share IRQs of such devices between on the host side when
 passing them to a guest.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  include/linux/kvm_host.h |1 +
  virt/kvm/assigned-dev.c  |  153 
 +
  2 files changed, 140 insertions(+), 14 deletions(-)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index df5497f..fcdc849 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
  unsigned int entries_nr;
  int host_irq;
  bool host_irq_disabled;
 +bool pci_2_3;
  struct msix_entry *host_msix_entries;
  int guest_irq;
  struct kvm_guest_msix_entry *guest_msix_entries;
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index d3ddfea..411643c 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct 
 kvm_assigned_dev_kernel
  return index;
  }
  
 +/*
 + * Verify that the device supports Interrupt Disable bit in command 
 register,
 + * per PCI 2.3, by flipping this bit and reading it back: this bit was 
 readonly
 + * in PCI 2.2.
 + */
 +static bool pci_2_3_supported(struct pci_dev *pdev)
 +{
 +u16 orig, new;
 +bool supported = false;
 +
 +pci_block_user_cfg_access(pdev);
 +pci_read_config_word(pdev, PCI_COMMAND, orig);
 +pci_write_config_word(pdev, PCI_COMMAND,
 +  orig ^ PCI_COMMAND_INTX_DISABLE);
 +pci_read_config_word(pdev, PCI_COMMAND, new);
 +
 +/*
 + * There's no way to protect against
 + * hardware bugs or detect them reliably, but as long as we know
 + * what the value should be, let's go ahead and check it.
 + */
 +if ((new ^ orig)  ~PCI_COMMAND_INTX_DISABLE) {
 +dev_err(pdev-dev, Command changed from 0x%x to 0x%x: 
 +driver or HW bug?\n, orig, new);
 +goto out;
 +}
 +if (!((new ^ orig)  PCI_COMMAND_INTX_DISABLE)) {
 +dev_warn(pdev-dev, Device does not support 
 + disabling interrupts: unable to bind.\n);
 +goto out;
 +}
 +supported = true;
 +
 +/* Now restore the original value. */
 +pci_write_config_word(pdev, PCI_COMMAND, orig);
 +
 +out:
 +pci_unblock_user_cfg_access(pdev);
 +return supported;
 +}
 +
 +static void
 +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
 +{
 +u32 cmd_status_dword;
 +u16 origcmd, newcmd;
 +
 +/*
 + * We do a single dword read to retrieve both command and status.
 + * Document assumptions that make this possible.
 + */
 +BUILD_BUG_ON(PCI_COMMAND % 4);
 +BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
 +
 +pci_block_user_cfg_access(dev);
 +
 +/*
 + * Read both command and status registers in a single 32-bit operation.
 + * Note: we could cache the value for command and move the status read
 + * out of the lock if there was a way to get notified of user changes
 + * to command register through sysfs. Should be good for shared irqs.
 + */
 +pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword);
 +origcmd = cmd_status_dword;
 +
 +if (irq_status) {
 +/*
 +* Check interrupt status register to see whether our device 
 triggered
 +* the interrupt.
 +*/
 +*irq_status = (cmd_status_dword  16)  PCI_STATUS_INTERRUPT;
 +if (*irq_status == 0)
 +goto done;
 +}
 +
 +newcmd = origcmd  ~PCI_COMMAND_INTX_DISABLE;
 +if (mask)
 +newcmd |= PCI_COMMAND_INTX_DISABLE;
 +if (newcmd != origcmd)
 +pci_write_config_word(dev, PCI_COMMAND, newcmd);
 +
 +done:
 +pci_unblock_user_cfg_access(dev);
 +}
 +
 
 Let's return irq_status instead of returning through a pointer?
 Will save a branch and generally make the code a bit cleaner.

I'm open for a better API suggestion, but the current one goes like
this: if irq_status is non-null, only mask the IRQ if the status bit
indicates an interrupt. But we also have a user that wants to mask
unconditionally.

 
  static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
  {
  struct kvm_assigned_dev_kernel *assigned_dev =
  (struct kvm_assigned_dev_kernel *) dev_id;
 +int ret = IRQ_HANDLED;
  unsigned long flags;
  
  spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
 @@ -83,19 +169,34 @@ static irqreturn_t kvm_assigned_dev_intr(int irq, void 
 *dev_id)
  guest_entries[i].vector, 1);
  }
  } else {
 -kvm_set_irq(assigned_dev-kvm, 

Re: [PATCH 0/3] Launch other test during migration

2010-11-01 Thread Michael Goldish
On 09/25/2010 11:36 AM, Jason Wang wrote:
 We could give a further test of migration by launch test during migartion. So
 the following series implements:
 
 - A simple class to run a specified test in the background which could be used
 to launch other test during migartion. Its design is rather simple and its 
 usage
 is a little tricky, but it work well.
 - Two sample tests which take advantages of the background class: Test reboot
 during guest migration and test file_transfer during guest migration.
 
 In the future, we could even lauch autotest client test during guest migation.
 
 ---
 
 Jason Wang (3):
   KVM Test: Introduce a helper class to run a test in the background
   KVM test: Test reboot during migration
   KVM test: Test the file transfer during migartion
 
 
  client/tests/kvm/kvm_test_utils.py |   44 +++
  .../kvm/tests/migration_with_file_transfer.py  |   59 
 
  client/tests/kvm/tests/migration_with_reboot.py|   45 +++
  client/tests/kvm/tests_base.cfg.sample |   12 
  4 files changed, 159 insertions(+), 1 deletions(-)
  create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py
  create mode 100644 client/tests/kvm/tests/migration_with_reboot.py

It seems to me that this method is only applicable to tests/functions
that don't require a VM object (i.e. that require only a shell session
object).  kvm_test_utils.reboot() operates on a VM object, and the same
VM is destroyed by migrate() which runs in the background, so eventually
reboot() tries logging into a destroyed VM, which fails because
vm.get_address() fails.  Any monitor operation will also fail.  If the
autotest wrapper requires a VM object (currently it does) then it can't
be used either.

An alternative (somewhat ugly) way to migrate in the background is to
pass a boolean 'migrate' flag to various functions/tests, such as
reboot() and run_autotest().  If 'migrate' is True, these functions will
do something like

vm = kvm_test_utils.migrate(vm, ...)

in their waiting loops, where wait_for() is normally used.  This
guarantees that 'vm' is always a valid VM object.  For example:

# Log in after reboot
while time.time()  end_time:
if migrate_in_bg:
vm = kvm_test_utils.migrate(vm, ...)
session = vm.remote_login()
if session:
break
time.sleep(2)

This is much longer than the usual wait_for(...) but it does the job.
What do you think?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv3] Add support for async page fault to qemu

2010-11-01 Thread Marcelo Tosatti
On Sun, Oct 24, 2010 at 02:27:55PM +0200, Gleb Natapov wrote:
 Add save/restore of MSR for migration and cpuid bit.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 --
  v1-v2
   - use vmstate subsection to migrate new msr.
  v2-v3
   - rebase onto uq/master
   - protect use of MSR_KVM_ASYNC_PF_EN with ifdef KVM_CAP_ASYNC_PF

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-11-01 Thread Michael S. Tsirkin
On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
 Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
  On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  PCI 2.3 allows to generically disable IRQ sources at device level. This
  enables us to share IRQs of such devices between on the host side when
  passing them to a guest.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  ---
   include/linux/kvm_host.h |1 +
   virt/kvm/assigned-dev.c  |  153 
  +
   2 files changed, 140 insertions(+), 14 deletions(-)
 
  diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
  index df5497f..fcdc849 100644
  --- a/include/linux/kvm_host.h
  +++ b/include/linux/kvm_host.h
  @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
 unsigned int entries_nr;
 int host_irq;
 bool host_irq_disabled;
  +  bool pci_2_3;
 struct msix_entry *host_msix_entries;
 int guest_irq;
 struct kvm_guest_msix_entry *guest_msix_entries;
  diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
  index d3ddfea..411643c 100644
  --- a/virt/kvm/assigned-dev.c
  +++ b/virt/kvm/assigned-dev.c
  @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct 
  kvm_assigned_dev_kernel
 return index;
   }
   
  +/*
  + * Verify that the device supports Interrupt Disable bit in command 
  register,
  + * per PCI 2.3, by flipping this bit and reading it back: this bit was 
  readonly
  + * in PCI 2.2.
  + */
  +static bool pci_2_3_supported(struct pci_dev *pdev)
  +{
  +  u16 orig, new;
  +  bool supported = false;
  +
  +  pci_block_user_cfg_access(pdev);
  +  pci_read_config_word(pdev, PCI_COMMAND, orig);
  +  pci_write_config_word(pdev, PCI_COMMAND,
  +orig ^ PCI_COMMAND_INTX_DISABLE);
  +  pci_read_config_word(pdev, PCI_COMMAND, new);
  +  
  +  /*
  +   * There's no way to protect against
  +   * hardware bugs or detect them reliably, but as long as we know
  +   * what the value should be, let's go ahead and check it.
  +   */
  +  if ((new ^ orig)  ~PCI_COMMAND_INTX_DISABLE) {
  +  dev_err(pdev-dev, Command changed from 0x%x to 0x%x: 
  +  driver or HW bug?\n, orig, new);
  +  goto out;
  +  }
  +  if (!((new ^ orig)  PCI_COMMAND_INTX_DISABLE)) {
  +  dev_warn(pdev-dev, Device does not support 
  +   disabling interrupts: unable to bind.\n);
  +  goto out;
  +  }
  +  supported = true;
  +
  +  /* Now restore the original value. */
  +  pci_write_config_word(pdev, PCI_COMMAND, orig);
  +
  +out:
  +  pci_unblock_user_cfg_access(pdev);
  +  return supported;
  +}
  +
  +static void
  +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
  +{
  +  u32 cmd_status_dword;
  +  u16 origcmd, newcmd;
  +
  +  /*
  +   * We do a single dword read to retrieve both command and status.
  +   * Document assumptions that make this possible.
  +   */
  +  BUILD_BUG_ON(PCI_COMMAND % 4);
  +  BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
  +
  +  pci_block_user_cfg_access(dev);
  +
  +  /*
  +   * Read both command and status registers in a single 32-bit operation.
  +   * Note: we could cache the value for command and move the status read
  +   * out of the lock if there was a way to get notified of user changes
  +   * to command register through sysfs. Should be good for shared irqs.
  +   */
  +  pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword);
  +  origcmd = cmd_status_dword;
  +
  +  if (irq_status) {
  +  /*
  +  * Check interrupt status register to see whether our device 
  triggered
  +  * the interrupt.
  +  */
  +  *irq_status = (cmd_status_dword  16)  PCI_STATUS_INTERRUPT;
  +  if (*irq_status == 0)
  +  goto done;
  +  }
  +
  +  newcmd = origcmd  ~PCI_COMMAND_INTX_DISABLE;
  +  if (mask)
  +  newcmd |= PCI_COMMAND_INTX_DISABLE;
  +  if (newcmd != origcmd)
  +  pci_write_config_word(dev, PCI_COMMAND, newcmd);
  +
  +done:
  +  pci_unblock_user_cfg_access(dev);
  +}
  +
  
  Let's return irq_status instead of returning through a pointer?
  Will save a branch and generally make the code a bit cleaner.
 
 I'm open for a better API suggestion,

Maybe use separate functions for this.
pci_2_3_mask_irq
pci_2_3_unmask_irq
pci_2_3_disable_irq

Common code can go into subfunctions.

 but the current one goes like
 this: if irq_status is non-null, only mask the IRQ if the status bit
 indicates an interrupt. But we also have a user that wants to mask
 unconditionally.

Why do you ever want to do that?

  
   static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
   {
 struct kvm_assigned_dev_kernel *assigned_dev =
 (struct kvm_assigned_dev_kernel *) dev_id;
  +  int ret = IRQ_HANDLED;
 unsigned long flags;
   
 spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
  @@ -83,19 

Re: [PATCH 1/2] msix: Allow msix_init on a device with existing MSI-X capability

2010-11-01 Thread Alex Williamson
On Thu, 2010-10-28 at 17:00 +0200, Avi Kivity wrote:
 On 10/23/2010 06:55 PM, Alex Williamson wrote:
  On Sat, 2010-10-23 at 18:18 +0200, Michael S. Tsirkin wrote:
On Fri, Oct 22, 2010 at 02:40:31PM -0600, Alex Williamson wrote:
  To enable common msix support to be used with pass through devices,
  don't attempt to change the BAR if the device already has an
  MSI-X capability.  This also means we want to pay closer attention
  to the size when we map the msix table page, as it isn't necessarily
  covering the entire end of the BAR.

  Signed-off-by: Alex Williamsonalex.william...@redhat.com
  ---

   hw/msix.c |   67 
   +++--
   1 files changed, 38 insertions(+), 29 deletions(-)

  diff --git a/hw/msix.c b/hw/msix.c
  index 43efbd2..4122395 100644
  --- a/hw/msix.c
  +++ b/hw/msix.c
  @@ -167,35 +167,43 @@ static int msix_add_config(struct PCIDevice 
   *pdev, unsigned short nentries,
   {
   int config_offset;
   uint8_t *config;
  -uint32_t new_size;

  -if (nentries  1 || nentries  PCI_MSIX_FLAGS_QSIZE + 1)
  -return -EINVAL;
  -if (bar_size  0x8000)
  -return -ENOSPC;
  -
  -/* Add space for MSI-X structures */
  -if (!bar_size) {
  -new_size = MSIX_PAGE_SIZE;
  -} else if (bar_size  MSIX_PAGE_SIZE) {
  -bar_size = MSIX_PAGE_SIZE;
  -new_size = MSIX_PAGE_SIZE * 2;
  -} else {
  -new_size = bar_size * 2;
  -}
  -
  -pdev-msix_bar_size = new_size;
  -config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX, 
   MSIX_CAP_LENGTH);
  -if (config_offset  0)
  -return config_offset;
  -config = pdev-config + config_offset;
  -
  -pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1);
  -/* Table on top of BAR */
  -pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr);
  -/* Pending bits on top of that */
  -pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + 
   MSIX_PAGE_PENDING) |
  - bar_nr);
  +pdev-msix_bar_size = bar_size;
  +
  +config_offset = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
  +
  +if (!config_offset) {
  +uint32_t new_size;
  +
  +if (nentries  1 || nentries  PCI_MSIX_FLAGS_QSIZE + 1)
  +return -EINVAL;
  +if (bar_size  0x8000)
  +return -ENOSPC;
  +
  +/* Add space for MSI-X structures */
  +if (!bar_size) {
  +new_size = MSIX_PAGE_SIZE;
  +} else if (bar_size  MSIX_PAGE_SIZE) {
  +bar_size = MSIX_PAGE_SIZE;
  +new_size = MSIX_PAGE_SIZE * 2;
  +} else {
  +new_size = bar_size * 2;
  +}
  +
  +pdev-msix_bar_size = new_size;
  +config_offset = pci_add_capability(pdev, PCI_CAP_ID_MSIX,
  +   MSIX_CAP_LENGTH);
  +if (config_offset  0)
  +return config_offset;
  +config = pdev-config + config_offset;
  +
  +pci_set_word(config + PCI_MSIX_FLAGS, nentries - 1);
  +/* Table on top of BAR */
  +pci_set_long(config + MSIX_TABLE_OFFSET, bar_size | bar_nr);
  +/* Pending bits on top of that */
  +pci_set_long(config + MSIX_PBA_OFFSET, (bar_size + 
   MSIX_PAGE_PENDING) |
  + bar_nr);
  +}
   pdev-msix_cap = config_offset;
   /* Make flags bit writeable. */
   pdev-wmask[config_offset + MSIX_CONTROL_OFFSET] |= 
   MSIX_ENABLE_MASK |
  @@ -337,7 +345,8 @@ void msix_mmio_map(PCIDevice *d, int region_num,
   return;
   if (size= offset)
   return;
  -cpu_register_physical_memory(addr + offset, size - offset,
  +cpu_register_physical_memory(addr + offset,
  + MIN(size - offset, MSIX_PAGE_SIZE),
  
This is wrong I think, the table might not fit in a single page.
You would need to read table size out of from device config.
 
  That's true, but I was hoping to save that for later since we don't seem
  to be running into that problem yet.  Current device assignment code
  assumes a single page, and I haven't heard of anyone with a vector table
  that exceeds that yet.  Thanks,
 
 
 Ok; applied.  Please add some warning if the condition happens so the 
 breakage is at least not silent.

Looks like this is already handled before we even get here.  msix_init()
checks nentries against the max supported in a single page and returns
-EINVAL if over.  So I think we're safely covered until we add support
for more pages.  Thanks,

Alex

--
To unsubscribe 

Re: [PATCH 3/3] KVM test: Test the file transfer during migartion

2010-11-01 Thread Michael Goldish
On 09/25/2010 11:36 AM, Jason Wang wrote:
 This test just do the file transfer from host to guest during migartion in 
 order
 to check whether the nic/block state could be saved and loaded correctly.
 
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
  .../kvm/tests/migration_with_file_transfer.py  |   59 
 
  client/tests/kvm/tests_base.cfg.sample |7 ++
  2 files changed, 66 insertions(+), 0 deletions(-)
  create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py
 
 diff --git a/client/tests/kvm/tests/migration_with_file_transfer.py 
 b/client/tests/kvm/tests/migration_with_file_transfer.py
 new file mode 100644
 index 000..8a316bf
 --- /dev/null
 +++ b/client/tests/kvm/tests/migration_with_file_transfer.py
 @@ -0,0 +1,59 @@
 +import logging, time
 +from autotest_lib.client.common_lib import utils, error
 +import kvm_subprocess, kvm_test_utils, kvm_utils
 +
 +
 +def run_migration_with_file_transfer(test, params, env):
 +
 +KVM migration test:
 +1) Get a live VM and clone it.
 +2) Verify that the source VM supports migration.  If it does, proceed 
 with
 +the test.
 +3) Reboot the VM
 +4) Send a migration command to the source VM and wait until it's 
 finished.
 +5) Kill off the source VM.
 +6) Log into the destination VM after the migration is finished.
 +
 +@param test: kvm test object.
 +@param params: Dictionary with test parameters.
 +@param env: Dictionary with the test environment.
 +
 +
 +def transfer_test(vm, host_path, guest_path, timeout=120):
 +
 +vm.copy_files_to does not raise exception, so we need a wrapper
 +in order to make it to be used by BackgroundTest.
 +
 +if not vm.copy_files_to(host_path, guest_path, timeout=timeout):
 +raise error.TestError(Fail to do the file transfer!)
 +
 +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
 +timeout = int(params.get(login_timeout, 360))
 +session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
 +
 +mig_timeout = float(params.get(mig_timeout, 3600))
 +mig_protocol = params.get(migration_protocol, tcp)
 +
 +guest_path = params.get(guest_path, /tmp)
 +file_size = params.get(file_size, 1000)
 +transfer_timeout = int(params.get(transfer_timeout, 240))
 +bg = None
 +
 +try:
 +utils.run(dd if=/dev/zero of=/tmp/file bs=1M count=%s % file_size)

Wouldn't it be useful to read from /dev/urandom instead of /dev/zero,
and compare the md5 hash of the file generated on the host with the md5
hash of the file that arrives in the guest?

 +
 +# Transfer file from host to guest
 +bg = kvm_test_utils.BackgroundTest(transfer_test,
 +   (vm, /tmp/file, guest_path,
 +transfer_timeout))
 +bg.start()
 +
 +while bg.is_alive():
 +logging.info(File transfer is not ended, start a round of 
 migration ...)
 +# Migrate the VM
 +dest_vm = kvm_test_utils.migrate(vm, env, mig_timeout, 
 mig_protocol, False)
 +vm = dest_vm
 +finally:
 +if bg: bg.join()
 +session.close()
 +utils.run(rm -rf /tmp/zero)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 16:52, Michael S. Tsirkin wrote:
 On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
 Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
 On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 PCI 2.3 allows to generically disable IRQ sources at device level. This
 enables us to share IRQs of such devices between on the host side when
 passing them to a guest.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  include/linux/kvm_host.h |1 +
  virt/kvm/assigned-dev.c  |  153 
 +
  2 files changed, 140 insertions(+), 14 deletions(-)

 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index df5497f..fcdc849 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
unsigned int entries_nr;
int host_irq;
bool host_irq_disabled;
 +  bool pci_2_3;
struct msix_entry *host_msix_entries;
int guest_irq;
struct kvm_guest_msix_entry *guest_msix_entries;
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index d3ddfea..411643c 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct 
 kvm_assigned_dev_kernel
return index;
  }
  
 +/*
 + * Verify that the device supports Interrupt Disable bit in command 
 register,
 + * per PCI 2.3, by flipping this bit and reading it back: this bit was 
 readonly
 + * in PCI 2.2.
 + */
 +static bool pci_2_3_supported(struct pci_dev *pdev)
 +{
 +  u16 orig, new;
 +  bool supported = false;
 +
 +  pci_block_user_cfg_access(pdev);
 +  pci_read_config_word(pdev, PCI_COMMAND, orig);
 +  pci_write_config_word(pdev, PCI_COMMAND,
 +orig ^ PCI_COMMAND_INTX_DISABLE);
 +  pci_read_config_word(pdev, PCI_COMMAND, new);
 +  
 +  /*
 +   * There's no way to protect against
 +   * hardware bugs or detect them reliably, but as long as we know
 +   * what the value should be, let's go ahead and check it.
 +   */
 +  if ((new ^ orig)  ~PCI_COMMAND_INTX_DISABLE) {
 +  dev_err(pdev-dev, Command changed from 0x%x to 0x%x: 
 +  driver or HW bug?\n, orig, new);
 +  goto out;
 +  }
 +  if (!((new ^ orig)  PCI_COMMAND_INTX_DISABLE)) {
 +  dev_warn(pdev-dev, Device does not support 
 +   disabling interrupts: unable to bind.\n);
 +  goto out;
 +  }
 +  supported = true;
 +
 +  /* Now restore the original value. */
 +  pci_write_config_word(pdev, PCI_COMMAND, orig);
 +
 +out:
 +  pci_unblock_user_cfg_access(pdev);
 +  return supported;
 +}
 +
 +static void
 +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int *irq_status)
 +{
 +  u32 cmd_status_dword;
 +  u16 origcmd, newcmd;
 +
 +  /*
 +   * We do a single dword read to retrieve both command and status.
 +   * Document assumptions that make this possible.
 +   */
 +  BUILD_BUG_ON(PCI_COMMAND % 4);
 +  BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
 +
 +  pci_block_user_cfg_access(dev);
 +
 +  /*
 +   * Read both command and status registers in a single 32-bit operation.
 +   * Note: we could cache the value for command and move the status read
 +   * out of the lock if there was a way to get notified of user changes
 +   * to command register through sysfs. Should be good for shared irqs.
 +   */
 +  pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword);
 +  origcmd = cmd_status_dword;
 +
 +  if (irq_status) {
 +  /*
 +  * Check interrupt status register to see whether our device 
 triggered
 +  * the interrupt.
 +  */
 +  *irq_status = (cmd_status_dword  16)  PCI_STATUS_INTERRUPT;
 +  if (*irq_status == 0)
 +  goto done;
 +  }
 +
 +  newcmd = origcmd  ~PCI_COMMAND_INTX_DISABLE;
 +  if (mask)
 +  newcmd |= PCI_COMMAND_INTX_DISABLE;
 +  if (newcmd != origcmd)
 +  pci_write_config_word(dev, PCI_COMMAND, newcmd);
 +
 +done:
 +  pci_unblock_user_cfg_access(dev);
 +}
 +

 Let's return irq_status instead of returning through a pointer?
 Will save a branch and generally make the code a bit cleaner.

 I'm open for a better API suggestion,
 
 Maybe use separate functions for this.
 pci_2_3_mask_irq
 pci_2_3_unmask_irq
 pci_2_3_disable_irq
 
 Common code can go into subfunctions.
 
 but the current one goes like
 this: if irq_status is non-null, only mask the IRQ if the status bit
 indicates an interrupt. But we also have a user that wants to mask
 unconditionally.
 
 Why do you ever want to do that?

During device shutdown (was disable_irq in the unshared case so far).
The alternative would be to reset the device first, clearing any
potentially pending events. If we can reorder the reset, that is likely
better.

 

  static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
  {
struct kvm_assigned_dev_kernel *assigned_dev =
(struct kvm_assigned_dev_kernel *) 

Re: [PATCH 1/3] KVM: Fold assigned interrupt work into IRQ handler

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 15:08, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 The complete work handler runs with assigned_dev_lock acquired and
 interrupts disabled, so there is nothing to gain pushing this work out
 of the actually IRQ handler. Fold them together.

Err, forget it. kvm_set_irq pulls in the famous pic lock, and that one
is not prepared to be called from IRQ context (lockdep just complained).

I will try to clean this up via a threaded IRQ handler, maybe using the
slow-path only for INTx-type IRQs and pushing MSIs directly from the
hard handler.

Jan

 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  include/linux/kvm_host.h |1 -
  virt/kvm/assigned-dev.c  |   71 
 +++---
  2 files changed, 23 insertions(+), 49 deletions(-)
 
 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
 index 462b982..df5497f 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -465,7 +465,6 @@ struct kvm_guest_msix_entry {
  
  struct kvm_assigned_dev_kernel {
   struct kvm_irq_ack_notifier ack_notifier;
 - struct work_struct interrupt_work;
   struct list_head list;
   int assigned_dev_id;
   int host_segnr;
 diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
 index 7c98928..5c1b56a 100644
 --- a/virt/kvm/assigned-dev.c
 +++ b/virt/kvm/assigned-dev.c
 @@ -55,18 +55,24 @@ static int find_index_from_host_irq(struct 
 kvm_assigned_dev_kernel
   return index;
  }
  
 -static void kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
 +static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
  {
 - struct kvm_assigned_dev_kernel *assigned_dev;
 - int i;
 -
 - assigned_dev = container_of(work, struct kvm_assigned_dev_kernel,
 - interrupt_work);
 + struct kvm_assigned_dev_kernel *assigned_dev =
 + (struct kvm_assigned_dev_kernel *) dev_id;
 + unsigned long flags;
  
 - spin_lock_irq(assigned_dev-assigned_dev_lock);
 + spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
   if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
   struct kvm_guest_msix_entry *guest_entries =
   assigned_dev-guest_msix_entries;
 + int index = find_index_from_host_irq(assigned_dev, irq);
 + int i;
 +
 + if (index  0)
 + goto out;
 +
 + guest_entries[index].flags |= KVM_ASSIGNED_MSIX_PENDING;
 +
   for (i = 0; i  assigned_dev-entries_nr; i++) {
   if (!(guest_entries[i].flags 
   KVM_ASSIGNED_MSIX_PENDING))
 @@ -76,33 +82,15 @@ static void 
 kvm_assigned_dev_interrupt_work_handler(struct work_struct *work)
   assigned_dev-irq_source_id,
   guest_entries[i].vector, 1);
   }
 - } else
 + } else {
   kvm_set_irq(assigned_dev-kvm, assigned_dev-irq_source_id,
   assigned_dev-guest_irq, 1);
  
 - spin_unlock_irq(assigned_dev-assigned_dev_lock);
 -}
 -
 -static irqreturn_t kvm_assigned_dev_intr(int irq, void *dev_id)
 -{
 - unsigned long flags;
 - struct kvm_assigned_dev_kernel *assigned_dev =
 - (struct kvm_assigned_dev_kernel *) dev_id;
 -
 - spin_lock_irqsave(assigned_dev-assigned_dev_lock, flags);
 - if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_HOST_MSIX) {
 - int index = find_index_from_host_irq(assigned_dev, irq);
 - if (index  0)
 - goto out;
 - assigned_dev-guest_msix_entries[index].flags |=
 - KVM_ASSIGNED_MSIX_PENDING;
 - }
 -
 - schedule_work(assigned_dev-interrupt_work);
 -
 - if (assigned_dev-irq_requested_type  KVM_DEV_IRQ_GUEST_INTX) {
 - disable_irq_nosync(irq);
 - assigned_dev-host_irq_disabled = true;
 + if (assigned_dev-irq_requested_type 
 + KVM_DEV_IRQ_GUEST_INTX) {
 + disable_irq_nosync(irq);
 + assigned_dev-host_irq_disabled = true;
 + }
   }
  
  out:
 @@ -147,33 +135,23 @@ static void deassign_guest_irq(struct kvm *kvm,
   assigned_dev-irq_requested_type = ~(KVM_DEV_IRQ_GUEST_MASK);
  }
  
 -/* The function implicit hold kvm-lock mutex due to cancel_work_sync() */
  static void deassign_host_irq(struct kvm *kvm,
 struct kvm_assigned_dev_kernel *assigned_dev)
  {
   /*
 -  * In kvm_free_device_irq, cancel_work_sync return true if:
 -  * 1. work is scheduled, and then cancelled.
 -  * 2. work callback is executed.
 -  *
 -  * The first one ensured that the irq is disabled and no more events
 -  * would happen. But for the second one, the irq may be enabled (e.g.
 -  * for MSI). So we disable irq here to 

Re: Crash on kvm_iommu_map_pages

2010-11-01 Thread Jan Kiszka
Am 01.11.2010 16:29, Jan Kiszka wrote:
 Nope. But I just noticed a fatal thinko in my fix to
 intel_iommu_attach_device - probably that was the key. Need to boot the
 test kernel...

That was indeed the reason for this GPF: I blindly swapped the
problematic lines, releasing the wrong page. Sorry, false alarm this
time, will send out the corrected intel_iommu_attach_device fix later.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-11-01 Thread Michael S. Tsirkin
On Mon, Nov 01, 2010 at 05:30:20PM +0100, Jan Kiszka wrote:
 Am 01.11.2010 16:52, Michael S. Tsirkin wrote:
  On Mon, Nov 01, 2010 at 04:41:08PM +0100, Jan Kiszka wrote:
  Am 01.11.2010 16:24, Michael S. Tsirkin wrote:
  On Mon, Nov 01, 2010 at 03:08:37PM +0100, Jan Kiszka wrote:
  From: Jan Kiszka jan.kis...@siemens.com
 
  PCI 2.3 allows to generically disable IRQ sources at device level. This
  enables us to share IRQs of such devices between on the host side when
  passing them to a guest.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  ---
   include/linux/kvm_host.h |1 +
   virt/kvm/assigned-dev.c  |  153 
  +
   2 files changed, 140 insertions(+), 14 deletions(-)
 
  diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
  index df5497f..fcdc849 100644
  --- a/include/linux/kvm_host.h
  +++ b/include/linux/kvm_host.h
  @@ -473,6 +473,7 @@ struct kvm_assigned_dev_kernel {
   unsigned int entries_nr;
   int host_irq;
   bool host_irq_disabled;
  +bool pci_2_3;
   struct msix_entry *host_msix_entries;
   int guest_irq;
   struct kvm_guest_msix_entry *guest_msix_entries;
  diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
  index d3ddfea..411643c 100644
  --- a/virt/kvm/assigned-dev.c
  +++ b/virt/kvm/assigned-dev.c
  @@ -55,10 +55,96 @@ static int find_index_from_host_irq(struct 
  kvm_assigned_dev_kernel
   return index;
   }
   
  +/*
  + * Verify that the device supports Interrupt Disable bit in command 
  register,
  + * per PCI 2.3, by flipping this bit and reading it back: this bit was 
  readonly
  + * in PCI 2.2.
  + */
  +static bool pci_2_3_supported(struct pci_dev *pdev)
  +{
  +u16 orig, new;
  +bool supported = false;
  +
  +pci_block_user_cfg_access(pdev);
  +pci_read_config_word(pdev, PCI_COMMAND, orig);
  +pci_write_config_word(pdev, PCI_COMMAND,
  +  orig ^ PCI_COMMAND_INTX_DISABLE);
  +pci_read_config_word(pdev, PCI_COMMAND, new);
  +
  +/*
  + * There's no way to protect against
  + * hardware bugs or detect them reliably, but as long as we know
  + * what the value should be, let's go ahead and check it.
  + */
  +if ((new ^ orig)  ~PCI_COMMAND_INTX_DISABLE) {
  +dev_err(pdev-dev, Command changed from 0x%x to 0x%x: 
  
  +driver or HW bug?\n, orig, new);
  +goto out;
  +}
  +if (!((new ^ orig)  PCI_COMMAND_INTX_DISABLE)) {
  +dev_warn(pdev-dev, Device does not support 
  + disabling interrupts: unable to bind.\n);
  +goto out;
  +}
  +supported = true;
  +
  +/* Now restore the original value. */
  +pci_write_config_word(pdev, PCI_COMMAND, orig);
  +
  +out:
  +pci_unblock_user_cfg_access(pdev);
  +return supported;
  +}
  +
  +static void
  +pci_2_3_mask_irq(struct pci_dev *dev, int mask, unsigned int 
  *irq_status)
  +{
  +u32 cmd_status_dword;
  +u16 origcmd, newcmd;
  +
  +/*
  + * We do a single dword read to retrieve both command and 
  status.
  + * Document assumptions that make this possible.
  + */
  +BUILD_BUG_ON(PCI_COMMAND % 4);
  +BUILD_BUG_ON(PCI_COMMAND + 2 != PCI_STATUS);
  +
  +pci_block_user_cfg_access(dev);
  +
  +/*
  + * Read both command and status registers in a single 32-bit 
  operation.
  + * Note: we could cache the value for command and move the 
  status read
  + * out of the lock if there was a way to get notified of user 
  changes
  + * to command register through sysfs. Should be good for shared 
  irqs.
  + */
  +pci_read_config_dword(dev, PCI_COMMAND, cmd_status_dword);
  +origcmd = cmd_status_dword;
  +
  +if (irq_status) {
  +/*
  +* Check interrupt status register to see whether our 
  device triggered
  +* the interrupt.
  +*/
  +*irq_status = (cmd_status_dword  16)  
  PCI_STATUS_INTERRUPT;
  +if (*irq_status == 0)
  +goto done;
  +}
  +
  +newcmd = origcmd  ~PCI_COMMAND_INTX_DISABLE;
  +if (mask)
  +newcmd |= PCI_COMMAND_INTX_DISABLE;
  +if (newcmd != origcmd)
  +pci_write_config_word(dev, PCI_COMMAND, newcmd);
  +
  +done:
  +pci_unblock_user_cfg_access(dev);
  +}
  +
 
  Let's return irq_status instead of returning through a pointer?
  Will save a branch and generally make the code a bit cleaner.
 
  I'm open for a better API suggestion,
  
  Maybe use separate functions for this.
  pci_2_3_mask_irq
  pci_2_3_unmask_irq
  pci_2_3_disable_irq
  
  Common code can go 

Re: TODO item: guest programmable mac/vlan filtering with macvtap

2010-11-01 Thread Dragos Tatulea
I have created a wiki page for this [1], also added to the networking
todo list [2]. No meaty information yet. But it's enough to start
working on it.

[1] - http://www.linux-kvm.org/page/GuestProgrammableMacVlanFiltering
[2] - http://www.linux-kvm.org/page/NetworkingTodo

-- Dragos
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] KVM: dirty logging optimization - double buffering

2010-11-01 Thread Marcelo Tosatti
On Wed, Oct 27, 2010 at 06:21:02PM +0900, Takuya Yoshikawa wrote:
 This patch series just change the way we allocate dirty bitmaps but don't
 change timing related issues.
 
  - Changelog
   I have not changed anything about patch 1 and 2 since I got
   looks good comment from Marcelo. Just rebased.
 
   Patch 3 has become simpler by rebasing on top of this series.
 
 Thanks,
   Takuya

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: Mask KVM_GET_SUPPORTED_CPUID data with Linux cpuid info

2010-11-01 Thread Marcelo Tosatti
On Sun, Oct 24, 2010 at 03:38:46PM +0200, Avi Kivity wrote:
 This allows Linux to mask cpuid bits if, for example, nx is enabled on only
 some cpus.
 
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
  arch/x86/kvm/x86.c |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 48ce015..54fda7e 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2229,6 +2229,11 @@ out:
   return r;
  }
  
 +static void cpuid_mask(u32 *word, int wordnum)
 +{
 + *word = boot_cpu_data.x86_capability[wordnum];
 +}
 +
  static void do_cpuid_1_ent(struct kvm_cpuid_entry2 *entry, u32 function,
  u32 index)
  {
 @@ -2303,7 +2308,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
   break;
   case 1:
   entry-edx = kvm_supported_word0_x86_features;
 + cpuid_mask(entry-edx, 0);
   entry-ecx = kvm_supported_word4_x86_features;
 + cpuid_mask(entry-edx, 4);

ecx?

   /* we support x2apic emulation even if host does not support
* it since we emulate x2apic in software */
   entry-ecx |= F(X2APIC);
 @@ -2394,7 +2401,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 
 *entry, u32 function,
   break;
   case 0x8001:
   entry-edx = kvm_supported_word1_x86_features;
 + cpuid_mask(entry-edx, 1);
   entry-ecx = kvm_supported_word6_x86_features;
 + cpuid_mask(entry-edx, 6);

ecx?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] trace exit to userspace event

2010-11-01 Thread Marcelo Tosatti
On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote:
 Add tracepoint for userspace exit.
 
 Signed-off-by: Gleb Natapov g...@redhat.com
 ---
   ChangeLog:
   v1-v2
log error case too.

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: add cast within kvm_clear_guest_page to fix warning

2010-11-01 Thread Marcelo Tosatti
On Wed, Oct 27, 2010 at 05:21:21PM +0200, Heiko Carstens wrote:
 From: Heiko Carstens heiko.carst...@de.ibm.com
 
 Fixes this:
 
   CC  arch/s390/kvm/../../../virt/kvm/kvm_main.o
 arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 
 'kvm_clear_guest_page':
 arch/s390/kvm/../../../virt/kvm/kvm_main.c:1224:2: warning: passing argument 
 3 of 'kvm_write_guest_page' makes pointer from integer without a cast
 arch/s390/kvm/../../../virt/kvm/kvm_main.c:1185:5: note: expected 'const void 
 *' but argument is of type 'long unsigned int'
 
 Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com
 ---
  virt/kvm/kvm_main.c |3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

Applied both, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: fix information leak to userland

2010-11-01 Thread Marcelo Tosatti
On Sat, Oct 30, 2010 at 01:04:24PM +0400, Vasiliy Kulikov wrote:
 Structure kvm_ppc_pvinfo is copied to userland with flags and
 pad fields unitialized.  It leads to leaking of contents of
 kernel stack memory.
 
 Signed-off-by: Vasiliy Kulikov sego...@gmail.com
 ---
  I cannot compile this driver, so it is not tested at all.
 
  arch/powerpc/kvm/powerpc.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] x86: kvm: x86: fix information leak to userland

2010-11-01 Thread Marcelo Tosatti
On Sat, Oct 30, 2010 at 10:54:47PM +0400, Vasiliy Kulikov wrote:
 Structures kvm_vcpu_events, kvm_debugregs, kvm_pit_state2 and
 kvm_clock_data are copied to userland with some padding and reserved
 fields unitialized.  It leads to leaking of contents of kernel stack
 memory.  We have to initialize them to zero.
 
 In patch v1 Jan Kiszka suggested to fill reserved fields with zeros
 instead of memset'ting the whole struct.  It makes sense as these
 fields are explicitly marked as padding.  No more fields need zeroing.
 
 Signed-off-by: Vasiliy Kulikov sego...@gmail.com
 ---
  Compile tesed only.
 
  arch/x86/kvm/x86.c |6 ++
  1 files changed, 6 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM x86: remove memset, use vzalloc and don't assign the same value to a variable twice

2010-11-01 Thread Jesper Juhl
On Mon, 1 Nov 2010, Takuya Yoshikawa wrote:

 (2010/10/31 3:28), Jesper Juhl wrote:
  Hi,
  
  We can improve kvm_vm_ioctl_get_dirty_log() slightly by using vzalloc()
  rather than first allocating and then manually zero the memory with
  memset(). Also, while I was looking at this I noticed that we assign
 
 I personally prefer this new vzalloc() to vmalloc() + memset().
 
 Just from my interest, is there real performance difference not just
 the cleanup effect?  If so, we'd better do this for other places too.
 
There's definately a positive size impact for the generated object code 
and we save having to do the call to memset() and the cost of a vzalloc() 
call looks more or less the same as a call to vmalloc() to me.


snip
 
 This patch is not based on kvm.git, I guess.
 
Nope. It was generated against mainline git as of a couple of days ago. I 
can generate a version against kvm.git if you prefer.


  r = -ENOMEM;
  -   dirty_bitmap = vmalloc(n);
  +   dirty_bitmap = vzalloc(n);
  if (!dirty_bitmap)
  goto out;
  -   memset(dirty_bitmap, 0, n);
  
  -   r = -ENOMEM;
 
 This one is here because this belongs to a different code block from the
 previous one.  This keeps it easy to insert another codes in between these
 two blocks.  The optimization will be done at compile time.
 IIRC, I did like this based on Avi's advise.
 
Fair enough, I've dropped that bit from the patch. Updated version below 
(against current mainline git).


Signed-off-by: Jesper Juhl j...@chaosbits.net
---
 x86.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2288ad8..624d4da 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3174,10 +3174,9 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
spin_unlock(kvm-mmu_lock);
 
r = -ENOMEM;
-   dirty_bitmap = vmalloc(n);
+   dirty_bitmap = vzalloc(n);
if (!dirty_bitmap)
goto out;
-   memset(dirty_bitmap, 0, n);
 
r = -ENOMEM;
slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);



-- 
Jesper Juhl j...@chaosbits.net http://www.chaosbits.net/
Plain text mails only, please  http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kernel,cred,kvm,security - removing superfluous rcu_read_lock_held check

2010-11-01 Thread Jiri Olsa
hi,
the rcu_dereference_check is defined as

#define rcu_dereference_check(p, c) \
   __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu)

so the caller does not need to specify rcu_read_lock_held() condition.

wbr,
jirka


Signed-off-by: Jiri Olsa jo...@redhat.com
---
 include/linux/cred.h |1 -
 include/linux/fdtable.h  |1 -
 include/linux/kvm_host.h |1 -
 kernel/exit.c|1 -
 kernel/pid.c |1 -
 kernel/rcutorture.c  |2 --
 security/keys/keyring.c  |1 -
 7 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 4aaeab3..a6b9afc 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -283,7 +283,6 @@ static inline void put_cred(const struct cred *_cred)
({  \
const struct task_struct *__t = (task); \
rcu_dereference_check(__t-real_cred,   \
- rcu_read_lock_held() ||   \
  task_is_dead(__t));   \
})
 
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index 133c0ba..df7e3cf 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -60,7 +60,6 @@ struct files_struct {
 
 #define rcu_dereference_check_fdtable(files, fdtfd) \
(rcu_dereference_check((fdtfd), \
-  rcu_read_lock_held() || \
   lockdep_is_held((files)-file_lock) || \
   atomic_read((files)-count) == 1 || \
   rcu_my_thread_group_empty()))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..a90a7e3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,7 +256,6 @@ void kvm_put_kvm(struct kvm *kvm);
 static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
 {
return rcu_dereference_check(kvm-memslots,
-   srcu_read_lock_held(kvm-srcu)
|| lockdep_is_held(kvm-slots_lock));
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index b194feb..f753342 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -85,7 +85,6 @@ static void __exit_signal(struct task_struct *tsk)
struct tty_struct *uninitialized_var(tty);
 
sighand = rcu_dereference_check(tsk-sighand,
-   rcu_read_lock_held() ||
lockdep_tasklist_lock_is_held());
spin_lock(sighand-siglock);
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 39b65b6..c02adda 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -402,7 +402,6 @@ struct task_struct *pid_task(struct pid *pid, enum pid_type 
type)
if (pid) {
struct hlist_node *first;
first = 
rcu_dereference_check(hlist_first_rcu(pid-tasks[type]),
- rcu_read_lock_held() ||
  lockdep_tasklist_lock_is_held());
if (first)
result = hlist_entry(first, struct task_struct, 
pids[(type)].node);
diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 9d8e8fb..0956a73 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -807,7 +807,6 @@ static void rcu_torture_timer(unsigned long unused)
idx = cur_ops-readlock();
completed = cur_ops-completed();
p = rcu_dereference_check(rcu_torture_current,
- rcu_read_lock_held() ||
  rcu_read_lock_bh_held() ||
  rcu_read_lock_sched_held() ||
  srcu_read_lock_held(srcu_ctl));
@@ -868,7 +867,6 @@ rcu_torture_reader(void *arg)
idx = cur_ops-readlock();
completed = cur_ops-completed();
p = rcu_dereference_check(rcu_torture_current,
- rcu_read_lock_held() ||
  rcu_read_lock_bh_held() ||
  rcu_read_lock_sched_held() ||
  srcu_read_lock_held(srcu_ctl));
diff --git a/security/keys/keyring.c b/security/keys/keyring.c
index d37f713..73c23f2 100644
--- a/security/keys/keyring.c
+++ b/security/keys/keyring.c
@@ -157,7 +157,6 @@ static void keyring_destroy(struct key *keyring)
}
 
klist = rcu_dereference_check(keyring-payload.subscriptions,
- rcu_read_lock_held() ||
  atomic_read(keyring-usage) == 0);
if (klist) {
for (loop = klist-nkeys - 1; loop = 0; loop--)
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More 

[PATCH 3/3] net - removing superfluous rcu_read_lock_held check

2010-11-01 Thread Jiri Olsa
hi,
the rcu_dereference_check is defined as

#define rcu_dereference_check(p, c) \
   __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu)

so the caller does not need to specify rcu_read_lock_held() condition.

wbr,
jirka


Signed-off-by: Jiri Olsa jo...@redhat.com
---
 include/linux/rtnetlink.h  |3 +--
 include/net/sock.h |3 +--
 net/mac80211/sta_info.c|4 
 net/netlabel/netlabel_domainhash.c |3 +--
 net/netlabel/netlabel_unlabeled.c  |3 +--
 5 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index d42f274..dfe5ba1 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -758,8 +758,7 @@ extern int lockdep_rtnl_is_held(void);
  * or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference()
  */
 #define rcu_dereference_rtnl(p)\
-   rcu_dereference_check(p, rcu_read_lock_held() ||\
-lockdep_rtnl_is_held())
+   rcu_dereference_check(p, lockdep_rtnl_is_held())
 
 /**
  * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL
diff --git a/include/net/sock.h b/include/net/sock.h
index c7a7362..bee3e9c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1274,8 +1274,7 @@ extern unsigned long sock_i_ino(struct sock *sk);
 static inline struct dst_entry *
 __sk_dst_get(struct sock *sk)
 {
-   return rcu_dereference_check(sk-sk_dst_cache, rcu_read_lock_held() ||
-  sock_owned_by_user(sk) ||
+   return rcu_dereference_check(sk-sk_dst_cache, sock_owned_by_user(sk) ||
   
lockdep_is_held(sk-sk_lock.slock));
 }
 
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index 6d8f897..c879217 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -94,7 +94,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data 
*sdata,
struct sta_info *sta;
 
sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)],
-   rcu_read_lock_held() ||
lockdep_is_held(local-sta_lock) ||
lockdep_is_held(local-sta_mtx));
while (sta) {
@@ -102,7 +101,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data 
*sdata,
memcmp(sta-sta.addr, addr, ETH_ALEN) == 0)
break;
sta = rcu_dereference_check(sta-hnext,
-   rcu_read_lock_held() ||
lockdep_is_held(local-sta_lock) ||
lockdep_is_held(local-sta_mtx));
}
@@ -120,7 +118,6 @@ struct sta_info *sta_info_get_bss(struct 
ieee80211_sub_if_data *sdata,
struct sta_info *sta;
 
sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)],
-   rcu_read_lock_held() ||
lockdep_is_held(local-sta_lock) ||
lockdep_is_held(local-sta_mtx));
while (sta) {
@@ -129,7 +126,6 @@ struct sta_info *sta_info_get_bss(struct 
ieee80211_sub_if_data *sdata,
memcmp(sta-sta.addr, addr, ETH_ALEN) == 0)
break;
sta = rcu_dereference_check(sta-hnext,
-   rcu_read_lock_held() ||
lockdep_is_held(local-sta_lock) ||
lockdep_is_held(local-sta_mtx));
}
diff --git a/net/netlabel/netlabel_domainhash.c 
b/net/netlabel/netlabel_domainhash.c
index d37b7f8..82795a4 100644
--- a/net/netlabel/netlabel_domainhash.c
+++ b/net/netlabel/netlabel_domainhash.c
@@ -55,8 +55,7 @@ struct netlbl_domhsh_tbl {
  * should be okay */
 static DEFINE_SPINLOCK(netlbl_domhsh_lock);
 #define netlbl_domhsh_rcu_deref(p) \
-   rcu_dereference_check(p, rcu_read_lock_held() || \
-lockdep_is_held(netlbl_domhsh_lock))
+   rcu_dereference_check(p, lockdep_is_held(netlbl_domhsh_lock))
 static struct netlbl_domhsh_tbl *netlbl_domhsh = NULL;
 static struct netlbl_dom_map *netlbl_domhsh_def = NULL;
 
diff --git a/net/netlabel/netlabel_unlabeled.c 
b/net/netlabel/netlabel_unlabeled.c
index e2b0a68..d2f982f 100644
--- a/net/netlabel/netlabel_unlabeled.c
+++ b/net/netlabel/netlabel_unlabeled.c
@@ -116,8 +116,7 @@ struct netlbl_unlhsh_walk_arg {
  * hash table should be okay */
 static DEFINE_SPINLOCK(netlbl_unlhsh_lock);
 #define netlbl_unlhsh_rcu_deref(p) \
-   rcu_dereference_check(p, rcu_read_lock_held() || \
-lockdep_is_held(netlbl_unlhsh_lock))
+   rcu_dereference_check(p, lockdep_is_held(netlbl_unlhsh_lock))
 static struct 

[PATCH 1/3] cgroup - removing superfluous rcu_read_lock_held check

2010-11-01 Thread Jiri Olsa
hi,
the rcu_dereference_check is defined as

#define rcu_dereference_check(p, c) \
   __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu)

so the caller does not need to specify rcu_read_lock_held() condition.

wbr,
jirka


Signed-off-by: Jiri Olsa jo...@redhat.com
---
 include/linux/cgroup.h |1 -
 kernel/cgroup.c|6 ++
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed4ba11..caed568 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -536,7 +536,6 @@ static inline struct cgroup_subsys_state 
*cgroup_subsys_state(
  */
 #define task_subsys_state_check(task, subsys_id, __c)  \
rcu_dereference_check(task-cgroups-subsys[subsys_id], \
- rcu_read_lock_held() ||   \
  lockdep_is_held(task-alloc_lock) || \
  cgroup_lock_is_held() || (__c))
 
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66a416b..1f329a2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1687,7 +1687,6 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int 
buflen)
 {
char *start;
struct dentry *dentry = rcu_dereference_check(cgrp-dentry,
- rcu_read_lock_held() ||
  cgroup_lock_is_held());
 
if (!dentry || cgrp == dummytop) {
@@ -1713,7 +1712,6 @@ int cgroup_path(const struct cgroup *cgrp, char *buf, int 
buflen)
break;
 
dentry = rcu_dereference_check(cgrp-dentry,
-  rcu_read_lock_held() ||
   cgroup_lock_is_held());
if (!cgrp-parent)
continue;
@@ -4544,7 +4542,7 @@ unsigned short css_id(struct cgroup_subsys_state *css)
 * it's unchanged until freed.
 */
cssid = rcu_dereference_check(css-id,
-   rcu_read_lock_held() || atomic_read(css-refcnt));
+   atomic_read(css-refcnt));
 
if (cssid)
return cssid-id;
@@ -4557,7 +4555,7 @@ unsigned short css_depth(struct cgroup_subsys_state *css)
struct css_id *cssid;
 
cssid = rcu_dereference_check(css-id,
-   rcu_read_lock_held() || atomic_read(css-refcnt));
+   atomic_read(css-refcnt));
 
if (cssid)
return cssid-depth;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] rcu - removing superfluous rcu_read_lock_held check

2010-11-01 Thread Jiri Olsa
hi,

the rcu_dereference_check is defined as

  #define rcu_dereference_check(p, c) \
 __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu)

so the caller does not need to specify rcu_read_lock_held()
condition.

Several places in kernel are spcifying rcu_read_lock_held as
rcu_dereference_check condition parameter.  I separated them
into 3 patches:

1/3 - cgroup - removing superfluous rcu_read_lock_held check
2/3 - kernel,cred,kvm,security - removing superfluous rcu_read_lock_held check
3/3 - net - removing superfluous rcu_read_lock_held check


sry if I might missed or added somebody wrongly to recipients

wbr,
jirka


Signed-off-by: Jiri Olsa jo...@redhat.com
---
 include/linux/cgroup.h |1 -
 include/linux/cred.h   |1 -
 include/linux/fdtable.h|1 -
 include/linux/kvm_host.h   |1 -
 include/linux/rtnetlink.h  |3 +--
 include/net/sock.h |3 +--
 kernel/cgroup.c|6 ++
 kernel/exit.c  |1 -
 kernel/pid.c   |1 -
 kernel/rcutorture.c|2 --
 net/mac80211/sta_info.c|4 
 net/netlabel/netlabel_domainhash.c |3 +--
 net/netlabel/netlabel_unlabeled.c  |3 +--
 security/keys/keyring.c|1 -
 14 files changed, 6 insertions(+), 25 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-01 Thread Huang Ying
On Mon, 2010-11-01 at 11:49 -0700, Anthony Liguori wrote:
 On 11/01/2010 11:09 AM, Marcelo Tosatti wrote:
  On Tue, Oct 26, 2010 at 10:39:48AM +0800, Huang Ying wrote:
 
  Author: Max Asbockmasb...@linux.vnet.ibm.com
 
  Add command pfa2hva to translate guest physical address to host
  virtual address.
 
  The pfa2hva command provides one step in a chain of translations from
  guest virtual to guest physical to host virtual to host physical. Host
  physical is then used to inject a machine check error. As a
  consequence the HWPOISON code on the host and the MCE injection code
  in qemu-kvm are exercised.
 
  v2:
 
  - Add QMP support
   
 
  Looks good to me. Anthony?
 
 
 The translation is not stable so this is really a bad interface.
 
 It definitely shouldn't be a QMP command and I don't think it's 
 generally useful enough that it should be a monitor command.
 
 It's impossible to use correctly as a general interface.
 
 I gave this feedback when it was initially submitted.

Yes. As general interface, it may not work so well, but as test
interface, it works quite well and useful.

Do we have any mechanism to add a test only interface?

Best Regards,
Huang Ying


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v2] Monitor command: pfa2hva, translate guest physical address to host virtual address

2010-11-01 Thread Anthony Liguori

On 11/01/2010 02:20 PM, Huang Ying wrote:

Yes. As general interface, it may not work so well, but as test
interface, it works quite well and useful.

Do we have any mechanism to add a test only interface?
   


I'd like to see what Luiz/Markus think but definitely only a human 
monitor interface and probably prefix the command with a 'x-' prefix to 
indicate that it's not a supported interface.


The documentation should be very clear about the limitations of the 
interface and the intended use-case.


Regards,

Anthony Liguori


Best Regards,
Huang Ying


   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] trace exit to userspace event

2010-11-01 Thread Avi Kivity

 On 11/01/2010 12:06 PM, Marcelo Tosatti wrote:

On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote:
  Add tracepoint for userspace exit.

  Signed-off-by: Gleb Natapovg...@redhat.com
  ---
ChangeLog:
v1-v2
 log error case too.

Applied, thanks.



errno sign is still flipped, no?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


unhandled wrmsr

2010-11-01 Thread Ross Boylan
I built from qemu-kvm-0.13.0.tar.gz on a Debian system with kernel
linux-image-2.6.32-5-amd642.6.32-26
(but otherwise basically the stable/lenny version) and now see
Oct 26 16:57:38 markov kernel: [ 5757.672426] kvm: 23063: cpu0 unhandled wrmsr: 
0x198 data 0
Oct 26 16:57:38 markov kernel: [ 5757.672454] kvm: 23063: cpu1 unhandled wrmsr: 
0x198 data 0
Oct 26 16:57:38 markov kernel: [ 5757.672476] kvm: 23063: cpu2 unhandled wrmsr: 
0x198 data 0
Oct 26 16:57:38 markov kernel: [ 5757.672497] kvm: 23063: cpu3 unhandled wrmsr: 
0x198 data 0
on startup.  I have 4 CPUs with 8 cores:
Intel(R) Xeon(R) CPU   E5420  @ 2.50GHz stepping 06.
Note this means the kernel side is from the Debian package, and so might
be older.

google shows this message has popped up several times in the past and
been fixed several times.  Most of the messages indicate it is harmless,
but https://bugs.launchpad.net/ubuntu/+source/linux/+bug/325851 on
2009-07-01 says it's not benign.  There also seems to be a current bug
open about it,
http://sourceforge.net/tracker/index.php?func=detailaid=3056363group_id=180599atid=893831,
 which is also marked as fixed.

Can anyone tell me anything more about these messages?  Do they indicate
a problem I need to fix?  If so, any advice on how to fix it?

Thanks.
Ross

P.S. Please cc me on reply.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] trace exit to userspace event

2010-11-01 Thread Marcelo Tosatti
On Mon, Nov 01, 2010 at 03:33:20PM -0400, Avi Kivity wrote:
  On 11/01/2010 12:06 PM, Marcelo Tosatti wrote:
 On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote:
   Add tracepoint for userspace exit.
 
   Signed-off-by: Gleb Natapovg...@redhat.com
   ---
 ChangeLog:
 v1-v2
  log error case too.
 
 Applied, thanks.
 
 
 errno sign is still flipped, no?

No, kvm_arch_vcpu_ioctl_run return -errno on error.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-11-01 Thread Shirley Ma
On Sat, 2010-10-30 at 22:06 +0200, Michael S. Tsirkin wrote:
 On Fri, Oct 29, 2010 at 08:43:08AM -0700, Shirley Ma wrote:
  On Fri, 2010-10-29 at 10:10 +0200, Michael S. Tsirkin wrote:
   Hmm. I don't yet understand. We are still doing copies into the
 per-vq
   buffer, and the data copied is really small.  Is it about cache
 line
   bounces?  Could you try figuring it out?
  
  per-vq buffer is much less expensive than 3 put_copy() call. I will
  collect the profiling data to show that.
 
 What about __put_user? Maybe the access checks are the ones
 that add the cost here? I attach patches to strip access checks:
 they are not needed as we do them on setup time already, anyway.
 Can you try them out and see if performance is improved for you
 please?
 On top of this, we will need to add some scheme to accumulate signals,
 but that is a separate issue.

Yes, moving from put_user/get_user to __put_user/__get_user does improve
the performance by removing the checking.

My concern here is whether checking only in set up would be sufficient
for security? Would be there is a case guest could corrupt the ring
later? If not, that's OK.

 2. How about flushing out queued stuff before we exit
the handle_tx loop? That would address most of
the spec issue. 

The performance is almost as same as the previous patch. I will
   resubmit
the modified one, adding vhost_add_used_and_signal_n after
 handle_tx
loop for processing pending queue.

This patch was a part of modified macvtap zero copy which I
 haven't
submitted yet. I found this helped vhost TX in general. This
 pending
queue will be used by DMA done later, so I put it in vq instead
 of a
local variable in handle_tx.

Thanks
Shirley
   
   BTW why do we need another array? Isn't heads field exactly what
 we
   need
   here?
  
  head field is only for up to 32, the more used buffers add and
 signal
  accumulated the better performance is from test results.
 
 I think we should separate the used update and signalling.  Interrupts
 are expensive so I can believe accumulating even up to 100 of them
 helps. But used head copies are already prety cheap. If we cut the
 overhead by x32, that should make them almost free?

I can separate the used update and signaling to see the best
performance.

  That's was one
  of the reason I didn't use heads. The other reason was I used these
  buffer for pending dma done in mavctap zero copy patch. It could be
 up
  to vq-num in worse case.
 
 We can always increase that, not an issue. 

Good, I will change heads up to vq-num and use it.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] net - removing superfluous rcu_read_lock_held check

2010-11-01 Thread David Miller
From: Jiri Olsa jo...@redhat.com
Date: Mon,  1 Nov 2010 20:15:09 +0100

 the rcu_dereference_check is defined as
 
   #define rcu_dereference_check(p, c) \
  __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu)
 
 so the caller does not need to specify rcu_read_lock_held() condition.

 Signed-off-by: Jiri Olsa jo...@redhat.com

Acked-by: David S. Miller da...@davemloft.net
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 0/2] Test nmi-after-sti

2010-11-01 Thread Avi Kivity
Add a unit test for NMI immediately after an IF-enabling STI.

Avi Kivity (2):
  apic: use boot idt instead of a locally allocated idt
  apic: test nmi-after-sti

 x86/apic.c |   83 +---
 1 files changed, 68 insertions(+), 15 deletions(-)

-- 
1.7.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 2/2] apic: test nmi-after-sti

2010-11-01 Thread Avi Kivity
While not required by the spec, some guests (Linux)
rely on nmi being blocked by an IF-enabling sti.  Add
a unit test for this condition.

Signed-off-by: Avi Kivity a...@redhat.com
---
 x86/apic.c |   67 
 1 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/x86/apic.c b/x86/apic.c
index 165f820..2207040 100644
--- a/x86/apic.c
+++ b/x86/apic.c
@@ -1,6 +1,7 @@
 #include libcflat.h
 #include apic.h
 #include vm.h
+#include smp.h
 
 typedef struct {
 unsigned short offset0;
@@ -274,9 +275,74 @@ static void test_ioapic_simultaneous(void)
g_66  g_78  g_66_after_78  g_66_rip == g_78_rip);
 }
 
+volatile int nmi_counter_private, nmi_counter, nmi_hlt_counter, 
sti_loop_active;
+
+void sti_nop(char *p)
+{
+asm volatile (
+ .globl post_sti \n\t
+ sti \n
+ /*
+  * vmx won't exit on external interrupt if blocked-by-sti,
+  * so give it a reason to exit by accessing an unmapped page.
+  */
+ post_sti: testb $0, %0 \n\t
+ nop \n\t
+ cli
+ : : m(*p)
+ );
+nmi_counter = nmi_counter_private;
+}
+
+static void sti_loop(void *ignore)
+{
+unsigned k = 0;
+
+while (sti_loop_active) {
+   sti_nop((char *)(ulong)((k++ * 4096) % (128 * 1024 * 1024)));
+}
+}
+
+static void nmi_handler(isr_regs_t *regs)
+{
+extern void post_sti(void);
+++nmi_counter_private;
+nmi_hlt_counter += regs-rip == (ulong)post_sti;
+}
+
+static void update_cr3(void *cr3)
+{
+write_cr3((ulong)cr3);
+}
+
+static void test_sti_nmi(void)
+{
+unsigned old_counter;
+
+if (cpu_count()  2) {
+   return;
+}
+
+set_idt_entry(2, nmi_handler);
+on_cpu(1, update_cr3, (void *)read_cr3());
+
+sti_loop_active = 1;
+on_cpu_async(1, sti_loop, 0);
+while (nmi_counter  3) {
+   old_counter = nmi_counter;
+   apic_icr_write(APIC_DEST_PHYSICAL | APIC_DM_NMI | APIC_INT_ASSERT, 1);
+   while (nmi_counter == old_counter) {
+   ;
+   }
+}
+sti_loop_active = 0;
+report(nmi-after-sti, nmi_hlt_counter == 0);
+}
+
 int main()
 {
 setup_vm();
+smp_init();
 
 test_lapic_existence();
 
@@ -288,6 +354,7 @@ int main()
 
 test_ioapic_intr();
 test_ioapic_simultaneous();
+test_sti_nmi();
 
 printf(\nsummary: %d tests, %d failures\n, g_tests, g_fail);
 
-- 
1.7.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH kvm-unit-tests 1/2] apic: use boot idt instead of a locally allocated idt

2010-11-01 Thread Avi Kivity
This allows the smp support, which uses the boot idt, to work.

Signed-off-by: Avi Kivity a...@redhat.com
---
 x86/apic.c |   16 +---
 1 files changed, 1 insertions(+), 15 deletions(-)

diff --git a/x86/apic.c b/x86/apic.c
index 48fa0f7..165f820 100644
--- a/x86/apic.c
+++ b/x86/apic.c
@@ -89,7 +89,7 @@ asm (
 #endif
 );
 
-static idt_entry_t idt[256];
+static idt_entry_t *idt = 0;
 
 static int g_fail;
 static int g_tests;
@@ -127,19 +127,6 @@ void test_enable_x2apic(void)
 }
 }
 
-static void init_idt(void)
-{
-struct {
-u16 limit;
-ulong idt;
-} __attribute__((packed)) idt_ptr = {
-sizeof(idt_entry_t) * 256 - 1,
-(ulong)idt,
-};
-
-asm volatile(lidt %0 : : m(idt_ptr));
-}
-
 static void set_idt_entry(unsigned vec, void (*func)(isr_regs_t *regs))
 {
 u8 *thunk = vmalloc(50);
@@ -296,7 +283,6 @@ int main()
 mask_pic_interrupts();
 enable_apic();
 test_enable_x2apic();
-init_idt();
 
 test_self_ipi();
 
-- 
1.7.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] trace exit to userspace event

2010-11-01 Thread Avi Kivity

 On 11/01/2010 03:43 PM, Marcelo Tosatti wrote:

On Mon, Nov 01, 2010 at 03:33:20PM -0400, Avi Kivity wrote:
   On 11/01/2010 12:06 PM, Marcelo Tosatti wrote:
  On Sun, Oct 24, 2010 at 04:49:08PM +0200, Gleb Natapov wrote:
 Add tracepoint for userspace exit.
  
 Signed-off-by: Gleb Natapovg...@redhat.com
 ---
   ChangeLog:
   v1-v2
log error case too.
  
  Applied, thanks.
  

  errno sign is still flipped, no?

No, kvm_arch_vcpu_ioctl_run return -errno on error.



It's true that this is how the kernel ABI is defined, but for a user 
errnos are positive.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: VMX: Disallow NMI while blocked by STI

2010-11-01 Thread Avi Kivity
While not mandated by the spec, Linux relies on NMI being blocked by an
IF-enabling STI.  VMX also refuses to enter a guest in this state, at
least on some implementations.

Disallow NMI while blocked by STI by checking for the condition, and
requesting an interrupt window exit if it occurs.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/vmx.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 993e332..09321d0 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2788,6 +2788,10 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
return;
}
 
+   if (vmcs_read32(GUEST_INTERRUPTIBILITY_INFO)  GUEST_INTR_STATE_STI) {
+   enable_irq_window(vcpu);
+   return;
+   }
cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_NMI_PENDING;
vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
@@ -2850,7 +2854,8 @@ static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
return 0;
 
return  !(vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) 
-   (GUEST_INTR_STATE_MOV_SS | GUEST_INTR_STATE_NMI));
+ (GUEST_INTR_STATE_MOV_SS | GUEST_INTR_STATE_STI
+  | GUEST_INTR_STATE_NMI));
 }
 
 static bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu)
-- 
1.7.3.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Fixes, non-PCI-2.3 support, EOI enhancements

2010-11-01 Thread Tom Lyon
I've applied all your patches. Thanks!

On Saturday, October 30, 2010 09:58:55 am Alex Williamson wrote:
 Hi Tom,
 
 I've updated some patches I've been working on to v5 and wanted to
 see what you think.  I also found a couple minor bugs, fixed in this
 series.
 
 The main idea is that since the VFIO interface defines that the INTx
 interrupt is disabled when it fires, we should provide an interface
 to re-enable it.  We currently have to do a read-modify-write to PCI
 config space, requring two ioctls.  I introduce an ioctl to do this
 for us.  An important aspect of this is that now we can support
 non-PCI-2.3 devices since re-enabling the interrupt is abstracted
 (with the same caveat as current KVM device assignment, that they
 must get an exclusive interrupt).  The real trick though is that we
 can then also create an irqfd-like mechanism to trigger re-enabling
 interrupts from and eventfd.  This allows qemu to do all the setup
 for connecting interrupts from VFIO into KVM and EOIs from KVM to
 VFIO, then lets userspace get completely out of the way.
 
 Hope you like it.  Thanks,
 
 Alex
 ---
 
 Alex Williamson (5):
   vfio: Add a new ioctl to support EOI via eventfd
   vfio: Add support for non-PCI 2.3 compliant devices
   vfio: Add ioctl to re-enable interrupts
   vfio: Fix requested regions
   vfio: Fix the ROM mask
 
 
  drivers/vfio/vfio_intrs.c  |  254
 +--- drivers/vfio/vfio_main.c   | 
  37 +-
  drivers/vfio/vfio_pci_config.c |2
  drivers/vfio/vfio_rdwr.c   |4 -
  include/linux/vfio.h   |   14 ++
  5 files changed, 279 insertions(+), 32 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph/rbd block driver for qemu-kvm (v7)

2010-11-01 Thread Sage Weil
Hi,

Are there any lingering issues or concerns with the latest rbd patch, or 
other roadblocks that would prevent this from being merged?

Thanks-
sage


On Fri, 15 Oct 2010, Christian Brunner wrote:
 Hi,
 
 once again, Yehuda committed fixes for all the suggestions made on the 
 list (and more). Here is the next update for the ceph/rbd block driver.
 
 Please let us know if there are any pending issues.

 For those who didn't follow the previous postings:
 
 This is an block driver for the distributed file system Ceph
 (http://ceph.newdream.net/). This driver uses librados (which
 is part of the Ceph server) for direct access to the Ceph object
 store and is running entirely in userspace (Yehuda also
 wrote a driver for the linux kernel, that can be used to access
 rbd volumes as a block device).
 
 Kind Regards,
 Christian
 
 Signed-off-by: Christian Brunner c...@muc.de
 Signed-off-by: Yehuda Sadeh yeh...@hq.newdream.net
 ---
  Makefile.objs |1 +
  block/rbd.c   | 1059 
 +
  block/rbd_types.h |   71 
  configure |   31 ++
  4 files changed, 1162 insertions(+), 0 deletions(-)
  create mode 100644 block/rbd.c
  create mode 100644 block/rbd_types.h
 
 diff --git a/Makefile.objs b/Makefile.objs
 index 6ee077c..56a13c1 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -19,6 +19,7 @@ block-nested-y += parallels.o nbd.o blkdebug.o
  block-nested-$(CONFIG_WIN32) += raw-win32.o
  block-nested-$(CONFIG_POSIX) += raw-posix.o
  block-nested-$(CONFIG_CURL) += curl.o
 +block-nested-$(CONFIG_RBD) += rbd.o
  
  block-obj-y +=  $(addprefix block/, $(block-nested-y))
  
 diff --git a/block/rbd.c b/block/rbd.c
 new file mode 100644
 index 000..fbfb93e
 --- /dev/null
 +++ b/block/rbd.c
 @@ -0,0 +1,1059 @@
 +/*
 + * QEMU Block driver for RADOS (Ceph)
 + *
 + * Copyright (C) 2010 Christian Brunner c...@muc.de
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2.  See
 + * the COPYING file in the top-level directory.
 + *
 + */
 +
 +#include qemu-common.h
 +#include qemu-error.h
 +
 +#include rbd_types.h
 +#include block_int.h
 +
 +#include rados/librados.h
 +
 +
 +
 +/*
 + * When specifying the image filename use:
 + *
 + * rbd:poolname/devicename
 + *
 + * poolname must be the name of an existing rados pool
 + *
 + * devicename is the basename for all objects used to
 + * emulate the raw device.
 + *
 + * Metadata information (image size, ...) is stored in an
 + * object with the name devicename.rbd.
 + *
 + * The raw device is split into 4MB sized objects by default.
 + * The sequencenumber is encoded in a 12 byte long hex-string,
 + * and is attached to the devicename, separated by a dot.
 + * e.g. devicename.1234567890ab
 + *
 + */
 +
 +#define OBJ_MAX_SIZE (1UL  OBJ_DEFAULT_OBJ_ORDER)
 +
 +typedef struct RBDAIOCB {
 +BlockDriverAIOCB common;
 +QEMUBH *bh;
 +int ret;
 +QEMUIOVector *qiov;
 +char *bounce;
 +int write;
 +int64_t sector_num;
 +int aiocnt;
 +int error;
 +struct BDRVRBDState *s;
 +int cancelled;
 +} RBDAIOCB;
 +
 +typedef struct RADOSCB {
 +int rcbid;
 +RBDAIOCB *acb;
 +struct BDRVRBDState *s;
 +int done;
 +int64_t segsize;
 +char *buf;
 +int ret;
 +} RADOSCB;
 +
 +#define RBD_FD_READ 0
 +#define RBD_FD_WRITE 1
 +
 +typedef struct BDRVRBDState {
 +int fds[2];
 +rados_pool_t pool;
 +rados_pool_t header_pool;
 +char name[RBD_MAX_OBJ_NAME_SIZE];
 +char block_name[RBD_MAX_BLOCK_NAME_SIZE];
 +uint64_t size;
 +uint64_t objsize;
 +int qemu_aio_count;
 +int read_only;
 +int event_reader_pos;
 +RADOSCB *event_rcb;
 +} BDRVRBDState;
 +
 +typedef struct rbd_obj_header_ondisk RbdHeader1;
 +
 +static int rbd_next_tok(char *dst, int dst_len,
 +char *src, char delim,
 +const char *name,
 +char **p)
 +{
 +int l;
 +char *end;
 +
 +*p = NULL;
 +
 +if (delim != '\0') {
 +end = strchr(src, delim);
 +if (end) {
 +*p = end + 1;
 +*end = '\0';
 +}
 +}
 +l = strlen(src);
 +if (l = dst_len) {
 +error_report(%s too long, name);
 +return -EINVAL;
 +} else if (l == 0) {
 +error_report(%s too short, name);
 +return -EINVAL;
 +}
 +
 +pstrcpy(dst, dst_len, src);
 +
 +return 0;
 +}
 +
 +static int rbd_parsename(const char *filename,
 + char *pool, int pool_len,
 + char *snap, int snap_len,
 + char *name, int name_len)
 +{
 +const char *start;
 +char *p, *buf;
 +int ret;
 +
 +if (!strstart(filename, rbd:, start)) {
 +return -EINVAL;
 +}
 +
 +buf = qemu_strdup(start);
 +p = buf;
 +
 +ret = rbd_next_tok(pool, pool_len, p, '/', pool name, p);
 +if (ret  0 || !p) {
 +ret = -EINVAL;
 +goto done;
 +}
 +   

Re: [PATCH 3/3] net - removing superfluous rcu_read_lock_held check

2010-11-01 Thread Paul Moore
On Mon, 2010-11-01 at 20:15 +0100, Jiri Olsa wrote:
 hi,
 the rcu_dereference_check is defined as
 
   #define rcu_dereference_check(p, c) \
  __rcu_dereference_check((p), rcu_read_lock_held() || (c), __rcu)
 
 so the caller does not need to specify rcu_read_lock_held() condition.
 
 wbr,
 jirka
 
 
 Signed-off-by: Jiri Olsa jo...@redhat.com

The NetLabel changes look fine to me; thanks for the cleanup.

 ---
  include/linux/rtnetlink.h  |3 +--
  include/net/sock.h |3 +--
  net/mac80211/sta_info.c|4 
  net/netlabel/netlabel_domainhash.c |3 +--
  net/netlabel/netlabel_unlabeled.c  |3 +--
  5 files changed, 4 insertions(+), 12 deletions(-)
 
 diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
 index d42f274..dfe5ba1 100644
 --- a/include/linux/rtnetlink.h
 +++ b/include/linux/rtnetlink.h
 @@ -758,8 +758,7 @@ extern int lockdep_rtnl_is_held(void);
   * or RTNL. Note : Please prefer rtnl_dereference() or rcu_dereference()
   */
  #define rcu_dereference_rtnl(p)  \
 - rcu_dereference_check(p, rcu_read_lock_held() ||\
 -  lockdep_rtnl_is_held())
 + rcu_dereference_check(p, lockdep_rtnl_is_held())
  
  /**
   * rtnl_dereference - fetch RCU pointer when updates are prevented by RTNL
 diff --git a/include/net/sock.h b/include/net/sock.h
 index c7a7362..bee3e9c 100644
 --- a/include/net/sock.h
 +++ b/include/net/sock.h
 @@ -1274,8 +1274,7 @@ extern unsigned long sock_i_ino(struct sock *sk);
  static inline struct dst_entry *
  __sk_dst_get(struct sock *sk)
  {
 - return rcu_dereference_check(sk-sk_dst_cache, rcu_read_lock_held() ||
 -sock_owned_by_user(sk) ||
 + return rcu_dereference_check(sk-sk_dst_cache, sock_owned_by_user(sk) ||
  
 lockdep_is_held(sk-sk_lock.slock));
  }
  
 diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
 index 6d8f897..c879217 100644
 --- a/net/mac80211/sta_info.c
 +++ b/net/mac80211/sta_info.c
 @@ -94,7 +94,6 @@ struct sta_info *sta_info_get(struct ieee80211_sub_if_data 
 *sdata,
   struct sta_info *sta;
  
   sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)],
 - rcu_read_lock_held() ||
   lockdep_is_held(local-sta_lock) ||
   lockdep_is_held(local-sta_mtx));
   while (sta) {
 @@ -102,7 +101,6 @@ struct sta_info *sta_info_get(struct 
 ieee80211_sub_if_data *sdata,
   memcmp(sta-sta.addr, addr, ETH_ALEN) == 0)
   break;
   sta = rcu_dereference_check(sta-hnext,
 - rcu_read_lock_held() ||
   lockdep_is_held(local-sta_lock) ||
   lockdep_is_held(local-sta_mtx));
   }
 @@ -120,7 +118,6 @@ struct sta_info *sta_info_get_bss(struct 
 ieee80211_sub_if_data *sdata,
   struct sta_info *sta;
  
   sta = rcu_dereference_check(local-sta_hash[STA_HASH(addr)],
 - rcu_read_lock_held() ||
   lockdep_is_held(local-sta_lock) ||
   lockdep_is_held(local-sta_mtx));
   while (sta) {
 @@ -129,7 +126,6 @@ struct sta_info *sta_info_get_bss(struct 
 ieee80211_sub_if_data *sdata,
   memcmp(sta-sta.addr, addr, ETH_ALEN) == 0)
   break;
   sta = rcu_dereference_check(sta-hnext,
 - rcu_read_lock_held() ||
   lockdep_is_held(local-sta_lock) ||
   lockdep_is_held(local-sta_mtx));
   }
 diff --git a/net/netlabel/netlabel_domainhash.c 
 b/net/netlabel/netlabel_domainhash.c
 index d37b7f8..82795a4 100644
 --- a/net/netlabel/netlabel_domainhash.c
 +++ b/net/netlabel/netlabel_domainhash.c
 @@ -55,8 +55,7 @@ struct netlbl_domhsh_tbl {
   * should be okay */
  static DEFINE_SPINLOCK(netlbl_domhsh_lock);
  #define netlbl_domhsh_rcu_deref(p) \
 - rcu_dereference_check(p, rcu_read_lock_held() || \
 -  lockdep_is_held(netlbl_domhsh_lock))
 + rcu_dereference_check(p, lockdep_is_held(netlbl_domhsh_lock))
  static struct netlbl_domhsh_tbl *netlbl_domhsh = NULL;
  static struct netlbl_dom_map *netlbl_domhsh_def = NULL;
  
 diff --git a/net/netlabel/netlabel_unlabeled.c 
 b/net/netlabel/netlabel_unlabeled.c
 index e2b0a68..d2f982f 100644
 --- a/net/netlabel/netlabel_unlabeled.c
 +++ b/net/netlabel/netlabel_unlabeled.c
 @@ -116,8 +116,7 @@ struct netlbl_unlhsh_walk_arg {
   * hash table should be okay */
  static DEFINE_SPINLOCK(netlbl_unlhsh_lock);
  #define netlbl_unlhsh_rcu_deref(p) \
 - rcu_dereference_check(p, rcu_read_lock_held() || \
 -  

Re: [PATCH 2/3] kernel,cred,kvm,security - removing superfluous rcu_read_lock_held check

2010-11-01 Thread Paolo Bonzini

On 11/01/2010 08:15 PM, Jiri Olsa wrote:

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..a90a7e3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,7 +256,6 @@ void kvm_put_kvm(struct kvm *kvm);
  static inline struct kvm_memslots *kvm_memslots(struct kvm *kvm)
  {
return rcu_dereference_check(kvm-memslots,
-   srcu_read_lock_held(kvm-srcu)
|| lockdep_is_held(kvm-slots_lock));
  }



This is an srcu_read_lock_held, which you don't touch here:


diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c
index 9d8e8fb..0956a73 100644
--- a/kernel/rcutorture.c
+++ b/kernel/rcutorture.c
@@ -807,7 +807,6 @@ static void rcu_torture_timer(unsigned long unused)
idx = cur_ops-readlock();
completed = cur_ops-completed();
p = rcu_dereference_check(rcu_torture_current,
- rcu_read_lock_held() ||
  rcu_read_lock_bh_held() ||
  rcu_read_lock_sched_held() ||
  srcu_read_lock_held(srcu_ctl));


I guess the kvm hunk is the incorrect one?

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM x86: remove memset, use vzalloc and don't assign the same value to a variable twice

2010-11-01 Thread Takuya Yoshikawa
Hi Jesper, (dropped some addresses from Cc)

  Jesper Juhl wrote:
 There's definately a positive size impact for the generated object code 
 and we save having to do the call to memset() and the cost of a vzalloc() 
 call looks more or less the same as a call to vmalloc() to me.

  This patch is not based on kvm.git, I guess.
  
 Nope. It was generated against mainline git as of a couple of days ago. I 
 can generate a version against kvm.git if you prefer.

Sorry, I should have said more clearly.

kvm_vm_ioctl_get_dirty_log() has been changed since latest mainline kernel was 
released.
Furthermore, vmalloc() in it is to be removed soon, currently in kvm's next.

I have checked all vmalloc() + memset() in kvm, and there are three remaining.
See the patch below.
  - I have tested on x86.
  - I can test ppc later if needed, but this is so trivial that Alex will see
no problem about this, probably.

So please write your Signed-off-by and ask Avi and Marcelo to apply.

Thanks,
  Takuya

===
Subject: [PATCH] KVM: replace vmalloc and memset with vzalloc

Let's use newly introduced vzalloc().

Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
Cc: Jesper Juhl j...@chaosbits.net
---
 arch/powerpc/kvm/book3s.c |4 +---
 virt/kvm/kvm_main.c   |9 ++---
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index e316847..badc983 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1307,12 +1307,10 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm 
*kvm, unsigned int id)
int err = -ENOMEM;
unsigned long p;
 
-   vcpu_book3s = vmalloc(sizeof(struct kvmppc_vcpu_book3s));
+   vcpu_book3s = vzalloc(sizeof(struct kvmppc_vcpu_book3s));
if (!vcpu_book3s)
goto out;
 
-   memset(vcpu_book3s, 0, sizeof(struct kvmppc_vcpu_book3s));
-
vcpu_book3s-shadow_vcpu = (struct kvmppc_book3s_shadow_vcpu *)
kzalloc(sizeof(*vcpu_book3s-shadow_vcpu), GFP_KERNEL);
if (!vcpu_book3s-shadow_vcpu)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index dbe1d6a..bd44a81 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -606,13 +606,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
/* Allocate if a slot is being created */
 #ifndef CONFIG_S390
if (npages  !new.rmap) {
-   new.rmap = vmalloc(npages * sizeof(*new.rmap));
+   new.rmap = vzalloc(npages * sizeof(*new.rmap));
 
if (!new.rmap)
goto out_free;
 
-   memset(new.rmap, 0, npages * sizeof(*new.rmap));
-
new.user_alloc = user_alloc;
new.userspace_addr = mem-userspace_addr;
}
@@ -635,14 +633,11 @@ int __kvm_set_memory_region(struct kvm *kvm,
  KVM_HPAGE_GFN_SHIFT(level));
lpages -= base_gfn  KVM_HPAGE_GFN_SHIFT(level);
 
-   new.lpage_info[i] = vmalloc(lpages * 
sizeof(*new.lpage_info[i]));
+   new.lpage_info[i] = vzalloc(lpages * 
sizeof(*new.lpage_info[i]));
 
if (!new.lpage_info[i])
goto out_free;
 
-   memset(new.lpage_info[i], 0,
-  lpages * sizeof(*new.lpage_info[i]));
-
if (base_gfn  (KVM_PAGES_PER_HPAGE(level) - 1))
new.lpage_info[i][0].write_count = 1;
if ((base_gfn+npages)  (KVM_PAGES_PER_HPAGE(level) - 1))
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on default_i386_debian_5_0

2010-11-01 Thread qemu-kvm
The Buildbot has detected a new failure of default_i386_debian_5_0 on qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_debian_5_0/builds/628

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: The Nightly scheduler named 'nightly_default' triggered this build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in qemu-kvm on default_i386_out_of_tree

2010-11-01 Thread qemu-kvm
The Buildbot has detected a new failure of default_i386_out_of_tree on qemu-kvm.
Full details are available at:
 
http://buildbot.b1-systems.de/qemu-kvm/builders/default_i386_out_of_tree/builds/565

Buildbot URL: http://buildbot.b1-systems.de/qemu-kvm/

Buildslave for this Build: b1_qemu_kvm_2

Build Reason: The Nightly scheduler named 'nightly_default' triggered this build
Build Source Stamp: [branch master] HEAD
Blamelist: 

BUILD FAILED: failed compile

sincerely,
 -The Buildbot

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 7/7] KVM: KVM: don't break vcpu 'halt' state due to apfs

2010-11-01 Thread Xiao Guangrong
On 11/01/2010 08:55 PM, Gleb Natapov wrote:

 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 2cfdf2d..f7aed95 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -5295,8 +5295,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
   {
   switch(vcpu-arch.mp_state) {
   case KVM_MP_STATE_HALTED:
 - vcpu-arch.mp_state =
 - KVM_MP_STATE_RUNNABLE;
 + if 
 (list_empty_careful(vcpu-async_pf.done))
 + vcpu-arch.mp_state =
 + KVM_MP_STATE_RUNNABLE;

if nmi/interrupt and apfs completed event occur at the same time, we will miss 
to
exit halt sate. Maybe we can check the pending event here, see below patch 
please.

   case KVM_MP_STATE_RUNNABLE:
   vcpu-arch.apf.halted = false;
   break;
 @@ -6279,6 +6280,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
   vcpu-arch.fault.error_code = 0;
   vcpu-arch.fault.address = work-arch.token;
   kvm_inject_page_fault(vcpu);
 + vcpu-arch.mp_state = KVM_MP_STATE_RUNNABLE;
   }
  }

Have a stupid question, why we make the vcpu runnable here? Sorry i don't know
kvm pv guest to much. :-(

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2044302..27a712b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5267,6 +5267,12 @@ out:
return r;
 }
 
+static bool kvm_vcpu_leave_halt_state(struct kvm_vcpu *vcpu)
+{
+   return vcpu-arch.nmi_pending ||
+   (kvm_arch_interrupt_allowed(vcpu) 
+kvm_cpu_has_interrupt(vcpu));
+}
 
 static int __vcpu_run(struct kvm_vcpu *vcpu)
 {
@@ -5299,8 +5305,9 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
{
switch(vcpu-arch.mp_state) {
case KVM_MP_STATE_HALTED:
-   vcpu-arch.mp_state =
-   KVM_MP_STATE_RUNNABLE;
+   if (kvm_vcpu_leave_halt_state(vcpu))
+   vcpu-arch.mp_state =
+   KVM_MP_STATE_RUNNABLE;
case KVM_MP_STATE_RUNNABLE:
vcpu-arch.apf.halted = false;
break;
@@ -6113,9 +6120,7 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
!vcpu-arch.apf.halted)
|| !list_empty_careful(vcpu-async_pf.done)
|| vcpu-arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED
-   || vcpu-arch.nmi_pending ||
-   (kvm_arch_interrupt_allowed(vcpu) 
-kvm_cpu_has_interrupt(vcpu));
+   || kvm_vcpu_leave_halt_state(vcpu);
 }
 
 void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM x86: remove memset, use vzalloc and don't assign the same value to a variable twice

2010-11-01 Thread Jesper Juhl
On Tue, 2 Nov 2010, Takuya Yoshikawa wrote:

 Hi Jesper, (dropped some addresses from Cc)
 
   Jesper Juhl wrote:
  There's definately a positive size impact for the generated object code 
  and we save having to do the call to memset() and the cost of a vzalloc() 
  call looks more or less the same as a call to vmalloc() to me.
 
   This patch is not based on kvm.git, I guess.
   
  Nope. It was generated against mainline git as of a couple of days ago. I 
  can generate a version against kvm.git if you prefer.
 
 Sorry, I should have said more clearly.
 
 kvm_vm_ioctl_get_dirty_log() has been changed since latest mainline kernel 
 was released.
 Furthermore, vmalloc() in it is to be removed soon, currently in kvm's next.
 
 I have checked all vmalloc() + memset() in kvm, and there are three remaining.
 See the patch below.
   - I have tested on x86.
   - I can test ppc later if needed, but this is so trivial that Alex will see
 no problem about this, probably.
 
 So please write your Signed-off-by and ask Avi and Marcelo to apply.
 
 Thanks,
   Takuya
 
 ===
 Subject: [PATCH] KVM: replace vmalloc and memset with vzalloc
 
 Let's use newly introduced vzalloc().
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp

Signed-off-by: Jesper-juhl j...@chaosbits.net


-- 
Jesper Juhl j...@chaosbits.net http://www.chaosbits.net/
Plain text mails only, please  http://www.expita.com/nomime.html
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: Extended capability fixes

2010-11-01 Thread Alex Williamson
- Virtual channel position gets truncated as a u8
 - Print the ecap that's unknown, not the last cap we saw
 - Print actual config offset, which provides enough info to make
   some sense of the error.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 drivers/vfio/vfio_pci_config.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
index 8af995d..8304316 100644
--- a/drivers/vfio/vfio_pci_config.c
+++ b/drivers/vfio/vfio_pci_config.c
@@ -410,7 +410,7 @@ static int vfio_msi_cap_len(struct vfio_dev *vdev, u8 pos)
  * Determine extended capability length for VC (2  9) and
  * MFVC capabilities
  */
-static int vfio_vc_cap_len(struct vfio_dev *vdev, u8 pos)
+static int vfio_vc_cap_len(struct vfio_dev *vdev, u16 pos)
 {
struct pci_dev *pdev = vdev-pdev;
u32 dw;
@@ -580,7 +580,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
printk(KERN_WARNING
%s: pci config conflict at %x, 
caps %x %x\n,
-   __func__, i, map[pos+i], cap);
+   __func__, pos+i, map[pos+i], cap);
map[pos+i] = cap;
}
ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, pos);
@@ -683,7 +683,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
if (len == 0 || len == 0xFF) {
printk(KERN_WARNING
%s: unknown length for pci ext cap %x\n,
-   __func__, cap);
+   __func__, ecap);
len = PCI_CAP_SIZEOF;
}
for (i = 0; i  len; i++) {
@@ -691,7 +691,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
printk(KERN_WARNING
%s: pci config conflict at %x, 
caps %x %x\n,
-   __func__, i, map[epos+i], ecap);
+   __func__, epos+i, map[epos+i], ecap);
map[epos+i] = ecap;
}
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM test: Test the file transfer during migartion

2010-11-01 Thread Jason Wang
Michael Goldish writes:
  On 09/25/2010 11:36 AM, Jason Wang wrote:
   This test just do the file transfer from host to guest during migartion in 
   order
   to check whether the nic/block state could be saved and loaded correctly.
   
   Signed-off-by: Jason Wang jasow...@redhat.com
   ---
.../kvm/tests/migration_with_file_transfer.py  |   59 
   
client/tests/kvm/tests_base.cfg.sample |7 ++
2 files changed, 66 insertions(+), 0 deletions(-)
create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py
   
   diff --git a/client/tests/kvm/tests/migration_with_file_transfer.py 
   b/client/tests/kvm/tests/migration_with_file_transfer.py
   new file mode 100644
   index 000..8a316bf
   --- /dev/null
   +++ b/client/tests/kvm/tests/migration_with_file_transfer.py
   @@ -0,0 +1,59 @@
   +import logging, time
   +from autotest_lib.client.common_lib import utils, error
   +import kvm_subprocess, kvm_test_utils, kvm_utils
   +
   +
   +def run_migration_with_file_transfer(test, params, env):
   +
   +KVM migration test:
   +1) Get a live VM and clone it.
   +2) Verify that the source VM supports migration.  If it does, proceed 
   with
   +the test.
   +3) Reboot the VM
   +4) Send a migration command to the source VM and wait until it's 
   finished.
   +5) Kill off the source VM.
   +6) Log into the destination VM after the migration is finished.
   +
   +@param test: kvm test object.
   +@param params: Dictionary with test parameters.
   +@param env: Dictionary with the test environment.
   +
   +
   +def transfer_test(vm, host_path, guest_path, timeout=120):
   +
   +vm.copy_files_to does not raise exception, so we need a wrapper
   +in order to make it to be used by BackgroundTest.
   +
   +if not vm.copy_files_to(host_path, guest_path, timeout=timeout):
   +raise error.TestError(Fail to do the file transfer!)
   +
   +vm = kvm_test_utils.get_living_vm(env, params.get(main_vm))
   +timeout = int(params.get(login_timeout, 360))
   +session = kvm_test_utils.wait_for_login(vm, timeout=timeout)
   +
   +mig_timeout = float(params.get(mig_timeout, 3600))
   +mig_protocol = params.get(migration_protocol, tcp)
   +
   +guest_path = params.get(guest_path, /tmp)
   +file_size = params.get(file_size, 1000)
   +transfer_timeout = int(params.get(transfer_timeout, 240))
   +bg = None
   +
   +try:
   +utils.run(dd if=/dev/zero of=/tmp/file bs=1M count=%s % 
   file_size)
  
  Wouldn't it be useful to read from /dev/urandom instead of /dev/zero,
  and compare the md5 hash of the file generated on the host with the md5
  hash of the file that arrives in the guest?
  

Yes, as the file transfer test have been applied, I would rebase this
case to use it.

   +
   +# Transfer file from host to guest
   +bg = kvm_test_utils.BackgroundTest(transfer_test,
   +   (vm, /tmp/file, guest_path,
   +transfer_timeout))
   +bg.start()
   +
   +while bg.is_alive():
   +logging.info(File transfer is not ended, start a round of 
   migration ...)
   +# Migrate the VM
   +dest_vm = kvm_test_utils.migrate(vm, env, mig_timeout, 
   mig_protocol, False)
   +vm = dest_vm
   +finally:
   +if bg: bg.join()
   +session.close()
   +utils.run(rm -rf /tmp/zero)
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Launch other test during migration

2010-11-01 Thread Jason Wang
Michael Goldish writes:
  On 09/25/2010 11:36 AM, Jason Wang wrote:
   We could give a further test of migration by launch test during migartion. 
   So
   the following series implements:
   
   - A simple class to run a specified test in the background which could be 
   used
   to launch other test during migartion. Its design is rather simple and its 
   usage
   is a little tricky, but it work well.
   - Two sample tests which take advantages of the background class: Test 
   reboot
   during guest migration and test file_transfer during guest migration.
   
   In the future, we could even lauch autotest client test during guest 
   migation.
   
   ---
   
   Jason Wang (3):
 KVM Test: Introduce a helper class to run a test in the background
 KVM test: Test reboot during migration
 KVM test: Test the file transfer during migartion
   
   
client/tests/kvm/kvm_test_utils.py |   44 +++
.../kvm/tests/migration_with_file_transfer.py  |   59 
   
client/tests/kvm/tests/migration_with_reboot.py|   45 +++
client/tests/kvm/tests_base.cfg.sample |   12 
4 files changed, 159 insertions(+), 1 deletions(-)
create mode 100644 client/tests/kvm/tests/migration_with_file_transfer.py
create mode 100644 client/tests/kvm/tests/migration_with_reboot.py
  
  It seems to me that this method is only applicable to tests/functions
  that don't require a VM object (i.e. that require only a shell session
  object).  kvm_test_utils.reboot() operates on a VM object, and the same
  VM is destroyed by migrate() which runs in the background, so eventually
  reboot() tries logging into a destroyed VM, which fails because
  vm.get_address() fails.  Any monitor operation will also fail.  If the
  autotest wrapper requires a VM object (currently it does) then it can't
  be used either.
  

You are right and that's an issue when running test in parallel with
migration, but reboot through shell should work. The aim of this kind
of test is just to add some stress ( such as run_autotest) during
migartion, so most (probably all) of the tests only needs a
session. So I think it's not a very big issue in this situation.

  An alternative (somewhat ugly) way to migrate in the background is to
  pass a boolean 'migrate' flag to various functions/tests, such as
  reboot() and run_autotest().  If 'migrate' is True, these functions will
  do something like
  
  vm = kvm_test_utils.migrate(vm, ...)
  
  in their waiting loops, where wait_for() is normally used.  This
  guarantees that 'vm' is always a valid VM object.  For example:
  
  # Log in after reboot
  while time.time()  end_time:
  if migrate_in_bg:
  vm = kvm_test_utils.migrate(vm, ...)
  session = vm.remote_login()
  if session:
  break
  time.sleep(2)
  
  This is much longer than the usual wait_for(...) but it does the job.
  What do you think?

This makes sense but it would let testcases need to care about the
migration and it's also hard to put all related codes into a wrapper
which would complicate the codes.

Despite the issue of vm object, all tests that only depends on the
shell session should works well with my method and it's more easy to
be extended. Maybe we could just warn the user of its usage and adapt
my method?

  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] powerpc: kvm: fix information leak to userland

2010-11-01 Thread Marcelo Tosatti
On Sat, Oct 30, 2010 at 01:04:24PM +0400, Vasiliy Kulikov wrote:
 Structure kvm_ppc_pvinfo is copied to userland with flags and
 pad fields unitialized.  It leads to leaking of contents of
 kernel stack memory.
 
 Signed-off-by: Vasiliy Kulikov sego...@gmail.com
 ---
  I cannot compile this driver, so it is not tested at all.
 
  arch/powerpc/kvm/powerpc.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html