Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-27 Thread Yunhong Jiang
On Mon, Oct 26, 2015 at 09:37:14PM -0600, Alex Williamson wrote:
> On Mon, 2015-10-26 at 18:20 -0700, Yunhong Jiang wrote:
> > An option to force VFIO PCI MSI/MSI-X handler as non-threaded IRQ,
> > even when CONFIG_IRQ_FORCED_THREADING=y. This is uselful when
> > assigning a device to a guest with low latency requirement since it
> > reduce the context switch to/from the IRQ thread.
> 
> Is there any way we can do this automatically?  Perhaps detecting that
> we're on a RT kernel or maybe that the user is running with RT priority?
> I find that module options are mostly misunderstood and misused.

Alex, thanks for review.

It's not easy to detect if the user is running with RT priority, since 
sometimes the user start the thread and then set the scheduler priority 
late.

Also should we do this only for in kernel irqchip scenario and not for user 
space handler, since in kernel irqchip has lower overhead?

> 
> > An experiment was conducted on a HSW platform for 1 minutes, with the
> > guest vCPU bound to isolated pCPU. The assigned device triggered the
> > interrupt every 1ms. The average EXTERNAL_INTERRUPT exit handling time
> > is dropped from 5.3us to 2.2us.
> > 
> > Another choice is to change VFIO_DEVICE_SET_IRQS ioctl, to apply this
> > option only to specific devices when in kernel irq_chip is enabled. It
> > provides more flexibility but is more complex, not sure if we need go
> > through that way.
> 
> Allowing the user to decide whether or not to use a threaded IRQ seems
> like a privilege violation; a chance for the user to game the system and
> give themselves better latency, maybe at the cost of others.  I think

Yes, you are right. One benefit of the ioctl change is to have a 
per-device-option thus is more flexible.

> we're better off trying to infer the privilege from the task priority or

I'd think system admin may make decision after some tunning, like you said 
it "maybe at the cost of others" and not sure if we should make decision 
based on task priority or kernel config.

Thanks
--jyh 

> kernel config or, if we run out of options, make a module option as you
> have here requiring the system admin to provide the privilege.  Thanks,
> 
> Alex
> 
> 
> > Signed-off-by: Yunhong Jiang 
> > ---
> >  drivers/vfio/pci/vfio_pci_intrs.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> > b/drivers/vfio/pci/vfio_pci_intrs.c
> > index 1f577b4..ca1f95a 100644
> > --- a/drivers/vfio/pci/vfio_pci_intrs.c
> > +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> > @@ -22,9 +22,13 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include "vfio_pci_private.h"
> >  
> > +static bool nonthread_msi = 1;
> > +module_param(nonthread_msi, bool, 0444);
> > +
> >  /*
> >   * INTx
> >   */
> > @@ -313,6 +317,7 @@ static int vfio_msi_set_vector_signal(struct 
> > vfio_pci_device *vdev,
> > char *name = msix ? "vfio-msix" : "vfio-msi";
> > struct eventfd_ctx *trigger;
> > int ret;
> > +   unsigned long irqflags = 0;
> >  
> > if (vector >= vdev->num_ctx)
> > return -EINVAL;
> > @@ -352,7 +357,10 @@ static int vfio_msi_set_vector_signal(struct 
> > vfio_pci_device *vdev,
> > pci_write_msi_msg(irq, );
> > }
> >  
> > -   ret = request_irq(irq, vfio_msihandler, 0,
> > +   if (nonthread_msi)
> > +   irqflags = IRQF_NO_THREAD;
> > +
> > +   ret = request_irq(irq, vfio_msihandler, irqflags,
> >   vdev->ctx[vector].name, trigger);
> > if (ret) {
> > kfree(vdev->ctx[vector].name);
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 2/3] KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()

2015-10-27 Thread Pavel Fedin
 Hello!

> > --- cut ---
> > Additionally, remove unnecessary vgic_set_lr() and LR_STATE_PENDING check
> > in vgic_unqueue_irqs(), because all these things are now done by the
> > following vgic_retire_lr().
> > --- cut ---
> 
> This does not explain the question I'm raising.
> 
> After applying this patch, and before applying your next patch,
> unqueueing an IRQ will not restore the pending state on the
> distributor, but just throw that piece of state away

 It will restore the state and not throw it away.
 I guess i'm just not clear enough and you misunderstand me. This check in 
vgic_unqueue_irqs() is redundant from the beginning.
Let's look at current vgic_retire_lr():
https://git.kernel.org/cgit/linux/kernel/git/kvmarm/kvmarm.git/tree/virt/kvm/arm/vgic.c?h=next#n1099
 It already does LR_STATE_PENDING check and pushback by itself, since 
cff9211eb1a1f58ce7f5a2d596b617928fd4be0e (it's your commit,
BTW), so that this check:
https://git.kernel.org/cgit/linux/kernel/git/kvmarm/kvmarm.git/tree/virt/kvm/arm/vgic.c?h=next#n728
 is already redundant. So actually this is a separate change, and perhaps it's 
my fault to squash it in.

> which breaks bisectability and makes it impossible to understand the logic by 
> looking
> at this commit in isolation.

 Will this be understood better if i make this particular refactor a separate 
commit, with better explanations?

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/3] KVM: arm/arm64: Optimize away redundant LR tracking

2015-10-27 Thread Pavel Fedin
Currently we use vgic_irq_lr_map in order to track which LRs hold which
IRQs, and lr_used bitmap in order to track which LRs are used or free.

vgic_irq_lr_map is actually used only for piggy-back optimization, and
can be easily replaced by iteration over lr_used. This is good because in
future, when LPI support is introduced, number of IRQs will grow up to at
least 16384, while numbers from 1024 to 8192 are never going to be used.
This would be a huge memory waste.

In its turn, lr_used is also completely redundant since
ae705930fca6322600690df9dc1c7d0516145a93 ("arm/arm64: KVM: Keep elrsr/aisr
in sync with software model"), because together with lr_used we also update
elrsr. This allows to easily replace lr_used with elrsr, inverting all
conditions (because in elrsr '1' means 'free').

Signed-off-by: Pavel Fedin 
---
 include/kvm/arm_vgic.h |  6 --
 virt/kvm/arm/vgic-v2.c |  1 +
 virt/kvm/arm/vgic-v3.c |  1 +
 virt/kvm/arm/vgic.c| 53 ++
 4 files changed, 17 insertions(+), 44 deletions(-)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 8065801..3936bf8 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -295,9 +295,6 @@ struct vgic_v3_cpu_if {
 };
 
 struct vgic_cpu {
-   /* per IRQ to LR mapping */
-   u8  *vgic_irq_lr_map;
-
/* Pending/active/both interrupts on this VCPU */
DECLARE_BITMAP(pending_percpu, VGIC_NR_PRIVATE_IRQS);
DECLARE_BITMAP(active_percpu, VGIC_NR_PRIVATE_IRQS);
@@ -308,9 +305,6 @@ struct vgic_cpu {
unsigned long   *active_shared;
unsigned long   *pend_act_shared;
 
-   /* Bitmap of used/free list registers */
-   DECLARE_BITMAP(lr_used, VGIC_V2_MAX_LRS);
-
/* Number of list registers on this CPU */
int nr_lr;
 
diff --git a/virt/kvm/arm/vgic-v2.c b/virt/kvm/arm/vgic-v2.c
index 8d7b04d..c0f5d7f 100644
--- a/virt/kvm/arm/vgic-v2.c
+++ b/virt/kvm/arm/vgic-v2.c
@@ -158,6 +158,7 @@ static void vgic_v2_enable(struct kvm_vcpu *vcpu)
 * anyway.
 */
vcpu->arch.vgic_cpu.vgic_v2.vgic_vmcr = 0;
+   vcpu->arch.vgic_cpu.vgic_v2.vgic_elrsr = ~0;
 
/* Get the show on the road... */
vcpu->arch.vgic_cpu.vgic_v2.vgic_hcr = GICH_HCR_EN;
diff --git a/virt/kvm/arm/vgic-v3.c b/virt/kvm/arm/vgic-v3.c
index 7dd5d62..92003cb 100644
--- a/virt/kvm/arm/vgic-v3.c
+++ b/virt/kvm/arm/vgic-v3.c
@@ -193,6 +193,7 @@ static void vgic_v3_enable(struct kvm_vcpu *vcpu)
 * anyway.
 */
vgic_v3->vgic_vmcr = 0;
+   vgic_v3->vgic_elrsr = ~0;
 
/*
 * If we are emulating a GICv3, we do it in an non-GICv2-compatible
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index d4669eb..265a410 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -108,6 +108,7 @@ static void vgic_retire_disabled_irqs(struct kvm_vcpu 
*vcpu);
 static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu);
 static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr);
 static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr lr_desc);
+static u64 vgic_get_elrsr(struct kvm_vcpu *vcpu);
 static struct irq_phys_map *vgic_irq_map_search(struct kvm_vcpu *vcpu,
int virt_irq);
 static int compute_pending_for_cpu(struct kvm_vcpu *vcpu);
@@ -691,9 +692,11 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
*mmio,
 void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 {
struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
+   u64 elrsr = vgic_get_elrsr(vcpu);
+   unsigned long *elrsr_ptr = u64_to_bitmask();
int i;
 
-   for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) {
+   for_each_clear_bit(i, elrsr_ptr, vgic_cpu->nr_lr) {
struct vgic_lr lr = vgic_get_lr(vcpu, i);
 
/*
@@ -1098,7 +1101,6 @@ static inline void vgic_enable(struct kvm_vcpu *vcpu)
 
 static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu)
 {
-   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
 
/*
@@ -1112,8 +1114,6 @@ static void vgic_retire_lr(int lr_nr, int irq, struct 
kvm_vcpu *vcpu)
 
vlr.state = 0;
vgic_set_lr(vcpu, lr_nr, vlr);
-   clear_bit(lr_nr, vgic_cpu->lr_used);
-   vgic_cpu->vgic_irq_lr_map[irq] = LR_EMPTY;
vgic_sync_lr_elrsr(vcpu, lr_nr, vlr);
 }
 
@@ -1128,10 +1128,11 @@ static void vgic_retire_lr(int lr_nr, int irq, struct 
kvm_vcpu *vcpu)
  */
 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu)
 {
-   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
+   u64 elrsr = vgic_get_elrsr(vcpu);
+   unsigned long *elrsr_ptr = u64_to_bitmask();
int lr;
 
-   for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) {
+   for_each_clear_bit(lr, elrsr_ptr, vgic->nr_lr) {
struct vgic_lr vlr = 

[PATCH v4 2/3] KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings

2015-10-27 Thread Pavel Fedin
1. Remove unnecessary 'irq' argument, because irq number can be retrieved
   from the LR.
2. Since cff9211eb1a1f58ce7f5a2d596b617928fd4be0e
   ("arm/arm64: KVM: Fix arch timer behavior for disabled interrupts ")
   LR_STATE_PENDING is queued back by vgic_retire_lr() itself. Also, it
   clears vlr.state itself. Therefore, we remove the same, now duplicated,
   check with all accompanying bit manipulations from vgic_unqueue_irqs().
3. vgic_retire_lr() is always accompanied by vgic_irq_clear_queued(). Since
   it already does more than just clearing the LR, move
   vgic_irq_clear_queued() inside of it.

Signed-off-by: Pavel Fedin 
---
 virt/kvm/arm/vgic.c | 37 ++---
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 265a410..96e45f3 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -105,7 +105,7 @@
 #include "vgic.h"
 
 static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu);
-static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu);
+static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu);
 static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr);
 static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr lr_desc);
 static u64 vgic_get_elrsr(struct kvm_vcpu *vcpu);
@@ -717,30 +717,14 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
 * interrupt then move the active state to the
 * distributor tracking bit.
 */
-   if (lr.state & LR_STATE_ACTIVE) {
+   if (lr.state & LR_STATE_ACTIVE)
vgic_irq_set_active(vcpu, lr.irq);
-   lr.state &= ~LR_STATE_ACTIVE;
-   }
 
/*
 * Reestablish the pending state on the distributor and the
-* CPU interface.  It may have already been pending, but that
-* is fine, then we are only setting a few bits that were
-* already set.
+* CPU interface and mark the LR as free for other use.
 */
-   if (lr.state & LR_STATE_PENDING) {
-   vgic_dist_irq_set_pending(vcpu, lr.irq);
-   lr.state &= ~LR_STATE_PENDING;
-   }
-
-   vgic_set_lr(vcpu, i, lr);
-
-   /*
-* Mark the LR as free for other use.
-*/
-   BUG_ON(lr.state & LR_STATE_MASK);
-   vgic_retire_lr(i, lr.irq, vcpu);
-   vgic_irq_clear_queued(vcpu, lr.irq);
+   vgic_retire_lr(i, vcpu);
 
/* Finally update the VGIC state. */
vgic_update_state(vcpu->kvm);
@@ -1099,16 +1083,18 @@ static inline void vgic_enable(struct kvm_vcpu *vcpu)
vgic_ops->enable(vcpu);
 }
 
-static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu)
+static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu)
 {
struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
 
+   vgic_irq_clear_queued(vcpu, vlr.irq);
+
/*
 * We must transfer the pending state back to the distributor before
 * retiring the LR, otherwise we may loose edge-triggered interrupts.
 */
if (vlr.state & LR_STATE_PENDING) {
-   vgic_dist_irq_set_pending(vcpu, irq);
+   vgic_dist_irq_set_pending(vcpu, vlr.irq);
vlr.hwirq = 0;
}
 
@@ -1135,11 +1121,8 @@ static void vgic_retire_disabled_irqs(struct kvm_vcpu 
*vcpu)
for_each_clear_bit(lr, elrsr_ptr, vgic->nr_lr) {
struct vgic_lr vlr = vgic_get_lr(vcpu, lr);
 
-   if (!vgic_irq_is_enabled(vcpu, vlr.irq)) {
-   vgic_retire_lr(lr, vlr.irq, vcpu);
-   if (vgic_irq_is_queued(vcpu, vlr.irq))
-   vgic_irq_clear_queued(vcpu, vlr.irq);
-   }
+   if (!vgic_irq_is_enabled(vcpu, vlr.irq))
+   vgic_retire_lr(lr, vcpu);
}
 }
 
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/3] KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()

2015-10-27 Thread Pavel Fedin
Now we see that vgic_set_lr() and vgic_sync_lr_elrsr() are always used
together. Merge them into one function, saving from second vgic_ops
dereferencing every time.

Signed-off-by: Pavel Fedin 
---
 include/kvm/arm_vgic.h |  1 -
 virt/kvm/arm/vgic-v2.c |  5 -
 virt/kvm/arm/vgic-v3.c |  5 -
 virt/kvm/arm/vgic.c| 14 ++
 4 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
index 3936bf8..f62addc 100644
--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -112,7 +112,6 @@ struct vgic_vmcr {
 struct vgic_ops {
struct vgic_lr  (*get_lr)(const struct kvm_vcpu *, int);
void(*set_lr)(struct kvm_vcpu *, int, struct vgic_lr);
-   void(*sync_lr_elrsr)(struct kvm_vcpu *, int, struct vgic_lr);
u64 (*get_elrsr)(const struct kvm_vcpu *vcpu);
u64 (*get_eisr)(const struct kvm_vcpu *vcpu);
void(*clear_eisr)(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/arm/vgic-v2.c b/virt/kvm/arm/vgic-v2.c
index c0f5d7f..ff02f08 100644
--- a/virt/kvm/arm/vgic-v2.c
+++ b/virt/kvm/arm/vgic-v2.c
@@ -79,11 +79,7 @@ static void vgic_v2_set_lr(struct kvm_vcpu *vcpu, int lr,
lr_val |= (lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT);
 
vcpu->arch.vgic_cpu.vgic_v2.vgic_lr[lr] = lr_val;
-}
 
-static void vgic_v2_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
- struct vgic_lr lr_desc)
-{
if (!(lr_desc.state & LR_STATE_MASK))
vcpu->arch.vgic_cpu.vgic_v2.vgic_elrsr |= (1ULL << lr);
else
@@ -167,7 +163,6 @@ static void vgic_v2_enable(struct kvm_vcpu *vcpu)
 static const struct vgic_ops vgic_v2_ops = {
.get_lr = vgic_v2_get_lr,
.set_lr = vgic_v2_set_lr,
-   .sync_lr_elrsr  = vgic_v2_sync_lr_elrsr,
.get_elrsr  = vgic_v2_get_elrsr,
.get_eisr   = vgic_v2_get_eisr,
.clear_eisr = vgic_v2_clear_eisr,
diff --git a/virt/kvm/arm/vgic-v3.c b/virt/kvm/arm/vgic-v3.c
index 92003cb..487d635 100644
--- a/virt/kvm/arm/vgic-v3.c
+++ b/virt/kvm/arm/vgic-v3.c
@@ -112,11 +112,7 @@ static void vgic_v3_set_lr(struct kvm_vcpu *vcpu, int lr,
}
 
vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)] = lr_val;
-}
 
-static void vgic_v3_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
- struct vgic_lr lr_desc)
-{
if (!(lr_desc.state & LR_STATE_MASK))
vcpu->arch.vgic_cpu.vgic_v3.vgic_elrsr |= (1U << lr);
else
@@ -212,7 +208,6 @@ static void vgic_v3_enable(struct kvm_vcpu *vcpu)
 static const struct vgic_ops vgic_v3_ops = {
.get_lr = vgic_v3_get_lr,
.set_lr = vgic_v3_set_lr,
-   .sync_lr_elrsr  = vgic_v3_sync_lr_elrsr,
.get_elrsr  = vgic_v3_get_elrsr,
.get_eisr   = vgic_v3_get_eisr,
.clear_eisr = vgic_v3_clear_eisr,
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 96e45f3..fe451d4 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -1032,12 +1032,6 @@ static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr,
vgic_ops->set_lr(vcpu, lr, vlr);
 }
 
-static void vgic_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
-  struct vgic_lr vlr)
-{
-   vgic_ops->sync_lr_elrsr(vcpu, lr, vlr);
-}
-
 static inline u64 vgic_get_elrsr(struct kvm_vcpu *vcpu)
 {
return vgic_ops->get_elrsr(vcpu);
@@ -1100,7 +1094,6 @@ static void vgic_retire_lr(int lr_nr, struct kvm_vcpu 
*vcpu)
 
vlr.state = 0;
vgic_set_lr(vcpu, lr_nr, vlr);
-   vgic_sync_lr_elrsr(vcpu, lr_nr, vlr);
 }
 
 /*
@@ -1162,7 +1155,6 @@ static void vgic_queue_irq_to_lr(struct kvm_vcpu *vcpu, 
int irq,
}
 
vgic_set_lr(vcpu, lr_nr, vlr);
-   vgic_sync_lr_elrsr(vcpu, lr_nr, vlr);
 }
 
 /*
@@ -1340,8 +1332,6 @@ static int process_queued_irq(struct kvm_vcpu *vcpu,
vlr.hwirq = 0;
vgic_set_lr(vcpu, lr, vlr);
 
-   vgic_sync_lr_elrsr(vcpu, lr, vlr);
-
return pending;
 }
 
@@ -1442,8 +1432,6 @@ static void __kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu)
bool level_pending;
 
level_pending = vgic_process_maintenance(vcpu);
-   elrsr = vgic_get_elrsr(vcpu);
-   elrsr_ptr = u64_to_bitmask();
 
/* Deal with HW interrupts, and clear mappings for empty LRs */
for (lr = 0; lr < vgic->nr_lr; lr++) {
@@ -1454,6 +1442,8 @@ static void __kvm_vgic_sync_hwstate(struct kvm_vcpu *vcpu)
}
 
/* Check if we still have something up our sleeve... */
+   elrsr = vgic_get_elrsr(vcpu);
+   elrsr_ptr = u64_to_bitmask();
pending = find_first_zero_bit(elrsr_ptr, vgic->nr_lr);
if (level_pending || pending < vgic->nr_lr)
set_bit(vcpu->vcpu_id, dist->irq_pending_on_cpu);
-- 
2.4.4

--
To 

[PATCH v4 0/3] KVM: arm/arm64: Clean up some obsolete code

2015-10-27 Thread Pavel Fedin
Current KVM code has lots of old redundancies, which can be cleaned up.
This patchset is actually a better alternative to
http://www.spinics.net/lists/arm-kernel/msg430726.html, which allows to
keep piggy-backed LRs. The idea is based on the fact that our code also
maintains LR state in elrsr, and this information is enough to track LR
usage.

In case of problems this series can be applied partially, each patch is
a complete refactoring step on its own.

Thanks to Andre Przywara for pinpointing some 4.3+ specifics.

This version has been tested on SMDK5410 development board
(Exynos5410 SoC).

v3 => v4:
- Reordered changes for purpose of better understanding and bisection. All
  changes related to vgic_retire_lr() are gathered in one patch now.

v2 => v3:
- Removed two unused variables in __kvm_vgic_flush_hwstate(), overlooked
  leftover from v1.

v1 => v2:
- Rebased to kvmarm/next of 23.10.2015.
- Do not use vgic_retire_lr() for initializing ELRSR bitmask, because now
  it also handles pushback of PENDING state, use direct initialization
  instead (copied from Andre's patchset).
- Took more care about vgic_retire_lr(), which has deserved own patch.

Pavel Fedin (3):
  KVM: arm/arm64: Optimize away redundant LR tracking
  KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
  KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()

 include/kvm/arm_vgic.h |   7 
 virt/kvm/arm/vgic-v2.c |   6 +--
 virt/kvm/arm/vgic-v3.c |   6 +--
 virt/kvm/arm/vgic.c| 104 +
 4 files changed, 29 insertions(+), 94 deletions(-)

-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: arm: Fix crash in free_hyp_pgds() if timer initialization fails

2015-10-27 Thread Pavel Fedin
After vGIC initialization succeeded, and timer initialization failed,
the following crash can be observed on ARM32:

kvm [1]: interrupt-controller@10484000 IRQ57
kvm [1]: kvm_arch_timer: can't find DT node
Unable to handle kernel paging request at virtual address 90484000
pgd = c0003000
[90484000] *pgd=8040006003, *pmd=
Internal error: Oops: 2a06 [#1] PREEMPT SMP ARM
...
[] (v7_flush_kern_dcache_area) from [] 
(kvm_flush_dcache_pte+0x48/0x5c)
[] (kvm_flush_dcache_pte) from [] (unmap_range+0x24c/0x460)
[] (unmap_range) from [] (free_hyp_pgds+0x84/0x160)
[] (free_hyp_pgds) from [] (kvm_arch_init+0x254/0x41c)
[] (kvm_arch_init) from [] (kvm_init+0x28/0x2b4)
[] (kvm_init) from [] (do_one_initcall+0x9c/0x200)

This happens when unmapping reaches mapped vGIC control registers. The
problem root seems to be combination of two facts:
1. vGIC control region is defined in device trees as having size of
   0x2000. But the specification defines only registers up to 0x1FC,
   therefore it is only one page, not two.
2. unmap_ptes() is expected to recognize device memory and omit cache
   flushing. However, it tests only for PAGE_S2_DEVICE, while devices
   mapped for HYP mode have PAGE_HYP_DEVICE, which is different.
   Therefore, cache flush is attempted, and it dies when hitting the
   nonexistent second page.

This patch fixes the problem by adding missing recognition of
PAGE_HYP_DEVICE protection value.

The crash can be observed on Exynos 5410 (and probably on all Exynos5
family) with stock device trees (using MCT) and CONFIG_KVM enabled.

Signed-off-by: Pavel Fedin 
---
 arch/arm/kvm/mmu.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 7b42012..839dd970 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -213,7 +213,10 @@ static void unmap_ptes(struct kvm *kvm, pmd_t *pmd,
kvm_tlb_flush_vmid_ipa(kvm, addr);
 
/* No need to invalidate the cache for device mappings 
*/
-   if ((pte_val(old_pte) & PAGE_S2_DEVICE) != 
PAGE_S2_DEVICE)
+   if (((pte_val(old_pte) & PAGE_S2_DEVICE)
+!= PAGE_S2_DEVICE) &&
+   ((pte_val(old_pte) & PAGE_HYP_DEVICE)
+!= PAGE_HYP_DEVICE))
kvm_flush_dcache_pte(old_pte);
 
put_page(virt_to_page(pte));
-- 
2.4.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE][RFC] KVMGT - the implementation of Intel GVT-g(full GPU virtualization) for KVM

2015-10-27 Thread Jike Song

Hi all,

We are pleased to announce another update of Intel GVT-g for KVM.

Intel GVT-g is a full GPU virtualization solution with mediated pass-through, 
starting from 4th generation Intel Core(TM) processors with Intel Graphics 
processors. A virtual GPU instance is maintained for each VM, with part of 
performance critical resources directly assigned. The capability of running 
native graphics driver inside a VM, without hypervisor intervention in 
performance critical paths, achieves a good balance among performance, feature, 
and sharing capability. KVM is supported by Intel GVT-g(a.k.a. KVMGT).


Repositories

Kernel: https://github.com/01org/igvtg-kernel (2015q3-3.18.0 branch)
Qemu: https://github.com/01org/igvtg-qemu (kvmgt_public2015q3 branch)


This update consists of:

- KVMGT is now merged with XenGT in unified repositories(kernel and qemu), 
but currently
  different branches for qemu.  KVMGT and XenGT share same iGVT-g core 
logic.
- PPGTT supported, hence the Windows guest support
- KVMGT now supports both 4th generation (Haswell) and 5th generation 
(Broadwell) Intel Core(TM) processors
- 2D/3D/Media decoding have been validated on Ubuntu 14.04 and 
Windows7/Windows 8.1

Next update will be around early Jan, 2016.

Known issues:

- At least 2GB memory is suggested for VM to run most 3D workloads.
- 3Dmark06 running in Windows VM may have some stability issue.
- Using VLC to play .ogg file may cause mosaic or slow response.


Please subscribe the mailing list to report BUGs, discuss, and/or contribute:

https://lists.01.org/mailman/listinfo/igvt-g

More information about Intel GVT-g background, architecture, etc can be found 
at(may not be up-to-date):

https://01.org/igvt-g
http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf
https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian


Note:

The KVMGT project should be considered a work in progress. As such it is not a 
complete product nor should it be considered one. Extra care should be taken 
when testing and configuring a system to use the KVMGT project.


--
Thanks,
Jike

On 12/04/2014 10:24 AM, Jike Song wrote:

Hi all,

   We are pleased to announce the first release of KVMGT project. KVMGT is the 
implementation of Intel GVT-g technology, a full GPU virtualization solution. 
Under Intel GVT-g, a virtual GPU instance is maintained for each VM, with part 
of performance critical resources directly assigned. The capability of running 
native graphics driver inside a VM, without hypervisor intervention in 
performance critical paths, achieves a good balance of performance, feature, 
and sharing capability.


   KVMGT is still in the early stage:

- Basic functions of full GPU virtualization works, guest can see a 
full-featured vGPU.
  We ran several 3D workloads such as lightsmark, nexuiz, urbanterror and 
warsow.

- Only Linux guest supported so far, and PPGTT must be disabled in guest 
through a
  kernel parameter(see README.kvmgt in QEMU).

- This drop also includes some Xen specific changes, which will be cleaned 
up later.

- Our end goal is to upstream both XenGT and KVMGT, which shares ~90% logic 
for vGPU
  device model (will be part of i915 driver), with only difference in 
hypervisor
  specific services

- insufficient test coverage, so please bear with stability issues :)



   There are things need to be improved, esp. the KVM interfacing part:

1   a domid was added to each KVMGT guest

An ID is needed for foreground OS switching, e.g.

# echo >
/sys/kernel/vgt/control/foreground_vm

domid 0 is reserved for host OS.


2   SRCU workarounds.

Some KVM functions, such as:

kvm_io_bus_register_dev
install_new_memslots

must be called *without* >srcu read-locked. Otherwise it 
hangs.

In KVMGT, we need to register an iodev only *after* BAR 
registers are
written by guest. That means, we already have >srcu hold -
trapping/emulating PIO(BAR registers) makes us in such a 
condition.
That will make kvm_io_bus_register_dev hangs.

Currently we have to disable rcu_assign_pointer() in such 
functions.

These were dirty workarounds, your suggestions are high welcome!


3   syscalls were called to access "/dev/mem" from kernel

An in-kernel memslot was added for aperture, but using syscalls 
like
open and mmap to open and access the character device 
"/dev/mem",
for pass-through.




The source codes(kernel, qemu as well as seabios) are available at github:

git://github.com/01org/KVMGT-kernel
git://github.com/01org/KVMGT-qemu
git://github.com/01org/KVMGT-seabios


[PATCH] vhost: fix performance on LE hosts

2015-10-27 Thread Michael S. Tsirkin
commit 2751c9882b947292fcfb084c4f604e01724af804 ("vhost: cross-endian
support for legacy devices") introduced a minor regression: even with
cross-endian disabled, and even on LE host, vhost_is_little_endian is
checking is_le flag so there's always a branch.

To fix, simply check virtio_legacy_is_little_endian first.

Cc: Greg Kurz 
Signed-off-by: Michael S. Tsirkin 
---
 drivers/vhost/vhost.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4772862..d3f7674 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -183,10 +183,17 @@ static inline bool vhost_has_feature(struct 
vhost_virtqueue *vq, int bit)
return vq->acked_features & (1ULL << bit);
 }
 
+#ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
 static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
 {
return vq->is_le;
 }
+#else
+static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
+{
+   return virtio_legacy_is_little_endian() || vq->is_le;
+}
+#endif
 
 /* Memory accessors */
 static inline u16 vhost16_to_cpu(struct vhost_virtqueue *vq, __virtio16 val)
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: fix performance on LE hosts

2015-10-27 Thread Greg Kurz
On Tue, 27 Oct 2015 11:37:39 +0200
"Michael S. Tsirkin"  wrote:
> commit 2751c9882b947292fcfb084c4f604e01724af804 ("vhost: cross-endian
> support for legacy devices") introduced a minor regression: even with
> cross-endian disabled, and even on LE host, vhost_is_little_endian is
> checking is_le flag so there's always a branch.
> 
> To fix, simply check virtio_legacy_is_little_endian first.
> 
> Cc: Greg Kurz 
> Signed-off-by: Michael S. Tsirkin 

Oops my bad for this oversight...

Reviewed-by: Greg Kurz 

> ---
>  drivers/vhost/vhost.h | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 4772862..d3f7674 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -183,10 +183,17 @@ static inline bool vhost_has_feature(struct 
> vhost_virtqueue *vq, int bit)
>   return vq->acked_features & (1ULL << bit);
>  }
> 
> +#ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
>  static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
>  {
>   return vq->is_le;
>  }
> +#else
> +static inline bool vhost_is_little_endian(struct vhost_virtqueue *vq)
> +{
> + return virtio_legacy_is_little_endian() || vq->is_le;
> +}
> +#endif
> 
>  /* Memory accessors */
>  static inline u16 vhost16_to_cpu(struct vhost_virtqueue *vq, __virtio16 val)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network hangs when communicating with host

2015-10-27 Thread Will Deacon
[apologies for the delay -- I've been off for a week and am catching up
 on email]

On Tue, Oct 20, 2015 at 09:58:51AM -0400, Sasha Levin wrote:
> On 10/20/2015 09:42 AM, Dmitry Vyukov wrote:
> > I now have another issue. My binary fails to mmap a file within lkvm
> > sandbox. The same binary works fine on host and in qemu. I've added
> > strace into sandbox script, and here is the output:
> > 
> > [pid   837] openat(AT_FDCWD, "syzkaller-shm048878722", O_RDWR|O_CLOEXEC) = 5
> > [pid   837] mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_SHARED, 5,
> > 0) = -1 EINVAL (Invalid argument)
> > 
> > I don't see anything that can potentially cause EINVAL here. Is it
> > possible that lkvm somehow affects kernel behavior here?
> > 
> > I run lkvm as:
> > 
> > $ taskset 1 /kvmtool/lkvm sandbox --disk syz-0 --mem=2048 --cpus=2
> > --kernel /arch/x86/boot/bzImage --network mode=user --sandbox
> > /workdir/kvm/syz-0.sh
> 
> It's possible that something in the virtio-9p layer is broken. I'll give
> it a look in the evening.

I ended up with the patch below, but it's really ugly and I didn't get
round to posting it.

Will

--->8

>From 7cbcdfef1b9f094db4bf75676f22339f3164e103 Mon Sep 17 00:00:00 2001
From: Will Deacon 
Date: Fri, 17 Apr 2015 17:31:36 +0100
Subject: [PATCH] kvmtool: virtio-net: fix VIRTIO_NET_F_MRG_RXBUF usage in rx
 thread

When merging virtio-net buffers using the VIRTIO_NET_F_MRG_RXBUF feature,
the first buffer added to the used ring should indicate the total number
of buffers used to hold the packet. Unfortunately, kvmtool has a number
of issues when constructing these merged buffers:

  - Commit 5131332e3f1a ("kvmtool: convert net backend to support
bi-endianness") introduced a strange loop counter, which resulted in
hdr->num_buffers being set redundantly the first time round

  - When adding the buffers to the ring, we actually add them one-by-one,
allowing the guest to see the header before we've inserted the rest
of the data buffers...

  - ... which is made worse because we non-atomically increment the
num_buffers count in the header each time we insert a new data buffer

Consequently, the guest quickly becomes confused in its net rx code and
the whole thing grinds to a halt. This is easily exemplified by trying
to boot a root filesystem over NFS, which seldom succeeds.

This patch resolves the issues by allowing us to insert items into the
used ring without updating the index. Once the full payload has been
added and num_buffers corresponds to the total size, we *then* publish
the buffers to the guest.

Cc: Marc Zyngier 
Cc: Sasha Levin 
Signed-off-by: Will Deacon 
---
 include/kvm/virtio.h |  2 ++
 virtio/core.c| 32 +---
 virtio/net.c | 31 +++
 3 files changed, 42 insertions(+), 23 deletions(-)

diff --git a/include/kvm/virtio.h b/include/kvm/virtio.h
index 768ee9668d44..8324ba7d38be 100644
--- a/include/kvm/virtio.h
+++ b/include/kvm/virtio.h
@@ -112,6 +112,8 @@ static inline bool virt_queue__available(struct virt_queue 
*vq)
return virtio_guest_to_host_u16(vq, vq->vring.avail->idx) != 
vq->last_avail_idx;
 }
 
+void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump);
+struct vring_used_elem * virt_queue__set_used_elem_no_update(struct virt_queue 
*queue, u32 head, u32 len, u16 offset);
 struct vring_used_elem *virt_queue__set_used_elem(struct virt_queue *queue, 
u32 head, u32 len);
 
 bool virtio_queue__should_signal(struct virt_queue *vq);
diff --git a/virtio/core.c b/virtio/core.c
index 3b6e4d7cd045..d6ac289d450e 100644
--- a/virtio/core.c
+++ b/virtio/core.c
@@ -21,22 +21,17 @@ const char* virtio_trans_name(enum virtio_trans trans)
return "unknown";
 }
 
-struct vring_used_elem *virt_queue__set_used_elem(struct virt_queue *queue, 
u32 head, u32 len)
+void virt_queue__used_idx_advance(struct virt_queue *queue, u16 jump)
 {
-   struct vring_used_elem *used_elem;
u16 idx = virtio_guest_to_host_u16(queue, queue->vring.used->idx);
 
-   used_elem   = >vring.used->ring[idx % queue->vring.num];
-   used_elem->id   = virtio_host_to_guest_u32(queue, head);
-   used_elem->len  = virtio_host_to_guest_u32(queue, len);
-
/*
 * Use wmb to assure that used elem was updated with head and len.
 * We need a wmb here since we can't advance idx unless we're ready
 * to pass the used element to the guest.
 */
wmb();
-   idx++;
+   idx += jump;
queue->vring.used->idx = virtio_host_to_guest_u16(queue, idx);
 
/*
@@ -45,6 +40,29 @@ struct vring_used_elem *virt_queue__set_used_elem(struct 
virt_queue *queue, u32
 * an updated idx.
 */
wmb();
+}
+
+struct vring_used_elem *
+virt_queue__set_used_elem_no_update(struct virt_queue *queue, u32 head,
+   u32 

Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-27 Thread Paolo Bonzini


On 27/10/2015 07:35, Yunhong Jiang wrote:
> On Mon, Oct 26, 2015 at 09:37:14PM -0600, Alex Williamson wrote:
>> On Mon, 2015-10-26 at 18:20 -0700, Yunhong Jiang wrote:
>>> An option to force VFIO PCI MSI/MSI-X handler as non-threaded IRQ,
>>> even when CONFIG_IRQ_FORCED_THREADING=y. This is uselful when
>>> assigning a device to a guest with low latency requirement since it
>>> reduce the context switch to/from the IRQ thread.
>>
>> Is there any way we can do this automatically?  Perhaps detecting that
>> we're on a RT kernel or maybe that the user is running with RT priority?
>> I find that module options are mostly misunderstood and misused.
> 
> Alex, thanks for review.
> 
> It's not easy to detect if the user is running with RT priority, since 
> sometimes the user start the thread and then set the scheduler priority 
> late.
> 
> Also should we do this only for in kernel irqchip scenario and not for user 
> space handler, since in kernel irqchip has lower overhead?

The overhead of the non-threaded IRQ handler is the same for kernel or
userspace irqchip, since the handler just writes 1 to the eventfd.

On RT kernels however can you call eventfd_signal from interrupt
context?  You cannot call spin_lock_irqsave (which can sleep) from a
non-threaded interrupt handler, can you?  You would need a raw spin lock.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: fix performance on LE hosts

2015-10-27 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Tue, 27 Oct 2015 11:37:39 +0200

> commit 2751c9882b947292fcfb084c4f604e01724af804 ("vhost: cross-endian
> support for legacy devices") introduced a minor regression: even with
> cross-endian disabled, and even on LE host, vhost_is_little_endian is
> checking is_le flag so there's always a branch.
> 
> To fix, simply check virtio_legacy_is_little_endian first.
> 
> Cc: Greg Kurz 
> Signed-off-by: Michael S. Tsirkin 

Michael, do you want me to take this via the 'net' tree?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Implement extension to report number of memslots

2015-10-27 Thread Thomas Huth
On 26/10/15 06:15, Paul Mackerras wrote:
> On Fri, Oct 16, 2015 at 08:41:31AM +0200, Thomas Huth wrote:
>> Yes, we'll likely need this soon! 32 slots are not enough...
> 
> Would anyone object if I raised the limit for PPC to 512 slots?

In the long run we should really make this somehow dynamically instead.
But as a first step, that should IMHO be fine. Question is whether we
immediately need 512 on PPC, too? QEMU for x86 features 256 pluggable
memory DIMM slots ("#define ACPI_MAX_RAM_SLOTS 256"), while PPC only has
32 ("#define SPAPR_MAX_RAM_SLOTS 32"). So maybe we are fine with less
memory slots on PPC, e.g. 300 or so?

 Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Implement extension to report number of memslots

2015-10-27 Thread Thomas Huth
On 26/10/15 06:15, Paul Mackerras wrote:
> On Fri, Oct 16, 2015 at 08:41:31AM +0200, Thomas Huth wrote:
>> Yes, we'll likely need this soon! 32 slots are not enough...
> 
> Would anyone object if I raised the limit for PPC to 512 slots?

In the long run we should really make this somehow dynamically instead.
But as a first step, that should IMHO be fine. Question is whether we
immediately need 512 on PPC, too? QEMU for x86 features 256 pluggable
memory DIMM slots ("#define ACPI_MAX_RAM_SLOTS 256"), while PPC only has
32 ("#define SPAPR_MAX_RAM_SLOTS 32"). So maybe we are fine with less
memory slots on PPC, e.g. 300 or so?

 Thomas

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: fix performance on LE hosts

2015-10-27 Thread Michael S. Tsirkin
On Tue, Oct 27, 2015 at 06:32:55AM -0700, David Miller wrote:
> From: "Michael S. Tsirkin" 
> Date: Tue, 27 Oct 2015 11:37:39 +0200
> 
> > commit 2751c9882b947292fcfb084c4f604e01724af804 ("vhost: cross-endian
> > support for legacy devices") introduced a minor regression: even with
> > cross-endian disabled, and even on LE host, vhost_is_little_endian is
> > checking is_le flag so there's always a branch.
> > 
> > To fix, simply check virtio_legacy_is_little_endian first.
> > 
> > Cc: Greg Kurz 
> > Signed-off-by: Michael S. Tsirkin 
> 
> Michael, do you want me to take this via the 'net' tree?

Sure - especially since it looks like I have nothing else for this merge
window (working on fixing shutdown races, but it's slow going).  Thanks
a lot!

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-27 Thread Alex Williamson
[cc +iommu]

On Tue, 2015-10-27 at 15:40 +, Will Deacon wrote:
> On Fri, Oct 16, 2015 at 09:51:22AM -0600, Alex Williamson wrote:
> > On Fri, 2015-10-16 at 16:03 +0200, Eric Auger wrote:
> > > Hi Alex,
> > > On 10/15/2015 10:52 PM, Alex Williamson wrote:
> > > > We can only provide isolation if DMA is forced through the IOMMU
> > > > aperture.  Don't allow type1 to be used if this is not the case.
> > > > 
> > > > Signed-off-by: Alex Williamson 
> > > > ---
> > > > 
> > > > Eric, I see a number of IOMMU drivers enable this, do the ones you
> > > > care about for ARM set geometry.force_aperture?  Thanks,
> > > I am currently using arm-smmu.c. I don't see force_aperture being set.
> > 
> > Hi Will,
> 
> Hi Alex,
> 
> > Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
> > In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
> > eventually like to pass the aperture information out through the VFIO
> > API.  Thanks,
> 
> The slight snag here is that we don't know which SMMU in the system the
> domain is attached to at the point when the geometry is queried, so I
> can't give you an upper bound on the aperture. For example, if there is
> an SMMU with a 32-bit input address and another with a 48-bit input
> address.
> 
> We could play the same horrible games that we do with the pgsize bitmap,
> and truncate some global aperture everytime we probe an SMMU device, but
> I'd really like to have fewer hacks like that if possible. The root of
> the problem is still that domains are allocated for a bus, rather than
> an IOMMU instance.

Hi Will,

Yes, Intel VT-d has this issue as well.  In theory we can have
heterogeneous IOMMU hardware units (DRHDs) in a system and the upper
bound of the geometry could be diminished if we add a less capable DRHD
into the domain.  I suspect this is more a theoretical problem than a
practical one though as we're typically mixing similar DRHDs and I think
we're still capable of 39-bit addressing in the least capable version
per the spec.

In any case, I really want to start testing geometry.force_aperture,
even if we're not yet comfortable to expose the IOMMU limits to the
user.  The vfio type1 shouldn't be enabled at all for underlying
hardware that allows DMA bypass.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-27 Thread Will Deacon
On Fri, Oct 16, 2015 at 09:51:22AM -0600, Alex Williamson wrote:
> On Fri, 2015-10-16 at 16:03 +0200, Eric Auger wrote:
> > Hi Alex,
> > On 10/15/2015 10:52 PM, Alex Williamson wrote:
> > > We can only provide isolation if DMA is forced through the IOMMU
> > > aperture.  Don't allow type1 to be used if this is not the case.
> > > 
> > > Signed-off-by: Alex Williamson 
> > > ---
> > > 
> > > Eric, I see a number of IOMMU drivers enable this, do the ones you
> > > care about for ARM set geometry.force_aperture?  Thanks,
> > I am currently using arm-smmu.c. I don't see force_aperture being set.
> 
> Hi Will,

Hi Alex,

> Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
> In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
> eventually like to pass the aperture information out through the VFIO
> API.  Thanks,

The slight snag here is that we don't know which SMMU in the system the
domain is attached to at the point when the geometry is queried, so I
can't give you an upper bound on the aperture. For example, if there is
an SMMU with a 32-bit input address and another with a 48-bit input
address.

We could play the same horrible games that we do with the pgsize bitmap,
and truncate some global aperture everytime we probe an SMMU device, but
I'd really like to have fewer hacks like that if possible. The root of
the problem is still that domains are allocated for a bus, rather than
an IOMMU instance.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] kvmtool: don't overwrite /virt/init

2015-10-27 Thread Will Deacon
Hi Oleg,

On Thu, Oct 22, 2015 at 05:59:21PM +0200, Oleg Nesterov wrote:
> On top of "[PATCH 0/3] kvmtool: tiny init fox x86_64" I sent.
> 
> I simply can't understand why kvmtool always overwrites /virt/init.
> This doesn't look right, especially because you can't pass "init="
> kernel argument without another patch "[PATCH] kvmtool/run: append
> cfg.kernel_cmdline at the end of real_cmdline" I sent. And so far
> it is not clear to me if you are going to accept that patch or not.

Sorry, I was off last week and didn't have email access. I'm going through
my inbox now and your patches look pretty good to me. I'll apply/test and
then push out if I don't see any issues.

Thanks!

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SALLAM ALEIKUM

2015-10-27 Thread MAYA
صديقي العزيز

   أشعر أنك من العمر ما يكفي للحفاظ على أسرار والتعامل مع مسألة حساسة 
بشكل سري، وبعد دردشة لدينا القليل شعرت بداخلي بأنني يجب مناقشة اقتراحي معك بسبب 
المستوى الخاص من التفاهم. هذا هو السبب في أنني أريد أن تقصر هذه الصفقة السرية 
فيكم. يجب عليك أن تبقي هذه المحادثة على أنها سرية الأعلى بين كل واحد منا.
 
يا سيدي، أنا من دولة صغيرة تسمى تكساس في الولايات المتحدة، وأنا في الواقع أم 
وحيدة ابنتي هو سبع سنوات من العمر، اسمها ديزي. أنا مهندس بترول مع شركة شل 
للبترول، تعاقدت الحكومة الأمريكية لي قبل سنوات قليلة، وكنت نشرت من المكسيك 
المحيط إلى العراق، وهذا هو بلدي السنة الثانية هنا في بغداد العراق. أنا مهندس 
اختبار والتحقق من النفط نظرا إلى البيت الأبيض من قبل الحكومة العراقية. بلدي 
التوقيع والتقارير تلعب دورا حيويا بين الولايات المتحدة الأمريكية والحكومة 
العراقية في مجال النفط الخام.
 
جاء بعض الدبلوماسيين الأمريكيين وبعض التجار النفط خاصة من الولايات المتحدة إلى 
العراق لإجراء عملية شراء النفط الخام الضخمة التي تقدر بمئات الملايين من 
الدولارات، وهناك حاجة إلى دور جهدي لختم الصفقة بالموافقة على جودة الزيت وبدون 
توقيعي، فإن شراء لا يكون ناجحا. أعطيت لي مبلغ اجمالي قدره 10 € ملايين يورو كما 
نصيبي في صفقة شراء. وقد تم بالفعل نقل هذا المال للخروج من العراق من خلال تأمين 
شركة البريد السريع الدبلوماسية.
 
هذا 10€  ملايين يورو نقلت الى خارج العراق، لا يسمح لنا بقواعد حكومة الولايات 
المتحدة إلى الدخول في صفقات تجارية، وهذا هو السبب في أنني أريد منك أن الشراكة 
معي والوقوف لتلقي هذه الأموال باسمي في بلدك. الشركة التي انتقلت من المال للخروج 
من العراق سوف نقل المال إلى الوجهة التي تريدها.
 
يا سيدي، أنا سوف نقدم لك مبلغا مجموعه 20٪ من المبلغ الإجمالي لتلقي مربع والحفاظ 
معك حتى وصولي للقاء معكم لمناقشة الأعمال الأخرى الممكنة.
 
إذا كنت تقبل هذا الاقتراح وسوف يحتاج فقط ما يلي: -
(1) إسمك
(2) عنوانك الحالي
(3) اسم الشركة التي تعمل بها
(4) الموقع الخاص بك في مكان العمل
(5) رقم الهاتف الخاص بك
(6) بطاقة الهوية أو رخصة القيادة أو جواز السفر الدولي
 بمجرد تلقي كل هذه المعلومات، وسوف عملية مستند وإرساله إليك وسترسل أيضا نسخة 
للشركة إدخال لك كشريكي المعترف بها فقط، وأنها ينبغي أن تحويل الأموال بلدي لكم 
لمزيد من الاستثمار في بلدكم.العملية برمتها بسيطة وخالية من المخاطر، ولكن يجب 
علينا الحفاظ على الابتعاد عن الأضواء والانصياع لقواعد السرية.

مع أطيب التحيات
صديقك وشريك
كينت جينا مهندس
بغداد، العراق
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Provide simple noop dma ops

2015-10-27 Thread Christian Borntraeger
We are going to require dma_ops for several common drivers, even for
systems that do have an identity mapping. Lets provide some minimal
no-op dma_ops that can be used for that purpose.

Signed-off-by: Christian Borntraeger 
---
 include/linux/dma-mapping.h |  2 ++
 lib/Makefile|  2 +-
 lib/dma-noop.c  | 77 +
 3 files changed, 80 insertions(+), 1 deletion(-)
 create mode 100644 lib/dma-noop.c

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff0..7912f54 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -66,6 +66,8 @@ struct dma_map_ops {
int is_phys;
 };
 
+extern struct dma_map_ops dma_noop_ops;
+
 #define DMA_BIT_MASK(n)(((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
 
 #define DMA_MASK_NONE  0x0ULL
diff --git a/lib/Makefile b/lib/Makefile
index 13a7c6a..b04ba71 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 sha1.o md5.o irq_regs.o argv_split.o \
 proportions.o flex_proportions.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-earlycpio.o seq_buf.o nmi_backtrace.o
+earlycpio.o seq_buf.o nmi_backtrace.o dma-noop.o
 
 obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
 lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/dma-noop.c b/lib/dma-noop.c
new file mode 100644
index 000..3ce31302
--- /dev/null
+++ b/lib/dma-noop.c
@@ -0,0 +1,77 @@
+/*
+ * lib/dma-noop.c
+ *
+ * Stub DMA noop-ops
+ */
+#include 
+#include 
+#include 
+#include 
+
+static void *dma_noop_alloc(struct device *dev, size_t size,
+   dma_addr_t *dma_handle, gfp_t gfp,
+   struct dma_attrs *attrs)
+{
+   void *ret;
+
+   ret = (void *)__get_free_pages(gfp, get_order(size));
+   if (ret) {
+   memset(ret, 0, size);
+   *dma_handle = virt_to_phys(ret);
+   }
+   return ret;
+}
+
+static void dma_noop_free(struct device *dev, size_t size,
+ void *cpu_addr, dma_addr_t dma_addr,
+ struct dma_attrs *attrs)
+{
+   free_pages((unsigned long)cpu_addr, get_order(size));
+}
+
+static dma_addr_t dma_noop_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+   return page_to_phys(page) + offset;
+}
+
+static int dma_noop_map_sg(struct device *dev, struct scatterlist *sgl, int 
nents,
+enum dma_data_direction dir, struct dma_attrs 
*attrs)
+{
+   int i;
+   struct scatterlist *sg;
+
+   for_each_sg(sgl, sg, nents, i) {
+   void *va;
+
+   BUG_ON(!sg_page(sg));
+   va = sg_virt(sg);
+   sg_dma_address(sg) = (dma_addr_t)virt_to_phys(va);
+   sg_dma_len(sg) = sg->length;
+   }
+
+   return nents;
+}
+
+static int dma_noop_mapping_error(struct device *dev, dma_addr_t dma_addr)
+{
+   return 0;
+}
+
+static int dma_noop_supported(struct device *dev, u64 mask)
+{
+   return 1;
+}
+
+struct dma_map_ops dma_noop_ops = {
+   .alloc  = dma_noop_alloc,
+   .free   = dma_noop_free,
+   .map_page   = dma_noop_map_page,
+   .map_sg = dma_noop_map_sg,
+   .mapping_error  = dma_noop_mapping_error,
+   .dma_supported  = dma_noop_supported,
+};
+
+EXPORT_SYMBOL(dma_noop_ops);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] alpha: use common noop dma ops

2015-10-27 Thread Christian Borntraeger
Some of the alpha pci noop dma ops are identical to the common ones.
Use them.

Signed-off-by: Christian Borntraeger 
---
 arch/alpha/kernel/pci-noop.c | 46 
 1 file changed, 4 insertions(+), 42 deletions(-)

diff --git a/arch/alpha/kernel/pci-noop.c b/arch/alpha/kernel/pci-noop.c
index 2b1f4a1..8e735b5e 100644
--- a/arch/alpha/kernel/pci-noop.c
+++ b/arch/alpha/kernel/pci-noop.c
@@ -123,44 +123,6 @@ static void *alpha_noop_alloc_coherent(struct device *dev, 
size_t size,
return ret;
 }
 
-static void alpha_noop_free_coherent(struct device *dev, size_t size,
-void *cpu_addr, dma_addr_t dma_addr,
-struct dma_attrs *attrs)
-{
-   free_pages((unsigned long)cpu_addr, get_order(size));
-}
-
-static dma_addr_t alpha_noop_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size,
- enum dma_data_direction dir,
- struct dma_attrs *attrs)
-{
-   return page_to_pa(page) + offset;
-}
-
-static int alpha_noop_map_sg(struct device *dev, struct scatterlist *sgl, int 
nents,
-enum dma_data_direction dir, struct dma_attrs 
*attrs)
-{
-   int i;
-   struct scatterlist *sg;
-
-   for_each_sg(sgl, sg, nents, i) {
-   void *va;
-
-   BUG_ON(!sg_page(sg));
-   va = sg_virt(sg);
-   sg_dma_address(sg) = (dma_addr_t)virt_to_phys(va);
-   sg_dma_len(sg) = sg->length;
-   }
-
-   return nents;
-}
-
-static int alpha_noop_mapping_error(struct device *dev, dma_addr_t dma_addr)
-{
-   return 0;
-}
-
 static int alpha_noop_supported(struct device *dev, u64 mask)
 {
return mask < 0x00ffUL ? 0 : 1;
@@ -168,10 +130,10 @@ static int alpha_noop_supported(struct device *dev, u64 
mask)
 
 struct dma_map_ops alpha_noop_ops = {
.alloc  = alpha_noop_alloc_coherent,
-   .free   = alpha_noop_free_coherent,
-   .map_page   = alpha_noop_map_page,
-   .map_sg = alpha_noop_map_sg,
-   .mapping_error  = alpha_noop_mapping_error,
+   .free   = dma_noop_free_coherent,
+   .map_page   = dma_noop_map_page,
+   .map_sg = dma_noop_map_sg,
+   .mapping_error  = dma_noop_mapping_error,
.dma_supported  = alpha_noop_supported,
 };
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] s390/dma: Allow per device dma ops

2015-10-27 Thread Christian Borntraeger
As virtio-ccw now has dma ops, we can no longer default to the PCI ones.
Make use of dev_archdata to keep the dma_ops per device.

Signed-off-by: Christian Borntraeger 
---
 arch/s390/include/asm/device.h  | 6 +-
 arch/s390/include/asm/dma-mapping.h | 2 +-
 arch/s390/pci/pci.c | 1 +
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/device.h b/arch/s390/include/asm/device.h
index d8f9872..4a9f35e 100644
--- a/arch/s390/include/asm/device.h
+++ b/arch/s390/include/asm/device.h
@@ -3,5 +3,9 @@
  *
  * This file is released under the GPLv2
  */
-#include 
+struct dev_archdata {
+   struct dma_map_ops *dma_ops;
+};
 
+struct pdev_archdata {
+};
diff --git a/arch/s390/include/asm/dma-mapping.h 
b/arch/s390/include/asm/dma-mapping.h
index b3fd54d..71404f1 100644
--- a/arch/s390/include/asm/dma-mapping.h
+++ b/arch/s390/include/asm/dma-mapping.h
@@ -15,7 +15,7 @@ extern struct dma_map_ops s390_dma_ops;
 
 static inline struct dma_map_ops *get_dma_ops(struct device *dev)
 {
-   return _dma_ops;
+   return dev->archdata.dma_ops;
 }
 
 static inline void dma_cache_sync(struct device *dev, void *vaddr, size_t size,
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 7ef12a3..9913915 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -649,6 +649,7 @@ int pcibios_add_device(struct pci_dev *pdev)
 
zdev->pdev = pdev;
pdev->dev.groups = zpci_attr_groups;
+   pdev->dev.archdata.dma_ops = _dma_ops;
zpci_map_resources(pdev);
 
for (i = 0; i < PCI_BAR_COUNT; i++) {
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC 0/4] dma ops and virtio

2015-10-27 Thread Christian Borntraeger
This is an RFC to check if I am on the right track.  There
are some attempts to unify the dma ops (Christoph) as well
as some attempts to make virtio use the dma API (Andy).

At todays discussion at the kernel summit, we concluded that
we want to use the same code on all platforms, whereever
possible, so having a dummy dma_op might be the easiest
solution to keep virtio-ccw as similar as possible to
virtio-pci. Andy Lutomirski will rework his patchset to
unconditionally use the dma ops.  We will also need a
compatibility quirk for powerpc to bypass the iommu mappings
on older QEMU version (which translates to all versions as
of today) and per device, e.g.  via device tree.  Ben
Herrenschmidt will look into that.

Here is a very quick (still untested) shot at providing the s390 part:
- patch1: dummy dma ops, inspired by the alpha code
- patch2: replace some of the alpha functions with the dummy ones
- patch3: allow per device dma ops for s390
- patch4: wire up virtio dma ops

TODOs
- test (s390 virtio-ccw with Andis changes, s390 pci) and review
- check i386 nommu dma ops for potential improvements in the noop 
  handlers
- make dma-noop only compile when necessary

Christian Borntraeger (4):
  Provide simple noop dma ops
  alpha: use common noop dma ops
  s390/dma: Allow per device dma ops
  s390/virtio: use noop dma ops

 arch/alpha/kernel/pci-noop.c| 46 ++
 arch/s390/include/asm/device.h  |  6 ++-
 arch/s390/include/asm/dma-mapping.h |  2 +-
 arch/s390/pci/pci.c |  1 +
 drivers/s390/virtio/kvm_virtio.c|  2 +
 drivers/s390/virtio/virtio_ccw.c|  2 +
 include/linux/dma-mapping.h |  2 +
 lib/Makefile|  2 +-
 lib/dma-noop.c  | 77 +
 9 files changed, 95 insertions(+), 45 deletions(-)
 create mode 100644 lib/dma-noop.c

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] s390/virtio: use noop dma ops

2015-10-27 Thread Christian Borntraeger
With all infrastructure in place, lets provide dma_ops for virtio
devices on s390.

Signed-off-by: Christian Borntraeger 
---
 drivers/s390/virtio/kvm_virtio.c | 2 ++
 drivers/s390/virtio/virtio_ccw.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/drivers/s390/virtio/kvm_virtio.c b/drivers/s390/virtio/kvm_virtio.c
index 53fb975..05adaa9 100644
--- a/drivers/s390/virtio/kvm_virtio.c
+++ b/drivers/s390/virtio/kvm_virtio.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -318,6 +319,7 @@ static void add_kvm_device(struct kvm_device_desc *d, 
unsigned int offset)
return;
}
 
+   kdev->vdev.dev.archdata.dma_ops = _noop_ops;
kdev->vdev.dev.parent = kvm_root;
kdev->vdev.id.device = d->type;
kdev->vdev.config = _vq_configspace_ops;
diff --git a/drivers/s390/virtio/virtio_ccw.c b/drivers/s390/virtio/virtio_ccw.c
index 1cda784..8fb7a6b 100644
--- a/drivers/s390/virtio/virtio_ccw.c
+++ b/drivers/s390/virtio/virtio_ccw.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1093,6 +1094,7 @@ static void virtio_ccw_auto_online(void *data, 
async_cookie_t cookie)
struct ccw_device *cdev = data;
int ret;
 
+   cdev->dev.archdata.dma_ops = _noop_ops;
ret = ccw_device_set_online(cdev);
if (ret)
dev_warn(>dev, "Failed to set online: %d\n", ret);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-27 Thread Yunhong Jiang
On Tue, Oct 27, 2015 at 10:29:28AM +0100, Paolo Bonzini wrote:
> 
> 
> On 27/10/2015 07:35, Yunhong Jiang wrote:
> > On Mon, Oct 26, 2015 at 09:37:14PM -0600, Alex Williamson wrote:
> >> On Mon, 2015-10-26 at 18:20 -0700, Yunhong Jiang wrote:
> >>> An option to force VFIO PCI MSI/MSI-X handler as non-threaded IRQ,
> >>> even when CONFIG_IRQ_FORCED_THREADING=y. This is uselful when
> >>> assigning a device to a guest with low latency requirement since it
> >>> reduce the context switch to/from the IRQ thread.
> >>
> >> Is there any way we can do this automatically?  Perhaps detecting that
> >> we're on a RT kernel or maybe that the user is running with RT priority?
> >> I find that module options are mostly misunderstood and misused.
> > 
> > Alex, thanks for review.
> > 
> > It's not easy to detect if the user is running with RT priority, since 
> > sometimes the user start the thread and then set the scheduler priority 
> > late.
> > 
> > Also should we do this only for in kernel irqchip scenario and not for user 
> > space handler, since in kernel irqchip has lower overhead?
> 
> The overhead of the non-threaded IRQ handler is the same for kernel or
> userspace irqchip, since the handler just writes 1 to the eventfd.

IIUC, the handler not only write1 1 to the eventfd, it also invoke the wait 
queue function, and the in kernel irqchip has different callback with the 
user space irqchip, am I right? But I should not state that in kernel 
irqchip has lower overhead since I have no data for it.

> 
> On RT kernels however can you call eventfd_signal from interrupt
> context?  You cannot call spin_lock_irqsave (which can sleep) from a
> non-threaded interrupt handler, can you?  You would need a raw spin lock.

Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
kernel. Will do this way on next patch. But not sure if it's overkill to use 
raw_spinlock there since the eventfd_signal is used by other caller also.

Thanks
--jyh


> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] Provide simple noop dma ops

2015-10-27 Thread Joerg Roedel
Hi Christian,

On Tue, Oct 27, 2015 at 11:48:48PM +0100, Christian Borntraeger wrote:
> +static dma_addr_t dma_noop_map_page(struct device *dev, struct page *page,
> +   unsigned long offset, size_t size,
> +   enum dma_data_direction dir,
> +   struct dma_attrs *attrs)
> +{
> + return page_to_phys(page) + offset;
> +}

X86 also has its own version of these noop dma_ops, see
arch/x86/kernel/pci-nommu.c. This one also checks the dma_mask and
prints a warning if the physical address doesn't fit into the mask.

I think this would make sense here too, and that we can also make x86
use the same generic noop-dma-ops your are introducing.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski 

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Signed-off-by: Andy Lutomirski 
---
 drivers/net/virtio_net.c | 53 
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..4f10f8a58811 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
+
+   struct {
+   struct virtio_net_ctrl_hdr ctrl;
+   virtio_net_ctrl_ack status;
+   } *buf;
+
unsigned out_num = 0, tmp;
+   bool ret;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+   if (!buf)
+   return false;
+   buf->status = ~0;
+
+   buf->ctrl.class = class;
+   buf->ctrl.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl, sizeof(buf->ctrl));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >status, sizeof(buf->status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-   if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   if (unlikely(!virtqueue_kick(vi->cvq))) {
+   ret = (buf->status == VIRTIO_NET_OK);
+   goto out;
+   }
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
*vi, u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+   kfree(buf);
+   return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
+   u8 *cmdbyte;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+   if (!cmdbyte)
+   return;
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+   *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
-
-   sg_init_one(sg, , sizeof(allmulti));
+*cmdbyte ? "en" : "dis");
 
+   *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+*cmdbyte ? "en" : "dis");
+
+   kfree(cmdbyte);
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] virtio DMA API core stuff

2015-10-27 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

Andy Lutomirski (3):
  virtio_net: Stop doing DMA from the stack
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API

 drivers/net/virtio_net.c   |  53 +++
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_pci_common.h |   3 +-
 drivers/virtio/virtio_pci_legacy.c |  19 +++-
 drivers/virtio/virtio_pci_modern.c |  34 ---
 drivers/virtio/virtio_ring.c   | 179 ++---
 tools/virtio/linux/dma-mapping.h   |  17 
 7 files changed, 241 insertions(+), 66 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski 

virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 179 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 164 insertions(+), 34 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..911fbaa7a260 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@@ -131,7 +198,7 @@ static inline int 

[PATCH 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski 

This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page. You are my only hope! */
-   return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO);
+   return dma_zalloc_coherent(
+ 

[PATCH 1/3] context_tracking: remove duplicate enabled check

2015-10-27 Thread Paolo Bonzini
All calls to context_tracking_enter and context_tracking_exit
are already checking context_tracking_is_enabled, except the
context_tracking_user_enter and context_tracking_user_exit
functions left in for the benefit of assembly calls.

Pull the check up to those functions, by making them simple
wrappers around the user_enter and user_exit inline functions.

Cc: Andy Lutomirski 
Cc: Frederic Weisbecker 
Cc: Rik van Riel 
Cc: Paul McKenney 
Signed-off-by: Paolo Bonzini 
---
 include/linux/context_tracking.h |  4 ++--
 kernel/context_tracking.c| 16 ++--
 2 files changed, 4 insertions(+), 16 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 008fc67d0d96..6ef136ff0897 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -18,13 +18,13 @@ extern void context_tracking_user_exit(void);
 static inline void user_enter(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_enter();
+   context_tracking_enter(CONTEXT_USER);
 
 }
 static inline void user_exit(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_exit();
+   context_tracking_exit(CONTEXT_USER);
 }
 
 static inline enum ctx_state exception_enter(void)
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0a495ab35bc7..6d4c6ce21275 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -63,15 +63,6 @@ void context_tracking_enter(enum ctx_state state)
unsigned long flags;
 
/*
-* Repeat the user_enter() check here because some archs may be calling
-* this from asm and if no CPU needs context tracking, they shouldn't
-* go further. Repeat the check here until they support the inline 
static
-* key check.
-*/
-   if (!context_tracking_is_enabled())
-   return;
-
-   /*
 * Some contexts may involve an exception occuring in an irq,
 * leading to that nesting:
 * rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
@@ -128,7 +119,7 @@ EXPORT_SYMBOL_GPL(context_tracking_enter);
 
 void context_tracking_user_enter(void)
 {
-   context_tracking_enter(CONTEXT_USER);
+   user_enter();
 }
 NOKPROBE_SYMBOL(context_tracking_user_enter);
 
@@ -148,9 +139,6 @@ void context_tracking_exit(enum ctx_state state)
 {
unsigned long flags;
 
-   if (!context_tracking_is_enabled())
-   return;
-
if (in_interrupt())
return;
 
@@ -181,7 +169,7 @@ EXPORT_SYMBOL_GPL(context_tracking_exit);
 
 void context_tracking_user_exit(void)
 {
-   context_tracking_exit(CONTEXT_USER);
+   user_exit();
 }
 NOKPROBE_SYMBOL(context_tracking_user_exit);
 
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit

2015-10-27 Thread Paolo Bonzini
guest_enter and guest_exit must be called with interrupts disabled,
since they take the vtime_seqlock with write_seq{lock,unlock}.
Therefore, it is not necessary to check for exceptions, nor to
save/restore the IRQ state, when context tracking functions are
called by guest_enter and guest_exit.

Split the body of context_tracking_entry and context_tracking_exit
out to __-prefixed functions, and use them from KVM.

Rik van Riel has measured this to speed up a tight vmentry/vmexit
loop by about 2%.

Cc: Andy Lutomirski 
Cc: Frederic Weisbecker 
Cc: Rik van Riel 
Cc: Paul McKenney 
Signed-off-by: Paolo Bonzini 
---
 include/linux/context_tracking.h |  8 +++--
 kernel/context_tracking.c| 64 
 2 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 6ef136ff0897..68b575afe5f5 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,6 +10,10 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
+/* Called with interrupts disabled.  */
+extern void __context_tracking_enter(enum ctx_state state);
+extern void __context_tracking_exit(enum ctx_state state);
+
 extern void context_tracking_enter(enum ctx_state state);
 extern void context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
@@ -88,13 +92,13 @@ static inline void guest_enter(void)
current->flags |= PF_VCPU;
 
if (context_tracking_is_enabled())
-   context_tracking_enter(CONTEXT_GUEST);
+   __context_tracking_enter(CONTEXT_GUEST);
 }
 
 static inline void guest_exit(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_exit(CONTEXT_GUEST);
+   __context_tracking_exit(CONTEXT_GUEST);
 
if (vtime_accounting_enabled())
vtime_guest_exit(current);
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 6d4c6ce21275..d8560ee3bab7 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -58,27 +58,13 @@ static void context_tracking_recursion_exit(void)
  * instructions to execute won't use any RCU read side critical section
  * because this function sets RCU in extended quiescent state.
  */
-void context_tracking_enter(enum ctx_state state)
+void __context_tracking_enter(enum ctx_state state)
 {
-   unsigned long flags;
-
-   /*
-* Some contexts may involve an exception occuring in an irq,
-* leading to that nesting:
-* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
-* This would mess up the dyntick_nesting count though. And rcu_irq_*()
-* helpers are enough to protect RCU uses inside the exception. So
-* just return immediately if we detect we are in an IRQ.
-*/
-   if (in_interrupt())
-   return;
-
/* Kernel threads aren't supposed to go to userspace */
WARN_ON_ONCE(!current->mm);
 
-   local_irq_save(flags);
if (!context_tracking_recursion_enter())
-   goto out_irq_restore;
+   return;
 
if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
@@ -111,7 +97,27 @@ void context_tracking_enter(enum ctx_state state)
__this_cpu_write(context_tracking.state, state);
}
context_tracking_recursion_exit();
-out_irq_restore:
+}
+NOKPROBE_SYMBOL(__context_tracking_enter);
+EXPORT_SYMBOL_GPL(__context_tracking_enter);
+
+void context_tracking_enter(enum ctx_state state)
+{
+   unsigned long flags;
+
+   /*
+* Some contexts may involve an exception occuring in an irq,
+* leading to that nesting:
+* rcu_irq_enter() rcu_user_exit() rcu_user_exit() rcu_irq_exit()
+* This would mess up the dyntick_nesting count though. And rcu_irq_*()
+* helpers are enough to protect RCU uses inside the exception. So
+* just return immediately if we detect we are in an IRQ.
+*/
+   if (in_interrupt())
+   return;
+
+   local_irq_save(flags);
+   __context_tracking_enter(state);
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_enter);
@@ -135,16 +141,10 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(enum ctx_state state)
+void __context_tracking_exit(enum ctx_state state)
 {
-   unsigned long flags;
-
-   if (in_interrupt())
-   return;
-
-   local_irq_save(flags);
if (!context_tracking_recursion_enter())
-   goto out_irq_restore;
+   

[PATCH 3/3] x86: context_tracking: avoid irq_save/irq_restore on kernel entry and exit

2015-10-27 Thread Paolo Bonzini
x86 always calls user_enter and user_exit with interrupt disabled.
Therefore, it is not necessary to check for exceptions, nor to
save/restore the IRQ state, when context tracking functions are
called by guest_enter and guest_exit.

Use the previously introduced __context_tracking_entry and
__context_tracking_exit.

Cc: Andy Lutomirski 
Cc: Frederic Weisbecker 
Cc: Rik van Riel 
Cc: Paul McKenney 
Signed-off-by: Paolo Bonzini 
---
 arch/x86/entry/common.c  |  4 ++--
 include/linux/context_tracking.h | 42 ++--
 2 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 80dcc9261ca3..ba707c6da818 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -33,7 +33,7 @@
 __visible void enter_from_user_mode(void)
 {
CT_WARN_ON(ct_state() != CONTEXT_USER);
-   user_exit();
+   __user_exit();
 }
 #endif
 
@@ -262,7 +262,7 @@ __visible void prepare_exit_to_usermode(struct pt_regs 
*regs)
local_irq_disable();
}
 
-   user_enter();
+   __user_enter();
 }
 
 /*
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 68b575afe5f5..d36a5c8085b7 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -19,18 +19,6 @@ extern void context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 
-static inline void user_enter(void)
-{
-   if (context_tracking_is_enabled())
-   context_tracking_enter(CONTEXT_USER);
-
-}
-static inline void user_exit(void)
-{
-   if (context_tracking_is_enabled())
-   context_tracking_exit(CONTEXT_USER);
-}
-
 static inline enum ctx_state exception_enter(void)
 {
enum ctx_state prev_ctx;
@@ -67,13 +55,39 @@ static inline enum ctx_state ct_state(void)
this_cpu_read(context_tracking.state) : CONTEXT_DISABLED;
 }
 #else
-static inline void user_enter(void) { }
-static inline void user_exit(void) { }
+static inline void __context_tracking_enter(enum ctx_state state) { }
+static inline void __context_tracking_exit(enum ctx_state state) { }
+static inline void context_tracking_enter(enum ctx_state state) { }
+static inline void context_tracking_exit(enum ctx_state state) { }
 static inline enum ctx_state exception_enter(void) { return 0; }
 static inline void exception_exit(enum ctx_state prev_ctx) { }
 static inline enum ctx_state ct_state(void) { return CONTEXT_DISABLED; }
 #endif /* !CONFIG_CONTEXT_TRACKING */
 
+static inline void __user_enter(void)
+{
+   if (context_tracking_is_enabled())
+   __context_tracking_enter(CONTEXT_USER);
+
+}
+static inline void __user_exit(void)
+{
+   if (context_tracking_is_enabled())
+   __context_tracking_exit(CONTEXT_USER);
+}
+
+static inline void user_enter(void)
+{
+   if (context_tracking_is_enabled())
+   context_tracking_enter(CONTEXT_USER);
+
+}
+static inline void user_exit(void)
+{
+   if (context_tracking_is_enabled())
+   context_tracking_exit(CONTEXT_USER);
+}
+
 #define CT_WARN_ON(cond) WARN_ON(context_tracking_is_enabled() && (cond))
 
 #ifdef CONFIG_CONTEXT_TRACKING_FORCE
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH/RFC 0/4] dma ops and virtio

2015-10-27 Thread Joerg Roedel
Hi Christian,

On Tue, Oct 27, 2015 at 11:48:47PM +0100, Christian Borntraeger wrote:
> Here is a very quick (still untested) shot at providing the s390 part:
> - patch1: dummy dma ops, inspired by the alpha code
> - patch2: replace some of the alpha functions with the dummy ones
> - patch3: allow per device dma ops for s390
> - patch4: wire up virtio dma ops

Thanks for the patches, I think they are a good start. I sent you
comments for two of the patches, the others look good to me.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-27 Thread Paolo Bonzini


On 27/10/2015 22:26, Yunhong Jiang wrote:
>> > On RT kernels however can you call eventfd_signal from interrupt
>> > context?  You cannot call spin_lock_irqsave (which can sleep) from a
>> > non-threaded interrupt handler, can you?  You would need a raw spin lock.
> Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
> kernel. Will do this way on next patch. But not sure if it's overkill to use 
> raw_spinlock there since the eventfd_signal is used by other caller also.

No, I don't think you can use raw_spinlock there.  The problem is not
just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
convert the whole workqueue infrastructure to use raw_spinlock.

Alex, would it make sense to use the IRQ bypass infrastructure always,
not just for VT-d, to do the MSI injection directly from the VFIO
interrupt handler and bypass the eventfd?  Basically this would add an
RCU-protected list of consumers matching the token to struct
irq_bypass_producer, and a

int (*inject)(struct irq_bypass_consumer *);

callback to struct irq_bypass_consumer.  If any callback returns true,
the eventfd is not signaled.  The KVM implementation would be like this
(compare with virt/kvm/eventfd.c):

/* Extracted out of irqfd_wakeup */
static int
irqfd_wakeup_pollin(struct kvm_kernel_irqfd *irqfd)
{
...
}

/* Extracted out of irqfd_wakeup */
static int
irqfd_wakeup_pollhup(struct kvm_kernel_irqfd *irqfd)
{
...
}

static int
irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync,
 void *key)
{
struct _irqfd *irqfd = container_of(wait,
struct _irqfd, wait);
unsigned long flags = (unsigned long)key;

if (flags & POLLIN)
irqfd_wakeup_pollin(irqfd);
if (flags & POLLHUP)
irqfd_wakeup_pollhup(irqfd);

return 0;
}

static int kvm_arch_irq_bypass_inject(
struct irq_bypass_consumer *cons)
{
struct kvm_kernel_irqfd *irqfd =
container_of(cons, struct kvm_kernel_irqfd,
 consumer); 

irqfd_wakeup_pollin(irqfd);
}

Or do you think it would be a hack?  The latency improvement might
actually be even better than what Yunhong is already reporting.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] context_tracking: streamline code, avoid IRQ save/restore

2015-10-27 Thread Paolo Bonzini
The first two of these patches were posted last February, the last one
is new.  Rik's old measurements were that it shaved around .3 microseconds
on each iteration of his KVM benchmark.

I guess three days before the start of the merge window is not
the best time to post patches.  However, I brought this series up at
kernel summit yesterday, and Andy's cleanups actually makes it trivial
to apply this to syscall entry.  So here it is, perhaps it's worth it.

Assuming it works, of course, because this is compile-tested only. :)

Paolo

Paolo Bonzini (3):
  context_tracking: remove duplicate enabled check
  context_tracking: avoid irq_save/irq_restore on guest entry and exit
  x86: context_tracking: avoid irq_save/irq_restore on kernel entry and exit

 arch/x86/entry/common.c  |  4 +-
 include/linux/context_tracking.h | 50 +
 kernel/context_tracking.c| 80 
 3 files changed, 76 insertions(+), 58 deletions(-)

-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] s390/virtio: use noop dma ops

2015-10-27 Thread Joerg Roedel
On Tue, Oct 27, 2015 at 11:48:51PM +0100, Christian Borntraeger wrote:
> @@ -1093,6 +1094,7 @@ static void virtio_ccw_auto_online(void *data, 
> async_cookie_t cookie)
>   struct ccw_device *cdev = data;
>   int ret;
>  
> + cdev->dev.archdata.dma_ops = _noop_ops;
>   ret = ccw_device_set_online(cdev);
>   if (ret)
>   dev_warn(>dev, "Failed to set online: %d\n", ret);

Hmm, drivers usually don't deal with setting the dma_ops for their
devices, as they depend on the platform and not so much on the device
itself.

Can you do this special-case handling from device independent platform
code, where you also setup dma_ops for other devices?


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Christian Borntraeger
Am 28.10.2015 um 10:17 schrieb Andy Lutomirski:
@@ -423,27 +522,42 @@ EXPORT_SYMBOL_GPL(virtqueue_kick);
> 
>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>  {
> - unsigned int i;
> + unsigned int i, j;
> + u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
> 
>   /* Clear data ptr. */
> - vq->data[head] = NULL;
> + vq->desc_state[head].data = NULL;
> 
> - /* Put back on free list: find end */
> + /* Put back on free list: unmap first-level descriptors and find end */
>   i = head;
> 
> - /* Free the indirect table */
> - if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
> VRING_DESC_F_INDIRECT))
> - kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, 
> vq->vring.desc[i].addr)));
> -
> - while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
> VRING_DESC_F_NEXT)) {
> + while (vq->vring.desc[i].flags & nextflag) {
> + vring_unmap_one(vq, >vring.desc[i]);
>   i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>   vq->vq.num_free++;
>   }
> 
> + vring_unmap_one(vq, >vring.desc[i]);
>   vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>   vq->free_head = head;
> +
>   /* Plus final descriptor */
>   vq->vq.num_free++;
> +
> + /* Free the indirect table, if any, now that it's unmapped. */
> + if (vq->desc_state[head].indir_desc) {
> + struct vring_desc *indir_desc = vq->desc_state[head].indir_desc;
> + u32 len = vq->vring.desc[head].len;
> +
> + BUG_ON(!(vq->vring.desc[head].flags & VRING_DESC_F_INDIRECT));
> + BUG_ON(len == 0 || len % sizeof(struct vring_desc));
> +
> + for (j = 0; j < len / sizeof(struct vring_desc); j++)
> + vring_unmap_one(vq, _desc[j]);
> +
> + kfree(vq->desc_state[head].indir_desc);
> + vq->desc_state[head].indir_desc = NULL;
> + }
>  }

something seems to be broken with indirect descriptors with that change

[1.885451] [ cut here ]
[1.885454] kernel BUG at drivers/virtio/virtio_ring.c:552!
[1.885487] illegal operation: 0001 ilc:1 [#1] SMP 
[1.885489] Modules linked in:
[1.885491] CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.3.0-rc3+ #250
[1.885493] task: 0be49988 ti: 0be54000 task.ti: 
0be54000
[1.885514] Krnl PSW : 0404c0018000 0059b9d2 
(detach_buf+0x1ca/0x1d0)
[1.885519]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
EA:3
Krnl GPRS:  3000 0c0c8c00 0a89c000
[1.885522]0059b8fa 0001  

[1.885523]  0100 
0a89c000
[1.885525]0c082000 0c082000 0059b8fa 
0c7fbc88
[1.885531] Krnl Code: 0059b9c6: a7f4ff40brc 
15,59b846
   0059b9ca: a7f40001   brc 15,59b9cc
  #0059b9ce: a7f40001   brc 15,59b9d0
  >0059b9d2: 0707   bcr 0,%r7
   0059b9d4: 0707   bcr 0,%r7
   0059b9d6: 0707   bcr 0,%r7
   0059b9d8: c004   brcl0,59b9d8
   0059b9de: ebcff0780024   stmg%r12,%r15,120(%r15)
[1.885542] Call Trace:
[1.885544] ([<0059b8fa>] detach_buf+0xf2/0x1d0)
[1.885546]  [<0059bbb4>] virtqueue_get_buf+0xcc/0x218
[1.885549]  [<005dd8fe>] virtblk_done+0xa6/0x120
[1.885550]  [<0059b66e>] vring_interrupt+0x7e/0x108
[1.885552]  [<006c21a4>] virtio_airq_handler+0x7c/0x120
[1.885554]  [<00657e54>] do_airq_interrupt+0xa4/0xc8
[1.885558]  [<001b79a0>] handle_irq_event_percpu+0x60/0x1f0
[1.885560]  [<001bbbea>] handle_percpu_irq+0x72/0xa0
[1.885561]  [<001b6fa4>] generic_handle_irq+0x4c/0x78
[1.885564]  [<0010cc7c>] do_IRQ+0x64/0x88
[1.885566] Btrfs loaded
[1.885600]  [<0080e30a>] io_int_handler+0x10a/0x218
[1.885603]  [<00186b9e>] wake_up_new_task+0x1be/0x260
[1.885604] ([<00186b6a>] wake_up_new_task+0x18a/0x260)
[1.885606]  [<00156182>] _do_fork+0xfa/0x338
[1.885608]  [<0015644e>] kernel_thread+0x4e/0x60
[1.885609]  [<00179202>] kthreadd+0x1a2/0x238
[1.885611]  [<0080e056>] kernel_thread_starter+0x6/0xc
[1.885613]  [<0080e050>] kernel_thread_starter+0x0/0xc
[1.885614] Last Breaking-Event-Address:
[1.885615]  [<0059b9ce>] detach_buf+0x1c6/0x1d0
[1.885617]  
[1.885618] Kernel panic - not syncing: Fatal exception in interrupt

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a 

Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Joerg Roedel
Hi Andy,

On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> virtio_ring currently sends the device (usually a hypervisor)
> physical addresses of its I/O buffers.  This is okay when DMA
> addresses and physical addresses are the same thing, but this isn't
> always the case.  For example, this never works on Xen guests, and
> it is likely to fail if a physical "virtio" device ever ends up
> behind an IOMMU or swiotlb.

The overall code looks good, but I havn't seen and dma_sync* calls.
When swiotlb=force is in use this would break.

> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, 
> vring_map_single(
> + vq,
> + desc, total_sg * sizeof(struct vring_desc),
> + DMA_TO_DEVICE));

Nit: I think readability could be improved here by using a temporary
variable for the return value of vring_map_single().


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Joerg Roedel
On Tue, Oct 27, 2015 at 06:17:08PM -0700, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> Once virtio starts using the DMA API, we won't be able to safely DMA
> from the stack.  virtio-net does a couple of config DMA requests
> from small stack buffers -- switch to using dynamically-allocated
> memory.
> 
> This should have no effect on any performance-critical code paths.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  drivers/net/virtio_net.c | 53 
> 
>  1 file changed, 36 insertions(+), 17 deletions(-)

Reviewed-by: Joerg Roedel 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: fix eflags state following processor init/reset

2015-10-27 Thread Wanpeng Li

Ping, :-)
On 10/21/15 2:28 PM, Wanpeng Li wrote:

Reference SDM 3.4.3:

Following initialization of the processor (either by asserting the
RESET pin or the INIT pin), the state of the EFLAGS register is
0002H.

However, the eflags fixed bit is not set and other bits are also not
cleared during the init/reset in kvm.

This patch fix it by set eflags register to 0002H following
initialization of the processor.

Signed-off-by: Wanpeng Li 
---
  arch/x86/kvm/vmx.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b680c2e..326f6ea 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4935,6 +4935,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
vmx_set_efer(vcpu, 0);
vmx_fpu_activate(vcpu);
update_exception_bitmap(vcpu);
+   vmx_set_rflags(vcpu, X86_EFLAGS_FIXED);
  
  	vpid_sync_context(vmx->vpid);

  }


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:27 PM, Christian Borntraeger
 wrote:
> Am 28.10.2015 um 10:17 schrieb Andy Lutomirski:
> @@ -423,27 +522,42 @@ EXPORT_SYMBOL_GPL(virtqueue_kick);
>>
>>  static void detach_buf(struct vring_virtqueue *vq, unsigned int head)
>>  {
>> - unsigned int i;
>> + unsigned int i, j;
>> + u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>>
>>   /* Clear data ptr. */
>> - vq->data[head] = NULL;
>> + vq->desc_state[head].data = NULL;
>>
>> - /* Put back on free list: find end */
>> + /* Put back on free list: unmap first-level descriptors and find end */
>>   i = head;
>>
>> - /* Free the indirect table */
>> - if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
>> VRING_DESC_F_INDIRECT))
>> - kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, 
>> vq->vring.desc[i].addr)));
>> -
>> - while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, 
>> VRING_DESC_F_NEXT)) {
>> + while (vq->vring.desc[i].flags & nextflag) {
>> + vring_unmap_one(vq, >vring.desc[i]);
>>   i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next);
>>   vq->vq.num_free++;
>>   }
>>
>> + vring_unmap_one(vq, >vring.desc[i]);
>>   vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head);
>>   vq->free_head = head;
>> +
>>   /* Plus final descriptor */
>>   vq->vq.num_free++;
>> +
>> + /* Free the indirect table, if any, now that it's unmapped. */
>> + if (vq->desc_state[head].indir_desc) {
>> + struct vring_desc *indir_desc = 
>> vq->desc_state[head].indir_desc;
>> + u32 len = vq->vring.desc[head].len;
>> +
>> + BUG_ON(!(vq->vring.desc[head].flags & VRING_DESC_F_INDIRECT));
>> + BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>> +
>> + for (j = 0; j < len / sizeof(struct vring_desc); j++)
>> + vring_unmap_one(vq, _desc[j]);
>> +
>> + kfree(vq->desc_state[head].indir_desc);
>> + vq->desc_state[head].indir_desc = NULL;
>> + }
>>  }
>
> something seems to be broken with indirect descriptors with that change

You're big endian, right?  I had an endian handling bug, and I'll fix it in v2.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] context_tracking: remove duplicate enabled check

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini  wrote:
> All calls to context_tracking_enter and context_tracking_exit
> are already checking context_tracking_is_enabled, except the
> context_tracking_user_enter and context_tracking_user_exit
> functions left in for the benefit of assembly calls.
>
> Pull the check up to those functions, by making them simple
> wrappers around the user_enter and user_exit inline functions.

This makes my brain hurt.  Assuming that this survives a boot with
CONFIG_CONTEXT_TRACKING_FORCE=y and CONFIG_PROVE_LOCKING=y (with the
implied CONFIG_PROVE_RCU), then:

Acked-by: Andy Lutomirski 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini  wrote:
> guest_enter and guest_exit must be called with interrupts disabled,
> since they take the vtime_seqlock with write_seq{lock,unlock}.
> Therefore, it is not necessary to check for exceptions, nor to
> save/restore the IRQ state, when context tracking functions are
> called by guest_enter and guest_exit.
>
> Split the body of context_tracking_entry and context_tracking_exit
> out to __-prefixed functions, and use them from KVM.
>
> Rik van Riel has measured this to speed up a tight vmentry/vmexit
> loop by about 2%.

Looks generally sensible.  I'm not familiar enough with the code to
call it reviewed-by while sitting on the airport shuttle.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] context_tracking: streamline code, avoid IRQ save/restore

2015-10-27 Thread Paolo Bonzini


On 28/10/2015 02:39, Paolo Bonzini wrote:
> The first two of these patches were posted last February, the last one
> is new.  Rik's old measurements were that it shaved around .3 microseconds
> on each iteration of his KVM benchmark.
> 
> I guess three days before the start of the merge window is not
> the best time to post patches.  However, I brought this series up at
> kernel summit yesterday, and Andy's cleanups actually makes it trivial
> to apply this to syscall entry.  So here it is, perhaps it's worth it.
> 
> Assuming it works, of course, because this is compile-tested only. :)

Heh, Andy said it doesn't. :)

However, we could still merge the first two patches for 4.4 since they
have been tested.  I can even take them myself if I get Acked-by.  The
third one will have to wait for 4.5.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:07 PM, Joerg Roedel  wrote:
> On Tue, Oct 27, 2015 at 06:17:08PM -0700, Andy Lutomirski wrote:
>> From: Andy Lutomirski 
>>
>> Once virtio starts using the DMA API, we won't be able to safely DMA
>> from the stack.  virtio-net does a couple of config DMA requests
>> from small stack buffers -- switch to using dynamically-allocated
>> memory.
>>
>> This should have no effect on any performance-critical code paths.
>>
>> Signed-off-by: Andy Lutomirski 
>> ---
>>  drivers/net/virtio_net.c | 53 
>> 
>>  1 file changed, 36 insertions(+), 17 deletions(-)
>
> Reviewed-by: Joerg Roedel 

Should I just send this to netdev now for 4.4?  We could get this one
out of the way, maybe.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Joerg Roedel
On Tue, Oct 27, 2015 at 06:17:10PM -0700, Andy Lutomirski wrote:
> From: Andy Lutomirski 
> 
> This fixes virtio-pci on platforms and busses that have IOMMUs.  This
> will break the experimental QEMU Q35 IOMMU support until QEMU is
> fixed.  In exchange, it fixes physical virtio hardware as well as
> virtio-pci running under Xen.
> 
> We should clean up the virtqueue API to do its own allocation and
> teach virtqueue_get_avail and virtqueue_get_used to return DMA
> addresses directly.
> 
> Signed-off-by: Andy Lutomirski 
> ---
>  drivers/virtio/virtio_pci_common.h |  3 ++-
>  drivers/virtio/virtio_pci_legacy.c | 19 +++
>  drivers/virtio/virtio_pci_modern.c | 34 --
>  3 files changed, 41 insertions(+), 15 deletions(-)

Same here, you need to call the dma_sync* functions when passing data
from/to the virtio-device.

I think a good test for that is to boot a virtio kvm-guest with
swiotlb=force and see if it still works.



Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Joerg Roedel
On Tue, Oct 27, 2015 at 07:13:56PM -0700, Andy Lutomirski wrote:
> On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel  wrote:
> > Hi Andy,
> >
> > On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote:
> >> From: Andy Lutomirski 
> >>
> >> virtio_ring currently sends the device (usually a hypervisor)
> >> physical addresses of its I/O buffers.  This is okay when DMA
> >> addresses and physical addresses are the same thing, but this isn't
> >> always the case.  For example, this never works on Xen guests, and
> >> it is likely to fail if a physical "virtio" device ever ends up
> >> behind an IOMMU or swiotlb.
> >
> > The overall code looks good, but I havn't seen and dma_sync* calls.
> > When swiotlb=force is in use this would break.
> >
> >> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, 
> >> vring_map_single(
> >> + vq,
> >> + desc, total_sg * sizeof(struct vring_desc),
> >> + DMA_TO_DEVICE));
> >
> 
> Are you talking about a dma_sync call on the descriptor ring itself?
> Isn't dma_alloc_coherent supposed to make that unnecessary?  I should
> move the allocation into the virtqueue code.
> 
> The docs suggest that I might need to "flush the processor's write
> buffers before telling devices to read that memory".  I'm not sure how
> to do that.

The write buffers should be flushed by the dma-api functions if
necessary.  For dma_alloc_coherent allocations you don't need to call
dma_sync*, but for the map_single/map_page/map_sg ones, as these might
be bounce-buffered.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread David Woodhouse
On Wed, 2015-10-28 at 11:15 +0900, Joerg Roedel wrote:
> On Tue, Oct 27, 2015 at 06:17:10PM -0700, Andy Lutomirski wrote:
> > From: Andy Lutomirski 
> > 
> > This fixes virtio-pci on platforms and busses that have IOMMUs. 
> >  This
> > will break the experimental QEMU Q35 IOMMU support until QEMU is
> > fixed.  In exchange, it fixes physical virtio hardware as well as
> > virtio-pci running under Xen.
> > 
> > We should clean up the virtqueue API to do its own allocation and
> > teach virtqueue_get_avail and virtqueue_get_used to return DMA
> > addresses directly.
> > 
> > Signed-off-by: Andy Lutomirski 
> > ---
> >  drivers/virtio/virtio_pci_common.h |  3 ++-
> >  drivers/virtio/virtio_pci_legacy.c | 19 +++
> >  drivers/virtio/virtio_pci_modern.c | 34 --
> > 
> >  3 files changed, 41 insertions(+), 15 deletions(-)
> 
> Same here, you need to call the dma_sync* functions when passing data
> from/to the virtio-device.
> 
> I think a good test for that is to boot a virtio kvm-guest with
> swiotlb=force and see if it still works.

That's useful but doesn't cover the cases where dma_wmb() is needed,
right? 

We should make sure we're handling descriptors properly as we would for
real hardware, and ensuring that addresses etc. are written to the
descriptor (and a write barrier occurs) *before* the bit is set which
tells the 'hardware' that it owns that descriptor.

AFAICT we're currently setting the flags *first* in virtqueue_add(),
let alone the missing write barrier between the two.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


[PATCH v2 0/3] virtio DMA API core stuff

2015-10-27 Thread Andy Lutomirski
This switches virtio to use the DMA API unconditionally.  I'm sure
it breaks things, but it seems to work on x86 using virtio-pci, with
and without Xen, and using both the modern 1.0 variant and the
legacy variant.

Changes from v1:
 - Fix an endian conversion error causing a BUG to hit.
 - Fix a DMA ordering issue (swiotlb=force works now).
 - Minor cleanups.

Andy Lutomirski (3):
  virtio_net: Stop doing DMA from the stack
  virtio_ring: Support DMA APIs
  virtio_pci: Use the DMA API

 drivers/net/virtio_net.c   |  53 +++
 drivers/virtio/Kconfig |   2 +-
 drivers/virtio/virtio_pci_common.h |   3 +-
 drivers/virtio/virtio_pci_legacy.c |  19 +++-
 drivers/virtio/virtio_pci_modern.c |  34 +--
 drivers/virtio/virtio_ring.c   | 187 ++---
 tools/virtio/linux/dma-mapping.h   |  17 
 7 files changed, 246 insertions(+), 69 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel  wrote:
> Hi Andy,
>
> On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote:
>> From: Andy Lutomirski 
>>
>> virtio_ring currently sends the device (usually a hypervisor)
>> physical addresses of its I/O buffers.  This is okay when DMA
>> addresses and physical addresses are the same thing, but this isn't
>> always the case.  For example, this never works on Xen guests, and
>> it is likely to fail if a physical "virtio" device ever ends up
>> behind an IOMMU or swiotlb.
>
> The overall code looks good, but I havn't seen and dma_sync* calls.
> When swiotlb=force is in use this would break.
>
>> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, 
>> vring_map_single(
>> + vq,
>> + desc, total_sg * sizeof(struct vring_desc),
>> + DMA_TO_DEVICE));
>

Are you talking about a dma_sync call on the descriptor ring itself?
Isn't dma_alloc_coherent supposed to make that unnecessary?  I should
move the allocation into the virtqueue code.

The docs suggest that I might need to "flush the processor's write
buffers before telling devices to read that memory".  I'm not sure how
to do that.

> Nit: I think readability could be improved here by using a temporary
> variable for the return value of vring_map_single().
>

I'll do something like that for v2.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Joerg Roedel
On Wed, Oct 28, 2015 at 11:15:30AM +0900, Joerg Roedel wrote:
> Same here, you need to call the dma_sync* functions when passing data
> from/to the virtio-device.

Okay, forget about this comment. This patch only converts to
dma_coherent allocations, which don't need synchronization.

> I think a good test for that is to boot a virtio kvm-guest with
> swiotlb=force and see if it still works.

But this holds, Its a good way to test if your changes work with
bounce-buffering. Together with DMA_API_DEBUG you also see if your
specified dma_directions are right.



Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Joerg Roedel
On Wed, Oct 28, 2015 at 11:22:52AM +0900, David Woodhouse wrote:
> On Wed, 2015-10-28 at 11:15 +0900, Joerg Roedel wrote:
> > I think a good test for that is to boot a virtio kvm-guest with
> > swiotlb=force and see if it still works.
> 
> That's useful but doesn't cover the cases where dma_wmb() is needed,
> right? 
> 
> We should make sure we're handling descriptors properly as we would for
> real hardware, and ensuring that addresses etc. are written to the
> descriptor (and a write barrier occurs) *before* the bit is set which
> tells the 'hardware' that it owns that descriptor.

Hmm, good question. The virtio code has virtio_rmb/wmb and should
already call it in the right places.

The virtio-barriers default to rmb()/wmb() unless weak_barriers is set,
in which case it calls dma_rmb()/wmb(), just a compiler barrier on x86.

And weak_barriers is set by the virtio drivers when creating the queue,
and the drivers should know what they are doing. Saying this, I think
the virtio-drivers should already get this right.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: fix performance on LE hosts

2015-10-27 Thread David Miller
From: "Michael S. Tsirkin" 
Date: Tue, 27 Oct 2015 11:37:39 +0200

> commit 2751c9882b947292fcfb084c4f604e01724af804 ("vhost: cross-endian
> support for legacy devices") introduced a minor regression: even with
> cross-endian disabled, and even on LE host, vhost_is_little_endian is
> checking is_le flag so there's always a branch.
> 
> To fix, simply check virtio_legacy_is_little_endian first.
> 
> Cc: Greg Kurz 
> Signed-off-by: Michael S. Tsirkin 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 7:21 PM, Joerg Roedel  wrote:
> On Tue, Oct 27, 2015 at 07:13:56PM -0700, Andy Lutomirski wrote:
>> On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel  wrote:
>> > Hi Andy,
>> >
>> > On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote:
>> >> From: Andy Lutomirski 
>> >>
>> >> virtio_ring currently sends the device (usually a hypervisor)
>> >> physical addresses of its I/O buffers.  This is okay when DMA
>> >> addresses and physical addresses are the same thing, but this isn't
>> >> always the case.  For example, this never works on Xen guests, and
>> >> it is likely to fail if a physical "virtio" device ever ends up
>> >> behind an IOMMU or swiotlb.
>> >
>> > The overall code looks good, but I havn't seen and dma_sync* calls.
>> > When swiotlb=force is in use this would break.
>> >
>> >> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, 
>> >> vring_map_single(
>> >> + vq,
>> >> + desc, total_sg * sizeof(struct vring_desc),
>> >> + DMA_TO_DEVICE));
>> >
>>
>> Are you talking about a dma_sync call on the descriptor ring itself?
>> Isn't dma_alloc_coherent supposed to make that unnecessary?  I should
>> move the allocation into the virtqueue code.
>>
>> The docs suggest that I might need to "flush the processor's write
>> buffers before telling devices to read that memory".  I'm not sure how
>> to do that.
>
> The write buffers should be flushed by the dma-api functions if
> necessary.  For dma_alloc_coherent allocations you don't need to call
> dma_sync*, but for the map_single/map_page/map_sg ones, as these might
> be bounce-buffered.

I think that all the necessary barriers are already there.  I had a
nasty bug that swiotlb=force exposed, though, which I've fixed.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] x86: context_tracking: avoid irq_save/irq_restore on kernel entry and exit

2015-10-27 Thread Andy Lutomirski
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini  wrote:
> x86 always calls user_enter and user_exit with interrupt disabled.
> Therefore, it is not necessary to check for exceptions, nor to
> save/restore the IRQ state, when context tracking functions are
> called by guest_enter and guest_exit.
>
> Use the previously introduced __context_tracking_entry and
> __context_tracking_exit.

x86 isn't ready for this yet.  We could do a quick-and-dirty fix with
explicit IRQs-on-and-off much protected by the static key, or we could
just wait until I finish the syscall cleanup.  I favor the latter, but
you're all welcome to do the former and I'll review it.

BTW, Frederic, I have a static key patch coming that I think you'll
like.  I'll send it tomorrow once I'm in front of a real computer.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Andy Lutomirski
virtio_ring currently sends the device (usually a hypervisor)
physical addresses of its I/O buffers.  This is okay when DMA
addresses and physical addresses are the same thing, but this isn't
always the case.  For example, this never works on Xen guests, and
it is likely to fail if a physical "virtio" device ever ends up
behind an IOMMU or swiotlb.

The immediate use case for me is to enable virtio on Xen guests.
For that to work, we need vring to support DMA address translation
as well as a corresponding change to virtio_pci or to another
driver.

With this patch, if enabled, virtfs survives kmemleak and
CONFIG_DMA_API_DEBUG.

Signed-off-by: Andy Lutomirski 
---
 drivers/virtio/Kconfig   |   2 +-
 drivers/virtio/virtio_ring.c | 187 +++
 tools/virtio/linux/dma-mapping.h |  17 
 3 files changed, 169 insertions(+), 37 deletions(-)
 create mode 100644 tools/virtio/linux/dma-mapping.h

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index cab9f3f63a38..77590320d44c 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -60,7 +60,7 @@ config VIRTIO_INPUT
 
  config VIRTIO_MMIO
tristate "Platform bus driver for memory mapped virtio devices"
-   depends on HAS_IOMEM
+   depends on HAS_IOMEM && HAS_DMA
select VIRTIO
---help---
 This drivers provides support for memory mapped virtio
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 096b857e7b75..e71bf0ca59a2 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef DEBUG
 /* For development, we want to crash whenever the ring is screwed. */
@@ -54,7 +55,14 @@
 #define END_USE(vq)
 #endif
 
-struct vring_virtqueue {
+struct vring_desc_state
+{
+   void *data; /* Data for callback. */
+   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+};
+
+struct vring_virtqueue
+{
struct virtqueue vq;
 
/* Actual memory layout for this queue */
@@ -92,12 +100,71 @@ struct vring_virtqueue {
ktime_t last_add_time;
 #endif
 
-   /* Tokens for callbacks. */
-   void *data[];
+   /* Per-descriptor state. */
+   struct vring_desc_state desc_state[];
 };
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+/*
+ * The DMA ops on various arches are rather gnarly right now, and
+ * making all of the arch DMA ops work on the vring device itself
+ * is a mess.  For now, we use the parent device for DMA ops.
+ */
+struct device *vring_dma_dev(const struct vring_virtqueue *vq)
+{
+   return vq->vq.vdev->dev.parent;
+}
+
+/* Map one sg entry. */
+static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
+  struct scatterlist *sg,
+  enum dma_data_direction direction)
+{
+   /*
+* We can't use dma_map_sg, because we don't use scatterlists in
+* the way it expects (we don't guarantee that the scatterlist
+* will exist for the lifetime of the mapping).
+*/
+   return dma_map_page(vring_dma_dev(vq),
+   sg_page(sg), sg->offset, sg->length,
+   direction);
+}
+
+static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
+  void *cpu_addr, size_t size,
+  enum dma_data_direction direction)
+{
+   return dma_map_single(vring_dma_dev(vq),
+ cpu_addr, size, direction);
+}
+
+static void vring_unmap_one(const struct vring_virtqueue *vq,
+   struct vring_desc *desc)
+{
+   u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
 static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
 unsigned int total_sg, gfp_t gfp)
 {
@@ -131,7 +198,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
struct 

[PATCH v2 1/3] virtio_net: Stop doing DMA from the stack

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski 

Once virtio starts using the DMA API, we won't be able to safely DMA
from the stack.  virtio-net does a couple of config DMA requests
from small stack buffers -- switch to using dynamically-allocated
memory.

This should have no effect on any performance-critical code paths.

Cc: net...@vger.kernel.org
Cc: "Michael S. Tsirkin" 
Cc: virtualizat...@lists.linux-foundation.org
Reviewed-by: Joerg Roedel 
Signed-off-by: Andy Lutomirski 
---

Hi Michael and DaveM-

This is a prerequisite for the virtio DMA fixing project.  It works
as a standalone patch, though.  Would it make sense to apply it to
an appropriate networking tree now?

 drivers/net/virtio_net.c | 53 
 1 file changed, 36 insertions(+), 17 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d8838dedb7a4..4f10f8a58811 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, 
u8 class, u8 cmd,
 struct scatterlist *out)
 {
struct scatterlist *sgs[4], hdr, stat;
-   struct virtio_net_ctrl_hdr ctrl;
-   virtio_net_ctrl_ack status = ~0;
+
+   struct {
+   struct virtio_net_ctrl_hdr ctrl;
+   virtio_net_ctrl_ack status;
+   } *buf;
+
unsigned out_num = 0, tmp;
+   bool ret;
 
/* Caller should know better */
BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
 
-   ctrl.class = class;
-   ctrl.cmd = cmd;
+   buf = kmalloc(sizeof(*buf), GFP_ATOMIC);
+   if (!buf)
+   return false;
+   buf->status = ~0;
+
+   buf->ctrl.class = class;
+   buf->ctrl.cmd = cmd;
/* Add header */
-   sg_init_one(, , sizeof(ctrl));
+   sg_init_one(, >ctrl, sizeof(buf->ctrl));
sgs[out_num++] = 
 
if (out)
sgs[out_num++] = out;
 
/* Add return status. */
-   sg_init_one(, , sizeof(status));
+   sg_init_one(, >status, sizeof(buf->status));
sgs[out_num] = 
 
BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
 
-   if (unlikely(!virtqueue_kick(vi->cvq)))
-   return status == VIRTIO_NET_OK;
+   if (unlikely(!virtqueue_kick(vi->cvq))) {
+   ret = (buf->status == VIRTIO_NET_OK);
+   goto out;
+   }
 
/* Spin for a response, the kick causes an ioport write, trapping
 * into the hypervisor, so the request should be handled immediately.
@@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info 
*vi, u8 class, u8 cmd,
   !virtqueue_is_broken(vi->cvq))
cpu_relax();
 
-   return status == VIRTIO_NET_OK;
+   ret = (buf->status == VIRTIO_NET_OK);
+
+out:
+   kfree(buf);
+   return ret;
 }
 
 static int virtnet_set_mac_address(struct net_device *dev, void *p)
@@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct scatterlist sg[2];
-   u8 promisc, allmulti;
+   u8 *cmdbyte;
struct virtio_net_ctrl_mac *mac_data;
struct netdev_hw_addr *ha;
int uc_count;
@@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev)
if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
return;
 
-   promisc = ((dev->flags & IFF_PROMISC) != 0);
-   allmulti = ((dev->flags & IFF_ALLMULTI) != 0);
+   cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC);
+   if (!cmdbyte)
+   return;
 
-   sg_init_one(sg, , sizeof(promisc));
+   sg_init_one(sg, cmdbyte, sizeof(*cmdbyte));
 
+   *cmdbyte = ((dev->flags & IFF_PROMISC) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_PROMISC, sg))
dev_warn(>dev, "Failed to %sable promisc mode.\n",
-promisc ? "en" : "dis");
-
-   sg_init_one(sg, , sizeof(allmulti));
+*cmdbyte ? "en" : "dis");
 
+   *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0);
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX,
  VIRTIO_NET_CTRL_RX_ALLMULTI, sg))
dev_warn(>dev, "Failed to %sable allmulti mode.\n",
-allmulti ? "en" : "dis");
+*cmdbyte ? "en" : "dis");
+
+   kfree(cmdbyte);
 
uc_count = netdev_uc_count(dev);
mc_count = netdev_mc_count(dev);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] virtio_pci: Use the DMA API

2015-10-27 Thread Andy Lutomirski
From: Andy Lutomirski 

This fixes virtio-pci on platforms and busses that have IOMMUs.  This
will break the experimental QEMU Q35 IOMMU support until QEMU is
fixed.  In exchange, it fixes physical virtio hardware as well as
virtio-pci running under Xen.

We should clean up the virtqueue API to do its own allocation and
teach virtqueue_get_avail and virtqueue_get_used to return DMA
addresses directly.

Signed-off-by: Andy Lutomirski 
---

drivers/virtio/virtio_pci_common.h |  3 ++-
 drivers/virtio/virtio_pci_legacy.c | 19 +++
 drivers/virtio/virtio_pci_modern.c | 34 --
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/drivers/virtio/virtio_pci_common.h 
b/drivers/virtio/virtio_pci_common.h
index b976d968e793..cd6196b513ad 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -38,8 +38,9 @@ struct virtio_pci_vq_info {
/* the number of entries in the queue */
int num;
 
-   /* the virtual address of the ring queue */
+   /* the ring queue */
void *queue;
+   dma_addr_t queue_dma_addr;  /* bus address */
 
/* the list node for the virtqueues list */
struct list_head node;
diff --git a/drivers/virtio/virtio_pci_legacy.c 
b/drivers/virtio/virtio_pci_legacy.c
index 48bc9797e530..b5293e5f2af4 100644
--- a/drivers/virtio/virtio_pci_legacy.c
+++ b/drivers/virtio/virtio_pci_legacy.c
@@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct 
virtio_pci_device *vp_dev,
info->msix_vector = msix_vec;
 
size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN));
-   info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO);
+   info->queue = dma_zalloc_coherent(_dev->pci_dev->dev, size,
+ >queue_dma_addr,
+ GFP_KERNEL);
if (info->queue == NULL)
return ERR_PTR(-ENOMEM);
 
/* activate the queue */
-   iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
+   iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
/* create the vring */
@@ -169,7 +171,8 @@ out_assign:
vring_del_virtqueue(vq);
 out_activate_queue:
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
return ERR_PTR(err);
 }
 
@@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info)
iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
 
size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN));
-   free_pages_exact(info->queue, size);
+   dma_free_coherent(_dev->pci_dev->dev, size,
+ info->queue, info->queue_dma_addr);
 }
 
 static const struct virtio_config_ops virtio_pci_config_ops = {
@@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device 
*vp_dev)
return -ENODEV;
}
 
+   rc = dma_set_mask_and_coherent(_dev->dev, DMA_BIT_MASK(64));
+   if (rc)
+   rc = dma_set_mask_and_coherent(_dev->dev,
+   DMA_BIT_MASK(32));
+   if (rc)
+   dev_warn(_dev->dev, "Failed to enable 64-bit or 32-bit DMA. 
 Trying to continue, but this might not work.\n");
+
rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy");
if (rc)
return rc;
diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 8e5cf194cc0b..fbe0bd1c4881 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num)
return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES));
 }
 
-static void *alloc_virtqueue_pages(int *num)
+static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev,
+  int *num, dma_addr_t *dma_addr)
 {
void *pages;
 
/* TODO: allocate each queue chunk individually */
for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) {
-   pages = alloc_pages_exact(vring_pci_size(*num),
- GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN);
+   pages = dma_zalloc_coherent(
+   _dev->pci_dev->dev, vring_pci_size(*num),
+   dma_addr, GFP_KERNEL|__GFP_NOWARN);
if (pages)
return pages;
}
@@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num)
return NULL;
 
/* Try to get a single page. You are my only hope! */
-   return alloc_pages_exact(vring_pci_size(*num), GFP_KERNEL|__GFP_ZERO);
+   return dma_zalloc_coherent(
+ 

Re: [PATCH v2 2/3] virtio_ring: Support DMA APIs

2015-10-27 Thread Christian Borntraeger
Am 28.10.2015 um 14:30 schrieb Andy Lutomirski:

> +static void vring_unmap_one(const struct vring_virtqueue *vq,
> + struct vring_desc *desc)
> +{
> + u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> +
> + if (flags & VRING_DESC_F_INDIRECT) {
> + dma_unmap_single(vring_dma_dev(vq),
> +  virtio64_to_cpu(vq->vq.vdev, desc->addr),
> +  virtio32_to_cpu(vq->vq.vdev, desc->len),
> +  (flags & VRING_DESC_F_WRITE) ?
> +  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + } else {
> + dma_unmap_page(vring_dma_dev(vq),
> +virtio64_to_cpu(vq->vq.vdev, desc->addr),
> +virtio32_to_cpu(vq->vq.vdev, desc->len),
> +(flags & VRING_DESC_F_WRITE) ?
> +DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + }
> +}
> +



I still have a failure in this:

[1.913040] Unable to handle kernel pointer dereference in virtual kernel 
address space
[1.913044] failing address: 0c80 TEID: 0c800803
[1.913045] Fault in home space mode while using kernel ASCE.
[1.913048] AS:00d56007 R3:0c7f0007 S:0020 
[1.913099] Oops: 0010 ilc:2 [#1] SMP 
[1.913142] Modules linked in:
[1.913144] CPU: 4 PID: 50 Comm: kworker/u18:1 Not tainted 4.3.0-rc3+ #252
[1.913150] Workqueue: events_unbound call_usermodehelper_exec_work
[1.913152] task: 0bb8b310 ti: 0bba8000 task.ti: 
0bba8000
[1.913154] Krnl PSW : 0404e0018000 0059b266 
(vring_unmap_one+0x46/0x8d0)
[1.913158]R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 
EA:3
Krnl GPRS:  0c80 0c036800 0c80
[1.913161]005a21a8   

[1.913162]0704c0018000 04000bdbfce8 0c082000 
0bdbf6e8
[1.913164]0400  0bdbf988 
0bdbf6e8
[1.913170] Krnl Code: 0059b254: e310b0a80004lg  
%r1,168(%r11)
   0059b25a: e3201024   lg  %r2,32(%r1)
  #0059b260: e310b0a4   lg  %r1,160(%r11)
  >0059b266: 4810100c   lh  %r1,12(%r1)
   0059b26a: e320b2a80024   stg %r2,680(%r11)
   0059b270: 4010b2a6   sth %r1,678(%r11)
   0059b274: e310b2a80004   lg  %r1,680(%r11)
   0059b27a: e310b2980024   stg %r1,664(%r11)
[1.913183] Call Trace:
[1.913185] ([<0059b74c>] vring_unmap_one+0x52c/0x8d0)
[1.913187]  [<005a21a8>] detach_buf+0x720/0x788
[1.913188]  [<005a2830>] virtqueue_get_buf+0x620/0x908
[1.913191]  [<005e5336>] virtblk_done+0xa6/0x120
[1.913192]  [<005a3e46>] vring_interrupt+0x2a6/0x2c0
[1.913224]  [<006c9bdc>] virtio_airq_handler+0x7c/0x120
[1.913226]  [<0065f88c>] do_airq_interrupt+0xa4/0xc8
[1.913229]  [<001b79a0>] handle_irq_event_percpu+0x60/0x1f0
[1.913230]  [<001bbbea>] handle_percpu_irq+0x72/0xa0
[1.913232]  [<001b6fa4>] generic_handle_irq+0x4c/0x78
[1.913234]  [<0010cc7c>] do_IRQ+0x64/0x88
[1.913236]  [<00815d42>] io_int_handler+0x10a/0x218
[1.913238]  [<00104268>] copy_thread+0x78/0x1a0
[1.913240] ([<001548f8>] copy_process.isra.11+0x750/0x1a80)
[1.913242]  [<00156122>] _do_fork+0x9a/0x338
[1.913243]  [<0015644e>] kernel_thread+0x4e/0x60
[1.913245]  [<0016da7a>] call_usermodehelper_exec_work+0x7a/0xf0
[1.913247]  [<00171c06>] process_one_work+0x1b6/0x490
[1.913248]  [<00171f38>] worker_thread+0x58/0x588
[1.913250]  [<001788fa>] kthread+0x10a/0x110
[1.913252]  [<00815a8e>] kernel_thread_starter+0x6/0xc
[1.913254]  [<00815a88>] kernel_thread_starter+0x0/0xc
[1.913255] Last Breaking-Event-Address:
[1.913256]  [<005a21a2>] detach_buf+0x71a/0x788
[1.913258]  
[1.913263] Kernel panic - not syncing: Fatal exception in interrupt

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html