Re: virtio fixes pull for 4.0?

2015-03-16 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Mon, Mar 09, 2015 at 05:43:19PM +1030, Rusty Russell wrote:
  I think it's a good idea to merge these patches (maybe except the
  !TASK_RUNNING thing) sooner rather than later, to make sure people have
  the time to test the fixes properly.  Would you like me to pack up (some
  of them) them up and do a pull request?
 
 I'm waiting a bit longer, since they seem to still be tricking in.

 I don't think there was a pull request since rc1 which
 makes me a bit anxious. Any outstanding issues?

Well, I was hoping for an ack on your mmio patches.  But the weekend
is over, so I've pushed and sent Linus a pull request.

Cheers,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 log fixed] virtio_mmio: fix endian-ness for mmio

2015-03-12 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Subject: [PATCH] virtio_mmio: fix access width for mmio

Just for the record:

Applied.

Thanks,
Rusty.

 Going over the virtio mmio code, I noticed that it doesn't correctly
 access modern device config values using natural accessors: it uses
 readb to get/set them byte by byte, while the virtio 1.0 spec explicitly 
 states:

 4.2.2.2 Driver Requirements: MMIO Device Register Layout

 ...

 The driver MUST only use 32 bit wide and aligned reads and writes to
 access the control registers described in table 4.1.
 For the device-specific configuration space, the driver MUST use
 8 bit wide accesses for 8 bit wide fields, 16 bit wide and aligned
 accesses for 16 bit wide fields and 32 bit wide and aligned accesses 
 for
 32 and 64 bit wide fields.

 Borrow code from virtio_pci_modern to do this correctly.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

 ---

 This is exactly the same as PATCH virtio_mmio: fix endian-ness for mmio
 but with corrected subject and commit log, to make
 it easier to apply.

 Note: untested: QEMU doesn't support virtio 1.0 for virtio-mmio,
 but seems pretty obvious. Let's apply ASAP so we don't ship
 incompliant drivers for virtio 1.0?

  drivers/virtio/virtio_mmio.c | 79 
 +++-
  1 file changed, 71 insertions(+), 8 deletions(-)

 diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
 index cad5698..0375456 100644
 --- a/drivers/virtio/virtio_mmio.c
 +++ b/drivers/virtio/virtio_mmio.c
 @@ -156,22 +156,85 @@ static void vm_get(struct virtio_device *vdev, unsigned 
 offset,
  void *buf, unsigned len)
  {
   struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
 - u8 *ptr = buf;
 - int i;
 + void __iomem *base = vm_dev-base + VIRTIO_MMIO_CONFIG;
 + u8 b;
 + __le16 w;
 + __le32 l;
  
 - for (i = 0; i  len; i++)
 - ptr[i] = readb(vm_dev-base + VIRTIO_MMIO_CONFIG + offset + i);
 + if (vm_dev-version == 1) {
 + u8 *ptr = buf;
 + int i;
 +
 + for (i = 0; i  len; i++)
 + ptr[i] = readb(base + offset + i);
 + return;
 + }
 +
 + switch (len) {
 + case 1:
 + b = readb(base + offset);
 + memcpy(buf, b, sizeof b);
 + break;
 + case 2:
 + w = cpu_to_le16(readw(base + offset));
 + memcpy(buf, w, sizeof w);
 + break;
 + case 4:
 + l = cpu_to_le32(readl(base + offset));
 + memcpy(buf, l, sizeof l);
 + break;
 + case 8:
 + l = cpu_to_le32(readl(base + offset));
 + memcpy(buf, l, sizeof l);
 + l = cpu_to_le32(ioread32(base + offset + sizeof l));
 + memcpy(buf + sizeof l, l, sizeof l);
 + break;
 + default:
 + BUG();
 + }
  }
  
  static void vm_set(struct virtio_device *vdev, unsigned offset,
  const void *buf, unsigned len)
  {
   struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev);
 - const u8 *ptr = buf;
 - int i;
 + void __iomem *base = vm_dev-base + VIRTIO_MMIO_CONFIG;
 + u8 b;
 + __le16 w;
 + __le32 l;
  
 - for (i = 0; i  len; i++)
 - writeb(ptr[i], vm_dev-base + VIRTIO_MMIO_CONFIG + offset + i);
 + if (vm_dev-version == 1) {
 + const u8 *ptr = buf;
 + int i;
 +
 + for (i = 0; i  len; i++)
 + writeb(ptr[i], base + offset + i);
 +
 + return;
 + }
 +
 + switch (len) {
 + case 1:
 + memcpy(b, buf, sizeof b);
 + writeb(b, base + offset);
 + break;
 + case 2:
 + memcpy(w, buf, sizeof w);
 + writew(le16_to_cpu(w), base + offset);
 + break;
 + case 4:
 + memcpy(l, buf, sizeof l);
 + writel(le32_to_cpu(l), base + offset);
 + break;
 + case 8:
 + memcpy(l, buf, sizeof l);
 + writel(le32_to_cpu(l), base + offset);
 + memcpy(l, buf + sizeof l, sizeof l);
 + writel(le32_to_cpu(l), base + offset + sizeof l);
 + break;
 + default:
 + BUG();
 + }
  }
  
  static u8 vm_get_status(struct virtio_device *vdev)
 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_rpmsg: set DRIVER_OK before using device

2015-03-11 Thread Rusty Russell
Ohad Ben-Cohen o...@wizery.com writes:
 On Mon, Mar 9, 2015 at 10:41 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Sat, Mar 07, 2015 at 08:06:56PM +0100, Michael S. Tsirkin wrote:
 virtio spec requires that all drivers set DRIVER_OK
 before using devices. While rpmsg isn't yet
 included in the virtio 1 spec, previous spec versions
 also required this.

 virtio rpmsg violates this rule: is calls kick
 before setting DRIVER_OK.

 The fix isn't trivial since simply calling virtio_device_ready earlier
 would mean we might get an interrupt in parallel with adding buffers.

 Instead, split kick out to prepare+notify calls.  prepare before
 virtio_device_ready - when we know we won't get interrupts. notify right
 afterwards.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---

 Ohad, can you review and ack pls?

 Sure,

 Acked-by: Ohad Ben-Cohen o...@wizery.com

Applied.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] virtio-balloon: do not call blocking ops when !TASK_RUNNING

2015-03-09 Thread Rusty Russell
Thomas Huth th...@linux.vnet.ibm.com writes:
 On Wed, 25 Feb 2015 16:11:27 +0100
 Cornelia Huck cornelia.h...@de.ibm.com wrote:

 On Wed, 25 Feb 2015 15:36:02 +0100
 Michael S. Tsirkin m...@redhat.com wrote:
 
  virtio balloon has this code:
  wait_event_interruptible(vb-config_change,
   (diff = towards_target(vb)) != 0
   || vb-need_stats_update
   || kthread_should_stop()
   || freezing(current));
  
  Which is a problem because towards_target() call might block after
  wait_event_interruptible sets task state to TAST_INTERRUPTIBLE, causing
  the task_struct::state collision typical of nesting of sleeping
  primitives
  
  See also http://lwn.net/Articles/628628/ or Thomas's
  bug report
  http://article.gmane.org/gmane.linux.kernel.virtualization/24846
  for a fuller explanation.
  
  To fix, rewrite using wait_woken.
  
  Cc: sta...@vger.kernel.org
  Reported-by: Thomas Huth th...@linux.vnet.ibm.com
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  changes from v1:
 remove wait_event_interruptible
 noticed by Cornelia Huck cornelia.h...@de.ibm.com
  
   drivers/virtio/virtio_balloon.c | 19 ++-
   1 file changed, 14 insertions(+), 5 deletions(-)
  
 
 I was able to reproduce Thomas' original problem and can confirm that
 it is gone with this patch.
 
 Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com

 Right, I just applied the patch on my system, too, and the problem is
 indeed gone! Thanks for the quick fix!

 Tested-by: Thomas Huth th...@linux.vnet.ibm.com

Applied.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio balloon: do not call blocking ops when !TASK_RUNNING

2015-03-09 Thread Rusty Russell
Cornelia Huck cornelia.h...@de.ibm.com writes:
 On Wed, 4 Mar 2015 11:25:56 +0100
 Michael S. Tsirkin m...@redhat.com wrote:

 On Wed, Mar 04, 2015 at 04:44:54PM +1030, Rusty Russell wrote:
  Michael S. Tsirkin m...@redhat.com writes:
   On Mon, Mar 02, 2015 at 10:37:26AM +1030, Rusty Russell wrote:
   Thomas Huth th...@linux.vnet.ibm.com writes:
On Thu, 26 Feb 2015 11:50:42 +1030
Rusty Russell ru...@rustcorp.com.au wrote:
   
Thomas Huth th...@linux.vnet.ibm.com writes:
  Hi all,

 with the recent kernel 3.19, I get a kernel warning when I start my
 KVM guest on s390 with virtio balloon enabled:

The deeper problem is that virtio_ccw_get_config just silently fails 
on
OOM.

Neither get_config nor set_config are expected to fail.
   
AFAIK this is currently not a problem. According to
http://lwn.net/Articles/627419/ these kmalloc calls never
fail because they allocate less than a page.
   
   I strongly suggest you unlearn that fact.
   The fix for this is in two parts:
   
   1) Annotate using sched_annotate_sleep() and add a comment: we may spin
  a few times in low memory situations, but this isn't a high
  performance path.
   
   2) Handle get_config (and other) failure in some more elegant way.
   
   Cheers,
   Rusty.
  
   I agree, but I'd like to point out that even without kmalloc,
   on s390 get_config is blocking - it's waiting
   for a hardware interrupt.
  
   And it makes sense: config is not data path, I don't think
   we should spin there.
  
   So I think besides these two parts, we still need my two patches:
   virtio-balloon: do not call blocking ops when !TASK_RUNNING
  
  I prefer to annotate, over trying to fix this.
  
  Because it's not important.  We might spin a few times, but it's very
  unlikely, and it's certainly not performance critical.
  
  Thanks,
  Rusty.
  
  Subject: virtio_balloon: annotate possible sleep waiting for event.
  
  CCW (s390) does this.
  
  Reported-by: Thomas Huth th...@linux.vnet.ibm.com
  Signed-off-by: Rusty Russell ru...@rustcorp.com.au
  
  diff --git a/drivers/virtio/virtio_balloon.c 
  b/drivers/virtio/virtio_balloon.c
  index 0413157f3b49..3f4d5acdbde0 100644
  --- a/drivers/virtio/virtio_balloon.c
  +++ b/drivers/virtio/virtio_balloon.c
  @@ -340,6 +340,15 @@ static int balloon(void *_vballoon)
 s64 diff;
   
 try_to_freeze();
  +
  +  /*
  +   * Reading the config on the ccw backend involves an
  +   * allocation, so we may actually sleep and have an
  +   * extra iteration.  It's extremely unlikely,
 
 Hmm, this part of the comment seems wrong to me.
 Reading the config on the ccw backend always sleeps
 because it's interrupt driven.

 (...)

 So I suspect
 http://mid.gmane.org/1424874878-17155-1-git-send-email-...@redhat.com
 is better.
 
 What do you think?

 I'd prefer to fix this as well. While the I/O request completes
 instantly on current qemu (the ssch backend handles the start function
 immediately, not asynchronously as on real hardware), this (a) is an
 implementation detail that may change and (b) doesn't account for the
 need to deliver the interrupt to the guest - which might take non-zero
 time.

Ah, I see.  My mistake.

I've thrown out my patch, applied that one.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_rpmsg: set DRIVER_OK before using device

2015-03-09 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 virtio spec requires that all drivers set DRIVER_OK
 before using devices. While rpmsg isn't yet
 included in the virtio 1 spec, previous spec versions
 also required this.

 virtio rpmsg violates this rule: is calls kick
 before setting DRIVER_OK.

 The fix isn't trivial since simply calling virtio_device_ready earlier
 would mean we might get an interrupt in parallel with adding buffers.

 Instead, split kick out to prepare+notify calls.  prepare before
 virtio_device_ready - when we know we won't get interrupts. notify right
 afterwards.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Applied.  I'll wait for Ohad to ack before sending to Linus.

BTW I assume you have a version of qemu which warns on these kind of
failures?  That'd be nice to have!

Thanks,
Rusty.

 ---

 Note: compile-tested only.

  drivers/rpmsg/virtio_rpmsg_bus.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

 diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c 
 b/drivers/rpmsg/virtio_rpmsg_bus.c
 index 92f6af6..73354ee 100644
 --- a/drivers/rpmsg/virtio_rpmsg_bus.c
 +++ b/drivers/rpmsg/virtio_rpmsg_bus.c
 @@ -951,6 +951,7 @@ static int rpmsg_probe(struct virtio_device *vdev)
   void *bufs_va;
   int err = 0, i;
   size_t total_buf_space;
 + bool notify;
  
   vrp = kzalloc(sizeof(*vrp), GFP_KERNEL);
   if (!vrp)
 @@ -1030,8 +1031,22 @@ static int rpmsg_probe(struct virtio_device *vdev)
   }
   }
  
 + /*
 +  * Prepare to kick but don't notify yet - we can't do this before
 +  * device is ready.
 +  */
 + notify = virtqueue_kick_prepare(vrp-rvq);
 +
 + /* From this point on, we can notify and get callbacks. */
 + virtio_device_ready(vdev);
 +
   /* tell the remote processor it can start sending messages */
 - virtqueue_kick(vrp-rvq);
 + /*
 +  * this might be concurrent with callbacks, but we are only
 +  * doing notify, not a full kick here, so that's ok.
 +  */
 + if (notify)
 + virtqueue_notify(vrp-rvq);
  
   dev_info(vdev-dev, rpmsg host is online\n);
  
 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio fixes pull for 4.0?

2015-03-09 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Hi Rusty!
 There are a bunch of (mostly virtio 1.0 related) fixes for virtio
 that need to go into 4.0 I think.
   virtio_blk: typo fix
   virtio_blk: fix comment for virtio 1.0

OK, I've added these two.  I tend to be overcautious after the merge
window.

   virtio_console: init work unconditionally
   virtio_console: avoid config access from irq
   virtio_balloon: set DRIVER_OK before using device

 seem ready?

These are in my virtio-next tree already.  

   virtio_mmio: generation support
   virtio_mmio: fix endian-ness for mmio these two are waiting for ack by 
 Pawel

 These two fix bugs in virtio 1.0 code for mmio.
 Host code for that was AFAIK not posted, so I can't test properly.
 Pawel?

I'm waiting on Acks for these two.

   virtio-balloon: do not call blocking ops when !TASK_RUNNING

 Rusty, it seems you prefer a different fix for this issue,
 while Cornelia prefers mine. Maybe both me and Cornelia misunderstand the
 issue? I know you dealt with a ton of similar issues recently
 so you are more likely to be right, but I'd like to understand
 what's going on better all the same. Could you help please?

In the longer run, we should handle failures from these callbacks.  But
we don't need to do that right now.  So we want the minimal fix.

And an annotation is the minimal fix.  The bug has been there for ages;
it's just the warning that is new (if we *always* slept, we would
spin, but in practice we'll rarely sleep).

   virtio_rpmsg: set DRIVER_OK before using device

 Just posted this, but seems pretty obvious.

Yep, I've applied this too.  Thanks!

 I think it's a good idea to merge these patches (maybe except the
 !TASK_RUNNING thing) sooner rather than later, to make sure people have
 the time to test the fixes properly.  Would you like me to pack up (some
 of them) them up and do a pull request?

I'm waiting a bit longer, since they seem to still be tricking in.

I'm still chasing a QEMU bug, where it seems to fill in a number too
large in the 'len' field for a block device.  It should be 1 byte for a
block device write, for example.  See patch which causes assert() in
qemu, but I had to stop at that point (should get back tomorrow I hope).

Thanks,
Rusty.

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 882a31b..98e99b8 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -243,16 +243,21 @@ int virtio_queue_empty(VirtQueue *vq)
 }
 
 void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
-unsigned int len, unsigned int idx)
+unsigned int len_written, unsigned int idx)
 {
-unsigned int offset;
+unsigned int offset, tot_wlen;
 int i;
 
-trace_virtqueue_fill(vq, elem, len, idx);
+trace_virtqueue_fill(vq, elem, len_written, idx);
+
+for (tot_wlen = i = 0; i  elem-out_num; i++) {
+tot_wlen += elem-out_sg[i].iov_len;
+}
+assert(len_written = tot_wlen);
 
 offset = 0;
 for (i = 0; i  elem-in_num; i++) {
-size_t size = MIN(len - offset, elem-in_sg[i].iov_len);
+size_t size = MIN(len_written - offset, elem-in_sg[i].iov_len);
 
 cpu_physical_memory_unmap(elem-in_sg[i].iov_base,
   elem-in_sg[i].iov_len,
@@ -270,7 +275,7 @@ void virtqueue_fill(VirtQueue *vq, const VirtQueueElement 
*elem,
 
 /* Get a pointer to the next entry in the used ring. */
 vring_used_ring_id(vq, idx, elem-index);
-vring_used_ring_len(vq, idx, len);
+vring_used_ring_len(vq, idx, len_written);
 }
 
 void virtqueue_flush(VirtQueue *vq, unsigned int count)
@@ -288,9 +293,9 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
 }
 
 void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
-unsigned int len)
+unsigned int len_written)
 {
-virtqueue_fill(vq, elem, len, 0);
+virtqueue_fill(vq, elem, len_written, 0);
 virtqueue_flush(vq, 1);
 }
 
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index df09993..153374f 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -191,10 +191,10 @@ VirtQueue *virtio_add_queue(VirtIODevice *vdev, int 
queue_size,
 void virtio_del_queue(VirtIODevice *vdev, int n);
 
 void virtqueue_push(VirtQueue *vq, const VirtQueueElement *elem,
-unsigned int len);
+unsigned int len_written);
 void virtqueue_flush(VirtQueue *vq, unsigned int count);
 void virtqueue_fill(VirtQueue *vq, const VirtQueueElement *elem,
-unsigned int len, unsigned int idx);
+unsigned int len_written, unsigned int idx);
 
 void virtqueue_map_sg(struct iovec *sg, hwaddr *addr,
 size_t num_sg, int is_write);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: virtio balloon: do not call blocking ops when !TASK_RUNNING

2015-03-04 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Mon, Mar 02, 2015 at 10:37:26AM +1030, Rusty Russell wrote:
 Thomas Huth th...@linux.vnet.ibm.com writes:
  On Thu, 26 Feb 2015 11:50:42 +1030
  Rusty Russell ru...@rustcorp.com.au wrote:
 
  Thomas Huth th...@linux.vnet.ibm.com writes:
Hi all,
  
   with the recent kernel 3.19, I get a kernel warning when I start my
   KVM guest on s390 with virtio balloon enabled:
  
  The deeper problem is that virtio_ccw_get_config just silently fails on
  OOM.
  
  Neither get_config nor set_config are expected to fail.
 
  AFAIK this is currently not a problem. According to
  http://lwn.net/Articles/627419/ these kmalloc calls never
  fail because they allocate less than a page.
 
 I strongly suggest you unlearn that fact.
 The fix for this is in two parts:
 
 1) Annotate using sched_annotate_sleep() and add a comment: we may spin
a few times in low memory situations, but this isn't a high
performance path.
 
 2) Handle get_config (and other) failure in some more elegant way.
 
 Cheers,
 Rusty.

 I agree, but I'd like to point out that even without kmalloc,
 on s390 get_config is blocking - it's waiting
 for a hardware interrupt.

 And it makes sense: config is not data path, I don't think
 we should spin there.

 So I think besides these two parts, we still need my two patches:
 virtio-balloon: do not call blocking ops when !TASK_RUNNING

I prefer to annotate, over trying to fix this.

Because it's not important.  We might spin a few times, but it's very
unlikely, and it's certainly not performance critical.

Thanks,
Rusty.

Subject: virtio_balloon: annotate possible sleep waiting for event.

CCW (s390) does this.

Reported-by: Thomas Huth th...@linux.vnet.ibm.com
Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0413157f3b49..3f4d5acdbde0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -340,6 +340,15 @@ static int balloon(void *_vballoon)
s64 diff;
 
try_to_freeze();
+
+   /*
+* Reading the config on the ccw backend involves an
+* allocation, so we may actually sleep and have an
+* extra iteration.  It's extremely unlikely, and this
+* isn't a fast path in any sense.
+*/
+   sched_annotate_sleep();
+
wait_event_interruptible(vb-config_change,
 (diff = towards_target(vb)) != 0
 || vb-need_stats_update
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio balloon: do not call blocking ops when !TASK_RUNNING

2015-03-02 Thread Rusty Russell
Thomas Huth th...@linux.vnet.ibm.com writes:
 On Thu, 26 Feb 2015 11:50:42 +1030
 Rusty Russell ru...@rustcorp.com.au wrote:

 Thomas Huth th...@linux.vnet.ibm.com writes:
   Hi all,
 
  with the recent kernel 3.19, I get a kernel warning when I start my
  KVM guest on s390 with virtio balloon enabled:
 
 The deeper problem is that virtio_ccw_get_config just silently fails on
 OOM.
 
 Neither get_config nor set_config are expected to fail.

 AFAIK this is currently not a problem. According to
 http://lwn.net/Articles/627419/ these kmalloc calls never
 fail because they allocate less than a page.

I strongly suggest you unlearn that fact.

The fix for this is in two parts:

1) Annotate using sched_annotate_sleep() and add a comment: we may spin
   a few times in low memory situations, but this isn't a high
   performance path.

2) Handle get_config (and other) failure in some more elegant way.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio balloon: do not call blocking ops when !TASK_RUNNING

2015-02-25 Thread Rusty Russell
Thomas Huth th...@linux.vnet.ibm.com writes:
  Hi all,

 with the recent kernel 3.19, I get a kernel warning when I start my
 KVM guest on s390 with virtio balloon enabled:

The deeper problem is that virtio_ccw_get_config just silently fails on
OOM.

Neither get_config nor set_config are expected to fail.

Cornelia, I think ccw and config_area should be allocated inside vcdev.
You could either use pointers, or simply allocate vcdev with GDP_DMA.

This would avoid the kmalloc inside these calls.

Thanks,
Rusty.


 [0.839687] do not call blocking ops when !TASK_RUNNING; state=1 set at
[00174a1e] prepare_to_wait_event+0x7e/0x108
 [0.839694] [ cut here ]
 [0.839697] WARNING: at kernel/sched/core.c:7326
 [0.839698] Modules linked in:
 [0.839702] CPU: 0 PID: 46 Comm: vballoon Not tainted 3.19.0 #233
 [0.839705] task: 021d ti: 021d8000 task.ti: 
 021d8000
 [0.839707] Krnl PSW : 0704c0018000 0015bf8e 
 (__might_sleep+0x8e/0x98)
 [0.839713]R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 
 EA:3
 Krnl GPRS: 000d 021d 0071 0001
 [0.839718]00675ace 01998c50  
 
 [0.839720]00982134 0058f824 00a008a8 
 
 [0.839722]04d9 007ea992 0015bf8a 
 021dbc28
 [0.839731] Krnl Code: 0015bf7e: c0200033e838  larl
 %r2,7d8fee
0015bf84: c0e50028cd62 brasl   %r14,675a48
   #0015bf8a: a7f40001 brc 15,15bf8c
   0015bf8e: 9201a000 mvi 0(%r10),1
0015bf92: a7f4ffe2 brc 15,15bf56
0015bf96: 0707 bcr 0,%r7
0015bf98: ebdff0800024 stmg%r13,%r15,128(%r15)
0015bf9e: a7f13fe0 tmll%r15,16352
 [0.839749] Call Trace:
 [0.839751] ([0015bf8a] __might_sleep+0x8a/0x98)
 [0.839756]  [0028a562] __kmalloc+0x272/0x350
 [0.839759]  [0058f824] virtio_ccw_get_config+0x3c/0x100
 [0.839762]  [0049fcb0] balloon+0x1b8/0x330
 [0.839765]  [001529c8] kthread+0x120/0x138
 [0.839767]  [00683c22] kernel_thread_starter+0x6/0xc
 [0.839770]  [00683c1c] kernel_thread_starter+0x0/0xc
 [0.839772] no locks held by vballoon/46.
 [0.839773] Last Breaking-Event-Address:
 [0.839776]  [0015bf8a] __might_sleep+0x8a/0x98
 [0.839778] ---[ end trace d27fcdfa27273d7c ]---

 The problem seems to be this code in balloon() in
 drivers/virtio/virtio_balloon.c:

   wait_event_interruptible(vb-config_change,
(diff = towards_target(vb)) != 0
|| vb-need_stats_update
|| kthread_should_stop()
|| freezing(current));

 wait_event_interruptible() sets the state of the current task to
 TASK_INTERRUPTIBLE, then checks the condition. The condition contains
 towards_target() which reads the virtio config space via virtio_cread().
 On s390, this then triggers virtio_ccw_get_config() - and this function
 calls some other functions again that might sleep (e.g. kzalloc or
 wait_event in ccw_io_helper) ... and this causes the new kernel warning
 message with kernel 3.19.

 I think it would be quite difficult or at least ugly to rewrite
 virtio_ccw_get_config() so that it does not call sleepable functions
 anymore. So would it be feasible to rewrite the balloon() function that
 it does not call the towards_target() in its wait_event condition
 anymore? I am unfortunately not that familiar with the balloon code
 semantics, so any help is very appreciated here!

  Thanks,
   Thomas
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Qemu and virtio 1.0

2015-02-24 Thread Rusty Russell
OK, I am trying to experiment with virtio 1.0 support using the
latest kernel and MST's qemu tree:

https://git.kernel.org/cgit/virt/kvm/mst/qemu.git/?h=virtio-1.0

The first issue is that the device config endian was wrong (see
attached patch).

I'm now setting up a BE guest on my x86 laptop, and a BE and LE guest
on a BE powerpc machine, to check that all combinations work correctly.
If others test too, that would be appreciated!

Cheers,
Rusty.

From 95ac91554ed602f856a2a5fcc25eaffcad1b1c8d Mon Sep 17 00:00:00 2001
From: Rusty Russell ru...@rustcorp.com.au
Date: Tue, 24 Feb 2015 14:47:44 +1030
Subject: [PATCH] virtio_config_write*/virtio_config_read*: Don't endian swap
 for virtio 1.0.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 079944c..882a31b 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -662,7 +662,12 @@ uint32_t virtio_config_readw(VirtIODevice *vdev, uint32_t 
addr)
 
 k-get_config(vdev, vdev-config);
 
-val = lduw_p(vdev-config + addr);
+/* Virtio 1.0 is always LE */
+if (virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+val = lduw_le_p(vdev-config + addr);
+} else {
+val = lduw_p(vdev-config + addr);
+}
 return val;
 }
 
@@ -677,7 +682,12 @@ uint32_t virtio_config_readl(VirtIODevice *vdev, uint32_t 
addr)
 
 k-get_config(vdev, vdev-config);
 
-val = ldl_p(vdev-config + addr);
+/* Virtio 1.0 is always LE */
+if (virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+val = ldl_le_p(vdev-config + addr);
+} else {
+val = ldl_p(vdev-config + addr);
+}
 return val;
 }
 
@@ -706,7 +716,12 @@ void virtio_config_writew(VirtIODevice *vdev, uint32_t 
addr, uint32_t data)
 return;
 }
 
-stw_p(vdev-config + addr, val);
+/* Virtio 1.0 is always LE */
+if (virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+stw_le_p(vdev-config + addr, val);
+} else {
+stw_p(vdev-config + addr, val);
+}
 
 if (k-set_config) {
 k-set_config(vdev, vdev-config);
@@ -722,7 +737,12 @@ void virtio_config_writel(VirtIODevice *vdev, uint32_t 
addr, uint32_t data)
 return;
 }
 
-stl_p(vdev-config + addr, val);
+/* Virtio 1.0 is always LE */
+if (virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+stl_le_p(vdev-config + addr, val);
+} else {
+stl_p(vdev-config + addr, val);
+}
 
 if (k-set_config) {
 k-set_config(vdev, vdev-config);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL] virtio-next

2015-02-17 Thread Rusty Russell
The following changes since commit b97f880c8342fd6e49a02c9ef7507a678722b2b3:

  Merge branch 'for-3.19-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata (2015-01-21 07:54:16 
+1200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux.git 
tags/virtio-next-for-linus

for you to fetch changes up to 5b40a7daf51812b35cf05d1601a779a7043f8414:

  virtio: don't set VIRTIO_CONFIG_S_DRIVER_OK twice. (2015-02-17 16:19:29 +1030)


OK, this has the big virtio 1.0 implementation, as specified by OASIS.

On top of tht is the major rework of lguest, to use PCI and virtio 1.0, to
double-check the implementation.

Then comes the inevitable fixes and cleanups from that work.

Thanks,
Rusty.


Luis R. Rodriguez (1):
  virtual: Documentation: simplify and generalize paravirt_ops.txt

Michael S. Tsirkin (23):
  virtio_pci: drop virtio_config dependency
  virtio/9p: verify device has config space
  virtio/blk: verify device has config space
  virtio/console: verify device has config space
  virtio/net: verify device has config space
  virtio/scsi: verify device has config space
  virtio/balloon: verify device has config space
  mn10300: drop dead code
  pci: add pci_iomap_range
  s390: add pci_iomap_range
  virtio_pci: move probe/remove code to common
  virtio_pci: modern driver
  virtio_pci_modern: reduce number of mappings
  virtio_pci_modern: support devices with no config
  virtio_balloon: coding style fixes
  virtio_blk: coding style fixes
  virtio_ring: coding style fix
  virtio_rng: drop extra empty line
  virtio_pci: Kconfig grammar fix
  virtio_pci: drop Kconfig warnings
  virtio_pci: add an option to disable legacy driver
  virtio_pci: add module param to force legacy mode
  virtio_pci_modern: drop an unused function

Pawel Moll (1):
  virtio-mmio: Update the device to OASIS spec version

Rusty Russell (53):
  virtio-pci: define layout for virtio 1.0
  virtio_pci: macros for PCI layout offsets
  virtio: define VIRTIO_PCI_CAP_PCI_CFG in header.
  virtio: Don't expose legacy block features when VIRTIO_BLK_NO_LEGACY 
defined.
  virtio: Don't expose legacy config features when VIRTIO_CONFIG_NO_LEGACY 
defined.
  virtio_pci: use 16-bit accessor for queue_enable.
  virtio: don't require a config space on the console device.
  lguest: have --rng read from /dev/urandom not /dev/random.
  lguest: add operations to get/set a register from the Launcher.
  lguest: write more information to userspace about pending traps.
  lguest: add infrastructure for userspace to deliver a trap to the guest.
  lguest: add infrastructure to check mappings.
  lguest: send trap 13 through to userspace.
  lguest: suppress PS/2 keyboard polling.
  lguest: don't disable iospace.
  lguest: add iomem region, where guest page faults get sent to userspace.
  lguest: disable ACPI explicitly.
  lguest: Override pcibios_enable_irq/pcibios_disable_irq to our stupid PIC
  lguest: add MMIO region allocator in example launcher.
  lguest: decode mmio accesses for PCI in example launcher.
  lguest: add PCI config space emulation to example launcher.
  lguest: implement virtio-PCI MMIO accesses.
  lguest: fix failure to find linux/virtio_types.h
  lguest: add a dummy PCI host bridge.
  lguest: Convert block device to virtio 1.0 PCI.
  lguest: Convert net device to virtio 1.0 PCI.
  lguest: Convert entropy device to virtio 1.0 PCI.
  lguest: Convert console device to virtio 1.0 PCI.
  lguest: define VIRTIO_CONFIG_NO_LEGACY in example launcher.
  lguest: remove support for lguest bus.
  lguest: remove support for lguest bus in demonstration launcher.
  lguest: remove lguest bus definitions from header.
  lguest: support emerg_wr in console device in example launcher.
  lguest: support backdoor window.
  lguest: always put console in PCI slot #1.
  lguest: use the PCI console device's emerg_wr for early boot messages.
  lguest: remove NOTIFY facility from demonstration launcher.
  lguest: remove NOTIFY call and eventfd facility.
  tools/lguest: handle device reset correctly in example launcher.
  tools/lguest: fix features_accepted logic in example launcher.
  tools/lguest: rename virtio_pci_cfg_cap field to match spec.
  tools/lguest: insert device references from the 1.0 spec (4.1 Virtio Over 
PCI)
  tools/lguest: insert driver references from the 1.0 spec (4.1 Virtio Over 
PCI)
  tools/lguest: handle indirect partway through chain.
  tools/lguest: don't start devices until DRIVER_OK status set.
  lguest: don't look in console features to find emerg_wr.
  tools/lguest: more documentation

Re: [PATCH] virtual: Documentation: simplify and generalize paravirt_ops.txt

2015-02-11 Thread Rusty Russell
Luis R. Rodriguez mcg...@do-not-panic.com writes:
 From: Luis R. Rodriguez mcg...@suse.com

 The general documentation we have for pv_ops is currenty present
 on the IA64 docs, but since this documentation covers IA64 xen
 enablement and IA64 Xen support got ripped out a while ago
 through commit d52eefb47 present since v3.14-rc1 lets just
 simplify, generalize and move the pv_ops documentation to a
 shared place.

OK, I've applied this.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] hw_random: fix unregister race.

2014-11-11 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:
 From: Rusty Russell ru...@rustcorp.com.au

 The previous patch added one potential problem: we can still be
 reading from a hwrng when it's unregistered.  Add a wait for zero
 in the hwrng_unregister path.

 v4: add cleanup_done flag to insure that cleanup is done

That's a bit weird.  The usual pattern would be to hold a reference
until we're actually finished, but this reference is a bit weird.

We hold the mutex across cleanup, so we could grab that but we have to
take care sleeping inside wait_event, otherwise Peter will have to fix
my code again :)

AFAICT the wake_woken() stuff isn't merged yet, so your patch will
have to do for now.

 @@ -98,6 +99,8 @@ static inline void cleanup_rng(struct kref *kref)
  
   if (rng-cleanup)
   rng-cleanup(rng);
 + rng-cleanup_done = true;
 + wake_up_all(rng_done);
  }
  
  static void set_current_rng(struct hwrng *rng)
 @@ -536,6 +539,11 @@ void hwrng_unregister(struct hwrng *rng)
   kthread_stop(hwrng_fill);
   } else
   mutex_unlock(rng_mutex);
 +
 + /* Just in case rng is reading right now, wait. */
 + wait_event(rng_done, rng-cleanup_done 
 +atomic_read(rng-ref.refcount) == 0);
 +

The atomic_read() isn't necessary here.

However, you should probably init cleanup_done in hwrng_register().
(Probably noone does unregister then register, but let's be clear).

Thanks,
Rusty.

  }
  EXPORT_SYMBOL_GPL(hwrng_unregister);
  
 diff --git a/include/linux/hw_random.h b/include/linux/hw_random.h
 index c212e71..7832e50 100644
 --- a/include/linux/hw_random.h
 +++ b/include/linux/hw_random.h
 @@ -46,6 +46,7 @@ struct hwrng {
   /* internal. */
   struct list_head list;
   struct kref ref;
 + bool cleanup_done;
  };
  
  /** Register a new Hardware Random Number Generator driver. */
 -- 
 1.9.3
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] hw_random: use reference counts on each struct hwrng.

2014-11-11 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:
 From: Rusty Russell ru...@rustcorp.com.au

 current_rng holds one reference, and we bump it every time we want
 to do a read from it.

 This means we only hold the rng_mutex to grab or drop a reference,
 so accessing /sys/devices/virtual/misc/hw_random/rng_current doesn't
 block on read of /dev/hwrng.

 Using a kref is overkill (we're always under the rng_mutex), but
 a standard pattern.

 This also solves the problem that the hwrng_fillfn thread was
 accessing current_rng without a lock, which could change (eg. to NULL)
 underneath it.

 v4: decrease last reference for triggering the cleanup

This doesn't make any sense:

 +static void drop_current_rng(void)
 +{
 + struct hwrng *rng = current_rng;
 +
 + BUG_ON(!mutex_is_locked(rng_mutex));
 + if (!current_rng)
 + return;
 +
 + /* release current_rng reference */
 + kref_put(current_rng-ref, cleanup_rng);
 + current_rng = NULL;
 +
 + /* decrease last reference for triggering the cleanup */
 + kref_put(rng-ref, cleanup_rng);
 +}

Why would it drop the refcount twice?  This doesn't make sense.

Hmm, because you added kref_init, which initializes the reference count
to 1, you created this bug.

Leave out the kref_init, and let it naturally be 0 (until, and if, it
becomes current_rng).  Add a comment if you want.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/6] hw_random: fix unregister race.

2014-10-30 Thread Rusty Russell
Herbert Xu herb...@gondor.apana.org.au writes:
 On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote:
 From: Rusty Russell ru...@rustcorp.com.au
 
 The previous patch added one potential problem: we can still be
 reading from a hwrng when it's unregistered.  Add a wait for zero
 in the hwrng_unregister path.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au

 You totally corrupted Rusty's patch.  If you're going to repost
 his series you better make sure that you've got the right patches.

 Just as well though as it made me think a little more about this
 patch :)

OK Amos, can you please repost the complete series?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] hw_random: use reference counts on each struct hwrng.

2014-10-19 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:
 We got a warning in boot stage when above set_current_rng() is executed,
 it can be fixed by init rng-ref in hwrng_init().


 @@ -166,6 +169,8 @@ static inline int hwrng_init(struct hwrng *rng)
 if (current_quality  0  !hwrng_fill)
 start_khwrngd();
  
 +   kref_init(rng-ref);
 +
 return 0;
  }

OK, I folded this fix on.

Thanks,
Rusty.

hw_random: use reference counts on each struct hwrng.

current_rng holds one reference, and we bump it every time we want
to do a read from it.

This means we only hold the rng_mutex to grab or drop a reference,
so accessing /sys/devices/virtual/misc/hw_random/rng_current doesn't
block on read of /dev/hwrng.

Using a kref is overkill (we're always under the rng_mutex), but
a standard pattern.

This also solves the problem that the hwrng_fillfn thread was
accessing current_rng without a lock, which could change (eg. to NULL)
underneath it.

v3: initialize kref (thanks Amos Kong)
v2: fix missing put_rng() on exit path (thanks Amos Kong)
Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index a0905c818e13..0ecac38da954 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -42,6 +42,7 @@
 #include linux/delay.h
 #include linux/slab.h
 #include linux/random.h
+#include linux/err.h
 #include asm/uaccess.h
 
 
@@ -91,6 +92,59 @@ static void add_early_randomness(struct hwrng *rng)
add_device_randomness(bytes, bytes_read);
 }
 
+static inline void cleanup_rng(struct kref *kref)
+{
+   struct hwrng *rng = container_of(kref, struct hwrng, ref);
+
+   if (rng-cleanup)
+   rng-cleanup(rng);
+}
+
+static void set_current_rng(struct hwrng *rng)
+{
+   BUG_ON(!mutex_is_locked(rng_mutex));
+   kref_get(rng-ref);
+   current_rng = rng;
+}
+
+static void drop_current_rng(void)
+{
+   BUG_ON(!mutex_is_locked(rng_mutex));
+   if (!current_rng)
+   return;
+
+   kref_put(current_rng-ref, cleanup_rng);
+   current_rng = NULL;
+}
+
+/* Returns ERR_PTR(), NULL or refcounted hwrng */
+static struct hwrng *get_current_rng(void)
+{
+   struct hwrng *rng;
+
+   if (mutex_lock_interruptible(rng_mutex))
+   return ERR_PTR(-ERESTARTSYS);
+
+   rng = current_rng;
+   if (rng)
+   kref_get(rng-ref);
+
+   mutex_unlock(rng_mutex);
+   return rng;
+}
+
+static void put_rng(struct hwrng *rng)
+{
+   /*
+* Hold rng_mutex here so we serialize in case they set_current_rng
+* on rng again immediately.
+*/
+   mutex_lock(rng_mutex);
+   if (rng)
+   kref_put(rng-ref, cleanup_rng);
+   mutex_unlock(rng_mutex);
+}
+
 static inline int hwrng_init(struct hwrng *rng)
 {
if (rng-init) {
@@ -110,13 +164,9 @@ static inline int hwrng_init(struct hwrng *rng)
if (current_quality  0  !hwrng_fill)
start_khwrngd();
 
-   return 0;
-}
+   kref_init(rng-ref);
 
-static inline void hwrng_cleanup(struct hwrng *rng)
-{
-   if (rng  rng-cleanup)
-   rng-cleanup(rng);
+   return 0;
 }
 
 static int rng_dev_open(struct inode *inode, struct file *filp)
@@ -154,21 +204,22 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
ssize_t ret = 0;
int err = 0;
int bytes_read, len;
+   struct hwrng *rng;
 
while (size) {
-   if (mutex_lock_interruptible(rng_mutex)) {
-   err = -ERESTARTSYS;
+   rng = get_current_rng();
+   if (IS_ERR(rng)) {
+   err = PTR_ERR(rng);
goto out;
}
-
-   if (!current_rng) {
+   if (!rng) {
err = -ENODEV;
-   goto out_unlock;
+   goto out;
}
 
mutex_lock(reading_mutex);
if (!data_avail) {
-   bytes_read = rng_get_data(current_rng, rng_buffer,
+   bytes_read = rng_get_data(rng, rng_buffer,
rng_buffer_size(),
!(filp-f_flags  O_NONBLOCK));
if (bytes_read  0) {
@@ -200,8 +251,8 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
ret += len;
}
 
-   mutex_unlock(rng_mutex);
mutex_unlock(reading_mutex);
+   put_rng(rng);
 
if (need_resched())
schedule_timeout_interruptible(1);
@@ -213,12 +264,11 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
}
 out:
return ret ? : err;
-out_unlock:
-   mutex_unlock(rng_mutex);
-   goto out;
+
 out_unlock_reading:
mutex_unlock(reading_mutex);
-   goto out_unlock

Re: [PATCH v2 3/6] hw_random: use reference counts on each struct hwrng.

2014-10-19 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:
 From: Rusty Russell ru...@rustcorp.com.au

 current_rng holds one reference, and we bump it every time we want
 to do a read from it.

 This means we only hold the rng_mutex to grab or drop a reference,
 so accessing /sys/devices/virtual/misc/hw_random/rng_current doesn't
 block on read of /dev/hwrng.

 Using a kref is overkill (we're always under the rng_mutex), but
 a standard pattern.

 This also solves the problem that the hwrng_fillfn thread was
 accessing current_rng without a lock, which could change (eg. to NULL)
 underneath it.

 V2: reduce reference count in signal_pending path

OK, I changed it to do the put_rng before the check, so instead of:

 @@ -208,17 +256,19 @@ static ssize_t rng_dev_read(struct file *filp, char 
 __user *buf,
  
   if (signal_pending(current)) {
   err = -ERESTARTSYS;
 + put_rng(rng);
   goto out;
   }
 +
 + put_rng(rng);
   }
  out:
   return ret ? : err;
 -out_unlock:
 - mutex_unlock(rng_mutex);
 - goto out;
 +
  out_unlock_reading:
   mutex_unlock(reading_mutex);
 - goto out_unlock;
 + put_rng(rng);
 + goto out;
  }

We have:


mutex_unlock(reading_mutex);
put_rng(rng);

if (need_resched())
schedule_timeout_interruptible(1);

if (signal_pending(current)) {
err = -ERESTARTSYS;
goto out;
}
}
out:
return ret ? : err;

out_unlock_reading:
mutex_unlock(reading_mutex);
put_rng(rng);
goto out;
}


Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 1/3] virtio: support for urgent descriptors

2014-10-15 Thread Rusty Russell
Jason Wang jasow...@redhat.com writes:
 Below should be useful for some experiments Jason is doing.
 I thought I'd send it out for early review/feedback.

 event idx feature allows us to defer interrupts until
 a specific # of descriptors were used.
 Sometimes it might be useful to get an interrupt after
 a specific descriptor, regardless.
 This adds a descriptor flag for this, and an API
 to create an urgent output descriptor.
 This is still an RFC:
 we'll need a feature bit for drivers to detect this,
 but we've run out of feature bits for virtio 0.X.
 For experimentation purposes, drivers can assume
 this is set, or add a driver-specific feature bit.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com

The new VRING_DESC_F_URGENT bit is theoretically nicer, but for
networking (which tends to take packets in order) couldn't we just set
the event counter to give us a tx interrupt at the packet we want?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 04/25] virtio: defer config changed notifications

2014-10-14 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Defer config changed notifications that arrive during
 probe/scan/freeze/restore.

 This will allow drivers to set DRIVER_OK earlier, without worrying about
 racing with config change interrupts.

 This change will also benefit old hypervisors (before 2009)
 that send interrupts without checking DRIVER_OK: previously,
 the callback could race with driver-specific initialization.

 This will also help simplify drivers.

But AFAICT you never *read* dev-config_changed.

You unconditionally trigger a config_changed event in
virtio_config_enable().  That's a bit weird, but probably OK.

How about the following change (on top of your patch).  I
think the renaming is clearer, and note the added if() test in
virtio_config_enable().

If you approve, I'll fold it in.

Cheers,
Rusty.

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 2536701b098b..df598dd8c5c8 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -122,7 +122,7 @@ static void __virtio_config_changed(struct virtio_device 
*dev)
struct virtio_driver *drv = drv_to_virtio(dev-dev.driver);
 
if (!dev-config_enabled)
-   dev-config_changed = true;
+   dev-config_change_pending = true;
else if (drv  drv-config_changed)
drv-config_changed(dev);
 }
@@ -148,8 +148,9 @@ static void virtio_config_enable(struct virtio_device *dev)
 {
spin_lock_irq(dev-config_lock);
dev-config_enabled = true;
-   __virtio_config_changed(dev);
-   dev-config_changed = false;
+   if (dev-config_change_pending)
+   __virtio_config_changed(dev);
+   dev-config_change_pending = false;
spin_unlock_irq(dev-config_lock);
 }
 
@@ -253,7 +254,7 @@ int register_virtio_device(struct virtio_device *dev)
 
spin_lock_init(dev-config_lock);
dev-config_enabled = false;
-   dev-config_changed = false;
+   dev-config_change_pending = false;
 
/* We always start by resetting the device, in case a previous
 * driver messed it up.  This also tests that code path a little. */
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 5636b119dc25..65261a7244fc 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -80,7 +80,7 @@ bool virtqueue_is_broken(struct virtqueue *vq);
  * @index: unique position on the virtio bus
  * @failed: saved value for CONFIG_S_FAILED bit (for restore)
  * @config_enabled: configuration change reporting enabled
- * @config_changed: configuration change reported while disabled
+ * @config_change_pending: configuration change reported while disabled
  * @config_lock: protects configuration change reporting
  * @dev: underlying device.
  * @id: the device type identification (used to match it with a driver).
@@ -94,7 +94,7 @@ struct virtio_device {
int index;
bool failed;
bool config_enabled;
-   bool config_changed;
+   bool config_change_pending;
spinlock_t config_lock;
struct device dev;
struct virtio_device_id id;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 08/11] virtio_blk: use virtio v1.0 endian

2014-10-13 Thread Rusty Russell
Cornelia Huck cornelia.h...@de.ibm.com writes:
 Note that we care only about the fields still in use for virtio v1.0.

 Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com
 Reviewed-by: David Hildenbrand d...@linux.vnet.ibm.com
 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com

Hi Cornelia,

These patches all look good; I'm a bit nervous about our testing
missing some conversion, so we'll need qemu patches for PCI so we can
test on other platforms too.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 03/11] virtio: support more feature bits

2014-10-13 Thread Rusty Russell
Cornelia Huck cornelia.h...@de.ibm.com writes:
 With virtio-1, we support more than 32 feature bits. Let's make
 vdev-guest_features depend on the number of supported feature bits,
 allowing us to grow the feature bits automatically.

It's a judgement call, but I would say that simply using uint64_t
will be sufficient for quite a while.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/25] virtio: add API to enable VQs early

2014-10-13 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 virtio spec 0.9.X requires DRIVER_OK to be set before
 VQs are used, but some drivers use VQs before probe
 function returns.
 Since DRIVER_OK is set after probe, this violates the spec.

 Even though under virtio 1.0 transitional devices support this
 behaviour, we want to make it possible for those early callers to become
 spec compliant and eventually support non-transitional devices.

 Add API for drivers to call before using VQs.

 Sets DRIVER_OK internally.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 Reviewed-by: Cornelia Huck cornelia.h...@de.ibm.com

OK, this all looks good, but I think we should do two things:

1) rename virtio_enable_vqs_early() to virtio_device_ready().
2) Add a BUG_ON in the virtio_ring code to make sure device is ready
   before any ops are called.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/3] hw_random: fix stuck in catting hwrng attributes

2014-09-17 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:

 I started a QEMU (non-smp) guest with one virtio-rng device, and read
 random data from /dev/hwrng by dd:

  # dd if=/dev/hwrng of=/dev/null 

 In the same time, if I check hwrng attributes from sysfs by cat:

  # cat /sys/class/misc/hw_random/rng_*

 The cat process always gets stuck with slow backend (5 k/s), if we
 use a quick backend (1.2 M/s), the cat process will cost 1 to 2
 minutes. The stuck doesn't exist for smp guest.

 Reading syscall enters kernel and call rng_dev_read(), it's user
 context. We used need_resched() to check if other tasks need to
 be run, but it almost always return false, and re-hold the mutex
 lock. The attributes accessing process always fails to hold the
 lock, so the cat gets stuck.

 User context doesn't allow other user contexts run on that CPU,
 unless the kernel code sleeps for some reason. This is why the
 need_reshed() always return false here.

 This patch removed need_resched() and always schedule other tasks
 then other tasks can have chance to hold the lock and execute
 protected code.

OK, this is going to be a rant.

Your explanation doesn't make sense at all.  Worse, your solution breaks
the advice of Kernighan  Plaugher: Don't patch bad code - rewrite
it..

But worst of all, this detailed explanation might have convinced me you
understood the problem better than I did, and applied your patch.

I did some tests.  For me, as expected, the process spends its time
inside the virtio rng read function, holding the mutex and thus blocking
sysfs access; it's not a failure of this code at all.

Your schedule_timeout() fix probably just helps by letting the host
refresh entropy, so we spend less time waiting in the read fn.

I will post a series, which unfortunately is only lightly tested, then
I'm going to have some beer to begin my holiday.  That may help me
forget my disappointment at seeing respected fellow developers
monkey-patching random code they don't understand.

Grrr
Rusty.

 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  drivers/char/hw_random/core.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

 diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
 index c591d7e..263a370 100644
 --- a/drivers/char/hw_random/core.c
 +++ b/drivers/char/hw_random/core.c
 @@ -195,8 +195,7 @@ static ssize_t rng_dev_read(struct file *filp, char 
 __user *buf,
  
   mutex_unlock(rng_mutex);
  
 - if (need_resched())
 - schedule_timeout_interruptible(1);
 + schedule_timeout_interruptible(1);
  
   if (signal_pending(current)) {
   err = -ERESTARTSYS;
 -- 
 1.9.3

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] hw_random: use reference counts on each struct hwrng.

2014-09-17 Thread Rusty Russell
current_rng holds one reference, and we bump it every time we want
to do a read from it.

This means we only hold the rng_mutex to grab or drop a reference,
so accessing /sys/devices/virtual/misc/hw_random/rng_current doesn't
block on read of /dev/hwrng.

Using a kref is overkill (we're always under the rng_mutex), but
a standard pattern.

This also solves the problem that the hwrng_fillfn thread was
accessing current_rng without a lock, which could change (eg. to NULL)
underneath it.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/char/hw_random/core.c | 135 --
 include/linux/hw_random.h |   2 +
 2 files changed, 94 insertions(+), 43 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index b1b6042ad85c..dc9092a1075d 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -42,6 +42,7 @@
 #include linux/delay.h
 #include linux/slab.h
 #include linux/random.h
+#include linux/err.h
 #include asm/uaccess.h
 
 
@@ -91,6 +92,59 @@ static void add_early_randomness(struct hwrng *rng)
add_device_randomness(bytes, bytes_read);
 }
 
+static inline void cleanup_rng(struct kref *kref)
+{
+   struct hwrng *rng = container_of(kref, struct hwrng, ref);
+
+   if (rng-cleanup)
+   rng-cleanup(rng);
+}
+
+static void set_current_rng(struct hwrng *rng)
+{
+   BUG_ON(!mutex_is_locked(rng_mutex));
+   kref_get(rng-ref);
+   current_rng = rng;
+}
+
+static void drop_current_rng(void)
+{
+   BUG_ON(!mutex_is_locked(rng_mutex));
+   if (!current_rng)
+   return;
+
+   kref_put(current_rng-ref, cleanup_rng);
+   current_rng = NULL;
+}
+
+/* Returns ERR_PTR(), NULL or refcounted hwrng */
+static struct hwrng *get_current_rng(void)
+{
+   struct hwrng *rng;
+
+   if (mutex_lock_interruptible(rng_mutex))
+   return ERR_PTR(-ERESTARTSYS);
+
+   rng = current_rng;
+   if (rng)
+   kref_get(rng-ref);
+
+   mutex_unlock(rng_mutex);
+   return rng;
+}
+
+static void put_rng(struct hwrng *rng)
+{
+   /*
+* Hold rng_mutex here so we serialize in case they set_current_rng
+* on rng again immediately.
+*/
+   mutex_lock(rng_mutex);
+   if (rng)
+   kref_put(rng-ref, cleanup_rng);
+   mutex_unlock(rng_mutex);
+}
+
 static inline int hwrng_init(struct hwrng *rng)
 {
if (rng-init) {
@@ -113,12 +167,6 @@ static inline int hwrng_init(struct hwrng *rng)
return 0;
 }
 
-static inline void hwrng_cleanup(struct hwrng *rng)
-{
-   if (rng  rng-cleanup)
-   rng-cleanup(rng);
-}
-
 static int rng_dev_open(struct inode *inode, struct file *filp)
 {
/* enforce read-only access to this chrdev */
@@ -154,21 +202,22 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
ssize_t ret = 0;
int err = 0;
int bytes_read, len;
+   struct hwrng *rng;
 
while (size) {
-   if (mutex_lock_interruptible(rng_mutex)) {
-   err = -ERESTARTSYS;
+   rng = get_current_rng();
+   if (IS_ERR(rng)) {
+   err = PTR_ERR(rng);
goto out;
}
-
-   if (!current_rng) {
+   if (!rng) {
err = -ENODEV;
-   goto out_unlock;
+   goto out;
}
 
mutex_lock(reading_mutex);
if (!data_avail) {
-   bytes_read = rng_get_data(current_rng, rng_buffer,
+   bytes_read = rng_get_data(rng, rng_buffer,
rng_buffer_size(),
!(filp-f_flags  O_NONBLOCK));
if (bytes_read  0) {
@@ -200,7 +249,6 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
ret += len;
}
 
-   mutex_unlock(rng_mutex);
mutex_unlock(reading_mutex);
 
if (need_resched())
@@ -210,15 +258,16 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
err = -ERESTARTSYS;
goto out;
}
+
+   put_rng(rng);
}
 out:
return ret ? : err;
-out_unlock:
-   mutex_unlock(rng_mutex);
-   goto out;
+
 out_unlock_reading:
mutex_unlock(reading_mutex);
-   goto out_unlock;
+   put_rng(rng);
+   goto out;
 }
 
 
@@ -257,8 +306,8 @@ static ssize_t hwrng_attr_current_store(struct device *dev,
err = hwrng_init(rng);
if (err)
break;
-   hwrng_cleanup(current_rng);
-   current_rng = rng;
+   drop_current_rng();
+   set_current_rng

[PATCH 5/5] hw_random: don't init list element we're about to add to list.

2014-09-17 Thread Rusty Russell
Another interesting anti-pattern.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/char/hw_random/core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index 6a34feca6b43..96fa06716e95 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -485,7 +485,6 @@ int hwrng_register(struct hwrng *rng)
goto out_unlock;
}
}
-   INIT_LIST_HEAD(rng-list);
list_add_tail(rng-list, rng_list);
 
if (old_rng  !rng-init) {
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] hw_random: place mutex around read functions and buffers.

2014-09-17 Thread Rusty Russell
There's currently a big lock around everything, and it means that we
can't query sysfs (eg /sys/devices/virtual/misc/hw_random/rng_current)
while the rng is reading.  This is a real problem when the rng is slow,
or blocked (eg. virtio_rng with qemu's default /dev/random backend)

This doesn't help (it leaves the current lock untouched), just adds a
lock to protect the read function and the static buffers, in preparation
for transition.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/char/hw_random/core.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index aa30a25c8d49..b1b6042ad85c 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -53,7 +53,10 @@
 static struct hwrng *current_rng;
 static struct task_struct *hwrng_fill;
 static LIST_HEAD(rng_list);
+/* Protects rng_list and current_rng */
 static DEFINE_MUTEX(rng_mutex);
+/* Protects rng read functions, data_avail, rng_buffer and rng_fillbuf */
+static DEFINE_MUTEX(reading_mutex);
 static int data_avail;
 static u8 *rng_buffer, *rng_fillbuf;
 static unsigned short current_quality;
@@ -81,7 +84,9 @@ static void add_early_randomness(struct hwrng *rng)
unsigned char bytes[16];
int bytes_read;
 
+   mutex_lock(reading_mutex);
bytes_read = rng_get_data(rng, bytes, sizeof(bytes), 1);
+   mutex_unlock(reading_mutex);
if (bytes_read  0)
add_device_randomness(bytes, bytes_read);
 }
@@ -128,6 +133,7 @@ static inline int rng_get_data(struct hwrng *rng, u8 
*buffer, size_t size,
int wait) {
int present;
 
+   BUG_ON(!mutex_is_locked(reading_mutex));
if (rng-read)
return rng-read(rng, (void *)buffer, size, wait);
 
@@ -160,13 +166,14 @@ static ssize_t rng_dev_read(struct file *filp, char 
__user *buf,
goto out_unlock;
}
 
+   mutex_lock(reading_mutex);
if (!data_avail) {
bytes_read = rng_get_data(current_rng, rng_buffer,
rng_buffer_size(),
!(filp-f_flags  O_NONBLOCK));
if (bytes_read  0) {
err = bytes_read;
-   goto out_unlock;
+   goto out_unlock_reading;
}
data_avail = bytes_read;
}
@@ -174,7 +181,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
if (!data_avail) {
if (filp-f_flags  O_NONBLOCK) {
err = -EAGAIN;
-   goto out_unlock;
+   goto out_unlock_reading;
}
} else {
len = data_avail;
@@ -186,7 +193,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
if (copy_to_user(buf + ret, rng_buffer + data_avail,
len)) {
err = -EFAULT;
-   goto out_unlock;
+   goto out_unlock_reading;
}
 
size -= len;
@@ -194,6 +201,7 @@ static ssize_t rng_dev_read(struct file *filp, char __user 
*buf,
}
 
mutex_unlock(rng_mutex);
+   mutex_unlock(reading_mutex);
 
if (need_resched())
schedule_timeout_interruptible(1);
@@ -208,6 +216,9 @@ out:
 out_unlock:
mutex_unlock(rng_mutex);
goto out;
+out_unlock_reading:
+   mutex_unlock(reading_mutex);
+   goto out_unlock;
 }
 
 
@@ -348,13 +359,16 @@ static int hwrng_fillfn(void *unused)
while (!kthread_should_stop()) {
if (!current_rng)
break;
+   mutex_lock(reading_mutex);
rc = rng_get_data(current_rng, rng_fillbuf,
  rng_buffer_size(), 1);
+   mutex_unlock(reading_mutex);
if (rc = 0) {
pr_warn(hwrng: no data available\n);
msleep_interruptible(1);
continue;
}
+   /* Outside lock, sure, but y'know: randomness. */
add_hwgenerator_randomness((void *)rng_fillbuf, rc,
   rc * current_quality * 8  10);
}
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] hw_random: don't double-check old_rng.

2014-09-17 Thread Rusty Russell
Interesting anti-pattern.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/char/hw_random/core.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index b4a21e9521cf..6a34feca6b43 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -472,14 +472,13 @@ int hwrng_register(struct hwrng *rng)
}
 
old_rng = current_rng;
+   err = 0;
if (!old_rng) {
err = hwrng_init(rng);
if (err)
goto out_unlock;
set_current_rng(rng);
-   }
-   err = 0;
-   if (!old_rng) {
+
err = register_miscdev();
if (err) {
drop_current_rng();
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] hw_random: fix unregister race.

2014-09-17 Thread Rusty Russell
The previous patch added one potential problem: we can still be
reading from a hwrng when it's unregistered.  Add a wait for zero
in the hwrng_unregister path.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/char/hw_random/core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index dc9092a1075d..b4a21e9521cf 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -60,6 +60,7 @@ static DEFINE_MUTEX(rng_mutex);
 static DEFINE_MUTEX(reading_mutex);
 static int data_avail;
 static u8 *rng_buffer, *rng_fillbuf;
+static DECLARE_WAIT_QUEUE_HEAD(rng_done);
 static unsigned short current_quality;
 static unsigned short default_quality; /* = 0; default to off */
 
@@ -98,6 +99,7 @@ static inline void cleanup_rng(struct kref *kref)
 
if (rng-cleanup)
rng-cleanup(rng);
+   wake_up_all(rng_done);
 }
 
 static void set_current_rng(struct hwrng *rng)
@@ -529,6 +531,9 @@ void hwrng_unregister(struct hwrng *rng)
}
 
mutex_unlock(rng_mutex);
+
+   /* Just in case rng is reading right now, wait. */
+   wait_event(rng_done, atomic_read(rng-ref.refcount) == 0);
 }
 EXPORT_SYMBOL_GPL(hwrng_unregister);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] virtio-rng: fix stuck in catting hwrng attributes

2014-09-16 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:
 On Sun, Sep 14, 2014 at 01:12:58AM +0800, Amos Kong wrote:
 On Thu, Sep 11, 2014 at 09:08:03PM +0930, Rusty Russell wrote:
  Amos Kong ak...@redhat.com writes:
   When I check hwrng attributes in sysfs, cat process always gets
   stuck if guest has only 1 vcpu and uses a slow rng backend.
  
   Currently we check if there is any tasks waiting to be run on
   current cpu in rng_dev_read() by need_resched(). But need_resched()
   doesn't work because rng_dev_read() is executing in user context.
  
  I don't understand this explanation?  I'd expect the sysfs process to be
  woken by the mutex_unlock().
 
 But actually sysfs process's not woken always, this is they the
 process gets stuck.

 %s/they/why/

 Hi Rusty,


 Reference:
 http://www.linuxgrill.com/anonymous/fire/netfilter/kernel-hacking-HOWTO-2.html

Sure, that was true when I wrote it, and is still true when preempt is
off.

 read() syscall of /dev/hwrng will enter into kernel, the read operation is
 rng_dev_read(), it's userspace context (not interrupt context).

 Userspace context doesn't allow other user contexts run on that CPU,
 unless the kernel code sleeps for some reason.

This is true assuming preempt is off, yes.

 In this case, the need_resched() doesn't work.

This is exactly what need_resched() is for: it should return true if
there's another process of sufficient priority waiting to be run.  It
implies that schedule() would run it.

git blame doesn't offer any enlightenment here, as to why we use
schedule_timeout_interruptible() at all.

I would expect mutex_unlock() to wake the other reader.  The code
certainly seems to, so it should now be runnable and need_resched()
should return true.

I suspect something else is happening which makes this work.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/2] virtio-rng: skip reading when we start to remove the device

2014-09-12 Thread Rusty Russell
Amit Shah amit.s...@redhat.com writes:
 On (Wed) 10 Sep 2014 [14:11:37], Amos Kong wrote:
 Before we really unregister the hwrng device, reading will get stuck if
 the virtio device is reset. We should return error for reading when we
 start to remove the device.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 Cc: sta...@vger.kernel.org

 Reviewed-by: Amit Shah amit.s...@redhat.com

Thanks, applied.

They're sitting in my fixes branch.  If there are no screams from
linux-next, I'll push to Linus Monday.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] virtio-rng: fix stuck in catting hwrng attributes

2014-09-11 Thread Rusty Russell
Amos Kong ak...@redhat.com writes:
 When I check hwrng attributes in sysfs, cat process always gets
 stuck if guest has only 1 vcpu and uses a slow rng backend.

 Currently we check if there is any tasks waiting to be run on
 current cpu in rng_dev_read() by need_resched(). But need_resched()
 doesn't work because rng_dev_read() is executing in user context.

I don't understand this explanation?  I'd expect the sysfs process to be
woken by the mutex_unlock().

If we're really high priority (vs. the sysfs process) then I can see why
we'd need schedule_timeout_interruptible() instead of just schedule(),
and in that case, need_resched() would be false too.

You could argue that's intended behaviour, but I can't see how it
happens in the normal case anyway.

What am I missing?

Thanks,
Rusty.

 This patch removed need_resched() and increase delay to 10 jiffies,
 then other tasks can have chance to execute protected code.
 Delaying 1 jiffy also works, but 10 jiffies is safer.

 Signed-off-by: Amos Kong ak...@redhat.com
 ---
  drivers/char/hw_random/core.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

 diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
 index c591d7e..b5d1b6f 100644
 --- a/drivers/char/hw_random/core.c
 +++ b/drivers/char/hw_random/core.c
 @@ -195,8 +195,7 @@ static ssize_t rng_dev_read(struct file *filp, char 
 __user *buf,
  
   mutex_unlock(rng_mutex);
  
 - if (need_resched())
 - schedule_timeout_interruptible(1);
 + schedule_timeout_interruptible(10);
  
   if (signal_pending(current)) {
   err = -ERESTARTSYS;
 -- 
 1.9.3
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/2] virtio-rng: fix stuck of hot-unplugging busy device

2014-09-11 Thread Rusty Russell
Amit Shah amit.s...@redhat.com writes:
 On (Wed) 10 Sep 2014 [14:11:36], Amos Kong wrote:
 When we try to hot-remove a busy virtio-rng device from QEMU monitor,
 the device can't be hot-removed. Because virtio-rng driver hangs at
 wait_for_completion_killable().
 
 This patch exits the waiting by completing have_data completion before
 unregistering, resets data_avail to avoid the hwrng core use wrong
 buffer bytes.
 
 Signed-off-by: Amos Kong ak...@redhat.com
 Cc: sta...@vger.kernel.org

 Reviewed-by: Amit Shah amit.s...@redhat.com

Thanks, applied.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] virtio: support for urgent descriptors

2014-07-08 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Below should be useful for some experiments Jason is doing.
 I thought I'd send it out for early review/feedback.

 Compiled-only at this point.

It's not a terrible idea, but it will come down to how effective it is
in practice.

I'm tempted to make it v1.0 only though.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using virtio for inter-VM communication

2014-06-12 Thread Rusty Russell
Jan Kiszka jan.kis...@siemens.com writes:
 On 2014-06-12 04:27, Rusty Russell wrote:
 Henning Schild henning.sch...@siemens.com writes:
 It was also never implemented, and remains a thought experiment.
 However, implementing it in lguest should be fairly easy.

 The reason why a trusted helper, i.e. additional logic in the
 hypervisor, is not our favorite solution is that we'd like to keep the
 hypervisor as small as possible. I wouldn't exclude such an approach
 categorically, but we have to weigh the costs (lines of code, additional
 hypervisor interface) carefully against the gain (existing
 specifications and guest driver infrastructure).

Reasonable, but I think you'll find it is about the minimal
implementation in practice.  Unfortunately, I don't have time during the
next 6 months to implement it myself :(

 Back to VIRTIO_F_RING_SHMEM_ADDR (which you once brought up in an MCA
 working group discussion): What speaks against introducing an
 alternative encoding of addresses inside virtio data structures? The
 idea of this flag was to replace guest-physical addresses with offsets
 into a shared memory region associated with or part of a virtio
 device.

We would also need a way of defining the shared memory region.  But
that's not the problem.  If such a feature is not accepted by the guest?
How to you fall back?

We don't add features which unmake the standard.

 That would preserve zero-copy capabilities (as long as you can work
 against the shared mem directly, e.g. doing DMA from a physical NIC or
 storage device into it) and keep the hypervisor out of the loop.

This seems ill thought out.  How will you program a NIC via the virtio
protocol without a hypervisor?  And how will you make it safe?  You'll
need an IOMMU.  But if you have an IOMMU you don't need shared memory.

 Is it
 too invasive to existing infrastructure or does it have some other pitfalls?

You'll have to convince every vendor to implement your addition to the
standard.  Which is easier than inventing a completely new system, but
it's not quite virtio.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using virtio for inter-VM communication

2014-06-11 Thread Rusty Russell
Henning Schild henning.sch...@siemens.com writes:
 Hi,

 i am working on the jailhouse[1] project and am currently looking at
 inter-VM communication. We want to connect guests directly with virtual
 consoles based on shared memory. The code complexity in the hypervisor
 should be minimal, it should just make the shared memory discoverable
 and provide a signaling mechanism.

Hi Henning,

The virtio assumption was that the host can see all of guest
memory.  This simplifies things significantly, and makes it efficient.

If you don't have this, *someone* needs to do a copy.  Usually the guest
OS does a bounce buffer into your shared region.  Goodbye performance.
Or you can play remapping tricks.  Goodbye performance again.

My preferred model is to have a trusted helper (ie. host) which
understands how to copy between virtio rings.  The backend guest (to
steal Xen vocab) R/O maps the descriptor, avail ring and used rings in
the guest.  It then asks the trusted helper to do various operation
(copy into writable descriptor, copy out of readable descriptor, mark
used).  The virtio ring itself acts as a grant table.

Note: that helper mechanism is completely protocol agnostic.  It was
also explicitly designed into the virtio mechanism (with its 4k
boundaries for data structures and its 'len' field to indicate how much
was written into the descriptor). 

It was also never implemented, and remains a thought experiment.
However, implementing it in lguest should be fairly easy.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio specification: OOO completion of the xmit buffers in the networking device

2014-06-03 Thread Rusty Russell
Vlad Zolotarov vl...@cloudius-systems.com writes:
 Rusty, hi!
 I'd like to ask for a small clarification about the virtio spec.
 The virtio specification is put the way that it allows the out-of-order 
 completions in general. Although it states that it's mostly relevant to 
 the virtio-blk it doesn't explicitly bans it for virtio networking 
 devices (for generality?).

Indeed.

 (AFAICS the current virtio HV implementations make sure there is no OOO 
 Tx buffers completions).

 I wonder if this feature (OOO Tx buffers completions) seems practical to 
 u and if not can I hope that it could be clearly stated in the future 
 virtio specification releases that it's forbidden?

Are you sure that vhost-net doesn't allow async completion already?
I can't immediately see why it can't happen, but MST is CC'd.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 9/9] kvmtool: virtio: enable arm/arm64 support for bi-endianness

2014-05-08 Thread Rusty Russell
Alexander Graf ag...@suse.de writes:
 On 05/07/2014 11:57 AM, Marc Zyngier wrote:
 How virtio implementations should determine their endianness is
 a spec question, I think; at any rate QEMU and kvmtool ought to
 agree on how it's done. I think the most recent suggestion on the
 QEMU mailing list (for PPC) is that we should care about the
 guest kernel endianness, but I don't know if anybody thought of
 the pass-through-to-userspace usecase...
 Current opinion on the qemu-devel thread seems to be that we
 should just define that the endianness of the virtio device is
 the endianness of the guest kernel at the point where the guest
 triggers a reset of the virtio device by writing zero the QueuePFN
 or Status registers.
 Virtio by design has full access to guest physical memory. It doesn't
 route DMA via PCI. So user space drivers simply don't make sense here.
 Huh? What if my guest has usespace using an idmap, with Stage-1 MMU for
 isolation only (much like an MPU)? R-class guests anyone?

 Agreed, this is not the general use case, but that doesn't seem to be
 completely unrealistic either.

 Yes, and once that user tries the same without idmap virtio ends up 
 overwriting random memory. It's just not a good idea and I'd much rather 
 see us solve this properly with virtio 1.0 really.

Slightly orthogonal: virtio 1.0 is LE, so endianness is solved.

 Of course alternatively we can continue bikeshedding about this until 
 everything becomes moot because we switched to virtio 1.0 ;).

Transition will be long...

 Rusty / Michael, virtio 1.0 does go via normal DMA channels, right?

No.  We argued about this; it's more PCI-like to do, but there's a
performance cost and it's really unclear that passing through a virtio
PCI device to a (sub)guest is a scenario worth supporting.

Maybe someone will come up with a convincing reason, and we'll add
a feature bit in a future revision...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: make udp more efficient by avoiding indirect desc

2014-02-12 Thread Rusty Russell
Qin Chuanyu qinchua...@huawei.com writes:
 udp packet use 2 buffers at least, one for vnet_hdr and
 one for skb-data.
 we could change the threshold from 2 to 3, so the udp packet
 which data buff only using single desc will gain from this.
 the guest would avoid from allocating memory dynamically.
 the host would avoid from translating indirect desc.

 Signed-off-by: Chuanyu Qin qinchua...@huawei.com

We should really adapt the threshold depending on how full the ring is.
If it's gotten cramped recently, prefer indirect.  See linux/average.h.

Such a change would have to be benchmarked carefully though!
Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-19 Thread Rusty Russell
Luiz Capitulino lcapitul...@redhat.com writes:
 On Fri, 17 Jan 2014 09:10:47 +1030
 Rusty Russell ru...@rustcorp.com.au wrote:

 Luiz Capitulino lcapitul...@redhat.com writes:
  From: Luiz capitulino lcapitul...@redhat.com
 
  This commit adds support to a new virtqueue called message virtqueue.
 
 OK, this needs a lot of thought (especially since reworking the virtio
 balloon is on the TODO list for the OASIS virtio technical committee...)
 
 But AFAICT this is a way of explicitly saying no to the host's target
 (vs the implicit method of just not meeting the target).  I'm not sure
 that gives enough information to the host.  On the other hand, I'm not
 sure what information we *can* give.
 
 Should we instead be counter-proposing a target?

 The problem is how to estimate a target value. I found it simpler
 to just try to obey what the host is asking for (and fail if not
 possible) than trying to make the guest negotiate with the host.

Understood, but we already do this the other way where the host tells
the guest how much memory to give up.

And is a guest expected to automatically inflate the balloon even if the
host doesn't want the memory, or wait to be asked?

  What does qemu do with this information?
 
  There are two possible scenarios:
 
   1. The balloon driver is currently inflating when it gets under
  pressure
 
  QEMU resets num_pages to the current balloon size. This
  cancels the on-going inflate
 
   2. The balloon driver is not inflating, eg. it's possibly sleeping
 
 QEMU issues a deflate
 
  But note that those scenarios are not supposed to be used with the
  current device, they are part of the automatic ballooning feature.
  I CC'ed you on the QEMU patch, you can find it here case you didn't
  see it:
 
   http://marc.info/?l=kvmm=138988966315690w=2

Yes, caught that after I replied; I should have checked first.

It seems like we are still figuring out how to do ballooning.  The
best approach in cases like this is to make it simple, so I don't hate
this.

But note that Daniel Kiper and I have been discussing a new virtio
balloon for draft 2 of the standard.  I'll CC you when I post that to
one of the OASIS virtio mailing lists.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/4] virtio_balloon: add pressure notification via a new virtqueue

2014-01-16 Thread Rusty Russell
Luiz Capitulino lcapitul...@redhat.com writes:
 From: Luiz capitulino lcapitul...@redhat.com

 This commit adds support to a new virtqueue called message virtqueue.

OK, this needs a lot of thought (especially since reworking the virtio
balloon is on the TODO list for the OASIS virtio technical committee...)

But AFAICT this is a way of explicitly saying no to the host's target
(vs the implicit method of just not meeting the target).  I'm not sure
that gives enough information to the host.  On the other hand, I'm not
sure what information we *can* give.

Should we instead be counter-proposing a target?

What does qemu do with this information?

Thanks,
Rusty.

 The message virtqueue can be used by guests to notify the host about
 important memory-related state changes in the guest. Currently, the
 only implemented notification is the guest is under pressure one,
 which informs the host that the guest is under memory pressure.

 This notification can be used to implement automatic memory ballooning
 in the host. For example, once learning that the guest is under pressure,
 the host could cancel an on-going inflate and/or start a deflate
 operation.

 Doing this through a virtqueue might seem like overkill, as all
 we're doing is to transfer an integer between guest and host. However,
 using a virtqueue offers the following advantages:

  1. We can realibly synchronize host and guest. That is, the guest
 will only continue once the host acknowledges the notification.
 This is important, because if the guest gets under pressure while
 inflating the balloon, it has to stop to give the host a chance
 to reset num_pages (see next commit)

  2. It's extensible. We may (or may not) want to tell the host
 which pressure level the guest finds itself in (ie. low,
 medium or critical)

 The lightweight alternative is to use a configuration space parameter.
 For this to work though, the guest would have to wait the for host
 to acknowloedge the receipt of a configuration change update. I could
 try this if the virtqueue is too overkill.

 Finally, the guest learns it's under pressure by registering a
 callback with the in-kernel vmpressure API.

 FIXMEs:

  - vmpressure API is missing an de-registration routine
  - not completely sure my changes in virtballoon_probe() are correct

 Signed-off-by: Luiz capitulino lcapitul...@redhat.com
 ---
  drivers/virtio/virtio_balloon.c | 93 
 +
  include/uapi/linux/virtio_balloon.h |  1 +
  2 files changed, 84 insertions(+), 10 deletions(-)

 diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
 index 5c4a95b..1c3ee71 100644
 --- a/drivers/virtio/virtio_balloon.c
 +++ b/drivers/virtio/virtio_balloon.c
 @@ -29,6 +29,9 @@
  #include linux/module.h
  #include linux/balloon_compaction.h
  
 +#include linux/cgroup.h
 +#include linux/vmpressure.h
 +
  /*
   * Balloon device works in 4K page units.  So each page is pointed to by
   * multiple balloon pages.  All memory counters in this driver are in balloon
 @@ -37,10 +40,12 @@
  #define VIRTIO_BALLOON_PAGES_PER_PAGE (unsigned)(PAGE_SIZE  
 VIRTIO_BALLOON_PFN_SHIFT)
  #define VIRTIO_BALLOON_ARRAY_PFNS_MAX 256
  
 +#define VIRTIO_BALLOON_MSG_PRESSURE 1
 +
  struct virtio_balloon
  {
   struct virtio_device *vdev;
 - struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
 + struct virtqueue *inflate_vq, *deflate_vq, *stats_vq, *message_vq;
  
   /* Where the ballooning thread waits for config to change. */
   wait_queue_head_t config_change;
 @@ -51,6 +56,8 @@ struct virtio_balloon
   /* Waiting for host to ack the pages we released. */
   wait_queue_head_t acked;
  
 + wait_queue_head_t message_acked;
 +
   /* Number of balloon pages we've told the Host we're not using. */
   unsigned int num_pages;
   /*
 @@ -71,6 +78,9 @@ struct virtio_balloon
   /* Memory statistics */
   int need_stats_update;
   struct virtio_balloon_stat stats[VIRTIO_BALLOON_S_NR];
 +
 + /* Message virtqueue */
 + atomic_t guest_pressure;
  };
  
  static struct virtio_device_id id_table[] = {
 @@ -78,6 +88,41 @@ static struct virtio_device_id id_table[] = {
   { 0 },
  };
  
 +static inline bool guest_under_pressure(const struct virtio_balloon *vb)
 +{
 + return atomic_read(vb-guest_pressure) == 1;
 +}
 +
 +static void vmpressure_event_handler(void *data, int level)
 +{
 + struct virtio_balloon *vb = data;
 +
 + atomic_set(vb-guest_pressure, 1);
 + wake_up(vb-config_change);
 +}
 +
 +static void tell_host_pressure(struct virtio_balloon *vb)
 +{
 + const uint32_t msg = VIRTIO_BALLOON_MSG_PRESSURE;
 + struct scatterlist sg;
 + unsigned int len;
 + int err;
 +
 + sg_init_one(sg, msg, sizeof(msg));
 +
 + err = virtqueue_add_outbuf(vb-message_vq, sg, 1, vb, GFP_KERNEL);
 + if (err  0) {
 + printk(KERN_WARNING virtio-balloon: failed to send host 
 message 

Re: [PATCH V2] virtio-scsi: Fix hotcpu_notifier use-after-free with virtscsi_freeze

2014-01-14 Thread Rusty Russell
Jason Wang jasow...@redhat.com writes:
 From: Asias He asias.he...@gmail.com

 vqs are freed in virtscsi_freeze but the hotcpu_notifier is not
 unregistered. We will have a use-after-free usage when the notifier
 callback is called after virtscsi_freeze.

 Fixes: 285e71ea6f3583a85e27cb2b9a7d8c35d4c0d558
 (virtio-scsi: reset virtqueue affinity when doing cpu hotplug)

Sorry, I missed this in my earlier sweep (I was only skimming emails
during the break).

Applied,
Rusty.

 Cc: sta...@vger.kernel.org
 Signed-off-by: Asias He asias.he...@gmail.com
 Reviewed-by: Paolo Bonzini pbonz...@redhat.com
 Signed-off-by: Jason Wang jasow...@redhat.com
 ---
 Changes from V1:
 - Add Fixes line
 - CC stable
 ---
  drivers/scsi/virtio_scsi.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

 diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
 index c3173dc..16bfd50 100644
 --- a/drivers/scsi/virtio_scsi.c
 +++ b/drivers/scsi/virtio_scsi.c
 @@ -956,6 +956,10 @@ static void virtscsi_remove(struct virtio_device *vdev)
  #ifdef CONFIG_PM_SLEEP
  static int virtscsi_freeze(struct virtio_device *vdev)
  {
 + struct Scsi_Host *sh = virtio_scsi_host(vdev);
 + struct virtio_scsi *vscsi = shost_priv(sh);
 +
 + unregister_hotcpu_notifier(vscsi-nb);
   virtscsi_remove_vqs(vdev);
   return 0;
  }
 @@ -964,8 +968,17 @@ static int virtscsi_restore(struct virtio_device *vdev)
  {
   struct Scsi_Host *sh = virtio_scsi_host(vdev);
   struct virtio_scsi *vscsi = shost_priv(sh);
 + int err;
 +
 + err = virtscsi_init(vdev, vscsi);
 + if (err)
 + return err;
 +
 + err = register_hotcpu_notifier(vscsi-nb);
 + if (err)
 + vdev-config-del_vqs(vdev);
  
 - return virtscsi_init(vdev, vscsi);
 + return err;
  }
  #endif
  
 -- 
 1.8.3.2
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-scsi: Fix hotcpu_notifier use-after-free with virtscsi_freeze

2013-12-16 Thread Rusty Russell
Jason Wang jasow...@redhat.com writes:
 On 10/28/2013 04:01 PM, Asias He wrote:
 vqs are freed in virtscsi_freeze but the hotcpu_notifier is not
 unregistered. We will have a use-after-free usage when the notifier
 callback is called after virtscsi_freeze.

 Signed-off-by: Asias He as...@redhat.com

Please include a Fixes: line, especially if you want the CC: stable.

Thanks,
Rusty.


 ---
  drivers/scsi/virtio_scsi.c | 15 ++-
  1 file changed, 14 insertions(+), 1 deletion(-)

 diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
 index 74b88ef..b26f1a5 100644
 --- a/drivers/scsi/virtio_scsi.c
 +++ b/drivers/scsi/virtio_scsi.c
 @@ -957,6 +957,10 @@ static void virtscsi_remove(struct virtio_device *vdev)
  #ifdef CONFIG_PM
  static int virtscsi_freeze(struct virtio_device *vdev)
  {
 +struct Scsi_Host *sh = virtio_scsi_host(vdev);
 +struct virtio_scsi *vscsi = shost_priv(sh);
 +
 +unregister_hotcpu_notifier(vscsi-nb);
  virtscsi_remove_vqs(vdev);
  return 0;
  }
 @@ -965,8 +969,17 @@ static int virtscsi_restore(struct virtio_device *vdev)
  {
  struct Scsi_Host *sh = virtio_scsi_host(vdev);
  struct virtio_scsi *vscsi = shost_priv(sh);
 +int err;
 +
 +err = virtscsi_init(vdev, vscsi);
 +if (err)
 +return err;
 +
 +err = register_hotcpu_notifier(vscsi-nb);
 +if (err)
 +vdev-config-del_vqs(vdev);
  
 -return virtscsi_init(vdev, vscsi);
 +return err;
  }
  #endif
  

 Ping. Rusty, could you please review and apply this patch?

 Thanks
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_balloon: update_balloon_size(): update correct field

2013-12-04 Thread Rusty Russell
Luiz Capitulino lcapitul...@redhat.com writes:
 According to the virtio spec, the device configuration field
 that should be updated after an inflation or deflation
 operation is the 'actual' field, not the 'num_pages' one.

 Commit 855e0c5288177bcb193f6f6316952d2490478e1c swapped them
 in update_balloon_size(). Fix it.

 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com

Damn, exactly right.  Good catch.

Applied,
Rusty.

  drivers/virtio/virtio_balloon.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
 index c444654..5c4a95b 100644
 --- a/drivers/virtio/virtio_balloon.c
 +++ b/drivers/virtio/virtio_balloon.c
 @@ -285,7 +285,7 @@ static void update_balloon_size(struct virtio_balloon *vb)
  {
   __le32 actual = cpu_to_le32(vb-num_pages);
  
 - virtio_cwrite(vb-vdev, struct virtio_balloon_config, num_pages,
 + virtio_cwrite(vb-vdev, struct virtio_balloon_config, actual,
 actual);
  }
  
 -- 
 1.8.1.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio: mmio: fix signature checking for BE guests

2013-11-06 Thread Rusty Russell
Pawel Moll pawel.m...@arm.com writes:
 On Tue, 2013-11-05 at 03:36 +, Rusty Russell wrote:
 This particular can is empty - all worms already escaped :-)

I just thought that if you wait for 1.0, it will always be
little-endian, and if the current qemu only supported little-endian your
life might be simpler inside qemu.  But I'm sure Marc doesn't want to
wait :)

 It's a bugfix, not changing anything for the existing hosts. So - yes
 please, do merge this one.

Done, thanks.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] virtio: mmio: fix signature checking for BE guests

2013-11-04 Thread Rusty Russell
Rusty Russell ru...@rustcorp.com.au writes:
 Pawel Moll pawel.m...@arm.com writes:
 On Fri, 2013-10-11 at 15:36 +0100, Marc Zyngier wrote:
 As virtio-mmio config registers are specified to be little-endian,
 using readl() to read the magic value and then memcmp() to check it
 fails on BE (as readl() has an implicit swab).
 
 Fix it by encoding the magic value as an integer instead of a string.
 
 Cc: Rusty Russell ru...@rustcorp.com.au
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Pawel Moll pawel.m...@arm.com
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  drivers/virtio/virtio_mmio.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
 index 1ba0d68..57f24fd 100644
 --- a/drivers/virtio/virtio_mmio.c
 +++ b/drivers/virtio/virtio_mmio.c
 @@ -470,7 +470,7 @@ static int virtio_mmio_probe(struct platform_device 
 *pdev)
  
 /* Check magic value */
 magic = readl(vm_dev-base + VIRTIO_MMIO_MAGIC_VALUE);
 -   if (memcmp(magic, virt, 4) != 0) {
 +   if (magic != ('v' | 'i'  8 | 'r'  16 | 't'  24)) {
 dev_warn(pdev-dev, Wrong magic value 0x%08lx!\n, magic);
 return -ENODEV;
 }

 The new spec will clarify this:

 * 0x000 | R | MagicValue
   Magic value. Must be 0x74726976 (a Little Endian equivalent
   of a virt string).

 ... but I like the 'v'i'r't' characters still being there :-)

 Acked-by: Pawel Moll pawel.m...@arm.com

 Applied, thanks.

OK, I *said* applied, but left it in my pending queue.

Pawel, do you want to open this can of worms?  If so, I'l merge this
now.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] virtio-mmio: handle BE guests on LE hosts

2013-10-16 Thread Rusty Russell
Marc Zyngier marc.zyng...@arm.com writes:
 On 14/10/13 09:21, Rusty Russell wrote:
 Marc Zyngier marc.zyng...@arm.com writes:
 This small patch series adds just enough kernel infrastructure and
 fixes to allow a BE guest to use virtio-mmio on a LE host, provided
 that the host actually supports such madness.

 This has been tested on arm64, with some fixes to KVM and a set of
 changes to kvmtool, both which I am posting separately.
 
 OK, so I already have a patch which supports config space accessors.
 I've posted the series below, and since you want it, I'll put it in
 virtio-next after more testing and rebasing...

 Yes, that's definitely something I'd like to see being merged, as it
 would allow me to drop a significant chunk of changes.

OK, I've just put them into virtio-next.

 But we won't be using feature bits for endianness, since we're defining
 endian in the 1.0 spec (whether LE-everywhere or LE-for-mmio-and-PCI is
 still debated).  Since we need to guess for backwards compat anyway,
 let's keep doing this until v1.0?

 When is 1.0 going to happen? When will actual implementation of drivers
 and devices show up in my favourite platform emulation?

1.0 won't be finalized until early next year.  We aim to publish the
first draft next month, but noone should finalize implementations until
the feedback process is complete.

 Having a grand plan for the future is great, but I need something
 working right now, or at least fairly soonish... And I need it to be
 backward compatible, as none of the above is going to show up overnight.

Well, mmio BE won't work at all right now due to the signature check
being wrong in Linux.  So there are two choices: (1) fix it, and use
heuristics to figure out if the guest is BE, or (2) say there's no BE
mmio, and wait for 1.0.

I don't know what your timeline is: you might need to chat with Pawel
internally.

BTW, for qemu and PPC (though virtio-pci, not mmio) I look at the guest
interrupt delivery endian bit upon any virtio device reset to guess
endian.  Since Linux guests reset the device before doing anything else,
this works well, and supports crazy stuff like kexec a kernel in the
other endian.

 (Yeah, I made this mess with native endian.  I promise I have learnt
 my lesson).

 Paracetamol bill coming you way... ;-)

Now I just have to figure out where to send my equivalent bill...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] virtio-mmio: handle BE guests on LE hosts

2013-10-14 Thread Rusty Russell
Marc Zyngier marc.zyng...@arm.com writes:
 This small patch series adds just enough kernel infrastructure and
 fixes to allow a BE guest to use virtio-mmio on a LE host, provided
 that the host actually supports such madness.

 This has been tested on arm64, with some fixes to KVM and a set of
 changes to kvmtool, both which I am posting separately.

OK, so I already have a patch which supports config space accessors.
I've posted the series below, and since you want it, I'll put it in
virtio-next after more testing and rebasing...

But we won't be using feature bits for endianness, since we're defining
endian in the 1.0 spec (whether LE-everywhere or LE-for-mmio-and-PCI is
still debated).  Since we need to guess for backwards compat anyway,
let's keep doing this until v1.0?

(Yeah, I made this mess with native endian.  I promise I have learnt
my lesson).

Thanks,
Rusty.

Subject: virtio_config: introduce size-based accessors.

This lets the us do endian conversion if necessary, and insulates the
drivers from that change.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 29b9104..e62acdd 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -162,5 +135,139 @@ int virtqueue_set_affinity(struct virtqueue *vq, int cpu)
return 0;
 }
 
+/* Config space accessors. */
+#define virtio_cread(vdev, structname, member, ptr)\
+   do {\
+   /* Must match the member's type, and be integer */  \
+   if (!typecheck(typeofstructname*)0)-member)), *(ptr))) \
+   (*ptr) = 1; \
+   \
+   switch (sizeof(*ptr)) { \
+   case 1: \
+   *(ptr) = virtio_cread8(vdev,\
+  offsetof(structname, member)); \
+   break;  \
+   case 2: \
+   *(ptr) = virtio_cread16(vdev,   \
+   offsetof(structname, member)); \
+   break;  \
+   case 4: \
+   *(ptr) = virtio_cread32(vdev,   \
+   offsetof(structname, member)); \
+   break;  \
+   case 8: \
+   *(ptr) = virtio_cread64(vdev,   \
+   offsetof(structname, member)); \
+   break;  \
+   default:\
+   BUG();  \
+   }   \
+   } while(0)
+
+/* Config space accessors. */
+#define virtio_cwrite(vdev, structname, member, ptr)   \
+   do {\
+   /* Must match the member's type, and be integer */  \
+   if (!typecheck(typeofstructname*)0)-member)), *(ptr))) \
+   BUG_ON((*ptr) == 1);\
+   \
+   switch (sizeof(*ptr)) { \
+   case 1: \
+   virtio_cwrite8(vdev,\
+  offsetof(structname, member),\
+  *(ptr)); \
+   break;  \
+   case 2: \
+   virtio_cwrite16(vdev,   \
+   offsetof(structname, member),   \
+   *(ptr));\
+   break;  \
+   case 4: \
+   virtio_cwrite32(vdev,   \
+   offsetof(structname, member),   \
+   *(ptr));\
+   break

Re: [PATCH 2/3] virtio: mmio: fix signature checking for BE guests

2013-10-14 Thread Rusty Russell
Pawel Moll pawel.m...@arm.com writes:
 On Fri, 2013-10-11 at 15:36 +0100, Marc Zyngier wrote:
 As virtio-mmio config registers are specified to be little-endian,
 using readl() to read the magic value and then memcmp() to check it
 fails on BE (as readl() has an implicit swab).
 
 Fix it by encoding the magic value as an integer instead of a string.
 
 Cc: Rusty Russell ru...@rustcorp.com.au
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Pawel Moll pawel.m...@arm.com
 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  drivers/virtio/virtio_mmio.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
 index 1ba0d68..57f24fd 100644
 --- a/drivers/virtio/virtio_mmio.c
 +++ b/drivers/virtio/virtio_mmio.c
 @@ -470,7 +470,7 @@ static int virtio_mmio_probe(struct platform_device 
 *pdev)
  
  /* Check magic value */
  magic = readl(vm_dev-base + VIRTIO_MMIO_MAGIC_VALUE);
 -if (memcmp(magic, virt, 4) != 0) {
 +if (magic != ('v' | 'i'  8 | 'r'  16 | 't'  24)) {
  dev_warn(pdev-dev, Wrong magic value 0x%08lx!\n, magic);
  return -ENODEV;
  }

 The new spec will clarify this:

 * 0x000 | R | MagicValue
   Magic value. Must be 0x74726976 (a Little Endian equivalent
   of a virt string).

 ... but I like the 'v'i'r't' characters still being there :-)

 Acked-by: Pawel Moll pawel.m...@arm.com

Applied, thanks.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] virtio-mmio: handle BE guests on LE hosts

2013-10-14 Thread Rusty Russell
Marc Zyngier marc.zyng...@arm.com writes:
 Hi Michael,

 On 12/10/13 19:28, Michael S. Tsirkin wrote:
 On Fri, Oct 11, 2013 at 03:36:08PM +0100, Marc Zyngier wrote:
 This small patch series adds just enough kernel infrastructure and
 fixes to allow a BE guest to use virtio-mmio on a LE host, provided
 that the host actually supports such madness.

 This has been tested on arm64, with some fixes to KVM and a set of
 changes to kvmtool, both which I am posting separately.

 A branch containing all the relevant changes is at:
 git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git 
 kvm-arm64/be-on-le-3.12-rc4

 Cc: Rusty Russell ru...@rustcorp.com.au
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Pawel Moll pawel.m...@arm.com
 
 We are changing the spec to make everything LE instead of
 the native endian.
 
 I think that'll fix the issue in a cleaner way.

 While I agree that it would solve the issue completely, it would also
 break all BE users. Is that really an option?

 The whole goal of this series is to implement something that gracefully
 falls back to what we have today, not breaking anything in the process.

Pawel was in favor of incrementing the version number.

OTOH, clearly there are no big endian users at the moment, since they
fail the magic tests...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] virtio-mmio: handle BE guests on LE hosts

2013-10-14 Thread Rusty Russell
Pawel Moll pawel.m...@arm.com writes:
 On Mon, 2013-10-14 at 11:46 +0100, Michael S. Tsirkin wrote:
 What I meant is that under the proposed scheme, users with
 existing v1 drivers must configure a v1 device explicitly.
 
 According to the plan, drivers will be updated so they can
 work with both v1 devices and new v2 devices.
 
 But if you configure a v2 device, old drivers
 will not work.

 That's correct, yes. If qemu 2015(16? 17?).01 will provide v2 devices
 only, kernel 3.11 is not going to recognize any of them.

 What will happen in reality, though, is that the implementation of the
 Linux v1/v2 driver will pre-date *any* v2 device by far. First
 implementation of v1 devices appeared almost 2 years after the driver
 appeared. It will take year for people to upgrade to v2, and probably
 only after they experience problems with page-based addressing mode.

 I think I can live with this.

Yes, this makes sense.  In particular, MMIO doesn't have any users you
don't control at this point.

I expect it to be less than 2 years to get into qemu, but there'll be no
compelling reason to use v2 devices for at least that long.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] virtio-mmio: handle BE guests on LE hosts

2013-10-14 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Mon, Oct 14, 2013 at 06:50:48PM +0200, Paolo Bonzini wrote:
 Il 14/10/2013 17:36, Marc Zyngier ha scritto:
   
   Devices are fine in QEMU, it's only the generic parts (rings) that are
   missing AFAICT.
  So if I understand correctly how it works, target endianness is set at
  compile time, and you have a BE specific QEMU?
 
 Yes.  Though as Alex said, this will have to change for PPC
 little-endian support.
 
 Paolo

 Or we'll just say that this platform requires virtio 1.0.

To come full circle, I have implemented virtio support for BE PPC in
qemu :) And we'll be supporting pre-1.0 for the moment. I'm awaiting
some ppc-specific hooks, but it could be merged without them.

It's icky: we don't really want to use the current endianness of the
CPU, since in theory a BE kernel could be running an LE process.  So at
the time of virtio device reset (the first thing the Linux drivers do)
we read the endianness of the interrupt entry point, which is presumably
the OS endianness.

The result looks like this.  In virtio_reset(void *opaque):

/* We assume all devices are the same endian. */
virtio_byteswap = virtio_get_byteswap();

And accessors like this are used in all virtio devices and routines:

/* Initialized by virtio_get_byteswap() at any virtio device reset. */
extern bool virtio_byteswap;

static inline uint16_t virtio_lduw_phys(hwaddr pa)
{
if (virtio_byteswap) {
return bswap16(lduw_phys(pa));
}
return lduw_phys(pa);

}

static inline uint32_t virtio_ldl_phys(hwaddr pa)
{
if (virtio_byteswap) {
return bswap32(ldl_phys(pa));
}
return ldl_phys(pa);
}

static inline uint64_t virtio_ldq_phys(hwaddr pa)
{
if (virtio_byteswap) {
return bswap64(ldq_phys(pa));
}
return ldq_phys(pa);
}

static inline void virtio_stw_phys(hwaddr pa, uint16_t value)
{
if (virtio_byteswap) {
stw_phys(pa, bswap16(value));
} else {
stw_phys(pa, value);
}
}

static inline void virtio_stl_phys(hwaddr pa, uint32_t value)
{
if (virtio_byteswap) {
stl_phys(pa, bswap32(value));
} else {
stl_phys(pa, value);
}
}

static inline void virtio_stw_p(void *ptr, uint16_t v)
{
if (virtio_byteswap) {
stw_p(ptr, bswap16(v));
} else {
stw_p(ptr, v);
}
}

static inline void virtio_stl_p(void *ptr, uint32_t v)
{
if (virtio_byteswap) {
stl_p(ptr, bswap32(v));
} else {
stl_p(ptr, v);
}
}

static inline void virtio_stq_p(void *ptr, uint64_t v)
{
if (virtio_byteswap) {
stq_p(ptr, bswap64(v));
} else {
stq_p(ptr, v);
}
}

static inline int virtio_lduw_p(const void *ptr)
{
if (virtio_byteswap) {
return bswap16(lduw_p(ptr));
} else {
return lduw_p(ptr);
}
}

static inline int virtio_ldl_p(const void *ptr)
{
if (virtio_byteswap) {
return bswap32(ldl_p(ptr));
} else {
return ldl_p(ptr);
}
}

static inline uint64_t virtio_ldq_p(const void *ptr)
{
if (virtio_byteswap) {
return bswap64(ldq_p(ptr));
} else {
return ldq_p(ptr);
}
}

static inline uint32_t virtio_tswap32(uint32_t s)
{
if (virtio_byteswap) {
return bswap32(tswap32(s));
} else {
return tswap32(s);
}
}

static inline void virtio_tswap32s(uint32_t *s)
{
tswap32s(s);
if (virtio_byteswap) {
*s = bswap32(*s);
}
}

There are obvious optimizations to this, such as making virtio_byteswap
a compile time constant on non-endian-ambivalent targets, but this is
the current approach.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-scsi: Fix virtqueue affinity setup

2013-07-31 Thread Rusty Russell
Asias He as...@redhat.com writes:
 vscsi-num_queues counts the number of request virtqueue which does not
 include the control and event virtqueue. It is wrong to subtract
 VIRTIO_SCSI_VQ_BASE from vscsi-num_queues.

 This patch fixes the following panic.

Applied.

Thanks,
Rusty.


 (qemu) device_del scsi0

  BUG: unable to handle kernel NULL pointer dereference at 0020
  IP: [8179b29f] __virtscsi_set_affinity+0x6f/0x120
  PGD 0
  Oops:  [#1] SMP
  Modules linked in:
  CPU: 0 PID: 659 Comm: kworker/0:1 Not tainted 3.11.0-rc2+ #1172
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: kacpi_hotplug _handle_hotplug_event_func
  task: 88007bee1cc0 ti: 88007bfe4000 task.ti: 88007bfe4000
  RIP: 0010:[8179b29f]  [8179b29f] 
 __virtscsi_set_affinity+0x6f/0x120
  RSP: 0018:88007bfe5a38  EFLAGS: 00010202
  RAX: 0010 RBX: 880077fd0d28 RCX: 0050
  RDX:  RSI: 0246 RDI: 
  RBP: 88007bfe5a58 R08: 880077f6ff00 R09: 0001
  R10: 8143e673 R11: 0001 R12: 0001
  R13: 880077fd0800 R14:  R15: 88007bf489b0
  FS:  () GS:88007ea0() knlGS:
  CS:  0010 DS:  ES:  CR0: 8005003b
  CR2: 0020 CR3: 79f8b000 CR4: 06f0
  Stack:
   880077fd0d28  880077fd0800 0008
   88007bfe5a78 8179b37d 88007bccc800 88007bccc800
   88007bfe5a98 8179b3b6 88007bccc800 880077fd0d28
  Call Trace:
   [8179b37d] virtscsi_set_affinity+0x2d/0x40
   [8179b3b6] virtscsi_remove_vqs+0x26/0x50
   [8179c7d2] virtscsi_remove+0x82/0xa0
   [814cb6b2] virtio_dev_remove+0x22/0x70
   [8167ca49] __device_release_driver+0x69/0xd0
   [8167cb9d] device_release_driver+0x2d/0x40
   [8167bb96] bus_remove_device+0x116/0x150
   [81679936] device_del+0x126/0x1e0
   [81679a06] device_unregister+0x16/0x30
   [814cb889] unregister_virtio_device+0x19/0x30
   [814cdad6] virtio_pci_remove+0x36/0x80
   [81464ae7] pci_device_remove+0x37/0x70
   [8167ca49] __device_release_driver+0x69/0xd0
   [8167cb9d] device_release_driver+0x2d/0x40
   [8167bb96] bus_remove_device+0x116/0x150
   [81679936] device_del+0x126/0x1e0
   [8145edfc] pci_stop_bus_device+0x9c/0xb0
   [8145f036] pci_stop_and_remove_bus_device+0x16/0x30
   [81474a9e] acpiphp_disable_slot+0x8e/0x150
   [81474f6a] hotplug_event_func+0xba/0x1a0
   [814906c8] ? acpi_os_release_object+0xe/0x12
   [81475911] _handle_hotplug_event_func+0x31/0x70
   [810b5333] process_one_work+0x183/0x500
   [810b66e2] worker_thread+0x122/0x400
   [810b65c0] ? manage_workers+0x2d0/0x2d0
   [810bc5de] kthread+0xce/0xe0
   [810bc510] ? kthread_freezable_should_stop+0x70/0x70
   [81ca045c] ret_from_fork+0x7c/0xb0
   [810bc510] ? kthread_freezable_should_stop+0x70/0x70
  Code: 01 00 00 00 74 59 45 31 e4 83 bb c8 01 00 00 02 74 46 66 2e 0f 1f 84 
 00 00 00 00 00 49 63 c4 48 c1 e0 04 48 8b bc 0
 3 10 02 00 00 48 8b 47 20 48 8b 80 d0 01 00 00 48 8b 40 50 48 85 c0 74 07 be
  RIP  [8179b29f] __virtscsi_set_affinity+0x6f/0x120
   RSP 88007bfe5a38
  CR2: 0020
  ---[ end trace 99679331a3775f48 ]---

 CC: sta...@vger.kernel.org
 Signed-off-by: Asias He as...@redhat.com
 ---
  drivers/scsi/virtio_scsi.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
 index 2168258..74b88ef 100644
 --- a/drivers/scsi/virtio_scsi.c
 +++ b/drivers/scsi/virtio_scsi.c
 @@ -751,7 +751,7 @@ static void __virtscsi_set_affinity(struct virtio_scsi 
 *vscsi, bool affinity)
  
   vscsi-affinity_hint_set = true;
   } else {
 - for (i = 0; i  vscsi-num_queues - VIRTIO_SCSI_VQ_BASE; i++)
 + for (i = 0; i  vscsi-num_queues; i++)
   virtqueue_set_affinity(vscsi-req_vqs[i].vq, -1);
  
   vscsi-affinity_hint_set = false;
 -- 
 1.8.3.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio: include asm/barrier explicitly

2013-07-07 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 virtio_ring.h uses mb() and friends, make
 it pull in asm/barrier.h itself, not rely
 on other headers to do it.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Applied.

Thanks,
Rusty.
PS.  I'll squeeze these into this merge window, since they're trivial.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tools/virtio: move module license stub to module.h

2013-07-07 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 This fixes build for the vringh test:
 [linux]$ make -C tools/virtio/
 make: Entering directory `/home/mst/scm/linux/tools/virtio'
 cc -g -O2 -Wall -I. -I ../../usr/include/ -Wno-pointer-sign
 -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD
 -U_FORTIFY_SOURCE   -c -o vringh.o ../../drivers/vhost/vringh.c
 ../../drivers/vhost/vringh.c:1010:16: error: expected declaration
 specifiers or ‘...’ before string constant

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  tools/virtio/linux/module.h | 5 +
  tools/virtio/linux/virtio.h | 3 ---
  2 files changed, 5 insertions(+), 3 deletions(-)

Applied, thanks.

Cheers,
Rusty.

 diff --git a/tools/virtio/linux/module.h b/tools/virtio/linux/module.h
 index 3039a7e..28ce95a 100644
 --- a/tools/virtio/linux/module.h
 +++ b/tools/virtio/linux/module.h
 @@ -1 +1,6 @@
  #include linux/export.h
 +
 +#define MODULE_LICENSE(__MODULE_LICENSE_value) \
 + static __attribute__((unused)) const char *__MODULE_LICENSE_name = \
 + __MODULE_LICENSE_value
 +
 diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h
 index cd80183..8447830 100644
 --- a/tools/virtio/linux/virtio.h
 +++ b/tools/virtio/linux/virtio.h
 @@ -45,9 +45,6 @@ struct virtqueue {
   void *priv;
  };
  
 -#define MODULE_LICENSE(__MODULE_LICENSE_value) \
 - const char *__MODULE_LICENSE_name = __MODULE_LICENSE_value
 -
  /* Interfaces exported by virtio_ring. */
  int virtqueue_add_sgs(struct virtqueue *vq,
 struct scatterlist *sgs[],
 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm_intel: Could not allocate 42 bytes percpu data

2013-07-02 Thread Rusty Russell
Chegu Vinod chegu_vi...@hp.com writes:
 On 6/30/2013 11:22 PM, Rusty Russell wrote:
 Chegu Vinod chegu_vi...@hp.com writes:
 Hello,

 Lots (~700+) of the following messages are showing up in the dmesg of a
 3.10-rc1 based kernel (Host OS is running on a large socket count box
 with HT-on).

 [   82.270682] PERCPU: allocation failed, size=42 align=16, alloc from
 reserved chunk failed
 [   82.272633] kvm_intel: Could not allocate 42 bytes percpu data
 Woah, weird

 Oh.  Shit.  Um, this is embarrassing.

 Thanks,
 Rusty.


 Thanks for your response!

 ===
 module: do percpu allocation after uniqueness check.  No, really!

 v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
 percpu memory on large machines:

  Now we have a new state MODULE_STATE_UNFORMED, we can insert the
  module into the list (and thus guarantee its uniqueness) before we
  allocate the per-cpu region.

 In my defence, it didn't actually say the patch did this.  Just that
 we can.

 This patch actually *does* it.

 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 Tested-by: Noone it seems.

 Your following updated fix seems to be working fine on the larger 
 socket count machine with HT-on.

OK, did you definitely revert every other workaround?

If so, please give me a Tested-by: line...

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm_intel: Could not allocate 42 bytes percpu data

2013-07-02 Thread Rusty Russell
Chegu Vinod chegu_vi...@hp.com writes:
 On 7/1/2013 10:49 PM, Rusty Russell wrote:
 Chegu Vinod chegu_vi...@hp.com writes:
 On 6/30/2013 11:22 PM, Rusty Russell wrote:
 module: do percpu allocation after uniqueness check.  No, really!

 v3.8-rc1-5-g1fb9341 was supposed to stop parallel kvm loads exhausting
 percpu memory on large machines:

   Now we have a new state MODULE_STATE_UNFORMED, we can insert the
   module into the list (and thus guarantee its uniqueness) before we
   allocate the per-cpu region.

 In my defence, it didn't actually say the patch did this.  Just that
 we can.

 This patch actually *does* it.

 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 Tested-by: Noone it seems.
 Your following updated fix seems to be working fine on the larger
 socket count machine with HT-on.
 OK, did you definitely revert every other workaround?

 Yes no other workarounds were there when your change was tested.


 If so, please give me a Tested-by: line...

 FYI The actual verification of your change was done by my esteemed 
 colleague :Jim Hull (cc'd) who had access to this larger socket count box.



 Tested-by: Jim Hull jim.h...@hp.com

Thanks, I've put this in my -next tree, and CC'd stable.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Rusty Russell
Alex Williamson alex.william...@redhat.com writes:
 On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
 On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
 
  IOMMU groups themselves don't provide security, they're accessed by
  interfaces like VFIO, which provide the security.  Given a brief look, I
  agree, this looks like a possible backdoor.  The typical VFIO way to
  handle this would be to pass a VFIO file descriptor here to prove that
  the process has access to the IOMMU group.  This is how /dev/vfio/vfio
  gains the ability to setup an IOMMU domain an do mappings with the
  SET_CONTAINER ioctl using a group fd.  Thanks,
 
 How do you envision that in the kernel ? IE. I'm in KVM code, gets that
 vfio fd, what do I do with it ?
 
 Basically, KVM needs to know that the user is allowed to use that iommu
 group. I don't think we want KVM however to call into VFIO directly
 right ?

 Right, we don't want to create dependencies across modules.  I don't
 have a vision for how this should work.  This is effectively a complete
 side-band to vfio, so we're really just dealing in the iommu group
 space.  Maybe there needs to be some kind of registration of ownership
 for the group using some kind of token.  It would need to include some
 kind of notification when that ownership ends.  That might also be a
 convenient tag to toggle driver probing off for devices in the group.
 Other ideas?  Thanks,

It's actually not that bad.

eg. 

struct vfio_container *vfio_container_from_file(struct file *filp)
{
if (filp-f_op != vfio_device_fops)
return ERR_PTR(-EINVAL);

/* OK it really is a vfio fd, return the data. */

}
EXPORT_SYMBOL_GPL(vfio_container_from_file);

...

inside KVM_CREATE_SPAPR_TCE_IOMMU:

struct file *vfio_filp;
struct vfio_container *(lookup)(struct file *filp);

vfio_filp = fget(create_tce_iommu.fd);
if (!vfio)
ret = -EBADF;
lookup = symbol_get(vfio_container_from_file);
if (!lookup)
ret = -EINVAL;
else {
container = lookup(vfio_filp);
if (IS_ERR(container))
ret = PTR_ERR(container);
else
...
symbol_put(vfio_container_from_file);
}

symbol_get() won't try to load a module; it'll just fail.  This is what
you want, since they must have vfio in the kernel to get a valid fd...

Hope that helps,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] KVM: PPC: Add support for IOMMU in-kernel handling

2013-06-18 Thread Rusty Russell
Alex Williamson alex.william...@redhat.com writes:
 On Mon, 2013-06-17 at 13:56 +1000, Benjamin Herrenschmidt wrote:
 On Sun, 2013-06-16 at 21:13 -0600, Alex Williamson wrote:
 
  IOMMU groups themselves don't provide security, they're accessed by
  interfaces like VFIO, which provide the security.  Given a brief look, I
  agree, this looks like a possible backdoor.  The typical VFIO way to
  handle this would be to pass a VFIO file descriptor here to prove that
  the process has access to the IOMMU group.  This is how /dev/vfio/vfio
  gains the ability to setup an IOMMU domain an do mappings with the
  SET_CONTAINER ioctl using a group fd.  Thanks,
 
 How do you envision that in the kernel ? IE. I'm in KVM code, gets that
 vfio fd, what do I do with it ?
 
 Basically, KVM needs to know that the user is allowed to use that iommu
 group. I don't think we want KVM however to call into VFIO directly
 right ?

 Right, we don't want to create dependencies across modules.  I don't
 have a vision for how this should work.  This is effectively a complete
 side-band to vfio, so we're really just dealing in the iommu group
 space.  Maybe there needs to be some kind of registration of ownership
 for the group using some kind of token.  It would need to include some
 kind of notification when that ownership ends.  That might also be a
 convenient tag to toggle driver probing off for devices in the group.
 Other ideas?  Thanks,

It's actually not that bad.

eg. 

struct vfio_container *vfio_container_from_file(struct file *filp)
{
if (filp-f_op != vfio_device_fops)
return ERR_PTR(-EINVAL);

/* OK it really is a vfio fd, return the data. */

}
EXPORT_SYMBOL_GPL(vfio_container_from_file);

...

inside KVM_CREATE_SPAPR_TCE_IOMMU:

struct file *vfio_filp;
struct vfio_container *(lookup)(struct file *filp);

vfio_filp = fget(create_tce_iommu.fd);
if (!vfio)
ret = -EBADF;
lookup = symbol_get(vfio_container_from_file);
if (!lookup)
ret = -EINVAL;
else {
container = lookup(vfio_filp);
if (IS_ERR(container))
ret = PTR_ERR(container);
else
...
symbol_put(vfio_container_from_file);
}

symbol_get() won't try to load a module; it'll just fail.  This is what
you want, since they must have vfio in the kernel to get a valid fd...

Hope that helps,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mod filenames != modnames? (inconsistent name changing)

2013-06-15 Thread Rusty Russell
Linda Walsh l...@tlinx.org writes:
 Should we have any expectation that a module name and it's filename
 should be equivalent?

 I was writing an auto-complete script for modprobe so it wouldn't give
 me the option to double load a module (I'd have to manually type it in if
 I really wanted it).

 Trouble is some modules with names w/underscores ('_') change
 register their names with a 'dash'.

But modprobe kvm_intel already works.  Just convert any - to _?

 Would it be too much trouble to ask that the same names be used in
 both places?

Well, you could get kernel developers to rename their files with
underscores.  I've CC'd them.

We use _ because we turn the module names into identifiers; it's also
the most common choice, and having both is just confusing.

 I have about 30 modules loaded now and out of that list,
 2 are name changes:

 acpi_cpufreq  (filename acip-cpufreq.ko)
 kvm_intel (filename kvm-intel.ko)

 If I load a bunch more... I find more that change names.

 Could the naming in the modules that do this be made consistent?

 Not doing so makes automatically tying a mod-name to it's file
 tend toward being non-deterministic.

 Many modules use '-' or '_' without being confused.
 It would be helpful if the few confused modules could
 get their identity straight.  Is there a problem with
 making these names consistent?

It shouldn't break anything to rename them, if there's consensus I'm
happy to create a patch.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-06-06 Thread Rusty Russell
Anthony Liguori aligu...@us.ibm.com writes:
 Hi Rusty,

 Rusty Russell ru...@rustcorp.com.au writes:

 Anthony Liguori aligu...@us.ibm.com writes:
 4) Do virtio-pcie, make it PCI-e friendly (drop the IO BAR completely), give
it a new device/vendor ID.   Continue to use virtio-pci for existing
devices potentially adding virtio-{net,blk,...}-pcie variants for
people that care to use them.

 Now you have a different compatibility problem; how do you know the
 guest supports the new virtio-pcie net?

 We don't care.

 We would still use virtio-pci for existing devices.  Only new devices
 would use virtio-pcie.

My concern is that new features make the virtio-pcie driver so desirable
that libvirt is pressured to use it ASAP.  Have we just punted the
problem upstream?

(Of course, feature bits were supposed to avoid such a transition issue,
but mistakes accumulate).

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-06-05 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 By my count, net still has 7 feature bits left, so I don't think the
 feature bits are likely to be a limitation in the next 6 months?

 Yes but you wanted a generic transport feature bit
 for flexible SG layout.
 Are you happy with VIRTIO_NET_F_ANY_HEADER_SG then?

If we need it within 6 months, I'd rather do that: this feature would
then be assumed for 1.0, and reserved.  We may do that for other
features, too (the committee will have to review).

 MMIO is a bigger problem.  Linux guests are happy with it: does it break
 the Windows drivers?
 
 Thanks,
 Rusty.

 You mean make BAR0 an MMIO BAR?
 Yes, it would break current windows guests.
 Further, as long as we use same address to notify all queues,
 we would also need to decode the instruction on x86 and that's
 measureably slower than PIO.
 We could go back to discussing hypercall use for notifications,
 but that has its own set of issues...

We might have something ready to deploy in 3 months, but realistically,
I'd rather have it ready and tested outside the main git tree(s) and
push it once the standard is committed.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-06-05 Thread Rusty Russell
Anthony Liguori aligu...@us.ibm.com writes:
 4) Do virtio-pcie, make it PCI-e friendly (drop the IO BAR completely), give
it a new device/vendor ID.   Continue to use virtio-pci for existing
devices potentially adding virtio-{net,blk,...}-pcie variants for
people that care to use them.

Now you have a different compatibility problem; how do you know the
guest supports the new virtio-pcie net?

If you put a virtio-pci card behind a PCI-e bridge today, it's not
compliant, but AFAICT it will Just Work.  (Modulo the 16-dev limit).

I've been assuming we'd avoid a flag day change; that devices would
look like existing virtio-pci with capabilities indicating the new
config layout.

 I think 4 is the best path forward.  It's better for users (guests
 continue to work as they always have).  There's less confusion about
 enabling PCI-e support--you must ask for the virtio-pcie variant and you
 must have a virtio-pcie driver.  It's easy to explain.

Removing both forward and backward compatibility is easy to explain, but
I think it'll be harder to deploy.  This is your area though, so perhaps
I'm wrong.

 It also maps to what regular hardware does.  I highly doubt that there
 are any real PCI cards that made the shift from PCI to PCI-e without
 bumping at least a revision ID.

Noone expected the new cards to Just Work with old OSes: a new machine
meant a new OS and new drivers.  Hardware vendors like that.

Since virtualization often involves legacy, our priorities might be
different.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-06-04 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Mon, Jun 03, 2013 at 09:56:15AM +0930, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  On Thu, May 30, 2013 at 08:53:45AM -0500, Anthony Liguori wrote:
  Rusty Russell ru...@rustcorp.com.au writes:
  
   Anthony Liguori aligu...@us.ibm.com writes:
   Forcing a guest driver change is a really big
   deal and I see no reason to do that unless there's a compelling reason
   to.
  
   So we're stuck with the 1.0 config layout for a very long time.
  
   We definitely must not force a guest change.  The explicit aim of the
   standard is that legacy and 1.0 be backward compatible.  One
   deliverable is a document detailing how this is done (effectively a
   summary of changes between what we have and 1.0).
  
  If 2.0 is fully backwards compatible, great.  It seems like such a
  difference that that would be impossible but I need to investigate
  further.
  
  Regards,
  
  Anthony Liguori
 
  If you look at my patches you'll see how it works.
  Basically old guests use BAR0 new ones don't, so
  it's easy: BAR0 access means legacy guest.
  Only started testing but things seem to work
  fine with old guests so far.
 
  I think we need a spec, not just driver code.
 
  Rusty what's the plan? Want me to write it?
 
 We need both, of course, but the spec work will happen in the OASIS WG.
 A draft is good, but let's not commit anything to upstream QEMU until we
 get the spec finalized.  And that is proposed to be late this year.

 Well that would be quite sad really.
 
 This means we can't make virtio a spec compliant pci express device,
 and we can't add any more feature bits, so no
 flexible buffer optimizations for virtio net.

 There are probably more projects that will be blocked.

 So how about we keep extending legacy layout for a bit longer:
 - add a way to access device with MMIO
 - use feature bit 31 to signal 64 bit features
   (and shift device config accordingly)

By my count, net still has 7 feature bits left, so I don't think the
feature bits are likely to be a limitation in the next 6 months?

MMIO is a bigger problem.  Linux guests are happy with it: does it break
the Windows drivers?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-06-02 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Thu, May 30, 2013 at 08:53:45AM -0500, Anthony Liguori wrote:
 Rusty Russell ru...@rustcorp.com.au writes:
 
  Anthony Liguori aligu...@us.ibm.com writes:
  Forcing a guest driver change is a really big
  deal and I see no reason to do that unless there's a compelling reason
  to.
 
  So we're stuck with the 1.0 config layout for a very long time.
 
  We definitely must not force a guest change.  The explicit aim of the
  standard is that legacy and 1.0 be backward compatible.  One
  deliverable is a document detailing how this is done (effectively a
  summary of changes between what we have and 1.0).
 
 If 2.0 is fully backwards compatible, great.  It seems like such a
 difference that that would be impossible but I need to investigate
 further.
 
 Regards,
 
 Anthony Liguori

 If you look at my patches you'll see how it works.
 Basically old guests use BAR0 new ones don't, so
 it's easy: BAR0 access means legacy guest.
 Only started testing but things seem to work
 fine with old guests so far.

 I think we need a spec, not just driver code.

 Rusty what's the plan? Want me to write it?

We need both, of course, but the spec work will happen in the OASIS WG.
A draft is good, but let's not commit anything to upstream QEMU until we
get the spec finalized.  And that is proposed to be late this year.

Since I'm going to have to reformat the spec and adapt it into OASIS
style anyway, perhaps you should prepare a description as a standalone
text document.  Easier to email and work with...

Now, the idea is that if you want to support 0.9 and 1.0 (or whatever we
call them; I used the term legacy for existing implementations in the
OASIS WG proposal), you add capabilities and don't point them into (the
start of?) BAR0.  Old drivers use BAR0 as now.

One trick to note: while drivers shouldn't use both old and new style on
the same device, you need to allow it for kexec, particularly reset via
BAR0.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH] virtio-spec: small English/punctuation corrections

2013-06-02 Thread Rusty Russell
Luiz Capitulino lcapitul...@redhat.com writes:
 1. s/These are devices are/These devices are
 2. s/Thefirst/The first
 3. s/, Guest should/. Guest should

 Signed-off-by: Luiz Capitulino lcapitul...@redhat.com
 ---
  virtio-spec.lyx | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

Thanks... I changed , Guest to .  The guest, and re-entered the
changes by hand to get change tracking.

Thanks,
Rusty.


 diff --git a/virtio-spec.lyx b/virtio-spec.lyx
 index 6e188d0..7e4ce71 100644
 --- a/virtio-spec.lyx
 +++ b/virtio-spec.lyx
 @@ -116,7 +116,7 @@ description Peripheral Component Interconnect; a common 
 device bus.  Seehtt
  \end_inset
  
   devices.
 - These are devices are found in 
 + These devices are found in 
  \emph on
  virtual
  \emph default
 @@ -1558,7 +1558,7 @@ name sub:Feature-Bits
  \end_layout
  
  \begin_layout Standard
 -Thefirst configuration field indicates the features that the device supports.
 +The first configuration field indicates the features that the device 
 supports.
   The bits are allocated as follows:
  \end_layout
  
 @@ -2919,7 +2919,7 @@ For each ring, guest should then disable interrupts by 
 writing VRING_AVAIL_F_NO_
  INTERRUPT flag in avail structure, if required.
   It can then process used ring entries finally enabling interrupts by 
 clearing
   the VRING_AVAIL_F_NO_INTERRUPT flag or updating the EVENT_IDX field in
 - the available structure, Guest should then execute a memory barrier, and
 + the available structure. Guest should then execute a memory barrier, and
   then recheck the ring empty condition.
   This is necessary to handle the case where, after the last check and before
   enabling interrupts, an interrupt has been suppressed by the device:
 -- 
 1.8.1.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-06-02 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Thu, May 30, 2013 at 01:28:26PM +0930, Rusty Russell wrote:
  Yet the structure definitions are descriptive, capturing layout, size
  and endianness in natural a format readable by any C programmer.
 
 From an API design point of view, here are the problems I see:
 
  1) C makes no guarantees about structure layout beyond the first
 member.  Yes, if it's naturally aligned or has a packed attribute,
 GCC does the right thing.  But this isn't kernel land anymore,
 portability matters and there are more compilers than GCC.
 
 [ I argued in detail here, then deleted.  Damn bikeshedding... ]
 
 I think the best thing is to add the decoded integer versions as macros,
 and have a heap of BUILD_BUG_ON() in the kernel source to make sure they
 match.

 Hmm you want to have both in the header?

 That would confuse users for sure :)

 I guess we could put in a big comment explaning
 why do we have both, and which version should
 userspace use. I tried to write it:

 /*
  * On all known C compilers, you can use the
  * offsetof macro to find the correct offset of each field -
  * if your compiler doesn't implement this correctly, use the
  * integer versions below, instead.
  * In that case please don't use the structures above, to avoid confusion.
  */

My version was a little less verbose :)

Subject: virtio_pci: macros for PCI layout offsets.

QEMU wants it, so why not?  Trust, but verify.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index 24bcd9b..6510009 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -453,6 +453,64 @@ static void __iomem *map_capability(struct pci_dev *dev, 
int off, size_t minlen,
return p;
 }
 
+/* This is part of the ABI.  Don't screw with it. */
+static inline void check_offsets(void)
+{
+   /* Note: disk space was harmed in compilation of this function. */
+   BUILD_BUG_ON(VIRTIO_PCI_CAP_VNDR !=
+offsetof(struct virtio_pci_cap, cap_vndr));
+   BUILD_BUG_ON(VIRTIO_PCI_CAP_NEXT !=
+offsetof(struct virtio_pci_cap, cap_next));
+   BUILD_BUG_ON(VIRTIO_PCI_CAP_LEN !=
+offsetof(struct virtio_pci_cap, cap_len));
+   BUILD_BUG_ON(VIRTIO_PCI_CAP_TYPE_AND_BAR !=
+offsetof(struct virtio_pci_cap, type_and_bar));
+   BUILD_BUG_ON(VIRTIO_PCI_CAP_OFFSET !=
+offsetof(struct virtio_pci_cap, offset));
+   BUILD_BUG_ON(VIRTIO_PCI_CAP_LENGTH !=
+offsetof(struct virtio_pci_cap, length));
+   BUILD_BUG_ON(VIRTIO_PCI_NOTIFY_CAP_MULT !=
+offsetof(struct virtio_pci_notify_cap,
+ notify_off_multiplier));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_DFSELECT != 
+offsetof(struct virtio_pci_common_cfg,
+ device_feature_select));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_DF != 
+offsetof(struct virtio_pci_common_cfg, device_feature));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_GFSELECT != 
+offsetof(struct virtio_pci_common_cfg,
+ guest_feature_select));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_GF != 
+offsetof(struct virtio_pci_common_cfg, guest_feature));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_MSIX != 
+offsetof(struct virtio_pci_common_cfg, msix_config));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_NUMQ != 
+offsetof(struct virtio_pci_common_cfg, num_queues));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_STATUS != 
+offsetof(struct virtio_pci_common_cfg, device_status));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SELECT != 
+offsetof(struct virtio_pci_common_cfg, queue_select));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_SIZE != 
+offsetof(struct virtio_pci_common_cfg, queue_size));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_MSIX != 
+offsetof(struct virtio_pci_common_cfg, queue_msix_vector));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_ENABLE != 
+offsetof(struct virtio_pci_common_cfg, queue_enable));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_NOFF != 
+offsetof(struct virtio_pci_common_cfg, queue_notify_off));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCLO != 
+offsetof(struct virtio_pci_common_cfg, queue_desc_lo));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_DESCHI != 
+offsetof(struct virtio_pci_common_cfg, queue_desc_hi));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILLO != 
+offsetof(struct virtio_pci_common_cfg, queue_avail_lo));
+   BUILD_BUG_ON(VIRTIO_PCI_COMMON_Q_AVAILHI != 
+offsetof(struct virtio_pci_common_cfg, queue_avail_hi));
+   BUILD_BUG_ON

Re: updated: kvm networking todo wiki

2013-06-02 Thread Rusty Russell
Anthony Liguori anth...@codemonkey.ws writes:
 Michael S. Tsirkin m...@redhat.com writes:

 On Thu, May 30, 2013 at 08:40:47AM -0500, Anthony Liguori wrote:
 Stefan Hajnoczi stefa...@gmail.com writes:
 
  On Thu, May 30, 2013 at 7:23 AM, Rusty Russell ru...@rustcorp.com.au 
  wrote:
  Anthony Liguori anth...@codemonkey.ws writes:
  Rusty Russell ru...@rustcorp.com.au writes:
  On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote:
  FWIW, I think what's more interesting is using vhost-net as a 
  networking
  backend with virtio-net in QEMU being what's guest facing.
 
  In theory, this gives you the best of both worlds: QEMU acts as a 
  first
  line of defense against a malicious guest while still getting the
  performance advantages of vhost-net (zero-copy).
 
  It would be an interesting idea if we didn't already have the vhost
  model where we don't need the userspace bounce.
 
  The model is very interesting for QEMU because then we can use vhost as
  a backend for other types of network adapters (like vmxnet3 or even
  e1000).
 
  It also helps for things like fault tolerance where we need to be able
  to control packet flow within QEMU.
 
  (CC's reduced, context added, Dmitry Fleytman added for vmxnet3 
  thoughts).
 
  Then I'm really confused as to what this would look like.  A zero copy
  sendmsg?  We should be able to implement that today.
 
  On the receive side, what can we do better than readv?  If we need to
  return to userspace to tell the guest that we've got a new packet, we
  don't win on latency.  We might reduce syscall overhead with a
  multi-dimensional readv to read multiple packets at once?
 
  Sounds like recvmmsg(2).
 
 Could we map this to mergable rx buffers though?
 
 Regards,
 
 Anthony Liguori

 Yes because we don't have to complete buffers in order.

 What I meant though was for GRO, we don't know how large the received
 packet is going to be.  Mergable rx buffers lets us allocate a pool of
 data for all incoming packets instead of allocating max packet size *
 max packets.

 recvmmsg expects an array of msghdrs and I presume each needs to be
 given a fixed size.  So this seems incompatible with mergable rx
 buffers.

Good point.  You'd need to build 64k buffers to pass to recvmmsg, then
reuse the parts it didn't touch on the next call.  This limits us to
about a 16th of what we could do with an interface which understood
buffer merging, but I don't know how much that would matter in
practice.  We'd need some benchmarks

Cheers,
Rusty.




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-30 Thread Rusty Russell
Stefan Hajnoczi stefa...@gmail.com writes:
 On Thu, May 30, 2013 at 7:23 AM, Rusty Russell ru...@rustcorp.com.au wrote:
 On the receive side, what can we do better than readv?  If we need to
 return to userspace to tell the guest that we've got a new packet, we
 don't win on latency.  We might reduce syscall overhead with a
 multi-dimensional readv to read multiple packets at once?

 Sounds like recvmmsg(2).

Wow... the future is here, today!

Thanks,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-29 Thread Rusty Russell
Anthony Liguori anth...@codemonkey.ws writes:
 Rusty Russell ru...@rustcorp.com.au writes:
 On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote:
 FWIW, I think what's more interesting is using vhost-net as a networking
 backend with virtio-net in QEMU being what's guest facing.
 
 In theory, this gives you the best of both worlds: QEMU acts as a first
 line of defense against a malicious guest while still getting the
 performance advantages of vhost-net (zero-copy).

 It would be an interesting idea if we didn't already have the vhost
 model where we don't need the userspace bounce.

 The model is very interesting for QEMU because then we can use vhost as
 a backend for other types of network adapters (like vmxnet3 or even
 e1000).

 It also helps for things like fault tolerance where we need to be able
 to control packet flow within QEMU.

(CC's reduced, context added, Dmitry Fleytman added for vmxnet3 thoughts).

Then I'm really confused as to what this would look like.  A zero copy
sendmsg?  We should be able to implement that today.

On the receive side, what can we do better than readv?  If we need to
return to userspace to tell the guest that we've got a new packet, we
don't win on latency.  We might reduce syscall overhead with a
multi-dimensional readv to read multiple packets at once?

Confused,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_pci: fix capability format, comments

2013-05-29 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:

 - queue size can actually be 0 which is not a power of 2

Actually, that points to a flaw in the code.  When we shut down the
queue, we should ideally reset it to what the device started with,
rather than 0.

See below.

 - fix capability format. PCI spec says:
   The layout of the information is vendor specific, except that the byte
   immediately following the “Next” pointer in the capability structure is
   defined to be a length field.
   This length field provides the number of bytes in the capability
   structure (including the ID and Next pointer bytes).

That part's definitely correct: applied.

Thanks,
Rusty.

Subjet: virtio_pci: save the desired ringsize.

MST points out that 0 isn't a power of 2.  This means we can't re-open
a virtio device once we write 0 into the queue length.

We should restore the amount the device originally asked for.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/virtio/virtio_pci-common.h 
b/drivers/virtio/virtio_pci-common.h
index ba1bf81..0e3143b 100644
--- a/drivers/virtio/virtio_pci-common.h
+++ b/drivers/virtio/virtio_pci-common.h
@@ -82,6 +82,9 @@ struct virtio_pci_vq_info {
 
/* MSI-X vector (or none) */
unsigned msix_vector;
+
+   /* What size did the device *want* this to be? */
+   u16 desired_num;
 };
 
 /* the notify function used when creating a virt queue */
diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c
index 0f0e3a6..8b35c2e 100644
--- a/drivers/virtio/virtio_pci.c
+++ b/drivers/virtio/virtio_pci.c
@@ -230,6 +230,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device 
*vp_dev,
return ERR_PTR(-ENOMEM);
 
info-msix_vector = msix_vec;
+   info-desired_num = num;
 
/* get offset of notification word for this vq (shouldn't wrap) */
off = ioread16(vp_dev-common-queue_notify_off);
@@ -350,7 +351,7 @@ static void del_vq(struct virtqueue *vq)
vring_del_virtqueue(vq);
 
/* This is for our own benefit, not the device's! */
-   iowrite16(0, vp_dev-common-queue_size);
+   iowrite16(info-desired_num, vp_dev-common-queue_size);
iowrite64_twopart(0, vp_dev-common-queue_desc);
iowrite64_twopart(0, vp_dev-common-queue_avail);
iowrite64_twopart(0, vp_dev-common-queue_used);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-rng only returns zeros with CONFIG_HW_RANDOM=m

2013-05-29 Thread Rusty Russell
Jens Axboe ax...@kernel.dk writes:
 On Wed, Feb 27 2013, Rusty Russell wrote:
 Aurelien Jarno aurel...@aurel32.net writes:
  Hi,
 
  I have noticed that virtio-rng only returns zero for kernels = 2.6.33
  built with CONFIG_HW_RANDOM=m. This is a bit much too predictable for a
  random generator ;-).
 
 Wow.  Fortunately, all of SLES, RHEL, Ubuntu or Fedora set
 CONFIG_HW_RANDOM=y.  What do they know that we don't?
 
 Oops, looks like Debian testing: config-3.2.0-4-amd64:CONFIG_HW_RANDOM=m
 
  The reason for that is virtio expects guest real addresses, while
  rng_core.ko (ie when built as a module) is passing a vmalloced buffer 
  to the virtio-rng read function, declared as such:
 
static u8 rng_buffer[SMP_CACHE_BYTES  32 ? 32 : SMP_CACHE_BYTES]
__cacheline_aligned;
 
 Yuck...  It would be nice if this has oopsed.  Jens, what about this patch?
 
 Cheers,
 Rusty.
 
 Subject: scatterlist: sg_set_buf() argument must be in linear mapping.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
 diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
 index 4bd6c06..9365375 100644
 --- a/include/linux/scatterlist.h
 +++ b/include/linux/scatterlist.h
 @@ -111,6 +111,9 @@ static inline struct page *sg_page(struct scatterlist 
 *sg)
  static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
unsigned int buflen)
  {
 +#ifdef CONFIG_DEBUG_SG
 +BUG_ON(!virt_addr_valid(buf));
 +#endif
  sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
  }

 Looks good to me, in lieu of being able to return an error. Want me to
 queue it up?

 -- 
 Jens Axboe

Ping?  I haven't seen this go into Linus' tree...

Subject: scatterlist: sg_set_buf() argument must be in linear mapping.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 4bd6c06..9365375 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -111,6 +111,9 @@ static inline struct page *sg_page(struct scatterlist *sg)
 static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
  unsigned int buflen)
 {
+#ifdef CONFIG_DEBUG_SG
+   BUG_ON(!virt_addr_valid(buf));
+#endif
sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-05-29 Thread Rusty Russell
Anthony Liguori aligu...@us.ibm.com writes:
 Rusty Russell ru...@rustcorp.com.au writes:

 Anthony Liguori aligu...@us.ibm.com writes:
 Michael S. Tsirkin m...@redhat.com writes:
 +case offsetof(struct virtio_pci_common_cfg, device_feature_select):
 +return proxy-device_feature_select;

 Oh dear no...  Please use defines like the rest of QEMU.

 It is pretty ugly.

 I think beauty is in the eye of the beholder here...

 Pretty much every device we have has a switch statement like this.
 Consistency wins when it comes to qualitative arguments like this.

I was agreeing with you here, actually.

 Yet the structure definitions are descriptive, capturing layout, size
 and endianness in natural a format readable by any C programmer.

From an API design point of view, here are the problems I see:

 1) C makes no guarantees about structure layout beyond the first
member.  Yes, if it's naturally aligned or has a packed attribute,
GCC does the right thing.  But this isn't kernel land anymore,
portability matters and there are more compilers than GCC.

[ I argued in detail here, then deleted.  Damn bikeshedding... ]

I think the best thing is to add the decoded integer versions as macros,
and have a heap of BUILD_BUG_ON() in the kernel source to make sure they
match.

 3) It suspect it's harder to review because a subtle change could more
easily have broad impact.  If someone changed the type of a field
from u32 to u16, it changes the offset of every other field.  That's
not terribly obvious in the patch itself unless you understand how
the structure is used elsewhere.

MST's patch just did this, so point taken.  (MST: I'm going to combine
the cfg_type and bar bytes to fix this, patch coming).

 No change, but there's an open question on whether we should nail it to
 little endian (or define the endian by the transport).

 Of course, I can't rule out that the 1.0 standard *may* decide to frob
 the ring layout somehow,

 Well, given that virtio is widely deployed today, I would think the 1.0
 standard should strictly reflect what's deployed today, no?

That will be up to the committee.  I think we want to fix some obvious
pain points, though qemu will not benefit from them in the next 5 years.

 Any new config layout would be 2.0 material, right?

 Re: the new config layout, I don't think we would want to use it for
 anything but new devices.

There are enough concrete reasons that I think we want it for existing
devices:

1) Negotiated ring size/alignment.  Coreboot wants smaller, others want
   larger.
2) Remove assertion that it has to be an I/O bar.  PowerPC wants this.
3) Notification location flexibility.  MST wanted this for performance.
4) More feature bits.

 Forcing a guest driver change is a really big
 deal and I see no reason to do that unless there's a compelling reason
 to.

 So we're stuck with the 1.0 config layout for a very long time.

We definitely must not force a guest change.  The explicit aim of the
standard is that legacy and 1.0 be backward compatible.  One
deliverable is a document detailing how this is done (effectively a
summary of changes between what we have and 1.0).

It's a delicate balancing act.  My plan is to accompany any changes in
the standard with a qemu implementation, so we can see how painful those
changes are.  And if there are performance implications, measure them.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updated: kvm networking todo wiki

2013-05-28 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Fri, May 24, 2013 at 08:47:58AM -0500, Anthony Liguori wrote:
 Michael S. Tsirkin m...@redhat.com writes:
 
  On Fri, May 24, 2013 at 05:41:11PM +0800, Jason Wang wrote:
  On 05/23/2013 04:50 PM, Michael S. Tsirkin wrote:
   Hey guys,
   I've updated the kvm networking todo wiki with current projects.
   Will try to keep it up to date more often.
   Original announcement below.
  
  Thanks a lot. I've added the tasks I'm currently working on to the wiki.
  
  btw. I notice the virtio-net data plane were missed in the wiki. Is the
  project still being considered?
 
  It might have been interesting several years ago, but now that linux has
  vhost-net in kernel, the only point seems to be to
  speed up networking on non-linux hosts.
 
 Data plane just means having a dedicated thread for virtqueue processing
 that doesn't hold qemu_mutex.
 
 Of course we're going to do this in QEMU.  It's a no brainer.  But not
 as a separate device, just as an improvement to the existing userspace
 virtio-net.
 
  Since non-linux does not have kvm, I doubt virtio is a bottleneck.
 
 FWIW, I think what's more interesting is using vhost-net as a networking
 backend with virtio-net in QEMU being what's guest facing.
 
 In theory, this gives you the best of both worlds: QEMU acts as a first
 line of defense against a malicious guest while still getting the
 performance advantages of vhost-net (zero-copy).

 Great idea, that sounds very intresting.

 I'll add it to the wiki.

 In fact a bit of complexity in vhost was put there in the vague hope to
 support something like this: virtio rings are not translated through
 regular memory tables, instead, vhost gets a pointer to ring address.

 This allows qemu acting as a man in the middle,
 verifying the descriptors but not touching the

 Anyone interested in working on such a project?

It would be an interesting idea if we didn't already have the vhost
model where we don't need the userspace bounce.  We already have two
sets of host side ring code in the kernel (vhost and vringh, though
they're being unified).

All an accelerator can offer on the tx side is zero copy and direct
update of the used ring.  On rx userspace could register the buffers and
the accelerator could fill them and update the used ring.  It still
needs to deal with merged buffers, for example.

You avoid the address translation in the kernel, but I'm not convinced
that's a key problem.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-05-28 Thread Rusty Russell
Anthony Liguori aligu...@us.ibm.com writes:
 Michael S. Tsirkin m...@redhat.com writes:
 +case offsetof(struct virtio_pci_common_cfg, device_feature_select):
 +return proxy-device_feature_select;

 Oh dear no...  Please use defines like the rest of QEMU.

It is pretty ugly.

Yet the structure definitions are descriptive, capturing layout, size
and endianness in natural a format readable by any C programmer.

So AFAICT the question is, do we put the required

#define VIRTIO_PCI_CFG_FEATURE_SEL \
 (offsetof(struct virtio_pci_common_cfg, device_feature_select))

etc. in the kernel headers or qemu?

 Haven't looked at the proposed new ring layout yet.

No change, but there's an open question on whether we should nail it to
little endian (or define the endian by the transport).

Of course, I can't rule out that the 1.0 standard *may* decide to frob
the ring layout somehow, but I'd think it would require a compelling
reason.  I suggest that's 2.0 material...

Cheers,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] virtio-pci: new config layout: using memory BAR

2013-05-28 Thread Rusty Russell
Paolo Bonzini pbonz...@redhat.com writes:
 Il 28/05/2013 19:32, Michael S. Tsirkin ha scritto:
   +
   +switch (addr) {
   +case offsetof(struct virtio_pci_common_cfg, 
   device_feature_select):
   +return proxy-device_feature_select;
  
  Oh dear no...  Please use defines like the rest of QEMU.
 Any good reason not to use offsetof?

 I'm not sure it's portable to use it in case labels.  IIRC, the
 definition ((int)(((T *)0)-field)) is not a valid C integer constant
 expression.  Laszlo?

It's defined to yield an integer constant expression in the ISO standard
(and I think ANSI too, though that's not at hand):

7.19, para 3:

...offsetof(type, member-designator)
which expands to an integer constant expression that has type
size_t, ...

The real question is whether compilers qemu cares about meet the
standard (there's some evidence that older compilers fail this).  If
not, we'll have to define them as raw offsets...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rusty/virtio-pci-new-layout] virtio: new layout minor header fixups

2013-05-27 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Fix issues observed with the new layout code, seen
 when implementing device in qemu:
   - use of uXX in uapi header
   - incorrect readonly tag on one field
   - unconditional warning breaks builds with -Werr

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Hmm, this means if we ever *do* remove those defines, qemu will break.

But perhaps that's OK, because we'll probably remove legacy support at
the same time as we break the header.

Applied, thanks!
Rusty.

 This patch is on top of rusty/virtio-pci-new-layout.

  include/uapi/linux/virtio_pci.h | 12 +++-
  1 file changed, 7 insertions(+), 5 deletions(-)

 diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
 index 3e61d55..cda688f 100644
 --- a/include/uapi/linux/virtio_pci.h
 +++ b/include/uapi/linux/virtio_pci.h
 @@ -86,8 +86,10 @@
  #define VIRTIO_PCI_LEGACY_VRING_ALIGN4096
  
  #ifndef VIRTIO_PCI_NO_LEGACY
 +#ifndef VIRTIO_PCI_LEGACY_COMPAT_NAMES
  /* Don't break compile of old userspace code.  These will go away. */
  #warning Please support virtio_pci non-legacy mode!
 +#endif
  #define VIRTIO_PCI_HOST_FEATURES VIRTIO_PCI_LEGACY_HOST_FEATURES
  #define VIRTIO_PCI_GUEST_FEATURES VIRTIO_PCI_LEGACY_GUEST_FEATURES
  #define VIRTIO_PCI_QUEUE_PFN VIRTIO_PCI_LEGACY_QUEUE_PFN
 @@ -125,10 +127,10 @@
  
  /* This is the PCI capability header: */
  struct virtio_pci_cap {
 - u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
 - u8 cap_next;/* Generic PCI field: next ptr. */
 - u8 cfg_type;/* One of the VIRTIO_PCI_CAP_*_CFG. */
 - u8 bar; /* Where to find it. */
 + __u8 cap_vndr;  /* Generic PCI field: PCI_CAP_ID_VNDR */
 + __u8 cap_next;  /* Generic PCI field: next ptr. */
 + __u8 cfg_type;  /* One of the VIRTIO_PCI_CAP_*_CFG. */
 + __u8 bar;   /* Where to find it. */
   __le32 offset;  /* Offset within bar. */
   __le32 length;  /* Length. */
  };
 @@ -144,7 +146,7 @@ struct virtio_pci_common_cfg {
   __le32 device_feature_select;   /* read-write */
   __le32 device_feature;  /* read-only */
   __le32 guest_feature_select;/* read-write */
 - __le32 guest_feature;   /* read-only */
 + __le32 guest_feature;   /* read-write */
   __le16 msix_config; /* read-write */
   __le16 num_queues;  /* read-only */
   __u8 device_status; /* read-write */
 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: provide vhost thread per virtqueue for forwarding scenario

2013-05-22 Thread Rusty Russell
Jason Wang jasow...@redhat.com writes:
 On 05/22/2013 05:59 PM, Zang Hongyong wrote:
 On 2013/5/20 15:43, Michael S. Tsirkin wrote:
 On Mon, May 20, 2013 at 02:11:19AM +, Qinchuanyu wrote:
 Yes, I don't think we want to create threads even more aggressively
 in all cases. I'm worried about scalability as it is.
 I think we should explore a flexible approach, use a thread pool
 (for example, a wq) to share threads between virtqueues,
 switch to a separate thread only if there's free CPU and existing
 threads are busy. Hopefully share threads between vhost instances too.
 On Xen platform, network backend pv driver model has evolved to this
 way. Netbacks from all DomUs share a thread pool,
 and thread number eaqual to cpu core number.
 Is there any plan for kvm paltform?

 There used to be two related RFCs for this, one is the multiple vhost
 workers from Anthony another is percpu vhost thread from Shirley. You
 can search the archives on netdev or kvm for the patches.

As I've said to MST before, I think our entire model is wrong.
Userspace should create the threads and call in.  If you're doing kernel
acceleration, two extra threads per NIC is a tiny overhead.

Of course, such radical changes to vhost doesn't help existing users as
Qinchuanyu asked...

Cheers,
Rusty,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] virtio_pci: better macro exported in uapi

2013-05-19 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:

 Macro VIRTIO_PCI_CONFIG assumes that userspace actually has a structure
 with a field named msix_enabled.  Add VIRTIO_PCI_CONFIG_OFF that gets
 the msix_enabled by value instead, to make it useful for userspace.  We
 still keep VIRTIO_PCI_CONFIG around for now, in case some userspace uses
 it.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Applied.

Thanks,
Rusty.

 ---

 Changes from v1:
 - rename macro to VIRTIO_PCI_CONFIG_OFF so preprocessor
   can be used to check for its presence, as suggested
   by Rusty
 - keep old macro around as suggested by David Howells

  include/uapi/linux/virtio_pci.h | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

 diff --git a/include/uapi/linux/virtio_pci.h b/include/uapi/linux/virtio_pci.h
 index ea66f3f..e5ec1ca 100644
 --- a/include/uapi/linux/virtio_pci.h
 +++ b/include/uapi/linux/virtio_pci.h
 @@ -80,7 +80,9 @@
  
  /* The remaining space is defined by each driver as the per-driver
   * configuration space */
 -#define VIRTIO_PCI_CONFIG(dev)   ((dev)-msix_enabled ? 24 : 20)
 +#define VIRTIO_PCI_CONFIG_OFF(msix_enabled)  ((msix_enabled) ? 24 : 20)
 +/* Deprecated: please use VIRTIO_PCI_CONFIG_OFF instead */
 +#define VIRTIO_PCI_CONFIG(dev)   
 VIRTIO_PCI_CONFIG_OFF((dev)-msix_enabled)
  
  /* Virtio ABI version, this must match exactly */
  #define VIRTIO_PCI_ABI_VERSION   0
 -- 
 MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_pci: fix macro exported in uapi

2013-05-19 Thread Rusty Russell
David Howells dhowe...@redhat.com writes:
 Rusty Russell ru...@rustcorp.com.au wrote:

 Macro still isn't usable, because userspace can't know whether it's the
 new or old.
 
 We need to either remove it from UAPI, or rename it to
 VIRTIO_PCI_CONFIG_OFF.

 Surely, if userspace is using it as is, you can't remove it, rename it or
 alter it?

The point of the patch is that it's unusable:

#define VIRTIO_PCI_CONFIG(dev)  ((dev)-msix_enabled ? 24 : 20)

ie. it's accessing a member of the kernel's virtio_pci_dev structure.

In theory, userspace could have a structure with the same field and be
using it now, but that's unlikely (qemu certainly doesn't).  No harm no
foul.

In theory it's actually a useful macro, so we could expose it to
userspace, but we'd need to have a new name so userspace can #ifdef for
older headers...

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio_pci: fix macro exported in uapi

2013-05-16 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 macro VIRTIO_PCI_CONFIG except in the unlikely event userspace
 actually has a structure with a field named msix_enabled.
 Get the msix_enabled by value instead, to make it useful
 for userspace.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Macro still isn't usable, because userspace can't know whether it's the
new or old.

We need to either remove it from UAPI, or rename it to
VIRTIO_PCI_CONFIG_OFF.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-scsi: Depend on NET for memcpy_fromiovec

2013-05-16 Thread Rusty Russell
Joe Perches j...@perches.com writes:
 On Thu, 2013-05-16 at 13:04 +0930, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
  On Wed, May 15, 2013 at 02:47:53PM +0930, Rusty Russell wrote:
 []
  Other users are using memcpy_fromiovec and friends outside net. It seems
  a good idea to put it in a util library. e.g.  crypto/algif_skcipher.c
  which also depends on NET for it.

 []
 Subject: Hoist memcpy_fromiovec into lib/

 You'll need the friends memcpy_toiovec too.

 $ git grep -E \bmemcpy\w+iovec\w*
 crypto/algif_hash.c:err = memcpy_toiovec(msg-msg_iov, ctx-result, len);
 crypto/algif_skcipher.c:err = 
 memcpy_fromiovec(page_address(sg_page(sg)) +
 crypto/algif_skcipher.c:err = 
 memcpy_fromiovec(page_address(sg_page(sg + i)),
 drivers/dma/iovlock.c:#include net/tcp.h /* for memcpy_toiovec */
 drivers/dma/iovlock.c:  return memcpy_toiovec(iov, kdata, len);
 drivers/dma/iovlock.c:  err = memcpy_toiovec(iov, vaddr + offset, 
 len);
 drivers/isdn/mISDN/socket.c:if (memcpy_fromiovec(skb_put(skb, len), 
 msg-msg_iov, len)) {
 drivers/misc/vmw_vmci/vmci_queue_pair.c:err = 
 memcpy_fromiovec((u8 *)va + page_o
 drivers/misc/vmw_vmci/vmci_queue_pair.c:err = 
 memcpy_toiovec(iov, (u8 *)va + pag

Fascinating.  These all indirectly depend on NET, so there's no problem
at the moment.  But it is a bit weird...

crypto/algif_hash.c: depends on CRYPTO_USER_API_HASH - NET
crypto/algif_skcipher.c: depends on CRYPTO_USER_API_SKCIPHER - NET
drivers/dma/iovlock.c: depends on NET_DMA - NET
drivers/isdn/mISDN/socket.c: depends on MISDN - ISDN - NET
drivers/misc/vmw_vmci/vmci_queue_pair.c: depends on VMCI - NET

Patch welcome.

Meanwhile, to avoid more bikeshedding I've put the patch I posted with
all acks in my fixes branch.  One cycle through linux-next, then
straight to Linus.

Subject: Hoist memcpy_fromiovec into lib/

ERROR: memcpy_fromiovec [drivers/vhost/vhost_scsi.ko] undefined!

That function is only present with CONFIG_NET.  Turns out that
crypto/algif_skcipher.c also uses that outside net, but it actually
needs sockets anyway.

socket.h already include uio.h, so no callers need updating.

Reported-by: Randy Dunlap rdun...@infradead.org
Acked-by: David S. Miller da...@davemloft.net
Acked-by: Michael S. Tsirkin m...@redhat.com
Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 428c37a..7266775 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -305,7 +305,6 @@ struct ucred {
 
 extern void cred_to_ucred(struct pid *pid, const struct cred *cred, struct 
ucred *ucred);
 
-extern int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
 extern int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
   int offset, int len);
 extern int csum_partial_copy_fromiovecend(unsigned char *kdata, 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 629aaf5..21628d3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -35,4 +35,6 @@ static inline size_t iov_length(const struct iovec *iov, 
unsigned long nr_segs)
 }
 
 unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to);
+
+int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index e9c52e1..2377211 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -9,7 +9,7 @@ endif
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-idr.o int_sqrt.o extable.o \
+idr.o int_sqrt.o extable.o iovec.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/iovec.c b/lib/iovec.c
new file mode 100644
index 000..632c5ea
--- /dev/null
+++ b/lib/iovec.c
@@ -0,0 +1,29 @@
+#include linux/uaccess.h
+#include linux/export.h
+#include linux/uio.h
+
+/*
+ * Copy iovec to kernel. Returns -EFAULT on error.
+ *
+ * Note: this modifies the original iovec.
+ */
+
+int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len)
+{
+   while (len  0) {
+   if (iov-iov_len) {
+   int copy = min_t(unsigned int, len, iov-iov_len);
+   if (copy_from_user(kdata, iov-iov_base, copy))
+   return -EFAULT;
+   len -= copy;
+   kdata += copy;
+   iov-iov_base += copy;
+   iov-iov_len -= copy;
+   }
+   iov++;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(memcpy_fromiovec);
+
diff --git a/net/core/iovec.c b/net/core/iovec.c
index 7e7aeb0..d81257f 100644
--- a/net/core/iovec.c
+++ b/net/core/iovec.c

Re: [PATCH] virtio_console: fix uapi header

2013-05-16 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 uapi should use __u32 not u32.
 Fix a macro in virtio_console.h which uses u32.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com

Ouch.  Added CC:stable, and put into fixes.  Mainly because it's
embarrassing :)

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-scsi: Depend on NET for memcpy_fromiovec

2013-05-15 Thread Rusty Russell
Nicholas A. Bellinger n...@linux-iscsi.org writes:
 On Wed, 2013-05-15 at 14:47 +0930, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
  scsi.c includes vhost.c which uses memcpy_fromiovec.
 
  This patch fixes this build failure.
 
 From Randy Dunlap:
 '''
 on x86_64:
 
 ERROR: memcpy_fromiovec [drivers/vhost/vhost_scsi.ko] undefined!
 
 It needs to depend on NET since net/core/ provides that function.
 '''
 
 Proper fix please.
 
 Though I can't see why you thought this was a good idea.  Nonetheless, I
 shan't highlight why: I have far too much respect for your intellects
 and abilities.
 
 No, don't thank me!

 Hi Rusty  Asias,

 I assume you mean something like the following patch to allow kbuild to
 work when VHOST_NET + VHOST_SCSI are both enabled and sharing vhost.o,
 yes..?

No, that's a separate issue.

memcpy_fromiovec() has nothing to do with networking: that was just the
first user.  Note that crypto/algif_skcipher.c also uses it.  The
obvious answer is to move it into lib/.

OTOH making vhost_scsi depend on CONFIG_NET is breathtakingly lazy.  I
expect better from experienced kernel hackers :(

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-scsi: Depend on NET for memcpy_fromiovec

2013-05-15 Thread Rusty Russell
Asias He as...@redhat.com writes:
 On Wed, May 15, 2013 at 02:47:53PM +0930, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
  scsi.c includes vhost.c which uses memcpy_fromiovec.
 
  This patch fixes this build failure.
 
 From Randy Dunlap:
 '''
 on x86_64:
 
 ERROR: memcpy_fromiovec [drivers/vhost/vhost_scsi.ko] undefined!
 
 It needs to depend on NET since net/core/ provides that function.
 '''
 
 Proper fix please.

 --verbose please ;-)

 Making VHOST_SCSI depends on NET looks weird but this is because vhost
 core depends on it. A bunch of patches are cleaning this up. Since MST
 wanted do the vhost.ko split up in 3.11, plus your WIP vringh work, so I
 wanted the fix for 3.10 as minimum as possible.

If this isn't the only symbol causing the problem, then this should be
mentioned in the changelog.  If it is, it should be fixed: we have
plenty of time for that.

Either way, your commit message or the commit itself needs to justify
it!

 Other users are using memcpy_fromiovec and friends outside net. It seems
 a good idea to put it in a util library. e.g.  crypto/algif_skcipher.c
 which also depends on NET for it.

 Though I can't see why you thought this was a good idea.  Nonetheless, I
 shan't highlight why: I have far too much respect for your intellects
 and abilities.
 
 No, don't thank me!

 Interesting.

Heh... I originally wrote an explanation, then found it a bit insulting:
I knew I didn't need to tell you :)

How's this?

From: Rusty Russell ru...@rustcorp.com.au
Subject: Hoist memcpy_fromiovec into lib/

ERROR: memcpy_fromiovec [drivers/vhost/vhost_scsi.ko] undefined!

That function is only present with CONFIG_NET.  Turns out that
crypto/algif_skcipher.c also uses that outside net, but it actually
needs sockets anyway.

socket.h already include uio.h, so no callers need updating.

Reported-by: Randy Dunlap rdun...@infradead.org
Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 428c37a..7266775 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -305,7 +305,6 @@ struct ucred {
 
 extern void cred_to_ucred(struct pid *pid, const struct cred *cred, struct 
ucred *ucred);
 
-extern int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
 extern int memcpy_fromiovecend(unsigned char *kdata, const struct iovec *iov,
   int offset, int len);
 extern int csum_partial_copy_fromiovecend(unsigned char *kdata, 
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 629aaf5..21628d3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -35,4 +35,6 @@ static inline size_t iov_length(const struct iovec *iov, 
unsigned long nr_segs)
 }
 
 unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to);
+
+int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len);
 #endif
diff --git a/lib/Makefile b/lib/Makefile
index e9c52e1..2377211 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -9,7 +9,7 @@ endif
 
 lib-y := ctype.o string.o vsprintf.o cmdline.o \
 rbtree.o radix-tree.o dump_stack.o timerqueue.o\
-idr.o int_sqrt.o extable.o \
+idr.o int_sqrt.o extable.o iovec.o \
 sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
 proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
diff --git a/lib/iovec.c b/lib/iovec.c
new file mode 100644
index 000..632c5ea
--- /dev/null
+++ b/lib/iovec.c
@@ -0,0 +1,29 @@
+#include linux/uaccess.h
+#include linux/export.h
+#include linux/uio.h
+
+/*
+ * Copy iovec to kernel. Returns -EFAULT on error.
+ *
+ * Note: this modifies the original iovec.
+ */
+
+int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len)
+{
+   while (len  0) {
+   if (iov-iov_len) {
+   int copy = min_t(unsigned int, len, iov-iov_len);
+   if (copy_from_user(kdata, iov-iov_base, copy))
+   return -EFAULT;
+   len -= copy;
+   kdata += copy;
+   iov-iov_base += copy;
+   iov-iov_len -= copy;
+   }
+   iov++;
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL(memcpy_fromiovec);
+
diff --git a/net/core/iovec.c b/net/core/iovec.c
index 7e7aeb0..d81257f 100644
--- a/net/core/iovec.c
+++ b/net/core/iovec.c
@@ -125,31 +125,6 @@ int memcpy_toiovecend(const struct iovec *iov, unsigned 
char *kdata,
 EXPORT_SYMBOL(memcpy_toiovecend);
 
 /*
- * Copy iovec to kernel. Returns -EFAULT on error.
- *
- * Note: this modifies the original iovec.
- */
-
-int memcpy_fromiovec(unsigned char *kdata, struct iovec *iov, int len)
-{
-   while (len  0) {
-   if (iov-iov_len) {
-   int copy = min_t(unsigned int, len, iov-iov_len

Re: [PATCH] vhost-scsi: Depend on NET for memcpy_fromiovec

2013-05-14 Thread Rusty Russell
Asias He as...@redhat.com writes:
 scsi.c includes vhost.c which uses memcpy_fromiovec.

 This patch fixes this build failure.

From Randy Dunlap:
'''
on x86_64:

ERROR: memcpy_fromiovec [drivers/vhost/vhost_scsi.ko] undefined!

It needs to depend on NET since net/core/ provides that function.
'''

Proper fix please.

Though I can't see why you thought this was a good idea.  Nonetheless, I
shan't highlight why: I have far too much respect for your intellects
and abilities.

No, don't thank me!
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] vhost cleanups and separate module

2013-05-12 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Tue, May 07, 2013 at 02:13:44PM +0930, Rusty Russell wrote:
 AFAICT we should always do zero copy.

 It seems not to be a win for small packets.
 I speculate the issue is that ring space isn't released as promptly.
 Further, we can't do it safely for guest to guest and guest to host.
 And if we try, net core just does a packet copy later (which is less
 efficient). So there's a hack in place to detect that and suppress zero
 copy.

AFAICT there are two places we should copy.  One is small packets:
latency plus refcount/callbacks aren't a win.  The other is weird
packets (eg. more than 1000 segements), which we currently drop.

 Though I do wonder if we should
 use a dedicated hook to get an skb into the tun driver and generate it
 ourselves, rather than going sg - iov - skb.
 
 Cheers,
 Rusty.

 I think we'd have to export two interfaces:
 - alloc_skb()
    add frags ...
 - send_skb

 the code to add frags could maybe use some
 library functions ...

I think we just need send_skb for the socket.  We can build the skb
ourselves.  But yes that frag-handling code should be factored out.

I'll see how I go.

Cheers,
Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] vhost cleanups and separate module

2013-05-07 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Mon, May 06, 2013 at 03:41:36PM +0930, Rusty Russell wrote:
 Asias He as...@redhat.com writes:
  Asias He (3):
vhost: Remove vhost_enable_zcopy in vhost.h
vhost: Move VHOST_NET_FEATURES to net.c
vhost: Make vhost a separate module
 
 I like these cleanups, MST pleasee apply.

 Absolutely. Except it's 3.11 material and I can only
 usefully create a -next branch once -rc1 is out.

 I have some other cleanups which are on hold for the moment pending
 MST's vhost_net simplification.  MST, how's that going?

 Not too well. The array of status bytes which was designed to complete
 packets in order turns out to be a very efficient datastructure:

 It gives us a way to signal completions that is completely lockless for
 multiple completers, and using the producer/consumer model saves extra
 scans for the common case.

 Overall I can save some memory and clean up some code but can't get rid
 of the producer/consumer idices (currently named upend/done indices)
 which is what you asked me to do.
 Your cleanups basically don't work with zcopy because they
 ignore the upend/done indices?
 Would you like to post them, noting they only work with zcopy off, and
 we'll look for a way to apply them, together?

Not quite; it's just that I don't understand that code.  It seemed to be
achieving something (ordered completion) which was entirely unnecessary,
so I went on with other things while you removed it.  Now that's not
possible, I'll revisit.

AFAICT we should always do zero copy.  Though I do wonder if we should
use a dedicated hook to get an skb into the tun driver and generate it
ourselves, rather than going sg - iov - skb.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] vhost cleanups and separate module

2013-05-06 Thread Rusty Russell
Asias He as...@redhat.com writes:
 Asias He (3):
   vhost: Remove vhost_enable_zcopy in vhost.h
   vhost: Move VHOST_NET_FEATURES to net.c
   vhost: Make vhost a separate module

I like these cleanups, MST pleasee apply.

I have some other cleanups which are on hold for the moment pending
MST's vhost_net simplification.  MST, how's that going?

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: suggesting wording fixes for virtio-spec 0.9.5

2013-04-28 Thread Rusty Russell
Laszlo Ersek ler...@redhat.com writes:
 On 04/23/13 06:05, Rusty Russell wrote:
 Laszlo Ersek ler...@redhat.com writes:
 Hi,

 (I'm not subscribed to either list,)

 using the word descriptor is misleading in the following sections:
 
 Yes, I like the use of 'descriptor chains'.  This is a definite
 improvement.
...
 Not sure if it's customary here or if you need it / want it, but anyway

 Reviewed-by: Laszlo Ersek ler...@redhat.com

Not needed, but always useful to ahve confirmation that I didn't
introduce another mistake while fixing my old ones!

 (Also I've fixed the OVMF driver; just reposting the patch today with a
 better commit message.)

 Thanks much!
 Laszlo

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm tools: virtio-net mergable rx buffers

2013-04-28 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Wed, Apr 24, 2013 at 09:51:57AM +0300, Pekka Enberg wrote:
 Hi,
 
 On 04/23/2013 12:35 PM, Eric Northup wrote:
  Do you care about guests with drivers that don't negotiate
  VIRTIO_NET_F_MRG_RXBUF?
 
 On Wed, Apr 24, 2013 at 5:51 AM, Sasha Levin sasha.le...@oracle.com wrote:
  We usually try to keep backward compatibility, but in this case
  mergable RX buffers are about 5 years old now, so it's safe to
  assume they'll be running in any guest.
 
  Unless there is a specific reason to allow working without them
  I'd rather keep the code simple in this case.
 
 Are there such guests around? What's the failure scenario for them
 after this patch?
 
 Pekka

 Warning: have not looked at the patch, just a general comment.

 I think it's reasonable to assume embedded guests such as PXE won't
 negotiate any features.  And, running old guests is one of the reasons
 people use virtualization at all. So 5 years is not a lot.

 In any case, stick to the device spec please, if you want it changed
 please send a spec patch, don't deviate from it randomly.

Supporting old guests is an quality of implementation issue.  It's like
any ABI: if noone will notice, you can remove stuff.

But the case of I can receive GSO packets but I don't support mergeable
buffers is a trivial one: you can support it by pretending the guest
can't handle GSO :)

If you want to support non-Linux guests (eg. bootloaders), you probably
want to keep support for very dumb drivers with no mergable rxbufs
though.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] virtio-net: fill only rx queues which are being used

2013-04-28 Thread Rusty Russell
Sasha Levin sasha.le...@oracle.com writes:
 Due to MQ support we may allocate a whole bunch of rx queues but
 never use them. With this patch we'll safe the space used by
 the receive buffers until they are actually in use:

Thanks, applied!

Cheers,
Rusty.


 sh-4.2# free -h
  total   used   free sharedbuffers cached
 Mem:  490M35M   455M 0B 0B   4.1M
 -/+ buffers/cache:31M   459M
 Swap:   0B 0B 0B
 sh-4.2# ethtool -L eth0 combined 8
 sh-4.2# free -h
  total   used   free sharedbuffers cached
 Mem:  490M   162M   327M 0B 0B   4.1M
 -/+ buffers/cache:   158M   331M
 Swap:   0B 0B 0B

 Signed-off-by: Sasha Levin sasha.le...@oracle.com
 ---
  drivers/net/virtio_net.c | 15 ++-
  1 file changed, 10 insertions(+), 5 deletions(-)

 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 6bfc511..196e721 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -581,7 +581,7 @@ static void refill_work(struct work_struct *work)
   bool still_empty;
   int i;
  
 - for (i = 0; i  vi-max_queue_pairs; i++) {
 + for (i = 0; i  vi-curr_queue_pairs; i++) {
   struct receive_queue *rq = vi-rq[i];
  
   napi_disable(rq-napi);
 @@ -636,7 +636,7 @@ static int virtnet_open(struct net_device *dev)
   struct virtnet_info *vi = netdev_priv(dev);
   int i;
  
 - for (i = 0; i  vi-max_queue_pairs; i++) {
 + for (i = 0; i  vi-curr_queue_pairs; i++) {
   /* Make sure we have some buffers: if oom use wq. */
   if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
   schedule_delayed_work(vi-refill, 0);
 @@ -900,6 +900,7 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
 u16 queue_pairs)
   struct scatterlist sg;
   struct virtio_net_ctrl_mq s;
   struct net_device *dev = vi-dev;
 + int i;
  
   if (!vi-has_cvq || !virtio_has_feature(vi-vdev, VIRTIO_NET_F_MQ))
   return 0;
 @@ -912,8 +913,12 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
 u16 queue_pairs)
   dev_warn(dev-dev, Fail to set num of queue pairs to %d\n,
queue_pairs);
   return -EINVAL;
 - } else
 + } else {
 + for (i = vi-curr_queue_pairs; i  queue_pairs; i++)
 + if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
 + schedule_delayed_work(vi-refill, 0);
   vi-curr_queue_pairs = queue_pairs;
 + }
  
   return 0;
  }
 @@ -1568,7 +1573,7 @@ static int virtnet_probe(struct virtio_device *vdev)
   }
  
   /* Last of all, set up some receive buffers. */
 - for (i = 0; i  vi-max_queue_pairs; i++) {
 + for (i = 0; i  vi-curr_queue_pairs; i++) {
   try_fill_recv(vi-rq[i], GFP_KERNEL);
  
   /* If we didn't even get one input buffer, we're useless. */
 @@ -1692,7 +1697,7 @@ static int virtnet_restore(struct virtio_device *vdev)
  
   netif_device_attach(vi-dev);
  
 - for (i = 0; i  vi-max_queue_pairs; i++)
 + for (i = 0; i  vi-curr_queue_pairs; i++)
   if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
   schedule_delayed_work(vi-refill, 0);
  
 -- 
 1.8.2.1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-net: fill only rx queues which are being used

2013-04-22 Thread Rusty Russell
Sasha Levin sasha.le...@oracle.com writes:
 Due to MQ support we may allocate a whole bunch of rx queues but
 never use them. With this patch we'll safe the space used by
 the receive buffers until they are actually in use:

Idea is good, implementation needs a tiny tweak:

 @@ -912,8 +913,13 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
 u16 queue_pairs)
   dev_warn(dev-dev, Fail to set num of queue pairs to %d\n,
queue_pairs);
   return -EINVAL;
 - } else
 + } else {
 + if (queue_pairs  vi-curr_queue_pairs)
 + for (i = 0; i  queue_pairs; i++)
 + if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
 + schedule_delayed_work(vi-refill, 0);
   vi-curr_queue_pairs = queue_pairs;
 + }
  
   return 0;
  }

You don't want to refill existing queues, so you don't need the if.

for (i = vi-curr_queue_pairs; i  queue_pairs; i++) {
if (!try_fill_recv(vi-rq[i], GFP_KERNEL))
schedule_delayed_work(vi-refill, 0);

We don't free up buffers when we're reducing queues, but I consider that
a corner case.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 3/3] tcm_vhost: Add hotplug/hotunplug support

2013-04-22 Thread Rusty Russell
Asias He as...@redhat.com writes:
 On Mon, Apr 22, 2013 at 04:17:04PM +0300, Michael S. Tsirkin wrote:
+  evt = kzalloc(sizeof(*evt), GFP_KERNEL);
   
   I think kzalloc not needed here, you init all fields.
  
  Not really! evt-event.lun[4-7] is not initialized. It needs to be 
  0.
 
 So that is 4 bytes just init them when you set rest of lun.

It is not in the fast path. You can do it this way but not a must.
   
   I think that's a bit cleaner than relying on kzalloc to zero-initialize.
  
  I think it is a bit shorter, 4 lines saved. 
 
 OTOH you don't have to think what about the rest of the bytes? then
 hunt through the code and go ah, it's kzalloc.
 Looks it's a style issue but I prefer explicit initialization.

 It is just make none sense to argue about this. If you really want you
 can make patches to remove all the lines where use kzalloc because it is
 cleaner.

I prefer explicit initialization too, FWIW.

+static void tcm_vhost_do_evt_work(struct vhost_scsi *vs,
+  struct tcm_vhost_evt *evt)
+{
+  struct vhost_virtqueue *vq = 
vs-vqs[VHOST_SCSI_VQ_EVT];
+  struct virtio_scsi_event *event = evt-event;
+  struct virtio_scsi_event __user *eventp;
+  unsigned out, in;
+  int head, ret;
+
+  if (!tcm_vhost_check_endpoint(vq))
+  return;
+
+  mutex_lock(vs-vs_events_lock);
+  mutex_lock(vq-mutex);
+again:
+  vhost_disable_notify(vs-dev, vq);
+  head = vhost_get_vq_desc(vs-dev, vq, vq-iov,
+  ARRAY_SIZE(vq-iov), out, in,
+  NULL, NULL);
+  if (head  0) {
+  vs-vs_events_dropped = true;
+  goto out;
+  }
+  if (head == vq-num) {
+  if (vhost_enable_notify(vs-dev, vq))
+  goto again;
   
   Please write loops with while() or for().
   Not with goto. goto is for error handling.
  
  This makes extra indention which is more ugly.
 
 I don't care. No loops with goto and that's a hard rule.

It is not a loop.
   
   If same code repeats, it's a loop.
  
  We really need here is to call vhost_get_vq_desc once. It's not a loop
  like other places to use while() or for() with vhost_get_vq_desc.
 
 Exactly. Like a mutex for example? We only need to
 lock it once. See __mutex_lock_common - uses a for loop,
 doesn't it?

 So what?

 So fix this please, use while or for or even until.

 I do not think it is necessary.

I tend to use again: in such corner cases, where I don't expect it to
be invoked very often.  A do loop looks too normal.

I do exactly the same thing in virtio-net.c, for the same case.

Would you like me to review the entire thing?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: suggesting wording fixes for virtio-spec 0.9.5

2013-04-22 Thread Rusty Russell
Laszlo Ersek ler...@redhat.com writes:
 Hi,

 (I'm not subscribed to either list,)

 using the word descriptor is misleading in the following sections:

Yes, I like the use of 'descriptor chains'.  This is a definite
improvement.

Here's the diff I ended up with (massaged to minimize it).

Thanks!
Rusty.

--- virtio-spec.txt-old 2013-04-23 13:22:21.339158214 +0930
+++ virtio-spec.txt 2013-04-23 13:34:14.055176464 +0930
@@ -482,10 +482,10 @@
 
 2.3.4 Available Ring
 
-The available ring refers to what descriptors we are offering the 
-device: it refers to the head of a descriptor chain. The “flags” 
+The available ring refers to what descriptor chains we are offering the
+device: each entry refers to the head of a descriptor chain. The “flags”
 field is currently 0 or 1: 1 indicating that we do not need an 
-interrupt when the device consumes a descriptor from the 
+interrupt when the device consumes a descriptor chain from the
 available ring. Alternatively, the guest can ask the device to 
 delay interrupts until an entry with an index specified by the “
 used_event” field is written in the used ring (equivalently, 
@@ -671,16 +671,16 @@
 
 avail-ring[avail-idx % qsz] = head;
 
-However, in general we can add many descriptors before we update 
-the “idx” field (at which point they become visible to the 
-device), so we keep a counter of how many we've added:
+However, in general we can add many separate descriptor chains before we update
+the “idx” field (at which point they become visible to the device),
+so we keep a counter of how many we've added:
 
 avail-ring[(avail-idx + added++) % qsz] = head;
 
 2.4.1.3 Updating The Index Field
 
 Once the idx field of the virtqueue is updated, the device will 
-be able to access the descriptor entries we've created and the 
+be able to access the descriptor chains we've created and the 
 memory they refer to. This is why a memory barrier is generally 
 used before the idx update, to ensure it sees the most up-to-date 
 copy.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-spec: document virtio-9p

2013-04-15 Thread Rusty Russell
Sasha Levin sasha.le...@oracle.com writes:
 Add basic documentation for virtio-9p. I can expand more on device operation,
 but I don't think there's anything significant enough for the spec to be
 mentioned there. Please let me know if I'm wrong.

 Signed-off-by: Sasha Levin sasha.le...@oracle.com

Hmm, I guess so.  I was happy to leave it undocumented.

Applied,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-net mq vq initialization

2013-04-14 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 On Sun, Apr 14, 2013 at 11:59:42AM -0400, Sasha Levin wrote:
 On 04/14/2013 11:53 AM, Michael S. Tsirkin wrote:
   
   Initializing them only when they're actually needed will do the trick 
   here.
  Not initializing, adding the buffers. In the current spec, initialization
  is always done before DRIVER_OK.
 
 Yeah, that's better, but we're going to need a spec change either way since 
 even
 adding the buffers is specifically stated to happen before DRIVER_OK:
 
 
  6. The receive virtqueues should be filled with receive buffers. This 
 is described
 in detail below in “Setting Up Receive Buffers”.
 
 
 Thanks,
 Sasha

 Yes but that's a minor point, specific to virio-net.
 The main point was to ensure that VQs should have buffers before they get
 any packets.

 OTOH VQ init bere DRIVER_OK is a generic virtio thing: DRIVER_OK means
 we can start sending interrupts and we expect driver to be ready to get
 them.

 Let's solve the real problem with buffers, a couple of pages per VQ
 shouldn't be an issue.

Yeah, let's not populate the extra virtqueues until someone turns
multiqueue on.

I appended to the last paragraph under Setting Up Receive Buffers:

If VIRTIO_NET_F_MQ is negotiated, each of receiveq0...receiveqN that
will be used should be populated with receive buffers

Added: 
   before multiqueue is activated (See [sub:Automatic-receive-steering]).

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio-net mq vq initialization (was: [PATCH 0/5] Usual batch of random ARM fixes for kvmtool)

2013-04-12 Thread Rusty Russell
Sasha Levin sasha.le...@oracle.com writes:
 On 04/11/2013 12:36 PM, Will Deacon wrote:
 Hello folks,
 
 Here's the latest round of ARM fixes and updates for kvmtool. Most of
 this is confined to the arm/ subdirectory, with the exception of a fix
 to the virtio-mmio vq definitions due to the multi-queue work from
 Sasha. I'm not terribly happy about that code though, since it seriously
 increases the memory footprint of the guest.
 
 Without multi-queue, we can boot Debian Wheezy to a prompt in 38MB. With
 the new changes, that increases to 170MB! Any chance we can try and tackle
 this regression please? I keep getting bitten by the OOM killer :(

 (cc Rusty, MST)

 The spec defines the operation of a virtio-net device with regards to multiple
 queues as follows:

 
 Device Initialization

   1. The initialization routine should identify the receive and 
 transmission
 virtqueues, up to N+1 of each kind. If VIRTIO_NET_F_MQ feature
 bit is negotiated, N=max_virtqueue_pairs-1, otherwise identify N=0.

   [...]

   5. Only receiveq0, transmitq0 and controlq are used by default. To use 
 more
 queues driver must negotiate the VIRTIO_NET_F_MQ feature; initialize
 up to max_virtqueue_pairs of each of transmit and receive queues; execute_
 VIRTIO_NET_CTRL_MQ_VQ_PAIRS_SET command specifying
 the number of the transmit and receive queues that is going to be
 used and wait until the device consumes the controlq buffer and acks this
 command.
 

 And kvmtool follows that to the letter: It will initialize the maximum amount 
 of
 queues it can support during initialization and will start using them only 
 when
 the device tells it it should use them.

 As Will has stated, this causes a memory issue since all the data structures 
 that hold
 all possible queues get initialized regardless of whether we actually need 
 them or not,
 which is quite troublesome for systems with small RAM.


 Rusty, MST, would you be open to a spec and code change that would initialize 
 the
 RX/TX vqs on demand instead of on device initialization? Or is there an 
 easier way
 to work around this issue?

I'm confused.  kvmtool is using too much memory, or the guest?  If
kvmtool, the Device Initialization section above applies to the driver,
not the device.  If the guest, well, the language says UP TO N+1.  You
want a small guest, don't use them all.  Or any...

What am I missing?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V7 0/5] virtio-scsi multiqueue

2013-04-08 Thread Rusty Russell
Asias He as...@redhat.com writes:
 On Sat, Apr 06, 2013 at 09:40:13AM +0100, James Bottomley wrote:
 Well, I haven't had time to look at anything other than the patch I
 commented on.  I'm happy with your fix, so you can add my acked by to
 that one.  Since it's going through the virtio tree, don't wait for me,
 put it in and I'll make you fix up anything I find later that I'm
 unhappy with.

 So, Rusty, could you pick this up in your virtio-next tree?

Done!

Thanks,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup

2013-04-01 Thread Rusty Russell
Michael S. Tsirkin m...@redhat.com writes:
 Rusty's currently doing some reorgs of -net let's delay
 cleanups there to avoid stepping on each other's toys.
 Let's focus on scsi here.
 E.g. any chance framing assumptions can be fixed in 3.10?

I am waiting for your removal of the dma-compelete ordering stuff in
vhost-net.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   >