Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support

2012-12-19 Thread Stefan Hajnoczi
On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:
  struct virtio_scsi_target_state {
 - /* Never held at the same time as vq_lock.  */
 + /* This spinlock ever held at the same time as vq_lock.  */

s/ever/is never/
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support

2012-12-18 Thread Michael S. Tsirkin
On Tue, Dec 18, 2012 at 01:32:52PM +0100, Paolo Bonzini wrote:
 This patch adds queue steering to virtio-scsi.  When a target is sent
 multiple requests, we always drive them to the same queue so that FIFO
 processing order is kept.  However, if a target was idle, we can choose
 a queue arbitrarily.  In this case the queue is chosen according to the
 current VCPU, so the driver expects the number of request queues to be
 equal to the number of VCPUs.  This makes it easy and fast to select
 the queue, and also lets the driver optimize the IRQ affinity for the
 virtqueues (each virtqueue's affinity is set to the CPU that owns
 the queue).
 
 The speedup comes from improving cache locality and giving CPU affinity
 to the virtqueues, which is why this scheme was selected.  Assuming that
 the thread that is sending requests to the device is I/O-bound, it is
 likely to be sleeping at the time the ISR is executed, and thus executing
 the ISR on the same processor that sent the requests is cheap.
 
 However, the kernel will not execute the ISR on the best processor
 unless you explicitly set the affinity.  This is because in practice
 you will have many such I/O-bound processes and thus many otherwise
 idle processors.  Then the kernel will execute the ISR on a random
 processor, rather than the one that is sending requests to the device.
 
 The alternative to per-CPU virtqueues is per-target virtqueues.  To
 achieve the same locality, we could dynamically choose the virtqueue's
 affinity based on the CPU of the last task that sent a request.  This
 is less appealing because we do not set the affinity directly---we only
 provide a hint to the irqbalanced running in userspace.  Dynamically
 changing the affinity only works if the userspace applies the hint
 fast enough.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
   v1-v2: improved comments and commit messages, added memory barriers
 
  drivers/scsi/virtio_scsi.c |  234 +--
  1 files changed, 201 insertions(+), 33 deletions(-)
 
 diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
 index 4f6c6a3..ca9d29d 100644
 --- a/drivers/scsi/virtio_scsi.c
 +++ b/drivers/scsi/virtio_scsi.c
 @@ -26,6 +26,7 @@
  
  #define VIRTIO_SCSI_MEMPOOL_SZ 64
  #define VIRTIO_SCSI_EVENT_LEN 8
 +#define VIRTIO_SCSI_VQ_BASE 2
  
  /* Command queue element */
  struct virtio_scsi_cmd {
 @@ -57,24 +58,57 @@ struct virtio_scsi_vq {
   struct virtqueue *vq;
  };
  
 -/* Per-target queue state */
 +/*
 + * Per-target queue state.
 + *
 + * This struct holds the data needed by the queue steering policy.  When a
 + * target is sent multiple requests, we need to drive them to the same queue 
 so
 + * that FIFO processing order is kept.  However, if a target was idle, we can
 + * choose a queue arbitrarily.  In this case the queue is chosen according to
 + * the current VCPU, so the driver expects the number of request queues to be
 + * equal to the number of VCPUs.  This makes it easy and fast to select the
 + * queue, and also lets the driver optimize the IRQ affinity for the 
 virtqueues
 + * (each virtqueue's affinity is set to the CPU that owns the queue).
 + *
 + * An interesting effect of this policy is that only writes to req_vq need to
 + * take the tgt_lock.  Read can be done outside the lock because:
 + *
 + * - writes of req_vq only occur when atomic_inc_return(tgt-reqs) returns 
 1.
 + *   In that case, no other CPU is reading req_vq: even if they were in
 + *   virtscsi_queuecommand_multi, they would be spinning on tgt_lock.
 + *
 + * - reads of req_vq only occur when the target is not idle (reqs != 0).
 + *   A CPU that enters virtscsi_queuecommand_multi will not modify req_vq.
 + *
 + * Similarly, decrements of reqs are never concurrent with writes of req_vq.
 + * Thus they can happen outside the tgt_lock, provided of course we make reqs
 + * an atomic_t.
 + */
  struct virtio_scsi_target_state {
 - /* Never held at the same time as vq_lock.  */
 + /* This spinlock ever held at the same time as vq_lock.  */
   spinlock_t tgt_lock;
 +
 + /* Count of outstanding requests.  */
 + atomic_t reqs;
 +
 + /* Currently active virtqueue for requests sent to this target.  */
 + struct virtio_scsi_vq *req_vq;
  };
  
  /* Driver instance state */
  struct virtio_scsi {
   struct virtio_device *vdev;
  
 - struct virtio_scsi_vq ctrl_vq;
 - struct virtio_scsi_vq event_vq;
 - struct virtio_scsi_vq req_vq;
 -
   /* Get some buffers ready for event vq */
   struct virtio_scsi_event_node event_list[VIRTIO_SCSI_EVENT_LEN];
  
   struct virtio_scsi_target_state *tgt;
 +
 + u32 num_queues;
 +
 + struct virtio_scsi_vq ctrl_vq;
 + struct virtio_scsi_vq event_vq;
 + struct virtio_scsi_vq req_vqs[];
  };
  
  static struct kmem_cache *virtscsi_cmd_cache;
 @@ -109,6 +143,7 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
 *vscsi, void *buf)
   struct 

Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support

2012-12-18 Thread Paolo Bonzini
Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
 -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd *sc)
 +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
 + struct virtio_scsi_target_state *tgt,
 + struct scsi_cmnd *sc)
  {
 -struct virtio_scsi *vscsi = shost_priv(sh);
 -struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  struct virtio_scsi_cmd *cmd;
 +struct virtio_scsi_vq *req_vq;
  int ret;
  
  struct Scsi_Host *shost = virtio_scsi_host(vscsi-vdev);
 @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, 
 struct scsi_cmnd *sc)
  BUG_ON(sc-cmd_len  VIRTIO_SCSI_CDB_SIZE);
  memcpy(cmd-req.cmd.cdb, sc-cmnd, sc-cmd_len);
  
 -if (virtscsi_kick_cmd(tgt, vscsi-req_vq, cmd,
 +req_vq = ACCESS_ONCE(tgt-req_vq);
 
 This ACCESS_ONCE without a barrier looks strange to me.
 Can req_vq change? Needs a comment.

Barriers are needed to order two things.  Here I don't have the second thing
to order against, hence no barrier.

Accessing req_vq lockless is safe, and there's a comment about it, but you
still want ACCESS_ONCE to ensure the compiler doesn't play tricks.  It
shouldn't be necessary, because the critical section of
virtscsi_queuecommand_multi will already include the appropriate
compiler barriers, but it is actually clearer this way to me. :)

 +if (virtscsi_kick_cmd(tgt, req_vq, cmd,
sizeof cmd-req.cmd, sizeof cmd-resp.cmd,
GFP_ATOMIC) == 0)
  ret = 0;
 @@ -472,6 +545,48 @@ out:
  return ret;
  }
  
 +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 +struct scsi_cmnd *sc)
 +{
 +struct virtio_scsi *vscsi = shost_priv(sh);
 +struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +
 +atomic_inc(tgt-reqs);
 
 And here we don't have barrier after atomic? Why? Needs a comment.

Because we don't write req_vq, so there's no two writes to order.  Barrier
against what?

 +return virtscsi_queuecommand(vscsi, tgt, sc);
 +}
 +
 +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
 +   struct scsi_cmnd *sc)
 +{
 +struct virtio_scsi *vscsi = shost_priv(sh);
 +struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +unsigned long flags;
 +u32 queue_num;
 +
 +/*
 + * Using an atomic_t for tgt-reqs lets the virtqueue handler
 + * decrement it without taking the spinlock.
 + *
 + * We still need a critical section to prevent concurrent submissions
 + * from picking two different req_vqs.
 + */
 +spin_lock_irqsave(tgt-tgt_lock, flags);
 +if (atomic_inc_return(tgt-reqs) == 1) {
 +queue_num = smp_processor_id();
 +while (unlikely(queue_num = vscsi-num_queues))
 +queue_num -= vscsi-num_queues;
 +
 +/*
 + * Write reqs before writing req_vq, matching the
 + * smp_read_barrier_depends() in virtscsi_req_done.
 + */
 +smp_wmb();
 +tgt-req_vq = vscsi-req_vqs[queue_num];
 +}
 +spin_unlock_irqrestore(tgt-tgt_lock, flags);
 +return virtscsi_queuecommand(vscsi, tgt, sc);
 +}
 +
  static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd 
 *cmd)
  {
  DECLARE_COMPLETION_ONSTACK(comp);
 @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
  return virtscsi_tmf(vscsi, cmd);
  }
  
 -static struct scsi_host_template virtscsi_host_template = {
 +static struct scsi_host_template virtscsi_host_template_single = {
  .module = THIS_MODULE,
  .name = Virtio SCSI HBA,
  .proc_name = virtio_scsi,
 -.queuecommand = virtscsi_queuecommand,
  .this_id = -1,
 +.queuecommand = virtscsi_queuecommand_single,
 +.eh_abort_handler = virtscsi_abort,
 +.eh_device_reset_handler = virtscsi_device_reset,
 +
 +.can_queue = 1024,
 +.dma_boundary = UINT_MAX,
 +.use_clustering = ENABLE_CLUSTERING,
 +};
 +
 +static struct scsi_host_template virtscsi_host_template_multi = {
 +.module = THIS_MODULE,
 +.name = Virtio SCSI HBA,
 +.proc_name = virtio_scsi,
 +.this_id = -1,
 +.queuecommand = virtscsi_queuecommand_multi,
  .eh_abort_handler = virtscsi_abort,
  .eh_device_reset_handler = virtscsi_device_reset,
  
 @@ -572,16 +701,27 @@ static struct scsi_host_template 
 virtscsi_host_template = {
__val, sizeof(__val)); \
  })
  
 +
  static void virtscsi_init_vq(struct virtio_scsi_vq *virtscsi_vq,
 - struct virtqueue *vq)
 + struct virtqueue *vq, bool affinity)
  {
  spin_lock_init(virtscsi_vq-vq_lock);
  virtscsi_vq-vq = vq;
 +if (affinity)
 +virtqueue_set_affinity(vq, vq-index - VIRTIO_SCSI_VQ_BASE);
 
 I've been 

Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support

2012-12-18 Thread Michael S. Tsirkin
On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
 Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
  -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd 
  *sc)
  +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
  +   struct virtio_scsi_target_state *tgt,
  +   struct scsi_cmnd *sc)
   {
  -  struct virtio_scsi *vscsi = shost_priv(sh);
  -  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 struct virtio_scsi_cmd *cmd;
  +  struct virtio_scsi_vq *req_vq;
 int ret;
   
 struct Scsi_Host *shost = virtio_scsi_host(vscsi-vdev);
  @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, 
  struct scsi_cmnd *sc)
 BUG_ON(sc-cmd_len  VIRTIO_SCSI_CDB_SIZE);
 memcpy(cmd-req.cmd.cdb, sc-cmnd, sc-cmd_len);
   
  -  if (virtscsi_kick_cmd(tgt, vscsi-req_vq, cmd,
  +  req_vq = ACCESS_ONCE(tgt-req_vq);
  
  This ACCESS_ONCE without a barrier looks strange to me.
  Can req_vq change? Needs a comment.
 
 Barriers are needed to order two things.  Here I don't have the second thing
 to order against, hence no barrier.
 
 Accessing req_vq lockless is safe, and there's a comment about it, but you
 still want ACCESS_ONCE to ensure the compiler doesn't play tricks.

That's just it.
Why don't you want compiler to play tricks?

ACCESS_ONCE is needed if the value can change
while you access it, this helps ensure
a consistent value is evalutated.

If it can you almost always need a barrier. If it doesn't
you don't need ACCESS_ONCE.

  It
 shouldn't be necessary, because the critical section of
 virtscsi_queuecommand_multi will already include the appropriate
 compiler barriers,

So if there's a barrier then pls add a comment saying where
it is.

 but it is actually clearer this way to me. :)

No barriers are needed I think because
when you queue command req is incremented to req_vq
can not change. But this also means ACCESS_ONCE
is not needed either.

  +  if (virtscsi_kick_cmd(tgt, req_vq, cmd,
   sizeof cmd-req.cmd, sizeof cmd-resp.cmd,
   GFP_ATOMIC) == 0)
 ret = 0;
  @@ -472,6 +545,48 @@ out:
 return ret;
   }
   
  +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
  +  struct scsi_cmnd *sc)
  +{
  +  struct virtio_scsi *vscsi = shost_priv(sh);
  +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +
  +  atomic_inc(tgt-reqs);
  
  And here we don't have barrier after atomic? Why? Needs a comment.
 
 Because we don't write req_vq, so there's no two writes to order.  Barrier
 against what?

Between atomic update and command. Once you queue command it
can complete and decrement reqs, if this happens before
increment reqs can become negative even.

  +  return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +
  +static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
  + struct scsi_cmnd *sc)
  +{
  +  struct virtio_scsi *vscsi = shost_priv(sh);
  +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
  +  unsigned long flags;
  +  u32 queue_num;
  +
  +  /*
  +   * Using an atomic_t for tgt-reqs lets the virtqueue handler
  +   * decrement it without taking the spinlock.
  +   *
  +   * We still need a critical section to prevent concurrent submissions
  +   * from picking two different req_vqs.
  +   */
  +  spin_lock_irqsave(tgt-tgt_lock, flags);
  +  if (atomic_inc_return(tgt-reqs) == 1) {
  +  queue_num = smp_processor_id();
  +  while (unlikely(queue_num = vscsi-num_queues))
  +  queue_num -= vscsi-num_queues;
  +
  +  /*
  +   * Write reqs before writing req_vq, matching the
  +   * smp_read_barrier_depends() in virtscsi_req_done.
  +   */
  +  smp_wmb();
  +  tgt-req_vq = vscsi-req_vqs[queue_num];
  +  }
  +  spin_unlock_irqrestore(tgt-tgt_lock, flags);
  +  return virtscsi_queuecommand(vscsi, tgt, sc);
  +}
  +
   static int virtscsi_tmf(struct virtio_scsi *vscsi, struct virtio_scsi_cmd 
  *cmd)
   {
 DECLARE_COMPLETION_ONSTACK(comp);
  @@ -541,12 +656,26 @@ static int virtscsi_abort(struct scsi_cmnd *sc)
 return virtscsi_tmf(vscsi, cmd);
   }
   
  -static struct scsi_host_template virtscsi_host_template = {
  +static struct scsi_host_template virtscsi_host_template_single = {
 .module = THIS_MODULE,
 .name = Virtio SCSI HBA,
 .proc_name = virtio_scsi,
  -  .queuecommand = virtscsi_queuecommand,
 .this_id = -1,
  +  .queuecommand = virtscsi_queuecommand_single,
  +  .eh_abort_handler = virtscsi_abort,
  +  .eh_device_reset_handler = virtscsi_device_reset,
  +
  +  .can_queue = 1024,
  +  .dma_boundary = UINT_MAX,
  +  .use_clustering = ENABLE_CLUSTERING,
  +};
  +
  +static struct scsi_host_template virtscsi_host_template_multi = {
  +  .module = THIS_MODULE,
  +  .name = Virtio SCSI HBA,
  +  .proc_name 

Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support

2012-12-18 Thread Paolo Bonzini
Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
 On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
 Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
 -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd 
 *sc)
 +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
 +   struct virtio_scsi_target_state *tgt,
 +   struct scsi_cmnd *sc)
  {
 -  struct virtio_scsi *vscsi = shost_priv(sh);
 -  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
struct virtio_scsi_cmd *cmd;
 +  struct virtio_scsi_vq *req_vq;
int ret;
  
struct Scsi_Host *shost = virtio_scsi_host(vscsi-vdev);
 @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host *sh, 
 struct scsi_cmnd *sc)
BUG_ON(sc-cmd_len  VIRTIO_SCSI_CDB_SIZE);
memcpy(cmd-req.cmd.cdb, sc-cmnd, sc-cmd_len);
  
 -  if (virtscsi_kick_cmd(tgt, vscsi-req_vq, cmd,
 +  req_vq = ACCESS_ONCE(tgt-req_vq);

 This ACCESS_ONCE without a barrier looks strange to me.
 Can req_vq change? Needs a comment.

 Barriers are needed to order two things.  Here I don't have the second thing
 to order against, hence no barrier.

 Accessing req_vq lockless is safe, and there's a comment about it, but you
 still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
 
 That's just it.
 Why don't you want compiler to play tricks?

Because I want the lockless access to occur exactly when I write it.
Otherwise I have one more thing to think about, i.e. what a crazy
compiler writer could do with my code.  And having been on the other
side of the trench, compiler writers can have *really* crazy ideas.

Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
write and make it clearer.

 +  if (virtscsi_kick_cmd(tgt, req_vq, cmd,
  sizeof cmd-req.cmd, sizeof cmd-resp.cmd,
  GFP_ATOMIC) == 0)
ret = 0;
 @@ -472,6 +545,48 @@ out:
return ret;
  }
  
 +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
 +  struct scsi_cmnd *sc)
 +{
 +  struct virtio_scsi *vscsi = shost_priv(sh);
 +  struct virtio_scsi_target_state *tgt = vscsi-tgt[sc-device-id];
 +
 +  atomic_inc(tgt-reqs);

 And here we don't have barrier after atomic? Why? Needs a comment.

 Because we don't write req_vq, so there's no two writes to order.  Barrier
 against what?
 
 Between atomic update and command. Once you queue command it
 can complete and decrement reqs, if this happens before
 increment reqs can become negative even.

This is not a problem.  Please read Documentation/memory-barrier.txt:

   The following also do _not_ imply memory barriers, and so may
   require explicit memory barriers under some circumstances
   (smp_mb__before_atomic_dec() for instance):

atomic_add();
atomic_sub();
atomic_inc();
atomic_dec();

   If they're used for statistics generation, then they probably don't
   need memory barriers, unless there's a coupling between statistical
   data.

This is the single-queue case, so it falls under this case.

/* Discover virtqueues and write information to configuration.  */
 -  err = vdev-config-find_vqs(vdev, 3, vqs, callbacks, names);
 +  err = vdev-config-find_vqs(vdev, num_vqs, vqs, callbacks, names);
if (err)
return err;
  
 -  virtscsi_init_vq(vscsi-ctrl_vq, vqs[0]);
 -  virtscsi_init_vq(vscsi-event_vq, vqs[1]);
 -  virtscsi_init_vq(vscsi-req_vq, vqs[2]);
 +  virtscsi_init_vq(vscsi-ctrl_vq, vqs[0], false);
 +  virtscsi_init_vq(vscsi-event_vq, vqs[1], false);
 +  for (i = VIRTIO_SCSI_VQ_BASE; i  num_vqs; i++)
 +  virtscsi_init_vq(vscsi-req_vqs[i - VIRTIO_SCSI_VQ_BASE],
 +   vqs[i], vscsi-num_queues  1);

 So affinity is true if 1 vq? I am guessing this is not
 going to do the right thing unless you have at least
 as many vqs as CPUs.

 Yes, and then you're not setting up the thing correctly.
 
 Why not just check instead of doing the wrong thing?

The right thing could be to set the affinity with a stride, e.g. CPUs
0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.

Paolo

 Isn't the same thing true for virtio-net mq?

 Paolo
 
 Last I looked it checked vi-max_queue_pairs == num_online_cpus().
 This is even too aggressive I think, max_queue_pairs =
 num_online_cpus() should be enough.
 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 5/5] virtio-scsi: introduce multiqueue support

2012-12-18 Thread Michael S. Tsirkin
On Tue, Dec 18, 2012 at 04:51:28PM +0100, Paolo Bonzini wrote:
 Il 18/12/2012 16:03, Michael S. Tsirkin ha scritto:
  On Tue, Dec 18, 2012 at 03:08:08PM +0100, Paolo Bonzini wrote:
  Il 18/12/2012 14:57, Michael S. Tsirkin ha scritto:
  -static int virtscsi_queuecommand(struct Scsi_Host *sh, struct scsi_cmnd 
  *sc)
  +static int virtscsi_queuecommand(struct virtio_scsi *vscsi,
  + struct virtio_scsi_target_state *tgt,
  + struct scsi_cmnd *sc)
   {
  -struct virtio_scsi *vscsi = shost_priv(sh);
  -struct virtio_scsi_target_state *tgt = 
  vscsi-tgt[sc-device-id];
   struct virtio_scsi_cmd *cmd;
  +struct virtio_scsi_vq *req_vq;
   int ret;
   
   struct Scsi_Host *shost = virtio_scsi_host(vscsi-vdev);
  @@ -461,7 +533,8 @@ static int virtscsi_queuecommand(struct Scsi_Host 
  *sh, struct scsi_cmnd *sc)
   BUG_ON(sc-cmd_len  VIRTIO_SCSI_CDB_SIZE);
   memcpy(cmd-req.cmd.cdb, sc-cmnd, sc-cmd_len);
   
  -if (virtscsi_kick_cmd(tgt, vscsi-req_vq, cmd,
  +req_vq = ACCESS_ONCE(tgt-req_vq);
 
  This ACCESS_ONCE without a barrier looks strange to me.
  Can req_vq change? Needs a comment.
 
  Barriers are needed to order two things.  Here I don't have the second 
  thing
  to order against, hence no barrier.
 
  Accessing req_vq lockless is safe, and there's a comment about it, but you
  still want ACCESS_ONCE to ensure the compiler doesn't play tricks.
  
  That's just it.
  Why don't you want compiler to play tricks?
 
 Because I want the lockless access to occur exactly when I write it.

It doesn't occur when you write it. CPU can still move accesses
around. That's why you either need both ACCESS_ONCE and a barrier
or none.

 Otherwise I have one more thing to think about, i.e. what a crazy
 compiler writer could do with my code.  And having been on the other
 side of the trench, compiler writers can have *really* crazy ideas.
 
 Anyhow, I'll reorganize the code to move the ACCESS_ONCE closer to the
 write and make it clearer.
 
  +if (virtscsi_kick_cmd(tgt, req_vq, cmd,
 sizeof cmd-req.cmd, sizeof cmd-resp.cmd,
 GFP_ATOMIC) == 0)
   ret = 0;
  @@ -472,6 +545,48 @@ out:
   return ret;
   }
   
  +static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
  +struct scsi_cmnd *sc)
  +{
  +struct virtio_scsi *vscsi = shost_priv(sh);
  +struct virtio_scsi_target_state *tgt = 
  vscsi-tgt[sc-device-id];
  +
  +atomic_inc(tgt-reqs);
 
  And here we don't have barrier after atomic? Why? Needs a comment.
 
  Because we don't write req_vq, so there's no two writes to order.  Barrier
  against what?
  
  Between atomic update and command. Once you queue command it
  can complete and decrement reqs, if this happens before
  increment reqs can become negative even.
 
 This is not a problem.  Please read Documentation/memory-barrier.txt:
 
The following also do _not_ imply memory barriers, and so may
require explicit memory barriers under some circumstances
(smp_mb__before_atomic_dec() for instance):
 
 atomic_add();
 atomic_sub();
 atomic_inc();
 atomic_dec();
 
If they're used for statistics generation, then they probably don't
need memory barriers, unless there's a coupling between statistical
data.
 
 This is the single-queue case, so it falls under this case.

Aha I missed it's single queue. Correct but please add a comment.

   /* Discover virtqueues and write information to configuration.  
  */
  -err = vdev-config-find_vqs(vdev, 3, vqs, callbacks, names);
  +err = vdev-config-find_vqs(vdev, num_vqs, vqs, callbacks, 
  names);
   if (err)
   return err;
   
  -virtscsi_init_vq(vscsi-ctrl_vq, vqs[0]);
  -virtscsi_init_vq(vscsi-event_vq, vqs[1]);
  -virtscsi_init_vq(vscsi-req_vq, vqs[2]);
  +virtscsi_init_vq(vscsi-ctrl_vq, vqs[0], false);
  +virtscsi_init_vq(vscsi-event_vq, vqs[1], false);
  +for (i = VIRTIO_SCSI_VQ_BASE; i  num_vqs; i++)
  +virtscsi_init_vq(vscsi-req_vqs[i - 
  VIRTIO_SCSI_VQ_BASE],
  + vqs[i], vscsi-num_queues  1);
 
  So affinity is true if 1 vq? I am guessing this is not
  going to do the right thing unless you have at least
  as many vqs as CPUs.
 
  Yes, and then you're not setting up the thing correctly.
  
  Why not just check instead of doing the wrong thing?
 
 The right thing could be to set the affinity with a stride, e.g. CPUs
 0-4 for virtqueue 0 and so on until CPUs 3-7 for virtqueue 3.
 
 Paolo

I think a simple #vqs == #cpus check would be kind of OK for
starters, otherwise let userspace set affinity.
Again need to think what happens with CPU hotplug.

  Isn't the same thing true for