Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-24 Thread Avi Kivity

On 05/23/2010 07:30 PM, Michael S. Tsirkin wrote:


   

Maybe we should use atomics on index then?

   

This should only be helpful if you access the cacheline several times in
a row.  That's not the case in virtio (or here).
 

So why does it help?
   


We actually do access the cacheline several times in a row here (but not 
in virtio?):



case SHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
   


Broadcast a read request.


count++;
counter-cacheline1 = count;
   


Broadcast an invalidate request.


count++;
}
break;

case LOCKSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while 
(__sync_val_compare_and_swap(counter-cacheline1, count, count+1)
   != count);
   


Broadcast a 'read for ownership' request.


count += 2;
}
break;
   


So RMW should certainly by faster using single-instruction RMW 
operations (or using prefetchw).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-24 Thread Michael S. Tsirkin
On Mon, May 24, 2010 at 09:37:05AM +0300, Avi Kivity wrote:
 On 05/23/2010 07:30 PM, Michael S. Tsirkin wrote:


 Maybe we should use atomics on index then?


 This should only be helpful if you access the cacheline several times in
 a row.  That's not the case in virtio (or here).
  
 So why does it help?


 We actually do access the cacheline several times in a row here (but not  
 in virtio?):

  case SHARE:
  while (count  MAX_BOUNCES) {
  /* Spin waiting for other side to change it. */
  while (counter-cacheline1 != count);


 Broadcast a read request.

  count++;
  counter-cacheline1 = count;


 Broadcast an invalidate request.

  count++;
  }
  break;

  case LOCKSHARE:
  while (count  MAX_BOUNCES) {
  /* Spin waiting for other side to change it. */
  while 
 (__sync_val_compare_and_swap(counter-cacheline1, count, count+1)
 != count);


 Broadcast a 'read for ownership' request.

  count += 2;
  }
  break;


 So RMW should certainly by faster using single-instruction RMW  
 operations (or using prefetchw).

Okay, but why is lockunshare faster than unshare?

 -- 
 Do not meddle in the internals of kernels, for they are subtle and quick to 
 panic.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-24 Thread Avi Kivity

On 05/24/2010 11:05 AM, Michael S. Tsirkin wrote:


Okay, but why is lockunshare faster than unshare?

   


No idea.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
 On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
  On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
Note that this is a exclusive-shared-exclusive bounce only, too.
   
   
   A bounce is a bounce.
  
  I tried to measure this to show that you were wrong, but I was only able
  to show that you're right.  How annoying.  Test code below.
 
 This time for sure!


What do you see?
On my laptop:
[...@tuck testring]$ ./rusty1 share 0 1
CPU 1: share cacheline: 2820410 usec
CPU 0: share cacheline: 2823441 usec
[...@tuck testring]$ ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 2783014 usec
CPU 1: unshare cacheline: 2782951 usec
[...@tuck testring]$ ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 1888495 usec
CPU 0: lockshare cacheline: 1888544 usec
[...@tuck testring]$ ./rusty1 lockunshare 0 1
CPU 0: lockunshare cacheline: 1889854 usec
CPU 1: lockunshare cacheline: 1889804 usec
So locked version seems to be faster than unlocked,
and share/unshare not to matter?

same on a workstation:
[r...@qus19 ~]# ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 6037002 usec
CPU 1: unshare cacheline: 6036977 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 1
CPU 1: lockunshare cacheline: 5734362 usec
CPU 0: lockunshare cacheline: 5734389 usec
[r...@qus19 ~]# ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 5733537 usec
CPU 0: lockshare cacheline: 5733564 usec

using another pair of CPUs gives a more drastic
results:

[r...@qus19 ~]# ./rusty1 lockshare 0 2
CPU 2: lockshare cacheline: 4226990 usec
CPU 0: lockshare cacheline: 4227038 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 2
CPU 0: lockunshare cacheline: 4226707 usec
CPU 2: lockunshare cacheline: 4226662 usec
[r...@qus19 ~]# ./rusty1 unshare 0 2
CPU 0: unshare cacheline: 14815048 usec
CPU 2: unshare cacheline: 14815006 usec


The share test seems to never finish on the
workstation. I am debugging this.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Avi Kivity

On 05/23/2010 06:31 PM, Michael S. Tsirkin wrote:

On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
   

On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
 

On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
   

Note that this is a exclusive-shared-exclusive bounce only, too.

   

A bounce is a bounce.
 

I tried to measure this to show that you were wrong, but I was only able
to show that you're right.  How annoying.  Test code below.
   

This time for sure!
 


What do you see?
On my laptop:
[...@tuck testring]$ ./rusty1 share 0 1
CPU 1: share cacheline: 2820410 usec
CPU 0: share cacheline: 2823441 usec
[...@tuck testring]$ ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 2783014 usec
CPU 1: unshare cacheline: 2782951 usec
[...@tuck testring]$ ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 1888495 usec
CPU 0: lockshare cacheline: 1888544 usec
[...@tuck testring]$ ./rusty1 lockunshare 0 1
CPU 0: lockunshare cacheline: 1889854 usec
CPU 1: lockunshare cacheline: 1889804 usec
   


Ugh, can the timing be normalized per operation?  This is unreadable.


So locked version seems to be faster than unlocked,
and share/unshare not to matter?
   


May be due to the processor using the LOCK operation as a hint to 
reserve the cacheline for a bit.



same on a workstation:
[r...@qus19 ~]# ./rusty1 unshare 0 1
CPU 0: unshare cacheline: 6037002 usec
CPU 1: unshare cacheline: 6036977 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 1
CPU 1: lockunshare cacheline: 5734362 usec
CPU 0: lockunshare cacheline: 5734389 usec
[r...@qus19 ~]# ./rusty1 lockshare 0 1
CPU 1: lockshare cacheline: 5733537 usec
CPU 0: lockshare cacheline: 5733564 usec

using another pair of CPUs gives a more drastic
results:

[r...@qus19 ~]# ./rusty1 lockshare 0 2
CPU 2: lockshare cacheline: 4226990 usec
CPU 0: lockshare cacheline: 4227038 usec
[r...@qus19 ~]# ./rusty1 lockunshare 0 2
CPU 0: lockunshare cacheline: 4226707 usec
CPU 2: lockunshare cacheline: 4226662 usec
[r...@qus19 ~]# ./rusty1 unshare 0 2
CPU 0: unshare cacheline: 14815048 usec
CPU 2: unshare cacheline: 14815006 usec

   


That's expected.  Hyperthread will be fastest (shared L1), shared L2/L3 
will be slower, cross-socket will suck.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Sun, May 23, 2010 at 06:41:33PM +0300, Avi Kivity wrote:
 On 05/23/2010 06:31 PM, Michael S. Tsirkin wrote:
 On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:

 On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
  
 On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:

 Note that this is a exclusive-shared-exclusive bounce only, too.


 A bounce is a bounce.
  
 I tried to measure this to show that you were wrong, but I was only able
 to show that you're right.  How annoying.  Test code below.

 This time for sure!
  

 What do you see?
 On my laptop:
  [...@tuck testring]$ ./rusty1 share 0 1
  CPU 1: share cacheline: 2820410 usec
  CPU 0: share cacheline: 2823441 usec
  [...@tuck testring]$ ./rusty1 unshare 0 1
  CPU 0: unshare cacheline: 2783014 usec
  CPU 1: unshare cacheline: 2782951 usec
  [...@tuck testring]$ ./rusty1 lockshare 0 1
  CPU 1: lockshare cacheline: 1888495 usec
  CPU 0: lockshare cacheline: 1888544 usec
  [...@tuck testring]$ ./rusty1 lockunshare 0 1
  CPU 0: lockunshare cacheline: 1889854 usec
  CPU 1: lockunshare cacheline: 1889804 usec


 Ugh, can the timing be normalized per operation?  This is unreadable.

 So locked version seems to be faster than unlocked,
 and share/unshare not to matter?


 May be due to the processor using the LOCK operation as a hint to  
 reserve the cacheline for a bit.

Maybe we should use atomics on index then?

 same on a workstation:
 [r...@qus19 ~]# ./rusty1 unshare 0 1
 CPU 0: unshare cacheline: 6037002 usec
 CPU 1: unshare cacheline: 6036977 usec
 [r...@qus19 ~]# ./rusty1 lockunshare 0 1
 CPU 1: lockunshare cacheline: 5734362 usec
 CPU 0: lockunshare cacheline: 5734389 usec
 [r...@qus19 ~]# ./rusty1 lockshare 0 1
 CPU 1: lockshare cacheline: 5733537 usec
 CPU 0: lockshare cacheline: 5733564 usec

 using another pair of CPUs gives a more drastic
 results:

 [r...@qus19 ~]# ./rusty1 lockshare 0 2
 CPU 2: lockshare cacheline: 4226990 usec
 CPU 0: lockshare cacheline: 4227038 usec
 [r...@qus19 ~]# ./rusty1 lockunshare 0 2
 CPU 0: lockunshare cacheline: 4226707 usec
 CPU 2: lockunshare cacheline: 4226662 usec
 [r...@qus19 ~]# ./rusty1 unshare 0 2
 CPU 0: unshare cacheline: 14815048 usec
 CPU 2: unshare cacheline: 14815006 usec



 That's expected.  Hyperthread will be fastest (shared L1), shared L2/L3  
 will be slower, cross-socket will suck.

OK, after adding mb in code patch will be sent separately,
the test works for my workstation. locked is still fastest,
unshared sometimes shows wins and sometimes loses over shared.

[r...@qus19 ~]# ./cachebounce share 0 1
CPU 0: share cacheline: 6638521 usec
CPU 1: share cacheline: 6638478 usec
[r...@qus19 ~]# ./cachebounce unshare 0 1
CPU 0: unshare cacheline: 6037415 usec
CPU 1: unshare cacheline: 6037374 usec
[r...@qus19 ~]# ./cachebounce lockshare 0 1
CPU 0: lockshare cacheline: 5734017 usec
CPU 1: lockshare cacheline: 5733978 usec
[r...@qus19 ~]# ./cachebounce lockunshare 0 1
CPU 1: lockunshare cacheline: 5733260 usec
CPU 0: lockunshare cacheline: 5733307 usec
[r...@qus19 ~]# ./cachebounce share 0 2
CPU 0: share cacheline: 14529198 usec
CPU 2: share cacheline: 14529156 usec
[r...@qus19 ~]# ./cachebounce unshare 0 2
CPU 2: unshare cacheline: 14815328 usec
CPU 0: unshare cacheline: 14815374 usec
[r...@qus19 ~]# ./cachebounce lockshare 0 2
CPU 0: lockshare cacheline: 4226878 usec
CPU 2: lockshare cacheline: 4226842 usec
[r...@qus19 ~]# ./cachebounce locknushare 0 2
cachebounce: Usage: cachebounce share|unshare|lockshare|lockunshare cpu0 
cpu1
[r...@qus19 ~]# ./cachebounce lockunshare 0 2
CPU 0: lockunshare cacheline: 4227432 usec
CPU 2: lockunshare cacheline: 4227375 usec

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:38:16PM +0930, Rusty Russell wrote:
 On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
  On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
Note that this is a exclusive-shared-exclusive bounce only, too.
   
   
   A bounce is a bounce.
  
  I tried to measure this to show that you were wrong, but I was only able
  to show that you're right.  How annoying.  Test code below.
 
 This time for sure!

The share option does not work on some
boxes unless I apply the following:
essentially, this adds mb() after each write
and before read. It seems to make sense to
me: we must update our counter before we
wanit for another side.

diff --git a/cachebounce.c b/cachebounce.c
index 0387027..ebe5a37 100644
--- a/cachebounce.c
+++ b/cachebounce.c
@@ -77,6 +77,7 @@ int main(int argc, char *argv[])
count++;
counter-cacheline1 = count;
count++;
+   __sync_synchronize();
}
break;
case UNSHARE:
@@ -86,6 +87,7 @@ int main(int argc, char *argv[])
count++;
counter-cacheline2 = count;
count++;
+   __sync_synchronize();
}
break;
case LOCKSHARE:
@@ -98,6 +100,7 @@ int main(int argc, char *argv[])
break;
case LOCKUNSHARE:
while (count  MAX_BOUNCES) {
+   __sync_synchronize();
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);

__sync_val_compare_and_swap(counter-cacheline2, count, count+1);
@@ -115,6 +118,7 @@ int main(int argc, char *argv[])
count++;
counter-cacheline1 = count;
count++;
+   __sync_synchronize();
}
break;
case UNSHARE:
@@ -124,6 +128,7 @@ int main(int argc, char *argv[])
count++;
counter-cacheline1 = count;
count++;
+   __sync_synchronize();
}
break;
case LOCKSHARE:
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Avi Kivity

On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote:



So locked version seems to be faster than unlocked,
and share/unshare not to matter?

   

May be due to the processor using the LOCK operation as a hint to
reserve the cacheline for a bit.
 

Maybe we should use atomics on index then?
   


This should only be helpful if you access the cacheline several times in 
a row.  That's not the case in virtio (or here).


I think the problem is that LOCKSHARE and SHARE are not symmetric, so 
they can't be directly compared.



OK, after adding mb in code patch will be sent separately,
the test works for my workstation. locked is still fastest,
unshared sometimes shows wins and sometimes loses over shared.

[r...@qus19 ~]# ./cachebounce share 0 1
CPU 0: share cacheline: 6638521 usec
CPU 1: share cacheline: 6638478 usec
   


66 ns? nice.


[r...@qus19 ~]# ./cachebounce share 0 2
CPU 0: share cacheline: 14529198 usec
CPU 2: share cacheline: 14529156 usec
   


140 ns, not too bad.  I hope I'm not misinterpreting the results.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Sun, May 23, 2010 at 07:03:10PM +0300, Avi Kivity wrote:
 On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote:

 So locked version seems to be faster than unlocked,
 and share/unshare not to matter?


 May be due to the processor using the LOCK operation as a hint to
 reserve the cacheline for a bit.
  
 Maybe we should use atomics on index then?


 This should only be helpful if you access the cacheline several times in  
 a row.  That's not the case in virtio (or here).

So why does it help?

 I think the problem is that LOCKSHARE and SHARE are not symmetric, so  
 they can't be directly compared.

In what sense are they not symmetric?

 OK, after adding mb in code patch will be sent separately,
 the test works for my workstation. locked is still fastest,
 unshared sometimes shows wins and sometimes loses over shared.

 [r...@qus19 ~]# ./cachebounce share 0 1
 CPU 0: share cacheline: 6638521 usec
 CPU 1: share cacheline: 6638478 usec


 66 ns? nice.

 [r...@qus19 ~]# ./cachebounce share 0 2
 CPU 0: share cacheline: 14529198 usec
 CPU 2: share cacheline: 14529156 usec


 140 ns, not too bad.  I hope I'm not misinterpreting the results.

 -- 
 error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-23 Thread Michael S. Tsirkin
On Sun, May 23, 2010 at 07:03:10PM +0300, Avi Kivity wrote:
 On 05/23/2010 06:51 PM, Michael S. Tsirkin wrote:

 So locked version seems to be faster than unlocked,
 and share/unshare not to matter?


 May be due to the processor using the LOCK operation as a hint to
 reserve the cacheline for a bit.
  
 Maybe we should use atomics on index then?


 This should only be helpful if you access the cacheline several times in  
 a row.  That's not the case in virtio (or here).

 I think the problem is that LOCKSHARE and SHARE are not symmetric, so  
 they can't be directly compared.

 OK, after adding mb in code patch will be sent separately,
 the test works for my workstation. locked is still fastest,
 unshared sometimes shows wins and sometimes loses over shared.

 [r...@qus19 ~]# ./cachebounce share 0 1
 CPU 0: share cacheline: 6638521 usec
 CPU 1: share cacheline: 6638478 usec


 66 ns? nice.

 [r...@qus19 ~]# ./cachebounce share 0 2
 CPU 0: share cacheline: 14529198 usec
 CPU 2: share cacheline: 14529156 usec


 140 ns, not too bad.  I hope I'm not misinterpreting the results.

 -- 
 error compiling committee.c: too many arguments to function


Here's another box: here the fastest option
is shared, slowest unshared, lock is in the middle.



[r...@virtlab16 testring]# sh run 0 2
CPU 2: share cacheline: 3304728 usec
CPU 0: share cacheline: 3304784 usec
CPU 0: unshare cacheline: 6283248 usec
CPU 2: unshare cacheline: 6283224 usec
CPU 2: lockshare cacheline: 4018567 usec
CPU 0: lockshare cacheline: 4018609 usec


CPU 2: lockunshare cacheline: 4041791 usec
CPU 0: lockunshare cacheline: 4041832 usec
[r...@virtlab16 testring]# 
[r...@virtlab16 testring]# 
[r...@virtlab16 testring]# 
[r...@virtlab16 testring]# sh run 0 1
CPU 1: share cacheline: 8306326 usec
CPU 0: share cacheline: 8306324 usec
CPU 0: unshare cacheline: 19571697 usec
CPU 1: unshare cacheline: 19571578 usec
CPU 0: lockshare cacheline: 11281566 usec
CPU 1: lockshare cacheline: 11281424 usec
CPU 0: lockunshare cacheline: 11276093 usec
CPU 1: lockunshare cacheline: 11275957 usec


[r...@virtlab16 testring]# sh run 0 3
CPU 0: share cacheline: 8288335 usec
CPU 3: share cacheline: 8288334 usec
CPU 0: unshare cacheline: 19107202 usec
CPU 3: unshare cacheline: 19107139 usec
CPU 0: lockshare cacheline: 11238915 usec
CPU 3: lockshare cacheline: 11238848 usec
CPU 3: lockunshare cacheline: 11132134 usec
CPU 0: lockunshare cacheline: 11132249 usec

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 01:33 AM, Michael S. Tsirkin wrote:



Virtio is already way too bouncy due to the indirection between the
avail/used rings and the descriptor pool.  A device with out of order
completion (like virtio-blk) will quickly randomize the unused
descriptor indexes, so every descriptor fetch will require a bounce.

In contrast, if the rings hold the descriptors themselves instead of
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
every descriptor, amortized.
 

On the other hand, consider that on fast path we are never using all
of the ring. With a good allocator we might be able to keep
reusing only small part of the ring, instead of wrapping around
all of it all of the time.
   


It's still suboptimal, we have to bounce both the avail/used rings and 
the descriptor pool, compared to just the descriptor ring with a direct 
design.  Plus we don't need a fancy allocator.


When amortizing cachelines, simple data structures win.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 08:01 AM, Rusty Russell wrote:



A device with out of order
completion (like virtio-blk) will quickly randomize the unused
descriptor indexes, so every descriptor fetch will require a bounce.

In contrast, if the rings hold the descriptors themselves instead of
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
every descriptor, amortized.
 

We already have indirect, this would be a logical next step.  So let's
think about it. The avail ring would contain 64 bit values, the used ring
would contain indexes into the avail ring.
   


Have just one ring, no indexes.  The producer places descriptors into 
the ring and updates the head,  The consumer copies out descriptors to 
be processed and copies back in completed descriptors.  Chaining is 
always linear.  The descriptors contain a tag that allow the producer to 
identify the completion.


Indirect only pays when there are enough descriptors in it to fill a 
couple of cache lines.  Otherwise it's an extra bounce.


We will always bounce here, that what happens when transferring data.  
The question is whether how many cache lines per descriptor.  A pointer 
adds 1 bounce, linear descriptors cost 1/4 bounce, chained descriptors 
cost a bounce.  So best is one ring of linearly chained descriptors.  
Indirect works when you have large requests (like block).



So client writes descriptor page and adds to avail ring, then writes to
index.
Server reads index, avail ring, descriptor page (3).  Writes used
entry (1).  Updates last_used (1).  Client reads used (1), derefs avail (1),
updates last_used (1), cleans descriptor page (1).

That's 9 cacheline transfers, worst case.  Best case of a half-full ring
in steady state, assuming 128-byte cache lines, the avail ring costs are
1/16, the used entry is 1/64.  This drops it to 6 and 9/64 transfers.
   


Cache lines are 64 bytes these days.

With a single ring, client writes descriptors (ceil(N/4)), updates head 
(1).  Server reads head (1) copies out descriptors (ceil(N/4)), issues 
requests, copies back completions ((ceil(N/4)), updates tail (1).  
Client reads back tail and descriptors (1 + ceil(N/4))


Worst case: 4 + 4 * ceil(N/4).  Best case I think this drops by half.




(Note, the current scheme adds 2 more cacheline transfers, for the descriptor
table, worst case.


2 bounces per descriptor due to random access.


   Assuming indirect, we get 2/8 xfer best case.  Either way,
it's not the main source of cacheline xfers).
   


Indirect adds a double bounce to get to the descriptor table, but any 
descriptors there are accessed linearly.  It's only good when you have 
large chains.



Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.
   


Interesting!  So a peer will read until it hits a wall.  But how to 
update the wall atomically?


Maybe we can have a flag in the descriptor indicate headness or 
tailness.  Update looks ugly though: write descriptor with head flag, 
write next descriptor with head flag, remove flag from previous descriptor.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Michael S. Tsirkin
On Thu, May 20, 2010 at 02:31:50PM +0930, Rusty Russell wrote:
 Can we do better?  The obvious idea is to try to get rid of last_used and
 used, and use the ring itself.  We would use an invalid entry to mark the
 head of the ring.
 
 Any other thoughts?
 Rusty.

We also need a way to avoid interrupts at least while we are
processing the ring.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Rusty Russell
On Thu, 20 May 2010 04:30:56 pm Avi Kivity wrote:
 On 05/20/2010 08:01 AM, Rusty Russell wrote:
 
  A device with out of order
  completion (like virtio-blk) will quickly randomize the unused
  descriptor indexes, so every descriptor fetch will require a bounce.
 
  In contrast, if the rings hold the descriptors themselves instead of
  pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for
  every descriptor, amortized.
   
  We already have indirect, this would be a logical next step.  So let's
  think about it. The avail ring would contain 64 bit values, the used ring
  would contain indexes into the avail ring.
 
 Have just one ring, no indexes.  The producer places descriptors into 
 the ring and updates the head,  The consumer copies out descriptors to 
 be processed and copies back in completed descriptors.  Chaining is 
 always linear.  The descriptors contain a tag that allow the producer to 
 identify the completion.

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.

  Can we do better?  The obvious idea is to try to get rid of last_used and
  used, and use the ring itself.  We would use an invalid entry to mark the
  head of the ring.
 
 Interesting!  So a peer will read until it hits a wall.  But how to 
 update the wall atomically?
 
 Maybe we can have a flag in the descriptor indicate headness or 
 tailness.  Update looks ugly though: write descriptor with head flag, 
 write next descriptor with head flag, remove flag from previous descriptor.

I was thinking a separate magic invalid entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.

I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-20 Thread Avi Kivity

On 05/20/2010 05:34 PM, Rusty Russell wrote:



Have just one ring, no indexes.  The producer places descriptors into
the ring and updates the head,  The consumer copies out descriptors to
be processed and copies back in completed descriptors.  Chaining is
always linear.  The descriptors contain a tag that allow the producer to
identify the completion.
 

This could definitely work.  The original reason for the page boundaries
was for untrusted inter-guest communication: with appropriate page protections
they could see each other's rings and a simply inter-guest copy hypercall
could verify that the other guest really exposed that data via virtio ring.

But, cute as that is, we never did that.  And it's not clear that it wins
much over simply having the hypervisor read both rings directly.
   


AFAICS having separate avail_ring/used_ring/desc_pool is orthogonal to 
this cuteness.



Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.
   

Interesting!  So a peer will read until it hits a wall.  But how to
update the wall atomically?

Maybe we can have a flag in the descriptor indicate headness or
tailness.  Update looks ugly though: write descriptor with head flag,
write next descriptor with head flag, remove flag from previous descriptor.
 

I was thinking a separate magic invalid entry.  To publish an 3 descriptor
chain, you would write descriptors 2 and 3, write an invalid entry at 4,
barrier, write entry 1.  It is a bit ugly, yes, but not terrible.
   


Worth exploring. This amortizes the indexes into the ring, a good thing.

Another thing we can do is place the tail a half ring away from the head 
(and limit ring utilization to 50%), reducing bounces on short kicks.  
Or equivalently have an avail ring and used ring, but both containing 
tagged descriptors instead of pointers to descriptors.



I think that a simple simulator for this is worth writing, which tracks
cacheline moves under various fullness scenarios...
   


Yup.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell
On Wed, 12 May 2010 04:57:22 am Avi Kivity wrote:
 On 05/07/2010 06:23 AM, Rusty Russell wrote:
  On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
 
  On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
   
  + /* We publish the last-seen used index at the end of the available ring.
  +  * It is at the end for backwards compatibility. */
  + vr-last_used_idx =(vr)-avail-ring[num];
  + /* Verify that last used index does not spill over the used ring. */
  + BUG_ON((void *)vr-last_used_idx +
  +sizeof *vr-last_used_idx   (void *)vr-used);
 }
 
 
  Shouldn't this be on its own cache line?
   
  It's next to the available ring; because that's where the guest publishes
  its data.  That whole page is guest-write, host-read.
 
  Putting it on a cacheline by itself would be a slight pessimization; the 
  host
  cpu would have to get the last_used_idx cacheline and the avail descriptor
  cacheline every time.  This way, they are sometimes the same cacheline.
 
 If one peer writes the tail of the available ring, while the other reads 
 last_used_idx, it's a false bounce, no?

I think we're talking about the last 2 entries of the avail ring.  That means
the worst case is 1 false bounce every time around the ring.  I think that's
why we're debating it instead of measuring it :)

Note that this is a exclusive-shared-exclusive bounce only, too.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Avi Kivity

On 05/19/2010 10:39 AM, Rusty Russell wrote:


I think we're talking about the last 2 entries of the avail ring.  That means
the worst case is 1 false bounce every time around the ring.


It's low, but why introduce an inefficiency when you can avoid doing it 
for the same effort?



I think that's
why we're debating it instead of measuring it :)
   


Measure before optimize is good for code but not for protocols.  
Protocols have to be robust against future changes.  Virtio is warty 
enough already, we can't keep doing local optimizations.



Note that this is a exclusive-shared-exclusive bounce only, too.
   


A bounce is a bounce.

Virtio is already way too bouncy due to the indirection between the 
avail/used rings and the descriptor pool.  A device with out of order 
completion (like virtio-blk) will quickly randomize the unused 
descriptor indexes, so every descriptor fetch will require a bounce.


In contrast, if the rings hold the descriptors themselves instead of 
pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for 
every descriptor, amortized.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Michael S. Tsirkin
On Wed, May 19, 2010 at 11:06:42AM +0300, Avi Kivity wrote:
 On 05/19/2010 10:39 AM, Rusty Russell wrote:

 I think we're talking about the last 2 entries of the avail ring.  That means
 the worst case is 1 false bounce every time around the ring.

 It's low, but why introduce an inefficiency when you can avoid doing it  
 for the same effort?
 
 I think that's
 why we're debating it instead of measuring it :)


 Measure before optimize is good for code but not for protocols.   
 Protocols have to be robust against future changes.  Virtio is warty  
 enough already, we can't keep doing local optimizations.

 Note that this is a exclusive-shared-exclusive bounce only, too.


 A bounce is a bounce.

 Virtio is already way too bouncy due to the indirection between the  
 avail/used rings and the descriptor pool.  A device with out of order  
 completion (like virtio-blk) will quickly randomize the unused  
 descriptor indexes, so every descriptor fetch will require a bounce.

 In contrast, if the rings hold the descriptors themselves instead of  
 pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for  
 every descriptor, amortized.

On the other hand, consider that on fast path we are never using all
of the ring. With a good allocator we might be able to keep
reusing only small part of the ring, instead of wrapping around
all of it all of the time.


 -- 
 Do not meddle in the internals of kernels, for they are subtle and quick to 
 panic.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell
On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
  Note that this is a exclusive-shared-exclusive bounce only, too.
 
 
 A bounce is a bounce.

I tried to measure this to show that you were wrong, but I was only able
to show that you're right.  How annoying.  Test code below.

 Virtio is already way too bouncy due to the indirection between the 
 avail/used rings and the descriptor pool.

I tried to do a more careful analysis below, and I think this is an
overstatement.

 A device with out of order 
 completion (like virtio-blk) will quickly randomize the unused 
 descriptor indexes, so every descriptor fetch will require a bounce.
 
 In contrast, if the rings hold the descriptors themselves instead of 
 pointers, we bounce (sizeof(descriptor)/cache_line_size) cache lines for 
 every descriptor, amortized.

We already have indirect, this would be a logical next step.  So let's
think about it. The avail ring would contain 64 bit values, the used ring
would contain indexes into the avail ring.

So client writes descriptor page and adds to avail ring, then writes to
index.  Server reads index, avail ring, descriptor page (3).  Writes used
entry (1).  Updates last_used (1).  Client reads used (1), derefs avail (1),
updates last_used (1), cleans descriptor page (1).

That's 9 cacheline transfers, worst case.  Best case of a half-full ring
in steady state, assuming 128-byte cache lines, the avail ring costs are
1/16, the used entry is 1/64.  This drops it to 6 and 9/64 transfers.

(Note, the current scheme adds 2 more cacheline transfers, for the descriptor
table, worst case.  Assuming indirect, we get 2/8 xfer best case.  Either way,
it's not the main source of cacheline xfers).

Can we do better?  The obvious idea is to try to get rid of last_used and
used, and use the ring itself.  We would use an invalid entry to mark the
head of the ring.

Any other thoughts?
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-19 Thread Rusty Russell
On Thu, 20 May 2010 02:31:50 pm Rusty Russell wrote:
 On Wed, 19 May 2010 05:36:42 pm Avi Kivity wrote:
   Note that this is a exclusive-shared-exclusive bounce only, too.
  
  
  A bounce is a bounce.
 
 I tried to measure this to show that you were wrong, but I was only able
 to show that you're right.  How annoying.  Test code below.

This time for sure!

#define _GNU_SOURCE
#include unistd.h
#include sched.h
#include err.h
#include stdbool.h
#include stdint.h
#include string.h
#include stdlib.h
#include stdio.h
#include sys/time.h
#include sys/mman.h

/* We share memory via an mmap. */
struct counter {
unsigned int cacheline1;
char pad[256];
unsigned int cacheline2;
};

#define MAX_BOUNCES 1

enum mode {
SHARE,
UNSHARE,
LOCKSHARE,
LOCKUNSHARE,
};

int main(int argc, char *argv[])
{
cpu_set_t cpuset;
volatile struct counter *counter;
struct timeval start, stop;
bool child;
unsigned int count;
uint64_t usec;
enum mode mode;

if (argc != 4)
errx(1, Usage: cachebounce share|unshare|lockshare|lockunshare 
cpu0 cpu1);

if (strcmp(argv[1], share) == 0)
mode = SHARE;
else if (strcmp(argv[1], unshare) == 0)
mode = UNSHARE;
else if (strcmp(argv[1], lockshare) == 0)
mode = LOCKSHARE;
else if (strcmp(argv[1], lockunshare) == 0)
mode = LOCKSHARE;
else
errx(1, Usage: cachebounce share|unshare|lockshare|lockunshare 
cpu0 cpu1);

CPU_ZERO(cpuset);

counter = mmap(NULL, getpagesize(), PROT_READ|PROT_WRITE, 
MAP_ANONYMOUS|MAP_SHARED, -1, 0);
if (counter == MAP_FAILED)
err(1, Mapping page);

/* Fault it in. */
counter-cacheline1 = counter-cacheline2 = 0;

child = (fork() == 0);

CPU_SET(atoi(argv[2 + child]), cpuset);
if (sched_setaffinity(getpid(), sizeof(cpu_set_t), cpuset) != 0)
err(1, Calling sched_setaffinity());

gettimeofday(start, NULL);

if (child) {
count = 1;
switch (mode) {
case SHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
count++;
counter-cacheline1 = count;
count++;
}
break;
case UNSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
count++;
counter-cacheline2 = count;
count++;
}
break;
case LOCKSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while 
(__sync_val_compare_and_swap(counter-cacheline1, count, count+1)
   != count);
count += 2;
}
break;
case LOCKUNSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);

__sync_val_compare_and_swap(counter-cacheline2, count, count+1);
count += 2;
}
break;
}
} else {
count = 0;
switch (mode) {
case SHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline1 != count);
count++;
counter-cacheline1 = count;
count++;
}
break;
case UNSHARE:
while (count  MAX_BOUNCES) {
/* Spin waiting for other side to change it. */
while (counter-cacheline2 != count);
count++;
counter-cacheline1 = count;
count++;
}
break;
case LOCKSHARE:
while (count  MAX_BOUNCES) {
   

Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-11 Thread Michael S. Tsirkin
On Tue, May 11, 2010 at 10:27:22PM +0300, Avi Kivity wrote:
 On 05/07/2010 06:23 AM, Rusty Russell wrote:
 On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:

 On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
  
 +  /* We publish the last-seen used index at the end of the available ring.
 +   * It is at the end for backwards compatibility. */
 +  vr-last_used_idx =(vr)-avail-ring[num];
 +  /* Verify that last used index does not spill over the used ring. */
 +  BUG_ON((void *)vr-last_used_idx +
 + sizeof *vr-last_used_idx   (void *)vr-used);
}


 Shouldn't this be on its own cache line?
  
 It's next to the available ring; because that's where the guest publishes
 its data.  That whole page is guest-write, host-read.

 Putting it on a cacheline by itself would be a slight pessimization; the host
 cpu would have to get the last_used_idx cacheline and the avail descriptor
 cacheline every time.  This way, they are sometimes the same cacheline.


 If one peer writes the tail of the available ring, while the other reads  
 last_used_idx, it's a false bounce, no?

 Having things on the same cacheline is only worthwhile if they are  
 accessed at the same time.

Yes, this is what I was trying to say.
avail flags and used index *are* accessed at the same time, so
there could be an advantage to sharing a cache line there.

All this should be kept in mind if we ever do
VIRTIO_RING_F_NEW_LAYOUT.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-06 Thread Avi Kivity

On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:

+   /* We publish the last-seen used index at the end of the available ring.
+* It is at the end for backwards compatibility. */
+   vr-last_used_idx =(vr)-avail-ring[num];
+   /* Verify that last used index does not spill over the used ring. */
+   BUG_ON((void *)vr-last_used_idx +
+  sizeof *vr-last_used_idx  (void *)vr-used);
  }
   


Shouldn't this be on its own cache line?

One way to achieve this is to allocate it at the end.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH RFC] virtio: put last seen used index into ring itself

2010-05-06 Thread Rusty Russell
On Thu, 6 May 2010 07:30:00 pm Avi Kivity wrote:
 On 05/05/2010 11:58 PM, Michael S. Tsirkin wrote:
  +   /* We publish the last-seen used index at the end of the available ring.
  +* It is at the end for backwards compatibility. */
  +   vr-last_used_idx =(vr)-avail-ring[num];
  +   /* Verify that last used index does not spill over the used ring. */
  +   BUG_ON((void *)vr-last_used_idx +
  +  sizeof *vr-last_used_idx  (void *)vr-used);
}
 
 
 Shouldn't this be on its own cache line?

It's next to the available ring; because that's where the guest publishes
its data.  That whole page is guest-write, host-read.

Putting it on a cacheline by itself would be a slight pessimization; the host
cpu would have to get the last_used_idx cacheline and the avail descriptor
cacheline every time.  This way, they are sometimes the same cacheline.

Hope that clarifies,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html