subject:"Flow Control and Port Mirroring Revisited"

Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Rick Jones



Just to block netperf you can send it SIGSTOP :)



Clever :)  One could I suppose achieve the same result by making the remote 
receive socket buffer size smaller than the UDP message size and then not worry 
about having to learn the netserver's PID to send it the SIGSTOP.  I *think* the 
semantics will be substantially the same?  Both will be drops at the socket 
buffer, albeit for for different reasons.  The too small socket buffer version 
though doesn't require one remember to wake the netserver in time to have it 
send results back to netperf without netperf tossing-up an error and not 
reporting any statistics.


Also, netperf has a no control connection mode where you can, in effect cause 
it to send UDP datagrams out into the void - I put it there to allow folks to 
test against the likes of echo discard and chargen services but it may have a 
use here.  Requires that one specify the destination IP and port for the data 
connection explicitly via the test-specific options.  In that mode the only 
stats reported are those local to netperf rather than netserver.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Michael S. Tsirkin

On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
 
 Just to block netperf you can send it SIGSTOP :)
 
 
 Clever :)  One could I suppose achieve the same result by making the
 remote receive socket buffer size smaller than the UDP message size
 and then not worry about having to learn the netserver's PID to send
 it the SIGSTOP.  I *think* the semantics will be substantially the
 same?

If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

case SO_RCVBUF:
/* Don't error on this BSD doesn't and if you think
   about it this is right. Otherwise apps have to
   play 'guess the biggest size' games. RCVBUF/SNDBUF
   are treated in BSD as hints */

if (val  sysctl_rmem_max)
val = sysctl_rmem_max;
set_rcvbuf: 
sk-sk_userlocks |= SOCK_RCVBUF_LOCK;

/*
 * We double it on the way in to account for
 * struct sk_buff etc. overhead.   Applications
 * assume that the SO_RCVBUF setting they make will
 * allow that much actual data to be received on that
 * socket.
 *
 * Applications are unaware that struct sk_buff and
 * other overheads allocate from the receive buffer
 * during socket buffer allocation. 
 *
 * And after considering the possible alternatives,
 * returning the value we actually used in getsockopt
 * is the most desirable behavior.
 */ 
if ((val * 2)  SOCK_MIN_RCVBUF)
sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
else
sk-sk_rcvbuf = val * 2;

and

/*  
 * Since sk_rmem_alloc sums skb-truesize, even a small frame might need
 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */ 
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


  Both will be drops at the socket buffer, albeit for for
 different reasons.  The too small socket buffer version though
 doesn't require one remember to wake the netserver in time to have
 it send results back to netperf without netperf tossing-up an error
 and not reporting any statistics.
 
 Also, netperf has a no control connection mode where you can, in
 effect cause it to send UDP datagrams out into the void - I put it
 there to allow folks to test against the likes of echo discard and
 chargen services but it may have a use here.  Requires that one
 specify the destination IP and port for the data connection
 explicitly via the test-specific options.  In that mode the only
 stats reported are those local to netperf rather than netserver.

Ah, sounds perfect.

 happy benchmarking,
 
 rick jones

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Rick Jones


Michael S. Tsirkin wrote:

On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:


Just to block netperf you can send it SIGSTOP :)



Clever :)  One could I suppose achieve the same result by making the
remote receive socket buffer size smaller than the UDP message size
and then not worry about having to learn the netserver's PID to send
it the SIGSTOP.  I *think* the semantics will be substantially the
same?



If you could set, it, yes. But at least linux ignores
any value substantially smaller than 1K, and then
multiplies that by 2:

case SO_RCVBUF:
/* Don't error on this BSD doesn't and if you think
   about it this is right. Otherwise apps have to
   play 'guess the biggest size' games. RCVBUF/SNDBUF
   are treated in BSD as hints */

if (val  sysctl_rmem_max)
val = sysctl_rmem_max;
set_rcvbuf: 
sk-sk_userlocks |= SOCK_RCVBUF_LOCK;


/*
 * We double it on the way in to account for
 * struct sk_buff etc. overhead.   Applications
 * assume that the SO_RCVBUF setting they make will
 * allow that much actual data to be received on that
 * socket.
 *
 * Applications are unaware that struct sk_buff and
 * other overheads allocate from the receive buffer
 * during socket buffer allocation. 
 *

 * And after considering the possible alternatives,
 * returning the value we actually used in getsockopt
 * is the most desirable behavior.
 */ 
if ((val * 2)  SOCK_MIN_RCVBUF)

sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
else
sk-sk_rcvbuf = val * 2;

and

/*  
 * Since sk_rmem_alloc sums skb-truesize, even a small frame might need

 * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
 */ 
#define SOCK_MIN_RCVBUF (2048 + sizeof(struct sk_buff))


Pity - seems to work back on 2.6.26:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

1249281024   10.00 2882334  02361.17
   256   10.00   0  0.00

raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 GNU/Linux

Still, even with that (or SIGSTOP) we don't really know where the packets were 
dropped right?  There is no guarantee they weren't dropped before they got to 
the socket buffer


happy benchmarking,
rick jones

PS - here is with a -S 1024 option:

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

1249281024   10.00 1679269  01375.64
  2048   10.00 1490662   1221.13

showing that there is a decent chance that many of the frames were dropped at 
the socket buffer, but not all - I suppose I could/should be checking netstat 
stats... :)


And just a little more, only because I was curious :)

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

124928 257   10.00 1869134  0 384.29
262142   10.00 1869134384.29

raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost 
(127.0.0.1) port 0 AF_INET : histogram

Socket  Message  Elapsed  Messages
SizeSize Time Okay Errors   Throughput
bytes   bytessecs#  #   10^6bits/sec

124928 257   10.00 3076363  0 632.49
   256   10.00   0  0.00

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-24 Thread Michael S. Tsirkin

On Mon, Jan 24, 2011 at 11:01:45AM -0800, Rick Jones wrote:
 Michael S. Tsirkin wrote:
 On Mon, Jan 24, 2011 at 10:27:55AM -0800, Rick Jones wrote:
 
 Just to block netperf you can send it SIGSTOP :)
 
 
 Clever :)  One could I suppose achieve the same result by making the
 remote receive socket buffer size smaller than the UDP message size
 and then not worry about having to learn the netserver's PID to send
 it the SIGSTOP.  I *think* the semantics will be substantially the
 same?
 
 
 If you could set, it, yes. But at least linux ignores
 any value substantially smaller than 1K, and then
 multiplies that by 2:
 
 case SO_RCVBUF:
 /* Don't error on this BSD doesn't and if you think
about it this is right. Otherwise apps have to
play 'guess the biggest size' games. RCVBUF/SNDBUF
are treated in BSD as hints */
 
 if (val  sysctl_rmem_max)
 val = sysctl_rmem_max;
 set_rcvbuf: sk-sk_userlocks |=
 SOCK_RCVBUF_LOCK;
 
 /*
  * We double it on the way in to account for
  * struct sk_buff etc. overhead.   Applications
  * assume that the SO_RCVBUF setting they make will
  * allow that much actual data to be received on that
  * socket.
  *
  * Applications are unaware that struct sk_buff and
  * other overheads allocate from the receive buffer
  * during socket buffer allocation.
 *
  * And after considering the possible alternatives,
  * returning the value we actually used in getsockopt
  * is the most desirable behavior.
  */ if ((val * 2) 
 SOCK_MIN_RCVBUF)
 sk-sk_rcvbuf = SOCK_MIN_RCVBUF;
 else
 sk-sk_rcvbuf = val * 2;
 
 and
 
 /*   * Since sk_rmem_alloc sums skb-truesize,
 even a small frame might need
  * sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak
  */ #define SOCK_MIN_RCVBUF (2048 + sizeof(struct
 sk_buff))
 
 Pity - seems to work back on 2.6.26:

Hmm, that code is there at least as far back as 2.6.12.

 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 1024
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 1249281024   10.00 2882334  02361.17
256   10.00   0  0.00
 
 raj@tardy:~/netperf2_trunk$ uname -a
 Linux tardy 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 
 GNU/Linux
 
 Still, even with that (or SIGSTOP) we don't really know where the
 packets were dropped right?  There is no guarantee they weren't
 dropped before they got to the socket buffer
 
 happy benchmarking,
 rick jones

Right. Better send to a port with no socket listening there,
that would drop the packet at an early (if not at the earliest
possible)  opportunity.

 PS - here is with a -S 1024 option:
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1024 -m 1024
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 1249281024   10.00 1679269  01375.64
   2048   10.00 1490662   1221.13
 
 showing that there is a decent chance that many of the frames were
 dropped at the socket buffer, but not all - I suppose I could/should
 be checking netstat stats... :)
 
 And just a little more, only because I was curious :)
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1M -m 257
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 124928 257   10.00 1869134  0 384.29
 262142   10.00 1869134384.29
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t UDP_STREAM -- -S 1 -m 257
 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
 localhost (127.0.0.1) port 0 AF_INET : histogram
 Socket  Message  Elapsed  Messages
 SizeSize Time Okay Errors   Throughput
 bytes   bytessecs#  #   10^6bits/sec
 
 124928 257   10.00 3076363  0 632.49
256   10.00   0  0.00
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a

Re: Flow Control and Port Mirroring Revisited

2011-01-23 Thread Michael S. Tsirkin

On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
 On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
  On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
   On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
 [ Trimmed Eric from CC list as vger was complaining that it is too 
 long ]
 
 On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
  So it won't be all that simple to implement well, and before we 
  try,
  I'd like to know whether there are applications that are helped
  by it. For example, we could try to measure latency at various
  pps and see whether the backpressure helps. netperf has -b, -w
  flags which might help these measurements.
  
  Those options are enabled when one adds --enable-burst to the
  pre-compilation ./configure  of netperf (one doesn't have to
  recompile netserver).  However, if one is also looking at latency
  statistics via the -j option in the top-of-trunk, or simply at the
  histogram with --enable-histogram on the ./configure and a verbosity
  level of 2 (global -v 2) then one wants the very top of trunk
  netperf from:
 
 Hi,
 
 I have constructed a test where I run an un-paced  UDP_STREAM test in
 one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.
   
   Rick mentioned in another email that I messed up my test parameters a bit,
   so I will re-run the tests, incorporating his suggestions.
   
   What I was attempting to measure was the effect of an unpaced UDP_STREAM
   on the latency of more moderated traffic. Because I am interested in
   what effect an abusive guest has on other guests and how that my be
   mitigated.
   
   Could you suggest some tests that you feel are more appropriate?
  
  Yes. To refraze my concern in these terms, besides the malicious guest
  you have another software in host (netperf) that interferes with
  the traffic, and it cooperates with the malicious guest.
  Right?
 
 Yes, that is the scenario in this test.

Yes but I think that you want to put some controlled load on host.
Let's assume that we impove the speed somehow and now you can push more
bytes per second without loss.  Result might be a regression in your
test because you let the guest push as much as it can and suddenly it
can push more data through.  OTOH with packet loss the load on host is
anywhere in between send and receive throughput: there's no easy way to
measure it from netperf: the earlier some buffers overrun, the earlier
the packets get dropped and the less the load on host.

This is why I say that to get a specific
load on host you want to limit the sender
to a specific BW and then either
- make sure packet loss % is close to 0.
- make sure packet loss % is close to 100%.

  IMO for a malicious guest you would send
  UDP packets that then get dropped by the host.
  
  For example block netperf in host so that
  it does not consume packets from the socket.
 
 I'm more interested in rate-limiting netperf than blocking it.

Well I mean netperf on host.

 But in any case, do you mean use iptables or tc based on
 classification made by net_cls?

Just to block netperf you can send it SIGSTOP :)

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-23 Thread Simon Horman

On Sun, Jan 23, 2011 at 12:39:02PM +0200, Michael S. Tsirkin wrote:
 On Sun, Jan 23, 2011 at 05:38:49PM +1100, Simon Horman wrote:
  On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
   On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:

[snip]

 Hmm, what is this supposed to measure?  Basically each time you run an
 un-paced UDP_STREAM you get some random load on the network.
 You can't tell what it was exactly, only that it was between
 the send and receive throughput.

Rick mentioned in another email that I messed up my test parameters a 
bit,
so I will re-run the tests, incorporating his suggestions.

What I was attempting to measure was the effect of an unpaced UDP_STREAM
on the latency of more moderated traffic. Because I am interested in
what effect an abusive guest has on other guests and how that my be
mitigated.

Could you suggest some tests that you feel are more appropriate?
   
   Yes. To refraze my concern in these terms, besides the malicious guest
   you have another software in host (netperf) that interferes with
   the traffic, and it cooperates with the malicious guest.
   Right?
  
  Yes, that is the scenario in this test.
 
 Yes but I think that you want to put some controlled load on host.
 Let's assume that we impove the speed somehow and now you can push more
 bytes per second without loss.  Result might be a regression in your
 test because you let the guest push as much as it can and suddenly it
 can push more data through.  OTOH with packet loss the load on host is
 anywhere in between send and receive throughput: there's no easy way to
 measure it from netperf: the earlier some buffers overrun, the earlier
 the packets get dropped and the less the load on host.
 
 This is why I say that to get a specific
 load on host you want to limit the sender
 to a specific BW and then either
 - make sure packet loss % is close to 0.
 - make sure packet loss % is close to 100%.

Thanks, and sorry for being a bit slow.  I now see what you have
been getting at with regards to limiting the tests.
I will see about getting some numbers based on your suggestions.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-22 Thread Michael S. Tsirkin

On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
 On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
  On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
   [ Trimmed Eric from CC list as vger was complaining that it is too long ]
   
   On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
So it won't be all that simple to implement well, and before we try,
I'd like to know whether there are applications that are helped
by it. For example, we could try to measure latency at various
pps and see whether the backpressure helps. netperf has -b, -w
flags which might help these measurements.

Those options are enabled when one adds --enable-burst to the
pre-compilation ./configure  of netperf (one doesn't have to
recompile netserver).  However, if one is also looking at latency
statistics via the -j option in the top-of-trunk, or simply at the
histogram with --enable-histogram on the ./configure and a verbosity
level of 2 (global -v 2) then one wants the very top of trunk
netperf from:
   
   Hi,
   
   I have constructed a test where I run an un-paced  UDP_STREAM test in
   one guest and a paced omni rr test in another guest at the same time.
  
  Hmm, what is this supposed to measure?  Basically each time you run an
  un-paced UDP_STREAM you get some random load on the network.
  You can't tell what it was exactly, only that it was between
  the send and receive throughput.
 
 Rick mentioned in another email that I messed up my test parameters a bit,
 so I will re-run the tests, incorporating his suggestions.
 
 What I was attempting to measure was the effect of an unpaced UDP_STREAM
 on the latency of more moderated traffic. Because I am interested in
 what effect an abusive guest has on other guests and how that my be
 mitigated.
 
 Could you suggest some tests that you feel are more appropriate?

Yes. To refraze my concern in these terms, besides the malicious guest
you have another software in host (netperf) that interferes with
the traffic, and it cooperates with the malicious guest.
Right?

IMO for a malicious guest you would send
UDP packets that then get dropped by the host.

For example block netperf in host so that
it does not consume packets from the socket.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-22 Thread Simon Horman

On Sat, Jan 22, 2011 at 11:57:42PM +0200, Michael S. Tsirkin wrote:
 On Sat, Jan 22, 2011 at 10:11:52AM +1100, Simon Horman wrote:
  On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
   On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
[ Trimmed Eric from CC list as vger was complaining that it is too long 
]

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
 So it won't be all that simple to implement well, and before we try,
 I'd like to know whether there are applications that are helped
 by it. For example, we could try to measure latency at various
 pps and see whether the backpressure helps. netperf has -b, -w
 flags which might help these measurements.
 
 Those options are enabled when one adds --enable-burst to the
 pre-compilation ./configure  of netperf (one doesn't have to
 recompile netserver).  However, if one is also looking at latency
 statistics via the -j option in the top-of-trunk, or simply at the
 histogram with --enable-histogram on the ./configure and a verbosity
 level of 2 (global -v 2) then one wants the very top of trunk
 netperf from:

Hi,

I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
   
   Hmm, what is this supposed to measure?  Basically each time you run an
   un-paced UDP_STREAM you get some random load on the network.
   You can't tell what it was exactly, only that it was between
   the send and receive throughput.
  
  Rick mentioned in another email that I messed up my test parameters a bit,
  so I will re-run the tests, incorporating his suggestions.
  
  What I was attempting to measure was the effect of an unpaced UDP_STREAM
  on the latency of more moderated traffic. Because I am interested in
  what effect an abusive guest has on other guests and how that my be
  mitigated.
  
  Could you suggest some tests that you feel are more appropriate?
 
 Yes. To refraze my concern in these terms, besides the malicious guest
 you have another software in host (netperf) that interferes with
 the traffic, and it cooperates with the malicious guest.
 Right?

Yes, that is the scenario in this test.

 IMO for a malicious guest you would send
 UDP packets that then get dropped by the host.
 
 For example block netperf in host so that
 it does not consume packets from the socket.

I'm more interested in rate-limiting netperf than blocking it.
But in any case, do you mean use iptables or tc based on
classification made by net_cls?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-21 Thread Michael S. Tsirkin

On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
 [ Trimmed Eric from CC list as vger was complaining that it is too long ]
 
 On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
  So it won't be all that simple to implement well, and before we try,
  I'd like to know whether there are applications that are helped
  by it. For example, we could try to measure latency at various
  pps and see whether the backpressure helps. netperf has -b, -w
  flags which might help these measurements.
  
  Those options are enabled when one adds --enable-burst to the
  pre-compilation ./configure  of netperf (one doesn't have to
  recompile netserver).  However, if one is also looking at latency
  statistics via the -j option in the top-of-trunk, or simply at the
  histogram with --enable-histogram on the ./configure and a verbosity
  level of 2 (global -v 2) then one wants the very top of trunk
  netperf from:
 
 Hi,
 
 I have constructed a test where I run an un-paced  UDP_STREAM test in
 one guest and a paced omni rr test in another guest at the same time.

Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.
You can't tell what it was exactly, only that it was between
the send and receive throughput.

 Breifly I get the following results from the omni test..
 
 1. Omni test only:MEAN_LATENCY=272.00
 2. Omni and stream test:  MEAN_LATENCY=3423.00
 3. cpu and net_cls group: MEAN_LATENCY=493.00
As per 2 plus cgoups are created for each guest
and guest tasks added to the groups
 4. 100Mbit/s class:   MEAN_LATENCY=273.00
As per 3 plus the net_cls groups each have a 100MBit/s HTB class
 5. cpu.shares=128:MEAN_LATENCY=652.00
As per 4 plus the cpu groups have cpu.shares set to 128
 6. Busy CPUS: MEAN_LATENCY=15126.00
As per 5 but the CPUs are made busy using a simple shell while loop
 
 There is a bit of noise in the results as the two netperf invocations
 aren't started at exactly the same moment
 
 For reference, my netperf invocations are:
 netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
 netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr 
 -k foo -b 1 -w 200 -m 200
 
 foo contains
 PROTOCOL
 THROUGHPUT,THROUGHPUT_UNITS
 LOCAL_SEND_THROUGHPUT
 LOCAL_RECV_THROUGHPUT
 REMOTE_SEND_THROUGHPUT
 REMOTE_RECV_THROUGHPUT
 RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
 P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
 LOCAL_CPU_UTIL,REMOTE_CPU_UTIL
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-21 Thread Rick Jones


I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.



Hmm, what is this supposed to measure?  Basically each time you run an
un-paced UDP_STREAM you get some random load on the network.


Well, if the netperf is (effectively) pinned to a given CPU, presumably it would 
be trying to generate UDP datagrams at the same rate each time.  Indeed though, 
no guarantee that rate would consistently get through each time.


But then, that is where one can use the confidence intervals options to get an 
idea by how much the rate varied.


rick jones
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-21 Thread Simon Horman

On Fri, Jan 21, 2011 at 11:59:30AM +0200, Michael S. Tsirkin wrote:
 On Thu, Jan 20, 2011 at 05:38:33PM +0900, Simon Horman wrote:
  [ Trimmed Eric from CC list as vger was complaining that it is too long ]
  
  On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
   So it won't be all that simple to implement well, and before we try,
   I'd like to know whether there are applications that are helped
   by it. For example, we could try to measure latency at various
   pps and see whether the backpressure helps. netperf has -b, -w
   flags which might help these measurements.
   
   Those options are enabled when one adds --enable-burst to the
   pre-compilation ./configure  of netperf (one doesn't have to
   recompile netserver).  However, if one is also looking at latency
   statistics via the -j option in the top-of-trunk, or simply at the
   histogram with --enable-histogram on the ./configure and a verbosity
   level of 2 (global -v 2) then one wants the very top of trunk
   netperf from:
  
  Hi,
  
  I have constructed a test where I run an un-paced  UDP_STREAM test in
  one guest and a paced omni rr test in another guest at the same time.
 
 Hmm, what is this supposed to measure?  Basically each time you run an
 un-paced UDP_STREAM you get some random load on the network.
 You can't tell what it was exactly, only that it was between
 the send and receive throughput.

Rick mentioned in another email that I messed up my test parameters a bit,
so I will re-run the tests, incorporating his suggestions.

What I was attempting to measure was the effect of an unpaced UDP_STREAM
on the latency of more moderated traffic. Because I am interested in
what effect an abusive guest has on other guests and how that my be
mitigated.

Could you suggest some tests that you feel are more appropriate?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-20 Thread Simon Horman

[ Trimmed Eric from CC list as vger was complaining that it is too long ]

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
 So it won't be all that simple to implement well, and before we try,
 I'd like to know whether there are applications that are helped
 by it. For example, we could try to measure latency at various
 pps and see whether the backpressure helps. netperf has -b, -w
 flags which might help these measurements.
 
 Those options are enabled when one adds --enable-burst to the
 pre-compilation ./configure  of netperf (one doesn't have to
 recompile netserver).  However, if one is also looking at latency
 statistics via the -j option in the top-of-trunk, or simply at the
 histogram with --enable-histogram on the ./configure and a verbosity
 level of 2 (global -v 2) then one wants the very top of trunk
 netperf from:

Hi,

I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
Breifly I get the following results from the omni test..

1. Omni test only:  MEAN_LATENCY=272.00
2. Omni and stream test:MEAN_LATENCY=3423.00
3. cpu and net_cls group:   MEAN_LATENCY=493.00
   As per 2 plus cgoups are created for each guest
   and guest tasks added to the groups
4. 100Mbit/s class: MEAN_LATENCY=273.00
   As per 3 plus the net_cls groups each have a 100MBit/s HTB class
5. cpu.shares=128:  MEAN_LATENCY=652.00
   As per 4 plus the cpu groups have cpu.shares set to 128
6. Busy CPUS:   MEAN_LATENCY=15126.00
   As per 5 but the CPUs are made busy using a simple shell while loop

There is a bit of noise in the results as the two netperf invocations
aren't started at exactly the same moment

For reference, my netperf invocations are:
netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr 
-k foo -b 1 -w 200 -m 200

foo contains
PROTOCOL
THROUGHPUT,THROUGHPUT_UNITS
LOCAL_SEND_THROUGHPUT
LOCAL_RECV_THROUGHPUT
REMOTE_SEND_THROUGHPUT
REMOTE_RECV_THROUGHPUT
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-20 Thread Rick Jones


Simon Horman wrote:

[ Trimmed Eric from CC list as vger was complaining that it is too long ]
...
I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
Breifly I get the following results from the omni test..

...



There is a bit of noise in the results as the two netperf invocations
aren't started at exactly the same moment

For reference, my netperf invocations are:
netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr 
-k foo -b 1 -w 200 -m 200


Since the -b and -w are in the test-specific portion, this test was not actually 
 paced. The -w will have been ignored entirely (IIRC) and the -b will have 
attempted to set the burst size of a --enable-burst ./configured netperf.  If 
netperf was ./configured that way, it will have had two rr transactions in 
flight at one time - the regular one and then the one additional from the -b 
option.  If netperf was not ./configured with --enable-burst then a warning 
message should have been emitted.


Also, I am guessing you wanted TCP_NODELAY set, and that is -D but not a global 
-D.  I'm reasonably confident the -m 200 will have been ignored, but it would be 
best to drop it. So, I think your second line needs to be:


netperf.omni -p 12866 -c -C -H  172.17.60.216 -t omni -j -v 2 -b 1 -w 200 -- -r 
1 -d rr -k foo -D


If you want the request and response sizes to be 200 bytes, use -r 200 
(test-specific).


Also, if you ./configure with --enable-omni first, that netserver will 
understand both omni and non-omni tests at the same time and you don't have to 
have a second netserver on a different control port.  You can also go-in to 
config.h after the ./configure and unset WANT_MIGRATION and then UDP_STREAM in 
netperf will be the true classic UDP_STREAM code rather than the migrated to 
omni path.



foo contains
PROTOCOL
THROUGHPUT,THROUGHPUT_UNITS
LOCAL_SEND_THROUGHPUT
LOCAL_RECV_THROUGHPUT
REMOTE_SEND_THROUGHPUT
REMOTE_RECV_THROUGHPUT
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
LOCAL_CPU_UTIL,REMOTE_CPU_UTIL


As the -k file parsing option didn't care until recently (within the hour or 
so), I think it didn't matter that you had more than four lines (assuming that 
is a verbatim cat of foo).  However, if you pull the *current* top of trunk, it 
will probably start to care - I'm in the midst of adding support for direct 
output selection in the -k, -o and -O options and also cleaning-up the omni 
printing code to the point where there is only the one routing parsing the 
output selection file.  Currently that is the one for human output, which has 
a four line restriction.  I will try to make it smarter as I go.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-19 Thread Simon Horman

On Tue, Jan 18, 2011 at 10:13:33PM +0200, Michael S. Tsirkin wrote:
 On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
  So it won't be all that simple to implement well, and before we try,
  I'd like to know whether there are applications that are helped
  by it. For example, we could try to measure latency at various
  pps and see whether the backpressure helps. netperf has -b, -w
  flags which might help these measurements.
  
  Those options are enabled when one adds --enable-burst to the
  pre-compilation ./configure  of netperf (one doesn't have to
  recompile netserver).  However, if one is also looking at latency
  statistics via the -j option in the top-of-trunk, or simply at the
  histogram with --enable-histogram on the ./configure and a verbosity
  level of 2 (global -v 2) then one wants the very top of trunk
  netperf from:
  
  http://www.netperf.org/svn/netperf2/trunk
  
  to get the recently added support for accurate (netperf level) RTT
  measuremnts on burst-mode request/response tests.
  
  happy benchmarking,
  
  rick jones

Thanks Rick, that is really helpful.

  PS - the enhanced latency statistics from -j are only available in
  the omni version of the TCP_RR test.  To get that add a
  --enable-omni to the ./configure - and in this case both netperf and
  netserver have to be recompiled.
 
 
 Is this TCP only? I would love to get latency data from UDP as well.

At a glance, -- -T UDP is what you are after.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-18 Thread Rick Jones


So it won't be all that simple to implement well, and before we try,
I'd like to know whether there are applications that are helped
by it. For example, we could try to measure latency at various
pps and see whether the backpressure helps. netperf has -b, -w
flags which might help these measurements.


Those options are enabled when one adds --enable-burst to the pre-compilation 
./configure  of netperf (one doesn't have to recompile netserver).  However, if 
one is also looking at latency statistics via the -j option in the top-of-trunk, 
or simply at the histogram with --enable-histogram on the ./configure and a 
verbosity level of 2 (global -v 2) then one wants the very top of trunk netperf 
from:


http://www.netperf.org/svn/netperf2/trunk

to get the recently added support for accurate (netperf level) RTT measuremnts 
on burst-mode request/response tests.


happy benchmarking,

rick jones

PS - the enhanced latency statistics from -j are only available in the omni 
version of the TCP_RR test.  To get that add a --enable-omni to the ./configure 
- and in this case both netperf and netserver have to be recompiled.  For very 
basic output one can peruse the output of:


src/netperf -t omni -- -O /?

and then pick those outputs of interest and put them into an output selection 
file which one then passes to either (test-specific) -o, -O or -k to get CVS, 
Human or keyval output respectively.  E.G.


raj@tardy:~/netperf2_trunk$ cat foo
THROUGHPUT,THROUGHPUT_UNITS
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY

when foo is passed to -o one will get those all on one line of CSV.  To -O one 
gets three lines of more netperf-classic-like human readable output, and when 
one passes that to -k one gets a string of keyval output a la:


raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 
AF_INET : histogram

THROUGHPUT=29454.12
THROUGHPUT_UNITS=Trans/s
RT_LATENCY=33.951
MIN_LATENCY=19
MEAN_LATENCY=32.00
MAX_LATENCY=126
P50_LATENCY=32
P90_LATENCY=38
P99_LATENCY=41
STDDEV_LATENCY=5.46

Histogram of request/response times
UNIT_USEC :0:0:0:0:0:0:0:0:0:0
TEN_USEC  :0: 3553: 45244: 237790: 7859:   86:   10:3:0:0
HUNDRED_USEC  :0:2:0:0:0:0:0:0:0:0
UNIT_MSEC :0:0:0:0:0:0:0:0:0:0
TEN_MSEC  :0:0:0:0:0:0:0:0:0:0
HUNDRED_MSEC  :0:0:0:0:0:0:0:0:0:0
UNIT_SEC  :0:0:0:0:0:0:0:0:0:0
TEN_SEC   :0:0:0:0:0:0:0:0:0:0
100_SECS: 0
HIST_TOTAL:  294547

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-18 Thread Michael S. Tsirkin

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
 So it won't be all that simple to implement well, and before we try,
 I'd like to know whether there are applications that are helped
 by it. For example, we could try to measure latency at various
 pps and see whether the backpressure helps. netperf has -b, -w
 flags which might help these measurements.
 
 Those options are enabled when one adds --enable-burst to the
 pre-compilation ./configure  of netperf (one doesn't have to
 recompile netserver).  However, if one is also looking at latency
 statistics via the -j option in the top-of-trunk, or simply at the
 histogram with --enable-histogram on the ./configure and a verbosity
 level of 2 (global -v 2) then one wants the very top of trunk
 netperf from:
 
 http://www.netperf.org/svn/netperf2/trunk
 
 to get the recently added support for accurate (netperf level) RTT
 measuremnts on burst-mode request/response tests.
 
 happy benchmarking,
 
 rick jones
 
 PS - the enhanced latency statistics from -j are only available in
 the omni version of the TCP_RR test.  To get that add a
 --enable-omni to the ./configure - and in this case both netperf and
 netserver have to be recompiled.


Is this TCP only? I would love to get latency data from UDP as well.

  For very basic output one can
 peruse the output of:
 
 src/netperf -t omni -- -O /?
 
 and then pick those outputs of interest and put them into an output
 selection file which one then passes to either (test-specific) -o,
 -O or -k to get CVS, Human or keyval output respectively.  E.G.
 
 raj@tardy:~/netperf2_trunk$ cat foo
 THROUGHPUT,THROUGHPUT_UNITS
 RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
 P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
 
 when foo is passed to -o one will get those all on one line of CSV.
 To -O one gets three lines of more netperf-classic-like human
 readable output, and when one passes that to -k one gets a string of
 keyval output a la:
 
 raj@tardy:~/netperf2_trunk$ src/netperf -t omni -j -v 2 -- -r 1 -d rr -k foo
 OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost
 (127.0.0.1) port 0 AF_INET : histogram
 THROUGHPUT=29454.12
 THROUGHPUT_UNITS=Trans/s
 RT_LATENCY=33.951
 MIN_LATENCY=19
 MEAN_LATENCY=32.00
 MAX_LATENCY=126
 P50_LATENCY=32
 P90_LATENCY=38
 P99_LATENCY=41
 STDDEV_LATENCY=5.46
 
 Histogram of request/response times
 UNIT_USEC :0:0:0:0:0:0:0:0:0:0
 TEN_USEC  :0: 3553: 45244: 237790: 7859:   86:   10:3:0:0
 HUNDRED_USEC  :0:2:0:0:0:0:0:0:0:0
 UNIT_MSEC :0:0:0:0:0:0:0:0:0:0
 TEN_MSEC  :0:0:0:0:0:0:0:0:0:0
 HUNDRED_MSEC  :0:0:0:0:0:0:0:0:0:0
 UNIT_SEC  :0:0:0:0:0:0:0:0:0:0
 TEN_SEC   :0:0:0:0:0:0:0:0:0:0
 100_SECS: 0
 HIST_TOTAL:  294547
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-18 Thread Rick Jones


Michael S. Tsirkin wrote:

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:


PS - the enhanced latency statistics from -j are only available in
the omni version of the TCP_RR test.  To get that add a
--enable-omni to the ./configure - and in this case both netperf and
netserver have to be recompiled.


Is this TCP only? I would love to get latency data from UDP as well.


I believe it will work with UDP request response as well.  The omni test code 
strives to be protocol agnostic.  (I'm sure there are bugs of course, there 
always are.)


There is though the added complication of there being no specific matching of 
requests to responses.  The code as written takes advantage of TCP's in-order 
semantics and recovery from packet loss.  In a plain UDP_RR test, with one at 
a time transactions, if either the request or response are lost, data flow 
effectively stops there until the timer expires.  So, one has reasonable RTT 
numbers from before that point.  In a burst UDP RR test, the code doesn't know 
which request/response was lost and so the matching being done to get RTTs will 
be off by each lost datagram.  And if something were re-ordered the timstamps 
would be off even without a datagram loss event.


To fix that would require netperf do something it has not yet done in 18-odd 
years :)  That is actually echo something back from the netserver on the RR test 
- either an id, or a timestamp.  That means dirtying the buffers which means 
still more cache misses, from places other than the actual stack. Not beyond the 
realm of the possible, but it would be a bit of departure for normal operation 
(*) and could enforce a minimum request/response size beyond the present single 
byte (ok, perhaps only two or four bytes :).  But that, perhaps, is a discussion 
best left to netperf-talk at netperf.org.


happy benchmarking,

rick jones

(*) netperf does have the concept of reading from and/or dirtying buffers, 
put-in back in the days of COW/page-remapping in HP-UX 9.0, but that was mainly 
to force COW and/or show the effect of the required data cache purges/flushes. 
As such it was made conditional on DIRTY being defined.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-17 Thread Michael S. Tsirkin

On Mon, Jan 17, 2011 at 07:37:30AM +0900, Simon Horman wrote:
 On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
  On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
   On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
 On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
  On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au 
  wrote:
   On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
   On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
   
[ snip ]

 I know that everyone likes a nice netperf result but I agree 
 with
 Michael that this probably isn't the right question to be 
 asking.  I
 don't think that socket buffers are a real solution to the 
 flow
 control problem: they happen to provide that functionality 
 but it's
 more of a side effect than anything.  It's just that the 
 amount of
 memory consumed by packets in the queue(s) doesn't really 
 have any
 implicit meaning for flow control (think multiple physical 
 adapters,
 all with the same speed instead of a virtual device and a 
 physical
 device with wildly different speeds).  The analog in the 
 physical
 world that you're looking for would be Ethernet flow control.
 Obviously, if the question is limiting CPU or memory 
 consumption then
 that's a different story.
   
Point taken. I will see if I can control CPU (and thus memory) 
consumption
using cgroups and/or tc.
  
   I have found that I can successfully control the throughput using
   the following techniques
  
   1) Place a tc egress filter on dummy0
  
   2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and 
   then eth1,
      this is effectively the same as one of my hacks to the 
   datapath
      that I mentioned in an earlier mail. The result is that eth1
      paces the connection.

This is actually a bug. This means that one slow connection will affect
fast ones. I intend to change the default for qemu to sndbuf=0 : this
will fix it but break your pacing. So pls do not count on this
behaviour.
   
   Do you have a patch I could test?
  
  You can (and users already can) just run qemu with sndbuf=0. But if you
  like, below.
 
 Thanks
 
   Further to this, I wonder if there is any interest in providing
   a method to switch the action order - using ovs-ofctl is a hack 
   imho -
   and/or switching the default action order for mirroring.
  
  I'm not sure that there is a way to do this that is correct in the
  generic case.  It's possible that the destination could be a VM 
  while
  packets are being mirrored to a physical device or we could be
  multicasting or some other arbitrarily complex scenario.  Just think
  of what a physical switch would do if it has ports with two 
  different
  speeds.
 
 Yes, I have considered that case. And I agree that perhaps there
 is no sensible default. But perhaps we could make it configurable 
 somehow?

The fix is at the application level. Run netperf with -b and -w flags to
limit the speed to a sensible value.
   
   Perhaps I should have stated my goals more clearly.
   I'm interested in situations where I don't control the application.
  
  Well an application that streams UDP without any throttling
  at the application level will break on a physical network, right?
  So I am not sure why should one try to make it work on the virtual one.
  
  But let's assume that you do want to throttle the guest
  for reasons such as QOS. The proper approach seems
  to be to throttle the sender, not have a dummy throttled
  receiver pacing it. Place the qemu process in the
  correct net_cls cgroup, set the class id and apply a rate limit?
 
 I would like to be able to use a class to rate limit egress packets.
 That much works fine for me.
 
 What I would also like is for there to be back-pressure such that the guest
 doesn't consume lots of CPU, spinning, sending packets as fast as it can,
 almost of all of which are dropped. That does seem like a lot of wasted
 CPU to me.
 
 Unfortunately there are several problems with this and I am fast concluding
 that I will need to use a CPU cgroup. Which does make some sense, as what I
 am really trying to limit here is CPU usage not network packet rates - even
 if the test using the CPU is netperf.  So long as the CPU usage can
 (mostly) be attributed to the guest using a cgroup should work fine.  And
 indeed seems to in my limited testing.
 
 One scenario in which I don't think it is

Re: Flow Control and Port Mirroring Revisited

2011-01-17 Thread Michael S. Tsirkin

On Mon, Jan 17, 2011 at 10:26:25AM +1030, Rusty Russell wrote:
 On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:
 
 [snip]
 
 I've been away, but what concerns me is that socket buffer limits are
 bypassed in various configurations, due to skb cloning.  We should probably
 drop such limits altogether, or fix them to be consistent.

Further, it looks like when the limits are not bypassed, they
easily result in deadlocks. For example, with
multiple tun devices attached to a single bridge in host,
if a number of these have their queues blocked,
others will reach the socket buffer limit and
traffic on the bridge will get blocked altogether.

It might be better to drop the limits altogether
unless we can fix them. Happily, as the limits are off by
default, doing so does not require kernel changes.

 Simple fix is as someone suggested here, to attach the clone.  That might
 seriously reduce your sk limit, though.  I haven't thought about it hard,
 but might it make sense to move ownership into skb_shared_info; ie. the
 data, rather than the skb head?
 
 Cheers,
 Rusty.

tracking data ownership might benefit others such as various zero-copy
strategies. It might need to be done per-page, though, not per-skb.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-16 Thread Simon Horman

On Fri, Jan 14, 2011 at 08:54:15AM +0200, Michael S. Tsirkin wrote:
 On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
  On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
   On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
 On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au 
 wrote:
  On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
  On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
   On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
  
   [ snip ]
   
I know that everyone likes a nice netperf result but I agree 
with
Michael that this probably isn't the right question to be 
asking.  I
don't think that socket buffers are a real solution to the flow
control problem: they happen to provide that functionality but 
it's
more of a side effect than anything.  It's just that the 
amount of
memory consumed by packets in the queue(s) doesn't really have 
any
implicit meaning for flow control (think multiple physical 
adapters,
all with the same speed instead of a virtual device and a 
physical
device with wildly different speeds).  The analog in the 
physical
world that you're looking for would be Ethernet flow control.
Obviously, if the question is limiting CPU or memory 
consumption then
that's a different story.
  
   Point taken. I will see if I can control CPU (and thus memory) 
   consumption
   using cgroups and/or tc.
 
  I have found that I can successfully control the throughput using
  the following techniques
 
  1) Place a tc egress filter on dummy0
 
  2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then 
  eth1,
     this is effectively the same as one of my hacks to the datapath
     that I mentioned in an earlier mail. The result is that eth1
     paces the connection.
   
   This is actually a bug. This means that one slow connection will affect
   fast ones. I intend to change the default for qemu to sndbuf=0 : this
   will fix it but break your pacing. So pls do not count on this
   behaviour.
  
  Do you have a patch I could test?
 
 You can (and users already can) just run qemu with sndbuf=0. But if you
 like, below.

Thanks

  Further to this, I wonder if there is any interest in providing
  a method to switch the action order - using ovs-ofctl is a hack 
  imho -
  and/or switching the default action order for mirroring.
 
 I'm not sure that there is a way to do this that is correct in the
 generic case.  It's possible that the destination could be a VM while
 packets are being mirrored to a physical device or we could be
 multicasting or some other arbitrarily complex scenario.  Just think
 of what a physical switch would do if it has ports with two different
 speeds.

Yes, I have considered that case. And I agree that perhaps there
is no sensible default. But perhaps we could make it configurable 
somehow?
   
   The fix is at the application level. Run netperf with -b and -w flags to
   limit the speed to a sensible value.
  
  Perhaps I should have stated my goals more clearly.
  I'm interested in situations where I don't control the application.
 
 Well an application that streams UDP without any throttling
 at the application level will break on a physical network, right?
 So I am not sure why should one try to make it work on the virtual one.
 
 But let's assume that you do want to throttle the guest
 for reasons such as QOS. The proper approach seems
 to be to throttle the sender, not have a dummy throttled
 receiver pacing it. Place the qemu process in the
 correct net_cls cgroup, set the class id and apply a rate limit?

I would like to be able to use a class to rate limit egress packets.
That much works fine for me.

What I would also like is for there to be back-pressure such that the guest
doesn't consume lots of CPU, spinning, sending packets as fast as it can,
almost of all of which are dropped. That does seem like a lot of wasted
CPU to me.

Unfortunately there are several problems with this and I am fast concluding
that I will need to use a CPU cgroup. Which does make some sense, as what I
am really trying to limit here is CPU usage not network packet rates - even
if the test using the CPU is netperf.  So long as the CPU usage can
(mostly) be attributed to the guest using a cgroup should work fine.  And
indeed seems to in my limited testing.

One scenario in which I don't think it is possible for there to be
back-pressure in a meaningful sense is if root in the guest sets
/proc/sys/net/core/wmem_default to a large value, say 200.


I do think that to some extent there is back-pressure

Re: Flow Control and Port Mirroring Revisited

2011-01-16 Thread Rusty Russell

On Mon, 17 Jan 2011 09:07:30 am Simon Horman wrote:

[snip]

I've been away, but what concerns me is that socket buffer limits are
bypassed in various configurations, due to skb cloning.  We should probably
drop such limits altogether, or fix them to be consistent.

Simple fix is as someone suggested here, to attach the clone.  That might
seriously reduce your sk limit, though.  I haven't thought about it hard,
but might it make sense to move ownership into skb_shared_info; ie. the
data, rather than the skb head?

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Simon Horman

On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
 On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au wrote:
  On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
  On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
   On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
  
   [ snip ]
   
I know that everyone likes a nice netperf result but I agree with
Michael that this probably isn't the right question to be asking.  I
don't think that socket buffers are a real solution to the flow
control problem: they happen to provide that functionality but it's
more of a side effect than anything.  It's just that the amount of
memory consumed by packets in the queue(s) doesn't really have any
implicit meaning for flow control (think multiple physical adapters,
all with the same speed instead of a virtual device and a physical
device with wildly different speeds).  The analog in the physical
world that you're looking for would be Ethernet flow control.
Obviously, if the question is limiting CPU or memory consumption then
that's a different story.
  
   Point taken. I will see if I can control CPU (and thus memory) 
   consumption
   using cgroups and/or tc.
 
  I have found that I can successfully control the throughput using
  the following techniques
 
  1) Place a tc egress filter on dummy0
 
  2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
     this is effectively the same as one of my hacks to the datapath
     that I mentioned in an earlier mail. The result is that eth1
     paces the connection.
 
  Further to this, I wonder if there is any interest in providing
  a method to switch the action order - using ovs-ofctl is a hack imho -
  and/or switching the default action order for mirroring.
 
 I'm not sure that there is a way to do this that is correct in the
 generic case.  It's possible that the destination could be a VM while
 packets are being mirrored to a physical device or we could be
 multicasting or some other arbitrarily complex scenario.  Just think
 of what a physical switch would do if it has ports with two different
 speeds.

Yes, I have considered that case. And I agree that perhaps there
is no sensible default. But perhaps we could make it configurable somehow?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Michael S. Tsirkin

On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
 On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
  On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au wrote:
   On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
   On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
   
[ snip ]

 I know that everyone likes a nice netperf result but I agree with
 Michael that this probably isn't the right question to be asking.  I
 don't think that socket buffers are a real solution to the flow
 control problem: they happen to provide that functionality but it's
 more of a side effect than anything.  It's just that the amount of
 memory consumed by packets in the queue(s) doesn't really have any
 implicit meaning for flow control (think multiple physical adapters,
 all with the same speed instead of a virtual device and a physical
 device with wildly different speeds).  The analog in the physical
 world that you're looking for would be Ethernet flow control.
 Obviously, if the question is limiting CPU or memory consumption then
 that's a different story.
   
Point taken. I will see if I can control CPU (and thus memory) 
consumption
using cgroups and/or tc.
  
   I have found that I can successfully control the throughput using
   the following techniques
  
   1) Place a tc egress filter on dummy0
  
   2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
      this is effectively the same as one of my hacks to the datapath
      that I mentioned in an earlier mail. The result is that eth1
      paces the connection.

This is actually a bug. This means that one slow connection will
affect fast ones. I intend to change the default for qemu to sndbuf=0 :
this will fix it but break your pacing. So pls do not count on this behaviour.

   Further to this, I wonder if there is any interest in providing
   a method to switch the action order - using ovs-ofctl is a hack imho -
   and/or switching the default action order for mirroring.
  
  I'm not sure that there is a way to do this that is correct in the
  generic case.  It's possible that the destination could be a VM while
  packets are being mirrored to a physical device or we could be
  multicasting or some other arbitrarily complex scenario.  Just think
  of what a physical switch would do if it has ports with two different
  speeds.
 
 Yes, I have considered that case. And I agree that perhaps there
 is no sensible default. But perhaps we could make it configurable somehow?

The fix is at the application level. Run netperf with -b and -w flags to
limit the speed to a sensible value.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Simon Horman

On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
 On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
  On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
   On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au wrote:
On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
 On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:

 [ snip ]
 
  I know that everyone likes a nice netperf result but I agree with
  Michael that this probably isn't the right question to be asking.  
  I
  don't think that socket buffers are a real solution to the flow
  control problem: they happen to provide that functionality but it's
  more of a side effect than anything.  It's just that the amount of
  memory consumed by packets in the queue(s) doesn't really have any
  implicit meaning for flow control (think multiple physical 
  adapters,
  all with the same speed instead of a virtual device and a physical
  device with wildly different speeds).  The analog in the physical
  world that you're looking for would be Ethernet flow control.
  Obviously, if the question is limiting CPU or memory consumption 
  then
  that's a different story.

 Point taken. I will see if I can control CPU (and thus memory) 
 consumption
 using cgroups and/or tc.
   
I have found that I can successfully control the throughput using
the following techniques
   
1) Place a tc egress filter on dummy0
   
2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
   this is effectively the same as one of my hacks to the datapath
   that I mentioned in an earlier mail. The result is that eth1
   paces the connection.
 
 This is actually a bug. This means that one slow connection will affect
 fast ones. I intend to change the default for qemu to sndbuf=0 : this
 will fix it but break your pacing. So pls do not count on this
 behaviour.

Do you have a patch I could test?

Further to this, I wonder if there is any interest in providing
a method to switch the action order - using ovs-ofctl is a hack imho -
and/or switching the default action order for mirroring.
   
   I'm not sure that there is a way to do this that is correct in the
   generic case.  It's possible that the destination could be a VM while
   packets are being mirrored to a physical device or we could be
   multicasting or some other arbitrarily complex scenario.  Just think
   of what a physical switch would do if it has ports with two different
   speeds.
  
  Yes, I have considered that case. And I agree that perhaps there
  is no sensible default. But perhaps we could make it configurable somehow?
 
 The fix is at the application level. Run netperf with -b and -w flags to
 limit the speed to a sensible value.

Perhaps I should have stated my goals more clearly.
I'm interested in situations where I don't control the application.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-13 Thread Michael S. Tsirkin

On Fri, Jan 14, 2011 at 03:35:28PM +0900, Simon Horman wrote:
 On Fri, Jan 14, 2011 at 06:58:18AM +0200, Michael S. Tsirkin wrote:
  On Fri, Jan 14, 2011 at 08:41:36AM +0900, Simon Horman wrote:
   On Thu, Jan 13, 2011 at 10:45:38AM -0500, Jesse Gross wrote:
On Thu, Jan 13, 2011 at 1:47 AM, Simon Horman ho...@verge.net.au 
wrote:
 On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
 On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
  On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
 
  [ snip ]
  
   I know that everyone likes a nice netperf result but I agree with
   Michael that this probably isn't the right question to be 
   asking.  I
   don't think that socket buffers are a real solution to the flow
   control problem: they happen to provide that functionality but 
   it's
   more of a side effect than anything.  It's just that the amount 
   of
   memory consumed by packets in the queue(s) doesn't really have 
   any
   implicit meaning for flow control (think multiple physical 
   adapters,
   all with the same speed instead of a virtual device and a 
   physical
   device with wildly different speeds).  The analog in the physical
   world that you're looking for would be Ethernet flow control.
   Obviously, if the question is limiting CPU or memory consumption 
   then
   that's a different story.
 
  Point taken. I will see if I can control CPU (and thus memory) 
  consumption
  using cgroups and/or tc.

 I have found that I can successfully control the throughput using
 the following techniques

 1) Place a tc egress filter on dummy0

 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then 
 eth1,
    this is effectively the same as one of my hacks to the datapath
    that I mentioned in an earlier mail. The result is that eth1
    paces the connection.
  
  This is actually a bug. This means that one slow connection will affect
  fast ones. I intend to change the default for qemu to sndbuf=0 : this
  will fix it but break your pacing. So pls do not count on this
  behaviour.
 
 Do you have a patch I could test?

You can (and users already can) just run qemu with sndbuf=0. But if you
like, below.

 Further to this, I wonder if there is any interest in providing
 a method to switch the action order - using ovs-ofctl is a hack imho -
 and/or switching the default action order for mirroring.

I'm not sure that there is a way to do this that is correct in the
generic case.  It's possible that the destination could be a VM while
packets are being mirrored to a physical device or we could be
multicasting or some other arbitrarily complex scenario.  Just think
of what a physical switch would do if it has ports with two different
speeds.
   
   Yes, I have considered that case. And I agree that perhaps there
   is no sensible default. But perhaps we could make it configurable somehow?
  
  The fix is at the application level. Run netperf with -b and -w flags to
  limit the speed to a sensible value.
 
 Perhaps I should have stated my goals more clearly.
 I'm interested in situations where I don't control the application.

Well an application that streams UDP without any throttling
at the application level will break on a physical network, right?
So I am not sure why should one try to make it work on the virtual one.

But let's assume that you do want to throttle the guest
for reasons such as QOS. The proper approach seems
to be to throttle the sender, not have a dummy throttled
receiver pacing it. Place the qemu process in the
correct net_cls cgroup, set the class id and apply a rate limit?


---

diff --git a/net/tap-linux.c b/net/tap-linux.c
index f7aa904..0dbcdd4 100644
--- a/net/tap-linux.c
+++ b/net/tap-linux.c
@@ -87,7 +87,7 @@ int tap_open(char *ifname, int ifname_size, int *vnet_hdr, 
int vnet_hdr_required
  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
  * a good default, given a 1500 byte MTU.
  */
-#define TAP_DEFAULT_SNDBUF 1024*1024
+#define TAP_DEFAULT_SNDBUF 0
 
 int tap_set_sndbuf(int fd, QemuOpts *opts)
 {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-12 Thread Simon Horman

On Mon, Jan 10, 2011 at 06:31:55PM +0900, Simon Horman wrote:
 On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
  On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
  
  [ snip ]
   
   I know that everyone likes a nice netperf result but I agree with
   Michael that this probably isn't the right question to be asking.  I
   don't think that socket buffers are a real solution to the flow
   control problem: they happen to provide that functionality but it's
   more of a side effect than anything.  It's just that the amount of
   memory consumed by packets in the queue(s) doesn't really have any
   implicit meaning for flow control (think multiple physical adapters,
   all with the same speed instead of a virtual device and a physical
   device with wildly different speeds).  The analog in the physical
   world that you're looking for would be Ethernet flow control.
   Obviously, if the question is limiting CPU or memory consumption then
   that's a different story.
  
  Point taken. I will see if I can control CPU (and thus memory) consumption
  using cgroups and/or tc.
 
 I have found that I can successfully control the throughput using
 the following techniques
 
 1) Place a tc egress filter on dummy0
 
 2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
this is effectively the same as one of my hacks to the datapath
that I mentioned in an earlier mail. The result is that eth1
paces the connection.

Further to this, I wonder if there is any interest in providing
a method to switch the action order - using ovs-ofctl is a hack imho -
and/or switching the default action order for mirroring.

 3) 2) + place a tc egress filter on eth1
 
 Which mostly makes sense to me although I am a little confused about
 why 1) needs a filter on dummy0 (a filter on eth1 has no effect)
 but 3) needs a filter on eth1 (a filter on dummy0 has no effect,
 even if the skb is sent to dummy0 last.
 
 I also had some limited success using CPU cgroups, though obviously
 that targets CPU usage and thus the effect on throughput is fairly course.
 In short, its a useful technique but not one that bares further
 discussion here.
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-10 Thread Simon Horman

On Fri, Jan 07, 2011 at 10:23:58AM +0900, Simon Horman wrote:
 On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:
 
 [ snip ]
  
  I know that everyone likes a nice netperf result but I agree with
  Michael that this probably isn't the right question to be asking.  I
  don't think that socket buffers are a real solution to the flow
  control problem: they happen to provide that functionality but it's
  more of a side effect than anything.  It's just that the amount of
  memory consumed by packets in the queue(s) doesn't really have any
  implicit meaning for flow control (think multiple physical adapters,
  all with the same speed instead of a virtual device and a physical
  device with wildly different speeds).  The analog in the physical
  world that you're looking for would be Ethernet flow control.
  Obviously, if the question is limiting CPU or memory consumption then
  that's a different story.
 
 Point taken. I will see if I can control CPU (and thus memory) consumption
 using cgroups and/or tc.

I have found that I can successfully control the throughput using
the following techniques

1) Place a tc egress filter on dummy0

2) Use ovs-ofctl to add a flow that sends skbs to dummy0 and then eth1,
   this is effectively the same as one of my hacks to the datapath
   that I mentioned in an earlier mail. The result is that eth1
   paces the connection.

3) 2) + place a tc egress filter on eth1

Which mostly makes sense to me although I am a little confused about
why 1) needs a filter on dummy0 (a filter on eth1 has no effect)
but 3) needs a filter on eth1 (a filter on dummy0 has no effect,
even if the skb is sent to dummy0 last.

I also had some limited success using CPU cgroups, though obviously
that targets CPU usage and thus the effect on throughput is fairly course.
In short, its a useful technique but not one that bares further
discussion here.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Flow Control and Port Mirroring Revisited

2011-01-06 Thread Simon Horman

Hi,

Back in October I reported that I noticed a problem whereby flow control
breaks down when openvswitch is configured to mirror a port[1].

I have (finally) looked into this further and the problem appears to relate
to cloning of skbs, as Jesse Gross originally suspected.

More specifically, in do_execute_actions[2] the first n-1 times that an skb
needs to be transmitted it is cloned first and the final time the original
skb is used.

In the case that there is only one action, which is the normal case, then
the original skb will be used. But in the case of mirroring the cloning
comes into effect. And in my case the cloned skb seems to go to the (slow)
eth1 interface while the original skb goes to the (fast) dummy0 interface
that I set up to be a mirror. The result is that dummy0 paces the flow,
and its a cracking pace at that.

As an experiment I hacked do_execute_actions() to use the original skb
for the first action instead of the last one.  In my case the result was
that eth1 paces the flow, and things work reasonably nicely.

Well, sort of. Things work well for non-GSO skbs but extremely poorly for
GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
netserv. I'm unsure why, but I digress.

It seems to me that my hack illustrates the point that the flow ends up
being paced by one interface. However I think that what would be
desirable is that the flow is paced by the slowest link. Unfortunately
I'm unsure how to achieve that.

One idea that I had was to skb_get() the original skb each time it is
cloned - that is easy enough. But unfortunately it seems to me that
approach would require some sort of callback mechanism in kfree_skb() so
that the cloned skbs can kfree_skb() the original skb.

Ideas would be greatly appreciated.

[1] 
http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
[2] 
http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Eric Dumazet

Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
 Hi,
 
 Back in October I reported that I noticed a problem whereby flow control
 breaks down when openvswitch is configured to mirror a port[1].
 
 I have (finally) looked into this further and the problem appears to relate
 to cloning of skbs, as Jesse Gross originally suspected.
 
 More specifically, in do_execute_actions[2] the first n-1 times that an skb
 needs to be transmitted it is cloned first and the final time the original
 skb is used.
 
 In the case that there is only one action, which is the normal case, then
 the original skb will be used. But in the case of mirroring the cloning
 comes into effect. And in my case the cloned skb seems to go to the (slow)
 eth1 interface while the original skb goes to the (fast) dummy0 interface
 that I set up to be a mirror. The result is that dummy0 paces the flow,
 and its a cracking pace at that.
 
 As an experiment I hacked do_execute_actions() to use the original skb
 for the first action instead of the last one.  In my case the result was
 that eth1 paces the flow, and things work reasonably nicely.
 
 Well, sort of. Things work well for non-GSO skbs but extremely poorly for
 GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
 netserv. I'm unsure why, but I digress.
 
 It seems to me that my hack illustrates the point that the flow ends up
 being paced by one interface. However I think that what would be
 desirable is that the flow is paced by the slowest link. Unfortunately
 I'm unsure how to achieve that.
 

Hi Simon !

pacing is done because skb is attached to a socket, and a socket has a
limited (but configurable) sndbuf. sk-sk_wmem_alloc is the current sum
of all truesize skbs in flight.

When you enter something that :

1) Get a clone of the skb, queue the clone to device X
2) queue the original skb to device Y

Then :  Socket sndbuf is not affected at all by device X queue.
This is speed on device Y that matters.

You want to get servo control on both X and Y

You could try to

1) Get a clone of skb
   Attach it to socket too (so that socket get a feedback of final
orphaning for the clone) with skb_set_owner_w()
   queue the clone to device X

Unfortunatly, stacked skb-destructor() makes this possible only for
known destructor (aka sock_wfree())

 One idea that I had was to skb_get() the original skb each time it is
 cloned - that is easy enough. But unfortunately it seems to me that
 approach would require some sort of callback mechanism in kfree_skb() so
 that the cloned skbs can kfree_skb() the original skb.
 
 Ideas would be greatly appreciated.
 
 [1] 
 http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
 [2] 
 http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD
 --


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Michael S. Tsirkin

On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
Hi,

Back in October I reported that I noticed a problem whereby flow control
breaks down when openvswitch is configured to mirror a port[1].

Apropos the UDP flow control. See this
http://www.spinics.net/lists/netdev/msg150806.html
for some problems it introduces.
Unfortunately UDP does not have built-in flow control.
At some level it's just conceptually broken:
it's not present in physical networks so why should
we try and emulate it in a virtual network?

Specifically, when you do:
# netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
You are asking: what happens if I push data faster than it can be received?
But why is this an interesting question?
Ask 'what is the maximum rate at which I can send data with %X packet
loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
-b and -w flags for this. It needs to be configured
with --enable-intervals=yes for them to work.

If you pose the questions this way the problem of pacing
the execution just goes away.

I have (finally) looked into this further and the problem appears to relate
to cloning of skbs, as Jesse Gross originally suspected.

More specifically, in do_execute_actions[2] the first n-1 times that an skb
needs to be transmitted it is cloned first and the final time the original
skb is used.

In the case that there is only one action, which is the normal case, then
the original skb will be used. But in the case of mirroring the cloning
comes into effect. And in my case the cloned skb seems to go to the (slow)
eth1 interface while the original skb goes to the (fast) dummy0 interface
that I set up to be a mirror. The result is that dummy0 paces the flow,
and its a cracking pace at that.

As an experiment I hacked do_execute_actions() to use the original skb
for the first action instead of the last one. In my case the result was
that eth1 paces the flow, and things work reasonably nicely.

Well, sort of. Things work well for non-GSO skbs but extremely poorly for
GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
netserv. I'm unsure why, but I digress.

It seems to me that my hack illustrates the point that the flow ends up
being paced by one interface. However I think that what would be
desirable is that the flow is paced by the slowest link. Unfortunately
I'm unsure how to achieve that.

What if you have multiple UDP sockets with different targets
in the guest?

One idea that I had was to skb_get() the original skb each time it is
cloned - that is easy enough. But unfortunately it seems to me that
approach would require some sort of callback mechanism in kfree_skb() so
that the cloned skbs can kfree_skb() the original skb.

Ideas would be greatly appreciated.

[1]
http://openvswitch.org/pipermail/dev_openvswitch.org/2010-October/003806.html
[2]
http://openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=blob;f=datapath/actions.c;h=5e16143ca402f7da0ee8fc18ee5eb16c3b7598e6;hb=HEAD
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Simon Horman

On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
 On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
  Hi,
  
  Back in October I reported that I noticed a problem whereby flow control
  breaks down when openvswitch is configured to mirror a port[1].
 
 Apropos the UDP flow control.  See this
 http://www.spinics.net/lists/netdev/msg150806.html
 for some problems it introduces.
 Unfortunately UDP does not have built-in flow control.
 At some level it's just conceptually broken:
 it's not present in physical networks so why should
 we try and emulate it in a virtual network?
 
 
 Specifically, when you do:
 # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
 You are asking: what happens if I push data faster than it can be received?
 But why is this an interesting question?
 Ask 'what is the maximum rate at which I can send data with %X packet
 loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
 -b and -w flags for this. It needs to be configured
 with --enable-intervals=yes for them to work.
 
 If you pose the questions this way the problem of pacing
 the execution just goes away.

I am aware that UDP inherently lacks flow control.

The aspect of flow control that I am interested in is situations where the
guest can create large amounts of work for the host. However, it seems that
in the case of virtio with vhostnet that the CPU utilisation seems to be
almost entirely attributable to the vhost and qemu-system processes.  And
in the case of virtio without vhost net the CPU is used by the qemu-system
process. In both case I assume that I could use a cgroup or something
similar to limit the guests.

Assuming all of that is true then from a resource control problem point of
view, which is mostly what I am concerned about, the problem goes away.
However, I still think that it would be nice to resolve the situation I
described.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Michael S. Tsirkin

On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
 On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
  On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
   Hi,
   
   Back in October I reported that I noticed a problem whereby flow control
   breaks down when openvswitch is configured to mirror a port[1].
  
  Apropos the UDP flow control.  See this
  http://www.spinics.net/lists/netdev/msg150806.html
  for some problems it introduces.
  Unfortunately UDP does not have built-in flow control.
  At some level it's just conceptually broken:
  it's not present in physical networks so why should
  we try and emulate it in a virtual network?
  
  
  Specifically, when you do:
  # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
  You are asking: what happens if I push data faster than it can be received?
  But why is this an interesting question?
  Ask 'what is the maximum rate at which I can send data with %X packet
  loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
  -b and -w flags for this. It needs to be configured
  with --enable-intervals=yes for them to work.
  
  If you pose the questions this way the problem of pacing
  the execution just goes away.
 
 I am aware that UDP inherently lacks flow control.

Everyone's is aware of that, but this is always followed by a 'however'
:).

 The aspect of flow control that I am interested in is situations where the
 guest can create large amounts of work for the host. However, it seems that
 in the case of virtio with vhostnet that the CPU utilisation seems to be
 almost entirely attributable to the vhost and qemu-system processes.  And
 in the case of virtio without vhost net the CPU is used by the qemu-system
 process. In both case I assume that I could use a cgroup or something
 similar to limit the guests.

cgroups, yes. the vhost process inherits the cgroups
from the qemu process so you can limit them all.

If you are after limiting the max troughput of the guest
you can do this with cgroups as well.

 Assuming all of that is true then from a resource control problem point of
 view, which is mostly what I am concerned about, the problem goes away.
 However, I still think that it would be nice to resolve the situation I
 described.

We need to articulate what's wrong here, otherwise we won't
be able to resolve the situation. We are sending UDP packets
as fast as we can and some receivers can't cope. Is this the problem?
We have made attempts to add a pseudo flow control in the past
in an attempt to make UDP on the same host work better.
Maybe they help some but they also sure introduce problems.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Simon Horman

On Thu, Jan 06, 2011 at 02:07:22PM +0200, Michael S. Tsirkin wrote:
 On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
  On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
   On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
Hi,

Back in October I reported that I noticed a problem whereby flow control
breaks down when openvswitch is configured to mirror a port[1].
   
   Apropos the UDP flow control.  See this
   http://www.spinics.net/lists/netdev/msg150806.html
   for some problems it introduces.
   Unfortunately UDP does not have built-in flow control.
   At some level it's just conceptually broken:
   it's not present in physical networks so why should
   we try and emulate it in a virtual network?
   
   
   Specifically, when you do:
   # netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
   You are asking: what happens if I push data faster than it can be 
   received?
   But why is this an interesting question?
   Ask 'what is the maximum rate at which I can send data with %X packet
   loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
   -b and -w flags for this. It needs to be configured
   with --enable-intervals=yes for them to work.
   
   If you pose the questions this way the problem of pacing
   the execution just goes away.
  
  I am aware that UDP inherently lacks flow control.
 
 Everyone's is aware of that, but this is always followed by a 'however'
 :).
 
  The aspect of flow control that I am interested in is situations where the
  guest can create large amounts of work for the host. However, it seems that
  in the case of virtio with vhostnet that the CPU utilisation seems to be
  almost entirely attributable to the vhost and qemu-system processes.  And
  in the case of virtio without vhost net the CPU is used by the qemu-system
  process. In both case I assume that I could use a cgroup or something
  similar to limit the guests.
 
 cgroups, yes. the vhost process inherits the cgroups
 from the qemu process so you can limit them all.
 
 If you are after limiting the max troughput of the guest
 you can do this with cgroups as well.

Do you mean a CPU cgroup or something else?

  Assuming all of that is true then from a resource control problem point of
  view, which is mostly what I am concerned about, the problem goes away.
  However, I still think that it would be nice to resolve the situation I
  described.
 
 We need to articulate what's wrong here, otherwise we won't
 be able to resolve the situation. We are sending UDP packets
 as fast as we can and some receivers can't cope. Is this the problem?
 We have made attempts to add a pseudo flow control in the past
 in an attempt to make UDP on the same host work better.
 Maybe they help some but they also sure introduce problems.

In the case where port mirroring is not active, which is the
usual case, to some extent there is flow control in place due to
(as Eric Dumazet pointed out) the socket buffer.

When port mirroring is activated the flow control operates based
only on one port - which can't be controlled by the administrator
in an obvious way.

I think that it would be more intuitive if flow control was
based on sending a packet to all ports rather than just one.

Though now I think about it some more, perhaps this isn't the best either.
For instance the case where data was being sent to dummy0 and suddenly
adding a mirror on eth1 slowed everything down.

So perhaps there needs to be another knob to tune when setting
up port-mirroring. Or perhaps the current situation isn't so bad.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Simon Horman

On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
 Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
  Hi,
  
  Back in October I reported that I noticed a problem whereby flow control
  breaks down when openvswitch is configured to mirror a port[1].
  
  I have (finally) looked into this further and the problem appears to relate
  to cloning of skbs, as Jesse Gross originally suspected.
  
  More specifically, in do_execute_actions[2] the first n-1 times that an skb
  needs to be transmitted it is cloned first and the final time the original
  skb is used.
  
  In the case that there is only one action, which is the normal case, then
  the original skb will be used. But in the case of mirroring the cloning
  comes into effect. And in my case the cloned skb seems to go to the (slow)
  eth1 interface while the original skb goes to the (fast) dummy0 interface
  that I set up to be a mirror. The result is that dummy0 paces the flow,
  and its a cracking pace at that.
  
  As an experiment I hacked do_execute_actions() to use the original skb
  for the first action instead of the last one.  In my case the result was
  that eth1 paces the flow, and things work reasonably nicely.
  
  Well, sort of. Things work well for non-GSO skbs but extremely poorly for
  GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
  netserv. I'm unsure why, but I digress.
  
  It seems to me that my hack illustrates the point that the flow ends up
  being paced by one interface. However I think that what would be
  desirable is that the flow is paced by the slowest link. Unfortunately
  I'm unsure how to achieve that.
  
 
 Hi Simon !
 
 pacing is done because skb is attached to a socket, and a socket has a
 limited (but configurable) sndbuf. sk-sk_wmem_alloc is the current sum
 of all truesize skbs in flight.
 
 When you enter something that :
 
 1) Get a clone of the skb, queue the clone to device X
 2) queue the original skb to device Y
 
 Then :Socket sndbuf is not affected at all by device X queue.
   This is speed on device Y that matters.
 
 You want to get servo control on both X and Y
 
 You could try to
 
 1) Get a clone of skb
Attach it to socket too (so that socket get a feedback of final
 orphaning for the clone) with skb_set_owner_w()
queue the clone to device X
 
 Unfortunatly, stacked skb-destructor() makes this possible only for
 known destructor (aka sock_wfree())

Hi Eric !

Thanks for the advice. I had thought about the socket buffer but at some
point it slipped my mind.

In any case the following patch seems to implement the change that I had in
mind. However my discussions Michael Tsirkin elsewhere in this thread are
beginning to make me think that think that perhaps this change isn't the
best solution.

diff --git a/datapath/actions.c b/datapath/actions.c
index 5e16143..505f13f 100644
--- a/datapath/actions.c
+++ b/datapath/actions.c
@@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, struct 
sk_buff *skb,
 
for (a = actions, rem = actions_len; rem  0; a = nla_next(a, rem)) {
if (prev_port != -1) {
-   do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
+   struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
+   if (nskb) {
+   if (skb-sk)
+   skb_set_owner_w(nskb, skb-sk);
+   do_output(dp, nskb, prev_port);
+   }
prev_port = -1;
}

I got a rather nasty panic without the if (skb-sk),
I guess some skbs don't have a socket.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Michael S. Tsirkin

On Thu, Jan 06, 2011 at 09:29:02PM +0900, Simon Horman wrote:
 On Thu, Jan 06, 2011 at 02:07:22PM +0200, Michael S. Tsirkin wrote:
  On Thu, Jan 06, 2011 at 08:30:52PM +0900, Simon Horman wrote:
   On Thu, Jan 06, 2011 at 12:27:55PM +0200, Michael S. Tsirkin wrote:
On Thu, Jan 06, 2011 at 06:33:12PM +0900, Simon Horman wrote:
 Hi,
 
 Back in October I reported that I noticed a problem whereby flow 
 control
 breaks down when openvswitch is configured to mirror a port[1].

Apropos the UDP flow control.  See this
http://www.spinics.net/lists/netdev/msg150806.html
for some problems it introduces.
Unfortunately UDP does not have built-in flow control.
At some level it's just conceptually broken:
it's not present in physical networks so why should
we try and emulate it in a virtual network?


Specifically, when you do:
# netperf -c -4 -t UDP_STREAM -H 172.17.60.218 -l 30 -- -m 1472
You are asking: what happens if I push data faster than it can be 
received?
But why is this an interesting question?
Ask 'what is the maximum rate at which I can send data with %X packet
loss' or 'what is the packet loss at rate Y Gb/s'. netperf has
-b and -w flags for this. It needs to be configured
with --enable-intervals=yes for them to work.

If you pose the questions this way the problem of pacing
the execution just goes away.
   
   I am aware that UDP inherently lacks flow control.
  
  Everyone's is aware of that, but this is always followed by a 'however'
  :).
  
   The aspect of flow control that I am interested in is situations where the
   guest can create large amounts of work for the host. However, it seems 
   that
   in the case of virtio with vhostnet that the CPU utilisation seems to be
   almost entirely attributable to the vhost and qemu-system processes.  And
   in the case of virtio without vhost net the CPU is used by the qemu-system
   process. In both case I assume that I could use a cgroup or something
   similar to limit the guests.
  
  cgroups, yes. the vhost process inherits the cgroups
  from the qemu process so you can limit them all.
  
  If you are after limiting the max troughput of the guest
  you can do this with cgroups as well.
 
 Do you mean a CPU cgroup or something else?

net classifier cgroup

   Assuming all of that is true then from a resource control problem point of
   view, which is mostly what I am concerned about, the problem goes away.
   However, I still think that it would be nice to resolve the situation I
   described.
  
  We need to articulate what's wrong here, otherwise we won't
  be able to resolve the situation. We are sending UDP packets
  as fast as we can and some receivers can't cope. Is this the problem?
  We have made attempts to add a pseudo flow control in the past
  in an attempt to make UDP on the same host work better.
  Maybe they help some but they also sure introduce problems.
 
 In the case where port mirroring is not active, which is the
 usual case, to some extent there is flow control in place due to
 (as Eric Dumazet pointed out) the socket buffer.
 
 When port mirroring is activated the flow control operates based
 only on one port - which can't be controlled by the administrator
 in an obvious way.
 
 I think that it would be more intuitive if flow control was
 based on sending a packet to all ports rather than just one.
 
 Though now I think about it some more, perhaps this isn't the best either.
 For instance the case where data was being sent to dummy0 and suddenly
 adding a mirror on eth1 slowed everything down.
 
 So perhaps there needs to be another knob to tune when setting
 up port-mirroring. Or perhaps the current situation isn't so bad.

To understand whether it's bad, you'd need to measure it.
The netperf manual says:
5.2.4 UDP_STREAM

A UDP_STREAM test is similar to a TCP_STREAM test except UDP is 
used as
the transport rather than TCP.

A UDP_STREAM test has no end-to-end flow control - UDP provides 
none
and neither does netperf. However, if you wish, you can configure 
netperf with
--enable-intervals=yes to enable the global command-line -b and -w 
options to
pace bursts of traffic onto the network.

This has a number of implications.

...
and one of the implications is that the max throughput
might not be reached when you try to send as much data as possible.
It might be confusing that this is what netperf does by default with UDP_STREAM:
if the endpoint is much faster than the network the issue might not appear.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Eric Dumazet

Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :

 Hi Eric !
 
 Thanks for the advice. I had thought about the socket buffer but at some
 point it slipped my mind.
 
 In any case the following patch seems to implement the change that I had in
 mind. However my discussions Michael Tsirkin elsewhere in this thread are
 beginning to make me think that think that perhaps this change isn't the
 best solution.
 
 diff --git a/datapath/actions.c b/datapath/actions.c
 index 5e16143..505f13f 100644
 --- a/datapath/actions.c
 +++ b/datapath/actions.c
 @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, 
 struct sk_buff *skb,
  
   for (a = actions, rem = actions_len; rem  0; a = nla_next(a, rem)) {
   if (prev_port != -1) {
 - do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
 + struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
 + if (nskb) {
 + if (skb-sk)
 + skb_set_owner_w(nskb, skb-sk);
 + do_output(dp, nskb, prev_port);
 + }
   prev_port = -1;
   }
 
 I got a rather nasty panic without the if (skb-sk),
 I guess some skbs don't have a socket.

Indeed, some packets are not linked to a socket.

(ARP packets for example)

Sorry, I should have mentioned it :)


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Jesse Gross

On Thu, Jan 6, 2011 at 7:44 AM, Simon Horman ho...@verge.net.au wrote:
 On Thu, Jan 06, 2011 at 11:22:42AM +0100, Eric Dumazet wrote:
 Le jeudi 06 janvier 2011 à 18:33 +0900, Simon Horman a écrit :
  Hi,
 
  Back in October I reported that I noticed a problem whereby flow control
  breaks down when openvswitch is configured to mirror a port[1].
 
  I have (finally) looked into this further and the problem appears to relate
  to cloning of skbs, as Jesse Gross originally suspected.
 
  More specifically, in do_execute_actions[2] the first n-1 times that an skb
  needs to be transmitted it is cloned first and the final time the original
  skb is used.
 
  In the case that there is only one action, which is the normal case, then
  the original skb will be used. But in the case of mirroring the cloning
  comes into effect. And in my case the cloned skb seems to go to the (slow)
  eth1 interface while the original skb goes to the (fast) dummy0 interface
  that I set up to be a mirror. The result is that dummy0 paces the flow,
  and its a cracking pace at that.
 
  As an experiment I hacked do_execute_actions() to use the original skb
  for the first action instead of the last one.  In my case the result was
  that eth1 paces the flow, and things work reasonably nicely.
 
  Well, sort of. Things work well for non-GSO skbs but extremely poorly for
  GSO skbs where only 3 (yes 3, not 3%) end up at the remote host running
  netserv. I'm unsure why, but I digress.
 
  It seems to me that my hack illustrates the point that the flow ends up
  being paced by one interface. However I think that what would be
  desirable is that the flow is paced by the slowest link. Unfortunately
  I'm unsure how to achieve that.
 

 Hi Simon !

 pacing is done because skb is attached to a socket, and a socket has a
 limited (but configurable) sndbuf. sk-sk_wmem_alloc is the current sum
 of all truesize skbs in flight.

 When you enter something that :

 1) Get a clone of the skb, queue the clone to device X
 2) queue the original skb to device Y

 Then :        Socket sndbuf is not affected at all by device X queue.
       This is speed on device Y that matters.

 You want to get servo control on both X and Y

 You could try to

 1) Get a clone of skb
    Attach it to socket too (so that socket get a feedback of final
 orphaning for the clone) with skb_set_owner_w()
    queue the clone to device X

 Unfortunatly, stacked skb-destructor() makes this possible only for
 known destructor (aka sock_wfree())

 Hi Eric !

 Thanks for the advice. I had thought about the socket buffer but at some
 point it slipped my mind.

 In any case the following patch seems to implement the change that I had in
 mind. However my discussions Michael Tsirkin elsewhere in this thread are
 beginning to make me think that think that perhaps this change isn't the
 best solution.

I know that everyone likes a nice netperf result but I agree with
Michael that this probably isn't the right question to be asking.  I
don't think that socket buffers are a real solution to the flow
control problem: they happen to provide that functionality but it's
more of a side effect than anything.  It's just that the amount of
memory consumed by packets in the queue(s) doesn't really have any
implicit meaning for flow control (think multiple physical adapters,
all with the same speed instead of a virtual device and a physical
device with wildly different speeds).  The analog in the physical
world that you're looking for would be Ethernet flow control.
Obviously, if the question is limiting CPU or memory consumption then
that's a different story.

This patch also double counts memory, since the full size of the
packet will be accounted for by each clone, even though they share the
actual packet data.  Probably not too significant here but it might be
when flooding/mirroring to many interfaces.  This is at least fixable
(the Xen-style accounting through page tracking deals with it, though
it has its own problems).
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Simon Horman

On Thu, Jan 06, 2011 at 02:28:18PM +0100, Eric Dumazet wrote:
 Le jeudi 06 janvier 2011 à 21:44 +0900, Simon Horman a écrit :
 
  Hi Eric !
  
  Thanks for the advice. I had thought about the socket buffer but at some
  point it slipped my mind.
  
  In any case the following patch seems to implement the change that I had in
  mind. However my discussions Michael Tsirkin elsewhere in this thread are
  beginning to make me think that think that perhaps this change isn't the
  best solution.
  
  diff --git a/datapath/actions.c b/datapath/actions.c
  index 5e16143..505f13f 100644
  --- a/datapath/actions.c
  +++ b/datapath/actions.c
  @@ -384,7 +384,12 @@ static int do_execute_actions(struct datapath *dp, 
  struct sk_buff *skb,
   
  for (a = actions, rem = actions_len; rem  0; a = nla_next(a, rem)) {
  if (prev_port != -1) {
  -   do_output(dp, skb_clone(skb, GFP_ATOMIC), prev_port);
  +   struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
  +   if (nskb) {
  +   if (skb-sk)
  +   skb_set_owner_w(nskb, skb-sk);
  +   do_output(dp, nskb, prev_port);
  +   }
  prev_port = -1;
  }
  
  I got a rather nasty panic without the if (skb-sk),
  I guess some skbs don't have a socket.
 
 Indeed, some packets are not linked to a socket.
 
 (ARP packets for example)
 
 Sorry, I should have mentioned it :)

Not at all, the occasional panic during hacking is good for the soul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Flow Control and Port Mirroring Revisited

2011-01-06 Thread Simon Horman

On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:

[ snip ]
 
 I know that everyone likes a nice netperf result but I agree with
 Michael that this probably isn't the right question to be asking.  I
 don't think that socket buffers are a real solution to the flow
 control problem: they happen to provide that functionality but it's
 more of a side effect than anything.  It's just that the amount of
 memory consumed by packets in the queue(s) doesn't really have any
 implicit meaning for flow control (think multiple physical adapters,
 all with the same speed instead of a virtual device and a physical
 device with wildly different speeds).  The analog in the physical
 world that you're looking for would be Ethernet flow control.
 Obviously, if the question is limiting CPU or memory consumption then
 that's a different story.

Point taken. I will see if I can control CPU (and thus memory) consumption
using cgroups and/or tc.

 This patch also double counts memory, since the full size of the
 packet will be accounted for by each clone, even though they share the
 actual packet data.  Probably not too significant here but it might be
 when flooding/mirroring to many interfaces.  This is at least fixable
 (the Xen-style accounting through page tracking deals with it, though
 it has its own problems).

Agreed on all counts.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

39 matches

Mail list logo