Re: [net-next RFC V4 PATCH 0/4] Multiqueue virtio-net

2012-06-25 Thread Sridhar Samudrala

On 6/25/2012 2:16 AM, Jason Wang wrote:

Hello All:

This series is an update version of multiqueue virtio-net driver based on
Krishna Kumar's work to let virtio-net use multiple rx/tx queues to do the
packets reception and transmission. Please review and comments.

Test Environment:
- Intel(R) Xeon(R) CPU E5620 @ 2.40GHz, 8 cores 2 numa nodes
- Two directed connected 82599

Test Summary:

- Highlights: huge improvements on TCP_RR test
- Lowlights: regression on small packet transmission, higher cpu utilization
  than single queue, need further optimization

Does this also scale with increased number of VMs?

Thanks
Sridhar


Analysis of the performance result:

- I count the number of packets sending/receiving during the test, and
   multiqueue show much more ability in terms of packets per second.

- For the tx regression, multiqueue send about 1-2 times of more packets
   compared to single queue, and the packets size were much smaller than single
   queue does. I suspect tcp does less batching in multiqueue, so I hack the
   tcp_write_xmit() to forece more batching, multiqueue works as well as
   singlequeue for both small transmission and throughput

- I didn't pack the accelerate RFS with virtio-net in this sereis as it still
   need further shaping, for the one that interested in this please see:
   http://www.mail-archive.com/kvm@vger.kernel.org/msg64111.html




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread Sridhar Samudrala
On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
 Propagate software FDB table into hardware uc, mc lists when
 the NETIF_F_HW_FDB is set.
 
 This resolves the case below where an embedded switch is used
 in hardware to do inter-VF or VF-PF switching. This patch
 pushes the FDB entry (specifically the MAC address) into the
 embedded switch with dev_add_uc and dev_add_mc so the switch
 learns about the software bridge.
 
 
   veth0  veth2
 |  |
   
   |  bridge0 |    software bridging
   
/
/
   ethx.y  ethx
 VF PF
  \ \   propagate FDB entries to HW
  \ \
   
   |  Embedded Bridge | hardware offloaded switching
   
 

This scenario works now as adding an interface to a bridge puts it in
promiscuous mode. So adding a PF to a software bridge should not be
a problem as it supports promiscuous mode. But adding a VF will not
work.

Are you trying to avoid the requirement of having to put the interface 
in promiscuous mode when adding to a bridge?

Thanks
Sridhar



 This is only an RFC couple more changes are needed.
 
 (1) Optimize HW FDB set/del to only walk list if an FDB offloaded
 device is attached. Or decide it doesn't matter from unlikely()
 path.
 
 (2) Is it good enough to just call dev_uc_{add|del} or
 dev_mc_{add|del}? Or do some devices really need a new netdev
 callback to do this operation correctly. I think it should be
 good enough as is.
 
 (3) wrapped list walk in rcu_read_lock() just in case maybe every
 case is already inside rcu_read_lock()/unlock().
 
 Also this is in response to this thread regarding the macvlan and
 exposing rx filters posting now to see if folks think this is the
 right idea and if it will resolve at least the bridge case.
 
 http://lists.openwall.net/netdev/2011/11/08/135
 
 Signed-off-by: John Fastabend john.r.fastab...@intel.com
 ---
 
  include/linux/netdev_features.h |2 ++
  net/bridge/br_fdb.c |   34 ++
  2 files changed, 36 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
 index 77f5202..5936fae 100644
 --- a/include/linux/netdev_features.h
 +++ b/include/linux/netdev_features.h
 @@ -55,6 +55,8 @@ enum {
   NETIF_F_NOCACHE_COPY_BIT,   /* Use no-cache copyfromuser */
   NETIF_F_LOOPBACK_BIT,   /* Enable loopback */
 
 + NETIF_F_HW_FDB, /* Hardware supports switching */
 +
   /*
* Add your fresh new feature above and remember to update
* netdev_features_strings[] in net/core/ethtool.c and maybe
 diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
 index 5ba0c84..4cc545b 100644
 --- a/net/bridge/br_fdb.c
 +++ b/net/bridge/br_fdb.c
 @@ -81,9 +81,26 @@ static void fdb_rcu_free(struct rcu_head *head)
   kmem_cache_free(br_fdb_cache, ent);
  }
 
 +static void fdb_hw_delete(struct net_bridge *br,
 +   struct net_bridge_fdb_entry *fdb)
 +{
 + struct net_bridge_port *op;
 +
 + rcu_read_lock();
 + list_for_each_entry_rcu(op, br-port_list, list) {
 + struct net_device *dev = op-dev;
 +
 + if ((dev-features  NETIF_F_HW_FDB) 
 + dev != fdb-dst-dev)
 + dev_uc_del(dev, fdb-addr.addr);
 + }
 + rcu_read_unlock();
 +}
 +
  static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f)
  {
   hlist_del_rcu(f-hlist);
 + fdb_hw_delete(br, f);
   fdb_notify(br, f, RTM_DELNEIGH);
   call_rcu(f-rcu, fdb_rcu_free);
  }
 @@ -350,6 +367,22 @@ static struct net_bridge_fdb_entry *fdb_find_rcu(struct 
 hlist_head *head,
   return NULL;
  }
 
 +static void fdb_hw_create(struct net_bridge *br,
 +   struct net_bridge_fdb_entry *fdb)
 +{
 + struct net_bridge_port *op;
 +
 + rcu_read_lock();
 + list_for_each_entry_rcu(op, br-port_list, list) {
 + struct net_device *dev = op-dev;
 +
 + if ((dev-features  NETIF_F_HW_FDB) 
 + dev != fdb-dst-dev)
 + dev_uc_add(dev, fdb-addr.addr);
 + }
 + rcu_read_unlock();
 +}
 +
  static struct net_bridge_fdb_entry *fdb_create(struct hlist_head *head,
  struct net_bridge_port *source,
  const unsigned char *addr)
 @@ -363,6 +396,7 @@ static struct net_bridge_fdb_entry *fdb_create(struct 
 hlist_head *head,
   fdb-is_local = 0;
   fdb-is_static = 0;
   fdb-updated = fdb-used = jiffies;
 + fdb_hw_create(source-br, fdb);
   hlist_add_head_rcu(fdb-hlist, head);
   }
   return fdb;
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to 

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-09 Thread Sridhar Samudrala
On Thu, 2012-02-09 at 12:30 -0800, John Fastabend wrote:
 On 2/9/2012 10:14 AM, Sridhar Samudrala wrote:
  On Wed, 2012-02-08 at 19:22 -0800, John Fastabend wrote:
  Propagate software FDB table into hardware uc, mc lists when
  the NETIF_F_HW_FDB is set.
 
  This resolves the case below where an embedded switch is used
  in hardware to do inter-VF or VF-PF switching. This patch
  pushes the FDB entry (specifically the MAC address) into the
  embedded switch with dev_add_uc and dev_add_mc so the switch
  learns about the software bridge.
 
 
veth0  veth2
  |  |

|  bridge0 |    software bridging

 /
 /
ethx.y  ethx
  VF PF
   \ \   propagate FDB entries to HW
   \ \

|  Embedded Bridge | hardware offloaded switching

 
  
  This scenario works now as adding an interface to a bridge puts it in
  promiscuous mode. So adding a PF to a software bridge should not be
  a problem as it supports promiscuous mode. But adding a VF will not
  work.
 
 It shouldn't work because the embedded bridge will lookup the address
 in its FDB and when it doesn't match any unicast filters it will forward
 the packet onto the wire. Because the veth0 and veth2 above never get
 inserted into the embedded brdige's FDB the packets will _never_ get
 routed there.
 
 That said the current 'ixgbe' driver is doing something broken in that
 it is always setting the unicast hash table and mirroring bits to 1. So
 if you think this is working your seeing a bug where packets are being
 sent onto the wire AND upto the PF. Packets with destination addresses
 matching veth1 should not end up on the wire and vice versa. This is
 specific to ixgbe and is not the case for other SR-IOV devices.

OK. Is this behavior going to be fixed.

 
 This causes some issues (a) has some very real performance implications,
 (b) at this point you have some strange behavior from my point of view.
 The embedded bridge is not a learning bridge nor is it acting like an
 802.1Q VEB or VEPA.
 
  
  Are you trying to avoid the requirement of having to put the interface 
  in promiscuous mode when adding to a bridge?
  
 
 I think the bridge being in promiscuous mode is correct.

The interface that is added to the bridge is put in promiscuous mode,
not the bridge itself.  In this example, i assumed that setting
promiscuous on PF is putting the embedded bridge in learning mode.

Thanks
Sridhar

 
 Hope that helps sorry its a bit long winded.
 John
 
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-08 Thread Sridhar Samudrala

On 12/7/2011 3:02 AM, Jason Wang wrote:

On 12/06/2011 11:42 PM, Sridhar Samudrala wrote:

On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:
On Tue, Dec 6, 2011 at 10:21 AM, Jason Wangjasow...@redhat.com  
wrote:

On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:
On Tue, Dec 6, 2011 at 6:33 AM, Jason Wangjasow...@redhat.com
wrote:

On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:

On Mon, Dec 5, 2011 at 8:59 AM, Jason Wangjasow...@redhat.com
  wrote:

The vcpus are just threads and may not be bound to physical CPUs, so
what is the big picture here?  Is the guest even in the position to
set the best queue mappings today?


Not sure it could publish the best mapping but the idea is to make 
sure the
packets of a flow were handled by the same guest vcpu and may be 
the same

vhost thread in order to eliminate the packet reordering and lock
contention. But this assumption does not take the bouncing of vhost 
or vcpu

threads which would also affect the result.

Okay, this is why I'd like to know what the big picture here is.  What
solution are you proposing?  How are we going to have everything from
guest application, guest kernel, host threads, and host NIC driver
play along so we get the right steering up the entire stack.  I think
there needs to be an answer to that before changing virtio-net to add
any steering mechanism.


Yes. Also the current model of  a vhost thread per VM's interface 
doesn't help with packet steering

all the way from the guest to the host physical NIC.

I think we need to have vhost thread(s) per-CPU that can handle 
packets to/from physical NIC's
TX/RX queues. Currently we have a single vhost thread for a VM's i/f 
that handles all the packets from

various flows coming from a multi-queue physical NIC.


Even if we have per-cpu workthread, only one socket is used to queue 
the packet then, so a multiple queue(sockets) tap/macvtap is still 
needed.
I think so.  We need per-cpu tap/macvtap sockets along with per-cpu 
vhost threads.

This will parallelize all the way from physical NIC to vhost.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-06 Thread Sridhar Samudrala

On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:

On Tue, Dec 6, 2011 at 10:21 AM, Jason Wangjasow...@redhat.com  wrote:

On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:

On Tue, Dec 6, 2011 at 6:33 AM, Jason Wangjasow...@redhat.comwrote:

On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:

On Mon, Dec 5, 2011 at 8:59 AM, Jason Wangjasow...@redhat.com
  wrote:

The vcpus are just threads and may not be bound to physical CPUs, so
what is the big picture here?  Is the guest even in the position to
set the best queue mappings today?


Not sure it could publish the best mapping but the idea is to make sure the
packets of a flow were handled by the same guest vcpu and may be the same
vhost thread in order to eliminate the packet reordering and lock
contention. But this assumption does not take the bouncing of vhost or vcpu
threads which would also affect the result.

Okay, this is why I'd like to know what the big picture here is.  What
solution are you proposing?  How are we going to have everything from
guest application, guest kernel, host threads, and host NIC driver
play along so we get the right steering up the entire stack.  I think
there needs to be an answer to that before changing virtio-net to add
any steering mechanism.


Yes. Also the current model of  a vhost thread per VM's interface 
doesn't help with packet steering

all the way from the guest to the host physical NIC.

I think we need to have vhost thread(s) per-CPU that can handle packets 
to/from physical NIC's
TX/RX queues. Currently we have a single vhost thread for a VM's i/f 
that handles all the packets from

various flows coming from a multi-queue physical NIC.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC PATCH 5/5] virtio-net: flow director support

2011-12-06 Thread Sridhar Samudrala

On 12/6/2011 8:14 AM, Michael S. Tsirkin wrote:

On Tue, Dec 06, 2011 at 07:42:54AM -0800, Sridhar Samudrala wrote:

On 12/6/2011 5:15 AM, Stefan Hajnoczi wrote:

On Tue, Dec 6, 2011 at 10:21 AM, Jason Wangjasow...@redhat.com   wrote:

On 12/06/2011 05:18 PM, Stefan Hajnoczi wrote:

On Tue, Dec 6, 2011 at 6:33 AM, Jason Wangjasow...@redhat.com wrote:

On 12/05/2011 06:55 PM, Stefan Hajnoczi wrote:

On Mon, Dec 5, 2011 at 8:59 AM, Jason Wangjasow...@redhat.com
  wrote:

The vcpus are just threads and may not be bound to physical CPUs, so
what is the big picture here?  Is the guest even in the position to
set the best queue mappings today?

Not sure it could publish the best mapping but the idea is to make sure the
packets of a flow were handled by the same guest vcpu and may be the same
vhost thread in order to eliminate the packet reordering and lock
contention. But this assumption does not take the bouncing of vhost or vcpu
threads which would also affect the result.

Okay, this is why I'd like to know what the big picture here is.  What
solution are you proposing?  How are we going to have everything from
guest application, guest kernel, host threads, and host NIC driver
play along so we get the right steering up the entire stack.  I think
there needs to be an answer to that before changing virtio-net to add
any steering mechanism.



Yes. Also the current model of  a vhost thread per VM's interface
doesn't help with packet steering
all the way from the guest to the host physical NIC.

I think we need to have vhost thread(s) per-CPU that can handle
packets to/from physical NIC's
TX/RX queues.
Currently we have a single vhost thread for a VM's i/f
that handles all the packets from
various flows coming from a multi-queue physical NIC.

Thanks
Sridhar

It's not hard to try that:
1. revert c23f3445e68e1db0e74099f264bc5ff5d55ebdeb
this will convert our thread to a workqueue
2. convert the workqueue to a per-cpu one

It didn't work that well in the past, but YMMV
Yes. I tried this before we went ahead with per-interface vhost 
threading model.

At that time, per-cpu vhost  showed a regression with a single-VM and
per-vq vhost showed good performance improvements upto 8 VMs.

So  just making it per-cpu would not be enough. I think we may need a way
to schedule vcpu threads on the same cpu-socket as vhost.

Another aspect we need to look into is the splitting of vhost thread 
into separate
threads for TX and RX. Shirley is doing some work in this area and she 
is seeing

perf. improvements as long as TX and RX threads are on the same cpu-socket.


On the surface I'd say a single thread makes some sense
as long as guest uses a single queue.

But this may not be scalable long term when we want to support a large 
number of VMs each

having multiple virtio-net interfaces with multiple queues.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode

2011-11-30 Thread Sridhar Samudrala

On 11/30/2011 3:00 PM, Chris Wright wrote:

* Ben Hutchings (bhutchi...@solarflare.com) wrote:

On Wed, 2011-11-30 at 13:04 -0800, Chris Wright wrote:

I agree that it's confusing.  Couldn't you simplify your ascii art
(hopefully removing hw assumptions about receive processing, and
completely ignoring vlans for the moment) to something like:

  |RX
  v
++-+
| +--++|
| | RX MAC filter ||
| |and port select||
| +---+|
|/|\   |
|   / | \   match 2|
|  /  v  \ |
| /match  \|
|/  1 |\   |
|   / | \  |
|match /  |  \ |
|  0  /   |   \|
|v|v   |
||||   |
++++---+
  |||
 PF   VF 1 VF 2

And there's an unclear number of ways to update RX MAC filter and port
select table.

1) PF ndo_set_mac_addr
I expect that to be implicit to match 0.

2) PF ndo_set_rx_mode
Less clear, but I'd still expect these to implicitly match 0

3) PF ndo_set_vf_mac
I expect these to be an explicit match to VF N (given the interface
specifices which VF's MAC is being programmed).

I'm not sure whether this is supposed to implicitly add to the MAC
filter or whether that has to be changed too.  That's the main
difference between my models (a) and (b).

I see now.  I wasn't entirely clear on the difference before.  It's also
going to be hw specific.  I think (Intel folks can verify) that the
Intel SR-IOV devices have a single global unicast exact match table,
for example.


There's also PF ndo_set_vf_vlan.

Right, although I had mentioned I was trying to limit just to MAC
filtering to simplify.


4) VF ndo_set_mac_addr
This one may or may not be allowed (setting MAC+port if the VF is owned
by a guest is likely not allowed), but would expect an implicit VF N.

5) VF ndo_set_rx_mode
Same as 4) above.

So this is where we are today.

Cool, good that we agree there.


6) PF or VF? ndo_set_rx_filter_addr
The new proposal, which has an explicit VF, although when it's VF_SELF
I'm not clear if this is just the same as 5) above?

Have I missed anything?

Any physical port can be bridged to a mixture of guests with and without
their own VFs.  Packets sent from a guest with a VF to the address of a
guest without a VF need to be forwarded to the PF rather than the
physical port, but none of the drivers currently get to know about those
addresses.

To clarify, do you mean something like this?

physical port
  |
+++
| +-+ |
| | VEB | |
| +-+ |
|/   |   \|
|   /|\   |
|  / | \  |
+-+--+--+-+
   |  |   |
  PFVF 1VF 2
  /   |   |
  +---+---+  VM4  +---+---+
  |  sw   |   |macvtap|
  | switch|   +---+---+
  +-+-+-+-+   |
/ | \VM5
   /  |  \
VM1 VM2 VM3

This has VMs 1-3 hanging of the PF via a linux bridge (traditional hv
switching), VM4 directly owning VF1 (pci device assignement), and VM5
indirectly owning VF2 (macvtap passthrough, that started this whole
thing).

So, I'm understanding you saying that VM4 or VM4 sending a packet to VM1
goes in to VEB, out PF, and into linux bridging code, rigth?  At which
point the PF is in promiscuous mode (btw, same does not work if bridge is
attached to VF, at least for some VFs, due to lack of promiscuous mode).


Packets sent from a guest with a VF to the address of another guest with
a VF need to be forwarded similarly, but the driver should be able to
infer that from (3).

Right, and that works currently for the case where both guests are like
VM4, they directly own the VF via PCI device assignement.  But for VM4
to talk to VM5, VF3 is not in promiscuous mode and has a different MAC
address than VM5's vNIC.  If the embedded bridge does not learn, and
nobody programmed it to fwd frames for VM5 via VF3...

I think you are referring to VF2. There is no VF3 in your picture.
In macvtap passthru mode, VF2 will be set to the same mac address as 
VM5's MAC.

So VM4 should be be able to talk to VM5.


I believe this is what Roopa's patch will allow.  The question now is
whether there's a better way to handle this?
My understanding is that Roopa's patch will allow setting additional mac 
addresses to

VM5 without the need to put VF5 in promiscous mode.

Thanks
Sridhar


In my mind, we'd model the NIC's embedded bridge as, well, a bridge.
And set anti-spoofing, port mirroring, port mac/vlan filtering, etc via
that bridge.

thanks,
-chris



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-11 Thread Sridhar Samudrala

On 9/11/2011 6:18 AM, Roopa Prabhu wrote:



On 9/11/11 2:44 AM, Michael S. Tsirkinm...@redhat.com  wrote:


AFAIK, though it might maintain a single filter table space in hw, hw does
know which filter belongs to which VF. And the OS driver does not need to do
anything special. The VF driver exposes a VF netdev. And any uc/mc addresses
registered with a VF netdev are registered with the hw by the driver. And hw
will filter and send only pkts that the VF has expressed interest in.

No special filter partitioning in hw is required.

Thanks,
Roopa

Yes, but what I mean is, if the size of the single filter table
is limited, we need to decide how many addresses is
each guest allowed. If we let one guest ask for
as many as it wants, it can lock others out.

Yes true. In these cases ie when the number of unicast addresses being
registered is more than it can handle, The VF driver will put the VF  in
promiscuous mode (Or at least its supposed to do. I think all drivers do
that).

What does putting VF in promiscuous mode mean?  How can the NIC decide 
which set
of mac addresses are passed to the VF? Does it mean VF sees all the 
packets received

by the NIC including packets destined for other VFs/PF?

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Sridhar Samudrala
On Thu, 2011-09-08 at 09:19 -0700, Roopa Prabhu wrote:
 
 
 On 9/8/11 4:08 AM, Michael S. Tsirkin m...@redhat.com wrote:
 
  On Wed, Sep 07, 2011 at 10:20:28PM -0700, Roopa Prabhu wrote:
  On 9/7/11 5:34 AM, Michael S. Tsirkin m...@redhat.com wrote:
  
  On Tue, Sep 06, 2011 at 03:35:40PM -0700, Roopa Prabhu wrote:
  This patch is an attempt at providing address filtering support for 
  macvtap
  devices in PASSTHRU mode. Its still a work in progress.
  Briefly tested for basic functionality. Wanted to get some feedback on 
  the
  direction before proceeding.
  
  
  Good work, thanks.
  
  
  Thanks.
  
  I have hopefully CC'ed all concerned people.
  
  kvm crowd might also be interested.
  Try using ./scripts/get_maintainer.pl as well.
  
  Thanks for the tip. Expanded CC list a bit more.
  
  PASSTHRU mode today sets the lowerdev in promiscous mode. In PASSTHRU 
  mode
  there is a 1-1 mapping between macvtap device and physical nic or VF. And
  all
  filtering is done in lowerdev hw. The lowerdev does not need to be in
  promiscous mode as long as the guest filters are passed down to the
  lowerdev.
  This patch tries to remove the need for putting the lowerdev in 
  promiscous
  mode. 
  I have also referred to the thread below where TUNSETTXFILTER was 
  mentioned
  in 
  this context: 
   http://patchwork.ozlabs.org/patch/69297/
  
  This patch basically passes the addresses got by TUNSETTXFILTER to 
  macvlan
  lowerdev.
  
  I have looked at previous work and discussions on this for qemu-kvm
  by Michael Tsirkin, Alex Williamson and Dragos Tatulea
  http://patchwork.ozlabs.org/patch/78595/
  http://patchwork.ozlabs.org/patch/47160/
  https://patchwork.kernel.org/patch/474481/
  
  Redhat bugzilla by Michael Tsirkin:
  https://bugzilla.redhat.com/show_bug.cgi?id=655013
  
  I used Michael's qemu-kvm patch for testing the changes with KVM
  
  I would like to cover both MAC and vlan filtering in this work.
  
  Open Questions/Issues:
  - There is a need for vlan filtering to complete the patch. It will 
  require
a new tap ioctl cmd for vlans.
Some ideas on this are:
  
a) TUNSETVLANFILTER: This will entail we send the whole vlan bitmap
  filter
  (similar to tun_filter for addresses). Passing the vlan id's to lower
  device will mean going thru the whole list of vlans every time.
  
OR
  
b) TUNSETVLAN with vlan id and flag to set/unset
  
Does option 'b' sound ok ?
  
  - In this implementation we make the macvlan address list same as the
  address
list that came in the filter with TUNSETTXFILTER. This will not cover
  cases
where the macvlan device needs to have other addresses that are not
necessarily in the filter. Is this a problem ?
  
  What cases do you have in mind?
  
  This patch targets only macvlan PASSTHRU mode and for PASSTHRU mode I don't
  see a problem with uc/mc address list being the same in all the stacked
  netdevs in the path. I called that out above to make sure I was not missing
  any case in PASSTHRU mode where this might be invalid. Otherwise I don't 
  see
  a problem in the simple PASSTHRU use case this patch supports.
  
  - The patch currently only supports passing of IFF_PROMISC and
  IFF_MULTICAST
  filter flags to lowerdev
  
  This patch series implements the following
  01/3 - macvlan: Add support for unicast filtering in macvlan
  02/3 - macvlan: Add function to set addr filter on lower device in 
  passthru
  mode
  03/3 - macvtap: Add support for TUNSETTXFILTER
  
  Please comment. Thanks.
  
  Signed-off-by: Roopa Prabhu ropra...@cisco.com
  Signed-off-by: Christian Benvenuti be...@cisco.com
  Signed-off-by: David Wang dwa...@cisco.com
  
  The security isn't lower than with promisc, so I don't see
  a problem with this as such.
  
  There are more features we'll want down the road though,
  so let's see whether the interface will be able to
  satisfy them in a backwards compatible way before we
  set it in stone. Here's what I came up with:
  
  How will the filtering table be partitioned within guests?
  
  Since this patch supports macvlan PASSTHRU mode only, in which the lower
  device has 1-1 mapping to the guest nic, it does not require any
  partitioning of filtering table within guests. Unless I missed 
  understanding
  something. 
  If the lower device were being shared by multiple guest network interfaces
  (non PASSTHRU mode), only then we will need to maintain separate filter
  tables for each guest network interface in macvlan and forward the pkt to
  respective guest interface after a filter lookup. This could affect
  performance too I think.
  
  Not with hardware filtering support. Which is where we'd need to
  partition the host nic mac table between guests.
  
 I need to understand this more. In non passthru case when a VF or physical
 nic is shared between guests, the nic does not really know about the guests,
 so I was thinking we do the same thing as we do for the passthru case (ie
 send all the 

Re: [net-next-2.6 PATCH 0/3 RFC] macvlan: MAC Address filtering support for passthru mode

2011-09-08 Thread Sridhar Samudrala

On 9/8/2011 8:00 PM, Roopa Prabhu wrote:



On 9/8/11 12:33 PM, Michael S. Tsirkinm...@redhat.com  wrote:


On Thu, Sep 08, 2011 at 12:23:56PM -0700, Roopa Prabhu wrote:

I think the main usecase for passthru mode is to assign a SR-IOV VF to
a single guest.


Yes and for the passthru usecase this patch should be enough to enable
filtering in hw (eventually like I indicated before I need to fix vlan
filtering too).

So with filtering in hw, and in sriov VF case, VFs
actually share a filtering table. How will that
be partitioned?

AFAIK, though it might maintain a single filter table space in hw, hw does
know which filter belongs to which VF. And the OS driver does not need to do
anything special. The VF driver exposes a VF netdev. And any uc/mc addresses
registered with a VF netdev are registered with the hw by the driver. And hw
will filter and send only pkts that the VF has expressed interest in.

Does your NIC  driver support adding multiple mac addresses to a VF?
I have tried a few other SR-IOV NICs sometime back and they didn't 
support this feature.


Currently, we don't have an interface to add multiple mac addresses to a 
netdev other than an

indirect way of creating a macvlan /if on top of it.

Thanks
Sridhar



No special filter partitioning in hw is required.

Thanks,
Roopa




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [net-next RFC PATCH 0/7] multiqueue support for tun/tap

2011-08-12 Thread Sridhar Samudrala
On Fri, 2011-08-12 at 09:54 +0800, Jason Wang wrote:
 As multi-queue nics were commonly used for high-end servers,
 current single queue based tap can not satisfy the
 requirement of scaling guest network performance as the
 numbers of vcpus increase. So the following series
 implements multiple queue support in tun/tap.
 
 In order to take advantages of this, a multi-queue capable
 driver and qemu were also needed. I just rebase the latest
 version of Krishna's multi-queue virtio-net driver into this
 series to simplify the test. And for multiqueue supported
 qemu, you can refer the patches I post in
 http://www.spinics.net/lists/kvm/msg52808.html. Vhost is
 also a must to achieve high performance and its code could
 be used for multi-queue without modification. Alternatively,
 this series can be also used for Krishna's M:N
 implementation of multiqueue but I didn't test it.
 
 The idea is simple: each socket were abstracted as a queue
 for tun/tap, and userspace may open as many files as
 required and then attach them to the devices. In order to
 keep the ABI compatibility, device creation were still
 finished in TUNSETIFF, and two new ioctls TUNATTACHQUEUE and
 TUNDETACHQUEUE were added for user to manipulate the numbers
 of queues for the tun/tap.

Is it possible to have tap create these queues automatically when
TUNSETIFF is called instead of having userspace to do the new
ioctls. I am just wondering if it is possible to get multi-queue
to be enabled without any changes to qemu. I guess the number of queues
could be based on the number of vhost threads/guest virtio-net queues.

Also, is it possible to enable multi-queue on the host alone without
any guest virtio-net changes?

Have you done any multiple TCP_RR/UDP_RR testing with small packet
sizes? 256byte request/response with 50-100 instances? 

 
 I've done some basic performance testing of multi queue
 tap. For tun, I just test it through vpnc.
 
 Notes:
 - Test shows improvement when receving packets from
 local/external host to guest, and send big packet from guest
 to local/external host.
 - Current multiqueue based virtio-net/tap introduce a
 regression of send small packet (512 byte) from guest to
 local/external host. I suspect it's the issue of queue
 selection in both guest driver and tap. Would continue to
 investigate.
 - I would post the perforamnce numbers as a reply of this
 mail.
 
 TODO:
 - solve the issue of packet transmission of small packets.
 - addressing the comments of virtio-net driver
 - performance tunning
 
 Please review and comment it, Thanks.
 
 ---
 
 Jason Wang (5):
   tuntap: move socket/sock related structures to tun_file
   tuntap: categorize ioctl
   tuntap: introduce multiqueue related flags
   tuntap: multiqueue support
   tuntap: add ioctls to attach or detach a file form tap device
 
 Krishna Kumar (2):
   Change virtqueue structure
   virtio-net changes
 
 
  drivers/net/tun.c   |  738 
 ++-
  drivers/net/virtio_net.c|  578 --
  drivers/virtio/virtio_pci.c |   10 -
  include/linux/if_tun.h  |5 
  include/linux/virtio.h  |1 
  include/linux/virtio_net.h  |3 
  6 files changed, 867 insertions(+), 468 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC 3/4] virtio_net: limit xmit polling

2011-06-02 Thread Sridhar Samudrala
On Thu, 2011-06-02 at 18:43 +0300, Michael S. Tsirkin wrote:
 Current code might introduce a lot of latency variation
 if there are many pending bufs at the time we
 attempt to transmit a new one. This is bad for
 real-time applications and can't be good for TCP either.
 
 Free up just enough to both clean up all buffers
 eventually and to be able to xmit the next packet.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/net/virtio_net.c |  106 
 +-
  1 files changed, 67 insertions(+), 39 deletions(-)
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index a0ee78d..b25db1c 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -509,17 +509,33 @@ again:
   return received;
  }
 
 -static void free_old_xmit_skbs(struct virtnet_info *vi)
 +static bool free_old_xmit_skb(struct virtnet_info *vi)
  {
   struct sk_buff *skb;
   unsigned int len;
 
 - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
 - pr_debug(Sent skb %p\n, skb);
 - vi-dev-stats.tx_bytes += skb-len;
 - vi-dev-stats.tx_packets++;
 - dev_kfree_skb_any(skb);
 - }
 + skb = virtqueue_get_buf(vi-svq, len);
 + if (unlikely(!skb))
 + return false;
 + pr_debug(Sent skb %p\n, skb);
 + vi-dev-stats.tx_bytes += skb-len;
 + vi-dev-stats.tx_packets++;
 + dev_kfree_skb_any(skb);
 + return true;
 +}
 +
 +/* Check capacity and try to free enough pending old buffers to enable 
 queueing
 + * new ones. Return true if we can guarantee that a following
 + * virtqueue_add_buf will succeed. */
 +static bool free_xmit_capacity(struct virtnet_info *vi)
 +{
 + struct sk_buff *skb;
 + unsigned int len;
 +
 + while (virtqueue_min_capacity(vi-svq)  MAX_SKB_FRAGS + 2)
 + if (unlikely(!free_old_xmit_skb))
 + return false;
If we are using INDIRECT descriptors, 1 descriptor entry is good enough
to guarantee that an skb can be queued unless we run out of memory.
Is it worth checking if 'indirect' is set on the svq and then only free
1 descriptor? Otherwise, we will be dropping the packet if there are
less than 18 free descriptors although we ony need 1.


 + return true;
  }
 
  static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
 @@ -572,30 +588,34 @@ static int xmit_skb(struct virtnet_info *vi, struct 
 sk_buff *skb)
  static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
  {
   struct virtnet_info *vi = netdev_priv(dev);
 - int capacity;
 -
 - /* Free up any pending old buffers before queueing new ones. */
 - free_old_xmit_skbs(vi);
 -
 - /* Try to transmit */
 - capacity = xmit_skb(vi, skb);
 -
 - /* This can happen with OOM and indirect buffers. */
 - if (unlikely(capacity  0)) {
 - if (net_ratelimit()) {
 - if (likely(capacity == -ENOMEM)) {
 - dev_warn(dev-dev,
 -  TX queue failure: out of memory\n);
 - } else {
 - dev-stats.tx_fifo_errors++;
 + int ret, i;
 +
 + /* We normally do have space in the ring, so try to queue the skb as
 +  * fast as possible. */
 + ret = xmit_skb(vi, skb);
 + if (unlikely(ret  0)) {
 + /* This triggers on the first xmit after ring full condition.
 +  * We need to free up some skbs first. */
 + if (likely(free_xmit_capacity(vi))) {
 + ret = xmit_skb(vi, skb);
 + /* This should never fail. Check, just in case. */
 + if (unlikely(ret  0)) {
   dev_warn(dev-dev,
Unexpected TX queue failure: %d\n,
 -  capacity);
 +  ret);
 + dev-stats.tx_fifo_errors++;
 + dev-stats.tx_dropped++;
 + kfree_skb(skb);
 + return NETDEV_TX_OK;
   }
 + } else {
 + /* Ring full: it might happen if we get a callback while
 +  * the queue is still mostly full. This should be
 +  * extremely rare. */
 + dev-stats.tx_dropped++;
 + kfree_skb(skb);
 + goto stop;
   }
 - dev-stats.tx_dropped++;
 - kfree_skb(skb);
 - return NETDEV_TX_OK;
   }
   virtqueue_kick(vi-svq);
 
 @@ -603,18 +623,26 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
 struct net_device *dev)
   skb_orphan(skb);
   nf_reset(skb);
 
 - /* Apparently nice girls don't return TX_BUSY; stop the queue
 -  * before it gets out of hand.  Naturally, this wastes 

Re: [PATCHv2 dontapply] vhost-net tx tuning

2011-02-01 Thread Sridhar Samudrala
On Tue, 2011-02-01 at 17:52 +0200, Michael S. Tsirkin wrote:
 OK, so thinking about it more, maybe the issue is this:
 tx becomes full. We process one request and interrupt the guest,
 then it adds one request and the queue is full again.
 
 Maybe the following will help it stabilize?  By default with it we will
 only interrupt when we see an empty ring.
 Which is liklely too much: pls try other values
 in the middle: e.g. make bufs half the ring,
 or bytes some small value like half ring * 200, or packets some
 small value etc.
 
 Set any one parameter to 0 to get current
 behaviour (interrupt immediately when enabled).
 
 Warning: completely untested.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 ---
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index aac05bc..6769cdc 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -32,6 +32,13 @@
   * Using this limit prevents one virtqueue from starving others. */
  #define VHOST_NET_WEIGHT 0x8
 
 +int tx_bytes_coalesce = 10;
 +module_param(tx_bytes_coalesce, int, 0644);
 +int tx_bufs_coalesce = 10;
 +module_param(tx_bufs_coalesce, int, 0644);
 +int tx_packets_coalesce = 10;
 +module_param(tx_packets_coalesce, int, 0644);
 +
  enum {
   VHOST_NET_VQ_RX = 0,
   VHOST_NET_VQ_TX = 1,
 @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
   int err, wmem;
   size_t hdr_size;
   struct socket *sock;
 + int bytes_coalesced = 0;
 + int bufs_coalesced = 0;
 + int packets_coalesced = 0;
 
   /* TODO: check that we are running from vhost_worker? */
   sock = rcu_dereference_check(vq-private_data, 1);
 @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
   if (err != len)
   pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
 - vhost_add_used_and_signal(net-dev, vq, head, 0);
   total_len += len;
 + packets_coalesced += 1;
 + bytes_coalesced += len;
 + bufs_coalesced += out;
 + if (unlikely(packets_coalesced  tx_packets_coalesce ||
 +  bytes_coalesced  tx_bytes_coalesce ||
 +  bufs_coalesced  tx_bufs_coalesce))
 + vhost_add_used_and_signal(net-dev, vq, head, 0);

I think the counters that exceed the limits need to be reset to 0 here.
Otherwise we keep signaling for every buffer once we hit this condition.

Thanks
Sridhar

 + else
 + vhost_add_used(vq, head, 0);
   if (unlikely(total_len = VHOST_NET_WEIGHT)) {
   vhost_poll_queue(vq-poll);
   break;
   }
   }
 
 + if (likely(packets_coalesced 
 +bytes_coalesced 
 +bufs_coalesced))
 + vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
  }
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-31 Thread Sridhar Samudrala
On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
 
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
 
  Maybe the following will help it stabilize?
  By itself it does nothing, but if you set
  all the parameters to a huge value we will
  only interrupt when we see an empty ring.
  Which might be too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value, or packets some
  small value etc.
 
  Warning: completely untested.
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
 
  +int tx_bytes_coalesce = 0;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 0;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 0;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
 
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
  -  vhost_add_used_and_signal(net-dev, vq, head, 0);
 total_len += len;
  +  packets_coalesced += 1;
  +  bytes_coalesced += len;
  +  bufs_coalesced += in;
 
 Should this instead be:
   bufs_coalesced += out;
 
 Perusing the code I see that earlier there is a check to see if in is not
 zero, and, if so, error out of the loop.  After the check, in is not
 touched until it is added to bufs_coalesced, effectively not changing
 bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
 below.

Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.

I tried a simpler version of this patch without any tunables by
delaying the signaling until we come out of the for loop.
It definitely reduced the number of vmexits significantly for small message
guest to host stream test and the throughput went up a little.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..5f9fae9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   vhost_add_used(vq, head, 0);
total_len += len;
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
@@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
}
}
 
+   if (total_len  0)
+   vhost_signal(net-dev, vq);
mutex_unlock(vq-mutex);
 }
 

 
 Or am I missing something?
 
  +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +  bytes_coalesced  tx_bytes_coalesce ||
  +  bufs_coalesced  tx_bufs_coalesce))
  + vhost_add_used_and_signal(net-dev, vq, head, 0);
  +  else
  + vhost_add_used(vq, head, 0);
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
 }
  }
 
  +   if (likely(packets_coalesced  tx_packets_coalesce ||
  + bytes_coalesced  tx_bytes_coalesce ||
  + bufs_coalesced  tx_bufs_coalesce))
  +  vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }

It is possible that we can miss signaling the guest even after
processing a few pkts, if we don't hit any of these conditions.

 
 
 Steve D.
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Sridhar Samudrala
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 I need to report some error from virtio-pci
 that would be handled specially (disable but don't
 report an error) so I wanted one that's never likely to be used by a
 userspace ioctl. I selected ERANGE but it'd
 be easy to switch to something else. Comments?

Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?

-Sridhar
 
  hw/vhost.c  |4 +++-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |3 +++
  hw/virtio.h |2 ++
  4 files changed, 12 insertions(+), 3 deletions(-)
 
 diff --git a/hw/vhost.c b/hw/vhost.c
 index 1d09ed0..c79765a 100644
 --- a/hw/vhost.c
 +++ b/hw/vhost.c
 @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
 *vdev)
  
  r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
  if (r  0) {
 -fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + if (r != -EVIRTIO_DISABLED) {
 + fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + }
  goto fail_notifiers;
  }
  
 diff --git a/hw/virtio-net.c b/hw/virtio-net.c
 index ccb3e63..5de3fee 100644
 --- a/hw/virtio-net.c
 +++ b/hw/virtio-net.c
 @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
 uint8_t status)
  if (!n-vhost_started) {
  int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
 n-vdev);
  if (r  0) {
 -error_report(unable to start vhost net: %d: 
 - falling back on userspace virtio, -r);
 +if (r != -EVIRTIO_DISABLED) {
 +error_report(unable to start vhost net: %d: 
 + falling back on userspace virtio, -r);
 +}
  } else {
  n-vhost_started = 1;
  }
 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index dd8887a..dbf4be0 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
 int n, bool assign)
  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
  
  if (assign) {
 +if (!msix_enabled(proxy-pci_dev)) {
 +return -EVIRTIO_DISABLED;
 +}
  int r = event_notifier_init(notifier, 0);
  if (r  0) {
  return r;
 diff --git a/hw/virtio.h b/hw/virtio.h
 index d8546d5..53bbdba 100644
 --- a/hw/virtio.h
 +++ b/hw/virtio.h
 @@ -98,6 +98,8 @@ typedef struct {
  void (*vmstate_change)(void * opaque, bool running);
  } VirtIOBindings;
  
 +#define EVIRTIO_DISABLED ERANGE
 +
  #define VIRTIO_PCI_QUEUE_MAX 64
  
  #define VIRTIO_NO_VECTOR 0x

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Sridhar Samudrala
On Thu, 2011-01-20 at 19:47 +0200, Michael S. Tsirkin wrote:
 On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote:
  On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
   When MSI is off, each interrupt needs to be bounced through the io
   thread when it's set/cleared, so vhost-net causes more context switches 
   and
   higher CPU utilization than userspace virtio which handles networking in
   the same thread.
   
   We'll need to fix this by adding level irq support in kvm irqfd,
   for now disable vhost-net in these configurations.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
   
   I need to report some error from virtio-pci
   that would be handled specially (disable but don't
   report an error) so I wanted one that's never likely to be used by a
   userspace ioctl. I selected ERANGE but it'd
   be easy to switch to something else. Comments?
  
  Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?
  
  -Sridhar
 
 The error is reported by virtio-pci which does not know about vhost.
 I started with EVIRTIO_MSIX_DISABLED and made is shorter.
 Would EVIRTIO_MSIX_DISABLED be better?

I think so. This makes it more clear.
-Sridhar
 
   
hw/vhost.c  |4 +++-
hw/virtio-net.c |6 --
hw/virtio-pci.c |3 +++
hw/virtio.h |2 ++
4 files changed, 12 insertions(+), 3 deletions(-)
   
   diff --git a/hw/vhost.c b/hw/vhost.c
   index 1d09ed0..c79765a 100644
   --- a/hw/vhost.c
   +++ b/hw/vhost.c
   @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, 
   VirtIODevice *vdev)

r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
if (r  0) {
   -fprintf(stderr, Error binding guest notifier: %d\n, -r);
   + if (r != -EVIRTIO_DISABLED) {
   + fprintf(stderr, Error binding guest notifier: %d\n, -r);
   + }
goto fail_notifiers;
}

   diff --git a/hw/virtio-net.c b/hw/virtio-net.c
   index ccb3e63..5de3fee 100644
   --- a/hw/virtio-net.c
   +++ b/hw/virtio-net.c
   @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
   uint8_t status)
if (!n-vhost_started) {
int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
   n-vdev);
if (r  0) {
   -error_report(unable to start vhost net: %d: 
   - falling back on userspace virtio, -r);
   +if (r != -EVIRTIO_DISABLED) {
   +error_report(unable to start vhost net: %d: 
   + falling back on userspace virtio, -r);
   +}
} else {
n-vhost_started = 1;
}
   diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
   index dd8887a..dbf4be0 100644
   --- a/hw/virtio-pci.c
   +++ b/hw/virtio-pci.c
   @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void 
   *opaque, int n, bool assign)
EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);

if (assign) {
   +if (!msix_enabled(proxy-pci_dev)) {
   +return -EVIRTIO_DISABLED;
   +}
int r = event_notifier_init(notifier, 0);
if (r  0) {
return r;
   diff --git a/hw/virtio.h b/hw/virtio.h
   index d8546d5..53bbdba 100644
   --- a/hw/virtio.h
   +++ b/hw/virtio.h
   @@ -98,6 +98,8 @@ typedef struct {
void (*vmstate_change)(void * opaque, bool running);
} VirtIOBindings;

   +#define EVIRTIO_DISABLED ERANGE
   +
#define VIRTIO_PCI_QUEUE_MAX 64

#define VIRTIO_NO_VECTOR 0x

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance test result between per-vhost kthread disable and enable

2010-11-23 Thread Sridhar Samudrala

On 11/23/2010 5:41 AM, Michael S. Tsirkin wrote:

On Tue, Nov 23, 2010 at 09:23:41PM +0800, lidong chen wrote:

At this point, I'd suggest testing vhost-net on the upstream kernel,
not on rhel kernels. The change that introduced per-device threads is:
c23f3445e68e1db0e74099f264bc5ff5d55ebdeb
i will try this tomorrow.

Is CONFIG_SCHED_DEBUG set?
yes. CONFIG_SCHED_DEBUG=y.

Disable it. Either debug scheduler or perf-test it :)

Another debug option  to disable is CONFIG_WORKQUEUE_TRACER if it is set
when using old rhel6 kernels.

-Sridhar


2010/11/23 Michael S. Tsirkinm...@redhat.com:

On Tue, Nov 23, 2010 at 10:13:43AM +0800, lidong chen wrote:

I test the performance between per-vhost kthread disable and enable.

Test method:
Send the same traffic load between per-vhost kthread disable and
enable, and compare the cpu rate of host os.
I run five vm on kvm, each of them have five nic.
the vhost version which per-vhost kthread disable we used is rhel6
beta 2(2.6.32.60).
the vhost version which per-vhost kthread enable we used is rhel6 (2.6.32-71).

At this point, I'd suggest testing vhost-net on the upstream kernel,
not on rhel kernels. The change that introduced per-device threads is:
c23f3445e68e1db0e74099f264bc5ff5d55ebdeb


Test result:
with per-vhost kthread disable, the cpu rate of host os is 110%.
with per-vhost kthread enable, the cpu rate of host os is 130%.

Is CONFIG_SCHED_DEBUG set? We are stressing the scheduler a lot with
vhost-net.


In 2.6.32.60,the whole system only have a kthread.
[r...@rhel6-kvm1 ~]# ps -ef | grep vhost
root   973 2  0 Nov22 ?00:00:00 [vhost]

In 2.6.32.71,the whole system have 25 kthread.
[r...@kvm-4slot ~]# ps -ef | grep vhost-
root 12896 2  0 10:26 ?00:00:00 [vhost-12842]
root 12897 2  0 10:26 ?00:00:00 [vhost-12842]
root 12898 2  0 10:26 ?00:00:00 [vhost-12842]
root 12899 2  0 10:26 ?00:00:00 [vhost-12842]
root 12900 2  0 10:26 ?00:00:00 [vhost-12842]

root 13022 2  0 10:26 ?00:00:00 [vhost-12981]
root 13023 2  0 10:26 ?00:00:00 [vhost-12981]
root 13024 2  0 10:26 ?00:00:00 [vhost-12981]
root 13025 2  0 10:26 ?00:00:00 [vhost-12981]
root 13026 2  0 10:26 ?00:00:00 [vhost-12981]

root 13146 2  0 10:26 ?00:00:00 [vhost-13088]
root 13147 2  0 10:26 ?00:00:00 [vhost-13088]
root 13148 2  0 10:26 ?00:00:00 [vhost-13088]
root 13149 2  0 10:26 ?00:00:00 [vhost-13088]
root 13150 2  0 10:26 ?00:00:00 [vhost-13088]
...

Code difference:
In 2.6.32.60,in function vhost_init, create the kthread for vhost.
vhost_workqueue = create_singlethread_workqueue(vhost);

In 2.6.32.71,in function vhost_dev_set_owner, create the kthread for
each nic interface.
dev-wq = create_singlethread_workqueue(vhost_name);

Conclusion:
with per-vhost kthread enable, the system can more throughput.
but deal the same traffic load with per-vhost kthread enable, it waste
more cpu resource.

In my application scene, the cpu resource is more important, and one
kthread for deal with traffic load is enough.

So i think we should add a param to control this.
for the CPU-bound system, this param disable per-vhost kthread.
for the I/O-bound system, this param enable per-vhost kthread.
the default value of this param is enable.

If my opinion is right, i will give a patch for this.

Let's try to figure out what the issue is, first.

--
MST




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] macvlan: Introduce 'passthru' mode to takeover the underlying device

2010-10-31 Thread Sridhar Samudrala

On 10/29/2010 6:45 AM, Arnd Bergmann wrote:

On Friday 29 October 2010, Sridhar Samudrala wrote:

With the current default 'vepa' mode, a KVM guest using virtio with
macvtap backend has the following limitations.
- cannot change/add a mac address on the guest virtio-net

I believe this could be changed if there is a neeed, but I actually
consider it one of the design points of macvlan that the guest
is not able to change the mac address. With 802.1Qbg you rely on
the switch being able to identify the guest by its MAC address,
which the host kernel must ensure.

Currently the host cannot prevent a guest user from trying to change/add 
a mac address
on the guest virtio-net. From guest point of view, the request succeeds, 
but the incoming packets

are dropped siliently by the host interface.


- cannot create a vlan device on the guest virtio-net

Why not? If this doesn't work, it's probably a bug!
Because the host is not aware of the guest vlan tag and the host 
external interface will filter out

incoming vlan tagged packets.


Why does the passthru mode enable it if it doesn't work
already?

passthru mode puts the host external interface in promiscuous mode which 
allows vlan tagged

packets to be received.

Even in tap/bridge mode, this works because adding an external interface 
to the bridge causes it to be

put in promiscuous mode.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] vhost: TX used buffer guest signal accumulation

2010-10-28 Thread Sridhar Samudrala
On Thu, 2010-10-28 at 13:13 -0700, Shirley Ma wrote:
 On Thu, 2010-10-28 at 12:32 -0700, Shirley Ma wrote:
  Also I found a big TX regression for old guest and new guest. For old
  guest, I am able to get almost 11Gb/s for 2K message size, but for the
  new guest kernel, I can only get 3.5 Gb/s with the patch and same
  host.
  I will dig it why. 
 
 The regression is from guest kernel, not from this patch. Tested 2.6.31
 kernel, it's performance is less than 2Gb/s for 2K message size already.
 I will resubmit the patch for review. 
 
 I will start to test from 2.6.30 kernel to figure it when TX regression
 induced in virtio_net. Any suggestion which guest kernel I should test
 to figure out this regression?

It would be some change in virtio-net driver that may have improved the
latency of small messages which in turn would have reduced the bandwidth
as TCP could not accumulate and send large packets.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2] Support 'mode' parameter when creating macvtap device

2010-10-28 Thread Sridhar Samudrala
Add support for 'mode' parameter when creating a macvtap device.
This allows a macvtap device to be created in bridge, private or
the default vepa modes.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---

diff --git a/ip/Makefile b/ip/Makefile
index 2f223ca..6054e8a 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
 ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
 ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
 iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
-iplink_macvlan.o
+iplink_macvlan.o iplink_macvtap.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
new file mode 100644
index 000..35199b1
--- /dev/null
+++ b/ip/iplink_macvtap.c
@@ -0,0 +1,90 @@
+/*
+ * iplink_macvtap.cmacvtap device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void explain(void)
+{
+   fprintf(stderr,
+   Usage: ... macvtap mode { private | vepa | bridge }\n
+   );
+}
+
+static int mode_arg(void)
+{
+fprintf(stderr, Error: argument of \mode\ must be \private\, 
+   \vepa\ or \bridge\\n);
+return -1;
+}
+
+static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv,
+ struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, mode) == 0) {
+   __u32 mode = 0;
+   NEXT_ARG();
+
+   if (strcmp(*argv, private) == 0)
+   mode = MACVLAN_MODE_PRIVATE;
+   else if (strcmp(*argv, vepa) == 0)
+   mode = MACVLAN_MODE_VEPA;
+   else if (strcmp(*argv, bridge) == 0)
+   mode = MACVLAN_MODE_BRIDGE;
+   else
+   return mode_arg();
+
+   addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, macvtap: what is \%s\?\n, *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr 
*tb[])
+{
+   __u32 mode;
+
+   if (!tb)
+   return;
+
+   if (!tb[IFLA_MACVLAN_MODE] ||
+   RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE])  sizeof(__u32))
+   return;
+
+   mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]);
+   fprintf(f,  mode %s ,
+ mode == MACVLAN_MODE_PRIVATE ? private
+   : mode == MACVLAN_MODE_VEPA? vepa
+   : mode == MACVLAN_MODE_BRIDGE  ? bridge
+   :unknown);
+}
+
+struct link_util macvtap_link_util = {
+   .id = macvtap,
+   .maxattr= IFLA_MACVLAN_MAX,
+   .parse_opt  = macvtap_parse_opt,
+   .print_opt  = macvtap_print_opt,
+};


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2] macvlan/macvtap: support 'passthru' mode

2010-10-28 Thread Sridhar Samudrala
Add support for 'passthru' mode when creating a macvlan/macvtap device
which allows takeover of the underlying device and passing it to a KVM
guest using virtio with macvtap backend.

Only one macvlan device is allowed in passthru mode and it inherits
the mac address from the underlying device and sets it in promiscuous 
mode to receive and forward all the packets.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index f5bb2dc..23de79e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -230,6 +230,7 @@ enum macvlan_mode {
MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */
MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
+   MACVLAN_MODE_PASSTHRU  = 8, /* take over the underlying device */
 };
 
 /* SR-IOV virtual function management section */
diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
index a3c78bd..15022aa 100644
--- a/ip/iplink_macvlan.c
+++ b/ip/iplink_macvlan.c
@@ -23,14 +23,14 @@
 static void explain(void)
 {
fprintf(stderr,
-   Usage: ... macvlan mode { private | vepa | bridge }\n
+   Usage: ... macvlan mode { private | vepa | bridge | passthru 
}\n
);
 }
 
 static int mode_arg(void)
 {
 fprintf(stderr, Error: argument of \mode\ must be \private\, 
-   \vepa\ or \bridge\\n);
+   \vepa\, \bridge\ or \passthru\ \n);
 return -1;
 }
 
@@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
mode = MACVLAN_MODE_VEPA;
else if (strcmp(*argv, bridge) == 0)
mode = MACVLAN_MODE_BRIDGE;
+   else if (strcmp(*argv, passthru) == 0)
+   mode = MACVLAN_MODE_PASSTHRU;
else
return mode_arg();
 
@@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[]
  mode == MACVLAN_MODE_PRIVATE ? private
: mode == MACVLAN_MODE_VEPA? vepa
: mode == MACVLAN_MODE_BRIDGE  ? bridge
+   : mode == MACVLAN_MODE_PASSTHRU  ? passthru
:unknown);
 }
 
diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
index 35199b1..5665b6d 100644
--- a/ip/iplink_macvtap.c
+++ b/ip/iplink_macvtap.c
@@ -20,14 +20,14 @@
 static void explain(void)
 {
fprintf(stderr,
-   Usage: ... macvtap mode { private | vepa | bridge }\n
+   Usage: ... macvtap mode { private | vepa | bridge | passthru 
}\n
);
 }
 
 static int mode_arg(void)
 {
 fprintf(stderr, Error: argument of \mode\ must be \private\, 
-   \vepa\ or \bridge\\n);
+   \vepa\, \bridge\ or \passthru\ \n);
 return -1;
 }
 
@@ -45,6 +45,8 @@ static int macvtap_parse_opt(struct link_util *lu, int argc, 
char **argv,
mode = MACVLAN_MODE_VEPA;
else if (strcmp(*argv, bridge) == 0)
mode = MACVLAN_MODE_BRIDGE;
+   else if (strcmp(*argv, passthru) == 0)
+   mode = MACVLAN_MODE_PASSTHRU;
else
return mode_arg();
 
@@ -79,6 +81,7 @@ static void macvtap_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[]
  mode == MACVLAN_MODE_PRIVATE ? private
: mode == MACVLAN_MODE_VEPA? vepa
: mode == MACVLAN_MODE_BRIDGE  ? bridge
+   : mode == MACVLAN_MODE_PASSTHRU  ? passthru
:unknown);
 }
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH iproute2] Add passthru mode and support 'mode' parameter with macvtap devices

2010-10-26 Thread Sridhar Samudrala
Support a new 'passthru' mode with macvlan and 'mode' parameter
with macvtap devices.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index f5bb2dc..23de79e 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -230,6 +230,7 @@ enum macvlan_mode {
MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */
MACVLAN_MODE_BRIDGE  = 4, /* talk to bridge ports directly */
+   MACVLAN_MODE_PASSTHRU  = 8, /* take over the underlying device */
 };
 
 /* SR-IOV virtual function management section */
diff --git a/ip/Makefile b/ip/Makefile
index 2f223ca..6054e8a 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -3,7 +3,7 @@ IPOBJ=ip.o ipaddress.o ipaddrlabel.o iproute.o iprule.o \
 ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o iptuntap.o \
 ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o \
 iplink_vlan.o link_veth.o link_gre.o iplink_can.o \
-iplink_macvlan.o
+iplink_macvlan.o iplink_macvtap.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/iplink_macvlan.c b/ip/iplink_macvlan.c
index a3c78bd..97787f9 100644
--- a/ip/iplink_macvlan.c
+++ b/ip/iplink_macvlan.c
@@ -48,6 +48,8 @@ static int macvlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
mode = MACVLAN_MODE_VEPA;
else if (strcmp(*argv, bridge) == 0)
mode = MACVLAN_MODE_BRIDGE;
+   else if (strcmp(*argv, passthru) == 0)
+   mode = MACVLAN_MODE_PASSTHRU;
else
return mode_arg();
 
@@ -82,6 +84,7 @@ static void macvlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[]
  mode == MACVLAN_MODE_PRIVATE ? private
: mode == MACVLAN_MODE_VEPA? vepa
: mode == MACVLAN_MODE_BRIDGE  ? bridge
+   : mode == MACVLAN_MODE_PASSTHRU  ? passthru
:unknown);
 }
 
diff --git a/ip/iplink_macvtap.c b/ip/iplink_macvtap.c
new file mode 100644
index 000..040cc68
--- /dev/null
+++ b/ip/iplink_macvtap.c
@@ -0,0 +1,93 @@
+/*
+ * iplink_macvtap.cmacvtap device support
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ */
+
+#include stdio.h
+#include stdlib.h
+#include string.h
+#include sys/socket.h
+#include linux/if_link.h
+
+#include rt_names.h
+#include utils.h
+#include ip_common.h
+
+static void explain(void)
+{
+   fprintf(stderr,
+   Usage: ... macvtap mode { private | vepa | bridge | passthru 
}\n
+   );
+}
+
+static int mode_arg(void)
+{
+fprintf(stderr, Error: argument of \mode\ must be \private\, 
+   \vepa\ or \bridge\ \passthru\\n);
+return -1;
+}
+
+static int macvtap_parse_opt(struct link_util *lu, int argc, char **argv,
+ struct nlmsghdr *n)
+{
+   while (argc  0) {
+   if (matches(*argv, mode) == 0) {
+   __u32 mode = 0;
+   NEXT_ARG();
+
+   if (strcmp(*argv, private) == 0)
+   mode = MACVLAN_MODE_PRIVATE;
+   else if (strcmp(*argv, vepa) == 0)
+   mode = MACVLAN_MODE_VEPA;
+   else if (strcmp(*argv, bridge) == 0)
+   mode = MACVLAN_MODE_BRIDGE;
+   else if (strcmp(*argv, passthru) == 0)
+   mode = MACVLAN_MODE_PASSTHRU;
+   else
+   return mode_arg();
+
+   addattr32(n, 1024, IFLA_MACVLAN_MODE, mode);
+   } else if (matches(*argv, help) == 0) {
+   explain();
+   return -1;
+   } else {
+   fprintf(stderr, macvtap: what is \%s\?\n, *argv);
+   explain();
+   return -1;
+   }
+   argc--, argv++;
+   }
+
+   return 0;
+}
+
+static void macvtap_print_opt(struct link_util *lu, FILE *f, struct rtattr 
*tb[])
+{
+   __u32 mode;
+
+   if (!tb)
+   return;
+
+   if (!tb[IFLA_MACVLAN_MODE] ||
+   RTA_PAYLOAD(tb[IFLA_MACVLAN_MODE])  sizeof(__u32))
+   return;
+
+   mode = *(__u32 *)RTA_DATA(tb[IFLA_VLAN_ID]);
+   fprintf(f,  mode %s ,
+ mode == MACVLAN_MODE_PRIVATE ? private
+   : mode == MACVLAN_MODE_VEPA? vepa
+   : mode == MACVLAN_MODE_BRIDGE  ? bridge
+   : mode

[RFC PATCH] macvlan: Introduce a PASSTHRU mode to takeover the underlying device

2010-10-26 Thread Sridhar Samudrala
With the current default macvtap mode, a KVM guest using virtio with 
macvtap backend has the following limitations.
- cannot change/add a mac address on the guest virtio-net
- cannot create a vlan device on the guest virtio-net
- cannot enable promiscuous mode on guest virtio-net

This patch introduces a new mode called 'passthru' when creating a 
macvlan device which allows takeover of the underlying device and 
passing it to a guest using virtio with macvtap backend.

Only one macvlan device is allowed in passthru mode and it inherits
the mac address from the underlying device and sets it in promiscuous 
mode to receive and forward all the packets.

Thanks
Sridhar

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 0ef0eb0..bca3cb7 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -38,6 +38,7 @@ struct macvlan_port {
struct hlist_head   vlan_hash[MACVLAN_HASH_SIZE];
struct list_headvlans;
struct rcu_head rcu;
+   boolpassthru;
 };
 
 #define macvlan_port_get_rcu(dev) \
@@ -169,6 +170,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff 
*skb)
macvlan_broadcast(skb, port, NULL,
  MACVLAN_MODE_PRIVATE |
  MACVLAN_MODE_VEPA|
+ MACVLAN_MODE_PASSTHRU|
  MACVLAN_MODE_BRIDGE);
else if (src-mode == MACVLAN_MODE_VEPA)
/* flood to everyone except source */
@@ -185,7 +187,10 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff 
*skb)
return skb;
}
 
-   vlan = macvlan_hash_lookup(port, eth-h_dest);
+   if (port-passthru)
+   vlan = list_first_entry(port-vlans, struct macvlan_dev, list);
+   else
+   vlan = macvlan_hash_lookup(port, eth-h_dest);
if (vlan == NULL)
return skb;
 
@@ -284,6 +289,11 @@ static int macvlan_open(struct net_device *dev)
struct net_device *lowerdev = vlan-lowerdev;
int err;
 
+   if (vlan-port-passthru) {
+   dev_set_promiscuity(lowerdev, 1);
+   goto hash_add;
+   }
+
err = -EBUSY;
if (macvlan_addr_busy(vlan-port, dev-dev_addr))
goto out;
@@ -296,6 +306,8 @@ static int macvlan_open(struct net_device *dev)
if (err  0)
goto del_unicast;
}
+
+hash_add:
macvlan_hash_add(vlan);
return 0;
 
@@ -310,12 +322,18 @@ static int macvlan_stop(struct net_device *dev)
struct macvlan_dev *vlan = netdev_priv(dev);
struct net_device *lowerdev = vlan-lowerdev;
 
+   if (vlan-port-passthru) {
+   dev_set_promiscuity(lowerdev, -1);
+   goto hash_del;
+   }
+
dev_mc_unsync(lowerdev, dev);
if (dev-flags  IFF_ALLMULTI)
dev_set_allmulti(lowerdev, -1);
 
dev_uc_del(lowerdev, dev-dev_addr);
 
+hash_del:
macvlan_hash_del(vlan);
return 0;
 }
@@ -549,6 +567,7 @@ static int macvlan_port_create(struct net_device *dev)
if (port == NULL)
return -ENOMEM;
 
+   port-passthru = false;
port-dev = dev;
INIT_LIST_HEAD(port-vlans);
for (i = 0; i  MACVLAN_HASH_SIZE; i++)
@@ -593,6 +612,7 @@ static int macvlan_validate(struct nlattr *tb[], struct 
nlattr *data[])
case MACVLAN_MODE_PRIVATE:
case MACVLAN_MODE_VEPA:
case MACVLAN_MODE_BRIDGE:
+   case MACVLAN_MODE_PASSTHRU:
break;
default:
return -EINVAL;
@@ -661,6 +681,10 @@ int macvlan_common_newlink(struct net *src_net, struct 
net_device *dev,
}
port = macvlan_port_get(lowerdev);
 
+   /* Only 1 macvlan device can be created in passthru mode */
+   if (port-passthru)
+   return -EINVAL;
+
vlan-lowerdev = lowerdev;
vlan-dev  = dev;
vlan-port = port;
@@ -671,6 +695,13 @@ int macvlan_common_newlink(struct net *src_net, struct 
net_device *dev,
if (data  data[IFLA_MACVLAN_MODE])
vlan-mode = nla_get_u32(data[IFLA_MACVLAN_MODE]);
 
+   if (vlan-mode == MACVLAN_MODE_PASSTHRU) {
+   if (!list_empty(port-vlans))
+   return -EINVAL;
+   port-passthru = true;
+   memcpy(dev-dev_addr, lowerdev-dev_addr, ETH_ALEN);
+   }
+
err = register_netdevice(dev);
if (err  0)
goto destroy_port;
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 2fc66dd..8454805 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -232,6 +232,7 @@ enum macvlan_mode {
MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */
MACVLAN_MODE_VEPA= 

Re: [PATCH v11 13/17] Add mp(mediate passthru) device.

2010-09-28 Thread Sridhar Samudrala

 On 9/28/2010 8:18 AM, Arnd Bergmann wrote:

On Tuesday 28 September 2010, Michael S. Tsirkin wrote:

On Tue, Sep 28, 2010 at 04:39:59PM +0200, Arnd Bergmann wrote:

Can you be more specific what the problem is? Do you think
it breaks when a guest sends VLAN tagged frames or when macvtap
is connected to a VLAN interface that adds another tag (or
only the combination)?

I expect the protocol value to be wrong when guest sends vlan tagged
frames as 802.1q frames have a different format.

Ok, I see. Would that be fixed by using eth_type_trans()? I don't
see any code in there that tries to deal with the VLAN tag, so
do we have the same problem in the tun/tap driver?
tun_get_user() does call eth_type_trans(). Not sure why i didn't use it 
in macvtap code.

Need to test it with guest VLAN tagging to make sure it works.

Also, I wonder how we handle the case where both the guest and
the host do VLAN tagging. Does the host transparently override
the guest tag, or does it add a nested tag? More importantly,
what should it do?

I would think If both guest and host do VLAN tagging, the tags will be 
nested.


Thanks
Sridhar



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Sridhar Samudrala

 On 9/9/2010 2:45 AM, Krishna Kumar2 wrote:

Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:

Some more results and likely cause for single netperf
degradation below.



Guest -  Host (single netperf):
I am getting a drop of almost 20%. I am trying to figure out
why.

Host -  guest (single netperf):
I am getting an improvement of almost 15%. Again - unexpected.

Guest -  Host TCP_RR: I get an average 7.4% increase in #packets
for runs upto 128 sessions. With fewer netperf (under 8), there
was a drop of 3-7% in #packets, but beyond that, the #packets
improved significantly to give an average improvement of 7.4%.

So it seems that fewer sessions is having negative effect for
some reason on the tx side. The code path in virtio-net has not
changed much, so the drop in some cases is quite unexpected.

The drop for the single netperf seems to be due to multiple vhost.
I changed the patch to start *single* vhost:

Guest -  Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
Guest -  Host (1 netperf) : Latency: -3%, SD: 3.5%
I remember seeing similar issue when using a separate vhost thread for 
TX and
RX queues.  Basically, we should have the same vhost thread process a 
TCP flow
in both directions. I guess this allows the data and ACKs to be 
processed in sync.



Thanks
Sridhar

Single vhost performs well but hits the barrier at 16 netperf
sessions:

SINGLE vhost (Guest -  Host):
1 netperf:BW: 10.7% SD: -1.4%
4 netperfs:   BW: 3%SD: 1.4%
8 netperfs:   BW: 17.7% SD: -10%
   16 netperfs:  BW: 4.7%  SD: -7.0%
   32 netperfs:  BW: -6.1% SD: -5.7%
BW and SD both improves (guest multiple txqs help). For 32
netperfs, SD improves.

But with multiple vhosts, guest is able to send more packets
and BW increases much more (SD too increases, but I think
that is expected). From the earlier results:

N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
___
4   26387   40716 (54.30)   20  28   (40.00)86  85
(-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362
(-2.68)
16  23587   40546 (71.89)   375 564  (50.40)15581519
(-2.50)
32  22927   39490 (72.24)   16172171 (34.26)66945722
(-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823   13552
(-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972   26173
(-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
___
(All tests were done without any tuning)

 From my testing:

1. Single vhost improves mq guest performance upto 16
netperfs but degrades after that.
2. Multiple vhost degrades single netperf guest
performance, but significantly improves performance
for any number of netperf sessions.

Likely cause for the 1 stream degradation with multiple
vhost patch:

1. Two vhosts run handling the RX and TX respectively.
I think the issue is related to cache ping-pong esp
since these run on different cpus/sockets.
2. I (re-)modified the patch to share RX with TX[0]. The
performance drop is the same, but the reason is the
guest is not using txq[0] in most cases (dev_pick_tx),
so vhost's rx and tx are running on different threads.
But whenever the guest uses txq[0], only one vhost
runs and the performance is similar to original.

I went back to my *submitted* patch and started a guest
with numtxq=16 and pinned every vhost to cpus #01. Now
whether guest used txq[0] or txq[n], the performance is
similar or better (between 10-27% across 10 runs) than
original code. Also, -6% to -24% improvement in SD.

I will start a full test run of original vs submitted
code with minimal tuning (Avi also suggested the same),
and re-send. Please let me know if you need any other
data.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v9 00/16] Provide a zero-copy method on KVM virtio-net.

2010-09-03 Thread Sridhar Samudrala
On Fri, 2010-09-03 at 13:14 +0300, Michael S. Tsirkin wrote:
 On Tue, Aug 10, 2010 at 06:23:24PM -0700, Shirley Ma wrote:
  Hello Xiaohui,
  
  On Fri, 2010-08-06 at 17:23 +0800, xiaohui@intel.com wrote:
   Our goal is to improve the bandwidth and reduce the CPU usage.
   Exact performance data will be provided later. 
  
  Have you had any performance data to share here? I tested my
  experimental macvtap zero copy for TX only. The performance I have seen
  as below without any tuning, (default setting):
  
  Before: netperf 16K message size results with 60 secs run is 7.5Gb/s
  over ixgbe 10GbE card. perf top shows:
  
  2103.00 12.9% copy_user_generic_string
  1541.00  9.4% handle_tx
  1490.00  9.1% _raw_spin_unlock_irqrestore
  1361.00  8.3% _raw_spin_lock_irqsave
  1288.00  7.9% _raw_spin_lock
  924.00  5.7% vhost_worker
  
  After: netperf results with 60 secs run is 8.1Gb/s, perf output:
  
  1093.00  9.9% _raw_spin_unlock_irqrestore
  1048.00  9.5% handle_tx
  934.00  8.5% _raw_spin_lock_irqsave
  864.00  7.9% _raw_spin_lock
  644.00  5.9% vhost_worker
  387.00  3.5% use_mm 
  
  I am still working on collecting more data (latency, cpu
  utilization...). I will let you know once I get all data for macvtap TX
  zero copy. Also I found some vhost performance regression on the new
  kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.
  
  Shirley
 
 Could you please try disabling mergeable buffers, and see if this gets
 you back where you were?
 -global virtio-net-pci.mrg_rxbuf=off 

I don't think Shirley had mergeable buffers on when she ran these tests. 
The qemu patch to support mergeable buffers with vhost is not yet upstream.

Shirley is on vacation and will be back on Sept 7 and can provide more
detailed performance data and post her patch.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] cgroups: fix API thinko

2010-08-06 Thread Sridhar Samudrala

On 8/5/2010 3:59 PM, Michael S. Tsirkin wrote:

cgroup_attach_task_current_cg API that have upstream is backwards: we
really need an API to attach to the cgroups from another process A to
the current one.

In our case (vhost), a priveledged user wants to attach it's task to cgroups
from a less priveledged one, the API makes us run it in the other
task's context, and this fails.

So let's make the API generic and just pass in 'from' and 'to' tasks.
Add an inline wrapper for cgroup_attach_task_current_cg to avoid
breaking bisect.

Signed-off-by: Michael S. Tsirkinm...@redhat.com
---

Paul, Li, Sridhar, could you please review the following
patch?

I only compile-tested it due to travel, but looks
straight-forward to me.
Alex Williamson volunteered to test and report the results.
Sending out now for review as I might be offline for a bit.
Will only try to merge when done, obviously.

If OK, I would like to merge this through -net tree,
together with the patch fixing vhost-net.
Let me know if that sounds ok.

Thanks!

This patch is on top of net-next, it is needed for fix
vhost-net regression in net-next, where a non-priveledged
process can't enable the device anymore:

when qemu uses vhost, inside the ioctl call it
creates a thread, and tries to add
this thread to the groups of current, and it fails.
But we control the thread, so to solve the problem,
we really should tell it 'connect to out cgroups'.
   

So an unprivileged qemu cannot attach vhost thread to its own cgroups.
I guess you are planning to make the cgroup_attach_task_all() call in 
vhost_worker()

to attach itself to the cgroups of qemu. The new API looks fine, but the
name is little confusing. How about
  task_inherit_cgroups(struct task_struct *from, struct task_struct *to)


What this patch does is add an API for that.

  include/linux/cgroup.h |   11 ++-
  kernel/cgroup.c|9 +
  2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 43b2072..b38ec60 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -525,7 +525,11 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
  void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
  int cgroup_scan_tasks(struct cgroup_scanner *scan);
  int cgroup_attach_task(struct cgroup *, struct task_struct *);
-int cgroup_attach_task_current_cg(struct task_struct *);
+int cgroup_attach_task_all(struct task_struct *from, struct task_struct *);
+static inline int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+   return cgroup_attach_task_all(current, tsk);
+}

  /*
   * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
@@ -583,6 +587,11 @@ static inline int cgroupstats_build(struct cgroupstats 
*stats,
  }

  /* No cgroups - nothing to do */
+static inline int cgroup_attach_task_all(struct task_struct *from,
+struct task_struct *t)
+{
+   return 0;
+}
  static inline int cgroup_attach_task_current_cg(struct task_struct *t)
  {
return 0;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dce8ebc..e6293b8 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1606,10 +1606,11 @@ int cgroup_attach_task(struct cgroup *cgrp, struct 
task_struct *tsk)
  }

  /**
- * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from'
+ * @from: attach to all cgroups of a given task
   * @tsk: the task to be attached
   */
-int cgroup_attach_task_current_cg(struct task_struct *tsk)
+int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
  {
struct cgroupfs_root *root;
struct cgroup *cur_cg;
@@ -1617,7 +1618,7 @@ int cgroup_attach_task_current_cg(struct task_struct *tsk)

cgroup_lock();
for_each_active_root(root) {
-   cur_cg = task_cgroup_from_root(current, root);
+   cur_cg = task_cgroup_from_root(from, root);
   
Now that we are not operating on current, cur_cg should be renamed as 
from_cg

retval = cgroup_attach_task(cur_cg, tsk);
if (retval)
break;
@@ -1626,7 +1627,7 @@ int cgroup_attach_task_current_cg(struct task_struct *tsk)

return retval;
  }
-EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+EXPORT_SYMBOL_GPL(cgroup_attach_task_all);

  /*
   * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
   



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Make vhost multi-threaded and associate each thread to its guest's cgroup

2010-07-27 Thread Sridhar Samudrala
On Tue, 2010-07-27 at 23:42 +0300, Michael S. Tsirkin wrote:
 Sridhar,
 I pushed a patchset with all known issues fixed,
 on my vhost-net-next branch.
 
 For now this ignores the cpu mask issue, addressing
 only the cgroups issue.
 
 Would appreciate testing and reports.

I had to apply the following patch to get it build.
With this patch, i am seeing similar results as i saw earlier.

Thanks
Sridhar

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 996e751..8543898 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -23,6 +23,7 @@
 #include linux/highmem.h
 #include linux/slab.h
 #include linux/kthread.h
+#include linux/cgroup.h
 
 #include linux/net.h
 #include linux/if_packet.h
@@ -252,7 +253,7 @@ static long vhost_dev_set_owner(struct vhost_dev *dev)
}
 
dev-worker = worker;
-   err = cgroup_attach_task_current_cg(poller);
+   err = cgroup_attach_task_current_cg(worker);
if (err)
goto err_cgroup;
wake_up_process(worker);/* avoid contributing to loadavg */


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH repost] sched: export sched_set/getaffinity to modules

2010-07-26 Thread Sridhar Samudrala
On Mon, 2010-07-26 at 20:12 +0300, Michael S. Tsirkin wrote:
 On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
  On 07/02, Peter Zijlstra wrote:
  
   On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
   
 Does  it (Tejun's kthread_clone() patch) also  inherit the
cgroup of the caller?
  
   Of course, its a simple do_fork() which inherits everything just as you
   would expect from a similar sys_clone()/sys_fork() call.
  
  Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
  from ioctl(), right?
  
  Then the new thread becomes the natural child of the caller, and it shares
  -mm with the parent. And files, dup_fd() without CLONE_FS.
  
  Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
  TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
  just because the parent gets SIGQUIT or abother coredumpable signal.
  Or the new thread can recieve SIGSTOP via ^Z.
  
  Perhaps this is OK, I do not know. Just to remind that kernel_thread()
  is merely clone(CLONE_VM).
  
  Oleg.
 
 With some machinery to stop it later, yes.
 Oleg, how does the below look to you?
 
 Here I explicitly drop the fds so we don't share them.
 CLONE_VM takes care of sharing the mm I think.
 About signals - for the vhost-net use this is OK as we use
 uninterruptible sleep anyway (like the new kthread_worker does).
 
 This code seems to work fine for me so far - any comments?
 
 ---
 
 diff --git a/include/linux/kthread.h b/include/linux/kthread.h
 index aabc8a1..72c7b17 100644
 --- a/include/linux/kthread.h
 +++ b/include/linux/kthread.h
 @@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void 
 *data),
  const char namefmt[], ...)
   __attribute__((format(printf, 3, 4)));
 
 +struct task_struct *kthread_create_inherit(int (*threadfn)(void *data),
 +void *data,
 +const char namefmt[], ...)
 + __attribute__((format(printf, 3, 4)));
 +
  /**
   * kthread_run - create and wake a thread.
   * @threadfn: the function to run until signal_pending(current).
 diff --git a/kernel/kthread.c b/kernel/kthread.c
 index 83911c7..b81588c 100644
 --- a/kernel/kthread.c
 +++ b/kernel/kthread.c
 @@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void 
 *data),
  }
  EXPORT_SYMBOL(kthread_create);
 
 +/* Same as kthread_create, but inherit attributes (cgroups, priority, CPU 
 mask)
 + * from current. */
 +struct task_struct *kthread_create_inherit(int (*threadfn)(void *data),
 +void *data,
 +const char namefmt[],
 +...)
 +{
 + struct kthread_create_info create;
 +
 + create.threadfn = threadfn;
 + create.data = data;
 + init_completion(create.done);
 +
 + create_kthread(create);
 + wait_for_completion(create.done);
 +
 + if (!IS_ERR(create.result)) {
 + va_list args;
 +
 + /* Don't share files with parent as drivers use release for
 +  * close on exit, etc. */
 + exit_files(create.result);
 +
 + va_start(args, namefmt);
 + vsnprintf(create.result-comm, sizeof(create.result-comm),
 +   namefmt, args);
 + va_end(args);
 + }
 + return create.result;
 +}
 +EXPORT_SYMBOL(kthread_create_inherit);
 +
  /**
   * kthread_bind - bind a just-created kthread to a cpu.
   * @p: thread created by kthread_create().


I have been testing out a similar patch that uses kernel_thread() without 
CLONE_FILES
flag rather than create_kthread() and then closing the files.
Either version should be fine.

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..634eaf7 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -9,6 +9,11 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
   const char namefmt[], ...)
__attribute__((format(printf, 3, 4)));
 
+struct task_struct *kthread_clone(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[], ...)
+   __attribute__((format(printf, 3, 4)));
+
 /**
  * kthread_run - create and wake a thread.
  * @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..806dae5 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -149,6 +149,38 @@ struct task_struct *kthread_create(int (*threadfn)(void 
*data),
 }
 EXPORT_SYMBOL(kthread_create);
 
+struct task_struct *kthread_clone(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[],
+ ...)
+{
+   struct kthread_create_info create;
+   int

Re: Virtio network performance poorer than emulated e1000

2010-07-22 Thread Sridhar Samudrala
On Thu, 2010-07-22 at 19:53 -0400, Balachandar wrote:
 I am resending this email as Freddie didn't use 'reply to all' when
 replying to this message. I am also updating to answer Freddie's
 questions..
 
 I can see that virtio network performance is poorer than emaulated
 e1000 nic. I did some simple ping test and with emulated e1000 the
 average rtt is around 600 microsec. With virtio the average rtt is 800
 microsec. I am using a tap + bridge configuration. I run kvm as
 follows
 
 kvm -m 512 -hda vdisk.img \
 -net nic,model=virtio \
 -net tap,ifname=tap0,script=qemu-ifup,downscript=no

With newer qemu-kvm, you need to use -netdev tap to enable checksum/gso
offloads. 

Try
-net nic,model=virtio,netdev=tap0
-netdev tap,ifname=tap0,id=tap0,vhost=on,script=qemu-ifup,downscript=no

Thanks
Sridhar
 
 I am running Debian squeeze distribution with guest and host kernel 2.6.34.
 
 Does anyone else see some results like this or is it only me? Could
 changing the distribution help as i am running a testing one?
 
 Actually we are having a custom application which just measures
 ping-pong latency. The results that i get with virtio is around 750
 micro seconds. The result i get with emulated e1000 is around 250
 micro sec. Also i tried to use vhost but the virtio latency remained
 the same.
 
 Also i tried the tests with guest and host 2.6.26 kernel. I get better
 results for virtio than e1000. I get 550 for e1000 and 500 for virtio.
 
 Actually my application needs as minimum latency as needed. I am ready
 to trade-off throughput and cpu utilization. I was very excited when i
 saw the vhost-net module with people claiming low latencies. I am
 worried that i am missing some performance offered by virtio and
 vhost.
 
 
 Thanks,
 Bala
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost-net: avoid flush under lock

2010-07-15 Thread Sridhar Samudrala
On Thu, 2010-07-15 at 15:19 +0300, Michael S. Tsirkin wrote:
 We flush under vq mutex when changing backends.
 This creates a deadlock as workqueue being flushed
 needs this lock as well.
 
 https://bugzilla.redhat.com/show_bug.cgi?id=612421
 
 Drop the vq mutex before flush: we have the device mutex
 which is sufficient to prevent another ioctl from touching
 the vq.

Why do we need to flush the vq when trying to set the backend and
we find that it is already set. Is this just an optimization?

Thanks
Sridhar
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/vhost/net.c |5 +
  1 files changed, 5 insertions(+), 0 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 28d7786..50df58e6 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -534,11 +534,16 @@ static long vhost_net_set_backend(struct vhost_net *n, 
 unsigned index, int fd)
   rcu_assign_pointer(vq-private_data, sock);
   vhost_net_enable_vq(n, vq);
  done:
 + mutex_unlock(vq-mutex);
 +
   if (oldsock) {
   vhost_net_flush_vq(n, index);
   fput(oldsock-file);
   }
 
 + mutex_unlock(n-dev.mutex);
 + return 0;
 +
  err_vq:
   mutex_unlock(vq-mutex);
  err:

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH repost] sched: export sched_set/getaffinity to modules

2010-07-14 Thread Sridhar Samudrala
On Tue, 2010-07-13 at 14:09 +0300, Michael S. Tsirkin wrote: 
 On Mon, Jul 12, 2010 at 11:59:08PM -0700, Sridhar Samudrala wrote:
  On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:
  On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
  On 07/02, Peter Zijlstra wrote:
  On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
Does  it (Tejun's kthread_clone() patch) also  inherit the
  cgroup of the caller?
  Of course, its a simple do_fork() which inherits everything just as you
  would expect from a similar sys_clone()/sys_fork() call.
  Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
  from ioctl(), right?
  
  Then the new thread becomes the natural child of the caller, and it shares
  -mm with the parent. And files, dup_fd() without CLONE_FS.
  
  Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
  TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
  just because the parent gets SIGQUIT or abother coredumpable signal.
  Or the new thread can recieve SIGSTOP via ^Z.
  
  Perhaps this is OK, I do not know. Just to remind that kernel_thread()
  is merely clone(CLONE_VM).
  
  Oleg.
  
  Right. Doing this might break things like flush.  The signal and exit
  behaviour needs to be examined carefully. I am also unsure whether
  using such threads might be more expensive than inheriting kthreadd.
  
  Should we just leave it to the userspace to set the cgroup/cpumask
  after qemu starts the guest and
  the vhost threads?
  
  Thanks
  Sridhar
 
 Yes but we can't trust userspace to do this. It's important
 to do it on thread creation: if we don't, malicious userspace
 can create large amount of work exceeding the cgroup limits.
 
 And the same applies so the affinity: if the qemu process
 is limited to a set of CPUs, it's important to make
 the kernel thread that does work our behalf limited to the same
 set of CPUs.
 
 This is not unique to vhost, it's just that virt scenarious are affected
 by this more: people seem to run untrusted applications and expect the
 damage to be contained.

OK. So we want to create a thread that is a child of kthreadd, but inherits the 
cgroup/cpumask
from the caller. How about an exported kthread function 
kthread_create_in_current_cg() 
that does this?

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index aabc8a1..e0616f0 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -9,6 +9,9 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
   const char namefmt[], ...)
__attribute__((format(printf, 3, 4)));
 
+struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data),
+void *data, char *name);
+
 /**
  * kthread_run - create and wake a thread.
  * @threadfn: the function to run until signal_pending(current).
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..ea4e737 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -15,6 +15,7 @@
 #include linux/module.h
 #include linux/mutex.h
 #include trace/events/sched.h
+#include linux/cgroup.h
 
 static DEFINE_SPINLOCK(kthread_create_lock);
 static LIST_HEAD(kthread_create_list);
@@ -149,6 +150,42 @@ struct task_struct *kthread_create(int (*threadfn)(void 
*data),
 }
 EXPORT_SYMBOL(kthread_create);
 
+struct task_struct *kthread_create_in_current_cg(int (*threadfn)(void *data),
+void *data, char *name)
+{
+   struct task_struct *worker;
+   cpumask_var_t mask;
+   int ret = -ENOMEM;
+
+   if (!alloc_cpumask_var(mask, GFP_KERNEL))
+   goto out_free_mask;
+
+   worker = kthread_create(threadfn, data, %s-%d, name, current-pid);
+   if (IS_ERR(worker))
+   goto out_free_mask;
+
+   ret = sched_getaffinity(current-pid, mask);
+   if (ret)
+   goto out_stop_worker;
+
+   ret = sched_setaffinity(worker-pid, mask);
+   if (ret)
+   goto out_stop_worker;
+
+   ret = cgroup_attach_task_current_cg(worker);
+   if (ret)
+   goto out_stop_worker;
+
+   return worker;
+
+out_stop_worker:
+   kthread_stop(worker);
+out_free_mask:
+   free_cpumask_var(mask);
+   return ERR_PTR(ret);
+}
+EXPORT_SYMBOL(kthread_create_in_current_cg);
+
 /**
  * kthread_bind - bind a just-created kthread to a cpu.
  * @p: thread created by kthread_create().


Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH repost] sched: export sched_set/getaffinity to modules

2010-07-14 Thread Sridhar Samudrala

On 7/14/2010 5:05 PM, Oleg Nesterov wrote:

On 07/14, Sridhar Samudrala wrote:
   

OK. So we want to create a thread that is a child of kthreadd, but inherits the 
cgroup/cpumask
from the caller. How about an exported kthread function 
kthread_create_in_current_cg()
that does this?
 

Well. I must admit, this looks a bit strange to me ;)

Instead of exporting sched_xxxaffinity() we export the new function
which calls them. And I don't think this new helper is very useful
in general. May be I am wrong...
   
If we agree on exporting sched_xxxaffinity() functions, we don't need 
this new kthread function and we

can do the same in vhost as the original patch did.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH repost] sched: export sched_set/getaffinity to modules

2010-07-13 Thread Sridhar Samudrala

On 7/4/2010 2:00 AM, Michael S. Tsirkin wrote:

On Fri, Jul 02, 2010 at 11:06:37PM +0200, Oleg Nesterov wrote:
   

On 07/02, Peter Zijlstra wrote:
 

On Fri, 2010-07-02 at 11:01 -0700, Sridhar Samudrala wrote:
   

  Does  it (Tejun's kthread_clone() patch) also  inherit the
cgroup of the caller?
 

Of course, its a simple do_fork() which inherits everything just as you
would expect from a similar sys_clone()/sys_fork() call.
   

Yes. And I'm afraid it can inherit more than we want. IIUC, this is called
from ioctl(), right?

Then the new thread becomes the natural child of the caller, and it shares
-mm with the parent. And files, dup_fd() without CLONE_FS.

Signals. Say, if you send SIGKILL to this new thread, it can't sleep in
TASK_INTERRUPTIBLE or KILLABLE after that. And this SIGKILL can be sent
just because the parent gets SIGQUIT or abother coredumpable signal.
Or the new thread can recieve SIGSTOP via ^Z.

Perhaps this is OK, I do not know. Just to remind that kernel_thread()
is merely clone(CLONE_VM).

Oleg.
 


Right. Doing this might break things like flush.  The signal and exit
behaviour needs to be examined carefully. I am also unsure whether
using such threads might be more expensive than inheriting kthreadd.

   
Should we just leave it to the userspace to set the cgroup/cpumask after 
qemu starts the guest and

the vhost threads?

Thanks
Sridhar


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH repost] sched: export sched_set/getaffinity to modules

2010-07-02 Thread Sridhar Samudrala

On 7/1/2010 7:55 AM, Peter Zijlstra wrote:

On Thu, 2010-07-01 at 16:53 +0200, Tejun Heo wrote:
   

Hello,

On 07/01/2010 04:46 PM, Oleg Nesterov wrote:
 

It might be a good idea to make the function take extra clone flags
but anyways once created cloned task can be treated the same way as
other kthreads, so nothing else needs to be changed.
 

This makes kthread_stop() work. Otherwise the new thread is just
the CLONE_VM child of the caller, and the caller is the user-mode
task doing ioctl() ?
   

Hmmm, indeed.  It makes the attribute inheritance work but circumvents
the whole reason there is kthreadd.
 

I thought the whole reason there was threadd was to avoid the
inheritance? So avoiding the avoiding of inheritance seems like the goal
here, no?
   
I think so. Does  it (Tejun's kthread_clone() patch) also  inherit the 
cgroup of the caller? or do we still need the explicit

call to attach the thread to the current task's cgroup?

I am on vacation next week and cannot look into this until Jul 12. Hope 
this will be resoved by then.

If not, i will look into after i am back.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: break out of polling loop on error

2010-06-28 Thread Sridhar Samudrala
On Sun, 2010-06-27 at 11:59 +0300, Michael S. Tsirkin wrote:
 When ring parsing fails, we currently handle this
 as ring empty condition. This means that we enable
 kicks and recheck ring empty: if this not empty,
 we re-start polling which of course will fail again.
 
 Instead, let's return a negative error code and stop polling.

One minor comment on error return below. With that change,

Acked-by: Sridhar Samudrala s...@us.ibm.com

 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Dave, I'm sending this out so it can get reviewed.
 I'll put this on my vhost tree
 so no need for you to pick this patch directly.
 
  drivers/vhost/net.c   |   12 ++--
  drivers/vhost/vhost.c |   33 +
  drivers/vhost/vhost.h |8 
  3 files changed, 31 insertions(+), 22 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 0f41c91..54096ee 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -98,7 +98,8 @@ static void tx_poll_start(struct vhost_net *net, struct 
 socket *sock)
  static void handle_tx(struct vhost_net *net)
  {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_TX];
 - unsigned head, out, in, s;
 + unsigned out, in, s;
 + int head;
   struct msghdr msg = {
   .msg_name = NULL,
   .msg_namelen = 0,
 @@ -135,6 +136,9 @@ static void handle_tx(struct vhost_net *net)
ARRAY_SIZE(vq-iov),
out, in,
NULL, NULL);
 + /* On error, stop handling until the next kick. */
 + if (head  0)
 + break;
   /* Nothing new?  Wait for eventfd to tell us they refilled. */
   if (head == vq-num) {
   wmem = atomic_read(sock-sk-sk_wmem_alloc);
 @@ -192,7 +196,8 @@ static void handle_tx(struct vhost_net *net)
  static void handle_rx(struct vhost_net *net)
  {
   struct vhost_virtqueue *vq = net-dev.vqs[VHOST_NET_VQ_RX];
 - unsigned head, out, in, log, s;
 + unsigned out, in, log, s;
 + int head;
   struct vhost_log *vq_log;
   struct msghdr msg = {
   .msg_name = NULL,
 @@ -228,6 +233,9 @@ static void handle_rx(struct vhost_net *net)
ARRAY_SIZE(vq-iov),
out, in,
vq_log, log);
 + /* On error, stop handling until the next kick. */
 + if (head  0)
 + break;
   /* OK, now we need to know about added descriptors. */
   if (head == vq-num) {
   if (unlikely(vhost_enable_notify(vq))) {
 diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
 index 3b83382..5ccd384 100644
 --- a/drivers/vhost/vhost.c
 +++ b/drivers/vhost/vhost.c
 @@ -873,12 +873,13 @@ static unsigned get_indirect(struct vhost_dev *dev, 
 struct vhost_virtqueue *vq,
   * number of output then some number of input descriptors, it's actually two
   * iovecs, but we pack them into one and note how many of each there were.
   *
 - * This function returns the descriptor number found, or vq-num (which
 - * is never a valid descriptor number) if none was found. */
 -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 -struct iovec iov[], unsigned int iov_size,
 -unsigned int *out_num, unsigned int *in_num,
 -struct vhost_log *log, unsigned int *log_num)
 + * This function returns the descriptor number found, or vq-num (which is
 + * never a valid descriptor number) if none was found.  A negative code is
 + * returned on error. */
 +int vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
 +   struct iovec iov[], unsigned int iov_size,
 +   unsigned int *out_num, unsigned int *in_num,
 +   struct vhost_log *log, unsigned int *log_num)
  {
   struct vring_desc desc;
   unsigned int i, head, found = 0;
 @@ -890,13 +891,13 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, 
 struct vhost_virtqueue *vq,
   if (get_user(vq-avail_idx, vq-avail-idx)) {
   vq_err(vq, Failed to access avail idx at %p\n,
  vq-avail-idx);
 - return vq-num;
 + return -EFAULT;
   }
 
   if ((u16)(vq-avail_idx - last_avail_idx)  vq-num) {
   vq_err(vq, Guest moved used index from %u to %u,
  last_avail_idx, vq-avail_idx);
 - return vq-num;
 + return -EFAULT;

This should be -EINVAL
   }
 
   /* If there's nothing new since last we looked, return invalid. */
 @@ -912,14 +913,14 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, 
 struct vhost_virtqueue *vq,
   vq_err(vq, Failed to read head: idx %d address %p\n

Re: [PATCHv2] vhost-net: add dhclient work-around from userspace

2010-06-28 Thread Sridhar Samudrala
On Mon, 2010-06-28 at 13:08 +0300, Michael S. Tsirkin wrote:
 Userspace virtio server has the following hack
 so guests rely on it, and we have to replicate it, too:
 
 Use port number to detect incoming IPv4 DHCP response packets,
 and fill in the checksum for these.
 
 The issue we are solving is that on linux guests, some apps
 that use recvmsg with AF_PACKET sockets, don't know how to
 handle CHECKSUM_PARTIAL;
 The interface to return the relevant information was added
 in 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
 and older userspace does not use it.
 One important user of recvmsg with AF_PACKET is dhclient,
 so we add a work-around just for DHCP.
 
 Don't bother applying the hack to IPv6 as userspace virtio does not
 have a work-around for that - let's hope guests will do the right
 thing wrt IPv6.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Dave, I'm going to put this patch on the vhost tree,
 no need for you to bother merging it - you'll get
 it with a pull request.
 
 
  drivers/vhost/net.c |   44 +++-
  1 files changed, 43 insertions(+), 1 deletions(-)
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index cc19595..03bba6a 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -24,6 +24,10 @@
  #include linux/if_tun.h
  #include linux/if_macvlan.h
 
 +#include linux/ip.h
 +#include linux/udp.h
 +#include linux/netdevice.h
 +
  #include net/sock.h
 
  #include vhost.h
 @@ -186,6 +190,44 @@ static void handle_tx(struct vhost_net *net)
   unuse_mm(net-dev.mm);
  }
 
 +static int peek_head(struct sock *sk)

This routine is doing more than just peeking the head of sk's receive
queue. May be this should be named similar to what qemu calls
'work_around_broken_dhclient()'
 +{
 + struct sk_buff *skb;
 +
 + lock_sock(sk);
 + skb = skb_peek(sk-sk_receive_queue);
 + if (unlikely(!skb)) {
 + release_sock(sk);
 + return 0;
 + }
 + /* Userspace virtio server has the following hack so
 +  * guests rely on it, and we have to replicate it, too: */
 + /* Use port number to detect incoming IPv4 DHCP response packets,
 +  * and fill in the checksum. */
 +
 + /* The issue we are solving is that on linux guests, some apps
 +  * that use recvmsg with AF_PACKET sockets, don't know how to
 +  * handle CHECKSUM_PARTIAL;
 +  * The interface to return the relevant information was added in
 +  * 8dc4194474159660d7f37c495e3fc3f10d0db8cc,
 +  * and older userspace does not use it.
 +  * One important user of recvmsg with AF_PACKET is dhclient,
 +  * so we add a work-around just for DHCP. */
 + if (skb-ip_summed == CHECKSUM_PARTIAL 
 + skb_headlen(skb) = skb_transport_offset(skb) +
 + sizeof(struct udphdr) 
 + udp_hdr(skb)-dest == htons(68) 
 + skb_network_header_len(skb) = sizeof(struct iphdr) 
 + ip_hdr(skb)-protocol == IPPROTO_UDP 
 + skb-protocol == htons(ETH_P_IP)) {

Isn't it more logical to check for skb-protocol, followed by ip_hdr and
then udp_hdr?

 + skb_checksum_help(skb);
 + /* Restore ip_summed value: tun passes it to user. */
 + skb-ip_summed = CHECKSUM_PARTIAL;
 + }
 + release_sock(sk);
 + return 1;
 +}
 +
  /* Expects to be always run from workqueue - which acts as
   * read-size critical section for our kind of RCU. */
  static void handle_rx(struct vhost_net *net)
 @@ -222,7 +264,7 @@ static void handle_rx(struct vhost_net *net)
   vq_log = unlikely(vhost_has_feature(net-dev, VHOST_F_LOG_ALL)) ?
   vq-log : NULL;
 
 - for (;;) {
 + while (peek_head(sock-sk)) {
   head = vhost_get_vq_desc(net-dev, vq, vq-iov,
ARRAY_SIZE(vq-iov),
out, in,

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost pollers

2010-06-24 Thread Sridhar Samudrala
On Thu, 2010-06-24 at 11:11 +0300, Michael S. Tsirkin wrote:
 On Sun, May 30, 2010 at 10:25:01PM +0200, Tejun Heo wrote:
  Apply the cpumask and cgroup of the initializing task to the created
  vhost poller.
  
  Based on Sridhar Samudrala's patch.
  
  Cc: Michael S. Tsirkin m...@redhat.com
  Cc: Sridhar Samudrala samudrala.srid...@gmail.com
 
 
 I wanted to apply this, but modpost fails:
 ERROR: sched_setaffinity [drivers/vhost/vhost_net.ko] undefined!
 ERROR: sched_getaffinity [drivers/vhost/vhost_net.ko] undefined!
 
 Did you try building as a module?

In my original implementation, i had these calls in workqueue.c.
Now that these are moved to vhost.c which can be built as a module,
these symbols need to be exported.
The following patch fixes the build issue with vhost as a module.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/kernel/sched.c b/kernel/sched.c
index 3c2a54f..15a0c6f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4837,6 +4837,7 @@ out_put_task:
put_online_cpus();
return retval;
 }
+EXPORT_SYMBOL_GPL(sched_setaffinity);
 
 static int get_user_cpu_mask(unsigned long __user *user_mask_ptr,
unsigned len,
 struct cpumask *new_mask)
@@ -4900,6 +4901,7 @@ out_unlock:
 
return retval;
 }
+EXPORT_SYMBOL_GPL(sched_getaffinity);
 
 /**
  * sys_sched_getaffinity - get the cpu affinity of a process


  ---
   drivers/vhost/vhost.c |   36 +++-
   1 file changed, 31 insertions(+), 5 deletions(-)
  
  Index: work/drivers/vhost/vhost.c
  ===
  --- work.orig/drivers/vhost/vhost.c
  +++ work/drivers/vhost/vhost.c
  @@ -23,6 +23,7 @@
   #include linux/highmem.h
   #include linux/slab.h
   #include linux/kthread.h
  +#include linux/cgroup.h
  
   #include linux/net.h
   #include linux/if_packet.h
  @@ -176,12 +177,30 @@ repeat:
   long vhost_dev_init(struct vhost_dev *dev,
  struct vhost_virtqueue *vqs, int nvqs)
   {
  -   struct task_struct *poller;
  -   int i;
  +   struct task_struct *poller = NULL;
  +   cpumask_var_t mask;
  +   int i, ret = -ENOMEM;
  +
  +   if (!alloc_cpumask_var(mask, GFP_KERNEL))
  +   goto out;
  
  poller = kthread_create(vhost_poller, dev, vhost-%d, current-pid);
  -   if (IS_ERR(poller))
  -   return PTR_ERR(poller);
  +   if (IS_ERR(poller)) {
  +   ret = PTR_ERR(poller);
  +   goto out;
  +   }
  +
  +   ret = sched_getaffinity(current-pid, mask);
  +   if (ret)
  +   goto out;
  +
  +   ret = sched_setaffinity(poller-pid, mask);
  +   if (ret)
  +   goto out;
  +
  +   ret = cgroup_attach_task_current_cg(poller);
  +   if (ret)
  +   goto out;
  
  dev-vqs = vqs;
  dev-nvqs = nvqs;
  @@ -202,7 +221,14 @@ long vhost_dev_init(struct vhost_dev *de
  vhost_poll_init(dev-vqs[i].poll,
  dev-vqs[i].handle_kick, POLLIN, dev);
  }
  -   return 0;
  +
  +   wake_up_process(poller);/* avoid contributing to loadavg */
  +   ret = 0;
  +out:
  +   if (ret)
  +   kthread_stop(poller);
  +   free_cpumask_var(mask);
  +   return ret;
   }
  
   /* Caller should have device mutex */
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Fix bugs in msix_set/unset_mask_notifier() routines

2010-06-02 Thread Sridhar Samudrala

I am hitting the following assertions in msix.c when doing a guest reboot or 
live migration
using vhost.
qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier: Assertion 
`dev-msix_mask_notifier_opaque[vector]' failed.

The following patch fixes the bugs in handling msix_is_masked() condition
in msix_set/unset_mask_notifier() routines.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com


diff --git a/hw/msix.c b/hw/msix.c
index 1398680..a191df1 100644
--- a/hw/msix.c
+++ b/hw/msix.c
@@ -609,7 +609,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
 
 int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
 {
-int r;
+int r = 0;
 if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
 return 0;
 
@@ -619,13 +619,15 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
vector, void *opaque)
 
 /* Unmask the new notifier unless vector is masked. */
 if (msix_is_masked(dev, vector)) {
-return 0;
+goto out;
 }
 r = dev-msix_mask_notifier(dev, vector, opaque,
 msix_is_masked(dev, vector));
 if (r  0) {
 return r;
 }
+
+out:
 dev-msix_mask_notifier_opaque[vector] = opaque;
 return r;
 }
@@ -640,8 +642,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
vector)
 assert(dev-msix_mask_notifier_opaque[vector]);
 
 /* Mask the old notifier unless it is already masked. */
-if (msix_is_masked(dev, vector)) {
-return 0;
+if (!msix_is_masked(dev, vector)) {
+goto out;
 }
 r = dev-msix_mask_notifier(dev, vector,
 dev-msix_mask_notifier_opaque[vector],
@@ -649,6 +651,8 @@ int msix_unset_mask_notifier(PCIDevice *dev, unsigned 
vector)
 if (r  0) {
 return r;
 }
+
+out:
 dev-msix_mask_notifier_opaque[vector] = NULL;
 return r;
 }



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] msix: fix msix_set/unset_mask_notifier

2010-06-02 Thread Sridhar Samudrala
On Wed, 2010-06-02 at 20:49 +0300, Michael S. Tsirkin wrote:
 Sridhar Samudrala reported hitting the following assertions
 in msix.c when doing a guest reboot or live migration using vhost.
 qemu-kvm/hw/msix.c:375: msix_mask_all: Assertion `r = 0' failed.
 qemu-kvm/hw/msix.c:640: msix_unset_mask_notifier:
 Assertion `dev-msix_mask_notifier_opaque[vector]' failed.
 
 The issue is that we didn't clear/set the opaque pointer
 when vector is masked. The following patch fixes this.
 
 Signed-off-by: Sridhar Samudrala s...@us.ibm.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 Sridhar, could you test the following please?

Yes. looks good and works fine for me.
Tested-by: Sridhar Samudrala s...@us.ibm.com

 
  hw/msix.c |   33 -
  1 files changed, 16 insertions(+), 17 deletions(-)
 
 diff --git a/hw/msix.c b/hw/msix.c
 index b400769..24ff6ae 100644
 --- a/hw/msix.c
 +++ b/hw/msix.c
 @@ -610,7 +610,7 @@ void msix_unuse_all_vectors(PCIDevice *dev)
 
  int msix_set_mask_notifier(PCIDevice *dev, unsigned vector, void *opaque)
  {
 -int r;
 +int r = 0;
  if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
  return 0;
 
 @@ -619,13 +619,11 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
 vector, void *opaque)
  assert(!dev-msix_mask_notifier_opaque[vector]);
 
  /* Unmask the new notifier unless vector is masked. */
 -if (msix_is_masked(dev, vector)) {
 -return 0;
 -}
 -r = dev-msix_mask_notifier(dev, vector, opaque,
 -msix_is_masked(dev, vector));
 -if (r  0) {
 -return r;
 +if (!msix_is_masked(dev, vector)) {
 +r = dev-msix_mask_notifier(dev, vector, opaque, false);
 +if (r  0) {
 +return r;
 +}
  }
  dev-msix_mask_notifier_opaque[vector] = opaque;
  return r;
 @@ -634,21 +632,21 @@ int msix_set_mask_notifier(PCIDevice *dev, unsigned 
 vector, void *opaque)
  int msix_unset_mask_notifier(PCIDevice *dev, unsigned vector)
  {
  int r = 0;
 +void *opaque;
  if (vector = dev-msix_entries_nr || !dev-msix_entry_used[vector])
  return 0;
 
 +opaque = dev-msix_mask_notifier_opaque[vector];
 +
  assert(dev-msix_mask_notifier);
 -assert(dev-msix_mask_notifier_opaque[vector]);
 +assert(opaque);
 
  /* Mask the old notifier unless it is already masked. */
 -if (msix_is_masked(dev, vector)) {
 -return 0;
 -}
 -r = dev-msix_mask_notifier(dev, vector,
 -dev-msix_mask_notifier_opaque[vector],
 -!msix_is_masked(dev, vector));
 -if (r  0) {
 -return r;
 +if (!msix_is_masked(dev, vector)) {
 +r = dev-msix_mask_notifier(dev, vector, opaque, true);
 +if (r  0) {
 +return r;
 +}
  }
  dev-msix_mask_notifier_opaque[vector] = NULL;
  return r;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] vhost: apply cpumask and cgroup to vhost workers

2010-06-01 Thread Sridhar Samudrala
On Tue, 2010-06-01 at 11:35 +0200, Tejun Heo wrote:
 Apply the cpumask and cgroup of the initializing task to the created
 vhost worker.
 
 Based on Sridhar Samudrala's patch.  Li Zefan spotted a bug in error
 path (twice), fixed (twice).
 
 Signed-off-by: Tejun Heo t...@kernel.org
 Cc: Michael S. Tsirkin m...@redhat.com
 Cc: Sridhar Samudrala samudrala.srid...@gmail.com
 Cc: Li Zefan l...@cn.fujitsu.com
 ---
  drivers/vhost/vhost.c |   34 ++
  1 file changed, 30 insertions(+), 4 deletions(-)
 
 Index: work/drivers/vhost/vhost.c
 ===
 --- work.orig/drivers/vhost/vhost.c
 +++ work/drivers/vhost/vhost.c
 @@ -23,6 +23,7 @@
  #include linux/highmem.h
  #include linux/slab.h
  #include linux/kthread.h
 +#include linux/cgroup.h
 
  #include linux/net.h
  #include linux/if_packet.h
 @@ -187,11 +188,29 @@ long vhost_dev_init(struct vhost_dev *de
   struct vhost_virtqueue *vqs, int nvqs)
  {
   struct task_struct *worker;
 - int i;
 + cpumask_var_t mask;
 + int i, ret = -ENOMEM;
 +
 + if (!alloc_cpumask_var(mask, GFP_KERNEL))
 + goto out_free_mask;

I think this is another bug in the error path. You should simply
do a return instead of a goto here when aloc_cpu_mask fails.

Thanks
Sridhar
 
   worker = kthread_create(vhost_worker, dev, vhost-%d, current-pid);
 - if (IS_ERR(worker))
 - return PTR_ERR(worker);
 + if (IS_ERR(worker)) {
 + ret = PTR_ERR(worker);
 + goto out_free_mask;
 + }
 +
 + ret = sched_getaffinity(current-pid, mask);
 + if (ret)
 + goto out_stop_worker;
 +
 + ret = sched_setaffinity(worker-pid, mask);
 + if (ret)
 + goto out_stop_worker;
 +
 + ret = cgroup_attach_task_current_cg(worker);
 + if (ret)
 + goto out_stop_worker;
 
   dev-vqs = vqs;
   dev-nvqs = nvqs;
 @@ -214,7 +233,14 @@ long vhost_dev_init(struct vhost_dev *de
   }
 
   wake_up_process(worker);/* avoid contributing to loadavg */
 - return 0;
 + ret = 0;
 + goto out_free_mask;
 +
 +out_stop_worker:
 + kthread_stop(worker);
 +out_free_mask:
 + free_cpumask_var(mask);
 + return ret;
  }
 
  /* Caller should have device mutex */
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] workqueue: Add an API to create a singlethread workqueue attached to the current task's cgroup

2010-05-27 Thread Sridhar Samudrala
On Thu, 2010-05-27 at 14:44 +0200, Oleg Nesterov wrote:
 On 05/27, Michael S. Tsirkin wrote:
 
  On Tue, May 18, 2010 at 05:04:51PM -0700, Sridhar Samudrala wrote:
   Add a new kernel API to create a singlethread workqueue and attach it's
   task to current task's cgroup and cpumask.
  
   Signed-off-by: Sridhar Samudrala s...@us.ibm.com
 
  Could someone familiar with workqueue code please comment on whether
  this patch is suitable for 2.6.35?
 
  It is needed to fix the case where vhost user might cause a kernel
  thread to consume more CPU than allowed by the cgroup.
  Should I merge it through the vhost tree?
  Ack for this?
 
 I don't understand the reasons for this patch, but this doesn't matter.
 
 I don't really see any need to change workqueue.c,
 
   +static struct task_struct *get_singlethread_wq_task(struct 
   workqueue_struct *wq)
   +{
   + return (per_cpu_ptr(wq-cpu_wq, singlethread_cpu))-thread;
   +}
 
 (Not sure this trivial static helper with the single caller makes sense, but
  see below)
 
   +/* Create a singlethread workqueue and attach it's task to the current 
   task's
   + * cgroup and set it's cpumask to the current task's cpumask.
   + */
   +struct workqueue_struct 
   *create_singlethread_workqueue_in_current_cg(char *name)
   +{
   + struct workqueue_struct *wq;
   + struct task_struct *task;
   + cpumask_var_t mask;
   +
   + wq = create_singlethread_workqueue(name);
   + if (!wq)
   + goto out;
   +
   + if (!alloc_cpumask_var(mask, GFP_KERNEL))
   + goto err;
   + 
   + if (sched_getaffinity(current-pid, mask))
   + goto err;
   +
   + task = get_singlethread_wq_task(wq);
   + if (sched_setaffinity(task-pid, mask))
   + goto err;
   +
   + if (cgroup_attach_task_current_cg(task))
   + goto err;
   +out: 
   + return wq;
   +err:
   + destroy_workqueue(wq);
   + wq = NULL;
   + goto out;
   +}
 
 Instead, cgroup.c (or whoever needs this) can do
 
   struct move_struct {
   struct work_struct work;
   int ret;
   };
 
   static void move_func(struct work_struct *work)
   {
   struct move_struct *move = container_of(...);
 
   if (cgroup_attach_task_current_cg(current))

We are trying to attach the task associated with workqueue to the
current task's cgroup. So what we need is 
   cgroup_attach_task_current_cg(wq-task);

However there is no interface currently that exports the task associated
with a workqueue. It is hidden in cpu_workqueue_struct and is only 
accessible within workqueue.c.


   ret = -EANY;
   }
 
   static struct workqueue_struct 
 *create_singlethread_workqueue_in_current_cg(char *name)
   {
   struct workqueue_struct *wq;
   struct move_struct move = {
   .work = __WORK_INITIALIZER(move_func);
   };
 
   wq = create_singlethread_workqueue(name);
   if (!wq)
   return NULL;
 
   queue_work(move.work);
   flush_work(move.work);
 
   if (move.ret) {
   destroy_workqueue(wq);
   wq = NULL;
   }
 
   return wq;
   }
 
 Or. Just export wq_per_cpu() from workqueue.c (probably with a better name) 
 and
 use it like the patch does.
This requires that struct cpu_workqueue_struct and struct
workqueue_struct are made externally visible by moving them to
kernel/workqueue.h.

Instead what about adding the simple helper get_singlethread_wq_task()
in workqueue.c and exporting it.
I can add create_singlethread_workqueue_in_current_cg() to cgroup.c
using this helper routine.

Thanks
Sridhar


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-21 Thread Sridhar Samudrala

On 5/20/2010 3:22 PM, Paul Menage wrote:

On Tue, May 18, 2010 at 5:04 PM, Sridhar Samudrala
samudrala.srid...@gmail.com  wrote:
   

Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudralas...@us.ibm.com
 

Reviewed-by: Paul Menagemen...@google.com

It would be more efficient to just attach directly to current-cgroups
rather than potentially creating/destroying one css_set for each
hierarchy until we've completely converged on current-cgroups - but
that would require a bunch of refactoring of the guts of
cgroup_attach_task() to ensure that the right can_attach()/attach()
callbacks are made. That doesn't really seem worthwhile right now for
the initial use, that I imagine isn't going to be
performance-sensitive.
   
Yes. In our use-case, this will be called only once per guest interface 
when the guest comes up.
Hope you or someone more familiar with cgroups subsystem can optimize 
this function later.


Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] cgroups: Add an API to attach a task to current task's cgroup

2010-05-18 Thread Sridhar Samudrala
Add a new kernel API to attach a task to current task's cgroup
in all the active hierarchies.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -570,6 +570,7 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
 void cgroup_iter_end(struct cgroup *cgrp, struct cgroup_iter *it);
 int cgroup_scan_tasks(struct cgroup_scanner *scan);
 int cgroup_attach_task(struct cgroup *, struct task_struct *);
+int cgroup_attach_task_current_cg(struct task_struct *);
 
 /*
  * CSS ID is ID for cgroup_subsys_state structs under subsys. This only works
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6d870f2..6cfeb06 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1788,6 +1788,29 @@ out:
return retval;
 }
 
+/**
+ * cgroup_attach_task_current_cg - attach task 'tsk' to current task's cgroup
+ * @tsk: the task to be attached
+ */
+int cgroup_attach_task_current_cg(struct task_struct *tsk)
+{
+   struct cgroupfs_root *root;
+   struct cgroup *cur_cg;
+   int retval = 0;
+
+   cgroup_lock();
+   for_each_active_root(root) {
+   cur_cg = task_cgroup_from_root(current, root);
+   retval = cgroup_attach_task(cur_cg, tsk);
+   if (retval)
+   break;
+   }
+   cgroup_unlock();
+
+   return retval;
+}
+EXPORT_SYMBOL_GPL(cgroup_attach_task_current_cg);
+
 /*
  * Attach task with pid 'pid' to cgroup 'cgrp'. Call with cgroup_mutex
  * held. May take task_lock of task


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ PATCH 3/3] vhost: make it more scalable by creating a vhost thread per device

2010-05-18 Thread Sridhar Samudrala
Make vhost more scalable by creating a separate vhost thread per
vhost device. This provides better scaling across virtio-net interfaces
in multiple guests.

Also attach each vhost thread to the cgroup and cpumask of the 
associated guest(qemu or libvirt).

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9777583..18bf5be 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -329,19 +329,22 @@ static void handle_rx_net(struct work_struct *work)
 static int vhost_net_open(struct inode *inode, struct file *f)
 {
struct vhost_net *n = kmalloc(sizeof *n, GFP_KERNEL);
+   struct vhost_dev *dev;
int r;
if (!n)
return -ENOMEM;
+
+   dev = n-dev;
n-vqs[VHOST_NET_VQ_TX].handle_kick = handle_tx_kick;
n-vqs[VHOST_NET_VQ_RX].handle_kick = handle_rx_kick;
-   r = vhost_dev_init(n-dev, n-vqs, VHOST_NET_VQ_MAX);
+   r = vhost_dev_init(dev, n-vqs, VHOST_NET_VQ_MAX);
if (r  0) {
kfree(n);
return r;
}
 
-   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT, dev);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN, dev);
n-tx_poll_state = VHOST_NET_POLL_DISABLED;
 
f-private_data = n;
@@ -644,25 +647,14 @@ static struct miscdevice vhost_net_misc = {
 
 int vhost_net_init(void)
 {
-   int r = vhost_init();
-   if (r)
-   goto err_init;
-   r = misc_register(vhost_net_misc);
-   if (r)
-   goto err_reg;
-   return 0;
-err_reg:
-   vhost_cleanup();
-err_init:
-   return r;
-
+   return misc_register(vhost_net_misc);
 }
+
 module_init(vhost_net_init);
 
 void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
-   vhost_cleanup();
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 49fa953..b076b9b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -37,8 +37,6 @@ enum {
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
 {
@@ -57,18 +55,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned 
mode, int sync,
if (!((unsigned long)key  poll-mask))
return 0;
 
-   queue_work(vhost_workqueue, poll-work);
+   queue_work(poll-dev-wq, poll-work);
return 0;
 }
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-unsigned long mask)
+unsigned long mask, struct vhost_dev *dev)
 {
INIT_WORK(poll-work, func);
init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup);
init_poll_funcptr(poll-table, vhost_poll_func);
poll-mask = mask;
+   poll-dev = dev;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -97,7 +96,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-   queue_work(vhost_workqueue, poll-work);
+   queue_work(poll-dev-wq, poll-work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -129,6 +128,13 @@ long vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue *vqs, int nvqs)
 {
int i;
+   char vhost_name[20];
+
+   snprintf(vhost_name, 20, vhost-%d, current-pid);
+   dev-wq = create_singlethread_workqueue_in_current_cg(vhost_name);
+   if (!dev-wq)
+   return -ENOMEM;
+
dev-vqs = vqs;
dev-nvqs = nvqs;
mutex_init(dev-mutex);
@@ -144,7 +150,7 @@ long vhost_dev_init(struct vhost_dev *dev,
if (dev-vqs[i].handle_kick)
vhost_poll_init(dev-vqs[i].poll,
dev-vqs[i].handle_kick,
-   POLLIN);
+   POLLIN, dev);
}
return 0;
 }
@@ -217,6 +223,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
if (dev-mm)
mmput(dev-mm);
dev-mm = NULL;
+
+   destroy_workqueue(dev-wq);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1105,16 +1113,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
vq_err(vq, Failed to enable notification at %p: %d\n,
   vq-used-flags, r);
 }
-
-int vhost_init(void)
-{
-   vhost_workqueue = create_singlethread_workqueue(vhost);
-   if (!vhost_workqueue)
-   return -ENOMEM;
-   return 0;
-}
-
-void vhost_cleanup(void)
-{
-   destroy_workqueue(vhost_workqueue);
-}
diff --git a/drivers/vhost/vhost.h

Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-12 Thread Sridhar Samudrala
On Sun, 2010-04-11 at 18:47 +0300, Michael S. Tsirkin wrote:
 On Thu, Apr 08, 2010 at 05:05:42PM -0700, Sridhar Samudrala wrote:
  On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote:
   On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote:
On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
 Make vhost scalable by creating a separate vhost thread per vhost
 device. This provides better scaling across multiple guests and with
 multiple interfaces in a guest.

Thanks for looking into this. An alternative approach is
to simply replace create_singlethread_workqueue with
create_workqueue which would get us a thread per host CPU.

It seems that in theory this should be the optimal approach
wrt CPU locality, however, in practice a single thread
seems to get better numbers. I have a TODO to investigate this.
Could you try looking into this?
   
   Yes. I tried using create_workqueue(), but the results were not good
   atleast when the number of guest interfaces is less than the number
   of CPUs. I didn't try more than 8 guests.
   Creating a separate thread per guest interface seems to be more
   scalable based on the testing i have done so far.
   
   I will try some more tests and get some numbers to compare the following
   3 options.
   - single vhost thread
   - vhost thread per cpu
   - vhost thread per guest virtio interface
  
  Here are the results with netperf TCP_STREAM 64K guest to host on a
  8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host 
  CPU utilization.
  
  Current default single vhost thread
  ---
  1 guest:  12500  37%
  2 guests: 12800  46%
  3 guests: 12600  47%
  4 guests: 12200  47%
  5 guests: 12000  47%
  6 guests: 11700  47%
  7 guests: 11340  47%
  8 guests: 11200  48%
  
  vhost thread per cpu
  
  1 guest:   4900 25%
  2 guests: 10800 49%
  3 guests: 17100 67%
  4 guests: 20400 84%
  5 guests: 21000 90%
  6 guests: 22500 92%
  7 guests: 23500 96%
  8 guests: 24500 99%
  
  vhost thread per guest interface
  
  1 guest:  12500 37%
  2 guests: 21000 72%
  3 guests: 21600 79%
  4 guests: 21600 85%
  5 guests: 22500 89%
  6 guests: 22800 94%
  7 guests: 24500 98%
  8 guests: 26400 99%
  
  Thanks
  Sridhar
 
 
 Consider using Ingo's perf tool to get error bars, but looks good
 overall. 

What do you mean by getting error bars?

 One thing I note though is that we seem to be able to
 consume up to 99% CPU now. So I think with this approach
 we can no longer claim that we are just like some other parts of
 networking stack, doing work outside any cgroup, and we should
 make the vhost thread inherit the cgroup and cpu mask
 from the process calling SET_OWNER.

Yes. I am not sure what is the right interface to do this, but this
should also allow binding qemu to a set of cpus and automatically having
vhost thread inherit the same cpu mask.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-09 Thread Sridhar Samudrala
On Thu, 2010-04-08 at 17:14 -0700, Rick Jones wrote:
  Here are the results with netperf TCP_STREAM 64K guest to host on a
  8-cpu Nehalem system.
 
 I presume you mean 8 core Nehalem-EP, or did you mean 8 processor Nehalem-EX?

Yes. It is a 2 socket quad-core Nehalem. so i guess it is a 8 core
Nehalem-EP.
 
 Don't get me wrong, I *like* the netperf 64K TCP_STREAM test, I lik it a 
 lot!-) 
 but I find it incomplete and also like to run things like single-instance 
 TCP_RR 
 and multiple-instance, multiple transaction (./configure --enable-burst) 
 TCP_RR tests, particularly when concerned with scaling issues.

Can we run multiple instance and multiple transaction tests with a
single netperf commandline?

Is there any easy way to get consolidated throughput when a netserver on
the host is servicing netperf clients from multiple guests?

Thanks
Sridhar

 
 happy benchmarking,
 
 rick jones
 
  It shows cumulative bandwidth in Mbps and host 
  CPU utilization.
  
  Current default single vhost thread
  ---
  1 guest:  12500  37%
  2 guests: 12800  46%
  3 guests: 12600  47%
  4 guests: 12200  47%
  5 guests: 12000  47%
  6 guests: 11700  47%
  7 guests: 11340  47%
  8 guests: 11200  48%
  
  vhost thread per cpu
  
  1 guest:   4900 25%
  2 guests: 10800 49%
  3 guests: 17100 67%
  4 guests: 20400 84%
  5 guests: 21000 90%
  6 guests: 22500 92%
  7 guests: 23500 96%
  8 guests: 24500 99%
  
  vhost thread per guest interface
  
  1 guest:  12500 37%
  2 guests: 21000 72%
  3 guests: 21600 79%
  4 guests: 21600 85%
  5 guests: 22500 89%
  6 guests: 22800 94%
  7 guests: 24500 98%
  8 guests: 26400 99%
  
  Thanks
  Sridhar
  
  
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-08 Thread Sridhar Samudrala
On Mon, 2010-04-05 at 10:35 -0700, Sridhar Samudrala wrote:
 On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote:
  On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
   Make vhost scalable by creating a separate vhost thread per vhost
   device. This provides better scaling across multiple guests and with
   multiple interfaces in a guest.
  
  Thanks for looking into this. An alternative approach is
  to simply replace create_singlethread_workqueue with
  create_workqueue which would get us a thread per host CPU.
  
  It seems that in theory this should be the optimal approach
  wrt CPU locality, however, in practice a single thread
  seems to get better numbers. I have a TODO to investigate this.
  Could you try looking into this?
 
 Yes. I tried using create_workqueue(), but the results were not good
 atleast when the number of guest interfaces is less than the number
 of CPUs. I didn't try more than 8 guests.
 Creating a separate thread per guest interface seems to be more
 scalable based on the testing i have done so far.
 
 I will try some more tests and get some numbers to compare the following
 3 options.
 - single vhost thread
 - vhost thread per cpu
 - vhost thread per guest virtio interface

Here are the results with netperf TCP_STREAM 64K guest to host on a
8-cpu Nehalem system. It shows cumulative bandwidth in Mbps and host 
CPU utilization.

Current default single vhost thread
---
1 guest:  12500  37%
2 guests: 12800  46%
3 guests: 12600  47%
4 guests: 12200  47%
5 guests: 12000  47%
6 guests: 11700  47%
7 guests: 11340  47%
8 guests: 11200  48%

vhost thread per cpu

1 guest:   4900 25%
2 guests: 10800 49%
3 guests: 17100 67%
4 guests: 20400 84%
5 guests: 21000 90%
6 guests: 22500 92%
7 guests: 23500 96%
8 guests: 24500 99%

vhost thread per guest interface

1 guest:  12500 37%
2 guests: 21000 72%
3 guests: 21600 79%
4 guests: 21600 85%
5 guests: 22500 89%
6 guests: 22800 94%
7 guests: 24500 98%
8 guests: 26400 99%

Thanks
Sridhar


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-05 Thread Sridhar Samudrala
On Sun, 2010-04-04 at 14:14 +0300, Michael S. Tsirkin wrote:
 On Fri, Apr 02, 2010 at 10:31:20AM -0700, Sridhar Samudrala wrote:
  Make vhost scalable by creating a separate vhost thread per vhost
  device. This provides better scaling across multiple guests and with
  multiple interfaces in a guest.
 
 Thanks for looking into this. An alternative approach is
 to simply replace create_singlethread_workqueue with
 create_workqueue which would get us a thread per host CPU.
 
 It seems that in theory this should be the optimal approach
 wrt CPU locality, however, in practice a single thread
 seems to get better numbers. I have a TODO to investigate this.
 Could you try looking into this?

Yes. I tried using create_workqueue(), but the results were not good
atleast when the number of guest interfaces is less than the number
of CPUs. I didn't try more than 8 guests.
Creating a separate thread per guest interface seems to be more
scalable based on the testing i have done so far.

I will try some more tests and get some numbers to compare the following
3 options.
- single vhost thread
- vhost thread per cpu
- vhost thread per guest virtio interface

Thanks
Sridhar
 
  
  I am seeing better aggregated througput/latency when running netperf
  across multiple guests or multiple interfaces in a guest in parallel
  with this patch.
 
 Any numbers? What happens to CPU utilization?
 
  Signed-off-by: Sridhar Samudrala s...@us.ibm.com
  
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index a6a88df..29aa80f 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct 
  file *f)
  return r;
  }
   
  -   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
  -   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
  +   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
  +   n-dev);
  +   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
  +   n-dev);
  n-tx_poll_state = VHOST_NET_POLL_DISABLED;
   
  f-private_data = n;
  @@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = {
   
   int vhost_net_init(void)
   {
  -   int r = vhost_init();
  -   if (r)
  -   goto err_init;
  -   r = misc_register(vhost_net_misc);
  -   if (r)
  -   goto err_reg;
  -   return 0;
  -err_reg:
  -   vhost_cleanup();
  -err_init:
  -   return r;
  -
  +   return misc_register(vhost_net_misc);
   }
  +
   module_init(vhost_net_init);
   
   void vhost_net_exit(void)
   {
  misc_deregister(vhost_net_misc);
  -   vhost_cleanup();
   }
   module_exit(vhost_net_exit);
   
  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
  index 7bd7a1e..243f4d3 100644
  --- a/drivers/vhost/vhost.c
  +++ b/drivers/vhost/vhost.c
  @@ -36,8 +36,6 @@ enum {
  VHOST_MEMORY_F_LOG = 0x1,
   };
   
  -static struct workqueue_struct *vhost_workqueue;
  -
   static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
  poll_table *pt)
   {
  @@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, 
  unsigned mode, int sync,
  if (!((unsigned long)key  poll-mask))
  return 0;
   
  -   queue_work(vhost_workqueue, poll-work);
  +   queue_work(poll-dev-wq, poll-work);
  return 0;
   }
   
   /* Init poll structure */
   void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
  -unsigned long mask)
  +unsigned long mask, struct vhost_dev *dev)
   {
  INIT_WORK(poll-work, func);
  init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup);
  init_poll_funcptr(poll-table, vhost_poll_func);
  poll-mask = mask;
  +   poll-dev = dev;
   }
   
   /* Start polling a file. We add ourselves to file's wait queue. The caller 
  must
  @@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
   
   void vhost_poll_queue(struct vhost_poll *poll)
   {
  -   queue_work(vhost_workqueue, poll-work);
  +   queue_work(poll-dev-wq, poll-work);
   }
   
   static void vhost_vq_reset(struct vhost_dev *dev,
  @@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev,
  struct vhost_virtqueue *vqs, int nvqs)
   {
  int i;
  +
  +   dev-wq = create_singlethread_workqueue(vhost);
  +   if (!dev-wq)
  +   return -ENOMEM;
  +
  dev-vqs = vqs;
  dev-nvqs = nvqs;
  mutex_init(dev-mutex);
  @@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev,
  if (dev-vqs[i].handle_kick)
  vhost_poll_init(dev-vqs[i].poll,
  dev-vqs[i].handle_kick,
  -   POLLIN);
  +   POLLIN, dev);
  }
  return 0;
   }
  @@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
  if (dev-mm)
  mmput(dev-mm);
  dev-mm = NULL

[PATCH] vhost: Make it more scalable by creating a vhost thread per device.

2010-04-02 Thread Sridhar Samudrala
Make vhost scalable by creating a separate vhost thread per vhost
device. This provides better scaling across multiple guests and with
multiple interfaces in a guest.

I am seeing better aggregated througput/latency when running netperf
across multiple guests or multiple interfaces in a guest in parallel
with this patch.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com


diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index a6a88df..29aa80f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -339,8 +339,10 @@ static int vhost_net_open(struct inode *inode, struct file 
*f)
return r;
}
 
-   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
-   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT,
+   n-dev);
+   vhost_poll_init(n-poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN,
+   n-dev);
n-tx_poll_state = VHOST_NET_POLL_DISABLED;
 
f-private_data = n;
@@ -643,25 +645,14 @@ static struct miscdevice vhost_net_misc = {
 
 int vhost_net_init(void)
 {
-   int r = vhost_init();
-   if (r)
-   goto err_init;
-   r = misc_register(vhost_net_misc);
-   if (r)
-   goto err_reg;
-   return 0;
-err_reg:
-   vhost_cleanup();
-err_init:
-   return r;
-
+   return misc_register(vhost_net_misc);
 }
+
 module_init(vhost_net_init);
 
 void vhost_net_exit(void)
 {
misc_deregister(vhost_net_misc);
-   vhost_cleanup();
 }
 module_exit(vhost_net_exit);
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7bd7a1e..243f4d3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -36,8 +36,6 @@ enum {
VHOST_MEMORY_F_LOG = 0x1,
 };
 
-static struct workqueue_struct *vhost_workqueue;
-
 static void vhost_poll_func(struct file *file, wait_queue_head_t *wqh,
poll_table *pt)
 {
@@ -56,18 +54,19 @@ static int vhost_poll_wakeup(wait_queue_t *wait, unsigned 
mode, int sync,
if (!((unsigned long)key  poll-mask))
return 0;
 
-   queue_work(vhost_workqueue, poll-work);
+   queue_work(poll-dev-wq, poll-work);
return 0;
 }
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-unsigned long mask)
+unsigned long mask, struct vhost_dev *dev)
 {
INIT_WORK(poll-work, func);
init_waitqueue_func_entry(poll-wait, vhost_poll_wakeup);
init_poll_funcptr(poll-table, vhost_poll_func);
poll-mask = mask;
+   poll-dev = dev;
 }
 
 /* Start polling a file. We add ourselves to file's wait queue. The caller must
@@ -96,7 +95,7 @@ void vhost_poll_flush(struct vhost_poll *poll)
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-   queue_work(vhost_workqueue, poll-work);
+   queue_work(poll-dev-wq, poll-work);
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -128,6 +127,11 @@ long vhost_dev_init(struct vhost_dev *dev,
struct vhost_virtqueue *vqs, int nvqs)
 {
int i;
+
+   dev-wq = create_singlethread_workqueue(vhost);
+   if (!dev-wq)
+   return -ENOMEM;
+
dev-vqs = vqs;
dev-nvqs = nvqs;
mutex_init(dev-mutex);
@@ -143,7 +147,7 @@ long vhost_dev_init(struct vhost_dev *dev,
if (dev-vqs[i].handle_kick)
vhost_poll_init(dev-vqs[i].poll,
dev-vqs[i].handle_kick,
-   POLLIN);
+   POLLIN, dev);
}
return 0;
 }
@@ -216,6 +220,8 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
if (dev-mm)
mmput(dev-mm);
dev-mm = NULL;
+
+   destroy_workqueue(dev-wq);
 }
 
 static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
@@ -1095,16 +1101,3 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
vq_err(vq, Failed to enable notification at %p: %d\n,
   vq-used-flags, r);
 }
-
-int vhost_init(void)
-{
-   vhost_workqueue = create_singlethread_workqueue(vhost);
-   if (!vhost_workqueue)
-   return -ENOMEM;
-   return 0;
-}
-
-void vhost_cleanup(void)
-{
-   destroy_workqueue(vhost_workqueue);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..60fefd0 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -29,10 +29,11 @@ struct vhost_poll {
/* struct which will handle all actual work. */
struct work_structwork;
unsigned long mask;
+   struct vhost_dev *dev;
 };
 
 void vhost_poll_init(struct vhost_poll *poll, work_func_t func,
-unsigned long mask);
+unsigned long mask, struct vhost_dev *dev

Re: [RFC][PATCH v2 0/3] Provide a zero-copy method on KVM virtio-net.

2010-04-02 Thread Sridhar Samudrala
On Fri, 2010-04-02 at 15:25 +0800, xiaohui@intel.com wrote:
 The idea is simple, just to pin the guest VM user space and then
 let host NIC driver has the chance to directly DMA to it. 
 The patches are based on vhost-net backend driver. We add a device
 which provides proto_ops as sendmsg/recvmsg to vhost-net to
 send/recv directly to/from the NIC driver. KVM guest who use the
 vhost-net backend may bind any ethX interface in the host side to
 get copyless data transfer thru guest virtio-net frontend.

What is the advantage of this approach compared to PCI-passthrough
of the host NIC to the guest?
Does this require pinning of the entire guest memory? Or only the
send/receive buffers?

Thanks
Sridhar
 
 The scenario is like this:
 
 The guest virtio-net driver submits multiple requests thru vhost-net
 backend driver to the kernel. And the requests are queued and then
 completed after corresponding actions in h/w are done.
 
 For read, user space buffers are dispensed to NIC driver for rx when
 a page constructor API is invoked. Means NICs can allocate user buffers
 from a page constructor. We add a hook in netif_receive_skb() function
 to intercept the incoming packets, and notify the zero-copy device.
 
 For write, the zero-copy deivce may allocates a new host skb and puts
 payload on the skb_shinfo(skb)-frags, and copied the header to skb-data.
 The request remains pending until the skb is transmitted by h/w.
 
 Here, we have ever considered 2 ways to utilize the page constructor
 API to dispense the user buffers.
 
 One:  Modify __alloc_skb() function a bit, it can only allocate a 
   structure of sk_buff, and the data pointer is pointing to a 
   user buffer which is coming from a page constructor API.
   Then the shinfo of the skb is also from guest.
   When packet is received from hardware, the skb-data is filled
   directly by h/w. What we have done is in this way.
 
   Pros:   We can avoid any copy here.
   Cons:   Guest virtio-net driver needs to allocate skb as almost
   the same method with the host NIC drivers, say the size
   of netdev_alloc_skb() and the same reserved space in the
   head of skb. Many NIC drivers are the same with guest and
   ok for this. But some lastest NIC drivers reserves special
   room in skb head. To deal with it, we suggest to provide
   a method in guest virtio-net driver to ask for parameter
   we interest from the NIC driver when we know which device 
   we have bind to do zero-copy. Then we ask guest to do so.
   Is that reasonable?
 
 Two:  Modify driver to get user buffer allocated from a page constructor
   API(to substitute alloc_page()), the user buffer are used as payload
   buffers and filled by h/w directly when packet is received. Driver
   should associate the pages with skb (skb_shinfo(skb)-frags). For 
   the head buffer side, let host allocates skb, and h/w fills it. 
   After that, the data filled in host skb header will be copied into
   guest header buffer which is submitted together with the payload buffer.
 
   Pros:   We could less care the way how guest or host allocates their
   buffers.
   Cons:   We still need a bit copy here for the skb header.
 
 We are not sure which way is the better here. This is the first thing we want
 to get comments from the community. We wish the modification to the network
 part will be generic which not used by vhost-net backend only, but a user
 application may use it as well when the zero-copy device may provides async
 read/write operations later.
 
 Please give comments especially for the network part modifications.
 
 
 We provide multiple submits and asynchronous notifiicaton to 
 vhost-net too.
 
 Our goal is to improve the bandwidth and reduce the CPU usage.
 Exact performance data will be provided later. But for simple
 test with netperf, we found bindwidth up and CPU % up too,
 but the bindwidth up ratio is much more than CPU % up ratio.
 
 What we have not done yet:
   packet split support
   To support GRO
   Performance tuning
 
 what we have done in v1:
   polish the RCU usage
   deal with write logging in asynchroush mode in vhost
   add notifier block for mp device
   rename page_ctor to mp_port in netdevice.h to make it looks generic
   add mp_dev_change_flags() for mp device to change NIC state
   add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
   a small fix for missing dev_put when fail
   using dynamic minor instead of static minor number
   a __KERNEL__ protect to mp_get_sock()
 
 what we have done in v2:
   
   remove most of the RCU usage, since the ctor pointer is only
   changed by BIND/UNBIND ioctl, and during that time, NIC will be
   stopped to get good cleanup(all outstanding requests are finished),
   so the 

Re: [PATCH] kvm: Increase NR_IOBUS_DEVS limit to 200

2010-03-31 Thread Sridhar Samudrala
On Wed, 2010-03-31 at 12:51 +0300, Michael S. Tsirkin wrote:
 On Tue, Mar 30, 2010 at 04:48:25PM -0700, Sridhar Samudrala wrote:
  This patch increases the current hardcoded limit of NR_IOBUS_DEVS
  from 6 to 200. We are hitting this limit when creating a guest with more
  than 1 virtio-net device using vhost-net backend. Each virtio-net
  device requires 2 such devices to service notifications from rx/tx queues.
  
  Signed-off-by: Sridhar Samudrala s...@us.ibm.com
  
 
 I tried this, but observed a measurable performance
 degradation with vhost net and this patch. Did not
 investigate yet.  Do you see this as well?

No. I am not seeing any degradation that is not within the range of
variation seen with kvm networking.
For ex: On a Nehalem 2 socket, 8 core system, running netperf TCP_STREAM
with 64K size packets, i am getting 12-13Gb/s guest to host and
11-12Gb/s host to guest with and without this patch.

Thanks
Sridhar


 
  diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
  index a3fd0f9..7fb48d3 100644
  --- a/include/linux/kvm_host.h
  +++ b/include/linux/kvm_host.h
  @@ -54,7 +54,7 @@ extern struct kmem_cache *kvm_vcpu_cache;
*/
   struct kvm_io_bus {
  int   dev_count;
  -#define NR_IOBUS_DEVS 6
  +#define NR_IOBUS_DEVS 200 
  struct kvm_io_device *devs[NR_IOBUS_DEVS];
   };
   
  
  

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Increase NR_IOBUS_DEVS limit to 200

2010-03-30 Thread Sridhar Samudrala
This patch increases the current hardcoded limit of NR_IOBUS_DEVS
from 6 to 200. We are hitting this limit when creating a guest with more
than 1 virtio-net device using vhost-net backend. Each virtio-net
device requires 2 such devices to service notifications from rx/tx queues.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com


diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a3fd0f9..7fb48d3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -54,7 +54,7 @@ extern struct kmem_cache *kvm_vcpu_cache;
  */
 struct kvm_io_bus {
int   dev_count;
-#define NR_IOBUS_DEVS 6
+#define NR_IOBUS_DEVS 200 
struct kvm_io_device *devs[NR_IOBUS_DEVS];
 };
 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unable to create more than 1 guest virtio-net device using vhost-net backend

2010-03-22 Thread Sridhar Samudrala
On Mon, 2010-03-22 at 20:16 +0200, Michael S. Tsirkin wrote:
 On Sun, Mar 21, 2010 at 01:58:29PM +0200, Avi Kivity wrote:
  On 03/21/2010 01:34 PM, Michael S. Tsirkin wrote:
  On Sun, Mar 21, 2010 at 12:29:31PM +0200, Avi Kivity wrote:
 
  On 03/21/2010 12:15 PM, Michael S. Tsirkin wrote:
   
  Nothing easy that I can see. Each device needs 2 of these.  Avi, Gleb,
  any objections to increasing the limit to say 16?  That would give us
  5 more devices to the limit of 6 per guest.
 
 
 
  Increase it to 200, then.
 
   
  OK. I think we'll also need a smarter allocator
  than bus-dev_count++ than we now have. Right?
 
 
  No, why?
   
  We'll run into problems if devices are created/removed in random order,
  won't we?
 
 
  unregister_dev() takes care of it.
 
  Eventually we'll want faster scanning than the linear search we employ
  now, though.
   
  Yes I suspect with 200 entries we will :). Let's just make it 16 for
  now?
 
 
  Let's make it 200 and fix the performance problems later.  Making it 16  
  is just asking for trouble.
 
 I did this and performance with vhost seems to become much more noisy,
 and drop by about 10% on average, even though in practice only
 a single device is created. Still trying to figure it out ...
 Any idea?

I am not sure if this 10% variation is due to the increase of NR_IO_BUS_DEVS.
In our testing, we do see variations of 10% or higher between multiple netperf 
instances with the same setup/configuration when using virtio/vhost.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vlan disable TSO of virtio in KVM

2010-02-26 Thread Sridhar Samudrala
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote:
 Hi,
 
 I read some kernel source. My basic understanding is that, in
 net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig
 created interface is defined to be
 dev-features |= real_dev-features  real_dev-vlan_features;
 
 However, in drivers/net/virtio_net.c, vlan_features are never set (I
 will assume it would be 0). So, dev-features will be 0 for the
 ethX.vid interface.
 
 I verify it  using ethtool -k on each KVM.
 # ethtool -k eth0
 shows that rx/tx csum, sg, tso, gso are on
 # ethtool -k eth0.3003
 all offloading features are off.
 
 I think that is why TSO is never enabled when running large package
 traffic between two vlan interfaces of different KVMs.
 
 I also took a look at VMware's pv implementation, which is
 drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features
 when probing, by
   netdev-vlan_features = netdev-features;
 
 
 I was wondering why vlan_features was not defined in virtio_net. Is it
 a BUG? Or, it is due to some constraints?
 Could any explain that?

I saw the same issue some time back and submitted a couple of patches
to address it, but were not accepted as the fix is not done at the
right place. Not sure if we can do this right without updating
virtio_net_hdr with  vlan specific info.
http://thread.gmane.org/gmane.linux.network/150197/focus=150838
http://thread.gmane.org/gmane.linux.network/150198/focus=150837

Thanks
Sridhar

 
 Thanks,
 -D
 
 
 
 On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com 
 wrote:
  Hi all,
  I have been deploying two KVMs on my Debian testing box. Two KVMs each
  use one tap device connecting to the host.
 
  When I
  doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0)
  using ethX on them, I could verify that TSO did happened by
   # tcpdump -nt -i tap1
  I can see messages like,
  IP 192.168.101.2.39994  192.168.101.1.41316: Flags [P.], seq
  7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr
  874803], length 5792
 
  So, according the 'length', skb didn't get segmented.
 
  However, When I
  (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface
  eth1.3003, eth0.3003 on both machines.
  (2) netperf between two new interface, TSO no longer showed up,
   # tcpdump -nt -i tap1
  I only got,
  vlan 3003, p 0, IP 10.214.10.2.42324  10.214.10.1.56460: Flags [P.],
  seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr
  926686], length 1448
 
  So, all the large packages get segmented in virtio (is that right?)
 
  My KVM command line options are,
  kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56
  -net tap,ifname=$TAP,script=no
 
 
 
  My question is whether it is the expected behavior? Can VLAN tagging
  coexist with TSO in virtio_net driver?
  If this is not desired result. Any hint for fixing the problem?
 
  Thanks
  -D
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vlan disable TSO of virtio in KVM

2010-02-26 Thread Sridhar Samudrala
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote:
 Hi,
 
 I read some kernel source. My basic understanding is that, in
 net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig
 created interface is defined to be
 dev-features |= real_dev-features  real_dev-vlan_features;
 
 However, in drivers/net/virtio_net.c, vlan_features are never set (I
 will assume it would be 0). So, dev-features will be 0 for the
 ethX.vid interface.
 
 I verify it  using ethtool -k on each KVM.
 # ethtool -k eth0
 shows that rx/tx csum, sg, tso, gso are on
 # ethtool -k eth0.3003
 all offloading features are off.
 
 I think that is why TSO is never enabled when running large package
 traffic between two vlan interfaces of different KVMs.
 
 I also took a look at VMware's pv implementation, which is
 drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features
 when probing, by
   netdev-vlan_features = netdev-features;
 
 
 I was wondering why vlan_features was not defined in virtio_net. Is it
 a BUG? Or, it is due to some constraints?
 Could any explain that?

I saw the same issue some time back and submitted a couple of patches
to address it, but were not accepted as the fix is not done at the
right place. Not sure if we can do this right without updating
virtio_net_hdr with  vlan specific info.
http://thread.gmane.org/gmane.linux.network/150197/focus=150838
http://thread.gmane.org/gmane.linux.network/150198/focus=150837

Thanks
Sridhar

 
 Thanks,
 -D
 
 
 
 On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com 
 wrote:
  Hi all,
  I have been deploying two KVMs on my Debian testing box. Two KVMs each
  use one tap device connecting to the host.
 
  When I
  doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0)
  using ethX on them, I could verify that TSO did happened by
   # tcpdump -nt -i tap1
  I can see messages like,
  IP 192.168.101.2.39994  192.168.101.1.41316: Flags [P.], seq
  7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr
  874803], length 5792
 
  So, according the 'length', skb didn't get segmented.
 
  However, When I
  (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface
  eth1.3003, eth0.3003 on both machines.
  (2) netperf between two new interface, TSO no longer showed up,
   # tcpdump -nt -i tap1
  I only got,
  vlan 3003, p 0, IP 10.214.10.2.42324  10.214.10.1.56460: Flags [P.],
  seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr
  926686], length 1448
 
  So, all the large packages get segmented in virtio (is that right?)
 
  My KVM command line options are,
  kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56
  -net tap,ifname=$TAP,script=no
 
 
 
  My question is whether it is the expected behavior? Can VLAN tagging
  coexist with TSO in virtio_net driver?
  If this is not desired result. Any hint for fixing the problem?
 
  Thanks
  -D
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio GSO makes IPv6 unbearably slow

2010-02-01 Thread Sridhar Samudrala

On 2/1/2010 11:50 AM, Bernhard Schmidt wrote:

Hi,

I have a really weird issue all of the sudden on _one_ of my two KVM
hosts. The other one, while running on a different hardware in a
different network, is configured in a very similar way and does not show
these issues (so far).

Host:
- AMD Athlon64 3500+
- Debian testing amd64
- Kernel 2.6.32-trunk from Debian
- Debian qemu-kvm 0.11.1+dfsg-1

The system has a routed uplink (Intel e100) and an internal bridge, that
connects all the VMs (only virtual ports). Networking in the VMs is
usually configured like this:

-net nic,model=virtio,macaddr=00:16:3E:7C:30:AA -net
tap,ifname=vm.compile,script=no

The guests are Debian amd64, either testing or stable, running a custom
stripped 2.6.31.6 or 2.6.32.7. I've also tested older kernels (2.6.28.2
and 2.6.30.4) and have seen the same problem.

Problem:
IPv6 tx throughput from a Guest to a host in the internet is extremely
low (around 16kB/s). IPv6 rx is fine, as well as throughput in IPv4
(both ways). Also fine is throughput from the guest to the host and from
the host to the internet (also both ways).

Running tcpdump on the tap-device on the host while doing an scp to
another system shows this:

20:43:27.130438 IP6 GUEST.59864  DEST.22: Flags [.], seq 70785:73641, ack 
2128, win 327, options [nop,nop,TS val 4294920148 ecr 63373335], length 2856
20:43:27.130506 IP6 HOST  GUEST: ICMP6, packet too big, mtu 1500, length 1240
20:43:27.131965 IP6 DEST.22  GUEST.59864: Flags [.], ack 69357, win 501, 
options [nop,nop,TS val 63373335 ecr 4294920144], length 0
20:43:27.131996 IP6 DEST.22  GUEST.59864: Flags [.], ack 70785, win 501, 
options [nop,nop,TS val 63373335 ecr 4294920144], length 0
20:43:27.132651 IP6 GUEST.59864  DEST.22: Flags [.], seq 73641:76497, ack 
2128, win 327, options [nop,nop,TS val 4294920148 ecr 63373335], length 2856
20:43:27.132704 IP6 HOST  GUEST: ICMP6, packet too big, mtu 1500, length 1240
20:43:27.346347 IP6 GUEST.59864  DEST.22: Flags [.], seq 70785:72213, ack 
2128, win 327, options [nop,nop,TS val 4294920202 ecr 63373335], length 1428
20:43:27.360411 IP6 DEST.22  GUEST.59864: Flags [.], ack 72213, win 501, 
options [nop,nop,TS val 63373358 ecr 4294920202], length 0
20:43:27.361045 IP6 GUEST.59864  DEST.22: Flags [.], seq 76497:79353, ack 
2128, win 327, options [nop,nop,TS val 4294920205 ecr 63373358], length 2856

So, the guest sends packets of 2856 bytes, which are too large to pass
through the eth0 of the host (1500 bytes). Thus, the host rejects the
packet and sends back an ICMPv6 packet too big, which sometimes makes
the guest send at least one packet with the small MTU, but sometimes it
doesn't. This is repeated all over and slows down the process.

According to ethtool GSO is enabled on the guest, but disabling it using
ethtool does not have any effect. But giving the kernel
virtio_net.gso=0 in append fixes the issue completely.

Anyone having any ideas?

   
I ran into a similar problem when using Intel e1000e driver on the host. 
Found a bug in most of the intel ethernet drivers when handling large 
GSO IPv6 packets.

You can see the fix i submitted here

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=8e1e8a4779cb23c1d9f51e9223795e07ec54d77a


But you seem to be using Intel e100 driver. So i am not sure you are 
running into the same issue.

You could try disabling both tso and gso on the guest using ethtool.

On the host, do you have gso/tso enabled on the uplink e100 driver?

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-27 Thread Sridhar Samudrala
On Wed, 2010-01-27 at 22:39 +0100, Arnd Bergmann wrote:
 On Wednesday 27 January 2010, Anthony Liguori wrote:
   I think -net socket,fd should just be (trivially) extended to work with 
   raw
   sockets out of the box, with no support for opening it. Then you can have
   libvirt or some wrapper open a raw socket and a private namespace and 
   just pass it
   down.

   That'd work. Anthony?
  
  The fundamental problem that I have with all of this is that we should 
  not be introducing new network backends that are based around something 
  only a developer is going to understand.  If I'm a user and I want to 
  use an external switch in VEPA mode, how in the world am I going to know 
  that I'm supposed to use the -net raw backend or the -net socket 
  backend?  It might as well be the -net butterflies backend as far as a 
  user is concerned.
 
 My point is that we already have -net socket,fd and any user that passes
 an fd into that already knows what he wants to do with it. Making it
 work with raw sockets is just a natural extension to this, which works
 on all kernels and (with separate namespaces) is reasonably secure.

Didn't realize that -net socket is already there and supports TCP and
UDP sockets. I will look into extending -net socket to support AF_PACKET
SOCK_RAW type sockets.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-26 Thread Sridhar Samudrala
This patch adds raw socket backend to qemu and is based on Or Gerlitz's
patch re-factored and ported to the latest qemu-kvm git tree.
It also includes support for vnet_hdr option that enables gso/checksum
offload with raw backend. You can find the linux kernel patch to support
this feature here.
   http://thread.gmane.org/gmane.linux.network/150308

Signed-off-by: Sridhar Samudrala s...@us.ibm.com 

diff --git a/Makefile.objs b/Makefile.objs
index 357d305..4468124 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -34,6 +34,8 @@ net-nested-$(CONFIG_SOLARIS) += tap-solaris.o
 net-nested-$(CONFIG_AIX) += tap-aix.o
 net-nested-$(CONFIG_SLIRP) += slirp.o
 net-nested-$(CONFIG_VDE) += vde.o
+net-nested-$(CONFIG_POSIX) += raw.o
+net-nested-$(CONFIG_LINUX) += raw-linux.o
 net-obj-y += $(addprefix net/, $(net-nested-y))
 
 ##
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index eba578a..4aa40f2 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -15,6 +15,7 @@
 #include net.h
 #include net/checksum.h
 #include net/tap.h
+#include net/raw.h
 #include qemu-timer.h
 #include virtio-net.h
 
@@ -133,6 +134,9 @@ static int peer_has_vnet_hdr(VirtIONet *n)
 case NET_CLIENT_TYPE_TAP:
 n-has_vnet_hdr = tap_has_vnet_hdr(n-nic-nc.peer);
 break;
+case NET_CLIENT_TYPE_RAW:
+n-has_vnet_hdr = raw_has_vnet_hdr(n-nic-nc.peer);
+break;
 default:
 return 0;
 }
@@ -149,6 +153,9 @@ static int peer_has_ufo(VirtIONet *n)
 case NET_CLIENT_TYPE_TAP:
 n-has_ufo = tap_has_ufo(n-nic-nc.peer);
 break;
+case NET_CLIENT_TYPE_RAW:
+n-has_ufo = raw_has_ufo(n-nic-nc.peer);
+break;
 default:
 return 0;
 }
@@ -165,6 +172,9 @@ static void peer_using_vnet_hdr(VirtIONet *n, int 
using_vnet_hdr)
 case NET_CLIENT_TYPE_TAP:
 tap_using_vnet_hdr(n-nic-nc.peer, using_vnet_hdr);
 break;
+case NET_CLIENT_TYPE_RAW:
+raw_using_vnet_hdr(n-nic-nc.peer, using_vnet_hdr);
+break;
 default:
 break; 
 }
@@ -180,6 +190,9 @@ static void peer_set_offload(VirtIONet *n, int csum, int 
tso4, int tso6,
 case NET_CLIENT_TYPE_TAP:
 tap_set_offload(n-nic-nc.peer, csum, tso4, tso6, ecn, ufo);
 break;
+case NET_CLIENT_TYPE_RAW:
+raw_set_offload(n-nic-nc.peer, csum, tso4, tso6, ecn, ufo);
+break;
 default:
 break; 
 }
diff --git a/net.c b/net.c
index 6ef93e6..1ca2415 100644
--- a/net.c
+++ b/net.c
@@ -26,6 +26,7 @@
 #include config-host.h
 
 #include net/tap.h
+#include net/raw.h
 #include net/socket.h
 #include net/dump.h
 #include net/slirp.h
@@ -1004,6 +1005,27 @@ static struct {
 },
 { /* end of list */ }
 },
+}, {
+.type = raw,
+.init = net_init_raw,
+.desc = {
+NET_COMMON_PARAMS_DESC,
+{
+.name = fd,
+.type = QEMU_OPT_STRING,
+.help = file descriptor of an already opened raw socket,
+}, {
+.name = ifname,
+.type = QEMU_OPT_STRING,
+.help = interface name,
+   }, {
+   .name = vnet_hdr,
+   .type = QEMU_OPT_BOOL,
+   .help = enable PACKET_VNET_HDR option on the raw interface
+   },
+{ /* end of list */ }
+   },
+
 #ifdef CONFIG_VDE
 }, {
 .type = vde,
@@ -1076,6 +1098,7 @@ int net_client_init(Monitor *mon, QemuOpts *opts, int 
is_netdev)
 #ifdef CONFIG_VDE
 strcmp(type, vde) != 0 
 #endif
+strcmp(type, raw) != 0 
 strcmp(type, socket) != 0) {
 qemu_error(The '%s' network backend type is not valid with 
-netdev\n,
type);
diff --git a/net.h b/net.h
index 116bb80..4722185 100644
--- a/net.h
+++ b/net.h
@@ -34,7 +34,8 @@ typedef enum {
 NET_CLIENT_TYPE_TAP,
 NET_CLIENT_TYPE_SOCKET,
 NET_CLIENT_TYPE_VDE,
-NET_CLIENT_TYPE_DUMP
+NET_CLIENT_TYPE_DUMP,
+NET_CLIENT_TYPE_RAW,
 } net_client_type;
 
 typedef void (NetPoll)(VLANClientState *, bool enable);
diff --git a/net/raw-linux.c b/net/raw-linux.c
new file mode 100644
index 000..9ed2e6a
--- /dev/null
+++ b/net/raw-linux.c
@@ -0,0 +1,97 @@
+/*
+ * QEMU System Emulator
+ *
+ * Copyright (c) 2003-2008 Fabrice Bellard
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies

[PATCH qemu-kvm] Add generic peer_* routines for the remaining tap specific routines

2010-01-26 Thread Sridhar Samudrala
This patch adds generic peer routines for the remaining tap specific
routines(using_vnet_hdr  set_offload). This makes it easier to add
new backends like raw(packet sockets) that support gso/checksum-offload.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 6e48997..eba578a 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -129,10 +129,13 @@ static int peer_has_vnet_hdr(VirtIONet *n)
 if (!n-nic-nc.peer)
 return 0;
 
-if (n-nic-nc.peer-info-type != NET_CLIENT_TYPE_TAP)
-return 0;
-
-n-has_vnet_hdr = tap_has_vnet_hdr(n-nic-nc.peer);
+switch (n-nic-nc.peer-info-type) {
+case NET_CLIENT_TYPE_TAP:
+n-has_vnet_hdr = tap_has_vnet_hdr(n-nic-nc.peer);
+break;
+default:
+return 0;
+}
 
 return n-has_vnet_hdr;
 }
@@ -142,11 +145,46 @@ static int peer_has_ufo(VirtIONet *n)
 if (!peer_has_vnet_hdr(n))
 return 0;
 
-n-has_ufo = tap_has_ufo(n-nic-nc.peer);
+switch (n-nic-nc.peer-info-type) {
+case NET_CLIENT_TYPE_TAP:
+n-has_ufo = tap_has_ufo(n-nic-nc.peer);
+break;
+default:
+return 0;
+}
 
 return n-has_ufo;
 }
 
+static void peer_using_vnet_hdr(VirtIONet *n, int using_vnet_hdr)
+{
+if (!n-nic-nc.peer)
+return;
+
+switch (n-nic-nc.peer-info-type) {
+case NET_CLIENT_TYPE_TAP:
+tap_using_vnet_hdr(n-nic-nc.peer, using_vnet_hdr);
+break;
+default:
+break; 
+}
+}
+
+static void peer_set_offload(VirtIONet *n, int csum, int tso4, int tso6,
+ int ecn, int ufo)
+{
+if (!n-nic-nc.peer)
+return;
+
+switch (n-nic-nc.peer-info-type) {
+case NET_CLIENT_TYPE_TAP:
+tap_set_offload(n-nic-nc.peer, csum, tso4, tso6, ecn, ufo);
+break;
+default:
+break; 
+}
+}
+
 static uint32_t virtio_net_get_features(VirtIODevice *vdev, uint32_t features)
 {
 VirtIONet *n = to_virtio_net(vdev);
@@ -154,7 +192,7 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev, 
uint32_t features)
 features |= (1  VIRTIO_NET_F_MAC);
 
 if (peer_has_vnet_hdr(n)) {
-tap_using_vnet_hdr(n-nic-nc.peer, 1);
+peer_using_vnet_hdr(n, 1);
 } else {
 features = ~(0x1  VIRTIO_NET_F_CSUM);
 features = ~(0x1  VIRTIO_NET_F_HOST_TSO4);
@@ -197,7 +235,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint32_t features)
 n-mergeable_rx_bufs = !!(features  (1  VIRTIO_NET_F_MRG_RXBUF));
 
 if (n-has_vnet_hdr) {
-tap_set_offload(n-nic-nc.peer,
+peer_set_offload(n,
 (features  VIRTIO_NET_F_GUEST_CSUM)  1,
 (features  VIRTIO_NET_F_GUEST_TSO4)  1,
 (features  VIRTIO_NET_F_GUEST_TSO6)  1,
@@ -761,8 +799,8 @@ static int virtio_net_load(QEMUFile *f, void *opaque, int 
version_id)
 }
 
 if (n-has_vnet_hdr) {
-tap_using_vnet_hdr(n-nic-nc.peer, 1);
-tap_set_offload(n-nic-nc.peer,
+peer_using_vnet_hdr(n, 1);
+peer_set_offload(n,
 (n-vdev.guest_features  VIRTIO_NET_F_GUEST_CSUM)  1,
 (n-vdev.guest_features  VIRTIO_NET_F_GUEST_TSO4)  1,
 (n-vdev.guest_features  VIRTIO_NET_F_GUEST_TSO6)  1,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-26 Thread Sridhar Samudrala
On Tue, 2010-01-26 at 14:47 -0600, Anthony Liguori wrote:
 On 01/26/2010 02:40 PM, Sridhar Samudrala wrote:
  This patch adds raw socket backend to qemu and is based on Or Gerlitz's
  patch re-factored and ported to the latest qemu-kvm git tree.
  It also includes support for vnet_hdr option that enables gso/checksum
  offload with raw backend. You can find the linux kernel patch to support
  this feature here.
  http://thread.gmane.org/gmane.linux.network/150308
 
  Signed-off-by: Sridhar Samudralas...@us.ibm.com
 
 
 See the previous discussion about the raw backend from Or's original 
 patch.  There's no obvious reason why we should have this in addition to 
 a tun/tap backend.
 
 The only use-case I know of is macvlan but macvtap addresses this 
 functionality while not introduce the rather nasty security problems 
 associated with a raw backend.

The raw backend can be attached to a physical device, macvlan or SR-IOV VF.
I don't think AF_PACKET socket itself introduces any security problems. The
raw socket can be created only by a user with CAP_RAW capability. The only
issue is if we need to assume that qemu itself is an untrusted process and a
raw fd cannot be passed to it.
But, i think it is a useful backend to support in qemu that provides guest to
remote host connectivity without the need for a bridge/tap.

macvtap could be an alternative if it supports binding to SR-IOV VFs too.

Thanks
Sridhar


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] Add raw(af_packet) network backend to qemu

2010-01-26 Thread Sridhar Samudrala
On Tue, 2010-01-26 at 14:50 -0600, Anthony Liguori wrote:
 On 01/26/2010 02:47 PM, Anthony Liguori wrote:
  On 01/26/2010 02:40 PM, Sridhar Samudrala wrote:
  This patch adds raw socket backend to qemu and is based on Or Gerlitz's
  patch re-factored and ported to the latest qemu-kvm git tree.
  It also includes support for vnet_hdr option that enables gso/checksum
  offload with raw backend. You can find the linux kernel patch to support
  this feature here.
  http://thread.gmane.org/gmane.linux.network/150308
 
  Signed-off-by: Sridhar Samudralas...@us.ibm.com
 
  See the previous discussion about the raw backend from Or's original 
  patch.  There's no obvious reason why we should have this in addition 
  to a tun/tap backend.
 
  The only use-case I know of is macvlan but macvtap addresses this 
  functionality while not introduce the rather nasty security problems 
  associated with a raw backend.
 
 Not to mention that from a user perspective, raw makes almost no sense 
 as it's an obscure socket protocol family.
Not clear what you mean here. AF_PACKET socket is just a transport
mechanism for the host kernel to put the packets from the guest directly
to an attached interface and vice-versa.

 A user wants to do useful things like bridged networking or direct VF 
 assignment.  We should have -net backends that reflect things that make 
 sense to a user.

Binding to a SR-IOV VF is one of the use-case that is supported by raw
backend.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


virtio-net offloads not enabled with latest qemu-kvm

2009-12-07 Thread Sridhar Samudrala
With the latest upstream qemu-kvm git tree, all the offloads are disabled
on virtio-net.

peer_has_vnet_hdr(n) in virtio_net_get_features() is failing because
n-vc-peer is NULL. Could not figure out yet why peer field is not initialized.

Do i need any new options to be specified with qemu command?

qemu-system-x86_64 -m 1024 -kernel /boot/vmlinuz-2.6.32-guest -append 
'root=/dev/vda1 console=tty0 console=ttyS0,115200' -initrd 
/boot/initrd-2.6.32-guest.img -drive 
file=/kvm_images/fedora10-2-vm,if=virtio,index=0 -net 
nic,macaddr=54:52:00:35:e3:74,model=virtio -net 
tap,ifname=vnet1,script=no,downscript=no

Works fine with qemu-kvm-0.11.0 and i see checksum/tso/gso are enabled on guest 
virtio-net device.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost-net patches

2009-10-22 Thread Sridhar Samudrala
On Thu, 2009-10-22 at 19:43 +0200, Michael S. Tsirkin wrote:

 
 Possibly we'll have to debug this in vhost in host kernel.
 I would debug this directly, it's just that my setup is somehow
 different and I do not see this issue, otherwise I would not
 waste your time.
 
 Can we add some printks?
 handle_tx has this at the top:
 
 if (!sock || !sock_writeable(sock-sk))
 return;

I added some debug printks in handle_rx and handle_tx
get_user() calls are failing with EFAULT.

Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: handle_rx: net:f5494800
Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to enable 
notification: -14
Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to disable 
notification: -14
Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:51:41 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:51:48 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:51:48 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:51:49 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:51:49 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:51:53 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:51:53 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:52:03 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:52:03 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:52:22 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:52:22 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:52:23 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:52:23 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:53:17 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:53:17 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14
Oct 21 10:56:56 IBM-19B5263ED41-009047018100 kernel: handle_tx: net:f5494800
Oct 21 10:56:56 IBM-19B5263ED41-009047018100 kernel: Failed to access avail idx 
at 0002 ret:-14

Thanks
Sridhar
 Could you please add
 
  printk(KERN_ERR %s: !sock = %d, !sock || !sock_writeable(sock-sk) =
 %d,
 __func__, !sock , !sock || !sock_writeable(sock-sk));
 
 *Before* these checks?
 Then make modules modules_install
 rmmod vhost_net
 insmod vhost_net
 and re-run.
 If you want me to send a patch that does this, let me know please.
 
 Thanks!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vhost-net patches

2009-10-19 Thread Sridhar Samudrala
On Sun, 2009-10-18 at 19:32 +0200, Michael S. Tsirkin wrote:
 On Sun, Oct 18, 2009 at 12:53:56PM +0200, Michael S. Tsirkin wrote:
  On Fri, Oct 16, 2009 at 12:29:29PM -0700, Sridhar Samudrala wrote:
   Hi Michael,
   
   We are trying out your vhost-net patches from your git trees on 
   kernel.org.
   I am using mst/vhost.git as host kernel and mst/qemu-kvm.git for qemu.
   
   I am using the following qemu script to start the guest using userspace 
   tap backend.
   
   home/sridhar/git/mst/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 
   /home/sridhar/kvm_images/fedora10-1-vm -m 512 -drive 
   file=/home/sridhar/kvm_images/fedora10-1-vm,if=virtio,index=0,boot=on 
   -net nic,macaddr=54:52:00:35:e3:73,model=virtio -net 
   tap,ifname=vnet0,script=no,downscript=no
   
   Now that i got the default backend to work, i wanted to try vhost in 
   kernel. But
   could not figure out the right -net option to pass to qemu.
   
   Can you let me know the right syntax to start a guest using vhost.
   
   Thanks
   Sridhar
  
  Here's an example with raw socket:
  
  /root/kvm-test/bin/qemu-system-x86_64 -m 1G -kernel \
  /boot/vmlinuz-$release -append \
  'root=UUID=d5d2d201-d086-42ad-bb1d-32fbe40eda71 ro quiet nosplash \
  console=tty0 console=ttyS0,9600n8' -initrd /boot/guest-initrd.img \
  $HOME/disk.raw.copy -net raw,ifname=eth3 -net nic,model=virtio,vhost \
  -balloon none -redir tcp:8023::22
  
  As you see, I changed the command line.
  You now simply add ,vhost after model, and it will locate
  host network interface specified earlier and attach to it.
  This should have been clear from running  qemu with -help
  flag. Could you please suggest how can that text
  be clarified?

I updated to your latest git trees and the default user-space tap backend using 
the
following -net options worked fine.
-net tap,ifname=vnet0,script=no,downscript=no -net nic,model=virtio

But i could not get vhost to work with either raw or tap backends.
I tried the following combinations.
1) -net raw,ifname=eth0 -net nic,model=virtio,vhost
2) -net raw,ifname=vnet0, -net nic,model=virtio,vhost
3) -net tap,ifname=vnet0,script=no,downscript=no -net nic,model=virtio,vhost

They all failed with the following error
vhost_net_init returned -7
This is an error message from hw/virtio-net.c:virtio_net_driver_ok() when
vhost_net_start() fails. It looks like dev-binding-irqfd() is failing in
vhost_virtqueue_init(). Haven't yet debugged further. I have CONFIG_EVENTFD
enabled in the host kernel.

Are all the above -net options supposed to work?

In your descriptions, you say that checksum/tso offload is not supported. Isn't 
it
possible to send/receive large packets without checksum using AF_PACKET sockets 
if
the attached interface supports these offloads.
Do you see the same offload issue even when using tap backend via vhost?

Thanks
Sridhar







--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: Release plan for 0.12.0

2009-10-14 Thread Sridhar Samudrala
On Wed, 2009-10-14 at 17:50 +0200, Michael S. Tsirkin wrote:
 On Wed, Oct 14, 2009 at 04:19:17PM +0100, Jamie Lokier wrote:
  Michael S. Tsirkin wrote:
   On Wed, Oct 14, 2009 at 09:17:15AM -0500, Anthony Liguori wrote:
Michael S. Tsirkin wrote:
Looks like Or has abandoned it.  I have an updated version which works
with new APIs, etc.  Let me post it and we'll go from there.
   
  
I'm generally inclined to oppose the functionality as I don't think 
it  offers any advantages over the existing backends.

   
I patch it in and use it all the time.  It's much easier to setup
on a random machine than a bridged config.
  
   
Having two things that do the same thing is just going to lead to user  
confusion.
   
   They do not do the same thing. With raw socket you can use windows
   update without a bridge in the host, with tap you can't.
  
  On the other hand, with raw socket, guest Windows can't access files
  on the host's Samba share can it?  So it's not that useful even for
  Windows guests.
 
 I guess this depends on whether you use the same host for samba :)
 
If the problem is tap is too hard to setup, we should try to  
simplify tap configuration.
   
   The problem is bridge is too hard to setup.
   Simplifying that is a good idea, but outside the scope
   of the qemu project.
  
  I venture it's important enough for qemu that it's worth working on
  that.  Something that looks like the raw socket but behaves like an
  automatically instantiated bridge attached to the bound interface
  would be a useful interface.
 
 I agree, that would be good to have.

Can't we bind the raw socket to the tap interface instead of the
physical interface and allow the bridge config to work.

Thanks
Sridhar


 
  I don't have much time, but I'll help anybody who wants to do that.
  
  -- Jamie
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] Enable UFO on virtio-net and tap devices

2009-10-08 Thread Sridhar Samudrala
On Thu, 2009-10-08 at 11:07 +0100, Mark McLoughlin wrote:
 On Wed, 2009-10-07 at 14:50 -0700, Sridhar Samudrala wrote:
  linux 2.6.32 includes UDP fragmentation offload support in software. 
  So we can enable UFO on the host tap device if supported and allow setting
  UFO on virtio-net in the guest.
 
 Hmm, we really need to detect whether the host has tuntap UFO support
 before advertising it to the guest. Maybe in net_tap_fd_init() we should
 toggle TUN_F_UFO back and forth and check for EINVAL?

Sure. Here is an updated patch that checks for UFO support in host.

Thanks
Sridhar

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ce8e6cb..9561f34 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -150,7 +150,10 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 features |= (1  VIRTIO_NET_F_HOST_TSO6);
 features |= (1  VIRTIO_NET_F_HOST_ECN);
 features |= (1  VIRTIO_NET_F_MRG_RXBUF);
-/* Kernel can't actually handle UFO in software currently. */
+if (tap_has_ufo(host)) {
+features |= (1  VIRTIO_NET_F_GUEST_UFO);
+features |= (1  VIRTIO_NET_F_HOST_UFO);
+}
 }
 #endif
 
@@ -189,7 +192,8 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint32_t features)
   (features  VIRTIO_NET_F_GUEST_CSUM)  1,
   (features  VIRTIO_NET_F_GUEST_TSO4)  1,
   (features  VIRTIO_NET_F_GUEST_TSO6)  1,
-  (features  VIRTIO_NET_F_GUEST_ECN)   1);
+  (features  VIRTIO_NET_F_GUEST_ECN)   1,
+  (features  VIRTIO_NET_F_GUEST_UFO)   1);
 #endif
 }
 
diff --git a/net.c b/net.c
index 8032ff8..838e42c 100644
--- a/net.c
+++ b/net.c
@@ -1271,6 +1271,11 @@ void tap_using_vnet_hdr(void *opaque, int using_vnet_hdr)
 {
 }
 
+int tap_has_ufo(void *opaque)
+{
+return 0;
+}
+
 #else /* !defined(_WIN32) */
 
 /* Maximum GSO packet size (64k) plus plenty of room for
@@ -1292,6 +1297,7 @@ typedef struct TAPState {
 unsigned int write_poll : 1;
 unsigned int has_vnet_hdr : 1;
 unsigned int using_vnet_hdr : 1;
+unsigned int has_ufo: 1;
 } TAPState;
 
 static int launch_script(const char *setup_script, const char *ifname, int fd);
@@ -1527,9 +1533,22 @@ static int tap_probe_vnet_hdr(int fd)
 #endif
 }
 
+int tap_has_ufo(void *opaque)
+{
+VLANClientState *vc = opaque;
+TAPState *s = vc-opaque;
+
+return s ? s-has_ufo : 0;
+}
+
 #ifdef TUNSETOFFLOAD
+
+#ifndef TUN_F_UFO
+#define TUN_F_UFO  0x10
+#endif
+
 static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6,
-   int ecn)
+   int ecn, int ufo)
 {
 TAPState *s = vc-opaque;
 unsigned int offload = 0;
@@ -1542,11 +1561,18 @@ static void tap_set_offload(VLANClientState *vc, int 
csum, int tso4, int tso6,
offload |= TUN_F_TSO6;
if ((tso4 || tso6)  ecn)
offload |= TUN_F_TSO_ECN;
+   if (ufo)
+   offload |= TUN_F_UFO;
 }
 
-if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0)
-   fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n,
-   strerror(errno));
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) {
+/* Try without UFO */
+offload = ~TUN_F_UFO;
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) {
+   fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n,
+   strerror(errno));
+}
+}
 }
 #endif /* TUNSETOFFLOAD */
 
@@ -1574,6 +1600,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
  int vnet_hdr)
 {
 TAPState *s;
+unsigned int offload;
 
 s = qemu_mallocz(sizeof(TAPState));
 s-fd = fd;
@@ -1583,7 +1610,14 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
 s-vc-receive_raw = tap_receive_raw;
 #ifdef TUNSETOFFLOAD
 s-vc-set_offload = tap_set_offload;
-tap_set_offload(s-vc, 0, 0, 0, 0);
+
+s-has_ufo = 0; 
+/* Check if tap supports UFO */
+offload = TUN_F_CSUM | TUN_F_UFO;
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) == 0)
+   s-has_ufo = 1;
+
+tap_set_offload(s-vc, 0, 0, 0, 0, 0);
 #endif
 tap_read_poll(s, 1);
 snprintf(s-vc-info_str, sizeof(s-vc-info_str), fd=%d, fd);
diff --git a/net.h b/net.h
index 925c67c..6bb6434 100644
--- a/net.h
+++ b/net.h
@@ -14,7 +14,7 @@ typedef ssize_t (NetReceive)(VLANClientState *, const uint8_t 
*, size_t);
 typedef ssize_t (NetReceiveIOV)(VLANClientState *, const struct iovec *, int);
 typedef void (NetCleanup) (VLANClientState *);
 typedef void (LinkStatusChanged)(VLANClientState *);
-typedef void (SetOffload)(VLANClientState *, int, int, int, int);
+typedef void (SetOffload)(VLANClientState *, int, int, int, int, int);
 
 struct VLANClientState {
 NetReceive *receive;
@@ -92,6 +92,7 @@ void do_info_usernet(Monitor *mon);
 
 int tap_has_vnet_hdr(void *opaque);
 void tap_using_vnet_hdr

[PATCH qemu-kvm] Enable UFO on virtio-net and tap devices

2009-10-07 Thread Sridhar Samudrala
linux 2.6.32 includes UDP fragmentation offload support in software. 
So we can enable UFO on the host tap device if supported and allow setting
UFO on virtio-net in the guest.

This improves UDP stream performance significantly between guest to host
and inter-guest.

TUN_F_UFO is a new #define added to 2.6.32 kernel header file
include/linux/if_tun.h. Until this updated header file gets into distro
releases, i think we need to have this defined in qemu.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ce8e6cb..c73487d 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -150,7 +150,8 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 features |= (1  VIRTIO_NET_F_HOST_TSO6);
 features |= (1  VIRTIO_NET_F_HOST_ECN);
 features |= (1  VIRTIO_NET_F_MRG_RXBUF);
-/* Kernel can't actually handle UFO in software currently. */
+features |= (1  VIRTIO_NET_F_GUEST_UFO);
+features |= (1  VIRTIO_NET_F_HOST_UFO);
 }
 #endif
 
@@ -189,7 +190,8 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint32_t features)
   (features  VIRTIO_NET_F_GUEST_CSUM)  1,
   (features  VIRTIO_NET_F_GUEST_TSO4)  1,
   (features  VIRTIO_NET_F_GUEST_TSO6)  1,
-  (features  VIRTIO_NET_F_GUEST_ECN)   1);
+  (features  VIRTIO_NET_F_GUEST_ECN)   1,
+  (features  VIRTIO_NET_F_GUEST_UFO)   1);
 #endif
 }
 
diff --git a/net.c b/net.c
index 8032ff8..1942e25 100644
--- a/net.c
+++ b/net.c
@@ -1528,8 +1528,13 @@ static int tap_probe_vnet_hdr(int fd)
 }
 
 #ifdef TUNSETOFFLOAD
+
+#ifndef TUN_F_UFO
+#define TUN_F_UFO  0x10
+#endif
+
 static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6,
-   int ecn)
+   int ecn, int ufo)
 {
 TAPState *s = vc-opaque;
 unsigned int offload = 0;
@@ -1542,11 +1547,18 @@ static void tap_set_offload(VLANClientState *vc, int 
csum, int tso4, int tso6,
offload |= TUN_F_TSO6;
if ((tso4 || tso6)  ecn)
offload |= TUN_F_TSO_ECN;
+   if (ufo)
+   offload |= TUN_F_UFO;
 }
 
-if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0)
-   fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n,
-   strerror(errno));
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) {
+/* Try without UFO */
+offload = ~TUN_F_UFO;
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) {
+   fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n,
+   strerror(errno));
+}
+}
 }
 #endif /* TUNSETOFFLOAD */
 
@@ -1583,7 +1595,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
 s-vc-receive_raw = tap_receive_raw;
 #ifdef TUNSETOFFLOAD
 s-vc-set_offload = tap_set_offload;
-tap_set_offload(s-vc, 0, 0, 0, 0);
+tap_set_offload(s-vc, 0, 0, 0, 0, 0);
 #endif
 tap_read_poll(s, 1);
 snprintf(s-vc-info_str, sizeof(s-vc-info_str), fd=%d, fd);
diff --git a/net.h b/net.h
index 925c67c..ac3701c 100644
--- a/net.h
+++ b/net.h
@@ -14,7 +14,7 @@ typedef ssize_t (NetReceive)(VLANClientState *, const uint8_t 
*, size_t);
 typedef ssize_t (NetReceiveIOV)(VLANClientState *, const struct iovec *, int);
 typedef void (NetCleanup) (VLANClientState *);
 typedef void (LinkStatusChanged)(VLANClientState *);
-typedef void (SetOffload)(VLANClientState *, int, int, int, int);
+typedef void (SetOffload)(VLANClientState *, int, int, int, int, int);
 
 struct VLANClientState {
 NetReceive *receive;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources

2009-06-09 Thread Sridhar Samudrala

Rusty Russell wrote:

On Mon, 8 Jun 2009 02:46:08 pm Herbert Xu wrote:
  

On Fri, Jun 05, 2009 at 05:16:31PM -0700, Sridhar Samudrala wrote:


+   /* Software UFO is not yet supported */
+   segs = ERR_PTR(-EPROTONOSUPPORT);
  

Hmm, we need to fill this in before you start using it for virt.
After all, it's very difficult for the guest to know whether the
output device on the host is going to be able to do UFO or not.



Yep, no point telling the guest we support UFO if we don't!
  
This path is taken only when the packet needs to go out a device that 
doesn't support UFO.
Until the software UFO is supported, I thought it may be useful to 
provide a way to enable
UFO inter-guest, between host to guest and to an external host via an 
outgoing device that supports UFO.


Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources

2009-06-09 Thread Sridhar Samudrala

Herbert Xu wrote:

On Mon, Jun 08, 2009 at 10:04:47AM -0700, Sridhar Samudrala wrote:
  

OK. Can we use skb_segment() to do IP fragmentation of UDP packets?



It should be able to.
  
Unfortunately, this doesn't work for UDP without any changes. 
skb_segment() currently adds transport header to
each segmented skb. But when we are using IP fragmentation, only the 
first fragment should include the UDP

header.
We either need to fix skb_segment() to handle IP fragmentation or write 
a new skb_fragment(). I will look into

this when i get some time.

Thanks
Sridhar

  

The function itself looks protocol independent and if it is simply
splitting large skb into a list of mtu sized skb's, can we fixup the
ip header differently for UDP packets(id and frag_off) in inet_gso_segment
and ipv6_gso_segment?



Yes that's the idea.

Cheers,
  



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH qemu-kvm] Enable UFO on virtio-net/tap devices

2009-06-08 Thread Sridhar Samudrala
On Sun, 2009-06-07 at 09:21 +0300, Avi Kivity wrote:
 Sridhar Samudrala wrote:
  Enable UFO on the host tap device if supported and allow setting UFO
  on virtio-net in the guest.
 
  Signed-off-by: Sridhar Samudrala s...@us.ibm.com
 
 
  diff --git a/hw/virtio-net.c b/hw/virtio-net.c
  index 3c77b99..8a53e27 100644
  --- a/hw/virtio-net.c
  +++ b/hw/virtio-net.c
  @@ -134,7 +134,8 @@ static uint32_t virtio_net_get_features(VirtIODevice 
  *vdev)
   features |= (1  VIRTIO_NET_F_HOST_TSO6);
   features |= (1  VIRTIO_NET_F_HOST_ECN);
   features |= (1  VIRTIO_NET_F_MRG_RXBUF);
  -/* Kernel can't actually handle UFO in software currently. */
  +   // features |= (1  VIRTIO_NET_F_HOST_UFO);
  +features |= (1  VIRTIO_NET_F_GUEST_UFO);

 
 Where are these defined?

They are in hw/virtio_net.h

 
  @@ -990,8 +990,13 @@ static int tap_probe_vnet_hdr(int fd)
   }
   
   #ifdef TUNSETOFFLOAD
  +
  +#ifndef TUN_F_UFO
  +#define TUN_F_UFO  0x10
  +#endif
  +

 
 Should just use the definition in the kernel header.
 
TUN_F_UFO is not defined in the current version of kernel header linux/if_tun.h.
I have added this in my kernel patches to support UFO.
So until this definition gets into the distro versions of kernel headers, i 
think
we need to have this defined in qemu.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources

2009-06-08 Thread Sridhar Samudrala
On Mon, 2009-06-08 at 15:16 +1000, Herbert Xu wrote:
 On Fri, Jun 05, 2009 at 05:16:31PM -0700, Sridhar Samudrala wrote:
 
  +   /* Software UFO is not yet supported */
  +   segs = ERR_PTR(-EPROTONOSUPPORT);
 
 Hmm, we need to fill this in before you start using it for virt.
 After all, it's very difficult for the guest to know whether the
 output device on the host is going to be able to do UFO or not.

OK. Can we use skb_segment() to do IP fragmentation of UDP packets?
Or does it only support TCP segementation?
The function itself looks protocol independent and if it is simply
splitting large skb into a list of mtu sized skb's, can we fixup the
ip header differently for UDP packets(id and frag_off) in inet_gso_segment
and ipv6_gso_segment?


Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 1/4] IPv6: HW csum support for outgoing large UFO packets

2009-06-05 Thread Sridhar Samudrala

Add HW checksum support for outgoing large UDP/IPv6 packets destined to a UFO
enabled device.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---

 net/ipv6/udp.c |   51 ++-
 1 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fc333d8..fad0f5f 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -638,6 +638,47 @@ static void udp_v6_flush_pending_frames(struct sock *sk)
}
 }
 
+/**
+ * udp6_hwcsum_outgoing  -  handle outgoing HW checksumming
+ * @sk:socket we are sending on
+ * @skb:   sk_buff containing the filled-in UDP header
+ * (checksum field must be zeroed out)
+ */
+static void udp6_hwcsum_outgoing(struct sock *sk, struct sk_buff *skb,
+const struct in6_addr *saddr,
+const struct in6_addr *daddr, int len)
+{
+   unsigned int offset;
+   struct udphdr *uh = udp_hdr(skb);
+   __wsum csum = 0;
+
+   if (skb_queue_len(sk-sk_write_queue) == 1) {
+   /* Only one fragment on the socket.  */
+   skb-csum_start = skb_transport_header(skb) - skb-head;
+   skb-csum_offset = offsetof(struct udphdr, check);
+   uh-check = csum_ipv6_magic(saddr, daddr, len, IPPROTO_UDP, 0);
+   } else {
+   /*
+* HW-checksum won't work as there are two or more
+* fragments on the socket so that all csums of sk_buffs
+* should be together
+*/
+   offset = skb_transport_offset(skb);
+   skb-csum = skb_checksum(skb, offset, skb-len - offset, 0);
+
+   skb-ip_summed = CHECKSUM_NONE;
+
+   skb_queue_walk(sk-sk_write_queue, skb) {
+   csum = csum_add(csum, skb-csum);
+   }
+
+   uh-check = csum_ipv6_magic(saddr, daddr, len, IPPROTO_UDP,
+   csum);
+   if (uh-check == 0)
+   uh-check = CSUM_MANGLED_0;
+   }
+}
+
 /*
  * Sending
  */
@@ -668,7 +709,14 @@ static int udp_v6_push_pending_frames(struct sock *sk)
 
if (is_udplite)
csum = udplite_csum_outgoing(sk, skb);
-else
+else if (sk-sk_no_check == UDP_CSUM_NOXMIT) { /* UDP csum disabled */
+   skb-ip_summed = CHECKSUM_NONE;
+   goto send;
+   } else if (skb-ip_summed == CHECKSUM_PARTIAL) { /* UDP hardware csum */
+   udp6_hwcsum_outgoing(sk, skb, fl-fl6_src, fl-fl6_dst,
+up-len);
+   goto send;
+   } else
csum = udp_csum_outgoing(sk, skb);
 
/* add protocol-dependent pseudo-header */
@@ -677,6 +725,7 @@ static int udp_v6_push_pending_frames(struct sock *sk)
if (uh-check == 0)
uh-check = CSUM_MANGLED_0;
 
+send:
err = ip6_push_pending_frames(sk);
 out:
up-len = 0;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 2/4] udp: Handle large UFO packets from untrusted sources

2009-06-05 Thread Sridhar Samudrala
Setup partial checksum and add gso checks to handle large UFO packets from
untrusted sources.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---


 include/net/udp.h  |3 +++
 net/ipv4/af_inet.c |2 ++
 net/ipv4/udp.c |   60 
 net/ipv6/udp.c |   22 +++
 4 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 90e6ce5..274b764 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -207,4 +207,7 @@ extern void udp4_proc_exit(void);
 #endif
 
 extern void udp_init(void);
+
+extern int udp_v4_gso_send_check(struct sk_buff *skb);
+extern struct sk_buff *udp_ufo_fragment(struct sk_buff *skb, int features);
 #endif /* _UDP_H */
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d873621..1c1ac7d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1426,6 +1426,8 @@ static struct net_protocol tcp_protocol = {
 static struct net_protocol udp_protocol = {
.handler =  udp_rcv,
.err_handler =  udp_err,
+   .gso_send_check = udp_v4_gso_send_check,
+   .gso_segment = udp_ufo_fragment,
.no_policy =1,
.netns_ok = 1,
 };
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8f4158d..678bdf9 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1815,6 +1815,66 @@ void __init udp_init(void)
sysctl_udp_wmem_min = SK_MEM_QUANTUM;
 }
 
+int udp_v4_gso_send_check(struct sk_buff *skb)
+{
+   const struct iphdr *iph;
+   struct udphdr *uh;
+
+   if (!pskb_may_pull(skb, sizeof(*uh)))
+   return -EINVAL;
+
+   iph = ip_hdr(skb);
+   uh = udp_hdr(skb);
+
+   uh-check = 0;
+   uh-check = ~csum_tcpudp_magic(iph-saddr, iph-daddr, skb-len, 
IPPROTO_UDP, 0);
+   skb-csum_start = skb_transport_header(skb) - skb-head;
+   skb-csum_offset = offsetof(struct udphdr, check);
+   skb-ip_summed = CHECKSUM_PARTIAL;
+   return 0;
+}
+
+struct sk_buff *udp_ufo_fragment(struct sk_buff *skb, int features)
+{
+   struct sk_buff *segs = ERR_PTR(-EINVAL);
+   struct udphdr *uh;
+   unsigned uhlen;
+   unsigned int mss;
+
+   if (!pskb_may_pull(skb, sizeof(*uh)))
+   goto out;
+
+   uh = udp_hdr(skb);
+   uhlen = sizeof(*uh);
+
+   __skb_pull(skb, uhlen);
+
+   mss = skb_shinfo(skb)-gso_size;
+   if (unlikely(skb-len = mss))
+   goto out;
+
+   if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
+   int type = skb_shinfo(skb)-gso_type;
+
+   if (unlikely(type  ~(SKB_GSO_UDP | SKB_GSO_DODGY) ||
+!(type  (SKB_GSO_UDP
+   goto out;
+
+   skb_shinfo(skb)-gso_segs = DIV_ROUND_UP(skb-len, mss);
+
+   segs = NULL;
+   goto out;
+   }
+
+   /* Software UFO is not yet supported */
+   segs = ERR_PTR(-EPROTONOSUPPORT);
+
+out:
+   return segs;
+}
+EXPORT_SYMBOL(udp_ufo_fragment);
+
+
 EXPORT_SYMBOL(udp_disconnect);
 EXPORT_SYMBOL(udp_ioctl);
 EXPORT_SYMBOL(udp_prot);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fad0f5f..ccd42d5 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1081,9 +1081,31 @@ int compat_udpv6_getsockopt(struct sock *sk, int level, 
int optname,
 }
 #endif
 
+
+static int udpv6_gso_send_check(struct sk_buff *skb)
+{
+   struct ipv6hdr *ipv6h;
+   struct udphdr *uh;
+
+   if (!pskb_may_pull(skb, sizeof(*uh)))
+   return -EINVAL;
+
+   ipv6h = ipv6_hdr(skb);
+   uh = udp_hdr(skb);
+
+   uh-check = 0;
+   uh-check = ~csum_ipv6_magic(ipv6h-saddr, ipv6h-daddr, skb-len, 
IPPROTO_UDP, 0);
+   skb-csum_start = skb_transport_header(skb) - skb-head;
+   skb-csum_offset = offsetof(struct udphdr, check);
+   skb-ip_summed = CHECKSUM_PARTIAL;
+   return 0;
+}
+
 static struct inet6_protocol udpv6_protocol = {
.handler=   udpv6_rcv,
.err_handler=   udpv6_err,
+   .gso_send_check =   udpv6_gso_send_check,
+   .gso_segment=   udp_ufo_fragment,
.flags  =   INET6_PROTO_NOPOLICY|INET6_PROTO_FINAL,
 };
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 3/4] virtio_net: Allow UFO feature to be set on virtio_net device.

2009-06-05 Thread Sridhar Samudrala
Allow UFO feature to be set on virtio_net device.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---


 drivers/net/virtio_net.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0c9ca67..d24ede0 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -773,6 +773,7 @@ static struct ethtool_ops virtnet_ethtool_ops = {
.set_tx_csum = virtnet_set_tx_csum,
.set_sg = ethtool_op_set_sg,
.set_tso = ethtool_op_set_tso,
+   .set_ufo = ethtool_op_set_ufo,
.get_link = ethtool_op_get_link,
 };
 
@@ -1013,7 +1014,7 @@ static unsigned int features[] = {
VIRTIO_NET_F_GSO, VIRTIO_NET_F_MAC,
VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6,
VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
-   VIRTIO_NET_F_GUEST_ECN, /* We don't yet handle UFO input. */
+   VIRTIO_NET_F_GUEST_ECN, VIRTIO_NET_F_GUEST_UFO,
VIRTIO_NET_F_MRG_RXBUF, VIRTIO_NET_F_STATUS, VIRTIO_NET_F_CTRL_VQ,
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN,
VIRTIO_F_NOTIFY_ON_EMPTY,


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next 4/4] tun: Allow tap device to send/receive UFO packets.

2009-06-05 Thread Sridhar Samudrala
Handle send/receive of UFO packets in tun/tap driver.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com

---

 drivers/net/tun.c  |   13 -
 include/linux/if_tun.h |1 +
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 15cfcb2..96d23b8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -635,6 +635,9 @@ static __inline__ ssize_t tun_get_user(struct tun_struct 
*tun,
case VIRTIO_NET_HDR_GSO_TCPV6:
skb_shinfo(skb)-gso_type = SKB_GSO_TCPV6;
break;
+   case VIRTIO_NET_HDR_GSO_UDP:
+   skb_shinfo(skb)-gso_type = SKB_GSO_UDP;
+   break;
default:
tun-dev-stats.rx_frame_errors++;
kfree_skb(skb);
@@ -720,6 +723,8 @@ static __inline__ ssize_t tun_put_user(struct tun_struct 
*tun,
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
else if (sinfo-gso_type  SKB_GSO_TCPV6)
gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+   else if (sinfo-gso_type  SKB_GSO_UDP)
+   gso.gso_type = VIRTIO_NET_HDR_GSO_UDP;
else
BUG();
if (sinfo-gso_type  SKB_GSO_TCP_ECN)
@@ -1068,7 +1073,8 @@ static int set_offload(struct net_device *dev, unsigned 
long arg)
old_features = dev-features;
/* Unset features, set them as we chew on the arg. */
features = (old_features  ~(NETIF_F_HW_CSUM|NETIF_F_SG|NETIF_F_FRAGLIST
-   |NETIF_F_TSO_ECN|NETIF_F_TSO|NETIF_F_TSO6));
+   |NETIF_F_TSO_ECN|NETIF_F_TSO|NETIF_F_TSO6
+   |NETIF_F_UFO));
 
if (arg  TUN_F_CSUM) {
features |= NETIF_F_HW_CSUM|NETIF_F_SG|NETIF_F_FRAGLIST;
@@ -1085,6 +1091,11 @@ static int set_offload(struct net_device *dev, unsigned 
long arg)
features |= NETIF_F_TSO6;
arg = ~(TUN_F_TSO4|TUN_F_TSO6);
}
+
+   if (arg  TUN_F_UFO) {
+   features |= NETIF_F_UFO;
+   arg = ~TUN_F_UFO;
+   }
}
 
/* This gives the user a way to test for new features in future by
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 915ba57..3f5fd52 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -62,6 +62,7 @@
 #define TUN_F_TSO4 0x02/* I can handle TSO for IPv4 packets */
 #define TUN_F_TSO6 0x04/* I can handle TSO for IPv6 packets */
 #define TUN_F_TSO_ECN  0x08/* I can handle TSO with ECN bits. */
+#define TUN_F_UFO  0x10/* I can handle UFO packets */
 
 /* Protocol info prepended to the packets (when IFF_NO_PI is not set) */
 #define TUN_PKT_STRIP  0x0001


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH qemu-kvm] Enable UFO on virtio-net/tap devices

2009-06-05 Thread Sridhar Samudrala
Enable UFO on the host tap device if supported and allow setting UFO
on virtio-net in the guest.

Signed-off-by: Sridhar Samudrala s...@us.ibm.com


diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 3c77b99..8a53e27 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -134,7 +134,8 @@ static uint32_t virtio_net_get_features(VirtIODevice *vdev)
 features |= (1  VIRTIO_NET_F_HOST_TSO6);
 features |= (1  VIRTIO_NET_F_HOST_ECN);
 features |= (1  VIRTIO_NET_F_MRG_RXBUF);
-/* Kernel can't actually handle UFO in software currently. */
+   // features |= (1  VIRTIO_NET_F_HOST_UFO);
+features |= (1  VIRTIO_NET_F_GUEST_UFO);
 }
 #endif
 
@@ -173,6 +174,7 @@ static void virtio_net_set_features(VirtIODevice *vdev, 
uint32_t features)
   (features  VIRTIO_NET_F_GUEST_CSUM)  1,
   (features  VIRTIO_NET_F_GUEST_TSO4)  1,
   (features  VIRTIO_NET_F_GUEST_TSO6)  1,
+  (features  VIRTIO_NET_F_GUEST_UFO)   1,
   (features  VIRTIO_NET_F_GUEST_ECN)   1);
 #endif
 }
diff --git a/net.c b/net.c
index 01e31db..e7dbcd0 100644
--- a/net.c
+++ b/net.c
@@ -990,8 +990,13 @@ static int tap_probe_vnet_hdr(int fd)
 }
 
 #ifdef TUNSETOFFLOAD
+
+#ifndef TUN_F_UFO
+#define TUN_F_UFO  0x10
+#endif
+
 static void tap_set_offload(VLANClientState *vc, int csum, int tso4, int tso6,
-   int ecn)
+   int ecn, int ufo)
 {
 TAPState *s = vc-opaque;
 unsigned int offload = 0;
@@ -1004,11 +1009,18 @@ static void tap_set_offload(VLANClientState *vc, int 
csum, int tso4, int tso6,
offload |= TUN_F_TSO6;
if ((tso4 || tso6)  ecn)
offload |= TUN_F_TSO_ECN;
+   if (ufo)
+   offload |= TUN_F_UFO;
 }
 
-if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0)
-   fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n,
-   strerror(errno));
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) {
+/* Try without UFO */
+offload = ~TUN_F_UFO;
+if (ioctl(s-fd, TUNSETOFFLOAD, offload) != 0) {
+   fprintf(stderr, TUNSETOFFLOAD ioctl() failed: %s\n,
+   strerror(errno));
+}
+}
 }
 #endif /* TUNSETOFFLOAD */
 
@@ -1043,7 +1055,7 @@ static TAPState *net_tap_fd_init(VLANState *vlan,
 s-vc-fd_read_raw = tap_receive_raw;
 #ifdef TUNSETOFFLOAD
 s-vc-set_offload = tap_set_offload;
-tap_set_offload(s-vc, 0, 0, 0, 0);
+tap_set_offload(s-vc, 0, 0, 0, 0, 0);
 #endif
 qemu_set_fd_handler2(s-fd, tap_can_send, tap_send, NULL, s);
 snprintf(s-vc-info_str, sizeof(s-vc-info_str), fd=%d, fd);
diff --git a/net.h b/net.h
index 3d0b6f2..ecfb1f9 100644
--- a/net.h
+++ b/net.h
@@ -11,7 +11,7 @@ typedef struct VLANClientState VLANClientState;
 
 typedef void (NetCleanup) (VLANClientState *);
 typedef void (LinkStatusChanged)(VLANClientState *);
-typedef void (SetOffload)(VLANClientState *, int, int, int, int);
+typedef void (SetOffload)(VLANClientState *, int, int, int, int, int);
 
 struct VLANClientState {
 IOReadHandler *fd_read;


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


virtio-net not working with the latest qemu-kvm git

2009-05-04 Thread Sridhar Samudrala
When i moved to the latest qemu-kvm git tree from kvm-85, i noticed that
networking stopped working between the host and the guest.
It started working when i put the device in promiscuos mode by running
tcpdump in background on the guest. 

After browsing through the recent patches, i found that the following commit
is causing the regression.

Remove stray GSO code from virtio_net (Mark McLoughlin)
http://git.kernel.org/?p=virt/kvm/qemu-kvm.git;a=commitdiff;h=559a8f45f34cc50d1a60b4f67a06614d506b2e01

The comment doesn't seem to match with the code that is removed with this patch.

Thanks
Sridhar

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html